diff options
Diffstat (limited to 'man/man7/cgroups.7')
-rw-r--r-- | man/man7/cgroups.7 | 1914 |
1 files changed, 1914 insertions, 0 deletions
diff --git a/man/man7/cgroups.7 b/man/man7/cgroups.7 new file mode 100644 index 0000000..964f13c --- /dev/null +++ b/man/man7/cgroups.7 @@ -0,0 +1,1914 @@ +.\" Copyright (C) 2015 Serge Hallyn <serge@hallyn.com> +.\" and Copyright (C) 2016, 2017 Michael Kerrisk <mtk.manpages@gmail.com> +.\" +.\" SPDX-License-Identifier: Linux-man-pages-copyleft +.\" +.TH cgroups 7 2024-05-02 "Linux man-pages (unreleased)" +.SH NAME +cgroups \- Linux control groups +.SH DESCRIPTION +Control groups, usually referred to as cgroups, +are a Linux kernel feature which allow processes to +be organized into hierarchical groups whose usage of +various types of resources can then be limited and monitored. +The kernel's cgroup interface is provided through +a pseudo-filesystem called cgroupfs. +Grouping is implemented in the core cgroup kernel code, +while resource tracking and limits are implemented in +a set of per-resource-type subsystems (memory, CPU, and so on). +.\" +.SS Terminology +A +.I cgroup +is a collection of processes that are bound to a set of +limits or parameters defined via the cgroup filesystem. +.P +A +.I subsystem +is a kernel component that modifies the behavior of +the processes in a cgroup. +Various subsystems have been implemented, making it possible to do things +such as limiting the amount of CPU time and memory available to a cgroup, +accounting for the CPU time used by a cgroup, +and freezing and resuming execution of the processes in a cgroup. +Subsystems are sometimes also known as +.I resource controllers +(or simply, controllers). +.P +The cgroups for a controller are arranged in a +.IR hierarchy . +This hierarchy is defined by creating, removing, and +renaming subdirectories within the cgroup filesystem. +At each level of the hierarchy, attributes (e.g., limits) can be defined. +The limits, control, and accounting provided by cgroups generally have +effect throughout the subhierarchy underneath the cgroup where the +attributes are defined. +Thus, for example, the limits placed on +a cgroup at a higher level in the hierarchy cannot be exceeded +by descendant cgroups. +.\" +.SS Cgroups version 1 and version 2 +The initial release of the cgroups implementation was in Linux 2.6.24. +Over time, various cgroup controllers have been added +to allow the management of various types of resources. +However, the development of these controllers was largely uncoordinated, +with the result that many inconsistencies arose between controllers +and management of the cgroup hierarchies became rather complex. +A longer description of these problems can be found in the kernel +source file +.I Documentation/admin\-guide/cgroup\-v2.rst +(or +.I Documentation/cgroup\-v2.txt +in Linux 4.17 and earlier). +.P +Because of the problems with the initial cgroups implementation +(cgroups version 1), +starting in Linux 3.10, work began on a new, +orthogonal implementation to remedy these problems. +Initially marked experimental, and hidden behind the +.I "\-o\ __DEVEL__sane_behavior" +mount option, the new version (cgroups version 2) +was eventually made official with the release of Linux 4.5. +Differences between the two versions are described in the text below. +The file +.IR cgroup.sane_behavior , +present in cgroups v1, is a relic of this mount option. +The file always reports "0" and is only retained for backward compatibility. +.P +Although cgroups v2 is intended as a replacement for cgroups v1, +the older system continues to exist +(and for compatibility reasons is unlikely to be removed). +Currently, cgroups v2 implements only a subset of the controllers +available in cgroups v1. +The two systems are implemented so that both v1 controllers and +v2 controllers can be mounted on the same system. +Thus, for example, it is possible to use those controllers +that are supported under version 2, +while also using version 1 controllers +where version 2 does not yet support those controllers. +The only restriction here is that a controller can't be simultaneously +employed in both a cgroups v1 hierarchy and in the cgroups v2 hierarchy. +.\" +.SH CGROUPS VERSION 1 +Under cgroups v1, each controller may be mounted against a separate +cgroup filesystem that provides its own hierarchical organization of the +processes on the system. +It is also possible to comount multiple (or even all) cgroups v1 controllers +against the same cgroup filesystem, meaning that the comounted controllers +manage the same hierarchical organization of processes. +.P +For each mounted hierarchy, +the directory tree mirrors the control group hierarchy. +Each control group is represented by a directory, with each of its child +control cgroups represented as a child directory. +For instance, +.I /user/joe/1.session +represents control group +.IR 1.session , +which is a child of cgroup +.IR joe , +which is a child of +.IR /user . +Under each cgroup directory is a set of files which can be read or +written to, reflecting resource limits and a few general cgroup +properties. +.\" +.SS Tasks (threads) versus processes +In cgroups v1, a distinction is drawn between +.I processes +and +.IR tasks . +In this view, a process can consist of multiple tasks +(more commonly called threads, from a user-space perspective, +and called such in the remainder of this man page). +In cgroups v1, it is possible to independently manipulate +the cgroup memberships of the threads in a process. +.P +The cgroups v1 ability to split threads across different cgroups +caused problems in some cases. +For example, it made no sense for the +.I memory +controller, +since all of the threads of a process share a single address space. +Because of these problems, +the ability to independently manipulate the cgroup memberships +of the threads in a process was removed in the initial cgroups v2 +implementation, and subsequently restored in a more limited form +(see the discussion of "thread mode" below). +.\" +.SS Mounting v1 controllers +The use of cgroups requires a kernel built with the +.B CONFIG_CGROUP +option. +In addition, each of the v1 controllers has an associated +configuration option that must be set in order to employ that controller. +.P +In order to use a v1 controller, +it must be mounted against a cgroup filesystem. +The usual place for such mounts is under a +.BR tmpfs (5) +filesystem mounted at +.IR /sys/fs/cgroup . +Thus, one might mount the +.I cpu +controller as follows: +.P +.in +4n +.EX +mount \-t cgroup \-o cpu none /sys/fs/cgroup/cpu +.EE +.in +.P +It is possible to comount multiple controllers against the same hierarchy. +For example, here the +.I cpu +and +.I cpuacct +controllers are comounted against a single hierarchy: +.P +.in +4n +.EX +mount \-t cgroup \-o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct +.EE +.in +.P +Comounting controllers has the effect that a process is in the same cgroup for +all of the comounted controllers. +Separately mounting controllers allows a process to +be in cgroup +.I /foo1 +for one controller while being in +.I /foo2/foo3 +for another. +.P +It is possible to comount all v1 controllers against the same hierarchy: +.P +.in +4n +.EX +mount \-t cgroup \-o all cgroup /sys/fs/cgroup +.EE +.in +.P +(One can achieve the same result by omitting +.IR "\-o all" , +since it is the default if no controllers are explicitly specified.) +.P +It is not possible to mount the same controller +against multiple cgroup hierarchies. +For example, it is not possible to mount both the +.I cpu +and +.I cpuacct +controllers against one hierarchy, and to mount the +.I cpu +controller alone against another hierarchy. +It is possible to create multiple mount with exactly +the same set of comounted controllers. +However, in this case all that results is multiple mount points +providing a view of the same hierarchy. +.P +Note that on many systems, the v1 controllers are automatically mounted under +.IR /sys/fs/cgroup ; +in particular, +.BR systemd (1) +automatically creates such mounts. +.\" +.SS Unmounting v1 controllers +A mounted cgroup filesystem can be unmounted using the +.BR umount (8) +command, as in the following example: +.P +.in +4n +.EX +umount /sys/fs/cgroup/pids +.EE +.in +.P +.IR "But note well" : +a cgroup filesystem is unmounted only if it is not busy, +that is, it has no child cgroups. +If this is not the case, then the only effect of the +.BR umount (8) +is to make the mount invisible. +Thus, to ensure that the mount is really removed, +one must first remove all child cgroups, +which in turn can be done only after all member processes +have been moved from those cgroups to the root cgroup. +.\" +.SS Cgroups version 1 controllers +Each of the cgroups version 1 controllers is governed +by a kernel configuration option (listed below). +Additionally, the availability of the cgroups feature is governed by the +.B CONFIG_CGROUPS +kernel configuration option. +.TP +.IR cpu " (since Linux 2.6.24; " \fBCONFIG_CGROUP_SCHED\fP ) +Cgroups can be guaranteed a minimum number of "CPU shares" +when a system is busy. +This does not limit a cgroup's CPU usage if the CPUs are not busy. +For further information, see +.I Documentation/scheduler/sched\-design\-CFS.rst +(or +.I Documentation/scheduler/sched\-design\-CFS.txt +in Linux 5.2 and earlier). +.IP +In Linux 3.2, +this controller was extended to provide CPU "bandwidth" control. +If the kernel is configured with +.BR CONFIG_CFS_BANDWIDTH , +then within each scheduling period +(defined via a file in the cgroup directory), it is possible to define +an upper limit on the CPU time allocated to the processes in a cgroup. +This upper limit applies even if there is no other competition for the CPU. +Further information can be found in the kernel source file +.I Documentation/scheduler/sched\-bwc.rst +(or +.I Documentation/scheduler/sched\-bwc.txt +in Linux 5.2 and earlier). +.TP +.IR cpuacct " (since Linux 2.6.24; " \fBCONFIG_CGROUP_CPUACCT\fP ) +This provides accounting for CPU usage by groups of processes. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/cpuacct.rst +(or +.I Documentation/cgroup\-v1/cpuacct.txt +in Linux 5.2 and earlier). +.TP +.IR cpuset " (since Linux 2.6.24; " \fBCONFIG_CPUSETS\fP ) +This cgroup can be used to bind the processes in a cgroup to +a specified set of CPUs and NUMA nodes. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/cpusets.rst +(or +.I Documentation/cgroup\-v1/cpusets.txt +in Linux 5.2 and earlier). +. +.TP +.IR memory " (since Linux 2.6.25; " \fBCONFIG_MEMCG\fP ) +The memory controller supports reporting and limiting of process memory, kernel +memory, and swap used by cgroups. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/memory.rst +(or +.I Documentation/cgroup\-v1/memory.txt +in Linux 5.2 and earlier). +.TP +.IR devices " (since Linux 2.6.26; " \fBCONFIG_CGROUP_DEVICE\fP ) +This supports controlling which processes may create (mknod) devices as +well as open them for reading or writing. +The policies may be specified as allow-lists and deny-lists. +Hierarchy is enforced, so new rules must not +violate existing rules for the target or ancestor cgroups. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/devices.rst +(or +.I Documentation/cgroup\-v1/devices.txt +in Linux 5.2 and earlier). +.TP +.IR freezer " (since Linux 2.6.28; " \fBCONFIG_CGROUP_FREEZER\fP ) +The +.I freezer +cgroup can suspend and restore (resume) all processes in a cgroup. +Freezing a cgroup +.I /A +also causes its children, for example, processes in +.IR /A/B , +to be frozen. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/freezer\-subsystem.rst +(or +.I Documentation/cgroup\-v1/freezer\-subsystem.txt +in Linux 5.2 and earlier). +.TP +.IR net_cls " (since Linux 2.6.29; " \fBCONFIG_CGROUP_NET_CLASSID\fP ) +This places a classid, specified for the cgroup, on network packets +created by a cgroup. +These classids can then be used in firewall rules, +as well as used to shape traffic using +.BR tc (8). +This applies only to packets +leaving the cgroup, not to traffic arriving at the cgroup. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/net_cls.rst +(or +.I Documentation/cgroup\-v1/net_cls.txt +in Linux 5.2 and earlier). +.TP +.IR blkio " (since Linux 2.6.33; " \fBCONFIG_BLK_CGROUP\fP ) +The +.I blkio +cgroup controls and limits access to specified block devices by +applying IO control in the form of throttling and upper limits against leaf +nodes and intermediate nodes in the storage hierarchy. +.IP +Two policies are available. +The first is a proportional-weight time-based division +of disk implemented with CFQ. +This is in effect for leaf nodes using CFQ. +The second is a throttling policy which specifies +upper I/O rate limits on a device. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/blkio\-controller.rst +(or +.I Documentation/cgroup\-v1/blkio\-controller.txt +in Linux 5.2 and earlier). +.TP +.IR perf_event " (since Linux 2.6.39; " \fBCONFIG_CGROUP_PERF\fP ) +This controller allows +.I perf +monitoring of the set of processes grouped in a cgroup. +.IP +Further information can be found in the kernel source files +.TP +.IR net_prio " (since Linux 3.3; " \fBCONFIG_CGROUP_NET_PRIO\fP ) +This allows priorities to be specified, per network interface, for cgroups. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/net_prio.rst +(or +.I Documentation/cgroup\-v1/net_prio.txt +in Linux 5.2 and earlier). +.TP +.IR hugetlb " (since Linux 3.5; " \fBCONFIG_CGROUP_HUGETLB\fP ) +This supports limiting the use of huge pages by cgroups. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/hugetlb.rst +(or +.I Documentation/cgroup\-v1/hugetlb.txt +in Linux 5.2 and earlier). +.TP +.IR pids " (since Linux 4.3; " \fBCONFIG_CGROUP_PIDS\fP ) +This controller permits limiting the number of process that may be created +in a cgroup (and its descendants). +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/pids.rst +(or +.I Documentation/cgroup\-v1/pids.txt +in Linux 5.2 and earlier). +.TP +.IR rdma " (since Linux 4.11; " \fBCONFIG_CGROUP_RDMA\fP ) +The RDMA controller permits limiting the use of +RDMA/IB-specific resources per cgroup. +.IP +Further information can be found in the kernel source file +.I Documentation/admin\-guide/cgroup\-v1/rdma.rst +(or +.I Documentation/cgroup\-v1/rdma.txt +in Linux 5.2 and earlier). +.\" +.SS Creating cgroups and moving processes +A cgroup filesystem initially contains a single root cgroup, '/', +which all processes belong to. +A new cgroup is created by creating a directory in the cgroup filesystem: +.P +.in +4n +.EX +mkdir /sys/fs/cgroup/cpu/cg1 +.EE +.in +.P +This creates a new empty cgroup. +.P +A process may be moved to this cgroup by writing its PID into the cgroup's +.I cgroup.procs +file: +.P +.in +4n +.EX +echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs +.EE +.in +.P +Only one PID at a time should be written to this file. +.P +Writing the value 0 to a +.I cgroup.procs +file causes the writing process to be moved to the corresponding cgroup. +.P +When writing a PID into the +.IR cgroup.procs , +all threads in the process are moved into the new cgroup at once. +.P +Within a hierarchy, a process can be a member of exactly one cgroup. +Writing a process's PID to a +.I cgroup.procs +file automatically removes it from the cgroup of +which it was previously a member. +.P +The +.I cgroup.procs +file can be read to obtain a list of the processes that are +members of a cgroup. +The returned list of PIDs is not guaranteed to be in order. +Nor is it guaranteed to be free of duplicates. +(For example, a PID may be recycled while reading from the list.) +.P +In cgroups v1, an individual thread can be moved to +another cgroup by writing its thread ID +(i.e., the kernel thread ID returned by +.BR clone (2) +and +.BR gettid (2)) +to the +.I tasks +file in a cgroup directory. +This file can be read to discover the set of threads +that are members of the cgroup. +.\" +.SS Removing cgroups +To remove a cgroup, +it must first have no child cgroups and contain no (nonzombie) processes. +So long as that is the case, one can simply +remove the corresponding directory pathname. +Note that files in a cgroup directory cannot and need not be +removed. +.\" +.SS Cgroups v1 release notification +Two files can be used to determine whether the kernel provides +notifications when a cgroup becomes empty. +A cgroup is considered to be empty when it contains no child +cgroups and no member processes. +.P +A special file in the root directory of each cgroup hierarchy, +.IR release_agent , +can be used to register the pathname of a program that may be invoked when +a cgroup in the hierarchy becomes empty. +The pathname of the newly empty cgroup (relative to the cgroup mount point) +is provided as the sole command-line argument when the +.I release_agent +program is invoked. +The +.I release_agent +program might remove the cgroup directory, +or perhaps repopulate it with a process. +.P +The default value of the +.I release_agent +file is empty, meaning that no release agent is invoked. +.P +The content of the +.I release_agent +file can also be specified via a mount option when the +cgroup filesystem is mounted: +.P +.in +4n +.EX +mount \-o release_agent=pathname ... +.EE +.in +.P +Whether or not the +.I release_agent +program is invoked when a particular cgroup becomes empty is determined +by the value in the +.I notify_on_release +file in the corresponding cgroup directory. +If this file contains the value 0, then the +.I release_agent +program is not invoked. +If it contains the value 1, the +.I release_agent +program is invoked. +The default value for this file in the root cgroup is 0. +At the time when a new cgroup is created, +the value in this file is inherited from the corresponding file +in the parent cgroup. +.\" +.SS Cgroup v1 named hierarchies +In cgroups v1, +it is possible to mount a cgroup hierarchy that has no attached controllers: +.P +.in +4n +.EX +mount \-t cgroup \-o none,name=somename none /some/mount/point +.EE +.in +.P +Multiple instances of such hierarchies can be mounted; +each hierarchy must have a unique name. +The only purpose of such hierarchies is to track processes. +(See the discussion of release notification below.) +An example of this is the +.I name=systemd +cgroup hierarchy that is used by +.BR systemd (1) +to track services and user sessions. +.P +Since Linux 5.0, the +.I cgroup_no_v1 +kernel boot option (described below) can be used to disable cgroup v1 +named hierarchies, by specifying +.IR cgroup_no_v1=named . +.\" +.SH CGROUPS VERSION 2 +In cgroups v2, +all mounted controllers reside in a single unified hierarchy. +While (different) controllers may be simultaneously +mounted under the v1 and v2 hierarchies, +it is not possible to mount the same controller simultaneously +under both the v1 and the v2 hierarchies. +.P +The new behaviors in cgroups v2 are summarized here, +and in some cases elaborated in the following subsections. +.IP \[bu] 3 +Cgroups v2 provides a unified hierarchy against +which all controllers are mounted. +.IP \[bu] +"Internal" processes are not permitted. +With the exception of the root cgroup, processes may reside +only in leaf nodes (cgroups that do not themselves contain child cgroups). +The details are somewhat more subtle than this, and are described below. +.IP \[bu] +Active cgroups must be specified via the files +.I cgroup.controllers +and +.IR cgroup.subtree_control . +.IP \[bu] +The +.I tasks +file has been removed. +In addition, the +.I cgroup.clone_children +file that is employed by the +.I cpuset +controller has been removed. +.IP \[bu] +An improved mechanism for notification of empty cgroups is provided by the +.I cgroup.events +file. +.P +For more changes, see the +.I Documentation/admin\-guide/cgroup\-v2.rst +file in the kernel source +(or +.I Documentation/cgroup\-v2.txt +in Linux 4.17 and earlier). +. +.P +Some of the new behaviors listed above saw subsequent modification with +the addition in Linux 4.14 of "thread mode" (described below). +.\" +.SS Cgroups v2 unified hierarchy +In cgroups v1, the ability to mount different controllers +against different hierarchies was intended to allow great flexibility +for application design. +In practice, though, +the flexibility turned out to be less useful than expected, +and in many cases added complexity. +Therefore, in cgroups v2, +all available controllers are mounted against a single hierarchy. +The available controllers are automatically mounted, +meaning that it is not necessary (or possible) to specify the controllers +when mounting the cgroup v2 filesystem using a command such as the following: +.P +.in +4n +.EX +mount \-t cgroup2 none /mnt/cgroup2 +.EE +.in +.P +A cgroup v2 controller is available only if it is not currently in use +via a mount against a cgroup v1 hierarchy. +Or, to put things another way, it is not possible to employ +the same controller against both a v1 hierarchy and the unified v2 hierarchy. +This means that it may be necessary first to unmount a v1 controller +(as described above) before that controller is available in v2. +Since +.BR systemd (1) +makes heavy use of some v1 controllers by default, +it can in some cases be simpler to boot the system with +selected v1 controllers disabled. +To do this, specify the +.I cgroup_no_v1=list +option on the kernel boot command line; +.I list +is a comma-separated list of the names of the controllers to disable, +or the word +.I all +to disable all v1 controllers. +(This situation is correctly handled by +.BR systemd (1), +which falls back to operating without the specified controllers.) +.P +Note that on many modern systems, +.BR systemd (1) +automatically mounts the +.I cgroup2 +filesystem at +.I /sys/fs/cgroup/unified +during the boot process. +.\" +.SS Cgroups v2 mount options +The following options +.RI ( mount\~\-o ) +can be specified when mounting the group v2 filesystem: +.TP +.IR nsdelegate " (since Linux 4.15)" +Treat cgroup namespaces as delegation boundaries. +For details, see below. +.TP +.IR memory_localevents " (since Linux 5.2)" +.\" commit 9852ae3fe5293264f01c49f2571ef7688f7823ce +The +.I memory.events +should show statistics only for the cgroup itself, +and not for any descendant cgroups. +This was the behavior before Linux 5.2. +Starting in Linux 5.2, +the default behavior is to include statistics for descendant cgroups in +.IR memory.events , +and this mount option can be used to revert to the legacy behavior. +This option is system wide and can be set on mount or +modified through remount only from the initial mount namespace; +it is silently ignored in noninitial namespaces. +.\" +.SS Cgroups v2 controllers +The following controllers, documented in the kernel source file +.I Documentation/admin\-guide/cgroup\-v2.rst +(or +.I Documentation/cgroup\-v2.txt +in Linux 4.17 and earlier), +are supported in cgroups version 2: +.TP +.IR cpu " (since Linux 4.15)" +This is the successor to the version 1 +.I cpu +and +.I cpuacct +controllers. +.TP +.IR cpuset " (since Linux 5.0)" +This is the successor of the version 1 +.I cpuset +controller. +.TP +.IR freezer " (since Linux 5.2)" +.\" commit 76f969e8948d82e78e1bc4beb6b9465908e74873 +This is the successor of the version 1 +.I freezer +controller. +.TP +.IR hugetlb " (since Linux 5.6)" +This is the successor of the version 1 +.I hugetlb +controller. +.TP +.IR io " (since Linux 4.5)" +This is the successor of the version 1 +.I blkio +controller. +.TP +.IR memory " (since Linux 4.5)" +This is the successor of the version 1 +.I memory +controller. +.TP +.IR perf_event " (since Linux 4.11)" +This is the same as the version 1 +.I perf_event +controller. +.TP +.IR pids " (since Linux 4.5)" +This is the same as the version 1 +.I pids +controller. +.TP +.IR rdma " (since Linux 4.11)" +This is the same as the version 1 +.I rdma +controller. +.P +There is no direct equivalent of the +.I net_cls +and +.I net_prio +controllers from cgroups version 1. +Instead, support has been added to +.BR iptables (8) +to allow eBPF filters that hook on cgroup v2 pathnames to make decisions +about network traffic on a per-cgroup basis. +.P +The v2 +.I devices +controller provides no interface files; +instead, device control is gated by attaching an eBPF +.RB ( BPF_CGROUP_DEVICE ) +program to a v2 cgroup. +.\" +.SS Cgroups v2 subtree control +Each cgroup in the v2 hierarchy contains the following two files: +.TP +.I cgroup.controllers +This read-only file exposes a list of the controllers that are +.I available +in this cgroup. +The contents of this file match the contents of the +.I cgroup.subtree_control +file in the parent cgroup. +.TP +.I cgroup.subtree_control +This is a list of controllers that are +.I active +.RI ( enabled ) +in the cgroup. +The set of controllers in this file is a subset of the set in the +.I cgroup.controllers +of this cgroup. +The set of active controllers is modified by writing strings to this file +containing space-delimited controller names, +each preceded by '+' (to enable a controller) +or '\-' (to disable a controller), as in the following example: +.IP +.in +4n +.EX +echo \[aq]+pids \-memory\[aq] > x/y/cgroup.subtree_control +.EE +.in +.IP +An attempt to enable a controller +that is not present in +.I cgroup.controllers +leads to an +.B ENOENT +error when writing to the +.I cgroup.subtree_control +file. +.P +Because the list of controllers in +.I cgroup.subtree_control +is a subset of those +.IR cgroup.controllers , +a controller that has been disabled in one cgroup in the hierarchy +can never be re-enabled in the subtree below that cgroup. +.P +A cgroup's +.I cgroup.subtree_control +file determines the set of controllers that are exercised in the +.I child +cgroups. +When a controller (e.g., +.IR pids ) +is present in the +.I cgroup.subtree_control +file of a parent cgroup, +then the corresponding controller-interface files (e.g., +.IR pids.max ) +are automatically created in the children of that cgroup +and can be used to exert resource control in the child cgroups. +.\" +.SS Cgroups v2 \[dq]no internal processes\[dq] rule +Cgroups v2 enforces a so-called "no internal processes" rule. +Roughly speaking, this rule means that, +with the exception of the root cgroup, processes may reside +only in leaf nodes (cgroups that do not themselves contain child cgroups). +This avoids the need to decide how to partition resources between +processes which are members of cgroup A and processes in child cgroups of A. +.P +For instance, if cgroup +.I /cg1/cg2 +exists, then a process may reside in +.IR /cg1/cg2 , +but not in +.IR /cg1 . +This is to avoid an ambiguity in cgroups v1 +with respect to the delegation of resources between processes in +.I /cg1 +and its child cgroups. +The recommended approach in cgroups v2 is to create a subdirectory called +.I leaf +for any nonleaf cgroup which should contain processes, but no child cgroups. +Thus, processes which previously would have gone into +.I /cg1 +would now go into +.IR /cg1/leaf . +This has the advantage of making explicit +the relationship between processes in +.I /cg1/leaf +and +.IR /cg1 's +other children. +.P +The "no internal processes" rule is in fact more subtle than stated above. +More precisely, the rule is that a (nonroot) cgroup can't both +(1) have member processes, and +(2) distribute resources into child cgroups\[em]that is, have a nonempty +.I cgroup.subtree_control +file. +Thus, it +.I is +possible for a cgroup to have both member processes and child cgroups, +but before controllers can be enabled for that cgroup, +the member processes must be moved out of the cgroup +(e.g., perhaps into the child cgroups). +.P +With the Linux 4.14 addition of "thread mode" (described below), +the "no internal processes" rule has been relaxed in some cases. +.\" +.SS Cgroups v2 cgroup.events file +Each nonroot cgroup in the v2 hierarchy contains a read-only file, +.IR cgroup.events , +whose contents are key-value pairs +(delimited by newline characters, with the key and value separated by spaces) +providing state information about the cgroup: +.P +.in +4n +.EX +$ \fBcat mygrp/cgroup.events\fP +populated 1 +frozen 0 +.EE +.in +.P +The following keys may appear in this file: +.TP +.I populated +The value of this key is either 1, +if this cgroup or any of its descendants has member processes, +or otherwise 0. +.TP +.IR frozen " (since Linux 5.2)" +.\" commit 76f969e8948d82e78e1bc4beb6b9465908e7487 +The value of this key is 1 if this cgroup is currently frozen, +or 0 if it is not. +.P +The +.I cgroup.events +file can be monitored, in order to receive notification when the value of +one of its keys changes. +Such monitoring can be done using +.BR inotify (7), +which notifies changes as +.B IN_MODIFY +events, or +.BR poll (2), +which notifies changes by returning the +.B POLLPRI +and +.B POLLERR +bits in the +.I revents +field. +.\" +.SS Cgroup v2 release notification +Cgroups v2 provides a new mechanism for obtaining notification +when a cgroup becomes empty. +The cgroups v1 +.I release_agent +and +.I notify_on_release +files are removed, and replaced by the +.I populated +key in the +.I cgroup.events +file. +This key either has the value 0, +meaning that the cgroup (and its descendants) +contain no (nonzombie) member processes, +or 1, meaning that the cgroup (or one of its descendants) +contains member processes. +.P +The cgroups v2 release-notification mechanism +offers the following advantages over the cgroups v1 +.I release_agent +mechanism: +.IP \[bu] 3 +It allows for cheaper notification, +since a single process can monitor multiple +.I cgroup.events +files (using the techniques described earlier). +By contrast, the cgroups v1 mechanism requires the expense of creating +a process for each notification. +.IP \[bu] +Notification for different cgroup subhierarchies can be delegated +to different processes. +By contrast, the cgroups v1 mechanism allows only one release agent +for an entire hierarchy. +.\" +.SS Cgroups v2 cgroup.stat file +.\" commit ec39225cca42c05ac36853d11d28f877fde5c42e +Each cgroup in the v2 hierarchy contains a read-only +.I cgroup.stat +file (first introduced in Linux 4.14) +that consists of lines containing key-value pairs. +The following keys currently appear in this file: +.TP +.I nr_descendants +This is the total number of visible (i.e., living) descendant cgroups +underneath this cgroup. +.TP +.I nr_dying_descendants +This is the total number of dying descendant cgroups +underneath this cgroup. +A cgroup enters the dying state after being deleted. +It remains in that state for an undefined period +(which will depend on system load) +while resources are freed before the cgroup is destroyed. +Note that the presence of some cgroups in the dying state is normal, +and is not indicative of any problem. +.IP +A process can't be made a member of a dying cgroup, +and a dying cgroup can't be brought back to life. +.\" +.SS Limiting the number of descendant cgroups +Each cgroup in the v2 hierarchy contains the following files, +which can be used to view and set limits on the number +of descendant cgroups under that cgroup: +.TP +.IR cgroup.max.depth " (since Linux 4.14)" +.\" commit 1a926e0bbab83bae8207d05a533173425e0496d1 +This file defines a limit on the depth of nesting of descendant cgroups. +A value of 0 in this file means that no descendant cgroups can be created. +An attempt to create a descendant whose nesting level exceeds +the limit fails +.RI ( mkdir (2) +fails with the error +.BR EAGAIN ). +.IP +Writing the string +.I \[dq]max\[dq] +to this file means that no limit is imposed. +The default value in this file is +.IR \[dq]max\[dq] . +.TP +.IR cgroup.max.descendants " (since Linux 4.14)" +.\" commit 1a926e0bbab83bae8207d05a533173425e0496d1 +This file defines a limit on the number of live descendant cgroups that +this cgroup may have. +An attempt to create more descendants than allowed by the limit fails +.RI ( mkdir (2) +fails with the error +.BR EAGAIN ). +.IP +Writing the string +.I \[dq]max\[dq] +to this file means that no limit is imposed. +The default value in this file is +.IR \[dq]max\[dq] . +.\" +.SH CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER +In the context of cgroups, +delegation means passing management of some subtree +of the cgroup hierarchy to a nonprivileged user. +Cgroups v1 provides support for delegation based on file permissions +in the cgroup hierarchy but with less strict containment rules than v2 +(as noted below). +Cgroups v2 supports delegation with containment by explicit design. +The focus of the discussion in this section is on delegation in cgroups v2, +with some differences for cgroups v1 noted along the way. +.P +Some terminology is required in order to describe delegation. +A +.I delegater +is a privileged user (i.e., root) who owns a parent cgroup. +A +.I delegatee +is a nonprivileged user who will be granted the permissions needed +to manage some subhierarchy under that parent cgroup, +known as the +.IR "delegated subtree" . +.P +To perform delegation, +the delegater makes certain directories and files writable by the delegatee, +typically by changing the ownership of the objects to be the user ID +of the delegatee. +Assuming that we want to delegate the hierarchy rooted at (say) +.I /dlgt_grp +and that there are not yet any child cgroups under that cgroup, +the ownership of the following is changed to the user ID of the delegatee: +.TP +.I /dlgt_grp +Changing the ownership of the root of the subtree means that any new +cgroups created under the subtree (and the files they contain) +will also be owned by the delegatee. +.TP +.I /dlgt_grp/cgroup.procs +Changing the ownership of this file means that the delegatee +can move processes into the root of the delegated subtree. +.TP +.IR /dlgt_grp/cgroup.subtree_control " (cgroups v2 only)" +Changing the ownership of this file means that the delegatee +can enable controllers (that are present in +.IR /dlgt_grp/cgroup.controllers ) +in order to further redistribute resources at lower levels in the subtree. +(As an alternative to changing the ownership of this file, +the delegater might instead add selected controllers to this file.) +.TP +.IR /dlgt_grp/cgroup.threads " (cgroups v2 only)" +Changing the ownership of this file is necessary if a threaded subtree +is being delegated (see the description of "thread mode", below). +This permits the delegatee to write thread IDs to the file. +(The ownership of this file can also be changed when delegating +a domain subtree, but currently this serves no purpose, +since, as described below, it is not possible to move a thread between +domain cgroups by writing its thread ID to the +.I cgroup.threads +file.) +.IP +In cgroups v1, the corresponding file that should instead be delegated is the +.I tasks +file. +.P +The delegater should +.I not +change the ownership of any of the controller interfaces files (e.g., +.IR pids.max , +.IR memory.high ) +in +.IR dlgt_grp . +Those files are used from the next level above the delegated subtree +in order to distribute resources into the subtree, +and the delegatee should not have permission to change +the resources that are distributed into the delegated subtree. +.P +See also the discussion of the +.I /sys/kernel/cgroup/delegate +file in NOTES for information about further delegatable files in cgroups v2. +.P +After the aforementioned steps have been performed, +the delegatee can create child cgroups within the delegated subtree +(the cgroup subdirectories and the files they contain +will be owned by the delegatee) +and move processes between cgroups in the subtree. +If some controllers are present in +.IR dlgt_grp/cgroup.subtree_control , +or the ownership of that file was passed to the delegatee, +the delegatee can also control the further redistribution +of the corresponding resources into the delegated subtree. +.\" +.SS Cgroups v2 delegation: nsdelegate and cgroup namespaces +Starting with Linux 4.13, +.\" commit 5136f6365ce3eace5a926e10f16ed2a233db5ba9 +there is a second way to perform cgroup delegation in the cgroups v2 hierarchy. +This is done by mounting or remounting the cgroup v2 filesystem with the +.I nsdelegate +mount option. +For example, if the cgroup v2 filesystem has already been mounted, +we can remount it with the +.I nsdelegate +option as follows: +.P +.in +4n +.EX +mount \-t cgroup2 \-o remount,nsdelegate \e + none /sys/fs/cgroup/unified +.EE +.in +.\" +.\" Alternatively, we could boot the kernel with the options: +.\" +.\" cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller +.\" +.\" The effect of the latter option is to prevent systemd from employing +.\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2. +.P +The effect of this mount option is to cause cgroup namespaces +to automatically become delegation boundaries. +More specifically, +the following restrictions apply for processes inside the cgroup namespace: +.IP \[bu] 3 +Writes to controller interface files in the root directory of the namespace +will fail with the error +.BR EPERM . +Processes inside the cgroup namespace can still write to delegatable +files in the root directory of the cgroup namespace such as +.I cgroup.procs +and +.IR cgroup.subtree_control , +and can create subhierarchy underneath the root directory. +.IP \[bu] +Attempts to migrate processes across the namespace boundary are denied +(with the error +.BR ENOENT ). +Processes inside the cgroup namespace can still +(subject to the containment rules described below) +move processes between cgroups +.I within +the subhierarchy under the namespace root. +.P +The ability to define cgroup namespaces as delegation boundaries +makes cgroup namespaces more useful. +To understand why, suppose that we already have one cgroup hierarchy +that has been delegated to a nonprivileged user, +.IR cecilia , +using the older delegation technique described above. +Suppose further that +.I cecilia +wanted to further delegate a subhierarchy +under the existing delegated hierarchy. +(For example, the delegated hierarchy might be associated with +an unprivileged container run by +.IR cecilia .) +Even if a cgroup namespace was employed, +because both hierarchies are owned by the unprivileged user +.IR cecilia , +the following illegitimate actions could be performed: +.IP \[bu] 3 +A process in the inferior hierarchy could change the +resource controller settings in the root directory of that hierarchy. +(These resource controller settings are intended to allow control to +be exercised from the +.I parent +cgroup; +a process inside the child cgroup should not be allowed to modify them.) +.IP \[bu] +A process inside the inferior hierarchy could move processes +into and out of the inferior hierarchy if the cgroups in the +superior hierarchy were somehow visible. +.P +Employing the +.I nsdelegate +mount option prevents both of these possibilities. +.P +The +.I nsdelegate +mount option only has an effect when performed in +the initial mount namespace; +in other mount namespaces, the option is silently ignored. +.P +.IR Note : +On some systems, +.BR systemd (1) +automatically mounts the cgroup v2 filesystem. +In order to experiment with the +.I nsdelegate +operation, it may be useful to boot the kernel with +the following command-line options: +.P +.in +4n +.EX +cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller +.EE +.in +.P +These options cause the kernel to boot with the cgroups v1 controllers +disabled (meaning that the controllers are available in the v2 hierarchy), +and tells +.BR systemd (1) +not to mount and use the cgroup v2 hierarchy, +so that the v2 hierarchy can be manually mounted +with the desired options after boot-up. +.\" +.SS Cgroup delegation containment rules +Some delegation +.I containment rules +ensure that the delegatee can move processes between cgroups within the +delegated subtree, +but can't move processes from outside the delegated subtree into +the subtree or vice versa. +A nonprivileged process (i.e., the delegatee) can write the PID of +a "target" process into a +.I cgroup.procs +file only if all of the following are true: +.IP \[bu] 3 +The writer has write permission on the +.I cgroup.procs +file in the destination cgroup. +.IP \[bu] +The writer has write permission on the +.I cgroup.procs +file in the nearest common ancestor of the source and destination cgroups. +Note that in some cases, +the nearest common ancestor may be the source or destination cgroup itself. +This requirement is not enforced for cgroups v1 hierarchies, +with the consequence that containment in v1 is less strict than in v2. +(For example, in cgroups v1 the user that owns two distinct +delegated subhierarchies can move a process between the hierarchies.) +.IP \[bu] +If the cgroup v2 filesystem was mounted with the +.I nsdelegate +option, the writer must be able to see the source and destination cgroups +from its cgroup namespace. +.IP \[bu] +In cgroups v1: +the effective UID of the writer (i.e., the delegatee) matches the +real user ID or the saved set-user-ID of the target process. +Before Linux 4.11, +.\" commit 576dd464505fc53d501bb94569db76f220104d28 +this requirement also applied in cgroups v2 +(This was a historical requirement inherited from cgroups v1 +that was later deemed unnecessary, +since the other rules suffice for containment in cgroups v2.) +.P +.IR Note : +one consequence of these delegation containment rules is that the +unprivileged delegatee can't place the first process into +the delegated subtree; +instead, the delegater must place the first process +(a process owned by the delegatee) into the delegated subtree. +.\" +.SH CGROUPS VERSION 2 THREAD MODE +Among the restrictions imposed by cgroups v2 that were not present +in cgroups v1 are the following: +.IP \[bu] 3 +.IR "No thread-granularity control" : +all of the threads of a process must be in the same cgroup. +.IP \[bu] +.IR "No internal processes" : +a cgroup can't both have member processes and +exercise controllers on child cgroups. +.P +Both of these restrictions were added because +the lack of these restrictions had caused problems +in cgroups v1. +In particular, the cgroups v1 ability to allow thread-level granularity +for cgroup membership made no sense for some controllers. +(A notable example was the +.I memory +controller: since threads share an address space, +it made no sense to split threads across different +.I memory +cgroups.) +.P +Notwithstanding the initial design decision in cgroups v2, +there were use cases for certain controllers, notably the +.I cpu +controller, +for which thread-level granularity of control was meaningful and useful. +To accommodate such use cases, Linux 4.14 added +.I "thread mode" +for cgroups v2. +.P +Thread mode allows the following: +.IP \[bu] 3 +The creation of +.I threaded subtrees +in which the threads of a process may +be spread across cgroups inside the tree. +(A threaded subtree may contain multiple multithreaded processes.) +.IP \[bu] +The concept of +.IR "threaded controllers" , +which can distribute resources across the cgroups in a threaded subtree. +.IP \[bu] +A relaxation of the "no internal processes rule", +so that, within a threaded subtree, +a cgroup can both contain member threads and +exercise resource control over child cgroups. +.P +With the addition of thread mode, +each nonroot cgroup now contains a new file, +.IR cgroup.type , +that exposes, and in some circumstances can be used to change, +the "type" of a cgroup. +This file contains one of the following type values: +.TP +.I domain +This is a normal v2 cgroup that provides process-granularity control. +If a process is a member of this cgroup, +then all threads of the process are (by definition) in the same cgroup. +This is the default cgroup type, +and provides the same behavior that was provided for +cgroups in the initial cgroups v2 implementation. +.TP +.I threaded +This cgroup is a member of a threaded subtree. +Threads can be added to this cgroup, +and controllers can be enabled for the cgroup. +.TP +.I domain threaded +This is a domain cgroup that serves as the root of a threaded subtree. +This cgroup type is also known as "threaded root". +.TP +.I domain invalid +This is a cgroup inside a threaded subtree +that is in an "invalid" state. +Processes can't be added to the cgroup, +and controllers can't be enabled for the cgroup. +The only thing that can be done with this cgroup (other than deleting it) +is to convert it to a +.I threaded +cgroup by writing the string +.I \[dq]threaded\[dq] +to the +.I cgroup.type +file. +.IP +The rationale for the existence of this "interim" type +during the creation of a threaded subtree +(rather than the kernel simply immediately converting all cgroups +under the threaded root to the type +.IR threaded ) +is to allow for +possible future extensions to the thread mode model +.\" +.SS Threaded versus domain controllers +With the addition of threads mode, +cgroups v2 now distinguishes two types of resource controllers: +.IP \[bu] 3 +.I Threaded +.\" In the kernel source, look for ".threaded[ \t]*= true" in +.\" initializations of struct cgroup_subsys +controllers: these controllers support thread-granularity for +resource control and can be enabled inside threaded subtrees, +with the result that the corresponding controller-interface files +appear inside the cgroups in the threaded subtree. +As at Linux 4.19, the following controllers are threaded: +.IR cpu , +.IR perf_event , +and +.IR pids . +.IP \[bu] +.I Domain +controllers: these controllers support only process granularity +for resource control. +From the perspective of a domain controller, +all threads of a process are always in the same cgroup. +Domain controllers can't be enabled inside a threaded subtree. +.\" +.SS Creating a threaded subtree +There are two pathways that lead to the creation of a threaded subtree. +The first pathway proceeds as follows: +.IP (1) 5 +We write the string +.I \[dq]threaded\[dq] +to the +.I cgroup.type +file of a cgroup +.I y/z +that currently has the type +.IR domain . +This has the following effects: +.RS +.IP \[bu] 3 +The type of the cgroup +.I y/z +becomes +.IR threaded . +.IP \[bu] +The type of the parent cgroup, +.IR y , +becomes +.IR "domain threaded" . +The parent cgroup is the root of a threaded subtree +(also known as the "threaded root"). +.IP \[bu] +All other cgroups under +.I y +that were not already of type +.I threaded +(because they were inside already existing threaded subtrees +under the new threaded root) +are converted to type +.IR "domain invalid" . +Any subsequently created cgroups under +.I y +will also have the type +.IR "domain invalid" . +.RE +.IP (2) +We write the string +.I \[dq]threaded\[dq] +to each of the +.I domain invalid +cgroups under +.IR y , +in order to convert them to the type +.IR threaded . +As a consequence of this step, all threads under the threaded root +now have the type +.I threaded +and the threaded subtree is now fully usable. +The requirement to write +.I \[dq]threaded\[dq] +to each of these cgroups is somewhat cumbersome, +but allows for possible future extensions to the thread-mode model. +.P +The second way of creating a threaded subtree is as follows: +.IP (1) 5 +In an existing cgroup, +.IR z , +that currently has the type +.IR domain , +we (1.1) enable one or more threaded controllers and +(1.2) make a process a member of +.IR z . +(These two steps can be done in either order.) +This has the following consequences: +.RS +.IP \[bu] 3 +The type of +.I z +becomes +.IR "domain threaded" . +.IP \[bu] +All of the descendant cgroups of +.I z +that were not already of type +.I threaded +are converted to type +.IR "domain invalid" . +.RE +.IP (2) +As before, we make the threaded subtree usable by writing the string +.I \[dq]threaded\[dq] +to each of the +.I domain invalid +cgroups under +.IR z , +in order to convert them to the type +.IR threaded . +.P +One of the consequences of the above pathways to creating a threaded subtree +is that the threaded root cgroup can be a parent only to +.I threaded +(and +.IR "domain invalid" ) +cgroups. +The threaded root cgroup can't be a parent of a +.I domain +cgroups, and a +.I threaded +cgroup +can't have a sibling that is a +.I domain +cgroup. +.\" +.SS Using a threaded subtree +Within a threaded subtree, threaded controllers can be enabled +in each subgroup whose type has been changed to +.IR threaded ; +upon doing so, the corresponding controller interface files +appear in the children of that cgroup. +.P +A process can be moved into a threaded subtree by writing its PID to the +.I cgroup.procs +file in one of the cgroups inside the tree. +This has the effect of making all of the threads +in the process members of the corresponding cgroup +and makes the process a member of the threaded subtree. +The threads of the process can then be spread across +the threaded subtree by writing their thread IDs (see +.BR gettid (2)) +to the +.I cgroup.threads +files in different cgroups inside the subtree. +The threads of a process must all reside in the same threaded subtree. +.P +As with writing to +.IR cgroup.procs , +some containment rules apply when writing to the +.I cgroup.threads +file: +.IP \[bu] 3 +The writer must have write permission on the +cgroup.threads +file in the destination cgroup. +.IP \[bu] +The writer must have write permission on the +.I cgroup.procs +file in the common ancestor of the source and destination cgroups. +(In some cases, +the common ancestor may be the source or destination cgroup itself.) +.IP \[bu] +The source and destination cgroups must be in the same threaded subtree. +(Outside a threaded subtree, an attempt to move a thread by writing +its thread ID to the +.I cgroup.threads +file in a different +.I domain +cgroup fails with the error +.BR EOPNOTSUPP .) +.P +The +.I cgroup.threads +file is present in each cgroup (including +.I domain +cgroups) and can be read in order to discover the set of threads +that is present in the cgroup. +The set of thread IDs obtained when reading this file +is not guaranteed to be ordered or free of duplicates. +.P +The +.I cgroup.procs +file in the threaded root shows the PIDs of all processes +that are members of the threaded subtree. +The +.I cgroup.procs +files in the other cgroups in the subtree are not readable. +.P +Domain controllers can't be enabled in a threaded subtree; +no controller-interface files appear inside the cgroups underneath the +threaded root. +From the point of view of a domain controller, +threaded subtrees are invisible: +a multithreaded process inside a threaded subtree appears to a domain +controller as a process that resides in the threaded root cgroup. +.P +Within a threaded subtree, the "no internal processes" rule does not apply: +a cgroup can both contain member processes (or thread) +and exercise controllers on child cgroups. +.\" +.SS Rules for writing to cgroup.type and creating threaded subtrees +A number of rules apply when writing to the +.I cgroup.type +file: +.IP \[bu] 3 +Only the string +.I \[dq]threaded\[dq] +may be written. +In other words, the only explicit transition that is possible is to convert a +.I domain +cgroup to type +.IR threaded . +.IP \[bu] +The effect of writing +.I \[dq]threaded\[dq] +depends on the current value in +.IR cgroup.type , +as follows: +.RS +.IP \[bu] 3 +.I domain +or +.IR "domain threaded" : +start the creation of a threaded subtree +(whose root is the parent of this cgroup) via +the first of the pathways described above; +.IP \[bu] +.IR "domain\ invalid" : +convert this cgroup (which is inside a threaded subtree) to a usable (i.e., +.IR threaded ) +state; +.IP \[bu] +.IR threaded : +no effect (a "no-op"). +.RE +.IP \[bu] +We can't write to a +.I cgroup.type +file if the parent's type is +.IR "domain invalid" . +In other words, the cgroups of a threaded subtree must be converted to the +.I threaded +state in a top-down manner. +.P +There are also some constraints that must be satisfied +in order to create a threaded subtree rooted at the cgroup +.IR x : +.IP \[bu] 3 +There can be no member processes in the descendant cgroups of +.IR x . +(The cgroup +.I x +can itself have member processes.) +.IP \[bu] +No domain controllers may be enabled in +.IR x 's +.I cgroup.subtree_control +file. +.P +If any of the above constraints is violated, then an attempt to write +.I \[dq]threaded\[dq] +to a +.I cgroup.type +file fails with the error +.BR ENOTSUP . +.\" +.SS The \[dq]domain threaded\[dq] cgroup type +According to the pathways described above, +the type of a cgroup can change to +.I domain threaded +in either of the following cases: +.IP \[bu] 3 +The string +.I \[dq]threaded\[dq] +is written to a child cgroup. +.IP \[bu] +A threaded controller is enabled inside the cgroup and +a process is made a member of the cgroup. +.P +A +.I domain threaded +cgroup, +.IR x , +can revert to the type +.I domain +if the above conditions no longer hold true\[em]that is, if all +.I threaded +child cgroups of +.I x +are removed and either +.I x +no longer has threaded controllers enabled or +no longer has member processes. +.P +When a +.I domain threaded +cgroup +.I x +reverts to the type +.IR domain : +.IP \[bu] 3 +All +.I domain invalid +descendants of +.I x +that are not in lower-level threaded subtrees revert to the type +.IR domain . +.IP \[bu] +The root cgroups in any lower-level threaded subtrees revert to the type +.IR "domain threaded" . +.\" +.SS Exceptions for the root cgroup +The root cgroup of the v2 hierarchy is treated exceptionally: +it can be the parent of both +.I domain +and +.I threaded +cgroups. +If the string +.I \[dq]threaded\[dq] +is written to the +.I cgroup.type +file of one of the children of the root cgroup, then +.IP \[bu] 3 +The type of that cgroup becomes +.IR threaded . +.IP \[bu] +The type of any descendants of that cgroup that +are not part of lower-level threaded subtrees changes to +.IR "domain invalid" . +.P +Note that in this case, there is no cgroup whose type becomes +.IR "domain threaded" . +(Notionally, the root cgroup can be considered as the threaded root +for the cgroup whose type was changed to +.IR threaded .) +.P +The aim of this exceptional treatment for the root cgroup is to +allow a threaded cgroup that employs the +.I cpu +controller to be placed as high as possible in the hierarchy, +so as to minimize the (small) cost of traversing the cgroup hierarchy. +.\" +.SS The cgroups v2 \[dq]cpu\[dq] controller and realtime threads +As at Linux 4.19, the cgroups v2 +.I cpu +controller does not support control of realtime threads +(specifically threads scheduled under any of the policies +.BR SCHED_FIFO , +.BR SCHED_RR , +described +.BR SCHED_DEADLINE ; +see +.BR sched (7)). +Therefore, the +.I cpu +controller can be enabled in the root cgroup only +if all realtime threads are in the root cgroup. +(If there are realtime threads in nonroot cgroups, then a +.BR write (2) +of the string +.I \[dq]+cpu\[dq] +to the +.I cgroup.subtree_control +file fails with the error +.BR EINVAL .) +.P +On some systems, +.BR systemd (1) +places certain realtime threads in nonroot cgroups in the v2 hierarchy. +On such systems, +these threads must first be moved to the root cgroup before the +.I cpu +controller can be enabled. +.\" +.SH ERRORS +The following errors can occur for +.BR mount (2): +.TP +.B EBUSY +An attempt to mount a cgroup version 1 filesystem specified neither the +.I name= +option (to mount a named hierarchy) nor a controller name (or +.IR all ). +.SH NOTES +A child process created via +.BR fork (2) +inherits its parent's cgroup memberships. +A process's cgroup memberships are preserved across +.BR execve (2). +.P +The +.BR clone3 (2) +.B CLONE_INTO_CGROUP +flag can be used to create a child process that begins its life in +a different version 2 cgroup from the parent process. +.\" +.SS /proc files +.TP +.IR /proc/cgroups " (since Linux 2.6.24)" +This file contains information about the controllers +that are compiled into the kernel. +An example of the contents of this file (reformatted for readability) +is the following: +.IP +.in +4n +.EX +#subsys_name hierarchy num_cgroups enabled +cpuset 4 1 1 +cpu 8 1 1 +cpuacct 8 1 1 +blkio 6 1 1 +memory 3 1 1 +devices 10 84 1 +freezer 7 1 1 +net_cls 9 1 1 +perf_event 5 1 1 +net_prio 9 1 1 +hugetlb 0 1 0 +pids 2 1 1 +.EE +.in +.IP +The fields in this file are, from left to right: +.RS +.IP [1] 5 +The name of the controller. +.IP [2] +The unique ID of the cgroup hierarchy on which this controller is mounted. +If multiple cgroups v1 controllers are bound to the same hierarchy, +then each will show the same hierarchy ID in this field. +The value in this field will be 0 if: +.RS +.IP \[bu] 3 +the controller is not mounted on a cgroups v1 hierarchy; +.IP \[bu] +the controller is bound to the cgroups v2 single unified hierarchy; or +.IP \[bu] +the controller is disabled (see below). +.RE +.IP [3] +The number of control groups in this hierarchy using this controller. +.IP [4] +This field contains the value 1 if this controller is enabled, +or 0 if it has been disabled (via the +.I cgroup_disable +kernel command-line boot parameter). +.RE +.TP +.IR /proc/ pid /cgroup " (since Linux 2.6.24)" +This file describes control groups to which the process +with the corresponding PID belongs. +The displayed information differs for +cgroups version 1 and version 2 hierarchies. +.IP +For each cgroup hierarchy of which the process is a member, +there is one entry containing three colon-separated fields: +.IP +.in +4n +.EX +hierarchy\-ID:controller\-list:cgroup\-path +.EE +.in +.IP +For example: +.IP +.in +4n +.EX +5:cpuacct,cpu,cpuset:/daemons +.EE +.in +.IP +The colon-separated fields are, from left to right: +.RS +.IP [1] 5 +For cgroups version 1 hierarchies, +this field contains a unique hierarchy ID number +that can be matched to a hierarchy ID in +.IR /proc/cgroups . +For the cgroups version 2 hierarchy, this field contains the value 0. +.IP [2] +For cgroups version 1 hierarchies, +this field contains a comma-separated list of the controllers +bound to the hierarchy. +For the cgroups version 2 hierarchy, this field is empty. +.IP [3] +This field contains the pathname of the control group in the hierarchy +to which the process belongs. +This pathname is relative to the mount point of the hierarchy. +.RE +.\" +.SS /sys/kernel/cgroup files +.TP +.IR /sys/kernel/cgroup/delegate " (since Linux 4.15)" +.\" commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3 +This file exports a list of the cgroups v2 files +(one per line) that are delegatable +(i.e., whose ownership should be changed to the user ID of the delegatee). +In the future, the set of delegatable files may change or grow, +and this file provides a way for the kernel to inform +user-space applications of which files must be delegated. +As at Linux 4.15, one sees the following when inspecting this file: +.IP +.in +4n +.EX +$ \fBcat /sys/kernel/cgroup/delegate\fP +cgroup.procs +cgroup.subtree_control +cgroup.threads +.EE +.in +.TP +.IR /sys/kernel/cgroup/features " (since Linux 4.15)" +.\" commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce +Over time, the set of cgroups v2 features that are provided by the +kernel may change or grow, +or some features may not be enabled by default. +This file provides a way for user-space applications to discover what +features the running kernel supports and has enabled. +Features are listed one per line: +.IP +.in +4n +.EX +$ \fBcat /sys/kernel/cgroup/features\fP +nsdelegate +memory_localevents +.EE +.in +.IP +The entries that can appear in this file are: +.RS +.TP +.IR memory_localevents " (since Linux 5.2)" +The kernel supports the +.I memory_localevents +mount option. +.TP +.IR nsdelegate " (since Linux 4.15)" +The kernel supports the +.I nsdelegate +mount option. +.TP +.IR memory_recursiveprot " (since Linux 5.7)" +.\" commit 8a931f801340c2be10552c7b5622d5f4852f3a36 +The kernel supports the +.I memory_recursiveprot +mount option. +.RE +.SH SEE ALSO +.BR prlimit (1), +.BR systemd (1), +.BR systemd\-cgls (1), +.BR systemd\-cgtop (1), +.BR clone (2), +.BR ioprio_set (2), +.BR perf_event_open (2), +.BR setrlimit (2), +.BR cgroup_namespaces (7), +.BR cpuset (7), +.BR namespaces (7), +.BR sched (7), +.BR user_namespaces (7) +.P +The kernel source file +.IR Documentation/admin\-guide/cgroup\-v2.rst . |