1 files changed, 248 insertions, 0 deletions
diff --git a/man7/cgroup_namespaces.7 b/man7/cgroup_namespaces.7
new file mode 100644
index 0000000..c1162fe
--- /dev/null
+++ b/man7/cgroup_namespaces.7
@@ -0,0 +1,248 @@
+.\" Copyright (c) 2016 by Michael Kerrisk <mtk.manpages@gmail.com>
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.\"
+.TH cgroup_namespaces 7 2023-03-30 "Linux man-pages 6.05.01"
+.SH NAME
+cgroup_namespaces \- overview of Linux cgroup namespaces
+.SH DESCRIPTION
+For an overview of namespaces, see
+.BR namespaces (7).
+.PP
+Cgroup namespaces virtualize the view of a process's cgroups (see
+.BR cgroups (7))
+as seen via
+.IR /proc/ pid /cgroup
+and
+.IR /proc/ pid /mountinfo .
+.PP
+Each cgroup namespace has its own set of cgroup root directories.
+These root directories are the base points for the relative
+locations displayed in the corresponding records in the
+.IR /proc/ pid /cgroup
+file.
+When a process creates a new cgroup namespace using
+.BR clone (2)
+or
+.BR unshare (2)
+with the
+.B CLONE_NEWCGROUP
+flag, its current
+cgroups directories become the cgroup root directories
+of the new namespace.
+(This applies both for the cgroups version 1 hierarchies
+and the cgroups version 2 unified hierarchy.)
+.PP
+When reading the cgroup memberships of a "target" process from
+.IR /proc/ pid /cgroup ,
+the pathname shown in the third field of each record will be
+relative to the reading process's root directory
+for the corresponding cgroup hierarchy.
+If the cgroup directory of the target process lies outside
+the root directory of the reading process's cgroup namespace,
+then the pathname will show
+.I ../
+entries for each ancestor level in the cgroup hierarchy.
+.PP
+The following shell session demonstrates the effect of creating
+a new cgroup namespace.
+.PP
+First, (as superuser) in a shell in the initial cgroup namespace,
+we create a child cgroup in the
+.I freezer
+hierarchy, and place a process in that cgroup that we will
+use as part of the demonstration below:
+.PP
+.in +4n
+.EX
+# \fBmkdir \-p /sys/fs/cgroup/freezer/sub2\fP
+# \fBsleep 10000 &\fP     # Create a process that lives for a while
+[1] 20124
+# \fBecho 20124 > /sys/fs/cgroup/freezer/sub2/cgroup.procs\fP
+.EE
+.in
+.PP
+We then create another child cgroup in the
+.I freezer
+hierarchy and put the shell into that cgroup:
+.PP
+.in +4n
+.EX
+# \fBmkdir \-p /sys/fs/cgroup/freezer/sub\fP
+# \fBecho $$\fP                      # Show PID of this shell
+30655
+# \fBecho 30655 > /sys/fs/cgroup/freezer/sub/cgroup.procs\fP
+# \fBcat /proc/self/cgroup | grep freezer\fP
+7:freezer:/sub
+.EE
+.in
+.PP
+Next, we use
+.BR unshare (1)
+to create a process running a new shell in new cgroup and mount namespaces:
+.PP
+.in +4n
+.EX
+# \fBPS1="sh2# " unshare \-Cm bash\fP
+.EE
+.in
+.PP
+From the new shell started by
+.BR unshare (1),
+we then inspect the
+.IR /proc/ pid /cgroup
+files of, respectively, the new shell,
+a process that is in the initial cgroup namespace
+.RI ( init ,
+with PID 1), and the process in the sibling cgroup
+.RI ( sub2 ):
+.PP
+.in +4n
+.EX
+sh2# \fBcat /proc/self/cgroup | grep freezer\fP
+7:freezer:/
+sh2# \fBcat /proc/1/cgroup | grep freezer\fP
+7:freezer:/..
+sh2# \fBcat /proc/20124/cgroup | grep freezer\fP
+7:freezer:/../sub2
+.EE
+.in
+.PP
+From the output of the first command,
+we see that the freezer cgroup membership of the new shell
+(which is in the same cgroup as the initial shell)
+is shown defined relative to the freezer cgroup root directory
+that was established when the new cgroup namespace was created.
+(In absolute terms,
+the new shell is in the
+.I /sub
+freezer cgroup,
+and the root directory of the freezer cgroup hierarchy
+in the new cgroup namespace is also
+.IR /sub .
+Thus, the new shell's cgroup membership is displayed as \[aq]/\[aq].)
+.PP
+However, when we look in
+.I /proc/self/mountinfo
+we see the following anomaly:
+.PP
+.in +4n
+.EX
+sh2# \fBcat /proc/self/mountinfo | grep freezer\fP
+155 145 0:32 /.. /sys/fs/cgroup/freezer ...
+.EE
+.in
+.PP
+The fourth field of this line
+.RI ( /.. )
+should show the
+directory in the cgroup filesystem which forms the root of this mount.
+Since by the definition of cgroup namespaces, the process's current
+freezer cgroup directory became its root freezer cgroup directory,
+we should see \[aq]/\[aq] in this field.
+The problem here is that we are seeing a mount entry for the cgroup
+filesystem corresponding to the initial cgroup namespace
+(whose cgroup filesystem is indeed rooted at the parent directory of
+.IR sub ).
+To fix this problem, we must remount the freezer cgroup filesystem
+from the new shell (i.e., perform the mount from a process that is in the
+new cgroup namespace), after which we see the expected results:
+.PP
+.in +4n
+.EX
+sh2# \fBmount \-\-make\-rslave /\fP     # Don\[aq]t propagate mount events
+                               # to other namespaces
+sh2# \fBumount /sys/fs/cgroup/freezer\fP
+sh2# \fBmount \-t cgroup \-o freezer freezer /sys/fs/cgroup/freezer\fP
+sh2# \fBcat /proc/self/mountinfo | grep freezer\fP
+155 145 0:32 / /sys/fs/cgroup/freezer rw,relatime ...
+.EE
+.in
+.\"
+.SH STANDARDS
+Linux.
+.SH NOTES
+Use of cgroup namespaces requires a kernel that is configured with the
+.B CONFIG_CGROUPS
+option.
+.PP
+The virtualization provided by cgroup namespaces serves a number of purposes:
+.IP \[bu] 3
+It prevents information leaks whereby cgroup directory paths outside of
+a container would otherwise be visible to processes in the container.
+Such leakages could, for example,
+reveal information about the container framework
+to containerized applications.
+.IP \[bu]
+It eases tasks such as container migration.
+The virtualization provided by cgroup namespaces
+allows containers to be isolated from knowledge of
+the pathnames of ancestor cgroups.
+Without such isolation, the full cgroup pathnames (displayed in
+.IR /proc/self/cgroups )
+would need to be replicated on the target system when migrating a container;
+those pathnames would also need to be unique,
+so that they don't conflict with other pathnames on the target system.
+.IP \[bu]
+It allows better confinement of containerized processes,
+because it is possible to mount the container's cgroup filesystems such that
+the container processes can't gain access to ancestor cgroup directories.
+Consider, for example, the following scenario:
+.RS
+.IP \[bu] 3
+We have a cgroup directory,
+.IR /cg/1 ,
+that is owned by user ID 9000.
+.IP \[bu]
+We have a process,
+.IR X ,
+also owned by user ID 9000,
+that is namespaced under the cgroup
+.I /cg/1/2
+(i.e.,
+.I X
+was placed in a new cgroup namespace via
+.BR clone (2)
+or
+.BR unshare (2)
+with the
+.B CLONE_NEWCGROUP
+flag).
+.RE
+.IP
+In the absence of cgroup namespacing, because the cgroup directory
+.I /cg/1
+is owned (and writable) by UID 9000 and process
+.I X
+is also owned by user ID 9000, process
+.I X
+would be able to modify the contents of cgroups files
+(i.e., change cgroup settings) not only in
+.I /cg/1/2
+but also in the ancestor cgroup directory
+.IR /cg/1 .
+Namespacing process
+.I X
+under the cgroup directory
+.IR /cg/1/2 ,
+in combination with suitable mount operations
+for the cgroup filesystem (as shown above),
+prevents it modifying files in
+.IR /cg/1 ,
+since it cannot even see the contents of that directory
+(or of further removed cgroup ancestor directories).
+Combined with correct enforcement of hierarchical limits,
+this prevents process
+.I X
+from escaping the limits imposed by ancestor cgroups.
+.SH SEE ALSO
+.BR unshare (1),
+.BR clone (2),
+.BR setns (2),
+.BR unshare (2),
+.BR proc (5),
+.BR cgroups (7),
+.BR credentials (7),
+.BR namespaces (7),
+.BR user_namespaces (7)