diff options
Diffstat (limited to 'man7/user_namespaces.7')
-rw-r--r-- | man7/user_namespaces.7 | 1469 |
1 files changed, 0 insertions, 1469 deletions
diff --git a/man7/user_namespaces.7 b/man7/user_namespaces.7 deleted file mode 100644 index 94e0785..0000000 --- a/man7/user_namespaces.7 +++ /dev/null @@ -1,1469 +0,0 @@ -.\" Copyright (c) 2013, 2014 by Michael Kerrisk <mtk.manpages@gmail.com> -.\" and Copyright (c) 2012, 2014 by Eric W. Biederman <ebiederm@xmission.com> -.\" -.\" SPDX-License-Identifier: Linux-man-pages-copyleft -.\" -.\" -.TH user_namespaces 7 2024-02-25 "Linux man-pages 6.7" -.SH NAME -user_namespaces \- overview of Linux user namespaces -.SH DESCRIPTION -For an overview of namespaces, see -.BR namespaces (7). -.P -User namespaces isolate security-related identifiers and attributes, -in particular, -user IDs and group IDs (see -.BR credentials (7)), -the root directory, -keys (see -.BR keyrings (7)), -.\" FIXME: This page says very little about the interaction -.\" of user namespaces and keys. Add something on this topic. -and capabilities (see -.BR capabilities (7)). -A process's user and group IDs can be different -inside and outside a user namespace. -In particular, -a process can have a normal unprivileged user ID outside a user namespace -while at the same time having a user ID of 0 inside the namespace; -in other words, -the process has full privileges for operations inside the user namespace, -but is unprivileged for operations outside the namespace. -.\" -.\" ============================================================ -.\" -.SS Nested namespaces, namespace membership -User namespaces can be nested; -that is, each user namespace\[em]except the initial ("root") -namespace\[em]has a parent user namespace, -and can have zero or more child user namespaces. -The parent user namespace is the user namespace -of the process that creates the user namespace via a call to -.BR unshare (2) -or -.BR clone (2) -with the -.B CLONE_NEWUSER -flag. -.P -The kernel imposes (since Linux 3.11) a limit of 32 nested levels of -.\" commit 8742f229b635bf1c1c84a3dfe5e47c814c20b5c8 -user namespaces. -.\" FIXME Explain the rationale for this limit. (What is the rationale?) -Calls to -.BR unshare (2) -or -.BR clone (2) -that would cause this limit to be exceeded fail with the error -.BR EUSERS . -.P -Each process is a member of exactly one user namespace. -A process created via -.BR fork (2) -or -.BR clone (2) -without the -.B CLONE_NEWUSER -flag is a member of the same user namespace as its parent. -A single-threaded process can join another user namespace with -.BR setns (2) -if it has the -.B CAP_SYS_ADMIN -in that namespace; -upon doing so, it gains a full set of capabilities in that namespace. -.P -A call to -.BR clone (2) -or -.BR unshare (2) -with the -.B CLONE_NEWUSER -flag makes the new child process (for -.BR clone (2)) -or the caller (for -.BR unshare (2)) -a member of the new user namespace created by the call. -.P -The -.B NS_GET_PARENT -.BR ioctl (2) -operation can be used to discover the parental relationship -between user namespaces; see -.BR ioctl_ns (2). -.P -A task that changes one of its effective IDs -will have its dumpability reset to the value in -.IR /proc/sys/fs/suid_dumpable . -This may affect the ownership of proc files of child processes -and may thus cause the parent to lack the permissions -to write to mapping files of child processes running in a new user namespace. -In such cases making the parent process dumpable, using -.B PR_SET_DUMPABLE -in a call to -.BR prctl (2), -before creating a child process in a new user namespace -may rectify this problem. -See -.BR prctl (2) -and -.BR proc (5) -for details on how ownership is affected. -.\" -.\" ============================================================ -.\" -.SS Capabilities -The child process created by -.BR clone (2) -with the -.B CLONE_NEWUSER -flag starts out with a complete set -of capabilities in the new user namespace. -Likewise, a process that creates a new user namespace using -.BR unshare (2) -or joins an existing user namespace using -.BR setns (2) -gains a full set of capabilities in that namespace. -On the other hand, -that process has no capabilities in the parent (in the case of -.BR clone (2)) -or previous (in the case of -.BR unshare (2) -and -.BR setns (2)) -user namespace, -even if the new namespace is created or joined by the root user -(i.e., a process with user ID 0 in the root namespace). -.P -Note that a call to -.BR execve (2) -will cause a process's capabilities to be recalculated in the usual way (see -.BR capabilities (7)). -Consequently, -unless the process has a user ID of 0 within the namespace, -or the executable file has a nonempty inheritable capabilities mask, -the process will lose all capabilities. -See the discussion of user and group ID mappings, below. -.P -A call to -.BR clone (2) -or -.BR unshare (2) -using the -.B CLONE_NEWUSER -flag -or a call to -.BR setns (2) -that moves the caller into another user namespace -sets the "securebits" flags -(see -.BR capabilities (7)) -to their default values (all flags disabled) in the child (for -.BR clone (2)) -or caller (for -.BR unshare (2) -or -.BR setns (2)). -Note that because the caller no longer has capabilities -in its original user namespace after a call to -.BR setns (2), -it is not possible for a process to reset its "securebits" flags while -retaining its user namespace membership by using a pair of -.BR setns (2) -calls to move to another user namespace and then return to -its original user namespace. -.P -The rules for determining whether or not a process has a capability -in a particular user namespace are as follows: -.IP \[bu] 3 -A process has a capability inside a user namespace -if it is a member of that namespace and -it has the capability in its effective capability set. -A process can gain capabilities in its effective capability -set in various ways. -For example, it may execute a set-user-ID program or an -executable with associated file capabilities. -In addition, -a process may gain capabilities via the effect of -.BR clone (2), -.BR unshare (2), -or -.BR setns (2), -as already described. -.\" In the 3.8 sources, see security/commoncap.c::cap_capable(): -.IP \[bu] -If a process has a capability in a user namespace, -then it has that capability in all child (and further removed descendant) -namespaces as well. -.IP \[bu] -.\" * The owner of the user namespace in the parent of the -.\" * user namespace has all caps. -When a user namespace is created, the kernel records the effective -user ID of the creating process as being the "owner" of the namespace. -.\" (and likewise associates the effective group ID of the creating process -.\" with the namespace). -A process that resides -in the parent of the user namespace -.\" See kernel commit 520d9eabce18edfef76a60b7b839d54facafe1f9 for a fix -.\" on this point -and whose effective user ID matches the owner of the namespace -has all capabilities in the namespace. -.\" This includes the case where the process executes a set-user-ID -.\" program that confers the effective UID of the creator of the namespace. -By virtue of the previous rule, -this means that the process has all capabilities in all -further removed descendant user namespaces as well. -The -.B NS_GET_OWNER_UID -.BR ioctl (2) -operation can be used to discover the user ID of the owner of the namespace; -see -.BR ioctl_ns (2). -.\" -.\" ============================================================ -.\" -.SS Effect of capabilities within a user namespace -Having a capability inside a user namespace -permits a process to perform operations (that require privilege) -only on resources governed by that namespace. -In other words, having a capability in a user namespace permits a process -to perform privileged operations on resources that are governed by (nonuser) -namespaces owned by (associated with) the user namespace -(see the next subsection). -.P -On the other hand, there are many privileged operations that affect -resources that are not associated with any namespace type, -for example, changing the system (i.e., calendar) time (governed by -.BR CAP_SYS_TIME ), -loading a kernel module (governed by -.BR CAP_SYS_MODULE ), -and creating a device (governed by -.BR CAP_MKNOD ). -Only a process with privileges in the -.I initial -user namespace can perform such operations. -.P -Holding -.B CAP_SYS_ADMIN -within the user namespace that owns a process's mount namespace -allows that process to create bind mounts -and mount the following types of filesystems: -.\" fs_flags = FS_USERNS_MOUNT in kernel sources -.P -.RS 4 -.PD 0 -.IP \[bu] 3 -.I /proc -(since Linux 3.8) -.IP \[bu] -.I /sys -(since Linux 3.8) -.IP \[bu] -.I devpts -(since Linux 3.9) -.IP \[bu] -.BR tmpfs (5) -(since Linux 3.9) -.IP \[bu] -.I ramfs -(since Linux 3.9) -.IP \[bu] -.I mqueue -(since Linux 3.9) -.IP \[bu] -.I bpf -.\" commit b2197755b2633e164a439682fb05a9b5ea48f706 -(since Linux 4.4) -.IP \[bu] -.I overlayfs -.\" commit 92dbc9dedccb9759c7f9f2f0ae6242396376988f -.\" commit 4cb2c00c43b3fe88b32f29df4f76da1b92c33224 -(since Linux 5.11) -.PD -.RE -.P -Holding -.B CAP_SYS_ADMIN -within the user namespace that owns a process's cgroup namespace -allows (since Linux 4.6) -that process to the mount the cgroup version 2 filesystem and -cgroup version 1 named hierarchies -(i.e., cgroup filesystems mounted with the -.I \[dq]none,name=\[dq] -option). -.P -Holding -.B CAP_SYS_ADMIN -within the user namespace that owns a process's PID namespace -allows (since Linux 3.8) -that process to mount -.I /proc -filesystems. -.P -Note, however, that mounting block-based filesystems can be done -only by a process that holds -.B CAP_SYS_ADMIN -in the initial user namespace. -.\" -.\" ============================================================ -.\" -.SS Interaction of user namespaces and other types of namespaces -Starting in Linux 3.8, unprivileged processes can create user namespaces, -and the other types of namespaces can be created with just the -.B CAP_SYS_ADMIN -capability in the caller's user namespace. -.P -When a nonuser namespace is created, -it is owned by the user namespace in which the creating process -was a member at the time of the creation of the namespace. -Privileged operations on resources governed by the nonuser namespace -require that the process has the necessary capabilities -in the user namespace that owns the nonuser namespace. -.P -If -.B CLONE_NEWUSER -is specified along with other -.B CLONE_NEW* -flags in a single -.BR clone (2) -or -.BR unshare (2) -call, the user namespace is guaranteed to be created first, -giving the child -.RB ( clone (2)) -or caller -.RB ( unshare (2)) -privileges over the remaining namespaces created by the call. -Thus, it is possible for an unprivileged caller to specify this combination -of flags. -.P -When a new namespace (other than a user namespace) is created via -.BR clone (2) -or -.BR unshare (2), -the kernel records the user namespace of the creating process as the owner of -the new namespace. -(This association can't be changed.) -When a process in the new namespace subsequently performs -privileged operations that operate on global -resources isolated by the namespace, -the permission checks are performed according to the process's capabilities -in the user namespace that the kernel associated with the new namespace. -For example, suppose that a process attempts to change the hostname -.RB ( sethostname (2)), -a resource governed by the UTS namespace. -In this case, -the kernel will determine which user namespace owns -the process's UTS namespace, and check whether the process has the -required capability -.RB ( CAP_SYS_ADMIN ) -in that user namespace. -.P -The -.B NS_GET_USERNS -.BR ioctl (2) -operation can be used to discover the user namespace -that owns a nonuser namespace; see -.BR ioctl_ns (2). -.\" -.\" ============================================================ -.\" -.SS User and group ID mappings: uid_map and gid_map -When a user namespace is created, -it starts out without a mapping of user IDs (group IDs) -to the parent user namespace. -The -.IR /proc/ pid /uid_map -and -.IR /proc/ pid /gid_map -files (available since Linux 3.5) -.\" commit 22d917d80e842829d0ca0a561967d728eb1d6303 -expose the mappings for user and group IDs -inside the user namespace for the process -.IR pid . -These files can be read to view the mappings in a user namespace and -written to (once) to define the mappings. -.P -The description in the following paragraphs explains the details for -.IR uid_map ; -.I gid_map -is exactly the same, -but each instance of "user ID" is replaced by "group ID". -.P -The -.I uid_map -file exposes the mapping of user IDs from the user namespace -of the process -.I pid -to the user namespace of the process that opened -.I uid_map -(but see a qualification to this point below). -In other words, processes that are in different user namespaces -will potentially see different values when reading from a particular -.I uid_map -file, depending on the user ID mappings for the user namespaces -of the reading processes. -.P -Each line in the -.I uid_map -file specifies a 1-to-1 mapping of a range of contiguous -user IDs between two user namespaces. -(When a user namespace is first created, this file is empty.) -The specification in each line takes the form of -three numbers delimited by white space. -The first two numbers specify the starting user ID in -each of the two user namespaces. -The third number specifies the length of the mapped range. -In detail, the fields are interpreted as follows: -.IP (1) 5 -The start of the range of user IDs in -the user namespace of the process -.IR pid . -.IP (2) -The start of the range of user -IDs to which the user IDs specified by field one map. -How field two is interpreted depends on whether the process that opened -.I uid_map -and the process -.I pid -are in the same user namespace, as follows: -.RS -.IP (a) 5 -If the two processes are in different user namespaces: -field two is the start of a range of -user IDs in the user namespace of the process that opened -.IR uid_map . -.IP (b) -If the two processes are in the same user namespace: -field two is the start of the range of -user IDs in the parent user namespace of the process -.IR pid . -This case enables the opener of -.I uid_map -(the common case here is opening -.IR /proc/self/uid_map ) -to see the mapping of user IDs into the user namespace of the process -that created this user namespace. -.RE -.IP (3) -The length of the range of user IDs that is mapped between the two -user namespaces. -.P -System calls that return user IDs (group IDs)\[em]for example, -.BR getuid (2), -.BR getgid (2), -and the credential fields in the structure returned by -.BR stat (2)\[em]return -the user ID (group ID) mapped into the caller's user namespace. -.P -When a process accesses a file, its user and group IDs -are mapped into the initial user namespace for the purpose of permission -checking and assigning IDs when creating a file. -When a process retrieves file user and group IDs via -.BR stat (2), -the IDs are mapped in the opposite direction, -to produce values relative to the process user and group ID mappings. -.P -The initial user namespace has no parent namespace, -but, for consistency, the kernel provides dummy user and group -ID mapping files for this namespace. -Looking at the -.I uid_map -file -.RI ( gid_map -is the same) from a shell in the initial namespace shows: -.P -.in +4n -.EX -$ \fBcat /proc/$$/uid_map\fP - 0 0 4294967295 -.EE -.in -.P -This mapping tells us -that the range starting at user ID 0 in this namespace -maps to a range starting at 0 in the (nonexistent) parent namespace, -and the length of the range is the largest 32-bit unsigned integer. -This leaves 4294967295 (the 32-bit signed \-1 value) unmapped. -This is deliberate: -.I (uid_t)\~\-1 -is used in several interfaces (e.g., -.BR setreuid (2)) -as a way to specify "no user ID". -Leaving -.I (uid_t)\~\-1 -unmapped and unusable guarantees that there will be no -confusion when using these interfaces. -.\" -.\" ============================================================ -.\" -.SS Defining user and group ID mappings: writing to uid_map and gid_map -After the creation of a new user namespace, the -.I uid_map -file of -.I one -of the processes in the namespace may be written to -.I once -to define the mapping of user IDs in the new user namespace. -An attempt to write more than once to a -.I uid_map -file in a user namespace fails with the error -.BR EPERM . -Similar rules apply for -.I gid_map -files. -.P -The lines written to -.I uid_map -.RI ( gid_map ) -must conform to the following validity rules: -.IP \[bu] 3 -The three fields must be valid numbers, -and the last field must be greater than 0. -.IP \[bu] -Lines are terminated by newline characters. -.IP \[bu] -There is a limit on the number of lines in the file. -In Linux 4.14 and earlier, this limit was (arbitrarily) -.\" 5*12-byte records could fit in a 64B cache line -set at 5 lines. -Since Linux 4.15, -.\" commit 6397fac4915ab3002dc15aae751455da1a852f25 -the limit is 340 lines. -In addition, the number of bytes written to -the file must be less than the system page size, -and the write must be performed at the start of the file (i.e., -.BR lseek (2) -and -.BR pwrite (2) -can't be used to write to nonzero offsets in the file). -.IP \[bu] -The range of user IDs (group IDs) -specified in each line cannot overlap with the ranges -in any other lines. -In the initial implementation (Linux 3.8), this requirement was -satisfied by a simplistic implementation that imposed the further -requirement that -the values in both field 1 and field 2 of successive lines must be -in ascending numerical order, -which prevented some otherwise valid maps from being created. -Linux 3.9 and later -.\" commit 0bd14b4fd72afd5df41e9fd59f356740f22fceba -fix this limitation, allowing any valid set of nonoverlapping maps. -.IP \[bu] -At least one line must be written to the file. -.P -Writes that violate the above rules fail with the error -.BR EINVAL . -.P -In order for a process to write to the -.IR /proc/ pid /uid_map -.RI ( /proc/ pid /gid_map ) -file, all of the following permission requirements must be met: -.IP \[bu] 3 -The writing process must have the -.B CAP_SETUID -.RB ( CAP_SETGID ) -capability in the user namespace of the process -.IR pid . -.IP \[bu] -The writing process must either be in the user namespace of the process -.I pid -or be in the parent user namespace of the process -.IR pid . -.IP \[bu] -The mapped user IDs (group IDs) must in turn have a mapping -in the parent user namespace. -.IP \[bu] -If updating -.IR /proc/ pid /uid_map -to create a mapping that maps UID 0 in the parent namespace, -then one of the following must be true: -.RS -.IP (a) 5 -if writing process is in the parent user namespace, -then it must have the -.B CAP_SETFCAP -capability in that user namespace; or -.IP (b) -if the writing process is in the child user namespace, -then the process that created the user namespace must have had the -.B CAP_SETFCAP -capability when the namespace was created. -.RE -.IP -This rule has been in place since -.\" commit db2e718a47984b9d71ed890eb2ea36ecf150de18 -Linux 5.12. -It eliminates an earlier security bug whereby -a UID 0 process that lacks the -.B CAP_SETFCAP -capability, -which is needed to create a binary with namespaced file capabilities -(as described in -.BR capabilities (7)), -could nevertheless create such a binary, -by the following steps: -.RS -.IP (1) 5 -Create a new user namespace with the identity mapping -(i.e., UID 0 in the new user namespace maps to UID 0 in the parent namespace), -so that UID 0 in both namespaces is equivalent to the same root user ID. -.IP (2) -Since the child process has the -.B CAP_SETFCAP -capability, it could create a binary with namespaced file capabilities -that would then be effective in the parent user namespace -(because the root user IDs are the same in the two namespaces). -.RE -.IP \[bu] -One of the following two cases applies: -.RS -.IP (a) 5 -.I Either -the writing process has the -.B CAP_SETUID -.RB ( CAP_SETGID ) -capability in the -.I parent -user namespace. -.RS -.IP \[bu] 3 -No further restrictions apply: -the process can make mappings to arbitrary user IDs (group IDs) -in the parent user namespace. -.RE -.IP (b) -.I Or -otherwise all of the following restrictions apply: -.RS -.IP \[bu] 3 -The data written to -.I uid_map -.RI ( gid_map ) -must consist of a single line that maps -the writing process's effective user ID -(group ID) in the parent user namespace to a user ID (group ID) -in the user namespace. -.IP \[bu] -The writing process must have the same effective user ID as the process -that created the user namespace. -.IP \[bu] -In the case of -.IR gid_map , -use of the -.BR setgroups (2) -system call must first be denied by writing -.RI \[dq] deny \[dq] -to the -.IR /proc/ pid /setgroups -file (see below) before writing to -.IR gid_map . -.RE -.RE -.P -Writes that violate the above rules fail with the error -.BR EPERM . -.\" -.\" ============================================================ -.\" -.SS Project ID mappings: projid_map -Similarly to user and group ID mappings, -it is possible to create project ID mappings for a user namespace. -(Project IDs are used for disk quotas; see -.BR setquota (8) -and -.BR quotactl (2).) -.P -Project ID mappings are defined by writing to the -.IR /proc/ pid /projid_map -file (present since -.\" commit f76d207a66c3a53defea67e7d36c3eb1b7d6d61d -Linux 3.7). -.P -The validity rules for writing to the -.IR /proc/ pid /projid_map -file are as for writing to the -.I uid_map -file; violation of these rules causes -.BR write (2) -to fail with the error -.BR EINVAL . -.P -The permission rules for writing to the -.IR /proc/ pid /projid_map -file are as follows: -.IP \[bu] 3 -The writing process must either be in the user namespace of the process -.I pid -or be in the parent user namespace of the process -.IR pid . -.IP \[bu] -The mapped project IDs must in turn have a mapping -in the parent user namespace. -.P -Violation of these rules causes -.BR write (2) -to fail with the error -.BR EPERM . -.\" -.\" ============================================================ -.\" -.SS Interaction with system calls that change process UIDs or GIDs -In a user namespace where the -.I uid_map -file has not been written, the system calls that change user IDs will fail. -Similarly, if the -.I gid_map -file has not been written, the system calls that change group IDs will fail. -After the -.I uid_map -and -.I gid_map -files have been written, only the mapped values may be used in -system calls that change user and group IDs. -.P -For user IDs, the relevant system calls include -.BR setuid (2), -.BR setfsuid (2), -.BR setreuid (2), -and -.BR setresuid (2). -For group IDs, the relevant system calls include -.BR setgid (2), -.BR setfsgid (2), -.BR setregid (2), -.BR setresgid (2), -and -.BR setgroups (2). -.P -Writing -.RI \[dq] deny \[dq] -to the -.IR /proc/ pid /setgroups -file before writing to -.IR /proc/ pid /gid_map -.\" Things changed in Linux 3.19 -.\" commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 -.\" commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272 -.\" http://lwn.net/Articles/626665/ -will permanently disable -.BR setgroups (2) -in a user namespace and allow writing to -.IR /proc/ pid /gid_map -without having the -.B CAP_SETGID -capability in the parent user namespace. -.\" -.\" ============================================================ -.\" -.SS The \fI/proc/\fPpid\fI/setgroups\fP file -.\" -.\" commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 -.\" commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272 -.\" http://lwn.net/Articles/626665/ -.\" http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-8989 -.\" -The -.IR /proc/ pid /setgroups -file displays the string -.RI \[dq] allow \[dq] -if processes in the user namespace that contains the process -.I pid -are permitted to employ the -.BR setgroups (2) -system call; it displays -.RI \[dq] deny \[dq] -if -.BR setgroups (2) -is not permitted in that user namespace. -Note that regardless of the value in the -.IR /proc/ pid /setgroups -file (and regardless of the process's capabilities), calls to -.BR setgroups (2) -are also not permitted if -.IR /proc/ pid /gid_map -has not yet been set. -.P -A privileged process (one with the -.B CAP_SYS_ADMIN -capability in the namespace) may write either of the strings -.RI \[dq] allow \[dq] -or -.RI \[dq] deny \[dq] -to this file -.I before -writing a group ID mapping -for this user namespace to the file -.IR /proc/ pid /gid_map . -Writing the string -.RI \[dq] deny \[dq] -prevents any process in the user namespace from employing -.BR setgroups (2). -.P -The essence of the restrictions described in the preceding -paragraph is that it is permitted to write to -.IR /proc/ pid /setgroups -only so long as calling -.BR setgroups (2) -is disallowed because -.IR /proc/ pid /gid_map -has not been set. -This ensures that a process cannot transition from a state where -.BR setgroups (2) -is allowed to a state where -.BR setgroups (2) -is denied; -a process can transition only from -.BR setgroups (2) -being disallowed to -.BR setgroups (2) -being allowed. -.P -The default value of this file in the initial user namespace is -.RI \[dq] allow \[dq]. -.P -Once -.IR /proc/ pid /gid_map -has been written to -(which has the effect of enabling -.BR setgroups (2) -in the user namespace), -it is no longer possible to disallow -.BR setgroups (2) -by writing -.RI \[dq] deny \[dq] -to -.IR /proc/ pid /setgroups -(the write fails with the error -.BR EPERM ). -.P -A child user namespace inherits the -.IR /proc/ pid /setgroups -setting from its parent. -.P -If the -.I setgroups -file has the value -.RI \[dq] deny \[dq], -then the -.BR setgroups (2) -system call can't subsequently be reenabled (by writing -.RI \[dq] allow \[dq] -to the file) in this user namespace. -(Attempts to do so fail with the error -.BR EPERM .) -This restriction also propagates down to all child user namespaces of -this user namespace. -.P -The -.IR /proc/ pid /setgroups -file was added in Linux 3.19, -but was backported to many earlier stable kernel series, -because it addresses a security issue. -The issue concerned files with permissions such as "rwx\-\-\-rwx". -Such files give fewer permissions to "group" than they do to "other". -This means that dropping groups using -.BR setgroups (2) -might allow a process file access that it did not formerly have. -Before the existence of user namespaces this was not a concern, -since only a privileged process (one with the -.B CAP_SETGID -capability) could call -.BR setgroups (2). -However, with the introduction of user namespaces, -it became possible for an unprivileged process to create -a new namespace in which the user had all privileges. -This then allowed formerly unprivileged -users to drop groups and thus gain file access -that they did not previously have. -The -.IR /proc/ pid /setgroups -file was added to address this security issue, -by denying any pathway for an unprivileged process to drop groups with -.BR setgroups (2). -.\" -.\" /proc/PID/setgroups -.\" [allow == setgroups() is allowed, "deny" == setgroups() is disallowed] -.\" * Can write if have CAP_SYS_ADMIN in NS -.\" * Must write BEFORE writing to /proc/PID/gid_map -.\" -.\" setgroups() -.\" * Must already have written to gid_map -.\" * /proc/PID/setgroups must be "allow" -.\" -.\" /proc/PID/gid_map -- writing -.\" * Must already have written "deny" to /proc/PID/setgroups -.\" -.\" ============================================================ -.\" -.SS Unmapped user and group IDs -There are various places where an unmapped user ID (group ID) -may be exposed to user space. -For example, the first process in a new user namespace may call -.BR getuid (2) -before a user ID mapping has been defined for the namespace. -In most such cases, an unmapped user ID is converted -.\" from_kuid_munged(), from_kgid_munged() -to the overflow user ID (group ID); -the default value for the overflow user ID (group ID) is 65534. -See the descriptions of -.I /proc/sys/kernel/overflowuid -and -.I /proc/sys/kernel/overflowgid -in -.BR proc (5). -.P -The cases where unmapped IDs are mapped in this fashion include -system calls that return user IDs -.RB ( getuid (2), -.BR getgid (2), -and similar), -credentials passed over a UNIX domain socket, -.\" also SO_PEERCRED -credentials returned by -.BR stat (2), -.BR waitid (2), -and the System V IPC "ctl" -.B IPC_STAT -operations, -credentials exposed by -.IR /proc/ pid /status -and the files in -.IR /proc/sysvipc/* , -credentials returned via the -.I si_uid -field in the -.I siginfo_t -received with a signal (see -.BR sigaction (2)), -credentials written to the process accounting file (see -.BR acct (5)), -and credentials returned with POSIX message queue notifications (see -.BR mq_notify (3)). -.P -There is one notable case where unmapped user and group IDs are -.I not -.\" from_kuid(), from_kgid() -.\" Also F_GETOWNER_UIDS is an exception -converted to the corresponding overflow ID value. -When viewing a -.I uid_map -or -.I gid_map -file in which there is no mapping for the second field, -that field is displayed as 4294967295 (\-1 as an unsigned integer). -.\" -.\" ============================================================ -.\" -.SS Accessing files -In order to determine permissions when an unprivileged process accesses a file, -the process credentials (UID, GID) and the file credentials -are in effect mapped back to what they would be in -the initial user namespace and then compared to determine -the permissions that the process has on the file. -The same is also true of other objects that employ the credentials plus -permissions mask accessibility model, such as System V IPC objects. -.\" -.\" ============================================================ -.\" -.SS Operation of file-related capabilities -Certain capabilities allow a process to bypass various -kernel-enforced restrictions when performing operations on -files owned by other users or groups. -These capabilities are: -.BR CAP_CHOWN , -.BR CAP_DAC_OVERRIDE , -.BR CAP_DAC_READ_SEARCH , -.BR CAP_FOWNER , -and -.BR CAP_FSETID . -.P -Within a user namespace, -these capabilities allow a process to bypass the rules -if the process has the relevant capability over the file, -meaning that: -.IP \[bu] 3 -the process has the relevant effective capability in its user namespace; and -.IP \[bu] -the file's user ID and group ID both have valid mappings -in the user namespace. -.P -The -.B CAP_FOWNER -capability is treated somewhat exceptionally: -.\" These are the checks performed by the kernel function -.\" inode_owner_or_capable(). There is one exception to the exception: -.\" overriding the directory sticky permission bit requires that -.\" the file has a valid mapping for both its UID and GID. -it allows a process to bypass the corresponding rules so long as -at least the file's user ID has a mapping in the user namespace -(i.e., the file's group ID does not need to have a valid mapping). -.\" -.\" ============================================================ -.\" -.SS Set-user-ID and set-group-ID programs -When a process inside a user namespace executes -a set-user-ID (set-group-ID) program, -the process's effective user (group) ID inside the namespace is changed -to whatever value is mapped for the user (group) ID of the file. -However, if either the user -.I or -the group ID of the file has no mapping inside the namespace, -the set-user-ID (set-group-ID) bit is silently ignored: -the new program is executed, -but the process's effective user (group) ID is left unchanged. -(This mirrors the semantics of executing a set-user-ID or set-group-ID -program that resides on a filesystem that was mounted with the -.B MS_NOSUID -flag, as described in -.BR mount (2).) -.\" -.\" ============================================================ -.\" -.SS Miscellaneous -When a process's user and group IDs are passed over a UNIX domain socket -to a process in a different user namespace (see the description of -.B SCM_CREDENTIALS -in -.BR unix (7)), -they are translated into the corresponding values as per the -receiving process's user and group ID mappings. -.\" -.SH STANDARDS -Linux. -.\" -.SH NOTES -Over the years, there have been a lot of features that have been added -to the Linux kernel that have been made available only to privileged users -because of their potential to confuse set-user-ID-root applications. -In general, it becomes safe to allow the root user in a user namespace to -use those features because it is impossible, while in a user namespace, -to gain more privilege than the root user of a user namespace has. -.\" -.\" ============================================================ -.\" -.SS Global root -The term "global root" is sometimes used as a shorthand for -user ID 0 in the initial user namespace. -.\" -.\" ============================================================ -.\" -.SS Availability -Use of user namespaces requires a kernel that is configured with the -.B CONFIG_USER_NS -option. -User namespaces require support in a range of subsystems across -the kernel. -When an unsupported subsystem is configured into the kernel, -it is not possible to configure user namespaces support. -.P -As at Linux 3.8, most relevant subsystems supported user namespaces, -but a number of filesystems did not have the infrastructure needed -to map user and group IDs between user namespaces. -Linux 3.9 added the required infrastructure support for many of -the remaining unsupported filesystems -(Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA, NFS, and OCFS2). -Linux 3.12 added support for the last of the unsupported major filesystems, -.\" commit d6970d4b726cea6d7a9bc4120814f95c09571fc3 -XFS. -.\" -.SH EXAMPLES -The program below is designed to allow experimenting with -user namespaces, as well as other types of namespaces. -It creates namespaces as specified by command-line options and then executes -a command inside those namespaces. -The comments and -.IR usage () -function inside the program provide a full explanation of the program. -The following shell session demonstrates its use. -.P -First, we look at the run-time environment: -.P -.in +4n -.EX -$ \fBuname \-rs\fP # Need Linux 3.8 or later -Linux 3.8.0 -$ \fBid \-u\fP # Running as unprivileged user -1000 -$ \fBid \-g\fP -1000 -.EE -.in -.P -Now start a new shell in new user -.RI ( \-U ), -mount -.RI ( \-m ), -and PID -.RI ( \-p ) -namespaces, with user ID -.RI ( \-M ) -and group ID -.RI ( \-G ) -1000 mapped to 0 inside the user namespace: -.P -.in +4n -.EX -$ \fB./userns_child_exec \-p \-m \-U \-M \[aq]0 1000 1\[aq] \-G \[aq]0 1000 1\[aq] bash\fP -.EE -.in -.P -The shell has PID 1, because it is the first process in the new -PID namespace: -.P -.in +4n -.EX -bash$ \fBecho $$\fP -1 -.EE -.in -.P -Mounting a new -.I /proc -filesystem and listing all of the processes visible -in the new PID namespace shows that the shell can't see -any processes outside the PID namespace: -.P -.in +4n -.EX -bash$ \fBmount \-t proc proc /proc\fP -bash$ \fBps ax\fP - PID TTY STAT TIME COMMAND - 1 pts/3 S 0:00 bash - 22 pts/3 R+ 0:00 ps ax -.EE -.in -.P -Inside the user namespace, the shell has user and group ID 0, -and a full set of permitted and effective capabilities: -.P -.in +4n -.EX -bash$ \fBcat /proc/$$/status | egrep \[aq]\[ha][UG]id\[aq]\fP -Uid: 0 0 0 0 -Gid: 0 0 0 0 -bash$ \fBcat /proc/$$/status | egrep \[aq]\[ha]Cap(Prm|Inh|Eff)\[aq]\fP -CapInh: 0000000000000000 -CapPrm: 0000001fffffffff -CapEff: 0000001fffffffff -.EE -.in -.SS Program source -\& -.EX -/* userns_child_exec.c -\& - Licensed under GNU General Public License v2 or later -\& - Create a child process that executes a shell command in new - namespace(s); allow UID and GID mappings to be specified when - creating a user namespace. -*/ -#define _GNU_SOURCE -#include <err.h> -#include <sched.h> -#include <unistd.h> -#include <stdint.h> -#include <stdlib.h> -#include <sys/wait.h> -#include <signal.h> -#include <fcntl.h> -#include <stdio.h> -#include <string.h> -#include <limits.h> -#include <errno.h> -\& -struct child_args { - char **argv; /* Command to be executed by child, with args */ - int pipe_fd[2]; /* Pipe used to synchronize parent and child */ -}; -\& -static int verbose; -\& -static void -usage(char *pname) -{ - fprintf(stderr, "Usage: %s [options] cmd [arg...]\en\en", pname); - fprintf(stderr, "Create a child process that executes a shell " - "command in a new user namespace,\en" - "and possibly also other new namespace(s).\en\en"); - fprintf(stderr, "Options can be:\en\en"); -#define fpe(str) fprintf(stderr, " %s", str); - fpe("\-i New IPC namespace\en"); - fpe("\-m New mount namespace\en"); - fpe("\-n New network namespace\en"); - fpe("\-p New PID namespace\en"); - fpe("\-u New UTS namespace\en"); - fpe("\-U New user namespace\en"); - fpe("\-M uid_map Specify UID map for user namespace\en"); - fpe("\-G gid_map Specify GID map for user namespace\en"); - fpe("\-z Map user\[aq]s UID and GID to 0 in user namespace\en"); - fpe(" (equivalent to: \-M \[aq]0 <uid> 1\[aq] \-G \[aq]0 <gid> 1\[aq])\en"); - fpe("\-v Display verbose messages\en"); - fpe("\en"); - fpe("If \-z, \-M, or \-G is specified, \-U is required.\en"); - fpe("It is not permitted to specify both \-z and either \-M or \-G.\en"); - fpe("\en"); - fpe("Map strings for \-M and \-G consist of records of the form:\en"); - fpe("\en"); - fpe(" ID\-inside\-ns ID\-outside\-ns len\en"); - fpe("\en"); - fpe("A map string can contain multiple records, separated" - " by commas;\en"); - fpe("the commas are replaced by newlines before writing" - " to map files.\en"); -\& - exit(EXIT_FAILURE); -} -\& -/* Update the mapping file \[aq]map_file\[aq], with the value provided in - \[aq]mapping\[aq], a string that defines a UID or GID mapping. A UID or - GID mapping consists of one or more newline\-delimited records - of the form: -\& - ID_inside\-ns ID\-outside\-ns length -\& - Requiring the user to supply a string that contains newlines is - of course inconvenient for command\-line use. Thus, we permit the - use of commas to delimit records in this string, and replace them - with newlines before writing the string to the file. */ -\& -static void -update_map(char *mapping, char *map_file) -{ - int fd; - size_t map_len; /* Length of \[aq]mapping\[aq] */ -\& - /* Replace commas in mapping string with newlines. */ -\& - map_len = strlen(mapping); - for (size_t j = 0; j < map_len; j++) - if (mapping[j] == \[aq],\[aq]) - mapping[j] = \[aq]\en\[aq]; -\& - fd = open(map_file, O_RDWR); - if (fd == \-1) { - fprintf(stderr, "ERROR: open %s: %s\en", map_file, - strerror(errno)); - exit(EXIT_FAILURE); - } -\& - if (write(fd, mapping, map_len) != map_len) { - fprintf(stderr, "ERROR: write %s: %s\en", map_file, - strerror(errno)); - exit(EXIT_FAILURE); - } -\& - close(fd); -} -\& -/* Linux 3.19 made a change in the handling of setgroups(2) and - the \[aq]gid_map\[aq] file to address a security issue. The issue - allowed *unprivileged* users to employ user namespaces in - order to drop groups. The upshot of the 3.19 changes is that - in order to update the \[aq]gid_maps\[aq] file, use of the setgroups() - system call in this user namespace must first be disabled by - writing "deny" to one of the /proc/PID/setgroups files for - this namespace. That is the purpose of the following function. */ -\& -static void -proc_setgroups_write(pid_t child_pid, char *str) -{ - char setgroups_path[PATH_MAX]; - int fd; -\& - snprintf(setgroups_path, PATH_MAX, "/proc/%jd/setgroups", - (intmax_t) child_pid); -\& - fd = open(setgroups_path, O_RDWR); - if (fd == \-1) { -\& - /* We may be on a system that doesn\[aq]t support - /proc/PID/setgroups. In that case, the file won\[aq]t exist, - and the system won\[aq]t impose the restrictions that Linux 3.19 - added. That\[aq]s fine: we don\[aq]t need to do anything in order - to permit \[aq]gid_map\[aq] to be updated. -\& - However, if the error from open() was something other than - the ENOENT error that is expected for that case, let the - user know. */ -\& - if (errno != ENOENT) - fprintf(stderr, "ERROR: open %s: %s\en", setgroups_path, - strerror(errno)); - return; - } -\& - if (write(fd, str, strlen(str)) == \-1) - fprintf(stderr, "ERROR: write %s: %s\en", setgroups_path, - strerror(errno)); -\& - close(fd); -} -\& -static int /* Start function for cloned child */ -childFunc(void *arg) -{ - struct child_args *args = arg; - char ch; -\& - /* Wait until the parent has updated the UID and GID mappings. - See the comment in main(). We wait for end of file on a - pipe that will be closed by the parent process once it has - updated the mappings. */ -\& - close(args\->pipe_fd[1]); /* Close our descriptor for the write - end of the pipe so that we see EOF - when parent closes its descriptor. */ - if (read(args\->pipe_fd[0], &ch, 1) != 0) { - fprintf(stderr, - "Failure in child: read from pipe returned != 0\en"); - exit(EXIT_FAILURE); - } -\& - close(args\->pipe_fd[0]); -\& - /* Execute a shell command. */ -\& - printf("About to exec %s\en", args\->argv[0]); - execvp(args\->argv[0], args\->argv); - err(EXIT_FAILURE, "execvp"); -} -\& -#define STACK_SIZE (1024 * 1024) -\& -static char child_stack[STACK_SIZE]; /* Space for child\[aq]s stack */ -\& -int -main(int argc, char *argv[]) -{ - int flags, opt, map_zero; - pid_t child_pid; - struct child_args args; - char *uid_map, *gid_map; - const int MAP_BUF_SIZE = 100; - char map_buf[MAP_BUF_SIZE]; - char map_path[PATH_MAX]; -\& - /* Parse command\-line options. The initial \[aq]+\[aq] character in - the final getopt() argument prevents GNU\-style permutation - of command\-line options. That\[aq]s useful, since sometimes - the \[aq]command\[aq] to be executed by this program itself - has command\-line options. We don\[aq]t want getopt() to treat - those as options to this program. */ -\& - flags = 0; - verbose = 0; - gid_map = NULL; - uid_map = NULL; - map_zero = 0; - while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != \-1) { - switch (opt) { - case \[aq]i\[aq]: flags |= CLONE_NEWIPC; break; - case \[aq]m\[aq]: flags |= CLONE_NEWNS; break; - case \[aq]n\[aq]: flags |= CLONE_NEWNET; break; - case \[aq]p\[aq]: flags |= CLONE_NEWPID; break; - case \[aq]u\[aq]: flags |= CLONE_NEWUTS; break; - case \[aq]v\[aq]: verbose = 1; break; - case \[aq]z\[aq]: map_zero = 1; break; - case \[aq]M\[aq]: uid_map = optarg; break; - case \[aq]G\[aq]: gid_map = optarg; break; - case \[aq]U\[aq]: flags |= CLONE_NEWUSER; break; - default: usage(argv[0]); - } - } -\& - /* \-M or \-G without \-U is nonsensical */ -\& - if (((uid_map != NULL || gid_map != NULL || map_zero) && - !(flags & CLONE_NEWUSER)) || - (map_zero && (uid_map != NULL || gid_map != NULL))) - usage(argv[0]); -\& - args.argv = &argv[optind]; -\& - /* We use a pipe to synchronize the parent and child, in order to - ensure that the parent sets the UID and GID maps before the child - calls execve(). This ensures that the child maintains its - capabilities during the execve() in the common case where we - want to map the child\[aq]s effective user ID to 0 in the new user - namespace. Without this synchronization, the child would lose - its capabilities if it performed an execve() with nonzero - user IDs (see the capabilities(7) man page for details of the - transformation of a process\[aq]s capabilities during execve()). */ -\& - if (pipe(args.pipe_fd) == \-1) - err(EXIT_FAILURE, "pipe"); -\& - /* Create the child in new namespace(s). */ -\& - child_pid = clone(childFunc, child_stack + STACK_SIZE, - flags | SIGCHLD, &args); - if (child_pid == \-1) - err(EXIT_FAILURE, "clone"); -\& - /* Parent falls through to here. */ -\& - if (verbose) - printf("%s: PID of child created by clone() is %jd\en", - argv[0], (intmax_t) child_pid); -\& - /* Update the UID and GID maps in the child. */ -\& - if (uid_map != NULL || map_zero) { - snprintf(map_path, PATH_MAX, "/proc/%jd/uid_map", - (intmax_t) child_pid); - if (map_zero) { - snprintf(map_buf, MAP_BUF_SIZE, "0 %jd 1", - (intmax_t) getuid()); - uid_map = map_buf; - } - update_map(uid_map, map_path); - } -\& - if (gid_map != NULL || map_zero) { - proc_setgroups_write(child_pid, "deny"); -\& - snprintf(map_path, PATH_MAX, "/proc/%jd/gid_map", - (intmax_t) child_pid); - if (map_zero) { - snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", - (intmax_t) getgid()); - gid_map = map_buf; - } - update_map(gid_map, map_path); - } -\& - /* Close the write end of the pipe, to signal to the child that we - have updated the UID and GID maps. */ -\& - close(args.pipe_fd[1]); -\& - if (waitpid(child_pid, NULL, 0) == \-1) /* Wait for child */ - err(EXIT_FAILURE, "waitpid"); -\& - if (verbose) - printf("%s: terminating\en", argv[0]); -\& - exit(EXIT_SUCCESS); -} -.EE -.SH SEE ALSO -.BR newgidmap (1), \" From the shadow package -.BR newuidmap (1), \" From the shadow package -.BR clone (2), -.BR ptrace (2), -.BR setns (2), -.BR unshare (2), -.BR proc (5), -.BR subgid (5), \" From the shadow package -.BR subuid (5), \" From the shadow package -.BR capabilities (7), -.BR cgroup_namespaces (7), -.BR credentials (7), -.BR namespaces (7), -.BR pid_namespaces (7) -.P -The kernel source file -.IR Documentation/admin\-guide/namespaces/resource\-control.rst . |