diff options
Diffstat (limited to 'man7/pid_namespaces.7')
-rw-r--r-- | man7/pid_namespaces.7 | 388 |
1 files changed, 388 insertions, 0 deletions
diff --git a/man7/pid_namespaces.7 b/man7/pid_namespaces.7 new file mode 100644 index 0000000..b154fb4 --- /dev/null +++ b/man7/pid_namespaces.7 @@ -0,0 +1,388 @@ +.\" Copyright (c) 2013 by Michael Kerrisk <mtk.manpages@gmail.com> +.\" and Copyright (c) 2012 by Eric W. Biederman <ebiederm@xmission.com> +.\" +.\" SPDX-License-Identifier: Linux-man-pages-copyleft +.\" +.\" +.TH pid_namespaces 7 2023-03-30 "Linux man-pages 6.05.01" +.SH NAME +pid_namespaces \- overview of Linux PID namespaces +.SH DESCRIPTION +For an overview of namespaces, see +.BR namespaces (7). +.PP +PID namespaces isolate the process ID number space, +meaning that processes in different PID namespaces can have the same PID. +PID namespaces allow containers to provide functionality +such as suspending/resuming the set of processes in the container and +migrating the container to a new host +while the processes inside the container maintain the same PIDs. +.PP +PIDs in a new PID namespace start at 1, +somewhat like a standalone system, and calls to +.BR fork (2), +.BR vfork (2), +or +.BR clone (2) +will produce processes with PIDs that are unique within the namespace. +.PP +Use of PID namespaces requires a kernel that is configured with the +.B CONFIG_PID_NS +option. +.\" +.\" ============================================================ +.\" +.SS The namespace "init" process +The first process created in a new namespace +(i.e., the process created using +.BR clone (2) +with the +.B CLONE_NEWPID +flag, or the first child created by a process after a call to +.BR unshare (2) +using the +.B CLONE_NEWPID +flag) has the PID 1, and is the "init" process for the namespace (see +.BR init (1)). +This process becomes the parent of any child processes that are orphaned +because a process that resides in this PID namespace terminated +(see below for further details). +.PP +If the "init" process of a PID namespace terminates, +the kernel terminates all of the processes in the namespace via a +.B SIGKILL +signal. +This behavior reflects the fact that the "init" process +is essential for the correct operation of a PID namespace. +In this case, a subsequent +.BR fork (2) +into this PID namespace fail with the error +.BR ENOMEM ; +it is not possible to create a new process in a PID namespace whose "init" +process has terminated. +Such scenarios can occur when, for example, +a process uses an open file descriptor for a +.IR /proc/ pid /ns/pid +file corresponding to a process that was in a namespace to +.BR setns (2) +into that namespace after the "init" process has terminated. +Another possible scenario can occur after a call to +.BR unshare (2): +if the first child subsequently created by a +.BR fork (2) +terminates, then subsequent calls to +.BR fork (2) +fail with +.BR ENOMEM . +.PP +Only signals for which the "init" process has established a signal handler +can be sent to the "init" process by other members of the PID namespace. +This restriction applies even to privileged processes, +and prevents other members of the PID namespace from +accidentally killing the "init" process. +.PP +Likewise, a process in an ancestor namespace +can\[em]subject to the usual permission checks described in +.BR kill (2)\[em]send +signals to the "init" process of a child PID namespace only +if the "init" process has established a handler for that signal. +(Within the handler, the +.I siginfo_t +.I si_pid +field described in +.BR sigaction (2) +will be zero.) +.B SIGKILL +or +.B SIGSTOP +are treated exceptionally: +these signals are forcibly delivered when sent from an ancestor PID namespace. +Neither of these signals can be caught by the "init" process, +and so will result in the usual actions associated with those signals +(respectively, terminating and stopping the process). +.PP +Starting with Linux 3.4, the +.BR reboot (2) +system call causes a signal to be sent to the namespace "init" process. +See +.BR reboot (2) +for more details. +.\" +.\" ============================================================ +.\" +.SS Nesting PID namespaces +PID namespaces can be nested: +each PID namespace has a parent, +except for the initial ("root") PID namespace. +The parent of a PID namespace is the PID namespace of the process that +created the namespace using +.BR clone (2) +or +.BR unshare (2). +PID namespaces thus form a tree, +with all namespaces ultimately tracing their ancestry to the root namespace. +Since Linux 3.7, +.\" commit f2302505775fd13ba93f034206f1e2a587017929 +.\" The kernel constant MAX_PID_NS_LEVEL +the kernel limits the maximum nesting depth for PID namespaces to 32. +.PP +A process is visible to other processes in its PID namespace, +and to the processes in each direct ancestor PID namespace +going back to the root PID namespace. +In this context, "visible" means that one process +can be the target of operations by another process using +system calls that specify a process ID. +Conversely, the processes in a child PID namespace can't see +processes in the parent and further removed ancestor namespaces. +More succinctly: a process can see (e.g., send signals with +.BR kill (2), +set nice values with +.BR setpriority (2), +etc.) only processes contained in its own PID namespace +and in descendants of that namespace. +.PP +A process has one process ID in each of the layers of the PID +namespace hierarchy in which is visible, +and walking back though each direct ancestor namespace +through to the root PID namespace. +System calls that operate on process IDs always +operate using the process ID that is visible in the +PID namespace of the caller. +A call to +.BR getpid (2) +always returns the PID associated with the namespace in which +the process was created. +.PP +Some processes in a PID namespace may have parents +that are outside of the namespace. +For example, the parent of the initial process in the namespace +(i.e., the +.BR init (1) +process with PID 1) is necessarily in another namespace. +Likewise, the direct children of a process that uses +.BR setns (2) +to cause its children to join a PID namespace are in a different +PID namespace from the caller of +.BR setns (2). +Calls to +.BR getppid (2) +for such processes return 0. +.PP +While processes may freely descend into child PID namespaces +(e.g., using +.BR setns (2) +with a PID namespace file descriptor), +they may not move in the other direction. +That is to say, processes may not enter any ancestor namespaces +(parent, grandparent, etc.). +Changing PID namespaces is a one-way operation. +.PP +The +.B NS_GET_PARENT +.BR ioctl (2) +operation can be used to discover the parental relationship +between PID namespaces; see +.BR ioctl_ns (2). +.\" +.\" ============================================================ +.\" +.SS setns(2) and unshare(2) semantics +Calls to +.BR setns (2) +that specify a PID namespace file descriptor +and calls to +.BR unshare (2) +with the +.B CLONE_NEWPID +flag cause children subsequently created +by the caller to be placed in a different PID namespace from the caller. +(Since Linux 4.12, that PID namespace is shown via the +.IR /proc/ pid /ns/pid_for_children +file, as described in +.BR namespaces (7).) +These calls do not, however, +change the PID namespace of the calling process, +because doing so would change the caller's idea of its own PID +(as reported by +.BR getpid ()), +which would break many applications and libraries. +.PP +To put things another way: +a process's PID namespace membership is determined when the process is created +and cannot be changed thereafter. +Among other things, this means that the parental relationship +between processes mirrors the parental relationship between PID namespaces: +the parent of a process is either in the same namespace +or resides in the immediate parent PID namespace. +.PP +A process may call +.BR unshare (2) +with the +.B CLONE_NEWPID +flag only once. +After it has performed this operation, its +.IR /proc/ pid /ns/pid_for_children +symbolic link will be empty until the first child is created in the namespace. +.\" +.\" ============================================================ +.\" +.SS Adoption of orphaned children +When a child process becomes orphaned, it is reparented to the "init" +process in the PID namespace of its parent +(unless one of the nearer ancestors of the parent employed the +.BR prctl (2) +.B PR_SET_CHILD_SUBREAPER +command to mark itself as the reaper of orphaned descendant processes). +Note that because of the +.BR setns (2) +and +.BR unshare (2) +semantics described above, this may be the "init" process in the PID +namespace that is the +.I parent +of the child's PID namespace, +rather than the "init" process in the child's own PID namespace. +.\" Furthermore, by definition, the parent of the "init" process +.\" of a PID namespace resides in the parent PID namespace. +.\" +.\" ============================================================ +.\" +.SS Compatibility of CLONE_NEWPID with other CLONE_* flags +In current versions of Linux, +.B CLONE_NEWPID +can't be combined with +.BR CLONE_THREAD . +Threads are required to be in the same PID namespace such that +the threads in a process can send signals to each other. +Similarly, it must be possible to see all of the threads +of a process in the +.BR proc (5) +filesystem. +Additionally, if two threads were in different PID +namespaces, the process ID of the process sending a signal +could not be meaningfully encoded when a signal is sent +(see the description of the +.I siginfo_t +type in +.BR sigaction (2)). +Since this is computed when a signal is enqueued, +a signal queue shared by processes in multiple PID namespaces +would defeat that. +.PP +.\" Note these restrictions were all introduced in +.\" 8382fcac1b813ad0a4e68a838fc7ae93fa39eda0 +.\" when CLONE_NEWPID|CLONE_VM was disallowed +In earlier versions of Linux, +.B CLONE_NEWPID +was additionally disallowed (failing with the error +.BR EINVAL ) +in combination with +.B CLONE_SIGHAND +.\" (restriction lifted in faf00da544045fdc1454f3b9e6d7f65c841de302) +(before Linux 4.3) as well as +.\" (restriction lifted in e79f525e99b04390ca4d2366309545a836c03bf1) +.B CLONE_VM +(before Linux 3.12). +The changes that lifted these restrictions have also been ported to +earlier stable kernels. +.\" +.\" ============================================================ +.\" +.SS /proc and PID namespaces +A +.I /proc +filesystem shows (in the +.IR /proc/ pid +directories) only processes visible in the PID namespace +of the process that performed the mount, even if the +.I /proc +filesystem is viewed from processes in other namespaces. +.PP +After creating a new PID namespace, +it is useful for the child to change its root directory +and mount a new procfs instance at +.I /proc +so that tools such as +.BR ps (1) +work correctly. +If a new mount namespace is simultaneously created by including +.B CLONE_NEWNS +in the +.I flags +argument of +.BR clone (2) +or +.BR unshare (2), +then it isn't necessary to change the root directory: +a new procfs instance can be mounted directly over +.IR /proc . +.PP +From a shell, the command to mount +.I /proc +is: +.PP +.in +4n +.EX +$ mount \-t proc proc /proc +.EE +.in +.PP +Calling +.BR readlink (2) +on the path +.I /proc/self +yields the process ID of the caller in the PID namespace of the procfs mount +(i.e., the PID namespace of the process that mounted the procfs). +This can be useful for introspection purposes, +when a process wants to discover its PID in other namespaces. +.\" +.\" ============================================================ +.\" +.SS /proc files +.TP +.BR /proc/sys/kernel/ns_last_pid " (since Linux 3.3)" +.\" commit b8f566b04d3cddd192cfd2418ae6d54ac6353792 +This file +(which is virtualized per PID namespace) +displays the last PID that was allocated in this PID namespace. +When the next PID is allocated, +the kernel will search for the lowest unallocated PID +that is greater than this value, +and when this file is subsequently read it will show that PID. +.IP +This file is writable by a process that has the +.B CAP_SYS_ADMIN +or (since Linux 5.9) +.B CAP_CHECKPOINT_RESTORE +capability inside the user namespace that owns the PID namespace. +.\" This ability is necessary to support checkpoint restore in user-space +This makes it possible to determine the PID that is allocated +to the next process that is created inside this PID namespace. +.\" +.\" ============================================================ +.\" +.SS Miscellaneous +When a process ID is passed over a UNIX domain socket to a +process in a different PID namespace (see the description of +.B SCM_CREDENTIALS +in +.BR unix (7)), +it is translated into the corresponding PID value in +the receiving process's PID namespace. +.SH STANDARDS +Linux. +.SH EXAMPLES +See +.BR user_namespaces (7). +.SH SEE ALSO +.BR clone (2), +.BR reboot (2), +.BR setns (2), +.BR unshare (2), +.BR proc (5), +.BR capabilities (7), +.BR credentials (7), +.BR mount_namespaces (7), +.BR namespaces (7), +.BR user_namespaces (7), +.BR switch_root (8) |