1 files changed, 388 insertions, 0 deletions
diff --git a/man/man7/pid_namespaces.7 b/man/man7/pid_namespaces.7
new file mode 100644
index 0000000..2ffbb15
--- /dev/null
+++ b/man/man7/pid_namespaces.7
@@ -0,0 +1,388 @@
+.\" Copyright (c) 2013 by Michael Kerrisk <mtk.manpages@gmail.com>
+.\" and Copyright (c) 2012 by Eric W. Biederman <ebiederm@xmission.com>
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.\"
+.TH pid_namespaces 7 2024-05-02 "Linux man-pages (unreleased)"
+.SH NAME
+pid_namespaces \- overview of Linux PID namespaces
+.SH DESCRIPTION
+For an overview of namespaces, see
+.BR namespaces (7).
+.P
+PID namespaces isolate the process ID number space,
+meaning that processes in different PID namespaces can have the same PID.
+PID namespaces allow containers to provide functionality
+such as suspending/resuming the set of processes in the container and
+migrating the container to a new host
+while the processes inside the container maintain the same PIDs.
+.P
+PIDs in a new PID namespace start at 1,
+somewhat like a standalone system, and calls to
+.BR fork (2),
+.BR vfork (2),
+or
+.BR clone (2)
+will produce processes with PIDs that are unique within the namespace.
+.P
+Use of PID namespaces requires a kernel that is configured with the
+.B CONFIG_PID_NS
+option.
+.\"
+.\" ============================================================
+.\"
+.SS The namespace "init" process
+The first process created in a new namespace
+(i.e., the process created using
+.BR clone (2)
+with the
+.B CLONE_NEWPID
+flag, or the first child created by a process after a call to
+.BR unshare (2)
+using the
+.B CLONE_NEWPID
+flag) has the PID 1, and is the "init" process for the namespace (see
+.BR init (1)).
+This process becomes the parent of any child processes that are orphaned
+because a process that resides in this PID namespace terminated
+(see below for further details).
+.P
+If the "init" process of a PID namespace terminates,
+the kernel terminates all of the processes in the namespace via a
+.B SIGKILL
+signal.
+This behavior reflects the fact that the "init" process
+is essential for the correct operation of a PID namespace.
+In this case, a subsequent
+.BR fork (2)
+into this PID namespace fail with the error
+.BR ENOMEM ;
+it is not possible to create a new process in a PID namespace whose "init"
+process has terminated.
+Such scenarios can occur when, for example,
+a process uses an open file descriptor for a
+.IR /proc/ pid /ns/pid
+file corresponding to a process that was in a namespace to
+.BR setns (2)
+into that namespace after the "init" process has terminated.
+Another possible scenario can occur after a call to
+.BR unshare (2):
+if the first child subsequently created by a
+.BR fork (2)
+terminates, then subsequent calls to
+.BR fork (2)
+fail with
+.BR ENOMEM .
+.P
+Only signals for which the "init" process has established a signal handler
+can be sent to the "init" process by other members of the PID namespace.
+This restriction applies even to privileged processes,
+and prevents other members of the PID namespace from
+accidentally killing the "init" process.
+.P
+Likewise, a process in an ancestor namespace
+can\[em]subject to the usual permission checks described in
+.BR kill (2)\[em]send
+signals to the "init" process of a child PID namespace only
+if the "init" process has established a handler for that signal.
+(Within the handler, the
+.I siginfo_t
+.I si_pid
+field described in
+.BR sigaction (2)
+will be zero.)
+.B SIGKILL
+or
+.B SIGSTOP
+are treated exceptionally:
+these signals are forcibly delivered when sent from an ancestor PID namespace.
+Neither of these signals can be caught by the "init" process,
+and so will result in the usual actions associated with those signals
+(respectively, terminating and stopping the process).
+.P
+Starting with Linux 3.4, the
+.BR reboot (2)
+system call causes a signal to be sent to the namespace "init" process.
+See
+.BR reboot (2)
+for more details.
+.\"
+.\" ============================================================
+.\"
+.SS Nesting PID namespaces
+PID namespaces can be nested:
+each PID namespace has a parent,
+except for the initial ("root") PID namespace.
+The parent of a PID namespace is the PID namespace of the process that
+created the namespace using
+.BR clone (2)
+or
+.BR unshare (2).
+PID namespaces thus form a tree,
+with all namespaces ultimately tracing their ancestry to the root namespace.
+Since Linux 3.7,
+.\" commit f2302505775fd13ba93f034206f1e2a587017929
+.\" The kernel constant MAX_PID_NS_LEVEL
+the kernel limits the maximum nesting depth for PID namespaces to 32.
+.P
+A process is visible to other processes in its PID namespace,
+and to the processes in each direct ancestor PID namespace
+going back to the root PID namespace.
+In this context, "visible" means that one process
+can be the target of operations by another process using
+system calls that specify a process ID.
+Conversely, the processes in a child PID namespace can't see
+processes in the parent and further removed ancestor namespaces.
+More succinctly: a process can see (e.g., send signals with
+.BR kill (2),
+set nice values with
+.BR setpriority (2),
+etc.) only processes contained in its own PID namespace
+and in descendants of that namespace.
+.P
+A process has one process ID in each of the layers of the PID
+namespace hierarchy in which is visible,
+and walking back though each direct ancestor namespace
+through to the root PID namespace.
+System calls that operate on process IDs always
+operate using the process ID that is visible in the
+PID namespace of the caller.
+A call to
+.BR getpid (2)
+always returns the PID associated with the namespace in which
+the process was created.
+.P
+Some processes in a PID namespace may have parents
+that are outside of the namespace.
+For example, the parent of the initial process in the namespace
+(i.e., the
+.BR init (1)
+process with PID 1) is necessarily in another namespace.
+Likewise, the direct children of a process that uses
+.BR setns (2)
+to cause its children to join a PID namespace are in a different
+PID namespace from the caller of
+.BR setns (2).
+Calls to
+.BR getppid (2)
+for such processes return 0.
+.P
+While processes may freely descend into child PID namespaces
+(e.g., using
+.BR setns (2)
+with a PID namespace file descriptor),
+they may not move in the other direction.
+That is to say, processes may not enter any ancestor namespaces
+(parent, grandparent, etc.).
+Changing PID namespaces is a one-way operation.
+.P
+The
+.B NS_GET_PARENT
+.BR ioctl (2)
+operation can be used to discover the parental relationship
+between PID namespaces; see
+.BR ioctl_ns (2).
+.\"
+.\" ============================================================
+.\"
+.SS setns(2) and unshare(2) semantics
+Calls to
+.BR setns (2)
+that specify a PID namespace file descriptor
+and calls to
+.BR unshare (2)
+with the
+.B CLONE_NEWPID
+flag cause children subsequently created
+by the caller to be placed in a different PID namespace from the caller.
+(Since Linux 4.12, that PID namespace is shown via the
+.IR /proc/ pid /ns/pid_for_children
+file, as described in
+.BR namespaces (7).)
+These calls do not, however,
+change the PID namespace of the calling process,
+because doing so would change the caller's idea of its own PID
+(as reported by
+.BR getpid ()),
+which would break many applications and libraries.
+.P
+To put things another way:
+a process's PID namespace membership is determined when the process is created
+and cannot be changed thereafter.
+Among other things, this means that the parental relationship
+between processes mirrors the parental relationship between PID namespaces:
+the parent of a process is either in the same namespace
+or resides in the immediate parent PID namespace.
+.P
+A process may call
+.BR unshare (2)
+with the
+.B CLONE_NEWPID
+flag only once.
+After it has performed this operation, its
+.IR /proc/ pid /ns/pid_for_children
+symbolic link will be empty until the first child is created in the namespace.
+.\"
+.\" ============================================================
+.\"
+.SS Adoption of orphaned children
+When a child process becomes orphaned, it is reparented to the "init"
+process in the PID namespace of its parent
+(unless one of the nearer ancestors of the parent employed the
+.BR prctl (2)
+.B PR_SET_CHILD_SUBREAPER
+command to mark itself as the reaper of orphaned descendant processes).
+Note that because of the
+.BR setns (2)
+and
+.BR unshare (2)
+semantics described above, this may be the "init" process in the PID
+namespace that is the
+.I parent
+of the child's PID namespace,
+rather than the "init" process in the child's own PID namespace.
+.\" Furthermore, by definition, the parent of the "init" process
+.\" of a PID namespace resides in the parent PID namespace.
+.\"
+.\" ============================================================
+.\"
+.SS Compatibility of CLONE_NEWPID with other CLONE_* flags
+In current versions of Linux,
+.B CLONE_NEWPID
+can't be combined with
+.BR CLONE_THREAD .
+Threads are required to be in the same PID namespace such that
+the threads in a process can send signals to each other.
+Similarly, it must be possible to see all of the threads
+of a process in the
+.BR proc (5)
+filesystem.
+Additionally, if two threads were in different PID
+namespaces, the process ID of the process sending a signal
+could not be meaningfully encoded when a signal is sent
+(see the description of the
+.I siginfo_t
+type in
+.BR sigaction (2)).
+Since this is computed when a signal is enqueued,
+a signal queue shared by processes in multiple PID namespaces
+would defeat that.
+.P
+.\" Note these restrictions were all introduced in
+.\" 8382fcac1b813ad0a4e68a838fc7ae93fa39eda0
+.\" when CLONE_NEWPID|CLONE_VM was disallowed
+In earlier versions of Linux,
+.B CLONE_NEWPID
+was additionally disallowed (failing with the error
+.BR EINVAL )
+in combination with
+.B CLONE_SIGHAND
+.\" (restriction lifted in faf00da544045fdc1454f3b9e6d7f65c841de302)
+(before Linux 4.3) as well as
+.\" (restriction lifted in e79f525e99b04390ca4d2366309545a836c03bf1)
+.B CLONE_VM
+(before Linux 3.12).
+The changes that lifted these restrictions have also been ported to
+earlier stable kernels.
+.\"
+.\" ============================================================
+.\"
+.SS /proc and PID namespaces
+A
+.I /proc
+filesystem shows (in the
+.IR /proc/ pid
+directories) only processes visible in the PID namespace
+of the process that performed the mount, even if the
+.I /proc
+filesystem is viewed from processes in other namespaces.
+.P
+After creating a new PID namespace,
+it is useful for the child to change its root directory
+and mount a new procfs instance at
+.I /proc
+so that tools such as
+.BR ps (1)
+work correctly.
+If a new mount namespace is simultaneously created by including
+.B  CLONE_NEWNS
+in the
+.I flags
+argument of
+.BR clone (2)
+or
+.BR unshare (2),
+then it isn't necessary to change the root directory:
+a new procfs instance can be mounted directly over
+.IR /proc .
+.P
+From a shell, the command to mount
+.I /proc
+is:
+.P
+.in +4n
+.EX
+$ mount \-t proc proc /proc
+.EE
+.in
+.P
+Calling
+.BR readlink (2)
+on the path
+.I /proc/self
+yields the process ID of the caller in the PID namespace of the procfs mount
+(i.e., the PID namespace of the process that mounted the procfs).
+This can be useful for introspection purposes,
+when a process wants to discover its PID in other namespaces.
+.\"
+.\" ============================================================
+.\"
+.SS /proc files
+.TP
+.BR /proc/sys/kernel/ns_last_pid " (since Linux 3.3)"
+.\" commit b8f566b04d3cddd192cfd2418ae6d54ac6353792
+This file
+(which is virtualized per PID namespace)
+displays the last PID that was allocated in this PID namespace.
+When the next PID is allocated,
+the kernel will search for the lowest unallocated PID
+that is greater than this value,
+and when this file is subsequently read it will show that PID.
+.IP
+This file is writable by a process that has the
+.B CAP_SYS_ADMIN
+or (since Linux 5.9)
+.B CAP_CHECKPOINT_RESTORE
+capability inside the user namespace that owns the PID namespace.
+.\" This ability is necessary to support checkpoint restore in user-space
+This makes it possible to determine the PID that is allocated
+to the next process that is created inside this PID namespace.
+.\"
+.\" ============================================================
+.\"
+.SS Miscellaneous
+When a process ID is passed over a UNIX domain socket to a
+process in a different PID namespace (see the description of
+.B SCM_CREDENTIALS
+in
+.BR unix (7)),
+it is translated into the corresponding PID value in
+the receiving process's PID namespace.
+.SH STANDARDS
+Linux.
+.SH EXAMPLES
+See
+.BR user_namespaces (7).
+.SH SEE ALSO
+.BR clone (2),
+.BR reboot (2),
+.BR setns (2),
+.BR unshare (2),
+.BR proc (5),
+.BR capabilities (7),
+.BR credentials (7),
+.BR mount_namespaces (7),
+.BR namespaces (7),
+.BR user_namespaces (7),
+.BR switch_root (8)