From 3d08cd331c1adcf0d917392f7e527b3f00511748 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Fri, 24 May 2024 06:52:22 +0200 Subject: Merging upstream version 6.8. Signed-off-by: Daniel Baumann --- man/man7/namespaces.7 | 417 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 417 insertions(+) create mode 100644 man/man7/namespaces.7 (limited to 'man/man7/namespaces.7') diff --git a/man/man7/namespaces.7 b/man/man7/namespaces.7 new file mode 100644 index 0000000..c7a96aa --- /dev/null +++ b/man/man7/namespaces.7 @@ -0,0 +1,417 @@ +'\" t +.\" Copyright (c) 2013, 2016, 2017 by Michael Kerrisk +.\" and Copyright (c) 2012 by Eric W. Biederman +.\" +.\" SPDX-License-Identifier: Linux-man-pages-copyleft +.\" +.\" +.TH namespaces 7 2024-05-02 "Linux man-pages (unreleased)" +.SH NAME +namespaces \- overview of Linux namespaces +.SH DESCRIPTION +A namespace wraps a global system resource in an abstraction that +makes it appear to the processes within the namespace that they +have their own isolated instance of the global resource. +Changes to the global resource are visible to other processes +that are members of the namespace, but are invisible to other processes. +One use of namespaces is to implement containers. +.P +This page provides pointers to information on the various namespace types, +describes the associated +.I /proc +files, and summarizes the APIs for working with namespaces. +.\" +.SS Namespace types +The following table shows the namespace types available on Linux. +The second column of the table shows the flag value that is used to specify +the namespace type in various APIs. +The third column identifies the manual page that provides details +on the namespace type. +The last column is a summary of the resources that are isolated by +the namespace type. +.TS +lB lB lB lB +l1 lB1 l1 l. +Namespace Flag Page Isolates +Cgroup CLONE_NEWCGROUP \fBcgroup_namespaces\fP(7) T{ +Cgroup root directory +T} +IPC CLONE_NEWIPC \fBipc_namespaces\fP(7) T{ +System V IPC, +POSIX message queues +T} +Network CLONE_NEWNET \fBnetwork_namespaces\fP(7) T{ +Network devices, +stacks, ports, etc. +T} +Mount CLONE_NEWNS \fBmount_namespaces\fP(7) Mount points +PID CLONE_NEWPID \fBpid_namespaces\fP(7) Process IDs +Time CLONE_NEWTIME \fBtime_namespaces\fP(7) T{ +Boot and monotonic +clocks +T} +User CLONE_NEWUSER \fBuser_namespaces\fP(7) T{ +User and group IDs +T} +UTS CLONE_NEWUTS \fButs_namespaces\fP(7) T{ +Hostname and NIS +domain name +T} +.TE +.\" +.\" ==================== The namespaces API ==================== +.\" +.SS The namespaces API +As well as various +.I /proc +files described below, +the namespaces API includes the following system calls: +.TP +.BR clone (2) +The +.BR clone (2) +system call creates a new process. +If the +.I flags +argument of the call specifies one or more of the +.B CLONE_NEW* +flags listed above, then new namespaces are created for each flag, +and the child process is made a member of those namespaces. +(This system call also implements a number of features +unrelated to namespaces.) +.TP +.BR setns (2) +The +.BR setns (2) +system call allows the calling process to join an existing namespace. +The namespace to join is specified via a file descriptor that refers to +one of the +.IR /proc/ pid /ns +files described below. +.TP +.BR unshare (2) +The +.BR unshare (2) +system call moves the calling process to a new namespace. +If the +.I flags +argument of the call specifies one or more of the +.B CLONE_NEW* +flags listed above, then new namespaces are created for each flag, +and the calling process is made a member of those namespaces. +(This system call also implements a number of features +unrelated to namespaces.) +.TP +.BR ioctl (2) +Various +.BR ioctl (2) +operations can be used to discover information about namespaces. +These operations are described in +.BR ioctl_ns (2). +.P +Creation of new namespaces using +.BR clone (2) +and +.BR unshare (2) +in most cases requires the +.B CAP_SYS_ADMIN +capability, since, in the new namespace, +the creator will have the power to change global resources +that are visible to other processes that are subsequently created in, +or join the namespace. +User namespaces are the exception: since Linux 3.8, +no privilege is required to create a user namespace. +.\" +.\" ==================== The /proc/[pid]/ns/ directory ==================== +.\" +.SS The \fI/proc/\fPpid\fI/ns/\fP directory +Each process has a +.IR /proc/ pid /ns/ +.\" See commit 6b4e306aa3dc94a0545eb9279475b1ab6209a31f +subdirectory containing one entry for each namespace that +supports being manipulated by +.BR setns (2): +.P +.in +4n +.EX +$ \fBls \-l /proc/$$/ns | awk \[aq]{print $1, $9, $10, $11}\[aq]\fP +total 0 +lrwxrwxrwx. cgroup \-> cgroup:[4026531835] +lrwxrwxrwx. ipc \-> ipc:[4026531839] +lrwxrwxrwx. mnt \-> mnt:[4026531840] +lrwxrwxrwx. net \-> net:[4026531969] +lrwxrwxrwx. pid \-> pid:[4026531836] +lrwxrwxrwx. pid_for_children \-> pid:[4026531834] +lrwxrwxrwx. time \-> time:[4026531834] +lrwxrwxrwx. time_for_children \-> time:[4026531834] +lrwxrwxrwx. user \-> user:[4026531837] +lrwxrwxrwx. uts \-> uts:[4026531838] +.EE +.in +.P +Bind mounting (see +.BR mount (2)) +one of the files in this directory +to somewhere else in the filesystem keeps +the corresponding namespace of the process specified by +.I pid +alive even if all processes currently in the namespace terminate. +.P +Opening one of the files in this directory +(or a file that is bind mounted to one of these files) +returns a file handle for +the corresponding namespace of the process specified by +.IR pid . +As long as this file descriptor remains open, +the namespace will remain alive, +even if all processes in the namespace terminate. +The file descriptor can be passed to +.BR setns (2). +.P +In Linux 3.7 and earlier, these files were visible as hard links. +Since Linux 3.8, +.\" commit bf056bfa80596a5d14b26b17276a56a0dcb080e5 +they appear as symbolic links. +If two processes are in the same namespace, +then the device IDs and inode numbers of their +.IR /proc/ pid /ns/ xxx +symbolic links will be the same; an application can check this using the +.I stat.st_dev +.\" Eric Biederman: "I reserve the right for st_dev to be significant +.\" when comparing namespaces." +.\" https://lore.kernel.org/lkml/87poky5ca9.fsf@xmission.com/ +.\" Re: Documenting the ioctl interfaces to discover relationships... +.\" Date: Mon, 12 Dec 2016 11:30:38 +1300 +and +.I stat.st_ino +fields returned by +.BR stat (2). +The content of this symbolic link is a string containing +the namespace type and inode number as in the following example: +.P +.in +4n +.EX +$ \fBreadlink /proc/$$/ns/uts\fP +uts:[4026531838] +.EE +.in +.P +The symbolic links in this subdirectory are as follows: +.TP +.IR /proc/ pid /ns/cgroup " (since Linux 4.6)" +This file is a handle for the cgroup namespace of the process. +.TP +.IR /proc/ pid /ns/ipc " (since Linux 3.0)" +This file is a handle for the IPC namespace of the process. +.TP +.IR /proc/ pid /ns/mnt " (since Linux 3.8)" +.\" commit 8823c079ba7136dc1948d6f6dcb5f8022bde438e +This file is a handle for the mount namespace of the process. +.TP +.IR /proc/ pid /ns/net " (since Linux 3.0)" +This file is a handle for the network namespace of the process. +.TP +.IR /proc/ pid /ns/pid " (since Linux 3.8)" +.\" commit 57e8391d327609cbf12d843259c968b9e5c1838f +This file is a handle for the PID namespace of the process. +This handle is permanent for the lifetime of the process +(i.e., a process's PID namespace membership never changes). +.TP +.IR /proc/ pid /ns/pid_for_children " (since Linux 4.12)" +.\" commit eaa0d190bfe1ed891b814a52712dcd852554cb08 +This file is a handle for the PID namespace of +child processes created by this process. +This can change as a consequence of calls to +.BR unshare (2) +and +.BR setns (2) +(see +.BR pid_namespaces (7)), +so the file may differ from +.IR /proc/ pid /ns/pid . +The symbolic link gains a value only after the first child process +is created in the namespace. +(Beforehand, +.BR readlink (2) +of the symbolic link will return an empty buffer.) +.TP +.IR /proc/ pid /ns/time " (since Linux 5.6)" +This file is a handle for the time namespace of the process. +.TP +.IR /proc/ pid /ns/time_for_children " (since Linux 5.6)" +This file is a handle for the time namespace of +child processes created by this process. +This can change as a consequence of calls to +.BR unshare (2) +and +.BR setns (2) +(see +.BR time_namespaces (7)), +so the file may differ from +.IR /proc/ pid /ns/time . +.TP +.IR /proc/ pid /ns/user " (since Linux 3.8)" +.\" commit cde1975bc242f3e1072bde623ef378e547b73f91 +This file is a handle for the user namespace of the process. +.TP +.IR /proc/ pid /ns/uts " (since Linux 3.0)" +This file is a handle for the UTS namespace of the process. +.P +Permission to dereference or read +.RB ( readlink (2)) +these symbolic links is governed by a ptrace access mode +.B PTRACE_MODE_READ_FSCREDS +check; see +.BR ptrace (2). +.\" +.\" ==================== The /proc/sys/user directory ==================== +.\" +.SS The \fI/proc/sys/user\fP directory +The files in the +.I /proc/sys/user +directory (which is present since Linux 4.9) expose limits +on the number of namespaces of various types that can be created. +The files are as follows: +.TP +.I max_cgroup_namespaces +The value in this file defines a per-user limit on the number of +cgroup namespaces that may be created in the user namespace. +.TP +.I max_ipc_namespaces +The value in this file defines a per-user limit on the number of +ipc namespaces that may be created in the user namespace. +.TP +.I max_mnt_namespaces +The value in this file defines a per-user limit on the number of +mount namespaces that may be created in the user namespace. +.TP +.I max_net_namespaces +The value in this file defines a per-user limit on the number of +network namespaces that may be created in the user namespace. +.TP +.I max_pid_namespaces +The value in this file defines a per-user limit on the number of +PID namespaces that may be created in the user namespace. +.TP +.IR max_time_namespaces " (since Linux 5.7)" +.\" commit eeec26d5da8248ea4e240b8795bb4364213d3247 +The value in this file defines a per-user limit on the number of +time namespaces that may be created in the user namespace. +.TP +.I max_user_namespaces +The value in this file defines a per-user limit on the number of +user namespaces that may be created in the user namespace. +.TP +.I max_uts_namespaces +The value in this file defines a per-user limit on the number of +uts namespaces that may be created in the user namespace. +.P +Note the following details about these files: +.IP \[bu] 3 +The values in these files are modifiable by privileged processes. +.IP \[bu] +The values exposed by these files are the limits for the user namespace +in which the opening process resides. +.IP \[bu] +The limits are per-user. +Each user in the same user namespace +can create namespaces up to the defined limit. +.IP \[bu] +The limits apply to all users, including UID 0. +.IP \[bu] +These limits apply in addition to any other per-namespace +limits (such as those for PID and user namespaces) that may be enforced. +.IP \[bu] +Upon encountering these limits, +.BR clone (2) +and +.BR unshare (2) +fail with the error +.BR ENOSPC . +.IP \[bu] +For the initial user namespace, +the default value in each of these files is half the limit on the number +of threads that may be created +.RI ( /proc/sys/kernel/threads\-max ). +In all descendant user namespaces, the default value in each file is +.BR MAXINT . +.IP \[bu] +When a namespace is created, the object is also accounted +against ancestor namespaces. +More precisely: +.RS +.IP \[bu] 3 +Each user namespace has a creator UID. +.IP \[bu] +When a namespace is created, +it is accounted against the creator UIDs in each of the +ancestor user namespaces, +and the kernel ensures that the corresponding namespace limit +for the creator UID in the ancestor namespace is not exceeded. +.IP \[bu] +The aforementioned point ensures that creating a new user namespace +cannot be used as a means to escape the limits in force +in the current user namespace. +.RE +.\" +.SS Namespace lifetime +Absent any other factors, +a namespace is automatically torn down when the last process in +the namespace terminates or leaves the namespace. +However, there are a number of other factors that may pin +a namespace into existence even though it has no member processes. +These factors include the following: +.IP \[bu] 3 +An open file descriptor or a bind mount exists for the corresponding +.IR /proc/ pid /ns/* +file. +.IP \[bu] +The namespace is hierarchical (i.e., a PID or user namespace), +and has a child namespace. +.IP \[bu] +It is a user namespace that owns one or more nonuser namespaces. +.IP \[bu] +It is a PID namespace, +and there is a process that refers to the namespace via a +.IR /proc/ pid /ns/pid_for_children +symbolic link. +.IP \[bu] +It is a time namespace, +and there is a process that refers to the namespace via a +.IR /proc/ pid /ns/time_for_children +symbolic link. +.IP \[bu] +It is an IPC namespace, and a corresponding mount of an +.I mqueue +filesystem (see +.BR mq_overview (7)) +refers to this namespace. +.IP \[bu] +It is a PID namespace, and a corresponding mount of a +.BR proc (5) +filesystem refers to this namespace. +.SH EXAMPLES +See +.BR clone (2) +and +.BR user_namespaces (7). +.SH SEE ALSO +.BR nsenter (1), +.BR readlink (1), +.BR unshare (1), +.BR clone (2), +.BR ioctl_ns (2), +.BR setns (2), +.BR unshare (2), +.BR proc (5), +.BR capabilities (7), +.BR cgroup_namespaces (7), +.BR cgroups (7), +.BR credentials (7), +.BR ipc_namespaces (7), +.BR network_namespaces (7), +.BR pid_namespaces (7), +.BR user_namespaces (7), +.BR uts_namespaces (7), +.BR lsns (8), +.BR switch_root (8) -- cgit v1.2.3