summaryrefslogtreecommitdiffstats
path: root/man2/ioctl_userfaultfd.2
diff options
context:
space:
mode:
Diffstat (limited to 'man2/ioctl_userfaultfd.2')
-rw-r--r--man2/ioctl_userfaultfd.2340
1 files changed, 253 insertions, 87 deletions
diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
index 6ab9c11..08e52e6 100644
--- a/man2/ioctl_userfaultfd.2
+++ b/man2/ioctl_userfaultfd.2
@@ -5,7 +5,7 @@
.\" SPDX-License-Identifier: Linux-man-pages-copyleft
.\"
.\"
-.TH ioctl_userfaultfd 2 2023-05-03 "Linux man-pages 6.05.01"
+.TH ioctl_userfaultfd 2 2024-03-03 "Linux man-pages 6.7"
.SH NAME
ioctl_userfaultfd \- create a file descriptor for handling page faults in user
space
@@ -16,8 +16,8 @@ Standard C library
.nf
.BR "#include <linux/userfaultfd.h>" " /* Definition of " UFFD* " constants */"
.B #include <sys/ioctl.h>
-.PP
-.BI "int ioctl(int " fd ", int " cmd ", ...);"
+.P
+.BI "int ioctl(int " fd ", int " op ", ...);"
.fi
.SH DESCRIPTION
Various
@@ -25,21 +25,22 @@ Various
operations can be performed on a userfaultfd object (created by a call to
.BR userfaultfd (2))
using calls of the form:
-.PP
+.P
.in +4n
.EX
-ioctl(fd, cmd, argp);
+ioctl(fd, op, argp);
.EE
.in
+.P
In the above,
.I fd
is a file descriptor referring to a userfaultfd object,
-.I cmd
-is one of the commands listed below, and
+.I op
+is one of the operations listed below, and
.I argp
is a pointer to a data structure that is specific to
-.IR cmd .
-.PP
+.IR op .
+.P
The various
.BR ioctl (2)
operations are described below.
@@ -62,13 +63,13 @@ events.
.SS UFFDIO_API
(Since Linux 4.3.)
Enable operation of the userfaultfd and perform API handshake.
-.PP
+.P
The
.I argp
argument is a pointer to a
.I uffdio_api
structure, defined as:
-.PP
+.P
.in +4n
.EX
struct uffdio_api {
@@ -78,11 +79,10 @@ struct uffdio_api {
};
.EE
.in
-.PP
+.P
The
.I api
field denotes the API version requested by the application.
-.PP
The kernel verifies that it can support the requested API version,
and sets the
.I features
@@ -91,7 +91,26 @@ and
fields to bit masks representing all the available features and the generic
.BR ioctl (2)
operations available.
-.PP
+.P
+Since Linux 4.11,
+applications should use the
+.I features
+field to perform a two-step handshake.
+First,
+.B UFFDIO_API
+is called with the
+.I features
+field set to zero.
+The kernel responds by setting all supported feature bits.
+.P
+Applications which do not require any specific features
+can begin using the userfaultfd immediately.
+Applications which do need specific features
+should call
+.B UFFDIO_API
+again with a subset of the reported feature bits set
+to enable those features.
+.P
Before Linux 4.11, the
.I features
field must be initialized to zero before the call to
@@ -100,26 +119,13 @@ and zero (i.e., no feature bits) is placed in the
.I features
field by the kernel upon return from
.BR ioctl (2).
-.PP
-Starting from Linux 4.11, the
-.I features
-field can be used to ask whether particular features are supported
-and explicitly enable userfaultfd features that are disabled by default.
-The kernel always reports all the available features in the
-.I features
-field.
-.PP
-To enable userfaultfd features the application should set
-a bit corresponding to each feature it wants to enable in the
-.I features
-field.
-If the kernel supports all the requested features it will enable them.
-Otherwise it will zero out the returned
+.P
+If the application sets unsupported feature bits,
+the kernel will zero out the returned
.I uffdio_api
structure and return
.BR EINVAL .
-.\" FIXME add more details about feature negotiation and enablement
-.PP
+.P
The following feature bits may be set:
.TP
.BR UFFD_FEATURE_EVENT_FORK " (since Linux 4.11)"
@@ -198,6 +204,13 @@ If this feature bit is set,
.I uffd_msg.pagefault.feat.ptid
will be set to the faulted thread ID for each page-fault message.
.TP
+.BR UFFD_FEATURE_PAGEFAULT_FLAG_WP " (since Linux 5.10)"
+If this feature bit is set,
+userfaultfd supports write-protect faults
+for anonymous memory.
+(Note that shmem / hugetlbfs support
+is indicated by a separate feature.)
+.TP
.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)"
If this feature bit is set,
the kernel supports registering userfaultfd ranges
@@ -215,7 +228,28 @@ will be set to the exact page-fault address that was reported by the hardware,
and will not mask the offset within the page.
Note that old Linux versions might indicate the exact address as well,
even though the feature bit is not set.
-.PP
+.TP
+.BR UFFD_FEATURE_WP_HUGETLBFS_SHMEM " (since Linux 5.19)"
+If this feature bit is set,
+userfaultfd supports write-protect faults
+for hugetlbfs and shmem / tmpfs memory.
+.TP
+.BR UFFD_FEATURE_WP_UNPOPULATED " (since Linux 6.4)"
+If this feature bit is set,
+the kernel will handle anonymous memory the same way as file memory,
+by allowing the user to write-protect unpopulated page table entries.
+.TP
+.BR UFFD_FEATURE_POISON " (since Linux 6.6)"
+If this feature bit is set,
+the kernel supports resolving faults with the
+.B UFFDIO_POISON
+ioctl.
+.TP
+.BR UFFD_FEATURE_WP_ASYNC " (since Linux 6.7)"
+If this feature bit is set,
+the write protection faults would be asynchronously resolved
+by the kernel.
+.P
The returned
.I ioctls
field can contain the following bits:
@@ -236,13 +270,21 @@ operation is supported.
The
.B UFFDIO_UNREGISTER
operation is supported.
-.PP
+.P
This
.BR ioctl (2)
operation returns 0 on success.
On error, \-1 is returned and
.I errno
is set to indicate the error.
+If an error occurs,
+the kernel may zero the provided
+.I uffdio_api
+structure.
+The caller should treat its contents as unspecified,
+and reinitialize it before re-attempting another
+.B UFFDIO_API
+call.
Possible errors include:
.TP
.B EFAULT
@@ -251,38 +293,44 @@ refers to an address that is outside the calling process's
accessible address space.
.TP
.B EINVAL
-The userfaultfd has already been enabled by a previous
-.B UFFDIO_API
-operation.
-.TP
-.B EINVAL
The API version requested in the
.I api
field is not supported by this kernel, or the
.I features
field passed to the kernel includes feature bits that are not supported
by the current kernel version.
-.\" FIXME In the above error case, the returned 'uffdio_api' structure is
-.\" zeroed out. Why is this done? This should be explained in the manual page.
-.\"
-.\" Mike Rapoport:
-.\" In my understanding the uffdio_api
-.\" structure is zeroed to allow the caller
-.\" to distinguish the reasons for -EINVAL.
-.\"
+.TP
+.B EINVAL
+A previous
+.B UFFDIO_API
+call already enabled one or more features for this userfaultfd.
+Calling
+.B UFFDIO_API
+twice,
+the first time with no features set,
+is explicitly allowed
+as per the two-step feature detection handshake.
+.TP
+.B EPERM
+The
+.B UFFD_FEATURE_EVENT_FORK
+feature was enabled,
+but the calling process doesn't have the
+.B CAP_SYS_PTRACE
+capability.
.SS UFFDIO_REGISTER
(Since Linux 4.3.)
Register a memory address range with the userfaultfd object.
-The pages in the range must be "compatible".
+The pages in the range must be \[lq]compatible\[rq].
Please refer to the list of register modes below
for the compatible memory backends for each mode.
-.PP
+.P
The
.I argp
argument is a pointer to a
.I uffdio_register
structure, defined as:
-.PP
+.P
.in +4n
.EX
struct uffdio_range {
@@ -297,7 +345,7 @@ struct uffdio_register {
};
.EE
.in
-.PP
+.P
The
.I range
field defines a memory range starting at
@@ -305,7 +353,7 @@ field defines a memory range starting at
and continuing for
.I len
bytes that should be handled by the userfaultfd.
-.PP
+.P
The
.I mode
field defines the mode of operation desired for this memory region.
@@ -330,7 +378,7 @@ Since Linux 5.13,
only hugetlbfs ranges are compatible.
Since Linux 5.14,
compatibility with shmem ranges was added.
-.PP
+.P
If the operation is successful, the kernel modifies the
.I ioctls
bit-mask field to indicate which
@@ -351,6 +399,7 @@ operation is supported.
.B 1 << _UFFDIO_WRITEPROTECT
The
.B UFFDIO_WRITEPROTECT
+operation is supported.
.TP
.B 1 << _UFFDIO_ZEROPAGE
The
@@ -361,7 +410,12 @@ operation is supported.
The
.B UFFDIO_CONTINUE
operation is supported.
-.PP
+.TP
+.B 1 << _UFFDIO_POISON
+The
+.B UFFDIO_POISON
+operation is supported.
+.P
This
.BR ioctl (2)
operation returns 0 on success.
@@ -407,14 +461,15 @@ There as an incompatible mapping in the specified address range.
.SS UFFDIO_UNREGISTER
(Since Linux 4.3.)
Unregister a memory address range from userfaultfd.
-The pages in the range must be "compatible" (see the description of
-.BR UFFDIO_REGISTER .)
-.PP
+The pages in the range must be \[lq]compatible\[rq]
+(see the description of
+.BR UFFDIO_REGISTER .)
+.P
The address range to unregister is specified in the
.I uffdio_range
structure pointed to by
.IR argp .
-.PP
+.P
This
.BR ioctl (2)
operation returns 0 on success.
@@ -446,12 +501,15 @@ Atomically copy a continuous memory chunk into the userfault registered
range and optionally wake up the blocked thread.
The source and destination addresses and the number of bytes to copy are
specified by the
-.IR src ", " dst ", and " len
+.IR src ,
+.IR dst ,
+and
+.I len
fields of the
.I uffdio_copy
structure pointed to by
.IR argp :
-.PP
+.P
.in +4n
.EX
struct uffdio_copy {
@@ -463,7 +521,7 @@ struct uffdio_copy {
};
.EE
.in
-.PP
+.P
The following value may be bitwise ORed in
.I mode
to change the behavior of the
@@ -482,7 +540,7 @@ This is used only when both
and
.B UFFDIO_REGISTER_MODE_WP
modes are enabled for the registered range.
-.PP
+.P
The
.I copy
field is used by the kernel to return the number of bytes
@@ -503,7 +561,7 @@ field is output-only;
it is not read by the
.B UFFDIO_COPY
operation.
-.PP
+.P
This
.BR ioctl (2)
operation returns 0 on success.
@@ -560,14 +618,14 @@ operation.
.SS UFFDIO_ZEROPAGE
(Since Linux 4.3.)
Zero out a memory range registered with userfaultfd.
-.PP
+.P
The requested range is specified by the
.I range
field of the
.I uffdio_zeropage
structure pointed to by
.IR argp :
-.PP
+.P
.in +4n
.EX
struct uffdio_zeropage {
@@ -577,7 +635,7 @@ struct uffdio_zeropage {
};
.EE
.in
-.PP
+.P
The following value may be bitwise ORed in
.I mode
to change the behavior of the
@@ -586,7 +644,7 @@ operation:
.TP
.B UFFDIO_ZEROPAGE_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault resolution.
-.PP
+.P
The
.I zeropage
field is used by the kernel to return the number of bytes
@@ -607,7 +665,7 @@ field is output-only;
it is not read by the
.B UFFDIO_ZEROPAGE
operation.
-.PP
+.P
This
.BR ioctl (2)
operation returns 0 on success.
@@ -648,7 +706,7 @@ operation.
(Since Linux 4.3.)
Wake up the thread waiting for page-fault resolution on
a specified memory address range.
-.PP
+.P
The
.B UFFDIO_WAKE
operation is used in conjunction with
@@ -668,13 +726,13 @@ and
.B UFFDIO_ZEROPAGE
operations in a batch and then explicitly wake up the faulting thread using
.BR UFFDIO_WAKE .
-.PP
+.P
The
.I argp
argument is a pointer to a
.I uffdio_range
structure (shown above) that specifies the address range.
-.PP
+.P
This
.BR ioctl (2)
operation returns 0 on success.
@@ -693,17 +751,18 @@ field of the
structure was not a multiple of the system page size; or
.I len
was zero; or the specified range was otherwise invalid.
-.SS UFFDIO_WRITEPROTECT (Since Linux 5.7)
+.SS UFFDIO_WRITEPROTECT
+(Since Linux 5.7.)
Write-protect or write-unprotect a userfaultfd-registered memory range
registered with mode
.BR UFFDIO_REGISTER_MODE_WP .
-.PP
+.P
The
.I argp
argument is a pointer to a
.I uffdio_range
structure as shown below:
-.PP
+.P
.in +4n
.EX
struct uffdio_writeprotect {
@@ -712,7 +771,7 @@ struct uffdio_writeprotect {
};
.EE
.in
-.PP
+.P
There are two mode bits that are supported in this structure:
.TP
.B UFFDIO_WRITEPROTECT_MODE_WP
@@ -729,7 +788,7 @@ page-fault resolution after the operation.
This can be specified only if
.B UFFDIO_WRITEPROTECT_MODE_WP
is not specified.
-.PP
+.P
This
.BR ioctl (2)
operation returns 0 on success.
@@ -767,13 +826,13 @@ Encountered a generic fault during processing.
Resolve a minor page fault
by installing page table entries
for existing pages in the page cache.
-.PP
+.P
The
.I argp
argument is a pointer to a
.I uffdio_continue
structure as shown below:
-.PP
+.P
.in +4n
.EX
struct uffdio_continue {
@@ -784,7 +843,7 @@ struct uffdio_continue {
};
.EE
.in
-.PP
+.P
The following value may be bitwise ORed in
.I mode
to change the behavior of the
@@ -793,7 +852,7 @@ operation:
.TP
.B UFFDIO_CONTINUE_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault resolution.
-.PP
+.P
The
.I mapped
field is used by the kernel
@@ -812,7 +871,7 @@ field is output-only;
it is not read by the
.B UFFDIO_CONTINUE
operation.
-.PP
+.P
This
.BR ioctl (2)
operation returns 0 on success.
@@ -832,6 +891,12 @@ does not equal the value that was specified in the
.I range.len
field.
.TP
+.B EEXIST
+One or more pages were already mapped in the given range.
+.TP
+.B EFAULT
+No existing page could be found in the page cache for the given range.
+.TP
.B EINVAL
Either
.I range.start
@@ -846,9 +911,6 @@ An invalid bit was specified in the
.I mode
field.
.TP
-.B EEXIST
-One or more pages were already mapped in the given range.
-.TP
.B ENOENT
The faulting process has changed its virtual memory layout simultaneously with
an outstanding
@@ -858,14 +920,118 @@ operation.
.B ENOMEM
Allocating memory needed to setup the page table mappings failed.
.TP
-.B EFAULT
-No existing page could be found in the page cache for the given range.
-.TP
.B ESRCH
The faulting process has exited at the time of a
.B UFFDIO_CONTINUE
operation.
.\"
+.SS UFFDIO_POISON
+(Since Linux 6.6.)
+Mark an address range as "poisoned".
+Future accesses to these addresses will raise a
+.B SIGBUS
+signal.
+Unlike
+.B MADV_HWPOISON
+this works by installing page table entries,
+rather than "really" poisoning the underlying physical pages.
+This means it only affects this particular address space.
+.P
+The
+.I argp
+argument is a pointer to a
+.I uffdio_poison
+structure as shown below:
+.P
+.in +4n
+.EX
+struct uffdio_poison {
+ struct uffdio_range range;
+ /* Range to install poison PTE markers in */
+ __u64 mode; /* Flags controlling the behavior of poison */
+ __s64 updated; /* Number of bytes poisoned, or negated error */
+};
+.EE
+.in
+.P
+The following value may be bitwise ORed in
+.I mode
+to change the behavior of the
+.B UFFDIO_POISON
+operation:
+.TP
+.B UFFDIO_POISON_MODE_DONTWAKE
+Do not wake up the thread that waits for page-fault resolution.
+.P
+The
+.I updated
+field is used by the kernel
+to return the number of bytes that were actually poisoned,
+or an error in the same manner as
+.BR UFFDIO_COPY .
+If the value returned in the
+.I updated
+field doesn't match the value that was specified in
+.IR range.len ,
+the operation fails with the error
+.BR EAGAIN .
+The
+.I updated
+field is output-only;
+it is not read by the
+.B UFFDIO_POISON
+operation.
+.P
+This
+.BR ioctl (2)
+operation returns 0 on success.
+In this case,
+the entire area was poisoned.
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+Possible errors include:
+.TP
+.B EAGAIN
+The number of bytes mapped
+(i.e., the value returned in the
+.I updated
+field)
+does not equal the value that was specified in the
+.I range.len
+field.
+.TP
+.B EINVAL
+Either
+.I range.start
+or
+.I range.len
+was not a multiple of the system page size; or
+.I range.len
+was zero; or the range specified was invalid.
+.TP
+.B EINVAL
+An invalid bit was specified in the
+.I mode
+field.
+.TP
+.B EEXIST
+One or more pages were already mapped in the given range.
+.TP
+.B ENOENT
+The faulting process has changed its virtual memory layout simultaneously with
+an outstanding
+.B UFFDIO_POISON
+operation.
+.TP
+.B ENOMEM
+Allocating memory for page table entries failed.
+.TP
+.B ESRCH
+The faulting process has exited at the time of a
+.B UFFDIO_POISON
+operation.
+.\"
.SH RETURN VALUE
See descriptions of the individual operations, above.
.SH ERRORS
@@ -901,6 +1067,6 @@ See
.BR ioctl (2),
.BR mmap (2),
.BR userfaultfd (2)
-.PP
+.P
.I Documentation/admin\-guide/mm/userfaultfd.rst
in the Linux kernel source tree