diff options
Diffstat (limited to 'man2/ioctl_userfaultfd.2')
-rw-r--r-- | man2/ioctl_userfaultfd.2 | 340 |
1 files changed, 253 insertions, 87 deletions
diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 index 6ab9c11..08e52e6 100644 --- a/man2/ioctl_userfaultfd.2 +++ b/man2/ioctl_userfaultfd.2 @@ -5,7 +5,7 @@ .\" SPDX-License-Identifier: Linux-man-pages-copyleft .\" .\" -.TH ioctl_userfaultfd 2 2023-05-03 "Linux man-pages 6.05.01" +.TH ioctl_userfaultfd 2 2024-03-03 "Linux man-pages 6.7" .SH NAME ioctl_userfaultfd \- create a file descriptor for handling page faults in user space @@ -16,8 +16,8 @@ Standard C library .nf .BR "#include <linux/userfaultfd.h>" " /* Definition of " UFFD* " constants */" .B #include <sys/ioctl.h> -.PP -.BI "int ioctl(int " fd ", int " cmd ", ...);" +.P +.BI "int ioctl(int " fd ", int " op ", ...);" .fi .SH DESCRIPTION Various @@ -25,21 +25,22 @@ Various operations can be performed on a userfaultfd object (created by a call to .BR userfaultfd (2)) using calls of the form: -.PP +.P .in +4n .EX -ioctl(fd, cmd, argp); +ioctl(fd, op, argp); .EE .in +.P In the above, .I fd is a file descriptor referring to a userfaultfd object, -.I cmd -is one of the commands listed below, and +.I op +is one of the operations listed below, and .I argp is a pointer to a data structure that is specific to -.IR cmd . -.PP +.IR op . +.P The various .BR ioctl (2) operations are described below. @@ -62,13 +63,13 @@ events. .SS UFFDIO_API (Since Linux 4.3.) Enable operation of the userfaultfd and perform API handshake. -.PP +.P The .I argp argument is a pointer to a .I uffdio_api structure, defined as: -.PP +.P .in +4n .EX struct uffdio_api { @@ -78,11 +79,10 @@ struct uffdio_api { }; .EE .in -.PP +.P The .I api field denotes the API version requested by the application. -.PP The kernel verifies that it can support the requested API version, and sets the .I features @@ -91,7 +91,26 @@ and fields to bit masks representing all the available features and the generic .BR ioctl (2) operations available. -.PP +.P +Since Linux 4.11, +applications should use the +.I features +field to perform a two-step handshake. +First, +.B UFFDIO_API +is called with the +.I features +field set to zero. +The kernel responds by setting all supported feature bits. +.P +Applications which do not require any specific features +can begin using the userfaultfd immediately. +Applications which do need specific features +should call +.B UFFDIO_API +again with a subset of the reported feature bits set +to enable those features. +.P Before Linux 4.11, the .I features field must be initialized to zero before the call to @@ -100,26 +119,13 @@ and zero (i.e., no feature bits) is placed in the .I features field by the kernel upon return from .BR ioctl (2). -.PP -Starting from Linux 4.11, the -.I features -field can be used to ask whether particular features are supported -and explicitly enable userfaultfd features that are disabled by default. -The kernel always reports all the available features in the -.I features -field. -.PP -To enable userfaultfd features the application should set -a bit corresponding to each feature it wants to enable in the -.I features -field. -If the kernel supports all the requested features it will enable them. -Otherwise it will zero out the returned +.P +If the application sets unsupported feature bits, +the kernel will zero out the returned .I uffdio_api structure and return .BR EINVAL . -.\" FIXME add more details about feature negotiation and enablement -.PP +.P The following feature bits may be set: .TP .BR UFFD_FEATURE_EVENT_FORK " (since Linux 4.11)" @@ -198,6 +204,13 @@ If this feature bit is set, .I uffd_msg.pagefault.feat.ptid will be set to the faulted thread ID for each page-fault message. .TP +.BR UFFD_FEATURE_PAGEFAULT_FLAG_WP " (since Linux 5.10)" +If this feature bit is set, +userfaultfd supports write-protect faults +for anonymous memory. +(Note that shmem / hugetlbfs support +is indicated by a separate feature.) +.TP .BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)" If this feature bit is set, the kernel supports registering userfaultfd ranges @@ -215,7 +228,28 @@ will be set to the exact page-fault address that was reported by the hardware, and will not mask the offset within the page. Note that old Linux versions might indicate the exact address as well, even though the feature bit is not set. -.PP +.TP +.BR UFFD_FEATURE_WP_HUGETLBFS_SHMEM " (since Linux 5.19)" +If this feature bit is set, +userfaultfd supports write-protect faults +for hugetlbfs and shmem / tmpfs memory. +.TP +.BR UFFD_FEATURE_WP_UNPOPULATED " (since Linux 6.4)" +If this feature bit is set, +the kernel will handle anonymous memory the same way as file memory, +by allowing the user to write-protect unpopulated page table entries. +.TP +.BR UFFD_FEATURE_POISON " (since Linux 6.6)" +If this feature bit is set, +the kernel supports resolving faults with the +.B UFFDIO_POISON +ioctl. +.TP +.BR UFFD_FEATURE_WP_ASYNC " (since Linux 6.7)" +If this feature bit is set, +the write protection faults would be asynchronously resolved +by the kernel. +.P The returned .I ioctls field can contain the following bits: @@ -236,13 +270,21 @@ operation is supported. The .B UFFDIO_UNREGISTER operation is supported. -.PP +.P This .BR ioctl (2) operation returns 0 on success. On error, \-1 is returned and .I errno is set to indicate the error. +If an error occurs, +the kernel may zero the provided +.I uffdio_api +structure. +The caller should treat its contents as unspecified, +and reinitialize it before re-attempting another +.B UFFDIO_API +call. Possible errors include: .TP .B EFAULT @@ -251,38 +293,44 @@ refers to an address that is outside the calling process's accessible address space. .TP .B EINVAL -The userfaultfd has already been enabled by a previous -.B UFFDIO_API -operation. -.TP -.B EINVAL The API version requested in the .I api field is not supported by this kernel, or the .I features field passed to the kernel includes feature bits that are not supported by the current kernel version. -.\" FIXME In the above error case, the returned 'uffdio_api' structure is -.\" zeroed out. Why is this done? This should be explained in the manual page. -.\" -.\" Mike Rapoport: -.\" In my understanding the uffdio_api -.\" structure is zeroed to allow the caller -.\" to distinguish the reasons for -EINVAL. -.\" +.TP +.B EINVAL +A previous +.B UFFDIO_API +call already enabled one or more features for this userfaultfd. +Calling +.B UFFDIO_API +twice, +the first time with no features set, +is explicitly allowed +as per the two-step feature detection handshake. +.TP +.B EPERM +The +.B UFFD_FEATURE_EVENT_FORK +feature was enabled, +but the calling process doesn't have the +.B CAP_SYS_PTRACE +capability. .SS UFFDIO_REGISTER (Since Linux 4.3.) Register a memory address range with the userfaultfd object. -The pages in the range must be "compatible". +The pages in the range must be \[lq]compatible\[rq]. Please refer to the list of register modes below for the compatible memory backends for each mode. -.PP +.P The .I argp argument is a pointer to a .I uffdio_register structure, defined as: -.PP +.P .in +4n .EX struct uffdio_range { @@ -297,7 +345,7 @@ struct uffdio_register { }; .EE .in -.PP +.P The .I range field defines a memory range starting at @@ -305,7 +353,7 @@ field defines a memory range starting at and continuing for .I len bytes that should be handled by the userfaultfd. -.PP +.P The .I mode field defines the mode of operation desired for this memory region. @@ -330,7 +378,7 @@ Since Linux 5.13, only hugetlbfs ranges are compatible. Since Linux 5.14, compatibility with shmem ranges was added. -.PP +.P If the operation is successful, the kernel modifies the .I ioctls bit-mask field to indicate which @@ -351,6 +399,7 @@ operation is supported. .B 1 << _UFFDIO_WRITEPROTECT The .B UFFDIO_WRITEPROTECT +operation is supported. .TP .B 1 << _UFFDIO_ZEROPAGE The @@ -361,7 +410,12 @@ operation is supported. The .B UFFDIO_CONTINUE operation is supported. -.PP +.TP +.B 1 << _UFFDIO_POISON +The +.B UFFDIO_POISON +operation is supported. +.P This .BR ioctl (2) operation returns 0 on success. @@ -407,14 +461,15 @@ There as an incompatible mapping in the specified address range. .SS UFFDIO_UNREGISTER (Since Linux 4.3.) Unregister a memory address range from userfaultfd. -The pages in the range must be "compatible" (see the description of -.BR UFFDIO_REGISTER .) -.PP +The pages in the range must be \[lq]compatible\[rq] +(see the description of +.BR UFFDIO_REGISTER .) +.P The address range to unregister is specified in the .I uffdio_range structure pointed to by .IR argp . -.PP +.P This .BR ioctl (2) operation returns 0 on success. @@ -446,12 +501,15 @@ Atomically copy a continuous memory chunk into the userfault registered range and optionally wake up the blocked thread. The source and destination addresses and the number of bytes to copy are specified by the -.IR src ", " dst ", and " len +.IR src , +.IR dst , +and +.I len fields of the .I uffdio_copy structure pointed to by .IR argp : -.PP +.P .in +4n .EX struct uffdio_copy { @@ -463,7 +521,7 @@ struct uffdio_copy { }; .EE .in -.PP +.P The following value may be bitwise ORed in .I mode to change the behavior of the @@ -482,7 +540,7 @@ This is used only when both and .B UFFDIO_REGISTER_MODE_WP modes are enabled for the registered range. -.PP +.P The .I copy field is used by the kernel to return the number of bytes @@ -503,7 +561,7 @@ field is output-only; it is not read by the .B UFFDIO_COPY operation. -.PP +.P This .BR ioctl (2) operation returns 0 on success. @@ -560,14 +618,14 @@ operation. .SS UFFDIO_ZEROPAGE (Since Linux 4.3.) Zero out a memory range registered with userfaultfd. -.PP +.P The requested range is specified by the .I range field of the .I uffdio_zeropage structure pointed to by .IR argp : -.PP +.P .in +4n .EX struct uffdio_zeropage { @@ -577,7 +635,7 @@ struct uffdio_zeropage { }; .EE .in -.PP +.P The following value may be bitwise ORed in .I mode to change the behavior of the @@ -586,7 +644,7 @@ operation: .TP .B UFFDIO_ZEROPAGE_MODE_DONTWAKE Do not wake up the thread that waits for page-fault resolution. -.PP +.P The .I zeropage field is used by the kernel to return the number of bytes @@ -607,7 +665,7 @@ field is output-only; it is not read by the .B UFFDIO_ZEROPAGE operation. -.PP +.P This .BR ioctl (2) operation returns 0 on success. @@ -648,7 +706,7 @@ operation. (Since Linux 4.3.) Wake up the thread waiting for page-fault resolution on a specified memory address range. -.PP +.P The .B UFFDIO_WAKE operation is used in conjunction with @@ -668,13 +726,13 @@ and .B UFFDIO_ZEROPAGE operations in a batch and then explicitly wake up the faulting thread using .BR UFFDIO_WAKE . -.PP +.P The .I argp argument is a pointer to a .I uffdio_range structure (shown above) that specifies the address range. -.PP +.P This .BR ioctl (2) operation returns 0 on success. @@ -693,17 +751,18 @@ field of the structure was not a multiple of the system page size; or .I len was zero; or the specified range was otherwise invalid. -.SS UFFDIO_WRITEPROTECT (Since Linux 5.7) +.SS UFFDIO_WRITEPROTECT +(Since Linux 5.7.) Write-protect or write-unprotect a userfaultfd-registered memory range registered with mode .BR UFFDIO_REGISTER_MODE_WP . -.PP +.P The .I argp argument is a pointer to a .I uffdio_range structure as shown below: -.PP +.P .in +4n .EX struct uffdio_writeprotect { @@ -712,7 +771,7 @@ struct uffdio_writeprotect { }; .EE .in -.PP +.P There are two mode bits that are supported in this structure: .TP .B UFFDIO_WRITEPROTECT_MODE_WP @@ -729,7 +788,7 @@ page-fault resolution after the operation. This can be specified only if .B UFFDIO_WRITEPROTECT_MODE_WP is not specified. -.PP +.P This .BR ioctl (2) operation returns 0 on success. @@ -767,13 +826,13 @@ Encountered a generic fault during processing. Resolve a minor page fault by installing page table entries for existing pages in the page cache. -.PP +.P The .I argp argument is a pointer to a .I uffdio_continue structure as shown below: -.PP +.P .in +4n .EX struct uffdio_continue { @@ -784,7 +843,7 @@ struct uffdio_continue { }; .EE .in -.PP +.P The following value may be bitwise ORed in .I mode to change the behavior of the @@ -793,7 +852,7 @@ operation: .TP .B UFFDIO_CONTINUE_MODE_DONTWAKE Do not wake up the thread that waits for page-fault resolution. -.PP +.P The .I mapped field is used by the kernel @@ -812,7 +871,7 @@ field is output-only; it is not read by the .B UFFDIO_CONTINUE operation. -.PP +.P This .BR ioctl (2) operation returns 0 on success. @@ -832,6 +891,12 @@ does not equal the value that was specified in the .I range.len field. .TP +.B EEXIST +One or more pages were already mapped in the given range. +.TP +.B EFAULT +No existing page could be found in the page cache for the given range. +.TP .B EINVAL Either .I range.start @@ -846,9 +911,6 @@ An invalid bit was specified in the .I mode field. .TP -.B EEXIST -One or more pages were already mapped in the given range. -.TP .B ENOENT The faulting process has changed its virtual memory layout simultaneously with an outstanding @@ -858,14 +920,118 @@ operation. .B ENOMEM Allocating memory needed to setup the page table mappings failed. .TP -.B EFAULT -No existing page could be found in the page cache for the given range. -.TP .B ESRCH The faulting process has exited at the time of a .B UFFDIO_CONTINUE operation. .\" +.SS UFFDIO_POISON +(Since Linux 6.6.) +Mark an address range as "poisoned". +Future accesses to these addresses will raise a +.B SIGBUS +signal. +Unlike +.B MADV_HWPOISON +this works by installing page table entries, +rather than "really" poisoning the underlying physical pages. +This means it only affects this particular address space. +.P +The +.I argp +argument is a pointer to a +.I uffdio_poison +structure as shown below: +.P +.in +4n +.EX +struct uffdio_poison { + struct uffdio_range range; + /* Range to install poison PTE markers in */ + __u64 mode; /* Flags controlling the behavior of poison */ + __s64 updated; /* Number of bytes poisoned, or negated error */ +}; +.EE +.in +.P +The following value may be bitwise ORed in +.I mode +to change the behavior of the +.B UFFDIO_POISON +operation: +.TP +.B UFFDIO_POISON_MODE_DONTWAKE +Do not wake up the thread that waits for page-fault resolution. +.P +The +.I updated +field is used by the kernel +to return the number of bytes that were actually poisoned, +or an error in the same manner as +.BR UFFDIO_COPY . +If the value returned in the +.I updated +field doesn't match the value that was specified in +.IR range.len , +the operation fails with the error +.BR EAGAIN . +The +.I updated +field is output-only; +it is not read by the +.B UFFDIO_POISON +operation. +.P +This +.BR ioctl (2) +operation returns 0 on success. +In this case, +the entire area was poisoned. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.TP +.B EAGAIN +The number of bytes mapped +(i.e., the value returned in the +.I updated +field) +does not equal the value that was specified in the +.I range.len +field. +.TP +.B EINVAL +Either +.I range.start +or +.I range.len +was not a multiple of the system page size; or +.I range.len +was zero; or the range specified was invalid. +.TP +.B EINVAL +An invalid bit was specified in the +.I mode +field. +.TP +.B EEXIST +One or more pages were already mapped in the given range. +.TP +.B ENOENT +The faulting process has changed its virtual memory layout simultaneously with +an outstanding +.B UFFDIO_POISON +operation. +.TP +.B ENOMEM +Allocating memory for page table entries failed. +.TP +.B ESRCH +The faulting process has exited at the time of a +.B UFFDIO_POISON +operation. +.\" .SH RETURN VALUE See descriptions of the individual operations, above. .SH ERRORS @@ -901,6 +1067,6 @@ See .BR ioctl (2), .BR mmap (2), .BR userfaultfd (2) -.PP +.P .I Documentation/admin\-guide/mm/userfaultfd.rst in the Linux kernel source tree |