diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:40:15 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:40:15 +0000 |
commit | 399644e47874bff147afb19c89228901ac39340e (patch) | |
tree | 1c4c0b733f4c16b5783b41bebb19194a9ef62ad1 /man2/ioctl_userfaultfd.2 | |
parent | Initial commit. (diff) | |
download | manpages-399644e47874bff147afb19c89228901ac39340e.tar.xz manpages-399644e47874bff147afb19c89228901ac39340e.zip |
Adding upstream version 6.05.01.upstream/6.05.01
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'man2/ioctl_userfaultfd.2')
-rw-r--r-- | man2/ioctl_userfaultfd.2 | 906 |
1 files changed, 906 insertions, 0 deletions
diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 new file mode 100644 index 0000000..6ab9c11 --- /dev/null +++ b/man2/ioctl_userfaultfd.2 @@ -0,0 +1,906 @@ +.\" Copyright (c) 2016, IBM Corporation. +.\" Written by Mike Rapoport <rppt@linux.vnet.ibm.com> +.\" and Copyright (C) 2016 Michael Kerrisk <mtk.manpages@gmail.com> +.\" +.\" SPDX-License-Identifier: Linux-man-pages-copyleft +.\" +.\" +.TH ioctl_userfaultfd 2 2023-05-03 "Linux man-pages 6.05.01" +.SH NAME +ioctl_userfaultfd \- create a file descriptor for handling page faults in user +space +.SH LIBRARY +Standard C library +.RI ( libc ", " \-lc ) +.SH SYNOPSIS +.nf +.BR "#include <linux/userfaultfd.h>" " /* Definition of " UFFD* " constants */" +.B #include <sys/ioctl.h> +.PP +.BI "int ioctl(int " fd ", int " cmd ", ...);" +.fi +.SH DESCRIPTION +Various +.BR ioctl (2) +operations can be performed on a userfaultfd object (created by a call to +.BR userfaultfd (2)) +using calls of the form: +.PP +.in +4n +.EX +ioctl(fd, cmd, argp); +.EE +.in +In the above, +.I fd +is a file descriptor referring to a userfaultfd object, +.I cmd +is one of the commands listed below, and +.I argp +is a pointer to a data structure that is specific to +.IR cmd . +.PP +The various +.BR ioctl (2) +operations are described below. +The +.BR UFFDIO_API , +.BR UFFDIO_REGISTER , +and +.B UFFDIO_UNREGISTER +operations are used to +.I configure +userfaultfd behavior. +These operations allow the caller to choose what features will be enabled and +what kinds of events will be delivered to the application. +The remaining operations are +.I range +operations. +These operations enable the calling application to resolve page-fault +events. +.\" +.SS UFFDIO_API +(Since Linux 4.3.) +Enable operation of the userfaultfd and perform API handshake. +.PP +The +.I argp +argument is a pointer to a +.I uffdio_api +structure, defined as: +.PP +.in +4n +.EX +struct uffdio_api { + __u64 api; /* Requested API version (input) */ + __u64 features; /* Requested features (input/output) */ + __u64 ioctls; /* Available ioctl() operations (output) */ +}; +.EE +.in +.PP +The +.I api +field denotes the API version requested by the application. +.PP +The kernel verifies that it can support the requested API version, +and sets the +.I features +and +.I ioctls +fields to bit masks representing all the available features and the generic +.BR ioctl (2) +operations available. +.PP +Before Linux 4.11, the +.I features +field must be initialized to zero before the call to +.BR UFFDIO_API , +and zero (i.e., no feature bits) is placed in the +.I features +field by the kernel upon return from +.BR ioctl (2). +.PP +Starting from Linux 4.11, the +.I features +field can be used to ask whether particular features are supported +and explicitly enable userfaultfd features that are disabled by default. +The kernel always reports all the available features in the +.I features +field. +.PP +To enable userfaultfd features the application should set +a bit corresponding to each feature it wants to enable in the +.I features +field. +If the kernel supports all the requested features it will enable them. +Otherwise it will zero out the returned +.I uffdio_api +structure and return +.BR EINVAL . +.\" FIXME add more details about feature negotiation and enablement +.PP +The following feature bits may be set: +.TP +.BR UFFD_FEATURE_EVENT_FORK " (since Linux 4.11)" +When this feature is enabled, +the userfaultfd objects associated with a parent process are duplicated +into the child process during +.BR fork (2) +and a +.B UFFD_EVENT_FORK +event is delivered to the userfaultfd monitor +.TP +.BR UFFD_FEATURE_EVENT_REMAP " (since Linux 4.11)" +If this feature is enabled, +when the faulting process invokes +.BR mremap (2), +the userfaultfd monitor will receive an event of type +.BR UFFD_EVENT_REMAP . +.TP +.BR UFFD_FEATURE_EVENT_REMOVE " (since Linux 4.11)" +If this feature is enabled, +when the faulting process calls +.BR madvise (2) +with the +.B MADV_DONTNEED +or +.B MADV_REMOVE +advice value to free a virtual memory area +the userfaultfd monitor will receive an event of type +.BR UFFD_EVENT_REMOVE . +.TP +.BR UFFD_FEATURE_EVENT_UNMAP " (since Linux 4.11)" +If this feature is enabled, +when the faulting process unmaps virtual memory either explicitly with +.BR munmap (2), +or implicitly during either +.BR mmap (2) +or +.BR mremap (2), +the userfaultfd monitor will receive an event of type +.BR UFFD_EVENT_UNMAP . +.TP +.BR UFFD_FEATURE_MISSING_HUGETLBFS " (since Linux 4.11)" +If this feature bit is set, +the kernel supports registering userfaultfd ranges on hugetlbfs +virtual memory areas +.TP +.BR UFFD_FEATURE_MISSING_SHMEM " (since Linux 4.11)" +If this feature bit is set, +the kernel supports registering userfaultfd ranges on shared memory areas. +This includes all kernel shared memory APIs: +System V shared memory, +.BR tmpfs (5), +shared mappings of +.IR /dev/zero , +.BR mmap (2) +with the +.B MAP_SHARED +flag set, +.BR memfd_create (2), +and so on. +.TP +.BR UFFD_FEATURE_SIGBUS " (since Linux 4.14)" +.\" commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e +If this feature bit is set, no page-fault events +.RB ( UFFD_EVENT_PAGEFAULT ) +will be delivered. +Instead, a +.B SIGBUS +signal will be sent to the faulting process. +Applications using this +feature will not require the use of a userfaultfd monitor for processing +memory accesses to the regions registered with userfaultfd. +.TP +.BR UFFD_FEATURE_THREAD_ID " (since Linux 4.14)" +If this feature bit is set, +.I uffd_msg.pagefault.feat.ptid +will be set to the faulted thread ID for each page-fault message. +.TP +.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)" +If this feature bit is set, +the kernel supports registering userfaultfd ranges +in minor mode on hugetlbfs-backed memory areas. +.TP +.BR UFFD_FEATURE_MINOR_SHMEM " (since Linux 5.14)" +If this feature bit is set, +the kernel supports registering userfaultfd ranges +in minor mode on shmem-backed memory areas. +.TP +.BR UFFD_FEATURE_EXACT_ADDRESS " (since Linux 5.18)" +If this feature bit is set, +.I uffd_msg.pagefault.address +will be set to the exact page-fault address that was reported by the hardware, +and will not mask the offset within the page. +Note that old Linux versions might indicate the exact address as well, +even though the feature bit is not set. +.PP +The returned +.I ioctls +field can contain the following bits: +.\" FIXME This user-space API seems not fully polished. Why are there +.\" not constants defined for each of the bit-mask values listed below? +.TP +.B 1 << _UFFDIO_API +The +.B UFFDIO_API +operation is supported. +.TP +.B 1 << _UFFDIO_REGISTER +The +.B UFFDIO_REGISTER +operation is supported. +.TP +.B 1 << _UFFDIO_UNREGISTER +The +.B UFFDIO_UNREGISTER +operation is supported. +.PP +This +.BR ioctl (2) +operation returns 0 on success. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.TP +.B EFAULT +.I argp +refers to an address that is outside the calling process's +accessible address space. +.TP +.B EINVAL +The userfaultfd has already been enabled by a previous +.B UFFDIO_API +operation. +.TP +.B EINVAL +The API version requested in the +.I api +field is not supported by this kernel, or the +.I features +field passed to the kernel includes feature bits that are not supported +by the current kernel version. +.\" FIXME In the above error case, the returned 'uffdio_api' structure is +.\" zeroed out. Why is this done? This should be explained in the manual page. +.\" +.\" Mike Rapoport: +.\" In my understanding the uffdio_api +.\" structure is zeroed to allow the caller +.\" to distinguish the reasons for -EINVAL. +.\" +.SS UFFDIO_REGISTER +(Since Linux 4.3.) +Register a memory address range with the userfaultfd object. +The pages in the range must be "compatible". +Please refer to the list of register modes below +for the compatible memory backends for each mode. +.PP +The +.I argp +argument is a pointer to a +.I uffdio_register +structure, defined as: +.PP +.in +4n +.EX +struct uffdio_range { + __u64 start; /* Start of range */ + __u64 len; /* Length of range (bytes) */ +}; +\& +struct uffdio_register { + struct uffdio_range range; + __u64 mode; /* Desired mode of operation (input) */ + __u64 ioctls; /* Available ioctl() operations (output) */ +}; +.EE +.in +.PP +The +.I range +field defines a memory range starting at +.I start +and continuing for +.I len +bytes that should be handled by the userfaultfd. +.PP +The +.I mode +field defines the mode of operation desired for this memory region. +The following values may be bitwise ORed to set the userfaultfd mode for +the specified range: +.TP +.B UFFDIO_REGISTER_MODE_MISSING +Track page faults on missing pages. +Since Linux 4.3, +only private anonymous ranges are compatible. +Since Linux 4.11, +hugetlbfs and shared memory ranges are also compatible. +.TP +.B UFFDIO_REGISTER_MODE_WP +Track page faults on write-protected pages. +Since Linux 5.7, +only private anonymous ranges are compatible. +.TP +.B UFFDIO_REGISTER_MODE_MINOR +Track minor page faults. +Since Linux 5.13, +only hugetlbfs ranges are compatible. +Since Linux 5.14, +compatibility with shmem ranges was added. +.PP +If the operation is successful, the kernel modifies the +.I ioctls +bit-mask field to indicate which +.BR ioctl (2) +operations are available for the specified range. +This returned bit mask can contain the following bits: +.TP +.B 1 << _UFFDIO_COPY +The +.B UFFDIO_COPY +operation is supported. +.TP +.B 1 << _UFFDIO_WAKE +The +.B UFFDIO_WAKE +operation is supported. +.TP +.B 1 << _UFFDIO_WRITEPROTECT +The +.B UFFDIO_WRITEPROTECT +.TP +.B 1 << _UFFDIO_ZEROPAGE +The +.B UFFDIO_ZEROPAGE +operation is supported. +.TP +.B 1 << _UFFDIO_CONTINUE +The +.B UFFDIO_CONTINUE +operation is supported. +.PP +This +.BR ioctl (2) +operation returns 0 on success. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.\" FIXME Is the following error list correct? +.\" +.TP +.B EBUSY +A mapping in the specified range is registered with another +userfaultfd object. +.TP +.B EFAULT +.I argp +refers to an address that is outside the calling process's +accessible address space. +.TP +.B EINVAL +An invalid or unsupported bit was specified in the +.I mode +field; or the +.I mode +field was zero. +.TP +.B EINVAL +There is no mapping in the specified address range. +.TP +.B EINVAL +.I range.start +or +.I range.len +is not a multiple of the system page size; or, +.I range.len +is zero; or these fields are otherwise invalid. +.TP +.B EINVAL +There as an incompatible mapping in the specified address range. +.\" Mike Rapoport: +.\" ENOMEM if the process is exiting and the +.\" mm_struct has gone by the time userfault grabs it. +.SS UFFDIO_UNREGISTER +(Since Linux 4.3.) +Unregister a memory address range from userfaultfd. +The pages in the range must be "compatible" (see the description of +.BR UFFDIO_REGISTER .) +.PP +The address range to unregister is specified in the +.I uffdio_range +structure pointed to by +.IR argp . +.PP +This +.BR ioctl (2) +operation returns 0 on success. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.TP +.B EINVAL +Either the +.I start +or the +.I len +field of the +.I ufdio_range +structure was not a multiple of the system page size; or the +.I len +field was zero; or these fields were otherwise invalid. +.TP +.B EINVAL +There as an incompatible mapping in the specified address range. +.TP +.B EINVAL +There was no mapping in the specified address range. +.\" +.SS UFFDIO_COPY +(Since Linux 4.3.) +Atomically copy a continuous memory chunk into the userfault registered +range and optionally wake up the blocked thread. +The source and destination addresses and the number of bytes to copy are +specified by the +.IR src ", " dst ", and " len +fields of the +.I uffdio_copy +structure pointed to by +.IR argp : +.PP +.in +4n +.EX +struct uffdio_copy { + __u64 dst; /* Destination of copy */ + __u64 src; /* Source of copy */ + __u64 len; /* Number of bytes to copy */ + __u64 mode; /* Flags controlling behavior of copy */ + __s64 copy; /* Number of bytes copied, or negated error */ +}; +.EE +.in +.PP +The following value may be bitwise ORed in +.I mode +to change the behavior of the +.B UFFDIO_COPY +operation: +.TP +.B UFFDIO_COPY_MODE_DONTWAKE +Do not wake up the thread that waits for page-fault resolution +.TP +.B UFFDIO_COPY_MODE_WP +Copy the page with read-only permission. +This allows the user to trap the next write to the page, +which will block and generate another write-protect userfault message. +This is used only when both +.B UFFDIO_REGISTER_MODE_MISSING +and +.B UFFDIO_REGISTER_MODE_WP +modes are enabled for the registered range. +.PP +The +.I copy +field is used by the kernel to return the number of bytes +that was actually copied, or an error (a negated +.IR errno -style +value). +.\" FIXME Above: Why is the 'copy' field used to return error values? +.\" This should be explained in the manual page. +If the value returned in +.I copy +doesn't match the value that was specified in +.IR len , +the operation fails with the error +.BR EAGAIN . +The +.I copy +field is output-only; +it is not read by the +.B UFFDIO_COPY +operation. +.PP +This +.BR ioctl (2) +operation returns 0 on success. +In this case, the entire area was copied. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.TP +.B EAGAIN +The number of bytes copied (i.e., the value returned in the +.I copy +field) +does not equal the value that was specified in the +.I len +field. +.TP +.B EINVAL +Either +.I dst +or +.I len +was not a multiple of the system page size, or the range specified by +.I src +and +.I len +or +.I dst +and +.I len +was invalid. +.TP +.B EINVAL +An invalid bit was specified in the +.I mode +field. +.TP +.BR ENOENT " (since Linux 4.11)" +The faulting process has changed +its virtual memory layout simultaneously with an outstanding +.B UFFDIO_COPY +operation. +.TP +.BR ENOSPC " (from Linux 4.11 until Linux 4.13)" +The faulting process has exited at the time of a +.B UFFDIO_COPY +operation. +.TP +.BR ESRCH " (since Linux 4.13)" +The faulting process has exited at the time of a +.B UFFDIO_COPY +operation. +.\" +.SS UFFDIO_ZEROPAGE +(Since Linux 4.3.) +Zero out a memory range registered with userfaultfd. +.PP +The requested range is specified by the +.I range +field of the +.I uffdio_zeropage +structure pointed to by +.IR argp : +.PP +.in +4n +.EX +struct uffdio_zeropage { + struct uffdio_range range; + __u64 mode; /* Flags controlling behavior of copy */ + __s64 zeropage; /* Number of bytes zeroed, or negated error */ +}; +.EE +.in +.PP +The following value may be bitwise ORed in +.I mode +to change the behavior of the +.B UFFDIO_ZEROPAGE +operation: +.TP +.B UFFDIO_ZEROPAGE_MODE_DONTWAKE +Do not wake up the thread that waits for page-fault resolution. +.PP +The +.I zeropage +field is used by the kernel to return the number of bytes +that was actually zeroed, +or an error in the same manner as +.BR UFFDIO_COPY . +.\" FIXME Why is the 'zeropage' field used to return error values? +.\" This should be explained in the manual page. +If the value returned in the +.I zeropage +field doesn't match the value that was specified in +.IR range.len , +the operation fails with the error +.BR EAGAIN . +The +.I zeropage +field is output-only; +it is not read by the +.B UFFDIO_ZEROPAGE +operation. +.PP +This +.BR ioctl (2) +operation returns 0 on success. +In this case, the entire area was zeroed. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.TP +.B EAGAIN +The number of bytes zeroed (i.e., the value returned in the +.I zeropage +field) +does not equal the value that was specified in the +.I range.len +field. +.TP +.B EINVAL +Either +.I range.start +or +.I range.len +was not a multiple of the system page size; or +.I range.len +was zero; or the range specified was invalid. +.TP +.B EINVAL +An invalid bit was specified in the +.I mode +field. +.TP +.BR ESRCH " (since Linux 4.13)" +The faulting process has exited at the time of a +.B UFFDIO_ZEROPAGE +operation. +.\" +.SS UFFDIO_WAKE +(Since Linux 4.3.) +Wake up the thread waiting for page-fault resolution on +a specified memory address range. +.PP +The +.B UFFDIO_WAKE +operation is used in conjunction with +.B UFFDIO_COPY +and +.B UFFDIO_ZEROPAGE +operations that have the +.B UFFDIO_COPY_MODE_DONTWAKE +or +.B UFFDIO_ZEROPAGE_MODE_DONTWAKE +bit set in the +.I mode +field. +The userfault monitor can perform several +.B UFFDIO_COPY +and +.B UFFDIO_ZEROPAGE +operations in a batch and then explicitly wake up the faulting thread using +.BR UFFDIO_WAKE . +.PP +The +.I argp +argument is a pointer to a +.I uffdio_range +structure (shown above) that specifies the address range. +.PP +This +.BR ioctl (2) +operation returns 0 on success. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.TP +.B EINVAL +The +.I start +or the +.I len +field of the +.I ufdio_range +structure was not a multiple of the system page size; or +.I len +was zero; or the specified range was otherwise invalid. +.SS UFFDIO_WRITEPROTECT (Since Linux 5.7) +Write-protect or write-unprotect a userfaultfd-registered memory range +registered with mode +.BR UFFDIO_REGISTER_MODE_WP . +.PP +The +.I argp +argument is a pointer to a +.I uffdio_range +structure as shown below: +.PP +.in +4n +.EX +struct uffdio_writeprotect { + struct uffdio_range range; /* Range to change write permission*/ + __u64 mode; /* Mode to change write permission */ +}; +.EE +.in +.PP +There are two mode bits that are supported in this structure: +.TP +.B UFFDIO_WRITEPROTECT_MODE_WP +When this mode bit is set, +the ioctl will be a write-protect operation upon the memory range specified by +.IR range . +Otherwise it will be a write-unprotect operation upon the specified range, +which can be used to resolve a userfaultfd write-protect page fault. +.TP +.B UFFDIO_WRITEPROTECT_MODE_DONTWAKE +When this mode bit is set, +do not wake up any thread that waits for +page-fault resolution after the operation. +This can be specified only if +.B UFFDIO_WRITEPROTECT_MODE_WP +is not specified. +.PP +This +.BR ioctl (2) +operation returns 0 on success. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.TP +.B EINVAL +The +.I start +or the +.I len +field of the +.I ufdio_range +structure was not a multiple of the system page size; or +.I len +was zero; or the specified range was otherwise invalid. +.TP +.B EAGAIN +The process was interrupted; retry this call. +.TP +.B ENOENT +The range specified in +.I range +is not valid. +For example, the virtual address does not exist, +or not registered with userfaultfd write-protect mode. +.TP +.B EFAULT +Encountered a generic fault during processing. +.\" +.SS UFFDIO_CONTINUE +(Since Linux 5.13.) +Resolve a minor page fault +by installing page table entries +for existing pages in the page cache. +.PP +The +.I argp +argument is a pointer to a +.I uffdio_continue +structure as shown below: +.PP +.in +4n +.EX +struct uffdio_continue { + struct uffdio_range range; + /* Range to install PTEs for and continue */ + __u64 mode; /* Flags controlling the behavior of continue */ + __s64 mapped; /* Number of bytes mapped, or negated error */ +}; +.EE +.in +.PP +The following value may be bitwise ORed in +.I mode +to change the behavior of the +.B UFFDIO_CONTINUE +operation: +.TP +.B UFFDIO_CONTINUE_MODE_DONTWAKE +Do not wake up the thread that waits for page-fault resolution. +.PP +The +.I mapped +field is used by the kernel +to return the number of bytes that were actually mapped, +or an error in the same manner as +.BR UFFDIO_COPY . +If the value returned in the +.I mapped +field doesn't match the value that was specified in +.IR range.len , +the operation fails with the error +.BR EAGAIN . +The +.I mapped +field is output-only; +it is not read by the +.B UFFDIO_CONTINUE +operation. +.PP +This +.BR ioctl (2) +operation returns 0 on success. +In this case, +the entire area was mapped. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.TP +.B EAGAIN +The number of bytes mapped +(i.e., the value returned in the +.I mapped +field) +does not equal the value that was specified in the +.I range.len +field. +.TP +.B EINVAL +Either +.I range.start +or +.I range.len +was not a multiple of the system page size; or +.I range.len +was zero; or the range specified was invalid. +.TP +.B EINVAL +An invalid bit was specified in the +.I mode +field. +.TP +.B EEXIST +One or more pages were already mapped in the given range. +.TP +.B ENOENT +The faulting process has changed its virtual memory layout simultaneously with +an outstanding +.B UFFDIO_CONTINUE +operation. +.TP +.B ENOMEM +Allocating memory needed to setup the page table mappings failed. +.TP +.B EFAULT +No existing page could be found in the page cache for the given range. +.TP +.B ESRCH +The faulting process has exited at the time of a +.B UFFDIO_CONTINUE +operation. +.\" +.SH RETURN VALUE +See descriptions of the individual operations, above. +.SH ERRORS +See descriptions of the individual operations, above. +In addition, the following general errors can occur for all of the +operations described above: +.TP +.B EFAULT +.I argp +does not point to a valid memory address. +.TP +.B EINVAL +(For all operations except +.BR UFFDIO_API .) +The userfaultfd object has not yet been enabled (via the +.B UFFDIO_API +operation). +.SH STANDARDS +Linux. +.SH BUGS +In order to detect available userfault features and +enable some subset of those features +the userfaultfd file descriptor must be closed after the first +.B UFFDIO_API +operation that queries features availability and reopened before +the second +.B UFFDIO_API +operation that actually enables the desired features. +.SH EXAMPLES +See +.BR userfaultfd (2). +.SH SEE ALSO +.BR ioctl (2), +.BR mmap (2), +.BR userfaultfd (2) +.PP +.I Documentation/admin\-guide/mm/userfaultfd.rst +in the Linux kernel source tree |