1 files changed, 293 insertions, 0 deletions
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
new file mode 100644
index 000000000..1dc2d5f82
--- /dev/null
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -0,0 +1,293 @@
+.. _userfaultfd:
+
+===========
+Userfaultfd
+===========
+
+Objective
+=========
+
+Userfaults allow the implementation of on-demand paging from userland
+and more generally they allow userland to take control of various
+memory page faults, something otherwise only the kernel code could do.
+
+For example userfaults allows a proper and more optimal implementation
+of the ``PROT_NONE+SIGSEGV`` trick.
+
+Design
+======
+
+Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
+
+The ``userfaultfd`` (aside from registering and unregistering virtual
+memory ranges) provides two primary functionalities:
+
+1) ``read/POLLIN`` protocol to notify a userland thread of the faults
+   happening
+
+2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions
+   registered in the ``userfaultfd`` that allows userland to efficiently
+   resolve the userfaults it receives via 1) or to manage the virtual
+   memory in the background
+
+The real advantage of userfaults if compared to regular virtual memory
+management of mremap/mprotect is that the userfaults in all their
+operations never involve heavyweight structures like vmas (in fact the
+``userfaultfd`` runtime load never takes the mmap_lock for writing).
+
+Vmas are not suitable for page- (or hugepage) granular fault tracking
+when dealing with virtual address spaces that could span
+Terabytes. Too many vmas would be needed for that.
+
+The ``userfaultfd`` once opened by invoking the syscall, can also be
+passed using unix domain sockets to a manager process, so the same
+manager process could handle the userfaults of a multitude of
+different processes without them being aware about what is going on
+(well of course unless they later try to use the ``userfaultfd``
+themselves on the same region the manager is already tracking, which
+is a corner case that would currently return ``-EBUSY``).
+
+API
+===
+
+When first opened the ``userfaultfd`` must be enabled invoking the
+``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
+a later API version) which will specify the ``read/POLLIN`` protocol
+userland intends to speak on the ``UFFD`` and the ``uffdio_api.features``
+userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the
+requested ``uffdio_api.api`` is spoken also by the running kernel and the
+requested features are going to be enabled) will return into
+``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of
+respectively all the available features of the read(2) protocol and
+the generic ioctl available.
+
+The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
+defines what memory types are supported by the ``userfaultfd`` and what
+events, except page fault notifications, may be generated.
+
+If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs
+virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in
+``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be
+set if the kernel supports registering ``userfaultfd`` ranges on shared
+memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``,
+``MAP_SHARED``, ``memfd_create``, etc).
+
+The userland application that wants to use ``userfaultfd`` with hugetlbfs
+or shared memory need to set the corresponding flag in
+``uffdio_api.features`` to enable those features.
+
+If the userland desires to receive notifications for events other than
+page faults, it has to verify that ``uffdio_api.features`` has appropriate
+``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more
+detail below in `Non-cooperative userfaultfd`_ section.
+
+Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should
+be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to
+register a memory range in the ``userfaultfd`` by setting the
+uffdio_register structure accordingly. The ``uffdio_register.mode``
+bitmask will specify to the kernel which kind of faults to track for
+the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing
+pages). The ``UFFDIO_REGISTER`` ioctl will return the
+``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
+userfaults on the range registered. Not all ioctls will necessarily be
+supported for all memory types depending on the underlying virtual
+memory backend (anonymous memory vs tmpfs vs real filebacked
+mappings).
+
+Userland can use the ``uffdio_register.ioctls`` to manage the virtual
+address space in the background (to add or potentially also remove
+memory from the ``userfaultfd`` registered range). This means a userfault
+could be triggering just before userland maps in the background the
+user-faulted page.
+
+The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That
+atomically copies a page into the userfault registered range and wakes
+up the blocked userfaults
+(unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set).
+Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in
+guaranteeing that nothing can see an half copied page since it'll
+keep userfaulting until the copy has finished.
+
+Notes:
+
+- If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then
+  you must provide some kind of page in your thread after reading from
+  the uffd.  You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``.
+  The normal behavior of the OS automatically providing a zero page on
+  an annonymous mmaping is not in place.
+
+- None of the page-delivering ioctls default to the range that you
+  registered with.  You must fill in all fields for the appropriate
+  ioctl struct including the range.
+
+- You get the address of the access that triggered the missing page
+  event out of a struct uffd_msg that you read in the thread from the
+  uffd.  You can supply as many pages as you want with ``UFFDIO_COPY`` or
+  ``UFFDIO_ZEROPAGE``.  Keep in mind that unless you used DONTWAKE then
+  the first of any of those IOCTLs wakes up the faulting thread.
+
+- Be sure to test for all errors including
+  (``pollfd[0].revents & POLLERR``).  This can happen, e.g. when ranges
+  supplied were incorrect.
+
+Write Protect Notifications
+---------------------------
+
+This is equivalent to (but faster than) using mprotect and a SIGSEGV
+signal handler.
+
+Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``.
+Instead of using mprotect(2) you use
+``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
+while ``mode = UFFDIO_WRITEPROTECT_MODE_WP``
+in the struct passed in.  The range does not default to and does not
+have to be identical to the range you registered with.  You can write
+protect as many ranges as you like (inside the registered range).
+Then, in the thread reading from uffd the struct will have
+``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send
+``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
+again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP``
+set. This wakes up the thread which will continue to run with writes. This
+allows you to do the bookkeeping about the write in the uffd reading
+thread before the ioctl.
+
+If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and
+``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in
+which you supply a page and undo write protect.  Note that there is a
+difference between writes into a WP area and into a !WP area.  The
+former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
+``UFFD_PAGEFAULT_FLAG_WRITE``.  The latter did not fail on protection but
+you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
+used.
+
+QEMU/KVM
+========
+
+QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live
+migration. Postcopy live migration is one form of memory
+externalization consisting of a virtual machine running with part or
+all of its memory residing on a different node in the cloud. The
+``userfaultfd`` abstraction is generic enough that not a single line of
+KVM kernel code had to be modified in order to add postcopy live
+migration to QEMU.
+
+Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work
+just fine in combination with userfaults. Userfaults trigger async
+page faults in the guest scheduler so those guest processes that
+aren't waiting for userfaults (i.e. network bound) can keep running in
+the guest vcpus.
+
+It is generally beneficial to run one pass of precopy live migration
+just before starting postcopy live migration, in order to avoid
+generating userfaults for readonly guest regions.
+
+The implementation of postcopy live migration currently uses one
+single bidirectional socket but in the future two different sockets
+will be used (to reduce the latency of the userfaults to the minimum
+possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``).
+
+The QEMU in the source node writes all pages that it knows are missing
+in the destination node, into the socket, and the migration thread of
+the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE``
+ioctls on the ``userfaultfd`` in order to map the received pages into the
+guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page).
+
+A different postcopy thread in the destination node listens with
+poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is
+generated after a userfault triggers, the postcopy thread read() from
+the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the
+userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run
+by the parallel QEMU migration thread).
+
+After the QEMU postcopy thread (running in the destination node) gets
+the userfault address it writes the information about the missing page
+into the socket. The QEMU source node receives the information and
+roughly "seeks" to that page address and continues sending all
+remaining missing pages from that new page offset. Soon after that
+(just the time to flush the tcp_wmem queue through the network) the
+migration thread in the QEMU running in the destination node will
+receive the page that triggered the userfault and it'll map it as
+usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it
+was spontaneously sent by the source or if it was an urgent page
+requested through a userfault).
+
+By the time the userfaults start, the QEMU in the destination node
+doesn't need to keep any per-page state bitmap relative to the live
+migration around and a single per-page bitmap has to be maintained in
+the QEMU running in the source node to know which pages are still
+missing in the destination node. The bitmap in the source node is
+checked to find which missing pages to send in round robin and we seek
+over it when receiving incoming userfaults. After sending each page of
+course the bitmap is updated accordingly. It's also useful to avoid
+sending the same page twice (in case the userfault is read by the
+postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration
+thread).
+
+Non-cooperative userfaultfd
+===========================
+
+When the ``userfaultfd`` is monitored by an external manager, the manager
+must be able to track changes in the process virtual memory
+layout. Userfaultfd can notify the manager about such changes using
+the same read(2) protocol as for the page fault notifications. The
+manager has to explicitly enable these events by setting appropriate
+bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl:
+
+``UFFD_FEATURE_EVENT_FORK``
+	enable ``userfaultfd`` hooks for fork(). When this feature is
+	enabled, the ``userfaultfd`` context of the parent process is
+	duplicated into the newly created process. The manager
+	receives ``UFFD_EVENT_FORK`` with file descriptor of the new
+	``userfaultfd`` context in the ``uffd_msg.fork``.
+
+``UFFD_FEATURE_EVENT_REMAP``
+	enable notifications about mremap() calls. When the
+	non-cooperative process moves a virtual memory area to a
+	different location, the manager will receive
+	``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and
+	new addresses of the area and its original length.
+
+``UFFD_FEATURE_EVENT_REMOVE``
+	enable notifications about madvise(MADV_REMOVE) and
+	madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will
+	be generated upon these calls to madvise(). The ``uffd_msg.remove``
+	will contain start and end addresses of the removed area.
+
+``UFFD_FEATURE_EVENT_UNMAP``
+	enable notifications about memory unmapping. The manager will
+	get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and
+	end addresses of the unmapped area.
+
+Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP``
+are pretty similar, they quite differ in the action expected from the
+``userfaultfd`` manager. In the former case, the virtual memory is
+removed, but the area is not, the area remains monitored by the
+``userfaultfd``, and if a page fault occurs in that area it will be
+delivered to the manager. The proper resolution for such page fault is
+to zeromap the faulting address. However, in the latter case, when an
+area is unmapped, either explicitly (with munmap() system call), or
+implicitly (e.g. during mremap()), the area is removed and in turn the
+``userfaultfd`` context for such area disappears too and the manager will
+not get further userland page faults from the removed area. Still, the
+notification is required in order to prevent manager from using
+``UFFDIO_COPY`` on the unmapped area.
+
+Unlike userland page faults which have to be synchronous and require
+explicit or implicit wakeup, all the events are delivered
+asynchronously and the non-cooperative process resumes execution as
+soon as manager executes read(). The ``userfaultfd`` manager should
+carefully synchronize calls to ``UFFDIO_COPY`` with the events
+processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will
+return ``-ENOSPC`` when the monitored process exits at the time of
+``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed
+its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY``
+operation.
+
+The current asynchronous model of the event delivery is optimal for
+single threaded non-cooperative ``userfaultfd`` manager implementations. A
+synchronous event delivery model can be added later as a new
+``userfaultfd`` feature to facilitate multithreading enhancements of the
+non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to
+run in parallel to the event reception. Single threaded
+implementations should continue to use the current async event
+delivery model instead.