1 files changed, 302 insertions, 0 deletions
diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
new file mode 100644
index 0000000000..d90796adc2
--- /dev/null
+++ b/Documentation/arch/x86/sgx.rst
@@ -0,0 +1,302 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Software Guard eXtensions (SGX)
+===============================
+
+Overview
+========
+
+Software Guard eXtensions (SGX) hardware enables for user space applications
+to set aside private memory regions of code and data:
+
+* Privileged (ring-0) ENCLS functions orchestrate the construction of the
+  regions.
+* Unprivileged (ring-3) ENCLU functions allow an application to enter and
+  execute inside the regions.
+
+These memory regions are called enclaves. An enclave can be only entered at a
+fixed set of entry points. Each entry point can hold a single hardware thread
+at a time.  While the enclave is loaded from a regular binary file by using
+ENCLS functions, only the threads inside the enclave can access its memory. The
+region is denied from outside access by the CPU, and encrypted before it leaves
+from LLC.
+
+The support can be determined by
+
+	``grep sgx /proc/cpuinfo``
+
+SGX must both be supported in the processor and enabled by the BIOS.  If SGX
+appears to be unsupported on a system which has hardware support, ensure
+support is enabled in the BIOS.  If a BIOS presents a choice between "Enabled"
+and "Software Enabled" modes for SGX, choose "Enabled".
+
+Enclave Page Cache
+==================
+
+SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated
+with an enclave. It is contained in a BIOS-reserved region of physical memory.
+Unlike pages used for regular memory, pages can only be accessed from outside of
+the enclave during enclave construction with special, limited SGX instructions.
+
+Only a CPU executing inside an enclave can directly access enclave memory.
+However, a CPU executing inside an enclave may access normal memory outside the
+enclave.
+
+The kernel manages enclave memory similar to how it treats device memory.
+
+Enclave Page Types
+------------------
+
+**SGX Enclave Control Structure (SECS)**
+   Enclave's address range, attributes and other global data are defined
+   by this structure.
+
+**Regular (REG)**
+   Regular EPC pages contain the code and data of an enclave.
+
+**Thread Control Structure (TCS)**
+   Thread Control Structure pages define the entry points to an enclave and
+   track the execution state of an enclave thread.
+
+**Version Array (VA)**
+   Version Array pages contain 512 slots, each of which can contain a version
+   number for a page evicted from the EPC.
+
+Enclave Page Cache Map
+----------------------
+
+The processor tracks EPC pages in a hardware metadata structure called the
+*Enclave Page Cache Map (EPCM)*.  The EPCM contains an entry for each EPC page
+which describes the owning enclave, access rights and page type among the other
+things.
+
+EPCM permissions are separate from the normal page tables.  This prevents the
+kernel from, for instance, allowing writes to data which an enclave wishes to
+remain read-only.  EPCM permissions may only impose additional restrictions on
+top of normal x86 page permissions.
+
+For all intents and purposes, the SGX architecture allows the processor to
+invalidate all EPCM entries at will.  This requires that software be prepared to
+handle an EPCM fault at any time.  In practice, this can happen on events like
+power transitions when the ephemeral key that encrypts enclave memory is lost.
+
+Application interface
+=====================
+
+Enclave build functions
+-----------------------
+
+In addition to the traditional compiler and linker build process, SGX has a
+separate enclave “build” process.  Enclaves must be built before they can be
+executed (entered). The first step in building an enclave is opening the
+**/dev/sgx_enclave** device.  Since enclave memory is protected from direct
+access, special privileged instructions are then used to copy data into enclave
+pages and establish enclave page permissions.
+
+.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
+   :functions: sgx_ioc_enclave_create
+               sgx_ioc_enclave_add_pages
+               sgx_ioc_enclave_init
+               sgx_ioc_enclave_provision
+
+Enclave runtime management
+--------------------------
+
+Systems supporting SGX2 additionally support changes to initialized
+enclaves: modifying enclave page permissions and type, and dynamically
+adding and removing of enclave pages. When an enclave accesses an address
+within its address range that does not have a backing page then a new
+regular page will be dynamically added to the enclave. The enclave is
+still required to run EACCEPT on the new page before it can be used.
+
+.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
+   :functions: sgx_ioc_enclave_restrict_permissions
+               sgx_ioc_enclave_modify_types
+               sgx_ioc_enclave_remove_pages
+
+Enclave vDSO
+------------
+
+Entering an enclave can only be done through SGX-specific EENTER and ERESUME
+functions, and is a non-trivial process.  Because of the complexity of
+transitioning to and from an enclave, enclaves typically utilize a library to
+handle the actual transitions.  This is roughly analogous to how glibc
+implementations are used by most applications to wrap system calls.
+
+Another crucial characteristic of enclaves is that they can generate exceptions
+as part of their normal operation that need to be handled in the enclave or are
+unique to SGX.
+
+Instead of the traditional signal mechanism to handle these exceptions, SGX
+can leverage special exception fixup provided by the vDSO.  The kernel-provided
+vDSO function wraps low-level transitions to/from the enclave like EENTER and
+ERESUME.  The vDSO function intercepts exceptions that would otherwise generate
+a signal and return the fault information directly to its caller.  This avoids
+the need to juggle signal handlers.
+
+.. kernel-doc:: arch/x86/include/uapi/asm/sgx.h
+   :functions: vdso_sgx_enter_enclave_t
+
+ksgxd
+=====
+
+SGX support includes a kernel thread called *ksgxd*.
+
+EPC sanitization
+----------------
+
+ksgxd is started when SGX initializes.  Enclave memory is typically ready
+for use when the processor powers on or resets.  However, if SGX has been in
+use since the reset, enclave pages may be in an inconsistent state.  This might
+occur after a crash and kexec() cycle, for instance.  At boot, ksgxd
+reinitializes all enclave pages so that they can be allocated and re-used.
+
+The sanitization is done by going through EPC address space and applying the
+EREMOVE function to each physical page. Some enclave pages like SECS pages have
+hardware dependencies on other pages which prevents EREMOVE from functioning.
+Executing two EREMOVE passes removes the dependencies.
+
+Page reclaimer
+--------------
+
+Similar to the core kswapd, ksgxd, is responsible for managing the
+overcommitment of enclave memory.  If the system runs out of enclave memory,
+*ksgxd* “swaps” enclave memory to normal memory.
+
+Launch Control
+==============
+
+SGX provides a launch control mechanism. After all enclave pages have been
+copied, kernel executes EINIT function, which initializes the enclave. Only after
+this the CPU can execute inside the enclave.
+
+EINIT function takes an RSA-3072 signature of the enclave measurement.  The function
+checks that the measurement is correct and signature is signed with the key
+hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the
+SHA256 of a public key.
+
+Those MSRs can be configured by the BIOS to be either readable or writable.
+Linux supports only writable configuration in order to give full control to the
+kernel on launch control policy. Before calling EINIT function, the driver sets
+the MSRs to match the enclave's signing key.
+
+Encryption engines
+==================
+
+In order to conceal the enclave data while it is out of the CPU package, the
+memory controller has an encryption engine to transparently encrypt and decrypt
+enclave memory.
+
+In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to
+encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in
+SRAM to maintain integrity of the encrypted data. This provides integrity and
+anti-replay protection but does not scale to large memory sizes because the time
+required to update the Merkle tree grows logarithmically in relation to the
+memory size.
+
+CPUs starting from Icelake use Total Memory Encryption (TME) in the place of
+MEE. TME-based SGX implementations do not have an integrity Merkle tree, which
+means integrity and replay-attacks are not mitigated.  B, it includes
+additional changes to prevent cipher text from being returned and SW memory
+aliases from being created.
+
+DMA to enclave memory is blocked by range registers on both MEE and TME systems
+(SDM section 41.10).
+
+Usage Models
+============
+
+Shared Library
+--------------
+
+Sensitive data and the code that acts on it is partitioned from the application
+into a separate library. The library is then linked as a DSO which can be loaded
+into an enclave. The application can then make individual function calls into
+the enclave through special SGX instructions. A run-time within the enclave is
+configured to marshal function parameters into and out of the enclave and to
+call the correct library function.
+
+Application Container
+---------------------
+
+An application may be loaded into a container enclave which is specially
+configured with a library OS and run-time which permits the application to run.
+The enclave run-time and library OS work together to execute the application
+when a thread enters the enclave.
+
+Impact of Potential Kernel SGX Bugs
+===================================
+
+EPC leaks
+---------
+
+When EPC page leaks happen, a WARNING like this is shown in dmesg:
+
+"EREMOVE returned ... and an EPC page was leaked.  SGX may become unusable..."
+
+This is effectively a kernel use-after-free of an EPC page, and due
+to the way SGX works, the bug is detected at freeing. Rather than
+adding the page back to the pool of available EPC pages, the kernel
+intentionally leaks the page to avoid additional errors in the future.
+
+When this happens, the kernel will likely soon leak more EPC pages, and
+SGX will likely become unusable because the memory available to SGX is
+limited. However, while this may be fatal to SGX, the rest of the kernel
+is unlikely to be impacted and should continue to work.
+
+As a result, when this happens, user should stop running any new
+SGX workloads, (or just any new workloads), and migrate all valuable
+workloads. Although a machine reboot can recover all EPC memory, the bug
+should be reported to Linux developers.
+
+
+Virtual EPC
+===========
+
+The implementation has also a virtual EPC driver to support SGX enclaves
+in guests. Unlike the SGX driver, an EPC page allocated by the virtual
+EPC driver doesn't have a specific enclave associated with it. This is
+because KVM doesn't track how a guest uses EPC pages.
+
+As a result, the SGX core page reclaimer doesn't support reclaiming EPC
+pages allocated to KVM guests through the virtual EPC driver. If the
+user wants to deploy SGX applications both on the host and in guests
+on the same machine, the user should reserve enough EPC (by taking out
+total virtual EPC size of all SGX VMs from the physical EPC size) for
+host SGX applications so they can run with acceptable performance.
+
+Architectural behavior is to restore all EPC pages to an uninitialized
+state also after a guest reboot.  Because this state can be reached only
+through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc``
+provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction
+on all pages in the virtual EPC.
+
+``EREMOVE`` can fail for three reasons.  Userspace must pay attention
+to expected failures and handle them as follows:
+
+1. Page removal will always fail when any thread is running in the
+   enclave to which the page belongs.  In this case the ioctl will
+   return ``EBUSY`` independent of whether it has successfully removed
+   some pages; userspace can avoid these failures by preventing execution
+   of any vcpu which maps the virtual EPC.
+
+2. Page removal will cause a general protection fault if two calls to
+   ``EREMOVE`` happen concurrently for pages that refer to the same
+   "SECS" metadata pages.  This can happen if there are concurrent
+   invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc``
+   file descriptor in the guest is closed at the same time as
+   ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``.
+   This can be avoided in userspace by serializing calls to the ioctl()
+   and to close(), but in general it should not be a problem.
+
+3. Finally, page removal will fail for SECS metadata pages which still
+   have child pages.  Child pages can be removed by executing
+   ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors
+   mapped into the guest.  This means that the ioctl() must be called
+   twice: an initial set of calls to remove child pages and a subsequent
+   set of calls to remove SECS pages.  The second set of calls is only
+   required for those mappings that returned a nonzero value from the
+   first call.  It indicates a bug in the kernel or the userspace client
+   if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
+   a return code other than 0.