summaryrefslogtreecommitdiffstats
path: root/Documentation/arch
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/arch')
-rw-r--r--Documentation/arch/arm64/arm-acpi.rst2
-rw-r--r--Documentation/arch/arm64/perf.rst72
-rw-r--r--Documentation/arch/arm64/silicon-errata.rst12
-rw-r--r--Documentation/arch/riscv/hwprobe.rst122
-rw-r--r--Documentation/arch/x86/boot.rst2
-rw-r--r--Documentation/arch/x86/cpuinfo.rst89
-rw-r--r--Documentation/arch/x86/pti.rst10
-rw-r--r--Documentation/arch/x86/tdx.rst207
8 files changed, 467 insertions, 49 deletions
diff --git a/Documentation/arch/arm64/arm-acpi.rst b/Documentation/arch/arm64/arm-acpi.rst
index a46c34fa96..e59e4505d0 100644
--- a/Documentation/arch/arm64/arm-acpi.rst
+++ b/Documentation/arch/arm64/arm-acpi.rst
@@ -130,7 +130,7 @@ When an Arm system boots, it can either have DT information, ACPI tables,
or in some very unusual cases, both. If no command line parameters are used,
the kernel will try to use DT for device enumeration; if there is no DT
present, the kernel will try to use ACPI tables, but only if they are present.
-In neither is available, the kernel will not boot. If acpi=force is used
+If neither is available, the kernel will not boot. If acpi=force is used
on the command line, the kernel will attempt to use ACPI tables first, but
fall back to DT if there are no ACPI tables present. The basic idea is that
the kernel will not fail to boot unless it absolutely has no other choice.
diff --git a/Documentation/arch/arm64/perf.rst b/Documentation/arch/arm64/perf.rst
index 1f87b57c23..997fd716b8 100644
--- a/Documentation/arch/arm64/perf.rst
+++ b/Documentation/arch/arm64/perf.rst
@@ -164,3 +164,75 @@ and should be used to mask the upper bits as needed.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/arch/arm64/tests/user-events.c
.. _tools/lib/perf/tests/test-evsel.c:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/tests/test-evsel.c
+
+Event Counting Threshold
+==========================================
+
+Overview
+--------
+
+FEAT_PMUv3_TH (Armv8.8) permits a PMU counter to increment only on
+events whose count meets a specified threshold condition. For example if
+threshold_compare is set to 2 ('Greater than or equal'), and the
+threshold is set to 2, then the PMU counter will now only increment by
+when an event would have previously incremented the PMU counter by 2 or
+more on a single processor cycle.
+
+To increment by 1 after passing the threshold condition instead of the
+number of events on that cycle, add the 'threshold_count' option to the
+commandline.
+
+How-to
+------
+
+These are the parameters for controlling the feature:
+
+.. list-table::
+ :header-rows: 1
+
+ * - Parameter
+ - Description
+ * - threshold
+ - Value to threshold the event by. A value of 0 means that
+ thresholding is disabled and the other parameters have no effect.
+ * - threshold_compare
+ - | Comparison function to use, with the following values supported:
+ |
+ | 0: Not-equal
+ | 1: Equals
+ | 2: Greater-than-or-equal
+ | 3: Less-than
+ * - threshold_count
+ - If this is set, count by 1 after passing the threshold condition
+ instead of the value of the event on this cycle.
+
+The threshold, threshold_compare and threshold_count values can be
+provided per event, for example:
+
+.. code-block:: sh
+
+ perf stat -e stall_slot/threshold=2,threshold_compare=2/ \
+ -e dtlb_walk/threshold=10,threshold_compare=3,threshold_count/
+
+In this example the stall_slot event will count by 2 or more on every
+cycle where 2 or more stalls happen. And dtlb_walk will count by 1 on
+every cycle where the number of dtlb walks were less than 10.
+
+The maximum supported threshold value can be read from the caps of each
+PMU, for example:
+
+.. code-block:: sh
+
+ cat /sys/bus/event_source/devices/armv8_pmuv3/caps/threshold_max
+
+ 0x000000ff
+
+If a value higher than this is given, then opening the event will result
+in an error. The highest possible maximum is 4095, as the config field
+for threshold is limited to 12 bits, and the Perf tool will refuse to
+parse higher values.
+
+If the PMU doesn't support FEAT_PMUv3_TH, then threshold_max will read
+0, and attempting to set a threshold value will also result in an error.
+threshold_max will also read as 0 on aarch32 guests, even if the host
+is running on hardware with the feature.
diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index 29fd5213ee..45a7f4932f 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -119,6 +119,10 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A76 | #1463225 | ARM64_ERRATUM_1463225 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A76 | #1490853 | N/A |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A77 | #1491015 | N/A |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A77 | #1508412 | ARM64_ERRATUM_1508412 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A710 | #2119858 | ARM64_ERRATUM_2119858 |
@@ -129,6 +133,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A715 | #2645198 | ARM64_ERRATUM_2645198 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-X1 | #1502854 | N/A |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-X2 | #2119858 | ARM64_ERRATUM_2119858 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-X2 | #2224489 | ARM64_ERRATUM_2224489 |
@@ -137,6 +143,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N1 | #1349291 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Neoverse-N1 | #1490853 | N/A |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N1 | #1542419 | ARM64_ERRATUM_1542419 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N2 | #2139208 | ARM64_ERRATUM_2139208 |
@@ -145,6 +153,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N2 | #2253138 | ARM64_ERRATUM_2253138 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Neoverse-V1 | #1619801 | N/A |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | MMU-500 | #841119,826419 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | MMU-600 | #1076982,1209401| N/A |
@@ -227,11 +237,9 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| Rockchip | RK3588 | #3588001 | ROCKCHIP_ERRATUM_3588001 |
+----------------+-----------------+-----------------+-----------------------------+
-
+----------------+-----------------+-----------------+-----------------------------+
| Fujitsu | A64FX | E#010001 | FUJITSU_ERRATUM_010001 |
+----------------+-----------------+-----------------+-----------------------------+
-
+----------------+-----------------+-----------------+-----------------------------+
| ASR | ASR8601 | #8601001 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
diff --git a/Documentation/arch/riscv/hwprobe.rst b/Documentation/arch/riscv/hwprobe.rst
index 7b2384de47..b2bcc9eed9 100644
--- a/Documentation/arch/riscv/hwprobe.rst
+++ b/Documentation/arch/riscv/hwprobe.rst
@@ -12,7 +12,7 @@ is defined in <asm/hwprobe.h>::
};
long sys_riscv_hwprobe(struct riscv_hwprobe *pairs, size_t pair_count,
- size_t cpu_count, cpu_set_t *cpus,
+ size_t cpusetsize, cpu_set_t *cpus,
unsigned int flags);
The arguments are split into three groups: an array of key-value pairs, a CPU
@@ -20,12 +20,26 @@ set, and some flags. The key-value pairs are supplied with a count. Userspace
must prepopulate the key field for each element, and the kernel will fill in the
value if the key is recognized. If a key is unknown to the kernel, its key field
will be cleared to -1, and its value set to 0. The CPU set is defined by
-CPU_SET(3). For value-like keys (eg. vendor/arch/impl), the returned value will
-be only be valid if all CPUs in the given set have the same value. Otherwise -1
-will be returned. For boolean-like keys, the value returned will be a logical
-AND of the values for the specified CPUs. Usermode can supply NULL for cpus and
-0 for cpu_count as a shortcut for all online CPUs. There are currently no flags,
-this value must be zero for future compatibility.
+CPU_SET(3) with size ``cpusetsize`` bytes. For value-like keys (eg. vendor,
+arch, impl), the returned value will only be valid if all CPUs in the given set
+have the same value. Otherwise -1 will be returned. For boolean-like keys, the
+value returned will be a logical AND of the values for the specified CPUs.
+Usermode can supply NULL for ``cpus`` and 0 for ``cpusetsize`` as a shortcut for
+all online CPUs. The currently supported flags are:
+
+* :c:macro:`RISCV_HWPROBE_WHICH_CPUS`: This flag basically reverses the behavior
+ of sys_riscv_hwprobe(). Instead of populating the values of keys for a given
+ set of CPUs, the values of each key are given and the set of CPUs is reduced
+ by sys_riscv_hwprobe() to only those which match each of the key-value pairs.
+ How matching is done depends on the key type. For value-like keys, matching
+ means to be the exact same as the value. For boolean-like keys, matching
+ means the result of a logical AND of the pair's value with the CPU's value is
+ exactly the same as the pair's value. Additionally, when ``cpus`` is an empty
+ set, then it is initialized to all online CPUs which fit within it, i.e. the
+ CPU set returned is the reduction of all the online CPUs which can be
+ represented with a CPU set of size ``cpusetsize``.
+
+All other flags are reserved for future compatibility and must be zero.
On success 0 is returned, on failure a negative error code is returned.
@@ -80,6 +94,100 @@ The following keys are defined:
* :c:macro:`RISCV_HWPROBE_EXT_ZICBOZ`: The Zicboz extension is supported, as
ratified in commit 3dd606f ("Create cmobase-v1.0.pdf") of riscv-CMOs.
+ * :c:macro:`RISCV_HWPROBE_EXT_ZBC` The Zbc extension is supported, as defined
+ in version 1.0 of the Bit-Manipulation ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZBKB` The Zbkb extension is supported, as
+ defined in version 1.0 of the Scalar Crypto ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZBKC` The Zbkc extension is supported, as
+ defined in version 1.0 of the Scalar Crypto ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZBKX` The Zbkx extension is supported, as
+ defined in version 1.0 of the Scalar Crypto ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZKND` The Zknd extension is supported, as
+ defined in version 1.0 of the Scalar Crypto ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZKNE` The Zkne extension is supported, as
+ defined in version 1.0 of the Scalar Crypto ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZKNH` The Zknh extension is supported, as
+ defined in version 1.0 of the Scalar Crypto ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZKSED` The Zksed extension is supported, as
+ defined in version 1.0 of the Scalar Crypto ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZKSH` The Zksh extension is supported, as
+ defined in version 1.0 of the Scalar Crypto ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZKT` The Zkt extension is supported, as defined
+ in version 1.0 of the Scalar Crypto ISA extensions.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVBB`: The Zvbb extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVBC`: The Zvbc extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVKB`: The Zvkb extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVKG`: The Zvkg extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVKNED`: The Zvkned extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVKNHA`: The Zvknha extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVKNHB`: The Zvknhb extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVKSED`: The Zvksed extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVKSH`: The Zvksh extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVKT`: The Zvkt extension is supported as
+ defined in version 1.0 of the RISC-V Cryptography Extensions Volume II.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZFH`: The Zfh extension version 1.0 is supported
+ as defined in the RISC-V ISA manual.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZFHMIN`: The Zfhmin extension version 1.0 is
+ supported as defined in the RISC-V ISA manual.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZIHINTNTL`: The Zihintntl extension version 1.0
+ is supported as defined in the RISC-V ISA manual.
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVFH`: The Zvfh extension is supported as
+ defined in the RISC-V Vector manual starting from commit e2ccd0548d6c
+ ("Remove draft warnings from Zvfh[min]").
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZVFHMIN`: The Zvfhmin extension is supported as
+ defined in the RISC-V Vector manual starting from commit e2ccd0548d6c
+ ("Remove draft warnings from Zvfh[min]").
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZFA`: The Zfa extension is supported as
+ defined in the RISC-V ISA manual starting from commit 056b6ff467c7
+ ("Zfa is ratified").
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZTSO`: The Ztso extension is supported as
+ defined in the RISC-V ISA manual starting from commit 5618fb5a216b
+ ("Ztso is now ratified.")
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZACAS`: The Zacas extension is supported as
+ defined in the Atomic Compare-and-Swap (CAS) instructions manual starting
+ from commit 5059e0ca641c ("update to ratified").
+
+ * :c:macro:`RISCV_HWPROBE_EXT_ZICOND`: The Zicond extension is supported as
+ defined in the RISC-V Integer Conditional (Zicond) operations extension
+ manual starting from commit 95cf1f9 ("Add changes requested by Ved
+ during signoff")
+
* :c:macro:`RISCV_HWPROBE_KEY_CPUPERF_0`: A bitmask that contains performance
information about the selected set of processors.
diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst
index 22cc7a040d..c513855a54 100644
--- a/Documentation/arch/x86/boot.rst
+++ b/Documentation/arch/x86/boot.rst
@@ -71,7 +71,7 @@ Protocol 2.13 (Kernel 3.14) Support 32- and 64-bit flags being set in
Protocol 2.14 BURNT BY INCORRECT COMMIT
ae7e1238e68f2a472a125673ab506d49158c1889
- (x86/boot: Add ACPI RSDP address to setup_header)
+ ("x86/boot: Add ACPI RSDP address to setup_header")
DO NOT USE!!! ASSUME SAME AS 2.13.
Protocol 2.15 (Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max.
diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst
index 08246e8ac8..8895784d47 100644
--- a/Documentation/arch/x86/cpuinfo.rst
+++ b/Documentation/arch/x86/cpuinfo.rst
@@ -7,27 +7,74 @@ x86 Feature Flags
Introduction
============
-On x86, flags appearing in /proc/cpuinfo have an X86_FEATURE definition
-in arch/x86/include/asm/cpufeatures.h. If the kernel cares about a feature
-or KVM want to expose the feature to a KVM guest, it can and should have
-an X86_FEATURE_* defined. These flags represent hardware features as
-well as software features.
-
-If users want to know if a feature is available on a given system, they
-try to find the flag in /proc/cpuinfo. If a given flag is present, it
-means that the kernel supports it and is currently making it available.
-If such flag represents a hardware feature, it also means that the
-hardware supports it.
-
-If the expected flag does not appear in /proc/cpuinfo, things are murkier.
-Users need to find out the reason why the flag is missing and find the way
-how to enable it, which is not always easy. There are several factors that
-can explain missing flags: the expected feature failed to enable, the feature
-is missing in hardware, platform firmware did not enable it, the feature is
-disabled at build or run time, an old kernel is in use, or the kernel does
-not support the feature and thus has not enabled it. In general, /proc/cpuinfo
-shows features which the kernel supports. For a full list of CPUID flags
-which the CPU supports, use tools/arch/x86/kcpuid.
+The list of feature flags in /proc/cpuinfo is not complete and
+represents an ill-fated attempt from long time ago to put feature flags
+in an easy to find place for userspace.
+
+However, the amount of feature flags is growing by the CPU generation,
+leading to unparseable and unwieldy /proc/cpuinfo.
+
+What is more, those feature flags do not even need to be in that file
+because userspace doesn't care about them - glibc et al already use
+CPUID to find out what the target machine supports and what not.
+
+And even if it doesn't show a particular feature flag - although the CPU
+still does have support for the respective hardware functionality and
+said CPU supports CPUID faulting - userspace can simply probe for the
+feature and figure out if it is supported or not, regardless of whether
+it is being advertised somewhere.
+
+Furthermore, those flag strings become an ABI the moment they appear
+there and maintaining them forever when nothing even uses them is a lot
+of wasted effort.
+
+So, the current use of /proc/cpuinfo is to show features which the
+kernel has *enabled* and *supports*. As in: the CPUID feature flag is
+there, there's an additional setup which the kernel has done while
+booting and the functionality is ready to use. A perfect example for
+that is "user_shstk" where additional code enablement is present in the
+kernel to support shadow stack for user programs.
+
+So, if users want to know if a feature is available on a given system,
+they try to find the flag in /proc/cpuinfo. If a given flag is present,
+it means that
+
+* the kernel knows about the feature enough to have an X86_FEATURE bit
+
+* the kernel supports it and is currently making it available either to
+ userspace or some other part of the kernel
+
+* if the flag represents a hardware feature the hardware supports it.
+
+The absence of a flag in /proc/cpuinfo by itself means almost nothing to
+an end user.
+
+On the one hand, a feature like "vaes" might be fully available to user
+applications on a kernel that has not defined X86_FEATURE_VAES and thus
+there is no "vaes" in /proc/cpuinfo.
+
+On the other hand, a new kernel running on non-VAES hardware would also
+have no "vaes" in /proc/cpuinfo. There's no way for an application or
+user to tell the difference.
+
+The end result is that the flags field in /proc/cpuinfo is marginally
+useful for kernel debugging, but not really for anything else.
+Applications should instead use things like the glibc facilities for
+querying CPU support. Users should rely on tools like
+tools/arch/x86/kcpuid and cpuid(1).
+
+Regarding implementation, flags appearing in /proc/cpuinfo have an
+X86_FEATURE definition in arch/x86/include/asm/cpufeatures.h. These flags
+represent hardware features as well as software features.
+
+If the kernel cares about a feature or KVM want to expose the feature to
+a KVM guest, it should only then expose it to the guest when the guest
+needs to parse /proc/cpuinfo. Which, as mentioned above, is highly
+unlikely. KVM can synthesize the CPUID bit and the KVM guest can simply
+query CPUID and figure out what the hypervisor supports and what not. As
+already stated, /proc/cpuinfo is not a dumping ground for useless
+feature flags.
+
How are feature flags created?
==============================
diff --git a/Documentation/arch/x86/pti.rst b/Documentation/arch/x86/pti.rst
index 4b858a9bad..e08d35177b 100644
--- a/Documentation/arch/x86/pti.rst
+++ b/Documentation/arch/x86/pti.rst
@@ -81,11 +81,9 @@ this protection comes at a cost:
and exit (it can be skipped when the kernel is interrupted,
though.) Moves to CR3 are on the order of a hundred
cycles, and are required at every entry and exit.
- b. A "trampoline" must be used for SYSCALL entry. This
- trampoline depends on a smaller set of resources than the
- non-PTI SYSCALL entry code, so requires mapping fewer
- things into the userspace page tables. The downside is
- that stacks must be switched at entry time.
+ b. Percpu TSS is mapped into the user page tables to allow SYSCALL64 path
+ to work under PTI. This doesn't have a direct runtime cost but it can
+ be argued it opens certain timing attack scenarios.
c. Global pages are disabled for all kernel structures not
mapped into both kernel and userspace page tables. This
feature of the MMU allows different processes to share TLB
@@ -167,7 +165,7 @@ that are worth noting here.
* Failures of the selftests/x86 code. Usually a bug in one of the
more obscure corners of entry_64.S
* Crashes in early boot, especially around CPU bringup. Bugs
- in the trampoline code or mappings cause these.
+ in the mappings cause these.
* Crashes at the first interrupt. Caused by bugs in entry_64.S,
like screwing up a page table switch. Also caused by
incorrectly mapping the IRQ handler entry code.
diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index dc8d9fd2c3..719043cd8b 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -10,6 +10,191 @@ encrypting the guest memory. In TDX, a special module running in a special
mode sits between the host and the guest and manages the guest/host
separation.
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+Before the TDX module can be used to create and run protected VMs, it
+must be loaded into the isolated range and properly initialized. The TDX
+architecture doesn't require the BIOS to load the TDX module, but the
+kernel assumes it is loaded by the BIOS.
+
+TDX boot-time detection
+-----------------------
+
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
+boot. Below dmesg shows when TDX is enabled by BIOS::
+
+ [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
+
+TDX module initialization
+---------------------------------------
+
+The kernel talks to the TDX module via the new SEAMCALL instruction. The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+If the TDX module isn't loaded, the SEAMCALL instruction fails with a
+special error. In this case the kernel fails the module initialization
+and reports the module isn't loaded::
+
+ [..] virt/tdx: module not loaded
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory. It also takes additional CPU
+time to initialize those metadata along with the TDX module itself. Both
+are not trivial. The kernel initializes the TDX module at runtime on
+demand.
+
+Besides initializing the TDX module, a per-cpu initialization SEAMCALL
+must be done on one cpu before any other SEAMCALLs can be made on that
+cpu.
+
+The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
+allow the user of TDX to enable the TDX module and enable TDX on local
+cpu respectively.
+
+Making SEAMCALL requires VMXON has been done on that CPU. Currently only
+KVM implements VMXON. For now both tdx_enable() and tdx_cpu_enable()
+don't do VMXON internally (not trivial), but depends on the caller to
+guarantee that.
+
+To enable TDX, the caller of TDX should: 1) temporarily disable CPU
+hotplug; 2) do VMXON and tdx_enable_cpu() on all online cpus; 3) call
+tdx_enable(). For example::
+
+ cpus_read_lock();
+ on_each_cpu(vmxon_and_tdx_cpu_enable());
+ ret = tdx_enable();
+ cpus_read_unlock();
+ if (ret)
+ goto no_tdx;
+ // TDX is ready to use
+
+And the caller of TDX must guarantee the tdx_cpu_enable() has been
+successfully done on any cpu before it wants to run any other SEAMCALL.
+A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
+online callback, and refuse to online if tdx_cpu_enable() fails.
+
+User can consult dmesg to see whether the TDX module has been initialized.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below::
+
+ [..] virt/tdx: 262668 KBs allocated for PAMT
+ [..] virt/tdx: module initialized
+
+If the TDX module failed to initialize, dmesg also shows it failed to
+initialize::
+
+ [..] virt/tdx: module initialization failed ...
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+TDX Memory Policy
+~~~~~~~~~~~~~~~~~
+
+TDX reports a list of "Convertible Memory Region" (CMR) to tell the
+kernel which memory is TDX compatible. The kernel needs to build a list
+of memory regions (out of CMRs) as "TDX-usable" memory and pass those
+regions to the TDX module. Once this is done, those "TDX-usable" memory
+regions are fixed during module's lifetime.
+
+To keep things simple, currently the kernel simply guarantees all pages
+in the page allocator are TDX memory. Specifically, the kernel uses all
+system memory in the core-mm "at the time of TDX module initialization"
+as TDX memory, and in the meantime, refuses to online any non-TDX-memory
+in the memory hotplug.
+
+Physical Memory Hotplug
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Note TDX assumes convertible memory is always physically present during
+machine's runtime. A non-buggy BIOS should never support hot-removal of
+any convertible memory. This implementation doesn't handle ACPI memory
+removal but depends on the BIOS to behave correctly.
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX module requires the per-cpu initialization SEAMCALL must be done on
+one cpu before any other SEAMCALLs can be made on that cpu. The kernel
+provides tdx_cpu_enable() to let the user of TDX to do it when the user
+wants to use a new cpu for TDX task.
+
+TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
+TDX verifies all boot-time present logical CPUs are TDX compatible before
+enabling TDX. A non-buggy BIOS should never support hot-add/removal of
+physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
+but depends on the BIOS to behave correctly.
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Kexec()
+~~~~~~~
+
+TDX host support currently lacks the ability to handle kexec. For
+simplicity only one of them can be enabled in the Kconfig. This will be
+fixed in the future.
+
+Erratum
+~~~~~~~
+
+The first few generations of TDX hardware have an erratum. A partial
+write to a TDX private memory cacheline will silently "poison" the
+line. Subsequent reads will consume the poison and generate a machine
+check.
+
+A partial write is a memory write where a write transaction of less than
+cacheline lands at the memory controller. The CPU does these via
+non-temporal write instructions (like MOVNTI), or through UC/WC memory
+mappings. Devices can also do partial writes via DMA.
+
+Theoretically, a kernel bug could do partial write to TDX private memory
+and trigger unexpected machine check. What's more, the machine check
+code will present these as "Hardware error" when they were, in fact, a
+software-triggered issue. But in the end, this issue is hard to trigger.
+
+If the platform has such erratum, the kernel prints additional message in
+machine check handler to tell user the machine check may be caused by
+kernel bug on TDX private memory.
+
+Interaction vs S3 and deeper states
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+TDX cannot survive from S3 and deeper states. The hardware resets and
+disables TDX completely when platform goes to S3 and deeper. Both TDX
+guests and the TDX module get destroyed permanently.
+
+The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
+hibernation. Currently, for simplicity, the kernel chooses to make TDX
+mutually exclusive with S3 and hibernation.
+
+The kernel disables TDX during early boot when hibernation support is
+available::
+
+ [..] virt/tdx: initialization failed: Hibernation support is enabled
+
+Add 'nohibernate' kernel command line to disable hibernation in order to
+use TDX.
+
+ACPI S3 is disabled during kernel early boot if TDX is enabled. The user
+needs to turn off TDX in the BIOS in order to use S3.
+
+TDX Guest Support
+=================
Since the host cannot directly access guest registers or memory, much
normal functionality of a hypervisor must be moved into the guest. This is
implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +205,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
guest to the hypervisor or the TDX module.
New TDX Exceptions
-==================
+------------------
TDX guests behave differently from bare-metal and traditional VMX guests.
In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +215,7 @@ Instructions marked with an '*' conditionally cause exceptions. The
details for these instructions are discussed below.
Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
- Port I/O (INS, OUTS, IN, OUT)
- HLT
@@ -41,7 +226,7 @@ Instruction-based #VE
- CPUID*
Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +237,7 @@ Instruction-based #GP
- RDMSR*,WRMSR*
RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
MSR access behavior falls into three categories:
@@ -73,7 +258,7 @@ trapping and handling in the TDX module. Other than possibly being slow,
these MSRs appear to function just as they would on bare metal.
CPUID Behavior
---------------
+~~~~~~~~~~~~~~
For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +278,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
value with a hypercall.
#VE on Memory Accesses
-======================
+----------------------
There are essentially two classes of TDX memory: private and shared.
Private memory receives full TDX protections. Its content is protected
@@ -107,7 +292,7 @@ entries. This helps ensure that a guest does not place sensitive
information in shared memory, exposing it to the untrusted hypervisor.
#VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
Access to shared mappings can cause a #VE. The hypervisor ultimately
controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +312,7 @@ be careful not to access device MMIO regions unless it is also prepared to
handle a #VE.
#VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
An access to private mappings can also cause a #VE. Since all kernel
memory is also private memory, the kernel might theoretically need to
@@ -145,7 +330,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
to handle the exception.
Linux #VE handler
-=================
+-----------------
Just like page faults or #GP's, #VE exceptions can be either handled or be
fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +352,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
which is not recoverable.
MMIO handling
-=============
+-------------
In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +374,7 @@ MMIO access via other means (like structure overlays) may result in an
oops.
Shared Memory Conversions
-=========================
+-------------------------
All TDX guest memory starts out as private at boot. This memory can not
be accessed by the hypervisor. However, some kernel users like device