summaryrefslogtreecommitdiffstats
path: root/Documentation/arch
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/arch')
-rw-r--r--Documentation/arch/arm64/elf_hwcaps.rst49
-rw-r--r--Documentation/arch/arm64/silicon-errata.rst41
-rw-r--r--Documentation/arch/arm64/sme.rst11
-rw-r--r--Documentation/arch/arm64/sve.rst10
-rw-r--r--Documentation/arch/m68k/buddha-driver.rst2
-rw-r--r--Documentation/arch/powerpc/dexcr.rst141
-rw-r--r--Documentation/arch/powerpc/firmware-assisted-dump.rst91
-rw-r--r--Documentation/arch/powerpc/kvm-nested.rst4
-rw-r--r--Documentation/arch/riscv/cmodx.rst98
-rw-r--r--Documentation/arch/riscv/hwprobe.rst4
-rw-r--r--Documentation/arch/riscv/index.rst1
-rw-r--r--Documentation/arch/riscv/uabi.rst4
-rw-r--r--Documentation/arch/riscv/vm-layout.rst16
-rw-r--r--Documentation/arch/s390/index.rst1
-rw-r--r--Documentation/arch/s390/mm.rst111
-rw-r--r--Documentation/arch/s390/vfio-ap.rst32
-rw-r--r--Documentation/arch/sparc/oradax/dax-hv-api.txt2
-rw-r--r--Documentation/arch/x86/amd_hsmp.rst7
-rw-r--r--Documentation/arch/x86/boot.rst3
-rw-r--r--Documentation/arch/x86/pti.rst6
-rw-r--r--Documentation/arch/x86/resctrl.rst16
-rw-r--r--Documentation/arch/x86/topology.rst24
-rw-r--r--Documentation/arch/x86/x86_64/fred.rst96
-rw-r--r--Documentation/arch/x86/x86_64/index.rst1
-rw-r--r--Documentation/arch/x86/xstate.rst2
25 files changed, 662 insertions, 111 deletions
diff --git a/Documentation/arch/arm64/elf_hwcaps.rst b/Documentation/arch/arm64/elf_hwcaps.rst
index ced7b335e2..448c166487 100644
--- a/Documentation/arch/arm64/elf_hwcaps.rst
+++ b/Documentation/arch/arm64/elf_hwcaps.rst
@@ -317,6 +317,55 @@ HWCAP2_LRCPC3
HWCAP2_LSE128
Functionality implied by ID_AA64ISAR0_EL1.Atomic == 0b0011.
+HWCAP2_FPMR
+ Functionality implied by ID_AA64PFR2_EL1.FMR == 0b0001.
+
+HWCAP2_LUT
+ Functionality implied by ID_AA64ISAR2_EL1.LUT == 0b0001.
+
+HWCAP2_FAMINMAX
+ Functionality implied by ID_AA64ISAR3_EL1.FAMINMAX == 0b0001.
+
+HWCAP2_F8CVT
+ Functionality implied by ID_AA64FPFR0_EL1.F8CVT == 0b1.
+
+HWCAP2_F8FMA
+ Functionality implied by ID_AA64FPFR0_EL1.F8FMA == 0b1.
+
+HWCAP2_F8DP4
+ Functionality implied by ID_AA64FPFR0_EL1.F8DP4 == 0b1.
+
+HWCAP2_F8DP2
+ Functionality implied by ID_AA64FPFR0_EL1.F8DP2 == 0b1.
+
+HWCAP2_F8E4M3
+ Functionality implied by ID_AA64FPFR0_EL1.F8E4M3 == 0b1.
+
+HWCAP2_F8E5M2
+ Functionality implied by ID_AA64FPFR0_EL1.F8E5M2 == 0b1.
+
+HWCAP2_SME_LUTV2
+ Functionality implied by ID_AA64SMFR0_EL1.LUTv2 == 0b1.
+
+HWCAP2_SME_F8F16
+ Functionality implied by ID_AA64SMFR0_EL1.F8F16 == 0b1.
+
+HWCAP2_SME_F8F32
+ Functionality implied by ID_AA64SMFR0_EL1.F8F32 == 0b1.
+
+HWCAP2_SME_SF8FMA
+ Functionality implied by ID_AA64SMFR0_EL1.SF8FMA == 0b1.
+
+HWCAP2_SME_SF8DP4
+ Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1.
+
+HWCAP2_SME_SF8DP2
+ Functionality implied by ID_AA64SMFR0_EL1.SF8DP2 == 0b1.
+
+HWCAP2_SME_SF8DP4
+ Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1.
+
+
4. Unused AT_HWCAP bits
-----------------------
diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index 45a7f4932f..50327c05be 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -35,8 +35,9 @@ can be triggered by Linux).
For software workarounds that may adversely impact systems unaffected by
the erratum in question, a Kconfig entry is added under "Kernel
Features" -> "ARM errata workarounds via the alternatives framework".
-These are enabled by default and patched in at runtime when an affected
-CPU is detected. For less-intrusive workarounds, a Kconfig option is not
+With the exception of workarounds for errata deemed "rare" by Arm, these
+are enabled by default and patched in at runtime when an affected CPU is
+detected. For less-intrusive workarounds, a Kconfig option is not
available and the code is structured (preferably with a comment) in such
a way that the erratum will not be hit.
@@ -121,24 +122,50 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A76 | #1490853 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A76 | #3324349 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A77 | #1491015 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A77 | #1508412 | ARM64_ERRATUM_1508412 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A77 | #3324348 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A78 | #3324344 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A78C | #3324346,3324347| ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A710 | #2119858 | ARM64_ERRATUM_2119858 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A710 | #2054223 | ARM64_ERRATUM_2054223 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A710 | #2224489 | ARM64_ERRATUM_2224489 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A710 | #3324338 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A715 | #2645198 | ARM64_ERRATUM_2645198 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A720 | #3456091 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A725 | #3456106 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-X1 | #1502854 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-X1 | #3324344 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-X1C | #3324346 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-X2 | #2119858 | ARM64_ERRATUM_2119858 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-X2 | #2224489 | ARM64_ERRATUM_2224489 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-X2 | #3324338 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-X3 | #3324335 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-X4 | #3194386 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-X925 | #3324334 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N1 | #1188873,1418040| ARM64_ERRATUM_1418040 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N1 | #1349291 | N/A |
@@ -147,14 +174,24 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N1 | #1542419 | ARM64_ERRATUM_1542419 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Neoverse-N1 | #3324349 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N2 | #2139208 | ARM64_ERRATUM_2139208 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N2 | #2067961 | ARM64_ERRATUM_2067961 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-N2 | #2253138 | ARM64_ERRATUM_2253138 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Neoverse-N2 | #3324339 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Neoverse-V1 | #1619801 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Neoverse-V1 | #3324341 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Neoverse-V2 | #3324336 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
+| ARM | Neoverse-V3 | #3312417 | ARM64_ERRATUM_3194386 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | MMU-500 | #841119,826419 | N/A |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | MMU-600 | #1076982,1209401| N/A |
diff --git a/Documentation/arch/arm64/sme.rst b/Documentation/arch/arm64/sme.rst
index 3d0e53ecac..be317d4574 100644
--- a/Documentation/arch/arm64/sme.rst
+++ b/Documentation/arch/arm64/sme.rst
@@ -75,7 +75,7 @@ model features for SME is included in Appendix A.
2. Vector lengths
------------------
-SME defines a second vector length similar to the SVE vector length which is
+SME defines a second vector length similar to the SVE vector length which
controls the size of the streaming mode SVE vectors and the ZA matrix array.
The ZA matrix is square with each side having as many bytes as a streaming
mode SVE vector.
@@ -238,12 +238,12 @@ prctl(PR_SME_SET_VL, unsigned long arg)
bits of Z0..Z31 except for Z0 bits [127:0] .. Z31 bits [127:0] to become
unspecified, including both streaming and non-streaming SVE state.
Calling PR_SME_SET_VL with vl equal to the thread's current vector
- length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag,
+ length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
does not constitute a change to the vector length for this purpose.
* Changing the vector length causes PSTATE.ZA and PSTATE.SM to be cleared.
Calling PR_SME_SET_VL with vl equal to the thread's current vector
- length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag,
+ length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
does not constitute a change to the vector length for this purpose.
@@ -379,9 +379,8 @@ The regset data starts with struct user_za_header, containing:
/proc/sys/abi/sme_default_vector_length
Writing the text representation of an integer to this file sets the system
- default vector length to the specified value, unless the value is greater
- than the maximum vector length supported by the system in which case the
- default vector length is set to that maximum.
+ default vector length to the specified value rounded to a supported value
+ using the same rules as for setting vector length via PR_SME_SET_VL.
The result can be determined by reopening the file and reading its
contents.
diff --git a/Documentation/arch/arm64/sve.rst b/Documentation/arch/arm64/sve.rst
index 0d9a426e9f..8d8837fc39 100644
--- a/Documentation/arch/arm64/sve.rst
+++ b/Documentation/arch/arm64/sve.rst
@@ -117,11 +117,6 @@ the SVE instruction set architecture.
* The SVE registers are not used to pass arguments to or receive results from
any syscall.
-* In practice the affected registers/bits will be preserved or will be replaced
- with zeros on return from a syscall, but userspace should not make
- assumptions about this. The kernel behaviour may vary on a case-by-case
- basis.
-
* All other SVE state of a thread, including the currently configured vector
length, the state of the PR_SVE_VL_INHERIT flag, and the deferred vector
length (if any), is preserved across all syscalls, subject to the specific
@@ -428,9 +423,8 @@ The regset data starts with struct user_sve_header, containing:
/proc/sys/abi/sve_default_vector_length
Writing the text representation of an integer to this file sets the system
- default vector length to the specified value, unless the value is greater
- than the maximum vector length supported by the system in which case the
- default vector length is set to that maximum.
+ default vector length to the specified value rounded to a supported value
+ using the same rules as for setting vector length via PR_SVE_SET_VL.
The result can be determined by reopening the file and reading its
contents.
diff --git a/Documentation/arch/m68k/buddha-driver.rst b/Documentation/arch/m68k/buddha-driver.rst
index 20e4014139..5d1bc82497 100644
--- a/Documentation/arch/m68k/buddha-driver.rst
+++ b/Documentation/arch/m68k/buddha-driver.rst
@@ -173,7 +173,7 @@ When accessing IDE registers with A6=1 (for example $84x),
the timing will always be mode 0 8-bit compatible, no matter
what you have selected in the speed register:
-781ns select, IOR/IOW after 4 clock cycles (=314ns) aktive.
+781ns select, IOR/IOW after 4 clock cycles (=314ns) active.
All the timings with a very short select-signal (the 355ns
fast accesses) depend on the accelerator card used in the
diff --git a/Documentation/arch/powerpc/dexcr.rst b/Documentation/arch/powerpc/dexcr.rst
index 615a631f51..ab0724212f 100644
--- a/Documentation/arch/powerpc/dexcr.rst
+++ b/Documentation/arch/powerpc/dexcr.rst
@@ -36,8 +36,145 @@ state for a process.
Configuration
=============
-The DEXCR is currently unconfigurable. All threads are run with the
-NPHIE aspect enabled.
+prctl
+-----
+
+A process can control its own userspace DEXCR value using the
+``PR_PPC_GET_DEXCR`` and ``PR_PPC_SET_DEXCR`` pair of
+:manpage:`prctl(2)` commands. These calls have the form::
+
+ prctl(PR_PPC_GET_DEXCR, unsigned long which, 0, 0, 0);
+ prctl(PR_PPC_SET_DEXCR, unsigned long which, unsigned long ctrl, 0, 0);
+
+The possible 'which' and 'ctrl' values are as follows. Note there is no relation
+between the 'which' value and the DEXCR aspect's index.
+
+.. flat-table::
+ :header-rows: 1
+ :widths: 2 7 1
+
+ * - ``prctl()`` which
+ - Aspect name
+ - Aspect index
+
+ * - ``PR_PPC_DEXCR_SBHE``
+ - Speculative Branch Hint Enable (SBHE)
+ - 0
+
+ * - ``PR_PPC_DEXCR_IBRTPD``
+ - Indirect Branch Recurrent Target Prediction Disable (IBRTPD)
+ - 3
+
+ * - ``PR_PPC_DEXCR_SRAPD``
+ - Subroutine Return Address Prediction Disable (SRAPD)
+ - 4
+
+ * - ``PR_PPC_DEXCR_NPHIE``
+ - Non-Privileged Hash Instruction Enable (NPHIE)
+ - 5
+
+.. flat-table::
+ :header-rows: 1
+ :widths: 2 8
+
+ * - ``prctl()`` ctrl
+ - Meaning
+
+ * - ``PR_PPC_DEXCR_CTRL_EDITABLE``
+ - This aspect can be configured with PR_PPC_SET_DEXCR (get only)
+
+ * - ``PR_PPC_DEXCR_CTRL_SET``
+ - This aspect is set / set this aspect
+
+ * - ``PR_PPC_DEXCR_CTRL_CLEAR``
+ - This aspect is clear / clear this aspect
+
+ * - ``PR_PPC_DEXCR_CTRL_SET_ONEXEC``
+ - This aspect will be set after exec / set this aspect after exec
+
+ * - ``PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC``
+ - This aspect will be clear after exec / clear this aspect after exec
+
+Note that
+
+* which is a plain value, not a bitmask. Aspects must be worked with individually.
+
+* ctrl is a bitmask. ``PR_PPC_GET_DEXCR`` returns both the current and onexec
+ configuration. For example, ``PR_PPC_GET_DEXCR`` may return
+ ``PR_PPC_DEXCR_CTRL_EDITABLE | PR_PPC_DEXCR_CTRL_SET |
+ PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC``. This would indicate the aspect is currently
+ set, it will be cleared when you run exec, and you can change this with the
+ ``PR_PPC_SET_DEXCR`` prctl.
+
+* The set/clear terminology refers to setting/clearing the bit in the DEXCR.
+ For example::
+
+ prctl(PR_PPC_SET_DEXCR, PR_PPC_DEXCR_IBRTPD, PR_PPC_DEXCR_CTRL_SET, 0, 0);
+
+ will set the IBRTPD aspect bit in the DEXCR, causing indirect branch prediction
+ to be disabled.
+
+* The status returned by ``PR_PPC_GET_DEXCR`` represents what value the process
+ would like applied. It does not include any alternative overrides, such as if
+ the hypervisor is enforcing the aspect be set. To see the true DEXCR state
+ software should read the appropriate SPRs directly.
+
+* The aspect state when starting a process is copied from the parent's state on
+ :manpage:`fork(2)`. The state is reset to a fixed value on
+ :manpage:`execve(2)`. The PR_PPC_SET_DEXCR prctl() can control both of these
+ values.
+
+* The ``*_ONEXEC`` controls do not change the current process's DEXCR.
+
+Use ``PR_PPC_SET_DEXCR`` with one of ``PR_PPC_DEXCR_CTRL_SET`` or
+``PR_PPC_DEXCR_CTRL_CLEAR`` to edit a given aspect.
+
+Common error codes for both getting and setting the DEXCR are as follows:
+
+.. flat-table::
+ :header-rows: 1
+ :widths: 2 8
+
+ * - Error
+ - Meaning
+
+ * - ``EINVAL``
+ - The DEXCR is not supported by the kernel.
+
+ * - ``ENODEV``
+ - The aspect is not recognised by the kernel or not supported by the
+ hardware.
+
+``PR_PPC_SET_DEXCR`` may also report the following error codes:
+
+.. flat-table::
+ :header-rows: 1
+ :widths: 2 8
+
+ * - Error
+ - Meaning
+
+ * - ``EINVAL``
+ - The ctrl value contains unrecognised flags.
+
+ * - ``EINVAL``
+ - The ctrl value contains mutually conflicting flags (e.g.,
+ ``PR_PPC_DEXCR_CTRL_SET | PR_PPC_DEXCR_CTRL_CLEAR``)
+
+ * - ``EPERM``
+ - This aspect cannot be modified with prctl() (check for the
+ PR_PPC_DEXCR_CTRL_EDITABLE flag with PR_PPC_GET_DEXCR).
+
+ * - ``EPERM``
+ - The process does not have sufficient privilege to perform the operation.
+ For example, clearing NPHIE on exec is a privileged operation (a process
+ can still clear its own NPHIE aspect without privileges).
+
+This interface allows a process to control its own DEXCR aspects, and also set
+the initial DEXCR value for any children in its process tree (up to the next
+child to use an ``*_ONEXEC`` control). This allows fine-grained control over the
+default value of the DEXCR, for example allowing containers to run with different
+default values.
coredump and ptrace
diff --git a/Documentation/arch/powerpc/firmware-assisted-dump.rst b/Documentation/arch/powerpc/firmware-assisted-dump.rst
index e363fc4852..7e37aadd1f 100644
--- a/Documentation/arch/powerpc/firmware-assisted-dump.rst
+++ b/Documentation/arch/powerpc/firmware-assisted-dump.rst
@@ -134,12 +134,12 @@ that are run. If there is dump data, then the
memory is held.
If there is no waiting dump data, then only the memory required to
-hold CPU state, HPTE region, boot memory dump, FADump header and
-elfcore header, is usually reserved at an offset greater than boot
-memory size (see Fig. 1). This area is *not* released: this region
-will be kept permanently reserved, so that it can act as a receptacle
-for a copy of the boot memory content in addition to CPU state and
-HPTE region, in the case a crash does occur.
+hold CPU state, HPTE region, boot memory dump, and FADump header is
+usually reserved at an offset greater than boot memory size (see Fig. 1).
+This area is *not* released: this region will be kept permanently
+reserved, so that it can act as a receptacle for a copy of the boot
+memory content in addition to CPU state and HPTE region, in the case
+a crash does occur.
Since this reserved memory area is used only after the system crash,
there is no point in blocking this significant chunk of memory from
@@ -153,22 +153,22 @@ that were present in CMA region::
o Memory Reservation during first kernel
- Low memory Top of memory
- 0 boot memory size |<--- Reserved dump area --->| |
- | | | Permanent Reservation | |
- V V | | V
- +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
- | | |///|////| DUMP | HDR | ELF |////| |
- +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
- | ^ ^ ^ ^ ^
- | | | | | |
- \ CPU HPTE / | |
- ------------------------------ | |
- Boot memory content gets transferred | |
- to reserved area by firmware at the | |
- time of crash. | |
- FADump Header |
- (meta area) |
+ Low memory Top of memory
+ 0 boot memory size |<------ Reserved dump area ----->| |
+ | | | Permanent Reservation | |
+ V V | | V
+ +-----------+-----/ /---+---+----+-----------+-------+----+-----+
+ | | |///|////| DUMP | HDR |////| |
+ +-----------+-----/ /---+---+----+-----------+-------+----+-----+
+ | ^ ^ ^ ^ ^
+ | | | | | |
+ \ CPU HPTE / | |
+ -------------------------------- | |
+ Boot memory content gets transferred | |
+ to reserved area by firmware at the | |
+ time of crash. | |
+ FADump Header |
+ (meta area) |
|
|
Metadata: This area holds a metadata structure whose
@@ -186,13 +186,20 @@ that were present in CMA region::
0 boot memory size |
| |<------------ Crash preserved area ------------>|
V V |<--- Reserved dump area --->| |
- +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
- | | |///|////| DUMP | HDR | ELF |////| |
- +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
- | |
- V V
- Used by second /proc/vmcore
- kernel to boot
+ +----+---+--+-----/ /---+---+----+-------+-----+-----+-------+
+ | |ELF| | |///|////| DUMP | HDR |/////| |
+ +----+---+--+-----/ /---+---+----+-------+-----+-----+-------+
+ | | | | | |
+ ----- ------------------------------ ---------------
+ \ | |
+ \ | |
+ \ | |
+ \ | ----------------------------
+ \ | /
+ \ | /
+ \ | /
+ /proc/vmcore
+
+---+
|///| -> Regions (CPU, HPTE & Metadata) marked like this in the above
@@ -200,6 +207,12 @@ that were present in CMA region::
does not have CPU & HPTE regions while Metadata region is
not supported on pSeries currently.
+ +---+
+ |ELF| -> elfcorehdr, it is created in second kernel after crash.
+ +---+
+
+ Note: Memory from 0 to the boot memory size is used by second kernel
+
Fig. 2
@@ -353,26 +366,6 @@ TODO:
- Need to come up with the better approach to find out more
accurate boot memory size that is required for a kernel to
boot successfully when booted with restricted memory.
- - The FADump implementation introduces a FADump crash info structure
- in the scratch area before the ELF core header. The idea of introducing
- this structure is to pass some important crash info data to the second
- kernel which will help second kernel to populate ELF core header with
- correct data before it gets exported through /proc/vmcore. The current
- design implementation does not address a possibility of introducing
- additional fields (in future) to this structure without affecting
- compatibility. Need to come up with the better approach to address this.
-
- The possible approaches are:
-
- 1. Introduce version field for version tracking, bump up the version
- whenever a new field is added to the structure in future. The version
- field can be used to find out what fields are valid for the current
- version of the structure.
- 2. Reserve the area of predefined size (say PAGE_SIZE) for this
- structure and have unused area as reserved (initialized to zero)
- for future field additions.
-
- The advantage of approach 1 over 2 is we don't need to reserve extra space.
Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
diff --git a/Documentation/arch/powerpc/kvm-nested.rst b/Documentation/arch/powerpc/kvm-nested.rst
index 630602a8aa..5defd13cc6 100644
--- a/Documentation/arch/powerpc/kvm-nested.rst
+++ b/Documentation/arch/powerpc/kvm-nested.rst
@@ -546,7 +546,9 @@ table information.
+--------+-------+----+--------+----------------------------------+
| 0x1052 | 0x08 | RW | T | CTRL |
+--------+-------+----+--------+----------------------------------+
-| 0x1053-| | | | Reserved |
+| 0x1053 | 0x08 | RW | T | DPDES |
++--------+-------+----+--------+----------------------------------+
+| 0x1054-| | | | Reserved |
| 0x1FFF | | | | |
+--------+-------+----+--------+----------------------------------+
| 0x2000 | 0x04 | RW | T | CR |
diff --git a/Documentation/arch/riscv/cmodx.rst b/Documentation/arch/riscv/cmodx.rst
new file mode 100644
index 0000000000..8c48bcff3d
--- /dev/null
+++ b/Documentation/arch/riscv/cmodx.rst
@@ -0,0 +1,98 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================================================
+Concurrent Modification and Execution of Instructions (CMODX) for RISC-V Linux
+==============================================================================
+
+CMODX is a programming technique where a program executes instructions that were
+modified by the program itself. Instruction storage and the instruction cache
+(icache) are not guaranteed to be synchronized on RISC-V hardware. Therefore, the
+program must enforce its own synchronization with the unprivileged fence.i
+instruction.
+
+However, the default Linux ABI prohibits the use of fence.i in userspace
+applications. At any point the scheduler may migrate a task onto a new hart. If
+migration occurs after the userspace synchronized the icache and instruction
+storage with fence.i, the icache on the new hart will no longer be clean. This
+is due to the behavior of fence.i only affecting the hart that it is called on.
+Thus, the hart that the task has been migrated to may not have synchronized
+instruction storage and icache.
+
+There are two ways to solve this problem: use the riscv_flush_icache() syscall,
+or use the ``PR_RISCV_SET_ICACHE_FLUSH_CTX`` prctl() and emit fence.i in
+userspace. The syscall performs a one-off icache flushing operation. The prctl
+changes the Linux ABI to allow userspace to emit icache flushing operations.
+
+As an aside, "deferred" icache flushes can sometimes be triggered in the kernel.
+At the time of writing, this only occurs during the riscv_flush_icache() syscall
+and when the kernel uses copy_to_user_page(). These deferred flushes happen only
+when the memory map being used by a hart changes. If the prctl() context caused
+an icache flush, this deferred icache flush will be skipped as it is redundant.
+Therefore, there will be no additional flush when using the riscv_flush_icache()
+syscall inside of the prctl() context.
+
+prctl() Interface
+---------------------
+
+Call prctl() with ``PR_RISCV_SET_ICACHE_FLUSH_CTX`` as the first argument. The
+remaining arguments will be delegated to the riscv_set_icache_flush_ctx
+function detailed below.
+
+.. kernel-doc:: arch/riscv/mm/cacheflush.c
+ :identifiers: riscv_set_icache_flush_ctx
+
+Example usage:
+
+The following files are meant to be compiled and linked with each other. The
+modify_instruction() function replaces an add with 0 with an add with one,
+causing the instruction sequence in get_value() to change from returning a zero
+to returning a one.
+
+cmodx.c::
+
+ #include <stdio.h>
+ #include <sys/prctl.h>
+
+ extern int get_value();
+ extern void modify_instruction();
+
+ int main()
+ {
+ int value = get_value();
+ printf("Value before cmodx: %d\n", value);
+
+ // Call prctl before first fence.i is called inside modify_instruction
+ prctl(PR_RISCV_SET_ICACHE_FLUSH_CTX, PR_RISCV_CTX_SW_FENCEI_ON, PR_RISCV_SCOPE_PER_PROCESS);
+ modify_instruction();
+ // Call prctl after final fence.i is called in process
+ prctl(PR_RISCV_SET_ICACHE_FLUSH_CTX, PR_RISCV_CTX_SW_FENCEI_OFF, PR_RISCV_SCOPE_PER_PROCESS);
+
+ value = get_value();
+ printf("Value after cmodx: %d\n", value);
+ return 0;
+ }
+
+cmodx.S::
+
+ .option norvc
+
+ .text
+ .global modify_instruction
+ modify_instruction:
+ lw a0, new_insn
+ lui a5,%hi(old_insn)
+ sw a0,%lo(old_insn)(a5)
+ fence.i
+ ret
+
+ .section modifiable, "awx"
+ .global get_value
+ get_value:
+ li a0, 0
+ old_insn:
+ addi a0, a0, 0
+ ret
+
+ .data
+ new_insn:
+ addi a0, a0, 1
diff --git a/Documentation/arch/riscv/hwprobe.rst b/Documentation/arch/riscv/hwprobe.rst
index b2bcc9eed9..204cd4433a 100644
--- a/Documentation/arch/riscv/hwprobe.rst
+++ b/Documentation/arch/riscv/hwprobe.rst
@@ -188,6 +188,10 @@ The following keys are defined:
manual starting from commit 95cf1f9 ("Add changes requested by Ved
during signoff")
+ * :c:macro:`RISCV_HWPROBE_EXT_ZIHINTPAUSE`: The Zihintpause extension is
+ supported as defined in the RISC-V ISA manual starting from commit
+ d8ab5c78c207 ("Zihintpause is ratified").
+
* :c:macro:`RISCV_HWPROBE_KEY_CPUPERF_0`: A bitmask that contains performance
information about the selected set of processors.
diff --git a/Documentation/arch/riscv/index.rst b/Documentation/arch/riscv/index.rst
index 4dab0cb4b9..eecf347ce8 100644
--- a/Documentation/arch/riscv/index.rst
+++ b/Documentation/arch/riscv/index.rst
@@ -13,6 +13,7 @@ RISC-V architecture
patch-acceptance
uabi
vector
+ cmodx
features
diff --git a/Documentation/arch/riscv/uabi.rst b/Documentation/arch/riscv/uabi.rst
index 54d199dce7..2b420bab05 100644
--- a/Documentation/arch/riscv/uabi.rst
+++ b/Documentation/arch/riscv/uabi.rst
@@ -65,4 +65,6 @@ the extension, or may have deliberately removed it from the listing.
Misaligned accesses
-------------------
-Misaligned accesses are supported in userspace, but they may perform poorly.
+Misaligned scalar accesses are supported in userspace, but they may perform
+poorly. Misaligned vector accesses are only supported if the Zicclsm extension
+is supported.
diff --git a/Documentation/arch/riscv/vm-layout.rst b/Documentation/arch/riscv/vm-layout.rst
index 69ff6da1db..e476b4386b 100644
--- a/Documentation/arch/riscv/vm-layout.rst
+++ b/Documentation/arch/riscv/vm-layout.rst
@@ -144,14 +144,8 @@ passing 0 into the hint address parameter of mmap. On CPUs with an address space
smaller than sv48, the CPU maximum supported address space will be the default.
Software can "opt-in" to receiving VAs from another VA space by providing
-a hint address to mmap. A hint address passed to mmap will cause the largest
-address space that fits entirely into the hint to be used, unless there is no
-space left in the address space. If there is no space available in the requested
-address space, an address in the next smallest available address space will be
-returned.
-
-For example, in order to obtain 48-bit VA space, a hint address greater than
-:code:`1 << 47` must be provided. Note that this is 47 due to sv48 userspace
-ending at :code:`1 << 47` and the addresses beyond this are reserved for the
-kernel. Similarly, to obtain 57-bit VA space addresses, a hint address greater
-than or equal to :code:`1 << 56` must be provided.
+a hint address to mmap. When a hint address is passed to mmap, the returned
+address will never use more bits than the hint address. For example, if a hint
+address of `1 << 40` is passed to mmap, a valid returned address will never use
+bits 41 through 63. If no mappable addresses are available in that range, mmap
+will return `MAP_FAILED`.
diff --git a/Documentation/arch/s390/index.rst b/Documentation/arch/s390/index.rst
index 73c79bf586..e75a6e5d25 100644
--- a/Documentation/arch/s390/index.rst
+++ b/Documentation/arch/s390/index.rst
@@ -8,6 +8,7 @@ s390 Architecture
cds
3270
driver-model
+ mm
monreader
qeth
s390dbf
diff --git a/Documentation/arch/s390/mm.rst b/Documentation/arch/s390/mm.rst
new file mode 100644
index 0000000000..084adad5ee
--- /dev/null
+++ b/Documentation/arch/s390/mm.rst
@@ -0,0 +1,111 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Memory Management
+=================
+
+Virtual memory layout
+=====================
+
+.. note::
+
+ - Some aspects of the virtual memory layout setup are not
+ clarified (number of page levels, alignment, DMA memory).
+
+ - Unused gaps in the virtual memory layout could be present
+ or not - depending on how partucular system is configured.
+ No page tables are created for the unused gaps.
+
+ - The virtual memory regions are tracked or untracked by KASAN
+ instrumentation, as well as the KASAN shadow memory itself is
+ created only when CONFIG_KASAN configuration option is enabled.
+
+::
+
+ =============================================================================
+ | Physical | Virtual | VM area description
+ =============================================================================
+ +- 0 --------------+- 0 --------------+
+ | | S390_lowcore | Low-address memory
+ | +- 8 KB -----------+
+ | | |
+ | | |
+ | | ... unused gap | KASAN untracked
+ | | |
+ +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start
+ |.amode31 text/data|.amode31 text/data| KASAN untracked
+ +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB)
+ | | |
+ | | |
+ +- __kaslr_offset_phys | kernel rand. phys start
+ | | |
+ | kernel text/data | |
+ | | |
+ +------------------+ | kernel phys end
+ | | |
+ | | |
+ | | |
+ | | |
+ +- ident_map_size -+ |
+ | |
+ | ... unused gap | KASAN untracked
+ | |
+ +- __identity_base + identity mapping start (>= 2GB)
+ | |
+ | identity | phys == virt - __identity_base
+ | mapping | virt == phys + __identity_base
+ | |
+ | | KASAN tracked
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ | |
+ +---- vmemmap -----+ 'struct page' array start
+ | |
+ | virtually mapped |
+ | memory map | KASAN untracked
+ | |
+ +- __abs_lowcore --+
+ | |
+ | Absolute Lowcore | KASAN untracked
+ | |
+ +- __memcpy_real_area
+ | |
+ | Real Memory Copy| KASAN untracked
+ | |
+ +- VMALLOC_START --+ vmalloc area start
+ | | KASAN untracked or
+ | vmalloc area | KASAN shallowly populated in case
+ | | CONFIG_KASAN_VMALLOC=y
+ +- MODULES_VADDR --+ modules area start
+ | | KASAN allocated per module or
+ | modules area | KASAN shallowly populated in case
+ | | CONFIG_KASAN_VMALLOC=y
+ +- __kaslr_offset -+ kernel rand. virt start
+ | | KASAN tracked
+ | kernel text/data | phys == (kvirt - __kaslr_offset) +
+ | | __kaslr_offset_phys
+ +- kernel .bss end + kernel rand. virt end
+ | |
+ | ... unused gap | KASAN untracked
+ | |
+ +------------------+ UltraVisor Secure Storage limit
+ | |
+ | ... unused gap | KASAN untracked
+ | |
+ +KASAN_SHADOW_START+ KASAN shadow memory start
+ | |
+ | KASAN shadow | KASAN untracked
+ | |
+ +------------------+ ASCE limit
diff --git a/Documentation/arch/s390/vfio-ap.rst b/Documentation/arch/s390/vfio-ap.rst
index 929ee1c1c9..ea744cbc86 100644
--- a/Documentation/arch/s390/vfio-ap.rst
+++ b/Documentation/arch/s390/vfio-ap.rst
@@ -380,6 +380,36 @@ matrix device.
control_domains:
A read-only file for displaying the control domain numbers assigned to the
vfio_ap mediated device.
+ ap_config:
+ A read/write file that, when written to, allows all three of the
+ vfio_ap mediated device's ap matrix masks to be replaced in one shot.
+ Three masks are given, one for adapters, one for domains, and one for
+ control domains. If the given state cannot be set then no changes are
+ made to the vfio-ap mediated device.
+
+ The format of the data written to ap_config is as follows:
+ {amask},{dmask},{cmask}\n
+
+ \n is a newline character.
+
+ amask, dmask, and cmask are masks identifying which adapters, domains,
+ and control domains should be assigned to the mediated device.
+
+ The format of a mask is as follows:
+ 0xNN..NN
+
+ Where NN..NN is 64 hexadecimal characters representing a 256-bit value.
+ The leftmost (highest order) bit represents adapter/domain 0.
+
+ For an example set of masks that represent your mdev's current
+ configuration, simply cat ap_config.
+
+ Setting an adapter or domain number greater than the maximum allowed for
+ the system will result in an error.
+
+ This attribute is intended to be used by automation. End users would be
+ better served using the respective assign/unassign attributes for
+ adapters, domains, and control domains.
* functions:
@@ -550,7 +580,7 @@ These are the steps:
following Kconfig elements selected:
* IOMMU_SUPPORT
* S390
- * ZCRYPT
+ * AP
* VFIO
* KVM
diff --git a/Documentation/arch/sparc/oradax/dax-hv-api.txt b/Documentation/arch/sparc/oradax/dax-hv-api.txt
index 7ecd0bf495..ef1a4c2bf0 100644
--- a/Documentation/arch/sparc/oradax/dax-hv-api.txt
+++ b/Documentation/arch/sparc/oradax/dax-hv-api.txt
@@ -41,7 +41,7 @@ Chapter 36. Coprocessor services
submissions until they succeed; waiting for an outstanding CCB to complete is not necessary, and would
not be a guarantee that a future submission would succeed.
- The availablility of DAX coprocessor command service is indicated by the presence of the DAX virtual
+ The availability of DAX coprocessor command service is indicated by the presence of the DAX virtual
device node in the guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device
node”).
diff --git a/Documentation/arch/x86/amd_hsmp.rst b/Documentation/arch/x86/amd_hsmp.rst
index c92bfd5535..1e499ecf5f 100644
--- a/Documentation/arch/x86/amd_hsmp.rst
+++ b/Documentation/arch/x86/amd_hsmp.rst
@@ -13,7 +13,8 @@ set of mailbox registers.
More details on the interface can be found in chapter
"7 Host System Management Port (HSMP)" of the family/model PPR
-Eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
+Eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
+
HSMP interface is supported on EPYC server CPU models only.
@@ -97,8 +98,8 @@ what happened. The transaction returns 0 on success.
More details on the interface and message definitions can be found in chapter
"7 Host System Management Port (HSMP)" of the respective family/model PPR
-eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
+eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
User space C-APIs are made available by linking against the esmi library,
-which is provided by the E-SMS project https://developer.amd.com/e-sms/.
+which is provided by the E-SMS project https://www.amd.com/en/developer/e-sms.html.
See: https://github.com/amd/esmi_ib_library
diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst
index c513855a54..4fd492cb49 100644
--- a/Documentation/arch/x86/boot.rst
+++ b/Documentation/arch/x86/boot.rst
@@ -878,7 +878,8 @@ Protocol: 2.10+
address if possible.
A non-relocatable kernel will unconditionally move itself and to run
- at this address.
+ at this address. A relocatable kernel will move itself to this address if it
+ loaded below this address.
============ =======
Field name: init_size
diff --git a/Documentation/arch/x86/pti.rst b/Documentation/arch/x86/pti.rst
index e08d35177b..57e8392f61 100644
--- a/Documentation/arch/x86/pti.rst
+++ b/Documentation/arch/x86/pti.rst
@@ -26,9 +26,9 @@ comments in pti.c).
This approach helps to ensure that side-channel attacks leveraging
the paging structures do not function when PTI is enabled. It can be
-enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
-Once enabled at compile-time, it can be disabled at boot with the
-'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
+enabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile
+time. Once enabled at compile-time, it can be disabled at boot with
+the 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
Page Table Management
=====================
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a..627e23869b 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -45,7 +45,7 @@ mount options are:
Enable code/data prioritization in L2 cache allocations.
"mba_MBps":
Enable the MBA Software Controller(mba_sc) to specify MBA
- bandwidth in MBps
+ bandwidth in MiBps
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
@@ -446,6 +446,12 @@ during mkdir.
max_threshold_occupancy is a user configurable value to determine the
occupancy at which an RMID can be freed.
+The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
+for a subset of RMID that are not immediately available for allocation.
+This can't be relied on to produce output every second, it may be necessary
+to attempt to create an empty monitor group to force an update. Output may
+only be produced if creation of a control or monitor group fails.
+
Schemata files - general concepts
---------------------------------
Each line in the file describes one resource. The line starts with
@@ -526,7 +532,7 @@ threads start using more cores in an rdtgroup, the actual bandwidth may
increase or vary although user specified bandwidth percentage is same.
In order to mitigate this and make the interface more user friendly,
-resctrl added support for specifying the bandwidth in MBps as well. The
+resctrl added support for specifying the bandwidth in MiBps as well. The
kernel underneath would use a software feedback mechanism or a "Software
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
and adjust the memory bandwidth percentages to ensure::
@@ -573,13 +579,13 @@ Memory b/w domain is L3 cache.
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
-Memory bandwidth Allocation specified in MBps
----------------------------------------------
+Memory bandwidth Allocation specified in MiBps
+----------------------------------------------
Memory bandwidth domain is L3 cache.
::
- MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
+ MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
Slow Memory Bandwidth Allocation (SMBA)
---------------------------------------
diff --git a/Documentation/arch/x86/topology.rst b/Documentation/arch/x86/topology.rst
index 08ebf9edbf..7352ab89a5 100644
--- a/Documentation/arch/x86/topology.rst
+++ b/Documentation/arch/x86/topology.rst
@@ -47,17 +47,21 @@ AMD nomenclature for package is 'Node'.
Package-related topology information in the kernel:
- - cpuinfo_x86.x86_max_cores:
+ - topology_num_threads_per_package()
- The number of cores in a package. This information is retrieved via CPUID.
+ The number of threads in a package.
- - cpuinfo_x86.x86_max_dies:
+ - topology_num_cores_per_package()
- The number of dies in a package. This information is retrieved via CPUID.
+ The number of cores in a package.
+
+ - topology_max_dies_per_package()
+
+ The maximum number of dies in a package.
- cpuinfo_x86.topo.die_id:
- The physical ID of the die. This information is retrieved via CPUID.
+ The physical ID of the die.
- cpuinfo_x86.topo.pkg_id:
@@ -96,16 +100,6 @@ are SMT- or CMT-type threads.
AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses
"core".
-Core-related topology information in the kernel:
-
- - smp_num_siblings:
-
- The number of threads in a core. The number of threads in a package can be
- calculated by::
-
- threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings
-
-
Threads
=======
A thread is a single scheduling unit. It's the equivalent to a logical Linux
diff --git a/Documentation/arch/x86/x86_64/fred.rst b/Documentation/arch/x86/x86_64/fred.rst
new file mode 100644
index 0000000000..9f57e7b91f
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/fred.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Flexible Return and Event Delivery (FRED)
+=========================================
+
+Overview
+========
+
+The FRED architecture defines simple new transitions that change
+privilege level (ring transitions). The FRED architecture was
+designed with the following goals:
+
+1) Improve overall performance and response time by replacing event
+ delivery through the interrupt descriptor table (IDT event
+ delivery) and event return by the IRET instruction with lower
+ latency transitions.
+
+2) Improve software robustness by ensuring that event delivery
+ establishes the full supervisor context and that event return
+ establishes the full user context.
+
+The new transitions defined by the FRED architecture are FRED event
+delivery and, for returning from events, two FRED return instructions.
+FRED event delivery can effect a transition from ring 3 to ring 0, but
+it is used also to deliver events incident to ring 0. One FRED
+instruction (ERETU) effects a return from ring 0 to ring 3, while the
+other (ERETS) returns while remaining in ring 0. Collectively, FRED
+event delivery and the FRED return instructions are FRED transitions.
+
+In addition to these transitions, the FRED architecture defines a new
+instruction (LKGS) for managing the state of the GS segment register.
+The LKGS instruction can be used by 64-bit operating systems that do
+not use the new FRED transitions.
+
+Furthermore, the FRED architecture is easy to extend for future CPU
+architectures.
+
+Software based event dispatching
+================================
+
+FRED operates differently from IDT in terms of event handling. Instead
+of directly dispatching an event to its handler based on the event
+vector, FRED requires the software to dispatch an event to its handler
+based on both the event's type and vector. Therefore, an event dispatch
+framework must be implemented to facilitate the event-to-handler
+dispatch process. The FRED event dispatch framework takes control
+once an event is delivered, and employs a two-level dispatch.
+
+The first level dispatching is event type based, and the second level
+dispatching is event vector based.
+
+Full supervisor/user context
+============================
+
+FRED event delivery atomically save and restore full supervisor/user
+context upon event delivery and return. Thus it avoids the problem of
+transient states due to %cr2 and/or %dr6, and it is no longer needed
+to handle all the ugly corner cases caused by half baked entry states.
+
+FRED allows explicit unblock of NMI with new event return instructions
+ERETS/ERETU, avoiding the mess caused by IRET which unconditionally
+unblocks NMI, e.g., when an exception happens during NMI handling.
+
+FRED always restores the full value of %rsp, thus ESPFIX is no longer
+needed when FRED is enabled.
+
+LKGS
+====
+
+LKGS behaves like the MOV to GS instruction except that it loads the
+base address into the IA32_KERNEL_GS_BASE MSR instead of the GS
+segment’s descriptor cache. With LKGS, it ends up with avoiding
+mucking with kernel GS, i.e., an operating system can always operate
+with its own GS base address.
+
+Because FRED event delivery from ring 3 and ERETU both swap the value
+of the GS base address and that of the IA32_KERNEL_GS_BASE MSR, plus
+the introduction of LKGS instruction, the SWAPGS instruction is no
+longer needed when FRED is enabled, thus is disallowed (#UD).
+
+Stack levels
+============
+
+4 stack levels 0~3 are introduced to replace the nonreentrant IST for
+event handling, and each stack level should be configured to use a
+dedicated stack.
+
+The current stack level could be unchanged or go higher upon FRED
+event delivery. If unchanged, the CPU keeps using the current event
+stack. If higher, the CPU switches to a new event stack specified by
+the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP[123].
+
+Only execution of a FRED return instruction ERET[US], could lower the
+current stack level, causing the CPU to switch back to the stack it was
+on before a previous event delivery that promoted the stack level.
diff --git a/Documentation/arch/x86/x86_64/index.rst b/Documentation/arch/x86/x86_64/index.rst
index a56070fc8e..ad15e9bd62 100644
--- a/Documentation/arch/x86/x86_64/index.rst
+++ b/Documentation/arch/x86/x86_64/index.rst
@@ -15,3 +15,4 @@ x86_64 Support
cpu-hotplug-spec
machinecheck
fsgs
+ fred
diff --git a/Documentation/arch/x86/xstate.rst b/Documentation/arch/x86/xstate.rst
index ae5c69e48b..cec05ac464 100644
--- a/Documentation/arch/x86/xstate.rst
+++ b/Documentation/arch/x86/xstate.rst
@@ -138,7 +138,7 @@ Note this example does not include the sigaltstack preparation.
Dynamic features in signal frames
---------------------------------
-Dynamcally enabled features are not written to the signal frame upon signal
+Dynamically enabled features are not written to the signal frame upon signal
entry if the feature is in its initial configuration. This differs from
non-dynamic features which are always written regardless of their
configuration. Signal handlers can examine the XSAVE buffer's XSTATE_BV