Documentation/arch/x86/mds.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209

Microarchitectural Data Sampling (MDS) mitigation
=================================================

.. _mds:

Overview
--------

Microarchitectural Data Sampling (MDS) is a family of side channel attacks
on internal buffers in Intel CPUs. The variants are:

 - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
 - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
 - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
 - Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091)

MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
dependent load (store-to-load forwarding) as an optimization. The forward
can also happen to a faulting or assisting load operation for a different
memory address, which can be exploited under certain conditions. Store
buffers are partitioned between Hyper-Threads so cross thread forwarding is
not possible. But if a thread enters or exits a sleep state the store
buffer is repartitioned which can expose data from one thread to the other.

MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
L1 miss situations and to hold data which is returned or sent in response
to a memory or I/O operation. Fill buffers can forward data to a load
operation and also write data to the cache. When the fill buffer is
deallocated it can retain the stale data of the preceding operations which
can then be forwarded to a faulting or assisting load operation, which can
be exploited under certain conditions. Fill buffers are shared between
Hyper-Threads so cross thread leakage is possible.

MLPDS leaks Load Port Data. Load ports are used to perform load operations
from memory or I/O. The received data is then forwarded to the register
file or a subsequent operation. In some implementations the Load Port can
contain stale data from a previous operation which can be forwarded to
faulting or assisting loads under certain conditions, which again can be
exploited eventually. Load ports are shared between Hyper-Threads so cross
thread leakage is possible.

MDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load from
memory that takes a fault or assist can leave data in a microarchitectural
structure that may later be observed using one of the same methods used by
MSBDS, MFBDS or MLPDS.

Exposure assumptions
--------------------

It is assumed that attack code resides in user space or in a guest with one
exception. The rationale behind this assumption is that the code construct
needed for exploiting MDS requires:

 - to control the load to trigger a fault or assist

 - to have a disclosure gadget which exposes the speculatively accessed
   data for consumption through a side channel.

 - to control the pointer through which the disclosure gadget exposes the
   data

The existence of such a construct in the kernel cannot be excluded with
100% certainty, but the complexity involved makes it extremely unlikely.

There is one exception, which is untrusted BPF. The functionality of
untrusted BPF is limited, but it needs to be thoroughly investigated
whether it can be used to create such a construct.


Mitigation strategy
-------------------

All variants have the same mitigation strategy at least for the single CPU
thread case (SMT off): Force the CPU to clear the affected buffers.

This is achieved by using the otherwise unused and obsolete VERW
instruction in combination with a microcode update. The microcode clears
the affected CPU buffers when the VERW instruction is executed.

For virtualization there are two ways to achieve CPU buffer
clearing. Either the modified VERW instruction or via the L1D Flush
command. The latter is issued when L1TF mitigation is enabled so the extra
VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
be issued.

If the VERW instruction with the supplied segment selector argument is
executed on a CPU without the microcode update there is no side effect
other than a small number of pointlessly wasted CPU cycles.

This does not protect against cross Hyper-Thread attacks except for MSBDS
which is only exploitable cross Hyper-thread when one of the Hyper-Threads
enters a C-state.

The kernel provides a function to invoke the buffer clearing:

    mds_clear_cpu_buffers()

Also macro CLEAR_CPU_BUFFERS can be used in ASM late in exit-to-user path.
Other than CFLAGS.ZF, this macro doesn't clobber any registers.

The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
(idle) transitions.

As a special quirk to address virtualization scenarios where the host has
the microcode updated, but the hypervisor does not (yet) expose the
MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
hope that it might actually clear the buffers. The state is reflected
accordingly.

According to current knowledge additional mitigations inside the kernel
itself are not required because the necessary gadgets to expose the leaked
data cannot be controlled in a way which allows exploitation from malicious
user space or VM guests.

Kernel internal mitigation modes
--------------------------------

 ======= ============================================================
 off      Mitigation is disabled. Either the CPU is not affected or
          mds=off is supplied on the kernel command line

 full     Mitigation is enabled. CPU is affected and MD_CLEAR is
          advertised in CPUID.

 vmwerv	  Mitigation is enabled. CPU is affected and MD_CLEAR is not
	  advertised in CPUID. That is mainly for virtualization
	  scenarios where the host has the updated microcode but the
	  hypervisor does not expose MD_CLEAR in CPUID. It's a best
	  effort approach without guarantee.
 ======= ============================================================

If the CPU is affected and mds=off is not supplied on the kernel command
line then the kernel selects the appropriate mitigation mode depending on
the availability of the MD_CLEAR CPUID bit.

Mitigation points
-----------------

1. Return to user space
^^^^^^^^^^^^^^^^^^^^^^^

   When transitioning from kernel to user space the CPU buffers are flushed
   on affected CPUs when the mitigation is not disabled on the kernel
   command line. The mitigation is enabled through the feature flag
   X86_FEATURE_CLEAR_CPU_BUF.

   The mitigation is invoked just before transitioning to userspace after
   user registers are restored. This is done to minimize the window in
   which kernel data could be accessed after VERW e.g. via an NMI after
   VERW.

   **Corner case not handled**
   Interrupts returning to kernel don't clear CPUs buffers since the
   exit-to-user path is expected to do that anyways. But, there could be
   a case when an NMI is generated in kernel after the exit-to-user path
   has cleared the buffers. This case is not handled and NMI returning to
   kernel don't clear CPU buffers because:

   1. It is rare to get an NMI after VERW, but before returning to userspace.
   2. For an unprivileged user, there is no known way to make that NMI
      less rare or target it.
   3. It would take a large number of these precisely-timed NMIs to mount
      an actual attack.  There's presumably not enough bandwidth.
   4. The NMI in question occurs after a VERW, i.e. when user state is
      restored and most interesting data is already scrubbed. Whats left
      is only the data that NMI touches, and that may or may not be of
      any interest.


2. C-State transition
^^^^^^^^^^^^^^^^^^^^^

   When a CPU goes idle and enters a C-State the CPU buffers need to be
   cleared on affected CPUs when SMT is active. This addresses the
   repartitioning of the store buffer when one of the Hyper-Threads enters
   a C-State.

   When SMT is inactive, i.e. either the CPU does not support it or all
   sibling threads are offline CPU buffer clearing is not required.

   The idle clearing is enabled on CPUs which are only affected by MSBDS
   and not by any other MDS variant. The other MDS variants cannot be
   protected against cross Hyper-Thread attacks because the Fill Buffer and
   the Load Ports are shared. So on CPUs affected by other variants, the
   idle clearing would be a window dressing exercise and is therefore not
   activated.

   The invocation is controlled by the static key mds_idle_clear which is
   switched depending on the chosen mitigation mode and the SMT state of
   the system.

   The buffer clear is only invoked before entering the C-State to prevent
   that stale data from the idling CPU from spilling to the Hyper-Thread
   sibling after the store buffer got repartitioned and all entries are
   available to the non idle sibling.

   When coming out of idle the store buffer is partitioned again so each
   sibling has half of it available. The back from idle CPU could be then
   speculatively exposed to contents of the sibling. The buffers are
   flushed either on exit to user space or on VMENTER so malicious code
   in user space or the guest cannot speculatively access them.

   The mitigation is hooked into all variants of halt()/mwait(), but does
   not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
   has been superseded by the intel_idle driver around 2010 and is
   preferred on all affected CPUs which are expected to gain the MD_CLEAR
   functionality in microcode. Aside of that the IO-Port mechanism is a
   legacy interface which is only used on older systems which are either
   not affected or do not receive microcode updates anymore.