summaryrefslogtreecommitdiffstats
path: root/Documentation/virt/kvm/msr.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-27 10:05:51 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-27 10:05:51 +0000
commit5d1646d90e1f2cceb9f0828f4b28318cd0ec7744 (patch)
treea94efe259b9009378be6d90eb30d2b019d95c194 /Documentation/virt/kvm/msr.rst
parentInitial commit. (diff)
downloadlinux-5d1646d90e1f2cceb9f0828f4b28318cd0ec7744.tar.xz
linux-5d1646d90e1f2cceb9f0828f4b28318cd0ec7744.zip
Adding upstream version 5.10.209.upstream/5.10.209upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'Documentation/virt/kvm/msr.rst')
-rw-r--r--Documentation/virt/kvm/msr.rst378
1 files changed, 378 insertions, 0 deletions
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
new file mode 100644
index 000000000..e37a14c32
--- /dev/null
+++ b/Documentation/virt/kvm/msr.rst
@@ -0,0 +1,378 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+KVM-specific MSRs
+=================
+
+:Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
+
+KVM makes use of some custom MSRs to service some requests.
+
+Custom MSRs have a range reserved for them, that goes from
+0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
+but they are deprecated and their use is discouraged.
+
+Custom MSR list
+---------------
+
+The current supported Custom MSR list is:
+
+MSR_KVM_WALL_CLOCK_NEW:
+ 0x4b564d00
+
+data:
+ 4-byte alignment physical address of a memory area which must be
+ in guest RAM. This memory is expected to hold a copy of the following
+ structure::
+
+ struct pvclock_wall_clock {
+ u32 version;
+ u32 sec;
+ u32 nsec;
+ } __attribute__((__packed__));
+
+ whose data will be filled in by the hypervisor. The hypervisor is only
+ guaranteed to update this data at the moment of MSR write.
+ Users that want to reliably query this information more than once have
+ to write more than once to this MSR. Fields have the following meanings:
+
+ version:
+ guest has to check version before and after grabbing
+ time information and check that they are both equal and even.
+ An odd version indicates an in-progress update.
+
+ sec:
+ number of seconds for wallclock at time of boot.
+
+ nsec:
+ number of nanoseconds for wallclock at time of boot.
+
+ In order to get the current wallclock time, the system_time from
+ MSR_KVM_SYSTEM_TIME_NEW needs to be added.
+
+ Note that although MSRs are per-CPU entities, the effect of this
+ particular MSR is global.
+
+ Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
+ leaf prior to usage.
+
+MSR_KVM_SYSTEM_TIME_NEW:
+ 0x4b564d01
+
+data:
+ 4-byte aligned physical address of a memory area which must be in
+ guest RAM, plus an enable bit in bit 0. This memory is expected to hold
+ a copy of the following structure::
+
+ struct pvclock_vcpu_time_info {
+ u32 version;
+ u32 pad0;
+ u64 tsc_timestamp;
+ u64 system_time;
+ u32 tsc_to_system_mul;
+ s8 tsc_shift;
+ u8 flags;
+ u8 pad[2];
+ } __attribute__((__packed__)); /* 32 bytes */
+
+ whose data will be filled in by the hypervisor periodically. Only one
+ write, or registration, is needed for each VCPU. The interval between
+ updates of this structure is arbitrary and implementation-dependent.
+ The hypervisor may update this structure at any time it sees fit until
+ anything with bit0 == 0 is written to it.
+
+ Fields have the following meanings:
+
+ version:
+ guest has to check version before and after grabbing
+ time information and check that they are both equal and even.
+ An odd version indicates an in-progress update.
+
+ tsc_timestamp:
+ the tsc value at the current VCPU at the time
+ of the update of this structure. Guests can subtract this value
+ from current tsc to derive a notion of elapsed time since the
+ structure update.
+
+ system_time:
+ a host notion of monotonic time, including sleep
+ time at the time this structure was last updated. Unit is
+ nanoseconds.
+
+ tsc_to_system_mul:
+ multiplier to be used when converting
+ tsc-related quantity to nanoseconds
+
+ tsc_shift:
+ shift to be used when converting tsc-related
+ quantity to nanoseconds. This shift will ensure that
+ multiplication with tsc_to_system_mul does not overflow.
+ A positive value denotes a left shift, a negative value
+ a right shift.
+
+ The conversion from tsc to nanoseconds involves an additional
+ right shift by 32 bits. With this information, guests can
+ derive per-CPU time by doing::
+
+ time = (current_tsc - tsc_timestamp)
+ if (tsc_shift >= 0)
+ time <<= tsc_shift;
+ else
+ time >>= -tsc_shift;
+ time = (time * tsc_to_system_mul) >> 32
+ time = time + system_time
+
+ flags:
+ bits in this field indicate extended capabilities
+ coordinated between the guest and the hypervisor. Availability
+ of specific flags has to be checked in 0x40000001 cpuid leaf.
+ Current flags are:
+
+
+ +-----------+--------------+----------------------------------+
+ | flag bit | cpuid bit | meaning |
+ +-----------+--------------+----------------------------------+
+ | | | time measures taken across |
+ | 0 | 24 | multiple cpus are guaranteed to |
+ | | | be monotonic |
+ +-----------+--------------+----------------------------------+
+ | | | guest vcpu has been paused by |
+ | 1 | N/A | the host |
+ | | | See 4.70 in api.txt |
+ +-----------+--------------+----------------------------------+
+
+ Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
+ leaf prior to usage.
+
+
+MSR_KVM_WALL_CLOCK:
+ 0x11
+
+data and functioning:
+ same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
+
+ This MSR falls outside the reserved KVM range and may be removed in the
+ future. Its usage is deprecated.
+
+ Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
+ leaf prior to usage.
+
+MSR_KVM_SYSTEM_TIME:
+ 0x12
+
+data and functioning:
+ same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
+
+ This MSR falls outside the reserved KVM range and may be removed in the
+ future. Its usage is deprecated.
+
+ Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
+ leaf prior to usage.
+
+ The suggested algorithm for detecting kvmclock presence is then::
+
+ if (!kvm_para_available()) /* refer to cpuid.txt */
+ return NON_PRESENT;
+
+ flags = cpuid_eax(0x40000001);
+ if (flags & 3) {
+ msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
+ msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
+ return PRESENT;
+ } else if (flags & 0) {
+ msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+ msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+ return PRESENT;
+ } else
+ return NON_PRESENT;
+
+MSR_KVM_ASYNC_PF_EN:
+ 0x4b564d02
+
+data:
+ Asynchronous page fault (APF) control MSR.
+
+ Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
+ which must be in guest RAM and must be zeroed. This memory is expected
+ to hold a copy of the following structure::
+
+ struct kvm_vcpu_pv_apf_data {
+ /* Used for 'page not present' events delivered via #PF */
+ __u32 flags;
+
+ /* Used for 'page ready' events delivered via interrupt notification */
+ __u32 token;
+
+ __u8 pad[56];
+ __u32 enabled;
+ };
+
+ Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
+ when asynchronous page faults are enabled on the vcpu, 0 when disabled.
+ Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
+ cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
+ #PF vmexits. Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
+ present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
+ events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
+ CPUID.
+
+ 'Page not present' events are currently always delivered as synthetic
+ #PF exception. During delivery of these events APF CR2 register contains
+ a token that will be used to notify the guest when missing page becomes
+ available. Also, to make it possible to distinguish between real #PF and
+ APF, first 4 bytes of 64 byte memory location ('flags') will be written
+ to by the hypervisor at the time of injection. Only first bit of 'flags'
+ is currently supported, when set, it indicates that the guest is dealing
+ with asynchronous 'page not present' event. If during a page fault APF
+ 'flags' is '0' it means that this is regular page fault. Guest is
+ supposed to clear 'flags' when it is done handling #PF exception so the
+ next event can be delivered.
+
+ Note, since APF 'page not present' events use the same exception vector
+ as regular page fault, guest must reset 'flags' to '0' before it does
+ something that can generate normal page fault.
+
+ Bytes 5-7 of 64 byte memory location ('token') will be written to by the
+ hypervisor at the time of APF 'page ready' event injection. The content
+ of these bytes is a token which was previously delivered as 'page not
+ present' event. The event indicates the page in now available. Guest is
+ supposed to write '0' to 'token' when it is done handling 'page ready'
+ event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location;
+ writing to the MSR forces KVM to re-scan its queue and deliver the next
+ pending notification.
+
+ Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
+ ready' APF delivery needs to be written to before enabling APF mechanism
+ in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is
+ available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+ Note, previously, 'page ready' events were delivered via the same #PF
+ exception as 'page not present' events but this is now deprecated. If
+ bit 3 (interrupt based delivery) is not set APF events are not delivered.
+
+ If APF is disabled while there are outstanding APFs, they will
+ not be delivered.
+
+ Currently 'page ready' APF events will be always delivered on the
+ same vcpu as 'page not present' event was, but guest should not rely on
+ that.
+
+MSR_KVM_STEAL_TIME:
+ 0x4b564d03
+
+data:
+ 64-byte alignment physical address of a memory area which must be
+ in guest RAM, plus an enable bit in bit 0. This memory is expected to
+ hold a copy of the following structure::
+
+ struct kvm_steal_time {
+ __u64 steal;
+ __u32 version;
+ __u32 flags;
+ __u8 preempted;
+ __u8 u8_pad[3];
+ __u32 pad[11];
+ }
+
+ whose data will be filled in by the hypervisor periodically. Only one
+ write, or registration, is needed for each VCPU. The interval between
+ updates of this structure is arbitrary and implementation-dependent.
+ The hypervisor may update this structure at any time it sees fit until
+ anything with bit0 == 0 is written to it. Guest is required to make sure
+ this structure is initialized to zero.
+
+ Fields have the following meanings:
+
+ version:
+ a sequence counter. In other words, guest has to check
+ this field before and after grabbing time information and make
+ sure they are both equal and even. An odd version indicates an
+ in-progress update.
+
+ flags:
+ At this point, always zero. May be used to indicate
+ changes in this structure in the future.
+
+ steal:
+ the amount of time in which this vCPU did not run, in
+ nanoseconds. Time during which the vcpu is idle, will not be
+ reported as steal time.
+
+ preempted:
+ indicate the vCPU who owns this struct is running or
+ not. Non-zero values mean the vCPU has been preempted. Zero
+ means the vCPU is not preempted. NOTE, it is always zero if the
+ the hypervisor doesn't support this field.
+
+MSR_KVM_EOI_EN:
+ 0x4b564d04
+
+data:
+ Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
+ when disabled. Bit 1 is reserved and must be zero. When PV end of
+ interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
+ physical address of a 4 byte memory area which must be in guest RAM and
+ must be zeroed.
+
+ The first, least significant bit of 4 byte memory location will be
+ written to by the hypervisor, typically at the time of interrupt
+ injection. Value of 1 means that guest can skip writing EOI to the apic
+ (using MSR or MMIO write); instead, it is sufficient to signal
+ EOI by clearing the bit in guest memory - this location will
+ later be polled by the hypervisor.
+ Value of 0 means that the EOI write is required.
+
+ It is always safe for the guest to ignore the optimization and perform
+ the APIC EOI write anyway.
+
+ Hypervisor is guaranteed to only modify this least
+ significant bit while in the current VCPU context, this means that
+ guest does not need to use either lock prefix or memory ordering
+ primitives to synchronise with the hypervisor.
+
+ However, hypervisor can set and clear this memory bit at any time:
+ therefore to make sure hypervisor does not interrupt the
+ guest and clear the least significant bit in the memory area
+ in the window between guest testing it to detect
+ whether it can skip EOI apic write and between guest
+ clearing it to signal EOI to the hypervisor,
+ guest must both read the least significant bit in the memory area and
+ clear it using a single CPU instruction, such as test and clear, or
+ compare and exchange.
+
+MSR_KVM_POLL_CONTROL:
+ 0x4b564d05
+
+ Control host-side polling.
+
+data:
+ Bit 0 enables (1) or disables (0) host-side HLT polling logic.
+
+ KVM guests can request the host not to poll on HLT, for example if
+ they are performing polling themselves.
+
+MSR_KVM_ASYNC_PF_INT:
+ 0x4b564d06
+
+data:
+ Second asynchronous page fault (APF) control MSR.
+
+ Bits 0-7: APIC vector for delivery of 'page ready' APF events.
+ Bits 8-63: Reserved
+
+ Interrupt vector for asynchnonous 'page ready' notifications delivery.
+ The vector has to be set up before asynchronous page fault mechanism
+ is enabled in MSR_KVM_ASYNC_PF_EN. The MSR is only available if
+ KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_ASYNC_PF_ACK:
+ 0x4b564d07
+
+data:
+ Asynchronous page fault (APF) acknowledgment.
+
+ When the guest is done processing 'page ready' APF event and 'token'
+ field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
+ write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
+ and check if there are more notifications pending. The MSR is available
+ if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.