summaryrefslogtreecommitdiffstats
path: root/Documentation/virt/kvm/x86
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virt/kvm/x86')
-rw-r--r--Documentation/virt/kvm/x86/amd-memory-encryption.rst446
-rw-r--r--Documentation/virt/kvm/x86/cpuid.rst124
-rw-r--r--Documentation/virt/kvm/x86/errata.rst39
-rw-r--r--Documentation/virt/kvm/x86/hypercalls.rst192
-rw-r--r--Documentation/virt/kvm/x86/index.rst18
-rw-r--r--Documentation/virt/kvm/x86/mmu.rst484
-rw-r--r--Documentation/virt/kvm/x86/msr.rst391
-rw-r--r--Documentation/virt/kvm/x86/nested-vmx.rst244
-rw-r--r--Documentation/virt/kvm/x86/running-nested-guests.rst278
-rw-r--r--Documentation/virt/kvm/x86/timekeeping.rst645
10 files changed, 2861 insertions, 0 deletions
diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
new file mode 100644
index 000000000..935aaeb97
--- /dev/null
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -0,0 +1,446 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+Secure Encrypted Virtualization (SEV)
+======================================
+
+Overview
+========
+
+Secure Encrypted Virtualization (SEV) is a feature found on AMD processors.
+
+SEV is an extension to the AMD-V architecture which supports running
+virtual machines (VMs) under the control of a hypervisor. When enabled,
+the memory contents of a VM will be transparently encrypted with a key
+unique to that VM.
+
+The hypervisor can determine the SEV support through the CPUID
+instruction. The CPUID function 0x8000001f reports information related
+to SEV::
+
+ 0x8000001f[eax]:
+ Bit[1] indicates support for SEV
+ ...
+ [ecx]:
+ Bits[31:0] Number of encrypted guests supported simultaneously
+
+If support for SEV is present, MSR 0xc001_0010 (MSR_AMD64_SYSCFG) and MSR 0xc001_0015
+(MSR_K7_HWCR) can be used to determine if it can be enabled::
+
+ 0xc001_0010:
+ Bit[23] 1 = memory encryption can be enabled
+ 0 = memory encryption can not be enabled
+
+ 0xc001_0015:
+ Bit[0] 1 = memory encryption can be enabled
+ 0 = memory encryption can not be enabled
+
+When SEV support is available, it can be enabled in a specific VM by
+setting the SEV bit before executing VMRUN.::
+
+ VMCB[0x90]:
+ Bit[1] 1 = SEV is enabled
+ 0 = SEV is disabled
+
+SEV hardware uses ASIDs to associate a memory encryption key with a VM.
+Hence, the ASID for the SEV-enabled guests must be from 1 to a maximum value
+defined in the CPUID 0x8000001f[ecx] field.
+
+SEV Key Management
+==================
+
+The SEV guest key management is handled by a separate processor called the AMD
+Secure Processor (AMD-SP). Firmware running inside the AMD-SP provides a secure
+key management interface to perform common hypervisor activities such as
+encrypting bootstrap code, snapshot, migrating and debugging the guest. For more
+information, see the SEV Key Management spec [api-spec]_
+
+The main ioctl to access SEV is KVM_MEMORY_ENCRYPT_OP. If the argument
+to KVM_MEMORY_ENCRYPT_OP is NULL, the ioctl returns 0 if SEV is enabled
+and ``ENOTTY` if it is disabled (on some older versions of Linux,
+the ioctl runs normally even with a NULL argument, and therefore will
+likely return ``EFAULT``). If non-NULL, the argument to KVM_MEMORY_ENCRYPT_OP
+must be a struct kvm_sev_cmd::
+
+ struct kvm_sev_cmd {
+ __u32 id;
+ __u64 data;
+ __u32 error;
+ __u32 sev_fd;
+ };
+
+
+The ``id`` field contains the subcommand, and the ``data`` field points to
+another struct containing arguments specific to command. The ``sev_fd``
+should point to a file descriptor that is opened on the ``/dev/sev``
+device, if needed (see individual commands).
+
+On output, ``error`` is zero on success, or an error code. Error codes
+are defined in ``<linux/psp-dev.h>``.
+
+KVM implements the following commands to support common lifecycle events of SEV
+guests, such as launching, running, snapshotting, migrating and decommissioning.
+
+1. KVM_SEV_INIT
+---------------
+
+The KVM_SEV_INIT command is used by the hypervisor to initialize the SEV platform
+context. In a typical workflow, this command should be the first command issued.
+
+The firmware can be initialized either by using its own non-volatile storage or
+the OS can manage the NV storage for the firmware using the module parameter
+``init_ex_path``. If the file specified by ``init_ex_path`` does not exist or
+is invalid, the OS will create or override the file with output from PSP.
+
+Returns: 0 on success, -negative on error
+
+2. KVM_SEV_LAUNCH_START
+-----------------------
+
+The KVM_SEV_LAUNCH_START command is used for creating the memory encryption
+context. To create the encryption context, user must provide a guest policy,
+the owner's public Diffie-Hellman (PDH) key and session information.
+
+Parameters: struct kvm_sev_launch_start (in/out)
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_launch_start {
+ __u32 handle; /* if zero then firmware creates a new handle */
+ __u32 policy; /* guest's policy */
+
+ __u64 dh_uaddr; /* userspace address pointing to the guest owner's PDH key */
+ __u32 dh_len;
+
+ __u64 session_addr; /* userspace address which points to the guest session information */
+ __u32 session_len;
+ };
+
+On success, the 'handle' field contains a new handle and on error, a negative value.
+
+KVM_SEV_LAUNCH_START requires the ``sev_fd`` field to be valid.
+
+For more details, see SEV spec Section 6.2.
+
+3. KVM_SEV_LAUNCH_UPDATE_DATA
+-----------------------------
+
+The KVM_SEV_LAUNCH_UPDATE_DATA is used for encrypting a memory region. It also
+calculates a measurement of the memory contents. The measurement is a signature
+of the memory contents that can be sent to the guest owner as an attestation
+that the memory was encrypted correctly by the firmware.
+
+Parameters (in): struct kvm_sev_launch_update_data
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_launch_update {
+ __u64 uaddr; /* userspace address to be encrypted (must be 16-byte aligned) */
+ __u32 len; /* length of the data to be encrypted (must be 16-byte aligned) */
+ };
+
+For more details, see SEV spec Section 6.3.
+
+4. KVM_SEV_LAUNCH_MEASURE
+-------------------------
+
+The KVM_SEV_LAUNCH_MEASURE command is used to retrieve the measurement of the
+data encrypted by the KVM_SEV_LAUNCH_UPDATE_DATA command. The guest owner may
+wait to provide the guest with confidential information until it can verify the
+measurement. Since the guest owner knows the initial contents of the guest at
+boot, the measurement can be verified by comparing it to what the guest owner
+expects.
+
+If len is zero on entry, the measurement blob length is written to len and
+uaddr is unused.
+
+Parameters (in): struct kvm_sev_launch_measure
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_launch_measure {
+ __u64 uaddr; /* where to copy the measurement */
+ __u32 len; /* length of measurement blob */
+ };
+
+For more details on the measurement verification flow, see SEV spec Section 6.4.
+
+5. KVM_SEV_LAUNCH_FINISH
+------------------------
+
+After completion of the launch flow, the KVM_SEV_LAUNCH_FINISH command can be
+issued to make the guest ready for the execution.
+
+Returns: 0 on success, -negative on error
+
+6. KVM_SEV_GUEST_STATUS
+-----------------------
+
+The KVM_SEV_GUEST_STATUS command is used to retrieve status information about a
+SEV-enabled guest.
+
+Parameters (out): struct kvm_sev_guest_status
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_guest_status {
+ __u32 handle; /* guest handle */
+ __u32 policy; /* guest policy */
+ __u8 state; /* guest state (see enum below) */
+ };
+
+SEV guest state:
+
+::
+
+ enum {
+ SEV_STATE_INVALID = 0;
+ SEV_STATE_LAUNCHING, /* guest is currently being launched */
+ SEV_STATE_SECRET, /* guest is being launched and ready to accept the ciphertext data */
+ SEV_STATE_RUNNING, /* guest is fully launched and running */
+ SEV_STATE_RECEIVING, /* guest is being migrated in from another SEV machine */
+ SEV_STATE_SENDING /* guest is getting migrated out to another SEV machine */
+ };
+
+7. KVM_SEV_DBG_DECRYPT
+----------------------
+
+The KVM_SEV_DEBUG_DECRYPT command can be used by the hypervisor to request the
+firmware to decrypt the data at the given memory region.
+
+Parameters (in): struct kvm_sev_dbg
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_dbg {
+ __u64 src_uaddr; /* userspace address of data to decrypt */
+ __u64 dst_uaddr; /* userspace address of destination */
+ __u32 len; /* length of memory region to decrypt */
+ };
+
+The command returns an error if the guest policy does not allow debugging.
+
+8. KVM_SEV_DBG_ENCRYPT
+----------------------
+
+The KVM_SEV_DEBUG_ENCRYPT command can be used by the hypervisor to request the
+firmware to encrypt the data at the given memory region.
+
+Parameters (in): struct kvm_sev_dbg
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_dbg {
+ __u64 src_uaddr; /* userspace address of data to encrypt */
+ __u64 dst_uaddr; /* userspace address of destination */
+ __u32 len; /* length of memory region to encrypt */
+ };
+
+The command returns an error if the guest policy does not allow debugging.
+
+9. KVM_SEV_LAUNCH_SECRET
+------------------------
+
+The KVM_SEV_LAUNCH_SECRET command can be used by the hypervisor to inject secret
+data after the measurement has been validated by the guest owner.
+
+Parameters (in): struct kvm_sev_launch_secret
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_launch_secret {
+ __u64 hdr_uaddr; /* userspace address containing the packet header */
+ __u32 hdr_len;
+
+ __u64 guest_uaddr; /* the guest memory region where the secret should be injected */
+ __u32 guest_len;
+
+ __u64 trans_uaddr; /* the hypervisor memory region which contains the secret */
+ __u32 trans_len;
+ };
+
+10. KVM_SEV_GET_ATTESTATION_REPORT
+----------------------------------
+
+The KVM_SEV_GET_ATTESTATION_REPORT command can be used by the hypervisor to query the attestation
+report containing the SHA-256 digest of the guest memory and VMSA passed through the KVM_SEV_LAUNCH
+commands and signed with the PEK. The digest returned by the command should match the digest
+used by the guest owner with the KVM_SEV_LAUNCH_MEASURE.
+
+If len is zero on entry, the measurement blob length is written to len and
+uaddr is unused.
+
+Parameters (in): struct kvm_sev_attestation
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_attestation_report {
+ __u8 mnonce[16]; /* A random mnonce that will be placed in the report */
+
+ __u64 uaddr; /* userspace address where the report should be copied */
+ __u32 len;
+ };
+
+11. KVM_SEV_SEND_START
+----------------------
+
+The KVM_SEV_SEND_START command can be used by the hypervisor to create an
+outgoing guest encryption context.
+
+If session_len is zero on entry, the length of the guest session information is
+written to session_len and all other fields are not used.
+
+Parameters (in): struct kvm_sev_send_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_send_start {
+ __u32 policy; /* guest policy */
+
+ __u64 pdh_cert_uaddr; /* platform Diffie-Hellman certificate */
+ __u32 pdh_cert_len;
+
+ __u64 plat_certs_uaddr; /* platform certificate chain */
+ __u32 plat_certs_len;
+
+ __u64 amd_certs_uaddr; /* AMD certificate */
+ __u32 amd_certs_len;
+
+ __u64 session_uaddr; /* Guest session information */
+ __u32 session_len;
+ };
+
+12. KVM_SEV_SEND_UPDATE_DATA
+----------------------------
+
+The KVM_SEV_SEND_UPDATE_DATA command can be used by the hypervisor to encrypt the
+outgoing guest memory region with the encryption context creating using
+KVM_SEV_SEND_START.
+
+If hdr_len or trans_len are zero on entry, the length of the packet header and
+transport region are written to hdr_len and trans_len respectively, and all
+other fields are not used.
+
+Parameters (in): struct kvm_sev_send_update_data
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_launch_send_update_data {
+ __u64 hdr_uaddr; /* userspace address containing the packet header */
+ __u32 hdr_len;
+
+ __u64 guest_uaddr; /* the source memory region to be encrypted */
+ __u32 guest_len;
+
+ __u64 trans_uaddr; /* the destination memory region */
+ __u32 trans_len;
+ };
+
+13. KVM_SEV_SEND_FINISH
+------------------------
+
+After completion of the migration flow, the KVM_SEV_SEND_FINISH command can be
+issued by the hypervisor to delete the encryption context.
+
+Returns: 0 on success, -negative on error
+
+14. KVM_SEV_SEND_CANCEL
+------------------------
+
+After completion of SEND_START, but before SEND_FINISH, the source VMM can issue the
+SEND_CANCEL command to stop a migration. This is necessary so that a cancelled
+migration can restart with a new target later.
+
+Returns: 0 on success, -negative on error
+
+15. KVM_SEV_RECEIVE_START
+-------------------------
+
+The KVM_SEV_RECEIVE_START command is used for creating the memory encryption
+context for an incoming SEV guest. To create the encryption context, the user must
+provide a guest policy, the platform public Diffie-Hellman (PDH) key and session
+information.
+
+Parameters: struct kvm_sev_receive_start (in/out)
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_receive_start {
+ __u32 handle; /* if zero then firmware creates a new handle */
+ __u32 policy; /* guest's policy */
+
+ __u64 pdh_uaddr; /* userspace address pointing to the PDH key */
+ __u32 pdh_len;
+
+ __u64 session_uaddr; /* userspace address which points to the guest session information */
+ __u32 session_len;
+ };
+
+On success, the 'handle' field contains a new handle and on error, a negative value.
+
+For more details, see SEV spec Section 6.12.
+
+16. KVM_SEV_RECEIVE_UPDATE_DATA
+-------------------------------
+
+The KVM_SEV_RECEIVE_UPDATE_DATA command can be used by the hypervisor to copy
+the incoming buffers into the guest memory region with encryption context
+created during the KVM_SEV_RECEIVE_START.
+
+Parameters (in): struct kvm_sev_receive_update_data
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_launch_receive_update_data {
+ __u64 hdr_uaddr; /* userspace address containing the packet header */
+ __u32 hdr_len;
+
+ __u64 guest_uaddr; /* the destination guest memory region */
+ __u32 guest_len;
+
+ __u64 trans_uaddr; /* the incoming buffer memory region */
+ __u32 trans_len;
+ };
+
+17. KVM_SEV_RECEIVE_FINISH
+--------------------------
+
+After completion of the migration flow, the KVM_SEV_RECEIVE_FINISH command can be
+issued by the hypervisor to make the guest ready for execution.
+
+Returns: 0 on success, -negative on error
+
+References
+==========
+
+
+See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+
+.. [white-paper] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
+.. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
+.. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
+.. [kvm-forum] https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
diff --git a/Documentation/virt/kvm/x86/cpuid.rst b/Documentation/virt/kvm/x86/cpuid.rst
new file mode 100644
index 000000000..bda3e3e73
--- /dev/null
+++ b/Documentation/virt/kvm/x86/cpuid.rst
@@ -0,0 +1,124 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+KVM CPUID bits
+==============
+
+:Author: Glauber Costa <glommer@gmail.com>
+
+A guest running on a kvm host, can check some of its features using
+cpuid. This is not always guaranteed to work, since userspace can
+mask-out some, or even all KVM-related cpuid features before launching
+a guest.
+
+KVM cpuid functions are:
+
+function: KVM_CPUID_SIGNATURE (0x40000000)
+
+returns::
+
+ eax = 0x40000001
+ ebx = 0x4b4d564b
+ ecx = 0x564b4d56
+ edx = 0x4d
+
+Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
+The value in eax corresponds to the maximum cpuid function present in this leaf,
+and will be updated if more functions are added in the future.
+Note also that old hosts set eax value to 0x0. This should
+be interpreted as if the value was 0x40000001.
+This function queries the presence of KVM cpuid leafs.
+
+function: define KVM_CPUID_FEATURES (0x40000001)
+
+returns::
+
+ ebx, ecx
+ eax = an OR'ed group of (1 << flag)
+
+where ``flag`` is defined as below:
+
+================================== =========== ================================
+flag value meaning
+================================== =========== ================================
+KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs
+ 0x11 and 0x12
+
+KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays
+ on PIO operations
+
+KVM_FEATURE_MMU_OP 2 deprecated
+
+KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs
+ 0x4b564d00 and 0x4b564d01
+
+KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by
+ writing to msr 0x4b564d02
+
+KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by
+ writing to msr 0x4b564d03
+
+KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt
+ handler can be enabled by
+ writing to msr 0x4b564d04
+
+KVM_FEATURE_PV_UNHALT 7 guest checks this feature bit
+ before enabling paravirtualized
+ spinlock support
+
+KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit
+ before enabling paravirtualized
+ tlb flush
+
+KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT
+ can be enabled by setting bit 2
+ when writing to msr 0x4b564d02
+
+KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit
+ before enabling paravirtualized
+ send IPIs
+
+KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can
+ be disabled by writing
+ to msr 0x4b564d05.
+
+KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit
+ before using paravirtualized
+ sched yield.
+
+KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit
+ before using the second async
+ pf control msr 0x4b564d06 and
+ async pf acknowledgment msr
+ 0x4b564d07.
+
+KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit
+ before using extended destination
+ ID bits in MSI address bits 11-5.
+
+KVM_FEATURE_HC_MAP_GPA_RANGE 16 guest checks this feature bit before
+ using the map gpa range hypercall
+ to notify the page state change
+
+KVM_FEATURE_MIGRATION_CONTROL 17 guest checks this feature bit before
+ using MSR_KVM_MIGRATION_CONTROL
+
+KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side
+ per-cpu warps are expected in
+ kvmclock
+================================== =========== ================================
+
+::
+
+ edx = an OR'ed group of (1 << flag)
+
+Where ``flag`` here is defined as below:
+
+================== ============ =================================
+flag value meaning
+================== ============ =================================
+KVM_HINTS_REALTIME 0 guest checks this feature bit to
+ determine that vCPUs are never
+ preempted for an unlimited time
+ allowing optimizations
+================== ============ =================================
diff --git a/Documentation/virt/kvm/x86/errata.rst b/Documentation/virt/kvm/x86/errata.rst
new file mode 100644
index 000000000..410e0aa63
--- /dev/null
+++ b/Documentation/virt/kvm/x86/errata.rst
@@ -0,0 +1,39 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+Known limitations of CPU virtualization
+=======================================
+
+Whenever perfect emulation of a CPU feature is impossible or too hard, KVM
+has to choose between not implementing the feature at all or introducing
+behavioral differences between virtual machines and bare metal systems.
+
+This file documents some of the known limitations that KVM has in
+virtualizing CPU features.
+
+x86
+===
+
+``KVM_GET_SUPPORTED_CPUID`` issues
+----------------------------------
+
+x87 features
+~~~~~~~~~~~~
+
+Unlike most other CPUID feature bits, CPUID[EAX=7,ECX=0]:EBX[6]
+(FDP_EXCPTN_ONLY) and CPUID[EAX=7,ECX=0]:EBX]13] (ZERO_FCS_FDS) are
+clear if the features are present and set if the features are not present.
+
+Clearing these bits in CPUID has no effect on the operation of the guest;
+if these bits are set on hardware, the features will not be present on
+any virtual machine that runs on that hardware.
+
+**Workaround:** It is recommended to always set these bits in guest CPUID.
+Note however that any software (e.g ``WIN87EM.DLL``) expecting these features
+to be present likely predates these CPUID feature bits, and therefore
+doesn't know to check for them anyway.
+
+Nested virtualization features
+------------------------------
+
+TBD
diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst
new file mode 100644
index 000000000..10db79247
--- /dev/null
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -0,0 +1,192 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+Linux KVM Hypercall
+===================
+
+X86:
+ KVM Hypercalls have a three-byte sequence of either the vmcall or the vmmcall
+ instruction. The hypervisor can replace it with instructions that are
+ guaranteed to be supported.
+
+ Up to four arguments may be passed in rbx, rcx, rdx, and rsi respectively.
+ The hypercall number should be placed in rax and the return value will be
+ placed in rax. No other registers will be clobbered unless explicitly stated
+ by the particular hypercall.
+
+S390:
+ R2-R7 are used for parameters 1-6. In addition, R1 is used for hypercall
+ number. The return value is written to R2.
+
+ S390 uses diagnose instruction as hypercall (0x500) along with hypercall
+ number in R1.
+
+ For further information on the S390 diagnose call as supported by KVM,
+ refer to Documentation/virt/kvm/s390/s390-diag.rst.
+
+PowerPC:
+ It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers.
+ Return value is placed in R3.
+
+ KVM hypercalls uses 4 byte opcode, that are patched with 'hypercall-instructions'
+ property inside the device tree's /hypervisor node.
+ For more information refer to Documentation/virt/kvm/ppc-pv.rst
+
+MIPS:
+ KVM hypercalls use the HYPCALL instruction with code 0 and the hypercall
+ number in $2 (v0). Up to four arguments may be placed in $4-$7 (a0-a3) and
+ the return value is placed in $2 (v0).
+
+KVM Hypercalls Documentation
+============================
+
+The template for each hypercall is:
+1. Hypercall name.
+2. Architecture(s)
+3. Status (deprecated, obsolete, active)
+4. Purpose
+
+1. KVM_HC_VAPIC_POLL_IRQ
+------------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Trigger guest exit so that the host can check for pending
+ interrupts on reentry.
+
+2. KVM_HC_MMU_OP
+----------------
+
+:Architecture: x86
+:Status: deprecated.
+:Purpose: Support MMU operations such as writing to PTE,
+ flushing TLB, release PT.
+
+3. KVM_HC_FEATURES
+------------------
+
+:Architecture: PPC
+:Status: active
+:Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid
+ used to enumerate which hypercalls are available. On PPC, either
+ device tree based lookup ( which is also what EPAPR dictates)
+ OR KVM specific enumeration mechanism (which is this hypercall)
+ can be used.
+
+4. KVM_HC_PPC_MAP_MAGIC_PAGE
+----------------------------
+
+:Architecture: PPC
+:Status: active
+:Purpose: To enable communication between the hypervisor and guest there is a
+ shared page that contains parts of supervisor visible register state.
+ The guest can map this shared page to access its supervisor register
+ through memory using this hypercall.
+
+5. KVM_HC_KICK_CPU
+------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Hypercall used to wakeup a vcpu from HLT state
+:Usage example:
+ A vcpu of a paravirtualized guest that is busywaiting in guest
+ kernel mode for an event to occur (ex: a spinlock to become available) can
+ execute HLT instruction once it has busy-waited for more than a threshold
+ time-interval. Execution of HLT instruction would cause the hypervisor to put
+ the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the
+ same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
+ specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0)
+ is used in the hypercall for future use.
+
+
+6. KVM_HC_CLOCK_PAIRING
+-----------------------
+:Architecture: x86
+:Status: active
+:Purpose: Hypercall used to synchronize host and guest clocks.
+
+Usage:
+
+a0: guest physical address where host copies
+"struct kvm_clock_offset" structure.
+
+a1: clock_type, ATM only KVM_CLOCK_PAIRING_WALLCLOCK (0)
+is supported (corresponding to the host's CLOCK_REALTIME clock).
+
+ ::
+
+ struct kvm_clock_pairing {
+ __s64 sec;
+ __s64 nsec;
+ __u64 tsc;
+ __u32 flags;
+ __u32 pad[9];
+ };
+
+ Where:
+ * sec: seconds from clock_type clock.
+ * nsec: nanoseconds from clock_type clock.
+ * tsc: guest TSC value used to calculate sec/nsec pair
+ * flags: flags, unused (0) at the moment.
+
+The hypercall lets a guest compute a precise timestamp across
+host and guest. The guest can use the returned TSC value to
+compute the CLOCK_REALTIME for its clock, at the same instant.
+
+Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
+or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
+
+6. KVM_HC_SEND_IPI
+------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Send IPIs to multiple vCPUs.
+
+- a0: lower part of the bitmap of destination APIC IDs
+- a1: higher part of the bitmap of destination APIC IDs
+- a2: the lowest APIC ID in bitmap
+- a3: APIC ICR
+
+The hypercall lets a guest send multicast IPIs, with at most 128
+128 destinations per hypercall in 64-bit mode and 64 vCPUs per
+hypercall in 32-bit mode. The destinations are represented by a
+bitmap contained in the first two arguments (a0 and a1). Bit 0 of
+a0 corresponds to the APIC ID in the third argument (a2), bit 1
+corresponds to the APIC ID a2+1, and so on.
+
+Returns the number of CPUs to which the IPIs were delivered successfully.
+
+7. KVM_HC_SCHED_YIELD
+---------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Hypercall used to yield if the IPI target vCPU is preempted
+
+a0: destination APIC ID
+
+:Usage example: When sending a call-function IPI-many to vCPUs, yield if
+ any of the IPI target vCPUs was preempted.
+
+8. KVM_HC_MAP_GPA_RANGE
+-------------------------
+:Architecture: x86
+:Status: active
+:Purpose: Request KVM to map a GPA range with the specified attributes.
+
+a0: the guest physical address of the start page
+a1: the number of (4kb) pages (must be contiguous in GPA space)
+a2: attributes
+
+ Where 'attributes' :
+ * bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
+ * bit 4 - plaintext = 0, encrypted = 1
+ * bits 63:5 - reserved (must be zero)
+
+**Implementation note**: this hypercall is implemented in userspace via
+the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability
+before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In
+addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
+must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
new file mode 100644
index 000000000..9ece6b8dc
--- /dev/null
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+KVM for x86 systems
+===================
+
+.. toctree::
+ :maxdepth: 2
+
+ amd-memory-encryption
+ cpuid
+ errata
+ hypercalls
+ mmu
+ msr
+ nested-vmx
+ running-nested-guests
+ timekeeping
diff --git a/Documentation/virt/kvm/x86/mmu.rst b/Documentation/virt/kvm/x86/mmu.rst
new file mode 100644
index 000000000..8364afa22
--- /dev/null
+++ b/Documentation/virt/kvm/x86/mmu.rst
@@ -0,0 +1,484 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+The x86 kvm shadow mmu
+======================
+
+The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible
+for presenting a standard x86 mmu to the guest, while translating guest
+physical addresses to host physical addresses.
+
+The mmu code attempts to satisfy the following requirements:
+
+- correctness:
+ the guest should not be able to determine that it is running
+ on an emulated mmu except for timing (we attempt to comply
+ with the specification, not emulate the characteristics of
+ a particular implementation such as tlb size)
+- security:
+ the guest must not be able to touch host memory not assigned
+ to it
+- performance:
+ minimize the performance penalty imposed by the mmu
+- scaling:
+ need to scale to large memory and large vcpu guests
+- hardware:
+ support the full range of x86 virtualization hardware
+- integration:
+ Linux memory management code must be in control of guest memory
+ so that swapping, page migration, page merging, transparent
+ hugepages, and similar features work without change
+- dirty tracking:
+ report writes to guest memory to enable live migration
+ and framebuffer-based displays
+- footprint:
+ keep the amount of pinned kernel memory low (most memory
+ should be shrinkable)
+- reliability:
+ avoid multipage or GFP_ATOMIC allocations
+
+Acronyms
+========
+
+==== ====================================================================
+pfn host page frame number
+hpa host physical address
+hva host virtual address
+gfn guest frame number
+gpa guest physical address
+gva guest virtual address
+ngpa nested guest physical address
+ngva nested guest virtual address
+pte page table entry (used also to refer generically to paging structure
+ entries)
+gpte guest pte (referring to gfns)
+spte shadow pte (referring to pfns)
+tdp two dimensional paging (vendor neutral term for NPT and EPT)
+==== ====================================================================
+
+Virtual and real hardware supported
+===================================
+
+The mmu supports first-generation mmu hardware, which allows an atomic switch
+of the current paging mode and cr3 during guest entry, as well as
+two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware
+it exposes is the traditional 2/3/4 level x86 mmu, with support for global
+pages, pae, pse, pse36, cr0.wp, and 1GB pages. Emulated hardware also
+able to expose NPT capable hardware on NPT capable hosts.
+
+Translation
+===========
+
+The primary job of the mmu is to program the processor's mmu to translate
+addresses for the guest. Different translations are required at different
+times:
+
+- when guest paging is disabled, we translate guest physical addresses to
+ host physical addresses (gpa->hpa)
+- when guest paging is enabled, we translate guest virtual addresses, to
+ guest physical addresses, to host physical addresses (gva->gpa->hpa)
+- when the guest launches a guest of its own, we translate nested guest
+ virtual addresses, to nested guest physical addresses, to guest physical
+ addresses, to host physical addresses (ngva->ngpa->gpa->hpa)
+
+The primary challenge is to encode between 1 and 3 translations into hardware
+that support only 1 (traditional) and 2 (tdp) translations. When the
+number of required translations matches the hardware, the mmu operates in
+direct mode; otherwise it operates in shadow mode (see below).
+
+Memory
+======
+
+Guest memory (gpa) is part of the user address space of the process that is
+using kvm. Userspace defines the translation between guest addresses and user
+addresses (gpa->hva); note that two gpas may alias to the same hva, but not
+vice versa.
+
+These hvas may be backed using any method available to the host: anonymous
+memory, file backed memory, and device memory. Memory might be paged by the
+host at any time.
+
+Events
+======
+
+The mmu is driven by events, some from the guest, some from the host.
+
+Guest generated events:
+
+- writes to control registers (especially cr3)
+- invlpg/invlpga instruction execution
+- access to missing or protected translations
+
+Host generated events:
+
+- changes in the gpa->hpa translation (either through gpa->hva changes or
+ through hva->hpa changes)
+- memory pressure (the shrinker)
+
+Shadow pages
+============
+
+The principal data structure is the shadow page, 'struct kvm_mmu_page'. A
+shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A
+shadow page may contain a mix of leaf and nonleaf sptes.
+
+A nonleaf spte allows the hardware mmu to reach the leaf pages and
+is not related to a translation directly. It points to other shadow pages.
+
+A leaf spte corresponds to either one or two translations encoded into
+one paging structure entry. These are always the lowest level of the
+translation stack, with optional higher level translations left to NPT/EPT.
+Leaf ptes point at guest pages.
+
+The following table shows translations encoded by leaf ptes, with higher-level
+translations in parentheses:
+
+ Non-nested guests::
+
+ nonpaging: gpa->hpa
+ paging: gva->gpa->hpa
+ paging, tdp: (gva->)gpa->hpa
+
+ Nested guests::
+
+ non-tdp: ngva->gpa->hpa (*)
+ tdp: (ngva->)ngpa->gpa->hpa
+
+ (*) the guest hypervisor will encode the ngva->gpa translation into its page
+ tables if npt is not present
+
+Shadow pages contain the following information:
+ role.level:
+ The level in the shadow paging hierarchy that this shadow page belongs to.
+ 1=4k sptes, 2=2M sptes, 3=1G sptes, etc.
+ role.direct:
+ If set, leaf sptes reachable from this page are for a linear range.
+ Examples include real mode translation, large guest pages backed by small
+ host pages, and gpa->hpa translations when NPT or EPT is active.
+ The linear range starts at (gfn << PAGE_SHIFT) and its size is determined
+ by role.level (2MB for first level, 1GB for second level, 0.5TB for third
+ level, 256TB for fourth level)
+ If clear, this page corresponds to a guest page table denoted by the gfn
+ field.
+ role.quadrant:
+ When role.has_4_byte_gpte=1, the guest uses 32-bit gptes while the host uses 64-bit
+ sptes. That means a guest page table contains more ptes than the host,
+ so multiple shadow pages are needed to shadow one guest page.
+ For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
+ first or second 512-gpte block in the guest page table. For second-level
+ page tables, each 32-bit gpte is converted to two 64-bit sptes
+ (since each first-level guest page is shadowed by two first-level
+ shadow pages) so role.quadrant takes values in the range 0..3. Each
+ quadrant maps 1GB virtual address space.
+ role.access:
+ Inherited guest access permissions from the parent ptes in the form uwx.
+ Note execute permission is positive, not negative.
+ role.invalid:
+ The page is invalid and should not be used. It is a root page that is
+ currently pinned (by a cpu hardware register pointing to it); once it is
+ unpinned it will be destroyed.
+ role.has_4_byte_gpte:
+ Reflects the size of the guest PTE for which the page is valid, i.e. '0'
+ if direct map or 64-bit gptes are in use, '1' if 32-bit gptes are in use.
+ role.efer_nx:
+ Contains the value of efer.nx for which the page is valid.
+ role.cr0_wp:
+ Contains the value of cr0.wp for which the page is valid.
+ role.smep_andnot_wp:
+ Contains the value of cr4.smep && !cr0.wp for which the page is valid
+ (pages for which this is true are different from other pages; see the
+ treatment of cr0.wp=0 below).
+ role.smap_andnot_wp:
+ Contains the value of cr4.smap && !cr0.wp for which the page is valid
+ (pages for which this is true are different from other pages; see the
+ treatment of cr0.wp=0 below).
+ role.smm:
+ Is 1 if the page is valid in system management mode. This field
+ determines which of the kvm_memslots array was used to build this
+ shadow page; it is also used to go back from a struct kvm_mmu_page
+ to a memslot, through the kvm_memslots_for_spte_role macro and
+ __gfn_to_memslot.
+ role.ad_disabled:
+ Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D
+ bits before Haswell; shadow EPT page tables also cannot use A/D bits
+ if the L1 hypervisor does not enable them.
+ role.passthrough:
+ The page is not backed by a guest page table, but its first entry
+ points to one. This is set if NPT uses 5-level page tables (host
+ CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=1).
+ gfn:
+ Either the guest page table containing the translations shadowed by this
+ page, or the base page frame for linear translations. See role.direct.
+ spt:
+ A pageful of 64-bit sptes containing the translations for this page.
+ Accessed by both kvm and hardware.
+ The page pointed to by spt will have its page->private pointing back
+ at the shadow page structure.
+ sptes in spt point either at guest pages, or at lower-level shadow pages.
+ Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point
+ at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte.
+ The spt array forms a DAG structure with the shadow page as a node, and
+ guest pages as leaves.
+ gfns:
+ An array of 512 guest frame numbers, one for each present pte. Used to
+ perform a reverse map from a pte to a gfn. When role.direct is set, any
+ element of this array can be calculated from the gfn field when used, in
+ this case, the array of gfns is not allocated. See role.direct and gfn.
+ root_count:
+ A counter keeping track of how many hardware registers (guest cr3 or
+ pdptrs) are now pointing at the page. While this counter is nonzero, the
+ page cannot be destroyed. See role.invalid.
+ parent_ptes:
+ The reverse mapping for the pte/ptes pointing at this page's spt. If
+ parent_ptes bit 0 is zero, only one spte points at this page and
+ parent_ptes points at this single spte, otherwise, there exists multiple
+ sptes pointing at this page and (parent_ptes & ~0x1) points at a data
+ structure with a list of parent sptes.
+ unsync:
+ If true, then the translations in this page may not match the guest's
+ translation. This is equivalent to the state of the tlb when a pte is
+ changed but before the tlb entry is flushed. Accordingly, unsync ptes
+ are synchronized when the guest executes invlpg or flushes its tlb by
+ other means. Valid for leaf pages.
+ unsync_children:
+ How many sptes in the page point at pages that are unsync (or have
+ unsynchronized children).
+ unsync_child_bitmap:
+ A bitmap indicating which sptes in spt point (directly or indirectly) at
+ pages that may be unsynchronized. Used to quickly locate all unsychronized
+ pages reachable from a given page.
+ clear_spte_count:
+ Only present on 32-bit hosts, where a 64-bit spte cannot be written
+ atomically. The reader uses this while running out of the MMU lock
+ to detect in-progress updates and retry them until the writer has
+ finished the write.
+ write_flooding_count:
+ A guest may write to a page table many times, causing a lot of
+ emulations if the page needs to be write-protected (see "Synchronized
+ and unsynchronized pages" below). Leaf pages can be unsynchronized
+ so that they do not trigger frequent emulation, but this is not
+ possible for non-leafs. This field counts the number of emulations
+ since the last time the page table was actually used; if emulation
+ is triggered too frequently on this page, KVM will unmap the page
+ to avoid emulation in the future.
+
+Reverse map
+===========
+
+The mmu maintains a reverse mapping whereby all ptes mapping a page can be
+reached given its gfn. This is used, for example, when swapping out a page.
+
+Synchronized and unsynchronized pages
+=====================================
+
+The guest uses two events to synchronize its tlb and page tables: tlb flushes
+and page invalidations (invlpg).
+
+A tlb flush means that we need to synchronize all sptes reachable from the
+guest's cr3. This is expensive, so we keep all guest page tables write
+protected, and synchronize sptes to gptes when a gpte is written.
+
+A special case is when a guest page table is reachable from the current
+guest cr3. In this case, the guest is obliged to issue an invlpg instruction
+before using the translation. We take advantage of that by removing write
+protection from the guest page, and allowing the guest to modify it freely.
+We synchronize modified gptes when the guest invokes invlpg. This reduces
+the amount of emulation we have to do when the guest modifies multiple gptes,
+or when the a guest page is no longer used as a page table and is used for
+random guest data.
+
+As a side effect we have to resynchronize all reachable unsynchronized shadow
+pages on a tlb flush.
+
+
+Reaction to events
+==================
+
+- guest page fault (or npt page fault, or ept violation)
+
+This is the most complicated event. The cause of a page fault can be:
+
+ - a true guest fault (the guest translation won't allow the access) (*)
+ - access to a missing translation
+ - access to a protected translation
+ - when logging dirty pages, memory is write protected
+ - synchronized shadow pages are write protected (*)
+ - access to untranslatable memory (mmio)
+
+ (*) not applicable in direct mode
+
+Handling a page fault is performed as follows:
+
+ - if the RSV bit of the error code is set, the page fault is caused by guest
+ accessing MMIO and cached MMIO information is available.
+
+ - walk shadow page table
+ - check for valid generation number in the spte (see "Fast invalidation of
+ MMIO sptes" below)
+ - cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and
+ vcpu->arch.mmio_gfn, and call the emulator
+
+ - If both P bit and R/W bit of error code are set, this could possibly
+ be handled as a "fast page fault" (fixed without taking the MMU lock). See
+ the description in Documentation/virt/kvm/locking.rst.
+
+ - if needed, walk the guest page tables to determine the guest translation
+ (gva->gpa or ngpa->gpa)
+
+ - if permissions are insufficient, reflect the fault back to the guest
+
+ - determine the host page
+
+ - if this is an mmio request, there is no host page; cache the info to
+ vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn
+
+ - walk the shadow page table to find the spte for the translation,
+ instantiating missing intermediate page tables as necessary
+
+ - If this is an mmio request, cache the mmio info to the spte and set some
+ reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask)
+
+ - try to unsynchronize the page
+
+ - if successful, we can let the guest continue and modify the gpte
+
+ - emulate the instruction
+
+ - if failed, unshadow the page and let the guest continue
+
+ - update any translations that were modified by the instruction
+
+invlpg handling:
+
+ - walk the shadow page hierarchy and drop affected translations
+ - try to reinstantiate the indicated translation in the hope that the
+ guest will use it in the near future
+
+Guest control register updates:
+
+- mov to cr3
+
+ - look up new shadow roots
+ - synchronize newly reachable shadow pages
+
+- mov to cr0/cr4/efer
+
+ - set up mmu context for new paging mode
+ - look up new shadow roots
+ - synchronize newly reachable shadow pages
+
+Host translation updates:
+
+ - mmu notifier called with updated hva
+ - look up affected sptes through reverse map
+ - drop (or update) translations
+
+Emulating cr0.wp
+================
+
+If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
+works for the guest kernel, not guest userspace. When the guest
+cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0,
+we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
+semantics require allowing any guest kernel access plus user read access).
+
+We handle this by mapping the permissions to two possible sptes, depending
+on fault type:
+
+- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
+ disallows user access)
+- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
+ write access)
+
+(user write faults generate a #PF)
+
+In the first case there are two additional complications:
+
+- if CR4.SMEP is enabled: since we've turned the page into a kernel page,
+ the kernel may now execute it. We handle this by also setting spte.nx.
+ If we get a user fetch or read fault, we'll change spte.u=1 and
+ spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when
+ shadow paging is in use.
+- if CR4.SMAP is disabled: since the page has been changed to a kernel
+ page, it can not be reused when CR4.SMAP is enabled. We set
+ CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note,
+ here we do not care the case that CR4.SMAP is enabled since KVM will
+ directly inject #PF to guest due to failed permission check.
+
+To prevent an spte that was converted into a kernel page with cr0.wp=0
+from being written by the kernel after cr0.wp has changed to 1, we make
+the value of cr0.wp part of the page role. This means that an spte created
+with one value of cr0.wp cannot be used when cr0.wp has a different value -
+it will simply be missed by the shadow page lookup code. A similar issue
+exists when an spte created with cr0.wp=0 and cr4.smep=0 is used after
+changing cr4.smep to 1. To avoid this, the value of !cr0.wp && cr4.smep
+is also made a part of the page role.
+
+Large pages
+===========
+
+The mmu supports all combinations of large and small guest and host pages.
+Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as
+two separate 2M pages, on both guest and host, since the mmu always uses PAE
+paging.
+
+To instantiate a large spte, four constraints must be satisfied:
+
+- the spte must point to a large host page
+- the guest pte must be a large pte of at least equivalent size (if tdp is
+ enabled, there is no guest pte and this condition is satisfied)
+- if the spte will be writeable, the large page frame may not overlap any
+ write-protected pages
+- the guest page must be wholly contained by a single memory slot
+
+To check the last two conditions, the mmu maintains a ->disallow_lpage set of
+arrays for each memory slot and large page size. Every write protected page
+causes its disallow_lpage to be incremented, thus preventing instantiation of
+a large spte. The frames at the end of an unaligned memory slot have
+artificially inflated ->disallow_lpages so they can never be instantiated.
+
+Fast invalidation of MMIO sptes
+===============================
+
+As mentioned in "Reaction to events" above, kvm will cache MMIO
+information in leaf sptes. When a new memslot is added or an existing
+memslot is changed, this information may become stale and needs to be
+invalidated. This also needs to hold the MMU lock while walking all
+shadow pages, and is made more scalable with a similar technique.
+
+MMIO sptes have a few spare bits, which are used to store a
+generation number. The global generation number is stored in
+kvm_memslots(kvm)->generation, and increased whenever guest memory info
+changes.
+
+When KVM finds an MMIO spte, it checks the generation number of the spte.
+If the generation number of the spte does not equal the global generation
+number, it will ignore the cached MMIO information and handle the page
+fault through the slow path.
+
+Since only 18 bits are used to store generation-number on mmio spte, all
+pages are zapped when there is an overflow.
+
+Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
+times, the last one happening when the generation number is retrieved and
+stored into the MMIO spte. Thus, the MMIO spte might be created based on
+out-of-date information, but with an up-to-date generation number.
+
+To avoid this, the generation number is incremented again after synchronize_srcu
+returns; thus, bit 63 of kvm_memslots(kvm)->generation set to 1 only during a
+memslot update, while some SRCU readers might be using the old copy. We do not
+want to use an MMIO sptes created with an odd generation number, and we can do
+this without losing a bit in the MMIO spte. The "update in-progress" bit of the
+generation is not stored in MMIO spte, and is so is implicitly zero when the
+generation is extracted out of the spte. If KVM is unlucky and creates an MMIO
+spte while an update is in-progress, the next access to the spte will always be
+a cache miss. For example, a subsequent access during the update window will
+miss due to the in-progress flag diverging, while an access after the update
+window closes will have a higher generation number (as compared to the spte).
+
+
+Further reading
+===============
+
+- NPT presentation from KVM Forum 2008
+ https://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf
diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
new file mode 100644
index 000000000..9315fc385
--- /dev/null
+++ b/Documentation/virt/kvm/x86/msr.rst
@@ -0,0 +1,391 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+KVM-specific MSRs
+=================
+
+:Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
+
+KVM makes use of some custom MSRs to service some requests.
+
+Custom MSRs have a range reserved for them, that goes from
+0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
+but they are deprecated and their use is discouraged.
+
+Custom MSR list
+---------------
+
+The current supported Custom MSR list is:
+
+MSR_KVM_WALL_CLOCK_NEW:
+ 0x4b564d00
+
+data:
+ 4-byte alignment physical address of a memory area which must be
+ in guest RAM. This memory is expected to hold a copy of the following
+ structure::
+
+ struct pvclock_wall_clock {
+ u32 version;
+ u32 sec;
+ u32 nsec;
+ } __attribute__((__packed__));
+
+ whose data will be filled in by the hypervisor. The hypervisor is only
+ guaranteed to update this data at the moment of MSR write.
+ Users that want to reliably query this information more than once have
+ to write more than once to this MSR. Fields have the following meanings:
+
+ version:
+ guest has to check version before and after grabbing
+ time information and check that they are both equal and even.
+ An odd version indicates an in-progress update.
+
+ sec:
+ number of seconds for wallclock at time of boot.
+
+ nsec:
+ number of nanoseconds for wallclock at time of boot.
+
+ In order to get the current wallclock time, the system_time from
+ MSR_KVM_SYSTEM_TIME_NEW needs to be added.
+
+ Note that although MSRs are per-CPU entities, the effect of this
+ particular MSR is global.
+
+ Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
+ leaf prior to usage.
+
+MSR_KVM_SYSTEM_TIME_NEW:
+ 0x4b564d01
+
+data:
+ 4-byte aligned physical address of a memory area which must be in
+ guest RAM, plus an enable bit in bit 0. This memory is expected to hold
+ a copy of the following structure::
+
+ struct pvclock_vcpu_time_info {
+ u32 version;
+ u32 pad0;
+ u64 tsc_timestamp;
+ u64 system_time;
+ u32 tsc_to_system_mul;
+ s8 tsc_shift;
+ u8 flags;
+ u8 pad[2];
+ } __attribute__((__packed__)); /* 32 bytes */
+
+ whose data will be filled in by the hypervisor periodically. Only one
+ write, or registration, is needed for each VCPU. The interval between
+ updates of this structure is arbitrary and implementation-dependent.
+ The hypervisor may update this structure at any time it sees fit until
+ anything with bit0 == 0 is written to it.
+
+ Fields have the following meanings:
+
+ version:
+ guest has to check version before and after grabbing
+ time information and check that they are both equal and even.
+ An odd version indicates an in-progress update.
+
+ tsc_timestamp:
+ the tsc value at the current VCPU at the time
+ of the update of this structure. Guests can subtract this value
+ from current tsc to derive a notion of elapsed time since the
+ structure update.
+
+ system_time:
+ a host notion of monotonic time, including sleep
+ time at the time this structure was last updated. Unit is
+ nanoseconds.
+
+ tsc_to_system_mul:
+ multiplier to be used when converting
+ tsc-related quantity to nanoseconds
+
+ tsc_shift:
+ shift to be used when converting tsc-related
+ quantity to nanoseconds. This shift will ensure that
+ multiplication with tsc_to_system_mul does not overflow.
+ A positive value denotes a left shift, a negative value
+ a right shift.
+
+ The conversion from tsc to nanoseconds involves an additional
+ right shift by 32 bits. With this information, guests can
+ derive per-CPU time by doing::
+
+ time = (current_tsc - tsc_timestamp)
+ if (tsc_shift >= 0)
+ time <<= tsc_shift;
+ else
+ time >>= -tsc_shift;
+ time = (time * tsc_to_system_mul) >> 32
+ time = time + system_time
+
+ flags:
+ bits in this field indicate extended capabilities
+ coordinated between the guest and the hypervisor. Availability
+ of specific flags has to be checked in 0x40000001 cpuid leaf.
+ Current flags are:
+
+
+ +-----------+--------------+----------------------------------+
+ | flag bit | cpuid bit | meaning |
+ +-----------+--------------+----------------------------------+
+ | | | time measures taken across |
+ | 0 | 24 | multiple cpus are guaranteed to |
+ | | | be monotonic |
+ +-----------+--------------+----------------------------------+
+ | | | guest vcpu has been paused by |
+ | 1 | N/A | the host |
+ | | | See 4.70 in api.txt |
+ +-----------+--------------+----------------------------------+
+
+ Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
+ leaf prior to usage.
+
+
+MSR_KVM_WALL_CLOCK:
+ 0x11
+
+data and functioning:
+ same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
+
+ This MSR falls outside the reserved KVM range and may be removed in the
+ future. Its usage is deprecated.
+
+ Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
+ leaf prior to usage.
+
+MSR_KVM_SYSTEM_TIME:
+ 0x12
+
+data and functioning:
+ same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
+
+ This MSR falls outside the reserved KVM range and may be removed in the
+ future. Its usage is deprecated.
+
+ Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
+ leaf prior to usage.
+
+ The suggested algorithm for detecting kvmclock presence is then::
+
+ if (!kvm_para_available()) /* refer to cpuid.txt */
+ return NON_PRESENT;
+
+ flags = cpuid_eax(0x40000001);
+ if (flags & 3) {
+ msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
+ msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
+ return PRESENT;
+ } else if (flags & 0) {
+ msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+ msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+ return PRESENT;
+ } else
+ return NON_PRESENT;
+
+MSR_KVM_ASYNC_PF_EN:
+ 0x4b564d02
+
+data:
+ Asynchronous page fault (APF) control MSR.
+
+ Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
+ which must be in guest RAM and must be zeroed. This memory is expected
+ to hold a copy of the following structure::
+
+ struct kvm_vcpu_pv_apf_data {
+ /* Used for 'page not present' events delivered via #PF */
+ __u32 flags;
+
+ /* Used for 'page ready' events delivered via interrupt notification */
+ __u32 token;
+
+ __u8 pad[56];
+ __u32 enabled;
+ };
+
+ Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
+ when asynchronous page faults are enabled on the vcpu, 0 when disabled.
+ Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
+ cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
+ #PF vmexits. Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
+ present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
+ events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
+ CPUID.
+
+ 'Page not present' events are currently always delivered as synthetic
+ #PF exception. During delivery of these events APF CR2 register contains
+ a token that will be used to notify the guest when missing page becomes
+ available. Also, to make it possible to distinguish between real #PF and
+ APF, first 4 bytes of 64 byte memory location ('flags') will be written
+ to by the hypervisor at the time of injection. Only first bit of 'flags'
+ is currently supported, when set, it indicates that the guest is dealing
+ with asynchronous 'page not present' event. If during a page fault APF
+ 'flags' is '0' it means that this is regular page fault. Guest is
+ supposed to clear 'flags' when it is done handling #PF exception so the
+ next event can be delivered.
+
+ Note, since APF 'page not present' events use the same exception vector
+ as regular page fault, guest must reset 'flags' to '0' before it does
+ something that can generate normal page fault.
+
+ Bytes 5-7 of 64 byte memory location ('token') will be written to by the
+ hypervisor at the time of APF 'page ready' event injection. The content
+ of these bytes is a token which was previously delivered as 'page not
+ present' event. The event indicates the page in now available. Guest is
+ supposed to write '0' to 'token' when it is done handling 'page ready'
+ event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location;
+ writing to the MSR forces KVM to re-scan its queue and deliver the next
+ pending notification.
+
+ Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
+ ready' APF delivery needs to be written to before enabling APF mechanism
+ in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is
+ available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+ Note, previously, 'page ready' events were delivered via the same #PF
+ exception as 'page not present' events but this is now deprecated. If
+ bit 3 (interrupt based delivery) is not set APF events are not delivered.
+
+ If APF is disabled while there are outstanding APFs, they will
+ not be delivered.
+
+ Currently 'page ready' APF events will be always delivered on the
+ same vcpu as 'page not present' event was, but guest should not rely on
+ that.
+
+MSR_KVM_STEAL_TIME:
+ 0x4b564d03
+
+data:
+ 64-byte alignment physical address of a memory area which must be
+ in guest RAM, plus an enable bit in bit 0. This memory is expected to
+ hold a copy of the following structure::
+
+ struct kvm_steal_time {
+ __u64 steal;
+ __u32 version;
+ __u32 flags;
+ __u8 preempted;
+ __u8 u8_pad[3];
+ __u32 pad[11];
+ }
+
+ whose data will be filled in by the hypervisor periodically. Only one
+ write, or registration, is needed for each VCPU. The interval between
+ updates of this structure is arbitrary and implementation-dependent.
+ The hypervisor may update this structure at any time it sees fit until
+ anything with bit0 == 0 is written to it. Guest is required to make sure
+ this structure is initialized to zero.
+
+ Fields have the following meanings:
+
+ version:
+ a sequence counter. In other words, guest has to check
+ this field before and after grabbing time information and make
+ sure they are both equal and even. An odd version indicates an
+ in-progress update.
+
+ flags:
+ At this point, always zero. May be used to indicate
+ changes in this structure in the future.
+
+ steal:
+ the amount of time in which this vCPU did not run, in
+ nanoseconds. Time during which the vcpu is idle, will not be
+ reported as steal time.
+
+ preempted:
+ indicate the vCPU who owns this struct is running or
+ not. Non-zero values mean the vCPU has been preempted. Zero
+ means the vCPU is not preempted. NOTE, it is always zero if the
+ the hypervisor doesn't support this field.
+
+MSR_KVM_EOI_EN:
+ 0x4b564d04
+
+data:
+ Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
+ when disabled. Bit 1 is reserved and must be zero. When PV end of
+ interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
+ physical address of a 4 byte memory area which must be in guest RAM and
+ must be zeroed.
+
+ The first, least significant bit of 4 byte memory location will be
+ written to by the hypervisor, typically at the time of interrupt
+ injection. Value of 1 means that guest can skip writing EOI to the apic
+ (using MSR or MMIO write); instead, it is sufficient to signal
+ EOI by clearing the bit in guest memory - this location will
+ later be polled by the hypervisor.
+ Value of 0 means that the EOI write is required.
+
+ It is always safe for the guest to ignore the optimization and perform
+ the APIC EOI write anyway.
+
+ Hypervisor is guaranteed to only modify this least
+ significant bit while in the current VCPU context, this means that
+ guest does not need to use either lock prefix or memory ordering
+ primitives to synchronise with the hypervisor.
+
+ However, hypervisor can set and clear this memory bit at any time:
+ therefore to make sure hypervisor does not interrupt the
+ guest and clear the least significant bit in the memory area
+ in the window between guest testing it to detect
+ whether it can skip EOI apic write and between guest
+ clearing it to signal EOI to the hypervisor,
+ guest must both read the least significant bit in the memory area and
+ clear it using a single CPU instruction, such as test and clear, or
+ compare and exchange.
+
+MSR_KVM_POLL_CONTROL:
+ 0x4b564d05
+
+ Control host-side polling.
+
+data:
+ Bit 0 enables (1) or disables (0) host-side HLT polling logic.
+
+ KVM guests can request the host not to poll on HLT, for example if
+ they are performing polling themselves.
+
+MSR_KVM_ASYNC_PF_INT:
+ 0x4b564d06
+
+data:
+ Second asynchronous page fault (APF) control MSR.
+
+ Bits 0-7: APIC vector for delivery of 'page ready' APF events.
+ Bits 8-63: Reserved
+
+ Interrupt vector for asynchnonous 'page ready' notifications delivery.
+ The vector has to be set up before asynchronous page fault mechanism
+ is enabled in MSR_KVM_ASYNC_PF_EN. The MSR is only available if
+ KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_ASYNC_PF_ACK:
+ 0x4b564d07
+
+data:
+ Asynchronous page fault (APF) acknowledgment.
+
+ When the guest is done processing 'page ready' APF event and 'token'
+ field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
+ write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
+ and check if there are more notifications pending. The MSR is available
+ if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_MIGRATION_CONTROL:
+ 0x4b564d08
+
+data:
+ This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in
+ CPUID. Bit 0 represents whether live migration of the guest is allowed.
+
+ When a guest is started, bit 0 will be 0 if the guest has encrypted
+ memory and 1 if the guest does not have encrypted memory. If the
+ guest is communicating page encryption status to the host using the
+ ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to
+ allow live migration of the guest.
diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
new file mode 100644
index 000000000..ac2095d41
--- /dev/null
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -0,0 +1,244 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+"The Turtles Project: Design and Implementation of Nested Virtualization",
+available at:
+
+ https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and its nested guest, which we
+call L2.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is enabled by default since Linux kernel v4.20. For
+older Linux kernel, it can be enabled by giving the "nested=1" option to the
+kvm-intel module.
+
+
+No modifications are required to user space (qemu). However, qemu's default
+emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
+explicitly enabled, by giving qemu one of the following options:
+
+ - cpu host (emulated CPU has all features of the real CPU)
+
+ - cpu qemu64,+vmx (add just the vmx feature to a named CPU type)
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+
+The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
+also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
+which L0 builds to actually run L2 - how this is done is explained in the
+aforementioned paper.
+
+For convenience, we repeat the content of struct vmcs12 here. If the internals
+of this structure changes, this can break live migration across KVM versions.
+VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
+struct shadow_vmcs is ever changed.
+
+::
+
+ typedef u64 natural_width;
+ struct __packed vmcs12 {
+ /* According to the Intel spec, a VMCS region must start with
+ * these two user-visible fields */
+ u32 revision_id;
+ u32 abort;
+
+ u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+ u32 padding[7]; /* room for future expansion */
+
+ u64 io_bitmap_a;
+ u64 io_bitmap_b;
+ u64 msr_bitmap;
+ u64 vm_exit_msr_store_addr;
+ u64 vm_exit_msr_load_addr;
+ u64 vm_entry_msr_load_addr;
+ u64 tsc_offset;
+ u64 virtual_apic_page_addr;
+ u64 apic_access_addr;
+ u64 ept_pointer;
+ u64 guest_physical_address;
+ u64 vmcs_link_pointer;
+ u64 guest_ia32_debugctl;
+ u64 guest_ia32_pat;
+ u64 guest_ia32_efer;
+ u64 guest_pdptr0;
+ u64 guest_pdptr1;
+ u64 guest_pdptr2;
+ u64 guest_pdptr3;
+ u64 host_ia32_pat;
+ u64 host_ia32_efer;
+ u64 padding64[8]; /* room for future expansion */
+ natural_width cr0_guest_host_mask;
+ natural_width cr4_guest_host_mask;
+ natural_width cr0_read_shadow;
+ natural_width cr4_read_shadow;
+ natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
+ natural_width exit_qualification;
+ natural_width guest_linear_address;
+ natural_width guest_cr0;
+ natural_width guest_cr3;
+ natural_width guest_cr4;
+ natural_width guest_es_base;
+ natural_width guest_cs_base;
+ natural_width guest_ss_base;
+ natural_width guest_ds_base;
+ natural_width guest_fs_base;
+ natural_width guest_gs_base;
+ natural_width guest_ldtr_base;
+ natural_width guest_tr_base;
+ natural_width guest_gdtr_base;
+ natural_width guest_idtr_base;
+ natural_width guest_dr7;
+ natural_width guest_rsp;
+ natural_width guest_rip;
+ natural_width guest_rflags;
+ natural_width guest_pending_dbg_exceptions;
+ natural_width guest_sysenter_esp;
+ natural_width guest_sysenter_eip;
+ natural_width host_cr0;
+ natural_width host_cr3;
+ natural_width host_cr4;
+ natural_width host_fs_base;
+ natural_width host_gs_base;
+ natural_width host_tr_base;
+ natural_width host_gdtr_base;
+ natural_width host_idtr_base;
+ natural_width host_ia32_sysenter_esp;
+ natural_width host_ia32_sysenter_eip;
+ natural_width host_rsp;
+ natural_width host_rip;
+ natural_width paddingl[8]; /* room for future expansion */
+ u32 pin_based_vm_exec_control;
+ u32 cpu_based_vm_exec_control;
+ u32 exception_bitmap;
+ u32 page_fault_error_code_mask;
+ u32 page_fault_error_code_match;
+ u32 cr3_target_count;
+ u32 vm_exit_controls;
+ u32 vm_exit_msr_store_count;
+ u32 vm_exit_msr_load_count;
+ u32 vm_entry_controls;
+ u32 vm_entry_msr_load_count;
+ u32 vm_entry_intr_info_field;
+ u32 vm_entry_exception_error_code;
+ u32 vm_entry_instruction_len;
+ u32 tpr_threshold;
+ u32 secondary_vm_exec_control;
+ u32 vm_instruction_error;
+ u32 vm_exit_reason;
+ u32 vm_exit_intr_info;
+ u32 vm_exit_intr_error_code;
+ u32 idt_vectoring_info_field;
+ u32 idt_vectoring_error_code;
+ u32 vm_exit_instruction_len;
+ u32 vmx_instruction_info;
+ u32 guest_es_limit;
+ u32 guest_cs_limit;
+ u32 guest_ss_limit;
+ u32 guest_ds_limit;
+ u32 guest_fs_limit;
+ u32 guest_gs_limit;
+ u32 guest_ldtr_limit;
+ u32 guest_tr_limit;
+ u32 guest_gdtr_limit;
+ u32 guest_idtr_limit;
+ u32 guest_es_ar_bytes;
+ u32 guest_cs_ar_bytes;
+ u32 guest_ss_ar_bytes;
+ u32 guest_ds_ar_bytes;
+ u32 guest_fs_ar_bytes;
+ u32 guest_gs_ar_bytes;
+ u32 guest_ldtr_ar_bytes;
+ u32 guest_tr_ar_bytes;
+ u32 guest_interruptibility_info;
+ u32 guest_activity_state;
+ u32 guest_sysenter_cs;
+ u32 host_ia32_sysenter_cs;
+ u32 padding32[8]; /* room for future expansion */
+ u16 virtual_processor_id;
+ u16 guest_es_selector;
+ u16 guest_cs_selector;
+ u16 guest_ss_selector;
+ u16 guest_ds_selector;
+ u16 guest_fs_selector;
+ u16 guest_gs_selector;
+ u16 guest_ldtr_selector;
+ u16 guest_tr_selector;
+ u16 host_es_selector;
+ u16 host_cs_selector;
+ u16 host_ss_selector;
+ u16 host_ds_selector;
+ u16 host_fs_selector;
+ u16 host_gs_selector;
+ u16 host_tr_selector;
+ };
+
+
+Authors
+-------
+
+These patches were written by:
+ - Abel Gordon, abelg <at> il.ibm.com
+ - Nadav Har'El, nyh <at> il.ibm.com
+ - Orit Wasserman, oritw <at> il.ibm.com
+ - Ben-Ami Yassor, benami <at> il.ibm.com
+ - Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+ - Anthony Liguori, aliguori <at> us.ibm.com
+ - Mike Day, mdday <at> us.ibm.com
+ - Michael Factor, factor <at> il.ibm.com
+ - Zvi Dubitzky, dubi <at> il.ibm.com
+
+And valuable reviews by:
+ - Avi Kivity, avi <at> redhat.com
+ - Gleb Natapov, gleb <at> redhat.com
+ - Marcelo Tosatti, mtosatti <at> redhat.com
+ - Kevin Tian, kevin.tian <at> intel.com
+ - and others.
diff --git a/Documentation/virt/kvm/x86/running-nested-guests.rst b/Documentation/virt/kvm/x86/running-nested-guests.rst
new file mode 100644
index 000000000..a27e6768d
--- /dev/null
+++ b/Documentation/virt/kvm/x86/running-nested-guests.rst
@@ -0,0 +1,278 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Running nested guests with KVM
+==============================
+
+A nested guest is the ability to run a guest inside another guest (it
+can be KVM-based or a different hypervisor). The straightforward
+example is a KVM guest that in turn runs on a KVM guest (the rest of
+this document is built on this example)::
+
+ .----------------. .----------------.
+ | | | |
+ | L2 | | L2 |
+ | (Nested Guest) | | (Nested Guest) |
+ | | | |
+ |----------------'--'----------------|
+ | |
+ | L1 (Guest Hypervisor) |
+ | KVM (/dev/kvm) |
+ | |
+ .------------------------------------------------------.
+ | L0 (Host Hypervisor) |
+ | KVM (/dev/kvm) |
+ |------------------------------------------------------|
+ | Hardware (with virtualization extensions) |
+ '------------------------------------------------------'
+
+Terminology:
+
+- L0 – level-0; the bare metal host, running KVM
+
+- L1 – level-1 guest; a VM running on L0; also called the "guest
+ hypervisor", as it itself is capable of running KVM.
+
+- L2 – level-2 guest; a VM running on L1, this is the "nested guest"
+
+.. note:: The above diagram is modelled after the x86 architecture;
+ s390x, ppc64 and other architectures are likely to have
+ a different design for nesting.
+
+ For example, s390x always has an LPAR (LogicalPARtition)
+ hypervisor running on bare metal, adding another layer and
+ resulting in at least four levels in a nested setup — L0 (bare
+ metal, running the LPAR hypervisor), L1 (host hypervisor), L2
+ (guest hypervisor), L3 (nested guest).
+
+ This document will stick with the three-level terminology (L0,
+ L1, and L2) for all architectures; and will largely focus on
+ x86.
+
+
+Use Cases
+---------
+
+There are several scenarios where nested KVM can be useful, to name a
+few:
+
+- As a developer, you want to test your software on different operating
+ systems (OSes). Instead of renting multiple VMs from a Cloud
+ Provider, using nested KVM lets you rent a large enough "guest
+ hypervisor" (level-1 guest). This in turn allows you to create
+ multiple nested guests (level-2 guests), running different OSes, on
+ which you can develop and test your software.
+
+- Live migration of "guest hypervisors" and their nested guests, for
+ load balancing, disaster recovery, etc.
+
+- VM image creation tools (e.g. ``virt-install``, etc) often run
+ their own VM, and users expect these to work inside a VM.
+
+- Some OSes use virtualization internally for security (e.g. to let
+ applications run safely in isolation).
+
+
+Enabling "nested" (x86)
+-----------------------
+
+From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled
+by default for Intel and AMD. (Though your Linux distribution might
+override this default.)
+
+In case you are running a Linux kernel older than v4.19, to enable
+nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To
+persist this setting across reboots, you can add it in a config file, as
+shown below:
+
+1. On the bare metal host (L0), list the kernel modules and ensure that
+ the KVM modules::
+
+ $ lsmod | grep -i kvm
+ kvm_intel 133627 0
+ kvm 435079 1 kvm_intel
+
+2. Show information for ``kvm_intel`` module::
+
+ $ modinfo kvm_intel | grep -i nested
+ parm: nested:bool
+
+3. For the nested KVM configuration to persist across reboots, place the
+ below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
+ doesn't exist)::
+
+ $ cat /etc/modprobe.d/kvm_intel.conf
+ options kvm-intel nested=y
+
+4. Unload and re-load the KVM Intel module::
+
+ $ sudo rmmod kvm-intel
+ $ sudo modprobe kvm-intel
+
+5. Verify if the ``nested`` parameter for KVM is enabled::
+
+ $ cat /sys/module/kvm_intel/parameters/nested
+ Y
+
+For AMD hosts, the process is the same as above, except that the module
+name is ``kvm-amd``.
+
+
+Additional nested-related kernel parameters (x86)
+-------------------------------------------------
+
+If your hardware is sufficiently advanced (Intel Haswell processor or
+higher, which has newer hardware virt extensions), the following
+additional features will also be enabled by default: "Shadow VMCS
+(Virtual Machine Control Structure)", APIC Virtualization on your bare
+metal host (L0). Parameters for Intel hosts::
+
+ $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
+ Y
+
+ $ cat /sys/module/kvm_intel/parameters/enable_apicv
+ Y
+
+ $ cat /sys/module/kvm_intel/parameters/ept
+ Y
+
+.. note:: If you suspect your L2 (i.e. nested guest) is running slower,
+ ensure the above are enabled (particularly
+ ``enable_shadow_vmcs`` and ``ept``).
+
+
+Starting a nested guest (x86)
+-----------------------------
+
+Once your bare metal host (L0) is configured for nesting, you should be
+able to start an L1 guest with::
+
+ $ qemu-kvm -cpu host [...]
+
+The above will pass through the host CPU's capabilities as-is to the
+gues); or for better live migration compatibility, use a named CPU
+model supported by QEMU. e.g.::
+
+ $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
+
+then the guest hypervisor will subsequently be capable of running a
+nested guest with accelerated KVM.
+
+
+Enabling "nested" (s390x)
+-------------------------
+
+1. On the host hypervisor (L0), enable the ``nested`` parameter on
+ s390x::
+
+ $ rmmod kvm
+ $ modprobe kvm nested=1
+
+.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
+ with the ``nested`` paramter — i.e. to be able to enable
+ ``nested``, the ``hpage`` parameter *must* be disabled.
+
+2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
+ feature — with QEMU, this can be done by using "host passthrough"
+ (via the command-line ``-cpu host``).
+
+3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
+
+ $ modprobe kvm
+
+
+Live migration with nested KVM
+------------------------------
+
+Migrating an L1 guest, with a *live* nested guest in it, to another
+bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
+Intel x86 systems, and even on older versions for s390x.
+
+On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
+should no longer be migrated or saved (refer to QEMU documentation on
+"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate
+or save-and-load an L1 guest while an L2 guest is running will result in
+undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a
+kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1
+guest can no longer be considered stable or secure, and must be restarted.
+Migrating an L1 guest merely configured to support nesting, while not
+actually running L2 guests, is expected to function normally even on AMD
+systems but may fail once guests are started.
+
+Migrating an L2 guest is always expected to succeed, so all the following
+scenarios should work even on AMD systems:
+
+- Migrating a nested guest (L2) to another L1 guest on the *same* bare
+ metal host.
+
+- Migrating a nested guest (L2) to another L1 guest on a *different*
+ bare metal host.
+
+- Migrating a nested guest (L2) to a bare metal host.
+
+Reporting bugs from nested setups
+-----------------------------------
+
+Debugging "nested" problems can involve sifting through log files across
+L0, L1 and L2; this can result in tedious back-n-forth between the bug
+reporter and the bug fixer.
+
+- Mention that you are in a "nested" setup. If you are running any kind
+ of "nesting" at all, say so. Unfortunately, this needs to be called
+ out because when reporting bugs, people tend to forget to even
+ *mention* that they're using nested virtualization.
+
+- Ensure you are actually running KVM on KVM. Sometimes people do not
+ have KVM enabled for their guest hypervisor (L1), which results in
+ them running with pure emulation or what QEMU calls it as "TCG", but
+ they think they're running nested KVM. Thus confusing "nested Virt"
+ (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
+
+Information to collect (generic)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following is not an exhaustive list, but a very good starting point:
+
+ - Kernel, libvirt, and QEMU version from L0
+
+ - Kernel, libvirt and QEMU version from L1
+
+ - QEMU command-line of L1 -- when using libvirt, you'll find it here:
+ ``/var/log/libvirt/qemu/instance.log``
+
+ - QEMU command-line of L2 -- as above, when using libvirt, get the
+ complete libvirt-generated QEMU command-line
+
+ - ``cat /sys/cpuinfo`` from L0
+
+ - ``cat /sys/cpuinfo`` from L1
+
+ - ``lscpu`` from L0
+
+ - ``lscpu`` from L1
+
+ - Full ``dmesg`` output from L0
+
+ - Full ``dmesg`` output from L1
+
+x86-specific info to collect
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Both the below commands, ``x86info`` and ``dmidecode``, should be
+available on most Linux distributions with the same name:
+
+ - Output of: ``x86info -a`` from L0
+
+ - Output of: ``x86info -a`` from L1
+
+ - Output of: ``dmidecode`` from L0
+
+ - Output of: ``dmidecode`` from L1
+
+s390x-specific info to collect
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Along with the earlier mentioned generic details, the below is
+also recommended:
+
+ - ``/proc/sysinfo`` from L1; this will also include the info from L0
diff --git a/Documentation/virt/kvm/x86/timekeeping.rst b/Documentation/virt/kvm/x86/timekeeping.rst
new file mode 100644
index 000000000..21ae7efa2
--- /dev/null
+++ b/Documentation/virt/kvm/x86/timekeeping.rst
@@ -0,0 +1,645 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Timekeeping Virtualization for X86-Based Architectures
+======================================================
+
+:Author: Zachary Amsden <zamsden@redhat.com>
+:Copyright: (c) 2010, Red Hat. All rights reserved.
+
+.. Contents
+
+ 1) Overview
+ 2) Timing Devices
+ 3) TSC Hardware
+ 4) Virtualization Problems
+
+1. Overview
+===========
+
+One of the most complicated parts of the X86 platform, and specifically,
+the virtualization of this platform is the plethora of timing devices available
+and the complexity of emulating those devices. In addition, virtualization of
+time introduces a new set of challenges because it introduces a multiplexed
+division of time beyond the control of the guest CPU.
+
+First, we will describe the various timekeeping hardware available, then
+present some of the problems which arise and solutions available, giving
+specific recommendations for certain classes of KVM guests.
+
+The purpose of this document is to collect data and information relevant to
+timekeeping which may be difficult to find elsewhere, specifically,
+information relevant to KVM and hardware-based virtualization.
+
+2. Timing Devices
+=================
+
+First we discuss the basic hardware devices available. TSC and the related
+KVM clock are special enough to warrant a full exposition and are described in
+the following section.
+
+2.1. i8254 - PIT
+----------------
+
+One of the first timer devices available is the programmable interrupt timer,
+or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
+channels which can be programmed to deliver periodic or one-shot interrupts.
+These three channels can be configured in different modes and have individual
+counters. Channel 1 and 2 were not available for general use in the original
+IBM PC, and historically were connected to control RAM refresh and the PC
+speaker. Now the PIT is typically integrated as part of an emulated chipset
+and a separate physical PIT is not used.
+
+The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done
+using single or multiple byte access to the I/O ports. There are 6 modes
+available, but not all modes are available to all timers, as only timer 2
+has a connected gate input, required for modes 1 and 5. The gate line is
+controlled by port 61h, bit 0, as illustrated in the following diagram::
+
+ -------------- ----------------
+ | | | |
+ | 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0
+ | Clock | | | |
+ -------------- | +->| GATE TIMER 0 |
+ | ----------------
+ |
+ | ----------------
+ | | |
+ |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM
+ | | | (aka /dev/null)
+ | +->| GATE TIMER 1 |
+ | ----------------
+ |
+ | ----------------
+ | | |
+ |------>| CLOCK OUT | ---------> Port 61h, bit 5
+ | | |
+ Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____
+ ---------------- _| )--|LPF|---Speaker
+ / *---- \___/
+ Port 61h, bit 1 ---------------------------------/
+
+The timer modes are now described.
+
+Mode 0: Single Timeout.
+ This is a one-shot software timeout that counts down
+ when the gate is high (always true for timers 0 and 1). When the count
+ reaches zero, the output goes high.
+
+Mode 1: Triggered One-shot.
+ The output is initially set high. When the gate
+ line is set high, a countdown is initiated (which does not stop if the gate is
+ lowered), during which the output is set low. When the count reaches zero,
+ the output goes high.
+
+Mode 2: Rate Generator.
+ The output is initially set high. When the countdown
+ reaches 1, the output goes low for one count and then returns high. The value
+ is reloaded and the countdown automatically resumes. If the gate line goes
+ low, the count is halted. If the output is low when the gate is lowered, the
+ output automatically goes high (this only affects timer 2).
+
+Mode 3: Square Wave.
+ This generates a high / low square wave. The count
+ determines the length of the pulse, which alternates between high and low
+ when zero is reached. The count only proceeds when gate is high and is
+ automatically reloaded on reaching zero. The count is decremented twice at
+ each clock to generate a full high / low cycle at the full periodic rate.
+ If the count is even, the clock remains high for N/2 counts and low for N/2
+ counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
+ for (N-1)/2 counts. Only even values are latched by the counter, so odd
+ values are not observed when reading. This is the intended mode for timer 2,
+ which generates sine-like tones by low-pass filtering the square wave output.
+
+Mode 4: Software Strobe.
+ After programming this mode and loading the counter,
+ the output remains high until the counter reaches zero. Then the output
+ goes low for 1 clock cycle and returns high. The counter is not reloaded.
+ Counting only occurs when gate is high.
+
+Mode 5: Hardware Strobe.
+ After programming and loading the counter, the
+ output remains high. When the gate is raised, a countdown is initiated
+ (which does not stop if the gate is lowered). When the counter reaches zero,
+ the output goes low for 1 clock cycle and then returns high. The counter is
+ not reloaded.
+
+In addition to normal binary counting, the PIT supports BCD counting. The
+command port, 0x43 is used to set the counter and mode for each of the three
+timers.
+
+PIT commands, issued to port 0x43, using the following bit encoding::
+
+ Bit 7-4: Command (See table below)
+ Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
+ Bit 0 : Binary (0) / BCD (1)
+
+Command table::
+
+ 0000 - Latch Timer 0 count for port 0x40
+ sample and hold the count to be read in port 0x40;
+ additional commands ignored until counter is read;
+ mode bits ignored.
+
+ 0001 - Set Timer 0 LSB mode for port 0x40
+ set timer to read LSB only and force MSB to zero;
+ mode bits set timer mode
+
+ 0010 - Set Timer 0 MSB mode for port 0x40
+ set timer to read MSB only and force LSB to zero;
+ mode bits set timer mode
+
+ 0011 - Set Timer 0 16-bit mode for port 0x40
+ set timer to read / write LSB first, then MSB;
+ mode bits set timer mode
+
+ 0100 - Latch Timer 1 count for port 0x41 - as described above
+ 0101 - Set Timer 1 LSB mode for port 0x41 - as described above
+ 0110 - Set Timer 1 MSB mode for port 0x41 - as described above
+ 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
+
+ 1000 - Latch Timer 2 count for port 0x42 - as described above
+ 1001 - Set Timer 2 LSB mode for port 0x42 - as described above
+ 1010 - Set Timer 2 MSB mode for port 0x42 - as described above
+ 1011 - Set Timer 2 16-bit mode for port 0x42 as described above
+
+ 1101 - General counter latch
+ Latch combination of counters into corresponding ports
+ Bit 3 = Counter 2
+ Bit 2 = Counter 1
+ Bit 1 = Counter 0
+ Bit 0 = Unused
+
+ 1110 - Latch timer status
+ Latch combination of counter mode into corresponding ports
+ Bit 3 = Counter 2
+ Bit 2 = Counter 1
+ Bit 1 = Counter 0
+
+ The output of ports 0x40-0x42 following this command will be:
+
+ Bit 7 = Output pin
+ Bit 6 = Count loaded (0 if timer has expired)
+ Bit 5-4 = Read / Write mode
+ 01 = MSB only
+ 10 = LSB only
+ 11 = LSB / MSB (16-bit)
+ Bit 3-1 = Mode
+ Bit 0 = Binary (0) / BCD mode (1)
+
+2.2. RTC
+--------
+
+The second device which was available in the original PC was the MC146818 real
+time clock. The original device is now obsolete, and usually emulated by the
+system chipset, sometimes by an HPET and some frankenstein IRQ routing.
+
+The RTC is accessed through CMOS variables, which uses an index register to
+control which bytes are read. Since there is only one index register, read
+of the CMOS and read of the RTC require lock protection (in addition, it is
+dangerous to allow userspace utilities such as hwclock to have direct RTC
+access, as they could corrupt kernel reads and writes of CMOS memory).
+
+The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt
+can function as a periodic timer, an additional once a day alarm, and can issue
+interrupts after an update of the CMOS registers by the MC146818 is complete.
+The type of interrupt is signalled in the RTC status registers.
+
+The RTC will update the current time fields by battery power even while the
+system is off. The current time fields should not be read while an update is
+in progress, as indicated in the status register.
+
+The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
+programmed to a 32kHz divider if the RTC is to count seconds.
+
+This is the RAM map originally used for the RTC/CMOS::
+
+ Location Size Description
+ ------------------------------------------
+ 00h byte Current second (BCD)
+ 01h byte Seconds alarm (BCD)
+ 02h byte Current minute (BCD)
+ 03h byte Minutes alarm (BCD)
+ 04h byte Current hour (BCD)
+ 05h byte Hours alarm (BCD)
+ 06h byte Current day of week (BCD)
+ 07h byte Current day of month (BCD)
+ 08h byte Current month (BCD)
+ 09h byte Current year (BCD)
+ 0Ah byte Register A
+ bit 7 = Update in progress
+ bit 6-4 = Divider for clock
+ 000 = 4.194 MHz
+ 001 = 1.049 MHz
+ 010 = 32 kHz
+ 10X = test modes
+ 110 = reset / disable
+ 111 = reset / disable
+ bit 3-0 = Rate selection for periodic interrupt
+ 000 = periodic timer disabled
+ 001 = 3.90625 uS
+ 010 = 7.8125 uS
+ 011 = .122070 mS
+ 100 = .244141 mS
+ ...
+ 1101 = 125 mS
+ 1110 = 250 mS
+ 1111 = 500 mS
+ 0Bh byte Register B
+ bit 7 = Run (0) / Halt (1)
+ bit 6 = Periodic interrupt enable
+ bit 5 = Alarm interrupt enable
+ bit 4 = Update-ended interrupt enable
+ bit 3 = Square wave interrupt enable
+ bit 2 = BCD calendar (0) / Binary (1)
+ bit 1 = 12-hour mode (0) / 24-hour mode (1)
+ bit 0 = 0 (DST off) / 1 (DST enabled)
+ OCh byte Register C (read only)
+ bit 7 = interrupt request flag (IRQF)
+ bit 6 = periodic interrupt flag (PF)
+ bit 5 = alarm interrupt flag (AF)
+ bit 4 = update interrupt flag (UF)
+ bit 3-0 = reserved
+ ODh byte Register D (read only)
+ bit 7 = RTC has power
+ bit 6-0 = reserved
+ 32h byte Current century BCD (*)
+ (*) location vendor specific and now determined from ACPI global tables
+
+2.3. APIC
+---------
+
+On Pentium and later processors, an on-board timer is available to each CPU
+as part of the Advanced Programmable Interrupt Controller. The APIC is
+accessed through memory-mapped registers and provides interrupt service to each
+CPU, used for IPIs and local timer interrupts.
+
+Although in theory the APIC is a safe and stable source for local interrupts,
+in practice, many bugs and glitches have occurred due to the special nature of
+the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect
+the use of the APIC and that workarounds may be required. In addition, some of
+these workarounds pose unique constraints for virtualization - requiring either
+extra overhead incurred from extra reads of memory-mapped I/O or additional
+functionality that may be more computationally expensive to implement.
+
+Since the APIC is documented quite well in the Intel and AMD manuals, we will
+avoid repetition of the detail here. It should be pointed out that the APIC
+timer is programmed through the LVT (local vector timer) register, is capable
+of one-shot or periodic operation, and is based on the bus clock divided down
+by the programmable divider register.
+
+2.4. HPET
+---------
+
+HPET is quite complex, and was originally intended to replace the PIT / RTC
+support of the X86 PC. It remains to be seen whether that will be the case, as
+the de facto standard of PC hardware is to emulate these older devices. Some
+systems designated as legacy free may support only the HPET as a hardware timer
+device.
+
+The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
+but allowing implementation freedom to support many more. It also imposes no
+fixed rate on the timer frequency, but does impose some extremal values on
+frequency, error and slew.
+
+In general, the HPET is recommended as a high precision (compared to PIT /RTC)
+time source which is independent of local variation (as there is only one HPET
+in any given system). The HPET is also memory-mapped, and its presence is
+indicated through ACPI tables by the BIOS.
+
+Detailed specification of the HPET is beyond the current scope of this
+document, as it is also very well documented elsewhere.
+
+2.5. Offboard Timers
+--------------------
+
+Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
+timing chips built into the cards which may have registers which are accessible
+to kernel or user drivers. To the author's knowledge, using these to generate
+a clocksource for a Linux or other kernel has not yet been attempted and is in
+general frowned upon as not playing by the agreed rules of the game. Such a
+timer device would require additional support to be virtualized properly and is
+not considered important at this time as no known operating system does this.
+
+3. TSC Hardware
+===============
+
+The TSC or time stamp counter is relatively simple in theory; it counts
+instruction cycles issued by the processor, which can be used as a measure of
+time. In practice, due to a number of problems, it is the most complicated
+timekeeping device to use.
+
+The TSC is represented internally as a 64-bit MSR which can be read with the
+RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware
+limitations made it possible to write the TSC, but generally on old hardware it
+was only possible to write the low 32-bits of the 64-bit counter, and the upper
+32-bits of the counter were cleared. Now, however, on Intel processors family
+0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
+has been lifted and all 64-bits are writable. On AMD systems, the ability to
+write the TSC MSR is not an architectural guarantee.
+
+The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
+means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
+
+Some vendors have implemented an additional instruction, RDTSCP, which returns
+atomically not just the TSC, but an indicator which corresponds to the
+processor number. This can be used to index into an array of TSC variables to
+determine offset information in SMP systems where TSCs are not synchronized.
+The presence of this instruction must be determined by consulting CPUID feature
+bits.
+
+Both VMX and SVM provide extension fields in the virtualization hardware which
+allows the guest visible TSC to be offset by a constant. Newer implementations
+promise to allow the TSC to additionally be scaled, but this hardware is not
+yet widely available.
+
+3.1. TSC synchronization
+------------------------
+
+The TSC is a CPU-local clock in most implementations. This means, on SMP
+platforms, the TSCs of different CPUs may start at different times depending
+on when the CPUs are powered on. Generally, CPUs on the same die will share
+the same clock, however, this is not always the case.
+
+The BIOS may attempt to resynchronize the TSCs during the poweron process and
+the operating system or other system software may attempt to do this as well.
+Several hardware limitations make the problem worse - if it is not possible to
+write the full 64-bits of the TSC, it may be impossible to match the TSC in
+newly arriving CPUs to that of the rest of the system, resulting in
+unsynchronized TSCs. This may be done by BIOS or system software, but in
+practice, getting a perfectly synchronized TSC will not be possible unless all
+values are read from the same clock, which generally only is possible on single
+socket systems or those with special hardware support.
+
+3.2. TSC and CPU hotplug
+------------------------
+
+As touched on already, CPUs which arrive later than the boot time of the system
+may not have a TSC value that is synchronized with the rest of the system.
+Either system software, BIOS, or SMM code may actually try to establish the TSC
+to a value matching the rest of the system, but a perfect match is usually not
+a guarantee. This can have the effect of bringing a system from a state where
+TSC is synchronized back to a state where TSC synchronization flaws, however
+small, may be exposed to the OS and any virtualization environment.
+
+3.3. TSC and multi-socket / NUMA
+--------------------------------
+
+Multi-socket systems, especially large multi-socket systems are likely to have
+individual clocksources rather than a single, universally distributed clock.
+Since these clocks are driven by different crystals, they will not have
+perfectly matched frequency, and temperature and electrical variations will
+cause the CPU clocks, and thus the TSCs to drift over time. Depending on the
+exact clock and bus design, the drift may or may not be fixed in absolute
+error, and may accumulate over time.
+
+In addition, very large systems may deliberately slew the clocks of individual
+cores. This technique, known as spread-spectrum clocking, reduces EMI at the
+clock frequency and harmonics of it, which may be required to pass FCC
+standards for telecommunications and computer equipment.
+
+It is recommended not to trust the TSCs to remain synchronized on NUMA or
+multiple socket systems for these reasons.
+
+3.4. TSC and C-states
+---------------------
+
+C-states, or idling states of the processor, especially C1E and deeper sleep
+states may be problematic for TSC as well. The TSC may stop advancing in such
+a state, resulting in a TSC which is behind that of other CPUs when execution
+is resumed. Such CPUs must be detected and flagged by the operating system
+based on CPU and chipset identifications.
+
+The TSC in such a case may be corrected by catching it up to a known external
+clocksource.
+
+3.5. TSC frequency change / P-states
+------------------------------------
+
+To make things slightly more interesting, some CPUs may change frequency. They
+may or may not run the TSC at the same rate, and because the frequency change
+may be staggered or slewed, at some points in time, the TSC rate may not be
+known other than falling within a range of values. In this case, the TSC will
+not be a stable time source, and must be calibrated against a known, stable,
+external clock to be a usable source of time.
+
+Whether the TSC runs at a constant rate or scales with the P-state is model
+dependent and must be determined by inspecting CPUID, chipset or vendor
+specific MSR fields.
+
+In addition, some vendors have known bugs where the P-state is actually
+compensated for properly during normal operation, but when the processor is
+inactive, the P-state may be raised temporarily to service cache misses from
+other processors. In such cases, the TSC on halted CPUs could advance faster
+than that of non-halted processors. AMD Turion processors are known to have
+this problem.
+
+3.6. TSC and STPCLK / T-states
+------------------------------
+
+External signals given to the processor may also have the effect of stopping
+the TSC. This is typically done for thermal emergency power control to prevent
+an overheating condition, and typically, there is no way to detect that this
+condition has happened.
+
+3.7. TSC virtualization - VMX
+-----------------------------
+
+VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner. In
+addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
+field specified in the VMCS. Special instructions must be used to read and
+write the VMCS field.
+
+3.8. TSC virtualization - SVM
+-----------------------------
+
+SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner. In
+addition, SVM allows passing through the host TSC plus an additional offset
+field specified in the SVM control block.
+
+3.9. TSC feature bits in Linux
+------------------------------
+
+In summary, there is no way to guarantee the TSC remains in perfect
+synchronization unless it is explicitly guaranteed by the architecture. Even
+if so, the TSCs in multi-sockets or NUMA systems may still run independently
+despite being locally consistent.
+
+The following feature bits are used by Linux to signal various TSC attributes,
+but they can only be taken to be meaningful for UP or single node systems.
+
+========================= =======================================
+X86_FEATURE_TSC The TSC is available in hardware
+X86_FEATURE_RDTSCP The RDTSCP instruction is available
+X86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states
+X86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states
+X86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware)
+========================= =======================================
+
+4. Virtualization Problems
+==========================
+
+Timekeeping is especially problematic for virtualization because a number of
+challenges arise. The most obvious problem is that time is now shared between
+the host and, potentially, a number of virtual machines. Thus the virtual
+operating system does not run with 100% usage of the CPU, despite the fact that
+it may very well make that assumption. It may expect it to remain true to very
+exacting bounds when interrupt sources are disabled, but in reality only its
+virtual interrupt sources are disabled, and the machine may still be preempted
+at any time. This causes problems as the passage of real time, the injection
+of machine interrupts and the associated clock sources are no longer completely
+synchronized with real time.
+
+This same problem can occur on native hardware to a degree, as SMM mode may
+steal cycles from the naturally on X86 systems when SMM mode is used by the
+BIOS, but not in such an extreme fashion. However, the fact that SMM mode may
+cause similar problems to virtualization makes it a good justification for
+solving many of these problems on bare metal.
+
+4.1. Interrupt clocking
+-----------------------
+
+One of the most immediate problems that occurs with legacy operating systems
+is that the system timekeeping routines are often designed to keep track of
+time by counting periodic interrupts. These interrupts may come from the PIT
+or the RTC, but the problem is the same: the host virtualization engine may not
+be able to deliver the proper number of interrupts per second, and so guest
+time may fall behind. This is especially problematic if a high interrupt rate
+is selected, such as 1000 HZ, which is unfortunately the default for many Linux
+guests.
+
+There are three approaches to solving this problem; first, it may be possible
+to simply ignore it. Guests which have a separate time source for tracking
+'wall clock' or 'real time' may not need any adjustment of their interrupts to
+maintain proper time. If this is not sufficient, it may be necessary to inject
+additional interrupts into the guest in order to increase the effective
+interrupt rate. This approach leads to complications in extreme conditions,
+where host load or guest lag is too much to compensate for, and thus another
+solution to the problem has risen: the guest may need to become aware of lost
+ticks and compensate for them internally. Although promising in theory, the
+implementation of this policy in Linux has been extremely error prone, and a
+number of buggy variants of lost tick compensation are distributed across
+commonly used Linux systems.
+
+Windows uses periodic RTC clocking as a means of keeping time internally, and
+thus requires interrupt slewing to keep proper time. It does use a low enough
+rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
+practice.
+
+4.2. TSC sampling and serialization
+-----------------------------------
+
+As the highest precision time source available, the cycle counter of the CPU
+has aroused much interest from developers. As explained above, this timer has
+many problems unique to its nature as a local, potentially unstable and
+potentially unsynchronized source. One issue which is not unique to the TSC,
+but is highlighted because of its very precise nature is sampling delay. By
+definition, the counter, once read is already old. However, it is also
+possible for the counter to be read ahead of the actual use of the result.
+This is a consequence of the superscalar execution of the instruction stream,
+which may execute instructions out of order. Such execution is called
+non-serialized. Forcing serialized execution is necessary for precise
+measurement with the TSC, and requires a serializing instruction, such as CPUID
+or an MSR read.
+
+Since CPUID may actually be virtualized by a trap and emulate mechanism, this
+serialization can pose a performance issue for hardware virtualization. An
+accurate time stamp counter reading may therefore not always be available, and
+it may be necessary for an implementation to guard against "backwards" reads of
+the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
+system.
+
+4.3. Timespec aliasing
+----------------------
+
+Additionally, this lack of serialization from the TSC poses another challenge
+when using results of the TSC when measured against another time source. As
+the TSC is much higher precision, many possible values of the TSC may be read
+while another clock is still expressing the same value.
+
+That is, you may read (T,T+10) while external clock C maintains the same value.
+Due to non-serialized reads, you may actually end up with a range which
+fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but
+calibrated against an external value may have a range of valid values.
+Re-calibrating this computation may actually cause time, as computed after the
+calibration, to go backwards, compared with time computed before the
+calibration.
+
+This problem is particularly pronounced with an internal time source in Linux,
+the kernel time, which is expressed in the theoretically high resolution
+timespec - but which advances in much larger granularity intervals, sometimes
+at the rate of jiffies, and possibly in catchup modes, at a much larger step.
+
+This aliasing requires care in the computation and recalibration of kvmclock
+and any other values derived from TSC computation (such as TSC virtualization
+itself).
+
+4.4. Migration
+--------------
+
+Migration of a virtual machine raises problems for timekeeping in two ways.
+First, the migration itself may take time, during which interrupts cannot be
+delivered, and after which, the guest time may need to be caught up. NTP may
+be able to help to some degree here, as the clock correction required is
+typically small enough to fall in the NTP-correctable window.
+
+An additional concern is that timers based off the TSC (or HPET, if the raw bus
+clock is exposed) may now be running at different rates, requiring compensation
+in some way in the hypervisor by virtualizing these timers. In addition,
+migrating to a faster machine may preclude the use of a passthrough TSC, as a
+faster clock cannot be made visible to a guest without the potential of time
+advancing faster than usual. A slower clock is less of a problem, as it can
+always be caught up to the original rate. KVM clock avoids these problems by
+simply storing multipliers and offsets against the TSC for the guest to convert
+back into nanosecond resolution values.
+
+4.5. Scheduling
+---------------
+
+Since scheduling may be based on precise timing and firing of interrupts, the
+scheduling algorithms of an operating system may be adversely affected by
+virtualization. In theory, the effect is random and should be universally
+distributed, but in contrived as well as real scenarios (guest device access,
+causes of virtualization exits, possible context switch), this may not always
+be the case. The effect of this has not been well studied.
+
+In an attempt to work around this, several implementations have provided a
+paravirtualized scheduler clock, which reveals the true amount of CPU time for
+which a virtual machine has been running.
+
+4.6. Watchdogs
+--------------
+
+Watchdog timers, such as the lock detector in Linux may fire accidentally when
+running under hardware virtualization due to timer interrupts being delayed or
+misinterpretation of the passage of real time. Usually, these warnings are
+spurious and can be ignored, but in some circumstances it may be necessary to
+disable such detection.
+
+4.7. Delays and precision timing
+--------------------------------
+
+Precise timing and delays may not be possible in a virtualized system. This
+can happen if the system is controlling physical hardware, or issues delays to
+compensate for slower I/O to and from devices. The first issue is not solvable
+in general for a virtualized system; hardware control software can't be
+adequately virtualized without a full real-time operating system, which would
+require an RT aware virtualization platform.
+
+The second issue may cause performance problems, but this is unlikely to be a
+significant issue. In many cases these delays may be eliminated through
+configuration or paravirtualization.
+
+4.8. Covert channels and leaks
+------------------------------
+
+In addition to the above problems, time information will inevitably leak to the
+guest about the host in anything but a perfect implementation of virtualized
+time. This may allow the guest to infer the presence of a hypervisor (as in a
+red-pill type detection), and it may allow information to leak between guests
+by using CPU utilization itself as a signalling channel. Preventing such
+problems would require completely isolated virtual time which may not track
+real time any longer. This may be useful in certain security or QA contexts,
+but in general isn't recommended for real-world deployment scenarios.