From 76cb841cb886eef6b3bee341a2266c76578724ad Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Mon, 6 May 2024 03:02:30 +0200 Subject: Adding upstream version 4.19.249. Signed-off-by: Daniel Baumann --- Documentation/x86/x86_64/00-INDEX | 16 ++ Documentation/x86/x86_64/5level-paging.txt | 61 ++++++ Documentation/x86/x86_64/boot-options.txt | 281 +++++++++++++++++++++++++ Documentation/x86/x86_64/cpu-hotplug-spec | 21 ++ Documentation/x86/x86_64/fake-numa-for-cpusets | 67 ++++++ Documentation/x86/x86_64/machinecheck | 83 ++++++++ Documentation/x86/x86_64/mm.txt | 81 +++++++ Documentation/x86/x86_64/uefi.txt | 42 ++++ 8 files changed, 652 insertions(+) create mode 100644 Documentation/x86/x86_64/00-INDEX create mode 100644 Documentation/x86/x86_64/5level-paging.txt create mode 100644 Documentation/x86/x86_64/boot-options.txt create mode 100644 Documentation/x86/x86_64/cpu-hotplug-spec create mode 100644 Documentation/x86/x86_64/fake-numa-for-cpusets create mode 100644 Documentation/x86/x86_64/machinecheck create mode 100644 Documentation/x86/x86_64/mm.txt create mode 100644 Documentation/x86/x86_64/uefi.txt (limited to 'Documentation/x86/x86_64') diff --git a/Documentation/x86/x86_64/00-INDEX b/Documentation/x86/x86_64/00-INDEX new file mode 100644 index 000000000..92fc20ab5 --- /dev/null +++ b/Documentation/x86/x86_64/00-INDEX @@ -0,0 +1,16 @@ +00-INDEX + - This file +boot-options.txt + - AMD64-specific boot options. +cpu-hotplug-spec + - Firmware support for CPU hotplug under Linux/x86-64 +fake-numa-for-cpusets + - Using numa=fake and CPUSets for Resource Management +kernel-stacks + - Context-specific per-processor interrupt stacks. +machinecheck + - Configurable sysfs parameters for the x86-64 machine check code. +mm.txt + - Memory layout of x86-64 (4 level page tables, 46 bits physical). +uefi.txt + - Booting Linux via Unified Extensible Firmware Interface. diff --git a/Documentation/x86/x86_64/5level-paging.txt b/Documentation/x86/x86_64/5level-paging.txt new file mode 100644 index 000000000..2432a5ef8 --- /dev/null +++ b/Documentation/x86/x86_64/5level-paging.txt @@ -0,0 +1,61 @@ +== Overview == + +Original x86-64 was limited by 4-level paing to 256 TiB of virtual address +space and 64 TiB of physical address space. We are already bumping into +this limit: some vendors offers servers with 64 TiB of memory today. + +To overcome the limitation upcoming hardware will introduce support for +5-level paging. It is a straight-forward extension of the current page +table structure adding one more layer of translation. + +It bumps the limits to 128 PiB of virtual address space and 4 PiB of +physical address space. This "ought to be enough for anybody" ©. + +QEMU 2.9 and later support 5-level paging. + +Virtual memory layout for 5-level paging is described in +Documentation/x86/x86_64/mm.txt + +== Enabling 5-level paging == + +CONFIG_X86_5LEVEL=y enables the feature. + +Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware. +In this case additional page table level -- p4d -- will be folded at +runtime. + +== User-space and large virtual address space == + +On x86, 5-level paging enables 56-bit userspace virtual address space. +Not all user space is ready to handle wide addresses. It's known that +at least some JIT compilers use higher bits in pointers to encode their +information. It collides with valid pointers with 5-level paging and +leads to crashes. + +To mitigate this, we are not going to allocate virtual address space +above 47-bit by default. + +But userspace can ask for allocation from full address space by +specifying hint address (with or without MAP_FIXED) above 47-bits. + +If hint address set above 47-bit, but MAP_FIXED is not specified, we try +to look for unmapped area by specified address. If it's already +occupied, we look for unmapped area in *full* address space, rather than +from 47-bit window. + +A high hint address would only affect the allocation in question, but not +any future mmap()s. + +Specifying high hint address on older kernel or on machine without 5-level +paging support is safe. The hint will be ignored and kernel will fall back +to allocation from 47-bit address space. + +This approach helps to easily make application's memory allocator aware +about large address space without manually tracking allocated virtual +address space. + +One important case we need to handle here is interaction with MPX. +MPX (without MAWA extension) cannot handle addresses above 47-bit, so we +need to make sure that MPX cannot be enabled we already have VMA above +the boundary and forbid creating such VMAs once MPX is enabled. + diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt new file mode 100644 index 000000000..ad6d2a80c --- /dev/null +++ b/Documentation/x86/x86_64/boot-options.txt @@ -0,0 +1,281 @@ +AMD64 specific boot options + +There are many others (usually documented in driver documentation), but +only the AMD64 specific ones are listed here. + +Machine check + + Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables. + + mce=off + Disable machine check + mce=no_cmci + Disable CMCI(Corrected Machine Check Interrupt) that + Intel processor supports. Usually this disablement is + not recommended, but it might be handy if your hardware + is misbehaving. + Note that you'll get more problems without CMCI than with + due to the shared banks, i.e. you might get duplicated + error logs. + mce=dont_log_ce + Don't make logs for corrected errors. All events reported + as corrected are silently cleared by OS. + This option will be useful if you have no interest in any + of corrected errors. + mce=ignore_ce + Disable features for corrected errors, e.g. polling timer + and CMCI. All events reported as corrected are not cleared + by OS and remained in its error banks. + Usually this disablement is not recommended, however if + there is an agent checking/clearing corrected errors + (e.g. BIOS or hardware monitoring applications), conflicting + with OS's error handling, and you cannot deactivate the agent, + then this option will be a help. + mce=no_lmce + Do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCEs. + mce=bootlog + Enable logging of machine checks left over from booting. + Disabled by default on AMD Fam10h and older because some BIOS + leave bogus ones. + If your BIOS doesn't do that it's a good idea to enable though + to make sure you log even machine check events that result + in a reboot. On Intel systems it is enabled by default. + mce=nobootlog + Disable boot machine check logging. + mce=tolerancelevel[,monarchtimeout] (number,number) + tolerance levels: + 0: always panic on uncorrected errors, log corrected errors + 1: panic or SIGBUS on uncorrected errors, log corrected errors + 2: SIGBUS or log uncorrected errors, log corrected errors + 3: never panic or SIGBUS, log all errors (for testing only) + Default is 1 + Can be also set using sysfs which is preferable. + monarchtimeout: + Sets the time in us to wait for other CPUs on machine checks. 0 + to disable. + mce=bios_cmci_threshold + Don't overwrite the bios-set CMCI threshold. This boot option + prevents Linux from overwriting the CMCI threshold set by the + bios. Without this option, Linux always sets the CMCI + threshold to 1. Enabling this may make memory predictive failure + analysis less effective if the bios sets thresholds for memory + errors since we will not see details for all errors. + mce=recovery + Force-enable recoverable machine check code paths + + nomce (for compatibility with i386): same as mce=off + + Everything else is in sysfs now. + +APICs + + apic Use IO-APIC. Default + + noapic Don't use the IO-APIC. + + disableapic Don't use the local APIC + + nolapic Don't use the local APIC (alias for i386 compatibility) + + pirq=... See Documentation/x86/i386/IO-APIC.txt + + noapictimer Don't set up the APIC timer + + no_timer_check Don't check the IO-APIC timer. This can work around + problems with incorrect timer initialization on some boards. + apicpmtimer + Do APIC timer calibration using the pmtimer. Implies + apicmaintimer. Useful when your PIT timer is totally + broken. + +Timing + + notsc + Deprecated, use tsc=unstable instead. + + nohpet + Don't use the HPET timer. + +Idle loop + + idle=poll + Don't do power saving in the idle loop using HLT, but poll for rescheduling + event. This will make the CPUs eat a lot more power, but may be useful + to get slightly better performance in multiprocessor benchmarks. It also + makes some profiling using performance counters more accurate. + Please note that on systems with MONITOR/MWAIT support (like Intel EM64T + CPUs) this option has no performance advantage over the normal idle loop. + It may also interact badly with hyperthreading. + +Rebooting + + reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old] + bios Use the CPU reboot vector for warm reset + warm Don't set the cold reboot flag + cold Set the cold reboot flag + triple Force a triple fault (init) + kbd Use the keyboard controller. cold reset (default) + acpi Use the ACPI RESET_REG in the FADT. If ACPI is not configured or the + ACPI reset does not work, the reboot path attempts the reset using + the keyboard controller. + efi Use efi reset_system runtime service. If EFI is not configured or the + EFI reset does not work, the reboot path attempts the reset using + the keyboard controller. + + Using warm reset will be much faster especially on big memory + systems because the BIOS will not go through the memory check. + Disadvantage is that not all hardware will be completely reinitialized + on reboot so there may be boot problems on some systems. + + reboot=force + + Don't stop other CPUs on reboot. This can make reboot more reliable + in some cases. + +Non Executable Mappings + + noexec=on|off + + on Enable(default) + off Disable + +NUMA + + numa=off Only set up a single NUMA node spanning all memory. + + numa=noacpi Don't parse the SRAT table for NUMA setup + + numa=fake=[MG] + If given as a memory unit, fills all system RAM with nodes of + size interleaved over physical nodes. + + numa=fake= + If given as an integer, fills all system RAM with N fake nodes + interleaved over physical nodes. + + numa=fake=U + If given as an integer followed by 'U', it will divide each + physical node into N emulated nodes. + +ACPI + + acpi=off Don't enable ACPI + acpi=ht Use ACPI boot table parsing, but don't enable ACPI + interpreter + acpi=force Force ACPI on (currently not needed) + + acpi=strict Disable out of spec ACPI workarounds. + + acpi_sci={edge,level,high,low} Set up ACPI SCI interrupt. + + acpi=noirq Don't route interrupts + + acpi=nocmcff Disable firmware first mode for corrected errors. This + disables parsing the HEST CMC error source to check if + firmware has set the FF flag. This may result in + duplicate corrected error reports. + +PCI + + pci=off Don't use PCI + pci=conf1 Use conf1 access. + pci=conf2 Use conf2 access. + pci=rom Assign ROMs. + pci=assign-busses Assign busses + pci=irqmask=MASK Set PCI interrupt mask to MASK + pci=lastbus=NUMBER Scan up to NUMBER busses, no matter what the mptable says. + pci=noacpi Don't use ACPI to set up PCI interrupt routing. + +IOMMU (input/output memory management unit) + + Multiple x86-64 PCI-DMA mapping implementations exist, for example: + + 1. : use no hardware/software IOMMU at all + (e.g. because you have < 3 GB memory). + Kernel boot message: "PCI-DMA: Disabling IOMMU" + + 2. : AMD GART based hardware IOMMU. + Kernel boot message: "PCI-DMA: using GART IOMMU" + + 3. : Software IOMMU implementation. Used + e.g. if there is no hardware IOMMU in the system and it is need because + you have >3GB memory or told the kernel to us it (iommu=soft)) + Kernel boot message: "PCI-DMA: Using software bounce buffering + for IO (SWIOTLB)" + + 4. : IBM Calgary hardware IOMMU. Used in IBM + pSeries and xSeries servers. This hardware IOMMU supports DMA address + mapping with memory protection, etc. + Kernel boot message: "PCI-DMA: Using Calgary IOMMU" + + iommu=[][,noagp][,off][,force][,noforce][,leak[=] + [,memaper[=]][,merge][,fullflush][,nomerge] + [,noaperture][,calgary] + + General iommu options: + off Don't initialize and use any kind of IOMMU. + noforce Don't force hardware IOMMU usage when it is not needed. + (default). + force Force the use of the hardware IOMMU even when it is + not actually needed (e.g. because < 3 GB memory). + soft Use software bounce buffering (SWIOTLB) (default for + Intel machines). This can be used to prevent the usage + of an available hardware IOMMU. + + iommu options only relevant to the AMD GART hardware IOMMU: + Set the size of the remapping area in bytes. + allowed Overwrite iommu off workarounds for specific chipsets. + fullflush Flush IOMMU on each allocation (default). + nofullflush Don't use IOMMU fullflush. + leak Turn on simple iommu leak tracing (only when + CONFIG_IOMMU_LEAK is on). Default number of leak pages + is 20. + memaper[=] Allocate an own aperture over RAM with size 32MB<[,force] + Prereserve that many 128K pages for the software IO + bounce buffering. + force Force all IO through the software TLB. + + Settings for the IBM Calgary hardware IOMMU currently found in IBM + pSeries and xSeries machines: + + calgary=[64k,128k,256k,512k,1M,2M,4M,8M] + calgary=[translate_empty_slots] + calgary=[disable=] + panic Always panic when IOMMU overflows + + 64k,...,8M - Set the size of each PCI slot's translation table + when using the Calgary IOMMU. This is the size of the translation + table itself in main memory. The smallest table, 64k, covers an IO + space of 32MB; the largest, 8MB table, can cover an IO space of + 4GB. Normally the kernel will make the right choice by itself. + + translate_empty_slots - Enable translation even on slots that have + no devices attached to them, in case a device will be hotplugged + in the future. + + disable= - Disable translation on a given PHB. For + example, the built-in graphics adapter resides on the first bridge + (PCI bus number 0); if translation (isolation) is enabled on this + bridge, X servers that access the hardware directly from user + space might stop working. Use this option if you have devices that + are accessed from userspace directly on some PCI host bridge. + +Miscellaneous + + nogbpages + Do not use GB pages for kernel direct mappings. + gbpages + Use GB pages for kernel direct mappings. diff --git a/Documentation/x86/x86_64/cpu-hotplug-spec b/Documentation/x86/x86_64/cpu-hotplug-spec new file mode 100644 index 000000000..3c23e0587 --- /dev/null +++ b/Documentation/x86/x86_64/cpu-hotplug-spec @@ -0,0 +1,21 @@ +Firmware support for CPU hotplug under Linux/x86-64 +--------------------------------------------------- + +Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to +know in advance of boot time the maximum number of CPUs that could be plugged +into the system. ACPI 3.0 currently has no official way to supply +this information from the firmware to the operating system. + +In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the +ACPI 3.0 specification). ACPI already has the concept of disabled LAPIC +objects by setting the Enabled bit in the LAPIC object to zero. + +For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable +CPU is already available in the MADT. If the CPU is not available yet +it should have its LAPIC Enabled bit set to 0. Linux will use the number +of disabled LAPICs to compute the maximum number of future CPUs. + +In the worst case the user can overwrite this choice using a command line +option (additional_cpus=...), but it is recommended to supply the correct +number (or a reasonable approximation of it, with erring towards more not less) +in the MADT to avoid manual configuration. diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets b/Documentation/x86/x86_64/fake-numa-for-cpusets new file mode 100644 index 000000000..4b09f1883 --- /dev/null +++ b/Documentation/x86/x86_64/fake-numa-for-cpusets @@ -0,0 +1,67 @@ +Using numa=fake and CPUSets for Resource Management +Written by David Rientjes + +This document describes how the numa=fake x86_64 command-line option can be used +in conjunction with cpusets for coarse memory management. Using this feature, +you can create fake NUMA nodes that represent contiguous chunks of memory and +assign them to cpusets and their attached tasks. This is a way of limiting the +amount of system memory that are available to a certain class of tasks. + +For more information on the features of cpusets, see +Documentation/cgroup-v1/cpusets.txt. +There are a number of different configurations you can use for your needs. For +more information on the numa=fake command line option and its various ways of +configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. + +For the purposes of this introduction, we'll assume a very primitive NUMA +emulation setup of "numa=fake=4*512,". This will split our system memory into +four equal chunks of 512M each that we can now use to assign to cpusets. As +you become more familiar with using this combination for resource control, +you'll determine a better setup to minimize the number of nodes you have to deal +with. + +A machine may be split as follows with "numa=fake=4*512," as reported by dmesg: + + Faking node 0 at 0000000000000000-0000000020000000 (512MB) + Faking node 1 at 0000000020000000-0000000040000000 (512MB) + Faking node 2 at 0000000040000000-0000000060000000 (512MB) + Faking node 3 at 0000000060000000-0000000080000000 (512MB) + ... + On node 0 totalpages: 130975 + On node 1 totalpages: 131072 + On node 2 totalpages: 131072 + On node 3 totalpages: 131072 + +Now following the instructions for mounting the cpusets filesystem from +Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory +address spaces) to individual cpusets: + + [root@xroads /]# mkdir exampleset + [root@xroads /]# mount -t cpuset none exampleset + [root@xroads /]# mkdir exampleset/ddset + [root@xroads /]# cd exampleset/ddset + [root@xroads /exampleset/ddset]# echo 0-1 > cpus + [root@xroads /exampleset/ddset]# echo 0-1 > mems + +Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for +memory allocations (1G). + +You can now assign tasks to these cpusets to limit the memory resources +available to them according to the fake nodes assigned as mems: + + [root@xroads /exampleset/ddset]# echo $$ > tasks + [root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G + [1] 13425 + +Notice the difference between the system memory usage as reported by +/proc/meminfo between the restricted cpuset case above and the unrestricted +case (i.e. running the same 'dd' command without assigning it to a fake NUMA +cpuset): + Unrestricted Restricted + MemTotal: 3091900 kB 3091900 kB + MemFree: 42113 kB 1513236 kB + +This allows for coarse memory management for the tasks you assign to particular +cpusets. Since cpusets can form a hierarchy, you can create some pretty +interesting combinations of use-cases for various classes of tasks for your +memory management needs. diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck new file mode 100644 index 000000000..d0648a74f --- /dev/null +++ b/Documentation/x86/x86_64/machinecheck @@ -0,0 +1,83 @@ + +Configurable sysfs parameters for the x86-64 machine check code. + +Machine checks report internal hardware error conditions detected +by the CPU. Uncorrected errors typically cause a machine check +(often with panic), corrected ones cause a machine check log entry. + +Machine checks are organized in banks (normally associated with +a hardware subsystem) and subevents in a bank. The exact meaning +of the banks and subevent is CPU specific. + +mcelog knows how to decode them. + +When you see the "Machine check errors logged" message in the system +log then mcelog should run to collect and decode machine check entries +from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. + +Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN +(N = CPU number) + +The directory contains some configurable entries: + +Entries: + +bankNctl +(N bank number) + 64bit Hex bitmask enabling/disabling specific subevents for bank N + When a bit in the bitmask is zero then the respective + subevent will not be reported. + By default all events are enabled. + Note that BIOS maintain another mask to disable specific events + per bank. This is not visible here + +The following entries appear for each CPU, but they are truly shared +between all CPUs. + +check_interval + How often to poll for corrected machine check errors, in seconds + (Note output is hexadecimal). Default 5 minutes. When the poller + finds MCEs it triggers an exponential speedup (poll more often) on + the polling interval. When the poller stops finding MCEs, it + triggers an exponential backoff (poll less often) on the polling + interval. The check_interval variable is both the initial and + maximum polling interval. 0 means no polling for corrected machine + check errors (but some corrected errors might be still reported + in other ways) + +tolerant + Tolerance level. When a machine check exception occurs for a non + corrected machine check the kernel can take different actions. + Since machine check exceptions can happen any time it is sometimes + risky for the kernel to kill a process because it defies + normal kernel locking rules. The tolerance level configures + how hard the kernel tries to recover even at some risk of + deadlock. Higher tolerant values trade potentially better uptime + with the risk of a crash or even corruption (for tolerant >= 3). + + 0: always panic on uncorrected errors, log corrected errors + 1: panic or SIGBUS on uncorrected errors, log corrected errors + 2: SIGBUS or log uncorrected errors, log corrected errors + 3: never panic or SIGBUS, log all errors (for testing only) + + Default: 1 + + Note this only makes a difference if the CPU allows recovery + from a machine check exception. Current x86 CPUs generally do not. + +trigger + Program to run when a machine check event is detected. + This is an alternative to running mcelog regularly from cron + and allows to detect events faster. +monarch_timeout + How long to wait for the other CPUs to machine check too on a + exception. 0 to disable waiting for other CPUs. + Unit: us + +TBD document entries for AMD threshold interrupt configuration + +For more details about the x86 machine check architecture +see the Intel and AMD architecture manuals from their developer websites. + +For more details about the architecture see +see http://one.firstfloor.org/~andi/mce.pdf diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt new file mode 100644 index 000000000..05ef53d83 --- /dev/null +++ b/Documentation/x86/x86_64/mm.txt @@ -0,0 +1,81 @@ + +Virtual memory map with 4 level page tables: + +0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm +hole caused by [47:63] sign extension +ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor +ffff880000000000 - ffff887fffffffff (=39 bits) LDT remap for PTI +ffff888000000000 - ffffc87fffffffff (=64 TB) direct mapping of all phys. memory +ffffc88000000000 - ffffc8ffffffffff (=39 bits) hole +ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space +ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole +ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB) +... unused hole ... +ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB) +... unused hole ... + vaddr_end for KASLR +fffffe0000000000 - fffffe7fffffffff (=39 bits) cpu_entry_area mapping +fffffe8000000000 - fffffeffffffffff (=39 bits) LDT remap for PTI +ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks +... unused hole ... +ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space +... unused hole ... +ffffffff80000000 - ffffffff9fffffff (=512 MB) kernel text mapping, from phys 0 +ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space +[fixmap start] - ffffffffff5fffff kernel-internal fixmap range +ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI +ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole + +Virtual memory map with 5 level page tables: + +0000000000000000 - 00ffffffffffffff (=56 bits) user space, different per mm +hole caused by [56:63] sign extension +ff00000000000000 - ff0fffffffffffff (=52 bits) guard hole, reserved for hypervisor +ff10000000000000 - ff10ffffffffffff (=48 bits) LDT remap for PTI +ff11000000000000 - ff90ffffffffffff (=55 bits) direct mapping of all phys. memory +ff91000000000000 - ff9fffffffffffff (=3840 TB) hole +ffa0000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space (12800 TB) +ffd2000000000000 - ffd3ffffffffffff (=49 bits) hole +ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB) +... unused hole ... +ffdf000000000000 - fffffc0000000000 (=53 bits) kasan shadow memory (8PB) +... unused hole ... + vaddr_end for KASLR +fffffe0000000000 - fffffe7fffffffff (=39 bits) cpu_entry_area mapping +... unused hole ... +ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks +... unused hole ... +ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space +... unused hole ... +ffffffff80000000 - ffffffff9fffffff (=512 MB) kernel text mapping, from phys 0 +ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space +[fixmap start] - ffffffffff5fffff kernel-internal fixmap range +ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI +ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole + +Architecture defines a 64-bit virtual address. Implementations can support +less. Currently supported are 48- and 57-bit virtual addresses. Bits 63 +through to the most-significant implemented bit are sign extended. +This causes hole between user space and kernel addresses if you interpret them +as unsigned. + +The direct mapping covers all memory in the system up to the highest +memory address (this means in some cases it can also include PCI memory +holes). + +vmalloc space is lazily synchronized into the different PML4/PML5 pages of +the processes using the page fault handler, with init_top_pgt as +reference. + +We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual +memory window (this size is arbitrary, it can be raised later if needed). +The mappings are not part of any other kernel PGD and are only available +during EFI runtime calls. + +Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all +physical memory, vmalloc/ioremap space and virtual memory map are randomized. +Their order is preserved but their base will be offset early at boot time. + +Be very careful vs. KASLR when changing anything here. The KASLR address +range must not overlap with anything except the KASAN shadow area, which is +correct as KASAN disables KASLR. diff --git a/Documentation/x86/x86_64/uefi.txt b/Documentation/x86/x86_64/uefi.txt new file mode 100644 index 000000000..a5e2b4fdb --- /dev/null +++ b/Documentation/x86/x86_64/uefi.txt @@ -0,0 +1,42 @@ +General note on [U]EFI x86_64 support +------------------------------------- + +The nomenclature EFI and UEFI are used interchangeably in this document. + +Although the tools below are _not_ needed for building the kernel, +the needed bootloader support and associated tools for x86_64 platforms +with EFI firmware and specifications are listed below. + +1. UEFI specification: http://www.uefi.org + +2. Booting Linux kernel on UEFI x86_64 platform requires bootloader + support. Elilo with x86_64 support can be used. + +3. x86_64 platform with EFI/UEFI firmware. + +Mechanics: +--------- +- Build the kernel with the following configuration. + CONFIG_FB_EFI=y + CONFIG_FRAMEBUFFER_CONSOLE=y + If EFI runtime services are expected, the following configuration should + be selected. + CONFIG_EFI=y + CONFIG_EFI_VARS=y or m # optional +- Create a VFAT partition on the disk +- Copy the following to the VFAT partition: + elilo bootloader with x86_64 support, elilo configuration file, + kernel image built in first step and corresponding + initrd. Instructions on building elilo and its dependencies + can be found in the elilo sourceforge project. +- Boot to EFI shell and invoke elilo choosing the kernel image built + in first step. +- If some or all EFI runtime services don't work, you can try following + kernel command line parameters to turn off some or all EFI runtime + services. + noefi turn off all EFI runtime services + reboot_type=k turn off EFI reboot runtime service +- If the EFI memory map has additional entries not in the E820 map, + you can include those entries in the kernels memory map of available + physical RAM by using the following kernel command line parameter. + add_efi_memmap include EFI memory map of available physical RAM -- cgit v1.2.3