diff options
Diffstat (limited to 'Documentation/powerpc/firmware-assisted-dump.rst')
-rw-r--r-- | Documentation/powerpc/firmware-assisted-dump.rst | 381 |
1 files changed, 0 insertions, 381 deletions
diff --git a/Documentation/powerpc/firmware-assisted-dump.rst b/Documentation/powerpc/firmware-assisted-dump.rst deleted file mode 100644 index e363fc4852..0000000000 --- a/Documentation/powerpc/firmware-assisted-dump.rst +++ /dev/null @@ -1,381 +0,0 @@ -====================== -Firmware-Assisted Dump -====================== - -July 2011 - -The goal of firmware-assisted dump is to enable the dump of -a crashed system, and to do so from a fully-reset system, and -to minimize the total elapsed time until the system is back -in production use. - -- Firmware-Assisted Dump (FADump) infrastructure is intended to replace - the existing phyp assisted dump. -- Fadump uses the same firmware interfaces and memory reservation model - as phyp assisted dump. -- Unlike phyp dump, FADump exports the memory dump through /proc/vmcore - in the ELF format in the same way as kdump. This helps us reuse the - kdump infrastructure for dump capture and filtering. -- Unlike phyp dump, userspace tool does not need to refer any sysfs - interface while reading /proc/vmcore. -- Unlike phyp dump, FADump allows user to release all the memory reserved - for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. -- Once enabled through kernel boot parameter, FADump can be - started/stopped through /sys/kernel/fadump_registered interface (see - sysfs files section below) and can be easily integrated with kdump - service start/stop init scripts. - -Comparing with kdump or other strategies, firmware-assisted -dump offers several strong, practical advantages: - -- Unlike kdump, the system has been reset, and loaded - with a fresh copy of the kernel. In particular, - PCI and I/O devices have been reinitialized and are - in a clean, consistent state. -- Once the dump is copied out, the memory that held the dump - is immediately available to the running kernel. And therefore, - unlike kdump, FADump doesn't need a 2nd reboot to get back - the system to the production configuration. - -The above can only be accomplished by coordination with, -and assistance from the Power firmware. The procedure is -as follows: - -- The first kernel registers the sections of memory with the - Power firmware for dump preservation during OS initialization. - These registered sections of memory are reserved by the first - kernel during early boot. - -- When system crashes, the Power firmware will copy the registered - low memory regions (boot memory) from source to destination area. - It will also save hardware PTE's. - - NOTE: - The term 'boot memory' means size of the low memory chunk - that is required for a kernel to boot successfully when - booted with restricted memory. By default, the boot memory - size will be the larger of 5% of system RAM or 256MB. - Alternatively, user can also specify boot memory size - through boot parameter 'crashkernel=' which will override - the default calculated size. Use this option if default - boot memory size is not sufficient for second kernel to - boot successfully. For syntax of crashkernel= parameter, - refer to Documentation/admin-guide/kdump/kdump.rst. If any - offset is provided in crashkernel= parameter, it will be - ignored as FADump uses a predefined offset to reserve memory - for boot memory dump preservation in case of a crash. - -- After the low memory (boot memory) area has been saved, the - firmware will reset PCI and other hardware state. It will - *not* clear the RAM. It will then launch the bootloader, as - normal. - -- The freshly booted kernel will notice that there is a new node - (rtas/ibm,kernel-dump on pSeries or ibm,opal/dump/mpipl-boot - on OPAL platform) in the device tree, indicating that - there is crash data available from a previous boot. During - the early boot OS will reserve rest of the memory above - boot memory size effectively booting with restricted memory - size. This will make sure that this kernel (also, referred - to as second kernel or capture kernel) will not touch any - of the dump memory area. - -- User-space tools will read /proc/vmcore to obtain the contents - of memory, which holds the previous crashed kernel dump in ELF - format. The userspace tools may copy this info to disk, or - network, nas, san, iscsi, etc. as desired. - -- Once the userspace tool is done saving dump, it will echo - '1' to /sys/kernel/fadump_release_mem to release the reserved - memory back to general use, except the memory required for - next firmware-assisted dump registration. - - e.g.:: - - # echo 1 > /sys/kernel/fadump_release_mem - -Please note that the firmware-assisted dump feature -is only available on POWER6 and above systems on pSeries -(PowerVM) platform and POWER9 and above systems with OP940 -or later firmware versions on PowerNV (OPAL) platform. -Note that, OPAL firmware exports ibm,opal/dump node when -FADump is supported on PowerNV platform. - -On OPAL based machines, system first boots into an intermittent -kernel (referred to as petitboot kernel) before booting into the -capture kernel. This kernel would have minimal kernel and/or -userspace support to process crash data. Such kernel needs to -preserve previously crash'ed kernel's memory for the subsequent -capture kernel boot to process this crash data. Kernel config -option CONFIG_PRESERVE_FA_DUMP has to be enabled on such kernel -to ensure that crash data is preserved to process later. - --- On OPAL based machines (PowerNV), if the kernel is build with - CONFIG_OPAL_CORE=y, OPAL memory at the time of crash is also - exported as /sys/firmware/opal/mpipl/core file. This procfs file is - helpful in debugging OPAL crashes with GDB. The kernel memory - used for exporting this procfs file can be released by echo'ing - '1' to /sys/firmware/opal/mpipl/release_core node. - - e.g. - # echo 1 > /sys/firmware/opal/mpipl/release_core - -Implementation details: ------------------------ - -During boot, a check is made to see if firmware supports -this feature on that particular machine. If it does, then -we check to see if an active dump is waiting for us. If yes -then everything but boot memory size of RAM is reserved during -early boot (See Fig. 2). This area is released once we finish -collecting the dump from user land scripts (e.g. kdump scripts) -that are run. If there is dump data, then the -/sys/kernel/fadump_release_mem file is created, and the reserved -memory is held. - -If there is no waiting dump data, then only the memory required to -hold CPU state, HPTE region, boot memory dump, FADump header and -elfcore header, is usually reserved at an offset greater than boot -memory size (see Fig. 1). This area is *not* released: this region -will be kept permanently reserved, so that it can act as a receptacle -for a copy of the boot memory content in addition to CPU state and -HPTE region, in the case a crash does occur. - -Since this reserved memory area is used only after the system crash, -there is no point in blocking this significant chunk of memory from -production kernel. Hence, the implementation uses the Linux kernel's -Contiguous Memory Allocator (CMA) for memory reservation if CMA is -configured for kernel. With CMA reservation this memory will be -available for applications to use it, while kernel is prevented from -using it. With this FADump will still be able to capture all of the -kernel memory and most of the user space memory except the user pages -that were present in CMA region:: - - o Memory Reservation during first kernel - - Low memory Top of memory - 0 boot memory size |<--- Reserved dump area --->| | - | | | Permanent Reservation | | - V V | | V - +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ - | | |///|////| DUMP | HDR | ELF |////| | - +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ - | ^ ^ ^ ^ ^ - | | | | | | - \ CPU HPTE / | | - ------------------------------ | | - Boot memory content gets transferred | | - to reserved area by firmware at the | | - time of crash. | | - FADump Header | - (meta area) | - | - | - Metadata: This area holds a metadata structure whose - address is registered with f/w and retrieved in the - second kernel after crash, on platforms that support - tags (OPAL). Having such structure with info needed - to process the crashdump eases dump capture process. - - Fig. 1 - - - o Memory Reservation during second kernel after crash - - Low memory Top of memory - 0 boot memory size | - | |<------------ Crash preserved area ------------>| - V V |<--- Reserved dump area --->| | - +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ - | | |///|////| DUMP | HDR | ELF |////| | - +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ - | | - V V - Used by second /proc/vmcore - kernel to boot - - +---+ - |///| -> Regions (CPU, HPTE & Metadata) marked like this in the above - +---+ figures are not always present. For example, OPAL platform - does not have CPU & HPTE regions while Metadata region is - not supported on pSeries currently. - - Fig. 2 - - -Currently the dump will be copied from /proc/vmcore to a new file upon -user intervention. The dump data available through /proc/vmcore will be -in ELF format. Hence the existing kdump infrastructure (kdump scripts) -to save the dump works fine with minor modifications. KDump scripts on -major Distro releases have already been modified to work seamlessly (no -user intervention in saving the dump) when FADump is used, instead of -KDump, as dump mechanism. - -The tools to examine the dump will be same as the ones -used for kdump. - -How to enable firmware-assisted dump (FADump): ----------------------------------------------- - -1. Set config option CONFIG_FA_DUMP=y and build kernel. -2. Boot into linux kernel with 'fadump=on' kernel cmdline option. - By default, FADump reserved memory will be initialized as CMA area. - Alternatively, user can boot linux kernel with 'fadump=nocma' to - prevent FADump to use CMA. -3. Optionally, user can also set 'crashkernel=' kernel cmdline - to specify size of the memory to reserve for boot memory dump - preservation. - -NOTE: - 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead - use 'crashkernel=' to specify size of the memory to reserve - for boot memory dump preservation. - 2. If firmware-assisted dump fails to reserve memory then it - will fallback to existing kdump mechanism if 'crashkernel=' - option is set at kernel cmdline. - 3. if user wants to capture all of user space memory and ok with - reserved memory not available to production system, then - 'fadump=nocma' kernel parameter can be used to fallback to - old behaviour. - -Sysfs/debugfs files: --------------------- - -Firmware-assisted dump feature uses sysfs file system to hold -the control files and debugfs file to display memory reserved region. - -Here is the list of files under kernel sysfs: - - /sys/kernel/fadump_enabled - This is used to display the FADump status. - - - 0 = FADump is disabled - - 1 = FADump is enabled - - This interface can be used by kdump init scripts to identify if - FADump is enabled in the kernel and act accordingly. - - /sys/kernel/fadump_registered - This is used to display the FADump registration status as well - as to control (start/stop) the FADump registration. - - - 0 = FADump is not registered. - - 1 = FADump is registered and ready to handle system crash. - - To register FADump echo 1 > /sys/kernel/fadump_registered and - echo 0 > /sys/kernel/fadump_registered for un-register and stop the - FADump. Once the FADump is un-registered, the system crash will not - be handled and vmcore will not be captured. This interface can be - easily integrated with kdump service start/stop. - - /sys/kernel/fadump/mem_reserved - - This is used to display the memory reserved by FADump for saving the - crash dump. - - /sys/kernel/fadump_release_mem - This file is available only when FADump is active during - second kernel. This is used to release the reserved memory - region that are held for saving crash dump. To release the - reserved memory echo 1 to it:: - - echo 1 > /sys/kernel/fadump_release_mem - - After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region - file will change to reflect the new memory reservations. - - The existing userspace tools (kdump infrastructure) can be easily - enhanced to use this interface to release the memory reserved for - dump and continue without 2nd reboot. - -Note: /sys/kernel/fadump_release_opalcore sysfs has moved to - /sys/firmware/opal/mpipl/release_core - - /sys/firmware/opal/mpipl/release_core - - This file is available only on OPAL based machines when FADump is - active during capture kernel. This is used to release the memory - used by the kernel to export /sys/firmware/opal/mpipl/core file. To - release this memory, echo '1' to it: - - echo 1 > /sys/firmware/opal/mpipl/release_core - -Note: The following FADump sysfs files are deprecated. - -+----------------------------------+--------------------------------+ -| Deprecated | Alternative | -+----------------------------------+--------------------------------+ -| /sys/kernel/fadump_enabled | /sys/kernel/fadump/enabled | -+----------------------------------+--------------------------------+ -| /sys/kernel/fadump_registered | /sys/kernel/fadump/registered | -+----------------------------------+--------------------------------+ -| /sys/kernel/fadump_release_mem | /sys/kernel/fadump/release_mem | -+----------------------------------+--------------------------------+ - -Here is the list of files under powerpc debugfs: -(Assuming debugfs is mounted on /sys/kernel/debug directory.) - - /sys/kernel/debug/powerpc/fadump_region - This file shows the reserved memory regions if FADump is - enabled otherwise this file is empty. The output format - is:: - - <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> - - and for kernel DUMP region is: - - DUMP: Src: <src-addr>, Dest: <dest-addr>, Size: <size>, Dumped: # bytes - - e.g. - Contents when FADump is registered during first kernel:: - - # cat /sys/kernel/debug/powerpc/fadump_region - CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 - HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 - DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 - - Contents when FADump is active during second kernel:: - - # cat /sys/kernel/debug/powerpc/fadump_region - CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 - HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 - DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 - : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 - - -NOTE: - Please refer to Documentation/filesystems/debugfs.rst on - how to mount the debugfs filesystem. - - -TODO: ------ - - Need to come up with the better approach to find out more - accurate boot memory size that is required for a kernel to - boot successfully when booted with restricted memory. - - The FADump implementation introduces a FADump crash info structure - in the scratch area before the ELF core header. The idea of introducing - this structure is to pass some important crash info data to the second - kernel which will help second kernel to populate ELF core header with - correct data before it gets exported through /proc/vmcore. The current - design implementation does not address a possibility of introducing - additional fields (in future) to this structure without affecting - compatibility. Need to come up with the better approach to address this. - - The possible approaches are: - - 1. Introduce version field for version tracking, bump up the version - whenever a new field is added to the structure in future. The version - field can be used to find out what fields are valid for the current - version of the structure. - 2. Reserve the area of predefined size (say PAGE_SIZE) for this - structure and have unused area as reserved (initialized to zero) - for future field additions. - - The advantage of approach 1 over 2 is we don't need to reserve extra space. - -Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> - -This document is based on the original documentation written for phyp - -assisted dump by Linas Vepstas and Manish Ahuja. |