summaryrefslogtreecommitdiffstats
path: root/Documentation/vm/hwpoison.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/vm/hwpoison.rst')
-rw-r--r--Documentation/vm/hwpoison.rst186
1 files changed, 186 insertions, 0 deletions
diff --git a/Documentation/vm/hwpoison.rst b/Documentation/vm/hwpoison.rst
new file mode 100644
index 000000000..a5c884293
--- /dev/null
+++ b/Documentation/vm/hwpoison.rst
@@ -0,0 +1,186 @@
+.. hwpoison:
+
+========
+hwpoison
+========
+
+What is hwpoison?
+=================
+
+Upcoming Intel CPUs have support for recovering from some memory errors
+(``MCA recovery``). This requires the OS to declare a page "poisoned",
+kill the processes associated with it and avoid using it in the future.
+
+This patchkit implements the necessary infrastructure in the VM.
+
+To quote the overview comment::
+
+ High level machine check handler. Handles pages reported by the
+ hardware as being corrupted usually due to a 2bit ECC memory or cache
+ failure.
+
+ This focusses on pages detected as corrupted in the background.
+ When the current CPU tries to consume corruption the currently
+ running process can just be killed directly instead. This implies
+ that if the error cannot be handled for some reason it's safe to
+ just ignore it because no corruption has been consumed yet. Instead
+ when that happens another machine check will happen.
+
+ Handles page cache pages in various states. The tricky part
+ here is that we can access any page asynchronous to other VM
+ users, because memory failures could happen anytime and anywhere,
+ possibly violating some of their assumptions. This is why this code
+ has to be extremely careful. Generally it tries to use normal locking
+ rules, as in get the standard locks, even if that means the
+ error handling takes potentially a long time.
+
+ Some of the operations here are somewhat inefficient and have non
+ linear algorithmic complexity, because the data structures have not
+ been optimized for this case. This is in particular the case
+ for the mapping from a vma to a process. Since this case is expected
+ to be rare we hope we can get away with this.
+
+The code consists of a the high level handler in mm/memory-failure.c,
+a new page poison bit and various checks in the VM to handle poisoned
+pages.
+
+The main target right now is KVM guests, but it works for all kinds
+of applications. KVM support requires a recent qemu-kvm release.
+
+For the KVM use there was need for a new signal type so that
+KVM can inject the machine check into the guest with the proper
+address. This in theory allows other applications to handle
+memory failures too. The expection is that near all applications
+won't do that, but some very specialized ones might.
+
+Failure recovery modes
+======================
+
+There are two (actually three) modes memory failure recovery can be in:
+
+vm.memory_failure_recovery sysctl set to zero:
+ All memory failures cause a panic. Do not attempt recovery.
+ (on x86 this can be also affected by the tolerant level of the
+ MCE subsystem)
+
+early kill
+ (can be controlled globally and per process)
+ Send SIGBUS to the application as soon as the error is detected
+ This allows applications who can process memory errors in a gentle
+ way (e.g. drop affected object)
+ This is the mode used by KVM qemu.
+
+late kill
+ Send SIGBUS when the application runs into the corrupted page.
+ This is best for memory error unaware applications and default
+ Note some pages are always handled as late kill.
+
+User control
+============
+
+vm.memory_failure_recovery
+ See sysctl.txt
+
+vm.memory_failure_early_kill
+ Enable early kill mode globally
+
+PR_MCE_KILL
+ Set early/late kill mode/revert to system default
+
+ arg1: PR_MCE_KILL_CLEAR:
+ Revert to system default
+ arg1: PR_MCE_KILL_SET:
+ arg2 defines thread specific mode
+
+ PR_MCE_KILL_EARLY:
+ Early kill
+ PR_MCE_KILL_LATE:
+ Late kill
+ PR_MCE_KILL_DEFAULT
+ Use system global default
+
+ Note that if you want to have a dedicated thread which handles
+ the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
+ call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
+ the SIGBUS is sent to the main thread.
+
+PR_MCE_KILL_GET
+ return current mode
+
+Testing
+=======
+
+* madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the
+ process for testing
+
+* hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/``
+
+ corrupt-pfn
+ Inject hwpoison fault at PFN echoed into this file. This does
+ some early filtering to avoid corrupted unintended pages in test suites.
+
+ unpoison-pfn
+ Software-unpoison page at PFN echoed into this file. This way
+ a page can be reused again. This only works for Linux
+ injected failures, not for real memory failures.
+
+ Note these injection interfaces are not stable and might change between
+ kernel versions
+
+ corrupt-filter-dev-major, corrupt-filter-dev-minor
+ Only handle memory failures to pages associated with the file
+ system defined by block device major/minor. -1U is the
+ wildcard value. This should be only used for testing with
+ artificial injection.
+
+ corrupt-filter-memcg
+ Limit injection to pages owned by memgroup. Specified by inode
+ number of the memcg.
+
+ Example::
+
+ mkdir /sys/fs/cgroup/mem/hwpoison
+
+ usemem -m 100 -s 1000 &
+ echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
+
+ memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
+ echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
+
+ page-types -p `pidof init` --hwpoison # shall do nothing
+ page-types -p `pidof usemem` --hwpoison # poison its pages
+
+ corrupt-filter-flags-mask, corrupt-filter-flags-value
+ When specified, only poison pages if ((page_flags & mask) ==
+ value). This allows stress testing of many kinds of
+ pages. The page_flags are the same as in /proc/kpageflags. The
+ flag bits are defined in include/linux/kernel-page-flags.h and
+ documented in Documentation/admin-guide/mm/pagemap.rst
+
+* Architecture specific MCE injector
+
+ x86 has mce-inject, mce-test
+
+ Some portable hwpoison test programs in mce-test, see below.
+
+References
+==========
+
+http://halobates.de/mce-lc09-2.pdf
+ Overview presentation from LinuxCon 09
+
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
+ Test suite (hwpoison specific portable tests in tsrc)
+
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
+ x86 specific injector
+
+
+Limitations
+===========
+- Not all page types are supported and never will. Most kernel internal
+ objects cannot be recovered, only LRU pages for now.
+- Right now hugepage support is missing.
+
+---
+Andi Kleen, Oct 2009