64 files changed, 26061 insertions, 0 deletions
diff --git a/Documentation/admin-guide/LSM/LoadPin.rst b/Documentation/admin-guide/LSM/LoadPin.rst
new file mode 100644
index 000000000..32070762d
--- /dev/null
+++ b/Documentation/admin-guide/LSM/LoadPin.rst
@@ -0,0 +1,21 @@
+=======
+LoadPin
+=======
+
+LoadPin is a Linux Security Module that ensures all kernel-loaded files
+(modules, firmware, etc) all originate from the same filesystem, with
+the expectation that such a filesystem is backed by a read-only device
+such as dm-verity or CDROM. This allows systems that have a verified
+and/or unchangeable filesystem to enforce module and firmware loading
+restrictions without needing to sign the files individually.
+
+The LSM is selectable at build-time with ``CONFIG_SECURITY_LOADPIN``, and
+can be controlled at boot-time with the kernel command line option
+"``loadpin.enabled``". By default, it is enabled, but can be disabled at
+boot ("``loadpin.enabled=0``").
+
+LoadPin starts pinning when it sees the first file loaded. If the
+block device backing the filesystem is not read-only, a sysctl is
+created to toggle pinning: ``/proc/sys/kernel/loadpin/enabled``. (Having
+a mutable filesystem means pinning is mutable too, but having the
+sysctl allows for easy testing on systems with a mutable filesystem.)
diff --git a/Documentation/admin-guide/LSM/SELinux.rst b/Documentation/admin-guide/LSM/SELinux.rst
new file mode 100644
index 000000000..f722c9b41
--- /dev/null
+++ b/Documentation/admin-guide/LSM/SELinux.rst
@@ -0,0 +1,33 @@
+=======
+SELinux
+=======
+
+If you want to use SELinux, chances are you will want
+to use the distro-provided policies, or install the
+latest reference policy release from
+
+	http://oss.tresys.com/projects/refpolicy
+
+However, if you want to install a dummy policy for
+testing, you can do using ``mdp`` provided under
+scripts/selinux.  Note that this requires the selinux
+userspace to be installed - in particular you will
+need checkpolicy to compile a kernel, and setfiles and
+fixfiles to label the filesystem.
+
+	1. Compile the kernel with selinux enabled.
+	2. Type ``make`` to compile ``mdp``.
+	3. Make sure that you are not running with
+	   SELinux enabled and a real policy.  If
+	   you are, reboot with selinux disabled
+	   before continuing.
+	4. Run install_policy.sh::
+
+		cd scripts/selinux
+		sh install_policy.sh
+
+Step 4 will create a new dummy policy valid for your
+kernel, with a single selinux user, role, and type.
+It will compile the policy, will set your ``SELINUXTYPE`` to
+``dummy`` in ``/etc/selinux/config``, install the compiled policy
+as ``dummy``, and relabel your filesystem.
diff --git a/Documentation/admin-guide/LSM/Smack.rst b/Documentation/admin-guide/LSM/Smack.rst
new file mode 100644
index 000000000..6a5826a13
--- /dev/null
+++ b/Documentation/admin-guide/LSM/Smack.rst
@@ -0,0 +1,857 @@
+=====
+Smack
+=====
+
+
+    "Good for you, you've decided to clean the elevator!"
+    - The Elevator, from Dark Star
+
+Smack is the Simplified Mandatory Access Control Kernel.
+Smack is a kernel based implementation of mandatory access
+control that includes simplicity in its primary design goals.
+
+Smack is not the only Mandatory Access Control scheme
+available for Linux. Those new to Mandatory Access Control
+are encouraged to compare Smack with the other mechanisms
+available to determine which is best suited to the problem
+at hand.
+
+Smack consists of three major components:
+
+    - The kernel
+    - Basic utilities, which are helpful but not required
+    - Configuration data
+
+The kernel component of Smack is implemented as a Linux
+Security Modules (LSM) module. It requires netlabel and
+works best with file systems that support extended attributes,
+although xattr support is not strictly required.
+It is safe to run a Smack kernel under a "vanilla" distribution.
+
+Smack kernels use the CIPSO IP option. Some network
+configurations are intolerant of IP options and can impede
+access to systems that use them as Smack does.
+
+Smack is used in the Tizen operating system. Please
+go to http://wiki.tizen.org for information about how
+Smack is used in Tizen.
+
+The current git repository for Smack user space is:
+
+	git://github.com/smack-team/smack.git
+
+This should make and install on most modern distributions.
+There are five commands included in smackutil:
+
+chsmack:
+	display or set Smack extended attribute values
+
+smackctl:
+	load the Smack access rules
+
+smackaccess:
+	report if a process with one label has access
+	to an object with another
+
+These two commands are obsolete with the introduction of
+the smackfs/load2 and smackfs/cipso2 interfaces.
+
+smackload:
+	properly formats data for writing to smackfs/load
+
+smackcipso:
+	properly formats data for writing to smackfs/cipso
+
+In keeping with the intent of Smack, configuration data is
+minimal and not strictly required. The most important
+configuration step is mounting the smackfs pseudo filesystem.
+If smackutil is installed the startup script will take care
+of this, but it can be manually as well.
+
+Add this line to ``/etc/fstab``::
+
+    smackfs /sys/fs/smackfs smackfs defaults 0 0
+
+The ``/sys/fs/smackfs`` directory is created by the kernel.
+
+Smack uses extended attributes (xattrs) to store labels on filesystem
+objects. The attributes are stored in the extended attribute security
+name space. A process must have ``CAP_MAC_ADMIN`` to change any of these
+attributes.
+
+The extended attributes that Smack uses are:
+
+SMACK64
+	Used to make access control decisions. In almost all cases
+	the label given to a new filesystem object will be the label
+	of the process that created it.
+
+SMACK64EXEC
+	The Smack label of a process that execs a program file with
+	this attribute set will run with this attribute's value.
+
+SMACK64MMAP
+	Don't allow the file to be mmapped by a process whose Smack
+	label does not allow all of the access permitted to a process
+	with the label contained in this attribute. This is a very
+	specific use case for shared libraries.
+
+SMACK64TRANSMUTE
+	Can only have the value "TRUE". If this attribute is present
+	on a directory when an object is created in the directory and
+	the Smack rule (more below) that permitted the write access
+	to the directory includes the transmute ("t") mode the object
+	gets the label of the directory instead of the label of the
+	creating process. If the object being created is a directory
+	the SMACK64TRANSMUTE attribute is set as well.
+
+SMACK64IPIN
+	This attribute is only available on file descriptors for sockets.
+	Use the Smack label in this attribute for access control
+	decisions on packets being delivered to this socket.
+
+SMACK64IPOUT
+	This attribute is only available on file descriptors for sockets.
+	Use the Smack label in this attribute for access control
+	decisions on packets coming from this socket.
+
+There are multiple ways to set a Smack label on a file::
+
+    # attr -S -s SMACK64 -V "value" path
+    # chsmack -a value path
+
+A process can see the Smack label it is running with by
+reading ``/proc/self/attr/current``. A process with ``CAP_MAC_ADMIN``
+can set the process Smack by writing there.
+
+Most Smack configuration is accomplished by writing to files
+in the smackfs filesystem. This pseudo-filesystem is mounted
+on ``/sys/fs/smackfs``.
+
+access
+	Provided for backward compatibility. The access2 interface
+	is preferred and should be used instead.
+	This interface reports whether a subject with the specified
+	Smack label has a particular access to an object with a
+	specified Smack label. Write a fixed format access rule to
+	this file. The next read will indicate whether the access
+	would be permitted. The text will be either "1" indicating
+	access, or "0" indicating denial.
+
+access2
+	This interface reports whether a subject with the specified
+	Smack label has a particular access to an object with a
+	specified Smack label. Write a long format access rule to
+	this file. The next read will indicate whether the access
+	would be permitted. The text will be either "1" indicating
+	access, or "0" indicating denial.
+
+ambient
+	This contains the Smack label applied to unlabeled network
+	packets.
+
+change-rule
+	This interface allows modification of existing access control rules.
+	The format accepted on write is::
+
+		"%s %s %s %s"
+
+	where the first string is the subject label, the second the
+	object label, the third the access to allow and the fourth the
+	access to deny. The access strings may contain only the characters
+	"rwxat-". If a rule for a given subject and object exists it will be
+	modified by enabling the permissions in the third string and disabling
+	those in the fourth string. If there is no such rule it will be
+	created using the access specified in the third and the fourth strings.
+
+cipso
+	Provided for backward compatibility. The cipso2 interface
+	is preferred and should be used instead.
+	This interface allows a specific CIPSO header to be assigned
+	to a Smack label. The format accepted on write is::
+
+		"%24s%4d%4d"["%4d"]...
+
+	The first string is a fixed Smack label. The first number is
+	the level to use. The second number is the number of categories.
+	The following numbers are the categories::
+
+		"level-3-cats-5-19          3   2   5  19"
+
+cipso2
+	This interface allows a specific CIPSO header to be assigned
+	to a Smack label. The format accepted on write is::
+
+		"%s%4d%4d"["%4d"]...
+
+	The first string is a long Smack label. The first number is
+	the level to use. The second number is the number of categories.
+	The following numbers are the categories::
+
+		"level-3-cats-5-19   3   2   5  19"
+
+direct
+	This contains the CIPSO level used for Smack direct label
+	representation in network packets.
+
+doi
+	This contains the CIPSO domain of interpretation used in
+	network packets.
+
+ipv6host
+	This interface allows specific IPv6 internet addresses to be
+	treated as single label hosts. Packets are sent to single
+	label hosts only from processes that have Smack write access
+	to the host label. All packets received from single label hosts
+	are given the specified label. The format accepted on write is::
+
+		"%h:%h:%h:%h:%h:%h:%h:%h label" or
+		"%h:%h:%h:%h:%h:%h:%h:%h/%d label".
+
+	The "::" address shortcut is not supported.
+	If label is "-DELETE" a matched entry will be deleted.
+
+load
+	Provided for backward compatibility. The load2 interface
+	is preferred and should be used instead.
+	This interface allows access control rules in addition to
+	the system defined rules to be specified. The format accepted
+	on write is::
+
+		"%24s%24s%5s"
+
+	where the first string is the subject label, the second the
+	object label, and the third the requested access. The access
+	string may contain only the characters "rwxat-", and specifies
+	which sort of access is allowed. The "-" is a placeholder for
+	permissions that are not allowed. The string "r-x--" would
+	specify read and execute access. Labels are limited to 23
+	characters in length.
+
+load2
+	This interface allows access control rules in addition to
+	the system defined rules to be specified. The format accepted
+	on write is::
+
+		"%s %s %s"
+
+	where the first string is the subject label, the second the
+	object label, and the third the requested access. The access
+	string may contain only the characters "rwxat-", and specifies
+	which sort of access is allowed. The "-" is a placeholder for
+	permissions that are not allowed. The string "r-x--" would
+	specify read and execute access.
+
+load-self
+	Provided for backward compatibility. The load-self2 interface
+	is preferred and should be used instead.
+	This interface allows process specific access rules to be
+	defined. These rules are only consulted if access would
+	otherwise be permitted, and are intended to provide additional
+	restrictions on the process. The format is the same as for
+	the load interface.
+
+load-self2
+	This interface allows process specific access rules to be
+	defined. These rules are only consulted if access would
+	otherwise be permitted, and are intended to provide additional
+	restrictions on the process. The format is the same as for
+	the load2 interface.
+
+logging
+	This contains the Smack logging state.
+
+mapped
+	This contains the CIPSO level used for Smack mapped label
+	representation in network packets.
+
+netlabel
+	This interface allows specific internet addresses to be
+	treated as single label hosts. Packets are sent to single
+	label hosts without CIPSO headers, but only from processes
+	that have Smack write access to the host label. All packets
+	received from single label hosts are given the specified
+	label. The format accepted on write is::
+
+		"%d.%d.%d.%d label" or "%d.%d.%d.%d/%d label".
+
+	If the label specified is "-CIPSO" the address is treated
+	as a host that supports CIPSO headers.
+
+onlycap
+	This contains labels processes must have for CAP_MAC_ADMIN
+	and ``CAP_MAC_OVERRIDE`` to be effective. If this file is empty
+	these capabilities are effective at for processes with any
+	label. The values are set by writing the desired labels, separated
+	by spaces, to the file or cleared by writing "-" to the file.
+
+ptrace
+	This is used to define the current ptrace policy
+
+	0 - default:
+	    this is the policy that relies on Smack access rules.
+	    For the ``PTRACE_READ`` a subject needs to have a read access on
+	    object. For the ``PTRACE_ATTACH`` a read-write access is required.
+
+	1 - exact:
+	    this is the policy that limits ``PTRACE_ATTACH``. Attach is
+	    only allowed when subject's and object's labels are equal.
+	    ``PTRACE_READ`` is not affected. Can be overridden with ``CAP_SYS_PTRACE``.
+
+	2 - draconian:
+	    this policy behaves like the 'exact' above with an
+	    exception that it can't be overridden with ``CAP_SYS_PTRACE``.
+
+revoke-subject
+	Writing a Smack label here sets the access to '-' for all access
+	rules with that subject label.
+
+unconfined
+	If the kernel is configured with ``CONFIG_SECURITY_SMACK_BRINGUP``
+	a process with ``CAP_MAC_ADMIN`` can write a label into this interface.
+	Thereafter, accesses that involve that label will be logged and
+	the access permitted if it wouldn't be otherwise. Note that this
+	is dangerous and can ruin the proper labeling of your system.
+	It should never be used in production.
+
+relabel-self
+	This interface contains a list of labels to which the process can
+	transition to, by writing to ``/proc/self/attr/current``.
+	Normally a process can change its own label to any legal value, but only
+	if it has ``CAP_MAC_ADMIN``. This interface allows a process without
+	``CAP_MAC_ADMIN`` to relabel itself to one of labels from predefined list.
+	A process without ``CAP_MAC_ADMIN`` can change its label only once. When it
+	does, this list will be cleared.
+	The values are set by writing the desired labels, separated
+	by spaces, to the file or cleared by writing "-" to the file.
+
+If you are using the smackload utility
+you can add access rules in ``/etc/smack/accesses``. They take the form::
+
+    subjectlabel objectlabel access
+
+access is a combination of the letters rwxatb which specify the
+kind of access permitted a subject with subjectlabel on an
+object with objectlabel. If there is no rule no access is allowed.
+
+Look for additional programs on http://schaufler-ca.com
+
+The Simplified Mandatory Access Control Kernel (Whitepaper)
+===========================================================
+
+Casey Schaufler
+casey@schaufler-ca.com
+
+Mandatory Access Control
+------------------------
+
+Computer systems employ a variety of schemes to constrain how information is
+shared among the people and services using the machine. Some of these schemes
+allow the program or user to decide what other programs or users are allowed
+access to pieces of data. These schemes are called discretionary access
+control mechanisms because the access control is specified at the discretion
+of the user. Other schemes do not leave the decision regarding what a user or
+program can access up to users or programs. These schemes are called mandatory
+access control mechanisms because you don't have a choice regarding the users
+or programs that have access to pieces of data.
+
+Bell & LaPadula
+---------------
+
+From the middle of the 1980's until the turn of the century Mandatory Access
+Control (MAC) was very closely associated with the Bell & LaPadula security
+model, a mathematical description of the United States Department of Defense
+policy for marking paper documents. MAC in this form enjoyed a following
+within the Capital Beltway and Scandinavian supercomputer centers but was
+often sited as failing to address general needs.
+
+Domain Type Enforcement
+-----------------------
+
+Around the turn of the century Domain Type Enforcement (DTE) became popular.
+This scheme organizes users, programs, and data into domains that are
+protected from each other. This scheme has been widely deployed as a component
+of popular Linux distributions. The administrative overhead required to
+maintain this scheme and the detailed understanding of the whole system
+necessary to provide a secure domain mapping leads to the scheme being
+disabled or used in limited ways in the majority of cases.
+
+Smack
+-----
+
+Smack is a Mandatory Access Control mechanism designed to provide useful MAC
+while avoiding the pitfalls of its predecessors. The limitations of Bell &
+LaPadula are addressed by providing a scheme whereby access can be controlled
+according to the requirements of the system and its purpose rather than those
+imposed by an arcane government policy. The complexity of Domain Type
+Enforcement and avoided by defining access controls in terms of the access
+modes already in use.
+
+Smack Terminology
+-----------------
+
+The jargon used to talk about Smack will be familiar to those who have dealt
+with other MAC systems and shouldn't be too difficult for the uninitiated to
+pick up. There are four terms that are used in a specific way and that are
+especially important:
+
+  Subject:
+	A subject is an active entity on the computer system.
+	On Smack a subject is a task, which is in turn the basic unit
+	of execution.
+
+  Object:
+	An object is a passive entity on the computer system.
+	On Smack files of all types, IPC, and tasks can be objects.
+
+  Access:
+	Any attempt by a subject to put information into or get
+	information from an object is an access.
+
+  Label:
+	Data that identifies the Mandatory Access Control
+	characteristics of a subject or an object.
+
+These definitions are consistent with the traditional use in the security
+community. There are also some terms from Linux that are likely to crop up:
+
+  Capability:
+	A task that possesses a capability has permission to
+	violate an aspect of the system security policy, as identified by
+	the specific capability. A task that possesses one or more
+	capabilities is a privileged task, whereas a task with no
+	capabilities is an unprivileged task.
+
+  Privilege:
+	A task that is allowed to violate the system security
+	policy is said to have privilege. As of this writing a task can
+	have privilege either by possessing capabilities or by having an
+	effective user of root.
+
+Smack Basics
+------------
+
+Smack is an extension to a Linux system. It enforces additional restrictions
+on what subjects can access which objects, based on the labels attached to
+each of the subject and the object.
+
+Labels
+~~~~~~
+
+Smack labels are ASCII character strings. They can be up to 255 characters
+long, but keeping them to twenty-three characters is recommended.
+Single character labels using special characters, that being anything
+other than a letter or digit, are reserved for use by the Smack development
+team. Smack labels are unstructured, case sensitive, and the only operation
+ever performed on them is comparison for equality. Smack labels cannot
+contain unprintable characters, the "/" (slash), the "\" (backslash), the "'"
+(quote) and '"' (double-quote) characters.
+Smack labels cannot begin with a '-'. This is reserved for special options.
+
+There are some predefined labels::
+
+	_ 	Pronounced "floor", a single underscore character.
+	^ 	Pronounced "hat", a single circumflex character.
+	* 	Pronounced "star", a single asterisk character.
+	? 	Pronounced "huh", a single question mark character.
+	@ 	Pronounced "web", a single at sign character.
+
+Every task on a Smack system is assigned a label. The Smack label
+of a process will usually be assigned by the system initialization
+mechanism.
+
+Access Rules
+~~~~~~~~~~~~
+
+Smack uses the traditional access modes of Linux. These modes are read,
+execute, write, and occasionally append. There are a few cases where the
+access mode may not be obvious. These include:
+
+  Signals:
+	A signal is a write operation from the subject task to
+	the object task.
+
+  Internet Domain IPC:
+	Transmission of a packet is considered a
+	write operation from the source task to the destination task.
+
+Smack restricts access based on the label attached to a subject and the label
+attached to the object it is trying to access. The rules enforced are, in
+order:
+
+	1. Any access requested by a task labeled "*" is denied.
+	2. A read or execute access requested by a task labeled "^"
+	   is permitted.
+	3. A read or execute access requested on an object labeled "_"
+	   is permitted.
+	4. Any access requested on an object labeled "*" is permitted.
+	5. Any access requested by a task on an object with the same
+	   label is permitted.
+	6. Any access requested that is explicitly defined in the loaded
+	   rule set is permitted.
+	7. Any other access is denied.
+
+Smack Access Rules
+~~~~~~~~~~~~~~~~~~
+
+With the isolation provided by Smack access separation is simple. There are
+many interesting cases where limited access by subjects to objects with
+different labels is desired. One example is the familiar spy model of
+sensitivity, where a scientist working on a highly classified project would be
+able to read documents of lower classifications and anything she writes will
+be "born" highly classified. To accommodate such schemes Smack includes a
+mechanism for specifying rules allowing access between labels.
+
+Access Rule Format
+~~~~~~~~~~~~~~~~~~
+
+The format of an access rule is::
+
+	subject-label object-label access
+
+Where subject-label is the Smack label of the task, object-label is the Smack
+label of the thing being accessed, and access is a string specifying the sort
+of access allowed. The access specification is searched for letters that
+describe access modes:
+
+	a: indicates that append access should be granted.
+	r: indicates that read access should be granted.
+	w: indicates that write access should be granted.
+	x: indicates that execute access should be granted.
+	t: indicates that the rule requests transmutation.
+	b: indicates that the rule should be reported for bring-up.
+
+Uppercase values for the specification letters are allowed as well.
+Access mode specifications can be in any order. Examples of acceptable rules
+are::
+
+	TopSecret Secret  rx
+	Secret    Unclass R
+	Manager   Game    x
+	User      HR      w
+	Snap      Crackle rwxatb
+	New       Old     rRrRr
+	Closed    Off     -
+
+Examples of unacceptable rules are::
+
+	Top Secret Secret     rx
+	Ace        Ace        r
+	Odd        spells     waxbeans
+
+Spaces are not allowed in labels. Since a subject always has access to files
+with the same label specifying a rule for that case is pointless. Only
+valid letters (rwxatbRWXATB) and the dash ('-') character are allowed in
+access specifications. The dash is a placeholder, so "a-r" is the same
+as "ar". A lone dash is used to specify that no access should be allowed.
+
+Applying Access Rules
+~~~~~~~~~~~~~~~~~~~~~
+
+The developers of Linux rarely define new sorts of things, usually importing
+schemes and concepts from other systems. Most often, the other systems are
+variants of Unix. Unix has many endearing properties, but consistency of
+access control models is not one of them. Smack strives to treat accesses as
+uniformly as is sensible while keeping with the spirit of the underlying
+mechanism.
+
+File system objects including files, directories, named pipes, symbolic links,
+and devices require access permissions that closely match those used by mode
+bit access. To open a file for reading read access is required on the file. To
+search a directory requires execute access. Creating a file with write access
+requires both read and write access on the containing directory. Deleting a
+file requires read and write access to the file and to the containing
+directory. It is possible that a user may be able to see that a file exists
+but not any of its attributes by the circumstance of having read access to the
+containing directory but not to the differently labeled file. This is an
+artifact of the file name being data in the directory, not a part of the file.
+
+If a directory is marked as transmuting (SMACK64TRANSMUTE=TRUE) and the
+access rule that allows a process to create an object in that directory
+includes 't' access the label assigned to the new object will be that
+of the directory, not the creating process. This makes it much easier
+for two processes with different labels to share data without granting
+access to all of their files.
+
+IPC objects, message queues, semaphore sets, and memory segments exist in flat
+namespaces and access requests are only required to match the object in
+question.
+
+Process objects reflect tasks on the system and the Smack label used to access
+them is the same Smack label that the task would use for its own access
+attempts. Sending a signal via the kill() system call is a write operation
+from the signaler to the recipient. Debugging a process requires both reading
+and writing. Creating a new task is an internal operation that results in two
+tasks with identical Smack labels and requires no access checks.
+
+Sockets are data structures attached to processes and sending a packet from
+one process to another requires that the sender have write access to the
+receiver. The receiver is not required to have read access to the sender.
+
+Setting Access Rules
+~~~~~~~~~~~~~~~~~~~~
+
+The configuration file /etc/smack/accesses contains the rules to be set at
+system startup. The contents are written to the special file
+/sys/fs/smackfs/load2. Rules can be added at any time and take effect
+immediately. For any pair of subject and object labels there can be only
+one rule, with the most recently specified overriding any earlier
+specification.
+
+Task Attribute
+~~~~~~~~~~~~~~
+
+The Smack label of a process can be read from /proc/<pid>/attr/current. A
+process can read its own Smack label from /proc/self/attr/current. A
+privileged process can change its own Smack label by writing to
+/proc/self/attr/current but not the label of another process.
+
+File Attribute
+~~~~~~~~~~~~~~
+
+The Smack label of a filesystem object is stored as an extended attribute
+named SMACK64 on the file. This attribute is in the security namespace. It can
+only be changed by a process with privilege.
+
+Privilege
+~~~~~~~~~
+
+A process with CAP_MAC_OVERRIDE or CAP_MAC_ADMIN is privileged.
+CAP_MAC_OVERRIDE allows the process access to objects it would
+be denied otherwise. CAP_MAC_ADMIN allows a process to change
+Smack data, including rules and attributes.
+
+Smack Networking
+~~~~~~~~~~~~~~~~
+
+As mentioned before, Smack enforces access control on network protocol
+transmissions. Every packet sent by a Smack process is tagged with its Smack
+label. This is done by adding a CIPSO tag to the header of the IP packet. Each
+packet received is expected to have a CIPSO tag that identifies the label and
+if it lacks such a tag the network ambient label is assumed. Before the packet
+is delivered a check is made to determine that a subject with the label on the
+packet has write access to the receiving process and if that is not the case
+the packet is dropped.
+
+CIPSO Configuration
+~~~~~~~~~~~~~~~~~~~
+
+It is normally unnecessary to specify the CIPSO configuration. The default
+values used by the system handle all internal cases. Smack will compose CIPSO
+label values to match the Smack labels being used without administrative
+intervention. Unlabeled packets that come into the system will be given the
+ambient label.
+
+Smack requires configuration in the case where packets from a system that is
+not Smack that speaks CIPSO may be encountered. Usually this will be a Trusted
+Solaris system, but there are other, less widely deployed systems out there.
+CIPSO provides 3 important values, a Domain Of Interpretation (DOI), a level,
+and a category set with each packet. The DOI is intended to identify a group
+of systems that use compatible labeling schemes, and the DOI specified on the
+Smack system must match that of the remote system or packets will be
+discarded. The DOI is 3 by default. The value can be read from
+/sys/fs/smackfs/doi and can be changed by writing to /sys/fs/smackfs/doi.
+
+The label and category set are mapped to a Smack label as defined in
+/etc/smack/cipso.
+
+A Smack/CIPSO mapping has the form::
+
+	smack level [category [category]*]
+
+Smack does not expect the level or category sets to be related in any
+particular way and does not assume or assign accesses based on them. Some
+examples of mappings::
+
+	TopSecret 7
+	TS:A,B    7 1 2
+	SecBDE    5 2 4 6
+	RAFTERS   7 12 26
+
+The ":" and "," characters are permitted in a Smack label but have no special
+meaning.
+
+The mapping of Smack labels to CIPSO values is defined by writing to
+/sys/fs/smackfs/cipso2.
+
+In addition to explicit mappings Smack supports direct CIPSO mappings. One
+CIPSO level is used to indicate that the category set passed in the packet is
+in fact an encoding of the Smack label. The level used is 250 by default. The
+value can be read from /sys/fs/smackfs/direct and changed by writing to
+/sys/fs/smackfs/direct.
+
+Socket Attributes
+~~~~~~~~~~~~~~~~~
+
+There are two attributes that are associated with sockets. These attributes
+can only be set by privileged tasks, but any task can read them for their own
+sockets.
+
+  SMACK64IPIN:
+	The Smack label of the task object. A privileged
+	program that will enforce policy may set this to the star label.
+
+  SMACK64IPOUT:
+	The Smack label transmitted with outgoing packets.
+	A privileged program may set this to match the label of another
+	task with which it hopes to communicate.
+
+Smack Netlabel Exceptions
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You will often find that your labeled application has to talk to the outside,
+unlabeled world. To do this there's a special file /sys/fs/smackfs/netlabel
+where you can add some exceptions in the form of::
+
+	@IP1	   LABEL1 or
+	@IP2/MASK  LABEL2
+
+It means that your application will have unlabeled access to @IP1 if it has
+write access on LABEL1, and access to the subnet @IP2/MASK if it has write
+access on LABEL2.
+
+Entries in the /sys/fs/smackfs/netlabel file are matched by longest mask
+first, like in classless IPv4 routing.
+
+A special label '@' and an option '-CIPSO' can be used there::
+
+	@      means Internet, any application with any label has access to it
+	-CIPSO means standard CIPSO networking
+
+If you don't know what CIPSO is and don't plan to use it, you can just do::
+
+	echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel
+	echo 0.0.0.0/0 @      > /sys/fs/smackfs/netlabel
+
+If you use CIPSO on your 192.168.0.0/16 local network and need also unlabeled
+Internet access, you can have::
+
+	echo 127.0.0.1      -CIPSO > /sys/fs/smackfs/netlabel
+	echo 192.168.0.0/16 -CIPSO > /sys/fs/smackfs/netlabel
+	echo 0.0.0.0/0      @      > /sys/fs/smackfs/netlabel
+
+Writing Applications for Smack
+------------------------------
+
+There are three sorts of applications that will run on a Smack system. How an
+application interacts with Smack will determine what it will have to do to
+work properly under Smack.
+
+Smack Ignorant Applications
+---------------------------
+
+By far the majority of applications have no reason whatever to care about the
+unique properties of Smack. Since invoking a program has no impact on the
+Smack label associated with the process the only concern likely to arise is
+whether the process has execute access to the program.
+
+Smack Relevant Applications
+---------------------------
+
+Some programs can be improved by teaching them about Smack, but do not make
+any security decisions themselves. The utility ls(1) is one example of such a
+program.
+
+Smack Enforcing Applications
+----------------------------
+
+These are special programs that not only know about Smack, but participate in
+the enforcement of system policy. In most cases these are the programs that
+set up user sessions. There are also network services that provide information
+to processes running with various labels.
+
+File System Interfaces
+----------------------
+
+Smack maintains labels on file system objects using extended attributes. The
+Smack label of a file, directory, or other file system object can be obtained
+using getxattr(2)::
+
+	len = getxattr("/", "security.SMACK64", value, sizeof (value));
+
+will put the Smack label of the root directory into value. A privileged
+process can set the Smack label of a file system object with setxattr(2)::
+
+	len = strlen("Rubble");
+	rc = setxattr("/foo", "security.SMACK64", "Rubble", len, 0);
+
+will set the Smack label of /foo to "Rubble" if the program has appropriate
+privilege.
+
+Socket Interfaces
+-----------------
+
+The socket attributes can be read using fgetxattr(2).
+
+A privileged process can set the Smack label of outgoing packets with
+fsetxattr(2)::
+
+	len = strlen("Rubble");
+	rc = fsetxattr(fd, "security.SMACK64IPOUT", "Rubble", len, 0);
+
+will set the Smack label "Rubble" on packets going out from the socket if the
+program has appropriate privilege::
+
+	rc = fsetxattr(fd, "security.SMACK64IPIN, "*", strlen("*"), 0);
+
+will set the Smack label "*" as the object label against which incoming
+packets will be checked if the program has appropriate privilege.
+
+Administration
+--------------
+
+Smack supports some mount options:
+
+  smackfsdef=label:
+	specifies the label to give files that lack
+	the Smack label extended attribute.
+
+  smackfsroot=label:
+	specifies the label to assign the root of the
+	file system if it lacks the Smack extended attribute.
+
+  smackfshat=label:
+	specifies a label that must have read access to
+	all labels set on the filesystem. Not yet enforced.
+
+  smackfsfloor=label:
+	specifies a label to which all labels set on the
+	filesystem must have read access. Not yet enforced.
+
+These mount options apply to all file system types.
+
+Smack auditing
+--------------
+
+If you want Smack auditing of security events, you need to set CONFIG_AUDIT
+in your kernel configuration.
+By default, all denied events will be audited. You can change this behavior by
+writing a single character to the /sys/fs/smackfs/logging file::
+
+	0 : no logging
+	1 : log denied (default)
+	2 : log accepted
+	3 : log denied & accepted
+
+Events are logged as 'key=value' pairs, for each event you at least will get
+the subject, the object, the rights requested, the action, the kernel function
+that triggered the event, plus other pairs depending on the type of event
+audited.
+
+Bringup Mode
+------------
+
+Bringup mode provides logging features that can make application
+configuration and system bringup easier. Configure the kernel with
+CONFIG_SECURITY_SMACK_BRINGUP to enable these features. When bringup
+mode is enabled accesses that succeed due to rules marked with the "b"
+access mode will logged. When a new label is introduced for processes
+rules can be added aggressively, marked with the "b". The logging allows
+tracking of which rules actual get used for that label.
+
+Another feature of bringup mode is the "unconfined" option. Writing
+a label to /sys/fs/smackfs/unconfined makes subjects with that label
+able to access any object, and objects with that label accessible to
+all subjects. Any access that is granted because a label is unconfined
+is logged. This feature is dangerous, as files and directories may
+be created in places they couldn't if the policy were being enforced.
diff --git a/Documentation/admin-guide/LSM/Yama.rst b/Documentation/admin-guide/LSM/Yama.rst
new file mode 100644
index 000000000..13468ea69
--- /dev/null
+++ b/Documentation/admin-guide/LSM/Yama.rst
@@ -0,0 +1,74 @@
+====
+Yama
+====
+
+Yama is a Linux Security Module that collects system-wide DAC security
+protections that are not handled by the core kernel itself. This is
+selectable at build-time with ``CONFIG_SECURITY_YAMA``, and can be controlled
+at run-time through sysctls in ``/proc/sys/kernel/yama``:
+
+ptrace_scope
+============
+
+As Linux grows in popularity, it will become a larger target for
+malware. One particularly troubling weakness of the Linux process
+interfaces is that a single user is able to examine the memory and
+running state of any of their processes. For example, if one application
+(e.g. Pidgin) was compromised, it would be possible for an attacker to
+attach to other running processes (e.g. Firefox, SSH sessions, GPG agent,
+etc) to extract additional credentials and continue to expand the scope
+of their attack without resorting to user-assisted phishing.
+
+This is not a theoretical problem. SSH session hijacking
+(http://www.storm.net.nz/projects/7) and arbitrary code injection
+(http://c-skills.blogspot.com/2007/05/injectso.html) attacks already
+exist and remain possible if ptrace is allowed to operate as before.
+Since ptrace is not commonly used by non-developers and non-admins, system
+builders should be allowed the option to disable this debugging system.
+
+For a solution, some applications use ``prctl(PR_SET_DUMPABLE, ...)`` to
+specifically disallow such ptrace attachment (e.g. ssh-agent), but many
+do not. A more general solution is to only allow ptrace directly from a
+parent to a child process (i.e. direct "gdb EXE" and "strace EXE" still
+work), or with ``CAP_SYS_PTRACE`` (i.e. "gdb --pid=PID", and "strace -p PID"
+still work as root).
+
+In mode 1, software that has defined application-specific relationships
+between a debugging process and its inferior (crash handlers, etc),
+``prctl(PR_SET_PTRACER, pid, ...)`` can be used. An inferior can declare which
+other process (and its descendants) are allowed to call ``PTRACE_ATTACH``
+against it. Only one such declared debugging process can exists for
+each inferior at a time. For example, this is used by KDE, Chromium, and
+Firefox's crash handlers, and by Wine for allowing only Wine processes
+to ptrace each other. If a process wishes to entirely disable these ptrace
+restrictions, it can call ``prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY, ...)``
+so that any otherwise allowed process (even those in external pid namespaces)
+may attach.
+
+The sysctl settings (writable only with ``CAP_SYS_PTRACE``) are:
+
+0 - classic ptrace permissions:
+    a process can ``PTRACE_ATTACH`` to any other
+    process running under the same uid, as long as it is dumpable (i.e.
+    did not transition uids, start privileged, or have called
+    ``prctl(PR_SET_DUMPABLE...)`` already). Similarly, ``PTRACE_TRACEME`` is
+    unchanged.
+
+1 - restricted ptrace:
+    a process must have a predefined relationship
+    with the inferior it wants to call ``PTRACE_ATTACH`` on. By default,
+    this relationship is that of only its descendants when the above
+    classic criteria is also met. To change the relationship, an
+    inferior can call ``prctl(PR_SET_PTRACER, debugger, ...)`` to declare
+    an allowed debugger PID to call ``PTRACE_ATTACH`` on the inferior.
+    Using ``PTRACE_TRACEME`` is unchanged.
+
+2 - admin-only attach:
+    only processes with ``CAP_SYS_PTRACE`` may use ptrace
+    with ``PTRACE_ATTACH``, or through children calling ``PTRACE_TRACEME``.
+
+3 - no attach:
+    no processes may use ptrace with ``PTRACE_ATTACH`` nor via
+    ``PTRACE_TRACEME``. Once set, this sysctl value cannot be changed.
+
+The original children-only logic was based on the restrictions in grsecurity.
diff --git a/Documentation/admin-guide/LSM/apparmor.rst b/Documentation/admin-guide/LSM/apparmor.rst
new file mode 100644
index 000000000..6cf81bbd7
--- /dev/null
+++ b/Documentation/admin-guide/LSM/apparmor.rst
@@ -0,0 +1,51 @@
+========
+AppArmor
+========
+
+What is AppArmor?
+=================
+
+AppArmor is MAC style security extension for the Linux kernel.  It implements
+a task centered policy, with task "profiles" being created and loaded
+from user space.  Tasks on the system that do not have a profile defined for
+them run in an unconfined state which is equivalent to standard Linux DAC
+permissions.
+
+How to enable/disable
+=====================
+
+set ``CONFIG_SECURITY_APPARMOR=y``
+
+If AppArmor should be selected as the default security module then set::
+
+   CONFIG_DEFAULT_SECURITY="apparmor"
+   CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1
+
+Build the kernel
+
+If AppArmor is not the default security module it can be enabled by passing
+``security=apparmor`` on the kernel's command line.
+
+If AppArmor is the default security module it can be disabled by passing
+``apparmor=0, security=XXXX`` (where ``XXXX`` is valid security module), on the
+kernel's command line.
+
+For AppArmor to enforce any restrictions beyond standard Linux DAC permissions
+policy must be loaded into the kernel from user space (see the Documentation
+and tools links).
+
+Documentation
+=============
+
+Documentation can be found on the wiki, linked below.
+
+Links
+=====
+
+Mailing List - apparmor@lists.ubuntu.com
+
+Wiki - http://wiki.apparmor.net
+
+User space tools - https://gitlab.com/apparmor
+
+Kernel module - git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst
new file mode 100644
index 000000000..c980dfe9a
--- /dev/null
+++ b/Documentation/admin-guide/LSM/index.rst
@@ -0,0 +1,41 @@
+===========================
+Linux Security Module Usage
+===========================
+
+The Linux Security Module (LSM) framework provides a mechanism for
+various security checks to be hooked by new kernel extensions. The name
+"module" is a bit of a misnomer since these extensions are not actually
+loadable kernel modules. Instead, they are selectable at build-time via
+CONFIG_DEFAULT_SECURITY and can be overridden at boot-time via the
+``"security=..."`` kernel command line argument, in the case where multiple
+LSMs were built into a given kernel.
+
+The primary users of the LSM interface are Mandatory Access Control
+(MAC) extensions which provide a comprehensive security policy. Examples
+include SELinux, Smack, Tomoyo, and AppArmor. In addition to the larger
+MAC extensions, other extensions can be built using the LSM to provide
+specific changes to system operation when these tweaks are not available
+in the core functionality of Linux itself.
+
+Without a specific LSM built into the kernel, the default LSM will be the
+Linux capabilities system. Most LSMs choose to extend the capabilities
+system, building their checks on top of the defined capability hooks.
+For more details on capabilities, see ``capabilities(7)`` in the Linux
+man-pages project.
+
+A list of the active security modules can be found by reading
+``/sys/kernel/security/lsm``. This is a comma separated list, and
+will always include the capability module. The list reflects the
+order in which checks are made. The capability module will always
+be first, followed by any "minor" modules (e.g. Yama) and then
+the one "major" module (e.g. SELinux) if there is one configured.
+
+.. toctree::
+   :maxdepth: 1
+
+   apparmor
+   LoadPin
+   SELinux
+   Smack
+   tomoyo
+   Yama
diff --git a/Documentation/admin-guide/LSM/tomoyo.rst b/Documentation/admin-guide/LSM/tomoyo.rst
new file mode 100644
index 000000000..e2d6b6e15
--- /dev/null
+++ b/Documentation/admin-guide/LSM/tomoyo.rst
@@ -0,0 +1,65 @@
+======
+TOMOYO
+======
+
+What is TOMOYO?
+===============
+
+TOMOYO is a name-based MAC extension (LSM module) for the Linux kernel.
+
+LiveCD-based tutorials are available at
+
+http://tomoyo.sourceforge.jp/1.8/ubuntu12.04-live.html
+http://tomoyo.sourceforge.jp/1.8/centos6-live.html
+
+Though these tutorials use non-LSM version of TOMOYO, they are useful for you
+to know what TOMOYO is.
+
+How to enable TOMOYO?
+=====================
+
+Build the kernel with ``CONFIG_SECURITY_TOMOYO=y`` and pass ``security=tomoyo`` on
+kernel's command line.
+
+Please see http://tomoyo.osdn.jp/2.5/ for details.
+
+Where is documentation?
+=======================
+
+User <-> Kernel interface documentation is available at
+http://tomoyo.osdn.jp/2.5/policy-specification/index.html .
+
+Materials we prepared for seminars and symposiums are available at
+http://osdn.jp/projects/tomoyo/docs/?category_id=532&language_id=1 .
+Below lists are chosen from three aspects.
+
+What is TOMOYO?
+  TOMOYO Linux Overview
+    http://osdn.jp/projects/tomoyo/docs/lca2009-takeda.pdf
+  TOMOYO Linux: pragmatic and manageable security for Linux
+    http://osdn.jp/projects/tomoyo/docs/freedomhectaipei-tomoyo.pdf
+  TOMOYO Linux: A Practical Method to Understand and Protect Your Own Linux Box
+    http://osdn.jp/projects/tomoyo/docs/PacSec2007-en-no-demo.pdf
+
+What can TOMOYO do?
+  Deep inside TOMOYO Linux
+    http://osdn.jp/projects/tomoyo/docs/lca2009-kumaneko.pdf
+  The role of "pathname based access control" in security.
+    http://osdn.jp/projects/tomoyo/docs/lfj2008-bof.pdf
+
+History of TOMOYO?
+  Realities of Mainlining
+    http://osdn.jp/projects/tomoyo/docs/lfj2008.pdf
+
+What is future plan?
+====================
+
+We believe that inode based security and name based security are complementary
+and both should be used together. But unfortunately, so far, we cannot enable
+multiple LSM modules at the same time. We feel sorry that you have to give up
+SELinux/SMACK/AppArmor etc. when you want to use TOMOYO.
+
+We hope that LSM becomes stackable in future. Meanwhile, you can use non-LSM
+version of TOMOYO, available at http://tomoyo.osdn.jp/1.8/ .
+LSM version of TOMOYO is a subset of non-LSM version of TOMOYO. We are planning
+to port non-LSM version's functionalities to LSM versions.
diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst
new file mode 100644
index 000000000..15ea785b2
--- /dev/null
+++ b/Documentation/admin-guide/README.rst
@@ -0,0 +1,409 @@
+.. _readme:
+
+Linux kernel release 4.x <http://kernel.org/>
+=============================================
+
+These are the release notes for Linux version 4.  Read them carefully,
+as they tell you what this is all about, explain how to install the
+kernel, and what to do if something goes wrong.
+
+What is Linux?
+--------------
+
+  Linux is a clone of the operating system Unix, written from scratch by
+  Linus Torvalds with assistance from a loosely-knit team of hackers across
+  the Net. It aims towards POSIX and Single UNIX Specification compliance.
+
+  It has all the features you would expect in a modern fully-fledged Unix,
+  including true multitasking, virtual memory, shared libraries, demand
+  loading, shared copy-on-write executables, proper memory management,
+  and multistack networking including IPv4 and IPv6.
+
+  It is distributed under the GNU General Public License v2 - see the
+  accompanying COPYING file for more details.
+
+On what hardware does it run?
+-----------------------------
+
+  Although originally developed first for 32-bit x86-based PCs (386 or higher),
+  today Linux also runs on (at least) the Compaq Alpha AXP, Sun SPARC and
+  UltraSPARC, Motorola 68000, PowerPC, PowerPC64, ARM, Hitachi SuperH, Cell,
+  IBM S/390, MIPS, HP PA-RISC, Intel IA-64, DEC VAX, AMD x86-64 Xtensa, and
+  ARC architectures.
+
+  Linux is easily portable to most general-purpose 32- or 64-bit architectures
+  as long as they have a paged memory management unit (PMMU) and a port of the
+  GNU C compiler (gcc) (part of The GNU Compiler Collection, GCC). Linux has
+  also been ported to a number of architectures without a PMMU, although
+  functionality is then obviously somewhat limited.
+  Linux has also been ported to itself. You can now run the kernel as a
+  userspace application - this is called UserMode Linux (UML).
+
+Documentation
+-------------
+
+ - There is a lot of documentation available both in electronic form on
+   the Internet and in books, both Linux-specific and pertaining to
+   general UNIX questions.  I'd recommend looking into the documentation
+   subdirectories on any Linux FTP site for the LDP (Linux Documentation
+   Project) books.  This README is not meant to be documentation on the
+   system: there are much better sources available.
+
+ - There are various README files in the Documentation/ subdirectory:
+   these typically contain kernel-specific installation notes for some
+   drivers for example. See Documentation/00-INDEX for a list of what
+   is contained in each file.  Please read the
+   :ref:`Documentation/process/changes.rst <changes>` file, as it
+   contains information about the problems, which may result by upgrading
+   your kernel.
+
+Installing the kernel source
+----------------------------
+
+ - If you install the full sources, put the kernel tarball in a
+   directory where you have permissions (e.g. your home directory) and
+   unpack it::
+
+     xz -cd linux-4.X.tar.xz | tar xvf -
+
+   Replace "X" with the version number of the latest kernel.
+
+   Do NOT use the /usr/src/linux area! This area has a (usually
+   incomplete) set of kernel headers that are used by the library header
+   files.  They should match the library, and not get messed up by
+   whatever the kernel-du-jour happens to be.
+
+ - You can also upgrade between 4.x releases by patching.  Patches are
+   distributed in the xz format.  To install by patching, get all the
+   newer patch files, enter the top level directory of the kernel source
+   (linux-4.X) and execute::
+
+     xz -cd ../patch-4.x.xz | patch -p1
+
+   Replace "x" for all versions bigger than the version "X" of your current
+   source tree, **in_order**, and you should be ok.  You may want to remove
+   the backup files (some-file-name~ or some-file-name.orig), and make sure
+   that there are no failed patches (some-file-name# or some-file-name.rej).
+   If there are, either you or I have made a mistake.
+
+   Unlike patches for the 4.x kernels, patches for the 4.x.y kernels
+   (also known as the -stable kernels) are not incremental but instead apply
+   directly to the base 4.x kernel.  For example, if your base kernel is 4.0
+   and you want to apply the 4.0.3 patch, you must not first apply the 4.0.1
+   and 4.0.2 patches. Similarly, if you are running kernel version 4.0.2 and
+   want to jump to 4.0.3, you must first reverse the 4.0.2 patch (that is,
+   patch -R) **before** applying the 4.0.3 patch. You can read more on this in
+   :ref:`Documentation/process/applying-patches.rst <applying_patches>`.
+
+   Alternatively, the script patch-kernel can be used to automate this
+   process.  It determines the current kernel version and applies any
+   patches found::
+
+     linux/scripts/patch-kernel linux
+
+   The first argument in the command above is the location of the
+   kernel source.  Patches are applied from the current directory, but
+   an alternative directory can be specified as the second argument.
+
+ - Make sure you have no stale .o files and dependencies lying around::
+
+     cd linux
+     make mrproper
+
+   You should now have the sources correctly installed.
+
+Software requirements
+---------------------
+
+   Compiling and running the 4.x kernels requires up-to-date
+   versions of various software packages.  Consult
+   :ref:`Documentation/process/changes.rst <changes>` for the minimum version numbers
+   required and how to get updates for these packages.  Beware that using
+   excessively old versions of these packages can cause indirect
+   errors that are very difficult to track down, so don't assume that
+   you can just update packages when obvious problems arise during
+   build or operation.
+
+Build directory for the kernel
+------------------------------
+
+   When compiling the kernel, all output files will per default be
+   stored together with the kernel source code.
+   Using the option ``make O=output/dir`` allows you to specify an alternate
+   place for the output files (including .config).
+   Example::
+
+     kernel source code: /usr/src/linux-4.X
+     build directory:    /home/name/build/kernel
+
+   To configure and build the kernel, use::
+
+     cd /usr/src/linux-4.X
+     make O=/home/name/build/kernel menuconfig
+     make O=/home/name/build/kernel
+     sudo make O=/home/name/build/kernel modules_install install
+
+   Please note: If the ``O=output/dir`` option is used, then it must be
+   used for all invocations of make.
+
+Configuring the kernel
+----------------------
+
+   Do not skip this step even if you are only upgrading one minor
+   version.  New configuration options are added in each release, and
+   odd problems will turn up if the configuration files are not set up
+   as expected.  If you want to carry your existing configuration to a
+   new version with minimal work, use ``make oldconfig``, which will
+   only ask you for the answers to new questions.
+
+ - Alternative configuration commands are::
+
+     "make config"      Plain text interface.
+
+     "make menuconfig"  Text based color menus, radiolists & dialogs.
+
+     "make nconfig"     Enhanced text based color menus.
+
+     "make xconfig"     Qt based configuration tool.
+
+     "make gconfig"     GTK+ based configuration tool.
+
+     "make oldconfig"   Default all questions based on the contents of
+                        your existing ./.config file and asking about
+                        new config symbols.
+
+     "make olddefconfig"
+                        Like above, but sets new symbols to their default
+                        values without prompting.
+
+     "make defconfig"   Create a ./.config file by using the default
+                        symbol values from either arch/$ARCH/defconfig
+                        or arch/$ARCH/configs/${PLATFORM}_defconfig,
+                        depending on the architecture.
+
+     "make ${PLATFORM}_defconfig"
+                        Create a ./.config file by using the default
+                        symbol values from
+                        arch/$ARCH/configs/${PLATFORM}_defconfig.
+                        Use "make help" to get a list of all available
+                        platforms of your architecture.
+
+     "make allyesconfig"
+                        Create a ./.config file by setting symbol
+                        values to 'y' as much as possible.
+
+     "make allmodconfig"
+                        Create a ./.config file by setting symbol
+                        values to 'm' as much as possible.
+
+     "make allnoconfig" Create a ./.config file by setting symbol
+                        values to 'n' as much as possible.
+
+     "make randconfig"  Create a ./.config file by setting symbol
+                        values to random values.
+
+     "make localmodconfig" Create a config based on current config and
+                           loaded modules (lsmod). Disables any module
+                           option that is not needed for the loaded modules.
+
+                           To create a localmodconfig for another machine,
+                           store the lsmod of that machine into a file
+                           and pass it in as a LSMOD parameter.
+
+                   target$ lsmod > /tmp/mylsmod
+                   target$ scp /tmp/mylsmod host:/tmp
+
+                   host$ make LSMOD=/tmp/mylsmod localmodconfig
+
+                           The above also works when cross compiling.
+
+     "make localyesconfig" Similar to localmodconfig, except it will convert
+                           all module options to built in (=y) options.
+
+     "make kvmconfig"   Enable additional options for kvm guest kernel support.
+
+     "make xenconfig"   Enable additional options for xen dom0 guest kernel
+                        support.
+
+     "make tinyconfig"  Configure the tiniest possible kernel.
+
+   You can find more information on using the Linux kernel config tools
+   in Documentation/kbuild/kconfig.txt.
+
+ - NOTES on ``make config``:
+
+    - Having unnecessary drivers will make the kernel bigger, and can
+      under some circumstances lead to problems: probing for a
+      nonexistent controller card may confuse your other controllers.
+
+    - A kernel with math-emulation compiled in will still use the
+      coprocessor if one is present: the math emulation will just
+      never get used in that case.  The kernel will be slightly larger,
+      but will work on different machines regardless of whether they
+      have a math coprocessor or not.
+
+    - The "kernel hacking" configuration details usually result in a
+      bigger or slower kernel (or both), and can even make the kernel
+      less stable by configuring some routines to actively try to
+      break bad code to find kernel problems (kmalloc()).  Thus you
+      should probably answer 'n' to the questions for "development",
+      "experimental", or "debugging" features.
+
+Compiling the kernel
+--------------------
+
+ - Make sure you have at least gcc 3.2 available.
+   For more information, refer to :ref:`Documentation/process/changes.rst <changes>`.
+
+   Please note that you can still run a.out user programs with this kernel.
+
+ - Do a ``make`` to create a compressed kernel image. It is also
+   possible to do ``make install`` if you have lilo installed to suit the
+   kernel makefiles, but you may want to check your particular lilo setup first.
+
+   To do the actual install, you have to be root, but none of the normal
+   build should require that. Don't take the name of root in vain.
+
+ - If you configured any of the parts of the kernel as ``modules``, you
+   will also have to do ``make modules_install``.
+
+ - Verbose kernel compile/build output:
+
+   Normally, the kernel build system runs in a fairly quiet mode (but not
+   totally silent).  However, sometimes you or other kernel developers need
+   to see compile, link, or other commands exactly as they are executed.
+   For this, use "verbose" build mode.  This is done by passing
+   ``V=1`` to the ``make`` command, e.g.::
+
+     make V=1 all
+
+   To have the build system also tell the reason for the rebuild of each
+   target, use ``V=2``.  The default is ``V=0``.
+
+ - Keep a backup kernel handy in case something goes wrong.  This is
+   especially true for the development releases, since each new release
+   contains new code which has not been debugged.  Make sure you keep a
+   backup of the modules corresponding to that kernel, as well.  If you
+   are installing a new kernel with the same version number as your
+   working kernel, make a backup of your modules directory before you
+   do a ``make modules_install``.
+
+   Alternatively, before compiling, use the kernel config option
+   "LOCALVERSION" to append a unique suffix to the regular kernel version.
+   LOCALVERSION can be set in the "General Setup" menu.
+
+ - In order to boot your new kernel, you'll need to copy the kernel
+   image (e.g. .../linux/arch/x86/boot/bzImage after compilation)
+   to the place where your regular bootable kernel is found.
+
+ - Booting a kernel directly from a floppy without the assistance of a
+   bootloader such as LILO, is no longer supported.
+
+   If you boot Linux from the hard drive, chances are you use LILO, which
+   uses the kernel image as specified in the file /etc/lilo.conf.  The
+   kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
+   /boot/bzImage.  To use the new kernel, save a copy of the old image
+   and copy the new image over the old one.  Then, you MUST RERUN LILO
+   to update the loading map! If you don't, you won't be able to boot
+   the new kernel image.
+
+   Reinstalling LILO is usually a matter of running /sbin/lilo.
+   You may wish to edit /etc/lilo.conf to specify an entry for your
+   old kernel image (say, /vmlinux.old) in case the new one does not
+   work.  See the LILO docs for more information.
+
+   After reinstalling LILO, you should be all set.  Shutdown the system,
+   reboot, and enjoy!
+
+   If you ever need to change the default root device, video mode,
+   ramdisk size, etc.  in the kernel image, use the ``rdev`` program (or
+   alternatively the LILO boot options when appropriate).  No need to
+   recompile the kernel to change these parameters.
+
+ - Reboot with the new kernel and enjoy.
+
+If something goes wrong
+-----------------------
+
+ - If you have problems that seem to be due to kernel bugs, please check
+   the file MAINTAINERS to see if there is a particular person associated
+   with the part of the kernel that you are having trouble with. If there
+   isn't anyone listed there, then the second best thing is to mail
+   them to me (torvalds@linux-foundation.org), and possibly to any other
+   relevant mailing-list or to the newsgroup.
+
+ - In all bug-reports, *please* tell what kernel you are talking about,
+   how to duplicate the problem, and what your setup is (use your common
+   sense).  If the problem is new, tell me so, and if the problem is
+   old, please try to tell me when you first noticed it.
+
+ - If the bug results in a message like::
+
+     unable to handle kernel paging request at address C0000010
+     Oops: 0002
+     EIP:   0010:XXXXXXXX
+     eax: xxxxxxxx   ebx: xxxxxxxx   ecx: xxxxxxxx   edx: xxxxxxxx
+     esi: xxxxxxxx   edi: xxxxxxxx   ebp: xxxxxxxx
+     ds: xxxx  es: xxxx  fs: xxxx  gs: xxxx
+     Pid: xx, process nr: xx
+     xx xx xx xx xx xx xx xx xx xx
+
+   or similar kernel debugging information on your screen or in your
+   system log, please duplicate it *exactly*.  The dump may look
+   incomprehensible to you, but it does contain information that may
+   help debugging the problem.  The text above the dump is also
+   important: it tells something about why the kernel dumped code (in
+   the above example, it's due to a bad kernel pointer). More information
+   on making sense of the dump is in Documentation/admin-guide/bug-hunting.rst
+
+ - If you compiled the kernel with CONFIG_KALLSYMS you can send the dump
+   as is, otherwise you will have to use the ``ksymoops`` program to make
+   sense of the dump (but compiling with CONFIG_KALLSYMS is usually preferred).
+   This utility can be downloaded from
+   https://www.kernel.org/pub/linux/utils/kernel/ksymoops/ .
+   Alternatively, you can do the dump lookup by hand:
+
+ - In debugging dumps like the above, it helps enormously if you can
+   look up what the EIP value means.  The hex value as such doesn't help
+   me or anybody else very much: it will depend on your particular
+   kernel setup.  What you should do is take the hex value from the EIP
+   line (ignore the ``0010:``), and look it up in the kernel namelist to
+   see which kernel function contains the offending address.
+
+   To find out the kernel function name, you'll need to find the system
+   binary associated with the kernel that exhibited the symptom.  This is
+   the file 'linux/vmlinux'.  To extract the namelist and match it against
+   the EIP from the kernel crash, do::
+
+     nm vmlinux | sort | less
+
+   This will give you a list of kernel addresses sorted in ascending
+   order, from which it is simple to find the function that contains the
+   offending address.  Note that the address given by the kernel
+   debugging messages will not necessarily match exactly with the
+   function addresses (in fact, that is very unlikely), so you can't
+   just 'grep' the list: the list will, however, give you the starting
+   point of each kernel function, so by looking for the function that
+   has a starting address lower than the one you are searching for but
+   is followed by a function with a higher address you will find the one
+   you want.  In fact, it may be a good idea to include a bit of
+   "context" in your problem report, giving a few lines around the
+   interesting one.
+
+   If you for some reason cannot do the above (you have a pre-compiled
+   kernel image or similar), telling me as much about your setup as
+   possible will help.  Please read the :ref:`admin-guide/reporting-bugs.rst <reportingbugs>`
+   document for details.
+
+ - Alternatively, you can use gdb on a running kernel. (read-only; i.e. you
+   cannot change values or set break points.) To do this, first compile the
+   kernel with -g; edit arch/x86/Makefile appropriately, then do a ``make
+   clean``. You'll also need to enable CONFIG_PROC_FS (via ``make config``).
+
+   After you've rebooted with the new kernel, do ``gdb vmlinux /proc/kcore``.
+   You can now use all the usual gdb commands. The command to look up the
+   point where your system crashed is ``l *0xXXXXXXXX``. (Replace the XXXes
+   with the EIP value.)
+
+   gdb'ing a non-running kernel currently fails because ``gdb`` (wrongly)
+   disregards the starting offset for which the kernel is compiled.
diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst
new file mode 100644
index 000000000..c0ce64d75
--- /dev/null
+++ b/Documentation/admin-guide/bcache.rst
@@ -0,0 +1,649 @@
+============================
+A block layer cache (bcache)
+============================
+
+Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be
+nice if you could use them as cache... Hence bcache.
+
+Wiki and git repositories are at:
+
+  - http://bcache.evilpiepirate.org
+  - http://evilpiepirate.org/git/linux-bcache.git
+  - http://evilpiepirate.org/git/bcache-tools.git
+
+It's designed around the performance characteristics of SSDs - it only allocates
+in erase block sized buckets, and it uses a hybrid btree/log to track cached
+extents (which can be anywhere from a single sector to the bucket size). It's
+designed to avoid random writes at all costs; it fills up an erase block
+sequentially, then issues a discard before reusing it.
+
+Both writethrough and writeback caching are supported. Writeback defaults to
+off, but can be switched on and off arbitrarily at runtime. Bcache goes to
+great lengths to protect your data - it reliably handles unclean shutdown. (It
+doesn't even have a notion of a clean shutdown; bcache simply doesn't return
+writes as completed until they're on stable storage).
+
+Writeback caching can use most of the cache for buffering writes - writing
+dirty data to the backing device is always done sequentially, scanning from the
+start to the end of the index.
+
+Since random IO is what SSDs excel at, there generally won't be much benefit
+to caching large sequential IO. Bcache detects sequential IO and skips it;
+it also keeps a rolling average of the IO sizes per task, and as long as the
+average is above the cutoff it will skip all IO from that task - instead of
+caching the first 512k after every seek. Backups and large file copies should
+thus entirely bypass the cache.
+
+In the event of a data IO error on the flash it will try to recover by reading
+from disk or invalidating cache entries.  For unrecoverable errors (meta data
+or dirty data), caching is automatically disabled; if dirty data was present
+in the cache it first disables writeback caching and waits for all dirty data
+to be flushed.
+
+Getting started:
+You'll need make-bcache from the bcache-tools repository. Both the cache device
+and backing device must be formatted before use::
+
+  make-bcache -B /dev/sdb
+  make-bcache -C /dev/sdc
+
+make-bcache has the ability to format multiple devices at the same time - if
+you format your backing devices and cache device at the same time, you won't
+have to manually attach::
+
+  make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
+
+bcache-tools now ships udev rules, and bcache devices are known to the kernel
+immediately.  Without udev, you can manually register devices like this::
+
+  echo /dev/sdb > /sys/fs/bcache/register
+  echo /dev/sdc > /sys/fs/bcache/register
+
+Registering the backing device makes the bcache device show up in /dev; you can
+now format it and use it as normal. But the first time using a new bcache
+device, it'll be running in passthrough mode until you attach it to a cache.
+If you are thinking about using bcache later, it is recommended to setup all your
+slow devices as bcache backing devices without a cache, and you can choose to add
+a caching device later.
+See 'ATTACHING' section below.
+
+The devices show up as::
+
+  /dev/bcache<N>
+
+As well as (with udev)::
+
+  /dev/bcache/by-uuid/<uuid>
+  /dev/bcache/by-label/<label>
+
+To get started::
+
+  mkfs.ext4 /dev/bcache0
+  mount /dev/bcache0 /mnt
+
+You can control bcache devices through sysfs at /sys/block/bcache<N>/bcache .
+You can also control them through /sys/fs//bcache/<cset-uuid>/ .
+
+Cache devices are managed as sets; multiple caches per set isn't supported yet
+but will allow for mirroring of metadata and dirty data in the future. Your new
+cache set shows up as /sys/fs/bcache/<UUID>
+
+Attaching
+---------
+
+After your cache device and backing device are registered, the backing device
+must be attached to your cache set to enable caching. Attaching a backing
+device to a cache set is done thusly, with the UUID of the cache set in
+/sys/fs/bcache::
+
+  echo <CSET-UUID> > /sys/block/bcache0/bcache/attach
+
+This only has to be done once. The next time you reboot, just reregister all
+your bcache devices. If a backing device has data in a cache somewhere, the
+/dev/bcache<N> device won't be created until the cache shows up - particularly
+important if you have writeback caching turned on.
+
+If you're booting up and your cache device is gone and never coming back, you
+can force run the backing device::
+
+  echo 1 > /sys/block/sdb/bcache/running
+
+(You need to use /sys/block/sdb (or whatever your backing device is called), not
+/sys/block/bcache0, because bcache0 doesn't exist yet. If you're using a
+partition, the bcache directory would be at /sys/block/sdb/sdb2/bcache)
+
+The backing device will still use that cache set if it shows up in the future,
+but all the cached data will be invalidated. If there was dirty data in the
+cache, don't expect the filesystem to be recoverable - you will have massive
+filesystem corruption, though ext4's fsck does work miracles.
+
+Error Handling
+--------------
+
+Bcache tries to transparently handle IO errors to/from the cache device without
+affecting normal operation; if it sees too many errors (the threshold is
+configurable, and defaults to 0) it shuts down the cache device and switches all
+the backing devices to passthrough mode.
+
+ - For reads from the cache, if they error we just retry the read from the
+   backing device.
+
+ - For writethrough writes, if the write to the cache errors we just switch to
+   invalidating the data at that lba in the cache (i.e. the same thing we do for
+   a write that bypasses the cache)
+
+ - For writeback writes, we currently pass that error back up to the
+   filesystem/userspace. This could be improved - we could retry it as a write
+   that skips the cache so we don't have to error the write.
+
+ - When we detach, we first try to flush any dirty data (if we were running in
+   writeback mode). It currently doesn't do anything intelligent if it fails to
+   read some of the dirty data, though.
+
+
+Howto/cookbook
+--------------
+
+A) Starting a bcache with a missing caching device
+
+If registering the backing device doesn't help, it's already there, you just need
+to force it to run without the cache::
+
+	host:~# echo /dev/sdb1 > /sys/fs/bcache/register
+	[  119.844831] bcache: register_bcache() error opening /dev/sdb1: device already registered
+
+Next, you try to register your caching device if it's present. However
+if it's absent, or registration fails for some reason, you can still
+start your bcache without its cache, like so::
+
+	host:/sys/block/sdb/sdb1/bcache# echo 1 > running
+
+Note that this may cause data loss if you were running in writeback mode.
+
+
+B) Bcache does not find its cache::
+
+	host:/sys/block/md5/bcache# echo 0226553a-37cf-41d5-b3ce-8b1e944543a8 > attach
+	[ 1933.455082] bcache: bch_cached_dev_attach() Couldn't find uuid for md5 in set
+	[ 1933.478179] bcache: __cached_dev_store() Can't attach 0226553a-37cf-41d5-b3ce-8b1e944543a8
+	[ 1933.478179] : cache set not found
+
+In this case, the caching device was simply not registered at boot
+or disappeared and came back, and needs to be (re-)registered::
+
+	host:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register
+
+
+C) Corrupt bcache crashes the kernel at device registration time:
+
+This should never happen.  If it does happen, then you have found a bug!
+Please report it to the bcache development list: linux-bcache@vger.kernel.org
+
+Be sure to provide as much information that you can including kernel dmesg
+output if available so that we may assist.
+
+
+D) Recovering data without bcache:
+
+If bcache is not available in the kernel, a filesystem on the backing
+device is still available at an 8KiB offset. So either via a loopdev
+of the backing device created with --offset 8K, or any value defined by
+--data-offset when you originally formatted bcache with `make-bcache`.
+
+For example::
+
+	losetup -o 8192 /dev/loop0 /dev/your_bcache_backing_dev
+
+This should present your unmodified backing device data in /dev/loop0
+
+If your cache is in writethrough mode, then you can safely discard the
+cache device without loosing data.
+
+
+E) Wiping a cache device
+
+::
+
+	host:~# wipefs -a /dev/sdh2
+	16 bytes were erased at offset 0x1018 (bcache)
+	they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
+
+After you boot back with bcache enabled, you recreate the cache and attach it::
+
+	host:~# make-bcache -C /dev/sdh2
+	UUID:                   7be7e175-8f4c-4f99-94b2-9c904d227045
+	Set UUID:               5bc072a8-ab17-446d-9744-e247949913c1
+	version:                0
+	nbuckets:               106874
+	block_size:             1
+	bucket_size:            1024
+	nr_in_set:              1
+	nr_this_dev:            0
+	first_bucket:           1
+	[  650.511912] bcache: run_cache_set() invalidating existing data
+	[  650.549228] bcache: register_cache() registered cache device sdh2
+
+start backing device with missing cache::
+
+	host:/sys/block/md5/bcache# echo 1 > running
+
+attach new cache::
+
+	host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach
+	[  865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1
+
+
+F) Remove or replace a caching device::
+
+	host:/sys/block/sda/sda7/bcache# echo 1 > detach
+	[  695.872542] bcache: cached_dev_detach_finish() Caching disabled for sda7
+
+	host:~# wipefs -a /dev/nvme0n1p4
+	wipefs: error: /dev/nvme0n1p4: probing initialization failed: Device or resource busy
+	Ooops, it's disabled, but not unregistered, so it's still protected
+
+We need to go and unregister it::
+
+	host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# ls -l cache0
+	lrwxrwxrwx 1 root root 0 Feb 25 18:33 cache0 -> ../../../devices/pci0000:00/0000:00:1d.0/0000:70:00.0/nvme/nvme0/nvme0n1/nvme0n1p4/bcache/
+	host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# echo 1 > stop
+	kernel: [  917.041908] bcache: cache_set_free() Cache set b7ba27a1-2398-4649-8ae3-0959f57ba128 unregistered
+
+Now we can wipe it::
+
+	host:~# wipefs -a /dev/nvme0n1p4
+	/dev/nvme0n1p4: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
+
+
+G) dm-crypt and bcache
+
+First setup bcache unencrypted and then install dmcrypt on top of
+/dev/bcache<N> This will work faster than if you dmcrypt both the backing
+and caching devices and then install bcache on top. [benchmarks?]
+
+
+H) Stop/free a registered bcache to wipe and/or recreate it
+
+Suppose that you need to free up all bcache references so that you can
+fdisk run and re-register a changed partition table, which won't work
+if there are any active backing or caching devices left on it:
+
+1) Is it present in /dev/bcache* ? (there are times where it won't be)
+
+   If so, it's easy::
+
+	host:/sys/block/bcache0/bcache# echo 1 > stop
+
+2) But if your backing device is gone, this won't work::
+
+	host:/sys/block/bcache0# cd bcache
+	bash: cd: bcache: No such file or directory
+
+   In this case, you may have to unregister the dmcrypt block device that
+   references this bcache to free it up::
+
+	host:~# dmsetup remove oldds1
+	bcache: bcache_device_free() bcache0 stopped
+	bcache: cache_set_free() Cache set 5bc072a8-ab17-446d-9744-e247949913c1 unregistered
+
+   This causes the backing bcache to be removed from /sys/fs/bcache and
+   then it can be reused.  This would be true of any block device stacking
+   where bcache is a lower device.
+
+3) In other cases, you can also look in /sys/fs/bcache/::
+
+	host:/sys/fs/bcache# ls -l */{cache?,bdev?}
+	lrwxrwxrwx 1 root root 0 Mar  5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/
+	lrwxrwxrwx 1 root root 0 Mar  5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/
+	lrwxrwxrwx 1 root root 0 Mar  5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/
+
+   The device names will show which UUID is relevant, cd in that directory
+   and stop the cache::
+
+	host:/sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1# echo 1 > stop
+
+   This will free up bcache references and let you reuse the partition for
+   other purposes.
+
+
+
+Troubleshooting performance
+---------------------------
+
+Bcache has a bunch of config options and tunables. The defaults are intended to
+be reasonable for typical desktop and server workloads, but they're not what you
+want for getting the best possible numbers when benchmarking.
+
+ - Backing device alignment
+
+   The default metadata size in bcache is 8k.  If your backing device is
+   RAID based, then be sure to align this by a multiple of your stride
+   width using `make-bcache --data-offset`. If you intend to expand your
+   disk array in the future, then multiply a series of primes by your
+   raid stripe size to get the disk multiples that you would like.
+
+   For example:  If you have a 64k stripe size, then the following offset
+   would provide alignment for many common RAID5 data spindle counts::
+
+	64k * 2*2*2*3*3*5*7 bytes = 161280k
+
+   That space is wasted, but for only 157.5MB you can grow your RAID 5
+   volume to the following data-spindle counts without re-aligning::
+
+	3,4,5,6,7,8,9,10,12,14,15,18,20,21 ...
+
+ - Bad write performance
+
+   If write performance is not what you expected, you probably wanted to be
+   running in writeback mode, which isn't the default (not due to a lack of
+   maturity, but simply because in writeback mode you'll lose data if something
+   happens to your SSD)::
+
+	# echo writeback > /sys/block/bcache0/bcache/cache_mode
+
+ - Bad performance, or traffic not going to the SSD that you'd expect
+
+   By default, bcache doesn't cache everything. It tries to skip sequential IO -
+   because you really want to be caching the random IO, and if you copy a 10
+   gigabyte file you probably don't want that pushing 10 gigabytes of randomly
+   accessed data out of your cache.
+
+   But if you want to benchmark reads from cache, and you start out with fio
+   writing an 8 gigabyte test file - so you want to disable that::
+
+	# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
+
+   To set it back to the default (4 mb), do::
+
+	# echo 4M > /sys/block/bcache0/bcache/sequential_cutoff
+
+ - Traffic's still going to the spindle/still getting cache misses
+
+   In the real world, SSDs don't always keep up with disks - particularly with
+   slower SSDs, many disks being cached by one SSD, or mostly sequential IO. So
+   you want to avoid being bottlenecked by the SSD and having it slow everything
+   down.
+
+   To avoid that bcache tracks latency to the cache device, and gradually
+   throttles traffic if the latency exceeds a threshold (it does this by
+   cranking down the sequential bypass).
+
+   You can disable this if you need to by setting the thresholds to 0::
+
+	# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
+	# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
+
+   The default is 2000 us (2 milliseconds) for reads, and 20000 for writes.
+
+ - Still getting cache misses, of the same data
+
+   One last issue that sometimes trips people up is actually an old bug, due to
+   the way cache coherency is handled for cache misses. If a btree node is full,
+   a cache miss won't be able to insert a key for the new data and the data
+   won't be written to the cache.
+
+   In practice this isn't an issue because as soon as a write comes along it'll
+   cause the btree node to be split, and you need almost no write traffic for
+   this to not show up enough to be noticeable (especially since bcache's btree
+   nodes are huge and index large regions of the device). But when you're
+   benchmarking, if you're trying to warm the cache by reading a bunch of data
+   and there's no other traffic - that can be a problem.
+
+   Solution: warm the cache by doing writes, or use the testing branch (there's
+   a fix for the issue there).
+
+
+Sysfs - backing device
+----------------------
+
+Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and
+(if attached) /sys/fs/bcache/<cset-uuid>/bdev*
+
+attach
+  Echo the UUID of a cache set to this file to enable caching.
+
+cache_mode
+  Can be one of either writethrough, writeback, writearound or none.
+
+clear_stats
+  Writing to this file resets the running total stats (not the day/hour/5 minute
+  decaying versions).
+
+detach
+  Write to this file to detach from a cache set. If there is dirty data in the
+  cache, it will be flushed first.
+
+dirty_data
+  Amount of dirty data for this backing device in the cache. Continuously
+  updated unlike the cache set's version, but may be slightly off.
+
+label
+  Name of underlying device.
+
+readahead
+  Size of readahead that should be performed.  Defaults to 0.  If set to e.g.
+  1M, it will round cache miss reads up to that size, but without overlapping
+  existing cache entries.
+
+running
+  1 if bcache is running (i.e. whether the /dev/bcache device exists, whether
+  it's in passthrough mode or caching).
+
+sequential_cutoff
+  A sequential IO will bypass the cache once it passes this threshold; the
+  most recent 128 IOs are tracked so sequential IO can be detected even when
+  it isn't all done at once.
+
+sequential_merge
+  If non zero, bcache keeps a list of the last 128 requests submitted to compare
+  against all new requests to determine which new requests are sequential
+  continuations of previous requests for the purpose of determining sequential
+  cutoff. This is necessary if the sequential cutoff value is greater than the
+  maximum acceptable sequential size for any single request.
+
+state
+  The backing device can be in one of four different states:
+
+  no cache: Has never been attached to a cache set.
+
+  clean: Part of a cache set, and there is no cached dirty data.
+
+  dirty: Part of a cache set, and there is cached dirty data.
+
+  inconsistent: The backing device was forcibly run by the user when there was
+  dirty data cached but the cache set was unavailable; whatever data was on the
+  backing device has likely been corrupted.
+
+stop
+  Write to this file to shut down the bcache device and close the backing
+  device.
+
+writeback_delay
+  When dirty data is written to the cache and it previously did not contain
+  any, waits some number of seconds before initiating writeback. Defaults to
+  30.
+
+writeback_percent
+  If nonzero, bcache tries to keep around this percentage of the cache dirty by
+  throttling background writeback and using a PD controller to smoothly adjust
+  the rate.
+
+writeback_rate
+  Rate in sectors per second - if writeback_percent is nonzero, background
+  writeback is throttled to this rate. Continuously adjusted by bcache but may
+  also be set by the user.
+
+writeback_running
+  If off, writeback of dirty data will not take place at all. Dirty data will
+  still be added to the cache until it is mostly full; only meant for
+  benchmarking. Defaults to on.
+
+Sysfs - backing device stats
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are directories with these numbers for a running total, as well as
+versions that decay over the past day, hour and 5 minutes; they're also
+aggregated in the cache set directory as well.
+
+bypassed
+  Amount of IO (both reads and writes) that has bypassed the cache
+
+cache_hits, cache_misses, cache_hit_ratio
+  Hits and misses are counted per individual IO as bcache sees them; a
+  partial hit is counted as a miss.
+
+cache_bypass_hits, cache_bypass_misses
+  Hits and misses for IO that is intended to skip the cache are still counted,
+  but broken out here.
+
+cache_miss_collisions
+  Counts instances where data was going to be inserted into the cache from a
+  cache miss, but raced with a write and data was already present (usually 0
+  since the synchronization for cache misses was rewritten)
+
+cache_readaheads
+  Count of times readahead occurred.
+
+Sysfs - cache set
+~~~~~~~~~~~~~~~~~
+
+Available at /sys/fs/bcache/<cset-uuid>
+
+average_key_size
+  Average data per key in the btree.
+
+bdev<0..n>
+  Symlink to each of the attached backing devices.
+
+block_size
+  Block size of the cache devices.
+
+btree_cache_size
+  Amount of memory currently used by the btree cache
+
+bucket_size
+  Size of buckets
+
+cache<0..n>
+  Symlink to each of the cache devices comprising this cache set.
+
+cache_available_percent
+  Percentage of cache device which doesn't contain dirty data, and could
+  potentially be used for writeback.  This doesn't mean this space isn't used
+  for clean cached data; the unused statistic (in priority_stats) is typically
+  much lower.
+
+clear_stats
+  Clears the statistics associated with this cache
+
+dirty_data
+  Amount of dirty data is in the cache (updated when garbage collection runs).
+
+flash_vol_create
+  Echoing a size to this file (in human readable units, k/M/G) creates a thinly
+  provisioned volume backed by the cache set.
+
+io_error_halflife, io_error_limit
+  These determines how many errors we accept before disabling the cache.
+  Each error is decayed by the half life (in # ios).  If the decaying count
+  reaches io_error_limit dirty data is written out and the cache is disabled.
+
+journal_delay_ms
+  Journal writes will delay for up to this many milliseconds, unless a cache
+  flush happens sooner. Defaults to 100.
+
+root_usage_percent
+  Percentage of the root btree node in use.  If this gets too high the node
+  will split, increasing the tree depth.
+
+stop
+  Write to this file to shut down the cache set - waits until all attached
+  backing devices have been shut down.
+
+tree_depth
+  Depth of the btree (A single node btree has depth 0).
+
+unregister
+  Detaches all backing devices and closes the cache devices; if dirty data is
+  present it will disable writeback caching and wait for it to be flushed.
+
+Sysfs - cache set internal
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This directory also exposes timings for a number of internal operations, with
+separate files for average duration, average frequency, last occurrence and max
+duration: garbage collection, btree read, btree node sorts and btree splits.
+
+active_journal_entries
+  Number of journal entries that are newer than the index.
+
+btree_nodes
+  Total nodes in the btree.
+
+btree_used_percent
+  Average fraction of btree in use.
+
+bset_tree_stats
+  Statistics about the auxiliary search trees
+
+btree_cache_max_chain
+  Longest chain in the btree node cache's hash table
+
+cache_read_races
+  Counts instances where while data was being read from the cache, the bucket
+  was reused and invalidated - i.e. where the pointer was stale after the read
+  completed. When this occurs the data is reread from the backing device.
+
+trigger_gc
+  Writing to this file forces garbage collection to run.
+
+Sysfs - Cache device
+~~~~~~~~~~~~~~~~~~~~
+
+Available at /sys/block/<cdev>/bcache
+
+block_size
+  Minimum granularity of writes - should match hardware sector size.
+
+btree_written
+  Sum of all btree writes, in (kilo/mega/giga) bytes
+
+bucket_size
+  Size of buckets
+
+cache_replacement_policy
+  One of either lru, fifo or random.
+
+discard
+  Boolean; if on a discard/TRIM will be issued to each bucket before it is
+  reused. Defaults to off, since SATA TRIM is an unqueued command (and thus
+  slow).
+
+freelist_percent
+  Size of the freelist as a percentage of nbuckets. Can be written to to
+  increase the number of buckets kept on the freelist, which lets you
+  artificially reduce the size of the cache at runtime. Mostly for testing
+  purposes (i.e. testing how different size caches affect your hit rate), but
+  since buckets are discarded when they move on to the freelist will also make
+  the SSD's garbage collection easier by effectively giving it more reserved
+  space.
+
+io_errors
+  Number of errors that have occurred, decayed by io_error_halflife.
+
+metadata_written
+  Sum of all non data writes (btree writes and all other metadata).
+
+nbuckets
+  Total buckets in this cache
+
+priority_stats
+  Statistics about how recently data in the cache has been accessed.
+  This can reveal your working set size.  Unused is the percentage of
+  the cache that doesn't contain any data.  Metadata is bcache's
+  metadata overhead.  Average is the average priority of cache buckets.
+  Next is a list of quantiles with the priority threshold of each.
+
+written
+  Sum of all data that has been written to the cache; comparison with
+  btree_written gives the amount of write inflation in bcache.
diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst
new file mode 100644
index 000000000..97b0d7927
--- /dev/null
+++ b/Documentation/admin-guide/binfmt-misc.rst
@@ -0,0 +1,151 @@
+Kernel Support for miscellaneous (your favourite) Binary Formats v1.1
+=====================================================================
+
+This Kernel feature allows you to invoke almost (for restrictions see below)
+every program by simply typing its name in the shell.
+This includes for example compiled Java(TM), Python or Emacs programs.
+
+To achieve this you must tell binfmt_misc which interpreter has to be invoked
+with which binary. Binfmt_misc recognises the binary-type by matching some bytes
+at the beginning of the file with a magic byte sequence (masking out specified
+bits) you have supplied. Binfmt_misc can also recognise a filename extension
+aka ``.com`` or ``.exe``.
+
+First you must mount binfmt_misc::
+
+	mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc
+
+To actually register a new binary type, you have to set up a string looking like
+``:name:type:offset:magic:mask:interpreter:flags`` (where you can choose the
+``:`` upon your needs) and echo it to ``/proc/sys/fs/binfmt_misc/register``.
+
+Here is what the fields mean:
+
+- ``name``
+   is an identifier string. A new /proc file will be created with this
+   ``name below /proc/sys/fs/binfmt_misc``; cannot contain slashes ``/`` for
+   obvious reasons.
+- ``type``
+   is the type of recognition. Give ``M`` for magic and ``E`` for extension.
+- ``offset``
+   is the offset of the magic/mask in the file, counted in bytes. This
+   defaults to 0 if you omit it (i.e. you write ``:name:type::magic...``).
+   Ignored when using filename extension matching.
+- ``magic``
+   is the byte sequence binfmt_misc is matching for. The magic string
+   may contain hex-encoded characters like ``\x0a`` or ``\xA4``. Note that you
+   must escape any NUL bytes; parsing halts at the first one. In a shell
+   environment you might have to write ``\\x0a`` to prevent the shell from
+   eating your ``\``.
+   If you chose filename extension matching, this is the extension to be
+   recognised (without the ``.``, the ``\x0a`` specials are not allowed).
+   Extension    matching is case sensitive, and slashes ``/`` are not allowed!
+- ``mask``
+   is an (optional, defaults to all 0xff) mask. You can mask out some
+   bits from matching by supplying a string like magic and as long as magic.
+   The mask is anded with the byte sequence of the file. Note that you must
+   escape any NUL bytes; parsing halts at the first one. Ignored when using
+   filename extension matching.
+- ``interpreter``
+   is the program that should be invoked with the binary as first
+   argument (specify the full path)
+- ``flags``
+   is an optional field that controls several aspects of the invocation
+   of the interpreter. It is a string of capital letters, each controls a
+   certain aspect. The following flags are supported:
+
+      ``P`` - preserve-argv[0]
+            Legacy behavior of binfmt_misc is to overwrite
+            the original argv[0] with the full path to the binary. When this
+            flag is included, binfmt_misc will add an argument to the argument
+            vector for this purpose, thus preserving the original ``argv[0]``.
+            e.g. If your interp is set to ``/bin/foo`` and you run ``blah``
+            (which is in ``/usr/local/bin``), then the kernel will execute
+            ``/bin/foo`` with ``argv[]`` set to ``["/bin/foo", "/usr/local/bin/blah", "blah"]``.  The interp has to be aware of this so it can
+            execute ``/usr/local/bin/blah``
+            with ``argv[]`` set to ``["blah"]``.
+      ``O`` - open-binary
+	    Legacy behavior of binfmt_misc is to pass the full path
+            of the binary to the interpreter as an argument. When this flag is
+            included, binfmt_misc will open the file for reading and pass its
+            descriptor as an argument, instead of the full path, thus allowing
+            the interpreter to execute non-readable binaries. This feature
+            should be used with care - the interpreter has to be trusted not to
+            emit the contents of the non-readable binary.
+      ``C`` - credentials
+            Currently, the behavior of binfmt_misc is to calculate
+            the credentials and security token of the new process according to
+            the interpreter. When this flag is included, these attributes are
+            calculated according to the binary. It also implies the ``O`` flag.
+            This feature should be used with care as the interpreter
+            will run with root permissions when a setuid binary owned by root
+            is run with binfmt_misc.
+      ``F`` - fix binary
+            The usual behaviour of binfmt_misc is to spawn the
+	    binary lazily when the misc format file is invoked.  However,
+	    this doesn``t work very well in the face of mount namespaces and
+	    changeroots, so the ``F`` mode opens the binary as soon as the
+	    emulation is installed and uses the opened image to spawn the
+	    emulator, meaning it is always available once installed,
+	    regardless of how the environment changes.
+
+
+There are some restrictions:
+
+ - the whole register string may not exceed 1920 characters
+ - the magic must reside in the first 128 bytes of the file, i.e.
+   offset+size(magic) has to be less than 128
+ - the interpreter string may not exceed 127 characters
+
+To use binfmt_misc you have to mount it first. You can mount it with
+``mount -t binfmt_misc none /proc/sys/fs/binfmt_misc`` command, or you can add
+a line ``none  /proc/sys/fs/binfmt_misc binfmt_misc defaults 0 0`` to your
+``/etc/fstab`` so it auto mounts on boot.
+
+You may want to add the binary formats in one of your ``/etc/rc`` scripts during
+boot-up. Read the manual of your init program to figure out how to do this
+right.
+
+Think about the order of adding entries! Later added entries are matched first!
+
+
+A few examples (assumed you are in ``/proc/sys/fs/binfmt_misc``):
+
+- enable support for em86 (like binfmt_em86, for Alpha AXP only)::
+
+    echo ':i386:M::\x7fELF\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x03:\xff\xff\xff\xff\xff\xfe\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfb\xff\xff:/bin/em86:' > register
+    echo ':i486:M::\x7fELF\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x06:\xff\xff\xff\xff\xff\xfe\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfb\xff\xff:/bin/em86:' > register
+
+- enable support for packed DOS applications (pre-configured dosemu hdimages)::
+
+    echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register
+
+- enable support for Windows executables using wine::
+
+    echo ':DOSWin:M::MZ::/usr/local/bin/wine:' > register
+
+For java support see Documentation/admin-guide/java.rst
+
+
+You can enable/disable binfmt_misc or one binary type by echoing 0 (to disable)
+or 1 (to enable) to ``/proc/sys/fs/binfmt_misc/status`` or
+``/proc/.../the_name``.
+Catting the file tells you the current status of ``binfmt_misc/the_entry``.
+
+You can remove one entry or all entries by echoing -1 to ``/proc/.../the_name``
+or ``/proc/sys/fs/binfmt_misc/status``.
+
+
+Hints
+-----
+
+If you want to pass special arguments to your interpreter, you can
+write a wrapper script for it. See Documentation/admin-guide/java.rst for an
+example.
+
+Your interpreter should NOT look in the PATH for the filename; the kernel
+passes it the full filename (or the file descriptor) to use.  Using ``$PATH`` can
+cause unexpected behaviour and can be a security hazard.
+
+
+Richard Günther <rguenth@tat.physik.uni-tuebingen.de>
diff --git a/Documentation/admin-guide/braille-console.rst b/Documentation/admin-guide/braille-console.rst
new file mode 100644
index 000000000..18e79337d
--- /dev/null
+++ b/Documentation/admin-guide/braille-console.rst
@@ -0,0 +1,38 @@
+Linux Braille Console
+=====================
+
+To get early boot messages on a braille device (before userspace screen
+readers can start), you first need to compile the support for the usual serial
+console (see :ref:`Documentation/admin-guide/serial-console.rst <serial_console>`), and
+for braille device
+(in :menuselection:`Device Drivers --> Accessibility support --> Console on braille device`).
+
+Then you need to specify a ``console=brl``, option on the kernel command line, the
+format is::
+
+	console=brl,serial_options...
+
+where ``serial_options...`` are the same as described in
+:ref:`Documentation/admin-guide/serial-console.rst <serial_console>`.
+
+So for instance you can use ``console=brl,ttyS0`` if the braille device is connected to the first serial port, and ``console=brl,ttyS0,115200`` to
+override the baud rate to 115200, etc.
+
+By default, the braille device will just show the last kernel message (console
+mode).  To review previous messages, press the Insert key to switch to the VT
+review mode.  In review mode, the arrow keys permit to browse in the VT content,
+:kbd:`PAGE-UP`/:kbd:`PAGE-DOWN` keys go at the top/bottom of the screen, and
+the :kbd:`HOME` key goes back
+to the cursor, hence providing very basic screen reviewing facility.
+
+Sound feedback can be obtained by adding the ``braille_console.sound=1`` kernel
+parameter.
+
+For simplicity, only one braille console can be enabled, other uses of
+``console=brl,...`` will be discarded.  Also note that it does not interfere with
+the console selection mechanism described in
+:ref:`Documentation/admin-guide/serial-console.rst <serial_console>`.
+
+For now, only the VisioBraille device is supported.
+
+Samuel Thibault <samuel.thibault@ens-lyon.org>
diff --git a/Documentation/admin-guide/bug-bisect.rst b/Documentation/admin-guide/bug-bisect.rst
new file mode 100644
index 000000000..59567da34
--- /dev/null
+++ b/Documentation/admin-guide/bug-bisect.rst
@@ -0,0 +1,76 @@
+Bisecting a bug
++++++++++++++++
+
+Last updated: 28 October 2016
+
+Introduction
+============
+
+Always try the latest kernel from kernel.org and build from source. If you are
+not confident in doing that please report the bug to your distribution vendor
+instead of to a kernel developer.
+
+Finding bugs is not always easy. Have a go though. If you can't find it don't
+give up. Report as much as you have found to the relevant maintainer. See
+MAINTAINERS for who that is for the subsystem you have worked on.
+
+Before you submit a bug report read
+:ref:`Documentation/admin-guide/reporting-bugs.rst <reportingbugs>`.
+
+Devices not appearing
+=====================
+
+Often this is caused by udev/systemd. Check that first before blaming it
+on the kernel.
+
+Finding patch that caused a bug
+===============================
+
+Using the provided tools with ``git`` makes finding bugs easy provided the bug
+is reproducible.
+
+Steps to do it:
+
+- build the Kernel from its git source
+- start bisect with [#f1]_::
+
+	$ git bisect start
+
+- mark the broken changeset with::
+
+	$ git bisect bad [commit]
+
+- mark a changeset where the code is known to work with::
+
+	$ git bisect good [commit]
+
+- rebuild the Kernel and test
+- interact with git bisect by using either::
+
+	$ git bisect good
+
+  or::
+
+	$ git bisect bad
+
+  depending if the bug happened on the changeset you're testing
+- After some interactions, git bisect will give you the changeset that
+  likely caused the bug.
+
+- For example, if you know that the current version is bad, and version
+  4.8 is good, you could do::
+
+           $ git bisect start
+           $ git bisect bad                 # Current version is bad
+           $ git bisect good v4.8
+
+
+.. [#f1] You can, optionally, provide both good and bad arguments at git
+	 start with ``git bisect start [BAD] [GOOD]``
+
+For further references, please read:
+
+- The man page for ``git-bisect``
+- `Fighting regressions with git bisect <https://www.kernel.org/pub/software/scm/git/docs/git-bisect-lk2009.html>`_
+- `Fully automated bisecting with "git bisect run" <https://lwn.net/Articles/317154>`_
+- `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_
diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst
new file mode 100644
index 000000000..f278b289e
--- /dev/null
+++ b/Documentation/admin-guide/bug-hunting.rst
@@ -0,0 +1,369 @@
+Bug hunting
+===========
+
+Kernel bug reports often come with a stack dump like the one below::
+
+	------------[ cut here ]------------
+	WARNING: CPU: 1 PID: 28102 at kernel/module.c:1108 module_put+0x57/0x70
+	Modules linked in: dvb_usb_gp8psk(-) dvb_usb dvb_core nvidia_drm(PO) nvidia_modeset(PO) snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore nvidia(PO) [last unloaded: rc_core]
+	CPU: 1 PID: 28102 Comm: rmmod Tainted: P        WC O 4.8.4-build.1 #1
+	Hardware name: MSI MS-7309/MS-7309, BIOS V1.12 02/23/2009
+	 00000000 c12ba080 00000000 00000000 c103ed6a c1616014 00000001 00006dc6
+	 c1615862 00000454 c109e8a7 c109e8a7 00000009 ffffffff 00000000 f13f6a10
+	 f5f5a600 c103ee33 00000009 00000000 00000000 c109e8a7 f80ca4d0 c109f617
+	Call Trace:
+	 [<c12ba080>] ? dump_stack+0x44/0x64
+	 [<c103ed6a>] ? __warn+0xfa/0x120
+	 [<c109e8a7>] ? module_put+0x57/0x70
+	 [<c109e8a7>] ? module_put+0x57/0x70
+	 [<c103ee33>] ? warn_slowpath_null+0x23/0x30
+	 [<c109e8a7>] ? module_put+0x57/0x70
+	 [<f80ca4d0>] ? gp8psk_fe_set_frontend+0x460/0x460 [dvb_usb_gp8psk]
+	 [<c109f617>] ? symbol_put_addr+0x27/0x50
+	 [<f80bc9ca>] ? dvb_usb_adapter_frontend_exit+0x3a/0x70 [dvb_usb]
+	 [<f80bb3bf>] ? dvb_usb_exit+0x2f/0xd0 [dvb_usb]
+	 [<c13d03bc>] ? usb_disable_endpoint+0x7c/0xb0
+	 [<f80bb48a>] ? dvb_usb_device_exit+0x2a/0x50 [dvb_usb]
+	 [<c13d2882>] ? usb_unbind_interface+0x62/0x250
+	 [<c136b514>] ? __pm_runtime_idle+0x44/0x70
+	 [<c13620d8>] ? __device_release_driver+0x78/0x120
+	 [<c1362907>] ? driver_detach+0x87/0x90
+	 [<c1361c48>] ? bus_remove_driver+0x38/0x90
+	 [<c13d1c18>] ? usb_deregister+0x58/0xb0
+	 [<c109fbb0>] ? SyS_delete_module+0x130/0x1f0
+	 [<c1055654>] ? task_work_run+0x64/0x80
+	 [<c1000fa5>] ? exit_to_usermode_loop+0x85/0x90
+	 [<c10013f0>] ? do_fast_syscall_32+0x80/0x130
+	 [<c1549f43>] ? sysenter_past_esp+0x40/0x6a
+	---[ end trace 6ebc60ef3981792f ]---
+
+Such stack traces provide enough information to identify the line inside the
+Kernel's source code where the bug happened. Depending on the severity of
+the issue, it may also contain the word **Oops**, as on this one::
+
+	BUG: unable to handle kernel NULL pointer dereference at   (null)
+	IP: [<c06969d4>] iret_exc+0x7d0/0xa59
+	*pdpt = 000000002258a001 *pde = 0000000000000000
+	Oops: 0002 [#1] PREEMPT SMP
+	...
+
+Despite being an **Oops** or some other sort of stack trace, the offended
+line is usually required to identify and handle the bug. Along this chapter,
+we'll refer to "Oops" for all kinds of stack traces that need to be analized.
+
+.. note::
+
+  ``ksymoops`` is useless on 2.6 or upper.  Please use the Oops in its original
+  format (from ``dmesg``, etc).  Ignore any references in this or other docs to
+  "decoding the Oops" or "running it through ksymoops".
+  If you post an Oops from 2.6+ that has been run through ``ksymoops``,
+  people will just tell you to repost it.
+
+Where is the Oops message is located?
+-------------------------------------
+
+Normally the Oops text is read from the kernel buffers by klogd and
+handed to ``syslogd`` which writes it to a syslog file, typically
+``/var/log/messages`` (depends on ``/etc/syslog.conf``). On systems with
+systemd, it may also be stored by the ``journald`` daemon, and accessed
+by running ``journalctl`` command.
+
+Sometimes ``klogd`` dies, in which case you can run ``dmesg > file`` to
+read the data from the kernel buffers and save it.  Or you can
+``cat /proc/kmsg > file``, however you have to break in to stop the transfer,
+``kmsg`` is a "never ending file".
+
+If the machine has crashed so badly that you cannot enter commands or
+the disk is not available then you have three options:
+
+(1) Hand copy the text from the screen and type it in after the machine
+    has restarted.  Messy but it is the only option if you have not
+    planned for a crash. Alternatively, you can take a picture of
+    the screen with a digital camera - not nice, but better than
+    nothing.  If the messages scroll off the top of the console, you
+    may find that booting with a higher resolution (eg, ``vga=791``)
+    will allow you to read more of the text. (Caveat: This needs ``vesafb``,
+    so won't help for 'early' oopses)
+
+(2) Boot with a serial console (see
+    :ref:`Documentation/admin-guide/serial-console.rst <serial_console>`),
+    run a null modem to a second machine and capture the output there
+    using your favourite communication program.  Minicom works well.
+
+(3) Use Kdump (see Documentation/kdump/kdump.txt),
+    extract the kernel ring buffer from old memory with using dmesg
+    gdbmacro in Documentation/kdump/gdbmacros.txt.
+
+Finding the bug's location
+--------------------------
+
+Reporting a bug works best if you point the location of the bug at the
+Kernel source file. There are two methods for doing that. Usually, using
+``gdb`` is easier, but the Kernel should be pre-compiled with debug info.
+
+gdb
+^^^
+
+The GNU debug (``gdb``) is the best way to figure out the exact file and line
+number of the OOPS from the ``vmlinux`` file.
+
+The usage of gdb works best on a kernel compiled with ``CONFIG_DEBUG_INFO``.
+This can be set by running::
+
+  $ ./scripts/config -d COMPILE_TEST -e DEBUG_KERNEL -e DEBUG_INFO
+
+On a kernel compiled with ``CONFIG_DEBUG_INFO``, you can simply copy the
+EIP value from the OOPS::
+
+ EIP:    0060:[<c021e50e>]    Not tainted VLI
+
+And use GDB to translate that to human-readable form::
+
+  $ gdb vmlinux
+  (gdb) l *0xc021e50e
+
+If you don't have ``CONFIG_DEBUG_INFO`` enabled, you use the function
+offset from the OOPS::
+
+ EIP is at vt_ioctl+0xda8/0x1482
+
+And recompile the kernel with ``CONFIG_DEBUG_INFO`` enabled::
+
+  $ ./scripts/config -d COMPILE_TEST -e DEBUG_KERNEL -e DEBUG_INFO
+  $ make vmlinux
+  $ gdb vmlinux
+  (gdb) l *vt_ioctl+0xda8
+  0x1888 is in vt_ioctl (drivers/tty/vt/vt_ioctl.c:293).
+  288	{
+  289		struct vc_data *vc = NULL;
+  290		int ret = 0;
+  291
+  292		console_lock();
+  293		if (VT_BUSY(vc_num))
+  294			ret = -EBUSY;
+  295		else if (vc_num)
+  296			vc = vc_deallocate(vc_num);
+  297		console_unlock();
+
+or, if you want to be more verbose::
+
+  (gdb) p vt_ioctl
+  $1 = {int (struct tty_struct *, unsigned int, unsigned long)} 0xae0 <vt_ioctl>
+  (gdb) l *0xae0+0xda8
+
+You could, instead, use the object file::
+
+  $ make drivers/tty/
+  $ gdb drivers/tty/vt/vt_ioctl.o
+  (gdb) l *vt_ioctl+0xda8
+
+If you have a call trace, such as::
+
+     Call Trace:
+      [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5
+      [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e
+      [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
+      ...
+
+this shows the problem likely in the :jbd: module. You can load that module
+in gdb and list the relevant code::
+
+  $ gdb fs/jbd/jbd.ko
+  (gdb) l *log_wait_commit+0xa3
+
+.. note::
+
+     You can also do the same for any function call at the stack trace,
+     like this one::
+
+	 [<f80bc9ca>] ? dvb_usb_adapter_frontend_exit+0x3a/0x70 [dvb_usb]
+
+     The position where the above call happened can be seen with::
+
+	$ gdb drivers/media/usb/dvb-usb/dvb-usb.o
+	(gdb) l *dvb_usb_adapter_frontend_exit+0x3a
+
+objdump
+^^^^^^^
+
+To debug a kernel, use objdump and look for the hex offset from the crash
+output to find the valid line of code/assembler. Without debug symbols, you
+will see the assembler code for the routine shown, but if your kernel has
+debug symbols the C code will also be available. (Debug symbols can be enabled
+in the kernel hacking menu of the menu configuration.) For example::
+
+    $ objdump -r -S -l --disassemble net/dccp/ipv4.o
+
+.. note::
+
+   You need to be at the top level of the kernel tree for this to pick up
+   your C files.
+
+If you don't have access to the code you can also debug on some crash dumps
+e.g. crash dump output as shown by Dave Miller::
+
+     EIP is at 	+0x14/0x4c0
+      ...
+     Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
+     00 00 55 57  56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
+     <8b> 83 3c 01 00 00 89 44  24 14 8b 45 28 85 c0 89 44 24 18 0f 85
+
+     Put the bytes into a "foo.s" file like this:
+
+            .text
+            .globl foo
+     foo:
+            .byte  .... /* bytes from Code: part of OOPS dump */
+
+     Compile it with "gcc -c -o foo.o foo.s" then look at the output of
+     "objdump --disassemble foo.o".
+
+     Output:
+
+     ip_queue_xmit:
+         push       %ebp
+         push       %edi
+         push       %esi
+         push       %ebx
+         sub        $0xbc, %esp
+         mov        0xd0(%esp), %ebp        ! %ebp = arg0 (skb)
+         mov        0x8(%ebp), %ebx         ! %ebx = skb->sk
+         mov        0x13c(%ebx), %eax       ! %eax = inet_sk(sk)->opt
+
+Reporting the bug
+-----------------
+
+Once you find where the bug happened, by inspecting its location,
+you could either try to fix it yourself or report it upstream.
+
+In order to report it upstream, you should identify the mailing list
+used for the development of the affected code. This can be done by using
+the ``get_maintainer.pl`` script.
+
+For example, if you find a bug at the gspca's sonixj.c file, you can get
+their maintainers with::
+
+	$ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c
+	Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%)
+	Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%)
+	Tejun Heo <tj@kernel.org> (commit_signer:1/1=100%)
+	Bhaktipriya Shridhar <bhaktipriya96@gmail.com> (commit_signer:1/1=100%,authored:1/1=100%,added_lines:4/4=100%,removed_lines:9/9=100%)
+	linux-media@vger.kernel.org (open list:GSPCA USB WEBCAM DRIVER)
+	linux-kernel@vger.kernel.org (open list)
+
+Please notice that it will point to:
+
+- The last developers that touched on the source code. On the above example,
+  Tejun and Bhaktipriya (in this specific case, none really envolved on the
+  development of this file);
+- The driver maintainer (Hans Verkuil);
+- The subsystem maintainer (Mauro Carvalho Chehab);
+- The driver and/or subsystem mailing list (linux-media@vger.kernel.org);
+- the Linux Kernel mailing list (linux-kernel@vger.kernel.org).
+
+Usually, the fastest way to have your bug fixed is to report it to mailing
+list used for the development of the code (linux-media ML) copying the driver maintainer (Hans).
+
+If you are totally stumped as to whom to send the report, and
+``get_maintainer.pl`` didn't provide you anything useful, send it to
+linux-kernel@vger.kernel.org.
+
+Thanks for your help in making Linux as stable as humanly possible.
+
+Fixing the bug
+--------------
+
+If you know programming, you could help us by not only reporting the bug,
+but also providing us with a solution. After all, open source is about
+sharing what you do and don't you want to be recognised for your genius?
+
+If you decide to take this way, once you have worked out a fix please submit
+it upstream.
+
+Please do read
+:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` though
+to help your code get accepted.
+
+
+---------------------------------------------------------------------------
+
+Notes on Oops tracing with ``klogd``
+------------------------------------
+
+In order to help Linus and the other kernel developers there has been
+substantial support incorporated into ``klogd`` for processing protection
+faults.  In order to have full support for address resolution at least
+version 1.3-pl3 of the ``sysklogd`` package should be used.
+
+When a protection fault occurs the ``klogd`` daemon automatically
+translates important addresses in the kernel log messages to their
+symbolic equivalents.  This translated kernel message is then
+forwarded through whatever reporting mechanism ``klogd`` is using.  The
+protection fault message can be simply cut out of the message files
+and forwarded to the kernel developers.
+
+Two types of address resolution are performed by ``klogd``.  The first is
+static translation and the second is dynamic translation.  Static
+translation uses the System.map file in much the same manner that
+ksymoops does.  In order to do static translation the ``klogd`` daemon
+must be able to find a system map file at daemon initialization time.
+See the klogd man page for information on how ``klogd`` searches for map
+files.
+
+Dynamic address translation is important when kernel loadable modules
+are being used.  Since memory for kernel modules is allocated from the
+kernel's dynamic memory pools there are no fixed locations for either
+the start of the module or for functions and symbols in the module.
+
+The kernel supports system calls which allow a program to determine
+which modules are loaded and their location in memory.  Using these
+system calls the klogd daemon builds a symbol table which can be used
+to debug a protection fault which occurs in a loadable kernel module.
+
+At the very minimum klogd will provide the name of the module which
+generated the protection fault.  There may be additional symbolic
+information available if the developer of the loadable module chose to
+export symbol information from the module.
+
+Since the kernel module environment can be dynamic there must be a
+mechanism for notifying the ``klogd`` daemon when a change in module
+environment occurs.  There are command line options available which
+allow klogd to signal the currently executing daemon that symbol
+information should be refreshed.  See the ``klogd`` manual page for more
+information.
+
+A patch is included with the sysklogd distribution which modifies the
+``modules-2.0.0`` package to automatically signal klogd whenever a module
+is loaded or unloaded.  Applying this patch provides essentially
+seamless support for debugging protection faults which occur with
+kernel loadable modules.
+
+The following is an example of a protection fault in a loadable module
+processed by ``klogd``::
+
+	Aug 29 09:51:01 blizard kernel: Unable to handle kernel paging request at virtual address f15e97cc
+	Aug 29 09:51:01 blizard kernel: current->tss.cr3 = 0062d000, %cr3 = 0062d000
+	Aug 29 09:51:01 blizard kernel: *pde = 00000000
+	Aug 29 09:51:01 blizard kernel: Oops: 0002
+	Aug 29 09:51:01 blizard kernel: CPU:    0
+	Aug 29 09:51:01 blizard kernel: EIP:    0010:[oops:_oops+16/3868]
+	Aug 29 09:51:01 blizard kernel: EFLAGS: 00010212
+	Aug 29 09:51:01 blizard kernel: eax: 315e97cc   ebx: 003a6f80   ecx: 001be77b   edx: 00237c0c
+	Aug 29 09:51:01 blizard kernel: esi: 00000000   edi: bffffdb3   ebp: 00589f90   esp: 00589f8c
+	Aug 29 09:51:01 blizard kernel: ds: 0018   es: 0018   fs: 002b   gs: 002b   ss: 0018
+	Aug 29 09:51:01 blizard kernel: Process oops_test (pid: 3374, process nr: 21, stackpage=00589000)
+	Aug 29 09:51:01 blizard kernel: Stack: 315e97cc 00589f98 0100b0b4 bffffed4 0012e38e 00240c64 003a6f80 00000001
+	Aug 29 09:51:01 blizard kernel:        00000000 00237810 bfffff00 0010a7fa 00000003 00000001 00000000 bfffff00
+	Aug 29 09:51:01 blizard kernel:        bffffdb3 bffffed4 ffffffda 0000002b 0007002b 0000002b 0000002b 00000036
+	Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128]
+	Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3
+
+---------------------------------------------------------------------------
+
+::
+
+  Dr. G.W. Wettstein           Oncology Research Div. Computing Facility
+  Roger Maris Cancer Center    INTERNET: greg@wind.rmcc.com
+  820 4th St. N.
+  Fargo, ND  58122
+  Phone: 701-234-7556
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
new file mode 100644
index 000000000..184193bcb
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -0,0 +1,2142 @@
+================
+Control Group v2
+================
+
+:Date: October, 2015
+:Author: Tejun Heo <tj@kernel.org>
+
+This is the authoritative documentation on the design, interface and
+conventions of cgroup v2.  It describes all userland-visible aspects
+of cgroup including core and specific controller behaviors.  All
+future changes must be reflected in this document.  Documentation for
+v1 is available under Documentation/cgroup-v1/.
+
+.. CONTENTS
+
+   1. Introduction
+     1-1. Terminology
+     1-2. What is cgroup?
+   2. Basic Operations
+     2-1. Mounting
+     2-2. Organizing Processes and Threads
+       2-2-1. Processes
+       2-2-2. Threads
+     2-3. [Un]populated Notification
+     2-4. Controlling Controllers
+       2-4-1. Enabling and Disabling
+       2-4-2. Top-down Constraint
+       2-4-3. No Internal Process Constraint
+     2-5. Delegation
+       2-5-1. Model of Delegation
+       2-5-2. Delegation Containment
+     2-6. Guidelines
+       2-6-1. Organize Once and Control
+       2-6-2. Avoid Name Collisions
+   3. Resource Distribution Models
+     3-1. Weights
+     3-2. Limits
+     3-3. Protections
+     3-4. Allocations
+   4. Interface Files
+     4-1. Format
+     4-2. Conventions
+     4-3. Core Interface Files
+   5. Controllers
+     5-1. CPU
+       5-1-1. CPU Interface Files
+     5-2. Memory
+       5-2-1. Memory Interface Files
+       5-2-2. Usage Guidelines
+       5-2-3. Memory Ownership
+     5-3. IO
+       5-3-1. IO Interface Files
+       5-3-2. Writeback
+       5-3-3. IO Latency
+         5-3-3-1. How IO Latency Throttling Works
+         5-3-3-2. IO Latency Interface Files
+     5-4. PID
+       5-4-1. PID Interface Files
+     5-5. Device
+     5-6. RDMA
+       5-6-1. RDMA Interface Files
+     5-7. Misc
+       5-7-1. perf_event
+     5-N. Non-normative information
+       5-N-1. CPU controller root cgroup process behaviour
+       5-N-2. IO controller root cgroup process behaviour
+   6. Namespace
+     6-1. Basics
+     6-2. The Root and Views
+     6-3. Migration and setns(2)
+     6-4. Interaction with Other Namespaces
+   P. Information on Kernel Programming
+     P-1. Filesystem Support for Writeback
+   D. Deprecated v1 Core Features
+   R. Issues with v1 and Rationales for v2
+     R-1. Multiple Hierarchies
+     R-2. Thread Granularity
+     R-3. Competition Between Inner Nodes and Threads
+     R-4. Other Interface Issues
+     R-5. Controller Issues and Remedies
+       R-5-1. Memory
+
+
+Introduction
+============
+
+Terminology
+-----------
+
+"cgroup" stands for "control group" and is never capitalized.  The
+singular form is used to designate the whole feature and also as a
+qualifier as in "cgroup controllers".  When explicitly referring to
+multiple individual control groups, the plural form "cgroups" is used.
+
+
+What is cgroup?
+---------------
+
+cgroup is a mechanism to organize processes hierarchically and
+distribute system resources along the hierarchy in a controlled and
+configurable manner.
+
+cgroup is largely composed of two parts - the core and controllers.
+cgroup core is primarily responsible for hierarchically organizing
+processes.  A cgroup controller is usually responsible for
+distributing a specific type of system resource along the hierarchy
+although there are utility controllers which serve purposes other than
+resource distribution.
+
+cgroups form a tree structure and every process in the system belongs
+to one and only one cgroup.  All threads of a process belong to the
+same cgroup.  On creation, all processes are put in the cgroup that
+the parent process belongs to at the time.  A process can be migrated
+to another cgroup.  Migration of a process doesn't affect already
+existing descendant processes.
+
+Following certain structural constraints, controllers may be enabled or
+disabled selectively on a cgroup.  All controller behaviors are
+hierarchical - if a controller is enabled on a cgroup, it affects all
+processes which belong to the cgroups consisting the inclusive
+sub-hierarchy of the cgroup.  When a controller is enabled on a nested
+cgroup, it always restricts the resource distribution further.  The
+restrictions set closer to the root in the hierarchy can not be
+overridden from further away.
+
+
+Basic Operations
+================
+
+Mounting
+--------
+
+Unlike v1, cgroup v2 has only single hierarchy.  The cgroup v2
+hierarchy can be mounted with the following mount command::
+
+  # mount -t cgroup2 none $MOUNT_POINT
+
+cgroup2 filesystem has the magic number 0x63677270 ("cgrp").  All
+controllers which support v2 and are not bound to a v1 hierarchy are
+automatically bound to the v2 hierarchy and show up at the root.
+Controllers which are not in active use in the v2 hierarchy can be
+bound to other hierarchies.  This allows mixing v2 hierarchy with the
+legacy v1 multiple hierarchies in a fully backward compatible way.
+
+A controller can be moved across hierarchies only after the controller
+is no longer referenced in its current hierarchy.  Because per-cgroup
+controller states are destroyed asynchronously and controllers may
+have lingering references, a controller may not show up immediately on
+the v2 hierarchy after the final umount of the previous hierarchy.
+Similarly, a controller should be fully disabled to be moved out of
+the unified hierarchy and it may take some time for the disabled
+controller to become available for other hierarchies; furthermore, due
+to inter-controller dependencies, other controllers may need to be
+disabled too.
+
+While useful for development and manual configurations, moving
+controllers dynamically between the v2 and other hierarchies is
+strongly discouraged for production use.  It is recommended to decide
+the hierarchies and controller associations before starting using the
+controllers after system boot.
+
+During transition to v2, system management software might still
+automount the v1 cgroup filesystem and so hijack all controllers
+during boot, before manual intervention is possible. To make testing
+and experimenting easier, the kernel parameter cgroup_no_v1= allows
+disabling controllers in v1 and make them always available in v2.
+
+cgroup v2 currently supports the following mount options.
+
+  nsdelegate
+
+	Consider cgroup namespaces as delegation boundaries.  This
+	option is system wide and can only be set on mount or modified
+	through remount from the init namespace.  The mount option is
+	ignored on non-init namespace mounts.  Please refer to the
+	Delegation section for details.
+
+
+Organizing Processes and Threads
+--------------------------------
+
+Processes
+~~~~~~~~~
+
+Initially, only the root cgroup exists to which all processes belong.
+A child cgroup can be created by creating a sub-directory::
+
+  # mkdir $CGROUP_NAME
+
+A given cgroup may have multiple child cgroups forming a tree
+structure.  Each cgroup has a read-writable interface file
+"cgroup.procs".  When read, it lists the PIDs of all processes which
+belong to the cgroup one-per-line.  The PIDs are not ordered and the
+same PID may show up more than once if the process got moved to
+another cgroup and then back or the PID got recycled while reading.
+
+A process can be migrated into a cgroup by writing its PID to the
+target cgroup's "cgroup.procs" file.  Only one process can be migrated
+on a single write(2) call.  If a process is composed of multiple
+threads, writing the PID of any thread migrates all threads of the
+process.
+
+When a process forks a child process, the new process is born into the
+cgroup that the forking process belongs to at the time of the
+operation.  After exit, a process stays associated with the cgroup
+that it belonged to at the time of exit until it's reaped; however, a
+zombie process does not appear in "cgroup.procs" and thus can't be
+moved to another cgroup.
+
+A cgroup which doesn't have any children or live processes can be
+destroyed by removing the directory.  Note that a cgroup which doesn't
+have any children and is associated only with zombie processes is
+considered empty and can be removed::
+
+  # rmdir $CGROUP_NAME
+
+"/proc/$PID/cgroup" lists a process's cgroup membership.  If legacy
+cgroup is in use in the system, this file may contain multiple lines,
+one for each hierarchy.  The entry for cgroup v2 is always in the
+format "0::$PATH"::
+
+  # cat /proc/842/cgroup
+  ...
+  0::/test-cgroup/test-cgroup-nested
+
+If the process becomes a zombie and the cgroup it was associated with
+is removed subsequently, " (deleted)" is appended to the path::
+
+  # cat /proc/842/cgroup
+  ...
+  0::/test-cgroup/test-cgroup-nested (deleted)
+
+
+Threads
+~~~~~~~
+
+cgroup v2 supports thread granularity for a subset of controllers to
+support use cases requiring hierarchical resource distribution across
+the threads of a group of processes.  By default, all threads of a
+process belong to the same cgroup, which also serves as the resource
+domain to host resource consumptions which are not specific to a
+process or thread.  The thread mode allows threads to be spread across
+a subtree while still maintaining the common resource domain for them.
+
+Controllers which support thread mode are called threaded controllers.
+The ones which don't are called domain controllers.
+
+Marking a cgroup threaded makes it join the resource domain of its
+parent as a threaded cgroup.  The parent may be another threaded
+cgroup whose resource domain is further up in the hierarchy.  The root
+of a threaded subtree, that is, the nearest ancestor which is not
+threaded, is called threaded domain or thread root interchangeably and
+serves as the resource domain for the entire subtree.
+
+Inside a threaded subtree, threads of a process can be put in
+different cgroups and are not subject to the no internal process
+constraint - threaded controllers can be enabled on non-leaf cgroups
+whether they have threads in them or not.
+
+As the threaded domain cgroup hosts all the domain resource
+consumptions of the subtree, it is considered to have internal
+resource consumptions whether there are processes in it or not and
+can't have populated child cgroups which aren't threaded.  Because the
+root cgroup is not subject to no internal process constraint, it can
+serve both as a threaded domain and a parent to domain cgroups.
+
+The current operation mode or type of the cgroup is shown in the
+"cgroup.type" file which indicates whether the cgroup is a normal
+domain, a domain which is serving as the domain of a threaded subtree,
+or a threaded cgroup.
+
+On creation, a cgroup is always a domain cgroup and can be made
+threaded by writing "threaded" to the "cgroup.type" file.  The
+operation is single direction::
+
+  # echo threaded > cgroup.type
+
+Once threaded, the cgroup can't be made a domain again.  To enable the
+thread mode, the following conditions must be met.
+
+- As the cgroup will join the parent's resource domain.  The parent
+  must either be a valid (threaded) domain or a threaded cgroup.
+
+- When the parent is an unthreaded domain, it must not have any domain
+  controllers enabled or populated domain children.  The root is
+  exempt from this requirement.
+
+Topology-wise, a cgroup can be in an invalid state.  Please consider
+the following topology::
+
+  A (threaded domain) - B (threaded) - C (domain, just created)
+
+C is created as a domain but isn't connected to a parent which can
+host child domains.  C can't be used until it is turned into a
+threaded cgroup.  "cgroup.type" file will report "domain (invalid)" in
+these cases.  Operations which fail due to invalid topology use
+EOPNOTSUPP as the errno.
+
+A domain cgroup is turned into a threaded domain when one of its child
+cgroup becomes threaded or threaded controllers are enabled in the
+"cgroup.subtree_control" file while there are processes in the cgroup.
+A threaded domain reverts to a normal domain when the conditions
+clear.
+
+When read, "cgroup.threads" contains the list of the thread IDs of all
+threads in the cgroup.  Except that the operations are per-thread
+instead of per-process, "cgroup.threads" has the same format and
+behaves the same way as "cgroup.procs".  While "cgroup.threads" can be
+written to in any cgroup, as it can only move threads inside the same
+threaded domain, its operations are confined inside each threaded
+subtree.
+
+The threaded domain cgroup serves as the resource domain for the whole
+subtree, and, while the threads can be scattered across the subtree,
+all the processes are considered to be in the threaded domain cgroup.
+"cgroup.procs" in a threaded domain cgroup contains the PIDs of all
+processes in the subtree and is not readable in the subtree proper.
+However, "cgroup.procs" can be written to from anywhere in the subtree
+to migrate all threads of the matching process to the cgroup.
+
+Only threaded controllers can be enabled in a threaded subtree.  When
+a threaded controller is enabled inside a threaded subtree, it only
+accounts for and controls resource consumptions associated with the
+threads in the cgroup and its descendants.  All consumptions which
+aren't tied to a specific thread belong to the threaded domain cgroup.
+
+Because a threaded subtree is exempt from no internal process
+constraint, a threaded controller must be able to handle competition
+between threads in a non-leaf cgroup and its child cgroups.  Each
+threaded controller defines how such competitions are handled.
+
+
+[Un]populated Notification
+--------------------------
+
+Each non-root cgroup has a "cgroup.events" file which contains
+"populated" field indicating whether the cgroup's sub-hierarchy has
+live processes in it.  Its value is 0 if there is no live process in
+the cgroup and its descendants; otherwise, 1.  poll and [id]notify
+events are triggered when the value changes.  This can be used, for
+example, to start a clean-up operation after all processes of a given
+sub-hierarchy have exited.  The populated state updates and
+notifications are recursive.  Consider the following sub-hierarchy
+where the numbers in the parentheses represent the numbers of processes
+in each cgroup::
+
+  A(4) - B(0) - C(1)
+              \ D(0)
+
+A, B and C's "populated" fields would be 1 while D's 0.  After the one
+process in C exits, B and C's "populated" fields would flip to "0" and
+file modified events will be generated on the "cgroup.events" files of
+both cgroups.
+
+
+Controlling Controllers
+-----------------------
+
+Enabling and Disabling
+~~~~~~~~~~~~~~~~~~~~~~
+
+Each cgroup has a "cgroup.controllers" file which lists all
+controllers available for the cgroup to enable::
+
+  # cat cgroup.controllers
+  cpu io memory
+
+No controller is enabled by default.  Controllers can be enabled and
+disabled by writing to the "cgroup.subtree_control" file::
+
+  # echo "+cpu +memory -io" > cgroup.subtree_control
+
+Only controllers which are listed in "cgroup.controllers" can be
+enabled.  When multiple operations are specified as above, either they
+all succeed or fail.  If multiple operations on the same controller
+are specified, the last one is effective.
+
+Enabling a controller in a cgroup indicates that the distribution of
+the target resource across its immediate children will be controlled.
+Consider the following sub-hierarchy.  The enabled controllers are
+listed in parentheses::
+
+  A(cpu,memory) - B(memory) - C()
+                            \ D()
+
+As A has "cpu" and "memory" enabled, A will control the distribution
+of CPU cycles and memory to its children, in this case, B.  As B has
+"memory" enabled but not "CPU", C and D will compete freely on CPU
+cycles but their division of memory available to B will be controlled.
+
+As a controller regulates the distribution of the target resource to
+the cgroup's children, enabling it creates the controller's interface
+files in the child cgroups.  In the above example, enabling "cpu" on B
+would create the "cpu." prefixed controller interface files in C and
+D.  Likewise, disabling "memory" from B would remove the "memory."
+prefixed controller interface files from C and D.  This means that the
+controller interface files - anything which doesn't start with
+"cgroup." are owned by the parent rather than the cgroup itself.
+
+
+Top-down Constraint
+~~~~~~~~~~~~~~~~~~~
+
+Resources are distributed top-down and a cgroup can further distribute
+a resource only if the resource has been distributed to it from the
+parent.  This means that all non-root "cgroup.subtree_control" files
+can only contain controllers which are enabled in the parent's
+"cgroup.subtree_control" file.  A controller can be enabled only if
+the parent has the controller enabled and a controller can't be
+disabled if one or more children have it enabled.
+
+
+No Internal Process Constraint
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Non-root cgroups can distribute domain resources to their children
+only when they don't have any processes of their own.  In other words,
+only domain cgroups which don't contain any processes can have domain
+controllers enabled in their "cgroup.subtree_control" files.
+
+This guarantees that, when a domain controller is looking at the part
+of the hierarchy which has it enabled, processes are always only on
+the leaves.  This rules out situations where child cgroups compete
+against internal processes of the parent.
+
+The root cgroup is exempt from this restriction.  Root contains
+processes and anonymous resource consumption which can't be associated
+with any other cgroups and requires special treatment from most
+controllers.  How resource consumption in the root cgroup is governed
+is up to each controller (for more information on this topic please
+refer to the Non-normative information section in the Controllers
+chapter).
+
+Note that the restriction doesn't get in the way if there is no
+enabled controller in the cgroup's "cgroup.subtree_control".  This is
+important as otherwise it wouldn't be possible to create children of a
+populated cgroup.  To control resource distribution of a cgroup, the
+cgroup must create children and transfer all its processes to the
+children before enabling controllers in its "cgroup.subtree_control"
+file.
+
+
+Delegation
+----------
+
+Model of Delegation
+~~~~~~~~~~~~~~~~~~~
+
+A cgroup can be delegated in two ways.  First, to a less privileged
+user by granting write access of the directory and its "cgroup.procs",
+"cgroup.threads" and "cgroup.subtree_control" files to the user.
+Second, if the "nsdelegate" mount option is set, automatically to a
+cgroup namespace on namespace creation.
+
+Because the resource control interface files in a given directory
+control the distribution of the parent's resources, the delegatee
+shouldn't be allowed to write to them.  For the first method, this is
+achieved by not granting access to these files.  For the second, the
+kernel rejects writes to all files other than "cgroup.procs" and
+"cgroup.subtree_control" on a namespace root from inside the
+namespace.
+
+The end results are equivalent for both delegation types.  Once
+delegated, the user can build sub-hierarchy under the directory,
+organize processes inside it as it sees fit and further distribute the
+resources it received from the parent.  The limits and other settings
+of all resource controllers are hierarchical and regardless of what
+happens in the delegated sub-hierarchy, nothing can escape the
+resource restrictions imposed by the parent.
+
+Currently, cgroup doesn't impose any restrictions on the number of
+cgroups in or nesting depth of a delegated sub-hierarchy; however,
+this may be limited explicitly in the future.
+
+
+Delegation Containment
+~~~~~~~~~~~~~~~~~~~~~~
+
+A delegated sub-hierarchy is contained in the sense that processes
+can't be moved into or out of the sub-hierarchy by the delegatee.
+
+For delegations to a less privileged user, this is achieved by
+requiring the following conditions for a process with a non-root euid
+to migrate a target process into a cgroup by writing its PID to the
+"cgroup.procs" file.
+
+- The writer must have write access to the "cgroup.procs" file.
+
+- The writer must have write access to the "cgroup.procs" file of the
+  common ancestor of the source and destination cgroups.
+
+The above two constraints ensure that while a delegatee may migrate
+processes around freely in the delegated sub-hierarchy it can't pull
+in from or push out to outside the sub-hierarchy.
+
+For an example, let's assume cgroups C0 and C1 have been delegated to
+user U0 who created C00, C01 under C0 and C10 under C1 as follows and
+all processes under C0 and C1 belong to U0::
+
+  ~~~~~~~~~~~~~ - C0 - C00
+  ~ cgroup    ~      \ C01
+  ~ hierarchy ~
+  ~~~~~~~~~~~~~ - C1 - C10
+
+Let's also say U0 wants to write the PID of a process which is
+currently in C10 into "C00/cgroup.procs".  U0 has write access to the
+file; however, the common ancestor of the source cgroup C10 and the
+destination cgroup C00 is above the points of delegation and U0 would
+not have write access to its "cgroup.procs" files and thus the write
+will be denied with -EACCES.
+
+For delegations to namespaces, containment is achieved by requiring
+that both the source and destination cgroups are reachable from the
+namespace of the process which is attempting the migration.  If either
+is not reachable, the migration is rejected with -ENOENT.
+
+
+Guidelines
+----------
+
+Organize Once and Control
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Migrating a process across cgroups is a relatively expensive operation
+and stateful resources such as memory are not moved together with the
+process.  This is an explicit design decision as there often exist
+inherent trade-offs between migration and various hot paths in terms
+of synchronization cost.
+
+As such, migrating processes across cgroups frequently as a means to
+apply different resource restrictions is discouraged.  A workload
+should be assigned to a cgroup according to the system's logical and
+resource structure once on start-up.  Dynamic adjustments to resource
+distribution can be made by changing controller configuration through
+the interface files.
+
+
+Avoid Name Collisions
+~~~~~~~~~~~~~~~~~~~~~
+
+Interface files for a cgroup and its children cgroups occupy the same
+directory and it is possible to create children cgroups which collide
+with interface files.
+
+All cgroup core interface files are prefixed with "cgroup." and each
+controller's interface files are prefixed with the controller name and
+a dot.  A controller's name is composed of lower case alphabets and
+'_'s but never begins with an '_' so it can be used as the prefix
+character for collision avoidance.  Also, interface file names won't
+start or end with terms which are often used in categorizing workloads
+such as job, service, slice, unit or workload.
+
+cgroup doesn't do anything to prevent name collisions and it's the
+user's responsibility to avoid them.
+
+
+Resource Distribution Models
+============================
+
+cgroup controllers implement several resource distribution schemes
+depending on the resource type and expected use cases.  This section
+describes major schemes in use along with their expected behaviors.
+
+
+Weights
+-------
+
+A parent's resource is distributed by adding up the weights of all
+active children and giving each the fraction matching the ratio of its
+weight against the sum.  As only children which can make use of the
+resource at the moment participate in the distribution, this is
+work-conserving.  Due to the dynamic nature, this model is usually
+used for stateless resources.
+
+All weights are in the range [1, 10000] with the default at 100.  This
+allows symmetric multiplicative biases in both directions at fine
+enough granularity while staying in the intuitive range.
+
+As long as the weight is in range, all configuration combinations are
+valid and there is no reason to reject configuration changes or
+process migrations.
+
+"cpu.weight" proportionally distributes CPU cycles to active children
+and is an example of this type.
+
+
+Limits
+------
+
+A child can only consume upto the configured amount of the resource.
+Limits can be over-committed - the sum of the limits of children can
+exceed the amount of resource available to the parent.
+
+Limits are in the range [0, max] and defaults to "max", which is noop.
+
+As limits can be over-committed, all configuration combinations are
+valid and there is no reason to reject configuration changes or
+process migrations.
+
+"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
+on an IO device and is an example of this type.
+
+
+Protections
+-----------
+
+A cgroup is protected to be allocated upto the configured amount of
+the resource if the usages of all its ancestors are under their
+protected levels.  Protections can be hard guarantees or best effort
+soft boundaries.  Protections can also be over-committed in which case
+only upto the amount available to the parent is protected among
+children.
+
+Protections are in the range [0, max] and defaults to 0, which is
+noop.
+
+As protections can be over-committed, all configuration combinations
+are valid and there is no reason to reject configuration changes or
+process migrations.
+
+"memory.low" implements best-effort memory protection and is an
+example of this type.
+
+
+Allocations
+-----------
+
+A cgroup is exclusively allocated a certain amount of a finite
+resource.  Allocations can't be over-committed - the sum of the
+allocations of children can not exceed the amount of resource
+available to the parent.
+
+Allocations are in the range [0, max] and defaults to 0, which is no
+resource.
+
+As allocations can't be over-committed, some configuration
+combinations are invalid and should be rejected.  Also, if the
+resource is mandatory for execution of processes, process migrations
+may be rejected.
+
+"cpu.rt.max" hard-allocates realtime slices and is an example of this
+type.
+
+
+Interface Files
+===============
+
+Format
+------
+
+All interface files should be in one of the following formats whenever
+possible::
+
+  New-line separated values
+  (when only one value can be written at once)
+
+	VAL0\n
+	VAL1\n
+	...
+
+  Space separated values
+  (when read-only or multiple values can be written at once)
+
+	VAL0 VAL1 ...\n
+
+  Flat keyed
+
+	KEY0 VAL0\n
+	KEY1 VAL1\n
+	...
+
+  Nested keyed
+
+	KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+	KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+	...
+
+For a writable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+
+For both flat and nested keyed files, only the values for a single key
+can be written at a time.  For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+
+
+Conventions
+-----------
+
+- Settings for a single feature should be contained in a single file.
+
+- The root cgroup should be exempt from resource control and thus
+  shouldn't have resource control interface files.  Also,
+  informational files on the root cgroup which end up showing global
+  information available elsewhere shouldn't exist.
+
+- If a controller implements weight based resource distribution, its
+  interface file should be named "weight" and have the range [1,
+  10000] with 100 as the default.  The values are chosen to allow
+  enough and symmetric bias in both directions while keeping it
+  intuitive (the default is 100%).
+
+- If a controller implements an absolute resource guarantee and/or
+  limit, the interface files should be named "min" and "max"
+  respectively.  If a controller implements best effort resource
+  guarantee and/or limit, the interface files should be named "low"
+  and "high" respectively.
+
+  In the above four control files, the special token "max" should be
+  used to represent upward infinity for both reading and writing.
+
+- If a setting has a configurable default value and keyed specific
+  overrides, the default entry should be keyed with "default" and
+  appear as the first entry in the file.
+
+  The default value can be updated by writing either "default $VAL" or
+  "$VAL".
+
+  When writing to update a specific override, "default" can be used as
+  the value to indicate removal of the override.  Override entries
+  with "default" as the value must not appear when read.
+
+  For example, a setting which is keyed by major:minor device numbers
+  with integer values may look like the following::
+
+    # cat cgroup-example-interface-file
+    default 150
+    8:0 300
+
+  The default value can be updated by::
+
+    # echo 125 > cgroup-example-interface-file
+
+  or::
+
+    # echo "default 125" > cgroup-example-interface-file
+
+  An override can be set by::
+
+    # echo "8:16 170" > cgroup-example-interface-file
+
+  and cleared by::
+
+    # echo "8:0 default" > cgroup-example-interface-file
+    # cat cgroup-example-interface-file
+    default 125
+    8:16 170
+
+- For events which are not very high frequency, an interface file
+  "events" should be created which lists event key value pairs.
+  Whenever a notifiable event happens, file modified event should be
+  generated on the file.
+
+
+Core Interface Files
+--------------------
+
+All cgroup core files are prefixed with "cgroup."
+
+  cgroup.type
+
+	A read-write single value file which exists on non-root
+	cgroups.
+
+	When read, it indicates the current type of the cgroup, which
+	can be one of the following values.
+
+	- "domain" : A normal valid domain cgroup.
+
+	- "domain threaded" : A threaded domain cgroup which is
+          serving as the root of a threaded subtree.
+
+	- "domain invalid" : A cgroup which is in an invalid state.
+	  It can't be populated or have controllers enabled.  It may
+	  be allowed to become a threaded cgroup.
+
+	- "threaded" : A threaded cgroup which is a member of a
+          threaded subtree.
+
+	A cgroup can be turned into a threaded cgroup by writing
+	"threaded" to this file.
+
+  cgroup.procs
+	A read-write new-line separated values file which exists on
+	all cgroups.
+
+	When read, it lists the PIDs of all processes which belong to
+	the cgroup one-per-line.  The PIDs are not ordered and the
+	same PID may show up more than once if the process got moved
+	to another cgroup and then back or the PID got recycled while
+	reading.
+
+	A PID can be written to migrate the process associated with
+	the PID to the cgroup.  The writer should match all of the
+	following conditions.
+
+	- It must have write access to the "cgroup.procs" file.
+
+	- It must have write access to the "cgroup.procs" file of the
+	  common ancestor of the source and destination cgroups.
+
+	When delegating a sub-hierarchy, write access to this file
+	should be granted along with the containing directory.
+
+	In a threaded cgroup, reading this file fails with EOPNOTSUPP
+	as all the processes belong to the thread root.  Writing is
+	supported and moves every thread of the process to the cgroup.
+
+  cgroup.threads
+	A read-write new-line separated values file which exists on
+	all cgroups.
+
+	When read, it lists the TIDs of all threads which belong to
+	the cgroup one-per-line.  The TIDs are not ordered and the
+	same TID may show up more than once if the thread got moved to
+	another cgroup and then back or the TID got recycled while
+	reading.
+
+	A TID can be written to migrate the thread associated with the
+	TID to the cgroup.  The writer should match all of the
+	following conditions.
+
+	- It must have write access to the "cgroup.threads" file.
+
+	- The cgroup that the thread is currently in must be in the
+          same resource domain as the destination cgroup.
+
+	- It must have write access to the "cgroup.procs" file of the
+	  common ancestor of the source and destination cgroups.
+
+	When delegating a sub-hierarchy, write access to this file
+	should be granted along with the containing directory.
+
+  cgroup.controllers
+	A read-only space separated values file which exists on all
+	cgroups.
+
+	It shows space separated list of all controllers available to
+	the cgroup.  The controllers are not ordered.
+
+  cgroup.subtree_control
+	A read-write space separated values file which exists on all
+	cgroups.  Starts out empty.
+
+	When read, it shows space separated list of the controllers
+	which are enabled to control resource distribution from the
+	cgroup to its children.
+
+	Space separated list of controllers prefixed with '+' or '-'
+	can be written to enable or disable controllers.  A controller
+	name prefixed with '+' enables the controller and '-'
+	disables.  If a controller appears more than once on the list,
+	the last one is effective.  When multiple enable and disable
+	operations are specified, either all succeed or all fail.
+
+  cgroup.events
+	A read-only flat-keyed file which exists on non-root cgroups.
+	The following entries are defined.  Unless specified
+	otherwise, a value change in this file generates a file
+	modified event.
+
+	  populated
+		1 if the cgroup or its descendants contains any live
+		processes; otherwise, 0.
+
+  cgroup.max.descendants
+	A read-write single value files.  The default is "max".
+
+	Maximum allowed number of descent cgroups.
+	If the actual number of descendants is equal or larger,
+	an attempt to create a new cgroup in the hierarchy will fail.
+
+  cgroup.max.depth
+	A read-write single value files.  The default is "max".
+
+	Maximum allowed descent depth below the current cgroup.
+	If the actual descent depth is equal or larger,
+	an attempt to create a new child cgroup will fail.
+
+  cgroup.stat
+	A read-only flat-keyed file with the following entries:
+
+	  nr_descendants
+		Total number of visible descendant cgroups.
+
+	  nr_dying_descendants
+		Total number of dying descendant cgroups. A cgroup becomes
+		dying after being deleted by a user. The cgroup will remain
+		in dying state for some time undefined time (which can depend
+		on system load) before being completely destroyed.
+
+		A process can't enter a dying cgroup under any circumstances,
+		a dying cgroup can't revive.
+
+		A dying cgroup can consume system resources not exceeding
+		limits, which were active at the moment of cgroup deletion.
+
+
+Controllers
+===========
+
+CPU
+---
+
+The "cpu" controllers regulates distribution of CPU cycles.  This
+controller implements weight and absolute bandwidth limit models for
+normal scheduling policy and absolute bandwidth allocation model for
+realtime scheduling policy.
+
+WARNING: cgroup2 doesn't yet support control of realtime processes and
+the cpu controller can only be enabled when all RT processes are in
+the root cgroup.  Be aware that system management software may already
+have placed RT processes into nonroot cgroups during the system boot
+process, and these processes may need to be moved to the root cgroup
+before the cpu controller can be enabled.
+
+
+CPU Interface Files
+~~~~~~~~~~~~~~~~~~~
+
+All time durations are in microseconds.
+
+  cpu.stat
+	A read-only flat-keyed file which exists on non-root cgroups.
+	This file exists whether the controller is enabled or not.
+
+	It always reports the following three stats:
+
+	- usage_usec
+	- user_usec
+	- system_usec
+
+	and the following three when the controller is enabled:
+
+	- nr_periods
+	- nr_throttled
+	- throttled_usec
+
+  cpu.weight
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "100".
+
+	The weight in the range [1, 10000].
+
+  cpu.weight.nice
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	The nice value is in the range [-20, 19].
+
+	This interface file is an alternative interface for
+	"cpu.weight" and allows reading and setting weight using the
+	same values used by nice(2).  Because the range is smaller and
+	granularity is coarser for the nice values, the read value is
+	the closest approximation of the current weight.
+
+  cpu.max
+	A read-write two value file which exists on non-root cgroups.
+	The default is "max 100000".
+
+	The maximum bandwidth limit.  It's in the following format::
+
+	  $MAX $PERIOD
+
+	which indicates that the group may consume upto $MAX in each
+	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
+	one number is written, $MAX is updated.
+
+
+Memory
+------
+
+The "memory" controller regulates distribution of memory.  Memory is
+stateful and implements both limit and protection models.  Due to the
+intertwining between memory usage and reclaim pressure and the
+stateful nature of memory, the distribution model is relatively
+complex.
+
+While not completely water-tight, all major memory usages by a given
+cgroup are tracked so that the total memory consumption can be
+accounted and controlled to a reasonable extent.  Currently, the
+following types of memory usages are tracked.
+
+- Userland memory - page cache and anonymous memory.
+
+- Kernel data structures such as dentries and inodes.
+
+- TCP socket buffers.
+
+The above list may expand in the future for better coverage.
+
+
+Memory Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+All memory amounts are in bytes.  If a value which is not aligned to
+PAGE_SIZE is written, the value may be rounded up to the closest
+PAGE_SIZE multiple when read back.
+
+  memory.current
+	A read-only single value file which exists on non-root
+	cgroups.
+
+	The total amount of memory currently being used by the cgroup
+	and its descendants.
+
+  memory.min
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	Hard memory protection.  If the memory usage of a cgroup
+	is within its effective min boundary, the cgroup's memory
+	won't be reclaimed under any conditions. If there is no
+	unprotected reclaimable memory available, OOM killer
+	is invoked.
+
+       Effective min boundary is limited by memory.min values of
+	all ancestor cgroups. If there is memory.min overcommitment
+	(child cgroup or cgroups are requiring more protected memory
+	than parent will allow), then each child cgroup will get
+	the part of parent's protection proportional to its
+	actual memory usage below memory.min.
+
+	Putting more memory than generally available under this
+	protection is discouraged and may lead to constant OOMs.
+
+	If a memory cgroup is not populated with processes,
+	its memory.min is ignored.
+
+  memory.low
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	Best-effort memory protection.  If the memory usage of a
+	cgroup is within its effective low boundary, the cgroup's
+	memory won't be reclaimed unless memory can be reclaimed
+	from unprotected cgroups.
+
+	Effective low boundary is limited by memory.low values of
+	all ancestor cgroups. If there is memory.low overcommitment
+	(child cgroup or cgroups are requiring more protected memory
+	than parent will allow), then each child cgroup will get
+	the part of parent's protection proportional to its
+	actual memory usage below memory.low.
+
+	Putting more memory than generally available under this
+	protection is discouraged.
+
+  memory.high
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Memory usage throttle limit.  This is the main mechanism to
+	control memory usage of a cgroup.  If a cgroup's usage goes
+	over the high boundary, the processes of the cgroup are
+	throttled and put under heavy reclaim pressure.
+
+	Going over the high limit never invokes the OOM killer and
+	under extreme conditions the limit may be breached.
+
+  memory.max
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Memory usage hard limit.  This is the final protection
+	mechanism.  If a cgroup's memory usage reaches this limit and
+	can't be reduced, the OOM killer is invoked in the cgroup.
+	Under certain circumstances, the usage may go over the limit
+	temporarily.
+
+	This is the ultimate protection mechanism.  As long as the
+	high limit is used and monitored properly, this limit's
+	utility is limited to providing the final safety net.
+
+  memory.oom.group
+	A read-write single value file which exists on non-root
+	cgroups.  The default value is "0".
+
+	Determines whether the cgroup should be treated as
+	an indivisible workload by the OOM killer. If set,
+	all tasks belonging to the cgroup or to its descendants
+	(if the memory cgroup is not a leaf cgroup) are killed
+	together or not at all. This can be used to avoid
+	partial kills to guarantee workload integrity.
+
+	Tasks with the OOM protection (oom_score_adj set to -1000)
+	are treated as an exception and are never killed.
+
+	If the OOM killer is invoked in a cgroup, it's not going
+	to kill any tasks outside of this cgroup, regardless
+	memory.oom.group values of ancestor cgroups.
+
+  memory.events
+	A read-only flat-keyed file which exists on non-root cgroups.
+	The following entries are defined.  Unless specified
+	otherwise, a value change in this file generates a file
+	modified event.
+
+	  low
+		The number of times the cgroup is reclaimed due to
+		high memory pressure even though its usage is under
+		the low boundary.  This usually indicates that the low
+		boundary is over-committed.
+
+	  high
+		The number of times processes of the cgroup are
+		throttled and routed to perform direct memory reclaim
+		because the high memory boundary was exceeded.  For a
+		cgroup whose memory usage is capped by the high limit
+		rather than global memory pressure, this event's
+		occurrences are expected.
+
+	  max
+		The number of times the cgroup's memory usage was
+		about to go over the max boundary.  If direct reclaim
+		fails to bring it down, the cgroup goes to OOM state.
+
+	  oom
+		The number of time the cgroup's memory usage was
+		reached the limit and allocation was about to fail.
+
+		Depending on context result could be invocation of OOM
+		killer and retrying allocation or failing allocation.
+
+		Failed allocation in its turn could be returned into
+		userspace as -ENOMEM or silently ignored in cases like
+		disk readahead.  For now OOM in memory cgroup kills
+		tasks iff shortage has happened inside page fault.
+
+	  oom_kill
+		The number of processes belonging to this cgroup
+		killed by any kind of OOM killer.
+
+  memory.stat
+	A read-only flat-keyed file which exists on non-root cgroups.
+
+	This breaks down the cgroup's memory footprint into different
+	types of memory, type-specific details, and other information
+	on the state and past events of the memory management system.
+
+	All memory amounts are in bytes.
+
+	The entries are ordered to be human readable, and new entries
+	can show up in the middle. Don't rely on items remaining in a
+	fixed position; use the keys to look up specific values!
+
+	  anon
+		Amount of memory used in anonymous mappings such as
+		brk(), sbrk(), and mmap(MAP_ANONYMOUS)
+
+	  file
+		Amount of memory used to cache filesystem data,
+		including tmpfs and shared memory.
+
+	  kernel_stack
+		Amount of memory allocated to kernel stacks.
+
+	  slab
+		Amount of memory used for storing in-kernel data
+		structures.
+
+	  sock
+		Amount of memory used in network transmission buffers
+
+	  shmem
+		Amount of cached filesystem data that is swap-backed,
+		such as tmpfs, shm segments, shared anonymous mmap()s
+
+	  file_mapped
+		Amount of cached filesystem data mapped with mmap()
+
+	  file_dirty
+		Amount of cached filesystem data that was modified but
+		not yet written back to disk
+
+	  file_writeback
+		Amount of cached filesystem data that was modified and
+		is currently being written back to disk
+
+	  inactive_anon, active_anon, inactive_file, active_file, unevictable
+		Amount of memory, swap-backed and filesystem-backed,
+		on the internal memory management lists used by the
+		page reclaim algorithm
+
+	  slab_reclaimable
+		Part of "slab" that might be reclaimed, such as
+		dentries and inodes.
+
+	  slab_unreclaimable
+		Part of "slab" that cannot be reclaimed on memory
+		pressure.
+
+	  pgfault
+		Total number of page faults incurred
+
+	  pgmajfault
+		Number of major page faults incurred
+
+	  workingset_refault
+
+		Number of refaults of previously evicted pages
+
+	  workingset_activate
+
+		Number of refaulted pages that were immediately activated
+
+	  workingset_nodereclaim
+
+		Number of times a shadow node has been reclaimed
+
+	  pgrefill
+
+		Amount of scanned pages (in an active LRU list)
+
+	  pgscan
+
+		Amount of scanned pages (in an inactive LRU list)
+
+	  pgsteal
+
+		Amount of reclaimed pages
+
+	  pgactivate
+
+		Amount of pages moved to the active LRU list
+
+	  pgdeactivate
+
+		Amount of pages moved to the inactive LRU lis
+
+	  pglazyfree
+
+		Amount of pages postponed to be freed under memory pressure
+
+	  pglazyfreed
+
+		Amount of reclaimed lazyfree pages
+
+  memory.swap.current
+	A read-only single value file which exists on non-root
+	cgroups.
+
+	The total amount of swap currently being used by the cgroup
+	and its descendants.
+
+  memory.swap.max
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Swap usage hard limit.  If a cgroup's swap usage reaches this
+	limit, anonymous memory of the cgroup will not be swapped out.
+
+  memory.swap.events
+	A read-only flat-keyed file which exists on non-root cgroups.
+	The following entries are defined.  Unless specified
+	otherwise, a value change in this file generates a file
+	modified event.
+
+	  max
+		The number of times the cgroup's swap usage was about
+		to go over the max boundary and swap allocation
+		failed.
+
+	  fail
+		The number of times swap allocation failed either
+		because of running out of swap system-wide or max
+		limit.
+
+	When reduced under the current usage, the existing swap
+	entries are reclaimed gradually and the swap usage may stay
+	higher than the limit for an extended period of time.  This
+	reduces the impact on the workload and memory management.
+
+
+Usage Guidelines
+~~~~~~~~~~~~~~~~
+
+"memory.high" is the main mechanism to control memory usage.
+Over-committing on high limit (sum of high limits > available memory)
+and letting global memory pressure to distribute memory according to
+usage is a viable strategy.
+
+Because breach of the high limit doesn't trigger the OOM killer but
+throttles the offending cgroup, a management agent has ample
+opportunities to monitor and take appropriate actions such as granting
+more memory or terminating the workload.
+
+Determining whether a cgroup has enough memory is not trivial as
+memory usage doesn't indicate whether the workload can benefit from
+more memory.  For example, a workload which writes data received from
+network to a file can use all available memory but can also operate as
+performant with a small amount of memory.  A measure of memory
+pressure - how much the workload is being impacted due to lack of
+memory - is necessary to determine whether a workload needs more
+memory; unfortunately, memory pressure monitoring mechanism isn't
+implemented yet.
+
+
+Memory Ownership
+~~~~~~~~~~~~~~~~
+
+A memory area is charged to the cgroup which instantiated it and stays
+charged to the cgroup until the area is released.  Migrating a process
+to a different cgroup doesn't move the memory usages that it
+instantiated while in the previous cgroup to the new cgroup.
+
+A memory area may be used by processes belonging to different cgroups.
+To which cgroup the area will be charged is in-deterministic; however,
+over time, the memory area is likely to end up in a cgroup which has
+enough memory allowance to avoid high reclaim pressure.
+
+If a cgroup sweeps a considerable amount of memory which is expected
+to be accessed repeatedly by other cgroups, it may make sense to use
+POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
+belonging to the affected files to ensure correct memory ownership.
+
+
+IO
+--
+
+The "io" controller regulates the distribution of IO resources.  This
+controller implements both weight based and absolute bandwidth or IOPS
+limit distribution; however, weight based distribution is available
+only if cfq-iosched is in use and neither scheme is available for
+blk-mq devices.
+
+
+IO Interface Files
+~~~~~~~~~~~~~~~~~~
+
+  io.stat
+	A read-only nested-keyed file which exists on non-root
+	cgroups.
+
+	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
+	The following nested keys are defined.
+
+	  ======	=====================
+	  rbytes	Bytes read
+	  wbytes	Bytes written
+	  rios		Number of read IOs
+	  wios		Number of write IOs
+	  dbytes	Bytes discarded
+	  dios		Number of discard IOs
+	  ======	=====================
+
+	An example read output follows:
+
+	  8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
+	  8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
+
+  io.weight
+	A read-write flat-keyed file which exists on non-root cgroups.
+	The default is "default 100".
+
+	The first line is the default weight applied to devices
+	without specific override.  The rest are overrides keyed by
+	$MAJ:$MIN device numbers and not ordered.  The weights are in
+	the range [1, 10000] and specifies the relative amount IO time
+	the cgroup can use in relation to its siblings.
+
+	The default weight can be updated by writing either "default
+	$WEIGHT" or simply "$WEIGHT".  Overrides can be set by writing
+	"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
+
+	An example read output follows::
+
+	  default 100
+	  8:16 200
+	  8:0 50
+
+  io.max
+	A read-write nested-keyed file which exists on non-root
+	cgroups.
+
+	BPS and IOPS based IO limit.  Lines are keyed by $MAJ:$MIN
+	device numbers and not ordered.  The following nested keys are
+	defined.
+
+	  =====		==================================
+	  rbps		Max read bytes per second
+	  wbps		Max write bytes per second
+	  riops		Max read IO operations per second
+	  wiops		Max write IO operations per second
+	  =====		==================================
+
+	When writing, any number of nested key-value pairs can be
+	specified in any order.  "max" can be specified as the value
+	to remove a specific limit.  If the same key is specified
+	multiple times, the outcome is undefined.
+
+	BPS and IOPS are measured in each IO direction and IOs are
+	delayed if limit is reached.  Temporary bursts are allowed.
+
+	Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
+
+	  echo "8:16 rbps=2097152 wiops=120" > io.max
+
+	Reading returns the following::
+
+	  8:16 rbps=2097152 wbps=max riops=max wiops=120
+
+	Write IOPS limit can be removed by writing the following::
+
+	  echo "8:16 wiops=max" > io.max
+
+	Reading now returns the following::
+
+	  8:16 rbps=2097152 wbps=max riops=max wiops=max
+
+
+Writeback
+~~~~~~~~~
+
+Page cache is dirtied through buffered writes and shared mmaps and
+written asynchronously to the backing filesystem by the writeback
+mechanism.  Writeback sits between the memory and IO domains and
+regulates the proportion of dirty memory by balancing dirtying and
+write IOs.
+
+The io controller, in conjunction with the memory controller,
+implements control of page cache writeback IOs.  The memory controller
+defines the memory domain that dirty memory ratio is calculated and
+maintained for and the io controller defines the io domain which
+writes out dirty pages for the memory domain.  Both system-wide and
+per-cgroup dirty memory states are examined and the more restrictive
+of the two is enforced.
+
+cgroup writeback requires explicit support from the underlying
+filesystem.  Currently, cgroup writeback is implemented on ext2, ext4
+and btrfs.  On other filesystems, all writeback IOs are attributed to
+the root cgroup.
+
+There are inherent differences in memory and writeback management
+which affects how cgroup ownership is tracked.  Memory is tracked per
+page while writeback per inode.  For the purpose of writeback, an
+inode is assigned to a cgroup and all IO requests to write dirty pages
+from the inode are attributed to that cgroup.
+
+As cgroup ownership for memory is tracked per page, there can be pages
+which are associated with different cgroups than the one the inode is
+associated with.  These are called foreign pages.  The writeback
+constantly keeps track of foreign pages and, if a particular foreign
+cgroup becomes the majority over a certain period of time, switches
+the ownership of the inode to that cgroup.
+
+While this model is enough for most use cases where a given inode is
+mostly dirtied by a single cgroup even when the main writing cgroup
+changes over time, use cases where multiple cgroups write to a single
+inode simultaneously are not supported well.  In such circumstances, a
+significant portion of IOs are likely to be attributed incorrectly.
+As memory controller assigns page ownership on the first use and
+doesn't update it until the page is released, even if writeback
+strictly follows page ownership, multiple cgroups dirtying overlapping
+areas wouldn't work as expected.  It's recommended to avoid such usage
+patterns.
+
+The sysctl knobs which affect writeback behavior are applied to cgroup
+writeback as follows.
+
+  vm.dirty_background_ratio, vm.dirty_ratio
+	These ratios apply the same to cgroup writeback with the
+	amount of available memory capped by limits imposed by the
+	memory controller and system-wide clean memory.
+
+  vm.dirty_background_bytes, vm.dirty_bytes
+	For cgroup writeback, this is calculated into ratio against
+	total available memory and applied the same way as
+	vm.dirty[_background]_ratio.
+
+
+IO Latency
+~~~~~~~~~~
+
+This is a cgroup v2 controller for IO workload protection.  You provide a group
+with a latency target, and if the average latency exceeds that target the
+controller will throttle any peers that have a lower latency target than the
+protected workload.
+
+The limits are only applied at the peer level in the hierarchy.  This means that
+in the diagram below, only groups A, B, and C will influence each other, and
+groups D and F will influence each other.  Group G will influence nobody.
+
+			[root]
+		/	   |		\
+		A	   B		C
+	       /  \        |
+	      D    F	   G
+
+
+So the ideal way to configure this is to set io.latency in groups A, B, and C.
+Generally you do not want to set a value lower than the latency your device
+supports.  Experiment to find the value that works best for your workload.
+Start at higher than the expected latency for your device and watch the
+avg_lat value in io.stat for your workload group to get an idea of the
+latency you see during normal operation.  Use the avg_lat value as a basis for
+your real setting, setting at 10-15% higher than the value in io.stat.
+
+How IO Latency Throttling Works
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+io.latency is work conserving; so as long as everybody is meeting their latency
+target the controller doesn't do anything.  Once a group starts missing its
+target it begins throttling any peer group that has a higher target than itself.
+This throttling takes 2 forms:
+
+- Queue depth throttling.  This is the number of outstanding IO's a group is
+  allowed to have.  We will clamp down relatively quickly, starting at no limit
+  and going all the way down to 1 IO at a time.
+
+- Artificial delay induction.  There are certain types of IO that cannot be
+  throttled without possibly adversely affecting higher priority groups.  This
+  includes swapping and metadata IO.  These types of IO are allowed to occur
+  normally, however they are "charged" to the originating group.  If the
+  originating group is being throttled you will see the use_delay and delay
+  fields in io.stat increase.  The delay value is how many microseconds that are
+  being added to any process that runs in this group.  Because this number can
+  grow quite large if there is a lot of swapping or metadata IO occurring we
+  limit the individual delay events to 1 second at a time.
+
+Once the victimized group starts meeting its latency target again it will start
+unthrottling any peer groups that were throttled previously.  If the victimized
+group simply stops doing IO the global counter will unthrottle appropriately.
+
+IO Latency Interface Files
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+  io.latency
+	This takes a similar format as the other controllers.
+
+		"MAJOR:MINOR target=<target time in microseconds"
+
+  io.stat
+	If the controller is enabled you will see extra stats in io.stat in
+	addition to the normal ones.
+
+	  depth
+		This is the current queue depth for the group.
+
+	  avg_lat
+		This is an exponential moving average with a decay rate of 1/exp
+		bound by the sampling interval.  The decay rate interval can be
+		calculated by multiplying the win value in io.stat by the
+		corresponding number of samples based on the win value.
+
+	  win
+		The sampling window size in milliseconds.  This is the minimum
+		duration of time between evaluation events.  Windows only elapse
+		with IO activity.  Idle periods extend the most recent window.
+
+PID
+---
+
+The process number controller is used to allow a cgroup to stop any
+new tasks from being fork()'d or clone()'d after a specified limit is
+reached.
+
+The number of tasks in a cgroup can be exhausted in ways which other
+controllers cannot prevent, thus warranting its own controller.  For
+example, a fork bomb is likely to exhaust the number of tasks before
+hitting memory restrictions.
+
+Note that PIDs used in this controller refer to TIDs, process IDs as
+used by the kernel.
+
+
+PID Interface Files
+~~~~~~~~~~~~~~~~~~~
+
+  pids.max
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Hard limit of number of processes.
+
+  pids.current
+	A read-only single value file which exists on all cgroups.
+
+	The number of processes currently in the cgroup and its
+	descendants.
+
+Organisational operations are not blocked by cgroup policies, so it is
+possible to have pids.current > pids.max.  This can be done by either
+setting the limit to be smaller than pids.current, or attaching enough
+processes to the cgroup such that pids.current is larger than
+pids.max.  However, it is not possible to violate a cgroup PID policy
+through fork() or clone(). These will return -EAGAIN if the creation
+of a new process would cause a cgroup policy to be violated.
+
+
+Device controller
+-----------------
+
+Device controller manages access to device files. It includes both
+creation of new device files (using mknod), and access to the
+existing device files.
+
+Cgroup v2 device controller has no interface files and is implemented
+on top of cgroup BPF. To control access to device files, a user may
+create bpf programs of the BPF_CGROUP_DEVICE type and attach them
+to cgroups. On an attempt to access a device file, corresponding
+BPF programs will be executed, and depending on the return value
+the attempt will succeed or fail with -EPERM.
+
+A BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx
+structure, which describes the device access attempt: access type
+(mknod/read/write) and device (type, major and minor numbers).
+If the program returns 0, the attempt fails with -EPERM, otherwise
+it succeeds.
+
+An example of BPF_CGROUP_DEVICE program may be found in the kernel
+source tree in the tools/testing/selftests/bpf/dev_cgroup.c file.
+
+
+RDMA
+----
+
+The "rdma" controller regulates the distribution and accounting of
+of RDMA resources.
+
+RDMA Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+  rdma.max
+	A readwrite nested-keyed file that exists for all the cgroups
+	except root that describes current configured resource limit
+	for a RDMA/IB device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+	The following nested keys are defined.
+
+	  ==========	=============================
+	  hca_handle	Maximum number of HCA Handles
+	  hca_object 	Maximum number of HCA Objects
+	  ==========	=============================
+
+	An example for mlx4 and ocrdma device follows::
+
+	  mlx4_0 hca_handle=2 hca_object=2000
+	  ocrdma1 hca_handle=3 hca_object=max
+
+  rdma.current
+	A read-only file that describes current resource usage.
+	It exists for all the cgroup except root.
+
+	An example for mlx4 and ocrdma device follows::
+
+	  mlx4_0 hca_handle=1 hca_object=20
+	  ocrdma1 hca_handle=1 hca_object=23
+
+
+Misc
+----
+
+perf_event
+~~~~~~~~~~
+
+perf_event controller, if not mounted on a legacy hierarchy, is
+automatically enabled on the v2 hierarchy so that perf events can
+always be filtered by cgroup v2 path.  The controller can still be
+moved to a legacy hierarchy after v2 hierarchy is populated.
+
+
+Non-normative information
+-------------------------
+
+This section contains information that isn't considered to be a part of
+the stable kernel API and so is subject to change.
+
+
+CPU controller root cgroup process behaviour
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When distributing CPU cycles in the root cgroup each thread in this
+cgroup is treated as if it was hosted in a separate child cgroup of the
+root cgroup. This child cgroup weight is dependent on its thread nice
+level.
+
+For details of this mapping see sched_prio_to_weight array in
+kernel/sched/core.c file (values from this array should be scaled
+appropriately so the neutral - nice 0 - value is 100 instead of 1024).
+
+
+IO controller root cgroup process behaviour
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Root cgroup processes are hosted in an implicit leaf child node.
+When distributing IO resources this implicit child node is taken into
+account as if it was a normal child cgroup of the root cgroup with a
+weight value of 200.
+
+
+Namespace
+=========
+
+Basics
+------
+
+cgroup namespace provides a mechanism to virtualize the view of the
+"/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
+flag can be used with clone(2) and unshare(2) to create a new cgroup
+namespace.  The process running inside the cgroup namespace will have
+its "/proc/$PID/cgroup" output restricted to cgroupns root.  The
+cgroupns root is the cgroup of the process at the time of creation of
+the cgroup namespace.
+
+Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
+complete path of the cgroup of a process.  In a container setup where
+a set of cgroups and namespaces are intended to isolate processes the
+"/proc/$PID/cgroup" file may leak potential system level information
+to the isolated processes.  For Example::
+
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can be considered as system-data
+and undesirable to expose to the isolated processes.  cgroup namespace
+can be used to restrict visibility of this path.  For example, before
+creating a cgroup namespace, one would see::
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+After unsharing a new namespace, the view changes::
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+  # cat /proc/self/cgroup
+  0::/
+
+When some thread from a multi-threaded process unshares its cgroup
+namespace, the new cgroupns gets applied to the entire process (all
+the threads).  This is natural for the v2 hierarchy; however, for the
+legacy hierarchies, this may be unexpected.
+
+A cgroup namespace is alive as long as there are processes inside or
+mounts pinning it.  When the last usage goes away, the cgroup
+namespace is destroyed.  The cgroupns root and the actual cgroups
+remain.
+
+
+The Root and Views
+------------------
+
+The 'cgroupns root' for a cgroup namespace is the cgroup in which the
+process calling unshare(2) is running.  For example, if a process in
+/batchjobs/container_id1 cgroup calls unshare, cgroup
+/batchjobs/container_id1 becomes the cgroupns root.  For the
+init_cgroup_ns, this is the real root ('/') cgroup.
+
+The cgroupns root cgroup does not change even if the namespace creator
+process later moves to a different cgroup::
+
+  # ~/unshare -c # unshare cgroupns in some cgroup
+  # cat /proc/self/cgroup
+  0::/
+  # mkdir sub_cgrp_1
+  # echo 0 > sub_cgrp_1/cgroup.procs
+  # cat /proc/self/cgroup
+  0::/sub_cgrp_1
+
+Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+From within an unshared cgroupns::
+
+  # sleep 100000 &
+  [1] 7353
+  # echo 7353 > sub_cgrp_1/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+
+From the initial cgroup namespace, the real cgroup path will be
+visible::
+
+  $ cat /proc/7353/cgroup
+  0::/batchjobs/container_id1/sub_cgrp_1
+
+From a sibling cgroup namespace (that is, a namespace rooted at a
+different cgroup), the cgroup path relative to its own cgroup
+namespace root will be shown.  For instance, if PID 7353's cgroup
+namespace root is at '/batchjobs/container_id2', then it will see::
+
+  # cat /proc/7353/cgroup
+  0::/../container_id2/sub_cgrp_1
+
+Note that the relative path always starts with '/' to indicate that
+its relative to the cgroup namespace root of the caller.
+
+
+Migration and setns(2)
+----------------------
+
+Processes inside a cgroup namespace can move into and out of the
+namespace root if they have proper access to external cgroups.  For
+example, from inside a namespace with cgroupns root at
+/batchjobs/container_id1, and assuming that the global hierarchy is
+still accessible inside cgroupns::
+
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+  # echo 7353 > batchjobs/container_id2/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/../container_id2
+
+Note that this kind of setup is not encouraged.  A task inside cgroup
+namespace should only be exposed to its own cgroupns hierarchy.
+
+setns(2) to another cgroup namespace is allowed when:
+
+(a) the process has CAP_SYS_ADMIN against its current user namespace
+(b) the process has CAP_SYS_ADMIN against the target cgroup
+    namespace's userns
+
+No implicit cgroup changes happen with attaching to another cgroup
+namespace.  It is expected that the someone moves the attaching
+process under the target cgroup namespace root.
+
+
+Interaction with Other Namespaces
+---------------------------------
+
+Namespace specific cgroup hierarchy can be mounted by a process
+running inside a non-init cgroup namespace::
+
+  # mount -t cgroup2 none $MOUNT_POINT
+
+This will mount the unified cgroup hierarchy with cgroupns root as the
+filesystem root.  The process needs CAP_SYS_ADMIN against its user and
+mount namespaces.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+provides a properly isolated cgroup view inside the container.
+
+
+Information on Kernel Programming
+=================================
+
+This section contains kernel programming information in the areas
+where interacting with cgroup is necessary.  cgroup core and
+controllers are not covered.
+
+
+Filesystem Support for Writeback
+--------------------------------
+
+A filesystem can support cgroup writeback by updating
+address_space_operations->writepage[s]() to annotate bio's using the
+following two functions.
+
+  wbc_init_bio(@wbc, @bio)
+	Should be called for each bio carrying writeback data and
+	associates the bio with the inode's owner cgroup.  Can be
+	called anytime between bio allocation and submission.
+
+  wbc_account_io(@wbc, @page, @bytes)
+	Should be called for each data segment being written out.
+	While this function doesn't care exactly when it's called
+	during the writeback session, it's the easiest and most
+	natural to call it as data segments are added to a bio.
+
+With writeback bio's annotated, cgroup support can be enabled per
+super_block by setting SB_I_CGROUPWB in ->s_iflags.  This allows for
+selective disabling of cgroup writeback support which is helpful when
+certain filesystem features, e.g. journaled data mode, are
+incompatible.
+
+wbc_init_bio() binds the specified bio to its cgroup.  Depending on
+the configuration, the bio may be executed at a lower priority and if
+the writeback session is holding shared resources, e.g. a journal
+entry, may lead to priority inversion.  There is no one easy solution
+for the problem.  Filesystems can try to work around specific problem
+cases by skipping wbc_init_bio() or using bio_associate_blkcg()
+directly.
+
+
+Deprecated v1 Core Features
+===========================
+
+- Multiple hierarchies including named ones are not supported.
+
+- All v1 mount options are not supported.
+
+- The "tasks" file is removed and "cgroup.procs" is not sorted.
+
+- "cgroup.clone_children" is removed.
+
+- /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" file
+  at the root instead.
+
+
+Issues with v1 and Rationales for v2
+====================================
+
+Multiple Hierarchies
+--------------------
+
+cgroup v1 allowed an arbitrary number of hierarchies and each
+hierarchy could host any number of controllers.  While this seemed to
+provide a high level of flexibility, it wasn't useful in practice.
+
+For example, as there is only one instance of each controller, utility
+type controllers such as freezer which can be useful in all
+hierarchies could only be used in one.  The issue is exacerbated by
+the fact that controllers couldn't be moved to another hierarchy once
+hierarchies were populated.  Another issue was that all controllers
+bound to a hierarchy were forced to have exactly the same view of the
+hierarchy.  It wasn't possible to vary the granularity depending on
+the specific controller.
+
+In practice, these issues heavily limited which controllers could be
+put on the same hierarchy and most configurations resorted to putting
+each controller on its own hierarchy.  Only closely related ones, such
+as the cpu and cpuacct controllers, made sense to be put on the same
+hierarchy.  This often meant that userland ended up managing multiple
+similar hierarchies repeating the same steps on each hierarchy
+whenever a hierarchy management operation was necessary.
+
+Furthermore, support for multiple hierarchies came at a steep cost.
+It greatly complicated cgroup core implementation but more importantly
+the support for multiple hierarchies restricted how cgroup could be
+used in general and what controllers was able to do.
+
+There was no limit on how many hierarchies there might be, which meant
+that a thread's cgroup membership couldn't be described in finite
+length.  The key might contain any number of entries and was unlimited
+in length, which made it highly awkward to manipulate and led to
+addition of controllers which existed only to identify membership,
+which in turn exacerbated the original problem of proliferating number
+of hierarchies.
+
+Also, as a controller couldn't have any expectation regarding the
+topologies of hierarchies other controllers might be on, each
+controller had to assume that all other controllers were attached to
+completely orthogonal hierarchies.  This made it impossible, or at
+least very cumbersome, for controllers to cooperate with each other.
+
+In most use cases, putting controllers on hierarchies which are
+completely orthogonal to each other isn't necessary.  What usually is
+called for is the ability to have differing levels of granularity
+depending on the specific controller.  In other words, hierarchy may
+be collapsed from leaf towards root when viewed from specific
+controllers.  For example, a given configuration might not care about
+how memory is distributed beyond a certain level while still wanting
+to control how CPU cycles are distributed.
+
+
+Thread Granularity
+------------------
+
+cgroup v1 allowed threads of a process to belong to different cgroups.
+This didn't make sense for some controllers and those controllers
+ended up implementing different ways to ignore such situations but
+much more importantly it blurred the line between API exposed to
+individual applications and system management interface.
+
+Generally, in-process knowledge is available only to the process
+itself; thus, unlike service-level organization of processes,
+categorizing threads of a process requires active participation from
+the application which owns the target process.
+
+cgroup v1 had an ambiguously defined delegation model which got abused
+in combination with thread granularity.  cgroups were delegated to
+individual applications so that they can create and manage their own
+sub-hierarchies and control resource distributions along them.  This
+effectively raised cgroup to the status of a syscall-like API exposed
+to lay programs.
+
+First of all, cgroup has a fundamentally inadequate interface to be
+exposed this way.  For a process to access its own knobs, it has to
+extract the path on the target hierarchy from /proc/self/cgroup,
+construct the path by appending the name of the knob to the path, open
+and then read and/or write to it.  This is not only extremely clunky
+and unusual but also inherently racy.  There is no conventional way to
+define transaction across the required steps and nothing can guarantee
+that the process would actually be operating on its own sub-hierarchy.
+
+cgroup controllers implemented a number of knobs which would never be
+accepted as public APIs because they were just adding control knobs to
+system-management pseudo filesystem.  cgroup ended up with interface
+knobs which were not properly abstracted or refined and directly
+revealed kernel internal details.  These knobs got exposed to
+individual applications through the ill-defined delegation mechanism
+effectively abusing cgroup as a shortcut to implementing public APIs
+without going through the required scrutiny.
+
+This was painful for both userland and kernel.  Userland ended up with
+misbehaving and poorly abstracted interfaces and kernel exposing and
+locked into constructs inadvertently.
+
+
+Competition Between Inner Nodes and Threads
+-------------------------------------------
+
+cgroup v1 allowed threads to be in any cgroups which created an
+interesting problem where threads belonging to a parent cgroup and its
+children cgroups competed for resources.  This was nasty as two
+different types of entities competed and there was no obvious way to
+settle it.  Different controllers did different things.
+
+The cpu controller considered threads and cgroups as equivalents and
+mapped nice levels to cgroup weights.  This worked for some cases but
+fell flat when children wanted to be allocated specific ratios of CPU
+cycles and the number of internal threads fluctuated - the ratios
+constantly changed as the number of competing entities fluctuated.
+There also were other issues.  The mapping from nice level to weight
+wasn't obvious or universal, and there were various other knobs which
+simply weren't available for threads.
+
+The io controller implicitly created a hidden leaf node for each
+cgroup to host the threads.  The hidden leaf had its own copies of all
+the knobs with ``leaf_`` prefixed.  While this allowed equivalent
+control over internal threads, it was with serious drawbacks.  It
+always added an extra layer of nesting which wouldn't be necessary
+otherwise, made the interface messy and significantly complicated the
+implementation.
+
+The memory controller didn't have a way to control what happened
+between internal tasks and child cgroups and the behavior was not
+clearly defined.  There were attempts to add ad-hoc behaviors and
+knobs to tailor the behavior to specific workloads which would have
+led to problems extremely difficult to resolve in the long term.
+
+Multiple controllers struggled with internal tasks and came up with
+different ways to deal with it; unfortunately, all the approaches were
+severely flawed and, furthermore, the widely different behaviors
+made cgroup as a whole highly inconsistent.
+
+This clearly is a problem which needs to be addressed from cgroup core
+in a uniform way.
+
+
+Other Interface Issues
+----------------------
+
+cgroup v1 grew without oversight and developed a large number of
+idiosyncrasies and inconsistencies.  One issue on the cgroup core side
+was how an empty cgroup was notified - a userland helper binary was
+forked and executed for each event.  The event delivery wasn't
+recursive or delegatable.  The limitations of the mechanism also led
+to in-kernel event delivery filtering mechanism further complicating
+the interface.
+
+Controller interfaces were problematic too.  An extreme example is
+controllers completely ignoring hierarchical organization and treating
+all cgroups as if they were all located directly under the root
+cgroup.  Some controllers exposed a large amount of inconsistent
+implementation details to userland.
+
+There also was no consistency across controllers.  When a new cgroup
+was created, some controllers defaulted to not imposing extra
+restrictions while others disallowed any resource usage until
+explicitly configured.  Configuration knobs for the same type of
+control used widely differing naming schemes and formats.  Statistics
+and information knobs were named arbitrarily and used different
+formats and units even in the same controller.
+
+cgroup v2 establishes common conventions where appropriate and updates
+controllers so that they expose minimal and consistent interfaces.
+
+
+Controller Issues and Remedies
+------------------------------
+
+Memory
+~~~~~~
+
+The original lower boundary, the soft limit, is defined as a limit
+that is per default unset.  As a result, the set of cgroups that
+global reclaim prefers is opt-in, rather than opt-out.  The costs for
+optimizing these mostly negative lookups are so high that the
+implementation, despite its enormous size, does not even provide the
+basic desirable behavior.  First off, the soft limit has no
+hierarchical meaning.  All configured groups are organized in a global
+rbtree and treated like equal peers, regardless where they are located
+in the hierarchy.  This makes subtree delegation impossible.  Second,
+the soft limit reclaim pass is so aggressive that it not just
+introduces high allocation latencies into the system, but also impacts
+system performance due to overreclaim, to the point where the feature
+becomes self-defeating.
+
+The memory.low boundary on the other hand is a top-down allocated
+reserve.  A cgroup enjoys reclaim protection when it's within its low,
+which makes delegation of subtrees possible.
+
+The original high boundary, the hard limit, is defined as a strict
+limit that can not budge, even if the OOM killer has to be called.
+But this generally goes against the goal of making the most out of the
+available memory.  The memory consumption of workloads varies during
+runtime, and that requires users to overcommit.  But doing that with a
+strict upper limit requires either a fairly accurate prediction of the
+working set size or adding slack to the limit.  Since working set size
+estimation is hard and error prone, and getting it wrong results in
+OOM kills, most users tend to err on the side of a looser limit and
+end up wasting precious resources.
+
+The memory.high boundary on the other hand can be set much more
+conservatively.  When hit, it throttles allocations by forcing them
+into direct reclaim to work off the excess, but it never invokes the
+OOM killer.  As a result, a high boundary that is chosen too
+aggressively will not terminate the processes, but instead it will
+lead to gradual performance degradation.  The user can monitor this
+and make corrections until the minimal memory footprint that still
+gives acceptable performance is found.
+
+In extreme cases, with many concurrent allocations and a complete
+breakdown of reclaim progress within the group, the high boundary can
+be exceeded.  But even then it's mostly better to satisfy the
+allocation from the slack available in other groups or the rest of the
+system than killing the group.  Otherwise, memory.max is there to
+limit this type of spillover and ultimately contain buggy or even
+malicious applications.
+
+Setting the original memory.limit_in_bytes below the current usage was
+subject to a race condition, where concurrent charges could cause the
+limit setting to fail. memory.max on the other hand will first set the
+limit to prevent new charges, and then reclaim and OOM kill until the
+new limit is met - or the task writing to memory.max is killed.
+
+The combined memory+swap accounting and limiting is replaced by real
+control over swap space.
+
+The main argument for a combined memory+swap facility in the original
+cgroup design was that global or parental pressure would always be
+able to swap all anonymous memory of a child group, regardless of the
+child's own (possibly untrusted) configuration.  However, untrusted
+groups can sabotage swapping by other means - such as referencing its
+anonymous memory in a tight loop - and an admin can not assume full
+swappability when overcommitting untrusted jobs.
+
+For trusted jobs, on the other hand, a combined counter is not an
+intuitive userspace interface, and it flies in the face of the idea
+that cgroup controllers should account and limit specific physical
+resources.  Swap space is a resource like all others in the system,
+and that's why unified hierarchy allows distributing it separately.
diff --git a/Documentation/admin-guide/conf.py b/Documentation/admin-guide/conf.py
new file mode 100644
index 000000000..86f738953
--- /dev/null
+++ b/Documentation/admin-guide/conf.py
@@ -0,0 +1,10 @@
+# -*- coding: utf-8; mode: python -*-
+
+project = 'Linux Kernel User Documentation'
+
+tags.add("subproject")
+
+latex_documents = [
+    ('index', 'linux-user.tex', 'Linux Kernel User Documentation',
+     'The kernel development community', 'manual'),
+]
diff --git a/Documentation/admin-guide/devices.rst b/Documentation/admin-guide/devices.rst
new file mode 100644
index 000000000..7fadc0533
--- /dev/null
+++ b/Documentation/admin-guide/devices.rst
@@ -0,0 +1,268 @@
+
+Linux allocated devices (4.x+ version)
+======================================
+
+This list is the Linux Device List, the official registry of allocated
+device numbers and ``/dev`` directory nodes for the Linux operating
+system.
+
+The LaTeX version of this document is no longer maintained, nor is
+the document that used to reside at lanana.org.  This version in the
+mainline Linux kernel is the master document.  Updates shall be sent
+as patches to the kernel maintainers (see the
+:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` document).
+Specifically explore the sections titled "CHAR and MISC DRIVERS", and
+"BLOCK LAYER" in the MAINTAINERS file to find the right maintainers
+to involve for character and block devices.
+
+This document is included by reference into the Filesystem Hierarchy
+Standard (FHS).	 The FHS is available from http://www.pathname.com/fhs/.
+
+Allocations marked (68k/Amiga) apply to Linux/68k on the Amiga
+platform only.	Allocations marked (68k/Atari) apply to Linux/68k on
+the Atari platform only.
+
+This document is in the public domain.	The authors requests, however,
+that semantically altered versions are not distributed without
+permission of the authors, assuming the authors can be contacted without
+an unreasonable effort.
+
+
+.. attention::
+
+  DEVICE DRIVERS AUTHORS PLEASE READ THIS
+
+  Linux now has extensive support for dynamic allocation of device numbering
+  and can use ``sysfs`` and ``udev`` (``systemd``) to handle the naming needs.
+  There are still some exceptions in the serial and boot device area. Before
+  asking   for a device number make sure you actually need one.
+
+  To have a major number allocated, or a minor number in situations
+  where that applies (e.g. busmice), please submit a patch and send to
+  the authors as indicated above.
+
+  Keep the description of the device *in the same format
+  as this list*. The reason for this is that it is the only way we have
+  found to ensure we have all the requisite information to publish your
+  device and avoid conflicts.
+
+  Finally, sometimes we have to play "namespace police."  Please don't be
+  offended.  We often get submissions for ``/dev`` names that would be bound
+  to cause conflicts down the road.  We are trying to avoid getting in a
+  situation where we would have to suffer an incompatible forward
+  change.  Therefore, please consult with us **before** you make your
+  device names and numbers in any way public, at least to the point
+  where it would be at all difficult to get them changed.
+
+  Your cooperation is appreciated.
+
+.. include:: devices.txt
+   :literal:
+
+Additional ``/dev/`` directory entries
+--------------------------------------
+
+This section details additional entries that should or may exist in
+the /dev directory.  It is preferred that symbolic links use the same
+form (absolute or relative) as is indicated here.  Links are
+classified as "hard" or "symbolic" depending on the preferred type of
+link; if possible, the indicated type of link should be used.
+
+Compulsory links
+++++++++++++++++
+
+These links should exist on all systems:
+
+=============== =============== =============== ===============================
+/dev/fd		/proc/self/fd	symbolic	File descriptors
+/dev/stdin	fd/0		symbolic	stdin file descriptor
+/dev/stdout	fd/1		symbolic	stdout file descriptor
+/dev/stderr	fd/2		symbolic	stderr file descriptor
+/dev/nfsd	socksys		symbolic	Required by iBCS-2
+/dev/X0R	null		symbolic	Required by iBCS-2
+=============== =============== =============== ===============================
+
+Note: ``/dev/X0R`` is <letter X>-<digit 0>-<letter R>.
+
+Recommended links
++++++++++++++++++
+
+It is recommended that these links exist on all systems:
+
+
+=============== =============== =============== ===============================
+/dev/core	/proc/kcore	symbolic	Backward compatibility
+/dev/ramdisk	ram0		symbolic	Backward compatibility
+/dev/ftape	qft0		symbolic	Backward compatibility
+/dev/bttv0	video0		symbolic	Backward compatibility
+/dev/radio	radio0		symbolic	Backward compatibility
+/dev/i2o*	/dev/i2o/*	symbolic	Backward compatibility
+/dev/scd?	sr?		hard		Alternate SCSI CD-ROM name
+=============== =============== =============== ===============================
+
+Locally defined links
++++++++++++++++++++++
+
+The following links may be established locally to conform to the
+configuration of the system.  This is merely a tabulation of existing
+practice, and does not constitute a recommendation.  However, if they
+exist, they should have the following uses.
+
+=============== =============== =============== ===============================
+/dev/mouse	mouse port	symbolic	Current mouse device
+/dev/tape	tape device	symbolic	Current tape device
+/dev/cdrom	CD-ROM device	symbolic	Current CD-ROM device
+/dev/cdwriter	CD-writer	symbolic	Current CD-writer device
+/dev/scanner	scanner		symbolic	Current scanner device
+/dev/modem	modem port	symbolic	Current dialout device
+/dev/root	root device	symbolic	Current root filesystem
+/dev/swap	swap device	symbolic	Current swap device
+=============== =============== =============== ===============================
+
+``/dev/modem`` should not be used for a modem which supports dialin as
+well as dialout, as it tends to cause lock file problems.  If it
+exists, ``/dev/modem`` should point to the appropriate primary TTY device
+(the use of the alternate callout devices is deprecated).
+
+For SCSI devices, ``/dev/tape`` and ``/dev/cdrom`` should point to the
+*cooked* devices (``/dev/st*`` and ``/dev/sr*``, respectively), whereas
+``/dev/cdwriter`` and /dev/scanner should point to the appropriate generic
+SCSI devices (/dev/sg*).
+
+``/dev/mouse`` may point to a primary serial TTY device, a hardware mouse
+device, or a socket for a mouse driver program (e.g. ``/dev/gpmdata``).
+
+Sockets and pipes
++++++++++++++++++
+
+Non-transient sockets and named pipes may exist in /dev.  Common entries are:
+
+=============== =============== ===============================================
+/dev/printer	socket		lpd local socket
+/dev/log	socket		syslog local socket
+/dev/gpmdata	socket		gpm mouse multiplexer
+=============== =============== ===============================================
+
+Mount points
+++++++++++++
+
+The following names are reserved for mounting special filesystems
+under /dev.  These special filesystems provide kernel interfaces that
+cannot be provided with standard device nodes.
+
+=============== =============== ===============================================
+/dev/pts	devpts		PTY slave filesystem
+/dev/shm	tmpfs		POSIX shared memory maintenance access
+=============== =============== ===============================================
+
+Terminal devices
+----------------
+
+Terminal, or TTY devices are a special class of character devices.  A
+terminal device is any device that could act as a controlling terminal
+for a session; this includes virtual consoles, serial ports, and
+pseudoterminals (PTYs).
+
+All terminal devices share a common set of capabilities known as line
+disciplines; these include the common terminal line discipline as well
+as SLIP and PPP modes.
+
+All terminal devices are named similarly; this section explains the
+naming and use of the various types of TTYs.  Note that the naming
+conventions include several historical warts; some of these are
+Linux-specific, some were inherited from other systems, and some
+reflect Linux outgrowing a borrowed convention.
+
+A hash mark (``#``) in a device name is used here to indicate a decimal
+number without leading zeroes.
+
+Virtual consoles and the console device
++++++++++++++++++++++++++++++++++++++++
+
+Virtual consoles are full-screen terminal displays on the system video
+monitor.  Virtual consoles are named ``/dev/tty#``, with numbering
+starting at ``/dev/tty1``; ``/dev/tty0`` is the current virtual console.
+``/dev/tty0`` is the device that should be used to access the system video
+card on those architectures for which the frame buffer devices
+(``/dev/fb*``) are not applicable. Do not use ``/dev/console``
+for this purpose.
+
+The console device, ``/dev/console``, is the device to which system
+messages should be sent, and on which logins should be permitted in
+single-user mode.  Starting with Linux 2.1.71, ``/dev/console`` is managed
+by the kernel; for previous versions it should be a symbolic link to
+either ``/dev/tty0``, a specific virtual console such as ``/dev/tty1``, or to
+a serial port primary (``tty*``, not ``cu*``) device, depending on the
+configuration of the system.
+
+Serial ports
+++++++++++++
+
+Serial ports are RS-232 serial ports and any device which simulates
+one, either in hardware (such as internal modems) or in software (such
+as the ISDN driver.)  Under Linux, each serial ports has two device
+names, the primary or callin device and the alternate or callout one.
+Each kind of device is indicated by a different letter.	 For any
+letter X, the names of the devices are ``/dev/ttyX#`` and ``/dev/cux#``,
+respectively; for historical reasons, ``/dev/ttyS#`` and ``/dev/ttyC#``
+correspond to ``/dev/cua#`` and ``/dev/cub#``. In the future, it should be
+expected that multiple letters will be used; all letters will be upper
+case for the "tty" device (e.g. ``/dev/ttyDP#``) and lower case for the
+"cu" device (e.g. ``/dev/cudp#``).
+
+The names ``/dev/ttyQ#`` and ``/dev/cuq#`` are reserved for local use.
+
+The alternate devices provide for kernel-based exclusion and somewhat
+different defaults than the primary devices.  Their main purpose is to
+allow the use of serial ports with programs with no inherent or broken
+support for serial ports.  Their use is deprecated, and they may be
+removed from a future version of Linux.
+
+Arbitration of serial ports is provided by the use of lock files with
+the names ``/var/lock/LCK..ttyX#``. The contents of the lock file should
+be the PID of the locking process as an ASCII number.
+
+It is common practice to install links such as /dev/modem
+which point to serial ports.  In order to ensure proper locking in the
+presence of these links, it is recommended that software chase
+symlinks and lock all possible names; additionally, it is recommended
+that a lock file be installed with the corresponding alternate
+device.	 In order to avoid deadlocks, it is recommended that the locks
+are acquired in the following order, and released in the reverse:
+
+	1. The symbolic link name, if any (``/var/lock/LCK..modem``)
+	2. The "tty" name (``/var/lock/LCK..ttyS2``)
+	3. The alternate device name (``/var/lock/LCK..cua2``)
+
+In the case of nested symbolic links, the lock files should be
+installed in the order the symlinks are resolved.
+
+Under no circumstances should an application hold a lock while waiting
+for another to be released.  In addition, applications which attempt
+to create lock files for the corresponding alternate device names
+should take into account the possibility of being used on a non-serial
+port TTY, for which no alternate device would exist.
+
+Pseudoterminals (PTYs)
+++++++++++++++++++++++
+
+Pseudoterminals, or PTYs, are used to create login sessions or provide
+other capabilities requiring a TTY line discipline (including SLIP or
+PPP capability) to arbitrary data-generation processes.	 Each PTY has
+a master side, named ``/dev/pty[p-za-e][0-9a-f]``, and a slave side, named
+``/dev/tty[p-za-e][0-9a-f]``.  The kernel arbitrates the use of PTYs by
+allowing each master side to be opened only once.
+
+Once the master side has been opened, the corresponding slave device
+can be used in the same manner as any TTY device.  The master and
+slave devices are connected by the kernel, generating the equivalent
+of a bidirectional pipe with TTY capabilities.
+
+Recent versions of the Linux kernels and GNU libc contain support for
+the System V/Unix98 naming scheme for PTYs, which assigns a common
+device, ``/dev/ptmx``, to all the masters (opening it will automatically
+give you a previously unassigned PTY) and a subdirectory, ``/dev/pts``,
+for the slaves; the slaves are named with decimal integers (``/dev/pts/#``
+in our notation).  This removes the problem of exhausting the
+namespace and enables the kernel to automatically create the device
+nodes for the slaves on demand using the "devpts" filesystem.
diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt
new file mode 100644
index 000000000..9cd7926c8
--- /dev/null
+++ b/Documentation/admin-guide/devices.txt
@@ -0,0 +1,3092 @@
+   0		Unnamed devices (e.g. non-device mounts)
+		  0 = reserved as null device number
+		See block major 144, 145, 146 for expansion areas.
+
+   1 char	Memory devices
+		  1 = /dev/mem		Physical memory access
+		  2 = /dev/kmem		Kernel virtual memory access
+		  3 = /dev/null		Null device
+		  4 = /dev/port		I/O port access
+		  5 = /dev/zero		Null byte source
+		  6 = /dev/core		OBSOLETE - replaced by /proc/kcore
+		  7 = /dev/full		Returns ENOSPC on write
+		  8 = /dev/random	Nondeterministic random number gen.
+		  9 = /dev/urandom	Faster, less secure random number gen.
+		 10 = /dev/aio		Asynchronous I/O notification interface
+		 11 = /dev/kmsg		Writes to this come out as printk's, reads
+					export the buffered printk records.
+		 12 = /dev/oldmem	OBSOLETE - replaced by /proc/vmcore
+
+   1 block	RAM disk
+		  0 = /dev/ram0		First RAM disk
+		  1 = /dev/ram1		Second RAM disk
+		    ...
+		250 = /dev/initrd	Initial RAM disk
+
+		Older kernels had /dev/ramdisk (1, 1) here.
+		/dev/initrd refers to a RAM disk which was preloaded
+		by the boot loader; newer kernels use /dev/ram0 for
+		the initrd.
+
+   2 char	Pseudo-TTY masters
+		  0 = /dev/ptyp0	First PTY master
+		  1 = /dev/ptyp1	Second PTY master
+		    ...
+		255 = /dev/ptyef	256th PTY master
+
+		Pseudo-tty's are named as follows:
+		* Masters are "pty", slaves are "tty";
+		* the fourth letter is one of pqrstuvwxyzabcde indicating
+		  the 1st through 16th series of 16 pseudo-ttys each, and
+		* the fifth letter is one of 0123456789abcdef indicating
+		  the position within the series.
+
+		These are the old-style (BSD) PTY devices; Unix98
+		devices are on major 128 and above and use the PTY
+		master multiplex (/dev/ptmx) to acquire a PTY on
+		demand.
+
+   2 block	Floppy disks
+		  0 = /dev/fd0		Controller 0, drive 0, autodetect
+		  1 = /dev/fd1		Controller 0, drive 1, autodetect
+		  2 = /dev/fd2		Controller 0, drive 2, autodetect
+		  3 = /dev/fd3		Controller 0, drive 3, autodetect
+		128 = /dev/fd4		Controller 1, drive 0, autodetect
+		129 = /dev/fd5		Controller 1, drive 1, autodetect
+		130 = /dev/fd6		Controller 1, drive 2, autodetect
+		131 = /dev/fd7		Controller 1, drive 3, autodetect
+
+		To specify format, add to the autodetect device number:
+		  0 = /dev/fd?		Autodetect format
+		  4 = /dev/fd?d360	5.25"  360K in a 360K  drive(1)
+		 20 = /dev/fd?h360	5.25"  360K in a 1200K drive(1)
+		 48 = /dev/fd?h410	5.25"  410K in a 1200K drive
+		 64 = /dev/fd?h420	5.25"  420K in a 1200K drive
+		 24 = /dev/fd?h720	5.25"  720K in a 1200K drive
+		 80 = /dev/fd?h880	5.25"  880K in a 1200K drive(1)
+		  8 = /dev/fd?h1200	5.25" 1200K in a 1200K drive(1)
+		 40 = /dev/fd?h1440	5.25" 1440K in a 1200K drive(1)
+		 56 = /dev/fd?h1476	5.25" 1476K in a 1200K drive
+		 72 = /dev/fd?h1494	5.25" 1494K in a 1200K drive
+		 92 = /dev/fd?h1600	5.25" 1600K in a 1200K drive(1)
+
+		 12 = /dev/fd?u360	3.5"   360K Double Density(2)
+		 16 = /dev/fd?u720	3.5"   720K Double Density(1)
+		120 = /dev/fd?u800	3.5"   800K Double Density(2)
+		 52 = /dev/fd?u820	3.5"   820K Double Density
+		 68 = /dev/fd?u830	3.5"   830K Double Density
+		 84 = /dev/fd?u1040	3.5"  1040K Double Density(1)
+		 88 = /dev/fd?u1120	3.5"  1120K Double Density(1)
+		 28 = /dev/fd?u1440	3.5"  1440K High Density(1)
+		124 = /dev/fd?u1600	3.5"  1600K High Density(1)
+		 44 = /dev/fd?u1680	3.5"  1680K High Density(3)
+		 60 = /dev/fd?u1722	3.5"  1722K High Density
+		 76 = /dev/fd?u1743	3.5"  1743K High Density
+		 96 = /dev/fd?u1760	3.5"  1760K High Density
+		116 = /dev/fd?u1840	3.5"  1840K High Density(3)
+		100 = /dev/fd?u1920	3.5"  1920K High Density(1)
+		 32 = /dev/fd?u2880	3.5"  2880K Extra Density(1)
+		104 = /dev/fd?u3200	3.5"  3200K Extra Density
+		108 = /dev/fd?u3520	3.5"  3520K Extra Density
+		112 = /dev/fd?u3840	3.5"  3840K Extra Density(1)
+
+		 36 = /dev/fd?CompaQ	Compaq 2880K drive; obsolete?
+
+		(1) Autodetectable format
+		(2) Autodetectable format in a Double Density (720K) drive only
+		(3) Autodetectable format in a High Density (1440K) drive only
+
+		NOTE: The letter in the device name (d, q, h or u)
+		signifies the type of drive: 5.25" Double Density (d),
+		5.25" Quad Density (q), 5.25" High Density (h) or 3.5"
+		(any model, u).	 The use of the capital letters D, H
+		and E for the 3.5" models have been deprecated, since
+		the drive type is insignificant for these devices.
+
+   3 char	Pseudo-TTY slaves
+		  0 = /dev/ttyp0	First PTY slave
+		  1 = /dev/ttyp1	Second PTY slave
+		    ...
+		255 = /dev/ttyef	256th PTY slave
+
+		These are the old-style (BSD) PTY devices; Unix98
+		devices are on major 136 and above.
+
+   3 block	First MFM, RLL and IDE hard disk/CD-ROM interface
+		  0 = /dev/hda		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdb		Slave: whole disk (or CD-ROM)
+
+		For partitions, add to the whole disk device number:
+		  0 = /dev/hd?		Whole disk
+		  1 = /dev/hd?1		First partition
+		  2 = /dev/hd?2		Second partition
+		    ...
+		 63 = /dev/hd?63	63rd partition
+
+		For Linux/i386, partitions 1-4 are the primary
+		partitions, and 5 and above are logical partitions.
+		Other versions of Linux use partitioning schemes
+		appropriate to their respective architectures.
+
+   4 char	TTY devices
+		  0 = /dev/tty0		Current virtual console
+
+		  1 = /dev/tty1		First virtual console
+		    ...
+		 63 = /dev/tty63	63rd virtual console
+		 64 = /dev/ttyS0	First UART serial port
+		    ...
+		255 = /dev/ttyS191	192nd UART serial port
+
+		UART serial ports refer to 8250/16450/16550 series devices.
+
+		Older versions of the Linux kernel used this major
+		number for BSD PTY devices.  As of Linux 2.1.115, this
+		is no longer supported.	 Use major numbers 2 and 3.
+
+   4 block	Aliases for dynamically allocated major devices to be used
+		when its not possible to create the real device nodes
+		because the root filesystem is mounted read-only.
+
+		   0 = /dev/root
+
+   5 char	Alternate TTY devices
+		  0 = /dev/tty		Current TTY device
+		  1 = /dev/console	System console
+		  2 = /dev/ptmx		PTY master multiplex
+		  3 = /dev/ttyprintk	User messages via printk TTY device
+		 64 = /dev/cua0		Callout device for ttyS0
+		    ...
+		255 = /dev/cua191	Callout device for ttyS191
+
+		(5,1) is /dev/console starting with Linux 2.1.71.  See
+		the section on terminal devices for more information
+		on /dev/console.
+
+   6 char	Parallel printer devices
+		  0 = /dev/lp0		Parallel printer on parport0
+		  1 = /dev/lp1		Parallel printer on parport1
+		    ...
+
+		Current Linux kernels no longer have a fixed mapping
+		between parallel ports and I/O addresses.  Instead,
+		they are redirected through the parport multiplex layer.
+
+   7 char	Virtual console capture devices
+		  0 = /dev/vcs		Current vc text (glyph) contents
+		  1 = /dev/vcs1		tty1 text (glyph) contents
+		    ...
+		 63 = /dev/vcs63	tty63 text (glyph) contents
+		 64 = /dev/vcsu		Current vc text (unicode) contents
+		65 = /dev/vcsu1		tty1 text (unicode) contents
+		    ...
+		127 = /dev/vcsu63	tty63 text (unicode) contents
+		128 = /dev/vcsa		Current vc text/attribute (glyph) contents
+		129 = /dev/vcsa1	tty1 text/attribute (glyph) contents
+		    ...
+		191 = /dev/vcsa63	tty63 text/attribute (glyph) contents
+
+		NOTE: These devices permit both read and write access.
+
+   7 block	Loopback devices
+		  0 = /dev/loop0	First loop device
+		  1 = /dev/loop1	Second loop device
+		    ...
+
+		The loop devices are used to mount filesystems not
+		associated with block devices.	The binding to the
+		loop devices is handled by mount(8) or losetup(8).
+
+   8 block	SCSI disk devices (0-15)
+		  0 = /dev/sda		First SCSI disk whole disk
+		 16 = /dev/sdb		Second SCSI disk whole disk
+		 32 = /dev/sdc		Third SCSI disk whole disk
+		    ...
+		240 = /dev/sdp		Sixteenth SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+   9 char	SCSI tape devices
+		  0 = /dev/st0		First SCSI tape, mode 0
+		  1 = /dev/st1		Second SCSI tape, mode 0
+		    ...
+		 32 = /dev/st0l		First SCSI tape, mode 1
+		 33 = /dev/st1l		Second SCSI tape, mode 1
+		    ...
+		 64 = /dev/st0m		First SCSI tape, mode 2
+		 65 = /dev/st1m		Second SCSI tape, mode 2
+		    ...
+		 96 = /dev/st0a		First SCSI tape, mode 3
+		 97 = /dev/st1a		Second SCSI tape, mode 3
+		      ...
+		128 = /dev/nst0		First SCSI tape, mode 0, no rewind
+		129 = /dev/nst1		Second SCSI tape, mode 0, no rewind
+		    ...
+		160 = /dev/nst0l	First SCSI tape, mode 1, no rewind
+		161 = /dev/nst1l	Second SCSI tape, mode 1, no rewind
+		    ...
+		192 = /dev/nst0m	First SCSI tape, mode 2, no rewind
+		193 = /dev/nst1m	Second SCSI tape, mode 2, no rewind
+		    ...
+		224 = /dev/nst0a	First SCSI tape, mode 3, no rewind
+		225 = /dev/nst1a	Second SCSI tape, mode 3, no rewind
+		    ...
+
+		"No rewind" refers to the omission of the default
+		automatic rewind on device close.  The MTREW or MTOFFL
+		ioctl()'s can be used to rewind the tape regardless of
+		the device used to access it.
+
+   9 block	Metadisk (RAID) devices
+		  0 = /dev/md0		First metadisk group
+		  1 = /dev/md1		Second metadisk group
+		    ...
+
+		The metadisk driver is used to span a
+		filesystem across multiple physical disks.
+
+  10 char	Non-serial mice, misc features
+		  0 = /dev/logibm	Logitech bus mouse
+		  1 = /dev/psaux	PS/2-style mouse port
+		  2 = /dev/inportbm	Microsoft Inport bus mouse
+		  3 = /dev/atibm	ATI XL bus mouse
+		  4 = /dev/jbm		J-mouse
+		  4 = /dev/amigamouse	Amiga mouse (68k/Amiga)
+		  5 = /dev/atarimouse	Atari mouse
+		  6 = /dev/sunmouse	Sun mouse
+		  7 = /dev/amigamouse1	Second Amiga mouse
+		  8 = /dev/smouse	Simple serial mouse driver
+		  9 = /dev/pc110pad	IBM PC-110 digitizer pad
+		 10 = /dev/adbmouse	Apple Desktop Bus mouse
+		 11 = /dev/vrtpanel	Vr41xx embedded touch panel
+		 13 = /dev/vpcmouse	Connectix Virtual PC Mouse
+		 14 = /dev/touchscreen/ucb1x00  UCB 1x00 touchscreen
+		 15 = /dev/touchscreen/mk712	MK712 touchscreen
+		128 = /dev/beep		Fancy beep device
+		129 =
+		130 = /dev/watchdog	Watchdog timer port
+		131 = /dev/temperature	Machine internal temperature
+		132 = /dev/hwtrap	Hardware fault trap
+		133 = /dev/exttrp	External device trap
+		134 = /dev/apm_bios	Advanced Power Management BIOS
+		135 = /dev/rtc		Real Time Clock
+		137 = /dev/vhci		Bluetooth virtual HCI driver
+		139 = /dev/openprom	SPARC OpenBoot PROM
+		140 = /dev/relay8	Berkshire Products Octal relay card
+		141 = /dev/relay16	Berkshire Products ISO-16 relay card
+		142 =
+		143 = /dev/pciconf	PCI configuration space
+		144 = /dev/nvram	Non-volatile configuration RAM
+		145 = /dev/hfmodem	Soundcard shortwave modem control
+		146 = /dev/graphics	Linux/SGI graphics device
+		147 = /dev/opengl	Linux/SGI OpenGL pipe
+		148 = /dev/gfx		Linux/SGI graphics effects device
+		149 = /dev/input/mouse	Linux/SGI Irix emulation mouse
+		150 = /dev/input/keyboard Linux/SGI Irix emulation keyboard
+		151 = /dev/led		Front panel LEDs
+		152 = /dev/kpoll	Kernel Poll Driver
+		153 = /dev/mergemem	Memory merge device
+		154 = /dev/pmu		Macintosh PowerBook power manager
+		155 = /dev/isictl	MultiTech ISICom serial control
+		156 = /dev/lcd		Front panel LCD display
+		157 = /dev/ac		Applicom Intl Profibus card
+		158 = /dev/nwbutton	Netwinder external button
+		159 = /dev/nwdebug	Netwinder debug interface
+		160 = /dev/nwflash	Netwinder flash memory
+		161 = /dev/userdma	User-space DMA access
+		162 = /dev/smbus	System Management Bus
+		163 = /dev/lik		Logitech Internet Keyboard
+		164 = /dev/ipmo		Intel Intelligent Platform Management
+		165 = /dev/vmmon	VMware virtual machine monitor
+		166 = /dev/i2o/ctl	I2O configuration manager
+		167 = /dev/specialix_sxctl Specialix serial control
+		168 = /dev/tcldrv	Technology Concepts serial control
+		169 = /dev/specialix_rioctl Specialix RIO serial control
+		170 = /dev/thinkpad/thinkpad	IBM Thinkpad devices
+		171 = /dev/srripc	QNX4 API IPC manager
+		172 = /dev/usemaclone	Semaphore clone device
+		173 = /dev/ipmikcs	Intelligent Platform Management
+		174 = /dev/uctrl	SPARCbook 3 microcontroller
+		175 = /dev/agpgart	AGP Graphics Address Remapping Table
+		176 = /dev/gtrsc	Gorgy Timing radio clock
+		177 = /dev/cbm		Serial CBM bus
+		178 = /dev/jsflash	JavaStation OS flash SIMM
+		179 = /dev/xsvc		High-speed shared-mem/semaphore service
+		180 = /dev/vrbuttons	Vr41xx button input device
+		181 = /dev/toshiba	Toshiba laptop SMM support
+		182 = /dev/perfctr	Performance-monitoring counters
+		183 = /dev/hwrng	Generic random number generator
+		184 = /dev/cpu/microcode CPU microcode update interface
+		186 = /dev/atomicps	Atomic shapshot of process state data
+		187 = /dev/irnet	IrNET device
+		188 = /dev/smbusbios	SMBus BIOS
+		189 = /dev/ussp_ctl	User space serial port control
+		190 = /dev/crash	Mission Critical Linux crash dump facility
+		191 = /dev/pcl181	<information missing>
+		192 = /dev/nas_xbus	NAS xbus LCD/buttons access
+		193 = /dev/d7s		SPARC 7-segment display
+		194 = /dev/zkshim	Zero-Knowledge network shim control
+		195 = /dev/elographics/e2201	Elographics touchscreen E271-2201
+		196 = /dev/vfio/vfio	VFIO userspace driver interface
+		197 = /dev/pxa3xx-gcu	PXA3xx graphics controller unit driver
+		198 = /dev/sexec	Signed executable interface
+		199 = /dev/scanners/cuecat :CueCat barcode scanner
+		200 = /dev/net/tun	TAP/TUN network device
+		201 = /dev/button/gulpb	Transmeta GULP-B buttons
+		202 = /dev/emd/ctl	Enhanced Metadisk RAID (EMD) control
+		203 = /dev/cuse		Cuse (character device in user-space)
+		204 = /dev/video/em8300		EM8300 DVD decoder control
+		205 = /dev/video/em8300_mv	EM8300 DVD decoder video
+		206 = /dev/video/em8300_ma	EM8300 DVD decoder audio
+		207 = /dev/video/em8300_sp	EM8300 DVD decoder subpicture
+		208 = /dev/compaq/cpqphpc	Compaq PCI Hot Plug Controller
+		209 = /dev/compaq/cpqrid	Compaq Remote Insight Driver
+		210 = /dev/impi/bt	IMPI coprocessor block transfer
+		211 = /dev/impi/smic	IMPI coprocessor stream interface
+		212 = /dev/watchdogs/0	First watchdog device
+		213 = /dev/watchdogs/1	Second watchdog device
+		214 = /dev/watchdogs/2	Third watchdog device
+		215 = /dev/watchdogs/3	Fourth watchdog device
+		216 = /dev/fujitsu/apanel	Fujitsu/Siemens application panel
+		217 = /dev/ni/natmotn		National Instruments Motion
+		218 = /dev/kchuid	Inter-process chuid control
+		219 = /dev/modems/mwave	MWave modem firmware upload
+		220 = /dev/mptctl	Message passing technology (MPT) control
+		221 = /dev/mvista/hssdsi	Montavista PICMG hot swap system driver
+		222 = /dev/mvista/hasi		Montavista PICMG high availability
+		223 = /dev/input/uinput		User level driver support for input
+		224 = /dev/tpm		TCPA TPM driver
+		225 = /dev/pps		Pulse Per Second driver
+		226 = /dev/systrace	Systrace device
+		227 = /dev/mcelog	X86_64 Machine Check Exception driver
+		228 = /dev/hpet		HPET driver
+		229 = /dev/fuse		Fuse (virtual filesystem in user-space)
+		230 = /dev/midishare	MidiShare driver
+		231 = /dev/snapshot	System memory snapshot device
+		232 = /dev/kvm		Kernel-based virtual machine (hardware virtualization extensions)
+		233 = /dev/kmview	View-OS A process with a view
+		234 = /dev/btrfs-control	Btrfs control device
+		235 = /dev/autofs	Autofs control device
+		236 = /dev/mapper/control	Device-Mapper control device
+		237 = /dev/loop-control Loopback control device
+		238 = /dev/vhost-net	Host kernel accelerator for virtio net
+		239 = /dev/uhid		User-space I/O driver support for HID subsystem
+		240 = /dev/userio	Serio driver testing device
+		241 = /dev/vhost-vsock	Host kernel driver for virtio vsock
+
+		242-254			Reserved for local use
+		255			Reserved for MISC_DYNAMIC_MINOR
+
+  11 char	Raw keyboard device	(Linux/SPARC only)
+		  0 = /dev/kbd		Raw keyboard device
+
+  11 char	Serial Mux device	(Linux/PA-RISC only)
+		  0 = /dev/ttyB0	First mux port
+		  1 = /dev/ttyB1	Second mux port
+		    ...
+
+  11 block	SCSI CD-ROM devices
+		  0 = /dev/scd0		First SCSI CD-ROM
+		  1 = /dev/scd1		Second SCSI CD-ROM
+		    ...
+
+		The prefix /dev/sr (instead of /dev/scd) has been deprecated.
+
+  12 char	QIC-02 tape
+		  2 = /dev/ntpqic11	QIC-11, no rewind-on-close
+		  3 = /dev/tpqic11	QIC-11, rewind-on-close
+		  4 = /dev/ntpqic24	QIC-24, no rewind-on-close
+		  5 = /dev/tpqic24	QIC-24, rewind-on-close
+		  6 = /dev/ntpqic120	QIC-120, no rewind-on-close
+		  7 = /dev/tpqic120	QIC-120, rewind-on-close
+		  8 = /dev/ntpqic150	QIC-150, no rewind-on-close
+		  9 = /dev/tpqic150	QIC-150, rewind-on-close
+
+		The device names specified are proposed -- if there
+		are "standard" names for these devices, please let me know.
+
+  12 block
+
+  13 char	Input core
+		  0 = /dev/input/js0	First joystick
+		  1 = /dev/input/js1	Second joystick
+		    ...
+		 32 = /dev/input/mouse0	First mouse
+		 33 = /dev/input/mouse1	Second mouse
+		    ...
+		 63 = /dev/input/mice	Unified mouse
+		 64 = /dev/input/event0	First event queue
+		 65 = /dev/input/event1	Second event queue
+		    ...
+
+		Each device type has 5 bits (32 minors).
+
+  13 block	Previously used for the XT disk (/dev/xdN)
+		Deleted in kernel v3.9.
+
+  14 char	Open Sound System (OSS)
+		  0 = /dev/mixer	Mixer control
+		  1 = /dev/sequencer	Audio sequencer
+		  2 = /dev/midi00	First MIDI port
+		  3 = /dev/dsp		Digital audio
+		  4 = /dev/audio	Sun-compatible digital audio
+		  6 =
+		  7 = /dev/audioctl	SPARC audio control device
+		  8 = /dev/sequencer2	Sequencer -- alternate device
+		 16 = /dev/mixer1	Second soundcard mixer control
+		 17 = /dev/patmgr0	Sequencer patch manager
+		 18 = /dev/midi01	Second MIDI port
+		 19 = /dev/dsp1		Second soundcard digital audio
+		 20 = /dev/audio1	Second soundcard Sun digital audio
+		 33 = /dev/patmgr1	Sequencer patch manager
+		 34 = /dev/midi02	Third MIDI port
+		 50 = /dev/midi03	Fourth MIDI port
+
+  14 block
+
+  15 char	Joystick
+		  0 = /dev/js0		First analog joystick
+		  1 = /dev/js1		Second analog joystick
+		    ...
+		128 = /dev/djs0		First digital joystick
+		129 = /dev/djs1		Second digital joystick
+		    ...
+  15 block	Sony CDU-31A/CDU-33A CD-ROM
+		  0 = /dev/sonycd	Sony CDU-31a CD-ROM
+
+  16 char	Non-SCSI scanners
+		  0 = /dev/gs4500	Genius 4500 handheld scanner
+
+  16 block	GoldStar CD-ROM
+		  0 = /dev/gscd		GoldStar CD-ROM
+
+  17 char	OBSOLETE (was Chase serial card)
+		  0 = /dev/ttyH0	First Chase port
+		  1 = /dev/ttyH1	Second Chase port
+		    ...
+  17 block	Optics Storage CD-ROM
+		  0 = /dev/optcd	Optics Storage CD-ROM
+
+  18 char	OBSOLETE (was Chase serial card - alternate devices)
+		  0 = /dev/cuh0		Callout device for ttyH0
+		  1 = /dev/cuh1		Callout device for ttyH1
+		    ...
+  18 block	Sanyo CD-ROM
+		  0 = /dev/sjcd		Sanyo CD-ROM
+
+  19 char	Cyclades serial card
+		  0 = /dev/ttyC0	First Cyclades port
+		    ...
+		 31 = /dev/ttyC31	32nd Cyclades port
+
+  19 block	"Double" compressed disk
+		  0 = /dev/double0	First compressed disk
+		    ...
+		  7 = /dev/double7	Eighth compressed disk
+		128 = /dev/cdouble0	Mirror of first compressed disk
+		    ...
+		135 = /dev/cdouble7	Mirror of eighth compressed disk
+
+		See the Double documentation for the meaning of the
+		mirror devices.
+
+  20 char	Cyclades serial card - alternate devices
+		  0 = /dev/cub0		Callout device for ttyC0
+		    ...
+		 31 = /dev/cub31	Callout device for ttyC31
+
+  20 block	Hitachi CD-ROM (under development)
+		  0 = /dev/hitcd	Hitachi CD-ROM
+
+  21 char	Generic SCSI access
+		  0 = /dev/sg0		First generic SCSI device
+		  1 = /dev/sg1		Second generic SCSI device
+		    ...
+
+		Most distributions name these /dev/sga, /dev/sgb...;
+		this sets an unnecessary limit of 26 SCSI devices in
+		the system and is counter to standard Linux
+		device-naming practice.
+
+  21 block	Acorn MFM hard drive interface
+		  0 = /dev/mfma		First MFM drive whole disk
+		 64 = /dev/mfmb		Second MFM drive whole disk
+
+		This device is used on the ARM-based Acorn RiscPC.
+		Partitions are handled the same way as for IDE disks
+		(see major number 3).
+
+  22 char	Digiboard serial card
+		  0 = /dev/ttyD0	First Digiboard port
+		  1 = /dev/ttyD1	Second Digiboard port
+		    ...
+  22 block	Second IDE hard disk/CD-ROM interface
+		  0 = /dev/hdc		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdd		Slave: whole disk (or CD-ROM)
+
+		Partitions are handled the same way as for the first
+		interface (see major number 3).
+
+  23 char	Digiboard serial card - alternate devices
+		  0 = /dev/cud0		Callout device for ttyD0
+		  1 = /dev/cud1		Callout device for ttyD1
+		      ...
+  23 block	Mitsumi proprietary CD-ROM
+		  0 = /dev/mcd		Mitsumi CD-ROM
+
+  24 char	Stallion serial card
+		  0 = /dev/ttyE0	Stallion port 0 card 0
+		  1 = /dev/ttyE1	Stallion port 1 card 0
+		    ...
+		 64 = /dev/ttyE64	Stallion port 0 card 1
+		 65 = /dev/ttyE65	Stallion port 1 card 1
+		      ...
+		128 = /dev/ttyE128	Stallion port 0 card 2
+		129 = /dev/ttyE129	Stallion port 1 card 2
+		    ...
+		192 = /dev/ttyE192	Stallion port 0 card 3
+		193 = /dev/ttyE193	Stallion port 1 card 3
+		    ...
+  24 block	Sony CDU-535 CD-ROM
+		  0 = /dev/cdu535	Sony CDU-535 CD-ROM
+
+  25 char	Stallion serial card - alternate devices
+		  0 = /dev/cue0		Callout device for ttyE0
+		  1 = /dev/cue1		Callout device for ttyE1
+		    ...
+		 64 = /dev/cue64	Callout device for ttyE64
+		 65 = /dev/cue65	Callout device for ttyE65
+		    ...
+		128 = /dev/cue128	Callout device for ttyE128
+		129 = /dev/cue129	Callout device for ttyE129
+		    ...
+		192 = /dev/cue192	Callout device for ttyE192
+		193 = /dev/cue193	Callout device for ttyE193
+		      ...
+  25 block	First Matsushita (Panasonic/SoundBlaster) CD-ROM
+		  0 = /dev/sbpcd0	Panasonic CD-ROM controller 0 unit 0
+		  1 = /dev/sbpcd1	Panasonic CD-ROM controller 0 unit 1
+		  2 = /dev/sbpcd2	Panasonic CD-ROM controller 0 unit 2
+		  3 = /dev/sbpcd3	Panasonic CD-ROM controller 0 unit 3
+
+  26 char
+
+  26 block	Second Matsushita (Panasonic/SoundBlaster) CD-ROM
+		  0 = /dev/sbpcd4	Panasonic CD-ROM controller 1 unit 0
+		  1 = /dev/sbpcd5	Panasonic CD-ROM controller 1 unit 1
+		  2 = /dev/sbpcd6	Panasonic CD-ROM controller 1 unit 2
+		  3 = /dev/sbpcd7	Panasonic CD-ROM controller 1 unit 3
+
+  27 char	QIC-117 tape
+		  0 = /dev/qft0		Unit 0, rewind-on-close
+		  1 = /dev/qft1		Unit 1, rewind-on-close
+		  2 = /dev/qft2		Unit 2, rewind-on-close
+		  3 = /dev/qft3		Unit 3, rewind-on-close
+		  4 = /dev/nqft0	Unit 0, no rewind-on-close
+		  5 = /dev/nqft1	Unit 1, no rewind-on-close
+		  6 = /dev/nqft2	Unit 2, no rewind-on-close
+		  7 = /dev/nqft3	Unit 3, no rewind-on-close
+		 16 = /dev/zqft0	Unit 0, rewind-on-close, compression
+		 17 = /dev/zqft1	Unit 1, rewind-on-close, compression
+		 18 = /dev/zqft2	Unit 2, rewind-on-close, compression
+		 19 = /dev/zqft3	Unit 3, rewind-on-close, compression
+		 20 = /dev/nzqft0	Unit 0, no rewind-on-close, compression
+		 21 = /dev/nzqft1	Unit 1, no rewind-on-close, compression
+		 22 = /dev/nzqft2	Unit 2, no rewind-on-close, compression
+		 23 = /dev/nzqft3	Unit 3, no rewind-on-close, compression
+		 32 = /dev/rawqft0	Unit 0, rewind-on-close, no file marks
+		 33 = /dev/rawqft1	Unit 1, rewind-on-close, no file marks
+		 34 = /dev/rawqft2	Unit 2, rewind-on-close, no file marks
+		 35 = /dev/rawqft3	Unit 3, rewind-on-close, no file marks
+		 36 = /dev/nrawqft0	Unit 0, no rewind-on-close, no file marks
+		 37 = /dev/nrawqft1	Unit 1, no rewind-on-close, no file marks
+		 38 = /dev/nrawqft2	Unit 2, no rewind-on-close, no file marks
+		 39 = /dev/nrawqft3	Unit 3, no rewind-on-close, no file marks
+
+  27 block	Third Matsushita (Panasonic/SoundBlaster) CD-ROM
+		  0 = /dev/sbpcd8	Panasonic CD-ROM controller 2 unit 0
+		  1 = /dev/sbpcd9	Panasonic CD-ROM controller 2 unit 1
+		  2 = /dev/sbpcd10	Panasonic CD-ROM controller 2 unit 2
+		  3 = /dev/sbpcd11	Panasonic CD-ROM controller 2 unit 3
+
+  28 char	Stallion serial card - card programming
+		  0 = /dev/staliomem0	First Stallion card I/O memory
+		  1 = /dev/staliomem1	Second Stallion card I/O memory
+		  2 = /dev/staliomem2	Third Stallion card I/O memory
+		  3 = /dev/staliomem3	Fourth Stallion card I/O memory
+
+  28 char	Atari SLM ACSI laser printer (68k/Atari)
+		  0 = /dev/slm0		First SLM laser printer
+		  1 = /dev/slm1		Second SLM laser printer
+		    ...
+  28 block	Fourth Matsushita (Panasonic/SoundBlaster) CD-ROM
+		  0 = /dev/sbpcd12	Panasonic CD-ROM controller 3 unit 0
+		  1 = /dev/sbpcd13	Panasonic CD-ROM controller 3 unit 1
+		  2 = /dev/sbpcd14	Panasonic CD-ROM controller 3 unit 2
+		  3 = /dev/sbpcd15	Panasonic CD-ROM controller 3 unit 3
+
+  28 block	ACSI disk (68k/Atari)
+		  0 = /dev/ada		First ACSI disk whole disk
+		 16 = /dev/adb		Second ACSI disk whole disk
+		 32 = /dev/adc		Third ACSI disk whole disk
+		    ...
+		240 = /dev/adp		16th ACSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15, like SCSI.
+
+  29 char	Universal frame buffer
+		  0 = /dev/fb0		First frame buffer
+		  1 = /dev/fb1		Second frame buffer
+		    ...
+		 31 = /dev/fb31		32nd frame buffer
+
+  29 block	Aztech/Orchid/Okano/Wearnes CD-ROM
+		  0 = /dev/aztcd	Aztech CD-ROM
+
+  30 char	iBCS-2 compatibility devices
+		  0 = /dev/socksys	Socket access
+		  1 = /dev/spx		SVR3 local X interface
+		 32 = /dev/inet/ip	Network access
+		 33 = /dev/inet/icmp
+		 34 = /dev/inet/ggp
+		 35 = /dev/inet/ipip
+		 36 = /dev/inet/tcp
+		 37 = /dev/inet/egp
+		 38 = /dev/inet/pup
+		 39 = /dev/inet/udp
+		 40 = /dev/inet/idp
+		 41 = /dev/inet/rawip
+
+		Additionally, iBCS-2 requires the following links:
+
+		/dev/ip -> /dev/inet/ip
+		/dev/icmp -> /dev/inet/icmp
+		/dev/ggp -> /dev/inet/ggp
+		/dev/ipip -> /dev/inet/ipip
+		/dev/tcp -> /dev/inet/tcp
+		/dev/egp -> /dev/inet/egp
+		/dev/pup -> /dev/inet/pup
+		/dev/udp -> /dev/inet/udp
+		/dev/idp -> /dev/inet/idp
+		/dev/rawip -> /dev/inet/rawip
+		/dev/inet/arp -> /dev/inet/udp
+		/dev/inet/rip -> /dev/inet/udp
+		/dev/nfsd -> /dev/socksys
+		/dev/X0R -> /dev/null (? apparently not required ?)
+
+  30 block	Philips LMS CM-205 CD-ROM
+		  0 = /dev/cm205cd	Philips LMS CM-205 CD-ROM
+
+		/dev/lmscd is an older name for this device.  This
+		driver does not work with the CM-205MS CD-ROM.
+
+  31 char	MPU-401 MIDI
+		  0 = /dev/mpu401data	MPU-401 data port
+		  1 = /dev/mpu401stat	MPU-401 status port
+
+  31 block	ROM/flash memory card
+		  0 = /dev/rom0		First ROM card (rw)
+		      ...
+		  7 = /dev/rom7		Eighth ROM card (rw)
+		  8 = /dev/rrom0	First ROM card (ro)
+		    ...
+		 15 = /dev/rrom7	Eighth ROM card (ro)
+		 16 = /dev/flash0	First flash memory card (rw)
+		    ...
+		 23 = /dev/flash7	Eighth flash memory card (rw)
+		 24 = /dev/rflash0	First flash memory card (ro)
+		    ...
+		 31 = /dev/rflash7	Eighth flash memory card (ro)
+
+		The read-write (rw) devices support back-caching
+		written data in RAM, as well as writing to flash RAM
+		devices.  The read-only devices (ro) support reading
+		only.
+
+  32 char	Specialix serial card
+		  0 = /dev/ttyX0	First Specialix port
+		  1 = /dev/ttyX1	Second Specialix port
+		    ...
+  32 block	Philips LMS CM-206 CD-ROM
+		  0 = /dev/cm206cd	Philips LMS CM-206 CD-ROM
+
+  33 char	Specialix serial card - alternate devices
+		  0 = /dev/cux0		Callout device for ttyX0
+		  1 = /dev/cux1		Callout device for ttyX1
+		    ...
+  33 block	Third IDE hard disk/CD-ROM interface
+		  0 = /dev/hde		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdf		Slave: whole disk (or CD-ROM)
+
+		Partitions are handled the same way as for the first
+		interface (see major number 3).
+
+  34 char	Z8530 HDLC driver
+		  0 = /dev/scc0		First Z8530, first port
+		  1 = /dev/scc1		First Z8530, second port
+		  2 = /dev/scc2		Second Z8530, first port
+		  3 = /dev/scc3		Second Z8530, second port
+		    ...
+
+		In a previous version these devices were named
+		/dev/sc1 for /dev/scc0, /dev/sc2 for /dev/scc1, and so
+		on.
+
+  34 block	Fourth IDE hard disk/CD-ROM interface
+		  0 = /dev/hdg		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdh		Slave: whole disk (or CD-ROM)
+
+		Partitions are handled the same way as for the first
+		interface (see major number 3).
+
+  35 char	tclmidi MIDI driver
+		  0 = /dev/midi0	First MIDI port, kernel timed
+		  1 = /dev/midi1	Second MIDI port, kernel timed
+		  2 = /dev/midi2	Third MIDI port, kernel timed
+		  3 = /dev/midi3	Fourth MIDI port, kernel timed
+		 64 = /dev/rmidi0	First MIDI port, untimed
+		 65 = /dev/rmidi1	Second MIDI port, untimed
+		 66 = /dev/rmidi2	Third MIDI port, untimed
+		 67 = /dev/rmidi3	Fourth MIDI port, untimed
+		128 = /dev/smpte0	First MIDI port, SMPTE timed
+		129 = /dev/smpte1	Second MIDI port, SMPTE timed
+		130 = /dev/smpte2	Third MIDI port, SMPTE timed
+		131 = /dev/smpte3	Fourth MIDI port, SMPTE timed
+
+  35 block	Slow memory ramdisk
+		  0 = /dev/slram	Slow memory ramdisk
+
+  36 char	Netlink support
+		  0 = /dev/route	Routing, device updates, kernel to user
+		  1 = /dev/skip		enSKIP security cache control
+		  3 = /dev/fwmonitor	Firewall packet copies
+		 16 = /dev/tap0		First Ethertap device
+		    ...
+		 31 = /dev/tap15	16th Ethertap device
+
+  36 block	OBSOLETE (was MCA ESDI hard disk)
+
+  37 char	IDE tape
+		  0 = /dev/ht0		First IDE tape
+		  1 = /dev/ht1		Second IDE tape
+		    ...
+		128 = /dev/nht0		First IDE tape, no rewind-on-close
+		129 = /dev/nht1		Second IDE tape, no rewind-on-close
+		    ...
+
+		Currently, only one IDE tape drive is supported.
+
+  37 block	Zorro II ramdisk
+		  0 = /dev/z2ram	Zorro II ramdisk
+
+  38 char	Myricom PCI Myrinet board
+		  0 = /dev/mlanai0	First Myrinet board
+		  1 = /dev/mlanai1	Second Myrinet board
+		    ...
+
+		This device is used for status query, board control
+		and "user level packet I/O."  This board is also
+		accessible as a standard networking "eth" device.
+
+  38 block	OBSOLETE (was Linux/AP+)
+
+  39 char	ML-16P experimental I/O board
+		  0 = /dev/ml16pa-a0	First card, first analog channel
+		  1 = /dev/ml16pa-a1	First card, second analog channel
+		    ...
+		 15 = /dev/ml16pa-a15	First card, 16th analog channel
+		 16 = /dev/ml16pa-d	First card, digital lines
+		 17 = /dev/ml16pa-c0	First card, first counter/timer
+		 18 = /dev/ml16pa-c1	First card, second counter/timer
+		 19 = /dev/ml16pa-c2	First card, third counter/timer
+		 32 = /dev/ml16pb-a0	Second card, first analog channel
+		 33 = /dev/ml16pb-a1	Second card, second analog channel
+		    ...
+		 47 = /dev/ml16pb-a15	Second card, 16th analog channel
+		 48 = /dev/ml16pb-d	Second card, digital lines
+		 49 = /dev/ml16pb-c0	Second card, first counter/timer
+		 50 = /dev/ml16pb-c1	Second card, second counter/timer
+		 51 = /dev/ml16pb-c2	Second card, third counter/timer
+		      ...
+  39 block
+
+  40 char
+
+  40 block
+
+  41 char	Yet Another Micro Monitor
+		  0 = /dev/yamm		Yet Another Micro Monitor
+
+  41 block
+
+  42 char	Demo/sample use
+
+  42 block	Demo/sample use
+
+		This number is intended for use in sample code, as
+		well as a general "example" device number.  It
+		should never be used for a device driver that is being
+		distributed; either obtain an official number or use
+		the local/experimental range.  The sudden addition or
+		removal of a driver with this number should not cause
+		ill effects to the system (bugs excepted.)
+
+		IN PARTICULAR, ANY DISTRIBUTION WHICH CONTAINS A
+		DEVICE DRIVER USING MAJOR NUMBER 42 IS NONCOMPLIANT.
+
+  43 char	isdn4linux virtual modem
+		  0 = /dev/ttyI0	First virtual modem
+		    ...
+		 63 = /dev/ttyI63	64th virtual modem
+
+  43 block	Network block devices
+		  0 = /dev/nb0		First network block device
+		  1 = /dev/nb1		Second network block device
+		    ...
+
+		Network Block Device is somehow similar to loopback
+		devices: If you read from it, it sends packet across
+		network asking server for data. If you write to it, it
+		sends packet telling server to write. It could be used
+		to mounting filesystems over the net, swapping over
+		the net, implementing block device in userland etc.
+
+  44 char	isdn4linux virtual modem - alternate devices
+		  0 = /dev/cui0		Callout device for ttyI0
+		    ...
+		 63 = /dev/cui63	Callout device for ttyI63
+
+  44 block	Flash Translation Layer (FTL) filesystems
+		  0 = /dev/ftla		FTL on first Memory Technology Device
+		 16 = /dev/ftlb		FTL on second Memory Technology Device
+		 32 = /dev/ftlc		FTL on third Memory Technology Device
+		    ...
+		240 = /dev/ftlp		FTL on 16th Memory Technology Device
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the partition
+		limit is 15 rather than 63 per disk (same as SCSI.)
+
+  45 char	isdn4linux ISDN BRI driver
+		  0 = /dev/isdn0	First virtual B channel raw data
+		    ...
+		 63 = /dev/isdn63	64th virtual B channel raw data
+		 64 = /dev/isdnctrl0	First channel control/debug
+		    ...
+		127 = /dev/isdnctrl63	64th channel control/debug
+
+		128 = /dev/ippp0	First SyncPPP device
+		    ...
+		191 = /dev/ippp63	64th SyncPPP device
+
+		255 = /dev/isdninfo	ISDN monitor interface
+
+  45 block	Parallel port IDE disk devices
+		  0 = /dev/pda		First parallel port IDE disk
+		 16 = /dev/pdb		Second parallel port IDE disk
+		 32 = /dev/pdc		Third parallel port IDE disk
+		 48 = /dev/pdd		Fourth parallel port IDE disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the partition
+		limit is 15 rather than 63 per disk.
+
+  46 char	Comtrol Rocketport serial card
+		  0 = /dev/ttyR0	First Rocketport port
+		  1 = /dev/ttyR1	Second Rocketport port
+		    ...
+  46 block	Parallel port ATAPI CD-ROM devices
+		  0 = /dev/pcd0		First parallel port ATAPI CD-ROM
+		  1 = /dev/pcd1		Second parallel port ATAPI CD-ROM
+		  2 = /dev/pcd2		Third parallel port ATAPI CD-ROM
+		  3 = /dev/pcd3		Fourth parallel port ATAPI CD-ROM
+
+  47 char	Comtrol Rocketport serial card - alternate devices
+		  0 = /dev/cur0		Callout device for ttyR0
+		  1 = /dev/cur1		Callout device for ttyR1
+		    ...
+  47 block	Parallel port ATAPI disk devices
+		  0 = /dev/pf0		First parallel port ATAPI disk
+		  1 = /dev/pf1		Second parallel port ATAPI disk
+		  2 = /dev/pf2		Third parallel port ATAPI disk
+		  3 = /dev/pf3		Fourth parallel port ATAPI disk
+
+		This driver is intended for floppy disks and similar
+		devices and hence does not support partitioning.
+
+  48 char	SDL RISCom serial card
+		  0 = /dev/ttyL0	First RISCom port
+		  1 = /dev/ttyL1	Second RISCom port
+		    ...
+  48 block	Mylex DAC960 PCI RAID controller; first controller
+		  0 = /dev/rd/c0d0	First disk, whole disk
+		  8 = /dev/rd/c0d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c0d31	32nd disk, whole disk
+
+		For partitions add:
+		  0 = /dev/rd/c?d?	Whole disk
+		  1 = /dev/rd/c?d?p1	First partition
+		    ...
+		  7 = /dev/rd/c?d?p7	Seventh partition
+
+  49 char	SDL RISCom serial card - alternate devices
+		  0 = /dev/cul0		Callout device for ttyL0
+		  1 = /dev/cul1		Callout device for ttyL1
+		    ...
+  49 block	Mylex DAC960 PCI RAID controller; second controller
+		  0 = /dev/rd/c1d0	First disk, whole disk
+		  8 = /dev/rd/c1d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c1d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+  50 char	Reserved for GLINT
+
+  50 block	Mylex DAC960 PCI RAID controller; third controller
+		  0 = /dev/rd/c2d0	First disk, whole disk
+		  8 = /dev/rd/c2d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c2d31	32nd disk, whole disk
+
+  51 char	Baycom radio modem OR Radio Tech BIM-XXX-RS232 radio modem
+		  0 = /dev/bc0		First Baycom radio modem
+		  1 = /dev/bc1		Second Baycom radio modem
+		    ...
+  51 block	Mylex DAC960 PCI RAID controller; fourth controller
+		  0 = /dev/rd/c3d0	First disk, whole disk
+		  8 = /dev/rd/c3d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c3d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+  52 char	Spellcaster DataComm/BRI ISDN card
+		  0 = /dev/dcbri0	First DataComm card
+		  1 = /dev/dcbri1	Second DataComm card
+		  2 = /dev/dcbri2	Third DataComm card
+		  3 = /dev/dcbri3	Fourth DataComm card
+
+  52 block	Mylex DAC960 PCI RAID controller; fifth controller
+		  0 = /dev/rd/c4d0	First disk, whole disk
+		  8 = /dev/rd/c4d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c4d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+  53 char	BDM interface for remote debugging MC683xx microcontrollers
+		  0 = /dev/pd_bdm0	PD BDM interface on lp0
+		  1 = /dev/pd_bdm1	PD BDM interface on lp1
+		  2 = /dev/pd_bdm2	PD BDM interface on lp2
+		  4 = /dev/icd_bdm0	ICD BDM interface on lp0
+		  5 = /dev/icd_bdm1	ICD BDM interface on lp1
+		  6 = /dev/icd_bdm2	ICD BDM interface on lp2
+
+		This device is used for the interfacing to the MC683xx
+		microcontrollers via Background Debug Mode by use of a
+		Parallel Port interface. PD is the Motorola Public
+		Domain Interface and ICD is the commercial interface
+		by P&E.
+
+  53 block	Mylex DAC960 PCI RAID controller; sixth controller
+		  0 = /dev/rd/c5d0	First disk, whole disk
+		  8 = /dev/rd/c5d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c5d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+  54 char	Electrocardiognosis Holter serial card
+		  0 = /dev/holter0	First Holter port
+		  1 = /dev/holter1	Second Holter port
+		  2 = /dev/holter2	Third Holter port
+
+		A custom serial card used by Electrocardiognosis SRL
+		<mseritan@ottonel.pub.ro> to transfer data from Holter
+		24-hour heart monitoring equipment.
+
+  54 block	Mylex DAC960 PCI RAID controller; seventh controller
+		  0 = /dev/rd/c6d0	First disk, whole disk
+		  8 = /dev/rd/c6d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c6d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+  55 char	DSP56001 digital signal processor
+		  0 = /dev/dsp56k	First DSP56001
+
+  55 block	Mylex DAC960 PCI RAID controller; eighth controller
+		  0 = /dev/rd/c7d0	First disk, whole disk
+		  8 = /dev/rd/c7d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c7d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+  56 char	Apple Desktop Bus
+		  0 = /dev/adb		ADB bus control
+
+		Additional devices will be added to this number, all
+		starting with /dev/adb.
+
+  56 block	Fifth IDE hard disk/CD-ROM interface
+		  0 = /dev/hdi		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdj		Slave: whole disk (or CD-ROM)
+
+		Partitions are handled the same way as for the first
+		interface (see major number 3).
+
+  57 char	Hayes ESP serial card
+		  0 = /dev/ttyP0	First ESP port
+		  1 = /dev/ttyP1	Second ESP port
+		    ...
+
+  57 block	Sixth IDE hard disk/CD-ROM interface
+		  0 = /dev/hdk		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdl		Slave: whole disk (or CD-ROM)
+
+		Partitions are handled the same way as for the first
+		interface (see major number 3).
+
+  58 char	Hayes ESP serial card - alternate devices
+		  0 = /dev/cup0		Callout device for ttyP0
+		  1 = /dev/cup1		Callout device for ttyP1
+		    ...
+
+  58 block	Reserved for logical volume manager
+
+  59 char	sf firewall package
+		  0 = /dev/firewall	Communication with sf kernel module
+
+  59 block	Generic PDA filesystem device
+		  0 = /dev/pda0		First PDA device
+		  1 = /dev/pda1		Second PDA device
+		    ...
+
+		The pda devices are used to mount filesystems on
+		remote pda's (basically slow handheld machines with
+		proprietary OS's and limited memory and storage
+		running small fs translation drivers) through serial /
+		IRDA / parallel links.
+
+		NAMING CONFLICT -- PROPOSED REVISED NAME /dev/rpda0 etc
+
+  60-63 char	LOCAL/EXPERIMENTAL USE
+
+  60-63 block	LOCAL/EXPERIMENTAL USE
+		Allocated for local/experimental use.  For devices not
+		assigned official numbers, these ranges should be
+		used in order to avoid conflicting with future assignments.
+
+  64 char	ENskip kernel encryption package
+		  0 = /dev/enskip	Communication with ENskip kernel module
+
+  64 block	Scramdisk/DriveCrypt encrypted devices
+		  0 = /dev/scramdisk/master    Master node for ioctls
+		  1 = /dev/scramdisk/1         First encrypted device
+		  2 = /dev/scramdisk/2         Second encrypted device
+		  ...
+		255 = /dev/scramdisk/255       255th encrypted device
+
+		The filename of the encrypted container and the passwords
+		are sent via ioctls (using the sdmount tool) to the master
+		node which then activates them via one of the
+		/dev/scramdisk/x nodes for loop mounting (all handled
+		through the sdmount tool).
+
+		Requested by: andy@scramdisklinux.org
+
+  65 char	Sundance "plink" Transputer boards (obsolete, unused)
+		  0 = /dev/plink0	First plink device
+		  1 = /dev/plink1	Second plink device
+		  2 = /dev/plink2	Third plink device
+		  3 = /dev/plink3	Fourth plink device
+		 64 = /dev/rplink0	First plink device, raw
+		 65 = /dev/rplink1	Second plink device, raw
+		 66 = /dev/rplink2	Third plink device, raw
+		 67 = /dev/rplink3	Fourth plink device, raw
+		128 = /dev/plink0d	First plink device, debug
+		129 = /dev/plink1d	Second plink device, debug
+		130 = /dev/plink2d	Third plink device, debug
+		131 = /dev/plink3d	Fourth plink device, debug
+		192 = /dev/rplink0d	First plink device, raw, debug
+		193 = /dev/rplink1d	Second plink device, raw, debug
+		194 = /dev/rplink2d	Third plink device, raw, debug
+		195 = /dev/rplink3d	Fourth plink device, raw, debug
+
+		This is a commercial driver; contact James Howes
+		<jth@prosig.demon.co.uk> for information.
+
+  65 block	SCSI disk devices (16-31)
+		  0 = /dev/sdq		17th SCSI disk whole disk
+		 16 = /dev/sdr		18th SCSI disk whole disk
+		 32 = /dev/sds		19th SCSI disk whole disk
+		    ...
+		240 = /dev/sdaf		32nd SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  66 char	YARC PowerPC PCI coprocessor card
+		  0 = /dev/yppcpci0	First YARC card
+		  1 = /dev/yppcpci1	Second YARC card
+		    ...
+
+  66 block	SCSI disk devices (32-47)
+		  0 = /dev/sdag		33th SCSI disk whole disk
+		 16 = /dev/sdah		34th SCSI disk whole disk
+		 32 = /dev/sdai		35th SCSI disk whole disk
+		    ...
+		240 = /dev/sdav		48nd SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  67 char	Coda network file system
+		  0 = /dev/cfs0		Coda cache manager
+
+		See http://www.coda.cs.cmu.edu for information about Coda.
+
+  67 block	SCSI disk devices (48-63)
+		  0 = /dev/sdaw		49th SCSI disk whole disk
+		 16 = /dev/sdax		50th SCSI disk whole disk
+		 32 = /dev/sday		51st SCSI disk whole disk
+		    ...
+		240 = /dev/sdbl		64th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  68 char	CAPI 2.0 interface
+		  0 = /dev/capi20	Control device
+		  1 = /dev/capi20.00	First CAPI 2.0 application
+		  2 = /dev/capi20.01	Second CAPI 2.0 application
+		    ...
+		 20 = /dev/capi20.19	19th CAPI 2.0 application
+
+		ISDN CAPI 2.0 driver for use with CAPI 2.0
+		applications; currently supports the AVM B1 card.
+
+  68 block	SCSI disk devices (64-79)
+		  0 = /dev/sdbm		65th SCSI disk whole disk
+		 16 = /dev/sdbn		66th SCSI disk whole disk
+		 32 = /dev/sdbo		67th SCSI disk whole disk
+		    ...
+		240 = /dev/sdcb		80th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  69 char	MA16 numeric accelerator card
+		  0 = /dev/ma16		Board memory access
+
+  69 block	SCSI disk devices (80-95)
+		  0 = /dev/sdcc		81st SCSI disk whole disk
+		 16 = /dev/sdcd		82nd SCSI disk whole disk
+		 32 = /dev/sdce		83th SCSI disk whole disk
+		    ...
+		240 = /dev/sdcr		96th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  70 char	SpellCaster Protocol Services Interface
+		  0 = /dev/apscfg	Configuration interface
+		  1 = /dev/apsauth	Authentication interface
+		  2 = /dev/apslog	Logging interface
+		  3 = /dev/apsdbg	Debugging interface
+		 64 = /dev/apsisdn	ISDN command interface
+		 65 = /dev/apsasync	Async command interface
+		128 = /dev/apsmon	Monitor interface
+
+  70 block	SCSI disk devices (96-111)
+		  0 = /dev/sdcs		97th SCSI disk whole disk
+		 16 = /dev/sdct		98th SCSI disk whole disk
+		 32 = /dev/sdcu		99th SCSI disk whole disk
+		    ...
+		240 = /dev/sddh		112nd SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  71 char	Computone IntelliPort II serial card
+		  0 = /dev/ttyF0	IntelliPort II board 0, port 0
+		  1 = /dev/ttyF1	IntelliPort II board 0, port 1
+		    ...
+		 63 = /dev/ttyF63	IntelliPort II board 0, port 63
+		 64 = /dev/ttyF64	IntelliPort II board 1, port 0
+		 65 = /dev/ttyF65	IntelliPort II board 1, port 1
+		    ...
+		127 = /dev/ttyF127	IntelliPort II board 1, port 63
+		128 = /dev/ttyF128	IntelliPort II board 2, port 0
+		129 = /dev/ttyF129	IntelliPort II board 2, port 1
+		    ...
+		191 = /dev/ttyF191	IntelliPort II board 2, port 63
+		192 = /dev/ttyF192	IntelliPort II board 3, port 0
+		193 = /dev/ttyF193	IntelliPort II board 3, port 1
+		    ...
+		255 = /dev/ttyF255	IntelliPort II board 3, port 63
+
+  71 block	SCSI disk devices (112-127)
+		  0 = /dev/sddi		113th SCSI disk whole disk
+		 16 = /dev/sddj		114th SCSI disk whole disk
+		 32 = /dev/sddk		115th SCSI disk whole disk
+		    ...
+		240 = /dev/sddx		128th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  72 char	Computone IntelliPort II serial card - alternate devices
+		  0 = /dev/cuf0		Callout device for ttyF0
+		  1 = /dev/cuf1		Callout device for ttyF1
+		    ...
+		 63 = /dev/cuf63	Callout device for ttyF63
+		 64 = /dev/cuf64	Callout device for ttyF64
+		 65 = /dev/cuf65	Callout device for ttyF65
+		    ...
+		127 = /dev/cuf127	Callout device for ttyF127
+		128 = /dev/cuf128	Callout device for ttyF128
+		129 = /dev/cuf129	Callout device for ttyF129
+		    ...
+		191 = /dev/cuf191	Callout device for ttyF191
+		192 = /dev/cuf192	Callout device for ttyF192
+		193 = /dev/cuf193	Callout device for ttyF193
+		    ...
+		255 = /dev/cuf255	Callout device for ttyF255
+
+  72 block	Compaq Intelligent Drive Array, first controller
+		  0 = /dev/ida/c0d0	First logical drive whole disk
+		 16 = /dev/ida/c0d1	Second logical drive whole disk
+		    ...
+		240 = /dev/ida/c0d15	16th logical drive whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+  73 char	Computone IntelliPort II serial card - control devices
+		  0 = /dev/ip2ipl0	Loadware device for board 0
+		  1 = /dev/ip2stat0	Status device for board 0
+		  4 = /dev/ip2ipl1	Loadware device for board 1
+		  5 = /dev/ip2stat1	Status device for board 1
+		  8 = /dev/ip2ipl2	Loadware device for board 2
+		  9 = /dev/ip2stat2	Status device for board 2
+		 12 = /dev/ip2ipl3	Loadware device for board 3
+		 13 = /dev/ip2stat3	Status device for board 3
+
+  73 block	Compaq Intelligent Drive Array, second controller
+		  0 = /dev/ida/c1d0	First logical drive whole disk
+		 16 = /dev/ida/c1d1	Second logical drive whole disk
+		    ...
+		240 = /dev/ida/c1d15	16th logical drive whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+  74 char	SCI bridge
+		  0 = /dev/SCI/0	SCI device 0
+		  1 = /dev/SCI/1	SCI device 1
+		    ...
+
+		Currently for Dolphin Interconnect Solutions' PCI-SCI
+		bridge.
+
+  74 block	Compaq Intelligent Drive Array, third controller
+		  0 = /dev/ida/c2d0	First logical drive whole disk
+		 16 = /dev/ida/c2d1	Second logical drive whole disk
+		    ...
+		240 = /dev/ida/c2d15	16th logical drive whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+  75 char	Specialix IO8+ serial card
+		  0 = /dev/ttyW0	First IO8+ port, first card
+		  1 = /dev/ttyW1	Second IO8+ port, first card
+		    ...
+		  8 = /dev/ttyW8	First IO8+ port, second card
+		    ...
+
+  75 block	Compaq Intelligent Drive Array, fourth controller
+		  0 = /dev/ida/c3d0	First logical drive whole disk
+		 16 = /dev/ida/c3d1	Second logical drive whole disk
+		    ...
+		240 = /dev/ida/c3d15	16th logical drive whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+  76 char	Specialix IO8+ serial card - alternate devices
+		  0 = /dev/cuw0		Callout device for ttyW0
+		  1 = /dev/cuw1		Callout device for ttyW1
+		    ...
+		  8 = /dev/cuw8		Callout device for ttyW8
+		    ...
+
+  76 block	Compaq Intelligent Drive Array, fifth controller
+		  0 = /dev/ida/c4d0	First logical drive whole disk
+		 16 = /dev/ida/c4d1	Second logical drive whole disk
+		    ...
+		240 = /dev/ida/c4d15	16th logical drive whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+
+  77 char	ComScire Quantum Noise Generator
+		  0 = /dev/qng		ComScire Quantum Noise Generator
+
+  77 block	Compaq Intelligent Drive Array, sixth controller
+		  0 = /dev/ida/c5d0	First logical drive whole disk
+		 16 = /dev/ida/c5d1	Second logical drive whole disk
+		    ...
+		240 = /dev/ida/c5d15	16th logical drive whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+  78 char	PAM Software's multimodem boards
+		  0 = /dev/ttyM0	First PAM modem
+		  1 = /dev/ttyM1	Second PAM modem
+		    ...
+
+  78 block	Compaq Intelligent Drive Array, seventh controller
+		  0 = /dev/ida/c6d0	First logical drive whole disk
+		 16 = /dev/ida/c6d1	Second logical drive whole disk
+		    ...
+		240 = /dev/ida/c6d15	16th logical drive whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+  79 char	PAM Software's multimodem boards - alternate devices
+		  0 = /dev/cum0		Callout device for ttyM0
+		  1 = /dev/cum1		Callout device for ttyM1
+		    ...
+
+  79 block	Compaq Intelligent Drive Array, eighth controller
+		  0 = /dev/ida/c7d0	First logical drive whole disk
+		 16 = /dev/ida/c7d1	Second logical drive whole disk
+		    ...
+		240 = /dev/ida/c715	16th logical drive whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+  80 char	Photometrics AT200 CCD camera
+		  0 = /dev/at200	Photometrics AT200 CCD camera
+
+  80 block	I2O hard disk
+		  0 = /dev/i2o/hda	First I2O hard disk, whole disk
+		 16 = /dev/i2o/hdb	Second I2O hard disk, whole disk
+		    ...
+		240 = /dev/i2o/hdp	16th I2O hard disk, whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  81 char	video4linux
+		  0 = /dev/video0	Video capture/overlay device
+		    ...
+		 63 = /dev/video63	Video capture/overlay device
+		 64 = /dev/radio0	Radio device
+		    ...
+		127 = /dev/radio63	Radio device
+		128 = /dev/swradio0	Software Defined Radio device
+		    ...
+		191 = /dev/swradio63	Software Defined Radio device
+		224 = /dev/vbi0		Vertical blank interrupt
+		    ...
+		255 = /dev/vbi31	Vertical blank interrupt
+
+		Minor numbers are allocated dynamically unless
+		CONFIG_VIDEO_FIXED_MINOR_RANGES (default n)
+		configuration option is set.
+
+  81 block	I2O hard disk
+		  0 = /dev/i2o/hdq	17th I2O hard disk, whole disk
+		 16 = /dev/i2o/hdr	18th I2O hard disk, whole disk
+		    ...
+		240 = /dev/i2o/hdaf	32nd I2O hard disk, whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  82 char	WiNRADiO communications receiver card
+		  0 = /dev/winradio0	First WiNRADiO card
+		  1 = /dev/winradio1	Second WiNRADiO card
+		    ...
+
+		The driver and documentation may be obtained from
+		http://www.winradio.com/
+
+  82 block	I2O hard disk
+		  0 = /dev/i2o/hdag	33rd I2O hard disk, whole disk
+		 16 = /dev/i2o/hdah	34th I2O hard disk, whole disk
+		    ...
+		240 = /dev/i2o/hdav	48th I2O hard disk, whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  83 char	Matrox mga_vid video driver
+		 0 = /dev/mga_vid0	1st video card
+		 1 = /dev/mga_vid1	2nd video card
+		 2 = /dev/mga_vid2	3rd video card
+		  ...
+		15 = /dev/mga_vid15	16th video card
+
+  83 block	I2O hard disk
+		  0 = /dev/i2o/hdaw	49th I2O hard disk, whole disk
+		 16 = /dev/i2o/hdax	50th I2O hard disk, whole disk
+		    ...
+		240 = /dev/i2o/hdbl	64th I2O hard disk, whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  84 char	Ikon 1011[57] Versatec Greensheet Interface
+		  0 = /dev/ihcp0	First Greensheet port
+		  1 = /dev/ihcp1	Second Greensheet port
+
+  84 block	I2O hard disk
+		  0 = /dev/i2o/hdbm	65th I2O hard disk, whole disk
+		 16 = /dev/i2o/hdbn	66th I2O hard disk, whole disk
+		    ...
+		240 = /dev/i2o/hdcb	80th I2O hard disk, whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  85 char	Linux/SGI shared memory input queue
+		  0 = /dev/shmiq	Master shared input queue
+		  1 = /dev/qcntl0	First device pushed
+		  2 = /dev/qcntl1	Second device pushed
+		    ...
+
+  85 block	I2O hard disk
+		  0 = /dev/i2o/hdcc	81st I2O hard disk, whole disk
+		 16 = /dev/i2o/hdcd	82nd I2O hard disk, whole disk
+		    ...
+		240 = /dev/i2o/hdcr	96th I2O hard disk, whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  86 char	SCSI media changer
+		  0 = /dev/sch0		First SCSI media changer
+		  1 = /dev/sch1		Second SCSI media changer
+		    ...
+
+  86 block	I2O hard disk
+		  0 = /dev/i2o/hdcs	97th I2O hard disk, whole disk
+		 16 = /dev/i2o/hdct	98th I2O hard disk, whole disk
+		    ...
+		240 = /dev/i2o/hddh	112th I2O hard disk, whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  87 char	Sony Control-A1 stereo control bus
+		  0 = /dev/controla0	First device on chain
+		  1 = /dev/controla1	Second device on chain
+		    ...
+
+  87 block	I2O hard disk
+		  0 = /dev/i2o/hddi	113rd I2O hard disk, whole disk
+		 16 = /dev/i2o/hddj	114th I2O hard disk, whole disk
+		    ...
+		240 = /dev/i2o/hddx	128th I2O hard disk, whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  88 char	COMX synchronous serial card
+		  0 = /dev/comx0	COMX channel 0
+		  1 = /dev/comx1	COMX channel 1
+		    ...
+
+  88 block	Seventh IDE hard disk/CD-ROM interface
+		  0 = /dev/hdm		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdn		Slave: whole disk (or CD-ROM)
+
+		Partitions are handled the same way as for the first
+		interface (see major number 3).
+
+  89 char	I2C bus interface
+		  0 = /dev/i2c-0	First I2C adapter
+		  1 = /dev/i2c-1	Second I2C adapter
+		    ...
+
+  89 block	Eighth IDE hard disk/CD-ROM interface
+		  0 = /dev/hdo		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdp		Slave: whole disk (or CD-ROM)
+
+		Partitions are handled the same way as for the first
+		interface (see major number 3).
+
+  90 char	Memory Technology Device (RAM, ROM, Flash)
+		  0 = /dev/mtd0		First MTD (rw)
+		  1 = /dev/mtdr0	First MTD (ro)
+		    ...
+		 30 = /dev/mtd15	16th MTD (rw)
+		 31 = /dev/mtdr15	16th MTD (ro)
+
+  90 block	Ninth IDE hard disk/CD-ROM interface
+		  0 = /dev/hdq		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdr		Slave: whole disk (or CD-ROM)
+
+		Partitions are handled the same way as for the first
+		interface (see major number 3).
+
+  91 char	CAN-Bus devices
+		  0 = /dev/can0		First CAN-Bus controller
+		  1 = /dev/can1		Second CAN-Bus controller
+		    ...
+
+  91 block	Tenth IDE hard disk/CD-ROM interface
+		  0 = /dev/hds		Master: whole disk (or CD-ROM)
+		 64 = /dev/hdt		Slave: whole disk (or CD-ROM)
+
+		Partitions are handled the same way as for the first
+		interface (see major number 3).
+
+  92 char	Reserved for ith Kommunikationstechnik MIC ISDN card
+
+  92 block	PPDD encrypted disk driver
+		  0 = /dev/ppdd0	First encrypted disk
+		  1 = /dev/ppdd1	Second encrypted disk
+		    ...
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+  93 char
+
+  93 block	NAND Flash Translation Layer filesystem
+		  0 = /dev/nftla	First NFTL layer
+		 16 = /dev/nftlb	Second NFTL layer
+		    ...
+		240 = /dev/nftlp	16th NTFL layer
+
+  94 char
+
+  94 block	IBM S/390 DASD block storage
+		  0 = /dev/dasda First DASD device, major
+		  1 = /dev/dasda1 First DASD device, block 1
+		  2 = /dev/dasda2 First DASD device, block 2
+		  3 = /dev/dasda3 First DASD device, block 3
+		  4 = /dev/dasdb Second DASD device, major
+		  5 = /dev/dasdb1 Second DASD device, block 1
+		  6 = /dev/dasdb2 Second DASD device, block 2
+		  7 = /dev/dasdb3 Second DASD device, block 3
+		    ...
+
+  95 char	IP filter
+		  0 = /dev/ipl		Filter control device/log file
+		  1 = /dev/ipnat	NAT control device/log file
+		  2 = /dev/ipstate	State information log file
+		  3 = /dev/ipauth	Authentication control device/log file
+		    ...
+
+  96 char	Parallel port ATAPI tape devices
+		  0 = /dev/pt0		First parallel port ATAPI tape
+		  1 = /dev/pt1		Second parallel port ATAPI tape
+		    ...
+		128 = /dev/npt0		First p.p. ATAPI tape, no rewind
+		129 = /dev/npt1		Second p.p. ATAPI tape, no rewind
+		    ...
+
+  96 block	Inverse NAND Flash Translation Layer
+		  0 = /dev/inftla First INFTL layer
+		 16 = /dev/inftlb Second INFTL layer
+		    ...
+		240 = /dev/inftlp	16th INTFL layer
+
+  97 char	Parallel port generic ATAPI interface
+		  0 = /dev/pg0		First parallel port ATAPI device
+		  1 = /dev/pg1		Second parallel port ATAPI device
+		  2 = /dev/pg2		Third parallel port ATAPI device
+		  3 = /dev/pg3		Fourth parallel port ATAPI device
+
+		These devices support the same API as the generic SCSI
+		devices.
+
+  98 char	Control and Measurement Device (comedi)
+		  0 = /dev/comedi0	First comedi device
+		  1 = /dev/comedi1	Second comedi device
+		    ...
+
+		See http://stm.lbl.gov/comedi.
+
+  98 block	User-mode virtual block device
+		  0 = /dev/ubda		First user-mode block device
+		 16 = /dev/udbb		Second user-mode block device
+		    ...
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+		This device is used by the user-mode virtual kernel port.
+
+  99 char	Raw parallel ports
+		  0 = /dev/parport0	First parallel port
+		  1 = /dev/parport1	Second parallel port
+		    ...
+
+  99 block	JavaStation flash disk
+		  0 = /dev/jsfd		JavaStation flash disk
+
+ 100 char	Telephony for Linux
+		  0 = /dev/phone0	First telephony device
+		  1 = /dev/phone1	Second telephony device
+		    ...
+
+ 101 char	Motorola DSP 56xxx board
+		  0 = /dev/mdspstat	Status information
+		  1 = /dev/mdsp1	First DSP board I/O controls
+		    ...
+		 16 = /dev/mdsp16	16th DSP board I/O controls
+
+ 101 block	AMI HyperDisk RAID controller
+		  0 = /dev/amiraid/ar0	First array whole disk
+		 16 = /dev/amiraid/ar1	Second array whole disk
+		    ...
+		240 = /dev/amiraid/ar15	16th array whole disk
+
+		For each device, partitions are added as:
+		  0 = /dev/amiraid/ar?	  Whole disk
+		  1 = /dev/amiraid/ar?p1  First partition
+		  2 = /dev/amiraid/ar?p2  Second partition
+		    ...
+		 15 = /dev/amiraid/ar?p15 15th partition
+
+ 102 char
+
+ 102 block	Compressed block device
+		  0 = /dev/cbd/a	First compressed block device, whole device
+		 16 = /dev/cbd/b	Second compressed block device, whole device
+		    ...
+		240 = /dev/cbd/p	16th compressed block device, whole device
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 103 char	Arla network file system
+		  0 = /dev/nnpfs0	First NNPFS device
+		  1 = /dev/nnpfs1	Second NNPFS device
+
+		Arla is a free clone of the Andrew File System, AFS.
+		The NNPFS device gives user mode filesystem
+		implementations a kernel presence for caching and easy
+		mounting.  For more information about the project,
+		write to <arla-drinkers@stacken.kth.se> or see
+		http://www.stacken.kth.se/project/arla/
+
+ 103 block	Audit device
+		  0 = /dev/audit	Audit device
+
+ 104 char	Flash BIOS support
+
+ 104 block	Compaq Next Generation Drive Array, first controller
+		  0 = /dev/cciss/c0d0	First logical drive, whole disk
+		 16 = /dev/cciss/c0d1	Second logical drive, whole disk
+		    ...
+		240 = /dev/cciss/c0d15	16th logical drive, whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+ 105 char	Comtrol VS-1000 serial controller
+		  0 = /dev/ttyV0	First VS-1000 port
+		  1 = /dev/ttyV1	Second VS-1000 port
+		    ...
+
+ 105 block	Compaq Next Generation Drive Array, second controller
+		  0 = /dev/cciss/c1d0	First logical drive, whole disk
+		 16 = /dev/cciss/c1d1	Second logical drive, whole disk
+		    ...
+		240 = /dev/cciss/c1d15	16th logical drive, whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+ 106 char	Comtrol VS-1000 serial controller - alternate devices
+		  0 = /dev/cuv0		First VS-1000 port
+		  1 = /dev/cuv1		Second VS-1000 port
+		    ...
+
+ 106 block	Compaq Next Generation Drive Array, third controller
+		  0 = /dev/cciss/c2d0	First logical drive, whole disk
+		 16 = /dev/cciss/c2d1	Second logical drive, whole disk
+		    ...
+		240 = /dev/cciss/c2d15	16th logical drive, whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+ 107 char	3Dfx Voodoo Graphics device
+		  0 = /dev/3dfx		Primary 3Dfx graphics device
+
+ 107 block	Compaq Next Generation Drive Array, fourth controller
+		  0 = /dev/cciss/c3d0	First logical drive, whole disk
+		 16 = /dev/cciss/c3d1	Second logical drive, whole disk
+		    ...
+		240 = /dev/cciss/c3d15	16th logical drive, whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+ 108 char	Device independent PPP interface
+		  0 = /dev/ppp		Device independent PPP interface
+
+ 108 block	Compaq Next Generation Drive Array, fifth controller
+		  0 = /dev/cciss/c4d0	First logical drive, whole disk
+		 16 = /dev/cciss/c4d1	Second logical drive, whole disk
+		    ...
+		240 = /dev/cciss/c4d15	16th logical drive, whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+ 109 char	Reserved for logical volume manager
+
+ 109 block	Compaq Next Generation Drive Array, sixth controller
+		  0 = /dev/cciss/c5d0	First logical drive, whole disk
+		 16 = /dev/cciss/c5d1	Second logical drive, whole disk
+		    ...
+		240 = /dev/cciss/c5d15	16th logical drive, whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+ 110 char	miroMEDIA Surround board
+		  0 = /dev/srnd0	First miroMEDIA Surround board
+		  1 = /dev/srnd1	Second miroMEDIA Surround board
+		    ...
+
+ 110 block	Compaq Next Generation Drive Array, seventh controller
+		  0 = /dev/cciss/c6d0	First logical drive, whole disk
+		 16 = /dev/cciss/c6d1	Second logical drive, whole disk
+		    ...
+		240 = /dev/cciss/c6d15	16th logical drive, whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+ 111 char
+
+ 111 block	Compaq Next Generation Drive Array, eighth controller
+		  0 = /dev/cciss/c7d0	First logical drive, whole disk
+		 16 = /dev/cciss/c7d1	Second logical drive, whole disk
+		    ...
+		240 = /dev/cciss/c7d15	16th logical drive, whole disk
+
+		Partitions are handled the same way as for Mylex
+		DAC960 (see major number 48) except that the limit on
+		partitions is 15.
+
+ 112 char	ISI serial card
+		  0 = /dev/ttyM0	First ISI port
+		  1 = /dev/ttyM1	Second ISI port
+		    ...
+
+		There is currently a device-naming conflict between
+		these and PAM multimodems (major 78).
+
+ 112 block	IBM iSeries virtual disk
+		  0 = /dev/iseries/vda	First virtual disk, whole disk
+		  8 = /dev/iseries/vdb	Second virtual disk, whole disk
+		    ...
+		200 = /dev/iseries/vdz	26th virtual disk, whole disk
+		208 = /dev/iseries/vdaa	27th virtual disk, whole disk
+		    ...
+		248 = /dev/iseries/vdaf	32nd virtual disk, whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 7.
+
+ 113 char	ISI serial card - alternate devices
+		  0 = /dev/cum0		Callout device for ttyM0
+		  1 = /dev/cum1		Callout device for ttyM1
+		    ...
+
+ 113 block	IBM iSeries virtual CD-ROM
+		  0 = /dev/iseries/vcda	First virtual CD-ROM
+		  1 = /dev/iseries/vcdb	Second virtual CD-ROM
+		    ...
+
+ 114 char	Picture Elements ISE board
+		  0 = /dev/ise0		First ISE board
+		  1 = /dev/ise1		Second ISE board
+		    ...
+		128 = /dev/isex0	Control node for first ISE board
+		129 = /dev/isex1	Control node for second ISE board
+		    ...
+
+		The ISE board is an embedded computer, optimized for
+		image processing. The /dev/iseN nodes are the general
+		I/O access to the board, the /dev/isex0 nodes command
+		nodes used to control the board.
+
+ 114 block       IDE BIOS powered software RAID interfaces such as the
+		Promise Fastrak
+
+		   0 = /dev/ataraid/d0
+		   1 = /dev/ataraid/d0p1
+		   2 = /dev/ataraid/d0p2
+		  ...
+		  16 = /dev/ataraid/d1
+		  17 = /dev/ataraid/d1p1
+		  18 = /dev/ataraid/d1p2
+		  ...
+		 255 = /dev/ataraid/d15p15
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 115 char	TI link cable devices (115 was formerly the console driver speaker)
+		  0 = /dev/tipar0    Parallel cable on first parallel port
+		  ...
+		  7 = /dev/tipar7    Parallel cable on seventh parallel port
+
+		  8 = /dev/tiser0    Serial cable on first serial port
+		  ...
+		 15 = /dev/tiser7    Serial cable on seventh serial port
+
+		 16 = /dev/tiusb0    First USB cable
+		  ...
+		 47 = /dev/tiusb31   32nd USB cable
+
+ 115 block       NetWare (NWFS) Devices (0-255)
+
+		The NWFS (NetWare) devices are used to present a
+		collection of NetWare Mirror Groups or NetWare
+		Partitions as a logical storage segment for
+		use in mounting NetWare volumes.  A maximum of
+		 256 NetWare volumes can be supported in a single
+		machine.
+
+		http://cgfa.telepac.pt/ftp2/kernel.org/linux/kernel/people/jmerkey/nwfs/
+
+		 0 = /dev/nwfs/v0    First NetWare (NWFS) Logical Volume
+		 1 = /dev/nwfs/v1    Second NetWare (NWFS) Logical Volume
+		 2 = /dev/nwfs/v2    Third NetWare (NWFS) Logical Volume
+		      ...
+		 255 = /dev/nwfs/v255    Last NetWare (NWFS) Logical Volume
+
+ 116 char	Advanced Linux Sound Driver (ALSA)
+
+ 116 block       MicroMemory battery backed RAM adapter (NVRAM)
+		Supports 16 boards, 15 partitions each.
+		Requested by neilb at cse.unsw.edu.au.
+
+		 0 = /dev/umem/d0      Whole of first board
+		 1 = /dev/umem/d0p1    First partition of first board
+		 2 = /dev/umem/d0p2    Second partition of first board
+		15 = /dev/umem/d0p15   15th partition of first board
+
+		16 = /dev/umem/d1      Whole of second board
+		17 = /dev/umem/d1p1    First partition of second board
+		    ...
+		255= /dev/umem/d15p15  15th partition of 16th board.
+
+ 117 char	COSA/SRP synchronous serial card
+		  0 = /dev/cosa0c0	1st board, 1st channel
+		  1 = /dev/cosa0c1	1st board, 2nd channel
+		    ...
+		 16 = /dev/cosa1c0	2nd board, 1st channel
+		 17 = /dev/cosa1c1	2nd board, 2nd channel
+		    ...
+
+ 117 block       Enterprise Volume Management System (EVMS)
+
+		The EVMS driver uses a layered, plug-in model to provide
+		unparalleled flexibility and extensibility in managing
+		storage.  This allows for easy expansion or customization
+		of various levels of volume management.  Requested by
+		Mark Peloquin (peloquin at us.ibm.com).
+
+		Note: EVMS populates and manages all the devnodes in
+		/dev/evms.
+
+		http://sf.net/projects/evms
+
+		   0 = /dev/evms/block_device   EVMS block device
+		   1 = /dev/evms/legacyname1    First EVMS legacy device
+		   2 = /dev/evms/legacyname2    Second EVMS legacy device
+		    ...
+		    Both ranges can grow (down or up) until they meet.
+		    ...
+		 254 = /dev/evms/EVMSname2      Second EVMS native device
+		 255 = /dev/evms/EVMSname1      First EVMS native device
+
+		Note: legacyname(s) are derived from the normal legacy
+		device names.  For example, /dev/hda5 would become
+		/dev/evms/hda5.
+
+ 118 char	IBM Cryptographic Accelerator
+		  0 = /dev/ica	Virtual interface to all IBM Crypto Accelerators
+		  1 = /dev/ica0	IBMCA Device 0
+		  2 = /dev/ica1	IBMCA Device 1
+		    ...
+
+ 119 char	VMware virtual network control
+		  0 = /dev/vnet0	1st virtual network
+		  1 = /dev/vnet1	2nd virtual network
+		    ...
+
+ 120-127 char	LOCAL/EXPERIMENTAL USE
+
+ 120-127 block	LOCAL/EXPERIMENTAL USE
+		Allocated for local/experimental use.  For devices not
+		assigned official numbers, these ranges should be
+		used in order to avoid conflicting with future assignments.
+
+ 128-135 char	Unix98 PTY masters
+
+		These devices should not have corresponding device
+		nodes; instead they should be accessed through the
+		/dev/ptmx cloning interface.
+
+ 128 block       SCSI disk devices (128-143)
+		   0 = /dev/sddy         129th SCSI disk whole disk
+		  16 = /dev/sddz         130th SCSI disk whole disk
+		  32 = /dev/sdea         131th SCSI disk whole disk
+		    ...
+		 240 = /dev/sden         144th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 129 block       SCSI disk devices (144-159)
+		   0 = /dev/sdeo         145th SCSI disk whole disk
+		  16 = /dev/sdep         146th SCSI disk whole disk
+		  32 = /dev/sdeq         147th SCSI disk whole disk
+		    ...
+		 240 = /dev/sdfd         160th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 130 char 	(Misc devices)
+
+ 130 block       SCSI disk devices (160-175)
+		   0 = /dev/sdfe         161st SCSI disk whole disk
+		  16 = /dev/sdff         162nd SCSI disk whole disk
+		  32 = /dev/sdfg         163rd SCSI disk whole disk
+		    ...
+		 240 = /dev/sdft         176th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 131 block       SCSI disk devices (176-191)
+		   0 = /dev/sdfu         177th SCSI disk whole disk
+		  16 = /dev/sdfv         178th SCSI disk whole disk
+		  32 = /dev/sdfw         179th SCSI disk whole disk
+		    ...
+		 240 = /dev/sdgj         192nd SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 132 block       SCSI disk devices (192-207)
+		   0 = /dev/sdgk         193rd SCSI disk whole disk
+		  16 = /dev/sdgl         194th SCSI disk whole disk
+		  32 = /dev/sdgm         195th SCSI disk whole disk
+		    ...
+		 240 = /dev/sdgz         208th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 133 block       SCSI disk devices (208-223)
+		   0 = /dev/sdha         209th SCSI disk whole disk
+		  16 = /dev/sdhb         210th SCSI disk whole disk
+		  32 = /dev/sdhc         211th SCSI disk whole disk
+		    ...
+		 240 = /dev/sdhp         224th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 134 block       SCSI disk devices (224-239)
+		   0 = /dev/sdhq         225th SCSI disk whole disk
+		  16 = /dev/sdhr         226th SCSI disk whole disk
+		  32 = /dev/sdhs         227th SCSI disk whole disk
+		    ...
+		 240 = /dev/sdif         240th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 135 block       SCSI disk devices (240-255)
+		   0 = /dev/sdig         241st SCSI disk whole disk
+		  16 = /dev/sdih         242nd SCSI disk whole disk
+		  32 = /dev/sdih         243rd SCSI disk whole disk
+		    ...
+		 240 = /dev/sdiv         256th SCSI disk whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 136-143 char	Unix98 PTY slaves
+		  0 = /dev/pts/0	First Unix98 pseudo-TTY
+		  1 = /dev/pts/1	Second Unix98 pseudo-TTY
+		    ...
+
+		These device nodes are automatically generated with
+		the proper permissions and modes by mounting the
+		devpts filesystem onto /dev/pts with the appropriate
+		mount options (distribution dependent, however, on
+		*most* distributions the appropriate options are
+		"mode=0620,gid=<gid of the "tty" group>".)
+
+ 136 block	Mylex DAC960 PCI RAID controller; ninth controller
+		  0 = /dev/rd/c8d0	First disk, whole disk
+		  8 = /dev/rd/c8d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c8d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+ 137 block	Mylex DAC960 PCI RAID controller; tenth controller
+		  0 = /dev/rd/c9d0	First disk, whole disk
+		  8 = /dev/rd/c9d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c9d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+ 138 block	Mylex DAC960 PCI RAID controller; eleventh controller
+		  0 = /dev/rd/c10d0	First disk, whole disk
+		  8 = /dev/rd/c10d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c10d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+ 139 block	Mylex DAC960 PCI RAID controller; twelfth controller
+		  0 = /dev/rd/c11d0	First disk, whole disk
+		  8 = /dev/rd/c11d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c11d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+ 140 block	Mylex DAC960 PCI RAID controller; thirteenth controller
+		  0 = /dev/rd/c12d0	First disk, whole disk
+		  8 = /dev/rd/c12d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c12d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+ 141 block	Mylex DAC960 PCI RAID controller; fourteenth controller
+		  0 = /dev/rd/c13d0	First disk, whole disk
+		  8 = /dev/rd/c13d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c13d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+ 142 block	Mylex DAC960 PCI RAID controller; fifteenth controller
+		  0 = /dev/rd/c14d0	First disk, whole disk
+		  8 = /dev/rd/c14d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c14d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+ 143 block	Mylex DAC960 PCI RAID controller; sixteenth controller
+		  0 = /dev/rd/c15d0	First disk, whole disk
+		  8 = /dev/rd/c15d1	Second disk, whole disk
+		    ...
+		248 = /dev/rd/c15d31	32nd disk, whole disk
+
+		Partitions are handled as for major 48.
+
+ 144 char	Encapsulated PPP
+		  0 = /dev/pppox0	First PPP over Ethernet
+		    ...
+		 63 = /dev/pppox63	64th PPP over Ethernet
+
+		This is primarily used for ADSL.
+
+		The SST 5136-DN DeviceNet interface driver has been
+		relocated to major 183 due to an unfortunate conflict.
+
+ 144 block	Expansion Area #1 for more non-device (e.g. NFS) mounts
+		  0 = mounted device 256
+		255 = mounted device 511
+
+ 145 char	SAM9407-based soundcard
+		  0 = /dev/sam0_mixer
+		  1 = /dev/sam0_sequencer
+		  2 = /dev/sam0_midi00
+		  3 = /dev/sam0_dsp
+		  4 = /dev/sam0_audio
+		  6 = /dev/sam0_sndstat
+		 18 = /dev/sam0_midi01
+		 34 = /dev/sam0_midi02
+		 50 = /dev/sam0_midi03
+		 64 = /dev/sam1_mixer
+		    ...
+		128 = /dev/sam2_mixer
+		    ...
+		192 = /dev/sam3_mixer
+		    ...
+
+		Device functions match OSS, but offer a number of
+		addons, which are sam9407 specific.  OSS can be
+		operated simultaneously, taking care of the codec.
+
+ 145 block	Expansion Area #2 for more non-device (e.g. NFS) mounts
+		  0 = mounted device 512
+		255 = mounted device 767
+
+ 146 char	SYSTRAM SCRAMNet mirrored-memory network
+		  0 = /dev/scramnet0	First SCRAMNet device
+		  1 = /dev/scramnet1	Second SCRAMNet device
+		    ...
+
+ 146 block	Expansion Area #3 for more non-device (e.g. NFS) mounts
+		  0 = mounted device 768
+		255 = mounted device 1023
+
+ 147 char	Aureal Semiconductor Vortex Audio device
+		  0 = /dev/aureal0	First Aureal Vortex
+		  1 = /dev/aureal1	Second Aureal Vortex
+		    ...
+
+ 147 block	Distributed Replicated Block Device (DRBD)
+		  0 = /dev/drbd0	First DRBD device
+		  1 = /dev/drbd1	Second DRBD device
+		    ...
+
+ 148 char	Technology Concepts serial card
+		  0 = /dev/ttyT0	First TCL port
+		  1 = /dev/ttyT1	Second TCL port
+		    ...
+
+ 149 char	Technology Concepts serial card - alternate devices
+		  0 = /dev/cut0		Callout device for ttyT0
+		  1 = /dev/cut0		Callout device for ttyT1
+		    ...
+
+ 150 char	Real-Time Linux FIFOs
+		  0 = /dev/rtf0		First RTLinux FIFO
+		  1 = /dev/rtf1		Second RTLinux FIFO
+		    ...
+
+ 151 char	DPT I2O SmartRaid V controller
+		  0 = /dev/dpti0	First DPT I2O adapter
+		  1 = /dev/dpti1	Second DPT I2O adapter
+		    ...
+
+ 152 char	EtherDrive Control Device
+		  0 = /dev/etherd/ctl	Connect/Disconnect an EtherDrive
+		  1 = /dev/etherd/err	Monitor errors
+		  2 = /dev/etherd/raw	Raw AoE packet monitor
+
+ 152 block	EtherDrive Block Devices
+		  0 = /dev/etherd/0	EtherDrive 0
+		    ...
+		255 = /dev/etherd/255	EtherDrive 255
+
+ 153 char	SPI Bus Interface (sometimes referred to as MicroWire)
+		  0 = /dev/spi0		First SPI device on the bus
+		  1 = /dev/spi1		Second SPI device on the bus
+		    ...
+		 15 = /dev/spi15	Sixteenth SPI device on the bus
+
+ 153 block	Enhanced Metadisk RAID (EMD) storage units
+		  0 = /dev/emd/0	First unit
+		  1 = /dev/emd/0p1	Partition 1 on First unit
+		  2 = /dev/emd/0p2	Partition 2 on First unit
+		    ...
+		 15 = /dev/emd/0p15	Partition 15 on First unit
+
+		 16 = /dev/emd/1	Second unit
+		 32 = /dev/emd/2	Third unit
+		    ...
+		240 = /dev/emd/15	Sixteenth unit
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 154 char	Specialix RIO serial card
+		  0 = /dev/ttySR0	First RIO port
+		    ...
+		255 = /dev/ttySR255	256th RIO port
+
+ 155 char	Specialix RIO serial card - alternate devices
+		  0 = /dev/cusr0	Callout device for ttySR0
+		    ...
+		255 = /dev/cusr255	Callout device for ttySR255
+
+ 156 char	Specialix RIO serial card
+		  0 = /dev/ttySR256	257th RIO port
+		    ...
+		255 = /dev/ttySR511	512th RIO port
+
+ 157 char	Specialix RIO serial card - alternate devices
+		  0 = /dev/cusr256	Callout device for ttySR256
+		    ...
+		255 = /dev/cusr511	Callout device for ttySR511
+
+ 158 char	Dialogic GammaLink fax driver
+		  0 = /dev/gfax0	GammaLink channel 0
+		  1 = /dev/gfax1	GammaLink channel 1
+		    ...
+
+ 159 char	RESERVED
+
+ 159 block	RESERVED
+
+ 160 char	General Purpose Instrument Bus (GPIB)
+		  0 = /dev/gpib0	First GPIB bus
+		  1 = /dev/gpib1	Second GPIB bus
+		    ...
+
+ 160 block       Carmel 8-port SATA Disks on First Controller
+		  0 = /dev/carmel/0     SATA disk 0 whole disk
+		  1 = /dev/carmel/0p1   SATA disk 0 partition 1
+		    ...
+		 31 = /dev/carmel/0p31  SATA disk 0 partition 31
+
+		 32 = /dev/carmel/1     SATA disk 1 whole disk
+		 64 = /dev/carmel/2     SATA disk 2 whole disk
+		    ...
+		224 = /dev/carmel/7     SATA disk 7 whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 31.
+
+ 161 char	IrCOMM devices (IrDA serial/parallel emulation)
+		  0 = /dev/ircomm0	First IrCOMM device
+		  1 = /dev/ircomm1	Second IrCOMM device
+		    ...
+		 16 = /dev/irlpt0	First IrLPT device
+		 17 = /dev/irlpt1	Second IrLPT device
+		    ...
+
+ 161 block       Carmel 8-port SATA Disks on Second Controller
+		  0 = /dev/carmel/8     SATA disk 8 whole disk
+		  1 = /dev/carmel/8p1   SATA disk 8 partition 1
+		    ...
+		 31 = /dev/carmel/8p31  SATA disk 8 partition 31
+
+		 32 = /dev/carmel/9     SATA disk 9 whole disk
+		 64 = /dev/carmel/10    SATA disk 10 whole disk
+		    ...
+		224 = /dev/carmel/15    SATA disk 15 whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 31.
+
+ 162 char	Raw block device interface
+		  0 = /dev/rawctl	Raw I/O control device
+		  1 = /dev/raw/raw1	First raw I/O device
+		  2 = /dev/raw/raw2	Second raw I/O device
+		    ...
+		 max minor number of raw device is set by kernel config
+		 MAX_RAW_DEVS or raw module parameter 'max_raw_devs'
+
+ 163 char
+
+ 164 char	Chase Research AT/PCI-Fast serial card
+		  0 = /dev/ttyCH0	AT/PCI-Fast board 0, port 0
+		    ...
+		 15 = /dev/ttyCH15	AT/PCI-Fast board 0, port 15
+		 16 = /dev/ttyCH16	AT/PCI-Fast board 1, port 0
+		    ...
+		 31 = /dev/ttyCH31	AT/PCI-Fast board 1, port 15
+		 32 = /dev/ttyCH32	AT/PCI-Fast board 2, port 0
+		    ...
+		 47 = /dev/ttyCH47	AT/PCI-Fast board 2, port 15
+		 48 = /dev/ttyCH48	AT/PCI-Fast board 3, port 0
+		    ...
+		 63 = /dev/ttyCH63	AT/PCI-Fast board 3, port 15
+
+ 165 char	Chase Research AT/PCI-Fast serial card - alternate devices
+		  0 = /dev/cuch0	Callout device for ttyCH0
+		    ...
+		 63 = /dev/cuch63	Callout device for ttyCH63
+
+ 166 char	ACM USB modems
+		  0 = /dev/ttyACM0	First ACM modem
+		  1 = /dev/ttyACM1	Second ACM modem
+		    ...
+
+ 167 char	ACM USB modems - alternate devices
+		  0 = /dev/cuacm0	Callout device for ttyACM0
+		  1 = /dev/cuacm1	Callout device for ttyACM1
+		    ...
+
+ 168 char	Eracom CSA7000 PCI encryption adaptor
+		  0 = /dev/ecsa0	First CSA7000
+		  1 = /dev/ecsa1	Second CSA7000
+		    ...
+
+ 169 char	Eracom CSA8000 PCI encryption adaptor
+		  0 = /dev/ecsa8-0	First CSA8000
+		  1 = /dev/ecsa8-1	Second CSA8000
+		    ...
+
+ 170 char	AMI MegaRAC remote access controller
+		  0 = /dev/megarac0	First MegaRAC card
+		  1 = /dev/megarac1	Second MegaRAC card
+		    ...
+
+ 171 char	Reserved for IEEE 1394 (Firewire)
+
+ 172 char	Moxa Intellio serial card
+		  0 = /dev/ttyMX0	First Moxa port
+		  1 = /dev/ttyMX1	Second Moxa port
+		    ...
+		127 = /dev/ttyMX127	128th Moxa port
+		128 = /dev/moxactl	Moxa control port
+
+ 173 char	Moxa Intellio serial card - alternate devices
+		  0 = /dev/cumx0	Callout device for ttyMX0
+		  1 = /dev/cumx1	Callout device for ttyMX1
+		    ...
+		127 = /dev/cumx127	Callout device for ttyMX127
+
+ 174 char	SmartIO serial card
+		  0 = /dev/ttySI0	First SmartIO port
+		  1 = /dev/ttySI1	Second SmartIO port
+		    ...
+
+ 175 char	SmartIO serial card - alternate devices
+		  0 = /dev/cusi0	Callout device for ttySI0
+		  1 = /dev/cusi1	Callout device for ttySI1
+		    ...
+
+ 176 char	nCipher nFast PCI crypto accelerator
+		  0 = /dev/nfastpci0	First nFast PCI device
+		  1 = /dev/nfastpci1	First nFast PCI device
+		    ...
+
+ 177 char	TI PCILynx memory spaces
+		  0 = /dev/pcilynx/aux0	 AUX space of first PCILynx card
+		    ...
+		 15 = /dev/pcilynx/aux15 AUX space of 16th PCILynx card
+		 16 = /dev/pcilynx/rom0	 ROM space of first PCILynx card
+		    ...
+		 31 = /dev/pcilynx/rom15 ROM space of 16th PCILynx card
+		 32 = /dev/pcilynx/ram0	 RAM space of first PCILynx card
+		    ...
+		 47 = /dev/pcilynx/ram15 RAM space of 16th PCILynx card
+
+ 178 char	Giganet cLAN1xxx virtual interface adapter
+		  0 = /dev/clanvi0	First cLAN adapter
+		  1 = /dev/clanvi1	Second cLAN adapter
+		    ...
+
+ 179 block       MMC block devices
+		  0 = /dev/mmcblk0      First SD/MMC card
+		  1 = /dev/mmcblk0p1    First partition on first MMC card
+		  8 = /dev/mmcblk1      Second SD/MMC card
+		    ...
+
+		The start of next SD/MMC card can be configured with
+		CONFIG_MMC_BLOCK_MINORS, or overridden at boot/modprobe
+		time using the mmcblk.perdev_minors option. That would
+		bump the offset between each card to be the configured
+		value instead of the default 8.
+
+ 179 char	CCube DVXChip-based PCI products
+		  0 = /dev/dvxirq0	First DVX device
+		  1 = /dev/dvxirq1	Second DVX device
+		    ...
+
+ 180 char	USB devices
+		  0 = /dev/usb/lp0	First USB printer
+		    ...
+		 15 = /dev/usb/lp15	16th USB printer
+		 48 = /dev/usb/scanner0	First USB scanner
+		    ...
+		 63 = /dev/usb/scanner15 16th USB scanner
+		 64 = /dev/usb/rio500	Diamond Rio 500
+		 65 = /dev/usb/usblcd	USBLCD Interface (info@usblcd.de)
+		 66 = /dev/usb/cpad0	Synaptics cPad (mouse/LCD)
+		 96 = /dev/usb/hiddev0	1st USB HID device
+		    ...
+		111 = /dev/usb/hiddev15	16th USB HID device
+		112 = /dev/usb/auer0	1st auerswald ISDN device
+		    ...
+		127 = /dev/usb/auer15	16th auerswald ISDN device
+		128 = /dev/usb/brlvgr0	First Braille Voyager device
+		    ...
+		131 = /dev/usb/brlvgr3	Fourth Braille Voyager device
+		132 = /dev/usb/idmouse	ID Mouse (fingerprint scanner) device
+		133 = /dev/usb/sisusbvga1	First SiSUSB VGA device
+		    ...
+		140 = /dev/usb/sisusbvga8	Eighth SISUSB VGA device
+		144 = /dev/usb/lcd	USB LCD device
+		160 = /dev/usb/legousbtower0	1st USB Legotower device
+		    ...
+		175 = /dev/usb/legousbtower15	16th USB Legotower device
+		176 = /dev/usb/usbtmc1	First USB TMC device
+		   ...
+		191 = /dev/usb/usbtmc16	16th USB TMC device
+		192 = /dev/usb/yurex1	First USB Yurex device
+		   ...
+		209 = /dev/usb/yurex16	16th USB Yurex device
+
+ 180 block	USB block devices
+		  0 = /dev/uba		First USB block device
+		  8 = /dev/ubb		Second USB block device
+		 16 = /dev/ubc		Third USB block device
+		    ...
+
+ 181 char	Conrad Electronic parallel port radio clocks
+		  0 = /dev/pcfclock0	First Conrad radio clock
+		  1 = /dev/pcfclock1	Second Conrad radio clock
+		    ...
+
+ 182 char	Picture Elements THR2 binarizer
+		  0 = /dev/pethr0	First THR2 board
+		  1 = /dev/pethr1	Second THR2 board
+		    ...
+
+ 183 char	SST 5136-DN DeviceNet interface
+		  0 = /dev/ss5136dn0	First DeviceNet interface
+		  1 = /dev/ss5136dn1	Second DeviceNet interface
+		    ...
+
+		This device used to be assigned to major number 144.
+		It had to be moved due to an unfortunate conflict.
+
+ 184 char	Picture Elements' video simulator/sender
+		  0 = /dev/pevss0	First sender board
+		  1 = /dev/pevss1	Second sender board
+		    ...
+
+ 185 char	InterMezzo high availability file system
+		  0 = /dev/intermezzo0	First cache manager
+		  1 = /dev/intermezzo1	Second cache manager
+		    ...
+
+		See http://web.archive.org/web/20080115195241/
+		http://inter-mezzo.org/index.html
+
+ 186 char	Object-based storage control device
+		  0 = /dev/obd0		First obd control device
+		  1 = /dev/obd1		Second obd control device
+		    ...
+
+		See ftp://ftp.lustre.org/pub/obd for code and information.
+
+ 187 char	DESkey hardware encryption device
+		  0 = /dev/deskey0	First DES key
+		  1 = /dev/deskey1	Second DES key
+		    ...
+
+ 188 char	USB serial converters
+		  0 = /dev/ttyUSB0	First USB serial converter
+		  1 = /dev/ttyUSB1	Second USB serial converter
+		    ...
+
+ 189 char	USB serial converters - alternate devices
+		  0 = /dev/cuusb0	Callout device for ttyUSB0
+		  1 = /dev/cuusb1	Callout device for ttyUSB1
+		    ...
+
+ 190 char	Kansas City tracker/tuner card
+		  0 = /dev/kctt0	First KCT/T card
+		  1 = /dev/kctt1	Second KCT/T card
+		    ...
+
+ 191 char	Reserved for PCMCIA
+
+ 192 char	Kernel profiling interface
+		  0 = /dev/profile	Profiling control device
+		  1 = /dev/profile0	Profiling device for CPU 0
+		  2 = /dev/profile1	Profiling device for CPU 1
+		    ...
+
+ 193 char	Kernel event-tracing interface
+		  0 = /dev/trace	Tracing control device
+		  1 = /dev/trace0	Tracing device for CPU 0
+		  2 = /dev/trace1	Tracing device for CPU 1
+		    ...
+
+ 194 char	linVideoStreams (LINVS)
+		  0 = /dev/mvideo/status0	Video compression status
+		  1 = /dev/mvideo/stream0	Video stream
+		  2 = /dev/mvideo/frame0	Single compressed frame
+		  3 = /dev/mvideo/rawframe0	Raw uncompressed frame
+		  4 = /dev/mvideo/codec0	Direct codec access
+		  5 = /dev/mvideo/video4linux0	Video4Linux compatibility
+
+		 16 = /dev/mvideo/status1	Second device
+		    ...
+		 32 = /dev/mvideo/status2	Third device
+		    ...
+		    ...
+		240 = /dev/mvideo/status15	16th device
+		    ...
+
+ 195 char	Nvidia graphics devices
+		  0 = /dev/nvidia0		First Nvidia card
+		  1 = /dev/nvidia1		Second Nvidia card
+		    ...
+		255 = /dev/nvidiactl		Nvidia card control device
+
+ 196 char	Tormenta T1 card
+		  0 = /dev/tor/0		Master control channel for all cards
+		  1 = /dev/tor/1		First DS0
+		  2 = /dev/tor/2		Second DS0
+		    ...
+		 48 = /dev/tor/48		48th DS0
+		 49 = /dev/tor/49		First pseudo-channel
+		 50 = /dev/tor/50		Second pseudo-channel
+		    ...
+
+ 197 char	OpenTNF tracing facility
+		  0 = /dev/tnf/t0		Trace 0 data extraction
+		  1 = /dev/tnf/t1		Trace 1 data extraction
+		    ...
+		128 = /dev/tnf/status		Tracing facility status
+		130 = /dev/tnf/trace		Tracing device
+
+ 198 char	Total Impact TPMP2 quad coprocessor PCI card
+		  0 = /dev/tpmp2/0		First card
+		  1 = /dev/tpmp2/1		Second card
+		    ...
+
+ 199 char	Veritas volume manager (VxVM) volumes
+		  0 = /dev/vx/rdsk/*/*		First volume
+		  1 = /dev/vx/rdsk/*/*		Second volume
+		    ...
+
+ 199 block	Veritas volume manager (VxVM) volumes
+		  0 = /dev/vx/dsk/*/*		First volume
+		  1 = /dev/vx/dsk/*/*		Second volume
+		    ...
+
+		The namespace in these directories is maintained by
+		the user space VxVM software.
+
+ 200 char	Veritas VxVM configuration interface
+		   0 = /dev/vx/config		Configuration access node
+		   1 = /dev/vx/trace		Volume i/o trace access node
+		   2 = /dev/vx/iod		Volume i/o daemon access node
+		   3 = /dev/vx/info		Volume information access node
+		   4 = /dev/vx/task		Volume tasks access node
+		   5 = /dev/vx/taskmon		Volume tasks monitor daemon
+
+ 201 char	Veritas VxVM dynamic multipathing driver
+		  0 = /dev/vx/rdmp/*		First multipath device
+		  1 = /dev/vx/rdmp/*		Second multipath device
+		    ...
+ 201 block	Veritas VxVM dynamic multipathing driver
+		  0 = /dev/vx/dmp/*		First multipath device
+		  1 = /dev/vx/dmp/*		Second multipath device
+		    ...
+
+		The namespace in these directories is maintained by
+		the user space VxVM software.
+
+ 202 char	CPU model-specific registers
+		  0 = /dev/cpu/0/msr		MSRs on CPU 0
+		  1 = /dev/cpu/1/msr		MSRs on CPU 1
+		    ...
+
+ 202 block	Xen Virtual Block Device
+		  0 = /dev/xvda       First Xen VBD whole disk
+		  16 = /dev/xvdb      Second Xen VBD whole disk
+		  32 = /dev/xvdc      Third Xen VBD whole disk
+		    ...
+		  240 = /dev/xvdp     Sixteenth Xen VBD whole disk
+
+		Partitions are handled in the same way as for IDE
+		disks (see major number 3) except that the limit on
+		partitions is 15.
+
+ 203 char	CPU CPUID information
+		  0 = /dev/cpu/0/cpuid		CPUID on CPU 0
+		  1 = /dev/cpu/1/cpuid		CPUID on CPU 1
+		    ...
+
+ 204 char	Low-density serial ports
+		  0 = /dev/ttyLU0		LinkUp Systems L72xx UART - port 0
+		  1 = /dev/ttyLU1		LinkUp Systems L72xx UART - port 1
+		  2 = /dev/ttyLU2		LinkUp Systems L72xx UART - port 2
+		  3 = /dev/ttyLU3		LinkUp Systems L72xx UART - port 3
+		  4 = /dev/ttyFB0		Intel Footbridge (ARM)
+		  5 = /dev/ttySA0		StrongARM builtin serial port 0
+		  6 = /dev/ttySA1		StrongARM builtin serial port 1
+		  7 = /dev/ttySA2		StrongARM builtin serial port 2
+		  8 = /dev/ttySC0		SCI serial port (SuperH) - port 0
+		  9 = /dev/ttySC1		SCI serial port (SuperH) - port 1
+		 10 = /dev/ttySC2		SCI serial port (SuperH) - port 2
+		 11 = /dev/ttySC3		SCI serial port (SuperH) - port 3
+		 12 = /dev/ttyFW0		Firmware console - port 0
+		 13 = /dev/ttyFW1		Firmware console - port 1
+		 14 = /dev/ttyFW2		Firmware console - port 2
+		 15 = /dev/ttyFW3		Firmware console - port 3
+		 16 = /dev/ttyAM0		ARM "AMBA" serial port 0
+		    ...
+		 31 = /dev/ttyAM15		ARM "AMBA" serial port 15
+		 32 = /dev/ttyDB0		DataBooster serial port 0
+		    ...
+		 39 = /dev/ttyDB7		DataBooster serial port 7
+		 40 = /dev/ttySG0		SGI Altix console port
+		 41 = /dev/ttySMX0		Motorola i.MX - port 0
+		 42 = /dev/ttySMX1		Motorola i.MX - port 1
+		 43 = /dev/ttySMX2		Motorola i.MX - port 2
+		 44 = /dev/ttyMM0		Marvell MPSC - port 0
+		 45 = /dev/ttyMM1		Marvell MPSC - port 1
+		 46 = /dev/ttyCPM0		PPC CPM (SCC or SMC) - port 0
+		    ...
+		 47 = /dev/ttyCPM5		PPC CPM (SCC or SMC) - port 5
+		 50 = /dev/ttyIOC0		Altix serial card
+		    ...
+		 81 = /dev/ttyIOC31		Altix serial card
+		 82 = /dev/ttyVR0		NEC VR4100 series SIU
+		 83 = /dev/ttyVR1		NEC VR4100 series DSIU
+		 84 = /dev/ttyIOC84		Altix ioc4 serial card
+		    ...
+		 115 = /dev/ttyIOC115		Altix ioc4 serial card
+		 116 = /dev/ttySIOC0		Altix ioc3 serial card
+		    ...
+		 147 = /dev/ttySIOC31		Altix ioc3 serial card
+		 148 = /dev/ttyPSC0		PPC PSC - port 0
+		    ...
+		 153 = /dev/ttyPSC5		PPC PSC - port 5
+		 154 = /dev/ttyAT0		ATMEL serial port 0
+		    ...
+		 169 = /dev/ttyAT15		ATMEL serial port 15
+		 170 = /dev/ttyNX0		Hilscher netX serial port 0
+		    ...
+		 185 = /dev/ttyNX15		Hilscher netX serial port 15
+		 186 = /dev/ttyJ0		JTAG1 DCC protocol based serial port emulation
+		 187 = /dev/ttyUL0		Xilinx uartlite - port 0
+		    ...
+		 190 = /dev/ttyUL3		Xilinx uartlite - port 3
+		 191 = /dev/xvc0		Xen virtual console - port 0
+		 192 = /dev/ttyPZ0		pmac_zilog - port 0
+		    ...
+		 195 = /dev/ttyPZ3		pmac_zilog - port 3
+		 196 = /dev/ttyTX0		TX39/49 serial port 0
+		    ...
+		 204 = /dev/ttyTX7		TX39/49 serial port 7
+		 205 = /dev/ttySC0		SC26xx serial port 0
+		 206 = /dev/ttySC1		SC26xx serial port 1
+		 207 = /dev/ttySC2		SC26xx serial port 2
+		 208 = /dev/ttySC3		SC26xx serial port 3
+		 209 = /dev/ttyMAX0		MAX3100 serial port 0
+		 210 = /dev/ttyMAX1		MAX3100 serial port 1
+		 211 = /dev/ttyMAX2		MAX3100 serial port 2
+		 212 = /dev/ttyMAX3		MAX3100 serial port 3
+
+ 205 char	Low-density serial ports (alternate device)
+		  0 = /dev/culu0		Callout device for ttyLU0
+		  1 = /dev/culu1		Callout device for ttyLU1
+		  2 = /dev/culu2		Callout device for ttyLU2
+		  3 = /dev/culu3		Callout device for ttyLU3
+		  4 = /dev/cufb0		Callout device for ttyFB0
+		  5 = /dev/cusa0		Callout device for ttySA0
+		  6 = /dev/cusa1		Callout device for ttySA1
+		  7 = /dev/cusa2		Callout device for ttySA2
+		  8 = /dev/cusc0		Callout device for ttySC0
+		  9 = /dev/cusc1		Callout device for ttySC1
+		 10 = /dev/cusc2		Callout device for ttySC2
+		 11 = /dev/cusc3		Callout device for ttySC3
+		 12 = /dev/cufw0		Callout device for ttyFW0
+		 13 = /dev/cufw1		Callout device for ttyFW1
+		 14 = /dev/cufw2		Callout device for ttyFW2
+		 15 = /dev/cufw3		Callout device for ttyFW3
+		 16 = /dev/cuam0		Callout device for ttyAM0
+		    ...
+		 31 = /dev/cuam15		Callout device for ttyAM15
+		 32 = /dev/cudb0		Callout device for ttyDB0
+		    ...
+		 39 = /dev/cudb7		Callout device for ttyDB7
+		 40 = /dev/cusg0		Callout device for ttySG0
+		 41 = /dev/ttycusmx0		Callout device for ttySMX0
+		 42 = /dev/ttycusmx1		Callout device for ttySMX1
+		 43 = /dev/ttycusmx2		Callout device for ttySMX2
+		 46 = /dev/cucpm0		Callout device for ttyCPM0
+		    ...
+		 49 = /dev/cucpm5		Callout device for ttyCPM5
+		 50 = /dev/cuioc40		Callout device for ttyIOC40
+		    ...
+		 81 = /dev/cuioc431		Callout device for ttyIOC431
+		 82 = /dev/cuvr0		Callout device for ttyVR0
+		 83 = /dev/cuvr1		Callout device for ttyVR1
+
+ 206 char	OnStream SC-x0 tape devices
+		  0 = /dev/osst0		First OnStream SCSI tape, mode 0
+		  1 = /dev/osst1		Second OnStream SCSI tape, mode 0
+		    ...
+		 32 = /dev/osst0l		First OnStream SCSI tape, mode 1
+		 33 = /dev/osst1l		Second OnStream SCSI tape, mode 1
+		    ...
+		 64 = /dev/osst0m		First OnStream SCSI tape, mode 2
+		 65 = /dev/osst1m		Second OnStream SCSI tape, mode 2
+		    ...
+		 96 = /dev/osst0a		First OnStream SCSI tape, mode 3
+		 97 = /dev/osst1a		Second OnStream SCSI tape, mode 3
+		    ...
+		128 = /dev/nosst0		No rewind version of /dev/osst0
+		129 = /dev/nosst1		No rewind version of /dev/osst1
+		    ...
+		160 = /dev/nosst0l		No rewind version of /dev/osst0l
+		161 = /dev/nosst1l		No rewind version of /dev/osst1l
+		    ...
+		192 = /dev/nosst0m		No rewind version of /dev/osst0m
+		193 = /dev/nosst1m		No rewind version of /dev/osst1m
+		    ...
+		224 = /dev/nosst0a		No rewind version of /dev/osst0a
+		225 = /dev/nosst1a		No rewind version of /dev/osst1a
+		    ...
+
+		The OnStream SC-x0 SCSI tapes do not support the
+		standard SCSI SASD command set and therefore need
+		their own driver "osst". Note that the IDE, USB (and
+		maybe ParPort) versions may be driven via ide-scsi or
+		usb-storage SCSI emulation and this osst device and
+		driver as well.  The ADR-x0 drives are QIC-157
+		compliant and don't need osst.
+
+ 207 char	Compaq ProLiant health feature indicate
+		  0 = /dev/cpqhealth/cpqw	Redirector interface
+		  1 = /dev/cpqhealth/crom	EISA CROM
+		  2 = /dev/cpqhealth/cdt	Data Table
+		  3 = /dev/cpqhealth/cevt	Event Log
+		  4 = /dev/cpqhealth/casr	Automatic Server Recovery
+		  5 = /dev/cpqhealth/cecc	ECC Memory
+		  6 = /dev/cpqhealth/cmca	Machine Check Architecture
+		  7 = /dev/cpqhealth/ccsm	Deprecated CDT
+		  8 = /dev/cpqhealth/cnmi	NMI Handling
+		  9 = /dev/cpqhealth/css	Sideshow Management
+		 10 = /dev/cpqhealth/cram	CMOS interface
+		 11 = /dev/cpqhealth/cpci	PCI IRQ interface
+
+ 208 char	User space serial ports
+		  0 = /dev/ttyU0		First user space serial port
+		  1 = /dev/ttyU1		Second user space serial port
+		    ...
+
+ 209 char	User space serial ports (alternate devices)
+		  0 = /dev/cuu0			Callout device for ttyU0
+		  1 = /dev/cuu1			Callout device for ttyU1
+		    ...
+
+ 210 char	SBE, Inc. sync/async serial card
+		  0 = /dev/sbei/wxcfg0		Configuration device for board 0
+		  1 = /dev/sbei/dld0		Download device for board 0
+		  2 = /dev/sbei/wan00		WAN device, port 0, board 0
+		  3 = /dev/sbei/wan01		WAN device, port 1, board 0
+		  4 = /dev/sbei/wan02		WAN device, port 2, board 0
+		  5 = /dev/sbei/wan03		WAN device, port 3, board 0
+		  6 = /dev/sbei/wanc00		WAN clone device, port 0, board 0
+		  7 = /dev/sbei/wanc01		WAN clone device, port 1, board 0
+		  8 = /dev/sbei/wanc02		WAN clone device, port 2, board 0
+		  9 = /dev/sbei/wanc03		WAN clone device, port 3, board 0
+		 10 = /dev/sbei/wxcfg1		Configuration device for board 1
+		 11 = /dev/sbei/dld1		Download device for board 1
+		 12 = /dev/sbei/wan10		WAN device, port 0, board 1
+		 13 = /dev/sbei/wan11		WAN device, port 1, board 1
+		 14 = /dev/sbei/wan12		WAN device, port 2, board 1
+		 15 = /dev/sbei/wan13		WAN device, port 3, board 1
+		 16 = /dev/sbei/wanc10		WAN clone device, port 0, board 1
+		 17 = /dev/sbei/wanc11		WAN clone device, port 1, board 1
+		 18 = /dev/sbei/wanc12		WAN clone device, port 2, board 1
+		 19 = /dev/sbei/wanc13		WAN clone device, port 3, board 1
+		    ...
+
+		Yes, each board is really spaced 10 (decimal) apart.
+
+ 211 char	Addinum CPCI1500 digital I/O card
+		  0 = /dev/addinum/cpci1500/0	First CPCI1500 card
+		  1 = /dev/addinum/cpci1500/1	Second CPCI1500 card
+		    ...
+
+ 212 char	LinuxTV.org DVB driver subsystem
+		  0 = /dev/dvb/adapter0/video0    first video decoder of first card
+		  1 = /dev/dvb/adapter0/audio0    first audio decoder of first card
+		  2 = /dev/dvb/adapter0/sec0      (obsolete/unused)
+		  3 = /dev/dvb/adapter0/frontend0 first frontend device of first card
+		  4 = /dev/dvb/adapter0/demux0    first demux device of first card
+		  5 = /dev/dvb/adapter0/dvr0      first digital video recoder device of first card
+		  6 = /dev/dvb/adapter0/ca0       first common access port of first card
+		  7 = /dev/dvb/adapter0/net0      first network device of first card
+		  8 = /dev/dvb/adapter0/osd0      first on-screen-display device of first card
+		  9 = /dev/dvb/adapter0/video1    second video decoder of first card
+		    ...
+		 64 = /dev/dvb/adapter1/video0    first video decoder of second card
+		    ...
+		128 = /dev/dvb/adapter2/video0    first video decoder of third card
+		    ...
+		196 = /dev/dvb/adapter3/video0    first video decoder of fourth card
+
+ 216 char	Bluetooth RFCOMM TTY devices
+		  0 = /dev/rfcomm0		First Bluetooth RFCOMM TTY device
+		  1 = /dev/rfcomm1		Second Bluetooth RFCOMM TTY device
+		    ...
+
+ 217 char	Bluetooth RFCOMM TTY devices (alternate devices)
+		  0 = /dev/curf0		Callout device for rfcomm0
+		  1 = /dev/curf1		Callout device for rfcomm1
+		    ...
+
+ 218 char	The Logical Company bus Unibus/Qbus adapters
+		  0 = /dev/logicalco/bci/0	First bus adapter
+		  1 = /dev/logicalco/bci/1	First bus adapter
+		    ...
+
+ 219 char	The Logical Company DCI-1300 digital I/O card
+		  0 = /dev/logicalco/dci1300/0	First DCI-1300 card
+		  1 = /dev/logicalco/dci1300/1	Second DCI-1300 card
+		    ...
+
+ 220 char	Myricom Myrinet "GM" board
+		  0 = /dev/myricom/gm0		First Myrinet GM board
+		  1 = /dev/myricom/gmp0		First board "root access"
+		  2 = /dev/myricom/gm1		Second Myrinet GM board
+		  3 = /dev/myricom/gmp1		Second board "root access"
+		    ...
+
+ 221 char	VME bus
+		  0 = /dev/bus/vme/m0		First master image
+		  1 = /dev/bus/vme/m1		Second master image
+		  2 = /dev/bus/vme/m2		Third master image
+		  3 = /dev/bus/vme/m3		Fourth master image
+		  4 = /dev/bus/vme/s0		First slave image
+		  5 = /dev/bus/vme/s1		Second slave image
+		  6 = /dev/bus/vme/s2		Third slave image
+		  7 = /dev/bus/vme/s3		Fourth slave image
+		  8 = /dev/bus/vme/ctl		Control
+
+		It is expected that all VME bus drivers will use the
+		same interface.  For interface documentation see
+		http://www.vmelinux.org/.
+
+ 224 char	A2232 serial card
+		  0 = /dev/ttyY0		First A2232 port
+		  1 = /dev/ttyY1		Second A2232 port
+		    ...
+
+ 225 char	A2232 serial card (alternate devices)
+		  0 = /dev/cuy0			Callout device for ttyY0
+		  1 = /dev/cuy1			Callout device for ttyY1
+		    ...
+
+ 226 char	Direct Rendering Infrastructure (DRI)
+		  0 = /dev/dri/card0		First graphics card
+		  1 = /dev/dri/card1		Second graphics card
+		    ...
+
+ 227 char	IBM 3270 terminal Unix tty access
+		  1 = /dev/3270/tty1		First 3270 terminal
+		  2 = /dev/3270/tty2		Seconds 3270 terminal
+		    ...
+
+ 228 char	IBM 3270 terminal block-mode access
+		  0 = /dev/3270/tub		Controlling interface
+		  1 = /dev/3270/tub1		First 3270 terminal
+		  2 = /dev/3270/tub2		Second 3270 terminal
+		    ...
+
+ 229 char	IBM iSeries/pSeries virtual console
+		  0 = /dev/hvc0			First console port
+		  1 = /dev/hvc1			Second console port
+		    ...
+
+ 230 char	IBM iSeries virtual tape
+		  0 = /dev/iseries/vt0		First virtual tape, mode 0
+		  1 = /dev/iseries/vt1		Second virtual tape, mode 0
+		    ...
+		 32 = /dev/iseries/vt0l		First virtual tape, mode 1
+		 33 = /dev/iseries/vt1l		Second virtual tape, mode 1
+		    ...
+		 64 = /dev/iseries/vt0m		First virtual tape, mode 2
+		 65 = /dev/iseries/vt1m		Second virtual tape, mode 2
+		    ...
+		 96 = /dev/iseries/vt0a		First virtual tape, mode 3
+		 97 = /dev/iseries/vt1a		Second virtual tape, mode 3
+		      ...
+		128 = /dev/iseries/nvt0		First virtual tape, mode 0, no rewind
+		129 = /dev/iseries/nvt1		Second virtual tape, mode 0, no rewind
+		    ...
+		160 = /dev/iseries/nvt0l	First virtual tape, mode 1, no rewind
+		161 = /dev/iseries/nvt1l	Second virtual tape, mode 1, no rewind
+		    ...
+		192 = /dev/iseries/nvt0m	First virtual tape, mode 2, no rewind
+		193 = /dev/iseries/nvt1m	Second virtual tape, mode 2, no rewind
+		    ...
+		224 = /dev/iseries/nvt0a	First virtual tape, mode 3, no rewind
+		225 = /dev/iseries/nvt1a	Second virtual tape, mode 3, no rewind
+		    ...
+
+		"No rewind" refers to the omission of the default
+		automatic rewind on device close.  The MTREW or MTOFFL
+		ioctl()'s can be used to rewind the tape regardless of
+		the device used to access it.
+
+ 231 char	InfiniBand
+		0 = /dev/infiniband/umad0
+		1 = /dev/infiniband/umad1
+		  ...
+		63 = /dev/infiniband/umad63    63rd InfiniBandMad device
+		64 = /dev/infiniband/issm0     First InfiniBand IsSM device
+		65 = /dev/infiniband/issm1     Second InfiniBand IsSM device
+		  ...
+		127 = /dev/infiniband/issm63    63rd InfiniBand IsSM device
+		192 = /dev/infiniband/uverbs0   First InfiniBand verbs device
+		193 = /dev/infiniband/uverbs1   Second InfiniBand verbs device
+		  ...
+		223 = /dev/infiniband/uverbs31  31st InfiniBand verbs device
+
+ 232 char	Biometric Devices
+		0 = /dev/biometric/sensor0/fingerprint	first fingerprint sensor on first device
+		1 = /dev/biometric/sensor0/iris		first iris sensor on first device
+		2 = /dev/biometric/sensor0/retina	first retina sensor on first device
+		3 = /dev/biometric/sensor0/voiceprint	first voiceprint sensor on first device
+		4 = /dev/biometric/sensor0/facial	first facial sensor on first device
+		5 = /dev/biometric/sensor0/hand		first hand sensor on first device
+		  ...
+		10 = /dev/biometric/sensor1/fingerprint	first fingerprint sensor on second device
+		  ...
+		20 = /dev/biometric/sensor2/fingerprint	first fingerprint sensor on third device
+		  ...
+
+ 233 char	PathScale InfiniPath interconnect
+		0 = /dev/ipath        Primary device for programs (any unit)
+		1 = /dev/ipath0       Access specifically to unit 0
+		2 = /dev/ipath1       Access specifically to unit 1
+		  ...
+		4 = /dev/ipath3       Access specifically to unit 3
+		129 = /dev/ipath_sma    Device used by Subnet Management Agent
+		130 = /dev/ipath_diag   Device used by diagnostics programs
+
+ 234-254	char	RESERVED FOR DYNAMIC ASSIGNMENT
+		Character devices that request a dynamic allocation of major number will
+		take numbers starting from 254 and downward.
+
+ 240-254 block	LOCAL/EXPERIMENTAL USE
+		Allocated for local/experimental use.  For devices not
+		assigned official numbers, these ranges should be
+		used in order to avoid conflicting with future assignments.
+
+ 255 char	RESERVED
+
+ 255 block	RESERVED
+
+		This major is reserved to assist the expansion to a
+		larger number space.  No device nodes with this major
+		should ever be created on the filesystem.
+		(This is probably not true anymore, but I'll leave it
+		for now /Torben)
+
+ ---LARGE MAJORS!!!!!---
+
+ 256 char	Equinox SST multi-port serial boards
+		   0 = /dev/ttyEQ0	First serial port on first Equinox SST board
+		 127 = /dev/ttyEQ127	Last serial port on first Equinox SST board
+		 128 = /dev/ttyEQ128	First serial port on second Equinox SST board
+		  ...
+		1027 = /dev/ttyEQ1027	Last serial port on eighth Equinox SST board
+
+ 256 block	Resident Flash Disk Flash Translation Layer
+		  0 = /dev/rfda		First RFD FTL layer
+		 16 = /dev/rfdb		Second RFD FTL layer
+		  ...
+		240 = /dev/rfdp		16th RFD FTL layer
+
+ 257 char	Phoenix Technologies Cryptographic Services Driver
+		  0 = /dev/ptlsec	Crypto Services Driver
+
+ 257 block	SSFDC Flash Translation Layer filesystem
+		  0 = /dev/ssfdca	First SSFDC layer
+		  8 = /dev/ssfdcb	Second SSFDC layer
+		 16 = /dev/ssfdcc	Third SSFDC layer
+		 24 = /dev/ssfdcd	4th SSFDC layer
+		 32 = /dev/ssfdce	5th SSFDC layer
+		 40 = /dev/ssfdcf	6th SSFDC layer
+		 48 = /dev/ssfdcg	7th SSFDC layer
+		 56 = /dev/ssfdch	8th SSFDC layer
+
+ 258 block	ROM/Flash read-only translation layer
+		  0 = /dev/blockrom0	First ROM card's translation layer interface
+		  1 = /dev/blockrom1	Second ROM card's translation layer interface
+		  ...
+
+ 259 block	Block Extended Major
+		  Used dynamically to hold additional partition minor
+		  numbers and allow large numbers of partitions per device
+
+ 259 char	FPGA configuration interfaces
+		  0 = /dev/icap0	First Xilinx internal configuration
+		  1 = /dev/icap1	Second Xilinx internal configuration
+
+ 260 char	OSD (Object-based-device) SCSI Device
+		  0 = /dev/osd0		First OSD Device
+		  1 = /dev/osd1		Second OSD Device
+		  ...
+		  255 = /dev/osd255	256th OSD Device
+
+ 384-511 char	RESERVED FOR DYNAMIC ASSIGNMENT
+		Character devices that request a dynamic allocation of major
+		number will take numbers starting from 511 and downward,
+		once the 234-254 range is full.
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
new file mode 100644
index 000000000..fdf72429f
--- /dev/null
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -0,0 +1,353 @@
+Dynamic debug
++++++++++++++
+
+
+Introduction
+============
+
+This document describes how to use the dynamic debug (dyndbg) feature.
+
+Dynamic debug is designed to allow you to dynamically enable/disable
+kernel code to obtain additional kernel information.  Currently, if
+``CONFIG_DYNAMIC_DEBUG`` is set, then all ``pr_debug()``/``dev_dbg()`` and
+``print_hex_dump_debug()``/``print_hex_dump_bytes()`` calls can be dynamically
+enabled per-callsite.
+
+If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is just
+shortcut for ``print_hex_dump(KERN_DEBUG)``.
+
+For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is
+its ``prefix_str`` argument, if it is constant string; or ``hexdump``
+in case ``prefix_str`` is built dynamically.
+
+Dynamic debug has even more useful features:
+
+ * Simple query language allows turning on and off debugging
+   statements by matching any combination of 0 or 1 of:
+
+   - source filename
+   - function name
+   - line number (including ranges of line numbers)
+   - module name
+   - format string
+
+ * Provides a debugfs control file: ``<debugfs>/dynamic_debug/control``
+   which can be read to display the complete list of known debug
+   statements, to help guide you
+
+Controlling dynamic debug Behaviour
+===================================
+
+The behaviour of ``pr_debug()``/``dev_dbg()`` are controlled via writing to a
+control file in the 'debugfs' filesystem. Thus, you must first mount
+the debugfs filesystem, in order to make use of this feature.
+Subsequently, we refer to the control file as:
+``<debugfs>/dynamic_debug/control``. For example, if you want to enable
+printing from source file ``svcsock.c``, line 1603 you simply do::
+
+  nullarbor:~ # echo 'file svcsock.c line 1603 +p' >
+				<debugfs>/dynamic_debug/control
+
+If you make a mistake with the syntax, the write will fail thus::
+
+  nullarbor:~ # echo 'file svcsock.c wtf 1 +p' >
+				<debugfs>/dynamic_debug/control
+  -bash: echo: write error: Invalid argument
+
+Viewing Dynamic Debug Behaviour
+===============================
+
+You can view the currently configured behaviour of all the debug
+statements via::
+
+  nullarbor:~ # cat <debugfs>/dynamic_debug/control
+  # filename:lineno [module]function flags format
+  /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:323 [svcxprt_rdma]svc_rdma_cleanup =_ "SVCRDMA Module Removed, deregister RPC RDMA transport\012"
+  /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:341 [svcxprt_rdma]svc_rdma_init =_ "\011max_inline       : %d\012"
+  /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:340 [svcxprt_rdma]svc_rdma_init =_ "\011sq_depth         : %d\012"
+  /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:338 [svcxprt_rdma]svc_rdma_init =_ "\011max_requests     : %d\012"
+  ...
+
+
+You can also apply standard Unix text manipulation filters to this
+data, e.g.::
+
+  nullarbor:~ # grep -i rdma <debugfs>/dynamic_debug/control  | wc -l
+  62
+
+  nullarbor:~ # grep -i tcp <debugfs>/dynamic_debug/control | wc -l
+  42
+
+The third column shows the currently enabled flags for each debug
+statement callsite (see below for definitions of the flags).  The
+default value, with no flags enabled, is ``=_``.  So you can view all
+the debug statement callsites with any non-default flags::
+
+  nullarbor:~ # awk '$3 != "=_"' <debugfs>/dynamic_debug/control
+  # filename:lineno [module]function flags format
+  /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svcsock.c:1603 [sunrpc]svc_send p "svc_process: st_sendto returned %d\012"
+
+Command Language Reference
+==========================
+
+At the lexical level, a command comprises a sequence of words separated
+by spaces or tabs.  So these are all equivalent::
+
+  nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
+				<debugfs>/dynamic_debug/control
+  nullarbor:~ # echo -n '  file   svcsock.c     line  1603 +p  ' >
+				<debugfs>/dynamic_debug/control
+  nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
+				<debugfs>/dynamic_debug/control
+
+Command submissions are bounded by a write() system call.
+Multiple commands can be written together, separated by ``;`` or ``\n``::
+
+  ~# echo "func pnpacpi_get_resources +p; func pnp_assign_mem +p" \
+     > <debugfs>/dynamic_debug/control
+
+If your query set is big, you can batch them too::
+
+  ~# cat query-batch-file > <debugfs>/dynamic_debug/control
+
+A another way is to use wildcard. The match rule support ``*`` (matches
+zero or more characters) and ``?`` (matches exactly one character).For
+example, you can match all usb drivers::
+
+  ~# echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control
+
+At the syntactical level, a command comprises a sequence of match
+specifications, followed by a flags change specification::
+
+  command ::= match-spec* flags-spec
+
+The match-spec's are used to choose a subset of the known pr_debug()
+callsites to which to apply the flags-spec.  Think of them as a query
+with implicit ANDs between each pair.  Note that an empty list of
+match-specs will select all debug statement callsites.
+
+A match specification comprises a keyword, which controls the
+attribute of the callsite to be compared, and a value to compare
+against.  Possible keywords are:::
+
+  match-spec ::= 'func' string |
+		 'file' string |
+		 'module' string |
+		 'format' string |
+		 'line' line-range
+
+  line-range ::= lineno |
+		 '-'lineno |
+		 lineno'-' |
+		 lineno'-'lineno
+
+  lineno ::= unsigned-int
+
+.. note::
+
+  ``line-range`` cannot contain space, e.g.
+  "1-30" is valid range but "1 - 30" is not.
+
+
+The meanings of each keyword are:
+
+func
+    The given string is compared against the function name
+    of each callsite.  Example::
+
+	func svc_tcp_accept
+
+file
+    The given string is compared against either the full pathname, the
+    src-root relative pathname, or the basename of the source file of
+    each callsite.  Examples::
+
+	file svcsock.c
+	file kernel/freezer.c
+	file /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svcsock.c
+
+module
+    The given string is compared against the module name
+    of each callsite.  The module name is the string as
+    seen in ``lsmod``, i.e. without the directory or the ``.ko``
+    suffix and with ``-`` changed to ``_``.  Examples::
+
+	module sunrpc
+	module nfsd
+
+format
+    The given string is searched for in the dynamic debug format
+    string.  Note that the string does not need to match the
+    entire format, only some part.  Whitespace and other
+    special characters can be escaped using C octal character
+    escape ``\ooo`` notation, e.g. the space character is ``\040``.
+    Alternatively, the string can be enclosed in double quote
+    characters (``"``) or single quote characters (``'``).
+    Examples::
+
+	format svcrdma:         // many of the NFS/RDMA server pr_debugs
+	format readahead        // some pr_debugs in the readahead cache
+	format nfsd:\040SETATTR // one way to match a format with whitespace
+	format "nfsd: SETATTR"  // a neater way to match a format with whitespace
+	format 'nfsd: SETATTR'  // yet another way to match a format with whitespace
+
+line
+    The given line number or range of line numbers is compared
+    against the line number of each ``pr_debug()`` callsite.  A single
+    line number matches the callsite line number exactly.  A
+    range of line numbers matches any callsite between the first
+    and last line number inclusive.  An empty first number means
+    the first line in the file, an empty last line number means the
+    last line number in the file.  Examples::
+
+	line 1603           // exactly line 1603
+	line 1600-1605      // the six lines from line 1600 to line 1605
+	line -1605          // the 1605 lines from line 1 to line 1605
+	line 1600-          // all lines from line 1600 to the end of the file
+
+The flags specification comprises a change operation followed
+by one or more flag characters.  The change operation is one
+of the characters::
+
+  -    remove the given flags
+  +    add the given flags
+  =    set the flags to the given flags
+
+The flags are::
+
+  p    enables the pr_debug() callsite.
+  f    Include the function name in the printed message
+  l    Include line number in the printed message
+  m    Include module name in the printed message
+  t    Include thread ID in messages not generated from interrupt context
+  _    No flags are set. (Or'd with others on input)
+
+For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only ``p`` flag
+have meaning, other flags ignored.
+
+For display, the flags are preceded by ``=``
+(mnemonic: what the flags are currently equal to).
+
+Note the regexp ``^[-+=][flmpt_]+$`` matches a flags specification.
+To clear all flags at once, use ``=_`` or ``-flmpt``.
+
+
+Debug messages during Boot Process
+==================================
+
+To activate debug messages for core code and built-in modules during
+the boot process, even before userspace and debugfs exists, use
+``dyndbg="QUERY"``, ``module.dyndbg="QUERY"``, or ``ddebug_query="QUERY"``
+(``ddebug_query`` is obsoleted by ``dyndbg``, and deprecated).  QUERY follows
+the syntax described above, but must not exceed 1023 characters.  Your
+bootloader may impose lower limits.
+
+These ``dyndbg`` params are processed just after the ddebug tables are
+processed, as part of the arch_initcall.  Thus you can enable debug
+messages in all code run after this arch_initcall via this boot
+parameter.
+
+On an x86 system for example ACPI enablement is a subsys_initcall and::
+
+   dyndbg="file ec.c +p"
+
+will show early Embedded Controller transactions during ACPI setup if
+your machine (typically a laptop) has an Embedded Controller.
+PCI (or other devices) initialization also is a hot candidate for using
+this boot parameter for debugging purposes.
+
+If ``foo`` module is not built-in, ``foo.dyndbg`` will still be processed at
+boot time, without effect, but will be reprocessed when module is
+loaded later. ``dyndbg_query=`` and bare ``dyndbg=`` are only processed at
+boot.
+
+
+Debug Messages at Module Initialization Time
+============================================
+
+When ``modprobe foo`` is called, modprobe scans ``/proc/cmdline`` for
+``foo.params``, strips ``foo.``, and passes them to the kernel along with
+params given in modprobe args or ``/etc/modprob.d/*.conf`` files,
+in the following order:
+
+1. parameters given via ``/etc/modprobe.d/*.conf``::
+
+	options foo dyndbg=+pt
+	options foo dyndbg # defaults to +p
+
+2. ``foo.dyndbg`` as given in boot args, ``foo.`` is stripped and passed::
+
+	foo.dyndbg=" func bar +p; func buz +mp"
+
+3. args to modprobe::
+
+	modprobe foo dyndbg==pmf # override previous settings
+
+These ``dyndbg`` queries are applied in order, with last having final say.
+This allows boot args to override or modify those from ``/etc/modprobe.d``
+(sensible, since 1 is system wide, 2 is kernel or boot specific), and
+modprobe args to override both.
+
+In the ``foo.dyndbg="QUERY"`` form, the query must exclude ``module foo``.
+``foo`` is extracted from the param-name, and applied to each query in
+``QUERY``, and only 1 match-spec of each type is allowed.
+
+The ``dyndbg`` option is a "fake" module parameter, which means:
+
+- modules do not need to define it explicitly
+- every module gets it tacitly, whether they use pr_debug or not
+- it doesn't appear in ``/sys/module/$module/parameters/``
+  To see it, grep the control file, or inspect ``/proc/cmdline.``
+
+For ``CONFIG_DYNAMIC_DEBUG`` kernels, any settings given at boot-time (or
+enabled by ``-DDEBUG`` flag during compilation) can be disabled later via
+the sysfs interface if the debug messages are no longer needed::
+
+   echo "module module_name -p" > <debugfs>/dynamic_debug/control
+
+Examples
+========
+
+::
+
+  // enable the message at line 1603 of file svcsock.c
+  nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
+				<debugfs>/dynamic_debug/control
+
+  // enable all the messages in file svcsock.c
+  nullarbor:~ # echo -n 'file svcsock.c +p' >
+				<debugfs>/dynamic_debug/control
+
+  // enable all the messages in the NFS server module
+  nullarbor:~ # echo -n 'module nfsd +p' >
+				<debugfs>/dynamic_debug/control
+
+  // enable all 12 messages in the function svc_process()
+  nullarbor:~ # echo -n 'func svc_process +p' >
+				<debugfs>/dynamic_debug/control
+
+  // disable all 12 messages in the function svc_process()
+  nullarbor:~ # echo -n 'func svc_process -p' >
+				<debugfs>/dynamic_debug/control
+
+  // enable messages for NFS calls READ, READLINK, READDIR and READDIR+.
+  nullarbor:~ # echo -n 'format "nfsd: READ" +p' >
+				<debugfs>/dynamic_debug/control
+
+  // enable messages in files of which the paths include string "usb"
+  nullarbor:~ # echo -n '*usb* +p' > <debugfs>/dynamic_debug/control
+
+  // enable all messages
+  nullarbor:~ # echo -n '+p' > <debugfs>/dynamic_debug/control
+
+  // add module, function to all enabled messages
+  nullarbor:~ # echo -n '+mf' > <debugfs>/dynamic_debug/control
+
+  // boot-args example, with newlines and comments for readability
+  Kernel command line: ...
+    // see whats going on in dyndbg=value processing
+    dynamic_debug.verbose=1
+    // enable pr_debugs in 2 builtins, #cmt is stripped
+    dyndbg="module params +p #cmt ; module sys +p"
+    // enable pr_debugs in 2 functions in a module loaded later
+    pc87360.dyndbg="func pc87360_init_device +p; func pc87360_find +p"
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
new file mode 100644
index 000000000..2adec1e65
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -0,0 +1,18 @@
+========================
+Hardware vulnerabilities
+========================
+
+This section describes CPU vulnerabilities and provides an overview of the
+possible mitigations along with guidance for selecting mitigations if they
+are configurable at compile, boot or run time.
+
+.. toctree::
+   :maxdepth: 1
+
+   spectre
+   l1tf
+   mds
+   tsx_async_abort
+   multihit.rst
+   special-register-buffer-data-sampling.rst
+   processor_mmio_stale_data.rst
diff --git a/Documentation/admin-guide/hw-vuln/l1tf.rst b/Documentation/admin-guide/hw-vuln/l1tf.rst
new file mode 100644
index 000000000..31653a9f0
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1tf.rst
@@ -0,0 +1,615 @@
+L1TF - L1 Terminal Fault
+========================
+
+L1 Terminal Fault is a hardware vulnerability which allows unprivileged
+speculative access to data which is available in the Level 1 Data Cache
+when the page table entry controlling the virtual address, which is used
+for the access, has the Present bit cleared or other reserved bits set.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+   - Processors from AMD, Centaur and other non Intel vendors
+
+   - Older processor models, where the CPU family is < 6
+
+   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
+     Penwell, Pineview, Silvermont, Airmont, Merrifield)
+
+   - The Intel XEON PHI family
+
+   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
+     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
+     by the Meltdown vulnerability either. These CPUs should become
+     available by end of 2018.
+
+Whether a processor is affected or not can be read out from the L1TF
+vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the L1TF vulnerability:
+
+   =============  =================  ==============================
+   CVE-2018-3615  L1 Terminal Fault  SGX related aspects
+   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
+   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
+   =============  =================  ==============================
+
+Problem
+-------
+
+If an instruction accesses a virtual address for which the relevant page
+table entry (PTE) has the Present bit cleared or other reserved bits set,
+then speculative execution ignores the invalid PTE and loads the referenced
+data if it is present in the Level 1 Data Cache, as if the page referenced
+by the address bits in the PTE was still present and accessible.
+
+While this is a purely speculative mechanism and the instruction will raise
+a page fault when it is retired eventually, the pure act of loading the
+data and making it available to other speculative instructions opens up the
+opportunity for side channel attacks to unprivileged malicious code,
+similar to the Meltdown attack.
+
+While Meltdown breaks the user space to kernel space protection, L1TF
+allows to attack any physical memory address in the system and the attack
+works across all protection domains. It allows an attack of SGX and also
+works from inside virtual machines because the speculation bypasses the
+extended page table (EPT) protection mechanism.
+
+
+Attack scenarios
+----------------
+
+1. Malicious user space
+^^^^^^^^^^^^^^^^^^^^^^^
+
+   Operating Systems store arbitrary information in the address bits of a
+   PTE which is marked non present. This allows a malicious user space
+   application to attack the physical memory to which these PTEs resolve.
+   In some cases user-space can maliciously influence the information
+   encoded in the address bits of the PTE, thus making attacks more
+   deterministic and more practical.
+
+   The Linux kernel contains a mitigation for this attack vector, PTE
+   inversion, which is permanently enabled and has no performance
+   impact. The kernel ensures that the address bits of PTEs, which are not
+   marked present, never point to cacheable physical memory space.
+
+   A system with an up to date kernel is protected against attacks from
+   malicious user space applications.
+
+2. Malicious guest in a virtual machine
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The fact that L1TF breaks all domain protections allows malicious guest
+   OSes, which can control the PTEs directly, and malicious guest user
+   space applications, which run on an unprotected guest kernel lacking the
+   PTE inversion mitigation for L1TF, to attack physical host memory.
+
+   A special aspect of L1TF in the context of virtualization is symmetric
+   multi threading (SMT). The Intel implementation of SMT is called
+   HyperThreading. The fact that Hyperthreads on the affected processors
+   share the L1 Data Cache (L1D) is important for this. As the flaw allows
+   only to attack data which is present in L1D, a malicious guest running
+   on one Hyperthread can attack the data which is brought into the L1D by
+   the context which runs on the sibling Hyperthread of the same physical
+   core. This context can be host OS, host user space or a different guest.
+
+   If the processor does not support Extended Page Tables, the attack is
+   only possible, when the hypervisor does not sanitize the content of the
+   effective (shadow) page tables.
+
+   While solutions exist to mitigate these attack vectors fully, these
+   mitigations are not enabled by default in the Linux kernel because they
+   can affect performance significantly. The kernel provides several
+   mechanisms which can be utilized to address the problem depending on the
+   deployment scenario. The mitigations, their protection scope and impact
+   are described in the next sections.
+
+   The default mitigations and the rationale for choosing them are explained
+   at the end of this document. See :ref:`default_mitigations`.
+
+.. _l1tf_sys_info:
+
+L1TF system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current L1TF
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/l1tf
+
+The possible values in this file are:
+
+  ===========================   ===============================
+  'Not affected'		The processor is not vulnerable
+  'Mitigation: PTE Inversion'	The host protection is active
+  ===========================   ===============================
+
+If KVM/VMX is enabled and the processor is vulnerable then the following
+information is appended to the 'Mitigation: PTE Inversion' part:
+
+  - SMT status:
+
+    =====================  ================
+    'VMX: SMT vulnerable'  SMT is enabled
+    'VMX: SMT disabled'    SMT is disabled
+    =====================  ================
+
+  - L1D Flush mode:
+
+    ================================  ====================================
+    'L1D vulnerable'		      L1D flushing is disabled
+
+    'L1D conditional cache flushes'   L1D flush is conditionally enabled
+
+    'L1D cache flushes'		      L1D flush is unconditionally enabled
+    ================================  ====================================
+
+The resulting grade of protection is discussed in the following sections.
+
+
+Host mitigation mechanism
+-------------------------
+
+The kernel is unconditionally protected against L1TF attacks from malicious
+user space running on the host.
+
+
+Guest mitigation mechanisms
+---------------------------
+
+.. _l1d_flush:
+
+1. L1D flush on VMENTER
+^^^^^^^^^^^^^^^^^^^^^^^
+
+   To make sure that a guest cannot attack data which is present in the L1D
+   the hypervisor flushes the L1D before entering the guest.
+
+   Flushing the L1D evicts not only the data which should not be accessed
+   by a potentially malicious guest, it also flushes the guest
+   data. Flushing the L1D has a performance impact as the processor has to
+   bring the flushed guest data back into the L1D. Depending on the
+   frequency of VMEXIT/VMENTER and the type of computations in the guest
+   performance degradation in the range of 1% to 50% has been observed. For
+   scenarios where guest VMEXIT/VMENTER are rare the performance impact is
+   minimal. Virtio and mechanisms like posted interrupts are designed to
+   confine the VMEXITs to a bare minimum, but specific configurations and
+   application scenarios might still suffer from a high VMEXIT rate.
+
+   The kernel provides two L1D flush modes:
+    - conditional ('cond')
+    - unconditional ('always')
+
+   The conditional mode avoids L1D flushing after VMEXITs which execute
+   only audited code paths before the corresponding VMENTER. These code
+   paths have been verified that they cannot expose secrets or other
+   interesting data to an attacker, but they can leak information about the
+   address space layout of the hypervisor.
+
+   Unconditional mode flushes L1D on all VMENTER invocations and provides
+   maximum protection. It has a higher overhead than the conditional
+   mode. The overhead cannot be quantified correctly as it depends on the
+   workload scenario and the resulting number of VMEXITs.
+
+   The general recommendation is to enable L1D flush on VMENTER. The kernel
+   defaults to conditional mode on affected processors.
+
+   **Note**, that L1D flush does not prevent the SMT problem because the
+   sibling thread will also bring back its data into the L1D which makes it
+   attackable again.
+
+   L1D flush can be controlled by the administrator via the kernel command
+   line and sysfs control files. See :ref:`mitigation_control_command_line`
+   and :ref:`mitigation_control_kvm`.
+
+.. _guest_confinement:
+
+2. Guest VCPU confinement to dedicated physical cores
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   To address the SMT problem, it is possible to make a guest or a group of
+   guests affine to one or more physical cores. The proper mechanism for
+   that is to utilize exclusive cpusets to ensure that no other guest or
+   host tasks can run on these cores.
+
+   If only a single guest or related guests run on sibling SMT threads on
+   the same physical core then they can only attack their own memory and
+   restricted parts of the host memory.
+
+   Host memory is attackable, when one of the sibling SMT threads runs in
+   host OS (hypervisor) context and the other in guest context. The amount
+   of valuable information from the host OS context depends on the context
+   which the host OS executes, i.e. interrupts, soft interrupts and kernel
+   threads. The amount of valuable data from these contexts cannot be
+   declared as non-interesting for an attacker without deep inspection of
+   the code.
+
+   **Note**, that assigning guests to a fixed set of physical cores affects
+   the ability of the scheduler to do load balancing and might have
+   negative effects on CPU utilization depending on the hosting
+   scenario. Disabling SMT might be a viable alternative for particular
+   scenarios.
+
+   For further information about confining guests to a single or to a group
+   of cores consult the cpusets documentation:
+
+   https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
+
+.. _interrupt_isolation:
+
+3. Interrupt affinity
+^^^^^^^^^^^^^^^^^^^^^
+
+   Interrupts can be made affine to logical CPUs. This is not universally
+   true because there are types of interrupts which are truly per CPU
+   interrupts, e.g. the local timer interrupt. Aside of that multi queue
+   devices affine their interrupts to single CPUs or groups of CPUs per
+   queue without allowing the administrator to control the affinities.
+
+   Moving the interrupts, which can be affinity controlled, away from CPUs
+   which run untrusted guests, reduces the attack vector space.
+
+   Whether the interrupts with are affine to CPUs, which run untrusted
+   guests, provide interesting data for an attacker depends on the system
+   configuration and the scenarios which run on the system. While for some
+   of the interrupts it can be assumed that they won't expose interesting
+   information beyond exposing hints about the host OS memory layout, there
+   is no way to make general assumptions.
+
+   Interrupt affinity can be controlled by the administrator via the
+   /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
+   available at:
+
+   https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
+
+.. _smt_control:
+
+4. SMT control
+^^^^^^^^^^^^^^
+
+   To prevent the SMT issues of L1TF it might be necessary to disable SMT
+   completely. Disabling SMT can have a significant performance impact, but
+   the impact depends on the hosting scenario and the type of workloads.
+   The impact of disabling SMT needs also to be weighted against the impact
+   of other mitigation solutions like confining guests to dedicated cores.
+
+   The kernel provides a sysfs interface to retrieve the status of SMT and
+   to control it. It also provides a kernel command line interface to
+   control SMT.
+
+   The kernel command line interface consists of the following options:
+
+     =========== ==========================================================
+     nosmt	 Affects the bring up of the secondary CPUs during boot. The
+		 kernel tries to bring all present CPUs online during the
+		 boot process. "nosmt" makes sure that from each physical
+		 core only one - the so called primary (hyper) thread is
+		 activated. Due to a design flaw of Intel processors related
+		 to Machine Check Exceptions the non primary siblings have
+		 to be brought up at least partially and are then shut down
+		 again.  "nosmt" can be undone via the sysfs interface.
+
+     nosmt=force Has the same effect as "nosmt" but it does not allow to
+		 undo the SMT disable via the sysfs interface.
+     =========== ==========================================================
+
+   The sysfs interface provides two files:
+
+   - /sys/devices/system/cpu/smt/control
+   - /sys/devices/system/cpu/smt/active
+
+   /sys/devices/system/cpu/smt/control:
+
+     This file allows to read out the SMT control state and provides the
+     ability to disable or (re)enable SMT. The possible states are:
+
+	==============  ===================================================
+	on		SMT is supported by the CPU and enabled. All
+			logical CPUs can be onlined and offlined without
+			restrictions.
+
+	off		SMT is supported by the CPU and disabled. Only
+			the so called primary SMT threads can be onlined
+			and offlined without restrictions. An attempt to
+			online a non-primary sibling is rejected
+
+	forceoff	Same as 'off' but the state cannot be controlled.
+			Attempts to write to the control file are rejected.
+
+	notsupported	The processor does not support SMT. It's therefore
+			not affected by the SMT implications of L1TF.
+			Attempts to write to the control file are rejected.
+	==============  ===================================================
+
+     The possible states which can be written into this file to control SMT
+     state are:
+
+     - on
+     - off
+     - forceoff
+
+   /sys/devices/system/cpu/smt/active:
+
+     This file reports whether SMT is enabled and active, i.e. if on any
+     physical core two or more sibling threads are online.
+
+   SMT control is also possible at boot time via the l1tf kernel command
+   line parameter in combination with L1D flush control. See
+   :ref:`mitigation_control_command_line`.
+
+5. Disabling EPT
+^^^^^^^^^^^^^^^^
+
+  Disabling EPT for virtual machines provides full mitigation for L1TF even
+  with SMT enabled, because the effective page tables for guests are
+  managed and sanitized by the hypervisor. Though disabling EPT has a
+  significant performance impact especially when the Meltdown mitigation
+  KPTI is enabled.
+
+  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
+
+There is ongoing research and development for new mitigation mechanisms to
+address the performance impact of disabling SMT or EPT.
+
+.. _mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1TF mitigations at boot
+time with the option "l1tf=". The valid arguments for this option are:
+
+  ============  =============================================================
+  full		Provides all available mitigations for the L1TF
+		vulnerability. Disables SMT and enables all mitigations in
+		the hypervisors, i.e. unconditional L1D flushing
+
+		SMT control and L1D flush control via the sysfs interface
+		is still possible after boot.  Hypervisors will issue a
+		warning when the first VM is started in a potentially
+		insecure configuration, i.e. SMT enabled or L1D flush
+		disabled.
+
+  full,force	Same as 'full', but disables SMT and L1D flush runtime
+		control. Implies the 'nosmt=force' command line option.
+		(i.e. sysfs control of SMT is disabled.)
+
+  flush		Leaves SMT enabled and enables the default hypervisor
+		mitigation, i.e. conditional L1D flushing
+
+		SMT control and L1D flush control via the sysfs interface
+		is still possible after boot.  Hypervisors will issue a
+		warning when the first VM is started in a potentially
+		insecure configuration, i.e. SMT enabled or L1D flush
+		disabled.
+
+  flush,nosmt	Disables SMT and enables the default hypervisor mitigation,
+		i.e. conditional L1D flushing.
+
+		SMT control and L1D flush control via the sysfs interface
+		is still possible after boot.  Hypervisors will issue a
+		warning when the first VM is started in a potentially
+		insecure configuration, i.e. SMT enabled or L1D flush
+		disabled.
+
+  flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is
+		started in a potentially insecure configuration.
+
+  off		Disables hypervisor mitigations and doesn't emit any
+		warnings.
+		It also drops the swap size and available RAM limit restrictions
+		on both hypervisor and bare metal.
+
+  ============  =============================================================
+
+The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
+
+
+.. _mitigation_control_kvm:
+
+Mitigation control for KVM - module parameter
+-------------------------------------------------------------
+
+The KVM hypervisor mitigation mechanism, flushing the L1D cache when
+entering a guest, can be controlled with a module parameter.
+
+The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
+following arguments:
+
+  ============  ==============================================================
+  always	L1D cache flush on every VMENTER.
+
+  cond		Flush L1D on VMENTER only when the code between VMEXIT and
+		VMENTER can leak host memory which is considered
+		interesting for an attacker. This still can leak host memory
+		which allows e.g. to determine the hosts address space layout.
+
+  never		Disables the mitigation
+  ============  ==============================================================
+
+The parameter can be provided on the kernel command line, as a module
+parameter when loading the modules and at runtime modified via the sysfs
+file:
+
+/sys/module/kvm_intel/parameters/vmentry_l1d_flush
+
+The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
+line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
+module parameter is ignored and writes to the sysfs file are rejected.
+
+.. _mitigation_selection:
+
+Mitigation selection guide
+--------------------------
+
+1. No virtualization in use
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The system is protected by the kernel unconditionally and no further
+   action is required.
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   If the guest comes from a trusted source and the guest OS kernel is
+   guaranteed to have the L1TF mitigations in place the system is fully
+   protected against L1TF and no further action is required.
+
+   To avoid the overhead of the default L1D flushing on VMENTER the
+   administrator can disable the flushing via the kernel command line and
+   sysfs control files. See :ref:`mitigation_control_command_line` and
+   :ref:`mitigation_control_kvm`.
+
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+3.1. SMT not supported or disabled
+""""""""""""""""""""""""""""""""""
+
+  If SMT is not supported by the processor or disabled in the BIOS or by
+  the kernel, it's only required to enforce L1D flushing on VMENTER.
+
+  Conditional L1D flushing is the default behaviour and can be tuned. See
+  :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
+
+3.2. EPT not supported or disabled
+""""""""""""""""""""""""""""""""""
+
+  If EPT is not supported by the processor or disabled in the hypervisor,
+  the system is fully protected. SMT can stay enabled and L1D flushing on
+  VMENTER is not required.
+
+  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
+
+3.3. SMT and EPT supported and active
+"""""""""""""""""""""""""""""""""""""
+
+  If SMT and EPT are supported and active then various degrees of
+  mitigations can be employed:
+
+  - L1D flushing on VMENTER:
+
+    L1D flushing on VMENTER is the minimal protection requirement, but it
+    is only potent in combination with other mitigation methods.
+
+    Conditional L1D flushing is the default behaviour and can be tuned. See
+    :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
+
+  - Guest confinement:
+
+    Confinement of guests to a single or a group of physical cores which
+    are not running any other processes, can reduce the attack surface
+    significantly, but interrupts, soft interrupts and kernel threads can
+    still expose valuable data to a potential attacker. See
+    :ref:`guest_confinement`.
+
+  - Interrupt isolation:
+
+    Isolating the guest CPUs from interrupts can reduce the attack surface
+    further, but still allows a malicious guest to explore a limited amount
+    of host physical memory. This can at least be used to gain knowledge
+    about the host address space layout. The interrupts which have a fixed
+    affinity to the CPUs which run the untrusted guests can depending on
+    the scenario still trigger soft interrupts and schedule kernel threads
+    which might expose valuable information. See
+    :ref:`interrupt_isolation`.
+
+The above three mitigation methods combined can provide protection to a
+certain degree, but the risk of the remaining attack surface has to be
+carefully analyzed. For full protection the following methods are
+available:
+
+  - Disabling SMT:
+
+    Disabling SMT and enforcing the L1D flushing provides the maximum
+    amount of protection. This mitigation is not depending on any of the
+    above mitigation methods.
+
+    SMT control and L1D flushing can be tuned by the command line
+    parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
+    time with the matching sysfs control files. See :ref:`smt_control`,
+    :ref:`mitigation_control_command_line` and
+    :ref:`mitigation_control_kvm`.
+
+  - Disabling EPT:
+
+    Disabling EPT provides the maximum amount of protection as well. It is
+    not depending on any of the above mitigation methods. SMT can stay
+    enabled and L1D flushing is not required, but the performance impact is
+    significant.
+
+    EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
+    parameter.
+
+3.4. Nested virtual machines
+""""""""""""""""""""""""""""
+
+When nested virtualization is in use, three operating systems are involved:
+the bare metal hypervisor, the nested hypervisor and the nested virtual
+machine.  VMENTER operations from the nested hypervisor into the nested
+guest will always be processed by the bare metal hypervisor. If KVM is the
+bare metal hypervisor it will:
+
+ - Flush the L1D cache on every switch from the nested hypervisor to the
+   nested virtual machine, so that the nested hypervisor's secrets are not
+   exposed to the nested virtual machine;
+
+ - Flush the L1D cache on every switch from the nested virtual machine to
+   the nested hypervisor; this is a complex operation, and flushing the L1D
+   cache avoids that the bare metal hypervisor's secrets are exposed to the
+   nested virtual machine;
+
+ - Instruct the nested hypervisor to not perform any L1D cache flush. This
+   is an optimization to avoid double L1D flushing.
+
+
+.. _default_mitigations:
+
+Default mitigations
+-------------------
+
+  The kernel default mitigations for vulnerable processors are:
+
+  - PTE inversion to protect against malicious user space. This is done
+    unconditionally and cannot be controlled. The swap storage is limited
+    to ~16TB.
+
+  - L1D conditional flushing on VMENTER when EPT is enabled for
+    a guest.
+
+  The kernel does not by default enforce the disabling of SMT, which leaves
+  SMT systems vulnerable when running untrusted guests with EPT enabled.
+
+  The rationale for this choice is:
+
+  - Force disabling SMT can break existing setups, especially with
+    unattended updates.
+
+  - If regular users run untrusted guests on their machine, then L1TF is
+    just an add on to other malware which might be embedded in an untrusted
+    guest, e.g. spam-bots or attacks on the local network.
+
+    There is no technical way to prevent a user from running untrusted code
+    on their machines blindly.
+
+  - It's technically extremely unlikely and from today's knowledge even
+    impossible that L1TF can be exploited via the most popular attack
+    mechanisms like JavaScript because these mechanisms have no way to
+    control PTEs. If this would be possible and not other mitigation would
+    be possible, then the default might be different.
+
+  - The administrators of cloud and hosting setups have to carefully
+    analyze the risk for their scenarios and make the appropriate
+    mitigation choices, which might even vary across their deployed
+    machines and also result in other changes of their overall setup.
+    There is no way for the kernel to provide a sensible default for this
+    kind of scenarios.
diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
new file mode 100644
index 000000000..2d19c9f4c
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/mds.rst
@@ -0,0 +1,311 @@
+MDS - Microarchitectural Data Sampling
+======================================
+
+Microarchitectural Data Sampling is a hardware vulnerability which allows
+unprivileged speculative access to data which is available in various CPU
+internal buffers.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+   - Processors from AMD, Centaur and other non Intel vendors
+
+   - Older processor models, where the CPU family is < 6
+
+   - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+   - Intel processors which have the ARCH_CAP_MDS_NO bit set in the
+     IA32_ARCH_CAPABILITIES MSR.
+
+Whether a processor is affected or not can be read out from the MDS
+vulnerability file in sysfs. See :ref:`mds_sys_info`.
+
+Not all processors are affected by all variants of MDS, but the mitigation
+is identical for all of them so the kernel treats them as a single
+vulnerability.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the MDS vulnerability:
+
+   ==============  =====  ===================================================
+   CVE-2018-12126  MSBDS  Microarchitectural Store Buffer Data Sampling
+   CVE-2018-12130  MFBDS  Microarchitectural Fill Buffer Data Sampling
+   CVE-2018-12127  MLPDS  Microarchitectural Load Port Data Sampling
+   CVE-2019-11091  MDSUM  Microarchitectural Data Sampling Uncacheable Memory
+   ==============  =====  ===================================================
+
+Problem
+-------
+
+When performing store, load, L1 refill operations, processors write data
+into temporary microarchitectural structures (buffers). The data in the
+buffer can be forwarded to load operations as an optimization.
+
+Under certain conditions, usually a fault/assist caused by a load
+operation, data unrelated to the load memory address can be speculatively
+forwarded from the buffers. Because the load operation causes a fault or
+assist and its result will be discarded, the forwarded data will not cause
+incorrect program execution or state changes. But a malicious operation
+may be able to forward this speculative data to a disclosure gadget which
+allows in turn to infer the value via a cache side channel attack.
+
+Because the buffers are potentially shared between Hyper-Threads cross
+Hyper-Thread attacks are possible.
+
+Deeper technical information is available in the MDS specific x86
+architecture section: :ref:`Documentation/x86/mds.rst <mds>`.
+
+
+Attack scenarios
+----------------
+
+Attacks against the MDS vulnerabilities can be mounted from malicious non
+priviledged user space applications running on hosts or guest. Malicious
+guest OSes can obviously mount attacks as well.
+
+Contrary to other speculation based vulnerabilities the MDS vulnerability
+does not allow the attacker to control the memory target address. As a
+consequence the attacks are purely sampling based, but as demonstrated with
+the TLBleed attack samples can be postprocessed successfully.
+
+Web-Browsers
+^^^^^^^^^^^^
+
+  It's unclear whether attacks through Web-Browsers are possible at
+  all. The exploitation through Java-Script is considered very unlikely,
+  but other widely used web technologies like Webassembly could possibly be
+  abused.
+
+
+.. _mds_sys_info:
+
+MDS system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current MDS
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+The possible values in this file are:
+
+  .. list-table::
+
+     * - 'Not affected'
+       - The processor is not vulnerable
+     * - 'Vulnerable'
+       - The processor is vulnerable, but no mitigation enabled
+     * - 'Vulnerable: Clear CPU buffers attempted, no microcode'
+       - The processor is vulnerable but microcode is not updated.
+
+         The mitigation is enabled on a best effort basis. See :ref:`vmwerv`
+     * - 'Mitigation: Clear CPU buffers'
+       - The processor is vulnerable and the CPU buffer clearing mitigation is
+         enabled.
+
+If the processor is vulnerable then the following information is appended
+to the above information:
+
+    ========================  ============================================
+    'SMT vulnerable'          SMT is enabled
+    'SMT mitigated'           SMT is enabled and mitigated
+    'SMT disabled'            SMT is disabled
+    'SMT Host state unknown'  Kernel runs in a VM, Host SMT state unknown
+    ========================  ============================================
+
+.. _vmwerv:
+
+Best effort mitigation mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+  If the processor is vulnerable, but the availability of the microcode based
+  mitigation mechanism is not advertised via CPUID the kernel selects a best
+  effort mitigation mode.  This mode invokes the mitigation instructions
+  without a guarantee that they clear the CPU buffers.
+
+  This is done to address virtualization scenarios where the host has the
+  microcode update applied, but the hypervisor is not yet updated to expose
+  the CPUID to the guest. If the host has updated microcode the protection
+  takes effect otherwise a few cpu cycles are wasted pointlessly.
+
+  The state in the mds sysfs file reflects this situation accordingly.
+
+
+Mitigation mechanism
+-------------------------
+
+The kernel detects the affected CPUs and the presence of the microcode
+which is required.
+
+If a CPU is affected and the microcode is available, then the kernel
+enables the mitigation by default. The mitigation can be controlled at boot
+time via a kernel command line option. See
+:ref:`mds_mitigation_control_command_line`.
+
+.. _cpu_buffer_clear:
+
+CPU buffer clearing
+^^^^^^^^^^^^^^^^^^^
+
+  The mitigation for MDS clears the affected CPU buffers on return to user
+  space and when entering a guest.
+
+  If SMT is enabled it also clears the buffers on idle entry when the CPU
+  is only affected by MSBDS and not any other MDS variant, because the
+  other variants cannot be protected against cross Hyper-Thread attacks.
+
+  For CPUs which are only affected by MSBDS the user space, guest and idle
+  transition mitigations are sufficient and SMT is not affected.
+
+.. _virt_mechanism:
+
+Virtualization mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+  The protection for host to guest transition depends on the L1TF
+  vulnerability of the CPU:
+
+  - CPU is affected by L1TF:
+
+    If the L1D flush mitigation is enabled and up to date microcode is
+    available, the L1D flush mitigation is automatically protecting the
+    guest transition.
+
+    If the L1D flush mitigation is disabled then the MDS mitigation is
+    invoked explicit when the host MDS mitigation is enabled.
+
+    For details on L1TF and virtualization see:
+    :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_control_kvm>`.
+
+  - CPU is not affected by L1TF:
+
+    CPU buffers are flushed before entering the guest when the host MDS
+    mitigation is enabled.
+
+  The resulting MDS protection matrix for the host to guest transition:
+
+  ============ ===== ============= ============ =================
+   L1TF         MDS   VMX-L1FLUSH   Host MDS     MDS-State
+
+   Don't care   No    Don't care    N/A          Not affected
+
+   Yes          Yes   Disabled      Off          Vulnerable
+
+   Yes          Yes   Disabled      Full         Mitigated
+
+   Yes          Yes   Enabled       Don't care   Mitigated
+
+   No           Yes   N/A           Off          Vulnerable
+
+   No           Yes   N/A           Full         Mitigated
+  ============ ===== ============= ============ =================
+
+  This only covers the host to guest transition, i.e. prevents leakage from
+  host to guest, but does not protect the guest internally. Guests need to
+  have their own protections.
+
+.. _xeon_phi:
+
+XEON PHI specific considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+  The XEON PHI processor family is affected by MSBDS which can be exploited
+  cross Hyper-Threads when entering idle states. Some XEON PHI variants allow
+  to use MWAIT in user space (Ring 3) which opens an potential attack vector
+  for malicious user space. The exposure can be disabled on the kernel
+  command line with the 'ring3mwait=disable' command line option.
+
+  XEON PHI is not affected by the other MDS variants and MSBDS is mitigated
+  before the CPU enters a idle state. As XEON PHI is not affected by L1TF
+  either disabling SMT is not required for full protection.
+
+.. _mds_smt_control:
+
+SMT control
+^^^^^^^^^^^
+
+  All MDS variants except MSBDS can be attacked cross Hyper-Threads. That
+  means on CPUs which are affected by MFBDS or MLPDS it is necessary to
+  disable SMT for full protection. These are most of the affected CPUs; the
+  exception is XEON PHI, see :ref:`xeon_phi`.
+
+  Disabling SMT can have a significant performance impact, but the impact
+  depends on the type of workloads.
+
+  See the relevant chapter in the L1TF mitigation documentation for details:
+  :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+
+.. _mds_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the MDS mitigations at boot
+time with the option "mds=". The valid arguments for this option are:
+
+  ============  =============================================================
+  full		If the CPU is vulnerable, enable all available mitigations
+		for the MDS vulnerability, CPU buffer clearing on exit to
+		userspace and when entering a VM. Idle transitions are
+		protected as well if SMT is enabled.
+
+		It does not automatically disable SMT.
+
+  full,nosmt	The same as mds=full, with SMT disabled on vulnerable
+		CPUs.  This is the complete mitigation.
+
+  off		Disables MDS mitigations completely.
+
+  ============  =============================================================
+
+Not specifying this option is equivalent to "mds=full". For processors
+that are affected by both TAA (TSX Asynchronous Abort) and MDS,
+specifying just "mds=off" without an accompanying "tsx_async_abort=off"
+will have no effect as the same mitigation is used for both
+vulnerabilities.
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace
+^^^^^^^^^^^^^^^^^^^^
+
+   If all userspace applications are from a trusted source and do not
+   execute untrusted code which is supplied externally, then the mitigation
+   can be disabled.
+
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The same considerations as above versus trusted user space apply.
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The protection depends on the state of the L1TF mitigations.
+   See :ref:`virt_mechanism`.
+
+   If the MDS mitigation is enabled and SMT is disabled, guest to host and
+   guest to guest attacks are prevented.
+
+.. _mds_default_mitigations:
+
+Default mitigations
+-------------------
+
+  The kernel default mitigations for vulnerable processors are:
+
+  - Enable CPU buffer clearing
+
+  The kernel does not by default enforce the disabling of SMT, which leaves
+  SMT systems vulnerable when running untrusted code. The same rationale as
+  for L1TF applies.
+  See :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <default_mitigations>`.
diff --git a/Documentation/admin-guide/hw-vuln/multihit.rst b/Documentation/admin-guide/hw-vuln/multihit.rst
new file mode 100644
index 000000000..ba9988d8b
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/multihit.rst
@@ -0,0 +1,163 @@
+iTLB multihit
+=============
+
+iTLB multihit is an erratum where some processors may incur a machine check
+error, possibly resulting in an unrecoverable CPU lockup, when an
+instruction fetch hits multiple entries in the instruction TLB. This can
+occur when the page size is changed along with either the physical address
+or cache type. A malicious guest running on a virtualized system can
+exploit this erratum to perform a denial of service attack.
+
+
+Affected processors
+-------------------
+
+Variations of this erratum are present on most Intel Core and Xeon processor
+models. The erratum is not present on:
+
+   - non-Intel processors
+
+   - Some Atoms (Airmont, Bonnell, Goldmont, GoldmontPlus, Saltwell, Silvermont)
+
+   - Intel processors that have the PSCHANGE_MC_NO bit set in the
+     IA32_ARCH_CAPABILITIES MSR.
+
+
+Related CVEs
+------------
+
+The following CVE entry is related to this issue:
+
+   ==============  =================================================
+   CVE-2018-12207  Machine Check Error Avoidance on Page Size Change
+   ==============  =================================================
+
+
+Problem
+-------
+
+Privileged software, including OS and virtual machine managers (VMM), are in
+charge of memory management. A key component in memory management is the control
+of the page tables. Modern processors use virtual memory, a technique that creates
+the illusion of a very large memory for processors. This virtual space is split
+into pages of a given size. Page tables translate virtual addresses to physical
+addresses.
+
+To reduce latency when performing a virtual to physical address translation,
+processors include a structure, called TLB, that caches recent translations.
+There are separate TLBs for instruction (iTLB) and data (dTLB).
+
+Under this errata, instructions are fetched from a linear address translated
+using a 4 KB translation cached in the iTLB. Privileged software modifies the
+paging structure so that the same linear address using large page size (2 MB, 4
+MB, 1 GB) with a different physical address or memory type.  After the page
+structure modification but before the software invalidates any iTLB entries for
+the linear address, a code fetch that happens on the same linear address may
+cause a machine-check error which can result in a system hang or shutdown.
+
+
+Attack scenarios
+----------------
+
+Attacks against the iTLB multihit erratum can be mounted from malicious
+guests in a virtualized system.
+
+
+iTLB multihit system information
+--------------------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current iTLB
+multihit status of the system:whether the system is vulnerable and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/itlb_multihit
+
+The possible values in this file are:
+
+.. list-table::
+
+     * - Not affected
+       - The processor is not vulnerable.
+     * - KVM: Mitigation: Split huge pages
+       - Software changes mitigate this issue.
+     * - KVM: Vulnerable
+       - The processor is vulnerable, but no mitigation enabled
+
+
+Enumeration of the erratum
+--------------------------------
+
+A new bit has been allocated in the IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) msr
+and will be set on CPU's which are mitigated against this issue.
+
+   =======================================   ===========   ===============================
+   IA32_ARCH_CAPABILITIES MSR                Not present   Possibly vulnerable,check model
+   IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO]    '0'           Likely vulnerable,check model
+   IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO]    '1'           Not vulnerable
+   =======================================   ===========   ===============================
+
+
+Mitigation mechanism
+-------------------------
+
+This erratum can be mitigated by restricting the use of large page sizes to
+non-executable pages.  This forces all iTLB entries to be 4K, and removes
+the possibility of multiple hits.
+
+In order to mitigate the vulnerability, KVM initially marks all huge pages
+as non-executable. If the guest attempts to execute in one of those pages,
+the page is broken down into 4K pages, which are then marked executable.
+
+If EPT is disabled or not available on the host, KVM is in control of TLB
+flushes and the problematic situation cannot happen.  However, the shadow
+EPT paging mechanism used by nested virtualization is vulnerable, because
+the nested guest can trigger multiple iTLB hits by modifying its own
+(non-nested) page tables.  For simplicity, KVM will make large pages
+non-executable in all shadow paging modes.
+
+Mitigation control on the kernel command line and KVM - module parameter
+------------------------------------------------------------------------
+
+The KVM hypervisor mitigation mechanism for marking huge pages as
+non-executable can be controlled with a module parameter "nx_huge_pages=".
+The kernel command line allows to control the iTLB multihit mitigations at
+boot time with the option "kvm.nx_huge_pages=".
+
+The valid arguments for these options are:
+
+  ==========  ================================================================
+  force       Mitigation is enabled. In this case, the mitigation implements
+              non-executable huge pages in Linux kernel KVM module. All huge
+              pages in the EPT are marked as non-executable.
+              If a guest attempts to execute in one of those pages, the page is
+              broken down into 4K pages, which are then marked executable.
+
+  off	      Mitigation is disabled.
+
+  auto        Enable mitigation only if the platform is affected and the kernel
+              was not booted with the "mitigations=off" command line parameter.
+	      This is the default option.
+  ==========  ================================================================
+
+
+Mitigation selection guide
+--------------------------
+
+1. No virtualization in use
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The system is protected by the kernel unconditionally and no further
+   action is required.
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   If the guest comes from a trusted source, you may assume that the guest will
+   not attempt to maliciously exploit these errata and no further action is
+   required.
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+   If the guest comes from an untrusted source, the guest host kernel will need
+   to apply iTLB multihit mitigation via the kernel command line or kvm
+   module parameter.
diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
new file mode 100644
index 000000000..9393c50b5
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
@@ -0,0 +1,246 @@
+=========================================
+Processor MMIO Stale Data Vulnerabilities
+=========================================
+
+Processor MMIO Stale Data Vulnerabilities are a class of memory-mapped I/O
+(MMIO) vulnerabilities that can expose data. The sequences of operations for
+exposing data range from simple to very complex. Because most of the
+vulnerabilities require the attacker to have access to MMIO, many environments
+are not affected. System environments using virtualization where MMIO access is
+provided to untrusted guests may need mitigation. These vulnerabilities are
+not transient execution attacks. However, these vulnerabilities may propagate
+stale data into core fill buffers where the data can subsequently be inferred
+by an unmitigated transient execution attack. Mitigation for these
+vulnerabilities includes a combination of microcode update and software
+changes, depending on the platform and usage model. Some of these mitigations
+are similar to those used to mitigate Microarchitectural Data Sampling (MDS) or
+those used to mitigate Special Register Buffer Data Sampling (SRBDS).
+
+Data Propagators
+================
+Propagators are operations that result in stale data being copied or moved from
+one microarchitectural buffer or register to another. Processor MMIO Stale Data
+Vulnerabilities are operations that may result in stale data being directly
+read into an architectural, software-visible state or sampled from a buffer or
+register.
+
+Fill Buffer Stale Data Propagator (FBSDP)
+-----------------------------------------
+Stale data may propagate from fill buffers (FB) into the non-coherent portion
+of the uncore on some non-coherent writes. Fill buffer propagation by itself
+does not make stale data architecturally visible. Stale data must be propagated
+to a location where it is subject to reading or sampling.
+
+Sideband Stale Data Propagator (SSDP)
+-------------------------------------
+The sideband stale data propagator (SSDP) is limited to the client (including
+Intel Xeon server E3) uncore implementation. The sideband response buffer is
+shared by all client cores. For non-coherent reads that go to sideband
+destinations, the uncore logic returns 64 bytes of data to the core, including
+both requested data and unrequested stale data, from a transaction buffer and
+the sideband response buffer. As a result, stale data from the sideband
+response and transaction buffers may now reside in a core fill buffer.
+
+Primary Stale Data Propagator (PSDP)
+------------------------------------
+The primary stale data propagator (PSDP) is limited to the client (including
+Intel Xeon server E3) uncore implementation. Similar to the sideband response
+buffer, the primary response buffer is shared by all client cores. For some
+processors, MMIO primary reads will return 64 bytes of data to the core fill
+buffer including both requested data and unrequested stale data. This is
+similar to the sideband stale data propagator.
+
+Vulnerabilities
+===============
+Device Register Partial Write (DRPW) (CVE-2022-21166)
+-----------------------------------------------------
+Some endpoint MMIO registers incorrectly handle writes that are smaller than
+the register size. Instead of aborting the write or only copying the correct
+subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than
+specified by the write transaction may be written to the register. On
+processors affected by FBSDP, this may expose stale data from the fill buffers
+of the core that created the write transaction.
+
+Shared Buffers Data Sampling (SBDS) (CVE-2022-21125)
+----------------------------------------------------
+After propagators may have moved data around the uncore and copied stale data
+into client core fill buffers, processors affected by MFBDS can leak data from
+the fill buffer. It is limited to the client (including Intel Xeon server E3)
+uncore implementation.
+
+Shared Buffers Data Read (SBDR) (CVE-2022-21123)
+------------------------------------------------
+It is similar to Shared Buffer Data Sampling (SBDS) except that the data is
+directly read into the architectural software-visible state. It is limited to
+the client (including Intel Xeon server E3) uncore implementation.
+
+Affected Processors
+===================
+Not all the CPUs are affected by all the variants. For instance, most
+processors for the server market (excluding Intel Xeon E3 processors) are
+impacted by only Device Register Partial Write (DRPW).
+
+Below is the list of affected Intel processors [#f1]_:
+
+   ===================  ============  =========
+   Common name          Family_Model  Steppings
+   ===================  ============  =========
+   HASWELL_X            06_3FH        2,4
+   SKYLAKE_L            06_4EH        3
+   BROADWELL_X          06_4FH        All
+   SKYLAKE_X            06_55H        3,4,6,7,11
+   BROADWELL_D          06_56H        3,4,5
+   SKYLAKE              06_5EH        3
+   ICELAKE_X            06_6AH        4,5,6
+   ICELAKE_D            06_6CH        1
+   ICELAKE_L            06_7EH        5
+   ATOM_TREMONT_D       06_86H        All
+   LAKEFIELD            06_8AH        1
+   KABYLAKE_L           06_8EH        9 to 12
+   ATOM_TREMONT         06_96H        1
+   ATOM_TREMONT_L       06_9CH        0
+   KABYLAKE             06_9EH        9 to 13
+   COMETLAKE            06_A5H        2,3,5
+   COMETLAKE_L          06_A6H        0,1
+   ROCKETLAKE           06_A7H        1
+   ===================  ============  =========
+
+If a CPU is in the affected processor list, but not affected by a variant, it
+is indicated by new bits in MSR IA32_ARCH_CAPABILITIES. As described in a later
+section, mitigation largely remains the same for all the variants, i.e. to
+clear the CPU fill buffers via VERW instruction.
+
+New bits in MSRs
+================
+Newer processors and microcode update on existing affected processors added new
+bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate
+specific variants of Processor MMIO Stale Data vulnerabilities and mitigation
+capability.
+
+MSR IA32_ARCH_CAPABILITIES
+--------------------------
+Bit 13 - SBDR_SSDP_NO - When set, processor is not affected by either the
+	 Shared Buffers Data Read (SBDR) vulnerability or the sideband stale
+	 data propagator (SSDP).
+Bit 14 - FBSDP_NO - When set, processor is not affected by the Fill Buffer
+	 Stale Data Propagator (FBSDP).
+Bit 15 - PSDP_NO - When set, processor is not affected by Primary Stale Data
+	 Propagator (PSDP).
+Bit 17 - FB_CLEAR - When set, VERW instruction will overwrite CPU fill buffer
+	 values as part of MD_CLEAR operations. Processors that do not
+	 enumerate MDS_NO (meaning they are affected by MDS) but that do
+	 enumerate support for both L1D_FLUSH and MD_CLEAR implicitly enumerate
+	 FB_CLEAR as part of their MD_CLEAR support.
+Bit 18 - FB_CLEAR_CTRL - Processor supports read and write to MSR
+	 IA32_MCU_OPT_CTRL[FB_CLEAR_DIS]. On such processors, the FB_CLEAR_DIS
+	 bit can be set to cause the VERW instruction to not perform the
+	 FB_CLEAR action. Not all processors that support FB_CLEAR will support
+	 FB_CLEAR_CTRL.
+
+MSR IA32_MCU_OPT_CTRL
+---------------------
+Bit 3 - FB_CLEAR_DIS - When set, VERW instruction does not perform the FB_CLEAR
+action. This may be useful to reduce the performance impact of FB_CLEAR in
+cases where system software deems it warranted (for example, when performance
+is more critical, or the untrusted software has no MMIO access). Note that
+FB_CLEAR_DIS has no impact on enumeration (for example, it does not change
+FB_CLEAR or MD_CLEAR enumeration) and it may not be supported on all processors
+that enumerate FB_CLEAR.
+
+Mitigation
+==========
+Like MDS, all variants of Processor MMIO Stale Data vulnerabilities  have the
+same mitigation strategy to force the CPU to clear the affected buffers before
+an attacker can extract the secrets.
+
+This is achieved by using the otherwise unused and obsolete VERW instruction in
+combination with a microcode update. The microcode clears the affected CPU
+buffers when the VERW instruction is executed.
+
+Kernel reuses the MDS function to invoke the buffer clearing:
+
+	mds_clear_cpu_buffers()
+
+On MDS affected CPUs, the kernel already invokes CPU buffer clear on
+kernel/userspace, hypervisor/guest and C-state (idle) transitions. No
+additional mitigation is needed on such CPUs.
+
+For CPUs not affected by MDS or TAA, mitigation is needed only for the attacker
+with MMIO capability. Therefore, VERW is not required for kernel/userspace. For
+virtualization case, VERW is only needed at VMENTER for a guest with MMIO
+capability.
+
+Mitigation points
+-----------------
+Return to user space
+^^^^^^^^^^^^^^^^^^^^
+Same mitigation as MDS when affected by MDS/TAA, otherwise no mitigation
+needed.
+
+C-State transition
+^^^^^^^^^^^^^^^^^^
+Control register writes by CPU during C-state transition can propagate data
+from fill buffer to uncore buffers. Execute VERW before C-state transition to
+clear CPU fill buffers.
+
+Guest entry point
+^^^^^^^^^^^^^^^^^
+Same mitigation as MDS when processor is also affected by MDS/TAA, otherwise
+execute VERW at VMENTER only for MMIO capable guests. On CPUs not affected by
+MDS/TAA, guest without MMIO access cannot extract secrets using Processor MMIO
+Stale Data vulnerabilities, so there is no need to execute VERW for such guests.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+The kernel command line allows to control the Processor MMIO Stale Data
+mitigations at boot time with the option "mmio_stale_data=". The valid
+arguments for this option are:
+
+  ==========  =================================================================
+  full        If the CPU is vulnerable, enable mitigation; CPU buffer clearing
+              on exit to userspace and when entering a VM. Idle transitions are
+              protected as well. It does not automatically disable SMT.
+  full,nosmt  Same as full, with SMT disabled on vulnerable CPUs. This is the
+              complete mitigation.
+  off         Disables mitigation completely.
+  ==========  =================================================================
+
+If the CPU is affected and mmio_stale_data=off is not supplied on the kernel
+command line, then the kernel selects the appropriate mitigation.
+
+Mitigation status information
+-----------------------------
+The Linux kernel provides a sysfs interface to enumerate the current
+vulnerability status of the system: whether the system is vulnerable, and
+which mitigations are active. The relevant sysfs file is:
+
+	/sys/devices/system/cpu/vulnerabilities/mmio_stale_data
+
+The possible values in this file are:
+
+  .. list-table::
+
+     * - 'Not affected'
+       - The processor is not vulnerable
+     * - 'Vulnerable'
+       - The processor is vulnerable, but no mitigation enabled
+     * - 'Vulnerable: Clear CPU buffers attempted, no microcode'
+       - The processor is vulnerable, but microcode is not updated. The
+         mitigation is enabled on a best effort basis.
+     * - 'Mitigation: Clear CPU buffers'
+       - The processor is vulnerable and the CPU buffer clearing mitigation is
+         enabled.
+
+If the processor is vulnerable then the following information is appended to
+the above information:
+
+  ========================  ===========================================
+  'SMT vulnerable'          SMT is enabled
+  'SMT disabled'            SMT is disabled
+  'SMT Host state unknown'  Kernel runs in a VM, Host SMT state unknown
+  ========================  ===========================================
+
+References
+----------
+.. [#f1] Affected Processors
+   https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html
diff --git a/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst b/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst
new file mode 100644
index 000000000..47b1b3afa
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst
@@ -0,0 +1,149 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+SRBDS - Special Register Buffer Data Sampling
+=============================================
+
+SRBDS is a hardware vulnerability that allows MDS :doc:`mds` techniques to
+infer values returned from special register accesses.  Special register
+accesses are accesses to off core registers.  According to Intel's evaluation,
+the special register reads that have a security expectation of privacy are
+RDRAND, RDSEED and SGX EGETKEY.
+
+When RDRAND, RDSEED and EGETKEY instructions are used, the data is moved
+to the core through the special register mechanism that is susceptible
+to MDS attacks.
+
+Affected processors
+--------------------
+Core models (desktop, mobile, Xeon-E3) that implement RDRAND and/or RDSEED may
+be affected.
+
+A processor is affected by SRBDS if its Family_Model and stepping is
+in the following list, with the exception of the listed processors
+exporting MDS_NO while Intel TSX is available yet not enabled. The
+latter class of processors are only affected when Intel TSX is enabled
+by software using TSX_CTRL_MSR otherwise they are not affected.
+
+  =============  ============  ========
+  common name    Family_Model  Stepping
+  =============  ============  ========
+  IvyBridge      06_3AH        All
+
+  Haswell        06_3CH        All
+  Haswell_L      06_45H        All
+  Haswell_G      06_46H        All
+
+  Broadwell_G    06_47H        All
+  Broadwell      06_3DH        All
+
+  Skylake_L      06_4EH        All
+  Skylake        06_5EH        All
+
+  Kabylake_L     06_8EH        <= 0xC
+  Kabylake       06_9EH        <= 0xD
+  =============  ============  ========
+
+Related CVEs
+------------
+
+The following CVE entry is related to this SRBDS issue:
+
+    ==============  =====  =====================================
+    CVE-2020-0543   SRBDS  Special Register Buffer Data Sampling
+    ==============  =====  =====================================
+
+Attack scenarios
+----------------
+An unprivileged user can extract values returned from RDRAND and RDSEED
+executed on another core or sibling thread using MDS techniques.
+
+
+Mitigation mechanism
+-------------------
+Intel will release microcode updates that modify the RDRAND, RDSEED, and
+EGETKEY instructions to overwrite secret special register data in the shared
+staging buffer before the secret data can be accessed by another logical
+processor.
+
+During execution of the RDRAND, RDSEED, or EGETKEY instructions, off-core
+accesses from other logical processors will be delayed until the special
+register read is complete and the secret data in the shared staging buffer is
+overwritten.
+
+This has three effects on performance:
+
+#. RDRAND, RDSEED, or EGETKEY instructions have higher latency.
+
+#. Executing RDRAND at the same time on multiple logical processors will be
+   serialized, resulting in an overall reduction in the maximum RDRAND
+   bandwidth.
+
+#. Executing RDRAND, RDSEED or EGETKEY will delay memory accesses from other
+   logical processors that miss their core caches, with an impact similar to
+   legacy locked cache-line-split accesses.
+
+The microcode updates provide an opt-out mechanism (RNGDS_MITG_DIS) to disable
+the mitigation for RDRAND and RDSEED instructions executed outside of Intel
+Software Guard Extensions (Intel SGX) enclaves. On logical processors that
+disable the mitigation using this opt-out mechanism, RDRAND and RDSEED do not
+take longer to execute and do not impact performance of sibling logical
+processors memory accesses. The opt-out mechanism does not affect Intel SGX
+enclaves (including execution of RDRAND or RDSEED inside an enclave, as well
+as EGETKEY execution).
+
+IA32_MCU_OPT_CTRL MSR Definition
+--------------------------------
+Along with the mitigation for this issue, Intel added a new thread-scope
+IA32_MCU_OPT_CTRL MSR, (address 0x123). The presence of this MSR and
+RNGDS_MITG_DIS (bit 0) is enumerated by CPUID.(EAX=07H,ECX=0).EDX[SRBDS_CTRL =
+9]==1. This MSR is introduced through the microcode update.
+
+Setting IA32_MCU_OPT_CTRL[0] (RNGDS_MITG_DIS) to 1 for a logical processor
+disables the mitigation for RDRAND and RDSEED executed outside of an Intel SGX
+enclave on that logical processor. Opting out of the mitigation for a
+particular logical processor does not affect the RDRAND and RDSEED mitigations
+for other logical processors.
+
+Note that inside of an Intel SGX enclave, the mitigation is applied regardless
+of the value of RNGDS_MITG_DS.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+The kernel command line allows control over the SRBDS mitigation at boot time
+with the option "srbds=".  The option for this is:
+
+  ============= =============================================================
+  off           This option disables SRBDS mitigation for RDRAND and RDSEED on
+                affected platforms.
+  ============= =============================================================
+
+SRBDS System Information
+-----------------------
+The Linux kernel provides vulnerability status information through sysfs.  For
+SRBDS this can be accessed by the following sysfs file:
+/sys/devices/system/cpu/vulnerabilities/srbds
+
+The possible values contained in this file are:
+
+ ============================== =============================================
+ Not affected                   Processor not vulnerable
+ Vulnerable                     Processor vulnerable and mitigation disabled
+ Vulnerable: No microcode       Processor vulnerable and microcode is missing
+                                mitigation
+ Mitigation: Microcode          Processor is vulnerable and mitigation is in
+                                effect.
+ Mitigation: TSX disabled       Processor is only vulnerable when TSX is
+                                enabled while this system was booted with TSX
+                                disabled.
+ Unknown: Dependent on
+ hypervisor status              Running on virtual guest processor that is
+                                affected but with no way to know if host
+                                processor is mitigated or vulnerable.
+ ============================== =============================================
+
+SRBDS Default mitigation
+------------------------
+This new microcode serializes processor access during execution of RDRAND,
+RDSEED ensures that the shared buffer is overwritten before it is released for
+reuse.  Use the "srbds=off" kernel command line to disable the mitigation for
+RDRAND and RDSEED.
diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst
new file mode 100644
index 000000000..6bd97cd50
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -0,0 +1,785 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Spectre Side Channels
+=====================
+
+Spectre is a class of side channel attacks that exploit branch prediction
+and speculative execution on modern CPUs to read memory, possibly
+bypassing access controls. Speculative execution side channel exploits
+do not modify memory but attempt to infer privileged data in the memory.
+
+This document covers Spectre variant 1 and Spectre variant 2.
+
+Affected processors
+-------------------
+
+Speculative execution side channel methods affect a wide range of modern
+high performance processors, since most modern high speed processors
+use branch prediction and speculative execution.
+
+The following CPUs are vulnerable:
+
+    - Intel Core, Atom, Pentium, and Xeon processors
+
+    - AMD Phenom, EPYC, and Zen processors
+
+    - IBM POWER and zSeries processors
+
+    - Higher end ARM processors
+
+    - Apple CPUs
+
+    - Higher end MIPS CPUs
+
+    - Likely most other high performance CPUs. Contact your CPU vendor for details.
+
+Whether a processor is affected or not can be read out from the Spectre
+vulnerability files in sysfs. See :ref:`spectre_sys_info`.
+
+Related CVEs
+------------
+
+The following CVE entries describe Spectre variants:
+
+   =============   =======================  ==========================
+   CVE-2017-5753   Bounds check bypass      Spectre variant 1
+   CVE-2017-5715   Branch target injection  Spectre variant 2
+   CVE-2019-1125   Spectre v1 swapgs        Spectre variant 1 (swapgs)
+   =============   =======================  ==========================
+
+Problem
+-------
+
+CPUs use speculative operations to improve performance. That may leave
+traces of memory accesses or computations in the processor's caches,
+buffers, and branch predictors. Malicious software may be able to
+influence the speculative execution paths, and then use the side effects
+of the speculative execution in the CPUs' caches and buffers to infer
+privileged data touched during the speculative execution.
+
+Spectre variant 1 attacks take advantage of speculative execution of
+conditional branches, while Spectre variant 2 attacks use speculative
+execution of indirect branches to leak privileged memory.
+See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[6] <spec_ref6>`
+:ref:`[7] <spec_ref7>` :ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`.
+
+Spectre variant 1 (Bounds Check Bypass)
+---------------------------------------
+
+The bounds check bypass attack :ref:`[2] <spec_ref2>` takes advantage
+of speculative execution that bypasses conditional branch instructions
+used for memory access bounds check (e.g. checking if the index of an
+array results in memory access within a valid range). This results in
+memory accesses to invalid memory (with out-of-bound index) that are
+done speculatively before validation checks resolve. Such speculative
+memory accesses can leave side effects, creating side channels which
+leak information to the attacker.
+
+There are some extensions of Spectre variant 1 attacks for reading data
+over the network, see :ref:`[12] <spec_ref12>`. However such attacks
+are difficult, low bandwidth, fragile, and are considered low risk.
+
+Note that, despite "Bounds Check Bypass" name, Spectre variant 1 is not
+only about user-controlled array bounds checks.  It can affect any
+conditional checks.  The kernel entry code interrupt, exception, and NMI
+handlers all have conditional swapgs checks.  Those may be problematic
+in the context of Spectre v1, as kernel code can speculatively run with
+a user GS.
+
+Spectre variant 2 (Branch Target Injection)
+-------------------------------------------
+
+The branch target injection attack takes advantage of speculative
+execution of indirect branches :ref:`[3] <spec_ref3>`.  The indirect
+branch predictors inside the processor used to guess the target of
+indirect branches can be influenced by an attacker, causing gadget code
+to be speculatively executed, thus exposing sensitive data touched by
+the victim. The side effects left in the CPU's caches during speculative
+execution can be measured to infer data values.
+
+.. _poison_btb:
+
+In Spectre variant 2 attacks, the attacker can steer speculative indirect
+branches in the victim to gadget code by poisoning the branch target
+buffer of a CPU used for predicting indirect branch addresses. Such
+poisoning could be done by indirect branching into existing code,
+with the address offset of the indirect branch under the attacker's
+control. Since the branch prediction on impacted hardware does not
+fully disambiguate branch address and uses the offset for prediction,
+this could cause privileged code's indirect branch to jump to a gadget
+code with the same offset.
+
+The most useful gadgets take an attacker-controlled input parameter (such
+as a register value) so that the memory read can be controlled. Gadgets
+without input parameters might be possible, but the attacker would have
+very little control over what memory can be read, reducing the risk of
+the attack revealing useful data.
+
+One other variant 2 attack vector is for the attacker to poison the
+return stack buffer (RSB) :ref:`[13] <spec_ref13>` to cause speculative
+subroutine return instruction execution to go to a gadget.  An attacker's
+imbalanced subroutine call instructions might "poison" entries in the
+return stack buffer which are later consumed by a victim's subroutine
+return instructions.  This attack can be mitigated by flushing the return
+stack buffer on context switch, or virtual machine (VM) exit.
+
+On systems with simultaneous multi-threading (SMT), attacks are possible
+from the sibling thread, as level 1 cache and branch target buffer
+(BTB) may be shared between hardware threads in a CPU core.  A malicious
+program running on the sibling thread may influence its peer's BTB to
+steer its indirect branch speculations to gadget code, and measure the
+speculative execution's side effects left in level 1 cache to infer the
+victim's data.
+
+Yet another variant 2 attack vector is for the attacker to poison the
+Branch History Buffer (BHB) to speculatively steer an indirect branch
+to a specific Branch Target Buffer (BTB) entry, even if the entry isn't
+associated with the source address of the indirect branch. Specifically,
+the BHB might be shared across privilege levels even in the presence of
+Enhanced IBRS.
+
+Currently the only known real-world BHB attack vector is via
+unprivileged eBPF. Therefore, it's highly recommended to not enable
+unprivileged eBPF, especially when eIBRS is used (without retpolines).
+For a full mitigation against BHB attacks, it's recommended to use
+retpolines (or eIBRS combined with retpolines).
+
+Attack scenarios
+----------------
+
+The following list of attack scenarios have been anticipated, but may
+not cover all possible attack vectors.
+
+1. A user process attacking the kernel
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Spectre variant 1
+~~~~~~~~~~~~~~~~~
+
+   The attacker passes a parameter to the kernel via a register or
+   via a known address in memory during a syscall. Such parameter may
+   be used later by the kernel as an index to an array or to derive
+   a pointer for a Spectre variant 1 attack.  The index or pointer
+   is invalid, but bound checks are bypassed in the code branch taken
+   for speculative execution. This could cause privileged memory to be
+   accessed and leaked.
+
+   For kernel code that has been identified where data pointers could
+   potentially be influenced for Spectre attacks, new "nospec" accessor
+   macros are used to prevent speculative loading of data.
+
+Spectre variant 1 (swapgs)
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+   An attacker can train the branch predictor to speculatively skip the
+   swapgs path for an interrupt or exception.  If they initialize
+   the GS register to a user-space value, if the swapgs is speculatively
+   skipped, subsequent GS-related percpu accesses in the speculation
+   window will be done with the attacker-controlled GS value.  This
+   could cause privileged memory to be accessed and leaked.
+
+   For example:
+
+   ::
+
+     if (coming from user space)
+         swapgs
+     mov %gs:<percpu_offset>, %reg
+     mov (%reg), %reg1
+
+   When coming from user space, the CPU can speculatively skip the
+   swapgs, and then do a speculative percpu load using the user GS
+   value.  So the user can speculatively force a read of any kernel
+   value.  If a gadget exists which uses the percpu value as an address
+   in another load/store, then the contents of the kernel value may
+   become visible via an L1 side channel attack.
+
+   A similar attack exists when coming from kernel space.  The CPU can
+   speculatively do the swapgs, causing the user GS to get used for the
+   rest of the speculative window.
+
+Spectre variant 2
+~~~~~~~~~~~~~~~~~
+
+   A spectre variant 2 attacker can :ref:`poison <poison_btb>` the branch
+   target buffer (BTB) before issuing syscall to launch an attack.
+   After entering the kernel, the kernel could use the poisoned branch
+   target buffer on indirect jump and jump to gadget code in speculative
+   execution.
+
+   If an attacker tries to control the memory addresses leaked during
+   speculative execution, he would also need to pass a parameter to the
+   gadget, either through a register or a known address in memory. After
+   the gadget has executed, he can measure the side effect.
+
+   The kernel can protect itself against consuming poisoned branch
+   target buffer entries by using return trampolines (also known as
+   "retpoline") :ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` for all
+   indirect branches. Return trampolines trap speculative execution paths
+   to prevent jumping to gadget code during speculative execution.
+   x86 CPUs with Enhanced Indirect Branch Restricted Speculation
+   (Enhanced IBRS) available in hardware should use the feature to
+   mitigate Spectre variant 2 instead of retpoline. Enhanced IBRS is
+   more efficient than retpoline.
+
+   There may be gadget code in firmware which could be exploited with
+   Spectre variant 2 attack by a rogue user process. To mitigate such
+   attacks on x86, Indirect Branch Restricted Speculation (IBRS) feature
+   is turned on before the kernel invokes any firmware code.
+
+2. A user process attacking another user process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   A malicious user process can try to attack another user process,
+   either via a context switch on the same hardware thread, or from the
+   sibling hyperthread sharing a physical processor core on simultaneous
+   multi-threading (SMT) system.
+
+   Spectre variant 1 attacks generally require passing parameters
+   between the processes, which needs a data passing relationship, such
+   as remote procedure calls (RPC).  Those parameters are used in gadget
+   code to derive invalid data pointers accessing privileged memory in
+   the attacked process.
+
+   Spectre variant 2 attacks can be launched from a rogue process by
+   :ref:`poisoning <poison_btb>` the branch target buffer.  This can
+   influence the indirect branch targets for a victim process that either
+   runs later on the same hardware thread, or running concurrently on
+   a sibling hardware thread sharing the same physical core.
+
+   A user process can protect itself against Spectre variant 2 attacks
+   by using the prctl() syscall to disable indirect branch speculation
+   for itself.  An administrator can also cordon off an unsafe process
+   from polluting the branch target buffer by disabling the process's
+   indirect branch speculation. This comes with a performance cost
+   from not using indirect branch speculation and clearing the branch
+   target buffer.  When SMT is enabled on x86, for a process that has
+   indirect branch speculation disabled, Single Threaded Indirect Branch
+   Predictors (STIBP) :ref:`[4] <spec_ref4>` are turned on to prevent the
+   sibling thread from controlling branch target buffer.  In addition,
+   the Indirect Branch Prediction Barrier (IBPB) is issued to clear the
+   branch target buffer when context switching to and from such process.
+
+   On x86, the return stack buffer is stuffed on context switch.
+   This prevents the branch target buffer from being used for branch
+   prediction when the return stack buffer underflows while switching to
+   a deeper call stack. Any poisoned entries in the return stack buffer
+   left by the previous process will also be cleared.
+
+   User programs should use address space randomization to make attacks
+   more difficult (Set /proc/sys/kernel/randomize_va_space = 1 or 2).
+
+3. A virtualized guest attacking the host
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   The attack mechanism is similar to how user processes attack the
+   kernel.  The kernel is entered via hyper-calls or other virtualization
+   exit paths.
+
+   For Spectre variant 1 attacks, rogue guests can pass parameters
+   (e.g. in registers) via hyper-calls to derive invalid pointers to
+   speculate into privileged memory after entering the kernel.  For places
+   where such kernel code has been identified, nospec accessor macros
+   are used to stop speculative memory access.
+
+   For Spectre variant 2 attacks, rogue guests can :ref:`poison
+   <poison_btb>` the branch target buffer or return stack buffer, causing
+   the kernel to jump to gadget code in the speculative execution paths.
+
+   To mitigate variant 2, the host kernel can use return trampolines
+   for indirect branches to bypass the poisoned branch target buffer,
+   and flushing the return stack buffer on VM exit.  This prevents rogue
+   guests from affecting indirect branching in the host kernel.
+
+   To protect host processes from rogue guests, host processes can have
+   indirect branch speculation disabled via prctl().  The branch target
+   buffer is cleared before context switching to such processes.
+
+4. A virtualized guest attacking other guest
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   A rogue guest may attack another guest to get data accessible by the
+   other guest.
+
+   Spectre variant 1 attacks are possible if parameters can be passed
+   between guests.  This may be done via mechanisms such as shared memory
+   or message passing.  Such parameters could be used to derive data
+   pointers to privileged data in guest.  The privileged data could be
+   accessed by gadget code in the victim's speculation paths.
+
+   Spectre variant 2 attacks can be launched from a rogue guest by
+   :ref:`poisoning <poison_btb>` the branch target buffer or the return
+   stack buffer. Such poisoned entries could be used to influence
+   speculation execution paths in the victim guest.
+
+   Linux kernel mitigates attacks to other guests running in the same
+   CPU hardware thread by flushing the return stack buffer on VM exit,
+   and clearing the branch target buffer before switching to a new guest.
+
+   If SMT is used, Spectre variant 2 attacks from an untrusted guest
+   in the sibling hyperthread can be mitigated by the administrator,
+   by turning off the unsafe guest's indirect branch speculation via
+   prctl().  A guest can also protect itself by turning on microcode
+   based mitigations (such as IBPB or STIBP on x86) within the guest.
+
+.. _spectre_sys_info:
+
+Spectre system information
+--------------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current
+mitigation status of the system for Spectre: whether the system is
+vulnerable, and which mitigations are active.
+
+The sysfs file showing Spectre variant 1 mitigation status is:
+
+   /sys/devices/system/cpu/vulnerabilities/spectre_v1
+
+The possible values in this file are:
+
+  .. list-table::
+
+     * - 'Not affected'
+       - The processor is not vulnerable.
+     * - 'Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers'
+       - The swapgs protections are disabled; otherwise it has
+         protection in the kernel on a case by case base with explicit
+         pointer sanitation and usercopy LFENCE barriers.
+     * - 'Mitigation: usercopy/swapgs barriers and __user pointer sanitization'
+       - Protection in the kernel on a case by case base with explicit
+         pointer sanitation, usercopy LFENCE barriers, and swapgs LFENCE
+         barriers.
+
+However, the protections are put in place on a case by case basis,
+and there is no guarantee that all possible attack vectors for Spectre
+variant 1 are covered.
+
+The spectre_v2 kernel file reports if the kernel has been compiled with
+retpoline mitigation or if the CPU has hardware mitigation, and if the
+CPU has support for additional process-specific mitigation.
+
+This file also reports CPU features enabled by microcode to mitigate
+attack between user processes:
+
+1. Indirect Branch Prediction Barrier (IBPB) to add additional
+   isolation between processes of different users.
+2. Single Thread Indirect Branch Predictors (STIBP) to add additional
+   isolation between CPU threads running on the same core.
+
+These CPU features may impact performance when used and can be enabled
+per process on a case-by-case base.
+
+The sysfs file showing Spectre variant 2 mitigation status is:
+
+   /sys/devices/system/cpu/vulnerabilities/spectre_v2
+
+The possible values in this file are:
+
+  - Kernel status:
+
+  ========================================  =================================
+  'Not affected'                            The processor is not vulnerable
+  'Mitigation: None'                        Vulnerable, no mitigation
+  'Mitigation: Retpolines'                  Use Retpoline thunks
+  'Mitigation: LFENCE'                      Use LFENCE instructions
+  'Mitigation: Enhanced IBRS'               Hardware-focused mitigation
+  'Mitigation: Enhanced IBRS + Retpolines'  Hardware-focused + Retpolines
+  'Mitigation: Enhanced IBRS + LFENCE'      Hardware-focused + LFENCE
+  ========================================  =================================
+
+  - Firmware status: Show if Indirect Branch Restricted Speculation (IBRS) is
+    used to protect against Spectre variant 2 attacks when calling firmware (x86 only).
+
+  ========== =============================================================
+  'IBRS_FW'  Protection against user program attacks when calling firmware
+  ========== =============================================================
+
+  - Indirect branch prediction barrier (IBPB) status for protection between
+    processes of different users. This feature can be controlled through
+    prctl() per process, or through kernel command line options. This is
+    an x86 only feature. For more details see below.
+
+  ===================   ========================================================
+  'IBPB: disabled'      IBPB unused
+  'IBPB: always-on'     Use IBPB on all tasks
+  'IBPB: conditional'   Use IBPB on SECCOMP or indirect branch restricted tasks
+  ===================   ========================================================
+
+  - Single threaded indirect branch prediction (STIBP) status for protection
+    between different hyper threads. This feature can be controlled through
+    prctl per process, or through kernel command line options. This is x86
+    only feature. For more details see below.
+
+  ====================  ========================================================
+  'STIBP: disabled'     STIBP unused
+  'STIBP: forced'       Use STIBP on all tasks
+  'STIBP: conditional'  Use STIBP on SECCOMP or indirect branch restricted tasks
+  ====================  ========================================================
+
+  - Return stack buffer (RSB) protection status:
+
+  =============   ===========================================
+  'RSB filling'   Protection of RSB on context switch enabled
+  =============   ===========================================
+
+Full mitigation might require a microcode update from the CPU
+vendor. When the necessary microcode is not available, the kernel will
+report vulnerability.
+
+Turning on mitigation for Spectre variant 1 and Spectre variant 2
+-----------------------------------------------------------------
+
+1. Kernel mitigation
+^^^^^^^^^^^^^^^^^^^^
+
+Spectre variant 1
+~~~~~~~~~~~~~~~~~
+
+   For the Spectre variant 1, vulnerable kernel code (as determined
+   by code audit or scanning tools) is annotated on a case by case
+   basis to use nospec accessor macros for bounds clipping :ref:`[2]
+   <spec_ref2>` to avoid any usable disclosure gadgets. However, it may
+   not cover all attack vectors for Spectre variant 1.
+
+   Copy-from-user code has an LFENCE barrier to prevent the access_ok()
+   check from being mis-speculated.  The barrier is done by the
+   barrier_nospec() macro.
+
+   For the swapgs variant of Spectre variant 1, LFENCE barriers are
+   added to interrupt, exception and NMI entry where needed.  These
+   barriers are done by the FENCE_SWAPGS_KERNEL_ENTRY and
+   FENCE_SWAPGS_USER_ENTRY macros.
+
+Spectre variant 2
+~~~~~~~~~~~~~~~~~
+
+   For Spectre variant 2 mitigation, the compiler turns indirect calls or
+   jumps in the kernel into equivalent return trampolines (retpolines)
+   :ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` to go to the target
+   addresses.  Speculative execution paths under retpolines are trapped
+   in an infinite loop to prevent any speculative execution jumping to
+   a gadget.
+
+   To turn on retpoline mitigation on a vulnerable CPU, the kernel
+   needs to be compiled with a gcc compiler that supports the
+   -mindirect-branch=thunk-extern -mindirect-branch-register options.
+   If the kernel is compiled with a Clang compiler, the compiler needs
+   to support -mretpoline-external-thunk option.  The kernel config
+   CONFIG_RETPOLINE needs to be turned on, and the CPU needs to run with
+   the latest updated microcode.
+
+   On Intel Skylake-era systems the mitigation covers most, but not all,
+   cases. See :ref:`[3] <spec_ref3>` for more details.
+
+   On CPUs with hardware mitigation for Spectre variant 2 (e.g. Enhanced
+   IBRS on x86), retpoline is automatically disabled at run time.
+
+   The retpoline mitigation is turned on by default on vulnerable
+   CPUs. It can be forced on or off by the administrator
+   via the kernel command line and sysfs control files. See
+   :ref:`spectre_mitigation_control_command_line`.
+
+   On x86, indirect branch restricted speculation is turned on by default
+   before invoking any firmware code to prevent Spectre variant 2 exploits
+   using the firmware.
+
+   Using kernel address space randomization (CONFIG_RANDOMIZE_BASE=y
+   and CONFIG_SLAB_FREELIST_RANDOM=y in the kernel configuration) makes
+   attacks on the kernel generally more difficult.
+
+2. User program mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   User programs can mitigate Spectre variant 1 using LFENCE or "bounds
+   clipping". For more details see :ref:`[2] <spec_ref2>`.
+
+   For Spectre variant 2 mitigation, individual user programs
+   can be compiled with return trampolines for indirect branches.
+   This protects them from consuming poisoned entries in the branch
+   target buffer left by malicious software.  Alternatively, the
+   programs can disable their indirect branch speculation via prctl()
+   (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
+   On x86, this will turn on STIBP to guard against attacks from the
+   sibling thread when the user program is running, and use IBPB to
+   flush the branch target buffer when switching to/from the program.
+
+   Restricting indirect branch speculation on a user program will
+   also prevent the program from launching a variant 2 attack
+   on x86.  All sand-boxed SECCOMP programs have indirect branch
+   speculation restricted by default.  Administrators can change
+   that behavior via the kernel command line and sysfs control files.
+   See :ref:`spectre_mitigation_control_command_line`.
+
+   Programs that disable their indirect branch speculation will have
+   more overhead and run slower.
+
+   User programs should use address space randomization
+   (/proc/sys/kernel/randomize_va_space = 1 or 2) to make attacks more
+   difficult.
+
+3. VM mitigation
+^^^^^^^^^^^^^^^^
+
+   Within the kernel, Spectre variant 1 attacks from rogue guests are
+   mitigated on a case by case basis in VM exit paths. Vulnerable code
+   uses nospec accessor macros for "bounds clipping", to avoid any
+   usable disclosure gadgets.  However, this may not cover all variant
+   1 attack vectors.
+
+   For Spectre variant 2 attacks from rogue guests to the kernel, the
+   Linux kernel uses retpoline or Enhanced IBRS to prevent consumption of
+   poisoned entries in branch target buffer left by rogue guests.  It also
+   flushes the return stack buffer on every VM exit to prevent a return
+   stack buffer underflow so poisoned branch target buffer could be used,
+   or attacker guests leaving poisoned entries in the return stack buffer.
+
+   To mitigate guest-to-guest attacks in the same CPU hardware thread,
+   the branch target buffer is sanitized by flushing before switching
+   to a new guest on a CPU.
+
+   The above mitigations are turned on by default on vulnerable CPUs.
+
+   To mitigate guest-to-guest attacks from sibling thread when SMT is
+   in use, an untrusted guest running in the sibling thread can have
+   its indirect branch speculation disabled by administrator via prctl().
+
+   The kernel also allows guests to use any microcode based mitigation
+   they choose to use (such as IBPB or STIBP on x86) to protect themselves.
+
+.. _spectre_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+Spectre variant 2 mitigation can be disabled or force enabled at the
+kernel command line.
+
+	nospectre_v1
+
+		[X86,PPC] Disable mitigations for Spectre Variant 1
+		(bounds check bypass). With this option data leaks are
+		possible in the system.
+
+	nospectre_v2
+
+		[X86] Disable all mitigations for the Spectre variant 2
+		(indirect branch prediction) vulnerability. System may
+		allow data leaks with this option, which is equivalent
+		to spectre_v2=off.
+
+
+        spectre_v2=
+
+		[X86] Control mitigation of Spectre variant 2
+		(indirect branch speculation) vulnerability.
+		The default operation protects the kernel from
+		user space attacks.
+
+		on
+			unconditionally enable, implies
+			spectre_v2_user=on
+		off
+			unconditionally disable, implies
+		        spectre_v2_user=off
+		auto
+			kernel detects whether your CPU model is
+		        vulnerable
+
+		Selecting 'on' will, and 'auto' may, choose a
+		mitigation method at run time according to the
+		CPU, the available microcode, the setting of the
+		CONFIG_RETPOLINE configuration option, and the
+		compiler with which the kernel was built.
+
+		Selecting 'on' will also enable the mitigation
+		against user space to user space task attacks.
+
+		Selecting 'off' will disable both the kernel and
+		the user space protections.
+
+		Specific mitigations can also be selected manually:
+
+                retpoline               auto pick between generic,lfence
+                retpoline,generic       Retpolines
+                retpoline,lfence        LFENCE; indirect branch
+                retpoline,amd           alias for retpoline,lfence
+                eibrs                   enhanced IBRS
+                eibrs,retpoline         enhanced IBRS + Retpolines
+                eibrs,lfence            enhanced IBRS + LFENCE
+
+		Not specifying this option is equivalent to
+		spectre_v2=auto.
+
+For user space mitigation:
+
+        spectre_v2_user=
+
+		[X86] Control mitigation of Spectre variant 2
+		(indirect branch speculation) vulnerability between
+		user space tasks
+
+		on
+			Unconditionally enable mitigations. Is
+			enforced by spectre_v2=on
+
+		off
+			Unconditionally disable mitigations. Is
+			enforced by spectre_v2=off
+
+		prctl
+			Indirect branch speculation is enabled,
+			but mitigation can be enabled via prctl
+			per thread. The mitigation control state
+			is inherited on fork.
+
+		prctl,ibpb
+			Like "prctl" above, but only STIBP is
+			controlled per thread. IBPB is issued
+			always when switching between different user
+			space processes.
+
+		seccomp
+			Same as "prctl" above, but all seccomp
+			threads will enable the mitigation unless
+			they explicitly opt out.
+
+		seccomp,ibpb
+			Like "seccomp" above, but only STIBP is
+			controlled per thread. IBPB is issued
+			always when switching between different
+			user space processes.
+
+		auto
+			Kernel selects the mitigation depending on
+			the available CPU features and vulnerability.
+
+		Default mitigation:
+		If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl"
+
+		Not specifying this option is equivalent to
+		spectre_v2_user=auto.
+
+		In general the kernel by default selects
+		reasonable mitigations for the current CPU. To
+		disable Spectre variant 2 mitigations, boot with
+		spectre_v2=off. Spectre variant 1 mitigations
+		cannot be disabled.
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace
+^^^^^^^^^^^^^^^^^^^^
+
+   If all userspace applications are from trusted sources and do not
+   execute externally supplied untrusted code, then the mitigations can
+   be disabled.
+
+2. Protect sensitive programs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   For security-sensitive programs that have secrets (e.g. crypto
+   keys), protection against Spectre variant 2 can be put in place by
+   disabling indirect branch speculation when the program is running
+   (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
+
+3. Sandbox untrusted programs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   Untrusted programs that could be a source of attacks can be cordoned
+   off by disabling their indirect branch speculation when they are run
+   (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
+   This prevents untrusted programs from polluting the branch target
+   buffer.  All programs running in SECCOMP sandboxes have indirect
+   branch speculation restricted by default. This behavior can be
+   changed via the kernel command line and sysfs control files. See
+   :ref:`spectre_mitigation_control_command_line`.
+
+3. High security mode
+^^^^^^^^^^^^^^^^^^^^^
+
+   All Spectre variant 2 mitigations can be forced on
+   at boot time for all programs (See the "on" option in
+   :ref:`spectre_mitigation_control_command_line`).  This will add
+   overhead as indirect branch speculations for all programs will be
+   restricted.
+
+   On x86, branch target buffer will be flushed with IBPB when switching
+   to a new program. STIBP is left on all the time to protect programs
+   against variant 2 attacks originating from programs running on
+   sibling threads.
+
+   Alternatively, STIBP can be used only when running programs
+   whose indirect branch speculation is explicitly disabled,
+   while IBPB is still used all the time when switching to a new
+   program to clear the branch target buffer (See "ibpb" option in
+   :ref:`spectre_mitigation_control_command_line`).  This "ibpb" option
+   has less performance cost than the "on" option, which leaves STIBP
+   on all the time.
+
+References on Spectre
+---------------------
+
+Intel white papers:
+
+.. _spec_ref1:
+
+[1] `Intel analysis of speculative execution side channels <https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf>`_.
+
+.. _spec_ref2:
+
+[2] `Bounds check bypass <https://software.intel.com/security-software-guidance/software-guidance/bounds-check-bypass>`_.
+
+.. _spec_ref3:
+
+[3] `Deep dive: Retpoline: A branch target injection mitigation <https://software.intel.com/security-software-guidance/insights/deep-dive-retpoline-branch-target-injection-mitigation>`_.
+
+.. _spec_ref4:
+
+[4] `Deep Dive: Single Thread Indirect Branch Predictors <https://software.intel.com/security-software-guidance/insights/deep-dive-single-thread-indirect-branch-predictors>`_.
+
+AMD white papers:
+
+.. _spec_ref5:
+
+[5] `AMD64 technology indirect branch control extension <https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf>`_.
+
+.. _spec_ref6:
+
+[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf>`_.
+
+ARM white papers:
+
+.. _spec_ref7:
+
+[7] `Cache speculation side-channels <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/download-the-whitepaper>`_.
+
+.. _spec_ref8:
+
+[8] `Cache speculation issues update <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/latest-updates/cache-speculation-issues-update>`_.
+
+Google white paper:
+
+.. _spec_ref9:
+
+[9] `Retpoline: a software construct for preventing branch-target-injection <https://support.google.com/faqs/answer/7625886>`_.
+
+MIPS white paper:
+
+.. _spec_ref10:
+
+[10] `MIPS: response on speculative execution and side channel vulnerabilities <https://www.mips.com/blog/mips-response-on-speculative-execution-and-side-channel-vulnerabilities/>`_.
+
+Academic papers:
+
+.. _spec_ref11:
+
+[11] `Spectre Attacks: Exploiting Speculative Execution <https://spectreattack.com/spectre.pdf>`_.
+
+.. _spec_ref12:
+
+[12] `NetSpectre: Read Arbitrary Memory over Network <https://arxiv.org/abs/1807.10535>`_.
+
+.. _spec_ref13:
+
+[13] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://www.usenix.org/system/files/conference/woot18/woot18-paper-koruyeh.pdf>`_.
diff --git a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
new file mode 100644
index 000000000..af6865b82
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
@@ -0,0 +1,279 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+TAA - TSX Asynchronous Abort
+======================================
+
+TAA is a hardware vulnerability that allows unprivileged speculative access to
+data which is available in various CPU internal buffers by using asynchronous
+aborts within an Intel TSX transactional region.
+
+Affected processors
+-------------------
+
+This vulnerability only affects Intel processors that support Intel
+Transactional Synchronization Extensions (TSX) when the TAA_NO bit (bit 8)
+is 0 in the IA32_ARCH_CAPABILITIES MSR.  On processors where the MDS_NO bit
+(bit 5) is 0 in the IA32_ARCH_CAPABILITIES MSR, the existing MDS mitigations
+also mitigate against TAA.
+
+Whether a processor is affected or not can be read out from the TAA
+vulnerability file in sysfs. See :ref:`tsx_async_abort_sys_info`.
+
+Related CVEs
+------------
+
+The following CVE entry is related to this TAA issue:
+
+   ==============  =====  ===================================================
+   CVE-2019-11135  TAA    TSX Asynchronous Abort (TAA) condition on some
+                          microprocessors utilizing speculative execution may
+                          allow an authenticated user to potentially enable
+                          information disclosure via a side channel with
+                          local access.
+   ==============  =====  ===================================================
+
+Problem
+-------
+
+When performing store, load or L1 refill operations, processors write
+data into temporary microarchitectural structures (buffers). The data in
+those buffers can be forwarded to load operations as an optimization.
+
+Intel TSX is an extension to the x86 instruction set architecture that adds
+hardware transactional memory support to improve performance of multi-threaded
+software. TSX lets the processor expose and exploit concurrency hidden in an
+application due to dynamically avoiding unnecessary synchronization.
+
+TSX supports atomic memory transactions that are either committed (success) or
+aborted. During an abort, operations that happened within the transactional region
+are rolled back. An asynchronous abort takes place, among other options, when a
+different thread accesses a cache line that is also used within the transactional
+region when that access might lead to a data race.
+
+Immediately after an uncompleted asynchronous abort, certain speculatively
+executed loads may read data from those internal buffers and pass it to dependent
+operations. This can be then used to infer the value via a cache side channel
+attack.
+
+Because the buffers are potentially shared between Hyper-Threads cross
+Hyper-Thread attacks are possible.
+
+The victim of a malicious actor does not need to make use of TSX. Only the
+attacker needs to begin a TSX transaction and raise an asynchronous abort
+which in turn potenitally leaks data stored in the buffers.
+
+More detailed technical information is available in the TAA specific x86
+architecture section: :ref:`Documentation/x86/tsx_async_abort.rst <tsx_async_abort>`.
+
+
+Attack scenarios
+----------------
+
+Attacks against the TAA vulnerability can be implemented from unprivileged
+applications running on hosts or guests.
+
+As for MDS, the attacker has no control over the memory addresses that can
+be leaked. Only the victim is responsible for bringing data to the CPU. As
+a result, the malicious actor has to sample as much data as possible and
+then postprocess it to try to infer any useful information from it.
+
+A potential attacker only has read access to the data. Also, there is no direct
+privilege escalation by using this technique.
+
+
+.. _tsx_async_abort_sys_info:
+
+TAA system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current TAA status
+of mitigated systems. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/tsx_async_abort
+
+The possible values in this file are:
+
+.. list-table::
+
+   * - 'Vulnerable'
+     - The CPU is affected by this vulnerability and the microcode and kernel mitigation are not applied.
+   * - 'Vulnerable: Clear CPU buffers attempted, no microcode'
+     - The system tries to clear the buffers but the microcode might not support the operation.
+   * - 'Mitigation: Clear CPU buffers'
+     - The microcode has been updated to clear the buffers. TSX is still enabled.
+   * - 'Mitigation: TSX disabled'
+     - TSX is disabled.
+   * - 'Not affected'
+     - The CPU is not affected by this issue.
+
+.. _ucode_needed:
+
+Best effort mitigation mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the processor is vulnerable, but the availability of the microcode-based
+mitigation mechanism is not advertised via CPUID the kernel selects a best
+effort mitigation mode.  This mode invokes the mitigation instructions
+without a guarantee that they clear the CPU buffers.
+
+This is done to address virtualization scenarios where the host has the
+microcode update applied, but the hypervisor is not yet updated to expose the
+CPUID to the guest. If the host has updated microcode the protection takes
+effect; otherwise a few CPU cycles are wasted pointlessly.
+
+The state in the tsx_async_abort sysfs file reflects this situation
+accordingly.
+
+
+Mitigation mechanism
+--------------------
+
+The kernel detects the affected CPUs and the presence of the microcode which is
+required. If a CPU is affected and the microcode is available, then the kernel
+enables the mitigation by default.
+
+
+The mitigation can be controlled at boot time via a kernel command line option.
+See :ref:`taa_mitigation_control_command_line`.
+
+.. _virt_mechanism:
+
+Virtualization mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Affected systems where the host has TAA microcode and TAA is mitigated by
+having disabled TSX previously, are not vulnerable regardless of the status
+of the VMs.
+
+In all other cases, if the host either does not have the TAA microcode or
+the kernel is not mitigated, the system might be vulnerable.
+
+
+.. _taa_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the TAA mitigations at boot time with
+the option "tsx_async_abort=". The valid arguments for this option are:
+
+  ============  =============================================================
+  off		This option disables the TAA mitigation on affected platforms.
+                If the system has TSX enabled (see next parameter) and the CPU
+                is affected, the system is vulnerable.
+
+  full	        TAA mitigation is enabled. If TSX is enabled, on an affected
+                system it will clear CPU buffers on ring transitions. On
+                systems which are MDS-affected and deploy MDS mitigation,
+                TAA is also mitigated. Specifying this option on those
+                systems will have no effect.
+
+  full,nosmt    The same as tsx_async_abort=full, with SMT disabled on
+                vulnerable CPUs that have TSX enabled. This is the complete
+                mitigation. When TSX is disabled, SMT is not disabled because
+                CPU is not vulnerable to cross-thread TAA attacks.
+  ============  =============================================================
+
+Not specifying this option is equivalent to "tsx_async_abort=full". For
+processors that are affected by both TAA and MDS, specifying just
+"tsx_async_abort=off" without an accompanying "mds=off" will have no
+effect as the same mitigation is used for both vulnerabilities.
+
+The kernel command line also allows to control the TSX feature using the
+parameter "tsx=" on CPUs which support TSX control. MSR_IA32_TSX_CTRL is used
+to control the TSX feature and the enumeration of the TSX feature bits (RTM
+and HLE) in CPUID.
+
+The valid options are:
+
+  ============  =============================================================
+  off		Disables TSX on the system.
+
+                Note that this option takes effect only on newer CPUs which are
+                not vulnerable to MDS, i.e., have MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1
+                and which get the new IA32_TSX_CTRL MSR through a microcode
+                update. This new MSR allows for the reliable deactivation of
+                the TSX functionality.
+
+  on		Enables TSX.
+
+                Although there are mitigations for all known security
+                vulnerabilities, TSX has been known to be an accelerator for
+                several previous speculation-related CVEs, and so there may be
+                unknown security risks associated with leaving it enabled.
+
+  auto		Disables TSX if X86_BUG_TAA is present, otherwise enables TSX
+                on the system.
+  ============  =============================================================
+
+Not specifying this option is equivalent to "tsx=off".
+
+The following combinations of the "tsx_async_abort" and "tsx" are possible. For
+affected platforms tsx=auto is equivalent to tsx=off and the result will be:
+
+  =========  ==========================   =========================================
+  tsx=on     tsx_async_abort=full         The system will use VERW to clear CPU
+                                          buffers. Cross-thread attacks are still
+					  possible on SMT machines.
+  tsx=on     tsx_async_abort=full,nosmt   As above, cross-thread attacks on SMT
+                                          mitigated.
+  tsx=on     tsx_async_abort=off          The system is vulnerable.
+  tsx=off    tsx_async_abort=full         TSX might be disabled if microcode
+                                          provides a TSX control MSR. If so,
+					  system is not vulnerable.
+  tsx=off    tsx_async_abort=full,nosmt   Ditto
+  tsx=off    tsx_async_abort=off          ditto
+  =========  ==========================   =========================================
+
+
+For unaffected platforms "tsx=on" and "tsx_async_abort=full" does not clear CPU
+buffers.  For platforms without TSX control (MSR_IA32_ARCH_CAPABILITIES.MDS_NO=0)
+"tsx" command line argument has no effect.
+
+For the affected platforms below table indicates the mitigation status for the
+combinations of CPUID bit MD_CLEAR and IA32_ARCH_CAPABILITIES MSR bits MDS_NO
+and TSX_CTRL_MSR.
+
+  =======  =========  =============  ========================================
+  MDS_NO   MD_CLEAR   TSX_CTRL_MSR   Status
+  =======  =========  =============  ========================================
+    0          0            0        Vulnerable (needs microcode)
+    0          1            0        MDS and TAA mitigated via VERW
+    1          1            0        MDS fixed, TAA vulnerable if TSX enabled
+                                     because MD_CLEAR has no meaning and
+                                     VERW is not guaranteed to clear buffers
+    1          X            1        MDS fixed, TAA can be mitigated by
+                                     VERW or TSX_CTRL_MSR
+  =======  =========  =============  ========================================
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace and guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If all user space applications are from a trusted source and do not execute
+untrusted code which is supplied externally, then the mitigation can be
+disabled. The same applies to virtualized environments with trusted guests.
+
+
+2. Untrusted userspace and guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If there are untrusted applications or guests on the system, enabling TSX
+might allow a malicious actor to leak data from the host or from other
+processes running on the same physical core.
+
+If the microcode is available and the TSX is disabled on the host, attacks
+are prevented in a virtualized environment as well, even if the VMs do not
+explicitly enable the mitigation.
+
+
+.. _taa_default_mitigations:
+
+Default mitigations
+-------------------
+
+The kernel's default action for vulnerable processors is:
+
+  - Deploy TSX disable mitigation (tsx_async_abort=full tsx=off).
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
new file mode 100644
index 000000000..89abc5057
--- /dev/null
+++ b/Documentation/admin-guide/index.rst
@@ -0,0 +1,82 @@
+The Linux kernel user's and administrator's guide
+=================================================
+
+The following is a collection of user-oriented documents that have been
+added to the kernel over time.  There is, as yet, little overall order or
+organization here — this material was not written to be a single, coherent
+document!  With luck things will improve quickly over time.
+
+This initial section contains overall information, including the README
+file describing the kernel as a whole, documentation on kernel parameters,
+etc.
+
+.. toctree::
+   :maxdepth: 1
+
+   README
+   kernel-parameters
+   devices
+
+This section describes CPU vulnerabilities and their mitigations.
+
+.. toctree::
+   :maxdepth: 1
+
+   hw-vuln/index
+
+Here is a set of documents aimed at users who are trying to track down
+problems and bugs in particular.
+
+.. toctree::
+   :maxdepth: 1
+
+   reporting-bugs
+   security-bugs
+   bug-hunting
+   bug-bisect
+   tainted-kernels
+   ramoops
+   dynamic-debug-howto
+   init
+
+This is the beginning of a section with information of interest to
+application developers.  Documents covering various aspects of the kernel
+ABI will be found here.
+
+.. toctree::
+   :maxdepth: 1
+
+   sysfs-rules
+
+The rest of this manual consists of various unordered guides on how to
+configure specific aspects of kernel behavior to your liking.
+
+.. toctree::
+   :maxdepth: 1
+
+   initrd
+   cgroup-v2
+   serial-console
+   braille-console
+   parport
+   md
+   module-signing
+   sysrq
+   unicode
+   vga-softcursor
+   binfmt-misc
+   mono
+   java
+   ras
+   bcache
+   pm/index
+   thunderbolt
+   LSM/index
+   mm/index
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/admin-guide/init.rst b/Documentation/admin-guide/init.rst
new file mode 100644
index 000000000..e89d97f31
--- /dev/null
+++ b/Documentation/admin-guide/init.rst
@@ -0,0 +1,52 @@
+Explaining the dreaded "No init found." boot hang message
+=========================================================
+
+OK, so you've got this pretty unintuitive message (currently located
+in init/main.c) and are wondering what the H*** went wrong.
+Some high-level reasons for failure (listed roughly in order of execution)
+to load the init binary are:
+
+A) Unable to mount root FS
+B) init binary doesn't exist on rootfs
+C) broken console device
+D) binary exists but dependencies not available
+E) binary cannot be loaded
+
+Detailed explanations:
+
+A) Set "debug" kernel parameter (in bootloader config file or CONFIG_CMDLINE)
+   to get more detailed kernel messages.
+B) make sure you have the correct root FS type
+   (and ``root=`` kernel parameter points to the correct partition),
+   required drivers such as storage hardware (such as SCSI or USB!)
+   and filesystem (ext3, jffs2 etc.) are builtin (alternatively as modules,
+   to be pre-loaded by an initrd)
+C) Possibly a conflict in ``console= setup`` --> initial console unavailable.
+   E.g. some serial consoles are unreliable due to serial IRQ issues (e.g.
+   missing interrupt-based configuration).
+   Try using a different ``console= device`` or e.g. ``netconsole=``.
+D) e.g. required library dependencies of the init binary such as
+   ``/lib/ld-linux.so.2`` missing or broken. Use
+   ``readelf -d <INIT>|grep NEEDED`` to find out which libraries are required.
+E) make sure the binary's architecture matches your hardware.
+   E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM hardware.
+   In case you tried loading a non-binary file here (shell script?),
+   you should make sure that the script specifies an interpreter in its shebang
+   header line (``#!/...``) that is fully working (including its library
+   dependencies). And before tackling scripts, better first test a simple
+   non-script binary such as ``/bin/sh`` and confirm its successful execution.
+   To find out more, add code ``to init/main.c`` to display kernel_execve()s
+   return values.
+
+Please extend this explanation whenever you find new failure causes
+(after all loading the init binary is a CRITICAL and hard transition step
+which needs to be made as painless as possible), then submit patch to LKML.
+Further TODOs:
+
+- Implement the various ``run_init_process()`` invocations via a struct array
+  which can then store the ``kernel_execve()`` result value and on failure
+  log it all by iterating over **all** results (very important usability fix).
+- try to make the implementation itself more helpful in general,
+  e.g. by providing additional error messages at affected places.
+
+Andreas Mohr <andi at lisas period de>
diff --git a/Documentation/admin-guide/initrd.rst b/Documentation/admin-guide/initrd.rst
new file mode 100644
index 000000000..a03dabaaf
--- /dev/null
+++ b/Documentation/admin-guide/initrd.rst
@@ -0,0 +1,383 @@
+Using the initial RAM disk (initrd)
+===================================
+
+Written 1996,2000 by Werner Almesberger <werner.almesberger@epfl.ch> and
+Hans Lermen <lermen@fgan.de>
+
+
+initrd provides the capability to load a RAM disk by the boot loader.
+This RAM disk can then be mounted as the root file system and programs
+can be run from it. Afterwards, a new root file system can be mounted
+from a different device. The previous root (from initrd) is then moved
+to a directory and can be subsequently unmounted.
+
+initrd is mainly designed to allow system startup to occur in two phases,
+where the kernel comes up with a minimum set of compiled-in drivers, and
+where additional modules are loaded from initrd.
+
+This document gives a brief overview of the use of initrd. A more detailed
+discussion of the boot process can be found in [#f1]_.
+
+
+Operation
+---------
+
+When using initrd, the system typically boots as follows:
+
+  1) the boot loader loads the kernel and the initial RAM disk
+  2) the kernel converts initrd into a "normal" RAM disk and
+     frees the memory used by initrd
+  3) if the root device is not ``/dev/ram0``, the old (deprecated)
+     change_root procedure is followed. see the "Obsolete root change
+     mechanism" section below.
+  4) root device is mounted. if it is ``/dev/ram0``, the initrd image is
+     then mounted as root
+  5) /sbin/init is executed (this can be any valid executable, including
+     shell scripts; it is run with uid 0 and can do basically everything
+     init can do).
+  6) init mounts the "real" root file system
+  7) init places the root file system at the root directory using the
+     pivot_root system call
+  8) init execs the ``/sbin/init`` on the new root filesystem, performing
+     the usual boot sequence
+  9) the initrd file system is removed
+
+Note that changing the root directory does not involve unmounting it.
+It is therefore possible to leave processes running on initrd during that
+procedure. Also note that file systems mounted under initrd continue to
+be accessible.
+
+
+Boot command-line options
+-------------------------
+
+initrd adds the following new options::
+
+  initrd=<path>    (e.g. LOADLIN)
+
+    Loads the specified file as the initial RAM disk. When using LILO, you
+    have to specify the RAM disk image file in /etc/lilo.conf, using the
+    INITRD configuration variable.
+
+  noinitrd
+
+    initrd data is preserved but it is not converted to a RAM disk and
+    the "normal" root file system is mounted. initrd data can be read
+    from /dev/initrd. Note that the data in initrd can have any structure
+    in this case and doesn't necessarily have to be a file system image.
+    This option is used mainly for debugging.
+
+    Note: /dev/initrd is read-only and it can only be used once. As soon
+    as the last process has closed it, all data is freed and /dev/initrd
+    can't be opened anymore.
+
+  root=/dev/ram0
+
+    initrd is mounted as root, and the normal boot procedure is followed,
+    with the RAM disk mounted as root.
+
+Compressed cpio images
+----------------------
+
+Recent kernels have support for populating a ramdisk from a compressed cpio
+archive. On such systems, the creation of a ramdisk image doesn't need to
+involve special block devices or loopbacks; you merely create a directory on
+disk with the desired initrd content, cd to that directory, and run (as an
+example)::
+
+	find . | cpio --quiet -H newc -o | gzip -9 -n > /boot/imagefile.img
+
+Examining the contents of an existing image file is just as simple::
+
+	mkdir /tmp/imagefile
+	cd /tmp/imagefile
+	gzip -cd /boot/imagefile.img | cpio -imd --quiet
+
+Installation
+------------
+
+First, a directory for the initrd file system has to be created on the
+"normal" root file system, e.g.::
+
+	# mkdir /initrd
+
+The name is not relevant. More details can be found on the
+:manpage:`pivot_root(2)` man page.
+
+If the root file system is created during the boot procedure (i.e. if
+you're building an install floppy), the root file system creation
+procedure should create the ``/initrd`` directory.
+
+If initrd will not be mounted in some cases, its content is still
+accessible if the following device has been created::
+
+	# mknod /dev/initrd b 1 250
+	# chmod 400 /dev/initrd
+
+Second, the kernel has to be compiled with RAM disk support and with
+support for the initial RAM disk enabled. Also, at least all components
+needed to execute programs from initrd (e.g. executable format and file
+system) must be compiled into the kernel.
+
+Third, you have to create the RAM disk image. This is done by creating a
+file system on a block device, copying files to it as needed, and then
+copying the content of the block device to the initrd file. With recent
+kernels, at least three types of devices are suitable for that:
+
+ - a floppy disk (works everywhere but it's painfully slow)
+ - a RAM disk (fast, but allocates physical memory)
+ - a loopback device (the most elegant solution)
+
+We'll describe the loopback device method:
+
+ 1) make sure loopback block devices are configured into the kernel
+ 2) create an empty file system of the appropriate size, e.g.::
+
+	# dd if=/dev/zero of=initrd bs=300k count=1
+	# mke2fs -F -m0 initrd
+
+    (if space is critical, you may want to use the Minix FS instead of Ext2)
+ 3) mount the file system, e.g.::
+
+	# mount -t ext2 -o loop initrd /mnt
+
+ 4) create the console device::
+
+    # mkdir /mnt/dev
+    # mknod /mnt/dev/console c 5 1
+
+ 5) copy all the files that are needed to properly use the initrd
+    environment. Don't forget the most important file, ``/sbin/init``
+
+    .. note:: ``/sbin/init`` permissions must include "x" (execute).
+
+ 6) correct operation the initrd environment can frequently be tested
+    even without rebooting with the command::
+
+	# chroot /mnt /sbin/init
+
+    This is of course limited to initrds that do not interfere with the
+    general system state (e.g. by reconfiguring network interfaces,
+    overwriting mounted devices, trying to start already running demons,
+    etc. Note however that it is usually possible to use pivot_root in
+    such a chroot'ed initrd environment.)
+ 7) unmount the file system::
+
+	# umount /mnt
+
+ 8) the initrd is now in the file "initrd". Optionally, it can now be
+    compressed::
+
+	# gzip -9 initrd
+
+For experimenting with initrd, you may want to take a rescue floppy and
+only add a symbolic link from ``/sbin/init`` to ``/bin/sh``. Alternatively, you
+can try the experimental newlib environment [#f2]_ to create a small
+initrd.
+
+Finally, you have to boot the kernel and load initrd. Almost all Linux
+boot loaders support initrd. Since the boot process is still compatible
+with an older mechanism, the following boot command line parameters
+have to be given::
+
+  root=/dev/ram0 rw
+
+(rw is only necessary if writing to the initrd file system.)
+
+With LOADLIN, you simply execute::
+
+     LOADLIN <kernel> initrd=<disk_image>
+
+e.g.::
+
+	LOADLIN C:\LINUX\BZIMAGE initrd=C:\LINUX\INITRD.GZ root=/dev/ram0 rw
+
+With LILO, you add the option ``INITRD=<path>`` to either the global section
+or to the section of the respective kernel in ``/etc/lilo.conf``, and pass
+the options using APPEND, e.g.::
+
+  image = /bzImage
+    initrd = /boot/initrd.gz
+    append = "root=/dev/ram0 rw"
+
+and run ``/sbin/lilo``
+
+For other boot loaders, please refer to the respective documentation.
+
+Now you can boot and enjoy using initrd.
+
+
+Changing the root device
+------------------------
+
+When finished with its duties, init typically changes the root device
+and proceeds with starting the Linux system on the "real" root device.
+
+The procedure involves the following steps:
+ - mounting the new root file system
+ - turning it into the root file system
+ - removing all accesses to the old (initrd) root file system
+ - unmounting the initrd file system and de-allocating the RAM disk
+
+Mounting the new root file system is easy: it just needs to be mounted on
+a directory under the current root. Example::
+
+	# mkdir /new-root
+	# mount -o ro /dev/hda1 /new-root
+
+The root change is accomplished with the pivot_root system call, which
+is also available via the ``pivot_root`` utility (see :manpage:`pivot_root(8)`
+man page; ``pivot_root`` is distributed with util-linux version 2.10h or higher
+[#f3]_). ``pivot_root`` moves the current root to a directory under the new
+root, and puts the new root at its place. The directory for the old root
+must exist before calling ``pivot_root``. Example::
+
+	# cd /new-root
+	# mkdir initrd
+	# pivot_root . initrd
+
+Now, the init process may still access the old root via its
+executable, shared libraries, standard input/output/error, and its
+current root directory. All these references are dropped by the
+following command::
+
+	# exec chroot . what-follows <dev/console >dev/console 2>&1
+
+Where what-follows is a program under the new root, e.g. ``/sbin/init``
+If the new root file system will be used with udev and has no valid
+``/dev`` directory, udev must be initialized before invoking chroot in order
+to provide ``/dev/console``.
+
+Note: implementation details of pivot_root may change with time. In order
+to ensure compatibility, the following points should be observed:
+
+ - before calling pivot_root, the current directory of the invoking
+   process should point to the new root directory
+ - use . as the first argument, and the _relative_ path of the directory
+   for the old root as the second argument
+ - a chroot program must be available under the old and the new root
+ - chroot to the new root afterwards
+ - use relative paths for dev/console in the exec command
+
+Now, the initrd can be unmounted and the memory allocated by the RAM
+disk can be freed::
+
+	# umount /initrd
+	# blockdev --flushbufs /dev/ram0
+
+It is also possible to use initrd with an NFS-mounted root, see the
+:manpage:`pivot_root(8)` man page for details.
+
+
+Usage scenarios
+---------------
+
+The main motivation for implementing initrd was to allow for modular
+kernel configuration at system installation. The procedure would work
+as follows:
+
+  1) system boots from floppy or other media with a minimal kernel
+     (e.g. support for RAM disks, initrd, a.out, and the Ext2 FS) and
+     loads initrd
+  2) ``/sbin/init`` determines what is needed to (1) mount the "real" root FS
+     (i.e. device type, device drivers, file system) and (2) the
+     distribution media (e.g. CD-ROM, network, tape, ...). This can be
+     done by asking the user, by auto-probing, or by using a hybrid
+     approach.
+  3) ``/sbin/init`` loads the necessary kernel modules
+  4) ``/sbin/init`` creates and populates the root file system (this doesn't
+     have to be a very usable system yet)
+  5) ``/sbin/init`` invokes ``pivot_root`` to change the root file system and
+     execs - via chroot - a program that continues the installation
+  6) the boot loader is installed
+  7) the boot loader is configured to load an initrd with the set of
+     modules that was used to bring up the system (e.g. ``/initrd`` can be
+     modified, then unmounted, and finally, the image is written from
+     ``/dev/ram0`` or ``/dev/rd/0`` to a file)
+  8) now the system is bootable and additional installation tasks can be
+     performed
+
+The key role of initrd here is to re-use the configuration data during
+normal system operation without requiring the use of a bloated "generic"
+kernel or re-compiling or re-linking the kernel.
+
+A second scenario is for installations where Linux runs on systems with
+different hardware configurations in a single administrative domain. In
+such cases, it is desirable to generate only a small set of kernels
+(ideally only one) and to keep the system-specific part of configuration
+information as small as possible. In this case, a common initrd could be
+generated with all the necessary modules. Then, only ``/sbin/init`` or a file
+read by it would have to be different.
+
+A third scenario is more convenient recovery disks, because information
+like the location of the root FS partition doesn't have to be provided at
+boot time, but the system loaded from initrd can invoke a user-friendly
+dialog and it can also perform some sanity checks (or even some form of
+auto-detection).
+
+Last not least, CD-ROM distributors may use it for better installation
+from CD, e.g. by using a boot floppy and bootstrapping a bigger RAM disk
+via initrd from CD; or by booting via a loader like ``LOADLIN`` or directly
+from the CD-ROM, and loading the RAM disk from CD without need of
+floppies.
+
+
+Obsolete root change mechanism
+------------------------------
+
+The following mechanism was used before the introduction of pivot_root.
+Current kernels still support it, but you should _not_ rely on its
+continued availability.
+
+It works by mounting the "real" root device (i.e. the one set with rdev
+in the kernel image or with root=... at the boot command line) as the
+root file system when linuxrc exits. The initrd file system is then
+unmounted, or, if it is still busy, moved to a directory ``/initrd``, if
+such a directory exists on the new root file system.
+
+In order to use this mechanism, you do not have to specify the boot
+command options root, init, or rw. (If specified, they will affect
+the real root file system, not the initrd environment.)
+
+If /proc is mounted, the "real" root device can be changed from within
+linuxrc by writing the number of the new root FS device to the special
+file /proc/sys/kernel/real-root-dev, e.g.::
+
+  # echo 0x301 >/proc/sys/kernel/real-root-dev
+
+Note that the mechanism is incompatible with NFS and similar file
+systems.
+
+This old, deprecated mechanism is commonly called ``change_root``, while
+the new, supported mechanism is called ``pivot_root``.
+
+
+Mixed change_root and pivot_root mechanism
+------------------------------------------
+
+In case you did not want to use ``root=/dev/ram0`` to trigger the pivot_root
+mechanism, you may create both ``/linuxrc`` and ``/sbin/init`` in your initrd
+image.
+
+``/linuxrc`` would contain only the following::
+
+	#! /bin/sh
+	mount -n -t proc proc /proc
+	echo 0x0100 >/proc/sys/kernel/real-root-dev
+	umount -n /proc
+
+Once linuxrc exited, the kernel would mount again your initrd as root,
+this time executing ``/sbin/init``. Again, it would be the duty of this init
+to build the right environment (maybe using the ``root= device`` passed on
+the cmdline) before the final execution of the real ``/sbin/init``.
+
+
+Resources
+---------
+
+.. [#f1] Almesberger, Werner; "Booting Linux: The History and the Future"
+    http://www.almesberger.net/cv/papers/ols2k-9.ps.gz
+.. [#f2] newlib package (experimental), with initrd example
+    https://www.sourceware.org/newlib/
+.. [#f3] util-linux: Miscellaneous utilities for Linux
+    https://www.kernel.org/pub/linux/utils/util-linux/
diff --git a/Documentation/admin-guide/java.rst b/Documentation/admin-guide/java.rst
new file mode 100644
index 000000000..8744e272e
--- /dev/null
+++ b/Documentation/admin-guide/java.rst
@@ -0,0 +1,423 @@
+Java(tm) Binary Kernel Support for Linux v1.03
+----------------------------------------------
+
+Linux beats them ALL! While all other OS's are TALKING about direct
+support of Java Binaries in the OS, Linux is doing it!
+
+You can execute Java applications and Java Applets just like any
+other program after you have done the following:
+
+1) You MUST FIRST install the Java Developers Kit for Linux.
+   The Java on Linux HOWTO gives the details on getting and
+   installing this. This HOWTO can be found at:
+
+	ftp://sunsite.unc.edu/pub/Linux/docs/HOWTO/Java-HOWTO
+
+   You should also set up a reasonable CLASSPATH environment
+   variable to use Java applications that make use of any
+   nonstandard classes (not included in the same directory
+   as the application itself).
+
+2) You have to compile BINFMT_MISC either as a module or into
+   the kernel (``CONFIG_BINFMT_MISC``) and set it up properly.
+   If you choose to compile it as a module, you will have
+   to insert it manually with modprobe/insmod, as kmod
+   cannot easily be supported with binfmt_misc.
+   Read the file 'binfmt_misc.txt' in this directory to know
+   more about the configuration process.
+
+3) Add the following configuration items to binfmt_misc
+   (you should really have read ``binfmt_misc.txt`` now):
+   support for Java applications::
+
+     ':Java:M::\xca\xfe\xba\xbe::/usr/local/bin/javawrapper:'
+
+   support for executable Jar files::
+
+     ':ExecutableJAR:E::jar::/usr/local/bin/jarwrapper:'
+
+   support for Java Applets::
+
+     ':Applet:E::html::/usr/bin/appletviewer:'
+
+   or the following, if you want to be more selective::
+
+     ':Applet:M::<!--applet::/usr/bin/appletviewer:'
+
+   Of course you have to fix the path names. The path/file names given in this
+   document match the Debian 2.1 system. (i.e. jdk installed in ``/usr``,
+   custom wrappers from this document in ``/usr/local``)
+
+   Note, that for the more selective applet support you have to modify
+   existing html-files to contain ``<!--applet-->`` in the first line
+   (``<`` has to be the first character!) to let this work!
+
+   For the compiled Java programs you need a wrapper script like the
+   following (this is because Java is broken in case of the filename
+   handling), again fix the path names, both in the script and in the
+   above given configuration string.
+
+   You, too, need the little program after the script. Compile like::
+
+	gcc -O2 -o javaclassname javaclassname.c
+
+   and stick it to ``/usr/local/bin``.
+
+   Both the javawrapper shellscript and the javaclassname program
+   were supplied by Colin J. Watson <cjw44@cam.ac.uk>.
+
+Javawrapper shell script:
+
+.. code-block:: sh
+
+  #!/bin/bash
+  # /usr/local/bin/javawrapper - the wrapper for binfmt_misc/java
+
+  if [ -z "$1" ]; then
+	exec 1>&2
+	echo Usage: $0 class-file
+	exit 1
+  fi
+
+  CLASS=$1
+  FQCLASS=`/usr/local/bin/javaclassname $1`
+  FQCLASSN=`echo $FQCLASS | sed -e 's/^.*\.\([^.]*\)$/\1/'`
+  FQCLASSP=`echo $FQCLASS | sed -e 's-\.-/-g' -e 's-^[^/]*$--' -e 's-/[^/]*$--'`
+
+  # for example:
+  # CLASS=Test.class
+  # FQCLASS=foo.bar.Test
+  # FQCLASSN=Test
+  # FQCLASSP=foo/bar
+
+  unset CLASSBASE
+
+  declare -i LINKLEVEL=0
+
+  while :; do
+	if [ "`basename $CLASS .class`" == "$FQCLASSN" ]; then
+		# See if this directory works straight off
+		cd -L `dirname $CLASS`
+		CLASSDIR=$PWD
+		cd $OLDPWD
+		if echo $CLASSDIR | grep -q "$FQCLASSP$"; then
+			CLASSBASE=`echo $CLASSDIR | sed -e "s.$FQCLASSP$.."`
+			break;
+		fi
+		# Try dereferencing the directory name
+		cd -P `dirname $CLASS`
+		CLASSDIR=$PWD
+		cd $OLDPWD
+		if echo $CLASSDIR | grep -q "$FQCLASSP$"; then
+			CLASSBASE=`echo $CLASSDIR | sed -e "s.$FQCLASSP$.."`
+			break;
+		fi
+		# If no other possible filename exists
+		if [ ! -L $CLASS ]; then
+			exec 1>&2
+			echo $0:
+			echo "  $CLASS should be in a" \
+			     "directory tree called $FQCLASSP"
+			exit 1
+		fi
+	fi
+	if [ ! -L $CLASS ]; then break; fi
+	# Go down one more level of symbolic links
+	let LINKLEVEL+=1
+	if [ $LINKLEVEL -gt 5 ]; then
+		exec 1>&2
+		echo $0:
+		echo "  Too many symbolic links encountered"
+		exit 1
+	fi
+	CLASS=`ls --color=no -l $CLASS | sed -e 's/^.* \([^ ]*\)$/\1/'`
+  done
+
+  if [ -z "$CLASSBASE" ]; then
+	if [ -z "$FQCLASSP" ]; then
+		GOODNAME=$FQCLASSN.class
+	else
+		GOODNAME=$FQCLASSP/$FQCLASSN.class
+	fi
+	exec 1>&2
+	echo $0:
+	echo "  $FQCLASS should be in a file called $GOODNAME"
+	exit 1
+  fi
+
+  if ! echo $CLASSPATH | grep -q "^\(.*:\)*$CLASSBASE\(:.*\)*"; then
+	# class is not in CLASSPATH, so prepend dir of class to CLASSPATH
+	if [ -z "${CLASSPATH}" ] ; then
+		export CLASSPATH=$CLASSBASE
+	else
+		export CLASSPATH=$CLASSBASE:$CLASSPATH
+	fi
+  fi
+
+  shift
+  /usr/bin/java $FQCLASS "$@"
+
+javaclassname.c:
+
+.. code-block:: c
+
+  /* javaclassname.c
+   *
+   * Extracts the class name from a Java class file; intended for use in a Java
+   * wrapper of the type supported by the binfmt_misc option in the Linux kernel.
+   *
+   * Copyright (C) 1999 Colin J. Watson <cjw44@cam.ac.uk>.
+   *
+   * This program is free software; you can redistribute it and/or modify
+   * it under the terms of the GNU General Public License as published by
+   * the Free Software Foundation; either version 2 of the License, or
+   * (at your option) any later version.
+   *
+   * This program is distributed in the hope that it will be useful,
+   * but WITHOUT ANY WARRANTY; without even the implied warranty of
+   * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   * GNU General Public License for more details.
+   *
+   * You should have received a copy of the GNU General Public License
+   * along with this program; if not, write to the Free Software
+   * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+   */
+
+  #include <stdlib.h>
+  #include <stdio.h>
+  #include <stdarg.h>
+  #include <sys/types.h>
+
+  /* From Sun's Java VM Specification, as tag entries in the constant pool. */
+
+  #define CP_UTF8 1
+  #define CP_INTEGER 3
+  #define CP_FLOAT 4
+  #define CP_LONG 5
+  #define CP_DOUBLE 6
+  #define CP_CLASS 7
+  #define CP_STRING 8
+  #define CP_FIELDREF 9
+  #define CP_METHODREF 10
+  #define CP_INTERFACEMETHODREF 11
+  #define CP_NAMEANDTYPE 12
+  #define CP_METHODHANDLE 15
+  #define CP_METHODTYPE 16
+  #define CP_INVOKEDYNAMIC 18
+
+  /* Define some commonly used error messages */
+
+  #define seek_error() error("%s: Cannot seek\n", program)
+  #define corrupt_error() error("%s: Class file corrupt\n", program)
+  #define eof_error() error("%s: Unexpected end of file\n", program)
+  #define utf8_error() error("%s: Only ASCII 1-255 supported\n", program);
+
+  char *program;
+
+  long *pool;
+
+  u_int8_t read_8(FILE *classfile);
+  u_int16_t read_16(FILE *classfile);
+  void skip_constant(FILE *classfile, u_int16_t *cur);
+  void error(const char *format, ...);
+  int main(int argc, char **argv);
+
+  /* Reads in an unsigned 8-bit integer. */
+  u_int8_t read_8(FILE *classfile)
+  {
+	int b = fgetc(classfile);
+	if(b == EOF)
+		eof_error();
+	return (u_int8_t)b;
+  }
+
+  /* Reads in an unsigned 16-bit integer. */
+  u_int16_t read_16(FILE *classfile)
+  {
+	int b1, b2;
+	b1 = fgetc(classfile);
+	if(b1 == EOF)
+		eof_error();
+	b2 = fgetc(classfile);
+	if(b2 == EOF)
+		eof_error();
+	return (u_int16_t)((b1 << 8) | b2);
+  }
+
+  /* Reads in a value from the constant pool. */
+  void skip_constant(FILE *classfile, u_int16_t *cur)
+  {
+	u_int16_t len;
+	int seekerr = 1;
+	pool[*cur] = ftell(classfile);
+	switch(read_8(classfile))
+	{
+	case CP_UTF8:
+		len = read_16(classfile);
+		seekerr = fseek(classfile, len, SEEK_CUR);
+		break;
+	case CP_CLASS:
+	case CP_STRING:
+	case CP_METHODTYPE:
+		seekerr = fseek(classfile, 2, SEEK_CUR);
+		break;
+	case CP_METHODHANDLE:
+		seekerr = fseek(classfile, 3, SEEK_CUR);
+		break;
+	case CP_INTEGER:
+	case CP_FLOAT:
+	case CP_FIELDREF:
+	case CP_METHODREF:
+	case CP_INTERFACEMETHODREF:
+	case CP_NAMEANDTYPE:
+	case CP_INVOKEDYNAMIC:
+		seekerr = fseek(classfile, 4, SEEK_CUR);
+		break;
+	case CP_LONG:
+	case CP_DOUBLE:
+		seekerr = fseek(classfile, 8, SEEK_CUR);
+		++(*cur);
+		break;
+	default:
+		corrupt_error();
+	}
+	if(seekerr)
+		seek_error();
+  }
+
+  void error(const char *format, ...)
+  {
+	va_list ap;
+	va_start(ap, format);
+	vfprintf(stderr, format, ap);
+	va_end(ap);
+	exit(1);
+  }
+
+  int main(int argc, char **argv)
+  {
+	FILE *classfile;
+	u_int16_t cp_count, i, this_class, classinfo_ptr;
+	u_int8_t length;
+
+	program = argv[0];
+
+	if(!argv[1])
+		error("%s: Missing input file\n", program);
+	classfile = fopen(argv[1], "rb");
+	if(!classfile)
+		error("%s: Error opening %s\n", program, argv[1]);
+
+	if(fseek(classfile, 8, SEEK_SET))  /* skip magic and version numbers */
+		seek_error();
+	cp_count = read_16(classfile);
+	pool = calloc(cp_count, sizeof(long));
+	if(!pool)
+		error("%s: Out of memory for constant pool\n", program);
+
+	for(i = 1; i < cp_count; ++i)
+		skip_constant(classfile, &i);
+	if(fseek(classfile, 2, SEEK_CUR))	/* skip access flags */
+		seek_error();
+
+	this_class = read_16(classfile);
+	if(this_class < 1 || this_class >= cp_count)
+		corrupt_error();
+	if(!pool[this_class] || pool[this_class] == -1)
+		corrupt_error();
+	if(fseek(classfile, pool[this_class] + 1, SEEK_SET))
+		seek_error();
+
+	classinfo_ptr = read_16(classfile);
+	if(classinfo_ptr < 1 || classinfo_ptr >= cp_count)
+		corrupt_error();
+	if(!pool[classinfo_ptr] || pool[classinfo_ptr] == -1)
+		corrupt_error();
+	if(fseek(classfile, pool[classinfo_ptr] + 1, SEEK_SET))
+		seek_error();
+
+	length = read_16(classfile);
+	for(i = 0; i < length; ++i)
+	{
+		u_int8_t x = read_8(classfile);
+		if((x & 0x80) || !x)
+		{
+			if((x & 0xE0) == 0xC0)
+			{
+				u_int8_t y = read_8(classfile);
+				if((y & 0xC0) == 0x80)
+				{
+					int c = ((x & 0x1f) << 6) + (y & 0x3f);
+					if(c) putchar(c);
+					else utf8_error();
+				}
+				else utf8_error();
+			}
+			else utf8_error();
+		}
+		else if(x == '/') putchar('.');
+		else putchar(x);
+	}
+	putchar('\n');
+	free(pool);
+	fclose(classfile);
+	return 0;
+  }
+
+jarwrapper::
+
+  #!/bin/bash
+  # /usr/local/java/bin/jarwrapper - the wrapper for binfmt_misc/jar
+
+  java -jar $1
+
+
+Now simply ``chmod +x`` the ``.class``, ``.jar`` and/or ``.html`` files you
+want to execute.
+
+To add a Java program to your path best put a symbolic link to the main
+.class file into /usr/bin (or another place you like) omitting the .class
+extension. The directory containing the original .class file will be
+added to your CLASSPATH during execution.
+
+
+To test your new setup, enter in the following simple Java app, and name
+it "HelloWorld.java":
+
+.. code-block:: java
+
+	class HelloWorld {
+		public static void main(String args[]) {
+			System.out.println("Hello World!");
+		}
+	}
+
+Now compile the application with::
+
+	javac HelloWorld.java
+
+Set the executable permissions of the binary file, with::
+
+	chmod 755 HelloWorld.class
+
+And then execute it::
+
+	./HelloWorld.class
+
+
+To execute Java Jar files, simple chmod the ``*.jar`` files to include
+the execution bit, then just do::
+
+       ./Application.jar
+
+
+To execute Java Applets, simple chmod the ``*.html`` files to include
+the execution bit, then just do::
+
+	./Applet.html
+
+
+originally by Brian A. Lantz, brian@lantz.com
+heavily edited for binfmt_misc by Richard Günther
+new scripts by Colin J. Watson <cjw44@cam.ac.uk>
+added executable Jar file support by Kurt Huwig <kurt@iku-netz.de>
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
new file mode 100644
index 000000000..b8d0bc07e
--- /dev/null
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -0,0 +1,211 @@
+.. _kernelparameters:
+
+The kernel's command-line parameters
+====================================
+
+The following is a consolidated list of the kernel parameters as
+implemented by the __setup(), core_param() and module_param() macros
+and sorted into English Dictionary order (defined as ignoring all
+punctuation and sorting digits before letters in a case insensitive
+manner), and with descriptions where known.
+
+The kernel parses parameters from the kernel command line up to "--";
+if it doesn't recognize a parameter and it doesn't contain a '.', the
+parameter gets passed to init: parameters with '=' go into init's
+environment, others are passed as command line arguments to init.
+Everything after "--" is passed as an argument to init.
+
+Module parameters can be specified in two ways: via the kernel command
+line with a module name prefix, or via modprobe, e.g.::
+
+	(kernel command line) usbcore.blinkenlights=1
+	(modprobe command line) modprobe usbcore blinkenlights=1
+
+Parameters for modules which are built into the kernel need to be
+specified on the kernel command line.  modprobe looks through the
+kernel command line (/proc/cmdline) and collects module parameters
+when it loads a module, so the kernel command line can be used for
+loadable modules too.
+
+Hyphens (dashes) and underscores are equivalent in parameter names, so::
+
+	log_buf_len=1M print-fatal-signals=1
+
+can also be entered as::
+
+	log-buf-len=1M print_fatal_signals=1
+
+Double-quotes can be used to protect spaces in values, e.g.::
+
+	param="spaces in here"
+
+cpu lists:
+----------
+
+Some kernel parameters take a list of CPUs as a value, e.g.  isolcpus,
+nohz_full, irqaffinity, rcu_nocbs.  The format of this list is:
+
+	<cpu number>,...,<cpu number>
+
+or
+
+	<cpu number>-<cpu number>
+	(must be a positive range in ascending order)
+
+or a mixture
+
+<cpu number>,...,<cpu number>-<cpu number>
+
+Note that for the special case of a range one can split the range into equal
+sized groups and for each group use some amount from the beginning of that
+group:
+
+	<cpu number>-cpu number>:<used size>/<group size>
+
+For example one can add to the command line following parameter:
+
+	isolcpus=1,2,10-20,100-2000:2/25
+
+where the final item represents CPUs 100,101,125,126,150,151,...
+
+
+
+This document may not be entirely up to date and comprehensive. The command
+"modinfo -p ${modulename}" shows a current list of all parameters of a loadable
+module. Loadable modules, after being loaded into the running kernel, also
+reveal their parameters in /sys/module/${modulename}/parameters/. Some of these
+parameters may be changed at runtime by the command
+``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``.
+
+The parameters listed below are only valid if certain kernel build options were
+enabled and if respective hardware is present. The text in square brackets at
+the beginning of each description states the restrictions within which a
+parameter is applicable::
+
+	ACPI	ACPI support is enabled.
+	AGP	AGP (Accelerated Graphics Port) is enabled.
+	ALSA	ALSA sound support is enabled.
+	APIC	APIC support is enabled.
+	APM	Advanced Power Management support is enabled.
+	ARM	ARM architecture is enabled.
+	AX25	Appropriate AX.25 support is enabled.
+	CLK	Common clock infrastructure is enabled.
+	CMA	Contiguous Memory Area support is enabled.
+	DRM	Direct Rendering Management support is enabled.
+	DYNAMIC_DEBUG Build in debug messages and enable them at runtime
+	EDD	BIOS Enhanced Disk Drive Services (EDD) is enabled
+	EFI	EFI Partitioning (GPT) is enabled
+	EIDE	EIDE/ATAPI support is enabled.
+	EVM	Extended Verification Module
+	FB	The frame buffer device is enabled.
+	FTRACE	Function tracing enabled.
+	GCOV	GCOV profiling is enabled.
+	HW	Appropriate hardware is enabled.
+	IA-64	IA-64 architecture is enabled.
+	IMA     Integrity measurement architecture is enabled.
+	IOSCHED	More than one I/O scheduler is enabled.
+	IP_PNP	IP DHCP, BOOTP, or RARP is enabled.
+	IPV6	IPv6 support is enabled.
+	ISAPNP	ISA PnP code is enabled.
+	ISDN	Appropriate ISDN support is enabled.
+	ISOL	CPU Isolation is enabled.
+	JOY	Appropriate joystick support is enabled.
+	KGDB	Kernel debugger support is enabled.
+	KVM	Kernel Virtual Machine support is enabled.
+	LIBATA  Libata driver is enabled
+	LP	Printer support is enabled.
+	LOOP	Loopback device support is enabled.
+	M68k	M68k architecture is enabled.
+			These options have more detailed description inside of
+			Documentation/m68k/kernel-options.txt.
+	MDA	MDA console support is enabled.
+	MIPS	MIPS architecture is enabled.
+	MOUSE	Appropriate mouse support is enabled.
+	MSI	Message Signaled Interrupts (PCI).
+	MTD	MTD (Memory Technology Device) support is enabled.
+	NET	Appropriate network support is enabled.
+	NUMA	NUMA support is enabled.
+	NFS	Appropriate NFS support is enabled.
+	OSS	OSS sound support is enabled.
+	PV_OPS	A paravirtualized kernel is enabled.
+	PARIDE	The ParIDE (parallel port IDE) subsystem is enabled.
+	PARISC	The PA-RISC architecture is enabled.
+	PCI	PCI bus support is enabled.
+	PCIE	PCI Express support is enabled.
+	PCMCIA	The PCMCIA subsystem is enabled.
+	PNP	Plug & Play support is enabled.
+	PPC	PowerPC architecture is enabled.
+	PPT	Parallel port support is enabled.
+	PS2	Appropriate PS/2 support is enabled.
+	RAM	RAM disk support is enabled.
+	RDT	Intel Resource Director Technology.
+	S390	S390 architecture is enabled.
+	SCSI	Appropriate SCSI support is enabled.
+			A lot of drivers have their options described inside
+			the Documentation/scsi/ sub-directory.
+	SECURITY Different security models are enabled.
+	SELINUX SELinux support is enabled.
+	APPARMOR AppArmor support is enabled.
+	SERIAL	Serial support is enabled.
+	SH	SuperH architecture is enabled.
+	SMP	The kernel is an SMP kernel.
+	SPARC	Sparc architecture is enabled.
+	SWSUSP	Software suspend (hibernation) is enabled.
+	SUSPEND	System suspend states are enabled.
+	TPM	TPM drivers are enabled.
+	TS	Appropriate touchscreen support is enabled.
+	UMS	USB Mass Storage support is enabled.
+	USB	USB support is enabled.
+	USBHID	USB Human Interface Device support is enabled.
+	V4L	Video For Linux support is enabled.
+	VMMIO   Driver for memory mapped virtio devices is enabled.
+	VGA	The VGA console has been enabled.
+	VT	Virtual terminal support is enabled.
+	WDT	Watchdog support is enabled.
+	XT	IBM PC/XT MFM hard disk support is enabled.
+	X86-32	X86-32, aka i386 architecture is enabled.
+	X86-64	X86-64 architecture is enabled.
+			More X86-64 boot options can be found in
+			Documentation/x86/x86_64/boot-options.txt .
+	X86	Either 32-bit or 64-bit x86 (same as X86-32+X86-64)
+	X86_UV	SGI UV support is enabled.
+	XEN	Xen support is enabled
+
+In addition, the following text indicates that the option::
+
+	BUGS=	Relates to possible processor bugs on the said processor.
+	KNL	Is a kernel start-up parameter.
+	BOOT	Is a boot loader parameter.
+
+Parameters denoted with BOOT are actually interpreted by the boot
+loader, and have no meaning to the kernel directly.
+Do not modify the syntax of boot loader parameters without extreme
+need or coordination with <Documentation/x86/boot.txt>.
+
+There are also arch-specific kernel-parameters not documented here.
+See for example <Documentation/x86/x86_64/boot-options.txt>.
+
+Note that ALL kernel parameters listed below are CASE SENSITIVE, and that
+a trailing = on the name of any parameter states that that parameter will
+be entered as an environment variable, whereas its absence indicates that
+it will appear as a kernel argument readable via /proc/cmdline by programs
+running once the system is up.
+
+The number of kernel parameters is not limited, but the length of the
+complete command line (parameters including spaces etc.) is limited to
+a fixed number of characters. This limit depends on the architecture
+and is between 256 and 4096 characters. It is defined in the file
+./include/asm/setup.h as COMMAND_LINE_SIZE.
+
+Finally, the [KMG] suffix is commonly described after a number of kernel
+parameter values. These 'K', 'M', and 'G' letters represent the _binary_
+multipliers 'Kilo', 'Mega', and 'Giga', equaling 2^10, 2^20, and 2^30
+bytes respectively. Such letter suffixes can also be entirely omitted:
+
+.. include:: kernel-parameters.txt
+   :literal:
+
+Todo
+----
+
+	Add more DRM drivers.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
new file mode 100644
index 000000000..500032af0
--- /dev/null
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -0,0 +1,5361 @@
+	acpi=		[HW,ACPI,X86,ARM64]
+			Advanced Configuration and Power Interface
+			Format: { force | on | off | strict | noirq | rsdt |
+				  copy_dsdt }
+			force -- enable ACPI if default was off
+			on -- enable ACPI but allow fallback to DT [arm64]
+			off -- disable ACPI if default was on
+			noirq -- do not use ACPI for IRQ routing
+			strict -- Be less tolerant of platforms that are not
+				strictly ACPI specification compliant.
+			rsdt -- prefer RSDT over (default) XSDT
+			copy_dsdt -- copy DSDT to memory
+			For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force"
+			are available
+
+			See also Documentation/power/runtime_pm.txt, pci=noacpi
+
+	acpi_apic_instance=	[ACPI, IOAPIC]
+			Format: <int>
+			2: use 2nd APIC table, if available
+			1,0: use 1st APIC table
+			default: 0
+
+	acpi_backlight=	[HW,ACPI]
+			acpi_backlight=vendor
+			acpi_backlight=video
+			If set to vendor, prefer vendor specific driver
+			(e.g. thinkpad_acpi, sony_acpi, etc.) instead
+			of the ACPI video.ko driver.
+
+	acpi_force_32bit_fadt_addr
+			force FADT to use 32 bit addresses rather than the
+			64 bit X_* addresses. Some firmware have broken 64
+			bit addresses for force ACPI ignore these and use
+			the older legacy 32 bit addresses.
+
+	acpica_no_return_repair [HW, ACPI]
+			Disable AML predefined validation mechanism
+			This mechanism can repair the evaluation result to make
+			the return objects more ACPI specification compliant.
+			This option is useful for developers to identify the
+			root cause of an AML interpreter issue when the issue
+			has something to do with the repair mechanism.
+
+	acpi.debug_layer=	[HW,ACPI,ACPI_DEBUG]
+	acpi.debug_level=	[HW,ACPI,ACPI_DEBUG]
+			Format: <int>
+			CONFIG_ACPI_DEBUG must be enabled to produce any ACPI
+			debug output.  Bits in debug_layer correspond to a
+			_COMPONENT in an ACPI source file, e.g.,
+			    #define _COMPONENT ACPI_PCI_COMPONENT
+			Bits in debug_level correspond to a level in
+			ACPI_DEBUG_PRINT statements, e.g.,
+			    ACPI_DEBUG_PRINT((ACPI_DB_INFO, ...
+			The debug_level mask defaults to "info".  See
+			Documentation/acpi/debug.txt for more information about
+			debug layers and levels.
+
+			Enable processor driver info messages:
+			    acpi.debug_layer=0x20000000
+			Enable PCI/PCI interrupt routing info messages:
+			    acpi.debug_layer=0x400000
+			Enable AML "Debug" output, i.e., stores to the Debug
+			object while interpreting AML:
+			    acpi.debug_layer=0xffffffff acpi.debug_level=0x2
+			Enable all messages related to ACPI hardware:
+			    acpi.debug_layer=0x2 acpi.debug_level=0xffffffff
+
+			Some values produce so much output that the system is
+			unusable.  The "log_buf_len" parameter may be useful
+			if you need to capture more output.
+
+	acpi_enforce_resources=	[ACPI]
+			{ strict | lax | no }
+			Check for resource conflicts between native drivers
+			and ACPI OperationRegions (SystemIO and SystemMemory
+			only). IO ports and memory declared in ACPI might be
+			used by the ACPI subsystem in arbitrary AML code and
+			can interfere with legacy drivers.
+			strict (default): access to resources claimed by ACPI
+			is denied; legacy drivers trying to access reserved
+			resources will fail to bind to device using them.
+			lax: access to resources claimed by ACPI is allowed;
+			legacy drivers trying to access reserved resources
+			will bind successfully but a warning message is logged.
+			no: ACPI OperationRegions are not marked as reserved,
+			no further checks are performed.
+
+	acpi_force_table_verification	[HW,ACPI]
+			Enable table checksum verification during early stage.
+			By default, this is disabled due to x86 early mapping
+			size limitation.
+
+	acpi_irq_balance [HW,ACPI]
+			ACPI will balance active IRQs
+			default in APIC mode
+
+	acpi_irq_nobalance [HW,ACPI]
+			ACPI will not move active IRQs (default)
+			default in PIC mode
+
+	acpi_irq_isa=	[HW,ACPI] If irq_balance, mark listed IRQs used by ISA
+			Format: <irq>,<irq>...
+
+	acpi_irq_pci=	[HW,ACPI] If irq_balance, clear listed IRQs for
+			use by PCI
+			Format: <irq>,<irq>...
+
+	acpi_mask_gpe=	[HW,ACPI]
+			Due to the existence of _Lxx/_Exx, some GPEs triggered
+			by unsupported hardware/firmware features can result in
+			GPE floodings that cannot be automatically disabled by
+			the GPE dispatcher.
+			This facility can be used to prevent such uncontrolled
+			GPE floodings.
+			Format: <byte>
+
+	acpi_no_auto_serialize	[HW,ACPI]
+			Disable auto-serialization of AML methods
+			AML control methods that contain the opcodes to create
+			named objects will be marked as "Serialized" by the
+			auto-serialization feature.
+			This feature is enabled by default.
+			This option allows to turn off the feature.
+
+	acpi_no_memhotplug [ACPI] Disable memory hotplug.  Useful for kdump
+			   kernels.
+
+	acpi_no_static_ssdt	[HW,ACPI]
+			Disable installation of static SSDTs at early boot time
+			By default, SSDTs contained in the RSDT/XSDT will be
+			installed automatically and they will appear under
+			/sys/firmware/acpi/tables.
+			This option turns off this feature.
+			Note that specifying this option does not affect
+			dynamic table installation which will install SSDT
+			tables to /sys/firmware/acpi/tables/dynamic.
+
+	acpi_no_watchdog	[HW,ACPI,WDT]
+			Ignore the ACPI-based watchdog interface (WDAT) and let
+			a native driver control the watchdog device instead.
+
+	acpi_rsdp=	[ACPI,EFI,KEXEC]
+			Pass the RSDP address to the kernel, mostly used
+			on machines running EFI runtime service to boot the
+			second kernel for kdump.
+
+	acpi_os_name=	[HW,ACPI] Tell ACPI BIOS the name of the OS
+			Format: To spoof as Windows 98: ="Microsoft Windows"
+
+	acpi_rev_override [ACPI] Override the _REV object to return 5 (instead
+			of 2 which is mandated by ACPI 6) as the supported ACPI
+			specification revision (when using this switch, it may
+			be necessary to carry out a cold reboot _twice_ in a
+			row to make it take effect on the platform firmware).
+
+	acpi_osi=	[HW,ACPI] Modify list of supported OS interface strings
+			acpi_osi="string1"	# add string1
+			acpi_osi="!string2"	# remove string2
+			acpi_osi=!*		# remove all strings
+			acpi_osi=!		# disable all built-in OS vendor
+						  strings
+			acpi_osi=!!		# enable all built-in OS vendor
+						  strings
+			acpi_osi=		# disable all strings
+
+			'acpi_osi=!' can be used in combination with single or
+			multiple 'acpi_osi="string1"' to support specific OS
+			vendor string(s).  Note that such command can only
+			affect the default state of the OS vendor strings, thus
+			it cannot affect the default state of the feature group
+			strings and the current state of the OS vendor strings,
+			specifying it multiple times through kernel command line
+			is meaningless.  This command is useful when one do not
+			care about the state of the feature group strings which
+			should be controlled by the OSPM.
+			Examples:
+			  1. 'acpi_osi=! acpi_osi="Windows 2000"' is equivalent
+			     to 'acpi_osi="Windows 2000" acpi_osi=!', they all
+			     can make '_OSI("Windows 2000")' TRUE.
+
+			'acpi_osi=' cannot be used in combination with other
+			'acpi_osi=' command lines, the _OSI method will not
+			exist in the ACPI namespace.  NOTE that such command can
+			only affect the _OSI support state, thus specifying it
+			multiple times through kernel command line is also
+			meaningless.
+			Examples:
+			  1. 'acpi_osi=' can make 'CondRefOf(_OSI, Local1)'
+			     FALSE.
+
+			'acpi_osi=!*' can be used in combination with single or
+			multiple 'acpi_osi="string1"' to support specific
+			string(s).  Note that such command can affect the
+			current state of both the OS vendor strings and the
+			feature group strings, thus specifying it multiple times
+			through kernel command line is meaningful.  But it may
+			still not able to affect the final state of a string if
+			there are quirks related to this string.  This command
+			is useful when one want to control the state of the
+			feature group strings to debug BIOS issues related to
+			the OSPM features.
+			Examples:
+			  1. 'acpi_osi="Module Device" acpi_osi=!*' can make
+			     '_OSI("Module Device")' FALSE.
+			  2. 'acpi_osi=!* acpi_osi="Module Device"' can make
+			     '_OSI("Module Device")' TRUE.
+			  3. 'acpi_osi=! acpi_osi=!* acpi_osi="Windows 2000"' is
+			     equivalent to
+			     'acpi_osi=!* acpi_osi=! acpi_osi="Windows 2000"'
+			     and
+			     'acpi_osi=!* acpi_osi="Windows 2000" acpi_osi=!',
+			     they all will make '_OSI("Windows 2000")' TRUE.
+
+	acpi_pm_good	[X86]
+			Override the pmtimer bug detection: force the kernel
+			to assume that this machine's pmtimer latches its value
+			and always returns good values.
+
+	acpi_sci=	[HW,ACPI] ACPI System Control Interrupt trigger mode
+			Format: { level | edge | high | low }
+
+	acpi_skip_timer_override [HW,ACPI]
+			Recognize and ignore IRQ0/pin2 Interrupt Override.
+			For broken nForce2 BIOS resulting in XT-PIC timer.
+
+	acpi_sleep=	[HW,ACPI] Sleep options
+			Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig,
+				  old_ordering, nonvs, sci_force_enable, nobl }
+			See Documentation/power/video.txt for information on
+			s3_bios and s3_mode.
+			s3_beep is for debugging; it makes the PC's speaker beep
+			as soon as the kernel's real-mode entry point is called.
+			s4_nohwsig prevents ACPI hardware signature from being
+			used during resume from hibernation.
+			old_ordering causes the ACPI 1.0 ordering of the _PTS
+			control method, with respect to putting devices into
+			low power states, to be enforced (the ACPI 2.0 ordering
+			of _PTS is used by default).
+			nonvs prevents the kernel from saving/restoring the
+			ACPI NVS memory during suspend/hibernation and resume.
+			sci_force_enable causes the kernel to set SCI_EN directly
+			on resume from S1/S3 (which is against the ACPI spec,
+			but some broken systems don't work without it).
+			nobl causes the internal blacklist of systems known to
+			behave incorrectly in some ways with respect to system
+			suspend and resume to be ignored (use wisely).
+
+	acpi_use_timer_override [HW,ACPI]
+			Use timer override. For some broken Nvidia NF5 boards
+			that require a timer override, but don't have HPET
+
+	add_efi_memmap	[EFI; X86] Include EFI memory map in
+			kernel's map of available physical RAM.
+
+	agp=		[AGP]
+			{ off | try_unsupported }
+			off: disable AGP support
+			try_unsupported: try to drive unsupported chipsets
+				(may crash computer or cause data corruption)
+
+	ALSA		[HW,ALSA]
+			See Documentation/sound/alsa-configuration.rst
+
+	alignment=	[KNL,ARM]
+			Allow the default userspace alignment fault handler
+			behaviour to be specified.  Bit 0 enables warnings,
+			bit 1 enables fixups, and bit 2 sends a segfault.
+
+	align_va_addr=	[X86-64]
+			Align virtual addresses by clearing slice [14:12] when
+			allocating a VMA at process creation time. This option
+			gives you up to 3% performance improvement on AMD F15h
+			machines (where it is enabled by default) for a
+			CPU-intensive style benchmark, and it can vary highly in
+			a microbenchmark depending on workload and compiler.
+
+			32: only for 32-bit processes
+			64: only for 64-bit processes
+			on: enable for both 32- and 64-bit processes
+			off: disable for both 32- and 64-bit processes
+
+	alloc_snapshot	[FTRACE]
+			Allocate the ftrace snapshot buffer on boot up when the
+			main buffer is allocated. This is handy if debugging
+			and you need to use tracing_snapshot() on boot up, and
+			do not want to use tracing_snapshot_alloc() as it needs
+			to be done where GFP_KERNEL allocations are allowed.
+
+	amd_iommu=	[HW,X86-64]
+			Pass parameters to the AMD IOMMU driver in the system.
+			Possible values are:
+			fullflush - enable flushing of IO/TLB entries when
+				    they are unmapped. Otherwise they are
+				    flushed before they will be reused, which
+				    is a lot of faster
+			off	  - do not initialize any AMD IOMMU found in
+				    the system
+			force_isolation - Force device isolation for all
+					  devices. The IOMMU driver is not
+					  allowed anymore to lift isolation
+					  requirements as needed. This option
+					  does not override iommu=pt
+
+	amd_iommu_dump=	[HW,X86-64]
+			Enable AMD IOMMU driver option to dump the ACPI table
+			for AMD IOMMU. With this option enabled, AMD IOMMU
+			driver will print ACPI tables for AMD IOMMU during
+			IOMMU initialization.
+
+	amd_iommu_intr=	[HW,X86-64]
+			Specifies one of the following AMD IOMMU interrupt
+			remapping modes:
+			legacy     - Use legacy interrupt remapping mode.
+			vapic      - Use virtual APIC mode, which allows IOMMU
+			             to inject interrupts directly into guest.
+			             This mode requires kvm-amd.avic=1.
+			             (Default when IOMMU HW support is present.)
+
+	amijoy.map=	[HW,JOY] Amiga joystick support
+			Map of devices attached to JOY0DAT and JOY1DAT
+			Format: <a>,<b>
+			See also Documentation/input/joydev/joystick.rst
+
+	analog.map=	[HW,JOY] Analog joystick and gamepad support
+			Specifies type or capabilities of an analog joystick
+			connected to one of 16 gameports
+			Format: <type1>,<type2>,..<type16>
+
+	apc=		[HW,SPARC]
+			Power management functions (SPARCstation-4/5 + deriv.)
+			Format: noidle
+			Disable APC CPU standby support. SPARCstation-Fox does
+			not play well with APC CPU idle - disable it if you have
+			APC and your system crashes randomly.
+
+	apic=		[APIC,X86] Advanced Programmable Interrupt Controller
+			Change the output verbosity whilst booting
+			Format: { quiet (default) | verbose | debug }
+			Change the amount of debugging information output
+			when initialising the APIC and IO-APIC components.
+			For X86-32, this can also be used to specify an APIC
+			driver name.
+			Format: apic=driver_name
+			Examples: apic=bigsmp
+
+	apic_extnmi=	[APIC,X86] External NMI delivery setting
+			Format: { bsp (default) | all | none }
+			bsp:  External NMI is delivered only to CPU 0
+			all:  External NMIs are broadcast to all CPUs as a
+			      backup of CPU 0
+			none: External NMI is masked for all CPUs. This is
+			      useful so that a dump capture kernel won't be
+			      shot down by NMI
+
+	autoconf=	[IPV6]
+			See Documentation/networking/ipv6.txt.
+
+	show_lapic=	[APIC,X86] Advanced Programmable Interrupt Controller
+			Limit apic dumping. The parameter defines the maximal
+			number of local apics being dumped. Also it is possible
+			to set it to "all" by meaning -- no limit here.
+			Format: { 1 (default) | 2 | ... | all }.
+			The parameter valid if only apic=debug or
+			apic=verbose is specified.
+			Example: apic=debug show_lapic=all
+
+	apm=		[APM] Advanced Power Management
+			See header of arch/x86/kernel/apm_32.c.
+
+	arcrimi=	[HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards
+			Format: <io>,<irq>,<nodeID>
+
+	ataflop=	[HW,M68k]
+
+	atarimouse=	[HW,MOUSE] Atari Mouse
+
+	atkbd.extra=	[HW] Enable extra LEDs and keys on IBM RapidAccess,
+			EzKey and similar keyboards
+
+	atkbd.reset=	[HW] Reset keyboard during initialization
+
+	atkbd.set=	[HW] Select keyboard code set
+			Format: <int> (2 = AT (default), 3 = PS/2)
+
+	atkbd.scroll=	[HW] Enable scroll wheel on MS Office and similar
+			keyboards
+
+	atkbd.softraw=	[HW] Choose between synthetic and real raw mode
+			Format: <bool> (0 = real, 1 = synthetic (default))
+
+	atkbd.softrepeat= [HW]
+			Use software keyboard repeat
+
+	audit=		[KNL] Enable the audit sub-system
+			Format: { "0" | "1" | "off" | "on" }
+			0 | off - kernel audit is disabled and can not be
+			    enabled until the next reboot
+			unset - kernel audit is initialized but disabled and
+			    will be fully enabled by the userspace auditd.
+			1 | on - kernel audit is initialized and partially
+			    enabled, storing at most audit_backlog_limit
+			    messages in RAM until it is fully enabled by the
+			    userspace auditd.
+			Default: unset
+
+	audit_backlog_limit= [KNL] Set the audit queue size limit.
+			Format: <int> (must be >=0)
+			Default: 64
+
+	bau=		[X86_UV] Enable the BAU on SGI UV.  The default
+			behavior is to disable the BAU (i.e. bau=0).
+			Format: { "0" | "1" }
+			0 - Disable the BAU.
+			1 - Enable the BAU.
+			unset - Disable the BAU.
+
+	baycom_epp=	[HW,AX25]
+			Format: <io>,<mode>
+
+	baycom_par=	[HW,AX25] BayCom Parallel Port AX.25 Modem
+			Format: <io>,<mode>
+			See header of drivers/net/hamradio/baycom_par.c.
+
+	baycom_ser_fdx=	[HW,AX25]
+			BayCom Serial Port AX.25 Modem (Full Duplex Mode)
+			Format: <io>,<irq>,<mode>[,<baud>]
+			See header of drivers/net/hamradio/baycom_ser_fdx.c.
+
+	baycom_ser_hdx=	[HW,AX25]
+			BayCom Serial Port AX.25 Modem (Half Duplex Mode)
+			Format: <io>,<irq>,<mode>
+			See header of drivers/net/hamradio/baycom_ser_hdx.c.
+
+	blkdevparts=	Manual partition parsing of block device(s) for
+			embedded devices based on command line input.
+			See Documentation/block/cmdline-partition.txt
+
+	boot_delay=	Milliseconds to delay each printk during boot.
+			Values larger than 10 seconds (10000) are changed to
+			no delay (0).
+			Format: integer
+
+	bootmem_debug	[KNL] Enable bootmem allocator debug messages.
+
+	bert_disable	[ACPI]
+			Disable BERT OS support on buggy BIOSes.
+
+	bttv.card=	[HW,V4L] bttv (bt848 + bt878 based grabber cards)
+	bttv.radio=	Most important insmod options are available as
+			kernel args too.
+	bttv.pll=	See Documentation/media/v4l-drivers/bttv.rst
+	bttv.tuner=
+
+	bulk_remove=off	[PPC]  This parameter disables the use of the pSeries
+			firmware feature for flushing multiple hpte entries
+			at a time.
+
+	c101=		[NET] Moxa C101 synchronous serial card
+
+	cachesize=	[BUGS=X86-32] Override level 2 CPU cache size detection.
+			Sometimes CPU hardware bugs make them report the cache
+			size incorrectly. The kernel will attempt work arounds
+			to fix known problems, but for some CPUs it is not
+			possible to determine what the correct size should be.
+			This option provides an override for these situations.
+
+	ca_keys=	[KEYS] This parameter identifies a specific key(s) on
+			the system trusted keyring to be used for certificate
+			trust validation.
+			format: { id:<keyid> | builtin }
+
+	cca=		[MIPS] Override the kernel pages' cache coherency
+			algorithm.  Accepted values range from 0 to 7
+			inclusive. See arch/mips/include/asm/pgtable-bits.h
+			for platform specific values (SB1, Loongson3 and
+			others).
+
+	ccw_timeout_log	[S390]
+			See Documentation/s390/CommonIO for details.
+
+	cgroup_disable=	[KNL] Disable a particular controller
+			Format: {name of the controller(s) to disable}
+			The effects of cgroup_disable=foo are:
+			- foo isn't auto-mounted if you mount all cgroups in
+			  a single hierarchy
+			- foo isn't visible as an individually mountable
+			  subsystem
+			{Currently only "memory" controller deal with this and
+			cut the overhead, others just disable the usage. So
+			only cgroup_disable=memory is actually worthy}
+
+	cgroup_no_v1=	[KNL] Disable one, multiple, all cgroup controllers in v1
+			Format: { controller[,controller...] | "all" }
+			Like cgroup_disable, but only applies to cgroup v1;
+			the blacklisted controllers remain available in cgroup2.
+
+	cgroup.memory=	[KNL] Pass options to the cgroup memory controller.
+			Format: <string>
+			nosocket -- Disable socket memory accounting.
+			nokmem -- Disable kernel memory accounting.
+
+	checkreqprot	[SELINUX] Set initial checkreqprot flag value.
+			Format: { "0" | "1" }
+			See security/selinux/Kconfig help text.
+			0 -- check protection applied by kernel (includes
+				any implied execute protection).
+			1 -- check protection requested by application.
+			Default value is set via a kernel config option.
+			Value can be changed at runtime via
+				/selinux/checkreqprot.
+
+	cio_ignore=	[S390]
+			See Documentation/s390/CommonIO for details.
+	clk_ignore_unused
+			[CLK]
+			Prevents the clock framework from automatically gating
+			clocks that have not been explicitly enabled by a Linux
+			device driver but are enabled in hardware at reset or
+			by the bootloader/firmware. Note that this does not
+			force such clocks to be always-on nor does it reserve
+			those clocks in any way. This parameter is useful for
+			debug and development, but should not be needed on a
+			platform with proper driver support.  For more
+			information, see Documentation/driver-api/clk.rst.
+
+	clock=		[BUGS=X86-32, HW] gettimeofday clocksource override.
+			[Deprecated]
+			Forces specified clocksource (if available) to be used
+			when calculating gettimeofday(). If specified
+			clocksource is not available, it defaults to PIT.
+			Format: { pit | tsc | cyclone | pmtmr }
+
+	clocksource=	Override the default clocksource
+			Format: <string>
+			Override the default clocksource and use the clocksource
+			with the name specified.
+			Some clocksource names to choose from, depending on
+			the platform:
+			[all] jiffies (this is the base, fallback clocksource)
+			[ACPI] acpi_pm
+			[ARM] imx_timer1,OSTS,netx_timer,mpu_timer2,
+				pxa_timer,timer3,32k_counter,timer0_1
+			[X86-32] pit,hpet,tsc;
+				scx200_hrt on Geode; cyclone on IBM x440
+			[MIPS] MIPS
+			[PARISC] cr16
+			[S390] tod
+			[SH] SuperH
+			[SPARC64] tick
+			[X86-64] hpet,tsc
+
+	clocksource.arm_arch_timer.evtstrm=
+			[ARM,ARM64]
+			Format: <bool>
+			Enable/disable the eventstream feature of the ARM
+			architected timer so that code using WFE-based polling
+			loops can be debugged more effectively on production
+			systems.
+
+	clocksource.max_cswd_read_retries= [KNL]
+			Number of clocksource_watchdog() retries due to
+			external delays before the clock will be marked
+			unstable.  Defaults to three retries, that is,
+			four attempts to read the clock under test.
+
+	clearcpuid=BITNUM[,BITNUM...] [X86]
+			Disable CPUID feature X for the kernel. See
+			arch/x86/include/asm/cpufeatures.h for the valid bit
+			numbers. Note the Linux specific bits are not necessarily
+			stable over kernel options, but the vendor specific
+			ones should be.
+			Also note that user programs calling CPUID directly
+			or using the feature without checking anything
+			will still see it. This just prevents it from
+			being used by the kernel or shown in /proc/cpuinfo.
+			Also note the kernel might malfunction if you disable
+			some critical bits.
+
+	cma=nn[MG]@[start[MG][-end[MG]]]
+			[ARM,X86,KNL]
+			Sets the size of kernel global memory area for
+			contiguous memory allocations and optionally the
+			placement constraint by the physical address range of
+			memory allocations. A value of 0 disables CMA
+			altogether. For more information, see
+			include/linux/dma-contiguous.h
+
+	cmo_free_hint=	[PPC] Format: { yes | no }
+			Specify whether pages are marked as being inactive
+			when they are freed.  This is used in CMO environments
+			to determine OS memory pressure for page stealing by
+			a hypervisor.
+			Default: yes
+
+	coherent_pool=nn[KMG]	[ARM,KNL]
+			Sets the size of memory pool for coherent, atomic dma
+			allocations, by default set to 256K.
+
+	com20020=	[HW,NET] ARCnet - COM20020 chipset
+			Format:
+			<io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]]
+
+	com90io=	[HW,NET] ARCnet - COM90xx chipset (IO-mapped buffers)
+			Format: <io>[,<irq>]
+
+	com90xx=	[HW,NET]
+			ARCnet - COM90xx chipset (memory-mapped buffers)
+			Format: <io>[,<irq>[,<memstart>]]
+
+	condev=		[HW,S390] console device
+	conmode=
+
+	console=	[KNL] Output console device and options.
+
+		tty<n>	Use the virtual console device <n>.
+
+		ttyS<n>[,options]
+		ttyUSB0[,options]
+			Use the specified serial port.  The options are of
+			the form "bbbbpnf", where "bbbb" is the baud rate,
+			"p" is parity ("n", "o", or "e"), "n" is number of
+			bits, and "f" is flow control ("r" for RTS or
+			omit it).  Default is "9600n8".
+
+			See Documentation/admin-guide/serial-console.rst for more
+			information.  See
+			Documentation/networking/netconsole.txt for an
+			alternative.
+
+		uart[8250],io,<addr>[,options]
+		uart[8250],mmio,<addr>[,options]
+		uart[8250],mmio16,<addr>[,options]
+		uart[8250],mmio32,<addr>[,options]
+		uart[8250],0x<addr>[,options]
+			Start an early, polled-mode console on the 8250/16550
+			UART at the specified I/O port or MMIO address,
+			switching to the matching ttyS device later.
+			MMIO inter-register address stride is either 8-bit
+			(mmio), 16-bit (mmio16), or 32-bit (mmio32).
+			If none of [io|mmio|mmio16|mmio32], <addr> is assumed
+			to be equivalent to 'mmio'. 'options' are specified in
+			the same format described for ttyS above; if unspecified,
+			the h/w is not re-initialized.
+
+		hvc<n>	Use the hypervisor console device <n>. This is for
+			both Xen and PowerPC hypervisors.
+
+		If the device connected to the port is not a TTY but a braille
+		device, prepend "brl," before the device type, for instance
+			console=brl,ttyS0
+		For now, only VisioBraille is supported.
+
+	console_msg_format=
+			[KNL] Change console messages format
+		default
+			By default we print messages on consoles in
+			"[time stamp] text\n" format (time stamp may not be
+			printed, depending on CONFIG_PRINTK_TIME or
+			`printk_time' param).
+		syslog
+			Switch to syslog format: "<%u>[time stamp] text\n"
+			IOW, each message will have a facility and loglevel
+			prefix. The format is similar to one used by syslog()
+			syscall, or to executing "dmesg -S --raw" or to reading
+			from /proc/kmsg.
+
+	consoleblank=	[KNL] The console blank (screen saver) timeout in
+			seconds. A value of 0 disables the blank timer.
+			Defaults to 0.
+
+	coredump_filter=
+			[KNL] Change the default value for
+			/proc/<pid>/coredump_filter.
+			See also Documentation/filesystems/proc.txt.
+
+	coresight_cpu_debug.enable
+			[ARM,ARM64]
+			Format: <bool>
+			Enable/disable the CPU sampling based debugging.
+			0: default value, disable debugging
+			1: enable debugging at boot time
+
+	cpuidle.off=1	[CPU_IDLE]
+			disable the cpuidle sub-system
+
+	cpufreq.off=1	[CPU_FREQ]
+			disable the cpufreq sub-system
+
+	cpu_init_udelay=N
+			[X86] Delay for N microsec between assert and de-assert
+			of APIC INIT to start processors.  This delay occurs
+			on every CPU online, such as boot, and resume from suspend.
+			Default: 10000
+
+	cpcihp_generic=	[HW,PCI] Generic port I/O CompactPCI driver
+			Format:
+			<first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
+
+	crashkernel=size[KMG][@offset[KMG]]
+			[KNL] Using kexec, Linux can switch to a 'crash kernel'
+			upon panic. This parameter reserves the physical
+			memory region [offset, offset + size] for that kernel
+			image. If '@offset' is omitted, then a suitable offset
+			is selected automatically. Check
+			Documentation/kdump/kdump.txt for further details.
+
+	crashkernel=range1:size1[,range2:size2,...][@offset]
+			[KNL] Same as above, but depends on the memory
+			in the running system. The syntax of range is
+			start-[end] where start and end are both
+			a memory unit (amount[KMG]). See also
+			Documentation/kdump/kdump.txt for an example.
+
+	crashkernel=size[KMG],high
+			[KNL, x86_64] range could be above 4G. Allow kernel
+			to allocate physical memory region from top, so could
+			be above 4G if system have more than 4G ram installed.
+			Otherwise memory region will be allocated below 4G, if
+			available.
+			It will be ignored if crashkernel=X is specified.
+	crashkernel=size[KMG],low
+			[KNL, x86_64] range under 4G. When crashkernel=X,high
+			is passed, kernel could allocate physical memory region
+			above 4G, that cause second kernel crash on system
+			that require some amount of low memory, e.g. swiotlb
+			requires at least 64M+32K low memory, also enough extra
+			low memory is needed to make sure DMA buffers for 32-bit
+			devices won't run out. Kernel would try to allocate at
+			at least 256M below 4G automatically.
+			This one let user to specify own low range under 4G
+			for second kernel instead.
+			0: to disable low allocation.
+			It will be ignored when crashkernel=X,high is not used
+			or memory reserved is below 4G.
+
+	cryptomgr.notests
+			[KNL] Disable crypto self-tests
+
+	cs89x0_dma=	[HW,NET]
+			Format: <dma>
+
+	cs89x0_media=	[HW,NET]
+			Format: { rj45 | aui | bnc }
+
+	dasd=		[HW,NET]
+			See header of drivers/s390/block/dasd_devmap.c.
+
+	db9.dev[2|3]=	[HW,JOY] Multisystem joystick support via parallel port
+			(one device per port)
+			Format: <port#>,<type>
+			See also Documentation/input/devices/joystick-parport.rst
+
+	ddebug_query=	[KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
+			time. See
+			Documentation/admin-guide/dynamic-debug-howto.rst for
+			details.  Deprecated, see dyndbg.
+
+	debug		[KNL] Enable kernel debugging (events log level).
+
+	debug_boot_weak_hash
+			[KNL] Enable printing [hashed] pointers early in the
+			boot sequence.  If enabled, we use a weak hash instead
+			of siphash to hash pointers.  Use this option if you are
+			seeing instances of '(___ptrval___)') and need to see a
+			value (hashed pointer) instead. Cryptographically
+			insecure, please do not use on production kernels.
+
+	debug_locks_verbose=
+			[KNL] verbose self-tests
+			Format=<0|1>
+			Print debugging info while doing the locking API
+			self-tests.
+			We default to 0 (no extra messages), setting it to
+			1 will print _a lot_ more information - normally
+			only useful to kernel developers.
+
+	debug_objects	[KNL] Enable object debugging
+
+	no_debug_objects
+			[KNL] Disable object debugging
+
+	debug_guardpage_minorder=
+			[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
+			parameter allows control of the order of pages that will
+			be intentionally kept free (and hence protected) by the
+			buddy allocator. Bigger value increase the probability
+			of catching random memory corruption, but reduce the
+			amount of memory for normal system use. The maximum
+			possible value is MAX_ORDER/2.  Setting this parameter
+			to 1 or 2 should be enough to identify most random
+			memory corruption problems caused by bugs in kernel or
+			driver code when a CPU writes to (or reads from) a
+			random memory location. Note that there exists a class
+			of memory corruptions problems caused by buggy H/W or
+			F/W or by drivers badly programing DMA (basically when
+			memory is written at bus level and the CPU MMU is
+			bypassed) which are not detectable by
+			CONFIG_DEBUG_PAGEALLOC, hence this option will not help
+			tracking down these problems.
+
+	debug_pagealloc=
+			[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
+			parameter enables the feature at boot time. In
+			default, it is disabled. We can avoid allocating huge
+			chunk of memory for debug pagealloc if we don't enable
+			it at boot time and the system will work mostly same
+			with the kernel built without CONFIG_DEBUG_PAGEALLOC.
+			on: enable the feature
+
+	debugpat	[X86] Enable PAT debugging
+
+	decnet.addr=	[HW,NET]
+			Format: <area>[,<node>]
+			See also Documentation/networking/decnet.txt.
+
+	default_hugepagesz=
+			[same as hugepagesz=] The size of the default
+			HugeTLB page size. This is the size represented by
+			the legacy /proc/ hugepages APIs, used for SHM, and
+			default size when mounting hugetlbfs filesystems.
+			Defaults to the default architecture's huge page size
+			if not specified.
+
+	deferred_probe_timeout=
+			[KNL] Debugging option to set a timeout in seconds for
+			deferred probe to give up waiting on dependencies to
+			probe. Only specific dependencies (subsystems or
+			drivers) that have opted in will be ignored. A timeout of 0
+			will timeout at the end of initcalls. This option will also
+			dump out devices still on the deferred probe list after
+			retrying.
+
+	dhash_entries=	[KNL]
+			Set number of hash buckets for dentry cache.
+
+	disable_1tb_segments [PPC]
+			Disables the use of 1TB hash page table segments. This
+			causes the kernel to fall back to 256MB segments which
+			can be useful when debugging issues that require an SLB
+			miss to occur.
+
+	disable=	[IPV6]
+			See Documentation/networking/ipv6.txt.
+
+	hardened_usercopy=
+                        [KNL] Under CONFIG_HARDENED_USERCOPY, whether
+                        hardening is enabled for this boot. Hardened
+                        usercopy checking is used to protect the kernel
+                        from reading or writing beyond known memory
+                        allocation boundaries as a proactive defense
+                        against bounds-checking flaws in the kernel's
+                        copy_to_user()/copy_from_user() interface.
+                on      Perform hardened usercopy checks (default).
+                off     Disable hardened usercopy checks.
+
+	disable_radix	[PPC]
+			Disable RADIX MMU mode on POWER9
+
+	disable_cpu_apicid= [X86,APIC,SMP]
+			Format: <int>
+			The number of initial APIC ID for the
+			corresponding CPU to be disabled at boot,
+			mostly used for the kdump 2nd kernel to
+			disable BSP to wake up multiple CPUs without
+			causing system reset or hang due to sending
+			INIT from AP to BSP.
+
+	disable_ddw	[PPC/PSERIES]
+			Disable Dynamic DMA Window support. Use this if
+			to workaround buggy firmware.
+
+	disable_ipv6=	[IPV6]
+			See Documentation/networking/ipv6.txt.
+
+	disable_mtrr_cleanup [X86]
+			The kernel tries to adjust MTRR layout from continuous
+			to discrete, to make X server driver able to add WB
+			entry later. This parameter disables that.
+
+	disable_mtrr_trim [X86, Intel and AMD only]
+			By default the kernel will trim any uncacheable
+			memory out of your available memory pool based on
+			MTRR settings.  This parameter disables that behavior,
+			possibly causing your machine to run very slowly.
+
+	disable_timer_pin_1 [X86]
+			Disable PIN 1 of APIC timer
+			Can be useful to work around chipset bugs.
+
+	dis_ucode_ldr	[X86] Disable the microcode loader.
+
+	dma_debug=off	If the kernel is compiled with DMA_API_DEBUG support,
+			this option disables the debugging code at boot.
+
+	dma_debug_entries=<number>
+			This option allows to tune the number of preallocated
+			entries for DMA-API debugging code. One entry is
+			required per DMA-API allocation. Use this if the
+			DMA-API debugging code disables itself because the
+			architectural default is too low.
+
+	dma_debug_driver=<driver_name>
+			With this option the DMA-API debugging driver
+			filter feature can be enabled at boot time. Just
+			pass the driver to filter for as the parameter.
+			The filter can be disabled or changed to another
+			driver later using sysfs.
+
+	drm.edid_firmware=[<connector>:]<file>[,[<connector>:]<file>]
+			Broken monitors, graphic adapters, KVMs and EDIDless
+			panels may send no or incorrect EDID data sets.
+			This parameter allows to specify an EDID data sets
+			in the /lib/firmware directory that are used instead.
+			Generic built-in EDID data sets are used, if one of
+			edid/1024x768.bin, edid/1280x1024.bin,
+			edid/1680x1050.bin, or edid/1920x1080.bin is given
+			and no file with the same name exists. Details and
+			instructions how to build your own EDID data are
+			available in Documentation/EDID/HOWTO.txt. An EDID
+			data set will only be used for a particular connector,
+			if its name and a colon are prepended to the EDID
+			name. Each connector may use a unique EDID data
+			set by separating the files with a comma.  An EDID
+			data set with no connector name will be used for
+			any connectors not explicitly specified.
+
+	dscc4.setup=	[NET]
+
+	dt_cpu_ftrs=	[PPC]
+			Format: {"off" | "known"}
+			Control how the dt_cpu_ftrs device-tree binding is
+			used for CPU feature discovery and setup (if it
+			exists).
+			off: Do not use it, fall back to legacy cpu table.
+			known: Do not pass through unknown features to guests
+			or userspace, only those that the kernel is aware of.
+
+	dump_apple_properties	[X86]
+			Dump name and content of EFI device properties on
+			x86 Macs.  Useful for driver authors to determine
+			what data is available or for reverse-engineering.
+
+	dyndbg[="val"]		[KNL,DYNAMIC_DEBUG]
+	module.dyndbg[="val"]
+			Enable debug messages at boot time.  See
+			Documentation/admin-guide/dynamic-debug-howto.rst
+			for details.
+
+	nompx		[X86] Disables Intel Memory Protection Extensions.
+			See Documentation/x86/intel_mpx.txt for more
+			information about the feature.
+
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
+	module.async_probe [KNL]
+			Enable asynchronous probe on this module.
+
+	early_ioremap_debug [KNL]
+			Enable debug messages in early_ioremap support. This
+			is useful for tracking down temporary early mappings
+			which are not unmapped.
+
+	earlycon=	[KNL] Output early console device and options.
+
+			[ARM64] The early console is determined by the
+			stdout-path property in device tree's chosen node,
+			or determined by the ACPI SPCR table.
+
+			[X86] When used with no options the early console is
+			determined by the ACPI SPCR table.
+
+		cdns,<addr>[,options]
+			Start an early, polled-mode console on a Cadence
+			(xuartps) serial port at the specified address. Only
+			supported option is baud rate. If baud rate is not
+			specified, the serial port must already be setup and
+			configured.
+
+		uart[8250],io,<addr>[,options]
+		uart[8250],mmio,<addr>[,options]
+		uart[8250],mmio32,<addr>[,options]
+		uart[8250],mmio32be,<addr>[,options]
+		uart[8250],0x<addr>[,options]
+			Start an early, polled-mode console on the 8250/16550
+			UART at the specified I/O port or MMIO address.
+			MMIO inter-register address stride is either 8-bit
+			(mmio) or 32-bit (mmio32 or mmio32be).
+			If none of [io|mmio|mmio32|mmio32be], <addr> is assumed
+			to be equivalent to 'mmio'. 'options' are specified
+			in the same format described for "console=ttyS<n>"; if
+			unspecified, the h/w is not initialized.
+
+		pl011,<addr>
+		pl011,mmio32,<addr>
+			Start an early, polled-mode console on a pl011 serial
+			port at the specified address. The pl011 serial port
+			must already be setup and configured. Options are not
+			yet supported.  If 'mmio32' is specified, then only
+			the driver will use only 32-bit accessors to read/write
+			the device registers.
+
+		meson,<addr>
+			Start an early, polled-mode console on a meson serial
+			port at the specified address. The serial port must
+			already be setup and configured. Options are not yet
+			supported.
+
+		msm_serial,<addr>
+			Start an early, polled-mode console on an msm serial
+			port at the specified address. The serial port
+			must already be setup and configured. Options are not
+			yet supported.
+
+		msm_serial_dm,<addr>
+			Start an early, polled-mode console on an msm serial
+			dm port at the specified address. The serial port
+			must already be setup and configured. Options are not
+			yet supported.
+
+		owl,<addr>
+			Start an early, polled-mode console on a serial port
+			of an Actions Semi SoC, such as S500 or S900, at the
+			specified address. The serial port must already be
+			setup and configured. Options are not yet supported.
+
+		smh	Use ARM semihosting calls for early console.
+
+		s3c2410,<addr>
+		s3c2412,<addr>
+		s3c2440,<addr>
+		s3c6400,<addr>
+		s5pv210,<addr>
+		exynos4210,<addr>
+			Use early console provided by serial driver available
+			on Samsung SoCs, requires selecting proper type and
+			a correct base address of the selected UART port. The
+			serial port must already be setup and configured.
+			Options are not yet supported.
+
+		lantiq,<addr>
+			Start an early, polled-mode console on a lantiq serial
+			(lqasc) port at the specified address. The serial port
+			must already be setup and configured. Options are not
+			yet supported.
+
+		lpuart,<addr>
+		lpuart32,<addr>
+			Use early console provided by Freescale LP UART driver
+			found on Freescale Vybrid and QorIQ LS1021A processors.
+			A valid base address must be provided, and the serial
+			port must already be setup and configured.
+
+		ar3700_uart,<addr>
+			Start an early, polled-mode console on the
+			Armada 3700 serial port at the specified
+			address. The serial port must already be setup
+			and configured. Options are not yet supported.
+
+		qcom_geni,<addr>
+			Start an early, polled-mode console on a Qualcomm
+			Generic Interface (GENI) based serial port at the
+			specified address. The serial port must already be
+			setup and configured. Options are not yet supported.
+
+	earlyprintk=	[X86,SH,ARM,M68k,S390]
+			earlyprintk=vga
+			earlyprintk=efi
+			earlyprintk=sclp
+			earlyprintk=xen
+			earlyprintk=serial[,ttySn[,baudrate]]
+			earlyprintk=serial[,0x...[,baudrate]]
+			earlyprintk=ttySn[,baudrate]
+			earlyprintk=dbgp[debugController#]
+			earlyprintk=pciserial[,force],bus:device.function[,baudrate]
+			earlyprintk=xdbc[xhciController#]
+
+			earlyprintk is useful when the kernel crashes before
+			the normal console is initialized. It is not enabled by
+			default because it has some cosmetic problems.
+
+			Append ",keep" to not disable it when the real console
+			takes over.
+
+			Only one of vga, efi, serial, or usb debug port can
+			be used at a time.
+
+			Currently only ttyS0 and ttyS1 may be specified by
+			name.  Other I/O ports may be explicitly specified
+			on some architectures (x86 and arm at least) by
+			replacing ttySn with an I/O port address, like this:
+				earlyprintk=serial,0x1008,115200
+			You can find the port for a given device in
+			/proc/tty/driver/serial:
+				2: uart:ST16650V2 port:00001008 irq:18 ...
+
+			Interaction with the standard serial driver is not
+			very good.
+
+			The VGA and EFI output is eventually overwritten by
+			the real console.
+
+			The xen output can only be used by Xen PV guests.
+
+			The sclp output can only be used on s390.
+
+			The optional "force" to "pciserial" enables use of a
+			PCI device even when its classcode is not of the
+			UART class.
+
+	edac_report=	[HW,EDAC] Control how to report EDAC event
+			Format: {"on" | "off" | "force"}
+			on: enable EDAC to report H/W event. May be overridden
+			by other higher priority error reporting module.
+			off: disable H/W event reporting through EDAC.
+			force: enforce the use of EDAC to report H/W event.
+			default: on.
+
+	ekgdboc=	[X86,KGDB] Allow early kernel console debugging
+			ekgdboc=kbd
+
+			This is designed to be used in conjunction with
+			the boot argument: earlyprintk=vga
+
+	edd=		[EDD]
+			Format: {"off" | "on" | "skip[mbr]"}
+
+	efi=		[EFI]
+			Format: { "old_map", "nochunk", "noruntime", "debug" }
+			old_map [X86-64]: switch to the old ioremap-based EFI
+			runtime services mapping. 32-bit still uses this one by
+			default.
+			nochunk: disable reading files in "chunks" in the EFI
+			boot stub, as chunking can cause problems with some
+			firmware implementations.
+			noruntime : disable EFI runtime services support
+			debug: enable misc debug output
+
+	efi_no_storage_paranoia [EFI; X86]
+			Using this parameter you can use more than 50% of
+			your efi variable storage. Use this parameter only if
+			you are really sure that your UEFI does sane gc and
+			fulfills the spec otherwise your board may brick.
+
+	efi_fake_mem=	nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86]
+			Add arbitrary attribute to specific memory range by
+			updating original EFI memory map.
+			Region of memory which aa attribute is added to is
+			from ss to ss+nn.
+			If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000
+			is specified, EFI_MEMORY_MORE_RELIABLE(0x10000)
+			attribute is added to range 0x100000000-0x180000000 and
+			0x10a0000000-0x1120000000.
+
+			Using this parameter you can do debugging of EFI memmap
+			related feature. For example, you can do debugging of
+			Address Range Mirroring feature even if your box
+			doesn't support it.
+
+	efivar_ssdt=	[EFI; X86] Name of an EFI variable that contains an SSDT
+			that is to be dynamically loaded by Linux. If there are
+			multiple variables with the same name but with different
+			vendor GUIDs, all of them will be loaded. See
+			Documentation/acpi/ssdt-overlays.txt for details.
+
+
+	eisa_irq_edge=	[PARISC,HW]
+			See header of drivers/parisc/eisa.c.
+
+	elanfreq=	[X86-32]
+			See comment before function elanfreq_setup() in
+			arch/x86/kernel/cpu/cpufreq/elanfreq.c.
+
+	elevator=	[IOSCHED]
+			Format: {"cfq" | "deadline" | "noop"}
+			See Documentation/block/cfq-iosched.txt and
+			Documentation/block/deadline-iosched.txt for details.
+
+	elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
+			Specifies physical address of start of kernel core
+			image elf header and optionally the size. Generally
+			kexec loader will pass this option to capture kernel.
+			See Documentation/kdump/kdump.txt for details.
+
+	enable_mtrr_cleanup [X86]
+			The kernel tries to adjust MTRR layout from continuous
+			to discrete, to make X server driver able to add WB
+			entry later. This parameter enables that.
+
+	enable_timer_pin_1 [X86]
+			Enable PIN 1 of APIC timer
+			Can be useful to work around chipset bugs
+			(in particular on some ATI chipsets).
+			The kernel tries to set a reasonable default.
+
+	enforcing	[SELINUX] Set initial enforcing status.
+			Format: {"0" | "1"}
+			See security/selinux/Kconfig help text.
+			0 -- permissive (log only, no denials).
+			1 -- enforcing (deny and log).
+			Default value is 0.
+			Value can be changed at runtime via /selinux/enforce.
+
+	erst_disable	[ACPI]
+			Disable Error Record Serialization Table (ERST)
+			support.
+
+	ether=		[HW,NET] Ethernet cards parameters
+			This option is obsoleted by the "netdev=" option, which
+			has equivalent usage. See its documentation for details.
+
+	evm=		[EVM]
+			Format: { "fix" }
+			Permit 'security.evm' to be updated regardless of
+			current integrity status.
+
+	failslab=
+	fail_page_alloc=
+	fail_make_request=[KNL]
+			General fault injection mechanism.
+			Format: <interval>,<probability>,<space>,<times>
+			See also Documentation/fault-injection/.
+
+	floppy=		[HW]
+			See Documentation/blockdev/floppy.txt.
+
+	force_pal_cache_flush
+			[IA-64] Avoid check_sal_cache_flush which may hang on
+			buggy SAL_CACHE_FLUSH implementations. Using this
+			parameter will force ia64_sal_cache_flush to call
+			ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
+
+	forcepae	[X86-32]
+			Forcefully enable Physical Address Extension (PAE).
+			Many Pentium M systems disable PAE but may have a
+			functionally usable PAE implementation.
+			Warning: use of this parameter will taint the kernel
+			and may cause unknown problems.
+
+	ftrace=[tracer]
+			[FTRACE] will set and start the specified tracer
+			as early as possible in order to facilitate early
+			boot debugging.
+
+	ftrace_dump_on_oops[=orig_cpu]
+			[FTRACE] will dump the trace buffers on oops.
+			If no parameter is passed, ftrace will dump
+			buffers of all CPUs, but if you pass orig_cpu, it will
+			dump only the buffer of the CPU that triggered the
+			oops.
+
+	ftrace_filter=[function-list]
+			[FTRACE] Limit the functions traced by the function
+			tracer at boot up. function-list is a comma separated
+			list of functions. This list can be changed at run
+			time by the set_ftrace_filter file in the debugfs
+			tracing directory.
+
+	ftrace_notrace=[function-list]
+			[FTRACE] Do not trace the functions specified in
+			function-list. This list can be changed at run time
+			by the set_ftrace_notrace file in the debugfs
+			tracing directory.
+
+	ftrace_graph_filter=[function-list]
+			[FTRACE] Limit the top level callers functions traced
+			by the function graph tracer at boot up.
+			function-list is a comma separated list of functions
+			that can be changed at run time by the
+			set_graph_function file in the debugfs tracing directory.
+
+	ftrace_graph_notrace=[function-list]
+			[FTRACE] Do not trace from the functions specified in
+			function-list.  This list is a comma separated list of
+			functions that can be changed at run time by the
+			set_graph_notrace file in the debugfs tracing directory.
+
+	ftrace_graph_max_depth=<uint>
+			[FTRACE] Used with the function graph tracer. This is
+			the max depth it will trace into a function. This value
+			can be changed at run time by the max_graph_depth file
+			in the tracefs tracing directory. default: 0 (no limit)
+
+	gamecon.map[2|3]=
+			[HW,JOY] Multisystem joystick and NES/SNES/PSX pad
+			support via parallel port (up to 5 devices per port)
+			Format: <port#>,<pad1>,<pad2>,<pad3>,<pad4>,<pad5>
+			See also Documentation/input/devices/joystick-parport.rst
+
+	gamma=		[HW,DRM]
+
+	gart_fix_e820=	[X86_64] disable the fix e820 for K8 GART
+			Format: off | on
+			default: on
+
+	gcov_persist=	[GCOV] When non-zero (default), profiling data for
+			kernel modules is saved and remains accessible via
+			debugfs, even when the module is unloaded/reloaded.
+			When zero, profiling data is discarded and associated
+			debugfs files are removed at module unload time.
+
+	goldfish	[X86] Enable the goldfish android emulator platform.
+			Don't use this when you are not running on the
+			android emulator
+
+	gpt		[EFI] Forces disk with valid GPT signature but
+			invalid Protective MBR to be treated as GPT. If the
+			primary GPT is corrupted, it enables the backup/alternate
+			GPT to be used instead.
+
+	grcan.enable0=	[HW] Configuration of physical interface 0. Determines
+			the "Enable 0" bit of the configuration register.
+			Format: 0 | 1
+			Default: 0
+	grcan.enable1=	[HW] Configuration of physical interface 1. Determines
+			the "Enable 0" bit of the configuration register.
+			Format: 0 | 1
+			Default: 0
+	grcan.select=	[HW] Select which physical interface to use.
+			Format: 0 | 1
+			Default: 0
+	grcan.txsize=	[HW] Sets the size of the tx buffer.
+			Format: <unsigned int> such that (txsize & ~0x1fffc0) == 0.
+			Default: 1024
+	grcan.rxsize=	[HW] Sets the size of the rx buffer.
+			Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0.
+			Default: 1024
+
+	gpio-mockup.gpio_mockup_ranges
+			[HW] Sets the ranges of gpiochip of for this device.
+			Format: <start1>,<end1>,<start2>,<end2>...
+
+	hardlockup_all_cpu_backtrace=
+			[KNL] Should the hard-lockup detector generate
+			backtraces on all cpus.
+			Format: <integer>
+
+	hashdist=	[KNL,NUMA] Large hashes allocated during boot
+			are distributed across NUMA nodes.  Defaults on
+			for 64-bit NUMA, off otherwise.
+			Format: 0 | 1 (for off | on)
+
+	hcl=		[IA-64] SGI's Hardware Graph compatibility layer
+
+	hd=		[EIDE] (E)IDE hard drive subsystem geometry
+			Format: <cyl>,<head>,<sect>
+
+	hest_disable	[ACPI]
+			Disable Hardware Error Source Table (HEST) support;
+			corresponding firmware-first mode error processing
+			logic will be disabled.
+
+	highmem=nn[KMG]	[KNL,BOOT] forces the highmem zone to have an exact
+			size of <nn>. This works even on boxes that have no
+			highmem otherwise. This also works to reduce highmem
+			size on bigger boxes.
+
+	highres=	[KNL] Enable/disable high resolution timer mode.
+			Valid parameters: "on", "off"
+			Default: "on"
+
+	hisax=		[HW,ISDN]
+			See Documentation/isdn/README.HiSax.
+
+	hlt		[BUGS=ARM,SH]
+
+	hpet=		[X86-32,HPET] option to control HPET usage
+			Format: { enable (default) | disable | force |
+				verbose }
+			disable: disable HPET and use PIT instead
+			force: allow force enabled of undocumented chips (ICH4,
+				VIA, nVidia)
+			verbose: show contents of HPET registers during setup
+
+	hpet_mmap=	[X86, HPET_MMAP] Allow userspace to mmap HPET
+			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
+
+	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
+			On x86-64 and powerpc, this option can be specified
+			multiple times interleaved with hugepages= to reserve
+			huge pages of different sizes. Valid pages sizes on
+			x86-64 are 2M (when the CPU supports "pse") and 1G
+			(when the CPU supports the "pdpe1gb" cpuinfo flag).
+
+	hung_task_panic=
+			[KNL] Should the hung task detector generate panics.
+			Format: <integer>
+
+			A nonzero value instructs the kernel to panic when a
+			hung task is detected. The default value is controlled
+			by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
+			option. The value selected by this boot parameter can
+			be changed later by the kernel.hung_task_panic sysctl.
+
+	hvc_iucv=	[S390]	Number of z/VM IUCV hypervisor console (HVC)
+				terminal devices. Valid values: 0..8
+	hvc_iucv_allow=	[S390]	Comma-separated list of z/VM user IDs.
+				If specified, z/VM IUCV HVC accepts connections
+				from listed z/VM user IDs only.
+	keep_bootcon	[KNL]
+			Do not unregister boot console at start. This is only
+			useful for debugging when something happens in the window
+			between unregistering the boot console and initializing
+			the real console.
+
+	i2c_bus=	[HW]	Override the default board specific I2C bus speed
+				or register an additional I2C bus that is not
+				registered from board initialization code.
+				Format:
+				<bus_id>,<clkrate>
+
+	i8042.debug	[HW] Toggle i8042 debug mode
+	i8042.unmask_kbd_data
+			[HW] Enable printing of interrupt data from the KBD port
+			     (disabled by default, and as a pre-condition
+			     requires that i8042.debug=1 be enabled)
+	i8042.direct	[HW] Put keyboard port into non-translated mode
+	i8042.dumbkbd	[HW] Pretend that controller can only read data from
+			     keyboard and cannot control its state
+			     (Don't attempt to blink the leds)
+	i8042.noaux	[HW] Don't check for auxiliary (== mouse) port
+	i8042.nokbd	[HW] Don't check/create keyboard port
+	i8042.noloop	[HW] Disable the AUX Loopback command while probing
+			     for the AUX port
+	i8042.nomux	[HW] Don't check presence of an active multiplexing
+			     controller
+	i8042.nopnp	[HW] Don't use ACPIPnP / PnPBIOS to discover KBD/AUX
+			     controllers
+	i8042.notimeout	[HW] Ignore timeout condition signalled by controller
+	i8042.reset	[HW] Reset the controller during init, cleanup and
+			     suspend-to-ram transitions, only during s2r
+			     transitions, or never reset
+			Format: { 1 | Y | y | 0 | N | n }
+			1, Y, y: always reset controller
+			0, N, n: don't ever reset controller
+			Default: only on s2r transitions on x86; most other
+			architectures force reset to be always executed
+	i8042.unlock	[HW] Unlock (ignore) the keylock
+	i8042.kbdreset	[HW] Reset device connected to KBD port
+	i8042.probe_defer
+			[HW] Allow deferred probing upon i8042 probe errors
+
+	i810=		[HW,DRM]
+
+	i8k.ignore_dmi	[HW] Continue probing hardware even if DMI data
+			indicates that the driver is running on unsupported
+			hardware.
+	i8k.force	[HW] Activate i8k driver even if SMM BIOS signature
+			does not match list of supported models.
+	i8k.power_status
+			[HW] Report power status in /proc/i8k
+			(disabled by default)
+	i8k.restricted	[HW] Allow controlling fans only if SYS_ADMIN
+			capability is set.
+
+	i915.invert_brightness=
+			[DRM] Invert the sense of the variable that is used to
+			set the brightness of the panel backlight. Normally a
+			brightness value of 0 indicates backlight switched off,
+			and the maximum of the brightness value sets the backlight
+			to maximum brightness. If this parameter is set to 0
+			(default) and the machine requires it, or this parameter
+			is set to 1, a brightness value of 0 sets the backlight
+			to maximum brightness, and the maximum of the brightness
+			value switches the backlight off.
+			-1 -- never invert brightness
+			 0 -- machine default
+			 1 -- force brightness inversion
+
+	icn=		[HW,ISDN]
+			Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]]
+
+	ide-core.nodma=	[HW] (E)IDE subsystem
+			Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc
+			.vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr
+			.cdrom .chs .ignore_cable are additional options
+			See Documentation/ide/ide.txt.
+
+	ide-generic.probe-mask= [HW] (E)IDE subsystem
+			Format: <int>
+			Probe mask for legacy ISA IDE ports.  Depending on
+			platform up to 6 ports are supported, enabled by
+			setting corresponding bits in the mask to 1.  The
+			default value is 0x0, which has a special meaning.
+			On systems that have PCI, it triggers scanning the
+			PCI bus for the first and the second port, which
+			are then probed.  On systems without PCI the value
+			of 0x0 enables probing the two first ports as if it
+			was 0x3.
+
+	ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem
+			Claim all unknown PCI IDE storage controllers.
+
+	idle=		[X86]
+			Format: idle=poll, idle=halt, idle=nomwait
+			Poll forces a polling idle loop that can slightly
+			improve the performance of waking up a idle CPU, but
+			will use a lot of power and make the system run hot.
+			Not recommended.
+			idle=halt: Halt is forced to be used for CPU idle.
+			In such case C2/C3 won't be used again.
+			idle=nomwait: Disable mwait for CPU C-states
+
+	ieee754=	[MIPS] Select IEEE Std 754 conformance mode
+			Format: { strict | legacy | 2008 | relaxed }
+			Default: strict
+
+			Choose which programs will be accepted for execution
+			based on the IEEE 754 NaN encoding(s) supported by
+			the FPU and the NaN encoding requested with the value
+			of an ELF file header flag individually set by each
+			binary.  Hardware implementations are permitted to
+			support either or both of the legacy and the 2008 NaN
+			encoding mode.
+
+			Available settings are as follows:
+			strict	accept binaries that request a NaN encoding
+				supported by the FPU
+			legacy	only accept legacy-NaN binaries, if supported
+				by the FPU
+			2008	only accept 2008-NaN binaries, if supported
+				by the FPU
+			relaxed	accept any binaries regardless of whether
+				supported by the FPU
+
+			The FPU emulator is always able to support both NaN
+			encodings, so if no FPU hardware is present or it has
+			been disabled with 'nofpu', then the settings of
+			'legacy' and '2008' strap the emulator accordingly,
+			'relaxed' straps the emulator for both legacy-NaN and
+			2008-NaN, whereas 'strict' enables legacy-NaN only on
+			legacy processors and both NaN encodings on MIPS32 or
+			MIPS64 CPUs.
+
+			The setting for ABS.fmt/NEG.fmt instruction execution
+			mode generally follows that for the NaN encoding,
+			except where unsupported by hardware.
+
+	ignore_loglevel	[KNL]
+			Ignore loglevel setting - this will print /all/
+			kernel messages to the console. Useful for debugging.
+			We also add it as printk module parameter, so users
+			could change it dynamically, usually by
+			/sys/module/printk/parameters/ignore_loglevel.
+
+	ignore_rlimit_data
+			Ignore RLIMIT_DATA setting for data mappings,
+			print warning at first misuse.  Can be changed via
+			/sys/module/kernel/parameters/ignore_rlimit_data.
+
+	ihash_entries=	[KNL]
+			Set number of hash buckets for inode cache.
+
+	ima_appraise=	[IMA] appraise integrity measurements
+			Format: { "off" | "enforce" | "fix" | "log" }
+			default: "enforce"
+
+	ima_appraise_tcb [IMA]
+			The builtin appraise policy appraises all files
+			owned by uid=0.
+
+	ima_canonical_fmt [IMA]
+			Use the canonical format for the binary runtime
+			measurements, instead of host native format.
+
+	ima_hash=	[IMA]
+			Format: { md5 | sha1 | rmd160 | sha256 | sha384
+				   | sha512 | ... }
+			default: "sha1"
+
+			The list of supported hash algorithms is defined
+			in crypto/hash_info.h.
+
+	ima_policy=	[IMA]
+			The builtin policies to load during IMA setup.
+			Format: "tcb | appraise_tcb | secure_boot |
+				 fail_securely"
+
+			The "tcb" policy measures all programs exec'd, files
+			mmap'd for exec, and all files opened with the read
+			mode bit set by either the effective uid (euid=0) or
+			uid=0.
+
+			The "appraise_tcb" policy appraises the integrity of
+			all files owned by root. (This is the equivalent
+			of ima_appraise_tcb.)
+
+			The "secure_boot" policy appraises the integrity
+			of files (eg. kexec kernel image, kernel modules,
+			firmware, policy, etc) based on file signatures.
+
+			The "fail_securely" policy forces file signature
+			verification failure also on privileged mounted
+			filesystems with the SB_I_UNVERIFIABLE_SIGNATURE
+			flag.
+
+	ima_tcb		[IMA] Deprecated.  Use ima_policy= instead.
+			Load a policy which meets the needs of the Trusted
+			Computing Base.  This means IMA will measure all
+			programs exec'd, files mmap'd for exec, and all files
+			opened for read by uid=0.
+
+	ima_template=	[IMA]
+			Select one of defined IMA measurements template formats.
+			Formats: { "ima" | "ima-ng" | "ima-sig" }
+			Default: "ima-ng"
+
+	ima_template_fmt=
+			[IMA] Define a custom template format.
+			Format: { "field1|...|fieldN" }
+
+	ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage
+			Format: <min_file_size>
+			Set the minimal file size for using asynchronous hash.
+			If left unspecified, ahash usage is disabled.
+
+			ahash performance varies for different data sizes on
+			different crypto accelerators. This option can be used
+			to achieve the best performance for a particular HW.
+
+	ima.ahash_bufsize= [IMA] Asynchronous hash buffer size
+			Format: <bufsize>
+			Set hashing buffer size. Default: 4k.
+
+			ahash performance varies for different chunk sizes on
+			different crypto accelerators. This option can be used
+			to achieve best performance for particular HW.
+
+	init=		[KNL]
+			Format: <full_path>
+			Run specified binary instead of /sbin/init as init
+			process.
+
+	initcall_debug	[KNL] Trace initcalls as they are executed.  Useful
+			for working out where the kernel is dying during
+			startup.
+
+	initcall_blacklist=  [KNL] Do not execute a comma-separated list of
+			initcall functions.  Useful for debugging built-in
+			modules and initcalls.
+
+	initrd=		[BOOT] Specify the location of the initial ramdisk
+
+	init_pkru=	[x86] Specify the default memory protection keys rights
+			register contents for all processes.  0x55555554 by
+			default (disallow access to all but pkey 0).  Can
+			override in debugfs after boot.
+
+	inport.irq=	[HW] Inport (ATI XL and Microsoft) busmouse driver
+			Format: <irq>
+
+	int_pln_enable	[x86] Enable power limit notification interrupt
+
+	integrity_audit=[IMA]
+			Format: { "0" | "1" }
+			0 -- basic integrity auditing messages. (Default)
+			1 -- additional integrity auditing messages.
+
+	intel_iommu=	[DMAR] Intel IOMMU driver (DMAR) option
+		on
+			Enable intel iommu driver.
+		off
+			Disable intel iommu driver.
+		igfx_off [Default Off]
+			By default, gfx is mapped as normal device. If a gfx
+			device has a dedicated DMAR unit, the DMAR unit is
+			bypassed by not enabling DMAR with this option. In
+			this case, gfx device will use physical address for
+			DMA.
+		forcedac [x86_64]
+			With this option iommu will not optimize to look
+			for io virtual address below 32-bit forcing dual
+			address cycle on pci bus for cards supporting greater
+			than 32-bit addressing. The default is to look
+			for translation below 32-bit and if not available
+			then look in the higher range.
+		strict [Default Off]
+			With this option on every unmap_single operation will
+			result in a hardware IOTLB flush operation as opposed
+			to batching them for performance.
+		sp_off [Default Off]
+			By default, super page will be supported if Intel IOMMU
+			has the capability. With this option, super page will
+			not be supported.
+		ecs_off [Default Off]
+			By default, extended context tables will be supported if
+			the hardware advertises that it has support both for the
+			extended tables themselves, and also PASID support. With
+			this option set, extended tables will not be used even
+			on hardware which claims to support them.
+		tboot_noforce [Default Off]
+			Do not force the Intel IOMMU enabled under tboot.
+			By default, tboot will force Intel IOMMU on, which
+			could harm performance of some high-throughput
+			devices like 40GBit network cards, even if identity
+			mapping is enabled.
+			Note that using this option lowers the security
+			provided by tboot because it makes the system
+			vulnerable to DMA attacks.
+
+	intel_idle.max_cstate=	[KNL,HW,ACPI,X86]
+			0	disables intel_idle and fall back on acpi_idle.
+			1 to 9	specify maximum depth of C-state.
+
+	intel_pstate=	[X86]
+			disable
+			  Do not enable intel_pstate as the default
+			  scaling driver for the supported processors
+			passive
+			  Use intel_pstate as a scaling driver, but configure it
+			  to work with generic cpufreq governors (instead of
+			  enabling its internal governor).  This mode cannot be
+			  used along with the hardware-managed P-states (HWP)
+			  feature.
+			force
+			  Enable intel_pstate on systems that prohibit it by default
+			  in favor of acpi-cpufreq. Forcing the intel_pstate driver
+			  instead of acpi-cpufreq may disable platform features, such
+			  as thermal controls and power capping, that rely on ACPI
+			  P-States information being indicated to OSPM and therefore
+			  should be used with caution. This option does not work with
+			  processors that aren't supported by the intel_pstate driver
+			  or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
+			no_hwp
+			  Do not enable hardware P state control (HWP)
+			  if available.
+			hwp_only
+			  Only load intel_pstate on systems which support
+			  hardware P state control (HWP) if available.
+			support_acpi_ppc
+			  Enforce ACPI _PPC performance limits. If the Fixed ACPI
+			  Description Table, specifies preferred power management
+			  profile as "Enterprise Server" or "Performance Server",
+			  then this feature is turned on by default.
+			per_cpu_perf_limits
+			  Allow per-logical-CPU P-State performance control limits using
+			  cpufreq sysfs interface
+
+	intremap=	[X86-64, Intel-IOMMU]
+			on	enable Interrupt Remapping (default)
+			off	disable Interrupt Remapping
+			nosid	disable Source ID checking
+			no_x2apic_optout
+				BIOS x2APIC opt-out request will be ignored
+			nopost	disable Interrupt Posting
+
+	iomem=		Disable strict checking of access to MMIO memory
+		strict	regions from userspace.
+		relaxed
+
+	iommu=		[x86]
+		off
+		force
+		noforce
+		biomerge
+		panic
+		nopanic
+		merge
+		nomerge
+		soft
+		pt		[x86]
+		nopt		[x86]
+		nobypass	[PPC/POWERNV]
+			Disable IOMMU bypass, using IOMMU for PCI devices.
+
+	iommu.passthrough=
+			[ARM64] Configure DMA to bypass the IOMMU by default.
+			Format: { "0" | "1" }
+			0 - Use IOMMU translation for DMA.
+			1 - Bypass the IOMMU for DMA.
+			unset - Use IOMMU translation for DMA.
+
+	io7=		[HW] IO7 for Marvel based alpha systems
+			See comment before marvel_specify_io7 in
+			arch/alpha/kernel/core_marvel.c.
+
+	io_delay=	[X86] I/O delay method
+		0x80
+			Standard port 0x80 based delay
+		0xed
+			Alternate port 0xed based delay (needed on some systems)
+		udelay
+			Simple two microseconds delay
+		none
+			No delay
+
+	ip=		[IP_PNP]
+			See Documentation/filesystems/nfs/nfsroot.txt.
+
+	irqaffinity=	[SMP] Set the default irq affinity mask
+			The argument is a cpu list, as described above.
+
+	irqchip.gicv2_force_probe=
+			[ARM, ARM64]
+			Format: <bool>
+			Force the kernel to look for the second 4kB page
+			of a GICv2 controller even if the memory range
+			exposed by the device tree is too small.
+
+	irqchip.gicv3_nolpi=
+			[ARM, ARM64]
+			Force the kernel to ignore the availability of
+			LPIs (and by consequence ITSs). Intended for system
+			that use the kernel as a bootloader, and thus want
+			to let secondary kernels in charge of setting up
+			LPIs.
+
+	irqfixup	[HW]
+			When an interrupt is not handled search all handlers
+			for it. Intended to get systems with badly broken
+			firmware running.
+
+	irqpoll		[HW]
+			When an interrupt is not handled search all handlers
+			for it. Also check all handlers each timer
+			interrupt. Intended to get systems with badly broken
+			firmware running.
+
+	isapnp=		[ISAPNP]
+			Format: <RDP>,<reset>,<pci_scan>,<verbosity>
+
+	isolcpus=	[KNL,SMP,ISOL] Isolate a given set of CPUs from disturbance.
+			[Deprecated - use cpusets instead]
+			Format: [flag-list,]<cpu-list>
+
+			Specify one or more CPUs to isolate from disturbances
+			specified in the flag list (default: domain):
+
+			nohz
+			  Disable the tick when a single task runs.
+
+			  A residual 1Hz tick is offloaded to workqueues, which you
+			  need to affine to housekeeping through the global
+			  workqueue's affinity configured via the
+			  /sys/devices/virtual/workqueue/cpumask sysfs file, or
+			  by using the 'domain' flag described below.
+
+			  NOTE: by default the global workqueue runs on all CPUs,
+			  so to protect individual CPUs the 'cpumask' file has to
+			  be configured manually after bootup.
+
+			domain
+			  Isolate from the general SMP balancing and scheduling
+			  algorithms. Note that performing domain isolation this way
+			  is irreversible: it's not possible to bring back a CPU to
+			  the domains once isolated through isolcpus. It's strongly
+			  advised to use cpusets instead to disable scheduler load
+			  balancing through the "cpuset.sched_load_balance" file.
+			  It offers a much more flexible interface where CPUs can
+			  move in and out of an isolated set anytime.
+
+			  You can move a process onto or off an "isolated" CPU via
+			  the CPU affinity syscalls or cpuset.
+			  <cpu number> begins at 0 and the maximum value is
+			  "number of CPUs in system - 1".
+
+			The format of <cpu-list> is described above.
+
+
+
+	iucv=		[HW,NET]
+
+	ivrs_ioapic	[HW,X86_64]
+			Provide an override to the IOAPIC-ID<->DEVICE-ID
+			mapping provided in the IVRS ACPI table. For
+			example, to map IOAPIC-ID decimal 10 to
+			PCI device 00:14.0 write the parameter as:
+				ivrs_ioapic[10]=00:14.0
+
+	ivrs_hpet	[HW,X86_64]
+			Provide an override to the HPET-ID<->DEVICE-ID
+			mapping provided in the IVRS ACPI table. For
+			example, to map HPET-ID decimal 0 to
+			PCI device 00:14.0 write the parameter as:
+				ivrs_hpet[0]=00:14.0
+
+	ivrs_acpihid	[HW,X86_64]
+			Provide an override to the ACPI-HID:UID<->DEVICE-ID
+			mapping provided in the IVRS ACPI table. For
+			example, to map UART-HID:UID AMD0020:0 to
+			PCI device 00:14.5 write the parameter as:
+				ivrs_acpihid[00:14.5]=AMD0020:0
+
+	js=		[HW,JOY] Analog joystick
+			See Documentation/input/joydev/joystick.rst.
+
+	nokaslr		[KNL]
+			When CONFIG_RANDOMIZE_BASE is set, this disables
+			kernel and module base offset ASLR (Address Space
+			Layout Randomization).
+
+	kasan_multi_shot
+			[KNL] Enforce KASAN (Kernel Address Sanitizer) to print
+			report on every invalid memory access. Without this
+			parameter KASAN will print report only for the first
+			invalid access.
+
+	keepinitrd	[HW,ARM]
+
+	kernelcore=	[KNL,X86,IA-64,PPC]
+			Format: nn[KMGTPE] | nn% | "mirror"
+			This parameter specifies the amount of memory usable by
+			the kernel for non-movable allocations.  The requested
+			amount is spread evenly throughout all nodes in the
+			system as ZONE_NORMAL.  The remaining memory is used for
+			movable memory in its own zone, ZONE_MOVABLE.  In the
+			event, a node is too small to have both ZONE_NORMAL and
+			ZONE_MOVABLE, kernelcore memory will take priority and
+			other nodes will have a larger ZONE_MOVABLE.
+
+			ZONE_MOVABLE is used for the allocation of pages that
+			may be reclaimed or moved by the page migration
+			subsystem.  Note that allocations like PTEs-from-HighMem
+			still use the HighMem zone if it exists, and the Normal
+			zone if it does not.
+
+			It is possible to specify the exact amount of memory in
+			the form of "nn[KMGTPE]", a percentage of total system
+			memory in the form of "nn%", or "mirror".  If "mirror"
+			option is specified, mirrored (reliable) memory is used
+			for non-movable allocations and remaining memory is used
+			for Movable pages.  "nn[KMGTPE]", "nn%", and "mirror"
+			are exclusive, so you cannot specify multiple forms.
+
+	kgdbdbgp=	[KGDB,HW] kgdb over EHCI usb debug port.
+			Format: <Controller#>[,poll interval]
+			The controller # is the number of the ehci usb debug
+			port as it is probed via PCI.  The poll interval is
+			optional and is the number seconds in between
+			each poll cycle to the debug port in case you need
+			the functionality for interrupting the kernel with
+			gdb or control-c on the dbgp connection.  When
+			not using this parameter you use sysrq-g to break into
+			the kernel debugger.
+
+	kgdboc=		[KGDB,HW] kgdb over consoles.
+			Requires a tty driver that supports console polling,
+			or a supported polling keyboard driver (non-usb).
+			 Serial only format: <serial_device>[,baud]
+			 keyboard only format: kbd
+			 keyboard and serial format: kbd,<serial_device>[,baud]
+			Optional Kernel mode setting:
+			 kms, kbd format: kms,kbd
+			 kms, kbd and serial format: kms,kbd,<ser_dev>[,baud]
+
+	kgdbwait	[KGDB] Stop kernel execution and enter the
+			kernel debugger at the earliest opportunity.
+
+	kmac=		[MIPS] korina ethernet MAC address.
+			Configure the RouterBoard 532 series on-chip
+			Ethernet adapter MAC address.
+
+	kmemleak=	[KNL] Boot-time kmemleak enable/disable
+			Valid arguments: on, off
+			Default: on
+			Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,
+			the default is off.
+
+	kpti=		[ARM64] Control page table isolation of user
+			and kernel address spaces.
+			Default: enabled on cores which need mitigation.
+			0: force disabled
+			1: force enabled
+
+	kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
+			Default is 0 (don't ignore, but inject #GP)
+
+	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
+				   Default is false (don't support).
+
+	kvm.mmu_audit=	[KVM] This is a R/W parameter which allows audit
+			KVM MMU at runtime.
+			Default is 0 (off)
+
+	kvm.nx_huge_pages=
+			[KVM] Controls the software workaround for the
+			X86_BUG_ITLB_MULTIHIT bug.
+			force	: Always deploy workaround.
+			off	: Never deploy workaround.
+			auto    : Deploy workaround based on the presence of
+				  X86_BUG_ITLB_MULTIHIT.
+
+			Default is 'auto'.
+
+			If the software workaround is enabled for the host,
+			guests do need not to enable it for nested guests.
+
+	kvm.nx_huge_pages_recovery_ratio=
+			[KVM] Controls how many 4KiB pages are periodically zapped
+			back to huge pages.  0 disables the recovery, otherwise if
+			the value is N KVM will zap 1/Nth of the 4KiB pages every
+			minute.  The default is 60.
+
+	kvm-amd.nested=	[KVM,AMD] Allow nested virtualization in KVM/SVM.
+			Default is 1 (enabled)
+
+	kvm-amd.npt=	[KVM,AMD] Disable nested paging (virtualized MMU)
+			for all guests.
+			Default is 1 (enabled) if in 64-bit or 32-bit PAE mode.
+
+	kvm-arm.vgic_v3_group0_trap=
+			[KVM,ARM] Trap guest accesses to GICv3 group-0
+			system registers
+
+	kvm-arm.vgic_v3_group1_trap=
+			[KVM,ARM] Trap guest accesses to GICv3 group-1
+			system registers
+
+	kvm-arm.vgic_v3_common_trap=
+			[KVM,ARM] Trap guest accesses to GICv3 common
+			system registers
+
+	kvm-arm.vgic_v4_enable=
+			[KVM,ARM] Allow use of GICv4 for direct injection of
+			LPIs.
+
+	kvm-intel.ept=	[KVM,Intel] Disable extended page tables
+			(virtualized MMU) support on capable Intel chips.
+			Default is 1 (enabled)
+
+	kvm-intel.emulate_invalid_guest_state=
+			[KVM,Intel] Disable emulation of invalid guest state.
+			Ignored if kvm-intel.enable_unrestricted_guest=1, as
+			guest state is never invalid for unrestricted guests.
+			This param doesn't apply to nested guests (L2), as KVM
+			never emulates invalid L2 guest state.
+			Default is 1 (enabled)
+
+	kvm-intel.flexpriority=
+			[KVM,Intel] Disable FlexPriority feature (TPR shadow).
+			Default is 1 (enabled)
+
+	kvm-intel.nested=
+			[KVM,Intel] Enable VMX nesting (nVMX).
+			Default is 0 (disabled)
+
+	kvm-intel.unrestricted_guest=
+			[KVM,Intel] Disable unrestricted guest feature
+			(virtualized real and unpaged mode) on capable
+			Intel chips. Default is 1 (enabled)
+
+	kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault
+			CVE-2018-3620.
+
+			Valid arguments: never, cond, always
+
+			always: L1D cache flush on every VMENTER.
+			cond:	Flush L1D on VMENTER only when the code between
+				VMEXIT and VMENTER can leak host memory.
+			never:	Disables the mitigation
+
+			Default is cond (do L1 cache flush in specific instances)
+
+	kvm-intel.vpid=	[KVM,Intel] Disable Virtual Processor Identification
+			feature (tagged TLBs) on capable Intel chips.
+			Default is 1 (enabled)
+
+	l1tf=           [X86] Control mitigation of the L1TF vulnerability on
+			      affected CPUs
+
+			The kernel PTE inversion protection is unconditionally
+			enabled and cannot be disabled.
+
+			full
+				Provides all available mitigations for the
+				L1TF vulnerability. Disables SMT and
+				enables all mitigations in the
+				hypervisors, i.e. unconditional L1D flush.
+
+				SMT control and L1D flush control via the
+				sysfs interface is still possible after
+				boot.  Hypervisors will issue a warning
+				when the first VM is started in a
+				potentially insecure configuration,
+				i.e. SMT enabled or L1D flush disabled.
+
+			full,force
+				Same as 'full', but disables SMT and L1D
+				flush runtime control. Implies the
+				'nosmt=force' command line option.
+				(i.e. sysfs control of SMT is disabled.)
+
+			flush
+				Leaves SMT enabled and enables the default
+				hypervisor mitigation, i.e. conditional
+				L1D flush.
+
+				SMT control and L1D flush control via the
+				sysfs interface is still possible after
+				boot.  Hypervisors will issue a warning
+				when the first VM is started in a
+				potentially insecure configuration,
+				i.e. SMT enabled or L1D flush disabled.
+
+			flush,nosmt
+
+				Disables SMT and enables the default
+				hypervisor mitigation.
+
+				SMT control and L1D flush control via the
+				sysfs interface is still possible after
+				boot.  Hypervisors will issue a warning
+				when the first VM is started in a
+				potentially insecure configuration,
+				i.e. SMT enabled or L1D flush disabled.
+
+			flush,nowarn
+				Same as 'flush', but hypervisors will not
+				warn when a VM is started in a potentially
+				insecure configuration.
+
+			off
+				Disables hypervisor mitigations and doesn't
+				emit any warnings.
+				It also drops the swap size and available
+				RAM limit restriction on both hypervisor and
+				bare metal.
+
+			Default is 'flush'.
+
+			For details see: Documentation/admin-guide/hw-vuln/l1tf.rst
+
+	l2cr=		[PPC]
+
+	l3cr=		[PPC]
+
+	lapic		[X86-32,APIC] Enable the local APIC even if BIOS
+			disabled it.
+
+	lapic=		[x86,APIC] "notscdeadline" Do not use TSC deadline
+			value for LAPIC timer one-shot implementation. Default
+			back to the programmable timer unit in the LAPIC.
+
+	lapic_timer_c2_ok	[X86,APIC] trust the local apic timer
+			in C2 power state.
+
+	libata.dma=	[LIBATA] DMA control
+			libata.dma=0	  Disable all PATA and SATA DMA
+			libata.dma=1	  PATA and SATA Disk DMA only
+			libata.dma=2	  ATAPI (CDROM) DMA only
+			libata.dma=4	  Compact Flash DMA only
+			Combinations also work, so libata.dma=3 enables DMA
+			for disks and CDROMs, but not CFs.
+
+	libata.ignore_hpa=	[LIBATA] Ignore HPA limit
+			libata.ignore_hpa=0	  keep BIOS limits (default)
+			libata.ignore_hpa=1	  ignore limits, using full disk
+
+	libata.noacpi	[LIBATA] Disables use of ACPI in libata suspend/resume
+			when set.
+			Format: <int>
+
+	libata.force=	[LIBATA] Force configurations.  The format is comma
+			separated list of "[ID:]VAL" where ID is
+			PORT[.DEVICE].  PORT and DEVICE are decimal numbers
+			matching port, link or device.  Basically, it matches
+			the ATA ID string printed on console by libata.  If
+			the whole ID part is omitted, the last PORT and DEVICE
+			values are used.  If ID hasn't been specified yet, the
+			configuration applies to all ports, links and devices.
+
+			If only DEVICE is omitted, the parameter applies to
+			the port and all links and devices behind it.  DEVICE
+			number of 0 either selects the first device or the
+			first fan-out link behind PMP device.  It does not
+			select the host link.  DEVICE number of 15 selects the
+			host link and device attached to it.
+
+			The VAL specifies the configuration to force.  As long
+			as there's no ambiguity shortcut notation is allowed.
+			For example, both 1.5 and 1.5G would work for 1.5Gbps.
+			The following configurations can be forced.
+
+			* Cable type: 40c, 80c, short40c, unk, ign or sata.
+			  Any ID with matching PORT is used.
+
+			* SATA link speed limit: 1.5Gbps or 3.0Gbps.
+
+			* Transfer mode: pio[0-7], mwdma[0-4] and udma[0-7].
+			  udma[/][16,25,33,44,66,100,133] notation is also
+			  allowed.
+
+			* [no]ncq: Turn on or off NCQ.
+
+			* [no]ncqtrim: Turn off queued DSM TRIM.
+
+			* nohrst, nosrst, norst: suppress hard, soft
+			  and both resets.
+
+			* rstonce: only attempt one reset during
+			  hot-unplug link recovery
+
+			* dump_id: dump IDENTIFY data.
+
+			* atapi_dmadir: Enable ATAPI DMADIR bridge support
+
+			* disable: Disable this device.
+
+			If there are multiple matching configurations changing
+			the same attribute, the last one is used.
+
+	memblock=debug	[KNL] Enable memblock debug messages.
+
+	load_ramdisk=	[RAM] List of ramdisks to load from floppy
+			See Documentation/blockdev/ramdisk.txt.
+
+	lockd.nlm_grace_period=P  [NFS] Assign grace period.
+			Format: <integer>
+
+	lockd.nlm_tcpport=N	[NFS] Assign TCP port.
+			Format: <integer>
+
+	lockd.nlm_timeout=T	[NFS] Assign timeout value.
+			Format: <integer>
+
+	lockd.nlm_udpport=M	[NFS] Assign UDP port.
+			Format: <integer>
+
+	locktorture.nreaders_stress= [KNL]
+			Set the number of locking read-acquisition kthreads.
+			Defaults to being automatically set based on the
+			number of online CPUs.
+
+	locktorture.nwriters_stress= [KNL]
+			Set the number of locking write-acquisition kthreads.
+
+	locktorture.onoff_holdoff= [KNL]
+			Set time (s) after boot for CPU-hotplug testing.
+
+	locktorture.onoff_interval= [KNL]
+			Set time (s) between CPU-hotplug operations, or
+			zero to disable CPU-hotplug testing.
+
+	locktorture.shuffle_interval= [KNL]
+			Set task-shuffle interval (jiffies).  Shuffling
+			tasks allows some CPUs to go into dyntick-idle
+			mode during the locktorture test.
+
+	locktorture.shutdown_secs= [KNL]
+			Set time (s) after boot system shutdown.  This
+			is useful for hands-off automated testing.
+
+	locktorture.stat_interval= [KNL]
+			Time (s) between statistics printk()s.
+
+	locktorture.stutter= [KNL]
+			Time (s) to stutter testing, for example,
+			specifying five seconds causes the test to run for
+			five seconds, wait for five seconds, and so on.
+			This tests the locking primitive's ability to
+			transition abruptly to and from idle.
+
+	locktorture.torture_type= [KNL]
+			Specify the locking implementation to test.
+
+	locktorture.verbose= [KNL]
+			Enable additional printk() statements.
+
+	logibm.irq=	[HW,MOUSE] Logitech Bus Mouse Driver
+			Format: <irq>
+
+	loglevel=	All Kernel Messages with a loglevel smaller than the
+			console loglevel will be printed to the console. It can
+			also be changed with klogd or other programs. The
+			loglevels are defined as follows:
+
+			0 (KERN_EMERG)		system is unusable
+			1 (KERN_ALERT)		action must be taken immediately
+			2 (KERN_CRIT)		critical conditions
+			3 (KERN_ERR)		error conditions
+			4 (KERN_WARNING)	warning conditions
+			5 (KERN_NOTICE)		normal but significant condition
+			6 (KERN_INFO)		informational
+			7 (KERN_DEBUG)		debug-level messages
+
+	log_buf_len=n[KMG]	Sets the size of the printk ring buffer,
+			in bytes.  n must be a power of two and greater
+			than the minimal size. The minimal size is defined
+			by LOG_BUF_SHIFT kernel config parameter. There is
+			also CONFIG_LOG_CPU_MAX_BUF_SHIFT config parameter
+			that allows to increase the default size depending on
+			the number of CPUs. See init/Kconfig for more details.
+
+	logo.nologo	[FB] Disables display of the built-in Linux logo.
+			This may be used to provide more screen space for
+			kernel log messages and is useful when debugging
+			kernel boot problems.
+
+	lp=0		[LP]	Specify parallel ports to use, e.g,
+	lp=port[,port...]	lp=none,parport0 (lp0 not configured, lp1 uses
+	lp=reset		first parallel port). 'lp=0' disables the
+	lp=auto			printer driver. 'lp=reset' (which can be
+				specified in addition to the ports) causes
+				attached printers to be reset. Using
+				lp=port1,port2,... specifies the parallel ports
+				to associate lp devices with, starting with
+				lp0. A port specification may be 'none' to skip
+				that lp device, or a parport name such as
+				'parport0'. Specifying 'lp=auto' instead of a
+				port specification list means that device IDs
+				from each port should be examined, to see if
+				an IEEE 1284-compliant printer is attached; if
+				so, the driver will manage that printer.
+				See also header of drivers/char/lp.c.
+
+	lpj=n		[KNL]
+			Sets loops_per_jiffy to given constant, thus avoiding
+			time-consuming boot-time autodetection (up to 250 ms per
+			CPU). 0 enables autodetection (default). To determine
+			the correct value for your kernel, boot with normal
+			autodetection and see what value is printed. Note that
+			on SMP systems the preset will be applied to all CPUs,
+			which is likely to cause problems if your CPUs need
+			significantly divergent settings. An incorrect value
+			will cause delays in the kernel to be wrong, leading to
+			unpredictable I/O errors and other breakage. Although
+			unlikely, in the extreme case this might damage your
+			hardware.
+
+	ltpc=		[NET]
+			Format: <io>,<irq>,<dma>
+
+	machvec=	[IA-64] Force the use of a particular machine-vector
+			(machvec) in a generic kernel.
+			Example: machvec=hpzx1_swiotlb
+
+	machtype=	[Loongson] Share the same kernel image file between different
+			 yeeloong laptop.
+			Example: machtype=lemote-yeeloong-2f-7inch
+
+	max_addr=nn[KMG]	[KNL,BOOT,ia64] All physical memory greater
+			than or equal to this physical address is ignored.
+
+	maxcpus=	[SMP] Maximum number of processors that	an SMP kernel
+			will bring up during bootup.  maxcpus=n : n >= 0 limits
+			the kernel to bring up 'n' processors. Surely after
+			bootup you can bring up the other plugged cpu by executing
+			"echo 1 > /sys/devices/system/cpu/cpuX/online". So maxcpus
+			only takes effect during system bootup.
+			While n=0 is a special case, it is equivalent to "nosmp",
+			which also disables the IO APIC.
+
+	max_loop=	[LOOP] The number of loop block devices that get
+	(loop.max_loop)	unconditionally pre-created at init time. The default
+			number is configured by BLK_DEV_LOOP_MIN_COUNT. Instead
+			of statically allocating a predefined number, loop
+			devices can be requested on-demand with the
+			/dev/loop-control interface.
+
+	mce		[X86-32] Machine Check Exception
+
+	mce=option	[X86-64] See Documentation/x86/x86_64/boot-options.txt
+
+	md=		[HW] RAID subsystems devices and level
+			See Documentation/admin-guide/md.rst.
+
+	mdacon=		[MDA]
+			Format: <first>,<last>
+			Specifies range of consoles to be captured by the MDA.
+
+	mds=		[X86,INTEL]
+			Control mitigation for the Micro-architectural Data
+			Sampling (MDS) vulnerability.
+
+			Certain CPUs are vulnerable to an exploit against CPU
+			internal buffers which can forward information to a
+			disclosure gadget under certain conditions.
+
+			In vulnerable processors, the speculatively
+			forwarded data can be used in a cache side channel
+			attack, to access data to which the attacker does
+			not have direct access.
+
+			This parameter controls the MDS mitigation. The
+			options are:
+
+			full       - Enable MDS mitigation on vulnerable CPUs
+			full,nosmt - Enable MDS mitigation and disable
+				     SMT on vulnerable CPUs
+			off        - Unconditionally disable MDS mitigation
+
+			On TAA-affected machines, mds=off can be prevented by
+			an active TAA mitigation as both vulnerabilities are
+			mitigated with the same mechanism so in order to disable
+			this mitigation, you need to specify tsx_async_abort=off
+			too.
+
+			Not specifying this option is equivalent to
+			mds=full.
+
+			For details see: Documentation/admin-guide/hw-vuln/mds.rst
+
+	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
+			Amount of memory to be used when the kernel is not able
+			to see the whole system memory or for test.
+			[X86] Work as limiting max address. Use together
+			with memmap= to avoid physical address space collisions.
+			Without memmap= PCI devices could be placed at addresses
+			belonging to unused RAM.
+
+	mem=nopentium	[BUGS=X86-32] Disable usage of 4MB pages for kernel
+			memory.
+
+	memchunk=nn[KMG]
+			[KNL,SH] Allow user to override the default size for
+			per-device physically contiguous DMA buffers.
+
+	memhp_default_state=online/offline
+			[KNL] Set the initial state for the memory hotplug
+			onlining policy. If not specified, the default value is
+			set according to the
+			CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
+			option.
+			See Documentation/memory-hotplug.txt.
+
+	memmap=exactmap	[KNL,X86] Enable setting of an exact
+			E820 memory map, as specified by the user.
+			Such memmap=exactmap lines can be constructed based on
+			BIOS output or other requirements. See the memmap=nn@ss
+			option description.
+
+	memmap=nn[KMG]@ss[KMG]
+			[KNL] Force usage of a specific region of memory.
+			Region of memory to be used is from ss to ss+nn.
+			If @ss[KMG] is omitted, it is equivalent to mem=nn[KMG],
+			which limits max address to nn[KMG].
+			Multiple different regions can be specified,
+			comma delimited.
+			Example:
+				memmap=100M@2G,100M#3G,1G!1024G
+
+	memmap=nn[KMG]#ss[KMG]
+			[KNL,ACPI] Mark specific memory as ACPI data.
+			Region of memory to be marked is from ss to ss+nn.
+
+	memmap=nn[KMG]$ss[KMG]
+			[KNL,ACPI] Mark specific memory as reserved.
+			Region of memory to be reserved is from ss to ss+nn.
+			Example: Exclude memory from 0x18690000-0x1869ffff
+			         memmap=64K$0x18690000
+			         or
+			         memmap=0x10000$0x18690000
+			Some bootloaders may need an escape character before '$',
+			like Grub2, otherwise '$' and the following number
+			will be eaten.
+
+	memmap=nn[KMG]!ss[KMG]
+			[KNL,X86] Mark specific memory as protected.
+			Region of memory to be used, from ss to ss+nn.
+			The memory region may be marked as e820 type 12 (0xc)
+			and is NVDIMM or ADR memory.
+
+	memmap=<size>%<offset>-<oldtype>+<newtype>
+			[KNL,ACPI] Convert memory within the specified region
+			from <oldtype> to <newtype>. If "-<oldtype>" is left
+			out, the whole region will be marked as <newtype>,
+			even if previously unavailable. If "+<newtype>" is left
+			out, matching memory will be removed. Types are
+			specified as e820 types, e.g., 1 = RAM, 2 = reserved,
+			3 = ACPI, 12 = PRAM.
+
+	memory_corruption_check=0/1 [X86]
+			Some BIOSes seem to corrupt the first 64k of
+			memory when doing things like suspend/resume.
+			Setting this option will scan the memory
+			looking for corruption.  Enabling this will
+			both detect corruption and prevent the kernel
+			from using the memory being corrupted.
+			However, its intended as a diagnostic tool; if
+			repeatable BIOS-originated corruption always
+			affects the same memory, you can use memmap=
+			to prevent the kernel from using that memory.
+
+	memory_corruption_check_size=size [X86]
+			By default it checks for corruption in the low
+			64k, making this memory unavailable for normal
+			use.  Use this parameter to scan for
+			corruption in more or less memory.
+
+	memory_corruption_check_period=seconds [X86]
+			By default it checks for corruption every 60
+			seconds.  Use this parameter to check at some
+			other rate.  0 disables periodic checking.
+
+	memtest=	[KNL,X86,ARM] Enable memtest
+			Format: <integer>
+			default : 0 <disable>
+			Specifies the number of memtest passes to be
+			performed. Each pass selects another test
+			pattern from a given set of patterns. Memtest
+			fills the memory with this pattern, validates
+			memory contents and reserves bad memory
+			regions that are detected.
+
+	mem_encrypt=	[X86-64] AMD Secure Memory Encryption (SME) control
+			Valid arguments: on, off
+			Default (depends on kernel configuration option):
+			  on  (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y)
+			  off (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=n)
+			mem_encrypt=on:		Activate SME
+			mem_encrypt=off:	Do not activate SME
+
+			Refer to Documentation/x86/amd-memory-encryption.txt
+			for details on when memory encryption can be activated.
+
+	mem_sleep_default=	[SUSPEND] Default system suspend mode:
+			s2idle  - Suspend-To-Idle
+			shallow - Power-On Suspend or equivalent (if supported)
+			deep    - Suspend-To-RAM or equivalent (if supported)
+			See Documentation/admin-guide/pm/sleep-states.rst.
+
+	meye.*=		[HW] Set MotionEye Camera parameters
+			See Documentation/media/v4l-drivers/meye.rst.
+
+	mfgpt_irq=	[IA-32] Specify the IRQ to use for the
+			Multi-Function General Purpose Timers on AMD Geode
+			platforms.
+
+	mfgptfix	[X86-32] Fix MFGPT timers on AMD Geode platforms when
+			the BIOS has incorrectly applied a workaround. TinyBIOS
+			version 0.98 is known to be affected, 0.99 fixes the
+			problem by letting the user disable the workaround.
+
+	mga=		[HW,DRM]
+
+	min_addr=nn[KMG]	[KNL,BOOT,ia64] All physical memory below this
+			physical address is ignored.
+
+	mini2440=	[ARM,HW,KNL]
+			Format:[0..2][b][c][t]
+			Default: "0tb"
+			MINI2440 configuration specification:
+			0 - The attached screen is the 3.5" TFT
+			1 - The attached screen is the 7" TFT
+			2 - The VGA Shield is attached (1024x768)
+			Leaving out the screen size parameter will not load
+			the TFT driver, and the framebuffer will be left
+			unconfigured.
+			b - Enable backlight. The TFT backlight pin will be
+			linked to the kernel VESA blanking code and a GPIO
+			LED. This parameter is not necessary when using the
+			VGA shield.
+			c - Enable the s3c camera interface.
+			t - Reserved for enabling touchscreen support. The
+			touchscreen support is not enabled in the mainstream
+			kernel as of 2.6.30, a preliminary port can be found
+			in the "bleeding edge" mini2440 support kernel at
+			http://repo.or.cz/w/linux-2.6/mini2440.git
+
+	mitigations=
+			[X86,PPC,S390,ARM64] Control optional mitigations for
+			CPU vulnerabilities.  This is a set of curated,
+			arch-independent options, each of which is an
+			aggregation of existing arch-specific options.
+
+			off
+				Disable all optional CPU mitigations.  This
+				improves system performance, but it may also
+				expose users to several CPU vulnerabilities.
+				Equivalent to: nopti [X86,PPC]
+					       kpti=0 [ARM64]
+					       nospectre_v1 [PPC]
+					       nobp=0 [S390]
+					       nospectre_v1 [X86]
+					       nospectre_v2 [X86,PPC,S390,ARM64]
+					       spectre_v2_user=off [X86]
+					       spec_store_bypass_disable=off [X86,PPC]
+					       ssbd=force-off [ARM64]
+					       l1tf=off [X86]
+					       mds=off [X86]
+					       tsx_async_abort=off [X86]
+					       kvm.nx_huge_pages=off [X86]
+					       no_entry_flush [PPC]
+					       no_uaccess_flush [PPC]
+					       mmio_stale_data=off [X86]
+
+				Exceptions:
+					       This does not have any effect on
+					       kvm.nx_huge_pages when
+					       kvm.nx_huge_pages=force.
+
+			auto (default)
+				Mitigate all CPU vulnerabilities, but leave SMT
+				enabled, even if it's vulnerable.  This is for
+				users who don't want to be surprised by SMT
+				getting disabled across kernel upgrades, or who
+				have other ways of avoiding SMT-based attacks.
+				Equivalent to: (default behavior)
+
+			auto,nosmt
+				Mitigate all CPU vulnerabilities, disabling SMT
+				if needed.  This is for users who always want to
+				be fully mitigated, even if it means losing SMT.
+				Equivalent to: l1tf=flush,nosmt [X86]
+					       mds=full,nosmt [X86]
+					       tsx_async_abort=full,nosmt [X86]
+					       mmio_stale_data=full,nosmt [X86]
+
+	mminit_loglevel=
+			[KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
+			parameter allows control of the logging verbosity for
+			the additional memory initialisation checks. A value
+			of 0 disables mminit logging and a level of 4 will
+			log everything. Information is printed at KERN_DEBUG
+			so loglevel=8 may also need to be specified.
+
+	mmio_stale_data=
+			[X86,INTEL] Control mitigation for the Processor
+			MMIO Stale Data vulnerabilities.
+
+			Processor MMIO Stale Data is a class of
+			vulnerabilities that may expose data after an MMIO
+			operation. Exposed data could originate or end in
+			the same CPU buffers as affected by MDS and TAA.
+			Therefore, similar to MDS and TAA, the mitigation
+			is to clear the affected CPU buffers.
+
+			This parameter controls the mitigation. The
+			options are:
+
+			full       - Enable mitigation on vulnerable CPUs
+
+			full,nosmt - Enable mitigation and disable SMT on
+				     vulnerable CPUs.
+
+			off        - Unconditionally disable mitigation
+
+			On MDS or TAA affected machines,
+			mmio_stale_data=off can be prevented by an active
+			MDS or TAA mitigation as these vulnerabilities are
+			mitigated with the same mechanism so in order to
+			disable this mitigation, you need to specify
+			mds=off and tsx_async_abort=off too.
+
+			Not specifying this option is equivalent to
+			mmio_stale_data=full.
+
+			For details see:
+			Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
+
+	module.sig_enforce
+			[KNL] When CONFIG_MODULE_SIG is set, this means that
+			modules without (valid) signatures will fail to load.
+			Note that if CONFIG_MODULE_SIG_FORCE is set, that
+			is always true, so this option does nothing.
+
+	module_blacklist=  [KNL] Do not load a comma-separated list of
+			modules.  Useful for debugging problem modules.
+
+	mousedev.tap_time=
+			[MOUSE] Maximum time between finger touching and
+			leaving touchpad surface for touch to be considered
+			a tap and be reported as a left button click (for
+			touchpads working in absolute mode only).
+			Format: <msecs>
+	mousedev.xres=	[MOUSE] Horizontal screen resolution, used for devices
+			reporting absolute coordinates, such as tablets
+	mousedev.yres=	[MOUSE] Vertical screen resolution, used for devices
+			reporting absolute coordinates, such as tablets
+
+	movablecore=	[KNL,X86,IA-64,PPC]
+			Format: nn[KMGTPE] | nn%
+			This parameter is the complement to kernelcore=, it
+			specifies the amount of memory used for migratable
+			allocations.  If both kernelcore and movablecore is
+			specified, then kernelcore will be at *least* the
+			specified value but may be more.  If movablecore on its
+			own is specified, the administrator must be careful
+			that the amount of memory usable for all allocations
+			is not too small.
+
+	movable_node	[KNL] Boot-time switch to make hotplugable memory
+			NUMA nodes to be movable. This means that the memory
+			of such nodes will be usable only for movable
+			allocations which rules out almost all kernel
+			allocations. Use with caution!
+
+	MTD_Partition=	[MTD]
+			Format: <name>,<region-number>,<size>,<offset>
+
+	MTD_Region=	[MTD] Format:
+			<name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>]
+
+	mtdparts=	[MTD]
+			See drivers/mtd/cmdlinepart.c.
+
+	multitce=off	[PPC]  This parameter disables the use of the pSeries
+			firmware feature for updating multiple TCE entries
+			at a time.
+
+	onenand.bdry=	[HW,MTD] Flex-OneNAND Boundary Configuration
+
+			Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]
+
+			boundary - index of last SLC block on Flex-OneNAND.
+				   The remaining blocks are configured as MLC blocks.
+			lock	 - Configure if Flex-OneNAND boundary should be locked.
+				   Once locked, the boundary cannot be changed.
+				   1 indicates lock status, 0 indicates unlock status.
+
+	mtdset=		[ARM]
+			ARM/S3C2412 JIVE boot control
+
+			See arch/arm/mach-s3c2412/mach-jive.c
+
+	mtouchusb.raw_coordinates=
+			[HW] Make the MicroTouch USB driver use raw coordinates
+			('y', default) or cooked coordinates ('n')
+
+	mtrr_chunk_size=nn[KMG] [X86]
+			used for mtrr cleanup. It is largest continuous chunk
+			that could hold holes aka. UC entries.
+
+	mtrr_gran_size=nn[KMG] [X86]
+			Used for mtrr cleanup. It is granularity of mtrr block.
+			Default is 1.
+			Large value could prevent small alignment from
+			using up MTRRs.
+
+	mtrr_spare_reg_nr=n [X86]
+			Format: <integer>
+			Range: 0,7 : spare reg number
+			Default : 1
+			Used for mtrr cleanup. It is spare mtrr entries number.
+			Set to 2 or more if your graphical card needs more.
+
+	n2=		[NET] SDL Inc. RISCom/N2 synchronous serial card
+
+	netdev=		[NET] Network devices parameters
+			Format: <irq>,<io>,<mem_start>,<mem_end>,<name>
+			Note that mem_start is often overloaded to mean
+			something different and driver-specific.
+			This usage is only documented in each driver source
+			file if at all.
+
+	nf_conntrack.acct=
+			[NETFILTER] Enable connection tracking flow accounting
+			0 to disable accounting
+			1 to enable accounting
+			Default value is 0.
+
+	nfsaddrs=	[NFS] Deprecated.  Use ip= instead.
+			See Documentation/filesystems/nfs/nfsroot.txt.
+
+	nfsroot=	[NFS] nfs root filesystem for disk-less boxes.
+			See Documentation/filesystems/nfs/nfsroot.txt.
+
+	nfsrootdebug	[NFS] enable nfsroot debugging messages.
+			See Documentation/filesystems/nfs/nfsroot.txt.
+
+	nfs.callback_nr_threads=
+			[NFSv4] set the total number of threads that the
+			NFS client will assign to service NFSv4 callback
+			requests.
+
+	nfs.callback_tcpport=
+			[NFS] set the TCP port on which the NFSv4 callback
+			channel should listen.
+
+	nfs.cache_getent=
+			[NFS] sets the pathname to the program which is used
+			to update the NFS client cache entries.
+
+	nfs.cache_getent_timeout=
+			[NFS] sets the timeout after which an attempt to
+			update a cache entry is deemed to have failed.
+
+	nfs.idmap_cache_timeout=
+			[NFS] set the maximum lifetime for idmapper cache
+			entries.
+
+	nfs.enable_ino64=
+			[NFS] enable 64-bit inode numbers.
+			If zero, the NFS client will fake up a 32-bit inode
+			number for the readdir() and stat() syscalls instead
+			of returning the full 64-bit number.
+			The default is to return 64-bit inode numbers.
+
+	nfs.max_session_cb_slots=
+			[NFSv4.1] Sets the maximum number of session
+			slots the client will assign to the callback
+			channel. This determines the maximum number of
+			callbacks the client will process in parallel for
+			a particular server.
+
+	nfs.max_session_slots=
+			[NFSv4.1] Sets the maximum number of session slots
+			the client will attempt to negotiate with the server.
+			This limits the number of simultaneous RPC requests
+			that the client can send to the NFSv4.1 server.
+			Note that there is little point in setting this
+			value higher than the max_tcp_slot_table_limit.
+
+	nfs.nfs4_disable_idmapping=
+			[NFSv4] When set to the default of '1', this option
+			ensures that both the RPC level authentication
+			scheme and the NFS level operations agree to use
+			numeric uids/gids if the mount is using the
+			'sec=sys' security flavour. In effect it is
+			disabling idmapping, which can make migration from
+			legacy NFSv2/v3 systems to NFSv4 easier.
+			Servers that do not support this mode of operation
+			will be autodetected by the client, and it will fall
+			back to using the idmapper.
+			To turn off this behaviour, set the value to '0'.
+	nfs.nfs4_unique_id=
+			[NFS4] Specify an additional fixed unique ident-
+			ification string that NFSv4 clients can insert into
+			their nfs_client_id4 string.  This is typically a
+			UUID that is generated at system install time.
+
+	nfs.send_implementation_id =
+			[NFSv4.1] Send client implementation identification
+			information in exchange_id requests.
+			If zero, no implementation identification information
+			will be sent.
+			The default is to send the implementation identification
+			information.
+
+	nfs.recover_lost_locks =
+			[NFSv4] Attempt to recover locks that were lost due
+			to a lease timeout on the server. Please note that
+			doing this risks data corruption, since there are
+			no guarantees that the file will remain unchanged
+			after the locks are lost.
+			If you want to enable the kernel legacy behaviour of
+			attempting to recover these locks, then set this
+			parameter to '1'.
+			The default parameter value of '0' causes the kernel
+			not to attempt recovery of lost locks.
+
+	nfs4.layoutstats_timer =
+			[NFSv4.2] Change the rate at which the kernel sends
+			layoutstats to the pNFS metadata server.
+
+			Setting this to value to 0 causes the kernel to use
+			whatever value is the default set by the layout
+			driver. A non-zero value sets the minimum interval
+			in seconds between layoutstats transmissions.
+
+	nfsd.nfs4_disable_idmapping=
+			[NFSv4] When set to the default of '1', the NFSv4
+			server will return only numeric uids and gids to
+			clients using auth_sys, and will accept numeric uids
+			and gids from such clients.  This is intended to ease
+			migration from NFSv2/v3.
+
+	nmi_debug=	[KNL,SH] Specify one or more actions to take
+			when a NMI is triggered.
+			Format: [state][,regs][,debounce][,die]
+
+	nmi_watchdog=	[KNL,BUGS=X86] Debugging features for SMP kernels
+			Format: [panic,][nopanic,][num]
+			Valid num: 0 or 1
+			0 - turn hardlockup detector in nmi_watchdog off
+			1 - turn hardlockup detector in nmi_watchdog on
+			When panic is specified, panic when an NMI watchdog
+			timeout occurs (or 'nopanic' to override the opposite
+			default). To disable both hard and soft lockup detectors,
+			please see 'nowatchdog'.
+			This is useful when you use a panic=... timeout and
+			need the box quickly up again.
+
+			These settings can be accessed at runtime via
+			the nmi_watchdog and hardlockup_panic sysctls.
+
+	netpoll.carrier_timeout=
+			[NET] Specifies amount of time (in seconds) that
+			netpoll should wait for a carrier. By default netpoll
+			waits 4 seconds.
+
+	no387		[BUGS=X86-32] Tells the kernel to use the 387 maths
+			emulation library even if a 387 maths coprocessor
+			is present.
+
+	no5lvl		[X86-64] Disable 5-level paging mode. Forces
+			kernel to use 4-level paging instead.
+
+	no_console_suspend
+			[HW] Never suspend the console
+			Disable suspending of consoles during suspend and
+			hibernate operations.  Once disabled, debugging
+			messages can reach various consoles while the rest
+			of the system is being put to sleep (ie, while
+			debugging driver suspend/resume hooks).  This may
+			not work reliably with all consoles, but is known
+			to work with serial and VGA consoles.
+			To facilitate more flexible debugging, we also add
+			console_suspend, a printk module parameter to control
+			it. Users could use console_suspend (usually
+			/sys/module/printk/parameters/console_suspend) to
+			turn on/off it dynamically.
+
+	noaliencache	[MM, NUMA, SLAB] Disables the allocation of alien
+			caches in the slab allocator.  Saves per-node memory,
+			but will impact performance.
+
+	noalign		[KNL,ARM]
+
+	noaltinstr	[S390] Disables alternative instructions patching
+			(CPU alternatives feature).
+
+	noapic		[SMP,APIC] Tells the kernel to not make use of any
+			IOAPICs that may be present in the system.
+
+	noautogroup	Disable scheduler automatic task group creation.
+
+	nobats		[PPC] Do not use BATs for mapping kernel lowmem
+			on "Classic" PPC cores.
+
+	nocache		[ARM]
+
+	noclflush	[BUGS=X86] Don't use the CLFLUSH instruction
+
+	nodelayacct	[KNL] Disable per-task delay accounting
+
+	nodsp		[SH] Disable hardware DSP at boot time.
+
+	noefi		Disable EFI runtime services support.
+
+	no_entry_flush  [PPC] Don't flush the L1-D cache when entering the kernel.
+
+	noexec		[IA-64]
+
+	noexec		[X86]
+			On X86-32 available only on PAE configured kernels.
+			noexec=on: enable non-executable mappings (default)
+			noexec=off: disable non-executable mappings
+
+	nosmap		[X86]
+			Disable SMAP (Supervisor Mode Access Prevention)
+			even if it is supported by processor.
+
+	nosmep		[X86]
+			Disable SMEP (Supervisor Mode Execution Prevention)
+			even if it is supported by processor.
+
+	noexec32	[X86-64]
+			This affects only 32-bit executables.
+			noexec32=on: enable non-executable mappings (default)
+				read doesn't imply executable mappings
+			noexec32=off: disable non-executable mappings
+				read implies executable mappings
+
+	nofpu		[MIPS,SH] Disable hardware FPU at boot time.
+
+	nofxsr		[BUGS=X86-32] Disables x86 floating point extended
+			register save and restore. The kernel will only save
+			legacy floating-point registers on task switch.
+
+	nohugeiomap	[KNL,x86] Disable kernel huge I/O mappings.
+
+	nosmt		[KNL,S390] Disable symmetric multithreading (SMT).
+			Equivalent to smt=1.
+
+			[KNL,x86] Disable symmetric multithreading (SMT).
+			nosmt=force: Force disable SMT, cannot be undone
+				     via the sysfs control file.
+
+	nospectre_v1	[X66, PPC] Disable mitigations for Spectre Variant 1
+			(bounds check bypass). With this option data leaks
+			are possible in the system.
+
+	nospectre_v2	[X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for
+			the Spectre variant 2 (indirect branch prediction)
+			vulnerability. System may allow data leaks with this
+			option.
+
+	nospec_store_bypass_disable
+			[HW] Disable all mitigations for the Speculative Store Bypass vulnerability
+
+	no_uaccess_flush
+	                [PPC] Don't flush the L1-D cache after accessing user data.
+
+	noxsave		[BUGS=X86] Disables x86 extended register state save
+			and restore using xsave. The kernel will fallback to
+			enabling legacy floating-point and sse state.
+
+	noxsaveopt	[X86] Disables xsaveopt used in saving x86 extended
+			register states. The kernel will fall back to use
+			xsave to save the states. By using this parameter,
+			performance of saving the states is degraded because
+			xsave doesn't support modified optimization while
+			xsaveopt supports it on xsaveopt enabled systems.
+
+	noxsaves	[X86] Disables xsaves and xrstors used in saving and
+			restoring x86 extended register state in compacted
+			form of xsave area. The kernel will fall back to use
+			xsaveopt and xrstor to save and restore the states
+			in standard form of xsave area. By using this
+			parameter, xsave area per process might occupy more
+			memory on xsaves enabled systems.
+
+	nohlt		[BUGS=ARM,SH] Tells the kernel that the sleep(SH) or
+			wfi(ARM) instruction doesn't work correctly and not to
+			use it. This is also useful when using JTAG debugger.
+
+	no_file_caps	Tells the kernel not to honor file capabilities.  The
+			only way then for a file to be executed with privilege
+			is to be setuid root or executed by root.
+
+	nohalt		[IA-64] Tells the kernel not to use the power saving
+			function PAL_HALT_LIGHT when idle. This increases
+			power-consumption. On the positive side, it reduces
+			interrupt wake-up latency, which may improve performance
+			in certain environments such as networked servers or
+			real-time systems.
+
+	nohibernate	[HIBERNATION] Disable hibernation and resume.
+
+	nohz=		[KNL] Boottime enable/disable dynamic ticks
+			Valid arguments: on, off
+			Default: on
+
+	nohz_full=	[KNL,BOOT,SMP,ISOL]
+			The argument is a cpu list, as described above.
+			In kernels built with CONFIG_NO_HZ_FULL=y, set
+			the specified list of CPUs whose tick will be stopped
+			whenever possible. The boot CPU will be forced outside
+			the range to maintain the timekeeping.  Any CPUs
+			in this list will have their RCU callbacks offloaded,
+			just as if they had also been called out in the
+			rcu_nocbs= boot parameter.
+
+	noiotrap	[SH] Disables trapped I/O port accesses.
+
+	noirqdebug	[X86-32] Disables the code which attempts to detect and
+			disable unhandled interrupt sources.
+
+	no_timer_check	[X86,APIC] Disables the code which tests for
+			broken timer IRQ sources.
+
+	noisapnp	[ISAPNP] Disables ISA PnP code.
+
+	noinitrd	[RAM] Tells the kernel not to load any configured
+			initial RAM disk.
+
+	nointremap	[X86-64, Intel-IOMMU] Do not enable interrupt
+			remapping.
+			[Deprecated - use intremap=off]
+
+	nointroute	[IA-64]
+
+	noinvpcid	[X86] Disable the INVPCID cpu feature.
+
+	nojitter	[IA-64] Disables jitter checking for ITC timers.
+
+	no-kvmclock	[X86,KVM] Disable paravirtualized KVM clock driver
+
+	no-kvmapf	[X86,KVM] Disable paravirtualized asynchronous page
+			fault handling.
+
+	no-vmw-sched-clock
+			[X86,PV_OPS] Disable paravirtualized VMware scheduler
+			clock and use the default one.
+
+	no-steal-acc	[X86,KVM] Disable paravirtualized steal time accounting.
+			steal time is computed, but won't influence scheduler
+			behaviour
+
+	nolapic		[X86-32,APIC] Do not enable or use the local APIC.
+
+	nolapic_timer	[X86-32,APIC] Do not use the local APIC timer.
+
+	noltlbs		[PPC] Do not use large page/tlb entries for kernel
+			lowmem mapping on PPC40x and PPC8xx
+
+	nomca		[IA-64] Disable machine check abort handling
+
+	nomce		[X86-32] Disable Machine Check Exception
+
+	nomfgpt		[X86-32] Disable Multi-Function General Purpose
+			Timer usage (for AMD Geode machines).
+
+	nonmi_ipi	[X86] Disable using NMI IPIs during panic/reboot to
+			shutdown the other cpus.  Instead use the REBOOT_VECTOR
+			irq.
+
+	nomodule	Disable module load
+
+	nopat		[X86] Disable PAT (page attribute table extension of
+			pagetables) support.
+
+	nopcid		[X86-64] Disable the PCID cpu feature.
+
+	norandmaps	Don't use address space randomization.  Equivalent to
+			echo 0 > /proc/sys/kernel/randomize_va_space
+
+	noreplace-smp	[X86-32,SMP] Don't replace SMP instructions
+			with UP alternatives
+
+	nordrand	[X86] Disable kernel use of the RDRAND and
+			RDSEED instructions even if they are supported
+			by the processor.  RDRAND and RDSEED are still
+			available to user space applications.
+
+	noresume	[SWSUSP] Disables resume and restores original swap
+			space.
+
+	no-scroll	[VGA] Disables scrollback.
+			This is required for the Braillex ib80-piezo Braille
+			reader made by F.H. Papenmeier (Germany).
+
+	nosbagart	[IA-64]
+
+	nosep		[BUGS=X86-32] Disables x86 SYSENTER/SYSEXIT support.
+
+	nosmp		[SMP] Tells an SMP kernel to act as a UP kernel,
+			and disable the IO APIC.  legacy for "maxcpus=0".
+
+	nosoftlockup	[KNL] Disable the soft-lockup detector.
+
+	nosync		[HW,M68K] Disables sync negotiation for all devices.
+
+	nowatchdog	[KNL] Disable both lockup detectors, i.e.
+			soft-lockup and NMI watchdog (hard-lockup).
+
+	nowb		[ARM]
+
+	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
+
+	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
+			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
+			Some features depend on CPU0. Known dependencies are:
+			1. Resume from suspend/hibernate depends on CPU0.
+			Suspend/hibernate will fail if CPU0 is offline and you
+			need to online CPU0 before suspend/hibernate.
+			2. PIC interrupts also depend on CPU0. CPU0 can't be
+			removed if a PIC interrupt is detected.
+			It's said poweroff/reboot may depend on CPU0 on some
+			machines although I haven't seen such issues so far
+			after CPU0 is offline on a few tested machines.
+			If the dependencies are under your control, you can
+			turn on cpu0_hotplug.
+
+	nps_mtm_hs_ctr=	[KNL,ARC]
+			This parameter sets the maximum duration, in
+			cycles, each HW thread of the CTOP can run
+			without interruptions, before HW switches it.
+			The actual maximum duration is 16 times this
+			parameter's value.
+			Format: integer between 1 and 255
+			Default: 255
+
+	nptcg=		[IA-64] Override max number of concurrent global TLB
+			purges which is reported from either PAL_VM_SUMMARY or
+			SAL PALO.
+
+	nr_cpus=	[SMP] Maximum number of processors that	an SMP kernel
+			could support.  nr_cpus=n : n >= 1 limits the kernel to
+			support 'n' processors. It could be larger than the
+			number of already plugged CPU during bootup, later in
+			runtime you can physically add extra cpu until it reaches
+			n. So during boot up some boot time memory for per-cpu
+			variables need be pre-allocated for later physical cpu
+			hot plugging.
+
+	nr_uarts=	[SERIAL] maximum number of UARTs to be registered.
+
+	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
+			Allowed values are enable and disable
+
+	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
+			'node', 'default' can be specified
+			This can be set from sysctl after boot.
+			See Documentation/sysctl/vm.txt for details.
+
+	ohci1394_dma=early	[HW] enable debugging via the ohci1394 driver.
+			See Documentation/debugging-via-ohci1394.txt for more
+			info.
+
+	olpc_ec_timeout= [OLPC] ms delay when issuing EC commands
+			Rather than timing out after 20 ms if an EC
+			command is not properly ACKed, override the length
+			of the timeout.  We have interrupts disabled while
+			waiting for the ACK, so if this is set too high
+			interrupts *may* be lost!
+
+	omap_mux=	[OMAP] Override bootloader pin multiplexing.
+			Format: <mux_mode0.mode_name=value>...
+			For example, to override I2C bus2:
+			omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100
+
+	oprofile.timer=	[HW]
+			Use timer interrupt instead of performance counters
+
+	oprofile.cpu_type=	Force an oprofile cpu type
+			This might be useful if you have an older oprofile
+			userland or if you want common events.
+			Format: { arch_perfmon }
+			arch_perfmon: [X86] Force use of architectural
+				perfmon on Intel CPUs instead of the
+				CPU specific event set.
+			timer: [X86] Force use of architectural NMI
+				timer mode (see also oprofile.timer
+				for generic hr timer mode)
+
+	oops=panic	Always panic on oopses. Default is to just kill the
+			process, but there is a small probability of
+			deadlocking the machine.
+			This will also cause panics on machine check exceptions.
+			Useful together with panic=30 to trigger a reboot.
+
+	page_owner=	[KNL] Boot-time page_owner enabling option.
+			Storage of the information about who allocated
+			each page is disabled in default. With this switch,
+			we can turn it on.
+			on: enable the feature
+
+	page_poison=	[KNL] Boot-time parameter changing the state of
+			poisoning on the buddy allocator, available with
+			CONFIG_PAGE_POISONING=y.
+			off: turn off poisoning (default)
+			on: turn on poisoning
+
+	panic=		[KNL] Kernel behaviour on panic: delay <timeout>
+			timeout > 0: seconds before rebooting
+			timeout = 0: wait forever
+			timeout < 0: reboot immediately
+			Format: <timeout>
+
+	panic_on_warn	panic() instead of WARN().  Useful to cause kdump
+			on a WARN().
+
+	crash_kexec_post_notifiers
+			Run kdump after running panic-notifiers and dumping
+			kmsg. This only for the users who doubt kdump always
+			succeeds in any situation.
+			Note that this also increases risks of kdump failure,
+			because some panic notifiers can make the crashed
+			kernel more unstable.
+
+	parkbd.port=	[HW] Parallel port number the keyboard adapter is
+			connected to, default is 0.
+			Format: <parport#>
+	parkbd.mode=	[HW] Parallel port keyboard adapter mode of operation,
+			0 for XT, 1 for AT (default is AT).
+			Format: <mode>
+
+	parport=	[HW,PPT] Specify parallel ports. 0 disables.
+			Format: { 0 | auto | 0xBBB[,IRQ[,DMA]] }
+			Use 'auto' to force the driver to use any
+			IRQ/DMA settings detected (the default is to
+			ignore detected IRQ/DMA settings because of
+			possible conflicts). You can specify the base
+			address, IRQ, and DMA settings; IRQ and DMA
+			should be numbers, or 'auto' (for using detected
+			settings on that particular port), or 'nofifo'
+			(to avoid using a FIFO even if it is detected).
+			Parallel ports are assigned in the order they
+			are specified on the command line, starting
+			with parport0.
+
+	parport_init_mode=	[HW,PPT]
+			Configure VIA parallel port to operate in
+			a specific mode. This is necessary on Pegasos
+			computer where firmware has no options for setting
+			up parallel port mode and sets it to spp.
+			Currently this function knows 686a and 8231 chips.
+			Format: [spp|ps2|epp|ecp|ecpepp]
+
+	pause_on_oops=
+			Halt all CPUs after the first oops has been printed for
+			the specified number of seconds.  This is to be used if
+			your oopses keep scrolling off the screen.
+
+	pcbit=		[HW,ISDN]
+
+	pcd.		[PARIDE]
+			See header of drivers/block/paride/pcd.c.
+			See also Documentation/blockdev/paride.txt.
+
+	pci=option[,option...]	[PCI] various PCI subsystem options.
+
+				Some options herein operate on a specific device
+				or a set of devices (<pci_dev>). These are
+				specified in one of the following formats:
+
+				[<domain>:]<bus>:<dev>.<func>[/<dev>.<func>]*
+				pci:<vendor>:<device>[:<subvendor>:<subdevice>]
+
+				Note: the first format specifies a PCI
+				bus/device/function address which may change
+				if new hardware is inserted, if motherboard
+				firmware changes, or due to changes caused
+				by other kernel parameters. If the
+				domain is left unspecified, it is
+				taken to be zero. Optionally, a path
+				to a device through multiple device/function
+				addresses can be specified after the base
+				address (this is more robust against
+				renumbering issues).  The second format
+				selects devices using IDs from the
+				configuration space which may match multiple
+				devices in the system.
+
+		earlydump	dump PCI config space before the kernel
+				changes anything
+		off		[X86] don't probe for the PCI bus
+		bios		[X86-32] force use of PCI BIOS, don't access
+				the hardware directly. Use this if your machine
+				has a non-standard PCI host bridge.
+		nobios		[X86-32] disallow use of PCI BIOS, only direct
+				hardware access methods are allowed. Use this
+				if you experience crashes upon bootup and you
+				suspect they are caused by the BIOS.
+		conf1		[X86] Force use of PCI Configuration Access
+				Mechanism 1 (config address in IO port 0xCF8,
+				data in IO port 0xCFC, both 32-bit).
+		conf2		[X86] Force use of PCI Configuration Access
+				Mechanism 2 (IO port 0xCF8 is an 8-bit port for
+				the function, IO port 0xCFA, also 8-bit, sets
+				bus number. The config space is then accessed
+				through ports 0xC000-0xCFFF).
+				See http://wiki.osdev.org/PCI for more info
+				on the configuration access mechanisms.
+		noaer		[PCIE] If the PCIEAER kernel config parameter is
+				enabled, this kernel boot option can be used to
+				disable the use of PCIE advanced error reporting.
+		nodomains	[PCI] Disable support for multiple PCI
+				root domains (aka PCI segments, in ACPI-speak).
+		nommconf	[X86] Disable use of MMCONFIG for PCI
+				Configuration
+		check_enable_amd_mmconf [X86] check for and enable
+				properly configured MMIO access to PCI
+				config space on AMD family 10h CPU
+		nomsi		[MSI] If the PCI_MSI kernel config parameter is
+				enabled, this kernel boot option can be used to
+				disable the use of MSI interrupts system-wide.
+		noioapicquirk	[APIC] Disable all boot interrupt quirks.
+				Safety option to keep boot IRQs enabled. This
+				should never be necessary.
+		ioapicreroute	[APIC] Enable rerouting of boot IRQs to the
+				primary IO-APIC for bridges that cannot disable
+				boot IRQs. This fixes a source of spurious IRQs
+				when the system masks IRQs.
+		noioapicreroute	[APIC] Disable workaround that uses the
+				boot IRQ equivalent of an IRQ that connects to
+				a chipset where boot IRQs cannot be disabled.
+				The opposite of ioapicreroute.
+		biosirq		[X86-32] Use PCI BIOS calls to get the interrupt
+				routing table. These calls are known to be buggy
+				on several machines and they hang the machine
+				when used, but on other computers it's the only
+				way to get the interrupt routing table. Try
+				this option if the kernel is unable to allocate
+				IRQs or discover secondary PCI buses on your
+				motherboard.
+		rom		[X86] Assign address space to expansion ROMs.
+				Use with caution as certain devices share
+				address decoders between ROMs and other
+				resources.
+		norom		[X86] Do not assign address space to
+				expansion ROMs that do not already have
+				BIOS assigned address ranges.
+		nobar		[X86] Do not assign address space to the
+				BARs that weren't assigned by the BIOS.
+		irqmask=0xMMMM	[X86] Set a bit mask of IRQs allowed to be
+				assigned automatically to PCI devices. You can
+				make the kernel exclude IRQs of your ISA cards
+				this way.
+		pirqaddr=0xAAAAA	[X86] Specify the physical address
+				of the PIRQ table (normally generated
+				by the BIOS) if it is outside the
+				F0000h-100000h range.
+		lastbus=N	[X86] Scan all buses thru bus #N. Can be
+				useful if the kernel is unable to find your
+				secondary buses and you want to tell it
+				explicitly which ones they are.
+		assign-busses	[X86] Always assign all PCI bus
+				numbers ourselves, overriding
+				whatever the firmware may have done.
+		usepirqmask	[X86] Honor the possible IRQ mask stored
+				in the BIOS $PIR table. This is needed on
+				some systems with broken BIOSes, notably
+				some HP Pavilion N5400 and Omnibook XE3
+				notebooks. This will have no effect if ACPI
+				IRQ routing is enabled.
+		noacpi		[X86] Do not use ACPI for IRQ routing
+				or for PCI scanning.
+		use_crs		[X86] Use PCI host bridge window information
+				from ACPI.  On BIOSes from 2008 or later, this
+				is enabled by default.  If you need to use this,
+				please report a bug.
+		nocrs		[X86] Ignore PCI host bridge windows from ACPI.
+				If you need to use this, please report a bug.
+		routeirq	Do IRQ routing for all PCI devices.
+				This is normally done in pci_enable_device(),
+				so this option is a temporary workaround
+				for broken drivers that don't call it.
+		skip_isa_align	[X86] do not align io start addr, so can
+				handle more pci cards
+		noearly		[X86] Don't do any early type 1 scanning.
+				This might help on some broken boards which
+				machine check when some devices' config space
+				is read. But various workarounds are disabled
+				and some IOMMU drivers will not work.
+		bfsort		Sort PCI devices into breadth-first order.
+				This sorting is done to get a device
+				order compatible with older (<= 2.4) kernels.
+		nobfsort	Don't sort PCI devices into breadth-first order.
+		pcie_bus_tune_off	Disable PCIe MPS (Max Payload Size)
+				tuning and use the BIOS-configured MPS defaults.
+		pcie_bus_safe	Set every device's MPS to the largest value
+				supported by all devices below the root complex.
+		pcie_bus_perf	Set device MPS to the largest allowable MPS
+				based on its parent bus. Also set MRRS (Max
+				Read Request Size) to the largest supported
+				value (no larger than the MPS that the device
+				or bus can support) for best performance.
+		pcie_bus_peer2peer	Set every device's MPS to 128B, which
+				every device is guaranteed to support. This
+				configuration allows peer-to-peer DMA between
+				any pair of devices, possibly at the cost of
+				reduced performance.  This also guarantees
+				that hot-added devices will work.
+		cbiosize=nn[KMG]	The fixed amount of bus space which is
+				reserved for the CardBus bridge's IO window.
+				The default value is 256 bytes.
+		cbmemsize=nn[KMG]	The fixed amount of bus space which is
+				reserved for the CardBus bridge's memory
+				window. The default value is 64 megabytes.
+		resource_alignment=
+				Format:
+				[<order of align>@]<pci_dev>[; ...]
+				Specifies alignment and device to reassign
+				aligned memory resources. How to
+				specify the device is described above.
+				If <order of align> is not specified,
+				PAGE_SIZE is used as alignment.
+				PCI-PCI bridge can be specified, if resource
+				windows need to be expanded.
+				To specify the alignment for several
+				instances of a device, the PCI vendor,
+				device, subvendor, and subdevice may be
+				specified, e.g., 4096@pci:8086:9c22:103c:198f
+		ecrc=		Enable/disable PCIe ECRC (transaction layer
+				end-to-end CRC checking).
+				bios: Use BIOS/firmware settings. This is the
+				the default.
+				off: Turn ECRC off
+				on: Turn ECRC on.
+		hpiosize=nn[KMG]	The fixed amount of bus space which is
+				reserved for hotplug bridge's IO window.
+				Default size is 256 bytes.
+		hpmemsize=nn[KMG]	The fixed amount of bus space which is
+				reserved for hotplug bridge's memory window.
+				Default size is 2 megabytes.
+		hpbussize=nn	The minimum amount of additional bus numbers
+				reserved for buses below a hotplug bridge.
+				Default is 1.
+		realloc=	Enable/disable reallocating PCI bridge resources
+				if allocations done by BIOS are too small to
+				accommodate resources required by all child
+				devices.
+				off: Turn realloc off
+				on: Turn realloc on
+		realloc		same as realloc=on
+		noari		do not use PCIe ARI.
+		noats		[PCIE, Intel-IOMMU, AMD-IOMMU]
+				do not use PCIe ATS (and IOMMU device IOTLB).
+		pcie_scan_all	Scan all possible PCIe devices.  Otherwise we
+				only look for one device below a PCIe downstream
+				port.
+		big_root_window	Try to add a big 64bit memory window to the PCIe
+				root complex on AMD CPUs. Some GFX hardware
+				can resize a BAR to allow access to all VRAM.
+				Adding the window is slightly risky (it may
+				conflict with unreported devices), so this
+				taints the kernel.
+		disable_acs_redir=<pci_dev>[; ...]
+				Specify one or more PCI devices (in the format
+				specified above) separated by semicolons.
+				Each device specified will have the PCI ACS
+				redirect capabilities forced off which will
+				allow P2P traffic between devices through
+				bridges without forcing it upstream. Note:
+				this removes isolation between devices and
+				may put more devices in an IOMMU group.
+
+	pcie_aspm=	[PCIE] Forcibly enable or disable PCIe Active State Power
+			Management.
+		off	Disable ASPM.
+		force	Enable ASPM even on devices that claim not to support it.
+			WARNING: Forcing ASPM on may cause system lockups.
+
+	pcie_ports=	[PCIE] PCIe port services handling:
+		native	Use native PCIe services (PME, AER, DPC, PCIe hotplug)
+			even if the platform doesn't give the OS permission to
+			use them.  This may cause conflicts if the platform
+			also tries to use these services.
+		compat	Disable native PCIe services (PME, AER, DPC, PCIe
+			hotplug).
+
+	pcie_port_pm=	[PCIE] PCIe port power management handling:
+		off	Disable power management of all PCIe ports
+		force	Forcibly enable power management of all PCIe ports
+
+	pcie_pme=	[PCIE,PM] Native PCIe PME signaling options:
+		nomsi	Do not use MSI for native PCIe PME signaling (this makes
+			all PCIe root ports use INTx for all services).
+
+	pcmv=		[HW,PCMCIA] BadgePAD 4
+
+	pd_ignore_unused
+			[PM]
+			Keep all power-domains already enabled by bootloader on,
+			even if no driver has claimed them. This is useful
+			for debug and development, but should not be
+			needed on a platform with proper driver support.
+
+	pd.		[PARIDE]
+			See Documentation/blockdev/paride.txt.
+
+	pdcchassis=	[PARISC,HW] Disable/Enable PDC Chassis Status codes at
+			boot time.
+			Format: { 0 | 1 }
+			See arch/parisc/kernel/pdc_chassis.c
+
+	percpu_alloc=	Select which percpu first chunk allocator to use.
+			Currently supported values are "embed" and "page".
+			Archs may support subset or none of the	selections.
+			See comments in mm/percpu.c for details on each
+			allocator.  This parameter is primarily	for debugging
+			and performance comparison.
+
+	pf.		[PARIDE]
+			See Documentation/blockdev/paride.txt.
+
+	pg.		[PARIDE]
+			See Documentation/blockdev/paride.txt.
+
+	pirq=		[SMP,APIC] Manual mp-table setup
+			See Documentation/x86/i386/IO-APIC.txt.
+
+	plip=		[PPT,NET] Parallel port network link
+			Format: { parport<nr> | timid | 0 }
+			See also Documentation/admin-guide/parport.rst.
+
+	pmtmr=		[X86] Manual setup of pmtmr I/O Port.
+			Override pmtimer IOPort with a hex value.
+			e.g. pmtmr=0x508
+
+	pnp.debug=1	[PNP]
+			Enable PNP debug messages (depends on the
+			CONFIG_PNP_DEBUG_MESSAGES option).  Change at run-time
+			via /sys/module/pnp/parameters/debug.  We always show
+			current resource usage; turning this on also shows
+			possible settings and some assignment information.
+
+	pnpacpi=	[ACPI]
+			{ off }
+
+	pnpbios=	[ISAPNP]
+			{ on | off | curr | res | no-curr | no-res }
+
+	pnp_reserve_irq=
+			[ISAPNP] Exclude IRQs for the autoconfiguration
+
+	pnp_reserve_dma=
+			[ISAPNP] Exclude DMAs for the autoconfiguration
+
+	pnp_reserve_io=	[ISAPNP] Exclude I/O ports for the autoconfiguration
+			Ranges are in pairs (I/O port base and size).
+
+	pnp_reserve_mem=
+			[ISAPNP] Exclude memory regions for the
+			autoconfiguration.
+			Ranges are in pairs (memory base and size).
+
+	ports=		[IP_VS_FTP] IPVS ftp helper module
+			Default is 21.
+			Up to 8 (IP_VS_APP_MAX_PORTS) ports
+			may be specified.
+			Format: <port>,<port>....
+
+	powersave=off	[PPC] This option disables power saving features.
+			It specifically disables cpuidle and sets the
+			platform machine description specific power_save
+			function to NULL. On Idle the CPU just reduces
+			execution priority.
+
+	ppc_strict_facility_enable
+			[PPC] This option catches any kernel floating point,
+			Altivec, VSX and SPE outside of regions specifically
+			allowed (eg kernel_enable_fpu()/kernel_disable_fpu()).
+			There is some performance impact when enabling this.
+
+	ppc_tm=		[PPC]
+			Format: {"off"}
+			Disable Hardware Transactional Memory
+
+	print-fatal-signals=
+			[KNL] debug: print fatal signals
+
+			If enabled, warn about various signal handling
+			related application anomalies: too many signals,
+			too many POSIX.1 timers, fatal signals causing a
+			coredump - etc.
+
+			If you hit the warning due to signal overflow,
+			you might want to try "ulimit -i unlimited".
+
+			default: off.
+
+	printk.always_kmsg_dump=
+			Trigger kmsg_dump for cases other than kernel oops or
+			panics
+			Format: <bool>  (1/Y/y=enable, 0/N/n=disable)
+			default: disabled
+
+	printk.devkmsg={on,off,ratelimit}
+			Control writing to /dev/kmsg.
+			on - unlimited logging to /dev/kmsg from userspace
+			off - logging to /dev/kmsg disabled
+			ratelimit - ratelimit the logging
+			Default: ratelimit
+
+	printk.time=	Show timing data prefixed to each printk message line
+			Format: <bool>  (1/Y/y=enable, 0/N/n=disable)
+
+	processor.max_cstate=	[HW,ACPI]
+			Limit processor to maximum C-state
+			max_cstate=9 overrides any DMI blacklist limit.
+
+	processor.nocst	[HW,ACPI]
+			Ignore the _CST method to determine C-states,
+			instead using the legacy FADT method
+
+	profile=	[KNL] Enable kernel profiling via /proc/profile
+			Format: [<profiletype>,]<number>
+			Param: <profiletype>: "schedule", "sleep", or "kvm"
+				[defaults to kernel profiling]
+			Param: "schedule" - profile schedule points.
+			Param: "sleep" - profile D-state sleeping (millisecs).
+				Requires CONFIG_SCHEDSTATS
+			Param: "kvm" - profile VM exits.
+			Param: <number> - step/bucket size as a power of 2 for
+				statistical time based profiling.
+
+	prompt_ramdisk=	[RAM] List of RAM disks to prompt for floppy disk
+			before loading.
+			See Documentation/blockdev/ramdisk.txt.
+
+	psmouse.proto=	[HW,MOUSE] Highest PS2 mouse protocol extension to
+			probe for; one of (bare|imps|exps|lifebook|any).
+	psmouse.rate=	[HW,MOUSE] Set desired mouse report rate, in reports
+			per second.
+	psmouse.resetafter=	[HW,MOUSE]
+			Try to reset the device after so many bad packets
+			(0 = never).
+	psmouse.resolution=
+			[HW,MOUSE] Set desired mouse resolution, in dpi.
+	psmouse.smartscroll=
+			[HW,MOUSE] Controls Logitech smartscroll autorepeat.
+			0 = disabled, 1 = enabled (default).
+
+	pstore.backend=	Specify the name of the pstore backend to use
+
+	pt.		[PARIDE]
+			See Documentation/blockdev/paride.txt.
+
+	pti=		[X86_64] Control Page Table Isolation of user and
+			kernel address spaces.  Disabling this feature
+			removes hardening, but improves performance of
+			system calls and interrupts.
+
+			on   - unconditionally enable
+			off  - unconditionally disable
+			auto - kernel detects whether your CPU model is
+			       vulnerable to issues that PTI mitigates
+
+			Not specifying this option is equivalent to pti=auto.
+
+	nopti		[X86_64]
+			Equivalent to pti=off
+
+	pty.legacy_count=
+			[KNL] Number of legacy pty's. Overwrites compiled-in
+			default number.
+
+	quiet		[KNL] Disable most log messages
+
+	r128=		[HW,DRM]
+
+	raid=		[HW,RAID]
+			See Documentation/admin-guide/md.rst.
+
+	ramdisk_size=	[RAM] Sizes of RAM disks in kilobytes
+			See Documentation/blockdev/ramdisk.txt.
+
+	random.trust_cpu={on,off}
+			[KNL] Enable or disable trusting the use of the
+			CPU's random number generator (if available) to
+			fully seed the kernel's CRNG. Default is controlled
+			by CONFIG_RANDOM_TRUST_CPU.
+
+	random.trust_bootloader={on,off}
+			[KNL] Enable or disable trusting the use of a
+			seed passed by the bootloader (if available) to
+			fully seed the kernel's CRNG. Default is controlled
+			by CONFIG_RANDOM_TRUST_BOOTLOADER.
+
+	ras=option[,option,...]	[KNL] RAS-specific options
+
+		cec_disable	[X86]
+				Disable the Correctable Errors Collector,
+				see CONFIG_RAS_CEC help text.
+
+	rcu_nocbs=	[KNL]
+			The argument is a cpu list, as described above.
+
+			In kernels built with CONFIG_RCU_NOCB_CPU=y, set
+			the specified list of CPUs to be no-callback CPUs.
+			Invocation of these CPUs' RCU callbacks will
+			be offloaded to "rcuox/N" kthreads created for
+			that purpose, where "x" is "b" for RCU-bh, "p"
+			for RCU-preempt, and "s" for RCU-sched, and "N"
+			is the CPU number.  This reduces OS jitter on the
+			offloaded CPUs, which can be useful for HPC and
+			real-time workloads.  It can also improve energy
+			efficiency for asymmetric multiprocessors.
+
+	rcu_nocb_poll	[KNL]
+			Rather than requiring that offloaded CPUs
+			(specified by rcu_nocbs= above) explicitly
+			awaken the corresponding "rcuoN" kthreads,
+			make these kthreads poll for callbacks.
+			This improves the real-time response for the
+			offloaded CPUs by relieving them of the need to
+			wake up the corresponding kthread, but degrades
+			energy efficiency by requiring that the kthreads
+			periodically wake up to do the polling.
+
+	rcutree.blimit=	[KNL]
+			Set maximum number of finished RCU callbacks to
+			process in one batch.
+
+	rcutree.dump_tree=	[KNL]
+			Dump the structure of the rcu_node combining tree
+			out at early boot.  This is used for diagnostic
+			purposes, to verify correct tree setup.
+
+	rcutree.gp_cleanup_delay=	[KNL]
+			Set the number of jiffies to delay each step of
+			RCU grace-period cleanup.
+
+	rcutree.gp_init_delay=	[KNL]
+			Set the number of jiffies to delay each step of
+			RCU grace-period initialization.
+
+	rcutree.gp_preinit_delay=	[KNL]
+			Set the number of jiffies to delay each step of
+			RCU grace-period pre-initialization, that is,
+			the propagation of recent CPU-hotplug changes up
+			the rcu_node combining tree.
+
+	rcutree.rcu_fanout_exact= [KNL]
+			Disable autobalancing of the rcu_node combining
+			tree.  This is used by rcutorture, and might
+			possibly be useful for architectures having high
+			cache-to-cache transfer latencies.
+
+	rcutree.rcu_fanout_leaf= [KNL]
+			Change the number of CPUs assigned to each
+			leaf rcu_node structure.  Useful for very
+			large systems, which will choose the value 64,
+			and for NUMA systems with large remote-access
+			latencies, which will choose a value aligned
+			with the appropriate hardware boundaries.
+
+	rcutree.jiffies_till_sched_qs= [KNL]
+			Set required age in jiffies for a
+			given grace period before RCU starts
+			soliciting quiescent-state help from
+			rcu_note_context_switch().
+
+	rcutree.jiffies_till_first_fqs= [KNL]
+			Set delay from grace-period initialization to
+			first attempt to force quiescent states.
+			Units are jiffies, minimum value is zero,
+			and maximum value is HZ.
+
+	rcutree.jiffies_till_next_fqs= [KNL]
+			Set delay between subsequent attempts to force
+			quiescent states.  Units are jiffies, minimum
+			value is one, and maximum value is HZ.
+
+	rcutree.kthread_prio= 	 [KNL,BOOT]
+			Set the SCHED_FIFO priority of the RCU per-CPU
+			kthreads (rcuc/N). This value is also used for
+			the priority of the RCU boost threads (rcub/N)
+			and for the RCU grace-period kthreads (rcu_bh,
+			rcu_preempt, and rcu_sched). If RCU_BOOST is
+			set, valid values are 1-99 and the default is 1
+			(the least-favored priority).  Otherwise, when
+			RCU_BOOST is not set, valid values are 0-99 and
+			the default is zero (non-realtime operation).
+
+	rcutree.rcu_nocb_leader_stride= [KNL]
+			Set the number of NOCB kthread groups, which
+			defaults to the square root of the number of
+			CPUs.  Larger numbers reduces the wakeup overhead
+			on the per-CPU grace-period kthreads, but increases
+			that same overhead on each group's leader.
+
+	rcutree.qhimark= [KNL]
+			Set threshold of queued RCU callbacks beyond which
+			batch limiting is disabled.
+
+	rcutree.qlowmark= [KNL]
+			Set threshold of queued RCU callbacks below which
+			batch limiting is re-enabled.
+
+	rcutree.rcu_idle_gp_delay= [KNL]
+			Set wakeup interval for idle CPUs that have
+			RCU callbacks (RCU_FAST_NO_HZ=y).
+
+	rcutree.rcu_idle_lazy_gp_delay= [KNL]
+			Set wakeup interval for idle CPUs that have
+			only "lazy" RCU callbacks (RCU_FAST_NO_HZ=y).
+			Lazy RCU callbacks are those which RCU can
+			prove do nothing more than free memory.
+
+	rcutree.rcu_kick_kthreads= [KNL]
+			Cause the grace-period kthread to get an extra
+			wake_up() if it sleeps three times longer than
+			it should at force-quiescent-state time.
+			This wake_up() will be accompanied by a
+			WARN_ONCE() splat and an ftrace_dump().
+
+	rcuperf.gp_async= [KNL]
+			Measure performance of asynchronous
+			grace-period primitives such as call_rcu().
+
+	rcuperf.gp_async_max= [KNL]
+			Specify the maximum number of outstanding
+			callbacks per writer thread.  When a writer
+			thread exceeds this limit, it invokes the
+			corresponding flavor of rcu_barrier() to allow
+			previously posted callbacks to drain.
+
+	rcuperf.gp_exp= [KNL]
+			Measure performance of expedited synchronous
+			grace-period primitives.
+
+	rcuperf.holdoff= [KNL]
+			Set test-start holdoff period.  The purpose of
+			this parameter is to delay the start of the
+			test until boot completes in order to avoid
+			interference.
+
+	rcuperf.nreaders= [KNL]
+			Set number of RCU readers.  The value -1 selects
+			N, where N is the number of CPUs.  A value
+			"n" less than -1 selects N-n+1, where N is again
+			the number of CPUs.  For example, -2 selects N
+			(the number of CPUs), -3 selects N+1, and so on.
+			A value of "n" less than or equal to -N selects
+			a single reader.
+
+	rcuperf.nwriters= [KNL]
+			Set number of RCU writers.  The values operate
+			the same as for rcuperf.nreaders.
+			N, where N is the number of CPUs
+
+	rcuperf.perf_type= [KNL]
+			Specify the RCU implementation to test.
+
+	rcuperf.shutdown= [KNL]
+			Shut the system down after performance tests
+			complete.  This is useful for hands-off automated
+			testing.
+
+	rcuperf.verbose= [KNL]
+			Enable additional printk() statements.
+
+	rcuperf.writer_holdoff= [KNL]
+			Write-side holdoff between grace periods,
+			in microseconds.  The default of zero says
+			no holdoff.
+
+	rcutorture.cbflood_inter_holdoff= [KNL]
+			Set holdoff time (jiffies) between successive
+			callback-flood tests.
+
+	rcutorture.cbflood_intra_holdoff= [KNL]
+			Set holdoff time (jiffies) between successive
+			bursts of callbacks within a given callback-flood
+			test.
+
+	rcutorture.cbflood_n_burst= [KNL]
+			Set the number of bursts making up a given
+			callback-flood test.  Set this to zero to
+			disable callback-flood testing.
+
+	rcutorture.cbflood_n_per_burst= [KNL]
+			Set the number of callbacks to be registered
+			in a given burst of a callback-flood test.
+
+	rcutorture.fqs_duration= [KNL]
+			Set duration of force_quiescent_state bursts
+			in microseconds.
+
+	rcutorture.fqs_holdoff= [KNL]
+			Set holdoff time within force_quiescent_state bursts
+			in microseconds.
+
+	rcutorture.fqs_stutter= [KNL]
+			Set wait time between force_quiescent_state bursts
+			in seconds.
+
+	rcutorture.gp_cond= [KNL]
+			Use conditional/asynchronous update-side
+			primitives, if available.
+
+	rcutorture.gp_exp= [KNL]
+			Use expedited update-side primitives, if available.
+
+	rcutorture.gp_normal= [KNL]
+			Use normal (non-expedited) asynchronous
+			update-side primitives, if available.
+
+	rcutorture.gp_sync= [KNL]
+			Use normal (non-expedited) synchronous
+			update-side primitives, if available.  If all
+			of rcutorture.gp_cond=, rcutorture.gp_exp=,
+			rcutorture.gp_normal=, and rcutorture.gp_sync=
+			are zero, rcutorture acts as if is interpreted
+			they are all non-zero.
+
+	rcutorture.n_barrier_cbs= [KNL]
+			Set callbacks/threads for rcu_barrier() testing.
+
+	rcutorture.nfakewriters= [KNL]
+			Set number of concurrent RCU writers.  These just
+			stress RCU, they don't participate in the actual
+			test, hence the "fake".
+
+	rcutorture.nreaders= [KNL]
+			Set number of RCU readers.  The value -1 selects
+			N-1, where N is the number of CPUs.  A value
+			"n" less than -1 selects N-n-2, where N is again
+			the number of CPUs.  For example, -2 selects N
+			(the number of CPUs), -3 selects N+1, and so on.
+
+	rcutorture.object_debug= [KNL]
+			Enable debug-object double-call_rcu() testing.
+
+	rcutorture.onoff_holdoff= [KNL]
+			Set time (s) after boot for CPU-hotplug testing.
+
+	rcutorture.onoff_interval= [KNL]
+			Set time (jiffies) between CPU-hotplug operations,
+			or zero to disable CPU-hotplug testing.
+
+	rcutorture.shuffle_interval= [KNL]
+			Set task-shuffle interval (s).  Shuffling tasks
+			allows some CPUs to go into dyntick-idle mode
+			during the rcutorture test.
+
+	rcutorture.shutdown_secs= [KNL]
+			Set time (s) after boot system shutdown.  This
+			is useful for hands-off automated testing.
+
+	rcutorture.stall_cpu= [KNL]
+			Duration of CPU stall (s) to test RCU CPU stall
+			warnings, zero to disable.
+
+	rcutorture.stall_cpu_holdoff= [KNL]
+			Time to wait (s) after boot before inducing stall.
+
+	rcutorture.stall_cpu_irqsoff= [KNL]
+			Disable interrupts while stalling if set.
+
+	rcutorture.stat_interval= [KNL]
+			Time (s) between statistics printk()s.
+
+	rcutorture.stutter= [KNL]
+			Time (s) to stutter testing, for example, specifying
+			five seconds causes the test to run for five seconds,
+			wait for five seconds, and so on.  This tests RCU's
+			ability to transition abruptly to and from idle.
+
+	rcutorture.test_boost= [KNL]
+			Test RCU priority boosting?  0=no, 1=maybe, 2=yes.
+			"Maybe" means test if the RCU implementation
+			under test support RCU priority boosting.
+
+	rcutorture.test_boost_duration= [KNL]
+			Duration (s) of each individual boost test.
+
+	rcutorture.test_boost_interval= [KNL]
+			Interval (s) between each boost test.
+
+	rcutorture.test_no_idle_hz= [KNL]
+			Test RCU's dyntick-idle handling.  See also the
+			rcutorture.shuffle_interval parameter.
+
+	rcutorture.torture_type= [KNL]
+			Specify the RCU implementation to test.
+
+	rcutorture.verbose= [KNL]
+			Enable additional printk() statements.
+
+	rcupdate.rcu_cpu_stall_suppress= [KNL]
+			Suppress RCU CPU stall warning messages.
+
+	rcupdate.rcu_cpu_stall_timeout= [KNL]
+			Set timeout for RCU CPU stall warning messages.
+
+	rcupdate.rcu_expedited= [KNL]
+			Use expedited grace-period primitives, for
+			example, synchronize_rcu_expedited() instead
+			of synchronize_rcu().  This reduces latency,
+			but can increase CPU utilization, degrade
+			real-time latency, and degrade energy efficiency.
+			No effect on CONFIG_TINY_RCU kernels.
+
+	rcupdate.rcu_normal= [KNL]
+			Use only normal grace-period primitives,
+			for example, synchronize_rcu() instead of
+			synchronize_rcu_expedited().  This improves
+			real-time latency, CPU utilization, and
+			energy efficiency, but can expose users to
+			increased grace-period latency.  This parameter
+			overrides rcupdate.rcu_expedited.  No effect on
+			CONFIG_TINY_RCU kernels.
+
+	rcupdate.rcu_normal_after_boot= [KNL]
+			Once boot has completed (that is, after
+			rcu_end_inkernel_boot() has been invoked), use
+			only normal grace-period primitives.  No effect
+			on CONFIG_TINY_RCU kernels.
+
+	rcupdate.rcu_task_stall_timeout= [KNL]
+			Set timeout in jiffies for RCU task stall warning
+			messages.  Disable with a value less than or equal
+			to zero.
+
+	rcupdate.rcu_self_test= [KNL]
+			Run the RCU early boot self tests
+
+	rcupdate.rcu_self_test_bh= [KNL]
+			Run the RCU bh early boot self tests
+
+	rcupdate.rcu_self_test_sched= [KNL]
+			Run the RCU sched early boot self tests
+
+	rdinit=		[KNL]
+			Format: <full_path>
+			Run specified binary instead of /init from the ramdisk,
+			used for early userspace startup. See initrd.
+
+	rdrand=		[X86]
+			force - Override the decision by the kernel to hide the
+				advertisement of RDRAND support (this affects
+				certain AMD processors because of buggy BIOS
+				support, specifically around the suspend/resume
+				path).
+
+	rdt=		[HW,X86,RDT]
+			Turn on/off individual RDT features. List is:
+			cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
+			mba.
+			E.g. to turn on cmt and turn off mba use:
+				rdt=cmt,!mba
+
+	reboot=		[KNL]
+			Format (x86 or x86_64):
+				[w[arm] | c[old] | h[ard] | s[oft] | g[pio]] \
+				[[,]s[mp]#### \
+				[[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \
+				[[,]f[orce]
+			Where reboot_mode is one of warm (soft) or cold (hard) or gpio,
+			      reboot_type is one of bios, acpi, kbd, triple, efi, or pci,
+			      reboot_force is either force or not specified,
+			      reboot_cpu is s[mp]#### with #### being the processor
+					to be used for rebooting.
+
+	relax_domain_level=
+			[KNL, SMP] Set scheduler's default relax_domain_level.
+			See Documentation/cgroup-v1/cpusets.txt.
+
+	reserve=	[KNL,BUGS] Force kernel to ignore I/O ports or memory
+			Format: <base1>,<size1>[,<base2>,<size2>,...]
+			Reserve I/O ports or memory so the kernel won't use
+			them.  If <base> is less than 0x10000, the region
+			is assumed to be I/O ports; otherwise it is memory.
+
+	reservetop=	[X86-32]
+			Format: nn[KMG]
+			Reserves a hole at the top of the kernel virtual
+			address space.
+
+	reservelow=	[X86]
+			Format: nn[K]
+			Set the amount of memory to reserve for BIOS at
+			the bottom of the address space.
+
+	reset_devices	[KNL] Force drivers to reset the underlying device
+			during initialization.
+
+	resume=		[SWSUSP]
+			Specify the partition device for software suspend
+			Format:
+			{/dev/<dev> | PARTUUID=<uuid> | <int>:<int> | <hex>}
+
+	resume_offset=	[SWSUSP]
+			Specify the offset from the beginning of the partition
+			given by "resume=" at which the swap header is located,
+			in <PAGE_SIZE> units (needed only for swap files).
+			See  Documentation/power/swsusp-and-swap-files.txt
+
+	resumedelay=	[HIBERNATION] Delay (in seconds) to pause before attempting to
+			read the resume files
+
+	resumewait	[HIBERNATION] Wait (indefinitely) for resume device to show up.
+			Useful for devices that are detected asynchronously
+			(e.g. USB and MMC devices).
+
+	hibernate=	[HIBERNATION]
+		noresume	Don't check if there's a hibernation image
+				present during boot.
+		nocompress	Don't compress/decompress hibernation images.
+		no		Disable hibernation and resume.
+		protect_image	Turn on image protection during restoration
+				(that will set all pages holding image data
+				during restoration read-only).
+
+	retain_initrd	[RAM] Keep initrd memory after extraction
+
+	rfkill.default_state=
+		0	"airplane mode".  All wifi, bluetooth, wimax, gps, fm,
+			etc. communication is blocked by default.
+		1	Unblocked.
+
+	rfkill.master_switch_mode=
+		0	The "airplane mode" button does nothing.
+		1	The "airplane mode" button toggles between everything
+			blocked and the previous configuration.
+		2	The "airplane mode" button toggles between everything
+			blocked and everything unblocked.
+
+	rhash_entries=	[KNL,NET]
+			Set number of hash buckets for route cache
+
+	ring3mwait=disable
+			[KNL] Disable ring 3 MONITOR/MWAIT feature on supported
+			CPUs.
+
+	ro		[KNL] Mount root device read-only on boot
+
+	rodata=		[KNL]
+		on	Mark read-only kernel memory as read-only (default).
+		off	Leave read-only kernel memory writable for debugging.
+
+	rockchip.usb_uart
+			Enable the uart passthrough on the designated usb port
+			on Rockchip SoCs. When active, the signals of the
+			debug-uart get routed to the D+ and D- pins of the usb
+			port and the regular usb controller gets disabled.
+
+	root=		[KNL] Root filesystem
+			See name_to_dev_t comment in init/do_mounts.c.
+
+	rootdelay=	[KNL] Delay (in seconds) to pause before attempting to
+			mount the root filesystem
+
+	rootflags=	[KNL] Set root filesystem mount option string
+
+	rootfstype=	[KNL] Set root filesystem type
+
+	rootwait	[KNL] Wait (indefinitely) for root device to show up.
+			Useful for devices that are detected asynchronously
+			(e.g. USB and MMC devices).
+
+	rproc_mem=nn[KMG][@address]
+			[KNL,ARM,CMA] Remoteproc physical memory block.
+			Memory area to be used by remote processor image,
+			managed by CMA.
+
+	rw		[KNL] Mount root device read-write on boot
+
+	S		[KNL] Run init in single mode
+
+	s390_iommu=	[HW,S390]
+			Set s390 IOTLB flushing mode
+		strict
+			With strict flushing every unmap operation will result in
+			an IOTLB flush. Default is lazy flushing before reuse,
+			which is faster.
+
+	sa1100ir	[NET]
+			See drivers/net/irda/sa1100_ir.c.
+
+	sbni=		[NET] Granch SBNI12 leased line adapter
+
+	sched_debug	[KNL] Enables verbose scheduler debug messages.
+
+	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
+			Allowed values are enable and disable. This feature
+			incurs a small amount of overhead in the scheduler
+			but is useful for debugging and performance tuning.
+
+	skew_tick=	[KNL] Offset the periodic timer tick per cpu to mitigate
+			xtime_lock contention on larger systems, and/or RCU lock
+			contention on all systems with CONFIG_MAXSMP set.
+			Format: { "0" | "1" }
+			0 -- disable. (may be 1 via CONFIG_CMDLINE="skew_tick=1"
+			1 -- enable.
+			Note: increases power consumption, thus should only be
+			enabled if running jitter sensitive (HPC/RT) workloads.
+
+	security=	[SECURITY] Choose a security module to enable at boot.
+			If this boot parameter is not specified, only the first
+			security module asking for security registration will be
+			loaded. An invalid security module name will be treated
+			as if no module has been chosen.
+
+	selinux=	[SELINUX] Disable or enable SELinux at boot time.
+			Format: { "0" | "1" }
+			See security/selinux/Kconfig help text.
+			0 -- disable.
+			1 -- enable.
+			Default value is set via kernel config option.
+			If enabled at boot time, /selinux/disable can be used
+			later to disable prior to initial policy load.
+
+	apparmor=	[APPARMOR] Disable or enable AppArmor at boot time
+			Format: { "0" | "1" }
+			See security/apparmor/Kconfig help text
+			0 -- disable.
+			1 -- enable.
+			Default value is set via kernel config option.
+
+	serialnumber	[BUGS=X86-32]
+
+	shapers=	[NET]
+			Maximal number of shapers.
+
+	simeth=		[IA-64]
+	simscsi=
+
+	slram=		[HW,MTD]
+
+	slab_nomerge	[MM]
+			Disable merging of slabs with similar size. May be
+			necessary if there is some reason to distinguish
+			allocs to different slabs, especially in hardened
+			environments where the risk of heap overflows and
+			layout control by attackers can usually be
+			frustrated by disabling merging. This will reduce
+			most of the exposure of a heap attack to a single
+			cache (risks via metadata attacks are mostly
+			unchanged). Debug options disable merging on their
+			own.
+			For more information see Documentation/vm/slub.rst.
+
+	slab_max_order=	[MM, SLAB]
+			Determines the maximum allowed order for slabs.
+			A high setting may cause OOMs due to memory
+			fragmentation.  Defaults to 1 for systems with
+			more than 32MB of RAM, 0 otherwise.
+
+	slub_debug[=options[,slabs]]	[MM, SLUB]
+			Enabling slub_debug allows one to determine the
+			culprit if slab objects become corrupted. Enabling
+			slub_debug can create guard zones around objects and
+			may poison objects when not in use. Also tracks the
+			last alloc / free. For more information see
+			Documentation/vm/slub.rst.
+
+	slub_memcg_sysfs=	[MM, SLUB]
+			Determines whether to enable sysfs directories for
+			memory cgroup sub-caches. 1 to enable, 0 to disable.
+			The default is determined by CONFIG_SLUB_MEMCG_SYSFS_ON.
+			Enabling this can lead to a very high number of	debug
+			directories and files being created under
+			/sys/kernel/slub.
+
+	slub_max_order= [MM, SLUB]
+			Determines the maximum allowed order for slabs.
+			A high setting may cause OOMs due to memory
+			fragmentation. For more information see
+			Documentation/vm/slub.rst.
+
+	slub_min_objects=	[MM, SLUB]
+			The minimum number of objects per slab. SLUB will
+			increase the slab order up to slub_max_order to
+			generate a sufficiently large slab able to contain
+			the number of objects indicated. The higher the number
+			of objects the smaller the overhead of tracking slabs
+			and the less frequently locks need to be acquired.
+			For more information see Documentation/vm/slub.rst.
+
+	slub_min_order=	[MM, SLUB]
+			Determines the minimum page order for slabs. Must be
+			lower than slub_max_order.
+			For more information see Documentation/vm/slub.rst.
+
+	slub_nomerge	[MM, SLUB]
+			Same with slab_nomerge. This is supported for legacy.
+			See slab_nomerge for more information.
+
+	smart2=		[HW]
+			Format: <io1>[,<io2>[,...,<io8>]]
+
+	smsc-ircc2.nopnp	[HW] Don't use PNP to discover SMC devices
+	smsc-ircc2.ircc_cfg=	[HW] Device configuration I/O port
+	smsc-ircc2.ircc_sir=	[HW] SIR base I/O port
+	smsc-ircc2.ircc_fir=	[HW] FIR base I/O port
+	smsc-ircc2.ircc_irq=	[HW] IRQ line
+	smsc-ircc2.ircc_dma=	[HW] DMA channel
+	smsc-ircc2.ircc_transceiver= [HW] Transceiver type:
+				0: Toshiba Satellite 1800 (GP data pin select)
+				1: Fast pin select (default)
+				2: ATC IRMode
+
+	smt		[KNL,S390] Set the maximum number of threads (logical
+			CPUs) to use per physical CPU on systems capable of
+			symmetric multithreading (SMT). Will be capped to the
+			actual hardware limit.
+			Format: <integer>
+			Default: -1 (no limit)
+
+	softlockup_panic=
+			[KNL] Should the soft-lockup detector generate panics.
+			Format: <integer>
+
+			A nonzero value instructs the soft-lockup detector
+			to panic the machine when a soft-lockup occurs. This
+			is also controlled by CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC
+			which is the respective build-time switch to that
+			functionality.
+
+	softlockup_all_cpu_backtrace=
+			[KNL] Should the soft-lockup detector generate
+			backtraces on all cpus.
+			Format: <integer>
+
+	sonypi.*=	[HW] Sony Programmable I/O Control Device driver
+			See Documentation/laptops/sonypi.txt
+
+	spectre_v2=	[X86] Control mitigation of Spectre variant 2
+			(indirect branch speculation) vulnerability.
+			The default operation protects the kernel from
+			user space attacks.
+
+			on   - unconditionally enable, implies
+			       spectre_v2_user=on
+			off  - unconditionally disable, implies
+			       spectre_v2_user=off
+			auto - kernel detects whether your CPU model is
+			       vulnerable
+
+			Selecting 'on' will, and 'auto' may, choose a
+			mitigation method at run time according to the
+			CPU, the available microcode, the setting of the
+			CONFIG_RETPOLINE configuration option, and the
+			compiler with which the kernel was built.
+
+			Selecting 'on' will also enable the mitigation
+			against user space to user space task attacks.
+
+			Selecting 'off' will disable both the kernel and
+			the user space protections.
+
+			Specific mitigations can also be selected manually:
+
+			retpoline	  - replace indirect branches
+			retpoline,generic - Retpolines
+			retpoline,lfence  - LFENCE; indirect branch
+			retpoline,amd     - alias for retpoline,lfence
+			eibrs		  - enhanced IBRS
+			eibrs,retpoline   - enhanced IBRS + Retpolines
+			eibrs,lfence      - enhanced IBRS + LFENCE
+
+			Not specifying this option is equivalent to
+			spectre_v2=auto.
+
+	spectre_v2_user=
+			[X86] Control mitigation of Spectre variant 2
+		        (indirect branch speculation) vulnerability between
+		        user space tasks
+
+			on	- Unconditionally enable mitigations. Is
+				  enforced by spectre_v2=on
+
+			off     - Unconditionally disable mitigations. Is
+				  enforced by spectre_v2=off
+
+			prctl   - Indirect branch speculation is enabled,
+				  but mitigation can be enabled via prctl
+				  per thread.  The mitigation control state
+				  is inherited on fork.
+
+			prctl,ibpb
+				- Like "prctl" above, but only STIBP is
+				  controlled per thread. IBPB is issued
+				  always when switching between different user
+				  space processes.
+
+			seccomp
+				- Same as "prctl" above, but all seccomp
+				  threads will enable the mitigation unless
+				  they explicitly opt out.
+
+			seccomp,ibpb
+				- Like "seccomp" above, but only STIBP is
+				  controlled per thread. IBPB is issued
+				  always when switching between different
+				  user space processes.
+
+			auto    - Kernel selects the mitigation depending on
+				  the available CPU features and vulnerability.
+
+			Default mitigation:
+			If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl"
+
+			Not specifying this option is equivalent to
+			spectre_v2_user=auto.
+
+	spec_store_bypass_disable=
+			[HW] Control Speculative Store Bypass (SSB) Disable mitigation
+			(Speculative Store Bypass vulnerability)
+
+			Certain CPUs are vulnerable to an exploit against a
+			a common industry wide performance optimization known
+			as "Speculative Store Bypass" in which recent stores
+			to the same memory location may not be observed by
+			later loads during speculative execution. The idea
+			is that such stores are unlikely and that they can
+			be detected prior to instruction retirement at the
+			end of a particular speculation execution window.
+
+			In vulnerable processors, the speculatively forwarded
+			store can be used in a cache side channel attack, for
+			example to read memory to which the attacker does not
+			directly have access (e.g. inside sandboxed code).
+
+			This parameter controls whether the Speculative Store
+			Bypass optimization is used.
+
+			On x86 the options are:
+
+			on      - Unconditionally disable Speculative Store Bypass
+			off     - Unconditionally enable Speculative Store Bypass
+			auto    - Kernel detects whether the CPU model contains an
+				  implementation of Speculative Store Bypass and
+				  picks the most appropriate mitigation. If the
+				  CPU is not vulnerable, "off" is selected. If the
+				  CPU is vulnerable the default mitigation is
+				  architecture and Kconfig dependent. See below.
+			prctl   - Control Speculative Store Bypass per thread
+				  via prctl. Speculative Store Bypass is enabled
+				  for a process by default. The state of the control
+				  is inherited on fork.
+			seccomp - Same as "prctl" above, but all seccomp threads
+				  will disable SSB unless they explicitly opt out.
+
+			Default mitigations:
+			X86:	If CONFIG_SECCOMP=y "seccomp", otherwise "prctl"
+
+			On powerpc the options are:
+
+			on,auto - On Power8 and Power9 insert a store-forwarding
+				  barrier on kernel entry and exit. On Power7
+				  perform a software flush on kernel entry and
+				  exit.
+			off	- No action.
+
+			Not specifying this option is equivalent to
+			spec_store_bypass_disable=auto.
+
+	spia_io_base=	[HW,MTD]
+	spia_fio_base=
+	spia_pedr=
+	spia_peddr=
+
+	srbds=		[X86,INTEL]
+			Control the Special Register Buffer Data Sampling
+			(SRBDS) mitigation.
+
+			Certain CPUs are vulnerable to an MDS-like
+			exploit which can leak bits from the random
+			number generator.
+
+			By default, this issue is mitigated by
+			microcode.  However, the microcode fix can cause
+			the RDRAND and RDSEED instructions to become
+			much slower.  Among other effects, this will
+			result in reduced throughput from /dev/urandom.
+
+			The microcode mitigation can be disabled with
+			the following option:
+
+			off:    Disable mitigation and remove
+				performance impact to RDRAND and RDSEED
+
+	srcutree.counter_wrap_check [KNL]
+			Specifies how frequently to check for
+			grace-period sequence counter wrap for the
+			srcu_data structure's ->srcu_gp_seq_needed field.
+			The greater the number of bits set in this kernel
+			parameter, the less frequently counter wrap will
+			be checked for.  Note that the bottom two bits
+			are ignored.
+
+	srcutree.exp_holdoff [KNL]
+			Specifies how many nanoseconds must elapse
+			since the end of the last SRCU grace period for
+			a given srcu_struct until the next normal SRCU
+			grace period will be considered for automatic
+			expediting.  Set to zero to disable automatic
+			expediting.
+
+	ssbd=		[ARM64,HW]
+			Speculative Store Bypass Disable control
+
+			On CPUs that are vulnerable to the Speculative
+			Store Bypass vulnerability and offer a
+			firmware based mitigation, this parameter
+			indicates how the mitigation should be used:
+
+			force-on:  Unconditionally enable mitigation for
+				   for both kernel and userspace
+			force-off: Unconditionally disable mitigation for
+				   for both kernel and userspace
+			kernel:    Always enable mitigation in the
+				   kernel, and offer a prctl interface
+				   to allow userspace to register its
+				   interest in being mitigated too.
+
+	stack_guard_gap=	[MM]
+			override the default stack gap protection. The value
+			is in page units and it defines how many pages prior
+			to (for stacks growing down) resp. after (for stacks
+			growing up) the main stack are reserved for no other
+			mapping. Default value is 256 pages.
+
+	stacktrace	[FTRACE]
+			Enabled the stack tracer on boot up.
+
+	stacktrace_filter=[function-list]
+			[FTRACE] Limit the functions that the stack tracer
+			will trace at boot up. function-list is a comma separated
+			list of functions. This list can be changed at run
+			time by the stack_trace_filter file in the debugfs
+			tracing directory. Note, this enables stack tracing
+			and the stacktrace above is not needed.
+
+	sti=		[PARISC,HW]
+			Format: <num>
+			Set the STI (builtin display/keyboard on the HP-PARISC
+			machines) console (graphic card) which should be used
+			as the initial boot-console.
+			See also comment in drivers/video/console/sticore.c.
+
+	sti_font=	[HW]
+			See comment in drivers/video/console/sticore.c.
+
+	stifb=		[HW]
+			Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]]
+
+	sunrpc.min_resvport=
+	sunrpc.max_resvport=
+			[NFS,SUNRPC]
+			SunRPC servers often require that client requests
+			originate from a privileged port (i.e. a port in the
+			range 0 < portnr < 1024).
+			An administrator who wishes to reserve some of these
+			ports for other uses may adjust the range that the
+			kernel's sunrpc client considers to be privileged
+			using these two parameters to set the minimum and
+			maximum port values.
+
+	sunrpc.svc_rpc_per_connection_limit=
+			[NFS,SUNRPC]
+			Limit the number of requests that the server will
+			process in parallel from a single connection.
+			The default value is 0 (no limit).
+
+	sunrpc.pool_mode=
+			[NFS]
+			Control how the NFS server code allocates CPUs to
+			service thread pools.  Depending on how many NICs
+			you have and where their interrupts are bound, this
+			option will affect which CPUs will do NFS serving.
+			Note: this parameter cannot be changed while the
+			NFS server is running.
+
+			auto	    the server chooses an appropriate mode
+				    automatically using heuristics
+			global	    a single global pool contains all CPUs
+			percpu	    one pool for each CPU
+			pernode	    one pool for each NUMA node (equivalent
+				    to global on non-NUMA machines)
+
+	sunrpc.tcp_slot_table_entries=
+	sunrpc.udp_slot_table_entries=
+			[NFS,SUNRPC]
+			Sets the upper limit on the number of simultaneous
+			RPC calls that can be sent from the client to a
+			server. Increasing these values may allow you to
+			improve throughput, but will also increase the
+			amount of memory reserved for use by the client.
+
+	suspend.pm_test_delay=
+			[SUSPEND]
+			Sets the number of seconds to remain in a suspend test
+			mode before resuming the system (see
+			/sys/power/pm_test). Only available when CONFIG_PM_DEBUG
+			is set. Default value is 5.
+
+	swapaccount=[0|1]
+			[KNL] Enable accounting of swap in memory resource
+			controller if no parameter or 1 is given or disable
+			it if 0 is given (See Documentation/cgroup-v1/memory.txt)
+
+	swiotlb=	[ARM,IA-64,PPC,MIPS,X86]
+			Format: { <int> | force | noforce }
+			<int> -- Number of I/O TLB slabs
+			force -- force using of bounce buffers even if they
+			         wouldn't be automatically used by the kernel
+			noforce -- Never use bounce buffers (for debugging)
+
+	switches=	[HW,M68k]
+
+	sysfs.deprecated=0|1 [KNL]
+			Enable/disable old style sysfs layout for old udev
+			on older distributions. When this option is enabled
+			very new udev will not work anymore. When this option
+			is disabled (or CONFIG_SYSFS_DEPRECATED not compiled)
+			in older udev will not work anymore.
+			Default depends on CONFIG_SYSFS_DEPRECATED_V2 set in
+			the kernel configuration.
+
+	sysrq_always_enabled
+			[KNL]
+			Ignore sysrq setting - this boot parameter will
+			neutralize any effect of /proc/sys/kernel/sysrq.
+			Useful for debugging.
+
+	tcpmhash_entries= [KNL,NET]
+			Set the number of tcp_metrics_hash slots.
+			Default value is 8192 or 16384 depending on total
+			ram pages. This is used to specify the TCP metrics
+			cache size. See Documentation/networking/ip-sysctl.txt
+			"tcp_no_metrics_save" section for more details.
+
+	tdfx=		[HW,DRM]
+
+	test_suspend=	[SUSPEND][,N]
+			Specify "mem" (for Suspend-to-RAM) or "standby" (for
+			standby suspend) or "freeze" (for suspend type freeze)
+			as the system sleep state during system startup with
+			the optional capability to repeat N number of times.
+			The system is woken from this state using a
+			wakeup-capable RTC alarm.
+
+	thash_entries=	[KNL,NET]
+			Set number of hash buckets for TCP connection
+
+	thermal.act=	[HW,ACPI]
+			-1: disable all active trip points in all thermal zones
+			<degrees C>: override all lowest active trip points
+
+	thermal.crt=	[HW,ACPI]
+			-1: disable all critical trip points in all thermal zones
+			<degrees C>: override all critical trip points
+
+	thermal.nocrt=	[HW,ACPI]
+			Set to disable actions on ACPI thermal zone
+			critical and hot trip points.
+
+	thermal.off=	[HW,ACPI]
+			1: disable ACPI thermal control
+
+	thermal.psv=	[HW,ACPI]
+			-1: disable all passive trip points
+			<degrees C>: override all passive trip points to this
+			value
+
+	thermal.tzp=	[HW,ACPI]
+			Specify global default ACPI thermal zone polling rate
+			<deci-seconds>: poll all this frequency
+			0: no polling (default)
+
+	threadirqs	[KNL]
+			Force threading of all interrupt handlers except those
+			marked explicitly IRQF_NO_THREAD.
+
+	tmem		[KNL,XEN]
+			Enable the Transcendent memory driver if built-in.
+
+	tmem.cleancache=0|1 [KNL, XEN]
+			Default is on (1). Disable the usage of the cleancache
+			API to send anonymous pages to the hypervisor.
+
+	tmem.frontswap=0|1 [KNL, XEN]
+			Default is on (1). Disable the usage of the frontswap
+			API to send swap pages to the hypervisor. If disabled
+			the selfballooning and selfshrinking are force disabled.
+
+	tmem.selfballooning=0|1 [KNL, XEN]
+			Default is on (1). Disable the driving of swap pages
+			to the hypervisor.
+
+	tmem.selfshrinking=0|1 [KNL, XEN]
+			Default is on (1). Partial swapoff that immediately
+			transfers pages from Xen hypervisor back to the
+			kernel based on different criteria.
+
+	topology=	[S390]
+			Format: {off | on}
+			Specify if the kernel should make use of the cpu
+			topology information if the hardware supports this.
+			The scheduler will make use of this information and
+			e.g. base its process migration decisions on it.
+			Default is on.
+
+	topology_updates= [KNL, PPC, NUMA]
+			Format: {off}
+			Specify if the kernel should ignore (off)
+			topology updates sent by the hypervisor to this
+			LPAR.
+
+	tp720=		[HW,PS2]
+
+	tpm_suspend_pcr=[HW,TPM]
+			Format: integer pcr id
+			Specify that at suspend time, the tpm driver
+			should extend the specified pcr with zeros,
+			as a workaround for some chips which fail to
+			flush the last written pcr on TPM_SaveState.
+			This will guarantee that all the other pcrs
+			are saved.
+
+	trace_buf_size=nn[KMG]
+			[FTRACE] will set tracing buffer size on each cpu.
+
+	trace_event=[event-list]
+			[FTRACE] Set and start specified trace events in order
+			to facilitate early boot debugging. The event-list is a
+			comma separated list of trace events to enable. See
+			also Documentation/trace/events.rst
+
+	trace_options=[option-list]
+			[FTRACE] Enable or disable tracer options at boot.
+			The option-list is a comma delimited list of options
+			that can be enabled or disabled just as if you were
+			to echo the option name into
+
+			    /sys/kernel/debug/tracing/trace_options
+
+			For example, to enable stacktrace option (to dump the
+			stack trace of each event), add to the command line:
+
+			      trace_options=stacktrace
+
+			See also Documentation/trace/ftrace.rst "trace options"
+			section.
+
+	tp_printk[FTRACE]
+			Have the tracepoints sent to printk as well as the
+			tracing ring buffer. This is useful for early boot up
+			where the system hangs or reboots and does not give the
+			option for reading the tracing buffer or performing a
+			ftrace_dump_on_oops.
+
+			To turn off having tracepoints sent to printk,
+			 echo 0 > /proc/sys/kernel/tracepoint_printk
+			Note, echoing 1 into this file without the
+			tracepoint_printk kernel cmdline option has no effect.
+
+			** CAUTION **
+
+			Having tracepoints sent to printk() and activating high
+			frequency tracepoints such as irq or sched, can cause
+			the system to live lock.
+
+	traceoff_on_warning
+			[FTRACE] enable this option to disable tracing when a
+			warning is hit. This turns off "tracing_on". Tracing can
+			be enabled again by echoing '1' into the "tracing_on"
+			file located in /sys/kernel/debug/tracing/
+
+			This option is useful, as it disables the trace before
+			the WARNING dump is called, which prevents the trace to
+			be filled with content caused by the warning output.
+
+			This option can also be set at run time via the sysctl
+			option:  kernel/traceoff_on_warning
+
+	transparent_hugepage=
+			[KNL]
+			Format: [always|madvise|never]
+			Can be used to control the default behavior of the system
+			with respect to transparent hugepages.
+			See Documentation/admin-guide/mm/transhuge.rst
+			for more details.
+
+	tsc=		Disable clocksource stability checks for TSC.
+			Format: <string>
+			[x86] reliable: mark tsc clocksource as reliable, this
+			disables clocksource verification at runtime, as well
+			as the stability checks done at bootup.	Used to enable
+			high-resolution timer mode on older hardware, and in
+			virtualized environment.
+			[x86] noirqtime: Do not use TSC to do irq accounting.
+			Used to run time disable IRQ_TIME_ACCOUNTING on any
+			platforms where RDTSC is slow and this accounting
+			can add overhead.
+			[x86] unstable: mark the TSC clocksource as unstable, this
+			marks the TSC unconditionally unstable at bootup and
+			avoids any further wobbles once the TSC watchdog notices.
+
+	tsx=		[X86] Control Transactional Synchronization
+			Extensions (TSX) feature in Intel processors that
+			support TSX control.
+
+			This parameter controls the TSX feature. The options are:
+
+			on	- Enable TSX on the system. Although there are
+				mitigations for all known security vulnerabilities,
+				TSX has been known to be an accelerator for
+				several previous speculation-related CVEs, and
+				so there may be unknown	security risks associated
+				with leaving it enabled.
+
+			off	- Disable TSX on the system. (Note that this
+				option takes effect only on newer CPUs which are
+				not vulnerable to MDS, i.e., have
+				MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 and which get
+				the new IA32_TSX_CTRL MSR through a microcode
+				update. This new MSR allows for the reliable
+				deactivation of the TSX functionality.)
+
+			auto	- Disable TSX if X86_BUG_TAA is present,
+				  otherwise enable TSX on the system.
+
+			Not specifying this option is equivalent to tsx=off.
+
+			See Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
+			for more details.
+
+	tsx_async_abort= [X86,INTEL] Control mitigation for the TSX Async
+			Abort (TAA) vulnerability.
+
+			Similar to Micro-architectural Data Sampling (MDS)
+			certain CPUs that support Transactional
+			Synchronization Extensions (TSX) are vulnerable to an
+			exploit against CPU internal buffers which can forward
+			information to a disclosure gadget under certain
+			conditions.
+
+			In vulnerable processors, the speculatively forwarded
+			data can be used in a cache side channel attack, to
+			access data to which the attacker does not have direct
+			access.
+
+			This parameter controls the TAA mitigation.  The
+			options are:
+
+			full       - Enable TAA mitigation on vulnerable CPUs
+				     if TSX is enabled.
+
+			full,nosmt - Enable TAA mitigation and disable SMT on
+				     vulnerable CPUs. If TSX is disabled, SMT
+				     is not disabled because CPU is not
+				     vulnerable to cross-thread TAA attacks.
+			off        - Unconditionally disable TAA mitigation
+
+			On MDS-affected machines, tsx_async_abort=off can be
+			prevented by an active MDS mitigation as both vulnerabilities
+			are mitigated with the same mechanism so in order to disable
+			this mitigation, you need to specify mds=off too.
+
+			Not specifying this option is equivalent to
+			tsx_async_abort=full.  On CPUs which are MDS affected
+			and deploy MDS mitigation, TAA mitigation is not
+			required and doesn't provide any additional
+			mitigation.
+
+			For details see:
+			Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
+
+	turbografx.map[2|3]=	[HW,JOY]
+			TurboGraFX parallel port interface
+			Format:
+			<port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
+			See also Documentation/input/devices/joystick-parport.rst
+
+	udbg-immortal	[PPC] When debugging early kernel crashes that
+			happen after console_init() and before a proper
+			console driver takes over, this boot options might
+			help "seeing" what's going on.
+
+	uhash_entries=	[KNL,NET]
+			Set number of hash buckets for UDP/UDP-Lite connections
+
+	uhci-hcd.ignore_oc=
+			[USB] Ignore overcurrent events (default N).
+			Some badly-designed motherboards generate lots of
+			bogus events, for ports that aren't wired to
+			anything.  Set this parameter to avoid log spamming.
+			Note that genuine overcurrent events won't be
+			reported either.
+
+	unknown_nmi_panic
+			[X86] Cause panic on unknown NMI.
+
+	usbcore.authorized_default=
+			[USB] Default USB device authorization:
+			(default -1 = authorized except for wireless USB,
+			0 = not authorized, 1 = authorized)
+
+	usbcore.autosuspend=
+			[USB] The autosuspend time delay (in seconds) used
+			for newly-detected USB devices (default 2).  This
+			is the time required before an idle device will be
+			autosuspended.  Devices for which the delay is set
+			to a negative value won't be autosuspended at all.
+
+	usbcore.usbfs_snoop=
+			[USB] Set to log all usbfs traffic (default 0 = off).
+
+	usbcore.usbfs_snoop_max=
+			[USB] Maximum number of bytes to snoop in each URB
+			(default = 65536).
+
+	usbcore.blinkenlights=
+			[USB] Set to cycle leds on hubs (default 0 = off).
+
+	usbcore.old_scheme_first=
+			[USB] Start with the old device initialization
+			scheme (default 0 = off).
+
+	usbcore.usbfs_memory_mb=
+			[USB] Memory limit (in MB) for buffers allocated by
+			usbfs (default = 16, 0 = max = 2047).
+
+	usbcore.use_both_schemes=
+			[USB] Try the other device initialization scheme
+			if the first one fails (default 1 = enabled).
+
+	usbcore.initial_descriptor_timeout=
+			[USB] Specifies timeout for the initial 64-byte
+			USB_REQ_GET_DESCRIPTOR request in milliseconds
+			(default 5000 = 5.0 seconds).
+
+	usbcore.nousb	[USB] Disable the USB subsystem
+
+	usbcore.quirks=
+			[USB] A list of quirk entries to augment the built-in
+			usb core quirk list. List entries are separated by
+			commas. Each entry has the form
+			VendorID:ProductID:Flags. The IDs are 4-digit hex
+			numbers and Flags is a set of letters. Each letter
+			will change the built-in quirk; setting it if it is
+			clear and clearing it if it is set. The letters have
+			the following meanings:
+				a = USB_QUIRK_STRING_FETCH_255 (string
+					descriptors must not be fetched using
+					a 255-byte read);
+				b = USB_QUIRK_RESET_RESUME (device can't resume
+					correctly so reset it instead);
+				c = USB_QUIRK_NO_SET_INTF (device can't handle
+					Set-Interface requests);
+				d = USB_QUIRK_CONFIG_INTF_STRINGS (device can't
+					handle its Configuration or Interface
+					strings);
+				e = USB_QUIRK_RESET (device can't be reset
+					(e.g morph devices), don't use reset);
+				f = USB_QUIRK_HONOR_BNUMINTERFACES (device has
+					more interface descriptions than the
+					bNumInterfaces count, and can't handle
+					talking to these interfaces);
+				g = USB_QUIRK_DELAY_INIT (device needs a pause
+					during initialization, after we read
+					the device descriptor);
+				h = USB_QUIRK_LINEAR_UFRAME_INTR_BINTERVAL (For
+					high speed and super speed interrupt
+					endpoints, the USB 2.0 and USB 3.0 spec
+					require the interval in microframes (1
+					microframe = 125 microseconds) to be
+					calculated as interval = 2 ^
+					(bInterval-1).
+					Devices with this quirk report their
+					bInterval as the result of this
+					calculation instead of the exponent
+					variable used in the calculation);
+				i = USB_QUIRK_DEVICE_QUALIFIER (device can't
+					handle device_qualifier descriptor
+					requests);
+				j = USB_QUIRK_IGNORE_REMOTE_WAKEUP (device
+					generates spurious wakeup, ignore
+					remote wakeup capability);
+				k = USB_QUIRK_NO_LPM (device can't handle Link
+					Power Management);
+				l = USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL
+					(Device reports its bInterval as linear
+					frames instead of the USB 2.0
+					calculation);
+				m = USB_QUIRK_DISCONNECT_SUSPEND (Device needs
+					to be disconnected before suspend to
+					prevent spurious wakeup);
+				n = USB_QUIRK_DELAY_CTRL_MSG (Device needs a
+					pause after every control message);
+				o = USB_QUIRK_HUB_SLOW_RESET (Hub needs extra
+					delay after resetting its port);
+			Example: quirks=0781:5580:bk,0a5c:5834:gij
+
+	usbhid.mousepoll=
+			[USBHID] The interval which mice are to be polled at.
+
+	usbhid.jspoll=
+			[USBHID] The interval which joysticks are to be polled at.
+
+	usbhid.kbpoll=
+			[USBHID] The interval which keyboards are to be polled at.
+
+	usb-storage.delay_use=
+			[UMS] The delay in seconds before a new device is
+			scanned for Logical Units (default 1).
+
+	usb-storage.quirks=
+			[UMS] A list of quirks entries to supplement or
+			override the built-in unusual_devs list.  List
+			entries are separated by commas.  Each entry has
+			the form VID:PID:Flags where VID and PID are Vendor
+			and Product ID values (4-digit hex numbers) and
+			Flags is a set of characters, each corresponding
+			to a common usb-storage quirk flag as follows:
+				a = SANE_SENSE (collect more than 18 bytes
+					of sense data, not on uas);
+				b = BAD_SENSE (don't collect more than 18
+					bytes of sense data, not on uas);
+				c = FIX_CAPACITY (decrease the reported
+					device capacity by one sector);
+				d = NO_READ_DISC_INFO (don't use
+					READ_DISC_INFO command, not on uas);
+				e = NO_READ_CAPACITY_16 (don't use
+					READ_CAPACITY_16 command);
+				f = NO_REPORT_OPCODES (don't use report opcodes
+					command, uas only);
+				g = MAX_SECTORS_240 (don't transfer more than
+					240 sectors at a time, uas only);
+				h = CAPACITY_HEURISTICS (decrease the
+					reported device capacity by one
+					sector if the number is odd);
+				i = IGNORE_DEVICE (don't bind to this
+					device);
+				j = NO_REPORT_LUNS (don't use report luns
+					command, uas only);
+				k = NO_SAME (do not use WRITE_SAME, uas only)
+				l = NOT_LOCKABLE (don't try to lock and
+					unlock ejectable media, not on uas);
+				m = MAX_SECTORS_64 (don't transfer more
+					than 64 sectors = 32 KB at a time,
+					not on uas);
+				n = INITIAL_READ10 (force a retry of the
+					initial READ(10) command, not on uas);
+				o = CAPACITY_OK (accept the capacity
+					reported by the device, not on uas);
+				p = WRITE_CACHE (the device cache is ON
+					by default, not on uas);
+				r = IGNORE_RESIDUE (the device reports
+					bogus residue values, not on uas);
+				s = SINGLE_LUN (the device has only one
+					Logical Unit);
+				t = NO_ATA_1X (don't allow ATA(12) and ATA(16)
+					commands, uas only);
+				u = IGNORE_UAS (don't bind to the uas driver);
+				w = NO_WP_DETECT (don't test whether the
+					medium is write-protected).
+				y = ALWAYS_SYNC (issue a SYNCHRONIZE_CACHE
+					even if the device claims no cache,
+					not on uas)
+			Example: quirks=0419:aaf5:rl,0421:0433:rc
+
+	user_debug=	[KNL,ARM]
+			Format: <int>
+			See arch/arm/Kconfig.debug help text.
+				 1 - undefined instruction events
+				 2 - system calls
+				 4 - invalid data aborts
+				 8 - SIGSEGV faults
+				16 - SIGBUS faults
+			Example: user_debug=31
+
+	userpte=
+			[X86] Flags controlling user PTE allocations.
+
+				nohigh = do not allocate PTE pages in
+					HIGHMEM regardless of setting
+					of CONFIG_HIGHPTE.
+
+	vdso=		[X86,SH]
+			On X86_32, this is an alias for vdso32=.  Otherwise:
+
+			vdso=1: enable VDSO (the default)
+			vdso=0: disable VDSO mapping
+
+	vdso32=		[X86] Control the 32-bit vDSO
+			vdso32=1: enable 32-bit VDSO
+			vdso32=0 or vdso32=2: disable 32-bit VDSO
+
+			See the help text for CONFIG_COMPAT_VDSO for more
+			details.  If CONFIG_COMPAT_VDSO is set, the default is
+			vdso32=0; otherwise, the default is vdso32=1.
+
+			For compatibility with older kernels, vdso32=2 is an
+			alias for vdso32=0.
+
+			Try vdso32=0 if you encounter an error that says:
+			dl_main: Assertion `(void *) ph->p_vaddr == _rtld_local._dl_sysinfo_dso' failed!
+
+	vector=		[IA-64,SMP]
+			vector=percpu: enable percpu vector domain
+
+	video=		[FB] Frame buffer configuration
+			See Documentation/fb/modedb.txt.
+
+	video.brightness_switch_enabled= [0,1]
+			If set to 1, on receiving an ACPI notify event
+			generated by hotkey, video driver will adjust brightness
+			level and then send out the event to user space through
+			the allocated input device; If set to 0, video driver
+			will only send out the event without touching backlight
+			brightness level.
+			default: 1
+
+	virtio_mmio.device=
+			[VMMIO] Memory mapped virtio (platform) device.
+
+				<size>@<baseaddr>:<irq>[:<id>]
+			where:
+				<size>     := size (can use standard suffixes
+						like K, M and G)
+				<baseaddr> := physical base address
+				<irq>      := interrupt number (as passed to
+						request_irq())
+				<id>       := (optional) platform device id
+			example:
+				virtio_mmio.device=1K@0x100b0000:48:7
+
+			Can be used multiple times for multiple devices.
+
+	vga=		[BOOT,X86-32] Select a particular video mode
+			See Documentation/x86/boot.txt and
+			Documentation/svga.txt.
+			Use vga=ask for menu.
+			This is actually a boot loader parameter; the value is
+			passed to the kernel using a special protocol.
+
+	vmalloc=nn[KMG]	[KNL,BOOT] Forces the vmalloc area to have an exact
+			size of <nn>. This can be used to increase the
+			minimum size (128MB on x86). It can also be used to
+			decrease the size and leave more room for directly
+			mapped kernel RAM.
+
+	vmcp_cma=nn[MG]	[KNL,S390]
+			Sets the memory size reserved for contiguous memory
+			allocations for the vmcp device driver.
+
+	vmhalt=		[KNL,S390] Perform z/VM CP command after system halt.
+			Format: <command>
+
+	vmpanic=	[KNL,S390] Perform z/VM CP command after kernel panic.
+			Format: <command>
+
+	vmpoff=		[KNL,S390] Perform z/VM CP command after power off.
+			Format: <command>
+
+	vsyscall=	[X86-64]
+			Controls the behavior of vsyscalls (i.e. calls to
+			fixed addresses of 0xffffffffff600x00 from legacy
+			code).  Most statically-linked binaries and older
+			versions of glibc use these calls.  Because these
+			functions are at fixed addresses, they make nice
+			targets for exploits that can control RIP.
+
+			emulate     [default] Vsyscalls turn into traps and are
+			            emulated reasonably safely.
+
+			none        Vsyscalls don't work at all.  This makes
+			            them quite hard to use for exploits but
+			            might break your system.
+
+	vt.color=	[VT] Default text color.
+			Format: 0xYX, X = foreground, Y = background.
+			Default: 0x07 = light gray on black.
+
+	vt.cur_default=	[VT] Default cursor shape.
+			Format: 0xCCBBAA, where AA, BB, and CC are the same as
+			the parameters of the <Esc>[?A;B;Cc escape sequence;
+			see VGA-softcursor.txt. Default: 2 = underline.
+
+	vt.default_blu=	[VT]
+			Format: <blue0>,<blue1>,<blue2>,...,<blue15>
+			Change the default blue palette of the console.
+			This is a 16-member array composed of values
+			ranging from 0-255.
+
+	vt.default_grn=	[VT]
+			Format: <green0>,<green1>,<green2>,...,<green15>
+			Change the default green palette of the console.
+			This is a 16-member array composed of values
+			ranging from 0-255.
+
+	vt.default_red=	[VT]
+			Format: <red0>,<red1>,<red2>,...,<red15>
+			Change the default red palette of the console.
+			This is a 16-member array composed of values
+			ranging from 0-255.
+
+	vt.default_utf8=
+			[VT]
+			Format=<0|1>
+			Set system-wide default UTF-8 mode for all tty's.
+			Default is 1, i.e. UTF-8 mode is enabled for all
+			newly opened terminals.
+
+	vt.global_cursor_default=
+			[VT]
+			Format=<-1|0|1>
+			Set system-wide default for whether a cursor
+			is shown on new VTs. Default is -1,
+			i.e. cursors will be created by default unless
+			overridden by individual drivers. 0 will hide
+			cursors, 1 will display them.
+
+	vt.italic=	[VT] Default color for italic text; 0-15.
+			Default: 2 = green.
+
+	vt.underline=	[VT] Default color for underlined text; 0-15.
+			Default: 3 = cyan.
+
+	watchdog timers	[HW,WDT] For information on watchdog timers,
+			see Documentation/watchdog/watchdog-parameters.txt
+			or other driver-specific files in the
+			Documentation/watchdog/ directory.
+
+	workqueue.watchdog_thresh=
+			If CONFIG_WQ_WATCHDOG is configured, workqueue can
+			warn stall conditions and dump internal state to
+			help debugging.  0 disables workqueue stall
+			detection; otherwise, it's the stall threshold
+			duration in seconds.  The default value is 30 and
+			it can be updated at runtime by writing to the
+			corresponding sysfs file.
+
+	workqueue.disable_numa
+			By default, all work items queued to unbound
+			workqueues are affine to the NUMA nodes they're
+			issued on, which results in better behavior in
+			general.  If NUMA affinity needs to be disabled for
+			whatever reason, this option can be used.  Note
+			that this also can be controlled per-workqueue for
+			workqueues visible under /sys/bus/workqueue/.
+
+	workqueue.power_efficient
+			Per-cpu workqueues are generally preferred because
+			they show better performance thanks to cache
+			locality; unfortunately, per-cpu workqueues tend to
+			be more power hungry than unbound workqueues.
+
+			Enabling this makes the per-cpu workqueues which
+			were observed to contribute significantly to power
+			consumption unbound, leading to measurably lower
+			power usage at the cost of small performance
+			overhead.
+
+			The default value of this parameter is determined by
+			the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT.
+
+	workqueue.debug_force_rr_cpu
+			Workqueue used to implicitly guarantee that work
+			items queued without explicit CPU specified are put
+			on the local CPU.  This guarantee is no longer true
+			and while local CPU is still preferred work items
+			may be put on foreign CPUs.  This debug option
+			forces round-robin CPU selection to flush out
+			usages which depend on the now broken guarantee.
+			When enabled, memory and cache locality will be
+			impacted.
+
+	x2apic_phys	[X86-64,APIC] Use x2apic physical mode instead of
+			default x2apic cluster mode on platforms
+			supporting x2apic.
+
+	x86_intel_mid_timer= [X86-32,APBT]
+			Choose timer option for x86 Intel MID platform.
+			Two valid options are apbt timer only and lapic timer
+			plus one apbt timer for broadcast timer.
+			x86_intel_mid_timer=apbt_only | lapic_and_apbt
+
+	xen_512gb_limit		[KNL,X86-64,XEN]
+			Restricts the kernel running paravirtualized under Xen
+			to use only up to 512 GB of RAM. The reason to do so is
+			crash analysis tools and Xen tools for doing domain
+			save/restore/migration must be enabled to handle larger
+			domains.
+
+	xen_emul_unplug=		[HW,X86,XEN]
+			Unplug Xen emulated devices
+			Format: [unplug0,][unplug1]
+			ide-disks -- unplug primary master IDE devices
+			aux-ide-disks -- unplug non-primary-master IDE devices
+			nics -- unplug network devices
+			all -- unplug all emulated devices (NICs and IDE disks)
+			unnecessary -- unplugging emulated devices is
+				unnecessary even if the host did not respond to
+				the unplug protocol
+			never -- do not unplug even if version check succeeds
+
+	xen_legacy_crash	[X86,XEN]
+			Crash from Xen panic notifier, without executing late
+			panic() code such as dumping handler.
+
+	xen_nopvspin	[X86,XEN]
+			Disables the ticketlock slowpath using Xen PV
+			optimizations.
+
+	xen_nopv	[X86]
+			Disables the PV optimizations forcing the HVM guest to
+			run as generic HVM guest with no PV drivers.
+
+	xen_scrub_pages=	[XEN]
+			Boolean option to control scrubbing pages before giving them back
+			to Xen, for use by other domains. Can be also changed at runtime
+			with /sys/devices/system/xen_memory/xen_memory0/scrub_pages.
+			Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
+
+	xen.balloon_boot_timeout= [XEN]
+			The time (in seconds) to wait before giving up to boot
+			in case initial ballooning fails to free enough memory.
+			Applies only when running as HVM or PVH guest and
+			started with less memory configured than allowed at
+			max. Default is 180.
+
+	xen.event_eoi_delay=	[XEN]
+			How long to delay EOI handling in case of event
+			storms (jiffies). Default is 10.
+
+	xen.event_loop_timeout=	[XEN]
+			After which time (jiffies) the event handling loop
+			should start to delay EOI handling. Default is 2.
+
+	xirc2ps_cs=	[NET,PCMCIA]
+			Format:
+			<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]
+
+	xhci-hcd.quirks		[USB,KNL]
+			A hex value specifying bitmask with supplemental xhci
+			host controller quirks. Meaning of each bit can be
+			consulted in header drivers/usb/host/xhci.h.
diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
new file mode 100644
index 000000000..84de718f2
--- /dev/null
+++ b/Documentation/admin-guide/md.rst
@@ -0,0 +1,758 @@
+RAID arrays
+===========
+
+Boot time assembly of RAID arrays
+---------------------------------
+
+Tools that manage md devices can be found at
+   http://www.kernel.org/pub/linux/utils/raid/
+
+
+You can boot with your md device with the following kernel command
+lines:
+
+for old raid arrays without persistent superblocks::
+
+  md=<md device no.>,<raid level>,<chunk size factor>,<fault level>,dev0,dev1,...,devn
+
+for raid arrays with persistent superblocks::
+
+  md=<md device no.>,dev0,dev1,...,devn
+
+or, to assemble a partitionable array::
+
+  md=d<md device no.>,dev0,dev1,...,devn
+
+``md device no.``
++++++++++++++++++
+
+The number of the md device
+
+================= =========
+``md device no.`` device
+================= =========
+              0		md0
+	      1		md1
+	      2		md2
+	      3		md3
+	      4		md4
+================= =========
+
+``raid level``
+++++++++++++++
+
+level of the RAID array
+
+=============== =============
+``raid level``  level
+=============== =============
+-1		linear mode
+0		striped mode
+=============== =============
+
+other modes are only supported with persistent super blocks
+
+``chunk size factor``
++++++++++++++++++++++
+
+(raid-0 and raid-1 only)
+
+Set  the chunk size as 4k << n.
+
+``fault level``
++++++++++++++++
+
+Totally ignored
+
+``dev0`` to ``devn``
+++++++++++++++++++++
+
+e.g. ``/dev/hda1``, ``/dev/hdc1``, ``/dev/sda1``, ``/dev/sdb1``
+
+A possible loadlin line (Harald Hoyer <HarryH@Royal.Net>)  looks like this::
+
+	e:\loadlin\loadlin e:\zimage root=/dev/md0 md=0,0,4,0,/dev/hdb2,/dev/hdc3 ro
+
+
+Boot time autodetection of RAID arrays
+--------------------------------------
+
+When md is compiled into the kernel (not as module), partitions of
+type 0xfd are scanned and automatically assembled into RAID arrays.
+This autodetection may be suppressed with the kernel parameter
+``raid=noautodetect``.  As of kernel 2.6.9, only drives with a type 0
+superblock can be autodetected and run at boot time.
+
+The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
+that all auto-detected arrays are assembled as partitionable.
+
+Boot time assembly of degraded/dirty arrays
+-------------------------------------------
+
+If a raid5 or raid6 array is both dirty and degraded, it could have
+undetectable data corruption.  This is because the fact that it is
+``dirty`` means that the parity cannot be trusted, and the fact that it
+is degraded means that some datablocks are missing and cannot reliably
+be reconstructed (due to no parity).
+
+For this reason, md will normally refuse to start such an array.  This
+requires the sysadmin to take action to explicitly start the array
+despite possible corruption.  This is normally done with::
+
+   mdadm --assemble --force ....
+
+This option is not really available if the array has the root
+filesystem on it.  In order to support this booting from such an
+array, md supports a module parameter ``start_dirty_degraded`` which,
+when set to 1, bypassed the checks and will allows dirty degraded
+arrays to be started.
+
+So, to boot with a root filesystem of a dirty degraded raid 5 or 6, use::
+
+   md-mod.start_dirty_degraded=1
+
+
+Superblock formats
+------------------
+
+The md driver can support a variety of different superblock formats.
+Currently, it supports superblock formats ``0.90.0`` and the ``md-1`` format
+introduced in the 2.5 development series.
+
+The kernel will autodetect which format superblock is being used.
+
+Superblock format ``0`` is treated differently to others for legacy
+reasons - it is the original superblock format.
+
+
+General Rules - apply for all superblock formats
+------------------------------------------------
+
+An array is ``created`` by writing appropriate superblocks to all
+devices.
+
+It is ``assembled`` by associating each of these devices with an
+particular md virtual device.  Once it is completely assembled, it can
+be accessed.
+
+An array should be created by a user-space tool.  This will write
+superblocks to all devices.  It will usually mark the array as
+``unclean``, or with some devices missing so that the kernel md driver
+can create appropriate redundancy (copying in raid 1, parity
+calculation in raid 4/5).
+
+When an array is assembled, it is first initialized with the
+SET_ARRAY_INFO ioctl.  This contains, in particular, a major and minor
+version number.  The major version number selects which superblock
+format is to be used.  The minor number might be used to tune handling
+of the format, such as suggesting where on each device to look for the
+superblock.
+
+Then each device is added using the ADD_NEW_DISK ioctl.  This
+provides, in particular, a major and minor number identifying the
+device to add.
+
+The array is started with the RUN_ARRAY ioctl.
+
+Once started, new devices can be added.  They should have an
+appropriate superblock written to them, and then be passed in with
+ADD_NEW_DISK.
+
+Devices that have failed or are not yet active can be detached from an
+array using HOT_REMOVE_DISK.
+
+
+Specific Rules that apply to format-0 super block arrays, and arrays with no superblock (non-persistent)
+--------------------------------------------------------------------------------------------------------
+
+An array can be ``created`` by describing the array (level, chunksize
+etc) in a SET_ARRAY_INFO ioctl.  This must have ``major_version==0`` and
+``raid_disks != 0``.
+
+Then uninitialized devices can be added with ADD_NEW_DISK.  The
+structure passed to ADD_NEW_DISK must specify the state of the device
+and its role in the array.
+
+Once started with RUN_ARRAY, uninitialized spares can be added with
+HOT_ADD_DISK.
+
+
+MD devices in sysfs
+-------------------
+
+md devices appear in sysfs (``/sys``) as regular block devices,
+e.g.::
+
+   /sys/block/md0
+
+Each ``md`` device will contain a subdirectory called ``md`` which
+contains further md-specific information about the device.
+
+All md devices contain:
+
+  level
+     a text file indicating the ``raid level``. e.g. raid0, raid1,
+     raid5, linear, multipath, faulty.
+     If no raid level has been set yet (array is still being
+     assembled), the value will reflect whatever has been written
+     to it, which may be a name like the above, or may be a number
+     such as ``0``, ``5``, etc.
+
+  raid_disks
+     a text file with a simple number indicating the number of devices
+     in a fully functional array.  If this is not yet known, the file
+     will be empty.  If an array is being resized this will contain
+     the new number of devices.
+     Some raid levels allow this value to be set while the array is
+     active.  This will reconfigure the array.   Otherwise it can only
+     be set while assembling an array.
+     A change to this attribute will not be permitted if it would
+     reduce the size of the array.  To reduce the number of drives
+     in an e.g. raid5, the array size must first be reduced by
+     setting the ``array_size`` attribute.
+
+  chunk_size
+     This is the size in bytes for ``chunks`` and is only relevant to
+     raid levels that involve striping (0,4,5,6,10). The address space
+     of the array is conceptually divided into chunks and consecutive
+     chunks are striped onto neighbouring devices.
+     The size should be at least PAGE_SIZE (4k) and should be a power
+     of 2.  This can only be set while assembling an array
+
+  layout
+     The ``layout`` for the array for the particular level.  This is
+     simply a number that is interpretted differently by different
+     levels.  It can be written while assembling an array.
+
+  array_size
+     This can be used to artificially constrain the available space in
+     the array to be less than is actually available on the combined
+     devices.  Writing a number (in Kilobytes) which is less than
+     the available size will set the size.  Any reconfiguration of the
+     array (e.g. adding devices) will not cause the size to change.
+     Writing the word ``default`` will cause the effective size of the
+     array to be whatever size is actually available based on
+     ``level``, ``chunk_size`` and ``component_size``.
+
+     This can be used to reduce the size of the array before reducing
+     the number of devices in a raid4/5/6, or to support external
+     metadata formats which mandate such clipping.
+
+  reshape_position
+     This is either ``none`` or a sector number within the devices of
+     the array where ``reshape`` is up to.  If this is set, the three
+     attributes mentioned above (raid_disks, chunk_size, layout) can
+     potentially have 2 values, an old and a new value.  If these
+     values differ, reading the attribute returns::
+
+        new (old)
+
+     and writing will effect the ``new`` value, leaving the ``old``
+     unchanged.
+
+  component_size
+     For arrays with data redundancy (i.e. not raid0, linear, faulty,
+     multipath), all components must be the same size - or at least
+     there must a size that they all provide space for.  This is a key
+     part or the geometry of the array.  It is measured in sectors
+     and can be read from here.  Writing to this value may resize
+     the array if the personality supports it (raid1, raid5, raid6),
+     and if the component drives are large enough.
+
+  metadata_version
+     This indicates the format that is being used to record metadata
+     about the array.  It can be 0.90 (traditional format), 1.0, 1.1,
+     1.2 (newer format in varying locations) or ``none`` indicating that
+     the kernel isn't managing metadata at all.
+     Alternately it can be ``external:`` followed by a string which
+     is set by user-space.  This indicates that metadata is managed
+     by a user-space program.  Any device failure or other event that
+     requires a metadata update will cause array activity to be
+     suspended until the event is acknowledged.
+
+  resync_start
+     The point at which resync should start.  If no resync is needed,
+     this will be a very large number (or ``none`` since 2.6.30-rc1).  At
+     array creation it will default to 0, though starting the array as
+     ``clean`` will set it much larger.
+
+  new_dev
+     This file can be written but not read.  The value written should
+     be a block device number as major:minor.  e.g. 8:0
+     This will cause that device to be attached to the array, if it is
+     available.  It will then appear at md/dev-XXX (depending on the
+     name of the device) and further configuration is then possible.
+
+  safe_mode_delay
+     When an md array has seen no write requests for a certain period
+     of time, it will be marked as ``clean``.  When another write
+     request arrives, the array is marked as ``dirty`` before the write
+     commences.  This is known as ``safe_mode``.
+     The ``certain period`` is controlled by this file which stores the
+     period as a number of seconds.  The default is 200msec (0.200).
+     Writing a value of 0 disables safemode.
+
+  array_state
+     This file contains a single word which describes the current
+     state of the array.  In many cases, the state can be set by
+     writing the word for the desired state, however some states
+     cannot be explicitly set, and some transitions are not allowed.
+
+     Select/poll works on this file.  All changes except between
+     Active_idle and active (which can be frequent and are not
+     very interesting) are notified.  active->active_idle is
+     reported if the metadata is externally managed.
+
+     clear
+         No devices, no size, no level
+
+         Writing is equivalent to STOP_ARRAY ioctl
+
+     inactive
+         May have some settings, but array is not active
+         all IO results in error
+
+         When written, doesn't tear down array, but just stops it
+
+     suspended (not supported yet)
+         All IO requests will block. The array can be reconfigured.
+
+         Writing this, if accepted, will block until array is quiessent
+
+     readonly
+         no resync can happen.  no superblocks get written.
+
+         Write requests fail
+
+     read-auto
+         like readonly, but behaves like ``clean`` on a write request.
+
+     clean
+         no pending writes, but otherwise active.
+
+         When written to inactive array, starts without resync
+
+         If a write request arrives then
+         if metadata is known, mark ``dirty`` and switch to ``active``.
+         if not known, block and switch to write-pending
+
+         If written to an active array that has pending writes, then fails.
+     active
+         fully active: IO and resync can be happening.
+         When written to inactive array, starts with resync
+
+     write-pending
+         clean, but writes are blocked waiting for ``active`` to be written.
+
+     active-idle
+         like active, but no writes have been seen for a while (safe_mode_delay).
+
+  bitmap/location
+     This indicates where the write-intent bitmap for the array is
+     stored.
+
+     It can be one of ``none``, ``file`` or ``[+-]N``.
+     ``file`` may later be extended to ``file:/file/name``
+     ``[+-]N`` means that many sectors from the start of the metadata.
+
+     This is replicated on all devices.  For arrays with externally
+     managed metadata, the offset is from the beginning of the
+     device.
+
+  bitmap/chunksize
+     The size, in bytes, of the chunk which will be represented by a
+     single bit.  For RAID456, it is a portion of an individual
+     device. For RAID10, it is a portion of the array.  For RAID1, it
+     is both (they come to the same thing).
+
+  bitmap/time_base
+     The time, in seconds, between looking for bits in the bitmap to
+     be cleared. In the current implementation, a bit will be cleared
+     between 2 and 3 times ``time_base`` after all the covered blocks
+     are known to be in-sync.
+
+  bitmap/backlog
+     When write-mostly devices are active in a RAID1, write requests
+     to those devices proceed in the background - the filesystem (or
+     other user of the device) does not have to wait for them.
+     ``backlog`` sets a limit on the number of concurrent background
+     writes.  If there are more than this, new writes will by
+     synchronous.
+
+  bitmap/metadata
+     This can be either ``internal`` or ``external``.
+
+     ``internal``
+       is the default and means the metadata for the bitmap
+       is stored in the first 256 bytes of the allocated space and is
+       managed by the md module.
+
+     ``external``
+       means that bitmap metadata is managed externally to
+       the kernel (i.e. by some userspace program)
+
+  bitmap/can_clear
+     This is either ``true`` or ``false``.  If ``true``, then bits in the
+     bitmap will be cleared when the corresponding blocks are thought
+     to be in-sync.  If ``false``, bits will never be cleared.
+     This is automatically set to ``false`` if a write happens on a
+     degraded array, or if the array becomes degraded during a write.
+     When metadata is managed externally, it should be set to true
+     once the array becomes non-degraded, and this fact has been
+     recorded in the metadata.
+
+  consistency_policy
+     This indicates how the array maintains consistency in case of unexpected
+     shutdown. It can be:
+
+     none
+       Array has no redundancy information, e.g. raid0, linear.
+
+     resync
+       Full resync is performed and all redundancy is regenerated when the
+       array is started after unclean shutdown.
+
+     bitmap
+       Resync assisted by a write-intent bitmap.
+
+     journal
+       For raid4/5/6, journal device is used to log transactions and replay
+       after unclean shutdown.
+
+     ppl
+       For raid5 only, Partial Parity Log is used to close the write hole and
+       eliminate resync.
+
+     The accepted values when writing to this file are ``ppl`` and ``resync``,
+     used to enable and disable PPL.
+
+
+As component devices are added to an md array, they appear in the ``md``
+directory as new directories named::
+
+      dev-XXX
+
+where ``XXX`` is a name that the kernel knows for the device, e.g. hdb1.
+Each directory contains:
+
+      block
+        a symlink to the block device in /sys/block, e.g.::
+
+	     /sys/block/md0/md/dev-hdb1/block -> ../../../../block/hdb/hdb1
+
+      super
+        A file containing an image of the superblock read from, or
+        written to, that device.
+
+      state
+	A file recording the current state of the device in the array
+	which can be a comma separated list of:
+
+	      faulty
+			device has been kicked from active use due to
+			a detected fault, or it has unacknowledged bad
+			blocks
+
+	      in_sync
+			device is a fully in-sync member of the array
+
+	      writemostly
+			device will only be subject to read
+			requests if there are no other options.
+
+			This applies only to raid1 arrays.
+
+	      blocked
+			device has failed, and the failure hasn't been
+			acknowledged yet by the metadata handler.
+
+			Writes that would write to this device if
+			it were not faulty are blocked.
+
+	      spare
+			device is working, but not a full member.
+
+			This includes spares that are in the process
+			of being recovered to
+
+	      write_error
+			device has ever seen a write error.
+
+	      want_replacement
+			device is (mostly) working but probably
+			should be replaced, either due to errors or
+			due to user request.
+
+	      replacement
+			device is a replacement for another active
+			device with same raid_disk.
+
+
+	This list may grow in future.
+
+	This can be written to.
+
+	Writing ``faulty``  simulates a failure on the device.
+
+	Writing ``remove`` removes the device from the array.
+
+	Writing ``writemostly`` sets the writemostly flag.
+
+	Writing ``-writemostly`` clears the writemostly flag.
+
+	Writing ``blocked`` sets the ``blocked`` flag.
+
+	Writing ``-blocked`` clears the ``blocked`` flags and allows writes
+	to complete and possibly simulates an error.
+
+	Writing ``in_sync`` sets the in_sync flag.
+
+	Writing ``write_error`` sets writeerrorseen flag.
+
+	Writing ``-write_error`` clears writeerrorseen flag.
+
+	Writing ``want_replacement`` is allowed at any time except to a
+	replacement device or a spare.  It sets the flag.
+
+	Writing ``-want_replacement`` is allowed at any time.  It clears
+	the flag.
+
+	Writing ``replacement`` or ``-replacement`` is only allowed before
+	starting the array.  It sets or clears the flag.
+
+
+	This file responds to select/poll. Any change to ``faulty``
+	or ``blocked`` causes an event.
+
+      errors
+	An approximate count of read errors that have been detected on
+	this device but have not caused the device to be evicted from
+	the array (either because they were corrected or because they
+	happened while the array was read-only).  When using version-1
+	metadata, this value persists across restarts of the array.
+
+	This value can be written while assembling an array thus
+	providing an ongoing count for arrays with metadata managed by
+	userspace.
+
+      slot
+        This gives the role that the device has in the array.  It will
+	either be ``none`` if the device is not active in the array
+        (i.e. is a spare or has failed) or an integer less than the
+	``raid_disks`` number for the array indicating which position
+	it currently fills.  This can only be set while assembling an
+	array.  A device for which this is set is assumed to be working.
+
+      offset
+        This gives the location in the device (in sectors from the
+        start) where data from the array will be stored.  Any part of
+        the device before this offset is not touched, unless it is
+        used for storing metadata (Formats 1.1 and 1.2).
+
+      size
+        The amount of the device, after the offset, that can be used
+        for storage of data.  This will normally be the same as the
+	component_size.  This can be written while assembling an
+        array.  If a value less than the current component_size is
+        written, it will be rejected.
+
+      recovery_start
+        When the device is not ``in_sync``, this records the number of
+	sectors from the start of the device which are known to be
+	correct.  This is normally zero, but during a recovery
+	operation it will steadily increase, and if the recovery is
+	interrupted, restoring this value can cause recovery to
+	avoid repeating the earlier blocks.  With v1.x metadata, this
+	value is saved and restored automatically.
+
+	This can be set whenever the device is not an active member of
+	the array, either before the array is activated, or before
+	the ``slot`` is set.
+
+	Setting this to ``none`` is equivalent to setting ``in_sync``.
+	Setting to any other value also clears the ``in_sync`` flag.
+
+      bad_blocks
+	This gives the list of all known bad blocks in the form of
+	start address and length (in sectors respectively). If output
+	is too big to fit in a page, it will be truncated. Writing
+	``sector length`` to this file adds new acknowledged (i.e.
+	recorded to disk safely) bad blocks.
+
+      unacknowledged_bad_blocks
+	This gives the list of known-but-not-yet-saved-to-disk bad
+	blocks in the same form of ``bad_blocks``. If output is too big
+	to fit in a page, it will be truncated. Writing to this file
+	adds bad blocks without acknowledging them. This is largely
+	for testing.
+
+      ppl_sector, ppl_size
+        Location and size (in sectors) of the space used for Partial Parity Log
+        on this device.
+
+
+An active md device will also contain an entry for each active device
+in the array.  These are named::
+
+    rdNN
+
+where ``NN`` is the position in the array, starting from 0.
+So for a 3 drive array there will be rd0, rd1, rd2.
+These are symbolic links to the appropriate ``dev-XXX`` entry.
+Thus, for example::
+
+       cat /sys/block/md*/md/rd*/state
+
+will show ``in_sync`` on every line.
+
+
+
+Active md devices for levels that support data redundancy (1,4,5,6,10)
+also have
+
+   sync_action
+     a text file that can be used to monitor and control the rebuild
+     process.  It contains one word which can be one of:
+
+       resync
+		redundancy is being recalculated after unclean
+                shutdown or creation
+
+       recover
+		a hot spare is being built to replace a
+		failed/missing device
+
+       idle
+		nothing is happening
+       check
+		A full check of redundancy was requested and is
+                happening.  This reads all blocks and checks
+                them. A repair may also happen for some raid
+                levels.
+
+       repair
+		A full check and repair is happening.  This is
+		similar to ``resync``, but was requested by the
+                user, and the write-intent bitmap is NOT used to
+		optimise the process.
+
+      This file is writable, and each of the strings that could be
+      read are meaningful for writing.
+
+	``idle`` will stop an active resync/recovery etc.  There is no
+	guarantee that another resync/recovery may not be automatically
+	started again, though some event will be needed to trigger
+	this.
+
+	``resync`` or ``recovery`` can be used to restart the
+        corresponding operation if it was stopped with ``idle``.
+
+	``check`` and ``repair`` will start the appropriate process
+	providing the current state is ``idle``.
+
+      This file responds to select/poll.  Any important change in the value
+      triggers a poll event.  Sometimes the value will briefly be
+      ``recover`` if a recovery seems to be needed, but cannot be
+      achieved. In that case, the transition to ``recover`` isn't
+      notified, but the transition away is.
+
+   degraded
+      This contains a count of the number of devices by which the
+      arrays is degraded.  So an optimal array will show ``0``.  A
+      single failed/missing drive will show ``1``, etc.
+
+      This file responds to select/poll, any increase or decrease
+      in the count of missing devices will trigger an event.
+
+   mismatch_count
+      When performing ``check`` and ``repair``, and possibly when
+      performing ``resync``, md will count the number of errors that are
+      found.  The count in ``mismatch_cnt`` is the number of sectors
+      that were re-written, or (for ``check``) would have been
+      re-written.  As most raid levels work in units of pages rather
+      than sectors, this may be larger than the number of actual errors
+      by a factor of the number of sectors in a page.
+
+   bitmap_set_bits
+      If the array has a write-intent bitmap, then writing to this
+      attribute can set bits in the bitmap, indicating that a resync
+      would need to check the corresponding blocks. Either individual
+      numbers or start-end pairs can be written.  Multiple numbers
+      can be separated by a space.
+
+      Note that the numbers are ``bit`` numbers, not ``block`` numbers.
+      They should be scaled by the bitmap_chunksize.
+
+   sync_speed_min, sync_speed_max
+     This are similar to ``/proc/sys/dev/raid/speed_limit_{min,max}``
+     however they only apply to the particular array.
+
+     If no value has been written to these, or if the word ``system``
+     is written, then the system-wide value is used.  If a value,
+     in kibibytes-per-second is written, then it is used.
+
+     When the files are read, they show the currently active value
+     followed by ``(local)`` or ``(system)`` depending on whether it is
+     a locally set or system-wide value.
+
+   sync_completed
+     This shows the number of sectors that have been completed of
+     whatever the current sync_action is, followed by the number of
+     sectors in total that could need to be processed.  The two
+     numbers are separated by a ``/``  thus effectively showing one
+     value, a fraction of the process that is complete.
+
+     A ``select`` on this attribute will return when resync completes,
+     when it reaches the current sync_max (below) and possibly at
+     other times.
+
+   sync_speed
+     This shows the current actual speed, in K/sec, of the current
+     sync_action.  It is averaged over the last 30 seconds.
+
+   suspend_lo, suspend_hi
+     The two values, given as numbers of sectors, indicate a range
+     within the array where IO will be blocked.  This is currently
+     only supported for raid4/5/6.
+
+   sync_min, sync_max
+     The two values, given as numbers of sectors, indicate a range
+     within the array where ``check``/``repair`` will operate. Must be
+     a multiple of chunk_size. When it reaches ``sync_max`` it will
+     pause, rather than complete.
+     You can use ``select`` or ``poll`` on ``sync_completed`` to wait for
+     that number to reach sync_max.  Then you can either increase
+     ``sync_max``, or can write ``idle`` to ``sync_action``.
+
+     The value of ``max`` for ``sync_max`` effectively disables the limit.
+     When a resync is active, the value can only ever be increased,
+     never decreased.
+     The value of ``0`` is the minimum for ``sync_min``.
+
+
+
+Each active md device may also have attributes specific to the
+personality module that manages it.
+These are specific to the implementation of the module and could
+change substantially if the implementation changes.
+
+These currently include:
+
+  stripe_cache_size  (currently raid5 only)
+      number of entries in the stripe cache.  This is writable, but
+      there are upper and lower limits (32768, 17).  Default is 256.
+
+  strip_cache_active (currently raid5 only)
+      number of active entries in the stripe cache
+
+  preread_bypass_threshold (currently raid5 only)
+      number of times a stripe requiring preread will be bypassed by
+      a stripe that does not require preread.  For fairness defaults
+      to 1.  Setting this to 0 disables bypass accounting and
+      requires preread stripes to wait until all full-width stripe-
+      writes are complete.  Valid values are 0 to stripe_cache_size.
+
+  journal_mode (currently raid5 only)
+      The cache mode for raid5. raid5 could include an extra disk for
+      caching. The mode can be "write-throuth" and "write-back". The
+      default is "write-through".
diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst
new file mode 100644
index 000000000..291699c81
--- /dev/null
+++ b/Documentation/admin-guide/mm/concepts.rst
@@ -0,0 +1,222 @@
+.. _mm_concepts:
+
+=================
+Concepts overview
+=================
+
+The memory management in Linux is complex system that evolved over the
+years and included more and more functionality to support variety of
+systems from MMU-less microcontrollers to supercomputers. The memory
+management for systems without MMU is called ``nommu`` and it
+definitely deserves a dedicated document, which hopefully will be
+eventually written. Yet, although some of the concepts are the same,
+here we assume that MMU is available and CPU can translate a virtual
+address to a physical address.
+
+.. contents:: :local:
+
+Virtual Memory Primer
+=====================
+
+The physical memory in a computer system is a limited resource and
+even for systems that support memory hotplug there is a hard limit on
+the amount of memory that can be installed. The physical memory is not
+necessary contiguous, it might be accessible as a set of distinct
+address ranges. Besides, different CPU architectures, and even
+different implementations of the same architecture have different view
+how these address ranges defined.
+
+All this makes dealing directly with physical memory quite complex and
+to avoid this complexity a concept of virtual memory was developed.
+
+The virtual memory abstracts the details of physical memory from the
+application software, allows to keep only needed information in the
+physical memory (demand paging) and provides a mechanism for the
+protection and controlled sharing of data between processes.
+
+With virtual memory, each and every memory access uses a virtual
+address. When the CPU decodes the an instruction that reads (or
+writes) from (or to) the system memory, it translates the `virtual`
+address encoded in that instruction to a `physical` address that the
+memory controller can understand.
+
+The physical system memory is divided into page frames, or pages. The
+size of each page is architecture specific. Some architectures allow
+selection of the page size from several supported values; this
+selection is performed at the kernel build time by setting an
+appropriate kernel configuration option.
+
+Each physical memory page can be mapped as one or more virtual
+pages. These mappings are described by page tables that allow
+translation from virtual address used by programs to real address in
+the physical memory. The page tables organized hierarchically.
+
+The tables at the lowest level of the hierarchy contain physical
+addresses of actual pages used by the software. The tables at higher
+levels contain physical addresses of the pages belonging to the lower
+levels. The pointer to the top level page table resides in a
+register. When the CPU performs the address translation, it uses this
+register to access the top level page table. The high bits of the
+virtual address are used to index an entry in the top level page
+table. That entry is then used to access the next level in the
+hierarchy with the next bits of the virtual address as the index to
+that level page table. The lowest bits in the virtual address define
+the offset inside the actual page.
+
+Huge Pages
+==========
+
+The address translation requires several memory accesses and memory
+accesses are slow relatively to CPU speed. To avoid spending precious
+processor cycles on the address translation, CPUs maintain a cache of
+such translations called Translation Lookaside Buffer (or
+TLB). Usually TLB is pretty scarce resource and applications with
+large memory working set will experience performance hit because of
+TLB misses.
+
+Many modern CPU architectures allow mapping of the memory pages
+directly by the higher levels in the page table. For instance, on x86,
+it is possible to map 2M and even 1G pages using entries in the second
+and the third level page tables. In Linux such pages are called
+`huge`. Usage of huge pages significantly reduces pressure on TLB,
+improves TLB hit-rate and thus improves overall system performance.
+
+There are two mechanisms in Linux that enable mapping of the physical
+memory with the huge pages. The first one is `HugeTLB filesystem`, or
+hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
+store. For the files created in this filesystem the data resides in
+the memory and mapped using huge pages. The hugetlbfs is described at
+:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
+
+Another, more recent, mechanism that enables use of the huge pages is
+called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
+requires users and/or system administrators to configure what parts of
+the system memory should and can be mapped by the huge pages, THP
+manages such mappings transparently to the user and hence the
+name. See
+:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
+for more details about THP.
+
+Zones
+=====
+
+Often hardware poses restrictions on how different physical memory
+ranges can be accessed. In some cases, devices cannot perform DMA to
+all the addressable memory. In other cases, the size of the physical
+memory exceeds the maximal addressable size of virtual memory and
+special actions are required to access portions of the memory. Linux
+groups memory pages into `zones` according to their possible
+usage. For example, ZONE_DMA will contain memory that can be used by
+devices for DMA, ZONE_HIGHMEM will contain memory that is not
+permanently mapped into kernel's address space and ZONE_NORMAL will
+contain normally addressed pages.
+
+The actual layout of the memory zones is hardware dependent as not all
+architectures define all zones, and requirements for DMA are different
+for different platforms.
+
+Nodes
+=====
+
+Many multi-processor machines are NUMA - Non-Uniform Memory Access -
+systems. In such systems the memory is arranged into banks that have
+different access latency depending on the "distance" from the
+processor. Each bank is referred as `node` and for each node Linux
+constructs an independent memory management subsystem. A node has it's
+own set of zones, lists of free and used pages and various statistics
+counters. You can find more details about NUMA in
+:ref:`Documentation/vm/numa.rst <numa>` and in
+:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
+
+Page cache
+==========
+
+The physical memory is volatile and the common case for getting data
+into the memory is to read it from files. Whenever a file is read, the
+data is put into the `page cache` to avoid expensive disk access on
+the subsequent reads. Similarly, when one writes to a file, the data
+is placed in the page cache and eventually gets into the backing
+storage device. The written pages are marked as `dirty` and when Linux
+decides to reuse them for other purposes, it makes sure to synchronize
+the file contents on the device with the updated data.
+
+Anonymous Memory
+================
+
+The `anonymous memory` or `anonymous mappings` represent memory that
+is not backed by a filesystem. Such mappings are implicitly created
+for program's stack and heap or by explicit calls to mmap(2) system
+call. Usually, the anonymous mappings only define virtual memory areas
+that the program is allowed to access. The read accesses will result
+in creation of a page table entry that references a special physical
+page filled with zeroes. When the program performs a write, regular
+physical page will be allocated to hold the written data. The page
+will be marked dirty and if the kernel will decide to repurpose it,
+the dirty page will be swapped out.
+
+Reclaim
+=======
+
+Throughout the system lifetime, a physical page can be used for storing
+different types of data. It can be kernel internal data structures,
+DMA'able buffers for device drivers use, data read from a filesystem,
+memory allocated by user space processes etc.
+
+Depending on the page usage it is treated differently by the Linux
+memory management. The pages that can be freed at any time, either
+because they cache the data available elsewhere, for instance, on a
+hard disk, or because they can be swapped out, again, to the hard
+disk, are called `reclaimable`. The most notable categories of the
+reclaimable pages are page cache and anonymous memory.
+
+In most cases, the pages holding internal kernel data and used as DMA
+buffers cannot be repurposed, and they remain pinned until freed by
+their user. Such pages are called `unreclaimable`. However, in certain
+circumstances, even pages occupied with kernel data structures can be
+reclaimed. For instance, in-memory caches of filesystem metadata can
+be re-read from the storage device and therefore it is possible to
+discard them from the main memory when system is under memory
+pressure.
+
+The process of freeing the reclaimable physical memory pages and
+repurposing them is called (surprise!) `reclaim`. Linux can reclaim
+pages either asynchronously or synchronously, depending on the state
+of the system. When system is not loaded, most of the memory is free
+and allocation request will be satisfied immediately from the free
+pages supply. As the load increases, the amount of the free pages goes
+down and when it reaches a certain threshold (high watermark), an
+allocation request will awaken the ``kswapd`` daemon. It will
+asynchronously scan memory pages and either just free them if the data
+they contain is available elsewhere, or evict to the backing storage
+device (remember those dirty pages?). As memory usage increases even
+more and reaches another threshold - min watermark - an allocation
+will trigger the `direct reclaim`. In this case allocation is stalled
+until enough memory pages are reclaimed to satisfy the request.
+
+Compaction
+==========
+
+As the system runs, tasks allocate and free the memory and it becomes
+fragmented. Although with virtual memory it is possible to present
+scattered physical pages as virtually contiguous range, sometimes it is
+necessary to allocate large physically contiguous memory areas. Such
+need may arise, for instance, when a device driver requires large
+buffer for DMA, or when THP allocates a huge page. Memory `compaction`
+addresses the fragmentation issue. This mechanism moves occupied pages
+from the lower part of a memory zone to free pages in the upper part
+of the zone. When a compaction scan is finished free pages are grouped
+together at the beginning of the zone and allocations of large
+physically contiguous areas become possible.
+
+Like reclaim, the compaction may happen asynchronously in ``kcompactd``
+daemon or synchronously as a result of memory allocation request.
+
+OOM killer
+==========
+
+It may happen, that on a loaded machine memory will be exhausted. When
+the kernel detects that the system runs out of memory (OOM) it invokes
+`OOM killer`. Its mission is simple: all it has to do is to select a
+task to sacrifice for the sake of the overall system health. The
+selected task is killed in a hope that after it exits enough memory
+will be freed to continue normal operation.
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
new file mode 100644
index 000000000..1cc0bc78d
--- /dev/null
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -0,0 +1,382 @@
+.. _hugetlbpage:
+
+=============
+HugeTLB Pages
+=============
+
+Overview
+========
+
+The intent of this file is to give a brief summary of hugetlbpage support in
+the Linux kernel.  This support is built on top of multiple page size support
+that is provided by most modern architectures.  For example, x86 CPUs normally
+support 4K and 2M (1G if architecturally supported) page sizes, ia64
+architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
+256M and ppc64 supports 4K and 16M.  A TLB is a cache of virtual-to-physical
+translations.  Typically this is a very scarce resource on processor.
+Operating systems try to make best use of limited number of TLB resources.
+This optimization is more critical now as bigger and bigger physical memories
+(several GBs) are more readily available.
+
+Users can use the huge page support in Linux kernel by either using the mmap
+system call or standard SYSV shared memory system calls (shmget, shmat).
+
+First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
+(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
+automatically when CONFIG_HUGETLBFS is selected) configuration
+options.
+
+The ``/proc/meminfo`` file provides information about the total number of
+persistent hugetlb pages in the kernel's huge page pool.  It also displays
+default huge page size and information about the number of free, reserved
+and surplus huge pages in the pool of huge pages of default size.
+The huge page size is needed for generating the proper alignment and
+size of the arguments to system calls that map huge page regions.
+
+The output of ``cat /proc/meminfo`` will include lines like::
+
+	HugePages_Total: uuu
+	HugePages_Free:  vvv
+	HugePages_Rsvd:  www
+	HugePages_Surp:  xxx
+	Hugepagesize:    yyy kB
+	Hugetlb:         zzz kB
+
+where:
+
+HugePages_Total
+	is the size of the pool of huge pages.
+HugePages_Free
+	is the number of huge pages in the pool that are not yet
+        allocated.
+HugePages_Rsvd
+	is short for "reserved," and is the number of huge pages for
+        which a commitment to allocate from the pool has been made,
+        but no allocation has yet been made.  Reserved huge pages
+        guarantee that an application will be able to allocate a
+        huge page from the pool of huge pages at fault time.
+HugePages_Surp
+	is short for "surplus," and is the number of huge pages in
+        the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
+        maximum number of surplus huge pages is controlled by
+        ``/proc/sys/vm/nr_overcommit_hugepages``.
+Hugepagesize
+	is the default hugepage size (in Kb).
+Hugetlb
+        is the total amount of memory (in kB), consumed by huge
+        pages of all sizes.
+        If huge pages of different sizes are in use, this number
+        will exceed HugePages_Total \* Hugepagesize. To get more
+        detailed information, please, refer to
+        ``/sys/kernel/mm/hugepages`` (described below).
+
+
+``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
+configured in the kernel.
+
+``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
+pages in the kernel's huge page pool.  "Persistent" huge pages will be
+returned to the huge page pool when freed by a task.  A user with root
+privileges can dynamically allocate more or free some persistent huge pages
+by increasing or decreasing the value of ``nr_hugepages``.
+
+Pages that are used as huge pages are reserved inside the kernel and cannot
+be used for other purposes.  Huge pages cannot be swapped out under
+memory pressure.
+
+Once a number of huge pages have been pre-allocated to the kernel huge page
+pool, a user with appropriate privilege can use either the mmap system call
+or shared memory system calls to use the huge pages.  See the discussion of
+:ref:`Using Huge Pages <using_huge_pages>`, below.
+
+The administrator can allocate persistent huge pages on the kernel boot
+command line by specifying the "hugepages=N" parameter, where 'N' = the
+number of huge pages requested.  This is the most reliable method of
+allocating huge pages as memory has not yet become fragmented.
+
+Some platforms support multiple huge page sizes.  To allocate huge pages
+of a specific size, one must precede the huge pages boot command parameters
+with a huge page size selection parameter "hugepagesz=<size>".  <size> must
+be specified in bytes with optional scale suffix [kKmMgG].  The default huge
+page size may be selected with the "default_hugepagesz=<size>" boot parameter.
+
+When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
+indicates the current number of pre-allocated huge pages of the default size.
+Thus, one can use the following command to dynamically allocate/deallocate
+default sized persistent huge pages::
+
+	echo 20 > /proc/sys/vm/nr_hugepages
+
+This command will try to adjust the number of default sized huge pages in the
+huge page pool to 20, allocating or freeing huge pages, as required.
+
+On a NUMA platform, the kernel will attempt to distribute the huge page pool
+over all the set of allowed nodes specified by the NUMA memory policy of the
+task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
+task has default memory policy--is all on-line nodes with memory.  Allowed
+nodes with insufficient available, contiguous memory for a huge page will be
+silently skipped when allocating persistent huge pages.  See the
+:ref:`discussion below <mem_policy_and_hp_alloc>`
+of the interaction of task memory policy, cpusets and per node attributes
+with the allocation and freeing of persistent huge pages.
+
+The success or failure of huge page allocation depends on the amount of
+physically contiguous memory that is present in system at the time of the
+allocation attempt.  If the kernel is unable to allocate huge pages from
+some nodes in a NUMA system, it will attempt to make up the difference by
+allocating extra pages on other nodes with sufficient available contiguous
+memory, if any.
+
+System administrators may want to put this command in one of the local rc
+init files.  This will enable the kernel to allocate huge pages early in
+the boot process when the possibility of getting physical contiguous pages
+is still very high.  Administrators can verify the number of huge pages
+actually allocated by checking the sysctl or meminfo.  To check the per node
+distribution of huge pages in a NUMA system, use::
+
+	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
+
+``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
+huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
+requested by applications.  Writing any non-zero value into this file
+indicates that the hugetlb subsystem is allowed to try to obtain that
+number of "surplus" huge pages from the kernel's normal page pool, when the
+persistent huge page pool is exhausted. As these surplus huge pages become
+unused, they are freed back to the kernel's normal page pool.
+
+When increasing the huge page pool size via ``nr_hugepages``, any existing
+surplus pages will first be promoted to persistent huge pages.  Then, additional
+huge pages will be allocated, if necessary and if possible, to fulfill
+the new persistent huge page pool size.
+
+The administrator may shrink the pool of persistent huge pages for
+the default huge page size by setting the ``nr_hugepages`` sysctl to a
+smaller value.  The kernel will attempt to balance the freeing of huge pages
+across all nodes in the memory policy of the task modifying ``nr_hugepages``.
+Any free huge pages on the selected nodes will be freed back to the kernel's
+normal page pool.
+
+Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
+it becomes less than the number of huge pages in use will convert the balance
+of the in-use huge pages to surplus huge pages.  This will occur even if
+the number of surplus pages would exceed the overcommit value.  As long as
+this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
+increased sufficiently, or the surplus huge pages go out of use and are freed--
+no more surplus huge pages will be allowed to be allocated.
+
+With support for multiple huge page pools at run-time available, much of
+the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
+sysfs.
+The ``/proc`` interfaces discussed above have been retained for backwards
+compatibility. The root huge page control directory in sysfs is::
+
+	/sys/kernel/mm/hugepages
+
+For each huge page size supported by the running kernel, a subdirectory
+will exist, of the form::
+
+	hugepages-${size}kB
+
+Inside each of these directories, the same set of files will exist::
+
+	nr_hugepages
+	nr_hugepages_mempolicy
+	nr_overcommit_hugepages
+	free_hugepages
+	resv_hugepages
+	surplus_hugepages
+
+which function as described above for the default huge page-sized case.
+
+.. _mem_policy_and_hp_alloc:
+
+Interaction of Task Memory Policy with Huge Page Allocation/Freeing
+===================================================================
+
+Whether huge pages are allocated and freed via the ``/proc`` interface or
+the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
+NUMA nodes from which huge pages are allocated or freed are controlled by the
+NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
+sysctl or attribute.  When the ``nr_hugepages`` attribute is used, mempolicy
+is ignored.
+
+The recommended method to allocate or free huge pages to/from the kernel
+huge page pool, using the ``nr_hugepages`` example above, is::
+
+    numactl --interleave <node-list> echo 20 \
+				>/proc/sys/vm/nr_hugepages_mempolicy
+
+or, more succinctly::
+
+    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
+
+This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
+specified in <node-list>, depending on whether number of persistent huge pages
+is initially less than or greater than 20, respectively.  No huge pages will be
+allocated nor freed on any node not included in the specified <node-list>.
+
+When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
+memory policy mode--bind, preferred, local or interleave--may be used.  The
+resulting effect on persistent huge page allocation is as follows:
+
+#. Regardless of mempolicy mode [see
+   :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`],
+   persistent huge pages will be distributed across the node or nodes
+   specified in the mempolicy as if "interleave" had been specified.
+   However, if a node in the policy does not contain sufficient contiguous
+   memory for a huge page, the allocation will not "fallback" to the nearest
+   neighbor node with sufficient contiguous memory.  To do this would cause
+   undesirable imbalance in the distribution of the huge page pool, or
+   possibly, allocation of persistent huge pages on nodes not allowed by
+   the task's memory policy.
+
+#. One or more nodes may be specified with the bind or interleave policy.
+   If more than one node is specified with the preferred policy, only the
+   lowest numeric id will be used.  Local policy will select the node where
+   the task is running at the time the nodes_allowed mask is constructed.
+   For local policy to be deterministic, the task must be bound to a cpu or
+   cpus in a single node.  Otherwise, the task could be migrated to some
+   other node at any time after launch and the resulting node will be
+   indeterminate.  Thus, local policy is not very useful for this purpose.
+   Any of the other mempolicy modes may be used to specify a single node.
+
+#. The nodes allowed mask will be derived from any non-default task mempolicy,
+   whether this policy was set explicitly by the task itself or one of its
+   ancestors, such as numactl.  This means that if the task is invoked from a
+   shell with non-default policy, that policy will be used.  One can specify a
+   node list of "all" with numactl --interleave or --membind [-m] to achieve
+   interleaving over all nodes in the system or cpuset.
+
+#. Any task mempolicy specified--e.g., using numactl--will be constrained by
+   the resource limits of any cpuset in which the task runs.  Thus, there will
+   be no way for a task with non-default policy running in a cpuset with a
+   subset of the system nodes to allocate huge pages outside the cpuset
+   without first moving to a cpuset that contains all of the desired nodes.
+
+#. Boot-time huge page allocation attempts to distribute the requested number
+   of huge pages over all on-lines nodes with memory.
+
+Per Node Hugepages Attributes
+=============================
+
+A subset of the contents of the root huge page control directory in sysfs,
+described above, will be replicated under each the system device of each
+NUMA node with memory in::
+
+	/sys/devices/system/node/node[0-9]*/hugepages/
+
+Under this directory, the subdirectory for each supported huge page size
+contains the following attribute files::
+
+	nr_hugepages
+	free_hugepages
+	surplus_hugepages
+
+The free\_' and surplus\_' attribute files are read-only.  They return the number
+of free and surplus [overcommitted] huge pages, respectively, on the parent
+node.
+
+The ``nr_hugepages`` attribute returns the total number of huge pages on the
+specified node.  When this attribute is written, the number of persistent huge
+pages on the parent node will be adjusted to the specified value, if sufficient
+resources exist, regardless of the task's mempolicy or cpuset constraints.
+
+Note that the number of overcommit and reserve pages remain global quantities,
+as we don't know until fault time, when the faulting task's mempolicy is
+applied, from which node the huge page allocation will be attempted.
+
+.. _using_huge_pages:
+
+Using Huge Pages
+================
+
+If the user applications are going to request huge pages using mmap system
+call, then it is required that system administrator mount a file system of
+type hugetlbfs::
+
+  mount -t hugetlbfs \
+	-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
+	min_size=<value>,nr_inodes=<value> none /mnt/huge
+
+This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
+``/mnt/huge``.  Any file created on ``/mnt/huge`` uses huge pages.
+
+The ``uid`` and ``gid`` options sets the owner and group of the root of the
+file system.  By default the ``uid`` and ``gid`` of the current process
+are taken.
+
+The ``mode`` option sets the mode of root of file system to value & 01777.
+This value is given in octal. By default the value 0755 is picked.
+
+If the platform supports multiple huge page sizes, the ``pagesize`` option can
+be used to specify the huge page size and associated pool. ``pagesize``
+is specified in bytes. If ``pagesize`` is not specified the platform's
+default huge page size and associated pool will be used.
+
+The ``size`` option sets the maximum value of memory (huge pages) allowed
+for that filesystem (``/mnt/huge``). The ``size`` option can be specified
+in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
+The size is rounded down to HPAGE_SIZE boundary.
+
+The ``min_size`` option sets the minimum value of memory (huge pages) allowed
+for the filesystem. ``min_size`` can be specified in the same way as ``size``,
+either bytes or a percentage of the huge page pool.
+At mount time, the number of huge pages specified by ``min_size`` are reserved
+for use by the filesystem.
+If there are not enough free huge pages available, the mount will fail.
+As huge pages are allocated to the filesystem and freed, the reserve count
+is adjusted so that the sum of allocated and reserved huge pages is always
+at least ``min_size``.
+
+The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
+can use.
+
+If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
+command line then no limits are set.
+
+For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
+use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
+For example, size=2K has the same meaning as size=2048.
+
+While read system calls are supported on files that reside on hugetlb
+file systems, write system calls are not.
+
+Regular chown, chgrp, and chmod commands (with right permissions) could be
+used to change the file attributes on hugetlbfs.
+
+Also, it is important to note that no such mount command is required if
+applications are going to use only shmat/shmget system calls or mmap with
+MAP_HUGETLB.  For an example of how to use mmap with MAP_HUGETLB see
+:ref:`map_hugetlb <map_hugetlb>` below.
+
+Users who wish to use hugetlb memory via shared memory segment should be
+members of a supplementary group and system admin needs to configure that gid
+into ``/proc/sys/vm/hugetlb_shm_group``.  It is possible for same or different
+applications to use any combination of mmaps and shm* calls, though the mount of
+filesystem will be required for using mmap calls without MAP_HUGETLB.
+
+Syscalls that operate on memory backed by hugetlb pages only have their lengths
+aligned to the native page size of the processor; they will normally fail with
+errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
+not hugepage aligned.  For example, munmap(2) will fail if memory is backed by
+a hugetlb page and the length is smaller than the hugepage size.
+
+
+Examples
+========
+
+.. _map_hugetlb:
+
+``map_hugetlb``
+	see tools/testing/selftests/vm/map_hugetlb.c
+
+``hugepage-shm``
+	see tools/testing/selftests/vm/hugepage-shm.c
+
+``hugepage-mmap``
+	see tools/testing/selftests/vm/hugepage-mmap.c
+
+The `libhugetlbfs`_  library provides a wide range of userspace tools
+to help with huge page usability, environment setup, and control.
+
+.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
new file mode 100644
index 000000000..df9394fb3
--- /dev/null
+++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
@@ -0,0 +1,121 @@
+.. _idle_page_tracking:
+
+==================
+Idle Page Tracking
+==================
+
+Motivation
+==========
+
+The idle page tracking feature allows to track which memory pages are being
+accessed by a workload and which are idle. This information can be useful for
+estimating the workload's working set size, which, in turn, can be taken into
+account when configuring the workload parameters, setting memory cgroup limits,
+or deciding where to place the workload within a compute cluster.
+
+It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
+
+.. _user_api:
+
+User API
+========
+
+The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
+Currently, it consists of the only read-write file,
+``/sys/kernel/mm/page_idle/bitmap``.
+
+The file implements a bitmap where each bit corresponds to a memory page. The
+bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
+mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
+set, the corresponding page is idle.
+
+A page is considered idle if it has not been accessed since it was marked idle
+(for more details on what "accessed" actually means see the :ref:`Implementation
+Details <impl_details>` section).
+To mark a page idle one has to set the bit corresponding to
+the page by writing to the file. A value written to the file is OR-ed with the
+current bitmap value.
+
+Only accesses to user memory pages are tracked. These are pages mapped to a
+process address space, page cache and buffer pages, swap cache pages. For other
+page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
+and hence such pages are never reported idle.
+
+For huge pages the idle flag is set only on the head page, so one has to read
+``/proc/kpageflags`` in order to correctly count idle huge pages.
+
+Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
+-EINVAL if you are not starting the read/write on an 8-byte boundary, or
+if the size of the read/write is not a multiple of 8 bytes. Writing to
+this file beyond max PFN will return -ENXIO.
+
+That said, in order to estimate the amount of pages that are not used by a
+workload one should:
+
+ 1. Mark all the workload's pages as idle by setting corresponding bits in
+    ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
+    ``/proc/pid/pagemap`` if the workload is represented by a process, or by
+    filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
+    is placed in a memory cgroup.
+
+ 2. Wait until the workload accesses its working set.
+
+ 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
+    If one wants to ignore certain types of pages, e.g. mlocked pages since they
+    are not reclaimable, he or she can filter them out using
+    ``/proc/kpageflags``.
+
+The page-types tool in the tools/vm directory can be used to assist in this.
+If the tool is run initially with the appropriate option, it will mark all the
+queried pages as idle.  Subsequent runs of the tool can then show which pages have
+their idle flag cleared in the interim.
+
+See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
+information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
+``/proc/kpagecgroup``.
+
+.. _impl_details:
+
+Implementation Details
+======================
+
+The kernel internally keeps track of accesses to user memory pages in order to
+reclaim unreferenced pages first on memory shortage conditions. A page is
+considered referenced if it has been recently accessed via a process address
+space, in which case one or more PTEs it is mapped to will have the Accessed bit
+set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
+latter happens when:
+
+ - a userspace process reads or writes a page using a system call (e.g. read(2)
+   or write(2))
+
+ - a page that is used for storing filesystem buffers is read or written,
+   because a process needs filesystem metadata stored in it (e.g. lists a
+   directory tree)
+
+ - a page is accessed by a device driver using get_user_pages()
+
+When a dirty page is written to swap or disk as a result of memory reclaim or
+exceeding the dirty memory limit, it is not marked referenced.
+
+The idle memory tracking feature adds a new page flag, the Idle flag. This flag
+is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
+:ref:`User API <user_api>`
+section), and cleared automatically whenever a page is referenced as defined
+above.
+
+When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
+mapped to, otherwise we will not be able to detect accesses to the page coming
+from a process address space. To avoid interference with the reclaimer, which,
+as noted above, uses the Accessed bit to promote actively referenced pages, one
+more page flag is introduced, the Young flag. When the PTE Accessed bit is
+cleared as a result of setting or updating a page's Idle flag, the Young flag
+is set on the page. The reclaimer treats the Young flag as an extra PTE
+Accessed bit and therefore will consider such a page as referenced.
+
+Since the idle memory tracking feature is based on the memory reclaimer logic,
+it only works with pages that are on an LRU list, other pages are silently
+ignored. That means it will ignore a user memory page if it is isolated, but
+since there are usually not many of them, it should not affect the overall
+result noticeably. In order not to stall scanning of the idle page bitmap,
+locked pages may be skipped too.
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
new file mode 100644
index 000000000..ceead68c2
--- /dev/null
+++ b/Documentation/admin-guide/mm/index.rst
@@ -0,0 +1,36 @@
+=================
+Memory Management
+=================
+
+Linux memory management subsystem is responsible, as the name implies,
+for managing the memory in the system. This includes implemnetation of
+virtual memory and demand paging, memory allocation both for kernel
+internal structures and user space programms, mapping of files into
+processes address space and many other cool things.
+
+Linux memory management is a complex system with many configurable
+settings. Most of these settings are available via ``/proc``
+filesystem and can be quired and adjusted using ``sysctl``. These APIs
+are described in Documentation/sysctl/vm.txt and in `man 5 proc`_.
+
+.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
+
+Linux memory management has its own jargon and if you are not yet
+familiar with it, consider reading
+:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
+
+Here we document in detail how to interact with various mechanisms in
+the Linux memory management.
+
+.. toctree::
+   :maxdepth: 1
+
+   concepts
+   hugetlbpage
+   idle_page_tracking
+   ksm
+   numa_memory_policy
+   pagemap
+   soft-dirty
+   transhuge
+   userfaultfd
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
new file mode 100644
index 000000000..930378663
--- /dev/null
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -0,0 +1,189 @@
+.. _admin_guide_ksm:
+
+=======================
+Kernel Samepage Merging
+=======================
+
+Overview
+========
+
+KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
+added to the Linux kernel in 2.6.32.  See ``mm/ksm.c`` for its implementation,
+and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
+
+KSM was originally developed for use with KVM (where it was known as
+Kernel Shared Memory), to fit more virtual machines into physical memory,
+by sharing the data common between them.  But it can be useful to any
+application which generates many instances of the same data.
+
+The KSM daemon ksmd periodically scans those areas of user memory
+which have been registered with it, looking for pages of identical
+content which can be replaced by a single write-protected page (which
+is automatically copied if a process later wants to update its
+content). The amount of pages that KSM daemon scans in a single pass
+and the time between the passes are configured using :ref:`sysfs
+intraface <ksm_sysfs>`
+
+KSM only merges anonymous (private) pages, never pagecache (file) pages.
+KSM's merged pages were originally locked into kernel memory, but can now
+be swapped out just like other user pages (but sharing is broken when they
+are swapped back in: ksmd must rediscover their identity and merge again).
+
+Controlling KSM with madvise
+============================
+
+KSM only operates on those areas of address space which an application
+has advised to be likely candidates for merging, by using the madvise(2)
+system call::
+
+	int madvise(addr, length, MADV_MERGEABLE)
+
+The app may call
+
+::
+
+	int madvise(addr, length, MADV_UNMERGEABLE)
+
+to cancel that advice and restore unshared pages: whereupon KSM
+unmerges whatever it merged in that range.  Note: this unmerging call
+may suddenly require more memory than is available - possibly failing
+with EAGAIN, but more probably arousing the Out-Of-Memory killer.
+
+If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
+and MADV_UNMERGEABLE simply fail with EINVAL.  If the running kernel was
+built with CONFIG_KSM=y, those calls will normally succeed: even if the
+the KSM daemon is not currently running, MADV_MERGEABLE still registers
+the range for whenever the KSM daemon is started; even if the range
+cannot contain any pages which KSM could actually merge; even if
+MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
+
+If a region of memory must be split into at least one new MADV_MERGEABLE
+or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
+will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt).
+
+Like other madvise calls, they are intended for use on mapped areas of
+the user address space: they will report ENOMEM if the specified range
+includes unmapped gaps (though working on the intervening mapped areas),
+and might fail with EAGAIN if not enough memory for internal structures.
+
+Applications should be considerate in their use of MADV_MERGEABLE,
+restricting its use to areas likely to benefit.  KSM's scans may use a lot
+of processing power: some installations will disable KSM for that reason.
+
+.. _ksm_sysfs:
+
+KSM daemon sysfs interface
+==========================
+
+The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``,
+readable by all but writable only by root:
+
+pages_to_scan
+        how many pages to scan before ksmd goes to sleep
+        e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
+
+        Default: 100 (chosen for demonstration purposes)
+
+sleep_millisecs
+        how many milliseconds ksmd should sleep before next scan
+        e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
+
+        Default: 20 (chosen for demonstration purposes)
+
+merge_across_nodes
+        specifies if pages from different NUMA nodes can be merged.
+        When set to 0, ksm merges only pages which physically reside
+        in the memory area of same NUMA node. That brings lower
+        latency to access of shared pages. Systems with more nodes, at
+        significant NUMA distances, are likely to benefit from the
+        lower latency of setting 0. Smaller systems, which need to
+        minimize memory usage, are likely to benefit from the greater
+        sharing of setting 1 (default). You may wish to compare how
+        your system performs under each setting, before deciding on
+        which to use. ``merge_across_nodes`` setting can be changed only
+        when there are no ksm shared pages in the system: set run 2 to
+        unmerge pages first, then to 1 after changing
+        ``merge_across_nodes``, to remerge according to the new setting.
+
+        Default: 1 (merging across nodes as in earlier releases)
+
+run
+        * set to 0 to stop ksmd from running but keep merged pages,
+        * set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``,
+        * set to 2 to stop ksmd and unmerge all pages currently merged, but
+	  leave mergeable areas registered for next run.
+
+        Default: 0 (must be changed to 1 to activate KSM, except if
+        CONFIG_SYSFS is disabled)
+
+use_zero_pages
+        specifies whether empty pages (i.e. allocated pages that only
+        contain zeroes) should be treated specially.  When set to 1,
+        empty pages are merged with the kernel zero page(s) instead of
+        with each other as it would happen normally. This can improve
+        the performance on architectures with coloured zero pages,
+        depending on the workload. Care should be taken when enabling
+        this setting, as it can potentially degrade the performance of
+        KSM for some workloads, for example if the checksums of pages
+        candidate for merging match the checksum of an empty
+        page. This setting can be changed at any time, it is only
+        effective for pages merged after the change.
+
+        Default: 0 (normal KSM behaviour as in earlier releases)
+
+max_page_sharing
+        Maximum sharing allowed for each KSM page. This enforces a
+        deduplication limit to avoid high latency for virtual memory
+        operations that involve traversal of the virtual mappings that
+        share the KSM page. The minimum value is 2 as a newly created
+        KSM page will have at least two sharers. The higher this value
+        the faster KSM will merge the memory and the higher the
+        deduplication factor will be, but the slower the worst case
+        virtual mappings traversal could be for any given KSM
+        page. Slowing down this traversal means there will be higher
+        latency for certain virtual memory operations happening during
+        swapping, compaction, NUMA balancing and page migration, in
+        turn decreasing responsiveness for the caller of those virtual
+        memory operations. The scheduler latency of other tasks not
+        involved with the VM operations doing the virtual mappings
+        traversal is not affected by this parameter as these
+        traversals are always schedule friendly themselves.
+
+stable_node_chains_prune_millisecs
+        specifies how frequently KSM checks the metadata of the pages
+        that hit the deduplication limit for stale information.
+        Smaller milllisecs values will free up the KSM metadata with
+        lower latency, but they will make ksmd use more CPU during the
+        scan. It's a noop if not a single KSM page hit the
+        ``max_page_sharing`` yet.
+
+The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
+
+pages_shared
+        how many shared pages are being used
+pages_sharing
+        how many more sites are sharing them i.e. how much saved
+pages_unshared
+        how many pages unique but repeatedly checked for merging
+pages_volatile
+        how many pages changing too fast to be placed in a tree
+full_scans
+        how many times all mergeable areas have been scanned
+stable_node_chains
+        the number of KSM pages that hit the ``max_page_sharing`` limit
+stable_node_dups
+        number of duplicated KSM pages
+
+A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
+sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
+indicates wasted effort.  ``pages_volatile`` embraces several
+different kinds of activity, but a high proportion there would also
+indicate poor use of madvise MADV_MERGEABLE.
+
+The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
+``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
+be increased accordingly.
+
+--
+Izik Eidus,
+Hugh Dickins, 17 Nov 2009
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
new file mode 100644
index 000000000..d78c5b315
--- /dev/null
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -0,0 +1,495 @@
+.. _numa_memory_policy:
+
+==================
+NUMA Memory Policy
+==================
+
+What is NUMA Memory Policy?
+============================
+
+In the Linux kernel, "memory policy" determines from which node the kernel will
+allocate memory in a NUMA system or in an emulated NUMA system.  Linux has
+supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
+The current memory policy support was added to Linux 2.6 around May 2004.  This
+document attempts to describe the concepts and APIs of the 2.6 memory policy
+support.
+
+Memory policies should not be confused with cpusets
+(``Documentation/cgroup-v1/cpusets.txt``)
+which is an administrative mechanism for restricting the nodes from which
+memory may be allocated by a set of processes. Memory policies are a
+programming interface that a NUMA-aware application can take advantage of.  When
+both cpusets and policies are applied to a task, the restrictions of the cpuset
+takes priority.  See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
+below for more details.
+
+Memory Policy Concepts
+======================
+
+Scope of Memory Policies
+------------------------
+
+The Linux kernel supports _scopes_ of memory policy, described here from
+most general to most specific:
+
+System Default Policy
+	this policy is "hard coded" into the kernel.  It is the policy
+	that governs all page allocations that aren't controlled by
+	one of the more specific policy scopes discussed below.  When
+	the system is "up and running", the system default policy will
+	use "local allocation" described below.  However, during boot
+	up, the system default policy will be set to interleave
+	allocations across all nodes with "sufficient" memory, so as
+	not to overload the initial boot node with boot-time
+	allocations.
+
+Task/Process Policy
+	this is an optional, per-task policy.  When defined for a
+	specific task, this policy controls all page allocations made
+	by or on behalf of the task that aren't controlled by a more
+	specific scope. If a task does not define a task policy, then
+	all page allocations that would have been controlled by the
+	task policy "fall back" to the System Default Policy.
+
+	The task policy applies to the entire address space of a task. Thus,
+	it is inheritable, and indeed is inherited, across both fork()
+	[clone() w/o the CLONE_VM flag] and exec*().  This allows a parent task
+	to establish the task policy for a child task exec()'d from an
+	executable image that has no awareness of memory policy.  See the
+	:ref:`Memory Policy APIs <memory_policy_apis>` section,
+	below, for an overview of the system call
+	that a task may use to set/change its task/process policy.
+
+	In a multi-threaded task, task policies apply only to the thread
+	[Linux kernel task] that installs the policy and any threads
+	subsequently created by that thread.  Any sibling threads existing
+	at the time a new task policy is installed retain their current
+	policy.
+
+	A task policy applies only to pages allocated after the policy is
+	installed.  Any pages already faulted in by the task when the task
+	changes its task policy remain where they were allocated based on
+	the policy at the time they were allocated.
+
+.. _vma_policy:
+
+VMA Policy
+	A "VMA" or "Virtual Memory Area" refers to a range of a task's
+	virtual address space.  A task may define a specific policy for a range
+	of its virtual address space.   See the
+	:ref:`Memory Policy APIs <memory_policy_apis>` section,
+	below, for an overview of the mbind() system call used to set a VMA
+	policy.
+
+	A VMA policy will govern the allocation of pages that back
+	this region of the address space.  Any regions of the task's
+	address space that don't have an explicit VMA policy will fall
+	back to the task policy, which may itself fall back to the
+	System Default Policy.
+
+	VMA policies have a few complicating details:
+
+	* VMA policy applies ONLY to anonymous pages.  These include
+	  pages allocated for anonymous segments, such as the task
+	  stack and heap, and any regions of the address space
+	  mmap()ed with the MAP_ANONYMOUS flag.  If a VMA policy is
+	  applied to a file mapping, it will be ignored if the mapping
+	  used the MAP_SHARED flag.  If the file mapping used the
+	  MAP_PRIVATE flag, the VMA policy will only be applied when
+	  an anonymous page is allocated on an attempt to write to the
+	  mapping-- i.e., at Copy-On-Write.
+
+	* VMA policies are shared between all tasks that share a
+	  virtual address space--a.k.a. threads--independent of when
+	  the policy is installed; and they are inherited across
+	  fork().  However, because VMA policies refer to a specific
+	  region of a task's address space, and because the address
+	  space is discarded and recreated on exec*(), VMA policies
+	  are NOT inheritable across exec().  Thus, only NUMA-aware
+	  applications may use VMA policies.
+
+	* A task may install a new VMA policy on a sub-range of a
+	  previously mmap()ed region.  When this happens, Linux splits
+	  the existing virtual memory area into 2 or 3 VMAs, each with
+	  it's own policy.
+
+	* By default, VMA policy applies only to pages allocated after
+	  the policy is installed.  Any pages already faulted into the
+	  VMA range remain where they were allocated based on the
+	  policy at the time they were allocated.  However, since
+	  2.6.16, Linux supports page migration via the mbind() system
+	  call, so that page contents can be moved to match a newly
+	  installed policy.
+
+Shared Policy
+	Conceptually, shared policies apply to "memory objects" mapped
+	shared into one or more tasks' distinct address spaces.  An
+	application installs shared policies the same way as VMA
+	policies--using the mbind() system call specifying a range of
+	virtual addresses that map the shared object.  However, unlike
+	VMA policies, which can be considered to be an attribute of a
+	range of a task's address space, shared policies apply
+	directly to the shared object.  Thus, all tasks that attach to
+	the object share the policy, and all pages allocated for the
+	shared object, by any task, will obey the shared policy.
+
+	As of 2.6.22, only shared memory segments, created by shmget() or
+	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
+	policy support was added to Linux, the associated data structures were
+	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
+	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
+	shmem segments were never "hooked up" to the shared policy support.
+	Although hugetlbfs segments now support lazy allocation, their support
+	for shared policy has not been completed.
+
+	As mentioned above in :ref:`VMA policies <vma_policy>` section,
+	allocations of page cache pages for regular files mmap()ed
+	with MAP_SHARED ignore any VMA policy installed on the virtual
+	address range backed by the shared file mapping.  Rather,
+	shared page cache pages, including pages backing private
+	mappings that have not yet been written by the task, follow
+	task policy, if any, else System Default Policy.
+
+	The shared policy infrastructure supports different policies on subset
+	ranges of the shared object.  However, Linux still splits the VMA of
+	the task that installs the policy for each range of distinct policy.
+	Thus, different tasks that attach to a shared memory segment can have
+	different VMA configurations mapping that one shared object.  This
+	can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
+	a shared memory region, when one task has installed shared policy on
+	one or more ranges of the region.
+
+Components of Memory Policies
+-----------------------------
+
+A NUMA memory policy consists of a "mode", optional mode flags, and
+an optional set of nodes.  The mode determines the behavior of the
+policy, the optional mode flags determine the behavior of the mode,
+and the optional set of nodes can be viewed as the arguments to the
+policy behavior.
+
+Internally, memory policies are implemented by a reference counted
+structure, struct mempolicy.  Details of this structure will be
+discussed in context, below, as required to explain the behavior.
+
+NUMA memory policy supports the following 4 behavioral modes:
+
+Default Mode--MPOL_DEFAULT
+	This mode is only used in the memory policy APIs.  Internally,
+	MPOL_DEFAULT is converted to the NULL memory policy in all
+	policy scopes.  Any existing non-default policy will simply be
+	removed when MPOL_DEFAULT is specified.  As a result,
+	MPOL_DEFAULT means "fall back to the next most specific policy
+	scope."
+
+	For example, a NULL or default task policy will fall back to the
+	system default policy.  A NULL or default vma policy will fall
+	back to the task policy.
+
+	When specified in one of the memory policy APIs, the Default mode
+	does not use the optional set of nodes.
+
+	It is an error for the set of nodes specified for this policy to
+	be non-empty.
+
+MPOL_BIND
+	This mode specifies that memory must come from the set of
+	nodes specified by the policy.  Memory will be allocated from
+	the node in the set with sufficient free memory that is
+	closest to the node where the allocation takes place.
+
+MPOL_PREFERRED
+	This mode specifies that the allocation should be attempted
+	from the single node specified in the policy.  If that
+	allocation fails, the kernel will search other nodes, in order
+	of increasing distance from the preferred node based on
+	information provided by the platform firmware.
+
+	Internally, the Preferred policy uses a single node--the
+	preferred_node member of struct mempolicy.  When the internal
+	mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
+	and the policy is interpreted as local allocation.  "Local"
+	allocation policy can be viewed as a Preferred policy that
+	starts at the node containing the cpu where the allocation
+	takes place.
+
+	It is possible for the user to specify that local allocation
+	is always preferred by passing an empty nodemask with this
+	mode.  If an empty nodemask is passed, the policy cannot use
+	the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
+	described below.
+
+MPOL_INTERLEAVED
+	This mode specifies that page allocations be interleaved, on a
+	page granularity, across the nodes specified in the policy.
+	This mode also behaves slightly differently, based on the
+	context where it is used:
+
+	For allocation of anonymous pages and shared memory pages,
+	Interleave mode indexes the set of nodes specified by the
+	policy using the page offset of the faulting address into the
+	segment [VMA] containing the address modulo the number of
+	nodes specified by the policy.  It then attempts to allocate a
+	page, starting at the selected node, as if the node had been
+	specified by a Preferred policy or had been selected by a
+	local allocation.  That is, allocation will follow the per
+	node zonelist.
+
+	For allocation of page cache pages, Interleave mode indexes
+	the set of nodes specified by the policy using a node counter
+	maintained per task.  This counter wraps around to the lowest
+	specified node after it reaches the highest specified node.
+	This will tend to spread the pages out over the nodes
+	specified by the policy based on the order in which they are
+	allocated, rather than based on any page offset into an
+	address range or file.  During system boot up, the temporary
+	interleaved system default policy works in this mode.
+
+NUMA memory policy supports the following optional mode flags:
+
+MPOL_F_STATIC_NODES
+	This flag specifies that the nodemask passed by
+	the user should not be remapped if the task or VMA's set of allowed
+	nodes changes after the memory policy has been defined.
+
+	Without this flag, any time a mempolicy is rebound because of a
+	change in the set of allowed nodes, the node (Preferred) or
+	nodemask (Bind, Interleave) is remapped to the new set of
+	allowed nodes.  This may result in nodes being used that were
+	previously undesired.
+
+	With this flag, if the user-specified nodes overlap with the
+	nodes allowed by the task's cpuset, then the memory policy is
+	applied to their intersection.  If the two sets of nodes do not
+	overlap, the Default policy is used.
+
+	For example, consider a task that is attached to a cpuset with
+	mems 1-3 that sets an Interleave policy over the same set.  If
+	the cpuset's mems change to 3-5, the Interleave will now occur
+	over nodes 3, 4, and 5.  With this flag, however, since only node
+	3 is allowed from the user's nodemask, the "interleave" only
+	occurs over that node.  If no nodes from the user's nodemask are
+	now allowed, the Default behavior is used.
+
+	MPOL_F_STATIC_NODES cannot be combined with the
+	MPOL_F_RELATIVE_NODES flag.  It also cannot be used for
+	MPOL_PREFERRED policies that were created with an empty nodemask
+	(local allocation).
+
+MPOL_F_RELATIVE_NODES
+	This flag specifies that the nodemask passed
+	by the user will be mapped relative to the set of the task or VMA's
+	set of allowed nodes.  The kernel stores the user-passed nodemask,
+	and if the allowed nodes changes, then that original nodemask will
+	be remapped relative to the new set of allowed nodes.
+
+	Without this flag (and without MPOL_F_STATIC_NODES), anytime a
+	mempolicy is rebound because of a change in the set of allowed
+	nodes, the node (Preferred) or nodemask (Bind, Interleave) is
+	remapped to the new set of allowed nodes.  That remap may not
+	preserve the relative nature of the user's passed nodemask to its
+	set of allowed nodes upon successive rebinds: a nodemask of
+	1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
+	allowed nodes is restored to its original state.
+
+	With this flag, the remap is done so that the node numbers from
+	the user's passed nodemask are relative to the set of allowed
+	nodes.  In other words, if nodes 0, 2, and 4 are set in the user's
+	nodemask, the policy will be effected over the first (and in the
+	Bind or Interleave case, the third and fifth) nodes in the set of
+	allowed nodes.  The nodemask passed by the user represents nodes
+	relative to task or VMA's set of allowed nodes.
+
+	If the user's nodemask includes nodes that are outside the range
+	of the new set of allowed nodes (for example, node 5 is set in
+	the user's nodemask when the set of allowed nodes is only 0-3),
+	then the remap wraps around to the beginning of the nodemask and,
+	if not already set, sets the node in the mempolicy nodemask.
+
+	For example, consider a task that is attached to a cpuset with
+	mems 2-5 that sets an Interleave policy over the same set with
+	MPOL_F_RELATIVE_NODES.  If the cpuset's mems change to 3-7, the
+	interleave now occurs over nodes 3,5-7.  If the cpuset's mems
+	then change to 0,2-3,5, then the interleave occurs over nodes
+	0,2-3,5.
+
+	Thanks to the consistent remapping, applications preparing
+	nodemasks to specify memory policies using this flag should
+	disregard their current, actual cpuset imposed memory placement
+	and prepare the nodemask as if they were always located on
+	memory nodes 0 to N-1, where N is the number of memory nodes the
+	policy is intended to manage.  Let the kernel then remap to the
+	set of memory nodes allowed by the task's cpuset, as that may
+	change over time.
+
+	MPOL_F_RELATIVE_NODES cannot be combined with the
+	MPOL_F_STATIC_NODES flag.  It also cannot be used for
+	MPOL_PREFERRED policies that were created with an empty nodemask
+	(local allocation).
+
+Memory Policy Reference Counting
+================================
+
+To resolve use/free races, struct mempolicy contains an atomic reference
+count field.  Internal interfaces, mpol_get()/mpol_put() increment and
+decrement this reference count, respectively.  mpol_put() will only free
+the structure back to the mempolicy kmem cache when the reference count
+goes to zero.
+
+When a new memory policy is allocated, its reference count is initialized
+to '1', representing the reference held by the task that is installing the
+new policy.  When a pointer to a memory policy structure is stored in another
+structure, another reference is added, as the task's reference will be dropped
+on completion of the policy installation.
+
+During run-time "usage" of the policy, we attempt to minimize atomic operations
+on the reference count, as this can lead to cache lines bouncing between cpus
+and NUMA nodes.  "Usage" here means one of the following:
+
+1) querying of the policy, either by the task itself [using the get_mempolicy()
+   API discussed below] or by another task using the /proc/<pid>/numa_maps
+   interface.
+
+2) examination of the policy to determine the policy mode and associated node
+   or node lists, if any, for page allocation.  This is considered a "hot
+   path".  Note that for MPOL_BIND, the "usage" extends across the entire
+   allocation process, which may sleep during page reclaimation, because the
+   BIND policy nodemask is used, by reference, to filter ineligible nodes.
+
+We can avoid taking an extra reference during the usages listed above as
+follows:
+
+1) we never need to get/free the system default policy as this is never
+   changed nor freed, once the system is up and running.
+
+2) for querying the policy, we do not need to take an extra reference on the
+   target task's task policy nor vma policies because we always acquire the
+   task's mm's mmap_sem for read during the query.  The set_mempolicy() and
+   mbind() APIs [see below] always acquire the mmap_sem for write when
+   installing or replacing task or vma policies.  Thus, there is no possibility
+   of a task or thread freeing a policy while another task or thread is
+   querying it.
+
+3) Page allocation usage of task or vma policy occurs in the fault path where
+   we hold them mmap_sem for read.  Again, because replacing the task or vma
+   policy requires that the mmap_sem be held for write, the policy can't be
+   freed out from under us while we're using it for page allocation.
+
+4) Shared policies require special consideration.  One task can replace a
+   shared memory policy while another task, with a distinct mmap_sem, is
+   querying or allocating a page based on the policy.  To resolve this
+   potential race, the shared policy infrastructure adds an extra reference
+   to the shared policy during lookup while holding a spin lock on the shared
+   policy management structure.  This requires that we drop this extra
+   reference when we're finished "using" the policy.  We must drop the
+   extra reference on shared policies in the same query/allocation paths
+   used for non-shared policies.  For this reason, shared policies are marked
+   as such, and the extra reference is dropped "conditionally"--i.e., only
+   for shared policies.
+
+   Because of this extra reference counting, and because we must lookup
+   shared policies in a tree structure under spinlock, shared policies are
+   more expensive to use in the page allocation path.  This is especially
+   true for shared policies on shared memory regions shared by tasks running
+   on different NUMA nodes.  This extra overhead can be avoided by always
+   falling back to task or system default policy for shared memory regions,
+   or by prefaulting the entire shared memory region into memory and locking
+   it down.  However, this might not be appropriate for all applications.
+
+.. _memory_policy_apis:
+
+Memory Policy APIs
+==================
+
+Linux supports 3 system calls for controlling memory policy.  These APIS
+always affect only the calling task, the calling task's address space, or
+some shared object mapped into the calling task's address space.
+
+.. note::
+   the headers that define these APIs and the parameter data types for
+   user space applications reside in a package that is not part of the
+   Linux kernel.  The kernel system call interfaces, with the 'sys\_'
+   prefix, are defined in <linux/syscalls.h>; the mode and flag
+   definitions are defined in <linux/mempolicy.h>.
+
+Set [Task] Memory Policy::
+
+	long set_mempolicy(int mode, const unsigned long *nmask,
+					unsigned long maxnode);
+
+Set's the calling task's "task/process memory policy" to mode
+specified by the 'mode' argument and the set of nodes defined by
+'nmask'.  'nmask' points to a bit mask of node ids containing at least
+'maxnode' ids.  Optional mode flags may be passed by combining the
+'mode' argument with the flag (for example: MPOL_INTERLEAVE |
+MPOL_F_STATIC_NODES).
+
+See the set_mempolicy(2) man page for more details
+
+
+Get [Task] Memory Policy or Related Information::
+
+	long get_mempolicy(int *mode,
+			   const unsigned long *nmask, unsigned long maxnode,
+			   void *addr, int flags);
+
+Queries the "task/process memory policy" of the calling task, or the
+policy or location of a specified virtual address, depending on the
+'flags' argument.
+
+See the get_mempolicy(2) man page for more details
+
+
+Install VMA/Shared Policy for a Range of Task's Address Space::
+
+	long mbind(void *start, unsigned long len, int mode,
+		   const unsigned long *nmask, unsigned long maxnode,
+		   unsigned flags);
+
+mbind() installs the policy specified by (mode, nmask, maxnodes) as a
+VMA policy for the range of the calling task's address space specified
+by the 'start' and 'len' arguments.  Additional actions may be
+requested via the 'flags' argument.
+
+See the mbind(2) man page for more details.
+
+Memory Policy Command Line Interface
+====================================
+
+Although not strictly part of the Linux implementation of memory policy,
+a command line tool, numactl(8), exists that allows one to:
+
++ set the task policy for a specified program via set_mempolicy(2), fork(2) and
+  exec(2)
+
++ set the shared policy for a shared memory segment via mbind(2)
+
+The numactl(8) tool is packaged with the run-time version of the library
+containing the memory policy system call wrappers.  Some distributions
+package the headers and compile-time libraries in a separate development
+package.
+
+.. _mem_pol_and_cpusets:
+
+Memory Policies and cpusets
+===========================
+
+Memory policies work within cpusets as described above.  For memory policies
+that require a node or set of nodes, the nodes are restricted to the set of
+nodes whose memories are allowed by the cpuset constraints.  If the nodemask
+specified for the policy contains nodes that are not allowed by the cpuset and
+MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
+specified for the policy and the set of nodes with memory is used.  If the
+result is the empty set, the policy is considered invalid and cannot be
+installed.  If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
+onto and folded into the task's set of allowed nodes as previously described.
+
+The interaction of memory policies and cpusets can be problematic when tasks
+in two cpusets share access to a memory region, such as shared memory segments
+created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
+any of the tasks install shared policy on the region, only nodes whose
+memories are allowed in both cpusets may be used in the policies.  Obtaining
+this information requires "stepping outside" the memory policy APIs to use the
+cpuset information and requires that one know in what cpusets other task might
+be attaching to the shared region.  Furthermore, if the cpusets' allowed
+memory sets are disjoint, "local" allocation is the only valid policy.
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
new file mode 100644
index 000000000..3f7bade2c
--- /dev/null
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -0,0 +1,204 @@
+.. _pagemap:
+
+=============================
+Examining Process Page Tables
+=============================
+
+pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
+userspace programs to examine the page tables and related information by
+reading files in ``/proc``.
+
+There are four components to pagemap:
+
+ * ``/proc/pid/pagemap``.  This file lets a userspace process find out which
+   physical frame each virtual page is mapped to.  It contains one 64-bit
+   value for each virtual page, containing the following data (from
+   ``fs/proc/task_mmu.c``, above pagemap_read):
+
+    * Bits 0-54  page frame number (PFN) if present
+    * Bits 0-4   swap type if swapped
+    * Bits 5-54  swap offset if swapped
+    * Bit  55    pte is soft-dirty (see
+      :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
+    * Bit  56    page exclusively mapped (since 4.2)
+    * Bits 57-60 zero
+    * Bit  61    page is file-page or shared-anon (since 3.5)
+    * Bit  62    page swapped
+    * Bit  63    page present
+
+   Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
+   In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
+   4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
+   Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
+
+   If the page is not present but in swap, then the PFN contains an
+   encoding of the swap file number and the page's offset into the
+   swap. Unmapped pages return a null PFN. This allows determining
+   precisely which pages are mapped (or in swap) and comparing mapped
+   pages between processes.
+
+   Efficient users of this interface will use ``/proc/pid/maps`` to
+   determine which areas of memory are actually mapped and llseek to
+   skip over unmapped regions.
+
+ * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
+   times each page is mapped, indexed by PFN.
+
+The page-types tool in the tools/vm directory can be used to query the
+number of times a page is mapped.
+
+ * ``/proc/kpageflags``.  This file contains a 64-bit set of flags for each
+   page, indexed by PFN.
+
+   The flags are (from ``fs/proc/page.c``, above kpageflags_read):
+
+    0. LOCKED
+    1. ERROR
+    2. REFERENCED
+    3. UPTODATE
+    4. DIRTY
+    5. LRU
+    6. ACTIVE
+    7. SLAB
+    8. WRITEBACK
+    9. RECLAIM
+    10. BUDDY
+    11. MMAP
+    12. ANON
+    13. SWAPCACHE
+    14. SWAPBACKED
+    15. COMPOUND_HEAD
+    16. COMPOUND_TAIL
+    17. HUGE
+    18. UNEVICTABLE
+    19. HWPOISON
+    20. NOPAGE
+    21. KSM
+    22. THP
+    23. BALLOON
+    24. ZERO_PAGE
+    25. IDLE
+
+ * ``/proc/kpagecgroup``.  This file contains a 64-bit inode number of the
+   memory cgroup each page is charged to, indexed by PFN. Only available when
+   CONFIG_MEMCG is set.
+
+Short descriptions to the page flags
+====================================
+
+0 - LOCKED
+   page is being locked for exclusive access, e.g. by undergoing read/write IO
+7 - SLAB
+   page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
+   When compound page is used, SLUB/SLQB will only set this flag on the head
+   page; SLOB will not flag it at all.
+10 - BUDDY
+    a free memory block managed by the buddy system allocator
+    The buddy system organizes free memory in blocks of various orders.
+    An order N block has 2^N physically contiguous pages, with the BUDDY flag
+    set for and _only_ for the first page.
+15 - COMPOUND_HEAD
+    A compound page with order N consists of 2^N physically contiguous pages.
+    A compound page with order 2 takes the form of "HTTT", where H donates its
+    head page and T donates its tail page(s).  The major consumers of compound
+    pages are hugeTLB pages
+    (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`),
+    the SLUB etc.  memory allocators and various device drivers.
+    However in this interface, only huge/giga pages are made visible
+    to end users.
+16 - COMPOUND_TAIL
+    A compound page tail (see description above).
+17 - HUGE
+    this is an integral part of a HugeTLB page
+19 - HWPOISON
+    hardware detected memory corruption on this page: don't touch the data!
+20 - NOPAGE
+    no page frame exists at the requested address
+21 - KSM
+    identical memory pages dynamically shared between one or more processes
+22 - THP
+    contiguous pages which construct transparent hugepages
+23 - BALLOON
+    balloon compaction page
+24 - ZERO_PAGE
+    zero page for pfn_zero or huge_zero page
+25 - IDLE
+    page has not been accessed since it was marked idle (see
+    :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
+    Note that this flag may be stale in case the page was accessed via
+    a PTE. To make sure the flag is up-to-date one has to read
+    ``/sys/kernel/mm/page_idle/bitmap`` first.
+
+IO related page flags
+---------------------
+
+1 - ERROR
+   IO error occurred
+3 - UPTODATE
+   page has up-to-date data
+   ie. for file backed page: (in-memory data revision >= on-disk one)
+4 - DIRTY
+   page has been written to, hence contains new data
+   i.e. for file backed page: (in-memory data revision >  on-disk one)
+8 - WRITEBACK
+   page is being synced to disk
+
+LRU related page flags
+----------------------
+
+5 - LRU
+   page is in one of the LRU lists
+6 - ACTIVE
+   page is in the active LRU list
+18 - UNEVICTABLE
+   page is in the unevictable (non-)LRU list It is somehow pinned and
+   not a candidate for LRU page reclaims, e.g. ramfs pages,
+   shmctl(SHM_LOCK) and mlock() memory segments
+2 - REFERENCED
+   page has been referenced since last LRU list enqueue/requeue
+9 - RECLAIM
+   page will be reclaimed soon after its pageout IO completed
+11 - MMAP
+   a memory mapped page
+12 - ANON
+   a memory mapped page that is not part of a file
+13 - SWAPCACHE
+   page is mapped to swap space, i.e. has an associated swap entry
+14 - SWAPBACKED
+   page is backed by swap/RAM
+
+The page-types tool in the tools/vm directory can be used to query the
+above flags.
+
+Using pagemap to do something useful
+====================================
+
+The general procedure for using pagemap to find out about a process' memory
+usage goes like this:
+
+ 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
+    mapped to what.
+ 2. Select the maps you are interested in -- all of them, or a particular
+    library, or the stack or the heap, etc.
+ 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
+ 4. Read a u64 for each page from pagemap.
+ 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``.  For each PFN you
+    just read, seek to that entry in the file, and read the data you want.
+
+For example, to find the "unique set size" (USS), which is the amount of
+memory that a process is using that is not shared with any other process,
+you can go through every map in the process, find the PFNs, look those up
+in kpagecount, and tally up the number of pages that are only referenced
+once.
+
+Other notes
+===========
+
+Reading from any of the files will return -EINVAL if you are not starting
+the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
+into the file), or if the size of the read is not a multiple of 8 bytes.
+
+Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
+always 12 at most architectures). Since Linux 3.11 their meaning changes
+after first clear of soft-dirty bits. Since Linux 4.2 they are used for
+flags unconditionally.
diff --git a/Documentation/admin-guide/mm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst
new file mode 100644
index 000000000..cb0cfd667
--- /dev/null
+++ b/Documentation/admin-guide/mm/soft-dirty.rst
@@ -0,0 +1,47 @@
+.. _soft_dirty:
+
+===============
+Soft-Dirty PTEs
+===============
+
+The soft-dirty is a bit on a PTE which helps to track which pages a task
+writes to. In order to do this tracking one should
+
+  1. Clear soft-dirty bits from the task's PTEs.
+
+     This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
+     task in question.
+
+  2. Wait some time.
+
+  3. Read soft-dirty bits from the PTEs.
+
+     This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
+     64-bit qword is the soft-dirty one. If set, the respective PTE was
+     written to since step 1.
+
+
+Internally, to do this tracking, the writable bit is cleared from PTEs
+when the soft-dirty bit is cleared. So, after this, when the task tries to
+modify a page at some virtual address the #PF occurs and the kernel sets
+the soft-dirty bit on the respective PTE.
+
+Note, that although all the task's address space is marked as r/o after the
+soft-dirty bits clear, the #PF-s that occur after that are processed fast.
+This is so, since the pages are still mapped to physical memory, and thus all
+the kernel does is finds this fact out and puts both writable and soft-dirty
+bits on the PTE.
+
+While in most cases tracking memory changes by #PF-s is more than enough
+there is still a scenario when we can lose soft dirty bits -- a task
+unmaps a previously mapped memory region and then maps a new one at exactly
+the same place. When unmap is called, the kernel internally clears PTE values
+including soft dirty bits. To notify user space application about such
+memory region renewal the kernel always marks new memory regions (and
+expanded regions) as soft dirty.
+
+This feature is actively used by the checkpoint-restore project. You
+can find more details about it on http://criu.org
+
+
+-- Pavel Emelyanov, Apr 9, 2013
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
new file mode 100644
index 000000000..7ab93a840
--- /dev/null
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -0,0 +1,418 @@
+.. _admin_guide_transhuge:
+
+============================
+Transparent Hugepage Support
+============================
+
+Objective
+=========
+
+Performance critical computing applications dealing with large memory
+working sets are already running on top of libhugetlbfs and in turn
+hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
+using huge pages for the backing of virtual memory with huge pages
+that supports the automatic promotion and demotion of page sizes and
+without the shortcomings of hugetlbfs.
+
+Currently THP only works for anonymous memory mappings and tmpfs/shmem.
+But in the future it can expand to other filesystems.
+
+.. note::
+   in the examples below we presume that the basic page size is 4K and
+   the huge page size is 2M, although the actual numbers may vary
+   depending on the CPU architecture.
+
+The reason applications are running faster is because of two
+factors. The first factor is almost completely irrelevant and it's not
+of significant interest because it'll also have the downside of
+requiring larger clear-page copy-page in page faults which is a
+potentially negative effect. The first factor consists in taking a
+single page fault for each 2M virtual region touched by userland (so
+reducing the enter/exit kernel frequency by a 512 times factor). This
+only matters the first time the memory is accessed for the lifetime of
+a memory mapping. The second long lasting and much more important
+factor will affect all subsequent accesses to the memory for the whole
+runtime of the application. The second factor consist of two
+components:
+
+1) the TLB miss will run faster (especially with virtualization using
+   nested pagetables but almost always also on bare metal without
+   virtualization)
+
+2) a single TLB entry will be mapping a much larger amount of virtual
+   memory in turn reducing the number of TLB misses. With
+   virtualization and nested pagetables the TLB can be mapped of
+   larger size only if both KVM and the Linux guest are using
+   hugepages but a significant speedup already happens if only one of
+   the two is using hugepages just because of the fact the TLB miss is
+   going to run faster.
+
+THP can be enabled system wide or restricted to certain tasks or even
+memory ranges inside task's address space. Unless THP is completely
+disabled, there is ``khugepaged`` daemon that scans memory and
+collapses sequences of basic pages into huge pages.
+
+The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
+interface and using madivse(2) and prctl(2) system calls.
+
+Transparent Hugepage Support maximizes the usefulness of free memory
+if compared to the reservation approach of hugetlbfs by allowing all
+unused memory to be used as cache or other movable (or even unmovable
+entities). It doesn't require reservation to prevent hugepage
+allocation failures to be noticeable from userland. It allows paging
+and all other advanced VM features to be available on the
+hugepages. It requires no modifications for applications to take
+advantage of it.
+
+Applications however can be further optimized to take advantage of
+this feature, like for example they've been optimized before to avoid
+a flood of mmap system calls for every malloc(4k). Optimizing userland
+is by far not mandatory and khugepaged already can take care of long
+lived page allocations even for hugepage unaware applications that
+deals with large amounts of memory.
+
+In certain cases when hugepages are enabled system wide, application
+may end up allocating more memory resources. An application may mmap a
+large region but only touch 1 byte of it, in that case a 2M page might
+be allocated instead of a 4k page for no good. This is why it's
+possible to disable hugepages system-wide and to only have them inside
+MADV_HUGEPAGE madvise regions.
+
+Embedded systems should enable hugepages only inside madvise regions
+to eliminate any risk of wasting any precious byte of memory and to
+only run faster.
+
+Applications that gets a lot of benefit from hugepages and that don't
+risk to lose memory by using hugepages, should use
+madvise(MADV_HUGEPAGE) on their critical mmapped regions.
+
+.. _thp_sysfs:
+
+sysfs
+=====
+
+Global THP controls
+-------------------
+
+Transparent Hugepage Support for anonymous memory can be entirely disabled
+(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
+regions (to avoid the risk of consuming more memory resources) or enabled
+system wide. This can be achieved with one of::
+
+	echo always >/sys/kernel/mm/transparent_hugepage/enabled
+	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
+	echo never >/sys/kernel/mm/transparent_hugepage/enabled
+
+It's also possible to limit defrag efforts in the VM to generate
+anonymous hugepages in case they're not immediately free to madvise
+regions or to never try to defrag memory and simply fallback to regular
+pages unless hugepages are immediately available. Clearly if we spend CPU
+time to defrag memory, we would expect to gain even more by the fact we
+use hugepages later instead of regular pages. This isn't always
+guaranteed, but it may be more likely in case the allocation is for a
+MADV_HUGEPAGE region.
+
+::
+
+	echo always >/sys/kernel/mm/transparent_hugepage/defrag
+	echo defer >/sys/kernel/mm/transparent_hugepage/defrag
+	echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
+	echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
+	echo never >/sys/kernel/mm/transparent_hugepage/defrag
+
+always
+	means that an application requesting THP will stall on
+	allocation failure and directly reclaim pages and compact
+	memory in an effort to allocate a THP immediately. This may be
+	desirable for virtual machines that benefit heavily from THP
+	use and are willing to delay the VM start to utilise them.
+
+defer
+	means that an application will wake kswapd in the background
+	to reclaim pages and wake kcompactd to compact memory so that
+	THP is available in the near future. It's the responsibility
+	of khugepaged to then install the THP pages later.
+
+defer+madvise
+	will enter direct reclaim and compaction like ``always``, but
+	only for regions that have used madvise(MADV_HUGEPAGE); all
+	other regions will wake kswapd in the background to reclaim
+	pages and wake kcompactd to compact memory so that THP is
+	available in the near future.
+
+madvise
+	will enter direct reclaim like ``always`` but only for regions
+	that are have used madvise(MADV_HUGEPAGE). This is the default
+	behaviour.
+
+never
+	should be self-explanatory.
+
+By default kernel tries to use huge zero page on read page fault to
+anonymous mapping. It's possible to disable huge zero page by writing 0
+or enable it back by writing 1::
+
+	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
+	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
+
+Some userspace (such as a test program, or an optimized memory allocation
+library) may want to know the size (in bytes) of a transparent hugepage::
+
+	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
+
+khugepaged will be automatically started when
+transparent_hugepage/enabled is set to "always" or "madvise, and it'll
+be automatically shutdown if it's set to "never".
+
+Khugepaged controls
+-------------------
+
+khugepaged runs usually at low frequency so while one may not want to
+invoke defrag algorithms synchronously during the page faults, it
+should be worth invoking defrag at least in khugepaged. However it's
+also possible to disable defrag in khugepaged by writing 0 or enable
+defrag in khugepaged by writing 1::
+
+	echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+	echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+
+You can also control how many pages khugepaged should scan at each
+pass::
+
+	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
+
+and how many milliseconds to wait in khugepaged between each pass (you
+can set this to 0 to run khugepaged at 100% utilization of one core)::
+
+	/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
+
+and how many milliseconds to wait in khugepaged if there's an hugepage
+allocation failure to throttle the next allocation attempt::
+
+	/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
+
+The khugepaged progress can be seen in the number of pages collapsed::
+
+	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
+
+for each pass::
+
+	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
+
+``max_ptes_none`` specifies how many extra small pages (that are
+not already mapped) can be allocated when collapsing a group
+of small pages into one large page::
+
+	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
+
+A higher value leads to use additional memory for programs.
+A lower value leads to gain less thp performance. Value of
+max_ptes_none can waste cpu time very little, you can
+ignore it.
+
+``max_ptes_swap`` specifies how many pages can be brought in from
+swap when collapsing a group of pages into a transparent huge page::
+
+	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
+
+A higher value can cause excessive swap IO and waste
+memory. A lower value can prevent THPs from being
+collapsed, resulting fewer pages being collapsed into
+THPs, and lower memory access performance.
+
+Boot parameter
+==============
+
+You can change the sysfs boot time defaults of Transparent Hugepage
+Support by passing the parameter ``transparent_hugepage=always`` or
+``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
+to the kernel command line.
+
+Hugepages in tmpfs/shmem
+========================
+
+You can control hugepage allocation policy in tmpfs with mount option
+``huge=``. It can have following values:
+
+always
+    Attempt to allocate huge pages every time we need a new page;
+
+never
+    Do not allocate huge pages;
+
+within_size
+    Only allocate huge page if it will be fully within i_size.
+    Also respect fadvise()/madvise() hints;
+
+advise
+    Only allocate huge pages if requested with fadvise()/madvise();
+
+The default policy is ``never``.
+
+``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
+``huge=never`` will not attempt to break up huge pages at all, just stop more
+from being allocated.
+
+There's also sysfs knob to control hugepage allocation policy for internal
+shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
+is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
+MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
+
+In addition to policies listed above, shmem_enabled allows two further
+values:
+
+deny
+    For use in emergencies, to force the huge option off from
+    all mounts;
+force
+    Force the huge option on for all - very useful for testing;
+
+Need of application restart
+===========================
+
+The transparent_hugepage/enabled values and tmpfs mount option only affect
+future behavior. So to make them effective you need to restart any
+application that could have been using hugepages. This also applies to the
+regions registered in khugepaged.
+
+Monitoring usage
+================
+
+The number of anonymous transparent huge pages currently used by the
+system is available by reading the AnonHugePages field in ``/proc/meminfo``.
+To identify what applications are using anonymous transparent huge pages,
+it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
+for each mapping.
+
+The number of file transparent huge pages mapped to userspace is available
+by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
+To identify what applications are mapping file transparent huge pages, it
+is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
+for each mapping.
+
+Note that reading the smaps file is expensive and reading it
+frequently will incur overhead.
+
+There are a number of counters in ``/proc/vmstat`` that may be used to
+monitor how successfully the system is providing huge pages for use.
+
+thp_fault_alloc
+	is incremented every time a huge page is successfully
+	allocated to handle a page fault. This applies to both the
+	first time a page is faulted and for COW faults.
+
+thp_collapse_alloc
+	is incremented by khugepaged when it has found
+	a range of pages to collapse into one huge page and has
+	successfully allocated a new huge page to store the data.
+
+thp_fault_fallback
+	is incremented if a page fault fails to allocate
+	a huge page and instead falls back to using small pages.
+
+thp_collapse_alloc_failed
+	is incremented if khugepaged found a range
+	of pages that should be collapsed into one huge page but failed
+	the allocation.
+
+thp_file_alloc
+	is incremented every time a file huge page is successfully
+	allocated.
+
+thp_file_mapped
+	is incremented every time a file huge page is mapped into
+	user address space.
+
+thp_split_page
+	is incremented every time a huge page is split into base
+	pages. This can happen for a variety of reasons but a common
+	reason is that a huge page is old and is being reclaimed.
+	This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed
+	is incremented if kernel fails to split huge
+	page. This can happen if the page was pinned by somebody.
+
+thp_deferred_split_page
+	is incremented when a huge page is put onto split
+	queue. This happens when a huge page is partially unmapped and
+	splitting it would free up some memory. Pages on split queue are
+	going to be split under memory pressure.
+
+thp_split_pmd
+	is incremented every time a PMD split into table of PTEs.
+	This can happen, for instance, when application calls mprotect() or
+	munmap() on part of huge page. It doesn't split huge page, only
+	page table entry.
+
+thp_zero_page_alloc
+	is incremented every time a huge zero page is
+	successfully allocated. It includes allocations which where
+	dropped due race with other allocation. Note, it doesn't count
+	every map of the huge zero page, only its allocation.
+
+thp_zero_page_alloc_failed
+	is incremented if kernel fails to allocate
+	huge zero page and falls back to using small pages.
+
+thp_swpout
+	is incremented every time a huge page is swapout in one
+	piece without splitting.
+
+thp_swpout_fallback
+	is incremented if a huge page has to be split before swapout.
+	Usually because failed to allocate some continuous swap space
+	for the huge page.
+
+As the system ages, allocating huge pages may be expensive as the
+system uses memory compaction to copy data around memory to free a
+huge page for use. There are some counters in ``/proc/vmstat`` to help
+monitor this overhead.
+
+compact_stall
+	is incremented every time a process stalls to run
+	memory compaction so that a huge page is free for use.
+
+compact_success
+	is incremented if the system compacted memory and
+	freed a huge page for use.
+
+compact_fail
+	is incremented if the system tries to compact memory
+	but failed.
+
+compact_pages_moved
+	is incremented each time a page is moved. If
+	this value is increasing rapidly, it implies that the system
+	is copying a lot of data to satisfy the huge page allocation.
+	It is possible that the cost of copying exceeds any savings
+	from reduced TLB misses.
+
+compact_pagemigrate_failed
+	is incremented when the underlying mechanism
+	for moving a page failed.
+
+compact_blocks_moved
+	is incremented each time memory compaction examines
+	a huge page aligned range of pages.
+
+It is possible to establish how long the stalls were using the function
+tracer to record how long was spent in __alloc_pages_nodemask and
+using the mm_page_alloc tracepoint to identify which allocations were
+for huge pages.
+
+Optimizing the applications
+===========================
+
+To be guaranteed that the kernel will map a 2M page immediately in any
+memory region, the mmap region has to be hugepage naturally
+aligned. posix_memalign() can provide that guarantee.
+
+Hugetlbfs
+=========
+
+You can use hugetlbfs on a kernel that has transparent hugepage
+support enabled just fine as always. No difference can be noted in
+hugetlbfs other than there will be less overall fragmentation. All
+usual features belonging to hugetlbfs are preserved and
+unaffected. libhugetlbfs will also work fine as usual.
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
new file mode 100644
index 000000000..5048cf661
--- /dev/null
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -0,0 +1,241 @@
+.. _userfaultfd:
+
+===========
+Userfaultfd
+===========
+
+Objective
+=========
+
+Userfaults allow the implementation of on-demand paging from userland
+and more generally they allow userland to take control of various
+memory page faults, something otherwise only the kernel code could do.
+
+For example userfaults allows a proper and more optimal implementation
+of the PROT_NONE+SIGSEGV trick.
+
+Design
+======
+
+Userfaults are delivered and resolved through the userfaultfd syscall.
+
+The userfaultfd (aside from registering and unregistering virtual
+memory ranges) provides two primary functionalities:
+
+1) read/POLLIN protocol to notify a userland thread of the faults
+   happening
+
+2) various UFFDIO_* ioctls that can manage the virtual memory regions
+   registered in the userfaultfd that allows userland to efficiently
+   resolve the userfaults it receives via 1) or to manage the virtual
+   memory in the background
+
+The real advantage of userfaults if compared to regular virtual memory
+management of mremap/mprotect is that the userfaults in all their
+operations never involve heavyweight structures like vmas (in fact the
+userfaultfd runtime load never takes the mmap_sem for writing).
+
+Vmas are not suitable for page- (or hugepage) granular fault tracking
+when dealing with virtual address spaces that could span
+Terabytes. Too many vmas would be needed for that.
+
+The userfaultfd once opened by invoking the syscall, can also be
+passed using unix domain sockets to a manager process, so the same
+manager process could handle the userfaults of a multitude of
+different processes without them being aware about what is going on
+(well of course unless they later try to use the userfaultfd
+themselves on the same region the manager is already tracking, which
+is a corner case that would currently return -EBUSY).
+
+API
+===
+
+When first opened the userfaultfd must be enabled invoking the
+UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
+a later API version) which will specify the read/POLLIN protocol
+userland intends to speak on the UFFD and the uffdio_api.features
+userland requires. The UFFDIO_API ioctl if successful (i.e. if the
+requested uffdio_api.api is spoken also by the running kernel and the
+requested features are going to be enabled) will return into
+uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
+respectively all the available features of the read(2) protocol and
+the generic ioctl available.
+
+The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
+defines what memory types are supported by the userfaultfd and what
+events, except page fault notifications, may be generated.
+
+If the kernel supports registering userfaultfd ranges on hugetlbfs
+virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
+uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
+set if the kernel supports registering userfaultfd ranges on shared
+memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
+MAP_SHARED, memfd_create, etc).
+
+The userland application that wants to use userfaultfd with hugetlbfs
+or shared memory need to set the corresponding flag in
+uffdio_api.features to enable those features.
+
+If the userland desires to receive notifications for events other than
+page faults, it has to verify that uffdio_api.features has appropriate
+UFFD_FEATURE_EVENT_* bits set. These events are described in more
+detail below in "Non-cooperative userfaultfd" section.
+
+Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
+be invoked (if present in the returned uffdio_api.ioctls bitmask) to
+register a memory range in the userfaultfd by setting the
+uffdio_register structure accordingly. The uffdio_register.mode
+bitmask will specify to the kernel which kind of faults to track for
+the range (UFFDIO_REGISTER_MODE_MISSING would track missing
+pages). The UFFDIO_REGISTER ioctl will return the
+uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
+userfaults on the range registered. Not all ioctls will necessarily be
+supported for all memory types depending on the underlying virtual
+memory backend (anonymous memory vs tmpfs vs real filebacked
+mappings).
+
+Userland can use the uffdio_register.ioctls to manage the virtual
+address space in the background (to add or potentially also remove
+memory from the userfaultfd registered range). This means a userfault
+could be triggering just before userland maps in the background the
+user-faulted page.
+
+The primary ioctl to resolve userfaults is UFFDIO_COPY. That
+atomically copies a page into the userfault registered range and wakes
+up the blocked userfaults (unless uffdio_copy.mode &
+UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
+UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
+half copied page since it'll keep userfaulting until the copy has
+finished.
+
+QEMU/KVM
+========
+
+QEMU/KVM is using the userfaultfd syscall to implement postcopy live
+migration. Postcopy live migration is one form of memory
+externalization consisting of a virtual machine running with part or
+all of its memory residing on a different node in the cloud. The
+userfaultfd abstraction is generic enough that not a single line of
+KVM kernel code had to be modified in order to add postcopy live
+migration to QEMU.
+
+Guest async page faults, FOLL_NOWAIT and all other GUP features work
+just fine in combination with userfaults. Userfaults trigger async
+page faults in the guest scheduler so those guest processes that
+aren't waiting for userfaults (i.e. network bound) can keep running in
+the guest vcpus.
+
+It is generally beneficial to run one pass of precopy live migration
+just before starting postcopy live migration, in order to avoid
+generating userfaults for readonly guest regions.
+
+The implementation of postcopy live migration currently uses one
+single bidirectional socket but in the future two different sockets
+will be used (to reduce the latency of the userfaults to the minimum
+possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
+
+The QEMU in the source node writes all pages that it knows are missing
+in the destination node, into the socket, and the migration thread of
+the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
+ioctls on the userfaultfd in order to map the received pages into the
+guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
+
+A different postcopy thread in the destination node listens with
+poll() to the userfaultfd in parallel. When a POLLIN event is
+generated after a userfault triggers, the postcopy thread read() from
+the userfaultfd and receives the fault address (or -EAGAIN in case the
+userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
+by the parallel QEMU migration thread).
+
+After the QEMU postcopy thread (running in the destination node) gets
+the userfault address it writes the information about the missing page
+into the socket. The QEMU source node receives the information and
+roughly "seeks" to that page address and continues sending all
+remaining missing pages from that new page offset. Soon after that
+(just the time to flush the tcp_wmem queue through the network) the
+migration thread in the QEMU running in the destination node will
+receive the page that triggered the userfault and it'll map it as
+usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
+was spontaneously sent by the source or if it was an urgent page
+requested through a userfault).
+
+By the time the userfaults start, the QEMU in the destination node
+doesn't need to keep any per-page state bitmap relative to the live
+migration around and a single per-page bitmap has to be maintained in
+the QEMU running in the source node to know which pages are still
+missing in the destination node. The bitmap in the source node is
+checked to find which missing pages to send in round robin and we seek
+over it when receiving incoming userfaults. After sending each page of
+course the bitmap is updated accordingly. It's also useful to avoid
+sending the same page twice (in case the userfault is read by the
+postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
+thread).
+
+Non-cooperative userfaultfd
+===========================
+
+When the userfaultfd is monitored by an external manager, the manager
+must be able to track changes in the process virtual memory
+layout. Userfaultfd can notify the manager about such changes using
+the same read(2) protocol as for the page fault notifications. The
+manager has to explicitly enable these events by setting appropriate
+bits in uffdio_api.features passed to UFFDIO_API ioctl:
+
+UFFD_FEATURE_EVENT_FORK
+	enable userfaultfd hooks for fork(). When this feature is
+	enabled, the userfaultfd context of the parent process is
+	duplicated into the newly created process. The manager
+	receives UFFD_EVENT_FORK with file descriptor of the new
+	userfaultfd context in the uffd_msg.fork.
+
+UFFD_FEATURE_EVENT_REMAP
+	enable notifications about mremap() calls. When the
+	non-cooperative process moves a virtual memory area to a
+	different location, the manager will receive
+	UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
+	new addresses of the area and its original length.
+
+UFFD_FEATURE_EVENT_REMOVE
+	enable notifications about madvise(MADV_REMOVE) and
+	madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
+	be generated upon these calls to madvise. The uffd_msg.remove
+	will contain start and end addresses of the removed area.
+
+UFFD_FEATURE_EVENT_UNMAP
+	enable notifications about memory unmapping. The manager will
+	get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
+	end addresses of the unmapped area.
+
+Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
+are pretty similar, they quite differ in the action expected from the
+userfaultfd manager. In the former case, the virtual memory is
+removed, but the area is not, the area remains monitored by the
+userfaultfd, and if a page fault occurs in that area it will be
+delivered to the manager. The proper resolution for such page fault is
+to zeromap the faulting address. However, in the latter case, when an
+area is unmapped, either explicitly (with munmap() system call), or
+implicitly (e.g. during mremap()), the area is removed and in turn the
+userfaultfd context for such area disappears too and the manager will
+not get further userland page faults from the removed area. Still, the
+notification is required in order to prevent manager from using
+UFFDIO_COPY on the unmapped area.
+
+Unlike userland page faults which have to be synchronous and require
+explicit or implicit wakeup, all the events are delivered
+asynchronously and the non-cooperative process resumes execution as
+soon as manager executes read(). The userfaultfd manager should
+carefully synchronize calls to UFFDIO_COPY with the events
+processing. To aid the synchronization, the UFFDIO_COPY ioctl will
+return -ENOSPC when the monitored process exits at the time of
+UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
+its virtual memory layout simultaneously with outstanding UFFDIO_COPY
+operation.
+
+The current asynchronous model of the event delivery is optimal for
+single threaded non-cooperative userfaultfd manager implementations. A
+synchronous event delivery model can be added later as a new
+userfaultfd feature to facilitate multithreading enhancements of the
+non cooperative manager, for example to allow UFFDIO_COPY ioctls to
+run in parallel to the event reception. Single threaded
+implementations should continue to use the current async event
+delivery model instead.
diff --git a/Documentation/admin-guide/module-signing.rst b/Documentation/admin-guide/module-signing.rst
new file mode 100644
index 000000000..f8b584179
--- /dev/null
+++ b/Documentation/admin-guide/module-signing.rst
@@ -0,0 +1,285 @@
+Kernel module signing facility
+------------------------------
+
+.. CONTENTS
+..
+.. - Overview.
+.. - Configuring module signing.
+.. - Generating signing keys.
+.. - Public keys in the kernel.
+.. - Manually signing modules.
+.. - Signed modules and stripping.
+.. - Loading signed modules.
+.. - Non-valid signatures and unsigned modules.
+.. - Administering/protecting the private key.
+
+
+========
+Overview
+========
+
+The kernel module signing facility cryptographically signs modules during
+installation and then checks the signature upon loading the module.  This
+allows increased kernel security by disallowing the loading of unsigned modules
+or modules signed with an invalid key.  Module signing increases security by
+making it harder to load a malicious module into the kernel.  The module
+signature checking is done by the kernel so that it is not necessary to have
+trusted userspace bits.
+
+This facility uses X.509 ITU-T standard certificates to encode the public keys
+involved.  The signatures are not themselves encoded in any industrial standard
+type.  The facility currently only supports the RSA public key encryption
+standard (though it is pluggable and permits others to be used).  The possible
+hash algorithms that can be used are SHA-1, SHA-224, SHA-256, SHA-384, and
+SHA-512 (the algorithm is selected by data in the signature).
+
+
+==========================
+Configuring module signing
+==========================
+
+The module signing facility is enabled by going to the
+:menuselection:`Enable Loadable Module Support` section of
+the kernel configuration and turning on::
+
+	CONFIG_MODULE_SIG	"Module signature verification"
+
+This has a number of options available:
+
+ (1) :menuselection:`Require modules to be validly signed`
+     (``CONFIG_MODULE_SIG_FORCE``)
+
+     This specifies how the kernel should deal with a module that has a
+     signature for which the key is not known or a module that is unsigned.
+
+     If this is off (ie. "permissive"), then modules for which the key is not
+     available and modules that are unsigned are permitted, but the kernel will
+     be marked as being tainted, and the concerned modules will be marked as
+     tainted, shown with the character 'E'.
+
+     If this is on (ie. "restrictive"), only modules that have a valid
+     signature that can be verified by a public key in the kernel's possession
+     will be loaded.  All other modules will generate an error.
+
+     Irrespective of the setting here, if the module has a signature block that
+     cannot be parsed, it will be rejected out of hand.
+
+
+ (2) :menuselection:`Automatically sign all modules`
+     (``CONFIG_MODULE_SIG_ALL``)
+
+     If this is on then modules will be automatically signed during the
+     modules_install phase of a build.  If this is off, then the modules must
+     be signed manually using::
+
+	scripts/sign-file
+
+
+ (3) :menuselection:`Which hash algorithm should modules be signed with?`
+
+     This presents a choice of which hash algorithm the installation phase will
+     sign the modules with:
+
+        =============================== ==========================================
+	``CONFIG_MODULE_SIG_SHA1``	:menuselection:`Sign modules with SHA-1`
+	``CONFIG_MODULE_SIG_SHA224``	:menuselection:`Sign modules with SHA-224`
+	``CONFIG_MODULE_SIG_SHA256``	:menuselection:`Sign modules with SHA-256`
+	``CONFIG_MODULE_SIG_SHA384``	:menuselection:`Sign modules with SHA-384`
+	``CONFIG_MODULE_SIG_SHA512``	:menuselection:`Sign modules with SHA-512`
+        =============================== ==========================================
+
+     The algorithm selected here will also be built into the kernel (rather
+     than being a module) so that modules signed with that algorithm can have
+     their signatures checked without causing a dependency loop.
+
+
+ (4) :menuselection:`File name or PKCS#11 URI of module signing key`
+     (``CONFIG_MODULE_SIG_KEY``)
+
+     Setting this option to something other than its default of
+     ``certs/signing_key.pem`` will disable the autogeneration of signing keys
+     and allow the kernel modules to be signed with a key of your choosing.
+     The string provided should identify a file containing both a private key
+     and its corresponding X.509 certificate in PEM form, or — on systems where
+     the OpenSSL ENGINE_pkcs11 is functional — a PKCS#11 URI as defined by
+     RFC7512. In the latter case, the PKCS#11 URI should reference both a
+     certificate and a private key.
+
+     If the PEM file containing the private key is encrypted, or if the
+     PKCS#11 token requries a PIN, this can be provided at build time by
+     means of the ``KBUILD_SIGN_PIN`` variable.
+
+
+ (5) :menuselection:`Additional X.509 keys for default system keyring`
+     (``CONFIG_SYSTEM_TRUSTED_KEYS``)
+
+     This option can be set to the filename of a PEM-encoded file containing
+     additional certificates which will be included in the system keyring by
+     default.
+
+Note that enabling module signing adds a dependency on the OpenSSL devel
+packages to the kernel build processes for the tool that does the signing.
+
+
+=======================
+Generating signing keys
+=======================
+
+Cryptographic keypairs are required to generate and check signatures.  A
+private key is used to generate a signature and the corresponding public key is
+used to check it.  The private key is only needed during the build, after which
+it can be deleted or stored securely.  The public key gets built into the
+kernel so that it can be used to check the signatures as the modules are
+loaded.
+
+Under normal conditions, when ``CONFIG_MODULE_SIG_KEY`` is unchanged from its
+default, the kernel build will automatically generate a new keypair using
+openssl if one does not exist in the file::
+
+	certs/signing_key.pem
+
+during the building of vmlinux (the public part of the key needs to be built
+into vmlinux) using parameters in the::
+
+	certs/x509.genkey
+
+file (which is also generated if it does not already exist).
+
+It is strongly recommended that you provide your own x509.genkey file.
+
+Most notably, in the x509.genkey file, the req_distinguished_name section
+should be altered from the default::
+
+	[ req_distinguished_name ]
+	#O = Unspecified company
+	CN = Build time autogenerated kernel key
+	#emailAddress = unspecified.user@unspecified.company
+
+The generated RSA key size can also be set with::
+
+	[ req ]
+	default_bits = 4096
+
+
+It is also possible to manually generate the key private/public files using the
+x509.genkey key generation configuration file in the root node of the Linux
+kernel sources tree and the openssl command.  The following is an example to
+generate the public/private key files::
+
+	openssl req -new -nodes -utf8 -sha256 -days 36500 -batch -x509 \
+	   -config x509.genkey -outform PEM -out kernel_key.pem \
+	   -keyout kernel_key.pem
+
+The full pathname for the resulting kernel_key.pem file can then be specified
+in the ``CONFIG_MODULE_SIG_KEY`` option, and the certificate and key therein will
+be used instead of an autogenerated keypair.
+
+
+=========================
+Public keys in the kernel
+=========================
+
+The kernel contains a ring of public keys that can be viewed by root.  They're
+in a keyring called ".builtin_trusted_keys" that can be seen by::
+
+	[root@deneb ~]# cat /proc/keys
+	...
+	223c7853 I------     1 perm 1f030000     0     0 keyring   .builtin_trusted_keys: 1
+	302d2d52 I------     1 perm 1f010000     0     0 asymmetri Fedora kernel signing key: d69a84e6bce3d216b979e9505b3e3ef9a7118079: X509.RSA a7118079 []
+	...
+
+Beyond the public key generated specifically for module signing, additional
+trusted certificates can be provided in a PEM-encoded file referenced by the
+``CONFIG_SYSTEM_TRUSTED_KEYS`` configuration option.
+
+Further, the architecture code may take public keys from a hardware store and
+add those in also (e.g. from the UEFI key database).
+
+Finally, it is possible to add additional public keys by doing::
+
+	keyctl padd asymmetric "" [.builtin_trusted_keys-ID] <[key-file]
+
+e.g.::
+
+	keyctl padd asymmetric "" 0x223c7853 <my_public_key.x509
+
+Note, however, that the kernel will only permit keys to be added to
+``.builtin_trusted_keys`` **if** the new key's X.509 wrapper is validly signed by a key
+that is already resident in the ``.builtin_trusted_keys`` at the time the key was added.
+
+
+========================
+Manually signing modules
+========================
+
+To manually sign a module, use the scripts/sign-file tool available in
+the Linux kernel source tree.  The script requires 4 arguments:
+
+	1.  The hash algorithm (e.g., sha256)
+	2.  The private key filename or PKCS#11 URI
+	3.  The public key filename
+	4.  The kernel module to be signed
+
+The following is an example to sign a kernel module::
+
+	scripts/sign-file sha512 kernel-signkey.priv \
+		kernel-signkey.x509 module.ko
+
+The hash algorithm used does not have to match the one configured, but if it
+doesn't, you should make sure that hash algorithm is either built into the
+kernel or can be loaded without requiring itself.
+
+If the private key requires a passphrase or PIN, it can be provided in the
+$KBUILD_SIGN_PIN environment variable.
+
+
+============================
+Signed modules and stripping
+============================
+
+A signed module has a digital signature simply appended at the end.  The string
+``~Module signature appended~.`` at the end of the module's file confirms that a
+signature is present but it does not confirm that the signature is valid!
+
+Signed modules are BRITTLE as the signature is outside of the defined ELF
+container.  Thus they MAY NOT be stripped once the signature is computed and
+attached.  Note the entire module is the signed payload, including any and all
+debug information present at the time of signing.
+
+
+======================
+Loading signed modules
+======================
+
+Modules are loaded with insmod, modprobe, ``init_module()`` or
+``finit_module()``, exactly as for unsigned modules as no processing is
+done in userspace.  The signature checking is all done within the kernel.
+
+
+=========================================
+Non-valid signatures and unsigned modules
+=========================================
+
+If ``CONFIG_MODULE_SIG_FORCE`` is enabled or module.sig_enforce=1 is supplied on
+the kernel command line, the kernel will only load validly signed modules
+for which it has a public key.   Otherwise, it will also load modules that are
+unsigned.   Any module for which the kernel has a key, but which proves to have
+a signature mismatch will not be permitted to load.
+
+Any module that has an unparseable signature will be rejected.
+
+
+=========================================
+Administering/protecting the private key
+=========================================
+
+Since the private key is used to sign modules, viruses and malware could use
+the private key to sign modules and compromise the operating system.  The
+private key must be either destroyed or moved to a secure location and not kept
+in the root node of the kernel source tree.
+
+If you use the same private key to sign modules for multiple kernel
+configurations, you must ensure that the module version information is
+sufficient to prevent loading a module into a different kernel.  Either
+set ``CONFIG_MODVERSIONS=y`` or ensure that each configuration has a different
+kernel release string by changing ``EXTRAVERSION`` or ``CONFIG_LOCALVERSION``.
diff --git a/Documentation/admin-guide/mono.rst b/Documentation/admin-guide/mono.rst
new file mode 100644
index 000000000..59e6d59f0
--- /dev/null
+++ b/Documentation/admin-guide/mono.rst
@@ -0,0 +1,70 @@
+Mono(tm) Binary Kernel Support for Linux
+-----------------------------------------
+
+To configure Linux to automatically execute Mono-based .NET binaries
+(in the form of .exe files) without the need to use the mono CLR
+wrapper, you can use the BINFMT_MISC kernel support.
+
+This will allow you to execute Mono-based .NET binaries just like any
+other program after you have done the following:
+
+1) You MUST FIRST install the Mono CLR support, either by downloading
+   a binary package, a source tarball or by installing from Git. Binary
+   packages for several distributions can be found at:
+
+	http://www.mono-project.com/download/
+
+   Instructions for compiling Mono can be found at:
+
+	http://www.mono-project.com/docs/compiling-mono/linux/
+
+   Once the Mono CLR support has been installed, just check that
+   ``/usr/bin/mono`` (which could be located elsewhere, for example
+   ``/usr/local/bin/mono``) is working.
+
+2) You have to compile BINFMT_MISC either as a module or into
+   the kernel (``CONFIG_BINFMT_MISC``) and set it up properly.
+   If you choose to compile it as a module, you will have
+   to insert it manually with modprobe/insmod, as kmod
+   cannot be easily supported with binfmt_misc.
+   Read the file ``binfmt_misc.txt`` in this directory to know
+   more about the configuration process.
+
+3) Add the following entries to ``/etc/rc.local`` or similar script
+   to be run at system startup:
+
+   .. code-block:: sh
+
+    # Insert BINFMT_MISC module into the kernel
+    if [ ! -e /proc/sys/fs/binfmt_misc/register ]; then
+        /sbin/modprobe binfmt_misc
+	# Some distributions, like Fedora Core, perform
+	# the following command automatically when the
+	# binfmt_misc module is loaded into the kernel
+	# or during normal boot up (systemd-based systems).
+	# Thus, it is possible that the following line
+	# is not needed at all.
+	mount -t binfmt_misc none /proc/sys/fs/binfmt_misc
+    fi
+
+    # Register support for .NET CLR binaries
+    if [ -e /proc/sys/fs/binfmt_misc/register ]; then
+	# Replace /usr/bin/mono with the correct pathname to
+	# the Mono CLR runtime (usually /usr/local/bin/mono
+	# when compiling from sources or CVS).
+        echo ':CLR:M::MZ::/usr/bin/mono:' > /proc/sys/fs/binfmt_misc/register
+    else
+        echo "No binfmt_misc support"
+        exit 1
+    fi
+
+4) Check that ``.exe`` binaries can be ran without the need of a
+   wrapper script, simply by launching the ``.exe`` file directly
+   from a command prompt, for example::
+
+	/usr/bin/xsd.exe
+
+   .. note::
+
+      If this fails with a permission denied error, check
+      that the ``.exe`` file has execute permissions.
diff --git a/Documentation/admin-guide/parport.rst b/Documentation/admin-guide/parport.rst
new file mode 100644
index 000000000..ad3f9b8a1
--- /dev/null
+++ b/Documentation/admin-guide/parport.rst
@@ -0,0 +1,286 @@
+Parport
++++++++
+
+The ``parport`` code provides parallel-port support under Linux.  This
+includes the ability to share one port between multiple device
+drivers.
+
+You can pass parameters to the ``parport`` code to override its automatic
+detection of your hardware.  This is particularly useful if you want
+to use IRQs, since in general these can't be autoprobed successfully.
+By default IRQs are not used even if they **can** be probed.  This is
+because there are a lot of people using the same IRQ for their
+parallel port and a sound card or network card.
+
+The ``parport`` code is split into two parts: generic (which deals with
+port-sharing) and architecture-dependent (which deals with actually
+using the port).
+
+
+Parport as modules
+==================
+
+If you load the `parport`` code as a module, say::
+
+	# insmod parport
+
+to load the generic ``parport`` code.  You then must load the
+architecture-dependent code with (for example)::
+
+	# insmod parport_pc io=0x3bc,0x378,0x278 irq=none,7,auto
+
+to tell the ``parport`` code that you want three PC-style ports, one at
+0x3bc with no IRQ, one at 0x378 using IRQ 7, and one at 0x278 with an
+auto-detected IRQ.  Currently, PC-style (``parport_pc``), Sun ``bpp``,
+Amiga, Atari, and MFC3 hardware is supported.
+
+PCI parallel I/O card support comes from ``parport_pc``.  Base I/O
+addresses should not be specified for supported PCI cards since they
+are automatically detected.
+
+
+modprobe
+--------
+
+If you use modprobe , you will find it useful to add lines as below to a
+configuration file in /etc/modprobe.d/ directory::
+
+	alias parport_lowlevel parport_pc
+	options parport_pc io=0x378,0x278 irq=7,auto
+
+modprobe will load ``parport_pc`` (with the options ``io=0x378,0x278 irq=7,auto``)
+whenever a parallel port device driver (such as ``lp``) is loaded.
+
+Note that these are example lines only!  You shouldn't in general need
+to specify any options to ``parport_pc`` in order to be able to use a
+parallel port.
+
+
+Parport probe [optional]
+------------------------
+
+In 2.2 kernels there was a module called ``parport_probe``, which was used
+for collecting IEEE 1284 device ID information.  This has now been
+enhanced and now lives with the IEEE 1284 support.  When a parallel
+port is detected, the devices that are connected to it are analysed,
+and information is logged like this::
+
+	parport0: Printer, BJC-210 (Canon)
+
+The probe information is available from files in ``/proc/sys/dev/parport/``.
+
+
+Parport linked into the kernel statically
+=========================================
+
+If you compile the ``parport`` code into the kernel, then you can use
+kernel boot parameters to get the same effect.  Add something like the
+following to your LILO command line::
+
+	parport=0x3bc parport=0x378,7 parport=0x278,auto,nofifo
+
+You can have many ``parport=...`` statements, one for each port you want
+to add.  Adding ``parport=0`` to the kernel command-line will disable
+parport support entirely.  Adding ``parport=auto`` to the kernel
+command-line will make ``parport`` use any IRQ lines or DMA channels that
+it auto-detects.
+
+
+Files in /proc
+==============
+
+If you have configured the ``/proc`` filesystem into your kernel, you will
+see a new directory entry: ``/proc/sys/dev/parport``.  In there will be a
+directory entry for each parallel port for which parport is
+configured.  In each of those directories are a collection of files
+describing that parallel port.
+
+The ``/proc/sys/dev/parport`` directory tree looks like::
+
+	parport
+	|-- default
+	|   |-- spintime
+	|   `-- timeslice
+	|-- parport0
+	|   |-- autoprobe
+	|   |-- autoprobe0
+	|   |-- autoprobe1
+	|   |-- autoprobe2
+	|   |-- autoprobe3
+	|   |-- devices
+	|   |   |-- active
+	|   |   `-- lp
+	|   |       `-- timeslice
+	|   |-- base-addr
+	|   |-- irq
+	|   |-- dma
+	|   |-- modes
+	|   `-- spintime
+	`-- parport1
+	|-- autoprobe
+	|-- autoprobe0
+	|-- autoprobe1
+	|-- autoprobe2
+	|-- autoprobe3
+	|-- devices
+	|   |-- active
+	|   `-- ppa
+	|       `-- timeslice
+	|-- base-addr
+	|-- irq
+	|-- dma
+	|-- modes
+	`-- spintime
+
+.. tabularcolumns:: |p{4.0cm}|p{13.5cm}|
+
+=======================	=======================================================
+File			Contents
+=======================	=======================================================
+``devices/active``	A list of the device drivers using that port.  A "+"
+			will appear by the name of the device currently using
+			the port (it might not appear against any).  The
+			string "none" means that there are no device drivers
+			using that port.
+
+``base-addr``		Parallel port's base address, or addresses if the port
+			has more than one in which case they are separated
+			with tabs.  These values might not have any sensible
+			meaning for some ports.
+
+``irq``			Parallel port's IRQ, or -1 if none is being used.
+
+``dma``			Parallel port's DMA channel, or -1 if none is being
+			used.
+
+``modes``		Parallel port's hardware modes, comma-separated,
+			meaning:
+
+			- PCSPP
+				PC-style SPP registers are available.
+
+			- TRISTATE
+				Port is bidirectional.
+
+			- COMPAT
+				Hardware acceleration for printers is
+				available and will be used.
+
+			- EPP
+				Hardware acceleration for EPP protocol
+				is available and will be used.
+
+			- ECP
+				Hardware acceleration for ECP protocol
+				is available and will be used.
+
+			- DMA
+				DMA is available and will be used.
+
+			Note that the current implementation will only take
+			advantage of COMPAT and ECP modes if it has an IRQ
+			line to use.
+
+``autoprobe``		Any IEEE-1284 device ID information that has been
+			acquired from the (non-IEEE 1284.3) device.
+
+``autoprobe[0-3]``	IEEE 1284 device ID information retrieved from
+			daisy-chain devices that conform to IEEE 1284.3.
+
+``spintime``		The number of microseconds to busy-loop while waiting
+			for the peripheral to respond.  You might find that
+			adjusting this improves performance, depending on your
+			peripherals.  This is a port-wide setting, i.e. it
+			applies to all devices on a particular port.
+
+``timeslice``		The number of milliseconds that a device driver is
+			allowed to keep a port claimed for.  This is advisory,
+			and driver can ignore it if it must.
+
+``default/*``		The defaults for spintime and timeslice. When a new
+			port is	registered, it picks up the default spintime.
+			When a new device is registered, it picks up the
+			default timeslice.
+=======================	=======================================================
+
+Device drivers
+==============
+
+Once the parport code is initialised, you can attach device drivers to
+specific ports.  Normally this happens automatically; if the lp driver
+is loaded it will create one lp device for each port found.  You can
+override this, though, by using parameters either when you load the lp
+driver::
+
+	# insmod lp parport=0,2
+
+or on the LILO command line::
+
+	lp=parport0 lp=parport2
+
+Both the above examples would inform lp that you want ``/dev/lp0`` to be
+the first parallel port, and /dev/lp1 to be the **third** parallel port,
+with no lp device associated with the second port (parport1).  Note
+that this is different to the way older kernels worked; there used to
+be a static association between the I/O port address and the device
+name, so ``/dev/lp0`` was always the port at 0x3bc.  This is no longer the
+case - if you only have one port, it will default to being ``/dev/lp0``,
+regardless of base address.
+
+Also:
+
+ * If you selected the IEEE 1284 support at compile time, you can say
+   ``lp=auto`` on the kernel command line, and lp will create devices
+   only for those ports that seem to have printers attached.
+
+ * If you give PLIP the ``timid`` parameter, either with ``plip=timid`` on
+   the command line, or with ``insmod plip timid=1`` when using modules,
+   it will avoid any ports that seem to be in use by other devices.
+
+ * IRQ autoprobing works only for a few port types at the moment.
+
+Reporting printer problems with parport
+=======================================
+
+If you are having problems printing, please go through these steps to
+try to narrow down where the problem area is.
+
+When reporting problems with parport, really you need to give all of
+the messages that ``parport_pc`` spits out when it initialises.  There are
+several code paths:
+
+- polling
+- interrupt-driven, protocol in software
+- interrupt-driven, protocol in hardware using PIO
+- interrupt-driven, protocol in hardware using DMA
+
+The kernel messages that ``parport_pc`` logs give an indication of which
+code path is being used. (They could be a lot better actually..)
+
+For normal printer protocol, having IEEE 1284 modes enabled or not
+should not make a difference.
+
+To turn off the 'protocol in hardware' code paths, disable
+``CONFIG_PARPORT_PC_FIFO``.  Note that when they are enabled they are not
+necessarily **used**; it depends on whether the hardware is available,
+enabled by the BIOS, and detected by the driver.
+
+So, to start with, disable ``CONFIG_PARPORT_PC_FIFO``, and load ``parport_pc``
+with ``irq=none``. See if printing works then.  It really should,
+because this is the simplest code path.
+
+If that works fine, try with ``io=0x378 irq=7`` (adjust for your
+hardware), to make it use interrupt-driven in-software protocol.
+
+If **that** works fine, then one of the hardware modes isn't working
+right.  Enable ``CONFIG_FIFO`` (no, it isn't a module option,
+and yes, it should be), set the port to ECP mode in the BIOS and note
+the DMA channel, and try with::
+
+    io=0x378 irq=7 dma=none (for PIO)
+    io=0x378 irq=7 dma=3 (for DMA)
+
+----------
+
+philb@gnu.org
+tim@cyberelk.net
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
new file mode 100644
index 000000000..47153e64d
--- /dev/null
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -0,0 +1,701 @@
+.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>`
+.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
+
+=======================
+CPU Performance Scaling
+=======================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+The Concept of CPU Performance Scaling
+======================================
+
+The majority of modern processors are capable of operating in a number of
+different clock frequency and voltage configurations, often referred to as
+Operating Performance Points or P-states (in ACPI terminology).  As a rule,
+the higher the clock frequency and the higher the voltage, the more instructions
+can be retired by the CPU over a unit of time, but also the higher the clock
+frequency and the higher the voltage, the more energy is consumed over a unit of
+time (or the more power is drawn) by the CPU in the given P-state.  Therefore
+there is a natural tradeoff between the CPU capacity (the number of instructions
+that can be executed over a unit of time) and the power drawn by the CPU.
+
+In some situations it is desirable or even necessary to run the program as fast
+as possible and then there is no reason to use any P-states different from the
+highest one (i.e. the highest-performance frequency/voltage configuration
+available).  In some other cases, however, it may not be necessary to execute
+instructions so quickly and maintaining the highest available CPU capacity for a
+relatively long time without utilizing it entirely may be regarded as wasteful.
+It also may not be physically possible to maintain maximum CPU capacity for too
+long for thermal or power supply capacity reasons or similar.  To cover those
+cases, there are hardware interfaces allowing CPUs to be switched between
+different frequency/voltage configurations or (in the ACPI terminology) to be
+put into different P-states.
+
+Typically, they are used along with algorithms to estimate the required CPU
+capacity, so as to decide which P-states to put the CPUs into.  Of course, since
+the utilization of the system generally changes over time, that has to be done
+repeatedly on a regular basis.  The activity by which this happens is referred
+to as CPU performance scaling or CPU frequency scaling (because it involves
+adjusting the CPU clock frequency).
+
+
+CPU Performance Scaling in Linux
+================================
+
+The Linux kernel supports CPU performance scaling by means of the ``CPUFreq``
+(CPU Frequency scaling) subsystem that consists of three layers of code: the
+core, scaling governors and scaling drivers.
+
+The ``CPUFreq`` core provides the common code infrastructure and user space
+interfaces for all platforms that support CPU performance scaling.  It defines
+the basic framework in which the other components operate.
+
+Scaling governors implement algorithms to estimate the required CPU capacity.
+As a rule, each governor implements one, possibly parametrized, scaling
+algorithm.
+
+Scaling drivers talk to the hardware.  They provide scaling governors with
+information on the available P-states (or P-state ranges in some cases) and
+access platform-specific hardware interfaces to change CPU P-states as requested
+by scaling governors.
+
+In principle, all available scaling governors can be used with every scaling
+driver.  That design is based on the observation that the information used by
+performance scaling algorithms for P-state selection can be represented in a
+platform-independent form in the majority of cases, so it should be possible
+to use the same performance scaling algorithm implemented in exactly the same
+way regardless of which scaling driver is used.  Consequently, the same set of
+scaling governors should be suitable for every supported platform.
+
+However, that observation may not hold for performance scaling algorithms
+based on information provided by the hardware itself, for example through
+feedback registers, as that information is typically specific to the hardware
+interface it comes from and may not be easily represented in an abstract,
+platform-independent way.  For this reason, ``CPUFreq`` allows scaling drivers
+to bypass the governor layer and implement their own performance scaling
+algorithms.  That is done by the |intel_pstate| scaling driver.
+
+
+``CPUFreq`` Policy Objects
+==========================
+
+In some cases the hardware interface for P-state control is shared by multiple
+CPUs.  That is, for example, the same register (or set of registers) is used to
+control the P-state of multiple CPUs at the same time and writing to it affects
+all of those CPUs simultaneously.
+
+Sets of CPUs sharing hardware P-state control interfaces are represented by
+``CPUFreq`` as |struct cpufreq_policy| objects.  For consistency,
+|struct cpufreq_policy| is also used when there is only one CPU in the given
+set.
+
+The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for
+every CPU in the system, including CPUs that are currently offline.  If multiple
+CPUs share the same hardware P-state control interface, all of the pointers
+corresponding to them point to the same |struct cpufreq_policy| object.
+
+``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design
+of its user space interface is based on the policy concept.
+
+
+CPU Initialization
+==================
+
+First of all, a scaling driver has to be registered for ``CPUFreq`` to work.
+It is only possible to register one scaling driver at a time, so the scaling
+driver is expected to be able to handle all CPUs in the system.
+
+The scaling driver may be registered before or after CPU registration.  If
+CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to
+take a note of all of the already registered CPUs during the registration of the
+scaling driver.  In turn, if any CPUs are registered after the registration of
+the scaling driver, the ``CPUFreq`` core will be invoked to take note of them
+at their registration time.
+
+In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it
+has not seen so far as soon as it is ready to handle that CPU.  [Note that the
+logical CPU may be a physical single-core processor, or a single core in a
+multicore processor, or a hardware thread in a physical processor or processor
+core.  In what follows "CPU" always means "logical CPU" unless explicitly stated
+otherwise and the word "processor" is used to refer to the physical part
+possibly including multiple logical CPUs.]
+
+Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set
+for the given CPU and if so, it skips the policy object creation.  Otherwise,
+a new policy object is created and initialized, which involves the creation of
+a new policy directory in ``sysfs``, and the policy pointer corresponding to
+the given CPU is set to the new policy object's address in memory.
+
+Next, the scaling driver's ``->init()`` callback is invoked with the policy
+pointer of the new CPU passed to it as the argument.  That callback is expected
+to initialize the performance scaling hardware interface for the given CPU (or,
+more precisely, for the set of CPUs sharing the hardware interface it belongs
+to, represented by its policy object) and, if the policy object it has been
+called for is new, to set parameters of the policy, like the minimum and maximum
+frequencies supported by the hardware, the table of available frequencies (if
+the set of supported P-states is not a continuous range), and the mask of CPUs
+that belong to the same policy (including both online and offline CPUs).  That
+mask is then used by the core to populate the policy pointers for all of the
+CPUs in it.
+
+The next major initialization step for a new policy object is to attach a
+scaling governor to it (to begin with, that is the default scaling governor
+determined by the kernel configuration, but it may be changed later
+via ``sysfs``).  First, a pointer to the new policy object is passed to the
+governor's ``->init()`` callback which is expected to initialize all of the
+data structures necessary to handle the given policy and, possibly, to add
+a governor ``sysfs`` interface to it.  Next, the governor is started by
+invoking its ``->start()`` callback.
+
+That callback it expected to register per-CPU utilization update callbacks for
+all of the online CPUs belonging to the given policy with the CPU scheduler.
+The utilization update callbacks will be invoked by the CPU scheduler on
+important events, like task enqueue and dequeue, on every iteration of the
+scheduler tick or generally whenever the CPU utilization may change (from the
+scheduler's perspective).  They are expected to carry out computations needed
+to determine the P-state to use for the given policy going forward and to
+invoke the scaling driver to make changes to the hardware in accordance with
+the P-state selection.  The scaling driver may be invoked directly from
+scheduler context or asynchronously, via a kernel thread or workqueue, depending
+on the configuration and capabilities of the scaling driver and the governor.
+
+Similar steps are taken for policy objects that are not new, but were "inactive"
+previously, meaning that all of the CPUs belonging to them were offline.  The
+only practical difference in that case is that the ``CPUFreq`` core will attempt
+to use the scaling governor previously used with the policy that became
+"inactive" (and is re-initialized now) instead of the default governor.
+
+In turn, if a previously offline CPU is being brought back online, but some
+other CPUs sharing the policy object with it are online already, there is no
+need to re-initialize the policy object at all.  In that case, it only is
+necessary to restart the scaling governor so that it can take the new online CPU
+into account.  That is achieved by invoking the governor's ``->stop`` and
+``->start()`` callbacks, in this order, for the entire policy.
+
+As mentioned before, the |intel_pstate| scaling driver bypasses the scaling
+governor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
+Consequently, if |intel_pstate| is used, scaling governors are not attached to
+new policy objects.  Instead, the driver's ``->setpolicy()`` callback is invoked
+to register per-CPU utilization update callbacks for each policy.  These
+callbacks are invoked by the CPU scheduler in the same way as for scaling
+governors, but in the |intel_pstate| case they both determine the P-state to
+use and change the hardware configuration accordingly in one go from scheduler
+context.
+
+The policy objects created during CPU initialization and other data structures
+associated with them are torn down when the scaling driver is unregistered
+(which happens when the kernel module containing it is unloaded, for example) or
+when the last CPU belonging to the given policy in unregistered.
+
+
+Policy Interface in ``sysfs``
+=============================
+
+During the initialization of the kernel, the ``CPUFreq`` core creates a
+``sysfs`` directory (kobject) called ``cpufreq`` under
+:file:`/sys/devices/system/cpu/`.
+
+That directory contains a ``policyX`` subdirectory (where ``X`` represents an
+integer number) for every policy object maintained by the ``CPUFreq`` core.
+Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links
+under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer
+that may be different from the one represented by ``X``) for all of the CPUs
+associated with (or belonging to) the given policy.  The ``policyX`` directories
+in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific
+attributes (files) to control ``CPUFreq`` behavior for the corresponding policy
+objects (that is, for all of the CPUs associated with them).
+
+Some of those attributes are generic.  They are created by the ``CPUFreq`` core
+and their behavior generally does not depend on what scaling driver is in use
+and what scaling governor is attached to the given policy.  Some scaling drivers
+also add driver-specific attributes to the policy directories in ``sysfs`` to
+control policy-specific aspects of driver behavior.
+
+The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/`
+are the following:
+
+``affected_cpus``
+	List of online CPUs belonging to this policy (i.e. sharing the hardware
+	performance scaling interface represented by the ``policyX`` policy
+	object).
+
+``bios_limit``
+	If the platform firmware (BIOS) tells the OS to apply an upper limit to
+	CPU frequencies, that limit will be reported through this attribute (if
+	present).
+
+	The existence of the limit may be a result of some (often unintentional)
+	BIOS settings, restrictions coming from a service processor or another
+	BIOS/HW-based mechanisms.
+
+	This does not cover ACPI thermal limitations which can be discovered
+	through a generic thermal driver.
+
+	This attribute is not present if the scaling driver in use does not
+	support it.
+
+``cpuinfo_cur_freq``
+	Current frequency of the CPUs belonging to this policy as obtained from
+	the hardware (in KHz).
+
+	This is expected to be the frequency the hardware actually runs at.
+	If that frequency cannot be determined, this attribute should not
+	be present.
+
+``cpuinfo_max_freq``
+	Maximum possible operating frequency the CPUs belonging to this policy
+	can run at (in kHz).
+
+``cpuinfo_min_freq``
+	Minimum possible operating frequency the CPUs belonging to this policy
+	can run at (in kHz).
+
+``cpuinfo_transition_latency``
+	The time it takes to switch the CPUs belonging to this policy from one
+	P-state to another, in nanoseconds.
+
+	If unknown or if known to be so high that the scaling driver does not
+	work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`)
+	will be returned by reads from this attribute.
+
+``related_cpus``
+	List of all (online and offline) CPUs belonging to this policy.
+
+``scaling_available_governors``
+	List of ``CPUFreq`` scaling governors present in the kernel that can
+	be attached to this policy or (if the |intel_pstate| scaling driver is
+	in use) list of scaling algorithms provided by the driver that can be
+	applied to this policy.
+
+	[Note that some governors are modular and it may be necessary to load a
+	kernel module for the governor held by it to become available and be
+	listed by this attribute.]
+
+``scaling_cur_freq``
+	Current frequency of all of the CPUs belonging to this policy (in kHz).
+
+	In the majority of cases, this is the frequency of the last P-state
+	requested by the scaling driver from the hardware using the scaling
+	interface provided by it, which may or may not reflect the frequency
+	the CPU is actually running at (due to hardware design and other
+	limitations).
+
+	Some architectures (e.g. ``x86``) may attempt to provide information
+	more precisely reflecting the current CPU frequency through this
+	attribute, but that still may not be the exact current CPU frequency as
+	seen by the hardware at the moment.
+
+``scaling_driver``
+	The scaling driver currently in use.
+
+``scaling_governor``
+	The scaling governor currently attached to this policy or (if the
+	|intel_pstate| scaling driver is in use) the scaling algorithm
+	provided by the driver that is currently applied to this policy.
+
+	This attribute is read-write and writing to it will cause a new scaling
+	governor to be attached to this policy or a new scaling algorithm
+	provided by the scaling driver to be applied to it (in the
+	|intel_pstate| case), as indicated by the string written to this
+	attribute (which must be one of the names listed by the
+	``scaling_available_governors`` attribute described above).
+
+``scaling_max_freq``
+	Maximum frequency the CPUs belonging to this policy are allowed to be
+	running at (in kHz).
+
+	This attribute is read-write and writing a string representing an
+	integer to it will cause a new limit to be set (it must not be lower
+	than the value of the ``scaling_min_freq`` attribute).
+
+``scaling_min_freq``
+	Minimum frequency the CPUs belonging to this policy are allowed to be
+	running at (in kHz).
+
+	This attribute is read-write and writing a string representing a
+	non-negative integer to it will cause a new limit to be set (it must not
+	be higher than the value of the ``scaling_max_freq`` attribute).
+
+``scaling_setspeed``
+	This attribute is functional only if the `userspace`_ scaling governor
+	is attached to the given policy.
+
+	It returns the last frequency requested by the governor (in kHz) or can
+	be written to in order to set a new frequency for the policy.
+
+
+Generic Scaling Governors
+=========================
+
+``CPUFreq`` provides generic scaling governors that can be used with all
+scaling drivers.  As stated before, each of them implements a single, possibly
+parametrized, performance scaling algorithm.
+
+Scaling governors are attached to policy objects and different policy objects
+can be handled by different scaling governors at the same time (although that
+may lead to suboptimal results in some cases).
+
+The scaling governor for a given policy object can be changed at any time with
+the help of the ``scaling_governor`` policy attribute in ``sysfs``.
+
+Some governors expose ``sysfs`` attributes to control or fine-tune the scaling
+algorithms implemented by them.  Those attributes, referred to as governor
+tunables, can be either global (system-wide) or per-policy, depending on the
+scaling driver in use.  If the driver requires governor tunables to be
+per-policy, they are located in a subdirectory of each policy directory.
+Otherwise, they are located in a subdirectory under
+:file:`/sys/devices/system/cpu/cpufreq/`.  In either case the name of the
+subdirectory containing the governor tunables is the name of the governor
+providing them.
+
+``performance``
+---------------
+
+When attached to a policy object, this governor causes the highest frequency,
+within the ``scaling_max_freq`` policy limit, to be requested for that policy.
+
+The request is made once at that time the governor for the policy is set to
+``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
+policy limits change after that.
+
+``powersave``
+-------------
+
+When attached to a policy object, this governor causes the lowest frequency,
+within the ``scaling_min_freq`` policy limit, to be requested for that policy.
+
+The request is made once at that time the governor for the policy is set to
+``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
+policy limits change after that.
+
+``userspace``
+-------------
+
+This governor does not do anything by itself.  Instead, it allows user space
+to set the CPU frequency for the policy it is attached to by writing to the
+``scaling_setspeed`` attribute of that policy.
+
+``schedutil``
+-------------
+
+This governor uses CPU utilization data available from the CPU scheduler.  It
+generally is regarded as a part of the CPU scheduler, so it can access the
+scheduler's internal data structures directly.
+
+It runs entirely in scheduler context, although in some cases it may need to
+invoke the scaling driver asynchronously when it decides that the CPU frequency
+should be changed for a given policy (that depends on whether or not the driver
+is capable of changing the CPU frequency from scheduler context).
+
+The actions of this governor for a particular CPU depend on the scheduling class
+invoking its utilization update callback for that CPU.  If it is invoked by the
+RT or deadline scheduling classes, the governor will increase the frequency to
+the allowed maximum (that is, the ``scaling_max_freq`` policy limit).  In turn,
+if it is invoked by the CFS scheduling class, the governor will use the
+Per-Entity Load Tracking (PELT) metric for the root control group of the
+given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_
+LWN.net article for a description of the PELT mechanism).  Then, the new
+CPU frequency to apply is computed in accordance with the formula
+
+	f = 1.25 * ``f_0`` * ``util`` / ``max``
+
+where ``util`` is the PELT number, ``max`` is the theoretical maximum of
+``util``, and ``f_0`` is either the maximum possible CPU frequency for the given
+policy (if the PELT number is frequency-invariant), or the current CPU frequency
+(otherwise).
+
+This governor also employs a mechanism allowing it to temporarily bump up the
+CPU frequency for tasks that have been waiting on I/O most recently, called
+"IO-wait boosting".  That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag
+is passed by the scheduler to the governor callback which causes the frequency
+to go up to the allowed maximum immediately and then draw back to the value
+returned by the above formula over time.
+
+This governor exposes only one tunable:
+
+``rate_limit_us``
+	Minimum time (in microseconds) that has to pass between two consecutive
+	runs of governor computations (default: 1000 times the scaling driver's
+	transition latency).
+
+	The purpose of this tunable is to reduce the scheduler context overhead
+	of the governor which might be excessive without it.
+
+This governor generally is regarded as a replacement for the older `ondemand`_
+and `conservative`_ governors (described below), as it is simpler and more
+tightly integrated with the CPU scheduler, its overhead in terms of CPU context
+switches and similar is less significant, and it uses the scheduler's own CPU
+utilization metric, so in principle its decisions should not contradict the
+decisions made by the other parts of the scheduler.
+
+``ondemand``
+------------
+
+This governor uses CPU load as a CPU frequency selection metric.
+
+In order to estimate the current CPU load, it measures the time elapsed between
+consecutive invocations of its worker routine and computes the fraction of that
+time in which the given CPU was not idle.  The ratio of the non-idle (active)
+time to the total CPU time is taken as an estimate of the load.
+
+If this governor is attached to a policy shared by multiple CPUs, the load is
+estimated for all of them and the greatest result is taken as the load estimate
+for the entire policy.
+
+The worker routine of this governor has to run in process context, so it is
+invoked asynchronously (via a workqueue) and CPU P-states are updated from
+there if necessary.  As a result, the scheduler context overhead from this
+governor is minimum, but it causes additional CPU context switches to happen
+relatively often and the CPU P-state updates triggered by it can be relatively
+irregular.  Also, it affects its own CPU load metric by running code that
+reduces the CPU idle time (even though the CPU idle time is only reduced very
+slightly by it).
+
+It generally selects CPU frequencies proportional to the estimated load, so that
+the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of
+1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute
+corresponds to the load of 0, unless when the load exceeds a (configurable)
+speedup threshold, in which case it will go straight for the highest frequency
+it is allowed to use (the ``scaling_max_freq`` policy limit).
+
+This governor exposes the following tunables:
+
+``sampling_rate``
+	This is how often the governor's worker routine should run, in
+	microseconds.
+
+	Typically, it is set to values of the order of 10000 (10 ms).  Its
+	default value is equal to the value of ``cpuinfo_transition_latency``
+	for each policy this governor is attached to (but since the unit here
+	is greater by 1000, this means that the time represented by
+	``sampling_rate`` is 1000 times greater than the transition latency by
+	default).
+
+	If this tunable is per-policy, the following shell command sets the time
+	represented by it to be 750 times as high as the transition latency::
+
+	# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
+
+``up_threshold``
+	If the estimated CPU load is above this value (in percent), the governor
+	will set the frequency to the maximum value allowed for the policy.
+	Otherwise, the selected frequency will be proportional to the estimated
+	CPU load.
+
+``ignore_nice_load``
+	If set to 1 (default 0), it will cause the CPU load estimation code to
+	treat the CPU time spent on executing tasks with "nice" levels greater
+	than 0 as CPU idle time.
+
+	This may be useful if there are tasks in the system that should not be
+	taken into account when deciding what frequency to run the CPUs at.
+	Then, to make that happen it is sufficient to increase the "nice" level
+	of those tasks above 0 and set this attribute to 1.
+
+``sampling_down_factor``
+	Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
+	the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
+
+	This causes the next execution of the governor's worker routine (after
+	setting the frequency to the allowed maximum) to be delayed, so the
+	frequency stays at the maximum level for a longer time.
+
+	Frequency fluctuations in some bursty workloads may be avoided this way
+	at the cost of additional energy spent on maintaining the maximum CPU
+	capacity.
+
+``powersave_bias``
+	Reduction factor to apply to the original frequency target of the
+	governor (including the maximum value used when the ``up_threshold``
+	value is exceeded by the estimated CPU load) or sensitivity threshold
+	for the AMD frequency sensitivity powersave bias driver
+	(:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000
+	inclusive.
+
+	If the AMD frequency sensitivity powersave bias driver is not loaded,
+	the effective frequency to apply is given by
+
+		f * (1 - ``powersave_bias`` / 1000)
+
+	where f is the governor's original frequency target.  The default value
+	of this attribute is 0 in that case.
+
+	If the AMD frequency sensitivity powersave bias driver is loaded, the
+	value of this attribute is 400 by default and it is used in a different
+	way.
+
+	On Family 16h (and later) AMD processors there is a mechanism to get a
+	measured workload sensitivity, between 0 and 100% inclusive, from the
+	hardware.  That value can be used to estimate how the performance of the
+	workload running on a CPU will change in response to frequency changes.
+
+	The performance of a workload with the sensitivity of 0 (memory-bound or
+	IO-bound) is not expected to increase at all as a result of increasing
+	the CPU frequency, whereas workloads with the sensitivity of 100%
+	(CPU-bound) are expected to perform much better if the CPU frequency is
+	increased.
+
+	If the workload sensitivity is less than the threshold represented by
+	the ``powersave_bias`` value, the sensitivity powersave bias driver
+	will cause the governor to select a frequency lower than its original
+	target, so as to avoid over-provisioning workloads that will not benefit
+	from running at higher CPU frequencies.
+
+``conservative``
+----------------
+
+This governor uses CPU load as a CPU frequency selection metric.
+
+It estimates the CPU load in the same way as the `ondemand`_ governor described
+above, but the CPU frequency selection algorithm implemented by it is different.
+
+Namely, it avoids changing the frequency significantly over short time intervals
+which may not be suitable for systems with limited power supply capacity (e.g.
+battery-powered).  To achieve that, it changes the frequency in relatively
+small steps, one step at a time, up or down - depending on whether or not a
+(configurable) threshold has been exceeded by the estimated CPU load.
+
+This governor exposes the following tunables:
+
+``freq_step``
+	Frequency step in percent of the maximum frequency the governor is
+	allowed to set (the ``scaling_max_freq`` policy limit), between 0 and
+	100 (5 by default).
+
+	This is how much the frequency is allowed to change in one go.  Setting
+	it to 0 will cause the default frequency step (5 percent) to be used
+	and setting it to 100 effectively causes the governor to periodically
+	switch the frequency between the ``scaling_min_freq`` and
+	``scaling_max_freq`` policy limits.
+
+``down_threshold``
+	Threshold value (in percent, 20 by default) used to determine the
+	frequency change direction.
+
+	If the estimated CPU load is greater than this value, the frequency will
+	go up (by ``freq_step``).  If the load is less than this value (and the
+	``sampling_down_factor`` mechanism is not in effect), the frequency will
+	go down.  Otherwise, the frequency will not be changed.
+
+``sampling_down_factor``
+	Frequency decrease deferral factor, between 1 (default) and 10
+	inclusive.
+
+	It effectively causes the frequency to go down ``sampling_down_factor``
+	times slower than it ramps up.
+
+
+Frequency Boost Support
+=======================
+
+Background
+----------
+
+Some processors support a mechanism to raise the operating frequency of some
+cores in a multicore package temporarily (and above the sustainable frequency
+threshold for the whole package) under certain conditions, for example if the
+whole chip is not fully utilized and below its intended thermal or power budget.
+
+Different names are used by different vendors to refer to this functionality.
+For Intel processors it is referred to as "Turbo Boost", AMD calls it
+"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on.
+As a rule, it also is implemented differently by different vendors.  The simple
+term "frequency boost" is used here for brevity to refer to all of those
+implementations.
+
+The frequency boost mechanism may be either hardware-based or software-based.
+If it is hardware-based (e.g. on x86), the decision to trigger the boosting is
+made by the hardware (although in general it requires the hardware to be put
+into a special state in which it can control the CPU frequency within certain
+limits).  If it is software-based (e.g. on ARM), the scaling driver decides
+whether or not to trigger boosting and when to do that.
+
+The ``boost`` File in ``sysfs``
+-------------------------------
+
+This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
+the "boost" setting for the whole system.  It is not present if the underlying
+scaling driver does not support the frequency boost mechanism (or supports it,
+but provides a driver-specific interface for controlling it, like
+|intel_pstate|).
+
+If the value in this file is 1, the frequency boost mechanism is enabled.  This
+means that either the hardware can be put into states in which it is able to
+trigger boosting (in the hardware-based case), or the software is allowed to
+trigger boosting (in the software-based case).  It does not mean that boosting
+is actually in use at the moment on any CPUs in the system.  It only means a
+permission to use the frequency boost mechanism (which still may never be used
+for other reasons).
+
+If the value in this file is 0, the frequency boost mechanism is disabled and
+cannot be used at all.
+
+The only values that can be written to this file are 0 and 1.
+
+Rationale for Boost Control Knob
+--------------------------------
+
+The frequency boost mechanism is generally intended to help to achieve optimum
+CPU performance on time scales below software resolution (e.g. below the
+scheduler tick interval) and it is demonstrably suitable for many workloads, but
+it may lead to problems in certain situations.
+
+For this reason, many systems make it possible to disable the frequency boost
+mechanism in the platform firmware (BIOS) setup, but that requires the system to
+be restarted for the setting to be adjusted as desired, which may not be
+practical at least in some cases.  For example:
+
+  1. Boosting means overclocking the processor, although under controlled
+     conditions.  Generally, the processor's energy consumption increases
+     as a result of increasing its frequency and voltage, even temporarily.
+     That may not be desirable on systems that switch to power sources of
+     limited capacity, such as batteries, so the ability to disable the boost
+     mechanism while the system is running may help there (but that depends on
+     the workload too).
+
+  2. In some situations deterministic behavior is more important than
+     performance or energy consumption (or both) and the ability to disable
+     boosting while the system is running may be useful then.
+
+  3. To examine the impact of the frequency boost mechanism itself, it is useful
+     to be able to run tests with and without boosting, preferably without
+     restarting the system in the meantime.
+
+  4. Reproducible results are important when running benchmarks.  Since
+     the boosting functionality depends on the load of the whole package,
+     single-thread performance may vary because of it which may lead to
+     unreproducible results sometimes.  That can be avoided by disabling the
+     frequency boost mechanism before running benchmarks sensitive to that
+     issue.
+
+Legacy AMD ``cpb`` Knob
+-----------------------
+
+The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to
+the global ``boost`` one.  It is used for disabling/enabling the "Core
+Performance Boost" feature of some AMD processors.
+
+If present, that knob is located in every ``CPUFreq`` policy directory in
+``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called
+``cpb``, which indicates a more fine grained control interface.  The actual
+implementation, however, works on the system-wide basis and setting that knob
+for one policy causes the same value of it to be set for all of the other
+policies at the same time.
+
+That knob is still supported on AMD processors that support its underlying
+hardware feature, but it may be configured out of the kernel (via the
+:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global
+``boost`` knob is present regardless.  Thus it is always possible use the
+``boost`` knob instead of the ``cpb`` one which is highly recommended, as that
+is more consistent with what all of the other systems do (and the ``cpb`` knob
+may not be supported any more in the future).
+
+The ``cpb`` knob is never present for any processors without the underlying
+hardware feature (e.g. all Intel ones), even if the
+:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
+
+
+.. _Per-entity load tracking: https://lwn.net/Articles/531853/
diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst
new file mode 100644
index 000000000..49237ac73
--- /dev/null
+++ b/Documentation/admin-guide/pm/index.rst
@@ -0,0 +1,10 @@
+================
+Power Management
+================
+
+.. toctree::
+   :maxdepth: 2
+
+   strategies
+   system-wide
+   working-state
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
new file mode 100644
index 000000000..8f1d3de44
--- /dev/null
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -0,0 +1,718 @@
+===============================================
+``intel_pstate`` CPU Performance Scaling Driver
+===============================================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+
+General Information
+===================
+
+``intel_pstate`` is a part of the
+:doc:`CPU performance scaling subsystem <cpufreq>` in the Linux kernel
+(``CPUFreq``).  It is a scaling driver for the Sandy Bridge and later
+generations of Intel processors.  Note, however, that some of those processors
+may not be supported.  [To understand ``intel_pstate`` it is necessary to know
+how ``CPUFreq`` works in general, so this is the time to read :doc:`cpufreq` if
+you have not done that yet.]
+
+For the processors supported by ``intel_pstate``, the P-state concept is broader
+than just an operating frequency or an operating performance point (see the
+`LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more
+information about that).  For this reason, the representation of P-states used
+by ``intel_pstate`` internally follows the hardware specification (for details
+refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual
+Volume 3: System Programming Guide <SDM_>`_).  However, the ``CPUFreq`` core
+uses frequencies for identifying operating performance points of CPUs and
+frequencies are involved in the user space interface exposed by it, so
+``intel_pstate`` maps its internal representation of P-states to frequencies too
+(fortunately, that mapping is unambiguous).  At the same time, it would not be
+practical for ``intel_pstate`` to supply the ``CPUFreq`` core with a table of
+available frequencies due to the possible size of it, so the driver does not do
+that.  Some functionality of the core is limited by that.
+
+Since the hardware P-state selection interface used by ``intel_pstate`` is
+available at the logical CPU level, the driver always works with individual
+CPUs.  Consequently, if ``intel_pstate`` is in use, every ``CPUFreq`` policy
+object corresponds to one logical CPU and ``CPUFreq`` policies are effectively
+equivalent to CPUs.  In particular, this means that they become "inactive" every
+time the corresponding CPU is taken offline and need to be re-initialized when
+it goes back online.
+
+``intel_pstate`` is not modular, so it cannot be unloaded, which means that the
+only way to pass early-configuration-time parameters to it is via the kernel
+command line.  However, its configuration can be adjusted via ``sysfs`` to a
+great extent.  In some configurations it even is possible to unregister it via
+``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and
+registered (see `below <status_attr_>`_).
+
+
+Operation Modes
+===============
+
+``intel_pstate`` can operate in three different modes: in the active mode with
+or without hardware-managed P-states support and in the passive mode.  Which of
+them will be in effect depends on what kernel command line options are used and
+on the capabilities of the processor.
+
+Active Mode
+-----------
+
+This is the default operation mode of ``intel_pstate``.  If it works in this
+mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq``
+policies contains the string "intel_pstate".
+
+In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and
+provides its own scaling algorithms for P-state selection.  Those algorithms
+can be applied to ``CPUFreq`` policies in the same way as generic scaling
+governors (that is, through the ``scaling_governor`` policy attribute in
+``sysfs``).  [Note that different P-state selection algorithms may be chosen for
+different policies, but that is not recommended.]
+
+They are not generic scaling governors, but their names are the same as the
+names of some of those governors.  Moreover, confusingly enough, they generally
+do not work in the same way as the generic governors they share the names with.
+For example, the ``powersave`` P-state selection algorithm provided by
+``intel_pstate`` is not a counterpart of the generic ``powersave`` governor
+(roughly, it corresponds to the ``schedutil`` and ``ondemand`` governors).
+
+There are two P-state selection algorithms provided by ``intel_pstate`` in the
+active mode: ``powersave`` and ``performance``.  The way they both operate
+depends on whether or not the hardware-managed P-states (HWP) feature has been
+enabled in the processor and possibly on the processor model.
+
+Which of the P-state selection algorithms is used by default depends on the
+:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option.
+Namely, if that option is set, the ``performance`` algorithm will be used by
+default, and the other one will be used by default if it is not set.
+
+Active Mode With HWP
+~~~~~~~~~~~~~~~~~~~~
+
+If the processor supports the HWP feature, it will be enabled during the
+processor initialization and cannot be disabled after that.  It is possible
+to avoid enabling it by passing the ``intel_pstate=no_hwp`` argument to the
+kernel in the command line.
+
+If the HWP feature has been enabled, ``intel_pstate`` relies on the processor to
+select P-states by itself, but still it can give hints to the processor's
+internal P-state selection logic.  What those hints are depends on which P-state
+selection algorithm has been applied to the given policy (or to the CPU it
+corresponds to).
+
+Even though the P-state selection is carried out by the processor automatically,
+``intel_pstate`` registers utilization update callbacks with the CPU scheduler
+in this mode.  However, they are not used for running a P-state selection
+algorithm, but for periodic updates of the current CPU frequency information to
+be made available from the ``scaling_cur_freq`` policy attribute in ``sysfs``.
+
+HWP + ``performance``
+.....................
+
+In this configuration ``intel_pstate`` will write 0 to the processor's
+Energy-Performance Preference (EPP) knob (if supported) or its
+Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's
+internal P-state selection logic is expected to focus entirely on performance.
+
+This will override the EPP/EPB setting coming from the ``sysfs`` interface
+(see `Energy vs Performance Hints`_ below).
+
+Also, in this configuration the range of P-states available to the processor's
+internal P-state selection logic is always restricted to the upper boundary
+(that is, the maximum P-state that the driver is allowed to use).
+
+HWP + ``powersave``
+...................
+
+In this configuration ``intel_pstate`` will set the processor's
+Energy-Performance Preference (EPP) knob (if supported) or its
+Energy-Performance Bias (EPB) knob (otherwise) to whatever value it was
+previously set to via ``sysfs`` (or whatever default value it was
+set to by the platform firmware).  This usually causes the processor's
+internal P-state selection logic to be less performance-focused.
+
+Active Mode Without HWP
+~~~~~~~~~~~~~~~~~~~~~~~
+
+This is the default operation mode for processors that do not support the HWP
+feature.  It also is used by default with the ``intel_pstate=no_hwp`` argument
+in the kernel command line.  However, in this mode ``intel_pstate`` may refuse
+to work with the given processor if it does not recognize it.  [Note that
+``intel_pstate`` will never refuse to work with any processor with the HWP
+feature enabled.]
+
+In this mode ``intel_pstate`` registers utilization update callbacks with the
+CPU scheduler in order to run a P-state selection algorithm, either
+``powersave`` or ``performance``, depending on the ``scaling_governor`` policy
+setting in ``sysfs``.  The current CPU frequency information to be made
+available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is
+periodically updated by those utilization update callbacks too.
+
+``performance``
+...............
+
+Without HWP, this P-state selection algorithm is always the same regardless of
+the processor model and platform configuration.
+
+It selects the maximum P-state it is allowed to use, subject to limits set via
+``sysfs``, every time the driver configuration for the given CPU is updated
+(e.g. via ``sysfs``).
+
+This is the default P-state selection algorithm if the
+:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
+is set.
+
+``powersave``
+.............
+
+Without HWP, this P-state selection algorithm is similar to the algorithm
+implemented by the generic ``schedutil`` scaling governor except that the
+utilization metric used by it is based on numbers coming from feedback
+registers of the CPU.  It generally selects P-states proportional to the
+current CPU utilization.
+
+This algorithm is run by the driver's utilization update callback for the
+given CPU when it is invoked by the CPU scheduler, but not more often than
+every 10 ms.  Like in the ``performance`` case, the hardware configuration
+is not touched if the new P-state turns out to be the same as the current
+one.
+
+This is the default P-state selection algorithm if the
+:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
+is not set.
+
+Passive Mode
+------------
+
+This mode is used if the ``intel_pstate=passive`` argument is passed to the
+kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting too).
+Like in the active mode without HWP support, in this mode ``intel_pstate`` may
+refuse to work with the given processor if it does not recognize it.
+
+If the driver works in this mode, the ``scaling_driver`` policy attribute in
+``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
+Then, the driver behaves like a regular ``CPUFreq`` scaling driver.  That is,
+it is invoked by generic scaling governors when necessary to talk to the
+hardware in order to change the P-state of a CPU (in particular, the
+``schedutil`` governor can invoke it directly from scheduler context).
+
+While in this mode, ``intel_pstate`` can be used with all of the (generic)
+scaling governors listed by the ``scaling_available_governors`` policy attribute
+in ``sysfs`` (and the P-state selection algorithms described above are not
+used).  Then, it is responsible for the configuration of policy objects
+corresponding to CPUs and provides the ``CPUFreq`` core (and the scaling
+governors attached to the policy objects) with accurate information on the
+maximum and minimum operating frequencies supported by the hardware (including
+the so-called "turbo" frequency ranges).  In other words, in the passive mode
+the entire range of available P-states is exposed by ``intel_pstate`` to the
+``CPUFreq`` core.  However, in this mode the driver does not register
+utilization update callbacks with the CPU scheduler and the ``scaling_cur_freq``
+information comes from the ``CPUFreq`` core (and is the last frequency selected
+by the current scaling governor for the given policy).
+
+
+.. _turbo:
+
+Turbo P-states Support
+======================
+
+In the majority of cases, the entire range of P-states available to
+``intel_pstate`` can be divided into two sub-ranges that correspond to
+different types of processor behavior, above and below a boundary that
+will be referred to as the "turbo threshold" in what follows.
+
+The P-states above the turbo threshold are referred to as "turbo P-states" and
+the whole sub-range of P-states they belong to is referred to as the "turbo
+range".  These names are related to the Turbo Boost technology allowing a
+multicore processor to opportunistically increase the P-state of one or more
+cores if there is enough power to do that and if that is not going to cause the
+thermal envelope of the processor package to be exceeded.
+
+Specifically, if software sets the P-state of a CPU core within the turbo range
+(that is, above the turbo threshold), the processor is permitted to take over
+performance scaling control for that core and put it into turbo P-states of its
+choice going forward.  However, that permission is interpreted differently by
+different processor generations.  Namely, the Sandy Bridge generation of
+processors will never use any P-states above the last one set by software for
+the given core, even if it is within the turbo range, whereas all of the later
+processor generations will take it as a license to use any P-states from the
+turbo range, even above the one set by software.  In other words, on those
+processors setting any P-state from the turbo range will enable the processor
+to put the given core into all turbo P-states up to and including the maximum
+supported one as it sees fit.
+
+One important property of turbo P-states is that they are not sustainable.  More
+precisely, there is no guarantee that any CPUs will be able to stay in any of
+those states indefinitely, because the power distribution within the processor
+package may change over time  or the thermal envelope it was designed for might
+be exceeded if a turbo P-state was used for too long.
+
+In turn, the P-states below the turbo threshold generally are sustainable.  In
+fact, if one of them is set by software, the processor is not expected to change
+it to a lower one unless in a thermal stress or a power limit violation
+situation (a higher P-state may still be used if it is set for another CPU in
+the same package at the same time, for example).
+
+Some processors allow multiple cores to be in turbo P-states at the same time,
+but the maximum P-state that can be set for them generally depends on the number
+of cores running concurrently.  The maximum turbo P-state that can be set for 3
+cores at the same time usually is lower than the analogous maximum P-state for
+2 cores, which in turn usually is lower than the maximum turbo P-state that can
+be set for 1 core.  The one-core maximum turbo P-state is thus the maximum
+supported one overall.
+
+The maximum supported turbo P-state, the turbo threshold (the maximum supported
+non-turbo P-state) and the minimum supported P-state are specific to the
+processor model and can be determined by reading the processor's model-specific
+registers (MSRs).  Moreover, some processors support the Configurable TDP
+(Thermal Design Power) feature and, when that feature is enabled, the turbo
+threshold effectively becomes a configurable value that can be set by the
+platform firmware.
+
+Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes
+the entire range of available P-states, including the whole turbo range, to the
+``CPUFreq`` core and (in the passive mode) to generic scaling governors.  This
+generally causes turbo P-states to be set more often when ``intel_pstate`` is
+used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_
+for more information).
+
+Moreover, since ``intel_pstate`` always knows what the real turbo threshold is
+(even if the Configurable TDP feature is enabled in the processor), its
+``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should
+work as expected in all cases (that is, if set to disable turbo P-states, it
+always should prevent ``intel_pstate`` from using them).
+
+
+Processor Support
+=================
+
+To handle a given processor ``intel_pstate`` requires a number of different
+pieces of information on it to be known, including:
+
+ * The minimum supported P-state.
+
+ * The maximum supported `non-turbo P-state <turbo_>`_.
+
+ * Whether or not turbo P-states are supported at all.
+
+ * The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states
+   are supported).
+
+ * The scaling formula to translate the driver's internal representation
+   of P-states into frequencies and the other way around.
+
+Generally, ways to obtain that information are specific to the processor model
+or family.  Although it often is possible to obtain all of it from the processor
+itself (using model-specific registers), there are cases in which hardware
+manuals need to be consulted to get to it too.
+
+For this reason, there is a list of supported processors in ``intel_pstate`` and
+the driver initialization will fail if the detected processor is not in that
+list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
+obtain all of the information listed above is the same for all of the processors
+supporting the HWP feature, which is why they all are supported by
+``intel_pstate``.]
+
+
+User Space Interface in ``sysfs``
+=================================
+
+Global Attributes
+-----------------
+
+``intel_pstate`` exposes several global attributes (files) in ``sysfs`` to
+control its functionality at the system level.  They are located in the
+``/sys/devices/system/cpu/intel_pstate/`` directory and affect all CPUs.
+
+Some of them are not present if the ``intel_pstate=per_cpu_perf_limits``
+argument is passed to the kernel in the command line.
+
+``max_perf_pct``
+	Maximum P-state the driver is allowed to set in percent of the
+	maximum supported performance level (the highest supported `turbo
+	P-state <turbo_>`_).
+
+	This attribute will not be exposed if the
+	``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
+	command line.
+
+``min_perf_pct``
+	Minimum P-state the driver is allowed to set in percent of the
+	maximum supported performance level (the highest supported `turbo
+	P-state <turbo_>`_).
+
+	This attribute will not be exposed if the
+	``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel
+	command line.
+
+``num_pstates``
+	Number of P-states supported by the processor (between 0 and 255
+	inclusive) including both turbo and non-turbo P-states (see
+	`Turbo P-states Support`_).
+
+	The value of this attribute is not affected by the ``no_turbo``
+	setting described `below <no_turbo_attr_>`_.
+
+	This attribute is read-only.
+
+``turbo_pct``
+	Ratio of the `turbo range <turbo_>`_ size to the size of the entire
+	range of supported P-states, in percent.
+
+	This attribute is read-only.
+
+.. _no_turbo_attr:
+
+``no_turbo``
+	If set (equal to 1), the driver is not allowed to set any turbo P-states
+	(see `Turbo P-states Support`_).  If unset (equalt to 0, which is the
+	default), turbo P-states can be set by the driver.
+	[Note that ``intel_pstate`` does not support the general ``boost``
+	attribute (supported by some other scaling drivers) which is replaced
+	by this one.]
+
+	This attrubute does not affect the maximum supported frequency value
+	supplied to the ``CPUFreq`` core and exposed via the policy interface,
+	but it affects the maximum possible value of per-policy P-state	limits
+	(see `Interpretation of Policy Attributes`_ below for details).
+
+``hwp_dynamic_boost``
+	This attribute is only present if ``intel_pstate`` works in the
+	`active mode with the HWP feature enabled <Active Mode With HWP_>`_ in
+	the processor.  If set (equal to 1), it causes the minimum P-state limit
+	to be increased dynamically for a short time whenever a task previously
+	waiting on I/O is selected to run on a given logical CPU (the purpose
+	of this mechanism is to improve performance).
+
+	This setting has no effect on logical CPUs whose minimum P-state limit
+	is directly set to the highest non-turbo P-state or above it.
+
+.. _status_attr:
+
+``status``
+	Operation mode of the driver: "active", "passive" or "off".
+
+	"active"
+		The driver is functional and in the `active mode
+		<Active Mode_>`_.
+
+	"passive"
+		The driver is functional and in the `passive mode
+		<Passive Mode_>`_.
+
+	"off"
+		The driver is not functional (it is not registered as a scaling
+		driver with the ``CPUFreq`` core).
+
+	This attribute can be written to in order to change the driver's
+	operation mode or to unregister it.  The string written to it must be
+	one of the possible values of it and, if successful, the write will
+	cause the driver to switch over to the operation mode represented by
+	that string - or to be unregistered in the "off" case.  [Actually,
+	switching over from the active mode to the passive mode or the other
+	way around causes the driver to be unregistered and registered again
+	with a different set of callbacks, so all of its settings (the global
+	as well as the per-policy ones) are then reset to their default
+	values, possibly depending on the target operation mode.]
+
+	That only is supported in some configurations, though (for example, if
+	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
+	the operation mode of the driver cannot be changed), and if it is not
+	supported in the current configuration, writes to this attribute will
+	fail with an appropriate error.
+
+Interpretation of Policy Attributes
+-----------------------------------
+
+The interpretation of some ``CPUFreq`` policy attributes described in
+:doc:`cpufreq` is special with ``intel_pstate`` as the current scaling driver
+and it generally depends on the driver's `operation mode <Operation Modes_>`_.
+
+First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and
+``scaling_cur_freq`` attributes are produced by applying a processor-specific
+multiplier to the internal P-state representation used by ``intel_pstate``.
+Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq``
+attributes are capped by the frequency corresponding to the maximum P-state that
+the driver is allowed to set.
+
+If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is
+not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq``
+and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency.
+Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and
+``scaling_min_freq`` to go down to that value if they were above it before.
+However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be
+restored after unsetting ``no_turbo``, unless these attributes have been written
+to after ``no_turbo`` was set.
+
+If ``no_turbo`` is not set, the maximum possible value of ``scaling_max_freq``
+and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state,
+which also is the value of ``cpuinfo_max_freq`` in either case.
+
+Next, the following policy attributes have special meaning if
+``intel_pstate`` works in the `active mode <Active Mode_>`_:
+
+``scaling_available_governors``
+	List of P-state selection algorithms provided by ``intel_pstate``.
+
+``scaling_governor``
+	P-state selection algorithm provided by ``intel_pstate`` currently in
+	use with the given policy.
+
+``scaling_cur_freq``
+	Frequency of the average P-state of the CPU represented by the given
+	policy for the time interval between the last two invocations of the
+	driver's utilization update callback by the CPU scheduler for that CPU.
+
+The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the
+same as for other scaling drivers.
+
+Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate``
+depends on the operation mode of the driver.  Namely, it is either
+"intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the
+`passive mode <Passive Mode_>`_).
+
+Coordination of P-State Limits
+------------------------------
+
+``intel_pstate`` allows P-state limits to be set in two ways: with the help of
+the ``max_perf_pct`` and ``min_perf_pct`` `global attributes
+<Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq``
+``CPUFreq`` policy attributes.  The coordination between those limits is based
+on the following rules, regardless of the current operation mode of the driver:
+
+ 1. All CPUs are affected by the global limits (that is, none of them can be
+    requested to run faster than the global maximum and none of them can be
+    requested to run slower than the global minimum).
+
+ 2. Each individual CPU is affected by its own per-policy limits (that is, it
+    cannot be requested to run faster than its own per-policy maximum and it
+    cannot be requested to run slower than its own per-policy minimum).
+
+ 3. The global and per-policy limits can be set independently.
+
+If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
+resulting effective values are written into its registers whenever the limits
+change in order to request its internal P-state selection logic to always set
+P-states within these limits.  Otherwise, the limits are taken into account by
+scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+every time before setting a new P-state for a CPU.
+
+Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
+is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed
+at all and the only way to set the limits is by using the policy attributes.
+
+
+Energy vs Performance Hints
+---------------------------
+
+If ``intel_pstate`` works in the `active mode with the HWP feature enabled
+<Active Mode With HWP_>`_ in the processor, additional attributes are present
+in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
+user space to help ``intel_pstate`` to adjust the processor's internal P-state
+selection logic by focusing it on performance or on energy-efficiency, or
+somewhere between the two extremes:
+
+``energy_performance_preference``
+	Current value of the energy vs performance hint for the given policy
+	(or the CPU represented by it).
+
+	The hint can be changed by writing to this attribute.
+
+``energy_performance_available_preferences``
+	List of strings that can be written to the
+	``energy_performance_preference`` attribute.
+
+	They represent different energy vs performance hints and should be
+	self-explanatory, except that ``default`` represents whatever hint
+	value was set by the platform firmware.
+
+Strings written to the ``energy_performance_preference`` attribute are
+internally translated to integer values written to the processor's
+Energy-Performance Preference (EPP) knob (if supported) or its
+Energy-Performance Bias (EPB) knob.
+
+[Note that tasks may by migrated from one CPU to another by the scheduler's
+load-balancing algorithm and if different energy vs performance hints are
+set for those CPUs, that may lead to undesirable outcomes.  To avoid such
+issues it is better to set the same energy vs performance hint for all CPUs
+or to pin every task potentially sensitive to them to a specific CPU.]
+
+.. _acpi-cpufreq:
+
+``intel_pstate`` vs ``acpi-cpufreq``
+====================================
+
+On the majority of systems supported by ``intel_pstate``, the ACPI tables
+provided by the platform firmware contain ``_PSS`` objects returning information
+that can be used for CPU performance scaling (refer to the `ACPI specification`_
+for details on the ``_PSS`` objects and the format of the information returned
+by them).
+
+The information returned by the ACPI ``_PSS`` objects is used by the
+``acpi-cpufreq`` scaling driver.  On systems supported by ``intel_pstate``
+the ``acpi-cpufreq`` driver uses the same hardware CPU performance scaling
+interface, but the set of P-states it can use is limited by the ``_PSS``
+output.
+
+On those systems each ``_PSS`` object returns a list of P-states supported by
+the corresponding CPU which basically is a subset of the P-states range that can
+be used by ``intel_pstate`` on the same system, with one exception: the whole
+`turbo range <turbo_>`_ is represented by one item in it (the topmost one).  By
+convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz
+than the frequency of the highest non-turbo P-state listed by it, but the
+corresponding P-state representation (following the hardware specification)
+returned for it matches the maximum supported turbo P-state (or is the
+special value 255 meaning essentially "go as high as you can get").
+
+The list of P-states returned by ``_PSS`` is reflected by the table of
+available frequencies supplied by ``acpi-cpufreq`` to the ``CPUFreq`` core and
+scaling governors and the minimum and maximum supported frequencies reported by
+it come from that list as well.  In particular, given the special representation
+of the turbo range described above, this means that the maximum supported
+frequency reported by ``acpi-cpufreq`` is higher by 1 MHz than the frequency
+of the highest supported non-turbo P-state listed by ``_PSS`` which, of course,
+affects decisions made by the scaling governors, except for ``powersave`` and
+``performance``.
+
+For example, if a given governor attempts to select a frequency proportional to
+estimated CPU load and maps the load of 100% to the maximum supported frequency
+(possibly multiplied by a constant), then it will tend to choose P-states below
+the turbo threshold if ``acpi-cpufreq`` is used as the scaling driver, because
+in that case the turbo range corresponds to a small fraction of the frequency
+band it can use (1 MHz vs 1 GHz or more).  In consequence, it will only go to
+the turbo range for the highest loads and the other loads above 50% that might
+benefit from running at turbo frequencies will be given non-turbo P-states
+instead.
+
+One more issue related to that may appear on systems supporting the
+`Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the
+turbo threshold.  Namely, if that is not coordinated with the lists of P-states
+returned by ``_PSS`` properly, there may be more than one item corresponding to
+a turbo P-state in those lists and there may be a problem with avoiding the
+turbo range (if desirable or necessary).  Usually, to avoid using turbo
+P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed
+by ``_PSS``, but that is not sufficient when there are other turbo P-states in
+the list returned by it.
+
+Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the
+`passive mode <Passive Mode_>`_, except that the number of P-states it can set
+is limited to the ones listed by the ACPI ``_PSS`` objects.
+
+
+Kernel Command Line Options for ``intel_pstate``
+================================================
+
+Several kernel command line options can be used to pass early-configuration-time
+parameters to ``intel_pstate`` in order to enforce specific behavior of it.  All
+of them have to be prepended with the ``intel_pstate=`` prefix.
+
+``disable``
+	Do not register ``intel_pstate`` as the scaling driver even if the
+	processor is supported by it.
+
+``passive``
+	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
+	start with.
+
+	This option implies the ``no_hwp`` one described below.
+
+``force``
+	Register ``intel_pstate`` as the scaling driver instead of
+	``acpi-cpufreq`` even if the latter is preferred on the given system.
+
+	This may prevent some platform features (such as thermal controls and
+	power capping) that rely on the availability of ACPI P-states
+	information from functioning as expected, so it should be used with
+	caution.
+
+	This option does not work with processors that are not supported by
+	``intel_pstate`` and on platforms where the ``pcc-cpufreq`` scaling
+	driver is used instead of ``acpi-cpufreq``.
+
+``no_hwp``
+	Do not enable the `hardware-managed P-states (HWP) feature
+	<Active Mode With HWP_>`_ even if it is supported by the processor.
+
+``hwp_only``
+	Register ``intel_pstate`` as the scaling driver only if the
+	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
+	supported by the processor.
+
+``support_acpi_ppc``
+	Take ACPI ``_PPC`` performance limits into account.
+
+	If the preferred power management profile in the FADT (Fixed ACPI
+	Description Table) is set to "Enterprise Server" or "Performance
+	Server", the ACPI ``_PPC`` limits are taken into account by default
+	and this option has no effect.
+
+``per_cpu_perf_limits``
+	Use per-logical-CPU P-State limits (see `Coordination of P-state
+	Limits`_ for details).
+
+
+Diagnostics and Tuning
+======================
+
+Trace Events
+------------
+
+There are two static trace events that can be used for ``intel_pstate``
+diagnostics.  One of them is the ``cpu_frequency`` trace event generally used
+by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific
+to ``intel_pstate``.  Both of them are triggered by ``intel_pstate`` only if
+it works in the `active mode <Active Mode_>`_.
+
+The following sequence of shell commands can be used to enable them and see
+their output (if the kernel is generally configured to support event tracing)::
+
+ # cd /sys/kernel/debug/tracing/
+ # echo 1 > events/power/pstate_sample/enable
+ # echo 1 > events/power/cpu_frequency/enable
+ # cat trace
+ gnome-terminal--4510  [001] ..s.  1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476
+ cat-5235  [002] ..s.  1177.681723: cpu_frequency: state=2900000 cpu_id=2
+
+If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the
+``cpu_frequency`` trace event will be triggered either by the ``schedutil``
+scaling governor (for the policies it is attached to), or by the ``CPUFreq``
+core (for the policies with other scaling governors).
+
+``ftrace``
+----------
+
+The ``ftrace`` interface can be used for low-level diagnostics of
+``intel_pstate``.  For example, to check how often the function to set a
+P-state is called, the ``ftrace`` filter can be set to to
+:c:func:`intel_pstate_set_pstate`::
+
+ # cd /sys/kernel/debug/tracing/
+ # cat available_filter_functions | grep -i pstate
+ intel_pstate_set_pstate
+ intel_pstate_cpu_init
+ ...
+ # echo intel_pstate_set_pstate > set_ftrace_filter
+ # echo function > current_tracer
+ # cat trace | head -15
+ # tracer: function
+ #
+ # entries-in-buffer/entries-written: 80/80   #P:4
+ #
+ #                              _-----=> irqs-off
+ #                             / _----=> need-resched
+ #                            | / _---=> hardirq/softirq
+ #                            || / _--=> preempt-depth
+ #                            ||| /     delay
+ #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+ #              | |       |   ||||       |         |
+             Xorg-3129  [000] ..s.  2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
+  gnome-terminal--4510  [002] ..s.  2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
+      gnome-shell-3409  [001] ..s.  2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
+           <idle>-0     [000] ..s.  2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
+
+
+.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
+.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
+.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst
new file mode 100644
index 000000000..dbf5acd49
--- /dev/null
+++ b/Documentation/admin-guide/pm/sleep-states.rst
@@ -0,0 +1,245 @@
+===================
+System Sleep States
+===================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+Sleep states are global low-power states of the entire system in which user
+space code cannot be executed and the overall system activity is significantly
+reduced.
+
+
+Sleep States That Can Be Supported
+==================================
+
+Depending on its configuration and the capabilities of the platform it runs on,
+the Linux kernel can support up to four system sleep states, including
+hibernation and up to three variants of system suspend.  The sleep states that
+can be supported by the kernel are listed below.
+
+.. _s2idle:
+
+Suspend-to-Idle
+---------------
+
+This is a generic, pure software, light-weight variant of system suspend (also
+referred to as S2I or S2Idle).  It allows more energy to be saved relative to
+runtime idle by freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states (possibly lower-power than available in the
+working state), such that the processors can spend time in their deepest idle
+states while the system is suspended.
+
+The system is woken up from this state by in-band interrupts, so theoretically
+any devices that can cause interrupts to be generated in the working state can
+also be set up as wakeup devices for S2Idle.
+
+This state can be used on platforms without support for :ref:`standby <standby>`
+or :ref:`suspend-to-RAM <s2ram>`, or it can be used in addition to any of the
+deeper system suspend variants to provide reduced resume latency.  It is always
+supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option is set.
+
+.. _standby:
+
+Standby
+-------
+
+This state, if supported, offers moderate, but real, energy savings, while
+providing a relatively straightforward transition back to the working state.  No
+operating state is lost (the system core logic retains power), so the system can
+go back to where it left off easily enough.
+
+In addition to freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states, which is done for :ref:`suspend-to-idle
+<s2idle>` too, nonboot CPUs are taken offline and all low-level system functions
+are suspended during transitions into this state.  For this reason, it should
+allow more energy to be saved relative to :ref:`suspend-to-idle <s2idle>`, but
+the resume latency will generally be greater than for that state.
+
+The set of devices that can wake up the system from this state usually is
+reduced relative to :ref:`suspend-to-idle <s2idle>` and it may be necessary to
+rely on the platform for setting up the wakeup functionality as appropriate.
+
+This state is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration
+option is set and the support for it is registered by the platform with the
+core system suspend subsystem.  On ACPI-based systems this state is mapped to
+the S1 system state defined by ACPI.
+
+.. _s2ram:
+
+Suspend-to-RAM
+--------------
+
+This state (also referred to as STR or S2RAM), if supported, offers significant
+energy savings as everything in the system is put into a low-power state, except
+for memory, which should be placed into the self-refresh mode to retain its
+contents.  All of the steps carried out when entering :ref:`standby <standby>`
+are also carried out during transitions to S2RAM.  Additional operations may
+take place depending on the platform capabilities.  In particular, on ACPI-based
+systems the kernel passes control to the platform firmware (BIOS) as the last
+step during S2RAM transitions and that usually results in powering down some
+more low-level components that are not directly controlled by the kernel.
+
+The state of devices and CPUs is saved and held in memory.  All devices are
+suspended and put into low-power states.  In many cases, all peripheral buses
+lose power when entering S2RAM, so devices must be able to handle the transition
+back to the "on" state.
+
+On ACPI-based systems S2RAM requires some minimal boot-strapping code in the
+platform firmware to resume the system from it.  This may be the case on other
+platforms too.
+
+The set of devices that can wake up the system from S2RAM usually is reduced
+relative to :ref:`suspend-to-idle <s2idle>` and :ref:`standby <standby>` and it
+may be necessary to rely on the platform for setting up the wakeup functionality
+as appropriate.
+
+S2RAM is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option
+is set and the support for it is registered by the platform with the core system
+suspend subsystem.  On ACPI-based systems it is mapped to the S3 system state
+defined by ACPI.
+
+.. _hibernation:
+
+Hibernation
+-----------
+
+This state (also referred to as Suspend-to-Disk or STD) offers the greatest
+energy savings and can be used even in the absence of low-level platform support
+for system suspend.  However, it requires some low-level code for resuming the
+system to be present for the underlying CPU architecture.
+
+Hibernation is significantly different from any of the system suspend variants.
+It takes three system state changes to put it into hibernation and two system
+state changes to resume it.
+
+First, when hibernation is triggered, the kernel stops all system activity and
+creates a snapshot image of memory to be written into persistent storage.  Next,
+the system goes into a state in which the snapshot image can be saved, the image
+is written out and finally the system goes into the target low-power state in
+which power is cut from almost all of its hardware components, including memory,
+except for a limited set of wakeup devices.
+
+Once the snapshot image has been written out, the system may either enter a
+special low-power state (like ACPI S4), or it may simply power down itself.
+Powering down means minimum power draw and it allows this mechanism to work on
+any system.  However, entering a special low-power state may allow additional
+means of system wakeup to be used  (e.g. pressing a key on the keyboard or
+opening a laptop lid).
+
+After wakeup, control goes to the platform firmware that runs a boot loader
+which boots a fresh instance of the kernel (control may also go directly to
+the boot loader, depending on the system configuration, but anyway it causes
+a fresh instance of the kernel to be booted).  That new instance of the kernel
+(referred to as the ``restore kernel``) looks for a hibernation image in
+persistent storage and if one is found, it is loaded into memory.  Next, all
+activity in the system is stopped and the restore kernel overwrites itself with
+the image contents and jumps into a special trampoline area in the original
+kernel stored in the image (referred to as the ``image kernel``), which is where
+the special architecture-specific low-level code is needed.  Finally, the
+image kernel restores the system to the pre-hibernation state and allows user
+space to run again.
+
+Hibernation is supported if the :c:macro:`CONFIG_HIBERNATION` kernel
+configuration option is set.  However, this option can only be set if support
+for the given CPU architecture includes the low-level code for system resume.
+
+
+Basic ``sysfs`` Interfaces for System Suspend and Hibernation
+=============================================================
+
+The following files located in the :file:`/sys/power/` directory can be used by
+user space for sleep states control.
+
+``state``
+	This file contains a list of strings representing sleep states supported
+	by the kernel.  Writing one of these strings into it causes the kernel
+	to start a transition of the system into the sleep state represented by
+	that string.
+
+	In particular, the strings "disk", "freeze" and "standby" represent the
+	:ref:`hibernation <hibernation>`, :ref:`suspend-to-idle <s2idle>` and
+	:ref:`standby <standby>` sleep states, respectively.  The string "mem"
+	is interpreted in accordance with the contents of the ``mem_sleep`` file
+	described below.
+
+	If the kernel does not support any system sleep states, this file is
+	not present.
+
+``mem_sleep``
+	This file contains a list of strings representing supported system
+	suspend	variants and allows user space to select the variant to be
+	associated with the "mem" string in the ``state`` file described above.
+
+	The strings that may be present in this file are "s2idle", "shallow"
+	and "deep".  The string "s2idle" always represents :ref:`suspend-to-idle
+	<s2idle>` and, by convention, "shallow" and "deep" represent
+	:ref:`standby <standby>` and :ref:`suspend-to-RAM <s2ram>`,
+	respectively.
+
+	Writing one of the listed strings into this file causes the system
+	suspend variant represented by it to be associated with the "mem" string
+	in the ``state`` file.  The string representing the suspend variant
+	currently associated with the "mem" string in the ``state`` file
+	is listed in square brackets.
+
+	If the kernel does not support system suspend, this file is not present.
+
+``disk``
+	This file contains a list of strings representing different operations
+	that can be carried out after the hibernation image has been saved.  The
+	possible options are as follows:
+
+	``platform``
+		Put the system into a special low-power state (e.g. ACPI S4) to
+		make additional wakeup options available and possibly allow the
+		platform firmware to take a simplified initialization path after
+		wakeup.
+
+	``shutdown``
+		Power off the system.
+
+	``reboot``
+		Reboot the system (useful for diagnostics mostly).
+
+	``suspend``
+		Hybrid system suspend.  Put the system into the suspend sleep
+		state selected through the ``mem_sleep`` file described above.
+		If the system is successfully woken up from that state, discard
+		the hibernation image and continue.  Otherwise, use the image
+		to restore the previous state of the system.
+
+	``test_resume``
+		Diagnostic operation.  Load the image as though the system had
+		just woken up from hibernation and the currently running kernel
+		instance was a restore kernel and follow up with full system
+		resume.
+
+	Writing one of the listed strings into this file causes the option
+	represented by it to be selected.
+
+	The currently selected option is shown in square brackets which means
+	that the operation represented by it will be carried out after creating
+	and saving the image next time hibernation is triggered by writing
+	``disk`` to :file:`/sys/power/state`.
+
+	If the kernel does not support hibernation, this file is not present.
+
+According to the above, there are two ways to make the system go into the
+:ref:`suspend-to-idle <s2idle>` state.  The first one is to write "freeze"
+directly to :file:`/sys/power/state`.  The second one is to write "s2idle" to
+:file:`/sys/power/mem_sleep` and then to write "mem" to
+:file:`/sys/power/state`.  Likewise, there are two ways to make the system go
+into the :ref:`standby <standby>` state (the strings to write to the control
+files in that case are "standby" or "shallow" and "mem", respectively) if that
+state is supported by the platform.  However, there is only one way to make the
+system go into the :ref:`suspend-to-RAM <s2ram>` state (write "deep" into
+:file:`/sys/power/mem_sleep` and "mem" into :file:`/sys/power/state`).
+
+The default suspend variant (ie. the one to be used without writing anything
+into :file:`/sys/power/mem_sleep`) is either "deep" (on the majority of systems
+supporting :ref:`suspend-to-RAM <s2ram>`) or "s2idle", but it can be overridden
+by the value of the "mem_sleep_default" parameter in the kernel command line.
+On some ACPI-based systems, depending on the information in the ACPI tables, the
+default may be "s2idle" even if :ref:`suspend-to-RAM <s2ram>` is supported.
diff --git a/Documentation/admin-guide/pm/strategies.rst b/Documentation/admin-guide/pm/strategies.rst
new file mode 100644
index 000000000..afe4d3f83
--- /dev/null
+++ b/Documentation/admin-guide/pm/strategies.rst
@@ -0,0 +1,52 @@
+===========================
+Power Management Strategies
+===========================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+The Linux kernel supports two major high-level power management strategies.
+
+One of them is based on using global low-power states of the whole system in
+which user space code cannot be executed and the overall system activity is
+significantly reduced, referred to as :doc:`sleep states <sleep-states>`.  The
+kernel puts the system into one of these states when requested by user space
+and the system stays in it until a special signal is received from one of
+designated devices, triggering a transition to the ``working state`` in which
+user space code can run.  Because sleep states are global and the whole system
+is affected by the state changes, this strategy is referred to as the
+:doc:`system-wide power management <system-wide>`.
+
+The other strategy, referred to as the :doc:`working-state power management
+<working-state>`, is based on adjusting the power states of individual hardware
+components of the system, as needed, in the working state.  In consequence, if
+this strategy is in use, the working state of the system usually does not
+correspond to any particular physical configuration of it, but can be treated as
+a metastate covering a range of different power states of the system in which
+the individual components of it can be either ``active`` (in use) or
+``inactive`` (idle).  If they are active, they have to be in power states
+allowing them to process data and to be accessed by software.  In turn, if they
+are inactive, ideally, they should be in low-power states in which they may not
+be accessible.
+
+If all of the system components are active, the system as a whole is regarded as
+"runtime active" and that situation typically corresponds to the maximum power
+draw (or maximum energy usage) of it.  If all of them are inactive, the system
+as a whole is regarded as "runtime idle" which may be very close to a sleep
+state from the physical system configuration and power draw perspective, but
+then it takes much less time and effort to start executing user space code than
+for the same system in a sleep state.  However, transitions from sleep states
+back to the working state can only be started by a limited set of devices, so
+typically the system can spend much more time in a sleep state than it can be
+runtime idle in one go.  For this reason, systems usually use less energy in
+sleep states than when they are runtime idle most of the time.
+
+Moreover, the two power management strategies address different usage scenarios.
+Namely, if the user indicates that the system will not be in use going forward,
+for example by closing its lid (if the system is a laptop), it probably should
+go into a sleep state at that point.  On the other hand, if the user simply goes
+away from the laptop keyboard, it probably should stay in the working state and
+use the working-state power management in case it becomes idle, because the user
+may come back to it at any time and then may want the system to be immediately
+accessible.
diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst
new file mode 100644
index 000000000..0c81e4c5d
--- /dev/null
+++ b/Documentation/admin-guide/pm/system-wide.rst
@@ -0,0 +1,8 @@
+============================
+System-Wide Power Management
+============================
+
+.. toctree::
+   :maxdepth: 2
+
+   sleep-states
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst
new file mode 100644
index 000000000..fa01bf083
--- /dev/null
+++ b/Documentation/admin-guide/pm/working-state.rst
@@ -0,0 +1,9 @@
+==============================
+Working-State Power Management
+==============================
+
+.. toctree::
+   :maxdepth: 2
+
+   cpufreq
+   intel_pstate
diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst
new file mode 100644
index 000000000..6dbcc5481
--- /dev/null
+++ b/Documentation/admin-guide/ramoops.rst
@@ -0,0 +1,156 @@
+Ramoops oops/panic logger
+=========================
+
+Sergiu Iordache <sergiu@chromium.org>
+
+Updated: 17 November 2011
+
+Introduction
+------------
+
+Ramoops is an oops/panic logger that writes its logs to RAM before the system
+crashes. It works by logging oopses and panics in a circular buffer. Ramoops
+needs a system with persistent RAM so that the content of that area can
+survive after a restart.
+
+Ramoops concepts
+----------------
+
+Ramoops uses a predefined memory area to store the dump. The start and size
+and type of the memory area are set using three variables:
+
+  * ``mem_address`` for the start
+  * ``mem_size`` for the size. The memory size will be rounded down to a
+    power of two.
+  * ``mem_type`` to specifiy if the memory type (default is pgprot_writecombine).
+
+Typically the default value of ``mem_type=0`` should be used as that sets the pstore
+mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use
+``pgprot_noncached``, which only works on some platforms. This is because pstore
+depends on atomic operations. At least on ARM, pgprot_noncached causes the
+memory to be mapped strongly ordered, and atomic operations on strongly ordered
+memory are implementation defined, and won't work on many ARMs such as omaps.
+
+The memory area is divided into ``record_size`` chunks (also rounded down to
+power of two) and each oops/panic writes a ``record_size`` chunk of
+information.
+
+Dumping both oopses and panics can be done by setting 1 in the ``dump_oops``
+variable while setting 0 in that variable dumps only the panics.
+
+The module uses a counter to record multiple dumps but the counter gets reset
+on restart (i.e. new dumps after the restart will overwrite old ones).
+
+Ramoops also supports software ECC protection of persistent memory regions.
+This might be useful when a hardware reset was used to bring the machine back
+to life (i.e. a watchdog triggered). In such cases, RAM may be somewhat
+corrupt, but usually it is restorable.
+
+Setting the parameters
+----------------------
+
+Setting the ramoops parameters can be done in several different manners:
+
+ A. Use the module parameters (which have the names of the variables described
+ as before). For quick debugging, you can also reserve parts of memory during
+ boot and then use the reserved memory for ramoops. For example, assuming a
+ machine with > 128 MB of memory, the following kernel command line will tell
+ the kernel to use only the first 128 MB of memory, and place ECC-protected
+ ramoops region at 128 MB boundary::
+
+	mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1
+
+ B. Use Device Tree bindings, as described in
+ ``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``.
+ For example::
+
+	reserved-memory {
+		#address-cells = <2>;
+		#size-cells = <2>;
+		ranges;
+
+		ramoops@8f000000 {
+			compatible = "ramoops";
+			reg = <0 0x8f000000 0 0x100000>;
+			record-size = <0x4000>;
+			console-size = <0x4000>;
+		};
+	};
+
+ C. Use a platform device and set the platform data. The parameters can then
+ be set through that platform data. An example of doing that is:
+
+ .. code-block:: c
+
+  #include <linux/pstore_ram.h>
+  [...]
+
+  static struct ramoops_platform_data ramoops_data = {
+        .mem_size               = <...>,
+        .mem_address            = <...>,
+        .mem_type               = <...>,
+        .record_size            = <...>,
+        .dump_oops              = <...>,
+        .ecc                    = <...>,
+  };
+
+  static struct platform_device ramoops_dev = {
+        .name = "ramoops",
+        .dev = {
+                .platform_data = &ramoops_data,
+        },
+  };
+
+  [... inside a function ...]
+  int ret;
+
+  ret = platform_device_register(&ramoops_dev);
+  if (ret) {
+	printk(KERN_ERR "unable to register platform device\n");
+	return ret;
+  }
+
+You can specify either RAM memory or peripheral devices' memory. However, when
+specifying RAM, be sure to reserve the memory by issuing memblock_reserve()
+very early in the architecture code, e.g.::
+
+	#include <linux/memblock.h>
+
+	memblock_reserve(ramoops_data.mem_address, ramoops_data.mem_size);
+
+Dump format
+-----------
+
+The data dump begins with a header, currently defined as ``====`` followed by a
+timestamp and a new line. The dump then continues with the actual data.
+
+Reading the data
+----------------
+
+The dump data can be read from the pstore filesystem. The format for these
+files is ``dmesg-ramoops-N``, where N is the record number in memory. To delete
+a stored record from RAM, simply unlink the respective pstore file.
+
+Persistent function tracing
+---------------------------
+
+Persistent function tracing might be useful for debugging software or hardware
+related hangs. The functions call chain log is stored in a ``ftrace-ramoops``
+file. Here is an example of usage::
+
+ # mount -t debugfs debugfs /sys/kernel/debug/
+ # echo 1 > /sys/kernel/debug/pstore/record_ftrace
+ # reboot -f
+ [...]
+ # mount -t pstore pstore /mnt/
+ # tail /mnt/ftrace-ramoops
+ 0 ffffffff8101ea64  ffffffff8101bcda  native_apic_mem_read <- disconnect_bsp_APIC+0x6a/0xc0
+ 0 ffffffff8101ea44  ffffffff8101bcf6  native_apic_mem_write <- disconnect_bsp_APIC+0x86/0xc0
+ 0 ffffffff81020084  ffffffff8101a4b5  hpet_disable <- native_machine_shutdown+0x75/0x90
+ 0 ffffffff81005f94  ffffffff8101a4bb  iommu_shutdown_noop <- native_machine_shutdown+0x7b/0x90
+ 0 ffffffff8101a6a1  ffffffff8101a437  native_machine_emergency_restart <- native_machine_restart+0x37/0x40
+ 0 ffffffff811f9876  ffffffff8101a73a  acpi_reboot <- native_machine_emergency_restart+0xaa/0x1e0
+ 0 ffffffff8101a514  ffffffff8101a772  mach_reboot_fixups <- native_machine_emergency_restart+0xe2/0x1e0
+ 0 ffffffff811d9c54  ffffffff8101a7a0  __const_udelay <- native_machine_emergency_restart+0x110/0x1e0
+ 0 ffffffff811d9c34  ffffffff811d9c80  __delay <- __const_udelay+0x30/0x40
+ 0 ffffffff811d9d14  ffffffff811d9c3f  delay_tsc <- __delay+0xf/0x20
diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst
new file mode 100644
index 000000000..197896718
--- /dev/null
+++ b/Documentation/admin-guide/ras.rst
@@ -0,0 +1,1210 @@
+.. include:: <isonum.txt>
+
+============================================
+Reliability, Availability and Serviceability
+============================================
+
+RAS concepts
+************
+
+Reliability, Availability and Serviceability (RAS) is a concept used on
+servers meant to measure their robustness.
+
+Reliability
+  is the probability that a system will produce correct outputs.
+
+  * Generally measured as Mean Time Between Failures (MTBF)
+  * Enhanced by features that help to avoid, detect and repair hardware faults
+
+Availability
+  is the probability that a system is operational at a given time
+
+  * Generally measured as a percentage of downtime per a period of time
+  * Often uses mechanisms to detect and correct hardware faults in
+    runtime;
+
+Serviceability (or maintainability)
+  is the simplicity and speed with which a system can be repaired or
+  maintained
+
+  * Generally measured on Mean Time Between Repair (MTBR)
+
+Improving RAS
+-------------
+
+In order to reduce systems downtime, a system should be capable of detecting
+hardware errors, and, when possible correcting them in runtime. It should
+also provide mechanisms to detect hardware degradation, in order to warn
+the system administrator to take the action of replacing a component before
+it causes data loss or system downtime.
+
+Among the monitoring measures, the most usual ones include:
+
+* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
+* Memory – add error correction logic (ECC) to detect and correct errors;
+* I/O – add CRC checksums for transferred data;
+* Storage – RAID, journal file systems, checksums,
+  Self-Monitoring, Analysis and Reporting Technology (SMART).
+
+By monitoring the number of occurrences of error detections, it is possible
+to identify if the probability of hardware errors is increasing, and, on such
+case, do a preventive maintenance to replace a degraded component while
+those errors are correctable.
+
+Types of errors
+---------------
+
+Most mechanisms used on modern systems use use technologies like Hamming
+Codes that allow error correction when the number of errors on a bit packet
+is below a threshold. If the number of errors is above, those mechanisms
+can indicate with a high degree of confidence that an error happened, but
+they can't correct.
+
+Also, sometimes an error occur on a component that it is not used. For
+example, a part of the memory that it is not currently allocated.
+
+That defines some categories of errors:
+
+* **Correctable Error (CE)** - the error detection mechanism detected and
+  corrected the error. Such errors are usually not fatal, although some
+  Kernel mechanisms allow the system administrator to consider them as fatal.
+
+* **Uncorrected Error (UE)** - the amount of errors happened above the error
+  correction threshold, and the system was unable to auto-correct.
+
+* **Fatal Error** - when an UE error happens on a critical component of the
+  system (for example, a piece of the Kernel got corrupted by an UE), the
+  only reliable way to avoid data corruption is to hang or reboot the machine.
+
+* **Non-fatal Error** - when an UE error happens on an unused component,
+  like a CPU in power down state or an unused memory bank, the system may
+  still run, eventually replacing the affected hardware by a hot spare,
+  if available.
+
+  Also, when an error happens on a userspace process, it is also possible to
+  kill such process and let userspace restart it.
+
+The mechanism for handling non-fatal errors is usually complex and may
+require the help of some userspace application, in order to apply the
+policy desired by the system administrator.
+
+Identifying a bad hardware component
+------------------------------------
+
+Just detecting a hardware flaw is usually not enough, as the system needs
+to pinpoint to the minimal replaceable unit (MRU) that should be exchanged
+to make the hardware reliable again.
+
+So, it requires not only error logging facilities, but also mechanisms that
+will translate the error message to the silkscreen or component label for
+the MRU.
+
+Typically, it is very complex for memory, as modern CPUs interlace memory
+from different memory modules, in order to provide a better performance. The
+DMI BIOS usually have a list of memory module labels, with can be obtained
+using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
+
+	Memory Device
+		Total Width: 64 bits
+		Data Width: 64 bits
+		Size: 16384 MB
+		Form Factor: SODIMM
+		Set: None
+		Locator: ChannelA-DIMM0
+		Bank Locator: BANK 0
+		Type: DDR4
+		Type Detail: Synchronous
+		Speed: 2133 MHz
+		Rank: 2
+		Configured Clock Speed: 2133 MHz
+
+On the above example, a DDR4 SO-DIMM memory module is located at the
+system's memory labeled as "BANK 0", as given by the *bank locator* field.
+Please notice that, on such system, the *total width* is equal to the
+*data width*. It means that such memory module doesn't have error
+detection/correction mechanisms.
+
+Unfortunately, not all systems use the same field to specify the memory
+bank. On this example, from an older server, ``dmidecode`` shows::
+
+	Memory Device
+		Array Handle: 0x1000
+		Error Information Handle: Not Provided
+		Total Width: 72 bits
+		Data Width: 64 bits
+		Size: 8192 MB
+		Form Factor: DIMM
+		Set: 1
+		Locator: DIMM_A1
+		Bank Locator: Not Specified
+		Type: DDR3
+		Type Detail: Synchronous Registered (Buffered)
+		Speed: 1600 MHz
+		Rank: 2
+		Configured Clock Speed: 1600 MHz
+
+There, the DDR3 RDIMM memory module is located at the system's memory labeled
+as "DIMM_A1", as given by the *locator* field. Please notice that this
+memory module has 64 bits of *data width* and 72 bits of *total width*. So,
+it has 8 extra bits to be used by error detection and correction mechanisms.
+Such kind of memory is called Error-correcting code memory (ECC memory).
+
+To make things even worse, it is not uncommon that systems with different
+labels on their system's board to use exactly the same BIOS, meaning that
+the labels provided by the BIOS won't match the real ones.
+
+ECC memory
+----------
+
+As mentioned on the previous section, ECC memory has extra bits to be
+used for error correction. So, on 64 bit systems, a memory module
+has 64 bits of *data width*, and 74 bits of *total width*. So, there are
+8 bits extra bits to be used for the error detection and correction
+mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_.
+
+So, when the cpu requests the memory controller to write a word with
+*data width*, the memory controller calculates the *syndrome* in real time,
+using Hamming code, or some other error correction code, like SECDED+,
+producing a code with *total width* size. Such code is then written
+on the memory modules.
+
+At read, the *total width* bits code is converted back, using the same
+ECC code used on write, producing a word with *data width* and a *syndrome*.
+The word with *data width* is sent to the CPU, even when errors happen.
+
+The memory controller also looks at the *syndrome* in order to check if
+there was an error, and if the ECC code was able to fix such error.
+If the error was corrected, a Corrected Error (CE) happened. If not, an
+Uncorrected Error (UE) happened.
+
+The information about the CE/UE errors is stored on some special registers
+at the memory controller and can be accessed by reading such registers,
+either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
+bit CPUs, such errors can also be retrieved via the Machine Check
+Architecture (MCA)\ [#f3]_.
+
+.. [#f1] Please notice that several memory controllers allow operation on a
+  mode called "Lock-Step", where it groups two memory modules together,
+  doing 128-bit reads/writes. That gives 16 bits for error correction, with
+  significantly improves the error correction mechanism, at the expense
+  that, when an error happens, there's no way to know what memory module is
+  to blame. So, it has to blame both memory modules.
+
+.. [#f2] Some memory controllers also allow using memory in mirror mode.
+  On such mode, the same data is written to two memory modules. At read,
+  the system checks both memory modules, in order to check if both provide
+  identical data. On such configuration, when an error happens, there's no
+  way to know what memory module is to blame. So, it has to blame both
+  memory modules (or 4 memory modules, if the system is also on Lock-step
+  mode).
+
+.. [#f3] For more details about the Machine Check Architecture (MCA),
+  please read Documentation/x86/x86_64/machinecheck at the Kernel tree.
+
+EDAC - Error Detection And Correction
+*************************************
+
+.. note::
+
+   "bluesmoke" was the name for this device driver subsystem when it
+   was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
+   That site is mostly archaic now and can be used only for historical
+   purposes.
+
+   When the subsystem was pushed upstream for the first time, on
+   Kernel 2.6.16, for the first time, it was renamed to ``EDAC``.
+
+Purpose
+-------
+
+The ``edac`` kernel module's goal is to detect and report hardware errors
+that occur within the computer system running under linux.
+
+Memory
+------
+
+Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
+primary errors being harvested. These types of errors are harvested by
+the ``edac_mc`` device.
+
+Detecting CE events, then harvesting those events and reporting them,
+**can** but must not necessarily be a predictor of future UE events. With
+CE events only, the system can and will continue to operate as no data
+has been damaged yet.
+
+However, preventive maintenance and proactive part replacement of memory
+modules exhibiting CEs can reduce the likelihood of the dreaded UE events
+and system panics.
+
+Other hardware elements
+-----------------------
+
+A new feature for EDAC, the ``edac_device`` class of device, was added in
+the 2.6.23 version of the kernel.
+
+This new device type allows for non-memory type of ECC hardware detectors
+to have their states harvested and presented to userspace via the sysfs
+interface.
+
+Some architectures have ECC detectors for L1, L2 and L3 caches,
+along with DMA engines, fabric switches, main data path switches,
+interconnections, and various other hardware data paths. If the hardware
+reports it, then a edac_device device probably can be constructed to
+harvest and present that to userspace.
+
+
+PCI bus scanning
+----------------
+
+In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
+in order to determine if errors are occurring during data transfers.
+
+The presence of PCI Parity errors must be examined with a grain of salt.
+There are several add-in adapters that do **not** follow the PCI specification
+with regards to Parity generation and reporting. The specification says
+the vendor should tie the parity status bits to 0 if they do not intend
+to generate parity.  Some vendors do not do this, and thus the parity bit
+can "float" giving false positives.
+
+There is a PCI device attribute located in sysfs that is checked by
+the EDAC PCI scanning code. If that attribute is set, PCI parity/error
+scanning is skipped for that device. The attribute is::
+
+	broken_parity_status
+
+and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for
+PCI devices.
+
+
+Versioning
+----------
+
+EDAC is composed of a "core" module (``edac_core.ko``) and several Memory
+Controller (MC) driver modules. On a given system, the CORE is loaded
+and one MC driver will be loaded. Both the CORE and the MC driver (or
+``edac_device`` driver) have individual versions that reflect current
+release level of their respective modules.
+
+Thus, to "report" on what version a system is running, one must report
+both the CORE's and the MC driver's versions.
+
+
+Loading
+-------
+
+If ``edac`` was statically linked with the kernel then no loading
+is necessary. If ``edac`` was built as modules then simply modprobe
+the ``edac`` pieces that you need. You should be able to modprobe
+hardware-specific modules and have the dependencies load the necessary
+core modules.
+
+Example::
+
+	$ modprobe amd76x_edac
+
+loads both the ``amd76x_edac.ko`` memory controller module and the
+``edac_mc.ko`` core module.
+
+
+Sysfs interface
+---------------
+
+EDAC presents a ``sysfs`` interface for control and reporting purposes. It
+lives in the /sys/devices/system/edac directory.
+
+Within this directory there currently reside 2 components:
+
+	======= ==============================
+	mc	memory controller(s) system
+	pci	PCI control and status system
+	======= ==============================
+
+
+
+Memory Controller (mc) Model
+----------------------------
+
+Each ``mc`` device controls a set of memory modules [#f4]_. These modules
+are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
+There can be multiple csrows and multiple channels.
+
+.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
+  used to refer to a memory module, although there are other memory
+  packaging alternatives, like SO-DIMM, SIMM, etc. Along this document,
+  and inside the EDAC system, the term "dimm" is used for all memory
+  modules, even when they use a different kind of packaging.
+
+Memory controllers allow for several csrows, with 8 csrows being a
+typical value. Yet, the actual number of csrows depends on the layout of
+a given motherboard, memory controller and memory module characteristics.
+
+Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
+data transfers to/from the CPU from/to memory. Some newer chipsets allow
+for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
+controllers. The following example will assume 2 channels:
+
+	+------------+-----------------------+
+	| CS Rows    |       Channels        |
+	+------------+-----------+-----------+
+	|            |  ``ch0``  |  ``ch1``  |
+	+============+===========+===========+
+	| ``csrow0`` |  DIMM_A0  |  DIMM_B0  |
+	+------------+           |           |
+	| ``csrow1`` |           |           |
+	+------------+-----------+-----------+
+	| ``csrow2`` |  DIMM_A1  | DIMM_B1   |
+	+------------+           |           |
+	| ``csrow3`` |           |           |
+	+------------+-----------+-----------+
+
+In the above example, there are 4 physical slots on the motherboard
+for memory DIMMs:
+
+	+---------+---------+
+	| DIMM_A0 | DIMM_B0 |
+	+---------+---------+
+	| DIMM_A1 | DIMM_B1 |
+	+---------+---------+
+
+Labels for these slots are usually silk-screened on the motherboard.
+Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
+channel 1. Notice that there are two csrows possible on a physical DIMM.
+These csrows are allocated their csrow assignment based on the slot into
+which the memory DIMM is placed. Thus, when 1 DIMM is placed in each
+Channel, the csrows cross both DIMMs.
+
+Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
+Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
+will have just one csrow (csrow0). csrow1 will be empty. On the other
+hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0
+and csrow1 will be populated. The pattern repeats itself for csrow2 and
+csrow3.
+
+The representation of the above is reflected in the directory
+tree in EDAC's sysfs interface. Starting in directory
+``/sys/devices/system/edac/mc``, each memory controller will be
+represented by its own ``mcX`` directory, where ``X`` is the
+index of the MC::
+
+	..../edac/mc/
+		   |
+		   |->mc0
+		   |->mc1
+		   |->mc2
+		   ....
+
+Under each ``mcX`` directory each ``csrowX`` is again represented by a
+``csrowX``, where ``X`` is the csrow index::
+
+	.../mc/mc0/
+		|
+		|->csrow0
+		|->csrow2
+		|->csrow3
+		....
+
+Notice that there is no csrow1, which indicates that csrow0 is composed
+of a single ranked DIMMs. This should also apply in both Channels, in
+order to have dual-channel mode be operational. Since both csrow2 and
+csrow3 are populated, this indicates a dual ranked set of DIMMs for
+channels 0 and 1.
+
+Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
+control and attribute files.
+
+``mcX`` directories
+-------------------
+
+In ``mcX`` directories are EDAC control and attribute files for
+this ``X`` instance of the memory controllers.
+
+For a description of the sysfs API, please see:
+
+	Documentation/ABI/testing/sysfs-devices-edac
+
+
+``dimmX`` or ``rankX`` directories
+----------------------------------
+
+The recommended way to use the EDAC subsystem is to look at the information
+provided by the ``dimmX`` or ``rankX`` directories [#f5]_.
+
+A typical EDAC system has the following structure under
+``/sys/devices/system/edac/``\ [#f6]_::
+
+	/sys/devices/system/edac/
+	├── mc
+	│   ├── mc0
+	│   │   ├── ce_count
+	│   │   ├── ce_noinfo_count
+	│   │   ├── dimm0
+	│   │   │   ├── dimm_ce_count
+	│   │   │   ├── dimm_dev_type
+	│   │   │   ├── dimm_edac_mode
+	│   │   │   ├── dimm_label
+	│   │   │   ├── dimm_location
+	│   │   │   ├── dimm_mem_type
+	│   │   │   ├── dimm_ue_count
+	│   │   │   ├── size
+	│   │   │   └── uevent
+	│   │   ├── max_location
+	│   │   ├── mc_name
+	│   │   ├── reset_counters
+	│   │   ├── seconds_since_reset
+	│   │   ├── size_mb
+	│   │   ├── ue_count
+	│   │   ├── ue_noinfo_count
+	│   │   └── uevent
+	│   ├── mc1
+	│   │   ├── ce_count
+	│   │   ├── ce_noinfo_count
+	│   │   ├── dimm0
+	│   │   │   ├── dimm_ce_count
+	│   │   │   ├── dimm_dev_type
+	│   │   │   ├── dimm_edac_mode
+	│   │   │   ├── dimm_label
+	│   │   │   ├── dimm_location
+	│   │   │   ├── dimm_mem_type
+	│   │   │   ├── dimm_ue_count
+	│   │   │   ├── size
+	│   │   │   └── uevent
+	│   │   ├── max_location
+	│   │   ├── mc_name
+	│   │   ├── reset_counters
+	│   │   ├── seconds_since_reset
+	│   │   ├── size_mb
+	│   │   ├── ue_count
+	│   │   ├── ue_noinfo_count
+	│   │   └── uevent
+	│   └── uevent
+	└── uevent
+
+In the ``dimmX`` directories are EDAC control and attribute files for
+this ``X`` memory module:
+
+- ``size`` - Total memory managed by this csrow attribute file
+
+	This attribute file displays, in count of megabytes, the memory
+	that this csrow contains.
+
+- ``dimm_ue_count`` - Uncorrectable Errors count attribute file
+
+	This attribute file displays the total count of uncorrectable
+	errors that have occurred on this DIMM. If panic_on_ue is set
+	this counter will not have a chance to increment, since EDAC
+	will panic the system.
+
+- ``dimm_ce_count`` - Correctable Errors count attribute file
+
+	This attribute file displays the total count of correctable
+	errors that have occurred on this DIMM. This count is very
+	important to examine. CEs provide early indications that a
+	DIMM is beginning to fail. This count field should be
+	monitored for non-zero values and report such information
+	to the system administrator.
+
+- ``dimm_dev_type``  - Device type attribute file
+
+	This attribute file will display what type of DRAM device is
+	being utilized on this DIMM.
+	Examples:
+
+		- x1
+		- x2
+		- x4
+		- x8
+
+- ``dimm_edac_mode`` - EDAC Mode of operation attribute file
+
+	This attribute file will display what type of Error detection
+	and correction is being utilized.
+
+- ``dimm_label`` - memory module label control file
+
+	This control file allows this DIMM to have a label assigned
+	to it. With this label in the module, when errors occur
+	the output can provide the DIMM label in the system log.
+	This becomes vital for panic events to isolate the
+	cause of the UE event.
+
+	DIMM Labels must be assigned after booting, with information
+	that correctly identifies the physical slot with its
+	silk screen label. This information is currently very
+	motherboard specific and determination of this information
+	must occur in userland at this time.
+
+- ``dimm_location`` - location of the memory module
+
+	The location can have up to 3 levels, and describe how the
+	memory controller identifies the location of a memory module.
+	Depending on the type of memory and memory controller, it
+	can be:
+
+		- *csrow* and *channel* - used when the memory controller
+		  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
+		- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
+		  controllers;
+		- *channel*, *slot* - used on Nehalem and newer Intel drivers.
+
+- ``dimm_mem_type`` - Memory Type attribute file
+
+	This attribute file will display what type of memory is currently
+	on this csrow. Normally, either buffered or unbuffered memory.
+	Examples:
+
+		- Registered-DDR
+		- Unbuffered-DDR
+
+.. [#f5] On some systems, the memory controller doesn't have any logic
+  to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
+  On modern Intel memory controllers, the memory controller identifies the
+  memory modules directly. On such systems, the directory is called ``dimmX``.
+
+.. [#f6] There are also some ``power`` directories and ``subsystem``
+  symlinks inside the sysfs mapping that are automatically created by
+  the sysfs subsystem. Currently, they serve no purpose.
+
+``csrowX`` directories
+----------------------
+
+When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
+directories. As this API doesn't work properly for Rambus, FB-DIMMs and
+modern Intel Memory Controllers, this is being deprecated in favor of
+``dimmX`` directories.
+
+In the ``csrowX`` directories are EDAC control and attribute files for
+this ``X`` instance of csrow:
+
+
+- ``ue_count`` - Total Uncorrectable Errors count attribute file
+
+	This attribute file displays the total count of uncorrectable
+	errors that have occurred on this csrow. If panic_on_ue is set
+	this counter will not have a chance to increment, since EDAC
+	will panic the system.
+
+
+- ``ce_count`` - Total Correctable Errors count attribute file
+
+	This attribute file displays the total count of correctable
+	errors that have occurred on this csrow. This count is very
+	important to examine. CEs provide early indications that a
+	DIMM is beginning to fail. This count field should be
+	monitored for non-zero values and report such information
+	to the system administrator.
+
+
+- ``size_mb`` - Total memory managed by this csrow attribute file
+
+	This attribute file displays, in count of megabytes, the memory
+	that this csrow contains.
+
+
+- ``mem_type`` - Memory Type attribute file
+
+	This attribute file will display what type of memory is currently
+	on this csrow. Normally, either buffered or unbuffered memory.
+	Examples:
+
+		- Registered-DDR
+		- Unbuffered-DDR
+
+
+- ``edac_mode`` - EDAC Mode of operation attribute file
+
+	This attribute file will display what type of Error detection
+	and correction is being utilized.
+
+
+- ``dev_type`` - Device type attribute file
+
+	This attribute file will display what type of DRAM device is
+	being utilized on this DIMM.
+	Examples:
+
+		- x1
+		- x2
+		- x4
+		- x8
+
+
+- ``ch0_ce_count`` - Channel 0 CE Count attribute file
+
+	This attribute file will display the count of CEs on this
+	DIMM located in channel 0.
+
+
+- ``ch0_ue_count`` - Channel 0 UE Count attribute file
+
+	This attribute file will display the count of UEs on this
+	DIMM located in channel 0.
+
+
+- ``ch0_dimm_label`` - Channel 0 DIMM Label control file
+
+
+	This control file allows this DIMM to have a label assigned
+	to it. With this label in the module, when errors occur
+	the output can provide the DIMM label in the system log.
+	This becomes vital for panic events to isolate the
+	cause of the UE event.
+
+	DIMM Labels must be assigned after booting, with information
+	that correctly identifies the physical slot with its
+	silk screen label. This information is currently very
+	motherboard specific and determination of this information
+	must occur in userland at this time.
+
+
+- ``ch1_ce_count`` - Channel 1 CE Count attribute file
+
+
+	This attribute file will display the count of CEs on this
+	DIMM located in channel 1.
+
+
+- ``ch1_ue_count`` - Channel 1 UE Count attribute file
+
+
+	This attribute file will display the count of UEs on this
+	DIMM located in channel 0.
+
+
+- ``ch1_dimm_label`` - Channel 1 DIMM Label control file
+
+	This control file allows this DIMM to have a label assigned
+	to it. With this label in the module, when errors occur
+	the output can provide the DIMM label in the system log.
+	This becomes vital for panic events to isolate the
+	cause of the UE event.
+
+	DIMM Labels must be assigned after booting, with information
+	that correctly identifies the physical slot with its
+	silk screen label. This information is currently very
+	motherboard specific and determination of this information
+	must occur in userland at this time.
+
+
+System Logging
+--------------
+
+If logging for UEs and CEs is enabled, then system logs will contain
+information indicating that errors have been detected::
+
+  EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac
+  EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac
+
+
+The structure of the message is:
+
+	+---------------------------------------+-------------+
+	| Content                               | Example     |
+	+=======================================+=============+
+	| The memory controller                 | MC0         |
+	+---------------------------------------+-------------+
+	| Error type                            | CE          |
+	+---------------------------------------+-------------+
+	| Memory page                           | 0x283       |
+	+---------------------------------------+-------------+
+	| Offset in the page                    | 0xce0       |
+	+---------------------------------------+-------------+
+	| The byte granularity                  | grain 8     |
+	| or resolution of the error            |             |
+	+---------------------------------------+-------------+
+	| The error syndrome                    | 0xb741      |
+	+---------------------------------------+-------------+
+	| Memory row                            | row 0       |
+	+---------------------------------------+-------------+
+	| Memory channel                        | channel 1   |
+	+---------------------------------------+-------------+
+	| DIMM label, if set prior              | DIMM B1     |
+	+---------------------------------------+-------------+
+	| And then an optional, driver-specific |             |
+	| message that may have additional      |             |
+	| information.                          |             |
+	+---------------------------------------+-------------+
+
+Both UEs and CEs with no info will lack all but memory controller, error
+type, a notice of "no info" and then an optional, driver-specific error
+message.
+
+
+PCI Bus Parity Detection
+------------------------
+
+On Header Type 00 devices, the primary status is looked at for any
+parity error regardless of whether parity is enabled on the device or
+not. (The spec indicates parity is generated in some cases). On Header
+Type 01 bridges, the secondary status register is also looked at to see
+if parity occurred on the bus on the other side of the bridge.
+
+
+Sysfs configuration
+-------------------
+
+Under ``/sys/devices/system/edac/pci`` are control and attribute files as
+follows:
+
+
+- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
+
+	This control file enables or disables the PCI Bus Parity scanning
+	operation. Writing a 1 to this file enables the scanning. Writing
+	a 0 to this file disables the scanning.
+
+	Enable::
+
+		echo "1" >/sys/devices/system/edac/pci/check_pci_parity
+
+	Disable::
+
+		echo "0" >/sys/devices/system/edac/pci/check_pci_parity
+
+
+- ``pci_parity_count`` - Parity Count
+
+	This attribute file will display the number of parity errors that
+	have been detected.
+
+
+Module parameters
+-----------------
+
+- ``edac_mc_panic_on_ue`` - Panic on UE control file
+
+	An uncorrectable error will cause a machine panic.  This is usually
+	desirable.  It is a bad idea to continue when an uncorrectable error
+	occurs - it is indeterminate what was uncorrected and the operating
+	system context might be so mangled that continuing will lead to further
+	corruption. If the kernel has MCE configured, then EDAC will never
+	notice the UE.
+
+	LOAD TIME::
+
+		module/kernel parameter: edac_mc_panic_on_ue=[0|1]
+
+	RUN TIME::
+
+		echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
+
+
+- ``edac_mc_log_ue`` - Log UE control file
+
+
+	Generate kernel messages describing uncorrectable errors.  These errors
+	are reported through the system message log system.  UE statistics
+	will be accumulated even when UE logging is disabled.
+
+	LOAD TIME::
+
+		module/kernel parameter: edac_mc_log_ue=[0|1]
+
+	RUN TIME::
+
+		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
+
+
+- ``edac_mc_log_ce`` - Log CE control file
+
+
+	Generate kernel messages describing correctable errors.  These
+	errors are reported through the system message log system.
+	CE statistics will be accumulated even when CE logging is disabled.
+
+	LOAD TIME::
+
+		module/kernel parameter: edac_mc_log_ce=[0|1]
+
+	RUN TIME::
+
+		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
+
+
+- ``edac_mc_poll_msec`` - Polling period control file
+
+
+	The time period, in milliseconds, for polling for error information.
+	Too small a value wastes resources.  Too large a value might delay
+	necessary handling of errors and might loose valuable information for
+	locating the error.  1000 milliseconds (once each second) is the current
+	default. Systems which require all the bandwidth they can get, may
+	increase this.
+
+	LOAD TIME::
+
+		module/kernel parameter: edac_mc_poll_msec=[0|1]
+
+	RUN TIME::
+
+		echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
+
+
+- ``panic_on_pci_parity`` - Panic on PCI PARITY Error
+
+
+	This control file enables or disables panicking when a parity
+	error has been detected.
+
+
+	module/kernel parameter::
+
+			edac_panic_on_pci_pe=[0|1]
+
+	Enable::
+
+		echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
+
+	Disable::
+
+		echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
+
+
+
+EDAC device type
+----------------
+
+In the header file, edac_pci.h, there is a series of edac_device structures
+and APIs for the EDAC_DEVICE.
+
+User space access to an edac_device is through the sysfs interface.
+
+At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
+will appear.
+
+There is a three level tree beneath the above ``edac`` directory. For example,
+the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net
+website) installs itself as::
+
+	/sys/devices/system/edac/test-instance
+
+in this directory are various controls, a symlink and one or more ``instance``
+directories.
+
+The standard default controls are:
+
+	==============	=======================================================
+	log_ce		boolean to log CE events
+	log_ue		boolean to log UE events
+	panic_on_ue	boolean to ``panic`` the system if an UE is encountered
+			(default off, can be set true via startup script)
+	poll_msec	time period between POLL cycles for events
+	==============	=======================================================
+
+The test_device_edac device adds at least one of its own custom control:
+
+	==============	==================================================
+	test_bits	which in the current test driver does nothing but
+			show how it is installed. A ported driver can
+			add one or more such controls and/or attributes
+			for specific uses.
+			One out-of-tree driver uses controls here to allow
+			for ERROR INJECTION operations to hardware
+			injection registers
+	==============	==================================================
+
+The symlink points to the 'struct dev' that is registered for this edac_device.
+
+Instances
+---------
+
+One or more instance directories are present. For the ``test_device_edac``
+case:
+
+	+----------------+
+	| test-instance0 |
+	+----------------+
+
+
+In this directory there are two default counter attributes, which are totals of
+counter in deeper subdirectories.
+
+	==============	====================================
+	ce_count	total of CE events of subdirectories
+	ue_count	total of UE events of subdirectories
+	==============	====================================
+
+Blocks
+------
+
+At the lowest directory level is the ``block`` directory. There can be 0, 1
+or more blocks specified in each instance:
+
+	+-------------+
+	| test-block0 |
+	+-------------+
+
+In this directory the default attributes are:
+
+	==============	================================================
+	ce_count	which is counter of CE events for this ``block``
+			of hardware being monitored
+	ue_count	which is counter of UE events for this ``block``
+			of hardware being monitored
+	==============	================================================
+
+
+The ``test_device_edac`` device adds 4 attributes and 1 control:
+
+	================== ====================================================
+	test-block-bits-0	for every POLL cycle this counter
+				is incremented
+	test-block-bits-1	every 10 cycles, this counter is bumped once,
+				and test-block-bits-0 is set to 0
+	test-block-bits-2	every 100 cycles, this counter is bumped once,
+				and test-block-bits-1 is set to 0
+	test-block-bits-3	every 1000 cycles, this counter is bumped once,
+				and test-block-bits-2 is set to 0
+	================== ====================================================
+
+
+	================== ====================================================
+	reset-counters		writing ANY thing to this control will
+				reset all the above counters.
+	================== ====================================================
+
+
+Use of the ``test_device_edac`` driver should enable any others to create their own
+unique drivers for their hardware systems.
+
+The ``test_device_edac`` sample driver is located at the
+http://bluesmoke.sourceforge.net project site for EDAC.
+
+
+Usage of EDAC APIs on Nehalem and newer Intel CPUs
+--------------------------------------------------
+
+On older Intel architectures, the memory controller was part of the North
+Bridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and
+newer Intel architectures integrated an enhanced version of the memory
+controller (MC) inside the CPUs.
+
+This chapter will cover the differences of the enhanced memory controllers
+found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
+``sbx_edac`` drivers.
+
+.. note::
+
+   The Xeon E7 processor families use a separate chip for the memory
+   controller, called Intel Scalable Memory Buffer. This section doesn't
+   apply for such families.
+
+1) There is one Memory Controller per Quick Patch Interconnect
+   (QPI). At the driver, the term "socket" means one QPI. This is
+   associated with a physical CPU socket.
+
+   Each MC have 3 physical read channels, 3 physical write channels and
+   3 logic channels. The driver currently sees it as just 3 channels.
+   Each channel can have up to 3 DIMMs.
+
+   The minimum known unity is DIMMs. There are no information about csrows.
+   As EDAC API maps the minimum unity is csrows, the driver sequentially
+   maps channel/DIMM into different csrows.
+
+   For example, supposing the following layout::
+
+	Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
+	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
+	  dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
+        Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
+	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
+	Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
+	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
+
+   The driver will map it as::
+
+	csrow0: channel 0, dimm0
+	csrow1: channel 0, dimm1
+	csrow2: channel 1, dimm0
+	csrow3: channel 2, dimm0
+
+   exports one DIMM per csrow.
+
+   Each QPI is exported as a different memory controller.
+
+2) The MC has the ability to inject errors to test drivers. The drivers
+   implement this functionality via some error injection nodes:
+
+   For injecting a memory error, there are some sysfs nodes, under
+   ``/sys/devices/system/edac/mc/mc?/``:
+
+   - ``inject_addrmatch/*``:
+      Controls the error injection mask register. It is possible to specify
+      several characteristics of the address to match an error code::
+
+         dimm = the affected dimm. Numbers are relative to a channel;
+         rank = the memory rank;
+         channel = the channel that will generate an error;
+         bank = the affected bank;
+         page = the page address;
+         column (or col) = the address column.
+
+      each of the above values can be set to "any" to match any valid value.
+
+      At driver init, all values are set to any.
+
+      For example, to generate an error at rank 1 of dimm 2, for any channel,
+      any bank, any page, any column::
+
+		echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
+		echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
+
+	To return to the default behaviour of matching any, you can do::
+
+		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
+		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
+
+   - ``inject_eccmask``:
+          specifies what bits will have troubles,
+
+   - ``inject_section``:
+       specifies what ECC cache section will get the error::
+
+		3 for both
+		2 for the highest
+		1 for the lowest
+
+   - ``inject_type``:
+       specifies the type of error, being a combination of the following bits::
+
+		bit 0 - repeat
+		bit 1 - ecc
+		bit 2 - parity
+
+   - ``inject_enable``:
+       starts the error generation when something different than 0 is written.
+
+   All inject vars can be read. root permission is needed for write.
+
+   Datasheet states that the error will only be generated after a write on an
+   address that matches inject_addrmatch. It seems, however, that reading will
+   also produce an error.
+
+   For example, the following code will generate an error for any write access
+   at socket 0, on any DIMM/address on channel 2::
+
+	echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
+	echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
+	echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
+	echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
+	echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
+	dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
+
+   For socket 1, it is needed to replace "mc0" by "mc1" at the above
+   commands.
+
+   The generated error message will look like::
+
+	EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
+
+3) Corrected Error memory register counters
+
+   Those newer MCs have some registers to count memory errors. The driver
+   uses those registers to report Corrected Errors on devices with Registered
+   DIMMs.
+
+   However, those counters don't work with Unregistered DIMM. As the chipset
+   offers some counters that also work with UDIMMs (but with a worse level of
+   granularity than the default ones), the driver exposes those registers for
+   UDIMM memories.
+
+   They can be read by looking at the contents of ``all_channel_counts/``::
+
+     $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
+	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
+	0
+	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
+	0
+	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
+	0
+
+   What happens here is that errors on different csrows, but at the same
+   dimm number will increment the same counter.
+   So, in this memory mapping::
+
+	csrow0: channel 0, dimm0
+	csrow1: channel 0, dimm1
+	csrow2: channel 1, dimm0
+	csrow3: channel 2, dimm0
+
+   The hardware will increment udimm0 for an error at the first dimm at either
+   csrow0, csrow2  or csrow3;
+
+   The hardware will increment udimm1 for an error at the second dimm at either
+   csrow0, csrow2  or csrow3;
+
+   The hardware will increment udimm2 for an error at the third dimm at either
+   csrow0, csrow2  or csrow3;
+
+4) Standard error counters
+
+   The standard error counters are generated when an mcelog error is received
+   by the driver. Since, with UDIMM, this is counted by software, it is
+   possible that some errors could be lost. With RDIMM's, they display the
+   contents of the registers
+
+Reference documents used on ``amd64_edac``
+------------------------------------------
+
+``amd64_edac`` module is based on the following documents
+(available from http://support.amd.com/en-us/search/tech-docs):
+
+1. :Title:  BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
+	   Opteron Processors
+   :AMD publication #: 26094
+   :Revision: 3.26
+   :Link: http://support.amd.com/TechDocs/26094.PDF
+
+2. :Title:  BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh
+	   Processors
+   :AMD publication #: 32559
+   :Revision: 3.00
+   :Issue Date: May 2006
+   :Link: http://support.amd.com/TechDocs/32559.pdf
+
+3. :Title:  BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h
+	   Processors
+   :AMD publication #: 31116
+   :Revision: 3.00
+   :Issue Date: September 07, 2007
+   :Link: http://support.amd.com/TechDocs/31116.pdf
+
+4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
+	  Models 30h-3Fh Processors
+   :AMD publication #: 49125
+   :Revision: 3.06
+   :Issue Date: 2/12/2015 (latest release)
+   :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
+
+5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
+	  Models 60h-6Fh Processors
+   :AMD publication #: 50742
+   :Revision: 3.01
+   :Issue Date: 7/23/2015 (latest release)
+   :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
+
+6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h
+	  Models 00h-0Fh Processors
+   :AMD publication #: 48751
+   :Revision: 3.03
+   :Issue Date: 2/23/2015 (latest release)
+   :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf
+
+Credits
+=======
+
+* Written by Doug Thompson <dougthompson@xmission.com>
+
+  - 7 Dec 2005
+  - 17 Jul 2007	Updated
+
+* |copy| Mauro Carvalho Chehab
+
+  - 05 Aug 2009	Nehalem interface
+  - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
+
+* EDAC authors/maintainers:
+
+  - Doug Thompson, Dave Jiang, Dave Peterson et al,
+  - Mauro Carvalho Chehab
+  - Borislav Petkov
+  - original author: Thayne Harbaugh
diff --git a/Documentation/admin-guide/reporting-bugs.rst b/Documentation/admin-guide/reporting-bugs.rst
new file mode 100644
index 000000000..4650edb88
--- /dev/null
+++ b/Documentation/admin-guide/reporting-bugs.rst
@@ -0,0 +1,182 @@
+.. _reportingbugs:
+
+Reporting bugs
+++++++++++++++
+
+Background
+==========
+
+The upstream Linux kernel maintainers only fix bugs for specific kernel
+versions.  Those versions include the current "release candidate" (or -rc)
+kernel, any "stable" kernel versions, and any "long term" kernels.
+
+Please see https://www.kernel.org/ for a list of supported kernels.  Any
+kernel marked with [EOL] is "end of life" and will not have any fixes
+backported to it.
+
+If you've found a bug on a kernel version that isn't listed on kernel.org,
+contact your Linux distribution or embedded vendor for support.
+Alternatively, you can attempt to run one of the supported stable or -rc
+kernels, and see if you can reproduce the bug on that.  It's preferable
+to reproduce the bug on the latest -rc kernel.
+
+
+How to report Linux kernel bugs
+===============================
+
+
+Identify the problematic subsystem
+----------------------------------
+
+Identifying which part of the Linux kernel might be causing your issue
+increases your chances of getting your bug fixed. Simply posting to the
+generic linux-kernel mailing list (LKML) may cause your bug report to be
+lost in the noise of a mailing list that gets 1000+ emails a day.
+
+Instead, try to figure out which kernel subsystem is causing the issue,
+and email that subsystem's maintainer and mailing list.  If the subsystem
+maintainer doesn't answer, then expand your scope to mailing lists like
+LKML.
+
+
+Identify who to notify
+----------------------
+
+Once you know the subsystem that is causing the issue, you should send a
+bug report.  Some maintainers prefer bugs to be reported via bugzilla
+(https://bugzilla.kernel.org), while others prefer that bugs be reported
+via the subsystem mailing list.
+
+To find out where to send an emailed bug report, find your subsystem or
+device driver in the MAINTAINERS file.  Search in the file for relevant
+entries, and send your bug report to the person(s) listed in the "M:"
+lines, making sure to Cc the mailing list(s) in the "L:" lines.  When the
+maintainer replies to you, make sure to 'Reply-all' in order to keep the
+public mailing list(s) in the email thread.
+
+If you know which driver is causing issues, you can pass one of the driver
+files to the get_maintainer.pl script::
+
+     perl scripts/get_maintainer.pl -f <filename>
+
+If it is a security bug, please copy the Security Contact listed in the
+MAINTAINERS file.  They can help coordinate bugfix and disclosure.  See
+:ref:`Documentation/admin-guide/security-bugs.rst <securitybugs>` for more information.
+
+If you can't figure out which subsystem caused the issue, you should file
+a bug in kernel.org bugzilla and send email to
+linux-kernel@vger.kernel.org, referencing the bugzilla URL.  (For more
+information on the linux-kernel mailing list see
+http://www.tux.org/lkml/).
+
+
+Tips for reporting bugs
+-----------------------
+
+If you haven't reported a bug before, please read:
+
+	http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
+
+	http://www.catb.org/esr/faqs/smart-questions.html
+
+It's REALLY important to report bugs that seem unrelated as separate email
+threads or separate bugzilla entries.  If you report several unrelated
+bugs at once, it's difficult for maintainers to tease apart the relevant
+data.
+
+
+Gather information
+------------------
+
+The most important information in a bug report is how to reproduce the
+bug.  This includes system information, and (most importantly)
+step-by-step instructions for how a user can trigger the bug.
+
+If the failure includes an "OOPS:", take a picture of the screen, capture
+a netconsole trace, or type the message from your screen into the bug
+report.  Please read "Documentation/admin-guide/bug-hunting.rst" before posting your
+bug report. This explains what you should do with the "Oops" information
+to make it useful to the recipient.
+
+This is a suggested format for a bug report sent via email or bugzilla.
+Having a standardized bug report form makes it easier for you not to
+overlook things, and easier for the developers to find the pieces of
+information they're really interested in.  If some information is not
+relevant to your bug, feel free to exclude it.
+
+First run the ver_linux script included as scripts/ver_linux, which
+reports the version of some important subsystems.  Run this script with
+the command ``awk -f scripts/ver_linux``.
+
+Use that information to fill in all fields of the bug report form, and
+post it to the mailing list with a subject of "PROBLEM: <one line
+summary from [1.]>" for easy identification by the developers::
+
+  [1.] One line summary of the problem:
+  [2.] Full description of the problem/report:
+  [3.] Keywords (i.e., modules, networking, kernel):
+  [4.] Kernel information
+  [4.1.] Kernel version (from /proc/version):
+  [4.2.] Kernel .config file:
+  [5.] Most recent kernel version which did not have the bug:
+  [6.] Output of Oops.. message (if applicable) with symbolic information
+       resolved (see Documentation/admin-guide/bug-hunting.rst)
+  [7.] A small shell script or example program which triggers the
+       problem (if possible)
+  [8.] Environment
+  [8.1.] Software (add the output of the ver_linux script here)
+  [8.2.] Processor information (from /proc/cpuinfo):
+  [8.3.] Module information (from /proc/modules):
+  [8.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
+  [8.5.] PCI information ('lspci -vvv' as root)
+  [8.6.] SCSI information (from /proc/scsi/scsi)
+  [8.7.] Other information that might be relevant to the problem
+         (please look in /proc and include all information that you
+         think to be relevant):
+  [X.] Other notes, patches, fixes, workarounds:
+
+
+Follow up
+=========
+
+Expectations for bug reporters
+------------------------------
+
+Linux kernel maintainers expect bug reporters to be able to follow up on
+bug reports.  That may include running new tests, applying patches,
+recompiling your kernel, and/or re-triggering your bug.  The most
+frustrating thing for maintainers is for someone to report a bug, and then
+never follow up on a request to try out a fix.
+
+That said, it's still useful for a kernel maintainer to know a bug exists
+on a supported kernel, even if you can't follow up with retests.  Follow
+up reports, such as replying to the email thread with "I tried the latest
+kernel and I can't reproduce my bug anymore" are also helpful, because
+maintainers have to assume silence means things are still broken.
+
+Expectations for kernel maintainers
+-----------------------------------
+
+Linux kernel maintainers are busy, overworked human beings.  Some times
+they may not be able to address your bug in a day, a week, or two weeks.
+If they don't answer your email, they may be on vacation, or at a Linux
+conference.  Check the conference schedule at https://LWN.net for more info:
+
+	https://lwn.net/Calendar/
+
+In general, kernel maintainers take 1 to 5 business days to respond to
+bugs.  The majority of kernel maintainers are employed to work on the
+kernel, and they may not work on the weekends.  Maintainers are scattered
+around the world, and they may not work in your time zone.  Unless you
+have a high priority bug, please wait at least a week after the first bug
+report before sending the maintainer a reminder email.
+
+The exceptions to this rule are regressions, kernel crashes, security holes,
+or userspace breakage caused by new kernel behavior.  Those bugs should be
+addressed by the maintainers ASAP.  If you suspect a maintainer is not
+responding to these types of bugs in a timely manner (especially during a
+merge window), escalate the bug to LKML and Linus Torvalds.
+
+Thank you!
+
+[Some of this is taken from Frohwalt Egerer's original linux-kernel FAQ]
diff --git a/Documentation/admin-guide/security-bugs.rst b/Documentation/admin-guide/security-bugs.rst
new file mode 100644
index 000000000..30187d49d
--- /dev/null
+++ b/Documentation/admin-guide/security-bugs.rst
@@ -0,0 +1,89 @@
+.. _securitybugs:
+
+Security bugs
+=============
+
+Linux kernel developers take security very seriously.  As such, we'd
+like to know when a security bug is found so that it can be fixed and
+disclosed as quickly as possible.  Please report security bugs to the
+Linux kernel security team.
+
+Contact
+-------
+
+The Linux kernel security team can be contacted by email at
+<security@kernel.org>.  This is a private list of security officers
+who will help verify the bug report and develop and release a fix.
+If you already have a fix, please include it with your report, as
+that can speed up the process considerably.  It is possible that the
+security team will bring in extra help from area maintainers to
+understand and fix the security vulnerability.
+
+As it is with any bug, the more information provided the easier it
+will be to diagnose and fix.  Please review the procedure outlined in
+admin-guide/reporting-bugs.rst if you are unclear about what
+information is helpful.  Any exploit code is very helpful and will not
+be released without consent from the reporter unless it has already been
+made public.
+
+Disclosure and embargoed information
+------------------------------------
+
+The security list is not a disclosure channel.  For that, see Coordination
+below.
+
+Once a robust fix has been developed, the release process starts.  Fixes
+for publicly known bugs are released immediately.
+
+Although our preference is to release fixes for publicly undisclosed bugs
+as soon as they become available, this may be postponed at the request of
+the reporter or an affected party for up to 7 calendar days from the start
+of the release process, with an exceptional extension to 14 calendar days
+if it is agreed that the criticality of the bug requires more time.  The
+only valid reason for deferring the publication of a fix is to accommodate
+the logistics of QA and large scale rollouts which require release
+coordination.
+
+Whilst embargoed information may be shared with trusted individuals in
+order to develop a fix, such information will not be published alongside
+the fix or on any other disclosure channel without the permission of the
+reporter.  This includes but is not limited to the original bug report
+and followup discussions (if any), exploits, CVE information or the
+identity of the reporter.
+
+In other words our only interest is in getting bugs fixed.  All other
+information submitted to the security list and any followup discussions
+of the report are treated confidentially even after the embargo has been
+lifted, in perpetuity.
+
+Coordination
+------------
+
+Fixes for sensitive bugs, such as those that might lead to privilege
+escalations, may need to be coordinated with the private
+<linux-distros@vs.openwall.org> mailing list so that distribution vendors
+are well prepared to issue a fixed kernel upon public disclosure of the
+upstream fix. Distros will need some time to test the proposed patch and
+will generally request at least a few days of embargo, and vendor update
+publication prefers to happen Tuesday through Thursday. When appropriate,
+the security team can assist with this coordination, or the reporter can
+include linux-distros from the start. In this case, remember to prefix
+the email Subject line with "[vs]" as described in the linux-distros wiki:
+<http://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists>
+
+CVE assignment
+--------------
+
+The security team does not normally assign CVEs, nor do we require them
+for reports or fixes, as this can needlessly complicate the process and
+may delay the bug handling. If a reporter wishes to have a CVE identifier
+assigned ahead of public disclosure, they will need to contact the private
+linux-distros list, described above. When such a CVE identifier is known
+before a patch is provided, it is desirable to mention it in the commit
+message if the reporter agrees.
+
+Non-disclosure agreements
+-------------------------
+
+The Linux kernel security team is not a formal body and therefore unable
+to enter any non-disclosure agreements.
diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst
new file mode 100644
index 000000000..a8d1e36b6
--- /dev/null
+++ b/Documentation/admin-guide/serial-console.rst
@@ -0,0 +1,115 @@
+.. _serial_console:
+
+Linux Serial Console
+====================
+
+To use a serial port as console you need to compile the support into your
+kernel - by default it is not compiled in. For PC style serial ports
+it's the config option next to menu option:
+
+:menuselection:`Character devices --> Serial drivers --> 8250/16550 and compatible serial support --> Console on 8250/16550 and compatible serial port`
+
+You must compile serial support into the kernel and not as a module.
+
+It is possible to specify multiple devices for console output. You can
+define a new kernel command line option to select which device(s) to
+use for console output.
+
+The format of this option is::
+
+	console=device,options
+
+	device:		tty0 for the foreground virtual console
+			ttyX for any other virtual console
+			ttySx for a serial port
+			lp0 for the first parallel port
+			ttyUSB0 for the first USB serial device
+
+	options:	depend on the driver. For the serial port this
+			defines the baudrate/parity/bits/flow control of
+			the port, in the format BBBBPNF, where BBBB is the
+			speed, P is parity (n/o/e), N is number of bits,
+			and F is flow control ('r' for RTS). Default is
+			9600n8. The maximum baudrate is 115200.
+
+You can specify multiple console= options on the kernel command line.
+Output will appear on all of them. The last device will be used when
+you open ``/dev/console``. So, for example::
+
+	console=ttyS1,9600 console=tty0
+
+defines that opening ``/dev/console`` will get you the current foreground
+virtual console, and kernel messages will appear on both the VGA
+console and the 2nd serial port (ttyS1 or COM2) at 9600 baud.
+
+Note that you can only define one console per device type (serial, video).
+
+If no console device is specified, the first device found capable of
+acting as a system console will be used. At this time, the system
+first looks for a VGA card and then for a serial port. So if you don't
+have a VGA card in your system the first serial port will automatically
+become the console.
+
+You will need to create a new device to use ``/dev/console``. The official
+``/dev/console`` is now character device 5,1.
+
+(You can also use a network device as a console.  See
+``Documentation/networking/netconsole.txt`` for information on that.)
+
+Here's an example that will use ``/dev/ttyS1`` (COM2) as the console.
+Replace the sample values as needed.
+
+1. Create ``/dev/console`` (real console) and ``/dev/tty0`` (master virtual
+   console)::
+
+     cd /dev
+     rm -f console tty0
+     mknod -m 622 console c 5 1
+     mknod -m 622 tty0 c 4 0
+
+2. LILO can also take input from a serial device. This is a very
+   useful option. To tell LILO to use the serial port:
+   In lilo.conf (global section)::
+
+     serial  = 1,9600n8 (ttyS1, 9600 bd, no parity, 8 bits)
+
+3. Adjust to kernel flags for the new kernel,
+   again in lilo.conf (kernel section)::
+
+     append = "console=ttyS1,9600"
+
+4. Make sure a getty runs on the serial port so that you can login to
+   it once the system is done booting. This is done by adding a line
+   like this to ``/etc/inittab`` (exact syntax depends on your getty)::
+
+     S1:23:respawn:/sbin/getty -L ttyS1 9600 vt100
+
+5. Init and ``/etc/ioctl.save``
+
+   Sysvinit remembers its stty settings in a file in ``/etc``, called
+   ``/etc/ioctl.save``. REMOVE THIS FILE before using the serial
+   console for the first time, because otherwise init will probably
+   set the baudrate to 38400 (baudrate of the virtual console).
+
+6. ``/dev/console`` and X
+   Programs that want to do something with the virtual console usually
+   open ``/dev/console``. If you have created the new ``/dev/console`` device,
+   and your console is NOT the virtual console some programs will fail.
+   Those are programs that want to access the VT interface, and use
+   ``/dev/console instead of /dev/tty0``. Some of those programs are::
+
+     Xfree86, svgalib, gpm, SVGATextMode
+
+   It should be fixed in modern versions of these programs though.
+
+   Note that if you boot without a ``console=`` option (or with
+   ``console=/dev/tty0``), ``/dev/console`` is the same as ``/dev/tty0``.
+   In that case everything will still work.
+
+7. Thanks
+
+   Thanks to Geert Uytterhoeven <geert@linux-m68k.org>
+   for porting the patches from 2.1.4x to 2.1.6x for taking care of
+   the integration of these patches into m68k, ppc and alpha.
+
+Miquel van Smoorenburg <miquels@cistron.nl>, 11-Jun-2000
diff --git a/Documentation/admin-guide/sysfs-rules.rst b/Documentation/admin-guide/sysfs-rules.rst
new file mode 100644
index 000000000..abad33526
--- /dev/null
+++ b/Documentation/admin-guide/sysfs-rules.rst
@@ -0,0 +1,192 @@
+Rules on how to access information in sysfs
+===========================================
+
+The kernel-exported sysfs exports internal kernel implementation details
+and depends on internal kernel structures and layout. It is agreed upon
+by the kernel developers that the Linux kernel does not provide a stable
+internal API. Therefore, there are aspects of the sysfs interface that
+may not be stable across kernel releases.
+
+To minimize the risk of breaking users of sysfs, which are in most cases
+low-level userspace applications, with a new kernel release, the users
+of sysfs must follow some rules to use an as-abstract-as-possible way to
+access this filesystem. The current udev and HAL programs already
+implement this and users are encouraged to plug, if possible, into the
+abstractions these programs provide instead of accessing sysfs directly.
+
+But if you really do want or need to access sysfs directly, please follow
+the following rules and then your programs should work with future
+versions of the sysfs interface.
+
+- Do not use libsysfs
+    It makes assumptions about sysfs which are not true. Its API does not
+    offer any abstraction, it exposes all the kernel driver-core
+    implementation details in its own API. Therefore it is not better than
+    reading directories and opening the files yourself.
+    Also, it is not actively maintained, in the sense of reflecting the
+    current kernel development. The goal of providing a stable interface
+    to sysfs has failed; it causes more problems than it solves. It
+    violates many of the rules in this document.
+
+- sysfs is always at ``/sys``
+    Parsing ``/proc/mounts`` is a waste of time. Other mount points are a
+    system configuration bug you should not try to solve. For test cases,
+    possibly support a ``SYSFS_PATH`` environment variable to overwrite the
+    application's behavior, but never try to search for sysfs. Never try
+    to mount it, if you are not an early boot script.
+
+- devices are only "devices"
+    There is no such thing like class-, bus-, physical devices,
+    interfaces, and such that you can rely on in userspace. Everything is
+    just simply a "device". Class-, bus-, physical, ... types are just
+    kernel implementation details which should not be expected by
+    applications that look for devices in sysfs.
+
+    The properties of a device are:
+
+    - devpath (``/devices/pci0000:00/0000:00:1d.1/usb2/2-2/2-2:1.0``)
+
+      - identical to the DEVPATH value in the event sent from the kernel
+        at device creation and removal
+      - the unique key to the device at that point in time
+      - the kernel's path to the device directory without the leading
+        ``/sys``, and always starting with a slash
+      - all elements of a devpath must be real directories. Symlinks
+        pointing to /sys/devices must always be resolved to their real
+        target and the target path must be used to access the device.
+        That way the devpath to the device matches the devpath of the
+        kernel used at event time.
+      - using or exposing symlink values as elements in a devpath string
+        is a bug in the application
+
+    - kernel name (``sda``, ``tty``, ``0000:00:1f.2``, ...)
+
+      - a directory name, identical to the last element of the devpath
+      - applications need to handle spaces and characters like ``!`` in
+        the name
+
+    - subsystem (``block``, ``tty``, ``pci``, ...)
+
+      - simple string, never a path or a link
+      - retrieved by reading the "subsystem"-link and using only the
+        last element of the target path
+
+    - driver (``tg3``, ``ata_piix``, ``uhci_hcd``)
+
+      - a simple string, which may contain spaces, never a path or a
+        link
+      - it is retrieved by reading the "driver"-link and using only the
+        last element of the target path
+      - devices which do not have "driver"-link just do not have a
+        driver; copying the driver value in a child device context is a
+        bug in the application
+
+    - attributes
+
+      - the files in the device directory or files below subdirectories
+        of the same device directory
+      - accessing attributes reached by a symlink pointing to another device,
+        like the "device"-link, is a bug in the application
+
+    Everything else is just a kernel driver-core implementation detail
+    that should not be assumed to be stable across kernel releases.
+
+- Properties of parent devices never belong into a child device.
+    Always look at the parent devices themselves for determining device
+    context properties. If the device ``eth0`` or ``sda`` does not have a
+    "driver"-link, then this device does not have a driver. Its value is empty.
+    Never copy any property of the parent-device into a child-device. Parent
+    device properties may change dynamically without any notice to the
+    child device.
+
+- Hierarchy in a single device tree
+    There is only one valid place in sysfs where hierarchy can be examined
+    and this is below: ``/sys/devices.``
+    It is planned that all device directories will end up in the tree
+    below this directory.
+
+- Classification by subsystem
+    There are currently three places for classification of devices:
+    ``/sys/block,`` ``/sys/class`` and ``/sys/bus.`` It is planned that these will
+    not contain any device directories themselves, but only flat lists of
+    symlinks pointing to the unified ``/sys/devices`` tree.
+    All three places have completely different rules on how to access
+    device information. It is planned to merge all three
+    classification directories into one place at ``/sys/subsystem``,
+    following the layout of the bus directories. All buses and
+    classes, including the converted block subsystem, will show up
+    there.
+    The devices belonging to a subsystem will create a symlink in the
+    "devices" directory at ``/sys/subsystem/<name>/devices``,
+
+    If ``/sys/subsystem`` exists, ``/sys/bus``, ``/sys/class`` and ``/sys/block``
+    can be ignored. If it does not exist, you always have to scan all three
+    places, as the kernel is free to move a subsystem from one place to
+    the other, as long as the devices are still reachable by the same
+    subsystem name.
+
+    Assuming ``/sys/class/<subsystem>`` and ``/sys/bus/<subsystem>``, or
+    ``/sys/block`` and ``/sys/class/block`` are not interchangeable is a bug in
+    the application.
+
+- Block
+    The converted block subsystem at ``/sys/class/block`` or
+    ``/sys/subsystem/block`` will contain the links for disks and partitions
+    at the same level, never in a hierarchy. Assuming the block subsystem to
+    contain only disks and not partition devices in the same flat list is
+    a bug in the application.
+
+- "device"-link and <subsystem>:<kernel name>-links
+    Never depend on the "device"-link. The "device"-link is a workaround
+    for the old layout, where class devices are not created in
+    ``/sys/devices/`` like the bus devices. If the link-resolving of a
+    device directory does not end in ``/sys/devices/``, you can use the
+    "device"-link to find the parent devices in ``/sys/devices/``, That is the
+    single valid use of the "device"-link; it must never appear in any
+    path as an element. Assuming the existence of the "device"-link for
+    a device in ``/sys/devices/`` is a bug in the application.
+    Accessing ``/sys/class/net/eth0/device`` is a bug in the application.
+
+    Never depend on the class-specific links back to the ``/sys/class``
+    directory.  These links are also a workaround for the design mistake
+    that class devices are not created in ``/sys/devices.`` If a device
+    directory does not contain directories for child devices, these links
+    may be used to find the child devices in ``/sys/class.`` That is the single
+    valid use of these links; they must never appear in any path as an
+    element. Assuming the existence of these links for devices which are
+    real child device directories in the ``/sys/devices`` tree is a bug in
+    the application.
+
+    It is planned to remove all these links when all class device
+    directories live in ``/sys/devices.``
+
+- Position of devices along device chain can change.
+    Never depend on a specific parent device position in the devpath,
+    or the chain of parent devices. The kernel is free to insert devices into
+    the chain. You must always request the parent device you are looking for
+    by its subsystem value. You need to walk up the chain until you find
+    the device that matches the expected subsystem. Depending on a specific
+    position of a parent device or exposing relative paths using ``../`` to
+    access the chain of parents is a bug in the application.
+
+- When reading and writing sysfs device attribute files, avoid dependency
+    on specific error codes wherever possible. This minimizes coupling to
+    the error handling implementation within the kernel.
+
+    In general, failures to read or write sysfs device attributes shall
+    propagate errors wherever possible. Common errors include, but are not
+    limited to:
+
+	``-EIO``: The read or store operation is not supported, typically
+	returned by the sysfs system itself if the read or store pointer
+	is ``NULL``.
+
+	``-ENXIO``: The read or store operation failed
+
+    Error codes will not be changed without good reason, and should a change
+    to error codes result in user-space breakage, it will be fixed, or the
+    the offending change will be reverted.
+
+    Userspace applications can, however, expect the format and contents of
+    the attribute files to remain consistent in the absence of a version
+    attribute change in the context of a given attribute.
diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst
new file mode 100644
index 000000000..7b9035c01
--- /dev/null
+++ b/Documentation/admin-guide/sysrq.rst
@@ -0,0 +1,290 @@
+Linux Magic System Request Key Hacks
+====================================
+
+Documentation for sysrq.c
+
+What is the magic SysRq key?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is a 'magical' key combo you can hit which the kernel will respond to
+regardless of whatever else it is doing, unless it is completely locked up.
+
+How do I enable the magic SysRq key?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You need to say "yes" to 'Magic SysRq key (CONFIG_MAGIC_SYSRQ)' when
+configuring the kernel. When running a kernel with SysRq compiled in,
+/proc/sys/kernel/sysrq controls the functions allowed to be invoked via
+the SysRq key. The default value in this file is set by the
+CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE config symbol, which itself defaults
+to 1. Here is the list of possible values in /proc/sys/kernel/sysrq:
+
+   -  0 - disable sysrq completely
+   -  1 - enable all functions of sysrq
+   - >1 - bitmask of allowed sysrq functions (see below for detailed function
+     description)::
+
+          2 =   0x2 - enable control of console logging level
+          4 =   0x4 - enable control of keyboard (SAK, unraw)
+          8 =   0x8 - enable debugging dumps of processes etc.
+         16 =  0x10 - enable sync command
+         32 =  0x20 - enable remount read-only
+         64 =  0x40 - enable signalling of processes (term, kill, oom-kill)
+        128 =  0x80 - allow reboot/poweroff
+        256 = 0x100 - allow nicing of all RT tasks
+
+You can set the value in the file by the following command::
+
+    echo "number" >/proc/sys/kernel/sysrq
+
+The number may be written here either as decimal or as hexadecimal
+with the 0x prefix. CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE must always be
+written in hexadecimal.
+
+Note that the value of ``/proc/sys/kernel/sysrq`` influences only the invocation
+via a keyboard. Invocation of any operation via ``/proc/sysrq-trigger`` is
+always allowed (by a user with admin privileges).
+
+How do I use the magic SysRq key?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On x86   - You press the key combo :kbd:`ALT-SysRq-<command key>`.
+
+.. note::
+	   Some
+           keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is
+           also known as the 'Print Screen' key. Also some keyboards cannot
+	   handle so many keys being pressed at the same time, so you might
+	   have better luck with press :kbd:`Alt`, press :kbd:`SysRq`,
+	   release :kbd:`SysRq`, press :kbd:`<command key>`, release everything.
+
+On SPARC - You press :kbd:`ALT-STOP-<command key>`, I believe.
+
+On the serial console (PC style standard serial ports only)
+        You send a ``BREAK``, then within 5 seconds a command key. Sending
+        ``BREAK`` twice is interpreted as a normal BREAK.
+
+On PowerPC
+	Press :kbd:`ALT - Print Screen` (or :kbd:`F13`) - :kbd:`<command key>`,
+        :kbd:`Print Screen` (or :kbd:`F13`) - :kbd:`<command key>` may suffice.
+
+On other
+	If you know of the key combos for other architectures, please
+        let me know so I can add them to this section.
+
+On all
+	write a character to /proc/sysrq-trigger.  e.g.::
+
+		echo t > /proc/sysrq-trigger
+
+What are the 'command' keys?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+=========== ===================================================================
+Command	    Function
+=========== ===================================================================
+``b``	    Will immediately reboot the system without syncing or unmounting
+            your disks.
+
+``c``	    Will perform a system crash by a NULL pointer dereference.
+            A crashdump will be taken if configured.
+
+``d``	    Shows all locks that are held.
+
+``e``	    Send a SIGTERM to all processes, except for init.
+
+``f``	    Will call the oom killer to kill a memory hog process, but do not
+	    panic if nothing can be killed.
+
+``g``	    Used by kgdb (kernel debugger)
+
+``h``	    Will display help (actually any other key than those listed
+            here will display help. but ``h`` is easy to remember :-)
+
+``i``	    Send a SIGKILL to all processes, except for init.
+
+``j``	    Forcibly "Just thaw it" - filesystems frozen by the FIFREEZE ioctl.
+
+``k``	    Secure Access Key (SAK) Kills all programs on the current virtual
+            console. NOTE: See important comments below in SAK section.
+
+``l``	    Shows a stack backtrace for all active CPUs.
+
+``m``	    Will dump current memory info to your console.
+
+``n``	    Used to make RT tasks nice-able
+
+``o``	    Will shut your system off (if configured and supported).
+
+``p``	    Will dump the current registers and flags to your console.
+
+``q``	    Will dump per CPU lists of all armed hrtimers (but NOT regular
+            timer_list timers) and detailed information about all
+            clockevent devices.
+
+``r``	    Turns off keyboard raw mode and sets it to XLATE.
+
+``s``	    Will attempt to sync all mounted filesystems.
+
+``t``	    Will dump a list of current tasks and their information to your
+            console.
+
+``u``	    Will attempt to remount all mounted filesystems read-only.
+
+``v``	    Forcefully restores framebuffer console
+``v``	    Causes ETM buffer dump [ARM-specific]
+
+``w``	    Dumps tasks that are in uninterruptable (blocked) state.
+
+``x``	    Used by xmon interface on ppc/powerpc platforms.
+            Show global PMU Registers on sparc64.
+            Dump all TLB entries on MIPS.
+
+``y``	    Show global CPU Registers [SPARC-64 specific]
+
+``z``	    Dump the ftrace buffer
+
+``0``-``9`` Sets the console log level, controlling which kernel messages
+            will be printed to your console. (``0``, for example would make
+            it so that only emergency messages like PANICs or OOPSes would
+            make it to your console.)
+=========== ===================================================================
+
+Okay, so what can I use them for?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Well, unraw(r) is very handy when your X server or a svgalib program crashes.
+
+sak(k) (Secure Access Key) is useful when you want to be sure there is no
+trojan program running at console which could grab your password
+when you would try to login. It will kill all programs on given console,
+thus letting you make sure that the login prompt you see is actually
+the one from init, not some trojan program.
+
+.. important::
+
+   In its true form it is not a true SAK like the one in a
+   c2 compliant system, and it should not be mistaken as
+   such.
+
+It seems others find it useful as (System Attention Key) which is
+useful when you want to exit a program that will not let you switch consoles.
+(For example, X or a svgalib program.)
+
+``reboot(b)`` is good when you're unable to shut down. But you should also
+``sync(s)`` and ``umount(u)`` first.
+
+``crash(c)`` can be used to manually trigger a crashdump when the system is hung.
+Note that this just triggers a crash if there is no dump mechanism available.
+
+``sync(s)`` is great when your system is locked up, it allows you to sync your
+disks and will certainly lessen the chance of data loss and fscking. Note
+that the sync hasn't taken place until you see the "OK" and "Done" appear
+on the screen. (If the kernel is really in strife, you may not ever get the
+OK or Done message...)
+
+``umount(u)`` is basically useful in the same ways as ``sync(s)``. I generally
+``sync(s)``, ``umount(u)``, then ``reboot(b)`` when my system locks. It's saved
+me many a fsck. Again, the unmount (remount read-only) hasn't taken place until
+you see the "OK" and "Done" message appear on the screen.
+
+The loglevels ``0``-``9`` are useful when your console is being flooded with
+kernel messages you do not want to see. Selecting ``0`` will prevent all but
+the most urgent kernel messages from reaching your console. (They will
+still be logged if syslogd/klogd are alive, though.)
+
+``term(e)`` and ``kill(i)`` are useful if you have some sort of runaway process
+you are unable to kill any other way, especially if it's spawning other
+processes.
+
+"just thaw ``it(j)``" is useful if your system becomes unresponsive due to a
+frozen (probably root) filesystem via the FIFREEZE ioctl.
+
+Sometimes SysRq seems to get 'stuck' after using it, what can I do?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+That happens to me, also. I've found that tapping shift, alt, and control
+on both sides of the keyboard, and hitting an invalid sysrq sequence again
+will fix the problem. (i.e., something like :kbd:`alt-sysrq-z`). Switching to
+another virtual console (:kbd:`ALT+Fn`) and then back again should also help.
+
+I hit SysRq, but nothing seems to happen, what's wrong?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are some keyboards that produce a different keycode for SysRq than the
+pre-defined value of 99
+(see ``KEY_SYSRQ`` in ``include/uapi/linux/input-event-codes.h``), or
+which don't have a SysRq key at all. In these cases, run ``showkey -s`` to find
+an appropriate scancode sequence, and use ``setkeycodes <sequence> 99`` to map
+this sequence to the usual SysRq code (e.g., ``setkeycodes e05b 99``). It's
+probably best to put this command in a boot script. Oh, and by the way, you
+exit ``showkey`` by not typing anything for ten seconds.
+
+I want to add SysRQ key events to a module, how does it work?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to register a basic function with the table, you must first include
+the header ``include/linux/sysrq.h``, this will define everything else you need.
+Next, you must create a ``sysrq_key_op`` struct, and populate it with A) the key
+handler function you will use, B) a help_msg string, that will print when SysRQ
+prints help, and C) an action_msg string, that will print right before your
+handler is called. Your handler must conform to the prototype in 'sysrq.h'.
+
+After the ``sysrq_key_op`` is created, you can call the kernel function
+``register_sysrq_key(int key, struct sysrq_key_op *op_p);`` this will
+register the operation pointed to by ``op_p`` at table key 'key',
+if that slot in the table is blank. At module unload time, you must call
+the function ``unregister_sysrq_key(int key, struct sysrq_key_op *op_p)``, which
+will remove the key op pointed to by 'op_p' from the key 'key', if and only if
+it is currently registered in that slot. This is in case the slot has been
+overwritten since you registered it.
+
+The Magic SysRQ system works by registering key operations against a key op
+lookup table, which is defined in 'drivers/tty/sysrq.c'. This key table has
+a number of operations registered into it at compile time, but is mutable,
+and 2 functions are exported for interface to it::
+
+	register_sysrq_key and unregister_sysrq_key.
+
+Of course, never ever leave an invalid pointer in the table. I.e., when
+your module that called register_sysrq_key() exits, it must call
+unregister_sysrq_key() to clean up the sysrq key table entry that it used.
+Null pointers in the table are always safe. :)
+
+If for some reason you feel the need to call the handle_sysrq function from
+within a function called by handle_sysrq, you must be aware that you are in
+a lock (you are also in an interrupt handler, which means don't sleep!), so
+you must call ``__handle_sysrq_nolock`` instead.
+
+When I hit a SysRq key combination only the header appears on the console?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sysrq output is subject to the same console loglevel control as all
+other console output.  This means that if the kernel was booted 'quiet'
+as is common on distro kernels the output may not appear on the actual
+console, even though it will appear in the dmesg buffer, and be accessible
+via the dmesg command and to the consumers of ``/proc/kmsg``.  As a specific
+exception the header line from the sysrq command is passed to all console
+consumers as if the current loglevel was maximum.  If only the header
+is emitted it is almost certain that the kernel loglevel is too low.
+Should you require the output on the console channel then you will need
+to temporarily up the console loglevel using :kbd:`alt-sysrq-8` or::
+
+    echo 8 > /proc/sysrq-trigger
+
+Remember to return the loglevel to normal after triggering the sysrq
+command you are interested in.
+
+I have more questions, who can I ask?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Just ask them on the linux-kernel mailing list:
+	linux-kernel@vger.kernel.org
+
+Credits
+~~~~~~~
+
+Written by Mydraal <vulpyne@vulpyne.net>
+Updated by Adam Sulmicki <adam@cfar.umd.edu>
+Updated by Jeremy M. Dolan <jmd@turbogeek.org> 2001/01/28 10:15:59
+Added to by Crutcher Dunnavant <crutcher+kernel@datastacks.com>
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
new file mode 100644
index 000000000..28a869c50
--- /dev/null
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -0,0 +1,59 @@
+Tainted kernels
+---------------
+
+Some oops reports contain the string **'Tainted: '** after the program
+counter. This indicates that the kernel has been tainted by some
+mechanism.  The string is followed by a series of position-sensitive
+characters, each representing a particular tainted value.
+
+ 1)  ``G`` if all modules loaded have a GPL or compatible license, ``P`` if
+     any proprietary module has been loaded.  Modules without a
+     MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by
+     insmod as GPL compatible are assumed to be proprietary.
+
+ 2)  ``F`` if any module was force loaded by ``insmod -f``, ``' '`` if all
+     modules were loaded normally.
+
+ 3)  ``S`` if the oops occurred on an SMP kernel running on hardware that
+     hasn't been certified as safe to run multiprocessor.
+     Currently this occurs only on various Athlons that are not
+     SMP capable.
+
+ 4)  ``R`` if a module was force unloaded by ``rmmod -f``, ``' '`` if all
+     modules were unloaded normally.
+
+ 5)  ``M`` if any processor has reported a Machine Check Exception,
+     ``' '`` if no Machine Check Exceptions have occurred.
+
+ 6)  ``B`` if a page-release function has found a bad page reference or
+     some unexpected page flags.
+
+ 7)  ``U`` if a user or user application specifically requested that the
+     Tainted flag be set, ``' '`` otherwise.
+
+ 8)  ``D`` if the kernel has died recently, i.e. there was an OOPS or BUG.
+
+ 9)  ``A`` if the ACPI table has been overridden.
+
+ 10) ``W`` if a warning has previously been issued by the kernel.
+     (Though some warnings may set more specific taint flags.)
+
+ 11) ``C`` if a staging driver has been loaded.
+
+ 12) ``I`` if the kernel is working around a severe bug in the platform
+     firmware (BIOS or similar).
+
+ 13) ``O`` if an externally-built ("out-of-tree") module has been loaded.
+
+ 14) ``E`` if an unsigned module has been loaded in a kernel supporting
+     module signature.
+
+ 15) ``L`` if a soft lockup has previously occurred on the system.
+
+ 16) ``K`` if the kernel has been live patched.
+
+The primary reason for the **'Tainted: '** string is to tell kernel
+debuggers if this is a clean kernel or if anything unusual has
+occurred.  Tainting is permanent: even if an offending module is
+unloaded, the tainted value remains to indicate that the kernel is not
+trustworthy.
diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst
new file mode 100644
index 000000000..35fccba6a
--- /dev/null
+++ b/Documentation/admin-guide/thunderbolt.rst
@@ -0,0 +1,243 @@
+=============
+ Thunderbolt
+=============
+The interface presented here is not meant for end users. Instead there
+should be a userspace tool that handles all the low-level details, keeps
+a database of the authorized devices and prompts users for new connections.
+
+More details about the sysfs interface for Thunderbolt devices can be
+found in ``Documentation/ABI/testing/sysfs-bus-thunderbolt``.
+
+Those users who just want to connect any device without any sort of
+manual work can add following line to
+``/etc/udev/rules.d/99-local.rules``::
+
+  ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{authorized}=="0", ATTR{authorized}="1"
+
+This will authorize all devices automatically when they appear. However,
+keep in mind that this bypasses the security levels and makes the system
+vulnerable to DMA attacks.
+
+Security levels and how to use them
+-----------------------------------
+Starting with Intel Falcon Ridge Thunderbolt controller there are 4
+security levels available. Intel Titan Ridge added one more security level
+(usbonly). The reason for these is the fact that the connected devices can
+be DMA masters and thus read contents of the host memory without CPU and OS
+knowing about it. There are ways to prevent this by setting up an IOMMU but
+it is not always available for various reasons.
+
+The security levels are as follows:
+
+  none
+    All devices are automatically connected by the firmware. No user
+    approval is needed. In BIOS settings this is typically called
+    *Legacy mode*.
+
+  user
+    User is asked whether the device is allowed to be connected.
+    Based on the device identification information available through
+    ``/sys/bus/thunderbolt/devices``, the user then can make the decision.
+    In BIOS settings this is typically called *Unique ID*.
+
+  secure
+    User is asked whether the device is allowed to be connected. In
+    addition to UUID the device (if it supports secure connect) is sent
+    a challenge that should match the expected one based on a random key
+    written to the ``key`` sysfs attribute. In BIOS settings this is
+    typically called *One time saved key*.
+
+  dponly
+    The firmware automatically creates tunnels for Display Port and
+    USB. No PCIe tunneling is done. In BIOS settings this is
+    typically called *Display Port Only*.
+
+  usbonly
+    The firmware automatically creates tunnels for the USB controller and
+    Display Port in a dock. All PCIe links downstream of the dock are
+    removed.
+
+The current security level can be read from
+``/sys/bus/thunderbolt/devices/domainX/security`` where ``domainX`` is
+the Thunderbolt domain the host controller manages. There is typically
+one domain per Thunderbolt host controller.
+
+If the security level reads as ``user`` or ``secure`` the connected
+device must be authorized by the user before PCIe tunnels are created
+(e.g the PCIe device appears).
+
+Each Thunderbolt device plugged in will appear in sysfs under
+``/sys/bus/thunderbolt/devices``. The device directory carries
+information that can be used to identify the particular device,
+including its name and UUID.
+
+Authorizing devices when security level is ``user`` or ``secure``
+-----------------------------------------------------------------
+When a device is plugged in it will appear in sysfs as follows::
+
+  /sys/bus/thunderbolt/devices/0-1/authorized	- 0
+  /sys/bus/thunderbolt/devices/0-1/device	- 0x8004
+  /sys/bus/thunderbolt/devices/0-1/device_name	- Thunderbolt to FireWire Adapter
+  /sys/bus/thunderbolt/devices/0-1/vendor	- 0x1
+  /sys/bus/thunderbolt/devices/0-1/vendor_name	- Apple, Inc.
+  /sys/bus/thunderbolt/devices/0-1/unique_id	- e0376f00-0300-0100-ffff-ffffffffffff
+
+The ``authorized`` attribute reads 0 which means no PCIe tunnels are
+created yet. The user can authorize the device by simply entering::
+
+  # echo 1 > /sys/bus/thunderbolt/devices/0-1/authorized
+
+This will create the PCIe tunnels and the device is now connected.
+
+If the device supports secure connect, and the domain security level is
+set to ``secure``, it has an additional attribute ``key`` which can hold
+a random 32-byte value used for authorization and challenging the device in
+future connects::
+
+  /sys/bus/thunderbolt/devices/0-3/authorized	- 0
+  /sys/bus/thunderbolt/devices/0-3/device	- 0x305
+  /sys/bus/thunderbolt/devices/0-3/device_name	- AKiTiO Thunder3 PCIe Box
+  /sys/bus/thunderbolt/devices/0-3/key		-
+  /sys/bus/thunderbolt/devices/0-3/vendor	- 0x41
+  /sys/bus/thunderbolt/devices/0-3/vendor_name	- inXtron
+  /sys/bus/thunderbolt/devices/0-3/unique_id	- dc010000-0000-8508-a22d-32ca6421cb16
+
+Notice the key is empty by default.
+
+If the user does not want to use secure connect they can just ``echo 1``
+to the ``authorized`` attribute and the PCIe tunnels will be created in
+the same way as in the ``user`` security level.
+
+If the user wants to use secure connect, the first time the device is
+plugged a key needs to be created and sent to the device::
+
+  # key=$(openssl rand -hex 32)
+  # echo $key > /sys/bus/thunderbolt/devices/0-3/key
+  # echo 1 > /sys/bus/thunderbolt/devices/0-3/authorized
+
+Now the device is connected (PCIe tunnels are created) and in addition
+the key is stored on the device NVM.
+
+Next time the device is plugged in the user can verify (challenge) the
+device using the same key::
+
+  # echo $key > /sys/bus/thunderbolt/devices/0-3/key
+  # echo 2 > /sys/bus/thunderbolt/devices/0-3/authorized
+
+If the challenge the device returns back matches the one we expect based
+on the key, the device is connected and the PCIe tunnels are created.
+However, if the challenge fails no tunnels are created and error is
+returned to the user.
+
+If the user still wants to connect the device they can either approve
+the device without a key or write a new key and write 1 to the
+``authorized`` file to get the new key stored on the device NVM.
+
+Upgrading NVM on Thunderbolt device or host
+-------------------------------------------
+Since most of the functionality is handled in firmware running on a
+host controller or a device, it is important that the firmware can be
+upgraded to the latest where possible bugs in it have been fixed.
+Typically OEMs provide this firmware from their support site.
+
+There is also a central site which has links where to download firmware
+for some machines:
+
+  `Thunderbolt Updates <https://thunderbolttechnology.net/updates>`_
+
+Before you upgrade firmware on a device or host, please make sure it is a
+suitable upgrade. Failing to do that may render the device (or host) in a
+state where it cannot be used properly anymore without special tools!
+
+Host NVM upgrade on Apple Macs is not supported.
+
+Once the NVM image has been downloaded, you need to plug in a
+Thunderbolt device so that the host controller appears. It does not
+matter which device is connected (unless you are upgrading NVM on a
+device - then you need to connect that particular device).
+
+Note an OEM-specific method to power the controller up ("force power") may
+be available for your system in which case there is no need to plug in a
+Thunderbolt device.
+
+After that we can write the firmware to the non-active parts of the NVM
+of the host or device. As an example here is how Intel NUC6i7KYK (Skull
+Canyon) Thunderbolt controller NVM is upgraded::
+
+  # dd if=KYK_TBT_FW_0018.bin of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem
+
+Once the operation completes we can trigger NVM authentication and
+upgrade process as follows::
+
+  # echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate
+
+If no errors are returned, the host controller shortly disappears. Once
+it comes back the driver notices it and initiates a full power cycle.
+After a while the host controller appears again and this time it should
+be fully functional.
+
+We can verify that the new NVM firmware is active by running the following
+commands::
+
+  # cat /sys/bus/thunderbolt/devices/0-0/nvm_authenticate
+  0x0
+  # cat /sys/bus/thunderbolt/devices/0-0/nvm_version
+  18.0
+
+If ``nvm_authenticate`` contains anything other than 0x0 it is the error
+code from the last authentication cycle, which means the authentication
+of the NVM image failed.
+
+Note names of the NVMem devices ``nvm_activeN`` and ``nvm_non_activeN``
+depend on the order they are registered in the NVMem subsystem. N in
+the name is the identifier added by the NVMem subsystem.
+
+Upgrading NVM when host controller is in safe mode
+--------------------------------------------------
+If the existing NVM is not properly authenticated (or is missing) the
+host controller goes into safe mode which means that the only available
+functionality is flashing a new NVM image. When in this mode, reading
+``nvm_version`` fails with ``ENODATA`` and the device identification
+information is missing.
+
+To recover from this mode, one needs to flash a valid NVM image to the
+host controller in the same way it is done in the previous chapter.
+
+Networking over Thunderbolt cable
+---------------------------------
+Thunderbolt technology allows software communication between two hosts
+connected by a Thunderbolt cable.
+
+It is possible to tunnel any kind of traffic over a Thunderbolt link but
+currently we only support Apple ThunderboltIP protocol.
+
+If the other host is running Windows or macOS, the only thing you need to
+do is to connect a Thunderbolt cable between the two hosts; the
+``thunderbolt-net`` driver is loaded automatically. If the other host is
+also Linux you should load ``thunderbolt-net`` manually on one host (it
+does not matter which one)::
+
+  # modprobe thunderbolt-net
+
+This triggers module load on the other host automatically. If the driver
+is built-in to the kernel image, there is no need to do anything.
+
+The driver will create one virtual ethernet interface per Thunderbolt
+port which are named like ``thunderbolt0`` and so on. From this point
+you can either use standard userspace tools like ``ifconfig`` to
+configure the interface or let your GUI handle it automatically.
+
+Forcing power
+-------------
+Many OEMs include a method that can be used to force the power of a
+Thunderbolt controller to an "On" state even if nothing is connected.
+If supported by your machine this will be exposed by the WMI bus with
+a sysfs attribute called "force_power".
+
+For example the intel-wmi-thunderbolt driver exposes this attribute in:
+  /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
+
+  To force the power to on, write 1 to this attribute file.
+  To disable force power, write 0 to this attribute file.
+
+Note: it's currently not possible to query the force power state of a platform.
diff --git a/Documentation/admin-guide/unicode.rst b/Documentation/admin-guide/unicode.rst
new file mode 100644
index 000000000..7425a3351
--- /dev/null
+++ b/Documentation/admin-guide/unicode.rst
@@ -0,0 +1,189 @@
+Unicode support
+===============
+
+		 Last update: 2005-01-17, version 1.4
+
+This file is maintained by H. Peter Anvin <unicode@lanana.org> as part
+of the Linux Assigned Names And Numbers Authority (LANANA) project.
+The current version can be found at:
+
+	    http://www.lanana.org/docs/unicode/admin-guide/unicode.rst
+
+Introduction
+------------
+
+The Linux kernel code has been rewritten to use Unicode to map
+characters to fonts.  By downloading a single Unicode-to-font table,
+both the eight-bit character sets and UTF-8 mode are changed to use
+the font as indicated.
+
+This changes the semantics of the eight-bit character tables subtly.
+The four character tables are now:
+
+=============== =============================== ================
+Map symbol	Map name			Escape code (G0)
+=============== =============================== ================
+LAT1_MAP	Latin-1 (ISO 8859-1)		ESC ( B
+GRAF_MAP	DEC VT100 pseudographics	ESC ( 0
+IBMPC_MAP	IBM code page 437		ESC ( U
+USER_MAP	User defined			ESC ( K
+=============== =============================== ================
+
+In particular, ESC ( U is no longer "straight to font", since the font
+might be completely different than the IBM character set.  This
+permits for example the use of block graphics even with a Latin-1 font
+loaded.
+
+Note that although these codes are similar to ISO 2022, neither the
+codes nor their uses match ISO 2022; Linux has two 8-bit codes (G0 and
+G1), whereas ISO 2022 has four 7-bit codes (G0-G3).
+
+In accordance with the Unicode standard/ISO 10646 the range U+F000 to
+U+F8FF has been reserved for OS-wide allocation (the Unicode Standard
+refers to this as a "Corporate Zone", since this is inaccurate for
+Linux we call it the "Linux Zone").  U+F000 was picked as the starting
+point since it lets the direct-mapping area start on a large power of
+two (in case 1024- or 2048-character fonts ever become necessary).
+This leaves U+E000 to U+EFFF as End User Zone.
+
+[v1.2]: The Unicodes range from U+F000 and up to U+F7FF have been
+hard-coded to map directly to the loaded font, bypassing the
+translation table.  The user-defined map now defaults to U+F000 to
+U+F0FF, emulating the previous behaviour.  In practice, this range
+might be shorter; for example, vgacon can only handle 256-character
+(U+F000..U+F0FF) or 512-character (U+F000..U+F1FF) fonts.
+
+
+Actual characters assigned in the Linux Zone
+--------------------------------------------
+
+In addition, the following characters not present in Unicode 1.1.4
+have been defined; these are used by the DEC VT graphics map.  [v1.2]
+THIS USE IS OBSOLETE AND SHOULD NO LONGER BE USED; PLEASE SEE BELOW.
+
+====== ======================================
+U+F800 DEC VT GRAPHICS HORIZONTAL LINE SCAN 1
+U+F801 DEC VT GRAPHICS HORIZONTAL LINE SCAN 3
+U+F803 DEC VT GRAPHICS HORIZONTAL LINE SCAN 7
+U+F804 DEC VT GRAPHICS HORIZONTAL LINE SCAN 9
+====== ======================================
+
+The DEC VT220 uses a 6x10 character matrix, and these characters form
+a smooth progression in the DEC VT graphics character set.  I have
+omitted the scan 5 line, since it is also used as a block-graphics
+character, and hence has been coded as U+2500 FORMS LIGHT HORIZONTAL.
+
+[v1.3]: These characters have been officially added to Unicode 3.2.0;
+they are added at U+23BA, U+23BB, U+23BC, U+23BD.  Linux now uses the
+new values.
+
+[v1.2]: The following characters have been added to represent common
+keyboard symbols that are unlikely to ever be added to Unicode proper
+since they are horribly vendor-specific.  This, of course, is an
+excellent example of horrible design.
+
+====== ======================================
+U+F810 KEYBOARD SYMBOL FLYING FLAG
+U+F811 KEYBOARD SYMBOL PULLDOWN MENU
+U+F812 KEYBOARD SYMBOL OPEN APPLE
+U+F813 KEYBOARD SYMBOL SOLID APPLE
+====== ======================================
+
+Klingon language support
+------------------------
+
+In 1996, Linux was the first operating system in the world to add
+support for the artificial language Klingon, created by Marc Okrand
+for the "Star Trek" television series.	This encoding was later
+adopted by the ConScript Unicode Registry and proposed (but ultimately
+rejected) for inclusion in Unicode Plane 1.  Thus, it remains as a
+Linux/CSUR private assignment in the Linux Zone.
+
+This encoding has been endorsed by the Klingon Language Institute.
+For more information, contact them at:
+
+	http://www.kli.org/
+
+Since the characters in the beginning of the Linux CZ have been more
+of the dingbats/symbols/forms type and this is a language, I have
+located it at the end, on a 16-cell boundary in keeping with standard
+Unicode practice.
+
+.. note::
+
+  This range is now officially managed by the ConScript Unicode
+  Registry.  The normative reference is at:
+
+	http://www.evertype.com/standards/csur/klingon.html
+
+Klingon has an alphabet of 26 characters, a positional numeric writing
+system with 10 digits, and is written left-to-right, top-to-bottom.
+
+Several glyph forms for the Klingon alphabet have been proposed.
+However, since the set of symbols appear to be consistent throughout,
+with only the actual shapes being different, in keeping with standard
+Unicode practice these differences are considered font variants.
+
+======	=======================================================
+U+F8D0	KLINGON LETTER A
+U+F8D1	KLINGON LETTER B
+U+F8D2	KLINGON LETTER CH
+U+F8D3	KLINGON LETTER D
+U+F8D4	KLINGON LETTER E
+U+F8D5	KLINGON LETTER GH
+U+F8D6	KLINGON LETTER H
+U+F8D7	KLINGON LETTER I
+U+F8D8	KLINGON LETTER J
+U+F8D9	KLINGON LETTER L
+U+F8DA	KLINGON LETTER M
+U+F8DB	KLINGON LETTER N
+U+F8DC	KLINGON LETTER NG
+U+F8DD	KLINGON LETTER O
+U+F8DE	KLINGON LETTER P
+U+F8DF	KLINGON LETTER Q
+	- Written <q> in standard Okrand Latin transliteration
+U+F8E0	KLINGON LETTER QH
+	- Written <Q> in standard Okrand Latin transliteration
+U+F8E1	KLINGON LETTER R
+U+F8E2	KLINGON LETTER S
+U+F8E3	KLINGON LETTER T
+U+F8E4	KLINGON LETTER TLH
+U+F8E5	KLINGON LETTER U
+U+F8E6	KLINGON LETTER V
+U+F8E7	KLINGON LETTER W
+U+F8E8	KLINGON LETTER Y
+U+F8E9	KLINGON LETTER GLOTTAL STOP
+
+U+F8F0	KLINGON DIGIT ZERO
+U+F8F1	KLINGON DIGIT ONE
+U+F8F2	KLINGON DIGIT TWO
+U+F8F3	KLINGON DIGIT THREE
+U+F8F4	KLINGON DIGIT FOUR
+U+F8F5	KLINGON DIGIT FIVE
+U+F8F6	KLINGON DIGIT SIX
+U+F8F7	KLINGON DIGIT SEVEN
+U+F8F8	KLINGON DIGIT EIGHT
+U+F8F9	KLINGON DIGIT NINE
+
+U+F8FD	KLINGON COMMA
+U+F8FE	KLINGON FULL STOP
+U+F8FF	KLINGON SYMBOL FOR EMPIRE
+======	=======================================================
+
+Other Fictional and Artificial Scripts
+--------------------------------------
+
+Since the assignment of the Klingon Linux Unicode block, a registry of
+fictional and artificial scripts has been established by John Cowan
+<jcowan@reutershealth.com> and Michael Everson <everson@evertype.com>.
+The ConScript Unicode Registry is accessible at:
+
+	  http://www.evertype.com/standards/csur/
+
+The ranges used fall at the low end of the End User Zone and can hence
+not be normatively assigned, but it is recommended that people who
+wish to encode fictional scripts use these codes, in the interest of
+interoperability.  For Klingon, CSUR has adopted the Linux encoding.
+The CSUR people are driving adding Tengwar and Cirth into Unicode
+Plane 1; the addition of Klingon to Unicode Plane 1 has been rejected
+and so the above encoding remains official.
diff --git a/Documentation/admin-guide/vga-softcursor.rst b/Documentation/admin-guide/vga-softcursor.rst
new file mode 100644
index 000000000..f52175457
--- /dev/null
+++ b/Documentation/admin-guide/vga-softcursor.rst
@@ -0,0 +1,62 @@
+Software cursor for VGA
+=======================
+
+by Pavel Machek <pavel@atrey.karlin.mff.cuni.cz>
+and Martin Mares <mj@atrey.karlin.mff.cuni.cz>
+
+Linux now has some ability to manipulate cursor appearance.  Normally,
+you can set the size of hardware cursor.  You can now play a few new
+tricks: you can make your cursor look like a non-blinking red block,
+make it inverse background of the character it's over or to highlight
+that character and still choose whether the original hardware cursor
+should remain visible or not.  There may be other things I have never
+thought of.
+
+The cursor appearance is controlled by a ``<ESC>[?1;2;3c`` escape sequence
+where 1, 2 and 3 are parameters described below. If you omit any of them,
+they will default to zeroes.
+
+first Parameter
+	specifies cursor size::
+
+		0=default
+		1=invisible
+		2=underline,
+		...
+		8=full block
+		+ 16 if you want the software cursor to be applied
+		+ 32 if you want to always change the background color
+		+ 64 if you dislike having the background the same as the
+		     foreground.
+
+	Highlights are ignored for the last two flags.
+
+second parameter
+	selects character attribute bits you want to change
+	(by simply XORing them with the value of this parameter). On standard
+	VGA, the high four bits specify background and the low four the
+	foreground. In both groups, low three bits set color (as in normal
+	color codes used by the console) and the most significant one turns
+	on highlight (or sometimes blinking -- it depends on the configuration
+	of your VGA).
+
+third parameter
+	consists of character attribute bits you want to set.
+
+	Bit setting takes place before bit toggling, so you can simply clear a
+	bit by including it in both the set mask and the toggle mask.
+
+Examples
+--------
+
+To get normal blinking underline, use::
+
+	echo -e '\033[?2c'
+
+To get blinking block, use::
+
+	echo -e '\033[?6c'
+
+To get red non-blinking block, use::
+
+	echo -e '\033[?17;0;64c'