diff options
Diffstat (limited to 'Documentation/admin-guide')
64 files changed, 26061 insertions, 0 deletions
diff --git a/Documentation/admin-guide/LSM/LoadPin.rst b/Documentation/admin-guide/LSM/LoadPin.rst new file mode 100644 index 000000000..32070762d --- /dev/null +++ b/Documentation/admin-guide/LSM/LoadPin.rst @@ -0,0 +1,21 @@ +======= +LoadPin +======= + +LoadPin is a Linux Security Module that ensures all kernel-loaded files +(modules, firmware, etc) all originate from the same filesystem, with +the expectation that such a filesystem is backed by a read-only device +such as dm-verity or CDROM. This allows systems that have a verified +and/or unchangeable filesystem to enforce module and firmware loading +restrictions without needing to sign the files individually. + +The LSM is selectable at build-time with ``CONFIG_SECURITY_LOADPIN``, and +can be controlled at boot-time with the kernel command line option +"``loadpin.enabled``". By default, it is enabled, but can be disabled at +boot ("``loadpin.enabled=0``"). + +LoadPin starts pinning when it sees the first file loaded. If the +block device backing the filesystem is not read-only, a sysctl is +created to toggle pinning: ``/proc/sys/kernel/loadpin/enabled``. (Having +a mutable filesystem means pinning is mutable too, but having the +sysctl allows for easy testing on systems with a mutable filesystem.) diff --git a/Documentation/admin-guide/LSM/SELinux.rst b/Documentation/admin-guide/LSM/SELinux.rst new file mode 100644 index 000000000..f722c9b41 --- /dev/null +++ b/Documentation/admin-guide/LSM/SELinux.rst @@ -0,0 +1,33 @@ +======= +SELinux +======= + +If you want to use SELinux, chances are you will want +to use the distro-provided policies, or install the +latest reference policy release from + + http://oss.tresys.com/projects/refpolicy + +However, if you want to install a dummy policy for +testing, you can do using ``mdp`` provided under +scripts/selinux. Note that this requires the selinux +userspace to be installed - in particular you will +need checkpolicy to compile a kernel, and setfiles and +fixfiles to label the filesystem. + + 1. Compile the kernel with selinux enabled. + 2. Type ``make`` to compile ``mdp``. + 3. Make sure that you are not running with + SELinux enabled and a real policy. If + you are, reboot with selinux disabled + before continuing. + 4. Run install_policy.sh:: + + cd scripts/selinux + sh install_policy.sh + +Step 4 will create a new dummy policy valid for your +kernel, with a single selinux user, role, and type. +It will compile the policy, will set your ``SELINUXTYPE`` to +``dummy`` in ``/etc/selinux/config``, install the compiled policy +as ``dummy``, and relabel your filesystem. diff --git a/Documentation/admin-guide/LSM/Smack.rst b/Documentation/admin-guide/LSM/Smack.rst new file mode 100644 index 000000000..6a5826a13 --- /dev/null +++ b/Documentation/admin-guide/LSM/Smack.rst @@ -0,0 +1,857 @@ +===== +Smack +===== + + + "Good for you, you've decided to clean the elevator!" + - The Elevator, from Dark Star + +Smack is the Simplified Mandatory Access Control Kernel. +Smack is a kernel based implementation of mandatory access +control that includes simplicity in its primary design goals. + +Smack is not the only Mandatory Access Control scheme +available for Linux. Those new to Mandatory Access Control +are encouraged to compare Smack with the other mechanisms +available to determine which is best suited to the problem +at hand. + +Smack consists of three major components: + + - The kernel + - Basic utilities, which are helpful but not required + - Configuration data + +The kernel component of Smack is implemented as a Linux +Security Modules (LSM) module. It requires netlabel and +works best with file systems that support extended attributes, +although xattr support is not strictly required. +It is safe to run a Smack kernel under a "vanilla" distribution. + +Smack kernels use the CIPSO IP option. Some network +configurations are intolerant of IP options and can impede +access to systems that use them as Smack does. + +Smack is used in the Tizen operating system. Please +go to http://wiki.tizen.org for information about how +Smack is used in Tizen. + +The current git repository for Smack user space is: + + git://github.com/smack-team/smack.git + +This should make and install on most modern distributions. +There are five commands included in smackutil: + +chsmack: + display or set Smack extended attribute values + +smackctl: + load the Smack access rules + +smackaccess: + report if a process with one label has access + to an object with another + +These two commands are obsolete with the introduction of +the smackfs/load2 and smackfs/cipso2 interfaces. + +smackload: + properly formats data for writing to smackfs/load + +smackcipso: + properly formats data for writing to smackfs/cipso + +In keeping with the intent of Smack, configuration data is +minimal and not strictly required. The most important +configuration step is mounting the smackfs pseudo filesystem. +If smackutil is installed the startup script will take care +of this, but it can be manually as well. + +Add this line to ``/etc/fstab``:: + + smackfs /sys/fs/smackfs smackfs defaults 0 0 + +The ``/sys/fs/smackfs`` directory is created by the kernel. + +Smack uses extended attributes (xattrs) to store labels on filesystem +objects. The attributes are stored in the extended attribute security +name space. A process must have ``CAP_MAC_ADMIN`` to change any of these +attributes. + +The extended attributes that Smack uses are: + +SMACK64 + Used to make access control decisions. In almost all cases + the label given to a new filesystem object will be the label + of the process that created it. + +SMACK64EXEC + The Smack label of a process that execs a program file with + this attribute set will run with this attribute's value. + +SMACK64MMAP + Don't allow the file to be mmapped by a process whose Smack + label does not allow all of the access permitted to a process + with the label contained in this attribute. This is a very + specific use case for shared libraries. + +SMACK64TRANSMUTE + Can only have the value "TRUE". If this attribute is present + on a directory when an object is created in the directory and + the Smack rule (more below) that permitted the write access + to the directory includes the transmute ("t") mode the object + gets the label of the directory instead of the label of the + creating process. If the object being created is a directory + the SMACK64TRANSMUTE attribute is set as well. + +SMACK64IPIN + This attribute is only available on file descriptors for sockets. + Use the Smack label in this attribute for access control + decisions on packets being delivered to this socket. + +SMACK64IPOUT + This attribute is only available on file descriptors for sockets. + Use the Smack label in this attribute for access control + decisions on packets coming from this socket. + +There are multiple ways to set a Smack label on a file:: + + # attr -S -s SMACK64 -V "value" path + # chsmack -a value path + +A process can see the Smack label it is running with by +reading ``/proc/self/attr/current``. A process with ``CAP_MAC_ADMIN`` +can set the process Smack by writing there. + +Most Smack configuration is accomplished by writing to files +in the smackfs filesystem. This pseudo-filesystem is mounted +on ``/sys/fs/smackfs``. + +access + Provided for backward compatibility. The access2 interface + is preferred and should be used instead. + This interface reports whether a subject with the specified + Smack label has a particular access to an object with a + specified Smack label. Write a fixed format access rule to + this file. The next read will indicate whether the access + would be permitted. The text will be either "1" indicating + access, or "0" indicating denial. + +access2 + This interface reports whether a subject with the specified + Smack label has a particular access to an object with a + specified Smack label. Write a long format access rule to + this file. The next read will indicate whether the access + would be permitted. The text will be either "1" indicating + access, or "0" indicating denial. + +ambient + This contains the Smack label applied to unlabeled network + packets. + +change-rule + This interface allows modification of existing access control rules. + The format accepted on write is:: + + "%s %s %s %s" + + where the first string is the subject label, the second the + object label, the third the access to allow and the fourth the + access to deny. The access strings may contain only the characters + "rwxat-". If a rule for a given subject and object exists it will be + modified by enabling the permissions in the third string and disabling + those in the fourth string. If there is no such rule it will be + created using the access specified in the third and the fourth strings. + +cipso + Provided for backward compatibility. The cipso2 interface + is preferred and should be used instead. + This interface allows a specific CIPSO header to be assigned + to a Smack label. The format accepted on write is:: + + "%24s%4d%4d"["%4d"]... + + The first string is a fixed Smack label. The first number is + the level to use. The second number is the number of categories. + The following numbers are the categories:: + + "level-3-cats-5-19 3 2 5 19" + +cipso2 + This interface allows a specific CIPSO header to be assigned + to a Smack label. The format accepted on write is:: + + "%s%4d%4d"["%4d"]... + + The first string is a long Smack label. The first number is + the level to use. The second number is the number of categories. + The following numbers are the categories:: + + "level-3-cats-5-19 3 2 5 19" + +direct + This contains the CIPSO level used for Smack direct label + representation in network packets. + +doi + This contains the CIPSO domain of interpretation used in + network packets. + +ipv6host + This interface allows specific IPv6 internet addresses to be + treated as single label hosts. Packets are sent to single + label hosts only from processes that have Smack write access + to the host label. All packets received from single label hosts + are given the specified label. The format accepted on write is:: + + "%h:%h:%h:%h:%h:%h:%h:%h label" or + "%h:%h:%h:%h:%h:%h:%h:%h/%d label". + + The "::" address shortcut is not supported. + If label is "-DELETE" a matched entry will be deleted. + +load + Provided for backward compatibility. The load2 interface + is preferred and should be used instead. + This interface allows access control rules in addition to + the system defined rules to be specified. The format accepted + on write is:: + + "%24s%24s%5s" + + where the first string is the subject label, the second the + object label, and the third the requested access. The access + string may contain only the characters "rwxat-", and specifies + which sort of access is allowed. The "-" is a placeholder for + permissions that are not allowed. The string "r-x--" would + specify read and execute access. Labels are limited to 23 + characters in length. + +load2 + This interface allows access control rules in addition to + the system defined rules to be specified. The format accepted + on write is:: + + "%s %s %s" + + where the first string is the subject label, the second the + object label, and the third the requested access. The access + string may contain only the characters "rwxat-", and specifies + which sort of access is allowed. The "-" is a placeholder for + permissions that are not allowed. The string "r-x--" would + specify read and execute access. + +load-self + Provided for backward compatibility. The load-self2 interface + is preferred and should be used instead. + This interface allows process specific access rules to be + defined. These rules are only consulted if access would + otherwise be permitted, and are intended to provide additional + restrictions on the process. The format is the same as for + the load interface. + +load-self2 + This interface allows process specific access rules to be + defined. These rules are only consulted if access would + otherwise be permitted, and are intended to provide additional + restrictions on the process. The format is the same as for + the load2 interface. + +logging + This contains the Smack logging state. + +mapped + This contains the CIPSO level used for Smack mapped label + representation in network packets. + +netlabel + This interface allows specific internet addresses to be + treated as single label hosts. Packets are sent to single + label hosts without CIPSO headers, but only from processes + that have Smack write access to the host label. All packets + received from single label hosts are given the specified + label. The format accepted on write is:: + + "%d.%d.%d.%d label" or "%d.%d.%d.%d/%d label". + + If the label specified is "-CIPSO" the address is treated + as a host that supports CIPSO headers. + +onlycap + This contains labels processes must have for CAP_MAC_ADMIN + and ``CAP_MAC_OVERRIDE`` to be effective. If this file is empty + these capabilities are effective at for processes with any + label. The values are set by writing the desired labels, separated + by spaces, to the file or cleared by writing "-" to the file. + +ptrace + This is used to define the current ptrace policy + + 0 - default: + this is the policy that relies on Smack access rules. + For the ``PTRACE_READ`` a subject needs to have a read access on + object. For the ``PTRACE_ATTACH`` a read-write access is required. + + 1 - exact: + this is the policy that limits ``PTRACE_ATTACH``. Attach is + only allowed when subject's and object's labels are equal. + ``PTRACE_READ`` is not affected. Can be overridden with ``CAP_SYS_PTRACE``. + + 2 - draconian: + this policy behaves like the 'exact' above with an + exception that it can't be overridden with ``CAP_SYS_PTRACE``. + +revoke-subject + Writing a Smack label here sets the access to '-' for all access + rules with that subject label. + +unconfined + If the kernel is configured with ``CONFIG_SECURITY_SMACK_BRINGUP`` + a process with ``CAP_MAC_ADMIN`` can write a label into this interface. + Thereafter, accesses that involve that label will be logged and + the access permitted if it wouldn't be otherwise. Note that this + is dangerous and can ruin the proper labeling of your system. + It should never be used in production. + +relabel-self + This interface contains a list of labels to which the process can + transition to, by writing to ``/proc/self/attr/current``. + Normally a process can change its own label to any legal value, but only + if it has ``CAP_MAC_ADMIN``. This interface allows a process without + ``CAP_MAC_ADMIN`` to relabel itself to one of labels from predefined list. + A process without ``CAP_MAC_ADMIN`` can change its label only once. When it + does, this list will be cleared. + The values are set by writing the desired labels, separated + by spaces, to the file or cleared by writing "-" to the file. + +If you are using the smackload utility +you can add access rules in ``/etc/smack/accesses``. They take the form:: + + subjectlabel objectlabel access + +access is a combination of the letters rwxatb which specify the +kind of access permitted a subject with subjectlabel on an +object with objectlabel. If there is no rule no access is allowed. + +Look for additional programs on http://schaufler-ca.com + +The Simplified Mandatory Access Control Kernel (Whitepaper) +=========================================================== + +Casey Schaufler +casey@schaufler-ca.com + +Mandatory Access Control +------------------------ + +Computer systems employ a variety of schemes to constrain how information is +shared among the people and services using the machine. Some of these schemes +allow the program or user to decide what other programs or users are allowed +access to pieces of data. These schemes are called discretionary access +control mechanisms because the access control is specified at the discretion +of the user. Other schemes do not leave the decision regarding what a user or +program can access up to users or programs. These schemes are called mandatory +access control mechanisms because you don't have a choice regarding the users +or programs that have access to pieces of data. + +Bell & LaPadula +--------------- + +From the middle of the 1980's until the turn of the century Mandatory Access +Control (MAC) was very closely associated with the Bell & LaPadula security +model, a mathematical description of the United States Department of Defense +policy for marking paper documents. MAC in this form enjoyed a following +within the Capital Beltway and Scandinavian supercomputer centers but was +often sited as failing to address general needs. + +Domain Type Enforcement +----------------------- + +Around the turn of the century Domain Type Enforcement (DTE) became popular. +This scheme organizes users, programs, and data into domains that are +protected from each other. This scheme has been widely deployed as a component +of popular Linux distributions. The administrative overhead required to +maintain this scheme and the detailed understanding of the whole system +necessary to provide a secure domain mapping leads to the scheme being +disabled or used in limited ways in the majority of cases. + +Smack +----- + +Smack is a Mandatory Access Control mechanism designed to provide useful MAC +while avoiding the pitfalls of its predecessors. The limitations of Bell & +LaPadula are addressed by providing a scheme whereby access can be controlled +according to the requirements of the system and its purpose rather than those +imposed by an arcane government policy. The complexity of Domain Type +Enforcement and avoided by defining access controls in terms of the access +modes already in use. + +Smack Terminology +----------------- + +The jargon used to talk about Smack will be familiar to those who have dealt +with other MAC systems and shouldn't be too difficult for the uninitiated to +pick up. There are four terms that are used in a specific way and that are +especially important: + + Subject: + A subject is an active entity on the computer system. + On Smack a subject is a task, which is in turn the basic unit + of execution. + + Object: + An object is a passive entity on the computer system. + On Smack files of all types, IPC, and tasks can be objects. + + Access: + Any attempt by a subject to put information into or get + information from an object is an access. + + Label: + Data that identifies the Mandatory Access Control + characteristics of a subject or an object. + +These definitions are consistent with the traditional use in the security +community. There are also some terms from Linux that are likely to crop up: + + Capability: + A task that possesses a capability has permission to + violate an aspect of the system security policy, as identified by + the specific capability. A task that possesses one or more + capabilities is a privileged task, whereas a task with no + capabilities is an unprivileged task. + + Privilege: + A task that is allowed to violate the system security + policy is said to have privilege. As of this writing a task can + have privilege either by possessing capabilities or by having an + effective user of root. + +Smack Basics +------------ + +Smack is an extension to a Linux system. It enforces additional restrictions +on what subjects can access which objects, based on the labels attached to +each of the subject and the object. + +Labels +~~~~~~ + +Smack labels are ASCII character strings. They can be up to 255 characters +long, but keeping them to twenty-three characters is recommended. +Single character labels using special characters, that being anything +other than a letter or digit, are reserved for use by the Smack development +team. Smack labels are unstructured, case sensitive, and the only operation +ever performed on them is comparison for equality. Smack labels cannot +contain unprintable characters, the "/" (slash), the "\" (backslash), the "'" +(quote) and '"' (double-quote) characters. +Smack labels cannot begin with a '-'. This is reserved for special options. + +There are some predefined labels:: + + _ Pronounced "floor", a single underscore character. + ^ Pronounced "hat", a single circumflex character. + * Pronounced "star", a single asterisk character. + ? Pronounced "huh", a single question mark character. + @ Pronounced "web", a single at sign character. + +Every task on a Smack system is assigned a label. The Smack label +of a process will usually be assigned by the system initialization +mechanism. + +Access Rules +~~~~~~~~~~~~ + +Smack uses the traditional access modes of Linux. These modes are read, +execute, write, and occasionally append. There are a few cases where the +access mode may not be obvious. These include: + + Signals: + A signal is a write operation from the subject task to + the object task. + + Internet Domain IPC: + Transmission of a packet is considered a + write operation from the source task to the destination task. + +Smack restricts access based on the label attached to a subject and the label +attached to the object it is trying to access. The rules enforced are, in +order: + + 1. Any access requested by a task labeled "*" is denied. + 2. A read or execute access requested by a task labeled "^" + is permitted. + 3. A read or execute access requested on an object labeled "_" + is permitted. + 4. Any access requested on an object labeled "*" is permitted. + 5. Any access requested by a task on an object with the same + label is permitted. + 6. Any access requested that is explicitly defined in the loaded + rule set is permitted. + 7. Any other access is denied. + +Smack Access Rules +~~~~~~~~~~~~~~~~~~ + +With the isolation provided by Smack access separation is simple. There are +many interesting cases where limited access by subjects to objects with +different labels is desired. One example is the familiar spy model of +sensitivity, where a scientist working on a highly classified project would be +able to read documents of lower classifications and anything she writes will +be "born" highly classified. To accommodate such schemes Smack includes a +mechanism for specifying rules allowing access between labels. + +Access Rule Format +~~~~~~~~~~~~~~~~~~ + +The format of an access rule is:: + + subject-label object-label access + +Where subject-label is the Smack label of the task, object-label is the Smack +label of the thing being accessed, and access is a string specifying the sort +of access allowed. The access specification is searched for letters that +describe access modes: + + a: indicates that append access should be granted. + r: indicates that read access should be granted. + w: indicates that write access should be granted. + x: indicates that execute access should be granted. + t: indicates that the rule requests transmutation. + b: indicates that the rule should be reported for bring-up. + +Uppercase values for the specification letters are allowed as well. +Access mode specifications can be in any order. Examples of acceptable rules +are:: + + TopSecret Secret rx + Secret Unclass R + Manager Game x + User HR w + Snap Crackle rwxatb + New Old rRrRr + Closed Off - + +Examples of unacceptable rules are:: + + Top Secret Secret rx + Ace Ace r + Odd spells waxbeans + +Spaces are not allowed in labels. Since a subject always has access to files +with the same label specifying a rule for that case is pointless. Only +valid letters (rwxatbRWXATB) and the dash ('-') character are allowed in +access specifications. The dash is a placeholder, so "a-r" is the same +as "ar". A lone dash is used to specify that no access should be allowed. + +Applying Access Rules +~~~~~~~~~~~~~~~~~~~~~ + +The developers of Linux rarely define new sorts of things, usually importing +schemes and concepts from other systems. Most often, the other systems are +variants of Unix. Unix has many endearing properties, but consistency of +access control models is not one of them. Smack strives to treat accesses as +uniformly as is sensible while keeping with the spirit of the underlying +mechanism. + +File system objects including files, directories, named pipes, symbolic links, +and devices require access permissions that closely match those used by mode +bit access. To open a file for reading read access is required on the file. To +search a directory requires execute access. Creating a file with write access +requires both read and write access on the containing directory. Deleting a +file requires read and write access to the file and to the containing +directory. It is possible that a user may be able to see that a file exists +but not any of its attributes by the circumstance of having read access to the +containing directory but not to the differently labeled file. This is an +artifact of the file name being data in the directory, not a part of the file. + +If a directory is marked as transmuting (SMACK64TRANSMUTE=TRUE) and the +access rule that allows a process to create an object in that directory +includes 't' access the label assigned to the new object will be that +of the directory, not the creating process. This makes it much easier +for two processes with different labels to share data without granting +access to all of their files. + +IPC objects, message queues, semaphore sets, and memory segments exist in flat +namespaces and access requests are only required to match the object in +question. + +Process objects reflect tasks on the system and the Smack label used to access +them is the same Smack label that the task would use for its own access +attempts. Sending a signal via the kill() system call is a write operation +from the signaler to the recipient. Debugging a process requires both reading +and writing. Creating a new task is an internal operation that results in two +tasks with identical Smack labels and requires no access checks. + +Sockets are data structures attached to processes and sending a packet from +one process to another requires that the sender have write access to the +receiver. The receiver is not required to have read access to the sender. + +Setting Access Rules +~~~~~~~~~~~~~~~~~~~~ + +The configuration file /etc/smack/accesses contains the rules to be set at +system startup. The contents are written to the special file +/sys/fs/smackfs/load2. Rules can be added at any time and take effect +immediately. For any pair of subject and object labels there can be only +one rule, with the most recently specified overriding any earlier +specification. + +Task Attribute +~~~~~~~~~~~~~~ + +The Smack label of a process can be read from /proc/<pid>/attr/current. A +process can read its own Smack label from /proc/self/attr/current. A +privileged process can change its own Smack label by writing to +/proc/self/attr/current but not the label of another process. + +File Attribute +~~~~~~~~~~~~~~ + +The Smack label of a filesystem object is stored as an extended attribute +named SMACK64 on the file. This attribute is in the security namespace. It can +only be changed by a process with privilege. + +Privilege +~~~~~~~~~ + +A process with CAP_MAC_OVERRIDE or CAP_MAC_ADMIN is privileged. +CAP_MAC_OVERRIDE allows the process access to objects it would +be denied otherwise. CAP_MAC_ADMIN allows a process to change +Smack data, including rules and attributes. + +Smack Networking +~~~~~~~~~~~~~~~~ + +As mentioned before, Smack enforces access control on network protocol +transmissions. Every packet sent by a Smack process is tagged with its Smack +label. This is done by adding a CIPSO tag to the header of the IP packet. Each +packet received is expected to have a CIPSO tag that identifies the label and +if it lacks such a tag the network ambient label is assumed. Before the packet +is delivered a check is made to determine that a subject with the label on the +packet has write access to the receiving process and if that is not the case +the packet is dropped. + +CIPSO Configuration +~~~~~~~~~~~~~~~~~~~ + +It is normally unnecessary to specify the CIPSO configuration. The default +values used by the system handle all internal cases. Smack will compose CIPSO +label values to match the Smack labels being used without administrative +intervention. Unlabeled packets that come into the system will be given the +ambient label. + +Smack requires configuration in the case where packets from a system that is +not Smack that speaks CIPSO may be encountered. Usually this will be a Trusted +Solaris system, but there are other, less widely deployed systems out there. +CIPSO provides 3 important values, a Domain Of Interpretation (DOI), a level, +and a category set with each packet. The DOI is intended to identify a group +of systems that use compatible labeling schemes, and the DOI specified on the +Smack system must match that of the remote system or packets will be +discarded. The DOI is 3 by default. The value can be read from +/sys/fs/smackfs/doi and can be changed by writing to /sys/fs/smackfs/doi. + +The label and category set are mapped to a Smack label as defined in +/etc/smack/cipso. + +A Smack/CIPSO mapping has the form:: + + smack level [category [category]*] + +Smack does not expect the level or category sets to be related in any +particular way and does not assume or assign accesses based on them. Some +examples of mappings:: + + TopSecret 7 + TS:A,B 7 1 2 + SecBDE 5 2 4 6 + RAFTERS 7 12 26 + +The ":" and "," characters are permitted in a Smack label but have no special +meaning. + +The mapping of Smack labels to CIPSO values is defined by writing to +/sys/fs/smackfs/cipso2. + +In addition to explicit mappings Smack supports direct CIPSO mappings. One +CIPSO level is used to indicate that the category set passed in the packet is +in fact an encoding of the Smack label. The level used is 250 by default. The +value can be read from /sys/fs/smackfs/direct and changed by writing to +/sys/fs/smackfs/direct. + +Socket Attributes +~~~~~~~~~~~~~~~~~ + +There are two attributes that are associated with sockets. These attributes +can only be set by privileged tasks, but any task can read them for their own +sockets. + + SMACK64IPIN: + The Smack label of the task object. A privileged + program that will enforce policy may set this to the star label. + + SMACK64IPOUT: + The Smack label transmitted with outgoing packets. + A privileged program may set this to match the label of another + task with which it hopes to communicate. + +Smack Netlabel Exceptions +~~~~~~~~~~~~~~~~~~~~~~~~~ + +You will often find that your labeled application has to talk to the outside, +unlabeled world. To do this there's a special file /sys/fs/smackfs/netlabel +where you can add some exceptions in the form of:: + + @IP1 LABEL1 or + @IP2/MASK LABEL2 + +It means that your application will have unlabeled access to @IP1 if it has +write access on LABEL1, and access to the subnet @IP2/MASK if it has write +access on LABEL2. + +Entries in the /sys/fs/smackfs/netlabel file are matched by longest mask +first, like in classless IPv4 routing. + +A special label '@' and an option '-CIPSO' can be used there:: + + @ means Internet, any application with any label has access to it + -CIPSO means standard CIPSO networking + +If you don't know what CIPSO is and don't plan to use it, you can just do:: + + echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel + echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel + +If you use CIPSO on your 192.168.0.0/16 local network and need also unlabeled +Internet access, you can have:: + + echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel + echo 192.168.0.0/16 -CIPSO > /sys/fs/smackfs/netlabel + echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel + +Writing Applications for Smack +------------------------------ + +There are three sorts of applications that will run on a Smack system. How an +application interacts with Smack will determine what it will have to do to +work properly under Smack. + +Smack Ignorant Applications +--------------------------- + +By far the majority of applications have no reason whatever to care about the +unique properties of Smack. Since invoking a program has no impact on the +Smack label associated with the process the only concern likely to arise is +whether the process has execute access to the program. + +Smack Relevant Applications +--------------------------- + +Some programs can be improved by teaching them about Smack, but do not make +any security decisions themselves. The utility ls(1) is one example of such a +program. + +Smack Enforcing Applications +---------------------------- + +These are special programs that not only know about Smack, but participate in +the enforcement of system policy. In most cases these are the programs that +set up user sessions. There are also network services that provide information +to processes running with various labels. + +File System Interfaces +---------------------- + +Smack maintains labels on file system objects using extended attributes. The +Smack label of a file, directory, or other file system object can be obtained +using getxattr(2):: + + len = getxattr("/", "security.SMACK64", value, sizeof (value)); + +will put the Smack label of the root directory into value. A privileged +process can set the Smack label of a file system object with setxattr(2):: + + len = strlen("Rubble"); + rc = setxattr("/foo", "security.SMACK64", "Rubble", len, 0); + +will set the Smack label of /foo to "Rubble" if the program has appropriate +privilege. + +Socket Interfaces +----------------- + +The socket attributes can be read using fgetxattr(2). + +A privileged process can set the Smack label of outgoing packets with +fsetxattr(2):: + + len = strlen("Rubble"); + rc = fsetxattr(fd, "security.SMACK64IPOUT", "Rubble", len, 0); + +will set the Smack label "Rubble" on packets going out from the socket if the +program has appropriate privilege:: + + rc = fsetxattr(fd, "security.SMACK64IPIN, "*", strlen("*"), 0); + +will set the Smack label "*" as the object label against which incoming +packets will be checked if the program has appropriate privilege. + +Administration +-------------- + +Smack supports some mount options: + + smackfsdef=label: + specifies the label to give files that lack + the Smack label extended attribute. + + smackfsroot=label: + specifies the label to assign the root of the + file system if it lacks the Smack extended attribute. + + smackfshat=label: + specifies a label that must have read access to + all labels set on the filesystem. Not yet enforced. + + smackfsfloor=label: + specifies a label to which all labels set on the + filesystem must have read access. Not yet enforced. + +These mount options apply to all file system types. + +Smack auditing +-------------- + +If you want Smack auditing of security events, you need to set CONFIG_AUDIT +in your kernel configuration. +By default, all denied events will be audited. You can change this behavior by +writing a single character to the /sys/fs/smackfs/logging file:: + + 0 : no logging + 1 : log denied (default) + 2 : log accepted + 3 : log denied & accepted + +Events are logged as 'key=value' pairs, for each event you at least will get +the subject, the object, the rights requested, the action, the kernel function +that triggered the event, plus other pairs depending on the type of event +audited. + +Bringup Mode +------------ + +Bringup mode provides logging features that can make application +configuration and system bringup easier. Configure the kernel with +CONFIG_SECURITY_SMACK_BRINGUP to enable these features. When bringup +mode is enabled accesses that succeed due to rules marked with the "b" +access mode will logged. When a new label is introduced for processes +rules can be added aggressively, marked with the "b". The logging allows +tracking of which rules actual get used for that label. + +Another feature of bringup mode is the "unconfined" option. Writing +a label to /sys/fs/smackfs/unconfined makes subjects with that label +able to access any object, and objects with that label accessible to +all subjects. Any access that is granted because a label is unconfined +is logged. This feature is dangerous, as files and directories may +be created in places they couldn't if the policy were being enforced. diff --git a/Documentation/admin-guide/LSM/Yama.rst b/Documentation/admin-guide/LSM/Yama.rst new file mode 100644 index 000000000..13468ea69 --- /dev/null +++ b/Documentation/admin-guide/LSM/Yama.rst @@ -0,0 +1,74 @@ +==== +Yama +==== + +Yama is a Linux Security Module that collects system-wide DAC security +protections that are not handled by the core kernel itself. This is +selectable at build-time with ``CONFIG_SECURITY_YAMA``, and can be controlled +at run-time through sysctls in ``/proc/sys/kernel/yama``: + +ptrace_scope +============ + +As Linux grows in popularity, it will become a larger target for +malware. One particularly troubling weakness of the Linux process +interfaces is that a single user is able to examine the memory and +running state of any of their processes. For example, if one application +(e.g. Pidgin) was compromised, it would be possible for an attacker to +attach to other running processes (e.g. Firefox, SSH sessions, GPG agent, +etc) to extract additional credentials and continue to expand the scope +of their attack without resorting to user-assisted phishing. + +This is not a theoretical problem. SSH session hijacking +(http://www.storm.net.nz/projects/7) and arbitrary code injection +(http://c-skills.blogspot.com/2007/05/injectso.html) attacks already +exist and remain possible if ptrace is allowed to operate as before. +Since ptrace is not commonly used by non-developers and non-admins, system +builders should be allowed the option to disable this debugging system. + +For a solution, some applications use ``prctl(PR_SET_DUMPABLE, ...)`` to +specifically disallow such ptrace attachment (e.g. ssh-agent), but many +do not. A more general solution is to only allow ptrace directly from a +parent to a child process (i.e. direct "gdb EXE" and "strace EXE" still +work), or with ``CAP_SYS_PTRACE`` (i.e. "gdb --pid=PID", and "strace -p PID" +still work as root). + +In mode 1, software that has defined application-specific relationships +between a debugging process and its inferior (crash handlers, etc), +``prctl(PR_SET_PTRACER, pid, ...)`` can be used. An inferior can declare which +other process (and its descendants) are allowed to call ``PTRACE_ATTACH`` +against it. Only one such declared debugging process can exists for +each inferior at a time. For example, this is used by KDE, Chromium, and +Firefox's crash handlers, and by Wine for allowing only Wine processes +to ptrace each other. If a process wishes to entirely disable these ptrace +restrictions, it can call ``prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY, ...)`` +so that any otherwise allowed process (even those in external pid namespaces) +may attach. + +The sysctl settings (writable only with ``CAP_SYS_PTRACE``) are: + +0 - classic ptrace permissions: + a process can ``PTRACE_ATTACH`` to any other + process running under the same uid, as long as it is dumpable (i.e. + did not transition uids, start privileged, or have called + ``prctl(PR_SET_DUMPABLE...)`` already). Similarly, ``PTRACE_TRACEME`` is + unchanged. + +1 - restricted ptrace: + a process must have a predefined relationship + with the inferior it wants to call ``PTRACE_ATTACH`` on. By default, + this relationship is that of only its descendants when the above + classic criteria is also met. To change the relationship, an + inferior can call ``prctl(PR_SET_PTRACER, debugger, ...)`` to declare + an allowed debugger PID to call ``PTRACE_ATTACH`` on the inferior. + Using ``PTRACE_TRACEME`` is unchanged. + +2 - admin-only attach: + only processes with ``CAP_SYS_PTRACE`` may use ptrace + with ``PTRACE_ATTACH``, or through children calling ``PTRACE_TRACEME``. + +3 - no attach: + no processes may use ptrace with ``PTRACE_ATTACH`` nor via + ``PTRACE_TRACEME``. Once set, this sysctl value cannot be changed. + +The original children-only logic was based on the restrictions in grsecurity. diff --git a/Documentation/admin-guide/LSM/apparmor.rst b/Documentation/admin-guide/LSM/apparmor.rst new file mode 100644 index 000000000..6cf81bbd7 --- /dev/null +++ b/Documentation/admin-guide/LSM/apparmor.rst @@ -0,0 +1,51 @@ +======== +AppArmor +======== + +What is AppArmor? +================= + +AppArmor is MAC style security extension for the Linux kernel. It implements +a task centered policy, with task "profiles" being created and loaded +from user space. Tasks on the system that do not have a profile defined for +them run in an unconfined state which is equivalent to standard Linux DAC +permissions. + +How to enable/disable +===================== + +set ``CONFIG_SECURITY_APPARMOR=y`` + +If AppArmor should be selected as the default security module then set:: + + CONFIG_DEFAULT_SECURITY="apparmor" + CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1 + +Build the kernel + +If AppArmor is not the default security module it can be enabled by passing +``security=apparmor`` on the kernel's command line. + +If AppArmor is the default security module it can be disabled by passing +``apparmor=0, security=XXXX`` (where ``XXXX`` is valid security module), on the +kernel's command line. + +For AppArmor to enforce any restrictions beyond standard Linux DAC permissions +policy must be loaded into the kernel from user space (see the Documentation +and tools links). + +Documentation +============= + +Documentation can be found on the wiki, linked below. + +Links +===== + +Mailing List - apparmor@lists.ubuntu.com + +Wiki - http://wiki.apparmor.net + +User space tools - https://gitlab.com/apparmor + +Kernel module - git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst new file mode 100644 index 000000000..c980dfe9a --- /dev/null +++ b/Documentation/admin-guide/LSM/index.rst @@ -0,0 +1,41 @@ +=========================== +Linux Security Module Usage +=========================== + +The Linux Security Module (LSM) framework provides a mechanism for +various security checks to be hooked by new kernel extensions. The name +"module" is a bit of a misnomer since these extensions are not actually +loadable kernel modules. Instead, they are selectable at build-time via +CONFIG_DEFAULT_SECURITY and can be overridden at boot-time via the +``"security=..."`` kernel command line argument, in the case where multiple +LSMs were built into a given kernel. + +The primary users of the LSM interface are Mandatory Access Control +(MAC) extensions which provide a comprehensive security policy. Examples +include SELinux, Smack, Tomoyo, and AppArmor. In addition to the larger +MAC extensions, other extensions can be built using the LSM to provide +specific changes to system operation when these tweaks are not available +in the core functionality of Linux itself. + +Without a specific LSM built into the kernel, the default LSM will be the +Linux capabilities system. Most LSMs choose to extend the capabilities +system, building their checks on top of the defined capability hooks. +For more details on capabilities, see ``capabilities(7)`` in the Linux +man-pages project. + +A list of the active security modules can be found by reading +``/sys/kernel/security/lsm``. This is a comma separated list, and +will always include the capability module. The list reflects the +order in which checks are made. The capability module will always +be first, followed by any "minor" modules (e.g. Yama) and then +the one "major" module (e.g. SELinux) if there is one configured. + +.. toctree:: + :maxdepth: 1 + + apparmor + LoadPin + SELinux + Smack + tomoyo + Yama diff --git a/Documentation/admin-guide/LSM/tomoyo.rst b/Documentation/admin-guide/LSM/tomoyo.rst new file mode 100644 index 000000000..e2d6b6e15 --- /dev/null +++ b/Documentation/admin-guide/LSM/tomoyo.rst @@ -0,0 +1,65 @@ +====== +TOMOYO +====== + +What is TOMOYO? +=============== + +TOMOYO is a name-based MAC extension (LSM module) for the Linux kernel. + +LiveCD-based tutorials are available at + +http://tomoyo.sourceforge.jp/1.8/ubuntu12.04-live.html +http://tomoyo.sourceforge.jp/1.8/centos6-live.html + +Though these tutorials use non-LSM version of TOMOYO, they are useful for you +to know what TOMOYO is. + +How to enable TOMOYO? +===================== + +Build the kernel with ``CONFIG_SECURITY_TOMOYO=y`` and pass ``security=tomoyo`` on +kernel's command line. + +Please see http://tomoyo.osdn.jp/2.5/ for details. + +Where is documentation? +======================= + +User <-> Kernel interface documentation is available at +http://tomoyo.osdn.jp/2.5/policy-specification/index.html . + +Materials we prepared for seminars and symposiums are available at +http://osdn.jp/projects/tomoyo/docs/?category_id=532&language_id=1 . +Below lists are chosen from three aspects. + +What is TOMOYO? + TOMOYO Linux Overview + http://osdn.jp/projects/tomoyo/docs/lca2009-takeda.pdf + TOMOYO Linux: pragmatic and manageable security for Linux + http://osdn.jp/projects/tomoyo/docs/freedomhectaipei-tomoyo.pdf + TOMOYO Linux: A Practical Method to Understand and Protect Your Own Linux Box + http://osdn.jp/projects/tomoyo/docs/PacSec2007-en-no-demo.pdf + +What can TOMOYO do? + Deep inside TOMOYO Linux + http://osdn.jp/projects/tomoyo/docs/lca2009-kumaneko.pdf + The role of "pathname based access control" in security. + http://osdn.jp/projects/tomoyo/docs/lfj2008-bof.pdf + +History of TOMOYO? + Realities of Mainlining + http://osdn.jp/projects/tomoyo/docs/lfj2008.pdf + +What is future plan? +==================== + +We believe that inode based security and name based security are complementary +and both should be used together. But unfortunately, so far, we cannot enable +multiple LSM modules at the same time. We feel sorry that you have to give up +SELinux/SMACK/AppArmor etc. when you want to use TOMOYO. + +We hope that LSM becomes stackable in future. Meanwhile, you can use non-LSM +version of TOMOYO, available at http://tomoyo.osdn.jp/1.8/ . +LSM version of TOMOYO is a subset of non-LSM version of TOMOYO. We are planning +to port non-LSM version's functionalities to LSM versions. diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst new file mode 100644 index 000000000..15ea785b2 --- /dev/null +++ b/Documentation/admin-guide/README.rst @@ -0,0 +1,409 @@ +.. _readme: + +Linux kernel release 4.x <http://kernel.org/> +============================================= + +These are the release notes for Linux version 4. Read them carefully, +as they tell you what this is all about, explain how to install the +kernel, and what to do if something goes wrong. + +What is Linux? +-------------- + + Linux is a clone of the operating system Unix, written from scratch by + Linus Torvalds with assistance from a loosely-knit team of hackers across + the Net. It aims towards POSIX and Single UNIX Specification compliance. + + It has all the features you would expect in a modern fully-fledged Unix, + including true multitasking, virtual memory, shared libraries, demand + loading, shared copy-on-write executables, proper memory management, + and multistack networking including IPv4 and IPv6. + + It is distributed under the GNU General Public License v2 - see the + accompanying COPYING file for more details. + +On what hardware does it run? +----------------------------- + + Although originally developed first for 32-bit x86-based PCs (386 or higher), + today Linux also runs on (at least) the Compaq Alpha AXP, Sun SPARC and + UltraSPARC, Motorola 68000, PowerPC, PowerPC64, ARM, Hitachi SuperH, Cell, + IBM S/390, MIPS, HP PA-RISC, Intel IA-64, DEC VAX, AMD x86-64 Xtensa, and + ARC architectures. + + Linux is easily portable to most general-purpose 32- or 64-bit architectures + as long as they have a paged memory management unit (PMMU) and a port of the + GNU C compiler (gcc) (part of The GNU Compiler Collection, GCC). Linux has + also been ported to a number of architectures without a PMMU, although + functionality is then obviously somewhat limited. + Linux has also been ported to itself. You can now run the kernel as a + userspace application - this is called UserMode Linux (UML). + +Documentation +------------- + + - There is a lot of documentation available both in electronic form on + the Internet and in books, both Linux-specific and pertaining to + general UNIX questions. I'd recommend looking into the documentation + subdirectories on any Linux FTP site for the LDP (Linux Documentation + Project) books. This README is not meant to be documentation on the + system: there are much better sources available. + + - There are various README files in the Documentation/ subdirectory: + these typically contain kernel-specific installation notes for some + drivers for example. See Documentation/00-INDEX for a list of what + is contained in each file. Please read the + :ref:`Documentation/process/changes.rst <changes>` file, as it + contains information about the problems, which may result by upgrading + your kernel. + +Installing the kernel source +---------------------------- + + - If you install the full sources, put the kernel tarball in a + directory where you have permissions (e.g. your home directory) and + unpack it:: + + xz -cd linux-4.X.tar.xz | tar xvf - + + Replace "X" with the version number of the latest kernel. + + Do NOT use the /usr/src/linux area! This area has a (usually + incomplete) set of kernel headers that are used by the library header + files. They should match the library, and not get messed up by + whatever the kernel-du-jour happens to be. + + - You can also upgrade between 4.x releases by patching. Patches are + distributed in the xz format. To install by patching, get all the + newer patch files, enter the top level directory of the kernel source + (linux-4.X) and execute:: + + xz -cd ../patch-4.x.xz | patch -p1 + + Replace "x" for all versions bigger than the version "X" of your current + source tree, **in_order**, and you should be ok. You may want to remove + the backup files (some-file-name~ or some-file-name.orig), and make sure + that there are no failed patches (some-file-name# or some-file-name.rej). + If there are, either you or I have made a mistake. + + Unlike patches for the 4.x kernels, patches for the 4.x.y kernels + (also known as the -stable kernels) are not incremental but instead apply + directly to the base 4.x kernel. For example, if your base kernel is 4.0 + and you want to apply the 4.0.3 patch, you must not first apply the 4.0.1 + and 4.0.2 patches. Similarly, if you are running kernel version 4.0.2 and + want to jump to 4.0.3, you must first reverse the 4.0.2 patch (that is, + patch -R) **before** applying the 4.0.3 patch. You can read more on this in + :ref:`Documentation/process/applying-patches.rst <applying_patches>`. + + Alternatively, the script patch-kernel can be used to automate this + process. It determines the current kernel version and applies any + patches found:: + + linux/scripts/patch-kernel linux + + The first argument in the command above is the location of the + kernel source. Patches are applied from the current directory, but + an alternative directory can be specified as the second argument. + + - Make sure you have no stale .o files and dependencies lying around:: + + cd linux + make mrproper + + You should now have the sources correctly installed. + +Software requirements +--------------------- + + Compiling and running the 4.x kernels requires up-to-date + versions of various software packages. Consult + :ref:`Documentation/process/changes.rst <changes>` for the minimum version numbers + required and how to get updates for these packages. Beware that using + excessively old versions of these packages can cause indirect + errors that are very difficult to track down, so don't assume that + you can just update packages when obvious problems arise during + build or operation. + +Build directory for the kernel +------------------------------ + + When compiling the kernel, all output files will per default be + stored together with the kernel source code. + Using the option ``make O=output/dir`` allows you to specify an alternate + place for the output files (including .config). + Example:: + + kernel source code: /usr/src/linux-4.X + build directory: /home/name/build/kernel + + To configure and build the kernel, use:: + + cd /usr/src/linux-4.X + make O=/home/name/build/kernel menuconfig + make O=/home/name/build/kernel + sudo make O=/home/name/build/kernel modules_install install + + Please note: If the ``O=output/dir`` option is used, then it must be + used for all invocations of make. + +Configuring the kernel +---------------------- + + Do not skip this step even if you are only upgrading one minor + version. New configuration options are added in each release, and + odd problems will turn up if the configuration files are not set up + as expected. If you want to carry your existing configuration to a + new version with minimal work, use ``make oldconfig``, which will + only ask you for the answers to new questions. + + - Alternative configuration commands are:: + + "make config" Plain text interface. + + "make menuconfig" Text based color menus, radiolists & dialogs. + + "make nconfig" Enhanced text based color menus. + + "make xconfig" Qt based configuration tool. + + "make gconfig" GTK+ based configuration tool. + + "make oldconfig" Default all questions based on the contents of + your existing ./.config file and asking about + new config symbols. + + "make olddefconfig" + Like above, but sets new symbols to their default + values without prompting. + + "make defconfig" Create a ./.config file by using the default + symbol values from either arch/$ARCH/defconfig + or arch/$ARCH/configs/${PLATFORM}_defconfig, + depending on the architecture. + + "make ${PLATFORM}_defconfig" + Create a ./.config file by using the default + symbol values from + arch/$ARCH/configs/${PLATFORM}_defconfig. + Use "make help" to get a list of all available + platforms of your architecture. + + "make allyesconfig" + Create a ./.config file by setting symbol + values to 'y' as much as possible. + + "make allmodconfig" + Create a ./.config file by setting symbol + values to 'm' as much as possible. + + "make allnoconfig" Create a ./.config file by setting symbol + values to 'n' as much as possible. + + "make randconfig" Create a ./.config file by setting symbol + values to random values. + + "make localmodconfig" Create a config based on current config and + loaded modules (lsmod). Disables any module + option that is not needed for the loaded modules. + + To create a localmodconfig for another machine, + store the lsmod of that machine into a file + and pass it in as a LSMOD parameter. + + target$ lsmod > /tmp/mylsmod + target$ scp /tmp/mylsmod host:/tmp + + host$ make LSMOD=/tmp/mylsmod localmodconfig + + The above also works when cross compiling. + + "make localyesconfig" Similar to localmodconfig, except it will convert + all module options to built in (=y) options. + + "make kvmconfig" Enable additional options for kvm guest kernel support. + + "make xenconfig" Enable additional options for xen dom0 guest kernel + support. + + "make tinyconfig" Configure the tiniest possible kernel. + + You can find more information on using the Linux kernel config tools + in Documentation/kbuild/kconfig.txt. + + - NOTES on ``make config``: + + - Having unnecessary drivers will make the kernel bigger, and can + under some circumstances lead to problems: probing for a + nonexistent controller card may confuse your other controllers. + + - A kernel with math-emulation compiled in will still use the + coprocessor if one is present: the math emulation will just + never get used in that case. The kernel will be slightly larger, + but will work on different machines regardless of whether they + have a math coprocessor or not. + + - The "kernel hacking" configuration details usually result in a + bigger or slower kernel (or both), and can even make the kernel + less stable by configuring some routines to actively try to + break bad code to find kernel problems (kmalloc()). Thus you + should probably answer 'n' to the questions for "development", + "experimental", or "debugging" features. + +Compiling the kernel +-------------------- + + - Make sure you have at least gcc 3.2 available. + For more information, refer to :ref:`Documentation/process/changes.rst <changes>`. + + Please note that you can still run a.out user programs with this kernel. + + - Do a ``make`` to create a compressed kernel image. It is also + possible to do ``make install`` if you have lilo installed to suit the + kernel makefiles, but you may want to check your particular lilo setup first. + + To do the actual install, you have to be root, but none of the normal + build should require that. Don't take the name of root in vain. + + - If you configured any of the parts of the kernel as ``modules``, you + will also have to do ``make modules_install``. + + - Verbose kernel compile/build output: + + Normally, the kernel build system runs in a fairly quiet mode (but not + totally silent). However, sometimes you or other kernel developers need + to see compile, link, or other commands exactly as they are executed. + For this, use "verbose" build mode. This is done by passing + ``V=1`` to the ``make`` command, e.g.:: + + make V=1 all + + To have the build system also tell the reason for the rebuild of each + target, use ``V=2``. The default is ``V=0``. + + - Keep a backup kernel handy in case something goes wrong. This is + especially true for the development releases, since each new release + contains new code which has not been debugged. Make sure you keep a + backup of the modules corresponding to that kernel, as well. If you + are installing a new kernel with the same version number as your + working kernel, make a backup of your modules directory before you + do a ``make modules_install``. + + Alternatively, before compiling, use the kernel config option + "LOCALVERSION" to append a unique suffix to the regular kernel version. + LOCALVERSION can be set in the "General Setup" menu. + + - In order to boot your new kernel, you'll need to copy the kernel + image (e.g. .../linux/arch/x86/boot/bzImage after compilation) + to the place where your regular bootable kernel is found. + + - Booting a kernel directly from a floppy without the assistance of a + bootloader such as LILO, is no longer supported. + + If you boot Linux from the hard drive, chances are you use LILO, which + uses the kernel image as specified in the file /etc/lilo.conf. The + kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or + /boot/bzImage. To use the new kernel, save a copy of the old image + and copy the new image over the old one. Then, you MUST RERUN LILO + to update the loading map! If you don't, you won't be able to boot + the new kernel image. + + Reinstalling LILO is usually a matter of running /sbin/lilo. + You may wish to edit /etc/lilo.conf to specify an entry for your + old kernel image (say, /vmlinux.old) in case the new one does not + work. See the LILO docs for more information. + + After reinstalling LILO, you should be all set. Shutdown the system, + reboot, and enjoy! + + If you ever need to change the default root device, video mode, + ramdisk size, etc. in the kernel image, use the ``rdev`` program (or + alternatively the LILO boot options when appropriate). No need to + recompile the kernel to change these parameters. + + - Reboot with the new kernel and enjoy. + +If something goes wrong +----------------------- + + - If you have problems that seem to be due to kernel bugs, please check + the file MAINTAINERS to see if there is a particular person associated + with the part of the kernel that you are having trouble with. If there + isn't anyone listed there, then the second best thing is to mail + them to me (torvalds@linux-foundation.org), and possibly to any other + relevant mailing-list or to the newsgroup. + + - In all bug-reports, *please* tell what kernel you are talking about, + how to duplicate the problem, and what your setup is (use your common + sense). If the problem is new, tell me so, and if the problem is + old, please try to tell me when you first noticed it. + + - If the bug results in a message like:: + + unable to handle kernel paging request at address C0000010 + Oops: 0002 + EIP: 0010:XXXXXXXX + eax: xxxxxxxx ebx: xxxxxxxx ecx: xxxxxxxx edx: xxxxxxxx + esi: xxxxxxxx edi: xxxxxxxx ebp: xxxxxxxx + ds: xxxx es: xxxx fs: xxxx gs: xxxx + Pid: xx, process nr: xx + xx xx xx xx xx xx xx xx xx xx + + or similar kernel debugging information on your screen or in your + system log, please duplicate it *exactly*. The dump may look + incomprehensible to you, but it does contain information that may + help debugging the problem. The text above the dump is also + important: it tells something about why the kernel dumped code (in + the above example, it's due to a bad kernel pointer). More information + on making sense of the dump is in Documentation/admin-guide/bug-hunting.rst + + - If you compiled the kernel with CONFIG_KALLSYMS you can send the dump + as is, otherwise you will have to use the ``ksymoops`` program to make + sense of the dump (but compiling with CONFIG_KALLSYMS is usually preferred). + This utility can be downloaded from + https://www.kernel.org/pub/linux/utils/kernel/ksymoops/ . + Alternatively, you can do the dump lookup by hand: + + - In debugging dumps like the above, it helps enormously if you can + look up what the EIP value means. The hex value as such doesn't help + me or anybody else very much: it will depend on your particular + kernel setup. What you should do is take the hex value from the EIP + line (ignore the ``0010:``), and look it up in the kernel namelist to + see which kernel function contains the offending address. + + To find out the kernel function name, you'll need to find the system + binary associated with the kernel that exhibited the symptom. This is + the file 'linux/vmlinux'. To extract the namelist and match it against + the EIP from the kernel crash, do:: + + nm vmlinux | sort | less + + This will give you a list of kernel addresses sorted in ascending + order, from which it is simple to find the function that contains the + offending address. Note that the address given by the kernel + debugging messages will not necessarily match exactly with the + function addresses (in fact, that is very unlikely), so you can't + just 'grep' the list: the list will, however, give you the starting + point of each kernel function, so by looking for the function that + has a starting address lower than the one you are searching for but + is followed by a function with a higher address you will find the one + you want. In fact, it may be a good idea to include a bit of + "context" in your problem report, giving a few lines around the + interesting one. + + If you for some reason cannot do the above (you have a pre-compiled + kernel image or similar), telling me as much about your setup as + possible will help. Please read the :ref:`admin-guide/reporting-bugs.rst <reportingbugs>` + document for details. + + - Alternatively, you can use gdb on a running kernel. (read-only; i.e. you + cannot change values or set break points.) To do this, first compile the + kernel with -g; edit arch/x86/Makefile appropriately, then do a ``make + clean``. You'll also need to enable CONFIG_PROC_FS (via ``make config``). + + After you've rebooted with the new kernel, do ``gdb vmlinux /proc/kcore``. + You can now use all the usual gdb commands. The command to look up the + point where your system crashed is ``l *0xXXXXXXXX``. (Replace the XXXes + with the EIP value.) + + gdb'ing a non-running kernel currently fails because ``gdb`` (wrongly) + disregards the starting offset for which the kernel is compiled. diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst new file mode 100644 index 000000000..c0ce64d75 --- /dev/null +++ b/Documentation/admin-guide/bcache.rst @@ -0,0 +1,649 @@ +============================ +A block layer cache (bcache) +============================ + +Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be +nice if you could use them as cache... Hence bcache. + +Wiki and git repositories are at: + + - http://bcache.evilpiepirate.org + - http://evilpiepirate.org/git/linux-bcache.git + - http://evilpiepirate.org/git/bcache-tools.git + +It's designed around the performance characteristics of SSDs - it only allocates +in erase block sized buckets, and it uses a hybrid btree/log to track cached +extents (which can be anywhere from a single sector to the bucket size). It's +designed to avoid random writes at all costs; it fills up an erase block +sequentially, then issues a discard before reusing it. + +Both writethrough and writeback caching are supported. Writeback defaults to +off, but can be switched on and off arbitrarily at runtime. Bcache goes to +great lengths to protect your data - it reliably handles unclean shutdown. (It +doesn't even have a notion of a clean shutdown; bcache simply doesn't return +writes as completed until they're on stable storage). + +Writeback caching can use most of the cache for buffering writes - writing +dirty data to the backing device is always done sequentially, scanning from the +start to the end of the index. + +Since random IO is what SSDs excel at, there generally won't be much benefit +to caching large sequential IO. Bcache detects sequential IO and skips it; +it also keeps a rolling average of the IO sizes per task, and as long as the +average is above the cutoff it will skip all IO from that task - instead of +caching the first 512k after every seek. Backups and large file copies should +thus entirely bypass the cache. + +In the event of a data IO error on the flash it will try to recover by reading +from disk or invalidating cache entries. For unrecoverable errors (meta data +or dirty data), caching is automatically disabled; if dirty data was present +in the cache it first disables writeback caching and waits for all dirty data +to be flushed. + +Getting started: +You'll need make-bcache from the bcache-tools repository. Both the cache device +and backing device must be formatted before use:: + + make-bcache -B /dev/sdb + make-bcache -C /dev/sdc + +make-bcache has the ability to format multiple devices at the same time - if +you format your backing devices and cache device at the same time, you won't +have to manually attach:: + + make-bcache -B /dev/sda /dev/sdb -C /dev/sdc + +bcache-tools now ships udev rules, and bcache devices are known to the kernel +immediately. Without udev, you can manually register devices like this:: + + echo /dev/sdb > /sys/fs/bcache/register + echo /dev/sdc > /sys/fs/bcache/register + +Registering the backing device makes the bcache device show up in /dev; you can +now format it and use it as normal. But the first time using a new bcache +device, it'll be running in passthrough mode until you attach it to a cache. +If you are thinking about using bcache later, it is recommended to setup all your +slow devices as bcache backing devices without a cache, and you can choose to add +a caching device later. +See 'ATTACHING' section below. + +The devices show up as:: + + /dev/bcache<N> + +As well as (with udev):: + + /dev/bcache/by-uuid/<uuid> + /dev/bcache/by-label/<label> + +To get started:: + + mkfs.ext4 /dev/bcache0 + mount /dev/bcache0 /mnt + +You can control bcache devices through sysfs at /sys/block/bcache<N>/bcache . +You can also control them through /sys/fs//bcache/<cset-uuid>/ . + +Cache devices are managed as sets; multiple caches per set isn't supported yet +but will allow for mirroring of metadata and dirty data in the future. Your new +cache set shows up as /sys/fs/bcache/<UUID> + +Attaching +--------- + +After your cache device and backing device are registered, the backing device +must be attached to your cache set to enable caching. Attaching a backing +device to a cache set is done thusly, with the UUID of the cache set in +/sys/fs/bcache:: + + echo <CSET-UUID> > /sys/block/bcache0/bcache/attach + +This only has to be done once. The next time you reboot, just reregister all +your bcache devices. If a backing device has data in a cache somewhere, the +/dev/bcache<N> device won't be created until the cache shows up - particularly +important if you have writeback caching turned on. + +If you're booting up and your cache device is gone and never coming back, you +can force run the backing device:: + + echo 1 > /sys/block/sdb/bcache/running + +(You need to use /sys/block/sdb (or whatever your backing device is called), not +/sys/block/bcache0, because bcache0 doesn't exist yet. If you're using a +partition, the bcache directory would be at /sys/block/sdb/sdb2/bcache) + +The backing device will still use that cache set if it shows up in the future, +but all the cached data will be invalidated. If there was dirty data in the +cache, don't expect the filesystem to be recoverable - you will have massive +filesystem corruption, though ext4's fsck does work miracles. + +Error Handling +-------------- + +Bcache tries to transparently handle IO errors to/from the cache device without +affecting normal operation; if it sees too many errors (the threshold is +configurable, and defaults to 0) it shuts down the cache device and switches all +the backing devices to passthrough mode. + + - For reads from the cache, if they error we just retry the read from the + backing device. + + - For writethrough writes, if the write to the cache errors we just switch to + invalidating the data at that lba in the cache (i.e. the same thing we do for + a write that bypasses the cache) + + - For writeback writes, we currently pass that error back up to the + filesystem/userspace. This could be improved - we could retry it as a write + that skips the cache so we don't have to error the write. + + - When we detach, we first try to flush any dirty data (if we were running in + writeback mode). It currently doesn't do anything intelligent if it fails to + read some of the dirty data, though. + + +Howto/cookbook +-------------- + +A) Starting a bcache with a missing caching device + +If registering the backing device doesn't help, it's already there, you just need +to force it to run without the cache:: + + host:~# echo /dev/sdb1 > /sys/fs/bcache/register + [ 119.844831] bcache: register_bcache() error opening /dev/sdb1: device already registered + +Next, you try to register your caching device if it's present. However +if it's absent, or registration fails for some reason, you can still +start your bcache without its cache, like so:: + + host:/sys/block/sdb/sdb1/bcache# echo 1 > running + +Note that this may cause data loss if you were running in writeback mode. + + +B) Bcache does not find its cache:: + + host:/sys/block/md5/bcache# echo 0226553a-37cf-41d5-b3ce-8b1e944543a8 > attach + [ 1933.455082] bcache: bch_cached_dev_attach() Couldn't find uuid for md5 in set + [ 1933.478179] bcache: __cached_dev_store() Can't attach 0226553a-37cf-41d5-b3ce-8b1e944543a8 + [ 1933.478179] : cache set not found + +In this case, the caching device was simply not registered at boot +or disappeared and came back, and needs to be (re-)registered:: + + host:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register + + +C) Corrupt bcache crashes the kernel at device registration time: + +This should never happen. If it does happen, then you have found a bug! +Please report it to the bcache development list: linux-bcache@vger.kernel.org + +Be sure to provide as much information that you can including kernel dmesg +output if available so that we may assist. + + +D) Recovering data without bcache: + +If bcache is not available in the kernel, a filesystem on the backing +device is still available at an 8KiB offset. So either via a loopdev +of the backing device created with --offset 8K, or any value defined by +--data-offset when you originally formatted bcache with `make-bcache`. + +For example:: + + losetup -o 8192 /dev/loop0 /dev/your_bcache_backing_dev + +This should present your unmodified backing device data in /dev/loop0 + +If your cache is in writethrough mode, then you can safely discard the +cache device without loosing data. + + +E) Wiping a cache device + +:: + + host:~# wipefs -a /dev/sdh2 + 16 bytes were erased at offset 0x1018 (bcache) + they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 + +After you boot back with bcache enabled, you recreate the cache and attach it:: + + host:~# make-bcache -C /dev/sdh2 + UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045 + Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1 + version: 0 + nbuckets: 106874 + block_size: 1 + bucket_size: 1024 + nr_in_set: 1 + nr_this_dev: 0 + first_bucket: 1 + [ 650.511912] bcache: run_cache_set() invalidating existing data + [ 650.549228] bcache: register_cache() registered cache device sdh2 + +start backing device with missing cache:: + + host:/sys/block/md5/bcache# echo 1 > running + +attach new cache:: + + host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach + [ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1 + + +F) Remove or replace a caching device:: + + host:/sys/block/sda/sda7/bcache# echo 1 > detach + [ 695.872542] bcache: cached_dev_detach_finish() Caching disabled for sda7 + + host:~# wipefs -a /dev/nvme0n1p4 + wipefs: error: /dev/nvme0n1p4: probing initialization failed: Device or resource busy + Ooops, it's disabled, but not unregistered, so it's still protected + +We need to go and unregister it:: + + host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# ls -l cache0 + lrwxrwxrwx 1 root root 0 Feb 25 18:33 cache0 -> ../../../devices/pci0000:00/0000:00:1d.0/0000:70:00.0/nvme/nvme0/nvme0n1/nvme0n1p4/bcache/ + host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# echo 1 > stop + kernel: [ 917.041908] bcache: cache_set_free() Cache set b7ba27a1-2398-4649-8ae3-0959f57ba128 unregistered + +Now we can wipe it:: + + host:~# wipefs -a /dev/nvme0n1p4 + /dev/nvme0n1p4: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 + + +G) dm-crypt and bcache + +First setup bcache unencrypted and then install dmcrypt on top of +/dev/bcache<N> This will work faster than if you dmcrypt both the backing +and caching devices and then install bcache on top. [benchmarks?] + + +H) Stop/free a registered bcache to wipe and/or recreate it + +Suppose that you need to free up all bcache references so that you can +fdisk run and re-register a changed partition table, which won't work +if there are any active backing or caching devices left on it: + +1) Is it present in /dev/bcache* ? (there are times where it won't be) + + If so, it's easy:: + + host:/sys/block/bcache0/bcache# echo 1 > stop + +2) But if your backing device is gone, this won't work:: + + host:/sys/block/bcache0# cd bcache + bash: cd: bcache: No such file or directory + + In this case, you may have to unregister the dmcrypt block device that + references this bcache to free it up:: + + host:~# dmsetup remove oldds1 + bcache: bcache_device_free() bcache0 stopped + bcache: cache_set_free() Cache set 5bc072a8-ab17-446d-9744-e247949913c1 unregistered + + This causes the backing bcache to be removed from /sys/fs/bcache and + then it can be reused. This would be true of any block device stacking + where bcache is a lower device. + +3) In other cases, you can also look in /sys/fs/bcache/:: + + host:/sys/fs/bcache# ls -l */{cache?,bdev?} + lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/ + lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/ + lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/ + + The device names will show which UUID is relevant, cd in that directory + and stop the cache:: + + host:/sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1# echo 1 > stop + + This will free up bcache references and let you reuse the partition for + other purposes. + + + +Troubleshooting performance +--------------------------- + +Bcache has a bunch of config options and tunables. The defaults are intended to +be reasonable for typical desktop and server workloads, but they're not what you +want for getting the best possible numbers when benchmarking. + + - Backing device alignment + + The default metadata size in bcache is 8k. If your backing device is + RAID based, then be sure to align this by a multiple of your stride + width using `make-bcache --data-offset`. If you intend to expand your + disk array in the future, then multiply a series of primes by your + raid stripe size to get the disk multiples that you would like. + + For example: If you have a 64k stripe size, then the following offset + would provide alignment for many common RAID5 data spindle counts:: + + 64k * 2*2*2*3*3*5*7 bytes = 161280k + + That space is wasted, but for only 157.5MB you can grow your RAID 5 + volume to the following data-spindle counts without re-aligning:: + + 3,4,5,6,7,8,9,10,12,14,15,18,20,21 ... + + - Bad write performance + + If write performance is not what you expected, you probably wanted to be + running in writeback mode, which isn't the default (not due to a lack of + maturity, but simply because in writeback mode you'll lose data if something + happens to your SSD):: + + # echo writeback > /sys/block/bcache0/bcache/cache_mode + + - Bad performance, or traffic not going to the SSD that you'd expect + + By default, bcache doesn't cache everything. It tries to skip sequential IO - + because you really want to be caching the random IO, and if you copy a 10 + gigabyte file you probably don't want that pushing 10 gigabytes of randomly + accessed data out of your cache. + + But if you want to benchmark reads from cache, and you start out with fio + writing an 8 gigabyte test file - so you want to disable that:: + + # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff + + To set it back to the default (4 mb), do:: + + # echo 4M > /sys/block/bcache0/bcache/sequential_cutoff + + - Traffic's still going to the spindle/still getting cache misses + + In the real world, SSDs don't always keep up with disks - particularly with + slower SSDs, many disks being cached by one SSD, or mostly sequential IO. So + you want to avoid being bottlenecked by the SSD and having it slow everything + down. + + To avoid that bcache tracks latency to the cache device, and gradually + throttles traffic if the latency exceeds a threshold (it does this by + cranking down the sequential bypass). + + You can disable this if you need to by setting the thresholds to 0:: + + # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us + # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us + + The default is 2000 us (2 milliseconds) for reads, and 20000 for writes. + + - Still getting cache misses, of the same data + + One last issue that sometimes trips people up is actually an old bug, due to + the way cache coherency is handled for cache misses. If a btree node is full, + a cache miss won't be able to insert a key for the new data and the data + won't be written to the cache. + + In practice this isn't an issue because as soon as a write comes along it'll + cause the btree node to be split, and you need almost no write traffic for + this to not show up enough to be noticeable (especially since bcache's btree + nodes are huge and index large regions of the device). But when you're + benchmarking, if you're trying to warm the cache by reading a bunch of data + and there's no other traffic - that can be a problem. + + Solution: warm the cache by doing writes, or use the testing branch (there's + a fix for the issue there). + + +Sysfs - backing device +---------------------- + +Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and +(if attached) /sys/fs/bcache/<cset-uuid>/bdev* + +attach + Echo the UUID of a cache set to this file to enable caching. + +cache_mode + Can be one of either writethrough, writeback, writearound or none. + +clear_stats + Writing to this file resets the running total stats (not the day/hour/5 minute + decaying versions). + +detach + Write to this file to detach from a cache set. If there is dirty data in the + cache, it will be flushed first. + +dirty_data + Amount of dirty data for this backing device in the cache. Continuously + updated unlike the cache set's version, but may be slightly off. + +label + Name of underlying device. + +readahead + Size of readahead that should be performed. Defaults to 0. If set to e.g. + 1M, it will round cache miss reads up to that size, but without overlapping + existing cache entries. + +running + 1 if bcache is running (i.e. whether the /dev/bcache device exists, whether + it's in passthrough mode or caching). + +sequential_cutoff + A sequential IO will bypass the cache once it passes this threshold; the + most recent 128 IOs are tracked so sequential IO can be detected even when + it isn't all done at once. + +sequential_merge + If non zero, bcache keeps a list of the last 128 requests submitted to compare + against all new requests to determine which new requests are sequential + continuations of previous requests for the purpose of determining sequential + cutoff. This is necessary if the sequential cutoff value is greater than the + maximum acceptable sequential size for any single request. + +state + The backing device can be in one of four different states: + + no cache: Has never been attached to a cache set. + + clean: Part of a cache set, and there is no cached dirty data. + + dirty: Part of a cache set, and there is cached dirty data. + + inconsistent: The backing device was forcibly run by the user when there was + dirty data cached but the cache set was unavailable; whatever data was on the + backing device has likely been corrupted. + +stop + Write to this file to shut down the bcache device and close the backing + device. + +writeback_delay + When dirty data is written to the cache and it previously did not contain + any, waits some number of seconds before initiating writeback. Defaults to + 30. + +writeback_percent + If nonzero, bcache tries to keep around this percentage of the cache dirty by + throttling background writeback and using a PD controller to smoothly adjust + the rate. + +writeback_rate + Rate in sectors per second - if writeback_percent is nonzero, background + writeback is throttled to this rate. Continuously adjusted by bcache but may + also be set by the user. + +writeback_running + If off, writeback of dirty data will not take place at all. Dirty data will + still be added to the cache until it is mostly full; only meant for + benchmarking. Defaults to on. + +Sysfs - backing device stats +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are directories with these numbers for a running total, as well as +versions that decay over the past day, hour and 5 minutes; they're also +aggregated in the cache set directory as well. + +bypassed + Amount of IO (both reads and writes) that has bypassed the cache + +cache_hits, cache_misses, cache_hit_ratio + Hits and misses are counted per individual IO as bcache sees them; a + partial hit is counted as a miss. + +cache_bypass_hits, cache_bypass_misses + Hits and misses for IO that is intended to skip the cache are still counted, + but broken out here. + +cache_miss_collisions + Counts instances where data was going to be inserted into the cache from a + cache miss, but raced with a write and data was already present (usually 0 + since the synchronization for cache misses was rewritten) + +cache_readaheads + Count of times readahead occurred. + +Sysfs - cache set +~~~~~~~~~~~~~~~~~ + +Available at /sys/fs/bcache/<cset-uuid> + +average_key_size + Average data per key in the btree. + +bdev<0..n> + Symlink to each of the attached backing devices. + +block_size + Block size of the cache devices. + +btree_cache_size + Amount of memory currently used by the btree cache + +bucket_size + Size of buckets + +cache<0..n> + Symlink to each of the cache devices comprising this cache set. + +cache_available_percent + Percentage of cache device which doesn't contain dirty data, and could + potentially be used for writeback. This doesn't mean this space isn't used + for clean cached data; the unused statistic (in priority_stats) is typically + much lower. + +clear_stats + Clears the statistics associated with this cache + +dirty_data + Amount of dirty data is in the cache (updated when garbage collection runs). + +flash_vol_create + Echoing a size to this file (in human readable units, k/M/G) creates a thinly + provisioned volume backed by the cache set. + +io_error_halflife, io_error_limit + These determines how many errors we accept before disabling the cache. + Each error is decayed by the half life (in # ios). If the decaying count + reaches io_error_limit dirty data is written out and the cache is disabled. + +journal_delay_ms + Journal writes will delay for up to this many milliseconds, unless a cache + flush happens sooner. Defaults to 100. + +root_usage_percent + Percentage of the root btree node in use. If this gets too high the node + will split, increasing the tree depth. + +stop + Write to this file to shut down the cache set - waits until all attached + backing devices have been shut down. + +tree_depth + Depth of the btree (A single node btree has depth 0). + +unregister + Detaches all backing devices and closes the cache devices; if dirty data is + present it will disable writeback caching and wait for it to be flushed. + +Sysfs - cache set internal +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This directory also exposes timings for a number of internal operations, with +separate files for average duration, average frequency, last occurrence and max +duration: garbage collection, btree read, btree node sorts and btree splits. + +active_journal_entries + Number of journal entries that are newer than the index. + +btree_nodes + Total nodes in the btree. + +btree_used_percent + Average fraction of btree in use. + +bset_tree_stats + Statistics about the auxiliary search trees + +btree_cache_max_chain + Longest chain in the btree node cache's hash table + +cache_read_races + Counts instances where while data was being read from the cache, the bucket + was reused and invalidated - i.e. where the pointer was stale after the read + completed. When this occurs the data is reread from the backing device. + +trigger_gc + Writing to this file forces garbage collection to run. + +Sysfs - Cache device +~~~~~~~~~~~~~~~~~~~~ + +Available at /sys/block/<cdev>/bcache + +block_size + Minimum granularity of writes - should match hardware sector size. + +btree_written + Sum of all btree writes, in (kilo/mega/giga) bytes + +bucket_size + Size of buckets + +cache_replacement_policy + One of either lru, fifo or random. + +discard + Boolean; if on a discard/TRIM will be issued to each bucket before it is + reused. Defaults to off, since SATA TRIM is an unqueued command (and thus + slow). + +freelist_percent + Size of the freelist as a percentage of nbuckets. Can be written to to + increase the number of buckets kept on the freelist, which lets you + artificially reduce the size of the cache at runtime. Mostly for testing + purposes (i.e. testing how different size caches affect your hit rate), but + since buckets are discarded when they move on to the freelist will also make + the SSD's garbage collection easier by effectively giving it more reserved + space. + +io_errors + Number of errors that have occurred, decayed by io_error_halflife. + +metadata_written + Sum of all non data writes (btree writes and all other metadata). + +nbuckets + Total buckets in this cache + +priority_stats + Statistics about how recently data in the cache has been accessed. + This can reveal your working set size. Unused is the percentage of + the cache that doesn't contain any data. Metadata is bcache's + metadata overhead. Average is the average priority of cache buckets. + Next is a list of quantiles with the priority threshold of each. + +written + Sum of all data that has been written to the cache; comparison with + btree_written gives the amount of write inflation in bcache. diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst new file mode 100644 index 000000000..97b0d7927 --- /dev/null +++ b/Documentation/admin-guide/binfmt-misc.rst @@ -0,0 +1,151 @@ +Kernel Support for miscellaneous (your favourite) Binary Formats v1.1 +===================================================================== + +This Kernel feature allows you to invoke almost (for restrictions see below) +every program by simply typing its name in the shell. +This includes for example compiled Java(TM), Python or Emacs programs. + +To achieve this you must tell binfmt_misc which interpreter has to be invoked +with which binary. Binfmt_misc recognises the binary-type by matching some bytes +at the beginning of the file with a magic byte sequence (masking out specified +bits) you have supplied. Binfmt_misc can also recognise a filename extension +aka ``.com`` or ``.exe``. + +First you must mount binfmt_misc:: + + mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc + +To actually register a new binary type, you have to set up a string looking like +``:name:type:offset:magic:mask:interpreter:flags`` (where you can choose the +``:`` upon your needs) and echo it to ``/proc/sys/fs/binfmt_misc/register``. + +Here is what the fields mean: + +- ``name`` + is an identifier string. A new /proc file will be created with this + ``name below /proc/sys/fs/binfmt_misc``; cannot contain slashes ``/`` for + obvious reasons. +- ``type`` + is the type of recognition. Give ``M`` for magic and ``E`` for extension. +- ``offset`` + is the offset of the magic/mask in the file, counted in bytes. This + defaults to 0 if you omit it (i.e. you write ``:name:type::magic...``). + Ignored when using filename extension matching. +- ``magic`` + is the byte sequence binfmt_misc is matching for. The magic string + may contain hex-encoded characters like ``\x0a`` or ``\xA4``. Note that you + must escape any NUL bytes; parsing halts at the first one. In a shell + environment you might have to write ``\\x0a`` to prevent the shell from + eating your ``\``. + If you chose filename extension matching, this is the extension to be + recognised (without the ``.``, the ``\x0a`` specials are not allowed). + Extension matching is case sensitive, and slashes ``/`` are not allowed! +- ``mask`` + is an (optional, defaults to all 0xff) mask. You can mask out some + bits from matching by supplying a string like magic and as long as magic. + The mask is anded with the byte sequence of the file. Note that you must + escape any NUL bytes; parsing halts at the first one. Ignored when using + filename extension matching. +- ``interpreter`` + is the program that should be invoked with the binary as first + argument (specify the full path) +- ``flags`` + is an optional field that controls several aspects of the invocation + of the interpreter. It is a string of capital letters, each controls a + certain aspect. The following flags are supported: + + ``P`` - preserve-argv[0] + Legacy behavior of binfmt_misc is to overwrite + the original argv[0] with the full path to the binary. When this + flag is included, binfmt_misc will add an argument to the argument + vector for this purpose, thus preserving the original ``argv[0]``. + e.g. If your interp is set to ``/bin/foo`` and you run ``blah`` + (which is in ``/usr/local/bin``), then the kernel will execute + ``/bin/foo`` with ``argv[]`` set to ``["/bin/foo", "/usr/local/bin/blah", "blah"]``. The interp has to be aware of this so it can + execute ``/usr/local/bin/blah`` + with ``argv[]`` set to ``["blah"]``. + ``O`` - open-binary + Legacy behavior of binfmt_misc is to pass the full path + of the binary to the interpreter as an argument. When this flag is + included, binfmt_misc will open the file for reading and pass its + descriptor as an argument, instead of the full path, thus allowing + the interpreter to execute non-readable binaries. This feature + should be used with care - the interpreter has to be trusted not to + emit the contents of the non-readable binary. + ``C`` - credentials + Currently, the behavior of binfmt_misc is to calculate + the credentials and security token of the new process according to + the interpreter. When this flag is included, these attributes are + calculated according to the binary. It also implies the ``O`` flag. + This feature should be used with care as the interpreter + will run with root permissions when a setuid binary owned by root + is run with binfmt_misc. + ``F`` - fix binary + The usual behaviour of binfmt_misc is to spawn the + binary lazily when the misc format file is invoked. However, + this doesn``t work very well in the face of mount namespaces and + changeroots, so the ``F`` mode opens the binary as soon as the + emulation is installed and uses the opened image to spawn the + emulator, meaning it is always available once installed, + regardless of how the environment changes. + + +There are some restrictions: + + - the whole register string may not exceed 1920 characters + - the magic must reside in the first 128 bytes of the file, i.e. + offset+size(magic) has to be less than 128 + - the interpreter string may not exceed 127 characters + +To use binfmt_misc you have to mount it first. You can mount it with +``mount -t binfmt_misc none /proc/sys/fs/binfmt_misc`` command, or you can add +a line ``none /proc/sys/fs/binfmt_misc binfmt_misc defaults 0 0`` to your +``/etc/fstab`` so it auto mounts on boot. + +You may want to add the binary formats in one of your ``/etc/rc`` scripts during +boot-up. Read the manual of your init program to figure out how to do this +right. + +Think about the order of adding entries! Later added entries are matched first! + + +A few examples (assumed you are in ``/proc/sys/fs/binfmt_misc``): + +- enable support for em86 (like binfmt_em86, for Alpha AXP only):: + + echo ':i386:M::\x7fELF\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x03:\xff\xff\xff\xff\xff\xfe\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfb\xff\xff:/bin/em86:' > register + echo ':i486:M::\x7fELF\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x06:\xff\xff\xff\xff\xff\xfe\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfb\xff\xff:/bin/em86:' > register + +- enable support for packed DOS applications (pre-configured dosemu hdimages):: + + echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register + +- enable support for Windows executables using wine:: + + echo ':DOSWin:M::MZ::/usr/local/bin/wine:' > register + +For java support see Documentation/admin-guide/java.rst + + +You can enable/disable binfmt_misc or one binary type by echoing 0 (to disable) +or 1 (to enable) to ``/proc/sys/fs/binfmt_misc/status`` or +``/proc/.../the_name``. +Catting the file tells you the current status of ``binfmt_misc/the_entry``. + +You can remove one entry or all entries by echoing -1 to ``/proc/.../the_name`` +or ``/proc/sys/fs/binfmt_misc/status``. + + +Hints +----- + +If you want to pass special arguments to your interpreter, you can +write a wrapper script for it. See Documentation/admin-guide/java.rst for an +example. + +Your interpreter should NOT look in the PATH for the filename; the kernel +passes it the full filename (or the file descriptor) to use. Using ``$PATH`` can +cause unexpected behaviour and can be a security hazard. + + +Richard Günther <rguenth@tat.physik.uni-tuebingen.de> diff --git a/Documentation/admin-guide/braille-console.rst b/Documentation/admin-guide/braille-console.rst new file mode 100644 index 000000000..18e79337d --- /dev/null +++ b/Documentation/admin-guide/braille-console.rst @@ -0,0 +1,38 @@ +Linux Braille Console +===================== + +To get early boot messages on a braille device (before userspace screen +readers can start), you first need to compile the support for the usual serial +console (see :ref:`Documentation/admin-guide/serial-console.rst <serial_console>`), and +for braille device +(in :menuselection:`Device Drivers --> Accessibility support --> Console on braille device`). + +Then you need to specify a ``console=brl``, option on the kernel command line, the +format is:: + + console=brl,serial_options... + +where ``serial_options...`` are the same as described in +:ref:`Documentation/admin-guide/serial-console.rst <serial_console>`. + +So for instance you can use ``console=brl,ttyS0`` if the braille device is connected to the first serial port, and ``console=brl,ttyS0,115200`` to +override the baud rate to 115200, etc. + +By default, the braille device will just show the last kernel message (console +mode). To review previous messages, press the Insert key to switch to the VT +review mode. In review mode, the arrow keys permit to browse in the VT content, +:kbd:`PAGE-UP`/:kbd:`PAGE-DOWN` keys go at the top/bottom of the screen, and +the :kbd:`HOME` key goes back +to the cursor, hence providing very basic screen reviewing facility. + +Sound feedback can be obtained by adding the ``braille_console.sound=1`` kernel +parameter. + +For simplicity, only one braille console can be enabled, other uses of +``console=brl,...`` will be discarded. Also note that it does not interfere with +the console selection mechanism described in +:ref:`Documentation/admin-guide/serial-console.rst <serial_console>`. + +For now, only the VisioBraille device is supported. + +Samuel Thibault <samuel.thibault@ens-lyon.org> diff --git a/Documentation/admin-guide/bug-bisect.rst b/Documentation/admin-guide/bug-bisect.rst new file mode 100644 index 000000000..59567da34 --- /dev/null +++ b/Documentation/admin-guide/bug-bisect.rst @@ -0,0 +1,76 @@ +Bisecting a bug ++++++++++++++++ + +Last updated: 28 October 2016 + +Introduction +============ + +Always try the latest kernel from kernel.org and build from source. If you are +not confident in doing that please report the bug to your distribution vendor +instead of to a kernel developer. + +Finding bugs is not always easy. Have a go though. If you can't find it don't +give up. Report as much as you have found to the relevant maintainer. See +MAINTAINERS for who that is for the subsystem you have worked on. + +Before you submit a bug report read +:ref:`Documentation/admin-guide/reporting-bugs.rst <reportingbugs>`. + +Devices not appearing +===================== + +Often this is caused by udev/systemd. Check that first before blaming it +on the kernel. + +Finding patch that caused a bug +=============================== + +Using the provided tools with ``git`` makes finding bugs easy provided the bug +is reproducible. + +Steps to do it: + +- build the Kernel from its git source +- start bisect with [#f1]_:: + + $ git bisect start + +- mark the broken changeset with:: + + $ git bisect bad [commit] + +- mark a changeset where the code is known to work with:: + + $ git bisect good [commit] + +- rebuild the Kernel and test +- interact with git bisect by using either:: + + $ git bisect good + + or:: + + $ git bisect bad + + depending if the bug happened on the changeset you're testing +- After some interactions, git bisect will give you the changeset that + likely caused the bug. + +- For example, if you know that the current version is bad, and version + 4.8 is good, you could do:: + + $ git bisect start + $ git bisect bad # Current version is bad + $ git bisect good v4.8 + + +.. [#f1] You can, optionally, provide both good and bad arguments at git + start with ``git bisect start [BAD] [GOOD]`` + +For further references, please read: + +- The man page for ``git-bisect`` +- `Fighting regressions with git bisect <https://www.kernel.org/pub/software/scm/git/docs/git-bisect-lk2009.html>`_ +- `Fully automated bisecting with "git bisect run" <https://lwn.net/Articles/317154>`_ +- `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_ diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst new file mode 100644 index 000000000..f278b289e --- /dev/null +++ b/Documentation/admin-guide/bug-hunting.rst @@ -0,0 +1,369 @@ +Bug hunting +=========== + +Kernel bug reports often come with a stack dump like the one below:: + + ------------[ cut here ]------------ + WARNING: CPU: 1 PID: 28102 at kernel/module.c:1108 module_put+0x57/0x70 + Modules linked in: dvb_usb_gp8psk(-) dvb_usb dvb_core nvidia_drm(PO) nvidia_modeset(PO) snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore nvidia(PO) [last unloaded: rc_core] + CPU: 1 PID: 28102 Comm: rmmod Tainted: P WC O 4.8.4-build.1 #1 + Hardware name: MSI MS-7309/MS-7309, BIOS V1.12 02/23/2009 + 00000000 c12ba080 00000000 00000000 c103ed6a c1616014 00000001 00006dc6 + c1615862 00000454 c109e8a7 c109e8a7 00000009 ffffffff 00000000 f13f6a10 + f5f5a600 c103ee33 00000009 00000000 00000000 c109e8a7 f80ca4d0 c109f617 + Call Trace: + [<c12ba080>] ? dump_stack+0x44/0x64 + [<c103ed6a>] ? __warn+0xfa/0x120 + [<c109e8a7>] ? module_put+0x57/0x70 + [<c109e8a7>] ? module_put+0x57/0x70 + [<c103ee33>] ? warn_slowpath_null+0x23/0x30 + [<c109e8a7>] ? module_put+0x57/0x70 + [<f80ca4d0>] ? gp8psk_fe_set_frontend+0x460/0x460 [dvb_usb_gp8psk] + [<c109f617>] ? symbol_put_addr+0x27/0x50 + [<f80bc9ca>] ? dvb_usb_adapter_frontend_exit+0x3a/0x70 [dvb_usb] + [<f80bb3bf>] ? dvb_usb_exit+0x2f/0xd0 [dvb_usb] + [<c13d03bc>] ? usb_disable_endpoint+0x7c/0xb0 + [<f80bb48a>] ? dvb_usb_device_exit+0x2a/0x50 [dvb_usb] + [<c13d2882>] ? usb_unbind_interface+0x62/0x250 + [<c136b514>] ? __pm_runtime_idle+0x44/0x70 + [<c13620d8>] ? __device_release_driver+0x78/0x120 + [<c1362907>] ? driver_detach+0x87/0x90 + [<c1361c48>] ? bus_remove_driver+0x38/0x90 + [<c13d1c18>] ? usb_deregister+0x58/0xb0 + [<c109fbb0>] ? SyS_delete_module+0x130/0x1f0 + [<c1055654>] ? task_work_run+0x64/0x80 + [<c1000fa5>] ? exit_to_usermode_loop+0x85/0x90 + [<c10013f0>] ? do_fast_syscall_32+0x80/0x130 + [<c1549f43>] ? sysenter_past_esp+0x40/0x6a + ---[ end trace 6ebc60ef3981792f ]--- + +Such stack traces provide enough information to identify the line inside the +Kernel's source code where the bug happened. Depending on the severity of +the issue, it may also contain the word **Oops**, as on this one:: + + BUG: unable to handle kernel NULL pointer dereference at (null) + IP: [<c06969d4>] iret_exc+0x7d0/0xa59 + *pdpt = 000000002258a001 *pde = 0000000000000000 + Oops: 0002 [#1] PREEMPT SMP + ... + +Despite being an **Oops** or some other sort of stack trace, the offended +line is usually required to identify and handle the bug. Along this chapter, +we'll refer to "Oops" for all kinds of stack traces that need to be analized. + +.. note:: + + ``ksymoops`` is useless on 2.6 or upper. Please use the Oops in its original + format (from ``dmesg``, etc). Ignore any references in this or other docs to + "decoding the Oops" or "running it through ksymoops". + If you post an Oops from 2.6+ that has been run through ``ksymoops``, + people will just tell you to repost it. + +Where is the Oops message is located? +------------------------------------- + +Normally the Oops text is read from the kernel buffers by klogd and +handed to ``syslogd`` which writes it to a syslog file, typically +``/var/log/messages`` (depends on ``/etc/syslog.conf``). On systems with +systemd, it may also be stored by the ``journald`` daemon, and accessed +by running ``journalctl`` command. + +Sometimes ``klogd`` dies, in which case you can run ``dmesg > file`` to +read the data from the kernel buffers and save it. Or you can +``cat /proc/kmsg > file``, however you have to break in to stop the transfer, +``kmsg`` is a "never ending file". + +If the machine has crashed so badly that you cannot enter commands or +the disk is not available then you have three options: + +(1) Hand copy the text from the screen and type it in after the machine + has restarted. Messy but it is the only option if you have not + planned for a crash. Alternatively, you can take a picture of + the screen with a digital camera - not nice, but better than + nothing. If the messages scroll off the top of the console, you + may find that booting with a higher resolution (eg, ``vga=791``) + will allow you to read more of the text. (Caveat: This needs ``vesafb``, + so won't help for 'early' oopses) + +(2) Boot with a serial console (see + :ref:`Documentation/admin-guide/serial-console.rst <serial_console>`), + run a null modem to a second machine and capture the output there + using your favourite communication program. Minicom works well. + +(3) Use Kdump (see Documentation/kdump/kdump.txt), + extract the kernel ring buffer from old memory with using dmesg + gdbmacro in Documentation/kdump/gdbmacros.txt. + +Finding the bug's location +-------------------------- + +Reporting a bug works best if you point the location of the bug at the +Kernel source file. There are two methods for doing that. Usually, using +``gdb`` is easier, but the Kernel should be pre-compiled with debug info. + +gdb +^^^ + +The GNU debug (``gdb``) is the best way to figure out the exact file and line +number of the OOPS from the ``vmlinux`` file. + +The usage of gdb works best on a kernel compiled with ``CONFIG_DEBUG_INFO``. +This can be set by running:: + + $ ./scripts/config -d COMPILE_TEST -e DEBUG_KERNEL -e DEBUG_INFO + +On a kernel compiled with ``CONFIG_DEBUG_INFO``, you can simply copy the +EIP value from the OOPS:: + + EIP: 0060:[<c021e50e>] Not tainted VLI + +And use GDB to translate that to human-readable form:: + + $ gdb vmlinux + (gdb) l *0xc021e50e + +If you don't have ``CONFIG_DEBUG_INFO`` enabled, you use the function +offset from the OOPS:: + + EIP is at vt_ioctl+0xda8/0x1482 + +And recompile the kernel with ``CONFIG_DEBUG_INFO`` enabled:: + + $ ./scripts/config -d COMPILE_TEST -e DEBUG_KERNEL -e DEBUG_INFO + $ make vmlinux + $ gdb vmlinux + (gdb) l *vt_ioctl+0xda8 + 0x1888 is in vt_ioctl (drivers/tty/vt/vt_ioctl.c:293). + 288 { + 289 struct vc_data *vc = NULL; + 290 int ret = 0; + 291 + 292 console_lock(); + 293 if (VT_BUSY(vc_num)) + 294 ret = -EBUSY; + 295 else if (vc_num) + 296 vc = vc_deallocate(vc_num); + 297 console_unlock(); + +or, if you want to be more verbose:: + + (gdb) p vt_ioctl + $1 = {int (struct tty_struct *, unsigned int, unsigned long)} 0xae0 <vt_ioctl> + (gdb) l *0xae0+0xda8 + +You could, instead, use the object file:: + + $ make drivers/tty/ + $ gdb drivers/tty/vt/vt_ioctl.o + (gdb) l *vt_ioctl+0xda8 + +If you have a call trace, such as:: + + Call Trace: + [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5 + [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e + [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee + ... + +this shows the problem likely in the :jbd: module. You can load that module +in gdb and list the relevant code:: + + $ gdb fs/jbd/jbd.ko + (gdb) l *log_wait_commit+0xa3 + +.. note:: + + You can also do the same for any function call at the stack trace, + like this one:: + + [<f80bc9ca>] ? dvb_usb_adapter_frontend_exit+0x3a/0x70 [dvb_usb] + + The position where the above call happened can be seen with:: + + $ gdb drivers/media/usb/dvb-usb/dvb-usb.o + (gdb) l *dvb_usb_adapter_frontend_exit+0x3a + +objdump +^^^^^^^ + +To debug a kernel, use objdump and look for the hex offset from the crash +output to find the valid line of code/assembler. Without debug symbols, you +will see the assembler code for the routine shown, but if your kernel has +debug symbols the C code will also be available. (Debug symbols can be enabled +in the kernel hacking menu of the menu configuration.) For example:: + + $ objdump -r -S -l --disassemble net/dccp/ipv4.o + +.. note:: + + You need to be at the top level of the kernel tree for this to pick up + your C files. + +If you don't have access to the code you can also debug on some crash dumps +e.g. crash dump output as shown by Dave Miller:: + + EIP is at +0x14/0x4c0 + ... + Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00 + 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08 + <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85 + + Put the bytes into a "foo.s" file like this: + + .text + .globl foo + foo: + .byte .... /* bytes from Code: part of OOPS dump */ + + Compile it with "gcc -c -o foo.o foo.s" then look at the output of + "objdump --disassemble foo.o". + + Output: + + ip_queue_xmit: + push %ebp + push %edi + push %esi + push %ebx + sub $0xbc, %esp + mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb) + mov 0x8(%ebp), %ebx ! %ebx = skb->sk + mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt + +Reporting the bug +----------------- + +Once you find where the bug happened, by inspecting its location, +you could either try to fix it yourself or report it upstream. + +In order to report it upstream, you should identify the mailing list +used for the development of the affected code. This can be done by using +the ``get_maintainer.pl`` script. + +For example, if you find a bug at the gspca's sonixj.c file, you can get +their maintainers with:: + + $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c + Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) + Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%) + Tejun Heo <tj@kernel.org> (commit_signer:1/1=100%) + Bhaktipriya Shridhar <bhaktipriya96@gmail.com> (commit_signer:1/1=100%,authored:1/1=100%,added_lines:4/4=100%,removed_lines:9/9=100%) + linux-media@vger.kernel.org (open list:GSPCA USB WEBCAM DRIVER) + linux-kernel@vger.kernel.org (open list) + +Please notice that it will point to: + +- The last developers that touched on the source code. On the above example, + Tejun and Bhaktipriya (in this specific case, none really envolved on the + development of this file); +- The driver maintainer (Hans Verkuil); +- The subsystem maintainer (Mauro Carvalho Chehab); +- The driver and/or subsystem mailing list (linux-media@vger.kernel.org); +- the Linux Kernel mailing list (linux-kernel@vger.kernel.org). + +Usually, the fastest way to have your bug fixed is to report it to mailing +list used for the development of the code (linux-media ML) copying the driver maintainer (Hans). + +If you are totally stumped as to whom to send the report, and +``get_maintainer.pl`` didn't provide you anything useful, send it to +linux-kernel@vger.kernel.org. + +Thanks for your help in making Linux as stable as humanly possible. + +Fixing the bug +-------------- + +If you know programming, you could help us by not only reporting the bug, +but also providing us with a solution. After all, open source is about +sharing what you do and don't you want to be recognised for your genius? + +If you decide to take this way, once you have worked out a fix please submit +it upstream. + +Please do read +:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` though +to help your code get accepted. + + +--------------------------------------------------------------------------- + +Notes on Oops tracing with ``klogd`` +------------------------------------ + +In order to help Linus and the other kernel developers there has been +substantial support incorporated into ``klogd`` for processing protection +faults. In order to have full support for address resolution at least +version 1.3-pl3 of the ``sysklogd`` package should be used. + +When a protection fault occurs the ``klogd`` daemon automatically +translates important addresses in the kernel log messages to their +symbolic equivalents. This translated kernel message is then +forwarded through whatever reporting mechanism ``klogd`` is using. The +protection fault message can be simply cut out of the message files +and forwarded to the kernel developers. + +Two types of address resolution are performed by ``klogd``. The first is +static translation and the second is dynamic translation. Static +translation uses the System.map file in much the same manner that +ksymoops does. In order to do static translation the ``klogd`` daemon +must be able to find a system map file at daemon initialization time. +See the klogd man page for information on how ``klogd`` searches for map +files. + +Dynamic address translation is important when kernel loadable modules +are being used. Since memory for kernel modules is allocated from the +kernel's dynamic memory pools there are no fixed locations for either +the start of the module or for functions and symbols in the module. + +The kernel supports system calls which allow a program to determine +which modules are loaded and their location in memory. Using these +system calls the klogd daemon builds a symbol table which can be used +to debug a protection fault which occurs in a loadable kernel module. + +At the very minimum klogd will provide the name of the module which +generated the protection fault. There may be additional symbolic +information available if the developer of the loadable module chose to +export symbol information from the module. + +Since the kernel module environment can be dynamic there must be a +mechanism for notifying the ``klogd`` daemon when a change in module +environment occurs. There are command line options available which +allow klogd to signal the currently executing daemon that symbol +information should be refreshed. See the ``klogd`` manual page for more +information. + +A patch is included with the sysklogd distribution which modifies the +``modules-2.0.0`` package to automatically signal klogd whenever a module +is loaded or unloaded. Applying this patch provides essentially +seamless support for debugging protection faults which occur with +kernel loadable modules. + +The following is an example of a protection fault in a loadable module +processed by ``klogd``:: + + Aug 29 09:51:01 blizard kernel: Unable to handle kernel paging request at virtual address f15e97cc + Aug 29 09:51:01 blizard kernel: current->tss.cr3 = 0062d000, %cr3 = 0062d000 + Aug 29 09:51:01 blizard kernel: *pde = 00000000 + Aug 29 09:51:01 blizard kernel: Oops: 0002 + Aug 29 09:51:01 blizard kernel: CPU: 0 + Aug 29 09:51:01 blizard kernel: EIP: 0010:[oops:_oops+16/3868] + Aug 29 09:51:01 blizard kernel: EFLAGS: 00010212 + Aug 29 09:51:01 blizard kernel: eax: 315e97cc ebx: 003a6f80 ecx: 001be77b edx: 00237c0c + Aug 29 09:51:01 blizard kernel: esi: 00000000 edi: bffffdb3 ebp: 00589f90 esp: 00589f8c + Aug 29 09:51:01 blizard kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018 + Aug 29 09:51:01 blizard kernel: Process oops_test (pid: 3374, process nr: 21, stackpage=00589000) + Aug 29 09:51:01 blizard kernel: Stack: 315e97cc 00589f98 0100b0b4 bffffed4 0012e38e 00240c64 003a6f80 00000001 + Aug 29 09:51:01 blizard kernel: 00000000 00237810 bfffff00 0010a7fa 00000003 00000001 00000000 bfffff00 + Aug 29 09:51:01 blizard kernel: bffffdb3 bffffed4 ffffffda 0000002b 0007002b 0000002b 0000002b 00000036 + Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128] + Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3 + +--------------------------------------------------------------------------- + +:: + + Dr. G.W. Wettstein Oncology Research Div. Computing Facility + Roger Maris Cancer Center INTERNET: greg@wind.rmcc.com + 820 4th St. N. + Fargo, ND 58122 + Phone: 701-234-7556 diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst new file mode 100644 index 000000000..184193bcb --- /dev/null +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -0,0 +1,2142 @@ +================ +Control Group v2 +================ + +:Date: October, 2015 +:Author: Tejun Heo <tj@kernel.org> + +This is the authoritative documentation on the design, interface and +conventions of cgroup v2. It describes all userland-visible aspects +of cgroup including core and specific controller behaviors. All +future changes must be reflected in this document. Documentation for +v1 is available under Documentation/cgroup-v1/. + +.. CONTENTS + + 1. Introduction + 1-1. Terminology + 1-2. What is cgroup? + 2. Basic Operations + 2-1. Mounting + 2-2. Organizing Processes and Threads + 2-2-1. Processes + 2-2-2. Threads + 2-3. [Un]populated Notification + 2-4. Controlling Controllers + 2-4-1. Enabling and Disabling + 2-4-2. Top-down Constraint + 2-4-3. No Internal Process Constraint + 2-5. Delegation + 2-5-1. Model of Delegation + 2-5-2. Delegation Containment + 2-6. Guidelines + 2-6-1. Organize Once and Control + 2-6-2. Avoid Name Collisions + 3. Resource Distribution Models + 3-1. Weights + 3-2. Limits + 3-3. Protections + 3-4. Allocations + 4. Interface Files + 4-1. Format + 4-2. Conventions + 4-3. Core Interface Files + 5. Controllers + 5-1. CPU + 5-1-1. CPU Interface Files + 5-2. Memory + 5-2-1. Memory Interface Files + 5-2-2. Usage Guidelines + 5-2-3. Memory Ownership + 5-3. IO + 5-3-1. IO Interface Files + 5-3-2. Writeback + 5-3-3. IO Latency + 5-3-3-1. How IO Latency Throttling Works + 5-3-3-2. IO Latency Interface Files + 5-4. PID + 5-4-1. PID Interface Files + 5-5. Device + 5-6. RDMA + 5-6-1. RDMA Interface Files + 5-7. Misc + 5-7-1. perf_event + 5-N. Non-normative information + 5-N-1. CPU controller root cgroup process behaviour + 5-N-2. IO controller root cgroup process behaviour + 6. Namespace + 6-1. Basics + 6-2. The Root and Views + 6-3. Migration and setns(2) + 6-4. Interaction with Other Namespaces + P. Information on Kernel Programming + P-1. Filesystem Support for Writeback + D. Deprecated v1 Core Features + R. Issues with v1 and Rationales for v2 + R-1. Multiple Hierarchies + R-2. Thread Granularity + R-3. Competition Between Inner Nodes and Threads + R-4. Other Interface Issues + R-5. Controller Issues and Remedies + R-5-1. Memory + + +Introduction +============ + +Terminology +----------- + +"cgroup" stands for "control group" and is never capitalized. The +singular form is used to designate the whole feature and also as a +qualifier as in "cgroup controllers". When explicitly referring to +multiple individual control groups, the plural form "cgroups" is used. + + +What is cgroup? +--------------- + +cgroup is a mechanism to organize processes hierarchically and +distribute system resources along the hierarchy in a controlled and +configurable manner. + +cgroup is largely composed of two parts - the core and controllers. +cgroup core is primarily responsible for hierarchically organizing +processes. A cgroup controller is usually responsible for +distributing a specific type of system resource along the hierarchy +although there are utility controllers which serve purposes other than +resource distribution. + +cgroups form a tree structure and every process in the system belongs +to one and only one cgroup. All threads of a process belong to the +same cgroup. On creation, all processes are put in the cgroup that +the parent process belongs to at the time. A process can be migrated +to another cgroup. Migration of a process doesn't affect already +existing descendant processes. + +Following certain structural constraints, controllers may be enabled or +disabled selectively on a cgroup. All controller behaviors are +hierarchical - if a controller is enabled on a cgroup, it affects all +processes which belong to the cgroups consisting the inclusive +sub-hierarchy of the cgroup. When a controller is enabled on a nested +cgroup, it always restricts the resource distribution further. The +restrictions set closer to the root in the hierarchy can not be +overridden from further away. + + +Basic Operations +================ + +Mounting +-------- + +Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2 +hierarchy can be mounted with the following mount command:: + + # mount -t cgroup2 none $MOUNT_POINT + +cgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All +controllers which support v2 and are not bound to a v1 hierarchy are +automatically bound to the v2 hierarchy and show up at the root. +Controllers which are not in active use in the v2 hierarchy can be +bound to other hierarchies. This allows mixing v2 hierarchy with the +legacy v1 multiple hierarchies in a fully backward compatible way. + +A controller can be moved across hierarchies only after the controller +is no longer referenced in its current hierarchy. Because per-cgroup +controller states are destroyed asynchronously and controllers may +have lingering references, a controller may not show up immediately on +the v2 hierarchy after the final umount of the previous hierarchy. +Similarly, a controller should be fully disabled to be moved out of +the unified hierarchy and it may take some time for the disabled +controller to become available for other hierarchies; furthermore, due +to inter-controller dependencies, other controllers may need to be +disabled too. + +While useful for development and manual configurations, moving +controllers dynamically between the v2 and other hierarchies is +strongly discouraged for production use. It is recommended to decide +the hierarchies and controller associations before starting using the +controllers after system boot. + +During transition to v2, system management software might still +automount the v1 cgroup filesystem and so hijack all controllers +during boot, before manual intervention is possible. To make testing +and experimenting easier, the kernel parameter cgroup_no_v1= allows +disabling controllers in v1 and make them always available in v2. + +cgroup v2 currently supports the following mount options. + + nsdelegate + + Consider cgroup namespaces as delegation boundaries. This + option is system wide and can only be set on mount or modified + through remount from the init namespace. The mount option is + ignored on non-init namespace mounts. Please refer to the + Delegation section for details. + + +Organizing Processes and Threads +-------------------------------- + +Processes +~~~~~~~~~ + +Initially, only the root cgroup exists to which all processes belong. +A child cgroup can be created by creating a sub-directory:: + + # mkdir $CGROUP_NAME + +A given cgroup may have multiple child cgroups forming a tree +structure. Each cgroup has a read-writable interface file +"cgroup.procs". When read, it lists the PIDs of all processes which +belong to the cgroup one-per-line. The PIDs are not ordered and the +same PID may show up more than once if the process got moved to +another cgroup and then back or the PID got recycled while reading. + +A process can be migrated into a cgroup by writing its PID to the +target cgroup's "cgroup.procs" file. Only one process can be migrated +on a single write(2) call. If a process is composed of multiple +threads, writing the PID of any thread migrates all threads of the +process. + +When a process forks a child process, the new process is born into the +cgroup that the forking process belongs to at the time of the +operation. After exit, a process stays associated with the cgroup +that it belonged to at the time of exit until it's reaped; however, a +zombie process does not appear in "cgroup.procs" and thus can't be +moved to another cgroup. + +A cgroup which doesn't have any children or live processes can be +destroyed by removing the directory. Note that a cgroup which doesn't +have any children and is associated only with zombie processes is +considered empty and can be removed:: + + # rmdir $CGROUP_NAME + +"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy +cgroup is in use in the system, this file may contain multiple lines, +one for each hierarchy. The entry for cgroup v2 is always in the +format "0::$PATH":: + + # cat /proc/842/cgroup + ... + 0::/test-cgroup/test-cgroup-nested + +If the process becomes a zombie and the cgroup it was associated with +is removed subsequently, " (deleted)" is appended to the path:: + + # cat /proc/842/cgroup + ... + 0::/test-cgroup/test-cgroup-nested (deleted) + + +Threads +~~~~~~~ + +cgroup v2 supports thread granularity for a subset of controllers to +support use cases requiring hierarchical resource distribution across +the threads of a group of processes. By default, all threads of a +process belong to the same cgroup, which also serves as the resource +domain to host resource consumptions which are not specific to a +process or thread. The thread mode allows threads to be spread across +a subtree while still maintaining the common resource domain for them. + +Controllers which support thread mode are called threaded controllers. +The ones which don't are called domain controllers. + +Marking a cgroup threaded makes it join the resource domain of its +parent as a threaded cgroup. The parent may be another threaded +cgroup whose resource domain is further up in the hierarchy. The root +of a threaded subtree, that is, the nearest ancestor which is not +threaded, is called threaded domain or thread root interchangeably and +serves as the resource domain for the entire subtree. + +Inside a threaded subtree, threads of a process can be put in +different cgroups and are not subject to the no internal process +constraint - threaded controllers can be enabled on non-leaf cgroups +whether they have threads in them or not. + +As the threaded domain cgroup hosts all the domain resource +consumptions of the subtree, it is considered to have internal +resource consumptions whether there are processes in it or not and +can't have populated child cgroups which aren't threaded. Because the +root cgroup is not subject to no internal process constraint, it can +serve both as a threaded domain and a parent to domain cgroups. + +The current operation mode or type of the cgroup is shown in the +"cgroup.type" file which indicates whether the cgroup is a normal +domain, a domain which is serving as the domain of a threaded subtree, +or a threaded cgroup. + +On creation, a cgroup is always a domain cgroup and can be made +threaded by writing "threaded" to the "cgroup.type" file. The +operation is single direction:: + + # echo threaded > cgroup.type + +Once threaded, the cgroup can't be made a domain again. To enable the +thread mode, the following conditions must be met. + +- As the cgroup will join the parent's resource domain. The parent + must either be a valid (threaded) domain or a threaded cgroup. + +- When the parent is an unthreaded domain, it must not have any domain + controllers enabled or populated domain children. The root is + exempt from this requirement. + +Topology-wise, a cgroup can be in an invalid state. Please consider +the following topology:: + + A (threaded domain) - B (threaded) - C (domain, just created) + +C is created as a domain but isn't connected to a parent which can +host child domains. C can't be used until it is turned into a +threaded cgroup. "cgroup.type" file will report "domain (invalid)" in +these cases. Operations which fail due to invalid topology use +EOPNOTSUPP as the errno. + +A domain cgroup is turned into a threaded domain when one of its child +cgroup becomes threaded or threaded controllers are enabled in the +"cgroup.subtree_control" file while there are processes in the cgroup. +A threaded domain reverts to a normal domain when the conditions +clear. + +When read, "cgroup.threads" contains the list of the thread IDs of all +threads in the cgroup. Except that the operations are per-thread +instead of per-process, "cgroup.threads" has the same format and +behaves the same way as "cgroup.procs". While "cgroup.threads" can be +written to in any cgroup, as it can only move threads inside the same +threaded domain, its operations are confined inside each threaded +subtree. + +The threaded domain cgroup serves as the resource domain for the whole +subtree, and, while the threads can be scattered across the subtree, +all the processes are considered to be in the threaded domain cgroup. +"cgroup.procs" in a threaded domain cgroup contains the PIDs of all +processes in the subtree and is not readable in the subtree proper. +However, "cgroup.procs" can be written to from anywhere in the subtree +to migrate all threads of the matching process to the cgroup. + +Only threaded controllers can be enabled in a threaded subtree. When +a threaded controller is enabled inside a threaded subtree, it only +accounts for and controls resource consumptions associated with the +threads in the cgroup and its descendants. All consumptions which +aren't tied to a specific thread belong to the threaded domain cgroup. + +Because a threaded subtree is exempt from no internal process +constraint, a threaded controller must be able to handle competition +between threads in a non-leaf cgroup and its child cgroups. Each +threaded controller defines how such competitions are handled. + + +[Un]populated Notification +-------------------------- + +Each non-root cgroup has a "cgroup.events" file which contains +"populated" field indicating whether the cgroup's sub-hierarchy has +live processes in it. Its value is 0 if there is no live process in +the cgroup and its descendants; otherwise, 1. poll and [id]notify +events are triggered when the value changes. This can be used, for +example, to start a clean-up operation after all processes of a given +sub-hierarchy have exited. The populated state updates and +notifications are recursive. Consider the following sub-hierarchy +where the numbers in the parentheses represent the numbers of processes +in each cgroup:: + + A(4) - B(0) - C(1) + \ D(0) + +A, B and C's "populated" fields would be 1 while D's 0. After the one +process in C exits, B and C's "populated" fields would flip to "0" and +file modified events will be generated on the "cgroup.events" files of +both cgroups. + + +Controlling Controllers +----------------------- + +Enabling and Disabling +~~~~~~~~~~~~~~~~~~~~~~ + +Each cgroup has a "cgroup.controllers" file which lists all +controllers available for the cgroup to enable:: + + # cat cgroup.controllers + cpu io memory + +No controller is enabled by default. Controllers can be enabled and +disabled by writing to the "cgroup.subtree_control" file:: + + # echo "+cpu +memory -io" > cgroup.subtree_control + +Only controllers which are listed in "cgroup.controllers" can be +enabled. When multiple operations are specified as above, either they +all succeed or fail. If multiple operations on the same controller +are specified, the last one is effective. + +Enabling a controller in a cgroup indicates that the distribution of +the target resource across its immediate children will be controlled. +Consider the following sub-hierarchy. The enabled controllers are +listed in parentheses:: + + A(cpu,memory) - B(memory) - C() + \ D() + +As A has "cpu" and "memory" enabled, A will control the distribution +of CPU cycles and memory to its children, in this case, B. As B has +"memory" enabled but not "CPU", C and D will compete freely on CPU +cycles but their division of memory available to B will be controlled. + +As a controller regulates the distribution of the target resource to +the cgroup's children, enabling it creates the controller's interface +files in the child cgroups. In the above example, enabling "cpu" on B +would create the "cpu." prefixed controller interface files in C and +D. Likewise, disabling "memory" from B would remove the "memory." +prefixed controller interface files from C and D. This means that the +controller interface files - anything which doesn't start with +"cgroup." are owned by the parent rather than the cgroup itself. + + +Top-down Constraint +~~~~~~~~~~~~~~~~~~~ + +Resources are distributed top-down and a cgroup can further distribute +a resource only if the resource has been distributed to it from the +parent. This means that all non-root "cgroup.subtree_control" files +can only contain controllers which are enabled in the parent's +"cgroup.subtree_control" file. A controller can be enabled only if +the parent has the controller enabled and a controller can't be +disabled if one or more children have it enabled. + + +No Internal Process Constraint +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Non-root cgroups can distribute domain resources to their children +only when they don't have any processes of their own. In other words, +only domain cgroups which don't contain any processes can have domain +controllers enabled in their "cgroup.subtree_control" files. + +This guarantees that, when a domain controller is looking at the part +of the hierarchy which has it enabled, processes are always only on +the leaves. This rules out situations where child cgroups compete +against internal processes of the parent. + +The root cgroup is exempt from this restriction. Root contains +processes and anonymous resource consumption which can't be associated +with any other cgroups and requires special treatment from most +controllers. How resource consumption in the root cgroup is governed +is up to each controller (for more information on this topic please +refer to the Non-normative information section in the Controllers +chapter). + +Note that the restriction doesn't get in the way if there is no +enabled controller in the cgroup's "cgroup.subtree_control". This is +important as otherwise it wouldn't be possible to create children of a +populated cgroup. To control resource distribution of a cgroup, the +cgroup must create children and transfer all its processes to the +children before enabling controllers in its "cgroup.subtree_control" +file. + + +Delegation +---------- + +Model of Delegation +~~~~~~~~~~~~~~~~~~~ + +A cgroup can be delegated in two ways. First, to a less privileged +user by granting write access of the directory and its "cgroup.procs", +"cgroup.threads" and "cgroup.subtree_control" files to the user. +Second, if the "nsdelegate" mount option is set, automatically to a +cgroup namespace on namespace creation. + +Because the resource control interface files in a given directory +control the distribution of the parent's resources, the delegatee +shouldn't be allowed to write to them. For the first method, this is +achieved by not granting access to these files. For the second, the +kernel rejects writes to all files other than "cgroup.procs" and +"cgroup.subtree_control" on a namespace root from inside the +namespace. + +The end results are equivalent for both delegation types. Once +delegated, the user can build sub-hierarchy under the directory, +organize processes inside it as it sees fit and further distribute the +resources it received from the parent. The limits and other settings +of all resource controllers are hierarchical and regardless of what +happens in the delegated sub-hierarchy, nothing can escape the +resource restrictions imposed by the parent. + +Currently, cgroup doesn't impose any restrictions on the number of +cgroups in or nesting depth of a delegated sub-hierarchy; however, +this may be limited explicitly in the future. + + +Delegation Containment +~~~~~~~~~~~~~~~~~~~~~~ + +A delegated sub-hierarchy is contained in the sense that processes +can't be moved into or out of the sub-hierarchy by the delegatee. + +For delegations to a less privileged user, this is achieved by +requiring the following conditions for a process with a non-root euid +to migrate a target process into a cgroup by writing its PID to the +"cgroup.procs" file. + +- The writer must have write access to the "cgroup.procs" file. + +- The writer must have write access to the "cgroup.procs" file of the + common ancestor of the source and destination cgroups. + +The above two constraints ensure that while a delegatee may migrate +processes around freely in the delegated sub-hierarchy it can't pull +in from or push out to outside the sub-hierarchy. + +For an example, let's assume cgroups C0 and C1 have been delegated to +user U0 who created C00, C01 under C0 and C10 under C1 as follows and +all processes under C0 and C1 belong to U0:: + + ~~~~~~~~~~~~~ - C0 - C00 + ~ cgroup ~ \ C01 + ~ hierarchy ~ + ~~~~~~~~~~~~~ - C1 - C10 + +Let's also say U0 wants to write the PID of a process which is +currently in C10 into "C00/cgroup.procs". U0 has write access to the +file; however, the common ancestor of the source cgroup C10 and the +destination cgroup C00 is above the points of delegation and U0 would +not have write access to its "cgroup.procs" files and thus the write +will be denied with -EACCES. + +For delegations to namespaces, containment is achieved by requiring +that both the source and destination cgroups are reachable from the +namespace of the process which is attempting the migration. If either +is not reachable, the migration is rejected with -ENOENT. + + +Guidelines +---------- + +Organize Once and Control +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Migrating a process across cgroups is a relatively expensive operation +and stateful resources such as memory are not moved together with the +process. This is an explicit design decision as there often exist +inherent trade-offs between migration and various hot paths in terms +of synchronization cost. + +As such, migrating processes across cgroups frequently as a means to +apply different resource restrictions is discouraged. A workload +should be assigned to a cgroup according to the system's logical and +resource structure once on start-up. Dynamic adjustments to resource +distribution can be made by changing controller configuration through +the interface files. + + +Avoid Name Collisions +~~~~~~~~~~~~~~~~~~~~~ + +Interface files for a cgroup and its children cgroups occupy the same +directory and it is possible to create children cgroups which collide +with interface files. + +All cgroup core interface files are prefixed with "cgroup." and each +controller's interface files are prefixed with the controller name and +a dot. A controller's name is composed of lower case alphabets and +'_'s but never begins with an '_' so it can be used as the prefix +character for collision avoidance. Also, interface file names won't +start or end with terms which are often used in categorizing workloads +such as job, service, slice, unit or workload. + +cgroup doesn't do anything to prevent name collisions and it's the +user's responsibility to avoid them. + + +Resource Distribution Models +============================ + +cgroup controllers implement several resource distribution schemes +depending on the resource type and expected use cases. This section +describes major schemes in use along with their expected behaviors. + + +Weights +------- + +A parent's resource is distributed by adding up the weights of all +active children and giving each the fraction matching the ratio of its +weight against the sum. As only children which can make use of the +resource at the moment participate in the distribution, this is +work-conserving. Due to the dynamic nature, this model is usually +used for stateless resources. + +All weights are in the range [1, 10000] with the default at 100. This +allows symmetric multiplicative biases in both directions at fine +enough granularity while staying in the intuitive range. + +As long as the weight is in range, all configuration combinations are +valid and there is no reason to reject configuration changes or +process migrations. + +"cpu.weight" proportionally distributes CPU cycles to active children +and is an example of this type. + + +Limits +------ + +A child can only consume upto the configured amount of the resource. +Limits can be over-committed - the sum of the limits of children can +exceed the amount of resource available to the parent. + +Limits are in the range [0, max] and defaults to "max", which is noop. + +As limits can be over-committed, all configuration combinations are +valid and there is no reason to reject configuration changes or +process migrations. + +"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume +on an IO device and is an example of this type. + + +Protections +----------- + +A cgroup is protected to be allocated upto the configured amount of +the resource if the usages of all its ancestors are under their +protected levels. Protections can be hard guarantees or best effort +soft boundaries. Protections can also be over-committed in which case +only upto the amount available to the parent is protected among +children. + +Protections are in the range [0, max] and defaults to 0, which is +noop. + +As protections can be over-committed, all configuration combinations +are valid and there is no reason to reject configuration changes or +process migrations. + +"memory.low" implements best-effort memory protection and is an +example of this type. + + +Allocations +----------- + +A cgroup is exclusively allocated a certain amount of a finite +resource. Allocations can't be over-committed - the sum of the +allocations of children can not exceed the amount of resource +available to the parent. + +Allocations are in the range [0, max] and defaults to 0, which is no +resource. + +As allocations can't be over-committed, some configuration +combinations are invalid and should be rejected. Also, if the +resource is mandatory for execution of processes, process migrations +may be rejected. + +"cpu.rt.max" hard-allocates realtime slices and is an example of this +type. + + +Interface Files +=============== + +Format +------ + +All interface files should be in one of the following formats whenever +possible:: + + New-line separated values + (when only one value can be written at once) + + VAL0\n + VAL1\n + ... + + Space separated values + (when read-only or multiple values can be written at once) + + VAL0 VAL1 ...\n + + Flat keyed + + KEY0 VAL0\n + KEY1 VAL1\n + ... + + Nested keyed + + KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... + KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... + ... + +For a writable file, the format for writing should generally match +reading; however, controllers may allow omitting later fields or +implement restricted shortcuts for most common use cases. + +For both flat and nested keyed files, only the values for a single key +can be written at a time. For nested keyed files, the sub key pairs +may be specified in any order and not all pairs have to be specified. + + +Conventions +----------- + +- Settings for a single feature should be contained in a single file. + +- The root cgroup should be exempt from resource control and thus + shouldn't have resource control interface files. Also, + informational files on the root cgroup which end up showing global + information available elsewhere shouldn't exist. + +- If a controller implements weight based resource distribution, its + interface file should be named "weight" and have the range [1, + 10000] with 100 as the default. The values are chosen to allow + enough and symmetric bias in both directions while keeping it + intuitive (the default is 100%). + +- If a controller implements an absolute resource guarantee and/or + limit, the interface files should be named "min" and "max" + respectively. If a controller implements best effort resource + guarantee and/or limit, the interface files should be named "low" + and "high" respectively. + + In the above four control files, the special token "max" should be + used to represent upward infinity for both reading and writing. + +- If a setting has a configurable default value and keyed specific + overrides, the default entry should be keyed with "default" and + appear as the first entry in the file. + + The default value can be updated by writing either "default $VAL" or + "$VAL". + + When writing to update a specific override, "default" can be used as + the value to indicate removal of the override. Override entries + with "default" as the value must not appear when read. + + For example, a setting which is keyed by major:minor device numbers + with integer values may look like the following:: + + # cat cgroup-example-interface-file + default 150 + 8:0 300 + + The default value can be updated by:: + + # echo 125 > cgroup-example-interface-file + + or:: + + # echo "default 125" > cgroup-example-interface-file + + An override can be set by:: + + # echo "8:16 170" > cgroup-example-interface-file + + and cleared by:: + + # echo "8:0 default" > cgroup-example-interface-file + # cat cgroup-example-interface-file + default 125 + 8:16 170 + +- For events which are not very high frequency, an interface file + "events" should be created which lists event key value pairs. + Whenever a notifiable event happens, file modified event should be + generated on the file. + + +Core Interface Files +-------------------- + +All cgroup core files are prefixed with "cgroup." + + cgroup.type + + A read-write single value file which exists on non-root + cgroups. + + When read, it indicates the current type of the cgroup, which + can be one of the following values. + + - "domain" : A normal valid domain cgroup. + + - "domain threaded" : A threaded domain cgroup which is + serving as the root of a threaded subtree. + + - "domain invalid" : A cgroup which is in an invalid state. + It can't be populated or have controllers enabled. It may + be allowed to become a threaded cgroup. + + - "threaded" : A threaded cgroup which is a member of a + threaded subtree. + + A cgroup can be turned into a threaded cgroup by writing + "threaded" to this file. + + cgroup.procs + A read-write new-line separated values file which exists on + all cgroups. + + When read, it lists the PIDs of all processes which belong to + the cgroup one-per-line. The PIDs are not ordered and the + same PID may show up more than once if the process got moved + to another cgroup and then back or the PID got recycled while + reading. + + A PID can be written to migrate the process associated with + the PID to the cgroup. The writer should match all of the + following conditions. + + - It must have write access to the "cgroup.procs" file. + + - It must have write access to the "cgroup.procs" file of the + common ancestor of the source and destination cgroups. + + When delegating a sub-hierarchy, write access to this file + should be granted along with the containing directory. + + In a threaded cgroup, reading this file fails with EOPNOTSUPP + as all the processes belong to the thread root. Writing is + supported and moves every thread of the process to the cgroup. + + cgroup.threads + A read-write new-line separated values file which exists on + all cgroups. + + When read, it lists the TIDs of all threads which belong to + the cgroup one-per-line. The TIDs are not ordered and the + same TID may show up more than once if the thread got moved to + another cgroup and then back or the TID got recycled while + reading. + + A TID can be written to migrate the thread associated with the + TID to the cgroup. The writer should match all of the + following conditions. + + - It must have write access to the "cgroup.threads" file. + + - The cgroup that the thread is currently in must be in the + same resource domain as the destination cgroup. + + - It must have write access to the "cgroup.procs" file of the + common ancestor of the source and destination cgroups. + + When delegating a sub-hierarchy, write access to this file + should be granted along with the containing directory. + + cgroup.controllers + A read-only space separated values file which exists on all + cgroups. + + It shows space separated list of all controllers available to + the cgroup. The controllers are not ordered. + + cgroup.subtree_control + A read-write space separated values file which exists on all + cgroups. Starts out empty. + + When read, it shows space separated list of the controllers + which are enabled to control resource distribution from the + cgroup to its children. + + Space separated list of controllers prefixed with '+' or '-' + can be written to enable or disable controllers. A controller + name prefixed with '+' enables the controller and '-' + disables. If a controller appears more than once on the list, + the last one is effective. When multiple enable and disable + operations are specified, either all succeed or all fail. + + cgroup.events + A read-only flat-keyed file which exists on non-root cgroups. + The following entries are defined. Unless specified + otherwise, a value change in this file generates a file + modified event. + + populated + 1 if the cgroup or its descendants contains any live + processes; otherwise, 0. + + cgroup.max.descendants + A read-write single value files. The default is "max". + + Maximum allowed number of descent cgroups. + If the actual number of descendants is equal or larger, + an attempt to create a new cgroup in the hierarchy will fail. + + cgroup.max.depth + A read-write single value files. The default is "max". + + Maximum allowed descent depth below the current cgroup. + If the actual descent depth is equal or larger, + an attempt to create a new child cgroup will fail. + + cgroup.stat + A read-only flat-keyed file with the following entries: + + nr_descendants + Total number of visible descendant cgroups. + + nr_dying_descendants + Total number of dying descendant cgroups. A cgroup becomes + dying after being deleted by a user. The cgroup will remain + in dying state for some time undefined time (which can depend + on system load) before being completely destroyed. + + A process can't enter a dying cgroup under any circumstances, + a dying cgroup can't revive. + + A dying cgroup can consume system resources not exceeding + limits, which were active at the moment of cgroup deletion. + + +Controllers +=========== + +CPU +--- + +The "cpu" controllers regulates distribution of CPU cycles. This +controller implements weight and absolute bandwidth limit models for +normal scheduling policy and absolute bandwidth allocation model for +realtime scheduling policy. + +WARNING: cgroup2 doesn't yet support control of realtime processes and +the cpu controller can only be enabled when all RT processes are in +the root cgroup. Be aware that system management software may already +have placed RT processes into nonroot cgroups during the system boot +process, and these processes may need to be moved to the root cgroup +before the cpu controller can be enabled. + + +CPU Interface Files +~~~~~~~~~~~~~~~~~~~ + +All time durations are in microseconds. + + cpu.stat + A read-only flat-keyed file which exists on non-root cgroups. + This file exists whether the controller is enabled or not. + + It always reports the following three stats: + + - usage_usec + - user_usec + - system_usec + + and the following three when the controller is enabled: + + - nr_periods + - nr_throttled + - throttled_usec + + cpu.weight + A read-write single value file which exists on non-root + cgroups. The default is "100". + + The weight in the range [1, 10000]. + + cpu.weight.nice + A read-write single value file which exists on non-root + cgroups. The default is "0". + + The nice value is in the range [-20, 19]. + + This interface file is an alternative interface for + "cpu.weight" and allows reading and setting weight using the + same values used by nice(2). Because the range is smaller and + granularity is coarser for the nice values, the read value is + the closest approximation of the current weight. + + cpu.max + A read-write two value file which exists on non-root cgroups. + The default is "max 100000". + + The maximum bandwidth limit. It's in the following format:: + + $MAX $PERIOD + + which indicates that the group may consume upto $MAX in each + $PERIOD duration. "max" for $MAX indicates no limit. If only + one number is written, $MAX is updated. + + +Memory +------ + +The "memory" controller regulates distribution of memory. Memory is +stateful and implements both limit and protection models. Due to the +intertwining between memory usage and reclaim pressure and the +stateful nature of memory, the distribution model is relatively +complex. + +While not completely water-tight, all major memory usages by a given +cgroup are tracked so that the total memory consumption can be +accounted and controlled to a reasonable extent. Currently, the +following types of memory usages are tracked. + +- Userland memory - page cache and anonymous memory. + +- Kernel data structures such as dentries and inodes. + +- TCP socket buffers. + +The above list may expand in the future for better coverage. + + +Memory Interface Files +~~~~~~~~~~~~~~~~~~~~~~ + +All memory amounts are in bytes. If a value which is not aligned to +PAGE_SIZE is written, the value may be rounded up to the closest +PAGE_SIZE multiple when read back. + + memory.current + A read-only single value file which exists on non-root + cgroups. + + The total amount of memory currently being used by the cgroup + and its descendants. + + memory.min + A read-write single value file which exists on non-root + cgroups. The default is "0". + + Hard memory protection. If the memory usage of a cgroup + is within its effective min boundary, the cgroup's memory + won't be reclaimed under any conditions. If there is no + unprotected reclaimable memory available, OOM killer + is invoked. + + Effective min boundary is limited by memory.min values of + all ancestor cgroups. If there is memory.min overcommitment + (child cgroup or cgroups are requiring more protected memory + than parent will allow), then each child cgroup will get + the part of parent's protection proportional to its + actual memory usage below memory.min. + + Putting more memory than generally available under this + protection is discouraged and may lead to constant OOMs. + + If a memory cgroup is not populated with processes, + its memory.min is ignored. + + memory.low + A read-write single value file which exists on non-root + cgroups. The default is "0". + + Best-effort memory protection. If the memory usage of a + cgroup is within its effective low boundary, the cgroup's + memory won't be reclaimed unless memory can be reclaimed + from unprotected cgroups. + + Effective low boundary is limited by memory.low values of + all ancestor cgroups. If there is memory.low overcommitment + (child cgroup or cgroups are requiring more protected memory + than parent will allow), then each child cgroup will get + the part of parent's protection proportional to its + actual memory usage below memory.low. + + Putting more memory than generally available under this + protection is discouraged. + + memory.high + A read-write single value file which exists on non-root + cgroups. The default is "max". + + Memory usage throttle limit. This is the main mechanism to + control memory usage of a cgroup. If a cgroup's usage goes + over the high boundary, the processes of the cgroup are + throttled and put under heavy reclaim pressure. + + Going over the high limit never invokes the OOM killer and + under extreme conditions the limit may be breached. + + memory.max + A read-write single value file which exists on non-root + cgroups. The default is "max". + + Memory usage hard limit. This is the final protection + mechanism. If a cgroup's memory usage reaches this limit and + can't be reduced, the OOM killer is invoked in the cgroup. + Under certain circumstances, the usage may go over the limit + temporarily. + + This is the ultimate protection mechanism. As long as the + high limit is used and monitored properly, this limit's + utility is limited to providing the final safety net. + + memory.oom.group + A read-write single value file which exists on non-root + cgroups. The default value is "0". + + Determines whether the cgroup should be treated as + an indivisible workload by the OOM killer. If set, + all tasks belonging to the cgroup or to its descendants + (if the memory cgroup is not a leaf cgroup) are killed + together or not at all. This can be used to avoid + partial kills to guarantee workload integrity. + + Tasks with the OOM protection (oom_score_adj set to -1000) + are treated as an exception and are never killed. + + If the OOM killer is invoked in a cgroup, it's not going + to kill any tasks outside of this cgroup, regardless + memory.oom.group values of ancestor cgroups. + + memory.events + A read-only flat-keyed file which exists on non-root cgroups. + The following entries are defined. Unless specified + otherwise, a value change in this file generates a file + modified event. + + low + The number of times the cgroup is reclaimed due to + high memory pressure even though its usage is under + the low boundary. This usually indicates that the low + boundary is over-committed. + + high + The number of times processes of the cgroup are + throttled and routed to perform direct memory reclaim + because the high memory boundary was exceeded. For a + cgroup whose memory usage is capped by the high limit + rather than global memory pressure, this event's + occurrences are expected. + + max + The number of times the cgroup's memory usage was + about to go over the max boundary. If direct reclaim + fails to bring it down, the cgroup goes to OOM state. + + oom + The number of time the cgroup's memory usage was + reached the limit and allocation was about to fail. + + Depending on context result could be invocation of OOM + killer and retrying allocation or failing allocation. + + Failed allocation in its turn could be returned into + userspace as -ENOMEM or silently ignored in cases like + disk readahead. For now OOM in memory cgroup kills + tasks iff shortage has happened inside page fault. + + oom_kill + The number of processes belonging to this cgroup + killed by any kind of OOM killer. + + memory.stat + A read-only flat-keyed file which exists on non-root cgroups. + + This breaks down the cgroup's memory footprint into different + types of memory, type-specific details, and other information + on the state and past events of the memory management system. + + All memory amounts are in bytes. + + The entries are ordered to be human readable, and new entries + can show up in the middle. Don't rely on items remaining in a + fixed position; use the keys to look up specific values! + + anon + Amount of memory used in anonymous mappings such as + brk(), sbrk(), and mmap(MAP_ANONYMOUS) + + file + Amount of memory used to cache filesystem data, + including tmpfs and shared memory. + + kernel_stack + Amount of memory allocated to kernel stacks. + + slab + Amount of memory used for storing in-kernel data + structures. + + sock + Amount of memory used in network transmission buffers + + shmem + Amount of cached filesystem data that is swap-backed, + such as tmpfs, shm segments, shared anonymous mmap()s + + file_mapped + Amount of cached filesystem data mapped with mmap() + + file_dirty + Amount of cached filesystem data that was modified but + not yet written back to disk + + file_writeback + Amount of cached filesystem data that was modified and + is currently being written back to disk + + inactive_anon, active_anon, inactive_file, active_file, unevictable + Amount of memory, swap-backed and filesystem-backed, + on the internal memory management lists used by the + page reclaim algorithm + + slab_reclaimable + Part of "slab" that might be reclaimed, such as + dentries and inodes. + + slab_unreclaimable + Part of "slab" that cannot be reclaimed on memory + pressure. + + pgfault + Total number of page faults incurred + + pgmajfault + Number of major page faults incurred + + workingset_refault + + Number of refaults of previously evicted pages + + workingset_activate + + Number of refaulted pages that were immediately activated + + workingset_nodereclaim + + Number of times a shadow node has been reclaimed + + pgrefill + + Amount of scanned pages (in an active LRU list) + + pgscan + + Amount of scanned pages (in an inactive LRU list) + + pgsteal + + Amount of reclaimed pages + + pgactivate + + Amount of pages moved to the active LRU list + + pgdeactivate + + Amount of pages moved to the inactive LRU lis + + pglazyfree + + Amount of pages postponed to be freed under memory pressure + + pglazyfreed + + Amount of reclaimed lazyfree pages + + memory.swap.current + A read-only single value file which exists on non-root + cgroups. + + The total amount of swap currently being used by the cgroup + and its descendants. + + memory.swap.max + A read-write single value file which exists on non-root + cgroups. The default is "max". + + Swap usage hard limit. If a cgroup's swap usage reaches this + limit, anonymous memory of the cgroup will not be swapped out. + + memory.swap.events + A read-only flat-keyed file which exists on non-root cgroups. + The following entries are defined. Unless specified + otherwise, a value change in this file generates a file + modified event. + + max + The number of times the cgroup's swap usage was about + to go over the max boundary and swap allocation + failed. + + fail + The number of times swap allocation failed either + because of running out of swap system-wide or max + limit. + + When reduced under the current usage, the existing swap + entries are reclaimed gradually and the swap usage may stay + higher than the limit for an extended period of time. This + reduces the impact on the workload and memory management. + + +Usage Guidelines +~~~~~~~~~~~~~~~~ + +"memory.high" is the main mechanism to control memory usage. +Over-committing on high limit (sum of high limits > available memory) +and letting global memory pressure to distribute memory according to +usage is a viable strategy. + +Because breach of the high limit doesn't trigger the OOM killer but +throttles the offending cgroup, a management agent has ample +opportunities to monitor and take appropriate actions such as granting +more memory or terminating the workload. + +Determining whether a cgroup has enough memory is not trivial as +memory usage doesn't indicate whether the workload can benefit from +more memory. For example, a workload which writes data received from +network to a file can use all available memory but can also operate as +performant with a small amount of memory. A measure of memory +pressure - how much the workload is being impacted due to lack of +memory - is necessary to determine whether a workload needs more +memory; unfortunately, memory pressure monitoring mechanism isn't +implemented yet. + + +Memory Ownership +~~~~~~~~~~~~~~~~ + +A memory area is charged to the cgroup which instantiated it and stays +charged to the cgroup until the area is released. Migrating a process +to a different cgroup doesn't move the memory usages that it +instantiated while in the previous cgroup to the new cgroup. + +A memory area may be used by processes belonging to different cgroups. +To which cgroup the area will be charged is in-deterministic; however, +over time, the memory area is likely to end up in a cgroup which has +enough memory allowance to avoid high reclaim pressure. + +If a cgroup sweeps a considerable amount of memory which is expected +to be accessed repeatedly by other cgroups, it may make sense to use +POSIX_FADV_DONTNEED to relinquish the ownership of memory areas +belonging to the affected files to ensure correct memory ownership. + + +IO +-- + +The "io" controller regulates the distribution of IO resources. This +controller implements both weight based and absolute bandwidth or IOPS +limit distribution; however, weight based distribution is available +only if cfq-iosched is in use and neither scheme is available for +blk-mq devices. + + +IO Interface Files +~~~~~~~~~~~~~~~~~~ + + io.stat + A read-only nested-keyed file which exists on non-root + cgroups. + + Lines are keyed by $MAJ:$MIN device numbers and not ordered. + The following nested keys are defined. + + ====== ===================== + rbytes Bytes read + wbytes Bytes written + rios Number of read IOs + wios Number of write IOs + dbytes Bytes discarded + dios Number of discard IOs + ====== ===================== + + An example read output follows: + + 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 + 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 + + io.weight + A read-write flat-keyed file which exists on non-root cgroups. + The default is "default 100". + + The first line is the default weight applied to devices + without specific override. The rest are overrides keyed by + $MAJ:$MIN device numbers and not ordered. The weights are in + the range [1, 10000] and specifies the relative amount IO time + the cgroup can use in relation to its siblings. + + The default weight can be updated by writing either "default + $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing + "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default". + + An example read output follows:: + + default 100 + 8:16 200 + 8:0 50 + + io.max + A read-write nested-keyed file which exists on non-root + cgroups. + + BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN + device numbers and not ordered. The following nested keys are + defined. + + ===== ================================== + rbps Max read bytes per second + wbps Max write bytes per second + riops Max read IO operations per second + wiops Max write IO operations per second + ===== ================================== + + When writing, any number of nested key-value pairs can be + specified in any order. "max" can be specified as the value + to remove a specific limit. If the same key is specified + multiple times, the outcome is undefined. + + BPS and IOPS are measured in each IO direction and IOs are + delayed if limit is reached. Temporary bursts are allowed. + + Setting read limit at 2M BPS and write at 120 IOPS for 8:16:: + + echo "8:16 rbps=2097152 wiops=120" > io.max + + Reading returns the following:: + + 8:16 rbps=2097152 wbps=max riops=max wiops=120 + + Write IOPS limit can be removed by writing the following:: + + echo "8:16 wiops=max" > io.max + + Reading now returns the following:: + + 8:16 rbps=2097152 wbps=max riops=max wiops=max + + +Writeback +~~~~~~~~~ + +Page cache is dirtied through buffered writes and shared mmaps and +written asynchronously to the backing filesystem by the writeback +mechanism. Writeback sits between the memory and IO domains and +regulates the proportion of dirty memory by balancing dirtying and +write IOs. + +The io controller, in conjunction with the memory controller, +implements control of page cache writeback IOs. The memory controller +defines the memory domain that dirty memory ratio is calculated and +maintained for and the io controller defines the io domain which +writes out dirty pages for the memory domain. Both system-wide and +per-cgroup dirty memory states are examined and the more restrictive +of the two is enforced. + +cgroup writeback requires explicit support from the underlying +filesystem. Currently, cgroup writeback is implemented on ext2, ext4 +and btrfs. On other filesystems, all writeback IOs are attributed to +the root cgroup. + +There are inherent differences in memory and writeback management +which affects how cgroup ownership is tracked. Memory is tracked per +page while writeback per inode. For the purpose of writeback, an +inode is assigned to a cgroup and all IO requests to write dirty pages +from the inode are attributed to that cgroup. + +As cgroup ownership for memory is tracked per page, there can be pages +which are associated with different cgroups than the one the inode is +associated with. These are called foreign pages. The writeback +constantly keeps track of foreign pages and, if a particular foreign +cgroup becomes the majority over a certain period of time, switches +the ownership of the inode to that cgroup. + +While this model is enough for most use cases where a given inode is +mostly dirtied by a single cgroup even when the main writing cgroup +changes over time, use cases where multiple cgroups write to a single +inode simultaneously are not supported well. In such circumstances, a +significant portion of IOs are likely to be attributed incorrectly. +As memory controller assigns page ownership on the first use and +doesn't update it until the page is released, even if writeback +strictly follows page ownership, multiple cgroups dirtying overlapping +areas wouldn't work as expected. It's recommended to avoid such usage +patterns. + +The sysctl knobs which affect writeback behavior are applied to cgroup +writeback as follows. + + vm.dirty_background_ratio, vm.dirty_ratio + These ratios apply the same to cgroup writeback with the + amount of available memory capped by limits imposed by the + memory controller and system-wide clean memory. + + vm.dirty_background_bytes, vm.dirty_bytes + For cgroup writeback, this is calculated into ratio against + total available memory and applied the same way as + vm.dirty[_background]_ratio. + + +IO Latency +~~~~~~~~~~ + +This is a cgroup v2 controller for IO workload protection. You provide a group +with a latency target, and if the average latency exceeds that target the +controller will throttle any peers that have a lower latency target than the +protected workload. + +The limits are only applied at the peer level in the hierarchy. This means that +in the diagram below, only groups A, B, and C will influence each other, and +groups D and F will influence each other. Group G will influence nobody. + + [root] + / | \ + A B C + / \ | + D F G + + +So the ideal way to configure this is to set io.latency in groups A, B, and C. +Generally you do not want to set a value lower than the latency your device +supports. Experiment to find the value that works best for your workload. +Start at higher than the expected latency for your device and watch the +avg_lat value in io.stat for your workload group to get an idea of the +latency you see during normal operation. Use the avg_lat value as a basis for +your real setting, setting at 10-15% higher than the value in io.stat. + +How IO Latency Throttling Works +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +io.latency is work conserving; so as long as everybody is meeting their latency +target the controller doesn't do anything. Once a group starts missing its +target it begins throttling any peer group that has a higher target than itself. +This throttling takes 2 forms: + +- Queue depth throttling. This is the number of outstanding IO's a group is + allowed to have. We will clamp down relatively quickly, starting at no limit + and going all the way down to 1 IO at a time. + +- Artificial delay induction. There are certain types of IO that cannot be + throttled without possibly adversely affecting higher priority groups. This + includes swapping and metadata IO. These types of IO are allowed to occur + normally, however they are "charged" to the originating group. If the + originating group is being throttled you will see the use_delay and delay + fields in io.stat increase. The delay value is how many microseconds that are + being added to any process that runs in this group. Because this number can + grow quite large if there is a lot of swapping or metadata IO occurring we + limit the individual delay events to 1 second at a time. + +Once the victimized group starts meeting its latency target again it will start +unthrottling any peer groups that were throttled previously. If the victimized +group simply stops doing IO the global counter will unthrottle appropriately. + +IO Latency Interface Files +~~~~~~~~~~~~~~~~~~~~~~~~~~ + + io.latency + This takes a similar format as the other controllers. + + "MAJOR:MINOR target=<target time in microseconds" + + io.stat + If the controller is enabled you will see extra stats in io.stat in + addition to the normal ones. + + depth + This is the current queue depth for the group. + + avg_lat + This is an exponential moving average with a decay rate of 1/exp + bound by the sampling interval. The decay rate interval can be + calculated by multiplying the win value in io.stat by the + corresponding number of samples based on the win value. + + win + The sampling window size in milliseconds. This is the minimum + duration of time between evaluation events. Windows only elapse + with IO activity. Idle periods extend the most recent window. + +PID +--- + +The process number controller is used to allow a cgroup to stop any +new tasks from being fork()'d or clone()'d after a specified limit is +reached. + +The number of tasks in a cgroup can be exhausted in ways which other +controllers cannot prevent, thus warranting its own controller. For +example, a fork bomb is likely to exhaust the number of tasks before +hitting memory restrictions. + +Note that PIDs used in this controller refer to TIDs, process IDs as +used by the kernel. + + +PID Interface Files +~~~~~~~~~~~~~~~~~~~ + + pids.max + A read-write single value file which exists on non-root + cgroups. The default is "max". + + Hard limit of number of processes. + + pids.current + A read-only single value file which exists on all cgroups. + + The number of processes currently in the cgroup and its + descendants. + +Organisational operations are not blocked by cgroup policies, so it is +possible to have pids.current > pids.max. This can be done by either +setting the limit to be smaller than pids.current, or attaching enough +processes to the cgroup such that pids.current is larger than +pids.max. However, it is not possible to violate a cgroup PID policy +through fork() or clone(). These will return -EAGAIN if the creation +of a new process would cause a cgroup policy to be violated. + + +Device controller +----------------- + +Device controller manages access to device files. It includes both +creation of new device files (using mknod), and access to the +existing device files. + +Cgroup v2 device controller has no interface files and is implemented +on top of cgroup BPF. To control access to device files, a user may +create bpf programs of the BPF_CGROUP_DEVICE type and attach them +to cgroups. On an attempt to access a device file, corresponding +BPF programs will be executed, and depending on the return value +the attempt will succeed or fail with -EPERM. + +A BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx +structure, which describes the device access attempt: access type +(mknod/read/write) and device (type, major and minor numbers). +If the program returns 0, the attempt fails with -EPERM, otherwise +it succeeds. + +An example of BPF_CGROUP_DEVICE program may be found in the kernel +source tree in the tools/testing/selftests/bpf/dev_cgroup.c file. + + +RDMA +---- + +The "rdma" controller regulates the distribution and accounting of +of RDMA resources. + +RDMA Interface Files +~~~~~~~~~~~~~~~~~~~~ + + rdma.max + A readwrite nested-keyed file that exists for all the cgroups + except root that describes current configured resource limit + for a RDMA/IB device. + + Lines are keyed by device name and are not ordered. + Each line contains space separated resource name and its configured + limit that can be distributed. + + The following nested keys are defined. + + ========== ============================= + hca_handle Maximum number of HCA Handles + hca_object Maximum number of HCA Objects + ========== ============================= + + An example for mlx4 and ocrdma device follows:: + + mlx4_0 hca_handle=2 hca_object=2000 + ocrdma1 hca_handle=3 hca_object=max + + rdma.current + A read-only file that describes current resource usage. + It exists for all the cgroup except root. + + An example for mlx4 and ocrdma device follows:: + + mlx4_0 hca_handle=1 hca_object=20 + ocrdma1 hca_handle=1 hca_object=23 + + +Misc +---- + +perf_event +~~~~~~~~~~ + +perf_event controller, if not mounted on a legacy hierarchy, is +automatically enabled on the v2 hierarchy so that perf events can +always be filtered by cgroup v2 path. The controller can still be +moved to a legacy hierarchy after v2 hierarchy is populated. + + +Non-normative information +------------------------- + +This section contains information that isn't considered to be a part of +the stable kernel API and so is subject to change. + + +CPU controller root cgroup process behaviour +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When distributing CPU cycles in the root cgroup each thread in this +cgroup is treated as if it was hosted in a separate child cgroup of the +root cgroup. This child cgroup weight is dependent on its thread nice +level. + +For details of this mapping see sched_prio_to_weight array in +kernel/sched/core.c file (values from this array should be scaled +appropriately so the neutral - nice 0 - value is 100 instead of 1024). + + +IO controller root cgroup process behaviour +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Root cgroup processes are hosted in an implicit leaf child node. +When distributing IO resources this implicit child node is taken into +account as if it was a normal child cgroup of the root cgroup with a +weight value of 200. + + +Namespace +========= + +Basics +------ + +cgroup namespace provides a mechanism to virtualize the view of the +"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone +flag can be used with clone(2) and unshare(2) to create a new cgroup +namespace. The process running inside the cgroup namespace will have +its "/proc/$PID/cgroup" output restricted to cgroupns root. The +cgroupns root is the cgroup of the process at the time of creation of +the cgroup namespace. + +Without cgroup namespace, the "/proc/$PID/cgroup" file shows the +complete path of the cgroup of a process. In a container setup where +a set of cgroups and namespaces are intended to isolate processes the +"/proc/$PID/cgroup" file may leak potential system level information +to the isolated processes. For Example:: + + # cat /proc/self/cgroup + 0::/batchjobs/container_id1 + +The path '/batchjobs/container_id1' can be considered as system-data +and undesirable to expose to the isolated processes. cgroup namespace +can be used to restrict visibility of this path. For example, before +creating a cgroup namespace, one would see:: + + # ls -l /proc/self/ns/cgroup + lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] + # cat /proc/self/cgroup + 0::/batchjobs/container_id1 + +After unsharing a new namespace, the view changes:: + + # ls -l /proc/self/ns/cgroup + lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] + # cat /proc/self/cgroup + 0::/ + +When some thread from a multi-threaded process unshares its cgroup +namespace, the new cgroupns gets applied to the entire process (all +the threads). This is natural for the v2 hierarchy; however, for the +legacy hierarchies, this may be unexpected. + +A cgroup namespace is alive as long as there are processes inside or +mounts pinning it. When the last usage goes away, the cgroup +namespace is destroyed. The cgroupns root and the actual cgroups +remain. + + +The Root and Views +------------------ + +The 'cgroupns root' for a cgroup namespace is the cgroup in which the +process calling unshare(2) is running. For example, if a process in +/batchjobs/container_id1 cgroup calls unshare, cgroup +/batchjobs/container_id1 becomes the cgroupns root. For the +init_cgroup_ns, this is the real root ('/') cgroup. + +The cgroupns root cgroup does not change even if the namespace creator +process later moves to a different cgroup:: + + # ~/unshare -c # unshare cgroupns in some cgroup + # cat /proc/self/cgroup + 0::/ + # mkdir sub_cgrp_1 + # echo 0 > sub_cgrp_1/cgroup.procs + # cat /proc/self/cgroup + 0::/sub_cgrp_1 + +Each process gets its namespace-specific view of "/proc/$PID/cgroup" + +Processes running inside the cgroup namespace will be able to see +cgroup paths (in /proc/self/cgroup) only inside their root cgroup. +From within an unshared cgroupns:: + + # sleep 100000 & + [1] 7353 + # echo 7353 > sub_cgrp_1/cgroup.procs + # cat /proc/7353/cgroup + 0::/sub_cgrp_1 + +From the initial cgroup namespace, the real cgroup path will be +visible:: + + $ cat /proc/7353/cgroup + 0::/batchjobs/container_id1/sub_cgrp_1 + +From a sibling cgroup namespace (that is, a namespace rooted at a +different cgroup), the cgroup path relative to its own cgroup +namespace root will be shown. For instance, if PID 7353's cgroup +namespace root is at '/batchjobs/container_id2', then it will see:: + + # cat /proc/7353/cgroup + 0::/../container_id2/sub_cgrp_1 + +Note that the relative path always starts with '/' to indicate that +its relative to the cgroup namespace root of the caller. + + +Migration and setns(2) +---------------------- + +Processes inside a cgroup namespace can move into and out of the +namespace root if they have proper access to external cgroups. For +example, from inside a namespace with cgroupns root at +/batchjobs/container_id1, and assuming that the global hierarchy is +still accessible inside cgroupns:: + + # cat /proc/7353/cgroup + 0::/sub_cgrp_1 + # echo 7353 > batchjobs/container_id2/cgroup.procs + # cat /proc/7353/cgroup + 0::/../container_id2 + +Note that this kind of setup is not encouraged. A task inside cgroup +namespace should only be exposed to its own cgroupns hierarchy. + +setns(2) to another cgroup namespace is allowed when: + +(a) the process has CAP_SYS_ADMIN against its current user namespace +(b) the process has CAP_SYS_ADMIN against the target cgroup + namespace's userns + +No implicit cgroup changes happen with attaching to another cgroup +namespace. It is expected that the someone moves the attaching +process under the target cgroup namespace root. + + +Interaction with Other Namespaces +--------------------------------- + +Namespace specific cgroup hierarchy can be mounted by a process +running inside a non-init cgroup namespace:: + + # mount -t cgroup2 none $MOUNT_POINT + +This will mount the unified cgroup hierarchy with cgroupns root as the +filesystem root. The process needs CAP_SYS_ADMIN against its user and +mount namespaces. + +The virtualization of /proc/self/cgroup file combined with restricting +the view of cgroup hierarchy by namespace-private cgroupfs mount +provides a properly isolated cgroup view inside the container. + + +Information on Kernel Programming +================================= + +This section contains kernel programming information in the areas +where interacting with cgroup is necessary. cgroup core and +controllers are not covered. + + +Filesystem Support for Writeback +-------------------------------- + +A filesystem can support cgroup writeback by updating +address_space_operations->writepage[s]() to annotate bio's using the +following two functions. + + wbc_init_bio(@wbc, @bio) + Should be called for each bio carrying writeback data and + associates the bio with the inode's owner cgroup. Can be + called anytime between bio allocation and submission. + + wbc_account_io(@wbc, @page, @bytes) + Should be called for each data segment being written out. + While this function doesn't care exactly when it's called + during the writeback session, it's the easiest and most + natural to call it as data segments are added to a bio. + +With writeback bio's annotated, cgroup support can be enabled per +super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for +selective disabling of cgroup writeback support which is helpful when +certain filesystem features, e.g. journaled data mode, are +incompatible. + +wbc_init_bio() binds the specified bio to its cgroup. Depending on +the configuration, the bio may be executed at a lower priority and if +the writeback session is holding shared resources, e.g. a journal +entry, may lead to priority inversion. There is no one easy solution +for the problem. Filesystems can try to work around specific problem +cases by skipping wbc_init_bio() or using bio_associate_blkcg() +directly. + + +Deprecated v1 Core Features +=========================== + +- Multiple hierarchies including named ones are not supported. + +- All v1 mount options are not supported. + +- The "tasks" file is removed and "cgroup.procs" is not sorted. + +- "cgroup.clone_children" is removed. + +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file + at the root instead. + + +Issues with v1 and Rationales for v2 +==================================== + +Multiple Hierarchies +-------------------- + +cgroup v1 allowed an arbitrary number of hierarchies and each +hierarchy could host any number of controllers. While this seemed to +provide a high level of flexibility, it wasn't useful in practice. + +For example, as there is only one instance of each controller, utility +type controllers such as freezer which can be useful in all +hierarchies could only be used in one. The issue is exacerbated by +the fact that controllers couldn't be moved to another hierarchy once +hierarchies were populated. Another issue was that all controllers +bound to a hierarchy were forced to have exactly the same view of the +hierarchy. It wasn't possible to vary the granularity depending on +the specific controller. + +In practice, these issues heavily limited which controllers could be +put on the same hierarchy and most configurations resorted to putting +each controller on its own hierarchy. Only closely related ones, such +as the cpu and cpuacct controllers, made sense to be put on the same +hierarchy. This often meant that userland ended up managing multiple +similar hierarchies repeating the same steps on each hierarchy +whenever a hierarchy management operation was necessary. + +Furthermore, support for multiple hierarchies came at a steep cost. +It greatly complicated cgroup core implementation but more importantly +the support for multiple hierarchies restricted how cgroup could be +used in general and what controllers was able to do. + +There was no limit on how many hierarchies there might be, which meant +that a thread's cgroup membership couldn't be described in finite +length. The key might contain any number of entries and was unlimited +in length, which made it highly awkward to manipulate and led to +addition of controllers which existed only to identify membership, +which in turn exacerbated the original problem of proliferating number +of hierarchies. + +Also, as a controller couldn't have any expectation regarding the +topologies of hierarchies other controllers might be on, each +controller had to assume that all other controllers were attached to +completely orthogonal hierarchies. This made it impossible, or at +least very cumbersome, for controllers to cooperate with each other. + +In most use cases, putting controllers on hierarchies which are +completely orthogonal to each other isn't necessary. What usually is +called for is the ability to have differing levels of granularity +depending on the specific controller. In other words, hierarchy may +be collapsed from leaf towards root when viewed from specific +controllers. For example, a given configuration might not care about +how memory is distributed beyond a certain level while still wanting +to control how CPU cycles are distributed. + + +Thread Granularity +------------------ + +cgroup v1 allowed threads of a process to belong to different cgroups. +This didn't make sense for some controllers and those controllers +ended up implementing different ways to ignore such situations but +much more importantly it blurred the line between API exposed to +individual applications and system management interface. + +Generally, in-process knowledge is available only to the process +itself; thus, unlike service-level organization of processes, +categorizing threads of a process requires active participation from +the application which owns the target process. + +cgroup v1 had an ambiguously defined delegation model which got abused +in combination with thread granularity. cgroups were delegated to +individual applications so that they can create and manage their own +sub-hierarchies and control resource distributions along them. This +effectively raised cgroup to the status of a syscall-like API exposed +to lay programs. + +First of all, cgroup has a fundamentally inadequate interface to be +exposed this way. For a process to access its own knobs, it has to +extract the path on the target hierarchy from /proc/self/cgroup, +construct the path by appending the name of the knob to the path, open +and then read and/or write to it. This is not only extremely clunky +and unusual but also inherently racy. There is no conventional way to +define transaction across the required steps and nothing can guarantee +that the process would actually be operating on its own sub-hierarchy. + +cgroup controllers implemented a number of knobs which would never be +accepted as public APIs because they were just adding control knobs to +system-management pseudo filesystem. cgroup ended up with interface +knobs which were not properly abstracted or refined and directly +revealed kernel internal details. These knobs got exposed to +individual applications through the ill-defined delegation mechanism +effectively abusing cgroup as a shortcut to implementing public APIs +without going through the required scrutiny. + +This was painful for both userland and kernel. Userland ended up with +misbehaving and poorly abstracted interfaces and kernel exposing and +locked into constructs inadvertently. + + +Competition Between Inner Nodes and Threads +------------------------------------------- + +cgroup v1 allowed threads to be in any cgroups which created an +interesting problem where threads belonging to a parent cgroup and its +children cgroups competed for resources. This was nasty as two +different types of entities competed and there was no obvious way to +settle it. Different controllers did different things. + +The cpu controller considered threads and cgroups as equivalents and +mapped nice levels to cgroup weights. This worked for some cases but +fell flat when children wanted to be allocated specific ratios of CPU +cycles and the number of internal threads fluctuated - the ratios +constantly changed as the number of competing entities fluctuated. +There also were other issues. The mapping from nice level to weight +wasn't obvious or universal, and there were various other knobs which +simply weren't available for threads. + +The io controller implicitly created a hidden leaf node for each +cgroup to host the threads. The hidden leaf had its own copies of all +the knobs with ``leaf_`` prefixed. While this allowed equivalent +control over internal threads, it was with serious drawbacks. It +always added an extra layer of nesting which wouldn't be necessary +otherwise, made the interface messy and significantly complicated the +implementation. + +The memory controller didn't have a way to control what happened +between internal tasks and child cgroups and the behavior was not +clearly defined. There were attempts to add ad-hoc behaviors and +knobs to tailor the behavior to specific workloads which would have +led to problems extremely difficult to resolve in the long term. + +Multiple controllers struggled with internal tasks and came up with +different ways to deal with it; unfortunately, all the approaches were +severely flawed and, furthermore, the widely different behaviors +made cgroup as a whole highly inconsistent. + +This clearly is a problem which needs to be addressed from cgroup core +in a uniform way. + + +Other Interface Issues +---------------------- + +cgroup v1 grew without oversight and developed a large number of +idiosyncrasies and inconsistencies. One issue on the cgroup core side +was how an empty cgroup was notified - a userland helper binary was +forked and executed for each event. The event delivery wasn't +recursive or delegatable. The limitations of the mechanism also led +to in-kernel event delivery filtering mechanism further complicating +the interface. + +Controller interfaces were problematic too. An extreme example is +controllers completely ignoring hierarchical organization and treating +all cgroups as if they were all located directly under the root +cgroup. Some controllers exposed a large amount of inconsistent +implementation details to userland. + +There also was no consistency across controllers. When a new cgroup +was created, some controllers defaulted to not imposing extra +restrictions while others disallowed any resource usage until +explicitly configured. Configuration knobs for the same type of +control used widely differing naming schemes and formats. Statistics +and information knobs were named arbitrarily and used different +formats and units even in the same controller. + +cgroup v2 establishes common conventions where appropriate and updates +controllers so that they expose minimal and consistent interfaces. + + +Controller Issues and Remedies +------------------------------ + +Memory +~~~~~~ + +The original lower boundary, the soft limit, is defined as a limit +that is per default unset. As a result, the set of cgroups that +global reclaim prefers is opt-in, rather than opt-out. The costs for +optimizing these mostly negative lookups are so high that the +implementation, despite its enormous size, does not even provide the +basic desirable behavior. First off, the soft limit has no +hierarchical meaning. All configured groups are organized in a global +rbtree and treated like equal peers, regardless where they are located +in the hierarchy. This makes subtree delegation impossible. Second, +the soft limit reclaim pass is so aggressive that it not just +introduces high allocation latencies into the system, but also impacts +system performance due to overreclaim, to the point where the feature +becomes self-defeating. + +The memory.low boundary on the other hand is a top-down allocated +reserve. A cgroup enjoys reclaim protection when it's within its low, +which makes delegation of subtrees possible. + +The original high boundary, the hard limit, is defined as a strict +limit that can not budge, even if the OOM killer has to be called. +But this generally goes against the goal of making the most out of the +available memory. The memory consumption of workloads varies during +runtime, and that requires users to overcommit. But doing that with a +strict upper limit requires either a fairly accurate prediction of the +working set size or adding slack to the limit. Since working set size +estimation is hard and error prone, and getting it wrong results in +OOM kills, most users tend to err on the side of a looser limit and +end up wasting precious resources. + +The memory.high boundary on the other hand can be set much more +conservatively. When hit, it throttles allocations by forcing them +into direct reclaim to work off the excess, but it never invokes the +OOM killer. As a result, a high boundary that is chosen too +aggressively will not terminate the processes, but instead it will +lead to gradual performance degradation. The user can monitor this +and make corrections until the minimal memory footprint that still +gives acceptable performance is found. + +In extreme cases, with many concurrent allocations and a complete +breakdown of reclaim progress within the group, the high boundary can +be exceeded. But even then it's mostly better to satisfy the +allocation from the slack available in other groups or the rest of the +system than killing the group. Otherwise, memory.max is there to +limit this type of spillover and ultimately contain buggy or even +malicious applications. + +Setting the original memory.limit_in_bytes below the current usage was +subject to a race condition, where concurrent charges could cause the +limit setting to fail. memory.max on the other hand will first set the +limit to prevent new charges, and then reclaim and OOM kill until the +new limit is met - or the task writing to memory.max is killed. + +The combined memory+swap accounting and limiting is replaced by real +control over swap space. + +The main argument for a combined memory+swap facility in the original +cgroup design was that global or parental pressure would always be +able to swap all anonymous memory of a child group, regardless of the +child's own (possibly untrusted) configuration. However, untrusted +groups can sabotage swapping by other means - such as referencing its +anonymous memory in a tight loop - and an admin can not assume full +swappability when overcommitting untrusted jobs. + +For trusted jobs, on the other hand, a combined counter is not an +intuitive userspace interface, and it flies in the face of the idea +that cgroup controllers should account and limit specific physical +resources. Swap space is a resource like all others in the system, +and that's why unified hierarchy allows distributing it separately. diff --git a/Documentation/admin-guide/conf.py b/Documentation/admin-guide/conf.py new file mode 100644 index 000000000..86f738953 --- /dev/null +++ b/Documentation/admin-guide/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = 'Linux Kernel User Documentation' + +tags.add("subproject") + +latex_documents = [ + ('index', 'linux-user.tex', 'Linux Kernel User Documentation', + 'The kernel development community', 'manual'), +] diff --git a/Documentation/admin-guide/devices.rst b/Documentation/admin-guide/devices.rst new file mode 100644 index 000000000..7fadc0533 --- /dev/null +++ b/Documentation/admin-guide/devices.rst @@ -0,0 +1,268 @@ + +Linux allocated devices (4.x+ version) +====================================== + +This list is the Linux Device List, the official registry of allocated +device numbers and ``/dev`` directory nodes for the Linux operating +system. + +The LaTeX version of this document is no longer maintained, nor is +the document that used to reside at lanana.org. This version in the +mainline Linux kernel is the master document. Updates shall be sent +as patches to the kernel maintainers (see the +:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` document). +Specifically explore the sections titled "CHAR and MISC DRIVERS", and +"BLOCK LAYER" in the MAINTAINERS file to find the right maintainers +to involve for character and block devices. + +This document is included by reference into the Filesystem Hierarchy +Standard (FHS). The FHS is available from http://www.pathname.com/fhs/. + +Allocations marked (68k/Amiga) apply to Linux/68k on the Amiga +platform only. Allocations marked (68k/Atari) apply to Linux/68k on +the Atari platform only. + +This document is in the public domain. The authors requests, however, +that semantically altered versions are not distributed without +permission of the authors, assuming the authors can be contacted without +an unreasonable effort. + + +.. attention:: + + DEVICE DRIVERS AUTHORS PLEASE READ THIS + + Linux now has extensive support for dynamic allocation of device numbering + and can use ``sysfs`` and ``udev`` (``systemd``) to handle the naming needs. + There are still some exceptions in the serial and boot device area. Before + asking for a device number make sure you actually need one. + + To have a major number allocated, or a minor number in situations + where that applies (e.g. busmice), please submit a patch and send to + the authors as indicated above. + + Keep the description of the device *in the same format + as this list*. The reason for this is that it is the only way we have + found to ensure we have all the requisite information to publish your + device and avoid conflicts. + + Finally, sometimes we have to play "namespace police." Please don't be + offended. We often get submissions for ``/dev`` names that would be bound + to cause conflicts down the road. We are trying to avoid getting in a + situation where we would have to suffer an incompatible forward + change. Therefore, please consult with us **before** you make your + device names and numbers in any way public, at least to the point + where it would be at all difficult to get them changed. + + Your cooperation is appreciated. + +.. include:: devices.txt + :literal: + +Additional ``/dev/`` directory entries +-------------------------------------- + +This section details additional entries that should or may exist in +the /dev directory. It is preferred that symbolic links use the same +form (absolute or relative) as is indicated here. Links are +classified as "hard" or "symbolic" depending on the preferred type of +link; if possible, the indicated type of link should be used. + +Compulsory links +++++++++++++++++ + +These links should exist on all systems: + +=============== =============== =============== =============================== +/dev/fd /proc/self/fd symbolic File descriptors +/dev/stdin fd/0 symbolic stdin file descriptor +/dev/stdout fd/1 symbolic stdout file descriptor +/dev/stderr fd/2 symbolic stderr file descriptor +/dev/nfsd socksys symbolic Required by iBCS-2 +/dev/X0R null symbolic Required by iBCS-2 +=============== =============== =============== =============================== + +Note: ``/dev/X0R`` is <letter X>-<digit 0>-<letter R>. + +Recommended links ++++++++++++++++++ + +It is recommended that these links exist on all systems: + + +=============== =============== =============== =============================== +/dev/core /proc/kcore symbolic Backward compatibility +/dev/ramdisk ram0 symbolic Backward compatibility +/dev/ftape qft0 symbolic Backward compatibility +/dev/bttv0 video0 symbolic Backward compatibility +/dev/radio radio0 symbolic Backward compatibility +/dev/i2o* /dev/i2o/* symbolic Backward compatibility +/dev/scd? sr? hard Alternate SCSI CD-ROM name +=============== =============== =============== =============================== + +Locally defined links ++++++++++++++++++++++ + +The following links may be established locally to conform to the +configuration of the system. This is merely a tabulation of existing +practice, and does not constitute a recommendation. However, if they +exist, they should have the following uses. + +=============== =============== =============== =============================== +/dev/mouse mouse port symbolic Current mouse device +/dev/tape tape device symbolic Current tape device +/dev/cdrom CD-ROM device symbolic Current CD-ROM device +/dev/cdwriter CD-writer symbolic Current CD-writer device +/dev/scanner scanner symbolic Current scanner device +/dev/modem modem port symbolic Current dialout device +/dev/root root device symbolic Current root filesystem +/dev/swap swap device symbolic Current swap device +=============== =============== =============== =============================== + +``/dev/modem`` should not be used for a modem which supports dialin as +well as dialout, as it tends to cause lock file problems. If it +exists, ``/dev/modem`` should point to the appropriate primary TTY device +(the use of the alternate callout devices is deprecated). + +For SCSI devices, ``/dev/tape`` and ``/dev/cdrom`` should point to the +*cooked* devices (``/dev/st*`` and ``/dev/sr*``, respectively), whereas +``/dev/cdwriter`` and /dev/scanner should point to the appropriate generic +SCSI devices (/dev/sg*). + +``/dev/mouse`` may point to a primary serial TTY device, a hardware mouse +device, or a socket for a mouse driver program (e.g. ``/dev/gpmdata``). + +Sockets and pipes ++++++++++++++++++ + +Non-transient sockets and named pipes may exist in /dev. Common entries are: + +=============== =============== =============================================== +/dev/printer socket lpd local socket +/dev/log socket syslog local socket +/dev/gpmdata socket gpm mouse multiplexer +=============== =============== =============================================== + +Mount points +++++++++++++ + +The following names are reserved for mounting special filesystems +under /dev. These special filesystems provide kernel interfaces that +cannot be provided with standard device nodes. + +=============== =============== =============================================== +/dev/pts devpts PTY slave filesystem +/dev/shm tmpfs POSIX shared memory maintenance access +=============== =============== =============================================== + +Terminal devices +---------------- + +Terminal, or TTY devices are a special class of character devices. A +terminal device is any device that could act as a controlling terminal +for a session; this includes virtual consoles, serial ports, and +pseudoterminals (PTYs). + +All terminal devices share a common set of capabilities known as line +disciplines; these include the common terminal line discipline as well +as SLIP and PPP modes. + +All terminal devices are named similarly; this section explains the +naming and use of the various types of TTYs. Note that the naming +conventions include several historical warts; some of these are +Linux-specific, some were inherited from other systems, and some +reflect Linux outgrowing a borrowed convention. + +A hash mark (``#``) in a device name is used here to indicate a decimal +number without leading zeroes. + +Virtual consoles and the console device ++++++++++++++++++++++++++++++++++++++++ + +Virtual consoles are full-screen terminal displays on the system video +monitor. Virtual consoles are named ``/dev/tty#``, with numbering +starting at ``/dev/tty1``; ``/dev/tty0`` is the current virtual console. +``/dev/tty0`` is the device that should be used to access the system video +card on those architectures for which the frame buffer devices +(``/dev/fb*``) are not applicable. Do not use ``/dev/console`` +for this purpose. + +The console device, ``/dev/console``, is the device to which system +messages should be sent, and on which logins should be permitted in +single-user mode. Starting with Linux 2.1.71, ``/dev/console`` is managed +by the kernel; for previous versions it should be a symbolic link to +either ``/dev/tty0``, a specific virtual console such as ``/dev/tty1``, or to +a serial port primary (``tty*``, not ``cu*``) device, depending on the +configuration of the system. + +Serial ports +++++++++++++ + +Serial ports are RS-232 serial ports and any device which simulates +one, either in hardware (such as internal modems) or in software (such +as the ISDN driver.) Under Linux, each serial ports has two device +names, the primary or callin device and the alternate or callout one. +Each kind of device is indicated by a different letter. For any +letter X, the names of the devices are ``/dev/ttyX#`` and ``/dev/cux#``, +respectively; for historical reasons, ``/dev/ttyS#`` and ``/dev/ttyC#`` +correspond to ``/dev/cua#`` and ``/dev/cub#``. In the future, it should be +expected that multiple letters will be used; all letters will be upper +case for the "tty" device (e.g. ``/dev/ttyDP#``) and lower case for the +"cu" device (e.g. ``/dev/cudp#``). + +The names ``/dev/ttyQ#`` and ``/dev/cuq#`` are reserved for local use. + +The alternate devices provide for kernel-based exclusion and somewhat +different defaults than the primary devices. Their main purpose is to +allow the use of serial ports with programs with no inherent or broken +support for serial ports. Their use is deprecated, and they may be +removed from a future version of Linux. + +Arbitration of serial ports is provided by the use of lock files with +the names ``/var/lock/LCK..ttyX#``. The contents of the lock file should +be the PID of the locking process as an ASCII number. + +It is common practice to install links such as /dev/modem +which point to serial ports. In order to ensure proper locking in the +presence of these links, it is recommended that software chase +symlinks and lock all possible names; additionally, it is recommended +that a lock file be installed with the corresponding alternate +device. In order to avoid deadlocks, it is recommended that the locks +are acquired in the following order, and released in the reverse: + + 1. The symbolic link name, if any (``/var/lock/LCK..modem``) + 2. The "tty" name (``/var/lock/LCK..ttyS2``) + 3. The alternate device name (``/var/lock/LCK..cua2``) + +In the case of nested symbolic links, the lock files should be +installed in the order the symlinks are resolved. + +Under no circumstances should an application hold a lock while waiting +for another to be released. In addition, applications which attempt +to create lock files for the corresponding alternate device names +should take into account the possibility of being used on a non-serial +port TTY, for which no alternate device would exist. + +Pseudoterminals (PTYs) +++++++++++++++++++++++ + +Pseudoterminals, or PTYs, are used to create login sessions or provide +other capabilities requiring a TTY line discipline (including SLIP or +PPP capability) to arbitrary data-generation processes. Each PTY has +a master side, named ``/dev/pty[p-za-e][0-9a-f]``, and a slave side, named +``/dev/tty[p-za-e][0-9a-f]``. The kernel arbitrates the use of PTYs by +allowing each master side to be opened only once. + +Once the master side has been opened, the corresponding slave device +can be used in the same manner as any TTY device. The master and +slave devices are connected by the kernel, generating the equivalent +of a bidirectional pipe with TTY capabilities. + +Recent versions of the Linux kernels and GNU libc contain support for +the System V/Unix98 naming scheme for PTYs, which assigns a common +device, ``/dev/ptmx``, to all the masters (opening it will automatically +give you a previously unassigned PTY) and a subdirectory, ``/dev/pts``, +for the slaves; the slaves are named with decimal integers (``/dev/pts/#`` +in our notation). This removes the problem of exhausting the +namespace and enables the kernel to automatically create the device +nodes for the slaves on demand using the "devpts" filesystem. diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt new file mode 100644 index 000000000..9cd7926c8 --- /dev/null +++ b/Documentation/admin-guide/devices.txt @@ -0,0 +1,3092 @@ + 0 Unnamed devices (e.g. non-device mounts) + 0 = reserved as null device number + See block major 144, 145, 146 for expansion areas. + + 1 char Memory devices + 1 = /dev/mem Physical memory access + 2 = /dev/kmem Kernel virtual memory access + 3 = /dev/null Null device + 4 = /dev/port I/O port access + 5 = /dev/zero Null byte source + 6 = /dev/core OBSOLETE - replaced by /proc/kcore + 7 = /dev/full Returns ENOSPC on write + 8 = /dev/random Nondeterministic random number gen. + 9 = /dev/urandom Faster, less secure random number gen. + 10 = /dev/aio Asynchronous I/O notification interface + 11 = /dev/kmsg Writes to this come out as printk's, reads + export the buffered printk records. + 12 = /dev/oldmem OBSOLETE - replaced by /proc/vmcore + + 1 block RAM disk + 0 = /dev/ram0 First RAM disk + 1 = /dev/ram1 Second RAM disk + ... + 250 = /dev/initrd Initial RAM disk + + Older kernels had /dev/ramdisk (1, 1) here. + /dev/initrd refers to a RAM disk which was preloaded + by the boot loader; newer kernels use /dev/ram0 for + the initrd. + + 2 char Pseudo-TTY masters + 0 = /dev/ptyp0 First PTY master + 1 = /dev/ptyp1 Second PTY master + ... + 255 = /dev/ptyef 256th PTY master + + Pseudo-tty's are named as follows: + * Masters are "pty", slaves are "tty"; + * the fourth letter is one of pqrstuvwxyzabcde indicating + the 1st through 16th series of 16 pseudo-ttys each, and + * the fifth letter is one of 0123456789abcdef indicating + the position within the series. + + These are the old-style (BSD) PTY devices; Unix98 + devices are on major 128 and above and use the PTY + master multiplex (/dev/ptmx) to acquire a PTY on + demand. + + 2 block Floppy disks + 0 = /dev/fd0 Controller 0, drive 0, autodetect + 1 = /dev/fd1 Controller 0, drive 1, autodetect + 2 = /dev/fd2 Controller 0, drive 2, autodetect + 3 = /dev/fd3 Controller 0, drive 3, autodetect + 128 = /dev/fd4 Controller 1, drive 0, autodetect + 129 = /dev/fd5 Controller 1, drive 1, autodetect + 130 = /dev/fd6 Controller 1, drive 2, autodetect + 131 = /dev/fd7 Controller 1, drive 3, autodetect + + To specify format, add to the autodetect device number: + 0 = /dev/fd? Autodetect format + 4 = /dev/fd?d360 5.25" 360K in a 360K drive(1) + 20 = /dev/fd?h360 5.25" 360K in a 1200K drive(1) + 48 = /dev/fd?h410 5.25" 410K in a 1200K drive + 64 = /dev/fd?h420 5.25" 420K in a 1200K drive + 24 = /dev/fd?h720 5.25" 720K in a 1200K drive + 80 = /dev/fd?h880 5.25" 880K in a 1200K drive(1) + 8 = /dev/fd?h1200 5.25" 1200K in a 1200K drive(1) + 40 = /dev/fd?h1440 5.25" 1440K in a 1200K drive(1) + 56 = /dev/fd?h1476 5.25" 1476K in a 1200K drive + 72 = /dev/fd?h1494 5.25" 1494K in a 1200K drive + 92 = /dev/fd?h1600 5.25" 1600K in a 1200K drive(1) + + 12 = /dev/fd?u360 3.5" 360K Double Density(2) + 16 = /dev/fd?u720 3.5" 720K Double Density(1) + 120 = /dev/fd?u800 3.5" 800K Double Density(2) + 52 = /dev/fd?u820 3.5" 820K Double Density + 68 = /dev/fd?u830 3.5" 830K Double Density + 84 = /dev/fd?u1040 3.5" 1040K Double Density(1) + 88 = /dev/fd?u1120 3.5" 1120K Double Density(1) + 28 = /dev/fd?u1440 3.5" 1440K High Density(1) + 124 = /dev/fd?u1600 3.5" 1600K High Density(1) + 44 = /dev/fd?u1680 3.5" 1680K High Density(3) + 60 = /dev/fd?u1722 3.5" 1722K High Density + 76 = /dev/fd?u1743 3.5" 1743K High Density + 96 = /dev/fd?u1760 3.5" 1760K High Density + 116 = /dev/fd?u1840 3.5" 1840K High Density(3) + 100 = /dev/fd?u1920 3.5" 1920K High Density(1) + 32 = /dev/fd?u2880 3.5" 2880K Extra Density(1) + 104 = /dev/fd?u3200 3.5" 3200K Extra Density + 108 = /dev/fd?u3520 3.5" 3520K Extra Density + 112 = /dev/fd?u3840 3.5" 3840K Extra Density(1) + + 36 = /dev/fd?CompaQ Compaq 2880K drive; obsolete? + + (1) Autodetectable format + (2) Autodetectable format in a Double Density (720K) drive only + (3) Autodetectable format in a High Density (1440K) drive only + + NOTE: The letter in the device name (d, q, h or u) + signifies the type of drive: 5.25" Double Density (d), + 5.25" Quad Density (q), 5.25" High Density (h) or 3.5" + (any model, u). The use of the capital letters D, H + and E for the 3.5" models have been deprecated, since + the drive type is insignificant for these devices. + + 3 char Pseudo-TTY slaves + 0 = /dev/ttyp0 First PTY slave + 1 = /dev/ttyp1 Second PTY slave + ... + 255 = /dev/ttyef 256th PTY slave + + These are the old-style (BSD) PTY devices; Unix98 + devices are on major 136 and above. + + 3 block First MFM, RLL and IDE hard disk/CD-ROM interface + 0 = /dev/hda Master: whole disk (or CD-ROM) + 64 = /dev/hdb Slave: whole disk (or CD-ROM) + + For partitions, add to the whole disk device number: + 0 = /dev/hd? Whole disk + 1 = /dev/hd?1 First partition + 2 = /dev/hd?2 Second partition + ... + 63 = /dev/hd?63 63rd partition + + For Linux/i386, partitions 1-4 are the primary + partitions, and 5 and above are logical partitions. + Other versions of Linux use partitioning schemes + appropriate to their respective architectures. + + 4 char TTY devices + 0 = /dev/tty0 Current virtual console + + 1 = /dev/tty1 First virtual console + ... + 63 = /dev/tty63 63rd virtual console + 64 = /dev/ttyS0 First UART serial port + ... + 255 = /dev/ttyS191 192nd UART serial port + + UART serial ports refer to 8250/16450/16550 series devices. + + Older versions of the Linux kernel used this major + number for BSD PTY devices. As of Linux 2.1.115, this + is no longer supported. Use major numbers 2 and 3. + + 4 block Aliases for dynamically allocated major devices to be used + when its not possible to create the real device nodes + because the root filesystem is mounted read-only. + + 0 = /dev/root + + 5 char Alternate TTY devices + 0 = /dev/tty Current TTY device + 1 = /dev/console System console + 2 = /dev/ptmx PTY master multiplex + 3 = /dev/ttyprintk User messages via printk TTY device + 64 = /dev/cua0 Callout device for ttyS0 + ... + 255 = /dev/cua191 Callout device for ttyS191 + + (5,1) is /dev/console starting with Linux 2.1.71. See + the section on terminal devices for more information + on /dev/console. + + 6 char Parallel printer devices + 0 = /dev/lp0 Parallel printer on parport0 + 1 = /dev/lp1 Parallel printer on parport1 + ... + + Current Linux kernels no longer have a fixed mapping + between parallel ports and I/O addresses. Instead, + they are redirected through the parport multiplex layer. + + 7 char Virtual console capture devices + 0 = /dev/vcs Current vc text (glyph) contents + 1 = /dev/vcs1 tty1 text (glyph) contents + ... + 63 = /dev/vcs63 tty63 text (glyph) contents + 64 = /dev/vcsu Current vc text (unicode) contents + 65 = /dev/vcsu1 tty1 text (unicode) contents + ... + 127 = /dev/vcsu63 tty63 text (unicode) contents + 128 = /dev/vcsa Current vc text/attribute (glyph) contents + 129 = /dev/vcsa1 tty1 text/attribute (glyph) contents + ... + 191 = /dev/vcsa63 tty63 text/attribute (glyph) contents + + NOTE: These devices permit both read and write access. + + 7 block Loopback devices + 0 = /dev/loop0 First loop device + 1 = /dev/loop1 Second loop device + ... + + The loop devices are used to mount filesystems not + associated with block devices. The binding to the + loop devices is handled by mount(8) or losetup(8). + + 8 block SCSI disk devices (0-15) + 0 = /dev/sda First SCSI disk whole disk + 16 = /dev/sdb Second SCSI disk whole disk + 32 = /dev/sdc Third SCSI disk whole disk + ... + 240 = /dev/sdp Sixteenth SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 9 char SCSI tape devices + 0 = /dev/st0 First SCSI tape, mode 0 + 1 = /dev/st1 Second SCSI tape, mode 0 + ... + 32 = /dev/st0l First SCSI tape, mode 1 + 33 = /dev/st1l Second SCSI tape, mode 1 + ... + 64 = /dev/st0m First SCSI tape, mode 2 + 65 = /dev/st1m Second SCSI tape, mode 2 + ... + 96 = /dev/st0a First SCSI tape, mode 3 + 97 = /dev/st1a Second SCSI tape, mode 3 + ... + 128 = /dev/nst0 First SCSI tape, mode 0, no rewind + 129 = /dev/nst1 Second SCSI tape, mode 0, no rewind + ... + 160 = /dev/nst0l First SCSI tape, mode 1, no rewind + 161 = /dev/nst1l Second SCSI tape, mode 1, no rewind + ... + 192 = /dev/nst0m First SCSI tape, mode 2, no rewind + 193 = /dev/nst1m Second SCSI tape, mode 2, no rewind + ... + 224 = /dev/nst0a First SCSI tape, mode 3, no rewind + 225 = /dev/nst1a Second SCSI tape, mode 3, no rewind + ... + + "No rewind" refers to the omission of the default + automatic rewind on device close. The MTREW or MTOFFL + ioctl()'s can be used to rewind the tape regardless of + the device used to access it. + + 9 block Metadisk (RAID) devices + 0 = /dev/md0 First metadisk group + 1 = /dev/md1 Second metadisk group + ... + + The metadisk driver is used to span a + filesystem across multiple physical disks. + + 10 char Non-serial mice, misc features + 0 = /dev/logibm Logitech bus mouse + 1 = /dev/psaux PS/2-style mouse port + 2 = /dev/inportbm Microsoft Inport bus mouse + 3 = /dev/atibm ATI XL bus mouse + 4 = /dev/jbm J-mouse + 4 = /dev/amigamouse Amiga mouse (68k/Amiga) + 5 = /dev/atarimouse Atari mouse + 6 = /dev/sunmouse Sun mouse + 7 = /dev/amigamouse1 Second Amiga mouse + 8 = /dev/smouse Simple serial mouse driver + 9 = /dev/pc110pad IBM PC-110 digitizer pad + 10 = /dev/adbmouse Apple Desktop Bus mouse + 11 = /dev/vrtpanel Vr41xx embedded touch panel + 13 = /dev/vpcmouse Connectix Virtual PC Mouse + 14 = /dev/touchscreen/ucb1x00 UCB 1x00 touchscreen + 15 = /dev/touchscreen/mk712 MK712 touchscreen + 128 = /dev/beep Fancy beep device + 129 = + 130 = /dev/watchdog Watchdog timer port + 131 = /dev/temperature Machine internal temperature + 132 = /dev/hwtrap Hardware fault trap + 133 = /dev/exttrp External device trap + 134 = /dev/apm_bios Advanced Power Management BIOS + 135 = /dev/rtc Real Time Clock + 137 = /dev/vhci Bluetooth virtual HCI driver + 139 = /dev/openprom SPARC OpenBoot PROM + 140 = /dev/relay8 Berkshire Products Octal relay card + 141 = /dev/relay16 Berkshire Products ISO-16 relay card + 142 = + 143 = /dev/pciconf PCI configuration space + 144 = /dev/nvram Non-volatile configuration RAM + 145 = /dev/hfmodem Soundcard shortwave modem control + 146 = /dev/graphics Linux/SGI graphics device + 147 = /dev/opengl Linux/SGI OpenGL pipe + 148 = /dev/gfx Linux/SGI graphics effects device + 149 = /dev/input/mouse Linux/SGI Irix emulation mouse + 150 = /dev/input/keyboard Linux/SGI Irix emulation keyboard + 151 = /dev/led Front panel LEDs + 152 = /dev/kpoll Kernel Poll Driver + 153 = /dev/mergemem Memory merge device + 154 = /dev/pmu Macintosh PowerBook power manager + 155 = /dev/isictl MultiTech ISICom serial control + 156 = /dev/lcd Front panel LCD display + 157 = /dev/ac Applicom Intl Profibus card + 158 = /dev/nwbutton Netwinder external button + 159 = /dev/nwdebug Netwinder debug interface + 160 = /dev/nwflash Netwinder flash memory + 161 = /dev/userdma User-space DMA access + 162 = /dev/smbus System Management Bus + 163 = /dev/lik Logitech Internet Keyboard + 164 = /dev/ipmo Intel Intelligent Platform Management + 165 = /dev/vmmon VMware virtual machine monitor + 166 = /dev/i2o/ctl I2O configuration manager + 167 = /dev/specialix_sxctl Specialix serial control + 168 = /dev/tcldrv Technology Concepts serial control + 169 = /dev/specialix_rioctl Specialix RIO serial control + 170 = /dev/thinkpad/thinkpad IBM Thinkpad devices + 171 = /dev/srripc QNX4 API IPC manager + 172 = /dev/usemaclone Semaphore clone device + 173 = /dev/ipmikcs Intelligent Platform Management + 174 = /dev/uctrl SPARCbook 3 microcontroller + 175 = /dev/agpgart AGP Graphics Address Remapping Table + 176 = /dev/gtrsc Gorgy Timing radio clock + 177 = /dev/cbm Serial CBM bus + 178 = /dev/jsflash JavaStation OS flash SIMM + 179 = /dev/xsvc High-speed shared-mem/semaphore service + 180 = /dev/vrbuttons Vr41xx button input device + 181 = /dev/toshiba Toshiba laptop SMM support + 182 = /dev/perfctr Performance-monitoring counters + 183 = /dev/hwrng Generic random number generator + 184 = /dev/cpu/microcode CPU microcode update interface + 186 = /dev/atomicps Atomic shapshot of process state data + 187 = /dev/irnet IrNET device + 188 = /dev/smbusbios SMBus BIOS + 189 = /dev/ussp_ctl User space serial port control + 190 = /dev/crash Mission Critical Linux crash dump facility + 191 = /dev/pcl181 <information missing> + 192 = /dev/nas_xbus NAS xbus LCD/buttons access + 193 = /dev/d7s SPARC 7-segment display + 194 = /dev/zkshim Zero-Knowledge network shim control + 195 = /dev/elographics/e2201 Elographics touchscreen E271-2201 + 196 = /dev/vfio/vfio VFIO userspace driver interface + 197 = /dev/pxa3xx-gcu PXA3xx graphics controller unit driver + 198 = /dev/sexec Signed executable interface + 199 = /dev/scanners/cuecat :CueCat barcode scanner + 200 = /dev/net/tun TAP/TUN network device + 201 = /dev/button/gulpb Transmeta GULP-B buttons + 202 = /dev/emd/ctl Enhanced Metadisk RAID (EMD) control + 203 = /dev/cuse Cuse (character device in user-space) + 204 = /dev/video/em8300 EM8300 DVD decoder control + 205 = /dev/video/em8300_mv EM8300 DVD decoder video + 206 = /dev/video/em8300_ma EM8300 DVD decoder audio + 207 = /dev/video/em8300_sp EM8300 DVD decoder subpicture + 208 = /dev/compaq/cpqphpc Compaq PCI Hot Plug Controller + 209 = /dev/compaq/cpqrid Compaq Remote Insight Driver + 210 = /dev/impi/bt IMPI coprocessor block transfer + 211 = /dev/impi/smic IMPI coprocessor stream interface + 212 = /dev/watchdogs/0 First watchdog device + 213 = /dev/watchdogs/1 Second watchdog device + 214 = /dev/watchdogs/2 Third watchdog device + 215 = /dev/watchdogs/3 Fourth watchdog device + 216 = /dev/fujitsu/apanel Fujitsu/Siemens application panel + 217 = /dev/ni/natmotn National Instruments Motion + 218 = /dev/kchuid Inter-process chuid control + 219 = /dev/modems/mwave MWave modem firmware upload + 220 = /dev/mptctl Message passing technology (MPT) control + 221 = /dev/mvista/hssdsi Montavista PICMG hot swap system driver + 222 = /dev/mvista/hasi Montavista PICMG high availability + 223 = /dev/input/uinput User level driver support for input + 224 = /dev/tpm TCPA TPM driver + 225 = /dev/pps Pulse Per Second driver + 226 = /dev/systrace Systrace device + 227 = /dev/mcelog X86_64 Machine Check Exception driver + 228 = /dev/hpet HPET driver + 229 = /dev/fuse Fuse (virtual filesystem in user-space) + 230 = /dev/midishare MidiShare driver + 231 = /dev/snapshot System memory snapshot device + 232 = /dev/kvm Kernel-based virtual machine (hardware virtualization extensions) + 233 = /dev/kmview View-OS A process with a view + 234 = /dev/btrfs-control Btrfs control device + 235 = /dev/autofs Autofs control device + 236 = /dev/mapper/control Device-Mapper control device + 237 = /dev/loop-control Loopback control device + 238 = /dev/vhost-net Host kernel accelerator for virtio net + 239 = /dev/uhid User-space I/O driver support for HID subsystem + 240 = /dev/userio Serio driver testing device + 241 = /dev/vhost-vsock Host kernel driver for virtio vsock + + 242-254 Reserved for local use + 255 Reserved for MISC_DYNAMIC_MINOR + + 11 char Raw keyboard device (Linux/SPARC only) + 0 = /dev/kbd Raw keyboard device + + 11 char Serial Mux device (Linux/PA-RISC only) + 0 = /dev/ttyB0 First mux port + 1 = /dev/ttyB1 Second mux port + ... + + 11 block SCSI CD-ROM devices + 0 = /dev/scd0 First SCSI CD-ROM + 1 = /dev/scd1 Second SCSI CD-ROM + ... + + The prefix /dev/sr (instead of /dev/scd) has been deprecated. + + 12 char QIC-02 tape + 2 = /dev/ntpqic11 QIC-11, no rewind-on-close + 3 = /dev/tpqic11 QIC-11, rewind-on-close + 4 = /dev/ntpqic24 QIC-24, no rewind-on-close + 5 = /dev/tpqic24 QIC-24, rewind-on-close + 6 = /dev/ntpqic120 QIC-120, no rewind-on-close + 7 = /dev/tpqic120 QIC-120, rewind-on-close + 8 = /dev/ntpqic150 QIC-150, no rewind-on-close + 9 = /dev/tpqic150 QIC-150, rewind-on-close + + The device names specified are proposed -- if there + are "standard" names for these devices, please let me know. + + 12 block + + 13 char Input core + 0 = /dev/input/js0 First joystick + 1 = /dev/input/js1 Second joystick + ... + 32 = /dev/input/mouse0 First mouse + 33 = /dev/input/mouse1 Second mouse + ... + 63 = /dev/input/mice Unified mouse + 64 = /dev/input/event0 First event queue + 65 = /dev/input/event1 Second event queue + ... + + Each device type has 5 bits (32 minors). + + 13 block Previously used for the XT disk (/dev/xdN) + Deleted in kernel v3.9. + + 14 char Open Sound System (OSS) + 0 = /dev/mixer Mixer control + 1 = /dev/sequencer Audio sequencer + 2 = /dev/midi00 First MIDI port + 3 = /dev/dsp Digital audio + 4 = /dev/audio Sun-compatible digital audio + 6 = + 7 = /dev/audioctl SPARC audio control device + 8 = /dev/sequencer2 Sequencer -- alternate device + 16 = /dev/mixer1 Second soundcard mixer control + 17 = /dev/patmgr0 Sequencer patch manager + 18 = /dev/midi01 Second MIDI port + 19 = /dev/dsp1 Second soundcard digital audio + 20 = /dev/audio1 Second soundcard Sun digital audio + 33 = /dev/patmgr1 Sequencer patch manager + 34 = /dev/midi02 Third MIDI port + 50 = /dev/midi03 Fourth MIDI port + + 14 block + + 15 char Joystick + 0 = /dev/js0 First analog joystick + 1 = /dev/js1 Second analog joystick + ... + 128 = /dev/djs0 First digital joystick + 129 = /dev/djs1 Second digital joystick + ... + 15 block Sony CDU-31A/CDU-33A CD-ROM + 0 = /dev/sonycd Sony CDU-31a CD-ROM + + 16 char Non-SCSI scanners + 0 = /dev/gs4500 Genius 4500 handheld scanner + + 16 block GoldStar CD-ROM + 0 = /dev/gscd GoldStar CD-ROM + + 17 char OBSOLETE (was Chase serial card) + 0 = /dev/ttyH0 First Chase port + 1 = /dev/ttyH1 Second Chase port + ... + 17 block Optics Storage CD-ROM + 0 = /dev/optcd Optics Storage CD-ROM + + 18 char OBSOLETE (was Chase serial card - alternate devices) + 0 = /dev/cuh0 Callout device for ttyH0 + 1 = /dev/cuh1 Callout device for ttyH1 + ... + 18 block Sanyo CD-ROM + 0 = /dev/sjcd Sanyo CD-ROM + + 19 char Cyclades serial card + 0 = /dev/ttyC0 First Cyclades port + ... + 31 = /dev/ttyC31 32nd Cyclades port + + 19 block "Double" compressed disk + 0 = /dev/double0 First compressed disk + ... + 7 = /dev/double7 Eighth compressed disk + 128 = /dev/cdouble0 Mirror of first compressed disk + ... + 135 = /dev/cdouble7 Mirror of eighth compressed disk + + See the Double documentation for the meaning of the + mirror devices. + + 20 char Cyclades serial card - alternate devices + 0 = /dev/cub0 Callout device for ttyC0 + ... + 31 = /dev/cub31 Callout device for ttyC31 + + 20 block Hitachi CD-ROM (under development) + 0 = /dev/hitcd Hitachi CD-ROM + + 21 char Generic SCSI access + 0 = /dev/sg0 First generic SCSI device + 1 = /dev/sg1 Second generic SCSI device + ... + + Most distributions name these /dev/sga, /dev/sgb...; + this sets an unnecessary limit of 26 SCSI devices in + the system and is counter to standard Linux + device-naming practice. + + 21 block Acorn MFM hard drive interface + 0 = /dev/mfma First MFM drive whole disk + 64 = /dev/mfmb Second MFM drive whole disk + + This device is used on the ARM-based Acorn RiscPC. + Partitions are handled the same way as for IDE disks + (see major number 3). + + 22 char Digiboard serial card + 0 = /dev/ttyD0 First Digiboard port + 1 = /dev/ttyD1 Second Digiboard port + ... + 22 block Second IDE hard disk/CD-ROM interface + 0 = /dev/hdc Master: whole disk (or CD-ROM) + 64 = /dev/hdd Slave: whole disk (or CD-ROM) + + Partitions are handled the same way as for the first + interface (see major number 3). + + 23 char Digiboard serial card - alternate devices + 0 = /dev/cud0 Callout device for ttyD0 + 1 = /dev/cud1 Callout device for ttyD1 + ... + 23 block Mitsumi proprietary CD-ROM + 0 = /dev/mcd Mitsumi CD-ROM + + 24 char Stallion serial card + 0 = /dev/ttyE0 Stallion port 0 card 0 + 1 = /dev/ttyE1 Stallion port 1 card 0 + ... + 64 = /dev/ttyE64 Stallion port 0 card 1 + 65 = /dev/ttyE65 Stallion port 1 card 1 + ... + 128 = /dev/ttyE128 Stallion port 0 card 2 + 129 = /dev/ttyE129 Stallion port 1 card 2 + ... + 192 = /dev/ttyE192 Stallion port 0 card 3 + 193 = /dev/ttyE193 Stallion port 1 card 3 + ... + 24 block Sony CDU-535 CD-ROM + 0 = /dev/cdu535 Sony CDU-535 CD-ROM + + 25 char Stallion serial card - alternate devices + 0 = /dev/cue0 Callout device for ttyE0 + 1 = /dev/cue1 Callout device for ttyE1 + ... + 64 = /dev/cue64 Callout device for ttyE64 + 65 = /dev/cue65 Callout device for ttyE65 + ... + 128 = /dev/cue128 Callout device for ttyE128 + 129 = /dev/cue129 Callout device for ttyE129 + ... + 192 = /dev/cue192 Callout device for ttyE192 + 193 = /dev/cue193 Callout device for ttyE193 + ... + 25 block First Matsushita (Panasonic/SoundBlaster) CD-ROM + 0 = /dev/sbpcd0 Panasonic CD-ROM controller 0 unit 0 + 1 = /dev/sbpcd1 Panasonic CD-ROM controller 0 unit 1 + 2 = /dev/sbpcd2 Panasonic CD-ROM controller 0 unit 2 + 3 = /dev/sbpcd3 Panasonic CD-ROM controller 0 unit 3 + + 26 char + + 26 block Second Matsushita (Panasonic/SoundBlaster) CD-ROM + 0 = /dev/sbpcd4 Panasonic CD-ROM controller 1 unit 0 + 1 = /dev/sbpcd5 Panasonic CD-ROM controller 1 unit 1 + 2 = /dev/sbpcd6 Panasonic CD-ROM controller 1 unit 2 + 3 = /dev/sbpcd7 Panasonic CD-ROM controller 1 unit 3 + + 27 char QIC-117 tape + 0 = /dev/qft0 Unit 0, rewind-on-close + 1 = /dev/qft1 Unit 1, rewind-on-close + 2 = /dev/qft2 Unit 2, rewind-on-close + 3 = /dev/qft3 Unit 3, rewind-on-close + 4 = /dev/nqft0 Unit 0, no rewind-on-close + 5 = /dev/nqft1 Unit 1, no rewind-on-close + 6 = /dev/nqft2 Unit 2, no rewind-on-close + 7 = /dev/nqft3 Unit 3, no rewind-on-close + 16 = /dev/zqft0 Unit 0, rewind-on-close, compression + 17 = /dev/zqft1 Unit 1, rewind-on-close, compression + 18 = /dev/zqft2 Unit 2, rewind-on-close, compression + 19 = /dev/zqft3 Unit 3, rewind-on-close, compression + 20 = /dev/nzqft0 Unit 0, no rewind-on-close, compression + 21 = /dev/nzqft1 Unit 1, no rewind-on-close, compression + 22 = /dev/nzqft2 Unit 2, no rewind-on-close, compression + 23 = /dev/nzqft3 Unit 3, no rewind-on-close, compression + 32 = /dev/rawqft0 Unit 0, rewind-on-close, no file marks + 33 = /dev/rawqft1 Unit 1, rewind-on-close, no file marks + 34 = /dev/rawqft2 Unit 2, rewind-on-close, no file marks + 35 = /dev/rawqft3 Unit 3, rewind-on-close, no file marks + 36 = /dev/nrawqft0 Unit 0, no rewind-on-close, no file marks + 37 = /dev/nrawqft1 Unit 1, no rewind-on-close, no file marks + 38 = /dev/nrawqft2 Unit 2, no rewind-on-close, no file marks + 39 = /dev/nrawqft3 Unit 3, no rewind-on-close, no file marks + + 27 block Third Matsushita (Panasonic/SoundBlaster) CD-ROM + 0 = /dev/sbpcd8 Panasonic CD-ROM controller 2 unit 0 + 1 = /dev/sbpcd9 Panasonic CD-ROM controller 2 unit 1 + 2 = /dev/sbpcd10 Panasonic CD-ROM controller 2 unit 2 + 3 = /dev/sbpcd11 Panasonic CD-ROM controller 2 unit 3 + + 28 char Stallion serial card - card programming + 0 = /dev/staliomem0 First Stallion card I/O memory + 1 = /dev/staliomem1 Second Stallion card I/O memory + 2 = /dev/staliomem2 Third Stallion card I/O memory + 3 = /dev/staliomem3 Fourth Stallion card I/O memory + + 28 char Atari SLM ACSI laser printer (68k/Atari) + 0 = /dev/slm0 First SLM laser printer + 1 = /dev/slm1 Second SLM laser printer + ... + 28 block Fourth Matsushita (Panasonic/SoundBlaster) CD-ROM + 0 = /dev/sbpcd12 Panasonic CD-ROM controller 3 unit 0 + 1 = /dev/sbpcd13 Panasonic CD-ROM controller 3 unit 1 + 2 = /dev/sbpcd14 Panasonic CD-ROM controller 3 unit 2 + 3 = /dev/sbpcd15 Panasonic CD-ROM controller 3 unit 3 + + 28 block ACSI disk (68k/Atari) + 0 = /dev/ada First ACSI disk whole disk + 16 = /dev/adb Second ACSI disk whole disk + 32 = /dev/adc Third ACSI disk whole disk + ... + 240 = /dev/adp 16th ACSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15, like SCSI. + + 29 char Universal frame buffer + 0 = /dev/fb0 First frame buffer + 1 = /dev/fb1 Second frame buffer + ... + 31 = /dev/fb31 32nd frame buffer + + 29 block Aztech/Orchid/Okano/Wearnes CD-ROM + 0 = /dev/aztcd Aztech CD-ROM + + 30 char iBCS-2 compatibility devices + 0 = /dev/socksys Socket access + 1 = /dev/spx SVR3 local X interface + 32 = /dev/inet/ip Network access + 33 = /dev/inet/icmp + 34 = /dev/inet/ggp + 35 = /dev/inet/ipip + 36 = /dev/inet/tcp + 37 = /dev/inet/egp + 38 = /dev/inet/pup + 39 = /dev/inet/udp + 40 = /dev/inet/idp + 41 = /dev/inet/rawip + + Additionally, iBCS-2 requires the following links: + + /dev/ip -> /dev/inet/ip + /dev/icmp -> /dev/inet/icmp + /dev/ggp -> /dev/inet/ggp + /dev/ipip -> /dev/inet/ipip + /dev/tcp -> /dev/inet/tcp + /dev/egp -> /dev/inet/egp + /dev/pup -> /dev/inet/pup + /dev/udp -> /dev/inet/udp + /dev/idp -> /dev/inet/idp + /dev/rawip -> /dev/inet/rawip + /dev/inet/arp -> /dev/inet/udp + /dev/inet/rip -> /dev/inet/udp + /dev/nfsd -> /dev/socksys + /dev/X0R -> /dev/null (? apparently not required ?) + + 30 block Philips LMS CM-205 CD-ROM + 0 = /dev/cm205cd Philips LMS CM-205 CD-ROM + + /dev/lmscd is an older name for this device. This + driver does not work with the CM-205MS CD-ROM. + + 31 char MPU-401 MIDI + 0 = /dev/mpu401data MPU-401 data port + 1 = /dev/mpu401stat MPU-401 status port + + 31 block ROM/flash memory card + 0 = /dev/rom0 First ROM card (rw) + ... + 7 = /dev/rom7 Eighth ROM card (rw) + 8 = /dev/rrom0 First ROM card (ro) + ... + 15 = /dev/rrom7 Eighth ROM card (ro) + 16 = /dev/flash0 First flash memory card (rw) + ... + 23 = /dev/flash7 Eighth flash memory card (rw) + 24 = /dev/rflash0 First flash memory card (ro) + ... + 31 = /dev/rflash7 Eighth flash memory card (ro) + + The read-write (rw) devices support back-caching + written data in RAM, as well as writing to flash RAM + devices. The read-only devices (ro) support reading + only. + + 32 char Specialix serial card + 0 = /dev/ttyX0 First Specialix port + 1 = /dev/ttyX1 Second Specialix port + ... + 32 block Philips LMS CM-206 CD-ROM + 0 = /dev/cm206cd Philips LMS CM-206 CD-ROM + + 33 char Specialix serial card - alternate devices + 0 = /dev/cux0 Callout device for ttyX0 + 1 = /dev/cux1 Callout device for ttyX1 + ... + 33 block Third IDE hard disk/CD-ROM interface + 0 = /dev/hde Master: whole disk (or CD-ROM) + 64 = /dev/hdf Slave: whole disk (or CD-ROM) + + Partitions are handled the same way as for the first + interface (see major number 3). + + 34 char Z8530 HDLC driver + 0 = /dev/scc0 First Z8530, first port + 1 = /dev/scc1 First Z8530, second port + 2 = /dev/scc2 Second Z8530, first port + 3 = /dev/scc3 Second Z8530, second port + ... + + In a previous version these devices were named + /dev/sc1 for /dev/scc0, /dev/sc2 for /dev/scc1, and so + on. + + 34 block Fourth IDE hard disk/CD-ROM interface + 0 = /dev/hdg Master: whole disk (or CD-ROM) + 64 = /dev/hdh Slave: whole disk (or CD-ROM) + + Partitions are handled the same way as for the first + interface (see major number 3). + + 35 char tclmidi MIDI driver + 0 = /dev/midi0 First MIDI port, kernel timed + 1 = /dev/midi1 Second MIDI port, kernel timed + 2 = /dev/midi2 Third MIDI port, kernel timed + 3 = /dev/midi3 Fourth MIDI port, kernel timed + 64 = /dev/rmidi0 First MIDI port, untimed + 65 = /dev/rmidi1 Second MIDI port, untimed + 66 = /dev/rmidi2 Third MIDI port, untimed + 67 = /dev/rmidi3 Fourth MIDI port, untimed + 128 = /dev/smpte0 First MIDI port, SMPTE timed + 129 = /dev/smpte1 Second MIDI port, SMPTE timed + 130 = /dev/smpte2 Third MIDI port, SMPTE timed + 131 = /dev/smpte3 Fourth MIDI port, SMPTE timed + + 35 block Slow memory ramdisk + 0 = /dev/slram Slow memory ramdisk + + 36 char Netlink support + 0 = /dev/route Routing, device updates, kernel to user + 1 = /dev/skip enSKIP security cache control + 3 = /dev/fwmonitor Firewall packet copies + 16 = /dev/tap0 First Ethertap device + ... + 31 = /dev/tap15 16th Ethertap device + + 36 block OBSOLETE (was MCA ESDI hard disk) + + 37 char IDE tape + 0 = /dev/ht0 First IDE tape + 1 = /dev/ht1 Second IDE tape + ... + 128 = /dev/nht0 First IDE tape, no rewind-on-close + 129 = /dev/nht1 Second IDE tape, no rewind-on-close + ... + + Currently, only one IDE tape drive is supported. + + 37 block Zorro II ramdisk + 0 = /dev/z2ram Zorro II ramdisk + + 38 char Myricom PCI Myrinet board + 0 = /dev/mlanai0 First Myrinet board + 1 = /dev/mlanai1 Second Myrinet board + ... + + This device is used for status query, board control + and "user level packet I/O." This board is also + accessible as a standard networking "eth" device. + + 38 block OBSOLETE (was Linux/AP+) + + 39 char ML-16P experimental I/O board + 0 = /dev/ml16pa-a0 First card, first analog channel + 1 = /dev/ml16pa-a1 First card, second analog channel + ... + 15 = /dev/ml16pa-a15 First card, 16th analog channel + 16 = /dev/ml16pa-d First card, digital lines + 17 = /dev/ml16pa-c0 First card, first counter/timer + 18 = /dev/ml16pa-c1 First card, second counter/timer + 19 = /dev/ml16pa-c2 First card, third counter/timer + 32 = /dev/ml16pb-a0 Second card, first analog channel + 33 = /dev/ml16pb-a1 Second card, second analog channel + ... + 47 = /dev/ml16pb-a15 Second card, 16th analog channel + 48 = /dev/ml16pb-d Second card, digital lines + 49 = /dev/ml16pb-c0 Second card, first counter/timer + 50 = /dev/ml16pb-c1 Second card, second counter/timer + 51 = /dev/ml16pb-c2 Second card, third counter/timer + ... + 39 block + + 40 char + + 40 block + + 41 char Yet Another Micro Monitor + 0 = /dev/yamm Yet Another Micro Monitor + + 41 block + + 42 char Demo/sample use + + 42 block Demo/sample use + + This number is intended for use in sample code, as + well as a general "example" device number. It + should never be used for a device driver that is being + distributed; either obtain an official number or use + the local/experimental range. The sudden addition or + removal of a driver with this number should not cause + ill effects to the system (bugs excepted.) + + IN PARTICULAR, ANY DISTRIBUTION WHICH CONTAINS A + DEVICE DRIVER USING MAJOR NUMBER 42 IS NONCOMPLIANT. + + 43 char isdn4linux virtual modem + 0 = /dev/ttyI0 First virtual modem + ... + 63 = /dev/ttyI63 64th virtual modem + + 43 block Network block devices + 0 = /dev/nb0 First network block device + 1 = /dev/nb1 Second network block device + ... + + Network Block Device is somehow similar to loopback + devices: If you read from it, it sends packet across + network asking server for data. If you write to it, it + sends packet telling server to write. It could be used + to mounting filesystems over the net, swapping over + the net, implementing block device in userland etc. + + 44 char isdn4linux virtual modem - alternate devices + 0 = /dev/cui0 Callout device for ttyI0 + ... + 63 = /dev/cui63 Callout device for ttyI63 + + 44 block Flash Translation Layer (FTL) filesystems + 0 = /dev/ftla FTL on first Memory Technology Device + 16 = /dev/ftlb FTL on second Memory Technology Device + 32 = /dev/ftlc FTL on third Memory Technology Device + ... + 240 = /dev/ftlp FTL on 16th Memory Technology Device + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the partition + limit is 15 rather than 63 per disk (same as SCSI.) + + 45 char isdn4linux ISDN BRI driver + 0 = /dev/isdn0 First virtual B channel raw data + ... + 63 = /dev/isdn63 64th virtual B channel raw data + 64 = /dev/isdnctrl0 First channel control/debug + ... + 127 = /dev/isdnctrl63 64th channel control/debug + + 128 = /dev/ippp0 First SyncPPP device + ... + 191 = /dev/ippp63 64th SyncPPP device + + 255 = /dev/isdninfo ISDN monitor interface + + 45 block Parallel port IDE disk devices + 0 = /dev/pda First parallel port IDE disk + 16 = /dev/pdb Second parallel port IDE disk + 32 = /dev/pdc Third parallel port IDE disk + 48 = /dev/pdd Fourth parallel port IDE disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the partition + limit is 15 rather than 63 per disk. + + 46 char Comtrol Rocketport serial card + 0 = /dev/ttyR0 First Rocketport port + 1 = /dev/ttyR1 Second Rocketport port + ... + 46 block Parallel port ATAPI CD-ROM devices + 0 = /dev/pcd0 First parallel port ATAPI CD-ROM + 1 = /dev/pcd1 Second parallel port ATAPI CD-ROM + 2 = /dev/pcd2 Third parallel port ATAPI CD-ROM + 3 = /dev/pcd3 Fourth parallel port ATAPI CD-ROM + + 47 char Comtrol Rocketport serial card - alternate devices + 0 = /dev/cur0 Callout device for ttyR0 + 1 = /dev/cur1 Callout device for ttyR1 + ... + 47 block Parallel port ATAPI disk devices + 0 = /dev/pf0 First parallel port ATAPI disk + 1 = /dev/pf1 Second parallel port ATAPI disk + 2 = /dev/pf2 Third parallel port ATAPI disk + 3 = /dev/pf3 Fourth parallel port ATAPI disk + + This driver is intended for floppy disks and similar + devices and hence does not support partitioning. + + 48 char SDL RISCom serial card + 0 = /dev/ttyL0 First RISCom port + 1 = /dev/ttyL1 Second RISCom port + ... + 48 block Mylex DAC960 PCI RAID controller; first controller + 0 = /dev/rd/c0d0 First disk, whole disk + 8 = /dev/rd/c0d1 Second disk, whole disk + ... + 248 = /dev/rd/c0d31 32nd disk, whole disk + + For partitions add: + 0 = /dev/rd/c?d? Whole disk + 1 = /dev/rd/c?d?p1 First partition + ... + 7 = /dev/rd/c?d?p7 Seventh partition + + 49 char SDL RISCom serial card - alternate devices + 0 = /dev/cul0 Callout device for ttyL0 + 1 = /dev/cul1 Callout device for ttyL1 + ... + 49 block Mylex DAC960 PCI RAID controller; second controller + 0 = /dev/rd/c1d0 First disk, whole disk + 8 = /dev/rd/c1d1 Second disk, whole disk + ... + 248 = /dev/rd/c1d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 50 char Reserved for GLINT + + 50 block Mylex DAC960 PCI RAID controller; third controller + 0 = /dev/rd/c2d0 First disk, whole disk + 8 = /dev/rd/c2d1 Second disk, whole disk + ... + 248 = /dev/rd/c2d31 32nd disk, whole disk + + 51 char Baycom radio modem OR Radio Tech BIM-XXX-RS232 radio modem + 0 = /dev/bc0 First Baycom radio modem + 1 = /dev/bc1 Second Baycom radio modem + ... + 51 block Mylex DAC960 PCI RAID controller; fourth controller + 0 = /dev/rd/c3d0 First disk, whole disk + 8 = /dev/rd/c3d1 Second disk, whole disk + ... + 248 = /dev/rd/c3d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 52 char Spellcaster DataComm/BRI ISDN card + 0 = /dev/dcbri0 First DataComm card + 1 = /dev/dcbri1 Second DataComm card + 2 = /dev/dcbri2 Third DataComm card + 3 = /dev/dcbri3 Fourth DataComm card + + 52 block Mylex DAC960 PCI RAID controller; fifth controller + 0 = /dev/rd/c4d0 First disk, whole disk + 8 = /dev/rd/c4d1 Second disk, whole disk + ... + 248 = /dev/rd/c4d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 53 char BDM interface for remote debugging MC683xx microcontrollers + 0 = /dev/pd_bdm0 PD BDM interface on lp0 + 1 = /dev/pd_bdm1 PD BDM interface on lp1 + 2 = /dev/pd_bdm2 PD BDM interface on lp2 + 4 = /dev/icd_bdm0 ICD BDM interface on lp0 + 5 = /dev/icd_bdm1 ICD BDM interface on lp1 + 6 = /dev/icd_bdm2 ICD BDM interface on lp2 + + This device is used for the interfacing to the MC683xx + microcontrollers via Background Debug Mode by use of a + Parallel Port interface. PD is the Motorola Public + Domain Interface and ICD is the commercial interface + by P&E. + + 53 block Mylex DAC960 PCI RAID controller; sixth controller + 0 = /dev/rd/c5d0 First disk, whole disk + 8 = /dev/rd/c5d1 Second disk, whole disk + ... + 248 = /dev/rd/c5d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 54 char Electrocardiognosis Holter serial card + 0 = /dev/holter0 First Holter port + 1 = /dev/holter1 Second Holter port + 2 = /dev/holter2 Third Holter port + + A custom serial card used by Electrocardiognosis SRL + <mseritan@ottonel.pub.ro> to transfer data from Holter + 24-hour heart monitoring equipment. + + 54 block Mylex DAC960 PCI RAID controller; seventh controller + 0 = /dev/rd/c6d0 First disk, whole disk + 8 = /dev/rd/c6d1 Second disk, whole disk + ... + 248 = /dev/rd/c6d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 55 char DSP56001 digital signal processor + 0 = /dev/dsp56k First DSP56001 + + 55 block Mylex DAC960 PCI RAID controller; eighth controller + 0 = /dev/rd/c7d0 First disk, whole disk + 8 = /dev/rd/c7d1 Second disk, whole disk + ... + 248 = /dev/rd/c7d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 56 char Apple Desktop Bus + 0 = /dev/adb ADB bus control + + Additional devices will be added to this number, all + starting with /dev/adb. + + 56 block Fifth IDE hard disk/CD-ROM interface + 0 = /dev/hdi Master: whole disk (or CD-ROM) + 64 = /dev/hdj Slave: whole disk (or CD-ROM) + + Partitions are handled the same way as for the first + interface (see major number 3). + + 57 char Hayes ESP serial card + 0 = /dev/ttyP0 First ESP port + 1 = /dev/ttyP1 Second ESP port + ... + + 57 block Sixth IDE hard disk/CD-ROM interface + 0 = /dev/hdk Master: whole disk (or CD-ROM) + 64 = /dev/hdl Slave: whole disk (or CD-ROM) + + Partitions are handled the same way as for the first + interface (see major number 3). + + 58 char Hayes ESP serial card - alternate devices + 0 = /dev/cup0 Callout device for ttyP0 + 1 = /dev/cup1 Callout device for ttyP1 + ... + + 58 block Reserved for logical volume manager + + 59 char sf firewall package + 0 = /dev/firewall Communication with sf kernel module + + 59 block Generic PDA filesystem device + 0 = /dev/pda0 First PDA device + 1 = /dev/pda1 Second PDA device + ... + + The pda devices are used to mount filesystems on + remote pda's (basically slow handheld machines with + proprietary OS's and limited memory and storage + running small fs translation drivers) through serial / + IRDA / parallel links. + + NAMING CONFLICT -- PROPOSED REVISED NAME /dev/rpda0 etc + + 60-63 char LOCAL/EXPERIMENTAL USE + + 60-63 block LOCAL/EXPERIMENTAL USE + Allocated for local/experimental use. For devices not + assigned official numbers, these ranges should be + used in order to avoid conflicting with future assignments. + + 64 char ENskip kernel encryption package + 0 = /dev/enskip Communication with ENskip kernel module + + 64 block Scramdisk/DriveCrypt encrypted devices + 0 = /dev/scramdisk/master Master node for ioctls + 1 = /dev/scramdisk/1 First encrypted device + 2 = /dev/scramdisk/2 Second encrypted device + ... + 255 = /dev/scramdisk/255 255th encrypted device + + The filename of the encrypted container and the passwords + are sent via ioctls (using the sdmount tool) to the master + node which then activates them via one of the + /dev/scramdisk/x nodes for loop mounting (all handled + through the sdmount tool). + + Requested by: andy@scramdisklinux.org + + 65 char Sundance "plink" Transputer boards (obsolete, unused) + 0 = /dev/plink0 First plink device + 1 = /dev/plink1 Second plink device + 2 = /dev/plink2 Third plink device + 3 = /dev/plink3 Fourth plink device + 64 = /dev/rplink0 First plink device, raw + 65 = /dev/rplink1 Second plink device, raw + 66 = /dev/rplink2 Third plink device, raw + 67 = /dev/rplink3 Fourth plink device, raw + 128 = /dev/plink0d First plink device, debug + 129 = /dev/plink1d Second plink device, debug + 130 = /dev/plink2d Third plink device, debug + 131 = /dev/plink3d Fourth plink device, debug + 192 = /dev/rplink0d First plink device, raw, debug + 193 = /dev/rplink1d Second plink device, raw, debug + 194 = /dev/rplink2d Third plink device, raw, debug + 195 = /dev/rplink3d Fourth plink device, raw, debug + + This is a commercial driver; contact James Howes + <jth@prosig.demon.co.uk> for information. + + 65 block SCSI disk devices (16-31) + 0 = /dev/sdq 17th SCSI disk whole disk + 16 = /dev/sdr 18th SCSI disk whole disk + 32 = /dev/sds 19th SCSI disk whole disk + ... + 240 = /dev/sdaf 32nd SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 66 char YARC PowerPC PCI coprocessor card + 0 = /dev/yppcpci0 First YARC card + 1 = /dev/yppcpci1 Second YARC card + ... + + 66 block SCSI disk devices (32-47) + 0 = /dev/sdag 33th SCSI disk whole disk + 16 = /dev/sdah 34th SCSI disk whole disk + 32 = /dev/sdai 35th SCSI disk whole disk + ... + 240 = /dev/sdav 48nd SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 67 char Coda network file system + 0 = /dev/cfs0 Coda cache manager + + See http://www.coda.cs.cmu.edu for information about Coda. + + 67 block SCSI disk devices (48-63) + 0 = /dev/sdaw 49th SCSI disk whole disk + 16 = /dev/sdax 50th SCSI disk whole disk + 32 = /dev/sday 51st SCSI disk whole disk + ... + 240 = /dev/sdbl 64th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 68 char CAPI 2.0 interface + 0 = /dev/capi20 Control device + 1 = /dev/capi20.00 First CAPI 2.0 application + 2 = /dev/capi20.01 Second CAPI 2.0 application + ... + 20 = /dev/capi20.19 19th CAPI 2.0 application + + ISDN CAPI 2.0 driver for use with CAPI 2.0 + applications; currently supports the AVM B1 card. + + 68 block SCSI disk devices (64-79) + 0 = /dev/sdbm 65th SCSI disk whole disk + 16 = /dev/sdbn 66th SCSI disk whole disk + 32 = /dev/sdbo 67th SCSI disk whole disk + ... + 240 = /dev/sdcb 80th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 69 char MA16 numeric accelerator card + 0 = /dev/ma16 Board memory access + + 69 block SCSI disk devices (80-95) + 0 = /dev/sdcc 81st SCSI disk whole disk + 16 = /dev/sdcd 82nd SCSI disk whole disk + 32 = /dev/sdce 83th SCSI disk whole disk + ... + 240 = /dev/sdcr 96th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 70 char SpellCaster Protocol Services Interface + 0 = /dev/apscfg Configuration interface + 1 = /dev/apsauth Authentication interface + 2 = /dev/apslog Logging interface + 3 = /dev/apsdbg Debugging interface + 64 = /dev/apsisdn ISDN command interface + 65 = /dev/apsasync Async command interface + 128 = /dev/apsmon Monitor interface + + 70 block SCSI disk devices (96-111) + 0 = /dev/sdcs 97th SCSI disk whole disk + 16 = /dev/sdct 98th SCSI disk whole disk + 32 = /dev/sdcu 99th SCSI disk whole disk + ... + 240 = /dev/sddh 112nd SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 71 char Computone IntelliPort II serial card + 0 = /dev/ttyF0 IntelliPort II board 0, port 0 + 1 = /dev/ttyF1 IntelliPort II board 0, port 1 + ... + 63 = /dev/ttyF63 IntelliPort II board 0, port 63 + 64 = /dev/ttyF64 IntelliPort II board 1, port 0 + 65 = /dev/ttyF65 IntelliPort II board 1, port 1 + ... + 127 = /dev/ttyF127 IntelliPort II board 1, port 63 + 128 = /dev/ttyF128 IntelliPort II board 2, port 0 + 129 = /dev/ttyF129 IntelliPort II board 2, port 1 + ... + 191 = /dev/ttyF191 IntelliPort II board 2, port 63 + 192 = /dev/ttyF192 IntelliPort II board 3, port 0 + 193 = /dev/ttyF193 IntelliPort II board 3, port 1 + ... + 255 = /dev/ttyF255 IntelliPort II board 3, port 63 + + 71 block SCSI disk devices (112-127) + 0 = /dev/sddi 113th SCSI disk whole disk + 16 = /dev/sddj 114th SCSI disk whole disk + 32 = /dev/sddk 115th SCSI disk whole disk + ... + 240 = /dev/sddx 128th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 72 char Computone IntelliPort II serial card - alternate devices + 0 = /dev/cuf0 Callout device for ttyF0 + 1 = /dev/cuf1 Callout device for ttyF1 + ... + 63 = /dev/cuf63 Callout device for ttyF63 + 64 = /dev/cuf64 Callout device for ttyF64 + 65 = /dev/cuf65 Callout device for ttyF65 + ... + 127 = /dev/cuf127 Callout device for ttyF127 + 128 = /dev/cuf128 Callout device for ttyF128 + 129 = /dev/cuf129 Callout device for ttyF129 + ... + 191 = /dev/cuf191 Callout device for ttyF191 + 192 = /dev/cuf192 Callout device for ttyF192 + 193 = /dev/cuf193 Callout device for ttyF193 + ... + 255 = /dev/cuf255 Callout device for ttyF255 + + 72 block Compaq Intelligent Drive Array, first controller + 0 = /dev/ida/c0d0 First logical drive whole disk + 16 = /dev/ida/c0d1 Second logical drive whole disk + ... + 240 = /dev/ida/c0d15 16th logical drive whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 73 char Computone IntelliPort II serial card - control devices + 0 = /dev/ip2ipl0 Loadware device for board 0 + 1 = /dev/ip2stat0 Status device for board 0 + 4 = /dev/ip2ipl1 Loadware device for board 1 + 5 = /dev/ip2stat1 Status device for board 1 + 8 = /dev/ip2ipl2 Loadware device for board 2 + 9 = /dev/ip2stat2 Status device for board 2 + 12 = /dev/ip2ipl3 Loadware device for board 3 + 13 = /dev/ip2stat3 Status device for board 3 + + 73 block Compaq Intelligent Drive Array, second controller + 0 = /dev/ida/c1d0 First logical drive whole disk + 16 = /dev/ida/c1d1 Second logical drive whole disk + ... + 240 = /dev/ida/c1d15 16th logical drive whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 74 char SCI bridge + 0 = /dev/SCI/0 SCI device 0 + 1 = /dev/SCI/1 SCI device 1 + ... + + Currently for Dolphin Interconnect Solutions' PCI-SCI + bridge. + + 74 block Compaq Intelligent Drive Array, third controller + 0 = /dev/ida/c2d0 First logical drive whole disk + 16 = /dev/ida/c2d1 Second logical drive whole disk + ... + 240 = /dev/ida/c2d15 16th logical drive whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 75 char Specialix IO8+ serial card + 0 = /dev/ttyW0 First IO8+ port, first card + 1 = /dev/ttyW1 Second IO8+ port, first card + ... + 8 = /dev/ttyW8 First IO8+ port, second card + ... + + 75 block Compaq Intelligent Drive Array, fourth controller + 0 = /dev/ida/c3d0 First logical drive whole disk + 16 = /dev/ida/c3d1 Second logical drive whole disk + ... + 240 = /dev/ida/c3d15 16th logical drive whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 76 char Specialix IO8+ serial card - alternate devices + 0 = /dev/cuw0 Callout device for ttyW0 + 1 = /dev/cuw1 Callout device for ttyW1 + ... + 8 = /dev/cuw8 Callout device for ttyW8 + ... + + 76 block Compaq Intelligent Drive Array, fifth controller + 0 = /dev/ida/c4d0 First logical drive whole disk + 16 = /dev/ida/c4d1 Second logical drive whole disk + ... + 240 = /dev/ida/c4d15 16th logical drive whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + + 77 char ComScire Quantum Noise Generator + 0 = /dev/qng ComScire Quantum Noise Generator + + 77 block Compaq Intelligent Drive Array, sixth controller + 0 = /dev/ida/c5d0 First logical drive whole disk + 16 = /dev/ida/c5d1 Second logical drive whole disk + ... + 240 = /dev/ida/c5d15 16th logical drive whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 78 char PAM Software's multimodem boards + 0 = /dev/ttyM0 First PAM modem + 1 = /dev/ttyM1 Second PAM modem + ... + + 78 block Compaq Intelligent Drive Array, seventh controller + 0 = /dev/ida/c6d0 First logical drive whole disk + 16 = /dev/ida/c6d1 Second logical drive whole disk + ... + 240 = /dev/ida/c6d15 16th logical drive whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 79 char PAM Software's multimodem boards - alternate devices + 0 = /dev/cum0 Callout device for ttyM0 + 1 = /dev/cum1 Callout device for ttyM1 + ... + + 79 block Compaq Intelligent Drive Array, eighth controller + 0 = /dev/ida/c7d0 First logical drive whole disk + 16 = /dev/ida/c7d1 Second logical drive whole disk + ... + 240 = /dev/ida/c715 16th logical drive whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 80 char Photometrics AT200 CCD camera + 0 = /dev/at200 Photometrics AT200 CCD camera + + 80 block I2O hard disk + 0 = /dev/i2o/hda First I2O hard disk, whole disk + 16 = /dev/i2o/hdb Second I2O hard disk, whole disk + ... + 240 = /dev/i2o/hdp 16th I2O hard disk, whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 81 char video4linux + 0 = /dev/video0 Video capture/overlay device + ... + 63 = /dev/video63 Video capture/overlay device + 64 = /dev/radio0 Radio device + ... + 127 = /dev/radio63 Radio device + 128 = /dev/swradio0 Software Defined Radio device + ... + 191 = /dev/swradio63 Software Defined Radio device + 224 = /dev/vbi0 Vertical blank interrupt + ... + 255 = /dev/vbi31 Vertical blank interrupt + + Minor numbers are allocated dynamically unless + CONFIG_VIDEO_FIXED_MINOR_RANGES (default n) + configuration option is set. + + 81 block I2O hard disk + 0 = /dev/i2o/hdq 17th I2O hard disk, whole disk + 16 = /dev/i2o/hdr 18th I2O hard disk, whole disk + ... + 240 = /dev/i2o/hdaf 32nd I2O hard disk, whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 82 char WiNRADiO communications receiver card + 0 = /dev/winradio0 First WiNRADiO card + 1 = /dev/winradio1 Second WiNRADiO card + ... + + The driver and documentation may be obtained from + http://www.winradio.com/ + + 82 block I2O hard disk + 0 = /dev/i2o/hdag 33rd I2O hard disk, whole disk + 16 = /dev/i2o/hdah 34th I2O hard disk, whole disk + ... + 240 = /dev/i2o/hdav 48th I2O hard disk, whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 83 char Matrox mga_vid video driver + 0 = /dev/mga_vid0 1st video card + 1 = /dev/mga_vid1 2nd video card + 2 = /dev/mga_vid2 3rd video card + ... + 15 = /dev/mga_vid15 16th video card + + 83 block I2O hard disk + 0 = /dev/i2o/hdaw 49th I2O hard disk, whole disk + 16 = /dev/i2o/hdax 50th I2O hard disk, whole disk + ... + 240 = /dev/i2o/hdbl 64th I2O hard disk, whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 84 char Ikon 1011[57] Versatec Greensheet Interface + 0 = /dev/ihcp0 First Greensheet port + 1 = /dev/ihcp1 Second Greensheet port + + 84 block I2O hard disk + 0 = /dev/i2o/hdbm 65th I2O hard disk, whole disk + 16 = /dev/i2o/hdbn 66th I2O hard disk, whole disk + ... + 240 = /dev/i2o/hdcb 80th I2O hard disk, whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 85 char Linux/SGI shared memory input queue + 0 = /dev/shmiq Master shared input queue + 1 = /dev/qcntl0 First device pushed + 2 = /dev/qcntl1 Second device pushed + ... + + 85 block I2O hard disk + 0 = /dev/i2o/hdcc 81st I2O hard disk, whole disk + 16 = /dev/i2o/hdcd 82nd I2O hard disk, whole disk + ... + 240 = /dev/i2o/hdcr 96th I2O hard disk, whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 86 char SCSI media changer + 0 = /dev/sch0 First SCSI media changer + 1 = /dev/sch1 Second SCSI media changer + ... + + 86 block I2O hard disk + 0 = /dev/i2o/hdcs 97th I2O hard disk, whole disk + 16 = /dev/i2o/hdct 98th I2O hard disk, whole disk + ... + 240 = /dev/i2o/hddh 112th I2O hard disk, whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 87 char Sony Control-A1 stereo control bus + 0 = /dev/controla0 First device on chain + 1 = /dev/controla1 Second device on chain + ... + + 87 block I2O hard disk + 0 = /dev/i2o/hddi 113rd I2O hard disk, whole disk + 16 = /dev/i2o/hddj 114th I2O hard disk, whole disk + ... + 240 = /dev/i2o/hddx 128th I2O hard disk, whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 88 char COMX synchronous serial card + 0 = /dev/comx0 COMX channel 0 + 1 = /dev/comx1 COMX channel 1 + ... + + 88 block Seventh IDE hard disk/CD-ROM interface + 0 = /dev/hdm Master: whole disk (or CD-ROM) + 64 = /dev/hdn Slave: whole disk (or CD-ROM) + + Partitions are handled the same way as for the first + interface (see major number 3). + + 89 char I2C bus interface + 0 = /dev/i2c-0 First I2C adapter + 1 = /dev/i2c-1 Second I2C adapter + ... + + 89 block Eighth IDE hard disk/CD-ROM interface + 0 = /dev/hdo Master: whole disk (or CD-ROM) + 64 = /dev/hdp Slave: whole disk (or CD-ROM) + + Partitions are handled the same way as for the first + interface (see major number 3). + + 90 char Memory Technology Device (RAM, ROM, Flash) + 0 = /dev/mtd0 First MTD (rw) + 1 = /dev/mtdr0 First MTD (ro) + ... + 30 = /dev/mtd15 16th MTD (rw) + 31 = /dev/mtdr15 16th MTD (ro) + + 90 block Ninth IDE hard disk/CD-ROM interface + 0 = /dev/hdq Master: whole disk (or CD-ROM) + 64 = /dev/hdr Slave: whole disk (or CD-ROM) + + Partitions are handled the same way as for the first + interface (see major number 3). + + 91 char CAN-Bus devices + 0 = /dev/can0 First CAN-Bus controller + 1 = /dev/can1 Second CAN-Bus controller + ... + + 91 block Tenth IDE hard disk/CD-ROM interface + 0 = /dev/hds Master: whole disk (or CD-ROM) + 64 = /dev/hdt Slave: whole disk (or CD-ROM) + + Partitions are handled the same way as for the first + interface (see major number 3). + + 92 char Reserved for ith Kommunikationstechnik MIC ISDN card + + 92 block PPDD encrypted disk driver + 0 = /dev/ppdd0 First encrypted disk + 1 = /dev/ppdd1 Second encrypted disk + ... + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 93 char + + 93 block NAND Flash Translation Layer filesystem + 0 = /dev/nftla First NFTL layer + 16 = /dev/nftlb Second NFTL layer + ... + 240 = /dev/nftlp 16th NTFL layer + + 94 char + + 94 block IBM S/390 DASD block storage + 0 = /dev/dasda First DASD device, major + 1 = /dev/dasda1 First DASD device, block 1 + 2 = /dev/dasda2 First DASD device, block 2 + 3 = /dev/dasda3 First DASD device, block 3 + 4 = /dev/dasdb Second DASD device, major + 5 = /dev/dasdb1 Second DASD device, block 1 + 6 = /dev/dasdb2 Second DASD device, block 2 + 7 = /dev/dasdb3 Second DASD device, block 3 + ... + + 95 char IP filter + 0 = /dev/ipl Filter control device/log file + 1 = /dev/ipnat NAT control device/log file + 2 = /dev/ipstate State information log file + 3 = /dev/ipauth Authentication control device/log file + ... + + 96 char Parallel port ATAPI tape devices + 0 = /dev/pt0 First parallel port ATAPI tape + 1 = /dev/pt1 Second parallel port ATAPI tape + ... + 128 = /dev/npt0 First p.p. ATAPI tape, no rewind + 129 = /dev/npt1 Second p.p. ATAPI tape, no rewind + ... + + 96 block Inverse NAND Flash Translation Layer + 0 = /dev/inftla First INFTL layer + 16 = /dev/inftlb Second INFTL layer + ... + 240 = /dev/inftlp 16th INTFL layer + + 97 char Parallel port generic ATAPI interface + 0 = /dev/pg0 First parallel port ATAPI device + 1 = /dev/pg1 Second parallel port ATAPI device + 2 = /dev/pg2 Third parallel port ATAPI device + 3 = /dev/pg3 Fourth parallel port ATAPI device + + These devices support the same API as the generic SCSI + devices. + + 98 char Control and Measurement Device (comedi) + 0 = /dev/comedi0 First comedi device + 1 = /dev/comedi1 Second comedi device + ... + + See http://stm.lbl.gov/comedi. + + 98 block User-mode virtual block device + 0 = /dev/ubda First user-mode block device + 16 = /dev/udbb Second user-mode block device + ... + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + This device is used by the user-mode virtual kernel port. + + 99 char Raw parallel ports + 0 = /dev/parport0 First parallel port + 1 = /dev/parport1 Second parallel port + ... + + 99 block JavaStation flash disk + 0 = /dev/jsfd JavaStation flash disk + + 100 char Telephony for Linux + 0 = /dev/phone0 First telephony device + 1 = /dev/phone1 Second telephony device + ... + + 101 char Motorola DSP 56xxx board + 0 = /dev/mdspstat Status information + 1 = /dev/mdsp1 First DSP board I/O controls + ... + 16 = /dev/mdsp16 16th DSP board I/O controls + + 101 block AMI HyperDisk RAID controller + 0 = /dev/amiraid/ar0 First array whole disk + 16 = /dev/amiraid/ar1 Second array whole disk + ... + 240 = /dev/amiraid/ar15 16th array whole disk + + For each device, partitions are added as: + 0 = /dev/amiraid/ar? Whole disk + 1 = /dev/amiraid/ar?p1 First partition + 2 = /dev/amiraid/ar?p2 Second partition + ... + 15 = /dev/amiraid/ar?p15 15th partition + + 102 char + + 102 block Compressed block device + 0 = /dev/cbd/a First compressed block device, whole device + 16 = /dev/cbd/b Second compressed block device, whole device + ... + 240 = /dev/cbd/p 16th compressed block device, whole device + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 103 char Arla network file system + 0 = /dev/nnpfs0 First NNPFS device + 1 = /dev/nnpfs1 Second NNPFS device + + Arla is a free clone of the Andrew File System, AFS. + The NNPFS device gives user mode filesystem + implementations a kernel presence for caching and easy + mounting. For more information about the project, + write to <arla-drinkers@stacken.kth.se> or see + http://www.stacken.kth.se/project/arla/ + + 103 block Audit device + 0 = /dev/audit Audit device + + 104 char Flash BIOS support + + 104 block Compaq Next Generation Drive Array, first controller + 0 = /dev/cciss/c0d0 First logical drive, whole disk + 16 = /dev/cciss/c0d1 Second logical drive, whole disk + ... + 240 = /dev/cciss/c0d15 16th logical drive, whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 105 char Comtrol VS-1000 serial controller + 0 = /dev/ttyV0 First VS-1000 port + 1 = /dev/ttyV1 Second VS-1000 port + ... + + 105 block Compaq Next Generation Drive Array, second controller + 0 = /dev/cciss/c1d0 First logical drive, whole disk + 16 = /dev/cciss/c1d1 Second logical drive, whole disk + ... + 240 = /dev/cciss/c1d15 16th logical drive, whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 106 char Comtrol VS-1000 serial controller - alternate devices + 0 = /dev/cuv0 First VS-1000 port + 1 = /dev/cuv1 Second VS-1000 port + ... + + 106 block Compaq Next Generation Drive Array, third controller + 0 = /dev/cciss/c2d0 First logical drive, whole disk + 16 = /dev/cciss/c2d1 Second logical drive, whole disk + ... + 240 = /dev/cciss/c2d15 16th logical drive, whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 107 char 3Dfx Voodoo Graphics device + 0 = /dev/3dfx Primary 3Dfx graphics device + + 107 block Compaq Next Generation Drive Array, fourth controller + 0 = /dev/cciss/c3d0 First logical drive, whole disk + 16 = /dev/cciss/c3d1 Second logical drive, whole disk + ... + 240 = /dev/cciss/c3d15 16th logical drive, whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 108 char Device independent PPP interface + 0 = /dev/ppp Device independent PPP interface + + 108 block Compaq Next Generation Drive Array, fifth controller + 0 = /dev/cciss/c4d0 First logical drive, whole disk + 16 = /dev/cciss/c4d1 Second logical drive, whole disk + ... + 240 = /dev/cciss/c4d15 16th logical drive, whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 109 char Reserved for logical volume manager + + 109 block Compaq Next Generation Drive Array, sixth controller + 0 = /dev/cciss/c5d0 First logical drive, whole disk + 16 = /dev/cciss/c5d1 Second logical drive, whole disk + ... + 240 = /dev/cciss/c5d15 16th logical drive, whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 110 char miroMEDIA Surround board + 0 = /dev/srnd0 First miroMEDIA Surround board + 1 = /dev/srnd1 Second miroMEDIA Surround board + ... + + 110 block Compaq Next Generation Drive Array, seventh controller + 0 = /dev/cciss/c6d0 First logical drive, whole disk + 16 = /dev/cciss/c6d1 Second logical drive, whole disk + ... + 240 = /dev/cciss/c6d15 16th logical drive, whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 111 char + + 111 block Compaq Next Generation Drive Array, eighth controller + 0 = /dev/cciss/c7d0 First logical drive, whole disk + 16 = /dev/cciss/c7d1 Second logical drive, whole disk + ... + 240 = /dev/cciss/c7d15 16th logical drive, whole disk + + Partitions are handled the same way as for Mylex + DAC960 (see major number 48) except that the limit on + partitions is 15. + + 112 char ISI serial card + 0 = /dev/ttyM0 First ISI port + 1 = /dev/ttyM1 Second ISI port + ... + + There is currently a device-naming conflict between + these and PAM multimodems (major 78). + + 112 block IBM iSeries virtual disk + 0 = /dev/iseries/vda First virtual disk, whole disk + 8 = /dev/iseries/vdb Second virtual disk, whole disk + ... + 200 = /dev/iseries/vdz 26th virtual disk, whole disk + 208 = /dev/iseries/vdaa 27th virtual disk, whole disk + ... + 248 = /dev/iseries/vdaf 32nd virtual disk, whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 7. + + 113 char ISI serial card - alternate devices + 0 = /dev/cum0 Callout device for ttyM0 + 1 = /dev/cum1 Callout device for ttyM1 + ... + + 113 block IBM iSeries virtual CD-ROM + 0 = /dev/iseries/vcda First virtual CD-ROM + 1 = /dev/iseries/vcdb Second virtual CD-ROM + ... + + 114 char Picture Elements ISE board + 0 = /dev/ise0 First ISE board + 1 = /dev/ise1 Second ISE board + ... + 128 = /dev/isex0 Control node for first ISE board + 129 = /dev/isex1 Control node for second ISE board + ... + + The ISE board is an embedded computer, optimized for + image processing. The /dev/iseN nodes are the general + I/O access to the board, the /dev/isex0 nodes command + nodes used to control the board. + + 114 block IDE BIOS powered software RAID interfaces such as the + Promise Fastrak + + 0 = /dev/ataraid/d0 + 1 = /dev/ataraid/d0p1 + 2 = /dev/ataraid/d0p2 + ... + 16 = /dev/ataraid/d1 + 17 = /dev/ataraid/d1p1 + 18 = /dev/ataraid/d1p2 + ... + 255 = /dev/ataraid/d15p15 + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 115 char TI link cable devices (115 was formerly the console driver speaker) + 0 = /dev/tipar0 Parallel cable on first parallel port + ... + 7 = /dev/tipar7 Parallel cable on seventh parallel port + + 8 = /dev/tiser0 Serial cable on first serial port + ... + 15 = /dev/tiser7 Serial cable on seventh serial port + + 16 = /dev/tiusb0 First USB cable + ... + 47 = /dev/tiusb31 32nd USB cable + + 115 block NetWare (NWFS) Devices (0-255) + + The NWFS (NetWare) devices are used to present a + collection of NetWare Mirror Groups or NetWare + Partitions as a logical storage segment for + use in mounting NetWare volumes. A maximum of + 256 NetWare volumes can be supported in a single + machine. + + http://cgfa.telepac.pt/ftp2/kernel.org/linux/kernel/people/jmerkey/nwfs/ + + 0 = /dev/nwfs/v0 First NetWare (NWFS) Logical Volume + 1 = /dev/nwfs/v1 Second NetWare (NWFS) Logical Volume + 2 = /dev/nwfs/v2 Third NetWare (NWFS) Logical Volume + ... + 255 = /dev/nwfs/v255 Last NetWare (NWFS) Logical Volume + + 116 char Advanced Linux Sound Driver (ALSA) + + 116 block MicroMemory battery backed RAM adapter (NVRAM) + Supports 16 boards, 15 partitions each. + Requested by neilb at cse.unsw.edu.au. + + 0 = /dev/umem/d0 Whole of first board + 1 = /dev/umem/d0p1 First partition of first board + 2 = /dev/umem/d0p2 Second partition of first board + 15 = /dev/umem/d0p15 15th partition of first board + + 16 = /dev/umem/d1 Whole of second board + 17 = /dev/umem/d1p1 First partition of second board + ... + 255= /dev/umem/d15p15 15th partition of 16th board. + + 117 char COSA/SRP synchronous serial card + 0 = /dev/cosa0c0 1st board, 1st channel + 1 = /dev/cosa0c1 1st board, 2nd channel + ... + 16 = /dev/cosa1c0 2nd board, 1st channel + 17 = /dev/cosa1c1 2nd board, 2nd channel + ... + + 117 block Enterprise Volume Management System (EVMS) + + The EVMS driver uses a layered, plug-in model to provide + unparalleled flexibility and extensibility in managing + storage. This allows for easy expansion or customization + of various levels of volume management. Requested by + Mark Peloquin (peloquin at us.ibm.com). + + Note: EVMS populates and manages all the devnodes in + /dev/evms. + + http://sf.net/projects/evms + + 0 = /dev/evms/block_device EVMS block device + 1 = /dev/evms/legacyname1 First EVMS legacy device + 2 = /dev/evms/legacyname2 Second EVMS legacy device + ... + Both ranges can grow (down or up) until they meet. + ... + 254 = /dev/evms/EVMSname2 Second EVMS native device + 255 = /dev/evms/EVMSname1 First EVMS native device + + Note: legacyname(s) are derived from the normal legacy + device names. For example, /dev/hda5 would become + /dev/evms/hda5. + + 118 char IBM Cryptographic Accelerator + 0 = /dev/ica Virtual interface to all IBM Crypto Accelerators + 1 = /dev/ica0 IBMCA Device 0 + 2 = /dev/ica1 IBMCA Device 1 + ... + + 119 char VMware virtual network control + 0 = /dev/vnet0 1st virtual network + 1 = /dev/vnet1 2nd virtual network + ... + + 120-127 char LOCAL/EXPERIMENTAL USE + + 120-127 block LOCAL/EXPERIMENTAL USE + Allocated for local/experimental use. For devices not + assigned official numbers, these ranges should be + used in order to avoid conflicting with future assignments. + + 128-135 char Unix98 PTY masters + + These devices should not have corresponding device + nodes; instead they should be accessed through the + /dev/ptmx cloning interface. + + 128 block SCSI disk devices (128-143) + 0 = /dev/sddy 129th SCSI disk whole disk + 16 = /dev/sddz 130th SCSI disk whole disk + 32 = /dev/sdea 131th SCSI disk whole disk + ... + 240 = /dev/sden 144th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 129 block SCSI disk devices (144-159) + 0 = /dev/sdeo 145th SCSI disk whole disk + 16 = /dev/sdep 146th SCSI disk whole disk + 32 = /dev/sdeq 147th SCSI disk whole disk + ... + 240 = /dev/sdfd 160th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 130 char (Misc devices) + + 130 block SCSI disk devices (160-175) + 0 = /dev/sdfe 161st SCSI disk whole disk + 16 = /dev/sdff 162nd SCSI disk whole disk + 32 = /dev/sdfg 163rd SCSI disk whole disk + ... + 240 = /dev/sdft 176th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 131 block SCSI disk devices (176-191) + 0 = /dev/sdfu 177th SCSI disk whole disk + 16 = /dev/sdfv 178th SCSI disk whole disk + 32 = /dev/sdfw 179th SCSI disk whole disk + ... + 240 = /dev/sdgj 192nd SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 132 block SCSI disk devices (192-207) + 0 = /dev/sdgk 193rd SCSI disk whole disk + 16 = /dev/sdgl 194th SCSI disk whole disk + 32 = /dev/sdgm 195th SCSI disk whole disk + ... + 240 = /dev/sdgz 208th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 133 block SCSI disk devices (208-223) + 0 = /dev/sdha 209th SCSI disk whole disk + 16 = /dev/sdhb 210th SCSI disk whole disk + 32 = /dev/sdhc 211th SCSI disk whole disk + ... + 240 = /dev/sdhp 224th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 134 block SCSI disk devices (224-239) + 0 = /dev/sdhq 225th SCSI disk whole disk + 16 = /dev/sdhr 226th SCSI disk whole disk + 32 = /dev/sdhs 227th SCSI disk whole disk + ... + 240 = /dev/sdif 240th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 135 block SCSI disk devices (240-255) + 0 = /dev/sdig 241st SCSI disk whole disk + 16 = /dev/sdih 242nd SCSI disk whole disk + 32 = /dev/sdih 243rd SCSI disk whole disk + ... + 240 = /dev/sdiv 256th SCSI disk whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 136-143 char Unix98 PTY slaves + 0 = /dev/pts/0 First Unix98 pseudo-TTY + 1 = /dev/pts/1 Second Unix98 pseudo-TTY + ... + + These device nodes are automatically generated with + the proper permissions and modes by mounting the + devpts filesystem onto /dev/pts with the appropriate + mount options (distribution dependent, however, on + *most* distributions the appropriate options are + "mode=0620,gid=<gid of the "tty" group>".) + + 136 block Mylex DAC960 PCI RAID controller; ninth controller + 0 = /dev/rd/c8d0 First disk, whole disk + 8 = /dev/rd/c8d1 Second disk, whole disk + ... + 248 = /dev/rd/c8d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 137 block Mylex DAC960 PCI RAID controller; tenth controller + 0 = /dev/rd/c9d0 First disk, whole disk + 8 = /dev/rd/c9d1 Second disk, whole disk + ... + 248 = /dev/rd/c9d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 138 block Mylex DAC960 PCI RAID controller; eleventh controller + 0 = /dev/rd/c10d0 First disk, whole disk + 8 = /dev/rd/c10d1 Second disk, whole disk + ... + 248 = /dev/rd/c10d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 139 block Mylex DAC960 PCI RAID controller; twelfth controller + 0 = /dev/rd/c11d0 First disk, whole disk + 8 = /dev/rd/c11d1 Second disk, whole disk + ... + 248 = /dev/rd/c11d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 140 block Mylex DAC960 PCI RAID controller; thirteenth controller + 0 = /dev/rd/c12d0 First disk, whole disk + 8 = /dev/rd/c12d1 Second disk, whole disk + ... + 248 = /dev/rd/c12d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 141 block Mylex DAC960 PCI RAID controller; fourteenth controller + 0 = /dev/rd/c13d0 First disk, whole disk + 8 = /dev/rd/c13d1 Second disk, whole disk + ... + 248 = /dev/rd/c13d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 142 block Mylex DAC960 PCI RAID controller; fifteenth controller + 0 = /dev/rd/c14d0 First disk, whole disk + 8 = /dev/rd/c14d1 Second disk, whole disk + ... + 248 = /dev/rd/c14d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 143 block Mylex DAC960 PCI RAID controller; sixteenth controller + 0 = /dev/rd/c15d0 First disk, whole disk + 8 = /dev/rd/c15d1 Second disk, whole disk + ... + 248 = /dev/rd/c15d31 32nd disk, whole disk + + Partitions are handled as for major 48. + + 144 char Encapsulated PPP + 0 = /dev/pppox0 First PPP over Ethernet + ... + 63 = /dev/pppox63 64th PPP over Ethernet + + This is primarily used for ADSL. + + The SST 5136-DN DeviceNet interface driver has been + relocated to major 183 due to an unfortunate conflict. + + 144 block Expansion Area #1 for more non-device (e.g. NFS) mounts + 0 = mounted device 256 + 255 = mounted device 511 + + 145 char SAM9407-based soundcard + 0 = /dev/sam0_mixer + 1 = /dev/sam0_sequencer + 2 = /dev/sam0_midi00 + 3 = /dev/sam0_dsp + 4 = /dev/sam0_audio + 6 = /dev/sam0_sndstat + 18 = /dev/sam0_midi01 + 34 = /dev/sam0_midi02 + 50 = /dev/sam0_midi03 + 64 = /dev/sam1_mixer + ... + 128 = /dev/sam2_mixer + ... + 192 = /dev/sam3_mixer + ... + + Device functions match OSS, but offer a number of + addons, which are sam9407 specific. OSS can be + operated simultaneously, taking care of the codec. + + 145 block Expansion Area #2 for more non-device (e.g. NFS) mounts + 0 = mounted device 512 + 255 = mounted device 767 + + 146 char SYSTRAM SCRAMNet mirrored-memory network + 0 = /dev/scramnet0 First SCRAMNet device + 1 = /dev/scramnet1 Second SCRAMNet device + ... + + 146 block Expansion Area #3 for more non-device (e.g. NFS) mounts + 0 = mounted device 768 + 255 = mounted device 1023 + + 147 char Aureal Semiconductor Vortex Audio device + 0 = /dev/aureal0 First Aureal Vortex + 1 = /dev/aureal1 Second Aureal Vortex + ... + + 147 block Distributed Replicated Block Device (DRBD) + 0 = /dev/drbd0 First DRBD device + 1 = /dev/drbd1 Second DRBD device + ... + + 148 char Technology Concepts serial card + 0 = /dev/ttyT0 First TCL port + 1 = /dev/ttyT1 Second TCL port + ... + + 149 char Technology Concepts serial card - alternate devices + 0 = /dev/cut0 Callout device for ttyT0 + 1 = /dev/cut0 Callout device for ttyT1 + ... + + 150 char Real-Time Linux FIFOs + 0 = /dev/rtf0 First RTLinux FIFO + 1 = /dev/rtf1 Second RTLinux FIFO + ... + + 151 char DPT I2O SmartRaid V controller + 0 = /dev/dpti0 First DPT I2O adapter + 1 = /dev/dpti1 Second DPT I2O adapter + ... + + 152 char EtherDrive Control Device + 0 = /dev/etherd/ctl Connect/Disconnect an EtherDrive + 1 = /dev/etherd/err Monitor errors + 2 = /dev/etherd/raw Raw AoE packet monitor + + 152 block EtherDrive Block Devices + 0 = /dev/etherd/0 EtherDrive 0 + ... + 255 = /dev/etherd/255 EtherDrive 255 + + 153 char SPI Bus Interface (sometimes referred to as MicroWire) + 0 = /dev/spi0 First SPI device on the bus + 1 = /dev/spi1 Second SPI device on the bus + ... + 15 = /dev/spi15 Sixteenth SPI device on the bus + + 153 block Enhanced Metadisk RAID (EMD) storage units + 0 = /dev/emd/0 First unit + 1 = /dev/emd/0p1 Partition 1 on First unit + 2 = /dev/emd/0p2 Partition 2 on First unit + ... + 15 = /dev/emd/0p15 Partition 15 on First unit + + 16 = /dev/emd/1 Second unit + 32 = /dev/emd/2 Third unit + ... + 240 = /dev/emd/15 Sixteenth unit + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 154 char Specialix RIO serial card + 0 = /dev/ttySR0 First RIO port + ... + 255 = /dev/ttySR255 256th RIO port + + 155 char Specialix RIO serial card - alternate devices + 0 = /dev/cusr0 Callout device for ttySR0 + ... + 255 = /dev/cusr255 Callout device for ttySR255 + + 156 char Specialix RIO serial card + 0 = /dev/ttySR256 257th RIO port + ... + 255 = /dev/ttySR511 512th RIO port + + 157 char Specialix RIO serial card - alternate devices + 0 = /dev/cusr256 Callout device for ttySR256 + ... + 255 = /dev/cusr511 Callout device for ttySR511 + + 158 char Dialogic GammaLink fax driver + 0 = /dev/gfax0 GammaLink channel 0 + 1 = /dev/gfax1 GammaLink channel 1 + ... + + 159 char RESERVED + + 159 block RESERVED + + 160 char General Purpose Instrument Bus (GPIB) + 0 = /dev/gpib0 First GPIB bus + 1 = /dev/gpib1 Second GPIB bus + ... + + 160 block Carmel 8-port SATA Disks on First Controller + 0 = /dev/carmel/0 SATA disk 0 whole disk + 1 = /dev/carmel/0p1 SATA disk 0 partition 1 + ... + 31 = /dev/carmel/0p31 SATA disk 0 partition 31 + + 32 = /dev/carmel/1 SATA disk 1 whole disk + 64 = /dev/carmel/2 SATA disk 2 whole disk + ... + 224 = /dev/carmel/7 SATA disk 7 whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 31. + + 161 char IrCOMM devices (IrDA serial/parallel emulation) + 0 = /dev/ircomm0 First IrCOMM device + 1 = /dev/ircomm1 Second IrCOMM device + ... + 16 = /dev/irlpt0 First IrLPT device + 17 = /dev/irlpt1 Second IrLPT device + ... + + 161 block Carmel 8-port SATA Disks on Second Controller + 0 = /dev/carmel/8 SATA disk 8 whole disk + 1 = /dev/carmel/8p1 SATA disk 8 partition 1 + ... + 31 = /dev/carmel/8p31 SATA disk 8 partition 31 + + 32 = /dev/carmel/9 SATA disk 9 whole disk + 64 = /dev/carmel/10 SATA disk 10 whole disk + ... + 224 = /dev/carmel/15 SATA disk 15 whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 31. + + 162 char Raw block device interface + 0 = /dev/rawctl Raw I/O control device + 1 = /dev/raw/raw1 First raw I/O device + 2 = /dev/raw/raw2 Second raw I/O device + ... + max minor number of raw device is set by kernel config + MAX_RAW_DEVS or raw module parameter 'max_raw_devs' + + 163 char + + 164 char Chase Research AT/PCI-Fast serial card + 0 = /dev/ttyCH0 AT/PCI-Fast board 0, port 0 + ... + 15 = /dev/ttyCH15 AT/PCI-Fast board 0, port 15 + 16 = /dev/ttyCH16 AT/PCI-Fast board 1, port 0 + ... + 31 = /dev/ttyCH31 AT/PCI-Fast board 1, port 15 + 32 = /dev/ttyCH32 AT/PCI-Fast board 2, port 0 + ... + 47 = /dev/ttyCH47 AT/PCI-Fast board 2, port 15 + 48 = /dev/ttyCH48 AT/PCI-Fast board 3, port 0 + ... + 63 = /dev/ttyCH63 AT/PCI-Fast board 3, port 15 + + 165 char Chase Research AT/PCI-Fast serial card - alternate devices + 0 = /dev/cuch0 Callout device for ttyCH0 + ... + 63 = /dev/cuch63 Callout device for ttyCH63 + + 166 char ACM USB modems + 0 = /dev/ttyACM0 First ACM modem + 1 = /dev/ttyACM1 Second ACM modem + ... + + 167 char ACM USB modems - alternate devices + 0 = /dev/cuacm0 Callout device for ttyACM0 + 1 = /dev/cuacm1 Callout device for ttyACM1 + ... + + 168 char Eracom CSA7000 PCI encryption adaptor + 0 = /dev/ecsa0 First CSA7000 + 1 = /dev/ecsa1 Second CSA7000 + ... + + 169 char Eracom CSA8000 PCI encryption adaptor + 0 = /dev/ecsa8-0 First CSA8000 + 1 = /dev/ecsa8-1 Second CSA8000 + ... + + 170 char AMI MegaRAC remote access controller + 0 = /dev/megarac0 First MegaRAC card + 1 = /dev/megarac1 Second MegaRAC card + ... + + 171 char Reserved for IEEE 1394 (Firewire) + + 172 char Moxa Intellio serial card + 0 = /dev/ttyMX0 First Moxa port + 1 = /dev/ttyMX1 Second Moxa port + ... + 127 = /dev/ttyMX127 128th Moxa port + 128 = /dev/moxactl Moxa control port + + 173 char Moxa Intellio serial card - alternate devices + 0 = /dev/cumx0 Callout device for ttyMX0 + 1 = /dev/cumx1 Callout device for ttyMX1 + ... + 127 = /dev/cumx127 Callout device for ttyMX127 + + 174 char SmartIO serial card + 0 = /dev/ttySI0 First SmartIO port + 1 = /dev/ttySI1 Second SmartIO port + ... + + 175 char SmartIO serial card - alternate devices + 0 = /dev/cusi0 Callout device for ttySI0 + 1 = /dev/cusi1 Callout device for ttySI1 + ... + + 176 char nCipher nFast PCI crypto accelerator + 0 = /dev/nfastpci0 First nFast PCI device + 1 = /dev/nfastpci1 First nFast PCI device + ... + + 177 char TI PCILynx memory spaces + 0 = /dev/pcilynx/aux0 AUX space of first PCILynx card + ... + 15 = /dev/pcilynx/aux15 AUX space of 16th PCILynx card + 16 = /dev/pcilynx/rom0 ROM space of first PCILynx card + ... + 31 = /dev/pcilynx/rom15 ROM space of 16th PCILynx card + 32 = /dev/pcilynx/ram0 RAM space of first PCILynx card + ... + 47 = /dev/pcilynx/ram15 RAM space of 16th PCILynx card + + 178 char Giganet cLAN1xxx virtual interface adapter + 0 = /dev/clanvi0 First cLAN adapter + 1 = /dev/clanvi1 Second cLAN adapter + ... + + 179 block MMC block devices + 0 = /dev/mmcblk0 First SD/MMC card + 1 = /dev/mmcblk0p1 First partition on first MMC card + 8 = /dev/mmcblk1 Second SD/MMC card + ... + + The start of next SD/MMC card can be configured with + CONFIG_MMC_BLOCK_MINORS, or overridden at boot/modprobe + time using the mmcblk.perdev_minors option. That would + bump the offset between each card to be the configured + value instead of the default 8. + + 179 char CCube DVXChip-based PCI products + 0 = /dev/dvxirq0 First DVX device + 1 = /dev/dvxirq1 Second DVX device + ... + + 180 char USB devices + 0 = /dev/usb/lp0 First USB printer + ... + 15 = /dev/usb/lp15 16th USB printer + 48 = /dev/usb/scanner0 First USB scanner + ... + 63 = /dev/usb/scanner15 16th USB scanner + 64 = /dev/usb/rio500 Diamond Rio 500 + 65 = /dev/usb/usblcd USBLCD Interface (info@usblcd.de) + 66 = /dev/usb/cpad0 Synaptics cPad (mouse/LCD) + 96 = /dev/usb/hiddev0 1st USB HID device + ... + 111 = /dev/usb/hiddev15 16th USB HID device + 112 = /dev/usb/auer0 1st auerswald ISDN device + ... + 127 = /dev/usb/auer15 16th auerswald ISDN device + 128 = /dev/usb/brlvgr0 First Braille Voyager device + ... + 131 = /dev/usb/brlvgr3 Fourth Braille Voyager device + 132 = /dev/usb/idmouse ID Mouse (fingerprint scanner) device + 133 = /dev/usb/sisusbvga1 First SiSUSB VGA device + ... + 140 = /dev/usb/sisusbvga8 Eighth SISUSB VGA device + 144 = /dev/usb/lcd USB LCD device + 160 = /dev/usb/legousbtower0 1st USB Legotower device + ... + 175 = /dev/usb/legousbtower15 16th USB Legotower device + 176 = /dev/usb/usbtmc1 First USB TMC device + ... + 191 = /dev/usb/usbtmc16 16th USB TMC device + 192 = /dev/usb/yurex1 First USB Yurex device + ... + 209 = /dev/usb/yurex16 16th USB Yurex device + + 180 block USB block devices + 0 = /dev/uba First USB block device + 8 = /dev/ubb Second USB block device + 16 = /dev/ubc Third USB block device + ... + + 181 char Conrad Electronic parallel port radio clocks + 0 = /dev/pcfclock0 First Conrad radio clock + 1 = /dev/pcfclock1 Second Conrad radio clock + ... + + 182 char Picture Elements THR2 binarizer + 0 = /dev/pethr0 First THR2 board + 1 = /dev/pethr1 Second THR2 board + ... + + 183 char SST 5136-DN DeviceNet interface + 0 = /dev/ss5136dn0 First DeviceNet interface + 1 = /dev/ss5136dn1 Second DeviceNet interface + ... + + This device used to be assigned to major number 144. + It had to be moved due to an unfortunate conflict. + + 184 char Picture Elements' video simulator/sender + 0 = /dev/pevss0 First sender board + 1 = /dev/pevss1 Second sender board + ... + + 185 char InterMezzo high availability file system + 0 = /dev/intermezzo0 First cache manager + 1 = /dev/intermezzo1 Second cache manager + ... + + See http://web.archive.org/web/20080115195241/ + http://inter-mezzo.org/index.html + + 186 char Object-based storage control device + 0 = /dev/obd0 First obd control device + 1 = /dev/obd1 Second obd control device + ... + + See ftp://ftp.lustre.org/pub/obd for code and information. + + 187 char DESkey hardware encryption device + 0 = /dev/deskey0 First DES key + 1 = /dev/deskey1 Second DES key + ... + + 188 char USB serial converters + 0 = /dev/ttyUSB0 First USB serial converter + 1 = /dev/ttyUSB1 Second USB serial converter + ... + + 189 char USB serial converters - alternate devices + 0 = /dev/cuusb0 Callout device for ttyUSB0 + 1 = /dev/cuusb1 Callout device for ttyUSB1 + ... + + 190 char Kansas City tracker/tuner card + 0 = /dev/kctt0 First KCT/T card + 1 = /dev/kctt1 Second KCT/T card + ... + + 191 char Reserved for PCMCIA + + 192 char Kernel profiling interface + 0 = /dev/profile Profiling control device + 1 = /dev/profile0 Profiling device for CPU 0 + 2 = /dev/profile1 Profiling device for CPU 1 + ... + + 193 char Kernel event-tracing interface + 0 = /dev/trace Tracing control device + 1 = /dev/trace0 Tracing device for CPU 0 + 2 = /dev/trace1 Tracing device for CPU 1 + ... + + 194 char linVideoStreams (LINVS) + 0 = /dev/mvideo/status0 Video compression status + 1 = /dev/mvideo/stream0 Video stream + 2 = /dev/mvideo/frame0 Single compressed frame + 3 = /dev/mvideo/rawframe0 Raw uncompressed frame + 4 = /dev/mvideo/codec0 Direct codec access + 5 = /dev/mvideo/video4linux0 Video4Linux compatibility + + 16 = /dev/mvideo/status1 Second device + ... + 32 = /dev/mvideo/status2 Third device + ... + ... + 240 = /dev/mvideo/status15 16th device + ... + + 195 char Nvidia graphics devices + 0 = /dev/nvidia0 First Nvidia card + 1 = /dev/nvidia1 Second Nvidia card + ... + 255 = /dev/nvidiactl Nvidia card control device + + 196 char Tormenta T1 card + 0 = /dev/tor/0 Master control channel for all cards + 1 = /dev/tor/1 First DS0 + 2 = /dev/tor/2 Second DS0 + ... + 48 = /dev/tor/48 48th DS0 + 49 = /dev/tor/49 First pseudo-channel + 50 = /dev/tor/50 Second pseudo-channel + ... + + 197 char OpenTNF tracing facility + 0 = /dev/tnf/t0 Trace 0 data extraction + 1 = /dev/tnf/t1 Trace 1 data extraction + ... + 128 = /dev/tnf/status Tracing facility status + 130 = /dev/tnf/trace Tracing device + + 198 char Total Impact TPMP2 quad coprocessor PCI card + 0 = /dev/tpmp2/0 First card + 1 = /dev/tpmp2/1 Second card + ... + + 199 char Veritas volume manager (VxVM) volumes + 0 = /dev/vx/rdsk/*/* First volume + 1 = /dev/vx/rdsk/*/* Second volume + ... + + 199 block Veritas volume manager (VxVM) volumes + 0 = /dev/vx/dsk/*/* First volume + 1 = /dev/vx/dsk/*/* Second volume + ... + + The namespace in these directories is maintained by + the user space VxVM software. + + 200 char Veritas VxVM configuration interface + 0 = /dev/vx/config Configuration access node + 1 = /dev/vx/trace Volume i/o trace access node + 2 = /dev/vx/iod Volume i/o daemon access node + 3 = /dev/vx/info Volume information access node + 4 = /dev/vx/task Volume tasks access node + 5 = /dev/vx/taskmon Volume tasks monitor daemon + + 201 char Veritas VxVM dynamic multipathing driver + 0 = /dev/vx/rdmp/* First multipath device + 1 = /dev/vx/rdmp/* Second multipath device + ... + 201 block Veritas VxVM dynamic multipathing driver + 0 = /dev/vx/dmp/* First multipath device + 1 = /dev/vx/dmp/* Second multipath device + ... + + The namespace in these directories is maintained by + the user space VxVM software. + + 202 char CPU model-specific registers + 0 = /dev/cpu/0/msr MSRs on CPU 0 + 1 = /dev/cpu/1/msr MSRs on CPU 1 + ... + + 202 block Xen Virtual Block Device + 0 = /dev/xvda First Xen VBD whole disk + 16 = /dev/xvdb Second Xen VBD whole disk + 32 = /dev/xvdc Third Xen VBD whole disk + ... + 240 = /dev/xvdp Sixteenth Xen VBD whole disk + + Partitions are handled in the same way as for IDE + disks (see major number 3) except that the limit on + partitions is 15. + + 203 char CPU CPUID information + 0 = /dev/cpu/0/cpuid CPUID on CPU 0 + 1 = /dev/cpu/1/cpuid CPUID on CPU 1 + ... + + 204 char Low-density serial ports + 0 = /dev/ttyLU0 LinkUp Systems L72xx UART - port 0 + 1 = /dev/ttyLU1 LinkUp Systems L72xx UART - port 1 + 2 = /dev/ttyLU2 LinkUp Systems L72xx UART - port 2 + 3 = /dev/ttyLU3 LinkUp Systems L72xx UART - port 3 + 4 = /dev/ttyFB0 Intel Footbridge (ARM) + 5 = /dev/ttySA0 StrongARM builtin serial port 0 + 6 = /dev/ttySA1 StrongARM builtin serial port 1 + 7 = /dev/ttySA2 StrongARM builtin serial port 2 + 8 = /dev/ttySC0 SCI serial port (SuperH) - port 0 + 9 = /dev/ttySC1 SCI serial port (SuperH) - port 1 + 10 = /dev/ttySC2 SCI serial port (SuperH) - port 2 + 11 = /dev/ttySC3 SCI serial port (SuperH) - port 3 + 12 = /dev/ttyFW0 Firmware console - port 0 + 13 = /dev/ttyFW1 Firmware console - port 1 + 14 = /dev/ttyFW2 Firmware console - port 2 + 15 = /dev/ttyFW3 Firmware console - port 3 + 16 = /dev/ttyAM0 ARM "AMBA" serial port 0 + ... + 31 = /dev/ttyAM15 ARM "AMBA" serial port 15 + 32 = /dev/ttyDB0 DataBooster serial port 0 + ... + 39 = /dev/ttyDB7 DataBooster serial port 7 + 40 = /dev/ttySG0 SGI Altix console port + 41 = /dev/ttySMX0 Motorola i.MX - port 0 + 42 = /dev/ttySMX1 Motorola i.MX - port 1 + 43 = /dev/ttySMX2 Motorola i.MX - port 2 + 44 = /dev/ttyMM0 Marvell MPSC - port 0 + 45 = /dev/ttyMM1 Marvell MPSC - port 1 + 46 = /dev/ttyCPM0 PPC CPM (SCC or SMC) - port 0 + ... + 47 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5 + 50 = /dev/ttyIOC0 Altix serial card + ... + 81 = /dev/ttyIOC31 Altix serial card + 82 = /dev/ttyVR0 NEC VR4100 series SIU + 83 = /dev/ttyVR1 NEC VR4100 series DSIU + 84 = /dev/ttyIOC84 Altix ioc4 serial card + ... + 115 = /dev/ttyIOC115 Altix ioc4 serial card + 116 = /dev/ttySIOC0 Altix ioc3 serial card + ... + 147 = /dev/ttySIOC31 Altix ioc3 serial card + 148 = /dev/ttyPSC0 PPC PSC - port 0 + ... + 153 = /dev/ttyPSC5 PPC PSC - port 5 + 154 = /dev/ttyAT0 ATMEL serial port 0 + ... + 169 = /dev/ttyAT15 ATMEL serial port 15 + 170 = /dev/ttyNX0 Hilscher netX serial port 0 + ... + 185 = /dev/ttyNX15 Hilscher netX serial port 15 + 186 = /dev/ttyJ0 JTAG1 DCC protocol based serial port emulation + 187 = /dev/ttyUL0 Xilinx uartlite - port 0 + ... + 190 = /dev/ttyUL3 Xilinx uartlite - port 3 + 191 = /dev/xvc0 Xen virtual console - port 0 + 192 = /dev/ttyPZ0 pmac_zilog - port 0 + ... + 195 = /dev/ttyPZ3 pmac_zilog - port 3 + 196 = /dev/ttyTX0 TX39/49 serial port 0 + ... + 204 = /dev/ttyTX7 TX39/49 serial port 7 + 205 = /dev/ttySC0 SC26xx serial port 0 + 206 = /dev/ttySC1 SC26xx serial port 1 + 207 = /dev/ttySC2 SC26xx serial port 2 + 208 = /dev/ttySC3 SC26xx serial port 3 + 209 = /dev/ttyMAX0 MAX3100 serial port 0 + 210 = /dev/ttyMAX1 MAX3100 serial port 1 + 211 = /dev/ttyMAX2 MAX3100 serial port 2 + 212 = /dev/ttyMAX3 MAX3100 serial port 3 + + 205 char Low-density serial ports (alternate device) + 0 = /dev/culu0 Callout device for ttyLU0 + 1 = /dev/culu1 Callout device for ttyLU1 + 2 = /dev/culu2 Callout device for ttyLU2 + 3 = /dev/culu3 Callout device for ttyLU3 + 4 = /dev/cufb0 Callout device for ttyFB0 + 5 = /dev/cusa0 Callout device for ttySA0 + 6 = /dev/cusa1 Callout device for ttySA1 + 7 = /dev/cusa2 Callout device for ttySA2 + 8 = /dev/cusc0 Callout device for ttySC0 + 9 = /dev/cusc1 Callout device for ttySC1 + 10 = /dev/cusc2 Callout device for ttySC2 + 11 = /dev/cusc3 Callout device for ttySC3 + 12 = /dev/cufw0 Callout device for ttyFW0 + 13 = /dev/cufw1 Callout device for ttyFW1 + 14 = /dev/cufw2 Callout device for ttyFW2 + 15 = /dev/cufw3 Callout device for ttyFW3 + 16 = /dev/cuam0 Callout device for ttyAM0 + ... + 31 = /dev/cuam15 Callout device for ttyAM15 + 32 = /dev/cudb0 Callout device for ttyDB0 + ... + 39 = /dev/cudb7 Callout device for ttyDB7 + 40 = /dev/cusg0 Callout device for ttySG0 + 41 = /dev/ttycusmx0 Callout device for ttySMX0 + 42 = /dev/ttycusmx1 Callout device for ttySMX1 + 43 = /dev/ttycusmx2 Callout device for ttySMX2 + 46 = /dev/cucpm0 Callout device for ttyCPM0 + ... + 49 = /dev/cucpm5 Callout device for ttyCPM5 + 50 = /dev/cuioc40 Callout device for ttyIOC40 + ... + 81 = /dev/cuioc431 Callout device for ttyIOC431 + 82 = /dev/cuvr0 Callout device for ttyVR0 + 83 = /dev/cuvr1 Callout device for ttyVR1 + + 206 char OnStream SC-x0 tape devices + 0 = /dev/osst0 First OnStream SCSI tape, mode 0 + 1 = /dev/osst1 Second OnStream SCSI tape, mode 0 + ... + 32 = /dev/osst0l First OnStream SCSI tape, mode 1 + 33 = /dev/osst1l Second OnStream SCSI tape, mode 1 + ... + 64 = /dev/osst0m First OnStream SCSI tape, mode 2 + 65 = /dev/osst1m Second OnStream SCSI tape, mode 2 + ... + 96 = /dev/osst0a First OnStream SCSI tape, mode 3 + 97 = /dev/osst1a Second OnStream SCSI tape, mode 3 + ... + 128 = /dev/nosst0 No rewind version of /dev/osst0 + 129 = /dev/nosst1 No rewind version of /dev/osst1 + ... + 160 = /dev/nosst0l No rewind version of /dev/osst0l + 161 = /dev/nosst1l No rewind version of /dev/osst1l + ... + 192 = /dev/nosst0m No rewind version of /dev/osst0m + 193 = /dev/nosst1m No rewind version of /dev/osst1m + ... + 224 = /dev/nosst0a No rewind version of /dev/osst0a + 225 = /dev/nosst1a No rewind version of /dev/osst1a + ... + + The OnStream SC-x0 SCSI tapes do not support the + standard SCSI SASD command set and therefore need + their own driver "osst". Note that the IDE, USB (and + maybe ParPort) versions may be driven via ide-scsi or + usb-storage SCSI emulation and this osst device and + driver as well. The ADR-x0 drives are QIC-157 + compliant and don't need osst. + + 207 char Compaq ProLiant health feature indicate + 0 = /dev/cpqhealth/cpqw Redirector interface + 1 = /dev/cpqhealth/crom EISA CROM + 2 = /dev/cpqhealth/cdt Data Table + 3 = /dev/cpqhealth/cevt Event Log + 4 = /dev/cpqhealth/casr Automatic Server Recovery + 5 = /dev/cpqhealth/cecc ECC Memory + 6 = /dev/cpqhealth/cmca Machine Check Architecture + 7 = /dev/cpqhealth/ccsm Deprecated CDT + 8 = /dev/cpqhealth/cnmi NMI Handling + 9 = /dev/cpqhealth/css Sideshow Management + 10 = /dev/cpqhealth/cram CMOS interface + 11 = /dev/cpqhealth/cpci PCI IRQ interface + + 208 char User space serial ports + 0 = /dev/ttyU0 First user space serial port + 1 = /dev/ttyU1 Second user space serial port + ... + + 209 char User space serial ports (alternate devices) + 0 = /dev/cuu0 Callout device for ttyU0 + 1 = /dev/cuu1 Callout device for ttyU1 + ... + + 210 char SBE, Inc. sync/async serial card + 0 = /dev/sbei/wxcfg0 Configuration device for board 0 + 1 = /dev/sbei/dld0 Download device for board 0 + 2 = /dev/sbei/wan00 WAN device, port 0, board 0 + 3 = /dev/sbei/wan01 WAN device, port 1, board 0 + 4 = /dev/sbei/wan02 WAN device, port 2, board 0 + 5 = /dev/sbei/wan03 WAN device, port 3, board 0 + 6 = /dev/sbei/wanc00 WAN clone device, port 0, board 0 + 7 = /dev/sbei/wanc01 WAN clone device, port 1, board 0 + 8 = /dev/sbei/wanc02 WAN clone device, port 2, board 0 + 9 = /dev/sbei/wanc03 WAN clone device, port 3, board 0 + 10 = /dev/sbei/wxcfg1 Configuration device for board 1 + 11 = /dev/sbei/dld1 Download device for board 1 + 12 = /dev/sbei/wan10 WAN device, port 0, board 1 + 13 = /dev/sbei/wan11 WAN device, port 1, board 1 + 14 = /dev/sbei/wan12 WAN device, port 2, board 1 + 15 = /dev/sbei/wan13 WAN device, port 3, board 1 + 16 = /dev/sbei/wanc10 WAN clone device, port 0, board 1 + 17 = /dev/sbei/wanc11 WAN clone device, port 1, board 1 + 18 = /dev/sbei/wanc12 WAN clone device, port 2, board 1 + 19 = /dev/sbei/wanc13 WAN clone device, port 3, board 1 + ... + + Yes, each board is really spaced 10 (decimal) apart. + + 211 char Addinum CPCI1500 digital I/O card + 0 = /dev/addinum/cpci1500/0 First CPCI1500 card + 1 = /dev/addinum/cpci1500/1 Second CPCI1500 card + ... + + 212 char LinuxTV.org DVB driver subsystem + 0 = /dev/dvb/adapter0/video0 first video decoder of first card + 1 = /dev/dvb/adapter0/audio0 first audio decoder of first card + 2 = /dev/dvb/adapter0/sec0 (obsolete/unused) + 3 = /dev/dvb/adapter0/frontend0 first frontend device of first card + 4 = /dev/dvb/adapter0/demux0 first demux device of first card + 5 = /dev/dvb/adapter0/dvr0 first digital video recoder device of first card + 6 = /dev/dvb/adapter0/ca0 first common access port of first card + 7 = /dev/dvb/adapter0/net0 first network device of first card + 8 = /dev/dvb/adapter0/osd0 first on-screen-display device of first card + 9 = /dev/dvb/adapter0/video1 second video decoder of first card + ... + 64 = /dev/dvb/adapter1/video0 first video decoder of second card + ... + 128 = /dev/dvb/adapter2/video0 first video decoder of third card + ... + 196 = /dev/dvb/adapter3/video0 first video decoder of fourth card + + 216 char Bluetooth RFCOMM TTY devices + 0 = /dev/rfcomm0 First Bluetooth RFCOMM TTY device + 1 = /dev/rfcomm1 Second Bluetooth RFCOMM TTY device + ... + + 217 char Bluetooth RFCOMM TTY devices (alternate devices) + 0 = /dev/curf0 Callout device for rfcomm0 + 1 = /dev/curf1 Callout device for rfcomm1 + ... + + 218 char The Logical Company bus Unibus/Qbus adapters + 0 = /dev/logicalco/bci/0 First bus adapter + 1 = /dev/logicalco/bci/1 First bus adapter + ... + + 219 char The Logical Company DCI-1300 digital I/O card + 0 = /dev/logicalco/dci1300/0 First DCI-1300 card + 1 = /dev/logicalco/dci1300/1 Second DCI-1300 card + ... + + 220 char Myricom Myrinet "GM" board + 0 = /dev/myricom/gm0 First Myrinet GM board + 1 = /dev/myricom/gmp0 First board "root access" + 2 = /dev/myricom/gm1 Second Myrinet GM board + 3 = /dev/myricom/gmp1 Second board "root access" + ... + + 221 char VME bus + 0 = /dev/bus/vme/m0 First master image + 1 = /dev/bus/vme/m1 Second master image + 2 = /dev/bus/vme/m2 Third master image + 3 = /dev/bus/vme/m3 Fourth master image + 4 = /dev/bus/vme/s0 First slave image + 5 = /dev/bus/vme/s1 Second slave image + 6 = /dev/bus/vme/s2 Third slave image + 7 = /dev/bus/vme/s3 Fourth slave image + 8 = /dev/bus/vme/ctl Control + + It is expected that all VME bus drivers will use the + same interface. For interface documentation see + http://www.vmelinux.org/. + + 224 char A2232 serial card + 0 = /dev/ttyY0 First A2232 port + 1 = /dev/ttyY1 Second A2232 port + ... + + 225 char A2232 serial card (alternate devices) + 0 = /dev/cuy0 Callout device for ttyY0 + 1 = /dev/cuy1 Callout device for ttyY1 + ... + + 226 char Direct Rendering Infrastructure (DRI) + 0 = /dev/dri/card0 First graphics card + 1 = /dev/dri/card1 Second graphics card + ... + + 227 char IBM 3270 terminal Unix tty access + 1 = /dev/3270/tty1 First 3270 terminal + 2 = /dev/3270/tty2 Seconds 3270 terminal + ... + + 228 char IBM 3270 terminal block-mode access + 0 = /dev/3270/tub Controlling interface + 1 = /dev/3270/tub1 First 3270 terminal + 2 = /dev/3270/tub2 Second 3270 terminal + ... + + 229 char IBM iSeries/pSeries virtual console + 0 = /dev/hvc0 First console port + 1 = /dev/hvc1 Second console port + ... + + 230 char IBM iSeries virtual tape + 0 = /dev/iseries/vt0 First virtual tape, mode 0 + 1 = /dev/iseries/vt1 Second virtual tape, mode 0 + ... + 32 = /dev/iseries/vt0l First virtual tape, mode 1 + 33 = /dev/iseries/vt1l Second virtual tape, mode 1 + ... + 64 = /dev/iseries/vt0m First virtual tape, mode 2 + 65 = /dev/iseries/vt1m Second virtual tape, mode 2 + ... + 96 = /dev/iseries/vt0a First virtual tape, mode 3 + 97 = /dev/iseries/vt1a Second virtual tape, mode 3 + ... + 128 = /dev/iseries/nvt0 First virtual tape, mode 0, no rewind + 129 = /dev/iseries/nvt1 Second virtual tape, mode 0, no rewind + ... + 160 = /dev/iseries/nvt0l First virtual tape, mode 1, no rewind + 161 = /dev/iseries/nvt1l Second virtual tape, mode 1, no rewind + ... + 192 = /dev/iseries/nvt0m First virtual tape, mode 2, no rewind + 193 = /dev/iseries/nvt1m Second virtual tape, mode 2, no rewind + ... + 224 = /dev/iseries/nvt0a First virtual tape, mode 3, no rewind + 225 = /dev/iseries/nvt1a Second virtual tape, mode 3, no rewind + ... + + "No rewind" refers to the omission of the default + automatic rewind on device close. The MTREW or MTOFFL + ioctl()'s can be used to rewind the tape regardless of + the device used to access it. + + 231 char InfiniBand + 0 = /dev/infiniband/umad0 + 1 = /dev/infiniband/umad1 + ... + 63 = /dev/infiniband/umad63 63rd InfiniBandMad device + 64 = /dev/infiniband/issm0 First InfiniBand IsSM device + 65 = /dev/infiniband/issm1 Second InfiniBand IsSM device + ... + 127 = /dev/infiniband/issm63 63rd InfiniBand IsSM device + 192 = /dev/infiniband/uverbs0 First InfiniBand verbs device + 193 = /dev/infiniband/uverbs1 Second InfiniBand verbs device + ... + 223 = /dev/infiniband/uverbs31 31st InfiniBand verbs device + + 232 char Biometric Devices + 0 = /dev/biometric/sensor0/fingerprint first fingerprint sensor on first device + 1 = /dev/biometric/sensor0/iris first iris sensor on first device + 2 = /dev/biometric/sensor0/retina first retina sensor on first device + 3 = /dev/biometric/sensor0/voiceprint first voiceprint sensor on first device + 4 = /dev/biometric/sensor0/facial first facial sensor on first device + 5 = /dev/biometric/sensor0/hand first hand sensor on first device + ... + 10 = /dev/biometric/sensor1/fingerprint first fingerprint sensor on second device + ... + 20 = /dev/biometric/sensor2/fingerprint first fingerprint sensor on third device + ... + + 233 char PathScale InfiniPath interconnect + 0 = /dev/ipath Primary device for programs (any unit) + 1 = /dev/ipath0 Access specifically to unit 0 + 2 = /dev/ipath1 Access specifically to unit 1 + ... + 4 = /dev/ipath3 Access specifically to unit 3 + 129 = /dev/ipath_sma Device used by Subnet Management Agent + 130 = /dev/ipath_diag Device used by diagnostics programs + + 234-254 char RESERVED FOR DYNAMIC ASSIGNMENT + Character devices that request a dynamic allocation of major number will + take numbers starting from 254 and downward. + + 240-254 block LOCAL/EXPERIMENTAL USE + Allocated for local/experimental use. For devices not + assigned official numbers, these ranges should be + used in order to avoid conflicting with future assignments. + + 255 char RESERVED + + 255 block RESERVED + + This major is reserved to assist the expansion to a + larger number space. No device nodes with this major + should ever be created on the filesystem. + (This is probably not true anymore, but I'll leave it + for now /Torben) + + ---LARGE MAJORS!!!!!--- + + 256 char Equinox SST multi-port serial boards + 0 = /dev/ttyEQ0 First serial port on first Equinox SST board + 127 = /dev/ttyEQ127 Last serial port on first Equinox SST board + 128 = /dev/ttyEQ128 First serial port on second Equinox SST board + ... + 1027 = /dev/ttyEQ1027 Last serial port on eighth Equinox SST board + + 256 block Resident Flash Disk Flash Translation Layer + 0 = /dev/rfda First RFD FTL layer + 16 = /dev/rfdb Second RFD FTL layer + ... + 240 = /dev/rfdp 16th RFD FTL layer + + 257 char Phoenix Technologies Cryptographic Services Driver + 0 = /dev/ptlsec Crypto Services Driver + + 257 block SSFDC Flash Translation Layer filesystem + 0 = /dev/ssfdca First SSFDC layer + 8 = /dev/ssfdcb Second SSFDC layer + 16 = /dev/ssfdcc Third SSFDC layer + 24 = /dev/ssfdcd 4th SSFDC layer + 32 = /dev/ssfdce 5th SSFDC layer + 40 = /dev/ssfdcf 6th SSFDC layer + 48 = /dev/ssfdcg 7th SSFDC layer + 56 = /dev/ssfdch 8th SSFDC layer + + 258 block ROM/Flash read-only translation layer + 0 = /dev/blockrom0 First ROM card's translation layer interface + 1 = /dev/blockrom1 Second ROM card's translation layer interface + ... + + 259 block Block Extended Major + Used dynamically to hold additional partition minor + numbers and allow large numbers of partitions per device + + 259 char FPGA configuration interfaces + 0 = /dev/icap0 First Xilinx internal configuration + 1 = /dev/icap1 Second Xilinx internal configuration + + 260 char OSD (Object-based-device) SCSI Device + 0 = /dev/osd0 First OSD Device + 1 = /dev/osd1 Second OSD Device + ... + 255 = /dev/osd255 256th OSD Device + + 384-511 char RESERVED FOR DYNAMIC ASSIGNMENT + Character devices that request a dynamic allocation of major + number will take numbers starting from 511 and downward, + once the 234-254 range is full. diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst new file mode 100644 index 000000000..fdf72429f --- /dev/null +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -0,0 +1,353 @@ +Dynamic debug ++++++++++++++ + + +Introduction +============ + +This document describes how to use the dynamic debug (dyndbg) feature. + +Dynamic debug is designed to allow you to dynamically enable/disable +kernel code to obtain additional kernel information. Currently, if +``CONFIG_DYNAMIC_DEBUG`` is set, then all ``pr_debug()``/``dev_dbg()`` and +``print_hex_dump_debug()``/``print_hex_dump_bytes()`` calls can be dynamically +enabled per-callsite. + +If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is just +shortcut for ``print_hex_dump(KERN_DEBUG)``. + +For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is +its ``prefix_str`` argument, if it is constant string; or ``hexdump`` +in case ``prefix_str`` is built dynamically. + +Dynamic debug has even more useful features: + + * Simple query language allows turning on and off debugging + statements by matching any combination of 0 or 1 of: + + - source filename + - function name + - line number (including ranges of line numbers) + - module name + - format string + + * Provides a debugfs control file: ``<debugfs>/dynamic_debug/control`` + which can be read to display the complete list of known debug + statements, to help guide you + +Controlling dynamic debug Behaviour +=================================== + +The behaviour of ``pr_debug()``/``dev_dbg()`` are controlled via writing to a +control file in the 'debugfs' filesystem. Thus, you must first mount +the debugfs filesystem, in order to make use of this feature. +Subsequently, we refer to the control file as: +``<debugfs>/dynamic_debug/control``. For example, if you want to enable +printing from source file ``svcsock.c``, line 1603 you simply do:: + + nullarbor:~ # echo 'file svcsock.c line 1603 +p' > + <debugfs>/dynamic_debug/control + +If you make a mistake with the syntax, the write will fail thus:: + + nullarbor:~ # echo 'file svcsock.c wtf 1 +p' > + <debugfs>/dynamic_debug/control + -bash: echo: write error: Invalid argument + +Viewing Dynamic Debug Behaviour +=============================== + +You can view the currently configured behaviour of all the debug +statements via:: + + nullarbor:~ # cat <debugfs>/dynamic_debug/control + # filename:lineno [module]function flags format + /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:323 [svcxprt_rdma]svc_rdma_cleanup =_ "SVCRDMA Module Removed, deregister RPC RDMA transport\012" + /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:341 [svcxprt_rdma]svc_rdma_init =_ "\011max_inline : %d\012" + /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:340 [svcxprt_rdma]svc_rdma_init =_ "\011sq_depth : %d\012" + /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:338 [svcxprt_rdma]svc_rdma_init =_ "\011max_requests : %d\012" + ... + + +You can also apply standard Unix text manipulation filters to this +data, e.g.:: + + nullarbor:~ # grep -i rdma <debugfs>/dynamic_debug/control | wc -l + 62 + + nullarbor:~ # grep -i tcp <debugfs>/dynamic_debug/control | wc -l + 42 + +The third column shows the currently enabled flags for each debug +statement callsite (see below for definitions of the flags). The +default value, with no flags enabled, is ``=_``. So you can view all +the debug statement callsites with any non-default flags:: + + nullarbor:~ # awk '$3 != "=_"' <debugfs>/dynamic_debug/control + # filename:lineno [module]function flags format + /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svcsock.c:1603 [sunrpc]svc_send p "svc_process: st_sendto returned %d\012" + +Command Language Reference +========================== + +At the lexical level, a command comprises a sequence of words separated +by spaces or tabs. So these are all equivalent:: + + nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > + <debugfs>/dynamic_debug/control + nullarbor:~ # echo -n ' file svcsock.c line 1603 +p ' > + <debugfs>/dynamic_debug/control + nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > + <debugfs>/dynamic_debug/control + +Command submissions are bounded by a write() system call. +Multiple commands can be written together, separated by ``;`` or ``\n``:: + + ~# echo "func pnpacpi_get_resources +p; func pnp_assign_mem +p" \ + > <debugfs>/dynamic_debug/control + +If your query set is big, you can batch them too:: + + ~# cat query-batch-file > <debugfs>/dynamic_debug/control + +A another way is to use wildcard. The match rule support ``*`` (matches +zero or more characters) and ``?`` (matches exactly one character).For +example, you can match all usb drivers:: + + ~# echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control + +At the syntactical level, a command comprises a sequence of match +specifications, followed by a flags change specification:: + + command ::= match-spec* flags-spec + +The match-spec's are used to choose a subset of the known pr_debug() +callsites to which to apply the flags-spec. Think of them as a query +with implicit ANDs between each pair. Note that an empty list of +match-specs will select all debug statement callsites. + +A match specification comprises a keyword, which controls the +attribute of the callsite to be compared, and a value to compare +against. Possible keywords are::: + + match-spec ::= 'func' string | + 'file' string | + 'module' string | + 'format' string | + 'line' line-range + + line-range ::= lineno | + '-'lineno | + lineno'-' | + lineno'-'lineno + + lineno ::= unsigned-int + +.. note:: + + ``line-range`` cannot contain space, e.g. + "1-30" is valid range but "1 - 30" is not. + + +The meanings of each keyword are: + +func + The given string is compared against the function name + of each callsite. Example:: + + func svc_tcp_accept + +file + The given string is compared against either the full pathname, the + src-root relative pathname, or the basename of the source file of + each callsite. Examples:: + + file svcsock.c + file kernel/freezer.c + file /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svcsock.c + +module + The given string is compared against the module name + of each callsite. The module name is the string as + seen in ``lsmod``, i.e. without the directory or the ``.ko`` + suffix and with ``-`` changed to ``_``. Examples:: + + module sunrpc + module nfsd + +format + The given string is searched for in the dynamic debug format + string. Note that the string does not need to match the + entire format, only some part. Whitespace and other + special characters can be escaped using C octal character + escape ``\ooo`` notation, e.g. the space character is ``\040``. + Alternatively, the string can be enclosed in double quote + characters (``"``) or single quote characters (``'``). + Examples:: + + format svcrdma: // many of the NFS/RDMA server pr_debugs + format readahead // some pr_debugs in the readahead cache + format nfsd:\040SETATTR // one way to match a format with whitespace + format "nfsd: SETATTR" // a neater way to match a format with whitespace + format 'nfsd: SETATTR' // yet another way to match a format with whitespace + +line + The given line number or range of line numbers is compared + against the line number of each ``pr_debug()`` callsite. A single + line number matches the callsite line number exactly. A + range of line numbers matches any callsite between the first + and last line number inclusive. An empty first number means + the first line in the file, an empty last line number means the + last line number in the file. Examples:: + + line 1603 // exactly line 1603 + line 1600-1605 // the six lines from line 1600 to line 1605 + line -1605 // the 1605 lines from line 1 to line 1605 + line 1600- // all lines from line 1600 to the end of the file + +The flags specification comprises a change operation followed +by one or more flag characters. The change operation is one +of the characters:: + + - remove the given flags + + add the given flags + = set the flags to the given flags + +The flags are:: + + p enables the pr_debug() callsite. + f Include the function name in the printed message + l Include line number in the printed message + m Include module name in the printed message + t Include thread ID in messages not generated from interrupt context + _ No flags are set. (Or'd with others on input) + +For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only ``p`` flag +have meaning, other flags ignored. + +For display, the flags are preceded by ``=`` +(mnemonic: what the flags are currently equal to). + +Note the regexp ``^[-+=][flmpt_]+$`` matches a flags specification. +To clear all flags at once, use ``=_`` or ``-flmpt``. + + +Debug messages during Boot Process +================================== + +To activate debug messages for core code and built-in modules during +the boot process, even before userspace and debugfs exists, use +``dyndbg="QUERY"``, ``module.dyndbg="QUERY"``, or ``ddebug_query="QUERY"`` +(``ddebug_query`` is obsoleted by ``dyndbg``, and deprecated). QUERY follows +the syntax described above, but must not exceed 1023 characters. Your +bootloader may impose lower limits. + +These ``dyndbg`` params are processed just after the ddebug tables are +processed, as part of the arch_initcall. Thus you can enable debug +messages in all code run after this arch_initcall via this boot +parameter. + +On an x86 system for example ACPI enablement is a subsys_initcall and:: + + dyndbg="file ec.c +p" + +will show early Embedded Controller transactions during ACPI setup if +your machine (typically a laptop) has an Embedded Controller. +PCI (or other devices) initialization also is a hot candidate for using +this boot parameter for debugging purposes. + +If ``foo`` module is not built-in, ``foo.dyndbg`` will still be processed at +boot time, without effect, but will be reprocessed when module is +loaded later. ``dyndbg_query=`` and bare ``dyndbg=`` are only processed at +boot. + + +Debug Messages at Module Initialization Time +============================================ + +When ``modprobe foo`` is called, modprobe scans ``/proc/cmdline`` for +``foo.params``, strips ``foo.``, and passes them to the kernel along with +params given in modprobe args or ``/etc/modprob.d/*.conf`` files, +in the following order: + +1. parameters given via ``/etc/modprobe.d/*.conf``:: + + options foo dyndbg=+pt + options foo dyndbg # defaults to +p + +2. ``foo.dyndbg`` as given in boot args, ``foo.`` is stripped and passed:: + + foo.dyndbg=" func bar +p; func buz +mp" + +3. args to modprobe:: + + modprobe foo dyndbg==pmf # override previous settings + +These ``dyndbg`` queries are applied in order, with last having final say. +This allows boot args to override or modify those from ``/etc/modprobe.d`` +(sensible, since 1 is system wide, 2 is kernel or boot specific), and +modprobe args to override both. + +In the ``foo.dyndbg="QUERY"`` form, the query must exclude ``module foo``. +``foo`` is extracted from the param-name, and applied to each query in +``QUERY``, and only 1 match-spec of each type is allowed. + +The ``dyndbg`` option is a "fake" module parameter, which means: + +- modules do not need to define it explicitly +- every module gets it tacitly, whether they use pr_debug or not +- it doesn't appear in ``/sys/module/$module/parameters/`` + To see it, grep the control file, or inspect ``/proc/cmdline.`` + +For ``CONFIG_DYNAMIC_DEBUG`` kernels, any settings given at boot-time (or +enabled by ``-DDEBUG`` flag during compilation) can be disabled later via +the sysfs interface if the debug messages are no longer needed:: + + echo "module module_name -p" > <debugfs>/dynamic_debug/control + +Examples +======== + +:: + + // enable the message at line 1603 of file svcsock.c + nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > + <debugfs>/dynamic_debug/control + + // enable all the messages in file svcsock.c + nullarbor:~ # echo -n 'file svcsock.c +p' > + <debugfs>/dynamic_debug/control + + // enable all the messages in the NFS server module + nullarbor:~ # echo -n 'module nfsd +p' > + <debugfs>/dynamic_debug/control + + // enable all 12 messages in the function svc_process() + nullarbor:~ # echo -n 'func svc_process +p' > + <debugfs>/dynamic_debug/control + + // disable all 12 messages in the function svc_process() + nullarbor:~ # echo -n 'func svc_process -p' > + <debugfs>/dynamic_debug/control + + // enable messages for NFS calls READ, READLINK, READDIR and READDIR+. + nullarbor:~ # echo -n 'format "nfsd: READ" +p' > + <debugfs>/dynamic_debug/control + + // enable messages in files of which the paths include string "usb" + nullarbor:~ # echo -n '*usb* +p' > <debugfs>/dynamic_debug/control + + // enable all messages + nullarbor:~ # echo -n '+p' > <debugfs>/dynamic_debug/control + + // add module, function to all enabled messages + nullarbor:~ # echo -n '+mf' > <debugfs>/dynamic_debug/control + + // boot-args example, with newlines and comments for readability + Kernel command line: ... + // see whats going on in dyndbg=value processing + dynamic_debug.verbose=1 + // enable pr_debugs in 2 builtins, #cmt is stripped + dyndbg="module params +p #cmt ; module sys +p" + // enable pr_debugs in 2 functions in a module loaded later + pc87360.dyndbg="func pc87360_init_device +p; func pc87360_find +p" diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst new file mode 100644 index 000000000..2adec1e65 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -0,0 +1,18 @@ +======================== +Hardware vulnerabilities +======================== + +This section describes CPU vulnerabilities and provides an overview of the +possible mitigations along with guidance for selecting mitigations if they +are configurable at compile, boot or run time. + +.. toctree:: + :maxdepth: 1 + + spectre + l1tf + mds + tsx_async_abort + multihit.rst + special-register-buffer-data-sampling.rst + processor_mmio_stale_data.rst diff --git a/Documentation/admin-guide/hw-vuln/l1tf.rst b/Documentation/admin-guide/hw-vuln/l1tf.rst new file mode 100644 index 000000000..31653a9f0 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/l1tf.rst @@ -0,0 +1,615 @@ +L1TF - L1 Terminal Fault +======================== + +L1 Terminal Fault is a hardware vulnerability which allows unprivileged +speculative access to data which is available in the Level 1 Data Cache +when the page table entry controlling the virtual address, which is used +for the access, has the Present bit cleared or other reserved bits set. + +Affected processors +------------------- + +This vulnerability affects a wide range of Intel processors. The +vulnerability is not present on: + + - Processors from AMD, Centaur and other non Intel vendors + + - Older processor models, where the CPU family is < 6 + + - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft, + Penwell, Pineview, Silvermont, Airmont, Merrifield) + + - The Intel XEON PHI family + + - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the + IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected + by the Meltdown vulnerability either. These CPUs should become + available by end of 2018. + +Whether a processor is affected or not can be read out from the L1TF +vulnerability file in sysfs. See :ref:`l1tf_sys_info`. + +Related CVEs +------------ + +The following CVE entries are related to the L1TF vulnerability: + + ============= ================= ============================== + CVE-2018-3615 L1 Terminal Fault SGX related aspects + CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects + CVE-2018-3646 L1 Terminal Fault Virtualization related aspects + ============= ================= ============================== + +Problem +------- + +If an instruction accesses a virtual address for which the relevant page +table entry (PTE) has the Present bit cleared or other reserved bits set, +then speculative execution ignores the invalid PTE and loads the referenced +data if it is present in the Level 1 Data Cache, as if the page referenced +by the address bits in the PTE was still present and accessible. + +While this is a purely speculative mechanism and the instruction will raise +a page fault when it is retired eventually, the pure act of loading the +data and making it available to other speculative instructions opens up the +opportunity for side channel attacks to unprivileged malicious code, +similar to the Meltdown attack. + +While Meltdown breaks the user space to kernel space protection, L1TF +allows to attack any physical memory address in the system and the attack +works across all protection domains. It allows an attack of SGX and also +works from inside virtual machines because the speculation bypasses the +extended page table (EPT) protection mechanism. + + +Attack scenarios +---------------- + +1. Malicious user space +^^^^^^^^^^^^^^^^^^^^^^^ + + Operating Systems store arbitrary information in the address bits of a + PTE which is marked non present. This allows a malicious user space + application to attack the physical memory to which these PTEs resolve. + In some cases user-space can maliciously influence the information + encoded in the address bits of the PTE, thus making attacks more + deterministic and more practical. + + The Linux kernel contains a mitigation for this attack vector, PTE + inversion, which is permanently enabled and has no performance + impact. The kernel ensures that the address bits of PTEs, which are not + marked present, never point to cacheable physical memory space. + + A system with an up to date kernel is protected against attacks from + malicious user space applications. + +2. Malicious guest in a virtual machine +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The fact that L1TF breaks all domain protections allows malicious guest + OSes, which can control the PTEs directly, and malicious guest user + space applications, which run on an unprotected guest kernel lacking the + PTE inversion mitigation for L1TF, to attack physical host memory. + + A special aspect of L1TF in the context of virtualization is symmetric + multi threading (SMT). The Intel implementation of SMT is called + HyperThreading. The fact that Hyperthreads on the affected processors + share the L1 Data Cache (L1D) is important for this. As the flaw allows + only to attack data which is present in L1D, a malicious guest running + on one Hyperthread can attack the data which is brought into the L1D by + the context which runs on the sibling Hyperthread of the same physical + core. This context can be host OS, host user space or a different guest. + + If the processor does not support Extended Page Tables, the attack is + only possible, when the hypervisor does not sanitize the content of the + effective (shadow) page tables. + + While solutions exist to mitigate these attack vectors fully, these + mitigations are not enabled by default in the Linux kernel because they + can affect performance significantly. The kernel provides several + mechanisms which can be utilized to address the problem depending on the + deployment scenario. The mitigations, their protection scope and impact + are described in the next sections. + + The default mitigations and the rationale for choosing them are explained + at the end of this document. See :ref:`default_mitigations`. + +.. _l1tf_sys_info: + +L1TF system information +----------------------- + +The Linux kernel provides a sysfs interface to enumerate the current L1TF +status of the system: whether the system is vulnerable, and which +mitigations are active. The relevant sysfs file is: + +/sys/devices/system/cpu/vulnerabilities/l1tf + +The possible values in this file are: + + =========================== =============================== + 'Not affected' The processor is not vulnerable + 'Mitigation: PTE Inversion' The host protection is active + =========================== =============================== + +If KVM/VMX is enabled and the processor is vulnerable then the following +information is appended to the 'Mitigation: PTE Inversion' part: + + - SMT status: + + ===================== ================ + 'VMX: SMT vulnerable' SMT is enabled + 'VMX: SMT disabled' SMT is disabled + ===================== ================ + + - L1D Flush mode: + + ================================ ==================================== + 'L1D vulnerable' L1D flushing is disabled + + 'L1D conditional cache flushes' L1D flush is conditionally enabled + + 'L1D cache flushes' L1D flush is unconditionally enabled + ================================ ==================================== + +The resulting grade of protection is discussed in the following sections. + + +Host mitigation mechanism +------------------------- + +The kernel is unconditionally protected against L1TF attacks from malicious +user space running on the host. + + +Guest mitigation mechanisms +--------------------------- + +.. _l1d_flush: + +1. L1D flush on VMENTER +^^^^^^^^^^^^^^^^^^^^^^^ + + To make sure that a guest cannot attack data which is present in the L1D + the hypervisor flushes the L1D before entering the guest. + + Flushing the L1D evicts not only the data which should not be accessed + by a potentially malicious guest, it also flushes the guest + data. Flushing the L1D has a performance impact as the processor has to + bring the flushed guest data back into the L1D. Depending on the + frequency of VMEXIT/VMENTER and the type of computations in the guest + performance degradation in the range of 1% to 50% has been observed. For + scenarios where guest VMEXIT/VMENTER are rare the performance impact is + minimal. Virtio and mechanisms like posted interrupts are designed to + confine the VMEXITs to a bare minimum, but specific configurations and + application scenarios might still suffer from a high VMEXIT rate. + + The kernel provides two L1D flush modes: + - conditional ('cond') + - unconditional ('always') + + The conditional mode avoids L1D flushing after VMEXITs which execute + only audited code paths before the corresponding VMENTER. These code + paths have been verified that they cannot expose secrets or other + interesting data to an attacker, but they can leak information about the + address space layout of the hypervisor. + + Unconditional mode flushes L1D on all VMENTER invocations and provides + maximum protection. It has a higher overhead than the conditional + mode. The overhead cannot be quantified correctly as it depends on the + workload scenario and the resulting number of VMEXITs. + + The general recommendation is to enable L1D flush on VMENTER. The kernel + defaults to conditional mode on affected processors. + + **Note**, that L1D flush does not prevent the SMT problem because the + sibling thread will also bring back its data into the L1D which makes it + attackable again. + + L1D flush can be controlled by the administrator via the kernel command + line and sysfs control files. See :ref:`mitigation_control_command_line` + and :ref:`mitigation_control_kvm`. + +.. _guest_confinement: + +2. Guest VCPU confinement to dedicated physical cores +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + To address the SMT problem, it is possible to make a guest or a group of + guests affine to one or more physical cores. The proper mechanism for + that is to utilize exclusive cpusets to ensure that no other guest or + host tasks can run on these cores. + + If only a single guest or related guests run on sibling SMT threads on + the same physical core then they can only attack their own memory and + restricted parts of the host memory. + + Host memory is attackable, when one of the sibling SMT threads runs in + host OS (hypervisor) context and the other in guest context. The amount + of valuable information from the host OS context depends on the context + which the host OS executes, i.e. interrupts, soft interrupts and kernel + threads. The amount of valuable data from these contexts cannot be + declared as non-interesting for an attacker without deep inspection of + the code. + + **Note**, that assigning guests to a fixed set of physical cores affects + the ability of the scheduler to do load balancing and might have + negative effects on CPU utilization depending on the hosting + scenario. Disabling SMT might be a viable alternative for particular + scenarios. + + For further information about confining guests to a single or to a group + of cores consult the cpusets documentation: + + https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt + +.. _interrupt_isolation: + +3. Interrupt affinity +^^^^^^^^^^^^^^^^^^^^^ + + Interrupts can be made affine to logical CPUs. This is not universally + true because there are types of interrupts which are truly per CPU + interrupts, e.g. the local timer interrupt. Aside of that multi queue + devices affine their interrupts to single CPUs or groups of CPUs per + queue without allowing the administrator to control the affinities. + + Moving the interrupts, which can be affinity controlled, away from CPUs + which run untrusted guests, reduces the attack vector space. + + Whether the interrupts with are affine to CPUs, which run untrusted + guests, provide interesting data for an attacker depends on the system + configuration and the scenarios which run on the system. While for some + of the interrupts it can be assumed that they won't expose interesting + information beyond exposing hints about the host OS memory layout, there + is no way to make general assumptions. + + Interrupt affinity can be controlled by the administrator via the + /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is + available at: + + https://www.kernel.org/doc/Documentation/IRQ-affinity.txt + +.. _smt_control: + +4. SMT control +^^^^^^^^^^^^^^ + + To prevent the SMT issues of L1TF it might be necessary to disable SMT + completely. Disabling SMT can have a significant performance impact, but + the impact depends on the hosting scenario and the type of workloads. + The impact of disabling SMT needs also to be weighted against the impact + of other mitigation solutions like confining guests to dedicated cores. + + The kernel provides a sysfs interface to retrieve the status of SMT and + to control it. It also provides a kernel command line interface to + control SMT. + + The kernel command line interface consists of the following options: + + =========== ========================================================== + nosmt Affects the bring up of the secondary CPUs during boot. The + kernel tries to bring all present CPUs online during the + boot process. "nosmt" makes sure that from each physical + core only one - the so called primary (hyper) thread is + activated. Due to a design flaw of Intel processors related + to Machine Check Exceptions the non primary siblings have + to be brought up at least partially and are then shut down + again. "nosmt" can be undone via the sysfs interface. + + nosmt=force Has the same effect as "nosmt" but it does not allow to + undo the SMT disable via the sysfs interface. + =========== ========================================================== + + The sysfs interface provides two files: + + - /sys/devices/system/cpu/smt/control + - /sys/devices/system/cpu/smt/active + + /sys/devices/system/cpu/smt/control: + + This file allows to read out the SMT control state and provides the + ability to disable or (re)enable SMT. The possible states are: + + ============== =================================================== + on SMT is supported by the CPU and enabled. All + logical CPUs can be onlined and offlined without + restrictions. + + off SMT is supported by the CPU and disabled. Only + the so called primary SMT threads can be onlined + and offlined without restrictions. An attempt to + online a non-primary sibling is rejected + + forceoff Same as 'off' but the state cannot be controlled. + Attempts to write to the control file are rejected. + + notsupported The processor does not support SMT. It's therefore + not affected by the SMT implications of L1TF. + Attempts to write to the control file are rejected. + ============== =================================================== + + The possible states which can be written into this file to control SMT + state are: + + - on + - off + - forceoff + + /sys/devices/system/cpu/smt/active: + + This file reports whether SMT is enabled and active, i.e. if on any + physical core two or more sibling threads are online. + + SMT control is also possible at boot time via the l1tf kernel command + line parameter in combination with L1D flush control. See + :ref:`mitigation_control_command_line`. + +5. Disabling EPT +^^^^^^^^^^^^^^^^ + + Disabling EPT for virtual machines provides full mitigation for L1TF even + with SMT enabled, because the effective page tables for guests are + managed and sanitized by the hypervisor. Though disabling EPT has a + significant performance impact especially when the Meltdown mitigation + KPTI is enabled. + + EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. + +There is ongoing research and development for new mitigation mechanisms to +address the performance impact of disabling SMT or EPT. + +.. _mitigation_control_command_line: + +Mitigation control on the kernel command line +--------------------------------------------- + +The kernel command line allows to control the L1TF mitigations at boot +time with the option "l1tf=". The valid arguments for this option are: + + ============ ============================================================= + full Provides all available mitigations for the L1TF + vulnerability. Disables SMT and enables all mitigations in + the hypervisors, i.e. unconditional L1D flushing + + SMT control and L1D flush control via the sysfs interface + is still possible after boot. Hypervisors will issue a + warning when the first VM is started in a potentially + insecure configuration, i.e. SMT enabled or L1D flush + disabled. + + full,force Same as 'full', but disables SMT and L1D flush runtime + control. Implies the 'nosmt=force' command line option. + (i.e. sysfs control of SMT is disabled.) + + flush Leaves SMT enabled and enables the default hypervisor + mitigation, i.e. conditional L1D flushing + + SMT control and L1D flush control via the sysfs interface + is still possible after boot. Hypervisors will issue a + warning when the first VM is started in a potentially + insecure configuration, i.e. SMT enabled or L1D flush + disabled. + + flush,nosmt Disables SMT and enables the default hypervisor mitigation, + i.e. conditional L1D flushing. + + SMT control and L1D flush control via the sysfs interface + is still possible after boot. Hypervisors will issue a + warning when the first VM is started in a potentially + insecure configuration, i.e. SMT enabled or L1D flush + disabled. + + flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is + started in a potentially insecure configuration. + + off Disables hypervisor mitigations and doesn't emit any + warnings. + It also drops the swap size and available RAM limit restrictions + on both hypervisor and bare metal. + + ============ ============================================================= + +The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`. + + +.. _mitigation_control_kvm: + +Mitigation control for KVM - module parameter +------------------------------------------------------------- + +The KVM hypervisor mitigation mechanism, flushing the L1D cache when +entering a guest, can be controlled with a module parameter. + +The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the +following arguments: + + ============ ============================================================== + always L1D cache flush on every VMENTER. + + cond Flush L1D on VMENTER only when the code between VMEXIT and + VMENTER can leak host memory which is considered + interesting for an attacker. This still can leak host memory + which allows e.g. to determine the hosts address space layout. + + never Disables the mitigation + ============ ============================================================== + +The parameter can be provided on the kernel command line, as a module +parameter when loading the modules and at runtime modified via the sysfs +file: + +/sys/module/kvm_intel/parameters/vmentry_l1d_flush + +The default is 'cond'. If 'l1tf=full,force' is given on the kernel command +line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush +module parameter is ignored and writes to the sysfs file are rejected. + +.. _mitigation_selection: + +Mitigation selection guide +-------------------------- + +1. No virtualization in use +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The system is protected by the kernel unconditionally and no further + action is required. + +2. Virtualization with trusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + If the guest comes from a trusted source and the guest OS kernel is + guaranteed to have the L1TF mitigations in place the system is fully + protected against L1TF and no further action is required. + + To avoid the overhead of the default L1D flushing on VMENTER the + administrator can disable the flushing via the kernel command line and + sysfs control files. See :ref:`mitigation_control_command_line` and + :ref:`mitigation_control_kvm`. + + +3. Virtualization with untrusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +3.1. SMT not supported or disabled +"""""""""""""""""""""""""""""""""" + + If SMT is not supported by the processor or disabled in the BIOS or by + the kernel, it's only required to enforce L1D flushing on VMENTER. + + Conditional L1D flushing is the default behaviour and can be tuned. See + :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. + +3.2. EPT not supported or disabled +"""""""""""""""""""""""""""""""""" + + If EPT is not supported by the processor or disabled in the hypervisor, + the system is fully protected. SMT can stay enabled and L1D flushing on + VMENTER is not required. + + EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. + +3.3. SMT and EPT supported and active +""""""""""""""""""""""""""""""""""""" + + If SMT and EPT are supported and active then various degrees of + mitigations can be employed: + + - L1D flushing on VMENTER: + + L1D flushing on VMENTER is the minimal protection requirement, but it + is only potent in combination with other mitigation methods. + + Conditional L1D flushing is the default behaviour and can be tuned. See + :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. + + - Guest confinement: + + Confinement of guests to a single or a group of physical cores which + are not running any other processes, can reduce the attack surface + significantly, but interrupts, soft interrupts and kernel threads can + still expose valuable data to a potential attacker. See + :ref:`guest_confinement`. + + - Interrupt isolation: + + Isolating the guest CPUs from interrupts can reduce the attack surface + further, but still allows a malicious guest to explore a limited amount + of host physical memory. This can at least be used to gain knowledge + about the host address space layout. The interrupts which have a fixed + affinity to the CPUs which run the untrusted guests can depending on + the scenario still trigger soft interrupts and schedule kernel threads + which might expose valuable information. See + :ref:`interrupt_isolation`. + +The above three mitigation methods combined can provide protection to a +certain degree, but the risk of the remaining attack surface has to be +carefully analyzed. For full protection the following methods are +available: + + - Disabling SMT: + + Disabling SMT and enforcing the L1D flushing provides the maximum + amount of protection. This mitigation is not depending on any of the + above mitigation methods. + + SMT control and L1D flushing can be tuned by the command line + parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run + time with the matching sysfs control files. See :ref:`smt_control`, + :ref:`mitigation_control_command_line` and + :ref:`mitigation_control_kvm`. + + - Disabling EPT: + + Disabling EPT provides the maximum amount of protection as well. It is + not depending on any of the above mitigation methods. SMT can stay + enabled and L1D flushing is not required, but the performance impact is + significant. + + EPT can be disabled in the hypervisor via the 'kvm-intel.ept' + parameter. + +3.4. Nested virtual machines +"""""""""""""""""""""""""""" + +When nested virtualization is in use, three operating systems are involved: +the bare metal hypervisor, the nested hypervisor and the nested virtual +machine. VMENTER operations from the nested hypervisor into the nested +guest will always be processed by the bare metal hypervisor. If KVM is the +bare metal hypervisor it will: + + - Flush the L1D cache on every switch from the nested hypervisor to the + nested virtual machine, so that the nested hypervisor's secrets are not + exposed to the nested virtual machine; + + - Flush the L1D cache on every switch from the nested virtual machine to + the nested hypervisor; this is a complex operation, and flushing the L1D + cache avoids that the bare metal hypervisor's secrets are exposed to the + nested virtual machine; + + - Instruct the nested hypervisor to not perform any L1D cache flush. This + is an optimization to avoid double L1D flushing. + + +.. _default_mitigations: + +Default mitigations +------------------- + + The kernel default mitigations for vulnerable processors are: + + - PTE inversion to protect against malicious user space. This is done + unconditionally and cannot be controlled. The swap storage is limited + to ~16TB. + + - L1D conditional flushing on VMENTER when EPT is enabled for + a guest. + + The kernel does not by default enforce the disabling of SMT, which leaves + SMT systems vulnerable when running untrusted guests with EPT enabled. + + The rationale for this choice is: + + - Force disabling SMT can break existing setups, especially with + unattended updates. + + - If regular users run untrusted guests on their machine, then L1TF is + just an add on to other malware which might be embedded in an untrusted + guest, e.g. spam-bots or attacks on the local network. + + There is no technical way to prevent a user from running untrusted code + on their machines blindly. + + - It's technically extremely unlikely and from today's knowledge even + impossible that L1TF can be exploited via the most popular attack + mechanisms like JavaScript because these mechanisms have no way to + control PTEs. If this would be possible and not other mitigation would + be possible, then the default might be different. + + - The administrators of cloud and hosting setups have to carefully + analyze the risk for their scenarios and make the appropriate + mitigation choices, which might even vary across their deployed + machines and also result in other changes of their overall setup. + There is no way for the kernel to provide a sensible default for this + kind of scenarios. diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst new file mode 100644 index 000000000..2d19c9f4c --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/mds.rst @@ -0,0 +1,311 @@ +MDS - Microarchitectural Data Sampling +====================================== + +Microarchitectural Data Sampling is a hardware vulnerability which allows +unprivileged speculative access to data which is available in various CPU +internal buffers. + +Affected processors +------------------- + +This vulnerability affects a wide range of Intel processors. The +vulnerability is not present on: + + - Processors from AMD, Centaur and other non Intel vendors + + - Older processor models, where the CPU family is < 6 + + - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus) + + - Intel processors which have the ARCH_CAP_MDS_NO bit set in the + IA32_ARCH_CAPABILITIES MSR. + +Whether a processor is affected or not can be read out from the MDS +vulnerability file in sysfs. See :ref:`mds_sys_info`. + +Not all processors are affected by all variants of MDS, but the mitigation +is identical for all of them so the kernel treats them as a single +vulnerability. + +Related CVEs +------------ + +The following CVE entries are related to the MDS vulnerability: + + ============== ===== =================================================== + CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling + CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling + CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling + CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory + ============== ===== =================================================== + +Problem +------- + +When performing store, load, L1 refill operations, processors write data +into temporary microarchitectural structures (buffers). The data in the +buffer can be forwarded to load operations as an optimization. + +Under certain conditions, usually a fault/assist caused by a load +operation, data unrelated to the load memory address can be speculatively +forwarded from the buffers. Because the load operation causes a fault or +assist and its result will be discarded, the forwarded data will not cause +incorrect program execution or state changes. But a malicious operation +may be able to forward this speculative data to a disclosure gadget which +allows in turn to infer the value via a cache side channel attack. + +Because the buffers are potentially shared between Hyper-Threads cross +Hyper-Thread attacks are possible. + +Deeper technical information is available in the MDS specific x86 +architecture section: :ref:`Documentation/x86/mds.rst <mds>`. + + +Attack scenarios +---------------- + +Attacks against the MDS vulnerabilities can be mounted from malicious non +priviledged user space applications running on hosts or guest. Malicious +guest OSes can obviously mount attacks as well. + +Contrary to other speculation based vulnerabilities the MDS vulnerability +does not allow the attacker to control the memory target address. As a +consequence the attacks are purely sampling based, but as demonstrated with +the TLBleed attack samples can be postprocessed successfully. + +Web-Browsers +^^^^^^^^^^^^ + + It's unclear whether attacks through Web-Browsers are possible at + all. The exploitation through Java-Script is considered very unlikely, + but other widely used web technologies like Webassembly could possibly be + abused. + + +.. _mds_sys_info: + +MDS system information +----------------------- + +The Linux kernel provides a sysfs interface to enumerate the current MDS +status of the system: whether the system is vulnerable, and which +mitigations are active. The relevant sysfs file is: + +/sys/devices/system/cpu/vulnerabilities/mds + +The possible values in this file are: + + .. list-table:: + + * - 'Not affected' + - The processor is not vulnerable + * - 'Vulnerable' + - The processor is vulnerable, but no mitigation enabled + * - 'Vulnerable: Clear CPU buffers attempted, no microcode' + - The processor is vulnerable but microcode is not updated. + + The mitigation is enabled on a best effort basis. See :ref:`vmwerv` + * - 'Mitigation: Clear CPU buffers' + - The processor is vulnerable and the CPU buffer clearing mitigation is + enabled. + +If the processor is vulnerable then the following information is appended +to the above information: + + ======================== ============================================ + 'SMT vulnerable' SMT is enabled + 'SMT mitigated' SMT is enabled and mitigated + 'SMT disabled' SMT is disabled + 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown + ======================== ============================================ + +.. _vmwerv: + +Best effort mitigation mode +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + If the processor is vulnerable, but the availability of the microcode based + mitigation mechanism is not advertised via CPUID the kernel selects a best + effort mitigation mode. This mode invokes the mitigation instructions + without a guarantee that they clear the CPU buffers. + + This is done to address virtualization scenarios where the host has the + microcode update applied, but the hypervisor is not yet updated to expose + the CPUID to the guest. If the host has updated microcode the protection + takes effect otherwise a few cpu cycles are wasted pointlessly. + + The state in the mds sysfs file reflects this situation accordingly. + + +Mitigation mechanism +------------------------- + +The kernel detects the affected CPUs and the presence of the microcode +which is required. + +If a CPU is affected and the microcode is available, then the kernel +enables the mitigation by default. The mitigation can be controlled at boot +time via a kernel command line option. See +:ref:`mds_mitigation_control_command_line`. + +.. _cpu_buffer_clear: + +CPU buffer clearing +^^^^^^^^^^^^^^^^^^^ + + The mitigation for MDS clears the affected CPU buffers on return to user + space and when entering a guest. + + If SMT is enabled it also clears the buffers on idle entry when the CPU + is only affected by MSBDS and not any other MDS variant, because the + other variants cannot be protected against cross Hyper-Thread attacks. + + For CPUs which are only affected by MSBDS the user space, guest and idle + transition mitigations are sufficient and SMT is not affected. + +.. _virt_mechanism: + +Virtualization mitigation +^^^^^^^^^^^^^^^^^^^^^^^^^ + + The protection for host to guest transition depends on the L1TF + vulnerability of the CPU: + + - CPU is affected by L1TF: + + If the L1D flush mitigation is enabled and up to date microcode is + available, the L1D flush mitigation is automatically protecting the + guest transition. + + If the L1D flush mitigation is disabled then the MDS mitigation is + invoked explicit when the host MDS mitigation is enabled. + + For details on L1TF and virtualization see: + :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_control_kvm>`. + + - CPU is not affected by L1TF: + + CPU buffers are flushed before entering the guest when the host MDS + mitigation is enabled. + + The resulting MDS protection matrix for the host to guest transition: + + ============ ===== ============= ============ ================= + L1TF MDS VMX-L1FLUSH Host MDS MDS-State + + Don't care No Don't care N/A Not affected + + Yes Yes Disabled Off Vulnerable + + Yes Yes Disabled Full Mitigated + + Yes Yes Enabled Don't care Mitigated + + No Yes N/A Off Vulnerable + + No Yes N/A Full Mitigated + ============ ===== ============= ============ ================= + + This only covers the host to guest transition, i.e. prevents leakage from + host to guest, but does not protect the guest internally. Guests need to + have their own protections. + +.. _xeon_phi: + +XEON PHI specific considerations +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The XEON PHI processor family is affected by MSBDS which can be exploited + cross Hyper-Threads when entering idle states. Some XEON PHI variants allow + to use MWAIT in user space (Ring 3) which opens an potential attack vector + for malicious user space. The exposure can be disabled on the kernel + command line with the 'ring3mwait=disable' command line option. + + XEON PHI is not affected by the other MDS variants and MSBDS is mitigated + before the CPU enters a idle state. As XEON PHI is not affected by L1TF + either disabling SMT is not required for full protection. + +.. _mds_smt_control: + +SMT control +^^^^^^^^^^^ + + All MDS variants except MSBDS can be attacked cross Hyper-Threads. That + means on CPUs which are affected by MFBDS or MLPDS it is necessary to + disable SMT for full protection. These are most of the affected CPUs; the + exception is XEON PHI, see :ref:`xeon_phi`. + + Disabling SMT can have a significant performance impact, but the impact + depends on the type of workloads. + + See the relevant chapter in the L1TF mitigation documentation for details: + :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`. + + +.. _mds_mitigation_control_command_line: + +Mitigation control on the kernel command line +--------------------------------------------- + +The kernel command line allows to control the MDS mitigations at boot +time with the option "mds=". The valid arguments for this option are: + + ============ ============================================================= + full If the CPU is vulnerable, enable all available mitigations + for the MDS vulnerability, CPU buffer clearing on exit to + userspace and when entering a VM. Idle transitions are + protected as well if SMT is enabled. + + It does not automatically disable SMT. + + full,nosmt The same as mds=full, with SMT disabled on vulnerable + CPUs. This is the complete mitigation. + + off Disables MDS mitigations completely. + + ============ ============================================================= + +Not specifying this option is equivalent to "mds=full". For processors +that are affected by both TAA (TSX Asynchronous Abort) and MDS, +specifying just "mds=off" without an accompanying "tsx_async_abort=off" +will have no effect as the same mitigation is used for both +vulnerabilities. + +Mitigation selection guide +-------------------------- + +1. Trusted userspace +^^^^^^^^^^^^^^^^^^^^ + + If all userspace applications are from a trusted source and do not + execute untrusted code which is supplied externally, then the mitigation + can be disabled. + + +2. Virtualization with trusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The same considerations as above versus trusted user space apply. + +3. Virtualization with untrusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The protection depends on the state of the L1TF mitigations. + See :ref:`virt_mechanism`. + + If the MDS mitigation is enabled and SMT is disabled, guest to host and + guest to guest attacks are prevented. + +.. _mds_default_mitigations: + +Default mitigations +------------------- + + The kernel default mitigations for vulnerable processors are: + + - Enable CPU buffer clearing + + The kernel does not by default enforce the disabling of SMT, which leaves + SMT systems vulnerable when running untrusted code. The same rationale as + for L1TF applies. + See :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <default_mitigations>`. diff --git a/Documentation/admin-guide/hw-vuln/multihit.rst b/Documentation/admin-guide/hw-vuln/multihit.rst new file mode 100644 index 000000000..ba9988d8b --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/multihit.rst @@ -0,0 +1,163 @@ +iTLB multihit +============= + +iTLB multihit is an erratum where some processors may incur a machine check +error, possibly resulting in an unrecoverable CPU lockup, when an +instruction fetch hits multiple entries in the instruction TLB. This can +occur when the page size is changed along with either the physical address +or cache type. A malicious guest running on a virtualized system can +exploit this erratum to perform a denial of service attack. + + +Affected processors +------------------- + +Variations of this erratum are present on most Intel Core and Xeon processor +models. The erratum is not present on: + + - non-Intel processors + + - Some Atoms (Airmont, Bonnell, Goldmont, GoldmontPlus, Saltwell, Silvermont) + + - Intel processors that have the PSCHANGE_MC_NO bit set in the + IA32_ARCH_CAPABILITIES MSR. + + +Related CVEs +------------ + +The following CVE entry is related to this issue: + + ============== ================================================= + CVE-2018-12207 Machine Check Error Avoidance on Page Size Change + ============== ================================================= + + +Problem +------- + +Privileged software, including OS and virtual machine managers (VMM), are in +charge of memory management. A key component in memory management is the control +of the page tables. Modern processors use virtual memory, a technique that creates +the illusion of a very large memory for processors. This virtual space is split +into pages of a given size. Page tables translate virtual addresses to physical +addresses. + +To reduce latency when performing a virtual to physical address translation, +processors include a structure, called TLB, that caches recent translations. +There are separate TLBs for instruction (iTLB) and data (dTLB). + +Under this errata, instructions are fetched from a linear address translated +using a 4 KB translation cached in the iTLB. Privileged software modifies the +paging structure so that the same linear address using large page size (2 MB, 4 +MB, 1 GB) with a different physical address or memory type. After the page +structure modification but before the software invalidates any iTLB entries for +the linear address, a code fetch that happens on the same linear address may +cause a machine-check error which can result in a system hang or shutdown. + + +Attack scenarios +---------------- + +Attacks against the iTLB multihit erratum can be mounted from malicious +guests in a virtualized system. + + +iTLB multihit system information +-------------------------------- + +The Linux kernel provides a sysfs interface to enumerate the current iTLB +multihit status of the system:whether the system is vulnerable and which +mitigations are active. The relevant sysfs file is: + +/sys/devices/system/cpu/vulnerabilities/itlb_multihit + +The possible values in this file are: + +.. list-table:: + + * - Not affected + - The processor is not vulnerable. + * - KVM: Mitigation: Split huge pages + - Software changes mitigate this issue. + * - KVM: Vulnerable + - The processor is vulnerable, but no mitigation enabled + + +Enumeration of the erratum +-------------------------------- + +A new bit has been allocated in the IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) msr +and will be set on CPU's which are mitigated against this issue. + + ======================================= =========== =============================== + IA32_ARCH_CAPABILITIES MSR Not present Possibly vulnerable,check model + IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '0' Likely vulnerable,check model + IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '1' Not vulnerable + ======================================= =========== =============================== + + +Mitigation mechanism +------------------------- + +This erratum can be mitigated by restricting the use of large page sizes to +non-executable pages. This forces all iTLB entries to be 4K, and removes +the possibility of multiple hits. + +In order to mitigate the vulnerability, KVM initially marks all huge pages +as non-executable. If the guest attempts to execute in one of those pages, +the page is broken down into 4K pages, which are then marked executable. + +If EPT is disabled or not available on the host, KVM is in control of TLB +flushes and the problematic situation cannot happen. However, the shadow +EPT paging mechanism used by nested virtualization is vulnerable, because +the nested guest can trigger multiple iTLB hits by modifying its own +(non-nested) page tables. For simplicity, KVM will make large pages +non-executable in all shadow paging modes. + +Mitigation control on the kernel command line and KVM - module parameter +------------------------------------------------------------------------ + +The KVM hypervisor mitigation mechanism for marking huge pages as +non-executable can be controlled with a module parameter "nx_huge_pages=". +The kernel command line allows to control the iTLB multihit mitigations at +boot time with the option "kvm.nx_huge_pages=". + +The valid arguments for these options are: + + ========== ================================================================ + force Mitigation is enabled. In this case, the mitigation implements + non-executable huge pages in Linux kernel KVM module. All huge + pages in the EPT are marked as non-executable. + If a guest attempts to execute in one of those pages, the page is + broken down into 4K pages, which are then marked executable. + + off Mitigation is disabled. + + auto Enable mitigation only if the platform is affected and the kernel + was not booted with the "mitigations=off" command line parameter. + This is the default option. + ========== ================================================================ + + +Mitigation selection guide +-------------------------- + +1. No virtualization in use +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The system is protected by the kernel unconditionally and no further + action is required. + +2. Virtualization with trusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + If the guest comes from a trusted source, you may assume that the guest will + not attempt to maliciously exploit these errata and no further action is + required. + +3. Virtualization with untrusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + If the guest comes from an untrusted source, the guest host kernel will need + to apply iTLB multihit mitigation via the kernel command line or kvm + module parameter. diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst new file mode 100644 index 000000000..9393c50b5 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst @@ -0,0 +1,246 @@ +========================================= +Processor MMIO Stale Data Vulnerabilities +========================================= + +Processor MMIO Stale Data Vulnerabilities are a class of memory-mapped I/O +(MMIO) vulnerabilities that can expose data. The sequences of operations for +exposing data range from simple to very complex. Because most of the +vulnerabilities require the attacker to have access to MMIO, many environments +are not affected. System environments using virtualization where MMIO access is +provided to untrusted guests may need mitigation. These vulnerabilities are +not transient execution attacks. However, these vulnerabilities may propagate +stale data into core fill buffers where the data can subsequently be inferred +by an unmitigated transient execution attack. Mitigation for these +vulnerabilities includes a combination of microcode update and software +changes, depending on the platform and usage model. Some of these mitigations +are similar to those used to mitigate Microarchitectural Data Sampling (MDS) or +those used to mitigate Special Register Buffer Data Sampling (SRBDS). + +Data Propagators +================ +Propagators are operations that result in stale data being copied or moved from +one microarchitectural buffer or register to another. Processor MMIO Stale Data +Vulnerabilities are operations that may result in stale data being directly +read into an architectural, software-visible state or sampled from a buffer or +register. + +Fill Buffer Stale Data Propagator (FBSDP) +----------------------------------------- +Stale data may propagate from fill buffers (FB) into the non-coherent portion +of the uncore on some non-coherent writes. Fill buffer propagation by itself +does not make stale data architecturally visible. Stale data must be propagated +to a location where it is subject to reading or sampling. + +Sideband Stale Data Propagator (SSDP) +------------------------------------- +The sideband stale data propagator (SSDP) is limited to the client (including +Intel Xeon server E3) uncore implementation. The sideband response buffer is +shared by all client cores. For non-coherent reads that go to sideband +destinations, the uncore logic returns 64 bytes of data to the core, including +both requested data and unrequested stale data, from a transaction buffer and +the sideband response buffer. As a result, stale data from the sideband +response and transaction buffers may now reside in a core fill buffer. + +Primary Stale Data Propagator (PSDP) +------------------------------------ +The primary stale data propagator (PSDP) is limited to the client (including +Intel Xeon server E3) uncore implementation. Similar to the sideband response +buffer, the primary response buffer is shared by all client cores. For some +processors, MMIO primary reads will return 64 bytes of data to the core fill +buffer including both requested data and unrequested stale data. This is +similar to the sideband stale data propagator. + +Vulnerabilities +=============== +Device Register Partial Write (DRPW) (CVE-2022-21166) +----------------------------------------------------- +Some endpoint MMIO registers incorrectly handle writes that are smaller than +the register size. Instead of aborting the write or only copying the correct +subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than +specified by the write transaction may be written to the register. On +processors affected by FBSDP, this may expose stale data from the fill buffers +of the core that created the write transaction. + +Shared Buffers Data Sampling (SBDS) (CVE-2022-21125) +---------------------------------------------------- +After propagators may have moved data around the uncore and copied stale data +into client core fill buffers, processors affected by MFBDS can leak data from +the fill buffer. It is limited to the client (including Intel Xeon server E3) +uncore implementation. + +Shared Buffers Data Read (SBDR) (CVE-2022-21123) +------------------------------------------------ +It is similar to Shared Buffer Data Sampling (SBDS) except that the data is +directly read into the architectural software-visible state. It is limited to +the client (including Intel Xeon server E3) uncore implementation. + +Affected Processors +=================== +Not all the CPUs are affected by all the variants. For instance, most +processors for the server market (excluding Intel Xeon E3 processors) are +impacted by only Device Register Partial Write (DRPW). + +Below is the list of affected Intel processors [#f1]_: + + =================== ============ ========= + Common name Family_Model Steppings + =================== ============ ========= + HASWELL_X 06_3FH 2,4 + SKYLAKE_L 06_4EH 3 + BROADWELL_X 06_4FH All + SKYLAKE_X 06_55H 3,4,6,7,11 + BROADWELL_D 06_56H 3,4,5 + SKYLAKE 06_5EH 3 + ICELAKE_X 06_6AH 4,5,6 + ICELAKE_D 06_6CH 1 + ICELAKE_L 06_7EH 5 + ATOM_TREMONT_D 06_86H All + LAKEFIELD 06_8AH 1 + KABYLAKE_L 06_8EH 9 to 12 + ATOM_TREMONT 06_96H 1 + ATOM_TREMONT_L 06_9CH 0 + KABYLAKE 06_9EH 9 to 13 + COMETLAKE 06_A5H 2,3,5 + COMETLAKE_L 06_A6H 0,1 + ROCKETLAKE 06_A7H 1 + =================== ============ ========= + +If a CPU is in the affected processor list, but not affected by a variant, it +is indicated by new bits in MSR IA32_ARCH_CAPABILITIES. As described in a later +section, mitigation largely remains the same for all the variants, i.e. to +clear the CPU fill buffers via VERW instruction. + +New bits in MSRs +================ +Newer processors and microcode update on existing affected processors added new +bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate +specific variants of Processor MMIO Stale Data vulnerabilities and mitigation +capability. + +MSR IA32_ARCH_CAPABILITIES +-------------------------- +Bit 13 - SBDR_SSDP_NO - When set, processor is not affected by either the + Shared Buffers Data Read (SBDR) vulnerability or the sideband stale + data propagator (SSDP). +Bit 14 - FBSDP_NO - When set, processor is not affected by the Fill Buffer + Stale Data Propagator (FBSDP). +Bit 15 - PSDP_NO - When set, processor is not affected by Primary Stale Data + Propagator (PSDP). +Bit 17 - FB_CLEAR - When set, VERW instruction will overwrite CPU fill buffer + values as part of MD_CLEAR operations. Processors that do not + enumerate MDS_NO (meaning they are affected by MDS) but that do + enumerate support for both L1D_FLUSH and MD_CLEAR implicitly enumerate + FB_CLEAR as part of their MD_CLEAR support. +Bit 18 - FB_CLEAR_CTRL - Processor supports read and write to MSR + IA32_MCU_OPT_CTRL[FB_CLEAR_DIS]. On such processors, the FB_CLEAR_DIS + bit can be set to cause the VERW instruction to not perform the + FB_CLEAR action. Not all processors that support FB_CLEAR will support + FB_CLEAR_CTRL. + +MSR IA32_MCU_OPT_CTRL +--------------------- +Bit 3 - FB_CLEAR_DIS - When set, VERW instruction does not perform the FB_CLEAR +action. This may be useful to reduce the performance impact of FB_CLEAR in +cases where system software deems it warranted (for example, when performance +is more critical, or the untrusted software has no MMIO access). Note that +FB_CLEAR_DIS has no impact on enumeration (for example, it does not change +FB_CLEAR or MD_CLEAR enumeration) and it may not be supported on all processors +that enumerate FB_CLEAR. + +Mitigation +========== +Like MDS, all variants of Processor MMIO Stale Data vulnerabilities have the +same mitigation strategy to force the CPU to clear the affected buffers before +an attacker can extract the secrets. + +This is achieved by using the otherwise unused and obsolete VERW instruction in +combination with a microcode update. The microcode clears the affected CPU +buffers when the VERW instruction is executed. + +Kernel reuses the MDS function to invoke the buffer clearing: + + mds_clear_cpu_buffers() + +On MDS affected CPUs, the kernel already invokes CPU buffer clear on +kernel/userspace, hypervisor/guest and C-state (idle) transitions. No +additional mitigation is needed on such CPUs. + +For CPUs not affected by MDS or TAA, mitigation is needed only for the attacker +with MMIO capability. Therefore, VERW is not required for kernel/userspace. For +virtualization case, VERW is only needed at VMENTER for a guest with MMIO +capability. + +Mitigation points +----------------- +Return to user space +^^^^^^^^^^^^^^^^^^^^ +Same mitigation as MDS when affected by MDS/TAA, otherwise no mitigation +needed. + +C-State transition +^^^^^^^^^^^^^^^^^^ +Control register writes by CPU during C-state transition can propagate data +from fill buffer to uncore buffers. Execute VERW before C-state transition to +clear CPU fill buffers. + +Guest entry point +^^^^^^^^^^^^^^^^^ +Same mitigation as MDS when processor is also affected by MDS/TAA, otherwise +execute VERW at VMENTER only for MMIO capable guests. On CPUs not affected by +MDS/TAA, guest without MMIO access cannot extract secrets using Processor MMIO +Stale Data vulnerabilities, so there is no need to execute VERW for such guests. + +Mitigation control on the kernel command line +--------------------------------------------- +The kernel command line allows to control the Processor MMIO Stale Data +mitigations at boot time with the option "mmio_stale_data=". The valid +arguments for this option are: + + ========== ================================================================= + full If the CPU is vulnerable, enable mitigation; CPU buffer clearing + on exit to userspace and when entering a VM. Idle transitions are + protected as well. It does not automatically disable SMT. + full,nosmt Same as full, with SMT disabled on vulnerable CPUs. This is the + complete mitigation. + off Disables mitigation completely. + ========== ================================================================= + +If the CPU is affected and mmio_stale_data=off is not supplied on the kernel +command line, then the kernel selects the appropriate mitigation. + +Mitigation status information +----------------------------- +The Linux kernel provides a sysfs interface to enumerate the current +vulnerability status of the system: whether the system is vulnerable, and +which mitigations are active. The relevant sysfs file is: + + /sys/devices/system/cpu/vulnerabilities/mmio_stale_data + +The possible values in this file are: + + .. list-table:: + + * - 'Not affected' + - The processor is not vulnerable + * - 'Vulnerable' + - The processor is vulnerable, but no mitigation enabled + * - 'Vulnerable: Clear CPU buffers attempted, no microcode' + - The processor is vulnerable, but microcode is not updated. The + mitigation is enabled on a best effort basis. + * - 'Mitigation: Clear CPU buffers' + - The processor is vulnerable and the CPU buffer clearing mitigation is + enabled. + +If the processor is vulnerable then the following information is appended to +the above information: + + ======================== =========================================== + 'SMT vulnerable' SMT is enabled + 'SMT disabled' SMT is disabled + 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown + ======================== =========================================== + +References +---------- +.. [#f1] Affected Processors + https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html diff --git a/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst b/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst new file mode 100644 index 000000000..47b1b3afa --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst @@ -0,0 +1,149 @@ +.. SPDX-License-Identifier: GPL-2.0 + +SRBDS - Special Register Buffer Data Sampling +============================================= + +SRBDS is a hardware vulnerability that allows MDS :doc:`mds` techniques to +infer values returned from special register accesses. Special register +accesses are accesses to off core registers. According to Intel's evaluation, +the special register reads that have a security expectation of privacy are +RDRAND, RDSEED and SGX EGETKEY. + +When RDRAND, RDSEED and EGETKEY instructions are used, the data is moved +to the core through the special register mechanism that is susceptible +to MDS attacks. + +Affected processors +-------------------- +Core models (desktop, mobile, Xeon-E3) that implement RDRAND and/or RDSEED may +be affected. + +A processor is affected by SRBDS if its Family_Model and stepping is +in the following list, with the exception of the listed processors +exporting MDS_NO while Intel TSX is available yet not enabled. The +latter class of processors are only affected when Intel TSX is enabled +by software using TSX_CTRL_MSR otherwise they are not affected. + + ============= ============ ======== + common name Family_Model Stepping + ============= ============ ======== + IvyBridge 06_3AH All + + Haswell 06_3CH All + Haswell_L 06_45H All + Haswell_G 06_46H All + + Broadwell_G 06_47H All + Broadwell 06_3DH All + + Skylake_L 06_4EH All + Skylake 06_5EH All + + Kabylake_L 06_8EH <= 0xC + Kabylake 06_9EH <= 0xD + ============= ============ ======== + +Related CVEs +------------ + +The following CVE entry is related to this SRBDS issue: + + ============== ===== ===================================== + CVE-2020-0543 SRBDS Special Register Buffer Data Sampling + ============== ===== ===================================== + +Attack scenarios +---------------- +An unprivileged user can extract values returned from RDRAND and RDSEED +executed on another core or sibling thread using MDS techniques. + + +Mitigation mechanism +------------------- +Intel will release microcode updates that modify the RDRAND, RDSEED, and +EGETKEY instructions to overwrite secret special register data in the shared +staging buffer before the secret data can be accessed by another logical +processor. + +During execution of the RDRAND, RDSEED, or EGETKEY instructions, off-core +accesses from other logical processors will be delayed until the special +register read is complete and the secret data in the shared staging buffer is +overwritten. + +This has three effects on performance: + +#. RDRAND, RDSEED, or EGETKEY instructions have higher latency. + +#. Executing RDRAND at the same time on multiple logical processors will be + serialized, resulting in an overall reduction in the maximum RDRAND + bandwidth. + +#. Executing RDRAND, RDSEED or EGETKEY will delay memory accesses from other + logical processors that miss their core caches, with an impact similar to + legacy locked cache-line-split accesses. + +The microcode updates provide an opt-out mechanism (RNGDS_MITG_DIS) to disable +the mitigation for RDRAND and RDSEED instructions executed outside of Intel +Software Guard Extensions (Intel SGX) enclaves. On logical processors that +disable the mitigation using this opt-out mechanism, RDRAND and RDSEED do not +take longer to execute and do not impact performance of sibling logical +processors memory accesses. The opt-out mechanism does not affect Intel SGX +enclaves (including execution of RDRAND or RDSEED inside an enclave, as well +as EGETKEY execution). + +IA32_MCU_OPT_CTRL MSR Definition +-------------------------------- +Along with the mitigation for this issue, Intel added a new thread-scope +IA32_MCU_OPT_CTRL MSR, (address 0x123). The presence of this MSR and +RNGDS_MITG_DIS (bit 0) is enumerated by CPUID.(EAX=07H,ECX=0).EDX[SRBDS_CTRL = +9]==1. This MSR is introduced through the microcode update. + +Setting IA32_MCU_OPT_CTRL[0] (RNGDS_MITG_DIS) to 1 for a logical processor +disables the mitigation for RDRAND and RDSEED executed outside of an Intel SGX +enclave on that logical processor. Opting out of the mitigation for a +particular logical processor does not affect the RDRAND and RDSEED mitigations +for other logical processors. + +Note that inside of an Intel SGX enclave, the mitigation is applied regardless +of the value of RNGDS_MITG_DS. + +Mitigation control on the kernel command line +--------------------------------------------- +The kernel command line allows control over the SRBDS mitigation at boot time +with the option "srbds=". The option for this is: + + ============= ============================================================= + off This option disables SRBDS mitigation for RDRAND and RDSEED on + affected platforms. + ============= ============================================================= + +SRBDS System Information +----------------------- +The Linux kernel provides vulnerability status information through sysfs. For +SRBDS this can be accessed by the following sysfs file: +/sys/devices/system/cpu/vulnerabilities/srbds + +The possible values contained in this file are: + + ============================== ============================================= + Not affected Processor not vulnerable + Vulnerable Processor vulnerable and mitigation disabled + Vulnerable: No microcode Processor vulnerable and microcode is missing + mitigation + Mitigation: Microcode Processor is vulnerable and mitigation is in + effect. + Mitigation: TSX disabled Processor is only vulnerable when TSX is + enabled while this system was booted with TSX + disabled. + Unknown: Dependent on + hypervisor status Running on virtual guest processor that is + affected but with no way to know if host + processor is mitigated or vulnerable. + ============================== ============================================= + +SRBDS Default mitigation +------------------------ +This new microcode serializes processor access during execution of RDRAND, +RDSEED ensures that the shared buffer is overwritten before it is released for +reuse. Use the "srbds=off" kernel command line to disable the mitigation for +RDRAND and RDSEED. diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst new file mode 100644 index 000000000..6bd97cd50 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/spectre.rst @@ -0,0 +1,785 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Spectre Side Channels +===================== + +Spectre is a class of side channel attacks that exploit branch prediction +and speculative execution on modern CPUs to read memory, possibly +bypassing access controls. Speculative execution side channel exploits +do not modify memory but attempt to infer privileged data in the memory. + +This document covers Spectre variant 1 and Spectre variant 2. + +Affected processors +------------------- + +Speculative execution side channel methods affect a wide range of modern +high performance processors, since most modern high speed processors +use branch prediction and speculative execution. + +The following CPUs are vulnerable: + + - Intel Core, Atom, Pentium, and Xeon processors + + - AMD Phenom, EPYC, and Zen processors + + - IBM POWER and zSeries processors + + - Higher end ARM processors + + - Apple CPUs + + - Higher end MIPS CPUs + + - Likely most other high performance CPUs. Contact your CPU vendor for details. + +Whether a processor is affected or not can be read out from the Spectre +vulnerability files in sysfs. See :ref:`spectre_sys_info`. + +Related CVEs +------------ + +The following CVE entries describe Spectre variants: + + ============= ======================= ========================== + CVE-2017-5753 Bounds check bypass Spectre variant 1 + CVE-2017-5715 Branch target injection Spectre variant 2 + CVE-2019-1125 Spectre v1 swapgs Spectre variant 1 (swapgs) + ============= ======================= ========================== + +Problem +------- + +CPUs use speculative operations to improve performance. That may leave +traces of memory accesses or computations in the processor's caches, +buffers, and branch predictors. Malicious software may be able to +influence the speculative execution paths, and then use the side effects +of the speculative execution in the CPUs' caches and buffers to infer +privileged data touched during the speculative execution. + +Spectre variant 1 attacks take advantage of speculative execution of +conditional branches, while Spectre variant 2 attacks use speculative +execution of indirect branches to leak privileged memory. +See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[6] <spec_ref6>` +:ref:`[7] <spec_ref7>` :ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`. + +Spectre variant 1 (Bounds Check Bypass) +--------------------------------------- + +The bounds check bypass attack :ref:`[2] <spec_ref2>` takes advantage +of speculative execution that bypasses conditional branch instructions +used for memory access bounds check (e.g. checking if the index of an +array results in memory access within a valid range). This results in +memory accesses to invalid memory (with out-of-bound index) that are +done speculatively before validation checks resolve. Such speculative +memory accesses can leave side effects, creating side channels which +leak information to the attacker. + +There are some extensions of Spectre variant 1 attacks for reading data +over the network, see :ref:`[12] <spec_ref12>`. However such attacks +are difficult, low bandwidth, fragile, and are considered low risk. + +Note that, despite "Bounds Check Bypass" name, Spectre variant 1 is not +only about user-controlled array bounds checks. It can affect any +conditional checks. The kernel entry code interrupt, exception, and NMI +handlers all have conditional swapgs checks. Those may be problematic +in the context of Spectre v1, as kernel code can speculatively run with +a user GS. + +Spectre variant 2 (Branch Target Injection) +------------------------------------------- + +The branch target injection attack takes advantage of speculative +execution of indirect branches :ref:`[3] <spec_ref3>`. The indirect +branch predictors inside the processor used to guess the target of +indirect branches can be influenced by an attacker, causing gadget code +to be speculatively executed, thus exposing sensitive data touched by +the victim. The side effects left in the CPU's caches during speculative +execution can be measured to infer data values. + +.. _poison_btb: + +In Spectre variant 2 attacks, the attacker can steer speculative indirect +branches in the victim to gadget code by poisoning the branch target +buffer of a CPU used for predicting indirect branch addresses. Such +poisoning could be done by indirect branching into existing code, +with the address offset of the indirect branch under the attacker's +control. Since the branch prediction on impacted hardware does not +fully disambiguate branch address and uses the offset for prediction, +this could cause privileged code's indirect branch to jump to a gadget +code with the same offset. + +The most useful gadgets take an attacker-controlled input parameter (such +as a register value) so that the memory read can be controlled. Gadgets +without input parameters might be possible, but the attacker would have +very little control over what memory can be read, reducing the risk of +the attack revealing useful data. + +One other variant 2 attack vector is for the attacker to poison the +return stack buffer (RSB) :ref:`[13] <spec_ref13>` to cause speculative +subroutine return instruction execution to go to a gadget. An attacker's +imbalanced subroutine call instructions might "poison" entries in the +return stack buffer which are later consumed by a victim's subroutine +return instructions. This attack can be mitigated by flushing the return +stack buffer on context switch, or virtual machine (VM) exit. + +On systems with simultaneous multi-threading (SMT), attacks are possible +from the sibling thread, as level 1 cache and branch target buffer +(BTB) may be shared between hardware threads in a CPU core. A malicious +program running on the sibling thread may influence its peer's BTB to +steer its indirect branch speculations to gadget code, and measure the +speculative execution's side effects left in level 1 cache to infer the +victim's data. + +Yet another variant 2 attack vector is for the attacker to poison the +Branch History Buffer (BHB) to speculatively steer an indirect branch +to a specific Branch Target Buffer (BTB) entry, even if the entry isn't +associated with the source address of the indirect branch. Specifically, +the BHB might be shared across privilege levels even in the presence of +Enhanced IBRS. + +Currently the only known real-world BHB attack vector is via +unprivileged eBPF. Therefore, it's highly recommended to not enable +unprivileged eBPF, especially when eIBRS is used (without retpolines). +For a full mitigation against BHB attacks, it's recommended to use +retpolines (or eIBRS combined with retpolines). + +Attack scenarios +---------------- + +The following list of attack scenarios have been anticipated, but may +not cover all possible attack vectors. + +1. A user process attacking the kernel +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Spectre variant 1 +~~~~~~~~~~~~~~~~~ + + The attacker passes a parameter to the kernel via a register or + via a known address in memory during a syscall. Such parameter may + be used later by the kernel as an index to an array or to derive + a pointer for a Spectre variant 1 attack. The index or pointer + is invalid, but bound checks are bypassed in the code branch taken + for speculative execution. This could cause privileged memory to be + accessed and leaked. + + For kernel code that has been identified where data pointers could + potentially be influenced for Spectre attacks, new "nospec" accessor + macros are used to prevent speculative loading of data. + +Spectre variant 1 (swapgs) +~~~~~~~~~~~~~~~~~~~~~~~~~~ + + An attacker can train the branch predictor to speculatively skip the + swapgs path for an interrupt or exception. If they initialize + the GS register to a user-space value, if the swapgs is speculatively + skipped, subsequent GS-related percpu accesses in the speculation + window will be done with the attacker-controlled GS value. This + could cause privileged memory to be accessed and leaked. + + For example: + + :: + + if (coming from user space) + swapgs + mov %gs:<percpu_offset>, %reg + mov (%reg), %reg1 + + When coming from user space, the CPU can speculatively skip the + swapgs, and then do a speculative percpu load using the user GS + value. So the user can speculatively force a read of any kernel + value. If a gadget exists which uses the percpu value as an address + in another load/store, then the contents of the kernel value may + become visible via an L1 side channel attack. + + A similar attack exists when coming from kernel space. The CPU can + speculatively do the swapgs, causing the user GS to get used for the + rest of the speculative window. + +Spectre variant 2 +~~~~~~~~~~~~~~~~~ + + A spectre variant 2 attacker can :ref:`poison <poison_btb>` the branch + target buffer (BTB) before issuing syscall to launch an attack. + After entering the kernel, the kernel could use the poisoned branch + target buffer on indirect jump and jump to gadget code in speculative + execution. + + If an attacker tries to control the memory addresses leaked during + speculative execution, he would also need to pass a parameter to the + gadget, either through a register or a known address in memory. After + the gadget has executed, he can measure the side effect. + + The kernel can protect itself against consuming poisoned branch + target buffer entries by using return trampolines (also known as + "retpoline") :ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` for all + indirect branches. Return trampolines trap speculative execution paths + to prevent jumping to gadget code during speculative execution. + x86 CPUs with Enhanced Indirect Branch Restricted Speculation + (Enhanced IBRS) available in hardware should use the feature to + mitigate Spectre variant 2 instead of retpoline. Enhanced IBRS is + more efficient than retpoline. + + There may be gadget code in firmware which could be exploited with + Spectre variant 2 attack by a rogue user process. To mitigate such + attacks on x86, Indirect Branch Restricted Speculation (IBRS) feature + is turned on before the kernel invokes any firmware code. + +2. A user process attacking another user process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + A malicious user process can try to attack another user process, + either via a context switch on the same hardware thread, or from the + sibling hyperthread sharing a physical processor core on simultaneous + multi-threading (SMT) system. + + Spectre variant 1 attacks generally require passing parameters + between the processes, which needs a data passing relationship, such + as remote procedure calls (RPC). Those parameters are used in gadget + code to derive invalid data pointers accessing privileged memory in + the attacked process. + + Spectre variant 2 attacks can be launched from a rogue process by + :ref:`poisoning <poison_btb>` the branch target buffer. This can + influence the indirect branch targets for a victim process that either + runs later on the same hardware thread, or running concurrently on + a sibling hardware thread sharing the same physical core. + + A user process can protect itself against Spectre variant 2 attacks + by using the prctl() syscall to disable indirect branch speculation + for itself. An administrator can also cordon off an unsafe process + from polluting the branch target buffer by disabling the process's + indirect branch speculation. This comes with a performance cost + from not using indirect branch speculation and clearing the branch + target buffer. When SMT is enabled on x86, for a process that has + indirect branch speculation disabled, Single Threaded Indirect Branch + Predictors (STIBP) :ref:`[4] <spec_ref4>` are turned on to prevent the + sibling thread from controlling branch target buffer. In addition, + the Indirect Branch Prediction Barrier (IBPB) is issued to clear the + branch target buffer when context switching to and from such process. + + On x86, the return stack buffer is stuffed on context switch. + This prevents the branch target buffer from being used for branch + prediction when the return stack buffer underflows while switching to + a deeper call stack. Any poisoned entries in the return stack buffer + left by the previous process will also be cleared. + + User programs should use address space randomization to make attacks + more difficult (Set /proc/sys/kernel/randomize_va_space = 1 or 2). + +3. A virtualized guest attacking the host +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The attack mechanism is similar to how user processes attack the + kernel. The kernel is entered via hyper-calls or other virtualization + exit paths. + + For Spectre variant 1 attacks, rogue guests can pass parameters + (e.g. in registers) via hyper-calls to derive invalid pointers to + speculate into privileged memory after entering the kernel. For places + where such kernel code has been identified, nospec accessor macros + are used to stop speculative memory access. + + For Spectre variant 2 attacks, rogue guests can :ref:`poison + <poison_btb>` the branch target buffer or return stack buffer, causing + the kernel to jump to gadget code in the speculative execution paths. + + To mitigate variant 2, the host kernel can use return trampolines + for indirect branches to bypass the poisoned branch target buffer, + and flushing the return stack buffer on VM exit. This prevents rogue + guests from affecting indirect branching in the host kernel. + + To protect host processes from rogue guests, host processes can have + indirect branch speculation disabled via prctl(). The branch target + buffer is cleared before context switching to such processes. + +4. A virtualized guest attacking other guest +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + A rogue guest may attack another guest to get data accessible by the + other guest. + + Spectre variant 1 attacks are possible if parameters can be passed + between guests. This may be done via mechanisms such as shared memory + or message passing. Such parameters could be used to derive data + pointers to privileged data in guest. The privileged data could be + accessed by gadget code in the victim's speculation paths. + + Spectre variant 2 attacks can be launched from a rogue guest by + :ref:`poisoning <poison_btb>` the branch target buffer or the return + stack buffer. Such poisoned entries could be used to influence + speculation execution paths in the victim guest. + + Linux kernel mitigates attacks to other guests running in the same + CPU hardware thread by flushing the return stack buffer on VM exit, + and clearing the branch target buffer before switching to a new guest. + + If SMT is used, Spectre variant 2 attacks from an untrusted guest + in the sibling hyperthread can be mitigated by the administrator, + by turning off the unsafe guest's indirect branch speculation via + prctl(). A guest can also protect itself by turning on microcode + based mitigations (such as IBPB or STIBP on x86) within the guest. + +.. _spectre_sys_info: + +Spectre system information +-------------------------- + +The Linux kernel provides a sysfs interface to enumerate the current +mitigation status of the system for Spectre: whether the system is +vulnerable, and which mitigations are active. + +The sysfs file showing Spectre variant 1 mitigation status is: + + /sys/devices/system/cpu/vulnerabilities/spectre_v1 + +The possible values in this file are: + + .. list-table:: + + * - 'Not affected' + - The processor is not vulnerable. + * - 'Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers' + - The swapgs protections are disabled; otherwise it has + protection in the kernel on a case by case base with explicit + pointer sanitation and usercopy LFENCE barriers. + * - 'Mitigation: usercopy/swapgs barriers and __user pointer sanitization' + - Protection in the kernel on a case by case base with explicit + pointer sanitation, usercopy LFENCE barriers, and swapgs LFENCE + barriers. + +However, the protections are put in place on a case by case basis, +and there is no guarantee that all possible attack vectors for Spectre +variant 1 are covered. + +The spectre_v2 kernel file reports if the kernel has been compiled with +retpoline mitigation or if the CPU has hardware mitigation, and if the +CPU has support for additional process-specific mitigation. + +This file also reports CPU features enabled by microcode to mitigate +attack between user processes: + +1. Indirect Branch Prediction Barrier (IBPB) to add additional + isolation between processes of different users. +2. Single Thread Indirect Branch Predictors (STIBP) to add additional + isolation between CPU threads running on the same core. + +These CPU features may impact performance when used and can be enabled +per process on a case-by-case base. + +The sysfs file showing Spectre variant 2 mitigation status is: + + /sys/devices/system/cpu/vulnerabilities/spectre_v2 + +The possible values in this file are: + + - Kernel status: + + ======================================== ================================= + 'Not affected' The processor is not vulnerable + 'Mitigation: None' Vulnerable, no mitigation + 'Mitigation: Retpolines' Use Retpoline thunks + 'Mitigation: LFENCE' Use LFENCE instructions + 'Mitigation: Enhanced IBRS' Hardware-focused mitigation + 'Mitigation: Enhanced IBRS + Retpolines' Hardware-focused + Retpolines + 'Mitigation: Enhanced IBRS + LFENCE' Hardware-focused + LFENCE + ======================================== ================================= + + - Firmware status: Show if Indirect Branch Restricted Speculation (IBRS) is + used to protect against Spectre variant 2 attacks when calling firmware (x86 only). + + ========== ============================================================= + 'IBRS_FW' Protection against user program attacks when calling firmware + ========== ============================================================= + + - Indirect branch prediction barrier (IBPB) status for protection between + processes of different users. This feature can be controlled through + prctl() per process, or through kernel command line options. This is + an x86 only feature. For more details see below. + + =================== ======================================================== + 'IBPB: disabled' IBPB unused + 'IBPB: always-on' Use IBPB on all tasks + 'IBPB: conditional' Use IBPB on SECCOMP or indirect branch restricted tasks + =================== ======================================================== + + - Single threaded indirect branch prediction (STIBP) status for protection + between different hyper threads. This feature can be controlled through + prctl per process, or through kernel command line options. This is x86 + only feature. For more details see below. + + ==================== ======================================================== + 'STIBP: disabled' STIBP unused + 'STIBP: forced' Use STIBP on all tasks + 'STIBP: conditional' Use STIBP on SECCOMP or indirect branch restricted tasks + ==================== ======================================================== + + - Return stack buffer (RSB) protection status: + + ============= =========================================== + 'RSB filling' Protection of RSB on context switch enabled + ============= =========================================== + +Full mitigation might require a microcode update from the CPU +vendor. When the necessary microcode is not available, the kernel will +report vulnerability. + +Turning on mitigation for Spectre variant 1 and Spectre variant 2 +----------------------------------------------------------------- + +1. Kernel mitigation +^^^^^^^^^^^^^^^^^^^^ + +Spectre variant 1 +~~~~~~~~~~~~~~~~~ + + For the Spectre variant 1, vulnerable kernel code (as determined + by code audit or scanning tools) is annotated on a case by case + basis to use nospec accessor macros for bounds clipping :ref:`[2] + <spec_ref2>` to avoid any usable disclosure gadgets. However, it may + not cover all attack vectors for Spectre variant 1. + + Copy-from-user code has an LFENCE barrier to prevent the access_ok() + check from being mis-speculated. The barrier is done by the + barrier_nospec() macro. + + For the swapgs variant of Spectre variant 1, LFENCE barriers are + added to interrupt, exception and NMI entry where needed. These + barriers are done by the FENCE_SWAPGS_KERNEL_ENTRY and + FENCE_SWAPGS_USER_ENTRY macros. + +Spectre variant 2 +~~~~~~~~~~~~~~~~~ + + For Spectre variant 2 mitigation, the compiler turns indirect calls or + jumps in the kernel into equivalent return trampolines (retpolines) + :ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` to go to the target + addresses. Speculative execution paths under retpolines are trapped + in an infinite loop to prevent any speculative execution jumping to + a gadget. + + To turn on retpoline mitigation on a vulnerable CPU, the kernel + needs to be compiled with a gcc compiler that supports the + -mindirect-branch=thunk-extern -mindirect-branch-register options. + If the kernel is compiled with a Clang compiler, the compiler needs + to support -mretpoline-external-thunk option. The kernel config + CONFIG_RETPOLINE needs to be turned on, and the CPU needs to run with + the latest updated microcode. + + On Intel Skylake-era systems the mitigation covers most, but not all, + cases. See :ref:`[3] <spec_ref3>` for more details. + + On CPUs with hardware mitigation for Spectre variant 2 (e.g. Enhanced + IBRS on x86), retpoline is automatically disabled at run time. + + The retpoline mitigation is turned on by default on vulnerable + CPUs. It can be forced on or off by the administrator + via the kernel command line and sysfs control files. See + :ref:`spectre_mitigation_control_command_line`. + + On x86, indirect branch restricted speculation is turned on by default + before invoking any firmware code to prevent Spectre variant 2 exploits + using the firmware. + + Using kernel address space randomization (CONFIG_RANDOMIZE_BASE=y + and CONFIG_SLAB_FREELIST_RANDOM=y in the kernel configuration) makes + attacks on the kernel generally more difficult. + +2. User program mitigation +^^^^^^^^^^^^^^^^^^^^^^^^^^ + + User programs can mitigate Spectre variant 1 using LFENCE or "bounds + clipping". For more details see :ref:`[2] <spec_ref2>`. + + For Spectre variant 2 mitigation, individual user programs + can be compiled with return trampolines for indirect branches. + This protects them from consuming poisoned entries in the branch + target buffer left by malicious software. Alternatively, the + programs can disable their indirect branch speculation via prctl() + (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`). + On x86, this will turn on STIBP to guard against attacks from the + sibling thread when the user program is running, and use IBPB to + flush the branch target buffer when switching to/from the program. + + Restricting indirect branch speculation on a user program will + also prevent the program from launching a variant 2 attack + on x86. All sand-boxed SECCOMP programs have indirect branch + speculation restricted by default. Administrators can change + that behavior via the kernel command line and sysfs control files. + See :ref:`spectre_mitigation_control_command_line`. + + Programs that disable their indirect branch speculation will have + more overhead and run slower. + + User programs should use address space randomization + (/proc/sys/kernel/randomize_va_space = 1 or 2) to make attacks more + difficult. + +3. VM mitigation +^^^^^^^^^^^^^^^^ + + Within the kernel, Spectre variant 1 attacks from rogue guests are + mitigated on a case by case basis in VM exit paths. Vulnerable code + uses nospec accessor macros for "bounds clipping", to avoid any + usable disclosure gadgets. However, this may not cover all variant + 1 attack vectors. + + For Spectre variant 2 attacks from rogue guests to the kernel, the + Linux kernel uses retpoline or Enhanced IBRS to prevent consumption of + poisoned entries in branch target buffer left by rogue guests. It also + flushes the return stack buffer on every VM exit to prevent a return + stack buffer underflow so poisoned branch target buffer could be used, + or attacker guests leaving poisoned entries in the return stack buffer. + + To mitigate guest-to-guest attacks in the same CPU hardware thread, + the branch target buffer is sanitized by flushing before switching + to a new guest on a CPU. + + The above mitigations are turned on by default on vulnerable CPUs. + + To mitigate guest-to-guest attacks from sibling thread when SMT is + in use, an untrusted guest running in the sibling thread can have + its indirect branch speculation disabled by administrator via prctl(). + + The kernel also allows guests to use any microcode based mitigation + they choose to use (such as IBPB or STIBP on x86) to protect themselves. + +.. _spectre_mitigation_control_command_line: + +Mitigation control on the kernel command line +--------------------------------------------- + +Spectre variant 2 mitigation can be disabled or force enabled at the +kernel command line. + + nospectre_v1 + + [X86,PPC] Disable mitigations for Spectre Variant 1 + (bounds check bypass). With this option data leaks are + possible in the system. + + nospectre_v2 + + [X86] Disable all mitigations for the Spectre variant 2 + (indirect branch prediction) vulnerability. System may + allow data leaks with this option, which is equivalent + to spectre_v2=off. + + + spectre_v2= + + [X86] Control mitigation of Spectre variant 2 + (indirect branch speculation) vulnerability. + The default operation protects the kernel from + user space attacks. + + on + unconditionally enable, implies + spectre_v2_user=on + off + unconditionally disable, implies + spectre_v2_user=off + auto + kernel detects whether your CPU model is + vulnerable + + Selecting 'on' will, and 'auto' may, choose a + mitigation method at run time according to the + CPU, the available microcode, the setting of the + CONFIG_RETPOLINE configuration option, and the + compiler with which the kernel was built. + + Selecting 'on' will also enable the mitigation + against user space to user space task attacks. + + Selecting 'off' will disable both the kernel and + the user space protections. + + Specific mitigations can also be selected manually: + + retpoline auto pick between generic,lfence + retpoline,generic Retpolines + retpoline,lfence LFENCE; indirect branch + retpoline,amd alias for retpoline,lfence + eibrs enhanced IBRS + eibrs,retpoline enhanced IBRS + Retpolines + eibrs,lfence enhanced IBRS + LFENCE + + Not specifying this option is equivalent to + spectre_v2=auto. + +For user space mitigation: + + spectre_v2_user= + + [X86] Control mitigation of Spectre variant 2 + (indirect branch speculation) vulnerability between + user space tasks + + on + Unconditionally enable mitigations. Is + enforced by spectre_v2=on + + off + Unconditionally disable mitigations. Is + enforced by spectre_v2=off + + prctl + Indirect branch speculation is enabled, + but mitigation can be enabled via prctl + per thread. The mitigation control state + is inherited on fork. + + prctl,ibpb + Like "prctl" above, but only STIBP is + controlled per thread. IBPB is issued + always when switching between different user + space processes. + + seccomp + Same as "prctl" above, but all seccomp + threads will enable the mitigation unless + they explicitly opt out. + + seccomp,ibpb + Like "seccomp" above, but only STIBP is + controlled per thread. IBPB is issued + always when switching between different + user space processes. + + auto + Kernel selects the mitigation depending on + the available CPU features and vulnerability. + + Default mitigation: + If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl" + + Not specifying this option is equivalent to + spectre_v2_user=auto. + + In general the kernel by default selects + reasonable mitigations for the current CPU. To + disable Spectre variant 2 mitigations, boot with + spectre_v2=off. Spectre variant 1 mitigations + cannot be disabled. + +Mitigation selection guide +-------------------------- + +1. Trusted userspace +^^^^^^^^^^^^^^^^^^^^ + + If all userspace applications are from trusted sources and do not + execute externally supplied untrusted code, then the mitigations can + be disabled. + +2. Protect sensitive programs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + For security-sensitive programs that have secrets (e.g. crypto + keys), protection against Spectre variant 2 can be put in place by + disabling indirect branch speculation when the program is running + (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`). + +3. Sandbox untrusted programs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + Untrusted programs that could be a source of attacks can be cordoned + off by disabling their indirect branch speculation when they are run + (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`). + This prevents untrusted programs from polluting the branch target + buffer. All programs running in SECCOMP sandboxes have indirect + branch speculation restricted by default. This behavior can be + changed via the kernel command line and sysfs control files. See + :ref:`spectre_mitigation_control_command_line`. + +3. High security mode +^^^^^^^^^^^^^^^^^^^^^ + + All Spectre variant 2 mitigations can be forced on + at boot time for all programs (See the "on" option in + :ref:`spectre_mitigation_control_command_line`). This will add + overhead as indirect branch speculations for all programs will be + restricted. + + On x86, branch target buffer will be flushed with IBPB when switching + to a new program. STIBP is left on all the time to protect programs + against variant 2 attacks originating from programs running on + sibling threads. + + Alternatively, STIBP can be used only when running programs + whose indirect branch speculation is explicitly disabled, + while IBPB is still used all the time when switching to a new + program to clear the branch target buffer (See "ibpb" option in + :ref:`spectre_mitigation_control_command_line`). This "ibpb" option + has less performance cost than the "on" option, which leaves STIBP + on all the time. + +References on Spectre +--------------------- + +Intel white papers: + +.. _spec_ref1: + +[1] `Intel analysis of speculative execution side channels <https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf>`_. + +.. _spec_ref2: + +[2] `Bounds check bypass <https://software.intel.com/security-software-guidance/software-guidance/bounds-check-bypass>`_. + +.. _spec_ref3: + +[3] `Deep dive: Retpoline: A branch target injection mitigation <https://software.intel.com/security-software-guidance/insights/deep-dive-retpoline-branch-target-injection-mitigation>`_. + +.. _spec_ref4: + +[4] `Deep Dive: Single Thread Indirect Branch Predictors <https://software.intel.com/security-software-guidance/insights/deep-dive-single-thread-indirect-branch-predictors>`_. + +AMD white papers: + +.. _spec_ref5: + +[5] `AMD64 technology indirect branch control extension <https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf>`_. + +.. _spec_ref6: + +[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf>`_. + +ARM white papers: + +.. _spec_ref7: + +[7] `Cache speculation side-channels <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/download-the-whitepaper>`_. + +.. _spec_ref8: + +[8] `Cache speculation issues update <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/latest-updates/cache-speculation-issues-update>`_. + +Google white paper: + +.. _spec_ref9: + +[9] `Retpoline: a software construct for preventing branch-target-injection <https://support.google.com/faqs/answer/7625886>`_. + +MIPS white paper: + +.. _spec_ref10: + +[10] `MIPS: response on speculative execution and side channel vulnerabilities <https://www.mips.com/blog/mips-response-on-speculative-execution-and-side-channel-vulnerabilities/>`_. + +Academic papers: + +.. _spec_ref11: + +[11] `Spectre Attacks: Exploiting Speculative Execution <https://spectreattack.com/spectre.pdf>`_. + +.. _spec_ref12: + +[12] `NetSpectre: Read Arbitrary Memory over Network <https://arxiv.org/abs/1807.10535>`_. + +.. _spec_ref13: + +[13] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://www.usenix.org/system/files/conference/woot18/woot18-paper-koruyeh.pdf>`_. diff --git a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst new file mode 100644 index 000000000..af6865b82 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst @@ -0,0 +1,279 @@ +.. SPDX-License-Identifier: GPL-2.0 + +TAA - TSX Asynchronous Abort +====================================== + +TAA is a hardware vulnerability that allows unprivileged speculative access to +data which is available in various CPU internal buffers by using asynchronous +aborts within an Intel TSX transactional region. + +Affected processors +------------------- + +This vulnerability only affects Intel processors that support Intel +Transactional Synchronization Extensions (TSX) when the TAA_NO bit (bit 8) +is 0 in the IA32_ARCH_CAPABILITIES MSR. On processors where the MDS_NO bit +(bit 5) is 0 in the IA32_ARCH_CAPABILITIES MSR, the existing MDS mitigations +also mitigate against TAA. + +Whether a processor is affected or not can be read out from the TAA +vulnerability file in sysfs. See :ref:`tsx_async_abort_sys_info`. + +Related CVEs +------------ + +The following CVE entry is related to this TAA issue: + + ============== ===== =================================================== + CVE-2019-11135 TAA TSX Asynchronous Abort (TAA) condition on some + microprocessors utilizing speculative execution may + allow an authenticated user to potentially enable + information disclosure via a side channel with + local access. + ============== ===== =================================================== + +Problem +------- + +When performing store, load or L1 refill operations, processors write +data into temporary microarchitectural structures (buffers). The data in +those buffers can be forwarded to load operations as an optimization. + +Intel TSX is an extension to the x86 instruction set architecture that adds +hardware transactional memory support to improve performance of multi-threaded +software. TSX lets the processor expose and exploit concurrency hidden in an +application due to dynamically avoiding unnecessary synchronization. + +TSX supports atomic memory transactions that are either committed (success) or +aborted. During an abort, operations that happened within the transactional region +are rolled back. An asynchronous abort takes place, among other options, when a +different thread accesses a cache line that is also used within the transactional +region when that access might lead to a data race. + +Immediately after an uncompleted asynchronous abort, certain speculatively +executed loads may read data from those internal buffers and pass it to dependent +operations. This can be then used to infer the value via a cache side channel +attack. + +Because the buffers are potentially shared between Hyper-Threads cross +Hyper-Thread attacks are possible. + +The victim of a malicious actor does not need to make use of TSX. Only the +attacker needs to begin a TSX transaction and raise an asynchronous abort +which in turn potenitally leaks data stored in the buffers. + +More detailed technical information is available in the TAA specific x86 +architecture section: :ref:`Documentation/x86/tsx_async_abort.rst <tsx_async_abort>`. + + +Attack scenarios +---------------- + +Attacks against the TAA vulnerability can be implemented from unprivileged +applications running on hosts or guests. + +As for MDS, the attacker has no control over the memory addresses that can +be leaked. Only the victim is responsible for bringing data to the CPU. As +a result, the malicious actor has to sample as much data as possible and +then postprocess it to try to infer any useful information from it. + +A potential attacker only has read access to the data. Also, there is no direct +privilege escalation by using this technique. + + +.. _tsx_async_abort_sys_info: + +TAA system information +----------------------- + +The Linux kernel provides a sysfs interface to enumerate the current TAA status +of mitigated systems. The relevant sysfs file is: + +/sys/devices/system/cpu/vulnerabilities/tsx_async_abort + +The possible values in this file are: + +.. list-table:: + + * - 'Vulnerable' + - The CPU is affected by this vulnerability and the microcode and kernel mitigation are not applied. + * - 'Vulnerable: Clear CPU buffers attempted, no microcode' + - The system tries to clear the buffers but the microcode might not support the operation. + * - 'Mitigation: Clear CPU buffers' + - The microcode has been updated to clear the buffers. TSX is still enabled. + * - 'Mitigation: TSX disabled' + - TSX is disabled. + * - 'Not affected' + - The CPU is not affected by this issue. + +.. _ucode_needed: + +Best effort mitigation mode +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If the processor is vulnerable, but the availability of the microcode-based +mitigation mechanism is not advertised via CPUID the kernel selects a best +effort mitigation mode. This mode invokes the mitigation instructions +without a guarantee that they clear the CPU buffers. + +This is done to address virtualization scenarios where the host has the +microcode update applied, but the hypervisor is not yet updated to expose the +CPUID to the guest. If the host has updated microcode the protection takes +effect; otherwise a few CPU cycles are wasted pointlessly. + +The state in the tsx_async_abort sysfs file reflects this situation +accordingly. + + +Mitigation mechanism +-------------------- + +The kernel detects the affected CPUs and the presence of the microcode which is +required. If a CPU is affected and the microcode is available, then the kernel +enables the mitigation by default. + + +The mitigation can be controlled at boot time via a kernel command line option. +See :ref:`taa_mitigation_control_command_line`. + +.. _virt_mechanism: + +Virtualization mitigation +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Affected systems where the host has TAA microcode and TAA is mitigated by +having disabled TSX previously, are not vulnerable regardless of the status +of the VMs. + +In all other cases, if the host either does not have the TAA microcode or +the kernel is not mitigated, the system might be vulnerable. + + +.. _taa_mitigation_control_command_line: + +Mitigation control on the kernel command line +--------------------------------------------- + +The kernel command line allows to control the TAA mitigations at boot time with +the option "tsx_async_abort=". The valid arguments for this option are: + + ============ ============================================================= + off This option disables the TAA mitigation on affected platforms. + If the system has TSX enabled (see next parameter) and the CPU + is affected, the system is vulnerable. + + full TAA mitigation is enabled. If TSX is enabled, on an affected + system it will clear CPU buffers on ring transitions. On + systems which are MDS-affected and deploy MDS mitigation, + TAA is also mitigated. Specifying this option on those + systems will have no effect. + + full,nosmt The same as tsx_async_abort=full, with SMT disabled on + vulnerable CPUs that have TSX enabled. This is the complete + mitigation. When TSX is disabled, SMT is not disabled because + CPU is not vulnerable to cross-thread TAA attacks. + ============ ============================================================= + +Not specifying this option is equivalent to "tsx_async_abort=full". For +processors that are affected by both TAA and MDS, specifying just +"tsx_async_abort=off" without an accompanying "mds=off" will have no +effect as the same mitigation is used for both vulnerabilities. + +The kernel command line also allows to control the TSX feature using the +parameter "tsx=" on CPUs which support TSX control. MSR_IA32_TSX_CTRL is used +to control the TSX feature and the enumeration of the TSX feature bits (RTM +and HLE) in CPUID. + +The valid options are: + + ============ ============================================================= + off Disables TSX on the system. + + Note that this option takes effect only on newer CPUs which are + not vulnerable to MDS, i.e., have MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 + and which get the new IA32_TSX_CTRL MSR through a microcode + update. This new MSR allows for the reliable deactivation of + the TSX functionality. + + on Enables TSX. + + Although there are mitigations for all known security + vulnerabilities, TSX has been known to be an accelerator for + several previous speculation-related CVEs, and so there may be + unknown security risks associated with leaving it enabled. + + auto Disables TSX if X86_BUG_TAA is present, otherwise enables TSX + on the system. + ============ ============================================================= + +Not specifying this option is equivalent to "tsx=off". + +The following combinations of the "tsx_async_abort" and "tsx" are possible. For +affected platforms tsx=auto is equivalent to tsx=off and the result will be: + + ========= ========================== ========================================= + tsx=on tsx_async_abort=full The system will use VERW to clear CPU + buffers. Cross-thread attacks are still + possible on SMT machines. + tsx=on tsx_async_abort=full,nosmt As above, cross-thread attacks on SMT + mitigated. + tsx=on tsx_async_abort=off The system is vulnerable. + tsx=off tsx_async_abort=full TSX might be disabled if microcode + provides a TSX control MSR. If so, + system is not vulnerable. + tsx=off tsx_async_abort=full,nosmt Ditto + tsx=off tsx_async_abort=off ditto + ========= ========================== ========================================= + + +For unaffected platforms "tsx=on" and "tsx_async_abort=full" does not clear CPU +buffers. For platforms without TSX control (MSR_IA32_ARCH_CAPABILITIES.MDS_NO=0) +"tsx" command line argument has no effect. + +For the affected platforms below table indicates the mitigation status for the +combinations of CPUID bit MD_CLEAR and IA32_ARCH_CAPABILITIES MSR bits MDS_NO +and TSX_CTRL_MSR. + + ======= ========= ============= ======================================== + MDS_NO MD_CLEAR TSX_CTRL_MSR Status + ======= ========= ============= ======================================== + 0 0 0 Vulnerable (needs microcode) + 0 1 0 MDS and TAA mitigated via VERW + 1 1 0 MDS fixed, TAA vulnerable if TSX enabled + because MD_CLEAR has no meaning and + VERW is not guaranteed to clear buffers + 1 X 1 MDS fixed, TAA can be mitigated by + VERW or TSX_CTRL_MSR + ======= ========= ============= ======================================== + +Mitigation selection guide +-------------------------- + +1. Trusted userspace and guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If all user space applications are from a trusted source and do not execute +untrusted code which is supplied externally, then the mitigation can be +disabled. The same applies to virtualized environments with trusted guests. + + +2. Untrusted userspace and guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If there are untrusted applications or guests on the system, enabling TSX +might allow a malicious actor to leak data from the host or from other +processes running on the same physical core. + +If the microcode is available and the TSX is disabled on the host, attacks +are prevented in a virtualized environment as well, even if the VMs do not +explicitly enable the mitigation. + + +.. _taa_default_mitigations: + +Default mitigations +------------------- + +The kernel's default action for vulnerable processors is: + + - Deploy TSX disable mitigation (tsx_async_abort=full tsx=off). diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst new file mode 100644 index 000000000..89abc5057 --- /dev/null +++ b/Documentation/admin-guide/index.rst @@ -0,0 +1,82 @@ +The Linux kernel user's and administrator's guide +================================================= + +The following is a collection of user-oriented documents that have been +added to the kernel over time. There is, as yet, little overall order or +organization here — this material was not written to be a single, coherent +document! With luck things will improve quickly over time. + +This initial section contains overall information, including the README +file describing the kernel as a whole, documentation on kernel parameters, +etc. + +.. toctree:: + :maxdepth: 1 + + README + kernel-parameters + devices + +This section describes CPU vulnerabilities and their mitigations. + +.. toctree:: + :maxdepth: 1 + + hw-vuln/index + +Here is a set of documents aimed at users who are trying to track down +problems and bugs in particular. + +.. toctree:: + :maxdepth: 1 + + reporting-bugs + security-bugs + bug-hunting + bug-bisect + tainted-kernels + ramoops + dynamic-debug-howto + init + +This is the beginning of a section with information of interest to +application developers. Documents covering various aspects of the kernel +ABI will be found here. + +.. toctree:: + :maxdepth: 1 + + sysfs-rules + +The rest of this manual consists of various unordered guides on how to +configure specific aspects of kernel behavior to your liking. + +.. toctree:: + :maxdepth: 1 + + initrd + cgroup-v2 + serial-console + braille-console + parport + md + module-signing + sysrq + unicode + vga-softcursor + binfmt-misc + mono + java + ras + bcache + pm/index + thunderbolt + LSM/index + mm/index + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/admin-guide/init.rst b/Documentation/admin-guide/init.rst new file mode 100644 index 000000000..e89d97f31 --- /dev/null +++ b/Documentation/admin-guide/init.rst @@ -0,0 +1,52 @@ +Explaining the dreaded "No init found." boot hang message +========================================================= + +OK, so you've got this pretty unintuitive message (currently located +in init/main.c) and are wondering what the H*** went wrong. +Some high-level reasons for failure (listed roughly in order of execution) +to load the init binary are: + +A) Unable to mount root FS +B) init binary doesn't exist on rootfs +C) broken console device +D) binary exists but dependencies not available +E) binary cannot be loaded + +Detailed explanations: + +A) Set "debug" kernel parameter (in bootloader config file or CONFIG_CMDLINE) + to get more detailed kernel messages. +B) make sure you have the correct root FS type + (and ``root=`` kernel parameter points to the correct partition), + required drivers such as storage hardware (such as SCSI or USB!) + and filesystem (ext3, jffs2 etc.) are builtin (alternatively as modules, + to be pre-loaded by an initrd) +C) Possibly a conflict in ``console= setup`` --> initial console unavailable. + E.g. some serial consoles are unreliable due to serial IRQ issues (e.g. + missing interrupt-based configuration). + Try using a different ``console= device`` or e.g. ``netconsole=``. +D) e.g. required library dependencies of the init binary such as + ``/lib/ld-linux.so.2`` missing or broken. Use + ``readelf -d <INIT>|grep NEEDED`` to find out which libraries are required. +E) make sure the binary's architecture matches your hardware. + E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM hardware. + In case you tried loading a non-binary file here (shell script?), + you should make sure that the script specifies an interpreter in its shebang + header line (``#!/...``) that is fully working (including its library + dependencies). And before tackling scripts, better first test a simple + non-script binary such as ``/bin/sh`` and confirm its successful execution. + To find out more, add code ``to init/main.c`` to display kernel_execve()s + return values. + +Please extend this explanation whenever you find new failure causes +(after all loading the init binary is a CRITICAL and hard transition step +which needs to be made as painless as possible), then submit patch to LKML. +Further TODOs: + +- Implement the various ``run_init_process()`` invocations via a struct array + which can then store the ``kernel_execve()`` result value and on failure + log it all by iterating over **all** results (very important usability fix). +- try to make the implementation itself more helpful in general, + e.g. by providing additional error messages at affected places. + +Andreas Mohr <andi at lisas period de> diff --git a/Documentation/admin-guide/initrd.rst b/Documentation/admin-guide/initrd.rst new file mode 100644 index 000000000..a03dabaaf --- /dev/null +++ b/Documentation/admin-guide/initrd.rst @@ -0,0 +1,383 @@ +Using the initial RAM disk (initrd) +=================================== + +Written 1996,2000 by Werner Almesberger <werner.almesberger@epfl.ch> and +Hans Lermen <lermen@fgan.de> + + +initrd provides the capability to load a RAM disk by the boot loader. +This RAM disk can then be mounted as the root file system and programs +can be run from it. Afterwards, a new root file system can be mounted +from a different device. The previous root (from initrd) is then moved +to a directory and can be subsequently unmounted. + +initrd is mainly designed to allow system startup to occur in two phases, +where the kernel comes up with a minimum set of compiled-in drivers, and +where additional modules are loaded from initrd. + +This document gives a brief overview of the use of initrd. A more detailed +discussion of the boot process can be found in [#f1]_. + + +Operation +--------- + +When using initrd, the system typically boots as follows: + + 1) the boot loader loads the kernel and the initial RAM disk + 2) the kernel converts initrd into a "normal" RAM disk and + frees the memory used by initrd + 3) if the root device is not ``/dev/ram0``, the old (deprecated) + change_root procedure is followed. see the "Obsolete root change + mechanism" section below. + 4) root device is mounted. if it is ``/dev/ram0``, the initrd image is + then mounted as root + 5) /sbin/init is executed (this can be any valid executable, including + shell scripts; it is run with uid 0 and can do basically everything + init can do). + 6) init mounts the "real" root file system + 7) init places the root file system at the root directory using the + pivot_root system call + 8) init execs the ``/sbin/init`` on the new root filesystem, performing + the usual boot sequence + 9) the initrd file system is removed + +Note that changing the root directory does not involve unmounting it. +It is therefore possible to leave processes running on initrd during that +procedure. Also note that file systems mounted under initrd continue to +be accessible. + + +Boot command-line options +------------------------- + +initrd adds the following new options:: + + initrd=<path> (e.g. LOADLIN) + + Loads the specified file as the initial RAM disk. When using LILO, you + have to specify the RAM disk image file in /etc/lilo.conf, using the + INITRD configuration variable. + + noinitrd + + initrd data is preserved but it is not converted to a RAM disk and + the "normal" root file system is mounted. initrd data can be read + from /dev/initrd. Note that the data in initrd can have any structure + in this case and doesn't necessarily have to be a file system image. + This option is used mainly for debugging. + + Note: /dev/initrd is read-only and it can only be used once. As soon + as the last process has closed it, all data is freed and /dev/initrd + can't be opened anymore. + + root=/dev/ram0 + + initrd is mounted as root, and the normal boot procedure is followed, + with the RAM disk mounted as root. + +Compressed cpio images +---------------------- + +Recent kernels have support for populating a ramdisk from a compressed cpio +archive. On such systems, the creation of a ramdisk image doesn't need to +involve special block devices or loopbacks; you merely create a directory on +disk with the desired initrd content, cd to that directory, and run (as an +example):: + + find . | cpio --quiet -H newc -o | gzip -9 -n > /boot/imagefile.img + +Examining the contents of an existing image file is just as simple:: + + mkdir /tmp/imagefile + cd /tmp/imagefile + gzip -cd /boot/imagefile.img | cpio -imd --quiet + +Installation +------------ + +First, a directory for the initrd file system has to be created on the +"normal" root file system, e.g.:: + + # mkdir /initrd + +The name is not relevant. More details can be found on the +:manpage:`pivot_root(2)` man page. + +If the root file system is created during the boot procedure (i.e. if +you're building an install floppy), the root file system creation +procedure should create the ``/initrd`` directory. + +If initrd will not be mounted in some cases, its content is still +accessible if the following device has been created:: + + # mknod /dev/initrd b 1 250 + # chmod 400 /dev/initrd + +Second, the kernel has to be compiled with RAM disk support and with +support for the initial RAM disk enabled. Also, at least all components +needed to execute programs from initrd (e.g. executable format and file +system) must be compiled into the kernel. + +Third, you have to create the RAM disk image. This is done by creating a +file system on a block device, copying files to it as needed, and then +copying the content of the block device to the initrd file. With recent +kernels, at least three types of devices are suitable for that: + + - a floppy disk (works everywhere but it's painfully slow) + - a RAM disk (fast, but allocates physical memory) + - a loopback device (the most elegant solution) + +We'll describe the loopback device method: + + 1) make sure loopback block devices are configured into the kernel + 2) create an empty file system of the appropriate size, e.g.:: + + # dd if=/dev/zero of=initrd bs=300k count=1 + # mke2fs -F -m0 initrd + + (if space is critical, you may want to use the Minix FS instead of Ext2) + 3) mount the file system, e.g.:: + + # mount -t ext2 -o loop initrd /mnt + + 4) create the console device:: + + # mkdir /mnt/dev + # mknod /mnt/dev/console c 5 1 + + 5) copy all the files that are needed to properly use the initrd + environment. Don't forget the most important file, ``/sbin/init`` + + .. note:: ``/sbin/init`` permissions must include "x" (execute). + + 6) correct operation the initrd environment can frequently be tested + even without rebooting with the command:: + + # chroot /mnt /sbin/init + + This is of course limited to initrds that do not interfere with the + general system state (e.g. by reconfiguring network interfaces, + overwriting mounted devices, trying to start already running demons, + etc. Note however that it is usually possible to use pivot_root in + such a chroot'ed initrd environment.) + 7) unmount the file system:: + + # umount /mnt + + 8) the initrd is now in the file "initrd". Optionally, it can now be + compressed:: + + # gzip -9 initrd + +For experimenting with initrd, you may want to take a rescue floppy and +only add a symbolic link from ``/sbin/init`` to ``/bin/sh``. Alternatively, you +can try the experimental newlib environment [#f2]_ to create a small +initrd. + +Finally, you have to boot the kernel and load initrd. Almost all Linux +boot loaders support initrd. Since the boot process is still compatible +with an older mechanism, the following boot command line parameters +have to be given:: + + root=/dev/ram0 rw + +(rw is only necessary if writing to the initrd file system.) + +With LOADLIN, you simply execute:: + + LOADLIN <kernel> initrd=<disk_image> + +e.g.:: + + LOADLIN C:\LINUX\BZIMAGE initrd=C:\LINUX\INITRD.GZ root=/dev/ram0 rw + +With LILO, you add the option ``INITRD=<path>`` to either the global section +or to the section of the respective kernel in ``/etc/lilo.conf``, and pass +the options using APPEND, e.g.:: + + image = /bzImage + initrd = /boot/initrd.gz + append = "root=/dev/ram0 rw" + +and run ``/sbin/lilo`` + +For other boot loaders, please refer to the respective documentation. + +Now you can boot and enjoy using initrd. + + +Changing the root device +------------------------ + +When finished with its duties, init typically changes the root device +and proceeds with starting the Linux system on the "real" root device. + +The procedure involves the following steps: + - mounting the new root file system + - turning it into the root file system + - removing all accesses to the old (initrd) root file system + - unmounting the initrd file system and de-allocating the RAM disk + +Mounting the new root file system is easy: it just needs to be mounted on +a directory under the current root. Example:: + + # mkdir /new-root + # mount -o ro /dev/hda1 /new-root + +The root change is accomplished with the pivot_root system call, which +is also available via the ``pivot_root`` utility (see :manpage:`pivot_root(8)` +man page; ``pivot_root`` is distributed with util-linux version 2.10h or higher +[#f3]_). ``pivot_root`` moves the current root to a directory under the new +root, and puts the new root at its place. The directory for the old root +must exist before calling ``pivot_root``. Example:: + + # cd /new-root + # mkdir initrd + # pivot_root . initrd + +Now, the init process may still access the old root via its +executable, shared libraries, standard input/output/error, and its +current root directory. All these references are dropped by the +following command:: + + # exec chroot . what-follows <dev/console >dev/console 2>&1 + +Where what-follows is a program under the new root, e.g. ``/sbin/init`` +If the new root file system will be used with udev and has no valid +``/dev`` directory, udev must be initialized before invoking chroot in order +to provide ``/dev/console``. + +Note: implementation details of pivot_root may change with time. In order +to ensure compatibility, the following points should be observed: + + - before calling pivot_root, the current directory of the invoking + process should point to the new root directory + - use . as the first argument, and the _relative_ path of the directory + for the old root as the second argument + - a chroot program must be available under the old and the new root + - chroot to the new root afterwards + - use relative paths for dev/console in the exec command + +Now, the initrd can be unmounted and the memory allocated by the RAM +disk can be freed:: + + # umount /initrd + # blockdev --flushbufs /dev/ram0 + +It is also possible to use initrd with an NFS-mounted root, see the +:manpage:`pivot_root(8)` man page for details. + + +Usage scenarios +--------------- + +The main motivation for implementing initrd was to allow for modular +kernel configuration at system installation. The procedure would work +as follows: + + 1) system boots from floppy or other media with a minimal kernel + (e.g. support for RAM disks, initrd, a.out, and the Ext2 FS) and + loads initrd + 2) ``/sbin/init`` determines what is needed to (1) mount the "real" root FS + (i.e. device type, device drivers, file system) and (2) the + distribution media (e.g. CD-ROM, network, tape, ...). This can be + done by asking the user, by auto-probing, or by using a hybrid + approach. + 3) ``/sbin/init`` loads the necessary kernel modules + 4) ``/sbin/init`` creates and populates the root file system (this doesn't + have to be a very usable system yet) + 5) ``/sbin/init`` invokes ``pivot_root`` to change the root file system and + execs - via chroot - a program that continues the installation + 6) the boot loader is installed + 7) the boot loader is configured to load an initrd with the set of + modules that was used to bring up the system (e.g. ``/initrd`` can be + modified, then unmounted, and finally, the image is written from + ``/dev/ram0`` or ``/dev/rd/0`` to a file) + 8) now the system is bootable and additional installation tasks can be + performed + +The key role of initrd here is to re-use the configuration data during +normal system operation without requiring the use of a bloated "generic" +kernel or re-compiling or re-linking the kernel. + +A second scenario is for installations where Linux runs on systems with +different hardware configurations in a single administrative domain. In +such cases, it is desirable to generate only a small set of kernels +(ideally only one) and to keep the system-specific part of configuration +information as small as possible. In this case, a common initrd could be +generated with all the necessary modules. Then, only ``/sbin/init`` or a file +read by it would have to be different. + +A third scenario is more convenient recovery disks, because information +like the location of the root FS partition doesn't have to be provided at +boot time, but the system loaded from initrd can invoke a user-friendly +dialog and it can also perform some sanity checks (or even some form of +auto-detection). + +Last not least, CD-ROM distributors may use it for better installation +from CD, e.g. by using a boot floppy and bootstrapping a bigger RAM disk +via initrd from CD; or by booting via a loader like ``LOADLIN`` or directly +from the CD-ROM, and loading the RAM disk from CD without need of +floppies. + + +Obsolete root change mechanism +------------------------------ + +The following mechanism was used before the introduction of pivot_root. +Current kernels still support it, but you should _not_ rely on its +continued availability. + +It works by mounting the "real" root device (i.e. the one set with rdev +in the kernel image or with root=... at the boot command line) as the +root file system when linuxrc exits. The initrd file system is then +unmounted, or, if it is still busy, moved to a directory ``/initrd``, if +such a directory exists on the new root file system. + +In order to use this mechanism, you do not have to specify the boot +command options root, init, or rw. (If specified, they will affect +the real root file system, not the initrd environment.) + +If /proc is mounted, the "real" root device can be changed from within +linuxrc by writing the number of the new root FS device to the special +file /proc/sys/kernel/real-root-dev, e.g.:: + + # echo 0x301 >/proc/sys/kernel/real-root-dev + +Note that the mechanism is incompatible with NFS and similar file +systems. + +This old, deprecated mechanism is commonly called ``change_root``, while +the new, supported mechanism is called ``pivot_root``. + + +Mixed change_root and pivot_root mechanism +------------------------------------------ + +In case you did not want to use ``root=/dev/ram0`` to trigger the pivot_root +mechanism, you may create both ``/linuxrc`` and ``/sbin/init`` in your initrd +image. + +``/linuxrc`` would contain only the following:: + + #! /bin/sh + mount -n -t proc proc /proc + echo 0x0100 >/proc/sys/kernel/real-root-dev + umount -n /proc + +Once linuxrc exited, the kernel would mount again your initrd as root, +this time executing ``/sbin/init``. Again, it would be the duty of this init +to build the right environment (maybe using the ``root= device`` passed on +the cmdline) before the final execution of the real ``/sbin/init``. + + +Resources +--------- + +.. [#f1] Almesberger, Werner; "Booting Linux: The History and the Future" + http://www.almesberger.net/cv/papers/ols2k-9.ps.gz +.. [#f2] newlib package (experimental), with initrd example + https://www.sourceware.org/newlib/ +.. [#f3] util-linux: Miscellaneous utilities for Linux + https://www.kernel.org/pub/linux/utils/util-linux/ diff --git a/Documentation/admin-guide/java.rst b/Documentation/admin-guide/java.rst new file mode 100644 index 000000000..8744e272e --- /dev/null +++ b/Documentation/admin-guide/java.rst @@ -0,0 +1,423 @@ +Java(tm) Binary Kernel Support for Linux v1.03 +---------------------------------------------- + +Linux beats them ALL! While all other OS's are TALKING about direct +support of Java Binaries in the OS, Linux is doing it! + +You can execute Java applications and Java Applets just like any +other program after you have done the following: + +1) You MUST FIRST install the Java Developers Kit for Linux. + The Java on Linux HOWTO gives the details on getting and + installing this. This HOWTO can be found at: + + ftp://sunsite.unc.edu/pub/Linux/docs/HOWTO/Java-HOWTO + + You should also set up a reasonable CLASSPATH environment + variable to use Java applications that make use of any + nonstandard classes (not included in the same directory + as the application itself). + +2) You have to compile BINFMT_MISC either as a module or into + the kernel (``CONFIG_BINFMT_MISC``) and set it up properly. + If you choose to compile it as a module, you will have + to insert it manually with modprobe/insmod, as kmod + cannot easily be supported with binfmt_misc. + Read the file 'binfmt_misc.txt' in this directory to know + more about the configuration process. + +3) Add the following configuration items to binfmt_misc + (you should really have read ``binfmt_misc.txt`` now): + support for Java applications:: + + ':Java:M::\xca\xfe\xba\xbe::/usr/local/bin/javawrapper:' + + support for executable Jar files:: + + ':ExecutableJAR:E::jar::/usr/local/bin/jarwrapper:' + + support for Java Applets:: + + ':Applet:E::html::/usr/bin/appletviewer:' + + or the following, if you want to be more selective:: + + ':Applet:M::<!--applet::/usr/bin/appletviewer:' + + Of course you have to fix the path names. The path/file names given in this + document match the Debian 2.1 system. (i.e. jdk installed in ``/usr``, + custom wrappers from this document in ``/usr/local``) + + Note, that for the more selective applet support you have to modify + existing html-files to contain ``<!--applet-->`` in the first line + (``<`` has to be the first character!) to let this work! + + For the compiled Java programs you need a wrapper script like the + following (this is because Java is broken in case of the filename + handling), again fix the path names, both in the script and in the + above given configuration string. + + You, too, need the little program after the script. Compile like:: + + gcc -O2 -o javaclassname javaclassname.c + + and stick it to ``/usr/local/bin``. + + Both the javawrapper shellscript and the javaclassname program + were supplied by Colin J. Watson <cjw44@cam.ac.uk>. + +Javawrapper shell script: + +.. code-block:: sh + + #!/bin/bash + # /usr/local/bin/javawrapper - the wrapper for binfmt_misc/java + + if [ -z "$1" ]; then + exec 1>&2 + echo Usage: $0 class-file + exit 1 + fi + + CLASS=$1 + FQCLASS=`/usr/local/bin/javaclassname $1` + FQCLASSN=`echo $FQCLASS | sed -e 's/^.*\.\([^.]*\)$/\1/'` + FQCLASSP=`echo $FQCLASS | sed -e 's-\.-/-g' -e 's-^[^/]*$--' -e 's-/[^/]*$--'` + + # for example: + # CLASS=Test.class + # FQCLASS=foo.bar.Test + # FQCLASSN=Test + # FQCLASSP=foo/bar + + unset CLASSBASE + + declare -i LINKLEVEL=0 + + while :; do + if [ "`basename $CLASS .class`" == "$FQCLASSN" ]; then + # See if this directory works straight off + cd -L `dirname $CLASS` + CLASSDIR=$PWD + cd $OLDPWD + if echo $CLASSDIR | grep -q "$FQCLASSP$"; then + CLASSBASE=`echo $CLASSDIR | sed -e "s.$FQCLASSP$.."` + break; + fi + # Try dereferencing the directory name + cd -P `dirname $CLASS` + CLASSDIR=$PWD + cd $OLDPWD + if echo $CLASSDIR | grep -q "$FQCLASSP$"; then + CLASSBASE=`echo $CLASSDIR | sed -e "s.$FQCLASSP$.."` + break; + fi + # If no other possible filename exists + if [ ! -L $CLASS ]; then + exec 1>&2 + echo $0: + echo " $CLASS should be in a" \ + "directory tree called $FQCLASSP" + exit 1 + fi + fi + if [ ! -L $CLASS ]; then break; fi + # Go down one more level of symbolic links + let LINKLEVEL+=1 + if [ $LINKLEVEL -gt 5 ]; then + exec 1>&2 + echo $0: + echo " Too many symbolic links encountered" + exit 1 + fi + CLASS=`ls --color=no -l $CLASS | sed -e 's/^.* \([^ ]*\)$/\1/'` + done + + if [ -z "$CLASSBASE" ]; then + if [ -z "$FQCLASSP" ]; then + GOODNAME=$FQCLASSN.class + else + GOODNAME=$FQCLASSP/$FQCLASSN.class + fi + exec 1>&2 + echo $0: + echo " $FQCLASS should be in a file called $GOODNAME" + exit 1 + fi + + if ! echo $CLASSPATH | grep -q "^\(.*:\)*$CLASSBASE\(:.*\)*"; then + # class is not in CLASSPATH, so prepend dir of class to CLASSPATH + if [ -z "${CLASSPATH}" ] ; then + export CLASSPATH=$CLASSBASE + else + export CLASSPATH=$CLASSBASE:$CLASSPATH + fi + fi + + shift + /usr/bin/java $FQCLASS "$@" + +javaclassname.c: + +.. code-block:: c + + /* javaclassname.c + * + * Extracts the class name from a Java class file; intended for use in a Java + * wrapper of the type supported by the binfmt_misc option in the Linux kernel. + * + * Copyright (C) 1999 Colin J. Watson <cjw44@cam.ac.uk>. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + + #include <stdlib.h> + #include <stdio.h> + #include <stdarg.h> + #include <sys/types.h> + + /* From Sun's Java VM Specification, as tag entries in the constant pool. */ + + #define CP_UTF8 1 + #define CP_INTEGER 3 + #define CP_FLOAT 4 + #define CP_LONG 5 + #define CP_DOUBLE 6 + #define CP_CLASS 7 + #define CP_STRING 8 + #define CP_FIELDREF 9 + #define CP_METHODREF 10 + #define CP_INTERFACEMETHODREF 11 + #define CP_NAMEANDTYPE 12 + #define CP_METHODHANDLE 15 + #define CP_METHODTYPE 16 + #define CP_INVOKEDYNAMIC 18 + + /* Define some commonly used error messages */ + + #define seek_error() error("%s: Cannot seek\n", program) + #define corrupt_error() error("%s: Class file corrupt\n", program) + #define eof_error() error("%s: Unexpected end of file\n", program) + #define utf8_error() error("%s: Only ASCII 1-255 supported\n", program); + + char *program; + + long *pool; + + u_int8_t read_8(FILE *classfile); + u_int16_t read_16(FILE *classfile); + void skip_constant(FILE *classfile, u_int16_t *cur); + void error(const char *format, ...); + int main(int argc, char **argv); + + /* Reads in an unsigned 8-bit integer. */ + u_int8_t read_8(FILE *classfile) + { + int b = fgetc(classfile); + if(b == EOF) + eof_error(); + return (u_int8_t)b; + } + + /* Reads in an unsigned 16-bit integer. */ + u_int16_t read_16(FILE *classfile) + { + int b1, b2; + b1 = fgetc(classfile); + if(b1 == EOF) + eof_error(); + b2 = fgetc(classfile); + if(b2 == EOF) + eof_error(); + return (u_int16_t)((b1 << 8) | b2); + } + + /* Reads in a value from the constant pool. */ + void skip_constant(FILE *classfile, u_int16_t *cur) + { + u_int16_t len; + int seekerr = 1; + pool[*cur] = ftell(classfile); + switch(read_8(classfile)) + { + case CP_UTF8: + len = read_16(classfile); + seekerr = fseek(classfile, len, SEEK_CUR); + break; + case CP_CLASS: + case CP_STRING: + case CP_METHODTYPE: + seekerr = fseek(classfile, 2, SEEK_CUR); + break; + case CP_METHODHANDLE: + seekerr = fseek(classfile, 3, SEEK_CUR); + break; + case CP_INTEGER: + case CP_FLOAT: + case CP_FIELDREF: + case CP_METHODREF: + case CP_INTERFACEMETHODREF: + case CP_NAMEANDTYPE: + case CP_INVOKEDYNAMIC: + seekerr = fseek(classfile, 4, SEEK_CUR); + break; + case CP_LONG: + case CP_DOUBLE: + seekerr = fseek(classfile, 8, SEEK_CUR); + ++(*cur); + break; + default: + corrupt_error(); + } + if(seekerr) + seek_error(); + } + + void error(const char *format, ...) + { + va_list ap; + va_start(ap, format); + vfprintf(stderr, format, ap); + va_end(ap); + exit(1); + } + + int main(int argc, char **argv) + { + FILE *classfile; + u_int16_t cp_count, i, this_class, classinfo_ptr; + u_int8_t length; + + program = argv[0]; + + if(!argv[1]) + error("%s: Missing input file\n", program); + classfile = fopen(argv[1], "rb"); + if(!classfile) + error("%s: Error opening %s\n", program, argv[1]); + + if(fseek(classfile, 8, SEEK_SET)) /* skip magic and version numbers */ + seek_error(); + cp_count = read_16(classfile); + pool = calloc(cp_count, sizeof(long)); + if(!pool) + error("%s: Out of memory for constant pool\n", program); + + for(i = 1; i < cp_count; ++i) + skip_constant(classfile, &i); + if(fseek(classfile, 2, SEEK_CUR)) /* skip access flags */ + seek_error(); + + this_class = read_16(classfile); + if(this_class < 1 || this_class >= cp_count) + corrupt_error(); + if(!pool[this_class] || pool[this_class] == -1) + corrupt_error(); + if(fseek(classfile, pool[this_class] + 1, SEEK_SET)) + seek_error(); + + classinfo_ptr = read_16(classfile); + if(classinfo_ptr < 1 || classinfo_ptr >= cp_count) + corrupt_error(); + if(!pool[classinfo_ptr] || pool[classinfo_ptr] == -1) + corrupt_error(); + if(fseek(classfile, pool[classinfo_ptr] + 1, SEEK_SET)) + seek_error(); + + length = read_16(classfile); + for(i = 0; i < length; ++i) + { + u_int8_t x = read_8(classfile); + if((x & 0x80) || !x) + { + if((x & 0xE0) == 0xC0) + { + u_int8_t y = read_8(classfile); + if((y & 0xC0) == 0x80) + { + int c = ((x & 0x1f) << 6) + (y & 0x3f); + if(c) putchar(c); + else utf8_error(); + } + else utf8_error(); + } + else utf8_error(); + } + else if(x == '/') putchar('.'); + else putchar(x); + } + putchar('\n'); + free(pool); + fclose(classfile); + return 0; + } + +jarwrapper:: + + #!/bin/bash + # /usr/local/java/bin/jarwrapper - the wrapper for binfmt_misc/jar + + java -jar $1 + + +Now simply ``chmod +x`` the ``.class``, ``.jar`` and/or ``.html`` files you +want to execute. + +To add a Java program to your path best put a symbolic link to the main +.class file into /usr/bin (or another place you like) omitting the .class +extension. The directory containing the original .class file will be +added to your CLASSPATH during execution. + + +To test your new setup, enter in the following simple Java app, and name +it "HelloWorld.java": + +.. code-block:: java + + class HelloWorld { + public static void main(String args[]) { + System.out.println("Hello World!"); + } + } + +Now compile the application with:: + + javac HelloWorld.java + +Set the executable permissions of the binary file, with:: + + chmod 755 HelloWorld.class + +And then execute it:: + + ./HelloWorld.class + + +To execute Java Jar files, simple chmod the ``*.jar`` files to include +the execution bit, then just do:: + + ./Application.jar + + +To execute Java Applets, simple chmod the ``*.html`` files to include +the execution bit, then just do:: + + ./Applet.html + + +originally by Brian A. Lantz, brian@lantz.com +heavily edited for binfmt_misc by Richard Günther +new scripts by Colin J. Watson <cjw44@cam.ac.uk> +added executable Jar file support by Kurt Huwig <kurt@iku-netz.de> diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst new file mode 100644 index 000000000..b8d0bc07e --- /dev/null +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -0,0 +1,211 @@ +.. _kernelparameters: + +The kernel's command-line parameters +==================================== + +The following is a consolidated list of the kernel parameters as +implemented by the __setup(), core_param() and module_param() macros +and sorted into English Dictionary order (defined as ignoring all +punctuation and sorting digits before letters in a case insensitive +manner), and with descriptions where known. + +The kernel parses parameters from the kernel command line up to "--"; +if it doesn't recognize a parameter and it doesn't contain a '.', the +parameter gets passed to init: parameters with '=' go into init's +environment, others are passed as command line arguments to init. +Everything after "--" is passed as an argument to init. + +Module parameters can be specified in two ways: via the kernel command +line with a module name prefix, or via modprobe, e.g.:: + + (kernel command line) usbcore.blinkenlights=1 + (modprobe command line) modprobe usbcore blinkenlights=1 + +Parameters for modules which are built into the kernel need to be +specified on the kernel command line. modprobe looks through the +kernel command line (/proc/cmdline) and collects module parameters +when it loads a module, so the kernel command line can be used for +loadable modules too. + +Hyphens (dashes) and underscores are equivalent in parameter names, so:: + + log_buf_len=1M print-fatal-signals=1 + +can also be entered as:: + + log-buf-len=1M print_fatal_signals=1 + +Double-quotes can be used to protect spaces in values, e.g.:: + + param="spaces in here" + +cpu lists: +---------- + +Some kernel parameters take a list of CPUs as a value, e.g. isolcpus, +nohz_full, irqaffinity, rcu_nocbs. The format of this list is: + + <cpu number>,...,<cpu number> + +or + + <cpu number>-<cpu number> + (must be a positive range in ascending order) + +or a mixture + +<cpu number>,...,<cpu number>-<cpu number> + +Note that for the special case of a range one can split the range into equal +sized groups and for each group use some amount from the beginning of that +group: + + <cpu number>-cpu number>:<used size>/<group size> + +For example one can add to the command line following parameter: + + isolcpus=1,2,10-20,100-2000:2/25 + +where the final item represents CPUs 100,101,125,126,150,151,... + + + +This document may not be entirely up to date and comprehensive. The command +"modinfo -p ${modulename}" shows a current list of all parameters of a loadable +module. Loadable modules, after being loaded into the running kernel, also +reveal their parameters in /sys/module/${modulename}/parameters/. Some of these +parameters may be changed at runtime by the command +``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``. + +The parameters listed below are only valid if certain kernel build options were +enabled and if respective hardware is present. The text in square brackets at +the beginning of each description states the restrictions within which a +parameter is applicable:: + + ACPI ACPI support is enabled. + AGP AGP (Accelerated Graphics Port) is enabled. + ALSA ALSA sound support is enabled. + APIC APIC support is enabled. + APM Advanced Power Management support is enabled. + ARM ARM architecture is enabled. + AX25 Appropriate AX.25 support is enabled. + CLK Common clock infrastructure is enabled. + CMA Contiguous Memory Area support is enabled. + DRM Direct Rendering Management support is enabled. + DYNAMIC_DEBUG Build in debug messages and enable them at runtime + EDD BIOS Enhanced Disk Drive Services (EDD) is enabled + EFI EFI Partitioning (GPT) is enabled + EIDE EIDE/ATAPI support is enabled. + EVM Extended Verification Module + FB The frame buffer device is enabled. + FTRACE Function tracing enabled. + GCOV GCOV profiling is enabled. + HW Appropriate hardware is enabled. + IA-64 IA-64 architecture is enabled. + IMA Integrity measurement architecture is enabled. + IOSCHED More than one I/O scheduler is enabled. + IP_PNP IP DHCP, BOOTP, or RARP is enabled. + IPV6 IPv6 support is enabled. + ISAPNP ISA PnP code is enabled. + ISDN Appropriate ISDN support is enabled. + ISOL CPU Isolation is enabled. + JOY Appropriate joystick support is enabled. + KGDB Kernel debugger support is enabled. + KVM Kernel Virtual Machine support is enabled. + LIBATA Libata driver is enabled + LP Printer support is enabled. + LOOP Loopback device support is enabled. + M68k M68k architecture is enabled. + These options have more detailed description inside of + Documentation/m68k/kernel-options.txt. + MDA MDA console support is enabled. + MIPS MIPS architecture is enabled. + MOUSE Appropriate mouse support is enabled. + MSI Message Signaled Interrupts (PCI). + MTD MTD (Memory Technology Device) support is enabled. + NET Appropriate network support is enabled. + NUMA NUMA support is enabled. + NFS Appropriate NFS support is enabled. + OSS OSS sound support is enabled. + PV_OPS A paravirtualized kernel is enabled. + PARIDE The ParIDE (parallel port IDE) subsystem is enabled. + PARISC The PA-RISC architecture is enabled. + PCI PCI bus support is enabled. + PCIE PCI Express support is enabled. + PCMCIA The PCMCIA subsystem is enabled. + PNP Plug & Play support is enabled. + PPC PowerPC architecture is enabled. + PPT Parallel port support is enabled. + PS2 Appropriate PS/2 support is enabled. + RAM RAM disk support is enabled. + RDT Intel Resource Director Technology. + S390 S390 architecture is enabled. + SCSI Appropriate SCSI support is enabled. + A lot of drivers have their options described inside + the Documentation/scsi/ sub-directory. + SECURITY Different security models are enabled. + SELINUX SELinux support is enabled. + APPARMOR AppArmor support is enabled. + SERIAL Serial support is enabled. + SH SuperH architecture is enabled. + SMP The kernel is an SMP kernel. + SPARC Sparc architecture is enabled. + SWSUSP Software suspend (hibernation) is enabled. + SUSPEND System suspend states are enabled. + TPM TPM drivers are enabled. + TS Appropriate touchscreen support is enabled. + UMS USB Mass Storage support is enabled. + USB USB support is enabled. + USBHID USB Human Interface Device support is enabled. + V4L Video For Linux support is enabled. + VMMIO Driver for memory mapped virtio devices is enabled. + VGA The VGA console has been enabled. + VT Virtual terminal support is enabled. + WDT Watchdog support is enabled. + XT IBM PC/XT MFM hard disk support is enabled. + X86-32 X86-32, aka i386 architecture is enabled. + X86-64 X86-64 architecture is enabled. + More X86-64 boot options can be found in + Documentation/x86/x86_64/boot-options.txt . + X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) + X86_UV SGI UV support is enabled. + XEN Xen support is enabled + +In addition, the following text indicates that the option:: + + BUGS= Relates to possible processor bugs on the said processor. + KNL Is a kernel start-up parameter. + BOOT Is a boot loader parameter. + +Parameters denoted with BOOT are actually interpreted by the boot +loader, and have no meaning to the kernel directly. +Do not modify the syntax of boot loader parameters without extreme +need or coordination with <Documentation/x86/boot.txt>. + +There are also arch-specific kernel-parameters not documented here. +See for example <Documentation/x86/x86_64/boot-options.txt>. + +Note that ALL kernel parameters listed below are CASE SENSITIVE, and that +a trailing = on the name of any parameter states that that parameter will +be entered as an environment variable, whereas its absence indicates that +it will appear as a kernel argument readable via /proc/cmdline by programs +running once the system is up. + +The number of kernel parameters is not limited, but the length of the +complete command line (parameters including spaces etc.) is limited to +a fixed number of characters. This limit depends on the architecture +and is between 256 and 4096 characters. It is defined in the file +./include/asm/setup.h as COMMAND_LINE_SIZE. + +Finally, the [KMG] suffix is commonly described after a number of kernel +parameter values. These 'K', 'M', and 'G' letters represent the _binary_ +multipliers 'Kilo', 'Mega', and 'Giga', equaling 2^10, 2^20, and 2^30 +bytes respectively. Such letter suffixes can also be entirely omitted: + +.. include:: kernel-parameters.txt + :literal: + +Todo +---- + + Add more DRM drivers. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt new file mode 100644 index 000000000..500032af0 --- /dev/null +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -0,0 +1,5361 @@ + acpi= [HW,ACPI,X86,ARM64] + Advanced Configuration and Power Interface + Format: { force | on | off | strict | noirq | rsdt | + copy_dsdt } + force -- enable ACPI if default was off + on -- enable ACPI but allow fallback to DT [arm64] + off -- disable ACPI if default was on + noirq -- do not use ACPI for IRQ routing + strict -- Be less tolerant of platforms that are not + strictly ACPI specification compliant. + rsdt -- prefer RSDT over (default) XSDT + copy_dsdt -- copy DSDT to memory + For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force" + are available + + See also Documentation/power/runtime_pm.txt, pci=noacpi + + acpi_apic_instance= [ACPI, IOAPIC] + Format: <int> + 2: use 2nd APIC table, if available + 1,0: use 1st APIC table + default: 0 + + acpi_backlight= [HW,ACPI] + acpi_backlight=vendor + acpi_backlight=video + If set to vendor, prefer vendor specific driver + (e.g. thinkpad_acpi, sony_acpi, etc.) instead + of the ACPI video.ko driver. + + acpi_force_32bit_fadt_addr + force FADT to use 32 bit addresses rather than the + 64 bit X_* addresses. Some firmware have broken 64 + bit addresses for force ACPI ignore these and use + the older legacy 32 bit addresses. + + acpica_no_return_repair [HW, ACPI] + Disable AML predefined validation mechanism + This mechanism can repair the evaluation result to make + the return objects more ACPI specification compliant. + This option is useful for developers to identify the + root cause of an AML interpreter issue when the issue + has something to do with the repair mechanism. + + acpi.debug_layer= [HW,ACPI,ACPI_DEBUG] + acpi.debug_level= [HW,ACPI,ACPI_DEBUG] + Format: <int> + CONFIG_ACPI_DEBUG must be enabled to produce any ACPI + debug output. Bits in debug_layer correspond to a + _COMPONENT in an ACPI source file, e.g., + #define _COMPONENT ACPI_PCI_COMPONENT + Bits in debug_level correspond to a level in + ACPI_DEBUG_PRINT statements, e.g., + ACPI_DEBUG_PRINT((ACPI_DB_INFO, ... + The debug_level mask defaults to "info". See + Documentation/acpi/debug.txt for more information about + debug layers and levels. + + Enable processor driver info messages: + acpi.debug_layer=0x20000000 + Enable PCI/PCI interrupt routing info messages: + acpi.debug_layer=0x400000 + Enable AML "Debug" output, i.e., stores to the Debug + object while interpreting AML: + acpi.debug_layer=0xffffffff acpi.debug_level=0x2 + Enable all messages related to ACPI hardware: + acpi.debug_layer=0x2 acpi.debug_level=0xffffffff + + Some values produce so much output that the system is + unusable. The "log_buf_len" parameter may be useful + if you need to capture more output. + + acpi_enforce_resources= [ACPI] + { strict | lax | no } + Check for resource conflicts between native drivers + and ACPI OperationRegions (SystemIO and SystemMemory + only). IO ports and memory declared in ACPI might be + used by the ACPI subsystem in arbitrary AML code and + can interfere with legacy drivers. + strict (default): access to resources claimed by ACPI + is denied; legacy drivers trying to access reserved + resources will fail to bind to device using them. + lax: access to resources claimed by ACPI is allowed; + legacy drivers trying to access reserved resources + will bind successfully but a warning message is logged. + no: ACPI OperationRegions are not marked as reserved, + no further checks are performed. + + acpi_force_table_verification [HW,ACPI] + Enable table checksum verification during early stage. + By default, this is disabled due to x86 early mapping + size limitation. + + acpi_irq_balance [HW,ACPI] + ACPI will balance active IRQs + default in APIC mode + + acpi_irq_nobalance [HW,ACPI] + ACPI will not move active IRQs (default) + default in PIC mode + + acpi_irq_isa= [HW,ACPI] If irq_balance, mark listed IRQs used by ISA + Format: <irq>,<irq>... + + acpi_irq_pci= [HW,ACPI] If irq_balance, clear listed IRQs for + use by PCI + Format: <irq>,<irq>... + + acpi_mask_gpe= [HW,ACPI] + Due to the existence of _Lxx/_Exx, some GPEs triggered + by unsupported hardware/firmware features can result in + GPE floodings that cannot be automatically disabled by + the GPE dispatcher. + This facility can be used to prevent such uncontrolled + GPE floodings. + Format: <byte> + + acpi_no_auto_serialize [HW,ACPI] + Disable auto-serialization of AML methods + AML control methods that contain the opcodes to create + named objects will be marked as "Serialized" by the + auto-serialization feature. + This feature is enabled by default. + This option allows to turn off the feature. + + acpi_no_memhotplug [ACPI] Disable memory hotplug. Useful for kdump + kernels. + + acpi_no_static_ssdt [HW,ACPI] + Disable installation of static SSDTs at early boot time + By default, SSDTs contained in the RSDT/XSDT will be + installed automatically and they will appear under + /sys/firmware/acpi/tables. + This option turns off this feature. + Note that specifying this option does not affect + dynamic table installation which will install SSDT + tables to /sys/firmware/acpi/tables/dynamic. + + acpi_no_watchdog [HW,ACPI,WDT] + Ignore the ACPI-based watchdog interface (WDAT) and let + a native driver control the watchdog device instead. + + acpi_rsdp= [ACPI,EFI,KEXEC] + Pass the RSDP address to the kernel, mostly used + on machines running EFI runtime service to boot the + second kernel for kdump. + + acpi_os_name= [HW,ACPI] Tell ACPI BIOS the name of the OS + Format: To spoof as Windows 98: ="Microsoft Windows" + + acpi_rev_override [ACPI] Override the _REV object to return 5 (instead + of 2 which is mandated by ACPI 6) as the supported ACPI + specification revision (when using this switch, it may + be necessary to carry out a cold reboot _twice_ in a + row to make it take effect on the platform firmware). + + acpi_osi= [HW,ACPI] Modify list of supported OS interface strings + acpi_osi="string1" # add string1 + acpi_osi="!string2" # remove string2 + acpi_osi=!* # remove all strings + acpi_osi=! # disable all built-in OS vendor + strings + acpi_osi=!! # enable all built-in OS vendor + strings + acpi_osi= # disable all strings + + 'acpi_osi=!' can be used in combination with single or + multiple 'acpi_osi="string1"' to support specific OS + vendor string(s). Note that such command can only + affect the default state of the OS vendor strings, thus + it cannot affect the default state of the feature group + strings and the current state of the OS vendor strings, + specifying it multiple times through kernel command line + is meaningless. This command is useful when one do not + care about the state of the feature group strings which + should be controlled by the OSPM. + Examples: + 1. 'acpi_osi=! acpi_osi="Windows 2000"' is equivalent + to 'acpi_osi="Windows 2000" acpi_osi=!', they all + can make '_OSI("Windows 2000")' TRUE. + + 'acpi_osi=' cannot be used in combination with other + 'acpi_osi=' command lines, the _OSI method will not + exist in the ACPI namespace. NOTE that such command can + only affect the _OSI support state, thus specifying it + multiple times through kernel command line is also + meaningless. + Examples: + 1. 'acpi_osi=' can make 'CondRefOf(_OSI, Local1)' + FALSE. + + 'acpi_osi=!*' can be used in combination with single or + multiple 'acpi_osi="string1"' to support specific + string(s). Note that such command can affect the + current state of both the OS vendor strings and the + feature group strings, thus specifying it multiple times + through kernel command line is meaningful. But it may + still not able to affect the final state of a string if + there are quirks related to this string. This command + is useful when one want to control the state of the + feature group strings to debug BIOS issues related to + the OSPM features. + Examples: + 1. 'acpi_osi="Module Device" acpi_osi=!*' can make + '_OSI("Module Device")' FALSE. + 2. 'acpi_osi=!* acpi_osi="Module Device"' can make + '_OSI("Module Device")' TRUE. + 3. 'acpi_osi=! acpi_osi=!* acpi_osi="Windows 2000"' is + equivalent to + 'acpi_osi=!* acpi_osi=! acpi_osi="Windows 2000"' + and + 'acpi_osi=!* acpi_osi="Windows 2000" acpi_osi=!', + they all will make '_OSI("Windows 2000")' TRUE. + + acpi_pm_good [X86] + Override the pmtimer bug detection: force the kernel + to assume that this machine's pmtimer latches its value + and always returns good values. + + acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode + Format: { level | edge | high | low } + + acpi_skip_timer_override [HW,ACPI] + Recognize and ignore IRQ0/pin2 Interrupt Override. + For broken nForce2 BIOS resulting in XT-PIC timer. + + acpi_sleep= [HW,ACPI] Sleep options + Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, + old_ordering, nonvs, sci_force_enable, nobl } + See Documentation/power/video.txt for information on + s3_bios and s3_mode. + s3_beep is for debugging; it makes the PC's speaker beep + as soon as the kernel's real-mode entry point is called. + s4_nohwsig prevents ACPI hardware signature from being + used during resume from hibernation. + old_ordering causes the ACPI 1.0 ordering of the _PTS + control method, with respect to putting devices into + low power states, to be enforced (the ACPI 2.0 ordering + of _PTS is used by default). + nonvs prevents the kernel from saving/restoring the + ACPI NVS memory during suspend/hibernation and resume. + sci_force_enable causes the kernel to set SCI_EN directly + on resume from S1/S3 (which is against the ACPI spec, + but some broken systems don't work without it). + nobl causes the internal blacklist of systems known to + behave incorrectly in some ways with respect to system + suspend and resume to be ignored (use wisely). + + acpi_use_timer_override [HW,ACPI] + Use timer override. For some broken Nvidia NF5 boards + that require a timer override, but don't have HPET + + add_efi_memmap [EFI; X86] Include EFI memory map in + kernel's map of available physical RAM. + + agp= [AGP] + { off | try_unsupported } + off: disable AGP support + try_unsupported: try to drive unsupported chipsets + (may crash computer or cause data corruption) + + ALSA [HW,ALSA] + See Documentation/sound/alsa-configuration.rst + + alignment= [KNL,ARM] + Allow the default userspace alignment fault handler + behaviour to be specified. Bit 0 enables warnings, + bit 1 enables fixups, and bit 2 sends a segfault. + + align_va_addr= [X86-64] + Align virtual addresses by clearing slice [14:12] when + allocating a VMA at process creation time. This option + gives you up to 3% performance improvement on AMD F15h + machines (where it is enabled by default) for a + CPU-intensive style benchmark, and it can vary highly in + a microbenchmark depending on workload and compiler. + + 32: only for 32-bit processes + 64: only for 64-bit processes + on: enable for both 32- and 64-bit processes + off: disable for both 32- and 64-bit processes + + alloc_snapshot [FTRACE] + Allocate the ftrace snapshot buffer on boot up when the + main buffer is allocated. This is handy if debugging + and you need to use tracing_snapshot() on boot up, and + do not want to use tracing_snapshot_alloc() as it needs + to be done where GFP_KERNEL allocations are allowed. + + amd_iommu= [HW,X86-64] + Pass parameters to the AMD IOMMU driver in the system. + Possible values are: + fullflush - enable flushing of IO/TLB entries when + they are unmapped. Otherwise they are + flushed before they will be reused, which + is a lot of faster + off - do not initialize any AMD IOMMU found in + the system + force_isolation - Force device isolation for all + devices. The IOMMU driver is not + allowed anymore to lift isolation + requirements as needed. This option + does not override iommu=pt + + amd_iommu_dump= [HW,X86-64] + Enable AMD IOMMU driver option to dump the ACPI table + for AMD IOMMU. With this option enabled, AMD IOMMU + driver will print ACPI tables for AMD IOMMU during + IOMMU initialization. + + amd_iommu_intr= [HW,X86-64] + Specifies one of the following AMD IOMMU interrupt + remapping modes: + legacy - Use legacy interrupt remapping mode. + vapic - Use virtual APIC mode, which allows IOMMU + to inject interrupts directly into guest. + This mode requires kvm-amd.avic=1. + (Default when IOMMU HW support is present.) + + amijoy.map= [HW,JOY] Amiga joystick support + Map of devices attached to JOY0DAT and JOY1DAT + Format: <a>,<b> + See also Documentation/input/joydev/joystick.rst + + analog.map= [HW,JOY] Analog joystick and gamepad support + Specifies type or capabilities of an analog joystick + connected to one of 16 gameports + Format: <type1>,<type2>,..<type16> + + apc= [HW,SPARC] + Power management functions (SPARCstation-4/5 + deriv.) + Format: noidle + Disable APC CPU standby support. SPARCstation-Fox does + not play well with APC CPU idle - disable it if you have + APC and your system crashes randomly. + + apic= [APIC,X86] Advanced Programmable Interrupt Controller + Change the output verbosity whilst booting + Format: { quiet (default) | verbose | debug } + Change the amount of debugging information output + when initialising the APIC and IO-APIC components. + For X86-32, this can also be used to specify an APIC + driver name. + Format: apic=driver_name + Examples: apic=bigsmp + + apic_extnmi= [APIC,X86] External NMI delivery setting + Format: { bsp (default) | all | none } + bsp: External NMI is delivered only to CPU 0 + all: External NMIs are broadcast to all CPUs as a + backup of CPU 0 + none: External NMI is masked for all CPUs. This is + useful so that a dump capture kernel won't be + shot down by NMI + + autoconf= [IPV6] + See Documentation/networking/ipv6.txt. + + show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller + Limit apic dumping. The parameter defines the maximal + number of local apics being dumped. Also it is possible + to set it to "all" by meaning -- no limit here. + Format: { 1 (default) | 2 | ... | all }. + The parameter valid if only apic=debug or + apic=verbose is specified. + Example: apic=debug show_lapic=all + + apm= [APM] Advanced Power Management + See header of arch/x86/kernel/apm_32.c. + + arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards + Format: <io>,<irq>,<nodeID> + + ataflop= [HW,M68k] + + atarimouse= [HW,MOUSE] Atari Mouse + + atkbd.extra= [HW] Enable extra LEDs and keys on IBM RapidAccess, + EzKey and similar keyboards + + atkbd.reset= [HW] Reset keyboard during initialization + + atkbd.set= [HW] Select keyboard code set + Format: <int> (2 = AT (default), 3 = PS/2) + + atkbd.scroll= [HW] Enable scroll wheel on MS Office and similar + keyboards + + atkbd.softraw= [HW] Choose between synthetic and real raw mode + Format: <bool> (0 = real, 1 = synthetic (default)) + + atkbd.softrepeat= [HW] + Use software keyboard repeat + + audit= [KNL] Enable the audit sub-system + Format: { "0" | "1" | "off" | "on" } + 0 | off - kernel audit is disabled and can not be + enabled until the next reboot + unset - kernel audit is initialized but disabled and + will be fully enabled by the userspace auditd. + 1 | on - kernel audit is initialized and partially + enabled, storing at most audit_backlog_limit + messages in RAM until it is fully enabled by the + userspace auditd. + Default: unset + + audit_backlog_limit= [KNL] Set the audit queue size limit. + Format: <int> (must be >=0) + Default: 64 + + bau= [X86_UV] Enable the BAU on SGI UV. The default + behavior is to disable the BAU (i.e. bau=0). + Format: { "0" | "1" } + 0 - Disable the BAU. + 1 - Enable the BAU. + unset - Disable the BAU. + + baycom_epp= [HW,AX25] + Format: <io>,<mode> + + baycom_par= [HW,AX25] BayCom Parallel Port AX.25 Modem + Format: <io>,<mode> + See header of drivers/net/hamradio/baycom_par.c. + + baycom_ser_fdx= [HW,AX25] + BayCom Serial Port AX.25 Modem (Full Duplex Mode) + Format: <io>,<irq>,<mode>[,<baud>] + See header of drivers/net/hamradio/baycom_ser_fdx.c. + + baycom_ser_hdx= [HW,AX25] + BayCom Serial Port AX.25 Modem (Half Duplex Mode) + Format: <io>,<irq>,<mode> + See header of drivers/net/hamradio/baycom_ser_hdx.c. + + blkdevparts= Manual partition parsing of block device(s) for + embedded devices based on command line input. + See Documentation/block/cmdline-partition.txt + + boot_delay= Milliseconds to delay each printk during boot. + Values larger than 10 seconds (10000) are changed to + no delay (0). + Format: integer + + bootmem_debug [KNL] Enable bootmem allocator debug messages. + + bert_disable [ACPI] + Disable BERT OS support on buggy BIOSes. + + bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards) + bttv.radio= Most important insmod options are available as + kernel args too. + bttv.pll= See Documentation/media/v4l-drivers/bttv.rst + bttv.tuner= + + bulk_remove=off [PPC] This parameter disables the use of the pSeries + firmware feature for flushing multiple hpte entries + at a time. + + c101= [NET] Moxa C101 synchronous serial card + + cachesize= [BUGS=X86-32] Override level 2 CPU cache size detection. + Sometimes CPU hardware bugs make them report the cache + size incorrectly. The kernel will attempt work arounds + to fix known problems, but for some CPUs it is not + possible to determine what the correct size should be. + This option provides an override for these situations. + + ca_keys= [KEYS] This parameter identifies a specific key(s) on + the system trusted keyring to be used for certificate + trust validation. + format: { id:<keyid> | builtin } + + cca= [MIPS] Override the kernel pages' cache coherency + algorithm. Accepted values range from 0 to 7 + inclusive. See arch/mips/include/asm/pgtable-bits.h + for platform specific values (SB1, Loongson3 and + others). + + ccw_timeout_log [S390] + See Documentation/s390/CommonIO for details. + + cgroup_disable= [KNL] Disable a particular controller + Format: {name of the controller(s) to disable} + The effects of cgroup_disable=foo are: + - foo isn't auto-mounted if you mount all cgroups in + a single hierarchy + - foo isn't visible as an individually mountable + subsystem + {Currently only "memory" controller deal with this and + cut the overhead, others just disable the usage. So + only cgroup_disable=memory is actually worthy} + + cgroup_no_v1= [KNL] Disable one, multiple, all cgroup controllers in v1 + Format: { controller[,controller...] | "all" } + Like cgroup_disable, but only applies to cgroup v1; + the blacklisted controllers remain available in cgroup2. + + cgroup.memory= [KNL] Pass options to the cgroup memory controller. + Format: <string> + nosocket -- Disable socket memory accounting. + nokmem -- Disable kernel memory accounting. + + checkreqprot [SELINUX] Set initial checkreqprot flag value. + Format: { "0" | "1" } + See security/selinux/Kconfig help text. + 0 -- check protection applied by kernel (includes + any implied execute protection). + 1 -- check protection requested by application. + Default value is set via a kernel config option. + Value can be changed at runtime via + /selinux/checkreqprot. + + cio_ignore= [S390] + See Documentation/s390/CommonIO for details. + clk_ignore_unused + [CLK] + Prevents the clock framework from automatically gating + clocks that have not been explicitly enabled by a Linux + device driver but are enabled in hardware at reset or + by the bootloader/firmware. Note that this does not + force such clocks to be always-on nor does it reserve + those clocks in any way. This parameter is useful for + debug and development, but should not be needed on a + platform with proper driver support. For more + information, see Documentation/driver-api/clk.rst. + + clock= [BUGS=X86-32, HW] gettimeofday clocksource override. + [Deprecated] + Forces specified clocksource (if available) to be used + when calculating gettimeofday(). If specified + clocksource is not available, it defaults to PIT. + Format: { pit | tsc | cyclone | pmtmr } + + clocksource= Override the default clocksource + Format: <string> + Override the default clocksource and use the clocksource + with the name specified. + Some clocksource names to choose from, depending on + the platform: + [all] jiffies (this is the base, fallback clocksource) + [ACPI] acpi_pm + [ARM] imx_timer1,OSTS,netx_timer,mpu_timer2, + pxa_timer,timer3,32k_counter,timer0_1 + [X86-32] pit,hpet,tsc; + scx200_hrt on Geode; cyclone on IBM x440 + [MIPS] MIPS + [PARISC] cr16 + [S390] tod + [SH] SuperH + [SPARC64] tick + [X86-64] hpet,tsc + + clocksource.arm_arch_timer.evtstrm= + [ARM,ARM64] + Format: <bool> + Enable/disable the eventstream feature of the ARM + architected timer so that code using WFE-based polling + loops can be debugged more effectively on production + systems. + + clocksource.max_cswd_read_retries= [KNL] + Number of clocksource_watchdog() retries due to + external delays before the clock will be marked + unstable. Defaults to three retries, that is, + four attempts to read the clock under test. + + clearcpuid=BITNUM[,BITNUM...] [X86] + Disable CPUID feature X for the kernel. See + arch/x86/include/asm/cpufeatures.h for the valid bit + numbers. Note the Linux specific bits are not necessarily + stable over kernel options, but the vendor specific + ones should be. + Also note that user programs calling CPUID directly + or using the feature without checking anything + will still see it. This just prevents it from + being used by the kernel or shown in /proc/cpuinfo. + Also note the kernel might malfunction if you disable + some critical bits. + + cma=nn[MG]@[start[MG][-end[MG]]] + [ARM,X86,KNL] + Sets the size of kernel global memory area for + contiguous memory allocations and optionally the + placement constraint by the physical address range of + memory allocations. A value of 0 disables CMA + altogether. For more information, see + include/linux/dma-contiguous.h + + cmo_free_hint= [PPC] Format: { yes | no } + Specify whether pages are marked as being inactive + when they are freed. This is used in CMO environments + to determine OS memory pressure for page stealing by + a hypervisor. + Default: yes + + coherent_pool=nn[KMG] [ARM,KNL] + Sets the size of memory pool for coherent, atomic dma + allocations, by default set to 256K. + + com20020= [HW,NET] ARCnet - COM20020 chipset + Format: + <io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]] + + com90io= [HW,NET] ARCnet - COM90xx chipset (IO-mapped buffers) + Format: <io>[,<irq>] + + com90xx= [HW,NET] + ARCnet - COM90xx chipset (memory-mapped buffers) + Format: <io>[,<irq>[,<memstart>]] + + condev= [HW,S390] console device + conmode= + + console= [KNL] Output console device and options. + + tty<n> Use the virtual console device <n>. + + ttyS<n>[,options] + ttyUSB0[,options] + Use the specified serial port. The options are of + the form "bbbbpnf", where "bbbb" is the baud rate, + "p" is parity ("n", "o", or "e"), "n" is number of + bits, and "f" is flow control ("r" for RTS or + omit it). Default is "9600n8". + + See Documentation/admin-guide/serial-console.rst for more + information. See + Documentation/networking/netconsole.txt for an + alternative. + + uart[8250],io,<addr>[,options] + uart[8250],mmio,<addr>[,options] + uart[8250],mmio16,<addr>[,options] + uart[8250],mmio32,<addr>[,options] + uart[8250],0x<addr>[,options] + Start an early, polled-mode console on the 8250/16550 + UART at the specified I/O port or MMIO address, + switching to the matching ttyS device later. + MMIO inter-register address stride is either 8-bit + (mmio), 16-bit (mmio16), or 32-bit (mmio32). + If none of [io|mmio|mmio16|mmio32], <addr> is assumed + to be equivalent to 'mmio'. 'options' are specified in + the same format described for ttyS above; if unspecified, + the h/w is not re-initialized. + + hvc<n> Use the hypervisor console device <n>. This is for + both Xen and PowerPC hypervisors. + + If the device connected to the port is not a TTY but a braille + device, prepend "brl," before the device type, for instance + console=brl,ttyS0 + For now, only VisioBraille is supported. + + console_msg_format= + [KNL] Change console messages format + default + By default we print messages on consoles in + "[time stamp] text\n" format (time stamp may not be + printed, depending on CONFIG_PRINTK_TIME or + `printk_time' param). + syslog + Switch to syslog format: "<%u>[time stamp] text\n" + IOW, each message will have a facility and loglevel + prefix. The format is similar to one used by syslog() + syscall, or to executing "dmesg -S --raw" or to reading + from /proc/kmsg. + + consoleblank= [KNL] The console blank (screen saver) timeout in + seconds. A value of 0 disables the blank timer. + Defaults to 0. + + coredump_filter= + [KNL] Change the default value for + /proc/<pid>/coredump_filter. + See also Documentation/filesystems/proc.txt. + + coresight_cpu_debug.enable + [ARM,ARM64] + Format: <bool> + Enable/disable the CPU sampling based debugging. + 0: default value, disable debugging + 1: enable debugging at boot time + + cpuidle.off=1 [CPU_IDLE] + disable the cpuidle sub-system + + cpufreq.off=1 [CPU_FREQ] + disable the cpufreq sub-system + + cpu_init_udelay=N + [X86] Delay for N microsec between assert and de-assert + of APIC INIT to start processors. This delay occurs + on every CPU online, such as boot, and resume from suspend. + Default: 10000 + + cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver + Format: + <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>] + + crashkernel=size[KMG][@offset[KMG]] + [KNL] Using kexec, Linux can switch to a 'crash kernel' + upon panic. This parameter reserves the physical + memory region [offset, offset + size] for that kernel + image. If '@offset' is omitted, then a suitable offset + is selected automatically. Check + Documentation/kdump/kdump.txt for further details. + + crashkernel=range1:size1[,range2:size2,...][@offset] + [KNL] Same as above, but depends on the memory + in the running system. The syntax of range is + start-[end] where start and end are both + a memory unit (amount[KMG]). See also + Documentation/kdump/kdump.txt for an example. + + crashkernel=size[KMG],high + [KNL, x86_64] range could be above 4G. Allow kernel + to allocate physical memory region from top, so could + be above 4G if system have more than 4G ram installed. + Otherwise memory region will be allocated below 4G, if + available. + It will be ignored if crashkernel=X is specified. + crashkernel=size[KMG],low + [KNL, x86_64] range under 4G. When crashkernel=X,high + is passed, kernel could allocate physical memory region + above 4G, that cause second kernel crash on system + that require some amount of low memory, e.g. swiotlb + requires at least 64M+32K low memory, also enough extra + low memory is needed to make sure DMA buffers for 32-bit + devices won't run out. Kernel would try to allocate at + at least 256M below 4G automatically. + This one let user to specify own low range under 4G + for second kernel instead. + 0: to disable low allocation. + It will be ignored when crashkernel=X,high is not used + or memory reserved is below 4G. + + cryptomgr.notests + [KNL] Disable crypto self-tests + + cs89x0_dma= [HW,NET] + Format: <dma> + + cs89x0_media= [HW,NET] + Format: { rj45 | aui | bnc } + + dasd= [HW,NET] + See header of drivers/s390/block/dasd_devmap.c. + + db9.dev[2|3]= [HW,JOY] Multisystem joystick support via parallel port + (one device per port) + Format: <port#>,<type> + See also Documentation/input/devices/joystick-parport.rst + + ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot + time. See + Documentation/admin-guide/dynamic-debug-howto.rst for + details. Deprecated, see dyndbg. + + debug [KNL] Enable kernel debugging (events log level). + + debug_boot_weak_hash + [KNL] Enable printing [hashed] pointers early in the + boot sequence. If enabled, we use a weak hash instead + of siphash to hash pointers. Use this option if you are + seeing instances of '(___ptrval___)') and need to see a + value (hashed pointer) instead. Cryptographically + insecure, please do not use on production kernels. + + debug_locks_verbose= + [KNL] verbose self-tests + Format=<0|1> + Print debugging info while doing the locking API + self-tests. + We default to 0 (no extra messages), setting it to + 1 will print _a lot_ more information - normally + only useful to kernel developers. + + debug_objects [KNL] Enable object debugging + + no_debug_objects + [KNL] Disable object debugging + + debug_guardpage_minorder= + [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this + parameter allows control of the order of pages that will + be intentionally kept free (and hence protected) by the + buddy allocator. Bigger value increase the probability + of catching random memory corruption, but reduce the + amount of memory for normal system use. The maximum + possible value is MAX_ORDER/2. Setting this parameter + to 1 or 2 should be enough to identify most random + memory corruption problems caused by bugs in kernel or + driver code when a CPU writes to (or reads from) a + random memory location. Note that there exists a class + of memory corruptions problems caused by buggy H/W or + F/W or by drivers badly programing DMA (basically when + memory is written at bus level and the CPU MMU is + bypassed) which are not detectable by + CONFIG_DEBUG_PAGEALLOC, hence this option will not help + tracking down these problems. + + debug_pagealloc= + [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this + parameter enables the feature at boot time. In + default, it is disabled. We can avoid allocating huge + chunk of memory for debug pagealloc if we don't enable + it at boot time and the system will work mostly same + with the kernel built without CONFIG_DEBUG_PAGEALLOC. + on: enable the feature + + debugpat [X86] Enable PAT debugging + + decnet.addr= [HW,NET] + Format: <area>[,<node>] + See also Documentation/networking/decnet.txt. + + default_hugepagesz= + [same as hugepagesz=] The size of the default + HugeTLB page size. This is the size represented by + the legacy /proc/ hugepages APIs, used for SHM, and + default size when mounting hugetlbfs filesystems. + Defaults to the default architecture's huge page size + if not specified. + + deferred_probe_timeout= + [KNL] Debugging option to set a timeout in seconds for + deferred probe to give up waiting on dependencies to + probe. Only specific dependencies (subsystems or + drivers) that have opted in will be ignored. A timeout of 0 + will timeout at the end of initcalls. This option will also + dump out devices still on the deferred probe list after + retrying. + + dhash_entries= [KNL] + Set number of hash buckets for dentry cache. + + disable_1tb_segments [PPC] + Disables the use of 1TB hash page table segments. This + causes the kernel to fall back to 256MB segments which + can be useful when debugging issues that require an SLB + miss to occur. + + disable= [IPV6] + See Documentation/networking/ipv6.txt. + + hardened_usercopy= + [KNL] Under CONFIG_HARDENED_USERCOPY, whether + hardening is enabled for this boot. Hardened + usercopy checking is used to protect the kernel + from reading or writing beyond known memory + allocation boundaries as a proactive defense + against bounds-checking flaws in the kernel's + copy_to_user()/copy_from_user() interface. + on Perform hardened usercopy checks (default). + off Disable hardened usercopy checks. + + disable_radix [PPC] + Disable RADIX MMU mode on POWER9 + + disable_cpu_apicid= [X86,APIC,SMP] + Format: <int> + The number of initial APIC ID for the + corresponding CPU to be disabled at boot, + mostly used for the kdump 2nd kernel to + disable BSP to wake up multiple CPUs without + causing system reset or hang due to sending + INIT from AP to BSP. + + disable_ddw [PPC/PSERIES] + Disable Dynamic DMA Window support. Use this if + to workaround buggy firmware. + + disable_ipv6= [IPV6] + See Documentation/networking/ipv6.txt. + + disable_mtrr_cleanup [X86] + The kernel tries to adjust MTRR layout from continuous + to discrete, to make X server driver able to add WB + entry later. This parameter disables that. + + disable_mtrr_trim [X86, Intel and AMD only] + By default the kernel will trim any uncacheable + memory out of your available memory pool based on + MTRR settings. This parameter disables that behavior, + possibly causing your machine to run very slowly. + + disable_timer_pin_1 [X86] + Disable PIN 1 of APIC timer + Can be useful to work around chipset bugs. + + dis_ucode_ldr [X86] Disable the microcode loader. + + dma_debug=off If the kernel is compiled with DMA_API_DEBUG support, + this option disables the debugging code at boot. + + dma_debug_entries=<number> + This option allows to tune the number of preallocated + entries for DMA-API debugging code. One entry is + required per DMA-API allocation. Use this if the + DMA-API debugging code disables itself because the + architectural default is too low. + + dma_debug_driver=<driver_name> + With this option the DMA-API debugging driver + filter feature can be enabled at boot time. Just + pass the driver to filter for as the parameter. + The filter can be disabled or changed to another + driver later using sysfs. + + drm.edid_firmware=[<connector>:]<file>[,[<connector>:]<file>] + Broken monitors, graphic adapters, KVMs and EDIDless + panels may send no or incorrect EDID data sets. + This parameter allows to specify an EDID data sets + in the /lib/firmware directory that are used instead. + Generic built-in EDID data sets are used, if one of + edid/1024x768.bin, edid/1280x1024.bin, + edid/1680x1050.bin, or edid/1920x1080.bin is given + and no file with the same name exists. Details and + instructions how to build your own EDID data are + available in Documentation/EDID/HOWTO.txt. An EDID + data set will only be used for a particular connector, + if its name and a colon are prepended to the EDID + name. Each connector may use a unique EDID data + set by separating the files with a comma. An EDID + data set with no connector name will be used for + any connectors not explicitly specified. + + dscc4.setup= [NET] + + dt_cpu_ftrs= [PPC] + Format: {"off" | "known"} + Control how the dt_cpu_ftrs device-tree binding is + used for CPU feature discovery and setup (if it + exists). + off: Do not use it, fall back to legacy cpu table. + known: Do not pass through unknown features to guests + or userspace, only those that the kernel is aware of. + + dump_apple_properties [X86] + Dump name and content of EFI device properties on + x86 Macs. Useful for driver authors to determine + what data is available or for reverse-engineering. + + dyndbg[="val"] [KNL,DYNAMIC_DEBUG] + module.dyndbg[="val"] + Enable debug messages at boot time. See + Documentation/admin-guide/dynamic-debug-howto.rst + for details. + + nompx [X86] Disables Intel Memory Protection Extensions. + See Documentation/x86/intel_mpx.txt for more + information about the feature. + + nopku [X86] Disable Memory Protection Keys CPU feature found + in some Intel CPUs. + + module.async_probe [KNL] + Enable asynchronous probe on this module. + + early_ioremap_debug [KNL] + Enable debug messages in early_ioremap support. This + is useful for tracking down temporary early mappings + which are not unmapped. + + earlycon= [KNL] Output early console device and options. + + [ARM64] The early console is determined by the + stdout-path property in device tree's chosen node, + or determined by the ACPI SPCR table. + + [X86] When used with no options the early console is + determined by the ACPI SPCR table. + + cdns,<addr>[,options] + Start an early, polled-mode console on a Cadence + (xuartps) serial port at the specified address. Only + supported option is baud rate. If baud rate is not + specified, the serial port must already be setup and + configured. + + uart[8250],io,<addr>[,options] + uart[8250],mmio,<addr>[,options] + uart[8250],mmio32,<addr>[,options] + uart[8250],mmio32be,<addr>[,options] + uart[8250],0x<addr>[,options] + Start an early, polled-mode console on the 8250/16550 + UART at the specified I/O port or MMIO address. + MMIO inter-register address stride is either 8-bit + (mmio) or 32-bit (mmio32 or mmio32be). + If none of [io|mmio|mmio32|mmio32be], <addr> is assumed + to be equivalent to 'mmio'. 'options' are specified + in the same format described for "console=ttyS<n>"; if + unspecified, the h/w is not initialized. + + pl011,<addr> + pl011,mmio32,<addr> + Start an early, polled-mode console on a pl011 serial + port at the specified address. The pl011 serial port + must already be setup and configured. Options are not + yet supported. If 'mmio32' is specified, then only + the driver will use only 32-bit accessors to read/write + the device registers. + + meson,<addr> + Start an early, polled-mode console on a meson serial + port at the specified address. The serial port must + already be setup and configured. Options are not yet + supported. + + msm_serial,<addr> + Start an early, polled-mode console on an msm serial + port at the specified address. The serial port + must already be setup and configured. Options are not + yet supported. + + msm_serial_dm,<addr> + Start an early, polled-mode console on an msm serial + dm port at the specified address. The serial port + must already be setup and configured. Options are not + yet supported. + + owl,<addr> + Start an early, polled-mode console on a serial port + of an Actions Semi SoC, such as S500 or S900, at the + specified address. The serial port must already be + setup and configured. Options are not yet supported. + + smh Use ARM semihosting calls for early console. + + s3c2410,<addr> + s3c2412,<addr> + s3c2440,<addr> + s3c6400,<addr> + s5pv210,<addr> + exynos4210,<addr> + Use early console provided by serial driver available + on Samsung SoCs, requires selecting proper type and + a correct base address of the selected UART port. The + serial port must already be setup and configured. + Options are not yet supported. + + lantiq,<addr> + Start an early, polled-mode console on a lantiq serial + (lqasc) port at the specified address. The serial port + must already be setup and configured. Options are not + yet supported. + + lpuart,<addr> + lpuart32,<addr> + Use early console provided by Freescale LP UART driver + found on Freescale Vybrid and QorIQ LS1021A processors. + A valid base address must be provided, and the serial + port must already be setup and configured. + + ar3700_uart,<addr> + Start an early, polled-mode console on the + Armada 3700 serial port at the specified + address. The serial port must already be setup + and configured. Options are not yet supported. + + qcom_geni,<addr> + Start an early, polled-mode console on a Qualcomm + Generic Interface (GENI) based serial port at the + specified address. The serial port must already be + setup and configured. Options are not yet supported. + + earlyprintk= [X86,SH,ARM,M68k,S390] + earlyprintk=vga + earlyprintk=efi + earlyprintk=sclp + earlyprintk=xen + earlyprintk=serial[,ttySn[,baudrate]] + earlyprintk=serial[,0x...[,baudrate]] + earlyprintk=ttySn[,baudrate] + earlyprintk=dbgp[debugController#] + earlyprintk=pciserial[,force],bus:device.function[,baudrate] + earlyprintk=xdbc[xhciController#] + + earlyprintk is useful when the kernel crashes before + the normal console is initialized. It is not enabled by + default because it has some cosmetic problems. + + Append ",keep" to not disable it when the real console + takes over. + + Only one of vga, efi, serial, or usb debug port can + be used at a time. + + Currently only ttyS0 and ttyS1 may be specified by + name. Other I/O ports may be explicitly specified + on some architectures (x86 and arm at least) by + replacing ttySn with an I/O port address, like this: + earlyprintk=serial,0x1008,115200 + You can find the port for a given device in + /proc/tty/driver/serial: + 2: uart:ST16650V2 port:00001008 irq:18 ... + + Interaction with the standard serial driver is not + very good. + + The VGA and EFI output is eventually overwritten by + the real console. + + The xen output can only be used by Xen PV guests. + + The sclp output can only be used on s390. + + The optional "force" to "pciserial" enables use of a + PCI device even when its classcode is not of the + UART class. + + edac_report= [HW,EDAC] Control how to report EDAC event + Format: {"on" | "off" | "force"} + on: enable EDAC to report H/W event. May be overridden + by other higher priority error reporting module. + off: disable H/W event reporting through EDAC. + force: enforce the use of EDAC to report H/W event. + default: on. + + ekgdboc= [X86,KGDB] Allow early kernel console debugging + ekgdboc=kbd + + This is designed to be used in conjunction with + the boot argument: earlyprintk=vga + + edd= [EDD] + Format: {"off" | "on" | "skip[mbr]"} + + efi= [EFI] + Format: { "old_map", "nochunk", "noruntime", "debug" } + old_map [X86-64]: switch to the old ioremap-based EFI + runtime services mapping. 32-bit still uses this one by + default. + nochunk: disable reading files in "chunks" in the EFI + boot stub, as chunking can cause problems with some + firmware implementations. + noruntime : disable EFI runtime services support + debug: enable misc debug output + + efi_no_storage_paranoia [EFI; X86] + Using this parameter you can use more than 50% of + your efi variable storage. Use this parameter only if + you are really sure that your UEFI does sane gc and + fulfills the spec otherwise your board may brick. + + efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86] + Add arbitrary attribute to specific memory range by + updating original EFI memory map. + Region of memory which aa attribute is added to is + from ss to ss+nn. + If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000 + is specified, EFI_MEMORY_MORE_RELIABLE(0x10000) + attribute is added to range 0x100000000-0x180000000 and + 0x10a0000000-0x1120000000. + + Using this parameter you can do debugging of EFI memmap + related feature. For example, you can do debugging of + Address Range Mirroring feature even if your box + doesn't support it. + + efivar_ssdt= [EFI; X86] Name of an EFI variable that contains an SSDT + that is to be dynamically loaded by Linux. If there are + multiple variables with the same name but with different + vendor GUIDs, all of them will be loaded. See + Documentation/acpi/ssdt-overlays.txt for details. + + + eisa_irq_edge= [PARISC,HW] + See header of drivers/parisc/eisa.c. + + elanfreq= [X86-32] + See comment before function elanfreq_setup() in + arch/x86/kernel/cpu/cpufreq/elanfreq.c. + + elevator= [IOSCHED] + Format: {"cfq" | "deadline" | "noop"} + See Documentation/block/cfq-iosched.txt and + Documentation/block/deadline-iosched.txt for details. + + elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390] + Specifies physical address of start of kernel core + image elf header and optionally the size. Generally + kexec loader will pass this option to capture kernel. + See Documentation/kdump/kdump.txt for details. + + enable_mtrr_cleanup [X86] + The kernel tries to adjust MTRR layout from continuous + to discrete, to make X server driver able to add WB + entry later. This parameter enables that. + + enable_timer_pin_1 [X86] + Enable PIN 1 of APIC timer + Can be useful to work around chipset bugs + (in particular on some ATI chipsets). + The kernel tries to set a reasonable default. + + enforcing [SELINUX] Set initial enforcing status. + Format: {"0" | "1"} + See security/selinux/Kconfig help text. + 0 -- permissive (log only, no denials). + 1 -- enforcing (deny and log). + Default value is 0. + Value can be changed at runtime via /selinux/enforce. + + erst_disable [ACPI] + Disable Error Record Serialization Table (ERST) + support. + + ether= [HW,NET] Ethernet cards parameters + This option is obsoleted by the "netdev=" option, which + has equivalent usage. See its documentation for details. + + evm= [EVM] + Format: { "fix" } + Permit 'security.evm' to be updated regardless of + current integrity status. + + failslab= + fail_page_alloc= + fail_make_request=[KNL] + General fault injection mechanism. + Format: <interval>,<probability>,<space>,<times> + See also Documentation/fault-injection/. + + floppy= [HW] + See Documentation/blockdev/floppy.txt. + + force_pal_cache_flush + [IA-64] Avoid check_sal_cache_flush which may hang on + buggy SAL_CACHE_FLUSH implementations. Using this + parameter will force ia64_sal_cache_flush to call + ia64_pal_cache_flush instead of SAL_CACHE_FLUSH. + + forcepae [X86-32] + Forcefully enable Physical Address Extension (PAE). + Many Pentium M systems disable PAE but may have a + functionally usable PAE implementation. + Warning: use of this parameter will taint the kernel + and may cause unknown problems. + + ftrace=[tracer] + [FTRACE] will set and start the specified tracer + as early as possible in order to facilitate early + boot debugging. + + ftrace_dump_on_oops[=orig_cpu] + [FTRACE] will dump the trace buffers on oops. + If no parameter is passed, ftrace will dump + buffers of all CPUs, but if you pass orig_cpu, it will + dump only the buffer of the CPU that triggered the + oops. + + ftrace_filter=[function-list] + [FTRACE] Limit the functions traced by the function + tracer at boot up. function-list is a comma separated + list of functions. This list can be changed at run + time by the set_ftrace_filter file in the debugfs + tracing directory. + + ftrace_notrace=[function-list] + [FTRACE] Do not trace the functions specified in + function-list. This list can be changed at run time + by the set_ftrace_notrace file in the debugfs + tracing directory. + + ftrace_graph_filter=[function-list] + [FTRACE] Limit the top level callers functions traced + by the function graph tracer at boot up. + function-list is a comma separated list of functions + that can be changed at run time by the + set_graph_function file in the debugfs tracing directory. + + ftrace_graph_notrace=[function-list] + [FTRACE] Do not trace from the functions specified in + function-list. This list is a comma separated list of + functions that can be changed at run time by the + set_graph_notrace file in the debugfs tracing directory. + + ftrace_graph_max_depth=<uint> + [FTRACE] Used with the function graph tracer. This is + the max depth it will trace into a function. This value + can be changed at run time by the max_graph_depth file + in the tracefs tracing directory. default: 0 (no limit) + + gamecon.map[2|3]= + [HW,JOY] Multisystem joystick and NES/SNES/PSX pad + support via parallel port (up to 5 devices per port) + Format: <port#>,<pad1>,<pad2>,<pad3>,<pad4>,<pad5> + See also Documentation/input/devices/joystick-parport.rst + + gamma= [HW,DRM] + + gart_fix_e820= [X86_64] disable the fix e820 for K8 GART + Format: off | on + default: on + + gcov_persist= [GCOV] When non-zero (default), profiling data for + kernel modules is saved and remains accessible via + debugfs, even when the module is unloaded/reloaded. + When zero, profiling data is discarded and associated + debugfs files are removed at module unload time. + + goldfish [X86] Enable the goldfish android emulator platform. + Don't use this when you are not running on the + android emulator + + gpt [EFI] Forces disk with valid GPT signature but + invalid Protective MBR to be treated as GPT. If the + primary GPT is corrupted, it enables the backup/alternate + GPT to be used instead. + + grcan.enable0= [HW] Configuration of physical interface 0. Determines + the "Enable 0" bit of the configuration register. + Format: 0 | 1 + Default: 0 + grcan.enable1= [HW] Configuration of physical interface 1. Determines + the "Enable 0" bit of the configuration register. + Format: 0 | 1 + Default: 0 + grcan.select= [HW] Select which physical interface to use. + Format: 0 | 1 + Default: 0 + grcan.txsize= [HW] Sets the size of the tx buffer. + Format: <unsigned int> such that (txsize & ~0x1fffc0) == 0. + Default: 1024 + grcan.rxsize= [HW] Sets the size of the rx buffer. + Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0. + Default: 1024 + + gpio-mockup.gpio_mockup_ranges + [HW] Sets the ranges of gpiochip of for this device. + Format: <start1>,<end1>,<start2>,<end2>... + + hardlockup_all_cpu_backtrace= + [KNL] Should the hard-lockup detector generate + backtraces on all cpus. + Format: <integer> + + hashdist= [KNL,NUMA] Large hashes allocated during boot + are distributed across NUMA nodes. Defaults on + for 64-bit NUMA, off otherwise. + Format: 0 | 1 (for off | on) + + hcl= [IA-64] SGI's Hardware Graph compatibility layer + + hd= [EIDE] (E)IDE hard drive subsystem geometry + Format: <cyl>,<head>,<sect> + + hest_disable [ACPI] + Disable Hardware Error Source Table (HEST) support; + corresponding firmware-first mode error processing + logic will be disabled. + + highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact + size of <nn>. This works even on boxes that have no + highmem otherwise. This also works to reduce highmem + size on bigger boxes. + + highres= [KNL] Enable/disable high resolution timer mode. + Valid parameters: "on", "off" + Default: "on" + + hisax= [HW,ISDN] + See Documentation/isdn/README.HiSax. + + hlt [BUGS=ARM,SH] + + hpet= [X86-32,HPET] option to control HPET usage + Format: { enable (default) | disable | force | + verbose } + disable: disable HPET and use PIT instead + force: allow force enabled of undocumented chips (ICH4, + VIA, nVidia) + verbose: show contents of HPET registers during setup + + hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET + registers. Default set by CONFIG_HPET_MMAP_DEFAULT. + + hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot. + hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages. + On x86-64 and powerpc, this option can be specified + multiple times interleaved with hugepages= to reserve + huge pages of different sizes. Valid pages sizes on + x86-64 are 2M (when the CPU supports "pse") and 1G + (when the CPU supports the "pdpe1gb" cpuinfo flag). + + hung_task_panic= + [KNL] Should the hung task detector generate panics. + Format: <integer> + + A nonzero value instructs the kernel to panic when a + hung task is detected. The default value is controlled + by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time + option. The value selected by this boot parameter can + be changed later by the kernel.hung_task_panic sysctl. + + hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) + terminal devices. Valid values: 0..8 + hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs. + If specified, z/VM IUCV HVC accepts connections + from listed z/VM user IDs only. + keep_bootcon [KNL] + Do not unregister boot console at start. This is only + useful for debugging when something happens in the window + between unregistering the boot console and initializing + the real console. + + i2c_bus= [HW] Override the default board specific I2C bus speed + or register an additional I2C bus that is not + registered from board initialization code. + Format: + <bus_id>,<clkrate> + + i8042.debug [HW] Toggle i8042 debug mode + i8042.unmask_kbd_data + [HW] Enable printing of interrupt data from the KBD port + (disabled by default, and as a pre-condition + requires that i8042.debug=1 be enabled) + i8042.direct [HW] Put keyboard port into non-translated mode + i8042.dumbkbd [HW] Pretend that controller can only read data from + keyboard and cannot control its state + (Don't attempt to blink the leds) + i8042.noaux [HW] Don't check for auxiliary (== mouse) port + i8042.nokbd [HW] Don't check/create keyboard port + i8042.noloop [HW] Disable the AUX Loopback command while probing + for the AUX port + i8042.nomux [HW] Don't check presence of an active multiplexing + controller + i8042.nopnp [HW] Don't use ACPIPnP / PnPBIOS to discover KBD/AUX + controllers + i8042.notimeout [HW] Ignore timeout condition signalled by controller + i8042.reset [HW] Reset the controller during init, cleanup and + suspend-to-ram transitions, only during s2r + transitions, or never reset + Format: { 1 | Y | y | 0 | N | n } + 1, Y, y: always reset controller + 0, N, n: don't ever reset controller + Default: only on s2r transitions on x86; most other + architectures force reset to be always executed + i8042.unlock [HW] Unlock (ignore) the keylock + i8042.kbdreset [HW] Reset device connected to KBD port + i8042.probe_defer + [HW] Allow deferred probing upon i8042 probe errors + + i810= [HW,DRM] + + i8k.ignore_dmi [HW] Continue probing hardware even if DMI data + indicates that the driver is running on unsupported + hardware. + i8k.force [HW] Activate i8k driver even if SMM BIOS signature + does not match list of supported models. + i8k.power_status + [HW] Report power status in /proc/i8k + (disabled by default) + i8k.restricted [HW] Allow controlling fans only if SYS_ADMIN + capability is set. + + i915.invert_brightness= + [DRM] Invert the sense of the variable that is used to + set the brightness of the panel backlight. Normally a + brightness value of 0 indicates backlight switched off, + and the maximum of the brightness value sets the backlight + to maximum brightness. If this parameter is set to 0 + (default) and the machine requires it, or this parameter + is set to 1, a brightness value of 0 sets the backlight + to maximum brightness, and the maximum of the brightness + value switches the backlight off. + -1 -- never invert brightness + 0 -- machine default + 1 -- force brightness inversion + + icn= [HW,ISDN] + Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]] + + ide-core.nodma= [HW] (E)IDE subsystem + Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc + .vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr + .cdrom .chs .ignore_cable are additional options + See Documentation/ide/ide.txt. + + ide-generic.probe-mask= [HW] (E)IDE subsystem + Format: <int> + Probe mask for legacy ISA IDE ports. Depending on + platform up to 6 ports are supported, enabled by + setting corresponding bits in the mask to 1. The + default value is 0x0, which has a special meaning. + On systems that have PCI, it triggers scanning the + PCI bus for the first and the second port, which + are then probed. On systems without PCI the value + of 0x0 enables probing the two first ports as if it + was 0x3. + + ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem + Claim all unknown PCI IDE storage controllers. + + idle= [X86] + Format: idle=poll, idle=halt, idle=nomwait + Poll forces a polling idle loop that can slightly + improve the performance of waking up a idle CPU, but + will use a lot of power and make the system run hot. + Not recommended. + idle=halt: Halt is forced to be used for CPU idle. + In such case C2/C3 won't be used again. + idle=nomwait: Disable mwait for CPU C-states + + ieee754= [MIPS] Select IEEE Std 754 conformance mode + Format: { strict | legacy | 2008 | relaxed } + Default: strict + + Choose which programs will be accepted for execution + based on the IEEE 754 NaN encoding(s) supported by + the FPU and the NaN encoding requested with the value + of an ELF file header flag individually set by each + binary. Hardware implementations are permitted to + support either or both of the legacy and the 2008 NaN + encoding mode. + + Available settings are as follows: + strict accept binaries that request a NaN encoding + supported by the FPU + legacy only accept legacy-NaN binaries, if supported + by the FPU + 2008 only accept 2008-NaN binaries, if supported + by the FPU + relaxed accept any binaries regardless of whether + supported by the FPU + + The FPU emulator is always able to support both NaN + encodings, so if no FPU hardware is present or it has + been disabled with 'nofpu', then the settings of + 'legacy' and '2008' strap the emulator accordingly, + 'relaxed' straps the emulator for both legacy-NaN and + 2008-NaN, whereas 'strict' enables legacy-NaN only on + legacy processors and both NaN encodings on MIPS32 or + MIPS64 CPUs. + + The setting for ABS.fmt/NEG.fmt instruction execution + mode generally follows that for the NaN encoding, + except where unsupported by hardware. + + ignore_loglevel [KNL] + Ignore loglevel setting - this will print /all/ + kernel messages to the console. Useful for debugging. + We also add it as printk module parameter, so users + could change it dynamically, usually by + /sys/module/printk/parameters/ignore_loglevel. + + ignore_rlimit_data + Ignore RLIMIT_DATA setting for data mappings, + print warning at first misuse. Can be changed via + /sys/module/kernel/parameters/ignore_rlimit_data. + + ihash_entries= [KNL] + Set number of hash buckets for inode cache. + + ima_appraise= [IMA] appraise integrity measurements + Format: { "off" | "enforce" | "fix" | "log" } + default: "enforce" + + ima_appraise_tcb [IMA] + The builtin appraise policy appraises all files + owned by uid=0. + + ima_canonical_fmt [IMA] + Use the canonical format for the binary runtime + measurements, instead of host native format. + + ima_hash= [IMA] + Format: { md5 | sha1 | rmd160 | sha256 | sha384 + | sha512 | ... } + default: "sha1" + + The list of supported hash algorithms is defined + in crypto/hash_info.h. + + ima_policy= [IMA] + The builtin policies to load during IMA setup. + Format: "tcb | appraise_tcb | secure_boot | + fail_securely" + + The "tcb" policy measures all programs exec'd, files + mmap'd for exec, and all files opened with the read + mode bit set by either the effective uid (euid=0) or + uid=0. + + The "appraise_tcb" policy appraises the integrity of + all files owned by root. (This is the equivalent + of ima_appraise_tcb.) + + The "secure_boot" policy appraises the integrity + of files (eg. kexec kernel image, kernel modules, + firmware, policy, etc) based on file signatures. + + The "fail_securely" policy forces file signature + verification failure also on privileged mounted + filesystems with the SB_I_UNVERIFIABLE_SIGNATURE + flag. + + ima_tcb [IMA] Deprecated. Use ima_policy= instead. + Load a policy which meets the needs of the Trusted + Computing Base. This means IMA will measure all + programs exec'd, files mmap'd for exec, and all files + opened for read by uid=0. + + ima_template= [IMA] + Select one of defined IMA measurements template formats. + Formats: { "ima" | "ima-ng" | "ima-sig" } + Default: "ima-ng" + + ima_template_fmt= + [IMA] Define a custom template format. + Format: { "field1|...|fieldN" } + + ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage + Format: <min_file_size> + Set the minimal file size for using asynchronous hash. + If left unspecified, ahash usage is disabled. + + ahash performance varies for different data sizes on + different crypto accelerators. This option can be used + to achieve the best performance for a particular HW. + + ima.ahash_bufsize= [IMA] Asynchronous hash buffer size + Format: <bufsize> + Set hashing buffer size. Default: 4k. + + ahash performance varies for different chunk sizes on + different crypto accelerators. This option can be used + to achieve best performance for particular HW. + + init= [KNL] + Format: <full_path> + Run specified binary instead of /sbin/init as init + process. + + initcall_debug [KNL] Trace initcalls as they are executed. Useful + for working out where the kernel is dying during + startup. + + initcall_blacklist= [KNL] Do not execute a comma-separated list of + initcall functions. Useful for debugging built-in + modules and initcalls. + + initrd= [BOOT] Specify the location of the initial ramdisk + + init_pkru= [x86] Specify the default memory protection keys rights + register contents for all processes. 0x55555554 by + default (disallow access to all but pkey 0). Can + override in debugfs after boot. + + inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver + Format: <irq> + + int_pln_enable [x86] Enable power limit notification interrupt + + integrity_audit=[IMA] + Format: { "0" | "1" } + 0 -- basic integrity auditing messages. (Default) + 1 -- additional integrity auditing messages. + + intel_iommu= [DMAR] Intel IOMMU driver (DMAR) option + on + Enable intel iommu driver. + off + Disable intel iommu driver. + igfx_off [Default Off] + By default, gfx is mapped as normal device. If a gfx + device has a dedicated DMAR unit, the DMAR unit is + bypassed by not enabling DMAR with this option. In + this case, gfx device will use physical address for + DMA. + forcedac [x86_64] + With this option iommu will not optimize to look + for io virtual address below 32-bit forcing dual + address cycle on pci bus for cards supporting greater + than 32-bit addressing. The default is to look + for translation below 32-bit and if not available + then look in the higher range. + strict [Default Off] + With this option on every unmap_single operation will + result in a hardware IOTLB flush operation as opposed + to batching them for performance. + sp_off [Default Off] + By default, super page will be supported if Intel IOMMU + has the capability. With this option, super page will + not be supported. + ecs_off [Default Off] + By default, extended context tables will be supported if + the hardware advertises that it has support both for the + extended tables themselves, and also PASID support. With + this option set, extended tables will not be used even + on hardware which claims to support them. + tboot_noforce [Default Off] + Do not force the Intel IOMMU enabled under tboot. + By default, tboot will force Intel IOMMU on, which + could harm performance of some high-throughput + devices like 40GBit network cards, even if identity + mapping is enabled. + Note that using this option lowers the security + provided by tboot because it makes the system + vulnerable to DMA attacks. + + intel_idle.max_cstate= [KNL,HW,ACPI,X86] + 0 disables intel_idle and fall back on acpi_idle. + 1 to 9 specify maximum depth of C-state. + + intel_pstate= [X86] + disable + Do not enable intel_pstate as the default + scaling driver for the supported processors + passive + Use intel_pstate as a scaling driver, but configure it + to work with generic cpufreq governors (instead of + enabling its internal governor). This mode cannot be + used along with the hardware-managed P-states (HWP) + feature. + force + Enable intel_pstate on systems that prohibit it by default + in favor of acpi-cpufreq. Forcing the intel_pstate driver + instead of acpi-cpufreq may disable platform features, such + as thermal controls and power capping, that rely on ACPI + P-States information being indicated to OSPM and therefore + should be used with caution. This option does not work with + processors that aren't supported by the intel_pstate driver + or on platforms that use pcc-cpufreq instead of acpi-cpufreq. + no_hwp + Do not enable hardware P state control (HWP) + if available. + hwp_only + Only load intel_pstate on systems which support + hardware P state control (HWP) if available. + support_acpi_ppc + Enforce ACPI _PPC performance limits. If the Fixed ACPI + Description Table, specifies preferred power management + profile as "Enterprise Server" or "Performance Server", + then this feature is turned on by default. + per_cpu_perf_limits + Allow per-logical-CPU P-State performance control limits using + cpufreq sysfs interface + + intremap= [X86-64, Intel-IOMMU] + on enable Interrupt Remapping (default) + off disable Interrupt Remapping + nosid disable Source ID checking + no_x2apic_optout + BIOS x2APIC opt-out request will be ignored + nopost disable Interrupt Posting + + iomem= Disable strict checking of access to MMIO memory + strict regions from userspace. + relaxed + + iommu= [x86] + off + force + noforce + biomerge + panic + nopanic + merge + nomerge + soft + pt [x86] + nopt [x86] + nobypass [PPC/POWERNV] + Disable IOMMU bypass, using IOMMU for PCI devices. + + iommu.passthrough= + [ARM64] Configure DMA to bypass the IOMMU by default. + Format: { "0" | "1" } + 0 - Use IOMMU translation for DMA. + 1 - Bypass the IOMMU for DMA. + unset - Use IOMMU translation for DMA. + + io7= [HW] IO7 for Marvel based alpha systems + See comment before marvel_specify_io7 in + arch/alpha/kernel/core_marvel.c. + + io_delay= [X86] I/O delay method + 0x80 + Standard port 0x80 based delay + 0xed + Alternate port 0xed based delay (needed on some systems) + udelay + Simple two microseconds delay + none + No delay + + ip= [IP_PNP] + See Documentation/filesystems/nfs/nfsroot.txt. + + irqaffinity= [SMP] Set the default irq affinity mask + The argument is a cpu list, as described above. + + irqchip.gicv2_force_probe= + [ARM, ARM64] + Format: <bool> + Force the kernel to look for the second 4kB page + of a GICv2 controller even if the memory range + exposed by the device tree is too small. + + irqchip.gicv3_nolpi= + [ARM, ARM64] + Force the kernel to ignore the availability of + LPIs (and by consequence ITSs). Intended for system + that use the kernel as a bootloader, and thus want + to let secondary kernels in charge of setting up + LPIs. + + irqfixup [HW] + When an interrupt is not handled search all handlers + for it. Intended to get systems with badly broken + firmware running. + + irqpoll [HW] + When an interrupt is not handled search all handlers + for it. Also check all handlers each timer + interrupt. Intended to get systems with badly broken + firmware running. + + isapnp= [ISAPNP] + Format: <RDP>,<reset>,<pci_scan>,<verbosity> + + isolcpus= [KNL,SMP,ISOL] Isolate a given set of CPUs from disturbance. + [Deprecated - use cpusets instead] + Format: [flag-list,]<cpu-list> + + Specify one or more CPUs to isolate from disturbances + specified in the flag list (default: domain): + + nohz + Disable the tick when a single task runs. + + A residual 1Hz tick is offloaded to workqueues, which you + need to affine to housekeeping through the global + workqueue's affinity configured via the + /sys/devices/virtual/workqueue/cpumask sysfs file, or + by using the 'domain' flag described below. + + NOTE: by default the global workqueue runs on all CPUs, + so to protect individual CPUs the 'cpumask' file has to + be configured manually after bootup. + + domain + Isolate from the general SMP balancing and scheduling + algorithms. Note that performing domain isolation this way + is irreversible: it's not possible to bring back a CPU to + the domains once isolated through isolcpus. It's strongly + advised to use cpusets instead to disable scheduler load + balancing through the "cpuset.sched_load_balance" file. + It offers a much more flexible interface where CPUs can + move in and out of an isolated set anytime. + + You can move a process onto or off an "isolated" CPU via + the CPU affinity syscalls or cpuset. + <cpu number> begins at 0 and the maximum value is + "number of CPUs in system - 1". + + The format of <cpu-list> is described above. + + + + iucv= [HW,NET] + + ivrs_ioapic [HW,X86_64] + Provide an override to the IOAPIC-ID<->DEVICE-ID + mapping provided in the IVRS ACPI table. For + example, to map IOAPIC-ID decimal 10 to + PCI device 00:14.0 write the parameter as: + ivrs_ioapic[10]=00:14.0 + + ivrs_hpet [HW,X86_64] + Provide an override to the HPET-ID<->DEVICE-ID + mapping provided in the IVRS ACPI table. For + example, to map HPET-ID decimal 0 to + PCI device 00:14.0 write the parameter as: + ivrs_hpet[0]=00:14.0 + + ivrs_acpihid [HW,X86_64] + Provide an override to the ACPI-HID:UID<->DEVICE-ID + mapping provided in the IVRS ACPI table. For + example, to map UART-HID:UID AMD0020:0 to + PCI device 00:14.5 write the parameter as: + ivrs_acpihid[00:14.5]=AMD0020:0 + + js= [HW,JOY] Analog joystick + See Documentation/input/joydev/joystick.rst. + + nokaslr [KNL] + When CONFIG_RANDOMIZE_BASE is set, this disables + kernel and module base offset ASLR (Address Space + Layout Randomization). + + kasan_multi_shot + [KNL] Enforce KASAN (Kernel Address Sanitizer) to print + report on every invalid memory access. Without this + parameter KASAN will print report only for the first + invalid access. + + keepinitrd [HW,ARM] + + kernelcore= [KNL,X86,IA-64,PPC] + Format: nn[KMGTPE] | nn% | "mirror" + This parameter specifies the amount of memory usable by + the kernel for non-movable allocations. The requested + amount is spread evenly throughout all nodes in the + system as ZONE_NORMAL. The remaining memory is used for + movable memory in its own zone, ZONE_MOVABLE. In the + event, a node is too small to have both ZONE_NORMAL and + ZONE_MOVABLE, kernelcore memory will take priority and + other nodes will have a larger ZONE_MOVABLE. + + ZONE_MOVABLE is used for the allocation of pages that + may be reclaimed or moved by the page migration + subsystem. Note that allocations like PTEs-from-HighMem + still use the HighMem zone if it exists, and the Normal + zone if it does not. + + It is possible to specify the exact amount of memory in + the form of "nn[KMGTPE]", a percentage of total system + memory in the form of "nn%", or "mirror". If "mirror" + option is specified, mirrored (reliable) memory is used + for non-movable allocations and remaining memory is used + for Movable pages. "nn[KMGTPE]", "nn%", and "mirror" + are exclusive, so you cannot specify multiple forms. + + kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. + Format: <Controller#>[,poll interval] + The controller # is the number of the ehci usb debug + port as it is probed via PCI. The poll interval is + optional and is the number seconds in between + each poll cycle to the debug port in case you need + the functionality for interrupting the kernel with + gdb or control-c on the dbgp connection. When + not using this parameter you use sysrq-g to break into + the kernel debugger. + + kgdboc= [KGDB,HW] kgdb over consoles. + Requires a tty driver that supports console polling, + or a supported polling keyboard driver (non-usb). + Serial only format: <serial_device>[,baud] + keyboard only format: kbd + keyboard and serial format: kbd,<serial_device>[,baud] + Optional Kernel mode setting: + kms, kbd format: kms,kbd + kms, kbd and serial format: kms,kbd,<ser_dev>[,baud] + + kgdbwait [KGDB] Stop kernel execution and enter the + kernel debugger at the earliest opportunity. + + kmac= [MIPS] korina ethernet MAC address. + Configure the RouterBoard 532 series on-chip + Ethernet adapter MAC address. + + kmemleak= [KNL] Boot-time kmemleak enable/disable + Valid arguments: on, off + Default: on + Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y, + the default is off. + + kpti= [ARM64] Control page table isolation of user + and kernel address spaces. + Default: enabled on cores which need mitigation. + 0: force disabled + 1: force enabled + + kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs. + Default is 0 (don't ignore, but inject #GP) + + kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface. + Default is false (don't support). + + kvm.mmu_audit= [KVM] This is a R/W parameter which allows audit + KVM MMU at runtime. + Default is 0 (off) + + kvm.nx_huge_pages= + [KVM] Controls the software workaround for the + X86_BUG_ITLB_MULTIHIT bug. + force : Always deploy workaround. + off : Never deploy workaround. + auto : Deploy workaround based on the presence of + X86_BUG_ITLB_MULTIHIT. + + Default is 'auto'. + + If the software workaround is enabled for the host, + guests do need not to enable it for nested guests. + + kvm.nx_huge_pages_recovery_ratio= + [KVM] Controls how many 4KiB pages are periodically zapped + back to huge pages. 0 disables the recovery, otherwise if + the value is N KVM will zap 1/Nth of the 4KiB pages every + minute. The default is 60. + + kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM. + Default is 1 (enabled) + + kvm-amd.npt= [KVM,AMD] Disable nested paging (virtualized MMU) + for all guests. + Default is 1 (enabled) if in 64-bit or 32-bit PAE mode. + + kvm-arm.vgic_v3_group0_trap= + [KVM,ARM] Trap guest accesses to GICv3 group-0 + system registers + + kvm-arm.vgic_v3_group1_trap= + [KVM,ARM] Trap guest accesses to GICv3 group-1 + system registers + + kvm-arm.vgic_v3_common_trap= + [KVM,ARM] Trap guest accesses to GICv3 common + system registers + + kvm-arm.vgic_v4_enable= + [KVM,ARM] Allow use of GICv4 for direct injection of + LPIs. + + kvm-intel.ept= [KVM,Intel] Disable extended page tables + (virtualized MMU) support on capable Intel chips. + Default is 1 (enabled) + + kvm-intel.emulate_invalid_guest_state= + [KVM,Intel] Disable emulation of invalid guest state. + Ignored if kvm-intel.enable_unrestricted_guest=1, as + guest state is never invalid for unrestricted guests. + This param doesn't apply to nested guests (L2), as KVM + never emulates invalid L2 guest state. + Default is 1 (enabled) + + kvm-intel.flexpriority= + [KVM,Intel] Disable FlexPriority feature (TPR shadow). + Default is 1 (enabled) + + kvm-intel.nested= + [KVM,Intel] Enable VMX nesting (nVMX). + Default is 0 (disabled) + + kvm-intel.unrestricted_guest= + [KVM,Intel] Disable unrestricted guest feature + (virtualized real and unpaged mode) on capable + Intel chips. Default is 1 (enabled) + + kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault + CVE-2018-3620. + + Valid arguments: never, cond, always + + always: L1D cache flush on every VMENTER. + cond: Flush L1D on VMENTER only when the code between + VMEXIT and VMENTER can leak host memory. + never: Disables the mitigation + + Default is cond (do L1 cache flush in specific instances) + + kvm-intel.vpid= [KVM,Intel] Disable Virtual Processor Identification + feature (tagged TLBs) on capable Intel chips. + Default is 1 (enabled) + + l1tf= [X86] Control mitigation of the L1TF vulnerability on + affected CPUs + + The kernel PTE inversion protection is unconditionally + enabled and cannot be disabled. + + full + Provides all available mitigations for the + L1TF vulnerability. Disables SMT and + enables all mitigations in the + hypervisors, i.e. unconditional L1D flush. + + SMT control and L1D flush control via the + sysfs interface is still possible after + boot. Hypervisors will issue a warning + when the first VM is started in a + potentially insecure configuration, + i.e. SMT enabled or L1D flush disabled. + + full,force + Same as 'full', but disables SMT and L1D + flush runtime control. Implies the + 'nosmt=force' command line option. + (i.e. sysfs control of SMT is disabled.) + + flush + Leaves SMT enabled and enables the default + hypervisor mitigation, i.e. conditional + L1D flush. + + SMT control and L1D flush control via the + sysfs interface is still possible after + boot. Hypervisors will issue a warning + when the first VM is started in a + potentially insecure configuration, + i.e. SMT enabled or L1D flush disabled. + + flush,nosmt + + Disables SMT and enables the default + hypervisor mitigation. + + SMT control and L1D flush control via the + sysfs interface is still possible after + boot. Hypervisors will issue a warning + when the first VM is started in a + potentially insecure configuration, + i.e. SMT enabled or L1D flush disabled. + + flush,nowarn + Same as 'flush', but hypervisors will not + warn when a VM is started in a potentially + insecure configuration. + + off + Disables hypervisor mitigations and doesn't + emit any warnings. + It also drops the swap size and available + RAM limit restriction on both hypervisor and + bare metal. + + Default is 'flush'. + + For details see: Documentation/admin-guide/hw-vuln/l1tf.rst + + l2cr= [PPC] + + l3cr= [PPC] + + lapic [X86-32,APIC] Enable the local APIC even if BIOS + disabled it. + + lapic= [x86,APIC] "notscdeadline" Do not use TSC deadline + value for LAPIC timer one-shot implementation. Default + back to the programmable timer unit in the LAPIC. + + lapic_timer_c2_ok [X86,APIC] trust the local apic timer + in C2 power state. + + libata.dma= [LIBATA] DMA control + libata.dma=0 Disable all PATA and SATA DMA + libata.dma=1 PATA and SATA Disk DMA only + libata.dma=2 ATAPI (CDROM) DMA only + libata.dma=4 Compact Flash DMA only + Combinations also work, so libata.dma=3 enables DMA + for disks and CDROMs, but not CFs. + + libata.ignore_hpa= [LIBATA] Ignore HPA limit + libata.ignore_hpa=0 keep BIOS limits (default) + libata.ignore_hpa=1 ignore limits, using full disk + + libata.noacpi [LIBATA] Disables use of ACPI in libata suspend/resume + when set. + Format: <int> + + libata.force= [LIBATA] Force configurations. The format is comma + separated list of "[ID:]VAL" where ID is + PORT[.DEVICE]. PORT and DEVICE are decimal numbers + matching port, link or device. Basically, it matches + the ATA ID string printed on console by libata. If + the whole ID part is omitted, the last PORT and DEVICE + values are used. If ID hasn't been specified yet, the + configuration applies to all ports, links and devices. + + If only DEVICE is omitted, the parameter applies to + the port and all links and devices behind it. DEVICE + number of 0 either selects the first device or the + first fan-out link behind PMP device. It does not + select the host link. DEVICE number of 15 selects the + host link and device attached to it. + + The VAL specifies the configuration to force. As long + as there's no ambiguity shortcut notation is allowed. + For example, both 1.5 and 1.5G would work for 1.5Gbps. + The following configurations can be forced. + + * Cable type: 40c, 80c, short40c, unk, ign or sata. + Any ID with matching PORT is used. + + * SATA link speed limit: 1.5Gbps or 3.0Gbps. + + * Transfer mode: pio[0-7], mwdma[0-4] and udma[0-7]. + udma[/][16,25,33,44,66,100,133] notation is also + allowed. + + * [no]ncq: Turn on or off NCQ. + + * [no]ncqtrim: Turn off queued DSM TRIM. + + * nohrst, nosrst, norst: suppress hard, soft + and both resets. + + * rstonce: only attempt one reset during + hot-unplug link recovery + + * dump_id: dump IDENTIFY data. + + * atapi_dmadir: Enable ATAPI DMADIR bridge support + + * disable: Disable this device. + + If there are multiple matching configurations changing + the same attribute, the last one is used. + + memblock=debug [KNL] Enable memblock debug messages. + + load_ramdisk= [RAM] List of ramdisks to load from floppy + See Documentation/blockdev/ramdisk.txt. + + lockd.nlm_grace_period=P [NFS] Assign grace period. + Format: <integer> + + lockd.nlm_tcpport=N [NFS] Assign TCP port. + Format: <integer> + + lockd.nlm_timeout=T [NFS] Assign timeout value. + Format: <integer> + + lockd.nlm_udpport=M [NFS] Assign UDP port. + Format: <integer> + + locktorture.nreaders_stress= [KNL] + Set the number of locking read-acquisition kthreads. + Defaults to being automatically set based on the + number of online CPUs. + + locktorture.nwriters_stress= [KNL] + Set the number of locking write-acquisition kthreads. + + locktorture.onoff_holdoff= [KNL] + Set time (s) after boot for CPU-hotplug testing. + + locktorture.onoff_interval= [KNL] + Set time (s) between CPU-hotplug operations, or + zero to disable CPU-hotplug testing. + + locktorture.shuffle_interval= [KNL] + Set task-shuffle interval (jiffies). Shuffling + tasks allows some CPUs to go into dyntick-idle + mode during the locktorture test. + + locktorture.shutdown_secs= [KNL] + Set time (s) after boot system shutdown. This + is useful for hands-off automated testing. + + locktorture.stat_interval= [KNL] + Time (s) between statistics printk()s. + + locktorture.stutter= [KNL] + Time (s) to stutter testing, for example, + specifying five seconds causes the test to run for + five seconds, wait for five seconds, and so on. + This tests the locking primitive's ability to + transition abruptly to and from idle. + + locktorture.torture_type= [KNL] + Specify the locking implementation to test. + + locktorture.verbose= [KNL] + Enable additional printk() statements. + + logibm.irq= [HW,MOUSE] Logitech Bus Mouse Driver + Format: <irq> + + loglevel= All Kernel Messages with a loglevel smaller than the + console loglevel will be printed to the console. It can + also be changed with klogd or other programs. The + loglevels are defined as follows: + + 0 (KERN_EMERG) system is unusable + 1 (KERN_ALERT) action must be taken immediately + 2 (KERN_CRIT) critical conditions + 3 (KERN_ERR) error conditions + 4 (KERN_WARNING) warning conditions + 5 (KERN_NOTICE) normal but significant condition + 6 (KERN_INFO) informational + 7 (KERN_DEBUG) debug-level messages + + log_buf_len=n[KMG] Sets the size of the printk ring buffer, + in bytes. n must be a power of two and greater + than the minimal size. The minimal size is defined + by LOG_BUF_SHIFT kernel config parameter. There is + also CONFIG_LOG_CPU_MAX_BUF_SHIFT config parameter + that allows to increase the default size depending on + the number of CPUs. See init/Kconfig for more details. + + logo.nologo [FB] Disables display of the built-in Linux logo. + This may be used to provide more screen space for + kernel log messages and is useful when debugging + kernel boot problems. + + lp=0 [LP] Specify parallel ports to use, e.g, + lp=port[,port...] lp=none,parport0 (lp0 not configured, lp1 uses + lp=reset first parallel port). 'lp=0' disables the + lp=auto printer driver. 'lp=reset' (which can be + specified in addition to the ports) causes + attached printers to be reset. Using + lp=port1,port2,... specifies the parallel ports + to associate lp devices with, starting with + lp0. A port specification may be 'none' to skip + that lp device, or a parport name such as + 'parport0'. Specifying 'lp=auto' instead of a + port specification list means that device IDs + from each port should be examined, to see if + an IEEE 1284-compliant printer is attached; if + so, the driver will manage that printer. + See also header of drivers/char/lp.c. + + lpj=n [KNL] + Sets loops_per_jiffy to given constant, thus avoiding + time-consuming boot-time autodetection (up to 250 ms per + CPU). 0 enables autodetection (default). To determine + the correct value for your kernel, boot with normal + autodetection and see what value is printed. Note that + on SMP systems the preset will be applied to all CPUs, + which is likely to cause problems if your CPUs need + significantly divergent settings. An incorrect value + will cause delays in the kernel to be wrong, leading to + unpredictable I/O errors and other breakage. Although + unlikely, in the extreme case this might damage your + hardware. + + ltpc= [NET] + Format: <io>,<irq>,<dma> + + machvec= [IA-64] Force the use of a particular machine-vector + (machvec) in a generic kernel. + Example: machvec=hpzx1_swiotlb + + machtype= [Loongson] Share the same kernel image file between different + yeeloong laptop. + Example: machtype=lemote-yeeloong-2f-7inch + + max_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory greater + than or equal to this physical address is ignored. + + maxcpus= [SMP] Maximum number of processors that an SMP kernel + will bring up during bootup. maxcpus=n : n >= 0 limits + the kernel to bring up 'n' processors. Surely after + bootup you can bring up the other plugged cpu by executing + "echo 1 > /sys/devices/system/cpu/cpuX/online". So maxcpus + only takes effect during system bootup. + While n=0 is a special case, it is equivalent to "nosmp", + which also disables the IO APIC. + + max_loop= [LOOP] The number of loop block devices that get + (loop.max_loop) unconditionally pre-created at init time. The default + number is configured by BLK_DEV_LOOP_MIN_COUNT. Instead + of statically allocating a predefined number, loop + devices can be requested on-demand with the + /dev/loop-control interface. + + mce [X86-32] Machine Check Exception + + mce=option [X86-64] See Documentation/x86/x86_64/boot-options.txt + + md= [HW] RAID subsystems devices and level + See Documentation/admin-guide/md.rst. + + mdacon= [MDA] + Format: <first>,<last> + Specifies range of consoles to be captured by the MDA. + + mds= [X86,INTEL] + Control mitigation for the Micro-architectural Data + Sampling (MDS) vulnerability. + + Certain CPUs are vulnerable to an exploit against CPU + internal buffers which can forward information to a + disclosure gadget under certain conditions. + + In vulnerable processors, the speculatively + forwarded data can be used in a cache side channel + attack, to access data to which the attacker does + not have direct access. + + This parameter controls the MDS mitigation. The + options are: + + full - Enable MDS mitigation on vulnerable CPUs + full,nosmt - Enable MDS mitigation and disable + SMT on vulnerable CPUs + off - Unconditionally disable MDS mitigation + + On TAA-affected machines, mds=off can be prevented by + an active TAA mitigation as both vulnerabilities are + mitigated with the same mechanism so in order to disable + this mitigation, you need to specify tsx_async_abort=off + too. + + Not specifying this option is equivalent to + mds=full. + + For details see: Documentation/admin-guide/hw-vuln/mds.rst + + mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory + Amount of memory to be used when the kernel is not able + to see the whole system memory or for test. + [X86] Work as limiting max address. Use together + with memmap= to avoid physical address space collisions. + Without memmap= PCI devices could be placed at addresses + belonging to unused RAM. + + mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel + memory. + + memchunk=nn[KMG] + [KNL,SH] Allow user to override the default size for + per-device physically contiguous DMA buffers. + + memhp_default_state=online/offline + [KNL] Set the initial state for the memory hotplug + onlining policy. If not specified, the default value is + set according to the + CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config + option. + See Documentation/memory-hotplug.txt. + + memmap=exactmap [KNL,X86] Enable setting of an exact + E820 memory map, as specified by the user. + Such memmap=exactmap lines can be constructed based on + BIOS output or other requirements. See the memmap=nn@ss + option description. + + memmap=nn[KMG]@ss[KMG] + [KNL] Force usage of a specific region of memory. + Region of memory to be used is from ss to ss+nn. + If @ss[KMG] is omitted, it is equivalent to mem=nn[KMG], + which limits max address to nn[KMG]. + Multiple different regions can be specified, + comma delimited. + Example: + memmap=100M@2G,100M#3G,1G!1024G + + memmap=nn[KMG]#ss[KMG] + [KNL,ACPI] Mark specific memory as ACPI data. + Region of memory to be marked is from ss to ss+nn. + + memmap=nn[KMG]$ss[KMG] + [KNL,ACPI] Mark specific memory as reserved. + Region of memory to be reserved is from ss to ss+nn. + Example: Exclude memory from 0x18690000-0x1869ffff + memmap=64K$0x18690000 + or + memmap=0x10000$0x18690000 + Some bootloaders may need an escape character before '$', + like Grub2, otherwise '$' and the following number + will be eaten. + + memmap=nn[KMG]!ss[KMG] + [KNL,X86] Mark specific memory as protected. + Region of memory to be used, from ss to ss+nn. + The memory region may be marked as e820 type 12 (0xc) + and is NVDIMM or ADR memory. + + memmap=<size>%<offset>-<oldtype>+<newtype> + [KNL,ACPI] Convert memory within the specified region + from <oldtype> to <newtype>. If "-<oldtype>" is left + out, the whole region will be marked as <newtype>, + even if previously unavailable. If "+<newtype>" is left + out, matching memory will be removed. Types are + specified as e820 types, e.g., 1 = RAM, 2 = reserved, + 3 = ACPI, 12 = PRAM. + + memory_corruption_check=0/1 [X86] + Some BIOSes seem to corrupt the first 64k of + memory when doing things like suspend/resume. + Setting this option will scan the memory + looking for corruption. Enabling this will + both detect corruption and prevent the kernel + from using the memory being corrupted. + However, its intended as a diagnostic tool; if + repeatable BIOS-originated corruption always + affects the same memory, you can use memmap= + to prevent the kernel from using that memory. + + memory_corruption_check_size=size [X86] + By default it checks for corruption in the low + 64k, making this memory unavailable for normal + use. Use this parameter to scan for + corruption in more or less memory. + + memory_corruption_check_period=seconds [X86] + By default it checks for corruption every 60 + seconds. Use this parameter to check at some + other rate. 0 disables periodic checking. + + memtest= [KNL,X86,ARM] Enable memtest + Format: <integer> + default : 0 <disable> + Specifies the number of memtest passes to be + performed. Each pass selects another test + pattern from a given set of patterns. Memtest + fills the memory with this pattern, validates + memory contents and reserves bad memory + regions that are detected. + + mem_encrypt= [X86-64] AMD Secure Memory Encryption (SME) control + Valid arguments: on, off + Default (depends on kernel configuration option): + on (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) + off (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=n) + mem_encrypt=on: Activate SME + mem_encrypt=off: Do not activate SME + + Refer to Documentation/x86/amd-memory-encryption.txt + for details on when memory encryption can be activated. + + mem_sleep_default= [SUSPEND] Default system suspend mode: + s2idle - Suspend-To-Idle + shallow - Power-On Suspend or equivalent (if supported) + deep - Suspend-To-RAM or equivalent (if supported) + See Documentation/admin-guide/pm/sleep-states.rst. + + meye.*= [HW] Set MotionEye Camera parameters + See Documentation/media/v4l-drivers/meye.rst. + + mfgpt_irq= [IA-32] Specify the IRQ to use for the + Multi-Function General Purpose Timers on AMD Geode + platforms. + + mfgptfix [X86-32] Fix MFGPT timers on AMD Geode platforms when + the BIOS has incorrectly applied a workaround. TinyBIOS + version 0.98 is known to be affected, 0.99 fixes the + problem by letting the user disable the workaround. + + mga= [HW,DRM] + + min_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory below this + physical address is ignored. + + mini2440= [ARM,HW,KNL] + Format:[0..2][b][c][t] + Default: "0tb" + MINI2440 configuration specification: + 0 - The attached screen is the 3.5" TFT + 1 - The attached screen is the 7" TFT + 2 - The VGA Shield is attached (1024x768) + Leaving out the screen size parameter will not load + the TFT driver, and the framebuffer will be left + unconfigured. + b - Enable backlight. The TFT backlight pin will be + linked to the kernel VESA blanking code and a GPIO + LED. This parameter is not necessary when using the + VGA shield. + c - Enable the s3c camera interface. + t - Reserved for enabling touchscreen support. The + touchscreen support is not enabled in the mainstream + kernel as of 2.6.30, a preliminary port can be found + in the "bleeding edge" mini2440 support kernel at + http://repo.or.cz/w/linux-2.6/mini2440.git + + mitigations= + [X86,PPC,S390,ARM64] Control optional mitigations for + CPU vulnerabilities. This is a set of curated, + arch-independent options, each of which is an + aggregation of existing arch-specific options. + + off + Disable all optional CPU mitigations. This + improves system performance, but it may also + expose users to several CPU vulnerabilities. + Equivalent to: nopti [X86,PPC] + kpti=0 [ARM64] + nospectre_v1 [PPC] + nobp=0 [S390] + nospectre_v1 [X86] + nospectre_v2 [X86,PPC,S390,ARM64] + spectre_v2_user=off [X86] + spec_store_bypass_disable=off [X86,PPC] + ssbd=force-off [ARM64] + l1tf=off [X86] + mds=off [X86] + tsx_async_abort=off [X86] + kvm.nx_huge_pages=off [X86] + no_entry_flush [PPC] + no_uaccess_flush [PPC] + mmio_stale_data=off [X86] + + Exceptions: + This does not have any effect on + kvm.nx_huge_pages when + kvm.nx_huge_pages=force. + + auto (default) + Mitigate all CPU vulnerabilities, but leave SMT + enabled, even if it's vulnerable. This is for + users who don't want to be surprised by SMT + getting disabled across kernel upgrades, or who + have other ways of avoiding SMT-based attacks. + Equivalent to: (default behavior) + + auto,nosmt + Mitigate all CPU vulnerabilities, disabling SMT + if needed. This is for users who always want to + be fully mitigated, even if it means losing SMT. + Equivalent to: l1tf=flush,nosmt [X86] + mds=full,nosmt [X86] + tsx_async_abort=full,nosmt [X86] + mmio_stale_data=full,nosmt [X86] + + mminit_loglevel= + [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this + parameter allows control of the logging verbosity for + the additional memory initialisation checks. A value + of 0 disables mminit logging and a level of 4 will + log everything. Information is printed at KERN_DEBUG + so loglevel=8 may also need to be specified. + + mmio_stale_data= + [X86,INTEL] Control mitigation for the Processor + MMIO Stale Data vulnerabilities. + + Processor MMIO Stale Data is a class of + vulnerabilities that may expose data after an MMIO + operation. Exposed data could originate or end in + the same CPU buffers as affected by MDS and TAA. + Therefore, similar to MDS and TAA, the mitigation + is to clear the affected CPU buffers. + + This parameter controls the mitigation. The + options are: + + full - Enable mitigation on vulnerable CPUs + + full,nosmt - Enable mitigation and disable SMT on + vulnerable CPUs. + + off - Unconditionally disable mitigation + + On MDS or TAA affected machines, + mmio_stale_data=off can be prevented by an active + MDS or TAA mitigation as these vulnerabilities are + mitigated with the same mechanism so in order to + disable this mitigation, you need to specify + mds=off and tsx_async_abort=off too. + + Not specifying this option is equivalent to + mmio_stale_data=full. + + For details see: + Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst + + module.sig_enforce + [KNL] When CONFIG_MODULE_SIG is set, this means that + modules without (valid) signatures will fail to load. + Note that if CONFIG_MODULE_SIG_FORCE is set, that + is always true, so this option does nothing. + + module_blacklist= [KNL] Do not load a comma-separated list of + modules. Useful for debugging problem modules. + + mousedev.tap_time= + [MOUSE] Maximum time between finger touching and + leaving touchpad surface for touch to be considered + a tap and be reported as a left button click (for + touchpads working in absolute mode only). + Format: <msecs> + mousedev.xres= [MOUSE] Horizontal screen resolution, used for devices + reporting absolute coordinates, such as tablets + mousedev.yres= [MOUSE] Vertical screen resolution, used for devices + reporting absolute coordinates, such as tablets + + movablecore= [KNL,X86,IA-64,PPC] + Format: nn[KMGTPE] | nn% + This parameter is the complement to kernelcore=, it + specifies the amount of memory used for migratable + allocations. If both kernelcore and movablecore is + specified, then kernelcore will be at *least* the + specified value but may be more. If movablecore on its + own is specified, the administrator must be careful + that the amount of memory usable for all allocations + is not too small. + + movable_node [KNL] Boot-time switch to make hotplugable memory + NUMA nodes to be movable. This means that the memory + of such nodes will be usable only for movable + allocations which rules out almost all kernel + allocations. Use with caution! + + MTD_Partition= [MTD] + Format: <name>,<region-number>,<size>,<offset> + + MTD_Region= [MTD] Format: + <name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>] + + mtdparts= [MTD] + See drivers/mtd/cmdlinepart.c. + + multitce=off [PPC] This parameter disables the use of the pSeries + firmware feature for updating multiple TCE entries + at a time. + + onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration + + Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock] + + boundary - index of last SLC block on Flex-OneNAND. + The remaining blocks are configured as MLC blocks. + lock - Configure if Flex-OneNAND boundary should be locked. + Once locked, the boundary cannot be changed. + 1 indicates lock status, 0 indicates unlock status. + + mtdset= [ARM] + ARM/S3C2412 JIVE boot control + + See arch/arm/mach-s3c2412/mach-jive.c + + mtouchusb.raw_coordinates= + [HW] Make the MicroTouch USB driver use raw coordinates + ('y', default) or cooked coordinates ('n') + + mtrr_chunk_size=nn[KMG] [X86] + used for mtrr cleanup. It is largest continuous chunk + that could hold holes aka. UC entries. + + mtrr_gran_size=nn[KMG] [X86] + Used for mtrr cleanup. It is granularity of mtrr block. + Default is 1. + Large value could prevent small alignment from + using up MTRRs. + + mtrr_spare_reg_nr=n [X86] + Format: <integer> + Range: 0,7 : spare reg number + Default : 1 + Used for mtrr cleanup. It is spare mtrr entries number. + Set to 2 or more if your graphical card needs more. + + n2= [NET] SDL Inc. RISCom/N2 synchronous serial card + + netdev= [NET] Network devices parameters + Format: <irq>,<io>,<mem_start>,<mem_end>,<name> + Note that mem_start is often overloaded to mean + something different and driver-specific. + This usage is only documented in each driver source + file if at all. + + nf_conntrack.acct= + [NETFILTER] Enable connection tracking flow accounting + 0 to disable accounting + 1 to enable accounting + Default value is 0. + + nfsaddrs= [NFS] Deprecated. Use ip= instead. + See Documentation/filesystems/nfs/nfsroot.txt. + + nfsroot= [NFS] nfs root filesystem for disk-less boxes. + See Documentation/filesystems/nfs/nfsroot.txt. + + nfsrootdebug [NFS] enable nfsroot debugging messages. + See Documentation/filesystems/nfs/nfsroot.txt. + + nfs.callback_nr_threads= + [NFSv4] set the total number of threads that the + NFS client will assign to service NFSv4 callback + requests. + + nfs.callback_tcpport= + [NFS] set the TCP port on which the NFSv4 callback + channel should listen. + + nfs.cache_getent= + [NFS] sets the pathname to the program which is used + to update the NFS client cache entries. + + nfs.cache_getent_timeout= + [NFS] sets the timeout after which an attempt to + update a cache entry is deemed to have failed. + + nfs.idmap_cache_timeout= + [NFS] set the maximum lifetime for idmapper cache + entries. + + nfs.enable_ino64= + [NFS] enable 64-bit inode numbers. + If zero, the NFS client will fake up a 32-bit inode + number for the readdir() and stat() syscalls instead + of returning the full 64-bit number. + The default is to return 64-bit inode numbers. + + nfs.max_session_cb_slots= + [NFSv4.1] Sets the maximum number of session + slots the client will assign to the callback + channel. This determines the maximum number of + callbacks the client will process in parallel for + a particular server. + + nfs.max_session_slots= + [NFSv4.1] Sets the maximum number of session slots + the client will attempt to negotiate with the server. + This limits the number of simultaneous RPC requests + that the client can send to the NFSv4.1 server. + Note that there is little point in setting this + value higher than the max_tcp_slot_table_limit. + + nfs.nfs4_disable_idmapping= + [NFSv4] When set to the default of '1', this option + ensures that both the RPC level authentication + scheme and the NFS level operations agree to use + numeric uids/gids if the mount is using the + 'sec=sys' security flavour. In effect it is + disabling idmapping, which can make migration from + legacy NFSv2/v3 systems to NFSv4 easier. + Servers that do not support this mode of operation + will be autodetected by the client, and it will fall + back to using the idmapper. + To turn off this behaviour, set the value to '0'. + nfs.nfs4_unique_id= + [NFS4] Specify an additional fixed unique ident- + ification string that NFSv4 clients can insert into + their nfs_client_id4 string. This is typically a + UUID that is generated at system install time. + + nfs.send_implementation_id = + [NFSv4.1] Send client implementation identification + information in exchange_id requests. + If zero, no implementation identification information + will be sent. + The default is to send the implementation identification + information. + + nfs.recover_lost_locks = + [NFSv4] Attempt to recover locks that were lost due + to a lease timeout on the server. Please note that + doing this risks data corruption, since there are + no guarantees that the file will remain unchanged + after the locks are lost. + If you want to enable the kernel legacy behaviour of + attempting to recover these locks, then set this + parameter to '1'. + The default parameter value of '0' causes the kernel + not to attempt recovery of lost locks. + + nfs4.layoutstats_timer = + [NFSv4.2] Change the rate at which the kernel sends + layoutstats to the pNFS metadata server. + + Setting this to value to 0 causes the kernel to use + whatever value is the default set by the layout + driver. A non-zero value sets the minimum interval + in seconds between layoutstats transmissions. + + nfsd.nfs4_disable_idmapping= + [NFSv4] When set to the default of '1', the NFSv4 + server will return only numeric uids and gids to + clients using auth_sys, and will accept numeric uids + and gids from such clients. This is intended to ease + migration from NFSv2/v3. + + nmi_debug= [KNL,SH] Specify one or more actions to take + when a NMI is triggered. + Format: [state][,regs][,debounce][,die] + + nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels + Format: [panic,][nopanic,][num] + Valid num: 0 or 1 + 0 - turn hardlockup detector in nmi_watchdog off + 1 - turn hardlockup detector in nmi_watchdog on + When panic is specified, panic when an NMI watchdog + timeout occurs (or 'nopanic' to override the opposite + default). To disable both hard and soft lockup detectors, + please see 'nowatchdog'. + This is useful when you use a panic=... timeout and + need the box quickly up again. + + These settings can be accessed at runtime via + the nmi_watchdog and hardlockup_panic sysctls. + + netpoll.carrier_timeout= + [NET] Specifies amount of time (in seconds) that + netpoll should wait for a carrier. By default netpoll + waits 4 seconds. + + no387 [BUGS=X86-32] Tells the kernel to use the 387 maths + emulation library even if a 387 maths coprocessor + is present. + + no5lvl [X86-64] Disable 5-level paging mode. Forces + kernel to use 4-level paging instead. + + no_console_suspend + [HW] Never suspend the console + Disable suspending of consoles during suspend and + hibernate operations. Once disabled, debugging + messages can reach various consoles while the rest + of the system is being put to sleep (ie, while + debugging driver suspend/resume hooks). This may + not work reliably with all consoles, but is known + to work with serial and VGA consoles. + To facilitate more flexible debugging, we also add + console_suspend, a printk module parameter to control + it. Users could use console_suspend (usually + /sys/module/printk/parameters/console_suspend) to + turn on/off it dynamically. + + noaliencache [MM, NUMA, SLAB] Disables the allocation of alien + caches in the slab allocator. Saves per-node memory, + but will impact performance. + + noalign [KNL,ARM] + + noaltinstr [S390] Disables alternative instructions patching + (CPU alternatives feature). + + noapic [SMP,APIC] Tells the kernel to not make use of any + IOAPICs that may be present in the system. + + noautogroup Disable scheduler automatic task group creation. + + nobats [PPC] Do not use BATs for mapping kernel lowmem + on "Classic" PPC cores. + + nocache [ARM] + + noclflush [BUGS=X86] Don't use the CLFLUSH instruction + + nodelayacct [KNL] Disable per-task delay accounting + + nodsp [SH] Disable hardware DSP at boot time. + + noefi Disable EFI runtime services support. + + no_entry_flush [PPC] Don't flush the L1-D cache when entering the kernel. + + noexec [IA-64] + + noexec [X86] + On X86-32 available only on PAE configured kernels. + noexec=on: enable non-executable mappings (default) + noexec=off: disable non-executable mappings + + nosmap [X86] + Disable SMAP (Supervisor Mode Access Prevention) + even if it is supported by processor. + + nosmep [X86] + Disable SMEP (Supervisor Mode Execution Prevention) + even if it is supported by processor. + + noexec32 [X86-64] + This affects only 32-bit executables. + noexec32=on: enable non-executable mappings (default) + read doesn't imply executable mappings + noexec32=off: disable non-executable mappings + read implies executable mappings + + nofpu [MIPS,SH] Disable hardware FPU at boot time. + + nofxsr [BUGS=X86-32] Disables x86 floating point extended + register save and restore. The kernel will only save + legacy floating-point registers on task switch. + + nohugeiomap [KNL,x86] Disable kernel huge I/O mappings. + + nosmt [KNL,S390] Disable symmetric multithreading (SMT). + Equivalent to smt=1. + + [KNL,x86] Disable symmetric multithreading (SMT). + nosmt=force: Force disable SMT, cannot be undone + via the sysfs control file. + + nospectre_v1 [X66, PPC] Disable mitigations for Spectre Variant 1 + (bounds check bypass). With this option data leaks + are possible in the system. + + nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for + the Spectre variant 2 (indirect branch prediction) + vulnerability. System may allow data leaks with this + option. + + nospec_store_bypass_disable + [HW] Disable all mitigations for the Speculative Store Bypass vulnerability + + no_uaccess_flush + [PPC] Don't flush the L1-D cache after accessing user data. + + noxsave [BUGS=X86] Disables x86 extended register state save + and restore using xsave. The kernel will fallback to + enabling legacy floating-point and sse state. + + noxsaveopt [X86] Disables xsaveopt used in saving x86 extended + register states. The kernel will fall back to use + xsave to save the states. By using this parameter, + performance of saving the states is degraded because + xsave doesn't support modified optimization while + xsaveopt supports it on xsaveopt enabled systems. + + noxsaves [X86] Disables xsaves and xrstors used in saving and + restoring x86 extended register state in compacted + form of xsave area. The kernel will fall back to use + xsaveopt and xrstor to save and restore the states + in standard form of xsave area. By using this + parameter, xsave area per process might occupy more + memory on xsaves enabled systems. + + nohlt [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or + wfi(ARM) instruction doesn't work correctly and not to + use it. This is also useful when using JTAG debugger. + + no_file_caps Tells the kernel not to honor file capabilities. The + only way then for a file to be executed with privilege + is to be setuid root or executed by root. + + nohalt [IA-64] Tells the kernel not to use the power saving + function PAL_HALT_LIGHT when idle. This increases + power-consumption. On the positive side, it reduces + interrupt wake-up latency, which may improve performance + in certain environments such as networked servers or + real-time systems. + + nohibernate [HIBERNATION] Disable hibernation and resume. + + nohz= [KNL] Boottime enable/disable dynamic ticks + Valid arguments: on, off + Default: on + + nohz_full= [KNL,BOOT,SMP,ISOL] + The argument is a cpu list, as described above. + In kernels built with CONFIG_NO_HZ_FULL=y, set + the specified list of CPUs whose tick will be stopped + whenever possible. The boot CPU will be forced outside + the range to maintain the timekeeping. Any CPUs + in this list will have their RCU callbacks offloaded, + just as if they had also been called out in the + rcu_nocbs= boot parameter. + + noiotrap [SH] Disables trapped I/O port accesses. + + noirqdebug [X86-32] Disables the code which attempts to detect and + disable unhandled interrupt sources. + + no_timer_check [X86,APIC] Disables the code which tests for + broken timer IRQ sources. + + noisapnp [ISAPNP] Disables ISA PnP code. + + noinitrd [RAM] Tells the kernel not to load any configured + initial RAM disk. + + nointremap [X86-64, Intel-IOMMU] Do not enable interrupt + remapping. + [Deprecated - use intremap=off] + + nointroute [IA-64] + + noinvpcid [X86] Disable the INVPCID cpu feature. + + nojitter [IA-64] Disables jitter checking for ITC timers. + + no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver + + no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page + fault handling. + + no-vmw-sched-clock + [X86,PV_OPS] Disable paravirtualized VMware scheduler + clock and use the default one. + + no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. + steal time is computed, but won't influence scheduler + behaviour + + nolapic [X86-32,APIC] Do not enable or use the local APIC. + + nolapic_timer [X86-32,APIC] Do not use the local APIC timer. + + noltlbs [PPC] Do not use large page/tlb entries for kernel + lowmem mapping on PPC40x and PPC8xx + + nomca [IA-64] Disable machine check abort handling + + nomce [X86-32] Disable Machine Check Exception + + nomfgpt [X86-32] Disable Multi-Function General Purpose + Timer usage (for AMD Geode machines). + + nonmi_ipi [X86] Disable using NMI IPIs during panic/reboot to + shutdown the other cpus. Instead use the REBOOT_VECTOR + irq. + + nomodule Disable module load + + nopat [X86] Disable PAT (page attribute table extension of + pagetables) support. + + nopcid [X86-64] Disable the PCID cpu feature. + + norandmaps Don't use address space randomization. Equivalent to + echo 0 > /proc/sys/kernel/randomize_va_space + + noreplace-smp [X86-32,SMP] Don't replace SMP instructions + with UP alternatives + + nordrand [X86] Disable kernel use of the RDRAND and + RDSEED instructions even if they are supported + by the processor. RDRAND and RDSEED are still + available to user space applications. + + noresume [SWSUSP] Disables resume and restores original swap + space. + + no-scroll [VGA] Disables scrollback. + This is required for the Braillex ib80-piezo Braille + reader made by F.H. Papenmeier (Germany). + + nosbagart [IA-64] + + nosep [BUGS=X86-32] Disables x86 SYSENTER/SYSEXIT support. + + nosmp [SMP] Tells an SMP kernel to act as a UP kernel, + and disable the IO APIC. legacy for "maxcpus=0". + + nosoftlockup [KNL] Disable the soft-lockup detector. + + nosync [HW,M68K] Disables sync negotiation for all devices. + + nowatchdog [KNL] Disable both lockup detectors, i.e. + soft-lockup and NMI watchdog (hard-lockup). + + nowb [ARM] + + nox2apic [X86-64,APIC] Do not enable x2APIC mode. + + cpu0_hotplug [X86] Turn on CPU0 hotplug feature when + CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off. + Some features depend on CPU0. Known dependencies are: + 1. Resume from suspend/hibernate depends on CPU0. + Suspend/hibernate will fail if CPU0 is offline and you + need to online CPU0 before suspend/hibernate. + 2. PIC interrupts also depend on CPU0. CPU0 can't be + removed if a PIC interrupt is detected. + It's said poweroff/reboot may depend on CPU0 on some + machines although I haven't seen such issues so far + after CPU0 is offline on a few tested machines. + If the dependencies are under your control, you can + turn on cpu0_hotplug. + + nps_mtm_hs_ctr= [KNL,ARC] + This parameter sets the maximum duration, in + cycles, each HW thread of the CTOP can run + without interruptions, before HW switches it. + The actual maximum duration is 16 times this + parameter's value. + Format: integer between 1 and 255 + Default: 255 + + nptcg= [IA-64] Override max number of concurrent global TLB + purges which is reported from either PAL_VM_SUMMARY or + SAL PALO. + + nr_cpus= [SMP] Maximum number of processors that an SMP kernel + could support. nr_cpus=n : n >= 1 limits the kernel to + support 'n' processors. It could be larger than the + number of already plugged CPU during bootup, later in + runtime you can physically add extra cpu until it reaches + n. So during boot up some boot time memory for per-cpu + variables need be pre-allocated for later physical cpu + hot plugging. + + nr_uarts= [SERIAL] maximum number of UARTs to be registered. + + numa_balancing= [KNL,X86] Enable or disable automatic NUMA balancing. + Allowed values are enable and disable + + numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. + 'node', 'default' can be specified + This can be set from sysctl after boot. + See Documentation/sysctl/vm.txt for details. + + ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. + See Documentation/debugging-via-ohci1394.txt for more + info. + + olpc_ec_timeout= [OLPC] ms delay when issuing EC commands + Rather than timing out after 20 ms if an EC + command is not properly ACKed, override the length + of the timeout. We have interrupts disabled while + waiting for the ACK, so if this is set too high + interrupts *may* be lost! + + omap_mux= [OMAP] Override bootloader pin multiplexing. + Format: <mux_mode0.mode_name=value>... + For example, to override I2C bus2: + omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100 + + oprofile.timer= [HW] + Use timer interrupt instead of performance counters + + oprofile.cpu_type= Force an oprofile cpu type + This might be useful if you have an older oprofile + userland or if you want common events. + Format: { arch_perfmon } + arch_perfmon: [X86] Force use of architectural + perfmon on Intel CPUs instead of the + CPU specific event set. + timer: [X86] Force use of architectural NMI + timer mode (see also oprofile.timer + for generic hr timer mode) + + oops=panic Always panic on oopses. Default is to just kill the + process, but there is a small probability of + deadlocking the machine. + This will also cause panics on machine check exceptions. + Useful together with panic=30 to trigger a reboot. + + page_owner= [KNL] Boot-time page_owner enabling option. + Storage of the information about who allocated + each page is disabled in default. With this switch, + we can turn it on. + on: enable the feature + + page_poison= [KNL] Boot-time parameter changing the state of + poisoning on the buddy allocator, available with + CONFIG_PAGE_POISONING=y. + off: turn off poisoning (default) + on: turn on poisoning + + panic= [KNL] Kernel behaviour on panic: delay <timeout> + timeout > 0: seconds before rebooting + timeout = 0: wait forever + timeout < 0: reboot immediately + Format: <timeout> + + panic_on_warn panic() instead of WARN(). Useful to cause kdump + on a WARN(). + + crash_kexec_post_notifiers + Run kdump after running panic-notifiers and dumping + kmsg. This only for the users who doubt kdump always + succeeds in any situation. + Note that this also increases risks of kdump failure, + because some panic notifiers can make the crashed + kernel more unstable. + + parkbd.port= [HW] Parallel port number the keyboard adapter is + connected to, default is 0. + Format: <parport#> + parkbd.mode= [HW] Parallel port keyboard adapter mode of operation, + 0 for XT, 1 for AT (default is AT). + Format: <mode> + + parport= [HW,PPT] Specify parallel ports. 0 disables. + Format: { 0 | auto | 0xBBB[,IRQ[,DMA]] } + Use 'auto' to force the driver to use any + IRQ/DMA settings detected (the default is to + ignore detected IRQ/DMA settings because of + possible conflicts). You can specify the base + address, IRQ, and DMA settings; IRQ and DMA + should be numbers, or 'auto' (for using detected + settings on that particular port), or 'nofifo' + (to avoid using a FIFO even if it is detected). + Parallel ports are assigned in the order they + are specified on the command line, starting + with parport0. + + parport_init_mode= [HW,PPT] + Configure VIA parallel port to operate in + a specific mode. This is necessary on Pegasos + computer where firmware has no options for setting + up parallel port mode and sets it to spp. + Currently this function knows 686a and 8231 chips. + Format: [spp|ps2|epp|ecp|ecpepp] + + pause_on_oops= + Halt all CPUs after the first oops has been printed for + the specified number of seconds. This is to be used if + your oopses keep scrolling off the screen. + + pcbit= [HW,ISDN] + + pcd. [PARIDE] + See header of drivers/block/paride/pcd.c. + See also Documentation/blockdev/paride.txt. + + pci=option[,option...] [PCI] various PCI subsystem options. + + Some options herein operate on a specific device + or a set of devices (<pci_dev>). These are + specified in one of the following formats: + + [<domain>:]<bus>:<dev>.<func>[/<dev>.<func>]* + pci:<vendor>:<device>[:<subvendor>:<subdevice>] + + Note: the first format specifies a PCI + bus/device/function address which may change + if new hardware is inserted, if motherboard + firmware changes, or due to changes caused + by other kernel parameters. If the + domain is left unspecified, it is + taken to be zero. Optionally, a path + to a device through multiple device/function + addresses can be specified after the base + address (this is more robust against + renumbering issues). The second format + selects devices using IDs from the + configuration space which may match multiple + devices in the system. + + earlydump dump PCI config space before the kernel + changes anything + off [X86] don't probe for the PCI bus + bios [X86-32] force use of PCI BIOS, don't access + the hardware directly. Use this if your machine + has a non-standard PCI host bridge. + nobios [X86-32] disallow use of PCI BIOS, only direct + hardware access methods are allowed. Use this + if you experience crashes upon bootup and you + suspect they are caused by the BIOS. + conf1 [X86] Force use of PCI Configuration Access + Mechanism 1 (config address in IO port 0xCF8, + data in IO port 0xCFC, both 32-bit). + conf2 [X86] Force use of PCI Configuration Access + Mechanism 2 (IO port 0xCF8 is an 8-bit port for + the function, IO port 0xCFA, also 8-bit, sets + bus number. The config space is then accessed + through ports 0xC000-0xCFFF). + See http://wiki.osdev.org/PCI for more info + on the configuration access mechanisms. + noaer [PCIE] If the PCIEAER kernel config parameter is + enabled, this kernel boot option can be used to + disable the use of PCIE advanced error reporting. + nodomains [PCI] Disable support for multiple PCI + root domains (aka PCI segments, in ACPI-speak). + nommconf [X86] Disable use of MMCONFIG for PCI + Configuration + check_enable_amd_mmconf [X86] check for and enable + properly configured MMIO access to PCI + config space on AMD family 10h CPU + nomsi [MSI] If the PCI_MSI kernel config parameter is + enabled, this kernel boot option can be used to + disable the use of MSI interrupts system-wide. + noioapicquirk [APIC] Disable all boot interrupt quirks. + Safety option to keep boot IRQs enabled. This + should never be necessary. + ioapicreroute [APIC] Enable rerouting of boot IRQs to the + primary IO-APIC for bridges that cannot disable + boot IRQs. This fixes a source of spurious IRQs + when the system masks IRQs. + noioapicreroute [APIC] Disable workaround that uses the + boot IRQ equivalent of an IRQ that connects to + a chipset where boot IRQs cannot be disabled. + The opposite of ioapicreroute. + biosirq [X86-32] Use PCI BIOS calls to get the interrupt + routing table. These calls are known to be buggy + on several machines and they hang the machine + when used, but on other computers it's the only + way to get the interrupt routing table. Try + this option if the kernel is unable to allocate + IRQs or discover secondary PCI buses on your + motherboard. + rom [X86] Assign address space to expansion ROMs. + Use with caution as certain devices share + address decoders between ROMs and other + resources. + norom [X86] Do not assign address space to + expansion ROMs that do not already have + BIOS assigned address ranges. + nobar [X86] Do not assign address space to the + BARs that weren't assigned by the BIOS. + irqmask=0xMMMM [X86] Set a bit mask of IRQs allowed to be + assigned automatically to PCI devices. You can + make the kernel exclude IRQs of your ISA cards + this way. + pirqaddr=0xAAAAA [X86] Specify the physical address + of the PIRQ table (normally generated + by the BIOS) if it is outside the + F0000h-100000h range. + lastbus=N [X86] Scan all buses thru bus #N. Can be + useful if the kernel is unable to find your + secondary buses and you want to tell it + explicitly which ones they are. + assign-busses [X86] Always assign all PCI bus + numbers ourselves, overriding + whatever the firmware may have done. + usepirqmask [X86] Honor the possible IRQ mask stored + in the BIOS $PIR table. This is needed on + some systems with broken BIOSes, notably + some HP Pavilion N5400 and Omnibook XE3 + notebooks. This will have no effect if ACPI + IRQ routing is enabled. + noacpi [X86] Do not use ACPI for IRQ routing + or for PCI scanning. + use_crs [X86] Use PCI host bridge window information + from ACPI. On BIOSes from 2008 or later, this + is enabled by default. If you need to use this, + please report a bug. + nocrs [X86] Ignore PCI host bridge windows from ACPI. + If you need to use this, please report a bug. + routeirq Do IRQ routing for all PCI devices. + This is normally done in pci_enable_device(), + so this option is a temporary workaround + for broken drivers that don't call it. + skip_isa_align [X86] do not align io start addr, so can + handle more pci cards + noearly [X86] Don't do any early type 1 scanning. + This might help on some broken boards which + machine check when some devices' config space + is read. But various workarounds are disabled + and some IOMMU drivers will not work. + bfsort Sort PCI devices into breadth-first order. + This sorting is done to get a device + order compatible with older (<= 2.4) kernels. + nobfsort Don't sort PCI devices into breadth-first order. + pcie_bus_tune_off Disable PCIe MPS (Max Payload Size) + tuning and use the BIOS-configured MPS defaults. + pcie_bus_safe Set every device's MPS to the largest value + supported by all devices below the root complex. + pcie_bus_perf Set device MPS to the largest allowable MPS + based on its parent bus. Also set MRRS (Max + Read Request Size) to the largest supported + value (no larger than the MPS that the device + or bus can support) for best performance. + pcie_bus_peer2peer Set every device's MPS to 128B, which + every device is guaranteed to support. This + configuration allows peer-to-peer DMA between + any pair of devices, possibly at the cost of + reduced performance. This also guarantees + that hot-added devices will work. + cbiosize=nn[KMG] The fixed amount of bus space which is + reserved for the CardBus bridge's IO window. + The default value is 256 bytes. + cbmemsize=nn[KMG] The fixed amount of bus space which is + reserved for the CardBus bridge's memory + window. The default value is 64 megabytes. + resource_alignment= + Format: + [<order of align>@]<pci_dev>[; ...] + Specifies alignment and device to reassign + aligned memory resources. How to + specify the device is described above. + If <order of align> is not specified, + PAGE_SIZE is used as alignment. + PCI-PCI bridge can be specified, if resource + windows need to be expanded. + To specify the alignment for several + instances of a device, the PCI vendor, + device, subvendor, and subdevice may be + specified, e.g., 4096@pci:8086:9c22:103c:198f + ecrc= Enable/disable PCIe ECRC (transaction layer + end-to-end CRC checking). + bios: Use BIOS/firmware settings. This is the + the default. + off: Turn ECRC off + on: Turn ECRC on. + hpiosize=nn[KMG] The fixed amount of bus space which is + reserved for hotplug bridge's IO window. + Default size is 256 bytes. + hpmemsize=nn[KMG] The fixed amount of bus space which is + reserved for hotplug bridge's memory window. + Default size is 2 megabytes. + hpbussize=nn The minimum amount of additional bus numbers + reserved for buses below a hotplug bridge. + Default is 1. + realloc= Enable/disable reallocating PCI bridge resources + if allocations done by BIOS are too small to + accommodate resources required by all child + devices. + off: Turn realloc off + on: Turn realloc on + realloc same as realloc=on + noari do not use PCIe ARI. + noats [PCIE, Intel-IOMMU, AMD-IOMMU] + do not use PCIe ATS (and IOMMU device IOTLB). + pcie_scan_all Scan all possible PCIe devices. Otherwise we + only look for one device below a PCIe downstream + port. + big_root_window Try to add a big 64bit memory window to the PCIe + root complex on AMD CPUs. Some GFX hardware + can resize a BAR to allow access to all VRAM. + Adding the window is slightly risky (it may + conflict with unreported devices), so this + taints the kernel. + disable_acs_redir=<pci_dev>[; ...] + Specify one or more PCI devices (in the format + specified above) separated by semicolons. + Each device specified will have the PCI ACS + redirect capabilities forced off which will + allow P2P traffic between devices through + bridges without forcing it upstream. Note: + this removes isolation between devices and + may put more devices in an IOMMU group. + + pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power + Management. + off Disable ASPM. + force Enable ASPM even on devices that claim not to support it. + WARNING: Forcing ASPM on may cause system lockups. + + pcie_ports= [PCIE] PCIe port services handling: + native Use native PCIe services (PME, AER, DPC, PCIe hotplug) + even if the platform doesn't give the OS permission to + use them. This may cause conflicts if the platform + also tries to use these services. + compat Disable native PCIe services (PME, AER, DPC, PCIe + hotplug). + + pcie_port_pm= [PCIE] PCIe port power management handling: + off Disable power management of all PCIe ports + force Forcibly enable power management of all PCIe ports + + pcie_pme= [PCIE,PM] Native PCIe PME signaling options: + nomsi Do not use MSI for native PCIe PME signaling (this makes + all PCIe root ports use INTx for all services). + + pcmv= [HW,PCMCIA] BadgePAD 4 + + pd_ignore_unused + [PM] + Keep all power-domains already enabled by bootloader on, + even if no driver has claimed them. This is useful + for debug and development, but should not be + needed on a platform with proper driver support. + + pd. [PARIDE] + See Documentation/blockdev/paride.txt. + + pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at + boot time. + Format: { 0 | 1 } + See arch/parisc/kernel/pdc_chassis.c + + percpu_alloc= Select which percpu first chunk allocator to use. + Currently supported values are "embed" and "page". + Archs may support subset or none of the selections. + See comments in mm/percpu.c for details on each + allocator. This parameter is primarily for debugging + and performance comparison. + + pf. [PARIDE] + See Documentation/blockdev/paride.txt. + + pg. [PARIDE] + See Documentation/blockdev/paride.txt. + + pirq= [SMP,APIC] Manual mp-table setup + See Documentation/x86/i386/IO-APIC.txt. + + plip= [PPT,NET] Parallel port network link + Format: { parport<nr> | timid | 0 } + See also Documentation/admin-guide/parport.rst. + + pmtmr= [X86] Manual setup of pmtmr I/O Port. + Override pmtimer IOPort with a hex value. + e.g. pmtmr=0x508 + + pnp.debug=1 [PNP] + Enable PNP debug messages (depends on the + CONFIG_PNP_DEBUG_MESSAGES option). Change at run-time + via /sys/module/pnp/parameters/debug. We always show + current resource usage; turning this on also shows + possible settings and some assignment information. + + pnpacpi= [ACPI] + { off } + + pnpbios= [ISAPNP] + { on | off | curr | res | no-curr | no-res } + + pnp_reserve_irq= + [ISAPNP] Exclude IRQs for the autoconfiguration + + pnp_reserve_dma= + [ISAPNP] Exclude DMAs for the autoconfiguration + + pnp_reserve_io= [ISAPNP] Exclude I/O ports for the autoconfiguration + Ranges are in pairs (I/O port base and size). + + pnp_reserve_mem= + [ISAPNP] Exclude memory regions for the + autoconfiguration. + Ranges are in pairs (memory base and size). + + ports= [IP_VS_FTP] IPVS ftp helper module + Default is 21. + Up to 8 (IP_VS_APP_MAX_PORTS) ports + may be specified. + Format: <port>,<port>.... + + powersave=off [PPC] This option disables power saving features. + It specifically disables cpuidle and sets the + platform machine description specific power_save + function to NULL. On Idle the CPU just reduces + execution priority. + + ppc_strict_facility_enable + [PPC] This option catches any kernel floating point, + Altivec, VSX and SPE outside of regions specifically + allowed (eg kernel_enable_fpu()/kernel_disable_fpu()). + There is some performance impact when enabling this. + + ppc_tm= [PPC] + Format: {"off"} + Disable Hardware Transactional Memory + + print-fatal-signals= + [KNL] debug: print fatal signals + + If enabled, warn about various signal handling + related application anomalies: too many signals, + too many POSIX.1 timers, fatal signals causing a + coredump - etc. + + If you hit the warning due to signal overflow, + you might want to try "ulimit -i unlimited". + + default: off. + + printk.always_kmsg_dump= + Trigger kmsg_dump for cases other than kernel oops or + panics + Format: <bool> (1/Y/y=enable, 0/N/n=disable) + default: disabled + + printk.devkmsg={on,off,ratelimit} + Control writing to /dev/kmsg. + on - unlimited logging to /dev/kmsg from userspace + off - logging to /dev/kmsg disabled + ratelimit - ratelimit the logging + Default: ratelimit + + printk.time= Show timing data prefixed to each printk message line + Format: <bool> (1/Y/y=enable, 0/N/n=disable) + + processor.max_cstate= [HW,ACPI] + Limit processor to maximum C-state + max_cstate=9 overrides any DMI blacklist limit. + + processor.nocst [HW,ACPI] + Ignore the _CST method to determine C-states, + instead using the legacy FADT method + + profile= [KNL] Enable kernel profiling via /proc/profile + Format: [<profiletype>,]<number> + Param: <profiletype>: "schedule", "sleep", or "kvm" + [defaults to kernel profiling] + Param: "schedule" - profile schedule points. + Param: "sleep" - profile D-state sleeping (millisecs). + Requires CONFIG_SCHEDSTATS + Param: "kvm" - profile VM exits. + Param: <number> - step/bucket size as a power of 2 for + statistical time based profiling. + + prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk + before loading. + See Documentation/blockdev/ramdisk.txt. + + psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to + probe for; one of (bare|imps|exps|lifebook|any). + psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports + per second. + psmouse.resetafter= [HW,MOUSE] + Try to reset the device after so many bad packets + (0 = never). + psmouse.resolution= + [HW,MOUSE] Set desired mouse resolution, in dpi. + psmouse.smartscroll= + [HW,MOUSE] Controls Logitech smartscroll autorepeat. + 0 = disabled, 1 = enabled (default). + + pstore.backend= Specify the name of the pstore backend to use + + pt. [PARIDE] + See Documentation/blockdev/paride.txt. + + pti= [X86_64] Control Page Table Isolation of user and + kernel address spaces. Disabling this feature + removes hardening, but improves performance of + system calls and interrupts. + + on - unconditionally enable + off - unconditionally disable + auto - kernel detects whether your CPU model is + vulnerable to issues that PTI mitigates + + Not specifying this option is equivalent to pti=auto. + + nopti [X86_64] + Equivalent to pti=off + + pty.legacy_count= + [KNL] Number of legacy pty's. Overwrites compiled-in + default number. + + quiet [KNL] Disable most log messages + + r128= [HW,DRM] + + raid= [HW,RAID] + See Documentation/admin-guide/md.rst. + + ramdisk_size= [RAM] Sizes of RAM disks in kilobytes + See Documentation/blockdev/ramdisk.txt. + + random.trust_cpu={on,off} + [KNL] Enable or disable trusting the use of the + CPU's random number generator (if available) to + fully seed the kernel's CRNG. Default is controlled + by CONFIG_RANDOM_TRUST_CPU. + + random.trust_bootloader={on,off} + [KNL] Enable or disable trusting the use of a + seed passed by the bootloader (if available) to + fully seed the kernel's CRNG. Default is controlled + by CONFIG_RANDOM_TRUST_BOOTLOADER. + + ras=option[,option,...] [KNL] RAS-specific options + + cec_disable [X86] + Disable the Correctable Errors Collector, + see CONFIG_RAS_CEC help text. + + rcu_nocbs= [KNL] + The argument is a cpu list, as described above. + + In kernels built with CONFIG_RCU_NOCB_CPU=y, set + the specified list of CPUs to be no-callback CPUs. + Invocation of these CPUs' RCU callbacks will + be offloaded to "rcuox/N" kthreads created for + that purpose, where "x" is "b" for RCU-bh, "p" + for RCU-preempt, and "s" for RCU-sched, and "N" + is the CPU number. This reduces OS jitter on the + offloaded CPUs, which can be useful for HPC and + real-time workloads. It can also improve energy + efficiency for asymmetric multiprocessors. + + rcu_nocb_poll [KNL] + Rather than requiring that offloaded CPUs + (specified by rcu_nocbs= above) explicitly + awaken the corresponding "rcuoN" kthreads, + make these kthreads poll for callbacks. + This improves the real-time response for the + offloaded CPUs by relieving them of the need to + wake up the corresponding kthread, but degrades + energy efficiency by requiring that the kthreads + periodically wake up to do the polling. + + rcutree.blimit= [KNL] + Set maximum number of finished RCU callbacks to + process in one batch. + + rcutree.dump_tree= [KNL] + Dump the structure of the rcu_node combining tree + out at early boot. This is used for diagnostic + purposes, to verify correct tree setup. + + rcutree.gp_cleanup_delay= [KNL] + Set the number of jiffies to delay each step of + RCU grace-period cleanup. + + rcutree.gp_init_delay= [KNL] + Set the number of jiffies to delay each step of + RCU grace-period initialization. + + rcutree.gp_preinit_delay= [KNL] + Set the number of jiffies to delay each step of + RCU grace-period pre-initialization, that is, + the propagation of recent CPU-hotplug changes up + the rcu_node combining tree. + + rcutree.rcu_fanout_exact= [KNL] + Disable autobalancing of the rcu_node combining + tree. This is used by rcutorture, and might + possibly be useful for architectures having high + cache-to-cache transfer latencies. + + rcutree.rcu_fanout_leaf= [KNL] + Change the number of CPUs assigned to each + leaf rcu_node structure. Useful for very + large systems, which will choose the value 64, + and for NUMA systems with large remote-access + latencies, which will choose a value aligned + with the appropriate hardware boundaries. + + rcutree.jiffies_till_sched_qs= [KNL] + Set required age in jiffies for a + given grace period before RCU starts + soliciting quiescent-state help from + rcu_note_context_switch(). + + rcutree.jiffies_till_first_fqs= [KNL] + Set delay from grace-period initialization to + first attempt to force quiescent states. + Units are jiffies, minimum value is zero, + and maximum value is HZ. + + rcutree.jiffies_till_next_fqs= [KNL] + Set delay between subsequent attempts to force + quiescent states. Units are jiffies, minimum + value is one, and maximum value is HZ. + + rcutree.kthread_prio= [KNL,BOOT] + Set the SCHED_FIFO priority of the RCU per-CPU + kthreads (rcuc/N). This value is also used for + the priority of the RCU boost threads (rcub/N) + and for the RCU grace-period kthreads (rcu_bh, + rcu_preempt, and rcu_sched). If RCU_BOOST is + set, valid values are 1-99 and the default is 1 + (the least-favored priority). Otherwise, when + RCU_BOOST is not set, valid values are 0-99 and + the default is zero (non-realtime operation). + + rcutree.rcu_nocb_leader_stride= [KNL] + Set the number of NOCB kthread groups, which + defaults to the square root of the number of + CPUs. Larger numbers reduces the wakeup overhead + on the per-CPU grace-period kthreads, but increases + that same overhead on each group's leader. + + rcutree.qhimark= [KNL] + Set threshold of queued RCU callbacks beyond which + batch limiting is disabled. + + rcutree.qlowmark= [KNL] + Set threshold of queued RCU callbacks below which + batch limiting is re-enabled. + + rcutree.rcu_idle_gp_delay= [KNL] + Set wakeup interval for idle CPUs that have + RCU callbacks (RCU_FAST_NO_HZ=y). + + rcutree.rcu_idle_lazy_gp_delay= [KNL] + Set wakeup interval for idle CPUs that have + only "lazy" RCU callbacks (RCU_FAST_NO_HZ=y). + Lazy RCU callbacks are those which RCU can + prove do nothing more than free memory. + + rcutree.rcu_kick_kthreads= [KNL] + Cause the grace-period kthread to get an extra + wake_up() if it sleeps three times longer than + it should at force-quiescent-state time. + This wake_up() will be accompanied by a + WARN_ONCE() splat and an ftrace_dump(). + + rcuperf.gp_async= [KNL] + Measure performance of asynchronous + grace-period primitives such as call_rcu(). + + rcuperf.gp_async_max= [KNL] + Specify the maximum number of outstanding + callbacks per writer thread. When a writer + thread exceeds this limit, it invokes the + corresponding flavor of rcu_barrier() to allow + previously posted callbacks to drain. + + rcuperf.gp_exp= [KNL] + Measure performance of expedited synchronous + grace-period primitives. + + rcuperf.holdoff= [KNL] + Set test-start holdoff period. The purpose of + this parameter is to delay the start of the + test until boot completes in order to avoid + interference. + + rcuperf.nreaders= [KNL] + Set number of RCU readers. The value -1 selects + N, where N is the number of CPUs. A value + "n" less than -1 selects N-n+1, where N is again + the number of CPUs. For example, -2 selects N + (the number of CPUs), -3 selects N+1, and so on. + A value of "n" less than or equal to -N selects + a single reader. + + rcuperf.nwriters= [KNL] + Set number of RCU writers. The values operate + the same as for rcuperf.nreaders. + N, where N is the number of CPUs + + rcuperf.perf_type= [KNL] + Specify the RCU implementation to test. + + rcuperf.shutdown= [KNL] + Shut the system down after performance tests + complete. This is useful for hands-off automated + testing. + + rcuperf.verbose= [KNL] + Enable additional printk() statements. + + rcuperf.writer_holdoff= [KNL] + Write-side holdoff between grace periods, + in microseconds. The default of zero says + no holdoff. + + rcutorture.cbflood_inter_holdoff= [KNL] + Set holdoff time (jiffies) between successive + callback-flood tests. + + rcutorture.cbflood_intra_holdoff= [KNL] + Set holdoff time (jiffies) between successive + bursts of callbacks within a given callback-flood + test. + + rcutorture.cbflood_n_burst= [KNL] + Set the number of bursts making up a given + callback-flood test. Set this to zero to + disable callback-flood testing. + + rcutorture.cbflood_n_per_burst= [KNL] + Set the number of callbacks to be registered + in a given burst of a callback-flood test. + + rcutorture.fqs_duration= [KNL] + Set duration of force_quiescent_state bursts + in microseconds. + + rcutorture.fqs_holdoff= [KNL] + Set holdoff time within force_quiescent_state bursts + in microseconds. + + rcutorture.fqs_stutter= [KNL] + Set wait time between force_quiescent_state bursts + in seconds. + + rcutorture.gp_cond= [KNL] + Use conditional/asynchronous update-side + primitives, if available. + + rcutorture.gp_exp= [KNL] + Use expedited update-side primitives, if available. + + rcutorture.gp_normal= [KNL] + Use normal (non-expedited) asynchronous + update-side primitives, if available. + + rcutorture.gp_sync= [KNL] + Use normal (non-expedited) synchronous + update-side primitives, if available. If all + of rcutorture.gp_cond=, rcutorture.gp_exp=, + rcutorture.gp_normal=, and rcutorture.gp_sync= + are zero, rcutorture acts as if is interpreted + they are all non-zero. + + rcutorture.n_barrier_cbs= [KNL] + Set callbacks/threads for rcu_barrier() testing. + + rcutorture.nfakewriters= [KNL] + Set number of concurrent RCU writers. These just + stress RCU, they don't participate in the actual + test, hence the "fake". + + rcutorture.nreaders= [KNL] + Set number of RCU readers. The value -1 selects + N-1, where N is the number of CPUs. A value + "n" less than -1 selects N-n-2, where N is again + the number of CPUs. For example, -2 selects N + (the number of CPUs), -3 selects N+1, and so on. + + rcutorture.object_debug= [KNL] + Enable debug-object double-call_rcu() testing. + + rcutorture.onoff_holdoff= [KNL] + Set time (s) after boot for CPU-hotplug testing. + + rcutorture.onoff_interval= [KNL] + Set time (jiffies) between CPU-hotplug operations, + or zero to disable CPU-hotplug testing. + + rcutorture.shuffle_interval= [KNL] + Set task-shuffle interval (s). Shuffling tasks + allows some CPUs to go into dyntick-idle mode + during the rcutorture test. + + rcutorture.shutdown_secs= [KNL] + Set time (s) after boot system shutdown. This + is useful for hands-off automated testing. + + rcutorture.stall_cpu= [KNL] + Duration of CPU stall (s) to test RCU CPU stall + warnings, zero to disable. + + rcutorture.stall_cpu_holdoff= [KNL] + Time to wait (s) after boot before inducing stall. + + rcutorture.stall_cpu_irqsoff= [KNL] + Disable interrupts while stalling if set. + + rcutorture.stat_interval= [KNL] + Time (s) between statistics printk()s. + + rcutorture.stutter= [KNL] + Time (s) to stutter testing, for example, specifying + five seconds causes the test to run for five seconds, + wait for five seconds, and so on. This tests RCU's + ability to transition abruptly to and from idle. + + rcutorture.test_boost= [KNL] + Test RCU priority boosting? 0=no, 1=maybe, 2=yes. + "Maybe" means test if the RCU implementation + under test support RCU priority boosting. + + rcutorture.test_boost_duration= [KNL] + Duration (s) of each individual boost test. + + rcutorture.test_boost_interval= [KNL] + Interval (s) between each boost test. + + rcutorture.test_no_idle_hz= [KNL] + Test RCU's dyntick-idle handling. See also the + rcutorture.shuffle_interval parameter. + + rcutorture.torture_type= [KNL] + Specify the RCU implementation to test. + + rcutorture.verbose= [KNL] + Enable additional printk() statements. + + rcupdate.rcu_cpu_stall_suppress= [KNL] + Suppress RCU CPU stall warning messages. + + rcupdate.rcu_cpu_stall_timeout= [KNL] + Set timeout for RCU CPU stall warning messages. + + rcupdate.rcu_expedited= [KNL] + Use expedited grace-period primitives, for + example, synchronize_rcu_expedited() instead + of synchronize_rcu(). This reduces latency, + but can increase CPU utilization, degrade + real-time latency, and degrade energy efficiency. + No effect on CONFIG_TINY_RCU kernels. + + rcupdate.rcu_normal= [KNL] + Use only normal grace-period primitives, + for example, synchronize_rcu() instead of + synchronize_rcu_expedited(). This improves + real-time latency, CPU utilization, and + energy efficiency, but can expose users to + increased grace-period latency. This parameter + overrides rcupdate.rcu_expedited. No effect on + CONFIG_TINY_RCU kernels. + + rcupdate.rcu_normal_after_boot= [KNL] + Once boot has completed (that is, after + rcu_end_inkernel_boot() has been invoked), use + only normal grace-period primitives. No effect + on CONFIG_TINY_RCU kernels. + + rcupdate.rcu_task_stall_timeout= [KNL] + Set timeout in jiffies for RCU task stall warning + messages. Disable with a value less than or equal + to zero. + + rcupdate.rcu_self_test= [KNL] + Run the RCU early boot self tests + + rcupdate.rcu_self_test_bh= [KNL] + Run the RCU bh early boot self tests + + rcupdate.rcu_self_test_sched= [KNL] + Run the RCU sched early boot self tests + + rdinit= [KNL] + Format: <full_path> + Run specified binary instead of /init from the ramdisk, + used for early userspace startup. See initrd. + + rdrand= [X86] + force - Override the decision by the kernel to hide the + advertisement of RDRAND support (this affects + certain AMD processors because of buggy BIOS + support, specifically around the suspend/resume + path). + + rdt= [HW,X86,RDT] + Turn on/off individual RDT features. List is: + cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp, + mba. + E.g. to turn on cmt and turn off mba use: + rdt=cmt,!mba + + reboot= [KNL] + Format (x86 or x86_64): + [w[arm] | c[old] | h[ard] | s[oft] | g[pio]] \ + [[,]s[mp]#### \ + [[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \ + [[,]f[orce] + Where reboot_mode is one of warm (soft) or cold (hard) or gpio, + reboot_type is one of bios, acpi, kbd, triple, efi, or pci, + reboot_force is either force or not specified, + reboot_cpu is s[mp]#### with #### being the processor + to be used for rebooting. + + relax_domain_level= + [KNL, SMP] Set scheduler's default relax_domain_level. + See Documentation/cgroup-v1/cpusets.txt. + + reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory + Format: <base1>,<size1>[,<base2>,<size2>,...] + Reserve I/O ports or memory so the kernel won't use + them. If <base> is less than 0x10000, the region + is assumed to be I/O ports; otherwise it is memory. + + reservetop= [X86-32] + Format: nn[KMG] + Reserves a hole at the top of the kernel virtual + address space. + + reservelow= [X86] + Format: nn[K] + Set the amount of memory to reserve for BIOS at + the bottom of the address space. + + reset_devices [KNL] Force drivers to reset the underlying device + during initialization. + + resume= [SWSUSP] + Specify the partition device for software suspend + Format: + {/dev/<dev> | PARTUUID=<uuid> | <int>:<int> | <hex>} + + resume_offset= [SWSUSP] + Specify the offset from the beginning of the partition + given by "resume=" at which the swap header is located, + in <PAGE_SIZE> units (needed only for swap files). + See Documentation/power/swsusp-and-swap-files.txt + + resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to + read the resume files + + resumewait [HIBERNATION] Wait (indefinitely) for resume device to show up. + Useful for devices that are detected asynchronously + (e.g. USB and MMC devices). + + hibernate= [HIBERNATION] + noresume Don't check if there's a hibernation image + present during boot. + nocompress Don't compress/decompress hibernation images. + no Disable hibernation and resume. + protect_image Turn on image protection during restoration + (that will set all pages holding image data + during restoration read-only). + + retain_initrd [RAM] Keep initrd memory after extraction + + rfkill.default_state= + 0 "airplane mode". All wifi, bluetooth, wimax, gps, fm, + etc. communication is blocked by default. + 1 Unblocked. + + rfkill.master_switch_mode= + 0 The "airplane mode" button does nothing. + 1 The "airplane mode" button toggles between everything + blocked and the previous configuration. + 2 The "airplane mode" button toggles between everything + blocked and everything unblocked. + + rhash_entries= [KNL,NET] + Set number of hash buckets for route cache + + ring3mwait=disable + [KNL] Disable ring 3 MONITOR/MWAIT feature on supported + CPUs. + + ro [KNL] Mount root device read-only on boot + + rodata= [KNL] + on Mark read-only kernel memory as read-only (default). + off Leave read-only kernel memory writable for debugging. + + rockchip.usb_uart + Enable the uart passthrough on the designated usb port + on Rockchip SoCs. When active, the signals of the + debug-uart get routed to the D+ and D- pins of the usb + port and the regular usb controller gets disabled. + + root= [KNL] Root filesystem + See name_to_dev_t comment in init/do_mounts.c. + + rootdelay= [KNL] Delay (in seconds) to pause before attempting to + mount the root filesystem + + rootflags= [KNL] Set root filesystem mount option string + + rootfstype= [KNL] Set root filesystem type + + rootwait [KNL] Wait (indefinitely) for root device to show up. + Useful for devices that are detected asynchronously + (e.g. USB and MMC devices). + + rproc_mem=nn[KMG][@address] + [KNL,ARM,CMA] Remoteproc physical memory block. + Memory area to be used by remote processor image, + managed by CMA. + + rw [KNL] Mount root device read-write on boot + + S [KNL] Run init in single mode + + s390_iommu= [HW,S390] + Set s390 IOTLB flushing mode + strict + With strict flushing every unmap operation will result in + an IOTLB flush. Default is lazy flushing before reuse, + which is faster. + + sa1100ir [NET] + See drivers/net/irda/sa1100_ir.c. + + sbni= [NET] Granch SBNI12 leased line adapter + + sched_debug [KNL] Enables verbose scheduler debug messages. + + schedstats= [KNL,X86] Enable or disable scheduled statistics. + Allowed values are enable and disable. This feature + incurs a small amount of overhead in the scheduler + but is useful for debugging and performance tuning. + + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate + xtime_lock contention on larger systems, and/or RCU lock + contention on all systems with CONFIG_MAXSMP set. + Format: { "0" | "1" } + 0 -- disable. (may be 1 via CONFIG_CMDLINE="skew_tick=1" + 1 -- enable. + Note: increases power consumption, thus should only be + enabled if running jitter sensitive (HPC/RT) workloads. + + security= [SECURITY] Choose a security module to enable at boot. + If this boot parameter is not specified, only the first + security module asking for security registration will be + loaded. An invalid security module name will be treated + as if no module has been chosen. + + selinux= [SELINUX] Disable or enable SELinux at boot time. + Format: { "0" | "1" } + See security/selinux/Kconfig help text. + 0 -- disable. + 1 -- enable. + Default value is set via kernel config option. + If enabled at boot time, /selinux/disable can be used + later to disable prior to initial policy load. + + apparmor= [APPARMOR] Disable or enable AppArmor at boot time + Format: { "0" | "1" } + See security/apparmor/Kconfig help text + 0 -- disable. + 1 -- enable. + Default value is set via kernel config option. + + serialnumber [BUGS=X86-32] + + shapers= [NET] + Maximal number of shapers. + + simeth= [IA-64] + simscsi= + + slram= [HW,MTD] + + slab_nomerge [MM] + Disable merging of slabs with similar size. May be + necessary if there is some reason to distinguish + allocs to different slabs, especially in hardened + environments where the risk of heap overflows and + layout control by attackers can usually be + frustrated by disabling merging. This will reduce + most of the exposure of a heap attack to a single + cache (risks via metadata attacks are mostly + unchanged). Debug options disable merging on their + own. + For more information see Documentation/vm/slub.rst. + + slab_max_order= [MM, SLAB] + Determines the maximum allowed order for slabs. + A high setting may cause OOMs due to memory + fragmentation. Defaults to 1 for systems with + more than 32MB of RAM, 0 otherwise. + + slub_debug[=options[,slabs]] [MM, SLUB] + Enabling slub_debug allows one to determine the + culprit if slab objects become corrupted. Enabling + slub_debug can create guard zones around objects and + may poison objects when not in use. Also tracks the + last alloc / free. For more information see + Documentation/vm/slub.rst. + + slub_memcg_sysfs= [MM, SLUB] + Determines whether to enable sysfs directories for + memory cgroup sub-caches. 1 to enable, 0 to disable. + The default is determined by CONFIG_SLUB_MEMCG_SYSFS_ON. + Enabling this can lead to a very high number of debug + directories and files being created under + /sys/kernel/slub. + + slub_max_order= [MM, SLUB] + Determines the maximum allowed order for slabs. + A high setting may cause OOMs due to memory + fragmentation. For more information see + Documentation/vm/slub.rst. + + slub_min_objects= [MM, SLUB] + The minimum number of objects per slab. SLUB will + increase the slab order up to slub_max_order to + generate a sufficiently large slab able to contain + the number of objects indicated. The higher the number + of objects the smaller the overhead of tracking slabs + and the less frequently locks need to be acquired. + For more information see Documentation/vm/slub.rst. + + slub_min_order= [MM, SLUB] + Determines the minimum page order for slabs. Must be + lower than slub_max_order. + For more information see Documentation/vm/slub.rst. + + slub_nomerge [MM, SLUB] + Same with slab_nomerge. This is supported for legacy. + See slab_nomerge for more information. + + smart2= [HW] + Format: <io1>[,<io2>[,...,<io8>]] + + smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices + smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port + smsc-ircc2.ircc_sir= [HW] SIR base I/O port + smsc-ircc2.ircc_fir= [HW] FIR base I/O port + smsc-ircc2.ircc_irq= [HW] IRQ line + smsc-ircc2.ircc_dma= [HW] DMA channel + smsc-ircc2.ircc_transceiver= [HW] Transceiver type: + 0: Toshiba Satellite 1800 (GP data pin select) + 1: Fast pin select (default) + 2: ATC IRMode + + smt [KNL,S390] Set the maximum number of threads (logical + CPUs) to use per physical CPU on systems capable of + symmetric multithreading (SMT). Will be capped to the + actual hardware limit. + Format: <integer> + Default: -1 (no limit) + + softlockup_panic= + [KNL] Should the soft-lockup detector generate panics. + Format: <integer> + + A nonzero value instructs the soft-lockup detector + to panic the machine when a soft-lockup occurs. This + is also controlled by CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC + which is the respective build-time switch to that + functionality. + + softlockup_all_cpu_backtrace= + [KNL] Should the soft-lockup detector generate + backtraces on all cpus. + Format: <integer> + + sonypi.*= [HW] Sony Programmable I/O Control Device driver + See Documentation/laptops/sonypi.txt + + spectre_v2= [X86] Control mitigation of Spectre variant 2 + (indirect branch speculation) vulnerability. + The default operation protects the kernel from + user space attacks. + + on - unconditionally enable, implies + spectre_v2_user=on + off - unconditionally disable, implies + spectre_v2_user=off + auto - kernel detects whether your CPU model is + vulnerable + + Selecting 'on' will, and 'auto' may, choose a + mitigation method at run time according to the + CPU, the available microcode, the setting of the + CONFIG_RETPOLINE configuration option, and the + compiler with which the kernel was built. + + Selecting 'on' will also enable the mitigation + against user space to user space task attacks. + + Selecting 'off' will disable both the kernel and + the user space protections. + + Specific mitigations can also be selected manually: + + retpoline - replace indirect branches + retpoline,generic - Retpolines + retpoline,lfence - LFENCE; indirect branch + retpoline,amd - alias for retpoline,lfence + eibrs - enhanced IBRS + eibrs,retpoline - enhanced IBRS + Retpolines + eibrs,lfence - enhanced IBRS + LFENCE + + Not specifying this option is equivalent to + spectre_v2=auto. + + spectre_v2_user= + [X86] Control mitigation of Spectre variant 2 + (indirect branch speculation) vulnerability between + user space tasks + + on - Unconditionally enable mitigations. Is + enforced by spectre_v2=on + + off - Unconditionally disable mitigations. Is + enforced by spectre_v2=off + + prctl - Indirect branch speculation is enabled, + but mitigation can be enabled via prctl + per thread. The mitigation control state + is inherited on fork. + + prctl,ibpb + - Like "prctl" above, but only STIBP is + controlled per thread. IBPB is issued + always when switching between different user + space processes. + + seccomp + - Same as "prctl" above, but all seccomp + threads will enable the mitigation unless + they explicitly opt out. + + seccomp,ibpb + - Like "seccomp" above, but only STIBP is + controlled per thread. IBPB is issued + always when switching between different + user space processes. + + auto - Kernel selects the mitigation depending on + the available CPU features and vulnerability. + + Default mitigation: + If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl" + + Not specifying this option is equivalent to + spectre_v2_user=auto. + + spec_store_bypass_disable= + [HW] Control Speculative Store Bypass (SSB) Disable mitigation + (Speculative Store Bypass vulnerability) + + Certain CPUs are vulnerable to an exploit against a + a common industry wide performance optimization known + as "Speculative Store Bypass" in which recent stores + to the same memory location may not be observed by + later loads during speculative execution. The idea + is that such stores are unlikely and that they can + be detected prior to instruction retirement at the + end of a particular speculation execution window. + + In vulnerable processors, the speculatively forwarded + store can be used in a cache side channel attack, for + example to read memory to which the attacker does not + directly have access (e.g. inside sandboxed code). + + This parameter controls whether the Speculative Store + Bypass optimization is used. + + On x86 the options are: + + on - Unconditionally disable Speculative Store Bypass + off - Unconditionally enable Speculative Store Bypass + auto - Kernel detects whether the CPU model contains an + implementation of Speculative Store Bypass and + picks the most appropriate mitigation. If the + CPU is not vulnerable, "off" is selected. If the + CPU is vulnerable the default mitigation is + architecture and Kconfig dependent. See below. + prctl - Control Speculative Store Bypass per thread + via prctl. Speculative Store Bypass is enabled + for a process by default. The state of the control + is inherited on fork. + seccomp - Same as "prctl" above, but all seccomp threads + will disable SSB unless they explicitly opt out. + + Default mitigations: + X86: If CONFIG_SECCOMP=y "seccomp", otherwise "prctl" + + On powerpc the options are: + + on,auto - On Power8 and Power9 insert a store-forwarding + barrier on kernel entry and exit. On Power7 + perform a software flush on kernel entry and + exit. + off - No action. + + Not specifying this option is equivalent to + spec_store_bypass_disable=auto. + + spia_io_base= [HW,MTD] + spia_fio_base= + spia_pedr= + spia_peddr= + + srbds= [X86,INTEL] + Control the Special Register Buffer Data Sampling + (SRBDS) mitigation. + + Certain CPUs are vulnerable to an MDS-like + exploit which can leak bits from the random + number generator. + + By default, this issue is mitigated by + microcode. However, the microcode fix can cause + the RDRAND and RDSEED instructions to become + much slower. Among other effects, this will + result in reduced throughput from /dev/urandom. + + The microcode mitigation can be disabled with + the following option: + + off: Disable mitigation and remove + performance impact to RDRAND and RDSEED + + srcutree.counter_wrap_check [KNL] + Specifies how frequently to check for + grace-period sequence counter wrap for the + srcu_data structure's ->srcu_gp_seq_needed field. + The greater the number of bits set in this kernel + parameter, the less frequently counter wrap will + be checked for. Note that the bottom two bits + are ignored. + + srcutree.exp_holdoff [KNL] + Specifies how many nanoseconds must elapse + since the end of the last SRCU grace period for + a given srcu_struct until the next normal SRCU + grace period will be considered for automatic + expediting. Set to zero to disable automatic + expediting. + + ssbd= [ARM64,HW] + Speculative Store Bypass Disable control + + On CPUs that are vulnerable to the Speculative + Store Bypass vulnerability and offer a + firmware based mitigation, this parameter + indicates how the mitigation should be used: + + force-on: Unconditionally enable mitigation for + for both kernel and userspace + force-off: Unconditionally disable mitigation for + for both kernel and userspace + kernel: Always enable mitigation in the + kernel, and offer a prctl interface + to allow userspace to register its + interest in being mitigated too. + + stack_guard_gap= [MM] + override the default stack gap protection. The value + is in page units and it defines how many pages prior + to (for stacks growing down) resp. after (for stacks + growing up) the main stack are reserved for no other + mapping. Default value is 256 pages. + + stacktrace [FTRACE] + Enabled the stack tracer on boot up. + + stacktrace_filter=[function-list] + [FTRACE] Limit the functions that the stack tracer + will trace at boot up. function-list is a comma separated + list of functions. This list can be changed at run + time by the stack_trace_filter file in the debugfs + tracing directory. Note, this enables stack tracing + and the stacktrace above is not needed. + + sti= [PARISC,HW] + Format: <num> + Set the STI (builtin display/keyboard on the HP-PARISC + machines) console (graphic card) which should be used + as the initial boot-console. + See also comment in drivers/video/console/sticore.c. + + sti_font= [HW] + See comment in drivers/video/console/sticore.c. + + stifb= [HW] + Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]] + + sunrpc.min_resvport= + sunrpc.max_resvport= + [NFS,SUNRPC] + SunRPC servers often require that client requests + originate from a privileged port (i.e. a port in the + range 0 < portnr < 1024). + An administrator who wishes to reserve some of these + ports for other uses may adjust the range that the + kernel's sunrpc client considers to be privileged + using these two parameters to set the minimum and + maximum port values. + + sunrpc.svc_rpc_per_connection_limit= + [NFS,SUNRPC] + Limit the number of requests that the server will + process in parallel from a single connection. + The default value is 0 (no limit). + + sunrpc.pool_mode= + [NFS] + Control how the NFS server code allocates CPUs to + service thread pools. Depending on how many NICs + you have and where their interrupts are bound, this + option will affect which CPUs will do NFS serving. + Note: this parameter cannot be changed while the + NFS server is running. + + auto the server chooses an appropriate mode + automatically using heuristics + global a single global pool contains all CPUs + percpu one pool for each CPU + pernode one pool for each NUMA node (equivalent + to global on non-NUMA machines) + + sunrpc.tcp_slot_table_entries= + sunrpc.udp_slot_table_entries= + [NFS,SUNRPC] + Sets the upper limit on the number of simultaneous + RPC calls that can be sent from the client to a + server. Increasing these values may allow you to + improve throughput, but will also increase the + amount of memory reserved for use by the client. + + suspend.pm_test_delay= + [SUSPEND] + Sets the number of seconds to remain in a suspend test + mode before resuming the system (see + /sys/power/pm_test). Only available when CONFIG_PM_DEBUG + is set. Default value is 5. + + swapaccount=[0|1] + [KNL] Enable accounting of swap in memory resource + controller if no parameter or 1 is given or disable + it if 0 is given (See Documentation/cgroup-v1/memory.txt) + + swiotlb= [ARM,IA-64,PPC,MIPS,X86] + Format: { <int> | force | noforce } + <int> -- Number of I/O TLB slabs + force -- force using of bounce buffers even if they + wouldn't be automatically used by the kernel + noforce -- Never use bounce buffers (for debugging) + + switches= [HW,M68k] + + sysfs.deprecated=0|1 [KNL] + Enable/disable old style sysfs layout for old udev + on older distributions. When this option is enabled + very new udev will not work anymore. When this option + is disabled (or CONFIG_SYSFS_DEPRECATED not compiled) + in older udev will not work anymore. + Default depends on CONFIG_SYSFS_DEPRECATED_V2 set in + the kernel configuration. + + sysrq_always_enabled + [KNL] + Ignore sysrq setting - this boot parameter will + neutralize any effect of /proc/sys/kernel/sysrq. + Useful for debugging. + + tcpmhash_entries= [KNL,NET] + Set the number of tcp_metrics_hash slots. + Default value is 8192 or 16384 depending on total + ram pages. This is used to specify the TCP metrics + cache size. See Documentation/networking/ip-sysctl.txt + "tcp_no_metrics_save" section for more details. + + tdfx= [HW,DRM] + + test_suspend= [SUSPEND][,N] + Specify "mem" (for Suspend-to-RAM) or "standby" (for + standby suspend) or "freeze" (for suspend type freeze) + as the system sleep state during system startup with + the optional capability to repeat N number of times. + The system is woken from this state using a + wakeup-capable RTC alarm. + + thash_entries= [KNL,NET] + Set number of hash buckets for TCP connection + + thermal.act= [HW,ACPI] + -1: disable all active trip points in all thermal zones + <degrees C>: override all lowest active trip points + + thermal.crt= [HW,ACPI] + -1: disable all critical trip points in all thermal zones + <degrees C>: override all critical trip points + + thermal.nocrt= [HW,ACPI] + Set to disable actions on ACPI thermal zone + critical and hot trip points. + + thermal.off= [HW,ACPI] + 1: disable ACPI thermal control + + thermal.psv= [HW,ACPI] + -1: disable all passive trip points + <degrees C>: override all passive trip points to this + value + + thermal.tzp= [HW,ACPI] + Specify global default ACPI thermal zone polling rate + <deci-seconds>: poll all this frequency + 0: no polling (default) + + threadirqs [KNL] + Force threading of all interrupt handlers except those + marked explicitly IRQF_NO_THREAD. + + tmem [KNL,XEN] + Enable the Transcendent memory driver if built-in. + + tmem.cleancache=0|1 [KNL, XEN] + Default is on (1). Disable the usage of the cleancache + API to send anonymous pages to the hypervisor. + + tmem.frontswap=0|1 [KNL, XEN] + Default is on (1). Disable the usage of the frontswap + API to send swap pages to the hypervisor. If disabled + the selfballooning and selfshrinking are force disabled. + + tmem.selfballooning=0|1 [KNL, XEN] + Default is on (1). Disable the driving of swap pages + to the hypervisor. + + tmem.selfshrinking=0|1 [KNL, XEN] + Default is on (1). Partial swapoff that immediately + transfers pages from Xen hypervisor back to the + kernel based on different criteria. + + topology= [S390] + Format: {off | on} + Specify if the kernel should make use of the cpu + topology information if the hardware supports this. + The scheduler will make use of this information and + e.g. base its process migration decisions on it. + Default is on. + + topology_updates= [KNL, PPC, NUMA] + Format: {off} + Specify if the kernel should ignore (off) + topology updates sent by the hypervisor to this + LPAR. + + tp720= [HW,PS2] + + tpm_suspend_pcr=[HW,TPM] + Format: integer pcr id + Specify that at suspend time, the tpm driver + should extend the specified pcr with zeros, + as a workaround for some chips which fail to + flush the last written pcr on TPM_SaveState. + This will guarantee that all the other pcrs + are saved. + + trace_buf_size=nn[KMG] + [FTRACE] will set tracing buffer size on each cpu. + + trace_event=[event-list] + [FTRACE] Set and start specified trace events in order + to facilitate early boot debugging. The event-list is a + comma separated list of trace events to enable. See + also Documentation/trace/events.rst + + trace_options=[option-list] + [FTRACE] Enable or disable tracer options at boot. + The option-list is a comma delimited list of options + that can be enabled or disabled just as if you were + to echo the option name into + + /sys/kernel/debug/tracing/trace_options + + For example, to enable stacktrace option (to dump the + stack trace of each event), add to the command line: + + trace_options=stacktrace + + See also Documentation/trace/ftrace.rst "trace options" + section. + + tp_printk[FTRACE] + Have the tracepoints sent to printk as well as the + tracing ring buffer. This is useful for early boot up + where the system hangs or reboots and does not give the + option for reading the tracing buffer or performing a + ftrace_dump_on_oops. + + To turn off having tracepoints sent to printk, + echo 0 > /proc/sys/kernel/tracepoint_printk + Note, echoing 1 into this file without the + tracepoint_printk kernel cmdline option has no effect. + + ** CAUTION ** + + Having tracepoints sent to printk() and activating high + frequency tracepoints such as irq or sched, can cause + the system to live lock. + + traceoff_on_warning + [FTRACE] enable this option to disable tracing when a + warning is hit. This turns off "tracing_on". Tracing can + be enabled again by echoing '1' into the "tracing_on" + file located in /sys/kernel/debug/tracing/ + + This option is useful, as it disables the trace before + the WARNING dump is called, which prevents the trace to + be filled with content caused by the warning output. + + This option can also be set at run time via the sysctl + option: kernel/traceoff_on_warning + + transparent_hugepage= + [KNL] + Format: [always|madvise|never] + Can be used to control the default behavior of the system + with respect to transparent hugepages. + See Documentation/admin-guide/mm/transhuge.rst + for more details. + + tsc= Disable clocksource stability checks for TSC. + Format: <string> + [x86] reliable: mark tsc clocksource as reliable, this + disables clocksource verification at runtime, as well + as the stability checks done at bootup. Used to enable + high-resolution timer mode on older hardware, and in + virtualized environment. + [x86] noirqtime: Do not use TSC to do irq accounting. + Used to run time disable IRQ_TIME_ACCOUNTING on any + platforms where RDTSC is slow and this accounting + can add overhead. + [x86] unstable: mark the TSC clocksource as unstable, this + marks the TSC unconditionally unstable at bootup and + avoids any further wobbles once the TSC watchdog notices. + + tsx= [X86] Control Transactional Synchronization + Extensions (TSX) feature in Intel processors that + support TSX control. + + This parameter controls the TSX feature. The options are: + + on - Enable TSX on the system. Although there are + mitigations for all known security vulnerabilities, + TSX has been known to be an accelerator for + several previous speculation-related CVEs, and + so there may be unknown security risks associated + with leaving it enabled. + + off - Disable TSX on the system. (Note that this + option takes effect only on newer CPUs which are + not vulnerable to MDS, i.e., have + MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 and which get + the new IA32_TSX_CTRL MSR through a microcode + update. This new MSR allows for the reliable + deactivation of the TSX functionality.) + + auto - Disable TSX if X86_BUG_TAA is present, + otherwise enable TSX on the system. + + Not specifying this option is equivalent to tsx=off. + + See Documentation/admin-guide/hw-vuln/tsx_async_abort.rst + for more details. + + tsx_async_abort= [X86,INTEL] Control mitigation for the TSX Async + Abort (TAA) vulnerability. + + Similar to Micro-architectural Data Sampling (MDS) + certain CPUs that support Transactional + Synchronization Extensions (TSX) are vulnerable to an + exploit against CPU internal buffers which can forward + information to a disclosure gadget under certain + conditions. + + In vulnerable processors, the speculatively forwarded + data can be used in a cache side channel attack, to + access data to which the attacker does not have direct + access. + + This parameter controls the TAA mitigation. The + options are: + + full - Enable TAA mitigation on vulnerable CPUs + if TSX is enabled. + + full,nosmt - Enable TAA mitigation and disable SMT on + vulnerable CPUs. If TSX is disabled, SMT + is not disabled because CPU is not + vulnerable to cross-thread TAA attacks. + off - Unconditionally disable TAA mitigation + + On MDS-affected machines, tsx_async_abort=off can be + prevented by an active MDS mitigation as both vulnerabilities + are mitigated with the same mechanism so in order to disable + this mitigation, you need to specify mds=off too. + + Not specifying this option is equivalent to + tsx_async_abort=full. On CPUs which are MDS affected + and deploy MDS mitigation, TAA mitigation is not + required and doesn't provide any additional + mitigation. + + For details see: + Documentation/admin-guide/hw-vuln/tsx_async_abort.rst + + turbografx.map[2|3]= [HW,JOY] + TurboGraFX parallel port interface + Format: + <port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7> + See also Documentation/input/devices/joystick-parport.rst + + udbg-immortal [PPC] When debugging early kernel crashes that + happen after console_init() and before a proper + console driver takes over, this boot options might + help "seeing" what's going on. + + uhash_entries= [KNL,NET] + Set number of hash buckets for UDP/UDP-Lite connections + + uhci-hcd.ignore_oc= + [USB] Ignore overcurrent events (default N). + Some badly-designed motherboards generate lots of + bogus events, for ports that aren't wired to + anything. Set this parameter to avoid log spamming. + Note that genuine overcurrent events won't be + reported either. + + unknown_nmi_panic + [X86] Cause panic on unknown NMI. + + usbcore.authorized_default= + [USB] Default USB device authorization: + (default -1 = authorized except for wireless USB, + 0 = not authorized, 1 = authorized) + + usbcore.autosuspend= + [USB] The autosuspend time delay (in seconds) used + for newly-detected USB devices (default 2). This + is the time required before an idle device will be + autosuspended. Devices for which the delay is set + to a negative value won't be autosuspended at all. + + usbcore.usbfs_snoop= + [USB] Set to log all usbfs traffic (default 0 = off). + + usbcore.usbfs_snoop_max= + [USB] Maximum number of bytes to snoop in each URB + (default = 65536). + + usbcore.blinkenlights= + [USB] Set to cycle leds on hubs (default 0 = off). + + usbcore.old_scheme_first= + [USB] Start with the old device initialization + scheme (default 0 = off). + + usbcore.usbfs_memory_mb= + [USB] Memory limit (in MB) for buffers allocated by + usbfs (default = 16, 0 = max = 2047). + + usbcore.use_both_schemes= + [USB] Try the other device initialization scheme + if the first one fails (default 1 = enabled). + + usbcore.initial_descriptor_timeout= + [USB] Specifies timeout for the initial 64-byte + USB_REQ_GET_DESCRIPTOR request in milliseconds + (default 5000 = 5.0 seconds). + + usbcore.nousb [USB] Disable the USB subsystem + + usbcore.quirks= + [USB] A list of quirk entries to augment the built-in + usb core quirk list. List entries are separated by + commas. Each entry has the form + VendorID:ProductID:Flags. The IDs are 4-digit hex + numbers and Flags is a set of letters. Each letter + will change the built-in quirk; setting it if it is + clear and clearing it if it is set. The letters have + the following meanings: + a = USB_QUIRK_STRING_FETCH_255 (string + descriptors must not be fetched using + a 255-byte read); + b = USB_QUIRK_RESET_RESUME (device can't resume + correctly so reset it instead); + c = USB_QUIRK_NO_SET_INTF (device can't handle + Set-Interface requests); + d = USB_QUIRK_CONFIG_INTF_STRINGS (device can't + handle its Configuration or Interface + strings); + e = USB_QUIRK_RESET (device can't be reset + (e.g morph devices), don't use reset); + f = USB_QUIRK_HONOR_BNUMINTERFACES (device has + more interface descriptions than the + bNumInterfaces count, and can't handle + talking to these interfaces); + g = USB_QUIRK_DELAY_INIT (device needs a pause + during initialization, after we read + the device descriptor); + h = USB_QUIRK_LINEAR_UFRAME_INTR_BINTERVAL (For + high speed and super speed interrupt + endpoints, the USB 2.0 and USB 3.0 spec + require the interval in microframes (1 + microframe = 125 microseconds) to be + calculated as interval = 2 ^ + (bInterval-1). + Devices with this quirk report their + bInterval as the result of this + calculation instead of the exponent + variable used in the calculation); + i = USB_QUIRK_DEVICE_QUALIFIER (device can't + handle device_qualifier descriptor + requests); + j = USB_QUIRK_IGNORE_REMOTE_WAKEUP (device + generates spurious wakeup, ignore + remote wakeup capability); + k = USB_QUIRK_NO_LPM (device can't handle Link + Power Management); + l = USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL + (Device reports its bInterval as linear + frames instead of the USB 2.0 + calculation); + m = USB_QUIRK_DISCONNECT_SUSPEND (Device needs + to be disconnected before suspend to + prevent spurious wakeup); + n = USB_QUIRK_DELAY_CTRL_MSG (Device needs a + pause after every control message); + o = USB_QUIRK_HUB_SLOW_RESET (Hub needs extra + delay after resetting its port); + Example: quirks=0781:5580:bk,0a5c:5834:gij + + usbhid.mousepoll= + [USBHID] The interval which mice are to be polled at. + + usbhid.jspoll= + [USBHID] The interval which joysticks are to be polled at. + + usbhid.kbpoll= + [USBHID] The interval which keyboards are to be polled at. + + usb-storage.delay_use= + [UMS] The delay in seconds before a new device is + scanned for Logical Units (default 1). + + usb-storage.quirks= + [UMS] A list of quirks entries to supplement or + override the built-in unusual_devs list. List + entries are separated by commas. Each entry has + the form VID:PID:Flags where VID and PID are Vendor + and Product ID values (4-digit hex numbers) and + Flags is a set of characters, each corresponding + to a common usb-storage quirk flag as follows: + a = SANE_SENSE (collect more than 18 bytes + of sense data, not on uas); + b = BAD_SENSE (don't collect more than 18 + bytes of sense data, not on uas); + c = FIX_CAPACITY (decrease the reported + device capacity by one sector); + d = NO_READ_DISC_INFO (don't use + READ_DISC_INFO command, not on uas); + e = NO_READ_CAPACITY_16 (don't use + READ_CAPACITY_16 command); + f = NO_REPORT_OPCODES (don't use report opcodes + command, uas only); + g = MAX_SECTORS_240 (don't transfer more than + 240 sectors at a time, uas only); + h = CAPACITY_HEURISTICS (decrease the + reported device capacity by one + sector if the number is odd); + i = IGNORE_DEVICE (don't bind to this + device); + j = NO_REPORT_LUNS (don't use report luns + command, uas only); + k = NO_SAME (do not use WRITE_SAME, uas only) + l = NOT_LOCKABLE (don't try to lock and + unlock ejectable media, not on uas); + m = MAX_SECTORS_64 (don't transfer more + than 64 sectors = 32 KB at a time, + not on uas); + n = INITIAL_READ10 (force a retry of the + initial READ(10) command, not on uas); + o = CAPACITY_OK (accept the capacity + reported by the device, not on uas); + p = WRITE_CACHE (the device cache is ON + by default, not on uas); + r = IGNORE_RESIDUE (the device reports + bogus residue values, not on uas); + s = SINGLE_LUN (the device has only one + Logical Unit); + t = NO_ATA_1X (don't allow ATA(12) and ATA(16) + commands, uas only); + u = IGNORE_UAS (don't bind to the uas driver); + w = NO_WP_DETECT (don't test whether the + medium is write-protected). + y = ALWAYS_SYNC (issue a SYNCHRONIZE_CACHE + even if the device claims no cache, + not on uas) + Example: quirks=0419:aaf5:rl,0421:0433:rc + + user_debug= [KNL,ARM] + Format: <int> + See arch/arm/Kconfig.debug help text. + 1 - undefined instruction events + 2 - system calls + 4 - invalid data aborts + 8 - SIGSEGV faults + 16 - SIGBUS faults + Example: user_debug=31 + + userpte= + [X86] Flags controlling user PTE allocations. + + nohigh = do not allocate PTE pages in + HIGHMEM regardless of setting + of CONFIG_HIGHPTE. + + vdso= [X86,SH] + On X86_32, this is an alias for vdso32=. Otherwise: + + vdso=1: enable VDSO (the default) + vdso=0: disable VDSO mapping + + vdso32= [X86] Control the 32-bit vDSO + vdso32=1: enable 32-bit VDSO + vdso32=0 or vdso32=2: disable 32-bit VDSO + + See the help text for CONFIG_COMPAT_VDSO for more + details. If CONFIG_COMPAT_VDSO is set, the default is + vdso32=0; otherwise, the default is vdso32=1. + + For compatibility with older kernels, vdso32=2 is an + alias for vdso32=0. + + Try vdso32=0 if you encounter an error that says: + dl_main: Assertion `(void *) ph->p_vaddr == _rtld_local._dl_sysinfo_dso' failed! + + vector= [IA-64,SMP] + vector=percpu: enable percpu vector domain + + video= [FB] Frame buffer configuration + See Documentation/fb/modedb.txt. + + video.brightness_switch_enabled= [0,1] + If set to 1, on receiving an ACPI notify event + generated by hotkey, video driver will adjust brightness + level and then send out the event to user space through + the allocated input device; If set to 0, video driver + will only send out the event without touching backlight + brightness level. + default: 1 + + virtio_mmio.device= + [VMMIO] Memory mapped virtio (platform) device. + + <size>@<baseaddr>:<irq>[:<id>] + where: + <size> := size (can use standard suffixes + like K, M and G) + <baseaddr> := physical base address + <irq> := interrupt number (as passed to + request_irq()) + <id> := (optional) platform device id + example: + virtio_mmio.device=1K@0x100b0000:48:7 + + Can be used multiple times for multiple devices. + + vga= [BOOT,X86-32] Select a particular video mode + See Documentation/x86/boot.txt and + Documentation/svga.txt. + Use vga=ask for menu. + This is actually a boot loader parameter; the value is + passed to the kernel using a special protocol. + + vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact + size of <nn>. This can be used to increase the + minimum size (128MB on x86). It can also be used to + decrease the size and leave more room for directly + mapped kernel RAM. + + vmcp_cma=nn[MG] [KNL,S390] + Sets the memory size reserved for contiguous memory + allocations for the vmcp device driver. + + vmhalt= [KNL,S390] Perform z/VM CP command after system halt. + Format: <command> + + vmpanic= [KNL,S390] Perform z/VM CP command after kernel panic. + Format: <command> + + vmpoff= [KNL,S390] Perform z/VM CP command after power off. + Format: <command> + + vsyscall= [X86-64] + Controls the behavior of vsyscalls (i.e. calls to + fixed addresses of 0xffffffffff600x00 from legacy + code). Most statically-linked binaries and older + versions of glibc use these calls. Because these + functions are at fixed addresses, they make nice + targets for exploits that can control RIP. + + emulate [default] Vsyscalls turn into traps and are + emulated reasonably safely. + + none Vsyscalls don't work at all. This makes + them quite hard to use for exploits but + might break your system. + + vt.color= [VT] Default text color. + Format: 0xYX, X = foreground, Y = background. + Default: 0x07 = light gray on black. + + vt.cur_default= [VT] Default cursor shape. + Format: 0xCCBBAA, where AA, BB, and CC are the same as + the parameters of the <Esc>[?A;B;Cc escape sequence; + see VGA-softcursor.txt. Default: 2 = underline. + + vt.default_blu= [VT] + Format: <blue0>,<blue1>,<blue2>,...,<blue15> + Change the default blue palette of the console. + This is a 16-member array composed of values + ranging from 0-255. + + vt.default_grn= [VT] + Format: <green0>,<green1>,<green2>,...,<green15> + Change the default green palette of the console. + This is a 16-member array composed of values + ranging from 0-255. + + vt.default_red= [VT] + Format: <red0>,<red1>,<red2>,...,<red15> + Change the default red palette of the console. + This is a 16-member array composed of values + ranging from 0-255. + + vt.default_utf8= + [VT] + Format=<0|1> + Set system-wide default UTF-8 mode for all tty's. + Default is 1, i.e. UTF-8 mode is enabled for all + newly opened terminals. + + vt.global_cursor_default= + [VT] + Format=<-1|0|1> + Set system-wide default for whether a cursor + is shown on new VTs. Default is -1, + i.e. cursors will be created by default unless + overridden by individual drivers. 0 will hide + cursors, 1 will display them. + + vt.italic= [VT] Default color for italic text; 0-15. + Default: 2 = green. + + vt.underline= [VT] Default color for underlined text; 0-15. + Default: 3 = cyan. + + watchdog timers [HW,WDT] For information on watchdog timers, + see Documentation/watchdog/watchdog-parameters.txt + or other driver-specific files in the + Documentation/watchdog/ directory. + + workqueue.watchdog_thresh= + If CONFIG_WQ_WATCHDOG is configured, workqueue can + warn stall conditions and dump internal state to + help debugging. 0 disables workqueue stall + detection; otherwise, it's the stall threshold + duration in seconds. The default value is 30 and + it can be updated at runtime by writing to the + corresponding sysfs file. + + workqueue.disable_numa + By default, all work items queued to unbound + workqueues are affine to the NUMA nodes they're + issued on, which results in better behavior in + general. If NUMA affinity needs to be disabled for + whatever reason, this option can be used. Note + that this also can be controlled per-workqueue for + workqueues visible under /sys/bus/workqueue/. + + workqueue.power_efficient + Per-cpu workqueues are generally preferred because + they show better performance thanks to cache + locality; unfortunately, per-cpu workqueues tend to + be more power hungry than unbound workqueues. + + Enabling this makes the per-cpu workqueues which + were observed to contribute significantly to power + consumption unbound, leading to measurably lower + power usage at the cost of small performance + overhead. + + The default value of this parameter is determined by + the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT. + + workqueue.debug_force_rr_cpu + Workqueue used to implicitly guarantee that work + items queued without explicit CPU specified are put + on the local CPU. This guarantee is no longer true + and while local CPU is still preferred work items + may be put on foreign CPUs. This debug option + forces round-robin CPU selection to flush out + usages which depend on the now broken guarantee. + When enabled, memory and cache locality will be + impacted. + + x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of + default x2apic cluster mode on platforms + supporting x2apic. + + x86_intel_mid_timer= [X86-32,APBT] + Choose timer option for x86 Intel MID platform. + Two valid options are apbt timer only and lapic timer + plus one apbt timer for broadcast timer. + x86_intel_mid_timer=apbt_only | lapic_and_apbt + + xen_512gb_limit [KNL,X86-64,XEN] + Restricts the kernel running paravirtualized under Xen + to use only up to 512 GB of RAM. The reason to do so is + crash analysis tools and Xen tools for doing domain + save/restore/migration must be enabled to handle larger + domains. + + xen_emul_unplug= [HW,X86,XEN] + Unplug Xen emulated devices + Format: [unplug0,][unplug1] + ide-disks -- unplug primary master IDE devices + aux-ide-disks -- unplug non-primary-master IDE devices + nics -- unplug network devices + all -- unplug all emulated devices (NICs and IDE disks) + unnecessary -- unplugging emulated devices is + unnecessary even if the host did not respond to + the unplug protocol + never -- do not unplug even if version check succeeds + + xen_legacy_crash [X86,XEN] + Crash from Xen panic notifier, without executing late + panic() code such as dumping handler. + + xen_nopvspin [X86,XEN] + Disables the ticketlock slowpath using Xen PV + optimizations. + + xen_nopv [X86] + Disables the PV optimizations forcing the HVM guest to + run as generic HVM guest with no PV drivers. + + xen_scrub_pages= [XEN] + Boolean option to control scrubbing pages before giving them back + to Xen, for use by other domains. Can be also changed at runtime + with /sys/devices/system/xen_memory/xen_memory0/scrub_pages. + Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT. + + xen.balloon_boot_timeout= [XEN] + The time (in seconds) to wait before giving up to boot + in case initial ballooning fails to free enough memory. + Applies only when running as HVM or PVH guest and + started with less memory configured than allowed at + max. Default is 180. + + xen.event_eoi_delay= [XEN] + How long to delay EOI handling in case of event + storms (jiffies). Default is 10. + + xen.event_loop_timeout= [XEN] + After which time (jiffies) the event handling loop + should start to delay EOI handling. Default is 2. + + xirc2ps_cs= [NET,PCMCIA] + Format: + <irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]] + + xhci-hcd.quirks [USB,KNL] + A hex value specifying bitmask with supplemental xhci + host controller quirks. Meaning of each bit can be + consulted in header drivers/usb/host/xhci.h. diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst new file mode 100644 index 000000000..84de718f2 --- /dev/null +++ b/Documentation/admin-guide/md.rst @@ -0,0 +1,758 @@ +RAID arrays +=========== + +Boot time assembly of RAID arrays +--------------------------------- + +Tools that manage md devices can be found at + http://www.kernel.org/pub/linux/utils/raid/ + + +You can boot with your md device with the following kernel command +lines: + +for old raid arrays without persistent superblocks:: + + md=<md device no.>,<raid level>,<chunk size factor>,<fault level>,dev0,dev1,...,devn + +for raid arrays with persistent superblocks:: + + md=<md device no.>,dev0,dev1,...,devn + +or, to assemble a partitionable array:: + + md=d<md device no.>,dev0,dev1,...,devn + +``md device no.`` ++++++++++++++++++ + +The number of the md device + +================= ========= +``md device no.`` device +================= ========= + 0 md0 + 1 md1 + 2 md2 + 3 md3 + 4 md4 +================= ========= + +``raid level`` +++++++++++++++ + +level of the RAID array + +=============== ============= +``raid level`` level +=============== ============= +-1 linear mode +0 striped mode +=============== ============= + +other modes are only supported with persistent super blocks + +``chunk size factor`` ++++++++++++++++++++++ + +(raid-0 and raid-1 only) + +Set the chunk size as 4k << n. + +``fault level`` ++++++++++++++++ + +Totally ignored + +``dev0`` to ``devn`` +++++++++++++++++++++ + +e.g. ``/dev/hda1``, ``/dev/hdc1``, ``/dev/sda1``, ``/dev/sdb1`` + +A possible loadlin line (Harald Hoyer <HarryH@Royal.Net>) looks like this:: + + e:\loadlin\loadlin e:\zimage root=/dev/md0 md=0,0,4,0,/dev/hdb2,/dev/hdc3 ro + + +Boot time autodetection of RAID arrays +-------------------------------------- + +When md is compiled into the kernel (not as module), partitions of +type 0xfd are scanned and automatically assembled into RAID arrays. +This autodetection may be suppressed with the kernel parameter +``raid=noautodetect``. As of kernel 2.6.9, only drives with a type 0 +superblock can be autodetected and run at boot time. + +The kernel parameter ``raid=partitionable`` (or ``raid=part``) means +that all auto-detected arrays are assembled as partitionable. + +Boot time assembly of degraded/dirty arrays +------------------------------------------- + +If a raid5 or raid6 array is both dirty and degraded, it could have +undetectable data corruption. This is because the fact that it is +``dirty`` means that the parity cannot be trusted, and the fact that it +is degraded means that some datablocks are missing and cannot reliably +be reconstructed (due to no parity). + +For this reason, md will normally refuse to start such an array. This +requires the sysadmin to take action to explicitly start the array +despite possible corruption. This is normally done with:: + + mdadm --assemble --force .... + +This option is not really available if the array has the root +filesystem on it. In order to support this booting from such an +array, md supports a module parameter ``start_dirty_degraded`` which, +when set to 1, bypassed the checks and will allows dirty degraded +arrays to be started. + +So, to boot with a root filesystem of a dirty degraded raid 5 or 6, use:: + + md-mod.start_dirty_degraded=1 + + +Superblock formats +------------------ + +The md driver can support a variety of different superblock formats. +Currently, it supports superblock formats ``0.90.0`` and the ``md-1`` format +introduced in the 2.5 development series. + +The kernel will autodetect which format superblock is being used. + +Superblock format ``0`` is treated differently to others for legacy +reasons - it is the original superblock format. + + +General Rules - apply for all superblock formats +------------------------------------------------ + +An array is ``created`` by writing appropriate superblocks to all +devices. + +It is ``assembled`` by associating each of these devices with an +particular md virtual device. Once it is completely assembled, it can +be accessed. + +An array should be created by a user-space tool. This will write +superblocks to all devices. It will usually mark the array as +``unclean``, or with some devices missing so that the kernel md driver +can create appropriate redundancy (copying in raid 1, parity +calculation in raid 4/5). + +When an array is assembled, it is first initialized with the +SET_ARRAY_INFO ioctl. This contains, in particular, a major and minor +version number. The major version number selects which superblock +format is to be used. The minor number might be used to tune handling +of the format, such as suggesting where on each device to look for the +superblock. + +Then each device is added using the ADD_NEW_DISK ioctl. This +provides, in particular, a major and minor number identifying the +device to add. + +The array is started with the RUN_ARRAY ioctl. + +Once started, new devices can be added. They should have an +appropriate superblock written to them, and then be passed in with +ADD_NEW_DISK. + +Devices that have failed or are not yet active can be detached from an +array using HOT_REMOVE_DISK. + + +Specific Rules that apply to format-0 super block arrays, and arrays with no superblock (non-persistent) +-------------------------------------------------------------------------------------------------------- + +An array can be ``created`` by describing the array (level, chunksize +etc) in a SET_ARRAY_INFO ioctl. This must have ``major_version==0`` and +``raid_disks != 0``. + +Then uninitialized devices can be added with ADD_NEW_DISK. The +structure passed to ADD_NEW_DISK must specify the state of the device +and its role in the array. + +Once started with RUN_ARRAY, uninitialized spares can be added with +HOT_ADD_DISK. + + +MD devices in sysfs +------------------- + +md devices appear in sysfs (``/sys``) as regular block devices, +e.g.:: + + /sys/block/md0 + +Each ``md`` device will contain a subdirectory called ``md`` which +contains further md-specific information about the device. + +All md devices contain: + + level + a text file indicating the ``raid level``. e.g. raid0, raid1, + raid5, linear, multipath, faulty. + If no raid level has been set yet (array is still being + assembled), the value will reflect whatever has been written + to it, which may be a name like the above, or may be a number + such as ``0``, ``5``, etc. + + raid_disks + a text file with a simple number indicating the number of devices + in a fully functional array. If this is not yet known, the file + will be empty. If an array is being resized this will contain + the new number of devices. + Some raid levels allow this value to be set while the array is + active. This will reconfigure the array. Otherwise it can only + be set while assembling an array. + A change to this attribute will not be permitted if it would + reduce the size of the array. To reduce the number of drives + in an e.g. raid5, the array size must first be reduced by + setting the ``array_size`` attribute. + + chunk_size + This is the size in bytes for ``chunks`` and is only relevant to + raid levels that involve striping (0,4,5,6,10). The address space + of the array is conceptually divided into chunks and consecutive + chunks are striped onto neighbouring devices. + The size should be at least PAGE_SIZE (4k) and should be a power + of 2. This can only be set while assembling an array + + layout + The ``layout`` for the array for the particular level. This is + simply a number that is interpretted differently by different + levels. It can be written while assembling an array. + + array_size + This can be used to artificially constrain the available space in + the array to be less than is actually available on the combined + devices. Writing a number (in Kilobytes) which is less than + the available size will set the size. Any reconfiguration of the + array (e.g. adding devices) will not cause the size to change. + Writing the word ``default`` will cause the effective size of the + array to be whatever size is actually available based on + ``level``, ``chunk_size`` and ``component_size``. + + This can be used to reduce the size of the array before reducing + the number of devices in a raid4/5/6, or to support external + metadata formats which mandate such clipping. + + reshape_position + This is either ``none`` or a sector number within the devices of + the array where ``reshape`` is up to. If this is set, the three + attributes mentioned above (raid_disks, chunk_size, layout) can + potentially have 2 values, an old and a new value. If these + values differ, reading the attribute returns:: + + new (old) + + and writing will effect the ``new`` value, leaving the ``old`` + unchanged. + + component_size + For arrays with data redundancy (i.e. not raid0, linear, faulty, + multipath), all components must be the same size - or at least + there must a size that they all provide space for. This is a key + part or the geometry of the array. It is measured in sectors + and can be read from here. Writing to this value may resize + the array if the personality supports it (raid1, raid5, raid6), + and if the component drives are large enough. + + metadata_version + This indicates the format that is being used to record metadata + about the array. It can be 0.90 (traditional format), 1.0, 1.1, + 1.2 (newer format in varying locations) or ``none`` indicating that + the kernel isn't managing metadata at all. + Alternately it can be ``external:`` followed by a string which + is set by user-space. This indicates that metadata is managed + by a user-space program. Any device failure or other event that + requires a metadata update will cause array activity to be + suspended until the event is acknowledged. + + resync_start + The point at which resync should start. If no resync is needed, + this will be a very large number (or ``none`` since 2.6.30-rc1). At + array creation it will default to 0, though starting the array as + ``clean`` will set it much larger. + + new_dev + This file can be written but not read. The value written should + be a block device number as major:minor. e.g. 8:0 + This will cause that device to be attached to the array, if it is + available. It will then appear at md/dev-XXX (depending on the + name of the device) and further configuration is then possible. + + safe_mode_delay + When an md array has seen no write requests for a certain period + of time, it will be marked as ``clean``. When another write + request arrives, the array is marked as ``dirty`` before the write + commences. This is known as ``safe_mode``. + The ``certain period`` is controlled by this file which stores the + period as a number of seconds. The default is 200msec (0.200). + Writing a value of 0 disables safemode. + + array_state + This file contains a single word which describes the current + state of the array. In many cases, the state can be set by + writing the word for the desired state, however some states + cannot be explicitly set, and some transitions are not allowed. + + Select/poll works on this file. All changes except between + Active_idle and active (which can be frequent and are not + very interesting) are notified. active->active_idle is + reported if the metadata is externally managed. + + clear + No devices, no size, no level + + Writing is equivalent to STOP_ARRAY ioctl + + inactive + May have some settings, but array is not active + all IO results in error + + When written, doesn't tear down array, but just stops it + + suspended (not supported yet) + All IO requests will block. The array can be reconfigured. + + Writing this, if accepted, will block until array is quiessent + + readonly + no resync can happen. no superblocks get written. + + Write requests fail + + read-auto + like readonly, but behaves like ``clean`` on a write request. + + clean + no pending writes, but otherwise active. + + When written to inactive array, starts without resync + + If a write request arrives then + if metadata is known, mark ``dirty`` and switch to ``active``. + if not known, block and switch to write-pending + + If written to an active array that has pending writes, then fails. + active + fully active: IO and resync can be happening. + When written to inactive array, starts with resync + + write-pending + clean, but writes are blocked waiting for ``active`` to be written. + + active-idle + like active, but no writes have been seen for a while (safe_mode_delay). + + bitmap/location + This indicates where the write-intent bitmap for the array is + stored. + + It can be one of ``none``, ``file`` or ``[+-]N``. + ``file`` may later be extended to ``file:/file/name`` + ``[+-]N`` means that many sectors from the start of the metadata. + + This is replicated on all devices. For arrays with externally + managed metadata, the offset is from the beginning of the + device. + + bitmap/chunksize + The size, in bytes, of the chunk which will be represented by a + single bit. For RAID456, it is a portion of an individual + device. For RAID10, it is a portion of the array. For RAID1, it + is both (they come to the same thing). + + bitmap/time_base + The time, in seconds, between looking for bits in the bitmap to + be cleared. In the current implementation, a bit will be cleared + between 2 and 3 times ``time_base`` after all the covered blocks + are known to be in-sync. + + bitmap/backlog + When write-mostly devices are active in a RAID1, write requests + to those devices proceed in the background - the filesystem (or + other user of the device) does not have to wait for them. + ``backlog`` sets a limit on the number of concurrent background + writes. If there are more than this, new writes will by + synchronous. + + bitmap/metadata + This can be either ``internal`` or ``external``. + + ``internal`` + is the default and means the metadata for the bitmap + is stored in the first 256 bytes of the allocated space and is + managed by the md module. + + ``external`` + means that bitmap metadata is managed externally to + the kernel (i.e. by some userspace program) + + bitmap/can_clear + This is either ``true`` or ``false``. If ``true``, then bits in the + bitmap will be cleared when the corresponding blocks are thought + to be in-sync. If ``false``, bits will never be cleared. + This is automatically set to ``false`` if a write happens on a + degraded array, or if the array becomes degraded during a write. + When metadata is managed externally, it should be set to true + once the array becomes non-degraded, and this fact has been + recorded in the metadata. + + consistency_policy + This indicates how the array maintains consistency in case of unexpected + shutdown. It can be: + + none + Array has no redundancy information, e.g. raid0, linear. + + resync + Full resync is performed and all redundancy is regenerated when the + array is started after unclean shutdown. + + bitmap + Resync assisted by a write-intent bitmap. + + journal + For raid4/5/6, journal device is used to log transactions and replay + after unclean shutdown. + + ppl + For raid5 only, Partial Parity Log is used to close the write hole and + eliminate resync. + + The accepted values when writing to this file are ``ppl`` and ``resync``, + used to enable and disable PPL. + + +As component devices are added to an md array, they appear in the ``md`` +directory as new directories named:: + + dev-XXX + +where ``XXX`` is a name that the kernel knows for the device, e.g. hdb1. +Each directory contains: + + block + a symlink to the block device in /sys/block, e.g.:: + + /sys/block/md0/md/dev-hdb1/block -> ../../../../block/hdb/hdb1 + + super + A file containing an image of the superblock read from, or + written to, that device. + + state + A file recording the current state of the device in the array + which can be a comma separated list of: + + faulty + device has been kicked from active use due to + a detected fault, or it has unacknowledged bad + blocks + + in_sync + device is a fully in-sync member of the array + + writemostly + device will only be subject to read + requests if there are no other options. + + This applies only to raid1 arrays. + + blocked + device has failed, and the failure hasn't been + acknowledged yet by the metadata handler. + + Writes that would write to this device if + it were not faulty are blocked. + + spare + device is working, but not a full member. + + This includes spares that are in the process + of being recovered to + + write_error + device has ever seen a write error. + + want_replacement + device is (mostly) working but probably + should be replaced, either due to errors or + due to user request. + + replacement + device is a replacement for another active + device with same raid_disk. + + + This list may grow in future. + + This can be written to. + + Writing ``faulty`` simulates a failure on the device. + + Writing ``remove`` removes the device from the array. + + Writing ``writemostly`` sets the writemostly flag. + + Writing ``-writemostly`` clears the writemostly flag. + + Writing ``blocked`` sets the ``blocked`` flag. + + Writing ``-blocked`` clears the ``blocked`` flags and allows writes + to complete and possibly simulates an error. + + Writing ``in_sync`` sets the in_sync flag. + + Writing ``write_error`` sets writeerrorseen flag. + + Writing ``-write_error`` clears writeerrorseen flag. + + Writing ``want_replacement`` is allowed at any time except to a + replacement device or a spare. It sets the flag. + + Writing ``-want_replacement`` is allowed at any time. It clears + the flag. + + Writing ``replacement`` or ``-replacement`` is only allowed before + starting the array. It sets or clears the flag. + + + This file responds to select/poll. Any change to ``faulty`` + or ``blocked`` causes an event. + + errors + An approximate count of read errors that have been detected on + this device but have not caused the device to be evicted from + the array (either because they were corrected or because they + happened while the array was read-only). When using version-1 + metadata, this value persists across restarts of the array. + + This value can be written while assembling an array thus + providing an ongoing count for arrays with metadata managed by + userspace. + + slot + This gives the role that the device has in the array. It will + either be ``none`` if the device is not active in the array + (i.e. is a spare or has failed) or an integer less than the + ``raid_disks`` number for the array indicating which position + it currently fills. This can only be set while assembling an + array. A device for which this is set is assumed to be working. + + offset + This gives the location in the device (in sectors from the + start) where data from the array will be stored. Any part of + the device before this offset is not touched, unless it is + used for storing metadata (Formats 1.1 and 1.2). + + size + The amount of the device, after the offset, that can be used + for storage of data. This will normally be the same as the + component_size. This can be written while assembling an + array. If a value less than the current component_size is + written, it will be rejected. + + recovery_start + When the device is not ``in_sync``, this records the number of + sectors from the start of the device which are known to be + correct. This is normally zero, but during a recovery + operation it will steadily increase, and if the recovery is + interrupted, restoring this value can cause recovery to + avoid repeating the earlier blocks. With v1.x metadata, this + value is saved and restored automatically. + + This can be set whenever the device is not an active member of + the array, either before the array is activated, or before + the ``slot`` is set. + + Setting this to ``none`` is equivalent to setting ``in_sync``. + Setting to any other value also clears the ``in_sync`` flag. + + bad_blocks + This gives the list of all known bad blocks in the form of + start address and length (in sectors respectively). If output + is too big to fit in a page, it will be truncated. Writing + ``sector length`` to this file adds new acknowledged (i.e. + recorded to disk safely) bad blocks. + + unacknowledged_bad_blocks + This gives the list of known-but-not-yet-saved-to-disk bad + blocks in the same form of ``bad_blocks``. If output is too big + to fit in a page, it will be truncated. Writing to this file + adds bad blocks without acknowledging them. This is largely + for testing. + + ppl_sector, ppl_size + Location and size (in sectors) of the space used for Partial Parity Log + on this device. + + +An active md device will also contain an entry for each active device +in the array. These are named:: + + rdNN + +where ``NN`` is the position in the array, starting from 0. +So for a 3 drive array there will be rd0, rd1, rd2. +These are symbolic links to the appropriate ``dev-XXX`` entry. +Thus, for example:: + + cat /sys/block/md*/md/rd*/state + +will show ``in_sync`` on every line. + + + +Active md devices for levels that support data redundancy (1,4,5,6,10) +also have + + sync_action + a text file that can be used to monitor and control the rebuild + process. It contains one word which can be one of: + + resync + redundancy is being recalculated after unclean + shutdown or creation + + recover + a hot spare is being built to replace a + failed/missing device + + idle + nothing is happening + check + A full check of redundancy was requested and is + happening. This reads all blocks and checks + them. A repair may also happen for some raid + levels. + + repair + A full check and repair is happening. This is + similar to ``resync``, but was requested by the + user, and the write-intent bitmap is NOT used to + optimise the process. + + This file is writable, and each of the strings that could be + read are meaningful for writing. + + ``idle`` will stop an active resync/recovery etc. There is no + guarantee that another resync/recovery may not be automatically + started again, though some event will be needed to trigger + this. + + ``resync`` or ``recovery`` can be used to restart the + corresponding operation if it was stopped with ``idle``. + + ``check`` and ``repair`` will start the appropriate process + providing the current state is ``idle``. + + This file responds to select/poll. Any important change in the value + triggers a poll event. Sometimes the value will briefly be + ``recover`` if a recovery seems to be needed, but cannot be + achieved. In that case, the transition to ``recover`` isn't + notified, but the transition away is. + + degraded + This contains a count of the number of devices by which the + arrays is degraded. So an optimal array will show ``0``. A + single failed/missing drive will show ``1``, etc. + + This file responds to select/poll, any increase or decrease + in the count of missing devices will trigger an event. + + mismatch_count + When performing ``check`` and ``repair``, and possibly when + performing ``resync``, md will count the number of errors that are + found. The count in ``mismatch_cnt`` is the number of sectors + that were re-written, or (for ``check``) would have been + re-written. As most raid levels work in units of pages rather + than sectors, this may be larger than the number of actual errors + by a factor of the number of sectors in a page. + + bitmap_set_bits + If the array has a write-intent bitmap, then writing to this + attribute can set bits in the bitmap, indicating that a resync + would need to check the corresponding blocks. Either individual + numbers or start-end pairs can be written. Multiple numbers + can be separated by a space. + + Note that the numbers are ``bit`` numbers, not ``block`` numbers. + They should be scaled by the bitmap_chunksize. + + sync_speed_min, sync_speed_max + This are similar to ``/proc/sys/dev/raid/speed_limit_{min,max}`` + however they only apply to the particular array. + + If no value has been written to these, or if the word ``system`` + is written, then the system-wide value is used. If a value, + in kibibytes-per-second is written, then it is used. + + When the files are read, they show the currently active value + followed by ``(local)`` or ``(system)`` depending on whether it is + a locally set or system-wide value. + + sync_completed + This shows the number of sectors that have been completed of + whatever the current sync_action is, followed by the number of + sectors in total that could need to be processed. The two + numbers are separated by a ``/`` thus effectively showing one + value, a fraction of the process that is complete. + + A ``select`` on this attribute will return when resync completes, + when it reaches the current sync_max (below) and possibly at + other times. + + sync_speed + This shows the current actual speed, in K/sec, of the current + sync_action. It is averaged over the last 30 seconds. + + suspend_lo, suspend_hi + The two values, given as numbers of sectors, indicate a range + within the array where IO will be blocked. This is currently + only supported for raid4/5/6. + + sync_min, sync_max + The two values, given as numbers of sectors, indicate a range + within the array where ``check``/``repair`` will operate. Must be + a multiple of chunk_size. When it reaches ``sync_max`` it will + pause, rather than complete. + You can use ``select`` or ``poll`` on ``sync_completed`` to wait for + that number to reach sync_max. Then you can either increase + ``sync_max``, or can write ``idle`` to ``sync_action``. + + The value of ``max`` for ``sync_max`` effectively disables the limit. + When a resync is active, the value can only ever be increased, + never decreased. + The value of ``0`` is the minimum for ``sync_min``. + + + +Each active md device may also have attributes specific to the +personality module that manages it. +These are specific to the implementation of the module and could +change substantially if the implementation changes. + +These currently include: + + stripe_cache_size (currently raid5 only) + number of entries in the stripe cache. This is writable, but + there are upper and lower limits (32768, 17). Default is 256. + + strip_cache_active (currently raid5 only) + number of active entries in the stripe cache + + preread_bypass_threshold (currently raid5 only) + number of times a stripe requiring preread will be bypassed by + a stripe that does not require preread. For fairness defaults + to 1. Setting this to 0 disables bypass accounting and + requires preread stripes to wait until all full-width stripe- + writes are complete. Valid values are 0 to stripe_cache_size. + + journal_mode (currently raid5 only) + The cache mode for raid5. raid5 could include an extra disk for + caching. The mode can be "write-throuth" and "write-back". The + default is "write-through". diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst new file mode 100644 index 000000000..291699c81 --- /dev/null +++ b/Documentation/admin-guide/mm/concepts.rst @@ -0,0 +1,222 @@ +.. _mm_concepts: + +================= +Concepts overview +================= + +The memory management in Linux is complex system that evolved over the +years and included more and more functionality to support variety of +systems from MMU-less microcontrollers to supercomputers. The memory +management for systems without MMU is called ``nommu`` and it +definitely deserves a dedicated document, which hopefully will be +eventually written. Yet, although some of the concepts are the same, +here we assume that MMU is available and CPU can translate a virtual +address to a physical address. + +.. contents:: :local: + +Virtual Memory Primer +===================== + +The physical memory in a computer system is a limited resource and +even for systems that support memory hotplug there is a hard limit on +the amount of memory that can be installed. The physical memory is not +necessary contiguous, it might be accessible as a set of distinct +address ranges. Besides, different CPU architectures, and even +different implementations of the same architecture have different view +how these address ranges defined. + +All this makes dealing directly with physical memory quite complex and +to avoid this complexity a concept of virtual memory was developed. + +The virtual memory abstracts the details of physical memory from the +application software, allows to keep only needed information in the +physical memory (demand paging) and provides a mechanism for the +protection and controlled sharing of data between processes. + +With virtual memory, each and every memory access uses a virtual +address. When the CPU decodes the an instruction that reads (or +writes) from (or to) the system memory, it translates the `virtual` +address encoded in that instruction to a `physical` address that the +memory controller can understand. + +The physical system memory is divided into page frames, or pages. The +size of each page is architecture specific. Some architectures allow +selection of the page size from several supported values; this +selection is performed at the kernel build time by setting an +appropriate kernel configuration option. + +Each physical memory page can be mapped as one or more virtual +pages. These mappings are described by page tables that allow +translation from virtual address used by programs to real address in +the physical memory. The page tables organized hierarchically. + +The tables at the lowest level of the hierarchy contain physical +addresses of actual pages used by the software. The tables at higher +levels contain physical addresses of the pages belonging to the lower +levels. The pointer to the top level page table resides in a +register. When the CPU performs the address translation, it uses this +register to access the top level page table. The high bits of the +virtual address are used to index an entry in the top level page +table. That entry is then used to access the next level in the +hierarchy with the next bits of the virtual address as the index to +that level page table. The lowest bits in the virtual address define +the offset inside the actual page. + +Huge Pages +========== + +The address translation requires several memory accesses and memory +accesses are slow relatively to CPU speed. To avoid spending precious +processor cycles on the address translation, CPUs maintain a cache of +such translations called Translation Lookaside Buffer (or +TLB). Usually TLB is pretty scarce resource and applications with +large memory working set will experience performance hit because of +TLB misses. + +Many modern CPU architectures allow mapping of the memory pages +directly by the higher levels in the page table. For instance, on x86, +it is possible to map 2M and even 1G pages using entries in the second +and the third level page tables. In Linux such pages are called +`huge`. Usage of huge pages significantly reduces pressure on TLB, +improves TLB hit-rate and thus improves overall system performance. + +There are two mechanisms in Linux that enable mapping of the physical +memory with the huge pages. The first one is `HugeTLB filesystem`, or +hugetlbfs. It is a pseudo filesystem that uses RAM as its backing +store. For the files created in this filesystem the data resides in +the memory and mapped using huge pages. The hugetlbfs is described at +:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`. + +Another, more recent, mechanism that enables use of the huge pages is +called `Transparent HugePages`, or THP. Unlike the hugetlbfs that +requires users and/or system administrators to configure what parts of +the system memory should and can be mapped by the huge pages, THP +manages such mappings transparently to the user and hence the +name. See +:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>` +for more details about THP. + +Zones +===== + +Often hardware poses restrictions on how different physical memory +ranges can be accessed. In some cases, devices cannot perform DMA to +all the addressable memory. In other cases, the size of the physical +memory exceeds the maximal addressable size of virtual memory and +special actions are required to access portions of the memory. Linux +groups memory pages into `zones` according to their possible +usage. For example, ZONE_DMA will contain memory that can be used by +devices for DMA, ZONE_HIGHMEM will contain memory that is not +permanently mapped into kernel's address space and ZONE_NORMAL will +contain normally addressed pages. + +The actual layout of the memory zones is hardware dependent as not all +architectures define all zones, and requirements for DMA are different +for different platforms. + +Nodes +===== + +Many multi-processor machines are NUMA - Non-Uniform Memory Access - +systems. In such systems the memory is arranged into banks that have +different access latency depending on the "distance" from the +processor. Each bank is referred as `node` and for each node Linux +constructs an independent memory management subsystem. A node has it's +own set of zones, lists of free and used pages and various statistics +counters. You can find more details about NUMA in +:ref:`Documentation/vm/numa.rst <numa>` and in +:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`. + +Page cache +========== + +The physical memory is volatile and the common case for getting data +into the memory is to read it from files. Whenever a file is read, the +data is put into the `page cache` to avoid expensive disk access on +the subsequent reads. Similarly, when one writes to a file, the data +is placed in the page cache and eventually gets into the backing +storage device. The written pages are marked as `dirty` and when Linux +decides to reuse them for other purposes, it makes sure to synchronize +the file contents on the device with the updated data. + +Anonymous Memory +================ + +The `anonymous memory` or `anonymous mappings` represent memory that +is not backed by a filesystem. Such mappings are implicitly created +for program's stack and heap or by explicit calls to mmap(2) system +call. Usually, the anonymous mappings only define virtual memory areas +that the program is allowed to access. The read accesses will result +in creation of a page table entry that references a special physical +page filled with zeroes. When the program performs a write, regular +physical page will be allocated to hold the written data. The page +will be marked dirty and if the kernel will decide to repurpose it, +the dirty page will be swapped out. + +Reclaim +======= + +Throughout the system lifetime, a physical page can be used for storing +different types of data. It can be kernel internal data structures, +DMA'able buffers for device drivers use, data read from a filesystem, +memory allocated by user space processes etc. + +Depending on the page usage it is treated differently by the Linux +memory management. The pages that can be freed at any time, either +because they cache the data available elsewhere, for instance, on a +hard disk, or because they can be swapped out, again, to the hard +disk, are called `reclaimable`. The most notable categories of the +reclaimable pages are page cache and anonymous memory. + +In most cases, the pages holding internal kernel data and used as DMA +buffers cannot be repurposed, and they remain pinned until freed by +their user. Such pages are called `unreclaimable`. However, in certain +circumstances, even pages occupied with kernel data structures can be +reclaimed. For instance, in-memory caches of filesystem metadata can +be re-read from the storage device and therefore it is possible to +discard them from the main memory when system is under memory +pressure. + +The process of freeing the reclaimable physical memory pages and +repurposing them is called (surprise!) `reclaim`. Linux can reclaim +pages either asynchronously or synchronously, depending on the state +of the system. When system is not loaded, most of the memory is free +and allocation request will be satisfied immediately from the free +pages supply. As the load increases, the amount of the free pages goes +down and when it reaches a certain threshold (high watermark), an +allocation request will awaken the ``kswapd`` daemon. It will +asynchronously scan memory pages and either just free them if the data +they contain is available elsewhere, or evict to the backing storage +device (remember those dirty pages?). As memory usage increases even +more and reaches another threshold - min watermark - an allocation +will trigger the `direct reclaim`. In this case allocation is stalled +until enough memory pages are reclaimed to satisfy the request. + +Compaction +========== + +As the system runs, tasks allocate and free the memory and it becomes +fragmented. Although with virtual memory it is possible to present +scattered physical pages as virtually contiguous range, sometimes it is +necessary to allocate large physically contiguous memory areas. Such +need may arise, for instance, when a device driver requires large +buffer for DMA, or when THP allocates a huge page. Memory `compaction` +addresses the fragmentation issue. This mechanism moves occupied pages +from the lower part of a memory zone to free pages in the upper part +of the zone. When a compaction scan is finished free pages are grouped +together at the beginning of the zone and allocations of large +physically contiguous areas become possible. + +Like reclaim, the compaction may happen asynchronously in ``kcompactd`` +daemon or synchronously as a result of memory allocation request. + +OOM killer +========== + +It may happen, that on a loaded machine memory will be exhausted. When +the kernel detects that the system runs out of memory (OOM) it invokes +`OOM killer`. Its mission is simple: all it has to do is to select a +task to sacrifice for the sake of the overall system health. The +selected task is killed in a hope that after it exits enough memory +will be freed to continue normal operation. diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst new file mode 100644 index 000000000..1cc0bc78d --- /dev/null +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -0,0 +1,382 @@ +.. _hugetlbpage: + +============= +HugeTLB Pages +============= + +Overview +======== + +The intent of this file is to give a brief summary of hugetlbpage support in +the Linux kernel. This support is built on top of multiple page size support +that is provided by most modern architectures. For example, x86 CPUs normally +support 4K and 2M (1G if architecturally supported) page sizes, ia64 +architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, +256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical +translations. Typically this is a very scarce resource on processor. +Operating systems try to make best use of limited number of TLB resources. +This optimization is more critical now as bigger and bigger physical memories +(several GBs) are more readily available. + +Users can use the huge page support in Linux kernel by either using the mmap +system call or standard SYSV shared memory system calls (shmget, shmat). + +First the Linux kernel needs to be built with the CONFIG_HUGETLBFS +(present under "File systems") and CONFIG_HUGETLB_PAGE (selected +automatically when CONFIG_HUGETLBFS is selected) configuration +options. + +The ``/proc/meminfo`` file provides information about the total number of +persistent hugetlb pages in the kernel's huge page pool. It also displays +default huge page size and information about the number of free, reserved +and surplus huge pages in the pool of huge pages of default size. +The huge page size is needed for generating the proper alignment and +size of the arguments to system calls that map huge page regions. + +The output of ``cat /proc/meminfo`` will include lines like:: + + HugePages_Total: uuu + HugePages_Free: vvv + HugePages_Rsvd: www + HugePages_Surp: xxx + Hugepagesize: yyy kB + Hugetlb: zzz kB + +where: + +HugePages_Total + is the size of the pool of huge pages. +HugePages_Free + is the number of huge pages in the pool that are not yet + allocated. +HugePages_Rsvd + is short for "reserved," and is the number of huge pages for + which a commitment to allocate from the pool has been made, + but no allocation has yet been made. Reserved huge pages + guarantee that an application will be able to allocate a + huge page from the pool of huge pages at fault time. +HugePages_Surp + is short for "surplus," and is the number of huge pages in + the pool above the value in ``/proc/sys/vm/nr_hugepages``. The + maximum number of surplus huge pages is controlled by + ``/proc/sys/vm/nr_overcommit_hugepages``. +Hugepagesize + is the default hugepage size (in Kb). +Hugetlb + is the total amount of memory (in kB), consumed by huge + pages of all sizes. + If huge pages of different sizes are in use, this number + will exceed HugePages_Total \* Hugepagesize. To get more + detailed information, please, refer to + ``/sys/kernel/mm/hugepages`` (described below). + + +``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" +configured in the kernel. + +``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge +pages in the kernel's huge page pool. "Persistent" huge pages will be +returned to the huge page pool when freed by a task. A user with root +privileges can dynamically allocate more or free some persistent huge pages +by increasing or decreasing the value of ``nr_hugepages``. + +Pages that are used as huge pages are reserved inside the kernel and cannot +be used for other purposes. Huge pages cannot be swapped out under +memory pressure. + +Once a number of huge pages have been pre-allocated to the kernel huge page +pool, a user with appropriate privilege can use either the mmap system call +or shared memory system calls to use the huge pages. See the discussion of +:ref:`Using Huge Pages <using_huge_pages>`, below. + +The administrator can allocate persistent huge pages on the kernel boot +command line by specifying the "hugepages=N" parameter, where 'N' = the +number of huge pages requested. This is the most reliable method of +allocating huge pages as memory has not yet become fragmented. + +Some platforms support multiple huge page sizes. To allocate huge pages +of a specific size, one must precede the huge pages boot command parameters +with a huge page size selection parameter "hugepagesz=<size>". <size> must +be specified in bytes with optional scale suffix [kKmMgG]. The default huge +page size may be selected with the "default_hugepagesz=<size>" boot parameter. + +When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` +indicates the current number of pre-allocated huge pages of the default size. +Thus, one can use the following command to dynamically allocate/deallocate +default sized persistent huge pages:: + + echo 20 > /proc/sys/vm/nr_hugepages + +This command will try to adjust the number of default sized huge pages in the +huge page pool to 20, allocating or freeing huge pages, as required. + +On a NUMA platform, the kernel will attempt to distribute the huge page pool +over all the set of allowed nodes specified by the NUMA memory policy of the +task that modifies ``nr_hugepages``. The default for the allowed nodes--when the +task has default memory policy--is all on-line nodes with memory. Allowed +nodes with insufficient available, contiguous memory for a huge page will be +silently skipped when allocating persistent huge pages. See the +:ref:`discussion below <mem_policy_and_hp_alloc>` +of the interaction of task memory policy, cpusets and per node attributes +with the allocation and freeing of persistent huge pages. + +The success or failure of huge page allocation depends on the amount of +physically contiguous memory that is present in system at the time of the +allocation attempt. If the kernel is unable to allocate huge pages from +some nodes in a NUMA system, it will attempt to make up the difference by +allocating extra pages on other nodes with sufficient available contiguous +memory, if any. + +System administrators may want to put this command in one of the local rc +init files. This will enable the kernel to allocate huge pages early in +the boot process when the possibility of getting physical contiguous pages +is still very high. Administrators can verify the number of huge pages +actually allocated by checking the sysctl or meminfo. To check the per node +distribution of huge pages in a NUMA system, use:: + + cat /sys/devices/system/node/node*/meminfo | fgrep Huge + +``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of +huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are +requested by applications. Writing any non-zero value into this file +indicates that the hugetlb subsystem is allowed to try to obtain that +number of "surplus" huge pages from the kernel's normal page pool, when the +persistent huge page pool is exhausted. As these surplus huge pages become +unused, they are freed back to the kernel's normal page pool. + +When increasing the huge page pool size via ``nr_hugepages``, any existing +surplus pages will first be promoted to persistent huge pages. Then, additional +huge pages will be allocated, if necessary and if possible, to fulfill +the new persistent huge page pool size. + +The administrator may shrink the pool of persistent huge pages for +the default huge page size by setting the ``nr_hugepages`` sysctl to a +smaller value. The kernel will attempt to balance the freeing of huge pages +across all nodes in the memory policy of the task modifying ``nr_hugepages``. +Any free huge pages on the selected nodes will be freed back to the kernel's +normal page pool. + +Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that +it becomes less than the number of huge pages in use will convert the balance +of the in-use huge pages to surplus huge pages. This will occur even if +the number of surplus pages would exceed the overcommit value. As long as +this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is +increased sufficiently, or the surplus huge pages go out of use and are freed-- +no more surplus huge pages will be allowed to be allocated. + +With support for multiple huge page pools at run-time available, much of +the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in +sysfs. +The ``/proc`` interfaces discussed above have been retained for backwards +compatibility. The root huge page control directory in sysfs is:: + + /sys/kernel/mm/hugepages + +For each huge page size supported by the running kernel, a subdirectory +will exist, of the form:: + + hugepages-${size}kB + +Inside each of these directories, the same set of files will exist:: + + nr_hugepages + nr_hugepages_mempolicy + nr_overcommit_hugepages + free_hugepages + resv_hugepages + surplus_hugepages + +which function as described above for the default huge page-sized case. + +.. _mem_policy_and_hp_alloc: + +Interaction of Task Memory Policy with Huge Page Allocation/Freeing +=================================================================== + +Whether huge pages are allocated and freed via the ``/proc`` interface or +the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the +NUMA nodes from which huge pages are allocated or freed are controlled by the +NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` +sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy +is ignored. + +The recommended method to allocate or free huge pages to/from the kernel +huge page pool, using the ``nr_hugepages`` example above, is:: + + numactl --interleave <node-list> echo 20 \ + >/proc/sys/vm/nr_hugepages_mempolicy + +or, more succinctly:: + + numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy + +This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes +specified in <node-list>, depending on whether number of persistent huge pages +is initially less than or greater than 20, respectively. No huge pages will be +allocated nor freed on any node not included in the specified <node-list>. + +When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any +memory policy mode--bind, preferred, local or interleave--may be used. The +resulting effect on persistent huge page allocation is as follows: + +#. Regardless of mempolicy mode [see + :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`], + persistent huge pages will be distributed across the node or nodes + specified in the mempolicy as if "interleave" had been specified. + However, if a node in the policy does not contain sufficient contiguous + memory for a huge page, the allocation will not "fallback" to the nearest + neighbor node with sufficient contiguous memory. To do this would cause + undesirable imbalance in the distribution of the huge page pool, or + possibly, allocation of persistent huge pages on nodes not allowed by + the task's memory policy. + +#. One or more nodes may be specified with the bind or interleave policy. + If more than one node is specified with the preferred policy, only the + lowest numeric id will be used. Local policy will select the node where + the task is running at the time the nodes_allowed mask is constructed. + For local policy to be deterministic, the task must be bound to a cpu or + cpus in a single node. Otherwise, the task could be migrated to some + other node at any time after launch and the resulting node will be + indeterminate. Thus, local policy is not very useful for this purpose. + Any of the other mempolicy modes may be used to specify a single node. + +#. The nodes allowed mask will be derived from any non-default task mempolicy, + whether this policy was set explicitly by the task itself or one of its + ancestors, such as numactl. This means that if the task is invoked from a + shell with non-default policy, that policy will be used. One can specify a + node list of "all" with numactl --interleave or --membind [-m] to achieve + interleaving over all nodes in the system or cpuset. + +#. Any task mempolicy specified--e.g., using numactl--will be constrained by + the resource limits of any cpuset in which the task runs. Thus, there will + be no way for a task with non-default policy running in a cpuset with a + subset of the system nodes to allocate huge pages outside the cpuset + without first moving to a cpuset that contains all of the desired nodes. + +#. Boot-time huge page allocation attempts to distribute the requested number + of huge pages over all on-lines nodes with memory. + +Per Node Hugepages Attributes +============================= + +A subset of the contents of the root huge page control directory in sysfs, +described above, will be replicated under each the system device of each +NUMA node with memory in:: + + /sys/devices/system/node/node[0-9]*/hugepages/ + +Under this directory, the subdirectory for each supported huge page size +contains the following attribute files:: + + nr_hugepages + free_hugepages + surplus_hugepages + +The free\_' and surplus\_' attribute files are read-only. They return the number +of free and surplus [overcommitted] huge pages, respectively, on the parent +node. + +The ``nr_hugepages`` attribute returns the total number of huge pages on the +specified node. When this attribute is written, the number of persistent huge +pages on the parent node will be adjusted to the specified value, if sufficient +resources exist, regardless of the task's mempolicy or cpuset constraints. + +Note that the number of overcommit and reserve pages remain global quantities, +as we don't know until fault time, when the faulting task's mempolicy is +applied, from which node the huge page allocation will be attempted. + +.. _using_huge_pages: + +Using Huge Pages +================ + +If the user applications are going to request huge pages using mmap system +call, then it is required that system administrator mount a file system of +type hugetlbfs:: + + mount -t hugetlbfs \ + -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ + min_size=<value>,nr_inodes=<value> none /mnt/huge + +This command mounts a (pseudo) filesystem of type hugetlbfs on the directory +``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. + +The ``uid`` and ``gid`` options sets the owner and group of the root of the +file system. By default the ``uid`` and ``gid`` of the current process +are taken. + +The ``mode`` option sets the mode of root of file system to value & 01777. +This value is given in octal. By default the value 0755 is picked. + +If the platform supports multiple huge page sizes, the ``pagesize`` option can +be used to specify the huge page size and associated pool. ``pagesize`` +is specified in bytes. If ``pagesize`` is not specified the platform's +default huge page size and associated pool will be used. + +The ``size`` option sets the maximum value of memory (huge pages) allowed +for that filesystem (``/mnt/huge``). The ``size`` option can be specified +in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). +The size is rounded down to HPAGE_SIZE boundary. + +The ``min_size`` option sets the minimum value of memory (huge pages) allowed +for the filesystem. ``min_size`` can be specified in the same way as ``size``, +either bytes or a percentage of the huge page pool. +At mount time, the number of huge pages specified by ``min_size`` are reserved +for use by the filesystem. +If there are not enough free huge pages available, the mount will fail. +As huge pages are allocated to the filesystem and freed, the reserve count +is adjusted so that the sum of allocated and reserved huge pages is always +at least ``min_size``. + +The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` +can use. + +If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on +command line then no limits are set. + +For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can +use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. +For example, size=2K has the same meaning as size=2048. + +While read system calls are supported on files that reside on hugetlb +file systems, write system calls are not. + +Regular chown, chgrp, and chmod commands (with right permissions) could be +used to change the file attributes on hugetlbfs. + +Also, it is important to note that no such mount command is required if +applications are going to use only shmat/shmget system calls or mmap with +MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see +:ref:`map_hugetlb <map_hugetlb>` below. + +Users who wish to use hugetlb memory via shared memory segment should be +members of a supplementary group and system admin needs to configure that gid +into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different +applications to use any combination of mmaps and shm* calls, though the mount of +filesystem will be required for using mmap calls without MAP_HUGETLB. + +Syscalls that operate on memory backed by hugetlb pages only have their lengths +aligned to the native page size of the processor; they will normally fail with +errno set to EINVAL or exclude hugetlb pages that extend beyond the length if +not hugepage aligned. For example, munmap(2) will fail if memory is backed by +a hugetlb page and the length is smaller than the hugepage size. + + +Examples +======== + +.. _map_hugetlb: + +``map_hugetlb`` + see tools/testing/selftests/vm/map_hugetlb.c + +``hugepage-shm`` + see tools/testing/selftests/vm/hugepage-shm.c + +``hugepage-mmap`` + see tools/testing/selftests/vm/hugepage-mmap.c + +The `libhugetlbfs`_ library provides a wide range of userspace tools +to help with huge page usability, environment setup, and control. + +.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst new file mode 100644 index 000000000..df9394fb3 --- /dev/null +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -0,0 +1,121 @@ +.. _idle_page_tracking: + +================== +Idle Page Tracking +================== + +Motivation +========== + +The idle page tracking feature allows to track which memory pages are being +accessed by a workload and which are idle. This information can be useful for +estimating the workload's working set size, which, in turn, can be taken into +account when configuring the workload parameters, setting memory cgroup limits, +or deciding where to place the workload within a compute cluster. + +It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. + +.. _user_api: + +User API +======== + +The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. +Currently, it consists of the only read-write file, +``/sys/kernel/mm/page_idle/bitmap``. + +The file implements a bitmap where each bit corresponds to a memory page. The +bitmap is represented by an array of 8-byte integers, and the page at PFN #i is +mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is +set, the corresponding page is idle. + +A page is considered idle if it has not been accessed since it was marked idle +(for more details on what "accessed" actually means see the :ref:`Implementation +Details <impl_details>` section). +To mark a page idle one has to set the bit corresponding to +the page by writing to the file. A value written to the file is OR-ed with the +current bitmap value. + +Only accesses to user memory pages are tracked. These are pages mapped to a +process address space, page cache and buffer pages, swap cache pages. For other +page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, +and hence such pages are never reported idle. + +For huge pages the idle flag is set only on the head page, so one has to read +``/proc/kpageflags`` in order to correctly count idle huge pages. + +Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return +-EINVAL if you are not starting the read/write on an 8-byte boundary, or +if the size of the read/write is not a multiple of 8 bytes. Writing to +this file beyond max PFN will return -ENXIO. + +That said, in order to estimate the amount of pages that are not used by a +workload one should: + + 1. Mark all the workload's pages as idle by setting corresponding bits in + ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading + ``/proc/pid/pagemap`` if the workload is represented by a process, or by + filtering out alien pages using ``/proc/kpagecgroup`` in case the workload + is placed in a memory cgroup. + + 2. Wait until the workload accesses its working set. + + 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. + If one wants to ignore certain types of pages, e.g. mlocked pages since they + are not reclaimable, he or she can filter them out using + ``/proc/kpageflags``. + +The page-types tool in the tools/vm directory can be used to assist in this. +If the tool is run initially with the appropriate option, it will mark all the +queried pages as idle. Subsequent runs of the tool can then show which pages have +their idle flag cleared in the interim. + +See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more +information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and +``/proc/kpagecgroup``. + +.. _impl_details: + +Implementation Details +====================== + +The kernel internally keeps track of accesses to user memory pages in order to +reclaim unreferenced pages first on memory shortage conditions. A page is +considered referenced if it has been recently accessed via a process address +space, in which case one or more PTEs it is mapped to will have the Accessed bit +set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The +latter happens when: + + - a userspace process reads or writes a page using a system call (e.g. read(2) + or write(2)) + + - a page that is used for storing filesystem buffers is read or written, + because a process needs filesystem metadata stored in it (e.g. lists a + directory tree) + + - a page is accessed by a device driver using get_user_pages() + +When a dirty page is written to swap or disk as a result of memory reclaim or +exceeding the dirty memory limit, it is not marked referenced. + +The idle memory tracking feature adds a new page flag, the Idle flag. This flag +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the +:ref:`User API <user_api>` +section), and cleared automatically whenever a page is referenced as defined +above. + +When a page is marked idle, the Accessed bit must be cleared in all PTEs it is +mapped to, otherwise we will not be able to detect accesses to the page coming +from a process address space. To avoid interference with the reclaimer, which, +as noted above, uses the Accessed bit to promote actively referenced pages, one +more page flag is introduced, the Young flag. When the PTE Accessed bit is +cleared as a result of setting or updating a page's Idle flag, the Young flag +is set on the page. The reclaimer treats the Young flag as an extra PTE +Accessed bit and therefore will consider such a page as referenced. + +Since the idle memory tracking feature is based on the memory reclaimer logic, +it only works with pages that are on an LRU list, other pages are silently +ignored. That means it will ignore a user memory page if it is isolated, but +since there are usually not many of them, it should not affect the overall +result noticeably. In order not to stall scanning of the idle page bitmap, +locked pages may be skipped too. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst new file mode 100644 index 000000000..ceead68c2 --- /dev/null +++ b/Documentation/admin-guide/mm/index.rst @@ -0,0 +1,36 @@ +================= +Memory Management +================= + +Linux memory management subsystem is responsible, as the name implies, +for managing the memory in the system. This includes implemnetation of +virtual memory and demand paging, memory allocation both for kernel +internal structures and user space programms, mapping of files into +processes address space and many other cool things. + +Linux memory management is a complex system with many configurable +settings. Most of these settings are available via ``/proc`` +filesystem and can be quired and adjusted using ``sysctl``. These APIs +are described in Documentation/sysctl/vm.txt and in `man 5 proc`_. + +.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html + +Linux memory management has its own jargon and if you are not yet +familiar with it, consider reading +:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`. + +Here we document in detail how to interact with various mechanisms in +the Linux memory management. + +.. toctree:: + :maxdepth: 1 + + concepts + hugetlbpage + idle_page_tracking + ksm + numa_memory_policy + pagemap + soft-dirty + transhuge + userfaultfd diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst new file mode 100644 index 000000000..930378663 --- /dev/null +++ b/Documentation/admin-guide/mm/ksm.rst @@ -0,0 +1,189 @@ +.. _admin_guide_ksm: + +======================= +Kernel Samepage Merging +======================= + +Overview +======== + +KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, +added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation, +and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ + +KSM was originally developed for use with KVM (where it was known as +Kernel Shared Memory), to fit more virtual machines into physical memory, +by sharing the data common between them. But it can be useful to any +application which generates many instances of the same data. + +The KSM daemon ksmd periodically scans those areas of user memory +which have been registered with it, looking for pages of identical +content which can be replaced by a single write-protected page (which +is automatically copied if a process later wants to update its +content). The amount of pages that KSM daemon scans in a single pass +and the time between the passes are configured using :ref:`sysfs +intraface <ksm_sysfs>` + +KSM only merges anonymous (private) pages, never pagecache (file) pages. +KSM's merged pages were originally locked into kernel memory, but can now +be swapped out just like other user pages (but sharing is broken when they +are swapped back in: ksmd must rediscover their identity and merge again). + +Controlling KSM with madvise +============================ + +KSM only operates on those areas of address space which an application +has advised to be likely candidates for merging, by using the madvise(2) +system call:: + + int madvise(addr, length, MADV_MERGEABLE) + +The app may call + +:: + + int madvise(addr, length, MADV_UNMERGEABLE) + +to cancel that advice and restore unshared pages: whereupon KSM +unmerges whatever it merged in that range. Note: this unmerging call +may suddenly require more memory than is available - possibly failing +with EAGAIN, but more probably arousing the Out-Of-Memory killer. + +If KSM is not configured into the running kernel, madvise MADV_MERGEABLE +and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was +built with CONFIG_KSM=y, those calls will normally succeed: even if the +the KSM daemon is not currently running, MADV_MERGEABLE still registers +the range for whenever the KSM daemon is started; even if the range +cannot contain any pages which KSM could actually merge; even if +MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. + +If a region of memory must be split into at least one new MADV_MERGEABLE +or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process +will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt). + +Like other madvise calls, they are intended for use on mapped areas of +the user address space: they will report ENOMEM if the specified range +includes unmapped gaps (though working on the intervening mapped areas), +and might fail with EAGAIN if not enough memory for internal structures. + +Applications should be considerate in their use of MADV_MERGEABLE, +restricting its use to areas likely to benefit. KSM's scans may use a lot +of processing power: some installations will disable KSM for that reason. + +.. _ksm_sysfs: + +KSM daemon sysfs interface +========================== + +The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``, +readable by all but writable only by root: + +pages_to_scan + how many pages to scan before ksmd goes to sleep + e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``. + + Default: 100 (chosen for demonstration purposes) + +sleep_millisecs + how many milliseconds ksmd should sleep before next scan + e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs`` + + Default: 20 (chosen for demonstration purposes) + +merge_across_nodes + specifies if pages from different NUMA nodes can be merged. + When set to 0, ksm merges only pages which physically reside + in the memory area of same NUMA node. That brings lower + latency to access of shared pages. Systems with more nodes, at + significant NUMA distances, are likely to benefit from the + lower latency of setting 0. Smaller systems, which need to + minimize memory usage, are likely to benefit from the greater + sharing of setting 1 (default). You may wish to compare how + your system performs under each setting, before deciding on + which to use. ``merge_across_nodes`` setting can be changed only + when there are no ksm shared pages in the system: set run 2 to + unmerge pages first, then to 1 after changing + ``merge_across_nodes``, to remerge according to the new setting. + + Default: 1 (merging across nodes as in earlier releases) + +run + * set to 0 to stop ksmd from running but keep merged pages, + * set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``, + * set to 2 to stop ksmd and unmerge all pages currently merged, but + leave mergeable areas registered for next run. + + Default: 0 (must be changed to 1 to activate KSM, except if + CONFIG_SYSFS is disabled) + +use_zero_pages + specifies whether empty pages (i.e. allocated pages that only + contain zeroes) should be treated specially. When set to 1, + empty pages are merged with the kernel zero page(s) instead of + with each other as it would happen normally. This can improve + the performance on architectures with coloured zero pages, + depending on the workload. Care should be taken when enabling + this setting, as it can potentially degrade the performance of + KSM for some workloads, for example if the checksums of pages + candidate for merging match the checksum of an empty + page. This setting can be changed at any time, it is only + effective for pages merged after the change. + + Default: 0 (normal KSM behaviour as in earlier releases) + +max_page_sharing + Maximum sharing allowed for each KSM page. This enforces a + deduplication limit to avoid high latency for virtual memory + operations that involve traversal of the virtual mappings that + share the KSM page. The minimum value is 2 as a newly created + KSM page will have at least two sharers. The higher this value + the faster KSM will merge the memory and the higher the + deduplication factor will be, but the slower the worst case + virtual mappings traversal could be for any given KSM + page. Slowing down this traversal means there will be higher + latency for certain virtual memory operations happening during + swapping, compaction, NUMA balancing and page migration, in + turn decreasing responsiveness for the caller of those virtual + memory operations. The scheduler latency of other tasks not + involved with the VM operations doing the virtual mappings + traversal is not affected by this parameter as these + traversals are always schedule friendly themselves. + +stable_node_chains_prune_millisecs + specifies how frequently KSM checks the metadata of the pages + that hit the deduplication limit for stale information. + Smaller milllisecs values will free up the KSM metadata with + lower latency, but they will make ksmd use more CPU during the + scan. It's a noop if not a single KSM page hit the + ``max_page_sharing`` yet. + +The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: + +pages_shared + how many shared pages are being used +pages_sharing + how many more sites are sharing them i.e. how much saved +pages_unshared + how many pages unique but repeatedly checked for merging +pages_volatile + how many pages changing too fast to be placed in a tree +full_scans + how many times all mergeable areas have been scanned +stable_node_chains + the number of KSM pages that hit the ``max_page_sharing`` limit +stable_node_dups + number of duplicated KSM pages + +A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good +sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing`` +indicates wasted effort. ``pages_volatile`` embraces several +different kinds of activity, but a high proportion there would also +indicate poor use of madvise MADV_MERGEABLE. + +The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the +``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must +be increased accordingly. + +-- +Izik Eidus, +Hugh Dickins, 17 Nov 2009 diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst new file mode 100644 index 000000000..d78c5b315 --- /dev/null +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -0,0 +1,495 @@ +.. _numa_memory_policy: + +================== +NUMA Memory Policy +================== + +What is NUMA Memory Policy? +============================ + +In the Linux kernel, "memory policy" determines from which node the kernel will +allocate memory in a NUMA system or in an emulated NUMA system. Linux has +supported platforms with Non-Uniform Memory Access architectures since 2.4.?. +The current memory policy support was added to Linux 2.6 around May 2004. This +document attempts to describe the concepts and APIs of the 2.6 memory policy +support. + +Memory policies should not be confused with cpusets +(``Documentation/cgroup-v1/cpusets.txt``) +which is an administrative mechanism for restricting the nodes from which +memory may be allocated by a set of processes. Memory policies are a +programming interface that a NUMA-aware application can take advantage of. When +both cpusets and policies are applied to a task, the restrictions of the cpuset +takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>` +below for more details. + +Memory Policy Concepts +====================== + +Scope of Memory Policies +------------------------ + +The Linux kernel supports _scopes_ of memory policy, described here from +most general to most specific: + +System Default Policy + this policy is "hard coded" into the kernel. It is the policy + that governs all page allocations that aren't controlled by + one of the more specific policy scopes discussed below. When + the system is "up and running", the system default policy will + use "local allocation" described below. However, during boot + up, the system default policy will be set to interleave + allocations across all nodes with "sufficient" memory, so as + not to overload the initial boot node with boot-time + allocations. + +Task/Process Policy + this is an optional, per-task policy. When defined for a + specific task, this policy controls all page allocations made + by or on behalf of the task that aren't controlled by a more + specific scope. If a task does not define a task policy, then + all page allocations that would have been controlled by the + task policy "fall back" to the System Default Policy. + + The task policy applies to the entire address space of a task. Thus, + it is inheritable, and indeed is inherited, across both fork() + [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task + to establish the task policy for a child task exec()'d from an + executable image that has no awareness of memory policy. See the + :ref:`Memory Policy APIs <memory_policy_apis>` section, + below, for an overview of the system call + that a task may use to set/change its task/process policy. + + In a multi-threaded task, task policies apply only to the thread + [Linux kernel task] that installs the policy and any threads + subsequently created by that thread. Any sibling threads existing + at the time a new task policy is installed retain their current + policy. + + A task policy applies only to pages allocated after the policy is + installed. Any pages already faulted in by the task when the task + changes its task policy remain where they were allocated based on + the policy at the time they were allocated. + +.. _vma_policy: + +VMA Policy + A "VMA" or "Virtual Memory Area" refers to a range of a task's + virtual address space. A task may define a specific policy for a range + of its virtual address space. See the + :ref:`Memory Policy APIs <memory_policy_apis>` section, + below, for an overview of the mbind() system call used to set a VMA + policy. + + A VMA policy will govern the allocation of pages that back + this region of the address space. Any regions of the task's + address space that don't have an explicit VMA policy will fall + back to the task policy, which may itself fall back to the + System Default Policy. + + VMA policies have a few complicating details: + + * VMA policy applies ONLY to anonymous pages. These include + pages allocated for anonymous segments, such as the task + stack and heap, and any regions of the address space + mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is + applied to a file mapping, it will be ignored if the mapping + used the MAP_SHARED flag. If the file mapping used the + MAP_PRIVATE flag, the VMA policy will only be applied when + an anonymous page is allocated on an attempt to write to the + mapping-- i.e., at Copy-On-Write. + + * VMA policies are shared between all tasks that share a + virtual address space--a.k.a. threads--independent of when + the policy is installed; and they are inherited across + fork(). However, because VMA policies refer to a specific + region of a task's address space, and because the address + space is discarded and recreated on exec*(), VMA policies + are NOT inheritable across exec(). Thus, only NUMA-aware + applications may use VMA policies. + + * A task may install a new VMA policy on a sub-range of a + previously mmap()ed region. When this happens, Linux splits + the existing virtual memory area into 2 or 3 VMAs, each with + it's own policy. + + * By default, VMA policy applies only to pages allocated after + the policy is installed. Any pages already faulted into the + VMA range remain where they were allocated based on the + policy at the time they were allocated. However, since + 2.6.16, Linux supports page migration via the mbind() system + call, so that page contents can be moved to match a newly + installed policy. + +Shared Policy + Conceptually, shared policies apply to "memory objects" mapped + shared into one or more tasks' distinct address spaces. An + application installs shared policies the same way as VMA + policies--using the mbind() system call specifying a range of + virtual addresses that map the shared object. However, unlike + VMA policies, which can be considered to be an attribute of a + range of a task's address space, shared policies apply + directly to the shared object. Thus, all tasks that attach to + the object share the policy, and all pages allocated for the + shared object, by any task, will obey the shared policy. + + As of 2.6.22, only shared memory segments, created by shmget() or + mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared + policy support was added to Linux, the associated data structures were + added to hugetlbfs shmem segments. At the time, hugetlbfs did not + support allocation at fault time--a.k.a lazy allocation--so hugetlbfs + shmem segments were never "hooked up" to the shared policy support. + Although hugetlbfs segments now support lazy allocation, their support + for shared policy has not been completed. + + As mentioned above in :ref:`VMA policies <vma_policy>` section, + allocations of page cache pages for regular files mmap()ed + with MAP_SHARED ignore any VMA policy installed on the virtual + address range backed by the shared file mapping. Rather, + shared page cache pages, including pages backing private + mappings that have not yet been written by the task, follow + task policy, if any, else System Default Policy. + + The shared policy infrastructure supports different policies on subset + ranges of the shared object. However, Linux still splits the VMA of + the task that installs the policy for each range of distinct policy. + Thus, different tasks that attach to a shared memory segment can have + different VMA configurations mapping that one shared object. This + can be seen by examining the /proc/<pid>/numa_maps of tasks sharing + a shared memory region, when one task has installed shared policy on + one or more ranges of the region. + +Components of Memory Policies +----------------------------- + +A NUMA memory policy consists of a "mode", optional mode flags, and +an optional set of nodes. The mode determines the behavior of the +policy, the optional mode flags determine the behavior of the mode, +and the optional set of nodes can be viewed as the arguments to the +policy behavior. + +Internally, memory policies are implemented by a reference counted +structure, struct mempolicy. Details of this structure will be +discussed in context, below, as required to explain the behavior. + +NUMA memory policy supports the following 4 behavioral modes: + +Default Mode--MPOL_DEFAULT + This mode is only used in the memory policy APIs. Internally, + MPOL_DEFAULT is converted to the NULL memory policy in all + policy scopes. Any existing non-default policy will simply be + removed when MPOL_DEFAULT is specified. As a result, + MPOL_DEFAULT means "fall back to the next most specific policy + scope." + + For example, a NULL or default task policy will fall back to the + system default policy. A NULL or default vma policy will fall + back to the task policy. + + When specified in one of the memory policy APIs, the Default mode + does not use the optional set of nodes. + + It is an error for the set of nodes specified for this policy to + be non-empty. + +MPOL_BIND + This mode specifies that memory must come from the set of + nodes specified by the policy. Memory will be allocated from + the node in the set with sufficient free memory that is + closest to the node where the allocation takes place. + +MPOL_PREFERRED + This mode specifies that the allocation should be attempted + from the single node specified in the policy. If that + allocation fails, the kernel will search other nodes, in order + of increasing distance from the preferred node based on + information provided by the platform firmware. + + Internally, the Preferred policy uses a single node--the + preferred_node member of struct mempolicy. When the internal + mode flag MPOL_F_LOCAL is set, the preferred_node is ignored + and the policy is interpreted as local allocation. "Local" + allocation policy can be viewed as a Preferred policy that + starts at the node containing the cpu where the allocation + takes place. + + It is possible for the user to specify that local allocation + is always preferred by passing an empty nodemask with this + mode. If an empty nodemask is passed, the policy cannot use + the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags + described below. + +MPOL_INTERLEAVED + This mode specifies that page allocations be interleaved, on a + page granularity, across the nodes specified in the policy. + This mode also behaves slightly differently, based on the + context where it is used: + + For allocation of anonymous pages and shared memory pages, + Interleave mode indexes the set of nodes specified by the + policy using the page offset of the faulting address into the + segment [VMA] containing the address modulo the number of + nodes specified by the policy. It then attempts to allocate a + page, starting at the selected node, as if the node had been + specified by a Preferred policy or had been selected by a + local allocation. That is, allocation will follow the per + node zonelist. + + For allocation of page cache pages, Interleave mode indexes + the set of nodes specified by the policy using a node counter + maintained per task. This counter wraps around to the lowest + specified node after it reaches the highest specified node. + This will tend to spread the pages out over the nodes + specified by the policy based on the order in which they are + allocated, rather than based on any page offset into an + address range or file. During system boot up, the temporary + interleaved system default policy works in this mode. + +NUMA memory policy supports the following optional mode flags: + +MPOL_F_STATIC_NODES + This flag specifies that the nodemask passed by + the user should not be remapped if the task or VMA's set of allowed + nodes changes after the memory policy has been defined. + + Without this flag, any time a mempolicy is rebound because of a + change in the set of allowed nodes, the node (Preferred) or + nodemask (Bind, Interleave) is remapped to the new set of + allowed nodes. This may result in nodes being used that were + previously undesired. + + With this flag, if the user-specified nodes overlap with the + nodes allowed by the task's cpuset, then the memory policy is + applied to their intersection. If the two sets of nodes do not + overlap, the Default policy is used. + + For example, consider a task that is attached to a cpuset with + mems 1-3 that sets an Interleave policy over the same set. If + the cpuset's mems change to 3-5, the Interleave will now occur + over nodes 3, 4, and 5. With this flag, however, since only node + 3 is allowed from the user's nodemask, the "interleave" only + occurs over that node. If no nodes from the user's nodemask are + now allowed, the Default behavior is used. + + MPOL_F_STATIC_NODES cannot be combined with the + MPOL_F_RELATIVE_NODES flag. It also cannot be used for + MPOL_PREFERRED policies that were created with an empty nodemask + (local allocation). + +MPOL_F_RELATIVE_NODES + This flag specifies that the nodemask passed + by the user will be mapped relative to the set of the task or VMA's + set of allowed nodes. The kernel stores the user-passed nodemask, + and if the allowed nodes changes, then that original nodemask will + be remapped relative to the new set of allowed nodes. + + Without this flag (and without MPOL_F_STATIC_NODES), anytime a + mempolicy is rebound because of a change in the set of allowed + nodes, the node (Preferred) or nodemask (Bind, Interleave) is + remapped to the new set of allowed nodes. That remap may not + preserve the relative nature of the user's passed nodemask to its + set of allowed nodes upon successive rebinds: a nodemask of + 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of + allowed nodes is restored to its original state. + + With this flag, the remap is done so that the node numbers from + the user's passed nodemask are relative to the set of allowed + nodes. In other words, if nodes 0, 2, and 4 are set in the user's + nodemask, the policy will be effected over the first (and in the + Bind or Interleave case, the third and fifth) nodes in the set of + allowed nodes. The nodemask passed by the user represents nodes + relative to task or VMA's set of allowed nodes. + + If the user's nodemask includes nodes that are outside the range + of the new set of allowed nodes (for example, node 5 is set in + the user's nodemask when the set of allowed nodes is only 0-3), + then the remap wraps around to the beginning of the nodemask and, + if not already set, sets the node in the mempolicy nodemask. + + For example, consider a task that is attached to a cpuset with + mems 2-5 that sets an Interleave policy over the same set with + MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the + interleave now occurs over nodes 3,5-7. If the cpuset's mems + then change to 0,2-3,5, then the interleave occurs over nodes + 0,2-3,5. + + Thanks to the consistent remapping, applications preparing + nodemasks to specify memory policies using this flag should + disregard their current, actual cpuset imposed memory placement + and prepare the nodemask as if they were always located on + memory nodes 0 to N-1, where N is the number of memory nodes the + policy is intended to manage. Let the kernel then remap to the + set of memory nodes allowed by the task's cpuset, as that may + change over time. + + MPOL_F_RELATIVE_NODES cannot be combined with the + MPOL_F_STATIC_NODES flag. It also cannot be used for + MPOL_PREFERRED policies that were created with an empty nodemask + (local allocation). + +Memory Policy Reference Counting +================================ + +To resolve use/free races, struct mempolicy contains an atomic reference +count field. Internal interfaces, mpol_get()/mpol_put() increment and +decrement this reference count, respectively. mpol_put() will only free +the structure back to the mempolicy kmem cache when the reference count +goes to zero. + +When a new memory policy is allocated, its reference count is initialized +to '1', representing the reference held by the task that is installing the +new policy. When a pointer to a memory policy structure is stored in another +structure, another reference is added, as the task's reference will be dropped +on completion of the policy installation. + +During run-time "usage" of the policy, we attempt to minimize atomic operations +on the reference count, as this can lead to cache lines bouncing between cpus +and NUMA nodes. "Usage" here means one of the following: + +1) querying of the policy, either by the task itself [using the get_mempolicy() + API discussed below] or by another task using the /proc/<pid>/numa_maps + interface. + +2) examination of the policy to determine the policy mode and associated node + or node lists, if any, for page allocation. This is considered a "hot + path". Note that for MPOL_BIND, the "usage" extends across the entire + allocation process, which may sleep during page reclaimation, because the + BIND policy nodemask is used, by reference, to filter ineligible nodes. + +We can avoid taking an extra reference during the usages listed above as +follows: + +1) we never need to get/free the system default policy as this is never + changed nor freed, once the system is up and running. + +2) for querying the policy, we do not need to take an extra reference on the + target task's task policy nor vma policies because we always acquire the + task's mm's mmap_sem for read during the query. The set_mempolicy() and + mbind() APIs [see below] always acquire the mmap_sem for write when + installing or replacing task or vma policies. Thus, there is no possibility + of a task or thread freeing a policy while another task or thread is + querying it. + +3) Page allocation usage of task or vma policy occurs in the fault path where + we hold them mmap_sem for read. Again, because replacing the task or vma + policy requires that the mmap_sem be held for write, the policy can't be + freed out from under us while we're using it for page allocation. + +4) Shared policies require special consideration. One task can replace a + shared memory policy while another task, with a distinct mmap_sem, is + querying or allocating a page based on the policy. To resolve this + potential race, the shared policy infrastructure adds an extra reference + to the shared policy during lookup while holding a spin lock on the shared + policy management structure. This requires that we drop this extra + reference when we're finished "using" the policy. We must drop the + extra reference on shared policies in the same query/allocation paths + used for non-shared policies. For this reason, shared policies are marked + as such, and the extra reference is dropped "conditionally"--i.e., only + for shared policies. + + Because of this extra reference counting, and because we must lookup + shared policies in a tree structure under spinlock, shared policies are + more expensive to use in the page allocation path. This is especially + true for shared policies on shared memory regions shared by tasks running + on different NUMA nodes. This extra overhead can be avoided by always + falling back to task or system default policy for shared memory regions, + or by prefaulting the entire shared memory region into memory and locking + it down. However, this might not be appropriate for all applications. + +.. _memory_policy_apis: + +Memory Policy APIs +================== + +Linux supports 3 system calls for controlling memory policy. These APIS +always affect only the calling task, the calling task's address space, or +some shared object mapped into the calling task's address space. + +.. note:: + the headers that define these APIs and the parameter data types for + user space applications reside in a package that is not part of the + Linux kernel. The kernel system call interfaces, with the 'sys\_' + prefix, are defined in <linux/syscalls.h>; the mode and flag + definitions are defined in <linux/mempolicy.h>. + +Set [Task] Memory Policy:: + + long set_mempolicy(int mode, const unsigned long *nmask, + unsigned long maxnode); + +Set's the calling task's "task/process memory policy" to mode +specified by the 'mode' argument and the set of nodes defined by +'nmask'. 'nmask' points to a bit mask of node ids containing at least +'maxnode' ids. Optional mode flags may be passed by combining the +'mode' argument with the flag (for example: MPOL_INTERLEAVE | +MPOL_F_STATIC_NODES). + +See the set_mempolicy(2) man page for more details + + +Get [Task] Memory Policy or Related Information:: + + long get_mempolicy(int *mode, + const unsigned long *nmask, unsigned long maxnode, + void *addr, int flags); + +Queries the "task/process memory policy" of the calling task, or the +policy or location of a specified virtual address, depending on the +'flags' argument. + +See the get_mempolicy(2) man page for more details + + +Install VMA/Shared Policy for a Range of Task's Address Space:: + + long mbind(void *start, unsigned long len, int mode, + const unsigned long *nmask, unsigned long maxnode, + unsigned flags); + +mbind() installs the policy specified by (mode, nmask, maxnodes) as a +VMA policy for the range of the calling task's address space specified +by the 'start' and 'len' arguments. Additional actions may be +requested via the 'flags' argument. + +See the mbind(2) man page for more details. + +Memory Policy Command Line Interface +==================================== + +Although not strictly part of the Linux implementation of memory policy, +a command line tool, numactl(8), exists that allows one to: + ++ set the task policy for a specified program via set_mempolicy(2), fork(2) and + exec(2) + ++ set the shared policy for a shared memory segment via mbind(2) + +The numactl(8) tool is packaged with the run-time version of the library +containing the memory policy system call wrappers. Some distributions +package the headers and compile-time libraries in a separate development +package. + +.. _mem_pol_and_cpusets: + +Memory Policies and cpusets +=========================== + +Memory policies work within cpusets as described above. For memory policies +that require a node or set of nodes, the nodes are restricted to the set of +nodes whose memories are allowed by the cpuset constraints. If the nodemask +specified for the policy contains nodes that are not allowed by the cpuset and +MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes +specified for the policy and the set of nodes with memory is used. If the +result is the empty set, the policy is considered invalid and cannot be +installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped +onto and folded into the task's set of allowed nodes as previously described. + +The interaction of memory policies and cpusets can be problematic when tasks +in two cpusets share access to a memory region, such as shared memory segments +created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and +any of the tasks install shared policy on the region, only nodes whose +memories are allowed in both cpusets may be used in the policies. Obtaining +this information requires "stepping outside" the memory policy APIs to use the +cpuset information and requires that one know in what cpusets other task might +be attaching to the shared region. Furthermore, if the cpusets' allowed +memory sets are disjoint, "local" allocation is the only valid policy. diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst new file mode 100644 index 000000000..3f7bade2c --- /dev/null +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -0,0 +1,204 @@ +.. _pagemap: + +============================= +Examining Process Page Tables +============================= + +pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow +userspace programs to examine the page tables and related information by +reading files in ``/proc``. + +There are four components to pagemap: + + * ``/proc/pid/pagemap``. This file lets a userspace process find out which + physical frame each virtual page is mapped to. It contains one 64-bit + value for each virtual page, containing the following data (from + ``fs/proc/task_mmu.c``, above pagemap_read): + + * Bits 0-54 page frame number (PFN) if present + * Bits 0-4 swap type if swapped + * Bits 5-54 swap offset if swapped + * Bit 55 pte is soft-dirty (see + :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`) + * Bit 56 page exclusively mapped (since 4.2) + * Bits 57-60 zero + * Bit 61 page is file-page or shared-anon (since 3.5) + * Bit 62 page swapped + * Bit 63 page present + + Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. + In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from + 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. + Reason: information about PFNs helps in exploiting Rowhammer vulnerability. + + If the page is not present but in swap, then the PFN contains an + encoding of the swap file number and the page's offset into the + swap. Unmapped pages return a null PFN. This allows determining + precisely which pages are mapped (or in swap) and comparing mapped + pages between processes. + + Efficient users of this interface will use ``/proc/pid/maps`` to + determine which areas of memory are actually mapped and llseek to + skip over unmapped regions. + + * ``/proc/kpagecount``. This file contains a 64-bit count of the number of + times each page is mapped, indexed by PFN. + +The page-types tool in the tools/vm directory can be used to query the +number of times a page is mapped. + + * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each + page, indexed by PFN. + + The flags are (from ``fs/proc/page.c``, above kpageflags_read): + + 0. LOCKED + 1. ERROR + 2. REFERENCED + 3. UPTODATE + 4. DIRTY + 5. LRU + 6. ACTIVE + 7. SLAB + 8. WRITEBACK + 9. RECLAIM + 10. BUDDY + 11. MMAP + 12. ANON + 13. SWAPCACHE + 14. SWAPBACKED + 15. COMPOUND_HEAD + 16. COMPOUND_TAIL + 17. HUGE + 18. UNEVICTABLE + 19. HWPOISON + 20. NOPAGE + 21. KSM + 22. THP + 23. BALLOON + 24. ZERO_PAGE + 25. IDLE + + * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the + memory cgroup each page is charged to, indexed by PFN. Only available when + CONFIG_MEMCG is set. + +Short descriptions to the page flags +==================================== + +0 - LOCKED + page is being locked for exclusive access, e.g. by undergoing read/write IO +7 - SLAB + page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator + When compound page is used, SLUB/SLQB will only set this flag on the head + page; SLOB will not flag it at all. +10 - BUDDY + a free memory block managed by the buddy system allocator + The buddy system organizes free memory in blocks of various orders. + An order N block has 2^N physically contiguous pages, with the BUDDY flag + set for and _only_ for the first page. +15 - COMPOUND_HEAD + A compound page with order N consists of 2^N physically contiguous pages. + A compound page with order 2 takes the form of "HTTT", where H donates its + head page and T donates its tail page(s). The major consumers of compound + pages are hugeTLB pages + (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`), + the SLUB etc. memory allocators and various device drivers. + However in this interface, only huge/giga pages are made visible + to end users. +16 - COMPOUND_TAIL + A compound page tail (see description above). +17 - HUGE + this is an integral part of a HugeTLB page +19 - HWPOISON + hardware detected memory corruption on this page: don't touch the data! +20 - NOPAGE + no page frame exists at the requested address +21 - KSM + identical memory pages dynamically shared between one or more processes +22 - THP + contiguous pages which construct transparent hugepages +23 - BALLOON + balloon compaction page +24 - ZERO_PAGE + zero page for pfn_zero or huge_zero page +25 - IDLE + page has not been accessed since it was marked idle (see + :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`). + Note that this flag may be stale in case the page was accessed via + a PTE. To make sure the flag is up-to-date one has to read + ``/sys/kernel/mm/page_idle/bitmap`` first. + +IO related page flags +--------------------- + +1 - ERROR + IO error occurred +3 - UPTODATE + page has up-to-date data + ie. for file backed page: (in-memory data revision >= on-disk one) +4 - DIRTY + page has been written to, hence contains new data + i.e. for file backed page: (in-memory data revision > on-disk one) +8 - WRITEBACK + page is being synced to disk + +LRU related page flags +---------------------- + +5 - LRU + page is in one of the LRU lists +6 - ACTIVE + page is in the active LRU list +18 - UNEVICTABLE + page is in the unevictable (non-)LRU list It is somehow pinned and + not a candidate for LRU page reclaims, e.g. ramfs pages, + shmctl(SHM_LOCK) and mlock() memory segments +2 - REFERENCED + page has been referenced since last LRU list enqueue/requeue +9 - RECLAIM + page will be reclaimed soon after its pageout IO completed +11 - MMAP + a memory mapped page +12 - ANON + a memory mapped page that is not part of a file +13 - SWAPCACHE + page is mapped to swap space, i.e. has an associated swap entry +14 - SWAPBACKED + page is backed by swap/RAM + +The page-types tool in the tools/vm directory can be used to query the +above flags. + +Using pagemap to do something useful +==================================== + +The general procedure for using pagemap to find out about a process' memory +usage goes like this: + + 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are + mapped to what. + 2. Select the maps you are interested in -- all of them, or a particular + library, or the stack or the heap, etc. + 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. + 4. Read a u64 for each page from pagemap. + 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you + just read, seek to that entry in the file, and read the data you want. + +For example, to find the "unique set size" (USS), which is the amount of +memory that a process is using that is not shared with any other process, +you can go through every map in the process, find the PFNs, look those up +in kpagecount, and tally up the number of pages that are only referenced +once. + +Other notes +=========== + +Reading from any of the files will return -EINVAL if you are not starting +the read on an 8-byte boundary (e.g., if you sought an odd number of bytes +into the file), or if the size of the read is not a multiple of 8 bytes. + +Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is +always 12 at most architectures). Since Linux 3.11 their meaning changes +after first clear of soft-dirty bits. Since Linux 4.2 they are used for +flags unconditionally. diff --git a/Documentation/admin-guide/mm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst new file mode 100644 index 000000000..cb0cfd667 --- /dev/null +++ b/Documentation/admin-guide/mm/soft-dirty.rst @@ -0,0 +1,47 @@ +.. _soft_dirty: + +=============== +Soft-Dirty PTEs +=============== + +The soft-dirty is a bit on a PTE which helps to track which pages a task +writes to. In order to do this tracking one should + + 1. Clear soft-dirty bits from the task's PTEs. + + This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the + task in question. + + 2. Wait some time. + + 3. Read soft-dirty bits from the PTEs. + + This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the + 64-bit qword is the soft-dirty one. If set, the respective PTE was + written to since step 1. + + +Internally, to do this tracking, the writable bit is cleared from PTEs +when the soft-dirty bit is cleared. So, after this, when the task tries to +modify a page at some virtual address the #PF occurs and the kernel sets +the soft-dirty bit on the respective PTE. + +Note, that although all the task's address space is marked as r/o after the +soft-dirty bits clear, the #PF-s that occur after that are processed fast. +This is so, since the pages are still mapped to physical memory, and thus all +the kernel does is finds this fact out and puts both writable and soft-dirty +bits on the PTE. + +While in most cases tracking memory changes by #PF-s is more than enough +there is still a scenario when we can lose soft dirty bits -- a task +unmaps a previously mapped memory region and then maps a new one at exactly +the same place. When unmap is called, the kernel internally clears PTE values +including soft dirty bits. To notify user space application about such +memory region renewal the kernel always marks new memory regions (and +expanded regions) as soft dirty. + +This feature is actively used by the checkpoint-restore project. You +can find more details about it on http://criu.org + + +-- Pavel Emelyanov, Apr 9, 2013 diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst new file mode 100644 index 000000000..7ab93a840 --- /dev/null +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -0,0 +1,418 @@ +.. _admin_guide_transhuge: + +============================ +Transparent Hugepage Support +============================ + +Objective +========= + +Performance critical computing applications dealing with large memory +working sets are already running on top of libhugetlbfs and in turn +hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of +using huge pages for the backing of virtual memory with huge pages +that supports the automatic promotion and demotion of page sizes and +without the shortcomings of hugetlbfs. + +Currently THP only works for anonymous memory mappings and tmpfs/shmem. +But in the future it can expand to other filesystems. + +.. note:: + in the examples below we presume that the basic page size is 4K and + the huge page size is 2M, although the actual numbers may vary + depending on the CPU architecture. + +The reason applications are running faster is because of two +factors. The first factor is almost completely irrelevant and it's not +of significant interest because it'll also have the downside of +requiring larger clear-page copy-page in page faults which is a +potentially negative effect. The first factor consists in taking a +single page fault for each 2M virtual region touched by userland (so +reducing the enter/exit kernel frequency by a 512 times factor). This +only matters the first time the memory is accessed for the lifetime of +a memory mapping. The second long lasting and much more important +factor will affect all subsequent accesses to the memory for the whole +runtime of the application. The second factor consist of two +components: + +1) the TLB miss will run faster (especially with virtualization using + nested pagetables but almost always also on bare metal without + virtualization) + +2) a single TLB entry will be mapping a much larger amount of virtual + memory in turn reducing the number of TLB misses. With + virtualization and nested pagetables the TLB can be mapped of + larger size only if both KVM and the Linux guest are using + hugepages but a significant speedup already happens if only one of + the two is using hugepages just because of the fact the TLB miss is + going to run faster. + +THP can be enabled system wide or restricted to certain tasks or even +memory ranges inside task's address space. Unless THP is completely +disabled, there is ``khugepaged`` daemon that scans memory and +collapses sequences of basic pages into huge pages. + +The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` +interface and using madivse(2) and prctl(2) system calls. + +Transparent Hugepage Support maximizes the usefulness of free memory +if compared to the reservation approach of hugetlbfs by allowing all +unused memory to be used as cache or other movable (or even unmovable +entities). It doesn't require reservation to prevent hugepage +allocation failures to be noticeable from userland. It allows paging +and all other advanced VM features to be available on the +hugepages. It requires no modifications for applications to take +advantage of it. + +Applications however can be further optimized to take advantage of +this feature, like for example they've been optimized before to avoid +a flood of mmap system calls for every malloc(4k). Optimizing userland +is by far not mandatory and khugepaged already can take care of long +lived page allocations even for hugepage unaware applications that +deals with large amounts of memory. + +In certain cases when hugepages are enabled system wide, application +may end up allocating more memory resources. An application may mmap a +large region but only touch 1 byte of it, in that case a 2M page might +be allocated instead of a 4k page for no good. This is why it's +possible to disable hugepages system-wide and to only have them inside +MADV_HUGEPAGE madvise regions. + +Embedded systems should enable hugepages only inside madvise regions +to eliminate any risk of wasting any precious byte of memory and to +only run faster. + +Applications that gets a lot of benefit from hugepages and that don't +risk to lose memory by using hugepages, should use +madvise(MADV_HUGEPAGE) on their critical mmapped regions. + +.. _thp_sysfs: + +sysfs +===== + +Global THP controls +------------------- + +Transparent Hugepage Support for anonymous memory can be entirely disabled +(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE +regions (to avoid the risk of consuming more memory resources) or enabled +system wide. This can be achieved with one of:: + + echo always >/sys/kernel/mm/transparent_hugepage/enabled + echo madvise >/sys/kernel/mm/transparent_hugepage/enabled + echo never >/sys/kernel/mm/transparent_hugepage/enabled + +It's also possible to limit defrag efforts in the VM to generate +anonymous hugepages in case they're not immediately free to madvise +regions or to never try to defrag memory and simply fallback to regular +pages unless hugepages are immediately available. Clearly if we spend CPU +time to defrag memory, we would expect to gain even more by the fact we +use hugepages later instead of regular pages. This isn't always +guaranteed, but it may be more likely in case the allocation is for a +MADV_HUGEPAGE region. + +:: + + echo always >/sys/kernel/mm/transparent_hugepage/defrag + echo defer >/sys/kernel/mm/transparent_hugepage/defrag + echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo never >/sys/kernel/mm/transparent_hugepage/defrag + +always + means that an application requesting THP will stall on + allocation failure and directly reclaim pages and compact + memory in an effort to allocate a THP immediately. This may be + desirable for virtual machines that benefit heavily from THP + use and are willing to delay the VM start to utilise them. + +defer + means that an application will wake kswapd in the background + to reclaim pages and wake kcompactd to compact memory so that + THP is available in the near future. It's the responsibility + of khugepaged to then install the THP pages later. + +defer+madvise + will enter direct reclaim and compaction like ``always``, but + only for regions that have used madvise(MADV_HUGEPAGE); all + other regions will wake kswapd in the background to reclaim + pages and wake kcompactd to compact memory so that THP is + available in the near future. + +madvise + will enter direct reclaim like ``always`` but only for regions + that are have used madvise(MADV_HUGEPAGE). This is the default + behaviour. + +never + should be self-explanatory. + +By default kernel tries to use huge zero page on read page fault to +anonymous mapping. It's possible to disable huge zero page by writing 0 +or enable it back by writing 1:: + + echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page + echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page + +Some userspace (such as a test program, or an optimized memory allocation +library) may want to know the size (in bytes) of a transparent hugepage:: + + cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size + +khugepaged will be automatically started when +transparent_hugepage/enabled is set to "always" or "madvise, and it'll +be automatically shutdown if it's set to "never". + +Khugepaged controls +------------------- + +khugepaged runs usually at low frequency so while one may not want to +invoke defrag algorithms synchronously during the page faults, it +should be worth invoking defrag at least in khugepaged. However it's +also possible to disable defrag in khugepaged by writing 0 or enable +defrag in khugepaged by writing 1:: + + echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag + echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag + +You can also control how many pages khugepaged should scan at each +pass:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan + +and how many milliseconds to wait in khugepaged between each pass (you +can set this to 0 to run khugepaged at 100% utilization of one core):: + + /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs + +and how many milliseconds to wait in khugepaged if there's an hugepage +allocation failure to throttle the next allocation attempt:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs + +The khugepaged progress can be seen in the number of pages collapsed:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed + +for each pass:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans + +``max_ptes_none`` specifies how many extra small pages (that are +not already mapped) can be allocated when collapsing a group +of small pages into one large page:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none + +A higher value leads to use additional memory for programs. +A lower value leads to gain less thp performance. Value of +max_ptes_none can waste cpu time very little, you can +ignore it. + +``max_ptes_swap`` specifies how many pages can be brought in from +swap when collapsing a group of pages into a transparent huge page:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap + +A higher value can cause excessive swap IO and waste +memory. A lower value can prevent THPs from being +collapsed, resulting fewer pages being collapsed into +THPs, and lower memory access performance. + +Boot parameter +============== + +You can change the sysfs boot time defaults of Transparent Hugepage +Support by passing the parameter ``transparent_hugepage=always`` or +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` +to the kernel command line. + +Hugepages in tmpfs/shmem +======================== + +You can control hugepage allocation policy in tmpfs with mount option +``huge=``. It can have following values: + +always + Attempt to allocate huge pages every time we need a new page; + +never + Do not allocate huge pages; + +within_size + Only allocate huge page if it will be fully within i_size. + Also respect fadvise()/madvise() hints; + +advise + Only allocate huge pages if requested with fadvise()/madvise(); + +The default policy is ``never``. + +``mount -o remount,huge= /mountpoint`` works fine after mount: remounting +``huge=never`` will not attempt to break up huge pages at all, just stop more +from being allocated. + +There's also sysfs knob to control hugepage allocation policy for internal +shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount +is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or +MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. + +In addition to policies listed above, shmem_enabled allows two further +values: + +deny + For use in emergencies, to force the huge option off from + all mounts; +force + Force the huge option on for all - very useful for testing; + +Need of application restart +=========================== + +The transparent_hugepage/enabled values and tmpfs mount option only affect +future behavior. So to make them effective you need to restart any +application that could have been using hugepages. This also applies to the +regions registered in khugepaged. + +Monitoring usage +================ + +The number of anonymous transparent huge pages currently used by the +system is available by reading the AnonHugePages field in ``/proc/meminfo``. +To identify what applications are using anonymous transparent huge pages, +it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields +for each mapping. + +The number of file transparent huge pages mapped to userspace is available +by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. +To identify what applications are mapping file transparent huge pages, it +is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields +for each mapping. + +Note that reading the smaps file is expensive and reading it +frequently will incur overhead. + +There are a number of counters in ``/proc/vmstat`` that may be used to +monitor how successfully the system is providing huge pages for use. + +thp_fault_alloc + is incremented every time a huge page is successfully + allocated to handle a page fault. This applies to both the + first time a page is faulted and for COW faults. + +thp_collapse_alloc + is incremented by khugepaged when it has found + a range of pages to collapse into one huge page and has + successfully allocated a new huge page to store the data. + +thp_fault_fallback + is incremented if a page fault fails to allocate + a huge page and instead falls back to using small pages. + +thp_collapse_alloc_failed + is incremented if khugepaged found a range + of pages that should be collapsed into one huge page but failed + the allocation. + +thp_file_alloc + is incremented every time a file huge page is successfully + allocated. + +thp_file_mapped + is incremented every time a file huge page is mapped into + user address space. + +thp_split_page + is incremented every time a huge page is split into base + pages. This can happen for a variety of reasons but a common + reason is that a huge page is old and is being reclaimed. + This action implies splitting all PMD the page mapped with. + +thp_split_page_failed + is incremented if kernel fails to split huge + page. This can happen if the page was pinned by somebody. + +thp_deferred_split_page + is incremented when a huge page is put onto split + queue. This happens when a huge page is partially unmapped and + splitting it would free up some memory. Pages on split queue are + going to be split under memory pressure. + +thp_split_pmd + is incremented every time a PMD split into table of PTEs. + This can happen, for instance, when application calls mprotect() or + munmap() on part of huge page. It doesn't split huge page, only + page table entry. + +thp_zero_page_alloc + is incremented every time a huge zero page is + successfully allocated. It includes allocations which where + dropped due race with other allocation. Note, it doesn't count + every map of the huge zero page, only its allocation. + +thp_zero_page_alloc_failed + is incremented if kernel fails to allocate + huge zero page and falls back to using small pages. + +thp_swpout + is incremented every time a huge page is swapout in one + piece without splitting. + +thp_swpout_fallback + is incremented if a huge page has to be split before swapout. + Usually because failed to allocate some continuous swap space + for the huge page. + +As the system ages, allocating huge pages may be expensive as the +system uses memory compaction to copy data around memory to free a +huge page for use. There are some counters in ``/proc/vmstat`` to help +monitor this overhead. + +compact_stall + is incremented every time a process stalls to run + memory compaction so that a huge page is free for use. + +compact_success + is incremented if the system compacted memory and + freed a huge page for use. + +compact_fail + is incremented if the system tries to compact memory + but failed. + +compact_pages_moved + is incremented each time a page is moved. If + this value is increasing rapidly, it implies that the system + is copying a lot of data to satisfy the huge page allocation. + It is possible that the cost of copying exceeds any savings + from reduced TLB misses. + +compact_pagemigrate_failed + is incremented when the underlying mechanism + for moving a page failed. + +compact_blocks_moved + is incremented each time memory compaction examines + a huge page aligned range of pages. + +It is possible to establish how long the stalls were using the function +tracer to record how long was spent in __alloc_pages_nodemask and +using the mm_page_alloc tracepoint to identify which allocations were +for huge pages. + +Optimizing the applications +=========================== + +To be guaranteed that the kernel will map a 2M page immediately in any +memory region, the mmap region has to be hugepage naturally +aligned. posix_memalign() can provide that guarantee. + +Hugetlbfs +========= + +You can use hugetlbfs on a kernel that has transparent hugepage +support enabled just fine as always. No difference can be noted in +hugetlbfs other than there will be less overall fragmentation. All +usual features belonging to hugetlbfs are preserved and +unaffected. libhugetlbfs will also work fine as usual. diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst new file mode 100644 index 000000000..5048cf661 --- /dev/null +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -0,0 +1,241 @@ +.. _userfaultfd: + +=========== +Userfaultfd +=========== + +Objective +========= + +Userfaults allow the implementation of on-demand paging from userland +and more generally they allow userland to take control of various +memory page faults, something otherwise only the kernel code could do. + +For example userfaults allows a proper and more optimal implementation +of the PROT_NONE+SIGSEGV trick. + +Design +====== + +Userfaults are delivered and resolved through the userfaultfd syscall. + +The userfaultfd (aside from registering and unregistering virtual +memory ranges) provides two primary functionalities: + +1) read/POLLIN protocol to notify a userland thread of the faults + happening + +2) various UFFDIO_* ioctls that can manage the virtual memory regions + registered in the userfaultfd that allows userland to efficiently + resolve the userfaults it receives via 1) or to manage the virtual + memory in the background + +The real advantage of userfaults if compared to regular virtual memory +management of mremap/mprotect is that the userfaults in all their +operations never involve heavyweight structures like vmas (in fact the +userfaultfd runtime load never takes the mmap_sem for writing). + +Vmas are not suitable for page- (or hugepage) granular fault tracking +when dealing with virtual address spaces that could span +Terabytes. Too many vmas would be needed for that. + +The userfaultfd once opened by invoking the syscall, can also be +passed using unix domain sockets to a manager process, so the same +manager process could handle the userfaults of a multitude of +different processes without them being aware about what is going on +(well of course unless they later try to use the userfaultfd +themselves on the same region the manager is already tracking, which +is a corner case that would currently return -EBUSY). + +API +=== + +When first opened the userfaultfd must be enabled invoking the +UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or +a later API version) which will specify the read/POLLIN protocol +userland intends to speak on the UFFD and the uffdio_api.features +userland requires. The UFFDIO_API ioctl if successful (i.e. if the +requested uffdio_api.api is spoken also by the running kernel and the +requested features are going to be enabled) will return into +uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of +respectively all the available features of the read(2) protocol and +the generic ioctl available. + +The uffdio_api.features bitmask returned by the UFFDIO_API ioctl +defines what memory types are supported by the userfaultfd and what +events, except page fault notifications, may be generated. + +If the kernel supports registering userfaultfd ranges on hugetlbfs +virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in +uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be +set if the kernel supports registering userfaultfd ranges on shared +memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero +MAP_SHARED, memfd_create, etc). + +The userland application that wants to use userfaultfd with hugetlbfs +or shared memory need to set the corresponding flag in +uffdio_api.features to enable those features. + +If the userland desires to receive notifications for events other than +page faults, it has to verify that uffdio_api.features has appropriate +UFFD_FEATURE_EVENT_* bits set. These events are described in more +detail below in "Non-cooperative userfaultfd" section. + +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should +be invoked (if present in the returned uffdio_api.ioctls bitmask) to +register a memory range in the userfaultfd by setting the +uffdio_register structure accordingly. The uffdio_register.mode +bitmask will specify to the kernel which kind of faults to track for +the range (UFFDIO_REGISTER_MODE_MISSING would track missing +pages). The UFFDIO_REGISTER ioctl will return the +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve +userfaults on the range registered. Not all ioctls will necessarily be +supported for all memory types depending on the underlying virtual +memory backend (anonymous memory vs tmpfs vs real filebacked +mappings). + +Userland can use the uffdio_register.ioctls to manage the virtual +address space in the background (to add or potentially also remove +memory from the userfaultfd registered range). This means a userfault +could be triggering just before userland maps in the background the +user-faulted page. + +The primary ioctl to resolve userfaults is UFFDIO_COPY. That +atomically copies a page into the userfault registered range and wakes +up the blocked userfaults (unless uffdio_copy.mode & +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to +UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an +half copied page since it'll keep userfaulting until the copy has +finished. + +QEMU/KVM +======== + +QEMU/KVM is using the userfaultfd syscall to implement postcopy live +migration. Postcopy live migration is one form of memory +externalization consisting of a virtual machine running with part or +all of its memory residing on a different node in the cloud. The +userfaultfd abstraction is generic enough that not a single line of +KVM kernel code had to be modified in order to add postcopy live +migration to QEMU. + +Guest async page faults, FOLL_NOWAIT and all other GUP features work +just fine in combination with userfaults. Userfaults trigger async +page faults in the guest scheduler so those guest processes that +aren't waiting for userfaults (i.e. network bound) can keep running in +the guest vcpus. + +It is generally beneficial to run one pass of precopy live migration +just before starting postcopy live migration, in order to avoid +generating userfaults for readonly guest regions. + +The implementation of postcopy live migration currently uses one +single bidirectional socket but in the future two different sockets +will be used (to reduce the latency of the userfaults to the minimum +possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). + +The QEMU in the source node writes all pages that it knows are missing +in the destination node, into the socket, and the migration thread of +the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE +ioctls on the userfaultfd in order to map the received pages into the +guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). + +A different postcopy thread in the destination node listens with +poll() to the userfaultfd in parallel. When a POLLIN event is +generated after a userfault triggers, the postcopy thread read() from +the userfaultfd and receives the fault address (or -EAGAIN in case the +userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run +by the parallel QEMU migration thread). + +After the QEMU postcopy thread (running in the destination node) gets +the userfault address it writes the information about the missing page +into the socket. The QEMU source node receives the information and +roughly "seeks" to that page address and continues sending all +remaining missing pages from that new page offset. Soon after that +(just the time to flush the tcp_wmem queue through the network) the +migration thread in the QEMU running in the destination node will +receive the page that triggered the userfault and it'll map it as +usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it +was spontaneously sent by the source or if it was an urgent page +requested through a userfault). + +By the time the userfaults start, the QEMU in the destination node +doesn't need to keep any per-page state bitmap relative to the live +migration around and a single per-page bitmap has to be maintained in +the QEMU running in the source node to know which pages are still +missing in the destination node. The bitmap in the source node is +checked to find which missing pages to send in round robin and we seek +over it when receiving incoming userfaults. After sending each page of +course the bitmap is updated accordingly. It's also useful to avoid +sending the same page twice (in case the userfault is read by the +postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration +thread). + +Non-cooperative userfaultfd +=========================== + +When the userfaultfd is monitored by an external manager, the manager +must be able to track changes in the process virtual memory +layout. Userfaultfd can notify the manager about such changes using +the same read(2) protocol as for the page fault notifications. The +manager has to explicitly enable these events by setting appropriate +bits in uffdio_api.features passed to UFFDIO_API ioctl: + +UFFD_FEATURE_EVENT_FORK + enable userfaultfd hooks for fork(). When this feature is + enabled, the userfaultfd context of the parent process is + duplicated into the newly created process. The manager + receives UFFD_EVENT_FORK with file descriptor of the new + userfaultfd context in the uffd_msg.fork. + +UFFD_FEATURE_EVENT_REMAP + enable notifications about mremap() calls. When the + non-cooperative process moves a virtual memory area to a + different location, the manager will receive + UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and + new addresses of the area and its original length. + +UFFD_FEATURE_EVENT_REMOVE + enable notifications about madvise(MADV_REMOVE) and + madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will + be generated upon these calls to madvise. The uffd_msg.remove + will contain start and end addresses of the removed area. + +UFFD_FEATURE_EVENT_UNMAP + enable notifications about memory unmapping. The manager will + get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and + end addresses of the unmapped area. + +Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP +are pretty similar, they quite differ in the action expected from the +userfaultfd manager. In the former case, the virtual memory is +removed, but the area is not, the area remains monitored by the +userfaultfd, and if a page fault occurs in that area it will be +delivered to the manager. The proper resolution for such page fault is +to zeromap the faulting address. However, in the latter case, when an +area is unmapped, either explicitly (with munmap() system call), or +implicitly (e.g. during mremap()), the area is removed and in turn the +userfaultfd context for such area disappears too and the manager will +not get further userland page faults from the removed area. Still, the +notification is required in order to prevent manager from using +UFFDIO_COPY on the unmapped area. + +Unlike userland page faults which have to be synchronous and require +explicit or implicit wakeup, all the events are delivered +asynchronously and the non-cooperative process resumes execution as +soon as manager executes read(). The userfaultfd manager should +carefully synchronize calls to UFFDIO_COPY with the events +processing. To aid the synchronization, the UFFDIO_COPY ioctl will +return -ENOSPC when the monitored process exits at the time of +UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed +its virtual memory layout simultaneously with outstanding UFFDIO_COPY +operation. + +The current asynchronous model of the event delivery is optimal for +single threaded non-cooperative userfaultfd manager implementations. A +synchronous event delivery model can be added later as a new +userfaultfd feature to facilitate multithreading enhancements of the +non cooperative manager, for example to allow UFFDIO_COPY ioctls to +run in parallel to the event reception. Single threaded +implementations should continue to use the current async event +delivery model instead. diff --git a/Documentation/admin-guide/module-signing.rst b/Documentation/admin-guide/module-signing.rst new file mode 100644 index 000000000..f8b584179 --- /dev/null +++ b/Documentation/admin-guide/module-signing.rst @@ -0,0 +1,285 @@ +Kernel module signing facility +------------------------------ + +.. CONTENTS +.. +.. - Overview. +.. - Configuring module signing. +.. - Generating signing keys. +.. - Public keys in the kernel. +.. - Manually signing modules. +.. - Signed modules and stripping. +.. - Loading signed modules. +.. - Non-valid signatures and unsigned modules. +.. - Administering/protecting the private key. + + +======== +Overview +======== + +The kernel module signing facility cryptographically signs modules during +installation and then checks the signature upon loading the module. This +allows increased kernel security by disallowing the loading of unsigned modules +or modules signed with an invalid key. Module signing increases security by +making it harder to load a malicious module into the kernel. The module +signature checking is done by the kernel so that it is not necessary to have +trusted userspace bits. + +This facility uses X.509 ITU-T standard certificates to encode the public keys +involved. The signatures are not themselves encoded in any industrial standard +type. The facility currently only supports the RSA public key encryption +standard (though it is pluggable and permits others to be used). The possible +hash algorithms that can be used are SHA-1, SHA-224, SHA-256, SHA-384, and +SHA-512 (the algorithm is selected by data in the signature). + + +========================== +Configuring module signing +========================== + +The module signing facility is enabled by going to the +:menuselection:`Enable Loadable Module Support` section of +the kernel configuration and turning on:: + + CONFIG_MODULE_SIG "Module signature verification" + +This has a number of options available: + + (1) :menuselection:`Require modules to be validly signed` + (``CONFIG_MODULE_SIG_FORCE``) + + This specifies how the kernel should deal with a module that has a + signature for which the key is not known or a module that is unsigned. + + If this is off (ie. "permissive"), then modules for which the key is not + available and modules that are unsigned are permitted, but the kernel will + be marked as being tainted, and the concerned modules will be marked as + tainted, shown with the character 'E'. + + If this is on (ie. "restrictive"), only modules that have a valid + signature that can be verified by a public key in the kernel's possession + will be loaded. All other modules will generate an error. + + Irrespective of the setting here, if the module has a signature block that + cannot be parsed, it will be rejected out of hand. + + + (2) :menuselection:`Automatically sign all modules` + (``CONFIG_MODULE_SIG_ALL``) + + If this is on then modules will be automatically signed during the + modules_install phase of a build. If this is off, then the modules must + be signed manually using:: + + scripts/sign-file + + + (3) :menuselection:`Which hash algorithm should modules be signed with?` + + This presents a choice of which hash algorithm the installation phase will + sign the modules with: + + =============================== ========================================== + ``CONFIG_MODULE_SIG_SHA1`` :menuselection:`Sign modules with SHA-1` + ``CONFIG_MODULE_SIG_SHA224`` :menuselection:`Sign modules with SHA-224` + ``CONFIG_MODULE_SIG_SHA256`` :menuselection:`Sign modules with SHA-256` + ``CONFIG_MODULE_SIG_SHA384`` :menuselection:`Sign modules with SHA-384` + ``CONFIG_MODULE_SIG_SHA512`` :menuselection:`Sign modules with SHA-512` + =============================== ========================================== + + The algorithm selected here will also be built into the kernel (rather + than being a module) so that modules signed with that algorithm can have + their signatures checked without causing a dependency loop. + + + (4) :menuselection:`File name or PKCS#11 URI of module signing key` + (``CONFIG_MODULE_SIG_KEY``) + + Setting this option to something other than its default of + ``certs/signing_key.pem`` will disable the autogeneration of signing keys + and allow the kernel modules to be signed with a key of your choosing. + The string provided should identify a file containing both a private key + and its corresponding X.509 certificate in PEM form, or — on systems where + the OpenSSL ENGINE_pkcs11 is functional — a PKCS#11 URI as defined by + RFC7512. In the latter case, the PKCS#11 URI should reference both a + certificate and a private key. + + If the PEM file containing the private key is encrypted, or if the + PKCS#11 token requries a PIN, this can be provided at build time by + means of the ``KBUILD_SIGN_PIN`` variable. + + + (5) :menuselection:`Additional X.509 keys for default system keyring` + (``CONFIG_SYSTEM_TRUSTED_KEYS``) + + This option can be set to the filename of a PEM-encoded file containing + additional certificates which will be included in the system keyring by + default. + +Note that enabling module signing adds a dependency on the OpenSSL devel +packages to the kernel build processes for the tool that does the signing. + + +======================= +Generating signing keys +======================= + +Cryptographic keypairs are required to generate and check signatures. A +private key is used to generate a signature and the corresponding public key is +used to check it. The private key is only needed during the build, after which +it can be deleted or stored securely. The public key gets built into the +kernel so that it can be used to check the signatures as the modules are +loaded. + +Under normal conditions, when ``CONFIG_MODULE_SIG_KEY`` is unchanged from its +default, the kernel build will automatically generate a new keypair using +openssl if one does not exist in the file:: + + certs/signing_key.pem + +during the building of vmlinux (the public part of the key needs to be built +into vmlinux) using parameters in the:: + + certs/x509.genkey + +file (which is also generated if it does not already exist). + +It is strongly recommended that you provide your own x509.genkey file. + +Most notably, in the x509.genkey file, the req_distinguished_name section +should be altered from the default:: + + [ req_distinguished_name ] + #O = Unspecified company + CN = Build time autogenerated kernel key + #emailAddress = unspecified.user@unspecified.company + +The generated RSA key size can also be set with:: + + [ req ] + default_bits = 4096 + + +It is also possible to manually generate the key private/public files using the +x509.genkey key generation configuration file in the root node of the Linux +kernel sources tree and the openssl command. The following is an example to +generate the public/private key files:: + + openssl req -new -nodes -utf8 -sha256 -days 36500 -batch -x509 \ + -config x509.genkey -outform PEM -out kernel_key.pem \ + -keyout kernel_key.pem + +The full pathname for the resulting kernel_key.pem file can then be specified +in the ``CONFIG_MODULE_SIG_KEY`` option, and the certificate and key therein will +be used instead of an autogenerated keypair. + + +========================= +Public keys in the kernel +========================= + +The kernel contains a ring of public keys that can be viewed by root. They're +in a keyring called ".builtin_trusted_keys" that can be seen by:: + + [root@deneb ~]# cat /proc/keys + ... + 223c7853 I------ 1 perm 1f030000 0 0 keyring .builtin_trusted_keys: 1 + 302d2d52 I------ 1 perm 1f010000 0 0 asymmetri Fedora kernel signing key: d69a84e6bce3d216b979e9505b3e3ef9a7118079: X509.RSA a7118079 [] + ... + +Beyond the public key generated specifically for module signing, additional +trusted certificates can be provided in a PEM-encoded file referenced by the +``CONFIG_SYSTEM_TRUSTED_KEYS`` configuration option. + +Further, the architecture code may take public keys from a hardware store and +add those in also (e.g. from the UEFI key database). + +Finally, it is possible to add additional public keys by doing:: + + keyctl padd asymmetric "" [.builtin_trusted_keys-ID] <[key-file] + +e.g.:: + + keyctl padd asymmetric "" 0x223c7853 <my_public_key.x509 + +Note, however, that the kernel will only permit keys to be added to +``.builtin_trusted_keys`` **if** the new key's X.509 wrapper is validly signed by a key +that is already resident in the ``.builtin_trusted_keys`` at the time the key was added. + + +======================== +Manually signing modules +======================== + +To manually sign a module, use the scripts/sign-file tool available in +the Linux kernel source tree. The script requires 4 arguments: + + 1. The hash algorithm (e.g., sha256) + 2. The private key filename or PKCS#11 URI + 3. The public key filename + 4. The kernel module to be signed + +The following is an example to sign a kernel module:: + + scripts/sign-file sha512 kernel-signkey.priv \ + kernel-signkey.x509 module.ko + +The hash algorithm used does not have to match the one configured, but if it +doesn't, you should make sure that hash algorithm is either built into the +kernel or can be loaded without requiring itself. + +If the private key requires a passphrase or PIN, it can be provided in the +$KBUILD_SIGN_PIN environment variable. + + +============================ +Signed modules and stripping +============================ + +A signed module has a digital signature simply appended at the end. The string +``~Module signature appended~.`` at the end of the module's file confirms that a +signature is present but it does not confirm that the signature is valid! + +Signed modules are BRITTLE as the signature is outside of the defined ELF +container. Thus they MAY NOT be stripped once the signature is computed and +attached. Note the entire module is the signed payload, including any and all +debug information present at the time of signing. + + +====================== +Loading signed modules +====================== + +Modules are loaded with insmod, modprobe, ``init_module()`` or +``finit_module()``, exactly as for unsigned modules as no processing is +done in userspace. The signature checking is all done within the kernel. + + +========================================= +Non-valid signatures and unsigned modules +========================================= + +If ``CONFIG_MODULE_SIG_FORCE`` is enabled or module.sig_enforce=1 is supplied on +the kernel command line, the kernel will only load validly signed modules +for which it has a public key. Otherwise, it will also load modules that are +unsigned. Any module for which the kernel has a key, but which proves to have +a signature mismatch will not be permitted to load. + +Any module that has an unparseable signature will be rejected. + + +========================================= +Administering/protecting the private key +========================================= + +Since the private key is used to sign modules, viruses and malware could use +the private key to sign modules and compromise the operating system. The +private key must be either destroyed or moved to a secure location and not kept +in the root node of the kernel source tree. + +If you use the same private key to sign modules for multiple kernel +configurations, you must ensure that the module version information is +sufficient to prevent loading a module into a different kernel. Either +set ``CONFIG_MODVERSIONS=y`` or ensure that each configuration has a different +kernel release string by changing ``EXTRAVERSION`` or ``CONFIG_LOCALVERSION``. diff --git a/Documentation/admin-guide/mono.rst b/Documentation/admin-guide/mono.rst new file mode 100644 index 000000000..59e6d59f0 --- /dev/null +++ b/Documentation/admin-guide/mono.rst @@ -0,0 +1,70 @@ +Mono(tm) Binary Kernel Support for Linux +----------------------------------------- + +To configure Linux to automatically execute Mono-based .NET binaries +(in the form of .exe files) without the need to use the mono CLR +wrapper, you can use the BINFMT_MISC kernel support. + +This will allow you to execute Mono-based .NET binaries just like any +other program after you have done the following: + +1) You MUST FIRST install the Mono CLR support, either by downloading + a binary package, a source tarball or by installing from Git. Binary + packages for several distributions can be found at: + + http://www.mono-project.com/download/ + + Instructions for compiling Mono can be found at: + + http://www.mono-project.com/docs/compiling-mono/linux/ + + Once the Mono CLR support has been installed, just check that + ``/usr/bin/mono`` (which could be located elsewhere, for example + ``/usr/local/bin/mono``) is working. + +2) You have to compile BINFMT_MISC either as a module or into + the kernel (``CONFIG_BINFMT_MISC``) and set it up properly. + If you choose to compile it as a module, you will have + to insert it manually with modprobe/insmod, as kmod + cannot be easily supported with binfmt_misc. + Read the file ``binfmt_misc.txt`` in this directory to know + more about the configuration process. + +3) Add the following entries to ``/etc/rc.local`` or similar script + to be run at system startup: + + .. code-block:: sh + + # Insert BINFMT_MISC module into the kernel + if [ ! -e /proc/sys/fs/binfmt_misc/register ]; then + /sbin/modprobe binfmt_misc + # Some distributions, like Fedora Core, perform + # the following command automatically when the + # binfmt_misc module is loaded into the kernel + # or during normal boot up (systemd-based systems). + # Thus, it is possible that the following line + # is not needed at all. + mount -t binfmt_misc none /proc/sys/fs/binfmt_misc + fi + + # Register support for .NET CLR binaries + if [ -e /proc/sys/fs/binfmt_misc/register ]; then + # Replace /usr/bin/mono with the correct pathname to + # the Mono CLR runtime (usually /usr/local/bin/mono + # when compiling from sources or CVS). + echo ':CLR:M::MZ::/usr/bin/mono:' > /proc/sys/fs/binfmt_misc/register + else + echo "No binfmt_misc support" + exit 1 + fi + +4) Check that ``.exe`` binaries can be ran without the need of a + wrapper script, simply by launching the ``.exe`` file directly + from a command prompt, for example:: + + /usr/bin/xsd.exe + + .. note:: + + If this fails with a permission denied error, check + that the ``.exe`` file has execute permissions. diff --git a/Documentation/admin-guide/parport.rst b/Documentation/admin-guide/parport.rst new file mode 100644 index 000000000..ad3f9b8a1 --- /dev/null +++ b/Documentation/admin-guide/parport.rst @@ -0,0 +1,286 @@ +Parport ++++++++ + +The ``parport`` code provides parallel-port support under Linux. This +includes the ability to share one port between multiple device +drivers. + +You can pass parameters to the ``parport`` code to override its automatic +detection of your hardware. This is particularly useful if you want +to use IRQs, since in general these can't be autoprobed successfully. +By default IRQs are not used even if they **can** be probed. This is +because there are a lot of people using the same IRQ for their +parallel port and a sound card or network card. + +The ``parport`` code is split into two parts: generic (which deals with +port-sharing) and architecture-dependent (which deals with actually +using the port). + + +Parport as modules +================== + +If you load the `parport`` code as a module, say:: + + # insmod parport + +to load the generic ``parport`` code. You then must load the +architecture-dependent code with (for example):: + + # insmod parport_pc io=0x3bc,0x378,0x278 irq=none,7,auto + +to tell the ``parport`` code that you want three PC-style ports, one at +0x3bc with no IRQ, one at 0x378 using IRQ 7, and one at 0x278 with an +auto-detected IRQ. Currently, PC-style (``parport_pc``), Sun ``bpp``, +Amiga, Atari, and MFC3 hardware is supported. + +PCI parallel I/O card support comes from ``parport_pc``. Base I/O +addresses should not be specified for supported PCI cards since they +are automatically detected. + + +modprobe +-------- + +If you use modprobe , you will find it useful to add lines as below to a +configuration file in /etc/modprobe.d/ directory:: + + alias parport_lowlevel parport_pc + options parport_pc io=0x378,0x278 irq=7,auto + +modprobe will load ``parport_pc`` (with the options ``io=0x378,0x278 irq=7,auto``) +whenever a parallel port device driver (such as ``lp``) is loaded. + +Note that these are example lines only! You shouldn't in general need +to specify any options to ``parport_pc`` in order to be able to use a +parallel port. + + +Parport probe [optional] +------------------------ + +In 2.2 kernels there was a module called ``parport_probe``, which was used +for collecting IEEE 1284 device ID information. This has now been +enhanced and now lives with the IEEE 1284 support. When a parallel +port is detected, the devices that are connected to it are analysed, +and information is logged like this:: + + parport0: Printer, BJC-210 (Canon) + +The probe information is available from files in ``/proc/sys/dev/parport/``. + + +Parport linked into the kernel statically +========================================= + +If you compile the ``parport`` code into the kernel, then you can use +kernel boot parameters to get the same effect. Add something like the +following to your LILO command line:: + + parport=0x3bc parport=0x378,7 parport=0x278,auto,nofifo + +You can have many ``parport=...`` statements, one for each port you want +to add. Adding ``parport=0`` to the kernel command-line will disable +parport support entirely. Adding ``parport=auto`` to the kernel +command-line will make ``parport`` use any IRQ lines or DMA channels that +it auto-detects. + + +Files in /proc +============== + +If you have configured the ``/proc`` filesystem into your kernel, you will +see a new directory entry: ``/proc/sys/dev/parport``. In there will be a +directory entry for each parallel port for which parport is +configured. In each of those directories are a collection of files +describing that parallel port. + +The ``/proc/sys/dev/parport`` directory tree looks like:: + + parport + |-- default + | |-- spintime + | `-- timeslice + |-- parport0 + | |-- autoprobe + | |-- autoprobe0 + | |-- autoprobe1 + | |-- autoprobe2 + | |-- autoprobe3 + | |-- devices + | | |-- active + | | `-- lp + | | `-- timeslice + | |-- base-addr + | |-- irq + | |-- dma + | |-- modes + | `-- spintime + `-- parport1 + |-- autoprobe + |-- autoprobe0 + |-- autoprobe1 + |-- autoprobe2 + |-- autoprobe3 + |-- devices + | |-- active + | `-- ppa + | `-- timeslice + |-- base-addr + |-- irq + |-- dma + |-- modes + `-- spintime + +.. tabularcolumns:: |p{4.0cm}|p{13.5cm}| + +======================= ======================================================= +File Contents +======================= ======================================================= +``devices/active`` A list of the device drivers using that port. A "+" + will appear by the name of the device currently using + the port (it might not appear against any). The + string "none" means that there are no device drivers + using that port. + +``base-addr`` Parallel port's base address, or addresses if the port + has more than one in which case they are separated + with tabs. These values might not have any sensible + meaning for some ports. + +``irq`` Parallel port's IRQ, or -1 if none is being used. + +``dma`` Parallel port's DMA channel, or -1 if none is being + used. + +``modes`` Parallel port's hardware modes, comma-separated, + meaning: + + - PCSPP + PC-style SPP registers are available. + + - TRISTATE + Port is bidirectional. + + - COMPAT + Hardware acceleration for printers is + available and will be used. + + - EPP + Hardware acceleration for EPP protocol + is available and will be used. + + - ECP + Hardware acceleration for ECP protocol + is available and will be used. + + - DMA + DMA is available and will be used. + + Note that the current implementation will only take + advantage of COMPAT and ECP modes if it has an IRQ + line to use. + +``autoprobe`` Any IEEE-1284 device ID information that has been + acquired from the (non-IEEE 1284.3) device. + +``autoprobe[0-3]`` IEEE 1284 device ID information retrieved from + daisy-chain devices that conform to IEEE 1284.3. + +``spintime`` The number of microseconds to busy-loop while waiting + for the peripheral to respond. You might find that + adjusting this improves performance, depending on your + peripherals. This is a port-wide setting, i.e. it + applies to all devices on a particular port. + +``timeslice`` The number of milliseconds that a device driver is + allowed to keep a port claimed for. This is advisory, + and driver can ignore it if it must. + +``default/*`` The defaults for spintime and timeslice. When a new + port is registered, it picks up the default spintime. + When a new device is registered, it picks up the + default timeslice. +======================= ======================================================= + +Device drivers +============== + +Once the parport code is initialised, you can attach device drivers to +specific ports. Normally this happens automatically; if the lp driver +is loaded it will create one lp device for each port found. You can +override this, though, by using parameters either when you load the lp +driver:: + + # insmod lp parport=0,2 + +or on the LILO command line:: + + lp=parport0 lp=parport2 + +Both the above examples would inform lp that you want ``/dev/lp0`` to be +the first parallel port, and /dev/lp1 to be the **third** parallel port, +with no lp device associated with the second port (parport1). Note +that this is different to the way older kernels worked; there used to +be a static association between the I/O port address and the device +name, so ``/dev/lp0`` was always the port at 0x3bc. This is no longer the +case - if you only have one port, it will default to being ``/dev/lp0``, +regardless of base address. + +Also: + + * If you selected the IEEE 1284 support at compile time, you can say + ``lp=auto`` on the kernel command line, and lp will create devices + only for those ports that seem to have printers attached. + + * If you give PLIP the ``timid`` parameter, either with ``plip=timid`` on + the command line, or with ``insmod plip timid=1`` when using modules, + it will avoid any ports that seem to be in use by other devices. + + * IRQ autoprobing works only for a few port types at the moment. + +Reporting printer problems with parport +======================================= + +If you are having problems printing, please go through these steps to +try to narrow down where the problem area is. + +When reporting problems with parport, really you need to give all of +the messages that ``parport_pc`` spits out when it initialises. There are +several code paths: + +- polling +- interrupt-driven, protocol in software +- interrupt-driven, protocol in hardware using PIO +- interrupt-driven, protocol in hardware using DMA + +The kernel messages that ``parport_pc`` logs give an indication of which +code path is being used. (They could be a lot better actually..) + +For normal printer protocol, having IEEE 1284 modes enabled or not +should not make a difference. + +To turn off the 'protocol in hardware' code paths, disable +``CONFIG_PARPORT_PC_FIFO``. Note that when they are enabled they are not +necessarily **used**; it depends on whether the hardware is available, +enabled by the BIOS, and detected by the driver. + +So, to start with, disable ``CONFIG_PARPORT_PC_FIFO``, and load ``parport_pc`` +with ``irq=none``. See if printing works then. It really should, +because this is the simplest code path. + +If that works fine, try with ``io=0x378 irq=7`` (adjust for your +hardware), to make it use interrupt-driven in-software protocol. + +If **that** works fine, then one of the hardware modes isn't working +right. Enable ``CONFIG_FIFO`` (no, it isn't a module option, +and yes, it should be), set the port to ECP mode in the BIOS and note +the DMA channel, and try with:: + + io=0x378 irq=7 dma=none (for PIO) + io=0x378 irq=7 dma=3 (for DMA) + +---------- + +philb@gnu.org +tim@cyberelk.net diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst new file mode 100644 index 000000000..47153e64d --- /dev/null +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -0,0 +1,701 @@ +.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>` +.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` + +======================= +CPU Performance Scaling +======================= + +:: + + Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> + +The Concept of CPU Performance Scaling +====================================== + +The majority of modern processors are capable of operating in a number of +different clock frequency and voltage configurations, often referred to as +Operating Performance Points or P-states (in ACPI terminology). As a rule, +the higher the clock frequency and the higher the voltage, the more instructions +can be retired by the CPU over a unit of time, but also the higher the clock +frequency and the higher the voltage, the more energy is consumed over a unit of +time (or the more power is drawn) by the CPU in the given P-state. Therefore +there is a natural tradeoff between the CPU capacity (the number of instructions +that can be executed over a unit of time) and the power drawn by the CPU. + +In some situations it is desirable or even necessary to run the program as fast +as possible and then there is no reason to use any P-states different from the +highest one (i.e. the highest-performance frequency/voltage configuration +available). In some other cases, however, it may not be necessary to execute +instructions so quickly and maintaining the highest available CPU capacity for a +relatively long time without utilizing it entirely may be regarded as wasteful. +It also may not be physically possible to maintain maximum CPU capacity for too +long for thermal or power supply capacity reasons or similar. To cover those +cases, there are hardware interfaces allowing CPUs to be switched between +different frequency/voltage configurations or (in the ACPI terminology) to be +put into different P-states. + +Typically, they are used along with algorithms to estimate the required CPU +capacity, so as to decide which P-states to put the CPUs into. Of course, since +the utilization of the system generally changes over time, that has to be done +repeatedly on a regular basis. The activity by which this happens is referred +to as CPU performance scaling or CPU frequency scaling (because it involves +adjusting the CPU clock frequency). + + +CPU Performance Scaling in Linux +================================ + +The Linux kernel supports CPU performance scaling by means of the ``CPUFreq`` +(CPU Frequency scaling) subsystem that consists of three layers of code: the +core, scaling governors and scaling drivers. + +The ``CPUFreq`` core provides the common code infrastructure and user space +interfaces for all platforms that support CPU performance scaling. It defines +the basic framework in which the other components operate. + +Scaling governors implement algorithms to estimate the required CPU capacity. +As a rule, each governor implements one, possibly parametrized, scaling +algorithm. + +Scaling drivers talk to the hardware. They provide scaling governors with +information on the available P-states (or P-state ranges in some cases) and +access platform-specific hardware interfaces to change CPU P-states as requested +by scaling governors. + +In principle, all available scaling governors can be used with every scaling +driver. That design is based on the observation that the information used by +performance scaling algorithms for P-state selection can be represented in a +platform-independent form in the majority of cases, so it should be possible +to use the same performance scaling algorithm implemented in exactly the same +way regardless of which scaling driver is used. Consequently, the same set of +scaling governors should be suitable for every supported platform. + +However, that observation may not hold for performance scaling algorithms +based on information provided by the hardware itself, for example through +feedback registers, as that information is typically specific to the hardware +interface it comes from and may not be easily represented in an abstract, +platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers +to bypass the governor layer and implement their own performance scaling +algorithms. That is done by the |intel_pstate| scaling driver. + + +``CPUFreq`` Policy Objects +========================== + +In some cases the hardware interface for P-state control is shared by multiple +CPUs. That is, for example, the same register (or set of registers) is used to +control the P-state of multiple CPUs at the same time and writing to it affects +all of those CPUs simultaneously. + +Sets of CPUs sharing hardware P-state control interfaces are represented by +``CPUFreq`` as |struct cpufreq_policy| objects. For consistency, +|struct cpufreq_policy| is also used when there is only one CPU in the given +set. + +The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for +every CPU in the system, including CPUs that are currently offline. If multiple +CPUs share the same hardware P-state control interface, all of the pointers +corresponding to them point to the same |struct cpufreq_policy| object. + +``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design +of its user space interface is based on the policy concept. + + +CPU Initialization +================== + +First of all, a scaling driver has to be registered for ``CPUFreq`` to work. +It is only possible to register one scaling driver at a time, so the scaling +driver is expected to be able to handle all CPUs in the system. + +The scaling driver may be registered before or after CPU registration. If +CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to +take a note of all of the already registered CPUs during the registration of the +scaling driver. In turn, if any CPUs are registered after the registration of +the scaling driver, the ``CPUFreq`` core will be invoked to take note of them +at their registration time. + +In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it +has not seen so far as soon as it is ready to handle that CPU. [Note that the +logical CPU may be a physical single-core processor, or a single core in a +multicore processor, or a hardware thread in a physical processor or processor +core. In what follows "CPU" always means "logical CPU" unless explicitly stated +otherwise and the word "processor" is used to refer to the physical part +possibly including multiple logical CPUs.] + +Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set +for the given CPU and if so, it skips the policy object creation. Otherwise, +a new policy object is created and initialized, which involves the creation of +a new policy directory in ``sysfs``, and the policy pointer corresponding to +the given CPU is set to the new policy object's address in memory. + +Next, the scaling driver's ``->init()`` callback is invoked with the policy +pointer of the new CPU passed to it as the argument. That callback is expected +to initialize the performance scaling hardware interface for the given CPU (or, +more precisely, for the set of CPUs sharing the hardware interface it belongs +to, represented by its policy object) and, if the policy object it has been +called for is new, to set parameters of the policy, like the minimum and maximum +frequencies supported by the hardware, the table of available frequencies (if +the set of supported P-states is not a continuous range), and the mask of CPUs +that belong to the same policy (including both online and offline CPUs). That +mask is then used by the core to populate the policy pointers for all of the +CPUs in it. + +The next major initialization step for a new policy object is to attach a +scaling governor to it (to begin with, that is the default scaling governor +determined by the kernel configuration, but it may be changed later +via ``sysfs``). First, a pointer to the new policy object is passed to the +governor's ``->init()`` callback which is expected to initialize all of the +data structures necessary to handle the given policy and, possibly, to add +a governor ``sysfs`` interface to it. Next, the governor is started by +invoking its ``->start()`` callback. + +That callback it expected to register per-CPU utilization update callbacks for +all of the online CPUs belonging to the given policy with the CPU scheduler. +The utilization update callbacks will be invoked by the CPU scheduler on +important events, like task enqueue and dequeue, on every iteration of the +scheduler tick or generally whenever the CPU utilization may change (from the +scheduler's perspective). They are expected to carry out computations needed +to determine the P-state to use for the given policy going forward and to +invoke the scaling driver to make changes to the hardware in accordance with +the P-state selection. The scaling driver may be invoked directly from +scheduler context or asynchronously, via a kernel thread or workqueue, depending +on the configuration and capabilities of the scaling driver and the governor. + +Similar steps are taken for policy objects that are not new, but were "inactive" +previously, meaning that all of the CPUs belonging to them were offline. The +only practical difference in that case is that the ``CPUFreq`` core will attempt +to use the scaling governor previously used with the policy that became +"inactive" (and is re-initialized now) instead of the default governor. + +In turn, if a previously offline CPU is being brought back online, but some +other CPUs sharing the policy object with it are online already, there is no +need to re-initialize the policy object at all. In that case, it only is +necessary to restart the scaling governor so that it can take the new online CPU +into account. That is achieved by invoking the governor's ``->stop`` and +``->start()`` callbacks, in this order, for the entire policy. + +As mentioned before, the |intel_pstate| scaling driver bypasses the scaling +governor layer of ``CPUFreq`` and provides its own P-state selection algorithms. +Consequently, if |intel_pstate| is used, scaling governors are not attached to +new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked +to register per-CPU utilization update callbacks for each policy. These +callbacks are invoked by the CPU scheduler in the same way as for scaling +governors, but in the |intel_pstate| case they both determine the P-state to +use and change the hardware configuration accordingly in one go from scheduler +context. + +The policy objects created during CPU initialization and other data structures +associated with them are torn down when the scaling driver is unregistered +(which happens when the kernel module containing it is unloaded, for example) or +when the last CPU belonging to the given policy in unregistered. + + +Policy Interface in ``sysfs`` +============================= + +During the initialization of the kernel, the ``CPUFreq`` core creates a +``sysfs`` directory (kobject) called ``cpufreq`` under +:file:`/sys/devices/system/cpu/`. + +That directory contains a ``policyX`` subdirectory (where ``X`` represents an +integer number) for every policy object maintained by the ``CPUFreq`` core. +Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links +under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer +that may be different from the one represented by ``X``) for all of the CPUs +associated with (or belonging to) the given policy. The ``policyX`` directories +in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific +attributes (files) to control ``CPUFreq`` behavior for the corresponding policy +objects (that is, for all of the CPUs associated with them). + +Some of those attributes are generic. They are created by the ``CPUFreq`` core +and their behavior generally does not depend on what scaling driver is in use +and what scaling governor is attached to the given policy. Some scaling drivers +also add driver-specific attributes to the policy directories in ``sysfs`` to +control policy-specific aspects of driver behavior. + +The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/` +are the following: + +``affected_cpus`` + List of online CPUs belonging to this policy (i.e. sharing the hardware + performance scaling interface represented by the ``policyX`` policy + object). + +``bios_limit`` + If the platform firmware (BIOS) tells the OS to apply an upper limit to + CPU frequencies, that limit will be reported through this attribute (if + present). + + The existence of the limit may be a result of some (often unintentional) + BIOS settings, restrictions coming from a service processor or another + BIOS/HW-based mechanisms. + + This does not cover ACPI thermal limitations which can be discovered + through a generic thermal driver. + + This attribute is not present if the scaling driver in use does not + support it. + +``cpuinfo_cur_freq`` + Current frequency of the CPUs belonging to this policy as obtained from + the hardware (in KHz). + + This is expected to be the frequency the hardware actually runs at. + If that frequency cannot be determined, this attribute should not + be present. + +``cpuinfo_max_freq`` + Maximum possible operating frequency the CPUs belonging to this policy + can run at (in kHz). + +``cpuinfo_min_freq`` + Minimum possible operating frequency the CPUs belonging to this policy + can run at (in kHz). + +``cpuinfo_transition_latency`` + The time it takes to switch the CPUs belonging to this policy from one + P-state to another, in nanoseconds. + + If unknown or if known to be so high that the scaling driver does not + work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`) + will be returned by reads from this attribute. + +``related_cpus`` + List of all (online and offline) CPUs belonging to this policy. + +``scaling_available_governors`` + List of ``CPUFreq`` scaling governors present in the kernel that can + be attached to this policy or (if the |intel_pstate| scaling driver is + in use) list of scaling algorithms provided by the driver that can be + applied to this policy. + + [Note that some governors are modular and it may be necessary to load a + kernel module for the governor held by it to become available and be + listed by this attribute.] + +``scaling_cur_freq`` + Current frequency of all of the CPUs belonging to this policy (in kHz). + + In the majority of cases, this is the frequency of the last P-state + requested by the scaling driver from the hardware using the scaling + interface provided by it, which may or may not reflect the frequency + the CPU is actually running at (due to hardware design and other + limitations). + + Some architectures (e.g. ``x86``) may attempt to provide information + more precisely reflecting the current CPU frequency through this + attribute, but that still may not be the exact current CPU frequency as + seen by the hardware at the moment. + +``scaling_driver`` + The scaling driver currently in use. + +``scaling_governor`` + The scaling governor currently attached to this policy or (if the + |intel_pstate| scaling driver is in use) the scaling algorithm + provided by the driver that is currently applied to this policy. + + This attribute is read-write and writing to it will cause a new scaling + governor to be attached to this policy or a new scaling algorithm + provided by the scaling driver to be applied to it (in the + |intel_pstate| case), as indicated by the string written to this + attribute (which must be one of the names listed by the + ``scaling_available_governors`` attribute described above). + +``scaling_max_freq`` + Maximum frequency the CPUs belonging to this policy are allowed to be + running at (in kHz). + + This attribute is read-write and writing a string representing an + integer to it will cause a new limit to be set (it must not be lower + than the value of the ``scaling_min_freq`` attribute). + +``scaling_min_freq`` + Minimum frequency the CPUs belonging to this policy are allowed to be + running at (in kHz). + + This attribute is read-write and writing a string representing a + non-negative integer to it will cause a new limit to be set (it must not + be higher than the value of the ``scaling_max_freq`` attribute). + +``scaling_setspeed`` + This attribute is functional only if the `userspace`_ scaling governor + is attached to the given policy. + + It returns the last frequency requested by the governor (in kHz) or can + be written to in order to set a new frequency for the policy. + + +Generic Scaling Governors +========================= + +``CPUFreq`` provides generic scaling governors that can be used with all +scaling drivers. As stated before, each of them implements a single, possibly +parametrized, performance scaling algorithm. + +Scaling governors are attached to policy objects and different policy objects +can be handled by different scaling governors at the same time (although that +may lead to suboptimal results in some cases). + +The scaling governor for a given policy object can be changed at any time with +the help of the ``scaling_governor`` policy attribute in ``sysfs``. + +Some governors expose ``sysfs`` attributes to control or fine-tune the scaling +algorithms implemented by them. Those attributes, referred to as governor +tunables, can be either global (system-wide) or per-policy, depending on the +scaling driver in use. If the driver requires governor tunables to be +per-policy, they are located in a subdirectory of each policy directory. +Otherwise, they are located in a subdirectory under +:file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the +subdirectory containing the governor tunables is the name of the governor +providing them. + +``performance`` +--------------- + +When attached to a policy object, this governor causes the highest frequency, +within the ``scaling_max_freq`` policy limit, to be requested for that policy. + +The request is made once at that time the governor for the policy is set to +``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` +policy limits change after that. + +``powersave`` +------------- + +When attached to a policy object, this governor causes the lowest frequency, +within the ``scaling_min_freq`` policy limit, to be requested for that policy. + +The request is made once at that time the governor for the policy is set to +``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` +policy limits change after that. + +``userspace`` +------------- + +This governor does not do anything by itself. Instead, it allows user space +to set the CPU frequency for the policy it is attached to by writing to the +``scaling_setspeed`` attribute of that policy. + +``schedutil`` +------------- + +This governor uses CPU utilization data available from the CPU scheduler. It +generally is regarded as a part of the CPU scheduler, so it can access the +scheduler's internal data structures directly. + +It runs entirely in scheduler context, although in some cases it may need to +invoke the scaling driver asynchronously when it decides that the CPU frequency +should be changed for a given policy (that depends on whether or not the driver +is capable of changing the CPU frequency from scheduler context). + +The actions of this governor for a particular CPU depend on the scheduling class +invoking its utilization update callback for that CPU. If it is invoked by the +RT or deadline scheduling classes, the governor will increase the frequency to +the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn, +if it is invoked by the CFS scheduling class, the governor will use the +Per-Entity Load Tracking (PELT) metric for the root control group of the +given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_ +LWN.net article for a description of the PELT mechanism). Then, the new +CPU frequency to apply is computed in accordance with the formula + + f = 1.25 * ``f_0`` * ``util`` / ``max`` + +where ``util`` is the PELT number, ``max`` is the theoretical maximum of +``util``, and ``f_0`` is either the maximum possible CPU frequency for the given +policy (if the PELT number is frequency-invariant), or the current CPU frequency +(otherwise). + +This governor also employs a mechanism allowing it to temporarily bump up the +CPU frequency for tasks that have been waiting on I/O most recently, called +"IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag +is passed by the scheduler to the governor callback which causes the frequency +to go up to the allowed maximum immediately and then draw back to the value +returned by the above formula over time. + +This governor exposes only one tunable: + +``rate_limit_us`` + Minimum time (in microseconds) that has to pass between two consecutive + runs of governor computations (default: 1000 times the scaling driver's + transition latency). + + The purpose of this tunable is to reduce the scheduler context overhead + of the governor which might be excessive without it. + +This governor generally is regarded as a replacement for the older `ondemand`_ +and `conservative`_ governors (described below), as it is simpler and more +tightly integrated with the CPU scheduler, its overhead in terms of CPU context +switches and similar is less significant, and it uses the scheduler's own CPU +utilization metric, so in principle its decisions should not contradict the +decisions made by the other parts of the scheduler. + +``ondemand`` +------------ + +This governor uses CPU load as a CPU frequency selection metric. + +In order to estimate the current CPU load, it measures the time elapsed between +consecutive invocations of its worker routine and computes the fraction of that +time in which the given CPU was not idle. The ratio of the non-idle (active) +time to the total CPU time is taken as an estimate of the load. + +If this governor is attached to a policy shared by multiple CPUs, the load is +estimated for all of them and the greatest result is taken as the load estimate +for the entire policy. + +The worker routine of this governor has to run in process context, so it is +invoked asynchronously (via a workqueue) and CPU P-states are updated from +there if necessary. As a result, the scheduler context overhead from this +governor is minimum, but it causes additional CPU context switches to happen +relatively often and the CPU P-state updates triggered by it can be relatively +irregular. Also, it affects its own CPU load metric by running code that +reduces the CPU idle time (even though the CPU idle time is only reduced very +slightly by it). + +It generally selects CPU frequencies proportional to the estimated load, so that +the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of +1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute +corresponds to the load of 0, unless when the load exceeds a (configurable) +speedup threshold, in which case it will go straight for the highest frequency +it is allowed to use (the ``scaling_max_freq`` policy limit). + +This governor exposes the following tunables: + +``sampling_rate`` + This is how often the governor's worker routine should run, in + microseconds. + + Typically, it is set to values of the order of 10000 (10 ms). Its + default value is equal to the value of ``cpuinfo_transition_latency`` + for each policy this governor is attached to (but since the unit here + is greater by 1000, this means that the time represented by + ``sampling_rate`` is 1000 times greater than the transition latency by + default). + + If this tunable is per-policy, the following shell command sets the time + represented by it to be 750 times as high as the transition latency:: + + # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate + +``up_threshold`` + If the estimated CPU load is above this value (in percent), the governor + will set the frequency to the maximum value allowed for the policy. + Otherwise, the selected frequency will be proportional to the estimated + CPU load. + +``ignore_nice_load`` + If set to 1 (default 0), it will cause the CPU load estimation code to + treat the CPU time spent on executing tasks with "nice" levels greater + than 0 as CPU idle time. + + This may be useful if there are tasks in the system that should not be + taken into account when deciding what frequency to run the CPUs at. + Then, to make that happen it is sufficient to increase the "nice" level + of those tasks above 0 and set this attribute to 1. + +``sampling_down_factor`` + Temporary multiplier, between 1 (default) and 100 inclusive, to apply to + the ``sampling_rate`` value if the CPU load goes above ``up_threshold``. + + This causes the next execution of the governor's worker routine (after + setting the frequency to the allowed maximum) to be delayed, so the + frequency stays at the maximum level for a longer time. + + Frequency fluctuations in some bursty workloads may be avoided this way + at the cost of additional energy spent on maintaining the maximum CPU + capacity. + +``powersave_bias`` + Reduction factor to apply to the original frequency target of the + governor (including the maximum value used when the ``up_threshold`` + value is exceeded by the estimated CPU load) or sensitivity threshold + for the AMD frequency sensitivity powersave bias driver + (:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000 + inclusive. + + If the AMD frequency sensitivity powersave bias driver is not loaded, + the effective frequency to apply is given by + + f * (1 - ``powersave_bias`` / 1000) + + where f is the governor's original frequency target. The default value + of this attribute is 0 in that case. + + If the AMD frequency sensitivity powersave bias driver is loaded, the + value of this attribute is 400 by default and it is used in a different + way. + + On Family 16h (and later) AMD processors there is a mechanism to get a + measured workload sensitivity, between 0 and 100% inclusive, from the + hardware. That value can be used to estimate how the performance of the + workload running on a CPU will change in response to frequency changes. + + The performance of a workload with the sensitivity of 0 (memory-bound or + IO-bound) is not expected to increase at all as a result of increasing + the CPU frequency, whereas workloads with the sensitivity of 100% + (CPU-bound) are expected to perform much better if the CPU frequency is + increased. + + If the workload sensitivity is less than the threshold represented by + the ``powersave_bias`` value, the sensitivity powersave bias driver + will cause the governor to select a frequency lower than its original + target, so as to avoid over-provisioning workloads that will not benefit + from running at higher CPU frequencies. + +``conservative`` +---------------- + +This governor uses CPU load as a CPU frequency selection metric. + +It estimates the CPU load in the same way as the `ondemand`_ governor described +above, but the CPU frequency selection algorithm implemented by it is different. + +Namely, it avoids changing the frequency significantly over short time intervals +which may not be suitable for systems with limited power supply capacity (e.g. +battery-powered). To achieve that, it changes the frequency in relatively +small steps, one step at a time, up or down - depending on whether or not a +(configurable) threshold has been exceeded by the estimated CPU load. + +This governor exposes the following tunables: + +``freq_step`` + Frequency step in percent of the maximum frequency the governor is + allowed to set (the ``scaling_max_freq`` policy limit), between 0 and + 100 (5 by default). + + This is how much the frequency is allowed to change in one go. Setting + it to 0 will cause the default frequency step (5 percent) to be used + and setting it to 100 effectively causes the governor to periodically + switch the frequency between the ``scaling_min_freq`` and + ``scaling_max_freq`` policy limits. + +``down_threshold`` + Threshold value (in percent, 20 by default) used to determine the + frequency change direction. + + If the estimated CPU load is greater than this value, the frequency will + go up (by ``freq_step``). If the load is less than this value (and the + ``sampling_down_factor`` mechanism is not in effect), the frequency will + go down. Otherwise, the frequency will not be changed. + +``sampling_down_factor`` + Frequency decrease deferral factor, between 1 (default) and 10 + inclusive. + + It effectively causes the frequency to go down ``sampling_down_factor`` + times slower than it ramps up. + + +Frequency Boost Support +======================= + +Background +---------- + +Some processors support a mechanism to raise the operating frequency of some +cores in a multicore package temporarily (and above the sustainable frequency +threshold for the whole package) under certain conditions, for example if the +whole chip is not fully utilized and below its intended thermal or power budget. + +Different names are used by different vendors to refer to this functionality. +For Intel processors it is referred to as "Turbo Boost", AMD calls it +"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on. +As a rule, it also is implemented differently by different vendors. The simple +term "frequency boost" is used here for brevity to refer to all of those +implementations. + +The frequency boost mechanism may be either hardware-based or software-based. +If it is hardware-based (e.g. on x86), the decision to trigger the boosting is +made by the hardware (although in general it requires the hardware to be put +into a special state in which it can control the CPU frequency within certain +limits). If it is software-based (e.g. on ARM), the scaling driver decides +whether or not to trigger boosting and when to do that. + +The ``boost`` File in ``sysfs`` +------------------------------- + +This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls +the "boost" setting for the whole system. It is not present if the underlying +scaling driver does not support the frequency boost mechanism (or supports it, +but provides a driver-specific interface for controlling it, like +|intel_pstate|). + +If the value in this file is 1, the frequency boost mechanism is enabled. This +means that either the hardware can be put into states in which it is able to +trigger boosting (in the hardware-based case), or the software is allowed to +trigger boosting (in the software-based case). It does not mean that boosting +is actually in use at the moment on any CPUs in the system. It only means a +permission to use the frequency boost mechanism (which still may never be used +for other reasons). + +If the value in this file is 0, the frequency boost mechanism is disabled and +cannot be used at all. + +The only values that can be written to this file are 0 and 1. + +Rationale for Boost Control Knob +-------------------------------- + +The frequency boost mechanism is generally intended to help to achieve optimum +CPU performance on time scales below software resolution (e.g. below the +scheduler tick interval) and it is demonstrably suitable for many workloads, but +it may lead to problems in certain situations. + +For this reason, many systems make it possible to disable the frequency boost +mechanism in the platform firmware (BIOS) setup, but that requires the system to +be restarted for the setting to be adjusted as desired, which may not be +practical at least in some cases. For example: + + 1. Boosting means overclocking the processor, although under controlled + conditions. Generally, the processor's energy consumption increases + as a result of increasing its frequency and voltage, even temporarily. + That may not be desirable on systems that switch to power sources of + limited capacity, such as batteries, so the ability to disable the boost + mechanism while the system is running may help there (but that depends on + the workload too). + + 2. In some situations deterministic behavior is more important than + performance or energy consumption (or both) and the ability to disable + boosting while the system is running may be useful then. + + 3. To examine the impact of the frequency boost mechanism itself, it is useful + to be able to run tests with and without boosting, preferably without + restarting the system in the meantime. + + 4. Reproducible results are important when running benchmarks. Since + the boosting functionality depends on the load of the whole package, + single-thread performance may vary because of it which may lead to + unreproducible results sometimes. That can be avoided by disabling the + frequency boost mechanism before running benchmarks sensitive to that + issue. + +Legacy AMD ``cpb`` Knob +----------------------- + +The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to +the global ``boost`` one. It is used for disabling/enabling the "Core +Performance Boost" feature of some AMD processors. + +If present, that knob is located in every ``CPUFreq`` policy directory in +``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called +``cpb``, which indicates a more fine grained control interface. The actual +implementation, however, works on the system-wide basis and setting that knob +for one policy causes the same value of it to be set for all of the other +policies at the same time. + +That knob is still supported on AMD processors that support its underlying +hardware feature, but it may be configured out of the kernel (via the +:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global +``boost`` knob is present regardless. Thus it is always possible use the +``boost`` knob instead of the ``cpb`` one which is highly recommended, as that +is more consistent with what all of the other systems do (and the ``cpb`` knob +may not be supported any more in the future). + +The ``cpb`` knob is never present for any processors without the underlying +hardware feature (e.g. all Intel ones), even if the +:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. + + +.. _Per-entity load tracking: https://lwn.net/Articles/531853/ diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst new file mode 100644 index 000000000..49237ac73 --- /dev/null +++ b/Documentation/admin-guide/pm/index.rst @@ -0,0 +1,10 @@ +================ +Power Management +================ + +.. toctree:: + :maxdepth: 2 + + strategies + system-wide + working-state diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst new file mode 100644 index 000000000..8f1d3de44 --- /dev/null +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -0,0 +1,718 @@ +=============================================== +``intel_pstate`` CPU Performance Scaling Driver +=============================================== + +:: + + Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> + + +General Information +=================== + +``intel_pstate`` is a part of the +:doc:`CPU performance scaling subsystem <cpufreq>` in the Linux kernel +(``CPUFreq``). It is a scaling driver for the Sandy Bridge and later +generations of Intel processors. Note, however, that some of those processors +may not be supported. [To understand ``intel_pstate`` it is necessary to know +how ``CPUFreq`` works in general, so this is the time to read :doc:`cpufreq` if +you have not done that yet.] + +For the processors supported by ``intel_pstate``, the P-state concept is broader +than just an operating frequency or an operating performance point (see the +`LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more +information about that). For this reason, the representation of P-states used +by ``intel_pstate`` internally follows the hardware specification (for details +refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual +Volume 3: System Programming Guide <SDM_>`_). However, the ``CPUFreq`` core +uses frequencies for identifying operating performance points of CPUs and +frequencies are involved in the user space interface exposed by it, so +``intel_pstate`` maps its internal representation of P-states to frequencies too +(fortunately, that mapping is unambiguous). At the same time, it would not be +practical for ``intel_pstate`` to supply the ``CPUFreq`` core with a table of +available frequencies due to the possible size of it, so the driver does not do +that. Some functionality of the core is limited by that. + +Since the hardware P-state selection interface used by ``intel_pstate`` is +available at the logical CPU level, the driver always works with individual +CPUs. Consequently, if ``intel_pstate`` is in use, every ``CPUFreq`` policy +object corresponds to one logical CPU and ``CPUFreq`` policies are effectively +equivalent to CPUs. In particular, this means that they become "inactive" every +time the corresponding CPU is taken offline and need to be re-initialized when +it goes back online. + +``intel_pstate`` is not modular, so it cannot be unloaded, which means that the +only way to pass early-configuration-time parameters to it is via the kernel +command line. However, its configuration can be adjusted via ``sysfs`` to a +great extent. In some configurations it even is possible to unregister it via +``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and +registered (see `below <status_attr_>`_). + + +Operation Modes +=============== + +``intel_pstate`` can operate in three different modes: in the active mode with +or without hardware-managed P-states support and in the passive mode. Which of +them will be in effect depends on what kernel command line options are used and +on the capabilities of the processor. + +Active Mode +----------- + +This is the default operation mode of ``intel_pstate``. If it works in this +mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` +policies contains the string "intel_pstate". + +In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and +provides its own scaling algorithms for P-state selection. Those algorithms +can be applied to ``CPUFreq`` policies in the same way as generic scaling +governors (that is, through the ``scaling_governor`` policy attribute in +``sysfs``). [Note that different P-state selection algorithms may be chosen for +different policies, but that is not recommended.] + +They are not generic scaling governors, but their names are the same as the +names of some of those governors. Moreover, confusingly enough, they generally +do not work in the same way as the generic governors they share the names with. +For example, the ``powersave`` P-state selection algorithm provided by +``intel_pstate`` is not a counterpart of the generic ``powersave`` governor +(roughly, it corresponds to the ``schedutil`` and ``ondemand`` governors). + +There are two P-state selection algorithms provided by ``intel_pstate`` in the +active mode: ``powersave`` and ``performance``. The way they both operate +depends on whether or not the hardware-managed P-states (HWP) feature has been +enabled in the processor and possibly on the processor model. + +Which of the P-state selection algorithms is used by default depends on the +:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option. +Namely, if that option is set, the ``performance`` algorithm will be used by +default, and the other one will be used by default if it is not set. + +Active Mode With HWP +~~~~~~~~~~~~~~~~~~~~ + +If the processor supports the HWP feature, it will be enabled during the +processor initialization and cannot be disabled after that. It is possible +to avoid enabling it by passing the ``intel_pstate=no_hwp`` argument to the +kernel in the command line. + +If the HWP feature has been enabled, ``intel_pstate`` relies on the processor to +select P-states by itself, but still it can give hints to the processor's +internal P-state selection logic. What those hints are depends on which P-state +selection algorithm has been applied to the given policy (or to the CPU it +corresponds to). + +Even though the P-state selection is carried out by the processor automatically, +``intel_pstate`` registers utilization update callbacks with the CPU scheduler +in this mode. However, they are not used for running a P-state selection +algorithm, but for periodic updates of the current CPU frequency information to +be made available from the ``scaling_cur_freq`` policy attribute in ``sysfs``. + +HWP + ``performance`` +..................... + +In this configuration ``intel_pstate`` will write 0 to the processor's +Energy-Performance Preference (EPP) knob (if supported) or its +Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's +internal P-state selection logic is expected to focus entirely on performance. + +This will override the EPP/EPB setting coming from the ``sysfs`` interface +(see `Energy vs Performance Hints`_ below). + +Also, in this configuration the range of P-states available to the processor's +internal P-state selection logic is always restricted to the upper boundary +(that is, the maximum P-state that the driver is allowed to use). + +HWP + ``powersave`` +................... + +In this configuration ``intel_pstate`` will set the processor's +Energy-Performance Preference (EPP) knob (if supported) or its +Energy-Performance Bias (EPB) knob (otherwise) to whatever value it was +previously set to via ``sysfs`` (or whatever default value it was +set to by the platform firmware). This usually causes the processor's +internal P-state selection logic to be less performance-focused. + +Active Mode Without HWP +~~~~~~~~~~~~~~~~~~~~~~~ + +This is the default operation mode for processors that do not support the HWP +feature. It also is used by default with the ``intel_pstate=no_hwp`` argument +in the kernel command line. However, in this mode ``intel_pstate`` may refuse +to work with the given processor if it does not recognize it. [Note that +``intel_pstate`` will never refuse to work with any processor with the HWP +feature enabled.] + +In this mode ``intel_pstate`` registers utilization update callbacks with the +CPU scheduler in order to run a P-state selection algorithm, either +``powersave`` or ``performance``, depending on the ``scaling_governor`` policy +setting in ``sysfs``. The current CPU frequency information to be made +available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is +periodically updated by those utilization update callbacks too. + +``performance`` +............... + +Without HWP, this P-state selection algorithm is always the same regardless of +the processor model and platform configuration. + +It selects the maximum P-state it is allowed to use, subject to limits set via +``sysfs``, every time the driver configuration for the given CPU is updated +(e.g. via ``sysfs``). + +This is the default P-state selection algorithm if the +:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option +is set. + +``powersave`` +............. + +Without HWP, this P-state selection algorithm is similar to the algorithm +implemented by the generic ``schedutil`` scaling governor except that the +utilization metric used by it is based on numbers coming from feedback +registers of the CPU. It generally selects P-states proportional to the +current CPU utilization. + +This algorithm is run by the driver's utilization update callback for the +given CPU when it is invoked by the CPU scheduler, but not more often than +every 10 ms. Like in the ``performance`` case, the hardware configuration +is not touched if the new P-state turns out to be the same as the current +one. + +This is the default P-state selection algorithm if the +:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option +is not set. + +Passive Mode +------------ + +This mode is used if the ``intel_pstate=passive`` argument is passed to the +kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting too). +Like in the active mode without HWP support, in this mode ``intel_pstate`` may +refuse to work with the given processor if it does not recognize it. + +If the driver works in this mode, the ``scaling_driver`` policy attribute in +``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq". +Then, the driver behaves like a regular ``CPUFreq`` scaling driver. That is, +it is invoked by generic scaling governors when necessary to talk to the +hardware in order to change the P-state of a CPU (in particular, the +``schedutil`` governor can invoke it directly from scheduler context). + +While in this mode, ``intel_pstate`` can be used with all of the (generic) +scaling governors listed by the ``scaling_available_governors`` policy attribute +in ``sysfs`` (and the P-state selection algorithms described above are not +used). Then, it is responsible for the configuration of policy objects +corresponding to CPUs and provides the ``CPUFreq`` core (and the scaling +governors attached to the policy objects) with accurate information on the +maximum and minimum operating frequencies supported by the hardware (including +the so-called "turbo" frequency ranges). In other words, in the passive mode +the entire range of available P-states is exposed by ``intel_pstate`` to the +``CPUFreq`` core. However, in this mode the driver does not register +utilization update callbacks with the CPU scheduler and the ``scaling_cur_freq`` +information comes from the ``CPUFreq`` core (and is the last frequency selected +by the current scaling governor for the given policy). + + +.. _turbo: + +Turbo P-states Support +====================== + +In the majority of cases, the entire range of P-states available to +``intel_pstate`` can be divided into two sub-ranges that correspond to +different types of processor behavior, above and below a boundary that +will be referred to as the "turbo threshold" in what follows. + +The P-states above the turbo threshold are referred to as "turbo P-states" and +the whole sub-range of P-states they belong to is referred to as the "turbo +range". These names are related to the Turbo Boost technology allowing a +multicore processor to opportunistically increase the P-state of one or more +cores if there is enough power to do that and if that is not going to cause the +thermal envelope of the processor package to be exceeded. + +Specifically, if software sets the P-state of a CPU core within the turbo range +(that is, above the turbo threshold), the processor is permitted to take over +performance scaling control for that core and put it into turbo P-states of its +choice going forward. However, that permission is interpreted differently by +different processor generations. Namely, the Sandy Bridge generation of +processors will never use any P-states above the last one set by software for +the given core, even if it is within the turbo range, whereas all of the later +processor generations will take it as a license to use any P-states from the +turbo range, even above the one set by software. In other words, on those +processors setting any P-state from the turbo range will enable the processor +to put the given core into all turbo P-states up to and including the maximum +supported one as it sees fit. + +One important property of turbo P-states is that they are not sustainable. More +precisely, there is no guarantee that any CPUs will be able to stay in any of +those states indefinitely, because the power distribution within the processor +package may change over time or the thermal envelope it was designed for might +be exceeded if a turbo P-state was used for too long. + +In turn, the P-states below the turbo threshold generally are sustainable. In +fact, if one of them is set by software, the processor is not expected to change +it to a lower one unless in a thermal stress or a power limit violation +situation (a higher P-state may still be used if it is set for another CPU in +the same package at the same time, for example). + +Some processors allow multiple cores to be in turbo P-states at the same time, +but the maximum P-state that can be set for them generally depends on the number +of cores running concurrently. The maximum turbo P-state that can be set for 3 +cores at the same time usually is lower than the analogous maximum P-state for +2 cores, which in turn usually is lower than the maximum turbo P-state that can +be set for 1 core. The one-core maximum turbo P-state is thus the maximum +supported one overall. + +The maximum supported turbo P-state, the turbo threshold (the maximum supported +non-turbo P-state) and the minimum supported P-state are specific to the +processor model and can be determined by reading the processor's model-specific +registers (MSRs). Moreover, some processors support the Configurable TDP +(Thermal Design Power) feature and, when that feature is enabled, the turbo +threshold effectively becomes a configurable value that can be set by the +platform firmware. + +Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes +the entire range of available P-states, including the whole turbo range, to the +``CPUFreq`` core and (in the passive mode) to generic scaling governors. This +generally causes turbo P-states to be set more often when ``intel_pstate`` is +used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_ +for more information). + +Moreover, since ``intel_pstate`` always knows what the real turbo threshold is +(even if the Configurable TDP feature is enabled in the processor), its +``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should +work as expected in all cases (that is, if set to disable turbo P-states, it +always should prevent ``intel_pstate`` from using them). + + +Processor Support +================= + +To handle a given processor ``intel_pstate`` requires a number of different +pieces of information on it to be known, including: + + * The minimum supported P-state. + + * The maximum supported `non-turbo P-state <turbo_>`_. + + * Whether or not turbo P-states are supported at all. + + * The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states + are supported). + + * The scaling formula to translate the driver's internal representation + of P-states into frequencies and the other way around. + +Generally, ways to obtain that information are specific to the processor model +or family. Although it often is possible to obtain all of it from the processor +itself (using model-specific registers), there are cases in which hardware +manuals need to be consulted to get to it too. + +For this reason, there is a list of supported processors in ``intel_pstate`` and +the driver initialization will fail if the detected processor is not in that +list, unless it supports the `HWP feature <Active Mode_>`_. [The interface to +obtain all of the information listed above is the same for all of the processors +supporting the HWP feature, which is why they all are supported by +``intel_pstate``.] + + +User Space Interface in ``sysfs`` +================================= + +Global Attributes +----------------- + +``intel_pstate`` exposes several global attributes (files) in ``sysfs`` to +control its functionality at the system level. They are located in the +``/sys/devices/system/cpu/intel_pstate/`` directory and affect all CPUs. + +Some of them are not present if the ``intel_pstate=per_cpu_perf_limits`` +argument is passed to the kernel in the command line. + +``max_perf_pct`` + Maximum P-state the driver is allowed to set in percent of the + maximum supported performance level (the highest supported `turbo + P-state <turbo_>`_). + + This attribute will not be exposed if the + ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel + command line. + +``min_perf_pct`` + Minimum P-state the driver is allowed to set in percent of the + maximum supported performance level (the highest supported `turbo + P-state <turbo_>`_). + + This attribute will not be exposed if the + ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel + command line. + +``num_pstates`` + Number of P-states supported by the processor (between 0 and 255 + inclusive) including both turbo and non-turbo P-states (see + `Turbo P-states Support`_). + + The value of this attribute is not affected by the ``no_turbo`` + setting described `below <no_turbo_attr_>`_. + + This attribute is read-only. + +``turbo_pct`` + Ratio of the `turbo range <turbo_>`_ size to the size of the entire + range of supported P-states, in percent. + + This attribute is read-only. + +.. _no_turbo_attr: + +``no_turbo`` + If set (equal to 1), the driver is not allowed to set any turbo P-states + (see `Turbo P-states Support`_). If unset (equalt to 0, which is the + default), turbo P-states can be set by the driver. + [Note that ``intel_pstate`` does not support the general ``boost`` + attribute (supported by some other scaling drivers) which is replaced + by this one.] + + This attrubute does not affect the maximum supported frequency value + supplied to the ``CPUFreq`` core and exposed via the policy interface, + but it affects the maximum possible value of per-policy P-state limits + (see `Interpretation of Policy Attributes`_ below for details). + +``hwp_dynamic_boost`` + This attribute is only present if ``intel_pstate`` works in the + `active mode with the HWP feature enabled <Active Mode With HWP_>`_ in + the processor. If set (equal to 1), it causes the minimum P-state limit + to be increased dynamically for a short time whenever a task previously + waiting on I/O is selected to run on a given logical CPU (the purpose + of this mechanism is to improve performance). + + This setting has no effect on logical CPUs whose minimum P-state limit + is directly set to the highest non-turbo P-state or above it. + +.. _status_attr: + +``status`` + Operation mode of the driver: "active", "passive" or "off". + + "active" + The driver is functional and in the `active mode + <Active Mode_>`_. + + "passive" + The driver is functional and in the `passive mode + <Passive Mode_>`_. + + "off" + The driver is not functional (it is not registered as a scaling + driver with the ``CPUFreq`` core). + + This attribute can be written to in order to change the driver's + operation mode or to unregister it. The string written to it must be + one of the possible values of it and, if successful, the write will + cause the driver to switch over to the operation mode represented by + that string - or to be unregistered in the "off" case. [Actually, + switching over from the active mode to the passive mode or the other + way around causes the driver to be unregistered and registered again + with a different set of callbacks, so all of its settings (the global + as well as the per-policy ones) are then reset to their default + values, possibly depending on the target operation mode.] + + That only is supported in some configurations, though (for example, if + the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, + the operation mode of the driver cannot be changed), and if it is not + supported in the current configuration, writes to this attribute will + fail with an appropriate error. + +Interpretation of Policy Attributes +----------------------------------- + +The interpretation of some ``CPUFreq`` policy attributes described in +:doc:`cpufreq` is special with ``intel_pstate`` as the current scaling driver +and it generally depends on the driver's `operation mode <Operation Modes_>`_. + +First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and +``scaling_cur_freq`` attributes are produced by applying a processor-specific +multiplier to the internal P-state representation used by ``intel_pstate``. +Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq`` +attributes are capped by the frequency corresponding to the maximum P-state that +the driver is allowed to set. + +If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is +not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq`` +and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency. +Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and +``scaling_min_freq`` to go down to that value if they were above it before. +However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be +restored after unsetting ``no_turbo``, unless these attributes have been written +to after ``no_turbo`` was set. + +If ``no_turbo`` is not set, the maximum possible value of ``scaling_max_freq`` +and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state, +which also is the value of ``cpuinfo_max_freq`` in either case. + +Next, the following policy attributes have special meaning if +``intel_pstate`` works in the `active mode <Active Mode_>`_: + +``scaling_available_governors`` + List of P-state selection algorithms provided by ``intel_pstate``. + +``scaling_governor`` + P-state selection algorithm provided by ``intel_pstate`` currently in + use with the given policy. + +``scaling_cur_freq`` + Frequency of the average P-state of the CPU represented by the given + policy for the time interval between the last two invocations of the + driver's utilization update callback by the CPU scheduler for that CPU. + +The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the +same as for other scaling drivers. + +Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate`` +depends on the operation mode of the driver. Namely, it is either +"intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the +`passive mode <Passive Mode_>`_). + +Coordination of P-State Limits +------------------------------ + +``intel_pstate`` allows P-state limits to be set in two ways: with the help of +the ``max_perf_pct`` and ``min_perf_pct`` `global attributes +<Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq`` +``CPUFreq`` policy attributes. The coordination between those limits is based +on the following rules, regardless of the current operation mode of the driver: + + 1. All CPUs are affected by the global limits (that is, none of them can be + requested to run faster than the global maximum and none of them can be + requested to run slower than the global minimum). + + 2. Each individual CPU is affected by its own per-policy limits (that is, it + cannot be requested to run faster than its own per-policy maximum and it + cannot be requested to run slower than its own per-policy minimum). + + 3. The global and per-policy limits can be set independently. + +If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the +resulting effective values are written into its registers whenever the limits +change in order to request its internal P-state selection logic to always set +P-states within these limits. Otherwise, the limits are taken into account by +scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver +every time before setting a new P-state for a CPU. + +Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument +is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed +at all and the only way to set the limits is by using the policy attributes. + + +Energy vs Performance Hints +--------------------------- + +If ``intel_pstate`` works in the `active mode with the HWP feature enabled +<Active Mode With HWP_>`_ in the processor, additional attributes are present +in every ``CPUFreq`` policy directory in ``sysfs``. They are intended to allow +user space to help ``intel_pstate`` to adjust the processor's internal P-state +selection logic by focusing it on performance or on energy-efficiency, or +somewhere between the two extremes: + +``energy_performance_preference`` + Current value of the energy vs performance hint for the given policy + (or the CPU represented by it). + + The hint can be changed by writing to this attribute. + +``energy_performance_available_preferences`` + List of strings that can be written to the + ``energy_performance_preference`` attribute. + + They represent different energy vs performance hints and should be + self-explanatory, except that ``default`` represents whatever hint + value was set by the platform firmware. + +Strings written to the ``energy_performance_preference`` attribute are +internally translated to integer values written to the processor's +Energy-Performance Preference (EPP) knob (if supported) or its +Energy-Performance Bias (EPB) knob. + +[Note that tasks may by migrated from one CPU to another by the scheduler's +load-balancing algorithm and if different energy vs performance hints are +set for those CPUs, that may lead to undesirable outcomes. To avoid such +issues it is better to set the same energy vs performance hint for all CPUs +or to pin every task potentially sensitive to them to a specific CPU.] + +.. _acpi-cpufreq: + +``intel_pstate`` vs ``acpi-cpufreq`` +==================================== + +On the majority of systems supported by ``intel_pstate``, the ACPI tables +provided by the platform firmware contain ``_PSS`` objects returning information +that can be used for CPU performance scaling (refer to the `ACPI specification`_ +for details on the ``_PSS`` objects and the format of the information returned +by them). + +The information returned by the ACPI ``_PSS`` objects is used by the +``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate`` +the ``acpi-cpufreq`` driver uses the same hardware CPU performance scaling +interface, but the set of P-states it can use is limited by the ``_PSS`` +output. + +On those systems each ``_PSS`` object returns a list of P-states supported by +the corresponding CPU which basically is a subset of the P-states range that can +be used by ``intel_pstate`` on the same system, with one exception: the whole +`turbo range <turbo_>`_ is represented by one item in it (the topmost one). By +convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz +than the frequency of the highest non-turbo P-state listed by it, but the +corresponding P-state representation (following the hardware specification) +returned for it matches the maximum supported turbo P-state (or is the +special value 255 meaning essentially "go as high as you can get"). + +The list of P-states returned by ``_PSS`` is reflected by the table of +available frequencies supplied by ``acpi-cpufreq`` to the ``CPUFreq`` core and +scaling governors and the minimum and maximum supported frequencies reported by +it come from that list as well. In particular, given the special representation +of the turbo range described above, this means that the maximum supported +frequency reported by ``acpi-cpufreq`` is higher by 1 MHz than the frequency +of the highest supported non-turbo P-state listed by ``_PSS`` which, of course, +affects decisions made by the scaling governors, except for ``powersave`` and +``performance``. + +For example, if a given governor attempts to select a frequency proportional to +estimated CPU load and maps the load of 100% to the maximum supported frequency +(possibly multiplied by a constant), then it will tend to choose P-states below +the turbo threshold if ``acpi-cpufreq`` is used as the scaling driver, because +in that case the turbo range corresponds to a small fraction of the frequency +band it can use (1 MHz vs 1 GHz or more). In consequence, it will only go to +the turbo range for the highest loads and the other loads above 50% that might +benefit from running at turbo frequencies will be given non-turbo P-states +instead. + +One more issue related to that may appear on systems supporting the +`Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the +turbo threshold. Namely, if that is not coordinated with the lists of P-states +returned by ``_PSS`` properly, there may be more than one item corresponding to +a turbo P-state in those lists and there may be a problem with avoiding the +turbo range (if desirable or necessary). Usually, to avoid using turbo +P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed +by ``_PSS``, but that is not sufficient when there are other turbo P-states in +the list returned by it. + +Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the +`passive mode <Passive Mode_>`_, except that the number of P-states it can set +is limited to the ones listed by the ACPI ``_PSS`` objects. + + +Kernel Command Line Options for ``intel_pstate`` +================================================ + +Several kernel command line options can be used to pass early-configuration-time +parameters to ``intel_pstate`` in order to enforce specific behavior of it. All +of them have to be prepended with the ``intel_pstate=`` prefix. + +``disable`` + Do not register ``intel_pstate`` as the scaling driver even if the + processor is supported by it. + +``passive`` + Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to + start with. + + This option implies the ``no_hwp`` one described below. + +``force`` + Register ``intel_pstate`` as the scaling driver instead of + ``acpi-cpufreq`` even if the latter is preferred on the given system. + + This may prevent some platform features (such as thermal controls and + power capping) that rely on the availability of ACPI P-states + information from functioning as expected, so it should be used with + caution. + + This option does not work with processors that are not supported by + ``intel_pstate`` and on platforms where the ``pcc-cpufreq`` scaling + driver is used instead of ``acpi-cpufreq``. + +``no_hwp`` + Do not enable the `hardware-managed P-states (HWP) feature + <Active Mode With HWP_>`_ even if it is supported by the processor. + +``hwp_only`` + Register ``intel_pstate`` as the scaling driver only if the + `hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is + supported by the processor. + +``support_acpi_ppc`` + Take ACPI ``_PPC`` performance limits into account. + + If the preferred power management profile in the FADT (Fixed ACPI + Description Table) is set to "Enterprise Server" or "Performance + Server", the ACPI ``_PPC`` limits are taken into account by default + and this option has no effect. + +``per_cpu_perf_limits`` + Use per-logical-CPU P-State limits (see `Coordination of P-state + Limits`_ for details). + + +Diagnostics and Tuning +====================== + +Trace Events +------------ + +There are two static trace events that can be used for ``intel_pstate`` +diagnostics. One of them is the ``cpu_frequency`` trace event generally used +by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific +to ``intel_pstate``. Both of them are triggered by ``intel_pstate`` only if +it works in the `active mode <Active Mode_>`_. + +The following sequence of shell commands can be used to enable them and see +their output (if the kernel is generally configured to support event tracing):: + + # cd /sys/kernel/debug/tracing/ + # echo 1 > events/power/pstate_sample/enable + # echo 1 > events/power/cpu_frequency/enable + # cat trace + gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476 + cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 + +If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the +``cpu_frequency`` trace event will be triggered either by the ``schedutil`` +scaling governor (for the policies it is attached to), or by the ``CPUFreq`` +core (for the policies with other scaling governors). + +``ftrace`` +---------- + +The ``ftrace`` interface can be used for low-level diagnostics of +``intel_pstate``. For example, to check how often the function to set a +P-state is called, the ``ftrace`` filter can be set to to +:c:func:`intel_pstate_set_pstate`:: + + # cd /sys/kernel/debug/tracing/ + # cat available_filter_functions | grep -i pstate + intel_pstate_set_pstate + intel_pstate_cpu_init + ... + # echo intel_pstate_set_pstate > set_ftrace_filter + # echo function > current_tracer + # cat trace | head -15 + # tracer: function + # + # entries-in-buffer/entries-written: 80/80 #P:4 + # + # _-----=> irqs-off + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth + # ||| / delay + # TASK-PID CPU# |||| TIMESTAMP FUNCTION + # | | | |||| | | + Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func + gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func + gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func + <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func + + +.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf +.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html +.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst new file mode 100644 index 000000000..dbf5acd49 --- /dev/null +++ b/Documentation/admin-guide/pm/sleep-states.rst @@ -0,0 +1,245 @@ +=================== +System Sleep States +=================== + +:: + + Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> + +Sleep states are global low-power states of the entire system in which user +space code cannot be executed and the overall system activity is significantly +reduced. + + +Sleep States That Can Be Supported +================================== + +Depending on its configuration and the capabilities of the platform it runs on, +the Linux kernel can support up to four system sleep states, including +hibernation and up to three variants of system suspend. The sleep states that +can be supported by the kernel are listed below. + +.. _s2idle: + +Suspend-to-Idle +--------------- + +This is a generic, pure software, light-weight variant of system suspend (also +referred to as S2I or S2Idle). It allows more energy to be saved relative to +runtime idle by freezing user space, suspending the timekeeping and putting all +I/O devices into low-power states (possibly lower-power than available in the +working state), such that the processors can spend time in their deepest idle +states while the system is suspended. + +The system is woken up from this state by in-band interrupts, so theoretically +any devices that can cause interrupts to be generated in the working state can +also be set up as wakeup devices for S2Idle. + +This state can be used on platforms without support for :ref:`standby <standby>` +or :ref:`suspend-to-RAM <s2ram>`, or it can be used in addition to any of the +deeper system suspend variants to provide reduced resume latency. It is always +supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option is set. + +.. _standby: + +Standby +------- + +This state, if supported, offers moderate, but real, energy savings, while +providing a relatively straightforward transition back to the working state. No +operating state is lost (the system core logic retains power), so the system can +go back to where it left off easily enough. + +In addition to freezing user space, suspending the timekeeping and putting all +I/O devices into low-power states, which is done for :ref:`suspend-to-idle +<s2idle>` too, nonboot CPUs are taken offline and all low-level system functions +are suspended during transitions into this state. For this reason, it should +allow more energy to be saved relative to :ref:`suspend-to-idle <s2idle>`, but +the resume latency will generally be greater than for that state. + +The set of devices that can wake up the system from this state usually is +reduced relative to :ref:`suspend-to-idle <s2idle>` and it may be necessary to +rely on the platform for setting up the wakeup functionality as appropriate. + +This state is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration +option is set and the support for it is registered by the platform with the +core system suspend subsystem. On ACPI-based systems this state is mapped to +the S1 system state defined by ACPI. + +.. _s2ram: + +Suspend-to-RAM +-------------- + +This state (also referred to as STR or S2RAM), if supported, offers significant +energy savings as everything in the system is put into a low-power state, except +for memory, which should be placed into the self-refresh mode to retain its +contents. All of the steps carried out when entering :ref:`standby <standby>` +are also carried out during transitions to S2RAM. Additional operations may +take place depending on the platform capabilities. In particular, on ACPI-based +systems the kernel passes control to the platform firmware (BIOS) as the last +step during S2RAM transitions and that usually results in powering down some +more low-level components that are not directly controlled by the kernel. + +The state of devices and CPUs is saved and held in memory. All devices are +suspended and put into low-power states. In many cases, all peripheral buses +lose power when entering S2RAM, so devices must be able to handle the transition +back to the "on" state. + +On ACPI-based systems S2RAM requires some minimal boot-strapping code in the +platform firmware to resume the system from it. This may be the case on other +platforms too. + +The set of devices that can wake up the system from S2RAM usually is reduced +relative to :ref:`suspend-to-idle <s2idle>` and :ref:`standby <standby>` and it +may be necessary to rely on the platform for setting up the wakeup functionality +as appropriate. + +S2RAM is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option +is set and the support for it is registered by the platform with the core system +suspend subsystem. On ACPI-based systems it is mapped to the S3 system state +defined by ACPI. + +.. _hibernation: + +Hibernation +----------- + +This state (also referred to as Suspend-to-Disk or STD) offers the greatest +energy savings and can be used even in the absence of low-level platform support +for system suspend. However, it requires some low-level code for resuming the +system to be present for the underlying CPU architecture. + +Hibernation is significantly different from any of the system suspend variants. +It takes three system state changes to put it into hibernation and two system +state changes to resume it. + +First, when hibernation is triggered, the kernel stops all system activity and +creates a snapshot image of memory to be written into persistent storage. Next, +the system goes into a state in which the snapshot image can be saved, the image +is written out and finally the system goes into the target low-power state in +which power is cut from almost all of its hardware components, including memory, +except for a limited set of wakeup devices. + +Once the snapshot image has been written out, the system may either enter a +special low-power state (like ACPI S4), or it may simply power down itself. +Powering down means minimum power draw and it allows this mechanism to work on +any system. However, entering a special low-power state may allow additional +means of system wakeup to be used (e.g. pressing a key on the keyboard or +opening a laptop lid). + +After wakeup, control goes to the platform firmware that runs a boot loader +which boots a fresh instance of the kernel (control may also go directly to +the boot loader, depending on the system configuration, but anyway it causes +a fresh instance of the kernel to be booted). That new instance of the kernel +(referred to as the ``restore kernel``) looks for a hibernation image in +persistent storage and if one is found, it is loaded into memory. Next, all +activity in the system is stopped and the restore kernel overwrites itself with +the image contents and jumps into a special trampoline area in the original +kernel stored in the image (referred to as the ``image kernel``), which is where +the special architecture-specific low-level code is needed. Finally, the +image kernel restores the system to the pre-hibernation state and allows user +space to run again. + +Hibernation is supported if the :c:macro:`CONFIG_HIBERNATION` kernel +configuration option is set. However, this option can only be set if support +for the given CPU architecture includes the low-level code for system resume. + + +Basic ``sysfs`` Interfaces for System Suspend and Hibernation +============================================================= + +The following files located in the :file:`/sys/power/` directory can be used by +user space for sleep states control. + +``state`` + This file contains a list of strings representing sleep states supported + by the kernel. Writing one of these strings into it causes the kernel + to start a transition of the system into the sleep state represented by + that string. + + In particular, the strings "disk", "freeze" and "standby" represent the + :ref:`hibernation <hibernation>`, :ref:`suspend-to-idle <s2idle>` and + :ref:`standby <standby>` sleep states, respectively. The string "mem" + is interpreted in accordance with the contents of the ``mem_sleep`` file + described below. + + If the kernel does not support any system sleep states, this file is + not present. + +``mem_sleep`` + This file contains a list of strings representing supported system + suspend variants and allows user space to select the variant to be + associated with the "mem" string in the ``state`` file described above. + + The strings that may be present in this file are "s2idle", "shallow" + and "deep". The string "s2idle" always represents :ref:`suspend-to-idle + <s2idle>` and, by convention, "shallow" and "deep" represent + :ref:`standby <standby>` and :ref:`suspend-to-RAM <s2ram>`, + respectively. + + Writing one of the listed strings into this file causes the system + suspend variant represented by it to be associated with the "mem" string + in the ``state`` file. The string representing the suspend variant + currently associated with the "mem" string in the ``state`` file + is listed in square brackets. + + If the kernel does not support system suspend, this file is not present. + +``disk`` + This file contains a list of strings representing different operations + that can be carried out after the hibernation image has been saved. The + possible options are as follows: + + ``platform`` + Put the system into a special low-power state (e.g. ACPI S4) to + make additional wakeup options available and possibly allow the + platform firmware to take a simplified initialization path after + wakeup. + + ``shutdown`` + Power off the system. + + ``reboot`` + Reboot the system (useful for diagnostics mostly). + + ``suspend`` + Hybrid system suspend. Put the system into the suspend sleep + state selected through the ``mem_sleep`` file described above. + If the system is successfully woken up from that state, discard + the hibernation image and continue. Otherwise, use the image + to restore the previous state of the system. + + ``test_resume`` + Diagnostic operation. Load the image as though the system had + just woken up from hibernation and the currently running kernel + instance was a restore kernel and follow up with full system + resume. + + Writing one of the listed strings into this file causes the option + represented by it to be selected. + + The currently selected option is shown in square brackets which means + that the operation represented by it will be carried out after creating + and saving the image next time hibernation is triggered by writing + ``disk`` to :file:`/sys/power/state`. + + If the kernel does not support hibernation, this file is not present. + +According to the above, there are two ways to make the system go into the +:ref:`suspend-to-idle <s2idle>` state. The first one is to write "freeze" +directly to :file:`/sys/power/state`. The second one is to write "s2idle" to +:file:`/sys/power/mem_sleep` and then to write "mem" to +:file:`/sys/power/state`. Likewise, there are two ways to make the system go +into the :ref:`standby <standby>` state (the strings to write to the control +files in that case are "standby" or "shallow" and "mem", respectively) if that +state is supported by the platform. However, there is only one way to make the +system go into the :ref:`suspend-to-RAM <s2ram>` state (write "deep" into +:file:`/sys/power/mem_sleep` and "mem" into :file:`/sys/power/state`). + +The default suspend variant (ie. the one to be used without writing anything +into :file:`/sys/power/mem_sleep`) is either "deep" (on the majority of systems +supporting :ref:`suspend-to-RAM <s2ram>`) or "s2idle", but it can be overridden +by the value of the "mem_sleep_default" parameter in the kernel command line. +On some ACPI-based systems, depending on the information in the ACPI tables, the +default may be "s2idle" even if :ref:`suspend-to-RAM <s2ram>` is supported. diff --git a/Documentation/admin-guide/pm/strategies.rst b/Documentation/admin-guide/pm/strategies.rst new file mode 100644 index 000000000..afe4d3f83 --- /dev/null +++ b/Documentation/admin-guide/pm/strategies.rst @@ -0,0 +1,52 @@ +=========================== +Power Management Strategies +=========================== + +:: + + Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> + +The Linux kernel supports two major high-level power management strategies. + +One of them is based on using global low-power states of the whole system in +which user space code cannot be executed and the overall system activity is +significantly reduced, referred to as :doc:`sleep states <sleep-states>`. The +kernel puts the system into one of these states when requested by user space +and the system stays in it until a special signal is received from one of +designated devices, triggering a transition to the ``working state`` in which +user space code can run. Because sleep states are global and the whole system +is affected by the state changes, this strategy is referred to as the +:doc:`system-wide power management <system-wide>`. + +The other strategy, referred to as the :doc:`working-state power management +<working-state>`, is based on adjusting the power states of individual hardware +components of the system, as needed, in the working state. In consequence, if +this strategy is in use, the working state of the system usually does not +correspond to any particular physical configuration of it, but can be treated as +a metastate covering a range of different power states of the system in which +the individual components of it can be either ``active`` (in use) or +``inactive`` (idle). If they are active, they have to be in power states +allowing them to process data and to be accessed by software. In turn, if they +are inactive, ideally, they should be in low-power states in which they may not +be accessible. + +If all of the system components are active, the system as a whole is regarded as +"runtime active" and that situation typically corresponds to the maximum power +draw (or maximum energy usage) of it. If all of them are inactive, the system +as a whole is regarded as "runtime idle" which may be very close to a sleep +state from the physical system configuration and power draw perspective, but +then it takes much less time and effort to start executing user space code than +for the same system in a sleep state. However, transitions from sleep states +back to the working state can only be started by a limited set of devices, so +typically the system can spend much more time in a sleep state than it can be +runtime idle in one go. For this reason, systems usually use less energy in +sleep states than when they are runtime idle most of the time. + +Moreover, the two power management strategies address different usage scenarios. +Namely, if the user indicates that the system will not be in use going forward, +for example by closing its lid (if the system is a laptop), it probably should +go into a sleep state at that point. On the other hand, if the user simply goes +away from the laptop keyboard, it probably should stay in the working state and +use the working-state power management in case it becomes idle, because the user +may come back to it at any time and then may want the system to be immediately +accessible. diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst new file mode 100644 index 000000000..0c81e4c5d --- /dev/null +++ b/Documentation/admin-guide/pm/system-wide.rst @@ -0,0 +1,8 @@ +============================ +System-Wide Power Management +============================ + +.. toctree:: + :maxdepth: 2 + + sleep-states diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst new file mode 100644 index 000000000..fa01bf083 --- /dev/null +++ b/Documentation/admin-guide/pm/working-state.rst @@ -0,0 +1,9 @@ +============================== +Working-State Power Management +============================== + +.. toctree:: + :maxdepth: 2 + + cpufreq + intel_pstate diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst new file mode 100644 index 000000000..6dbcc5481 --- /dev/null +++ b/Documentation/admin-guide/ramoops.rst @@ -0,0 +1,156 @@ +Ramoops oops/panic logger +========================= + +Sergiu Iordache <sergiu@chromium.org> + +Updated: 17 November 2011 + +Introduction +------------ + +Ramoops is an oops/panic logger that writes its logs to RAM before the system +crashes. It works by logging oopses and panics in a circular buffer. Ramoops +needs a system with persistent RAM so that the content of that area can +survive after a restart. + +Ramoops concepts +---------------- + +Ramoops uses a predefined memory area to store the dump. The start and size +and type of the memory area are set using three variables: + + * ``mem_address`` for the start + * ``mem_size`` for the size. The memory size will be rounded down to a + power of two. + * ``mem_type`` to specifiy if the memory type (default is pgprot_writecombine). + +Typically the default value of ``mem_type=0`` should be used as that sets the pstore +mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use +``pgprot_noncached``, which only works on some platforms. This is because pstore +depends on atomic operations. At least on ARM, pgprot_noncached causes the +memory to be mapped strongly ordered, and atomic operations on strongly ordered +memory are implementation defined, and won't work on many ARMs such as omaps. + +The memory area is divided into ``record_size`` chunks (also rounded down to +power of two) and each oops/panic writes a ``record_size`` chunk of +information. + +Dumping both oopses and panics can be done by setting 1 in the ``dump_oops`` +variable while setting 0 in that variable dumps only the panics. + +The module uses a counter to record multiple dumps but the counter gets reset +on restart (i.e. new dumps after the restart will overwrite old ones). + +Ramoops also supports software ECC protection of persistent memory regions. +This might be useful when a hardware reset was used to bring the machine back +to life (i.e. a watchdog triggered). In such cases, RAM may be somewhat +corrupt, but usually it is restorable. + +Setting the parameters +---------------------- + +Setting the ramoops parameters can be done in several different manners: + + A. Use the module parameters (which have the names of the variables described + as before). For quick debugging, you can also reserve parts of memory during + boot and then use the reserved memory for ramoops. For example, assuming a + machine with > 128 MB of memory, the following kernel command line will tell + the kernel to use only the first 128 MB of memory, and place ECC-protected + ramoops region at 128 MB boundary:: + + mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1 + + B. Use Device Tree bindings, as described in + ``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``. + For example:: + + reserved-memory { + #address-cells = <2>; + #size-cells = <2>; + ranges; + + ramoops@8f000000 { + compatible = "ramoops"; + reg = <0 0x8f000000 0 0x100000>; + record-size = <0x4000>; + console-size = <0x4000>; + }; + }; + + C. Use a platform device and set the platform data. The parameters can then + be set through that platform data. An example of doing that is: + + .. code-block:: c + + #include <linux/pstore_ram.h> + [...] + + static struct ramoops_platform_data ramoops_data = { + .mem_size = <...>, + .mem_address = <...>, + .mem_type = <...>, + .record_size = <...>, + .dump_oops = <...>, + .ecc = <...>, + }; + + static struct platform_device ramoops_dev = { + .name = "ramoops", + .dev = { + .platform_data = &ramoops_data, + }, + }; + + [... inside a function ...] + int ret; + + ret = platform_device_register(&ramoops_dev); + if (ret) { + printk(KERN_ERR "unable to register platform device\n"); + return ret; + } + +You can specify either RAM memory or peripheral devices' memory. However, when +specifying RAM, be sure to reserve the memory by issuing memblock_reserve() +very early in the architecture code, e.g.:: + + #include <linux/memblock.h> + + memblock_reserve(ramoops_data.mem_address, ramoops_data.mem_size); + +Dump format +----------- + +The data dump begins with a header, currently defined as ``====`` followed by a +timestamp and a new line. The dump then continues with the actual data. + +Reading the data +---------------- + +The dump data can be read from the pstore filesystem. The format for these +files is ``dmesg-ramoops-N``, where N is the record number in memory. To delete +a stored record from RAM, simply unlink the respective pstore file. + +Persistent function tracing +--------------------------- + +Persistent function tracing might be useful for debugging software or hardware +related hangs. The functions call chain log is stored in a ``ftrace-ramoops`` +file. Here is an example of usage:: + + # mount -t debugfs debugfs /sys/kernel/debug/ + # echo 1 > /sys/kernel/debug/pstore/record_ftrace + # reboot -f + [...] + # mount -t pstore pstore /mnt/ + # tail /mnt/ftrace-ramoops + 0 ffffffff8101ea64 ffffffff8101bcda native_apic_mem_read <- disconnect_bsp_APIC+0x6a/0xc0 + 0 ffffffff8101ea44 ffffffff8101bcf6 native_apic_mem_write <- disconnect_bsp_APIC+0x86/0xc0 + 0 ffffffff81020084 ffffffff8101a4b5 hpet_disable <- native_machine_shutdown+0x75/0x90 + 0 ffffffff81005f94 ffffffff8101a4bb iommu_shutdown_noop <- native_machine_shutdown+0x7b/0x90 + 0 ffffffff8101a6a1 ffffffff8101a437 native_machine_emergency_restart <- native_machine_restart+0x37/0x40 + 0 ffffffff811f9876 ffffffff8101a73a acpi_reboot <- native_machine_emergency_restart+0xaa/0x1e0 + 0 ffffffff8101a514 ffffffff8101a772 mach_reboot_fixups <- native_machine_emergency_restart+0xe2/0x1e0 + 0 ffffffff811d9c54 ffffffff8101a7a0 __const_udelay <- native_machine_emergency_restart+0x110/0x1e0 + 0 ffffffff811d9c34 ffffffff811d9c80 __delay <- __const_udelay+0x30/0x40 + 0 ffffffff811d9d14 ffffffff811d9c3f delay_tsc <- __delay+0xf/0x20 diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst new file mode 100644 index 000000000..197896718 --- /dev/null +++ b/Documentation/admin-guide/ras.rst @@ -0,0 +1,1210 @@ +.. include:: <isonum.txt> + +============================================ +Reliability, Availability and Serviceability +============================================ + +RAS concepts +************ + +Reliability, Availability and Serviceability (RAS) is a concept used on +servers meant to measure their robustness. + +Reliability + is the probability that a system will produce correct outputs. + + * Generally measured as Mean Time Between Failures (MTBF) + * Enhanced by features that help to avoid, detect and repair hardware faults + +Availability + is the probability that a system is operational at a given time + + * Generally measured as a percentage of downtime per a period of time + * Often uses mechanisms to detect and correct hardware faults in + runtime; + +Serviceability (or maintainability) + is the simplicity and speed with which a system can be repaired or + maintained + + * Generally measured on Mean Time Between Repair (MTBR) + +Improving RAS +------------- + +In order to reduce systems downtime, a system should be capable of detecting +hardware errors, and, when possible correcting them in runtime. It should +also provide mechanisms to detect hardware degradation, in order to warn +the system administrator to take the action of replacing a component before +it causes data loss or system downtime. + +Among the monitoring measures, the most usual ones include: + +* CPU – detect errors at instruction execution and at L1/L2/L3 caches; +* Memory – add error correction logic (ECC) to detect and correct errors; +* I/O – add CRC checksums for transferred data; +* Storage – RAID, journal file systems, checksums, + Self-Monitoring, Analysis and Reporting Technology (SMART). + +By monitoring the number of occurrences of error detections, it is possible +to identify if the probability of hardware errors is increasing, and, on such +case, do a preventive maintenance to replace a degraded component while +those errors are correctable. + +Types of errors +--------------- + +Most mechanisms used on modern systems use use technologies like Hamming +Codes that allow error correction when the number of errors on a bit packet +is below a threshold. If the number of errors is above, those mechanisms +can indicate with a high degree of confidence that an error happened, but +they can't correct. + +Also, sometimes an error occur on a component that it is not used. For +example, a part of the memory that it is not currently allocated. + +That defines some categories of errors: + +* **Correctable Error (CE)** - the error detection mechanism detected and + corrected the error. Such errors are usually not fatal, although some + Kernel mechanisms allow the system administrator to consider them as fatal. + +* **Uncorrected Error (UE)** - the amount of errors happened above the error + correction threshold, and the system was unable to auto-correct. + +* **Fatal Error** - when an UE error happens on a critical component of the + system (for example, a piece of the Kernel got corrupted by an UE), the + only reliable way to avoid data corruption is to hang or reboot the machine. + +* **Non-fatal Error** - when an UE error happens on an unused component, + like a CPU in power down state or an unused memory bank, the system may + still run, eventually replacing the affected hardware by a hot spare, + if available. + + Also, when an error happens on a userspace process, it is also possible to + kill such process and let userspace restart it. + +The mechanism for handling non-fatal errors is usually complex and may +require the help of some userspace application, in order to apply the +policy desired by the system administrator. + +Identifying a bad hardware component +------------------------------------ + +Just detecting a hardware flaw is usually not enough, as the system needs +to pinpoint to the minimal replaceable unit (MRU) that should be exchanged +to make the hardware reliable again. + +So, it requires not only error logging facilities, but also mechanisms that +will translate the error message to the silkscreen or component label for +the MRU. + +Typically, it is very complex for memory, as modern CPUs interlace memory +from different memory modules, in order to provide a better performance. The +DMI BIOS usually have a list of memory module labels, with can be obtained +using the ``dmidecode`` tool. For example, on a desktop machine, it shows:: + + Memory Device + Total Width: 64 bits + Data Width: 64 bits + Size: 16384 MB + Form Factor: SODIMM + Set: None + Locator: ChannelA-DIMM0 + Bank Locator: BANK 0 + Type: DDR4 + Type Detail: Synchronous + Speed: 2133 MHz + Rank: 2 + Configured Clock Speed: 2133 MHz + +On the above example, a DDR4 SO-DIMM memory module is located at the +system's memory labeled as "BANK 0", as given by the *bank locator* field. +Please notice that, on such system, the *total width* is equal to the +*data width*. It means that such memory module doesn't have error +detection/correction mechanisms. + +Unfortunately, not all systems use the same field to specify the memory +bank. On this example, from an older server, ``dmidecode`` shows:: + + Memory Device + Array Handle: 0x1000 + Error Information Handle: Not Provided + Total Width: 72 bits + Data Width: 64 bits + Size: 8192 MB + Form Factor: DIMM + Set: 1 + Locator: DIMM_A1 + Bank Locator: Not Specified + Type: DDR3 + Type Detail: Synchronous Registered (Buffered) + Speed: 1600 MHz + Rank: 2 + Configured Clock Speed: 1600 MHz + +There, the DDR3 RDIMM memory module is located at the system's memory labeled +as "DIMM_A1", as given by the *locator* field. Please notice that this +memory module has 64 bits of *data width* and 72 bits of *total width*. So, +it has 8 extra bits to be used by error detection and correction mechanisms. +Such kind of memory is called Error-correcting code memory (ECC memory). + +To make things even worse, it is not uncommon that systems with different +labels on their system's board to use exactly the same BIOS, meaning that +the labels provided by the BIOS won't match the real ones. + +ECC memory +---------- + +As mentioned on the previous section, ECC memory has extra bits to be +used for error correction. So, on 64 bit systems, a memory module +has 64 bits of *data width*, and 74 bits of *total width*. So, there are +8 bits extra bits to be used for the error detection and correction +mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_. + +So, when the cpu requests the memory controller to write a word with +*data width*, the memory controller calculates the *syndrome* in real time, +using Hamming code, or some other error correction code, like SECDED+, +producing a code with *total width* size. Such code is then written +on the memory modules. + +At read, the *total width* bits code is converted back, using the same +ECC code used on write, producing a word with *data width* and a *syndrome*. +The word with *data width* is sent to the CPU, even when errors happen. + +The memory controller also looks at the *syndrome* in order to check if +there was an error, and if the ECC code was able to fix such error. +If the error was corrected, a Corrected Error (CE) happened. If not, an +Uncorrected Error (UE) happened. + +The information about the CE/UE errors is stored on some special registers +at the memory controller and can be accessed by reading such registers, +either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64 +bit CPUs, such errors can also be retrieved via the Machine Check +Architecture (MCA)\ [#f3]_. + +.. [#f1] Please notice that several memory controllers allow operation on a + mode called "Lock-Step", where it groups two memory modules together, + doing 128-bit reads/writes. That gives 16 bits for error correction, with + significantly improves the error correction mechanism, at the expense + that, when an error happens, there's no way to know what memory module is + to blame. So, it has to blame both memory modules. + +.. [#f2] Some memory controllers also allow using memory in mirror mode. + On such mode, the same data is written to two memory modules. At read, + the system checks both memory modules, in order to check if both provide + identical data. On such configuration, when an error happens, there's no + way to know what memory module is to blame. So, it has to blame both + memory modules (or 4 memory modules, if the system is also on Lock-step + mode). + +.. [#f3] For more details about the Machine Check Architecture (MCA), + please read Documentation/x86/x86_64/machinecheck at the Kernel tree. + +EDAC - Error Detection And Correction +************************************* + +.. note:: + + "bluesmoke" was the name for this device driver subsystem when it + was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. + That site is mostly archaic now and can be used only for historical + purposes. + + When the subsystem was pushed upstream for the first time, on + Kernel 2.6.16, for the first time, it was renamed to ``EDAC``. + +Purpose +------- + +The ``edac`` kernel module's goal is to detect and report hardware errors +that occur within the computer system running under linux. + +Memory +------ + +Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the +primary errors being harvested. These types of errors are harvested by +the ``edac_mc`` device. + +Detecting CE events, then harvesting those events and reporting them, +**can** but must not necessarily be a predictor of future UE events. With +CE events only, the system can and will continue to operate as no data +has been damaged yet. + +However, preventive maintenance and proactive part replacement of memory +modules exhibiting CEs can reduce the likelihood of the dreaded UE events +and system panics. + +Other hardware elements +----------------------- + +A new feature for EDAC, the ``edac_device`` class of device, was added in +the 2.6.23 version of the kernel. + +This new device type allows for non-memory type of ECC hardware detectors +to have their states harvested and presented to userspace via the sysfs +interface. + +Some architectures have ECC detectors for L1, L2 and L3 caches, +along with DMA engines, fabric switches, main data path switches, +interconnections, and various other hardware data paths. If the hardware +reports it, then a edac_device device probably can be constructed to +harvest and present that to userspace. + + +PCI bus scanning +---------------- + +In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors +in order to determine if errors are occurring during data transfers. + +The presence of PCI Parity errors must be examined with a grain of salt. +There are several add-in adapters that do **not** follow the PCI specification +with regards to Parity generation and reporting. The specification says +the vendor should tie the parity status bits to 0 if they do not intend +to generate parity. Some vendors do not do this, and thus the parity bit +can "float" giving false positives. + +There is a PCI device attribute located in sysfs that is checked by +the EDAC PCI scanning code. If that attribute is set, PCI parity/error +scanning is skipped for that device. The attribute is:: + + broken_parity_status + +and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for +PCI devices. + + +Versioning +---------- + +EDAC is composed of a "core" module (``edac_core.ko``) and several Memory +Controller (MC) driver modules. On a given system, the CORE is loaded +and one MC driver will be loaded. Both the CORE and the MC driver (or +``edac_device`` driver) have individual versions that reflect current +release level of their respective modules. + +Thus, to "report" on what version a system is running, one must report +both the CORE's and the MC driver's versions. + + +Loading +------- + +If ``edac`` was statically linked with the kernel then no loading +is necessary. If ``edac`` was built as modules then simply modprobe +the ``edac`` pieces that you need. You should be able to modprobe +hardware-specific modules and have the dependencies load the necessary +core modules. + +Example:: + + $ modprobe amd76x_edac + +loads both the ``amd76x_edac.ko`` memory controller module and the +``edac_mc.ko`` core module. + + +Sysfs interface +--------------- + +EDAC presents a ``sysfs`` interface for control and reporting purposes. It +lives in the /sys/devices/system/edac directory. + +Within this directory there currently reside 2 components: + + ======= ============================== + mc memory controller(s) system + pci PCI control and status system + ======= ============================== + + + +Memory Controller (mc) Model +---------------------------- + +Each ``mc`` device controls a set of memory modules [#f4]_. These modules +are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). +There can be multiple csrows and multiple channels. + +.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely + used to refer to a memory module, although there are other memory + packaging alternatives, like SO-DIMM, SIMM, etc. Along this document, + and inside the EDAC system, the term "dimm" is used for all memory + modules, even when they use a different kind of packaging. + +Memory controllers allow for several csrows, with 8 csrows being a +typical value. Yet, the actual number of csrows depends on the layout of +a given motherboard, memory controller and memory module characteristics. + +Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems) +data transfers to/from the CPU from/to memory. Some newer chipsets allow +for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory +controllers. The following example will assume 2 channels: + + +------------+-----------------------+ + | CS Rows | Channels | + +------------+-----------+-----------+ + | | ``ch0`` | ``ch1`` | + +============+===========+===========+ + | ``csrow0`` | DIMM_A0 | DIMM_B0 | + +------------+ | | + | ``csrow1`` | | | + +------------+-----------+-----------+ + | ``csrow2`` | DIMM_A1 | DIMM_B1 | + +------------+ | | + | ``csrow3`` | | | + +------------+-----------+-----------+ + +In the above example, there are 4 physical slots on the motherboard +for memory DIMMs: + + +---------+---------+ + | DIMM_A0 | DIMM_B0 | + +---------+---------+ + | DIMM_A1 | DIMM_B1 | + +---------+---------+ + +Labels for these slots are usually silk-screened on the motherboard. +Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are +channel 1. Notice that there are two csrows possible on a physical DIMM. +These csrows are allocated their csrow assignment based on the slot into +which the memory DIMM is placed. Thus, when 1 DIMM is placed in each +Channel, the csrows cross both DIMMs. + +Memory DIMMs come single or dual "ranked". A rank is a populated csrow. +Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above +will have just one csrow (csrow0). csrow1 will be empty. On the other +hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0 +and csrow1 will be populated. The pattern repeats itself for csrow2 and +csrow3. + +The representation of the above is reflected in the directory +tree in EDAC's sysfs interface. Starting in directory +``/sys/devices/system/edac/mc``, each memory controller will be +represented by its own ``mcX`` directory, where ``X`` is the +index of the MC:: + + ..../edac/mc/ + | + |->mc0 + |->mc1 + |->mc2 + .... + +Under each ``mcX`` directory each ``csrowX`` is again represented by a +``csrowX``, where ``X`` is the csrow index:: + + .../mc/mc0/ + | + |->csrow0 + |->csrow2 + |->csrow3 + .... + +Notice that there is no csrow1, which indicates that csrow0 is composed +of a single ranked DIMMs. This should also apply in both Channels, in +order to have dual-channel mode be operational. Since both csrow2 and +csrow3 are populated, this indicates a dual ranked set of DIMMs for +channels 0 and 1. + +Within each of the ``mcX`` and ``csrowX`` directories are several EDAC +control and attribute files. + +``mcX`` directories +------------------- + +In ``mcX`` directories are EDAC control and attribute files for +this ``X`` instance of the memory controllers. + +For a description of the sysfs API, please see: + + Documentation/ABI/testing/sysfs-devices-edac + + +``dimmX`` or ``rankX`` directories +---------------------------------- + +The recommended way to use the EDAC subsystem is to look at the information +provided by the ``dimmX`` or ``rankX`` directories [#f5]_. + +A typical EDAC system has the following structure under +``/sys/devices/system/edac/``\ [#f6]_:: + + /sys/devices/system/edac/ + ├── mc + │  ├── mc0 + │  │  ├── ce_count + │  │  ├── ce_noinfo_count + │  │  ├── dimm0 + │  │  │  ├── dimm_ce_count + │  │  │  ├── dimm_dev_type + │  │  │  ├── dimm_edac_mode + │  │  │  ├── dimm_label + │  │  │  ├── dimm_location + │  │  │  ├── dimm_mem_type + │  │  │  ├── dimm_ue_count + │  │  │  ├── size + │  │  │  └── uevent + │  │  ├── max_location + │  │  ├── mc_name + │  │  ├── reset_counters + │  │  ├── seconds_since_reset + │  │  ├── size_mb + │  │  ├── ue_count + │  │  ├── ue_noinfo_count + │  │  └── uevent + │  ├── mc1 + │  │  ├── ce_count + │  │  ├── ce_noinfo_count + │  │  ├── dimm0 + │  │  │  ├── dimm_ce_count + │  │  │  ├── dimm_dev_type + │  │  │  ├── dimm_edac_mode + │  │  │  ├── dimm_label + │  │  │  ├── dimm_location + │  │  │  ├── dimm_mem_type + │  │  │  ├── dimm_ue_count + │  │  │  ├── size + │  │  │  └── uevent + │  │  ├── max_location + │  │  ├── mc_name + │  │  ├── reset_counters + │  │  ├── seconds_since_reset + │  │  ├── size_mb + │  │  ├── ue_count + │  │  ├── ue_noinfo_count + │  │  └── uevent + │  └── uevent + └── uevent + +In the ``dimmX`` directories are EDAC control and attribute files for +this ``X`` memory module: + +- ``size`` - Total memory managed by this csrow attribute file + + This attribute file displays, in count of megabytes, the memory + that this csrow contains. + +- ``dimm_ue_count`` - Uncorrectable Errors count attribute file + + This attribute file displays the total count of uncorrectable + errors that have occurred on this DIMM. If panic_on_ue is set + this counter will not have a chance to increment, since EDAC + will panic the system. + +- ``dimm_ce_count`` - Correctable Errors count attribute file + + This attribute file displays the total count of correctable + errors that have occurred on this DIMM. This count is very + important to examine. CEs provide early indications that a + DIMM is beginning to fail. This count field should be + monitored for non-zero values and report such information + to the system administrator. + +- ``dimm_dev_type`` - Device type attribute file + + This attribute file will display what type of DRAM device is + being utilized on this DIMM. + Examples: + + - x1 + - x2 + - x4 + - x8 + +- ``dimm_edac_mode`` - EDAC Mode of operation attribute file + + This attribute file will display what type of Error detection + and correction is being utilized. + +- ``dimm_label`` - memory module label control file + + This control file allows this DIMM to have a label assigned + to it. With this label in the module, when errors occur + the output can provide the DIMM label in the system log. + This becomes vital for panic events to isolate the + cause of the UE event. + + DIMM Labels must be assigned after booting, with information + that correctly identifies the physical slot with its + silk screen label. This information is currently very + motherboard specific and determination of this information + must occur in userland at this time. + +- ``dimm_location`` - location of the memory module + + The location can have up to 3 levels, and describe how the + memory controller identifies the location of a memory module. + Depending on the type of memory and memory controller, it + can be: + + - *csrow* and *channel* - used when the memory controller + doesn't identify a single DIMM - e. g. in ``rankX`` dir; + - *branch*, *channel*, *slot* - typically used on FB-DIMM memory + controllers; + - *channel*, *slot* - used on Nehalem and newer Intel drivers. + +- ``dimm_mem_type`` - Memory Type attribute file + + This attribute file will display what type of memory is currently + on this csrow. Normally, either buffered or unbuffered memory. + Examples: + + - Registered-DDR + - Unbuffered-DDR + +.. [#f5] On some systems, the memory controller doesn't have any logic + to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. + On modern Intel memory controllers, the memory controller identifies the + memory modules directly. On such systems, the directory is called ``dimmX``. + +.. [#f6] There are also some ``power`` directories and ``subsystem`` + symlinks inside the sysfs mapping that are automatically created by + the sysfs subsystem. Currently, they serve no purpose. + +``csrowX`` directories +---------------------- + +When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` +directories. As this API doesn't work properly for Rambus, FB-DIMMs and +modern Intel Memory Controllers, this is being deprecated in favor of +``dimmX`` directories. + +In the ``csrowX`` directories are EDAC control and attribute files for +this ``X`` instance of csrow: + + +- ``ue_count`` - Total Uncorrectable Errors count attribute file + + This attribute file displays the total count of uncorrectable + errors that have occurred on this csrow. If panic_on_ue is set + this counter will not have a chance to increment, since EDAC + will panic the system. + + +- ``ce_count`` - Total Correctable Errors count attribute file + + This attribute file displays the total count of correctable + errors that have occurred on this csrow. This count is very + important to examine. CEs provide early indications that a + DIMM is beginning to fail. This count field should be + monitored for non-zero values and report such information + to the system administrator. + + +- ``size_mb`` - Total memory managed by this csrow attribute file + + This attribute file displays, in count of megabytes, the memory + that this csrow contains. + + +- ``mem_type`` - Memory Type attribute file + + This attribute file will display what type of memory is currently + on this csrow. Normally, either buffered or unbuffered memory. + Examples: + + - Registered-DDR + - Unbuffered-DDR + + +- ``edac_mode`` - EDAC Mode of operation attribute file + + This attribute file will display what type of Error detection + and correction is being utilized. + + +- ``dev_type`` - Device type attribute file + + This attribute file will display what type of DRAM device is + being utilized on this DIMM. + Examples: + + - x1 + - x2 + - x4 + - x8 + + +- ``ch0_ce_count`` - Channel 0 CE Count attribute file + + This attribute file will display the count of CEs on this + DIMM located in channel 0. + + +- ``ch0_ue_count`` - Channel 0 UE Count attribute file + + This attribute file will display the count of UEs on this + DIMM located in channel 0. + + +- ``ch0_dimm_label`` - Channel 0 DIMM Label control file + + + This control file allows this DIMM to have a label assigned + to it. With this label in the module, when errors occur + the output can provide the DIMM label in the system log. + This becomes vital for panic events to isolate the + cause of the UE event. + + DIMM Labels must be assigned after booting, with information + that correctly identifies the physical slot with its + silk screen label. This information is currently very + motherboard specific and determination of this information + must occur in userland at this time. + + +- ``ch1_ce_count`` - Channel 1 CE Count attribute file + + + This attribute file will display the count of CEs on this + DIMM located in channel 1. + + +- ``ch1_ue_count`` - Channel 1 UE Count attribute file + + + This attribute file will display the count of UEs on this + DIMM located in channel 0. + + +- ``ch1_dimm_label`` - Channel 1 DIMM Label control file + + This control file allows this DIMM to have a label assigned + to it. With this label in the module, when errors occur + the output can provide the DIMM label in the system log. + This becomes vital for panic events to isolate the + cause of the UE event. + + DIMM Labels must be assigned after booting, with information + that correctly identifies the physical slot with its + silk screen label. This information is currently very + motherboard specific and determination of this information + must occur in userland at this time. + + +System Logging +-------------- + +If logging for UEs and CEs is enabled, then system logs will contain +information indicating that errors have been detected:: + + EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac + EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac + + +The structure of the message is: + + +---------------------------------------+-------------+ + | Content | Example | + +=======================================+=============+ + | The memory controller | MC0 | + +---------------------------------------+-------------+ + | Error type | CE | + +---------------------------------------+-------------+ + | Memory page | 0x283 | + +---------------------------------------+-------------+ + | Offset in the page | 0xce0 | + +---------------------------------------+-------------+ + | The byte granularity | grain 8 | + | or resolution of the error | | + +---------------------------------------+-------------+ + | The error syndrome | 0xb741 | + +---------------------------------------+-------------+ + | Memory row | row 0 | + +---------------------------------------+-------------+ + | Memory channel | channel 1 | + +---------------------------------------+-------------+ + | DIMM label, if set prior | DIMM B1 | + +---------------------------------------+-------------+ + | And then an optional, driver-specific | | + | message that may have additional | | + | information. | | + +---------------------------------------+-------------+ + +Both UEs and CEs with no info will lack all but memory controller, error +type, a notice of "no info" and then an optional, driver-specific error +message. + + +PCI Bus Parity Detection +------------------------ + +On Header Type 00 devices, the primary status is looked at for any +parity error regardless of whether parity is enabled on the device or +not. (The spec indicates parity is generated in some cases). On Header +Type 01 bridges, the secondary status register is also looked at to see +if parity occurred on the bus on the other side of the bridge. + + +Sysfs configuration +------------------- + +Under ``/sys/devices/system/edac/pci`` are control and attribute files as +follows: + + +- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file + + This control file enables or disables the PCI Bus Parity scanning + operation. Writing a 1 to this file enables the scanning. Writing + a 0 to this file disables the scanning. + + Enable:: + + echo "1" >/sys/devices/system/edac/pci/check_pci_parity + + Disable:: + + echo "0" >/sys/devices/system/edac/pci/check_pci_parity + + +- ``pci_parity_count`` - Parity Count + + This attribute file will display the number of parity errors that + have been detected. + + +Module parameters +----------------- + +- ``edac_mc_panic_on_ue`` - Panic on UE control file + + An uncorrectable error will cause a machine panic. This is usually + desirable. It is a bad idea to continue when an uncorrectable error + occurs - it is indeterminate what was uncorrected and the operating + system context might be so mangled that continuing will lead to further + corruption. If the kernel has MCE configured, then EDAC will never + notice the UE. + + LOAD TIME:: + + module/kernel parameter: edac_mc_panic_on_ue=[0|1] + + RUN TIME:: + + echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue + + +- ``edac_mc_log_ue`` - Log UE control file + + + Generate kernel messages describing uncorrectable errors. These errors + are reported through the system message log system. UE statistics + will be accumulated even when UE logging is disabled. + + LOAD TIME:: + + module/kernel parameter: edac_mc_log_ue=[0|1] + + RUN TIME:: + + echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue + + +- ``edac_mc_log_ce`` - Log CE control file + + + Generate kernel messages describing correctable errors. These + errors are reported through the system message log system. + CE statistics will be accumulated even when CE logging is disabled. + + LOAD TIME:: + + module/kernel parameter: edac_mc_log_ce=[0|1] + + RUN TIME:: + + echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce + + +- ``edac_mc_poll_msec`` - Polling period control file + + + The time period, in milliseconds, for polling for error information. + Too small a value wastes resources. Too large a value might delay + necessary handling of errors and might loose valuable information for + locating the error. 1000 milliseconds (once each second) is the current + default. Systems which require all the bandwidth they can get, may + increase this. + + LOAD TIME:: + + module/kernel parameter: edac_mc_poll_msec=[0|1] + + RUN TIME:: + + echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec + + +- ``panic_on_pci_parity`` - Panic on PCI PARITY Error + + + This control file enables or disables panicking when a parity + error has been detected. + + + module/kernel parameter:: + + edac_panic_on_pci_pe=[0|1] + + Enable:: + + echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe + + Disable:: + + echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe + + + +EDAC device type +---------------- + +In the header file, edac_pci.h, there is a series of edac_device structures +and APIs for the EDAC_DEVICE. + +User space access to an edac_device is through the sysfs interface. + +At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices +will appear. + +There is a three level tree beneath the above ``edac`` directory. For example, +the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net +website) installs itself as:: + + /sys/devices/system/edac/test-instance + +in this directory are various controls, a symlink and one or more ``instance`` +directories. + +The standard default controls are: + + ============== ======================================================= + log_ce boolean to log CE events + log_ue boolean to log UE events + panic_on_ue boolean to ``panic`` the system if an UE is encountered + (default off, can be set true via startup script) + poll_msec time period between POLL cycles for events + ============== ======================================================= + +The test_device_edac device adds at least one of its own custom control: + + ============== ================================================== + test_bits which in the current test driver does nothing but + show how it is installed. A ported driver can + add one or more such controls and/or attributes + for specific uses. + One out-of-tree driver uses controls here to allow + for ERROR INJECTION operations to hardware + injection registers + ============== ================================================== + +The symlink points to the 'struct dev' that is registered for this edac_device. + +Instances +--------- + +One or more instance directories are present. For the ``test_device_edac`` +case: + + +----------------+ + | test-instance0 | + +----------------+ + + +In this directory there are two default counter attributes, which are totals of +counter in deeper subdirectories. + + ============== ==================================== + ce_count total of CE events of subdirectories + ue_count total of UE events of subdirectories + ============== ==================================== + +Blocks +------ + +At the lowest directory level is the ``block`` directory. There can be 0, 1 +or more blocks specified in each instance: + + +-------------+ + | test-block0 | + +-------------+ + +In this directory the default attributes are: + + ============== ================================================ + ce_count which is counter of CE events for this ``block`` + of hardware being monitored + ue_count which is counter of UE events for this ``block`` + of hardware being monitored + ============== ================================================ + + +The ``test_device_edac`` device adds 4 attributes and 1 control: + + ================== ==================================================== + test-block-bits-0 for every POLL cycle this counter + is incremented + test-block-bits-1 every 10 cycles, this counter is bumped once, + and test-block-bits-0 is set to 0 + test-block-bits-2 every 100 cycles, this counter is bumped once, + and test-block-bits-1 is set to 0 + test-block-bits-3 every 1000 cycles, this counter is bumped once, + and test-block-bits-2 is set to 0 + ================== ==================================================== + + + ================== ==================================================== + reset-counters writing ANY thing to this control will + reset all the above counters. + ================== ==================================================== + + +Use of the ``test_device_edac`` driver should enable any others to create their own +unique drivers for their hardware systems. + +The ``test_device_edac`` sample driver is located at the +http://bluesmoke.sourceforge.net project site for EDAC. + + +Usage of EDAC APIs on Nehalem and newer Intel CPUs +-------------------------------------------------- + +On older Intel architectures, the memory controller was part of the North +Bridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and +newer Intel architectures integrated an enhanced version of the memory +controller (MC) inside the CPUs. + +This chapter will cover the differences of the enhanced memory controllers +found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and +``sbx_edac`` drivers. + +.. note:: + + The Xeon E7 processor families use a separate chip for the memory + controller, called Intel Scalable Memory Buffer. This section doesn't + apply for such families. + +1) There is one Memory Controller per Quick Patch Interconnect + (QPI). At the driver, the term "socket" means one QPI. This is + associated with a physical CPU socket. + + Each MC have 3 physical read channels, 3 physical write channels and + 3 logic channels. The driver currently sees it as just 3 channels. + Each channel can have up to 3 DIMMs. + + The minimum known unity is DIMMs. There are no information about csrows. + As EDAC API maps the minimum unity is csrows, the driver sequentially + maps channel/DIMM into different csrows. + + For example, supposing the following layout:: + + Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs + dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 + dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 + Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs + dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 + Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs + dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 + + The driver will map it as:: + + csrow0: channel 0, dimm0 + csrow1: channel 0, dimm1 + csrow2: channel 1, dimm0 + csrow3: channel 2, dimm0 + + exports one DIMM per csrow. + + Each QPI is exported as a different memory controller. + +2) The MC has the ability to inject errors to test drivers. The drivers + implement this functionality via some error injection nodes: + + For injecting a memory error, there are some sysfs nodes, under + ``/sys/devices/system/edac/mc/mc?/``: + + - ``inject_addrmatch/*``: + Controls the error injection mask register. It is possible to specify + several characteristics of the address to match an error code:: + + dimm = the affected dimm. Numbers are relative to a channel; + rank = the memory rank; + channel = the channel that will generate an error; + bank = the affected bank; + page = the page address; + column (or col) = the address column. + + each of the above values can be set to "any" to match any valid value. + + At driver init, all values are set to any. + + For example, to generate an error at rank 1 of dimm 2, for any channel, + any bank, any page, any column:: + + echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm + echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank + + To return to the default behaviour of matching any, you can do:: + + echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm + echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank + + - ``inject_eccmask``: + specifies what bits will have troubles, + + - ``inject_section``: + specifies what ECC cache section will get the error:: + + 3 for both + 2 for the highest + 1 for the lowest + + - ``inject_type``: + specifies the type of error, being a combination of the following bits:: + + bit 0 - repeat + bit 1 - ecc + bit 2 - parity + + - ``inject_enable``: + starts the error generation when something different than 0 is written. + + All inject vars can be read. root permission is needed for write. + + Datasheet states that the error will only be generated after a write on an + address that matches inject_addrmatch. It seems, however, that reading will + also produce an error. + + For example, the following code will generate an error for any write access + at socket 0, on any DIMM/address on channel 2:: + + echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel + echo 2 >/sys/devices/system/edac/mc/mc0/inject_type + echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask + echo 3 >/sys/devices/system/edac/mc/mc0/inject_section + echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable + dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null + + For socket 1, it is needed to replace "mc0" by "mc1" at the above + commands. + + The generated error message will look like:: + + EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) + +3) Corrected Error memory register counters + + Those newer MCs have some registers to count memory errors. The driver + uses those registers to report Corrected Errors on devices with Registered + DIMMs. + + However, those counters don't work with Unregistered DIMM. As the chipset + offers some counters that also work with UDIMMs (but with a worse level of + granularity than the default ones), the driver exposes those registers for + UDIMM memories. + + They can be read by looking at the contents of ``all_channel_counts/``:: + + $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done + /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 + 0 + /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 + 0 + /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 + 0 + + What happens here is that errors on different csrows, but at the same + dimm number will increment the same counter. + So, in this memory mapping:: + + csrow0: channel 0, dimm0 + csrow1: channel 0, dimm1 + csrow2: channel 1, dimm0 + csrow3: channel 2, dimm0 + + The hardware will increment udimm0 for an error at the first dimm at either + csrow0, csrow2 or csrow3; + + The hardware will increment udimm1 for an error at the second dimm at either + csrow0, csrow2 or csrow3; + + The hardware will increment udimm2 for an error at the third dimm at either + csrow0, csrow2 or csrow3; + +4) Standard error counters + + The standard error counters are generated when an mcelog error is received + by the driver. Since, with UDIMM, this is counted by software, it is + possible that some errors could be lost. With RDIMM's, they display the + contents of the registers + +Reference documents used on ``amd64_edac`` +------------------------------------------ + +``amd64_edac`` module is based on the following documents +(available from http://support.amd.com/en-us/search/tech-docs): + +1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD + Opteron Processors + :AMD publication #: 26094 + :Revision: 3.26 + :Link: http://support.amd.com/TechDocs/26094.PDF + +2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh + Processors + :AMD publication #: 32559 + :Revision: 3.00 + :Issue Date: May 2006 + :Link: http://support.amd.com/TechDocs/32559.pdf + +3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h + Processors + :AMD publication #: 31116 + :Revision: 3.00 + :Issue Date: September 07, 2007 + :Link: http://support.amd.com/TechDocs/31116.pdf + +4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h + Models 30h-3Fh Processors + :AMD publication #: 49125 + :Revision: 3.06 + :Issue Date: 2/12/2015 (latest release) + :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf + +5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h + Models 60h-6Fh Processors + :AMD publication #: 50742 + :Revision: 3.01 + :Issue Date: 7/23/2015 (latest release) + :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf + +6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h + Models 00h-0Fh Processors + :AMD publication #: 48751 + :Revision: 3.03 + :Issue Date: 2/23/2015 (latest release) + :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf + +Credits +======= + +* Written by Doug Thompson <dougthompson@xmission.com> + + - 7 Dec 2005 + - 17 Jul 2007 Updated + +* |copy| Mauro Carvalho Chehab + + - 05 Aug 2009 Nehalem interface + - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section + +* EDAC authors/maintainers: + + - Doug Thompson, Dave Jiang, Dave Peterson et al, + - Mauro Carvalho Chehab + - Borislav Petkov + - original author: Thayne Harbaugh diff --git a/Documentation/admin-guide/reporting-bugs.rst b/Documentation/admin-guide/reporting-bugs.rst new file mode 100644 index 000000000..4650edb88 --- /dev/null +++ b/Documentation/admin-guide/reporting-bugs.rst @@ -0,0 +1,182 @@ +.. _reportingbugs: + +Reporting bugs +++++++++++++++ + +Background +========== + +The upstream Linux kernel maintainers only fix bugs for specific kernel +versions. Those versions include the current "release candidate" (or -rc) +kernel, any "stable" kernel versions, and any "long term" kernels. + +Please see https://www.kernel.org/ for a list of supported kernels. Any +kernel marked with [EOL] is "end of life" and will not have any fixes +backported to it. + +If you've found a bug on a kernel version that isn't listed on kernel.org, +contact your Linux distribution or embedded vendor for support. +Alternatively, you can attempt to run one of the supported stable or -rc +kernels, and see if you can reproduce the bug on that. It's preferable +to reproduce the bug on the latest -rc kernel. + + +How to report Linux kernel bugs +=============================== + + +Identify the problematic subsystem +---------------------------------- + +Identifying which part of the Linux kernel might be causing your issue +increases your chances of getting your bug fixed. Simply posting to the +generic linux-kernel mailing list (LKML) may cause your bug report to be +lost in the noise of a mailing list that gets 1000+ emails a day. + +Instead, try to figure out which kernel subsystem is causing the issue, +and email that subsystem's maintainer and mailing list. If the subsystem +maintainer doesn't answer, then expand your scope to mailing lists like +LKML. + + +Identify who to notify +---------------------- + +Once you know the subsystem that is causing the issue, you should send a +bug report. Some maintainers prefer bugs to be reported via bugzilla +(https://bugzilla.kernel.org), while others prefer that bugs be reported +via the subsystem mailing list. + +To find out where to send an emailed bug report, find your subsystem or +device driver in the MAINTAINERS file. Search in the file for relevant +entries, and send your bug report to the person(s) listed in the "M:" +lines, making sure to Cc the mailing list(s) in the "L:" lines. When the +maintainer replies to you, make sure to 'Reply-all' in order to keep the +public mailing list(s) in the email thread. + +If you know which driver is causing issues, you can pass one of the driver +files to the get_maintainer.pl script:: + + perl scripts/get_maintainer.pl -f <filename> + +If it is a security bug, please copy the Security Contact listed in the +MAINTAINERS file. They can help coordinate bugfix and disclosure. See +:ref:`Documentation/admin-guide/security-bugs.rst <securitybugs>` for more information. + +If you can't figure out which subsystem caused the issue, you should file +a bug in kernel.org bugzilla and send email to +linux-kernel@vger.kernel.org, referencing the bugzilla URL. (For more +information on the linux-kernel mailing list see +http://www.tux.org/lkml/). + + +Tips for reporting bugs +----------------------- + +If you haven't reported a bug before, please read: + + http://www.chiark.greenend.org.uk/~sgtatham/bugs.html + + http://www.catb.org/esr/faqs/smart-questions.html + +It's REALLY important to report bugs that seem unrelated as separate email +threads or separate bugzilla entries. If you report several unrelated +bugs at once, it's difficult for maintainers to tease apart the relevant +data. + + +Gather information +------------------ + +The most important information in a bug report is how to reproduce the +bug. This includes system information, and (most importantly) +step-by-step instructions for how a user can trigger the bug. + +If the failure includes an "OOPS:", take a picture of the screen, capture +a netconsole trace, or type the message from your screen into the bug +report. Please read "Documentation/admin-guide/bug-hunting.rst" before posting your +bug report. This explains what you should do with the "Oops" information +to make it useful to the recipient. + +This is a suggested format for a bug report sent via email or bugzilla. +Having a standardized bug report form makes it easier for you not to +overlook things, and easier for the developers to find the pieces of +information they're really interested in. If some information is not +relevant to your bug, feel free to exclude it. + +First run the ver_linux script included as scripts/ver_linux, which +reports the version of some important subsystems. Run this script with +the command ``awk -f scripts/ver_linux``. + +Use that information to fill in all fields of the bug report form, and +post it to the mailing list with a subject of "PROBLEM: <one line +summary from [1.]>" for easy identification by the developers:: + + [1.] One line summary of the problem: + [2.] Full description of the problem/report: + [3.] Keywords (i.e., modules, networking, kernel): + [4.] Kernel information + [4.1.] Kernel version (from /proc/version): + [4.2.] Kernel .config file: + [5.] Most recent kernel version which did not have the bug: + [6.] Output of Oops.. message (if applicable) with symbolic information + resolved (see Documentation/admin-guide/bug-hunting.rst) + [7.] A small shell script or example program which triggers the + problem (if possible) + [8.] Environment + [8.1.] Software (add the output of the ver_linux script here) + [8.2.] Processor information (from /proc/cpuinfo): + [8.3.] Module information (from /proc/modules): + [8.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem) + [8.5.] PCI information ('lspci -vvv' as root) + [8.6.] SCSI information (from /proc/scsi/scsi) + [8.7.] Other information that might be relevant to the problem + (please look in /proc and include all information that you + think to be relevant): + [X.] Other notes, patches, fixes, workarounds: + + +Follow up +========= + +Expectations for bug reporters +------------------------------ + +Linux kernel maintainers expect bug reporters to be able to follow up on +bug reports. That may include running new tests, applying patches, +recompiling your kernel, and/or re-triggering your bug. The most +frustrating thing for maintainers is for someone to report a bug, and then +never follow up on a request to try out a fix. + +That said, it's still useful for a kernel maintainer to know a bug exists +on a supported kernel, even if you can't follow up with retests. Follow +up reports, such as replying to the email thread with "I tried the latest +kernel and I can't reproduce my bug anymore" are also helpful, because +maintainers have to assume silence means things are still broken. + +Expectations for kernel maintainers +----------------------------------- + +Linux kernel maintainers are busy, overworked human beings. Some times +they may not be able to address your bug in a day, a week, or two weeks. +If they don't answer your email, they may be on vacation, or at a Linux +conference. Check the conference schedule at https://LWN.net for more info: + + https://lwn.net/Calendar/ + +In general, kernel maintainers take 1 to 5 business days to respond to +bugs. The majority of kernel maintainers are employed to work on the +kernel, and they may not work on the weekends. Maintainers are scattered +around the world, and they may not work in your time zone. Unless you +have a high priority bug, please wait at least a week after the first bug +report before sending the maintainer a reminder email. + +The exceptions to this rule are regressions, kernel crashes, security holes, +or userspace breakage caused by new kernel behavior. Those bugs should be +addressed by the maintainers ASAP. If you suspect a maintainer is not +responding to these types of bugs in a timely manner (especially during a +merge window), escalate the bug to LKML and Linus Torvalds. + +Thank you! + +[Some of this is taken from Frohwalt Egerer's original linux-kernel FAQ] diff --git a/Documentation/admin-guide/security-bugs.rst b/Documentation/admin-guide/security-bugs.rst new file mode 100644 index 000000000..30187d49d --- /dev/null +++ b/Documentation/admin-guide/security-bugs.rst @@ -0,0 +1,89 @@ +.. _securitybugs: + +Security bugs +============= + +Linux kernel developers take security very seriously. As such, we'd +like to know when a security bug is found so that it can be fixed and +disclosed as quickly as possible. Please report security bugs to the +Linux kernel security team. + +Contact +------- + +The Linux kernel security team can be contacted by email at +<security@kernel.org>. This is a private list of security officers +who will help verify the bug report and develop and release a fix. +If you already have a fix, please include it with your report, as +that can speed up the process considerably. It is possible that the +security team will bring in extra help from area maintainers to +understand and fix the security vulnerability. + +As it is with any bug, the more information provided the easier it +will be to diagnose and fix. Please review the procedure outlined in +admin-guide/reporting-bugs.rst if you are unclear about what +information is helpful. Any exploit code is very helpful and will not +be released without consent from the reporter unless it has already been +made public. + +Disclosure and embargoed information +------------------------------------ + +The security list is not a disclosure channel. For that, see Coordination +below. + +Once a robust fix has been developed, the release process starts. Fixes +for publicly known bugs are released immediately. + +Although our preference is to release fixes for publicly undisclosed bugs +as soon as they become available, this may be postponed at the request of +the reporter or an affected party for up to 7 calendar days from the start +of the release process, with an exceptional extension to 14 calendar days +if it is agreed that the criticality of the bug requires more time. The +only valid reason for deferring the publication of a fix is to accommodate +the logistics of QA and large scale rollouts which require release +coordination. + +Whilst embargoed information may be shared with trusted individuals in +order to develop a fix, such information will not be published alongside +the fix or on any other disclosure channel without the permission of the +reporter. This includes but is not limited to the original bug report +and followup discussions (if any), exploits, CVE information or the +identity of the reporter. + +In other words our only interest is in getting bugs fixed. All other +information submitted to the security list and any followup discussions +of the report are treated confidentially even after the embargo has been +lifted, in perpetuity. + +Coordination +------------ + +Fixes for sensitive bugs, such as those that might lead to privilege +escalations, may need to be coordinated with the private +<linux-distros@vs.openwall.org> mailing list so that distribution vendors +are well prepared to issue a fixed kernel upon public disclosure of the +upstream fix. Distros will need some time to test the proposed patch and +will generally request at least a few days of embargo, and vendor update +publication prefers to happen Tuesday through Thursday. When appropriate, +the security team can assist with this coordination, or the reporter can +include linux-distros from the start. In this case, remember to prefix +the email Subject line with "[vs]" as described in the linux-distros wiki: +<http://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists> + +CVE assignment +-------------- + +The security team does not normally assign CVEs, nor do we require them +for reports or fixes, as this can needlessly complicate the process and +may delay the bug handling. If a reporter wishes to have a CVE identifier +assigned ahead of public disclosure, they will need to contact the private +linux-distros list, described above. When such a CVE identifier is known +before a patch is provided, it is desirable to mention it in the commit +message if the reporter agrees. + +Non-disclosure agreements +------------------------- + +The Linux kernel security team is not a formal body and therefore unable +to enter any non-disclosure agreements. diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst new file mode 100644 index 000000000..a8d1e36b6 --- /dev/null +++ b/Documentation/admin-guide/serial-console.rst @@ -0,0 +1,115 @@ +.. _serial_console: + +Linux Serial Console +==================== + +To use a serial port as console you need to compile the support into your +kernel - by default it is not compiled in. For PC style serial ports +it's the config option next to menu option: + +:menuselection:`Character devices --> Serial drivers --> 8250/16550 and compatible serial support --> Console on 8250/16550 and compatible serial port` + +You must compile serial support into the kernel and not as a module. + +It is possible to specify multiple devices for console output. You can +define a new kernel command line option to select which device(s) to +use for console output. + +The format of this option is:: + + console=device,options + + device: tty0 for the foreground virtual console + ttyX for any other virtual console + ttySx for a serial port + lp0 for the first parallel port + ttyUSB0 for the first USB serial device + + options: depend on the driver. For the serial port this + defines the baudrate/parity/bits/flow control of + the port, in the format BBBBPNF, where BBBB is the + speed, P is parity (n/o/e), N is number of bits, + and F is flow control ('r' for RTS). Default is + 9600n8. The maximum baudrate is 115200. + +You can specify multiple console= options on the kernel command line. +Output will appear on all of them. The last device will be used when +you open ``/dev/console``. So, for example:: + + console=ttyS1,9600 console=tty0 + +defines that opening ``/dev/console`` will get you the current foreground +virtual console, and kernel messages will appear on both the VGA +console and the 2nd serial port (ttyS1 or COM2) at 9600 baud. + +Note that you can only define one console per device type (serial, video). + +If no console device is specified, the first device found capable of +acting as a system console will be used. At this time, the system +first looks for a VGA card and then for a serial port. So if you don't +have a VGA card in your system the first serial port will automatically +become the console. + +You will need to create a new device to use ``/dev/console``. The official +``/dev/console`` is now character device 5,1. + +(You can also use a network device as a console. See +``Documentation/networking/netconsole.txt`` for information on that.) + +Here's an example that will use ``/dev/ttyS1`` (COM2) as the console. +Replace the sample values as needed. + +1. Create ``/dev/console`` (real console) and ``/dev/tty0`` (master virtual + console):: + + cd /dev + rm -f console tty0 + mknod -m 622 console c 5 1 + mknod -m 622 tty0 c 4 0 + +2. LILO can also take input from a serial device. This is a very + useful option. To tell LILO to use the serial port: + In lilo.conf (global section):: + + serial = 1,9600n8 (ttyS1, 9600 bd, no parity, 8 bits) + +3. Adjust to kernel flags for the new kernel, + again in lilo.conf (kernel section):: + + append = "console=ttyS1,9600" + +4. Make sure a getty runs on the serial port so that you can login to + it once the system is done booting. This is done by adding a line + like this to ``/etc/inittab`` (exact syntax depends on your getty):: + + S1:23:respawn:/sbin/getty -L ttyS1 9600 vt100 + +5. Init and ``/etc/ioctl.save`` + + Sysvinit remembers its stty settings in a file in ``/etc``, called + ``/etc/ioctl.save``. REMOVE THIS FILE before using the serial + console for the first time, because otherwise init will probably + set the baudrate to 38400 (baudrate of the virtual console). + +6. ``/dev/console`` and X + Programs that want to do something with the virtual console usually + open ``/dev/console``. If you have created the new ``/dev/console`` device, + and your console is NOT the virtual console some programs will fail. + Those are programs that want to access the VT interface, and use + ``/dev/console instead of /dev/tty0``. Some of those programs are:: + + Xfree86, svgalib, gpm, SVGATextMode + + It should be fixed in modern versions of these programs though. + + Note that if you boot without a ``console=`` option (or with + ``console=/dev/tty0``), ``/dev/console`` is the same as ``/dev/tty0``. + In that case everything will still work. + +7. Thanks + + Thanks to Geert Uytterhoeven <geert@linux-m68k.org> + for porting the patches from 2.1.4x to 2.1.6x for taking care of + the integration of these patches into m68k, ppc and alpha. + +Miquel van Smoorenburg <miquels@cistron.nl>, 11-Jun-2000 diff --git a/Documentation/admin-guide/sysfs-rules.rst b/Documentation/admin-guide/sysfs-rules.rst new file mode 100644 index 000000000..abad33526 --- /dev/null +++ b/Documentation/admin-guide/sysfs-rules.rst @@ -0,0 +1,192 @@ +Rules on how to access information in sysfs +=========================================== + +The kernel-exported sysfs exports internal kernel implementation details +and depends on internal kernel structures and layout. It is agreed upon +by the kernel developers that the Linux kernel does not provide a stable +internal API. Therefore, there are aspects of the sysfs interface that +may not be stable across kernel releases. + +To minimize the risk of breaking users of sysfs, which are in most cases +low-level userspace applications, with a new kernel release, the users +of sysfs must follow some rules to use an as-abstract-as-possible way to +access this filesystem. The current udev and HAL programs already +implement this and users are encouraged to plug, if possible, into the +abstractions these programs provide instead of accessing sysfs directly. + +But if you really do want or need to access sysfs directly, please follow +the following rules and then your programs should work with future +versions of the sysfs interface. + +- Do not use libsysfs + It makes assumptions about sysfs which are not true. Its API does not + offer any abstraction, it exposes all the kernel driver-core + implementation details in its own API. Therefore it is not better than + reading directories and opening the files yourself. + Also, it is not actively maintained, in the sense of reflecting the + current kernel development. The goal of providing a stable interface + to sysfs has failed; it causes more problems than it solves. It + violates many of the rules in this document. + +- sysfs is always at ``/sys`` + Parsing ``/proc/mounts`` is a waste of time. Other mount points are a + system configuration bug you should not try to solve. For test cases, + possibly support a ``SYSFS_PATH`` environment variable to overwrite the + application's behavior, but never try to search for sysfs. Never try + to mount it, if you are not an early boot script. + +- devices are only "devices" + There is no such thing like class-, bus-, physical devices, + interfaces, and such that you can rely on in userspace. Everything is + just simply a "device". Class-, bus-, physical, ... types are just + kernel implementation details which should not be expected by + applications that look for devices in sysfs. + + The properties of a device are: + + - devpath (``/devices/pci0000:00/0000:00:1d.1/usb2/2-2/2-2:1.0``) + + - identical to the DEVPATH value in the event sent from the kernel + at device creation and removal + - the unique key to the device at that point in time + - the kernel's path to the device directory without the leading + ``/sys``, and always starting with a slash + - all elements of a devpath must be real directories. Symlinks + pointing to /sys/devices must always be resolved to their real + target and the target path must be used to access the device. + That way the devpath to the device matches the devpath of the + kernel used at event time. + - using or exposing symlink values as elements in a devpath string + is a bug in the application + + - kernel name (``sda``, ``tty``, ``0000:00:1f.2``, ...) + + - a directory name, identical to the last element of the devpath + - applications need to handle spaces and characters like ``!`` in + the name + + - subsystem (``block``, ``tty``, ``pci``, ...) + + - simple string, never a path or a link + - retrieved by reading the "subsystem"-link and using only the + last element of the target path + + - driver (``tg3``, ``ata_piix``, ``uhci_hcd``) + + - a simple string, which may contain spaces, never a path or a + link + - it is retrieved by reading the "driver"-link and using only the + last element of the target path + - devices which do not have "driver"-link just do not have a + driver; copying the driver value in a child device context is a + bug in the application + + - attributes + + - the files in the device directory or files below subdirectories + of the same device directory + - accessing attributes reached by a symlink pointing to another device, + like the "device"-link, is a bug in the application + + Everything else is just a kernel driver-core implementation detail + that should not be assumed to be stable across kernel releases. + +- Properties of parent devices never belong into a child device. + Always look at the parent devices themselves for determining device + context properties. If the device ``eth0`` or ``sda`` does not have a + "driver"-link, then this device does not have a driver. Its value is empty. + Never copy any property of the parent-device into a child-device. Parent + device properties may change dynamically without any notice to the + child device. + +- Hierarchy in a single device tree + There is only one valid place in sysfs where hierarchy can be examined + and this is below: ``/sys/devices.`` + It is planned that all device directories will end up in the tree + below this directory. + +- Classification by subsystem + There are currently three places for classification of devices: + ``/sys/block,`` ``/sys/class`` and ``/sys/bus.`` It is planned that these will + not contain any device directories themselves, but only flat lists of + symlinks pointing to the unified ``/sys/devices`` tree. + All three places have completely different rules on how to access + device information. It is planned to merge all three + classification directories into one place at ``/sys/subsystem``, + following the layout of the bus directories. All buses and + classes, including the converted block subsystem, will show up + there. + The devices belonging to a subsystem will create a symlink in the + "devices" directory at ``/sys/subsystem/<name>/devices``, + + If ``/sys/subsystem`` exists, ``/sys/bus``, ``/sys/class`` and ``/sys/block`` + can be ignored. If it does not exist, you always have to scan all three + places, as the kernel is free to move a subsystem from one place to + the other, as long as the devices are still reachable by the same + subsystem name. + + Assuming ``/sys/class/<subsystem>`` and ``/sys/bus/<subsystem>``, or + ``/sys/block`` and ``/sys/class/block`` are not interchangeable is a bug in + the application. + +- Block + The converted block subsystem at ``/sys/class/block`` or + ``/sys/subsystem/block`` will contain the links for disks and partitions + at the same level, never in a hierarchy. Assuming the block subsystem to + contain only disks and not partition devices in the same flat list is + a bug in the application. + +- "device"-link and <subsystem>:<kernel name>-links + Never depend on the "device"-link. The "device"-link is a workaround + for the old layout, where class devices are not created in + ``/sys/devices/`` like the bus devices. If the link-resolving of a + device directory does not end in ``/sys/devices/``, you can use the + "device"-link to find the parent devices in ``/sys/devices/``, That is the + single valid use of the "device"-link; it must never appear in any + path as an element. Assuming the existence of the "device"-link for + a device in ``/sys/devices/`` is a bug in the application. + Accessing ``/sys/class/net/eth0/device`` is a bug in the application. + + Never depend on the class-specific links back to the ``/sys/class`` + directory. These links are also a workaround for the design mistake + that class devices are not created in ``/sys/devices.`` If a device + directory does not contain directories for child devices, these links + may be used to find the child devices in ``/sys/class.`` That is the single + valid use of these links; they must never appear in any path as an + element. Assuming the existence of these links for devices which are + real child device directories in the ``/sys/devices`` tree is a bug in + the application. + + It is planned to remove all these links when all class device + directories live in ``/sys/devices.`` + +- Position of devices along device chain can change. + Never depend on a specific parent device position in the devpath, + or the chain of parent devices. The kernel is free to insert devices into + the chain. You must always request the parent device you are looking for + by its subsystem value. You need to walk up the chain until you find + the device that matches the expected subsystem. Depending on a specific + position of a parent device or exposing relative paths using ``../`` to + access the chain of parents is a bug in the application. + +- When reading and writing sysfs device attribute files, avoid dependency + on specific error codes wherever possible. This minimizes coupling to + the error handling implementation within the kernel. + + In general, failures to read or write sysfs device attributes shall + propagate errors wherever possible. Common errors include, but are not + limited to: + + ``-EIO``: The read or store operation is not supported, typically + returned by the sysfs system itself if the read or store pointer + is ``NULL``. + + ``-ENXIO``: The read or store operation failed + + Error codes will not be changed without good reason, and should a change + to error codes result in user-space breakage, it will be fixed, or the + the offending change will be reverted. + + Userspace applications can, however, expect the format and contents of + the attribute files to remain consistent in the absence of a version + attribute change in the context of a given attribute. diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst new file mode 100644 index 000000000..7b9035c01 --- /dev/null +++ b/Documentation/admin-guide/sysrq.rst @@ -0,0 +1,290 @@ +Linux Magic System Request Key Hacks +==================================== + +Documentation for sysrq.c + +What is the magic SysRq key? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is a 'magical' key combo you can hit which the kernel will respond to +regardless of whatever else it is doing, unless it is completely locked up. + +How do I enable the magic SysRq key? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You need to say "yes" to 'Magic SysRq key (CONFIG_MAGIC_SYSRQ)' when +configuring the kernel. When running a kernel with SysRq compiled in, +/proc/sys/kernel/sysrq controls the functions allowed to be invoked via +the SysRq key. The default value in this file is set by the +CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE config symbol, which itself defaults +to 1. Here is the list of possible values in /proc/sys/kernel/sysrq: + + - 0 - disable sysrq completely + - 1 - enable all functions of sysrq + - >1 - bitmask of allowed sysrq functions (see below for detailed function + description):: + + 2 = 0x2 - enable control of console logging level + 4 = 0x4 - enable control of keyboard (SAK, unraw) + 8 = 0x8 - enable debugging dumps of processes etc. + 16 = 0x10 - enable sync command + 32 = 0x20 - enable remount read-only + 64 = 0x40 - enable signalling of processes (term, kill, oom-kill) + 128 = 0x80 - allow reboot/poweroff + 256 = 0x100 - allow nicing of all RT tasks + +You can set the value in the file by the following command:: + + echo "number" >/proc/sys/kernel/sysrq + +The number may be written here either as decimal or as hexadecimal +with the 0x prefix. CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE must always be +written in hexadecimal. + +Note that the value of ``/proc/sys/kernel/sysrq`` influences only the invocation +via a keyboard. Invocation of any operation via ``/proc/sysrq-trigger`` is +always allowed (by a user with admin privileges). + +How do I use the magic SysRq key? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On x86 - You press the key combo :kbd:`ALT-SysRq-<command key>`. + +.. note:: + Some + keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is + also known as the 'Print Screen' key. Also some keyboards cannot + handle so many keys being pressed at the same time, so you might + have better luck with press :kbd:`Alt`, press :kbd:`SysRq`, + release :kbd:`SysRq`, press :kbd:`<command key>`, release everything. + +On SPARC - You press :kbd:`ALT-STOP-<command key>`, I believe. + +On the serial console (PC style standard serial ports only) + You send a ``BREAK``, then within 5 seconds a command key. Sending + ``BREAK`` twice is interpreted as a normal BREAK. + +On PowerPC + Press :kbd:`ALT - Print Screen` (or :kbd:`F13`) - :kbd:`<command key>`, + :kbd:`Print Screen` (or :kbd:`F13`) - :kbd:`<command key>` may suffice. + +On other + If you know of the key combos for other architectures, please + let me know so I can add them to this section. + +On all + write a character to /proc/sysrq-trigger. e.g.:: + + echo t > /proc/sysrq-trigger + +What are the 'command' keys? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +=========== =================================================================== +Command Function +=========== =================================================================== +``b`` Will immediately reboot the system without syncing or unmounting + your disks. + +``c`` Will perform a system crash by a NULL pointer dereference. + A crashdump will be taken if configured. + +``d`` Shows all locks that are held. + +``e`` Send a SIGTERM to all processes, except for init. + +``f`` Will call the oom killer to kill a memory hog process, but do not + panic if nothing can be killed. + +``g`` Used by kgdb (kernel debugger) + +``h`` Will display help (actually any other key than those listed + here will display help. but ``h`` is easy to remember :-) + +``i`` Send a SIGKILL to all processes, except for init. + +``j`` Forcibly "Just thaw it" - filesystems frozen by the FIFREEZE ioctl. + +``k`` Secure Access Key (SAK) Kills all programs on the current virtual + console. NOTE: See important comments below in SAK section. + +``l`` Shows a stack backtrace for all active CPUs. + +``m`` Will dump current memory info to your console. + +``n`` Used to make RT tasks nice-able + +``o`` Will shut your system off (if configured and supported). + +``p`` Will dump the current registers and flags to your console. + +``q`` Will dump per CPU lists of all armed hrtimers (but NOT regular + timer_list timers) and detailed information about all + clockevent devices. + +``r`` Turns off keyboard raw mode and sets it to XLATE. + +``s`` Will attempt to sync all mounted filesystems. + +``t`` Will dump a list of current tasks and their information to your + console. + +``u`` Will attempt to remount all mounted filesystems read-only. + +``v`` Forcefully restores framebuffer console +``v`` Causes ETM buffer dump [ARM-specific] + +``w`` Dumps tasks that are in uninterruptable (blocked) state. + +``x`` Used by xmon interface on ppc/powerpc platforms. + Show global PMU Registers on sparc64. + Dump all TLB entries on MIPS. + +``y`` Show global CPU Registers [SPARC-64 specific] + +``z`` Dump the ftrace buffer + +``0``-``9`` Sets the console log level, controlling which kernel messages + will be printed to your console. (``0``, for example would make + it so that only emergency messages like PANICs or OOPSes would + make it to your console.) +=========== =================================================================== + +Okay, so what can I use them for? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Well, unraw(r) is very handy when your X server or a svgalib program crashes. + +sak(k) (Secure Access Key) is useful when you want to be sure there is no +trojan program running at console which could grab your password +when you would try to login. It will kill all programs on given console, +thus letting you make sure that the login prompt you see is actually +the one from init, not some trojan program. + +.. important:: + + In its true form it is not a true SAK like the one in a + c2 compliant system, and it should not be mistaken as + such. + +It seems others find it useful as (System Attention Key) which is +useful when you want to exit a program that will not let you switch consoles. +(For example, X or a svgalib program.) + +``reboot(b)`` is good when you're unable to shut down. But you should also +``sync(s)`` and ``umount(u)`` first. + +``crash(c)`` can be used to manually trigger a crashdump when the system is hung. +Note that this just triggers a crash if there is no dump mechanism available. + +``sync(s)`` is great when your system is locked up, it allows you to sync your +disks and will certainly lessen the chance of data loss and fscking. Note +that the sync hasn't taken place until you see the "OK" and "Done" appear +on the screen. (If the kernel is really in strife, you may not ever get the +OK or Done message...) + +``umount(u)`` is basically useful in the same ways as ``sync(s)``. I generally +``sync(s)``, ``umount(u)``, then ``reboot(b)`` when my system locks. It's saved +me many a fsck. Again, the unmount (remount read-only) hasn't taken place until +you see the "OK" and "Done" message appear on the screen. + +The loglevels ``0``-``9`` are useful when your console is being flooded with +kernel messages you do not want to see. Selecting ``0`` will prevent all but +the most urgent kernel messages from reaching your console. (They will +still be logged if syslogd/klogd are alive, though.) + +``term(e)`` and ``kill(i)`` are useful if you have some sort of runaway process +you are unable to kill any other way, especially if it's spawning other +processes. + +"just thaw ``it(j)``" is useful if your system becomes unresponsive due to a +frozen (probably root) filesystem via the FIFREEZE ioctl. + +Sometimes SysRq seems to get 'stuck' after using it, what can I do? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +That happens to me, also. I've found that tapping shift, alt, and control +on both sides of the keyboard, and hitting an invalid sysrq sequence again +will fix the problem. (i.e., something like :kbd:`alt-sysrq-z`). Switching to +another virtual console (:kbd:`ALT+Fn`) and then back again should also help. + +I hit SysRq, but nothing seems to happen, what's wrong? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are some keyboards that produce a different keycode for SysRq than the +pre-defined value of 99 +(see ``KEY_SYSRQ`` in ``include/uapi/linux/input-event-codes.h``), or +which don't have a SysRq key at all. In these cases, run ``showkey -s`` to find +an appropriate scancode sequence, and use ``setkeycodes <sequence> 99`` to map +this sequence to the usual SysRq code (e.g., ``setkeycodes e05b 99``). It's +probably best to put this command in a boot script. Oh, and by the way, you +exit ``showkey`` by not typing anything for ten seconds. + +I want to add SysRQ key events to a module, how does it work? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to register a basic function with the table, you must first include +the header ``include/linux/sysrq.h``, this will define everything else you need. +Next, you must create a ``sysrq_key_op`` struct, and populate it with A) the key +handler function you will use, B) a help_msg string, that will print when SysRQ +prints help, and C) an action_msg string, that will print right before your +handler is called. Your handler must conform to the prototype in 'sysrq.h'. + +After the ``sysrq_key_op`` is created, you can call the kernel function +``register_sysrq_key(int key, struct sysrq_key_op *op_p);`` this will +register the operation pointed to by ``op_p`` at table key 'key', +if that slot in the table is blank. At module unload time, you must call +the function ``unregister_sysrq_key(int key, struct sysrq_key_op *op_p)``, which +will remove the key op pointed to by 'op_p' from the key 'key', if and only if +it is currently registered in that slot. This is in case the slot has been +overwritten since you registered it. + +The Magic SysRQ system works by registering key operations against a key op +lookup table, which is defined in 'drivers/tty/sysrq.c'. This key table has +a number of operations registered into it at compile time, but is mutable, +and 2 functions are exported for interface to it:: + + register_sysrq_key and unregister_sysrq_key. + +Of course, never ever leave an invalid pointer in the table. I.e., when +your module that called register_sysrq_key() exits, it must call +unregister_sysrq_key() to clean up the sysrq key table entry that it used. +Null pointers in the table are always safe. :) + +If for some reason you feel the need to call the handle_sysrq function from +within a function called by handle_sysrq, you must be aware that you are in +a lock (you are also in an interrupt handler, which means don't sleep!), so +you must call ``__handle_sysrq_nolock`` instead. + +When I hit a SysRq key combination only the header appears on the console? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Sysrq output is subject to the same console loglevel control as all +other console output. This means that if the kernel was booted 'quiet' +as is common on distro kernels the output may not appear on the actual +console, even though it will appear in the dmesg buffer, and be accessible +via the dmesg command and to the consumers of ``/proc/kmsg``. As a specific +exception the header line from the sysrq command is passed to all console +consumers as if the current loglevel was maximum. If only the header +is emitted it is almost certain that the kernel loglevel is too low. +Should you require the output on the console channel then you will need +to temporarily up the console loglevel using :kbd:`alt-sysrq-8` or:: + + echo 8 > /proc/sysrq-trigger + +Remember to return the loglevel to normal after triggering the sysrq +command you are interested in. + +I have more questions, who can I ask? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Just ask them on the linux-kernel mailing list: + linux-kernel@vger.kernel.org + +Credits +~~~~~~~ + +Written by Mydraal <vulpyne@vulpyne.net> +Updated by Adam Sulmicki <adam@cfar.umd.edu> +Updated by Jeremy M. Dolan <jmd@turbogeek.org> 2001/01/28 10:15:59 +Added to by Crutcher Dunnavant <crutcher+kernel@datastacks.com> diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst new file mode 100644 index 000000000..28a869c50 --- /dev/null +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -0,0 +1,59 @@ +Tainted kernels +--------------- + +Some oops reports contain the string **'Tainted: '** after the program +counter. This indicates that the kernel has been tainted by some +mechanism. The string is followed by a series of position-sensitive +characters, each representing a particular tainted value. + + 1) ``G`` if all modules loaded have a GPL or compatible license, ``P`` if + any proprietary module has been loaded. Modules without a + MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by + insmod as GPL compatible are assumed to be proprietary. + + 2) ``F`` if any module was force loaded by ``insmod -f``, ``' '`` if all + modules were loaded normally. + + 3) ``S`` if the oops occurred on an SMP kernel running on hardware that + hasn't been certified as safe to run multiprocessor. + Currently this occurs only on various Athlons that are not + SMP capable. + + 4) ``R`` if a module was force unloaded by ``rmmod -f``, ``' '`` if all + modules were unloaded normally. + + 5) ``M`` if any processor has reported a Machine Check Exception, + ``' '`` if no Machine Check Exceptions have occurred. + + 6) ``B`` if a page-release function has found a bad page reference or + some unexpected page flags. + + 7) ``U`` if a user or user application specifically requested that the + Tainted flag be set, ``' '`` otherwise. + + 8) ``D`` if the kernel has died recently, i.e. there was an OOPS or BUG. + + 9) ``A`` if the ACPI table has been overridden. + + 10) ``W`` if a warning has previously been issued by the kernel. + (Though some warnings may set more specific taint flags.) + + 11) ``C`` if a staging driver has been loaded. + + 12) ``I`` if the kernel is working around a severe bug in the platform + firmware (BIOS or similar). + + 13) ``O`` if an externally-built ("out-of-tree") module has been loaded. + + 14) ``E`` if an unsigned module has been loaded in a kernel supporting + module signature. + + 15) ``L`` if a soft lockup has previously occurred on the system. + + 16) ``K`` if the kernel has been live patched. + +The primary reason for the **'Tainted: '** string is to tell kernel +debuggers if this is a clean kernel or if anything unusual has +occurred. Tainting is permanent: even if an offending module is +unloaded, the tainted value remains to indicate that the kernel is not +trustworthy. diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst new file mode 100644 index 000000000..35fccba6a --- /dev/null +++ b/Documentation/admin-guide/thunderbolt.rst @@ -0,0 +1,243 @@ +============= + Thunderbolt +============= +The interface presented here is not meant for end users. Instead there +should be a userspace tool that handles all the low-level details, keeps +a database of the authorized devices and prompts users for new connections. + +More details about the sysfs interface for Thunderbolt devices can be +found in ``Documentation/ABI/testing/sysfs-bus-thunderbolt``. + +Those users who just want to connect any device without any sort of +manual work can add following line to +``/etc/udev/rules.d/99-local.rules``:: + + ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{authorized}=="0", ATTR{authorized}="1" + +This will authorize all devices automatically when they appear. However, +keep in mind that this bypasses the security levels and makes the system +vulnerable to DMA attacks. + +Security levels and how to use them +----------------------------------- +Starting with Intel Falcon Ridge Thunderbolt controller there are 4 +security levels available. Intel Titan Ridge added one more security level +(usbonly). The reason for these is the fact that the connected devices can +be DMA masters and thus read contents of the host memory without CPU and OS +knowing about it. There are ways to prevent this by setting up an IOMMU but +it is not always available for various reasons. + +The security levels are as follows: + + none + All devices are automatically connected by the firmware. No user + approval is needed. In BIOS settings this is typically called + *Legacy mode*. + + user + User is asked whether the device is allowed to be connected. + Based on the device identification information available through + ``/sys/bus/thunderbolt/devices``, the user then can make the decision. + In BIOS settings this is typically called *Unique ID*. + + secure + User is asked whether the device is allowed to be connected. In + addition to UUID the device (if it supports secure connect) is sent + a challenge that should match the expected one based on a random key + written to the ``key`` sysfs attribute. In BIOS settings this is + typically called *One time saved key*. + + dponly + The firmware automatically creates tunnels for Display Port and + USB. No PCIe tunneling is done. In BIOS settings this is + typically called *Display Port Only*. + + usbonly + The firmware automatically creates tunnels for the USB controller and + Display Port in a dock. All PCIe links downstream of the dock are + removed. + +The current security level can be read from +``/sys/bus/thunderbolt/devices/domainX/security`` where ``domainX`` is +the Thunderbolt domain the host controller manages. There is typically +one domain per Thunderbolt host controller. + +If the security level reads as ``user`` or ``secure`` the connected +device must be authorized by the user before PCIe tunnels are created +(e.g the PCIe device appears). + +Each Thunderbolt device plugged in will appear in sysfs under +``/sys/bus/thunderbolt/devices``. The device directory carries +information that can be used to identify the particular device, +including its name and UUID. + +Authorizing devices when security level is ``user`` or ``secure`` +----------------------------------------------------------------- +When a device is plugged in it will appear in sysfs as follows:: + + /sys/bus/thunderbolt/devices/0-1/authorized - 0 + /sys/bus/thunderbolt/devices/0-1/device - 0x8004 + /sys/bus/thunderbolt/devices/0-1/device_name - Thunderbolt to FireWire Adapter + /sys/bus/thunderbolt/devices/0-1/vendor - 0x1 + /sys/bus/thunderbolt/devices/0-1/vendor_name - Apple, Inc. + /sys/bus/thunderbolt/devices/0-1/unique_id - e0376f00-0300-0100-ffff-ffffffffffff + +The ``authorized`` attribute reads 0 which means no PCIe tunnels are +created yet. The user can authorize the device by simply entering:: + + # echo 1 > /sys/bus/thunderbolt/devices/0-1/authorized + +This will create the PCIe tunnels and the device is now connected. + +If the device supports secure connect, and the domain security level is +set to ``secure``, it has an additional attribute ``key`` which can hold +a random 32-byte value used for authorization and challenging the device in +future connects:: + + /sys/bus/thunderbolt/devices/0-3/authorized - 0 + /sys/bus/thunderbolt/devices/0-3/device - 0x305 + /sys/bus/thunderbolt/devices/0-3/device_name - AKiTiO Thunder3 PCIe Box + /sys/bus/thunderbolt/devices/0-3/key - + /sys/bus/thunderbolt/devices/0-3/vendor - 0x41 + /sys/bus/thunderbolt/devices/0-3/vendor_name - inXtron + /sys/bus/thunderbolt/devices/0-3/unique_id - dc010000-0000-8508-a22d-32ca6421cb16 + +Notice the key is empty by default. + +If the user does not want to use secure connect they can just ``echo 1`` +to the ``authorized`` attribute and the PCIe tunnels will be created in +the same way as in the ``user`` security level. + +If the user wants to use secure connect, the first time the device is +plugged a key needs to be created and sent to the device:: + + # key=$(openssl rand -hex 32) + # echo $key > /sys/bus/thunderbolt/devices/0-3/key + # echo 1 > /sys/bus/thunderbolt/devices/0-3/authorized + +Now the device is connected (PCIe tunnels are created) and in addition +the key is stored on the device NVM. + +Next time the device is plugged in the user can verify (challenge) the +device using the same key:: + + # echo $key > /sys/bus/thunderbolt/devices/0-3/key + # echo 2 > /sys/bus/thunderbolt/devices/0-3/authorized + +If the challenge the device returns back matches the one we expect based +on the key, the device is connected and the PCIe tunnels are created. +However, if the challenge fails no tunnels are created and error is +returned to the user. + +If the user still wants to connect the device they can either approve +the device without a key or write a new key and write 1 to the +``authorized`` file to get the new key stored on the device NVM. + +Upgrading NVM on Thunderbolt device or host +------------------------------------------- +Since most of the functionality is handled in firmware running on a +host controller or a device, it is important that the firmware can be +upgraded to the latest where possible bugs in it have been fixed. +Typically OEMs provide this firmware from their support site. + +There is also a central site which has links where to download firmware +for some machines: + + `Thunderbolt Updates <https://thunderbolttechnology.net/updates>`_ + +Before you upgrade firmware on a device or host, please make sure it is a +suitable upgrade. Failing to do that may render the device (or host) in a +state where it cannot be used properly anymore without special tools! + +Host NVM upgrade on Apple Macs is not supported. + +Once the NVM image has been downloaded, you need to plug in a +Thunderbolt device so that the host controller appears. It does not +matter which device is connected (unless you are upgrading NVM on a +device - then you need to connect that particular device). + +Note an OEM-specific method to power the controller up ("force power") may +be available for your system in which case there is no need to plug in a +Thunderbolt device. + +After that we can write the firmware to the non-active parts of the NVM +of the host or device. As an example here is how Intel NUC6i7KYK (Skull +Canyon) Thunderbolt controller NVM is upgraded:: + + # dd if=KYK_TBT_FW_0018.bin of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem + +Once the operation completes we can trigger NVM authentication and +upgrade process as follows:: + + # echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate + +If no errors are returned, the host controller shortly disappears. Once +it comes back the driver notices it and initiates a full power cycle. +After a while the host controller appears again and this time it should +be fully functional. + +We can verify that the new NVM firmware is active by running the following +commands:: + + # cat /sys/bus/thunderbolt/devices/0-0/nvm_authenticate + 0x0 + # cat /sys/bus/thunderbolt/devices/0-0/nvm_version + 18.0 + +If ``nvm_authenticate`` contains anything other than 0x0 it is the error +code from the last authentication cycle, which means the authentication +of the NVM image failed. + +Note names of the NVMem devices ``nvm_activeN`` and ``nvm_non_activeN`` +depend on the order they are registered in the NVMem subsystem. N in +the name is the identifier added by the NVMem subsystem. + +Upgrading NVM when host controller is in safe mode +-------------------------------------------------- +If the existing NVM is not properly authenticated (or is missing) the +host controller goes into safe mode which means that the only available +functionality is flashing a new NVM image. When in this mode, reading +``nvm_version`` fails with ``ENODATA`` and the device identification +information is missing. + +To recover from this mode, one needs to flash a valid NVM image to the +host controller in the same way it is done in the previous chapter. + +Networking over Thunderbolt cable +--------------------------------- +Thunderbolt technology allows software communication between two hosts +connected by a Thunderbolt cable. + +It is possible to tunnel any kind of traffic over a Thunderbolt link but +currently we only support Apple ThunderboltIP protocol. + +If the other host is running Windows or macOS, the only thing you need to +do is to connect a Thunderbolt cable between the two hosts; the +``thunderbolt-net`` driver is loaded automatically. If the other host is +also Linux you should load ``thunderbolt-net`` manually on one host (it +does not matter which one):: + + # modprobe thunderbolt-net + +This triggers module load on the other host automatically. If the driver +is built-in to the kernel image, there is no need to do anything. + +The driver will create one virtual ethernet interface per Thunderbolt +port which are named like ``thunderbolt0`` and so on. From this point +you can either use standard userspace tools like ``ifconfig`` to +configure the interface or let your GUI handle it automatically. + +Forcing power +------------- +Many OEMs include a method that can be used to force the power of a +Thunderbolt controller to an "On" state even if nothing is connected. +If supported by your machine this will be exposed by the WMI bus with +a sysfs attribute called "force_power". + +For example the intel-wmi-thunderbolt driver exposes this attribute in: + /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power + + To force the power to on, write 1 to this attribute file. + To disable force power, write 0 to this attribute file. + +Note: it's currently not possible to query the force power state of a platform. diff --git a/Documentation/admin-guide/unicode.rst b/Documentation/admin-guide/unicode.rst new file mode 100644 index 000000000..7425a3351 --- /dev/null +++ b/Documentation/admin-guide/unicode.rst @@ -0,0 +1,189 @@ +Unicode support +=============== + + Last update: 2005-01-17, version 1.4 + +This file is maintained by H. Peter Anvin <unicode@lanana.org> as part +of the Linux Assigned Names And Numbers Authority (LANANA) project. +The current version can be found at: + + http://www.lanana.org/docs/unicode/admin-guide/unicode.rst + +Introduction +------------ + +The Linux kernel code has been rewritten to use Unicode to map +characters to fonts. By downloading a single Unicode-to-font table, +both the eight-bit character sets and UTF-8 mode are changed to use +the font as indicated. + +This changes the semantics of the eight-bit character tables subtly. +The four character tables are now: + +=============== =============================== ================ +Map symbol Map name Escape code (G0) +=============== =============================== ================ +LAT1_MAP Latin-1 (ISO 8859-1) ESC ( B +GRAF_MAP DEC VT100 pseudographics ESC ( 0 +IBMPC_MAP IBM code page 437 ESC ( U +USER_MAP User defined ESC ( K +=============== =============================== ================ + +In particular, ESC ( U is no longer "straight to font", since the font +might be completely different than the IBM character set. This +permits for example the use of block graphics even with a Latin-1 font +loaded. + +Note that although these codes are similar to ISO 2022, neither the +codes nor their uses match ISO 2022; Linux has two 8-bit codes (G0 and +G1), whereas ISO 2022 has four 7-bit codes (G0-G3). + +In accordance with the Unicode standard/ISO 10646 the range U+F000 to +U+F8FF has been reserved for OS-wide allocation (the Unicode Standard +refers to this as a "Corporate Zone", since this is inaccurate for +Linux we call it the "Linux Zone"). U+F000 was picked as the starting +point since it lets the direct-mapping area start on a large power of +two (in case 1024- or 2048-character fonts ever become necessary). +This leaves U+E000 to U+EFFF as End User Zone. + +[v1.2]: The Unicodes range from U+F000 and up to U+F7FF have been +hard-coded to map directly to the loaded font, bypassing the +translation table. The user-defined map now defaults to U+F000 to +U+F0FF, emulating the previous behaviour. In practice, this range +might be shorter; for example, vgacon can only handle 256-character +(U+F000..U+F0FF) or 512-character (U+F000..U+F1FF) fonts. + + +Actual characters assigned in the Linux Zone +-------------------------------------------- + +In addition, the following characters not present in Unicode 1.1.4 +have been defined; these are used by the DEC VT graphics map. [v1.2] +THIS USE IS OBSOLETE AND SHOULD NO LONGER BE USED; PLEASE SEE BELOW. + +====== ====================================== +U+F800 DEC VT GRAPHICS HORIZONTAL LINE SCAN 1 +U+F801 DEC VT GRAPHICS HORIZONTAL LINE SCAN 3 +U+F803 DEC VT GRAPHICS HORIZONTAL LINE SCAN 7 +U+F804 DEC VT GRAPHICS HORIZONTAL LINE SCAN 9 +====== ====================================== + +The DEC VT220 uses a 6x10 character matrix, and these characters form +a smooth progression in the DEC VT graphics character set. I have +omitted the scan 5 line, since it is also used as a block-graphics +character, and hence has been coded as U+2500 FORMS LIGHT HORIZONTAL. + +[v1.3]: These characters have been officially added to Unicode 3.2.0; +they are added at U+23BA, U+23BB, U+23BC, U+23BD. Linux now uses the +new values. + +[v1.2]: The following characters have been added to represent common +keyboard symbols that are unlikely to ever be added to Unicode proper +since they are horribly vendor-specific. This, of course, is an +excellent example of horrible design. + +====== ====================================== +U+F810 KEYBOARD SYMBOL FLYING FLAG +U+F811 KEYBOARD SYMBOL PULLDOWN MENU +U+F812 KEYBOARD SYMBOL OPEN APPLE +U+F813 KEYBOARD SYMBOL SOLID APPLE +====== ====================================== + +Klingon language support +------------------------ + +In 1996, Linux was the first operating system in the world to add +support for the artificial language Klingon, created by Marc Okrand +for the "Star Trek" television series. This encoding was later +adopted by the ConScript Unicode Registry and proposed (but ultimately +rejected) for inclusion in Unicode Plane 1. Thus, it remains as a +Linux/CSUR private assignment in the Linux Zone. + +This encoding has been endorsed by the Klingon Language Institute. +For more information, contact them at: + + http://www.kli.org/ + +Since the characters in the beginning of the Linux CZ have been more +of the dingbats/symbols/forms type and this is a language, I have +located it at the end, on a 16-cell boundary in keeping with standard +Unicode practice. + +.. note:: + + This range is now officially managed by the ConScript Unicode + Registry. The normative reference is at: + + http://www.evertype.com/standards/csur/klingon.html + +Klingon has an alphabet of 26 characters, a positional numeric writing +system with 10 digits, and is written left-to-right, top-to-bottom. + +Several glyph forms for the Klingon alphabet have been proposed. +However, since the set of symbols appear to be consistent throughout, +with only the actual shapes being different, in keeping with standard +Unicode practice these differences are considered font variants. + +====== ======================================================= +U+F8D0 KLINGON LETTER A +U+F8D1 KLINGON LETTER B +U+F8D2 KLINGON LETTER CH +U+F8D3 KLINGON LETTER D +U+F8D4 KLINGON LETTER E +U+F8D5 KLINGON LETTER GH +U+F8D6 KLINGON LETTER H +U+F8D7 KLINGON LETTER I +U+F8D8 KLINGON LETTER J +U+F8D9 KLINGON LETTER L +U+F8DA KLINGON LETTER M +U+F8DB KLINGON LETTER N +U+F8DC KLINGON LETTER NG +U+F8DD KLINGON LETTER O +U+F8DE KLINGON LETTER P +U+F8DF KLINGON LETTER Q + - Written <q> in standard Okrand Latin transliteration +U+F8E0 KLINGON LETTER QH + - Written <Q> in standard Okrand Latin transliteration +U+F8E1 KLINGON LETTER R +U+F8E2 KLINGON LETTER S +U+F8E3 KLINGON LETTER T +U+F8E4 KLINGON LETTER TLH +U+F8E5 KLINGON LETTER U +U+F8E6 KLINGON LETTER V +U+F8E7 KLINGON LETTER W +U+F8E8 KLINGON LETTER Y +U+F8E9 KLINGON LETTER GLOTTAL STOP + +U+F8F0 KLINGON DIGIT ZERO +U+F8F1 KLINGON DIGIT ONE +U+F8F2 KLINGON DIGIT TWO +U+F8F3 KLINGON DIGIT THREE +U+F8F4 KLINGON DIGIT FOUR +U+F8F5 KLINGON DIGIT FIVE +U+F8F6 KLINGON DIGIT SIX +U+F8F7 KLINGON DIGIT SEVEN +U+F8F8 KLINGON DIGIT EIGHT +U+F8F9 KLINGON DIGIT NINE + +U+F8FD KLINGON COMMA +U+F8FE KLINGON FULL STOP +U+F8FF KLINGON SYMBOL FOR EMPIRE +====== ======================================================= + +Other Fictional and Artificial Scripts +-------------------------------------- + +Since the assignment of the Klingon Linux Unicode block, a registry of +fictional and artificial scripts has been established by John Cowan +<jcowan@reutershealth.com> and Michael Everson <everson@evertype.com>. +The ConScript Unicode Registry is accessible at: + + http://www.evertype.com/standards/csur/ + +The ranges used fall at the low end of the End User Zone and can hence +not be normatively assigned, but it is recommended that people who +wish to encode fictional scripts use these codes, in the interest of +interoperability. For Klingon, CSUR has adopted the Linux encoding. +The CSUR people are driving adding Tengwar and Cirth into Unicode +Plane 1; the addition of Klingon to Unicode Plane 1 has been rejected +and so the above encoding remains official. diff --git a/Documentation/admin-guide/vga-softcursor.rst b/Documentation/admin-guide/vga-softcursor.rst new file mode 100644 index 000000000..f52175457 --- /dev/null +++ b/Documentation/admin-guide/vga-softcursor.rst @@ -0,0 +1,62 @@ +Software cursor for VGA +======================= + +by Pavel Machek <pavel@atrey.karlin.mff.cuni.cz> +and Martin Mares <mj@atrey.karlin.mff.cuni.cz> + +Linux now has some ability to manipulate cursor appearance. Normally, +you can set the size of hardware cursor. You can now play a few new +tricks: you can make your cursor look like a non-blinking red block, +make it inverse background of the character it's over or to highlight +that character and still choose whether the original hardware cursor +should remain visible or not. There may be other things I have never +thought of. + +The cursor appearance is controlled by a ``<ESC>[?1;2;3c`` escape sequence +where 1, 2 and 3 are parameters described below. If you omit any of them, +they will default to zeroes. + +first Parameter + specifies cursor size:: + + 0=default + 1=invisible + 2=underline, + ... + 8=full block + + 16 if you want the software cursor to be applied + + 32 if you want to always change the background color + + 64 if you dislike having the background the same as the + foreground. + + Highlights are ignored for the last two flags. + +second parameter + selects character attribute bits you want to change + (by simply XORing them with the value of this parameter). On standard + VGA, the high four bits specify background and the low four the + foreground. In both groups, low three bits set color (as in normal + color codes used by the console) and the most significant one turns + on highlight (or sometimes blinking -- it depends on the configuration + of your VGA). + +third parameter + consists of character attribute bits you want to set. + + Bit setting takes place before bit toggling, so you can simply clear a + bit by including it in both the set mask and the toggle mask. + +Examples +-------- + +To get normal blinking underline, use:: + + echo -e '\033[?2c' + +To get blinking block, use:: + + echo -e '\033[?6c' + +To get red non-blinking block, use:: + + echo -e '\033[?17;0;64c' |