diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-11 08:27:49 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-11 08:27:49 +0000 |
commit | ace9429bb58fd418f0c81d4c2835699bddf6bde6 (patch) | |
tree | b2d64bc10158fdd5497876388cd68142ca374ed3 /Documentation/core-api | |
parent | Initial commit. (diff) | |
download | linux-ace9429bb58fd418f0c81d4c2835699bddf6bde6.tar.xz linux-ace9429bb58fd418f0c81d4c2835699bddf6bde6.zip |
Adding upstream version 6.6.15.upstream/6.6.15
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'Documentation/core-api')
55 files changed, 13795 insertions, 0 deletions
diff --git a/Documentation/core-api/asm-annotations.rst b/Documentation/core-api/asm-annotations.rst new file mode 100644 index 0000000000..11c96d3f9a --- /dev/null +++ b/Documentation/core-api/asm-annotations.rst @@ -0,0 +1,222 @@ +Assembler Annotations +===================== + +Copyright (c) 2017-2019 Jiri Slaby + +This document describes the new macros for annotation of data and code in +assembly. In particular, it contains information about ``SYM_FUNC_START``, +``SYM_FUNC_END``, ``SYM_CODE_START``, and similar. + +Rationale +--------- +Some code like entries, trampolines, or boot code needs to be written in +assembly. The same as in C, such code is grouped into functions and +accompanied with data. Standard assemblers do not force users into precisely +marking these pieces as code, data, or even specifying their length. +Nevertheless, assemblers provide developers with such annotations to aid +debuggers throughout assembly. On top of that, developers also want to mark +some functions as *global* in order to be visible outside of their translation +units. + +Over time, the Linux kernel has adopted macros from various projects (like +``binutils``) to facilitate such annotations. So for historic reasons, +developers have been using ``ENTRY``, ``END``, ``ENDPROC``, and other +annotations in assembly. Due to the lack of their documentation, the macros +are used in rather wrong contexts at some locations. Clearly, ``ENTRY`` was +intended to denote the beginning of global symbols (be it data or code). +``END`` used to mark the end of data or end of special functions with +*non-standard* calling convention. In contrast, ``ENDPROC`` should annotate +only ends of *standard* functions. + +When these macros are used correctly, they help assemblers generate a nice +object with both sizes and types set correctly. For example, the result of +``arch/x86/lib/putuser.S``:: + + Num: Value Size Type Bind Vis Ndx Name + 25: 0000000000000000 33 FUNC GLOBAL DEFAULT 1 __put_user_1 + 29: 0000000000000030 37 FUNC GLOBAL DEFAULT 1 __put_user_2 + 32: 0000000000000060 36 FUNC GLOBAL DEFAULT 1 __put_user_4 + 35: 0000000000000090 37 FUNC GLOBAL DEFAULT 1 __put_user_8 + +This is not only important for debugging purposes. When there are properly +annotated objects like this, tools can be run on them to generate more useful +information. In particular, on properly annotated objects, ``objtool`` can be +run to check and fix the object if needed. Currently, ``objtool`` can report +missing frame pointer setup/destruction in functions. It can also +automatically generate annotations for the ORC unwinder +(Documentation/arch/x86/orc-unwinder.rst) +for most code. Both of these are especially important to support reliable +stack traces which are in turn necessary for kernel live patching +(Documentation/livepatch/livepatch.rst). + +Caveat and Discussion +--------------------- +As one might realize, there were only three macros previously. That is indeed +insufficient to cover all the combinations of cases: + +* standard/non-standard function +* code/data +* global/local symbol + +There was a discussion_ and instead of extending the current ``ENTRY/END*`` +macros, it was decided that brand new macros should be introduced instead:: + + So how about using macro names that actually show the purpose, instead + of importing all the crappy, historic, essentially randomly chosen + debug symbol macro names from the binutils and older kernels? + +.. _discussion: https://lore.kernel.org/r/20170217104757.28588-1-jslaby@suse.cz + +Macros Description +------------------ + +The new macros are prefixed with the ``SYM_`` prefix and can be divided into +three main groups: + +1. ``SYM_FUNC_*`` -- to annotate C-like functions. This means functions with + standard C calling conventions. For example, on x86, this means that the + stack contains a return address at the predefined place and a return from + the function can happen in a standard way. When frame pointers are enabled, + save/restore of frame pointer shall happen at the start/end of a function, + respectively, too. + + Checking tools like ``objtool`` should ensure such marked functions conform + to these rules. The tools can also easily annotate these functions with + debugging information (like *ORC data*) automatically. + +2. ``SYM_CODE_*`` -- special functions called with special stack. Be it + interrupt handlers with special stack content, trampolines, or startup + functions. + + Checking tools mostly ignore checking of these functions. But some debug + information still can be generated automatically. For correct debug data, + this code needs hints like ``UNWIND_HINT_REGS`` provided by developers. + +3. ``SYM_DATA*`` -- obviously data belonging to ``.data`` sections and not to + ``.text``. Data do not contain instructions, so they have to be treated + specially by the tools: they should not treat the bytes as instructions, + nor assign any debug information to them. + +Instruction Macros +~~~~~~~~~~~~~~~~~~ +This section covers ``SYM_FUNC_*`` and ``SYM_CODE_*`` enumerated above. + +``objtool`` requires that all code must be contained in an ELF symbol. Symbol +names that have a ``.L`` prefix do not emit symbol table entries. ``.L`` +prefixed symbols can be used within a code region, but should be avoided for +denoting a range of code via ``SYM_*_START/END`` annotations. + +* ``SYM_FUNC_START`` and ``SYM_FUNC_START_LOCAL`` are supposed to be **the + most frequent markings**. They are used for functions with standard calling + conventions -- global and local. Like in C, they both align the functions to + architecture specific ``__ALIGN`` bytes. There are also ``_NOALIGN`` variants + for special cases where developers do not want this implicit alignment. + + ``SYM_FUNC_START_WEAK`` and ``SYM_FUNC_START_WEAK_NOALIGN`` markings are + also offered as an assembler counterpart to the *weak* attribute known from + C. + + All of these **shall** be coupled with ``SYM_FUNC_END``. First, it marks + the sequence of instructions as a function and computes its size to the + generated object file. Second, it also eases checking and processing such + object files as the tools can trivially find exact function boundaries. + + So in most cases, developers should write something like in the following + example, having some asm instructions in between the macros, of course:: + + SYM_FUNC_START(memset) + ... asm insns ... + SYM_FUNC_END(memset) + + In fact, this kind of annotation corresponds to the now deprecated ``ENTRY`` + and ``ENDPROC`` macros. + +* ``SYM_FUNC_ALIAS``, ``SYM_FUNC_ALIAS_LOCAL``, and ``SYM_FUNC_ALIAS_WEAK`` can + be used to define multiple names for a function. The typical use is:: + + SYM_FUNC_START(__memset) + ... asm insns ... + SYN_FUNC_END(__memset) + SYM_FUNC_ALIAS(memset, __memset) + + In this example, one can call ``__memset`` or ``memset`` with the same + result, except the debug information for the instructions is generated to + the object file only once -- for the non-``ALIAS`` case. + +* ``SYM_CODE_START`` and ``SYM_CODE_START_LOCAL`` should be used only in + special cases -- if you know what you are doing. This is used exclusively + for interrupt handlers and similar where the calling convention is not the C + one. ``_NOALIGN`` variants exist too. The use is the same as for the ``FUNC`` + category above:: + + SYM_CODE_START_LOCAL(bad_put_user) + ... asm insns ... + SYM_CODE_END(bad_put_user) + + Again, every ``SYM_CODE_START*`` **shall** be coupled by ``SYM_CODE_END``. + + To some extent, this category corresponds to deprecated ``ENTRY`` and + ``END``. Except ``END`` had several other meanings too. + +* ``SYM_INNER_LABEL*`` is used to denote a label inside some + ``SYM_{CODE,FUNC}_START`` and ``SYM_{CODE,FUNC}_END``. They are very similar + to C labels, except they can be made global. An example of use:: + + SYM_CODE_START(ftrace_caller) + /* save_mcount_regs fills in first two parameters */ + ... + + SYM_INNER_LABEL(ftrace_caller_op_ptr, SYM_L_GLOBAL) + /* Load the ftrace_ops into the 3rd parameter */ + ... + + SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL) + call ftrace_stub + ... + retq + SYM_CODE_END(ftrace_caller) + +Data Macros +~~~~~~~~~~~ +Similar to instructions, there is a couple of macros to describe data in the +assembly. + +* ``SYM_DATA_START`` and ``SYM_DATA_START_LOCAL`` mark the start of some data + and shall be used in conjunction with either ``SYM_DATA_END``, or + ``SYM_DATA_END_LABEL``. The latter adds also a label to the end, so that + people can use ``lstack`` and (local) ``lstack_end`` in the following + example:: + + SYM_DATA_START_LOCAL(lstack) + .skip 4096 + SYM_DATA_END_LABEL(lstack, SYM_L_LOCAL, lstack_end) + +* ``SYM_DATA`` and ``SYM_DATA_LOCAL`` are variants for simple, mostly one-line + data:: + + SYM_DATA(HEAP, .long rm_heap) + SYM_DATA(heap_end, .long rm_stack) + + In the end, they expand to ``SYM_DATA_START`` with ``SYM_DATA_END`` + internally. + +Support Macros +~~~~~~~~~~~~~~ +All the above reduce themselves to some invocation of ``SYM_START``, +``SYM_END``, or ``SYM_ENTRY`` at last. Normally, developers should avoid using +these. + +Further, in the above examples, one could see ``SYM_L_LOCAL``. There are also +``SYM_L_GLOBAL`` and ``SYM_L_WEAK``. All are intended to denote linkage of a +symbol marked by them. They are used either in ``_LABEL`` variants of the +earlier macros, or in ``SYM_START``. + + +Overriding Macros +~~~~~~~~~~~~~~~~~ +Architecture can also override any of the macros in their own +``asm/linkage.h``, including macros specifying the type of a symbol +(``SYM_T_FUNC``, ``SYM_T_OBJECT``, and ``SYM_T_NONE``). As every macro +described in this file is surrounded by ``#ifdef`` + ``#endif``, it is enough +to define the macros differently in the aforementioned architecture-dependent +header. diff --git a/Documentation/core-api/assoc_array.rst b/Documentation/core-api/assoc_array.rst new file mode 100644 index 0000000000..792bbf9939 --- /dev/null +++ b/Documentation/core-api/assoc_array.rst @@ -0,0 +1,554 @@ +======================================== +Generic Associative Array Implementation +======================================== + +Overview +======== + +This associative array implementation is an object container with the following +properties: + +1. Objects are opaque pointers. The implementation does not care where they + point (if anywhere) or what they point to (if anything). + + .. note:: + + Pointers to objects _must_ be zero in the least significant bit. + +2. Objects do not need to contain linkage blocks for use by the array. This + permits an object to be located in multiple arrays simultaneously. + Rather, the array is made up of metadata blocks that point to objects. + +3. Objects require index keys to locate them within the array. + +4. Index keys must be unique. Inserting an object with the same key as one + already in the array will replace the old object. + +5. Index keys can be of any length and can be of different lengths. + +6. Index keys should encode the length early on, before any variation due to + length is seen. + +7. Index keys can include a hash to scatter objects throughout the array. + +8. The array can iterated over. The objects will not necessarily come out in + key order. + +9. The array can be iterated over while it is being modified, provided the + RCU readlock is being held by the iterator. Note, however, under these + circumstances, some objects may be seen more than once. If this is a + problem, the iterator should lock against modification. Objects will not + be missed, however, unless deleted. + +10. Objects in the array can be looked up by means of their index key. + +11. Objects can be looked up while the array is being modified, provided the + RCU readlock is being held by the thread doing the look up. + +The implementation uses a tree of 16-pointer nodes internally that are indexed +on each level by nibbles from the index key in the same manner as in a radix +tree. To improve memory efficiency, shortcuts can be emplaced to skip over +what would otherwise be a series of single-occupancy nodes. Further, nodes +pack leaf object pointers into spare space in the node rather than making an +extra branch until as such time an object needs to be added to a full node. + + +The Public API +============== + +The public API can be found in ``<linux/assoc_array.h>``. The associative +array is rooted on the following structure:: + + struct assoc_array { + ... + }; + +The code is selected by enabling ``CONFIG_ASSOCIATIVE_ARRAY`` with:: + + ./script/config -e ASSOCIATIVE_ARRAY + + +Edit Script +----------- + +The insertion and deletion functions produce an 'edit script' that can later be +applied to effect the changes without risking ``ENOMEM``. This retains the +preallocated metadata blocks that will be installed in the internal tree and +keeps track of the metadata blocks that will be removed from the tree when the +script is applied. + +This is also used to keep track of dead blocks and dead objects after the +script has been applied so that they can be freed later. The freeing is done +after an RCU grace period has passed - thus allowing access functions to +proceed under the RCU read lock. + +The script appears as outside of the API as a pointer of the type:: + + struct assoc_array_edit; + +There are two functions for dealing with the script: + +1. Apply an edit script:: + + void assoc_array_apply_edit(struct assoc_array_edit *edit); + +This will perform the edit functions, interpolating various write barriers +to permit accesses under the RCU read lock to continue. The edit script +will then be passed to ``call_rcu()`` to free it and any dead stuff it points +to. + +2. Cancel an edit script:: + + void assoc_array_cancel_edit(struct assoc_array_edit *edit); + +This frees the edit script and all preallocated memory immediately. If +this was for insertion, the new object is _not_ released by this function, +but must rather be released by the caller. + +These functions are guaranteed not to fail. + + +Operations Table +---------------- + +Various functions take a table of operations:: + + struct assoc_array_ops { + ... + }; + +This points to a number of methods, all of which need to be provided: + +1. Get a chunk of index key from caller data:: + + unsigned long (*get_key_chunk)(const void *index_key, int level); + +This should return a chunk of caller-supplied index key starting at the +*bit* position given by the level argument. The level argument will be a +multiple of ``ASSOC_ARRAY_KEY_CHUNK_SIZE`` and the function should return +``ASSOC_ARRAY_KEY_CHUNK_SIZE bits``. No error is possible. + + +2. Get a chunk of an object's index key:: + + unsigned long (*get_object_key_chunk)(const void *object, int level); + +As the previous function, but gets its data from an object in the array +rather than from a caller-supplied index key. + + +3. See if this is the object we're looking for:: + + bool (*compare_object)(const void *object, const void *index_key); + +Compare the object against an index key and return ``true`` if it matches and +``false`` if it doesn't. + + +4. Diff the index keys of two objects:: + + int (*diff_objects)(const void *object, const void *index_key); + +Return the bit position at which the index key of the specified object +differs from the given index key or -1 if they are the same. + + +5. Free an object:: + + void (*free_object)(void *object); + +Free the specified object. Note that this may be called an RCU grace period +after ``assoc_array_apply_edit()`` was called, so ``synchronize_rcu()`` may be +necessary on module unloading. + + +Manipulation Functions +---------------------- + +There are a number of functions for manipulating an associative array: + +1. Initialise an associative array:: + + void assoc_array_init(struct assoc_array *array); + +This initialises the base structure for an associative array. It can't fail. + + +2. Insert/replace an object in an associative array:: + + struct assoc_array_edit * + assoc_array_insert(struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key, + void *object); + +This inserts the given object into the array. Note that the least +significant bit of the pointer must be zero as it's used to type-mark +pointers internally. + +If an object already exists for that key then it will be replaced with the +new object and the old one will be freed automatically. + +The ``index_key`` argument should hold index key information and is +passed to the methods in the ops table when they are called. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. + +The caller should lock exclusively against other modifiers of the array. + + +3. Delete an object from an associative array:: + + struct assoc_array_edit * + assoc_array_delete(struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key); + +This deletes an object that matches the specified data from the array. + +The ``index_key`` argument should hold index key information and is +passed to the methods in the ops table when they are called. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. ``NULL`` will be returned if the specified object is +not found within the array. + +The caller should lock exclusively against other modifiers of the array. + + +4. Delete all objects from an associative array:: + + struct assoc_array_edit * + assoc_array_clear(struct assoc_array *array, + const struct assoc_array_ops *ops); + +This deletes all the objects from an associative array and leaves it +completely empty. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. + +The caller should lock exclusively against other modifiers of the array. + + +5. Destroy an associative array, deleting all objects:: + + void assoc_array_destroy(struct assoc_array *array, + const struct assoc_array_ops *ops); + +This destroys the contents of the associative array and leaves it +completely empty. It is not permitted for another thread to be traversing +the array under the RCU read lock at the same time as this function is +destroying it as no RCU deferral is performed on memory release - +something that would require memory to be allocated. + +The caller should lock exclusively against other modifiers and accessors +of the array. + + +6. Garbage collect an associative array:: + + int assoc_array_gc(struct assoc_array *array, + const struct assoc_array_ops *ops, + bool (*iterator)(void *object, void *iterator_data), + void *iterator_data); + +This iterates over the objects in an associative array and passes each one to +``iterator()``. If ``iterator()`` returns ``true``, the object is kept. If it +returns ``false``, the object will be freed. If the ``iterator()`` function +returns ``true``, it must perform any appropriate refcount incrementing on the +object before returning. + +The internal tree will be packed down if possible as part of the iteration +to reduce the number of nodes in it. + +The ``iterator_data`` is passed directly to ``iterator()`` and is otherwise +ignored by the function. + +The function will return ``0`` if successful and ``-ENOMEM`` if there wasn't +enough memory. + +It is possible for other threads to iterate over or search the array under +the RCU read lock while this function is in progress. The caller should +lock exclusively against other modifiers of the array. + + +Access Functions +---------------- + +There are two functions for accessing an associative array: + +1. Iterate over all the objects in an associative array:: + + int assoc_array_iterate(const struct assoc_array *array, + int (*iterator)(const void *object, + void *iterator_data), + void *iterator_data); + +This passes each object in the array to the iterator callback function. +``iterator_data`` is private data for that function. + +This may be used on an array at the same time as the array is being +modified, provided the RCU read lock is held. Under such circumstances, +it is possible for the iteration function to see some objects twice. If +this is a problem, then modification should be locked against. The +iteration algorithm should not, however, miss any objects. + +The function will return ``0`` if no objects were in the array or else it will +return the result of the last iterator function called. Iteration stops +immediately if any call to the iteration function results in a non-zero +return. + + +2. Find an object in an associative array:: + + void *assoc_array_find(const struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key); + +This walks through the array's internal tree directly to the object +specified by the index key.. + +This may be used on an array at the same time as the array is being +modified, provided the RCU read lock is held. + +The function will return the object if found (and set ``*_type`` to the object +type) or will return ``NULL`` if the object was not found. + + +Index Key Form +-------------- + +The index key can be of any form, but since the algorithms aren't told how long +the key is, it is strongly recommended that the index key includes its length +very early on before any variation due to the length would have an effect on +comparisons. + +This will cause leaves with different length keys to scatter away from each +other - and those with the same length keys to cluster together. + +It is also recommended that the index key begin with a hash of the rest of the +key to maximise scattering throughout keyspace. + +The better the scattering, the wider and lower the internal tree will be. + +Poor scattering isn't too much of a problem as there are shortcuts and nodes +can contain mixtures of leaves and metadata pointers. + +The index key is read in chunks of machine word. Each chunk is subdivided into +one nibble (4 bits) per level, so on a 32-bit CPU this is good for 8 levels and +on a 64-bit CPU, 16 levels. Unless the scattering is really poor, it is +unlikely that more than one word of any particular index key will have to be +used. + + +Internal Workings +================= + +The associative array data structure has an internal tree. This tree is +constructed of two types of metadata blocks: nodes and shortcuts. + +A node is an array of slots. Each slot can contain one of four things: + +* A NULL pointer, indicating that the slot is empty. +* A pointer to an object (a leaf). +* A pointer to a node at the next level. +* A pointer to a shortcut. + + +Basic Internal Tree Layout +-------------------------- + +Ignoring shortcuts for the moment, the nodes form a multilevel tree. The index +key space is strictly subdivided by the nodes in the tree and nodes occur on +fixed levels. For example:: + + Level: 0 1 2 3 + =============== =============== =============== =============== + NODE D + NODE B NODE C +------>+---+ + +------>+---+ +------>+---+ | | 0 | + NODE A | | 0 | | | 0 | | +---+ + +---+ | +---+ | +---+ | : : + | 0 | | : : | : : | +---+ + +---+ | +---+ | +---+ | | f | + | 1 |---+ | 3 |---+ | 7 |---+ +---+ + +---+ +---+ +---+ + : : : : | 8 |---+ + +---+ +---+ +---+ | NODE E + | e |---+ | f | : : +------>+---+ + +---+ | +---+ +---+ | 0 | + | f | | | f | +---+ + +---+ | +---+ : : + | NODE F +---+ + +------>+---+ | f | + | 0 | NODE G +---+ + +---+ +------>+---+ + : : | | 0 | + +---+ | +---+ + | 6 |---+ : : + +---+ +---+ + : : | f | + +---+ +---+ + | f | + +---+ + +In the above example, there are 7 nodes (A-G), each with 16 slots (0-f). +Assuming no other meta data nodes in the tree, the key space is divided +thusly:: + + KEY PREFIX NODE + ========== ==== + 137* D + 138* E + 13[0-69-f]* C + 1[0-24-f]* B + e6* G + e[0-57-f]* F + [02-df]* A + +So, for instance, keys with the following example index keys will be found in +the appropriate nodes:: + + INDEX KEY PREFIX NODE + =============== ======= ==== + 13694892892489 13 C + 13795289025897 137 D + 13889dde88793 138 E + 138bbb89003093 138 E + 1394879524789 12 C + 1458952489 1 B + 9431809de993ba - A + b4542910809cd - A + e5284310def98 e F + e68428974237 e6 G + e7fffcbd443 e F + f3842239082 - A + +To save memory, if a node can hold all the leaves in its portion of keyspace, +then the node will have all those leaves in it and will not have any metadata +pointers - even if some of those leaves would like to be in the same slot. + +A node can contain a heterogeneous mix of leaves and metadata pointers. +Metadata pointers must be in the slots that match their subdivisions of key +space. The leaves can be in any slot not occupied by a metadata pointer. It +is guaranteed that none of the leaves in a node will match a slot occupied by a +metadata pointer. If the metadata pointer is there, any leaf whose key matches +the metadata key prefix must be in the subtree that the metadata pointer points +to. + +In the above example list of index keys, node A will contain:: + + SLOT CONTENT INDEX KEY (PREFIX) + ==== =============== ================== + 1 PTR TO NODE B 1* + any LEAF 9431809de993ba + any LEAF b4542910809cd + e PTR TO NODE F e* + any LEAF f3842239082 + +and node B:: + + 3 PTR TO NODE C 13* + any LEAF 1458952489 + + +Shortcuts +--------- + +Shortcuts are metadata records that jump over a piece of keyspace. A shortcut +is a replacement for a series of single-occupancy nodes ascending through the +levels. Shortcuts exist to save memory and to speed up traversal. + +It is possible for the root of the tree to be a shortcut - say, for example, +the tree contains at least 17 nodes all with key prefix ``1111``. The +insertion algorithm will insert a shortcut to skip over the ``1111`` keyspace +in a single bound and get to the fourth level where these actually become +different. + + +Splitting And Collapsing Nodes +------------------------------ + +Each node has a maximum capacity of 16 leaves and metadata pointers. If the +insertion algorithm finds that it is trying to insert a 17th object into a +node, that node will be split such that at least two leaves that have a common +key segment at that level end up in a separate node rooted on that slot for +that common key segment. + +If the leaves in a full node and the leaf that is being inserted are +sufficiently similar, then a shortcut will be inserted into the tree. + +When the number of objects in the subtree rooted at a node falls to 16 or +fewer, then the subtree will be collapsed down to a single node - and this will +ripple towards the root if possible. + + +Non-Recursive Iteration +----------------------- + +Each node and shortcut contains a back pointer to its parent and the number of +slot in that parent that points to it. None-recursive iteration uses these to +proceed rootwards through the tree, going to the parent node, slot N + 1 to +make sure progress is made without the need for a stack. + +The backpointers, however, make simultaneous alteration and iteration tricky. + + +Simultaneous Alteration And Iteration +------------------------------------- + +There are a number of cases to consider: + +1. Simple insert/replace. This involves simply replacing a NULL or old + matching leaf pointer with the pointer to the new leaf after a barrier. + The metadata blocks don't change otherwise. An old leaf won't be freed + until after the RCU grace period. + +2. Simple delete. This involves just clearing an old matching leaf. The + metadata blocks don't change otherwise. The old leaf won't be freed until + after the RCU grace period. + +3. Insertion replacing part of a subtree that we haven't yet entered. This + may involve replacement of part of that subtree - but that won't affect + the iteration as we won't have reached the pointer to it yet and the + ancestry blocks are not replaced (the layout of those does not change). + +4. Insertion replacing nodes that we're actively processing. This isn't a + problem as we've passed the anchoring pointer and won't switch onto the + new layout until we follow the back pointers - at which point we've + already examined the leaves in the replaced node (we iterate over all the + leaves in a node before following any of its metadata pointers). + + We might, however, re-see some leaves that have been split out into a new + branch that's in a slot further along than we were at. + +5. Insertion replacing nodes that we're processing a dependent branch of. + This won't affect us until we follow the back pointers. Similar to (4). + +6. Deletion collapsing a branch under us. This doesn't affect us because the + back pointers will get us back to the parent of the new node before we + could see the new node. The entire collapsed subtree is thrown away + unchanged - and will still be rooted on the same slot, so we shouldn't + process it a second time as we'll go back to slot + 1. + +.. note:: + + Under some circumstances, we need to simultaneously change the parent + pointer and the parent slot pointer on a node (say, for example, we + inserted another node before it and moved it up a level). We cannot do + this without locking against a read - so we have to replace that node too. + + However, when we're changing a shortcut into a node this isn't a problem + as shortcuts only have one slot and so the parent slot number isn't used + when traversing backwards over one. This means that it's okay to change + the slot number first - provided suitable barriers are used to make sure + the parent slot number is read after the back pointer. + +Obsolete blocks and leaves are freed up after an RCU grace period has passed, +so as long as anyone doing walking or iteration holds the RCU read lock, the +old superstructure should not go away on them. diff --git a/Documentation/core-api/boot-time-mm.rst b/Documentation/core-api/boot-time-mm.rst new file mode 100644 index 0000000000..e5ec9f1a56 --- /dev/null +++ b/Documentation/core-api/boot-time-mm.rst @@ -0,0 +1,41 @@ +=========================== +Boot time memory management +=========================== + +Early system initialization cannot use "normal" memory management +simply because it is not set up yet. But there is still need to +allocate memory for various data structures, for instance for the +physical page allocator. + +A specialized allocator called ``memblock`` performs the +boot time memory management. The architecture specific initialization +must set it up in :c:func:`setup_arch` and tear it down in +:c:func:`mem_init` functions. + +Once the early memory management is available it offers a variety of +functions and macros for memory allocations. The allocation request +may be directed to the first (and probably the only) node or to a +particular node in a NUMA system. There are API variants that panic +when an allocation fails and those that don't. + +Memblock also offers a variety of APIs that control its own behaviour. + +Memblock Overview +================= + +.. kernel-doc:: mm/memblock.c + :doc: memblock overview + + +Functions and structures +======================== + +Here is the description of memblock data structures, functions and +macros. Some of them are actually internal, but since they are +documented it would be silly to omit them. Besides, reading the +descriptions for the internal functions can help to understand what +really happens under the hood. + +.. kernel-doc:: include/linux/memblock.h +.. kernel-doc:: mm/memblock.c + :functions: diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst new file mode 100644 index 0000000000..889fc84ccd --- /dev/null +++ b/Documentation/core-api/cachetlb.rst @@ -0,0 +1,398 @@ +================================== +Cache and TLB Flushing Under Linux +================================== + +:Author: David S. Miller <davem@redhat.com> + +This document describes the cache/tlb flushing interfaces called +by the Linux VM subsystem. It enumerates over each interface, +describes its intended purpose, and what side effect is expected +after the interface is invoked. + +The side effects described below are stated for a uniprocessor +implementation, and what is to happen on that single processor. The +SMP cases are a simple extension, in that you just extend the +definition such that the side effect for a particular interface occurs +on all processors in the system. Don't let this scare you into +thinking SMP cache/tlb flushing must be so inefficient, this is in +fact an area where many optimizations are possible. For example, +if it can be proven that a user address space has never executed +on a cpu (see mm_cpumask()), one need not perform a flush +for this address space on that cpu. + +First, the TLB flushing interfaces, since they are the simplest. The +"TLB" is abstracted under Linux as something the cpu uses to cache +virtual-->physical address translations obtained from the software +page tables. Meaning that if the software page tables change, it is +possible for stale translations to exist in this "TLB" cache. +Therefore when software page table changes occur, the kernel will +invoke one of the following flush methods _after_ the page table +changes occur: + +1) ``void flush_tlb_all(void)`` + + The most severe flush of all. After this interface runs, + any previous page table modification whatsoever will be + visible to the cpu. + + This is usually invoked when the kernel page tables are + changed, since such translations are "global" in nature. + +2) ``void flush_tlb_mm(struct mm_struct *mm)`` + + This interface flushes an entire user address space from + the TLB. After running, this interface must make sure that + any previous page table modifications for the address space + 'mm' will be visible to the cpu. That is, after running, + there will be no entries in the TLB for 'mm'. + + This interface is used to handle whole address space + page table operations such as what happens during + fork, and exec. + +3) ``void flush_tlb_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end)`` + + Here we are flushing a specific range of (user) virtual + address translations from the TLB. After running, this + interface must make sure that any previous page table + modifications for the address space 'vma->vm_mm' in the range + 'start' to 'end-1' will be visible to the cpu. That is, after + running, there will be no entries in the TLB for 'mm' for + virtual addresses in the range 'start' to 'end-1'. + + The "vma" is the backing store being used for the region. + Primarily, this is used for munmap() type operations. + + The interface is provided in hopes that the port can find + a suitably efficient method for removing multiple page + sized translations from the TLB, instead of having the kernel + call flush_tlb_page (see below) for each entry which may be + modified. + +4) ``void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)`` + + This time we need to remove the PAGE_SIZE sized translation + from the TLB. The 'vma' is the backing structure used by + Linux to keep track of mmap'd regions for a process, the + address space is available via vma->vm_mm. Also, one may + test (vma->vm_flags & VM_EXEC) to see if this region is + executable (and thus could be in the 'instruction TLB' in + split-tlb type setups). + + After running, this interface must make sure that any previous + page table modification for address space 'vma->vm_mm' for + user virtual address 'addr' will be visible to the cpu. That + is, after running, there will be no entries in the TLB for + 'vma->vm_mm' for virtual address 'addr'. + + This is used primarily during fault processing. + +5) ``void update_mmu_cache_range(struct vm_fault *vmf, + struct vm_area_struct *vma, unsigned long address, pte_t *ptep, + unsigned int nr)`` + + At the end of every page fault, this routine is invoked to tell + the architecture specific code that translations now exists + in the software page tables for address space "vma->vm_mm" + at virtual address "address" for "nr" consecutive pages. + + This routine is also invoked in various other places which pass + a NULL "vmf". + + A port may use this information in any way it so chooses. + For example, it could use this event to pre-load TLB + translations for software managed TLB configurations. + The sparc64 port currently does this. + +Next, we have the cache flushing interfaces. In general, when Linux +is changing an existing virtual-->physical mapping to a new value, +the sequence will be in one of the following forms:: + + 1) flush_cache_mm(mm); + change_all_page_tables_of(mm); + flush_tlb_mm(mm); + + 2) flush_cache_range(vma, start, end); + change_range_of_page_tables(mm, start, end); + flush_tlb_range(vma, start, end); + + 3) flush_cache_page(vma, addr, pfn); + set_pte(pte_pointer, new_pte_val); + flush_tlb_page(vma, addr); + +The cache level flush will always be first, because this allows +us to properly handle systems whose caches are strict and require +a virtual-->physical translation to exist for a virtual address +when that virtual address is flushed from the cache. The HyperSparc +cpu is one such cpu with this attribute. + +The cache flushing routines below need only deal with cache flushing +to the extent that it is necessary for a particular cpu. Mostly, +these routines must be implemented for cpus which have virtually +indexed caches which must be flushed when virtual-->physical +translations are changed or removed. So, for example, the physically +indexed physically tagged caches of IA32 processors have no need to +implement these interfaces since the caches are fully synchronized +and have no dependency on translation information. + +Here are the routines, one by one: + +1) ``void flush_cache_mm(struct mm_struct *mm)`` + + This interface flushes an entire user address space from + the caches. That is, after running, there will be no cache + lines associated with 'mm'. + + This interface is used to handle whole address space + page table operations such as what happens during exit and exec. + +2) ``void flush_cache_dup_mm(struct mm_struct *mm)`` + + This interface flushes an entire user address space from + the caches. That is, after running, there will be no cache + lines associated with 'mm'. + + This interface is used to handle whole address space + page table operations such as what happens during fork. + + This option is separate from flush_cache_mm to allow some + optimizations for VIPT caches. + +3) ``void flush_cache_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end)`` + + Here we are flushing a specific range of (user) virtual + addresses from the cache. After running, there will be no + entries in the cache for 'vma->vm_mm' for virtual addresses in + the range 'start' to 'end-1'. + + The "vma" is the backing store being used for the region. + Primarily, this is used for munmap() type operations. + + The interface is provided in hopes that the port can find + a suitably efficient method for removing multiple page + sized regions from the cache, instead of having the kernel + call flush_cache_page (see below) for each entry which may be + modified. + +4) ``void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)`` + + This time we need to remove a PAGE_SIZE sized range + from the cache. The 'vma' is the backing structure used by + Linux to keep track of mmap'd regions for a process, the + address space is available via vma->vm_mm. Also, one may + test (vma->vm_flags & VM_EXEC) to see if this region is + executable (and thus could be in the 'instruction cache' in + "Harvard" type cache layouts). + + The 'pfn' indicates the physical page frame (shift this value + left by PAGE_SHIFT to get the physical address) that 'addr' + translates to. It is this mapping which should be removed from + the cache. + + After running, there will be no entries in the cache for + 'vma->vm_mm' for virtual address 'addr' which translates + to 'pfn'. + + This is used primarily during fault processing. + +5) ``void flush_cache_kmaps(void)`` + + This routine need only be implemented if the platform utilizes + highmem. It will be called right before all of the kmaps + are invalidated. + + After running, there will be no entries in the cache for + the kernel virtual address range PKMAP_ADDR(0) to + PKMAP_ADDR(LAST_PKMAP). + + This routing should be implemented in asm/highmem.h + +6) ``void flush_cache_vmap(unsigned long start, unsigned long end)`` + ``void flush_cache_vunmap(unsigned long start, unsigned long end)`` + + Here in these two interfaces we are flushing a specific range + of (kernel) virtual addresses from the cache. After running, + there will be no entries in the cache for the kernel address + space for virtual addresses in the range 'start' to 'end-1'. + + The first of these two routines is invoked after vmap_range() + has installed the page table entries. The second is invoked + before vunmap_range() deletes the page table entries. + +There exists another whole class of cpu cache issues which currently +require a whole different set of interfaces to handle properly. +The biggest problem is that of virtual aliasing in the data cache +of a processor. + +Is your port susceptible to virtual aliasing in its D-cache? +Well, if your D-cache is virtually indexed, is larger in size than +PAGE_SIZE, and does not prevent multiple cache lines for the same +physical address from existing at once, you have this problem. + +If your D-cache has this problem, first define asm/shmparam.h SHMLBA +properly, it should essentially be the size of your virtually +addressed D-cache (or if the size is variable, the largest possible +size). This setting will force the SYSv IPC layer to only allow user +processes to mmap shared memory at address which are a multiple of +this value. + +.. note:: + + This does not fix shared mmaps, check out the sparc64 port for + one way to solve this (in particular SPARC_FLAG_MMAPSHARED). + +Next, you have to solve the D-cache aliasing issue for all +other cases. Please keep in mind that fact that, for a given page +mapped into some user address space, there is always at least one more +mapping, that of the kernel in its linear mapping starting at +PAGE_OFFSET. So immediately, once the first user maps a given +physical page into its address space, by implication the D-cache +aliasing problem has the potential to exist since the kernel already +maps this page at its virtual address. + + ``void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)`` + ``void clear_user_page(void *to, unsigned long addr, struct page *page)`` + + These two routines store data in user anonymous or COW + pages. It allows a port to efficiently avoid D-cache alias + issues between userspace and the kernel. + + For example, a port may temporarily map 'from' and 'to' to + kernel virtual addresses during the copy. The virtual address + for these two pages is chosen in such a way that the kernel + load/store instructions happen to virtual addresses which are + of the same "color" as the user mapping of the page. Sparc64 + for example, uses this technique. + + The 'addr' parameter tells the virtual address where the + user will ultimately have this page mapped, and the 'page' + parameter gives a pointer to the struct page of the target. + + If D-cache aliasing is not an issue, these two routines may + simply call memcpy/memset directly and do nothing more. + + ``void flush_dcache_folio(struct folio *folio)`` + + This routines must be called when: + + a) the kernel did write to a page that is in the page cache page + and / or in high memory + b) the kernel is about to read from a page cache page and user space + shared/writable mappings of this page potentially exist. Note + that {get,pin}_user_pages{_fast} already call flush_dcache_folio + on any page found in the user address space and thus driver + code rarely needs to take this into account. + + .. note:: + + This routine need only be called for page cache pages + which can potentially ever be mapped into the address + space of a user process. So for example, VFS layer code + handling vfs symlinks in the page cache need not call + this interface at all. + + The phrase "kernel writes to a page cache page" means, specifically, + that the kernel executes store instructions that dirty data in that + page at the kernel virtual mapping of that page. It is important to + flush here to handle D-cache aliasing, to make sure these kernel stores + are visible to user space mappings of that page. + + The corollary case is just as important, if there are users which have + shared+writable mappings of this file, we must make sure that kernel + reads of these pages will see the most recent stores done by the user. + + If D-cache aliasing is not an issue, this routine may simply be defined + as a nop on that architecture. + + There is a bit set aside in folio->flags (PG_arch_1) as "architecture + private". The kernel guarantees that, for pagecache pages, it will + clear this bit when such a page first enters the pagecache. + + This allows these interfaces to be implemented much more + efficiently. It allows one to "defer" (perhaps indefinitely) the + actual flush if there are currently no user processes mapping this + page. See sparc64's flush_dcache_folio and update_mmu_cache_range + implementations for an example of how to go about doing this. + + The idea is, first at flush_dcache_folio() time, if + folio_flush_mapping() returns a mapping, and mapping_mapped() on that + mapping returns %false, just mark the architecture private page + flag bit. Later, in update_mmu_cache_range(), a check is made + of this flag bit, and if set the flush is done and the flag bit + is cleared. + + .. important:: + + It is often important, if you defer the flush, + that the actual flush occurs on the same CPU + as did the cpu stores into the page to make it + dirty. Again, see sparc64 for examples of how + to deal with this. + + ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page, + unsigned long user_vaddr, void *dst, void *src, int len)`` + ``void copy_from_user_page(struct vm_area_struct *vma, struct page *page, + unsigned long user_vaddr, void *dst, void *src, int len)`` + + When the kernel needs to copy arbitrary data in and out + of arbitrary user pages (f.e. for ptrace()) it will use + these two routines. + + Any necessary cache flushing or other coherency operations + that need to occur should happen here. If the processor's + instruction cache does not snoop cpu stores, it is very + likely that you will need to flush the instruction cache + for copy_to_user_page(). + + ``void flush_anon_page(struct vm_area_struct *vma, struct page *page, + unsigned long vmaddr)`` + + When the kernel needs to access the contents of an anonymous + page, it calls this function (currently only + get_user_pages()). Note: flush_dcache_folio() deliberately + doesn't work for an anonymous page. The default + implementation is a nop (and should remain so for all coherent + architectures). For incoherent architectures, it should flush + the cache of the page at vmaddr. + + ``void flush_icache_range(unsigned long start, unsigned long end)`` + + When the kernel stores into addresses that it will execute + out of (eg when loading modules), this function is called. + + If the icache does not snoop stores then this routine will need + to flush it. + + ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)`` + + All the functionality of flush_icache_page can be implemented in + flush_dcache_folio and update_mmu_cache_range. In the future, the hope + is to remove this interface completely. + +The final category of APIs is for I/O to deliberately aliased address +ranges inside the kernel. Such aliases are set up by use of the +vmap/vmalloc API. Since kernel I/O goes via physical pages, the I/O +subsystem assumes that the user mapping and kernel offset mapping are +the only aliases. This isn't true for vmap aliases, so anything in +the kernel trying to do I/O to vmap areas must manually manage +coherency. It must do this by flushing the vmap range before doing +I/O and invalidating it after the I/O returns. + + ``void flush_kernel_vmap_range(void *vaddr, int size)`` + + flushes the kernel cache for a given virtual address range in + the vmap area. This is to make sure that any data the kernel + modified in the vmap range is made visible to the physical + page. The design is to make this area safe to perform I/O on. + Note that this API does *not* also flush the offset map alias + of the area. + + ``void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates`` + + the cache for a given virtual address range in the vmap area + which prevents the processor from making the cache stale by + speculatively reading data while the I/O was occurring to the + physical pages. This is only necessary for data reads into the + vmap area. diff --git a/Documentation/core-api/circular-buffers.rst b/Documentation/core-api/circular-buffers.rst new file mode 100644 index 0000000000..50966f66e3 --- /dev/null +++ b/Documentation/core-api/circular-buffers.rst @@ -0,0 +1,237 @@ +================ +Circular Buffers +================ + +:Author: David Howells <dhowells@redhat.com> +:Author: Paul E. McKenney <paulmck@linux.ibm.com> + + +Linux provides a number of features that can be used to implement circular +buffering. There are two sets of such features: + + (1) Convenience functions for determining information about power-of-2 sized + buffers. + + (2) Memory barriers for when the producer and the consumer of objects in the + buffer don't want to share a lock. + +To use these facilities, as discussed below, there needs to be just one +producer and just one consumer. It is possible to handle multiple producers by +serialising them, and to handle multiple consumers by serialising them. + + +.. Contents: + + (*) What is a circular buffer? + + (*) Measuring power-of-2 buffers. + + (*) Using memory barriers with circular buffers. + - The producer. + - The consumer. + + + +What is a circular buffer? +========================== + +First of all, what is a circular buffer? A circular buffer is a buffer of +fixed, finite size into which there are two indices: + + (1) A 'head' index - the point at which the producer inserts items into the + buffer. + + (2) A 'tail' index - the point at which the consumer finds the next item in + the buffer. + +Typically when the tail pointer is equal to the head pointer, the buffer is +empty; and the buffer is full when the head pointer is one less than the tail +pointer. + +The head index is incremented when items are added, and the tail index when +items are removed. The tail index should never jump the head index, and both +indices should be wrapped to 0 when they reach the end of the buffer, thus +allowing an infinite amount of data to flow through the buffer. + +Typically, items will all be of the same unit size, but this isn't strictly +required to use the techniques below. The indices can be increased by more +than 1 if multiple items or variable-sized items are to be included in the +buffer, provided that neither index overtakes the other. The implementer must +be careful, however, as a region more than one unit in size may wrap the end of +the buffer and be broken into two segments. + +Measuring power-of-2 buffers +============================ + +Calculation of the occupancy or the remaining capacity of an arbitrarily sized +circular buffer would normally be a slow operation, requiring the use of a +modulus (divide) instruction. However, if the buffer is of a power-of-2 size, +then a much quicker bitwise-AND instruction can be used instead. + +Linux provides a set of macros for handling power-of-2 circular buffers. These +can be made use of by:: + + #include <linux/circ_buf.h> + +The macros are: + + (#) Measure the remaining capacity of a buffer:: + + CIRC_SPACE(head_index, tail_index, buffer_size); + + This returns the amount of space left in the buffer[1] into which items + can be inserted. + + + (#) Measure the maximum consecutive immediate space in a buffer:: + + CIRC_SPACE_TO_END(head_index, tail_index, buffer_size); + + This returns the amount of consecutive space left in the buffer[1] into + which items can be immediately inserted without having to wrap back to the + beginning of the buffer. + + + (#) Measure the occupancy of a buffer:: + + CIRC_CNT(head_index, tail_index, buffer_size); + + This returns the number of items currently occupying a buffer[2]. + + + (#) Measure the non-wrapping occupancy of a buffer:: + + CIRC_CNT_TO_END(head_index, tail_index, buffer_size); + + This returns the number of consecutive items[2] that can be extracted from + the buffer without having to wrap back to the beginning of the buffer. + + +Each of these macros will nominally return a value between 0 and buffer_size-1, +however: + + (1) CIRC_SPACE*() are intended to be used in the producer. To the producer + they will return a lower bound as the producer controls the head index, + but the consumer may still be depleting the buffer on another CPU and + moving the tail index. + + To the consumer it will show an upper bound as the producer may be busy + depleting the space. + + (2) CIRC_CNT*() are intended to be used in the consumer. To the consumer they + will return a lower bound as the consumer controls the tail index, but the + producer may still be filling the buffer on another CPU and moving the + head index. + + To the producer it will show an upper bound as the consumer may be busy + emptying the buffer. + + (3) To a third party, the order in which the writes to the indices by the + producer and consumer become visible cannot be guaranteed as they are + independent and may be made on different CPUs - so the result in such a + situation will merely be a guess, and may even be negative. + +Using memory barriers with circular buffers +=========================================== + +By using memory barriers in conjunction with circular buffers, you can avoid +the need to: + + (1) use a single lock to govern access to both ends of the buffer, thus + allowing the buffer to be filled and emptied at the same time; and + + (2) use atomic counter operations. + +There are two sides to this: the producer that fills the buffer, and the +consumer that empties it. Only one thing should be filling a buffer at any one +time, and only one thing should be emptying a buffer at any one time, but the +two sides can operate simultaneously. + + +The producer +------------ + +The producer will look something like this:: + + spin_lock(&producer_lock); + + unsigned long head = buffer->head; + /* The spin_unlock() and next spin_lock() provide needed ordering. */ + unsigned long tail = READ_ONCE(buffer->tail); + + if (CIRC_SPACE(head, tail, buffer->size) >= 1) { + /* insert one item into the buffer */ + struct item *item = buffer[head]; + + produce_item(item); + + smp_store_release(buffer->head, + (head + 1) & (buffer->size - 1)); + + /* wake_up() will make sure that the head is committed before + * waking anyone up */ + wake_up(consumer); + } + + spin_unlock(&producer_lock); + +This will instruct the CPU that the contents of the new item must be written +before the head index makes it available to the consumer and then instructs the +CPU that the revised head index must be written before the consumer is woken. + +Note that wake_up() does not guarantee any sort of barrier unless something +is actually awakened. We therefore cannot rely on it for ordering. However, +there is always one element of the array left empty. Therefore, the +producer must produce two elements before it could possibly corrupt the +element currently being read by the consumer. Therefore, the unlock-lock +pair between consecutive invocations of the consumer provides the necessary +ordering between the read of the index indicating that the consumer has +vacated a given element and the write by the producer to that same element. + + +The Consumer +------------ + +The consumer will look something like this:: + + spin_lock(&consumer_lock); + + /* Read index before reading contents at that index. */ + unsigned long head = smp_load_acquire(buffer->head); + unsigned long tail = buffer->tail; + + if (CIRC_CNT(head, tail, buffer->size) >= 1) { + + /* extract one item from the buffer */ + struct item *item = buffer[tail]; + + consume_item(item); + + /* Finish reading descriptor before incrementing tail. */ + smp_store_release(buffer->tail, + (tail + 1) & (buffer->size - 1)); + } + + spin_unlock(&consumer_lock); + +This will instruct the CPU to make sure the index is up to date before reading +the new item, and then it shall make sure the CPU has finished reading the item +before it writes the new tail pointer, which will erase the item. + +Note the use of READ_ONCE() and smp_load_acquire() to read the +opposition index. This prevents the compiler from discarding and +reloading its cached value. This isn't strictly needed if you can +be sure that the opposition index will _only_ be used the once. +The smp_load_acquire() additionally forces the CPU to order against +subsequent memory references. Similarly, smp_store_release() is used +in both algorithms to write the thread's index. This documents the +fact that we are writing to something that can be read concurrently, +prevents the compiler from tearing the store, and enforces ordering +against previous accesses. + + +Further reading +=============== + +See also Documentation/memory-barriers.txt for a description of Linux's memory +barrier facilities. diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst new file mode 100644 index 0000000000..9511e405aa --- /dev/null +++ b/Documentation/core-api/cpu_hotplug.rst @@ -0,0 +1,765 @@ +========================= +CPU hotplug in the Kernel +========================= + +:Date: September, 2021 +:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>, + Rusty Russell <rusty@rustcorp.com.au>, + Srivatsa Vaddagiri <vatsa@in.ibm.com>, + Ashok Raj <ashok.raj@intel.com>, + Joel Schopp <jschopp@austin.ibm.com>, + Thomas Gleixner <tglx@linutronix.de> + +Introduction +============ + +Modern advances in system architectures have introduced advanced error +reporting and correction capabilities in processors. There are couple OEMS that +support NUMA hardware which are hot pluggable as well, where physical node +insertion and removal require support for CPU hotplug. + +Such advances require CPUs available to a kernel to be removed either for +provisioning reasons, or for RAS purposes to keep an offending CPU off +system execution path. Hence the need for CPU hotplug support in the +Linux kernel. + +A more novel use of CPU-hotplug support is its use today in suspend resume +support for SMP. Dual-core and HT support makes even a laptop run SMP kernels +which didn't support these methods. + + +Command Line Switches +===================== +``maxcpus=n`` + Restrict boot time CPUs to *n*. Say if you have four CPUs, using + ``maxcpus=2`` will only boot two. You can choose to bring the + other CPUs later online. + +``nr_cpus=n`` + Restrict the total amount of CPUs the kernel will support. If the number + supplied here is lower than the number of physically available CPUs, then + those CPUs can not be brought online later. + +``additional_cpus=n`` + Use this to limit hotpluggable CPUs. This option sets + ``cpu_possible_mask = cpu_present_mask + additional_cpus`` + + This option is limited to the IA64 architecture. + +``possible_cpus=n`` + This option sets ``possible_cpus`` bits in ``cpu_possible_mask``. + + This option is limited to the X86 and S390 architecture. + +``cpu0_hotplug`` + Allow to shutdown CPU0. + + This option is limited to the X86 architecture. + +CPU maps +======== + +``cpu_possible_mask`` + Bitmap of possible CPUs that can ever be available in the + system. This is used to allocate some boot time memory for per_cpu variables + that aren't designed to grow/shrink as CPUs are made available or removed. + Once set during boot time discovery phase, the map is static, i.e no bits + are added or removed anytime. Trimming it accurately for your system needs + upfront can save some boot time memory. + +``cpu_online_mask`` + Bitmap of all CPUs currently online. Its set in ``__cpu_up()`` + after a CPU is available for kernel scheduling and ready to receive + interrupts from devices. Its cleared when a CPU is brought down using + ``__cpu_disable()``, before which all OS services including interrupts are + migrated to another target CPU. + +``cpu_present_mask`` + Bitmap of CPUs currently present in the system. Not all + of them may be online. When physical hotplug is processed by the relevant + subsystem (e.g ACPI) can change and new bit either be added or removed + from the map depending on the event is hot-add/hot-remove. There are currently + no locking rules as of now. Typical usage is to init topology during boot, + at which time hotplug is disabled. + +You really don't need to manipulate any of the system CPU maps. They should +be read-only for most use. When setting up per-cpu resources almost always use +``cpu_possible_mask`` or ``for_each_possible_cpu()`` to iterate. To macro +``for_each_cpu()`` can be used to iterate over a custom CPU mask. + +Never use anything other than ``cpumask_t`` to represent bitmap of CPUs. + + +Using CPU hotplug +================= + +The kernel option *CONFIG_HOTPLUG_CPU* needs to be enabled. It is currently +available on multiple architectures including ARM, MIPS, PowerPC and X86. The +configuration is done via the sysfs interface:: + + $ ls -lh /sys/devices/system/cpu + total 0 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu0 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu1 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu2 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu3 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu4 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu5 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu6 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu7 + drwxr-xr-x 2 root root 0 Dec 21 16:33 hotplug + -r--r--r-- 1 root root 4.0K Dec 21 16:33 offline + -r--r--r-- 1 root root 4.0K Dec 21 16:33 online + -r--r--r-- 1 root root 4.0K Dec 21 16:33 possible + -r--r--r-- 1 root root 4.0K Dec 21 16:33 present + +The files *offline*, *online*, *possible*, *present* represent the CPU masks. +Each CPU folder contains an *online* file which controls the logical on (1) and +off (0) state. To logically shutdown CPU4:: + + $ echo 0 > /sys/devices/system/cpu/cpu4/online + smpboot: CPU 4 is now offline + +Once the CPU is shutdown, it will be removed from */proc/interrupts*, +*/proc/cpuinfo* and should also not be shown visible by the *top* command. To +bring CPU4 back online:: + + $ echo 1 > /sys/devices/system/cpu/cpu4/online + smpboot: Booting Node 0 Processor 4 APIC 0x1 + +The CPU is usable again. This should work on all CPUs, but CPU0 is often special +and excluded from CPU hotplug. + +The CPU hotplug coordination +============================ + +The offline case +---------------- + +Once a CPU has been logically shutdown the teardown callbacks of registered +hotplug states will be invoked, starting with ``CPUHP_ONLINE`` and terminating +at state ``CPUHP_OFFLINE``. This includes: + +* If tasks are frozen due to a suspend operation then *cpuhp_tasks_frozen* + will be set to true. +* All processes are migrated away from this outgoing CPU to new CPUs. + The new CPU is chosen from each process' current cpuset, which may be + a subset of all online CPUs. +* All interrupts targeted to this CPU are migrated to a new CPU +* timers are also migrated to a new CPU +* Once all services are migrated, kernel calls an arch specific routine + ``__cpu_disable()`` to perform arch specific cleanup. + + +The CPU hotplug API +=================== + +CPU hotplug state machine +------------------------- + +CPU hotplug uses a trivial state machine with a linear state space from +CPUHP_OFFLINE to CPUHP_ONLINE. Each state has a startup and a teardown +callback. + +When a CPU is onlined, the startup callbacks are invoked sequentially until +the state CPUHP_ONLINE is reached. They can also be invoked when the +callbacks of a state are set up or an instance is added to a multi-instance +state. + +When a CPU is offlined the teardown callbacks are invoked in the reverse +order sequentially until the state CPUHP_OFFLINE is reached. They can also +be invoked when the callbacks of a state are removed or an instance is +removed from a multi-instance state. + +If a usage site requires only a callback in one direction of the hotplug +operations (CPU online or CPU offline) then the other not-required callback +can be set to NULL when the state is set up. + +The state space is divided into three sections: + +* The PREPARE section + + The PREPARE section covers the state space from CPUHP_OFFLINE to + CPUHP_BRINGUP_CPU. + + The startup callbacks in this section are invoked before the CPU is + started during a CPU online operation. The teardown callbacks are invoked + after the CPU has become dysfunctional during a CPU offline operation. + + The callbacks are invoked on a control CPU as they can't obviously run on + the hotplugged CPU which is either not yet started or has become + dysfunctional already. + + The startup callbacks are used to setup resources which are required to + bring a CPU successfully online. The teardown callbacks are used to free + resources or to move pending work to an online CPU after the hotplugged + CPU became dysfunctional. + + The startup callbacks are allowed to fail. If a callback fails, the CPU + online operation is aborted and the CPU is brought down to the previous + state (usually CPUHP_OFFLINE) again. + + The teardown callbacks in this section are not allowed to fail. + +* The STARTING section + + The STARTING section covers the state space between CPUHP_BRINGUP_CPU + 1 + and CPUHP_AP_ONLINE. + + The startup callbacks in this section are invoked on the hotplugged CPU + with interrupts disabled during a CPU online operation in the early CPU + setup code. The teardown callbacks are invoked with interrupts disabled + on the hotplugged CPU during a CPU offline operation shortly before the + CPU is completely shut down. + + The callbacks in this section are not allowed to fail. + + The callbacks are used for low level hardware initialization/shutdown and + for core subsystems. + +* The ONLINE section + + The ONLINE section covers the state space between CPUHP_AP_ONLINE + 1 and + CPUHP_ONLINE. + + The startup callbacks in this section are invoked on the hotplugged CPU + during a CPU online operation. The teardown callbacks are invoked on the + hotplugged CPU during a CPU offline operation. + + The callbacks are invoked in the context of the per CPU hotplug thread, + which is pinned on the hotplugged CPU. The callbacks are invoked with + interrupts and preemption enabled. + + The callbacks are allowed to fail. When a callback fails the hotplug + operation is aborted and the CPU is brought back to the previous state. + +CPU online/offline operations +----------------------------- + +A successful online operation looks like this:: + + [CPUHP_OFFLINE] + [CPUHP_OFFLINE + 1]->startup() -> success + [CPUHP_OFFLINE + 2]->startup() -> success + [CPUHP_OFFLINE + 3] -> skipped because startup == NULL + ... + [CPUHP_BRINGUP_CPU]->startup() -> success + === End of PREPARE section + [CPUHP_BRINGUP_CPU + 1]->startup() -> success + ... + [CPUHP_AP_ONLINE]->startup() -> success + === End of STARTUP section + [CPUHP_AP_ONLINE + 1]->startup() -> success + ... + [CPUHP_ONLINE - 1]->startup() -> success + [CPUHP_ONLINE] + +A successful offline operation looks like this:: + + [CPUHP_ONLINE] + [CPUHP_ONLINE - 1]->teardown() -> success + ... + [CPUHP_AP_ONLINE + 1]->teardown() -> success + === Start of STARTUP section + [CPUHP_AP_ONLINE]->teardown() -> success + ... + [CPUHP_BRINGUP_ONLINE - 1]->teardown() + ... + === Start of PREPARE section + [CPUHP_BRINGUP_CPU]->teardown() + [CPUHP_OFFLINE + 3]->teardown() + [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL + [CPUHP_OFFLINE + 1]->teardown() + [CPUHP_OFFLINE] + +A failed online operation looks like this:: + + [CPUHP_OFFLINE] + [CPUHP_OFFLINE + 1]->startup() -> success + [CPUHP_OFFLINE + 2]->startup() -> success + [CPUHP_OFFLINE + 3] -> skipped because startup == NULL + ... + [CPUHP_BRINGUP_CPU]->startup() -> success + === End of PREPARE section + [CPUHP_BRINGUP_CPU + 1]->startup() -> success + ... + [CPUHP_AP_ONLINE]->startup() -> success + === End of STARTUP section + [CPUHP_AP_ONLINE + 1]->startup() -> success + --- + [CPUHP_AP_ONLINE + N]->startup() -> fail + [CPUHP_AP_ONLINE + (N - 1)]->teardown() + ... + [CPUHP_AP_ONLINE + 1]->teardown() + === Start of STARTUP section + [CPUHP_AP_ONLINE]->teardown() + ... + [CPUHP_BRINGUP_ONLINE - 1]->teardown() + ... + === Start of PREPARE section + [CPUHP_BRINGUP_CPU]->teardown() + [CPUHP_OFFLINE + 3]->teardown() + [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL + [CPUHP_OFFLINE + 1]->teardown() + [CPUHP_OFFLINE] + +A failed offline operation looks like this:: + + [CPUHP_ONLINE] + [CPUHP_ONLINE - 1]->teardown() -> success + ... + [CPUHP_ONLINE - N]->teardown() -> fail + [CPUHP_ONLINE - (N - 1)]->startup() + ... + [CPUHP_ONLINE - 1]->startup() + [CPUHP_ONLINE] + +Recursive failures cannot be handled sensibly. Look at the following +example of a recursive fail due to a failed offline operation: :: + + [CPUHP_ONLINE] + [CPUHP_ONLINE - 1]->teardown() -> success + ... + [CPUHP_ONLINE - N]->teardown() -> fail + [CPUHP_ONLINE - (N - 1)]->startup() -> success + [CPUHP_ONLINE - (N - 2)]->startup() -> fail + +The CPU hotplug state machine stops right here and does not try to go back +down again because that would likely result in an endless loop:: + + [CPUHP_ONLINE - (N - 1)]->teardown() -> success + [CPUHP_ONLINE - N]->teardown() -> fail + [CPUHP_ONLINE - (N - 1)]->startup() -> success + [CPUHP_ONLINE - (N - 2)]->startup() -> fail + [CPUHP_ONLINE - (N - 1)]->teardown() -> success + [CPUHP_ONLINE - N]->teardown() -> fail + +Lather, rinse and repeat. In this case the CPU left in state:: + + [CPUHP_ONLINE - (N - 1)] + +which at least lets the system make progress and gives the user a chance to +debug or even resolve the situation. + +Allocating a state +------------------ + +There are two ways to allocate a CPU hotplug state: + +* Static allocation + + Static allocation has to be used when the subsystem or driver has + ordering requirements versus other CPU hotplug states. E.g. the PERF core + startup callback has to be invoked before the PERF driver startup + callbacks during a CPU online operation. During a CPU offline operation + the driver teardown callbacks have to be invoked before the core teardown + callback. The statically allocated states are described by constants in + the cpuhp_state enum which can be found in include/linux/cpuhotplug.h. + + Insert the state into the enum at the proper place so the ordering + requirements are fulfilled. The state constant has to be used for state + setup and removal. + + Static allocation is also required when the state callbacks are not set + up at runtime and are part of the initializer of the CPU hotplug state + array in kernel/cpu.c. + +* Dynamic allocation + + When there are no ordering requirements for the state callbacks then + dynamic allocation is the preferred method. The state number is allocated + by the setup function and returned to the caller on success. + + Only the PREPARE and ONLINE sections provide a dynamic allocation + range. The STARTING section does not as most of the callbacks in that + section have explicit ordering requirements. + +Setup of a CPU hotplug state +---------------------------- + +The core code provides the following functions to setup a state: + +* cpuhp_setup_state(state, name, startup, teardown) +* cpuhp_setup_state_nocalls(state, name, startup, teardown) +* cpuhp_setup_state_cpuslocked(state, name, startup, teardown) +* cpuhp_setup_state_nocalls_cpuslocked(state, name, startup, teardown) + +For cases where a driver or a subsystem has multiple instances and the same +CPU hotplug state callbacks need to be invoked for each instance, the CPU +hotplug core provides multi-instance support. The advantage over driver +specific instance lists is that the instance related functions are fully +serialized against CPU hotplug operations and provide the automatic +invocations of the state callbacks on add and removal. To set up such a +multi-instance state the following function is available: + +* cpuhp_setup_state_multi(state, name, startup, teardown) + +The @state argument is either a statically allocated state or one of the +constants for dynamically allocated states - CPUHP_BP_PREPARE_DYN, +CPUHP_AP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for +which a dynamic state should be allocated. + +The @name argument is used for sysfs output and for instrumentation. The +naming convention is "subsys:mode" or "subsys/driver:mode", +e.g. "perf:mode" or "perf/x86:mode". The common mode names are: + +======== ======================================================= +prepare For states in the PREPARE section + +dead For states in the PREPARE section which do not provide + a startup callback + +starting For states in the STARTING section + +dying For states in the STARTING section which do not provide + a startup callback + +online For states in the ONLINE section + +offline For states in the ONLINE section which do not provide + a startup callback +======== ======================================================= + +As the @name argument is only used for sysfs and instrumentation other mode +descriptors can be used as well if they describe the nature of the state +better than the common ones. + +Examples for @name arguments: "perf/online", "perf/x86:prepare", +"RCU/tree:dying", "sched/waitempty" + +The @startup argument is a function pointer to the callback which should be +invoked during a CPU online operation. If the usage site does not require a +startup callback set the pointer to NULL. + +The @teardown argument is a function pointer to the callback which should +be invoked during a CPU offline operation. If the usage site does not +require a teardown callback set the pointer to NULL. + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_setup_state_nocalls(), cpuhp_setup_state_nocalls_cpuslocked() + and cpuhp_setup_state_multi() only install the callbacks + + * cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the + callbacks and invoke the @startup callback (if not NULL) for all online + CPUs which have currently a state greater than the newly installed + state. Depending on the state section the callback is either invoked on + the current CPU (PREPARE section) or on each online CPU (ONLINE + section) in the context of the CPU's hotplug thread. + + If a callback fails for CPU N then the teardown callback for CPU + 0 .. N-1 is invoked to rollback the operation. The state setup fails, + the callbacks for the state are not installed and in case of dynamic + allocation the allocated state is freed. + +The state setup and the callback invocations are serialized against CPU +hotplug operations. If the setup function has to be called from a CPU +hotplug read locked region, then the _cpuslocked() variants have to be +used. These functions cannot be used from within CPU hotplug callbacks. + +The function return values: + ======== =================================================================== + 0 Statically allocated state was successfully set up + + >0 Dynamically allocated state was successfully set up. + + The returned number is the state number which was allocated. If + the state callbacks have to be removed later, e.g. module + removal, then this number has to be saved by the caller and used + as @state argument for the state remove function. For + multi-instance states the dynamically allocated state number is + also required as @state argument for the instance add/remove + operations. + + <0 Operation failed + ======== =================================================================== + +Removal of a CPU hotplug state +------------------------------ + +To remove a previously set up state, the following functions are provided: + +* cpuhp_remove_state(state) +* cpuhp_remove_state_nocalls(state) +* cpuhp_remove_state_nocalls_cpuslocked(state) +* cpuhp_remove_multi_state(state) + +The @state argument is either a statically allocated state or the state +number which was allocated in the dynamic range by cpuhp_setup_state*(). If +the state is in the dynamic range, then the state number is freed and +available for dynamic allocation again. + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_remove_state_nocalls(), cpuhp_remove_state_nocalls_cpuslocked() + and cpuhp_remove_multi_state() only remove the callbacks. + + * cpuhp_remove_state() removes the callbacks and invokes the teardown + callback (if not NULL) for all online CPUs which have currently a state + greater than the removed state. Depending on the state section the + callback is either invoked on the current CPU (PREPARE section) or on + each online CPU (ONLINE section) in the context of the CPU's hotplug + thread. + + In order to complete the removal, the teardown callback should not fail. + +The state removal and the callback invocations are serialized against CPU +hotplug operations. If the remove function has to be called from a CPU +hotplug read locked region, then the _cpuslocked() variants have to be +used. These functions cannot be used from within CPU hotplug callbacks. + +If a multi-instance state is removed then the caller has to remove all +instances first. + +Multi-Instance state instance management +---------------------------------------- + +Once the multi-instance state is set up, instances can be added to the +state: + + * cpuhp_state_add_instance(state, node) + * cpuhp_state_add_instance_nocalls(state, node) + +The @state argument is either a statically allocated state or the state +number which was allocated in the dynamic range by cpuhp_setup_state_multi(). + +The @node argument is a pointer to an hlist_node which is embedded in the +instance's data structure. The pointer is handed to the multi-instance +state callbacks and can be used by the callback to retrieve the instance +via container_of(). + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_state_add_instance_nocalls() and only adds the instance to the + multi-instance state's node list. + + * cpuhp_state_add_instance() adds the instance and invokes the startup + callback (if not NULL) associated with @state for all online CPUs which + have currently a state greater than @state. The callback is only + invoked for the to be added instance. Depending on the state section + the callback is either invoked on the current CPU (PREPARE section) or + on each online CPU (ONLINE section) in the context of the CPU's hotplug + thread. + + If a callback fails for CPU N then the teardown callback for CPU + 0 .. N-1 is invoked to rollback the operation, the function fails and + the instance is not added to the node list of the multi-instance state. + +To remove an instance from the state's node list these functions are +available: + + * cpuhp_state_remove_instance(state, node) + * cpuhp_state_remove_instance_nocalls(state, node) + +The arguments are the same as for the cpuhp_state_add_instance*() +variants above. + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_state_remove_instance_nocalls() only removes the instance from the + state's node list. + + * cpuhp_state_remove_instance() removes the instance and invokes the + teardown callback (if not NULL) associated with @state for all online + CPUs which have currently a state greater than @state. The callback is + only invoked for the to be removed instance. Depending on the state + section the callback is either invoked on the current CPU (PREPARE + section) or on each online CPU (ONLINE section) in the context of the + CPU's hotplug thread. + + In order to complete the removal, the teardown callback should not fail. + +The node list add/remove operations and the callback invocations are +serialized against CPU hotplug operations. These functions cannot be used +from within CPU hotplug callbacks and CPU hotplug read locked regions. + +Examples +-------- + +Setup and teardown a statically allocated state in the STARTING section for +notifications on online and offline operations:: + + ret = cpuhp_setup_state(CPUHP_SUBSYS_STARTING, "subsys:starting", subsys_cpu_starting, subsys_cpu_dying); + if (ret < 0) + return ret; + .... + cpuhp_remove_state(CPUHP_SUBSYS_STARTING); + +Setup and teardown a dynamically allocated state in the ONLINE section +for notifications on offline operations:: + + state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline); + if (state < 0) + return state; + .... + cpuhp_remove_state(state); + +Setup and teardown a dynamically allocated state in the ONLINE section +for notifications on online operations without invoking the callbacks:: + + state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, NULL); + if (state < 0) + return state; + .... + cpuhp_remove_state_nocalls(state); + +Setup, use and teardown a dynamically allocated multi-instance state in the +ONLINE section for notifications on online and offline operation:: + + state = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline); + if (state < 0) + return state; + .... + ret = cpuhp_state_add_instance(state, &inst1->node); + if (ret) + return ret; + .... + ret = cpuhp_state_add_instance(state, &inst2->node); + if (ret) + return ret; + .... + cpuhp_remove_instance(state, &inst1->node); + .... + cpuhp_remove_instance(state, &inst2->node); + .... + remove_multi_state(state); + + +Testing of hotplug states +========================= + +One way to verify whether a custom state is working as expected or not is to +shutdown a CPU and then put it online again. It is also possible to put the CPU +to certain state (for instance *CPUHP_AP_ONLINE*) and then go back to +*CPUHP_ONLINE*. This would simulate an error one state after *CPUHP_AP_ONLINE* +which would lead to rollback to the online state. + +All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states`` :: + + $ tail /sys/devices/system/cpu/hotplug/states + 138: mm/vmscan:online + 139: mm/vmstat:online + 140: lib/percpu_cnt:online + 141: acpi/cpu-drv:online + 142: base/cacheinfo:online + 143: virtio/net:online + 144: x86/mce:online + 145: printk:online + 168: sched:active + 169: online + +To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue:: + + $ cat /sys/devices/system/cpu/cpu4/hotplug/state + 169 + $ echo 140 > /sys/devices/system/cpu/cpu4/hotplug/target + $ cat /sys/devices/system/cpu/cpu4/hotplug/state + 140 + +It is important to note that the teardown callback of state 140 have been +invoked. And now get back online:: + + $ echo 169 > /sys/devices/system/cpu/cpu4/hotplug/target + $ cat /sys/devices/system/cpu/cpu4/hotplug/state + 169 + +With trace events enabled, the individual steps are visible, too:: + + # TASK-PID CPU# TIMESTAMP FUNCTION + # | | | | | + bash-394 [001] 22.976: cpuhp_enter: cpu: 0004 target: 140 step: 169 (cpuhp_kick_ap_work) + cpuhp/4-31 [004] 22.977: cpuhp_enter: cpu: 0004 target: 140 step: 168 (sched_cpu_deactivate) + cpuhp/4-31 [004] 22.990: cpuhp_exit: cpu: 0004 state: 168 step: 168 ret: 0 + cpuhp/4-31 [004] 22.991: cpuhp_enter: cpu: 0004 target: 140 step: 144 (mce_cpu_pre_down) + cpuhp/4-31 [004] 22.992: cpuhp_exit: cpu: 0004 state: 144 step: 144 ret: 0 + cpuhp/4-31 [004] 22.993: cpuhp_multi_enter: cpu: 0004 target: 140 step: 143 (virtnet_cpu_down_prep) + cpuhp/4-31 [004] 22.994: cpuhp_exit: cpu: 0004 state: 143 step: 143 ret: 0 + cpuhp/4-31 [004] 22.995: cpuhp_enter: cpu: 0004 target: 140 step: 142 (cacheinfo_cpu_pre_down) + cpuhp/4-31 [004] 22.996: cpuhp_exit: cpu: 0004 state: 142 step: 142 ret: 0 + bash-394 [001] 22.997: cpuhp_exit: cpu: 0004 state: 140 step: 169 ret: 0 + bash-394 [005] 95.540: cpuhp_enter: cpu: 0004 target: 169 step: 140 (cpuhp_kick_ap_work) + cpuhp/4-31 [004] 95.541: cpuhp_enter: cpu: 0004 target: 169 step: 141 (acpi_soft_cpu_online) + cpuhp/4-31 [004] 95.542: cpuhp_exit: cpu: 0004 state: 141 step: 141 ret: 0 + cpuhp/4-31 [004] 95.543: cpuhp_enter: cpu: 0004 target: 169 step: 142 (cacheinfo_cpu_online) + cpuhp/4-31 [004] 95.544: cpuhp_exit: cpu: 0004 state: 142 step: 142 ret: 0 + cpuhp/4-31 [004] 95.545: cpuhp_multi_enter: cpu: 0004 target: 169 step: 143 (virtnet_cpu_online) + cpuhp/4-31 [004] 95.546: cpuhp_exit: cpu: 0004 state: 143 step: 143 ret: 0 + cpuhp/4-31 [004] 95.547: cpuhp_enter: cpu: 0004 target: 169 step: 144 (mce_cpu_online) + cpuhp/4-31 [004] 95.548: cpuhp_exit: cpu: 0004 state: 144 step: 144 ret: 0 + cpuhp/4-31 [004] 95.549: cpuhp_enter: cpu: 0004 target: 169 step: 145 (console_cpu_notify) + cpuhp/4-31 [004] 95.550: cpuhp_exit: cpu: 0004 state: 145 step: 145 ret: 0 + cpuhp/4-31 [004] 95.551: cpuhp_enter: cpu: 0004 target: 169 step: 168 (sched_cpu_activate) + cpuhp/4-31 [004] 95.552: cpuhp_exit: cpu: 0004 state: 168 step: 168 ret: 0 + bash-394 [005] 95.553: cpuhp_exit: cpu: 0004 state: 169 step: 140 ret: 0 + +As it an be seen, CPU4 went down until timestamp 22.996 and then back up until +95.552. All invoked callbacks including their return codes are visible in the +trace. + +Architecture's requirements +=========================== + +The following functions and configurations are required: + +``CONFIG_HOTPLUG_CPU`` + This entry needs to be enabled in Kconfig + +``__cpu_up()`` + Arch interface to bring up a CPU + +``__cpu_disable()`` + Arch interface to shutdown a CPU, no more interrupts can be handled by the + kernel after the routine returns. This includes the shutdown of the timer. + +``__cpu_die()`` + This actually supposed to ensure death of the CPU. Actually look at some + example code in other arch that implement CPU hotplug. The processor is taken + down from the ``idle()`` loop for that specific architecture. ``__cpu_die()`` + typically waits for some per_cpu state to be set, to ensure the processor dead + routine is called to be sure positively. + +User Space Notification +======================= + +After CPU successfully onlined or offline udev events are sent. A udev rule like:: + + SUBSYSTEM=="cpu", DRIVERS=="processor", DEVPATH=="/devices/system/cpu/*", RUN+="the_hotplug_receiver.sh" + +will receive all events. A script like:: + + #!/bin/sh + + if [ "${ACTION}" = "offline" ] + then + echo "CPU ${DEVPATH##*/} offline" + + elif [ "${ACTION}" = "online" ] + then + echo "CPU ${DEVPATH##*/} online" + + fi + +can process the event further. + +When changes to the CPUs in the system occur, the sysfs file +/sys/devices/system/cpu/crash_hotplug contains '1' if the kernel +updates the kdump capture kernel list of CPUs itself (via elfcorehdr), +or '0' if userspace must update the kdump capture kernel list of CPUs. + +The availability depends on the CONFIG_HOTPLUG_CPU kernel configuration +option. + +To skip userspace processing of CPU hot un/plug events for kdump +(i.e. the unload-then-reload to obtain a current list of CPUs), this sysfs +file can be used in a udev rule as follows: + + SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" + +For a CPU hot un/plug event, if the architecture supports kernel updates +of the elfcorehdr (which contains the list of CPUs), then the rule skips +the unload-then-reload of the kdump capture kernel. + +Kernel Inline Documentations Reference +====================================== + +.. kernel-doc:: include/linux/cpuhotplug.h diff --git a/Documentation/core-api/debug-objects.rst b/Documentation/core-api/debug-objects.rst new file mode 100644 index 0000000000..ac926fd55a --- /dev/null +++ b/Documentation/core-api/debug-objects.rst @@ -0,0 +1,310 @@ +============================================ +The object-lifetime debugging infrastructure +============================================ + +:Author: Thomas Gleixner + +Introduction +============ + +debugobjects is a generic infrastructure to track the life time of +kernel objects and validate the operations on those. + +debugobjects is useful to check for the following error patterns: + +- Activation of uninitialized objects + +- Initialization of active objects + +- Usage of freed/destroyed objects + +debugobjects is not changing the data structure of the real object so it +can be compiled in with a minimal runtime impact and enabled on demand +with a kernel command line option. + +Howto use debugobjects +====================== + +A kernel subsystem needs to provide a data structure which describes the +object type and add calls into the debug code at appropriate places. The +data structure to describe the object type needs at minimum the name of +the object type. Optional functions can and should be provided to fixup +detected problems so the kernel can continue to work and the debug +information can be retrieved from a live system instead of hard core +debugging with serial consoles and stack trace transcripts from the +monitor. + +The debug calls provided by debugobjects are: + +- debug_object_init + +- debug_object_init_on_stack + +- debug_object_activate + +- debug_object_deactivate + +- debug_object_destroy + +- debug_object_free + +- debug_object_assert_init + +Each of these functions takes the address of the real object and a +pointer to the object type specific debug description structure. + +Each detected error is reported in the statistics and a limited number +of errors are printk'ed including a full stack trace. + +The statistics are available via /sys/kernel/debug/debug_objects/stats. +They provide information about the number of warnings and the number of +successful fixups along with information about the usage of the internal +tracking objects and the state of the internal tracking objects pool. + +Debug functions +=============== + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_init + +This function is called whenever the initialization function of a real +object is called. + +When the real object is already tracked by debugobjects it is checked, +whether the object can be initialized. Initializing is not allowed for +active and destroyed objects. When debugobjects detects an error, then +it calls the fixup_init function of the object type description +structure if provided by the caller. The fixup function can correct the +problem before the real initialization of the object happens. E.g. it +can deactivate an active object in order to prevent damage to the +subsystem. + +When the real object is not yet tracked by debugobjects, debugobjects +allocates a tracker object for the real object and sets the tracker +object state to ODEBUG_STATE_INIT. It verifies that the object is not +on the callers stack. If it is on the callers stack then a limited +number of warnings including a full stack trace is printk'ed. The +calling code must use debug_object_init_on_stack() and remove the +object before leaving the function which allocated it. See next section. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_init_on_stack + +This function is called whenever the initialization function of a real +object which resides on the stack is called. + +When the real object is already tracked by debugobjects it is checked, +whether the object can be initialized. Initializing is not allowed for +active and destroyed objects. When debugobjects detects an error, then +it calls the fixup_init function of the object type description +structure if provided by the caller. The fixup function can correct the +problem before the real initialization of the object happens. E.g. it +can deactivate an active object in order to prevent damage to the +subsystem. + +When the real object is not yet tracked by debugobjects debugobjects +allocates a tracker object for the real object and sets the tracker +object state to ODEBUG_STATE_INIT. It verifies that the object is on +the callers stack. + +An object which is on the stack must be removed from the tracker by +calling debug_object_free() before the function which allocates the +object returns. Otherwise we keep track of stale objects. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_activate + +This function is called whenever the activation function of a real +object is called. + +When the real object is already tracked by debugobjects it is checked, +whether the object can be activated. Activating is not allowed for +active and destroyed objects. When debugobjects detects an error, then +it calls the fixup_activate function of the object type description +structure if provided by the caller. The fixup function can correct the +problem before the real activation of the object happens. E.g. it can +deactivate an active object in order to prevent damage to the subsystem. + +When the real object is not yet tracked by debugobjects then the +fixup_activate function is called if available. This is necessary to +allow the legitimate activation of statically allocated and initialized +objects. The fixup function checks whether the object is valid and calls +the debug_objects_init() function to initialize the tracking of this +object. + +When the activation is legitimate, then the state of the associated +tracker object is set to ODEBUG_STATE_ACTIVE. + + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_deactivate + +This function is called whenever the deactivation function of a real +object is called. + +When the real object is tracked by debugobjects it is checked, whether +the object can be deactivated. Deactivating is not allowed for untracked +or destroyed objects. + +When the deactivation is legitimate, then the state of the associated +tracker object is set to ODEBUG_STATE_INACTIVE. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_destroy + +This function is called to mark an object destroyed. This is useful to +prevent the usage of invalid objects, which are still available in +memory: either statically allocated objects or objects which are freed +later. + +When the real object is tracked by debugobjects it is checked, whether +the object can be destroyed. Destruction is not allowed for active and +destroyed objects. When debugobjects detects an error, then it calls the +fixup_destroy function of the object type description structure if +provided by the caller. The fixup function can correct the problem +before the real destruction of the object happens. E.g. it can +deactivate an active object in order to prevent damage to the subsystem. + +When the destruction is legitimate, then the state of the associated +tracker object is set to ODEBUG_STATE_DESTROYED. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_free + +This function is called before an object is freed. + +When the real object is tracked by debugobjects it is checked, whether +the object can be freed. Free is not allowed for active objects. When +debugobjects detects an error, then it calls the fixup_free function of +the object type description structure if provided by the caller. The +fixup function can correct the problem before the real free of the +object happens. E.g. it can deactivate an active object in order to +prevent damage to the subsystem. + +Note that debug_object_free removes the object from the tracker. Later +usage of the object is detected by the other debug checks. + + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_assert_init + +This function is called to assert that an object has been initialized. + +When the real object is not tracked by debugobjects, it calls +fixup_assert_init of the object type description structure provided by +the caller, with the hardcoded object state ODEBUG_NOT_AVAILABLE. The +fixup function can correct the problem by calling debug_object_init +and other specific initializing functions. + +When the real object is already tracked by debugobjects it is ignored. + +Fixup functions +=============== + +Debug object type description structure +--------------------------------------- + +.. kernel-doc:: include/linux/debugobjects.h + :internal: + +fixup_init +----------- + +This function is called from the debug code whenever a problem in +debug_object_init is detected. The function takes the address of the +object and the state which is currently recorded in the tracker. + +Called from debug_object_init when the object state is: + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +Note, that the function needs to call the debug_object_init() function +again, after the damage has been repaired in order to keep the state +consistent. + +fixup_activate +--------------- + +This function is called from the debug code whenever a problem in +debug_object_activate is detected. + +Called from debug_object_activate when the object state is: + +- ODEBUG_STATE_NOTAVAILABLE + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +Note that the function needs to call the debug_object_activate() +function again after the damage has been repaired in order to keep the +state consistent. + +The activation of statically initialized objects is a special case. When +debug_object_activate() has no tracked object for this object address +then fixup_activate() is called with object state +ODEBUG_STATE_NOTAVAILABLE. The fixup function needs to check whether +this is a legitimate case of a statically initialized object or not. In +case it is it calls debug_object_init() and debug_object_activate() +to make the object known to the tracker and marked active. In this case +the function should return false because this is not a real fixup. + +fixup_destroy +-------------- + +This function is called from the debug code whenever a problem in +debug_object_destroy is detected. + +Called from debug_object_destroy when the object state is: + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +fixup_free +----------- + +This function is called from the debug code whenever a problem in +debug_object_free is detected. Further it can be called from the debug +checks in kfree/vfree, when an active object is detected from the +debug_check_no_obj_freed() sanity checks. + +Called from debug_object_free() or debug_check_no_obj_freed() when +the object state is: + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +fixup_assert_init +------------------- + +This function is called from the debug code whenever a problem in +debug_object_assert_init is detected. + +Called from debug_object_assert_init() with a hardcoded state +ODEBUG_STATE_NOTAVAILABLE when the object is not found in the debug +bucket. + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +Note, this function should make sure debug_object_init() is called +before returning. + +The handling of statically initialized objects is a special case. The +fixup function should check if this is a legitimate case of a statically +initialized object or not. In this case only debug_object_init() +should be called to make the object known to the tracker. Then the +function should return false because this is not a real fixup. + +Known Bugs And Assumptions +========================== + +None (knock on wood). diff --git a/Documentation/core-api/debugging-via-ohci1394.rst b/Documentation/core-api/debugging-via-ohci1394.rst new file mode 100644 index 0000000000..981ad4f89f --- /dev/null +++ b/Documentation/core-api/debugging-via-ohci1394.rst @@ -0,0 +1,185 @@ +=========================================================================== +Using physical DMA provided by OHCI-1394 FireWire controllers for debugging +=========================================================================== + +Introduction +------------ + +Basically all FireWire controllers which are in use today are compliant +to the OHCI-1394 specification which defines the controller to be a PCI +bus master which uses DMA to offload data transfers from the CPU and has +a "Physical Response Unit" which executes specific requests by employing +PCI-Bus master DMA after applying filters defined by the OHCI-1394 driver. + +Once properly configured, remote machines can send these requests to +ask the OHCI-1394 controller to perform read and write requests on +physical system memory and, for read requests, send the result of +the physical memory read back to the requester. + +With that, it is possible to debug issues by reading interesting memory +locations such as buffers like the printk buffer or the process table. + +Retrieving a full system memory dump is also possible over the FireWire, +using data transfer rates in the order of 10MB/s or more. + +With most FireWire controllers, memory access is limited to the low 4 GB +of physical address space. This can be a problem on IA64 machines where +memory is located mostly above that limit, but it is rarely a problem on +more common hardware such as x86, x86-64 and PowerPC. + +At least LSI FW643e and FW643e2 controllers are known to support access to +physical addresses above 4 GB, but this feature is currently not enabled by +Linux. + +Together with a early initialization of the OHCI-1394 controller for debugging, +this facility proved most useful for examining long debugs logs in the printk +buffer on to debug early boot problems in areas like ACPI where the system +fails to boot and other means for debugging (serial port) are either not +available (notebooks) or too slow for extensive debug information (like ACPI). + +Drivers +------- + +The firewire-ohci driver in drivers/firewire uses filtered physical +DMA by default, which is more secure but not suitable for remote debugging. +Pass the remote_dma=1 parameter to the driver to get unfiltered physical DMA. + +Because the firewire-ohci driver depends on the PCI enumeration to be +completed, an initialization routine which runs pretty early has been +implemented for x86. This routine runs long before console_init() can be +called, i.e. before the printk buffer appears on the console. + +To activate it, enable CONFIG_PROVIDE_OHCI1394_DMA_INIT (Kernel hacking menu: +Remote debugging over FireWire early on boot) and pass the parameter +"ohci1394_dma=early" to the recompiled kernel on boot. + +Tools +----- + +firescope - Originally developed by Benjamin Herrenschmidt, Andi Kleen ported +it from PowerPC to x86 and x86_64 and added functionality, firescope can now +be used to view the printk buffer of a remote machine, even with live update. + +Bernhard Kaindl enhanced firescope to support accessing 64-bit machines +from 32-bit firescope and vice versa: +- http://v3.sk/~lkundrak/firescope/ + +and he implemented fast system dump (alpha version - read README.txt): +- http://halobates.de/firewire/firedump-0.1.tar.bz2 + +There is also a gdb proxy for firewire which allows to use gdb to access +data which can be referenced from symbols found by gdb in vmlinux: +- http://halobates.de/firewire/fireproxy-0.33.tar.bz2 + +The latest version of this gdb proxy (fireproxy-0.34) can communicate (not +yet stable) with kgdb over an memory-based communication module (kgdbom). + +Getting Started +--------------- + +The OHCI-1394 specification regulates that the OHCI-1394 controller must +disable all physical DMA on each bus reset. + +This means that if you want to debug an issue in a system state where +interrupts are disabled and where no polling of the OHCI-1394 controller +for bus resets takes place, you have to establish any FireWire cable +connections and fully initialize all FireWire hardware __before__ the +system enters such state. + +Step-by-step instructions for using firescope with early OHCI initialization: + +1) Verify that your hardware is supported: + + Load the firewire-ohci module and check your kernel logs. + You should see a line similar to:: + + firewire_ohci 0000:15:00.1: added OHCI v1.0 device as card 2, 4 IR + 4 IT + ... contexts, quirks 0x11 + + when loading the driver. If you have no supported controller, many PCI, + CardBus and even some Express cards which are fully compliant to OHCI-1394 + specification are available. If it requires no driver for Windows operating + systems, it most likely is. Only specialized shops have cards which are not + compliant, they are based on TI PCILynx chips and require drivers for Windows + operating systems. + + The mentioned kernel log message contains the string "physUB" if the + controller implements a writable Physical Upper Bound register. This is + required for physical DMA above 4 GB (but not utilized by Linux yet). + +2) Establish a working FireWire cable connection: + + Any FireWire cable, as long at it provides electrically and mechanically + stable connection and has matching connectors (there are small 4-pin and + large 6-pin FireWire ports) will do. + + If an driver is running on both machines you should see a line like:: + + firewire_core 0000:15:00.1: created device fw1: GUID 00061b0020105917, S400 + + on both machines in the kernel log when the cable is plugged in + and connects the two machines. + +3) Test physical DMA using firescope: + + On the debug host, make sure that /dev/fw* is accessible, + then start firescope:: + + $ firescope + Port 0 (/dev/fw1) opened, 2 nodes detected + + FireScope + --------- + Target : <unspecified> + Gen : 1 + [Ctrl-T] choose target + [Ctrl-H] this menu + [Ctrl-Q] quit + + ------> Press Ctrl-T now, the output should be similar to: + + 2 nodes available, local node is: 0 + 0: ffc0, uuid: 00000000 00000000 [LOCAL] + 1: ffc1, uuid: 00279000 ba4bb801 + + Besides the [LOCAL] node, it must show another node without error message. + +4) Prepare for debugging with early OHCI-1394 initialization: + + 4.1) Kernel compilation and installation on debug target + + Compile the kernel to be debugged with CONFIG_PROVIDE_OHCI1394_DMA_INIT + (Kernel hacking: Provide code for enabling DMA over FireWire early on boot) + enabled and install it on the machine to be debugged (debug target). + + 4.2) Transfer the System.map of the debugged kernel to the debug host + + Copy the System.map of the kernel be debugged to the debug host (the host + which is connected to the debugged machine over the FireWire cable). + +5) Retrieving the printk buffer contents: + + With the FireWire cable connected, the OHCI-1394 driver on the debugging + host loaded, reboot the debugged machine, booting the kernel which has + CONFIG_PROVIDE_OHCI1394_DMA_INIT enabled, with the option ohci1394_dma=early. + + Then, on the debugging host, run firescope, for example by using -A:: + + firescope -A System.map-of-debug-target-kernel + + Note: -A automatically attaches to the first non-local node. It only works + reliably if only connected two machines are connected using FireWire. + + After having attached to the debug target, press Ctrl-D to view the + complete printk buffer or Ctrl-U to enter auto update mode and get an + updated live view of recent kernel messages logged on the debug target. + + Call "firescope -h" to get more information on firescope's options. + +Notes +----- + +Documentation and specifications: http://halobates.de/firewire/ + +FireWire is a trademark of Apple Inc. - for more information please refer to: +https://en.wikipedia.org/wiki/FireWire diff --git a/Documentation/core-api/dma-api-howto.rst b/Documentation/core-api/dma-api-howto.rst new file mode 100644 index 0000000000..72f6cdb6be --- /dev/null +++ b/Documentation/core-api/dma-api-howto.rst @@ -0,0 +1,915 @@ +========================= +Dynamic DMA mapping Guide +========================= + +:Author: David S. Miller <davem@redhat.com> +:Author: Richard Henderson <rth@cygnus.com> +:Author: Jakub Jelinek <jakub@redhat.com> + +This is a guide to device driver writers on how to use the DMA API +with example pseudo-code. For a concise description of the API, see +DMA-API.txt. + +CPU and DMA addresses +===================== + +There are several kinds of addresses involved in the DMA API, and it's +important to understand the differences. + +The kernel normally uses virtual addresses. Any address returned by +kmalloc(), vmalloc(), and similar interfaces is a virtual address and can +be stored in a ``void *``. + +The virtual memory system (TLB, page tables, etc.) translates virtual +addresses to CPU physical addresses, which are stored as "phys_addr_t" or +"resource_size_t". The kernel manages device resources like registers as +physical addresses. These are the addresses in /proc/iomem. The physical +address is not directly useful to a driver; it must use ioremap() to map +the space and produce a virtual address. + +I/O devices use a third kind of address: a "bus address". If a device has +registers at an MMIO address, or if it performs DMA to read or write system +memory, the addresses used by the device are bus addresses. In some +systems, bus addresses are identical to CPU physical addresses, but in +general they are not. IOMMUs and host bridges can produce arbitrary +mappings between physical and bus addresses. + +From a device's point of view, DMA uses the bus address space, but it may +be restricted to a subset of that space. For example, even if a system +supports 64-bit addresses for main memory and PCI BARs, it may use an IOMMU +so devices only need to use 32-bit DMA addresses. + +Here's a picture and some examples:: + + CPU CPU Bus + Virtual Physical Address + Address Address Space + Space Space + + +-------+ +------+ +------+ + | | |MMIO | Offset | | + | | Virtual |Space | applied | | + C +-------+ --------> B +------+ ----------> +------+ A + | | mapping | | by host | | + +-----+ | | | | bridge | | +--------+ + | | | | +------+ | | | | + | CPU | | | | RAM | | | | Device | + | | | | | | | | | | + +-----+ +-------+ +------+ +------+ +--------+ + | | Virtual |Buffer| Mapping | | + X +-------+ --------> Y +------+ <---------- +------+ Z + | | mapping | RAM | by IOMMU + | | | | + | | | | + +-------+ +------+ + +During the enumeration process, the kernel learns about I/O devices and +their MMIO space and the host bridges that connect them to the system. For +example, if a PCI device has a BAR, the kernel reads the bus address (A) +from the BAR and converts it to a CPU physical address (B). The address B +is stored in a struct resource and usually exposed via /proc/iomem. When a +driver claims a device, it typically uses ioremap() to map physical address +B at a virtual address (C). It can then use, e.g., ioread32(C), to access +the device registers at bus address A. + +If the device supports DMA, the driver sets up a buffer using kmalloc() or +a similar interface, which returns a virtual address (X). The virtual +memory system maps X to a physical address (Y) in system RAM. The driver +can use virtual address X to access the buffer, but the device itself +cannot because DMA doesn't go through the CPU virtual memory system. + +In some simple systems, the device can do DMA directly to physical address +Y. But in many others, there is IOMMU hardware that translates DMA +addresses to physical addresses, e.g., it translates Z to Y. This is part +of the reason for the DMA API: the driver can give a virtual address X to +an interface like dma_map_single(), which sets up any required IOMMU +mapping and returns the DMA address Z. The driver then tells the device to +do DMA to Z, and the IOMMU maps it to the buffer at address Y in system +RAM. + +So that Linux can use the dynamic DMA mapping, it needs some help from the +drivers, namely it has to take into account that DMA addresses should be +mapped only for the time they are actually used and unmapped after the DMA +transfer. + +The following API will work of course even on platforms where no such +hardware exists. + +Note that the DMA API works with any bus independent of the underlying +microprocessor architecture. You should use the DMA API rather than the +bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the +pci_map_*() interfaces. + +First of all, you should make sure:: + + #include <linux/dma-mapping.h> + +is in your driver, which provides the definition of dma_addr_t. This type +can hold any valid DMA address for the platform and should be used +everywhere you hold a DMA address returned from the DMA mapping functions. + +What memory is DMA'able? +======================== + +The first piece of information you must know is what kernel memory can +be used with the DMA mapping facilities. There has been an unwritten +set of rules regarding this, and this text is an attempt to finally +write them down. + +If you acquired your memory via the page allocator +(i.e. __get_free_page*()) or the generic memory allocators +(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from +that memory using the addresses returned from those routines. + +This means specifically that you may _not_ use the memory/addresses +returned from vmalloc() for DMA. It is possible to DMA to the +_underlying_ memory mapped into a vmalloc() area, but this requires +walking page tables to get the physical addresses, and then +translating each of those pages back to a kernel address using +something like __va(). [ EDIT: Update this when we integrate +Gerd Knorr's generic code which does this. ] + +This rule also means that you may use neither kernel image addresses +(items in data/text/bss segments), nor module image addresses, nor +stack addresses for DMA. These could all be mapped somewhere entirely +different than the rest of physical memory. Even if those classes of +memory could physically work with DMA, you'd need to ensure the I/O +buffers were cacheline-aligned. Without that, you'd see cacheline +sharing problems (data corruption) on CPUs with DMA-incoherent caches. +(The CPU could write to one word, DMA would write to a different one +in the same cache line, and one of them could be overwritten.) + +Also, this means that you cannot take the return of a kmap() +call and DMA to/from that. This is similar to vmalloc(). + +What about block I/O and networking buffers? The block I/O and +networking subsystems make sure that the buffers they use are valid +for you to DMA from/to. + +DMA addressing capabilities +=========================== + +By default, the kernel assumes that your device can address 32-bits of DMA +addressing. For a 64-bit capable device, this needs to be increased, and for +a device with limitations, it needs to be decreased. + +Special note about PCI: PCI-X specification requires PCI-X devices to support +64-bit addressing (DAC) for all transactions. And at least one platform (SGI +SN2) requires 64-bit consistent allocations to operate correctly when the IO +bus is in PCI-X mode. + +For correct operation, you must set the DMA mask to inform the kernel about +your devices DMA addressing capabilities. + +This is performed via a call to dma_set_mask_and_coherent():: + + int dma_set_mask_and_coherent(struct device *dev, u64 mask); + +which will set the mask for both streaming and coherent APIs together. If you +have some special requirements, then the following two separate calls can be +used instead: + + The setup for streaming mappings is performed via a call to + dma_set_mask():: + + int dma_set_mask(struct device *dev, u64 mask); + + The setup for consistent allocations is performed via a call + to dma_set_coherent_mask():: + + int dma_set_coherent_mask(struct device *dev, u64 mask); + +Here, dev is a pointer to the device struct of your device, and mask is a bit +mask describing which bits of an address your device supports. Often the +device struct of your device is embedded in the bus-specific device struct of +your device. For example, &pdev->dev is a pointer to the device struct of a +PCI device (pdev is a pointer to the PCI device struct of your device). + +These calls usually return zero to indicate your device can perform DMA +properly on the machine given the address mask you provided, but they might +return an error if the mask is too small to be supportable on the given +system. If it returns non-zero, your device cannot perform DMA properly on +this platform, and attempting to do so will result in undefined behavior. +You must not use DMA on this device unless the dma_set_mask family of +functions has returned success. + +This means that in the failure case, you have two options: + +1) Use some non-DMA mode for data transfer, if possible. +2) Ignore this device and do not initialize it. + +It is recommended that your driver print a kernel KERN_WARNING message when +setting the DMA mask fails. In this manner, if a user of your driver reports +that performance is bad or that the device is not even detected, you can ask +them for the kernel messages to find out exactly why. + +The standard 64-bit addressing device would do something like this:: + + if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64))) { + dev_warn(dev, "mydev: No suitable DMA available\n"); + goto ignore_this_device; + } + +If the device only supports 32-bit addressing for descriptors in the +coherent allocations, but supports full 64-bits for streaming mappings +it would look like this:: + + if (dma_set_mask(dev, DMA_BIT_MASK(64))) { + dev_warn(dev, "mydev: No suitable DMA available\n"); + goto ignore_this_device; + } + +The coherent mask will always be able to set the same or a smaller mask as +the streaming mask. However for the rare case that a device driver only +uses consistent allocations, one would have to check the return value from +dma_set_coherent_mask(). + +Finally, if your device can only drive the low 24-bits of +address you might do something like:: + + if (dma_set_mask(dev, DMA_BIT_MASK(24))) { + dev_warn(dev, "mydev: 24-bit DMA addressing not available\n"); + goto ignore_this_device; + } + +When dma_set_mask() or dma_set_mask_and_coherent() is successful, and +returns zero, the kernel saves away this mask you have provided. The +kernel will use this information later when you make DMA mappings. + +There is a case which we are aware of at this time, which is worth +mentioning in this documentation. If your device supports multiple +functions (for example a sound card provides playback and record +functions) and the various different functions have _different_ +DMA addressing limitations, you may wish to probe each mask and +only provide the functionality which the machine can handle. It +is important that the last call to dma_set_mask() be for the +most specific mask. + +Here is pseudo-code showing how this might be done:: + + #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32) + #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24) + + struct my_sound_card *card; + struct device *dev; + + ... + if (!dma_set_mask(dev, PLAYBACK_ADDRESS_BITS)) { + card->playback_enabled = 1; + } else { + card->playback_enabled = 0; + dev_warn(dev, "%s: Playback disabled due to DMA limitations\n", + card->name); + } + if (!dma_set_mask(dev, RECORD_ADDRESS_BITS)) { + card->record_enabled = 1; + } else { + card->record_enabled = 0; + dev_warn(dev, "%s: Record disabled due to DMA limitations\n", + card->name); + } + +A sound card was used as an example here because this genre of PCI +devices seems to be littered with ISA chips given a PCI front end, +and thus retaining the 16MB DMA addressing limitations of ISA. + +Types of DMA mappings +===================== + +There are two types of DMA mappings: + +- Consistent DMA mappings which are usually mapped at driver + initialization, unmapped at the end and for which the hardware should + guarantee that the device and the CPU can access the data + in parallel and will see updates made by each other without any + explicit software flushing. + + Think of "consistent" as "synchronous" or "coherent". + + The current default is to return consistent memory in the low 32 + bits of the DMA space. However, for future compatibility you should + set the consistent mask even if this default is fine for your + driver. + + Good examples of what to use consistent mappings for are: + + - Network card DMA ring descriptors. + - SCSI adapter mailbox command data structures. + - Device firmware microcode executed out of + main memory. + + The invariant these examples all require is that any CPU store + to memory is immediately visible to the device, and vice + versa. Consistent mappings guarantee this. + + .. important:: + + Consistent DMA memory does not preclude the usage of + proper memory barriers. The CPU may reorder stores to + consistent memory just as it may normal memory. Example: + if it is important for the device to see the first word + of a descriptor updated before the second, you must do + something like:: + + desc->word0 = address; + wmb(); + desc->word1 = DESC_VALID; + + in order to get correct behavior on all platforms. + + Also, on some platforms your driver may need to flush CPU write + buffers in much the same way as it needs to flush write buffers + found in PCI bridges (such as by reading a register's value + after writing it). + +- Streaming DMA mappings which are usually mapped for one DMA + transfer, unmapped right after it (unless you use dma_sync_* below) + and for which hardware can optimize for sequential accesses. + + Think of "streaming" as "asynchronous" or "outside the coherency + domain". + + Good examples of what to use streaming mappings for are: + + - Networking buffers transmitted/received by a device. + - Filesystem buffers written/read by a SCSI device. + + The interfaces for using this type of mapping were designed in + such a way that an implementation can make whatever performance + optimizations the hardware allows. To this end, when using + such mappings you must be explicit about what you want to happen. + +Neither type of DMA mapping has alignment restrictions that come from +the underlying bus, although some devices may have such restrictions. +Also, systems with caches that aren't DMA-coherent will work better +when the underlying buffers don't share cache lines with other data. + + +Using Consistent DMA mappings +============================= + +To allocate and map large (PAGE_SIZE or so) consistent DMA regions, +you should do:: + + dma_addr_t dma_handle; + + cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp); + +where device is a ``struct device *``. This may be called in interrupt +context with the GFP_ATOMIC flag. + +Size is the length of the region you want to allocate, in bytes. + +This routine will allocate RAM for that region, so it acts similarly to +__get_free_pages() (but takes size instead of a page order). If your +driver needs regions sized smaller than a page, you may prefer using +the dma_pool interface, described below. + +The consistent DMA mapping interfaces, will by default return a DMA address +which is 32-bit addressable. Even if the device indicates (via the DMA mask) +that it may address the upper 32-bits, consistent allocation will only +return > 32-bit addresses for DMA if the consistent DMA mask has been +explicitly changed via dma_set_coherent_mask(). This is true of the +dma_pool interface as well. + +dma_alloc_coherent() returns two values: the virtual address which you +can use to access it from the CPU and dma_handle which you pass to the +card. + +The CPU virtual address and the DMA address are both +guaranteed to be aligned to the smallest PAGE_SIZE order which +is greater than or equal to the requested size. This invariant +exists (for example) to guarantee that if you allocate a chunk +which is smaller than or equal to 64 kilobytes, the extent of the +buffer you receive will not cross a 64K boundary. + +To unmap and free such a DMA region, you call:: + + dma_free_coherent(dev, size, cpu_addr, dma_handle); + +where dev, size are the same as in the above call and cpu_addr and +dma_handle are the values dma_alloc_coherent() returned to you. +This function may not be called in interrupt context. + +If your driver needs lots of smaller memory regions, you can write +custom code to subdivide pages returned by dma_alloc_coherent(), +or you can use the dma_pool API to do that. A dma_pool is like +a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages(). +Also, it understands common hardware constraints for alignment, +like queue heads needing to be aligned on N byte boundaries. + +Create a dma_pool like this:: + + struct dma_pool *pool; + + pool = dma_pool_create(name, dev, size, align, boundary); + +The "name" is for diagnostics (like a kmem_cache name); dev and size +are as above. The device's hardware alignment requirement for this +type of data is "align" (which is expressed in bytes, and must be a +power of two). If your device has no boundary crossing restrictions, +pass 0 for boundary; passing 4096 says memory allocated from this pool +must not cross 4KByte boundaries (but at that time it may be better to +use dma_alloc_coherent() directly instead). + +Allocate memory from a DMA pool like this:: + + cpu_addr = dma_pool_alloc(pool, flags, &dma_handle); + +flags are GFP_KERNEL if blocking is permitted (not in_interrupt nor +holding SMP locks), GFP_ATOMIC otherwise. Like dma_alloc_coherent(), +this returns two values, cpu_addr and dma_handle. + +Free memory that was allocated from a dma_pool like this:: + + dma_pool_free(pool, cpu_addr, dma_handle); + +where pool is what you passed to dma_pool_alloc(), and cpu_addr and +dma_handle are the values dma_pool_alloc() returned. This function +may be called in interrupt context. + +Destroy a dma_pool by calling:: + + dma_pool_destroy(pool); + +Make sure you've called dma_pool_free() for all memory allocated +from a pool before you destroy the pool. This function may not +be called in interrupt context. + +DMA Direction +============= + +The interfaces described in subsequent portions of this document +take a DMA direction argument, which is an integer and takes on +one of the following values:: + + DMA_BIDIRECTIONAL + DMA_TO_DEVICE + DMA_FROM_DEVICE + DMA_NONE + +You should provide the exact DMA direction if you know it. + +DMA_TO_DEVICE means "from main memory to the device" +DMA_FROM_DEVICE means "from the device to main memory" +It is the direction in which the data moves during the DMA +transfer. + +You are _strongly_ encouraged to specify this as precisely +as you possibly can. + +If you absolutely cannot know the direction of the DMA transfer, +specify DMA_BIDIRECTIONAL. It means that the DMA can go in +either direction. The platform guarantees that you may legally +specify this, and that it will work, but this may be at the +cost of performance for example. + +The value DMA_NONE is to be used for debugging. One can +hold this in a data structure before you come to know the +precise direction, and this will help catch cases where your +direction tracking logic has failed to set things up properly. + +Another advantage of specifying this value precisely (outside of +potential platform-specific optimizations of such) is for debugging. +Some platforms actually have a write permission boolean which DMA +mappings can be marked with, much like page protections in the user +program address space. Such platforms can and do report errors in the +kernel logs when the DMA controller hardware detects violation of the +permission setting. + +Only streaming mappings specify a direction, consistent mappings +implicitly have a direction attribute setting of +DMA_BIDIRECTIONAL. + +The SCSI subsystem tells you the direction to use in the +'sc_data_direction' member of the SCSI command your driver is +working on. + +For Networking drivers, it's a rather simple affair. For transmit +packets, map/unmap them with the DMA_TO_DEVICE direction +specifier. For receive packets, just the opposite, map/unmap them +with the DMA_FROM_DEVICE direction specifier. + +Using Streaming DMA mappings +============================ + +The streaming DMA mapping routines can be called from interrupt +context. There are two versions of each map/unmap, one which will +map/unmap a single memory region, and one which will map/unmap a +scatterlist. + +To map a single region, you do:: + + struct device *dev = &my_dev->dev; + dma_addr_t dma_handle; + void *addr = buffer->ptr; + size_t size = buffer->len; + + dma_handle = dma_map_single(dev, addr, size, direction); + if (dma_mapping_error(dev, dma_handle)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + goto map_error_handling; + } + +and to unmap it:: + + dma_unmap_single(dev, dma_handle, size, direction); + +You should call dma_mapping_error() as dma_map_single() could fail and return +error. Doing so will ensure that the mapping code will work correctly on all +DMA implementations without any dependency on the specifics of the underlying +implementation. Using the returned address without checking for errors could +result in failures ranging from panics to silent data corruption. The same +applies to dma_map_page() as well. + +You should call dma_unmap_single() when the DMA activity is finished, e.g., +from the interrupt which told you that the DMA transfer is done. + +Using CPU pointers like this for single mappings has a disadvantage: +you cannot reference HIGHMEM memory in this way. Thus, there is a +map/unmap interface pair akin to dma_{map,unmap}_single(). These +interfaces deal with page/offset pairs instead of CPU pointers. +Specifically:: + + struct device *dev = &my_dev->dev; + dma_addr_t dma_handle; + struct page *page = buffer->page; + unsigned long offset = buffer->offset; + size_t size = buffer->len; + + dma_handle = dma_map_page(dev, page, offset, size, direction); + if (dma_mapping_error(dev, dma_handle)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + goto map_error_handling; + } + + ... + + dma_unmap_page(dev, dma_handle, size, direction); + +Here, "offset" means byte offset within the given page. + +You should call dma_mapping_error() as dma_map_page() could fail and return +error as outlined under the dma_map_single() discussion. + +You should call dma_unmap_page() when the DMA activity is finished, e.g., +from the interrupt which told you that the DMA transfer is done. + +With scatterlists, you map a region gathered from several regions by:: + + int i, count = dma_map_sg(dev, sglist, nents, direction); + struct scatterlist *sg; + + for_each_sg(sglist, sg, count, i) { + hw_address[i] = sg_dma_address(sg); + hw_len[i] = sg_dma_len(sg); + } + +where nents is the number of entries in the sglist. + +The implementation is free to merge several consecutive sglist entries +into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any +consecutive sglist entries can be merged into one provided the first one +ends and the second one starts on a page boundary - in fact this is a huge +advantage for cards which either cannot do scatter-gather or have very +limited number of scatter-gather entries) and returns the actual number +of sg entries it mapped them to. On failure 0 is returned. + +Then you should loop count times (note: this can be less than nents times) +and use sg_dma_address() and sg_dma_len() macros where you previously +accessed sg->address and sg->length as shown above. + +To unmap a scatterlist, just call:: + + dma_unmap_sg(dev, sglist, nents, direction); + +Again, make sure DMA activity has already finished. + +.. note:: + + The 'nents' argument to the dma_unmap_sg call must be + the _same_ one you passed into the dma_map_sg call, + it should _NOT_ be the 'count' value _returned_ from the + dma_map_sg call. + +Every dma_map_{single,sg}() call should have its dma_unmap_{single,sg}() +counterpart, because the DMA address space is a shared resource and +you could render the machine unusable by consuming all DMA addresses. + +If you need to use the same streaming DMA region multiple times and touch +the data in between the DMA transfers, the buffer needs to be synced +properly in order for the CPU and device to see the most up-to-date and +correct copy of the DMA buffer. + +So, firstly, just map it with dma_map_{single,sg}(), and after each DMA +transfer call either:: + + dma_sync_single_for_cpu(dev, dma_handle, size, direction); + +or:: + + dma_sync_sg_for_cpu(dev, sglist, nents, direction); + +as appropriate. + +Then, if you wish to let the device get at the DMA area again, +finish accessing the data with the CPU, and then before actually +giving the buffer to the hardware call either:: + + dma_sync_single_for_device(dev, dma_handle, size, direction); + +or:: + + dma_sync_sg_for_device(dev, sglist, nents, direction); + +as appropriate. + +.. note:: + + The 'nents' argument to dma_sync_sg_for_cpu() and + dma_sync_sg_for_device() must be the same passed to + dma_map_sg(). It is _NOT_ the count returned by + dma_map_sg(). + +After the last DMA transfer call one of the DMA unmap routines +dma_unmap_{single,sg}(). If you don't touch the data from the first +dma_map_*() call till dma_unmap_*(), then you don't have to call the +dma_sync_*() routines at all. + +Here is pseudo code which shows a situation in which you would need +to use the dma_sync_*() interfaces:: + + my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) + { + dma_addr_t mapping; + + mapping = dma_map_single(cp->dev, buffer, len, DMA_FROM_DEVICE); + if (dma_mapping_error(cp->dev, mapping)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + goto map_error_handling; + } + + cp->rx_buf = buffer; + cp->rx_len = len; + cp->rx_dma = mapping; + + give_rx_buf_to_card(cp); + } + + ... + + my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs) + { + struct my_card *cp = devid; + + ... + if (read_card_status(cp) == RX_BUF_TRANSFERRED) { + struct my_card_header *hp; + + /* Examine the header to see if we wish + * to accept the data. But synchronize + * the DMA transfer with the CPU first + * so that we see updated contents. + */ + dma_sync_single_for_cpu(&cp->dev, cp->rx_dma, + cp->rx_len, + DMA_FROM_DEVICE); + + /* Now it is safe to examine the buffer. */ + hp = (struct my_card_header *) cp->rx_buf; + if (header_is_ok(hp)) { + dma_unmap_single(&cp->dev, cp->rx_dma, cp->rx_len, + DMA_FROM_DEVICE); + pass_to_upper_layers(cp->rx_buf); + make_and_setup_new_rx_buf(cp); + } else { + /* CPU should not write to + * DMA_FROM_DEVICE-mapped area, + * so dma_sync_single_for_device() is + * not needed here. It would be required + * for DMA_BIDIRECTIONAL mapping if + * the memory was modified. + */ + give_rx_buf_to_card(cp); + } + } + } + +Handling Errors +=============== + +DMA address space is limited on some architectures and an allocation +failure can be determined by: + +- checking if dma_alloc_coherent() returns NULL or dma_map_sg returns 0 + +- checking the dma_addr_t returned from dma_map_single() and dma_map_page() + by using dma_mapping_error():: + + dma_addr_t dma_handle; + + dma_handle = dma_map_single(dev, addr, size, direction); + if (dma_mapping_error(dev, dma_handle)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + goto map_error_handling; + } + +- unmap pages that are already mapped, when mapping error occurs in the middle + of a multiple page mapping attempt. These example are applicable to + dma_map_page() as well. + +Example 1:: + + dma_addr_t dma_handle1; + dma_addr_t dma_handle2; + + dma_handle1 = dma_map_single(dev, addr, size, direction); + if (dma_mapping_error(dev, dma_handle1)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + goto map_error_handling1; + } + dma_handle2 = dma_map_single(dev, addr, size, direction); + if (dma_mapping_error(dev, dma_handle2)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + goto map_error_handling2; + } + + ... + + map_error_handling2: + dma_unmap_single(dma_handle1); + map_error_handling1: + +Example 2:: + + /* + * if buffers are allocated in a loop, unmap all mapped buffers when + * mapping error is detected in the middle + */ + + dma_addr_t dma_addr; + dma_addr_t array[DMA_BUFFERS]; + int save_index = 0; + + for (i = 0; i < DMA_BUFFERS; i++) { + + ... + + dma_addr = dma_map_single(dev, addr, size, direction); + if (dma_mapping_error(dev, dma_addr)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + goto map_error_handling; + } + array[i].dma_addr = dma_addr; + save_index++; + } + + ... + + map_error_handling: + + for (i = 0; i < save_index; i++) { + + ... + + dma_unmap_single(array[i].dma_addr); + } + +Networking drivers must call dev_kfree_skb() to free the socket buffer +and return NETDEV_TX_OK if the DMA mapping fails on the transmit hook +(ndo_start_xmit). This means that the socket buffer is just dropped in +the failure case. + +SCSI drivers must return SCSI_MLQUEUE_HOST_BUSY if the DMA mapping +fails in the queuecommand hook. This means that the SCSI subsystem +passes the command to the driver again later. + +Optimizing Unmap State Space Consumption +======================================== + +On many platforms, dma_unmap_{single,page}() is simply a nop. +Therefore, keeping track of the mapping address and length is a waste +of space. Instead of filling your drivers up with ifdefs and the like +to "work around" this (which would defeat the whole purpose of a +portable API) the following facilities are provided. + +Actually, instead of describing the macros one by one, we'll +transform some example code. + +1) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures. + Example, before:: + + struct ring_state { + struct sk_buff *skb; + dma_addr_t mapping; + __u32 len; + }; + + after:: + + struct ring_state { + struct sk_buff *skb; + DEFINE_DMA_UNMAP_ADDR(mapping); + DEFINE_DMA_UNMAP_LEN(len); + }; + +2) Use dma_unmap_{addr,len}_set() to set these values. + Example, before:: + + ringp->mapping = FOO; + ringp->len = BAR; + + after:: + + dma_unmap_addr_set(ringp, mapping, FOO); + dma_unmap_len_set(ringp, len, BAR); + +3) Use dma_unmap_{addr,len}() to access these values. + Example, before:: + + dma_unmap_single(dev, ringp->mapping, ringp->len, + DMA_FROM_DEVICE); + + after:: + + dma_unmap_single(dev, + dma_unmap_addr(ringp, mapping), + dma_unmap_len(ringp, len), + DMA_FROM_DEVICE); + +It really should be self-explanatory. We treat the ADDR and LEN +separately, because it is possible for an implementation to only +need the address in order to perform the unmap operation. + +Platform Issues +=============== + +If you are just writing drivers for Linux and do not maintain +an architecture port for the kernel, you can safely skip down +to "Closing". + +1) Struct scatterlist requirements. + + You need to enable CONFIG_NEED_SG_DMA_LENGTH if the architecture + supports IOMMUs (including software IOMMU). + +2) ARCH_DMA_MINALIGN + + Architectures must ensure that kmalloc'ed buffer is + DMA-safe. Drivers and subsystems depend on it. If an architecture + isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in + the CPU cache is identical to data in main memory), + ARCH_DMA_MINALIGN must be set so that the memory allocator + makes sure that kmalloc'ed buffer doesn't share a cache line with + the others. See arch/arm/include/asm/cache.h as an example. + + Note that ARCH_DMA_MINALIGN is about DMA memory alignment + constraints. You don't need to worry about the architecture data + alignment constraints (e.g. the alignment constraints about 64-bit + objects). + +Closing +======= + +This document, and the API itself, would not be in its current +form without the feedback and suggestions from numerous individuals. +We would like to specifically mention, in no particular order, the +following people:: + + Russell King <rmk@arm.linux.org.uk> + Leo Dagum <dagum@barrel.engr.sgi.com> + Ralf Baechle <ralf@oss.sgi.com> + Grant Grundler <grundler@cup.hp.com> + Jay Estabrook <Jay.Estabrook@compaq.com> + Thomas Sailer <sailer@ife.ee.ethz.ch> + Andrea Arcangeli <andrea@suse.de> + Jens Axboe <jens.axboe@oracle.com> + David Mosberger-Tang <davidm@hpl.hp.com> diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst new file mode 100644 index 0000000000..829f20a193 --- /dev/null +++ b/Documentation/core-api/dma-api.rst @@ -0,0 +1,846 @@ +============================================ +Dynamic DMA mapping using the generic device +============================================ + +:Author: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> + +This document describes the DMA API. For a more gentle introduction +of the API (and actual examples), see Documentation/core-api/dma-api-howto.rst. + +This API is split into two pieces. Part I describes the basic API. +Part II describes extensions for supporting non-consistent memory +machines. Unless you know that your driver absolutely has to support +non-consistent platforms (this is usually only legacy platforms) you +should only use the API described in part I. + +Part I - dma_API +---------------- + +To get the dma_API, you must #include <linux/dma-mapping.h>. This +provides dma_addr_t and the interfaces described below. + +A dma_addr_t can hold any valid DMA address for the platform. It can be +given to a device to use as a DMA source or target. A CPU cannot reference +a dma_addr_t directly because there may be translation between its physical +address space and the DMA address space. + +Part Ia - Using large DMA-coherent buffers +------------------------------------------ + +:: + + void * + dma_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t flag) + +Consistent memory is memory for which a write by either the device or +the processor can immediately be read by the processor or device +without having to worry about caching effects. (You may however need +to make sure to flush the processor's write buffers before telling +devices to read that memory.) + +This routine allocates a region of <size> bytes of consistent memory. + +It returns a pointer to the allocated region (in the processor's virtual +address space) or NULL if the allocation failed. + +It also returns a <dma_handle> which may be cast to an unsigned integer the +same width as the bus and given to the device as the DMA address base of +the region. + +Note: consistent memory can be expensive on some platforms, and the +minimum allocation length may be as big as a page, so you should +consolidate your requests for consistent memory as much as possible. +The simplest way to do that is to use the dma_pool calls (see below). + +The flag parameter (dma_alloc_coherent() only) allows the caller to +specify the ``GFP_`` flags (see kmalloc()) for the allocation (the +implementation may choose to ignore flags that affect the location of +the returned memory, like GFP_DMA). + +:: + + void + dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle) + +Free a region of consistent memory you previously allocated. dev, +size and dma_handle must all be the same as those passed into +dma_alloc_coherent(). cpu_addr must be the virtual address returned by +the dma_alloc_coherent(). + +Note that unlike their sibling allocation calls, these routines +may only be called with IRQs enabled. + + +Part Ib - Using small DMA-coherent buffers +------------------------------------------ + +To get this part of the dma_API, you must #include <linux/dmapool.h> + +Many drivers need lots of small DMA-coherent memory regions for DMA +descriptors or I/O buffers. Rather than allocating in units of a page +or more using dma_alloc_coherent(), you can use DMA pools. These work +much like a struct kmem_cache, except that they use the DMA-coherent allocator, +not __get_free_pages(). Also, they understand common hardware constraints +for alignment, like queue heads needing to be aligned on N-byte boundaries. + + +:: + + struct dma_pool * + dma_pool_create(const char *name, struct device *dev, + size_t size, size_t align, size_t alloc); + +dma_pool_create() initializes a pool of DMA-coherent buffers +for use with a given device. It must be called in a context which +can sleep. + +The "name" is for diagnostics (like a struct kmem_cache name); dev and size +are like what you'd pass to dma_alloc_coherent(). The device's hardware +alignment requirement for this type of data is "align" (which is expressed +in bytes, and must be a power of two). If your device has no boundary +crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated +from this pool must not cross 4KByte boundaries. + +:: + + void * + dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags, + dma_addr_t *handle) + +Wraps dma_pool_alloc() and also zeroes the returned memory if the +allocation attempt succeeded. + + +:: + + void * + dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, + dma_addr_t *dma_handle); + +This allocates memory from the pool; the returned memory will meet the +size and alignment requirements specified at creation time. Pass +GFP_ATOMIC to prevent blocking, or if it's permitted (not +in_interrupt, not holding SMP locks), pass GFP_KERNEL to allow +blocking. Like dma_alloc_coherent(), this returns two values: an +address usable by the CPU, and the DMA address usable by the pool's +device. + +:: + + void + dma_pool_free(struct dma_pool *pool, void *vaddr, + dma_addr_t addr); + +This puts memory back into the pool. The pool is what was passed to +dma_pool_alloc(); the CPU (vaddr) and DMA addresses are what +were returned when that routine allocated the memory being freed. + +:: + + void + dma_pool_destroy(struct dma_pool *pool); + +dma_pool_destroy() frees the resources of the pool. It must be +called in a context which can sleep. Make sure you've freed all allocated +memory back to the pool before you destroy it. + + +Part Ic - DMA addressing limitations +------------------------------------ + +:: + + int + dma_set_mask_and_coherent(struct device *dev, u64 mask) + +Checks to see if the mask is possible and updates the device +streaming and coherent DMA mask parameters if it is. + +Returns: 0 if successful and a negative error if not. + +:: + + int + dma_set_mask(struct device *dev, u64 mask) + +Checks to see if the mask is possible and updates the device +parameters if it is. + +Returns: 0 if successful and a negative error if not. + +:: + + int + dma_set_coherent_mask(struct device *dev, u64 mask) + +Checks to see if the mask is possible and updates the device +parameters if it is. + +Returns: 0 if successful and a negative error if not. + +:: + + u64 + dma_get_required_mask(struct device *dev) + +This API returns the mask that the platform requires to +operate efficiently. Usually this means the returned mask +is the minimum required to cover all of memory. Examining the +required mask gives drivers with variable descriptor sizes the +opportunity to use smaller descriptors as necessary. + +Requesting the required mask does not alter the current mask. If you +wish to take advantage of it, you should issue a dma_set_mask() +call to set the mask to the value returned. + +:: + + size_t + dma_max_mapping_size(struct device *dev); + +Returns the maximum size of a mapping for the device. The size parameter +of the mapping functions like dma_map_single(), dma_map_page() and +others should not be larger than the returned value. + +:: + + size_t + dma_opt_mapping_size(struct device *dev); + +Returns the maximum optimal size of a mapping for the device. + +Mapping larger buffers may take much longer in certain scenarios. In +addition, for high-rate short-lived streaming mappings, the upfront time +spent on the mapping may account for an appreciable part of the total +request lifetime. As such, if splitting larger requests incurs no +significant performance penalty, then device drivers are advised to +limit total DMA streaming mappings length to the returned value. + +:: + + bool + dma_need_sync(struct device *dev, dma_addr_t dma_addr); + +Returns %true if dma_sync_single_for_{device,cpu} calls are required to +transfer memory ownership. Returns %false if those calls can be skipped. + +:: + + unsigned long + dma_get_merge_boundary(struct device *dev); + +Returns the DMA merge boundary. If the device cannot merge any the DMA address +segments, the function returns 0. + +Part Id - Streaming DMA mappings +-------------------------------- + +:: + + dma_addr_t + dma_map_single(struct device *dev, void *cpu_addr, size_t size, + enum dma_data_direction direction) + +Maps a piece of processor virtual memory so it can be accessed by the +device and returns the DMA address of the memory. + +The direction for both APIs may be converted freely by casting. +However the dma_API uses a strongly typed enumerator for its +direction: + +======================= ============================================= +DMA_NONE no direction (used for debugging) +DMA_TO_DEVICE data is going from the memory to the device +DMA_FROM_DEVICE data is coming from the device to the memory +DMA_BIDIRECTIONAL direction isn't known +======================= ============================================= + +.. note:: + + Not all memory regions in a machine can be mapped by this API. + Further, contiguous kernel virtual space may not be contiguous as + physical memory. Since this API does not provide any scatter/gather + capability, it will fail if the user tries to map a non-physically + contiguous piece of memory. For this reason, memory to be mapped by + this API should be obtained from sources which guarantee it to be + physically contiguous (like kmalloc). + + Further, the DMA address of the memory must be within the + dma_mask of the device (the dma_mask is a bit mask of the + addressable region for the device, i.e., if the DMA address of + the memory ANDed with the dma_mask is still equal to the DMA + address, then the device can perform DMA to the memory). To + ensure that the memory allocated by kmalloc is within the dma_mask, + the driver may specify various platform-dependent flags to restrict + the DMA address range of the allocation (e.g., on x86, GFP_DMA + guarantees to be within the first 16MB of available DMA addresses, + as required by ISA devices). + + Note also that the above constraints on physical contiguity and + dma_mask may not apply if the platform has an IOMMU (a device which + maps an I/O DMA address to a physical memory address). However, to be + portable, device driver writers may *not* assume that such an IOMMU + exists. + +.. warning:: + + Memory coherency operates at a granularity called the cache + line width. In order for memory mapped by this API to operate + correctly, the mapped region must begin exactly on a cache line + boundary and end exactly on one (to prevent two separately mapped + regions from sharing a single cache line). Since the cache line size + may not be known at compile time, the API will not enforce this + requirement. Therefore, it is recommended that driver writers who + don't take special care to determine the cache line size at run time + only map virtual regions that begin and end on page boundaries (which + are guaranteed also to be cache line boundaries). + + DMA_TO_DEVICE synchronisation must be done after the last modification + of the memory region by the software and before it is handed off to + the device. Once this primitive is used, memory covered by this + primitive should be treated as read-only by the device. If the device + may write to it at any point, it should be DMA_BIDIRECTIONAL (see + below). + + DMA_FROM_DEVICE synchronisation must be done before the driver + accesses data that may be changed by the device. This memory should + be treated as read-only by the driver. If the driver needs to write + to it at any point, it should be DMA_BIDIRECTIONAL (see below). + + DMA_BIDIRECTIONAL requires special handling: it means that the driver + isn't sure if the memory was modified before being handed off to the + device and also isn't sure if the device will also modify it. Thus, + you must always sync bidirectional memory twice: once before the + memory is handed off to the device (to make sure all memory changes + are flushed from the processor) and once before the data may be + accessed after being used by the device (to make sure any processor + cache lines are updated with data that the device may have changed). + +:: + + void + dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, + enum dma_data_direction direction) + +Unmaps the region previously mapped. All the parameters passed in +must be identical to those passed in (and returned) by the mapping +API. + +:: + + dma_addr_t + dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction direction) + + void + dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, + enum dma_data_direction direction) + +API for mapping and unmapping for pages. All the notes and warnings +for the other mapping APIs apply here. Also, although the <offset> +and <size> parameters are provided to do partial page mapping, it is +recommended that you never use these unless you really know what the +cache width is. + +:: + + dma_addr_t + dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size, + enum dma_data_direction dir, unsigned long attrs) + + void + dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size, + enum dma_data_direction dir, unsigned long attrs) + +API for mapping and unmapping for MMIO resources. All the notes and +warnings for the other mapping APIs apply here. The API should only be +used to map device MMIO resources, mapping of RAM is not permitted. + +:: + + int + dma_mapping_error(struct device *dev, dma_addr_t dma_addr) + +In some circumstances dma_map_single(), dma_map_page() and dma_map_resource() +will fail to create a mapping. A driver can check for these errors by testing +the returned DMA address with dma_mapping_error(). A non-zero return value +means the mapping could not be created and the driver should take appropriate +action (e.g. reduce current DMA mapping usage or delay and try again later). + +:: + + int + dma_map_sg(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction direction) + +Returns: the number of DMA address segments mapped (this may be shorter +than <nents> passed in if some elements of the scatter/gather list are +physically or virtually adjacent and an IOMMU maps them with a single +entry). + +Please note that the sg cannot be mapped again if it has been mapped once. +The mapping process is allowed to destroy information in the sg. + +As with the other mapping interfaces, dma_map_sg() can fail. When it +does, 0 is returned and a driver must take appropriate action. It is +critical that the driver do something, in the case of a block driver +aborting the request or even oopsing is better than doing nothing and +corrupting the filesystem. + +With scatterlists, you use the resulting mapping like this:: + + int i, count = dma_map_sg(dev, sglist, nents, direction); + struct scatterlist *sg; + + for_each_sg(sglist, sg, count, i) { + hw_address[i] = sg_dma_address(sg); + hw_len[i] = sg_dma_len(sg); + } + +where nents is the number of entries in the sglist. + +The implementation is free to merge several consecutive sglist entries +into one (e.g. with an IOMMU, or if several pages just happen to be +physically contiguous) and returns the actual number of sg entries it +mapped them to. On failure 0, is returned. + +Then you should loop count times (note: this can be less than nents times) +and use sg_dma_address() and sg_dma_len() macros where you previously +accessed sg->address and sg->length as shown above. + +:: + + void + dma_unmap_sg(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction direction) + +Unmap the previously mapped scatter/gather list. All the parameters +must be the same as those and passed in to the scatter/gather mapping +API. + +Note: <nents> must be the number you passed in, *not* the number of +DMA address entries returned. + +:: + + void + dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, + size_t size, + enum dma_data_direction direction) + + void + dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, + size_t size, + enum dma_data_direction direction) + + void + dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, + int nents, + enum dma_data_direction direction) + + void + dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, + int nents, + enum dma_data_direction direction) + +Synchronise a single contiguous or scatter/gather mapping for the CPU +and device. With the sync_sg API, all the parameters must be the same +as those passed into the single mapping API. With the sync_single API, +you can use dma_handle and size parameters that aren't identical to +those passed into the single mapping API to do a partial sync. + + +.. note:: + + You must do this: + + - Before reading values that have been written by DMA from the device + (use the DMA_FROM_DEVICE direction) + - After writing values that will be written to the device using DMA + (use the DMA_TO_DEVICE) direction + - before *and* after handing memory to the device if the memory is + DMA_BIDIRECTIONAL + +See also dma_map_single(). + +:: + + dma_addr_t + dma_map_single_attrs(struct device *dev, void *cpu_addr, size_t size, + enum dma_data_direction dir, + unsigned long attrs) + + void + dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) + + int + dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, + unsigned long attrs) + + void + dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, + unsigned long attrs) + +The four functions above are just like the counterpart functions +without the _attrs suffixes, except that they pass an optional +dma_attrs. + +The interpretation of DMA attributes is architecture-specific, and +each attribute should be documented in +Documentation/core-api/dma-attributes.rst. + +If dma_attrs are 0, the semantics of each of these functions +is identical to those of the corresponding function +without the _attrs suffix. As a result dma_map_single_attrs() +can generally replace dma_map_single(), etc. + +As an example of the use of the ``*_attrs`` functions, here's how +you could pass an attribute DMA_ATTR_FOO when mapping memory +for DMA:: + + #include <linux/dma-mapping.h> + /* DMA_ATTR_FOO should be defined in linux/dma-mapping.h and + * documented in Documentation/core-api/dma-attributes.rst */ + ... + + unsigned long attr; + attr |= DMA_ATTR_FOO; + .... + n = dma_map_sg_attrs(dev, sg, nents, DMA_TO_DEVICE, attr); + .... + +Architectures that care about DMA_ATTR_FOO would check for its +presence in their implementations of the mapping and unmapping +routines, e.g.::: + + void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) + { + .... + if (attrs & DMA_ATTR_FOO) + /* twizzle the frobnozzle */ + .... + } + + +Part II - Non-coherent DMA allocations +-------------------------------------- + +These APIs allow to allocate pages that are guaranteed to be DMA addressable +by the passed in device, but which need explicit management of memory ownership +for the kernel vs the device. + +If you don't understand how cache line coherency works between a processor and +an I/O device, you should not be using this part of the API. + +:: + + struct page * + dma_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle, + enum dma_data_direction dir, gfp_t gfp) + +This routine allocates a region of <size> bytes of non-coherent memory. It +returns a pointer to first struct page for the region, or NULL if the +allocation failed. The resulting struct page can be used for everything a +struct page is suitable for. + +It also returns a <dma_handle> which may be cast to an unsigned integer the +same width as the bus and given to the device as the DMA address base of +the region. + +The dir parameter specified if data is read and/or written by the device, +see dma_map_single() for details. + +The gfp parameter allows the caller to specify the ``GFP_`` flags (see +kmalloc()) for the allocation, but rejects flags used to specify a memory +zone such as GFP_DMA or GFP_HIGHMEM. + +Before giving the memory to the device, dma_sync_single_for_device() needs +to be called, and before reading memory written by the device, +dma_sync_single_for_cpu(), just like for streaming DMA mappings that are +reused. + +:: + + void + dma_free_pages(struct device *dev, size_t size, struct page *page, + dma_addr_t dma_handle, enum dma_data_direction dir) + +Free a region of memory previously allocated using dma_alloc_pages(). +dev, size, dma_handle and dir must all be the same as those passed into +dma_alloc_pages(). page must be the pointer returned by dma_alloc_pages(). + +:: + + int + dma_mmap_pages(struct device *dev, struct vm_area_struct *vma, + size_t size, struct page *page) + +Map an allocation returned from dma_alloc_pages() into a user address space. +dev and size must be the same as those passed into dma_alloc_pages(). +page must be the pointer returned by dma_alloc_pages(). + +:: + + void * + dma_alloc_noncoherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, enum dma_data_direction dir, + gfp_t gfp) + +This routine is a convenient wrapper around dma_alloc_pages that returns the +kernel virtual address for the allocated memory instead of the page structure. + +:: + + void + dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle, enum dma_data_direction dir) + +Free a region of memory previously allocated using dma_alloc_noncoherent(). +dev, size, dma_handle and dir must all be the same as those passed into +dma_alloc_noncoherent(). cpu_addr must be the virtual address returned by +dma_alloc_noncoherent(). + +:: + + struct sg_table * + dma_alloc_noncontiguous(struct device *dev, size_t size, + enum dma_data_direction dir, gfp_t gfp, + unsigned long attrs); + +This routine allocates <size> bytes of non-coherent and possibly non-contiguous +memory. It returns a pointer to struct sg_table that describes the allocated +and DMA mapped memory, or NULL if the allocation failed. The resulting memory +can be used for struct page mapped into a scatterlist are suitable for. + +The return sg_table is guaranteed to have 1 single DMA mapped segment as +indicated by sgt->nents, but it might have multiple CPU side segments as +indicated by sgt->orig_nents. + +The dir parameter specified if data is read and/or written by the device, +see dma_map_single() for details. + +The gfp parameter allows the caller to specify the ``GFP_`` flags (see +kmalloc()) for the allocation, but rejects flags used to specify a memory +zone such as GFP_DMA or GFP_HIGHMEM. + +The attrs argument must be either 0 or DMA_ATTR_ALLOC_SINGLE_PAGES. + +Before giving the memory to the device, dma_sync_sgtable_for_device() needs +to be called, and before reading memory written by the device, +dma_sync_sgtable_for_cpu(), just like for streaming DMA mappings that are +reused. + +:: + + void + dma_free_noncontiguous(struct device *dev, size_t size, + struct sg_table *sgt, + enum dma_data_direction dir) + +Free memory previously allocated using dma_alloc_noncontiguous(). dev, size, +and dir must all be the same as those passed into dma_alloc_noncontiguous(). +sgt must be the pointer returned by dma_alloc_noncontiguous(). + +:: + + void * + dma_vmap_noncontiguous(struct device *dev, size_t size, + struct sg_table *sgt) + +Return a contiguous kernel mapping for an allocation returned from +dma_alloc_noncontiguous(). dev and size must be the same as those passed into +dma_alloc_noncontiguous(). sgt must be the pointer returned by +dma_alloc_noncontiguous(). + +Once a non-contiguous allocation is mapped using this function, the +flush_kernel_vmap_range() and invalidate_kernel_vmap_range() APIs must be used +to manage the coherency between the kernel mapping, the device and user space +mappings (if any). + +:: + + void + dma_vunmap_noncontiguous(struct device *dev, void *vaddr) + +Unmap a kernel mapping returned by dma_vmap_noncontiguous(). dev must be the +same the one passed into dma_alloc_noncontiguous(). vaddr must be the pointer +returned by dma_vmap_noncontiguous(). + + +:: + + int + dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma, + size_t size, struct sg_table *sgt) + +Map an allocation returned from dma_alloc_noncontiguous() into a user address +space. dev and size must be the same as those passed into +dma_alloc_noncontiguous(). sgt must be the pointer returned by +dma_alloc_noncontiguous(). + +:: + + int + dma_get_cache_alignment(void) + +Returns the processor cache alignment. This is the absolute minimum +alignment *and* width that you must observe when either mapping +memory or doing partial flushes. + +.. note:: + + This API may return a number *larger* than the actual cache + line, but it will guarantee that one or more cache lines fit exactly + into the width returned by this call. It will also always be a power + of two for easy alignment. + + +Part III - Debug drivers use of the DMA-API +------------------------------------------- + +The DMA-API as described above has some constraints. DMA addresses must be +released with the corresponding function with the same size for example. With +the advent of hardware IOMMUs it becomes more and more important that drivers +do not violate those constraints. In the worst case such a violation can +result in data corruption up to destroyed filesystems. + +To debug drivers and find bugs in the usage of the DMA-API checking code can +be compiled into the kernel which will tell the developer about those +violations. If your architecture supports it you can select the "Enable +debugging of DMA-API usage" option in your kernel configuration. Enabling this +option has a performance impact. Do not enable it in production kernels. + +If you boot the resulting kernel will contain code which does some bookkeeping +about what DMA memory was allocated for which device. If this code detects an +error it prints a warning message with some details into your kernel log. An +example warning message may look like this:: + + WARNING: at /data2/repos/linux-2.6-iommu/lib/dma-debug.c:448 + check_unmap+0x203/0x490() + Hardware name: + forcedeth 0000:00:08.0: DMA-API: device driver frees DMA memory with wrong + function [device address=0x00000000640444be] [size=66 bytes] [mapped as + single] [unmapped as page] + Modules linked in: nfsd exportfs bridge stp llc r8169 + Pid: 0, comm: swapper Tainted: G W 2.6.28-dmatest-09289-g8bb99c0 #1 + Call Trace: + <IRQ> [<ffffffff80240b22>] warn_slowpath+0xf2/0x130 + [<ffffffff80647b70>] _spin_unlock+0x10/0x30 + [<ffffffff80537e75>] usb_hcd_link_urb_to_ep+0x75/0xc0 + [<ffffffff80647c22>] _spin_unlock_irqrestore+0x12/0x40 + [<ffffffff8055347f>] ohci_urb_enqueue+0x19f/0x7c0 + [<ffffffff80252f96>] queue_work+0x56/0x60 + [<ffffffff80237e10>] enqueue_task_fair+0x20/0x50 + [<ffffffff80539279>] usb_hcd_submit_urb+0x379/0xbc0 + [<ffffffff803b78c3>] cpumask_next_and+0x23/0x40 + [<ffffffff80235177>] find_busiest_group+0x207/0x8a0 + [<ffffffff8064784f>] _spin_lock_irqsave+0x1f/0x50 + [<ffffffff803c7ea3>] check_unmap+0x203/0x490 + [<ffffffff803c8259>] debug_dma_unmap_page+0x49/0x50 + [<ffffffff80485f26>] nv_tx_done_optimized+0xc6/0x2c0 + [<ffffffff80486c13>] nv_nic_irq_optimized+0x73/0x2b0 + [<ffffffff8026df84>] handle_IRQ_event+0x34/0x70 + [<ffffffff8026ffe9>] handle_edge_irq+0xc9/0x150 + [<ffffffff8020e3ab>] do_IRQ+0xcb/0x1c0 + [<ffffffff8020c093>] ret_from_intr+0x0/0xa + <EOI> <4>---[ end trace f6435a98e2a38c0e ]--- + +The driver developer can find the driver and the device including a stacktrace +of the DMA-API call which caused this warning. + +Per default only the first error will result in a warning message. All other +errors will only silently counted. This limitation exist to prevent the code +from flooding your kernel log. To support debugging a device driver this can +be disabled via debugfs. See the debugfs interface documentation below for +details. + +The debugfs directory for the DMA-API debugging code is called dma-api/. In +this directory the following files can currently be found: + +=============================== =============================================== +dma-api/all_errors This file contains a numeric value. If this + value is not equal to zero the debugging code + will print a warning for every error it finds + into the kernel log. Be careful with this + option, as it can easily flood your logs. + +dma-api/disabled This read-only file contains the character 'Y' + if the debugging code is disabled. This can + happen when it runs out of memory or if it was + disabled at boot time + +dma-api/dump This read-only file contains current DMA + mappings. + +dma-api/error_count This file is read-only and shows the total + numbers of errors found. + +dma-api/num_errors The number in this file shows how many + warnings will be printed to the kernel log + before it stops. This number is initialized to + one at system boot and be set by writing into + this file + +dma-api/min_free_entries This read-only file can be read to get the + minimum number of free dma_debug_entries the + allocator has ever seen. If this value goes + down to zero the code will attempt to increase + nr_total_entries to compensate. + +dma-api/num_free_entries The current number of free dma_debug_entries + in the allocator. + +dma-api/nr_total_entries The total number of dma_debug_entries in the + allocator, both free and used. + +dma-api/driver_filter You can write a name of a driver into this file + to limit the debug output to requests from that + particular driver. Write an empty string to + that file to disable the filter and see + all errors again. +=============================== =============================================== + +If you have this code compiled into your kernel it will be enabled by default. +If you want to boot without the bookkeeping anyway you can provide +'dma_debug=off' as a boot parameter. This will disable DMA-API debugging. +Notice that you can not enable it again at runtime. You have to reboot to do +so. + +If you want to see debug messages only for a special device driver you can +specify the dma_debug_driver=<drivername> parameter. This will enable the +driver filter at boot time. The debug code will only print errors for that +driver afterwards. This filter can be disabled or changed later using debugfs. + +When the code disables itself at runtime this is most likely because it ran +out of dma_debug_entries and was unable to allocate more on-demand. 65536 +entries are preallocated at boot - if this is too low for you boot with +'dma_debug_entries=<your_desired_number>' to overwrite the default. Note +that the code allocates entries in batches, so the exact number of +preallocated entries may be greater than the actual number requested. The +code will print to the kernel log each time it has dynamically allocated +as many entries as were initially preallocated. This is to indicate that a +larger preallocation size may be appropriate, or if it happens continually +that a driver may be leaking mappings. + +:: + + void + debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr); + +dma-debug interface debug_dma_mapping_error() to debug drivers that fail +to check DMA mapping errors on addresses returned by dma_map_single() and +dma_map_page() interfaces. This interface clears a flag set by +debug_dma_map_page() to indicate that dma_mapping_error() has been called by +the driver. When driver does unmap, debug_dma_unmap() checks the flag and if +this flag is still set, prints warning message that includes call trace that +leads up to the unmap. This interface can be called from dma_mapping_error() +routines to enable DMA mapping error check debugging. diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst new file mode 100644 index 0000000000..1887d92e8e --- /dev/null +++ b/Documentation/core-api/dma-attributes.rst @@ -0,0 +1,132 @@ +============== +DMA attributes +============== + +This document describes the semantics of the DMA attributes that are +defined in linux/dma-mapping.h. + +DMA_ATTR_WEAK_ORDERING +---------------------- + +DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping +may be weakly ordered, that is that reads and writes may pass each other. + +Since it is optional for platforms to implement DMA_ATTR_WEAK_ORDERING, +those that do not will simply ignore the attribute and exhibit default +behavior. + +DMA_ATTR_WRITE_COMBINE +---------------------- + +DMA_ATTR_WRITE_COMBINE specifies that writes to the mapping may be +buffered to improve performance. + +Since it is optional for platforms to implement DMA_ATTR_WRITE_COMBINE, +those that do not will simply ignore the attribute and exhibit default +behavior. + +DMA_ATTR_NO_KERNEL_MAPPING +-------------------------- + +DMA_ATTR_NO_KERNEL_MAPPING lets the platform to avoid creating a kernel +virtual mapping for the allocated buffer. On some architectures creating +such mapping is non-trivial task and consumes very limited resources +(like kernel virtual address space or dma consistent address space). +Buffers allocated with this attribute can be only passed to user space +by calling dma_mmap_attrs(). By using this API, you are guaranteeing +that you won't dereference the pointer returned by dma_alloc_attr(). You +can treat it as a cookie that must be passed to dma_mmap_attrs() and +dma_free_attrs(). Make sure that both of these also get this attribute +set on each call. + +Since it is optional for platforms to implement +DMA_ATTR_NO_KERNEL_MAPPING, those that do not will simply ignore the +attribute and exhibit default behavior. + +DMA_ATTR_SKIP_CPU_SYNC +---------------------- + +By default dma_map_{single,page,sg} functions family transfer a given +buffer from CPU domain to device domain. Some advanced use cases might +require sharing a buffer between more than one device. This requires +having a mapping created separately for each device and is usually +performed by calling dma_map_{single,page,sg} function more than once +for the given buffer with device pointer to each device taking part in +the buffer sharing. The first call transfers a buffer from 'CPU' domain +to 'device' domain, what synchronizes CPU caches for the given region +(usually it means that the cache has been flushed or invalidated +depending on the dma direction). However, next calls to +dma_map_{single,page,sg}() for other devices will perform exactly the +same synchronization operation on the CPU cache. CPU cache synchronization +might be a time consuming operation, especially if the buffers are +large, so it is highly recommended to avoid it if possible. +DMA_ATTR_SKIP_CPU_SYNC allows platform code to skip synchronization of +the CPU cache for the given buffer assuming that it has been already +transferred to 'device' domain. This attribute can be also used for +dma_unmap_{single,page,sg} functions family to force buffer to stay in +device domain after releasing a mapping for it. Use this attribute with +care! + +DMA_ATTR_FORCE_CONTIGUOUS +------------------------- + +By default DMA-mapping subsystem is allowed to assemble the buffer +allocated by dma_alloc_attrs() function from individual pages if it can +be mapped as contiguous chunk into device dma address space. By +specifying this attribute the allocated buffer is forced to be contiguous +also in physical memory. + +DMA_ATTR_ALLOC_SINGLE_PAGES +--------------------------- + +This is a hint to the DMA-mapping subsystem that it's probably not worth +the time to try to allocate memory to in a way that gives better TLB +efficiency (AKA it's not worth trying to build the mapping out of larger +pages). You might want to specify this if: + +- You know that the accesses to this memory won't thrash the TLB. + You might know that the accesses are likely to be sequential or + that they aren't sequential but it's unlikely you'll ping-pong + between many addresses that are likely to be in different physical + pages. +- You know that the penalty of TLB misses while accessing the + memory will be small enough to be inconsequential. If you are + doing a heavy operation like decryption or decompression this + might be the case. +- You know that the DMA mapping is fairly transitory. If you expect + the mapping to have a short lifetime then it may be worth it to + optimize allocation (avoid coming up with large pages) instead of + getting the slight performance win of larger pages. + +Setting this hint doesn't guarantee that you won't get huge pages, but it +means that we won't try quite as hard to get them. + +.. note:: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM, + though ARM64 patches will likely be posted soon. + +DMA_ATTR_NO_WARN +---------------- + +This tells the DMA-mapping subsystem to suppress allocation failure reports +(similarly to __GFP_NOWARN). + +On some architectures allocation failures are reported with error messages +to the system logs. Although this can help to identify and debug problems, +drivers which handle failures (eg, retry later) have no problems with them, +and can actually flood the system logs with error messages that aren't any +problem at all, depending on the implementation of the retry mechanism. + +So, this provides a way for drivers to avoid those error messages on calls +where allocation failures are not a problem, and shouldn't bother the logs. + +.. note:: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC. + +DMA_ATTR_PRIVILEGED +------------------- + +Some advanced peripherals such as remote processors and GPUs perform +accesses to DMA buffers in both privileged "supervisor" and unprivileged +"user" modes. This attribute is used to indicate to the DMA-mapping +subsystem that the buffer is fully accessible at the elevated privilege +level (and ideally inaccessible or at least read-only at the +lesser-privileged levels). diff --git a/Documentation/core-api/dma-isa-lpc.rst b/Documentation/core-api/dma-isa-lpc.rst new file mode 100644 index 0000000000..17b193603f --- /dev/null +++ b/Documentation/core-api/dma-isa-lpc.rst @@ -0,0 +1,152 @@ +============================ +DMA with ISA and LPC devices +============================ + +:Author: Pierre Ossman <drzeus@drzeus.cx> + +This document describes how to do DMA transfers using the old ISA DMA +controller. Even though ISA is more or less dead today the LPC bus +uses the same DMA system so it will be around for quite some time. + +Headers and dependencies +------------------------ + +To do ISA style DMA you need to include two headers:: + + #include <linux/dma-mapping.h> + #include <asm/dma.h> + +The first is the generic DMA API used to convert virtual addresses to +bus addresses (see Documentation/core-api/dma-api.rst for details). + +The second contains the routines specific to ISA DMA transfers. Since +this is not present on all platforms make sure you construct your +Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries +to build your driver on unsupported platforms. + +Buffer allocation +----------------- + +The ISA DMA controller has some very strict requirements on which +memory it can access so extra care must be taken when allocating +buffers. + +(You usually need a special buffer for DMA transfers instead of +transferring directly to and from your normal data structures.) + +The DMA-able address space is the lowest 16 MB of _physical_ memory. +Also the transfer block may not cross page boundaries (which are 64 +or 128 KiB depending on which channel you use). + +In order to allocate a piece of memory that satisfies all these +requirements you pass the flag GFP_DMA to kmalloc. + +Unfortunately the memory available for ISA DMA is scarce so unless you +allocate the memory during boot-up it's a good idea to also pass +__GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder. + +(This scarcity also means that you should allocate the buffer as +early as possible and not release it until the driver is unloaded.) + +Address translation +------------------- + +To translate the virtual address to a bus address, use the normal DMA +API. Do _not_ use isa_virt_to_bus() even though it does the same +thing. The reason for this is that the function isa_virt_to_bus() +will require a Kconfig dependency to ISA, not just ISA_DMA_API which +is really all you need. Remember that even though the DMA controller +has its origins in ISA it is used elsewhere. + +Note: x86_64 had a broken DMA API when it came to ISA but has since +been fixed. If your arch has problems then fix the DMA API instead of +reverting to the ISA functions. + +Channels +-------- + +A normal ISA DMA controller has 8 channels. The lower four are for +8-bit transfers and the upper four are for 16-bit transfers. + +(Actually the DMA controller is really two separate controllers where +channel 4 is used to give DMA access for the second controller (0-3). +This means that of the four 16-bits channels only three are usable.) + +You allocate these in a similar fashion as all basic resources: + +extern int request_dma(unsigned int dmanr, const char * device_id); +extern void free_dma(unsigned int dmanr); + +The ability to use 16-bit or 8-bit transfers is _not_ up to you as a +driver author but depends on what the hardware supports. Check your +specs or test different channels. + +Transfer data +------------- + +Now for the good stuff, the actual DMA transfer. :) + +Before you use any ISA DMA routines you need to claim the DMA lock +using claim_dma_lock(). The reason is that some DMA operations are +not atomic so only one driver may fiddle with the registers at a +time. + +The first time you use the DMA controller you should call +clear_dma_ff(). This clears an internal register in the DMA +controller that is used for the non-atomic operations. As long as you +(and everyone else) uses the locking functions then you only need to +reset this once. + +Next, you tell the controller in which direction you intend to do the +transfer using set_dma_mode(). Currently you have the options +DMA_MODE_READ and DMA_MODE_WRITE. + +Set the address from where the transfer should start (this needs to +be 16-bit aligned for 16-bit transfers) and how many bytes to +transfer. Note that it's _bytes_. The DMA routines will do all the +required translation to values that the DMA controller understands. + +The final step is enabling the DMA channel and releasing the DMA +lock. + +Once the DMA transfer is finished (or timed out) you should disable +the channel again. You should also check get_dma_residue() to make +sure that all data has been transferred. + +Example:: + + int flags, residue; + + flags = claim_dma_lock(); + + clear_dma_ff(); + + set_dma_mode(channel, DMA_MODE_WRITE); + set_dma_addr(channel, phys_addr); + set_dma_count(channel, num_bytes); + + dma_enable(channel); + + release_dma_lock(flags); + + while (!device_done()); + + flags = claim_dma_lock(); + + dma_disable(channel); + + residue = dma_get_residue(channel); + if (residue != 0) + printk(KERN_ERR "driver: Incomplete DMA transfer!" + " %d bytes left!\n", residue); + + release_dma_lock(flags); + +Suspend/resume +-------------- + +It is the driver's responsibility to make sure that the machine isn't +suspended while a DMA transfer is in progress. Also, all DMA settings +are lost when the system suspends so if your driver relies on the DMA +controller being in a certain state then you have to restore these +registers upon resume. diff --git a/Documentation/core-api/entry.rst b/Documentation/core-api/entry.rst new file mode 100644 index 0000000000..e12f22ab33 --- /dev/null +++ b/Documentation/core-api/entry.rst @@ -0,0 +1,279 @@ +Entry/exit handling for exceptions, interrupts, syscalls and KVM +================================================================ + +All transitions between execution domains require state updates which are +subject to strict ordering constraints. State updates are required for the +following: + + * Lockdep + * RCU / Context tracking + * Preemption counter + * Tracing + * Time accounting + +The update order depends on the transition type and is explained below in +the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular +exceptions`_, `NMI and NMI-like exceptions`_. + +Non-instrumentable code - noinstr +--------------------------------- + +Most instrumentation facilities depend on RCU, so intrumentation is prohibited +for entry code before RCU starts watching and exit code after RCU stops +watching. In addition, many architectures must save and restore register state, +which means that (for example) a breakpoint in the breakpoint entry code would +overwrite the debug registers of the initial breakpoint. + +Such code must be marked with the 'noinstr' attribute, placing that code into a +special section inaccessible to instrumentation and debug facilities. Some +functions are partially instrumentable, which is handled by marking them +noinstr and using instrumentation_begin() and instrumentation_end() to flag the +instrumentable ranges of code: + +.. code-block:: c + + noinstr void entry(void) + { + handle_entry(); // <-- must be 'noinstr' or '__always_inline' + ... + + instrumentation_begin(); + handle_context(); // <-- instrumentable code + instrumentation_end(); + + ... + handle_exit(); // <-- must be 'noinstr' or '__always_inline' + } + +This allows verification of the 'noinstr' restrictions via objtool on +supported architectures. + +Invoking non-instrumentable functions from instrumentable context has no +restrictions and is useful to protect e.g. state switching which would +cause malfunction if instrumented. + +All non-instrumentable entry/exit code sections before and after the RCU +state transitions must run with interrupts disabled. + +Syscalls +-------- + +Syscall-entry code starts in assembly code and calls out into low-level C code +after establishing low-level architecture-specific state and stack frames. This +low-level C code must not be instrumented. A typical syscall handling function +invoked from low-level assembly code looks like this: + +.. code-block:: c + + noinstr void syscall(struct pt_regs *regs, int nr) + { + arch_syscall_enter(regs); + nr = syscall_enter_from_user_mode(regs, nr); + + instrumentation_begin(); + if (!invoke_syscall(regs, nr) && nr != -1) + result_reg(regs) = __sys_ni_syscall(regs); + instrumentation_end(); + + syscall_exit_to_user_mode(regs); + } + +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which +establishes state in the following order: + + * Lockdep + * RCU / Context tracking + * Tracing + +and then invokes the various entry work functions like ptrace, seccomp, audit, +syscall tracing, etc. After all that is done, the instrumentable invoke_syscall +function can be invoked. The instrumentable code section then ends, after which +syscall_exit_to_user_mode() is invoked. + +syscall_exit_to_user_mode() handles all work which needs to be done before +returning to user space like tracing, audit, signals, task work etc. After +that it invokes exit_to_user_mode() which again handles the state +transition in the reverse order: + + * Tracing + * RCU / Context tracking + * Lockdep + +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also +available as fine grained subfunctions in cases where the architecture code +has to do extra work between the various steps. In such cases it has to +ensure that enter_from_user_mode() is called first on entry and +exit_to_user_mode() is called last on exit. + +Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking +to print a warning. + +KVM +--- + +Entering or exiting guest mode is very similar to syscalls. From the host +kernel point of view the CPU goes off into user space when entering the +guest and returns to the kernel on exit. + +kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). +The state operations have the same ordering. + +Task work handling is done separately for guest at the boundary of the +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of +the work handled on return to user space. + +Do not nest KVM entry/exit transitions because doing so is nonsensical. + +Interrupts and regular exceptions +--------------------------------- + +Interrupts entry and exit handling is slightly more complex than syscalls +and KVM transitions. + +If an interrupt is raised while the CPU executes in user space, the entry +and exit handling is exactly the same as for syscalls. + +If the interrupt is raised while the CPU executes in kernel space the entry and +exit handling is slightly different. RCU state is only updated when the +interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will +already be watching. Lockdep and tracing have to be updated unconditionally. + +irqentry_enter() and irqentry_exit() provide the implementation for this. + +The architecture-specific part looks similar to syscall handling: + +.. code-block:: c + + noinstr void interrupt(struct pt_regs *regs, int nr) + { + arch_interrupt_enter(regs); + state = irqentry_enter(regs); + + instrumentation_begin(); + + irq_enter_rcu(); + invoke_irq_handler(regs, nr); + irq_exit_rcu(); + + instrumentation_end(); + + irqentry_exit(regs, state); + } + +Note that the invocation of the actual interrupt handler is within a +irq_enter_rcu() and irq_exit_rcu() pair. + +irq_enter_rcu() updates the preemption count which makes in_hardirq() +return true, handles NOHZ tick state and interrupt time accounting. This +means that up to the point where irq_enter_rcu() is invoked in_hardirq() +returns false. + +irq_exit_rcu() handles interrupt time accounting, undoes the preemption +count update and eventually handles soft interrupts and NOHZ tick state. + +In theory, the preemption count could be updated in irqentry_enter(). In +practice, deferring this update to irq_enter_rcu() allows the preemption-count +code to be traced, while also maintaining symmetry with irq_exit_rcu() and +irqentry_exit(), which are described in the next paragraph. The only downside +is that the early entry code up to irq_enter_rcu() must be aware that the +preemption count has not yet been updated with the HARDIRQ_OFFSET state. + +Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count +before it handles soft interrupts, whose handlers must run in BH context rather +than irq-disabled context. In addition, irqentry_exit() might schedule, which +also requires that HARDIRQ_OFFSET has been removed from the preemption count. + +Even though interrupt handlers are expected to run with local interrupts +disabled, interrupt nesting is common from an entry/exit perspective. For +example, softirq handling happens within an irqentry_{enter,exit}() block with +local interrupts enabled. Also, although uncommon, nothing prevents an +interrupt handler from re-enabling interrupts. + +Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it +runs with local interrupts disabled. But NMIs can happen anytime, and a lot of +the entry code is shared between the two. + +NMI and NMI-like exceptions +--------------------------- + +NMIs and NMI-like exceptions (machine checks, double faults, debug +interrupts, etc.) can hit any context and must be extra careful with +the state. + +State changes for debug exceptions and machine-check exceptions depend on +whether these exceptions happened in user-space (breakpoints or watchpoints) or +in kernel mode (code patching). From user-space, they are treated like +interrupts, while from kernel mode they are treated like NMIs. + +NMIs and other NMI-like exceptions handle state transitions without +distinguishing between user-mode and kernel-mode origin. + +The state update on entry is handled in irqentry_nmi_enter() which updates +state in the following order: + + * Preemption counter + * Lockdep + * RCU / Context tracking + * Tracing + +The exit counterpart irqentry_nmi_exit() does the reverse operation in the +reverse order. + +Note that the update of the preemption counter has to be the first +operation on enter and the last operation on exit. The reason is that both +lockdep and RCU rely on in_nmi() returning true in this case. The +preemption count modification in the NMI entry/exit case must not be +traced. + +Architecture-specific code looks like this: + +.. code-block:: c + + noinstr void nmi(struct pt_regs *regs) + { + arch_nmi_enter(regs); + state = irqentry_nmi_enter(regs); + + instrumentation_begin(); + nmi_handler(regs); + instrumentation_end(); + + irqentry_nmi_exit(regs); + } + +and for e.g. a debug exception it can look like this: + +.. code-block:: c + + noinstr void debug(struct pt_regs *regs) + { + arch_nmi_enter(regs); + + debug_regs = save_debug_regs(); + + if (user_mode(regs)) { + state = irqentry_enter(regs); + + instrumentation_begin(); + user_mode_debug_handler(regs, debug_regs); + instrumentation_end(); + + irqentry_exit(regs, state); + } else { + state = irqentry_nmi_enter(regs); + + instrumentation_begin(); + kernel_mode_debug_handler(regs, debug_regs); + instrumentation_end(); + + irqentry_nmi_exit(regs, state); + } + } + +There is no combined irqentry_nmi_if_kernel() function available as the +above cannot be handled in an exception-agnostic way. + +NMIs can happen in any context. For example, an NMI-like exception triggered +while handling an NMI. So NMI entry code has to be reentrant and state updates +need to handle nesting. diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst new file mode 100644 index 0000000000..ff332e2724 --- /dev/null +++ b/Documentation/core-api/errseq.rst @@ -0,0 +1,159 @@ +===================== +The errseq_t datatype +===================== + +An errseq_t is a way of recording errors in one place, and allowing any +number of "subscribers" to tell whether it has changed since a previous +point where it was sampled. + +The initial use case for this is tracking errors for file +synchronization syscalls (fsync, fdatasync, msync and sync_file_range), +but it may be usable in other situations. + +It's implemented as an unsigned 32-bit value. The low order bits are +designated to hold an error code (between 1 and MAX_ERRNO). The upper bits +are used as a counter. This is done with atomics instead of locking so that +these functions can be called from any context. + +Note that there is a risk of collisions if new errors are being recorded +frequently, since we have so few bits to use as a counter. + +To mitigate this, the bit between the error value and counter is used as +a flag to tell whether the value has been sampled since a new value was +recorded. That allows us to avoid bumping the counter if no one has +sampled it since the last time an error was recorded. + +Thus we end up with a value that looks something like this: + ++--------------------------------------+----+------------------------+ +| 31..13 | 12 | 11..0 | ++--------------------------------------+----+------------------------+ +| counter | SF | errno | ++--------------------------------------+----+------------------------+ + +The general idea is for "watchers" to sample an errseq_t value and keep +it as a running cursor. That value can later be used to tell whether +any new errors have occurred since that sampling was done, and atomically +record the state at the time that it was checked. This allows us to +record errors in one place, and then have a number of "watchers" that +can tell whether the value has changed since they last checked it. + +A new errseq_t should always be zeroed out. An errseq_t value of all zeroes +is the special (but common) case where there has never been an error. An all +zero value thus serves as the "epoch" if one wishes to know whether there +has ever been an error set since it was first initialized. + +API usage +========= + +Let me tell you a story about a worker drone. Now, he's a good worker +overall, but the company is a little...management heavy. He has to +report to 77 supervisors today, and tomorrow the "big boss" is coming in +from out of town and he's sure to test the poor fellow too. + +They're all handing him work to do -- so much he can't keep track of who +handed him what, but that's not really a big problem. The supervisors +just want to know when he's finished all of the work they've handed him so +far and whether he made any mistakes since they last asked. + +He might have made the mistake on work they didn't actually hand him, +but he can't keep track of things at that level of detail, all he can +remember is the most recent mistake that he made. + +Here's our worker_drone representation:: + + struct worker_drone { + errseq_t wd_err; /* for recording errors */ + }; + +Every day, the worker_drone starts out with a blank slate:: + + struct worker_drone wd; + + wd.wd_err = (errseq_t)0; + +The supervisors come in and get an initial read for the day. They +don't care about anything that happened before their watch begins:: + + struct supervisor { + errseq_t s_wd_err; /* private "cursor" for wd_err */ + spinlock_t s_wd_err_lock; /* protects s_wd_err */ + } + + struct supervisor su; + + su.s_wd_err = errseq_sample(&wd.wd_err); + spin_lock_init(&su.s_wd_err_lock); + +Now they start handing him tasks to do. Every few minutes they ask him to +finish up all of the work they've handed him so far. Then they ask him +whether he made any mistakes on any of it:: + + spin_lock(&su.su_wd_err_lock); + err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); + spin_unlock(&su.su_wd_err_lock); + +Up to this point, that just keeps returning 0. + +Now, the owners of this company are quite miserly and have given him +substandard equipment with which to do his job. Occasionally it +glitches and he makes a mistake. He sighs a heavy sigh, and marks it +down:: + + errseq_set(&wd.wd_err, -EIO); + +...and then gets back to work. The supervisors eventually poll again +and they each get the error when they next check. Subsequent calls will +return 0, until another error is recorded, at which point it's reported +to each of them once. + +Note that the supervisors can't tell how many mistakes he made, only +whether one was made since they last checked, and the latest value +recorded. + +Occasionally the big boss comes in for a spot check and asks the worker +to do a one-off job for him. He's not really watching the worker +full-time like the supervisors, but he does need to know whether a +mistake occurred while his job was processing. + +He can just sample the current errseq_t in the worker, and then use that +to tell whether an error has occurred later:: + + errseq_t since = errseq_sample(&wd.wd_err); + /* submit some work and wait for it to complete */ + err = errseq_check(&wd.wd_err, since); + +Since he's just going to discard "since" after that point, he doesn't +need to advance it here. He also doesn't need any locking since it's +not usable by anyone else. + +Serializing errseq_t cursor updates +=================================== + +Note that the errseq_t API does not protect the errseq_t cursor during a +check_and_advance_operation. Only the canonical error code is handled +atomically. In a situation where more than one task might be using the +same errseq_t cursor at the same time, it's important to serialize +updates to that cursor. + +If that's not done, then it's possible for the cursor to go backward +in which case the same error could be reported more than once. + +Because of this, it's often advantageous to first do an errseq_check to +see if anything has changed, and only later do an +errseq_check_and_advance after taking the lock. e.g.:: + + if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) { + /* su.s_wd_err is protected by s_wd_err_lock */ + spin_lock(&su.s_wd_err_lock); + err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); + spin_unlock(&su.s_wd_err_lock); + } + +That avoids the spinlock in the common case where nothing has changed +since the last time it was checked. + +Functions +========= + +.. kernel-doc:: lib/errseq.c diff --git a/Documentation/core-api/genalloc.rst b/Documentation/core-api/genalloc.rst new file mode 100644 index 0000000000..a5af2cbf58 --- /dev/null +++ b/Documentation/core-api/genalloc.rst @@ -0,0 +1,144 @@ +The genalloc/genpool subsystem +============================== + +There are a number of memory-allocation subsystems in the kernel, each +aimed at a specific need. Sometimes, however, a kernel developer needs to +implement a new allocator for a specific range of special-purpose memory; +often that memory is located on a device somewhere. The author of the +driver for that device can certainly write a little allocator to get the +job done, but that is the way to fill the kernel with dozens of poorly +tested allocators. Back in 2005, Jes Sorensen lifted one of those +allocators from the sym53c8xx_2 driver and posted_ it as a generic module +for the creation of ad hoc memory allocators. This code was merged +for the 2.6.13 release; it has been modified considerably since then. + +.. _posted: https://lwn.net/Articles/125842/ + +Code using this allocator should include <linux/genalloc.h>. The action +begins with the creation of a pool using one of: + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_create + +.. kernel-doc:: lib/genalloc.c + :functions: devm_gen_pool_create + +A call to gen_pool_create() will create a pool. The granularity of +allocations is set with min_alloc_order; it is a log-base-2 number like +those used by the page allocator, but it refers to bytes rather than pages. +So, if min_alloc_order is passed as 3, then all allocations will be a +multiple of eight bytes. Increasing min_alloc_order decreases the memory +required to track the memory in the pool. The nid parameter specifies +which NUMA node should be used for the allocation of the housekeeping +structures; it can be -1 if the caller doesn't care. + +The "managed" interface devm_gen_pool_create() ties the pool to a +specific device. Among other things, it will automatically clean up the +pool when the given device is destroyed. + +A pool is shut down with: + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_destroy + +It's worth noting that, if there are still allocations outstanding from the +given pool, this function will take the rather extreme step of invoking +BUG(), crashing the entire system. You have been warned. + +A freshly created pool has no memory to allocate. It is fairly useless in +that state, so one of the first orders of business is usually to add memory +to the pool. That can be done with one of: + +.. kernel-doc:: include/linux/genalloc.h + :functions: gen_pool_add + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_add_owner + +A call to gen_pool_add() will place the size bytes of memory +starting at addr (in the kernel's virtual address space) into the given +pool, once again using nid as the node ID for ancillary memory allocations. +The gen_pool_add_virt() variant associates an explicit physical +address with the memory; this is only necessary if the pool will be used +for DMA allocations. + +The functions for allocating memory from the pool (and putting it back) +are: + +.. kernel-doc:: include/linux/genalloc.h + :functions: gen_pool_alloc + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_dma_alloc + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_free_owner + +As one would expect, gen_pool_alloc() will allocate size< bytes +from the given pool. The gen_pool_dma_alloc() variant allocates +memory for use with DMA operations, returning the associated physical +address in the space pointed to by dma. This will only work if the memory +was added with gen_pool_add_virt(). Note that this function +departs from the usual genpool pattern of using unsigned long values to +represent kernel addresses; it returns a void * instead. + +That all seems relatively simple; indeed, some developers clearly found it +to be too simple. After all, the interface above provides no control over +how the allocation functions choose which specific piece of memory to +return. If that sort of control is needed, the following functions will be +of interest: + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_alloc_algo_owner + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_set_algo + +Allocations with gen_pool_alloc_algo() specify an algorithm to be +used to choose the memory to be allocated; the default algorithm can be set +with gen_pool_set_algo(). The data value is passed to the +algorithm; most ignore it, but it is occasionally needed. One can, +naturally, write a special-purpose algorithm, but there is a fair set +already available: + +- gen_pool_first_fit is a simple first-fit allocator; this is the default + algorithm if none other has been specified. + +- gen_pool_first_fit_align forces the allocation to have a specific + alignment (passed via data in a genpool_data_align structure). + +- gen_pool_first_fit_order_align aligns the allocation to the order of the + size. A 60-byte allocation will thus be 64-byte aligned, for example. + +- gen_pool_best_fit, as one would expect, is a simple best-fit allocator. + +- gen_pool_fixed_alloc allocates at a specific offset (passed in a + genpool_data_fixed structure via the data parameter) within the pool. + If the indicated memory is not available the allocation fails. + +There is a handful of other functions, mostly for purposes like querying +the space available in the pool or iterating through chunks of memory. +Most users, however, should not need much beyond what has been described +above. With luck, wider awareness of this module will help to prevent the +writing of special-purpose memory allocators in the future. + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_virt_to_phys + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_for_each_chunk + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_has_addr + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_avail + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_size + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_get + +.. kernel-doc:: lib/genalloc.c + :functions: of_gen_pool_get diff --git a/Documentation/core-api/generic-radix-tree.rst b/Documentation/core-api/generic-radix-tree.rst new file mode 100644 index 0000000000..ed42839ae4 --- /dev/null +++ b/Documentation/core-api/generic-radix-tree.rst @@ -0,0 +1,12 @@ +================================= +Generic radix trees/sparse arrays +================================= + +.. kernel-doc:: include/linux/generic-radix-tree.h + :doc: Generic radix trees/sparse arrays + +generic radix tree functions +---------------------------- + +.. kernel-doc:: include/linux/generic-radix-tree.h + :functions: diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst new file mode 100644 index 0000000000..4a460639ab --- /dev/null +++ b/Documentation/core-api/genericirq.rst @@ -0,0 +1,444 @@ +.. include:: <isonum.txt> + +========================== +Linux generic IRQ handling +========================== + +:Copyright: |copy| 2005-2010: Thomas Gleixner +:Copyright: |copy| 2005-2006: Ingo Molnar + +Introduction +============ + +The generic interrupt handling layer is designed to provide a complete +abstraction of interrupt handling for device drivers. It is able to +handle all the different types of interrupt controller hardware. Device +drivers use generic API functions to request, enable, disable and free +interrupts. The drivers do not have to know anything about interrupt +hardware details, so they can be used on different platforms without +code changes. + +This documentation is provided to developers who want to implement an +interrupt subsystem based for their architecture, with the help of the +generic IRQ handling layer. + +Rationale +========= + +The original implementation of interrupt handling in Linux uses the +__do_IRQ() super-handler, which is able to deal with every type of +interrupt logic. + +Originally, Russell King identified different types of handlers to build +a quite universal set for the ARM interrupt handler implementation in +Linux 2.5/2.6. He distinguished between: + +- Level type + +- Edge type + +- Simple type + +During the implementation we identified another type: + +- Fast EOI type + +In the SMP world of the __do_IRQ() super-handler another type was +identified: + +- Per CPU type + +This split implementation of high-level IRQ handlers allows us to +optimize the flow of the interrupt handling for each specific interrupt +type. This reduces complexity in that particular code path and allows +the optimized handling of a given type. + +The original general IRQ implementation used hw_interrupt_type +structures and their ``->ack``, ``->end`` [etc.] callbacks to differentiate +the flow control in the super-handler. This leads to a mix of flow logic +and low-level hardware logic, and it also leads to unnecessary code +duplication: for example in i386, there is an ``ioapic_level_irq`` and an +``ioapic_edge_irq`` IRQ-type which share many of the low-level details but +have different flow handling. + +A more natural abstraction is the clean separation of the 'irq flow' and +the 'chip details'. + +Analysing a couple of architecture's IRQ subsystem implementations +reveals that most of them can use a generic set of 'irq flow' methods +and only need to add the chip-level specific code. The separation is +also valuable for (sub)architectures which need specific quirks in the +IRQ flow itself but not in the chip details - and thus provides a more +transparent IRQ subsystem design. + +Each interrupt descriptor is assigned its own high-level flow handler, +which is normally one of the generic implementations. (This high-level +flow handler implementation also makes it simple to provide +demultiplexing handlers which can be found in embedded platforms on +various architectures.) + +The separation makes the generic interrupt handling layer more flexible +and extensible. For example, an (sub)architecture can use a generic +IRQ-flow implementation for 'level type' interrupts and add a +(sub)architecture specific 'edge type' implementation. + +To make the transition to the new model easier and prevent the breakage +of existing implementations, the __do_IRQ() super-handler is still +available. This leads to a kind of duality for the time being. Over time +the new model should be used in more and more architectures, as it +enables smaller and cleaner IRQ subsystems. It's deprecated for three +years now and about to be removed. + +Known Bugs And Assumptions +========================== + +None (knock on wood). + +Abstraction layers +================== + +There are three main levels of abstraction in the interrupt code: + +1. High-level driver API + +2. High-level IRQ flow handlers + +3. Chip-level hardware encapsulation + +Interrupt control flow +---------------------- + +Each interrupt is described by an interrupt descriptor structure +irq_desc. The interrupt is referenced by an 'unsigned int' numeric +value which selects the corresponding interrupt description structure in +the descriptor structures array. The descriptor structure contains +status information and pointers to the interrupt flow method and the +interrupt chip structure which are assigned to this interrupt. + +Whenever an interrupt triggers, the low-level architecture code calls +into the generic interrupt code by calling desc->handle_irq(). This +high-level IRQ handling function only uses desc->irq_data.chip +primitives referenced by the assigned chip descriptor structure. + +High-level Driver API +--------------------- + +The high-level Driver API consists of following functions: + +- request_irq() + +- request_threaded_irq() + +- free_irq() + +- disable_irq() + +- enable_irq() + +- disable_irq_nosync() (SMP only) + +- synchronize_irq() (SMP only) + +- irq_set_irq_type() + +- irq_set_irq_wake() + +- irq_set_handler_data() + +- irq_set_chip() + +- irq_set_chip_data() + +See the autogenerated function documentation for details. + +High-level IRQ flow handlers +---------------------------- + +The generic layer provides a set of pre-defined irq-flow methods: + +- handle_level_irq() + +- handle_edge_irq() + +- handle_fasteoi_irq() + +- handle_simple_irq() + +- handle_percpu_irq() + +- handle_edge_eoi_irq() + +- handle_bad_irq() + +The interrupt flow handlers (either pre-defined or architecture +specific) are assigned to specific interrupts by the architecture either +during bootup or during device initialization. + +Default flow implementations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Helper functions +^^^^^^^^^^^^^^^^ + +The helper functions call the chip primitives and are used by the +default flow implementations. The following helper functions are +implemented (simplified excerpt):: + + default_enable(struct irq_data *data) + { + desc->irq_data.chip->irq_unmask(data); + } + + default_disable(struct irq_data *data) + { + if (!delay_disable(data)) + desc->irq_data.chip->irq_mask(data); + } + + default_ack(struct irq_data *data) + { + chip->irq_ack(data); + } + + default_mask_ack(struct irq_data *data) + { + if (chip->irq_mask_ack) { + chip->irq_mask_ack(data); + } else { + chip->irq_mask(data); + chip->irq_ack(data); + } + } + + noop(struct irq_data *data)) + { + } + + + +Default flow handler implementations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Default Level IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_level_irq provides a generic implementation for level-triggered +interrupts. + +The following control flow is implemented (simplified excerpt):: + + desc->irq_data.chip->irq_mask_ack(); + handle_irq_event(desc->action); + desc->irq_data.chip->irq_unmask(); + + +Default Fast EOI IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_fasteoi_irq provides a generic implementation for interrupts, +which only need an EOI at the end of the handler. + +The following control flow is implemented (simplified excerpt):: + + handle_irq_event(desc->action); + desc->irq_data.chip->irq_eoi(); + + +Default Edge IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_edge_irq provides a generic implementation for edge-triggered +interrupts. + +The following control flow is implemented (simplified excerpt):: + + if (desc->status & running) { + desc->irq_data.chip->irq_mask_ack(); + desc->status |= pending | masked; + return; + } + desc->irq_data.chip->irq_ack(); + desc->status |= running; + do { + if (desc->status & masked) + desc->irq_data.chip->irq_unmask(); + desc->status &= ~pending; + handle_irq_event(desc->action); + } while (desc->status & pending); + desc->status &= ~running; + + +Default simple IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_simple_irq provides a generic implementation for simple +interrupts. + +.. note:: + + The simple flow handler does not call any handler/chip primitives. + +The following control flow is implemented (simplified excerpt):: + + handle_irq_event(desc->action); + + +Default per CPU flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_percpu_irq provides a generic implementation for per CPU +interrupts. + +Per CPU interrupts are only available on SMP and the handler provides a +simplified version without locking. + +The following control flow is implemented (simplified excerpt):: + + if (desc->irq_data.chip->irq_ack) + desc->irq_data.chip->irq_ack(); + handle_irq_event(desc->action); + if (desc->irq_data.chip->irq_eoi) + desc->irq_data.chip->irq_eoi(); + + +EOI Edge IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_edge_eoi_irq provides an abnomination of the edge handler +which is solely used to tame a badly wreckaged irq controller on +powerpc/cell. + +Bad IRQ flow handler +^^^^^^^^^^^^^^^^^^^^ + +handle_bad_irq is used for spurious interrupts which have no real +handler assigned.. + +Quirks and optimizations +~~~~~~~~~~~~~~~~~~~~~~~~ + +The generic functions are intended for 'clean' architectures and chips, +which have no platform-specific IRQ handling quirks. If an architecture +needs to implement quirks on the 'flow' level then it can do so by +overriding the high-level irq-flow handler. + +Delayed interrupt disable +~~~~~~~~~~~~~~~~~~~~~~~~~ + +This per interrupt selectable feature, which was introduced by Russell +King in the ARM interrupt implementation, does not mask an interrupt at +the hardware level when disable_irq() is called. The interrupt is kept +enabled and is masked in the flow handler when an interrupt event +happens. This prevents losing edge interrupts on hardware which does not +store an edge interrupt event while the interrupt is disabled at the +hardware level. When an interrupt arrives while the IRQ_DISABLED flag +is set, then the interrupt is masked at the hardware level and the +IRQ_PENDING bit is set. When the interrupt is re-enabled by +enable_irq() the pending bit is checked and if it is set, the interrupt +is resent either via hardware or by a software resend mechanism. (It's +necessary to enable CONFIG_HARDIRQS_SW_RESEND when you want to use +the delayed interrupt disable feature and your hardware is not capable +of retriggering an interrupt.) The delayed interrupt disable is not +configurable. + +Chip-level hardware encapsulation +--------------------------------- + +The chip-level hardware descriptor structure :c:type:`irq_chip` contains all +the direct chip relevant functions, which can be utilized by the irq flow +implementations. + +- ``irq_ack`` + +- ``irq_mask_ack`` - Optional, recommended for performance + +- ``irq_mask`` + +- ``irq_unmask`` + +- ``irq_eoi`` - Optional, required for EOI flow handlers + +- ``irq_retrigger`` - Optional + +- ``irq_set_type`` - Optional + +- ``irq_set_wake`` - Optional + +These primitives are strictly intended to mean what they say: ack means +ACK, masking means masking of an IRQ line, etc. It is up to the flow +handler(s) to use these basic units of low-level functionality. + +__do_IRQ entry point +==================== + +The original implementation __do_IRQ() was an alternative entry point +for all types of interrupts. It no longer exists. + +This handler turned out to be not suitable for all interrupt hardware +and was therefore reimplemented with split functionality for +edge/level/simple/percpu interrupts. This is not only a functional +optimization. It also shortens code paths for interrupts. + +Locking on SMP +============== + +The locking of chip registers is up to the architecture that defines the +chip primitives. The per-irq structure is protected via desc->lock, by +the generic layer. + +Generic interrupt chip +====================== + +To avoid copies of identical implementations of IRQ chips the core +provides a configurable generic interrupt chip implementation. +Developers should check carefully whether the generic chip fits their +needs before implementing the same functionality slightly differently +themselves. + +.. kernel-doc:: kernel/irq/generic-chip.c + :export: + +Structures +========== + +This chapter contains the autogenerated documentation of the structures +which are used in the generic IRQ layer. + +.. kernel-doc:: include/linux/irq.h + :internal: + +.. kernel-doc:: include/linux/interrupt.h + :internal: + +Public Functions Provided +========================= + +This chapter contains the autogenerated documentation of the kernel API +functions which are exported. + +.. kernel-doc:: kernel/irq/manage.c + +.. kernel-doc:: kernel/irq/chip.c + :export: + +Internal Functions Provided +=========================== + +This chapter contains the autogenerated documentation of the internal +functions. + +.. kernel-doc:: kernel/irq/irqdesc.c + +.. kernel-doc:: kernel/irq/handle.c + +.. kernel-doc:: kernel/irq/chip.c + :internal: + +Credits +======= + +The following people have contributed to this document: + +1. Thomas Gleixner tglx@linutronix.de + +2. Ingo Molnar mingo@elte.hu diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst new file mode 100644 index 0000000000..e7c32a8de1 --- /dev/null +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst @@ -0,0 +1,68 @@ +.. _gfp_mask_from_fs_io: + +================================= +GFP masks used from FS/IO context +================================= + +:Date: May, 2018 +:Author: Michal Hocko <mhocko@kernel.org> + +Introduction +============ + +Code paths in the filesystem and IO stacks must be careful when +allocating memory to prevent recursion deadlocks caused by direct +memory reclaim calling back into the FS or IO paths and blocking on +already held resources (e.g. locks - most commonly those used for the +transaction context). + +The traditional way to avoid this deadlock problem is to clear __GFP_FS +respectively __GFP_IO (note the latter implies clearing the first as well) in +the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be +used as shortcut. It turned out though that above approach has led to +abuses when the restricted gfp mask is used "just in case" without a +deeper consideration which leads to problems because an excessive use +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory +reclaim issues. + +New API +======== + +Since 4.12 we do have a generic scope API for both NOFS and NOIO context +``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``, +``memalloc_noio_restore`` which allow to mark a scope to be a critical +section from a filesystem or I/O point of view. Any allocation from that +scope will inherently drop __GFP_FS respectively __GFP_IO from the given +mask so no memory allocation can recurse back in the FS/IO. + +.. kernel-doc:: include/linux/sched/mm.h + :functions: memalloc_nofs_save memalloc_nofs_restore +.. kernel-doc:: include/linux/sched/mm.h + :functions: memalloc_noio_save memalloc_noio_restore + +FS/IO code then simply calls the appropriate save function before +any critical section with respect to the reclaim is started - e.g. +lock shared with the reclaim context or when a transaction context +nesting would be possible via reclaim. The restore function should be +called when the critical section ends. All that ideally along with an +explanation what is the reclaim context for easier maintenance. + +Please note that the proper pairing of save/restore functions +allows nesting so it is safe to call ``memalloc_noio_save`` or +``memalloc_noio_restore`` respectively from an existing NOIO or NOFS +scope. + +What about __vmalloc(GFP_NOFS) +============================== + +vmalloc doesn't support GFP_NOFS semantic because there are hardcoded +GFP_KERNEL allocations deep inside the allocator which are quite non-trivial +to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is +almost always a bug. The good news is that the NOFS/NOIO semantic can be +achieved by the scope API. + +In the ideal world, upper layers should already mark dangerous contexts +and so no special care is required and vmalloc should be called without +any problems. Sometimes if the context is not really clear or there are +layering violations then the recommended way around that is to wrap ``vmalloc`` +by the scope API with a comment explaining the problem. diff --git a/Documentation/core-api/idr.rst b/Documentation/core-api/idr.rst new file mode 100644 index 0000000000..18d7248670 --- /dev/null +++ b/Documentation/core-api/idr.rst @@ -0,0 +1,84 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============= +ID Allocation +============= + +:Author: Matthew Wilcox + +Overview +======== + +A common problem to solve is allocating identifiers (IDs); generally +small numbers which identify a thing. Examples include file descriptors, +process IDs, packet identifiers in networking protocols, SCSI tags +and device instance numbers. The IDR and the IDA provide a reasonable +solution to the problem to avoid everybody inventing their own. The IDR +provides the ability to map an ID to a pointer, while the IDA provides +only ID allocation, and as a result is much more memory-efficient. + +The IDR interface is deprecated; please use the :doc:`XArray <xarray>` +instead. + +IDR usage +========= + +Start by initialising an IDR, either with DEFINE_IDR() +for statically allocated IDRs or idr_init() for dynamically +allocated IDRs. + +You can call idr_alloc() to allocate an unused ID. Look up +the pointer you associated with the ID by calling idr_find() +and free the ID by calling idr_remove(). + +If you need to change the pointer associated with an ID, you can call +idr_replace(). One common reason to do this is to reserve an +ID by passing a ``NULL`` pointer to the allocation function; initialise the +object with the reserved ID and finally insert the initialised object +into the IDR. + +Some users need to allocate IDs larger than ``INT_MAX``. So far all of +these users have been content with a ``UINT_MAX`` limit, and they use +idr_alloc_u32(). If you need IDs that will not fit in a u32, +we will work with you to address your needs. + +If you need to allocate IDs sequentially, you can use +idr_alloc_cyclic(). The IDR becomes less efficient when dealing +with larger IDs, so using this function comes at a slight cost. + +To perform an action on all pointers used by the IDR, you can +either use the callback-based idr_for_each() or the +iterator-style idr_for_each_entry(). You may need to use +idr_for_each_entry_continue() to continue an iteration. You can +also use idr_get_next() if the iterator doesn't fit your needs. + +When you have finished using an IDR, you can call idr_destroy() +to release the memory used by the IDR. This will not free the objects +pointed to from the IDR; if you want to do that, use one of the iterators +to do it. + +You can use idr_is_empty() to find out whether there are any +IDs currently allocated. + +If you need to take a lock while allocating a new ID from the IDR, +you may need to pass a restrictive set of GFP flags, which can lead +to the IDR being unable to allocate memory. To work around this, +you can call idr_preload() before taking the lock, and then +idr_preload_end() after the allocation. + +.. kernel-doc:: include/linux/idr.h + :doc: idr sync + +IDA usage +========= + +.. kernel-doc:: lib/idr.c + :doc: IDA description + +Functions and structures +======================== + +.. kernel-doc:: include/linux/idr.h + :functions: +.. kernel-doc:: lib/idr.c + :functions: diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst new file mode 100644 index 0000000000..7a3a08d81f --- /dev/null +++ b/Documentation/core-api/index.rst @@ -0,0 +1,137 @@ +====================== +Core API Documentation +====================== + +This is the beginning of a manual for core kernel APIs. The conversion +(and writing!) of documents for this manual is much appreciated! + +Core utilities +============== + +This section has general and "core core" documentation. The first is a +massive grab-bag of kerneldoc info left over from the docbook days; it +should really be broken up someday when somebody finds the energy to do +it. + +.. toctree:: + :maxdepth: 1 + + kernel-api + workqueue + watch_queue + printk-basics + printk-formats + printk-index + symbol-namespaces + asm-annotations + +Data structures and low-level utilities +======================================= + +Library functionality that is used throughout the kernel. + +.. toctree:: + :maxdepth: 1 + + kobject + kref + assoc_array + xarray + maple_tree + idr + circular-buffers + rbtree + generic-radix-tree + packing + this_cpu_ops + timekeeping + errseq + wrappers/atomic_t + wrappers/atomic_bitops + +Low level entry and exit +======================== + +.. toctree:: + :maxdepth: 1 + + entry + +Concurrency primitives +====================== + +How Linux keeps everything from happening at the same time. See +Documentation/locking/index.rst for more related documentation. + +.. toctree:: + :maxdepth: 1 + + refcount-vs-atomic + irq/index + local_ops + padata + ../RCU/index + wrappers/memory-barriers.rst + +Low-level hardware management +============================= + +Cache management, managing CPU hotplug, etc. + +.. toctree:: + :maxdepth: 1 + + cachetlb + cpu_hotplug + memory-hotplug + genericirq + protection-keys + +Memory management +================= + +How to allocate and use memory in the kernel. Note that there is a lot +more memory-management documentation in Documentation/mm/index.rst. + +.. toctree:: + :maxdepth: 1 + + memory-allocation + unaligned-memory-access + dma-api + dma-api-howto + dma-attributes + dma-isa-lpc + mm-api + genalloc + pin_user_pages + boot-time-mm + gfp_mask-from-fs-io + +Interfaces for kernel debugging +=============================== + +.. toctree:: + :maxdepth: 1 + + debug-objects + tracepoint + debugging-via-ohci1394 + +Everything else +=============== + +Documents that don't fit elsewhere or which have yet to be categorized. + +.. toctree:: + :maxdepth: 1 + + librs + netlink + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/core-api/irq/concepts.rst b/Documentation/core-api/irq/concepts.rst new file mode 100644 index 0000000000..4273806a60 --- /dev/null +++ b/Documentation/core-api/irq/concepts.rst @@ -0,0 +1,24 @@ +=============== +What is an IRQ? +=============== + +An IRQ is an interrupt request from a device. +Currently they can come in over a pin, or over a packet. +Several devices may be connected to the same pin thus +sharing an IRQ. + +An IRQ number is a kernel identifier used to talk about a hardware +interrupt source. Typically this is an index into the global irq_desc +array, but except for what linux/interrupt.h implements the details +are architecture specific. + +An IRQ number is an enumeration of the possible interrupt sources on a +machine. Typically what is enumerated is the number of input pins on +all of the interrupt controller in the system. In the case of ISA +what is enumerated are the 16 input pins on the two i8259 interrupt +controllers. + +Architectures can assign additional meaning to the IRQ numbers, and +are encouraged to in the case where there is any manual configuration +of the hardware involved. The ISA IRQs are a classic example of +assigning this kind of additional meaning. diff --git a/Documentation/core-api/irq/index.rst b/Documentation/core-api/irq/index.rst new file mode 100644 index 0000000000..0d65d11e54 --- /dev/null +++ b/Documentation/core-api/irq/index.rst @@ -0,0 +1,11 @@ +==== +IRQs +==== + +.. toctree:: + :maxdepth: 1 + + concepts + irq-affinity + irq-domain + irqflags-tracing diff --git a/Documentation/core-api/irq/irq-affinity.rst b/Documentation/core-api/irq/irq-affinity.rst new file mode 100644 index 0000000000..29da500083 --- /dev/null +++ b/Documentation/core-api/irq/irq-affinity.rst @@ -0,0 +1,70 @@ +================ +SMP IRQ affinity +================ + +ChangeLog: + - Started by Ingo Molnar <mingo@redhat.com> + - Update by Max Krasnyansky <maxk@qualcomm.com> + + +/proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify +which target CPUs are permitted for a given IRQ source. It's a bitmask +(smp_affinity) or cpu list (smp_affinity_list) of allowed CPUs. It's not +allowed to turn off all CPUs, and if an IRQ controller does not support +IRQ affinity then the value will not change from the default of all cpus. + +/proc/irq/default_smp_affinity specifies default affinity mask that applies +to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask +will be set to the default mask. It can then be changed as described above. +Default mask is 0xffffffff. + +Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting +it to CPU4-7 (this is an 8-CPU SMP box):: + + [root@moon 44]# cd /proc/irq/44 + [root@moon 44]# cat smp_affinity + ffffffff + + [root@moon 44]# echo 0f > smp_affinity + [root@moon 44]# cat smp_affinity + 0000000f + [root@moon 44]# ping -f h + PING hell (195.4.7.3): 56 data bytes + ... + --- hell ping statistics --- + 6029 packets transmitted, 6027 packets received, 0% packet loss + round-trip min/avg/max = 0.1/0.1/0.4 ms + [root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:' + CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 + 44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1 + +As can be seen from the line above IRQ44 was delivered only to the first four +processors (0-3). +Now lets restrict that IRQ to CPU(4-7). + +:: + + [root@moon 44]# echo f0 > smp_affinity + [root@moon 44]# cat smp_affinity + 000000f0 + [root@moon 44]# ping -f h + PING hell (195.4.7.3): 56 data bytes + .. + --- hell ping statistics --- + 2779 packets transmitted, 2777 packets received, 0% packet loss + round-trip min/avg/max = 0.1/0.5/585.4 ms + [root@moon 44]# cat /proc/interrupts | 'CPU\|44:' + CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 + 44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1 + +This time around IRQ44 was delivered only to the last four processors. +i.e counters for the CPU0-3 did not change. + +Here is an example of limiting that same irq (44) to cpus 1024 to 1031:: + + [root@moon 44]# echo 1024-1031 > smp_affinity_list + [root@moon 44]# cat smp_affinity_list + 1024-1031 + +Note that to do this with a bitmask would require 32 bitmasks of zero +to follow the pertinent one. diff --git a/Documentation/core-api/irq/irq-domain.rst b/Documentation/core-api/irq/irq-domain.rst new file mode 100644 index 0000000000..f88a6ee67a --- /dev/null +++ b/Documentation/core-api/irq/irq-domain.rst @@ -0,0 +1,297 @@ +=============================================== +The irq_domain interrupt number mapping library +=============================================== + +The current design of the Linux kernel uses a single large number +space where each separate IRQ source is assigned a different number. +This is simple when there is only one interrupt controller, but in +systems with multiple interrupt controllers the kernel must ensure +that each one gets assigned non-overlapping allocations of Linux +IRQ numbers. + +The number of interrupt controllers registered as unique irqchips +show a rising tendency: for example subdrivers of different kinds +such as GPIO controllers avoid reimplementing identical callback +mechanisms as the IRQ core system by modelling their interrupt +handlers as irqchips, i.e. in effect cascading interrupt controllers. + +Here the interrupt number loose all kind of correspondence to +hardware interrupt numbers: whereas in the past, IRQ numbers could +be chosen so they matched the hardware IRQ line into the root +interrupt controller (i.e. the component actually fireing the +interrupt line to the CPU) nowadays this number is just a number. + +For this reason we need a mechanism to separate controller-local +interrupt numbers, called hardware irq's, from Linux IRQ numbers. + +The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of +irq numbers, but they don't provide any support for reverse mapping of +the controller-local IRQ (hwirq) number into the Linux IRQ number +space. + +The irq_domain library adds mapping between hwirq and IRQ numbers on +top of the irq_alloc_desc*() API. An irq_domain to manage mapping is +preferred over interrupt controller drivers open coding their own +reverse mapping scheme. + +irq_domain also implements translation from an abstract irq_fwspec +structure to hwirq numbers (Device Tree and ACPI GSI so far), and can +be easily extended to support other IRQ topology data sources. + +irq_domain usage +================ + +An interrupt controller driver creates and registers an irq_domain by +calling one of the irq_domain_add_*() or irq_domain_create_*() functions +(each mapping method has a different allocator function, more on that later). +The function will return a pointer to the irq_domain on success. The caller +must provide the allocator function with an irq_domain_ops structure. + +In most cases, the irq_domain will begin empty without any mappings +between hwirq and IRQ numbers. Mappings are added to the irq_domain +by calling irq_create_mapping() which accepts the irq_domain and a +hwirq number as arguments. If a mapping for the hwirq doesn't already +exist then it will allocate a new Linux irq_desc, associate it with +the hwirq, and call the .map() callback so the driver can perform any +required hardware setup. + +Once a mapping has been established, it can be retrieved or used via a +variety of methods: + +- irq_resolve_mapping() returns a pointer to the irq_desc structure + for a given domain and hwirq number, and NULL if there was no + mapping. +- irq_find_mapping() returns a Linux IRQ number for a given domain and + hwirq number, and 0 if there was no mapping +- irq_linear_revmap() is now identical to irq_find_mapping(), and is + deprecated +- generic_handle_domain_irq() handles an interrupt described by a + domain and a hwirq number + +Note that irq domain lookups must happen in contexts that are +compatible with a RCU read-side critical section. + +The irq_create_mapping() function must be called *at least once* +before any call to irq_find_mapping(), lest the descriptor will not +be allocated. + +If the driver has the Linux IRQ number or the irq_data pointer, and +needs to know the associated hwirq number (such as in the irq_chip +callbacks) then it can be directly obtained from irq_data->hwirq. + +Types of irq_domain mappings +============================ + +There are several mechanisms available for reverse mapping from hwirq +to Linux irq, and each mechanism uses a different allocation function. +Which reverse map type should be used depends on the use case. Each +of the reverse map types are described below: + +Linear +------ + +:: + + irq_domain_add_linear() + irq_domain_create_linear() + +The linear reverse map maintains a fixed size table indexed by the +hwirq number. When a hwirq is mapped, an irq_desc is allocated for +the hwirq, and the IRQ number is stored in the table. + +The Linear map is a good choice when the maximum number of hwirqs is +fixed and a relatively small number (~ < 256). The advantages of this +map are fixed time lookup for IRQ numbers, and irq_descs are only +allocated for in-use IRQs. The disadvantage is that the table must be +as large as the largest possible hwirq number. + +irq_domain_add_linear() and irq_domain_create_linear() are functionally +equivalent, except for the first argument is different - the former +accepts an Open Firmware specific 'struct device_node', while the latter +accepts a more general abstraction 'struct fwnode_handle'. + +The majority of drivers should use the linear map. + +Tree +---- + +:: + + irq_domain_add_tree() + irq_domain_create_tree() + +The irq_domain maintains a radix tree map from hwirq numbers to Linux +IRQs. When an hwirq is mapped, an irq_desc is allocated and the +hwirq is used as the lookup key for the radix tree. + +The tree map is a good choice if the hwirq number can be very large +since it doesn't need to allocate a table as large as the largest +hwirq number. The disadvantage is that hwirq to IRQ number lookup is +dependent on how many entries are in the table. + +irq_domain_add_tree() and irq_domain_create_tree() are functionally +equivalent, except for the first argument is different - the former +accepts an Open Firmware specific 'struct device_node', while the latter +accepts a more general abstraction 'struct fwnode_handle'. + +Very few drivers should need this mapping. + +No Map +------ + +:: + + irq_domain_add_nomap() + +The No Map mapping is to be used when the hwirq number is +programmable in the hardware. In this case it is best to program the +Linux IRQ number into the hardware itself so that no mapping is +required. Calling irq_create_direct_mapping() will allocate a Linux +IRQ number and call the .map() callback so that driver can program the +Linux IRQ number into the hardware. + +Most drivers cannot use this mapping, and it is now gated on the +CONFIG_IRQ_DOMAIN_NOMAP option. Please refrain from introducing new +users of this API. + +Legacy +------ + +:: + + irq_domain_add_simple() + irq_domain_add_legacy() + irq_domain_create_simple() + irq_domain_create_legacy() + +The Legacy mapping is a special case for drivers that already have a +range of irq_descs allocated for the hwirqs. It is used when the +driver cannot be immediately converted to use the linear mapping. For +example, many embedded system board support files use a set of #defines +for IRQ numbers that are passed to struct device registrations. In that +case the Linux IRQ numbers cannot be dynamically assigned and the legacy +mapping should be used. + +As the name implies, the \*_legacy() functions are deprecated and only +exist to ease the support of ancient platforms. No new users should be +added. Same goes for the \*_simple() functions when their use results +in the legacy behaviour. + +The legacy map assumes a contiguous range of IRQ numbers has already +been allocated for the controller and that the IRQ number can be +calculated by adding a fixed offset to the hwirq number, and +visa-versa. The disadvantage is that it requires the interrupt +controller to manage IRQ allocations and it requires an irq_desc to be +allocated for every hwirq, even if it is unused. + +The legacy map should only be used if fixed IRQ mappings must be +supported. For example, ISA controllers would use the legacy map for +mapping Linux IRQs 0-15 so that existing ISA drivers get the correct IRQ +numbers. + +Most users of legacy mappings should use irq_domain_add_simple() or +irq_domain_create_simple() which will use a legacy domain only if an IRQ range +is supplied by the system and will otherwise use a linear domain mapping. +The semantics of this call are such that if an IRQ range is specified then +descriptors will be allocated on-the-fly for it, and if no range is +specified it will fall through to irq_domain_add_linear() or +irq_domain_create_linear() which means *no* irq descriptors will be allocated. + +A typical use case for simple domains is where an irqchip provider +is supporting both dynamic and static IRQ assignments. + +In order to avoid ending up in a situation where a linear domain is +used and no descriptor gets allocated it is very important to make sure +that the driver using the simple domain call irq_create_mapping() +before any irq_find_mapping() since the latter will actually work +for the static IRQ assignment case. + +irq_domain_add_simple() and irq_domain_create_simple() as well as +irq_domain_add_legacy() and irq_domain_create_legacy() are functionally +equivalent, except for the first argument is different - the former +accepts an Open Firmware specific 'struct device_node', while the latter +accepts a more general abstraction 'struct fwnode_handle'. + +Hierarchy IRQ domain +-------------------- + +On some architectures, there may be multiple interrupt controllers +involved in delivering an interrupt from the device to the target CPU. +Let's look at a typical interrupt delivering path on x86 platforms:: + + Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU + +There are three interrupt controllers involved: + +1) IOAPIC controller +2) Interrupt remapping controller +3) Local APIC controller + +To support such a hardware topology and make software architecture match +hardware architecture, an irq_domain data structure is built for each +interrupt controller and those irq_domains are organized into hierarchy. +When building irq_domain hierarchy, the irq_domain near to the device is +child and the irq_domain near to CPU is parent. So a hierarchy structure +as below will be built for the example above:: + + CPU Vector irq_domain (root irq_domain to manage CPU vectors) + ^ + | + Interrupt Remapping irq_domain (manage irq_remapping entries) + ^ + | + IOAPIC irq_domain (manage IOAPIC delivery entries/pins) + +There are four major interfaces to use hierarchy irq_domain: + +1) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt + controller related resources to deliver these interrupts. +2) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller + related resources associated with these interrupts. +3) irq_domain_activate_irq(): activate interrupt controller hardware to + deliver the interrupt. +4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware + to stop delivering the interrupt. + +Following changes are needed to support hierarchy irq_domain: + +1) a new field 'parent' is added to struct irq_domain; it's used to + maintain irq_domain hierarchy information. +2) a new field 'parent_data' is added to struct irq_data; it's used to + build hierarchy irq_data to match hierarchy irq_domains. The irq_data + is used to store irq_domain pointer and hardware irq number. +3) new callbacks are added to struct irq_domain_ops to support hierarchy + irq_domain operations. + +With support of hierarchy irq_domain and hierarchy irq_data ready, an +irq_domain structure is built for each interrupt controller, and an +irq_data structure is allocated for each irq_domain associated with an +IRQ. Now we could go one step further to support stacked(hierarchy) +irq_chip. That is, an irq_chip is associated with each irq_data along +the hierarchy. A child irq_chip may implement a required action by +itself or by cooperating with its parent irq_chip. + +With stacked irq_chip, interrupt controller driver only needs to deal +with the hardware managed by itself and may ask for services from its +parent irq_chip when needed. So we could achieve a much cleaner +software architecture. + +For an interrupt controller driver to support hierarchy irq_domain, it +needs to: + +1) Implement irq_domain_ops.alloc and irq_domain_ops.free +2) Optionally implement irq_domain_ops.activate and + irq_domain_ops.deactivate. +3) Optionally implement an irq_chip to manage the interrupt controller + hardware. +4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap, + they are unused with hierarchy irq_domain. + +Hierarchy irq_domain is in no way x86 specific, and is heavily used to +support other architectures, such as ARM, ARM64 etc. + +Debugging +========= + +Most of the internals of the IRQ subsystem are exposed in debugfs by +turning CONFIG_GENERIC_IRQ_DEBUGFS on. diff --git a/Documentation/core-api/irq/irqflags-tracing.rst b/Documentation/core-api/irq/irqflags-tracing.rst new file mode 100644 index 0000000000..bdd208259f --- /dev/null +++ b/Documentation/core-api/irq/irqflags-tracing.rst @@ -0,0 +1,52 @@ +======================= +IRQ-flags state tracing +======================= + +:Author: started by Ingo Molnar <mingo@redhat.com> + +The "irq-flags tracing" feature "traces" hardirq and softirq state, in +that it gives interested subsystems an opportunity to be notified of +every hardirqs-off/hardirqs-on, softirqs-off/softirqs-on event that +happens in the kernel. + +CONFIG_TRACE_IRQFLAGS_SUPPORT is needed for CONFIG_PROVE_SPIN_LOCKING +and CONFIG_PROVE_RW_LOCKING to be offered by the generic lock debugging +code. Otherwise only CONFIG_PROVE_MUTEX_LOCKING and +CONFIG_PROVE_RWSEM_LOCKING will be offered on an architecture - these +are locking APIs that are not used in IRQ context. (the one exception +for rwsems is worked around) + +Architecture support for this is certainly not in the "trivial" +category, because lots of lowlevel assembly code deal with irq-flags +state changes. But an architecture can be irq-flags-tracing enabled in a +rather straightforward and risk-free manner. + +Architectures that want to support this need to do a couple of +code-organizational changes first: + +- add and enable TRACE_IRQFLAGS_SUPPORT in their arch level Kconfig file + +and then a couple of functional changes are needed as well to implement +irq-flags-tracing support: + +- in lowlevel entry code add (build-conditional) calls to the + trace_hardirqs_off()/trace_hardirqs_on() functions. The lock validator + closely guards whether the 'real' irq-flags matches the 'virtual' + irq-flags state, and complains loudly (and turns itself off) if the + two do not match. Usually most of the time for arch support for + irq-flags-tracing is spent in this state: look at the lockdep + complaint, try to figure out the assembly code we did not cover yet, + fix and repeat. Once the system has booted up and works without a + lockdep complaint in the irq-flags-tracing functions arch support is + complete. +- if the architecture has non-maskable interrupts then those need to be + excluded from the irq-tracing [and lock validation] mechanism via + lockdep_off()/lockdep_on(). + +In general there is no risk from having an incomplete irq-flags-tracing +implementation in an architecture: lockdep will detect that and will +turn itself off. I.e. the lock validator will still be reliable. There +should be no crashes due to irq-tracing bugs. (except if the assembly +changes break other code by modifying conditions or registers that +shouldn't be) + diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst new file mode 100644 index 0000000000..ae92a25713 --- /dev/null +++ b/Documentation/core-api/kernel-api.rst @@ -0,0 +1,434 @@ +==================== +The Linux Kernel API +==================== + + +List Management Functions +========================= + +.. kernel-doc:: include/linux/list.h + :internal: + +Basic C Library Functions +========================= + +When writing drivers, you cannot in general use routines which are from +the C Library. Some of the functions have been found generally useful +and they are listed below. The behaviour of these functions may vary +slightly from those defined by ANSI, and these deviations are noted in +the text. + +String Conversions +------------------ + +.. kernel-doc:: lib/vsprintf.c + :export: + +.. kernel-doc:: include/linux/kstrtox.h + :functions: kstrtol kstrtoul + +.. kernel-doc:: lib/kstrtox.c + :export: + +.. kernel-doc:: lib/string_helpers.c + :export: + +String Manipulation +------------------- + +.. kernel-doc:: include/linux/fortify-string.h + :internal: + +.. kernel-doc:: lib/string.c + :export: + +.. kernel-doc:: include/linux/string.h + :internal: + +.. kernel-doc:: mm/util.c + :functions: kstrdup kstrdup_const kstrndup kmemdup kmemdup_nul memdup_user + vmemdup_user strndup_user memdup_user_nul + +Basic Kernel Library Functions +============================== + +The Linux kernel provides more basic utility functions. + +Bit Operations +-------------- + +.. kernel-doc:: include/asm-generic/bitops/instrumented-atomic.h + :internal: + +.. kernel-doc:: include/asm-generic/bitops/instrumented-non-atomic.h + :internal: + +.. kernel-doc:: include/asm-generic/bitops/instrumented-lock.h + :internal: + +Bitmap Operations +----------------- + +.. kernel-doc:: lib/bitmap.c + :doc: bitmap introduction + +.. kernel-doc:: include/linux/bitmap.h + :doc: declare bitmap + +.. kernel-doc:: include/linux/bitmap.h + :doc: bitmap overview + +.. kernel-doc:: include/linux/bitmap.h + :doc: bitmap bitops + +.. kernel-doc:: lib/bitmap.c + :export: + +.. kernel-doc:: lib/bitmap.c + :internal: + +.. kernel-doc:: include/linux/bitmap.h + :internal: + +Command-line Parsing +-------------------- + +.. kernel-doc:: lib/cmdline.c + :export: + +Error Pointers +-------------- + +.. kernel-doc:: include/linux/err.h + :internal: + +Sorting +------- + +.. kernel-doc:: lib/sort.c + :export: + +.. kernel-doc:: lib/list_sort.c + :export: + +Text Searching +-------------- + +.. kernel-doc:: lib/textsearch.c + :doc: ts_intro + +.. kernel-doc:: lib/textsearch.c + :export: + +.. kernel-doc:: include/linux/textsearch.h + :functions: textsearch_find textsearch_next \ + textsearch_get_pattern textsearch_get_pattern_len + +CRC and Math Functions in Linux +=============================== + +Arithmetic Overflow Checking +---------------------------- + +.. kernel-doc:: include/linux/overflow.h + :internal: + +CRC Functions +------------- + +.. kernel-doc:: lib/crc4.c + :export: + +.. kernel-doc:: lib/crc7.c + :export: + +.. kernel-doc:: lib/crc8.c + :export: + +.. kernel-doc:: lib/crc16.c + :export: + +.. kernel-doc:: lib/crc32.c + +.. kernel-doc:: lib/crc-ccitt.c + :export: + +.. kernel-doc:: lib/crc-itu-t.c + :export: + +Base 2 log and power Functions +------------------------------ + +.. kernel-doc:: include/linux/log2.h + :internal: + +Integer log and power Functions +------------------------------- + +.. kernel-doc:: include/linux/int_log.h + +.. kernel-doc:: lib/math/int_pow.c + :export: + +.. kernel-doc:: lib/math/int_sqrt.c + :export: + +Division Functions +------------------ + +.. kernel-doc:: include/asm-generic/div64.h + :functions: do_div + +.. kernel-doc:: include/linux/math64.h + :internal: + +.. kernel-doc:: lib/math/gcd.c + :export: + +UUID/GUID +--------- + +.. kernel-doc:: lib/uuid.c + :export: + +Kernel IPC facilities +===================== + +IPC utilities +------------- + +.. kernel-doc:: ipc/util.c + :internal: + +FIFO Buffer +=========== + +kfifo interface +--------------- + +.. kernel-doc:: include/linux/kfifo.h + :internal: + +relay interface support +======================= + +Relay interface support is designed to provide an efficient mechanism +for tools and facilities to relay large amounts of data from kernel +space to user space. + +relay interface +--------------- + +.. kernel-doc:: kernel/relay.c + :export: + +.. kernel-doc:: kernel/relay.c + :internal: + +Module Support +============== + +Kernel module auto-loading +-------------------------- + +.. kernel-doc:: kernel/module/kmod.c + :export: + +Module debugging +---------------- + +.. kernel-doc:: kernel/module/stats.c + :doc: module debugging statistics overview + +dup_failed_modules - tracks duplicate failed modules +**************************************************** + +.. kernel-doc:: kernel/module/stats.c + :doc: dup_failed_modules - tracks duplicate failed modules + +module statistics debugfs counters +********************************** + +.. kernel-doc:: kernel/module/stats.c + :doc: module statistics debugfs counters + +Inter Module support +-------------------- + +Refer to the files in kernel/module/ for more information. + +Hardware Interfaces +=================== + +DMA Channels +------------ + +.. kernel-doc:: kernel/dma.c + :export: + +Resources Management +-------------------- + +.. kernel-doc:: kernel/resource.c + :internal: + +.. kernel-doc:: kernel/resource.c + :export: + +MTRR Handling +------------- + +.. kernel-doc:: arch/x86/kernel/cpu/mtrr/mtrr.c + :export: + +Security Framework +================== + +.. kernel-doc:: security/security.c + :internal: + +.. kernel-doc:: security/inode.c + :export: + +Audit Interfaces +================ + +.. kernel-doc:: kernel/audit.c + :export: + +.. kernel-doc:: kernel/auditsc.c + :internal: + +.. kernel-doc:: kernel/auditfilter.c + :internal: + +Accounting Framework +==================== + +.. kernel-doc:: kernel/acct.c + :internal: + +Block Devices +============= + +.. kernel-doc:: include/linux/bio.h +.. kernel-doc:: block/blk-core.c + :export: + +.. kernel-doc:: block/blk-core.c + :internal: + +.. kernel-doc:: block/blk-map.c + :export: + +.. kernel-doc:: block/blk-sysfs.c + :internal: + +.. kernel-doc:: block/blk-settings.c + :export: + +.. kernel-doc:: block/blk-flush.c + :export: + +.. kernel-doc:: block/blk-lib.c + :export: + +.. kernel-doc:: block/blk-integrity.c + :export: + +.. kernel-doc:: kernel/trace/blktrace.c + :internal: + +.. kernel-doc:: block/genhd.c + :internal: + +.. kernel-doc:: block/genhd.c + :export: + +.. kernel-doc:: block/bdev.c + :export: + +Char devices +============ + +.. kernel-doc:: fs/char_dev.c + :export: + +Clock Framework +=============== + +The clock framework defines programming interfaces to support software +management of the system clock tree. This framework is widely used with +System-On-Chip (SOC) platforms to support power management and various +devices which may need custom clock rates. Note that these "clocks" +don't relate to timekeeping or real time clocks (RTCs), each of which +have separate frameworks. These :c:type:`struct clk <clk>` +instances may be used to manage for example a 96 MHz signal that is used +to shift bits into and out of peripherals or busses, or otherwise +trigger synchronous state machine transitions in system hardware. + +Power management is supported by explicit software clock gating: unused +clocks are disabled, so the system doesn't waste power changing the +state of transistors that aren't in active use. On some systems this may +be backed by hardware clock gating, where clocks are gated without being +disabled in software. Sections of chips that are powered but not clocked +may be able to retain their last state. This low power state is often +called a *retention mode*. This mode still incurs leakage currents, +especially with finer circuit geometries, but for CMOS circuits power is +mostly used by clocked state changes. + +Power-aware drivers only enable their clocks when the device they manage +is in active use. Also, system sleep states often differ according to +which clock domains are active: while a "standby" state may allow wakeup +from several active domains, a "mem" (suspend-to-RAM) state may require +a more wholesale shutdown of clocks derived from higher speed PLLs and +oscillators, limiting the number of possible wakeup event sources. A +driver's suspend method may need to be aware of system-specific clock +constraints on the target sleep state. + +Some platforms support programmable clock generators. These can be used +by external chips of various kinds, such as other CPUs, multimedia +codecs, and devices with strict requirements for interface clocking. + +.. kernel-doc:: include/linux/clk.h + :internal: + +Synchronization Primitives +========================== + +Read-Copy Update (RCU) +---------------------- + +.. kernel-doc:: include/linux/rcupdate.h + +.. kernel-doc:: kernel/rcu/tree.c + +.. kernel-doc:: kernel/rcu/tree_exp.h + +.. kernel-doc:: kernel/rcu/update.c + +.. kernel-doc:: include/linux/srcu.h + +.. kernel-doc:: kernel/rcu/srcutree.c + +.. kernel-doc:: include/linux/rculist_bl.h + +.. kernel-doc:: include/linux/rculist.h + +.. kernel-doc:: include/linux/rculist_nulls.h + +.. kernel-doc:: include/linux/rcu_sync.h + +.. kernel-doc:: kernel/rcu/sync.c + +.. kernel-doc:: kernel/rcu/tasks.h + +.. kernel-doc:: kernel/rcu/tree_stall.h + +.. kernel-doc:: include/linux/rcupdate_trace.h + +.. kernel-doc:: include/linux/rcupdate_wait.h + +.. kernel-doc:: include/linux/rcuref.h + +.. kernel-doc:: include/linux/rcutree.h diff --git a/Documentation/core-api/kobject.rst b/Documentation/core-api/kobject.rst new file mode 100644 index 0000000000..7310247310 --- /dev/null +++ b/Documentation/core-api/kobject.rst @@ -0,0 +1,434 @@ +===================================================================== +Everything you never wanted to know about kobjects, ksets, and ktypes +===================================================================== + +:Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> +:Last updated: December 19, 2007 + +Based on an original article by Jon Corbet for lwn.net written October 1, +2003 and located at https://lwn.net/Articles/51437/ + +Part of the difficulty in understanding the driver model - and the kobject +abstraction upon which it is built - is that there is no obvious starting +place. Dealing with kobjects requires understanding a few different types, +all of which make reference to each other. In an attempt to make things +easier, we'll take a multi-pass approach, starting with vague terms and +adding detail as we go. To that end, here are some quick definitions of +some terms we will be working with. + + - A kobject is an object of type struct kobject. Kobjects have a name + and a reference count. A kobject also has a parent pointer (allowing + objects to be arranged into hierarchies), a specific type, and, + usually, a representation in the sysfs virtual filesystem. + + Kobjects are generally not interesting on their own; instead, they are + usually embedded within some other structure which contains the stuff + the code is really interested in. + + No structure should **EVER** have more than one kobject embedded within it. + If it does, the reference counting for the object is sure to be messed + up and incorrect, and your code will be buggy. So do not do this. + + - A ktype is the type of object that embeds a kobject. Every structure + that embeds a kobject needs a corresponding ktype. The ktype controls + what happens to the kobject when it is created and destroyed. + + - A kset is a group of kobjects. These kobjects can be of the same ktype + or belong to different ktypes. The kset is the basic container type for + collections of kobjects. Ksets contain their own kobjects, but you can + safely ignore that implementation detail as the kset core code handles + this kobject automatically. + + When you see a sysfs directory full of other directories, generally each + of those directories corresponds to a kobject in the same kset. + +We'll look at how to create and manipulate all of these types. A bottom-up +approach will be taken, so we'll go back to kobjects. + + +Embedding kobjects +================== + +It is rare for kernel code to create a standalone kobject, with one major +exception explained below. Instead, kobjects are used to control access to +a larger, domain-specific object. To this end, kobjects will be found +embedded in other structures. If you are used to thinking of things in +object-oriented terms, kobjects can be seen as a top-level, abstract class +from which other classes are derived. A kobject implements a set of +capabilities which are not particularly useful by themselves, but are +nice to have in other objects. The C language does not allow for the +direct expression of inheritance, so other techniques - such as structure +embedding - must be used. + +(As an aside, for those familiar with the kernel linked list implementation, +this is analogous as to how "list_head" structs are rarely useful on +their own, but are invariably found embedded in the larger objects of +interest.) + +So, for example, the UIO code in ``drivers/uio/uio.c`` has a structure that +defines the memory region associated with a uio device:: + + struct uio_map { + struct kobject kobj; + struct uio_mem *mem; + }; + +If you have a struct uio_map structure, finding its embedded kobject is +just a matter of using the kobj member. Code that works with kobjects will +often have the opposite problem, however: given a struct kobject pointer, +what is the pointer to the containing structure? You must avoid tricks +(such as assuming that the kobject is at the beginning of the structure) +and, instead, use the container_of() macro, found in ``<linux/kernel.h>``:: + + container_of(ptr, type, member) + +where: + + * ``ptr`` is the pointer to the embedded kobject, + * ``type`` is the type of the containing structure, and + * ``member`` is the name of the structure field to which ``pointer`` points. + +The return value from container_of() is a pointer to the corresponding +container type. So, for example, a pointer ``kp`` to a struct kobject +embedded **within** a struct uio_map could be converted to a pointer to the +**containing** uio_map structure with:: + + struct uio_map *u_map = container_of(kp, struct uio_map, kobj); + +For convenience, programmers often define a simple macro for **back-casting** +kobject pointers to the containing type. Exactly this happens in the +earlier ``drivers/uio/uio.c``, as you can see here:: + + struct uio_map { + struct kobject kobj; + struct uio_mem *mem; + }; + + #define to_map(map) container_of(map, struct uio_map, kobj) + +where the macro argument "map" is a pointer to the struct kobject in +question. That macro is subsequently invoked with:: + + struct uio_map *map = to_map(kobj); + + +Initialization of kobjects +========================== + +Code which creates a kobject must, of course, initialize that object. Some +of the internal fields are setup with a (mandatory) call to kobject_init():: + + void kobject_init(struct kobject *kobj, const struct kobj_type *ktype); + +The ktype is required for a kobject to be created properly, as every kobject +must have an associated kobj_type. After calling kobject_init(), to +register the kobject with sysfs, the function kobject_add() must be called:: + + int kobject_add(struct kobject *kobj, struct kobject *parent, + const char *fmt, ...); + +This sets up the parent of the kobject and the name for the kobject +properly. If the kobject is to be associated with a specific kset, +kobj->kset must be assigned before calling kobject_add(). If a kset is +associated with a kobject, then the parent for the kobject can be set to +NULL in the call to kobject_add() and then the kobject's parent will be the +kset itself. + +As the name of the kobject is set when it is added to the kernel, the name +of the kobject should never be manipulated directly. If you must change +the name of the kobject, call kobject_rename():: + + int kobject_rename(struct kobject *kobj, const char *new_name); + +kobject_rename() does not perform any locking or have a solid notion of +what names are valid so the caller must provide their own sanity checking +and serialization. + +There is a function called kobject_set_name() but that is legacy cruft and +is being removed. If your code needs to call this function, it is +incorrect and needs to be fixed. + +To properly access the name of the kobject, use the function +kobject_name():: + + const char *kobject_name(const struct kobject * kobj); + +There is a helper function to both initialize and add the kobject to the +kernel at the same time, called surprisingly enough kobject_init_and_add():: + + int kobject_init_and_add(struct kobject *kobj, const struct kobj_type *ktype, + struct kobject *parent, const char *fmt, ...); + +The arguments are the same as the individual kobject_init() and +kobject_add() functions described above. + + +Uevents +======= + +After a kobject has been registered with the kobject core, you need to +announce to the world that it has been created. This can be done with a +call to kobject_uevent():: + + int kobject_uevent(struct kobject *kobj, enum kobject_action action); + +Use the **KOBJ_ADD** action for when the kobject is first added to the kernel. +This should be done only after any attributes or children of the kobject +have been initialized properly, as userspace will instantly start to look +for them when this call happens. + +When the kobject is removed from the kernel (details on how to do that are +below), the uevent for **KOBJ_REMOVE** will be automatically created by the +kobject core, so the caller does not have to worry about doing that by +hand. + + +Reference counts +================ + +One of the key functions of a kobject is to serve as a reference counter +for the object in which it is embedded. As long as references to the object +exist, the object (and the code which supports it) must continue to exist. +The low-level functions for manipulating a kobject's reference counts are:: + + struct kobject *kobject_get(struct kobject *kobj); + void kobject_put(struct kobject *kobj); + +A successful call to kobject_get() will increment the kobject's reference +counter and return the pointer to the kobject. + +When a reference is released, the call to kobject_put() will decrement the +reference count and, possibly, free the object. Note that kobject_init() +sets the reference count to one, so the code which sets up the kobject will +need to do a kobject_put() eventually to release that reference. + +Because kobjects are dynamic, they must not be declared statically or on +the stack, but instead, always allocated dynamically. Future versions of +the kernel will contain a run-time check for kobjects that are created +statically and will warn the developer of this improper usage. + +If all that you want to use a kobject for is to provide a reference counter +for your structure, please use the struct kref instead; a kobject would be +overkill. For more information on how to use struct kref, please see the +file Documentation/core-api/kref.rst in the Linux kernel source tree. + + +Creating "simple" kobjects +========================== + +Sometimes all that a developer wants is a way to create a simple directory +in the sysfs hierarchy, and not have to mess with the whole complication of +ksets, show and store functions, and other details. This is the one +exception where a single kobject should be created. To create such an +entry, use the function:: + + struct kobject *kobject_create_and_add(const char *name, struct kobject *parent); + +This function will create a kobject and place it in sysfs in the location +underneath the specified parent kobject. To create simple attributes +associated with this kobject, use:: + + int sysfs_create_file(struct kobject *kobj, const struct attribute *attr); + +or:: + + int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp); + +Both types of attributes used here, with a kobject that has been created +with the kobject_create_and_add(), can be of type kobj_attribute, so no +special custom attribute is needed to be created. + +See the example module, ``samples/kobject/kobject-example.c`` for an +implementation of a simple kobject and attributes. + + + +ktypes and release methods +========================== + +One important thing still missing from the discussion is what happens to a +kobject when its reference count reaches zero. The code which created the +kobject generally does not know when that will happen; if it did, there +would be little point in using a kobject in the first place. Even +predictable object lifecycles become more complicated when sysfs is brought +in as other portions of the kernel can get a reference on any kobject that +is registered in the system. + +The end result is that a structure protected by a kobject cannot be freed +before its reference count goes to zero. The reference count is not under +the direct control of the code which created the kobject. So that code must +be notified asynchronously whenever the last reference to one of its +kobjects goes away. + +Once you registered your kobject via kobject_add(), you must never use +kfree() to free it directly. The only safe way is to use kobject_put(). It +is good practice to always use kobject_put() after kobject_init() to avoid +errors creeping in. + +This notification is done through a kobject's release() method. Usually +such a method has a form like:: + + void my_object_release(struct kobject *kobj) + { + struct my_object *mine = container_of(kobj, struct my_object, kobj); + + /* Perform any additional cleanup on this object, then... */ + kfree(mine); + } + +One important point cannot be overstated: every kobject must have a +release() method, and the kobject must persist (in a consistent state) +until that method is called. If these constraints are not met, the code is +flawed. Note that the kernel will warn you if you forget to provide a +release() method. Do not try to get rid of this warning by providing an +"empty" release function. + +If all your cleanup function needs to do is call kfree(), then you must +create a wrapper function which uses container_of() to upcast to the correct +type (as shown in the example above) and then calls kfree() on the overall +structure. + +Note, the name of the kobject is available in the release function, but it +must NOT be changed within this callback. Otherwise there will be a memory +leak in the kobject core, which makes people unhappy. + +Interestingly, the release() method is not stored in the kobject itself; +instead, it is associated with the ktype. So let us introduce struct +kobj_type:: + + struct kobj_type { + void (*release)(struct kobject *kobj); + const struct sysfs_ops *sysfs_ops; + const struct attribute_group **default_groups; + const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); + const void *(*namespace)(struct kobject *kobj); + void (*get_ownership)(struct kobject *kobj, kuid_t *uid, kgid_t *gid); + }; + +This structure is used to describe a particular type of kobject (or, more +correctly, of containing object). Every kobject needs to have an associated +kobj_type structure; a pointer to that structure must be specified when you +call kobject_init() or kobject_init_and_add(). + +The release field in struct kobj_type is, of course, a pointer to the +release() method for this type of kobject. The other two fields (sysfs_ops +and default_groups) control how objects of this type are represented in +sysfs; they are beyond the scope of this document. + +The default_groups pointer is a list of default attributes that will be +automatically created for any kobject that is registered with this ktype. + + +ksets +===== + +A kset is merely a collection of kobjects that want to be associated with +each other. There is no restriction that they be of the same ktype, but be +very careful if they are not. + +A kset serves these functions: + + - It serves as a bag containing a group of objects. A kset can be used by + the kernel to track "all block devices" or "all PCI device drivers." + + - A kset is also a subdirectory in sysfs, where the associated kobjects + with the kset can show up. Every kset contains a kobject which can be + set up to be the parent of other kobjects; the top-level directories of + the sysfs hierarchy are constructed in this way. + + - Ksets can support the "hotplugging" of kobjects and influence how + uevent events are reported to user space. + +In object-oriented terms, "kset" is the top-level container class; ksets +contain their own kobject, but that kobject is managed by the kset code and +should not be manipulated by any other user. + +A kset keeps its children in a standard kernel linked list. Kobjects point +back to their containing kset via their kset field. In almost all cases, +the kobjects belonging to a kset have that kset (or, strictly, its embedded +kobject) in their parent. + +As a kset contains a kobject within it, it should always be dynamically +created and never declared statically or on the stack. To create a new +kset use:: + + struct kset *kset_create_and_add(const char *name, + const struct kset_uevent_ops *uevent_ops, + struct kobject *parent_kobj); + +When you are finished with the kset, call:: + + void kset_unregister(struct kset *k); + +to destroy it. This removes the kset from sysfs and decrements its reference +count. When the reference count goes to zero, the kset will be released. +Because other references to the kset may still exist, the release may happen +after kset_unregister() returns. + +An example of using a kset can be seen in the +``samples/kobject/kset-example.c`` file in the kernel tree. + +If a kset wishes to control the uevent operations of the kobjects +associated with it, it can use the struct kset_uevent_ops to handle it:: + + struct kset_uevent_ops { + int (* const filter)(struct kobject *kobj); + const char *(* const name)(struct kobject *kobj); + int (* const uevent)(struct kobject *kobj, struct kobj_uevent_env *env); + }; + + +The filter function allows a kset to prevent a uevent from being emitted to +userspace for a specific kobject. If the function returns 0, the uevent +will not be emitted. + +The name function will be called to override the default name of the kset +that the uevent sends to userspace. By default, the name will be the same +as the kset itself, but this function, if present, can override that name. + +The uevent function will be called when the uevent is about to be sent to +userspace to allow more environment variables to be added to the uevent. + +One might ask how, exactly, a kobject is added to a kset, given that no +functions which perform that function have been presented. The answer is +that this task is handled by kobject_add(). When a kobject is passed to +kobject_add(), its kset member should point to the kset to which the +kobject will belong. kobject_add() will handle the rest. + +If the kobject belonging to a kset has no parent kobject set, it will be +added to the kset's directory. Not all members of a kset do necessarily +live in the kset directory. If an explicit parent kobject is assigned +before the kobject is added, the kobject is registered with the kset, but +added below the parent kobject. + + +Kobject removal +=============== + +After a kobject has been registered with the kobject core successfully, it +must be cleaned up when the code is finished with it. To do that, call +kobject_put(). By doing this, the kobject core will automatically clean up +all of the memory allocated by this kobject. If a ``KOBJ_ADD`` uevent has been +sent for the object, a corresponding ``KOBJ_REMOVE`` uevent will be sent, and +any other sysfs housekeeping will be handled for the caller properly. + +If you need to do a two-stage delete of the kobject (say you are not +allowed to sleep when you need to destroy the object), then call +kobject_del() which will unregister the kobject from sysfs. This makes the +kobject "invisible", but it is not cleaned up, and the reference count of +the object is still the same. At a later time call kobject_put() to finish +the cleanup of the memory associated with the kobject. + +kobject_del() can be used to drop the reference to the parent object, if +circular references are constructed. It is valid in some cases, that a +parent objects references a child. Circular references _must_ be broken +with an explicit call to kobject_del(), so that a release functions will be +called, and the objects in the former circle release each other. + + +Example code to copy from +========================= + +For a more complete example of using ksets and kobjects properly, see the +example programs ``samples/kobject/{kobject-example.c,kset-example.c}``, +which will be built as loadable modules if you select ``CONFIG_SAMPLE_KOBJECT``. diff --git a/Documentation/core-api/kref.rst b/Documentation/core-api/kref.rst new file mode 100644 index 0000000000..c61eea6f1b --- /dev/null +++ b/Documentation/core-api/kref.rst @@ -0,0 +1,323 @@ +=================================================== +Adding reference counters (krefs) to kernel objects +=================================================== + +:Author: Corey Minyard <minyard@acm.org> +:Author: Thomas Hellstrom <thellstrom@vmware.com> + +A lot of this was lifted from Greg Kroah-Hartman's 2004 OLS paper and +presentation on krefs, which can be found at: + + - http://www.kroah.com/linux/talks/ols_2004_kref_paper/Reprint-Kroah-Hartman-OLS2004.pdf + - http://www.kroah.com/linux/talks/ols_2004_kref_talk/ + +Introduction +============ + +krefs allow you to add reference counters to your objects. If you +have objects that are used in multiple places and passed around, and +you don't have refcounts, your code is almost certainly broken. If +you want refcounts, krefs are the way to go. + +To use a kref, add one to your data structures like:: + + struct my_data + { + . + . + struct kref refcount; + . + . + }; + +The kref can occur anywhere within the data structure. + +Initialization +============== + +You must initialize the kref after you allocate it. To do this, call +kref_init as so:: + + struct my_data *data; + + data = kmalloc(sizeof(*data), GFP_KERNEL); + if (!data) + return -ENOMEM; + kref_init(&data->refcount); + +This sets the refcount in the kref to 1. + +Kref rules +========== + +Once you have an initialized kref, you must follow the following +rules: + +1) If you make a non-temporary copy of a pointer, especially if + it can be passed to another thread of execution, you must + increment the refcount with kref_get() before passing it off:: + + kref_get(&data->refcount); + + If you already have a valid pointer to a kref-ed structure (the + refcount cannot go to zero) you may do this without a lock. + +2) When you are done with a pointer, you must call kref_put():: + + kref_put(&data->refcount, data_release); + + If this is the last reference to the pointer, the release + routine will be called. If the code never tries to get + a valid pointer to a kref-ed structure without already + holding a valid pointer, it is safe to do this without + a lock. + +3) If the code attempts to gain a reference to a kref-ed structure + without already holding a valid pointer, it must serialize access + where a kref_put() cannot occur during the kref_get(), and the + structure must remain valid during the kref_get(). + +For example, if you allocate some data and then pass it to another +thread to process:: + + void data_release(struct kref *ref) + { + struct my_data *data = container_of(ref, struct my_data, refcount); + kfree(data); + } + + void more_data_handling(void *cb_data) + { + struct my_data *data = cb_data; + . + . do stuff with data here + . + kref_put(&data->refcount, data_release); + } + + int my_data_handler(void) + { + int rv = 0; + struct my_data *data; + struct task_struct *task; + data = kmalloc(sizeof(*data), GFP_KERNEL); + if (!data) + return -ENOMEM; + kref_init(&data->refcount); + + kref_get(&data->refcount); + task = kthread_run(more_data_handling, data, "more_data_handling"); + if (task == ERR_PTR(-ENOMEM)) { + rv = -ENOMEM; + kref_put(&data->refcount, data_release); + goto out; + } + + . + . do stuff with data here + . + out: + kref_put(&data->refcount, data_release); + return rv; + } + +This way, it doesn't matter what order the two threads handle the +data, the kref_put() handles knowing when the data is not referenced +any more and releasing it. The kref_get() does not require a lock, +since we already have a valid pointer that we own a refcount for. The +put needs no lock because nothing tries to get the data without +already holding a pointer. + +In the above example, kref_put() will be called 2 times in both success +and error paths. This is necessary because the reference count got +incremented 2 times by kref_init() and kref_get(). + +Note that the "before" in rule 1 is very important. You should never +do something like:: + + task = kthread_run(more_data_handling, data, "more_data_handling"); + if (task == ERR_PTR(-ENOMEM)) { + rv = -ENOMEM; + goto out; + } else + /* BAD BAD BAD - get is after the handoff */ + kref_get(&data->refcount); + +Don't assume you know what you are doing and use the above construct. +First of all, you may not know what you are doing. Second, you may +know what you are doing (there are some situations where locking is +involved where the above may be legal) but someone else who doesn't +know what they are doing may change the code or copy the code. It's +bad style. Don't do it. + +There are some situations where you can optimize the gets and puts. +For instance, if you are done with an object and enqueuing it for +something else or passing it off to something else, there is no reason +to do a get then a put:: + + /* Silly extra get and put */ + kref_get(&obj->ref); + enqueue(obj); + kref_put(&obj->ref, obj_cleanup); + +Just do the enqueue. A comment about this is always welcome:: + + enqueue(obj); + /* We are done with obj, so we pass our refcount off + to the queue. DON'T TOUCH obj AFTER HERE! */ + +The last rule (rule 3) is the nastiest one to handle. Say, for +instance, you have a list of items that are each kref-ed, and you wish +to get the first one. You can't just pull the first item off the list +and kref_get() it. That violates rule 3 because you are not already +holding a valid pointer. You must add a mutex (or some other lock). +For instance:: + + static DEFINE_MUTEX(mutex); + static LIST_HEAD(q); + struct my_data + { + struct kref refcount; + struct list_head link; + }; + + static struct my_data *get_entry() + { + struct my_data *entry = NULL; + mutex_lock(&mutex); + if (!list_empty(&q)) { + entry = container_of(q.next, struct my_data, link); + kref_get(&entry->refcount); + } + mutex_unlock(&mutex); + return entry; + } + + static void release_entry(struct kref *ref) + { + struct my_data *entry = container_of(ref, struct my_data, refcount); + + list_del(&entry->link); + kfree(entry); + } + + static void put_entry(struct my_data *entry) + { + mutex_lock(&mutex); + kref_put(&entry->refcount, release_entry); + mutex_unlock(&mutex); + } + +The kref_put() return value is useful if you do not want to hold the +lock during the whole release operation. Say you didn't want to call +kfree() with the lock held in the example above (since it is kind of +pointless to do so). You could use kref_put() as follows:: + + static void release_entry(struct kref *ref) + { + /* All work is done after the return from kref_put(). */ + } + + static void put_entry(struct my_data *entry) + { + mutex_lock(&mutex); + if (kref_put(&entry->refcount, release_entry)) { + list_del(&entry->link); + mutex_unlock(&mutex); + kfree(entry); + } else + mutex_unlock(&mutex); + } + +This is really more useful if you have to call other routines as part +of the free operations that could take a long time or might claim the +same lock. Note that doing everything in the release routine is still +preferred as it is a little neater. + +The above example could also be optimized using kref_get_unless_zero() in +the following way:: + + static struct my_data *get_entry() + { + struct my_data *entry = NULL; + mutex_lock(&mutex); + if (!list_empty(&q)) { + entry = container_of(q.next, struct my_data, link); + if (!kref_get_unless_zero(&entry->refcount)) + entry = NULL; + } + mutex_unlock(&mutex); + return entry; + } + + static void release_entry(struct kref *ref) + { + struct my_data *entry = container_of(ref, struct my_data, refcount); + + mutex_lock(&mutex); + list_del(&entry->link); + mutex_unlock(&mutex); + kfree(entry); + } + + static void put_entry(struct my_data *entry) + { + kref_put(&entry->refcount, release_entry); + } + +Which is useful to remove the mutex lock around kref_put() in put_entry(), but +it's important that kref_get_unless_zero is enclosed in the same critical +section that finds the entry in the lookup table, +otherwise kref_get_unless_zero may reference already freed memory. +Note that it is illegal to use kref_get_unless_zero without checking its +return value. If you are sure (by already having a valid pointer) that +kref_get_unless_zero() will return true, then use kref_get() instead. + +Krefs and RCU +============= + +The function kref_get_unless_zero also makes it possible to use rcu +locking for lookups in the above example:: + + struct my_data + { + struct rcu_head rhead; + . + struct kref refcount; + . + . + }; + + static struct my_data *get_entry_rcu() + { + struct my_data *entry = NULL; + rcu_read_lock(); + if (!list_empty(&q)) { + entry = container_of(q.next, struct my_data, link); + if (!kref_get_unless_zero(&entry->refcount)) + entry = NULL; + } + rcu_read_unlock(); + return entry; + } + + static void release_entry_rcu(struct kref *ref) + { + struct my_data *entry = container_of(ref, struct my_data, refcount); + + mutex_lock(&mutex); + list_del_rcu(&entry->link); + mutex_unlock(&mutex); + kfree_rcu(entry, rhead); + } + + static void put_entry(struct my_data *entry) + { + kref_put(&entry->refcount, release_entry_rcu); + } + +But note that the struct kref member needs to remain in valid memory for a +rcu grace period after release_entry_rcu was called. That can be accomplished +by using kfree_rcu(entry, rhead) as done above, or by calling synchronize_rcu() +before using kfree, but note that synchronize_rcu() may sleep for a +substantial amount of time. diff --git a/Documentation/core-api/librs.rst b/Documentation/core-api/librs.rst new file mode 100644 index 0000000000..6010f5bc5b --- /dev/null +++ b/Documentation/core-api/librs.rst @@ -0,0 +1,212 @@ +========================================== +Reed-Solomon Library Programming Interface +========================================== + +:Author: Thomas Gleixner + +Introduction +============ + +The generic Reed-Solomon Library provides encoding, decoding and error +correction functions. + +Reed-Solomon codes are used in communication and storage applications to +ensure data integrity. + +This documentation is provided for developers who want to utilize the +functions provided by the library. + +Known Bugs And Assumptions +========================== + +None. + +Usage +===== + +This chapter provides examples of how to use the library. + +Initializing +------------ + +The init function init_rs returns a pointer to an rs decoder structure, +which holds the necessary information for encoding, decoding and error +correction with the given polynomial. It either uses an existing +matching decoder or creates a new one. On creation all the lookup tables +for fast en/decoding are created. The function may take a while, so make +sure not to call it in critical code paths. + +:: + + /* the Reed Solomon control structure */ + static struct rs_control *rs_decoder; + + /* Symbolsize is 10 (bits) + * Primitive polynomial is x^10+x^3+1 + * first consecutive root is 0 + * primitive element to generate roots = 1 + * generator polynomial degree (number of roots) = 6 + */ + rs_decoder = init_rs (10, 0x409, 0, 1, 6); + + +Encoding +-------- + +The encoder calculates the Reed-Solomon code over the given data length +and stores the result in the parity buffer. Note that the parity buffer +must be initialized before calling the encoder. + +The expanded data can be inverted on the fly by providing a non-zero +inversion mask. The expanded data is XOR'ed with the mask. This is used +e.g. for FLASH ECC, where the all 0xFF is inverted to an all 0x00. The +Reed-Solomon code for all 0x00 is all 0x00. The code is inverted before +storing to FLASH so it is 0xFF too. This prevents that reading from an +erased FLASH results in ECC errors. + +The databytes are expanded to the given symbol size on the fly. There is +no support for encoding continuous bitstreams with a symbol size != 8 at +the moment. If it is necessary it should be not a big deal to implement +such functionality. + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6]; + /* Initialize the parity buffer */ + memset(par, 0, sizeof(par)); + /* Encode 512 byte in data8. Store parity in buffer par */ + encode_rs8 (rs_decoder, data8, 512, par, 0); + + +Decoding +-------- + +The decoder calculates the syndrome over the given data length and the +received parity symbols and corrects errors in the data. + +If a syndrome is available from a hardware decoder then the syndrome +calculation is skipped. + +The correction of the data buffer can be suppressed by providing a +correction pattern buffer and an error location buffer to the decoder. +The decoder stores the calculated error location and the correction +bitmask in the given buffers. This is useful for hardware decoders which +use a weird bit ordering scheme. + +The databytes are expanded to the given symbol size on the fly. There is +no support for decoding continuous bitstreams with a symbolsize != 8 at +the moment. If it is necessary it should be not a big deal to implement +such functionality. + +Decoding with syndrome calculation, direct data correction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6]; + uint8_t data[512]; + int numerr; + /* Receive data */ + ..... + /* Receive parity */ + ..... + /* Decode 512 byte in data8.*/ + numerr = decode_rs8 (rs_decoder, data8, par, 512, NULL, 0, NULL, 0, NULL); + + +Decoding with syndrome given by hardware decoder, direct data correction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6], syn[6]; + uint8_t data[512]; + int numerr; + /* Receive data */ + ..... + /* Receive parity */ + ..... + /* Get syndrome from hardware decoder */ + ..... + /* Decode 512 byte in data8.*/ + numerr = decode_rs8 (rs_decoder, data8, par, 512, syn, 0, NULL, 0, NULL); + + +Decoding with syndrome given by hardware decoder, no direct data correction. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Note: It's not necessary to give data and received parity to the +decoder. + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6], syn[6], corr[8]; + uint8_t data[512]; + int numerr, errpos[8]; + /* Receive data */ + ..... + /* Receive parity */ + ..... + /* Get syndrome from hardware decoder */ + ..... + /* Decode 512 byte in data8.*/ + numerr = decode_rs8 (rs_decoder, NULL, NULL, 512, syn, 0, errpos, 0, corr); + for (i = 0; i < numerr; i++) { + do_error_correction_in_your_buffer(errpos[i], corr[i]); + } + + +Cleanup +------- + +The function free_rs frees the allocated resources, if the caller is +the last user of the decoder. + +:: + + /* Release resources */ + free_rs(rs_decoder); + + +Structures +========== + +This chapter contains the autogenerated documentation of the structures +which are used in the Reed-Solomon Library and are relevant for a +developer. + +.. kernel-doc:: include/linux/rslib.h + :internal: + +Public Functions Provided +========================= + +This chapter contains the autogenerated documentation of the +Reed-Solomon functions which are exported. + +.. kernel-doc:: lib/reed_solomon/reed_solomon.c + :export: + +Credits +======= + +The library code for encoding and decoding was written by Phil Karn. + +:: + + Copyright 2002, Phil Karn, KA9Q + May be used under the terms of the GNU General Public License (GPL) + + +The wrapper functions and interfaces are written by Thomas Gleixner. + +Many users have provided bugfixes, improvements and helping hands for +testing. Thanks a lot. + +The following people have contributed to this document: + +Thomas Gleixner\ tglx@linutronix.de diff --git a/Documentation/core-api/local_ops.rst b/Documentation/core-api/local_ops.rst new file mode 100644 index 0000000000..0b42ceaaf3 --- /dev/null +++ b/Documentation/core-api/local_ops.rst @@ -0,0 +1,202 @@ + +.. _local_ops: + +================================================= +Semantics and Behavior of Local Atomic Operations +================================================= + +:Author: Mathieu Desnoyers + + +This document explains the purpose of the local atomic operations, how +to implement them for any given architecture and shows how they can be used +properly. It also stresses on the precautions that must be taken when reading +those local variables across CPUs when the order of memory writes matters. + +.. note:: + + Note that ``local_t`` based operations are not recommended for general + kernel use. Please use the ``this_cpu`` operations instead unless there is + really a special purpose. Most uses of ``local_t`` in the kernel have been + replaced by ``this_cpu`` operations. ``this_cpu`` operations combine the + relocation with the ``local_t`` like semantics in a single instruction and + yield more compact and faster executing code. + + +Purpose of local atomic operations +================================== + +Local atomic operations are meant to provide fast and highly reentrant per CPU +counters. They minimize the performance cost of standard atomic operations by +removing the LOCK prefix and memory barriers normally required to synchronize +across CPUs. + +Having fast per CPU atomic counters is interesting in many cases: it does not +require disabling interrupts to protect from interrupt handlers and it permits +coherent counters in NMI handlers. It is especially useful for tracing purposes +and for various performance monitoring counters. + +Local atomic operations only guarantee variable modification atomicity wrt the +CPU which owns the data. Therefore, care must taken to make sure that only one +CPU writes to the ``local_t`` data. This is done by using per cpu data and +making sure that we modify it from within a preemption safe context. It is +however permitted to read ``local_t`` data from any CPU: it will then appear to +be written out of order wrt other memory writes by the owner CPU. + + +Implementation for a given architecture +======================================= + +It can be done by slightly modifying the standard atomic operations: only +their UP variant must be kept. It typically means removing LOCK prefix (on +i386 and x86_64) and any SMP synchronization barrier. If the architecture does +not have a different behavior between SMP and UP, including +``asm-generic/local.h`` in your architecture's ``local.h`` is sufficient. + +The ``local_t`` type is defined as an opaque ``signed long`` by embedding an +``atomic_long_t`` inside a structure. This is made so a cast from this type to +a ``long`` fails. The definition looks like:: + + typedef struct { atomic_long_t a; } local_t; + + +Rules to follow when using local atomic operations +================================================== + +* Variables touched by local ops must be per cpu variables. +* *Only* the CPU owner of these variables must write to them. +* This CPU can use local ops from any context (process, irq, softirq, nmi, ...) + to update its ``local_t`` variables. +* Preemption (or interrupts) must be disabled when using local ops in + process context to make sure the process won't be migrated to a + different CPU between getting the per-cpu variable and doing the + actual local op. +* When using local ops in interrupt context, no special care must be + taken on a mainline kernel, since they will run on the local CPU with + preemption already disabled. I suggest, however, to explicitly + disable preemption anyway to make sure it will still work correctly on + -rt kernels. +* Reading the local cpu variable will provide the current copy of the + variable. +* Reads of these variables can be done from any CPU, because updates to + "``long``", aligned, variables are always atomic. Since no memory + synchronization is done by the writer CPU, an outdated copy of the + variable can be read when reading some *other* cpu's variables. + + +How to use local atomic operations +================================== + +:: + + #include <linux/percpu.h> + #include <asm/local.h> + + static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); + + +Counting +======== + +Counting is done on all the bits of a signed long. + +In preemptible context, use ``get_cpu_var()`` and ``put_cpu_var()`` around +local atomic operations: it makes sure that preemption is disabled around write +access to the per cpu variable. For instance:: + + local_inc(&get_cpu_var(counters)); + put_cpu_var(counters); + +If you are already in a preemption-safe context, you can use +``this_cpu_ptr()`` instead:: + + local_inc(this_cpu_ptr(&counters)); + + + +Reading the counters +==================== + +Those local counters can be read from foreign CPUs to sum the count. Note that +the data seen by local_read across CPUs must be considered to be out of order +relatively to other memory writes happening on the CPU that owns the data:: + + long sum = 0; + for_each_online_cpu(cpu) + sum += local_read(&per_cpu(counters, cpu)); + +If you want to use a remote local_read to synchronize access to a resource +between CPUs, explicit ``smp_wmb()`` and ``smp_rmb()`` memory barriers must be used +respectively on the writer and the reader CPUs. It would be the case if you use +the ``local_t`` variable as a counter of bytes written in a buffer: there should +be a ``smp_wmb()`` between the buffer write and the counter increment and also a +``smp_rmb()`` between the counter read and the buffer read. + + +Here is a sample module which implements a basic per cpu counter using +``local.h``:: + + /* test-local.c + * + * Sample module for local.h usage. + */ + + + #include <asm/local.h> + #include <linux/module.h> + #include <linux/timer.h> + + static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); + + static struct timer_list test_timer; + + /* IPI called on each CPU. */ + static void test_each(void *info) + { + /* Increment the counter from a non preemptible context */ + printk("Increment on cpu %d\n", smp_processor_id()); + local_inc(this_cpu_ptr(&counters)); + + /* This is what incrementing the variable would look like within a + * preemptible context (it disables preemption) : + * + * local_inc(&get_cpu_var(counters)); + * put_cpu_var(counters); + */ + } + + static void do_test_timer(unsigned long data) + { + int cpu; + + /* Increment the counters */ + on_each_cpu(test_each, NULL, 1); + /* Read all the counters */ + printk("Counters read from CPU %d\n", smp_processor_id()); + for_each_online_cpu(cpu) { + printk("Read : CPU %d, count %ld\n", cpu, + local_read(&per_cpu(counters, cpu))); + } + mod_timer(&test_timer, jiffies + 1000); + } + + static int __init test_init(void) + { + /* initialize the timer that will increment the counter */ + timer_setup(&test_timer, do_test_timer, 0); + mod_timer(&test_timer, jiffies + 1); + + return 0; + } + + static void __exit test_exit(void) + { + timer_shutdown_sync(&test_timer); + } + + module_init(test_init); + module_exit(test_exit); + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Mathieu Desnoyers"); + MODULE_DESCRIPTION("Local Atomic Ops"); diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst new file mode 100644 index 0000000000..45defcf15d --- /dev/null +++ b/Documentation/core-api/maple_tree.rst @@ -0,0 +1,217 @@ +.. SPDX-License-Identifier: GPL-2.0+ + + +========== +Maple Tree +========== + +:Author: Liam R. Howlett + +Overview +======== + +The Maple Tree is a B-Tree data type which is optimized for storing +non-overlapping ranges, including ranges of size 1. The tree was designed to +be simple to use and does not require a user written search method. It +supports iterating over a range of entries and going to the previous or next +entry in a cache-efficient manner. The tree can also be put into an RCU-safe +mode of operation which allows reading and writing concurrently. Writers must +synchronize on a lock, which can be the default spinlock, or the user can set +the lock to an external lock of a different type. + +The Maple Tree maintains a small memory footprint and was designed to use +modern processor cache efficiently. The majority of the users will be able to +use the normal API. An :ref:`maple-tree-advanced-api` exists for more complex +scenarios. The most important usage of the Maple Tree is the tracking of the +virtual memory areas. + +The Maple Tree can store values between ``0`` and ``ULONG_MAX``. The Maple +Tree reserves values with the bottom two bits set to '10' which are below 4096 +(ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved +entries then the users can convert the entries using xa_mk_value() and convert +them back by calling xa_to_value(). If the user needs to use a reserved +value, then the user can convert the value when using the +:ref:`maple-tree-advanced-api`, but are blocked by the normal API. + +The Maple Tree can also be configured to support searching for a gap of a given +size (or larger). + +Pre-allocating of nodes is also supported using the +:ref:`maple-tree-advanced-api`. This is useful for users who must guarantee a +successful store operation within a given +code segment when allocating cannot be done. Allocations of nodes are +relatively small at around 256 bytes. + +.. _maple-tree-normal-api: + +Normal API +========== + +Start by initialising a maple tree, either with DEFINE_MTREE() for statically +allocated maple trees or mt_init() for dynamically allocated ones. A +freshly-initialised maple tree contains a ``NULL`` pointer for the range ``0`` +- ``ULONG_MAX``. There are currently two types of maple trees supported: the +allocation tree and the regular tree. The regular tree has a higher branching +factor for internal nodes. The allocation tree has a lower branching factor +but allows the user to search for a gap of a given size or larger from either +``0`` upwards or ``ULONG_MAX`` down. An allocation tree can be used by +passing in the ``MT_FLAGS_ALLOC_RANGE`` flag when initialising the tree. + +You can then set entries using mtree_store() or mtree_store_range(). +mtree_store() will overwrite any entry with the new entry and return 0 on +success or an error code otherwise. mtree_store_range() works in the same way +but takes a range. mtree_load() is used to retrieve the entry stored at a +given index. You can use mtree_erase() to erase an entire range by only +knowing one value within that range, or mtree_store() call with an entry of +NULL may be used to partially erase a range or many ranges at once. + +If you want to only store a new entry to a range (or index) if that range is +currently ``NULL``, you can use mtree_insert_range() or mtree_insert() which +return -EEXIST if the range is not empty. + +You can search for an entry from an index upwards by using mt_find(). + +You can walk each entry within a range by calling mt_for_each(). You must +provide a temporary variable to store a cursor. If you want to walk each +element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If +the caller is going to hold the lock for the duration of the walk then it is +worth looking at the mas_for_each() API in the :ref:`maple-tree-advanced-api` +section. + +Sometimes it is necessary to ensure the next call to store to a maple tree does +not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case. + +Finally, you can remove all entries from a maple tree by calling +mtree_destroy(). If the maple tree entries are pointers, you may wish to free +the entries first. + +Allocating Nodes +---------------- + +The allocations are handled by the internal tree code. See +:ref:`maple-tree-advanced-alloc` for other options. + +Locking +------- + +You do not have to worry about locking. See :ref:`maple-tree-advanced-locks` +for other options. + +The Maple Tree uses RCU and an internal spinlock to synchronise access: + +Takes RCU read lock: + * mtree_load() + * mt_find() + * mt_for_each() + * mt_next() + * mt_prev() + +Takes ma_lock internally: + * mtree_store() + * mtree_store_range() + * mtree_insert() + * mtree_insert_range() + * mtree_erase() + * mtree_destroy() + * mt_set_in_rcu() + * mt_clear_in_rcu() + +If you want to take advantage of the internal lock to protect the data +structures that you are storing in the Maple Tree, you can call mtree_lock() +before calling mtree_load(), then take a reference count on the object you +have found before calling mtree_unlock(). This will prevent stores from +removing the object from the tree between looking up the object and +incrementing the refcount. You can also use RCU to avoid dereferencing +freed memory, but an explanation of that is beyond the scope of this +document. + +.. _maple-tree-advanced-api: + +Advanced API +============ + +The advanced API offers more flexibility and better performance at the +cost of an interface which can be harder to use and has fewer safeguards. +You must take care of your own locking while using the advanced API. +You can use the ma_lock, RCU or an external lock for protection. +You can mix advanced and normal operations on the same array, as long +as the locking is compatible. The :ref:`maple-tree-normal-api` is implemented +in terms of the advanced API. + +The advanced API is based around the ma_state, this is where the 'mas' +prefix originates. The ma_state struct keeps track of tree operations to make +life easier for both internal and external tree users. + +Initialising the maple tree is the same as in the :ref:`maple-tree-normal-api`. +Please see above. + +The maple state keeps track of the range start and end in mas->index and +mas->last, respectively. + +mas_walk() will walk the tree to the location of mas->index and set the +mas->index and mas->last according to the range for the entry. + +You can set entries using mas_store(). mas_store() will overwrite any entry +with the new entry and return the first existing entry that is overwritten. +The range is passed in as members of the maple state: index and last. + +You can use mas_erase() to erase an entire range by setting index and +last of the maple state to the desired range to erase. This will erase +the first range that is found in that range, set the maple state index +and last as the range that was erased and return the entry that existed +at that location. + +You can walk each entry within a range by using mas_for_each(). If you want +to walk each element of the tree then ``0`` and ``ULONG_MAX`` may be used as +the range. If the lock needs to be periodically dropped, see the locking +section mas_pause(). + +Using a maple state allows mas_next() and mas_prev() to function as if the +tree was a linked list. With such a high branching factor the amortized +performance penalty is outweighed by cache optimization. mas_next() will +return the next entry which occurs after the entry at index. mas_prev() +will return the previous entry which occurs before the entry at index. + +mas_find() will find the first entry which exists at or above index on +the first call, and the next entry from every subsequent calls. + +mas_find_rev() will find the fist entry which exists at or below the last on +the first call, and the previous entry from every subsequent calls. + +If the user needs to yield the lock during an operation, then the maple state +must be paused using mas_pause(). + +There are a few extra interfaces provided when using an allocation tree. +If you wish to search for a gap within a range, then mas_empty_area() +or mas_empty_area_rev() can be used. mas_empty_area() searches for a gap +starting at the lowest index given up to the maximum of the range. +mas_empty_area_rev() searches for a gap starting at the highest index given +and continues downward to the lower bound of the range. + +.. _maple-tree-advanced-alloc: + +Advanced Allocating Nodes +------------------------- + +Allocations are usually handled internally to the tree, however if allocations +need to occur before a write occurs then calling mas_expected_entries() will +allocate the worst-case number of needed nodes to insert the provided number of +ranges. This also causes the tree to enter mass insertion mode. Once +insertions are complete calling mas_destroy() on the maple state will free the +unused allocations. + +.. _maple-tree-advanced-locks: + +Advanced Locking +---------------- + +The maple tree uses a spinlock by default, but external locks can be used for +tree updates as well. To use an external lock, the tree must be initialized +with the ``MT_FLAGS_LOCK_EXTERN flag``, this is usually done with the +MTREE_INIT_EXT() #define, which takes an external lock as an argument. + +Functions and structures +======================== + +.. kernel-doc:: include/linux/maple_tree.h +.. kernel-doc:: lib/maple_tree.c diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst new file mode 100644 index 0000000000..1c58d883b2 --- /dev/null +++ b/Documentation/core-api/memory-allocation.rst @@ -0,0 +1,185 @@ +.. _memory_allocation: + +======================= +Memory Allocation Guide +======================= + +Linux provides a variety of APIs for memory allocation. You can +allocate small chunks using `kmalloc` or `kmem_cache_alloc` families, +large virtually contiguous areas using `vmalloc` and its derivatives, +or you can directly request pages from the page allocator with +`alloc_pages`. It is also possible to use more specialized allocators, +for instance `cma_alloc` or `zs_malloc`. + +Most of the memory allocation APIs use GFP flags to express how that +memory should be allocated. The GFP acronym stands for "get free +pages", the underlying memory allocation function. + +Diversity of the allocation APIs combined with the numerous GFP flags +makes the question "How should I allocate memory?" not that easy to +answer, although very likely you should use + +:: + + kzalloc(<size>, GFP_KERNEL); + +Of course there are cases when other allocation APIs and different GFP +flags must be used. + +Get Free Page flags +=================== + +The GFP flags control the allocators behavior. They tell what memory +zones can be used, how hard the allocator should try to find free +memory, whether the memory can be accessed by the userspace etc. The +:ref:`Documentation/core-api/mm-api.rst <mm-api-gfp-flags>` provides +reference documentation for the GFP flags and their combinations and +here we briefly outline their recommended usage: + + * Most of the time ``GFP_KERNEL`` is what you need. Memory for the + kernel data structures, DMAable memory, inode cache, all these and + many other allocations types can use ``GFP_KERNEL``. Note, that + using ``GFP_KERNEL`` implies ``GFP_RECLAIM``, which means that + direct reclaim may be triggered under memory pressure; the calling + context must be allowed to sleep. + * If the allocation is performed from an atomic context, e.g interrupt + handler, use ``GFP_NOWAIT``. This flag prevents direct reclaim and + IO or filesystem operations. Consequently, under memory pressure + ``GFP_NOWAIT`` allocation is likely to fail. Allocations which + have a reasonable fallback should be using ``GFP_NOWARN``. + * If you think that accessing memory reserves is justified and the kernel + will be stressed unless allocation succeeds, you may use ``GFP_ATOMIC``. + * Untrusted allocations triggered from userspace should be a subject + of kmem accounting and must have ``__GFP_ACCOUNT`` bit set. There + is the handy ``GFP_KERNEL_ACCOUNT`` shortcut for ``GFP_KERNEL`` + allocations that should be accounted. + * Userspace allocations should use either of the ``GFP_USER``, + ``GFP_HIGHUSER`` or ``GFP_HIGHUSER_MOVABLE`` flags. The longer + the flag name the less restrictive it is. + + ``GFP_HIGHUSER_MOVABLE`` does not require that allocated memory + will be directly accessible by the kernel and implies that the + data is movable. + + ``GFP_HIGHUSER`` means that the allocated memory is not movable, + but it is not required to be directly accessible by the kernel. An + example may be a hardware allocation that maps data directly into + userspace but has no addressing limitations. + + ``GFP_USER`` means that the allocated memory is not movable and it + must be directly accessible by the kernel. + +You may notice that quite a few allocations in the existing code +specify ``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to +prevent recursion deadlocks caused by direct memory reclaim calling +back into the FS or IO paths and blocking on already held +resources. Since 4.12 the preferred way to address this issue is to +use new scope APIs described in +:ref:`Documentation/core-api/gfp_mask-from-fs-io.rst <gfp_mask_from_fs_io>`. + +Other legacy GFP flags are ``GFP_DMA`` and ``GFP_DMA32``. They are +used to ensure that the allocated memory is accessible by hardware +with limited addressing capabilities. So unless you are writing a +driver for a device with such restrictions, avoid using these flags. +And even with hardware with restrictions it is preferable to use +`dma_alloc*` APIs. + +GFP flags and reclaim behavior +------------------------------ +Memory allocations may trigger direct or background reclaim and it is +useful to understand how hard the page allocator will try to satisfy that +or another request. + + * ``GFP_KERNEL & ~__GFP_RECLAIM`` - optimistic allocation without _any_ + attempt to free memory at all. The most light weight mode which even + doesn't kick the background reclaim. Should be used carefully because it + might deplete the memory and the next user might hit the more aggressive + reclaim. + + * ``GFP_KERNEL & ~__GFP_DIRECT_RECLAIM`` (or ``GFP_NOWAIT``)- optimistic + allocation without any attempt to free memory from the current + context but can wake kswapd to reclaim memory if the zone is below + the low watermark. Can be used from either atomic contexts or when + the request is a performance optimization and there is another + fallback for a slow path. + + * ``(GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM`` (aka ``GFP_ATOMIC``) - + non sleeping allocation with an expensive fallback so it can access + some portion of memory reserves. Usually used from interrupt/bottom-half + context with an expensive slow path fallback. + + * ``GFP_KERNEL`` - both background and direct reclaim are allowed and the + **default** page allocator behavior is used. That means that not costly + allocation requests are basically no-fail but there is no guarantee of + that behavior so failures have to be checked properly by callers + (e.g. OOM killer victim is allowed to fail currently). + + * ``GFP_KERNEL | __GFP_NORETRY`` - overrides the default allocator behavior + and all allocation requests fail early rather than cause disruptive + reclaim (one round of reclaim in this implementation). The OOM killer + is not invoked. + + * ``GFP_KERNEL | __GFP_RETRY_MAYFAIL`` - overrides the default allocator + behavior and all allocation requests try really hard. The request + will fail if the reclaim cannot make any progress. The OOM killer + won't be triggered. + + * ``GFP_KERNEL | __GFP_NOFAIL`` - overrides the default allocator behavior + and all allocation requests will loop endlessly until they succeed. + This might be really dangerous especially for larger orders. + +Selecting memory allocator +========================== + +The most straightforward way to allocate memory is to use a function +from the kmalloc() family. And, to be on the safe side it's best to use +routines that set memory to zero, like kzalloc(). If you need to +allocate memory for an array, there are kmalloc_array() and kcalloc() +helpers. The helpers struct_size(), array_size() and array3_size() can +be used to safely calculate object sizes without overflowing. + +The maximal size of a chunk that can be allocated with `kmalloc` is +limited. The actual limit depends on the hardware and the kernel +configuration, but it is a good practice to use `kmalloc` for objects +smaller than page size. + +The address of a chunk allocated with `kmalloc` is aligned to at least +ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the +alignment is also guaranteed to be at least the respective size. + +Chunks allocated with kmalloc() can be resized with krealloc(). Similarly +to kmalloc_array(): a helper for resizing arrays is provided in the form of +krealloc_array(). + +For large allocations you can use vmalloc() and vzalloc(), or directly +request pages from the page allocator. The memory allocated by `vmalloc` +and related functions is not physically contiguous. + +If you are not sure whether the allocation size is too large for +`kmalloc`, it is possible to use kvmalloc() and its derivatives. It will +try to allocate memory with `kmalloc` and if the allocation fails it +will be retried with `vmalloc`. There are restrictions on which GFP +flags can be used with `kvmalloc`; please see kvmalloc_node() reference +documentation. Note that `kvmalloc` may return memory that is not +physically contiguous. + +If you need to allocate many identical objects you can use the slab +cache allocator. The cache should be set up with kmem_cache_create() or +kmem_cache_create_usercopy() before it can be used. The second function +should be used if a part of the cache might be copied to the userspace. +After the cache is created kmem_cache_alloc() and its convenience +wrappers can allocate memory from that cache. + +When the allocated memory is no longer needed it must be freed. + +Objects allocated by `kmalloc` can be freed by `kfree` or `kvfree`. Objects +allocated by `kmem_cache_alloc` can be freed with `kmem_cache_free`, `kfree` +or `kvfree`, where the latter two might be more convenient thanks to not +needing the kmem_cache pointer. + +The same rules apply to _bulk and _rcu flavors of freeing functions. + +Memory allocated by `vmalloc` can be freed with `vfree` or `kvfree`. +Memory allocated by `kvmalloc` can be freed with `kvfree`. +Caches created by `kmem_cache_create` should be freed with +`kmem_cache_destroy` only after freeing all the allocated objects first. diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst new file mode 100644 index 0000000000..682259ee63 --- /dev/null +++ b/Documentation/core-api/memory-hotplug.rst @@ -0,0 +1,122 @@ +.. _memory_hotplug: + +============== +Memory hotplug +============== + +Memory hotplug event notifier +============================= + +Hotplugging events are sent to a notification queue. + +There are six types of notification defined in ``include/linux/memory.h``: + +MEM_GOING_ONLINE + Generated before new memory becomes available in order to be able to + prepare subsystems to handle memory. The page allocator is still unable + to allocate from the new memory. + +MEM_CANCEL_ONLINE + Generated if MEM_GOING_ONLINE fails. + +MEM_ONLINE + Generated when memory has successfully brought online. The callback may + allocate pages from the new memory. + +MEM_GOING_OFFLINE + Generated to begin the process of offlining memory. Allocations are no + longer possible from the memory but some of the memory to be offlined + is still in use. The callback can be used to free memory known to a + subsystem from the indicated memory block. + +MEM_CANCEL_OFFLINE + Generated if MEM_GOING_OFFLINE fails. Memory is available again from + the memory block that we attempted to offline. + +MEM_OFFLINE + Generated after offlining memory is complete. + +A callback routine can be registered by calling:: + + hotplug_memory_notifier(callback_func, priority) + +Callback functions with higher values of priority are called before callback +functions with lower values. + +A callback function must have the following prototype:: + + int callback_func( + struct notifier_block *self, unsigned long action, void *arg); + +The first argument of the callback function (self) is a pointer to the block +of the notifier chain that points to the callback function itself. +The second argument (action) is one of the event types described above. +The third argument (arg) passes a pointer of struct memory_notify:: + + struct memory_notify { + unsigned long start_pfn; + unsigned long nr_pages; + int status_change_nid_normal; + int status_change_nid; + } + +- start_pfn is start_pfn of online/offline memory. +- nr_pages is # of pages of online/offline memory. +- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask + is (will be) set/clear, if this is -1, then nodemask status is not changed. +- status_change_nid is set node id when N_MEMORY of nodemask is (will be) + set/clear. It means a new(memoryless) node gets new memory by online and a + node loses all memory. If this is -1, then nodemask status is not changed. + + If status_changed_nid* >= 0, callback should create/discard structures for the + node if necessary. + +The callback routine shall return one of the values +NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP +defined in ``include/linux/notifier.h`` + +NOTIFY_DONE and NOTIFY_OK have no effect on the further processing. + +NOTIFY_BAD is used as response to the MEM_GOING_ONLINE, MEM_GOING_OFFLINE, +MEM_ONLINE, or MEM_OFFLINE action to cancel hotplugging. It stops +further processing of the notification queue. + +NOTIFY_STOP stops further processing of the notification queue. + +Locking Internals +================= + +When adding/removing memory that uses memory block devices (i.e. ordinary RAM), +the device_hotplug_lock should be held to: + +- synchronize against online/offline requests (e.g. via sysfs). This way, memory + block devices can only be accessed (.online/.state attributes) by user + space once memory has been fully added. And when removing memory, we + know nobody is in critical sections. +- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC) + +Especially, there is a possible lock inversion that is avoided using +device_hotplug_lock when adding memory and user space tries to online that +memory faster than expected: + +- device_online() will first take the device_lock(), followed by + mem_hotplug_lock +- add_memory_resource() will first take the mem_hotplug_lock, followed by + the device_lock() (while creating the devices, during bus_add_device()). + +As the device is visible to user space before taking the device_lock(), this +can result in a lock inversion. + +onlining/offlining of memory should be done via device_online()/ +device_offline() - to make sure it is properly synchronized to actions +via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type) + +When adding/removing/onlining/offlining memory or adding/removing +heterogeneous/device memory, we should always hold the mem_hotplug_lock in +write mode to serialise memory hotplug (e.g. access to global/zone +variables). + +In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read +mode allows for a quite efficient get_online_mems/put_online_mems +implementation, so code accessing memory can protect from that memory +vanishing. diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst new file mode 100644 index 0000000000..2d091c873d --- /dev/null +++ b/Documentation/core-api/mm-api.rst @@ -0,0 +1,142 @@ +====================== +Memory Management APIs +====================== + +User Space Memory Access +======================== + +.. kernel-doc:: arch/x86/include/asm/uaccess.h + :internal: + +.. kernel-doc:: arch/x86/lib/usercopy_32.c + :export: + +.. kernel-doc:: mm/gup.c + :functions: get_user_pages_fast + +.. _mm-api-gfp-flags: + +Memory Allocation Controls +========================== + +.. kernel-doc:: include/linux/gfp_types.h + :doc: Page mobility and placement hints + +.. kernel-doc:: include/linux/gfp_types.h + :doc: Watermark modifiers + +.. kernel-doc:: include/linux/gfp_types.h + :doc: Reclaim modifiers + +.. kernel-doc:: include/linux/gfp_types.h + :doc: Useful GFP flag combinations + +The Slab Cache +============== + +.. kernel-doc:: include/linux/slab.h + :internal: + +.. kernel-doc:: mm/slab.c + :export: + +.. kernel-doc:: mm/slab_common.c + :export: + +.. kernel-doc:: mm/util.c + :functions: kfree_const kvmalloc_node kvfree + +Virtually Contiguous Mappings +============================= + +.. kernel-doc:: mm/vmalloc.c + :export: + +File Mapping and Page Cache +=========================== + +Filemap +------- + +.. kernel-doc:: mm/filemap.c + :export: + +Readahead +--------- + +.. kernel-doc:: mm/readahead.c + :doc: Readahead Overview + +.. kernel-doc:: mm/readahead.c + :export: + +Writeback +--------- + +.. kernel-doc:: mm/page-writeback.c + :export: + +Truncate +-------- + +.. kernel-doc:: mm/truncate.c + :export: + +.. kernel-doc:: include/linux/pagemap.h + :internal: + +Memory pools +============ + +.. kernel-doc:: mm/mempool.c + :export: + +DMA pools +========= + +.. kernel-doc:: mm/dmapool.c + :export: + +More Memory Management Functions +================================ + +.. kernel-doc:: mm/memory.c + :export: + +.. kernel-doc:: mm/page_alloc.c +.. kernel-doc:: mm/mempolicy.c +.. kernel-doc:: include/linux/mm_types.h + :internal: +.. kernel-doc:: include/linux/mm_inline.h +.. kernel-doc:: include/linux/page-flags.h +.. kernel-doc:: include/linux/mm.h + :internal: +.. kernel-doc:: include/linux/page_ref.h +.. kernel-doc:: include/linux/mmzone.h +.. kernel-doc:: mm/util.c + :functions: folio_mapping + +.. kernel-doc:: mm/rmap.c +.. kernel-doc:: mm/migrate.c +.. kernel-doc:: mm/mmap.c +.. kernel-doc:: mm/kmemleak.c +.. #kernel-doc:: mm/hmm.c (build warnings) +.. kernel-doc:: mm/memremap.c +.. kernel-doc:: mm/hugetlb.c +.. kernel-doc:: mm/swap.c +.. kernel-doc:: mm/zpool.c +.. kernel-doc:: mm/memcontrol.c +.. #kernel-doc:: mm/memory-tiers.c (build warnings) +.. kernel-doc:: mm/shmem.c +.. kernel-doc:: mm/migrate_device.c +.. #kernel-doc:: mm/nommu.c (duplicates kernel-doc from other files) +.. kernel-doc:: mm/mapping_dirty_helpers.c +.. #kernel-doc:: mm/memory-failure.c (build warnings) +.. kernel-doc:: mm/percpu.c +.. kernel-doc:: mm/maccess.c +.. kernel-doc:: mm/vmscan.c +.. kernel-doc:: mm/memory_hotplug.c +.. kernel-doc:: mm/mmu_notifier.c +.. kernel-doc:: mm/balloon_compaction.c +.. kernel-doc:: mm/huge_memory.c +.. kernel-doc:: mm/io-mapping.c diff --git a/Documentation/core-api/netlink.rst b/Documentation/core-api/netlink.rst new file mode 100644 index 0000000000..9f692b02bf --- /dev/null +++ b/Documentation/core-api/netlink.rst @@ -0,0 +1,102 @@ +.. SPDX-License-Identifier: BSD-3-Clause + +.. _kernel_netlink: + +=================================== +Netlink notes for kernel developers +=================================== + +General guidance +================ + +Attribute enums +--------------- + +Older families often define "null" attributes and commands with value +of ``0`` and named ``unspec``. This is supported (``type: unused``) +but should be avoided in new families. The ``unspec`` enum values are +not used in practice, so just set the value of the first attribute to ``1``. + +Message enums +------------- + +Use the same command IDs for requests and replies. This makes it easier +to match them up, and we have plenty of ID space. + +Use separate command IDs for notifications. This makes it easier to +sort the notifications from replies (and present them to the user +application via a different API than replies). + +Answer requests +--------------- + +Older families do not reply to all of the commands, especially NEW / ADD +commands. User only gets information whether the operation succeeded or +not via the ACK. Try to find useful data to return. Once the command is +added whether it replies with a full message or only an ACK is uAPI and +cannot be changed. It's better to err on the side of replying. + +Specifically NEW and ADD commands should reply with information identifying +the created object such as the allocated object's ID (without having to +resort to using ``NLM_F_ECHO``). + +NLM_F_ECHO +---------- + +Make sure to pass the request info to genl_notify() to allow ``NLM_F_ECHO`` +to take effect. This is useful for programs that need precise feedback +from the kernel (for example for logging purposes). + +Support dump consistency +------------------------ + +If iterating over objects during dump may skip over objects or repeat +them - make sure to report dump inconsistency with ``NLM_F_DUMP_INTR``. +This is usually implemented by maintaining a generation id for the +structure and recording it in the ``seq`` member of struct netlink_callback. + +Netlink specification +===================== + +Documentation of the Netlink specification parts which are only relevant +to the kernel space. + +Globals +------- + +kernel-policy +~~~~~~~~~~~~~ + +Defines whether the kernel validation policy is ``global`` i.e. the same for all +operations of the family, defined for each operation individually - ``per-op``, +or separately for each operation and operation type (do vs dump) - ``split``. +New families should use ``per-op`` (default) to be able to narrow down the +attributes accepted by a specific command. + +checks +------ + +Documentation for the ``checks`` sub-sections of attribute specs. + +unterminated-ok +~~~~~~~~~~~~~~~ + +Accept strings without the null-termination (for legacy families only). +Switches from the ``NLA_NUL_STRING`` to ``NLA_STRING`` policy type. + +max-len +~~~~~~~ + +Defines max length for a binary or string attribute (corresponding +to the ``len`` member of struct nla_policy). For string attributes terminating +null character is not counted towards ``max-len``. + +The field may either be a literal integer value or a name of a defined +constant. String types may reduce the constant by one +(i.e. specify ``max-len: CONST - 1``) to reserve space for the terminating +character so implementations should recognize such pattern. + +min-len +~~~~~~~ + +Similar to ``max-len`` but defines minimum length. diff --git a/Documentation/core-api/packing.rst b/Documentation/core-api/packing.rst new file mode 100644 index 0000000000..3ed13bc9a1 --- /dev/null +++ b/Documentation/core-api/packing.rst @@ -0,0 +1,166 @@ +================================================ +Generic bitfield packing and unpacking functions +================================================ + +Problem statement +----------------- + +When working with hardware, one has to choose between several approaches of +interfacing with it. +One can memory-map a pointer to a carefully crafted struct over the hardware +device's memory region, and access its fields as struct members (potentially +declared as bitfields). But writing code this way would make it less portable, +due to potential endianness mismatches between the CPU and the hardware device. +Additionally, one has to pay close attention when translating register +definitions from the hardware documentation into bit field indices for the +structs. Also, some hardware (typically networking equipment) tends to group +its register fields in ways that violate any reasonable word boundaries +(sometimes even 64 bit ones). This creates the inconvenience of having to +define "high" and "low" portions of register fields within the struct. +A more robust alternative to struct field definitions would be to extract the +required fields by shifting the appropriate number of bits. But this would +still not protect from endianness mismatches, except if all memory accesses +were performed byte-by-byte. Also the code can easily get cluttered, and the +high-level idea might get lost among the many bit shifts required. +Many drivers take the bit-shifting approach and then attempt to reduce the +clutter with tailored macros, but more often than not these macros take +shortcuts that still prevent the code from being truly portable. + +The solution +------------ + +This API deals with 2 basic operations: + + - Packing a CPU-usable number into a memory buffer (with hardware + constraints/quirks) + - Unpacking a memory buffer (which has hardware constraints/quirks) + into a CPU-usable number. + +The API offers an abstraction over said hardware constraints and quirks, +over CPU endianness and therefore between possible mismatches between +the two. + +The basic unit of these API functions is the u64. From the CPU's +perspective, bit 63 always means bit offset 7 of byte 7, albeit only +logically. The question is: where do we lay this bit out in memory? + +The following examples cover the memory layout of a packed u64 field. +The byte offsets in the packed buffer are always implicitly 0, 1, ... 7. +What the examples show is where the logical bytes and bits sit. + +1. Normally (no quirks), we would do it like this: + +:: + + 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 + 7 6 5 4 + 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 + 3 2 1 0 + +That is, the MSByte (7) of the CPU-usable u64 sits at memory offset 0, and the +LSByte (0) of the u64 sits at memory offset 7. +This corresponds to what most folks would regard to as "big endian", where +bit i corresponds to the number 2^i. This is also referred to in the code +comments as "logical" notation. + + +2. If QUIRK_MSB_ON_THE_RIGHT is set, we do it like this: + +:: + + 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39 + 7 6 5 4 + 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 + 3 2 1 0 + +That is, QUIRK_MSB_ON_THE_RIGHT does not affect byte positioning, but +inverts bit offsets inside a byte. + + +3. If QUIRK_LITTLE_ENDIAN is set, we do it like this: + +:: + + 39 38 37 36 35 34 33 32 47 46 45 44 43 42 41 40 55 54 53 52 51 50 49 48 63 62 61 60 59 58 57 56 + 4 5 6 7 + 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 23 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24 + 0 1 2 3 + +Therefore, QUIRK_LITTLE_ENDIAN means that inside the memory region, every +byte from each 4-byte word is placed at its mirrored position compared to +the boundary of that word. + +4. If QUIRK_MSB_ON_THE_RIGHT and QUIRK_LITTLE_ENDIAN are both set, we do it + like this: + +:: + + 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 + 4 5 6 7 + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 + 0 1 2 3 + + +5. If just QUIRK_LSW32_IS_FIRST is set, we do it like this: + +:: + + 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 + 3 2 1 0 + 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 + 7 6 5 4 + +In this case the 8 byte memory region is interpreted as follows: first +4 bytes correspond to the least significant 4-byte word, next 4 bytes to +the more significant 4-byte word. + + +6. If QUIRK_LSW32_IS_FIRST and QUIRK_MSB_ON_THE_RIGHT are set, we do it like + this: + +:: + + 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 + 3 2 1 0 + 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39 + 7 6 5 4 + + +7. If QUIRK_LSW32_IS_FIRST and QUIRK_LITTLE_ENDIAN are set, it looks like + this: + +:: + + 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 23 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24 + 0 1 2 3 + 39 38 37 36 35 34 33 32 47 46 45 44 43 42 41 40 55 54 53 52 51 50 49 48 63 62 61 60 59 58 57 56 + 4 5 6 7 + + +8. If QUIRK_LSW32_IS_FIRST, QUIRK_LITTLE_ENDIAN and QUIRK_MSB_ON_THE_RIGHT + are set, it looks like this: + +:: + + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 + 0 1 2 3 + 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 + 4 5 6 7 + + +We always think of our offsets as if there were no quirk, and we translate +them afterwards, before accessing the memory region. + +Intended use +------------ + +Drivers that opt to use this API first need to identify which of the above 3 +quirk combinations (for a total of 8) match what the hardware documentation +describes. Then they should wrap the packing() function, creating a new +xxx_packing() that calls it using the proper QUIRK_* one-hot bits set. + +The packing() function returns an int-encoded error code, which protects the +programmer against incorrect API use. The errors are not expected to occur +during runtime, therefore it is reasonable for xxx_packing() to return void +and simply swallow those errors. Optionally it can dump stack or print the +error description. diff --git a/Documentation/core-api/padata.rst b/Documentation/core-api/padata.rst new file mode 100644 index 0000000000..05b73c6c10 --- /dev/null +++ b/Documentation/core-api/padata.rst @@ -0,0 +1,178 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +The padata parallel execution mechanism +======================================= + +:Date: May 2020 + +Padata is a mechanism by which the kernel can farm jobs out to be done in +parallel on multiple CPUs while optionally retaining their ordering. + +It was originally developed for IPsec, which needs to perform encryption and +decryption on large numbers of packets without reordering those packets. This +is currently the sole consumer of padata's serialized job support. + +Padata also supports multithreaded jobs, splitting up the job evenly while load +balancing and coordinating between threads. + +Running Serialized Jobs +======================= + +Initializing +------------ + +The first step in using padata to run serialized jobs is to set up a +padata_instance structure for overall control of how jobs are to be run:: + + #include <linux/padata.h> + + struct padata_instance *padata_alloc(const char *name); + +'name' simply identifies the instance. + +Then, complete padata initialization by allocating a padata_shell:: + + struct padata_shell *padata_alloc_shell(struct padata_instance *pinst); + +A padata_shell is used to submit a job to padata and allows a series of such +jobs to be serialized independently. A padata_instance may have one or more +padata_shells associated with it, each allowing a separate series of jobs. + +Modifying cpumasks +------------------ + +The CPUs used to run jobs can be changed in two ways, programmatically with +padata_set_cpumask() or via sysfs. The former is defined:: + + int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type, + cpumask_var_t cpumask); + +Here cpumask_type is one of PADATA_CPU_PARALLEL or PADATA_CPU_SERIAL, where a +parallel cpumask describes which processors will be used to execute jobs +submitted to this instance in parallel and a serial cpumask defines which +processors are allowed to be used as the serialization callback processor. +cpumask specifies the new cpumask to use. + +There may be sysfs files for an instance's cpumasks. For example, pcrypt's +live in /sys/kernel/pcrypt/<instance-name>. Within an instance's directory +there are two files, parallel_cpumask and serial_cpumask, and either cpumask +may be changed by echoing a bitmask into the file, for example:: + + echo f > /sys/kernel/pcrypt/pencrypt/parallel_cpumask + +Reading one of these files shows the user-supplied cpumask, which may be +different from the 'usable' cpumask. + +Padata maintains two pairs of cpumasks internally, the user-supplied cpumasks +and the 'usable' cpumasks. (Each pair consists of a parallel and a serial +cpumask.) The user-supplied cpumasks default to all possible CPUs on instance +allocation and may be changed as above. The usable cpumasks are always a +subset of the user-supplied cpumasks and contain only the online CPUs in the +user-supplied masks; these are the cpumasks padata actually uses. So it is +legal to supply a cpumask to padata that contains offline CPUs. Once an +offline CPU in the user-supplied cpumask comes online, padata is going to use +it. + +Changing the CPU masks are expensive operations, so it should not be done with +great frequency. + +Running A Job +------------- + +Actually submitting work to the padata instance requires the creation of a +padata_priv structure, which represents one job:: + + struct padata_priv { + /* Other stuff here... */ + void (*parallel)(struct padata_priv *padata); + void (*serial)(struct padata_priv *padata); + }; + +This structure will almost certainly be embedded within some larger +structure specific to the work to be done. Most of its fields are private to +padata, but the structure should be zeroed at initialisation time, and the +parallel() and serial() functions should be provided. Those functions will +be called in the process of getting the work done as we will see +momentarily. + +The submission of the job is done with:: + + int padata_do_parallel(struct padata_shell *ps, + struct padata_priv *padata, int *cb_cpu); + +The ps and padata structures must be set up as described above; cb_cpu +points to the preferred CPU to be used for the final callback when the job is +done; it must be in the current instance's CPU mask (if not the cb_cpu pointer +is updated to point to the CPU actually chosen). The return value from +padata_do_parallel() is zero on success, indicating that the job is in +progress. -EBUSY means that somebody, somewhere else is messing with the +instance's CPU mask, while -EINVAL is a complaint about cb_cpu not being in the +serial cpumask, no online CPUs in the parallel or serial cpumasks, or a stopped +instance. + +Each job submitted to padata_do_parallel() will, in turn, be passed to +exactly one call to the above-mentioned parallel() function, on one CPU, so +true parallelism is achieved by submitting multiple jobs. parallel() runs with +software interrupts disabled and thus cannot sleep. The parallel() +function gets the padata_priv structure pointer as its lone parameter; +information about the actual work to be done is probably obtained by using +container_of() to find the enclosing structure. + +Note that parallel() has no return value; the padata subsystem assumes that +parallel() will take responsibility for the job from this point. The job +need not be completed during this call, but, if parallel() leaves work +outstanding, it should be prepared to be called again with a new job before +the previous one completes. + +Serializing Jobs +---------------- + +When a job does complete, parallel() (or whatever function actually finishes +the work) should inform padata of the fact with a call to:: + + void padata_do_serial(struct padata_priv *padata); + +At some point in the future, padata_do_serial() will trigger a call to the +serial() function in the padata_priv structure. That call will happen on +the CPU requested in the initial call to padata_do_parallel(); it, too, is +run with local software interrupts disabled. +Note that this call may be deferred for a while since the padata code takes +pains to ensure that jobs are completed in the order in which they were +submitted. + +Destroying +---------- + +Cleaning up a padata instance predictably involves calling the two free +functions that correspond to the allocation in reverse:: + + void padata_free_shell(struct padata_shell *ps); + void padata_free(struct padata_instance *pinst); + +It is the user's responsibility to ensure all outstanding jobs are complete +before any of the above are called. + +Running Multithreaded Jobs +========================== + +A multithreaded job has a main thread and zero or more helper threads, with the +main thread participating in the job and then waiting until all helpers have +finished. padata splits the job into units called chunks, where a chunk is a +piece of the job that one thread completes in one call to the thread function. + +A user has to do three things to run a multithreaded job. First, describe the +job by defining a padata_mt_job structure, which is explained in the Interface +section. This includes a pointer to the thread function, which padata will +call each time it assigns a job chunk to a thread. Then, define the thread +function, which accepts three arguments, ``start``, ``end``, and ``arg``, where +the first two delimit the range that the thread operates on and the last is a +pointer to the job's shared state, if any. Prepare the shared state, which is +typically allocated on the main thread's stack. Last, call +padata_do_multithreaded(), which will return once the job is finished. + +Interface +========= + +.. kernel-doc:: include/linux/padata.h +.. kernel-doc:: kernel/padata.c diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst new file mode 100644 index 0000000000..d3c1f6d8c0 --- /dev/null +++ b/Documentation/core-api/pin_user_pages.rst @@ -0,0 +1,284 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================================== +pin_user_pages() and related calls +==================================================== + +.. contents:: :local: + +Overview +======== + +This document describes the following functions:: + + pin_user_pages() + pin_user_pages_fast() + pin_user_pages_remote() + +Basic description of FOLL_PIN +============================= + +FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() +("gup") family of functions. FOLL_PIN has significant interactions and +interdependencies with FOLL_LONGTERM, so both are covered here. + +FOLL_PIN is internal to gup, meaning that it should not appear at the gup call +sites. This allows the associated wrapper functions (pin_user_pages*() and +others) to set the correct combination of these flags, and to check for problems +as well. + +FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. +This is in order to avoid creating a large number of wrapper functions to cover +all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the +pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so +that's a natural dividing line, and a good point to make separate wrapper calls. +In other words, use pin_user_pages*() for DMA-pinned pages, and +get_user_pages*() for other cases. There are five cases described later on in +this document, to further clarify that concept. + +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, +multiple threads and call sites are free to pin the same struct pages, via both +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the +other, not the struct page(s). + +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN +uses a different reference counting technique. + +FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. + +Which flags are set by each wrapper +=================================== + +For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup +flags the caller provides. The caller is required to pass in a non-null struct +pages* array, and the function then pins pages by incrementing each by a special +value: GUP_PIN_COUNTING_BIAS. + +For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, +the extra space available in the struct folio is used to store the +pincount directly. + +This approach for large folios avoids the counting upper limit problems +that are discussed below. Those limitations would have been aggravated +severely by huge pages, because each tail page adds a refcount to the +head page. And in fact, testing revealed that, without a separate pincount +field, refcount overflows were seen in some huge page stress tests. + +This also means that huge pages and large folios do not suffer +from the false positives problem that is mentioned below.:: + + Function + -------- + pin_user_pages FOLL_PIN is always set internally by this function. + pin_user_pages_fast FOLL_PIN is always set internally by this function. + pin_user_pages_remote FOLL_PIN is always set internally by this function. + +For these get_user_pages*() functions, FOLL_GET might not even be specified. +Behavior is a little more complex than above. If FOLL_GET was *not* specified, +but the caller passed in a non-null struct pages* array, then the function +sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount +of each page by +1.:: + + Function + -------- + get_user_pages FOLL_GET is sometimes set internally by this function. + get_user_pages_fast FOLL_GET is sometimes set internally by this function. + get_user_pages_remote FOLL_GET is sometimes set internally by this function. + +Tracking dma-pinned pages +========================= + +Some of the key design constraints, and solutions, for tracking dma-pinned +pages: + +* An actual reference count, per struct page, is required. This is because + multiple processes may pin and unpin a page. + +* False positives (reporting that a page is dma-pinned, when in fact it is not) + are acceptable, but false negatives are not. + +* struct page may not be increased in size for this, and all fields are already + used. + +* Given the above, we can overload the page->_refcount field by using, sort of, + the upper bits in that field for a dma-pinned count. "Sort of", means that, + rather than dividing page->_refcount into bit fields, we simple add a medium- + large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to + page->_refcount. This provides fuzzy behavior: if a page has get_page() called + on it 1024 times, then it will appear to have a single dma-pinned count. + And again, that's acceptable. + +This also leads to limitations: there are only 31-10==21 bits available for a +counter that increments 10 bits at a time. + +* Because of that limitation, special handling is applied to the zero pages + when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its + refcount or pincount at all (it is permanent, so there's no need). The + unpinning functions also don't do anything to a zero page. This is + transparent to the caller. + +* Callers must specifically request "dma-pinned tracking of pages". In other + words, just calling get_user_pages() will not suffice; a new set of functions, + pin_user_page() and related, must be used. + +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags +========================================================== + +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing +these categories: + +CASE 1: Direct IO (DIO) +----------------------- +There are GUP references to pages that are serving +as DIO buffers. These buffers are needed for a relatively short time (so they +are not "long term"). No special synchronization with page_mkclean() or +munmap() is provided. Therefore, flags to set at the call site are: :: + + FOLL_PIN + +...but rather than setting FOLL_PIN directly, call sites should use one of +the pin_user_pages*() routines that set FOLL_PIN. + +CASE 2: RDMA +------------ +There are GUP references to pages that are serving as DMA +buffers. These buffers are needed for a long time ("long term"). No special +synchronization with page_mkclean() or munmap() is provided. Therefore, flags +to set at the call site are: :: + + FOLL_PIN | FOLL_LONGTERM + +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's +because DAX pages do not have a separate page cache, and so "pinning" implies +locking down file system blocks, which is not (yet) supported in that way. + +CASE 3: MMU notifier registration, with or without page faulting hardware +------------------------------------------------------------------------- +Device drivers can pin pages via get_user_pages*(), and register for mmu +notifier callbacks for the memory range. Then, upon receiving a notifier +"invalidate range" callback , stop the device from using the range, and unpin +the pages. There may be other possible schemes, such as for example explicitly +synchronizing against pending IO, that accomplish approximately the same thing. + +Or, if the hardware supports replayable page faults, then the device driver can +avoid pinning entirely (this is ideal), as follows: register for mmu notifier +callbacks as above, but instead of stopping the device and unpinning in the +callback, simply remove the range from the device's page tables. + +Either way, as long as the driver unpins the pages upon mmu notifier callback, +then there is proper synchronization with both filesystem and mm +(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. + +CASE 4: Pinning for struct page manipulation only +------------------------------------------------- +If only struct page data (as opposed to the actual memory contents that a page +is tracking) is affected, then normal GUP calls are sufficient, and neither flag +needs to be set. + +CASE 5: Pinning in order to write to the data within the page +------------------------------------------------------------- +Even though neither DMA nor Direct IO is involved, just a simple case of "pin, +write to a page's data, unpin" can cause a problem. Case 5 may be considered a +superset of Case 1, plus Case 2, plus anything that invokes that pattern. In +other words, if the code is neither Case 1 nor Case 2, it may still require +FOLL_PIN, for patterns like this: + +Correct (uses FOLL_PIN calls): + pin_user_pages() + write to the data within the pages + unpin_user_pages() + +INCORRECT (uses FOLL_GET calls): + get_user_pages() + write to the data within the pages + put_page() + +page_maybe_dma_pinned(): the whole point of pinning +=================================================== + +The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able +to query, "is this page DMA-pinned?" That allows code such as page_mkclean() +(and file system writeback code in general) to make informed decisions about +what to do when a page cannot be unmapped due to such pins. + +What to do in those cases is the subject of a years-long series of discussions +and debates (see the References at the end of this document). It's a TODO item +here: fill in the details once that's worked out. Meanwhile, it's safe to say +that having this available: :: + + static inline bool page_maybe_dma_pinned(struct page *page) + +...is a prerequisite to solving the long-running gup+DMA problem. + +Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM +=================================================================== + +Another way of thinking about these flags is as a progression of restrictions: +FOLL_GET is for struct page manipulation, without affecting the data that the +struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for +short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is +a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more +restrictive case that has FOLL_PIN as a prerequisite: this is for pages that +will be pinned longterm, and whose data will be accessed. + +Unit testing +============ +This file:: + + tools/testing/selftests/mm/gup_test.c + +has the following new calls to exercise the new pin*() wrapper functions: + +* PIN_FAST_BENCHMARK (./gup_test -a) +* PIN_BASIC_TEST (./gup_test -b) + +You can monitor how many total dma-pinned pages have been acquired and released +since the system was booted, via two new /proc/vmstat entries: :: + + /proc/vmstat/nr_foll_pin_acquired + /proc/vmstat/nr_foll_pin_released + +Under normal conditions, these two values will be equal unless there are any +long-term [R]DMA pins in place, or during pin/unpin transitions. + +* nr_foll_pin_acquired: This is the number of logical pins that have been + acquired since the system was powered on. For huge pages, the head page is + pinned once for each page (head page and each tail page) within the huge page. + This follows the same sort of behavior that get_user_pages() uses for huge + pages: the head page is refcounted once for each tail or head page in the huge + page, when get_user_pages() is applied to a huge page. + +* nr_foll_pin_released: The number of logical pins that have been released since + the system was powered on. Note that pages are released (unpinned) on a + PAGE_SIZE granularity, even if the original pin was applied to a huge page. + Becaused of the pin count behavior described above in "nr_foll_pin_acquired", + the accounting balances out, so that after doing this:: + + pin_user_pages(huge_page); + for (each page in huge_page) + unpin_user_page(page); + +...the following is expected:: + + nr_foll_pin_released == nr_foll_pin_acquired + +(...unless it was already out of balance due to a long-term RDMA pin being in +place.) + +Other diagnostics +================= + +dump_page() has been enhanced slightly to handle these new counting +fields, and to better report on large folios in general. Specifically, +for large folios, the exact pincount is reported. + +References +========== + +* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ +* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ +* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ +* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_ + +John Hubbard, October, 2019 diff --git a/Documentation/core-api/printk-basics.rst b/Documentation/core-api/printk-basics.rst new file mode 100644 index 0000000000..2dde24ca7d --- /dev/null +++ b/Documentation/core-api/printk-basics.rst @@ -0,0 +1,112 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +Message logging with printk +=========================== + +printk() is one of the most widely known functions in the Linux kernel. It's the +standard tool we have for printing messages and usually the most basic way of +tracing and debugging. If you're familiar with printf(3) you can tell printk() +is based on it, although it has some functional differences: + + - printk() messages can specify a log level. + + - the format string, while largely compatible with C99, doesn't follow the + exact same specification. It has some extensions and a few limitations + (no ``%n`` or floating point conversion specifiers). See :ref:`How to get + printk format specifiers right <printk-specifiers>`. + +All printk() messages are printed to the kernel log buffer, which is a ring +buffer exported to userspace through /dev/kmsg. The usual way to read it is +using ``dmesg``. + +printk() is typically used like this:: + + printk(KERN_INFO "Message: %s\n", arg); + +where ``KERN_INFO`` is the log level (note that it's concatenated to the format +string, the log level is not a separate argument). The available log levels are: + ++----------------+--------+-----------------------------------------------+ +| Name | String | Alias function | ++================+========+===============================================+ +| KERN_EMERG | "0" | pr_emerg() | ++----------------+--------+-----------------------------------------------+ +| KERN_ALERT | "1" | pr_alert() | ++----------------+--------+-----------------------------------------------+ +| KERN_CRIT | "2" | pr_crit() | ++----------------+--------+-----------------------------------------------+ +| KERN_ERR | "3" | pr_err() | ++----------------+--------+-----------------------------------------------+ +| KERN_WARNING | "4" | pr_warn() | ++----------------+--------+-----------------------------------------------+ +| KERN_NOTICE | "5" | pr_notice() | ++----------------+--------+-----------------------------------------------+ +| KERN_INFO | "6" | pr_info() | ++----------------+--------+-----------------------------------------------+ +| KERN_DEBUG | "7" | pr_debug() and pr_devel() if DEBUG is defined | ++----------------+--------+-----------------------------------------------+ +| KERN_DEFAULT | "" | | ++----------------+--------+-----------------------------------------------+ +| KERN_CONT | "c" | pr_cont() | ++----------------+--------+-----------------------------------------------+ + + +The log level specifies the importance of a message. The kernel decides whether +to show the message immediately (printing it to the current console) depending +on its log level and the current *console_loglevel* (a kernel variable). If the +message priority is higher (lower log level value) than the *console_loglevel* +the message will be printed to the console. + +If the log level is omitted, the message is printed with ``KERN_DEFAULT`` +level. + +You can check the current *console_loglevel* with:: + + $ cat /proc/sys/kernel/printk + 4 4 1 7 + +The result shows the *current*, *default*, *minimum* and *boot-time-default* log +levels. + +To change the current console_loglevel simply write the desired level to +``/proc/sys/kernel/printk``. For example, to print all messages to the console:: + + # echo 8 > /proc/sys/kernel/printk + +Another way, using ``dmesg``:: + + # dmesg -n 5 + +sets the console_loglevel to print KERN_WARNING (4) or more severe messages to +console. See ``dmesg(1)`` for more information. + +As an alternative to printk() you can use the ``pr_*()`` aliases for +logging. This family of macros embed the log level in the macro names. For +example:: + + pr_info("Info message no. %d\n", msg_num); + +prints a ``KERN_INFO`` message. + +Besides being more concise than the equivalent printk() calls, they can use a +common definition for the format string through the pr_fmt() macro. For +instance, defining this at the top of a source file (before any ``#include`` +directive):: + + #define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__ + +would prefix every pr_*() message in that file with the module and function name +that originated the message. + +For debugging purposes there are also two conditionally-compiled macros: +pr_debug() and pr_devel(), which are compiled-out unless ``DEBUG`` (or +also ``CONFIG_DYNAMIC_DEBUG`` in the case of pr_debug()) is defined. + + +Function reference +================== + +.. kernel-doc:: include/linux/printk.h + :functions: printk pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info + pr_fmt pr_debug pr_devel pr_cont diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst new file mode 100644 index 0000000000..4451ef5019 --- /dev/null +++ b/Documentation/core-api/printk-formats.rst @@ -0,0 +1,651 @@ +========================================= +How to get printk format specifiers right +========================================= + +.. _printk-specifiers: + +:Author: Randy Dunlap <rdunlap@infradead.org> +:Author: Andrew Murray <amurray@mpc-data.co.uk> + + +Integer types +============= + +:: + + If variable is of Type, use printk format specifier: + ------------------------------------------------------------ + signed char %d or %hhx + unsigned char %u or %x + char %u or %x + short int %d or %hx + unsigned short int %u or %x + int %d or %x + unsigned int %u or %x + long %ld or %lx + unsigned long %lu or %lx + long long %lld or %llx + unsigned long long %llu or %llx + size_t %zu or %zx + ssize_t %zd or %zx + s8 %d or %hhx + u8 %u or %x + s16 %d or %hx + u16 %u or %x + s32 %d or %x + u32 %u or %x + s64 %lld or %llx + u64 %llu or %llx + + +If <type> is architecture-dependent for its size (e.g., cycles_t, tcflag_t) or +is dependent on a config option for its size (e.g., blk_status_t), use a format +specifier of its largest possible type and explicitly cast to it. + +Example:: + + printk("test: latency: %llu cycles\n", (unsigned long long)time); + +Reminder: sizeof() returns type size_t. + +The kernel's printf does not support %n. Floating point formats (%e, %f, +%g, %a) are also not recognized, for obvious reasons. Use of any +unsupported specifier or length qualifier results in a WARN and early +return from vsnprintf(). + +Pointer types +============= + +A raw pointer value may be printed with %p which will hash the address +before printing. The kernel also supports extended specifiers for printing +pointers of different types. + +Some of the extended specifiers print the data on the given address instead +of printing the address itself. In this case, the following error messages +might be printed instead of the unreachable information:: + + (null) data on plain NULL address + (efault) data on invalid address + (einval) invalid data on a valid address + +Plain Pointers +-------------- + +:: + + %p abcdef12 or 00000000abcdef12 + +Pointers printed without a specifier extension (i.e unadorned %p) are +hashed to prevent leaking information about the kernel memory layout. This +has the added benefit of providing a unique identifier. On 64-bit machines +the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it +gathers enough entropy. + +When possible, use specialised modifiers such as %pS or %pB (described below) +to avoid the need of providing an unhashed address that has to be interpreted +post-hoc. If not possible, and the aim of printing the address is to provide +more information for debugging, use %p and boot the kernel with the +``no_hash_pointers`` parameter during debugging, which will print all %p +addresses unmodified. If you *really* always want the unmodified address, see +%px below. + +If (and only if) you are printing addresses as a content of a virtual file in +e.g. procfs or sysfs (using e.g. seq_printf(), not printk()) read by a +userspace process, use the %pK modifier described below instead of %p or %px. + +Error Pointers +-------------- + +:: + + %pe -ENOSPC + +For printing error pointers (i.e. a pointer for which IS_ERR() is true) +as a symbolic error name. Error values for which no symbolic name is +known are printed in decimal, while a non-ERR_PTR passed as the +argument to %pe gets treated as ordinary %p. + +Symbols/Function Pointers +------------------------- + +:: + + %pS versatile_init+0x0/0x110 + %ps versatile_init + %pSR versatile_init+0x9/0x110 + (with __builtin_extract_return_addr() translation) + %pB prev_fn_of_versatile_init+0x88/0x88 + + +The ``S`` and ``s`` specifiers are used for printing a pointer in symbolic +format. They result in the symbol name with (S) or without (s) +offsets. If KALLSYMS are disabled then the symbol address is printed instead. + +The ``B`` specifier results in the symbol name with offsets and should be +used when printing stack backtraces. The specifier takes into +consideration the effect of compiler optimisations which may occur +when tail-calls are used and marked with the noreturn GCC attribute. + +If the pointer is within a module, the module name and optionally build ID is +printed after the symbol name with an extra ``b`` appended to the end of the +specifier. + +:: + + %pS versatile_init+0x0/0x110 [module_name] + %pSb versatile_init+0x0/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e] + %pSRb versatile_init+0x9/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e] + (with __builtin_extract_return_addr() translation) + %pBb prev_fn_of_versatile_init+0x88/0x88 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e] + +Probed Pointers from BPF / tracing +---------------------------------- + +:: + + %pks kernel string + %pus user string + +The ``k`` and ``u`` specifiers are used for printing prior probed memory from +either kernel memory (k) or user memory (u). The subsequent ``s`` specifier +results in printing a string. For direct use in regular vsnprintf() the (k) +and (u) annotation is ignored, however, when used out of BPF's bpf_trace_printk(), +for example, it reads the memory it is pointing to without faulting. + +Kernel Pointers +--------------- + +:: + + %pK 01234567 or 0123456789abcdef + +For printing kernel pointers which should be hidden from unprivileged +users. The behaviour of %pK depends on the kptr_restrict sysctl - see +Documentation/admin-guide/sysctl/kernel.rst for more details. + +This modifier is *only* intended when producing content of a file read by +userspace from e.g. procfs or sysfs, not for dmesg. Please refer to the +section about %p above for discussion about how to manage hashing pointers +in printk(). + +Unmodified Addresses +-------------------- + +:: + + %px 01234567 or 0123456789abcdef + +For printing pointers when you *really* want to print the address. Please +consider whether or not you are leaking sensitive information about the +kernel memory layout before printing pointers with %px. %px is functionally +equivalent to %lx (or %lu). %px is preferred because it is more uniquely +grep'able. If in the future we need to modify the way the kernel handles +printing pointers we will be better equipped to find the call sites. + +Before using %px, consider if using %p is sufficient together with enabling the +``no_hash_pointers`` kernel parameter during debugging sessions (see the %p +description above). One valid scenario for %px might be printing information +immediately before a panic, which prevents any sensitive information to be +exploited anyway, and with %px there would be no need to reproduce the panic +with no_hash_pointers. + +Pointer Differences +------------------- + +:: + + %td 2560 + %tx a00 + +For printing the pointer differences, use the %t modifier for ptrdiff_t. + +Example:: + + printk("test: difference between pointers: %td\n", ptr2 - ptr1); + +Struct Resources +---------------- + +:: + + %pr [mem 0x60000000-0x6fffffff flags 0x2200] or + [mem 0x0000000060000000-0x000000006fffffff flags 0x2200] + %pR [mem 0x60000000-0x6fffffff pref] or + [mem 0x0000000060000000-0x000000006fffffff pref] + +For printing struct resources. The ``R`` and ``r`` specifiers result in a +printed resource with (R) or without (r) a decoded flags member. + +Passed by reference. + +Physical address types phys_addr_t +---------------------------------- + +:: + + %pa[p] 0x01234567 or 0x0123456789abcdef + +For printing a phys_addr_t type (and its derivatives, such as +resource_size_t) which can vary based on build options, regardless of the +width of the CPU data path. + +Passed by reference. + +DMA address types dma_addr_t +---------------------------- + +:: + + %pad 0x01234567 or 0x0123456789abcdef + +For printing a dma_addr_t type which can vary based on build options, +regardless of the width of the CPU data path. + +Passed by reference. + +Raw buffer as an escaped string +------------------------------- + +:: + + %*pE[achnops] + +For printing raw buffer as an escaped string. For the following buffer:: + + 1b 62 20 5c 43 07 22 90 0d 5d + +A few examples show how the conversion would be done (excluding surrounding +quotes):: + + %*pE "\eb \C\a"\220\r]" + %*pEhp "\x1bb \C\x07"\x90\x0d]" + %*pEa "\e\142\040\\\103\a\042\220\r\135" + +The conversion rules are applied according to an optional combination +of flags (see :c:func:`string_escape_mem` kernel documentation for the +details): + + - a - ESCAPE_ANY + - c - ESCAPE_SPECIAL + - h - ESCAPE_HEX + - n - ESCAPE_NULL + - o - ESCAPE_OCTAL + - p - ESCAPE_NP + - s - ESCAPE_SPACE + +By default ESCAPE_ANY_NP is used. + +ESCAPE_ANY_NP is the sane choice for many cases, in particularly for +printing SSIDs. + +If field width is omitted then 1 byte only will be escaped. + +Raw buffer as a hex string +-------------------------- + +:: + + %*ph 00 01 02 ... 3f + %*phC 00:01:02: ... :3f + %*phD 00-01-02- ... -3f + %*phN 000102 ... 3f + +For printing small buffers (up to 64 bytes long) as a hex string with a +certain separator. For larger buffers consider using +:c:func:`print_hex_dump`. + +MAC/FDDI addresses +------------------ + +:: + + %pM 00:01:02:03:04:05 + %pMR 05:04:03:02:01:00 + %pMF 00-01-02-03-04-05 + %pm 000102030405 + %pmR 050403020100 + +For printing 6-byte MAC/FDDI addresses in hex notation. The ``M`` and ``m`` +specifiers result in a printed address with (M) or without (m) byte +separators. The default byte separator is the colon (:). + +Where FDDI addresses are concerned the ``F`` specifier can be used after +the ``M`` specifier to use dash (-) separators instead of the default +separator. + +For Bluetooth addresses the ``R`` specifier shall be used after the ``M`` +specifier to use reversed byte order suitable for visual interpretation +of Bluetooth addresses which are in the little endian order. + +Passed by reference. + +IPv4 addresses +-------------- + +:: + + %pI4 1.2.3.4 + %pi4 001.002.003.004 + %p[Ii]4[hnbl] + +For printing IPv4 dot-separated decimal addresses. The ``I4`` and ``i4`` +specifiers result in a printed address with (i4) or without (I4) leading +zeros. + +The additional ``h``, ``n``, ``b``, and ``l`` specifiers are used to specify +host, network, big or little endian order addresses respectively. Where +no specifier is provided the default network/big endian order is used. + +Passed by reference. + +IPv6 addresses +-------------- + +:: + + %pI6 0001:0002:0003:0004:0005:0006:0007:0008 + %pi6 00010002000300040005000600070008 + %pI6c 1:2:3:4:5:6:7:8 + +For printing IPv6 network-order 16-bit hex addresses. The ``I6`` and ``i6`` +specifiers result in a printed address with (I6) or without (i6) +colon-separators. Leading zeros are always used. + +The additional ``c`` specifier can be used with the ``I`` specifier to +print a compressed IPv6 address as described by +https://tools.ietf.org/html/rfc5952 + +Passed by reference. + +IPv4/IPv6 addresses (generic, with port, flowinfo, scope) +--------------------------------------------------------- + +:: + + %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008 + %piS 001.002.003.004 or 00010002000300040005000600070008 + %pISc 1.2.3.4 or 1:2:3:4:5:6:7:8 + %pISpc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345 + %p[Ii]S[pfschnbl] + +For printing an IP address without the need to distinguish whether it's of +type AF_INET or AF_INET6. A pointer to a valid struct sockaddr, +specified through ``IS`` or ``iS``, can be passed to this format specifier. + +The additional ``p``, ``f``, and ``s`` specifiers are used to specify port +(IPv4, IPv6), flowinfo (IPv6) and scope (IPv6). Ports have a ``:`` prefix, +flowinfo a ``/`` and scope a ``%``, each followed by the actual value. + +In case of an IPv6 address the compressed IPv6 address as described by +https://tools.ietf.org/html/rfc5952 is being used if the additional +specifier ``c`` is given. The IPv6 address is surrounded by ``[``, ``]`` in +case of additional specifiers ``p``, ``f`` or ``s`` as suggested by +https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07 + +In case of IPv4 addresses, the additional ``h``, ``n``, ``b``, and ``l`` +specifiers can be used as well and are ignored in case of an IPv6 +address. + +Passed by reference. + +Further examples:: + + %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789 + %pISsc 1.2.3.4 or [1:2:3:4:5:6:7:8]%1234567890 + %pISpfc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345/123456789 + +UUID/GUID addresses +------------------- + +:: + + %pUb 00010203-0405-0607-0809-0a0b0c0d0e0f + %pUB 00010203-0405-0607-0809-0A0B0C0D0E0F + %pUl 03020100-0504-0706-0809-0a0b0c0e0e0f + %pUL 03020100-0504-0706-0809-0A0B0C0E0E0F + +For printing 16-byte UUID/GUIDs addresses. The additional ``l``, ``L``, +``b`` and ``B`` specifiers are used to specify a little endian order in +lower (l) or upper case (L) hex notation - and big endian order in lower (b) +or upper case (B) hex notation. + +Where no additional specifiers are used the default big endian +order with lower case hex notation will be printed. + +Passed by reference. + +dentry names +------------ + +:: + + %pd{,2,3,4} + %pD{,2,3,4} + +For printing dentry name; if we race with :c:func:`d_move`, the name might +be a mix of old and new ones, but it won't oops. %pd dentry is a safer +equivalent of %s dentry->d_name.name we used to use, %pd<n> prints ``n`` +last components. %pD does the same thing for struct file. + +Passed by reference. + +block_device names +------------------ + +:: + + %pg sda, sda1 or loop0p1 + +For printing name of block_device pointers. + +struct va_format +---------------- + +:: + + %pV + +For printing struct va_format structures. These contain a format string +and va_list as follows:: + + struct va_format { + const char *fmt; + va_list *va; + }; + +Implements a "recursive vsnprintf". + +Do not use this feature without some mechanism to verify the +correctness of the format string and va_list arguments. + +Passed by reference. + +Device tree nodes +----------------- + +:: + + %pOF[fnpPcCF] + + +For printing device tree node structures. Default behaviour is +equivalent to %pOFf. + + - f - device node full_name + - n - device node name + - p - device node phandle + - P - device node path spec (name + @unit) + - F - device node flags + - c - major compatible string + - C - full compatible string + +The separator when using multiple arguments is ':' + +Examples:: + + %pOF /foo/bar@0 - Node full name + %pOFf /foo/bar@0 - Same as above + %pOFfp /foo/bar@0:10 - Node full name + phandle + %pOFfcF /foo/bar@0:foo,device:--P- - Node full name + + major compatible string + + node flags + D - dynamic + d - detached + P - Populated + B - Populated bus + +Passed by reference. + +Fwnode handles +-------------- + +:: + + %pfw[fP] + +For printing information on fwnode handles. The default is to print the full +node name, including the path. The modifiers are functionally equivalent to +%pOF above. + + - f - full name of the node, including the path + - P - the name of the node including an address (if there is one) + +Examples (ACPI):: + + %pfwf \_SB.PCI0.CIO2.port@1.endpoint@0 - Full node name + %pfwP endpoint@0 - Node name + +Examples (OF):: + + %pfwf /ocp@68000000/i2c@48072000/camera@10/port/endpoint - Full name + %pfwP endpoint - Node name + +Time and date +------------- + +:: + + %pt[RT] YYYY-mm-ddTHH:MM:SS + %pt[RT]s YYYY-mm-dd HH:MM:SS + %pt[RT]d YYYY-mm-dd + %pt[RT]t HH:MM:SS + %pt[RT][dt][r][s] + +For printing date and time as represented by:: + + R struct rtc_time structure + T time64_t type + +in human readable format. + +By default year will be incremented by 1900 and month by 1. +Use %pt[RT]r (raw) to suppress this behaviour. + +The %pt[RT]s (space) will override ISO 8601 separator by using ' ' (space) +instead of 'T' (Capital T) between date and time. It won't have any effect +when date or time is omitted. + +Passed by reference. + +struct clk +---------- + +:: + + %pC pll1 + %pCn pll1 + +For printing struct clk structures. %pC and %pCn print the name of the clock +(Common Clock Framework) or a unique 32-bit ID (legacy clock framework). + +Passed by reference. + +bitmap and its derivatives such as cpumask and nodemask +------------------------------------------------------- + +:: + + %*pb 0779 + %*pbl 0,3-6,8-10 + +For printing bitmap and its derivatives such as cpumask and nodemask, +%*pb outputs the bitmap with field width as the number of bits and %*pbl +output the bitmap as range list with field width as the number of bits. + +The field width is passed by value, the bitmap is passed by reference. +Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease +printing cpumask and nodemask. + +Flags bitfields such as page flags, page_type, gfp_flags +-------------------------------------------------------- + +:: + + %pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff) + %pGt 0xffffff7f(buddy) + %pGg GFP_USER|GFP_DMA32|GFP_NOWARN + %pGv read|exec|mayread|maywrite|mayexec|denywrite + +For printing flags bitfields as a collection of symbolic constants that +would construct the value. The type of flags is given by the third +character. Currently supported are: + + - p - [p]age flags, expects value of type (``unsigned long *``) + - t - page [t]ype, expects value of type (``unsigned int *``) + - v - [v]ma_flags, expects value of type (``unsigned long *``) + - g - [g]fp_flags, expects value of type (``gfp_t *``) + +The flag names and print order depends on the particular type. + +Note that this format should not be used directly in the +:c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags() +functions from <trace/events/mmflags.h>. + +Passed by reference. + +Network device features +----------------------- + +:: + + %pNF 0x000000000000c000 + +For printing netdev_features_t. + +Passed by reference. + +V4L2 and DRM FourCC code (pixel format) +--------------------------------------- + +:: + + %p4cc + +Print a FourCC code used by V4L2 or DRM, including format endianness and +its numerical value as hexadecimal. + +Passed by reference. + +Examples:: + + %p4cc BG12 little-endian (0x32314742) + %p4cc Y10 little-endian (0x20303159) + %p4cc NV12 big-endian (0xb231564e) + +Rust +---- + +:: + + %pA + +Only intended to be used from Rust code to format ``core::fmt::Arguments``. +Do *not* use it from C. + +Thanks +====== + +If you add other %p extensions, please extend <lib/test_printf.c> with +one or more test cases, if at all feasible. + +Thank you for your cooperation and attention. diff --git a/Documentation/core-api/printk-index.rst b/Documentation/core-api/printk-index.rst new file mode 100644 index 0000000000..3062f37d11 --- /dev/null +++ b/Documentation/core-api/printk-index.rst @@ -0,0 +1,137 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +Printk Index +============ + +There are many ways how to monitor the state of the system. One important +source of information is the system log. It provides a lot of information, +including more or less important warnings and error messages. + +There are monitoring tools that filter and take action based on messages +logged. + +The kernel messages are evolving together with the code. As a result, +particular kernel messages are not KABI and never will be! + +It is a huge challenge for maintaining the system log monitors. It requires +knowing what messages were updated in a particular kernel version and why. +Finding these changes in the sources would require non-trivial parsers. +Also it would require matching the sources with the binary kernel which +is not always trivial. Various changes might be backported. Various kernel +versions might be used on different monitored systems. + +This is where the printk index feature might become useful. It provides +a dump of printk formats used all over the source code used for the kernel +and modules on the running system. It is accessible at runtime via debugfs. + +The printk index helps to find changes in the message formats. Also it helps +to track the strings back to the kernel sources and the related commit. + + +User Interface +============== + +The index of printk formats are split in into separate files. The files are +named according to the binaries where the printk formats are built-in. There +is always "vmlinux" and optionally also modules, for example:: + + /sys/kernel/debug/printk/index/vmlinux + /sys/kernel/debug/printk/index/ext4 + /sys/kernel/debug/printk/index/scsi_mod + +Note that only loaded modules are shown. Also printk formats from a module +might appear in "vmlinux" when the module is built-in. + +The content is inspired by the dynamic debug interface and looks like:: + + $> head -1 /sys/kernel/debug/printk/index/vmlinux; shuf -n 5 vmlinux + # <level[,flags]> filename:line function "format" + <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n" + <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n" + <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n" + <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n" + <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n" + +, where the meaning is: + + - :level: log level value: 0-7 for particular severity, -1 as default, + 'c' as continuous line without an explicit log level + - :flags: optional flags: currently only 'c' for KERN_CONT + - :filename\:line: source filename and line number of the related + printk() call. Note that there are many wrappers, for example, + pr_warn(), pr_warn_once(), dev_warn(). + - :function: function name where the printk() call is used. + - :format: format string + +The extra information makes it a bit harder to find differences +between various kernels. Especially the line number might change +very often. On the other hand, it helps a lot to confirm that +it is the same string or find the commit that is responsible +for eventual changes. + + +printk() Is Not a Stable KABI +============================= + +Several developers are afraid that exporting all these implementation +details into the user space will transform particular printk() calls +into KABI. + +But it is exactly the opposite. printk() calls must _not_ be KABI. +And the printk index helps user space tools to deal with this. + + +Subsystem specific printk wrappers +================================== + +The printk index is generated using extra metadata that are stored in +a dedicated .elf section ".printk_index". It is achieved using macro +wrappers doing __printk_index_emit() together with the real printk() +call. The same technique is used also for the metadata used by +the dynamic debug feature. + +The metadata are stored for a particular message only when it is printed +using these special wrappers. It is implemented for the commonly +used printk() calls, including, for example, pr_warn(), or pr_once(). + +Additional changes are necessary for various subsystem specific wrappers +that call the original printk() via a common helper function. These needs +their own wrappers adding __printk_index_emit(). + +Only few subsystem specific wrappers have been updated so far, +for example, dev_printk(). As a result, the printk formats from +some subsystes can be missing in the printk index. + + +Subsystem specific prefix +========================= + +The macro pr_fmt() macro allows to define a prefix that is printed +before the string generated by the related printk() calls. + +Subsystem specific wrappers usually add even more complicated +prefixes. + +These prefixes can be stored into the printk index metadata +by an optional parameter of __printk_index_emit(). The debugfs +interface might then show the printk formats including these prefixes. +For example, drivers/acpi/osl.c contains:: + + #define pr_fmt(fmt) "ACPI: OSL: " fmt + + static int __init acpi_no_auto_serialize_setup(char *str) + { + acpi_gbl_auto_serialize_methods = FALSE; + pr_info("Auto-serialization disabled\n"); + + return 1; + } + +This results in the following printk index entry:: + + <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n" + +It helps matching messages from the real log with printk index. +Then the source file name, line number, and function name can +be used to match the string with the source code. diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst new file mode 100644 index 0000000000..bf28ac0401 --- /dev/null +++ b/Documentation/core-api/protection-keys.rst @@ -0,0 +1,98 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +Memory Protection Keys +====================== + +Memory Protection Keys provide a mechanism for enforcing page-based +protections, but without requiring modification of the page tables when an +application changes protection domains. + +Pkeys Userspace (PKU) is a feature which can be found on: + * Intel server CPUs, Skylake and later + * Intel client CPUs, Tiger Lake (11th Gen Core) and later + * Future AMD CPUs + +Pkeys work by dedicating 4 previously Reserved bits in each page table entry to +a "protection key", giving 16 possible keys. + +Protections for each key are defined with a per-CPU user-accessible register +(PKRU). Each of these is a 32-bit register storing two bits (Access Disable +and Write Disable) for each of 16 keys. + +Being a CPU register, PKRU is inherently thread-local, potentially giving each +thread a different set of protections from every other thread. + +There are two instructions (RDPKRU/WRPKRU) for reading and writing to the +register. The feature is only available in 64-bit mode, even though there is +theoretically space in the PAE PTEs. These permissions are enforced on data +access only and have no effect on instruction fetches. + +Syscalls +======== + +There are 3 system calls which directly interact with pkeys:: + + int pkey_alloc(unsigned long flags, unsigned long init_access_rights) + int pkey_free(int pkey); + int pkey_mprotect(unsigned long start, size_t len, + unsigned long prot, int pkey); + +Before a pkey can be used, it must first be allocated with +pkey_alloc(). An application calls the WRPKRU instruction +directly in order to change access permissions to memory covered +with a key. In this example WRPKRU is wrapped by a C function +called pkey_set(). +:: + + int real_prot = PROT_READ|PROT_WRITE; + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE); + ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); + ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey); + ... application runs here + +Now, if the application needs to update the data at 'ptr', it can +gain access, do the update, then remove its write access:: + + pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE + *ptr = foo; // assign something + pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again + +Now when it frees the memory, it will also free the pkey since it +is no longer in use:: + + munmap(ptr, PAGE_SIZE); + pkey_free(pkey); + +.. note:: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions. + An example implementation can be found in + tools/testing/selftests/x86/protection_keys.c. + +Behavior +======== + +The kernel attempts to make protection keys consistent with the +behavior of a plain mprotect(). For instance if you do this:: + + mprotect(ptr, size, PROT_NONE); + something(ptr); + +you can expect the same effects with protection keys when doing this:: + + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ); + pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey); + something(ptr); + +That should be true whether something() is a direct access to 'ptr' +like:: + + *ptr = foo; + +or when the kernel does the access on the application's behalf like +with a read():: + + read(fd, ptr, 1); + +The kernel will send a SIGSEGV in both cases, but si_code will be set +to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when +the plain mprotect() permissions are violated. diff --git a/Documentation/core-api/rbtree.rst b/Documentation/core-api/rbtree.rst new file mode 100644 index 0000000000..ed1a9fbc77 --- /dev/null +++ b/Documentation/core-api/rbtree.rst @@ -0,0 +1,429 @@ +================================= +Red-black Trees (rbtree) in Linux +================================= + + +:Date: January 18, 2007 +:Author: Rob Landley <rob@landley.net> + +What are red-black trees, and what are they for? +------------------------------------------------ + +Red-black trees are a type of self-balancing binary search tree, used for +storing sortable key/value data pairs. This differs from radix trees (which +are used to efficiently store sparse arrays and thus use long integer indexes +to insert/access/delete nodes) and hash tables (which are not kept sorted to +be easily traversed in order, and must be tuned for a specific size and +hash function where rbtrees scale gracefully storing arbitrary keys). + +Red-black trees are similar to AVL trees, but provide faster real-time bounded +worst case performance for insertion and deletion (at most two rotations and +three rotations, respectively, to balance the tree), with slightly slower +(but still O(log n)) lookup time. + +To quote Linux Weekly News: + + There are a number of red-black trees in use in the kernel. + The deadline and CFQ I/O schedulers employ rbtrees to + track requests; the packet CD/DVD driver does the same. + The high-resolution timer code uses an rbtree to organize outstanding + timer requests. The ext3 filesystem tracks directory entries in a + red-black tree. Virtual memory areas (VMAs) are tracked with red-black + trees, as are epoll file descriptors, cryptographic keys, and network + packets in the "hierarchical token bucket" scheduler. + +This document covers use of the Linux rbtree implementation. For more +information on the nature and implementation of Red Black Trees, see: + + Linux Weekly News article on red-black trees + https://lwn.net/Articles/184495/ + + Wikipedia entry on red-black trees + https://en.wikipedia.org/wiki/Red-black_tree + +Linux implementation of red-black trees +--------------------------------------- + +Linux's rbtree implementation lives in the file "lib/rbtree.c". To use it, +"#include <linux/rbtree.h>". + +The Linux rbtree implementation is optimized for speed, and thus has one +less layer of indirection (and better cache locality) than more traditional +tree implementations. Instead of using pointers to separate rb_node and data +structures, each instance of struct rb_node is embedded in the data structure +it organizes. And instead of using a comparison callback function pointer, +users are expected to write their own tree search and insert functions +which call the provided rbtree functions. Locking is also left up to the +user of the rbtree code. + +Creating a new rbtree +--------------------- + +Data nodes in an rbtree tree are structures containing a struct rb_node member:: + + struct mytype { + struct rb_node node; + char *keystring; + }; + +When dealing with a pointer to the embedded struct rb_node, the containing data +structure may be accessed with the standard container_of() macro. In addition, +individual members may be accessed directly via rb_entry(node, type, member). + +At the root of each rbtree is an rb_root structure, which is initialized to be +empty via: + + struct rb_root mytree = RB_ROOT; + +Searching for a value in an rbtree +---------------------------------- + +Writing a search function for your tree is fairly straightforward: start at the +root, compare each value, and follow the left or right branch as necessary. + +Example:: + + struct mytype *my_search(struct rb_root *root, char *string) + { + struct rb_node *node = root->rb_node; + + while (node) { + struct mytype *data = container_of(node, struct mytype, node); + int result; + + result = strcmp(string, data->keystring); + + if (result < 0) + node = node->rb_left; + else if (result > 0) + node = node->rb_right; + else + return data; + } + return NULL; + } + +Inserting data into an rbtree +----------------------------- + +Inserting data in the tree involves first searching for the place to insert the +new node, then inserting the node and rebalancing ("recoloring") the tree. + +The search for insertion differs from the previous search by finding the +location of the pointer on which to graft the new node. The new node also +needs a link to its parent node for rebalancing purposes. + +Example:: + + int my_insert(struct rb_root *root, struct mytype *data) + { + struct rb_node **new = &(root->rb_node), *parent = NULL; + + /* Figure out where to put new node */ + while (*new) { + struct mytype *this = container_of(*new, struct mytype, node); + int result = strcmp(data->keystring, this->keystring); + + parent = *new; + if (result < 0) + new = &((*new)->rb_left); + else if (result > 0) + new = &((*new)->rb_right); + else + return FALSE; + } + + /* Add new node and rebalance tree. */ + rb_link_node(&data->node, parent, new); + rb_insert_color(&data->node, root); + + return TRUE; + } + +Removing or replacing existing data in an rbtree +------------------------------------------------ + +To remove an existing node from a tree, call:: + + void rb_erase(struct rb_node *victim, struct rb_root *tree); + +Example:: + + struct mytype *data = mysearch(&mytree, "walrus"); + + if (data) { + rb_erase(&data->node, &mytree); + myfree(data); + } + +To replace an existing node in a tree with a new one with the same key, call:: + + void rb_replace_node(struct rb_node *old, struct rb_node *new, + struct rb_root *tree); + +Replacing a node this way does not re-sort the tree: If the new node doesn't +have the same key as the old node, the rbtree will probably become corrupted. + +Iterating through the elements stored in an rbtree (in sort order) +------------------------------------------------------------------ + +Four functions are provided for iterating through an rbtree's contents in +sorted order. These work on arbitrary trees, and should not need to be +modified or wrapped (except for locking purposes):: + + struct rb_node *rb_first(struct rb_root *tree); + struct rb_node *rb_last(struct rb_root *tree); + struct rb_node *rb_next(struct rb_node *node); + struct rb_node *rb_prev(struct rb_node *node); + +To start iterating, call rb_first() or rb_last() with a pointer to the root +of the tree, which will return a pointer to the node structure contained in +the first or last element in the tree. To continue, fetch the next or previous +node by calling rb_next() or rb_prev() on the current node. This will return +NULL when there are no more nodes left. + +The iterator functions return a pointer to the embedded struct rb_node, from +which the containing data structure may be accessed with the container_of() +macro, and individual members may be accessed directly via +rb_entry(node, type, member). + +Example:: + + struct rb_node *node; + for (node = rb_first(&mytree); node; node = rb_next(node)) + printk("key=%s\n", rb_entry(node, struct mytype, node)->keystring); + +Cached rbtrees +-------------- + +Computing the leftmost (smallest) node is quite a common task for binary +search trees, such as for traversals or users relying on a the particular +order for their own logic. To this end, users can use 'struct rb_root_cached' +to optimize O(logN) rb_first() calls to a simple pointer fetch avoiding +potentially expensive tree iterations. This is done at negligible runtime +overhead for maintenance; albeit larger memory footprint. + +Similar to the rb_root structure, cached rbtrees are initialized to be +empty via:: + + struct rb_root_cached mytree = RB_ROOT_CACHED; + +Cached rbtree is simply a regular rb_root with an extra pointer to cache the +leftmost node. This allows rb_root_cached to exist wherever rb_root does, +which permits augmented trees to be supported as well as only a few extra +interfaces:: + + struct rb_node *rb_first_cached(struct rb_root_cached *tree); + void rb_insert_color_cached(struct rb_node *, struct rb_root_cached *, bool); + void rb_erase_cached(struct rb_node *node, struct rb_root_cached *); + +Both insert and erase calls have their respective counterpart of augmented +trees:: + + void rb_insert_augmented_cached(struct rb_node *node, struct rb_root_cached *, + bool, struct rb_augment_callbacks *); + void rb_erase_augmented_cached(struct rb_node *, struct rb_root_cached *, + struct rb_augment_callbacks *); + + +Support for Augmented rbtrees +----------------------------- + +Augmented rbtree is an rbtree with "some" additional data stored in +each node, where the additional data for node N must be a function of +the contents of all nodes in the subtree rooted at N. This data can +be used to augment some new functionality to rbtree. Augmented rbtree +is an optional feature built on top of basic rbtree infrastructure. +An rbtree user who wants this feature will have to call the augmentation +functions with the user provided augmentation callback when inserting +and erasing nodes. + +C files implementing augmented rbtree manipulation must include +<linux/rbtree_augmented.h> instead of <linux/rbtree.h>. Note that +linux/rbtree_augmented.h exposes some rbtree implementations details +you are not expected to rely on; please stick to the documented APIs +there and do not include <linux/rbtree_augmented.h> from header files +either so as to minimize chances of your users accidentally relying on +such implementation details. + +On insertion, the user must update the augmented information on the path +leading to the inserted node, then call rb_link_node() as usual and +rb_augment_inserted() instead of the usual rb_insert_color() call. +If rb_augment_inserted() rebalances the rbtree, it will callback into +a user provided function to update the augmented information on the +affected subtrees. + +When erasing a node, the user must call rb_erase_augmented() instead of +rb_erase(). rb_erase_augmented() calls back into user provided functions +to updated the augmented information on affected subtrees. + +In both cases, the callbacks are provided through struct rb_augment_callbacks. +3 callbacks must be defined: + +- A propagation callback, which updates the augmented value for a given + node and its ancestors, up to a given stop point (or NULL to update + all the way to the root). + +- A copy callback, which copies the augmented value for a given subtree + to a newly assigned subtree root. + +- A tree rotation callback, which copies the augmented value for a given + subtree to a newly assigned subtree root AND recomputes the augmented + information for the former subtree root. + +The compiled code for rb_erase_augmented() may inline the propagation and +copy callbacks, which results in a large function, so each augmented rbtree +user should have a single rb_erase_augmented() call site in order to limit +compiled code size. + + +Sample usage +^^^^^^^^^^^^ + +Interval tree is an example of augmented rb tree. Reference - +"Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein. +More details about interval trees: + +Classical rbtree has a single key and it cannot be directly used to store +interval ranges like [lo:hi] and do a quick lookup for any overlap with a new +lo:hi or to find whether there is an exact match for a new lo:hi. + +However, rbtree can be augmented to store such interval ranges in a structured +way making it possible to do efficient lookup and exact match. + +This "extra information" stored in each node is the maximum hi +(max_hi) value among all the nodes that are its descendants. This +information can be maintained at each node just be looking at the node +and its immediate children. And this will be used in O(log n) lookup +for lowest match (lowest start address among all possible matches) +with something like:: + + struct interval_tree_node * + interval_tree_first_match(struct rb_root *root, + unsigned long start, unsigned long last) + { + struct interval_tree_node *node; + + if (!root->rb_node) + return NULL; + node = rb_entry(root->rb_node, struct interval_tree_node, rb); + + while (true) { + if (node->rb.rb_left) { + struct interval_tree_node *left = + rb_entry(node->rb.rb_left, + struct interval_tree_node, rb); + if (left->__subtree_last >= start) { + /* + * Some nodes in left subtree satisfy Cond2. + * Iterate to find the leftmost such node N. + * If it also satisfies Cond1, that's the match + * we are looking for. Otherwise, there is no + * matching interval as nodes to the right of N + * can't satisfy Cond1 either. + */ + node = left; + continue; + } + } + if (node->start <= last) { /* Cond1 */ + if (node->last >= start) /* Cond2 */ + return node; /* node is leftmost match */ + if (node->rb.rb_right) { + node = rb_entry(node->rb.rb_right, + struct interval_tree_node, rb); + if (node->__subtree_last >= start) + continue; + } + } + return NULL; /* No match */ + } + } + +Insertion/removal are defined using the following augmented callbacks:: + + static inline unsigned long + compute_subtree_last(struct interval_tree_node *node) + { + unsigned long max = node->last, subtree_last; + if (node->rb.rb_left) { + subtree_last = rb_entry(node->rb.rb_left, + struct interval_tree_node, rb)->__subtree_last; + if (max < subtree_last) + max = subtree_last; + } + if (node->rb.rb_right) { + subtree_last = rb_entry(node->rb.rb_right, + struct interval_tree_node, rb)->__subtree_last; + if (max < subtree_last) + max = subtree_last; + } + return max; + } + + static void augment_propagate(struct rb_node *rb, struct rb_node *stop) + { + while (rb != stop) { + struct interval_tree_node *node = + rb_entry(rb, struct interval_tree_node, rb); + unsigned long subtree_last = compute_subtree_last(node); + if (node->__subtree_last == subtree_last) + break; + node->__subtree_last = subtree_last; + rb = rb_parent(&node->rb); + } + } + + static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new) + { + struct interval_tree_node *old = + rb_entry(rb_old, struct interval_tree_node, rb); + struct interval_tree_node *new = + rb_entry(rb_new, struct interval_tree_node, rb); + + new->__subtree_last = old->__subtree_last; + } + + static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) + { + struct interval_tree_node *old = + rb_entry(rb_old, struct interval_tree_node, rb); + struct interval_tree_node *new = + rb_entry(rb_new, struct interval_tree_node, rb); + + new->__subtree_last = old->__subtree_last; + old->__subtree_last = compute_subtree_last(old); + } + + static const struct rb_augment_callbacks augment_callbacks = { + augment_propagate, augment_copy, augment_rotate + }; + + void interval_tree_insert(struct interval_tree_node *node, + struct rb_root *root) + { + struct rb_node **link = &root->rb_node, *rb_parent = NULL; + unsigned long start = node->start, last = node->last; + struct interval_tree_node *parent; + + while (*link) { + rb_parent = *link; + parent = rb_entry(rb_parent, struct interval_tree_node, rb); + if (parent->__subtree_last < last) + parent->__subtree_last = last; + if (start < parent->start) + link = &parent->rb.rb_left; + else + link = &parent->rb.rb_right; + } + + node->__subtree_last = last; + rb_link_node(&node->rb, rb_parent, link); + rb_insert_augmented(&node->rb, root, &augment_callbacks); + } + + void interval_tree_remove(struct interval_tree_node *node, + struct rb_root *root) + { + rb_erase_augmented(&node->rb, root, &augment_callbacks); + } diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst new file mode 100644 index 0000000000..79a009ce11 --- /dev/null +++ b/Documentation/core-api/refcount-vs-atomic.rst @@ -0,0 +1,168 @@ +=================================== +refcount_t API compared to atomic_t +=================================== + +.. contents:: :local: + +Introduction +============ + +The goal of refcount_t API is to provide a minimal API for implementing +an object's reference counters. While a generic architecture-independent +implementation from lib/refcount.c uses atomic operations underneath, +there are a number of differences between some of the ``refcount_*()`` and +``atomic_*()`` functions with regards to the memory ordering guarantees. +This document outlines the differences and provides respective examples +in order to help maintainers validate their code against the change in +these memory ordering guarantees. + +The terms used through this document try to follow the formal LKMM defined in +tools/memory-model/Documentation/explanation.txt. + +memory-barriers.txt and atomic_t.txt provide more background to the +memory ordering in general and for atomic operations specifically. + +Relevant types of memory ordering +================================= + +.. note:: The following section only covers some of the memory + ordering types that are relevant for the atomics and reference + counters and used through this document. For a much broader picture + please consult memory-barriers.txt document. + +In the absence of any memory ordering guarantees (i.e. fully unordered) +atomics & refcounters only provide atomicity and +program order (po) relation (on the same CPU). It guarantees that +each ``atomic_*()`` and ``refcount_*()`` operation is atomic and instructions +are executed in program order on a single CPU. +This is implemented using READ_ONCE()/WRITE_ONCE() and +compare-and-swap primitives. + +A strong (full) memory ordering guarantees that all prior loads and +stores (all po-earlier instructions) on the same CPU are completed +before any po-later instruction is executed on the same CPU. +It also guarantees that all po-earlier stores on the same CPU +and all propagated stores from other CPUs must propagate to all +other CPUs before any po-later instruction is executed on the original +CPU (A-cumulative property). This is implemented using smp_mb(). + +A RELEASE memory ordering guarantees that all prior loads and +stores (all po-earlier instructions) on the same CPU are completed +before the operation. It also guarantees that all po-earlier +stores on the same CPU and all propagated stores from other CPUs +must propagate to all other CPUs before the release operation +(A-cumulative property). This is implemented using +smp_store_release(). + +An ACQUIRE memory ordering guarantees that all post loads and +stores (all po-later instructions) on the same CPU are +completed after the acquire operation. It also guarantees that all +po-later stores on the same CPU must propagate to all other CPUs +after the acquire operation executes. This is implemented using +smp_acquire__after_ctrl_dep(). + +A control dependency (on success) for refcounters guarantees that +if a reference for an object was successfully obtained (reference +counter increment or addition happened, function returned true), +then further stores are ordered against this operation. +Control dependency on stores are not implemented using any explicit +barriers, but rely on CPU not to speculate on stores. This is only +a single CPU relation and provides no guarantees for other CPUs. + + +Comparison of functions +======================= + +case 1) - non-"Read/Modify/Write" (RMW) ops +------------------------------------------- + +Function changes: + + * atomic_set() --> refcount_set() + * atomic_read() --> refcount_read() + +Memory ordering guarantee changes: + + * none (both fully unordered) + + +case 2) - increment-based ops that return no value +-------------------------------------------------- + +Function changes: + + * atomic_inc() --> refcount_inc() + * atomic_add() --> refcount_add() + +Memory ordering guarantee changes: + + * none (both fully unordered) + +case 3) - decrement-based RMW ops that return no value +------------------------------------------------------ + +Function changes: + + * atomic_dec() --> refcount_dec() + +Memory ordering guarantee changes: + + * fully unordered --> RELEASE ordering + + +case 4) - increment-based RMW ops that return a value +----------------------------------------------------- + +Function changes: + + * atomic_inc_not_zero() --> refcount_inc_not_zero() + * no atomic counterpart --> refcount_add_not_zero() + +Memory ordering guarantees changes: + + * fully ordered --> control dependency on success for stores + +.. note:: We really assume here that necessary ordering is provided as a + result of obtaining pointer to the object! + + +case 5) - generic dec/sub decrement-based RMW ops that return a value +--------------------------------------------------------------------- + +Function changes: + + * atomic_dec_and_test() --> refcount_dec_and_test() + * atomic_sub_and_test() --> refcount_sub_and_test() + +Memory ordering guarantees changes: + + * fully ordered --> RELEASE ordering + ACQUIRE ordering on success + + +case 6) other decrement-based RMW ops that return a value +--------------------------------------------------------- + +Function changes: + + * no atomic counterpart --> refcount_dec_if_one() + * ``atomic_add_unless(&var, -1, 1)`` --> ``refcount_dec_not_one(&var)`` + +Memory ordering guarantees changes: + + * fully ordered --> RELEASE ordering + control dependency + +.. note:: atomic_add_unless() only provides full order on success. + + +case 7) - lock-based RMW +------------------------ + +Function changes: + + * atomic_dec_and_lock() --> refcount_dec_and_lock() + * atomic_dec_and_mutex_lock() --> refcount_dec_and_mutex_lock() + +Memory ordering guarantees changes: + + * fully ordered --> RELEASE ordering + control dependency + hold + spin_lock() on success diff --git a/Documentation/core-api/symbol-namespaces.rst b/Documentation/core-api/symbol-namespaces.rst new file mode 100644 index 0000000000..12e4aecdae --- /dev/null +++ b/Documentation/core-api/symbol-namespaces.rst @@ -0,0 +1,157 @@ +================= +Symbol Namespaces +================= + +The following document describes how to use Symbol Namespaces to structure the +export surface of in-kernel symbols exported through the family of +EXPORT_SYMBOL() macros. + +.. Table of Contents + + === 1 Introduction + === 2 How to define Symbol Namespaces + --- 2.1 Using the EXPORT_SYMBOL macros + --- 2.2 Using the DEFAULT_SYMBOL_NAMESPACE define + === 3 How to use Symbols exported in Namespaces + === 4 Loading Modules that use namespaced Symbols + === 5 Automatically creating MODULE_IMPORT_NS statements + +1. Introduction +=============== + +Symbol Namespaces have been introduced as a means to structure the export +surface of the in-kernel API. It allows subsystem maintainers to partition +their exported symbols into separate namespaces. That is useful for +documentation purposes (think of the SUBSYSTEM_DEBUG namespace) as well as for +limiting the availability of a set of symbols for use in other parts of the +kernel. As of today, modules that make use of symbols exported into namespaces, +are required to import the namespace. Otherwise the kernel will, depending on +its configuration, reject loading the module or warn about a missing import. + +2. How to define Symbol Namespaces +================================== + +Symbols can be exported into namespace using different methods. All of them are +changing the way EXPORT_SYMBOL and friends are instrumented to create ksymtab +entries. + +2.1 Using the EXPORT_SYMBOL macros +================================== + +In addition to the macros EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL(), that allow +exporting of kernel symbols to the kernel symbol table, variants of these are +available to export symbols into a certain namespace: EXPORT_SYMBOL_NS() and +EXPORT_SYMBOL_NS_GPL(). They take one additional argument: the namespace. +Please note that due to macro expansion that argument needs to be a +preprocessor symbol. E.g. to export the symbol ``usb_stor_suspend`` into the +namespace ``USB_STORAGE``, use:: + + EXPORT_SYMBOL_NS(usb_stor_suspend, USB_STORAGE); + +The corresponding ksymtab entry struct ``kernel_symbol`` will have the member +``namespace`` set accordingly. A symbol that is exported without a namespace will +refer to ``NULL``. There is no default namespace if none is defined. ``modpost`` +and kernel/module/main.c make use the namespace at build time or module load +time, respectively. + +2.2 Using the DEFAULT_SYMBOL_NAMESPACE define +============================================= + +Defining namespaces for all symbols of a subsystem can be very verbose and may +become hard to maintain. Therefore a default define (DEFAULT_SYMBOL_NAMESPACE) +is been provided, that, if set, will become the default for all EXPORT_SYMBOL() +and EXPORT_SYMBOL_GPL() macro expansions that do not specify a namespace. + +There are multiple ways of specifying this define and it depends on the +subsystem and the maintainer's preference, which one to use. The first option +is to define the default namespace in the ``Makefile`` of the subsystem. E.g. to +export all symbols defined in usb-common into the namespace USB_COMMON, add a +line like this to drivers/usb/common/Makefile:: + + ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=USB_COMMON + +That will affect all EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL() statements. A +symbol exported with EXPORT_SYMBOL_NS() while this definition is present, will +still be exported into the namespace that is passed as the namespace argument +as this argument has preference over a default symbol namespace. + +A second option to define the default namespace is directly in the compilation +unit as preprocessor statement. The above example would then read:: + + #undef DEFAULT_SYMBOL_NAMESPACE + #define DEFAULT_SYMBOL_NAMESPACE USB_COMMON + +within the corresponding compilation unit before any EXPORT_SYMBOL macro is +used. + +3. How to use Symbols exported in Namespaces +============================================ + +In order to use symbols that are exported into namespaces, kernel modules need +to explicitly import these namespaces. Otherwise the kernel might reject to +load the module. The module code is required to use the macro MODULE_IMPORT_NS +for the namespaces it uses symbols from. E.g. a module using the +usb_stor_suspend symbol from above, needs to import the namespace USB_STORAGE +using a statement like:: + + MODULE_IMPORT_NS(USB_STORAGE); + +This will create a ``modinfo`` tag in the module for each imported namespace. +This has the side effect, that the imported namespaces of a module can be +inspected with modinfo:: + + $ modinfo drivers/usb/storage/ums-karma.ko + [...] + import_ns: USB_STORAGE + [...] + + +It is advisable to add the MODULE_IMPORT_NS() statement close to other module +metadata definitions like MODULE_AUTHOR() or MODULE_LICENSE(). Refer to section +5. for a way to create missing import statements automatically. + +4. Loading Modules that use namespaced Symbols +============================================== + +At module loading time (e.g. ``insmod``), the kernel will check each symbol +referenced from the module for its availability and whether the namespace it +might be exported to has been imported by the module. The default behaviour of +the kernel is to reject loading modules that don't specify sufficient imports. +An error will be logged and loading will be failed with EINVAL. In order to +allow loading of modules that don't satisfy this precondition, a configuration +option is available: Setting MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS=y will +enable loading regardless, but will emit a warning. + +5. Automatically creating MODULE_IMPORT_NS statements +===================================================== + +Missing namespaces imports can easily be detected at build time. In fact, +modpost will emit a warning if a module uses a symbol from a namespace +without importing it. +MODULE_IMPORT_NS() statements will usually be added at a definite location +(along with other module meta data). To make the life of module authors (and +subsystem maintainers) easier, a script and make target is available to fixup +missing imports. Fixing missing imports can be done with:: + + $ make nsdeps + +A typical scenario for module authors would be:: + + - write code that depends on a symbol from a not imported namespace + - ``make`` + - notice the warning of modpost telling about a missing import + - run ``make nsdeps`` to add the import to the correct code location + +For subsystem maintainers introducing a namespace, the steps are very similar. +Again, ``make nsdeps`` will eventually add the missing namespace imports for +in-tree modules:: + + - move or add symbols to a namespace (e.g. with EXPORT_SYMBOL_NS()) + - ``make`` (preferably with an allmodconfig to cover all in-kernel + modules) + - notice the warning of modpost telling about a missing import + - run ``make nsdeps`` to add the import to the correct code location + +You can also run nsdeps for external module builds. A typical usage is:: + + $ make -C <path_to_kernel_src> M=$PWD nsdeps diff --git a/Documentation/core-api/this_cpu_ops.rst b/Documentation/core-api/this_cpu_ops.rst new file mode 100644 index 0000000000..91acbcf30e --- /dev/null +++ b/Documentation/core-api/this_cpu_ops.rst @@ -0,0 +1,337 @@ +=================== +this_cpu operations +=================== + +:Author: Christoph Lameter, August 4th, 2014 +:Author: Pranith Kumar, Aug 2nd, 2014 + +this_cpu operations are a way of optimizing access to per cpu +variables associated with the *currently* executing processor. This is +done through the use of segment registers (or a dedicated register where +the cpu permanently stored the beginning of the per cpu area for a +specific processor). + +this_cpu operations add a per cpu variable offset to the processor +specific per cpu base and encode that operation in the instruction +operating on the per cpu variable. + +This means that there are no atomicity issues between the calculation of +the offset and the operation on the data. Therefore it is not +necessary to disable preemption or interrupts to ensure that the +processor is not changed between the calculation of the address and +the operation on the data. + +Read-modify-write operations are of particular interest. Frequently +processors have special lower latency instructions that can operate +without the typical synchronization overhead, but still provide some +sort of relaxed atomicity guarantees. The x86, for example, can execute +RMW (Read Modify Write) instructions like inc/dec/cmpxchg without the +lock prefix and the associated latency penalty. + +Access to the variable without the lock prefix is not synchronized but +synchronization is not necessary since we are dealing with per cpu +data specific to the currently executing processor. Only the current +processor should be accessing that variable and therefore there are no +concurrency issues with other processors in the system. + +Please note that accesses by remote processors to a per cpu area are +exceptional situations and may impact performance and/or correctness +(remote write operations) of local RMW operations via this_cpu_*. + +The main use of the this_cpu operations has been to optimize counter +operations. + +The following this_cpu() operations with implied preemption protection +are defined. These operations can be used without worrying about +preemption and interrupts:: + + this_cpu_read(pcp) + this_cpu_write(pcp, val) + this_cpu_add(pcp, val) + this_cpu_and(pcp, val) + this_cpu_or(pcp, val) + this_cpu_add_return(pcp, val) + this_cpu_xchg(pcp, nval) + this_cpu_cmpxchg(pcp, oval, nval) + this_cpu_sub(pcp, val) + this_cpu_inc(pcp) + this_cpu_dec(pcp) + this_cpu_sub_return(pcp, val) + this_cpu_inc_return(pcp) + this_cpu_dec_return(pcp) + + +Inner working of this_cpu operations +------------------------------------ + +On x86 the fs: or the gs: segment registers contain the base of the +per cpu area. It is then possible to simply use the segment override +to relocate a per cpu relative address to the proper per cpu area for +the processor. So the relocation to the per cpu base is encoded in the +instruction via a segment register prefix. + +For example:: + + DEFINE_PER_CPU(int, x); + int z; + + z = this_cpu_read(x); + +results in a single instruction:: + + mov ax, gs:[x] + +instead of a sequence of calculation of the address and then a fetch +from that address which occurs with the per cpu operations. Before +this_cpu_ops such sequence also required preempt disable/enable to +prevent the kernel from moving the thread to a different processor +while the calculation is performed. + +Consider the following this_cpu operation:: + + this_cpu_inc(x) + +The above results in the following single instruction (no lock prefix!):: + + inc gs:[x] + +instead of the following operations required if there is no segment +register:: + + int *y; + int cpu; + + cpu = get_cpu(); + y = per_cpu_ptr(&x, cpu); + (*y)++; + put_cpu(); + +Note that these operations can only be used on per cpu data that is +reserved for a specific processor. Without disabling preemption in the +surrounding code this_cpu_inc() will only guarantee that one of the +per cpu counters is correctly incremented. However, there is no +guarantee that the OS will not move the process directly before or +after the this_cpu instruction is executed. In general this means that +the value of the individual counters for each processor are +meaningless. The sum of all the per cpu counters is the only value +that is of interest. + +Per cpu variables are used for performance reasons. Bouncing cache +lines can be avoided if multiple processors concurrently go through +the same code paths. Since each processor has its own per cpu +variables no concurrent cache line updates take place. The price that +has to be paid for this optimization is the need to add up the per cpu +counters when the value of a counter is needed. + + +Special operations +------------------ + +:: + + y = this_cpu_ptr(&x) + +Takes the offset of a per cpu variable (&x !) and returns the address +of the per cpu variable that belongs to the currently executing +processor. this_cpu_ptr avoids multiple steps that the common +get_cpu/put_cpu sequence requires. No processor number is +available. Instead, the offset of the local per cpu area is simply +added to the per cpu offset. + +Note that this operation is usually used in a code segment when +preemption has been disabled. The pointer is then used to +access local per cpu data in a critical section. When preemption +is re-enabled this pointer is usually no longer useful since it may +no longer point to per cpu data of the current processor. + + +Per cpu variables and offsets +----------------------------- + +Per cpu variables have *offsets* to the beginning of the per cpu +area. They do not have addresses although they look like that in the +code. Offsets cannot be directly dereferenced. The offset must be +added to a base pointer of a per cpu area of a processor in order to +form a valid address. + +Therefore the use of x or &x outside of the context of per cpu +operations is invalid and will generally be treated like a NULL +pointer dereference. + +:: + + DEFINE_PER_CPU(int, x); + +In the context of per cpu operations the above implies that x is a per +cpu variable. Most this_cpu operations take a cpu variable. + +:: + + int __percpu *p = &x; + +&x and hence p is the *offset* of a per cpu variable. this_cpu_ptr() +takes the offset of a per cpu variable which makes this look a bit +strange. + + +Operations on a field of a per cpu structure +-------------------------------------------- + +Let's say we have a percpu structure:: + + struct s { + int n,m; + }; + + DEFINE_PER_CPU(struct s, p); + + +Operations on these fields are straightforward:: + + this_cpu_inc(p.m) + + z = this_cpu_cmpxchg(p.m, 0, 1); + + +If we have an offset to struct s:: + + struct s __percpu *ps = &p; + + this_cpu_dec(ps->m); + + z = this_cpu_inc_return(ps->n); + + +The calculation of the pointer may require the use of this_cpu_ptr() +if we do not make use of this_cpu ops later to manipulate fields:: + + struct s *pp; + + pp = this_cpu_ptr(&p); + + pp->m--; + + z = pp->n++; + + +Variants of this_cpu ops +------------------------ + +this_cpu ops are interrupt safe. Some architectures do not support +these per cpu local operations. In that case the operation must be +replaced by code that disables interrupts, then does the operations +that are guaranteed to be atomic and then re-enable interrupts. Doing +so is expensive. If there are other reasons why the scheduler cannot +change the processor we are executing on then there is no reason to +disable interrupts. For that purpose the following __this_cpu operations +are provided. + +These operations have no guarantee against concurrent interrupts or +preemption. If a per cpu variable is not used in an interrupt context +and the scheduler cannot preempt, then they are safe. If any interrupts +still occur while an operation is in progress and if the interrupt too +modifies the variable, then RMW actions can not be guaranteed to be +safe:: + + __this_cpu_read(pcp) + __this_cpu_write(pcp, val) + __this_cpu_add(pcp, val) + __this_cpu_and(pcp, val) + __this_cpu_or(pcp, val) + __this_cpu_add_return(pcp, val) + __this_cpu_xchg(pcp, nval) + __this_cpu_cmpxchg(pcp, oval, nval) + __this_cpu_sub(pcp, val) + __this_cpu_inc(pcp) + __this_cpu_dec(pcp) + __this_cpu_sub_return(pcp, val) + __this_cpu_inc_return(pcp) + __this_cpu_dec_return(pcp) + + +Will increment x and will not fall-back to code that disables +interrupts on platforms that cannot accomplish atomicity through +address relocation and a Read-Modify-Write operation in the same +instruction. + + +&this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n) +-------------------------------------------- + +The first operation takes the offset and forms an address and then +adds the offset of the n field. This may result in two add +instructions emitted by the compiler. + +The second one first adds the two offsets and then does the +relocation. IMHO the second form looks cleaner and has an easier time +with (). The second form also is consistent with the way +this_cpu_read() and friends are used. + + +Remote access to per cpu data +------------------------------ + +Per cpu data structures are designed to be used by one cpu exclusively. +If you use the variables as intended, this_cpu_ops() are guaranteed to +be "atomic" as no other CPU has access to these data structures. + +There are special cases where you might need to access per cpu data +structures remotely. It is usually safe to do a remote read access +and that is frequently done to summarize counters. Remote write access +something which could be problematic because this_cpu ops do not +have lock semantics. A remote write may interfere with a this_cpu +RMW operation. + +Remote write accesses to percpu data structures are highly discouraged +unless absolutely necessary. Please consider using an IPI to wake up +the remote CPU and perform the update to its per cpu area. + +To access per-cpu data structure remotely, typically the per_cpu_ptr() +function is used:: + + + DEFINE_PER_CPU(struct data, datap); + + struct data *p = per_cpu_ptr(&datap, cpu); + +This makes it explicit that we are getting ready to access a percpu +area remotely. + +You can also do the following to convert the datap offset to an address:: + + struct data *p = this_cpu_ptr(&datap); + +but, passing of pointers calculated via this_cpu_ptr to other cpus is +unusual and should be avoided. + +Remote access are typically only for reading the status of another cpus +per cpu data. Write accesses can cause unique problems due to the +relaxed synchronization requirements for this_cpu operations. + +One example that illustrates some concerns with write operations is +the following scenario that occurs because two per cpu variables +share a cache-line but the relaxed synchronization is applied to +only one process updating the cache-line. + +Consider the following example:: + + + struct test { + atomic_t a; + int b; + }; + + DEFINE_PER_CPU(struct test, onecacheline); + +There is some concern about what would happen if the field 'a' is updated +remotely from one processor and the local processor would use this_cpu ops +to update field b. Care should be taken that such simultaneous accesses to +data within the same cache line are avoided. Also costly synchronization +may be necessary. IPIs are generally recommended in such scenarios instead +of a remote write to the per cpu area of another processor. + +Even in cases where the remote writes are rare, please bear in +mind that a remote write will evict the cache line from the processor +that most likely will access it. If the processor wakes up and finds a +missing local cache line of a per cpu area, its performance and hence +the wake up times will be affected. diff --git a/Documentation/core-api/timekeeping.rst b/Documentation/core-api/timekeeping.rst new file mode 100644 index 0000000000..22ec68f244 --- /dev/null +++ b/Documentation/core-api/timekeeping.rst @@ -0,0 +1,190 @@ +ktime accessors +=============== + +Device drivers can read the current time using ktime_get() and the many +related functions declared in linux/timekeeping.h. As a rule of thumb, +using an accessor with a shorter name is preferred over one with a longer +name if both are equally fit for a particular use case. + +Basic ktime_t based interfaces +------------------------------ + +The recommended simplest form returns an opaque ktime_t, with variants +that return time for different clock references: + + +.. c:function:: ktime_t ktime_get( void ) + + CLOCK_MONOTONIC + + Useful for reliable timestamps and measuring short time intervals + accurately. Starts at system boot time but stops during suspend. + +.. c:function:: ktime_t ktime_get_boottime( void ) + + CLOCK_BOOTTIME + + Like ktime_get(), but does not stop when suspended. This can be + used e.g. for key expiration times that need to be synchronized + with other machines across a suspend operation. + +.. c:function:: ktime_t ktime_get_real( void ) + + CLOCK_REALTIME + + Returns the time in relative to the UNIX epoch starting in 1970 + using the Coordinated Universal Time (UTC), same as gettimeofday() + user space. This is used for all timestamps that need to + persist across a reboot, like inode times, but should be avoided + for internal uses, since it can jump backwards due to a leap + second update, NTP adjustment settimeofday() operation from user + space. + +.. c:function:: ktime_t ktime_get_clocktai( void ) + + CLOCK_TAI + + Like ktime_get_real(), but uses the International Atomic Time (TAI) + reference instead of UTC to avoid jumping on leap second updates. + This is rarely useful in the kernel. + +.. c:function:: ktime_t ktime_get_raw( void ) + + CLOCK_MONOTONIC_RAW + + Like ktime_get(), but runs at the same rate as the hardware + clocksource without (NTP) adjustments for clock drift. This is + also rarely needed in the kernel. + +nanosecond, timespec64, and second output +----------------------------------------- + +For all of the above, there are variants that return the time in a +different format depending on what is required by the user: + +.. c:function:: u64 ktime_get_ns( void ) + u64 ktime_get_boottime_ns( void ) + u64 ktime_get_real_ns( void ) + u64 ktime_get_clocktai_ns( void ) + u64 ktime_get_raw_ns( void ) + + Same as the plain ktime_get functions, but returning a u64 number + of nanoseconds in the respective time reference, which may be + more convenient for some callers. + +.. c:function:: void ktime_get_ts64( struct timespec64 * ) + void ktime_get_boottime_ts64( struct timespec64 * ) + void ktime_get_real_ts64( struct timespec64 * ) + void ktime_get_clocktai_ts64( struct timespec64 * ) + void ktime_get_raw_ts64( struct timespec64 * ) + + Same above, but returns the time in a 'struct timespec64', split + into seconds and nanoseconds. This can avoid an extra division + when printing the time, or when passing it into an external + interface that expects a 'timespec' or 'timeval' structure. + +.. c:function:: time64_t ktime_get_seconds( void ) + time64_t ktime_get_boottime_seconds( void ) + time64_t ktime_get_real_seconds( void ) + time64_t ktime_get_clocktai_seconds( void ) + time64_t ktime_get_raw_seconds( void ) + + Return a coarse-grained version of the time as a scalar + time64_t. This avoids accessing the clock hardware and rounds + down the seconds to the full seconds of the last timer tick + using the respective reference. + +Coarse and fast_ns access +------------------------- + +Some additional variants exist for more specialized cases: + +.. c:function:: ktime_t ktime_get_coarse( void ) + ktime_t ktime_get_coarse_boottime( void ) + ktime_t ktime_get_coarse_real( void ) + ktime_t ktime_get_coarse_clocktai( void ) + +.. c:function:: u64 ktime_get_coarse_ns( void ) + u64 ktime_get_coarse_boottime_ns( void ) + u64 ktime_get_coarse_real_ns( void ) + u64 ktime_get_coarse_clocktai_ns( void ) + +.. c:function:: void ktime_get_coarse_ts64( struct timespec64 * ) + void ktime_get_coarse_boottime_ts64( struct timespec64 * ) + void ktime_get_coarse_real_ts64( struct timespec64 * ) + void ktime_get_coarse_clocktai_ts64( struct timespec64 * ) + + These are quicker than the non-coarse versions, but less accurate, + corresponding to CLOCK_MONOTONIC_COARSE and CLOCK_REALTIME_COARSE + in user space, along with the equivalent boottime/tai/raw + timebase not available in user space. + + The time returned here corresponds to the last timer tick, which + may be as much as 10ms in the past (for CONFIG_HZ=100), same as + reading the 'jiffies' variable. These are only useful when called + in a fast path and one still expects better than second accuracy, + but can't easily use 'jiffies', e.g. for inode timestamps. + Skipping the hardware clock access saves around 100 CPU cycles + on most modern machines with a reliable cycle counter, but + up to several microseconds on older hardware with an external + clocksource. + +.. c:function:: u64 ktime_get_mono_fast_ns( void ) + u64 ktime_get_raw_fast_ns( void ) + u64 ktime_get_boot_fast_ns( void ) + u64 ktime_get_tai_fast_ns( void ) + u64 ktime_get_real_fast_ns( void ) + + These variants are safe to call from any context, including from + a non-maskable interrupt (NMI) during a timekeeper update, and + while we are entering suspend with the clocksource powered down. + This is useful in some tracing or debugging code as well as + machine check reporting, but most drivers should never call them, + since the time is allowed to jump under certain conditions. + +Deprecated time interfaces +-------------------------- + +Older kernels used some other interfaces that are now being phased out +but may appear in third-party drivers being ported here. In particular, +all interfaces returning a 'struct timeval' or 'struct timespec' have +been replaced because the tv_sec member overflows in year 2038 on 32-bit +architectures. These are the recommended replacements: + +.. c:function:: void ktime_get_ts( struct timespec * ) + + Use ktime_get() or ktime_get_ts64() instead. + +.. c:function:: void do_gettimeofday( struct timeval * ) + void getnstimeofday( struct timespec * ) + void getnstimeofday64( struct timespec64 * ) + void ktime_get_real_ts( struct timespec * ) + + ktime_get_real_ts64() is a direct replacement, but consider using + monotonic time (ktime_get_ts64()) and/or a ktime_t based interface + (ktime_get()/ktime_get_real()). + +.. c:function:: struct timespec current_kernel_time( void ) + struct timespec64 current_kernel_time64( void ) + struct timespec get_monotonic_coarse( void ) + struct timespec64 get_monotonic_coarse64( void ) + + These are replaced by ktime_get_coarse_real_ts64() and + ktime_get_coarse_ts64(). However, A lot of code that wants + coarse-grained times can use the simple 'jiffies' instead, while + some drivers may actually want the higher resolution accessors + these days. + +.. c:function:: struct timespec getrawmonotonic( void ) + struct timespec64 getrawmonotonic64( void ) + struct timespec timekeeping_clocktai( void ) + struct timespec64 timekeeping_clocktai64( void ) + struct timespec get_monotonic_boottime( void ) + struct timespec64 get_monotonic_boottime64( void ) + + These are replaced by ktime_get_raw()/ktime_get_raw_ts64(), + ktime_get_clocktai()/ktime_get_clocktai_ts64() as well + as ktime_get_boottime()/ktime_get_boottime_ts64(). + However, if the particular choice of clock source is not + important for the user, consider converting to + ktime_get()/ktime_get_ts64() instead for consistency. diff --git a/Documentation/core-api/tracepoint.rst b/Documentation/core-api/tracepoint.rst new file mode 100644 index 0000000000..6b44bec0de --- /dev/null +++ b/Documentation/core-api/tracepoint.rst @@ -0,0 +1,55 @@ +=============================== +The Linux Kernel Tracepoint API +=============================== + +:Author: Jason Baron +:Author: William Cohen + +Introduction +============ + +Tracepoints are static probe points that are located in strategic points +throughout the kernel. 'Probes' register/unregister with tracepoints via +a callback mechanism. The 'probes' are strictly typed functions that are +passed a unique set of parameters defined by each tracepoint. + +From this simple callback mechanism, 'probes' can be used to profile, +debug, and understand kernel behavior. There are a number of tools that +provide a framework for using 'probes'. These tools include Systemtap, +ftrace, and LTTng. + +Tracepoints are defined in a number of header files via various macros. +Thus, the purpose of this document is to provide a clear accounting of +the available tracepoints. The intention is to understand not only what +tracepoints are available but also to understand where future +tracepoints might be added. + +The API presented has functions of the form: +``trace_tracepointname(function parameters)``. These are the tracepoints +callbacks that are found throughout the code. Registering and +unregistering probes with these callback sites is covered in the +``Documentation/trace/*`` directory. + +IRQ +=== + +.. kernel-doc:: include/trace/events/irq.h + :internal: + +SIGNAL +====== + +.. kernel-doc:: include/trace/events/signal.h + :internal: + +Block IO +======== + +.. kernel-doc:: include/trace/events/block.h + :internal: + +Workqueue +========= + +.. kernel-doc:: include/trace/events/workqueue.h + :internal: diff --git a/Documentation/core-api/unaligned-memory-access.rst b/Documentation/core-api/unaligned-memory-access.rst new file mode 100644 index 0000000000..1ee82419d8 --- /dev/null +++ b/Documentation/core-api/unaligned-memory-access.rst @@ -0,0 +1,265 @@ +========================= +Unaligned Memory Accesses +========================= + +:Author: Daniel Drake <dsd@gentoo.org>, +:Author: Johannes Berg <johannes@sipsolutions.net> + +:With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt, + Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, Uli Kunitz, + Vadim Lobanov + + +Linux runs on a wide variety of architectures which have varying behaviour +when it comes to memory access. This document presents some details about +unaligned accesses, why you need to write code that doesn't cause them, +and how to write such code! + + +The definition of an unaligned access +===================================== + +Unaligned memory accesses occur when you try to read N bytes of data starting +from an address that is not evenly divisible by N (i.e. addr % N != 0). +For example, reading 4 bytes of data from address 0x10004 is fine, but +reading 4 bytes of data from address 0x10005 would be an unaligned memory +access. + +The above may seem a little vague, as memory access can happen in different +ways. The context here is at the machine code level: certain instructions read +or write a number of bytes to or from memory (e.g. movb, movw, movl in x86 +assembly). As will become clear, it is relatively easy to spot C statements +which will compile to multiple-byte memory access instructions, namely when +dealing with types such as u16, u32 and u64. + + +Natural alignment +================= + +The rule mentioned above forms what we refer to as natural alignment: +When accessing N bytes of memory, the base memory address must be evenly +divisible by N, i.e. addr % N == 0. + +When writing code, assume the target architecture has natural alignment +requirements. + +In reality, only a few architectures require natural alignment on all sizes +of memory access. However, we must consider ALL supported architectures; +writing code that satisfies natural alignment requirements is the easiest way +to achieve full portability. + + +Why unaligned access is bad +=========================== + +The effects of performing an unaligned memory access vary from architecture +to architecture. It would be easy to write a whole document on the differences +here; a summary of the common scenarios is presented below: + + - Some architectures are able to perform unaligned memory accesses + transparently, but there is usually a significant performance cost. + - Some architectures raise processor exceptions when unaligned accesses + happen. The exception handler is able to correct the unaligned access, + at significant cost to performance. + - Some architectures raise processor exceptions when unaligned accesses + happen, but the exceptions do not contain enough information for the + unaligned access to be corrected. + - Some architectures are not capable of unaligned memory access, but will + silently perform a different memory access to the one that was requested, + resulting in a subtle code bug that is hard to detect! + +It should be obvious from the above that if your code causes unaligned +memory accesses to happen, your code will not work correctly on certain +platforms and will cause performance problems on others. + + +Code that does not cause unaligned access +========================================= + +At first, the concepts above may seem a little hard to relate to actual +coding practice. After all, you don't have a great deal of control over +memory addresses of certain variables, etc. + +Fortunately things are not too complex, as in most cases, the compiler +ensures that things will work for you. For example, take the following +structure:: + + struct foo { + u16 field1; + u32 field2; + u8 field3; + }; + +Let us assume that an instance of the above structure resides in memory +starting at address 0x10000. With a basic level of understanding, it would +not be unreasonable to expect that accessing field2 would cause an unaligned +access. You'd be expecting field2 to be located at offset 2 bytes into the +structure, i.e. address 0x10002, but that address is not evenly divisible +by 4 (remember, we're reading a 4 byte value here). + +Fortunately, the compiler understands the alignment constraints, so in the +above case it would insert 2 bytes of padding in between field1 and field2. +Therefore, for standard structure types you can always rely on the compiler +to pad structures so that accesses to fields are suitably aligned (assuming +you do not cast the field to a type of different length). + +Similarly, you can also rely on the compiler to align variables and function +parameters to a naturally aligned scheme, based on the size of the type of +the variable. + +At this point, it should be clear that accessing a single byte (u8 or char) +will never cause an unaligned access, because all memory addresses are evenly +divisible by one. + +On a related topic, with the above considerations in mind you may observe +that you could reorder the fields in the structure in order to place fields +where padding would otherwise be inserted, and hence reduce the overall +resident memory size of structure instances. The optimal layout of the +above example is:: + + struct foo { + u32 field2; + u16 field1; + u8 field3; + }; + +For a natural alignment scheme, the compiler would only have to add a single +byte of padding at the end of the structure. This padding is added in order +to satisfy alignment constraints for arrays of these structures. + +Another point worth mentioning is the use of __attribute__((packed)) on a +structure type. This GCC-specific attribute tells the compiler never to +insert any padding within structures, useful when you want to use a C struct +to represent some data that comes in a fixed arrangement 'off the wire'. + +You might be inclined to believe that usage of this attribute can easily +lead to unaligned accesses when accessing fields that do not satisfy +architectural alignment requirements. However, again, the compiler is aware +of the alignment constraints and will generate extra instructions to perform +the memory access in a way that does not cause unaligned access. Of course, +the extra instructions obviously cause a loss in performance compared to the +non-packed case, so the packed attribute should only be used when avoiding +structure padding is of importance. + + +Code that causes unaligned access +================================= + +With the above in mind, let's move onto a real life example of a function +that can cause an unaligned memory access. The following function taken +from include/linux/etherdevice.h is an optimized routine to compare two +ethernet MAC addresses for equality:: + + bool ether_addr_equal(const u8 *addr1, const u8 *addr2) + { + #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) | + ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4))); + + return fold == 0; + #else + const u16 *a = (const u16 *)addr1; + const u16 *b = (const u16 *)addr2; + return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0; + #endif + } + +In the above function, when the hardware has efficient unaligned access +capability, there is no issue with this code. But when the hardware isn't +able to access memory on arbitrary boundaries, the reference to a[0] causes +2 bytes (16 bits) to be read from memory starting at address addr1. + +Think about what would happen if addr1 was an odd address such as 0x10003. +(Hint: it'd be an unaligned access.) + +Despite the potential unaligned access problems with the above function, it +is included in the kernel anyway but is understood to only work normally on +16-bit-aligned addresses. It is up to the caller to ensure this alignment or +not use this function at all. This alignment-unsafe function is still useful +as it is a decent optimization for the cases when you can ensure alignment, +which is true almost all of the time in ethernet networking context. + + +Here is another example of some code that could cause unaligned accesses:: + + void myfunc(u8 *data, u32 value) + { + [...] + *((u32 *) data) = cpu_to_le32(value); + [...] + } + +This code will cause unaligned accesses every time the data parameter points +to an address that is not evenly divisible by 4. + +In summary, the 2 main scenarios where you may run into unaligned access +problems involve: + + 1. Casting variables to types of different lengths + 2. Pointer arithmetic followed by access to at least 2 bytes of data + + +Avoiding unaligned accesses +=========================== + +The easiest way to avoid unaligned access is to use the get_unaligned() and +put_unaligned() macros provided by the <asm/unaligned.h> header file. + +Going back to an earlier example of code that potentially causes unaligned +access:: + + void myfunc(u8 *data, u32 value) + { + [...] + *((u32 *) data) = cpu_to_le32(value); + [...] + } + +To avoid the unaligned memory access, you would rewrite it as follows:: + + void myfunc(u8 *data, u32 value) + { + [...] + value = cpu_to_le32(value); + put_unaligned(value, (u32 *) data); + [...] + } + +The get_unaligned() macro works similarly. Assuming 'data' is a pointer to +memory and you wish to avoid unaligned access, its usage is as follows:: + + u32 value = get_unaligned((u32 *) data); + +These macros work for memory accesses of any length (not just 32 bits as +in the examples above). Be aware that when compared to standard access of +aligned memory, using these macros to access unaligned memory can be costly in +terms of performance. + +If use of such macros is not convenient, another option is to use memcpy(), +where the source or destination (or both) are of type u8* or unsigned char*. +Due to the byte-wise nature of this operation, unaligned accesses are avoided. + + +Alignment vs. Networking +======================== + +On architectures that require aligned loads, networking requires that the IP +header is aligned on a four-byte boundary to optimise the IP stack. For +regular ethernet hardware, the constant NET_IP_ALIGN is used. On most +architectures this constant has the value 2 because the normal ethernet +header is 14 bytes long, so in order to get proper alignment one needs to +DMA to an address which can be expressed as 4*n + 2. One notable exception +here is powerpc which defines NET_IP_ALIGN to 0 because DMA to unaligned +addresses can be very expensive and dwarf the cost of unaligned loads. + +For some ethernet hardware that cannot DMA to unaligned addresses like +4*n+2 or non-ethernet hardware, this can be a problem, and it is then +required to copy the incoming frame into an aligned buffer. Because this is +unnecessary on architectures that can do unaligned accesses, the code can be +made dependent on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS like so:: + + #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + skb = original skb + #else + skb = copy skb + #endif diff --git a/Documentation/core-api/watch_queue.rst b/Documentation/core-api/watch_queue.rst new file mode 100644 index 0000000000..54f13ad5fc --- /dev/null +++ b/Documentation/core-api/watch_queue.rst @@ -0,0 +1,343 @@ +============================== +General notification mechanism +============================== + +The general notification mechanism is built on top of the standard pipe driver +whereby it effectively splices notification messages from the kernel into pipes +opened by userspace. This can be used in conjunction with:: + + * Key/keyring notifications + + +The notifications buffers can be enabled by: + + "General setup"/"General notification queue" + (CONFIG_WATCH_QUEUE) + +This document has the following sections: + +.. contents:: :local: + + +Overview +======== + +This facility appears as a pipe that is opened in a special mode. The pipe's +internal ring buffer is used to hold messages that are generated by the kernel. +These messages are then read out by read(). Splice and similar are disabled on +such pipes due to them wanting to, under some circumstances, revert their +additions to the ring - which might end up interleaved with notification +messages. + +The owner of the pipe has to tell the kernel which sources it would like to +watch through that pipe. Only sources that have been connected to a pipe will +insert messages into it. Note that a source may be bound to multiple pipes and +insert messages into all of them simultaneously. + +Filters may also be emplaced on a pipe so that certain source types and +subevents can be ignored if they're not of interest. + +A message will be discarded if there isn't a slot available in the ring or if +no preallocated message buffer is available. In both of these cases, read() +will insert a WATCH_META_LOSS_NOTIFICATION message into the output buffer after +the last message currently in the buffer has been read. + +Note that when producing a notification, the kernel does not wait for the +consumers to collect it, but rather just continues on. This means that +notifications can be generated whilst spinlocks are held and also protects the +kernel from being held up indefinitely by a userspace malfunction. + + +Message Structure +================= + +Notification messages begin with a short header:: + + struct watch_notification { + __u32 type:24; + __u32 subtype:8; + __u32 info; + }; + +"type" indicates the source of the notification record and "subtype" indicates +the type of record from that source (see the Watch Sources section below). The +type may also be "WATCH_TYPE_META". This is a special record type generated +internally by the watch queue itself. There are two subtypes: + + * WATCH_META_REMOVAL_NOTIFICATION + * WATCH_META_LOSS_NOTIFICATION + +The first indicates that an object on which a watch was installed was removed +or destroyed and the second indicates that some messages have been lost. + +"info" indicates a bunch of things, including: + + * The length of the message in bytes, including the header (mask with + WATCH_INFO_LENGTH and shift by WATCH_INFO_LENGTH__SHIFT). This indicates + the size of the record, which may be between 8 and 127 bytes. + + * The watch ID (mask with WATCH_INFO_ID and shift by WATCH_INFO_ID__SHIFT). + This indicates that caller's ID of the watch, which may be between 0 + and 255. Multiple watches may share a queue, and this provides a means to + distinguish them. + + * A type-specific field (WATCH_INFO_TYPE_INFO). This is set by the + notification producer to indicate some meaning specific to the type and + subtype. + +Everything in info apart from the length can be used for filtering. + +The header can be followed by supplementary information. The format of this is +at the discretion is defined by the type and subtype. + + +Watch List (Notification Source) API +==================================== + +A "watch list" is a list of watchers that are subscribed to a source of +notifications. A list may be attached to an object (say a key or a superblock) +or may be global (say for device events). From a userspace perspective, a +non-global watch list is typically referred to by reference to the object it +belongs to (such as using KEYCTL_NOTIFY and giving it a key serial number to +watch that specific key). + +To manage a watch list, the following functions are provided: + + * :: + + void init_watch_list(struct watch_list *wlist, + void (*release_watch)(struct watch *wlist)); + + Initialise a watch list. If ``release_watch`` is not NULL, then this + indicates a function that should be called when the watch_list object is + destroyed to discard any references the watch list holds on the watched + object. + + * ``void remove_watch_list(struct watch_list *wlist);`` + + This removes all of the watches subscribed to a watch_list and frees them + and then destroys the watch_list object itself. + + +Watch Queue (Notification Output) API +===================================== + +A "watch queue" is the buffer allocated by an application that notification +records will be written into. The workings of this are hidden entirely inside +of the pipe device driver, but it is necessary to gain a reference to it to set +a watch. These can be managed with: + + * ``struct watch_queue *get_watch_queue(int fd);`` + + Since watch queues are indicated to the kernel by the fd of the pipe that + implements the buffer, userspace must hand that fd through a system call. + This can be used to look up an opaque pointer to the watch queue from the + system call. + + * ``void put_watch_queue(struct watch_queue *wqueue);`` + + This discards the reference obtained from ``get_watch_queue()``. + + +Watch Subscription API +====================== + +A "watch" is a subscription on a watch list, indicating the watch queue, and +thus the buffer, into which notification records should be written. The watch +queue object may also carry filtering rules for that object, as set by +userspace. Some parts of the watch struct can be set by the driver:: + + struct watch { + union { + u32 info_id; /* ID to be OR'd in to info field */ + ... + }; + void *private; /* Private data for the watched object */ + u64 id; /* Internal identifier */ + ... + }; + +The ``info_id`` value should be an 8-bit number obtained from userspace and +shifted by WATCH_INFO_ID__SHIFT. This is OR'd into the WATCH_INFO_ID field of +struct watch_notification::info when and if the notification is written into +the associated watch queue buffer. + +The ``private`` field is the driver's data associated with the watch_list and +is cleaned up by the ``watch_list::release_watch()`` method. + +The ``id`` field is the source's ID. Notifications that are posted with a +different ID are ignored. + +The following functions are provided to manage watches: + + * ``void init_watch(struct watch *watch, struct watch_queue *wqueue);`` + + Initialise a watch object, setting its pointer to the watch queue, using + appropriate barriering to avoid lockdep complaints. + + * ``int add_watch_to_object(struct watch *watch, struct watch_list *wlist);`` + + Subscribe a watch to a watch list (notification source). The + driver-settable fields in the watch struct must have been set before this + is called. + + * :: + + int remove_watch_from_object(struct watch_list *wlist, + struct watch_queue *wqueue, + u64 id, false); + + Remove a watch from a watch list, where the watch must match the specified + watch queue (``wqueue``) and object identifier (``id``). A notification + (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue to + indicate that the watch got removed. + + * ``int remove_watch_from_object(struct watch_list *wlist, NULL, 0, true);`` + + Remove all the watches from a watch list. It is expected that this will be + called preparatory to destruction and that the watch list will be + inaccessible to new watches by this point. A notification + (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue of each + subscribed watch to indicate that the watch got removed. + + +Notification Posting API +======================== + +To post a notification to watch list so that the subscribed watches can see it, +the following function should be used:: + + void post_watch_notification(struct watch_list *wlist, + struct watch_notification *n, + const struct cred *cred, + u64 id); + +The notification should be preformatted and a pointer to the header (``n``) +should be passed in. The notification may be larger than this and the size in +units of buffer slots is noted in ``n->info & WATCH_INFO_LENGTH``. + +The ``cred`` struct indicates the credentials of the source (subject) and is +passed to the LSMs, such as SELinux, to allow or suppress the recording of the +note in each individual queue according to the credentials of that queue +(object). + +The ``id`` is the ID of the source object (such as the serial number on a key). +Only watches that have the same ID set in them will see this notification. + + +Watch Sources +============= + +Any particular buffer can be fed from multiple sources. Sources include: + + * WATCH_TYPE_KEY_NOTIFY + + Notifications of this type indicate changes to keys and keyrings, including + the changes of keyring contents or the attributes of keys. + + See Documentation/security/keys/core.rst for more information. + + +Event Filtering +=============== + +Once a watch queue has been created, a set of filters can be applied to limit +the events that are received using:: + + struct watch_notification_filter filter = { + ... + }; + ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter) + +The filter description is a variable of type:: + + struct watch_notification_filter { + __u32 nr_filters; + __u32 __reserved; + struct watch_notification_type_filter filters[]; + }; + +Where "nr_filters" is the number of filters in filters[] and "__reserved" +should be 0. The "filters" array has elements of the following type:: + + struct watch_notification_type_filter { + __u32 type; + __u32 info_filter; + __u32 info_mask; + __u32 subtype_filter[8]; + }; + +Where: + + * ``type`` is the event type to filter for and should be something like + "WATCH_TYPE_KEY_NOTIFY" + + * ``info_filter`` and ``info_mask`` act as a filter on the info field of the + notification record. The notification is only written into the buffer if:: + + (watch.info & info_mask) == info_filter + + This could be used, for example, to ignore events that are not exactly on + the watched point in a mount tree. + + * ``subtype_filter`` is a bitmask indicating the subtypes that are of + interest. Bit 0 of subtype_filter[0] corresponds to subtype 0, bit 1 to + subtype 1, and so on. + +If the argument to the ioctl() is NULL, then the filters will be removed and +all events from the watched sources will come through. + + +Userspace Code Example +====================== + +A buffer is created with something like the following:: + + pipe2(fds, O_TMPFILE); + ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256); + +It can then be set to receive keyring change notifications:: + + keyctl(KEYCTL_WATCH_KEY, KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); + +The notifications can then be consumed by something like the following:: + + static void consumer(int rfd, struct watch_queue_buffer *buf) + { + unsigned char buffer[128]; + ssize_t buf_len; + + while (buf_len = read(rfd, buffer, sizeof(buffer)), + buf_len > 0 + ) { + void *p = buffer; + void *end = buffer + buf_len; + while (p < end) { + union { + struct watch_notification n; + unsigned char buf1[128]; + } n; + size_t largest, len; + + largest = end - p; + if (largest > 128) + largest = 128; + memcpy(&n, p, largest); + + len = (n->info & WATCH_INFO_LENGTH) >> + WATCH_INFO_LENGTH__SHIFT; + if (len == 0 || len > largest) + return; + + switch (n.n.type) { + case WATCH_TYPE_META: + got_meta(&n.n); + case WATCH_TYPE_KEY_NOTIFY: + saw_key_change(&n.n); + break; + } + + p += len; + } + } + } diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst new file mode 100644 index 0000000000..0046af0653 --- /dev/null +++ b/Documentation/core-api/workqueue.rst @@ -0,0 +1,763 @@ +========= +Workqueue +========= + +:Date: September, 2010 +:Author: Tejun Heo <tj@kernel.org> +:Author: Florian Mickler <florian@mickler.org> + + +Introduction +============ + +There are many cases where an asynchronous process execution context +is needed and the workqueue (wq) API is the most commonly used +mechanism for such cases. + +When such an asynchronous execution context is needed, a work item +describing which function to execute is put on a queue. An +independent thread serves as the asynchronous execution context. The +queue is called workqueue and the thread is called worker. + +While there are work items on the workqueue the worker executes the +functions associated with the work items one after the other. When +there is no work item left on the workqueue the worker becomes idle. +When a new work item gets queued, the worker begins executing again. + + +Why Concurrency Managed Workqueue? +================================== + +In the original wq implementation, a multi threaded (MT) wq had one +worker thread per CPU and a single threaded (ST) wq had one worker +thread system-wide. A single MT wq needed to keep around the same +number of workers as the number of CPUs. The kernel grew a lot of MT +wq users over the years and with the number of CPU cores continuously +rising, some systems saturated the default 32k PID space just booting +up. + +Although MT wq wasted a lot of resource, the level of concurrency +provided was unsatisfactory. The limitation was common to both ST and +MT wq albeit less severe on MT. Each wq maintained its own separate +worker pool. An MT wq could provide only one execution context per CPU +while an ST wq one for the whole system. Work items had to compete for +those very limited execution contexts leading to various problems +including proneness to deadlocks around the single execution context. + +The tension between the provided level of concurrency and resource +usage also forced its users to make unnecessary tradeoffs like libata +choosing to use ST wq for polling PIOs and accepting an unnecessary +limitation that no two polling PIOs can progress at the same time. As +MT wq don't provide much better concurrency, users which require +higher level of concurrency, like async or fscache, had to implement +their own thread pool. + +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with +focus on the following goals. + +* Maintain compatibility with the original workqueue API. + +* Use per-CPU unified worker pools shared by all wq to provide + flexible level of concurrency on demand without wasting a lot of + resource. + +* Automatically regulate worker pool and level of concurrency so that + the API users don't need to worry about such details. + + +The Design +========== + +In order to ease the asynchronous execution of functions a new +abstraction, the work item, is introduced. + +A work item is a simple struct that holds a pointer to the function +that is to be executed asynchronously. Whenever a driver or subsystem +wants a function to be executed asynchronously it has to set up a work +item pointing to that function and queue that work item on a +workqueue. + +Special purpose threads, called worker threads, execute the functions +off of the queue, one after the other. If no work is queued, the +worker threads become idle. These worker threads are managed in so +called worker-pools. + +The cmwq design differentiates between the user-facing workqueues that +subsystems and drivers queue work items on and the backend mechanism +which manages worker-pools and processes the queued work items. + +There are two worker-pools, one for normal work items and the other +for high priority ones, for each possible CPU and some extra +worker-pools to serve work items queued on unbound workqueues - the +number of these backing pools is dynamic. + +Subsystems and drivers can create and queue work items through special +workqueue API functions as they see fit. They can influence some +aspects of the way the work items are executed by setting flags on the +workqueue they are putting the work item on. These flags include +things like CPU locality, concurrency limits, priority and more. To +get a detailed overview refer to the API description of +``alloc_workqueue()`` below. + +When a work item is queued to a workqueue, the target worker-pool is +determined according to the queue parameters and workqueue attributes +and appended on the shared worklist of the worker-pool. For example, +unless specifically overridden, a work item of a bound workqueue will +be queued on the worklist of either normal or highpri worker-pool that +is associated to the CPU the issuer is running on. + +For any worker pool implementation, managing the concurrency level +(how many execution contexts are active) is an important issue. cmwq +tries to keep the concurrency at a minimal but sufficient level. +Minimal to save resources and sufficient in that the system is used at +its full capacity. + +Each worker-pool bound to an actual CPU implements concurrency +management by hooking into the scheduler. The worker-pool is notified +whenever an active worker wakes up or sleeps and keeps track of the +number of the currently runnable workers. Generally, work items are +not expected to hog a CPU and consume many cycles. That means +maintaining just enough concurrency to prevent work processing from +stalling should be optimal. As long as there are one or more runnable +workers on the CPU, the worker-pool doesn't start execution of a new +work, but, when the last running worker goes to sleep, it immediately +schedules a new worker so that the CPU doesn't sit idle while there +are pending work items. This allows using a minimal number of workers +without losing execution bandwidth. + +Keeping idle workers around doesn't cost other than the memory space +for kthreads, so cmwq holds onto idle ones for a while before killing +them. + +For unbound workqueues, the number of backing pools is dynamic. +Unbound workqueue can be assigned custom attributes using +``apply_workqueue_attrs()`` and workqueue will automatically create +backing worker pools matching the attributes. The responsibility of +regulating concurrency level is on the users. There is also a flag to +mark a bound wq to ignore the concurrency management. Please refer to +the API section for details. + +Forward progress guarantee relies on that workers can be created when +more execution contexts are necessary, which in turn is guaranteed +through the use of rescue workers. All work items which might be used +on code paths that handle memory reclaim are required to be queued on +wq's that have a rescue-worker reserved for execution under memory +pressure. Else it is possible that the worker-pool deadlocks waiting +for execution contexts to free up. + + +Application Programming Interface (API) +======================================= + +``alloc_workqueue()`` allocates a wq. The original +``create_*workqueue()`` functions are deprecated and scheduled for +removal. ``alloc_workqueue()`` takes three arguments - ``@name``, +``@flags`` and ``@max_active``. ``@name`` is the name of the wq and +also used as the name of the rescuer thread if there is one. + +A wq no longer manages execution resources but serves as a domain for +forward progress guarantee, flush and work item attributes. ``@flags`` +and ``@max_active`` control how work items are assigned execution +resources, scheduled and executed. + + +``flags`` +--------- + +``WQ_UNBOUND`` + Work items queued to an unbound wq are served by the special + worker-pools which host workers which are not bound to any + specific CPU. This makes the wq behave as a simple execution + context provider without concurrency management. The unbound + worker-pools try to start execution of work items as soon as + possible. Unbound wq sacrifices locality but is useful for + the following cases. + + * Wide fluctuation in the concurrency level requirement is + expected and using bound wq may end up creating large number + of mostly unused workers across different CPUs as the issuer + hops through different CPUs. + + * Long running CPU intensive workloads which can be better + managed by the system scheduler. + +``WQ_FREEZABLE`` + A freezable wq participates in the freeze phase of the system + suspend operations. Work items on the wq are drained and no + new work item starts execution until thawed. + +``WQ_MEM_RECLAIM`` + All wq which might be used in the memory reclaim paths **MUST** + have this flag set. The wq is guaranteed to have at least one + execution context regardless of memory pressure. + +``WQ_HIGHPRI`` + Work items of a highpri wq are queued to the highpri + worker-pool of the target cpu. Highpri worker-pools are + served by worker threads with elevated nice level. + + Note that normal and highpri worker-pools don't interact with + each other. Each maintains its separate pool of workers and + implements concurrency management among its workers. + +``WQ_CPU_INTENSIVE`` + Work items of a CPU intensive wq do not contribute to the + concurrency level. In other words, runnable CPU intensive + work items will not prevent other work items in the same + worker-pool from starting execution. This is useful for bound + work items which are expected to hog CPU cycles so that their + execution is regulated by the system scheduler. + + Although CPU intensive work items don't contribute to the + concurrency level, start of their executions is still + regulated by the concurrency management and runnable + non-CPU-intensive work items can delay execution of CPU + intensive work items. + + This flag is meaningless for unbound wq. + + +``max_active`` +-------------- + +``@max_active`` determines the maximum number of execution contexts per +CPU which can be assigned to the work items of a wq. For example, with +``@max_active`` of 16, at most 16 work items of the wq can be executing +at the same time per CPU. This is always a per-CPU attribute, even for +unbound workqueues. + +The maximum limit for ``@max_active`` is 512 and the default value used +when 0 is specified is 256. These values are chosen sufficiently high +such that they are not the limiting factor while providing protection in +runaway cases. + +The number of active work items of a wq is usually regulated by the +users of the wq, more specifically, by how many work items the users +may queue at the same time. Unless there is a specific need for +throttling the number of active work items, specifying '0' is +recommended. + +Some users depend on the strict execution ordering of ST wq. The +combination of ``@max_active`` of 1 and ``WQ_UNBOUND`` used to +achieve this behavior. Work items on such wq were always queued to the +unbound worker-pools and only one work item could be active at any given +time thus achieving the same ordering property as ST wq. + +In the current implementation the above configuration only guarantees +ST behavior within a given NUMA node. Instead ``alloc_ordered_workqueue()`` should +be used to achieve system-wide ST behavior. + + +Example Execution Scenarios +=========================== + +The following example execution scenarios try to illustrate how cmwq +behave under different configurations. + + Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. + w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms + again before finishing. w1 and w2 burn CPU for 5ms then sleep for + 10ms. + +Ignoring all other tasks, works and processing overhead, and assuming +simple FIFO scheduling, the following is one highly simplified version +of possible sequences of events with the original wq. :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 starts and burns CPU + 25 w1 sleeps + 35 w1 wakes up and finishes + 35 w2 starts and burns CPU + 40 w2 sleeps + 50 w2 wakes up and finishes + +And with cmwq with ``@max_active`` >= 3, :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 10 w2 starts and burns CPU + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + +If ``@max_active`` == 2, :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 20 w2 starts and burns CPU + 25 w2 sleeps + 35 w2 wakes up and finishes + +Now, let's assume w1 and w2 are queued to a different wq q1 which has +``WQ_CPU_INTENSIVE`` set, :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 and w2 start and burn CPU + 10 w1 sleeps + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + + +Guidelines +========== + +* Do not forget to use ``WQ_MEM_RECLAIM`` if a wq may process work + items which are used during memory reclaim. Each wq with + ``WQ_MEM_RECLAIM`` set has an execution context reserved for it. If + there is dependency among multiple work items used during memory + reclaim, they should be queued to separate wq each with + ``WQ_MEM_RECLAIM``. + +* Unless strict ordering is required, there is no need to use ST wq. + +* Unless there is a specific need, using 0 for @max_active is + recommended. In most use cases, concurrency level usually stays + well under the default limit. + +* A wq serves as a domain for forward progress guarantee + (``WQ_MEM_RECLAIM``, flush and work item attributes. Work items + which are not involved in memory reclaim and don't need to be + flushed as a part of a group of work items, and don't require any + special attribute, can use one of the system wq. There is no + difference in execution characteristics between using a dedicated wq + and a system wq. + +* Unless work items are expected to consume a huge amount of CPU + cycles, using a bound wq is usually beneficial due to the increased + level of locality in wq operations and work item execution. + + +Affinity Scopes +=============== + +An unbound workqueue groups CPUs according to its affinity scope to improve +cache locality. For example, if a workqueue is using the default affinity +scope of "cache", it will group CPUs according to last level cache +boundaries. A work item queued on the workqueue will be assigned to a worker +on one of the CPUs which share the last level cache with the issuing CPU. +Once started, the worker may or may not be allowed to move outside the scope +depending on the ``affinity_strict`` setting of the scope. + +Workqueue currently supports the following affinity scopes. + +``default`` + Use the scope in module parameter ``workqueue.default_affinity_scope`` + which is always set to one of the scopes below. + +``cpu`` + CPUs are not grouped. A work item issued on one CPU is processed by a + worker on the same CPU. This makes unbound workqueues behave as per-cpu + workqueues without concurrency management. + +``smt`` + CPUs are grouped according to SMT boundaries. This usually means that the + logical threads of each physical CPU core are grouped together. + +``cache`` + CPUs are grouped according to cache boundaries. Which specific cache + boundary is used is determined by the arch code. L3 is used in a lot of + cases. This is the default affinity scope. + +``numa`` + CPUs are grouped according to NUMA bounaries. + +``system`` + All CPUs are put in the same group. Workqueue makes no effort to process a + work item on a CPU close to the issuing CPU. + +The default affinity scope can be changed with the module parameter +``workqueue.default_affinity_scope`` and a specific workqueue's affinity +scope can be changed using ``apply_workqueue_attrs()``. + +If ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope +related interface files under its ``/sys/devices/virtual/workqueue/WQ_NAME/`` +directory. + +``affinity_scope`` + Read to see the current affinity scope. Write to change. + + When default is the current scope, reading this file will also show the + current effective scope in parentheses, for example, ``default (cache)``. + +``affinity_strict`` + 0 by default indicating that affinity scopes are not strict. When a work + item starts execution, workqueue makes a best-effort attempt to ensure + that the worker is inside its affinity scope, which is called + repatriation. Once started, the scheduler is free to move the worker + anywhere in the system as it sees fit. This enables benefiting from scope + locality while still being able to utilize other CPUs if necessary and + available. + + If set to 1, all workers of the scope are guaranteed always to be in the + scope. This may be useful when crossing affinity scopes has other + implications, for example, in terms of power consumption or workload + isolation. Strict NUMA scope can also be used to match the workqueue + behavior of older kernels. + + +Affinity Scopes and Performance +=============================== + +It'd be ideal if an unbound workqueue's behavior is optimal for vast +majority of use cases without further tuning. Unfortunately, in the current +kernel, there exists a pronounced trade-off between locality and utilization +necessitating explicit configurations when workqueues are heavily used. + +Higher locality leads to higher efficiency where more work is performed for +the same number of consumed CPU cycles. However, higher locality may also +cause lower overall system utilization if the work items are not spread +enough across the affinity scopes by the issuers. The following performance +testing with dm-crypt clearly illustrates this trade-off. + +The tests are run on a CPU with 12-cores/24-threads split across four L3 +caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency. +``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and +opened with ``cryptsetup`` with default settings. + + +Scenario 1: Enough issuers and work spread across the machine +------------------------------------------------------------- + +The command used: :: + + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \ + --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \ + --name=iops-test-job --verify=sha512 + +There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512`` +makes ``fio`` generate and read back the content each time which makes +execution locality matter between the issuer and ``kcryptd``. The followings +are the read bandwidths and CPU utilizations depending on different affinity +scope settings on ``kcryptd`` measured over five runs. Bandwidths are in +MiBps, and CPU util in percents. + +.. list-table:: + :widths: 16 20 20 + :header-rows: 1 + + * - Affinity + - Bandwidth (MiBps) + - CPU util (%) + + * - system + - 1159.40 ±1.34 + - 99.31 ±0.02 + + * - cache + - 1166.40 ±0.89 + - 99.34 ±0.01 + + * - cache (strict) + - 1166.00 ±0.71 + - 99.35 ±0.01 + +With enough issuers spread across the system, there is no downside to +"cache", strict or otherwise. All three configurations saturate the whole +machine but the cache-affine ones outperform by 0.6% thanks to improved +locality. + + +Scenario 2: Fewer issuers, enough work for saturation +----------------------------------------------------- + +The command used: :: + + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ + --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \ + --time_based --group_reporting --name=iops-test-job --verify=sha512 + +The only difference from the previous scenario is ``--numjobs=8``. There are +a third of the issuers but is still enough total work to saturate the +system. + +.. list-table:: + :widths: 16 20 20 + :header-rows: 1 + + * - Affinity + - Bandwidth (MiBps) + - CPU util (%) + + * - system + - 1155.40 ±0.89 + - 97.41 ±0.05 + + * - cache + - 1154.40 ±1.14 + - 96.15 ±0.09 + + * - cache (strict) + - 1112.00 ±4.64 + - 93.26 ±0.35 + +This is more than enough work to saturate the system. Both "system" and +"cache" are nearly saturating the machine but not fully. "cache" is using +less CPU but the better efficiency puts it at the same bandwidth as +"system". + +Eight issuers moving around over four L3 cache scope still allow "cache +(strict)" to mostly saturate the machine but the loss of work conservation +is now starting to hurt with 3.7% bandwidth loss. + + +Scenario 3: Even fewer issuers, not enough work to saturate +----------------------------------------------------------- + +The command used: :: + + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ + --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \ + --time_based --group_reporting --name=iops-test-job --verify=sha512 + +Again, the only difference is ``--numjobs=4``. With the number of issuers +reduced to four, there now isn't enough work to saturate the whole system +and the bandwidth becomes dependent on completion latencies. + +.. list-table:: + :widths: 16 20 20 + :header-rows: 1 + + * - Affinity + - Bandwidth (MiBps) + - CPU util (%) + + * - system + - 993.60 ±1.82 + - 75.49 ±0.06 + + * - cache + - 973.40 ±1.52 + - 74.90 ±0.07 + + * - cache (strict) + - 828.20 ±4.49 + - 66.84 ±0.29 + +Now, the tradeoff between locality and utilization is clearer. "cache" shows +2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%. + + +Conclusion and Recommendations +------------------------------ + +In the above experiments, the efficiency advantage of the "cache" affinity +scope over "system" is, while consistent and noticeable, small. However, the +impact is dependent on the distances between the scopes and may be more +pronounced in processors with more complex topologies. + +While the loss of work-conservation in certain scenarios hurts, it is a lot +better than "cache (strict)" and maximizing workqueue utilization is +unlikely to be the common case anyway. As such, "cache" is the default +affinity scope for unbound pools. + +* As there is no one option which is great for most cases, workqueue usages + that may consume a significant amount of CPU are recommended to configure + the workqueues using ``apply_workqueue_attrs()`` and/or enable + ``WQ_SYSFS``. + +* An unbound workqueue with strict "cpu" affinity scope behaves the same as + ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the + latter and an unbound workqueue provides a lot more flexibility. + +* Affinity scopes are introduced in Linux v6.5. To emulate the previous + behavior, use strict "numa" affinity scope. + +* The loss of work-conservation in non-strict affinity scopes is likely + originating from the scheduler. There is no theoretical reason why the + kernel wouldn't be able to do the right thing and maintain + work-conservation in most cases. As such, it is possible that future + scheduler improvements may make most of these tunables unnecessary. + + +Examining Configuration +======================= + +Use tools/workqueue/wq_dump.py to examine unbound CPU affinity +configuration, worker pools and how workqueues map to the pools: :: + + $ tools/workqueue/wq_dump.py + Affinity Scopes + =============== + wq_unbound_cpumask=0000000f + + CPU + nr_pods 4 + pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 + pod_node [0]=0 [1]=0 [2]=1 [3]=1 + cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 + + SMT + nr_pods 4 + pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 + pod_node [0]=0 [1]=0 [2]=1 [3]=1 + cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 + + CACHE (default) + nr_pods 2 + pod_cpus [0]=00000003 [1]=0000000c + pod_node [0]=0 [1]=1 + cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 + + NUMA + nr_pods 2 + pod_cpus [0]=00000003 [1]=0000000c + pod_node [0]=0 [1]=1 + cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 + + SYSTEM + nr_pods 1 + pod_cpus [0]=0000000f + pod_node [0]=-1 + cpu_pod [0]=0 [1]=0 [2]=0 [3]=0 + + Worker Pools + ============ + pool[00] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 0 + pool[01] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 0 + pool[02] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 1 + pool[03] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 1 + pool[04] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 2 + pool[05] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 2 + pool[06] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 3 + pool[07] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 3 + pool[08] ref=42 nice= 0 idle/workers= 6/ 6 cpus=0000000f + pool[09] ref=28 nice= 0 idle/workers= 3/ 3 cpus=00000003 + pool[10] ref=28 nice= 0 idle/workers= 17/ 17 cpus=0000000c + pool[11] ref= 1 nice=-20 idle/workers= 1/ 1 cpus=0000000f + pool[12] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=00000003 + pool[13] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=0000000c + + Workqueue CPU -> pool + ===================== + [ workqueue \ CPU 0 1 2 3 dfl] + events percpu 0 2 4 6 + events_highpri percpu 1 3 5 7 + events_long percpu 0 2 4 6 + events_unbound unbound 9 9 10 10 8 + events_freezable percpu 0 2 4 6 + events_power_efficient percpu 0 2 4 6 + events_freezable_power_ percpu 0 2 4 6 + rcu_gp percpu 0 2 4 6 + rcu_par_gp percpu 0 2 4 6 + slub_flushwq percpu 0 2 4 6 + netns ordered 8 8 8 8 8 + ... + +See the command's help message for more info. + + +Monitoring +========== + +Use tools/workqueue/wq_monitor.py to monitor workqueue operations: :: + + $ tools/workqueue/wq_monitor.py events + total infl CPUtime CPUhog CMW/RPR mayday rescued + events 18545 0 6.1 0 5 - - + events_highpri 8 0 0.0 0 0 - - + events_long 3 0 0.0 0 0 - - + events_unbound 38306 0 0.1 - 7 - - + events_freezable 0 0 0.0 0 0 - - + events_power_efficient 29598 0 0.2 0 0 - - + events_freezable_power_ 10 0 0.0 0 0 - - + sock_diag_events 0 0 0.0 0 0 - - + + total infl CPUtime CPUhog CMW/RPR mayday rescued + events 18548 0 6.1 0 5 - - + events_highpri 8 0 0.0 0 0 - - + events_long 3 0 0.0 0 0 - - + events_unbound 38322 0 0.1 - 7 - - + events_freezable 0 0 0.0 0 0 - - + events_power_efficient 29603 0 0.2 0 0 - - + events_freezable_power_ 10 0 0.0 0 0 - - + sock_diag_events 0 0 0.0 0 0 - - + + ... + +See the command's help message for more info. + + +Debugging +========= + +Because the work functions are executed by generic worker threads +there are a few tricks needed to shed some light on misbehaving +workqueue users. + +Worker threads show up in the process list as: :: + + root 5671 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/0:1] + root 5672 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/1:2] + root 5673 0.0 0.0 0 0 ? S 12:12 0:00 [kworker/0:0] + root 5674 0.0 0.0 0 0 ? S 12:13 0:00 [kworker/1:0] + +If kworkers are going crazy (using too much cpu), there are two types +of possible problems: + + 1. Something being scheduled in rapid succession + 2. A single work item that consumes lots of cpu cycles + +The first one can be tracked using tracing: :: + + $ echo workqueue:workqueue_queue_work > /sys/kernel/tracing/set_event + $ cat /sys/kernel/tracing/trace_pipe > out.txt + (wait a few secs) + ^C + +If something is busy looping on work queueing, it would be dominating +the output and the offender can be determined with the work item +function. + +For the second type of problems it should be possible to just check +the stack trace of the offending worker thread. :: + + $ cat /proc/THE_OFFENDING_KWORKER/stack + +The work item's function should be trivially visible in the stack +trace. + + +Non-reentrance Conditions +========================= + +Workqueue guarantees that a work item cannot be re-entrant if the following +conditions hold after a work item gets queued: + + 1. The work function hasn't been changed. + 2. No one queues the work item to another workqueue. + 3. The work item hasn't been reinitiated. + +In other words, if the above conditions hold, the work item is guaranteed to be +executed by at most one worker system-wide at any given time. + +Note that requeuing the work item (to the same queue) in the self function +doesn't break these conditions, so it's safe to do. Otherwise, caution is +required when breaking the conditions inside a work function. + + +Kernel Inline Documentations Reference +====================================== + +.. kernel-doc:: include/linux/workqueue.h + +.. kernel-doc:: kernel/workqueue.c diff --git a/Documentation/core-api/wrappers/atomic_bitops.rst b/Documentation/core-api/wrappers/atomic_bitops.rst new file mode 100644 index 0000000000..bf24e4081a --- /dev/null +++ b/Documentation/core-api/wrappers/atomic_bitops.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + This is a simple wrapper to bring atomic_bitops.txt into the RST world + until such a time as that file can be converted directly. + +============= +Atomic bitops +============= + +.. raw:: latex + + \footnotesize + +.. include:: ../../atomic_bitops.txt + :literal: + +.. raw:: latex + + \normalsize diff --git a/Documentation/core-api/wrappers/atomic_t.rst b/Documentation/core-api/wrappers/atomic_t.rst new file mode 100644 index 0000000000..ed109a964c --- /dev/null +++ b/Documentation/core-api/wrappers/atomic_t.rst @@ -0,0 +1,19 @@ +.. SPDX-License-Identifier: GPL-2.0 + This is a simple wrapper to bring atomic_t.txt into the RST world + until such a time as that file can be converted directly. + +============ +Atomic types +============ + +.. raw:: latex + + \footnotesize + +.. include:: ../../atomic_t.txt + :literal: + +.. raw:: latex + + \normalsize + diff --git a/Documentation/core-api/wrappers/memory-barriers.rst b/Documentation/core-api/wrappers/memory-barriers.rst new file mode 100644 index 0000000000..532460b5e3 --- /dev/null +++ b/Documentation/core-api/wrappers/memory-barriers.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + This is a simple wrapper to bring memory-barriers.txt into the RST world + until such a time as that file can be converted directly. + +============================ +Linux kernel memory barriers +============================ + +.. raw:: latex + + \footnotesize + +.. include:: ../../memory-barriers.txt + :literal: + +.. raw:: latex + + \normalsize diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst new file mode 100644 index 0000000000..77e0ece2b1 --- /dev/null +++ b/Documentation/core-api/xarray.rst @@ -0,0 +1,496 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +====== +XArray +====== + +:Author: Matthew Wilcox + +Overview +======== + +The XArray is an abstract data type which behaves like a very large array +of pointers. It meets many of the same needs as a hash or a conventional +resizable array. Unlike a hash, it allows you to sensibly go to the +next or previous entry in a cache-efficient manner. In contrast to a +resizable array, there is no need to copy data or change MMU mappings in +order to grow the array. It is more memory-efficient, parallelisable +and cache friendly than a doubly-linked list. It takes advantage of +RCU to perform lookups without locking. + +The XArray implementation is efficient when the indices used are densely +clustered; hashing the object and using the hash as the index will not +perform well. The XArray is optimised for small indices, but still has +good performance with large indices. If your index can be larger than +``ULONG_MAX`` then the XArray is not the data type for you. The most +important user of the XArray is the page cache. + +Normal pointers may be stored in the XArray directly. They must be 4-byte +aligned, which is true for any pointer returned from kmalloc() and +alloc_page(). It isn't true for arbitrary user-space pointers, +nor for function pointers. You can store pointers to statically allocated +objects, as long as those objects have an alignment of at least 4. + +You can also store integers between 0 and ``LONG_MAX`` in the XArray. +You must first convert it into an entry using xa_mk_value(). +When you retrieve an entry from the XArray, you can check whether it is +a value entry by calling xa_is_value(), and convert it back to +an integer by calling xa_to_value(). + +Some users want to tag the pointers they store in the XArray. You can +call xa_tag_pointer() to create an entry with a tag, xa_untag_pointer() +to turn a tagged entry back into an untagged pointer and xa_pointer_tag() +to retrieve the tag of an entry. Tagged pointers use the same bits that +are used to distinguish value entries from normal pointers, so you must +decide whether they want to store value entries or tagged pointers in +any particular XArray. + +The XArray does not support storing IS_ERR() pointers as some +conflict with value entries or internal entries. + +An unusual feature of the XArray is the ability to create entries which +occupy a range of indices. Once stored to, looking up any index in +the range will return the same entry as looking up any other index in +the range. Storing to any index will store to all of them. Multi-index +entries can be explicitly split into smaller entries, or storing ``NULL`` +into any entry will cause the XArray to forget about the range. + +Normal API +========== + +Start by initialising an XArray, either with DEFINE_XARRAY() +for statically allocated XArrays or xa_init() for dynamically +allocated ones. A freshly-initialised XArray contains a ``NULL`` +pointer at every index. + +You can then set entries using xa_store() and get entries +using xa_load(). xa_store will overwrite any entry with the +new entry and return the previous entry stored at that index. You can +use xa_erase() instead of calling xa_store() with a +``NULL`` entry. There is no difference between an entry that has never +been stored to, one that has been erased and one that has most recently +had ``NULL`` stored to it. + +You can conditionally replace an entry at an index by using +xa_cmpxchg(). Like cmpxchg(), it will only succeed if +the entry at that index has the 'old' value. It also returns the entry +which was at that index; if it returns the same entry which was passed as +'old', then xa_cmpxchg() succeeded. + +If you want to only store a new entry to an index if the current entry +at that index is ``NULL``, you can use xa_insert() which +returns ``-EBUSY`` if the entry is not empty. + +You can copy entries out of the XArray into a plain array by calling +xa_extract(). Or you can iterate over the present entries in the XArray +by calling xa_for_each(), xa_for_each_start() or xa_for_each_range(). +You may prefer to use xa_find() or xa_find_after() to move to the next +present entry in the XArray. + +Calling xa_store_range() stores the same entry in a range +of indices. If you do this, some of the other operations will behave +in a slightly odd way. For example, marking the entry at one index +may result in the entry being marked at some, but not all of the other +indices. Storing into one index may result in the entry retrieved by +some, but not all of the other indices changing. + +Sometimes you need to ensure that a subsequent call to xa_store() +will not need to allocate memory. The xa_reserve() function +will store a reserved entry at the indicated index. Users of the +normal API will see this entry as containing ``NULL``. If you do +not need to use the reserved entry, you can call xa_release() +to remove the unused entry. If another user has stored to the entry +in the meantime, xa_release() will do nothing; if instead you +want the entry to become ``NULL``, you should use xa_erase(). +Using xa_insert() on a reserved entry will fail. + +If all entries in the array are ``NULL``, the xa_empty() function +will return ``true``. + +Finally, you can remove all entries from an XArray by calling +xa_destroy(). If the XArray entries are pointers, you may wish +to free the entries first. You can do this by iterating over all present +entries in the XArray using the xa_for_each() iterator. + +Search Marks +------------ + +Each entry in the array has three bits associated with it called marks. +Each mark may be set or cleared independently of the others. You can +iterate over marked entries by using the xa_for_each_marked() iterator. + +You can enquire whether a mark is set on an entry by using +xa_get_mark(). If the entry is not ``NULL``, you can set a mark on it +by using xa_set_mark() and remove the mark from an entry by calling +xa_clear_mark(). You can ask whether any entry in the XArray has a +particular mark set by calling xa_marked(). Erasing an entry from the +XArray causes all marks associated with that entry to be cleared. + +Setting or clearing a mark on any index of a multi-index entry will +affect all indices covered by that entry. Querying the mark on any +index will return the same result. + +There is no way to iterate over entries which are not marked; the data +structure does not allow this to be implemented efficiently. There are +not currently iterators to search for logical combinations of bits (eg +iterate over all entries which have both ``XA_MARK_1`` and ``XA_MARK_2`` +set, or iterate over all entries which have ``XA_MARK_0`` or ``XA_MARK_2`` +set). It would be possible to add these if a user arises. + +Allocating XArrays +------------------ + +If you use DEFINE_XARRAY_ALLOC() to define the XArray, or +initialise it by passing ``XA_FLAGS_ALLOC`` to xa_init_flags(), +the XArray changes to track whether entries are in use or not. + +You can call xa_alloc() to store the entry at an unused index +in the XArray. If you need to modify the array from interrupt context, +you can use xa_alloc_bh() or xa_alloc_irq() to disable +interrupts while allocating the ID. + +Using xa_store(), xa_cmpxchg() or xa_insert() will +also mark the entry as being allocated. Unlike a normal XArray, storing +``NULL`` will mark the entry as being in use, like xa_reserve(). +To free an entry, use xa_erase() (or xa_release() if +you only want to free the entry if it's ``NULL``). + +By default, the lowest free entry is allocated starting from 0. If you +want to allocate entries starting at 1, it is more efficient to use +DEFINE_XARRAY_ALLOC1() or ``XA_FLAGS_ALLOC1``. If you want to +allocate IDs up to a maximum, then wrap back around to the lowest free +ID, you can use xa_alloc_cyclic(). + +You cannot use ``XA_MARK_0`` with an allocating XArray as this mark +is used to track whether an entry is free or not. The other marks are +available for your use. + +Memory allocation +----------------- + +The xa_store(), xa_cmpxchg(), xa_alloc(), +xa_reserve() and xa_insert() functions take a gfp_t +parameter in case the XArray needs to allocate memory to store this entry. +If the entry is being deleted, no memory allocation needs to be performed, +and the GFP flags specified will be ignored. + +It is possible for no memory to be allocatable, particularly if you pass +a restrictive set of GFP flags. In that case, the functions return a +special value which can be turned into an errno using xa_err(). +If you don't need to know exactly which error occurred, using +xa_is_err() is slightly more efficient. + +Locking +------- + +When using the Normal API, you do not have to worry about locking. +The XArray uses RCU and an internal spinlock to synchronise access: + +No lock needed: + * xa_empty() + * xa_marked() + +Takes RCU read lock: + * xa_load() + * xa_for_each() + * xa_for_each_start() + * xa_for_each_range() + * xa_find() + * xa_find_after() + * xa_extract() + * xa_get_mark() + +Takes xa_lock internally: + * xa_store() + * xa_store_bh() + * xa_store_irq() + * xa_insert() + * xa_insert_bh() + * xa_insert_irq() + * xa_erase() + * xa_erase_bh() + * xa_erase_irq() + * xa_cmpxchg() + * xa_cmpxchg_bh() + * xa_cmpxchg_irq() + * xa_store_range() + * xa_alloc() + * xa_alloc_bh() + * xa_alloc_irq() + * xa_reserve() + * xa_reserve_bh() + * xa_reserve_irq() + * xa_destroy() + * xa_set_mark() + * xa_clear_mark() + +Assumes xa_lock held on entry: + * __xa_store() + * __xa_insert() + * __xa_erase() + * __xa_cmpxchg() + * __xa_alloc() + * __xa_set_mark() + * __xa_clear_mark() + +If you want to take advantage of the lock to protect the data structures +that you are storing in the XArray, you can call xa_lock() +before calling xa_load(), then take a reference count on the +object you have found before calling xa_unlock(). This will +prevent stores from removing the object from the array between looking +up the object and incrementing the refcount. You can also use RCU to +avoid dereferencing freed memory, but an explanation of that is beyond +the scope of this document. + +The XArray does not disable interrupts or softirqs while modifying +the array. It is safe to read the XArray from interrupt or softirq +context as the RCU lock provides enough protection. + +If, for example, you want to store entries in the XArray in process +context and then erase them in softirq context, you can do that this way:: + + void foo_init(struct foo *foo) + { + xa_init_flags(&foo->array, XA_FLAGS_LOCK_BH); + } + + int foo_store(struct foo *foo, unsigned long index, void *entry) + { + int err; + + xa_lock_bh(&foo->array); + err = xa_err(__xa_store(&foo->array, index, entry, GFP_KERNEL)); + if (!err) + foo->count++; + xa_unlock_bh(&foo->array); + return err; + } + + /* foo_erase() is only called from softirq context */ + void foo_erase(struct foo *foo, unsigned long index) + { + xa_lock(&foo->array); + __xa_erase(&foo->array, index); + foo->count--; + xa_unlock(&foo->array); + } + +If you are going to modify the XArray from interrupt or softirq context, +you need to initialise the array using xa_init_flags(), passing +``XA_FLAGS_LOCK_IRQ`` or ``XA_FLAGS_LOCK_BH``. + +The above example also shows a common pattern of wanting to extend the +coverage of the xa_lock on the store side to protect some statistics +associated with the array. + +Sharing the XArray with interrupt context is also possible, either +using xa_lock_irqsave() in both the interrupt handler and process +context, or xa_lock_irq() in process context and xa_lock() +in the interrupt handler. Some of the more common patterns have helper +functions such as xa_store_bh(), xa_store_irq(), +xa_erase_bh(), xa_erase_irq(), xa_cmpxchg_bh() +and xa_cmpxchg_irq(). + +Sometimes you need to protect access to the XArray with a mutex because +that lock sits above another mutex in the locking hierarchy. That does +not entitle you to use functions like __xa_erase() without taking +the xa_lock; the xa_lock is used for lockdep validation and will be used +for other purposes in the future. + +The __xa_set_mark() and __xa_clear_mark() functions are also +available for situations where you look up an entry and want to atomically +set or clear a mark. It may be more efficient to use the advanced API +in this case, as it will save you from walking the tree twice. + +Advanced API +============ + +The advanced API offers more flexibility and better performance at the +cost of an interface which can be harder to use and has fewer safeguards. +No locking is done for you by the advanced API, and you are required +to use the xa_lock while modifying the array. You can choose whether +to use the xa_lock or the RCU lock while doing read-only operations on +the array. You can mix advanced and normal operations on the same array; +indeed the normal API is implemented in terms of the advanced API. The +advanced API is only available to modules with a GPL-compatible license. + +The advanced API is based around the xa_state. This is an opaque data +structure which you declare on the stack using the XA_STATE() macro. +This macro initialises the xa_state ready to start walking around the +XArray. It is used as a cursor to maintain the position in the XArray +and let you compose various operations together without having to restart +from the top every time. The contents of the xa_state are protected by +the rcu_read_lock() or the xas_lock(). If you need to drop whichever of +those locks is protecting your state and tree, you must call xas_pause() +so that future calls do not rely on the parts of the state which were +left unprotected. + +The xa_state is also used to store errors. You can call +xas_error() to retrieve the error. All operations check whether +the xa_state is in an error state before proceeding, so there's no need +for you to check for an error after each call; you can make multiple +calls in succession and only check at a convenient point. The only +errors currently generated by the XArray code itself are ``ENOMEM`` and +``EINVAL``, but it supports arbitrary errors in case you want to call +xas_set_err() yourself. + +If the xa_state is holding an ``ENOMEM`` error, calling xas_nomem() +will attempt to allocate more memory using the specified gfp flags and +cache it in the xa_state for the next attempt. The idea is that you take +the xa_lock, attempt the operation and drop the lock. The operation +attempts to allocate memory while holding the lock, but it is more +likely to fail. Once you have dropped the lock, xas_nomem() +can try harder to allocate more memory. It will return ``true`` if it +is worth retrying the operation (i.e. that there was a memory error *and* +more memory was allocated). If it has previously allocated memory, and +that memory wasn't used, and there is no error (or some error that isn't +``ENOMEM``), then it will free the memory previously allocated. + +Internal Entries +---------------- + +The XArray reserves some entries for its own purposes. These are never +exposed through the normal API, but when using the advanced API, it's +possible to see them. Usually the best way to handle them is to pass them +to xas_retry(), and retry the operation if it returns ``true``. + +.. flat-table:: + :widths: 1 1 6 + + * - Name + - Test + - Usage + + * - Node + - xa_is_node() + - An XArray node. May be visible when using a multi-index xa_state. + + * - Sibling + - xa_is_sibling() + - A non-canonical entry for a multi-index entry. The value indicates + which slot in this node has the canonical entry. + + * - Retry + - xa_is_retry() + - This entry is currently being modified by a thread which has the + xa_lock. The node containing this entry may be freed at the end + of this RCU period. You should restart the lookup from the head + of the array. + + * - Zero + - xa_is_zero() + - Zero entries appear as ``NULL`` through the Normal API, but occupy + an entry in the XArray which can be used to reserve the index for + future use. This is used by allocating XArrays for allocated entries + which are ``NULL``. + +Other internal entries may be added in the future. As far as possible, they +will be handled by xas_retry(). + +Additional functionality +------------------------ + +The xas_create_range() function allocates all the necessary memory +to store every entry in a range. It will set ENOMEM in the xa_state if +it cannot allocate memory. + +You can use xas_init_marks() to reset the marks on an entry +to their default state. This is usually all marks clear, unless the +XArray is marked with ``XA_FLAGS_TRACK_FREE``, in which case mark 0 is set +and all other marks are clear. Replacing one entry with another using +xas_store() will not reset the marks on that entry; if you want +the marks reset, you should do that explicitly. + +The xas_load() will walk the xa_state as close to the entry +as it can. If you know the xa_state has already been walked to the +entry and need to check that the entry hasn't changed, you can use +xas_reload() to save a function call. + +If you need to move to a different index in the XArray, call +xas_set(). This resets the cursor to the top of the tree, which +will generally make the next operation walk the cursor to the desired +spot in the tree. If you want to move to the next or previous index, +call xas_next() or xas_prev(). Setting the index does +not walk the cursor around the array so does not require a lock to be +held, while moving to the next or previous index does. + +You can search for the next present entry using xas_find(). This +is the equivalent of both xa_find() and xa_find_after(); +if the cursor has been walked to an entry, then it will find the next +entry after the one currently referenced. If not, it will return the +entry at the index of the xa_state. Using xas_next_entry() to +move to the next present entry instead of xas_find() will save +a function call in the majority of cases at the expense of emitting more +inline code. + +The xas_find_marked() function is similar. If the xa_state has +not been walked, it will return the entry at the index of the xa_state, +if it is marked. Otherwise, it will return the first marked entry after +the entry referenced by the xa_state. The xas_next_marked() +function is the equivalent of xas_next_entry(). + +When iterating over a range of the XArray using xas_for_each() +or xas_for_each_marked(), it may be necessary to temporarily stop +the iteration. The xas_pause() function exists for this purpose. +After you have done the necessary work and wish to resume, the xa_state +is in an appropriate state to continue the iteration after the entry +you last processed. If you have interrupts disabled while iterating, +then it is good manners to pause the iteration and reenable interrupts +every ``XA_CHECK_SCHED`` entries. + +The xas_get_mark(), xas_set_mark() and xas_clear_mark() functions require +the xa_state cursor to have been moved to the appropriate location in the +XArray; they will do nothing if you have called xas_pause() or xas_set() +immediately before. + +You can call xas_set_update() to have a callback function +called each time the XArray updates a node. This is used by the page +cache workingset code to maintain its list of nodes which contain only +shadow entries. + +Multi-Index Entries +------------------- + +The XArray has the ability to tie multiple indices together so that +operations on one index affect all indices. For example, storing into +any index will change the value of the entry retrieved from any index. +Setting or clearing a mark on any index will set or clear the mark +on every index that is tied together. The current implementation +only allows tying ranges which are aligned powers of two together; +eg indices 64-127 may be tied together, but 2-6 may not be. This may +save substantial quantities of memory; for example tying 512 entries +together will save over 4kB. + +You can create a multi-index entry by using XA_STATE_ORDER() +or xas_set_order() followed by a call to xas_store(). +Calling xas_load() with a multi-index xa_state will walk the +xa_state to the right location in the tree, but the return value is not +meaningful, potentially being an internal entry or ``NULL`` even when there +is an entry stored within the range. Calling xas_find_conflict() +will return the first entry within the range or ``NULL`` if there are no +entries in the range. The xas_for_each_conflict() iterator will +iterate over every entry which overlaps the specified range. + +If xas_load() encounters a multi-index entry, the xa_index +in the xa_state will not be changed. When iterating over an XArray +or calling xas_find(), if the initial index is in the middle +of a multi-index entry, it will not be altered. Subsequent calls +or iterations will move the index to the first index in the range. +Each entry will only be returned once, no matter how many indices it +occupies. + +Using xas_next() or xas_prev() with a multi-index xa_state is not +supported. Using either of these functions on a multi-index entry will +reveal sibling entries; these should be skipped over by the caller. + +Storing ``NULL`` into any index of a multi-index entry will set the +entry at every index to ``NULL`` and dissolve the tie. A multi-index +entry can be split into entries occupying smaller ranges by calling +xas_split_alloc() without the xa_lock held, followed by taking the lock +and calling xas_split(). + +Functions and structures +======================== + +.. kernel-doc:: include/linux/xarray.h +.. kernel-doc:: lib/xarray.c |