diff options
Diffstat (limited to 'Documentation/core-api')
24 files changed, 5873 insertions, 0 deletions
diff --git a/Documentation/core-api/assoc_array.rst b/Documentation/core-api/assoc_array.rst new file mode 100644 index 000000000..8231b915c --- /dev/null +++ b/Documentation/core-api/assoc_array.rst @@ -0,0 +1,554 @@ +======================================== +Generic Associative Array Implementation +======================================== + +Overview +======== + +This associative array implementation is an object container with the following +properties: + +1. Objects are opaque pointers. The implementation does not care where they + point (if anywhere) or what they point to (if anything). + + .. note:: + + Pointers to objects _must_ be zero in the least significant bit. + +2. Objects do not need to contain linkage blocks for use by the array. This + permits an object to be located in multiple arrays simultaneously. + Rather, the array is made up of metadata blocks that point to objects. + +3. Objects require index keys to locate them within the array. + +4. Index keys must be unique. Inserting an object with the same key as one + already in the array will replace the old object. + +5. Index keys can be of any length and can be of different lengths. + +6. Index keys should encode the length early on, before any variation due to + length is seen. + +7. Index keys can include a hash to scatter objects throughout the array. + +8. The array can iterated over. The objects will not necessarily come out in + key order. + +9. The array can be iterated over whilst it is being modified, provided the + RCU readlock is being held by the iterator. Note, however, under these + circumstances, some objects may be seen more than once. If this is a + problem, the iterator should lock against modification. Objects will not + be missed, however, unless deleted. + +10. Objects in the array can be looked up by means of their index key. + +11. Objects can be looked up whilst the array is being modified, provided the + RCU readlock is being held by the thread doing the look up. + +The implementation uses a tree of 16-pointer nodes internally that are indexed +on each level by nibbles from the index key in the same manner as in a radix +tree. To improve memory efficiency, shortcuts can be emplaced to skip over +what would otherwise be a series of single-occupancy nodes. Further, nodes +pack leaf object pointers into spare space in the node rather than making an +extra branch until as such time an object needs to be added to a full node. + + +The Public API +============== + +The public API can be found in ``<linux/assoc_array.h>``. The associative +array is rooted on the following structure:: + + struct assoc_array { + ... + }; + +The code is selected by enabling ``CONFIG_ASSOCIATIVE_ARRAY`` with:: + + ./script/config -e ASSOCIATIVE_ARRAY + + +Edit Script +----------- + +The insertion and deletion functions produce an 'edit script' that can later be +applied to effect the changes without risking ``ENOMEM``. This retains the +preallocated metadata blocks that will be installed in the internal tree and +keeps track of the metadata blocks that will be removed from the tree when the +script is applied. + +This is also used to keep track of dead blocks and dead objects after the +script has been applied so that they can be freed later. The freeing is done +after an RCU grace period has passed - thus allowing access functions to +proceed under the RCU read lock. + +The script appears as outside of the API as a pointer of the type:: + + struct assoc_array_edit; + +There are two functions for dealing with the script: + +1. Apply an edit script:: + + void assoc_array_apply_edit(struct assoc_array_edit *edit); + +This will perform the edit functions, interpolating various write barriers +to permit accesses under the RCU read lock to continue. The edit script +will then be passed to ``call_rcu()`` to free it and any dead stuff it points +to. + +2. Cancel an edit script:: + + void assoc_array_cancel_edit(struct assoc_array_edit *edit); + +This frees the edit script and all preallocated memory immediately. If +this was for insertion, the new object is _not_ released by this function, +but must rather be released by the caller. + +These functions are guaranteed not to fail. + + +Operations Table +---------------- + +Various functions take a table of operations:: + + struct assoc_array_ops { + ... + }; + +This points to a number of methods, all of which need to be provided: + +1. Get a chunk of index key from caller data:: + + unsigned long (*get_key_chunk)(const void *index_key, int level); + +This should return a chunk of caller-supplied index key starting at the +*bit* position given by the level argument. The level argument will be a +multiple of ``ASSOC_ARRAY_KEY_CHUNK_SIZE`` and the function should return +``ASSOC_ARRAY_KEY_CHUNK_SIZE bits``. No error is possible. + + +2. Get a chunk of an object's index key:: + + unsigned long (*get_object_key_chunk)(const void *object, int level); + +As the previous function, but gets its data from an object in the array +rather than from a caller-supplied index key. + + +3. See if this is the object we're looking for:: + + bool (*compare_object)(const void *object, const void *index_key); + +Compare the object against an index key and return ``true`` if it matches and +``false`` if it doesn't. + + +4. Diff the index keys of two objects:: + + int (*diff_objects)(const void *object, const void *index_key); + +Return the bit position at which the index key of the specified object +differs from the given index key or -1 if they are the same. + + +5. Free an object:: + + void (*free_object)(void *object); + +Free the specified object. Note that this may be called an RCU grace period +after ``assoc_array_apply_edit()`` was called, so ``synchronize_rcu()`` may be +necessary on module unloading. + + +Manipulation Functions +---------------------- + +There are a number of functions for manipulating an associative array: + +1. Initialise an associative array:: + + void assoc_array_init(struct assoc_array *array); + +This initialises the base structure for an associative array. It can't fail. + + +2. Insert/replace an object in an associative array:: + + struct assoc_array_edit * + assoc_array_insert(struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key, + void *object); + +This inserts the given object into the array. Note that the least +significant bit of the pointer must be zero as it's used to type-mark +pointers internally. + +If an object already exists for that key then it will be replaced with the +new object and the old one will be freed automatically. + +The ``index_key`` argument should hold index key information and is +passed to the methods in the ops table when they are called. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. + +The caller should lock exclusively against other modifiers of the array. + + +3. Delete an object from an associative array:: + + struct assoc_array_edit * + assoc_array_delete(struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key); + +This deletes an object that matches the specified data from the array. + +The ``index_key`` argument should hold index key information and is +passed to the methods in the ops table when they are called. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. ``NULL`` will be returned if the specified object is +not found within the array. + +The caller should lock exclusively against other modifiers of the array. + + +4. Delete all objects from an associative array:: + + struct assoc_array_edit * + assoc_array_clear(struct assoc_array *array, + const struct assoc_array_ops *ops); + +This deletes all the objects from an associative array and leaves it +completely empty. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. + +The caller should lock exclusively against other modifiers of the array. + + +5. Destroy an associative array, deleting all objects:: + + void assoc_array_destroy(struct assoc_array *array, + const struct assoc_array_ops *ops); + +This destroys the contents of the associative array and leaves it +completely empty. It is not permitted for another thread to be traversing +the array under the RCU read lock at the same time as this function is +destroying it as no RCU deferral is performed on memory release - +something that would require memory to be allocated. + +The caller should lock exclusively against other modifiers and accessors +of the array. + + +6. Garbage collect an associative array:: + + int assoc_array_gc(struct assoc_array *array, + const struct assoc_array_ops *ops, + bool (*iterator)(void *object, void *iterator_data), + void *iterator_data); + +This iterates over the objects in an associative array and passes each one to +``iterator()``. If ``iterator()`` returns ``true``, the object is kept. If it +returns ``false``, the object will be freed. If the ``iterator()`` function +returns ``true``, it must perform any appropriate refcount incrementing on the +object before returning. + +The internal tree will be packed down if possible as part of the iteration +to reduce the number of nodes in it. + +The ``iterator_data`` is passed directly to ``iterator()`` and is otherwise +ignored by the function. + +The function will return ``0`` if successful and ``-ENOMEM`` if there wasn't +enough memory. + +It is possible for other threads to iterate over or search the array under +the RCU read lock whilst this function is in progress. The caller should +lock exclusively against other modifiers of the array. + + +Access Functions +---------------- + +There are two functions for accessing an associative array: + +1. Iterate over all the objects in an associative array:: + + int assoc_array_iterate(const struct assoc_array *array, + int (*iterator)(const void *object, + void *iterator_data), + void *iterator_data); + +This passes each object in the array to the iterator callback function. +``iterator_data`` is private data for that function. + +This may be used on an array at the same time as the array is being +modified, provided the RCU read lock is held. Under such circumstances, +it is possible for the iteration function to see some objects twice. If +this is a problem, then modification should be locked against. The +iteration algorithm should not, however, miss any objects. + +The function will return ``0`` if no objects were in the array or else it will +return the result of the last iterator function called. Iteration stops +immediately if any call to the iteration function results in a non-zero +return. + + +2. Find an object in an associative array:: + + void *assoc_array_find(const struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key); + +This walks through the array's internal tree directly to the object +specified by the index key.. + +This may be used on an array at the same time as the array is being +modified, provided the RCU read lock is held. + +The function will return the object if found (and set ``*_type`` to the object +type) or will return ``NULL`` if the object was not found. + + +Index Key Form +-------------- + +The index key can be of any form, but since the algorithms aren't told how long +the key is, it is strongly recommended that the index key includes its length +very early on before any variation due to the length would have an effect on +comparisons. + +This will cause leaves with different length keys to scatter away from each +other - and those with the same length keys to cluster together. + +It is also recommended that the index key begin with a hash of the rest of the +key to maximise scattering throughout keyspace. + +The better the scattering, the wider and lower the internal tree will be. + +Poor scattering isn't too much of a problem as there are shortcuts and nodes +can contain mixtures of leaves and metadata pointers. + +The index key is read in chunks of machine word. Each chunk is subdivided into +one nibble (4 bits) per level, so on a 32-bit CPU this is good for 8 levels and +on a 64-bit CPU, 16 levels. Unless the scattering is really poor, it is +unlikely that more than one word of any particular index key will have to be +used. + + +Internal Workings +================= + +The associative array data structure has an internal tree. This tree is +constructed of two types of metadata blocks: nodes and shortcuts. + +A node is an array of slots. Each slot can contain one of four things: + +* A NULL pointer, indicating that the slot is empty. +* A pointer to an object (a leaf). +* A pointer to a node at the next level. +* A pointer to a shortcut. + + +Basic Internal Tree Layout +-------------------------- + +Ignoring shortcuts for the moment, the nodes form a multilevel tree. The index +key space is strictly subdivided by the nodes in the tree and nodes occur on +fixed levels. For example:: + + Level: 0 1 2 3 + =============== =============== =============== =============== + NODE D + NODE B NODE C +------>+---+ + +------>+---+ +------>+---+ | | 0 | + NODE A | | 0 | | | 0 | | +---+ + +---+ | +---+ | +---+ | : : + | 0 | | : : | : : | +---+ + +---+ | +---+ | +---+ | | f | + | 1 |---+ | 3 |---+ | 7 |---+ +---+ + +---+ +---+ +---+ + : : : : | 8 |---+ + +---+ +---+ +---+ | NODE E + | e |---+ | f | : : +------>+---+ + +---+ | +---+ +---+ | 0 | + | f | | | f | +---+ + +---+ | +---+ : : + | NODE F +---+ + +------>+---+ | f | + | 0 | NODE G +---+ + +---+ +------>+---+ + : : | | 0 | + +---+ | +---+ + | 6 |---+ : : + +---+ +---+ + : : | f | + +---+ +---+ + | f | + +---+ + +In the above example, there are 7 nodes (A-G), each with 16 slots (0-f). +Assuming no other meta data nodes in the tree, the key space is divided +thusly:: + + KEY PREFIX NODE + ========== ==== + 137* D + 138* E + 13[0-69-f]* C + 1[0-24-f]* B + e6* G + e[0-57-f]* F + [02-df]* A + +So, for instance, keys with the following example index keys will be found in +the appropriate nodes:: + + INDEX KEY PREFIX NODE + =============== ======= ==== + 13694892892489 13 C + 13795289025897 137 D + 13889dde88793 138 E + 138bbb89003093 138 E + 1394879524789 12 C + 1458952489 1 B + 9431809de993ba - A + b4542910809cd - A + e5284310def98 e F + e68428974237 e6 G + e7fffcbd443 e F + f3842239082 - A + +To save memory, if a node can hold all the leaves in its portion of keyspace, +then the node will have all those leaves in it and will not have any metadata +pointers - even if some of those leaves would like to be in the same slot. + +A node can contain a heterogeneous mix of leaves and metadata pointers. +Metadata pointers must be in the slots that match their subdivisions of key +space. The leaves can be in any slot not occupied by a metadata pointer. It +is guaranteed that none of the leaves in a node will match a slot occupied by a +metadata pointer. If the metadata pointer is there, any leaf whose key matches +the metadata key prefix must be in the subtree that the metadata pointer points +to. + +In the above example list of index keys, node A will contain:: + + SLOT CONTENT INDEX KEY (PREFIX) + ==== =============== ================== + 1 PTR TO NODE B 1* + any LEAF 9431809de993ba + any LEAF b4542910809cd + e PTR TO NODE F e* + any LEAF f3842239082 + +and node B:: + + 3 PTR TO NODE C 13* + any LEAF 1458952489 + + +Shortcuts +--------- + +Shortcuts are metadata records that jump over a piece of keyspace. A shortcut +is a replacement for a series of single-occupancy nodes ascending through the +levels. Shortcuts exist to save memory and to speed up traversal. + +It is possible for the root of the tree to be a shortcut - say, for example, +the tree contains at least 17 nodes all with key prefix ``1111``. The +insertion algorithm will insert a shortcut to skip over the ``1111`` keyspace +in a single bound and get to the fourth level where these actually become +different. + + +Splitting And Collapsing Nodes +------------------------------ + +Each node has a maximum capacity of 16 leaves and metadata pointers. If the +insertion algorithm finds that it is trying to insert a 17th object into a +node, that node will be split such that at least two leaves that have a common +key segment at that level end up in a separate node rooted on that slot for +that common key segment. + +If the leaves in a full node and the leaf that is being inserted are +sufficiently similar, then a shortcut will be inserted into the tree. + +When the number of objects in the subtree rooted at a node falls to 16 or +fewer, then the subtree will be collapsed down to a single node - and this will +ripple towards the root if possible. + + +Non-Recursive Iteration +----------------------- + +Each node and shortcut contains a back pointer to its parent and the number of +slot in that parent that points to it. None-recursive iteration uses these to +proceed rootwards through the tree, going to the parent node, slot N + 1 to +make sure progress is made without the need for a stack. + +The backpointers, however, make simultaneous alteration and iteration tricky. + + +Simultaneous Alteration And Iteration +------------------------------------- + +There are a number of cases to consider: + +1. Simple insert/replace. This involves simply replacing a NULL or old + matching leaf pointer with the pointer to the new leaf after a barrier. + The metadata blocks don't change otherwise. An old leaf won't be freed + until after the RCU grace period. + +2. Simple delete. This involves just clearing an old matching leaf. The + metadata blocks don't change otherwise. The old leaf won't be freed until + after the RCU grace period. + +3. Insertion replacing part of a subtree that we haven't yet entered. This + may involve replacement of part of that subtree - but that won't affect + the iteration as we won't have reached the pointer to it yet and the + ancestry blocks are not replaced (the layout of those does not change). + +4. Insertion replacing nodes that we're actively processing. This isn't a + problem as we've passed the anchoring pointer and won't switch onto the + new layout until we follow the back pointers - at which point we've + already examined the leaves in the replaced node (we iterate over all the + leaves in a node before following any of its metadata pointers). + + We might, however, re-see some leaves that have been split out into a new + branch that's in a slot further along than we were at. + +5. Insertion replacing nodes that we're processing a dependent branch of. + This won't affect us until we follow the back pointers. Similar to (4). + +6. Deletion collapsing a branch under us. This doesn't affect us because the + back pointers will get us back to the parent of the new node before we + could see the new node. The entire collapsed subtree is thrown away + unchanged - and will still be rooted on the same slot, so we shouldn't + process it a second time as we'll go back to slot + 1. + +.. note:: + + Under some circumstances, we need to simultaneously change the parent + pointer and the parent slot pointer on a node (say, for example, we + inserted another node before it and moved it up a level). We cannot do + this without locking against a read - so we have to replace that node too. + + However, when we're changing a shortcut into a node this isn't a problem + as shortcuts only have one slot and so the parent slot number isn't used + when traversing backwards over one. This means that it's okay to change + the slot number first - provided suitable barriers are used to make sure + the parent slot number is read after the back pointer. + +Obsolete blocks and leaves are freed up after an RCU grace period has passed, +so as long as anyone doing walking or iteration holds the RCU read lock, the +old superstructure should not go away on them. diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst new file mode 100644 index 000000000..724583453 --- /dev/null +++ b/Documentation/core-api/atomic_ops.rst @@ -0,0 +1,664 @@ +======================================================= +Semantics and Behavior of Atomic and Bitmask Operations +======================================================= + +:Author: David S. Miller + +This document is intended to serve as a guide to Linux port +maintainers on how to implement atomic counter, bitops, and spinlock +interfaces properly. + +Atomic Type And Operations +========================== + +The atomic_t type should be defined as a signed integer and +the atomic_long_t type as a signed long integer. Also, they should +be made opaque such that any kind of cast to a normal C integer type +will fail. Something like the following should suffice:: + + typedef struct { int counter; } atomic_t; + typedef struct { long counter; } atomic_long_t; + +Historically, counter has been declared volatile. This is now discouraged. +See :ref:`Documentation/process/volatile-considered-harmful.rst +<volatile_considered_harmful>` for the complete rationale. + +local_t is very similar to atomic_t. If the counter is per CPU and only +updated by one CPU, local_t is probably more appropriate. Please see +:ref:`Documentation/core-api/local_ops.rst <local_ops>` for the semantics of +local_t. + +The first operations to implement for atomic_t's are the initializers and +plain writes. :: + + #define ATOMIC_INIT(i) { (i) } + #define atomic_set(v, i) ((v)->counter = (i)) + +The first macro is used in definitions, such as:: + + static atomic_t my_counter = ATOMIC_INIT(1); + +The initializer is atomic in that the return values of the atomic operations +are guaranteed to be correct reflecting the initialized value if the +initializer is used before runtime. If the initializer is used at runtime, a +proper implicit or explicit read memory barrier is needed before reading the +value with atomic_read from another thread. + +As with all of the ``atomic_`` interfaces, replace the leading ``atomic_`` +with ``atomic_long_`` to operate on atomic_long_t. + +The second interface can be used at runtime, as in:: + + struct foo { atomic_t counter; }; + ... + + struct foo *k; + + k = kmalloc(sizeof(*k), GFP_KERNEL); + if (!k) + return -ENOMEM; + atomic_set(&k->counter, 0); + +The setting is atomic in that the return values of the atomic operations by +all threads are guaranteed to be correct reflecting either the value that has +been set with this operation or set with another operation. A proper implicit +or explicit memory barrier is needed before the value set with the operation +is guaranteed to be readable with atomic_read from another thread. + +Next, we have:: + + #define atomic_read(v) ((v)->counter) + +which simply reads the counter value currently visible to the calling thread. +The read is atomic in that the return value is guaranteed to be one of the +values initialized or modified with the interface operations if a proper +implicit or explicit memory barrier is used after possible runtime +initialization by any other thread and the value is modified only with the +interface operations. atomic_read does not guarantee that the runtime +initialization by any other thread is visible yet, so the user of the +interface must take care of that with a proper implicit or explicit memory +barrier. + +.. warning:: + + ``atomic_read()`` and ``atomic_set()`` DO NOT IMPLY BARRIERS! + + Some architectures may choose to use the volatile keyword, barriers, or + inline assembly to guarantee some degree of immediacy for atomic_read() + and atomic_set(). This is not uniformly guaranteed, and may change in + the future, so all users of atomic_t should treat atomic_read() and + atomic_set() as simple C statements that may be reordered or optimized + away entirely by the compiler or processor, and explicitly invoke the + appropriate compiler and/or memory barrier for each use case. Failure + to do so will result in code that may suddenly break when used with + different architectures or compiler optimizations, or even changes in + unrelated code which changes how the compiler optimizes the section + accessing atomic_t variables. + +Properly aligned pointers, longs, ints, and chars (and unsigned +equivalents) may be atomically loaded from and stored to in the same +sense as described for atomic_read() and atomic_set(). The READ_ONCE() +and WRITE_ONCE() macros should be used to prevent the compiler from using +optimizations that might otherwise optimize accesses out of existence on +the one hand, or that might create unsolicited accesses on the other. + +For example consider the following code:: + + while (a > 0) + do_something(); + +If the compiler can prove that do_something() does not store to the +variable a, then the compiler is within its rights transforming this to +the following:: + + if (a > 0) + for (;;) + do_something(); + +If you don't want the compiler to do this (and you probably don't), then +you should use something like the following:: + + while (READ_ONCE(a) > 0) + do_something(); + +Alternatively, you could place a barrier() call in the loop. + +For another example, consider the following code:: + + tmp_a = a; + do_something_with(tmp_a); + do_something_else_with(tmp_a); + +If the compiler can prove that do_something_with() does not store to the +variable a, then the compiler is within its rights to manufacture an +additional load as follows:: + + tmp_a = a; + do_something_with(tmp_a); + tmp_a = a; + do_something_else_with(tmp_a); + +This could fatally confuse your code if it expected the same value +to be passed to do_something_with() and do_something_else_with(). + +The compiler would be likely to manufacture this additional load if +do_something_with() was an inline function that made very heavy use +of registers: reloading from variable a could save a flush to the +stack and later reload. To prevent the compiler from attacking your +code in this manner, write the following:: + + tmp_a = READ_ONCE(a); + do_something_with(tmp_a); + do_something_else_with(tmp_a); + +For a final example, consider the following code, assuming that the +variable a is set at boot time before the second CPU is brought online +and never changed later, so that memory barriers are not needed:: + + if (a) + b = 9; + else + b = 42; + +The compiler is within its rights to manufacture an additional store +by transforming the above code into the following:: + + b = 42; + if (a) + b = 9; + +This could come as a fatal surprise to other code running concurrently +that expected b to never have the value 42 if a was zero. To prevent +the compiler from doing this, write something like:: + + if (a) + WRITE_ONCE(b, 9); + else + WRITE_ONCE(b, 42); + +Don't even -think- about doing this without proper use of memory barriers, +locks, or atomic operations if variable a can change at runtime! + +.. warning:: + + ``READ_ONCE()`` OR ``WRITE_ONCE()`` DO NOT IMPLY A BARRIER! + +Now, we move onto the atomic operation interfaces typically implemented with +the help of assembly code. :: + + void atomic_add(int i, atomic_t *v); + void atomic_sub(int i, atomic_t *v); + void atomic_inc(atomic_t *v); + void atomic_dec(atomic_t *v); + +These four routines add and subtract integral values to/from the given +atomic_t value. The first two routines pass explicit integers by +which to make the adjustment, whereas the latter two use an implicit +adjustment value of "1". + +One very important aspect of these two routines is that they DO NOT +require any explicit memory barriers. They need only perform the +atomic_t counter update in an SMP safe manner. + +Next, we have:: + + int atomic_inc_return(atomic_t *v); + int atomic_dec_return(atomic_t *v); + +These routines add 1 and subtract 1, respectively, from the given +atomic_t and return the new counter value after the operation is +performed. + +Unlike the above routines, it is required that these primitives +include explicit memory barriers that are performed before and after +the operation. It must be done such that all memory operations before +and after the atomic operation calls are strongly ordered with respect +to the atomic operation itself. + +For example, it should behave as if a smp_mb() call existed both +before and after the atomic operation. + +If the atomic instructions used in an implementation provide explicit +memory barrier semantics which satisfy the above requirements, that is +fine as well. + +Let's move on:: + + int atomic_add_return(int i, atomic_t *v); + int atomic_sub_return(int i, atomic_t *v); + +These behave just like atomic_{inc,dec}_return() except that an +explicit counter adjustment is given instead of the implicit "1". +This means that like atomic_{inc,dec}_return(), the memory barrier +semantics are required. + +Next:: + + int atomic_inc_and_test(atomic_t *v); + int atomic_dec_and_test(atomic_t *v); + +These two routines increment and decrement by 1, respectively, the +given atomic counter. They return a boolean indicating whether the +resulting counter value was zero or not. + +Again, these primitives provide explicit memory barrier semantics around +the atomic operation:: + + int atomic_sub_and_test(int i, atomic_t *v); + +This is identical to atomic_dec_and_test() except that an explicit +decrement is given instead of the implicit "1". This primitive must +provide explicit memory barrier semantics around the operation:: + + int atomic_add_negative(int i, atomic_t *v); + +The given increment is added to the given atomic counter value. A boolean +is return which indicates whether the resulting counter value is negative. +This primitive must provide explicit memory barrier semantics around +the operation. + +Then:: + + int atomic_xchg(atomic_t *v, int new); + +This performs an atomic exchange operation on the atomic variable v, setting +the given new value. It returns the old value that the atomic variable v had +just before the operation. + +atomic_xchg must provide explicit memory barriers around the operation. :: + + int atomic_cmpxchg(atomic_t *v, int old, int new); + +This performs an atomic compare exchange operation on the atomic value v, +with the given old and new values. Like all atomic_xxx operations, +atomic_cmpxchg will only satisfy its atomicity semantics as long as all +other accesses of \*v are performed through atomic_xxx operations. + +atomic_cmpxchg must provide explicit memory barriers around the operation, +although if the comparison fails then no memory ordering guarantees are +required. + +The semantics for atomic_cmpxchg are the same as those defined for 'cas' +below. + +Finally:: + + int atomic_add_unless(atomic_t *v, int a, int u); + +If the atomic value v is not equal to u, this function adds a to v, and +returns non zero. If v is equal to u then it returns zero. This is done as +an atomic operation. + +atomic_add_unless must provide explicit memory barriers around the +operation unless it fails (returns 0). + +atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) + + +If a caller requires memory barrier semantics around an atomic_t +operation which does not return a value, a set of interfaces are +defined which accomplish this:: + + void smp_mb__before_atomic(void); + void smp_mb__after_atomic(void); + +Preceding a non-value-returning read-modify-write atomic operation with +smp_mb__before_atomic() and following it with smp_mb__after_atomic() +provides the same full ordering that is provided by value-returning +read-modify-write atomic operations. + +For example, smp_mb__before_atomic() can be used like so:: + + obj->dead = 1; + smp_mb__before_atomic(); + atomic_dec(&obj->ref_count); + +It makes sure that all memory operations preceding the atomic_dec() +call are strongly ordered with respect to the atomic counter +operation. In the above example, it guarantees that the assignment of +"1" to obj->dead will be globally visible to other cpus before the +atomic counter decrement. + +Without the explicit smp_mb__before_atomic() call, the +implementation could legally allow the atomic counter update visible +to other cpus before the "obj->dead = 1;" assignment. + +A missing memory barrier in the cases where they are required by the +atomic_t implementation above can have disastrous results. Here is +an example, which follows a pattern occurring frequently in the Linux +kernel. It is the use of atomic counters to implement reference +counting, and it works such that once the counter falls to zero it can +be guaranteed that no other entity can be accessing the object:: + + static void obj_list_add(struct obj *obj, struct list_head *head) + { + obj->active = 1; + list_add(&obj->list, head); + } + + static void obj_list_del(struct obj *obj) + { + list_del(&obj->list); + obj->active = 0; + } + + static void obj_destroy(struct obj *obj) + { + BUG_ON(obj->active); + kfree(obj); + } + + struct obj *obj_list_peek(struct list_head *head) + { + if (!list_empty(head)) { + struct obj *obj; + + obj = list_entry(head->next, struct obj, list); + atomic_inc(&obj->refcnt); + return obj; + } + return NULL; + } + + void obj_poke(void) + { + struct obj *obj; + + spin_lock(&global_list_lock); + obj = obj_list_peek(&global_list); + spin_unlock(&global_list_lock); + + if (obj) { + obj->ops->poke(obj); + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } + } + + void obj_timeout(struct obj *obj) + { + spin_lock(&global_list_lock); + obj_list_del(obj); + spin_unlock(&global_list_lock); + + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } + +.. note:: + + This is a simplification of the ARP queue management in the generic + neighbour discover code of the networking. Olaf Kirch found a bug wrt. + memory barriers in kfree_skb() that exposed the atomic_t memory barrier + requirements quite clearly. + +Given the above scheme, it must be the case that the obj->active +update done by the obj list deletion be visible to other processors +before the atomic counter decrement is performed. + +Otherwise, the counter could fall to zero, yet obj->active would still +be set, thus triggering the assertion in obj_destroy(). The error +sequence looks like this:: + + cpu 0 cpu 1 + obj_poke() obj_timeout() + obj = obj_list_peek(); + ... gains ref to obj, refcnt=2 + obj_list_del(obj); + obj->active = 0 ... + ... visibility delayed ... + atomic_dec_and_test() + ... refcnt drops to 1 ... + atomic_dec_and_test() + ... refcount drops to 0 ... + obj_destroy() + BUG() triggers since obj->active + still seen as one + obj->active update visibility occurs + +With the memory barrier semantics required of the atomic_t operations +which return values, the above sequence of memory visibility can never +happen. Specifically, in the above case the atomic_dec_and_test() +counter decrement would not become globally visible until the +obj->active update does. + +As a historical note, 32-bit Sparc used to only allow usage of +24-bits of its atomic_t type. This was because it used 8 bits +as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" +type instruction. However, 32-bit Sparc has since been moved over +to a "hash table of spinlocks" scheme, that allows the full 32-bit +counter to be realized. Essentially, an array of spinlocks are +indexed into based upon the address of the atomic_t being operated +on, and that lock protects the atomic operation. Parisc uses the +same scheme. + +Another note is that the atomic_t operations returning values are +extremely slow on an old 386. + + +Atomic Bitmask +============== + +We will now cover the atomic bitmask operations. You will find that +their SMP and memory barrier semantics are similar in shape and scope +to the atomic_t ops above. + +Native atomic bit operations are defined to operate on objects aligned +to the size of an "unsigned long" C data type, and are least of that +size. The endianness of the bits within each "unsigned long" are the +native endianness of the cpu. :: + + void set_bit(unsigned long nr, volatile unsigned long *addr); + void clear_bit(unsigned long nr, volatile unsigned long *addr); + void change_bit(unsigned long nr, volatile unsigned long *addr); + +These routines set, clear, and change, respectively, the bit number +indicated by "nr" on the bit mask pointed to by "ADDR". + +They must execute atomically, yet there are no implicit memory barrier +semantics required of these interfaces. :: + + int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +Like the above, except that these routines return a boolean which +indicates whether the changed bit was set _BEFORE_ the atomic bit +operation. + + +.. warning:: + It is incredibly important that the value be a boolean, ie. "0" or "1". + Do not try to be fancy and save a few instructions by declaring the + above to return "long" and just returning something like "old_val & + mask" because that will not work. + +For one thing, this return value gets truncated to int in many code +paths using these interfaces, so on 64-bit if the bit is set in the +upper 32-bits then testers will never see that. + +One great example of where this problem crops up are the thread_info +flag operations. Routines such as test_and_set_ti_thread_flag() chop +the return value into an int. There are other places where things +like this occur as well. + +These routines, like the atomic_t counter operations returning values, +must provide explicit memory barrier semantics around their execution. +All memory operations before the atomic bit operation call must be +made visible globally before the atomic bit operation is made visible. +Likewise, the atomic bit operation must be visible globally before any +subsequent memory operation is made visible. For example:: + + obj->dead = 1; + if (test_and_set_bit(0, &obj->flags)) + /* ... */; + obj->killed = 1; + +The implementation of test_and_set_bit() must guarantee that +"obj->dead = 1;" is visible to cpus before the atomic memory operation +done by test_and_set_bit() becomes visible. Likewise, the atomic +memory operation done by test_and_set_bit() must become visible before +"obj->killed = 1;" is visible. + +Finally there is the basic operation:: + + int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); + +Which returns a boolean indicating if bit "nr" is set in the bitmask +pointed to by "addr". + +If explicit memory barriers are required around {set,clear}_bit() (which do +not return a value, and thus does not need to provide memory barrier +semantics), two interfaces are provided:: + + void smp_mb__before_atomic(void); + void smp_mb__after_atomic(void); + +They are used as follows, and are akin to their atomic_t operation +brothers:: + + /* All memory operations before this call will + * be globally visible before the clear_bit(). + */ + smp_mb__before_atomic(); + clear_bit( ... ); + + /* The clear_bit() will be visible before all + * subsequent memory operations. + */ + smp_mb__after_atomic(); + +There are two special bitops with lock barrier semantics (acquire/release, +same as spinlocks). These operate in the same way as their non-_lock/unlock +postfixed variants, except that they are to provide acquire/release semantics, +respectively. This means they can be used for bit_spin_trylock and +bit_spin_unlock type operations without specifying any more barriers. :: + + int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); + void clear_bit_unlock(unsigned long nr, unsigned long *addr); + void __clear_bit_unlock(unsigned long nr, unsigned long *addr); + +The __clear_bit_unlock version is non-atomic, however it still implements +unlock barrier semantics. This can be useful if the lock itself is protecting +the other bits in the word. + +Finally, there are non-atomic versions of the bitmask operations +provided. They are used in contexts where some other higher-level SMP +locking scheme is being used to protect the bitmask, and thus less +expensive non-atomic operations may be used in the implementation. +They have names similar to the above bitmask operation interfaces, +except that two underscores are prefixed to the interface name. :: + + void __set_bit(unsigned long nr, volatile unsigned long *addr); + void __clear_bit(unsigned long nr, volatile unsigned long *addr); + void __change_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +These non-atomic variants also do not require any special memory +barrier semantics. + +The routines xchg() and cmpxchg() must provide the same exact +memory-barrier semantics as the atomic and bit operations returning +values. + +.. note:: + + If someone wants to use xchg(), cmpxchg() and their variants, + linux/atomic.h should be included rather than asm/cmpxchg.h, unless the + code is in arch/* and can take care of itself. + +Spinlocks and rwlocks have memory barrier expectations as well. +The rule to follow is simple: + +1) When acquiring a lock, the implementation must make it globally + visible before any subsequent memory operation. + +2) When releasing a lock, the implementation must make it such that + all previous memory operations are globally visible before the + lock release. + +Which finally brings us to _atomic_dec_and_lock(). There is an +architecture-neutral version implemented in lib/dec_and_lock.c, +but most platforms will wish to optimize this in assembler. :: + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); + +Atomically decrement the given counter, and if will drop to zero +atomically acquire the given spinlock and perform the decrement +of the counter to zero. If it does not drop to zero, do nothing +with the spinlock. + +It is actually pretty simple to get the memory barrier correct. +Simply satisfy the spinlock grab requirements, which is make +sure the spinlock operation is globally visible before any +subsequent memory operation. + +We can demonstrate this operation more clearly if we define +an abstract atomic operation:: + + long cas(long *mem, long old, long new); + +"cas" stands for "compare and swap". It atomically: + +1) Compares "old" with the value currently at "mem". +2) If they are equal, "new" is written to "mem". +3) Regardless, the current value at "mem" is returned. + +As an example usage, here is what an atomic counter update +might look like:: + + void example_atomic_inc(long *counter) + { + long old, new, ret; + + while (1) { + old = *counter; + new = old + 1; + + ret = cas(counter, old, new); + if (ret == old) + break; + } + } + +Let's use cas() in order to build a pseudo-C atomic_dec_and_lock():: + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) + { + long old, new, ret; + int went_to_zero; + + went_to_zero = 0; + while (1) { + old = atomic_read(atomic); + new = old - 1; + if (new == 0) { + went_to_zero = 1; + spin_lock(lock); + } + ret = cas(atomic, old, new); + if (ret == old) + break; + if (went_to_zero) { + spin_unlock(lock); + went_to_zero = 0; + } + } + + return went_to_zero; + } + +Now, as far as memory barriers go, as long as spin_lock() +strictly orders all subsequent memory operations (including +the cas()) with respect to itself, things will be fine. + +Said another way, _atomic_dec_and_lock() must guarantee that +a counter dropping to zero is never made visible before the +spinlock being acquired. + +.. note:: + + Note that this also means that for the case where the counter is not + dropping to zero, there are no memory ordering requirements. diff --git a/Documentation/core-api/boot-time-mm.rst b/Documentation/core-api/boot-time-mm.rst new file mode 100644 index 000000000..03cb1643f --- /dev/null +++ b/Documentation/core-api/boot-time-mm.rst @@ -0,0 +1,92 @@ +=========================== +Boot time memory management +=========================== + +Early system initialization cannot use "normal" memory management +simply because it is not set up yet. But there is still need to +allocate memory for various data structures, for instance for the +physical page allocator. To address this, a specialized allocator +called the :ref:`Boot Memory Allocator <bootmem>`, or bootmem, was +introduced. Several years later PowerPC developers added a "Logical +Memory Blocks" allocator, which was later adopted by other +architectures and renamed to :ref:`memblock <memblock>`. There is also +a compatibility layer called `nobootmem` that translates bootmem +allocation interfaces to memblock calls. + +The selection of the early allocator is done using +``CONFIG_NO_BOOTMEM`` and ``CONFIG_HAVE_MEMBLOCK`` kernel +configuration options. These options are enabled or disabled +statically by the architectures' Kconfig files. + +* Architectures that rely only on bootmem select + ``CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=n``. +* The users of memblock with the nobootmem compatibility layer set + ``CONFIG_NO_BOOTMEM=y && CONFIG_HAVE_MEMBLOCK=y``. +* And for those that use both memblock and bootmem the configuration + includes ``CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=y``. + +Whichever allocator is used, it is the responsibility of the +architecture specific initialization to set it up in +:c:func:`setup_arch` and tear it down in :c:func:`mem_init` functions. + +Once the early memory management is available it offers a variety of +functions and macros for memory allocations. The allocation request +may be directed to the first (and probably the only) node or to a +particular node in a NUMA system. There are API variants that panic +when an allocation fails and those that don't. And more recent and +advanced memblock even allows controlling its own behaviour. + +.. _bootmem: + +Bootmem +======= + +(mostly stolen from Mel Gorman's "Understanding the Linux Virtual +Memory Manager" `book`_) + +.. _book: https://www.kernel.org/doc/gorman/ + +.. kernel-doc:: mm/bootmem.c + :doc: bootmem overview + +.. _memblock: + +Memblock +======== + +.. kernel-doc:: mm/memblock.c + :doc: memblock overview + + +Functions and structures +======================== + +Common API +---------- + +The functions that are described in this section are available +regardless of what early memory manager is enabled. + +.. kernel-doc:: mm/nobootmem.c + +Bootmem specific API +-------------------- + +These interfaces available only with bootmem, i.e when ``CONFIG_NO_BOOTMEM=n`` + +.. kernel-doc:: include/linux/bootmem.h +.. kernel-doc:: mm/bootmem.c + :nodocs: + +Memblock specific API +--------------------- + +Here is the description of memblock data structures, functions and +macros. Some of them are actually internal, but since they are +documented it would be silly to omit them. Besides, reading the +descriptions for the internal functions can help to understand what +really happens under the hood. + +.. kernel-doc:: include/linux/memblock.h +.. kernel-doc:: mm/memblock.c + :nodocs: diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst new file mode 100644 index 000000000..6eb9d3f09 --- /dev/null +++ b/Documentation/core-api/cachetlb.rst @@ -0,0 +1,415 @@ +================================== +Cache and TLB Flushing Under Linux +================================== + +:Author: David S. Miller <davem@redhat.com> + +This document describes the cache/tlb flushing interfaces called +by the Linux VM subsystem. It enumerates over each interface, +describes its intended purpose, and what side effect is expected +after the interface is invoked. + +The side effects described below are stated for a uniprocessor +implementation, and what is to happen on that single processor. The +SMP cases are a simple extension, in that you just extend the +definition such that the side effect for a particular interface occurs +on all processors in the system. Don't let this scare you into +thinking SMP cache/tlb flushing must be so inefficient, this is in +fact an area where many optimizations are possible. For example, +if it can be proven that a user address space has never executed +on a cpu (see mm_cpumask()), one need not perform a flush +for this address space on that cpu. + +First, the TLB flushing interfaces, since they are the simplest. The +"TLB" is abstracted under Linux as something the cpu uses to cache +virtual-->physical address translations obtained from the software +page tables. Meaning that if the software page tables change, it is +possible for stale translations to exist in this "TLB" cache. +Therefore when software page table changes occur, the kernel will +invoke one of the following flush methods _after_ the page table +changes occur: + +1) ``void flush_tlb_all(void)`` + + The most severe flush of all. After this interface runs, + any previous page table modification whatsoever will be + visible to the cpu. + + This is usually invoked when the kernel page tables are + changed, since such translations are "global" in nature. + +2) ``void flush_tlb_mm(struct mm_struct *mm)`` + + This interface flushes an entire user address space from + the TLB. After running, this interface must make sure that + any previous page table modifications for the address space + 'mm' will be visible to the cpu. That is, after running, + there will be no entries in the TLB for 'mm'. + + This interface is used to handle whole address space + page table operations such as what happens during + fork, and exec. + +3) ``void flush_tlb_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end)`` + + Here we are flushing a specific range of (user) virtual + address translations from the TLB. After running, this + interface must make sure that any previous page table + modifications for the address space 'vma->vm_mm' in the range + 'start' to 'end-1' will be visible to the cpu. That is, after + running, there will be no entries in the TLB for 'mm' for + virtual addresses in the range 'start' to 'end-1'. + + The "vma" is the backing store being used for the region. + Primarily, this is used for munmap() type operations. + + The interface is provided in hopes that the port can find + a suitably efficient method for removing multiple page + sized translations from the TLB, instead of having the kernel + call flush_tlb_page (see below) for each entry which may be + modified. + +4) ``void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)`` + + This time we need to remove the PAGE_SIZE sized translation + from the TLB. The 'vma' is the backing structure used by + Linux to keep track of mmap'd regions for a process, the + address space is available via vma->vm_mm. Also, one may + test (vma->vm_flags & VM_EXEC) to see if this region is + executable (and thus could be in the 'instruction TLB' in + split-tlb type setups). + + After running, this interface must make sure that any previous + page table modification for address space 'vma->vm_mm' for + user virtual address 'addr' will be visible to the cpu. That + is, after running, there will be no entries in the TLB for + 'vma->vm_mm' for virtual address 'addr'. + + This is used primarily during fault processing. + +5) ``void update_mmu_cache(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep)`` + + At the end of every page fault, this routine is invoked to + tell the architecture specific code that a translation + now exists at virtual address "address" for address space + "vma->vm_mm", in the software page tables. + + A port may use this information in any way it so chooses. + For example, it could use this event to pre-load TLB + translations for software managed TLB configurations. + The sparc64 port currently does this. + +6) ``void tlb_migrate_finish(struct mm_struct *mm)`` + + This interface is called at the end of an explicit + process migration. This interface provides a hook + to allow a platform to update TLB or context-specific + information for the address space. + + The ia64 sn2 platform is one example of a platform + that uses this interface. + +Next, we have the cache flushing interfaces. In general, when Linux +is changing an existing virtual-->physical mapping to a new value, +the sequence will be in one of the following forms:: + + 1) flush_cache_mm(mm); + change_all_page_tables_of(mm); + flush_tlb_mm(mm); + + 2) flush_cache_range(vma, start, end); + change_range_of_page_tables(mm, start, end); + flush_tlb_range(vma, start, end); + + 3) flush_cache_page(vma, addr, pfn); + set_pte(pte_pointer, new_pte_val); + flush_tlb_page(vma, addr); + +The cache level flush will always be first, because this allows +us to properly handle systems whose caches are strict and require +a virtual-->physical translation to exist for a virtual address +when that virtual address is flushed from the cache. The HyperSparc +cpu is one such cpu with this attribute. + +The cache flushing routines below need only deal with cache flushing +to the extent that it is necessary for a particular cpu. Mostly, +these routines must be implemented for cpus which have virtually +indexed caches which must be flushed when virtual-->physical +translations are changed or removed. So, for example, the physically +indexed physically tagged caches of IA32 processors have no need to +implement these interfaces since the caches are fully synchronized +and have no dependency on translation information. + +Here are the routines, one by one: + +1) ``void flush_cache_mm(struct mm_struct *mm)`` + + This interface flushes an entire user address space from + the caches. That is, after running, there will be no cache + lines associated with 'mm'. + + This interface is used to handle whole address space + page table operations such as what happens during exit and exec. + +2) ``void flush_cache_dup_mm(struct mm_struct *mm)`` + + This interface flushes an entire user address space from + the caches. That is, after running, there will be no cache + lines associated with 'mm'. + + This interface is used to handle whole address space + page table operations such as what happens during fork. + + This option is separate from flush_cache_mm to allow some + optimizations for VIPT caches. + +3) ``void flush_cache_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end)`` + + Here we are flushing a specific range of (user) virtual + addresses from the cache. After running, there will be no + entries in the cache for 'vma->vm_mm' for virtual addresses in + the range 'start' to 'end-1'. + + The "vma" is the backing store being used for the region. + Primarily, this is used for munmap() type operations. + + The interface is provided in hopes that the port can find + a suitably efficient method for removing multiple page + sized regions from the cache, instead of having the kernel + call flush_cache_page (see below) for each entry which may be + modified. + +4) ``void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)`` + + This time we need to remove a PAGE_SIZE sized range + from the cache. The 'vma' is the backing structure used by + Linux to keep track of mmap'd regions for a process, the + address space is available via vma->vm_mm. Also, one may + test (vma->vm_flags & VM_EXEC) to see if this region is + executable (and thus could be in the 'instruction cache' in + "Harvard" type cache layouts). + + The 'pfn' indicates the physical page frame (shift this value + left by PAGE_SHIFT to get the physical address) that 'addr' + translates to. It is this mapping which should be removed from + the cache. + + After running, there will be no entries in the cache for + 'vma->vm_mm' for virtual address 'addr' which translates + to 'pfn'. + + This is used primarily during fault processing. + +5) ``void flush_cache_kmaps(void)`` + + This routine need only be implemented if the platform utilizes + highmem. It will be called right before all of the kmaps + are invalidated. + + After running, there will be no entries in the cache for + the kernel virtual address range PKMAP_ADDR(0) to + PKMAP_ADDR(LAST_PKMAP). + + This routing should be implemented in asm/highmem.h + +6) ``void flush_cache_vmap(unsigned long start, unsigned long end)`` + ``void flush_cache_vunmap(unsigned long start, unsigned long end)`` + + Here in these two interfaces we are flushing a specific range + of (kernel) virtual addresses from the cache. After running, + there will be no entries in the cache for the kernel address + space for virtual addresses in the range 'start' to 'end-1'. + + The first of these two routines is invoked after map_vm_area() + has installed the page table entries. The second is invoked + before unmap_kernel_range() deletes the page table entries. + +There exists another whole class of cpu cache issues which currently +require a whole different set of interfaces to handle properly. +The biggest problem is that of virtual aliasing in the data cache +of a processor. + +Is your port susceptible to virtual aliasing in its D-cache? +Well, if your D-cache is virtually indexed, is larger in size than +PAGE_SIZE, and does not prevent multiple cache lines for the same +physical address from existing at once, you have this problem. + +If your D-cache has this problem, first define asm/shmparam.h SHMLBA +properly, it should essentially be the size of your virtually +addressed D-cache (or if the size is variable, the largest possible +size). This setting will force the SYSv IPC layer to only allow user +processes to mmap shared memory at address which are a multiple of +this value. + +.. note:: + + This does not fix shared mmaps, check out the sparc64 port for + one way to solve this (in particular SPARC_FLAG_MMAPSHARED). + +Next, you have to solve the D-cache aliasing issue for all +other cases. Please keep in mind that fact that, for a given page +mapped into some user address space, there is always at least one more +mapping, that of the kernel in its linear mapping starting at +PAGE_OFFSET. So immediately, once the first user maps a given +physical page into its address space, by implication the D-cache +aliasing problem has the potential to exist since the kernel already +maps this page at its virtual address. + + ``void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)`` + ``void clear_user_page(void *to, unsigned long addr, struct page *page)`` + + These two routines store data in user anonymous or COW + pages. It allows a port to efficiently avoid D-cache alias + issues between userspace and the kernel. + + For example, a port may temporarily map 'from' and 'to' to + kernel virtual addresses during the copy. The virtual address + for these two pages is chosen in such a way that the kernel + load/store instructions happen to virtual addresses which are + of the same "color" as the user mapping of the page. Sparc64 + for example, uses this technique. + + The 'addr' parameter tells the virtual address where the + user will ultimately have this page mapped, and the 'page' + parameter gives a pointer to the struct page of the target. + + If D-cache aliasing is not an issue, these two routines may + simply call memcpy/memset directly and do nothing more. + + ``void flush_dcache_page(struct page *page)`` + + Any time the kernel writes to a page cache page, _OR_ + the kernel is about to read from a page cache page and + user space shared/writable mappings of this page potentially + exist, this routine is called. + + .. note:: + + This routine need only be called for page cache pages + which can potentially ever be mapped into the address + space of a user process. So for example, VFS layer code + handling vfs symlinks in the page cache need not call + this interface at all. + + The phrase "kernel writes to a page cache page" means, + specifically, that the kernel executes store instructions + that dirty data in that page at the page->virtual mapping + of that page. It is important to flush here to handle + D-cache aliasing, to make sure these kernel stores are + visible to user space mappings of that page. + + The corollary case is just as important, if there are users + which have shared+writable mappings of this file, we must make + sure that kernel reads of these pages will see the most recent + stores done by the user. + + If D-cache aliasing is not an issue, this routine may + simply be defined as a nop on that architecture. + + There is a bit set aside in page->flags (PG_arch_1) as + "architecture private". The kernel guarantees that, + for pagecache pages, it will clear this bit when such + a page first enters the pagecache. + + This allows these interfaces to be implemented much more + efficiently. It allows one to "defer" (perhaps indefinitely) + the actual flush if there are currently no user processes + mapping this page. See sparc64's flush_dcache_page and + update_mmu_cache implementations for an example of how to go + about doing this. + + The idea is, first at flush_dcache_page() time, if + page->mapping->i_mmap is an empty tree, just mark the architecture + private page flag bit. Later, in update_mmu_cache(), a check is + made of this flag bit, and if set the flush is done and the flag + bit is cleared. + + .. important:: + + It is often important, if you defer the flush, + that the actual flush occurs on the same CPU + as did the cpu stores into the page to make it + dirty. Again, see sparc64 for examples of how + to deal with this. + + ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page, + unsigned long user_vaddr, void *dst, void *src, int len)`` + ``void copy_from_user_page(struct vm_area_struct *vma, struct page *page, + unsigned long user_vaddr, void *dst, void *src, int len)`` + + When the kernel needs to copy arbitrary data in and out + of arbitrary user pages (f.e. for ptrace()) it will use + these two routines. + + Any necessary cache flushing or other coherency operations + that need to occur should happen here. If the processor's + instruction cache does not snoop cpu stores, it is very + likely that you will need to flush the instruction cache + for copy_to_user_page(). + + ``void flush_anon_page(struct vm_area_struct *vma, struct page *page, + unsigned long vmaddr)`` + + When the kernel needs to access the contents of an anonymous + page, it calls this function (currently only + get_user_pages()). Note: flush_dcache_page() deliberately + doesn't work for an anonymous page. The default + implementation is a nop (and should remain so for all coherent + architectures). For incoherent architectures, it should flush + the cache of the page at vmaddr. + + ``void flush_kernel_dcache_page(struct page *page)`` + + When the kernel needs to modify a user page is has obtained + with kmap, it calls this function after all modifications are + complete (but before kunmapping it) to bring the underlying + page up to date. It is assumed here that the user has no + incoherent cached copies (i.e. the original page was obtained + from a mechanism like get_user_pages()). The default + implementation is a nop and should remain so on all coherent + architectures. On incoherent architectures, this should flush + the kernel cache for page (using page_address(page)). + + + ``void flush_icache_range(unsigned long start, unsigned long end)`` + + When the kernel stores into addresses that it will execute + out of (eg when loading modules), this function is called. + + If the icache does not snoop stores then this routine will need + to flush it. + + ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)`` + + All the functionality of flush_icache_page can be implemented in + flush_dcache_page and update_mmu_cache. In the future, the hope + is to remove this interface completely. + +The final category of APIs is for I/O to deliberately aliased address +ranges inside the kernel. Such aliases are set up by use of the +vmap/vmalloc API. Since kernel I/O goes via physical pages, the I/O +subsystem assumes that the user mapping and kernel offset mapping are +the only aliases. This isn't true for vmap aliases, so anything in +the kernel trying to do I/O to vmap areas must manually manage +coherency. It must do this by flushing the vmap range before doing +I/O and invalidating it after the I/O returns. + + ``void flush_kernel_vmap_range(void *vaddr, int size)`` + + flushes the kernel cache for a given virtual address range in + the vmap area. This is to make sure that any data the kernel + modified in the vmap range is made visible to the physical + page. The design is to make this area safe to perform I/O on. + Note that this API does *not* also flush the offset map alias + of the area. + + ``void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates`` + + the cache for a given virtual address range in the vmap area + which prevents the processor from making the cache stale by + speculatively reading data while the I/O was occurring to the + physical pages. This is only necessary for data reads into the + vmap area. diff --git a/Documentation/core-api/circular-buffers.rst b/Documentation/core-api/circular-buffers.rst new file mode 100644 index 000000000..53e51caa3 --- /dev/null +++ b/Documentation/core-api/circular-buffers.rst @@ -0,0 +1,237 @@ +================ +Circular Buffers +================ + +:Author: David Howells <dhowells@redhat.com> +:Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> + + +Linux provides a number of features that can be used to implement circular +buffering. There are two sets of such features: + + (1) Convenience functions for determining information about power-of-2 sized + buffers. + + (2) Memory barriers for when the producer and the consumer of objects in the + buffer don't want to share a lock. + +To use these facilities, as discussed below, there needs to be just one +producer and just one consumer. It is possible to handle multiple producers by +serialising them, and to handle multiple consumers by serialising them. + + +.. Contents: + + (*) What is a circular buffer? + + (*) Measuring power-of-2 buffers. + + (*) Using memory barriers with circular buffers. + - The producer. + - The consumer. + + + +What is a circular buffer? +========================== + +First of all, what is a circular buffer? A circular buffer is a buffer of +fixed, finite size into which there are two indices: + + (1) A 'head' index - the point at which the producer inserts items into the + buffer. + + (2) A 'tail' index - the point at which the consumer finds the next item in + the buffer. + +Typically when the tail pointer is equal to the head pointer, the buffer is +empty; and the buffer is full when the head pointer is one less than the tail +pointer. + +The head index is incremented when items are added, and the tail index when +items are removed. The tail index should never jump the head index, and both +indices should be wrapped to 0 when they reach the end of the buffer, thus +allowing an infinite amount of data to flow through the buffer. + +Typically, items will all be of the same unit size, but this isn't strictly +required to use the techniques below. The indices can be increased by more +than 1 if multiple items or variable-sized items are to be included in the +buffer, provided that neither index overtakes the other. The implementer must +be careful, however, as a region more than one unit in size may wrap the end of +the buffer and be broken into two segments. + +Measuring power-of-2 buffers +============================ + +Calculation of the occupancy or the remaining capacity of an arbitrarily sized +circular buffer would normally be a slow operation, requiring the use of a +modulus (divide) instruction. However, if the buffer is of a power-of-2 size, +then a much quicker bitwise-AND instruction can be used instead. + +Linux provides a set of macros for handling power-of-2 circular buffers. These +can be made use of by:: + + #include <linux/circ_buf.h> + +The macros are: + + (#) Measure the remaining capacity of a buffer:: + + CIRC_SPACE(head_index, tail_index, buffer_size); + + This returns the amount of space left in the buffer[1] into which items + can be inserted. + + + (#) Measure the maximum consecutive immediate space in a buffer:: + + CIRC_SPACE_TO_END(head_index, tail_index, buffer_size); + + This returns the amount of consecutive space left in the buffer[1] into + which items can be immediately inserted without having to wrap back to the + beginning of the buffer. + + + (#) Measure the occupancy of a buffer:: + + CIRC_CNT(head_index, tail_index, buffer_size); + + This returns the number of items currently occupying a buffer[2]. + + + (#) Measure the non-wrapping occupancy of a buffer:: + + CIRC_CNT_TO_END(head_index, tail_index, buffer_size); + + This returns the number of consecutive items[2] that can be extracted from + the buffer without having to wrap back to the beginning of the buffer. + + +Each of these macros will nominally return a value between 0 and buffer_size-1, +however: + + (1) CIRC_SPACE*() are intended to be used in the producer. To the producer + they will return a lower bound as the producer controls the head index, + but the consumer may still be depleting the buffer on another CPU and + moving the tail index. + + To the consumer it will show an upper bound as the producer may be busy + depleting the space. + + (2) CIRC_CNT*() are intended to be used in the consumer. To the consumer they + will return a lower bound as the consumer controls the tail index, but the + producer may still be filling the buffer on another CPU and moving the + head index. + + To the producer it will show an upper bound as the consumer may be busy + emptying the buffer. + + (3) To a third party, the order in which the writes to the indices by the + producer and consumer become visible cannot be guaranteed as they are + independent and may be made on different CPUs - so the result in such a + situation will merely be a guess, and may even be negative. + +Using memory barriers with circular buffers +=========================================== + +By using memory barriers in conjunction with circular buffers, you can avoid +the need to: + + (1) use a single lock to govern access to both ends of the buffer, thus + allowing the buffer to be filled and emptied at the same time; and + + (2) use atomic counter operations. + +There are two sides to this: the producer that fills the buffer, and the +consumer that empties it. Only one thing should be filling a buffer at any one +time, and only one thing should be emptying a buffer at any one time, but the +two sides can operate simultaneously. + + +The producer +------------ + +The producer will look something like this:: + + spin_lock(&producer_lock); + + unsigned long head = buffer->head; + /* The spin_unlock() and next spin_lock() provide needed ordering. */ + unsigned long tail = READ_ONCE(buffer->tail); + + if (CIRC_SPACE(head, tail, buffer->size) >= 1) { + /* insert one item into the buffer */ + struct item *item = buffer[head]; + + produce_item(item); + + smp_store_release(buffer->head, + (head + 1) & (buffer->size - 1)); + + /* wake_up() will make sure that the head is committed before + * waking anyone up */ + wake_up(consumer); + } + + spin_unlock(&producer_lock); + +This will instruct the CPU that the contents of the new item must be written +before the head index makes it available to the consumer and then instructs the +CPU that the revised head index must be written before the consumer is woken. + +Note that wake_up() does not guarantee any sort of barrier unless something +is actually awakened. We therefore cannot rely on it for ordering. However, +there is always one element of the array left empty. Therefore, the +producer must produce two elements before it could possibly corrupt the +element currently being read by the consumer. Therefore, the unlock-lock +pair between consecutive invocations of the consumer provides the necessary +ordering between the read of the index indicating that the consumer has +vacated a given element and the write by the producer to that same element. + + +The Consumer +------------ + +The consumer will look something like this:: + + spin_lock(&consumer_lock); + + /* Read index before reading contents at that index. */ + unsigned long head = smp_load_acquire(buffer->head); + unsigned long tail = buffer->tail; + + if (CIRC_CNT(head, tail, buffer->size) >= 1) { + + /* extract one item from the buffer */ + struct item *item = buffer[tail]; + + consume_item(item); + + /* Finish reading descriptor before incrementing tail. */ + smp_store_release(buffer->tail, + (tail + 1) & (buffer->size - 1)); + } + + spin_unlock(&consumer_lock); + +This will instruct the CPU to make sure the index is up to date before reading +the new item, and then it shall make sure the CPU has finished reading the item +before it writes the new tail pointer, which will erase the item. + +Note the use of READ_ONCE() and smp_load_acquire() to read the +opposition index. This prevents the compiler from discarding and +reloading its cached value. This isn't strictly needed if you can +be sure that the opposition index will _only_ be used the once. +The smp_load_acquire() additionally forces the CPU to order against +subsequent memory references. Similarly, smp_store_release() is used +in both algorithms to write the thread's index. This documents the +fact that we are writing to something that can be read concurrently, +prevents the compiler from tearing the store, and enforces ordering +against previous accesses. + + +Further reading +=============== + +See also Documentation/memory-barriers.txt for a description of Linux's memory +barrier facilities. diff --git a/Documentation/core-api/conf.py b/Documentation/core-api/conf.py new file mode 100644 index 000000000..db1f7659f --- /dev/null +++ b/Documentation/core-api/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Core-API Documentation" + +tags.add("subproject") + +latex_documents = [ + ('index', 'core-api.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst new file mode 100644 index 000000000..4a50ab781 --- /dev/null +++ b/Documentation/core-api/cpu_hotplug.rst @@ -0,0 +1,372 @@ +========================= +CPU hotplug in the Kernel +========================= + +:Date: December, 2016 +:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>, + Rusty Russell <rusty@rustcorp.com.au>, + Srivatsa Vaddagiri <vatsa@in.ibm.com>, + Ashok Raj <ashok.raj@intel.com>, + Joel Schopp <jschopp@austin.ibm.com> + +Introduction +============ + +Modern advances in system architectures have introduced advanced error +reporting and correction capabilities in processors. There are couple OEMS that +support NUMA hardware which are hot pluggable as well, where physical node +insertion and removal require support for CPU hotplug. + +Such advances require CPUs available to a kernel to be removed either for +provisioning reasons, or for RAS purposes to keep an offending CPU off +system execution path. Hence the need for CPU hotplug support in the +Linux kernel. + +A more novel use of CPU-hotplug support is its use today in suspend resume +support for SMP. Dual-core and HT support makes even a laptop run SMP kernels +which didn't support these methods. + + +Command Line Switches +===================== +``maxcpus=n`` + Restrict boot time CPUs to *n*. Say if you have fourV CPUs, using + ``maxcpus=2`` will only boot two. You can choose to bring the + other CPUs later online. + +``nr_cpus=n`` + Restrict the total amount CPUs the kernel will support. If the number + supplied here is lower than the number of physically available CPUs than + those CPUs can not be brought online later. + +``additional_cpus=n`` + Use this to limit hotpluggable CPUs. This option sets + ``cpu_possible_mask = cpu_present_mask + additional_cpus`` + + This option is limited to the IA64 architecture. + +``possible_cpus=n`` + This option sets ``possible_cpus`` bits in ``cpu_possible_mask``. + + This option is limited to the X86 and S390 architecture. + +``cede_offline={"off","on"}`` + Use this option to disable/enable putting offlined processors to an extended + ``H_CEDE`` state on supported pseries platforms. If nothing is specified, + ``cede_offline`` is set to "on". + + This option is limited to the PowerPC architecture. + +``cpu0_hotplug`` + Allow to shutdown CPU0. + + This option is limited to the X86 architecture. + +CPU maps +======== + +``cpu_possible_mask`` + Bitmap of possible CPUs that can ever be available in the + system. This is used to allocate some boot time memory for per_cpu variables + that aren't designed to grow/shrink as CPUs are made available or removed. + Once set during boot time discovery phase, the map is static, i.e no bits + are added or removed anytime. Trimming it accurately for your system needs + upfront can save some boot time memory. + +``cpu_online_mask`` + Bitmap of all CPUs currently online. Its set in ``__cpu_up()`` + after a CPU is available for kernel scheduling and ready to receive + interrupts from devices. Its cleared when a CPU is brought down using + ``__cpu_disable()``, before which all OS services including interrupts are + migrated to another target CPU. + +``cpu_present_mask`` + Bitmap of CPUs currently present in the system. Not all + of them may be online. When physical hotplug is processed by the relevant + subsystem (e.g ACPI) can change and new bit either be added or removed + from the map depending on the event is hot-add/hot-remove. There are currently + no locking rules as of now. Typical usage is to init topology during boot, + at which time hotplug is disabled. + +You really don't need to manipulate any of the system CPU maps. They should +be read-only for most use. When setting up per-cpu resources almost always use +``cpu_possible_mask`` or ``for_each_possible_cpu()`` to iterate. To macro +``for_each_cpu()`` can be used to iterate over a custom CPU mask. + +Never use anything other than ``cpumask_t`` to represent bitmap of CPUs. + + +Using CPU hotplug +================= +The kernel option *CONFIG_HOTPLUG_CPU* needs to be enabled. It is currently +available on multiple architectures including ARM, MIPS, PowerPC and X86. The +configuration is done via the sysfs interface: :: + + $ ls -lh /sys/devices/system/cpu + total 0 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu0 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu1 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu2 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu3 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu4 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu5 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu6 + drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu7 + drwxr-xr-x 2 root root 0 Dec 21 16:33 hotplug + -r--r--r-- 1 root root 4.0K Dec 21 16:33 offline + -r--r--r-- 1 root root 4.0K Dec 21 16:33 online + -r--r--r-- 1 root root 4.0K Dec 21 16:33 possible + -r--r--r-- 1 root root 4.0K Dec 21 16:33 present + +The files *offline*, *online*, *possible*, *present* represent the CPU masks. +Each CPU folder contains an *online* file which controls the logical on (1) and +off (0) state. To logically shutdown CPU4: :: + + $ echo 0 > /sys/devices/system/cpu/cpu4/online + smpboot: CPU 4 is now offline + +Once the CPU is shutdown, it will be removed from */proc/interrupts*, +*/proc/cpuinfo* and should also not be shown visible by the *top* command. To +bring CPU4 back online: :: + + $ echo 1 > /sys/devices/system/cpu/cpu4/online + smpboot: Booting Node 0 Processor 4 APIC 0x1 + +The CPU is usable again. This should work on all CPUs. CPU0 is often special +and excluded from CPU hotplug. On X86 the kernel option +*CONFIG_BOOTPARAM_HOTPLUG_CPU0* has to be enabled in order to be able to +shutdown CPU0. Alternatively the kernel command option *cpu0_hotplug* can be +used. Some known dependencies of CPU0: + +* Resume from hibernate/suspend. Hibernate/suspend will fail if CPU0 is offline. +* PIC interrupts. CPU0 can't be removed if a PIC interrupt is detected. + +Please let Fenghua Yu <fenghua.yu@intel.com> know if you find any dependencies +on CPU0. + +The CPU hotplug coordination +============================ + +The offline case +---------------- +Once a CPU has been logically shutdown the teardown callbacks of registered +hotplug states will be invoked, starting with ``CPUHP_ONLINE`` and terminating +at state ``CPUHP_OFFLINE``. This includes: + +* If tasks are frozen due to a suspend operation then *cpuhp_tasks_frozen* + will be set to true. +* All processes are migrated away from this outgoing CPU to new CPUs. + The new CPU is chosen from each process' current cpuset, which may be + a subset of all online CPUs. +* All interrupts targeted to this CPU are migrated to a new CPU +* timers are also migrated to a new CPU +* Once all services are migrated, kernel calls an arch specific routine + ``__cpu_disable()`` to perform arch specific cleanup. + +Using the hotplug API +--------------------- +It is possible to receive notifications once a CPU is offline or onlined. This +might be important to certain drivers which need to perform some kind of setup +or clean up functions based on the number of available CPUs: :: + + #include <linux/cpuhotplug.h> + + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "X/Y:online", + Y_online, Y_prepare_down); + +*X* is the subsystem and *Y* the particular driver. The *Y_online* callback +will be invoked during registration on all online CPUs. If an error +occurs during the online callback the *Y_prepare_down* callback will be +invoked on all CPUs on which the online callback was previously invoked. +After registration completed, the *Y_online* callback will be invoked +once a CPU is brought online and *Y_prepare_down* will be invoked when a +CPU is shutdown. All resources which were previously allocated in +*Y_online* should be released in *Y_prepare_down*. +The return value *ret* is negative if an error occurred during the +registration process. Otherwise a positive value is returned which +contains the allocated hotplug for dynamically allocated states +(*CPUHP_AP_ONLINE_DYN*). It will return zero for predefined states. + +The callback can be remove by invoking ``cpuhp_remove_state()``. In case of a +dynamically allocated state (*CPUHP_AP_ONLINE_DYN*) use the returned state. +During the removal of a hotplug state the teardown callback will be invoked. + +Multiple instances +~~~~~~~~~~~~~~~~~~ +If a driver has multiple instances and each instance needs to perform the +callback independently then it is likely that a ''multi-state'' should be used. +First a multi-state state needs to be registered: :: + + ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "X/Y:online, + Y_online, Y_prepare_down); + Y_hp_online = ret; + +The ``cpuhp_setup_state_multi()`` behaves similar to ``cpuhp_setup_state()`` +except it prepares the callbacks for a multi state and does not invoke +the callbacks. This is a one time setup. +Once a new instance is allocated, you need to register this new instance: :: + + ret = cpuhp_state_add_instance(Y_hp_online, &d->node); + +This function will add this instance to your previously allocated +*Y_hp_online* state and invoke the previously registered callback +(*Y_online*) on all online CPUs. The *node* element is a ``struct +hlist_node`` member of your per-instance data structure. + +On removal of the instance: :: + cpuhp_state_remove_instance(Y_hp_online, &d->node) + +should be invoked which will invoke the teardown callback on all online +CPUs. + +Manual setup +~~~~~~~~~~~~ +Usually it is handy to invoke setup and teardown callbacks on registration or +removal of a state because usually the operation needs to performed once a CPU +goes online (offline) and during initial setup (shutdown) of the driver. However +each registration and removal function is also available with a ``_nocalls`` +suffix which does not invoke the provided callbacks if the invocation of the +callbacks is not desired. During the manual setup (or teardown) the functions +``get_online_cpus()`` and ``put_online_cpus()`` should be used to inhibit CPU +hotplug operations. + + +The ordering of the events +-------------------------- +The hotplug states are defined in ``include/linux/cpuhotplug.h``: + +* The states *CPUHP_OFFLINE* … *CPUHP_AP_OFFLINE* are invoked before the + CPU is up. +* The states *CPUHP_AP_OFFLINE* … *CPUHP_AP_ONLINE* are invoked + just the after the CPU has been brought up. The interrupts are off and + the scheduler is not yet active on this CPU. Starting with *CPUHP_AP_OFFLINE* + the callbacks are invoked on the target CPU. +* The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are + reserved for the dynamic allocation. +* The states are invoked in the reverse order on CPU shutdown starting with + *CPUHP_ONLINE* and stopping at *CPUHP_OFFLINE*. Here the callbacks are + invoked on the CPU that will be shutdown until *CPUHP_AP_OFFLINE*. + +A dynamically allocated state via *CPUHP_AP_ONLINE_DYN* is often enough. +However if an earlier invocation during the bring up or shutdown is required +then an explicit state should be acquired. An explicit state might also be +required if the hotplug event requires specific ordering in respect to +another hotplug event. + +Testing of hotplug states +========================= +One way to verify whether a custom state is working as expected or not is to +shutdown a CPU and then put it online again. It is also possible to put the CPU +to certain state (for instance *CPUHP_AP_ONLINE*) and then go back to +*CPUHP_ONLINE*. This would simulate an error one state after *CPUHP_AP_ONLINE* +which would lead to rollback to the online state. + +All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states``: :: + + $ tail /sys/devices/system/cpu/hotplug/states + 138: mm/vmscan:online + 139: mm/vmstat:online + 140: lib/percpu_cnt:online + 141: acpi/cpu-drv:online + 142: base/cacheinfo:online + 143: virtio/net:online + 144: x86/mce:online + 145: printk:online + 168: sched:active + 169: online + +To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: :: + + $ cat /sys/devices/system/cpu/cpu4/hotplug/state + 169 + $ echo 140 > /sys/devices/system/cpu/cpu4/hotplug/target + $ cat /sys/devices/system/cpu/cpu4/hotplug/state + 140 + +It is important to note that the teardown callbac of state 140 have been +invoked. And now get back online: :: + + $ echo 169 > /sys/devices/system/cpu/cpu4/hotplug/target + $ cat /sys/devices/system/cpu/cpu4/hotplug/state + 169 + +With trace events enabled, the individual steps are visible, too: :: + + # TASK-PID CPU# TIMESTAMP FUNCTION + # | | | | | + bash-394 [001] 22.976: cpuhp_enter: cpu: 0004 target: 140 step: 169 (cpuhp_kick_ap_work) + cpuhp/4-31 [004] 22.977: cpuhp_enter: cpu: 0004 target: 140 step: 168 (sched_cpu_deactivate) + cpuhp/4-31 [004] 22.990: cpuhp_exit: cpu: 0004 state: 168 step: 168 ret: 0 + cpuhp/4-31 [004] 22.991: cpuhp_enter: cpu: 0004 target: 140 step: 144 (mce_cpu_pre_down) + cpuhp/4-31 [004] 22.992: cpuhp_exit: cpu: 0004 state: 144 step: 144 ret: 0 + cpuhp/4-31 [004] 22.993: cpuhp_multi_enter: cpu: 0004 target: 140 step: 143 (virtnet_cpu_down_prep) + cpuhp/4-31 [004] 22.994: cpuhp_exit: cpu: 0004 state: 143 step: 143 ret: 0 + cpuhp/4-31 [004] 22.995: cpuhp_enter: cpu: 0004 target: 140 step: 142 (cacheinfo_cpu_pre_down) + cpuhp/4-31 [004] 22.996: cpuhp_exit: cpu: 0004 state: 142 step: 142 ret: 0 + bash-394 [001] 22.997: cpuhp_exit: cpu: 0004 state: 140 step: 169 ret: 0 + bash-394 [005] 95.540: cpuhp_enter: cpu: 0004 target: 169 step: 140 (cpuhp_kick_ap_work) + cpuhp/4-31 [004] 95.541: cpuhp_enter: cpu: 0004 target: 169 step: 141 (acpi_soft_cpu_online) + cpuhp/4-31 [004] 95.542: cpuhp_exit: cpu: 0004 state: 141 step: 141 ret: 0 + cpuhp/4-31 [004] 95.543: cpuhp_enter: cpu: 0004 target: 169 step: 142 (cacheinfo_cpu_online) + cpuhp/4-31 [004] 95.544: cpuhp_exit: cpu: 0004 state: 142 step: 142 ret: 0 + cpuhp/4-31 [004] 95.545: cpuhp_multi_enter: cpu: 0004 target: 169 step: 143 (virtnet_cpu_online) + cpuhp/4-31 [004] 95.546: cpuhp_exit: cpu: 0004 state: 143 step: 143 ret: 0 + cpuhp/4-31 [004] 95.547: cpuhp_enter: cpu: 0004 target: 169 step: 144 (mce_cpu_online) + cpuhp/4-31 [004] 95.548: cpuhp_exit: cpu: 0004 state: 144 step: 144 ret: 0 + cpuhp/4-31 [004] 95.549: cpuhp_enter: cpu: 0004 target: 169 step: 145 (console_cpu_notify) + cpuhp/4-31 [004] 95.550: cpuhp_exit: cpu: 0004 state: 145 step: 145 ret: 0 + cpuhp/4-31 [004] 95.551: cpuhp_enter: cpu: 0004 target: 169 step: 168 (sched_cpu_activate) + cpuhp/4-31 [004] 95.552: cpuhp_exit: cpu: 0004 state: 168 step: 168 ret: 0 + bash-394 [005] 95.553: cpuhp_exit: cpu: 0004 state: 169 step: 140 ret: 0 + +As it an be seen, CPU4 went down until timestamp 22.996 and then back up until +95.552. All invoked callbacks including their return codes are visible in the +trace. + +Architecture's requirements +=========================== +The following functions and configurations are required: + +``CONFIG_HOTPLUG_CPU`` + This entry needs to be enabled in Kconfig + +``__cpu_up()`` + Arch interface to bring up a CPU + +``__cpu_disable()`` + Arch interface to shutdown a CPU, no more interrupts can be handled by the + kernel after the routine returns. This includes the shutdown of the timer. + +``__cpu_die()`` + This actually supposed to ensure death of the CPU. Actually look at some + example code in other arch that implement CPU hotplug. The processor is taken + down from the ``idle()`` loop for that specific architecture. ``__cpu_die()`` + typically waits for some per_cpu state to be set, to ensure the processor dead + routine is called to be sure positively. + +User Space Notification +======================= +After CPU successfully onlined or offline udev events are sent. A udev rule like: :: + + SUBSYSTEM=="cpu", DRIVERS=="processor", DEVPATH=="/devices/system/cpu/*", RUN+="the_hotplug_receiver.sh" + +will receive all events. A script like: :: + + #!/bin/sh + + if [ "${ACTION}" = "offline" ] + then + echo "CPU ${DEVPATH##*/} offline" + + elif [ "${ACTION}" = "online" ] + then + echo "CPU ${DEVPATH##*/} online" + + fi + +can process the event further. + +Kernel Inline Documentations Reference +====================================== + +.. kernel-doc:: include/linux/cpuhotplug.h diff --git a/Documentation/core-api/debug-objects.rst b/Documentation/core-api/debug-objects.rst new file mode 100644 index 000000000..ac926fd55 --- /dev/null +++ b/Documentation/core-api/debug-objects.rst @@ -0,0 +1,310 @@ +============================================ +The object-lifetime debugging infrastructure +============================================ + +:Author: Thomas Gleixner + +Introduction +============ + +debugobjects is a generic infrastructure to track the life time of +kernel objects and validate the operations on those. + +debugobjects is useful to check for the following error patterns: + +- Activation of uninitialized objects + +- Initialization of active objects + +- Usage of freed/destroyed objects + +debugobjects is not changing the data structure of the real object so it +can be compiled in with a minimal runtime impact and enabled on demand +with a kernel command line option. + +Howto use debugobjects +====================== + +A kernel subsystem needs to provide a data structure which describes the +object type and add calls into the debug code at appropriate places. The +data structure to describe the object type needs at minimum the name of +the object type. Optional functions can and should be provided to fixup +detected problems so the kernel can continue to work and the debug +information can be retrieved from a live system instead of hard core +debugging with serial consoles and stack trace transcripts from the +monitor. + +The debug calls provided by debugobjects are: + +- debug_object_init + +- debug_object_init_on_stack + +- debug_object_activate + +- debug_object_deactivate + +- debug_object_destroy + +- debug_object_free + +- debug_object_assert_init + +Each of these functions takes the address of the real object and a +pointer to the object type specific debug description structure. + +Each detected error is reported in the statistics and a limited number +of errors are printk'ed including a full stack trace. + +The statistics are available via /sys/kernel/debug/debug_objects/stats. +They provide information about the number of warnings and the number of +successful fixups along with information about the usage of the internal +tracking objects and the state of the internal tracking objects pool. + +Debug functions +=============== + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_init + +This function is called whenever the initialization function of a real +object is called. + +When the real object is already tracked by debugobjects it is checked, +whether the object can be initialized. Initializing is not allowed for +active and destroyed objects. When debugobjects detects an error, then +it calls the fixup_init function of the object type description +structure if provided by the caller. The fixup function can correct the +problem before the real initialization of the object happens. E.g. it +can deactivate an active object in order to prevent damage to the +subsystem. + +When the real object is not yet tracked by debugobjects, debugobjects +allocates a tracker object for the real object and sets the tracker +object state to ODEBUG_STATE_INIT. It verifies that the object is not +on the callers stack. If it is on the callers stack then a limited +number of warnings including a full stack trace is printk'ed. The +calling code must use debug_object_init_on_stack() and remove the +object before leaving the function which allocated it. See next section. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_init_on_stack + +This function is called whenever the initialization function of a real +object which resides on the stack is called. + +When the real object is already tracked by debugobjects it is checked, +whether the object can be initialized. Initializing is not allowed for +active and destroyed objects. When debugobjects detects an error, then +it calls the fixup_init function of the object type description +structure if provided by the caller. The fixup function can correct the +problem before the real initialization of the object happens. E.g. it +can deactivate an active object in order to prevent damage to the +subsystem. + +When the real object is not yet tracked by debugobjects debugobjects +allocates a tracker object for the real object and sets the tracker +object state to ODEBUG_STATE_INIT. It verifies that the object is on +the callers stack. + +An object which is on the stack must be removed from the tracker by +calling debug_object_free() before the function which allocates the +object returns. Otherwise we keep track of stale objects. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_activate + +This function is called whenever the activation function of a real +object is called. + +When the real object is already tracked by debugobjects it is checked, +whether the object can be activated. Activating is not allowed for +active and destroyed objects. When debugobjects detects an error, then +it calls the fixup_activate function of the object type description +structure if provided by the caller. The fixup function can correct the +problem before the real activation of the object happens. E.g. it can +deactivate an active object in order to prevent damage to the subsystem. + +When the real object is not yet tracked by debugobjects then the +fixup_activate function is called if available. This is necessary to +allow the legitimate activation of statically allocated and initialized +objects. The fixup function checks whether the object is valid and calls +the debug_objects_init() function to initialize the tracking of this +object. + +When the activation is legitimate, then the state of the associated +tracker object is set to ODEBUG_STATE_ACTIVE. + + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_deactivate + +This function is called whenever the deactivation function of a real +object is called. + +When the real object is tracked by debugobjects it is checked, whether +the object can be deactivated. Deactivating is not allowed for untracked +or destroyed objects. + +When the deactivation is legitimate, then the state of the associated +tracker object is set to ODEBUG_STATE_INACTIVE. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_destroy + +This function is called to mark an object destroyed. This is useful to +prevent the usage of invalid objects, which are still available in +memory: either statically allocated objects or objects which are freed +later. + +When the real object is tracked by debugobjects it is checked, whether +the object can be destroyed. Destruction is not allowed for active and +destroyed objects. When debugobjects detects an error, then it calls the +fixup_destroy function of the object type description structure if +provided by the caller. The fixup function can correct the problem +before the real destruction of the object happens. E.g. it can +deactivate an active object in order to prevent damage to the subsystem. + +When the destruction is legitimate, then the state of the associated +tracker object is set to ODEBUG_STATE_DESTROYED. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_free + +This function is called before an object is freed. + +When the real object is tracked by debugobjects it is checked, whether +the object can be freed. Free is not allowed for active objects. When +debugobjects detects an error, then it calls the fixup_free function of +the object type description structure if provided by the caller. The +fixup function can correct the problem before the real free of the +object happens. E.g. it can deactivate an active object in order to +prevent damage to the subsystem. + +Note that debug_object_free removes the object from the tracker. Later +usage of the object is detected by the other debug checks. + + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_assert_init + +This function is called to assert that an object has been initialized. + +When the real object is not tracked by debugobjects, it calls +fixup_assert_init of the object type description structure provided by +the caller, with the hardcoded object state ODEBUG_NOT_AVAILABLE. The +fixup function can correct the problem by calling debug_object_init +and other specific initializing functions. + +When the real object is already tracked by debugobjects it is ignored. + +Fixup functions +=============== + +Debug object type description structure +--------------------------------------- + +.. kernel-doc:: include/linux/debugobjects.h + :internal: + +fixup_init +----------- + +This function is called from the debug code whenever a problem in +debug_object_init is detected. The function takes the address of the +object and the state which is currently recorded in the tracker. + +Called from debug_object_init when the object state is: + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +Note, that the function needs to call the debug_object_init() function +again, after the damage has been repaired in order to keep the state +consistent. + +fixup_activate +--------------- + +This function is called from the debug code whenever a problem in +debug_object_activate is detected. + +Called from debug_object_activate when the object state is: + +- ODEBUG_STATE_NOTAVAILABLE + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +Note that the function needs to call the debug_object_activate() +function again after the damage has been repaired in order to keep the +state consistent. + +The activation of statically initialized objects is a special case. When +debug_object_activate() has no tracked object for this object address +then fixup_activate() is called with object state +ODEBUG_STATE_NOTAVAILABLE. The fixup function needs to check whether +this is a legitimate case of a statically initialized object or not. In +case it is it calls debug_object_init() and debug_object_activate() +to make the object known to the tracker and marked active. In this case +the function should return false because this is not a real fixup. + +fixup_destroy +-------------- + +This function is called from the debug code whenever a problem in +debug_object_destroy is detected. + +Called from debug_object_destroy when the object state is: + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +fixup_free +----------- + +This function is called from the debug code whenever a problem in +debug_object_free is detected. Further it can be called from the debug +checks in kfree/vfree, when an active object is detected from the +debug_check_no_obj_freed() sanity checks. + +Called from debug_object_free() or debug_check_no_obj_freed() when +the object state is: + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +fixup_assert_init +------------------- + +This function is called from the debug code whenever a problem in +debug_object_assert_init is detected. + +Called from debug_object_assert_init() with a hardcoded state +ODEBUG_STATE_NOTAVAILABLE when the object is not found in the debug +bucket. + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +Note, this function should make sure debug_object_init() is called +before returning. + +The handling of statically initialized objects is a special case. The +fixup function should check if this is a legitimate case of a statically +initialized object or not. In this case only debug_object_init() +should be called to make the object known to the tracker. Then the +function should return false because this is not a real fixup. + +Known Bugs And Assumptions +========================== + +None (knock on wood). diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst new file mode 100644 index 000000000..ff332e272 --- /dev/null +++ b/Documentation/core-api/errseq.rst @@ -0,0 +1,159 @@ +===================== +The errseq_t datatype +===================== + +An errseq_t is a way of recording errors in one place, and allowing any +number of "subscribers" to tell whether it has changed since a previous +point where it was sampled. + +The initial use case for this is tracking errors for file +synchronization syscalls (fsync, fdatasync, msync and sync_file_range), +but it may be usable in other situations. + +It's implemented as an unsigned 32-bit value. The low order bits are +designated to hold an error code (between 1 and MAX_ERRNO). The upper bits +are used as a counter. This is done with atomics instead of locking so that +these functions can be called from any context. + +Note that there is a risk of collisions if new errors are being recorded +frequently, since we have so few bits to use as a counter. + +To mitigate this, the bit between the error value and counter is used as +a flag to tell whether the value has been sampled since a new value was +recorded. That allows us to avoid bumping the counter if no one has +sampled it since the last time an error was recorded. + +Thus we end up with a value that looks something like this: + ++--------------------------------------+----+------------------------+ +| 31..13 | 12 | 11..0 | ++--------------------------------------+----+------------------------+ +| counter | SF | errno | ++--------------------------------------+----+------------------------+ + +The general idea is for "watchers" to sample an errseq_t value and keep +it as a running cursor. That value can later be used to tell whether +any new errors have occurred since that sampling was done, and atomically +record the state at the time that it was checked. This allows us to +record errors in one place, and then have a number of "watchers" that +can tell whether the value has changed since they last checked it. + +A new errseq_t should always be zeroed out. An errseq_t value of all zeroes +is the special (but common) case where there has never been an error. An all +zero value thus serves as the "epoch" if one wishes to know whether there +has ever been an error set since it was first initialized. + +API usage +========= + +Let me tell you a story about a worker drone. Now, he's a good worker +overall, but the company is a little...management heavy. He has to +report to 77 supervisors today, and tomorrow the "big boss" is coming in +from out of town and he's sure to test the poor fellow too. + +They're all handing him work to do -- so much he can't keep track of who +handed him what, but that's not really a big problem. The supervisors +just want to know when he's finished all of the work they've handed him so +far and whether he made any mistakes since they last asked. + +He might have made the mistake on work they didn't actually hand him, +but he can't keep track of things at that level of detail, all he can +remember is the most recent mistake that he made. + +Here's our worker_drone representation:: + + struct worker_drone { + errseq_t wd_err; /* for recording errors */ + }; + +Every day, the worker_drone starts out with a blank slate:: + + struct worker_drone wd; + + wd.wd_err = (errseq_t)0; + +The supervisors come in and get an initial read for the day. They +don't care about anything that happened before their watch begins:: + + struct supervisor { + errseq_t s_wd_err; /* private "cursor" for wd_err */ + spinlock_t s_wd_err_lock; /* protects s_wd_err */ + } + + struct supervisor su; + + su.s_wd_err = errseq_sample(&wd.wd_err); + spin_lock_init(&su.s_wd_err_lock); + +Now they start handing him tasks to do. Every few minutes they ask him to +finish up all of the work they've handed him so far. Then they ask him +whether he made any mistakes on any of it:: + + spin_lock(&su.su_wd_err_lock); + err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); + spin_unlock(&su.su_wd_err_lock); + +Up to this point, that just keeps returning 0. + +Now, the owners of this company are quite miserly and have given him +substandard equipment with which to do his job. Occasionally it +glitches and he makes a mistake. He sighs a heavy sigh, and marks it +down:: + + errseq_set(&wd.wd_err, -EIO); + +...and then gets back to work. The supervisors eventually poll again +and they each get the error when they next check. Subsequent calls will +return 0, until another error is recorded, at which point it's reported +to each of them once. + +Note that the supervisors can't tell how many mistakes he made, only +whether one was made since they last checked, and the latest value +recorded. + +Occasionally the big boss comes in for a spot check and asks the worker +to do a one-off job for him. He's not really watching the worker +full-time like the supervisors, but he does need to know whether a +mistake occurred while his job was processing. + +He can just sample the current errseq_t in the worker, and then use that +to tell whether an error has occurred later:: + + errseq_t since = errseq_sample(&wd.wd_err); + /* submit some work and wait for it to complete */ + err = errseq_check(&wd.wd_err, since); + +Since he's just going to discard "since" after that point, he doesn't +need to advance it here. He also doesn't need any locking since it's +not usable by anyone else. + +Serializing errseq_t cursor updates +=================================== + +Note that the errseq_t API does not protect the errseq_t cursor during a +check_and_advance_operation. Only the canonical error code is handled +atomically. In a situation where more than one task might be using the +same errseq_t cursor at the same time, it's important to serialize +updates to that cursor. + +If that's not done, then it's possible for the cursor to go backward +in which case the same error could be reported more than once. + +Because of this, it's often advantageous to first do an errseq_check to +see if anything has changed, and only later do an +errseq_check_and_advance after taking the lock. e.g.:: + + if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) { + /* su.s_wd_err is protected by s_wd_err_lock */ + spin_lock(&su.s_wd_err_lock); + err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); + spin_unlock(&su.s_wd_err_lock); + } + +That avoids the spinlock in the common case where nothing has changed +since the last time it was checked. + +Functions +========= + +.. kernel-doc:: lib/errseq.c diff --git a/Documentation/core-api/flexible-arrays.rst b/Documentation/core-api/flexible-arrays.rst new file mode 100644 index 000000000..b6b85a1b5 --- /dev/null +++ b/Documentation/core-api/flexible-arrays.rst @@ -0,0 +1,130 @@ + +=================================== +Using flexible arrays in the kernel +=================================== + +Large contiguous memory allocations can be unreliable in the Linux kernel. +Kernel programmers will sometimes respond to this problem by allocating +pages with :c:func:`vmalloc()`. This solution not ideal, though. On 32-bit +systems, memory from vmalloc() must be mapped into a relatively small address +space; it's easy to run out. On SMP systems, the page table changes required +by vmalloc() allocations can require expensive cross-processor interrupts on +all CPUs. And, on all systems, use of space in the vmalloc() range increases +pressure on the translation lookaside buffer (TLB), reducing the performance +of the system. + +In many cases, the need for memory from vmalloc() can be eliminated by piecing +together an array from smaller parts; the flexible array library exists to make +this task easier. + +A flexible array holds an arbitrary (within limits) number of fixed-sized +objects, accessed via an integer index. Sparse arrays are handled +reasonably well. Only single-page allocations are made, so memory +allocation failures should be relatively rare. The down sides are that the +arrays cannot be indexed directly, individual object size cannot exceed the +system page size, and putting data into a flexible array requires a copy +operation. It's also worth noting that flexible arrays do no internal +locking at all; if concurrent access to an array is possible, then the +caller must arrange for appropriate mutual exclusion. + +The creation of a flexible array is done with :c:func:`flex_array_alloc()`:: + + #include <linux/flex_array.h> + + struct flex_array *flex_array_alloc(int element_size, + unsigned int total, + gfp_t flags); + +The individual object size is provided by ``element_size``, while total is the +maximum number of objects which can be stored in the array. The flags +argument is passed directly to the internal memory allocation calls. With +the current code, using flags to ask for high memory is likely to lead to +notably unpleasant side effects. + +It is also possible to define flexible arrays at compile time with:: + + DEFINE_FLEX_ARRAY(name, element_size, total); + +This macro will result in a definition of an array with the given name; the +element size and total will be checked for validity at compile time. + +Storing data into a flexible array is accomplished with a call to +:c:func:`flex_array_put()`:: + + int flex_array_put(struct flex_array *array, unsigned int element_nr, + void *src, gfp_t flags); + +This call will copy the data from src into the array, in the position +indicated by ``element_nr`` (which must be less than the maximum specified when +the array was created). If any memory allocations must be performed, flags +will be used. The return value is zero on success, a negative error code +otherwise. + +There might possibly be a need to store data into a flexible array while +running in some sort of atomic context; in this situation, sleeping in the +memory allocator would be a bad thing. That can be avoided by using +``GFP_ATOMIC`` for the flags value, but, often, there is a better way. The +trick is to ensure that any needed memory allocations are done before +entering atomic context, using :c:func:`flex_array_prealloc()`:: + + int flex_array_prealloc(struct flex_array *array, unsigned int start, + unsigned int nr_elements, gfp_t flags); + +This function will ensure that memory for the elements indexed in the range +defined by ``start`` and ``nr_elements`` has been allocated. Thereafter, a +``flex_array_put()`` call on an element in that range is guaranteed not to +block. + +Getting data back out of the array is done with :c:func:`flex_array_get()`:: + + void *flex_array_get(struct flex_array *fa, unsigned int element_nr); + +The return value is a pointer to the data element, or NULL if that +particular element has never been allocated. + +Note that it is possible to get back a valid pointer for an element which +has never been stored in the array. Memory for array elements is allocated +one page at a time; a single allocation could provide memory for several +adjacent elements. Flexible array elements are normally initialized to the +value ``FLEX_ARRAY_FREE`` (defined as 0x6c in <linux/poison.h>), so errors +involving that number probably result from use of unstored array entries. +Note that, if array elements are allocated with ``__GFP_ZERO``, they will be +initialized to zero and this poisoning will not happen. + +Individual elements in the array can be cleared with +:c:func:`flex_array_clear()`:: + + int flex_array_clear(struct flex_array *array, unsigned int element_nr); + +This function will set the given element to ``FLEX_ARRAY_FREE`` and return +zero. If storage for the indicated element is not allocated for the array, +``flex_array_clear()`` will return ``-EINVAL`` instead. Note that clearing an +element does not release the storage associated with it; to reduce the +allocated size of an array, call :c:func:`flex_array_shrink()`:: + + int flex_array_shrink(struct flex_array *array); + +The return value will be the number of pages of memory actually freed. +This function works by scanning the array for pages containing nothing but +``FLEX_ARRAY_FREE`` bytes, so (1) it can be expensive, and (2) it will not work +if the array's pages are allocated with ``__GFP_ZERO``. + +It is possible to remove all elements of an array with a call to +:c:func:`flex_array_free_parts()`:: + + void flex_array_free_parts(struct flex_array *array); + +This call frees all elements, but leaves the array itself in place. +Freeing the entire array is done with :c:func:`flex_array_free()`:: + + void flex_array_free(struct flex_array *array); + +As of this writing, there are no users of flexible arrays in the mainline +kernel. The functions described here are also not exported to modules; +that will probably be fixed when somebody comes up with a need for it. + + +Flexible array functions +------------------------ + +.. kernel-doc:: include/linux/flex_array.h diff --git a/Documentation/core-api/genalloc.rst b/Documentation/core-api/genalloc.rst new file mode 100644 index 000000000..6b38a39fa --- /dev/null +++ b/Documentation/core-api/genalloc.rst @@ -0,0 +1,144 @@ +The genalloc/genpool subsystem +============================== + +There are a number of memory-allocation subsystems in the kernel, each +aimed at a specific need. Sometimes, however, a kernel developer needs to +implement a new allocator for a specific range of special-purpose memory; +often that memory is located on a device somewhere. The author of the +driver for that device can certainly write a little allocator to get the +job done, but that is the way to fill the kernel with dozens of poorly +tested allocators. Back in 2005, Jes Sorensen lifted one of those +allocators from the sym53c8xx_2 driver and posted_ it as a generic module +for the creation of ad hoc memory allocators. This code was merged +for the 2.6.13 release; it has been modified considerably since then. + +.. _posted: https://lwn.net/Articles/125842/ + +Code using this allocator should include <linux/genalloc.h>. The action +begins with the creation of a pool using one of: + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_create + +.. kernel-doc:: lib/genalloc.c + :functions: devm_gen_pool_create + +A call to :c:func:`gen_pool_create` will create a pool. The granularity of +allocations is set with min_alloc_order; it is a log-base-2 number like +those used by the page allocator, but it refers to bytes rather than pages. +So, if min_alloc_order is passed as 3, then all allocations will be a +multiple of eight bytes. Increasing min_alloc_order decreases the memory +required to track the memory in the pool. The nid parameter specifies +which NUMA node should be used for the allocation of the housekeeping +structures; it can be -1 if the caller doesn't care. + +The "managed" interface :c:func:`devm_gen_pool_create` ties the pool to a +specific device. Among other things, it will automatically clean up the +pool when the given device is destroyed. + +A pool is shut down with: + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_destroy + +It's worth noting that, if there are still allocations outstanding from the +given pool, this function will take the rather extreme step of invoking +BUG(), crashing the entire system. You have been warned. + +A freshly created pool has no memory to allocate. It is fairly useless in +that state, so one of the first orders of business is usually to add memory +to the pool. That can be done with one of: + +.. kernel-doc:: include/linux/genalloc.h + :functions: gen_pool_add + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_add_virt + +A call to :c:func:`gen_pool_add` will place the size bytes of memory +starting at addr (in the kernel's virtual address space) into the given +pool, once again using nid as the node ID for ancillary memory allocations. +The :c:func:`gen_pool_add_virt` variant associates an explicit physical +address with the memory; this is only necessary if the pool will be used +for DMA allocations. + +The functions for allocating memory from the pool (and putting it back) +are: + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_alloc + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_dma_alloc + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_free + +As one would expect, :c:func:`gen_pool_alloc` will allocate size< bytes +from the given pool. The :c:func:`gen_pool_dma_alloc` variant allocates +memory for use with DMA operations, returning the associated physical +address in the space pointed to by dma. This will only work if the memory +was added with :c:func:`gen_pool_add_virt`. Note that this function +departs from the usual genpool pattern of using unsigned long values to +represent kernel addresses; it returns a void * instead. + +That all seems relatively simple; indeed, some developers clearly found it +to be too simple. After all, the interface above provides no control over +how the allocation functions choose which specific piece of memory to +return. If that sort of control is needed, the following functions will be +of interest: + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_alloc_algo + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_set_algo + +Allocations with :c:func:`gen_pool_alloc_algo` specify an algorithm to be +used to choose the memory to be allocated; the default algorithm can be set +with :c:func:`gen_pool_set_algo`. The data value is passed to the +algorithm; most ignore it, but it is occasionally needed. One can, +naturally, write a special-purpose algorithm, but there is a fair set +already available: + +- gen_pool_first_fit is a simple first-fit allocator; this is the default + algorithm if none other has been specified. + +- gen_pool_first_fit_align forces the allocation to have a specific + alignment (passed via data in a genpool_data_align structure). + +- gen_pool_first_fit_order_align aligns the allocation to the order of the + size. A 60-byte allocation will thus be 64-byte aligned, for example. + +- gen_pool_best_fit, as one would expect, is a simple best-fit allocator. + +- gen_pool_fixed_alloc allocates at a specific offset (passed in a + genpool_data_fixed structure via the data parameter) within the pool. + If the indicated memory is not available the allocation fails. + +There is a handful of other functions, mostly for purposes like querying +the space available in the pool or iterating through chunks of memory. +Most users, however, should not need much beyond what has been described +above. With luck, wider awareness of this module will help to prevent the +writing of special-purpose memory allocators in the future. + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_virt_to_phys + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_for_each_chunk + +.. kernel-doc:: lib/genalloc.c + :functions: addr_in_gen_pool + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_avail + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_size + +.. kernel-doc:: lib/genalloc.c + :functions: gen_pool_get + +.. kernel-doc:: lib/genalloc.c + :functions: of_gen_pool_get diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst new file mode 100644 index 000000000..4da67b65c --- /dev/null +++ b/Documentation/core-api/genericirq.rst @@ -0,0 +1,440 @@ +.. include:: <isonum.txt> + +========================== +Linux generic IRQ handling +========================== + +:Copyright: |copy| 2005-2010: Thomas Gleixner +:Copyright: |copy| 2005-2006: Ingo Molnar + +Introduction +============ + +The generic interrupt handling layer is designed to provide a complete +abstraction of interrupt handling for device drivers. It is able to +handle all the different types of interrupt controller hardware. Device +drivers use generic API functions to request, enable, disable and free +interrupts. The drivers do not have to know anything about interrupt +hardware details, so they can be used on different platforms without +code changes. + +This documentation is provided to developers who want to implement an +interrupt subsystem based for their architecture, with the help of the +generic IRQ handling layer. + +Rationale +========= + +The original implementation of interrupt handling in Linux uses the +:c:func:`__do_IRQ` super-handler, which is able to deal with every type of +interrupt logic. + +Originally, Russell King identified different types of handlers to build +a quite universal set for the ARM interrupt handler implementation in +Linux 2.5/2.6. He distinguished between: + +- Level type + +- Edge type + +- Simple type + +During the implementation we identified another type: + +- Fast EOI type + +In the SMP world of the :c:func:`__do_IRQ` super-handler another type was +identified: + +- Per CPU type + +This split implementation of high-level IRQ handlers allows us to +optimize the flow of the interrupt handling for each specific interrupt +type. This reduces complexity in that particular code path and allows +the optimized handling of a given type. + +The original general IRQ implementation used hw_interrupt_type +structures and their ``->ack``, ``->end`` [etc.] callbacks to differentiate +the flow control in the super-handler. This leads to a mix of flow logic +and low-level hardware logic, and it also leads to unnecessary code +duplication: for example in i386, there is an ``ioapic_level_irq`` and an +``ioapic_edge_irq`` IRQ-type which share many of the low-level details but +have different flow handling. + +A more natural abstraction is the clean separation of the 'irq flow' and +the 'chip details'. + +Analysing a couple of architecture's IRQ subsystem implementations +reveals that most of them can use a generic set of 'irq flow' methods +and only need to add the chip-level specific code. The separation is +also valuable for (sub)architectures which need specific quirks in the +IRQ flow itself but not in the chip details - and thus provides a more +transparent IRQ subsystem design. + +Each interrupt descriptor is assigned its own high-level flow handler, +which is normally one of the generic implementations. (This high-level +flow handler implementation also makes it simple to provide +demultiplexing handlers which can be found in embedded platforms on +various architectures.) + +The separation makes the generic interrupt handling layer more flexible +and extensible. For example, an (sub)architecture can use a generic +IRQ-flow implementation for 'level type' interrupts and add a +(sub)architecture specific 'edge type' implementation. + +To make the transition to the new model easier and prevent the breakage +of existing implementations, the :c:func:`__do_IRQ` super-handler is still +available. This leads to a kind of duality for the time being. Over time +the new model should be used in more and more architectures, as it +enables smaller and cleaner IRQ subsystems. It's deprecated for three +years now and about to be removed. + +Known Bugs And Assumptions +========================== + +None (knock on wood). + +Abstraction layers +================== + +There are three main levels of abstraction in the interrupt code: + +1. High-level driver API + +2. High-level IRQ flow handlers + +3. Chip-level hardware encapsulation + +Interrupt control flow +---------------------- + +Each interrupt is described by an interrupt descriptor structure +irq_desc. The interrupt is referenced by an 'unsigned int' numeric +value which selects the corresponding interrupt description structure in +the descriptor structures array. The descriptor structure contains +status information and pointers to the interrupt flow method and the +interrupt chip structure which are assigned to this interrupt. + +Whenever an interrupt triggers, the low-level architecture code calls +into the generic interrupt code by calling :c:func:`desc->handle_irq`. This +high-level IRQ handling function only uses desc->irq_data.chip +primitives referenced by the assigned chip descriptor structure. + +High-level Driver API +--------------------- + +The high-level Driver API consists of following functions: + +- :c:func:`request_irq` + +- :c:func:`free_irq` + +- :c:func:`disable_irq` + +- :c:func:`enable_irq` + +- :c:func:`disable_irq_nosync` (SMP only) + +- :c:func:`synchronize_irq` (SMP only) + +- :c:func:`irq_set_irq_type` + +- :c:func:`irq_set_irq_wake` + +- :c:func:`irq_set_handler_data` + +- :c:func:`irq_set_chip` + +- :c:func:`irq_set_chip_data` + +See the autogenerated function documentation for details. + +High-level IRQ flow handlers +---------------------------- + +The generic layer provides a set of pre-defined irq-flow methods: + +- :c:func:`handle_level_irq` + +- :c:func:`handle_edge_irq` + +- :c:func:`handle_fasteoi_irq` + +- :c:func:`handle_simple_irq` + +- :c:func:`handle_percpu_irq` + +- :c:func:`handle_edge_eoi_irq` + +- :c:func:`handle_bad_irq` + +The interrupt flow handlers (either pre-defined or architecture +specific) are assigned to specific interrupts by the architecture either +during bootup or during device initialization. + +Default flow implementations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Helper functions +^^^^^^^^^^^^^^^^ + +The helper functions call the chip primitives and are used by the +default flow implementations. The following helper functions are +implemented (simplified excerpt):: + + default_enable(struct irq_data *data) + { + desc->irq_data.chip->irq_unmask(data); + } + + default_disable(struct irq_data *data) + { + if (!delay_disable(data)) + desc->irq_data.chip->irq_mask(data); + } + + default_ack(struct irq_data *data) + { + chip->irq_ack(data); + } + + default_mask_ack(struct irq_data *data) + { + if (chip->irq_mask_ack) { + chip->irq_mask_ack(data); + } else { + chip->irq_mask(data); + chip->irq_ack(data); + } + } + + noop(struct irq_data *data)) + { + } + + + +Default flow handler implementations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Default Level IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_level_irq provides a generic implementation for level-triggered +interrupts. + +The following control flow is implemented (simplified excerpt):: + + desc->irq_data.chip->irq_mask_ack(); + handle_irq_event(desc->action); + desc->irq_data.chip->irq_unmask(); + + +Default Fast EOI IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_fasteoi_irq provides a generic implementation for interrupts, +which only need an EOI at the end of the handler. + +The following control flow is implemented (simplified excerpt):: + + handle_irq_event(desc->action); + desc->irq_data.chip->irq_eoi(); + + +Default Edge IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_edge_irq provides a generic implementation for edge-triggered +interrupts. + +The following control flow is implemented (simplified excerpt):: + + if (desc->status & running) { + desc->irq_data.chip->irq_mask_ack(); + desc->status |= pending | masked; + return; + } + desc->irq_data.chip->irq_ack(); + desc->status |= running; + do { + if (desc->status & masked) + desc->irq_data.chip->irq_unmask(); + desc->status &= ~pending; + handle_irq_event(desc->action); + } while (status & pending); + desc->status &= ~running; + + +Default simple IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_simple_irq provides a generic implementation for simple +interrupts. + +.. note:: + + The simple flow handler does not call any handler/chip primitives. + +The following control flow is implemented (simplified excerpt):: + + handle_irq_event(desc->action); + + +Default per CPU flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_percpu_irq provides a generic implementation for per CPU +interrupts. + +Per CPU interrupts are only available on SMP and the handler provides a +simplified version without locking. + +The following control flow is implemented (simplified excerpt):: + + if (desc->irq_data.chip->irq_ack) + desc->irq_data.chip->irq_ack(); + handle_irq_event(desc->action); + if (desc->irq_data.chip->irq_eoi) + desc->irq_data.chip->irq_eoi(); + + +EOI Edge IRQ flow handler +^^^^^^^^^^^^^^^^^^^^^^^^^ + +handle_edge_eoi_irq provides an abnomination of the edge handler +which is solely used to tame a badly wreckaged irq controller on +powerpc/cell. + +Bad IRQ flow handler +^^^^^^^^^^^^^^^^^^^^ + +handle_bad_irq is used for spurious interrupts which have no real +handler assigned.. + +Quirks and optimizations +~~~~~~~~~~~~~~~~~~~~~~~~ + +The generic functions are intended for 'clean' architectures and chips, +which have no platform-specific IRQ handling quirks. If an architecture +needs to implement quirks on the 'flow' level then it can do so by +overriding the high-level irq-flow handler. + +Delayed interrupt disable +~~~~~~~~~~~~~~~~~~~~~~~~~ + +This per interrupt selectable feature, which was introduced by Russell +King in the ARM interrupt implementation, does not mask an interrupt at +the hardware level when :c:func:`disable_irq` is called. The interrupt is kept +enabled and is masked in the flow handler when an interrupt event +happens. This prevents losing edge interrupts on hardware which does not +store an edge interrupt event while the interrupt is disabled at the +hardware level. When an interrupt arrives while the IRQ_DISABLED flag +is set, then the interrupt is masked at the hardware level and the +IRQ_PENDING bit is set. When the interrupt is re-enabled by +:c:func:`enable_irq` the pending bit is checked and if it is set, the interrupt +is resent either via hardware or by a software resend mechanism. (It's +necessary to enable CONFIG_HARDIRQS_SW_RESEND when you want to use +the delayed interrupt disable feature and your hardware is not capable +of retriggering an interrupt.) The delayed interrupt disable is not +configurable. + +Chip-level hardware encapsulation +--------------------------------- + +The chip-level hardware descriptor structure :c:type:`irq_chip` contains all +the direct chip relevant functions, which can be utilized by the irq flow +implementations. + +- ``irq_ack`` + +- ``irq_mask_ack`` - Optional, recommended for performance + +- ``irq_mask`` + +- ``irq_unmask`` + +- ``irq_eoi`` - Optional, required for EOI flow handlers + +- ``irq_retrigger`` - Optional + +- ``irq_set_type`` - Optional + +- ``irq_set_wake`` - Optional + +These primitives are strictly intended to mean what they say: ack means +ACK, masking means masking of an IRQ line, etc. It is up to the flow +handler(s) to use these basic units of low-level functionality. + +__do_IRQ entry point +==================== + +The original implementation :c:func:`__do_IRQ` was an alternative entry point +for all types of interrupts. It no longer exists. + +This handler turned out to be not suitable for all interrupt hardware +and was therefore reimplemented with split functionality for +edge/level/simple/percpu interrupts. This is not only a functional +optimization. It also shortens code paths for interrupts. + +Locking on SMP +============== + +The locking of chip registers is up to the architecture that defines the +chip primitives. The per-irq structure is protected via desc->lock, by +the generic layer. + +Generic interrupt chip +====================== + +To avoid copies of identical implementations of IRQ chips the core +provides a configurable generic interrupt chip implementation. +Developers should check carefully whether the generic chip fits their +needs before implementing the same functionality slightly differently +themselves. + +.. kernel-doc:: kernel/irq/generic-chip.c + :export: + +Structures +========== + +This chapter contains the autogenerated documentation of the structures +which are used in the generic IRQ layer. + +.. kernel-doc:: include/linux/irq.h + :internal: + +.. kernel-doc:: include/linux/interrupt.h + :internal: + +Public Functions Provided +========================= + +This chapter contains the autogenerated documentation of the kernel API +functions which are exported. + +.. kernel-doc:: kernel/irq/manage.c + +.. kernel-doc:: kernel/irq/chip.c + +Internal Functions Provided +=========================== + +This chapter contains the autogenerated documentation of the internal +functions. + +.. kernel-doc:: kernel/irq/irqdesc.c + +.. kernel-doc:: kernel/irq/handle.c + +.. kernel-doc:: kernel/irq/chip.c + +Credits +======= + +The following people have contributed to this document: + +1. Thomas Gleixner tglx@linutronix.de + +2. Ingo Molnar mingo@elte.hu diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst new file mode 100644 index 000000000..e0df8f416 --- /dev/null +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst @@ -0,0 +1,66 @@ +================================= +GFP masks used from FS/IO context +================================= + +:Date: May, 2018 +:Author: Michal Hocko <mhocko@kernel.org> + +Introduction +============ + +Code paths in the filesystem and IO stacks must be careful when +allocating memory to prevent recursion deadlocks caused by direct +memory reclaim calling back into the FS or IO paths and blocking on +already held resources (e.g. locks - most commonly those used for the +transaction context). + +The traditional way to avoid this deadlock problem is to clear __GFP_FS +respectively __GFP_IO (note the latter implies clearing the first as well) in +the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be +used as shortcut. It turned out though that above approach has led to +abuses when the restricted gfp mask is used "just in case" without a +deeper consideration which leads to problems because an excessive use +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory +reclaim issues. + +New API +======== + +Since 4.12 we do have a generic scope API for both NOFS and NOIO context +``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``, +``memalloc_noio_restore`` which allow to mark a scope to be a critical +section from a filesystem or I/O point of view. Any allocation from that +scope will inherently drop __GFP_FS respectively __GFP_IO from the given +mask so no memory allocation can recurse back in the FS/IO. + +.. kernel-doc:: include/linux/sched/mm.h + :functions: memalloc_nofs_save memalloc_nofs_restore +.. kernel-doc:: include/linux/sched/mm.h + :functions: memalloc_noio_save memalloc_noio_restore + +FS/IO code then simply calls the appropriate save function before +any critical section with respect to the reclaim is started - e.g. +lock shared with the reclaim context or when a transaction context +nesting would be possible via reclaim. The restore function should be +called when the critical section ends. All that ideally along with an +explanation what is the reclaim context for easier maintenance. + +Please note that the proper pairing of save/restore functions +allows nesting so it is safe to call ``memalloc_noio_save`` or +``memalloc_noio_restore`` respectively from an existing NOIO or NOFS +scope. + +What about __vmalloc(GFP_NOFS) +============================== + +vmalloc doesn't support GFP_NOFS semantic because there are hardcoded +GFP_KERNEL allocations deep inside the allocator which are quite non-trivial +to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is +almost always a bug. The good news is that the NOFS/NOIO semantic can be +achieved by the scope API. + +In the ideal world, upper layers should already mark dangerous contexts +and so no special care is required and vmalloc should be called without +any problems. Sometimes if the context is not really clear or there are +layering violations then the recommended way around that is to wrap ``vmalloc`` +by the scope API with a comment explaining the problem. diff --git a/Documentation/core-api/idr.rst b/Documentation/core-api/idr.rst new file mode 100644 index 000000000..a2738050c --- /dev/null +++ b/Documentation/core-api/idr.rst @@ -0,0 +1,81 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============= +ID Allocation +============= + +:Author: Matthew Wilcox + +Overview +======== + +A common problem to solve is allocating identifiers (IDs); generally +small numbers which identify a thing. Examples include file descriptors, +process IDs, packet identifiers in networking protocols, SCSI tags +and device instance numbers. The IDR and the IDA provide a reasonable +solution to the problem to avoid everybody inventing their own. The IDR +provides the ability to map an ID to a pointer, while the IDA provides +only ID allocation, and as a result is much more memory-efficient. + +IDR usage +========= + +Start by initialising an IDR, either with :c:func:`DEFINE_IDR` +for statically allocated IDRs or :c:func:`idr_init` for dynamically +allocated IDRs. + +You can call :c:func:`idr_alloc` to allocate an unused ID. Look up +the pointer you associated with the ID by calling :c:func:`idr_find` +and free the ID by calling :c:func:`idr_remove`. + +If you need to change the pointer associated with an ID, you can call +:c:func:`idr_replace`. One common reason to do this is to reserve an +ID by passing a ``NULL`` pointer to the allocation function; initialise the +object with the reserved ID and finally insert the initialised object +into the IDR. + +Some users need to allocate IDs larger than ``INT_MAX``. So far all of +these users have been content with a ``UINT_MAX`` limit, and they use +:c:func:`idr_alloc_u32`. If you need IDs that will not fit in a u32, +we will work with you to address your needs. + +If you need to allocate IDs sequentially, you can use +:c:func:`idr_alloc_cyclic`. The IDR becomes less efficient when dealing +with larger IDs, so using this function comes at a slight cost. + +To perform an action on all pointers used by the IDR, you can +either use the callback-based :c:func:`idr_for_each` or the +iterator-style :c:func:`idr_for_each_entry`. You may need to use +:c:func:`idr_for_each_entry_continue` to continue an iteration. You can +also use :c:func:`idr_get_next` if the iterator doesn't fit your needs. + +When you have finished using an IDR, you can call :c:func:`idr_destroy` +to release the memory used by the IDR. This will not free the objects +pointed to from the IDR; if you want to do that, use one of the iterators +to do it. + +You can use :c:func:`idr_is_empty` to find out whether there are any +IDs currently allocated. + +If you need to take a lock while allocating a new ID from the IDR, +you may need to pass a restrictive set of GFP flags, which can lead +to the IDR being unable to allocate memory. To work around this, +you can call :c:func:`idr_preload` before taking the lock, and then +:c:func:`idr_preload_end` after the allocation. + +.. kernel-doc:: include/linux/idr.h + :doc: idr sync + +IDA usage +========= + +.. kernel-doc:: lib/idr.c + :doc: IDA description + +Functions and structures +======================== + +.. kernel-doc:: include/linux/idr.h + :functions: +.. kernel-doc:: lib/idr.c + :functions: diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst new file mode 100644 index 000000000..26b735cef --- /dev/null +++ b/Documentation/core-api/index.rst @@ -0,0 +1,49 @@ +====================== +Core API Documentation +====================== + +This is the beginning of a manual for core kernel APIs. The conversion +(and writing!) of documents for this manual is much appreciated! + +Core utilities +============== + +.. toctree:: + :maxdepth: 1 + + kernel-api + assoc_array + atomic_ops + cachetlb + refcount-vs-atomic + cpu_hotplug + idr + local_ops + workqueue + genericirq + flexible-arrays + librs + genalloc + errseq + printk-formats + circular-buffers + mm-api + gfp_mask-from-fs-io + timekeeping + boot-time-mm + +Interfaces for kernel debugging +=============================== + +.. toctree:: + :maxdepth: 1 + + debug-objects + tracepoint + +.. only:: subproject + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst new file mode 100644 index 000000000..3431337ee --- /dev/null +++ b/Documentation/core-api/kernel-api.rst @@ -0,0 +1,389 @@ +==================== +The Linux Kernel API +==================== + + +List Management Functions +========================= + +.. kernel-doc:: include/linux/list.h + :internal: + +Basic C Library Functions +========================= + +When writing drivers, you cannot in general use routines which are from +the C Library. Some of the functions have been found generally useful +and they are listed below. The behaviour of these functions may vary +slightly from those defined by ANSI, and these deviations are noted in +the text. + +String Conversions +------------------ + +.. kernel-doc:: lib/vsprintf.c + :export: + +.. kernel-doc:: include/linux/kernel.h + :functions: kstrtol + +.. kernel-doc:: include/linux/kernel.h + :functions: kstrtoul + +.. kernel-doc:: lib/kstrtox.c + :export: + +String Manipulation +------------------- + +.. kernel-doc:: lib/string.c + :export: + +.. kernel-doc:: mm/util.c + :functions: kstrdup kstrdup_const kstrndup kmemdup kmemdup_nul memdup_user + vmemdup_user strndup_user memdup_user_nul + +Basic Kernel Library Functions +============================== + +The Linux kernel provides more basic utility functions. + +Bit Operations +-------------- + +.. kernel-doc:: arch/x86/include/asm/bitops.h + :internal: + +Bitmap Operations +----------------- + +.. kernel-doc:: lib/bitmap.c + :doc: bitmap introduction + +.. kernel-doc:: include/linux/bitmap.h + :doc: declare bitmap + +.. kernel-doc:: include/linux/bitmap.h + :doc: bitmap overview + +.. kernel-doc:: include/linux/bitmap.h + :doc: bitmap bitops + +.. kernel-doc:: lib/bitmap.c + :export: + +.. kernel-doc:: lib/bitmap.c + :internal: + +.. kernel-doc:: include/linux/bitmap.h + :internal: + +Command-line Parsing +-------------------- + +.. kernel-doc:: lib/cmdline.c + :export: + +Sorting +------- + +.. kernel-doc:: lib/sort.c + :export: + +.. kernel-doc:: lib/list_sort.c + :export: + +Text Searching +-------------- + +.. kernel-doc:: lib/textsearch.c + :doc: ts_intro + +.. kernel-doc:: lib/textsearch.c + :export: + +.. kernel-doc:: include/linux/textsearch.h + :functions: textsearch_find textsearch_next \ + textsearch_get_pattern textsearch_get_pattern_len + +CRC and Math Functions in Linux +=============================== + +CRC Functions +------------- + +.. kernel-doc:: lib/crc4.c + :export: + +.. kernel-doc:: lib/crc7.c + :export: + +.. kernel-doc:: lib/crc8.c + :export: + +.. kernel-doc:: lib/crc16.c + :export: + +.. kernel-doc:: lib/crc32.c + +.. kernel-doc:: lib/crc-ccitt.c + :export: + +.. kernel-doc:: lib/crc-itu-t.c + :export: + +Base 2 log and power Functions +------------------------------ + +.. kernel-doc:: include/linux/log2.h + :internal: + +Division Functions +------------------ + +.. kernel-doc:: include/asm-generic/div64.h + :functions: do_div + +.. kernel-doc:: include/linux/math64.h + :internal: + +.. kernel-doc:: lib/div64.c + :functions: div_s64_rem div64_u64_rem div64_u64 div64_s64 + +.. kernel-doc:: lib/gcd.c + :export: + +UUID/GUID +--------- + +.. kernel-doc:: lib/uuid.c + :export: + +Kernel IPC facilities +===================== + +IPC utilities +------------- + +.. kernel-doc:: ipc/util.c + :internal: + +FIFO Buffer +=========== + +kfifo interface +--------------- + +.. kernel-doc:: include/linux/kfifo.h + :internal: + +relay interface support +======================= + +Relay interface support is designed to provide an efficient mechanism +for tools and facilities to relay large amounts of data from kernel +space to user space. + +relay interface +--------------- + +.. kernel-doc:: kernel/relay.c + :export: + +.. kernel-doc:: kernel/relay.c + :internal: + +Module Support +============== + +Module Loading +-------------- + +.. kernel-doc:: kernel/kmod.c + :export: + +Inter Module support +-------------------- + +Refer to the file kernel/module.c for more information. + +Hardware Interfaces +=================== + +Interrupt Handling +------------------ + +.. kernel-doc:: kernel/irq/manage.c + :export: + +DMA Channels +------------ + +.. kernel-doc:: kernel/dma.c + :export: + +Resources Management +-------------------- + +.. kernel-doc:: kernel/resource.c + :internal: + +.. kernel-doc:: kernel/resource.c + :export: + +MTRR Handling +------------- + +.. kernel-doc:: arch/x86/kernel/cpu/mtrr/mtrr.c + :export: + +Security Framework +================== + +.. kernel-doc:: security/security.c + :internal: + +.. kernel-doc:: security/inode.c + :export: + +Audit Interfaces +================ + +.. kernel-doc:: kernel/audit.c + :export: + +.. kernel-doc:: kernel/auditsc.c + :internal: + +.. kernel-doc:: kernel/auditfilter.c + :internal: + +Accounting Framework +==================== + +.. kernel-doc:: kernel/acct.c + :internal: + +Block Devices +============= + +.. kernel-doc:: block/blk-core.c + :export: + +.. kernel-doc:: block/blk-core.c + :internal: + +.. kernel-doc:: block/blk-map.c + :export: + +.. kernel-doc:: block/blk-sysfs.c + :internal: + +.. kernel-doc:: block/blk-settings.c + :export: + +.. kernel-doc:: block/blk-exec.c + :export: + +.. kernel-doc:: block/blk-flush.c + :export: + +.. kernel-doc:: block/blk-lib.c + :export: + +.. kernel-doc:: block/blk-tag.c + :export: + +.. kernel-doc:: block/blk-tag.c + :internal: + +.. kernel-doc:: block/blk-integrity.c + :export: + +.. kernel-doc:: kernel/trace/blktrace.c + :internal: + +.. kernel-doc:: block/genhd.c + :internal: + +.. kernel-doc:: block/genhd.c + :export: + +Char devices +============ + +.. kernel-doc:: fs/char_dev.c + :export: + +Clock Framework +=============== + +The clock framework defines programming interfaces to support software +management of the system clock tree. This framework is widely used with +System-On-Chip (SOC) platforms to support power management and various +devices which may need custom clock rates. Note that these "clocks" +don't relate to timekeeping or real time clocks (RTCs), each of which +have separate frameworks. These :c:type:`struct clk <clk>` +instances may be used to manage for example a 96 MHz signal that is used +to shift bits into and out of peripherals or busses, or otherwise +trigger synchronous state machine transitions in system hardware. + +Power management is supported by explicit software clock gating: unused +clocks are disabled, so the system doesn't waste power changing the +state of transistors that aren't in active use. On some systems this may +be backed by hardware clock gating, where clocks are gated without being +disabled in software. Sections of chips that are powered but not clocked +may be able to retain their last state. This low power state is often +called a *retention mode*. This mode still incurs leakage currents, +especially with finer circuit geometries, but for CMOS circuits power is +mostly used by clocked state changes. + +Power-aware drivers only enable their clocks when the device they manage +is in active use. Also, system sleep states often differ according to +which clock domains are active: while a "standby" state may allow wakeup +from several active domains, a "mem" (suspend-to-RAM) state may require +a more wholesale shutdown of clocks derived from higher speed PLLs and +oscillators, limiting the number of possible wakeup event sources. A +driver's suspend method may need to be aware of system-specific clock +constraints on the target sleep state. + +Some platforms support programmable clock generators. These can be used +by external chips of various kinds, such as other CPUs, multimedia +codecs, and devices with strict requirements for interface clocking. + +.. kernel-doc:: include/linux/clk.h + :internal: + +Synchronization Primitives +========================== + +Read-Copy Update (RCU) +---------------------- + +.. kernel-doc:: include/linux/rcupdate.h + +.. kernel-doc:: include/linux/rcupdate_wait.h + +.. kernel-doc:: include/linux/rcutree.h + +.. kernel-doc:: kernel/rcu/tree.c + +.. kernel-doc:: kernel/rcu/tree_plugin.h + +.. kernel-doc:: kernel/rcu/tree_exp.h + +.. kernel-doc:: kernel/rcu/update.c + +.. kernel-doc:: include/linux/srcu.h + +.. kernel-doc:: kernel/rcu/srcutree.c + +.. kernel-doc:: include/linux/rculist_bl.h + +.. kernel-doc:: include/linux/rculist.h + +.. kernel-doc:: include/linux/rculist_nulls.h + +.. kernel-doc:: include/linux/rcu_sync.h + +.. kernel-doc:: kernel/rcu/sync.c diff --git a/Documentation/core-api/librs.rst b/Documentation/core-api/librs.rst new file mode 100644 index 000000000..6010f5bc5 --- /dev/null +++ b/Documentation/core-api/librs.rst @@ -0,0 +1,212 @@ +========================================== +Reed-Solomon Library Programming Interface +========================================== + +:Author: Thomas Gleixner + +Introduction +============ + +The generic Reed-Solomon Library provides encoding, decoding and error +correction functions. + +Reed-Solomon codes are used in communication and storage applications to +ensure data integrity. + +This documentation is provided for developers who want to utilize the +functions provided by the library. + +Known Bugs And Assumptions +========================== + +None. + +Usage +===== + +This chapter provides examples of how to use the library. + +Initializing +------------ + +The init function init_rs returns a pointer to an rs decoder structure, +which holds the necessary information for encoding, decoding and error +correction with the given polynomial. It either uses an existing +matching decoder or creates a new one. On creation all the lookup tables +for fast en/decoding are created. The function may take a while, so make +sure not to call it in critical code paths. + +:: + + /* the Reed Solomon control structure */ + static struct rs_control *rs_decoder; + + /* Symbolsize is 10 (bits) + * Primitive polynomial is x^10+x^3+1 + * first consecutive root is 0 + * primitive element to generate roots = 1 + * generator polynomial degree (number of roots) = 6 + */ + rs_decoder = init_rs (10, 0x409, 0, 1, 6); + + +Encoding +-------- + +The encoder calculates the Reed-Solomon code over the given data length +and stores the result in the parity buffer. Note that the parity buffer +must be initialized before calling the encoder. + +The expanded data can be inverted on the fly by providing a non-zero +inversion mask. The expanded data is XOR'ed with the mask. This is used +e.g. for FLASH ECC, where the all 0xFF is inverted to an all 0x00. The +Reed-Solomon code for all 0x00 is all 0x00. The code is inverted before +storing to FLASH so it is 0xFF too. This prevents that reading from an +erased FLASH results in ECC errors. + +The databytes are expanded to the given symbol size on the fly. There is +no support for encoding continuous bitstreams with a symbol size != 8 at +the moment. If it is necessary it should be not a big deal to implement +such functionality. + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6]; + /* Initialize the parity buffer */ + memset(par, 0, sizeof(par)); + /* Encode 512 byte in data8. Store parity in buffer par */ + encode_rs8 (rs_decoder, data8, 512, par, 0); + + +Decoding +-------- + +The decoder calculates the syndrome over the given data length and the +received parity symbols and corrects errors in the data. + +If a syndrome is available from a hardware decoder then the syndrome +calculation is skipped. + +The correction of the data buffer can be suppressed by providing a +correction pattern buffer and an error location buffer to the decoder. +The decoder stores the calculated error location and the correction +bitmask in the given buffers. This is useful for hardware decoders which +use a weird bit ordering scheme. + +The databytes are expanded to the given symbol size on the fly. There is +no support for decoding continuous bitstreams with a symbolsize != 8 at +the moment. If it is necessary it should be not a big deal to implement +such functionality. + +Decoding with syndrome calculation, direct data correction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6]; + uint8_t data[512]; + int numerr; + /* Receive data */ + ..... + /* Receive parity */ + ..... + /* Decode 512 byte in data8.*/ + numerr = decode_rs8 (rs_decoder, data8, par, 512, NULL, 0, NULL, 0, NULL); + + +Decoding with syndrome given by hardware decoder, direct data correction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6], syn[6]; + uint8_t data[512]; + int numerr; + /* Receive data */ + ..... + /* Receive parity */ + ..... + /* Get syndrome from hardware decoder */ + ..... + /* Decode 512 byte in data8.*/ + numerr = decode_rs8 (rs_decoder, data8, par, 512, syn, 0, NULL, 0, NULL); + + +Decoding with syndrome given by hardware decoder, no direct data correction. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Note: It's not necessary to give data and received parity to the +decoder. + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6], syn[6], corr[8]; + uint8_t data[512]; + int numerr, errpos[8]; + /* Receive data */ + ..... + /* Receive parity */ + ..... + /* Get syndrome from hardware decoder */ + ..... + /* Decode 512 byte in data8.*/ + numerr = decode_rs8 (rs_decoder, NULL, NULL, 512, syn, 0, errpos, 0, corr); + for (i = 0; i < numerr; i++) { + do_error_correction_in_your_buffer(errpos[i], corr[i]); + } + + +Cleanup +------- + +The function free_rs frees the allocated resources, if the caller is +the last user of the decoder. + +:: + + /* Release resources */ + free_rs(rs_decoder); + + +Structures +========== + +This chapter contains the autogenerated documentation of the structures +which are used in the Reed-Solomon Library and are relevant for a +developer. + +.. kernel-doc:: include/linux/rslib.h + :internal: + +Public Functions Provided +========================= + +This chapter contains the autogenerated documentation of the +Reed-Solomon functions which are exported. + +.. kernel-doc:: lib/reed_solomon/reed_solomon.c + :export: + +Credits +======= + +The library code for encoding and decoding was written by Phil Karn. + +:: + + Copyright 2002, Phil Karn, KA9Q + May be used under the terms of the GNU General Public License (GPL) + + +The wrapper functions and interfaces are written by Thomas Gleixner. + +Many users have provided bugfixes, improvements and helping hands for +testing. Thanks a lot. + +The following people have contributed to this document: + +Thomas Gleixner\ tglx@linutronix.de diff --git a/Documentation/core-api/local_ops.rst b/Documentation/core-api/local_ops.rst new file mode 100644 index 000000000..2ac3f9f29 --- /dev/null +++ b/Documentation/core-api/local_ops.rst @@ -0,0 +1,202 @@ + +.. _local_ops: + +================================================= +Semantics and Behavior of Local Atomic Operations +================================================= + +:Author: Mathieu Desnoyers + + +This document explains the purpose of the local atomic operations, how +to implement them for any given architecture and shows how they can be used +properly. It also stresses on the precautions that must be taken when reading +those local variables across CPUs when the order of memory writes matters. + +.. note:: + + Note that ``local_t`` based operations are not recommended for general + kernel use. Please use the ``this_cpu`` operations instead unless there is + really a special purpose. Most uses of ``local_t`` in the kernel have been + replaced by ``this_cpu`` operations. ``this_cpu`` operations combine the + relocation with the ``local_t`` like semantics in a single instruction and + yield more compact and faster executing code. + + +Purpose of local atomic operations +================================== + +Local atomic operations are meant to provide fast and highly reentrant per CPU +counters. They minimize the performance cost of standard atomic operations by +removing the LOCK prefix and memory barriers normally required to synchronize +across CPUs. + +Having fast per CPU atomic counters is interesting in many cases: it does not +require disabling interrupts to protect from interrupt handlers and it permits +coherent counters in NMI handlers. It is especially useful for tracing purposes +and for various performance monitoring counters. + +Local atomic operations only guarantee variable modification atomicity wrt the +CPU which owns the data. Therefore, care must taken to make sure that only one +CPU writes to the ``local_t`` data. This is done by using per cpu data and +making sure that we modify it from within a preemption safe context. It is +however permitted to read ``local_t`` data from any CPU: it will then appear to +be written out of order wrt other memory writes by the owner CPU. + + +Implementation for a given architecture +======================================= + +It can be done by slightly modifying the standard atomic operations: only +their UP variant must be kept. It typically means removing LOCK prefix (on +i386 and x86_64) and any SMP synchronization barrier. If the architecture does +not have a different behavior between SMP and UP, including +``asm-generic/local.h`` in your architecture's ``local.h`` is sufficient. + +The ``local_t`` type is defined as an opaque ``signed long`` by embedding an +``atomic_long_t`` inside a structure. This is made so a cast from this type to +a ``long`` fails. The definition looks like:: + + typedef struct { atomic_long_t a; } local_t; + + +Rules to follow when using local atomic operations +================================================== + +* Variables touched by local ops must be per cpu variables. +* *Only* the CPU owner of these variables must write to them. +* This CPU can use local ops from any context (process, irq, softirq, nmi, ...) + to update its ``local_t`` variables. +* Preemption (or interrupts) must be disabled when using local ops in + process context to make sure the process won't be migrated to a + different CPU between getting the per-cpu variable and doing the + actual local op. +* When using local ops in interrupt context, no special care must be + taken on a mainline kernel, since they will run on the local CPU with + preemption already disabled. I suggest, however, to explicitly + disable preemption anyway to make sure it will still work correctly on + -rt kernels. +* Reading the local cpu variable will provide the current copy of the + variable. +* Reads of these variables can be done from any CPU, because updates to + "``long``", aligned, variables are always atomic. Since no memory + synchronization is done by the writer CPU, an outdated copy of the + variable can be read when reading some *other* cpu's variables. + + +How to use local atomic operations +================================== + +:: + + #include <linux/percpu.h> + #include <asm/local.h> + + static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); + + +Counting +======== + +Counting is done on all the bits of a signed long. + +In preemptible context, use ``get_cpu_var()`` and ``put_cpu_var()`` around +local atomic operations: it makes sure that preemption is disabled around write +access to the per cpu variable. For instance:: + + local_inc(&get_cpu_var(counters)); + put_cpu_var(counters); + +If you are already in a preemption-safe context, you can use +``this_cpu_ptr()`` instead:: + + local_inc(this_cpu_ptr(&counters)); + + + +Reading the counters +==================== + +Those local counters can be read from foreign CPUs to sum the count. Note that +the data seen by local_read across CPUs must be considered to be out of order +relatively to other memory writes happening on the CPU that owns the data:: + + long sum = 0; + for_each_online_cpu(cpu) + sum += local_read(&per_cpu(counters, cpu)); + +If you want to use a remote local_read to synchronize access to a resource +between CPUs, explicit ``smp_wmb()`` and ``smp_rmb()`` memory barriers must be used +respectively on the writer and the reader CPUs. It would be the case if you use +the ``local_t`` variable as a counter of bytes written in a buffer: there should +be a ``smp_wmb()`` between the buffer write and the counter increment and also a +``smp_rmb()`` between the counter read and the buffer read. + + +Here is a sample module which implements a basic per cpu counter using +``local.h``:: + + /* test-local.c + * + * Sample module for local.h usage. + */ + + + #include <asm/local.h> + #include <linux/module.h> + #include <linux/timer.h> + + static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); + + static struct timer_list test_timer; + + /* IPI called on each CPU. */ + static void test_each(void *info) + { + /* Increment the counter from a non preemptible context */ + printk("Increment on cpu %d\n", smp_processor_id()); + local_inc(this_cpu_ptr(&counters)); + + /* This is what incrementing the variable would look like within a + * preemptible context (it disables preemption) : + * + * local_inc(&get_cpu_var(counters)); + * put_cpu_var(counters); + */ + } + + static void do_test_timer(unsigned long data) + { + int cpu; + + /* Increment the counters */ + on_each_cpu(test_each, NULL, 1); + /* Read all the counters */ + printk("Counters read from CPU %d\n", smp_processor_id()); + for_each_online_cpu(cpu) { + printk("Read : CPU %d, count %ld\n", cpu, + local_read(&per_cpu(counters, cpu))); + } + mod_timer(&test_timer, jiffies + 1000); + } + + static int __init test_init(void) + { + /* initialize the timer that will increment the counter */ + timer_setup(&test_timer, do_test_timer, 0); + mod_timer(&test_timer, jiffies + 1); + + return 0; + } + + static void __exit test_exit(void) + { + del_timer_sync(&test_timer); + } + + module_init(test_init); + module_exit(test_exit); + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Mathieu Desnoyers"); + MODULE_DESCRIPTION("Local Atomic Ops"); diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst new file mode 100644 index 000000000..46ae3537f --- /dev/null +++ b/Documentation/core-api/mm-api.rst @@ -0,0 +1,78 @@ +====================== +Memory Management APIs +====================== + +User Space Memory Access +======================== + +.. kernel-doc:: arch/x86/include/asm/uaccess.h + :internal: + +.. kernel-doc:: arch/x86/lib/usercopy_32.c + :export: + +.. kernel-doc:: mm/util.c + :functions: get_user_pages_fast + +Memory Allocation Controls +========================== + +Functions which need to allocate memory often use GFP flags to express +how that memory should be allocated. The GFP acronym stands for "get +free pages", the underlying memory allocation function. Not every GFP +flag is allowed to every function which may allocate memory. Most +users will want to use a plain ``GFP_KERNEL``. + +.. kernel-doc:: include/linux/gfp.h + :doc: Page mobility and placement hints + +.. kernel-doc:: include/linux/gfp.h + :doc: Watermark modifiers + +.. kernel-doc:: include/linux/gfp.h + :doc: Reclaim modifiers + +.. kernel-doc:: include/linux/gfp.h + :doc: Common combinations + +The Slab Cache +============== + +.. kernel-doc:: include/linux/slab.h + :internal: + +.. kernel-doc:: mm/slab.c + :export: + +.. kernel-doc:: mm/util.c + :functions: kfree_const kvmalloc_node kvfree + +More Memory Management Functions +================================ + +.. kernel-doc:: mm/readahead.c + :export: + +.. kernel-doc:: mm/filemap.c + :export: + +.. kernel-doc:: mm/memory.c + :export: + +.. kernel-doc:: mm/vmalloc.c + :export: + +.. kernel-doc:: mm/page_alloc.c + :internal: + +.. kernel-doc:: mm/mempool.c + :export: + +.. kernel-doc:: mm/dmapool.c + :export: + +.. kernel-doc:: mm/page-writeback.c + :export: + +.. kernel-doc:: mm/truncate.c + :export: diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst new file mode 100644 index 000000000..25dc591cb --- /dev/null +++ b/Documentation/core-api/printk-formats.rst @@ -0,0 +1,481 @@ +========================================= +How to get printk format specifiers right +========================================= + +:Author: Randy Dunlap <rdunlap@infradead.org> +:Author: Andrew Murray <amurray@mpc-data.co.uk> + + +Integer types +============= + +:: + + If variable is of Type, use printk format specifier: + ------------------------------------------------------------ + int %d or %x + unsigned int %u or %x + long %ld or %lx + unsigned long %lu or %lx + long long %lld or %llx + unsigned long long %llu or %llx + size_t %zu or %zx + ssize_t %zd or %zx + s32 %d or %x + u32 %u or %x + s64 %lld or %llx + u64 %llu or %llx + + +If <type> is dependent on a config option for its size (e.g., sector_t, +blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a +format specifier of its largest possible type and explicitly cast to it. + +Example:: + + printk("test: sector number/total blocks: %llu/%llu\n", + (unsigned long long)sector, (unsigned long long)blockcount); + +Reminder: sizeof() returns type size_t. + +The kernel's printf does not support %n. Floating point formats (%e, %f, +%g, %a) are also not recognized, for obvious reasons. Use of any +unsupported specifier or length qualifier results in a WARN and early +return from vsnprintf(). + +Pointer types +============= + +A raw pointer value may be printed with %p which will hash the address +before printing. The kernel also supports extended specifiers for printing +pointers of different types. + +Plain Pointers +-------------- + +:: + + %p abcdef12 or 00000000abcdef12 + +Pointers printed without a specifier extension (i.e unadorned %p) are +hashed to prevent leaking information about the kernel memory layout. This +has the added benefit of providing a unique identifier. On 64-bit machines +the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it +gathers enough entropy. If you *really* want the address see %px below. + +Symbols/Function Pointers +------------------------- + +:: + + %pS versatile_init+0x0/0x110 + %ps versatile_init + %pF versatile_init+0x0/0x110 + %pf versatile_init + %pSR versatile_init+0x9/0x110 + (with __builtin_extract_return_addr() translation) + %pB prev_fn_of_versatile_init+0x88/0x88 + + +The ``S`` and ``s`` specifiers are used for printing a pointer in symbolic +format. They result in the symbol name with (S) or without (s) +offsets. If KALLSYMS are disabled then the symbol address is printed instead. + +Note, that the ``F`` and ``f`` specifiers are identical to ``S`` (``s``) +and thus deprecated. We have ``F`` and ``f`` because on ia64, ppc64 and +parisc64 function pointers are indirect and, in fact, are function +descriptors, which require additional dereferencing before we can lookup +the symbol. As of now, ``S`` and ``s`` perform dereferencing on those +platforms (when needed), so ``F`` and ``f`` exist for compatibility +reasons only. + +The ``B`` specifier results in the symbol name with offsets and should be +used when printing stack backtraces. The specifier takes into +consideration the effect of compiler optimisations which may occur +when tail-calls are used and marked with the noreturn GCC attribute. + +Kernel Pointers +--------------- + +:: + + %pK 01234567 or 0123456789abcdef + +For printing kernel pointers which should be hidden from unprivileged +users. The behaviour of %pK depends on the kptr_restrict sysctl - see +Documentation/sysctl/kernel.txt for more details. + +Unmodified Addresses +-------------------- + +:: + + %px 01234567 or 0123456789abcdef + +For printing pointers when you *really* want to print the address. Please +consider whether or not you are leaking sensitive information about the +kernel memory layout before printing pointers with %px. %px is functionally +equivalent to %lx (or %lu). %px is preferred because it is more uniquely +grep'able. If in the future we need to modify the way the kernel handles +printing pointers we will be better equipped to find the call sites. + +Struct Resources +---------------- + +:: + + %pr [mem 0x60000000-0x6fffffff flags 0x2200] or + [mem 0x0000000060000000-0x000000006fffffff flags 0x2200] + %pR [mem 0x60000000-0x6fffffff pref] or + [mem 0x0000000060000000-0x000000006fffffff pref] + +For printing struct resources. The ``R`` and ``r`` specifiers result in a +printed resource with (R) or without (r) a decoded flags member. + +Passed by reference. + +Physical address types phys_addr_t +---------------------------------- + +:: + + %pa[p] 0x01234567 or 0x0123456789abcdef + +For printing a phys_addr_t type (and its derivatives, such as +resource_size_t) which can vary based on build options, regardless of the +width of the CPU data path. + +Passed by reference. + +DMA address types dma_addr_t +---------------------------- + +:: + + %pad 0x01234567 or 0x0123456789abcdef + +For printing a dma_addr_t type which can vary based on build options, +regardless of the width of the CPU data path. + +Passed by reference. + +Raw buffer as an escaped string +------------------------------- + +:: + + %*pE[achnops] + +For printing raw buffer as an escaped string. For the following buffer:: + + 1b 62 20 5c 43 07 22 90 0d 5d + +A few examples show how the conversion would be done (excluding surrounding +quotes):: + + %*pE "\eb \C\a"\220\r]" + %*pEhp "\x1bb \C\x07"\x90\x0d]" + %*pEa "\e\142\040\\\103\a\042\220\r\135" + +The conversion rules are applied according to an optional combination +of flags (see :c:func:`string_escape_mem` kernel documentation for the +details): + + - a - ESCAPE_ANY + - c - ESCAPE_SPECIAL + - h - ESCAPE_HEX + - n - ESCAPE_NULL + - o - ESCAPE_OCTAL + - p - ESCAPE_NP + - s - ESCAPE_SPACE + +By default ESCAPE_ANY_NP is used. + +ESCAPE_ANY_NP is the sane choice for many cases, in particularly for +printing SSIDs. + +If field width is omitted then 1 byte only will be escaped. + +Raw buffer as a hex string +-------------------------- + +:: + + %*ph 00 01 02 ... 3f + %*phC 00:01:02: ... :3f + %*phD 00-01-02- ... -3f + %*phN 000102 ... 3f + +For printing small buffers (up to 64 bytes long) as a hex string with a +certain separator. For larger buffers consider using +:c:func:`print_hex_dump`. + +MAC/FDDI addresses +------------------ + +:: + + %pM 00:01:02:03:04:05 + %pMR 05:04:03:02:01:00 + %pMF 00-01-02-03-04-05 + %pm 000102030405 + %pmR 050403020100 + +For printing 6-byte MAC/FDDI addresses in hex notation. The ``M`` and ``m`` +specifiers result in a printed address with (M) or without (m) byte +separators. The default byte separator is the colon (:). + +Where FDDI addresses are concerned the ``F`` specifier can be used after +the ``M`` specifier to use dash (-) separators instead of the default +separator. + +For Bluetooth addresses the ``R`` specifier shall be used after the ``M`` +specifier to use reversed byte order suitable for visual interpretation +of Bluetooth addresses which are in the little endian order. + +Passed by reference. + +IPv4 addresses +-------------- + +:: + + %pI4 1.2.3.4 + %pi4 001.002.003.004 + %p[Ii]4[hnbl] + +For printing IPv4 dot-separated decimal addresses. The ``I4`` and ``i4`` +specifiers result in a printed address with (i4) or without (I4) leading +zeros. + +The additional ``h``, ``n``, ``b``, and ``l`` specifiers are used to specify +host, network, big or little endian order addresses respectively. Where +no specifier is provided the default network/big endian order is used. + +Passed by reference. + +IPv6 addresses +-------------- + +:: + + %pI6 0001:0002:0003:0004:0005:0006:0007:0008 + %pi6 00010002000300040005000600070008 + %pI6c 1:2:3:4:5:6:7:8 + +For printing IPv6 network-order 16-bit hex addresses. The ``I6`` and ``i6`` +specifiers result in a printed address with (I6) or without (i6) +colon-separators. Leading zeros are always used. + +The additional ``c`` specifier can be used with the ``I`` specifier to +print a compressed IPv6 address as described by +http://tools.ietf.org/html/rfc5952 + +Passed by reference. + +IPv4/IPv6 addresses (generic, with port, flowinfo, scope) +--------------------------------------------------------- + +:: + + %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008 + %piS 001.002.003.004 or 00010002000300040005000600070008 + %pISc 1.2.3.4 or 1:2:3:4:5:6:7:8 + %pISpc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345 + %p[Ii]S[pfschnbl] + +For printing an IP address without the need to distinguish whether it's of +type AF_INET or AF_INET6. A pointer to a valid struct sockaddr, +specified through ``IS`` or ``iS``, can be passed to this format specifier. + +The additional ``p``, ``f``, and ``s`` specifiers are used to specify port +(IPv4, IPv6), flowinfo (IPv6) and scope (IPv6). Ports have a ``:`` prefix, +flowinfo a ``/`` and scope a ``%``, each followed by the actual value. + +In case of an IPv6 address the compressed IPv6 address as described by +http://tools.ietf.org/html/rfc5952 is being used if the additional +specifier ``c`` is given. The IPv6 address is surrounded by ``[``, ``]`` in +case of additional specifiers ``p``, ``f`` or ``s`` as suggested by +https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07 + +In case of IPv4 addresses, the additional ``h``, ``n``, ``b``, and ``l`` +specifiers can be used as well and are ignored in case of an IPv6 +address. + +Passed by reference. + +Further examples:: + + %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789 + %pISsc 1.2.3.4 or [1:2:3:4:5:6:7:8]%1234567890 + %pISpfc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345/123456789 + +UUID/GUID addresses +------------------- + +:: + + %pUb 00010203-0405-0607-0809-0a0b0c0d0e0f + %pUB 00010203-0405-0607-0809-0A0B0C0D0E0F + %pUl 03020100-0504-0706-0809-0a0b0c0e0e0f + %pUL 03020100-0504-0706-0809-0A0B0C0E0E0F + +For printing 16-byte UUID/GUIDs addresses. The additional ``l``, ``L``, +``b`` and ``B`` specifiers are used to specify a little endian order in +lower (l) or upper case (L) hex notation - and big endian order in lower (b) +or upper case (B) hex notation. + +Where no additional specifiers are used the default big endian +order with lower case hex notation will be printed. + +Passed by reference. + +dentry names +------------ + +:: + + %pd{,2,3,4} + %pD{,2,3,4} + +For printing dentry name; if we race with :c:func:`d_move`, the name might +be a mix of old and new ones, but it won't oops. %pd dentry is a safer +equivalent of %s dentry->d_name.name we used to use, %pd<n> prints ``n`` +last components. %pD does the same thing for struct file. + +Passed by reference. + +block_device names +------------------ + +:: + + %pg sda, sda1 or loop0p1 + +For printing name of block_device pointers. + +struct va_format +---------------- + +:: + + %pV + +For printing struct va_format structures. These contain a format string +and va_list as follows:: + + struct va_format { + const char *fmt; + va_list *va; + }; + +Implements a "recursive vsnprintf". + +Do not use this feature without some mechanism to verify the +correctness of the format string and va_list arguments. + +Passed by reference. + +kobjects +-------- + +:: + + %pOF[fnpPcCF] + + +For printing kobject based structs (device nodes). Default behaviour is +equivalent to %pOFf. + + - f - device node full_name + - n - device node name + - p - device node phandle + - P - device node path spec (name + @unit) + - F - device node flags + - c - major compatible string + - C - full compatible string + +The separator when using multiple arguments is ':' + +Examples:: + + %pOF /foo/bar@0 - Node full name + %pOFf /foo/bar@0 - Same as above + %pOFfp /foo/bar@0:10 - Node full name + phandle + %pOFfcF /foo/bar@0:foo,device:--P- - Node full name + + major compatible string + + node flags + D - dynamic + d - detached + P - Populated + B - Populated bus + +Passed by reference. + +struct clk +---------- + +:: + + %pC pll1 + %pCn pll1 + +For printing struct clk structures. %pC and %pCn print the name +(Common Clock Framework) or address (legacy clock framework) of the +structure. + +Passed by reference. + +bitmap and its derivatives such as cpumask and nodemask +------------------------------------------------------- + +:: + + %*pb 0779 + %*pbl 0,3-6,8-10 + +For printing bitmap and its derivatives such as cpumask and nodemask, +%*pb outputs the bitmap with field width as the number of bits and %*pbl +output the bitmap as range list with field width as the number of bits. + +Passed by reference. + +Flags bitfields such as page flags, gfp_flags +--------------------------------------------- + +:: + + %pGp referenced|uptodate|lru|active|private + %pGg GFP_USER|GFP_DMA32|GFP_NOWARN + %pGv read|exec|mayread|maywrite|mayexec|denywrite + +For printing flags bitfields as a collection of symbolic constants that +would construct the value. The type of flags is given by the third +character. Currently supported are [p]age flags, [v]ma_flags (both +expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag +names and print order depends on the particular type. + +Note that this format should not be used directly in the +:c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags() +functions from <trace/events/mmflags.h>. + +Passed by reference. + +Network device features +----------------------- + +:: + + %pNF 0x000000000000c000 + +For printing netdev_features_t. + +Passed by reference. + +Thanks +====== + +If you add other %p extensions, please extend <lib/test_printf.c> with +one or more test cases, if at all feasible. + +Thank you for your cooperation and attention. diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst new file mode 100644 index 000000000..322851bad --- /dev/null +++ b/Documentation/core-api/refcount-vs-atomic.rst @@ -0,0 +1,150 @@ +=================================== +refcount_t API compared to atomic_t +=================================== + +.. contents:: :local: + +Introduction +============ + +The goal of refcount_t API is to provide a minimal API for implementing +an object's reference counters. While a generic architecture-independent +implementation from lib/refcount.c uses atomic operations underneath, +there are a number of differences between some of the ``refcount_*()`` and +``atomic_*()`` functions with regards to the memory ordering guarantees. +This document outlines the differences and provides respective examples +in order to help maintainers validate their code against the change in +these memory ordering guarantees. + +The terms used through this document try to follow the formal LKMM defined in +tools/memory-model/Documentation/explanation.txt. + +memory-barriers.txt and atomic_t.txt provide more background to the +memory ordering in general and for atomic operations specifically. + +Relevant types of memory ordering +================================= + +.. note:: The following section only covers some of the memory + ordering types that are relevant for the atomics and reference + counters and used through this document. For a much broader picture + please consult memory-barriers.txt document. + +In the absence of any memory ordering guarantees (i.e. fully unordered) +atomics & refcounters only provide atomicity and +program order (po) relation (on the same CPU). It guarantees that +each ``atomic_*()`` and ``refcount_*()`` operation is atomic and instructions +are executed in program order on a single CPU. +This is implemented using :c:func:`READ_ONCE`/:c:func:`WRITE_ONCE` and +compare-and-swap primitives. + +A strong (full) memory ordering guarantees that all prior loads and +stores (all po-earlier instructions) on the same CPU are completed +before any po-later instruction is executed on the same CPU. +It also guarantees that all po-earlier stores on the same CPU +and all propagated stores from other CPUs must propagate to all +other CPUs before any po-later instruction is executed on the original +CPU (A-cumulative property). This is implemented using :c:func:`smp_mb`. + +A RELEASE memory ordering guarantees that all prior loads and +stores (all po-earlier instructions) on the same CPU are completed +before the operation. It also guarantees that all po-earlier +stores on the same CPU and all propagated stores from other CPUs +must propagate to all other CPUs before the release operation +(A-cumulative property). This is implemented using +:c:func:`smp_store_release`. + +A control dependency (on success) for refcounters guarantees that +if a reference for an object was successfully obtained (reference +counter increment or addition happened, function returned true), +then further stores are ordered against this operation. +Control dependency on stores are not implemented using any explicit +barriers, but rely on CPU not to speculate on stores. This is only +a single CPU relation and provides no guarantees for other CPUs. + + +Comparison of functions +======================= + +case 1) - non-"Read/Modify/Write" (RMW) ops +------------------------------------------- + +Function changes: + + * :c:func:`atomic_set` --> :c:func:`refcount_set` + * :c:func:`atomic_read` --> :c:func:`refcount_read` + +Memory ordering guarantee changes: + + * none (both fully unordered) + + +case 2) - increment-based ops that return no value +-------------------------------------------------- + +Function changes: + + * :c:func:`atomic_inc` --> :c:func:`refcount_inc` + * :c:func:`atomic_add` --> :c:func:`refcount_add` + +Memory ordering guarantee changes: + + * none (both fully unordered) + +case 3) - decrement-based RMW ops that return no value +------------------------------------------------------ + +Function changes: + + * :c:func:`atomic_dec` --> :c:func:`refcount_dec` + +Memory ordering guarantee changes: + + * fully unordered --> RELEASE ordering + + +case 4) - increment-based RMW ops that return a value +----------------------------------------------------- + +Function changes: + + * :c:func:`atomic_inc_not_zero` --> :c:func:`refcount_inc_not_zero` + * no atomic counterpart --> :c:func:`refcount_add_not_zero` + +Memory ordering guarantees changes: + + * fully ordered --> control dependency on success for stores + +.. note:: We really assume here that necessary ordering is provided as a + result of obtaining pointer to the object! + + +case 5) - decrement-based RMW ops that return a value +----------------------------------------------------- + +Function changes: + + * :c:func:`atomic_dec_and_test` --> :c:func:`refcount_dec_and_test` + * :c:func:`atomic_sub_and_test` --> :c:func:`refcount_sub_and_test` + * no atomic counterpart --> :c:func:`refcount_dec_if_one` + * ``atomic_add_unless(&var, -1, 1)`` --> ``refcount_dec_not_one(&var)`` + +Memory ordering guarantees changes: + + * fully ordered --> RELEASE ordering + control dependency + +.. note:: :c:func:`atomic_add_unless` only provides full order on success. + + +case 6) - lock-based RMW +------------------------ + +Function changes: + + * :c:func:`atomic_dec_and_lock` --> :c:func:`refcount_dec_and_lock` + * :c:func:`atomic_dec_and_mutex_lock` --> :c:func:`refcount_dec_and_mutex_lock` + +Memory ordering guarantees changes: + + * fully ordered --> RELEASE ordering + control dependency + hold + :c:func:`spin_lock` on success diff --git a/Documentation/core-api/timekeeping.rst b/Documentation/core-api/timekeeping.rst new file mode 100644 index 000000000..93cbeb9da --- /dev/null +++ b/Documentation/core-api/timekeeping.rst @@ -0,0 +1,185 @@ +ktime accessors +=============== + +Device drivers can read the current time using ktime_get() and the many +related functions declared in linux/timekeeping.h. As a rule of thumb, +using an accessor with a shorter name is preferred over one with a longer +name if both are equally fit for a particular use case. + +Basic ktime_t based interfaces +------------------------------ + +The recommended simplest form returns an opaque ktime_t, with variants +that return time for different clock references: + + +.. c:function:: ktime_t ktime_get( void ) + + CLOCK_MONOTONIC + + Useful for reliable timestamps and measuring short time intervals + accurately. Starts at system boot time but stops during suspend. + +.. c:function:: ktime_t ktime_get_boottime( void ) + + CLOCK_BOOTTIME + + Like ktime_get(), but does not stop when suspended. This can be + used e.g. for key expiration times that need to be synchronized + with other machines across a suspend operation. + +.. c:function:: ktime_t ktime_get_real( void ) + + CLOCK_REALTIME + + Returns the time in relative to the UNIX epoch starting in 1970 + using the Coordinated Universal Time (UTC), same as gettimeofday() + user space. This is used for all timestamps that need to + persist across a reboot, like inode times, but should be avoided + for internal uses, since it can jump backwards due to a leap + second update, NTP adjustment settimeofday() operation from user + space. + +.. c:function:: ktime_t ktime_get_clocktai( void ) + + CLOCK_TAI + + Like ktime_get_real(), but uses the International Atomic Time (TAI) + reference instead of UTC to avoid jumping on leap second updates. + This is rarely useful in the kernel. + +.. c:function:: ktime_t ktime_get_raw( void ) + + CLOCK_MONOTONIC_RAW + + Like ktime_get(), but runs at the same rate as the hardware + clocksource without (NTP) adjustments for clock drift. This is + also rarely needed in the kernel. + +nanosecond, timespec64, and second output +----------------------------------------- + +For all of the above, there are variants that return the time in a +different format depending on what is required by the user: + +.. c:function:: u64 ktime_get_ns( void ) + u64 ktime_get_boottime_ns( void ) + u64 ktime_get_real_ns( void ) + u64 ktime_get_tai_ns( void ) + u64 ktime_get_raw_ns( void ) + + Same as the plain ktime_get functions, but returning a u64 number + of nanoseconds in the respective time reference, which may be + more convenient for some callers. + +.. c:function:: void ktime_get_ts64( struct timespec64 * ) + void ktime_get_boottime_ts64( struct timespec64 * ) + void ktime_get_real_ts64( struct timespec64 * ) + void ktime_get_clocktai_ts64( struct timespec64 * ) + void ktime_get_raw_ts64( struct timespec64 * ) + + Same above, but returns the time in a 'struct timespec64', split + into seconds and nanoseconds. This can avoid an extra division + when printing the time, or when passing it into an external + interface that expects a 'timespec' or 'timeval' structure. + +.. c:function:: time64_t ktime_get_seconds( void ) + time64_t ktime_get_boottime_seconds( void ) + time64_t ktime_get_real_seconds( void ) + time64_t ktime_get_clocktai_seconds( void ) + time64_t ktime_get_raw_seconds( void ) + + Return a coarse-grained version of the time as a scalar + time64_t. This avoids accessing the clock hardware and rounds + down the seconds to the full seconds of the last timer tick + using the respective reference. + +Coarse and fast_ns access +------------------------- + +Some additional variants exist for more specialized cases: + +.. c:function:: ktime_t ktime_get_coarse_boottime( void ) + ktime_t ktime_get_coarse_real( void ) + ktime_t ktime_get_coarse_clocktai( void ) + ktime_t ktime_get_coarse_raw( void ) + +.. c:function:: void ktime_get_coarse_ts64( struct timespec64 * ) + void ktime_get_coarse_boottime_ts64( struct timespec64 * ) + void ktime_get_coarse_real_ts64( struct timespec64 * ) + void ktime_get_coarse_clocktai_ts64( struct timespec64 * ) + void ktime_get_coarse_raw_ts64( struct timespec64 * ) + + These are quicker than the non-coarse versions, but less accurate, + corresponding to CLOCK_MONONOTNIC_COARSE and CLOCK_REALTIME_COARSE + in user space, along with the equivalent boottime/tai/raw + timebase not available in user space. + + The time returned here corresponds to the last timer tick, which + may be as much as 10ms in the past (for CONFIG_HZ=100), same as + reading the 'jiffies' variable. These are only useful when called + in a fast path and one still expects better than second accuracy, + but can't easily use 'jiffies', e.g. for inode timestamps. + Skipping the hardware clock access saves around 100 CPU cycles + on most modern machines with a reliable cycle counter, but + up to several microseconds on older hardware with an external + clocksource. + +.. c:function:: u64 ktime_get_mono_fast_ns( void ) + u64 ktime_get_raw_fast_ns( void ) + u64 ktime_get_boot_fast_ns( void ) + u64 ktime_get_real_fast_ns( void ) + + These variants are safe to call from any context, including from + a non-maskable interrupt (NMI) during a timekeeper update, and + while we are entering suspend with the clocksource powered down. + This is useful in some tracing or debugging code as well as + machine check reporting, but most drivers should never call them, + since the time is allowed to jump under certain conditions. + +Deprecated time interfaces +-------------------------- + +Older kernels used some other interfaces that are now being phased out +but may appear in third-party drivers being ported here. In particular, +all interfaces returning a 'struct timeval' or 'struct timespec' have +been replaced because the tv_sec member overflows in year 2038 on 32-bit +architectures. These are the recommended replacements: + +.. c:function:: void ktime_get_ts( struct timespec * ) + + Use ktime_get() or ktime_get_ts64() instead. + +.. c:function:: struct timeval do_gettimeofday( void ) + struct timespec getnstimeofday( void ) + struct timespec64 getnstimeofday64( void ) + void ktime_get_real_ts( struct timespec * ) + + ktime_get_real_ts64() is a direct replacement, but consider using + monotonic time (ktime_get_ts64()) and/or a ktime_t based interface + (ktime_get()/ktime_get_real()). + +.. c:function:: struct timespec current_kernel_time( void ) + struct timespec64 current_kernel_time64( void ) + struct timespec get_monotonic_coarse( void ) + struct timespec64 get_monotonic_coarse64( void ) + + These are replaced by ktime_get_coarse_real_ts64() and + ktime_get_coarse_ts64(). However, A lot of code that wants + coarse-grained times can use the simple 'jiffies' instead, while + some drivers may actually want the higher resolution accessors + these days. + +.. c:function:: struct timespec getrawmonotonic( void ) + struct timespec64 getrawmonotonic64( void ) + struct timespec timekeeping_clocktai( void ) + struct timespec64 timekeeping_clocktai64( void ) + struct timespec get_monotonic_boottime( void ) + struct timespec64 get_monotonic_boottime64( void ) + + These are replaced by ktime_get_raw()/ktime_get_raw_ts64(), + ktime_get_clocktai()/ktime_get_clocktai_ts64() as well + as ktime_get_boottime()/ktime_get_boottime_ts64(). + However, if the particular choice of clock source is not + important for the user, consider converting to + ktime_get()/ktime_get_ts64() instead for consistency. diff --git a/Documentation/core-api/tracepoint.rst b/Documentation/core-api/tracepoint.rst new file mode 100644 index 000000000..6b44bec0d --- /dev/null +++ b/Documentation/core-api/tracepoint.rst @@ -0,0 +1,55 @@ +=============================== +The Linux Kernel Tracepoint API +=============================== + +:Author: Jason Baron +:Author: William Cohen + +Introduction +============ + +Tracepoints are static probe points that are located in strategic points +throughout the kernel. 'Probes' register/unregister with tracepoints via +a callback mechanism. The 'probes' are strictly typed functions that are +passed a unique set of parameters defined by each tracepoint. + +From this simple callback mechanism, 'probes' can be used to profile, +debug, and understand kernel behavior. There are a number of tools that +provide a framework for using 'probes'. These tools include Systemtap, +ftrace, and LTTng. + +Tracepoints are defined in a number of header files via various macros. +Thus, the purpose of this document is to provide a clear accounting of +the available tracepoints. The intention is to understand not only what +tracepoints are available but also to understand where future +tracepoints might be added. + +The API presented has functions of the form: +``trace_tracepointname(function parameters)``. These are the tracepoints +callbacks that are found throughout the code. Registering and +unregistering probes with these callback sites is covered in the +``Documentation/trace/*`` directory. + +IRQ +=== + +.. kernel-doc:: include/trace/events/irq.h + :internal: + +SIGNAL +====== + +.. kernel-doc:: include/trace/events/signal.h + :internal: + +Block IO +======== + +.. kernel-doc:: include/trace/events/block.h + :internal: + +Workqueue +========= + +.. kernel-doc:: include/trace/events/workqueue.h + :internal: diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst new file mode 100644 index 000000000..00a5ba51e --- /dev/null +++ b/Documentation/core-api/workqueue.rst @@ -0,0 +1,398 @@ +==================================== +Concurrency Managed Workqueue (cmwq) +==================================== + +:Date: September, 2010 +:Author: Tejun Heo <tj@kernel.org> +:Author: Florian Mickler <florian@mickler.org> + + +Introduction +============ + +There are many cases where an asynchronous process execution context +is needed and the workqueue (wq) API is the most commonly used +mechanism for such cases. + +When such an asynchronous execution context is needed, a work item +describing which function to execute is put on a queue. An +independent thread serves as the asynchronous execution context. The +queue is called workqueue and the thread is called worker. + +While there are work items on the workqueue the worker executes the +functions associated with the work items one after the other. When +there is no work item left on the workqueue the worker becomes idle. +When a new work item gets queued, the worker begins executing again. + + +Why cmwq? +========= + +In the original wq implementation, a multi threaded (MT) wq had one +worker thread per CPU and a single threaded (ST) wq had one worker +thread system-wide. A single MT wq needed to keep around the same +number of workers as the number of CPUs. The kernel grew a lot of MT +wq users over the years and with the number of CPU cores continuously +rising, some systems saturated the default 32k PID space just booting +up. + +Although MT wq wasted a lot of resource, the level of concurrency +provided was unsatisfactory. The limitation was common to both ST and +MT wq albeit less severe on MT. Each wq maintained its own separate +worker pool. An MT wq could provide only one execution context per CPU +while an ST wq one for the whole system. Work items had to compete for +those very limited execution contexts leading to various problems +including proneness to deadlocks around the single execution context. + +The tension between the provided level of concurrency and resource +usage also forced its users to make unnecessary tradeoffs like libata +choosing to use ST wq for polling PIOs and accepting an unnecessary +limitation that no two polling PIOs can progress at the same time. As +MT wq don't provide much better concurrency, users which require +higher level of concurrency, like async or fscache, had to implement +their own thread pool. + +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with +focus on the following goals. + +* Maintain compatibility with the original workqueue API. + +* Use per-CPU unified worker pools shared by all wq to provide + flexible level of concurrency on demand without wasting a lot of + resource. + +* Automatically regulate worker pool and level of concurrency so that + the API users don't need to worry about such details. + + +The Design +========== + +In order to ease the asynchronous execution of functions a new +abstraction, the work item, is introduced. + +A work item is a simple struct that holds a pointer to the function +that is to be executed asynchronously. Whenever a driver or subsystem +wants a function to be executed asynchronously it has to set up a work +item pointing to that function and queue that work item on a +workqueue. + +Special purpose threads, called worker threads, execute the functions +off of the queue, one after the other. If no work is queued, the +worker threads become idle. These worker threads are managed in so +called worker-pools. + +The cmwq design differentiates between the user-facing workqueues that +subsystems and drivers queue work items on and the backend mechanism +which manages worker-pools and processes the queued work items. + +There are two worker-pools, one for normal work items and the other +for high priority ones, for each possible CPU and some extra +worker-pools to serve work items queued on unbound workqueues - the +number of these backing pools is dynamic. + +Subsystems and drivers can create and queue work items through special +workqueue API functions as they see fit. They can influence some +aspects of the way the work items are executed by setting flags on the +workqueue they are putting the work item on. These flags include +things like CPU locality, concurrency limits, priority and more. To +get a detailed overview refer to the API description of +``alloc_workqueue()`` below. + +When a work item is queued to a workqueue, the target worker-pool is +determined according to the queue parameters and workqueue attributes +and appended on the shared worklist of the worker-pool. For example, +unless specifically overridden, a work item of a bound workqueue will +be queued on the worklist of either normal or highpri worker-pool that +is associated to the CPU the issuer is running on. + +For any worker pool implementation, managing the concurrency level +(how many execution contexts are active) is an important issue. cmwq +tries to keep the concurrency at a minimal but sufficient level. +Minimal to save resources and sufficient in that the system is used at +its full capacity. + +Each worker-pool bound to an actual CPU implements concurrency +management by hooking into the scheduler. The worker-pool is notified +whenever an active worker wakes up or sleeps and keeps track of the +number of the currently runnable workers. Generally, work items are +not expected to hog a CPU and consume many cycles. That means +maintaining just enough concurrency to prevent work processing from +stalling should be optimal. As long as there are one or more runnable +workers on the CPU, the worker-pool doesn't start execution of a new +work, but, when the last running worker goes to sleep, it immediately +schedules a new worker so that the CPU doesn't sit idle while there +are pending work items. This allows using a minimal number of workers +without losing execution bandwidth. + +Keeping idle workers around doesn't cost other than the memory space +for kthreads, so cmwq holds onto idle ones for a while before killing +them. + +For unbound workqueues, the number of backing pools is dynamic. +Unbound workqueue can be assigned custom attributes using +``apply_workqueue_attrs()`` and workqueue will automatically create +backing worker pools matching the attributes. The responsibility of +regulating concurrency level is on the users. There is also a flag to +mark a bound wq to ignore the concurrency management. Please refer to +the API section for details. + +Forward progress guarantee relies on that workers can be created when +more execution contexts are necessary, which in turn is guaranteed +through the use of rescue workers. All work items which might be used +on code paths that handle memory reclaim are required to be queued on +wq's that have a rescue-worker reserved for execution under memory +pressure. Else it is possible that the worker-pool deadlocks waiting +for execution contexts to free up. + + +Application Programming Interface (API) +======================================= + +``alloc_workqueue()`` allocates a wq. The original +``create_*workqueue()`` functions are deprecated and scheduled for +removal. ``alloc_workqueue()`` takes three arguments - ``@name``, +``@flags`` and ``@max_active``. ``@name`` is the name of the wq and +also used as the name of the rescuer thread if there is one. + +A wq no longer manages execution resources but serves as a domain for +forward progress guarantee, flush and work item attributes. ``@flags`` +and ``@max_active`` control how work items are assigned execution +resources, scheduled and executed. + + +``flags`` +--------- + +``WQ_UNBOUND`` + Work items queued to an unbound wq are served by the special + worker-pools which host workers which are not bound to any + specific CPU. This makes the wq behave as a simple execution + context provider without concurrency management. The unbound + worker-pools try to start execution of work items as soon as + possible. Unbound wq sacrifices locality but is useful for + the following cases. + + * Wide fluctuation in the concurrency level requirement is + expected and using bound wq may end up creating large number + of mostly unused workers across different CPUs as the issuer + hops through different CPUs. + + * Long running CPU intensive workloads which can be better + managed by the system scheduler. + +``WQ_FREEZABLE`` + A freezable wq participates in the freeze phase of the system + suspend operations. Work items on the wq are drained and no + new work item starts execution until thawed. + +``WQ_MEM_RECLAIM`` + All wq which might be used in the memory reclaim paths **MUST** + have this flag set. The wq is guaranteed to have at least one + execution context regardless of memory pressure. + +``WQ_HIGHPRI`` + Work items of a highpri wq are queued to the highpri + worker-pool of the target cpu. Highpri worker-pools are + served by worker threads with elevated nice level. + + Note that normal and highpri worker-pools don't interact with + each other. Each maintains its separate pool of workers and + implements concurrency management among its workers. + +``WQ_CPU_INTENSIVE`` + Work items of a CPU intensive wq do not contribute to the + concurrency level. In other words, runnable CPU intensive + work items will not prevent other work items in the same + worker-pool from starting execution. This is useful for bound + work items which are expected to hog CPU cycles so that their + execution is regulated by the system scheduler. + + Although CPU intensive work items don't contribute to the + concurrency level, start of their executions is still + regulated by the concurrency management and runnable + non-CPU-intensive work items can delay execution of CPU + intensive work items. + + This flag is meaningless for unbound wq. + +Note that the flag ``WQ_NON_REENTRANT`` no longer exists as all +workqueues are now non-reentrant - any work item is guaranteed to be +executed by at most one worker system-wide at any given time. + + +``max_active`` +-------------- + +``@max_active`` determines the maximum number of execution contexts +per CPU which can be assigned to the work items of a wq. For example, +with ``@max_active`` of 16, at most 16 work items of the wq can be +executing at the same time per CPU. + +Currently, for a bound wq, the maximum limit for ``@max_active`` is +512 and the default value used when 0 is specified is 256. For an +unbound wq, the limit is higher of 512 and 4 * +``num_possible_cpus()``. These values are chosen sufficiently high +such that they are not the limiting factor while providing protection +in runaway cases. + +The number of active work items of a wq is usually regulated by the +users of the wq, more specifically, by how many work items the users +may queue at the same time. Unless there is a specific need for +throttling the number of active work items, specifying '0' is +recommended. + +Some users depend on the strict execution ordering of ST wq. The +combination of ``@max_active`` of 1 and ``WQ_UNBOUND`` used to +achieve this behavior. Work items on such wq were always queued to the +unbound worker-pools and only one work item could be active at any given +time thus achieving the same ordering property as ST wq. + +In the current implementation the above configuration only guarantees +ST behavior within a given NUMA node. Instead ``alloc_ordered_queue()`` should +be used to achieve system-wide ST behavior. + + +Example Execution Scenarios +=========================== + +The following example execution scenarios try to illustrate how cmwq +behave under different configurations. + + Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. + w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms + again before finishing. w1 and w2 burn CPU for 5ms then sleep for + 10ms. + +Ignoring all other tasks, works and processing overhead, and assuming +simple FIFO scheduling, the following is one highly simplified version +of possible sequences of events with the original wq. :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 starts and burns CPU + 25 w1 sleeps + 35 w1 wakes up and finishes + 35 w2 starts and burns CPU + 40 w2 sleeps + 50 w2 wakes up and finishes + +And with cmwq with ``@max_active`` >= 3, :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 10 w2 starts and burns CPU + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + +If ``@max_active`` == 2, :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 20 w2 starts and burns CPU + 25 w2 sleeps + 35 w2 wakes up and finishes + +Now, let's assume w1 and w2 are queued to a different wq q1 which has +``WQ_CPU_INTENSIVE`` set, :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 and w2 start and burn CPU + 10 w1 sleeps + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + + +Guidelines +========== + +* Do not forget to use ``WQ_MEM_RECLAIM`` if a wq may process work + items which are used during memory reclaim. Each wq with + ``WQ_MEM_RECLAIM`` set has an execution context reserved for it. If + there is dependency among multiple work items used during memory + reclaim, they should be queued to separate wq each with + ``WQ_MEM_RECLAIM``. + +* Unless strict ordering is required, there is no need to use ST wq. + +* Unless there is a specific need, using 0 for @max_active is + recommended. In most use cases, concurrency level usually stays + well under the default limit. + +* A wq serves as a domain for forward progress guarantee + (``WQ_MEM_RECLAIM``, flush and work item attributes. Work items + which are not involved in memory reclaim and don't need to be + flushed as a part of a group of work items, and don't require any + special attribute, can use one of the system wq. There is no + difference in execution characteristics between using a dedicated wq + and a system wq. + +* Unless work items are expected to consume a huge amount of CPU + cycles, using a bound wq is usually beneficial due to the increased + level of locality in wq operations and work item execution. + + +Debugging +========= + +Because the work functions are executed by generic worker threads +there are a few tricks needed to shed some light on misbehaving +workqueue users. + +Worker threads show up in the process list as: :: + + root 5671 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/0:1] + root 5672 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/1:2] + root 5673 0.0 0.0 0 0 ? S 12:12 0:00 [kworker/0:0] + root 5674 0.0 0.0 0 0 ? S 12:13 0:00 [kworker/1:0] + +If kworkers are going crazy (using too much cpu), there are two types +of possible problems: + + 1. Something being scheduled in rapid succession + 2. A single work item that consumes lots of cpu cycles + +The first one can be tracked using tracing: :: + + $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt + (wait a few secs) + ^C + +If something is busy looping on work queueing, it would be dominating +the output and the offender can be determined with the work item +function. + +For the second type of problems it should be possible to just check +the stack trace of the offending worker thread. :: + + $ cat /proc/THE_OFFENDING_KWORKER/stack + +The work item's function should be trivially visible in the stack +trace. + + +Kernel Inline Documentations Reference +====================================== + +.. kernel-doc:: include/linux/workqueue.h |