diff options
Diffstat (limited to 'doc/dev/osd_internals')
23 files changed, 2373 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/async_recovery.rst b/doc/dev/osd_internals/async_recovery.rst new file mode 100644 index 00000000..8c3105b3 --- /dev/null +++ b/doc/dev/osd_internals/async_recovery.rst @@ -0,0 +1,47 @@ +===================== +Asynchronous Recovery +===================== + +PGs in Ceph maintain a log of writes to allow speedy recovery of data. +Instead of scanning all of the objects to see what is missing on each +osd, we can examine the pg log to see which objects we need to +recover. See :ref:`Log Based PG <log-based-pg>` for more details on this process. + +Until now, this recovery process was synchronous - it blocked writes +to an object until it was recovered. In contrast, backfill could allow +writes to proceed (assuming enough up-to-date copies of the data were +available) by temporarily assigning a different acting set, and +backfilling an OSD outside of the acting set. In some circumstances, +this ends up being significantly better for availability, e.g. if the +pg log contains 3000 writes to different objects. Recovering several +megabytes of an object (or even worse, several megabytes of omap keys, +like rgw bucket indexes) can drastically increase latency for a small +update, and combined with requests spread across many degraded objects +it is a recipe for slow requests. + +To avoid this, we can perform recovery in the background on an OSD out +of the acting set, similar to backfill, but still using the PG log to +determine what needs recovery. This is known as asynchronous recovery. + +Exactly when we perform asynchronous recovery instead of synchronous +recovery is not a clear-cut threshold. There are a few criteria which +need to be met for asynchronous recovery: + +* try to keep min_size replicas available +* use the approximate magnitude of the difference in length of + logs combined with historical missing objects as the cost of recovery +* use the parameter osd_async_recovery_min_cost to determine + when asynchronous recovery is appropriate + +With the existing peering process, when we choose the acting set we +have not fetched the pg log from each peer, we have only the bounds of +it and other metadata from their pg_info_t. It would be more expensive +to fetch and examine every log at this point, so we only consider an +approximate check for log length for now. In Nautilus, we improved +the accounting of missing objects, so post nautilus, this information +is also used to determine the cost of recovery. + +While async recovery is occurring, writes on members of the acting set +may proceed, but we need to send their log entries to the async +recovery targets (just like we do for backfill osds) so that they +can completely catch up. diff --git a/doc/dev/osd_internals/backfill_reservation.rst b/doc/dev/osd_internals/backfill_reservation.rst new file mode 100644 index 00000000..cf9dab4d --- /dev/null +++ b/doc/dev/osd_internals/backfill_reservation.rst @@ -0,0 +1,38 @@ +==================== +Backfill Reservation +==================== + +When a new osd joins a cluster, all pgs containing it must eventually backfill +to it. If all of these backfills happen simultaneously, it would put excessive +load on the osd. osd_max_backfills limits the number of outgoing or +incoming backfills on a single node. The maximum number of outgoing backfills is +osd_max_backfills. The maximum number of incoming backfills is +osd_max_backfills. Therefore there can be a maximum of osd_max_backfills * 2 +simultaneous backfills on one osd. + +Each OSDService now has two AsyncReserver instances: one for backfills going +from the osd (local_reserver) and one for backfills going to the osd +(remote_reserver). An AsyncReserver (common/AsyncReserver.h) manages a queue +by priority of waiting items and a set of current reservation holders. When a +slot frees up, the AsyncReserver queues the Context* associated with the next +item on the highest priority queue in the finisher provided to the constructor. + +For a primary to initiate a backfill, it must first obtain a reservation from +its own local_reserver. Then, it must obtain a reservation from the backfill +target's remote_reserver via a MBackfillReserve message. This process is +managed by substates of Active and ReplicaActive (see the substates of Active +in PG.h). The reservations are dropped either on the Backfilled event, which +is sent on the primary before calling recovery_complete and on the replica on +receipt of the BackfillComplete progress message), or upon leaving Active or +ReplicaActive. + +It's important that we always grab the local reservation before the remote +reservation in order to prevent a circular dependency. + +We want to minimize the risk of data loss by prioritizing the order in +which PGs are recovered. The highest priority is log based recovery +(OSD_RECOVERY_PRIORITY_MAX) since this must always complete before +backfill can start. The next priority is backfill of degraded PGs and +is a function of the degradation. A backfill for a PG missing two +replicas will have a priority higher than a backfill for a PG missing +one replica. The lowest priority is backfill of non-degraded PGs. diff --git a/doc/dev/osd_internals/erasure_coding.rst b/doc/dev/osd_internals/erasure_coding.rst new file mode 100644 index 00000000..7263cc35 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding.rst @@ -0,0 +1,82 @@ +============================== +Erasure Coded Placement Groups +============================== + +Glossary +-------- + +*chunk* + when the encoding function is called, it returns chunks of the same + size. Data chunks which can be concatenated to reconstruct the original + object and coding chunks which can be used to rebuild a lost chunk. + +*chunk rank* + the index of a chunk when returned by the encoding function. The + rank of the first chunk is 0, the rank of the second chunk is 1 + etc. + +*stripe* + when an object is too large to be encoded with a single call, + each set of chunks created by a call to the encoding function is + called a stripe. + +*shard|strip* + an ordered sequence of chunks of the same rank from the same + object. For a given placement group, each OSD contains shards of + the same rank. When dealing with objects that are encoded with a + single operation, *chunk* is sometime used instead of *shard* + because the shard is made of a single chunk. The *chunks* in a + *shard* are ordered according to the rank of the stripe they belong + to. + +*K* + the number of data *chunks*, i.e. the number of *chunks* in which the + original object is divided. For instance if *K* = 2 a 10KB object + will be divided into *K* objects of 5KB each. + +*M* + the number of coding *chunks*, i.e. the number of additional *chunks* + computed by the encoding functions. If there are 2 coding *chunks*, + it means 2 OSDs can be out without losing data. + +*N* + the number of data *chunks* plus the number of coding *chunks*, + i.e. *K+M*. + +*rate* + the proportion of the *chunks* that contains useful information, i.e. *K/N*. + For instance, for *K* = 9 and *M* = 3 (i.e. *K+M* = *N* = 12) the rate is + *K* = 9 / *N* = 12 = 0.75, i.e. 75% of the chunks contain useful information. + +The definitions are illustrated as follows (PG stands for placement group): +:: + + OSD 40 OSD 33 + +-------------------------+ +-------------------------+ + | shard 0 - PG 10 | | shard 1 - PG 10 | + |+------ object O -------+| |+------ object O -------+| + ||+---------------------+|| ||+---------------------+|| + stripe||| chunk 0 ||| ||| chunk 1 ||| ... + 0 ||| stripe 0 ||| ||| stripe 0 ||| + ||+---------------------+|| ||+---------------------+|| + ||+---------------------+|| ||+---------------------+|| + stripe||| chunk 0 ||| ||| chunk 1 ||| ... + 1 ||| stripe 1 ||| ||| stripe 1 ||| + ||+---------------------+|| ||+---------------------+|| + ||+---------------------+|| ||+---------------------+|| + stripe||| chunk 0 ||| ||| chunk 1 ||| ... + 2 ||| stripe 2 ||| ||| stripe 2 ||| + ||+---------------------+|| ||+---------------------+|| + |+-----------------------+| |+-----------------------+| + | ... | | ... | + +-------------------------+ +-------------------------+ + +Table of content +---------------- + +.. toctree:: + :maxdepth: 1 + + Developer notes <erasure_coding/developer_notes> + Jerasure plugin <erasure_coding/jerasure> + High level design document <erasure_coding/ecbackend> diff --git a/doc/dev/osd_internals/erasure_coding/developer_notes.rst b/doc/dev/osd_internals/erasure_coding/developer_notes.rst new file mode 100644 index 00000000..fca56ce2 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/developer_notes.rst @@ -0,0 +1,223 @@ +============================ +Erasure Code developer notes +============================ + +Introduction +------------ + +Each chapter of this document explains an aspect of the implementation +of the erasure code within Ceph. It is mostly based on examples being +explained to demonstrate how things work. + +Reading and writing encoded chunks from and to OSDs +--------------------------------------------------- + +An erasure coded pool stores each object as K+M chunks. It is divided +into K data chunks and M coding chunks. The pool is configured to have +a size of K+M so that each chunk is stored in an OSD in the acting +set. The rank of the chunk is stored as an attribute of the object. + +Let's say an erasure coded pool is created to use five OSDs ( K+M = +5 ) and sustain the loss of two of them ( M = 2 ). + +When the object *NYAN* containing *ABCDEFGHI* is written to it, the +erasure encoding function splits the content in three data chunks, +simply by dividing the content in three : the first contains *ABC*, +the second *DEF* and the last *GHI*. The content will be padded if the +content length is not a multiple of K. The function also creates two +coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each +chunk is stored in an OSD in the acting set. The chunks are stored in +objects that have the same name ( *NYAN* ) but reside on different +OSDs. The order in which the chunks were created must be preserved and +is stored as an attribute of the object ( shard_t ), in addition to its +name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4* +contains *YXY* and is stored on *OSD3*. + +:: + + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | | | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | +------+ + +----------------->| OSD5 | + +------+ + + + + +When the object *NYAN* is read from the erasure coded pool, the +decoding function reads three chunks : chunk *1* containing *ABC*, +chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild +the original content of the object *ABCDEFGHI*. The decoding function +is informed that the chunks *2* and *5* are missing ( they are called +*erasures* ). The chunk *5* could not be read because the *OSD4* is +*out*. + +The decoding function could be called as soon as three chunks are +read : *OSD2* was the slowest and its chunk does not need to be taken into +account. This optimization is not implemented in Firefly. + +:: + + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + ^ + | + | + +------+------+ + | decode(3,2) | + | erasures 2,5| + +-------------->| | + | +-------------+ + | ^ ^ + | | +-----+ + | | | + +--+---+ +------+ +--+---+ +--+---+ + name | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | + +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | + +--+---+ +--+---+ +--+---+ +--+---+ + ^ . ^ ^ + | TOO . | | + | SLOW . +--+---+ | + | ^ | OSD1 | | + | | +------+ | + | | +------+ | + | +-------| OSD2 | | + | +------+ | + | +------+ | + | | OSD3 |-----+ + | +------+ + | +------+ + | | OSD4 | OUT + | +------+ + | +------+ + +------------------| OSD5 | + +------+ + + +Erasure code library +-------------------- + +Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_, +with parameters K+M, object O is encoded by dividing it into chunks O1, +O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks +out of the available K+M chunks can be used to obtain the original +object. If data chunk O2 or coding chunk P2 are lost, they can be +repaired using any K chunks out of the K+M chunks. If more than M +chunks are lost, it is not possible to recover the object. + +Reading the original content of object O can be a simple +concatenation of O1, O2, ... OM, because the plugins are using +`systematic codes +<https://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks +must be given to the erasure code library *decode* method to retrieve +the content of the object. + +Performance depend on the parameters to the encoding functions and +is also influenced by the packet sizes used when calling the encoding +functions ( for Cauchy or Liberation for instance ): smaller packets +means more calls and more overhead. + +Although Reed-Solomon is provided as a default, Ceph uses it via an +`abstract API <https://github.com/ceph/ceph/blob/v0.78/src/erasure-code/ErasureCodeInterface.h>`_ designed to +allow each pool to choose the plugin that implements it using +key=value pairs stored in an `erasure code profile`_. + +.. _erasure code profile: ../../../erasure-coded-pool + +:: + + $ ceph osd erasure-code-profile set myprofile \ + crush-failure-domain=osd + $ ceph osd erasure-code-profile get myprofile + directory=/usr/lib/ceph/erasure-code + k=2 + m=1 + plugin=jerasure + technique=reed_sol_van + crush-failure-domain=osd + $ ceph osd pool create ecpool 12 12 erasure myprofile + +The *plugin* is dynamically loaded from *directory* and expected to +implement the *int __erasure_code_init(char *plugin_name, char *directory)* function +which is responsible for registering an object derived from *ErasureCodePlugin* +in the registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ plugin reads: + +:: + + ErasureCodePluginRegistry &instance = + ErasureCodePluginRegistry::instance(); + instance.add(plugin_name, new ErasureCodePluginExample()); + +The *ErasureCodePlugin* derived object must provide a factory method +from which the concrete implementation of the *ErasureCodeInterface* +object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ reads: + +:: + + virtual int factory(const map<std::string,std::string> ¶meters, + ErasureCodeInterfaceRef *erasure_code) { + *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters)); + return 0; + } + +The *parameters* argument is the list of *key=value* pairs that were +set in the erasure code profile, before the pool was created. + +:: + + ceph osd erasure-code-profile set myprofile \ + directory=<dir> \ # mandatory + plugin=jerasure \ # mandatory + m=10 \ # optional and plugin dependant + k=3 \ # optional and plugin dependant + technique=reed_sol_van \ # optional and plugin dependant + +Notes +----- + +If the objects are large, it may be impractical to encode and decode +them in memory. However, when using *RBD* a 1TB device is divided in +many individual 4MB objects and *RGW* does the same. + +Encoding and decoding is implemented in the OSD. Although it could be +implemented client side for read write, the OSD must be able to encode +and decode on its own when scrubbing. diff --git a/doc/dev/osd_internals/erasure_coding/ecbackend.rst b/doc/dev/osd_internals/erasure_coding/ecbackend.rst new file mode 100644 index 00000000..624ec217 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/ecbackend.rst @@ -0,0 +1,207 @@ +================================= +ECBackend Implementation Strategy +================================= + +Misc initial design notes +========================= + +The initial (and still true for ec pools without the hacky ec +overwrites debug flag enabled) design for ec pools restricted +EC pools to operations which can be easily rolled back: + +- CEPH_OSD_OP_APPEND: We can roll back an append locally by + including the previous object size as part of the PG log event. +- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete + requires that we retain the deleted object until all replicas have + persisted the deletion event. ErasureCoded backend will therefore + need to store objects with the version at which they were created + included in the key provided to the filestore. Old versions of an + object can be pruned when all replicas have committed up to the log + event deleting the object. +- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr + to be set or removed, we can roll back these operations locally. + +Log entries contain a structure explaining how to locally undo the +operation represented by the operation +(see osd_types.h:TransactionInfo::LocalRollBack). + +PGTemp and Crush +---------------- + +Primaries are able to request a temp acting set mapping in order to +allow an up-to-date OSD to serve requests while a new primary is +backfilled (and for other reasons). An erasure coded pg needs to be +able to designate a primary for these reasons without putting it in +the first position of the acting set. It also needs to be able to +leave holes in the requested acting set. + +Core Changes: + +- OSDMap::pg_to_*_osds needs to separately return a primary. For most + cases, this can continue to be acting[0]. +- MOSDPGTemp (and related OSD structures) needs to be able to specify + a primary as well as an acting set. +- Much of the existing code base assumes that acting[0] is the primary + and that all elements of acting are valid. This needs to be cleaned + up since the acting set may contain holes. + +Distinguished acting set positions +---------------------------------- + +With the replicated strategy, all replicas of a PG are +interchangeable. With erasure coding, different positions in the +acting set have different pieces of the erasure coding scheme and are +not interchangeable. Worse, crush might cause chunk 2 to be written +to an OSD which happens already to contain an (old) copy of chunk 4. +This means that the OSD and PG messages need to work in terms of a +type like pair<shard_t, pg_t> in order to distinguish different pg +chunks on a single OSD. + +Because the mapping of object name to object in the filestore must +be 1-to-1, we must ensure that the objects in chunk 2 and the objects +in chunk 4 have different names. To that end, the objectstore must +include the chunk id in the object key. + +Core changes: + +- The objectstore `ghobject_t needs to also include a chunk id + <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like + tuple<hobject_t, gen_t, shard_t>. +- coll_t needs to include a shard_t. +- The OSD pg_map and similar pg mappings need to work in terms of a + spg_t (essentially + pair<pg_t, shard_t>). Similarly, pg->pg messages need to include + a shard_t +- For client->PG messages, the OSD will need a way to know which PG + chunk should get the message since the OSD may contain both a + primary and non-primary chunk for the same pg + +Object Classes +-------------- + +Reads from object classes will return ENOTSUP on ec pools by invoking +a special SYNC read. + +Scrub +----- + +The main catch, however, for ec pools is that sending a crc32 of the +stored chunk on a replica isn't particularly helpful since the chunks +on different replicas presumably store different data. Because we +don't support overwrites except via DELETE, however, we have the +option of maintaining a crc32 on each chunk through each append. +Thus, each replica instead simply computes a crc32 of its own stored +chunk and compares it with the locally stored checksum. The replica +then reports to the primary whether the checksums match. + +With overwrites, all scrubs are disabled for now until we work out +what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst). + +Crush +----- + +If crush is unable to generate a replacement for a down member of an +acting set, the acting set should have a hole at that position rather +than shifting the other elements of the acting set out of position. + +========= +ECBackend +========= + +MAIN OPERATION OVERVIEW +======================= + +A RADOS put operation can span +multiple stripes of a single object. There must be code that +tessellates the application level write into a set of per-stripe write +operations -- some whole-stripes and up to two partial +stripes. Without loss of generality, for the remainder of this +document we will focus exclusively on writing a single stripe (whole +or partial). We will use the symbol "W" to represent the number of +blocks within a stripe that are being written, i.e., W <= K. + +There are three data flows for handling a write into an EC stripe. The +choice of which of the three data flows to choose is based on the size +of the write operation and the arithmetic properties of the selected +parity-generation algorithm. + +(1) whole stripe is written/overwritten +(2) a read-modify-write operation is performed. + +WHOLE STRIPE WRITE +------------------ + +This is the simple case, and is already performed in the existing code +(for appends, that is). The primary receives all of the data for the +stripe in the RADOS request, computes the appropriate parity blocks +and send the data and parity blocks to their destination shards which +write them. This is essentially the current EC code. + +READ-MODIFY-WRITE +----------------- + +The primary determines which of the K-W blocks are to be unmodified, +and reads them from the shards. Once all of the data is received it is +combined with the received new data and new parity blocks are +computed. The modified blocks are sent to their respective shards and +written. The RADOS operation is acknowledged. + +OSD Object Write and Consistency +-------------------------------- + +Regardless of the algorithm chosen above, writing of the data is a two +phase process: commit and rollforward. The primary sends the log +entries with the operation described (see +osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack). +In all cases, the "commit" is performed in place, possibly leaving some +information required for a rollback in a write-aside object. The +rollforward phase occurs once all acting set replicas have committed +the commit (sorry, overloaded term) and removes the rollback information. + +In the case of overwrites of exsting stripes, the rollback information +has the form of a sparse object containing the old values of the +overwritten extents populated using clone_range. This is essentially +a place-holder implementation, in real life, bluestore will have an +efficient primitive for this. + +The rollforward part can be delayed since we report the operation as +committed once all replicas have committed. Currently, whenever we +send a write, we also indicate that all previously committed +operations should be rolled forward (see +ECBackend::try_reads_to_commit). If there aren't any in the pipeline +when we arrive at the waiting_rollforward queue, we start a dummy +write to move things along (see the Pipeline section later on and +ECBackend::try_finish_rmw). + +ExtentCache +----------- + +It's pretty important to be able to pipeline writes on the same +object. For this reason, there is a cache of extents written by +cacheable operations. Each extent remains pinned until the operations +referring to it are committed. The pipeline prevents rmw operations +from running until uncacheable transactions (clones, etc) are flushed +from the pipeline. + +See ExtentCache.h for a detailed explanation of how the cache +states correspond to the higher level invariants about the conditions +under which cuncurrent operations can refer to the same object. + +Pipeline +-------- + +Reading src/osd/ExtentCache.h should have given a good idea of how +operations might overlap. There are several states involved in +processing a write operation and an important invariant which +isn't enforced by PrimaryLogPG at a higher level which need to be +managed by ECBackend. The important invariant is that we can't +have uncacheable and rmw operations running at the same time +on the same object. For simplicity, we simply enforce that any +operation which contains an rmw operation must wait until +all in-progress uncacheable operations complete. + +There are improvements to be made here in the future. + +For more details, see ECBackend::waiting_* and +ECBackend::try_<from>_to_<to>. + diff --git a/doc/dev/osd_internals/erasure_coding/jerasure.rst b/doc/dev/osd_internals/erasure_coding/jerasure.rst new file mode 100644 index 00000000..27669a0b --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/jerasure.rst @@ -0,0 +1,33 @@ +=============== +jerasure plugin +=============== + +Introduction +------------ + +The parameters interpreted by the jerasure plugin are: + +:: + + ceph osd erasure-code-profile set myprofile \ + directory=<dir> \ # plugin directory absolute path + plugin=jerasure \ # plugin name (only jerasure) + k=<k> \ # data chunks (default 2) + m=<m> \ # coding chunks (default 2) + technique=<technique> \ # coding technique + +The coding techniques can be chosen among *reed_sol_van*, +*reed_sol_r6_op*, *cauchy_orig*, *cauchy_good*, *liberation*, +*blaum_roth* and *liber8tion*. + +The *src/erasure-code/jerasure* directory contains the +implementation. It is a wrapper around the code found at +`https://github.com/ceph/jerasure <https://github.com/ceph/jerasure>`_ +and `https://github.com/ceph/gf-complete +<https://github.com/ceph/gf-complete>`_ , pinned to the latest stable +version in *.gitmodules*. These repositories are copies of the +upstream repositories `http://jerasure.org/jerasure/jerasure +<http://jerasure.org/jerasure/jerasure>`_ and +`http://jerasure.org/jerasure/gf-complete +<http://jerasure.org/jerasure/gf-complete>`_ . The difference +between the two, if any, should match pull requests against upstream. diff --git a/doc/dev/osd_internals/erasure_coding/proposals.rst b/doc/dev/osd_internals/erasure_coding/proposals.rst new file mode 100644 index 00000000..793f55e5 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/proposals.rst @@ -0,0 +1,385 @@ +:orphan: + +================================= +Proposed Next Steps for ECBackend +================================= + +PARITY-DELTA-WRITE +------------------ + +RMW operations current require 4 network hops (2 round trips). In +principle, for some codes, we can reduce this to 3 by sending the +update to the replicas holding the data blocks and having them +compute a delta to forward onto the parity blocks. + +The primary reads the current values of the "W" blocks and then uses +the new values of the "W" blocks to compute parity-deltas for each of +the parity blocks. The W blocks and the parity delta-blocks are sent +to their respective shards. + +The choice of whether to use a read-modify-write or a +parity-delta-write is complex policy issue that is TBD in the details +and is likely to be heavily dependant on the computational costs +associated with a parity-delta vs. a regular parity-generation +operation. However, it is believed that the parity-delta scheme is +likely to be the preferred choice, when available. + +The internal interface to the erasure coding library plug-ins needs to +be extended to support the ability to query if parity-delta +computation is possible for a selected algorithm as well as an +interface to the actual parity-delta computation algorithm when +available. + +Stripe Cache +------------ + +It may be a good idea to extend the current ExtentCache usage to +cache some data past when the pinning operation releases it. +One application pattern that is important to optimize is the small +block sequential write operation (think of the journal of a journaling +file system or a database transaction log). Regardless of the chosen +redundancy algorithm, it is advantageous for the primary to +retain/buffer recently read/written portions of a stripe in order to +reduce network traffic. The dynamic contents of this cache may be used +in the determination of whether a read-modify-write or a +parity-delta-write is performed. The sizing of this cache is TBD, but +we should plan on allowing at least a few full stripes per active +client. Limiting the cache occupancy on a per-client basis will reduce +the noisy neighbor problem. + +Recovery and Rollback Details +============================= + +Implementing a Rollback-able Prepare Operation +---------------------------------------------- + +The prepare operation is implemented at each OSD through a simulation +of a versioning or copy-on-write capability for modifying a portion of +an object. + +When a prepare operation is performed, the new data is written into a +temporary object. The PG log for the +operation will contain a reference to the temporary object so that it +can be located for recovery purposes as well as a record of all of the +shards which are involved in the operation. + +In order to avoid fragmentation (and hence, future read performance), +creation of the temporary object needs special attention. The name of +the temporary object affects its location within the KV store. Right +now its unclear whether it's desirable for the name to locate near the +base object or whether a separate subset of keyspace should be used +for temporary objects. Sam believes that colocation with the base +object is preferred (he suggests using the generation counter of the +ghobject for temporaries). Whereas Allen believes that using a +separate subset of keyspace is desirable since these keys are +ephemeral and we don't want to actually colocate them with the base +object keys. Perhaps some modeling here can help resolve this +issue. The data of the temporary object wants to be located as close +to the data of the base object as possible. This may be best performed +by adding a new ObjectStore creation primitive that takes the base +object as an additional parameter that is a hint to the allocator. + +Sam: I think that the short lived thing may be a red herring. We'll +be updating the donor and primary objects atomically, so it seems like +we'd want them adjacent in the key space, regardless of the donor's +lifecycle. + +The apply operation moves the data from the temporary object into the +correct position within the base object and deletes the associated +temporary object. This operation is done using a specialized +ObjectStore primitive. In the current ObjectStore interface, this can +be done using the clonerange function followed by a delete, but can be +done more efficiently with a specialized move primitive. +Implementation of the specialized primitive on FileStore can be done +by copying the data. Some file systems have extensions that might also +be able to implement this operation (like a defrag API that swaps +chunks between files). It is expected that NewStore will be able to +support this efficiently and natively (It has been noted that this +sequence requires that temporary object allocations, which tend to be +small, be efficiently converted into blocks for main objects and that +blocks that were formerly inside of main objects must be reusable with +minimal overhead) + +The prepare and apply operations can be separated arbitrarily in +time. If a read operation accesses an object that has been altered by +a prepare operation (but without a corresponding apply operation) it +must return the data after the prepare operation. This is done by +creating an in-memory database of objects which have had a prepare +operation without a corresponding apply operation. All read operations +must consult this in-memory data structure in order to get the correct +data. It should explicitly recognized that it is likely that there +will be multiple prepare operations against a single base object and +the code must handle this case correctly. This code is implemented as +a layer between ObjectStore and all existing readers. Annoyingly, +we'll want to trash this state when the interval changes, so the first +thing that needs to happen after activation is that the primary and +replicas apply up to last_update so that the empty cache will be +correct. + +During peering, it is now obvious that an unapplied prepare operation +can easily be rolled back simply by deleting the associated temporary +object and removing that entry from the in-memory data structure. + +Partial Application Peering/Recovery modifications +-------------------------------------------------- + +Some writes will be small enough to not require updating all of the +shards holding data blocks. For write amplification minization +reasons, it would be best to avoid writing to those shards at all, +and delay even sending the log entries until the next write which +actually hits that shard. + +The delaying (buffering) of the transmission of the prepare and apply +operations for witnessing OSDs creates new situations that peering +must handle. In particular the logic for determining the authoritative +last_update value (and hence the selection of the OSD which has the +authoritative log) must be modified to account for the valid but +missing (i.e., delayed/buffered) pglog entries to which the +authoritative OSD was only a witness to. + +Because a partial write might complete without persisting a log entry +on every replica, we have to do a bit more work to determine an +authoritative last_update. The constraint (as with a replicated PG) +is that last_update >= the most recent log entry for which a commit +was sent to the client (call this actual_last_update). Secondarily, +we want last_update to be as small as possible since any log entry +past actual_last_update (we do not apply a log entry until we have +sent the commit to the client) must be able to be rolled back. Thus, +the smaller a last_update we choose, the less recovery will need to +happen (we can always roll back, but rolling a replica forward may +require an object rebuild). Thus, we will set last_update to 1 before +the oldest log entry we can prove cannot have been committed. In +current master, this is simply the last_update of the shortest log +from that interval (because that log did not persist any entry past +that point -- a precondition for sending a commit to the client). For +this design, we must consider the possibility that any log is missing +at its head log entries in which it did not participate. Thus, we +must determine the most recent interval in which we went active +(essentially, this is what find_best_info currently does). We then +pull the log from each live osd from that interval back to the minimum +last_update among them. Then, we extend all logs from the +authoritative interval until each hits an entry in which it should +have participated, but did not record. The shortest of these extended +logs must therefore contain any log entry for which we sent a commit +to the client -- and the last entry gives us our last_update. + +Deep scrub support +------------------ + +The simple answer here is probably our best bet. EC pools can't use +the omap namespace at all right now. The simplest solution would be +to take a prefix of the omap space and pack N M byte L bit checksums +into each key/value. The prefixing seems like a sensible precaution +against eventually wanting to store something else in the omap space. +It seems like any write will need to read at least the blocks +containing the modified range. However, with a code able to compute +parity deltas, we may not need to read a whole stripe. Even without +that, we don't want to have to write to blocks not participating in +the write. Thus, each shard should store checksums only for itself. +It seems like you'd be able to store checksums for all shards on the +parity blocks, but there may not be distinguished parity blocks which +are modified on all writes (LRC or shec provide two examples). L +should probably have a fixed number of options (16, 32, 64?) and be +configurable per-pool at pool creation. N, M should be likewise be +configurable at pool creation with sensible defaults. + +We need to handle online upgrade. I think the right answer is that +the first overwrite to an object with an append only checksum +removes the append only checksum and writes in whatever stripe +checksums actually got written. The next deep scrub then writes +out the full checksum omap entries. + +RADOS Client Acknowledgement Generation Optimization +==================================================== + +Now that the recovery scheme is understood, we can discuss the +generation of of the RADOS operation acknowledgement (ACK) by the +primary ("sufficient" from above). It is NOT required that the primary +wait for all shards to complete their respective prepare +operations. Using our example where the RADOS operations writes only +"W" chunks of the stripe, the primary will generate and send W+M +prepare operations (possibly including a send-to-self). The primary +need only wait for enough shards to be written to ensure recovery of +the data, Thus after writing W + M chunks you can afford the lost of M +chunks. Hence the primary can generate the RADOS ACK after W+M-M => W +of those prepare operations are completed. + +Inconsistent object_info_t versions +=================================== + +A natural consequence of only writing the blocks which actually +changed is that we don't want to update the object_info_t of the +objects which didn't. I actually think it would pose a problem to do +so: pg ghobject namespaces are generally large, and unless the osd is +seeing a bunch of overwrites on a small set of objects, I'd expect +each write to be far enough apart in the backing ghobject_t->data +mapping to each constitute a random metadata update. Thus, we have to +accept that not every shard will have the current version in its +object_info_t. We can't even bound how old the version on a +particular shard will happen to be. In particular, the primary does +not necessarily have the current version. One could argue that the +parity shards would always have the current version, but not every +code necessarily has designated parity shards which see every write +(certainly LRC, iirc shec, and even with a more pedestrian code, it +might be desirable to rotate the shards based on object hash). Even +if you chose to designate a shard as witnessing all writes, the pg +might be degraded with that particular shard missing. This is a bit +tricky, currently reads and writes implicitly return the most recent +version of the object written. On reads, we'd have to read K shards +to answer that question. We can get around that by adding a "don't +tell me the current version" flag. Writes are more problematic: we +need an object_info from the most recent write in order to form the +new object_info and log_entry. + +A truly terrifying option would be to eliminate version and +prior_version entirely from the object_info_t. There are a few +specific purposes it serves: + +#. On OSD startup, we prime the missing set by scanning backwards + from last_update to last_complete comparing the stored object's + object_info_t to the version of most recent log entry. +#. During backfill, we compare versions between primary and target + to avoid some pushes. We use it elsewhere as well +#. While pushing and pulling objects, we verify the version. +#. We return it on reads and writes and allow the librados user to + assert it atomically on writesto allow the user to deal with write + races (used extensively by rbd). + +Case (3) isn't actually essential, just convenient. Oh well. (4) +is more annoying. Writes are easy since we know the version. Reads +are tricky because we may not need to read from all of the replicas. +Simplest solution is to add a flag to rados operations to just not +return the user version on read. We can also just not support the +user version assert on ec for now (I think? Only user is rgw bucket +indices iirc, and those will always be on replicated because they use +omap). + +We can avoid (1) by maintaining the missing set explicitly. It's +already possible for there to be a missing object without a +corresponding log entry (Consider the case where the most recent write +is to an object which has not been updated in weeks. If that write +becomes divergent, the written object needs to be marked missing based +on the prior_version which is not in the log.) THe PGLog already has +a way of handling those edge cases (see divergent_priors). We'd +simply expand that to contain the entire missing set and maintain it +atomically with the log and the objects. This isn't really an +unreasonable option, the additional keys would be fewer than the +existing log keys + divergent_priors and aren't updated in the fast +write path anyway. + +The second case is a bit trickier. It's really an optimization for +the case where a pg became not in the acting set long enough for the +logs to no longer overlap but not long enough for the PG to have +healed and removed the old copy. Unfortunately, this describes the +case where a node was taken down for maintenance with noout set. It's +probably not acceptable to re-backfill the whole OSD in such a case, +so we need to be able to quickly determine whether a particular shard +is up to date given a valid acting set of other shards. + +Let ordinary writes which do not change the object size not touch the +object_info at all. That means that the object_info version won't +match the pg log entry version. Include in the pg_log_entry_t the +current object_info version as well as which shards participated (as +mentioned above). In addition to the object_info_t attr, record on +each shard s a vector recording for each other shard s' the most +recent write which spanned both s and s'. Operationally, we maintain +an attr on each shard containing that vector. A write touching S +updates the version stamp entry for each shard in S on each shard in +S's attribute (and leaves the rest alone). If we have a valid acting +set during backfill, we must have a witness of every write which +completed -- so taking the max of each entry over all of the acting +set shards must give us the current version for each shard. During +recovery, we set the attribute on the recovery target to that max +vector (Question: with LRC, we may not need to touch much of the +acting set to recover a particular shard -- can we just use the max of +the shards we used to recovery, or do we need to grab the version +vector from the rest of the acting set as well? I'm not sure, not a +big deal anyway, I think). + +The above lets us perform blind writes without knowing the current +object version (log entry version, that is) while still allowing us to +avoid backfilling up to date objects. The only catch is that our +backfill scans will can all replicas, not just the primary and the +backfill targets. + +It would be worth adding into scrub the ability to check the +consistency of the gathered version vectors -- probably by just +taking 3 random valid subsets and verifying that they generate +the same authoritative version vector. + +Implementation Strategy +======================= + +It goes without saying that it would be unwise to attempt to do all of +this in one massive PR. It's also not a good idea to merge code which +isn't being tested. To that end, it's worth thinking a bit about +which bits can be tested on their own (perhaps with a bit of temporary +scaffolding). + +We can implement the overwrite friendly checksumming scheme easily +enough with the current implementation. We'll want to enable it on a +per-pool basis (probably using a flag which we'll later repurpose for +actual overwrite support). We can enable it in some of the ec +thrashing tests in the suite. We can also add a simple test +validating the behavior of turning it on for an existing ec pool +(later, we'll want to be able to convert append-only ec pools to +overwrite ec pools, so that test will simply be expanded as we go). +The flag should be gated by the experimental feature flag since we +won't want to support this as a valid configuration -- testing only. +We need to upgrade append only ones in place during deep scrub. + +Similarly, we can implement the unstable extent cache with the current +implementation, it even lets us cut out the readable ack the replicas +send to the primary after the commit which lets it release the lock. +Same deal, implement, gate with experimental flag, add to some of the +automated tests. I don't really see a reason not to use the same flag +as above. + +We can certainly implement the move-range primitive with unit tests +before there are any users. Adding coverage to the existing +objectstore tests would suffice here. + +Explicit missing set can be implemented now, same deal as above -- +might as well even use the same feature bit. + +The TPC protocol outlined above can actually be implemented an append +only EC pool. Same deal as above, can even use the same feature bit. + +The RADOS flag to suppress the read op user version return can be +implemented immediately. Mostly just needs unit tests. + +The version vector problem is an interesting one. For append only EC +pools, it would be pointless since all writes increase the size and +therefore update the object_info. We could do it for replicated pools +though. It's a bit silly since all "shards" see all writes, but it +would still let us implement and partially test the augmented backfill +code as well as the extra pg log entry fields -- this depends on the +explicit pg log entry branch having already merged. It's not entirely +clear to me that this one is worth doing separately. It's enough code +that I'd really prefer to get it done independently, but it's also a +fair amount of scaffolding that will be later discarded. + +PGLog entries need to be able to record the participants and log +comparison needs to be modified to extend logs with entries they +wouldn't have witnessed. This logic should be abstracted behind +PGLog so it can be unittested -- that would let us test it somewhat +before the actual ec overwrites code merges. + +Whatever needs to happen to the ec plugin interface can probably be +done independently of the rest of this (pending resolution of +questions below). + +The actual nuts and bolts of performing the ec overwrite it seems to +me can't be productively tested (and therefore implemented) until the +above are complete, so best to get all of the supporting code in +first. + +Open Questions +============== + +Is there a code we should be using that would let us compute a parity +delta without rereading and reencoding the full stripe? If so, is it +the kind of thing we need to design for now, or can it be reasonably +put off? + +What needs to happen to the EC plugin interface? diff --git a/doc/dev/osd_internals/index.rst b/doc/dev/osd_internals/index.rst new file mode 100644 index 00000000..7e82914a --- /dev/null +++ b/doc/dev/osd_internals/index.rst @@ -0,0 +1,10 @@ +============================== +OSD developer documentation +============================== + +.. rubric:: Contents + +.. toctree:: + :glob: + + * diff --git a/doc/dev/osd_internals/last_epoch_started.rst b/doc/dev/osd_internals/last_epoch_started.rst new file mode 100644 index 00000000..8ed5c980 --- /dev/null +++ b/doc/dev/osd_internals/last_epoch_started.rst @@ -0,0 +1,60 @@ +====================== +last_epoch_started +====================== + +info.last_epoch_started records an activation epoch e for interval i +such that all writes committed in i or earlier are reflected in the +local info/log and no writes after i are reflected in the local +info/log. Since no committed write is ever divergent, even if we +get an authoritative log/info with an older info.last_epoch_started, +we can leave our info.last_epoch_started alone since no writes could +have committed in any intervening interval (See PG::proc_master_log). + +info.history.last_epoch_started records a lower bound on the most +recent interval in which the pg as a whole went active and accepted +writes. On a particular osd, it is also an upper bound on the +activation epoch of intervals in which writes in the local pg log +occurred (we update it before accepting writes). Because all +committed writes are committed by all acting set osds, any +non-divergent writes ensure that history.last_epoch_started was +recorded by all acting set members in the interval. Once peering has +queried one osd from each interval back to some seen +history.last_epoch_started, it follows that no interval after the max +history.last_epoch_started can have reported writes as committed +(since we record it before recording client writes in an interval). +Thus, the minimum last_update across all infos with +info.last_epoch_started >= MAX(history.last_epoch_started) must be an +upper bound on writes reported as committed to the client. + +We update info.last_epoch_started with the initial activation message, +but we only update history.last_epoch_started after the new +info.last_epoch_started is persisted (possibly along with the first +write). This ensures that we do not require an osd with the most +recent info.last_epoch_started until all acting set osds have recorded +it. + +In find_best_info, we do include info.last_epoch_started values when +calculating the max_last_epoch_started_found because we want to avoid +designating a log entry divergent which in a prior interval would have +been non-divergent since it might have been used to serve a read. In +activate(), we use the peer's last_epoch_started value as a bound on +how far back divergent log entries can be found. + +However, in a case like + +.. code:: + + calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556 + calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556 + calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556 + calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556 + +since osd.1 is the only one which recorded info.les=477 while 4,0 +which were the acting set in that interval did not (4 restarted and 0 +did not get the message in time) the pg is marked incomplete when +either 4 or 0 would have been valid choices. To avoid this, we do not +consider info.les for incomplete peers when calculating +min_last_epoch_started_found. It would not have been in the acting +set, so we must have another osd from that interval anyway (if +maybe_went_rw). If that osd does not remember that info.les, then we +cannot have served reads. diff --git a/doc/dev/osd_internals/log_based_pg.rst b/doc/dev/osd_internals/log_based_pg.rst new file mode 100644 index 00000000..a1e27539 --- /dev/null +++ b/doc/dev/osd_internals/log_based_pg.rst @@ -0,0 +1,208 @@ +.. _log-based-pg: + +============ +Log Based PG +============ + +Background +========== + +Why PrimaryLogPG? +----------------- + +Currently, consistency for all ceph pool types is ensured by primary +log-based replication. This goes for both erasure-coded and +replicated pools. + +Primary log-based replication +----------------------------- + +Reads must return data written by any write which completed (where the +client could possibly have received a commit message). There are lots +of ways to handle this, but ceph's architecture makes it easy for +everyone at any map epoch to know who the primary is. Thus, the easy +answer is to route all writes for a particular pg through a single +ordering primary and then out to the replicas. Though we only +actually need to serialize writes on a single object (and even then, +the partial ordering only really needs to provide an ordering between +writes on overlapping regions), we might as well serialize writes on +the whole PG since it lets us represent the current state of the PG +using two numbers: the epoch of the map on the primary in which the +most recent write started (this is a bit stranger than it might seem +since map distribution itself is asynchronous -- see Peering and the +concept of interval changes) and an increasing per-pg version number +-- this is referred to in the code with type eversion_t and stored as +pg_info_t::last_update. Furthermore, we maintain a log of "recent" +operations extending back at least far enough to include any +*unstable* writes (writes which have been started but not committed) +and objects which aren't uptodate locally (see recovery and +backfill). In practice, the log will extend much further +(osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not +clean) because it's handy for quickly performing recovery. + +Using this log, as long as we talk to a non-empty subset of the OSDs +which must have accepted any completed writes from the most recent +interval in which we accepted writes, we can determine a conservative +log which must contain any write which has been reported to a client +as committed. There is some freedom here, we can choose any log entry +between the oldest head remembered by an element of that set (any +newer cannot have completed without that log containing it) and the +newest head remembered (clearly, all writes in the log were started, +so it's fine for us to remember them) as the new head. This is the +main point of divergence between replicated pools and ec pools in +PG/PrimaryLogPG: replicated pools try to choose the newest valid +option to avoid the client needing to replay those operations and +instead recover the other copies. EC pools instead try to choose +the *oldest* option available to them. + +The reason for this gets to the heart of the rest of the differences +in implementation: one copy will not generally be enough to +reconstruct an ec object. Indeed, there are encodings where some log +combinations would leave unrecoverable objects (as with a 4+2 encoding +where 3 of the replicas remember a write, but the other 3 do not -- we +don't have 3 copies of either version). For this reason, log entries +representing *unstable* writes (writes not yet committed to the +client) must be rollbackable using only local information on ec pools. +Log entries in general may therefore be rollbackable (and in that case, +via a delayed application or via a set of instructions for rolling +back an inplace update) or not. Replicated pool log entries are +never able to be rolled back. + +For more details, see PGLog.h/cc, osd_types.h:pg_log_t, +osd_types.h:pg_log_entry_t, and peering in general. + +ReplicatedBackend/ECBackend unification strategy +================================================ + +PGBackend +--------- + +So, the fundamental difference between replication and erasure coding +is that replication can do destructive updates while erasure coding +cannot. It would be really annoying if we needed to have two entire +implementations of PrimaryLogPG, one for each of the two, if there +are really only a few fundamental differences: + +#. How reads work -- async only, requires remote reads for ec +#. How writes work -- either restricted to append, or must write aside and do a + tpc +#. Whether we choose the oldest or newest possible head entry during peering +#. A bit of extra information in the log entry to enable rollback + +and so many similarities + +#. All of the stats and metadata for objects +#. The high level locking rules for mixing client IO with recovery and scrub +#. The high level locking rules for mixing reads and writes without exposing + uncommitted state (which might be rolled back or forgotten later) +#. The process, metadata, and protocol needed to determine the set of osds + which participated in the most recent interval in which we accepted writes +#. etc. + +Instead, we choose a few abstractions (and a few kludges) to paper over the differences: + +#. PGBackend +#. PGTransaction +#. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting +#. Various bits of the write pipeline disallow some operations based on pool + type -- like omap operations, class operation reads, and writes which are + not aligned appends (officially, so far) for ec +#. Misc other kludges here and there + +PGBackend and PGTransaction enable abstraction of differences 1, 2, +and the addition of 4 as needed to the log entries. + +The replicated implementation is in ReplicatedBackend.h/cc and doesn't +require much explanation, I think. More detail on the ECBackend can be +found in doc/dev/osd_internals/erasure_coding/ecbackend.rst. + +PGBackend Interface Explanation +=============================== + +Note: this is from a design document from before the original firefly +and is probably out of date w.r.t. some of the method names. + +Readable vs Degraded +-------------------- + +For a replicated pool, an object is readable iff it is present on +the primary (at the right version). For an ec pool, we need at least +M shards present to do a read, and we need it on the primary. For +this reason, PGBackend needs to include some interfaces for determining +when recovery is required to serve a read vs a write. This also +changes the rules for when peering has enough logs to prove that it + +Core Changes: + +- | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate + | objects to allow the user to make these determinations. + +Client Reads +------------ + +Reads with the replicated strategy can always be satisfied +synchronously out of the primary OSD. With an erasure coded strategy, +the primary will need to request data from some number of replicas in +order to satisfy a read. PGBackend will therefore need to provide +separate objects_read_sync and objects_read_async interfaces where +the former won't be implemented by the ECBackend. + +PGBackend interfaces: + +- objects_read_sync +- objects_read_async + +Scrub +----- + +We currently have two scrub modes with different default frequencies: + +#. [shallow] scrub: compares the set of objects and metadata, but not + the contents +#. deep scrub: compares the set of objects, metadata, and a crc32 of + the object contents (including omap) + +The primary requests a scrubmap from each replica for a particular +range of objects. The replica fills out this scrubmap for the range +of objects including, if the scrub is deep, a crc32 of the contents of +each object. The primary gathers these scrubmaps from each replica +and performs a comparison identifying inconsistent objects. + +Most of this can work essentially unchanged with erasure coded PG with +the caveat that the PGBackend implementation must be in charge of +actually doing the scan. + + +PGBackend interfaces: + +- be_* + +Recovery +-------- + +The logic for recovering an object depends on the backend. With +the current replicated strategy, we first pull the object replica +to the primary and then concurrently push it out to the replicas. +With the erasure coded strategy, we probably want to read the +minimum number of replica chunks required to reconstruct the object +and push out the replacement chunks concurrently. + +Another difference is that objects in erasure coded pg may be +unrecoverable without being unfound. The "unfound" concept +should probably then be renamed to unrecoverable. Also, the +PGBackend implementation will have to be able to direct the search +for pg replicas with unrecoverable object chunks and to be able +to determine whether a particular object is recoverable. + + +Core changes: + +- s/unfound/unrecoverable + +PGBackend interfaces: + +- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_ +- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_ +- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_ +- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_ +- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_ diff --git a/doc/dev/osd_internals/map_message_handling.rst b/doc/dev/osd_internals/map_message_handling.rst new file mode 100644 index 00000000..a5013c22 --- /dev/null +++ b/doc/dev/osd_internals/map_message_handling.rst @@ -0,0 +1,131 @@ +=========================== +Map and PG Message handling +=========================== + +Overview +-------- +The OSD handles routing incoming messages to PGs, creating the PG if necessary +in some cases. + +PG messages generally come in two varieties: + + 1. Peering Messages + 2. Ops/SubOps + +There are several ways in which a message might be dropped or delayed. It is +important that the message delaying does not result in a violation of certain +message ordering requirements on the way to the relevant PG handling logic: + + 1. Ops referring to the same object must not be reordered. + 2. Peering messages must not be reordered. + 3. Subops must not be reordered. + +MOSDMap +------- +MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the +OSD must perform several tasks: + + 1. Persist the new maps to the filestore. + Several PG operations rely on having access to maps dating back to the last + time the PG was clean. + 2. Update and persist the superblock. + 3. Update OSD state related to the current map. + 4. Expose new maps to PG processes via *OSDService*. + 5. Remove PGs due to pool removal. + 6. Queue dummy events to trigger PG map catchup. + +Each PG asynchronously catches up to the currently published map during +process_peering_events before processing the event. As a result, different +PGs may have different views as to the "current" map. + +One consequence of this design is that messages containing submessages from +multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage +with the PG's epoch as well as tagging the message as a whole with the OSD's +current published epoch. + +MOSDPGOp/MOSDPGSubOp +-------------------- +See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op + +MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used +between OSDs to coordinate most non peering activities including replicating +MOSDPGOp operations. + +OSD::require_same_or_newer map checks that the current OSDMap is at least +as new as the map epoch indicated on the message. If not, the message is +queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this +cannot violate the above conditions since any two messages will be queued +in order of receipt and if a message is received with epoch e0, a later message +from the same source must be at epoch at least e0. Note that two PGs from +the same OSD count for these purposes as different sources for single PG +messages. That is, messages from different PGs may be reordered. + + +MOSDPGOps follow the following process: + + 1. OSD::handle_op: validates permissions and crush mapping. + discard the request if they are not connected and the client cannot get the reply ( See OSD::op_is_discardable ) + See OSDService::handle_misdirected_op + See PG::op_has_sufficient_caps + See OSD::require_same_or_newer_map + 2. OSD::enqueue_op + +MOSDSubOps follow the following process: + + 1. OSD::handle_sub_op checks that sender is an OSD + 2. OSD::enqueue_op + +OSD::enqueue_op calls PG::queue_op which checks waiting_for_map before calling OpWQ::queue which adds the op to the queue of the PG responsible for handling it. + +OSD::dequeue_op is then eventually called, with a lock on the PG. At +this time, the op is passed to PG::do_request, which checks that: + + 1. the PG map is new enough (PG::must_delay_op) + 2. the client requesting the op has enough permissions (PG::op_has_sufficient_caps) + 3. the op is not to be discarded (PG::can_discard_{request,op,subop,scan,backfill}) + 4. the PG is active (PG::flushed boolean) + 5. the op is a CEPH_MSG_OSD_OP and the PG is in PG_STATE_ACTIVE state and not in PG_STATE_REPLAY + +If these conditions are not met, the op is either discarded or queued for later processing. If all conditions are met, the op is processed according to its type: + + 1. CEPH_MSG_OSD_OP is handled by PG::do_op + 2. MSG_OSD_SUBOP is handled by PG::do_sub_op + 3. MSG_OSD_SUBOPREPLY is handled by PG::do_sub_op_reply + 4. MSG_OSD_PG_SCAN is handled by PG::do_scan + 5. MSG_OSD_PG_BACKFILL is handled by PG::do_backfill + +CEPH_MSG_OSD_OP processing +-------------------------- + +PrimaryLogPG::do_op handles CEPH_MSG_OSD_OP op and will queue it + + 1. in wait_for_all_missing if it is a CEPH_OSD_OP_PGLS for a designated snapid and some object updates are still missing + 2. in waiting_for_active if the op may write but the scrubber is working + 3. in waiting_for_missing_object if the op requires an object or a snapdir or a specific snap that is still missing + 4. in waiting_for_degraded_object if the op may write an object or a snapdir that is degraded, or if another object blocks it ("blocked_by") + 5. in waiting_for_backfill_pos if the op requires an object that will be available after the backfill is complete + 6. in waiting_for_ack if an ack from another OSD is expected + 7. in waiting_for_ondisk if the op is waiting for a write to complete + +Peering Messages +---------------- +See OSD::handle_pg_(notify|info|log|query) + +Peering messages are tagged with two epochs: + + 1. epoch_sent: map epoch at which the message was sent + 2. query_epoch: map epoch at which the message triggering the message was sent + +These are the same in cases where there was no triggering message. We discard +a peering message if the message's query_epoch if the PG in question has entered +a new epoch (See PG::old_peering_evt, PG::queue_peering_event). Notifies, +infos, notifies, and logs are all handled as PG::RecoveryMachine events and +are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created +state machine event along with epoch_sent and query_epoch in order to +generically check PG::old_peering_message upon insertion and removal from the +queue. + +Note, notifies, logs, and infos can trigger the creation of a PG. See +OSD::get_or_create_pg. + + diff --git a/doc/dev/osd_internals/osd_overview.rst b/doc/dev/osd_internals/osd_overview.rst new file mode 100644 index 00000000..192ddf8c --- /dev/null +++ b/doc/dev/osd_internals/osd_overview.rst @@ -0,0 +1,106 @@ +=== +OSD +=== + +Concepts +-------- + +*Messenger* + See src/msg/Messenger.h + + Handles sending and receipt of messages on behalf of the OSD. The OSD uses + two messengers: + + 1. cluster_messenger - handles traffic to other OSDs, monitors + 2. client_messenger - handles client traffic + + This division allows the OSD to be configured with different interfaces for + client and cluster traffic. + +*Dispatcher* + See src/msg/Dispatcher.h + + OSD implements the Dispatcher interface. Of particular note is ms_dispatch, + which serves as the entry point for messages received via either the client + or cluster messenger. Because there are two messengers, ms_dispatch may be + called from at least two threads. The osd_lock is always held during + ms_dispatch. + +*WorkQueue* + See src/common/WorkQueue.h + + The WorkQueue class abstracts the process of queueing independent tasks + for asynchronous execution. Each OSD process contains workqueues for + distinct tasks: + + 1. OpWQ: handles ops (from clients) and subops (from other OSDs). + Runs in the op_tp threadpool. + 2. PeeringWQ: handles peering tasks and pg map advancement + Runs in the op_tp threadpool. + See Peering + 3. CommandWQ: handles commands (pg query, etc) + Runs in the command_tp threadpool. + 4. RecoveryWQ: handles recovery tasks. + Runs in the recovery_tp threadpool. + 5. SnapTrimWQ: handles snap trimming + Runs in the disk_tp threadpool. + See SnapTrimmer + 6. ScrubWQ: handles primary scrub path + Runs in the disk_tp threadpool. + See Scrub + 7. ScrubFinalizeWQ: handles primary scrub finalize + Runs in the disk_tp threadpool. + See Scrub + 8. RepScrubWQ: handles replica scrub path + Runs in the disk_tp threadpool + See Scrub + 9. RemoveWQ: Asynchronously removes old pg directories + Runs in the disk_tp threadpool + See PGRemoval + +*ThreadPool* + See src/common/WorkQueue.h + See also above. + + There are 4 OSD threadpools: + + 1. op_tp: handles ops and subops + 2. recovery_tp: handles recovery tasks + 3. disk_tp: handles disk intensive tasks + 4. command_tp: handles commands + +*OSDMap* + See src/osd/OSDMap.h + + The crush algorithm takes two inputs: a picture of the cluster + with status information about which nodes are up/down and in/out, + and the pgid to place. The former is encapsulated by the OSDMap. + Maps are numbered by *epoch* (epoch_t). These maps are passed around + within the OSD as std::tr1::shared_ptr<const OSDMap>. + + See MapHandling + +*PG* + See src/osd/PG.* src/osd/PrimaryLogPG.* + + Objects in rados are hashed into *PGs* and *PGs* are placed via crush onto + OSDs. The PG structure is responsible for handling requests pertaining to + a particular *PG* as well as for maintaining relevant metadata and controlling + recovery. + +*OSDService* + See src/osd/OSD.cc OSDService + + The OSDService acts as a broker between PG threads and OSD state which allows + PGs to perform actions using OSD services such as workqueues and messengers. + This is still a work in progress. Future cleanups will focus on moving such + state entirely from the OSD into the OSDService. + +Overview +-------- + See src/ceph_osd.cc + + The OSD process represents one leaf device in the crush hierarchy. There + might be one OSD process per physical machine, or more than one if, for + example, the user configures one OSD instance per disk. + diff --git a/doc/dev/osd_internals/osd_throttles.rst b/doc/dev/osd_internals/osd_throttles.rst new file mode 100644 index 00000000..fbacb078 --- /dev/null +++ b/doc/dev/osd_internals/osd_throttles.rst @@ -0,0 +1,93 @@ +============= +OSD Throttles +============= + +There are three significant throttles in the filestore: wbthrottle, +op_queue_throttle, and a throttle based on journal usage. + +WBThrottle +---------- +The WBThrottle is defined in src/os/filestore/WBThrottle.[h,cc] and +included in FileStore as FileStore::wbthrottle. The intention is to +bound the amount of outstanding IO we need to do to flush the journal. +At the same time, we don't want to necessarily do it inline in case we +might be able to combine several IOs on the same object close together +in time. Thus, in FileStore::_write, we queue the fd for asynchronous +flushing and block in FileStore::_do_op if we have exceeded any hard +limits until the background flusher catches up. + +The relevant config options are filestore_wbthrottle*. There are +different defaults for xfs and btrfs. Each set has hard and soft +limits on bytes (total dirty bytes), ios (total dirty ios), and +inodes (total dirty fds). The WBThrottle will begin flushing +when any of these hits the soft limit and will block in throttle() +while any has exceeded the hard limit. + +Tighter soft limits will cause writeback to happen more quickly, +but may cause the OSD to miss oportunities for write coalescing. +Tighter hard limits may cause a reduction in latency variance by +reducing time spent flushing the journal, but may reduce writeback +parallelism. + +op_queue_throttle +----------------- +The op queue throttle is intended to bound the amount of queued but +uncompleted work in the filestore by delaying threads calling +queue_transactions more and more based on how many ops and bytes are +currently queued. The throttle is taken in queue_transactions and +released when the op is applied to the filesystem. This period +includes time spent in the journal queue, time spent writing to the +journal, time spent in the actual op queue, time spent waiting for the +wbthrottle to open up (thus, the wbthrottle can push back indirectly +on the queue_transactions caller), and time spent actually applying +the op to the filesystem. A BackoffThrottle is used to gradually +delay the queueing thread after each throttle becomes more than +filestore_queue_low_threshhold full (a ratio of +filestore_queue_max_(bytes|ops)). The throttles will block once the +max value is reached (filestore_queue_max_(bytes|ops)). + +The significant config options are: +filestore_queue_low_threshhold +filestore_queue_high_threshhold +filestore_expected_throughput_ops +filestore_expected_throughput_bytes +filestore_queue_high_delay_multiple +filestore_queue_max_delay_multiple + +While each throttle is at less than low_threshold of the max, +no delay happens. Between low and high, the throttle will +inject a per-op delay (per op or byte) ramping from 0 at low to +high_delay_multiple/expected_throughput at high. From high to +1, the delay will ramp from high_delay_multiple/expected_throughput +to max_delay_multiple/expected_throughput. + +filestore_queue_high_delay_multiple and +filestore_queue_max_delay_multiple probably do not need to be +changed. + +Setting these properly should help to smooth out op latencies by +mostly avoiding the hard limit. + +See FileStore::throttle_ops and FileSTore::thottle_bytes. + +journal usage throttle +---------------------- +See src/os/filestore/JournalThrottle.h/cc + +The intention of the journal usage throttle is to gradually slow +down queue_transactions callers as the journal fills up in order +to smooth out hiccup during filestore syncs. JournalThrottle +wraps a BackoffThrottle and tracks journaled but not flushed +journal entries so that the throttle can be released when the +journal is flushed. The configs work very similarly to the +op_queue_throttle. + +The significant config options are: +journal_throttle_low_threshhold +journal_throttle_high_threshhold +filestore_expected_throughput_ops +filestore_expected_throughput_bytes +journal_throttle_high_multiple +journal_throttle_max_multiple + +.. literalinclude:: osd_throttles.txt diff --git a/doc/dev/osd_internals/osd_throttles.txt b/doc/dev/osd_internals/osd_throttles.txt new file mode 100644 index 00000000..0332377e --- /dev/null +++ b/doc/dev/osd_internals/osd_throttles.txt @@ -0,0 +1,21 @@ + Messenger throttle (number and size) + |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| + FileStore op_queue throttle (number and size, includes a soft throttle based on filestore_expected_throughput_(ops|bytes)) + |--------------------------------------------------------| + WBThrottle + |---------------------------------------------------------------------------------------------------------| + Journal (size, includes a soft throttle based on filestore_expected_throughput_bytes) + |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| + |----------------------------------------------------------------------------------------------------> flushed ----------------> synced + | +Op: Read Header --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> op_applied -------------------------------------------------------------> Complete + | | +SubOp: --Messenger--> ReadHeader --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> sub_op_applied - + | + |-----------------------------> flushed ----------------> synced + |------------------------------------------------------------------------------------------| + Journal (size) + |---------------------------------| + WBThrottle + |-----------------------------------------------------| + FileStore op_queue throttle (number and size) diff --git a/doc/dev/osd_internals/osdmap_versions.txt b/doc/dev/osd_internals/osdmap_versions.txt new file mode 100644 index 00000000..12fab5ae --- /dev/null +++ b/doc/dev/osd_internals/osdmap_versions.txt @@ -0,0 +1,259 @@ +releases: + + <0.48 pre-argonaut, dev + 0.48 argonaut + 0.56 bobtail + 0.61 cuttlefish + 0.67 dumpling + 0.72 emperor + 0.80 firefly + 0.87 giant + 0.94 hammer + 9.1.0 infernalis rc + 9.2.0 infernalis + 10.2.0 jewel + 11.2.0 kraken + 12.2.0 luminous + 13.2.0 mimic + 14.2.0 nautilus (to-be) + +osdmap: + +type / v / cv / ev / commit / version / date + +map / 1 / - / - / 017788a6ecb570038632de31904dd2e1314dc7b7 / 0.11 / 2009 +inc / 1 / - / - / + * initial +map / 2 / - / - / 020350e19a5dc03cd6cedd7494e434295580615f / 0.13 / 2009 +inc / 2 / - / - / + * pg_temp +map / 3 / - / - / 1ebcebf6fff056a0c0bdf82dde69356e271be27e / 0.19 / 2009 +inc / 3 / - / - / + * heartbeat_addr +map / 4 / - / - / 3ced5e7de243edeccfd20a90ec2034206c920795 / 0.19 / 2010 +inc / 4 / - / - / + * pools removed from map +map / 5 / - / 5 / c4892bed6f49df396df3cbf9ed561c7315bd2442 / 0.20 / 2010 +inc / 5 / - / 5 / + * pool names moved to first part of encoding + * adds CEPH_OSDMAP_INC_VERSION_EXT (for extended part of map) + * adds CEPH_OSDMAP_VERSION_EXT (for extended part of map) + * adds 'ev' (extended version) during encode() and decode +map / 5 / - / 5 / bc9cb9311f1b946898b5256eab500856fccf5c83 / 0.22 / 2010 +inc / 5 / - / 6 / + * separate up client/osd + * increments CEPH_OSDMAP_INC_VERSION_EXT to 6 + * CEPH_OSDMAP_INC_VERSION stays at 5 +map / 5 / - / 6 / 7f70112052c7fc3ba46f9e475fa575d85e8b16b2 / 0.22 / 2010 +inc / 5 / - / 6 / + * add osd_cluster_addr to full map + * increments CEPH_OSDMAP_VERSION_EXT to 6 + * CEPH_OSDMAP_VERSION stays at 5 +map / 5 / - / 7 / 2ced4e24aef64f2bc7d55b73abb888c124512eac / 0.28 / 2011 +inc / 5 / - / 7 / + * add cluster_snapshot field + * increments CEPH_OSDMAP_VERSION_EXT to 7 + * increments CEPH_OSDMAP_INC_VERSION_EXT to 7 + * CEPH_OSDMAP_INC_VERSION stays at 5 + * CEPH_OSDMAP_VERSION stays at 5 +map / 6 / - / 7 / d1ce410842ca51fad3aa100a52815a39e5fe6af6 / 0.35 / 2011 +inc / 6 / - / 7 / + * encode/decode old + new versions + * adds encode_client_old() (basically transitioning from uint32 to + uint64) + * bumps osdmap version to 6, old clients stay at 5 + * starts using in-function versions (i.e., _u16 v = 6) +map / 6 / - / 7 / b297d1edecaf31a48cff6c37df2ee266e51cdec1 / 0.38 / 2011 +inc / 6 / - / 7 / + * make encoding conditional based on features + * essentially checks whether features & CEPH_FEATURE_PGID64 and opts + to either use encode_client_old() or encode() +map / 6 / - / 7 / 0f0c59478894c9ca7fa04fc32e854648192a9fae / 0.38 / 2011 +inc / 6 / - / 7 / + * move stuff from osdmap.h to osdmap.cc +map / 6 / 8 / ca4311e5e39cec8fad85fad3e67eea968707e9eb / 0.47 / 2012 +inc / 6 / 8 / + * store uuid per osd + * bumps osdmap::incremental extended version to 8; in function + * bumps osdmap's extended version to 8; in function +map / 6 / - / 8 / 5125daa6d78e173a8dbc75723a8fdcd279a44bcd / 0.47 / 2012 +inc / 6 / - / 8 / + * drop defines + * drops defines for CEPH_OSDMAP_*_VERSION from rados.h +map / 6 / 9 / e9f051ef3c49a080b24d7811a16aefb64beacbbd / 0.53 / 2012 +inc / 6 / 9 / + * add osd_xinfo_t + * osdmap::incremental ext version bumped to 9 + * osdmap's ext version bumped to 9 + * because we're adding osd_xinfo_t to the map +map / 6 / - / 10 / 1fee4ccd5277b52292e255daf458330eef5f0255 / 0.64 / 2013 +inc / 6 / - / 10 / + * encode front hb addr + * osdmap::incremental ext version bumped to 10 + * osdmap's ext versiont bumped to 10 + * because we're adding osd_addrs->hb_front_addr to map + +// below we have the change to ENCODE_START() for osdmap and others +// this means client-usable data and extended osd data get to have their +// own ENCODE_START()'s, hence their versions start at 1 again. + +map / 7 / 1 / 1 / 3d7c69fb0986337dc72e466dc39d93e5ab406062 / 0.77 / 2014 +inc / 7 / 1 / 1 / b55c45e85dbd5d2513a4c56b3b74dcafd03f20b1 / 0.77 / 2014 + * introduces ENCODE_START() approach to osdmap, and the 'features' + argument we currently see in ::encode() functions + * same, but for osdmap::incremental +map / 7 / 1 / 1 / b9208b47745fdd53d36b682bebfc01e913347092 / 0.77 / 2014 +inc / 7 / 1 / 2 / + * include features argument in incremental. +map / 7 / 2 / 1 / cee914290c5540eb1fb9d70faac70a581381c29b / 0.78 / 2014 +inc / 7 / 2 / 2 / + * add osd_primary_affinity +map / 7 / 3 / 1 / c4f8f265955d54f33c79cde02c1ab2fe69ab1ab0 / 0.78 / 2014 +inc / 7 / 3 / 2 / + * add new/old erasure code profiles +map / 8 / 3 / 1 / 3dcf5b9636bb9e0cd6484d18f151b457e1a0c328 / 0.91 / 2014 +inc / 8 / 3 / 2 / + * encode crc +map / 8 / 3 / 1 / 04679c5451e353c966f6ed00b33fa97be8072a79 / 9.1.0 / 2015 +inc / 8 / 3 / 2 / + * simply ensures encode_features are filled to CEPH_FEATURE_PGID64 when + decoding an incremental if struct_v >= 6; else keeps it at zero. + * otherwise, if we get an incremental from hammer (which has + struct_v = 6) we would be decoding it as if it were a map from before + CEPH_FEATURES_PGID64 (which was introduced in 0.35, pre-argonaut) +map / 8 / 3 / 2 / 5c6b9d9dcd0a225e3a2b154c20a623868c269346 / 12.0.1 / 2017 +inc / 8 / 3 / 3 / + * add (near)full_ratio + * used to live in pgmap, moving to osdmap for luminous + * conditional on SERVER_LUMINOUS feature being present + * osdmap::incremental::encode(): conditional on ev >= 3 + * osdmap::incremental::decode(): conditional on ev >= 3, else -1 + * osdmap::encode(): conditional on ev >= 2 + * osdmap::decode(): conditional on ev >= 0, else 0 +map / 8 / 4 / 2 / 27d6f4373bafa24450f6dbb4e4252c2d9c2c1448 / 12.0.2 / 2017 +inc / 8 / 4 / 3 / + * add pg_remap and pg_remap_items + * first forces a pg to map to a particular value; second replaces + specific osds with specific other osds in crush mapping. + * inc conditional on SERVER_LUMINOUS feature being present + * osdmap::incremental::encode(): conditional on cv >= 4 + * osdmap::incremental::decode(): conditional on cv >= 4 + * map conditional on OSDMAP_REMAP feature being present + * osdmap::encode(): if not feature, cv = 3; encode on cv >= 4 + * osdmap::decode(): conditional on cv >= 4 +map / 8 / 4 / 3 / 27d6f4373bafa24450f6dbb4e4252c2d9c2c1448 / 12.0.2 / 2017 +inc / 8 / 4 / 4 / + * handle backfillfull_ratio like nearfull and full + * inc: + * osdmap::incremental::encode(): conditional on ev >= 3 + * osdmap::incremental::decode(): conditional on ev >= 4, else -1 + * map: + * osdmap::encode(): conditional on ev >= 2 + * osdmap::decode(): conditional on ev >= 3, else 0 +map / 8 / 4 / 3 / a1c66468232002c9f36033226f5db0a5751e8d18 / 12.0.3 / 2017 +inc / 8 / 4 / 4 / + * add require_min_compat_client field + * inc: + * osdmap::incremental::encode() conditional on ev >= 4 + * osdmap::incremental::decode() conditional on ev >= 4 + map: + * osdmap::encode() conditional on ev >= 3 + * osdmap::decode() conditional on ev >= 3 +map / 8 / 4 / 4 / 4a09e9431de3084b1ca98af11b28f822fde4ffbe / 12.0.3 / 2017 +inc / 8 / 4 / 5 / + * bumps encoding version for require_min_compat_client + * otherwise osdmap::decode() would throw exception when decoding + old maps + * inc: + * osdmap::incremental::encode() no conditional on ev >= 3 + * osdmap::incremental::decode() conditional on ev >= 5 + * map: + * osdmap::encode() conditional on ev >= 2 + * osdmap::decode() conditional on ev >= 4 +map / 8 / 4 / 5 / 3d4c4d9d9da07e1456331c43acc998d2008ca8ea / 12.1.0 / 2017 +inc / 8 / 4 / 6 / + * add require_osd_release numeric field + * new_require_min_compat_client: + * osdmap::incremental::encode() conditional on ev >= 5 + * osdmap::encode() conditional on ev >= 4 + * require_osd_release: + * osdmap::incremental::encode() conditional on ev >= 6 + * osdmap::incremental::decode() conditional on ev >= 6 (else, -1) + * osdmap::encode() conditional on ev >= 5 + * osdmap::decode() conditional on ev >= 5 (else, -1) +map / 8 / 4 / 5 / f22997e24bda4e6476e15d5d4ad9737861f9741f / 12.1.0 / 2017 +inc / 8 / 4 / 6 / + * switch (require_)min_compat_client to integers instead of strings + * osdmap::incremental::encode() conditional on ev >= 6 + * osdmap::incremental::decode(): + * if ev == 5, decode string and translate to release number + * if ev >= 6, decode integer + * osdmap::encode() conditional on ev >= 4 + * osdmap::decode(): + * if ev == 4, decode string and translate to release number + * if ev >= 5, decode integer +map / 8 / 4 / 6 / a8fb39b57884d96201fa502b17bc9395ec38c1b3 / 12.1.0 / 2017 +inc / 8 / 5 / 6 / + * make incremental's `new_state` 32 bits instead of 8 bits + * implies forcing 8 bits on + * osdmap::incremental::encode_client_old() + * osdmap::incremental::encode_classic() + * osdmap::incremental::decode_classic() + * osdmap::incremental::encode() conditional on cv >= 5, else force 8b. + * osdmap::incremental::decode() conditional on cv >= 5, else force 8b. +map / 8 / 5 / 6 / 3c1e58215bbb98f71aae30904f9010a57a58da81 / 12.1.0 / 2017 +inc / 8 / 5 / 6 / + * same as above +map / 8 / 6 / 6 / 48158ec579b708772fae82daaa6cb5dcaf5ac5dd / 12.1.0 / 2017 +inc / 8 / 5 / 6 / + * add crush_version + * osdmap::encode() conditional on cv >= 6 + * osdmap::decode() conditional on cv >= 6 +map / 8 / 7 / 6 / 553048fbf97af999783deb7e992c8ecfa5e55500 / 13.0.2 / 2017 +inc / 8 / 6 / 6 / + * track newly removed and purged snaps in each epoch + * new_removed_snaps + * new_purged_snaps + * osdmap::encode() conditional on cv >= 7 + * if SERVER_MIMIC not in features, cv = 6 + * osdmap::decode() conditional cv >= 7 +map / 8 / 8 / 6 / f99c2a9fec65ad3ce275ef24bd167ee03275d3d7 / 14.0.1 / 2018 +inc / 8 / 7 / 6 / + * fix pre-addrvec compat + * osdmap::encode() conditional on cv >= 8, else encode client addrs + one by one in a loop. + * osdmap::decode() just bumps version (?) +map / 8 / 8 / 7 / 9fb1e521c7c75c124b0dbf193e8b65ff1b5f461e / 14.0.1 / 2018 +inc / 8 / 7 / 7 / + * make cluster addrs into addrvecs too + * this will allow single-step upgrade from msgr1 to msgr2 +map / 8 / 9 / 7 / d414f0b43a69f3c2db8e454d795be881496237c6 / 14.0.1 / 2018 +inc / 8 / 8 / 7 / + * store last_up_change and last_in_change + * osdmap::encode() conditional on cv >= 9 + * osdmap::decode() conditional on cv >= 9 + + + +osd_info_t: +v / commit / version / date / reason + +1 / e574c84a6a0c5a5070dc72d5f5d3d17914ef824a / 0.19 / 2010 / add struct_v + +osd_xinfo_t: +v / commit / version / date + +1 / e9f051ef3c49a080b24d7811a16aefb64beacbbd / 0.53 / 2012 + * add osd_xinfo_t +2 / 31743d50a109a463d664ec9cf764d5405db507bd / 0.75 / 2013 + * add features bit mask to osd_xinfo_t +3 / 87722a42c286d4d12190b86b6d06d388e2953ba0 / 0.82 / 2014 + * remember osd weight when auto-marking osds out + +rados.h: +v / commit / version / date / reason + +- / 147c6f51e34a875ab65624df04baa8ef89296ddd / 0.19 / 2010 / move versions + 3 / CEPH_OSDMAP_INC_VERSION + 3 / CEPH_OSDMAP_VERSION + 2 / CEPH_PG_POOL_VERSION diff --git a/doc/dev/osd_internals/pg.rst b/doc/dev/osd_internals/pg.rst new file mode 100644 index 00000000..40553639 --- /dev/null +++ b/doc/dev/osd_internals/pg.rst @@ -0,0 +1,31 @@ +==== +PG +==== + +Concepts +-------- + +*Peering Interval* + See PG::start_peering_interval. + See PG::acting_up_affected + See PG::RecoveryState::Reset + + A peering interval is a maximal set of contiguous map epochs in which the + up and acting sets did not change. PG::RecoveryMachine represents a + transition from one interval to another as passing through + RecoveryState::Reset. On PG::RecoveryState::AdvMap PG::acting_up_affected can + cause the pg to transition to Reset. + + +Peering Details and Gotchas +--------------------------- +For an overview of peering, see `Peering <../../peering>`_. + + * PG::flushed defaults to false and is set to false in + PG::start_peering_interval. Upon transitioning to PG::RecoveryState::Started + we send a transaction through the pg op sequencer which, upon complete, + sends a FlushedEvt which sets flushed to true. The primary cannot go + active until this happens (See PG::RecoveryState::WaitFlushedPeering). + Replicas can go active but cannot serve ops (writes or reads). + This is necessary because we cannot read our ondisk state until unstable + transactions from the previous interval have cleared. diff --git a/doc/dev/osd_internals/pg_removal.rst b/doc/dev/osd_internals/pg_removal.rst new file mode 100644 index 00000000..c5fe0e1a --- /dev/null +++ b/doc/dev/osd_internals/pg_removal.rst @@ -0,0 +1,56 @@ +========== +PG Removal +========== + +See OSD::_remove_pg, OSD::RemoveWQ + +There are two ways for a pg to be removed from an OSD: + + 1. MOSDPGRemove from the primary + 2. OSD::advance_map finds that the pool has been removed + +In either case, our general strategy for removing the pg is to +atomically set the metadata objects (pg->log_oid, pg->biginfo_oid) to +backfill and asynchronously remove the pg collections. We do not do +this inline because scanning the collections to remove the objects is +an expensive operation. + +OSDService::deleting_pgs tracks all pgs in the process of being +deleted. Each DeletingState object in deleting_pgs lives while at +least one reference to it remains. Each item in RemoveWQ carries a +reference to the DeletingState for the relevant pg such that +deleting_pgs.lookup(pgid) will return a null ref only if there are no +collections currently being deleted for that pg. + +The DeletingState for a pg also carries information about the status +of the current deletion and allows the deletion to be cancelled. +The possible states are: + + 1. QUEUED: the PG is in the RemoveWQ + 2. CLEARING_DIR: the PG's contents are being removed synchronously + 3. DELETING_DIR: the PG's directories and metadata being queued for removal + 4. DELETED_DIR: the final removal transaction has been queued + 5. CANCELED: the deletion has been cancelled + +In 1 and 2, the deletion can be cancelled. Each state transition +method (and check_canceled) returns false if deletion has been +cancelled and true if the state transition was successful. Similarly, +try_stop_deletion() returns true if it succeeds in cancelling the +deletion. Additionally, try_stop_deletion() in the event that it +fails to stop the deletion will not return until the final removal +transaction is queued. This ensures that any operations queued after +that point will be ordered after the pg deletion. + +OSD::_create_lock_pg must handle two cases: + + 1. Either there is no DeletingStateRef for the pg, or it failed to cancel + 2. We succeeded in cancelling the deletion. + +In case 1., we proceed as if there were no deletion occurring, except that +we avoid writing to the PG until the deletion finishes. In case 2., we +proceed as in case 1., except that we first mark the PG as backfilling. + +Similarly, OSD::osr_registry ensures that the OpSequencers for those +pgs can be reused for a new pg if created before the old one is fully +removed, ensuring that operations on the new pg are sequenced properly +with respect to operations on the old one. diff --git a/doc/dev/osd_internals/pgpool.rst b/doc/dev/osd_internals/pgpool.rst new file mode 100644 index 00000000..45a252bd --- /dev/null +++ b/doc/dev/osd_internals/pgpool.rst @@ -0,0 +1,22 @@ +================== +PGPool +================== + +PGPool is a structure used to manage and update the status of removed +snapshots. It does this by maintaining two fields, cached_removed_snaps - the +current removed snap set and newly_removed_snaps - newly removed snaps in the +last epoch. In OSD::load_pgs the osd map is recovered from the pg's file store +and passed down to OSD::_get_pool where a PGPool object is initialised with the +map. + +With each new map we receive we call PGPool::update with the new map. In that +function we build a list of newly removed snaps +(pg_pool_t::build_removed_snaps) and merge that with our cached_removed_snaps. +This function included checks to make sure we only do this update when things +have changed or there has been a map gap. + +When we activate the pg we initialise the snap trim queue from +cached_removed_snaps and subtract the purged_snaps we have already purged +leaving us with the list of snaps that need to be trimmed. Trimming is later +performed asynchronously by the snap_trim_wq. + diff --git a/doc/dev/osd_internals/recovery_reservation.rst b/doc/dev/osd_internals/recovery_reservation.rst new file mode 100644 index 00000000..a24ac1b1 --- /dev/null +++ b/doc/dev/osd_internals/recovery_reservation.rst @@ -0,0 +1,83 @@ +==================== +Recovery Reservation +==================== + +Recovery reservation extends and subsumes backfill reservation. The +reservation system from backfill recovery is used for local and remote +reservations. + +When a PG goes active, first it determines what type of recovery is +necessary, if any. It may need log-based recovery, backfill recovery, +both, or neither. + +In log-based recovery, the primary first acquires a local reservation +from the OSDService's local_reserver. Then a MRemoteReservationRequest +message is sent to each replica in order of OSD number. These requests +will always be granted (i.e., cannot be rejected), but they may take +some time to be granted if the remotes have already granted all their +remote reservation slots. + +After all reservations are acquired, log-based recovery proceeds as it +would without the reservation system. + +After log-based recovery completes, the primary releases all remote +reservations. The local reservation remains held. The primary then +determines whether backfill is necessary. If it is not necessary, the +primary releases its local reservation and waits in the Recovered state +for all OSDs to indicate that they are clean. + +If backfill recovery occurs after log-based recovery, the local +reservation does not need to be reacquired since it is still held from +before. If it occurs immediately after activation (log-based recovery +not possible/necessary), the local reservation is acquired according to +the typical process. + +Once the primary has its local reservation, it requests a remote +reservation from the backfill target. This reservation CAN be rejected, +for instance if the OSD is too full (backfillfull_ratio osd setting). +If the reservation is rejected, the primary drops its local +reservation, waits (osd_backfill_retry_interval), and then retries. It +will retry indefinitely. + +Once the primary has the local and remote reservations, backfill +proceeds as usual. After backfill completes the remote reservation is +dropped. + +Finally, after backfill (or log-based recovery if backfill was not +necessary), the primary drops the local reservation and enters the +Recovered state. Once all the PGs have reported they are clean, the +primary enters the Clean state and marks itself active+clean. + +----------------- +Dump Reservations +----------------- + +An OSD daemon command dumps total local and remote reservations:: + + ceph daemon osd.<id> dump_recovery_reservations + + +-------------- +Things to Note +-------------- + +We always grab the local reservation first, to prevent a circular +dependency. We grab remote reservations in order of OSD number for the +same reason. + +The recovery reservation state chart controls the PG state as reported +to the monitor. The state chart can set: + + - recovery_wait: waiting for local/remote reservations + - recovering: recovering + - recovery_toofull: recovery stopped, OSD(s) above full ratio + - backfill_wait: waiting for remote backfill reservations + - backfilling: backfilling + - backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio + + +-------- +See Also +-------- + +The Active substate of the automatically generated OSD state diagram. diff --git a/doc/dev/osd_internals/scrub.rst b/doc/dev/osd_internals/scrub.rst new file mode 100644 index 00000000..f20dc042 --- /dev/null +++ b/doc/dev/osd_internals/scrub.rst @@ -0,0 +1,41 @@ + +Scrub internals and diagnostics +=============================== + +Scrubbing Behavior Table +------------------------ + ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Flags | none | noscrub | nodeep_scrub | noscrub/nodeep_scrub | ++=================================================+==========+===========+===============+======================+ +| Periodic tick | S | X | S | X | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Periodic tick after osd_deep_scrub_interval | D | D | S | X | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Initiated scrub | S | S | S | S | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Initiated scrub after osd_deep_scrub_interval | D | D | S | S | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Initiated deep scrub | D | D | D | D | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ + +- X = Do nothing +- S = Do regular scrub +- D = Do deep scrub + +State variables +--------------- + +- Periodic tick state is !must_scrub && !must_deep_scrub && !time_for_deep +- Periodic tick after osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep +- Initiated scrub state is must_scrub && !must_deep_scrub && !time_for_deep +- Initiated scrub after osd_deep_scrub_interval state is must scrub && !must_deep_scrub && time_for_deep +- Initiated deep scrub state is must_scrub && must_deep_scrub + +Scrub Reservations +------------------ + +An OSD daemon command dumps total local and remote reservations:: + + ceph daemon osd.<id> dump_scrub_reservations + diff --git a/doc/dev/osd_internals/snaps.rst b/doc/dev/osd_internals/snaps.rst new file mode 100644 index 00000000..ed9538ba --- /dev/null +++ b/doc/dev/osd_internals/snaps.rst @@ -0,0 +1,128 @@ +====== +Snaps +====== + +Overview +-------- +Rados supports two related snapshotting mechanisms: + + 1. *pool snaps*: snapshots are implicitly applied to all objects + in a pool + 2. *self managed snaps*: the user must provide the current *SnapContext* + on each write. + +These two are mutually exclusive, only one or the other can be used on +a particular pool. + +The *SnapContext* is the set of snapshots currently defined for an object +as well as the most recent snapshot (the *seq*) requested from the mon for +sequencing purposes (a *SnapContext* with a newer *seq* is considered to +be more recent). + +The difference between *pool snaps* and *self managed snaps* from the +OSD's point of view lies in whether the *SnapContext* comes to the OSD +via the client's MOSDOp or via the most recent OSDMap. + +See OSD::make_writeable + +Ondisk Structures +----------------- +Each object has in the pg collection a *head* object (or *snapdir*, which we +will come to shortly) and possibly a set of *clone* objects. +Each hobject_t has a snap field. For the *head* (the only writeable version +of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the +snap field is set to the *seq* of the *SnapContext* at their creation. +When the OSD services a write, it first checks whether the most recent +*clone* is tagged with a snapid prior to the most recent snap represented +in the *SnapContext*. If so, at least one snapshot has occurred between +the time of the write and the time of the last clone. Therefore, prior +to performing the mutation, the OSD creates a new clone for servicing +reads on snaps between the snapid of the last clone and the most recent +snapid. + +The *head* object contains a *SnapSet* encoded in an attribute, which tracks + + 1. The full set of snaps defined for the object + 2. The full set of clones which currently exist + 3. Overlapping intervals between clones for tracking space usage + 4. Clone size + +If the *head* is deleted while there are still clones, a *snapdir* object +is created instead to house the *SnapSet*. + +Additionally, the *object_info_t* on each clone includes a vector of snaps +for which clone is defined. + +Snap Removal +------------ +To remove a snapshot, a request is made to the *Monitor* cluster to +add the snapshot id to the list of purged snaps (or to remove it from +the set of pool snaps in the case of *pool snaps*). In either case, +the *PG* adds the snap to its *snap_trimq* for trimming. + +A clone can be removed when all of its snaps have been removed. In +order to determine which clones might need to be removed upon snap +removal, we maintain a mapping from snap to *hobject_t* using the +*SnapMapper*. + +See PrimaryLogPG::SnapTrimmer, SnapMapper + +This trimming is performed asynchronously by the snap_trim_wq while the +pg is clean and not scrubbing. + + #. The next snap in PG::snap_trimq is selected for trimming + #. We determine the next object for trimming out of PG::snap_mapper. + For each object, we create a log entry and repop updating the + object info and the snap set (including adjusting the overlaps). + If the object is a clone which no longer belongs to any live snapshots, + it is removed here. (See PrimaryLogPG::trim_object() when new_snaps + is empty.) + #. We also locally update our *SnapMapper* instance with the object's + new snaps. + #. The log entry containing the modification of the object also + contains the new set of snaps, which the replica uses to update + its own *SnapMapper* instance. + #. The primary shares the info with the replica, which persists + the new set of purged_snaps along with the rest of the info. + + + +Recovery +-------- +Because the trim operations are implemented using repops and log entries, +normal pg peering and recovery maintain the snap trimmer operations with +the caveat that push and removal operations need to update the local +*SnapMapper* instance. If the purged_snaps update is lost, we merely +retrim a now empty snap. + +SnapMapper +---------- +*SnapMapper* is implemented on top of map_cacher<string, bufferlist>, +which provides an interface over a backing store such as the filesystem +with async transactions. While transactions are incomplete, the map_cacher +instance buffers unstable keys allowing consistent access without having +to flush the filestore. *SnapMapper* provides two mappings: + + 1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone + object + 2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot + as one of its snaps + +Assumption: there are lots of hobjects and relatively few snaps. The +first encoding has a stringification of the object as the key and an +encoding of the set of snaps as a value. The second mapping, because there +might be many hobjects for a single snap, is stored as a collection of keys +of the form stringify(snap)_stringify(object) such that stringify(snap) +is constant length. These keys have a bufferlist encoding +pair<snapid, hobject_t> as a value. Thus, creating or trimming a single +object does not involve reading all objects for any snap. Additionally, +upon construction, the *SnapMapper* is provided with a mask for filtering +the objects in the single SnapMapper keyspace belonging to that pg. + +Split +----- +The snapid_t -> hobject_t key entries are arranged such that for any pg, +up to 8 prefixes need to be checked to determine all hobjects in a particular +snap for a particular pg. Upon split, the prefixes to check on the parent +are adjusted such that only the objects remaining in the pg will be visible. +The children will immediately have the correct mapping. diff --git a/doc/dev/osd_internals/watch_notify.rst b/doc/dev/osd_internals/watch_notify.rst new file mode 100644 index 00000000..8c2ce09b --- /dev/null +++ b/doc/dev/osd_internals/watch_notify.rst @@ -0,0 +1,81 @@ +============ +Watch Notify +============ + +See librados for the watch/notify interface. + +Overview +-------- +The object_info (See osd/osd_types.h) tracks the set of watchers for +a particular object persistently in the object_info_t::watchers map. +In order to track notify progress, we also maintain some ephemeral +structures associated with the ObjectContext. + +Each Watch has an associated Watch object (See osd/Watch.h). The +ObjectContext for a watched object will have a (strong) reference +to one Watch object per watch, and each Watch object holds a +reference to the corresponding ObjectContext. This circular reference +is deliberate and is broken when the Watch state is discarded on +a new peering interval or removed upon timeout expiration or an +unwatch operation. + +A watch tracks the associated connection via a strong +ConnectionRef Watch::conn. The associated connection has a +WatchConState stashed in the OSD::Session for tracking associated +Watches in order to be able to notify them upon ms_handle_reset() +(via WatchConState::reset()). + +Each Watch object tracks the set of currently un-acked notifies. +start_notify() on a Watch object adds a reference to a new in-progress +Notify to the Watch and either: + +* if the Watch is *connected*, sends a Notify message to the client +* if the Watch is *unconnected*, does nothing. + +When the Watch becomes connected (in PrimaryLogPG::do_osd_op_effects), +Notifies are resent to all remaining tracked Notify objects. + +Each Notify object tracks the set of un-notified Watchers via +calls to complete_watcher(). Once the remaining set is empty or the +timeout expires (cb, registered in init()) a notify completion +is sent to the client. + +Watch Lifecycle +--------------- +A watch may be in one of 5 states: + +1. Non existent. +2. On disk, but not registered with an object context. +3. Connected +4. Disconnected, callback registered with timer +5. Disconnected, callback in queue for scrub or is_degraded + +Case 2 occurs between when an OSD goes active and the ObjectContext +for an object with watchers is loaded into memory due to an access. +During Case 2, no state is registered for the watch. Case 2 +transitions to Case 4 in PrimaryLogPG::populate_obc_watchers() during +PrimaryLogPG::find_object_context. Case 1 becomes case 3 via +OSD::do_osd_op_effects due to a watch operation. Case 4,5 become case +3 in the same way. Case 3 becomes case 4 when the connection resets +on a watcher's session. + +Cases 4&5 can use some explanation. Normally, when a Watch enters Case +4, a callback is registered with the OSDService::watch_timer to be +called at timeout expiration. At the time that the callback is +called, however, the pg might be in a state where it cannot write +to the object in order to remove the watch (i.e., during a scrub +or while the object is degraded). In that case, we use +Watch::get_delayed_cb() to generate another Context for use from +the callbacks_for_degraded_object and Scrubber::callbacks lists. +In either case, Watch::unregister_cb() does the right thing +(SafeTimer::cancel_event() is harmless for contexts not registered +with the timer). + +Notify Lifecycle +---------------- +The notify timeout is simpler: a timeout callback is registered when +the notify is init()'d. If all watchers ack notifies before the +timeout occurs, the timeout is canceled and the client is notified +of the notify completion. Otherwise, the timeout fires, the Notify +object pings each Watch via cancel_notify to remove itself, and +sends the notify completion to the client early. diff --git a/doc/dev/osd_internals/wbthrottle.rst b/doc/dev/osd_internals/wbthrottle.rst new file mode 100644 index 00000000..9b67efbb --- /dev/null +++ b/doc/dev/osd_internals/wbthrottle.rst @@ -0,0 +1,28 @@ +================== +Writeback Throttle +================== + +Previously, the filestore had a problem when handling large numbers of +small ios. We throttle dirty data implicitly via the journal, but +a large number of inodes can be dirtied without filling the journal +resulting in a very long sync time when the sync finally does happen. +The flusher was not an adequate solution to this problem since it +forced writeback of small writes too eagerly killing performance. + +WBThrottle tracks unflushed io per hobject_t and ::fsyncs in lru +order once the start_flusher threshold is exceeded for any of +dirty bytes, dirty ios, or dirty inodes. While any of these exceed +the hard_limit, we block on throttle() in _do_op. + +See src/os/WBThrottle.h, src/osd/WBThrottle.cc + +To track the open FDs through the writeback process, there is now an +fdcache to cache open fds. lfn_open now returns a cached FDRef which +implicitly closes the fd once all references have expired. + +Filestore syncs have a sideeffect of flushing all outstanding objects +in the wbthrottle. + +lfn_unlink clears the cached FDRef and wbthrottle entries for the +unlinked object when the last link is removed and asserts that all +outstanding FDRefs for that object are dead. |