diff options
Diffstat (limited to 'doc/dev/osd_internals')
27 files changed, 3779 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/async_recovery.rst b/doc/dev/osd_internals/async_recovery.rst new file mode 100644 index 000000000..aea5b70db --- /dev/null +++ b/doc/dev/osd_internals/async_recovery.rst @@ -0,0 +1,53 @@ +===================== +Asynchronous Recovery +===================== + +Ceph Placement Groups (PGs) maintain a log of write transactions to +facilitate speedy recovery of data. During recovery, each of these PG logs +is used to determine which content in each OSD is missing or outdated. +This obviates the need to scan all RADOS objects. +See :ref:`Log Based PG <log-based-pg>` for more details on this process. + +Prior to the Nautilus release this recovery process was synchronous: it +blocked writes to a RADOS object until it was recovered. In contrast, +backfill could allow writes to proceed (assuming enough up-to-date replicas +were available) by temporarily assigning a different acting set, and +backfilling an OSD outside of the acting set. In some circumstances +this ends up being significantly better for availability, e.g. if the +PG log contains 3000 writes to disjoint objects. When the PG log contains +thousands of entries, it could actually be faster (though not as safe) to +trade backfill for recovery by deleting and redeploying the containing +OSD than to iterate through the PG log. Recovering several megabytes +of RADOS object data (or even worse, several megabytes of omap keys, +notably RGW bucket indexes) can drastically increase latency for a small +update, and combined with requests spread across many degraded objects +it is a recipe for slow requests. + +To avoid this we can perform recovery in the background on an OSD +out-of-band of the live acting set, similar to backfill, but still using +the PG log to determine what needs to be done. This is known as *asynchronous +recovery*. + +The threshold for performing asynchronous recovery instead of synchronous +recovery is not a clear-cut. There are a few criteria which +need to be met for asynchronous recovery: + +* Try to keep ``min_size`` replicas available +* Use the approximate magnitude of the difference in length of + logs combined with historical missing objects to estimate the cost of + recovery +* Use the parameter ``osd_async_recovery_min_cost`` to determine + when asynchronous recovery is appropriate + +With the existing peering process, when we choose the acting set we +have not fetched the PG log from each peer; we have only the bounds of +it and other metadata from their ``pg_info_t``. It would be more expensive +to fetch and examine every log at this point, so we only consider an +approximate check for log length for now. In Nautilus, we improved +the accounting of missing objects, so post-Nautilus this information +is also used to determine the cost of recovery. + +While async recovery is occurring, writes to members of the acting set +may proceed, but we need to send their log entries to the async +recovery targets (just like we do for backfill OSDs) so that they +can completely catch up. diff --git a/doc/dev/osd_internals/backfill_reservation.rst b/doc/dev/osd_internals/backfill_reservation.rst new file mode 100644 index 000000000..3c380dcf6 --- /dev/null +++ b/doc/dev/osd_internals/backfill_reservation.rst @@ -0,0 +1,93 @@ +==================== +Backfill Reservation +==================== + +When a new OSD joins a cluster all PGs with it in their acting sets must +eventually backfill. If all of these backfills happen simultaneously +they will present excessive load on the OSD: the "thundering herd" +effect. + +The ``osd_max_backfills`` tunable limits the number of outgoing or +incoming backfills that are active on a given OSD. Note that this limit is +applied separately to incoming and to outgoing backfill operations. +Thus there can be as many as ``osd_max_backfills * 2`` backfill operations +in flight on each OSD. This subtlety is often missed, and Ceph +operators can be puzzled as to why more ops are observed than expected. + +Each ``OSDService`` now has two AsyncReserver instances: one for backfills going +from the OSD (``local_reserver``) and one for backfills going to the OSD +(``remote_reserver``). An ``AsyncReserver`` (``common/AsyncReserver.h``) +manages a queue by priority of waiting items and a set of current reservation +holders. When a slot frees up, the ``AsyncReserver`` queues the ``Context*`` +associated with the next item on the highest priority queue in the finisher +provided to the constructor. + +For a primary to initiate a backfill it must first obtain a reservation from +its own ``local_reserver``. Then it must obtain a reservation from the backfill +target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is +managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states +of ``Active`` in PG.h). The reservations are dropped either on the ``Backfilled`` +event, which is sent on the primary before calling ``recovery_complete`` +and on the replica on receipt of the ``BackfillComplete`` progress message), +or upon leaving ``Active`` or ``ReplicaActive``. + +It's important to always grab the local reservation before the remote +reservation in order to prevent a circular dependency. + +We minimize the risk of data loss by prioritizing the order in +which PGs are recovered. Admins can override the default order by using +``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op +priority ``255`` will start before a ``force-backfill`` op at priority ``254``. + +If recovery is needed because a PG is below ``min_size`` a base priority of +``220`` is used. This is incremented by the number of OSDs short of the pool's +``min_size`` as well as a value relative to the pool's ``recovery_priority``. +The resultant priority is capped at ``253`` so that it does not confound forced +ops as described above. Under ordinary circumstances a recovery op is +prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``. +The resultant priority is capped at ``219``. + +If backfill is needed because the number of acting OSDs is less than +the pool's ``min_size``, a priority of ``220`` is used. The number of OSDs +short of the pool's ``min_size`` is added as well as a value relative to +the pool's ``recovery_priority``. The total priority is limited to ``253``. + +If backfill is needed because a PG is undersized, +a priority of ``140`` is used. The number of OSDs below the size of the pool is +added as well as a value relative to the pool's ``recovery_priority``. The +resultant priority is capped at ``179``. If a backfill op is +needed because a PG is degraded, a priority of ``140`` is used. A value +relative to the pool's ``recovery_priority`` is added. The resultant priority +is capped at ``179`` . Under ordinary circumstances a +backfill op priority of ``100`` is used. A value relative to the pool's +``recovery_priority`` is added. The total priority is capped at ``139``. + +.. list-table:: Backfill and Recovery op priorities + :widths: 20 20 20 + :header-rows: 1 + + * - Description + - Base priority + - Maximum priority + * - Backfill + - 100 + - 139 + * - Degraded Backfill + - 140 + - 179 + * - Recovery + - 180 + - 219 + * - Inactive Recovery + - 220 + - 253 + * - Inactive Backfill + - 220 + - 253 + * - force-backfill + - 254 + - + * - force-recovery + - 255 + - + diff --git a/doc/dev/osd_internals/erasure_coding.rst b/doc/dev/osd_internals/erasure_coding.rst new file mode 100644 index 000000000..40064961b --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding.rst @@ -0,0 +1,87 @@ +============================== +Erasure Coded Placement Groups +============================== + +Glossary +-------- + +*chunk* + When the encoding function is called, it returns chunks of the same + size as each other. There are two kinds of chunks: (1) *data + chunks*, which can be concatenated to reconstruct the original + object, and (2) *coding chunks*, which can be used to rebuild a + lost chunk. + +*chunk rank* + The index of a chunk, as determined by the encoding function. The + rank of the first chunk is 0, the rank of the second chunk is 1, + and so on. + +*K* + The number of data chunks into which an object is divided. For + example, if *K* = 2, then a 10KB object is divided into two objects + of 5KB each. + +*M* + The number of coding chunks computed by the encoding function. *M* + is equal to the number of OSDs that can be missing from the cluster + without the cluster suffering data loss. For example, if there are + two coding chunks, then two OSDs can be missing without data loss. + +*N* + The number of data chunks plus the number of coding chunks: that + is, *K* + *M*. + +*rate* + The proportion of the total chunks containing useful information: + that is, *K* divided by *N*. For example, suppose that *K* = 9 and + *M* = 3. This would mean that *N* = 12 (because *K* + *M* = 9 + 3). + Therefore, the *rate* (*K* / *N*) would be 9 / 12 = 0.75. In other + words, 75% of the chunks would contain useful information. + +*shard* (also called *strip*) + An ordered sequence of chunks of the same rank from the same object. For a + given placement group, each OSD contains shards of the same rank. In the + special case in which an object is encoded with only one call to the + encoding function, the term *chunk* may be used instead of *shard* because + the shard is made of a single chunk. The chunks in a shard are ordered + according to the rank of the stripe (see *stripe* below) they belong to. + + +*stripe* + If an object is so large that encoding it requires more than one + call to the encoding function, each of these calls creates a set of + chunks called a *stripe*. + +The definitions are illustrated as follows (PG stands for placement group): +:: + + OSD 40 OSD 33 + +-------------------------+ +-------------------------+ + | shard 0 - PG 10 | | shard 1 - PG 10 | + |+------ object O -------+| |+------ object O -------+| + ||+---------------------+|| ||+---------------------+|| + stripe||| chunk 0 ||| ||| chunk 1 ||| ... + 0 ||| stripe 0 ||| ||| stripe 0 ||| + ||+---------------------+|| ||+---------------------+|| + ||+---------------------+|| ||+---------------------+|| + stripe||| chunk 0 ||| ||| chunk 1 ||| ... + 1 ||| stripe 1 ||| ||| stripe 1 ||| + ||+---------------------+|| ||+---------------------+|| + ||+---------------------+|| ||+---------------------+|| + stripe||| chunk 0 ||| ||| chunk 1 ||| ... + 2 ||| stripe 2 ||| ||| stripe 2 ||| + ||+---------------------+|| ||+---------------------+|| + |+-----------------------+| |+-----------------------+| + | ... | | ... | + +-------------------------+ +-------------------------+ + +Table of contents +----------------- + +.. toctree:: + :maxdepth: 1 + + Developer notes <erasure_coding/developer_notes> + Jerasure plugin <erasure_coding/jerasure> + High level design document <erasure_coding/ecbackend> diff --git a/doc/dev/osd_internals/erasure_coding/developer_notes.rst b/doc/dev/osd_internals/erasure_coding/developer_notes.rst new file mode 100644 index 000000000..586b4b71b --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/developer_notes.rst @@ -0,0 +1,223 @@ +============================ +Erasure Code developer notes +============================ + +Introduction +------------ + +Each chapter of this document explains an aspect of the implementation +of the erasure code within Ceph. It is mostly based on examples being +explained to demonstrate how things work. + +Reading and writing encoded chunks from and to OSDs +--------------------------------------------------- + +An erasure coded pool stores each object as K+M chunks. It is divided +into K data chunks and M coding chunks. The pool is configured to have +a size of K+M so that each chunk is stored in an OSD in the acting +set. The rank of the chunk is stored as an attribute of the object. + +Let's say an erasure coded pool is created to use five OSDs ( K+M = +5 ) and sustain the loss of two of them ( M = 2 ). + +When the object *NYAN* containing *ABCDEFGHI* is written to it, the +erasure encoding function splits the content in three data chunks, +simply by dividing the content in three : the first contains *ABC*, +the second *DEF* and the last *GHI*. The content will be padded if the +content length is not a multiple of K. The function also creates two +coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each +chunk is stored in an OSD in the acting set. The chunks are stored in +objects that have the same name ( *NYAN* ) but reside on different +OSDs. The order in which the chunks were created must be preserved and +is stored as an attribute of the object ( shard_t ), in addition to its +name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4* +contains *YXY* and is stored on *OSD3*. + +:: + + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | | | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | +------+ + +----------------->| OSD5 | + +------+ + + + + +When the object *NYAN* is read from the erasure coded pool, the +decoding function reads three chunks : chunk *1* containing *ABC*, +chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild +the original content of the object *ABCDEFGHI*. The decoding function +is informed that the chunks *2* and *5* are missing ( they are called +*erasures* ). The chunk *5* could not be read because the *OSD4* is +*out*. + +The decoding function could be called as soon as three chunks are +read : *OSD2* was the slowest and its chunk does not need to be taken into +account. This optimization is not implemented in Firefly. + +:: + + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + ^ + | + | + +------+------+ + | decode(3,2) | + | erasures 2,5| + +-------------->| | + | +-------------+ + | ^ ^ + | | +-----+ + | | | + +--+---+ +------+ +--+---+ +--+---+ + name | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | + +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | + +--+---+ +--+---+ +--+---+ +--+---+ + ^ . ^ ^ + | TOO . | | + | SLOW . +--+---+ | + | ^ | OSD1 | | + | | +------+ | + | | +------+ | + | +-------| OSD2 | | + | +------+ | + | +------+ | + | | OSD3 |-----+ + | +------+ + | +------+ + | | OSD4 | OUT + | +------+ + | +------+ + +------------------| OSD5 | + +------+ + + +Erasure code library +-------------------- + +Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_, +with parameters K+M, object O is encoded by dividing it into chunks O1, +O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks +out of the available K+M chunks can be used to obtain the original +object. If data chunk O2 or coding chunk P2 are lost, they can be +repaired using any K chunks out of the K+M chunks. If more than M +chunks are lost, it is not possible to recover the object. + +Reading the original content of object O can be a simple +concatenation of O1, O2, ... OM, because the plugins are using +`systematic codes +<https://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks +must be given to the erasure code library *decode* method to retrieve +the content of the object. + +Performance depend on the parameters to the encoding functions and +is also influenced by the packet sizes used when calling the encoding +functions ( for Cauchy or Liberation for instance ): smaller packets +means more calls and more overhead. + +Although Reed-Solomon is provided as a default, Ceph uses it via an +`abstract API <https://github.com/ceph/ceph/blob/v0.78/src/erasure-code/ErasureCodeInterface.h>`_ designed to +allow each pool to choose the plugin that implements it using +key=value pairs stored in an `erasure code profile`_. + +.. _erasure code profile: ../../../erasure-coded-pool + +:: + + $ ceph osd erasure-code-profile set myprofile \ + crush-failure-domain=osd + $ ceph osd erasure-code-profile get myprofile + directory=/usr/lib/ceph/erasure-code + k=2 + m=1 + plugin=jerasure + technique=reed_sol_van + crush-failure-domain=osd + $ ceph osd pool create ecpool erasure myprofile + +The *plugin* is dynamically loaded from *directory* and expected to +implement the *int __erasure_code_init(char *plugin_name, char *directory)* function +which is responsible for registering an object derived from *ErasureCodePlugin* +in the registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ plugin reads: + +:: + + ErasureCodePluginRegistry &instance = + ErasureCodePluginRegistry::instance(); + instance.add(plugin_name, new ErasureCodePluginExample()); + +The *ErasureCodePlugin* derived object must provide a factory method +from which the concrete implementation of the *ErasureCodeInterface* +object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ reads: + +:: + + virtual int factory(const map<std::string,std::string> ¶meters, + ErasureCodeInterfaceRef *erasure_code) { + *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters)); + return 0; + } + +The *parameters* argument is the list of *key=value* pairs that were +set in the erasure code profile, before the pool was created. + +:: + + ceph osd erasure-code-profile set myprofile \ + directory=<dir> \ # mandatory + plugin=jerasure \ # mandatory + m=10 \ # optional and plugin dependent + k=3 \ # optional and plugin dependent + technique=reed_sol_van \ # optional and plugin dependent + +Notes +----- + +If the objects are large, it may be impractical to encode and decode +them in memory. However, when using *RBD* a 1TB device is divided in +many individual 4MB objects and *RGW* does the same. + +Encoding and decoding is implemented in the OSD. Although it could be +implemented client side for read write, the OSD must be able to encode +and decode on its own when scrubbing. diff --git a/doc/dev/osd_internals/erasure_coding/ecbackend.rst b/doc/dev/osd_internals/erasure_coding/ecbackend.rst new file mode 100644 index 000000000..877a08a38 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/ecbackend.rst @@ -0,0 +1,206 @@ +================================= +ECBackend Implementation Strategy +================================= + +Miscellaneous initial design notes +================================== + +The initial (and still true for EC pools without the hacky EC +overwrites debug flag enabled) design for EC pools restricted +EC pools to operations that can be easily rolled back: + +- CEPH_OSD_OP_APPEND: We can roll back an append locally by + including the previous object size as part of the PG log event. +- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete + requires that we retain the deleted object until all replicas have + persisted in the deletion event. Erasure Coded backend will therefore + need to store objects with the version at which they were created + included in the key provided to the filestore. Old versions of an + object can be pruned when all replicas have committed up to the log + event deleting the object. +- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr + to be set or removed, we can roll back these operations locally. + +Log entries contain a structure explaining how to locally undo the +operation represented by the operation +(see osd_types.h:TransactionInfo::LocalRollBack). + +PGTemp and Crush +---------------- + +Primaries are able to request a temp acting set mapping in order to +allow an up-to-date OSD to serve requests while a new primary is +backfilled (and for other reasons). An erasure coded PG needs to be +able to designate a primary for these reasons without putting it in +the first position of the acting set. It also needs to be able to +leave holes in the requested acting set. + +Core Changes: + +- OSDMap::pg_to_*_osds needs to separately return a primary. For most + cases, this can continue to be acting[0]. +- MOSDPGTemp (and related OSD structures) needs to be able to specify + a primary as well as an acting set. +- Much of the existing code base assumes that acting[0] is the primary + and that all elements of acting are valid. This needs to be cleaned + up since the acting set may contain holes. + +Distinguished acting set positions +---------------------------------- + +With the replicated strategy, all replicas of a PG are +interchangeable. With erasure coding, different positions in the +acting set have different pieces of the erasure coding scheme and are +not interchangeable. Worse, crush might cause chunk 2 to be written +to an OSD which happens already to contain an (old) copy of chunk 4. +This means that the OSD and PG messages need to work in terms of a +type like pair<shard_t, pg_t> in order to distinguish different PG +chunks on a single OSD. + +Because the mapping of an object name to object in the filestore must +be 1-to-1, we must ensure that the objects in chunk 2 and the objects +in chunk 4 have different names. To that end, the object store must +include the chunk id in the object key. + +Core changes: + +- The object store `ghobject_t needs to also include a chunk id + <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like + tuple<hobject_t, gen_t, shard_t>. +- coll_t needs to include a shard_t. +- The OSD pg_map and similar PG mappings need to work in terms of a + spg_t (essentially + pair<pg_t, shard_t>). Similarly, pg->pg messages need to include + a shard_t +- For client->PG messages, the OSD will need a way to know which PG + chunk should get the message since the OSD may contain both a + primary and non-primary chunk for the same PG + +Object Classes +-------------- + +Reads from object classes will return ENOTSUP on EC pools by invoking +a special SYNC read. + +Scrub +----- + +The main catch, however, for EC pools is that sending a crc32 of the +stored chunk on a replica isn't particularly helpful since the chunks +on different replicas presumably store different data. Because we +don't support overwrites except via DELETE, however, we have the +option of maintaining a crc32 on each chunk through each append. +Thus, each replica instead simply computes a crc32 of its own stored +chunk and compares it with the locally stored checksum. The replica +then reports to the primary whether the checksums match. + +With overwrites, all scrubs are disabled for now until we work out +what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst). + +Crush +----- + +If crush is unable to generate a replacement for a down member of an +acting set, the acting set should have a hole at that position rather +than shifting the other elements of the acting set out of position. + +========= +ECBackend +========= + +MAIN OPERATION OVERVIEW +======================= + +A RADOS put operation can span +multiple stripes of a single object. There must be code that +tessellates the application level write into a set of per-stripe write +operations -- some whole-stripes and up to two partial +stripes. Without loss of generality, for the remainder of this +document, we will focus exclusively on writing a single stripe (whole +or partial). We will use the symbol "W" to represent the number of +blocks within a stripe that are being written, i.e., W <= K. + +There are three data flows for handling a write into an EC stripe. The +choice of which of the three data flows to choose is based on the size +of the write operation and the arithmetic properties of the selected +parity-generation algorithm. + +(1) Whole stripe is written/overwritten +(2) A read-modify-write operation is performed. + +WHOLE STRIPE WRITE +------------------ + +This is a simple case, and is already performed in the existing code +(for appends, that is). The primary receives all of the data for the +stripe in the RADOS request, computes the appropriate parity blocks +and send the data and parity blocks to their destination shards which +write them. This is essentially the current EC code. + +READ-MODIFY-WRITE +----------------- + +The primary determines which of the K-W blocks are to be unmodified, +and reads them from the shards. Once all of the data is received it is +combined with the received new data and new parity blocks are +computed. The modified blocks are sent to their respective shards and +written. The RADOS operation is acknowledged. + +OSD Object Write and Consistency +-------------------------------- + +Regardless of the algorithm chosen above, writing of the data is a two- +phase process: commit and rollforward. The primary sends the log +entries with the operation described (see +osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack). +In all cases, the "commit" is performed in place, possibly leaving some +information required for a rollback in a write-aside object. The +rollforward phase occurs once all acting set replicas have committed +the commit, it then removes the rollback information. + +In the case of overwrites of existing stripes, the rollback information +has the form of a sparse object containing the old values of the +overwritten extents populated using clone_range. This is essentially +a place-holder implementation, in real life, bluestore will have an +efficient primitive for this. + +The rollforward part can be delayed since we report the operation as +committed once all replicas have been committed. Currently, whenever we +send a write, we also indicate that all previously committed +operations should be rolled forward (see +ECBackend::try_reads_to_commit). If there aren't any in the pipeline +when we arrive at the waiting_rollforward queue, we start a dummy +write to move things along (see the Pipeline section later on and +ECBackend::try_finish_rmw). + +ExtentCache +----------- + +It's pretty important to be able to pipeline writes on the same +object. For this reason, there is a cache of extents written by +cacheable operations. Each extent remains pinned until the operations +referring to it are committed. The pipeline prevents rmw operations +from running until uncacheable transactions (clones, etc) are flushed +from the pipeline. + +See ExtentCache.h for a detailed explanation of how the cache +states correspond to the higher level invariants about the conditions +under which concurrent operations can refer to the same object. + +Pipeline +-------- + +Reading src/osd/ExtentCache.h should have given a good idea of how +operations might overlap. There are several states involved in +processing a write operation and an important invariant which +isn't enforced by PrimaryLogPG at a higher level which needs to be +managed by ECBackend. The important invariant is that we can't +have uncacheable and rmw operations running at the same time +on the same object. For simplicity, we simply enforce that any +operation which contains an rmw operation must wait until +all in-progress uncacheable operations complete. + +There are improvements to be made here in the future. + +For more details, see ECBackend::waiting_* and +ECBackend::try_<from>_to_<to>. diff --git a/doc/dev/osd_internals/erasure_coding/jerasure.rst b/doc/dev/osd_internals/erasure_coding/jerasure.rst new file mode 100644 index 000000000..ac3636720 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/jerasure.rst @@ -0,0 +1,35 @@ +=============== +jerasure plugin +=============== + +Introduction +------------ + +The parameters interpreted by the ``jerasure`` plugin are: + +:: + + ceph osd erasure-code-profile set myprofile \ + directory=<dir> \ # plugin directory absolute path + plugin=jerasure \ # plugin name (only jerasure) + k=<k> \ # data chunks (default 2) + m=<m> \ # coding chunks (default 2) + technique=<technique> \ # coding technique + +The coding techniques can be chosen among *reed_sol_van*, +*reed_sol_r6_op*, *cauchy_orig*, *cauchy_good*, *liberation*, +*blaum_roth* and *liber8tion*. + +The *src/erasure-code/jerasure* directory contains the +implementation. It is a wrapper around the code found at +`https://github.com/ceph/jerasure <https://github.com/ceph/jerasure>`_ +and `https://github.com/ceph/gf-complete +<https://github.com/ceph/gf-complete>`_ , pinned to the latest stable +version in *.gitmodules*. These repositories are copies of the +upstream repositories `http://jerasure.org/jerasure/jerasure +<http://jerasure.org/jerasure/jerasure>`_ and +`http://jerasure.org/jerasure/gf-complete +<http://jerasure.org/jerasure/gf-complete>`_ . The difference +between the two, if any, should match pull requests against upstream. +Note that as of 2023, the ``jerasure.org`` web site may no longer be +legitimate and/or associated with the original project. diff --git a/doc/dev/osd_internals/erasure_coding/proposals.rst b/doc/dev/osd_internals/erasure_coding/proposals.rst new file mode 100644 index 000000000..8a30727b3 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/proposals.rst @@ -0,0 +1,385 @@ +:orphan: + +================================= +Proposed Next Steps for ECBackend +================================= + +PARITY-DELTA-WRITE +------------------ + +RMW operations current require 4 network hops (2 round trips). In +principle, for some codes, we can reduce this to 3 by sending the +update to the replicas holding the data blocks and having them +compute a delta to forward onto the parity blocks. + +The primary reads the current values of the "W" blocks and then uses +the new values of the "W" blocks to compute parity-deltas for each of +the parity blocks. The W blocks and the parity delta-blocks are sent +to their respective shards. + +The choice of whether to use a read-modify-write or a +parity-delta-write is complex policy issue that is TBD in the details +and is likely to be heavily dependent on the computational costs +associated with a parity-delta vs. a regular parity-generation +operation. However, it is believed that the parity-delta scheme is +likely to be the preferred choice, when available. + +The internal interface to the erasure coding library plug-ins needs to +be extended to support the ability to query if parity-delta +computation is possible for a selected algorithm as well as an +interface to the actual parity-delta computation algorithm when +available. + +Stripe Cache +------------ + +It may be a good idea to extend the current ExtentCache usage to +cache some data past when the pinning operation releases it. +One application pattern that is important to optimize is the small +block sequential write operation (think of the journal of a journaling +file system or a database transaction log). Regardless of the chosen +redundancy algorithm, it is advantageous for the primary to +retain/buffer recently read/written portions of a stripe in order to +reduce network traffic. The dynamic contents of this cache may be used +in the determination of whether a read-modify-write or a +parity-delta-write is performed. The sizing of this cache is TBD, but +we should plan on allowing at least a few full stripes per active +client. Limiting the cache occupancy on a per-client basis will reduce +the noisy neighbor problem. + +Recovery and Rollback Details +============================= + +Implementing a Rollback-able Prepare Operation +---------------------------------------------- + +The prepare operation is implemented at each OSD through a simulation +of a versioning or copy-on-write capability for modifying a portion of +an object. + +When a prepare operation is performed, the new data is written into a +temporary object. The PG log for the +operation will contain a reference to the temporary object so that it +can be located for recovery purposes as well as a record of all of the +shards which are involved in the operation. + +In order to avoid fragmentation (and hence, future read performance), +creation of the temporary object needs special attention. The name of +the temporary object affects its location within the KV store. Right +now its unclear whether it's desirable for the name to locate near the +base object or whether a separate subset of keyspace should be used +for temporary objects. Sam believes that colocation with the base +object is preferred (he suggests using the generation counter of the +ghobject for temporaries). Whereas Allen believes that using a +separate subset of keyspace is desirable since these keys are +ephemeral and we don't want to actually colocate them with the base +object keys. Perhaps some modeling here can help resolve this +issue. The data of the temporary object wants to be located as close +to the data of the base object as possible. This may be best performed +by adding a new ObjectStore creation primitive that takes the base +object as an additional parameter that is a hint to the allocator. + +Sam: I think that the short lived thing may be a red herring. We'll +be updating the donor and primary objects atomically, so it seems like +we'd want them adjacent in the key space, regardless of the donor's +lifecycle. + +The apply operation moves the data from the temporary object into the +correct position within the base object and deletes the associated +temporary object. This operation is done using a specialized +ObjectStore primitive. In the current ObjectStore interface, this can +be done using the clonerange function followed by a delete, but can be +done more efficiently with a specialized move primitive. +Implementation of the specialized primitive on FileStore can be done +by copying the data. Some file systems have extensions that might also +be able to implement this operation (like a defrag API that swaps +chunks between files). It is expected that NewStore will be able to +support this efficiently and natively (It has been noted that this +sequence requires that temporary object allocations, which tend to be +small, be efficiently converted into blocks for main objects and that +blocks that were formerly inside of main objects must be reusable with +minimal overhead) + +The prepare and apply operations can be separated arbitrarily in +time. If a read operation accesses an object that has been altered by +a prepare operation (but without a corresponding apply operation) it +must return the data after the prepare operation. This is done by +creating an in-memory database of objects which have had a prepare +operation without a corresponding apply operation. All read operations +must consult this in-memory data structure in order to get the correct +data. It should explicitly recognized that it is likely that there +will be multiple prepare operations against a single base object and +the code must handle this case correctly. This code is implemented as +a layer between ObjectStore and all existing readers. Annoyingly, +we'll want to trash this state when the interval changes, so the first +thing that needs to happen after activation is that the primary and +replicas apply up to last_update so that the empty cache will be +correct. + +During peering, it is now obvious that an unapplied prepare operation +can easily be rolled back simply by deleting the associated temporary +object and removing that entry from the in-memory data structure. + +Partial Application Peering/Recovery modifications +-------------------------------------------------- + +Some writes will be small enough to not require updating all of the +shards holding data blocks. For write amplification minimization +reasons, it would be best to avoid writing to those shards at all, +and delay even sending the log entries until the next write which +actually hits that shard. + +The delaying (buffering) of the transmission of the prepare and apply +operations for witnessing OSDs creates new situations that peering +must handle. In particular the logic for determining the authoritative +last_update value (and hence the selection of the OSD which has the +authoritative log) must be modified to account for the valid but +missing (i.e., delayed/buffered) pglog entries to which the +authoritative OSD was only a witness to. + +Because a partial write might complete without persisting a log entry +on every replica, we have to do a bit more work to determine an +authoritative last_update. The constraint (as with a replicated PG) +is that last_update >= the most recent log entry for which a commit +was sent to the client (call this actual_last_update). Secondarily, +we want last_update to be as small as possible since any log entry +past actual_last_update (we do not apply a log entry until we have +sent the commit to the client) must be able to be rolled back. Thus, +the smaller a last_update we choose, the less recovery will need to +happen (we can always roll back, but rolling a replica forward may +require an object rebuild). Thus, we will set last_update to 1 before +the oldest log entry we can prove cannot have been committed. In +current master, this is simply the last_update of the shortest log +from that interval (because that log did not persist any entry past +that point -- a precondition for sending a commit to the client). For +this design, we must consider the possibility that any log is missing +at its head log entries in which it did not participate. Thus, we +must determine the most recent interval in which we went active +(essentially, this is what find_best_info currently does). We then +pull the log from each live osd from that interval back to the minimum +last_update among them. Then, we extend all logs from the +authoritative interval until each hits an entry in which it should +have participated, but did not record. The shortest of these extended +logs must therefore contain any log entry for which we sent a commit +to the client -- and the last entry gives us our last_update. + +Deep scrub support +------------------ + +The simple answer here is probably our best bet. EC pools can't use +the omap namespace at all right now. The simplest solution would be +to take a prefix of the omap space and pack N M byte L bit checksums +into each key/value. The prefixing seems like a sensible precaution +against eventually wanting to store something else in the omap space. +It seems like any write will need to read at least the blocks +containing the modified range. However, with a code able to compute +parity deltas, we may not need to read a whole stripe. Even without +that, we don't want to have to write to blocks not participating in +the write. Thus, each shard should store checksums only for itself. +It seems like you'd be able to store checksums for all shards on the +parity blocks, but there may not be distinguished parity blocks which +are modified on all writes (LRC or shec provide two examples). L +should probably have a fixed number of options (16, 32, 64?) and be +configurable per-pool at pool creation. N, M should be likewise be +configurable at pool creation with sensible defaults. + +We need to handle online upgrade. I think the right answer is that +the first overwrite to an object with an append only checksum +removes the append only checksum and writes in whatever stripe +checksums actually got written. The next deep scrub then writes +out the full checksum omap entries. + +RADOS Client Acknowledgement Generation Optimization +==================================================== + +Now that the recovery scheme is understood, we can discuss the +generation of the RADOS operation acknowledgement (ACK) by the +primary ("sufficient" from above). It is NOT required that the primary +wait for all shards to complete their respective prepare +operations. Using our example where the RADOS operations writes only +"W" chunks of the stripe, the primary will generate and send W+M +prepare operations (possibly including a send-to-self). The primary +need only wait for enough shards to be written to ensure recovery of +the data, Thus after writing W + M chunks you can afford the lost of M +chunks. Hence the primary can generate the RADOS ACK after W+M-M => W +of those prepare operations are completed. + +Inconsistent object_info_t versions +=================================== + +A natural consequence of only writing the blocks which actually +changed is that we don't want to update the object_info_t of the +objects which didn't. I actually think it would pose a problem to do +so: pg ghobject namespaces are generally large, and unless the osd is +seeing a bunch of overwrites on a small set of objects, I'd expect +each write to be far enough apart in the backing ghobject_t->data +mapping to each constitute a random metadata update. Thus, we have to +accept that not every shard will have the current version in its +object_info_t. We can't even bound how old the version on a +particular shard will happen to be. In particular, the primary does +not necessarily have the current version. One could argue that the +parity shards would always have the current version, but not every +code necessarily has designated parity shards which see every write +(certainly LRC, iirc shec, and even with a more pedestrian code, it +might be desirable to rotate the shards based on object hash). Even +if you chose to designate a shard as witnessing all writes, the pg +might be degraded with that particular shard missing. This is a bit +tricky, currently reads and writes implicitly return the most recent +version of the object written. On reads, we'd have to read K shards +to answer that question. We can get around that by adding a "don't +tell me the current version" flag. Writes are more problematic: we +need an object_info from the most recent write in order to form the +new object_info and log_entry. + +A truly terrifying option would be to eliminate version and +prior_version entirely from the object_info_t. There are a few +specific purposes it serves: + +#. On OSD startup, we prime the missing set by scanning backwards + from last_update to last_complete comparing the stored object's + object_info_t to the version of most recent log entry. +#. During backfill, we compare versions between primary and target + to avoid some pushes. We use it elsewhere as well +#. While pushing and pulling objects, we verify the version. +#. We return it on reads and writes and allow the librados user to + assert it atomically on writesto allow the user to deal with write + races (used extensively by rbd). + +Case (3) isn't actually essential, just convenient. Oh well. (4) +is more annoying. Writes are easy since we know the version. Reads +are tricky because we may not need to read from all of the replicas. +Simplest solution is to add a flag to rados operations to just not +return the user version on read. We can also just not support the +user version assert on ec for now (I think? Only user is rgw bucket +indices iirc, and those will always be on replicated because they use +omap). + +We can avoid (1) by maintaining the missing set explicitly. It's +already possible for there to be a missing object without a +corresponding log entry (Consider the case where the most recent write +is to an object which has not been updated in weeks. If that write +becomes divergent, the written object needs to be marked missing based +on the prior_version which is not in the log.) THe PGLog already has +a way of handling those edge cases (see divergent_priors). We'd +simply expand that to contain the entire missing set and maintain it +atomically with the log and the objects. This isn't really an +unreasonable option, the additional keys would be fewer than the +existing log keys + divergent_priors and aren't updated in the fast +write path anyway. + +The second case is a bit trickier. It's really an optimization for +the case where a pg became not in the acting set long enough for the +logs to no longer overlap but not long enough for the PG to have +healed and removed the old copy. Unfortunately, this describes the +case where a node was taken down for maintenance with noout set. It's +probably not acceptable to re-backfill the whole OSD in such a case, +so we need to be able to quickly determine whether a particular shard +is up to date given a valid acting set of other shards. + +Let ordinary writes which do not change the object size not touch the +object_info at all. That means that the object_info version won't +match the pg log entry version. Include in the pg_log_entry_t the +current object_info version as well as which shards participated (as +mentioned above). In addition to the object_info_t attr, record on +each shard s a vector recording for each other shard s' the most +recent write which spanned both s and s'. Operationally, we maintain +an attr on each shard containing that vector. A write touching S +updates the version stamp entry for each shard in S on each shard in +S's attribute (and leaves the rest alone). If we have a valid acting +set during backfill, we must have a witness of every write which +completed -- so taking the max of each entry over all of the acting +set shards must give us the current version for each shard. During +recovery, we set the attribute on the recovery target to that max +vector (Question: with LRC, we may not need to touch much of the +acting set to recover a particular shard -- can we just use the max of +the shards we used to recovery, or do we need to grab the version +vector from the rest of the acting set as well? I'm not sure, not a +big deal anyway, I think). + +The above lets us perform blind writes without knowing the current +object version (log entry version, that is) while still allowing us to +avoid backfilling up to date objects. The only catch is that our +backfill scans will can all replicas, not just the primary and the +backfill targets. + +It would be worth adding into scrub the ability to check the +consistency of the gathered version vectors -- probably by just +taking 3 random valid subsets and verifying that they generate +the same authoritative version vector. + +Implementation Strategy +======================= + +It goes without saying that it would be unwise to attempt to do all of +this in one massive PR. It's also not a good idea to merge code which +isn't being tested. To that end, it's worth thinking a bit about +which bits can be tested on their own (perhaps with a bit of temporary +scaffolding). + +We can implement the overwrite friendly checksumming scheme easily +enough with the current implementation. We'll want to enable it on a +per-pool basis (probably using a flag which we'll later repurpose for +actual overwrite support). We can enable it in some of the ec +thrashing tests in the suite. We can also add a simple test +validating the behavior of turning it on for an existing ec pool +(later, we'll want to be able to convert append-only ec pools to +overwrite ec pools, so that test will simply be expanded as we go). +The flag should be gated by the experimental feature flag since we +won't want to support this as a valid configuration -- testing only. +We need to upgrade append only ones in place during deep scrub. + +Similarly, we can implement the unstable extent cache with the current +implementation, it even lets us cut out the readable ack the replicas +send to the primary after the commit which lets it release the lock. +Same deal, implement, gate with experimental flag, add to some of the +automated tests. I don't really see a reason not to use the same flag +as above. + +We can certainly implement the move-range primitive with unit tests +before there are any users. Adding coverage to the existing +objectstore tests would suffice here. + +Explicit missing set can be implemented now, same deal as above -- +might as well even use the same feature bit. + +The TPC protocol outlined above can actually be implemented an append +only EC pool. Same deal as above, can even use the same feature bit. + +The RADOS flag to suppress the read op user version return can be +implemented immediately. Mostly just needs unit tests. + +The version vector problem is an interesting one. For append only EC +pools, it would be pointless since all writes increase the size and +therefore update the object_info. We could do it for replicated pools +though. It's a bit silly since all "shards" see all writes, but it +would still let us implement and partially test the augmented backfill +code as well as the extra pg log entry fields -- this depends on the +explicit pg log entry branch having already merged. It's not entirely +clear to me that this one is worth doing separately. It's enough code +that I'd really prefer to get it done independently, but it's also a +fair amount of scaffolding that will be later discarded. + +PGLog entries need to be able to record the participants and log +comparison needs to be modified to extend logs with entries they +wouldn't have witnessed. This logic should be abstracted behind +PGLog so it can be unittested -- that would let us test it somewhat +before the actual ec overwrites code merges. + +Whatever needs to happen to the ec plugin interface can probably be +done independently of the rest of this (pending resolution of +questions below). + +The actual nuts and bolts of performing the ec overwrite it seems to +me can't be productively tested (and therefore implemented) until the +above are complete, so best to get all of the supporting code in +first. + +Open Questions +============== + +Is there a code we should be using that would let us compute a parity +delta without rereading and reencoding the full stripe? If so, is it +the kind of thing we need to design for now, or can it be reasonably +put off? + +What needs to happen to the EC plugin interface? diff --git a/doc/dev/osd_internals/index.rst b/doc/dev/osd_internals/index.rst new file mode 100644 index 000000000..7e82914aa --- /dev/null +++ b/doc/dev/osd_internals/index.rst @@ -0,0 +1,10 @@ +============================== +OSD developer documentation +============================== + +.. rubric:: Contents + +.. toctree:: + :glob: + + * diff --git a/doc/dev/osd_internals/last_epoch_started.rst b/doc/dev/osd_internals/last_epoch_started.rst new file mode 100644 index 000000000..c31cc66b5 --- /dev/null +++ b/doc/dev/osd_internals/last_epoch_started.rst @@ -0,0 +1,60 @@ +====================== +last_epoch_started +====================== + +``info.last_epoch_started`` records an activation epoch ``e`` for interval ``i`` +such that all writes committed in ``i`` or earlier are reflected in the +local info/log and no writes after ``i`` are reflected in the local +info/log. Since no committed write is ever divergent, even if we +get an authoritative log/info with an older ``info.last_epoch_started``, +we can leave our ``info.last_epoch_started`` alone since no writes could +have committed in any intervening interval (See PG::proc_master_log). + +``info.history.last_epoch_started`` records a lower bound on the most +recent interval in which the PG as a whole went active and accepted +writes. On a particular OSD it is also an upper bound on the +activation epoch of intervals in which writes in the local PG log +occurred: we update it before accepting writes. Because all +committed writes are committed by all acting set OSDs, any +non-divergent writes ensure that ``history.last_epoch_started`` was +recorded by all acting set members in the interval. Once peering has +queried one OSD from each interval back to some seen +``history.last_epoch_started``, it follows that no interval after the max +``history.last_epoch_started`` can have reported writes as committed +(since we record it before recording client writes in an interval). +Thus, the minimum ``last_update`` across all infos with +``info.last_epoch_started >= MAX(history.last_epoch_started)`` must be an +upper bound on writes reported as committed to the client. + +We update ``info.last_epoch_started`` with the initial activation message, +but we only update ``history.last_epoch_started`` after the new +``info.last_epoch_started`` is persisted (possibly along with the first +write). This ensures that we do not require an OSD with the most +recent ``info.last_epoch_started`` until all acting set OSDs have recorded +it. + +In ``find_best_info``, we do include ``info.last_epoch_started`` values when +calculating ``max_last_epoch_started_found`` because we want to avoid +designating a log entry divergent which in a prior interval would have +been non-divergent since it might have been used to serve a read. In +``activate()``, we use the peer's ``last_epoch_started`` value as a bound on +how far back divergent log entries can be found. + +However, in a case like + +.. code:: + + calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556 + calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556 + calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556 + calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556 + +since osd.1 is the only one which recorded info.les=477, while osd.4,osd.0 +(which were the acting set in that interval) did not (osd.4 restarted and osd.0 +did not get the message in time), the PG is marked incomplete when +either osd.4 or osd.0 would have been valid choices. To avoid this, we do not +consider ``info.les`` for incomplete peers when calculating +``min_last_epoch_started_found``. It would not have been in the acting +set, so we must have another OSD from that interval anyway (if +``maybe_went_rw``). If that OSD does not remember that ``info.les``, then we +cannot have served reads. diff --git a/doc/dev/osd_internals/log_based_pg.rst b/doc/dev/osd_internals/log_based_pg.rst new file mode 100644 index 000000000..99cffd3d9 --- /dev/null +++ b/doc/dev/osd_internals/log_based_pg.rst @@ -0,0 +1,208 @@ +.. _log-based-pg: + +============ +Log Based PG +============ + +Background +========== + +Why PrimaryLogPG? +----------------- + +Currently, consistency for all ceph pool types is ensured by primary +log-based replication. This goes for both erasure-coded (EC) and +replicated pools. + +Primary log-based replication +----------------------------- + +Reads must return data written by any write which completed (where the +client could possibly have received a commit message). There are lots +of ways to handle this, but Ceph's architecture makes it easy for +everyone at any map epoch to know who the primary is. Thus, the easy +answer is to route all writes for a particular PG through a single +ordering primary and then out to the replicas. Though we only +actually need to serialize writes on a single RADOS object (and even then, +the partial ordering only really needs to provide an ordering between +writes on overlapping regions), we might as well serialize writes on +the whole PG since it lets us represent the current state of the PG +using two numbers: the epoch of the map on the primary in which the +most recent write started (this is a bit stranger than it might seem +since map distribution itself is asynchronous -- see Peering and the +concept of interval changes) and an increasing per-PG version number +-- this is referred to in the code with type ``eversion_t`` and stored as +``pg_info_t::last_update``. Furthermore, we maintain a log of "recent" +operations extending back at least far enough to include any +*unstable* writes (writes which have been started but not committed) +and objects which aren't up-to-date locally (see recovery and +backfill). In practice, the log will extend much further +(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not +clean) because it's handy for quickly performing recovery. + +Using this log, as long as we talk to a non-empty subset of the OSDs +which must have accepted any completed writes from the most recent +interval in which we accepted writes, we can determine a conservative +log which must contain any write which has been reported to a client +as committed. There is some freedom here, we can choose any log entry +between the oldest head remembered by an element of that set (any +newer cannot have completed without that log containing it) and the +newest head remembered (clearly, all writes in the log were started, +so it's fine for us to remember them) as the new head. This is the +main point of divergence between replicated pools and EC pools in +``PG/PrimaryLogPG``: replicated pools try to choose the newest valid +option to avoid the client needing to replay those operations and +instead recover the other copies. EC pools instead try to choose +the *oldest* option available to them. + +The reason for this gets to the heart of the rest of the differences +in implementation: one copy will not generally be enough to +reconstruct an EC object. Indeed, there are encodings where some log +combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding +where 3 of the replicas remember a write, but the other 3 do not -- we +don't have 3 copies of either version). For this reason, log entries +representing *unstable* writes (writes not yet committed to the +client) must be rollbackable using only local information on EC pools. +Log entries in general may therefore be rollbackable (and in that case, +via a delayed application or via a set of instructions for rolling +back an inplace update) or not. Replicated pool log entries are +never able to be rolled back. + +For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``, +``osd_types.h:pg_log_entry_t``, and peering in general. + +ReplicatedBackend/ECBackend unification strategy +================================================ + +PGBackend +--------- + +The fundamental difference between replication and erasure coding +is that replication can do destructive updates while erasure coding +cannot. It would be really annoying if we needed to have two entire +implementations of ``PrimaryLogPG`` since there +are really only a few fundamental differences: + +#. How reads work -- async only, requires remote reads for EC +#. How writes work -- either restricted to append, or must write aside and do a + tpc +#. Whether we choose the oldest or newest possible head entry during peering +#. A bit of extra information in the log entry to enable rollback + +and so many similarities + +#. All of the stats and metadata for objects +#. The high level locking rules for mixing client IO with recovery and scrub +#. The high level locking rules for mixing reads and writes without exposing + uncommitted state (which might be rolled back or forgotten later) +#. The process, metadata, and protocol needed to determine the set of osds + which participated in the most recent interval in which we accepted writes +#. etc. + +Instead, we choose a few abstractions (and a few kludges) to paper over the differences: + +#. ``PGBackend`` +#. ``PGTransaction`` +#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting`` +#. Various bits of the write pipeline disallow some operations based on pool + type -- like omap operations, class operation reads, and writes which are + not aligned appends (officially, so far) for EC +#. Misc other kludges here and there + +``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above +and the addition of 4 as needed to the log entries. + +The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't +require much additional explanation. More detail on the ``ECBackend`` can be +found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``. + +PGBackend Interface Explanation +=============================== + +Note: this is from a design document that predated the Firefly release +and is probably out of date w.r.t. some of the method names. + +Readable vs Degraded +-------------------- + +For a replicated pool, an object is readable IFF it is present on +the primary (at the right version). For an EC pool, we need at least +`m` shards present to perform a read, and we need it on the primary. For +this reason, ``PGBackend`` needs to include some interfaces for determining +when recovery is required to serve a read vs a write. This also +changes the rules for when peering has enough logs to prove that it + +Core Changes: + +- | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate`` + | objects to allow the user to make these determinations. + +Client Reads +------------ + +Reads from a replicated pool can always be satisfied +synchronously by the primary OSD. Within an erasure coded pool, +the primary will need to request data from some number of replicas in +order to satisfy a read. ``PGBackend`` will therefore need to provide +separate ``objects_read_sync`` and ``objects_read_async`` interfaces where +the former won't be implemented by the ``ECBackend``. + +``PGBackend`` interfaces: + +- ``objects_read_sync`` +- ``objects_read_async`` + +Scrubs +------ + +We currently have two scrub modes with different default frequencies: + +#. [shallow] scrub: compares the set of objects and metadata, but not + the contents +#. deep scrub: compares the set of objects, metadata, and a CRC32 of + the object contents (including omap) + +The primary requests a scrubmap from each replica for a particular +range of objects. The replica fills out this scrubmap for the range +of objects including, if the scrub is deep, a CRC32 of the contents of +each object. The primary gathers these scrubmaps from each replica +and performs a comparison identifying inconsistent objects. + +Most of this can work essentially unchanged with erasure coded PG with +the caveat that the ``PGBackend`` implementation must be in charge of +actually doing the scan. + + +``PGBackend`` interfaces: + +- ``be_*`` + +Recovery +-------- + +The logic for recovering an object depends on the backend. With +the current replicated strategy, we first pull the object replica +to the primary and then concurrently push it out to the replicas. +With the erasure coded strategy, we probably want to read the +minimum number of replica chunks required to reconstruct the object +and push out the replacement chunks concurrently. + +Another difference is that objects in erasure coded PG may be +unrecoverable without being unfound. The ``unfound`` state +should probably be renamed to ``unrecoverable``. Also, the +``PGBackend`` implementation will have to be able to direct the search +for PG replicas with unrecoverable object chunks and to be able +to determine whether a particular object is recoverable. + + +Core changes: + +- ``s/unfound/unrecoverable`` + +PGBackend interfaces: + +- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_ +- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_ +- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_ +- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_ +- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_ diff --git a/doc/dev/osd_internals/manifest.rst b/doc/dev/osd_internals/manifest.rst new file mode 100644 index 000000000..7be4350ea --- /dev/null +++ b/doc/dev/osd_internals/manifest.rst @@ -0,0 +1,589 @@ +======== +Manifest +======== + + +Introduction +============ + +As described in ``../deduplication.rst``, adding transparent redirect +machinery to RADOS would enable a more capable tiering solution +than RADOS currently has with "cache/tiering". + +See ``../deduplication.rst`` + +At a high level, each object has a piece of metadata embedded in +the ``object_info_t`` which can map subsets of the object data payload +to (refcounted) objects in other pools. + +This document exists to detail: + +1. Manifest data structures +2. Rados operations for manipulating manifests. +3. Status and Plans + + +Intended Usage Model +==================== + +RBD +--- + +For RBD, the primary goal is for either an OSD-internal agent or a +cluster-external agent to be able to transparently shift portions +of the constituent 4MB extents between a dedup pool and a hot base +pool. + +As such, RBD operations (including class operations and snapshots) +must have the same observable results regardless of the current +status of the object. + +Moreover, tiering/dedup operations must interleave with RBD operations +without changing the result. + +Thus, here is a sketch of how I'd expect a tiering agent to perform +basic operations: + +* Demote cold RBD chunk to slow pool: + + 1. Read object, noting current user_version. + 2. In memory, run CDC implementation to fingerprint object. + 3. Write out each resulting extent to an object in the cold pool + using the CAS class. + 4. Submit operation to base pool: + + * ``ASSERT_VER`` with the user version from the read to fail if the + object has been mutated since the read. + * ``SET_CHUNK`` for each of the extents to the corresponding object + in the base pool. + * ``EVICT_CHUNK`` for each extent to free up space in the base pool. + Results in each chunk being marked ``MISSING``. + + RBD users should then either see the state prior to the demotion or + subsequent to it. + + Note that between 3 and 4, we potentially leak references, so a + periodic scrub would be needed to validate refcounts. + +* Promote cold RBD chunk to fast pool. + + 1. Submit ``TIER_PROMOTE`` + +For clones, all of the above would be identical except that the +initial read would need a ``LIST_SNAPS`` to determine which clones exist +and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include +the ``cloneid``. + +RadosGW +------- + +For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the +manifest machinery in the OSD to hide the distinction between the object +being dedup'd or present in the base pool + +For writes, RGW could operate as RBD does above, but could +optionally have the freedom to fingerprint prior to doing the write. +In that case, it could immediately write out the target objects to the +CAS pool and then atomically write an object with the corresponding +chunks set. + +Status and Future Work +====================== + +At the moment, initial versions of a manifest data structure along +with IO path support and rados control operations exist. This section +is meant to outline next steps. + +At a high level, our future work plan is: + +- Cleanups: Address immediate inconsistencies and shortcomings outlined + in the next section. +- Testing: Rados relies heavily on teuthology failure testing to validate + features like cache/tiering. We'll need corresponding tests for + manifest operations. +- Snapshots: We want to be able to deduplicate portions of clones + below the level of the rados snapshot system. As such, the + rados operations below need to be extended to work correctly on + clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the + corresponding extent in the base pool, and correctly maintain OSD metadata). +- Cache/tiering: Ultimately, we'd like to be able to deprecate the existing + cache/tiering implementation, but to do that we need to ensure that we + can address the same use cases. + + +Cleanups +-------- + +The existing implementation has some things that need to be cleaned up: + +* ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise + one couldn't create an object atomically as a redirect. +* ``SET_CHUNK``: + + * Appears to trigger a new clone as user_modify gets set in + ``do_osd_ops``. This probably isn't desirable, see Snapshots section + below for some options on how generally to mix these operations + with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set + user_modify. + * Appears to assume that the corresponding section of the object + does not exist (sets ``FLAG_MISSING``) but does not check whether the + corresponding extent exists already in the object. Should always + leave the extent clean. + * Appears to clear the manifest unconditionally if not chunked, + that's probably wrong. We should return an error if it's a + ``REDIRECT`` :: + + case CEPH_OSD_OP_SET_CHUNK: + if (oi.manifest.is_redirect()) { + result = -EINVAL; + goto fail; + } + + +* ``TIER_PROMOTE``: + + * ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears + to copy them back in, but does not unset the redirect or clear the + reference. This violates the invariant that a redirect object + should be empty in the base pool. In particular, as long as the + redirect is set, it appears that all operations will be proxied + even after the promote defeating the purpose. We do want ``PROMOTE`` + to be able to atomically replace a redirect with the actual + object, so the solution is to clear the redirect at the end of the + promote. + * For a chunked manifest, we appear to flush prior to promoting. + Promotion will often be used to prepare an object for low latency + reads and writes, accordingly, the only effect should be to read + any ``MISSING`` extents into the base pool. No flushing should be done. + +* High Level: + + * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing + at a dedup extent. Writing the mutated extent back to the dedup pool + requires writing a new object since the previous one cannot be mutated, + just as it would if it hadn't been dedup'd yet. Thus, we should always + drop the reference and remove the manifest pointer. + + * There isn't currently a way to "evict" an object region. With the above + change to ``SET_CHUNK`` to always retain the existing object region, we + need an ``EVICT_CHUNK`` operation to then remove the extent. + + +Testing +------- + +We rely really heavily on randomized failure testing. As such, we need +to extend that testing to include dedup/manifest support as well. Here's +a short list of the touchpoints: + +* Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml`` + + That test, of course, tests the existing cache/tiering machinery. Add + additional files to that directory that instead setup a dedup pool. Add + support to ``ceph_test_rados`` (``src/test/osd/TestRados*``). + +* RBD tests + + Add a test that runs an RBD workload concurrently with blind + promote/evict operations. + +* RGW + + Add a test that runs a rgw workload concurrently with blind + promote/evict operations. + + +Snapshots +--------- + +Fundamentally we need to be able to manipulate the manifest +status of clones because we want to be able to dynamically promote, +flush (if the state was dirty when the clone was created), and evict +extents from clones. + +As such, the plan is to allow the ``object_manifest_t`` for each clone +to be independent. Here's an incomplete list of the high level +tasks: + +* Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK`` + to operation directly on clones. +* Ensure that recovery checks the object_manifest prior to trying to + use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets`` + are the two methods that would likely need to be modified. + +See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD +support details. I'd like to call out one particular data structure +we may want to exploit. + +The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover +clones as part of leak detection. + +An important question is how we deal with the fact that many clones +will frequently have references to the same backing chunks at the same +offset. In particular, ``make_writeable`` will generally create a clone +that shares the same ``object_manifest_t`` references with the exception +of any extents modified in that transaction. The metadata that +commits as part of that transaction must therefore map onto the same +refcount as before because otherwise we'd have to first increment +refcounts on backing objects (or risk a reference to a dead object) +Thus, we introduce a simple convention: consecutive clones which +share a reference at the same offset share the same refcount. This +means that a write that invokes ``make_writeable`` may decrease refcounts, +but not increase them. This has some consequences for removing clones. +Consider the following sequence :: + + write foo [0, 1024) + flush foo -> + head: [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=1, refcount(bbb)=1 + snapshot 10 + write foo [0, 512) -> + head: [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=1, refcount(bbb)=1 + flush foo -> + head: [0, 512) ccc, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1 + snapshot 20 + write foo [0, 512) (same contents as the original write) + head: [512, 1024) bbb + 20 : [0, 512) ccc, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=?, refcount(bbb)=1 + flush foo + head: [0, 512) aaa, [512, 1024) bbb + 20 : [0, 512) ccc, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1 + +What should be the refcount for ``aaa`` be at the end? By our +above rule, it should be ``2`` since the two ```aaa``` refs are not +contiguous. However, consider removing clone ``20`` :: + + initial: + head: [0, 512) aaa, [512, 1024) bbb + 20 : [0, 512) ccc, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1 + trim 20 + head: [0, 512) aaa, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0 + +At this point, our rule dictates that ``refcount(aaa)`` is `1`. +This means that removing ``20`` needs to check for refs held by +the clones on either side which will then match. + +See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal`` +for the logic implementing this rule. + +This seems complicated, but it gets us two valuable properties: + +1) The refcount change from make_writeable will not block on + incrementing a ref +2) We don't need to load the ``object_manifest_t`` for every clone + to determine how to handle removing one -- just the ones + immediately preceding and succeeding it. + +All clone operations will need to consider adjacent ``chunk_maps`` +when adding or removing references. + +Data Structures +=============== + +Each RADOS object contains an ``object_manifest_t`` embedded within the +``object_info_t`` (see ``osd_types.h``): + +:: + + struct object_manifest_t { + enum { + TYPE_NONE = 0, + TYPE_REDIRECT = 1, + TYPE_CHUNKED = 2, + }; + uint8_t type; // redirect, chunked, ... + hobject_t redirect_target; + std::map<uint64_t, chunk_info_t> chunk_map; + } + +The ``type`` enum reflects three possible states an object can be in: + +1. ``TYPE_NONE``: normal RADOS object +2. ``TYPE_REDIRECT``: object payload is backed by a single object + specified by ``redirect_target`` +3. ``TYPE_CHUNKED: object payload is distributed among objects with + size and offset specified by the ``chunk_map``. ``chunk_map`` maps + the offset of the chunk to a ``chunk_info_t`` as shown below, also + specifying the ``length``, target `OID`, and ``flags``. + +:: + + struct chunk_info_t { + typedef enum { + FLAG_DIRTY = 1, + FLAG_MISSING = 2, + FLAG_HAS_REFERENCE = 4, + FLAG_HAS_FINGERPRINT = 8, + } cflag_t; + uint32_t offset; + uint32_t length; + hobject_t oid; + cflag_t flags; // FLAG_* + + +``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint +is written. This should be changed to drop the fingerprint instead. + + +Request Handling +================ + +Similarly to cache/tiering, the initial touchpoint is +``maybe_handle_manifest_detail``. + +For manifest operations listed below, we return ``NOOP`` and continue onto +dedicated handling within ``do_osd_ops``. + +For redirect objects which haven't been promoted (apparently ``oi.size > +0`` indicates that it's present?) we proxy reads and writes. + +For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all +of the ops are reads of extents in the ``object_manifest_t chunk_map``), +we proxy requests to those objects. + + +RADOS Interface +================ + +To set up deduplication one must provision two pools. One will act as the +base pool and the other will act as the chunk pool. The base pool need to be +configured with the ``fingerprint_algorithm`` option as follows. + +:: + + ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 + --yes-i-really-mean-it + +Create objects :: + + rados -p base_pool put foo ./foo + rados -p chunk_pool put foo-chunk ./foo-chunk + +Make a manifest object :: + + rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference + +Operations: + +* ``set-redirect`` + + Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object`` + in the ``target_pool``. + A redirected object will forward all operations from the client to the + ``target_object``. :: + + void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx, + uint64_t tgt_version, int flag = 0); + + rados -p base_pool set-redirect <base_object> --target-pool <target_pool> + <target_object> + + Returns ``ENOENT`` if the object does not exist (TODO: why?) + Returns ``EINVAL`` if the object already is a redirect. + + Takes a reference to target as part of operation, can possibly leak a ref + if the acting set resets and the client dies between taking the ref and + recording the redirect. + + Truncates object, clears omap, and clears xattrs as a side effect. + + At the top of ``do_osd_ops``, does not set user_modify. + + This operation is not a user mutation and does not trigger a clone to be created. + + There are two purposes of ``set_redirect``: + + 1. Redirect all operation to the target object (like proxy) + 2. Cache when ``tier_promote`` is called (redirect will be cleared at this time). + +* ``set-chunk`` + + Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a + ``target_object``. :: + + void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx, + std::string tgt_oid, uint64_t tgt_offset, int flag = 0); + + rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool + <caspool> <target_object> <target-offset> + + Returns ``ENOENT`` if the object does not exist (TODO: why?) + Returns ``EINVAL`` if the object already is a redirect. + Returns ``EINVAL`` if on ill-formed parameter buffer. + Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping. + + Takes references to targets as part of operation, can possibly leak refs + if the acting set resets and the client dies between taking the ref and + recording the redirect. + + Truncates object, clears omap, and clears xattrs as a side effect. + + This operation is not a user mutation and does not trigger a clone to be created. + + TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. :: + + if (!oi.manifest.is_chunked()) { + oi.manifest.clear(); + } + +* ``evict-chunk`` + + Clears an extent from an object leaving only the manifest link between + it and the ``target_object``. :: + + void evict_chunk( + uint64_t offset, uint64_t length, int flag = 0); + + rados -p base_pool evict-chunk <offset> <length> <object> + + Returns ``EINVAL`` if the extent is not present in the manifest. + + Note: this does not exist yet. + + +* ``tier-promote`` + + Promotes the object ensuring that subsequent reads and writes will be local :: + + void tier_promote(); + + rados -p base_pool tier-promote <obj-name> + + Returns ``ENOENT`` if the object does not exist + + For a redirect manifest, copies data to head. + + TODO: Promote on a redirect object needs to clear the redirect. + + For a chunked manifest, reads all MISSING extents into the base pool, + subsequent reads and writes will be served from the base pool. + + Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The + resulting ``copy_get`` operation will issue reads which will then be redirected by + the normal manifest read machinery. + + Does not set the ``user_modify`` flag. + + Future work will involve adding support for specifying a ``clone_id``. + +* ``unset-manifest`` + + Unset the manifest info in the object that has manifest. :: + + void unset_manifest(); + + rados -p base_pool unset-manifest <obj-name> + + Clears manifest chunks or redirect. Lazily releases references, may + leak. + + ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``, + and so will trigger a snapshot. Note, this will be true even for a + redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should + be fixed -- ``unset-manifest`` should not be a ``user_modify``. + +* ``tier-flush`` + + Flush the object which has chunks to the chunk pool. :: + + void tier_flush(); + + rados -p base_pool tier-flush <obj-name> + + Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone. + + Does not evict the extents. + + +ceph-dedup-tool +=============== + +``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking +and fixing the reference count (see ``./refcount.rst``). + +* Find an optimal chunk offset + + a. Fixed chunk + + To find out a fixed chunk length, you need to run the following command many + times while changing the ``chunk_size``. :: + + ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size + --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512 + + b. Rabin chunk(Rabin-Karp algorithm) + + Rabin-Karp is a string-searching algorithm based + on a rolling hash. But a rolling hash is not enough to do deduplication because + we don't know the chunk boundary. So, we need content-based slicing using + a rolling hash for content-defined chunking. + The current implementation uses the simplest approach: look for chunk boundaries + by inspecting the rolling hash for pattern (like the + lower N bits are all zeroes). + + Users who want to use deduplication need to find an ideal chunk offset. + To find out ideal chunk offset, users should discover + the optimal configuration for their data workload via ``ceph-dedup-tool``. + This information will then be used for object chunking through + the ``set-chunk`` API. :: + + ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size + --chunk-algorithm rabin --fingerprint-algorithm rabin + + ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``. + These are options for ``rabin chunk``. :: + + --mod-prime <uint64_t> + --rabin-prime <uint64_t> + --pow <uint64_t> + --chunk-mask-bit <uint32_t> + --window-size <uint32_t> + --min-chunk <uint32_t> + --max-chunk <uint64_t> + + Users need to refer following equation to use above options for ``rabin chunk``. :: + + rabin_hash = + (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime) + + c. Fixed chunk vs content-defined chunk + + Content-defined chunking may or not be optimal solution. + For example, + + Data chunk ``A`` : ``abcdefgabcdefgabcdefg`` + + Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is + from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length. + But, in the case of content-based slicing, the optimal chunk length + could not be found (dedup ratio will not be 100%). + Because we need to find optimal parameter such + as boundary bit, window size and prime value. This is as easy as fixed chunk. + But, content defined chunking is very effective in the following case. + + Data chunk ``B`` : ``abcdefgabcdefgabcdefg`` + + Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg`` + + +* Fix reference count + + The key idea behind of reference counting for dedup is false-positive, which means + ``(manifest object (no ref),, chunk object(has ref))`` happen instead of + ``(manifest object (has ref), chunk 1(no ref))``. + To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. :: + + ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL + diff --git a/doc/dev/osd_internals/map_message_handling.rst b/doc/dev/osd_internals/map_message_handling.rst new file mode 100644 index 000000000..f8104f3fd --- /dev/null +++ b/doc/dev/osd_internals/map_message_handling.rst @@ -0,0 +1,131 @@ +=========================== +Map and PG Message handling +=========================== + +Overview +-------- +The OSD handles routing incoming messages to PGs, creating the PG if necessary +in some cases. + +PG messages generally come in two varieties: + + 1. Peering Messages + 2. Ops/SubOps + +There are several ways in which a message might be dropped or delayed. It is +important that the message delaying does not result in a violation of certain +message ordering requirements on the way to the relevant PG handling logic: + + 1. Ops referring to the same object must not be reordered. + 2. Peering messages must not be reordered. + 3. Subops must not be reordered. + +MOSDMap +------- +MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the +OSD must perform several tasks: + + 1. Persist the new maps to the filestore. + Several PG operations rely on having access to maps dating back to the last + time the PG was clean. + 2. Update and persist the superblock. + 3. Update OSD state related to the current map. + 4. Expose new maps to PG processes via *OSDService*. + 5. Remove PGs due to pool removal. + 6. Queue dummy events to trigger PG map catchup. + +Each PG asynchronously catches up to the currently published map during +process_peering_events before processing the event. As a result, different +PGs may have different views as to the "current" map. + +One consequence of this design is that messages containing submessages from +multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage +with the PG's epoch as well as tagging the message as a whole with the OSD's +current published epoch. + +MOSDPGOp/MOSDPGSubOp +-------------------- +See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op + +MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used +between OSDs to coordinate most non peering activities including replicating +MOSDPGOp operations. + +OSD::require_same_or_newer map checks that the current OSDMap is at least +as new as the map epoch indicated on the message. If not, the message is +queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this +cannot violate the above conditions since any two messages will be queued +in order of receipt and if a message is received with epoch e0, a later message +from the same source must be at epoch at least e0. Note that two PGs from +the same OSD count for these purposes as different sources for single PG +messages. That is, messages from different PGs may be reordered. + + +MOSDPGOps follow the following process: + + 1. OSD::handle_op: validates permissions and crush mapping. + discard the request if they are not connected and the client cannot get the reply ( See OSD::op_is_discardable ) + See OSDService::handle_misdirected_op + See PG::op_has_sufficient_caps + See OSD::require_same_or_newer_map + 2. OSD::enqueue_op + +MOSDSubOps follow the following process: + + 1. OSD::handle_sub_op checks that sender is an OSD + 2. OSD::enqueue_op + +OSD::enqueue_op calls PG::queue_op which checks waiting_for_map before calling OpWQ::queue which adds the op to the queue of the PG responsible for handling it. + +OSD::dequeue_op is then eventually called, with a lock on the PG. At +this time, the op is passed to PG::do_request, which checks that: + + 1. the PG map is new enough (PG::must_delay_op) + 2. the client requesting the op has enough permissions (PG::op_has_sufficient_caps) + 3. the op is not to be discarded (PG::can_discard_{request,op,subop,scan,backfill}) + 4. the PG is active (PG::flushed boolean) + 5. the op is a CEPH_MSG_OSD_OP and the PG is in PG_STATE_ACTIVE state and not in PG_STATE_REPLAY + +If these conditions are not met, the op is either discarded or queued for later processing. If all conditions are met, the op is processed according to its type: + + 1. CEPH_MSG_OSD_OP is handled by PG::do_op + 2. MSG_OSD_SUBOP is handled by PG::do_sub_op + 3. MSG_OSD_SUBOPREPLY is handled by PG::do_sub_op_reply + 4. MSG_OSD_PG_SCAN is handled by PG::do_scan + 5. MSG_OSD_PG_BACKFILL is handled by PG::do_backfill + +CEPH_MSG_OSD_OP processing +-------------------------- + +PrimaryLogPG::do_op handles CEPH_MSG_OSD_OP op and will queue it + + 1. in wait_for_all_missing if it is a CEPH_OSD_OP_PGLS for a designated snapid and some object updates are still missing + 2. in waiting_for_active if the op may write but the scrubber is working + 3. in waiting_for_missing_object if the op requires an object or a snapdir or a specific snap that is still missing + 4. in waiting_for_degraded_object if the op may write an object or a snapdir that is degraded, or if another object blocks it ("blocked_by") + 5. in waiting_for_backfill_pos if the op requires an object that will be available after the backfill is complete + 6. in waiting_for_ack if an ack from another OSD is expected + 7. in waiting_for_ondisk if the op is waiting for a write to complete + +Peering Messages +---------------- +See OSD::handle_pg_(notify|info|log|query) + +Peering messages are tagged with two epochs: + + 1. epoch_sent: map epoch at which the message was sent + 2. query_epoch: map epoch at which the message triggering the message was sent + +These are the same in cases where there was no triggering message. We discard +a peering message if the message's query_epoch if the PG in question has entered +a new epoch (See PG::old_peering_evt, PG::queue_peering_event). Notifies, +infos, notifies, and logs are all handled as PG::PeeringMachine events and +are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created +state machine event along with epoch_sent and query_epoch in order to +generically check PG::old_peering_message upon insertion and removal from the +queue. + +Note, notifies, logs, and infos can trigger the creation of a PG. See +OSD::get_or_create_pg. + + diff --git a/doc/dev/osd_internals/mclock_wpq_cmp_study.rst b/doc/dev/osd_internals/mclock_wpq_cmp_study.rst new file mode 100644 index 000000000..31ad18409 --- /dev/null +++ b/doc/dev/osd_internals/mclock_wpq_cmp_study.rst @@ -0,0 +1,476 @@ +========================================= + QoS Study with mClock and WPQ Schedulers +========================================= + +Introduction +============ + +The mClock scheduler provides three controls for each service using it. In Ceph, +the services using mClock are for example client I/O, background recovery, +scrub, snap trim and PG deletes. The three controls such as *weight*, +*reservation* and *limit* are used for predictable allocation of resources to +each service in proportion to its weight subject to the constraint that the +service receives at least its reservation and no more than its limit. In Ceph, +these controls are used to allocate IOPS for each service type provided the IOPS +capacity of each OSD is known. The mClock scheduler is based on +`the dmClock algorithm`_. See :ref:`dmclock-qos` section for more details. + +Ceph's use of mClock was primarily experimental and approached with an +exploratory mindset. This is still true with other organizations and individuals +who continue to either use the codebase or modify it according to their needs. + +DmClock exists in its own repository_. Before the Ceph *Pacific* release, +mClock could be enabled by setting the :confval:`osd_op_queue` Ceph option to +"mclock_scheduler". Additional mClock parameters like *reservation*, *weight* +and *limit* for each service type could be set using Ceph options. +For example, ``osd_mclock_scheduler_client_[res,wgt,lim]`` is one such option. +See :ref:`dmclock-qos` section for more details. Even with all the mClock +options set, the full capability of mClock could not be realized due to: + +- Unknown OSD capacity in terms of throughput (IOPS). +- No limit enforcement. In other words, services using mClock were allowed to + exceed their limits resulting in the desired QoS goals not being met. +- Share of each service type not distributed across the number of operational + shards. + +To resolve the above, refinements were made to the mClock scheduler in the Ceph +code base. See :doc:`/rados/configuration/mclock-config-ref`. With the +refinements, the usage of mClock is a bit more user-friendly and intuitive. This +is one step of many to refine and optimize the way mClock is used in Ceph. + +Overview +======== + +A comparison study was performed as part of efforts to refine the mClock +scheduler. The study involved running tests with client ops and background +recovery operations in parallel with the two schedulers. The results were +collated and then compared. The following statistics were compared between the +schedulers from the test results for each service type: + +- External client + + - Average throughput(IOPS), + - Average and percentile(95th, 99th, 99.5th) latency, + +- Background recovery + + - Average recovery throughput, + - Number of misplaced objects recovered per second + +Test Environment +================ + +1. **Software Configuration**: CentOS 8.1.1911 Linux Kernel 4.18.0-193.6.3.el8_2.x86_64 +2. **CPU**: 2 x Intel® Xeon® CPU E5-2650 v3 @ 2.30GHz +3. **nproc**: 40 +4. **System Memory**: 64 GiB +5. **Tuned-adm Profile**: network-latency +6. **CephVer**: 17.0.0-2125-g94f550a87f (94f550a87fcbda799afe9f85e40386e6d90b232e) quincy (dev) +7. **Storage**: + + - Intel® NVMe SSD DC P3700 Series (SSDPE2MD800G4) [4 x 800GB] + - Seagate Constellation 7200 RPM 64MB Cache SATA 6.0Gb/s HDD (ST91000640NS) [4 x 1TB] + +Test Methodology +================ + +Ceph cbt_ was used to test the recovery scenarios. A new recovery test to +generate background recoveries with client I/Os in parallel was created. +See the next section for the detailed test steps. The test was executed 3 times +with the default *Weighted Priority Queue (WPQ)* scheduler for comparison +purposes. This was done to establish a credible mean value to compare +the mClock scheduler results at a later point. + +After this, the same test was executed with mClock scheduler and with different +mClock profiles i.e., *high_client_ops*, *balanced* and *high_recovery_ops* and +the results collated for comparison. With each profile, the test was +executed 3 times, and the average of those runs are reported in this study. + +.. note:: Tests with HDDs were performed with and without the bluestore WAL and + dB configured. The charts discussed further below help bring out the + comparison across the schedulers and their configurations. + +Establish Baseline Client Throughput (IOPS) +=========================================== + +Before the actual recovery tests, the baseline throughput was established for +both the SSDs and the HDDs on the test machine by following the steps mentioned +in the :doc:`/rados/configuration/mclock-config-ref` document under +the "Benchmarking Test Steps Using CBT" section. For this study, the following +baseline throughput for each device type was determined: + ++--------------------------------------+-------------------------------------------+ +| Device Type | Baseline Throughput(@4KiB Random Writes) | ++======================================+===========================================+ +| **NVMe SSD** | 21500 IOPS (84 MiB/s) | ++--------------------------------------+-------------------------------------------+ +| **HDD (with bluestore WAL & dB)** | 340 IOPS (1.33 MiB/s) | ++--------------------------------------+-------------------------------------------+ +| **HDD (without bluestore WAL & dB)** | 315 IOPS (1.23 MiB/s) | ++--------------------------------------+-------------------------------------------+ + +.. note:: The :confval:`bluestore_throttle_bytes` and + :confval:`bluestore_throttle_deferred_bytes` for SSDs were determined to be + 256 KiB. For HDDs, it was 40MiB. The above throughput was obtained + by running 4 KiB random writes at a queue depth of 64 for 300 secs. + +MClock Profile Allocations +========================== + +The low-level mClock shares per profile are shown in the tables below. For +parameters like *reservation* and *limit*, the shares are represented as a +percentage of the total OSD capacity. For the *high_client_ops* profile, the +*reservation* parameter is set to 50% of the total OSD capacity. Therefore, for +the NVMe(baseline 21500 IOPS) device, a minimum of 10750 IOPS is reserved for +client operations. These allocations are made under the hood once +a profile is enabled. + +The *weight* parameter is unitless. See :ref:`dmclock-qos`. + +high_client_ops(default) +```````````````````````` + +This profile allocates more reservation and limit to external clients ops +when compared to background recoveries and other internal clients within +Ceph. This profile is enabled by default. + ++------------------------+-------------+--------+-------+ +| Service Type | Reservation | Weight | Limit | ++========================+=============+========+=======+ +| client | 50% | 2 | MAX | ++------------------------+-------------+--------+-------+ +| background recovery | 25% | 1 | 100% | ++------------------------+-------------+--------+-------+ +| background best effort | 25% | 2 | MAX | ++------------------------+-------------+--------+-------+ + +balanced +````````` + +This profile allocates equal reservations to client ops and background +recovery ops. The internal best effort client get a lower reservation +but a very high limit so that they can complete quickly if there are +no competing services. + ++------------------------+-------------+--------+-------+ +| Service Type | Reservation | Weight | Limit | ++========================+=============+========+=======+ +| client | 40% | 1 | 100% | ++------------------------+-------------+--------+-------+ +| background recovery | 40% | 1 | 150% | ++------------------------+-------------+--------+-------+ +| background best effort | 20% | 2 | MAX | ++------------------------+-------------+--------+-------+ + +high_recovery_ops +````````````````` + +This profile allocates more reservation to background recoveries when +compared to external clients and other internal clients within Ceph. For +example, an admin may enable this profile temporarily to speed-up background +recoveries during non-peak hours. + ++------------------------+-------------+--------+-------+ +| Service Type | Reservation | Weight | Limit | ++========================+=============+========+=======+ +| client | 30% | 1 | 80% | ++------------------------+-------------+--------+-------+ +| background recovery | 60% | 2 | 200% | ++------------------------+-------------+--------+-------+ +| background best effort | 1 (MIN) | 2 | MAX | ++------------------------+-------------+--------+-------+ + +custom +``````` + +The custom profile allows the user to have complete control of the mClock +and Ceph config parameters. To use this profile, the user must have a deep +understanding of the workings of Ceph and the mClock scheduler. All the +*reservation*, *weight* and *limit* parameters of the different service types +must be set manually along with any Ceph option(s). This profile may be used +for experimental and exploratory purposes or if the built-in profiles do not +meet the requirements. In such cases, adequate testing must be performed prior +to enabling this profile. + + +Recovery Test Steps +=================== + +Before bringing up the Ceph cluster, the following mClock configuration +parameters were set appropriately based on the obtained baseline throughput +from the previous section: + +- :confval:`osd_mclock_max_capacity_iops_hdd` +- :confval:`osd_mclock_max_capacity_iops_ssd` +- :confval:`osd_mclock_profile` + +See :doc:`/rados/configuration/mclock-config-ref` for more details. + +Test Steps(Using cbt) +````````````````````` + +1. Bring up the Ceph cluster with 4 osds. +2. Configure the OSDs with replication factor 3. +3. Create a recovery pool to populate recovery data. +4. Create a client pool and prefill some objects in it. +5. Create the recovery thread and mark an OSD down and out. +6. After the cluster handles the OSD down event, recovery data is + prefilled into the recovery pool. For the tests involving SSDs, prefill 100K + 4MiB objects into the recovery pool. For the tests involving HDDs, prefill + 5K 4MiB objects into the recovery pool. +7. After the prefill stage is completed, the downed OSD is brought up and in. + The backfill phase starts at this point. +8. As soon as the backfill/recovery starts, the test proceeds to initiate client + I/O on the client pool on another thread using a single client. +9. During step 8 above, statistics related to the client latency and + bandwidth are captured by cbt. The test also captures the total number of + misplaced objects and the number of misplaced objects recovered per second. + +To summarize, the steps above creates 2 pools during the test. Recovery is +triggered on one pool and client I/O is triggered on the other simultaneously. +Statistics captured during the tests are discussed below. + + +Non-Default Ceph Recovery Options +````````````````````````````````` + +Apart from the non-default bluestore throttle already mentioned above, the +following set of Ceph recovery related options were modified for tests with both +the WPQ and mClock schedulers. + +- :confval:`osd_max_backfills` = 1000 +- :confval:`osd_recovery_max_active` = 1000 +- :confval:`osd_async_recovery_min_cost` = 1 + +The above options set a high limit on the number of concurrent local and +remote backfill operations per OSD. Under these conditions the capability of the +mClock scheduler was tested and the results are discussed below. + +Test Results +============ + +Test Results With NVMe SSDs +``````````````````````````` + +Client Throughput Comparison +---------------------------- + +The chart below shows the average client throughput comparison across the +schedulers and their respective configurations. + +.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_NVMe_SSD_WPQ_vs_mClock.png + + +WPQ(def) in the chart shows the average client throughput obtained +using the WPQ scheduler with all other Ceph configuration settings set to +default values. The default setting for :confval:`osd_max_backfills` limits the number +of concurrent local and remote backfills or recoveries per OSD to 1. As a +result, the average client throughput obtained is impressive at just over 18000 +IOPS when compared to the baseline value which is 21500 IOPS. + +However, with WPQ scheduler along with non-default options mentioned in section +`Non-Default Ceph Recovery Options`_, things are quite different as shown in the +chart for WPQ(BST). In this case, the average client throughput obtained drops +dramatically to only 2544 IOPS. The non-default recovery options clearly had a +significant impact on the client throughput. In other words, recovery operations +overwhelm the client operations. Sections further below discuss the recovery +rates under these conditions. + +With the non-default options, the same test was executed with mClock and with +the default profile (*high_client_ops*) enabled. As per the profile allocation, +the reservation goal of 50% (10750 IOPS) is being met with an average throughput +of 11209 IOPS during the course of recovery operations. This is more than 4x +times the throughput obtained with WPQ(BST). + +Similar throughput with the *balanced* (11017 IOPS) and *high_recovery_ops* +(11153 IOPS) profile was obtained as seen in the chart above. This clearly +demonstrates that mClock is able to provide the desired QoS for the client +with multiple concurrent backfill/recovery operations in progress. + +Client Latency Comparison +------------------------- + +The chart below shows the average completion latency (*clat*) along with the +average 95th, 99th and 99.5th percentiles across the schedulers and their +respective configurations. + +.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_NVMe_SSD_WPQ_vs_mClock.png + +The average *clat* latency obtained with WPQ(Def) was 3.535 msec. But in this +case the number of concurrent recoveries was very much limited at an average of +around 97 objects/sec or ~388 MiB/s and a major contributing factor to the low +latency seen by the client. + +With WPQ(BST) and with the non-default recovery options, things are very +different with the average *clat* latency shooting up to an average of almost +25 msec which is 7x times worse! This is due to the high number of concurrent +recoveries which was measured to be ~350 objects/sec or ~1.4 GiB/s which is +close to the maximum OSD bandwidth. + +With mClock enabled and with the default *high_client_ops* profile, the average +*clat* latency was 5.688 msec which is impressive considering the high number +of concurrent active background backfill/recoveries. The recovery rate was +throttled down by mClock to an average of 80 objects/sec or ~320 MiB/s according +to the minimum profile allocation of 25% of the maximum OSD bandwidth thus +allowing the client operations to meet the QoS goal. + +With the other profiles like *balanced* and *high_recovery_ops*, the average +client *clat* latency didn't change much and stayed between 5.7 - 5.8 msec with +variations in the average percentile latency as observed from the chart above. + +.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_NVMe_SSD_WPQ_vs_mClock.png + +Perhaps a more interesting chart is the comparison chart shown above that +tracks the average *clat* latency variations through the duration of the test. +The chart shows the differences in the average latency between the +WPQ and mClock profiles). During the initial phase of the test, for about 150 +secs, the differences in the average latency between the WPQ scheduler and +across the profiles of mClock scheduler are quite evident and self explanatory. +The *high_client_ops* profile shows the lowest latency followed by *balanced* +and *high_recovery_ops* profiles. The WPQ(BST) had the highest average latency +through the course of the test. + +Recovery Statistics Comparison +------------------------------ + +Another important aspect to consider is how the recovery bandwidth and recovery +time are affected by mClock profile settings. The chart below outlines the +recovery rates and times for each mClock profile and how they differ with the +WPQ scheduler. The total number of objects to be recovered in all the cases was +around 75000 objects as observed in the chart below. + +.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_NVMe_SSD_WPQ_vs_mClock.png + +Intuitively, the *high_client_ops* should impact recovery operations the most +and this is indeed the case as it took an average of 966 secs for the +recovery to complete at 80 Objects/sec. The recovery bandwidth as expected was +the lowest at an average of ~320 MiB/s. + +.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_NVMe_SSD_WPQ_vs_mClock.png + +The *balanced* profile provides a good middle ground by allocating the same +reservation and weight to client and recovery operations. The recovery rate +curve falls between the *high_recovery_ops* and *high_client_ops* curves with +an average bandwidth of ~480 MiB/s and taking an average of ~647 secs at ~120 +Objects/sec to complete the recovery. + +The *high_recovery_ops* profile provides the fastest way to complete recovery +operations at the expense of other operations. The recovery bandwidth was +nearly 2x the bandwidth at ~635 MiB/s when compared to the bandwidth observed +using the *high_client_ops* profile. The average object recovery rate was ~159 +objects/sec and completed the fastest in approximately 488 secs. + +Test Results With HDDs (WAL and dB configured) +`````````````````````````````````````````````` + +The recovery tests were performed on HDDs with bluestore WAL and dB configured +on faster NVMe SSDs. The baseline throughput measured was 340 IOPS. + +Client Throughput & latency Comparison +-------------------------------------- + +The average client throughput comparison for WPQ and mClock and its profiles +are shown in the chart below. + +.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_WALdB_WPQ_vs_mClock.png + +With WPQ(Def), the average client throughput obtained was ~308 IOPS since the +the number of concurrent recoveries was very much limited. The average *clat* +latency was ~208 msec. + +However for WPQ(BST), due to concurrent recoveries client throughput is affected +significantly with 146 IOPS and an average *clat* latency of 433 msec. + +.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_WALdB_WPQ_vs_mClock.png + +With the *high_client_ops* profile, mClock was able to meet the QoS requirement +for client operations with an average throughput of 271 IOPS which is nearly +80% of the baseline throughput at an average *clat* latency of 235 msecs. + +For *balanced* and *high_recovery_ops* profiles, the average client throughput +came down marginally to ~248 IOPS and ~240 IOPS respectively. The average *clat* +latency as expected increased to ~258 msec and ~265 msec respectively. + +.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_WALdB_WPQ_vs_mClock.png + +The *clat* latency comparison chart above provides a more comprehensive insight +into the differences in latency through the course of the test. As observed +with the NVMe SSD case, *high_client_ops* profile shows the lowest latency in +the HDD case as well followed by the *balanced* and *high_recovery_ops* profile. +It's fairly easy to discern this between the profiles during the first 200 secs +of the test. + +Recovery Statistics Comparison +------------------------------ + +The charts below compares the recovery rates and times. The total number of +objects to be recovered in all the cases using HDDs with WAL and dB was around +4000 objects as observed in the chart below. + +.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_WALdB_WPQ_vs_mClock.png + +As expected, the *high_client_ops* impacts recovery operations the most as it +took an average of ~1409 secs for the recovery to complete at ~3 Objects/sec. +The recovery bandwidth as expected was the lowest at ~11 MiB/s. + +.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_WALdB_WPQ_vs_mClock.png + +The *balanced* profile as expected provides a decent compromise with an an +average bandwidth of ~16.5 MiB/s and taking an average of ~966 secs at ~4 +Objects/sec to complete the recovery. + +The *high_recovery_ops* profile is the fastest with nearly 2x the bandwidth at +~21 MiB/s when compared to the *high_client_ops* profile. The average object +recovery rate was ~5 objects/sec and completed in approximately 747 secs. This +is somewhat similar to the recovery time observed with WPQ(Def) at 647 secs with +a bandwidth of 23 MiB/s and at a rate of 5.8 objects/sec. + +Test Results With HDDs (No WAL and dB configured) +````````````````````````````````````````````````` + +The recovery tests were also performed on HDDs without bluestore WAL and dB +configured. The baseline throughput measured was 315 IOPS. + +This type of configuration without WAL and dB configured is probably rare +but testing was nevertheless performed to get a sense of how mClock performs +under a very restrictive environment where the OSD capacity is at the lower end. +The sections and charts below are very similar to the ones presented above and +are provided here for reference. + +Client Throughput & latency Comparison +-------------------------------------- + +The average client throughput, latency and percentiles are compared as before +in the set of charts shown below. + +.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png + +.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_NoWALdB_WPQ_vs_mClock.png + +.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png + +Recovery Statistics Comparison +------------------------------ + +The recovery rates and times are shown in the charts below. + +.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png + +.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png + +Key Takeaways and Conclusion +============================ + +- mClock is able to provide the desired QoS using profiles to allocate proper + *reservation*, *weight* and *limit* to the service types. +- By using the cost per I/O and the cost per byte parameters, mClock can + schedule operations appropriately for the different device types(SSD/HDD). + +The study so far shows promising results with the refinements made to the mClock +scheduler. Further refinements to mClock and profile tuning are planned. Further +improvements will also be based on feedback from broader testing on larger +clusters and with different workloads. + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf +.. _repository: https://github.com/ceph/dmclock +.. _cbt: https://github.com/ceph/cbt diff --git a/doc/dev/osd_internals/osd_overview.rst b/doc/dev/osd_internals/osd_overview.rst new file mode 100644 index 000000000..192ddf8ca --- /dev/null +++ b/doc/dev/osd_internals/osd_overview.rst @@ -0,0 +1,106 @@ +=== +OSD +=== + +Concepts +-------- + +*Messenger* + See src/msg/Messenger.h + + Handles sending and receipt of messages on behalf of the OSD. The OSD uses + two messengers: + + 1. cluster_messenger - handles traffic to other OSDs, monitors + 2. client_messenger - handles client traffic + + This division allows the OSD to be configured with different interfaces for + client and cluster traffic. + +*Dispatcher* + See src/msg/Dispatcher.h + + OSD implements the Dispatcher interface. Of particular note is ms_dispatch, + which serves as the entry point for messages received via either the client + or cluster messenger. Because there are two messengers, ms_dispatch may be + called from at least two threads. The osd_lock is always held during + ms_dispatch. + +*WorkQueue* + See src/common/WorkQueue.h + + The WorkQueue class abstracts the process of queueing independent tasks + for asynchronous execution. Each OSD process contains workqueues for + distinct tasks: + + 1. OpWQ: handles ops (from clients) and subops (from other OSDs). + Runs in the op_tp threadpool. + 2. PeeringWQ: handles peering tasks and pg map advancement + Runs in the op_tp threadpool. + See Peering + 3. CommandWQ: handles commands (pg query, etc) + Runs in the command_tp threadpool. + 4. RecoveryWQ: handles recovery tasks. + Runs in the recovery_tp threadpool. + 5. SnapTrimWQ: handles snap trimming + Runs in the disk_tp threadpool. + See SnapTrimmer + 6. ScrubWQ: handles primary scrub path + Runs in the disk_tp threadpool. + See Scrub + 7. ScrubFinalizeWQ: handles primary scrub finalize + Runs in the disk_tp threadpool. + See Scrub + 8. RepScrubWQ: handles replica scrub path + Runs in the disk_tp threadpool + See Scrub + 9. RemoveWQ: Asynchronously removes old pg directories + Runs in the disk_tp threadpool + See PGRemoval + +*ThreadPool* + See src/common/WorkQueue.h + See also above. + + There are 4 OSD threadpools: + + 1. op_tp: handles ops and subops + 2. recovery_tp: handles recovery tasks + 3. disk_tp: handles disk intensive tasks + 4. command_tp: handles commands + +*OSDMap* + See src/osd/OSDMap.h + + The crush algorithm takes two inputs: a picture of the cluster + with status information about which nodes are up/down and in/out, + and the pgid to place. The former is encapsulated by the OSDMap. + Maps are numbered by *epoch* (epoch_t). These maps are passed around + within the OSD as std::tr1::shared_ptr<const OSDMap>. + + See MapHandling + +*PG* + See src/osd/PG.* src/osd/PrimaryLogPG.* + + Objects in rados are hashed into *PGs* and *PGs* are placed via crush onto + OSDs. The PG structure is responsible for handling requests pertaining to + a particular *PG* as well as for maintaining relevant metadata and controlling + recovery. + +*OSDService* + See src/osd/OSD.cc OSDService + + The OSDService acts as a broker between PG threads and OSD state which allows + PGs to perform actions using OSD services such as workqueues and messengers. + This is still a work in progress. Future cleanups will focus on moving such + state entirely from the OSD into the OSDService. + +Overview +-------- + See src/ceph_osd.cc + + The OSD process represents one leaf device in the crush hierarchy. There + might be one OSD process per physical machine, or more than one if, for + example, the user configures one OSD instance per disk. + diff --git a/doc/dev/osd_internals/osdmap_versions.txt b/doc/dev/osd_internals/osdmap_versions.txt new file mode 100644 index 000000000..2bf247dcf --- /dev/null +++ b/doc/dev/osd_internals/osdmap_versions.txt @@ -0,0 +1,259 @@ +releases: + + <0.48 pre-argonaut, dev + 0.48 argonaut + 0.56 bobtail + 0.61 cuttlefish + 0.67 dumpling + 0.72 emperor + 0.80 firefly + 0.87 giant + 0.94 hammer + 9.1.0 infernalis rc + 9.2.0 infernalis + 10.2.0 jewel + 11.2.0 kraken + 12.2.0 luminous + 13.2.0 mimic + 14.2.0 nautilus (to-be) + +osdmap: + +type / v / cv / ev / commit / version / date + +map / 1 / - / - / 017788a6ecb570038632de31904dd2e1314dc7b7 / 0.11 / 2009 +inc / 1 / - / - / + * initial +map / 2 / - / - / 020350e19a5dc03cd6cedd7494e434295580615f / 0.13 / 2009 +inc / 2 / - / - / + * pg_temp +map / 3 / - / - / 1ebcebf6fff056a0c0bdf82dde69356e271be27e / 0.19 / 2009 +inc / 3 / - / - / + * heartbeat_addr +map / 4 / - / - / 3ced5e7de243edeccfd20a90ec2034206c920795 / 0.19 / 2010 +inc / 4 / - / - / + * pools removed from map +map / 5 / - / 5 / c4892bed6f49df396df3cbf9ed561c7315bd2442 / 0.20 / 2010 +inc / 5 / - / 5 / + * pool names moved to first part of encoding + * adds CEPH_OSDMAP_INC_VERSION_EXT (for extended part of map) + * adds CEPH_OSDMAP_VERSION_EXT (for extended part of map) + * adds 'ev' (extended version) during encode() and decode +map / 5 / - / 5 / bc9cb9311f1b946898b5256eab500856fccf5c83 / 0.22 / 2010 +inc / 5 / - / 6 / + * separate up client/osd + * increments CEPH_OSDMAP_INC_VERSION_EXT to 6 + * CEPH_OSDMAP_INC_VERSION stays at 5 +map / 5 / - / 6 / 7f70112052c7fc3ba46f9e475fa575d85e8b16b2 / 0.22 / 2010 +inc / 5 / - / 6 / + * add osd_cluster_addr to full map + * increments CEPH_OSDMAP_VERSION_EXT to 6 + * CEPH_OSDMAP_VERSION stays at 5 +map / 5 / - / 7 / 2ced4e24aef64f2bc7d55b73abb888c124512eac / 0.28 / 2011 +inc / 5 / - / 7 / + * add cluster_snapshot field + * increments CEPH_OSDMAP_VERSION_EXT to 7 + * increments CEPH_OSDMAP_INC_VERSION_EXT to 7 + * CEPH_OSDMAP_INC_VERSION stays at 5 + * CEPH_OSDMAP_VERSION stays at 5 +map / 6 / - / 7 / d1ce410842ca51fad3aa100a52815a39e5fe6af6 / 0.35 / 2011 +inc / 6 / - / 7 / + * encode/decode old + new versions + * adds encode_client_old() (basically transitioning from uint32 to + uint64) + * bumps osdmap version to 6, old clients stay at 5 + * starts using in-function versions (i.e., _u16 v = 6) +map / 6 / - / 7 / b297d1edecaf31a48cff6c37df2ee266e51cdec1 / 0.38 / 2011 +inc / 6 / - / 7 / + * make encoding conditional based on features + * essentially checks whether features & CEPH_FEATURE_PGID64 and opts + to either use encode_client_old() or encode() +map / 6 / - / 7 / 0f0c59478894c9ca7fa04fc32e854648192a9fae / 0.38 / 2011 +inc / 6 / - / 7 / + * move stuff from osdmap.h to osdmap.cc +map / 6 / 8 / ca4311e5e39cec8fad85fad3e67eea968707e9eb / 0.47 / 2012 +inc / 6 / 8 / + * store uuid per osd + * bumps osdmap::incremental extended version to 8; in function + * bumps osdmap's extended version to 8; in function +map / 6 / - / 8 / 5125daa6d78e173a8dbc75723a8fdcd279a44bcd / 0.47 / 2012 +inc / 6 / - / 8 / + * drop defines + * drops defines for CEPH_OSDMAP_*_VERSION from rados.h +map / 6 / 9 / e9f051ef3c49a080b24d7811a16aefb64beacbbd / 0.53 / 2012 +inc / 6 / 9 / + * add osd_xinfo_t + * osdmap::incremental ext version bumped to 9 + * osdmap's ext version bumped to 9 + * because we're adding osd_xinfo_t to the map +map / 6 / - / 10 / 1fee4ccd5277b52292e255daf458330eef5f0255 / 0.64 / 2013 +inc / 6 / - / 10 / + * encode front hb addr + * osdmap::incremental ext version bumped to 10 + * osdmap's ext version bumped to 10 + * because we're adding osd_addrs->hb_front_addr to map + +// below we have the change to ENCODE_START() for osdmap and others +// this means client-usable data and extended osd data get to have their +// own ENCODE_START()'s, hence their versions start at 1 again. + +map / 7 / 1 / 1 / 3d7c69fb0986337dc72e466dc39d93e5ab406062 / 0.77 / 2014 +inc / 7 / 1 / 1 / b55c45e85dbd5d2513a4c56b3b74dcafd03f20b1 / 0.77 / 2014 + * introduces ENCODE_START() approach to osdmap, and the 'features' + argument we currently see in ::encode() functions + * same, but for osdmap::incremental +map / 7 / 1 / 1 / b9208b47745fdd53d36b682bebfc01e913347092 / 0.77 / 2014 +inc / 7 / 1 / 2 / + * include features argument in incremental. +map / 7 / 2 / 1 / cee914290c5540eb1fb9d70faac70a581381c29b / 0.78 / 2014 +inc / 7 / 2 / 2 / + * add osd_primary_affinity +map / 7 / 3 / 1 / c4f8f265955d54f33c79cde02c1ab2fe69ab1ab0 / 0.78 / 2014 +inc / 7 / 3 / 2 / + * add new/old erasure code profiles +map / 8 / 3 / 1 / 3dcf5b9636bb9e0cd6484d18f151b457e1a0c328 / 0.91 / 2014 +inc / 8 / 3 / 2 / + * encode crc +map / 8 / 3 / 1 / 04679c5451e353c966f6ed00b33fa97be8072a79 / 9.1.0 / 2015 +inc / 8 / 3 / 2 / + * simply ensures encode_features are filled to CEPH_FEATURE_PGID64 when + decoding an incremental if struct_v >= 6; else keeps it at zero. + * otherwise, if we get an incremental from hammer (which has + struct_v = 6) we would be decoding it as if it were a map from before + CEPH_FEATURES_PGID64 (which was introduced in 0.35, pre-argonaut) +map / 8 / 3 / 2 / 5c6b9d9dcd0a225e3a2b154c20a623868c269346 / 12.0.1 / 2017 +inc / 8 / 3 / 3 / + * add (near)full_ratio + * used to live in pgmap, moving to osdmap for luminous + * conditional on SERVER_LUMINOUS feature being present + * osdmap::incremental::encode(): conditional on ev >= 3 + * osdmap::incremental::decode(): conditional on ev >= 3, else -1 + * osdmap::encode(): conditional on ev >= 2 + * osdmap::decode(): conditional on ev >= 0, else 0 +map / 8 / 4 / 2 / 27d6f4373bafa24450f6dbb4e4252c2d9c2c1448 / 12.0.2 / 2017 +inc / 8 / 4 / 3 / + * add pg_remap and pg_remap_items + * first forces a pg to map to a particular value; second replaces + specific osds with specific other osds in crush mapping. + * inc conditional on SERVER_LUMINOUS feature being present + * osdmap::incremental::encode(): conditional on cv >= 4 + * osdmap::incremental::decode(): conditional on cv >= 4 + * map conditional on OSDMAP_REMAP feature being present + * osdmap::encode(): if not feature, cv = 3; encode on cv >= 4 + * osdmap::decode(): conditional on cv >= 4 +map / 8 / 4 / 3 / 27d6f4373bafa24450f6dbb4e4252c2d9c2c1448 / 12.0.2 / 2017 +inc / 8 / 4 / 4 / + * handle backfillfull_ratio like nearfull and full + * inc: + * osdmap::incremental::encode(): conditional on ev >= 3 + * osdmap::incremental::decode(): conditional on ev >= 4, else -1 + * map: + * osdmap::encode(): conditional on ev >= 2 + * osdmap::decode(): conditional on ev >= 3, else 0 +map / 8 / 4 / 3 / a1c66468232002c9f36033226f5db0a5751e8d18 / 12.0.3 / 2017 +inc / 8 / 4 / 4 / + * add require_min_compat_client field + * inc: + * osdmap::incremental::encode() conditional on ev >= 4 + * osdmap::incremental::decode() conditional on ev >= 4 + map: + * osdmap::encode() conditional on ev >= 3 + * osdmap::decode() conditional on ev >= 3 +map / 8 / 4 / 4 / 4a09e9431de3084b1ca98af11b28f822fde4ffbe / 12.0.3 / 2017 +inc / 8 / 4 / 5 / + * bumps encoding version for require_min_compat_client + * otherwise osdmap::decode() would throw exception when decoding + old maps + * inc: + * osdmap::incremental::encode() no conditional on ev >= 3 + * osdmap::incremental::decode() conditional on ev >= 5 + * map: + * osdmap::encode() conditional on ev >= 2 + * osdmap::decode() conditional on ev >= 4 +map / 8 / 4 / 5 / 3d4c4d9d9da07e1456331c43acc998d2008ca8ea / 12.1.0 / 2017 +inc / 8 / 4 / 6 / + * add require_osd_release numeric field + * new_require_min_compat_client: + * osdmap::incremental::encode() conditional on ev >= 5 + * osdmap::encode() conditional on ev >= 4 + * require_osd_release: + * osdmap::incremental::encode() conditional on ev >= 6 + * osdmap::incremental::decode() conditional on ev >= 6 (else, -1) + * osdmap::encode() conditional on ev >= 5 + * osdmap::decode() conditional on ev >= 5 (else, -1) +map / 8 / 4 / 5 / f22997e24bda4e6476e15d5d4ad9737861f9741f / 12.1.0 / 2017 +inc / 8 / 4 / 6 / + * switch (require_)min_compat_client to integers instead of strings + * osdmap::incremental::encode() conditional on ev >= 6 + * osdmap::incremental::decode(): + * if ev == 5, decode string and translate to release number + * if ev >= 6, decode integer + * osdmap::encode() conditional on ev >= 4 + * osdmap::decode(): + * if ev == 4, decode string and translate to release number + * if ev >= 5, decode integer +map / 8 / 4 / 6 / a8fb39b57884d96201fa502b17bc9395ec38c1b3 / 12.1.0 / 2017 +inc / 8 / 5 / 6 / + * make incremental's `new_state` 32 bits instead of 8 bits + * implies forcing 8 bits on + * osdmap::incremental::encode_client_old() + * osdmap::incremental::encode_classic() + * osdmap::incremental::decode_classic() + * osdmap::incremental::encode() conditional on cv >= 5, else force 8b. + * osdmap::incremental::decode() conditional on cv >= 5, else force 8b. +map / 8 / 5 / 6 / 3c1e58215bbb98f71aae30904f9010a57a58da81 / 12.1.0 / 2017 +inc / 8 / 5 / 6 / + * same as above +map / 8 / 6 / 6 / 48158ec579b708772fae82daaa6cb5dcaf5ac5dd / 12.1.0 / 2017 +inc / 8 / 5 / 6 / + * add crush_version + * osdmap::encode() conditional on cv >= 6 + * osdmap::decode() conditional on cv >= 6 +map / 8 / 7 / 6 / 553048fbf97af999783deb7e992c8ecfa5e55500 / 13.0.2 / 2017 +inc / 8 / 6 / 6 / + * track newly removed and purged snaps in each epoch + * new_removed_snaps + * new_purged_snaps + * osdmap::encode() conditional on cv >= 7 + * if SERVER_MIMIC not in features, cv = 6 + * osdmap::decode() conditional cv >= 7 +map / 8 / 8 / 6 / f99c2a9fec65ad3ce275ef24bd167ee03275d3d7 / 14.0.1 / 2018 +inc / 8 / 7 / 6 / + * fix pre-addrvec compat + * osdmap::encode() conditional on cv >= 8, else encode client addrs + one by one in a loop. + * osdmap::decode() just bumps version (?) +map / 8 / 8 / 7 / 9fb1e521c7c75c124b0dbf193e8b65ff1b5f461e / 14.0.1 / 2018 +inc / 8 / 7 / 7 / + * make cluster addrs into addrvecs too + * this will allow single-step upgrade from msgr1 to msgr2 +map / 8 / 9 / 7 / d414f0b43a69f3c2db8e454d795be881496237c6 / 14.0.1 / 2018 +inc / 8 / 8 / 7 / + * store last_up_change and last_in_change + * osdmap::encode() conditional on cv >= 9 + * osdmap::decode() conditional on cv >= 9 + + + +osd_info_t: +v / commit / version / date / reason + +1 / e574c84a6a0c5a5070dc72d5f5d3d17914ef824a / 0.19 / 2010 / add struct_v + +osd_xinfo_t: +v / commit / version / date + +1 / e9f051ef3c49a080b24d7811a16aefb64beacbbd / 0.53 / 2012 + * add osd_xinfo_t +2 / 31743d50a109a463d664ec9cf764d5405db507bd / 0.75 / 2013 + * add features bit mask to osd_xinfo_t +3 / 87722a42c286d4d12190b86b6d06d388e2953ba0 / 0.82 / 2014 + * remember osd weight when auto-marking osds out + +rados.h: +v / commit / version / date / reason + +- / 147c6f51e34a875ab65624df04baa8ef89296ddd / 0.19 / 2010 / move versions + 3 / CEPH_OSDMAP_INC_VERSION + 3 / CEPH_OSDMAP_VERSION + 2 / CEPH_PG_POOL_VERSION diff --git a/doc/dev/osd_internals/partial_object_recovery.rst b/doc/dev/osd_internals/partial_object_recovery.rst new file mode 100644 index 000000000..a22f63348 --- /dev/null +++ b/doc/dev/osd_internals/partial_object_recovery.rst @@ -0,0 +1,148 @@ +======================= +Partial Object Recovery +======================= + +Partial Object Recovery improves the efficiency of log-based recovery (vs +backfill). Original log-based recovery calculates missing_set based on pg_log +differences. + +The whole object should be recovery from one OSD to another +if the object is indicated modified by pg_log regardless of how much +content in the object is really modified. That means a 4M object, +which is just modified 4k inside, should recovery the whole 4M object +rather than the modified 4k content. In addition, object map should be +also recovered even if it is not modified at all. + +Partial Object Recovery is designed to solve the problem mentioned above. +In order to achieve the goals, two things should be done: + +1. logging where the object is modified is necessary +2. logging whether the object_map of an object is modified is also necessary + +class ObjectCleanRegion is introduced to do what we want. +clean_offsets is a variable of interval_set<uint64_t> +and is used to indicate the unmodified content in an object. +clean_omap is a variable of bool indicating whether object_map is modified. +new_object means that osd does not exist for an object +max_num_intervals is an upbound of the number of intervals in clean_offsets +so that the memory cost of clean_offsets is always bounded. + +The shortest clean interval will be trimmed if the number of intervals +in clean_offsets exceeds the boundary. + + etc. max_num_intervals=2, clean_offsets:{[5~10], [20~5]} + + then new interval [30~10] will evict out the shortest one [20~5] + + finally, clean_offsets becomes {[5~10], [30~10]} + +Procedures for Partial Object Recovery +====================================== + +Firstly, OpContext and pg_log_entry_t should contain ObjectCleanRegion. +In do_osd_ops(), finish_copyfrom(), finish_promote(), corresponding content +in ObjectCleanRegion should mark dirty so that trace the modification of an object. +Also update ObjectCleanRegion in OpContext to its pg_log_entry_t. + +Secondly, pg_missing_set can build and rebuild correctly. +when calculating pg_missing_set during peering process, +also merge ObjectCleanRegion in each pg_log_entry_t. + + etc. object aa has pg_log: + 26'101 {[0~4096, 8192~MAX], false} + + 26'104 {0~8192, 12288~MAX, false} + + 28'108 {[0~12288, 16384~MAX], true} + + missing_set for object aa: merge pg_log above --> {[0~4096, 16384~MAX], true}. + which means 4096~16384 is modified and object_map is also modified on version 28'108 + +Also, OSD may be crash after merge log. +Therefore, we need to read_log and rebuild pg_missing_set. For example, pg_log is: + + object aa: 26'101 {[0~4096, 8192~MAX], false} + + object bb: 26'102 {[0~4096, 8192~MAX], false} + + object cc: 26'103 {[0~4096, 8192~MAX], false} + + object aa: 26'104 {0~8192, 12288~MAX, false} + + object dd: 26'105 {[0~4096, 8192~MAX], false} + + object aa: 28'108 {[0~12288, 16384~MAX], true} + +Originally, if bb,cc,dd is recovered, and aa is not. +So we need to rebuild pg_missing_set for object aa, +and find aa is modified on version 28'108. +If version in object_info is 26'96 < 28'108, +we don't need to consider 26'104 and 26'101 because the whole object will be recovered. +However, Partial Object Recovery should also require us to rebuild ObjectCleanRegion. + +Knowing whether the object is modified is not enough. + +Therefore, we also need to traverse the pg_log before, +that says 26'104 and 26'101 also > object_info(26'96) +and rebuild pg_missing_set for object aa based on those three logs: 28'108, 26'104, 26'101. +The way how to merge logs is the same as mentioned above + +Finally, finish the push and pull process based on pg_missing_set. +Updating copy_subset in recovery_info based on ObjectCleanRegion in pg_missing_set. +copy_subset indicates the intervals of content need to pull and push. + +The complicated part here is submit_push_data +and serval cases should be considered separately. +what we need to consider is how to deal with the object data, +object data makes up of omap_header, xattrs, omap, data: + +case 1: first && complete: since object recovering is finished in a single PushOp, +we would like to preserve the original object and overwrite on the object directly. +Object will not be removed and touch a new one. + + issue 1: As object is not removed, old xattrs remain in the old object + but maybe updated in new object. Overwriting for the same key or adding new keys is correct, + but removing keys will be wrong. + In order to solve this issue, We need to remove the all original xattrs in the object, and then update new xattrs. + + issue 2: As object is not removed, + object_map may be recovered depending on the clean_omap. + Therefore, if recovering clean_omap, we need to remove old omap of the object for the same reason + since omap updating may also be a deletion. + Thus, in this case, we should do: + + 1) clear xattrs of the object + 2) clear omap of the object if omap recovery is needed + 3) truncate the object into recovery_info.size + 4) recovery omap_header + 5) recovery xattrs, and recover omap if needed + 6) punch zeros for original object if fiemap tells nothing there + 7) overwrite object content which is modified + 8) finish recovery + +case 2: first && !complete: object recovering should be done in multiple times. +Here, target_oid will indicate a new temp_object in pgid_TEMP, +so the issues are a bit difference. + + issue 1: As object is newly created, there is no need to deal with xattrs + + issue 2: As object is newly created, + and object_map may not be transmitted depending on clean_omap. + Therefore, if clean_omap is true, we need to clone object_map from original object. + issue 3: As object is newly created, and unmodified data will not be transmitted. + Therefore, we need to clone unmodified data from the original object. + Thus, in this case, we should do: + + 1) remove the temp object + 2) create a new temp object + 3) set alloc_hint for the new temp object + 4) truncate new temp object to recovery_info.size + 5) recovery omap_header + 6) clone object_map from original object if omap is clean + 7) clone unmodified object_data from original object + 8) punch zeros for the new temp object + 9) recovery xattrs, and recover omap if needed + 10) overwrite object content which is modified + 11) remove the original object + 12) move and rename the new temp object to replace the original object + 13) finish recovery diff --git a/doc/dev/osd_internals/past_intervals.rst b/doc/dev/osd_internals/past_intervals.rst new file mode 100644 index 000000000..5b594df1a --- /dev/null +++ b/doc/dev/osd_internals/past_intervals.rst @@ -0,0 +1,93 @@ +============= +PastIntervals +============= + +Purpose +------- + +There are two situations where we need to consider the set of all acting-set +OSDs for a PG back to some epoch ``e``: + + * During peering, we need to consider the acting set for every epoch back to + ``last_epoch_started``, the last epoch in which the PG completed peering and + became active. + (see :doc:`/dev/osd_internals/last_epoch_started` for a detailed explanation) + * During recovery, we need to consider the acting set for every epoch back to + ``last_epoch_clean``, the last epoch at which all of the OSDs in the acting + set were fully recovered, and the acting set was full. + +For either of these purposes, we could build such a set by iterating backwards +from the current OSDMap to the relevant epoch. Instead, we maintain a structure +PastIntervals for each PG. + +An ``interval`` is a contiguous sequence of OSDMap epochs where the PG mapping +didn't change. This includes changes to the acting set, the up set, the +primary, and several other parameters fully spelled out in +PastIntervals::check_new_interval. + +Maintenance and Trimming +------------------------ + +The PastIntervals structure stores a record for each ``interval`` back to +last_epoch_clean. On each new ``interval`` (See AdvMap reactions, +PeeringState::should_restart_peering, and PeeringState::start_peering_interval) +each OSD with the PG will add the new ``interval`` to its local PastIntervals. +Activation messages to OSDs which do not already have the PG contain the +sender's PastIntervals so that the recipient needn't rebuild it. (See +PeeringState::activate needs_past_intervals). + +PastIntervals are trimmed in two places. First, when the primary marks the +PG clean, it clears its past_intervals instance +(PeeringState::try_mark_clean()). The replicas will do the same thing when +they receive the info (See PeeringState::update_history). + +The second, more complex, case is in PeeringState::start_peering_interval. In +the event of a "map gap", we assume that the PG actually has gone clean, but we +haven't received a pg_info_t with the updated ``last_epoch_clean`` value yet. +To explain this behavior, we need to discuss OSDMap trimming. + +OSDMap Trimming +--------------- + +OSDMaps are created by the Monitor quorum and gossiped out to the OSDs. The +Monitor cluster also determines when OSDs (and the Monitors) are allowed to +trim old OSDMap epochs. For the reasons explained above in this document, the +primary constraint is that we must retain all OSDMaps back to some epoch such +that all PGs have been clean at that or a later epoch (min_last_epoch_clean). +(See OSDMonitor::get_trim_to). + +The Monitor quorum determines min_last_epoch_clean through MOSDBeacon messages +sent periodically by each OSDs. Each message contains a set of PGs for which +the OSD is primary at that moment as well as the min_last_epoch_clean across +that set. The Monitors track these values in OSDMonitor::last_epoch_clean. + +There is a subtlety in the min_last_epoch_clean value used by the OSD to +populate the MOSDBeacon. OSD::collect_pg_stats invokes PG::with_pg_stats to +obtain the lec value, which actually uses +pg_stat_t::get_effective_last_epoch_clean() rather than +info.history.last_epoch_clean. If the PG is currently clean, +pg_stat_t::get_effective_last_epoch_clean() is the current epoch rather than +last_epoch_clean -- this works because the PG is clean at that epoch and it +allows OSDMaps to be trimmed during periods where OSDMaps are being created +(due to snapshot activity, perhaps), but no PGs are undergoing ``interval`` +changes. + +Back to PastIntervals +--------------------- + +We can now understand our second trimming case above. If OSDMaps have been +trimmed up to epoch ``e``, we know that the PG must have been clean at some epoch +>= ``e`` (indeed, **all** PGs must have been), so we can drop our PastIntevals. + +This dependency also pops up in PeeringState::check_past_interval_bounds(). +PeeringState::get_required_past_interval_bounds takes as a parameter +oldest_epoch, which comes from OSDSuperblock::cluster_osdmap_trim_lower_bound. +We use cluster_osdmap_trim_lower_bound rather than a specific osd's oldest_map +because we don't necessarily trim all MOSDMap::cluster_osdmap_trim_lower_bound. +In order to avoid doing too much work at once we limit the amount of osdmaps +trimmed using ``osd_target_transaction_size`` in OSD::trim_maps(). +For this reason, a specific OSD's oldest_map can lag behind +OSDSuperblock::cluster_osdmap_trim_lower_bound +for a while. + +See https://tracker.ceph.com/issues/49689 for an example. diff --git a/doc/dev/osd_internals/pg.rst b/doc/dev/osd_internals/pg.rst new file mode 100644 index 000000000..397d4ab5d --- /dev/null +++ b/doc/dev/osd_internals/pg.rst @@ -0,0 +1,31 @@ +==== +PG +==== + +Concepts +-------- + +*Peering Interval* + See PG::start_peering_interval. + See PG::acting_up_affected + See PG::PeeringState::Reset + + A peering interval is a maximal set of contiguous map epochs in which the + up and acting sets did not change. PG::PeeringMachine represents a + transition from one interval to another as passing through + PeeringState::Reset. On PG::PeeringState::AdvMap PG::acting_up_affected can + cause the pg to transition to Reset. + + +Peering Details and Gotchas +--------------------------- +For an overview of peering, see `Peering <../../peering>`_. + + * PG::flushed defaults to false and is set to false in + PG::start_peering_interval. Upon transitioning to PG::PeeringState::Started + we send a transaction through the pg op sequencer which, upon complete, + sends a FlushedEvt which sets flushed to true. The primary cannot go + active until this happens (See PG::PeeringState::WaitFlushedPeering). + Replicas can go active but cannot serve ops (writes or reads). + This is necessary because we cannot read our ondisk state until unstable + transactions from the previous interval have cleared. diff --git a/doc/dev/osd_internals/pg_removal.rst b/doc/dev/osd_internals/pg_removal.rst new file mode 100644 index 000000000..c5fe0e1ab --- /dev/null +++ b/doc/dev/osd_internals/pg_removal.rst @@ -0,0 +1,56 @@ +========== +PG Removal +========== + +See OSD::_remove_pg, OSD::RemoveWQ + +There are two ways for a pg to be removed from an OSD: + + 1. MOSDPGRemove from the primary + 2. OSD::advance_map finds that the pool has been removed + +In either case, our general strategy for removing the pg is to +atomically set the metadata objects (pg->log_oid, pg->biginfo_oid) to +backfill and asynchronously remove the pg collections. We do not do +this inline because scanning the collections to remove the objects is +an expensive operation. + +OSDService::deleting_pgs tracks all pgs in the process of being +deleted. Each DeletingState object in deleting_pgs lives while at +least one reference to it remains. Each item in RemoveWQ carries a +reference to the DeletingState for the relevant pg such that +deleting_pgs.lookup(pgid) will return a null ref only if there are no +collections currently being deleted for that pg. + +The DeletingState for a pg also carries information about the status +of the current deletion and allows the deletion to be cancelled. +The possible states are: + + 1. QUEUED: the PG is in the RemoveWQ + 2. CLEARING_DIR: the PG's contents are being removed synchronously + 3. DELETING_DIR: the PG's directories and metadata being queued for removal + 4. DELETED_DIR: the final removal transaction has been queued + 5. CANCELED: the deletion has been cancelled + +In 1 and 2, the deletion can be cancelled. Each state transition +method (and check_canceled) returns false if deletion has been +cancelled and true if the state transition was successful. Similarly, +try_stop_deletion() returns true if it succeeds in cancelling the +deletion. Additionally, try_stop_deletion() in the event that it +fails to stop the deletion will not return until the final removal +transaction is queued. This ensures that any operations queued after +that point will be ordered after the pg deletion. + +OSD::_create_lock_pg must handle two cases: + + 1. Either there is no DeletingStateRef for the pg, or it failed to cancel + 2. We succeeded in cancelling the deletion. + +In case 1., we proceed as if there were no deletion occurring, except that +we avoid writing to the PG until the deletion finishes. In case 2., we +proceed as in case 1., except that we first mark the PG as backfilling. + +Similarly, OSD::osr_registry ensures that the OpSequencers for those +pgs can be reused for a new pg if created before the old one is fully +removed, ensuring that operations on the new pg are sequenced properly +with respect to operations on the old one. diff --git a/doc/dev/osd_internals/pgpool.rst b/doc/dev/osd_internals/pgpool.rst new file mode 100644 index 000000000..45a252bd4 --- /dev/null +++ b/doc/dev/osd_internals/pgpool.rst @@ -0,0 +1,22 @@ +================== +PGPool +================== + +PGPool is a structure used to manage and update the status of removed +snapshots. It does this by maintaining two fields, cached_removed_snaps - the +current removed snap set and newly_removed_snaps - newly removed snaps in the +last epoch. In OSD::load_pgs the osd map is recovered from the pg's file store +and passed down to OSD::_get_pool where a PGPool object is initialised with the +map. + +With each new map we receive we call PGPool::update with the new map. In that +function we build a list of newly removed snaps +(pg_pool_t::build_removed_snaps) and merge that with our cached_removed_snaps. +This function included checks to make sure we only do this update when things +have changed or there has been a map gap. + +When we activate the pg we initialise the snap trim queue from +cached_removed_snaps and subtract the purged_snaps we have already purged +leaving us with the list of snaps that need to be trimmed. Trimming is later +performed asynchronously by the snap_trim_wq. + diff --git a/doc/dev/osd_internals/recovery_reservation.rst b/doc/dev/osd_internals/recovery_reservation.rst new file mode 100644 index 000000000..a24ac1b15 --- /dev/null +++ b/doc/dev/osd_internals/recovery_reservation.rst @@ -0,0 +1,83 @@ +==================== +Recovery Reservation +==================== + +Recovery reservation extends and subsumes backfill reservation. The +reservation system from backfill recovery is used for local and remote +reservations. + +When a PG goes active, first it determines what type of recovery is +necessary, if any. It may need log-based recovery, backfill recovery, +both, or neither. + +In log-based recovery, the primary first acquires a local reservation +from the OSDService's local_reserver. Then a MRemoteReservationRequest +message is sent to each replica in order of OSD number. These requests +will always be granted (i.e., cannot be rejected), but they may take +some time to be granted if the remotes have already granted all their +remote reservation slots. + +After all reservations are acquired, log-based recovery proceeds as it +would without the reservation system. + +After log-based recovery completes, the primary releases all remote +reservations. The local reservation remains held. The primary then +determines whether backfill is necessary. If it is not necessary, the +primary releases its local reservation and waits in the Recovered state +for all OSDs to indicate that they are clean. + +If backfill recovery occurs after log-based recovery, the local +reservation does not need to be reacquired since it is still held from +before. If it occurs immediately after activation (log-based recovery +not possible/necessary), the local reservation is acquired according to +the typical process. + +Once the primary has its local reservation, it requests a remote +reservation from the backfill target. This reservation CAN be rejected, +for instance if the OSD is too full (backfillfull_ratio osd setting). +If the reservation is rejected, the primary drops its local +reservation, waits (osd_backfill_retry_interval), and then retries. It +will retry indefinitely. + +Once the primary has the local and remote reservations, backfill +proceeds as usual. After backfill completes the remote reservation is +dropped. + +Finally, after backfill (or log-based recovery if backfill was not +necessary), the primary drops the local reservation and enters the +Recovered state. Once all the PGs have reported they are clean, the +primary enters the Clean state and marks itself active+clean. + +----------------- +Dump Reservations +----------------- + +An OSD daemon command dumps total local and remote reservations:: + + ceph daemon osd.<id> dump_recovery_reservations + + +-------------- +Things to Note +-------------- + +We always grab the local reservation first, to prevent a circular +dependency. We grab remote reservations in order of OSD number for the +same reason. + +The recovery reservation state chart controls the PG state as reported +to the monitor. The state chart can set: + + - recovery_wait: waiting for local/remote reservations + - recovering: recovering + - recovery_toofull: recovery stopped, OSD(s) above full ratio + - backfill_wait: waiting for remote backfill reservations + - backfilling: backfilling + - backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio + + +-------- +See Also +-------- + +The Active substate of the automatically generated OSD state diagram. diff --git a/doc/dev/osd_internals/refcount.rst b/doc/dev/osd_internals/refcount.rst new file mode 100644 index 000000000..3324b63e5 --- /dev/null +++ b/doc/dev/osd_internals/refcount.rst @@ -0,0 +1,45 @@ +======== +Refcount +======== + + +Introduction +============ + +Deduplication, as described in ../deduplication.rst, needs a way to +maintain a target pool of deduplicated chunks with atomic ref +refcounting. To that end, there exists an osd object class +refcount responsible for using the object class machinery to +maintain refcounts on deduped chunks and ultimately remove them +as the refcount hits 0. + +Class Interface +=============== + +See cls/refcount/cls_refcount_client* + +* cls_refcount_get + + Atomically increments the refcount with specified tag :: + + void cls_refcount_get(librados::ObjectWriteOperation& op, const string& tag, bool implicit_ref = false); + +* cls_refcount_put + + Atomically decrements the refcount specified by passed tag :: + + void cls_refcount_put(librados::ObjectWriteOperation& op, const string& tag, bool implicit_ref = false); + +* cls_refcount_Set + + Atomically sets the set of refcounts with passed list of tags :: + + void cls_refcount_set(librados::ObjectWriteOperation& op, list<string>& refs); + +* cls_refcount_read + + Dumps the current set of ref tags for the object :: + + int cls_refcount_read(librados::IoCtx& io_ctx, string& oid, list<string> *refs, bool implicit_ref = false); + + diff --git a/doc/dev/osd_internals/scrub.rst b/doc/dev/osd_internals/scrub.rst new file mode 100644 index 000000000..149509799 --- /dev/null +++ b/doc/dev/osd_internals/scrub.rst @@ -0,0 +1,41 @@ + +Scrub internals and diagnostics +=============================== + +Scrubbing Behavior Table +------------------------ + ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Flags | none | noscrub | nodeep_scrub | noscrub/nodeep_scrub | ++=================================================+==========+===========+===============+======================+ +| Periodic tick | S | X | S | X | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Periodic tick after osd_deep_scrub_interval | D | D | S | X | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Initiated scrub | S | S | S | S | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Initiated scrub after osd_deep_scrub_interval | D | D | S | S | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ +| Initiated deep scrub | D | D | D | D | ++-------------------------------------------------+----------+-----------+---------------+----------------------+ + +- X = Do nothing +- S = Do regular scrub +- D = Do deep scrub + +State variables +--------------- + +- Periodic tick state is ``!must_scrub && !must_deep_scrub && !time_for_deep`` +- Periodic tick after ``osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep`` +- Initiated scrub state is ``must_scrub && !must_deep_scrub && !time_for_deep`` +- Initiated scrub after ``osd_deep_scrub_interval`` state is ``must_scrub && !must_deep_scrub && time_for_deep`` +- Initiated deep scrub state is ``must_scrub && must_deep_scrub`` + +Scrub Reservations +------------------ + +An OSD daemon command dumps total local and remote reservations:: + + ceph daemon osd.<id> dump_scrub_reservations + diff --git a/doc/dev/osd_internals/snaps.rst b/doc/dev/osd_internals/snaps.rst new file mode 100644 index 000000000..5ebd0884a --- /dev/null +++ b/doc/dev/osd_internals/snaps.rst @@ -0,0 +1,128 @@ +====== +Snaps +====== + +Overview +-------- +Rados supports two related snapshotting mechanisms: + + 1. *pool snaps*: snapshots are implicitly applied to all objects + in a pool + 2. *self managed snaps*: the user must provide the current *SnapContext* + on each write. + +These two are mutually exclusive, only one or the other can be used on +a particular pool. + +The *SnapContext* is the set of snapshots currently defined for an object +as well as the most recent snapshot (the *seq*) requested from the mon for +sequencing purposes (a *SnapContext* with a newer *seq* is considered to +be more recent). + +The difference between *pool snaps* and *self managed snaps* from the +OSD's point of view lies in whether the *SnapContext* comes to the OSD +via the client's MOSDOp or via the most recent OSDMap. + +See OSD::make_writeable + +Ondisk Structures +----------------- +Each object has in the PG collection a *head* object (or *snapdir*, which we +will come to shortly) and possibly a set of *clone* objects. +Each hobject_t has a snap field. For the *head* (the only writeable version +of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the +snap field is set to the *seq* of the *SnapContext* at their creation. +When the OSD services a write, it first checks whether the most recent +*clone* is tagged with a snapid prior to the most recent snap represented +in the *SnapContext*. If so, at least one snapshot has occurred between +the time of the write and the time of the last clone. Therefore, prior +to performing the mutation, the OSD creates a new clone for servicing +reads on snaps between the snapid of the last clone and the most recent +snapid. + +The *head* object contains a *SnapSet* encoded in an attribute, which tracks + + 1. The full set of snaps defined for the object + 2. The full set of clones which currently exist + 3. Overlapping intervals between clones for tracking space usage + 4. Clone size + +If the *head* is deleted while there are still clones, a *snapdir* object +is created instead to house the *SnapSet*. + +Additionally, the *object_info_t* on each clone includes a vector of snaps +for which clone is defined. + +Snap Removal +------------ +To remove a snapshot, a request is made to the *Monitor* cluster to +add the snapshot id to the list of purged snaps (or to remove it from +the set of pool snaps in the case of *pool snaps*). In either case, +the *PG* adds the snap to its *snap_trimq* for trimming. + +A clone can be removed when all of its snaps have been removed. In +order to determine which clones might need to be removed upon snap +removal, we maintain a mapping from snap to *hobject_t* using the +*SnapMapper*. + +See PrimaryLogPG::SnapTrimmer, SnapMapper + +This trimming is performed asynchronously by the snap_trim_wq while the +PG is clean and not scrubbing. + + #. The next snap in PG::snap_trimq is selected for trimming + #. We determine the next object for trimming out of PG::snap_mapper. + For each object, we create a log entry and repop updating the + object info and the snap set (including adjusting the overlaps). + If the object is a clone which no longer belongs to any live snapshots, + it is removed here. (See PrimaryLogPG::trim_object() when new_snaps + is empty.) + #. We also locally update our *SnapMapper* instance with the object's + new snaps. + #. The log entry containing the modification of the object also + contains the new set of snaps, which the replica uses to update + its own *SnapMapper* instance. + #. The primary shares the info with the replica, which persists + the new set of purged_snaps along with the rest of the info. + + + +Recovery +-------- +Because the trim operations are implemented using repops and log entries, +normal PG peering and recovery maintain the snap trimmer operations with +the caveat that push and removal operations need to update the local +*SnapMapper* instance. If the purged_snaps update is lost, we merely +retrim a now empty snap. + +SnapMapper +---------- +*SnapMapper* is implemented on top of map_cacher<string, bufferlist>, +which provides an interface over a backing store such as the file system +with async transactions. While transactions are incomplete, the map_cacher +instance buffers unstable keys allowing consistent access without having +to flush the filestore. *SnapMapper* provides two mappings: + + 1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone + object + 2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot + as one of its snaps + +Assumption: there are lots of hobjects and relatively few snaps. The +first encoding has a stringification of the object as the key and an +encoding of the set of snaps as a value. The second mapping, because there +might be many hobjects for a single snap, is stored as a collection of keys +of the form stringify(snap)_stringify(object) such that stringify(snap) +is constant length. These keys have a bufferlist encoding +pair<snapid, hobject_t> as a value. Thus, creating or trimming a single +object does not involve reading all objects for any snap. Additionally, +upon construction, the *SnapMapper* is provided with a mask for filtering +the objects in the single SnapMapper keyspace belonging to that PG. + +Split +----- +The snapid_t -> hobject_t key entries are arranged such that for any PG, +up to 8 prefixes need to be checked to determine all hobjects in a particular +snap for a particular PG. Upon split, the prefixes to check on the parent +are adjusted such that only the objects remaining in the PG will be visible. +The children will immediately have the correct mapping. diff --git a/doc/dev/osd_internals/stale_read.rst b/doc/dev/osd_internals/stale_read.rst new file mode 100644 index 000000000..5493bb1f4 --- /dev/null +++ b/doc/dev/osd_internals/stale_read.rst @@ -0,0 +1,102 @@ +Preventing Stale Reads +====================== + +We write synchronously to all replicas before sending an ACK to the +client, which limits the potential for inconsistency +in the write path. However, by default we serve reads from just +one replica (the lead/primary OSD for each PG), and the +client will use whatever OSDMap is has to select the OSD from which to read. +In most cases, this is fine: either the client map is correct, +or the OSD that we think is the primary for the object knows that it +is not the primary anymore, and can feed the client an updated map +that indicates a newer primary. + +They key is to ensure that this is *always* true. In particular, we +need to ensure that an OSD that is fenced off from its peers and has +not learned about a map update does not continue to service read +requests from similarly stale clients at any point after which a new +primary may have been allowed to make a write. + +We accomplish this via a mechanism that works much like a read lease. +Each pool may have a ``read_lease_interval`` property which defines +how long this is, although by default we simply set it to +``osd_pool_default_read_lease_ratio`` (default: .8) times the +``osd_heartbeat_grace``. (This way the lease will generally have +expired by the time we mark a failed OSD down.) + +readable_until +-------------- + +Primary and replica both track a couple of values: + +* *readable_until* is how long we are allowed to service (read) + requests before *our* "lease" expires. +* *readable_until_ub* is an upper bound on *readable_until* for any + OSD in the acting set. + +The primary manages these two values by sending *pg_lease_t* messages +to replicas that increase the upper bound. Once all acting OSDs have +acknowledged they've seen the higher bound, the primary increases its +own *readable_until* and shares that (in a subsequent *pg_lease_t* +message). The resulting invariant is that any acting OSDs' +*readable_until* is always <= any acting OSDs' *readable_until_ub*. + +In order to avoid any problems with clock skew, we use monotonic +clocks (which are only accurate locally and unaffected by time +adjustments) throughout to manage these leases. Peer OSDs calculate +upper and lower bounds on the deltas between OSD-local clocks, +allowing the primary to share timestamps based on its local clock +while replicas translate that to an appropriate bound in for their own +local clocks. + +Prior Intervals +--------------- + +Whenever there is an interval change, we need to have an upper bound +on the *readable_until* values for any OSDs in the prior interval. +All OSDs from that interval have this value (*readable_until_ub*), and +share it as part of the pg_history_t during peering. + +Because peering may involve OSDs that were not already communicating +before and may not have bounds on their clock deltas, the bound in +*pg_history_t* is shared as a simple duration before the upper bound +expires. This means that the bound slips forward in time due to the +transit time for the peering message, but that is generally quite +short, and moving the bound later in time is safe since it is an +*upper* bound. + +PG "laggy" state +---------------- + +While the PG is active, *pg_lease_t* and *pg_lease_ack_t* messages are +regularly exchanged. However, if a client request comes in and the +lease has expired (*readable_until* has passed), the PG will go into a +*LAGGY* state and request will be blocked. Once the lease is renewed, +the request(s) will be requeued. + +PG "wait" state +--------------- + +If peering completes but the prior interval's OSDs may still be +readable, the PG will go into the *WAIT* state until sufficient time +has passed. Any OSD requests will block during that period. Recovery +may proceed while in this state, since the logical, user-visible +content of objects does not change. + +Dead OSDs +--------- + +Generally speaking, we need to wait until prior intervals' OSDs *know* +that they should no longer be readable. If an OSD is known to have +crashed (e.g., because the process is no longer running, which we may +infer because we get a ECONNREFUSED error), then we can infer that it +is not readable. + +Similarly, if an OSD is marked down, gets a map update telling it so, +and then informs the monitor that it knows it was marked down, we can +similarly infer that it is not still serving requests for a prior interval. + +When a PG is in the *WAIT* state, it will watch new maps for OSDs' +*dead_epoch* value indicating they are aware of their dead-ness. If +all down OSDs from prior interval are so aware, we can exit the WAIT +state early. diff --git a/doc/dev/osd_internals/watch_notify.rst b/doc/dev/osd_internals/watch_notify.rst new file mode 100644 index 000000000..8c2ce09ba --- /dev/null +++ b/doc/dev/osd_internals/watch_notify.rst @@ -0,0 +1,81 @@ +============ +Watch Notify +============ + +See librados for the watch/notify interface. + +Overview +-------- +The object_info (See osd/osd_types.h) tracks the set of watchers for +a particular object persistently in the object_info_t::watchers map. +In order to track notify progress, we also maintain some ephemeral +structures associated with the ObjectContext. + +Each Watch has an associated Watch object (See osd/Watch.h). The +ObjectContext for a watched object will have a (strong) reference +to one Watch object per watch, and each Watch object holds a +reference to the corresponding ObjectContext. This circular reference +is deliberate and is broken when the Watch state is discarded on +a new peering interval or removed upon timeout expiration or an +unwatch operation. + +A watch tracks the associated connection via a strong +ConnectionRef Watch::conn. The associated connection has a +WatchConState stashed in the OSD::Session for tracking associated +Watches in order to be able to notify them upon ms_handle_reset() +(via WatchConState::reset()). + +Each Watch object tracks the set of currently un-acked notifies. +start_notify() on a Watch object adds a reference to a new in-progress +Notify to the Watch and either: + +* if the Watch is *connected*, sends a Notify message to the client +* if the Watch is *unconnected*, does nothing. + +When the Watch becomes connected (in PrimaryLogPG::do_osd_op_effects), +Notifies are resent to all remaining tracked Notify objects. + +Each Notify object tracks the set of un-notified Watchers via +calls to complete_watcher(). Once the remaining set is empty or the +timeout expires (cb, registered in init()) a notify completion +is sent to the client. + +Watch Lifecycle +--------------- +A watch may be in one of 5 states: + +1. Non existent. +2. On disk, but not registered with an object context. +3. Connected +4. Disconnected, callback registered with timer +5. Disconnected, callback in queue for scrub or is_degraded + +Case 2 occurs between when an OSD goes active and the ObjectContext +for an object with watchers is loaded into memory due to an access. +During Case 2, no state is registered for the watch. Case 2 +transitions to Case 4 in PrimaryLogPG::populate_obc_watchers() during +PrimaryLogPG::find_object_context. Case 1 becomes case 3 via +OSD::do_osd_op_effects due to a watch operation. Case 4,5 become case +3 in the same way. Case 3 becomes case 4 when the connection resets +on a watcher's session. + +Cases 4&5 can use some explanation. Normally, when a Watch enters Case +4, a callback is registered with the OSDService::watch_timer to be +called at timeout expiration. At the time that the callback is +called, however, the pg might be in a state where it cannot write +to the object in order to remove the watch (i.e., during a scrub +or while the object is degraded). In that case, we use +Watch::get_delayed_cb() to generate another Context for use from +the callbacks_for_degraded_object and Scrubber::callbacks lists. +In either case, Watch::unregister_cb() does the right thing +(SafeTimer::cancel_event() is harmless for contexts not registered +with the timer). + +Notify Lifecycle +---------------- +The notify timeout is simpler: a timeout callback is registered when +the notify is init()'d. If all watchers ack notifies before the +timeout occurs, the timeout is canceled and the client is notified +of the notify completion. Otherwise, the timeout fires, the Notify +object pings each Watch via cancel_notify to remove itself, and +sends the notify completion to the client early. diff --git a/doc/dev/osd_internals/wbthrottle.rst b/doc/dev/osd_internals/wbthrottle.rst new file mode 100644 index 000000000..9b67efbb6 --- /dev/null +++ b/doc/dev/osd_internals/wbthrottle.rst @@ -0,0 +1,28 @@ +================== +Writeback Throttle +================== + +Previously, the filestore had a problem when handling large numbers of +small ios. We throttle dirty data implicitly via the journal, but +a large number of inodes can be dirtied without filling the journal +resulting in a very long sync time when the sync finally does happen. +The flusher was not an adequate solution to this problem since it +forced writeback of small writes too eagerly killing performance. + +WBThrottle tracks unflushed io per hobject_t and ::fsyncs in lru +order once the start_flusher threshold is exceeded for any of +dirty bytes, dirty ios, or dirty inodes. While any of these exceed +the hard_limit, we block on throttle() in _do_op. + +See src/os/WBThrottle.h, src/osd/WBThrottle.cc + +To track the open FDs through the writeback process, there is now an +fdcache to cache open fds. lfn_open now returns a cached FDRef which +implicitly closes the fd once all references have expired. + +Filestore syncs have a sideeffect of flushing all outstanding objects +in the wbthrottle. + +lfn_unlink clears the cached FDRef and wbthrottle entries for the +unlinked object when the last link is removed and asserts that all +outstanding FDRefs for that object are dead. |