summaryrefslogtreecommitdiffstats
path: root/doc/dev/osd_internals
diff options
context:
space:
mode:
Diffstat (limited to 'doc/dev/osd_internals')
-rw-r--r--doc/dev/osd_internals/async_recovery.rst53
-rw-r--r--doc/dev/osd_internals/backfill_reservation.rst93
-rw-r--r--doc/dev/osd_internals/erasure_coding.rst87
-rw-r--r--doc/dev/osd_internals/erasure_coding/developer_notes.rst223
-rw-r--r--doc/dev/osd_internals/erasure_coding/ecbackend.rst206
-rw-r--r--doc/dev/osd_internals/erasure_coding/jerasure.rst35
-rw-r--r--doc/dev/osd_internals/erasure_coding/proposals.rst385
-rw-r--r--doc/dev/osd_internals/index.rst10
-rw-r--r--doc/dev/osd_internals/last_epoch_started.rst60
-rw-r--r--doc/dev/osd_internals/log_based_pg.rst208
-rw-r--r--doc/dev/osd_internals/manifest.rst589
-rw-r--r--doc/dev/osd_internals/map_message_handling.rst131
-rw-r--r--doc/dev/osd_internals/mclock_wpq_cmp_study.rst476
-rw-r--r--doc/dev/osd_internals/osd_overview.rst106
-rw-r--r--doc/dev/osd_internals/osdmap_versions.txt259
-rw-r--r--doc/dev/osd_internals/partial_object_recovery.rst148
-rw-r--r--doc/dev/osd_internals/past_intervals.rst93
-rw-r--r--doc/dev/osd_internals/pg.rst31
-rw-r--r--doc/dev/osd_internals/pg_removal.rst56
-rw-r--r--doc/dev/osd_internals/pgpool.rst22
-rw-r--r--doc/dev/osd_internals/recovery_reservation.rst83
-rw-r--r--doc/dev/osd_internals/refcount.rst45
-rw-r--r--doc/dev/osd_internals/scrub.rst41
-rw-r--r--doc/dev/osd_internals/snaps.rst128
-rw-r--r--doc/dev/osd_internals/stale_read.rst102
-rw-r--r--doc/dev/osd_internals/watch_notify.rst81
-rw-r--r--doc/dev/osd_internals/wbthrottle.rst28
27 files changed, 3779 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/async_recovery.rst b/doc/dev/osd_internals/async_recovery.rst
new file mode 100644
index 000000000..aea5b70db
--- /dev/null
+++ b/doc/dev/osd_internals/async_recovery.rst
@@ -0,0 +1,53 @@
+=====================
+Asynchronous Recovery
+=====================
+
+Ceph Placement Groups (PGs) maintain a log of write transactions to
+facilitate speedy recovery of data. During recovery, each of these PG logs
+is used to determine which content in each OSD is missing or outdated.
+This obviates the need to scan all RADOS objects.
+See :ref:`Log Based PG <log-based-pg>` for more details on this process.
+
+Prior to the Nautilus release this recovery process was synchronous: it
+blocked writes to a RADOS object until it was recovered. In contrast,
+backfill could allow writes to proceed (assuming enough up-to-date replicas
+were available) by temporarily assigning a different acting set, and
+backfilling an OSD outside of the acting set. In some circumstances
+this ends up being significantly better for availability, e.g. if the
+PG log contains 3000 writes to disjoint objects. When the PG log contains
+thousands of entries, it could actually be faster (though not as safe) to
+trade backfill for recovery by deleting and redeploying the containing
+OSD than to iterate through the PG log. Recovering several megabytes
+of RADOS object data (or even worse, several megabytes of omap keys,
+notably RGW bucket indexes) can drastically increase latency for a small
+update, and combined with requests spread across many degraded objects
+it is a recipe for slow requests.
+
+To avoid this we can perform recovery in the background on an OSD
+out-of-band of the live acting set, similar to backfill, but still using
+the PG log to determine what needs to be done. This is known as *asynchronous
+recovery*.
+
+The threshold for performing asynchronous recovery instead of synchronous
+recovery is not a clear-cut. There are a few criteria which
+need to be met for asynchronous recovery:
+
+* Try to keep ``min_size`` replicas available
+* Use the approximate magnitude of the difference in length of
+ logs combined with historical missing objects to estimate the cost of
+ recovery
+* Use the parameter ``osd_async_recovery_min_cost`` to determine
+ when asynchronous recovery is appropriate
+
+With the existing peering process, when we choose the acting set we
+have not fetched the PG log from each peer; we have only the bounds of
+it and other metadata from their ``pg_info_t``. It would be more expensive
+to fetch and examine every log at this point, so we only consider an
+approximate check for log length for now. In Nautilus, we improved
+the accounting of missing objects, so post-Nautilus this information
+is also used to determine the cost of recovery.
+
+While async recovery is occurring, writes to members of the acting set
+may proceed, but we need to send their log entries to the async
+recovery targets (just like we do for backfill OSDs) so that they
+can completely catch up.
diff --git a/doc/dev/osd_internals/backfill_reservation.rst b/doc/dev/osd_internals/backfill_reservation.rst
new file mode 100644
index 000000000..3c380dcf6
--- /dev/null
+++ b/doc/dev/osd_internals/backfill_reservation.rst
@@ -0,0 +1,93 @@
+====================
+Backfill Reservation
+====================
+
+When a new OSD joins a cluster all PGs with it in their acting sets must
+eventually backfill. If all of these backfills happen simultaneously
+they will present excessive load on the OSD: the "thundering herd"
+effect.
+
+The ``osd_max_backfills`` tunable limits the number of outgoing or
+incoming backfills that are active on a given OSD. Note that this limit is
+applied separately to incoming and to outgoing backfill operations.
+Thus there can be as many as ``osd_max_backfills * 2`` backfill operations
+in flight on each OSD. This subtlety is often missed, and Ceph
+operators can be puzzled as to why more ops are observed than expected.
+
+Each ``OSDService`` now has two AsyncReserver instances: one for backfills going
+from the OSD (``local_reserver``) and one for backfills going to the OSD
+(``remote_reserver``). An ``AsyncReserver`` (``common/AsyncReserver.h``)
+manages a queue by priority of waiting items and a set of current reservation
+holders. When a slot frees up, the ``AsyncReserver`` queues the ``Context*``
+associated with the next item on the highest priority queue in the finisher
+provided to the constructor.
+
+For a primary to initiate a backfill it must first obtain a reservation from
+its own ``local_reserver``. Then it must obtain a reservation from the backfill
+target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is
+managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states
+of ``Active`` in PG.h). The reservations are dropped either on the ``Backfilled``
+event, which is sent on the primary before calling ``recovery_complete``
+and on the replica on receipt of the ``BackfillComplete`` progress message),
+or upon leaving ``Active`` or ``ReplicaActive``.
+
+It's important to always grab the local reservation before the remote
+reservation in order to prevent a circular dependency.
+
+We minimize the risk of data loss by prioritizing the order in
+which PGs are recovered. Admins can override the default order by using
+``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op
+priority ``255`` will start before a ``force-backfill`` op at priority ``254``.
+
+If recovery is needed because a PG is below ``min_size`` a base priority of
+``220`` is used. This is incremented by the number of OSDs short of the pool's
+``min_size`` as well as a value relative to the pool's ``recovery_priority``.
+The resultant priority is capped at ``253`` so that it does not confound forced
+ops as described above. Under ordinary circumstances a recovery op is
+prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``.
+The resultant priority is capped at ``219``.
+
+If backfill is needed because the number of acting OSDs is less than
+the pool's ``min_size``, a priority of ``220`` is used. The number of OSDs
+short of the pool's ``min_size`` is added as well as a value relative to
+the pool's ``recovery_priority``. The total priority is limited to ``253``.
+
+If backfill is needed because a PG is undersized,
+a priority of ``140`` is used. The number of OSDs below the size of the pool is
+added as well as a value relative to the pool's ``recovery_priority``. The
+resultant priority is capped at ``179``. If a backfill op is
+needed because a PG is degraded, a priority of ``140`` is used. A value
+relative to the pool's ``recovery_priority`` is added. The resultant priority
+is capped at ``179`` . Under ordinary circumstances a
+backfill op priority of ``100`` is used. A value relative to the pool's
+``recovery_priority`` is added. The total priority is capped at ``139``.
+
+.. list-table:: Backfill and Recovery op priorities
+ :widths: 20 20 20
+ :header-rows: 1
+
+ * - Description
+ - Base priority
+ - Maximum priority
+ * - Backfill
+ - 100
+ - 139
+ * - Degraded Backfill
+ - 140
+ - 179
+ * - Recovery
+ - 180
+ - 219
+ * - Inactive Recovery
+ - 220
+ - 253
+ * - Inactive Backfill
+ - 220
+ - 253
+ * - force-backfill
+ - 254
+ -
+ * - force-recovery
+ - 255
+ -
+
diff --git a/doc/dev/osd_internals/erasure_coding.rst b/doc/dev/osd_internals/erasure_coding.rst
new file mode 100644
index 000000000..40064961b
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding.rst
@@ -0,0 +1,87 @@
+==============================
+Erasure Coded Placement Groups
+==============================
+
+Glossary
+--------
+
+*chunk*
+ When the encoding function is called, it returns chunks of the same
+ size as each other. There are two kinds of chunks: (1) *data
+ chunks*, which can be concatenated to reconstruct the original
+ object, and (2) *coding chunks*, which can be used to rebuild a
+ lost chunk.
+
+*chunk rank*
+ The index of a chunk, as determined by the encoding function. The
+ rank of the first chunk is 0, the rank of the second chunk is 1,
+ and so on.
+
+*K*
+ The number of data chunks into which an object is divided. For
+ example, if *K* = 2, then a 10KB object is divided into two objects
+ of 5KB each.
+
+*M*
+ The number of coding chunks computed by the encoding function. *M*
+ is equal to the number of OSDs that can be missing from the cluster
+ without the cluster suffering data loss. For example, if there are
+ two coding chunks, then two OSDs can be missing without data loss.
+
+*N*
+ The number of data chunks plus the number of coding chunks: that
+ is, *K* + *M*.
+
+*rate*
+ The proportion of the total chunks containing useful information:
+ that is, *K* divided by *N*. For example, suppose that *K* = 9 and
+ *M* = 3. This would mean that *N* = 12 (because *K* + *M* = 9 + 3).
+ Therefore, the *rate* (*K* / *N*) would be 9 / 12 = 0.75. In other
+ words, 75% of the chunks would contain useful information.
+
+*shard* (also called *strip*)
+ An ordered sequence of chunks of the same rank from the same object. For a
+ given placement group, each OSD contains shards of the same rank. In the
+ special case in which an object is encoded with only one call to the
+ encoding function, the term *chunk* may be used instead of *shard* because
+ the shard is made of a single chunk. The chunks in a shard are ordered
+ according to the rank of the stripe (see *stripe* below) they belong to.
+
+
+*stripe*
+ If an object is so large that encoding it requires more than one
+ call to the encoding function, each of these calls creates a set of
+ chunks called a *stripe*.
+
+The definitions are illustrated as follows (PG stands for placement group):
+::
+
+ OSD 40 OSD 33
+ +-------------------------+ +-------------------------+
+ | shard 0 - PG 10 | | shard 1 - PG 10 |
+ |+------ object O -------+| |+------ object O -------+|
+ ||+---------------------+|| ||+---------------------+||
+ stripe||| chunk 0 ||| ||| chunk 1 ||| ...
+ 0 ||| stripe 0 ||| ||| stripe 0 |||
+ ||+---------------------+|| ||+---------------------+||
+ ||+---------------------+|| ||+---------------------+||
+ stripe||| chunk 0 ||| ||| chunk 1 ||| ...
+ 1 ||| stripe 1 ||| ||| stripe 1 |||
+ ||+---------------------+|| ||+---------------------+||
+ ||+---------------------+|| ||+---------------------+||
+ stripe||| chunk 0 ||| ||| chunk 1 ||| ...
+ 2 ||| stripe 2 ||| ||| stripe 2 |||
+ ||+---------------------+|| ||+---------------------+||
+ |+-----------------------+| |+-----------------------+|
+ | ... | | ... |
+ +-------------------------+ +-------------------------+
+
+Table of contents
+-----------------
+
+.. toctree::
+ :maxdepth: 1
+
+ Developer notes <erasure_coding/developer_notes>
+ Jerasure plugin <erasure_coding/jerasure>
+ High level design document <erasure_coding/ecbackend>
diff --git a/doc/dev/osd_internals/erasure_coding/developer_notes.rst b/doc/dev/osd_internals/erasure_coding/developer_notes.rst
new file mode 100644
index 000000000..586b4b71b
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding/developer_notes.rst
@@ -0,0 +1,223 @@
+============================
+Erasure Code developer notes
+============================
+
+Introduction
+------------
+
+Each chapter of this document explains an aspect of the implementation
+of the erasure code within Ceph. It is mostly based on examples being
+explained to demonstrate how things work.
+
+Reading and writing encoded chunks from and to OSDs
+---------------------------------------------------
+
+An erasure coded pool stores each object as K+M chunks. It is divided
+into K data chunks and M coding chunks. The pool is configured to have
+a size of K+M so that each chunk is stored in an OSD in the acting
+set. The rank of the chunk is stored as an attribute of the object.
+
+Let's say an erasure coded pool is created to use five OSDs ( K+M =
+5 ) and sustain the loss of two of them ( M = 2 ).
+
+When the object *NYAN* containing *ABCDEFGHI* is written to it, the
+erasure encoding function splits the content in three data chunks,
+simply by dividing the content in three : the first contains *ABC*,
+the second *DEF* and the last *GHI*. The content will be padded if the
+content length is not a multiple of K. The function also creates two
+coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each
+chunk is stored in an OSD in the acting set. The chunks are stored in
+objects that have the same name ( *NYAN* ) but reside on different
+OSDs. The order in which the chunks were created must be preserved and
+is stored as an attribute of the object ( shard_t ), in addition to its
+name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4*
+contains *YXY* and is stored on *OSD3*.
+
+::
+
+ +-------------------+
+ name | NYAN |
+ +-------------------+
+ content | ABCDEFGHI |
+ +--------+----------+
+ |
+ |
+ v
+ +------+------+
+ +---------------+ encode(3,2) +-----------+
+ | +--+--+---+---+ |
+ | | | | |
+ | +-------+ | +-----+ |
+ | | | | |
+ +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
+ name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
+ +------+ +------+ +------+ +------+ +------+
+ shard | 1 | | 2 | | 3 | | 4 | | 5 |
+ +------+ +------+ +------+ +------+ +------+
+ content | ABC | | DEF | | GHI | | YXY | | QGC |
+ +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
+ | | | | |
+ | | | | |
+ | | +--+---+ | |
+ | | | OSD1 | | |
+ | | +------+ | |
+ | | +------+ | |
+ | +------>| OSD2 | | |
+ | +------+ | |
+ | +------+ | |
+ | | OSD3 |<----+ |
+ | +------+ |
+ | +------+ |
+ | | OSD4 |<--------------+
+ | +------+
+ | +------+
+ +----------------->| OSD5 |
+ +------+
+
+
+
+
+When the object *NYAN* is read from the erasure coded pool, the
+decoding function reads three chunks : chunk *1* containing *ABC*,
+chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild
+the original content of the object *ABCDEFGHI*. The decoding function
+is informed that the chunks *2* and *5* are missing ( they are called
+*erasures* ). The chunk *5* could not be read because the *OSD4* is
+*out*.
+
+The decoding function could be called as soon as three chunks are
+read : *OSD2* was the slowest and its chunk does not need to be taken into
+account. This optimization is not implemented in Firefly.
+
+::
+
+ +-------------------+
+ name | NYAN |
+ +-------------------+
+ content | ABCDEFGHI |
+ +--------+----------+
+ ^
+ |
+ |
+ +------+------+
+ | decode(3,2) |
+ | erasures 2,5|
+ +-------------->| |
+ | +-------------+
+ | ^ ^
+ | | +-----+
+ | | |
+ +--+---+ +------+ +--+---+ +--+---+
+ name | NYAN | | NYAN | | NYAN | | NYAN |
+ +------+ +------+ +------+ +------+
+ shard | 1 | | 2 | | 3 | | 4 |
+ +------+ +------+ +------+ +------+
+ content | ABC | | DEF | | GHI | | YXY |
+ +--+---+ +--+---+ +--+---+ +--+---+
+ ^ . ^ ^
+ | TOO . | |
+ | SLOW . +--+---+ |
+ | ^ | OSD1 | |
+ | | +------+ |
+ | | +------+ |
+ | +-------| OSD2 | |
+ | +------+ |
+ | +------+ |
+ | | OSD3 |-----+
+ | +------+
+ | +------+
+ | | OSD4 | OUT
+ | +------+
+ | +------+
+ +------------------| OSD5 |
+ +------+
+
+
+Erasure code library
+--------------------
+
+Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_,
+with parameters K+M, object O is encoded by dividing it into chunks O1,
+O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks
+out of the available K+M chunks can be used to obtain the original
+object. If data chunk O2 or coding chunk P2 are lost, they can be
+repaired using any K chunks out of the K+M chunks. If more than M
+chunks are lost, it is not possible to recover the object.
+
+Reading the original content of object O can be a simple
+concatenation of O1, O2, ... OM, because the plugins are using
+`systematic codes
+<https://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks
+must be given to the erasure code library *decode* method to retrieve
+the content of the object.
+
+Performance depend on the parameters to the encoding functions and
+is also influenced by the packet sizes used when calling the encoding
+functions ( for Cauchy or Liberation for instance ): smaller packets
+means more calls and more overhead.
+
+Although Reed-Solomon is provided as a default, Ceph uses it via an
+`abstract API <https://github.com/ceph/ceph/blob/v0.78/src/erasure-code/ErasureCodeInterface.h>`_ designed to
+allow each pool to choose the plugin that implements it using
+key=value pairs stored in an `erasure code profile`_.
+
+.. _erasure code profile: ../../../erasure-coded-pool
+
+::
+
+ $ ceph osd erasure-code-profile set myprofile \
+ crush-failure-domain=osd
+ $ ceph osd erasure-code-profile get myprofile
+ directory=/usr/lib/ceph/erasure-code
+ k=2
+ m=1
+ plugin=jerasure
+ technique=reed_sol_van
+ crush-failure-domain=osd
+ $ ceph osd pool create ecpool erasure myprofile
+
+The *plugin* is dynamically loaded from *directory* and expected to
+implement the *int __erasure_code_init(char *plugin_name, char *directory)* function
+which is responsible for registering an object derived from *ErasureCodePlugin*
+in the registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ plugin reads:
+
+::
+
+ ErasureCodePluginRegistry &instance =
+ ErasureCodePluginRegistry::instance();
+ instance.add(plugin_name, new ErasureCodePluginExample());
+
+The *ErasureCodePlugin* derived object must provide a factory method
+from which the concrete implementation of the *ErasureCodeInterface*
+object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ reads:
+
+::
+
+ virtual int factory(const map<std::string,std::string> &parameters,
+ ErasureCodeInterfaceRef *erasure_code) {
+ *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters));
+ return 0;
+ }
+
+The *parameters* argument is the list of *key=value* pairs that were
+set in the erasure code profile, before the pool was created.
+
+::
+
+ ceph osd erasure-code-profile set myprofile \
+ directory=<dir> \ # mandatory
+ plugin=jerasure \ # mandatory
+ m=10 \ # optional and plugin dependent
+ k=3 \ # optional and plugin dependent
+ technique=reed_sol_van \ # optional and plugin dependent
+
+Notes
+-----
+
+If the objects are large, it may be impractical to encode and decode
+them in memory. However, when using *RBD* a 1TB device is divided in
+many individual 4MB objects and *RGW* does the same.
+
+Encoding and decoding is implemented in the OSD. Although it could be
+implemented client side for read write, the OSD must be able to encode
+and decode on its own when scrubbing.
diff --git a/doc/dev/osd_internals/erasure_coding/ecbackend.rst b/doc/dev/osd_internals/erasure_coding/ecbackend.rst
new file mode 100644
index 000000000..877a08a38
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding/ecbackend.rst
@@ -0,0 +1,206 @@
+=================================
+ECBackend Implementation Strategy
+=================================
+
+Miscellaneous initial design notes
+==================================
+
+The initial (and still true for EC pools without the hacky EC
+overwrites debug flag enabled) design for EC pools restricted
+EC pools to operations that can be easily rolled back:
+
+- CEPH_OSD_OP_APPEND: We can roll back an append locally by
+ including the previous object size as part of the PG log event.
+- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
+ requires that we retain the deleted object until all replicas have
+ persisted in the deletion event. Erasure Coded backend will therefore
+ need to store objects with the version at which they were created
+ included in the key provided to the filestore. Old versions of an
+ object can be pruned when all replicas have committed up to the log
+ event deleting the object.
+- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
+ to be set or removed, we can roll back these operations locally.
+
+Log entries contain a structure explaining how to locally undo the
+operation represented by the operation
+(see osd_types.h:TransactionInfo::LocalRollBack).
+
+PGTemp and Crush
+----------------
+
+Primaries are able to request a temp acting set mapping in order to
+allow an up-to-date OSD to serve requests while a new primary is
+backfilled (and for other reasons). An erasure coded PG needs to be
+able to designate a primary for these reasons without putting it in
+the first position of the acting set. It also needs to be able to
+leave holes in the requested acting set.
+
+Core Changes:
+
+- OSDMap::pg_to_*_osds needs to separately return a primary. For most
+ cases, this can continue to be acting[0].
+- MOSDPGTemp (and related OSD structures) needs to be able to specify
+ a primary as well as an acting set.
+- Much of the existing code base assumes that acting[0] is the primary
+ and that all elements of acting are valid. This needs to be cleaned
+ up since the acting set may contain holes.
+
+Distinguished acting set positions
+----------------------------------
+
+With the replicated strategy, all replicas of a PG are
+interchangeable. With erasure coding, different positions in the
+acting set have different pieces of the erasure coding scheme and are
+not interchangeable. Worse, crush might cause chunk 2 to be written
+to an OSD which happens already to contain an (old) copy of chunk 4.
+This means that the OSD and PG messages need to work in terms of a
+type like pair<shard_t, pg_t> in order to distinguish different PG
+chunks on a single OSD.
+
+Because the mapping of an object name to object in the filestore must
+be 1-to-1, we must ensure that the objects in chunk 2 and the objects
+in chunk 4 have different names. To that end, the object store must
+include the chunk id in the object key.
+
+Core changes:
+
+- The object store `ghobject_t needs to also include a chunk id
+ <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like
+ tuple<hobject_t, gen_t, shard_t>.
+- coll_t needs to include a shard_t.
+- The OSD pg_map and similar PG mappings need to work in terms of a
+ spg_t (essentially
+ pair<pg_t, shard_t>). Similarly, pg->pg messages need to include
+ a shard_t
+- For client->PG messages, the OSD will need a way to know which PG
+ chunk should get the message since the OSD may contain both a
+ primary and non-primary chunk for the same PG
+
+Object Classes
+--------------
+
+Reads from object classes will return ENOTSUP on EC pools by invoking
+a special SYNC read.
+
+Scrub
+-----
+
+The main catch, however, for EC pools is that sending a crc32 of the
+stored chunk on a replica isn't particularly helpful since the chunks
+on different replicas presumably store different data. Because we
+don't support overwrites except via DELETE, however, we have the
+option of maintaining a crc32 on each chunk through each append.
+Thus, each replica instead simply computes a crc32 of its own stored
+chunk and compares it with the locally stored checksum. The replica
+then reports to the primary whether the checksums match.
+
+With overwrites, all scrubs are disabled for now until we work out
+what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst).
+
+Crush
+-----
+
+If crush is unable to generate a replacement for a down member of an
+acting set, the acting set should have a hole at that position rather
+than shifting the other elements of the acting set out of position.
+
+=========
+ECBackend
+=========
+
+MAIN OPERATION OVERVIEW
+=======================
+
+A RADOS put operation can span
+multiple stripes of a single object. There must be code that
+tessellates the application level write into a set of per-stripe write
+operations -- some whole-stripes and up to two partial
+stripes. Without loss of generality, for the remainder of this
+document, we will focus exclusively on writing a single stripe (whole
+or partial). We will use the symbol "W" to represent the number of
+blocks within a stripe that are being written, i.e., W <= K.
+
+There are three data flows for handling a write into an EC stripe. The
+choice of which of the three data flows to choose is based on the size
+of the write operation and the arithmetic properties of the selected
+parity-generation algorithm.
+
+(1) Whole stripe is written/overwritten
+(2) A read-modify-write operation is performed.
+
+WHOLE STRIPE WRITE
+------------------
+
+This is a simple case, and is already performed in the existing code
+(for appends, that is). The primary receives all of the data for the
+stripe in the RADOS request, computes the appropriate parity blocks
+and send the data and parity blocks to their destination shards which
+write them. This is essentially the current EC code.
+
+READ-MODIFY-WRITE
+-----------------
+
+The primary determines which of the K-W blocks are to be unmodified,
+and reads them from the shards. Once all of the data is received it is
+combined with the received new data and new parity blocks are
+computed. The modified blocks are sent to their respective shards and
+written. The RADOS operation is acknowledged.
+
+OSD Object Write and Consistency
+--------------------------------
+
+Regardless of the algorithm chosen above, writing of the data is a two-
+phase process: commit and rollforward. The primary sends the log
+entries with the operation described (see
+osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack).
+In all cases, the "commit" is performed in place, possibly leaving some
+information required for a rollback in a write-aside object. The
+rollforward phase occurs once all acting set replicas have committed
+the commit, it then removes the rollback information.
+
+In the case of overwrites of existing stripes, the rollback information
+has the form of a sparse object containing the old values of the
+overwritten extents populated using clone_range. This is essentially
+a place-holder implementation, in real life, bluestore will have an
+efficient primitive for this.
+
+The rollforward part can be delayed since we report the operation as
+committed once all replicas have been committed. Currently, whenever we
+send a write, we also indicate that all previously committed
+operations should be rolled forward (see
+ECBackend::try_reads_to_commit). If there aren't any in the pipeline
+when we arrive at the waiting_rollforward queue, we start a dummy
+write to move things along (see the Pipeline section later on and
+ECBackend::try_finish_rmw).
+
+ExtentCache
+-----------
+
+It's pretty important to be able to pipeline writes on the same
+object. For this reason, there is a cache of extents written by
+cacheable operations. Each extent remains pinned until the operations
+referring to it are committed. The pipeline prevents rmw operations
+from running until uncacheable transactions (clones, etc) are flushed
+from the pipeline.
+
+See ExtentCache.h for a detailed explanation of how the cache
+states correspond to the higher level invariants about the conditions
+under which concurrent operations can refer to the same object.
+
+Pipeline
+--------
+
+Reading src/osd/ExtentCache.h should have given a good idea of how
+operations might overlap. There are several states involved in
+processing a write operation and an important invariant which
+isn't enforced by PrimaryLogPG at a higher level which needs to be
+managed by ECBackend. The important invariant is that we can't
+have uncacheable and rmw operations running at the same time
+on the same object. For simplicity, we simply enforce that any
+operation which contains an rmw operation must wait until
+all in-progress uncacheable operations complete.
+
+There are improvements to be made here in the future.
+
+For more details, see ECBackend::waiting_* and
+ECBackend::try_<from>_to_<to>.
diff --git a/doc/dev/osd_internals/erasure_coding/jerasure.rst b/doc/dev/osd_internals/erasure_coding/jerasure.rst
new file mode 100644
index 000000000..ac3636720
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding/jerasure.rst
@@ -0,0 +1,35 @@
+===============
+jerasure plugin
+===============
+
+Introduction
+------------
+
+The parameters interpreted by the ``jerasure`` plugin are:
+
+::
+
+ ceph osd erasure-code-profile set myprofile \
+ directory=<dir> \ # plugin directory absolute path
+ plugin=jerasure \ # plugin name (only jerasure)
+ k=<k> \ # data chunks (default 2)
+ m=<m> \ # coding chunks (default 2)
+ technique=<technique> \ # coding technique
+
+The coding techniques can be chosen among *reed_sol_van*,
+*reed_sol_r6_op*, *cauchy_orig*, *cauchy_good*, *liberation*,
+*blaum_roth* and *liber8tion*.
+
+The *src/erasure-code/jerasure* directory contains the
+implementation. It is a wrapper around the code found at
+`https://github.com/ceph/jerasure <https://github.com/ceph/jerasure>`_
+and `https://github.com/ceph/gf-complete
+<https://github.com/ceph/gf-complete>`_ , pinned to the latest stable
+version in *.gitmodules*. These repositories are copies of the
+upstream repositories `http://jerasure.org/jerasure/jerasure
+<http://jerasure.org/jerasure/jerasure>`_ and
+`http://jerasure.org/jerasure/gf-complete
+<http://jerasure.org/jerasure/gf-complete>`_ . The difference
+between the two, if any, should match pull requests against upstream.
+Note that as of 2023, the ``jerasure.org`` web site may no longer be
+legitimate and/or associated with the original project.
diff --git a/doc/dev/osd_internals/erasure_coding/proposals.rst b/doc/dev/osd_internals/erasure_coding/proposals.rst
new file mode 100644
index 000000000..8a30727b3
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding/proposals.rst
@@ -0,0 +1,385 @@
+:orphan:
+
+=================================
+Proposed Next Steps for ECBackend
+=================================
+
+PARITY-DELTA-WRITE
+------------------
+
+RMW operations current require 4 network hops (2 round trips). In
+principle, for some codes, we can reduce this to 3 by sending the
+update to the replicas holding the data blocks and having them
+compute a delta to forward onto the parity blocks.
+
+The primary reads the current values of the "W" blocks and then uses
+the new values of the "W" blocks to compute parity-deltas for each of
+the parity blocks. The W blocks and the parity delta-blocks are sent
+to their respective shards.
+
+The choice of whether to use a read-modify-write or a
+parity-delta-write is complex policy issue that is TBD in the details
+and is likely to be heavily dependent on the computational costs
+associated with a parity-delta vs. a regular parity-generation
+operation. However, it is believed that the parity-delta scheme is
+likely to be the preferred choice, when available.
+
+The internal interface to the erasure coding library plug-ins needs to
+be extended to support the ability to query if parity-delta
+computation is possible for a selected algorithm as well as an
+interface to the actual parity-delta computation algorithm when
+available.
+
+Stripe Cache
+------------
+
+It may be a good idea to extend the current ExtentCache usage to
+cache some data past when the pinning operation releases it.
+One application pattern that is important to optimize is the small
+block sequential write operation (think of the journal of a journaling
+file system or a database transaction log). Regardless of the chosen
+redundancy algorithm, it is advantageous for the primary to
+retain/buffer recently read/written portions of a stripe in order to
+reduce network traffic. The dynamic contents of this cache may be used
+in the determination of whether a read-modify-write or a
+parity-delta-write is performed. The sizing of this cache is TBD, but
+we should plan on allowing at least a few full stripes per active
+client. Limiting the cache occupancy on a per-client basis will reduce
+the noisy neighbor problem.
+
+Recovery and Rollback Details
+=============================
+
+Implementing a Rollback-able Prepare Operation
+----------------------------------------------
+
+The prepare operation is implemented at each OSD through a simulation
+of a versioning or copy-on-write capability for modifying a portion of
+an object.
+
+When a prepare operation is performed, the new data is written into a
+temporary object. The PG log for the
+operation will contain a reference to the temporary object so that it
+can be located for recovery purposes as well as a record of all of the
+shards which are involved in the operation.
+
+In order to avoid fragmentation (and hence, future read performance),
+creation of the temporary object needs special attention. The name of
+the temporary object affects its location within the KV store. Right
+now its unclear whether it's desirable for the name to locate near the
+base object or whether a separate subset of keyspace should be used
+for temporary objects. Sam believes that colocation with the base
+object is preferred (he suggests using the generation counter of the
+ghobject for temporaries). Whereas Allen believes that using a
+separate subset of keyspace is desirable since these keys are
+ephemeral and we don't want to actually colocate them with the base
+object keys. Perhaps some modeling here can help resolve this
+issue. The data of the temporary object wants to be located as close
+to the data of the base object as possible. This may be best performed
+by adding a new ObjectStore creation primitive that takes the base
+object as an additional parameter that is a hint to the allocator.
+
+Sam: I think that the short lived thing may be a red herring. We'll
+be updating the donor and primary objects atomically, so it seems like
+we'd want them adjacent in the key space, regardless of the donor's
+lifecycle.
+
+The apply operation moves the data from the temporary object into the
+correct position within the base object and deletes the associated
+temporary object. This operation is done using a specialized
+ObjectStore primitive. In the current ObjectStore interface, this can
+be done using the clonerange function followed by a delete, but can be
+done more efficiently with a specialized move primitive.
+Implementation of the specialized primitive on FileStore can be done
+by copying the data. Some file systems have extensions that might also
+be able to implement this operation (like a defrag API that swaps
+chunks between files). It is expected that NewStore will be able to
+support this efficiently and natively (It has been noted that this
+sequence requires that temporary object allocations, which tend to be
+small, be efficiently converted into blocks for main objects and that
+blocks that were formerly inside of main objects must be reusable with
+minimal overhead)
+
+The prepare and apply operations can be separated arbitrarily in
+time. If a read operation accesses an object that has been altered by
+a prepare operation (but without a corresponding apply operation) it
+must return the data after the prepare operation. This is done by
+creating an in-memory database of objects which have had a prepare
+operation without a corresponding apply operation. All read operations
+must consult this in-memory data structure in order to get the correct
+data. It should explicitly recognized that it is likely that there
+will be multiple prepare operations against a single base object and
+the code must handle this case correctly. This code is implemented as
+a layer between ObjectStore and all existing readers. Annoyingly,
+we'll want to trash this state when the interval changes, so the first
+thing that needs to happen after activation is that the primary and
+replicas apply up to last_update so that the empty cache will be
+correct.
+
+During peering, it is now obvious that an unapplied prepare operation
+can easily be rolled back simply by deleting the associated temporary
+object and removing that entry from the in-memory data structure.
+
+Partial Application Peering/Recovery modifications
+--------------------------------------------------
+
+Some writes will be small enough to not require updating all of the
+shards holding data blocks. For write amplification minimization
+reasons, it would be best to avoid writing to those shards at all,
+and delay even sending the log entries until the next write which
+actually hits that shard.
+
+The delaying (buffering) of the transmission of the prepare and apply
+operations for witnessing OSDs creates new situations that peering
+must handle. In particular the logic for determining the authoritative
+last_update value (and hence the selection of the OSD which has the
+authoritative log) must be modified to account for the valid but
+missing (i.e., delayed/buffered) pglog entries to which the
+authoritative OSD was only a witness to.
+
+Because a partial write might complete without persisting a log entry
+on every replica, we have to do a bit more work to determine an
+authoritative last_update. The constraint (as with a replicated PG)
+is that last_update >= the most recent log entry for which a commit
+was sent to the client (call this actual_last_update). Secondarily,
+we want last_update to be as small as possible since any log entry
+past actual_last_update (we do not apply a log entry until we have
+sent the commit to the client) must be able to be rolled back. Thus,
+the smaller a last_update we choose, the less recovery will need to
+happen (we can always roll back, but rolling a replica forward may
+require an object rebuild). Thus, we will set last_update to 1 before
+the oldest log entry we can prove cannot have been committed. In
+current master, this is simply the last_update of the shortest log
+from that interval (because that log did not persist any entry past
+that point -- a precondition for sending a commit to the client). For
+this design, we must consider the possibility that any log is missing
+at its head log entries in which it did not participate. Thus, we
+must determine the most recent interval in which we went active
+(essentially, this is what find_best_info currently does). We then
+pull the log from each live osd from that interval back to the minimum
+last_update among them. Then, we extend all logs from the
+authoritative interval until each hits an entry in which it should
+have participated, but did not record. The shortest of these extended
+logs must therefore contain any log entry for which we sent a commit
+to the client -- and the last entry gives us our last_update.
+
+Deep scrub support
+------------------
+
+The simple answer here is probably our best bet. EC pools can't use
+the omap namespace at all right now. The simplest solution would be
+to take a prefix of the omap space and pack N M byte L bit checksums
+into each key/value. The prefixing seems like a sensible precaution
+against eventually wanting to store something else in the omap space.
+It seems like any write will need to read at least the blocks
+containing the modified range. However, with a code able to compute
+parity deltas, we may not need to read a whole stripe. Even without
+that, we don't want to have to write to blocks not participating in
+the write. Thus, each shard should store checksums only for itself.
+It seems like you'd be able to store checksums for all shards on the
+parity blocks, but there may not be distinguished parity blocks which
+are modified on all writes (LRC or shec provide two examples). L
+should probably have a fixed number of options (16, 32, 64?) and be
+configurable per-pool at pool creation. N, M should be likewise be
+configurable at pool creation with sensible defaults.
+
+We need to handle online upgrade. I think the right answer is that
+the first overwrite to an object with an append only checksum
+removes the append only checksum and writes in whatever stripe
+checksums actually got written. The next deep scrub then writes
+out the full checksum omap entries.
+
+RADOS Client Acknowledgement Generation Optimization
+====================================================
+
+Now that the recovery scheme is understood, we can discuss the
+generation of the RADOS operation acknowledgement (ACK) by the
+primary ("sufficient" from above). It is NOT required that the primary
+wait for all shards to complete their respective prepare
+operations. Using our example where the RADOS operations writes only
+"W" chunks of the stripe, the primary will generate and send W+M
+prepare operations (possibly including a send-to-self). The primary
+need only wait for enough shards to be written to ensure recovery of
+the data, Thus after writing W + M chunks you can afford the lost of M
+chunks. Hence the primary can generate the RADOS ACK after W+M-M => W
+of those prepare operations are completed.
+
+Inconsistent object_info_t versions
+===================================
+
+A natural consequence of only writing the blocks which actually
+changed is that we don't want to update the object_info_t of the
+objects which didn't. I actually think it would pose a problem to do
+so: pg ghobject namespaces are generally large, and unless the osd is
+seeing a bunch of overwrites on a small set of objects, I'd expect
+each write to be far enough apart in the backing ghobject_t->data
+mapping to each constitute a random metadata update. Thus, we have to
+accept that not every shard will have the current version in its
+object_info_t. We can't even bound how old the version on a
+particular shard will happen to be. In particular, the primary does
+not necessarily have the current version. One could argue that the
+parity shards would always have the current version, but not every
+code necessarily has designated parity shards which see every write
+(certainly LRC, iirc shec, and even with a more pedestrian code, it
+might be desirable to rotate the shards based on object hash). Even
+if you chose to designate a shard as witnessing all writes, the pg
+might be degraded with that particular shard missing. This is a bit
+tricky, currently reads and writes implicitly return the most recent
+version of the object written. On reads, we'd have to read K shards
+to answer that question. We can get around that by adding a "don't
+tell me the current version" flag. Writes are more problematic: we
+need an object_info from the most recent write in order to form the
+new object_info and log_entry.
+
+A truly terrifying option would be to eliminate version and
+prior_version entirely from the object_info_t. There are a few
+specific purposes it serves:
+
+#. On OSD startup, we prime the missing set by scanning backwards
+ from last_update to last_complete comparing the stored object's
+ object_info_t to the version of most recent log entry.
+#. During backfill, we compare versions between primary and target
+ to avoid some pushes. We use it elsewhere as well
+#. While pushing and pulling objects, we verify the version.
+#. We return it on reads and writes and allow the librados user to
+ assert it atomically on writesto allow the user to deal with write
+ races (used extensively by rbd).
+
+Case (3) isn't actually essential, just convenient. Oh well. (4)
+is more annoying. Writes are easy since we know the version. Reads
+are tricky because we may not need to read from all of the replicas.
+Simplest solution is to add a flag to rados operations to just not
+return the user version on read. We can also just not support the
+user version assert on ec for now (I think? Only user is rgw bucket
+indices iirc, and those will always be on replicated because they use
+omap).
+
+We can avoid (1) by maintaining the missing set explicitly. It's
+already possible for there to be a missing object without a
+corresponding log entry (Consider the case where the most recent write
+is to an object which has not been updated in weeks. If that write
+becomes divergent, the written object needs to be marked missing based
+on the prior_version which is not in the log.) THe PGLog already has
+a way of handling those edge cases (see divergent_priors). We'd
+simply expand that to contain the entire missing set and maintain it
+atomically with the log and the objects. This isn't really an
+unreasonable option, the additional keys would be fewer than the
+existing log keys + divergent_priors and aren't updated in the fast
+write path anyway.
+
+The second case is a bit trickier. It's really an optimization for
+the case where a pg became not in the acting set long enough for the
+logs to no longer overlap but not long enough for the PG to have
+healed and removed the old copy. Unfortunately, this describes the
+case where a node was taken down for maintenance with noout set. It's
+probably not acceptable to re-backfill the whole OSD in such a case,
+so we need to be able to quickly determine whether a particular shard
+is up to date given a valid acting set of other shards.
+
+Let ordinary writes which do not change the object size not touch the
+object_info at all. That means that the object_info version won't
+match the pg log entry version. Include in the pg_log_entry_t the
+current object_info version as well as which shards participated (as
+mentioned above). In addition to the object_info_t attr, record on
+each shard s a vector recording for each other shard s' the most
+recent write which spanned both s and s'. Operationally, we maintain
+an attr on each shard containing that vector. A write touching S
+updates the version stamp entry for each shard in S on each shard in
+S's attribute (and leaves the rest alone). If we have a valid acting
+set during backfill, we must have a witness of every write which
+completed -- so taking the max of each entry over all of the acting
+set shards must give us the current version for each shard. During
+recovery, we set the attribute on the recovery target to that max
+vector (Question: with LRC, we may not need to touch much of the
+acting set to recover a particular shard -- can we just use the max of
+the shards we used to recovery, or do we need to grab the version
+vector from the rest of the acting set as well? I'm not sure, not a
+big deal anyway, I think).
+
+The above lets us perform blind writes without knowing the current
+object version (log entry version, that is) while still allowing us to
+avoid backfilling up to date objects. The only catch is that our
+backfill scans will can all replicas, not just the primary and the
+backfill targets.
+
+It would be worth adding into scrub the ability to check the
+consistency of the gathered version vectors -- probably by just
+taking 3 random valid subsets and verifying that they generate
+the same authoritative version vector.
+
+Implementation Strategy
+=======================
+
+It goes without saying that it would be unwise to attempt to do all of
+this in one massive PR. It's also not a good idea to merge code which
+isn't being tested. To that end, it's worth thinking a bit about
+which bits can be tested on their own (perhaps with a bit of temporary
+scaffolding).
+
+We can implement the overwrite friendly checksumming scheme easily
+enough with the current implementation. We'll want to enable it on a
+per-pool basis (probably using a flag which we'll later repurpose for
+actual overwrite support). We can enable it in some of the ec
+thrashing tests in the suite. We can also add a simple test
+validating the behavior of turning it on for an existing ec pool
+(later, we'll want to be able to convert append-only ec pools to
+overwrite ec pools, so that test will simply be expanded as we go).
+The flag should be gated by the experimental feature flag since we
+won't want to support this as a valid configuration -- testing only.
+We need to upgrade append only ones in place during deep scrub.
+
+Similarly, we can implement the unstable extent cache with the current
+implementation, it even lets us cut out the readable ack the replicas
+send to the primary after the commit which lets it release the lock.
+Same deal, implement, gate with experimental flag, add to some of the
+automated tests. I don't really see a reason not to use the same flag
+as above.
+
+We can certainly implement the move-range primitive with unit tests
+before there are any users. Adding coverage to the existing
+objectstore tests would suffice here.
+
+Explicit missing set can be implemented now, same deal as above --
+might as well even use the same feature bit.
+
+The TPC protocol outlined above can actually be implemented an append
+only EC pool. Same deal as above, can even use the same feature bit.
+
+The RADOS flag to suppress the read op user version return can be
+implemented immediately. Mostly just needs unit tests.
+
+The version vector problem is an interesting one. For append only EC
+pools, it would be pointless since all writes increase the size and
+therefore update the object_info. We could do it for replicated pools
+though. It's a bit silly since all "shards" see all writes, but it
+would still let us implement and partially test the augmented backfill
+code as well as the extra pg log entry fields -- this depends on the
+explicit pg log entry branch having already merged. It's not entirely
+clear to me that this one is worth doing separately. It's enough code
+that I'd really prefer to get it done independently, but it's also a
+fair amount of scaffolding that will be later discarded.
+
+PGLog entries need to be able to record the participants and log
+comparison needs to be modified to extend logs with entries they
+wouldn't have witnessed. This logic should be abstracted behind
+PGLog so it can be unittested -- that would let us test it somewhat
+before the actual ec overwrites code merges.
+
+Whatever needs to happen to the ec plugin interface can probably be
+done independently of the rest of this (pending resolution of
+questions below).
+
+The actual nuts and bolts of performing the ec overwrite it seems to
+me can't be productively tested (and therefore implemented) until the
+above are complete, so best to get all of the supporting code in
+first.
+
+Open Questions
+==============
+
+Is there a code we should be using that would let us compute a parity
+delta without rereading and reencoding the full stripe? If so, is it
+the kind of thing we need to design for now, or can it be reasonably
+put off?
+
+What needs to happen to the EC plugin interface?
diff --git a/doc/dev/osd_internals/index.rst b/doc/dev/osd_internals/index.rst
new file mode 100644
index 000000000..7e82914aa
--- /dev/null
+++ b/doc/dev/osd_internals/index.rst
@@ -0,0 +1,10 @@
+==============================
+OSD developer documentation
+==============================
+
+.. rubric:: Contents
+
+.. toctree::
+ :glob:
+
+ *
diff --git a/doc/dev/osd_internals/last_epoch_started.rst b/doc/dev/osd_internals/last_epoch_started.rst
new file mode 100644
index 000000000..c31cc66b5
--- /dev/null
+++ b/doc/dev/osd_internals/last_epoch_started.rst
@@ -0,0 +1,60 @@
+======================
+last_epoch_started
+======================
+
+``info.last_epoch_started`` records an activation epoch ``e`` for interval ``i``
+such that all writes committed in ``i`` or earlier are reflected in the
+local info/log and no writes after ``i`` are reflected in the local
+info/log. Since no committed write is ever divergent, even if we
+get an authoritative log/info with an older ``info.last_epoch_started``,
+we can leave our ``info.last_epoch_started`` alone since no writes could
+have committed in any intervening interval (See PG::proc_master_log).
+
+``info.history.last_epoch_started`` records a lower bound on the most
+recent interval in which the PG as a whole went active and accepted
+writes. On a particular OSD it is also an upper bound on the
+activation epoch of intervals in which writes in the local PG log
+occurred: we update it before accepting writes. Because all
+committed writes are committed by all acting set OSDs, any
+non-divergent writes ensure that ``history.last_epoch_started`` was
+recorded by all acting set members in the interval. Once peering has
+queried one OSD from each interval back to some seen
+``history.last_epoch_started``, it follows that no interval after the max
+``history.last_epoch_started`` can have reported writes as committed
+(since we record it before recording client writes in an interval).
+Thus, the minimum ``last_update`` across all infos with
+``info.last_epoch_started >= MAX(history.last_epoch_started)`` must be an
+upper bound on writes reported as committed to the client.
+
+We update ``info.last_epoch_started`` with the initial activation message,
+but we only update ``history.last_epoch_started`` after the new
+``info.last_epoch_started`` is persisted (possibly along with the first
+write). This ensures that we do not require an OSD with the most
+recent ``info.last_epoch_started`` until all acting set OSDs have recorded
+it.
+
+In ``find_best_info``, we do include ``info.last_epoch_started`` values when
+calculating ``max_last_epoch_started_found`` because we want to avoid
+designating a log entry divergent which in a prior interval would have
+been non-divergent since it might have been used to serve a read. In
+``activate()``, we use the peer's ``last_epoch_started`` value as a bound on
+how far back divergent log entries can be found.
+
+However, in a case like
+
+.. code::
+
+ calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
+
+since osd.1 is the only one which recorded info.les=477, while osd.4,osd.0
+(which were the acting set in that interval) did not (osd.4 restarted and osd.0
+did not get the message in time), the PG is marked incomplete when
+either osd.4 or osd.0 would have been valid choices. To avoid this, we do not
+consider ``info.les`` for incomplete peers when calculating
+``min_last_epoch_started_found``. It would not have been in the acting
+set, so we must have another OSD from that interval anyway (if
+``maybe_went_rw``). If that OSD does not remember that ``info.les``, then we
+cannot have served reads.
diff --git a/doc/dev/osd_internals/log_based_pg.rst b/doc/dev/osd_internals/log_based_pg.rst
new file mode 100644
index 000000000..99cffd3d9
--- /dev/null
+++ b/doc/dev/osd_internals/log_based_pg.rst
@@ -0,0 +1,208 @@
+.. _log-based-pg:
+
+============
+Log Based PG
+============
+
+Background
+==========
+
+Why PrimaryLogPG?
+-----------------
+
+Currently, consistency for all ceph pool types is ensured by primary
+log-based replication. This goes for both erasure-coded (EC) and
+replicated pools.
+
+Primary log-based replication
+-----------------------------
+
+Reads must return data written by any write which completed (where the
+client could possibly have received a commit message). There are lots
+of ways to handle this, but Ceph's architecture makes it easy for
+everyone at any map epoch to know who the primary is. Thus, the easy
+answer is to route all writes for a particular PG through a single
+ordering primary and then out to the replicas. Though we only
+actually need to serialize writes on a single RADOS object (and even then,
+the partial ordering only really needs to provide an ordering between
+writes on overlapping regions), we might as well serialize writes on
+the whole PG since it lets us represent the current state of the PG
+using two numbers: the epoch of the map on the primary in which the
+most recent write started (this is a bit stranger than it might seem
+since map distribution itself is asynchronous -- see Peering and the
+concept of interval changes) and an increasing per-PG version number
+-- this is referred to in the code with type ``eversion_t`` and stored as
+``pg_info_t::last_update``. Furthermore, we maintain a log of "recent"
+operations extending back at least far enough to include any
+*unstable* writes (writes which have been started but not committed)
+and objects which aren't up-to-date locally (see recovery and
+backfill). In practice, the log will extend much further
+(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not
+clean) because it's handy for quickly performing recovery.
+
+Using this log, as long as we talk to a non-empty subset of the OSDs
+which must have accepted any completed writes from the most recent
+interval in which we accepted writes, we can determine a conservative
+log which must contain any write which has been reported to a client
+as committed. There is some freedom here, we can choose any log entry
+between the oldest head remembered by an element of that set (any
+newer cannot have completed without that log containing it) and the
+newest head remembered (clearly, all writes in the log were started,
+so it's fine for us to remember them) as the new head. This is the
+main point of divergence between replicated pools and EC pools in
+``PG/PrimaryLogPG``: replicated pools try to choose the newest valid
+option to avoid the client needing to replay those operations and
+instead recover the other copies. EC pools instead try to choose
+the *oldest* option available to them.
+
+The reason for this gets to the heart of the rest of the differences
+in implementation: one copy will not generally be enough to
+reconstruct an EC object. Indeed, there are encodings where some log
+combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding
+where 3 of the replicas remember a write, but the other 3 do not -- we
+don't have 3 copies of either version). For this reason, log entries
+representing *unstable* writes (writes not yet committed to the
+client) must be rollbackable using only local information on EC pools.
+Log entries in general may therefore be rollbackable (and in that case,
+via a delayed application or via a set of instructions for rolling
+back an inplace update) or not. Replicated pool log entries are
+never able to be rolled back.
+
+For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``,
+``osd_types.h:pg_log_entry_t``, and peering in general.
+
+ReplicatedBackend/ECBackend unification strategy
+================================================
+
+PGBackend
+---------
+
+The fundamental difference between replication and erasure coding
+is that replication can do destructive updates while erasure coding
+cannot. It would be really annoying if we needed to have two entire
+implementations of ``PrimaryLogPG`` since there
+are really only a few fundamental differences:
+
+#. How reads work -- async only, requires remote reads for EC
+#. How writes work -- either restricted to append, or must write aside and do a
+ tpc
+#. Whether we choose the oldest or newest possible head entry during peering
+#. A bit of extra information in the log entry to enable rollback
+
+and so many similarities
+
+#. All of the stats and metadata for objects
+#. The high level locking rules for mixing client IO with recovery and scrub
+#. The high level locking rules for mixing reads and writes without exposing
+ uncommitted state (which might be rolled back or forgotten later)
+#. The process, metadata, and protocol needed to determine the set of osds
+ which participated in the most recent interval in which we accepted writes
+#. etc.
+
+Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
+
+#. ``PGBackend``
+#. ``PGTransaction``
+#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting``
+#. Various bits of the write pipeline disallow some operations based on pool
+ type -- like omap operations, class operation reads, and writes which are
+ not aligned appends (officially, so far) for EC
+#. Misc other kludges here and there
+
+``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above
+and the addition of 4 as needed to the log entries.
+
+The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't
+require much additional explanation. More detail on the ``ECBackend`` can be
+found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``.
+
+PGBackend Interface Explanation
+===============================
+
+Note: this is from a design document that predated the Firefly release
+and is probably out of date w.r.t. some of the method names.
+
+Readable vs Degraded
+--------------------
+
+For a replicated pool, an object is readable IFF it is present on
+the primary (at the right version). For an EC pool, we need at least
+`m` shards present to perform a read, and we need it on the primary. For
+this reason, ``PGBackend`` needs to include some interfaces for determining
+when recovery is required to serve a read vs a write. This also
+changes the rules for when peering has enough logs to prove that it
+
+Core Changes:
+
+- | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate``
+ | objects to allow the user to make these determinations.
+
+Client Reads
+------------
+
+Reads from a replicated pool can always be satisfied
+synchronously by the primary OSD. Within an erasure coded pool,
+the primary will need to request data from some number of replicas in
+order to satisfy a read. ``PGBackend`` will therefore need to provide
+separate ``objects_read_sync`` and ``objects_read_async`` interfaces where
+the former won't be implemented by the ``ECBackend``.
+
+``PGBackend`` interfaces:
+
+- ``objects_read_sync``
+- ``objects_read_async``
+
+Scrubs
+------
+
+We currently have two scrub modes with different default frequencies:
+
+#. [shallow] scrub: compares the set of objects and metadata, but not
+ the contents
+#. deep scrub: compares the set of objects, metadata, and a CRC32 of
+ the object contents (including omap)
+
+The primary requests a scrubmap from each replica for a particular
+range of objects. The replica fills out this scrubmap for the range
+of objects including, if the scrub is deep, a CRC32 of the contents of
+each object. The primary gathers these scrubmaps from each replica
+and performs a comparison identifying inconsistent objects.
+
+Most of this can work essentially unchanged with erasure coded PG with
+the caveat that the ``PGBackend`` implementation must be in charge of
+actually doing the scan.
+
+
+``PGBackend`` interfaces:
+
+- ``be_*``
+
+Recovery
+--------
+
+The logic for recovering an object depends on the backend. With
+the current replicated strategy, we first pull the object replica
+to the primary and then concurrently push it out to the replicas.
+With the erasure coded strategy, we probably want to read the
+minimum number of replica chunks required to reconstruct the object
+and push out the replacement chunks concurrently.
+
+Another difference is that objects in erasure coded PG may be
+unrecoverable without being unfound. The ``unfound`` state
+should probably be renamed to ``unrecoverable``. Also, the
+``PGBackend`` implementation will have to be able to direct the search
+for PG replicas with unrecoverable object chunks and to be able
+to determine whether a particular object is recoverable.
+
+
+Core changes:
+
+- ``s/unfound/unrecoverable``
+
+PGBackend interfaces:
+
+- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
+- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
+- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
+- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
+- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_
diff --git a/doc/dev/osd_internals/manifest.rst b/doc/dev/osd_internals/manifest.rst
new file mode 100644
index 000000000..7be4350ea
--- /dev/null
+++ b/doc/dev/osd_internals/manifest.rst
@@ -0,0 +1,589 @@
+========
+Manifest
+========
+
+
+Introduction
+============
+
+As described in ``../deduplication.rst``, adding transparent redirect
+machinery to RADOS would enable a more capable tiering solution
+than RADOS currently has with "cache/tiering".
+
+See ``../deduplication.rst``
+
+At a high level, each object has a piece of metadata embedded in
+the ``object_info_t`` which can map subsets of the object data payload
+to (refcounted) objects in other pools.
+
+This document exists to detail:
+
+1. Manifest data structures
+2. Rados operations for manipulating manifests.
+3. Status and Plans
+
+
+Intended Usage Model
+====================
+
+RBD
+---
+
+For RBD, the primary goal is for either an OSD-internal agent or a
+cluster-external agent to be able to transparently shift portions
+of the constituent 4MB extents between a dedup pool and a hot base
+pool.
+
+As such, RBD operations (including class operations and snapshots)
+must have the same observable results regardless of the current
+status of the object.
+
+Moreover, tiering/dedup operations must interleave with RBD operations
+without changing the result.
+
+Thus, here is a sketch of how I'd expect a tiering agent to perform
+basic operations:
+
+* Demote cold RBD chunk to slow pool:
+
+ 1. Read object, noting current user_version.
+ 2. In memory, run CDC implementation to fingerprint object.
+ 3. Write out each resulting extent to an object in the cold pool
+ using the CAS class.
+ 4. Submit operation to base pool:
+
+ * ``ASSERT_VER`` with the user version from the read to fail if the
+ object has been mutated since the read.
+ * ``SET_CHUNK`` for each of the extents to the corresponding object
+ in the base pool.
+ * ``EVICT_CHUNK`` for each extent to free up space in the base pool.
+ Results in each chunk being marked ``MISSING``.
+
+ RBD users should then either see the state prior to the demotion or
+ subsequent to it.
+
+ Note that between 3 and 4, we potentially leak references, so a
+ periodic scrub would be needed to validate refcounts.
+
+* Promote cold RBD chunk to fast pool.
+
+ 1. Submit ``TIER_PROMOTE``
+
+For clones, all of the above would be identical except that the
+initial read would need a ``LIST_SNAPS`` to determine which clones exist
+and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
+the ``cloneid``.
+
+RadosGW
+-------
+
+For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
+manifest machinery in the OSD to hide the distinction between the object
+being dedup'd or present in the base pool
+
+For writes, RGW could operate as RBD does above, but could
+optionally have the freedom to fingerprint prior to doing the write.
+In that case, it could immediately write out the target objects to the
+CAS pool and then atomically write an object with the corresponding
+chunks set.
+
+Status and Future Work
+======================
+
+At the moment, initial versions of a manifest data structure along
+with IO path support and rados control operations exist. This section
+is meant to outline next steps.
+
+At a high level, our future work plan is:
+
+- Cleanups: Address immediate inconsistencies and shortcomings outlined
+ in the next section.
+- Testing: Rados relies heavily on teuthology failure testing to validate
+ features like cache/tiering. We'll need corresponding tests for
+ manifest operations.
+- Snapshots: We want to be able to deduplicate portions of clones
+ below the level of the rados snapshot system. As such, the
+ rados operations below need to be extended to work correctly on
+ clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
+ corresponding extent in the base pool, and correctly maintain OSD metadata).
+- Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
+ cache/tiering implementation, but to do that we need to ensure that we
+ can address the same use cases.
+
+
+Cleanups
+--------
+
+The existing implementation has some things that need to be cleaned up:
+
+* ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
+ one couldn't create an object atomically as a redirect.
+* ``SET_CHUNK``:
+
+ * Appears to trigger a new clone as user_modify gets set in
+ ``do_osd_ops``. This probably isn't desirable, see Snapshots section
+ below for some options on how generally to mix these operations
+ with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set
+ user_modify.
+ * Appears to assume that the corresponding section of the object
+ does not exist (sets ``FLAG_MISSING``) but does not check whether the
+ corresponding extent exists already in the object. Should always
+ leave the extent clean.
+ * Appears to clear the manifest unconditionally if not chunked,
+ that's probably wrong. We should return an error if it's a
+ ``REDIRECT`` ::
+
+ case CEPH_OSD_OP_SET_CHUNK:
+ if (oi.manifest.is_redirect()) {
+ result = -EINVAL;
+ goto fail;
+ }
+
+
+* ``TIER_PROMOTE``:
+
+ * ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears
+ to copy them back in, but does not unset the redirect or clear the
+ reference. This violates the invariant that a redirect object
+ should be empty in the base pool. In particular, as long as the
+ redirect is set, it appears that all operations will be proxied
+ even after the promote defeating the purpose. We do want ``PROMOTE``
+ to be able to atomically replace a redirect with the actual
+ object, so the solution is to clear the redirect at the end of the
+ promote.
+ * For a chunked manifest, we appear to flush prior to promoting.
+ Promotion will often be used to prepare an object for low latency
+ reads and writes, accordingly, the only effect should be to read
+ any ``MISSING`` extents into the base pool. No flushing should be done.
+
+* High Level:
+
+ * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
+ at a dedup extent. Writing the mutated extent back to the dedup pool
+ requires writing a new object since the previous one cannot be mutated,
+ just as it would if it hadn't been dedup'd yet. Thus, we should always
+ drop the reference and remove the manifest pointer.
+
+ * There isn't currently a way to "evict" an object region. With the above
+ change to ``SET_CHUNK`` to always retain the existing object region, we
+ need an ``EVICT_CHUNK`` operation to then remove the extent.
+
+
+Testing
+-------
+
+We rely really heavily on randomized failure testing. As such, we need
+to extend that testing to include dedup/manifest support as well. Here's
+a short list of the touchpoints:
+
+* Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
+
+ That test, of course, tests the existing cache/tiering machinery. Add
+ additional files to that directory that instead setup a dedup pool. Add
+ support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
+
+* RBD tests
+
+ Add a test that runs an RBD workload concurrently with blind
+ promote/evict operations.
+
+* RGW
+
+ Add a test that runs a rgw workload concurrently with blind
+ promote/evict operations.
+
+
+Snapshots
+---------
+
+Fundamentally we need to be able to manipulate the manifest
+status of clones because we want to be able to dynamically promote,
+flush (if the state was dirty when the clone was created), and evict
+extents from clones.
+
+As such, the plan is to allow the ``object_manifest_t`` for each clone
+to be independent. Here's an incomplete list of the high level
+tasks:
+
+* Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
+ to operation directly on clones.
+* Ensure that recovery checks the object_manifest prior to trying to
+ use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets``
+ are the two methods that would likely need to be modified.
+
+See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
+support details. I'd like to call out one particular data structure
+we may want to exploit.
+
+The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
+clones as part of leak detection.
+
+An important question is how we deal with the fact that many clones
+will frequently have references to the same backing chunks at the same
+offset. In particular, ``make_writeable`` will generally create a clone
+that shares the same ``object_manifest_t`` references with the exception
+of any extents modified in that transaction. The metadata that
+commits as part of that transaction must therefore map onto the same
+refcount as before because otherwise we'd have to first increment
+refcounts on backing objects (or risk a reference to a dead object)
+Thus, we introduce a simple convention: consecutive clones which
+share a reference at the same offset share the same refcount. This
+means that a write that invokes ``make_writeable`` may decrease refcounts,
+but not increase them. This has some consequences for removing clones.
+Consider the following sequence ::
+
+ write foo [0, 1024)
+ flush foo ->
+ head: [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=1, refcount(bbb)=1
+ snapshot 10
+ write foo [0, 512) ->
+ head: [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=1, refcount(bbb)=1
+ flush foo ->
+ head: [0, 512) ccc, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1
+ snapshot 20
+ write foo [0, 512) (same contents as the original write)
+ head: [512, 1024) bbb
+ 20 : [0, 512) ccc, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=?, refcount(bbb)=1
+ flush foo
+ head: [0, 512) aaa, [512, 1024) bbb
+ 20 : [0, 512) ccc, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
+
+What should be the refcount for ``aaa`` be at the end? By our
+above rule, it should be ``2`` since the two ```aaa``` refs are not
+contiguous. However, consider removing clone ``20`` ::
+
+ initial:
+ head: [0, 512) aaa, [512, 1024) bbb
+ 20 : [0, 512) ccc, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1
+ trim 20
+ head: [0, 512) aaa, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
+
+At this point, our rule dictates that ``refcount(aaa)`` is `1`.
+This means that removing ``20`` needs to check for refs held by
+the clones on either side which will then match.
+
+See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
+for the logic implementing this rule.
+
+This seems complicated, but it gets us two valuable properties:
+
+1) The refcount change from make_writeable will not block on
+ incrementing a ref
+2) We don't need to load the ``object_manifest_t`` for every clone
+ to determine how to handle removing one -- just the ones
+ immediately preceding and succeeding it.
+
+All clone operations will need to consider adjacent ``chunk_maps``
+when adding or removing references.
+
+Data Structures
+===============
+
+Each RADOS object contains an ``object_manifest_t`` embedded within the
+``object_info_t`` (see ``osd_types.h``):
+
+::
+
+ struct object_manifest_t {
+ enum {
+ TYPE_NONE = 0,
+ TYPE_REDIRECT = 1,
+ TYPE_CHUNKED = 2,
+ };
+ uint8_t type; // redirect, chunked, ...
+ hobject_t redirect_target;
+ std::map<uint64_t, chunk_info_t> chunk_map;
+ }
+
+The ``type`` enum reflects three possible states an object can be in:
+
+1. ``TYPE_NONE``: normal RADOS object
+2. ``TYPE_REDIRECT``: object payload is backed by a single object
+ specified by ``redirect_target``
+3. ``TYPE_CHUNKED: object payload is distributed among objects with
+ size and offset specified by the ``chunk_map``. ``chunk_map`` maps
+ the offset of the chunk to a ``chunk_info_t`` as shown below, also
+ specifying the ``length``, target `OID`, and ``flags``.
+
+::
+
+ struct chunk_info_t {
+ typedef enum {
+ FLAG_DIRTY = 1,
+ FLAG_MISSING = 2,
+ FLAG_HAS_REFERENCE = 4,
+ FLAG_HAS_FINGERPRINT = 8,
+ } cflag_t;
+ uint32_t offset;
+ uint32_t length;
+ hobject_t oid;
+ cflag_t flags; // FLAG_*
+
+
+``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
+is written. This should be changed to drop the fingerprint instead.
+
+
+Request Handling
+================
+
+Similarly to cache/tiering, the initial touchpoint is
+``maybe_handle_manifest_detail``.
+
+For manifest operations listed below, we return ``NOOP`` and continue onto
+dedicated handling within ``do_osd_ops``.
+
+For redirect objects which haven't been promoted (apparently ``oi.size >
+0`` indicates that it's present?) we proxy reads and writes.
+
+For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
+of the ops are reads of extents in the ``object_manifest_t chunk_map``),
+we proxy requests to those objects.
+
+
+RADOS Interface
+================
+
+To set up deduplication one must provision two pools. One will act as the
+base pool and the other will act as the chunk pool. The base pool need to be
+configured with the ``fingerprint_algorithm`` option as follows.
+
+::
+
+ ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
+ --yes-i-really-mean-it
+
+Create objects ::
+
+ rados -p base_pool put foo ./foo
+ rados -p chunk_pool put foo-chunk ./foo-chunk
+
+Make a manifest object ::
+
+ rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
+
+Operations:
+
+* ``set-redirect``
+
+ Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
+ in the ``target_pool``.
+ A redirected object will forward all operations from the client to the
+ ``target_object``. ::
+
+ void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
+ uint64_t tgt_version, int flag = 0);
+
+ rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
+ <target_object>
+
+ Returns ``ENOENT`` if the object does not exist (TODO: why?)
+ Returns ``EINVAL`` if the object already is a redirect.
+
+ Takes a reference to target as part of operation, can possibly leak a ref
+ if the acting set resets and the client dies between taking the ref and
+ recording the redirect.
+
+ Truncates object, clears omap, and clears xattrs as a side effect.
+
+ At the top of ``do_osd_ops``, does not set user_modify.
+
+ This operation is not a user mutation and does not trigger a clone to be created.
+
+ There are two purposes of ``set_redirect``:
+
+ 1. Redirect all operation to the target object (like proxy)
+ 2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
+
+* ``set-chunk``
+
+ Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
+ ``target_object``. ::
+
+ void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
+ std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
+
+ rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
+ <caspool> <target_object> <target-offset>
+
+ Returns ``ENOENT`` if the object does not exist (TODO: why?)
+ Returns ``EINVAL`` if the object already is a redirect.
+ Returns ``EINVAL`` if on ill-formed parameter buffer.
+ Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
+
+ Takes references to targets as part of operation, can possibly leak refs
+ if the acting set resets and the client dies between taking the ref and
+ recording the redirect.
+
+ Truncates object, clears omap, and clears xattrs as a side effect.
+
+ This operation is not a user mutation and does not trigger a clone to be created.
+
+ TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
+
+ if (!oi.manifest.is_chunked()) {
+ oi.manifest.clear();
+ }
+
+* ``evict-chunk``
+
+ Clears an extent from an object leaving only the manifest link between
+ it and the ``target_object``. ::
+
+ void evict_chunk(
+ uint64_t offset, uint64_t length, int flag = 0);
+
+ rados -p base_pool evict-chunk <offset> <length> <object>
+
+ Returns ``EINVAL`` if the extent is not present in the manifest.
+
+ Note: this does not exist yet.
+
+
+* ``tier-promote``
+
+ Promotes the object ensuring that subsequent reads and writes will be local ::
+
+ void tier_promote();
+
+ rados -p base_pool tier-promote <obj-name>
+
+ Returns ``ENOENT`` if the object does not exist
+
+ For a redirect manifest, copies data to head.
+
+ TODO: Promote on a redirect object needs to clear the redirect.
+
+ For a chunked manifest, reads all MISSING extents into the base pool,
+ subsequent reads and writes will be served from the base pool.
+
+ Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The
+ resulting ``copy_get`` operation will issue reads which will then be redirected by
+ the normal manifest read machinery.
+
+ Does not set the ``user_modify`` flag.
+
+ Future work will involve adding support for specifying a ``clone_id``.
+
+* ``unset-manifest``
+
+ Unset the manifest info in the object that has manifest. ::
+
+ void unset_manifest();
+
+ rados -p base_pool unset-manifest <obj-name>
+
+ Clears manifest chunks or redirect. Lazily releases references, may
+ leak.
+
+ ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
+ and so will trigger a snapshot. Note, this will be true even for a
+ redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should
+ be fixed -- ``unset-manifest`` should not be a ``user_modify``.
+
+* ``tier-flush``
+
+ Flush the object which has chunks to the chunk pool. ::
+
+ void tier_flush();
+
+ rados -p base_pool tier-flush <obj-name>
+
+ Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
+
+ Does not evict the extents.
+
+
+ceph-dedup-tool
+===============
+
+``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
+and fixing the reference count (see ``./refcount.rst``).
+
+* Find an optimal chunk offset
+
+ a. Fixed chunk
+
+ To find out a fixed chunk length, you need to run the following command many
+ times while changing the ``chunk_size``. ::
+
+ ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
+ --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
+
+ b. Rabin chunk(Rabin-Karp algorithm)
+
+ Rabin-Karp is a string-searching algorithm based
+ on a rolling hash. But a rolling hash is not enough to do deduplication because
+ we don't know the chunk boundary. So, we need content-based slicing using
+ a rolling hash for content-defined chunking.
+ The current implementation uses the simplest approach: look for chunk boundaries
+ by inspecting the rolling hash for pattern (like the
+ lower N bits are all zeroes).
+
+ Users who want to use deduplication need to find an ideal chunk offset.
+ To find out ideal chunk offset, users should discover
+ the optimal configuration for their data workload via ``ceph-dedup-tool``.
+ This information will then be used for object chunking through
+ the ``set-chunk`` API. ::
+
+ ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
+ --chunk-algorithm rabin --fingerprint-algorithm rabin
+
+ ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
+ These are options for ``rabin chunk``. ::
+
+ --mod-prime <uint64_t>
+ --rabin-prime <uint64_t>
+ --pow <uint64_t>
+ --chunk-mask-bit <uint32_t>
+ --window-size <uint32_t>
+ --min-chunk <uint32_t>
+ --max-chunk <uint64_t>
+
+ Users need to refer following equation to use above options for ``rabin chunk``. ::
+
+ rabin_hash =
+ (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
+
+ c. Fixed chunk vs content-defined chunk
+
+ Content-defined chunking may or not be optimal solution.
+ For example,
+
+ Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
+
+ Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
+ from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
+ But, in the case of content-based slicing, the optimal chunk length
+ could not be found (dedup ratio will not be 100%).
+ Because we need to find optimal parameter such
+ as boundary bit, window size and prime value. This is as easy as fixed chunk.
+ But, content defined chunking is very effective in the following case.
+
+ Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
+
+ Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
+
+
+* Fix reference count
+
+ The key idea behind of reference counting for dedup is false-positive, which means
+ ``(manifest object (no ref),, chunk object(has ref))`` happen instead of
+ ``(manifest object (has ref), chunk 1(no ref))``.
+ To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
+
+ ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
+
diff --git a/doc/dev/osd_internals/map_message_handling.rst b/doc/dev/osd_internals/map_message_handling.rst
new file mode 100644
index 000000000..f8104f3fd
--- /dev/null
+++ b/doc/dev/osd_internals/map_message_handling.rst
@@ -0,0 +1,131 @@
+===========================
+Map and PG Message handling
+===========================
+
+Overview
+--------
+The OSD handles routing incoming messages to PGs, creating the PG if necessary
+in some cases.
+
+PG messages generally come in two varieties:
+
+ 1. Peering Messages
+ 2. Ops/SubOps
+
+There are several ways in which a message might be dropped or delayed. It is
+important that the message delaying does not result in a violation of certain
+message ordering requirements on the way to the relevant PG handling logic:
+
+ 1. Ops referring to the same object must not be reordered.
+ 2. Peering messages must not be reordered.
+ 3. Subops must not be reordered.
+
+MOSDMap
+-------
+MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the
+OSD must perform several tasks:
+
+ 1. Persist the new maps to the filestore.
+ Several PG operations rely on having access to maps dating back to the last
+ time the PG was clean.
+ 2. Update and persist the superblock.
+ 3. Update OSD state related to the current map.
+ 4. Expose new maps to PG processes via *OSDService*.
+ 5. Remove PGs due to pool removal.
+ 6. Queue dummy events to trigger PG map catchup.
+
+Each PG asynchronously catches up to the currently published map during
+process_peering_events before processing the event. As a result, different
+PGs may have different views as to the "current" map.
+
+One consequence of this design is that messages containing submessages from
+multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage
+with the PG's epoch as well as tagging the message as a whole with the OSD's
+current published epoch.
+
+MOSDPGOp/MOSDPGSubOp
+--------------------
+See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op
+
+MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used
+between OSDs to coordinate most non peering activities including replicating
+MOSDPGOp operations.
+
+OSD::require_same_or_newer map checks that the current OSDMap is at least
+as new as the map epoch indicated on the message. If not, the message is
+queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this
+cannot violate the above conditions since any two messages will be queued
+in order of receipt and if a message is received with epoch e0, a later message
+from the same source must be at epoch at least e0. Note that two PGs from
+the same OSD count for these purposes as different sources for single PG
+messages. That is, messages from different PGs may be reordered.
+
+
+MOSDPGOps follow the following process:
+
+ 1. OSD::handle_op: validates permissions and crush mapping.
+ discard the request if they are not connected and the client cannot get the reply ( See OSD::op_is_discardable )
+ See OSDService::handle_misdirected_op
+ See PG::op_has_sufficient_caps
+ See OSD::require_same_or_newer_map
+ 2. OSD::enqueue_op
+
+MOSDSubOps follow the following process:
+
+ 1. OSD::handle_sub_op checks that sender is an OSD
+ 2. OSD::enqueue_op
+
+OSD::enqueue_op calls PG::queue_op which checks waiting_for_map before calling OpWQ::queue which adds the op to the queue of the PG responsible for handling it.
+
+OSD::dequeue_op is then eventually called, with a lock on the PG. At
+this time, the op is passed to PG::do_request, which checks that:
+
+ 1. the PG map is new enough (PG::must_delay_op)
+ 2. the client requesting the op has enough permissions (PG::op_has_sufficient_caps)
+ 3. the op is not to be discarded (PG::can_discard_{request,op,subop,scan,backfill})
+ 4. the PG is active (PG::flushed boolean)
+ 5. the op is a CEPH_MSG_OSD_OP and the PG is in PG_STATE_ACTIVE state and not in PG_STATE_REPLAY
+
+If these conditions are not met, the op is either discarded or queued for later processing. If all conditions are met, the op is processed according to its type:
+
+ 1. CEPH_MSG_OSD_OP is handled by PG::do_op
+ 2. MSG_OSD_SUBOP is handled by PG::do_sub_op
+ 3. MSG_OSD_SUBOPREPLY is handled by PG::do_sub_op_reply
+ 4. MSG_OSD_PG_SCAN is handled by PG::do_scan
+ 5. MSG_OSD_PG_BACKFILL is handled by PG::do_backfill
+
+CEPH_MSG_OSD_OP processing
+--------------------------
+
+PrimaryLogPG::do_op handles CEPH_MSG_OSD_OP op and will queue it
+
+ 1. in wait_for_all_missing if it is a CEPH_OSD_OP_PGLS for a designated snapid and some object updates are still missing
+ 2. in waiting_for_active if the op may write but the scrubber is working
+ 3. in waiting_for_missing_object if the op requires an object or a snapdir or a specific snap that is still missing
+ 4. in waiting_for_degraded_object if the op may write an object or a snapdir that is degraded, or if another object blocks it ("blocked_by")
+ 5. in waiting_for_backfill_pos if the op requires an object that will be available after the backfill is complete
+ 6. in waiting_for_ack if an ack from another OSD is expected
+ 7. in waiting_for_ondisk if the op is waiting for a write to complete
+
+Peering Messages
+----------------
+See OSD::handle_pg_(notify|info|log|query)
+
+Peering messages are tagged with two epochs:
+
+ 1. epoch_sent: map epoch at which the message was sent
+ 2. query_epoch: map epoch at which the message triggering the message was sent
+
+These are the same in cases where there was no triggering message. We discard
+a peering message if the message's query_epoch if the PG in question has entered
+a new epoch (See PG::old_peering_evt, PG::queue_peering_event). Notifies,
+infos, notifies, and logs are all handled as PG::PeeringMachine events and
+are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created
+state machine event along with epoch_sent and query_epoch in order to
+generically check PG::old_peering_message upon insertion and removal from the
+queue.
+
+Note, notifies, logs, and infos can trigger the creation of a PG. See
+OSD::get_or_create_pg.
+
+
diff --git a/doc/dev/osd_internals/mclock_wpq_cmp_study.rst b/doc/dev/osd_internals/mclock_wpq_cmp_study.rst
new file mode 100644
index 000000000..31ad18409
--- /dev/null
+++ b/doc/dev/osd_internals/mclock_wpq_cmp_study.rst
@@ -0,0 +1,476 @@
+=========================================
+ QoS Study with mClock and WPQ Schedulers
+=========================================
+
+Introduction
+============
+
+The mClock scheduler provides three controls for each service using it. In Ceph,
+the services using mClock are for example client I/O, background recovery,
+scrub, snap trim and PG deletes. The three controls such as *weight*,
+*reservation* and *limit* are used for predictable allocation of resources to
+each service in proportion to its weight subject to the constraint that the
+service receives at least its reservation and no more than its limit. In Ceph,
+these controls are used to allocate IOPS for each service type provided the IOPS
+capacity of each OSD is known. The mClock scheduler is based on
+`the dmClock algorithm`_. See :ref:`dmclock-qos` section for more details.
+
+Ceph's use of mClock was primarily experimental and approached with an
+exploratory mindset. This is still true with other organizations and individuals
+who continue to either use the codebase or modify it according to their needs.
+
+DmClock exists in its own repository_. Before the Ceph *Pacific* release,
+mClock could be enabled by setting the :confval:`osd_op_queue` Ceph option to
+"mclock_scheduler". Additional mClock parameters like *reservation*, *weight*
+and *limit* for each service type could be set using Ceph options.
+For example, ``osd_mclock_scheduler_client_[res,wgt,lim]`` is one such option.
+See :ref:`dmclock-qos` section for more details. Even with all the mClock
+options set, the full capability of mClock could not be realized due to:
+
+- Unknown OSD capacity in terms of throughput (IOPS).
+- No limit enforcement. In other words, services using mClock were allowed to
+ exceed their limits resulting in the desired QoS goals not being met.
+- Share of each service type not distributed across the number of operational
+ shards.
+
+To resolve the above, refinements were made to the mClock scheduler in the Ceph
+code base. See :doc:`/rados/configuration/mclock-config-ref`. With the
+refinements, the usage of mClock is a bit more user-friendly and intuitive. This
+is one step of many to refine and optimize the way mClock is used in Ceph.
+
+Overview
+========
+
+A comparison study was performed as part of efforts to refine the mClock
+scheduler. The study involved running tests with client ops and background
+recovery operations in parallel with the two schedulers. The results were
+collated and then compared. The following statistics were compared between the
+schedulers from the test results for each service type:
+
+- External client
+
+ - Average throughput(IOPS),
+ - Average and percentile(95th, 99th, 99.5th) latency,
+
+- Background recovery
+
+ - Average recovery throughput,
+ - Number of misplaced objects recovered per second
+
+Test Environment
+================
+
+1. **Software Configuration**: CentOS 8.1.1911 Linux Kernel 4.18.0-193.6.3.el8_2.x86_64
+2. **CPU**: 2 x Intel® Xeon® CPU E5-2650 v3 @ 2.30GHz
+3. **nproc**: 40
+4. **System Memory**: 64 GiB
+5. **Tuned-adm Profile**: network-latency
+6. **CephVer**: 17.0.0-2125-g94f550a87f (94f550a87fcbda799afe9f85e40386e6d90b232e) quincy (dev)
+7. **Storage**:
+
+ - Intel® NVMe SSD DC P3700 Series (SSDPE2MD800G4) [4 x 800GB]
+ - Seagate Constellation 7200 RPM 64MB Cache SATA 6.0Gb/s HDD (ST91000640NS) [4 x 1TB]
+
+Test Methodology
+================
+
+Ceph cbt_ was used to test the recovery scenarios. A new recovery test to
+generate background recoveries with client I/Os in parallel was created.
+See the next section for the detailed test steps. The test was executed 3 times
+with the default *Weighted Priority Queue (WPQ)* scheduler for comparison
+purposes. This was done to establish a credible mean value to compare
+the mClock scheduler results at a later point.
+
+After this, the same test was executed with mClock scheduler and with different
+mClock profiles i.e., *high_client_ops*, *balanced* and *high_recovery_ops* and
+the results collated for comparison. With each profile, the test was
+executed 3 times, and the average of those runs are reported in this study.
+
+.. note:: Tests with HDDs were performed with and without the bluestore WAL and
+ dB configured. The charts discussed further below help bring out the
+ comparison across the schedulers and their configurations.
+
+Establish Baseline Client Throughput (IOPS)
+===========================================
+
+Before the actual recovery tests, the baseline throughput was established for
+both the SSDs and the HDDs on the test machine by following the steps mentioned
+in the :doc:`/rados/configuration/mclock-config-ref` document under
+the "Benchmarking Test Steps Using CBT" section. For this study, the following
+baseline throughput for each device type was determined:
+
++--------------------------------------+-------------------------------------------+
+| Device Type | Baseline Throughput(@4KiB Random Writes) |
++======================================+===========================================+
+| **NVMe SSD** | 21500 IOPS (84 MiB/s) |
++--------------------------------------+-------------------------------------------+
+| **HDD (with bluestore WAL & dB)** | 340 IOPS (1.33 MiB/s) |
++--------------------------------------+-------------------------------------------+
+| **HDD (without bluestore WAL & dB)** | 315 IOPS (1.23 MiB/s) |
++--------------------------------------+-------------------------------------------+
+
+.. note:: The :confval:`bluestore_throttle_bytes` and
+ :confval:`bluestore_throttle_deferred_bytes` for SSDs were determined to be
+ 256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
+ by running 4 KiB random writes at a queue depth of 64 for 300 secs.
+
+MClock Profile Allocations
+==========================
+
+The low-level mClock shares per profile are shown in the tables below. For
+parameters like *reservation* and *limit*, the shares are represented as a
+percentage of the total OSD capacity. For the *high_client_ops* profile, the
+*reservation* parameter is set to 50% of the total OSD capacity. Therefore, for
+the NVMe(baseline 21500 IOPS) device, a minimum of 10750 IOPS is reserved for
+client operations. These allocations are made under the hood once
+a profile is enabled.
+
+The *weight* parameter is unitless. See :ref:`dmclock-qos`.
+
+high_client_ops(default)
+````````````````````````
+
+This profile allocates more reservation and limit to external clients ops
+when compared to background recoveries and other internal clients within
+Ceph. This profile is enabled by default.
+
++------------------------+-------------+--------+-------+
+| Service Type | Reservation | Weight | Limit |
++========================+=============+========+=======+
+| client | 50% | 2 | MAX |
++------------------------+-------------+--------+-------+
+| background recovery | 25% | 1 | 100% |
++------------------------+-------------+--------+-------+
+| background best effort | 25% | 2 | MAX |
++------------------------+-------------+--------+-------+
+
+balanced
+`````````
+
+This profile allocates equal reservations to client ops and background
+recovery ops. The internal best effort client get a lower reservation
+but a very high limit so that they can complete quickly if there are
+no competing services.
+
++------------------------+-------------+--------+-------+
+| Service Type | Reservation | Weight | Limit |
++========================+=============+========+=======+
+| client | 40% | 1 | 100% |
++------------------------+-------------+--------+-------+
+| background recovery | 40% | 1 | 150% |
++------------------------+-------------+--------+-------+
+| background best effort | 20% | 2 | MAX |
++------------------------+-------------+--------+-------+
+
+high_recovery_ops
+`````````````````
+
+This profile allocates more reservation to background recoveries when
+compared to external clients and other internal clients within Ceph. For
+example, an admin may enable this profile temporarily to speed-up background
+recoveries during non-peak hours.
+
++------------------------+-------------+--------+-------+
+| Service Type | Reservation | Weight | Limit |
++========================+=============+========+=======+
+| client | 30% | 1 | 80% |
++------------------------+-------------+--------+-------+
+| background recovery | 60% | 2 | 200% |
++------------------------+-------------+--------+-------+
+| background best effort | 1 (MIN) | 2 | MAX |
++------------------------+-------------+--------+-------+
+
+custom
+```````
+
+The custom profile allows the user to have complete control of the mClock
+and Ceph config parameters. To use this profile, the user must have a deep
+understanding of the workings of Ceph and the mClock scheduler. All the
+*reservation*, *weight* and *limit* parameters of the different service types
+must be set manually along with any Ceph option(s). This profile may be used
+for experimental and exploratory purposes or if the built-in profiles do not
+meet the requirements. In such cases, adequate testing must be performed prior
+to enabling this profile.
+
+
+Recovery Test Steps
+===================
+
+Before bringing up the Ceph cluster, the following mClock configuration
+parameters were set appropriately based on the obtained baseline throughput
+from the previous section:
+
+- :confval:`osd_mclock_max_capacity_iops_hdd`
+- :confval:`osd_mclock_max_capacity_iops_ssd`
+- :confval:`osd_mclock_profile`
+
+See :doc:`/rados/configuration/mclock-config-ref` for more details.
+
+Test Steps(Using cbt)
+`````````````````````
+
+1. Bring up the Ceph cluster with 4 osds.
+2. Configure the OSDs with replication factor 3.
+3. Create a recovery pool to populate recovery data.
+4. Create a client pool and prefill some objects in it.
+5. Create the recovery thread and mark an OSD down and out.
+6. After the cluster handles the OSD down event, recovery data is
+ prefilled into the recovery pool. For the tests involving SSDs, prefill 100K
+ 4MiB objects into the recovery pool. For the tests involving HDDs, prefill
+ 5K 4MiB objects into the recovery pool.
+7. After the prefill stage is completed, the downed OSD is brought up and in.
+ The backfill phase starts at this point.
+8. As soon as the backfill/recovery starts, the test proceeds to initiate client
+ I/O on the client pool on another thread using a single client.
+9. During step 8 above, statistics related to the client latency and
+ bandwidth are captured by cbt. The test also captures the total number of
+ misplaced objects and the number of misplaced objects recovered per second.
+
+To summarize, the steps above creates 2 pools during the test. Recovery is
+triggered on one pool and client I/O is triggered on the other simultaneously.
+Statistics captured during the tests are discussed below.
+
+
+Non-Default Ceph Recovery Options
+`````````````````````````````````
+
+Apart from the non-default bluestore throttle already mentioned above, the
+following set of Ceph recovery related options were modified for tests with both
+the WPQ and mClock schedulers.
+
+- :confval:`osd_max_backfills` = 1000
+- :confval:`osd_recovery_max_active` = 1000
+- :confval:`osd_async_recovery_min_cost` = 1
+
+The above options set a high limit on the number of concurrent local and
+remote backfill operations per OSD. Under these conditions the capability of the
+mClock scheduler was tested and the results are discussed below.
+
+Test Results
+============
+
+Test Results With NVMe SSDs
+```````````````````````````
+
+Client Throughput Comparison
+----------------------------
+
+The chart below shows the average client throughput comparison across the
+schedulers and their respective configurations.
+
+.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_NVMe_SSD_WPQ_vs_mClock.png
+
+
+WPQ(def) in the chart shows the average client throughput obtained
+using the WPQ scheduler with all other Ceph configuration settings set to
+default values. The default setting for :confval:`osd_max_backfills` limits the number
+of concurrent local and remote backfills or recoveries per OSD to 1. As a
+result, the average client throughput obtained is impressive at just over 18000
+IOPS when compared to the baseline value which is 21500 IOPS.
+
+However, with WPQ scheduler along with non-default options mentioned in section
+`Non-Default Ceph Recovery Options`_, things are quite different as shown in the
+chart for WPQ(BST). In this case, the average client throughput obtained drops
+dramatically to only 2544 IOPS. The non-default recovery options clearly had a
+significant impact on the client throughput. In other words, recovery operations
+overwhelm the client operations. Sections further below discuss the recovery
+rates under these conditions.
+
+With the non-default options, the same test was executed with mClock and with
+the default profile (*high_client_ops*) enabled. As per the profile allocation,
+the reservation goal of 50% (10750 IOPS) is being met with an average throughput
+of 11209 IOPS during the course of recovery operations. This is more than 4x
+times the throughput obtained with WPQ(BST).
+
+Similar throughput with the *balanced* (11017 IOPS) and *high_recovery_ops*
+(11153 IOPS) profile was obtained as seen in the chart above. This clearly
+demonstrates that mClock is able to provide the desired QoS for the client
+with multiple concurrent backfill/recovery operations in progress.
+
+Client Latency Comparison
+-------------------------
+
+The chart below shows the average completion latency (*clat*) along with the
+average 95th, 99th and 99.5th percentiles across the schedulers and their
+respective configurations.
+
+.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_NVMe_SSD_WPQ_vs_mClock.png
+
+The average *clat* latency obtained with WPQ(Def) was 3.535 msec. But in this
+case the number of concurrent recoveries was very much limited at an average of
+around 97 objects/sec or ~388 MiB/s and a major contributing factor to the low
+latency seen by the client.
+
+With WPQ(BST) and with the non-default recovery options, things are very
+different with the average *clat* latency shooting up to an average of almost
+25 msec which is 7x times worse! This is due to the high number of concurrent
+recoveries which was measured to be ~350 objects/sec or ~1.4 GiB/s which is
+close to the maximum OSD bandwidth.
+
+With mClock enabled and with the default *high_client_ops* profile, the average
+*clat* latency was 5.688 msec which is impressive considering the high number
+of concurrent active background backfill/recoveries. The recovery rate was
+throttled down by mClock to an average of 80 objects/sec or ~320 MiB/s according
+to the minimum profile allocation of 25% of the maximum OSD bandwidth thus
+allowing the client operations to meet the QoS goal.
+
+With the other profiles like *balanced* and *high_recovery_ops*, the average
+client *clat* latency didn't change much and stayed between 5.7 - 5.8 msec with
+variations in the average percentile latency as observed from the chart above.
+
+.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_NVMe_SSD_WPQ_vs_mClock.png
+
+Perhaps a more interesting chart is the comparison chart shown above that
+tracks the average *clat* latency variations through the duration of the test.
+The chart shows the differences in the average latency between the
+WPQ and mClock profiles). During the initial phase of the test, for about 150
+secs, the differences in the average latency between the WPQ scheduler and
+across the profiles of mClock scheduler are quite evident and self explanatory.
+The *high_client_ops* profile shows the lowest latency followed by *balanced*
+and *high_recovery_ops* profiles. The WPQ(BST) had the highest average latency
+through the course of the test.
+
+Recovery Statistics Comparison
+------------------------------
+
+Another important aspect to consider is how the recovery bandwidth and recovery
+time are affected by mClock profile settings. The chart below outlines the
+recovery rates and times for each mClock profile and how they differ with the
+WPQ scheduler. The total number of objects to be recovered in all the cases was
+around 75000 objects as observed in the chart below.
+
+.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_NVMe_SSD_WPQ_vs_mClock.png
+
+Intuitively, the *high_client_ops* should impact recovery operations the most
+and this is indeed the case as it took an average of 966 secs for the
+recovery to complete at 80 Objects/sec. The recovery bandwidth as expected was
+the lowest at an average of ~320 MiB/s.
+
+.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_NVMe_SSD_WPQ_vs_mClock.png
+
+The *balanced* profile provides a good middle ground by allocating the same
+reservation and weight to client and recovery operations. The recovery rate
+curve falls between the *high_recovery_ops* and *high_client_ops* curves with
+an average bandwidth of ~480 MiB/s and taking an average of ~647 secs at ~120
+Objects/sec to complete the recovery.
+
+The *high_recovery_ops* profile provides the fastest way to complete recovery
+operations at the expense of other operations. The recovery bandwidth was
+nearly 2x the bandwidth at ~635 MiB/s when compared to the bandwidth observed
+using the *high_client_ops* profile. The average object recovery rate was ~159
+objects/sec and completed the fastest in approximately 488 secs.
+
+Test Results With HDDs (WAL and dB configured)
+``````````````````````````````````````````````
+
+The recovery tests were performed on HDDs with bluestore WAL and dB configured
+on faster NVMe SSDs. The baseline throughput measured was 340 IOPS.
+
+Client Throughput & latency Comparison
+--------------------------------------
+
+The average client throughput comparison for WPQ and mClock and its profiles
+are shown in the chart below.
+
+.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_WALdB_WPQ_vs_mClock.png
+
+With WPQ(Def), the average client throughput obtained was ~308 IOPS since the
+the number of concurrent recoveries was very much limited. The average *clat*
+latency was ~208 msec.
+
+However for WPQ(BST), due to concurrent recoveries client throughput is affected
+significantly with 146 IOPS and an average *clat* latency of 433 msec.
+
+.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_WALdB_WPQ_vs_mClock.png
+
+With the *high_client_ops* profile, mClock was able to meet the QoS requirement
+for client operations with an average throughput of 271 IOPS which is nearly
+80% of the baseline throughput at an average *clat* latency of 235 msecs.
+
+For *balanced* and *high_recovery_ops* profiles, the average client throughput
+came down marginally to ~248 IOPS and ~240 IOPS respectively. The average *clat*
+latency as expected increased to ~258 msec and ~265 msec respectively.
+
+.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_WALdB_WPQ_vs_mClock.png
+
+The *clat* latency comparison chart above provides a more comprehensive insight
+into the differences in latency through the course of the test. As observed
+with the NVMe SSD case, *high_client_ops* profile shows the lowest latency in
+the HDD case as well followed by the *balanced* and *high_recovery_ops* profile.
+It's fairly easy to discern this between the profiles during the first 200 secs
+of the test.
+
+Recovery Statistics Comparison
+------------------------------
+
+The charts below compares the recovery rates and times. The total number of
+objects to be recovered in all the cases using HDDs with WAL and dB was around
+4000 objects as observed in the chart below.
+
+.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_WALdB_WPQ_vs_mClock.png
+
+As expected, the *high_client_ops* impacts recovery operations the most as it
+took an average of ~1409 secs for the recovery to complete at ~3 Objects/sec.
+The recovery bandwidth as expected was the lowest at ~11 MiB/s.
+
+.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_WALdB_WPQ_vs_mClock.png
+
+The *balanced* profile as expected provides a decent compromise with an an
+average bandwidth of ~16.5 MiB/s and taking an average of ~966 secs at ~4
+Objects/sec to complete the recovery.
+
+The *high_recovery_ops* profile is the fastest with nearly 2x the bandwidth at
+~21 MiB/s when compared to the *high_client_ops* profile. The average object
+recovery rate was ~5 objects/sec and completed in approximately 747 secs. This
+is somewhat similar to the recovery time observed with WPQ(Def) at 647 secs with
+a bandwidth of 23 MiB/s and at a rate of 5.8 objects/sec.
+
+Test Results With HDDs (No WAL and dB configured)
+`````````````````````````````````````````````````
+
+The recovery tests were also performed on HDDs without bluestore WAL and dB
+configured. The baseline throughput measured was 315 IOPS.
+
+This type of configuration without WAL and dB configured is probably rare
+but testing was nevertheless performed to get a sense of how mClock performs
+under a very restrictive environment where the OSD capacity is at the lower end.
+The sections and charts below are very similar to the ones presented above and
+are provided here for reference.
+
+Client Throughput & latency Comparison
+--------------------------------------
+
+The average client throughput, latency and percentiles are compared as before
+in the set of charts shown below.
+
+.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png
+
+.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_NoWALdB_WPQ_vs_mClock.png
+
+.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png
+
+Recovery Statistics Comparison
+------------------------------
+
+The recovery rates and times are shown in the charts below.
+
+.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png
+
+.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png
+
+Key Takeaways and Conclusion
+============================
+
+- mClock is able to provide the desired QoS using profiles to allocate proper
+ *reservation*, *weight* and *limit* to the service types.
+- By using the cost per I/O and the cost per byte parameters, mClock can
+ schedule operations appropriately for the different device types(SSD/HDD).
+
+The study so far shows promising results with the refinements made to the mClock
+scheduler. Further refinements to mClock and profile tuning are planned. Further
+improvements will also be based on feedback from broader testing on larger
+clusters and with different workloads.
+
+.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
+.. _repository: https://github.com/ceph/dmclock
+.. _cbt: https://github.com/ceph/cbt
diff --git a/doc/dev/osd_internals/osd_overview.rst b/doc/dev/osd_internals/osd_overview.rst
new file mode 100644
index 000000000..192ddf8ca
--- /dev/null
+++ b/doc/dev/osd_internals/osd_overview.rst
@@ -0,0 +1,106 @@
+===
+OSD
+===
+
+Concepts
+--------
+
+*Messenger*
+ See src/msg/Messenger.h
+
+ Handles sending and receipt of messages on behalf of the OSD. The OSD uses
+ two messengers:
+
+ 1. cluster_messenger - handles traffic to other OSDs, monitors
+ 2. client_messenger - handles client traffic
+
+ This division allows the OSD to be configured with different interfaces for
+ client and cluster traffic.
+
+*Dispatcher*
+ See src/msg/Dispatcher.h
+
+ OSD implements the Dispatcher interface. Of particular note is ms_dispatch,
+ which serves as the entry point for messages received via either the client
+ or cluster messenger. Because there are two messengers, ms_dispatch may be
+ called from at least two threads. The osd_lock is always held during
+ ms_dispatch.
+
+*WorkQueue*
+ See src/common/WorkQueue.h
+
+ The WorkQueue class abstracts the process of queueing independent tasks
+ for asynchronous execution. Each OSD process contains workqueues for
+ distinct tasks:
+
+ 1. OpWQ: handles ops (from clients) and subops (from other OSDs).
+ Runs in the op_tp threadpool.
+ 2. PeeringWQ: handles peering tasks and pg map advancement
+ Runs in the op_tp threadpool.
+ See Peering
+ 3. CommandWQ: handles commands (pg query, etc)
+ Runs in the command_tp threadpool.
+ 4. RecoveryWQ: handles recovery tasks.
+ Runs in the recovery_tp threadpool.
+ 5. SnapTrimWQ: handles snap trimming
+ Runs in the disk_tp threadpool.
+ See SnapTrimmer
+ 6. ScrubWQ: handles primary scrub path
+ Runs in the disk_tp threadpool.
+ See Scrub
+ 7. ScrubFinalizeWQ: handles primary scrub finalize
+ Runs in the disk_tp threadpool.
+ See Scrub
+ 8. RepScrubWQ: handles replica scrub path
+ Runs in the disk_tp threadpool
+ See Scrub
+ 9. RemoveWQ: Asynchronously removes old pg directories
+ Runs in the disk_tp threadpool
+ See PGRemoval
+
+*ThreadPool*
+ See src/common/WorkQueue.h
+ See also above.
+
+ There are 4 OSD threadpools:
+
+ 1. op_tp: handles ops and subops
+ 2. recovery_tp: handles recovery tasks
+ 3. disk_tp: handles disk intensive tasks
+ 4. command_tp: handles commands
+
+*OSDMap*
+ See src/osd/OSDMap.h
+
+ The crush algorithm takes two inputs: a picture of the cluster
+ with status information about which nodes are up/down and in/out,
+ and the pgid to place. The former is encapsulated by the OSDMap.
+ Maps are numbered by *epoch* (epoch_t). These maps are passed around
+ within the OSD as std::tr1::shared_ptr<const OSDMap>.
+
+ See MapHandling
+
+*PG*
+ See src/osd/PG.* src/osd/PrimaryLogPG.*
+
+ Objects in rados are hashed into *PGs* and *PGs* are placed via crush onto
+ OSDs. The PG structure is responsible for handling requests pertaining to
+ a particular *PG* as well as for maintaining relevant metadata and controlling
+ recovery.
+
+*OSDService*
+ See src/osd/OSD.cc OSDService
+
+ The OSDService acts as a broker between PG threads and OSD state which allows
+ PGs to perform actions using OSD services such as workqueues and messengers.
+ This is still a work in progress. Future cleanups will focus on moving such
+ state entirely from the OSD into the OSDService.
+
+Overview
+--------
+ See src/ceph_osd.cc
+
+ The OSD process represents one leaf device in the crush hierarchy. There
+ might be one OSD process per physical machine, or more than one if, for
+ example, the user configures one OSD instance per disk.
+
diff --git a/doc/dev/osd_internals/osdmap_versions.txt b/doc/dev/osd_internals/osdmap_versions.txt
new file mode 100644
index 000000000..2bf247dcf
--- /dev/null
+++ b/doc/dev/osd_internals/osdmap_versions.txt
@@ -0,0 +1,259 @@
+releases:
+
+ <0.48 pre-argonaut, dev
+ 0.48 argonaut
+ 0.56 bobtail
+ 0.61 cuttlefish
+ 0.67 dumpling
+ 0.72 emperor
+ 0.80 firefly
+ 0.87 giant
+ 0.94 hammer
+ 9.1.0 infernalis rc
+ 9.2.0 infernalis
+ 10.2.0 jewel
+ 11.2.0 kraken
+ 12.2.0 luminous
+ 13.2.0 mimic
+ 14.2.0 nautilus (to-be)
+
+osdmap:
+
+type / v / cv / ev / commit / version / date
+
+map / 1 / - / - / 017788a6ecb570038632de31904dd2e1314dc7b7 / 0.11 / 2009
+inc / 1 / - / - /
+ * initial
+map / 2 / - / - / 020350e19a5dc03cd6cedd7494e434295580615f / 0.13 / 2009
+inc / 2 / - / - /
+ * pg_temp
+map / 3 / - / - / 1ebcebf6fff056a0c0bdf82dde69356e271be27e / 0.19 / 2009
+inc / 3 / - / - /
+ * heartbeat_addr
+map / 4 / - / - / 3ced5e7de243edeccfd20a90ec2034206c920795 / 0.19 / 2010
+inc / 4 / - / - /
+ * pools removed from map
+map / 5 / - / 5 / c4892bed6f49df396df3cbf9ed561c7315bd2442 / 0.20 / 2010
+inc / 5 / - / 5 /
+ * pool names moved to first part of encoding
+ * adds CEPH_OSDMAP_INC_VERSION_EXT (for extended part of map)
+ * adds CEPH_OSDMAP_VERSION_EXT (for extended part of map)
+ * adds 'ev' (extended version) during encode() and decode
+map / 5 / - / 5 / bc9cb9311f1b946898b5256eab500856fccf5c83 / 0.22 / 2010
+inc / 5 / - / 6 /
+ * separate up client/osd
+ * increments CEPH_OSDMAP_INC_VERSION_EXT to 6
+ * CEPH_OSDMAP_INC_VERSION stays at 5
+map / 5 / - / 6 / 7f70112052c7fc3ba46f9e475fa575d85e8b16b2 / 0.22 / 2010
+inc / 5 / - / 6 /
+ * add osd_cluster_addr to full map
+ * increments CEPH_OSDMAP_VERSION_EXT to 6
+ * CEPH_OSDMAP_VERSION stays at 5
+map / 5 / - / 7 / 2ced4e24aef64f2bc7d55b73abb888c124512eac / 0.28 / 2011
+inc / 5 / - / 7 /
+ * add cluster_snapshot field
+ * increments CEPH_OSDMAP_VERSION_EXT to 7
+ * increments CEPH_OSDMAP_INC_VERSION_EXT to 7
+ * CEPH_OSDMAP_INC_VERSION stays at 5
+ * CEPH_OSDMAP_VERSION stays at 5
+map / 6 / - / 7 / d1ce410842ca51fad3aa100a52815a39e5fe6af6 / 0.35 / 2011
+inc / 6 / - / 7 /
+ * encode/decode old + new versions
+ * adds encode_client_old() (basically transitioning from uint32 to
+ uint64)
+ * bumps osdmap version to 6, old clients stay at 5
+ * starts using in-function versions (i.e., _u16 v = 6)
+map / 6 / - / 7 / b297d1edecaf31a48cff6c37df2ee266e51cdec1 / 0.38 / 2011
+inc / 6 / - / 7 /
+ * make encoding conditional based on features
+ * essentially checks whether features & CEPH_FEATURE_PGID64 and opts
+ to either use encode_client_old() or encode()
+map / 6 / - / 7 / 0f0c59478894c9ca7fa04fc32e854648192a9fae / 0.38 / 2011
+inc / 6 / - / 7 /
+ * move stuff from osdmap.h to osdmap.cc
+map / 6 / 8 / ca4311e5e39cec8fad85fad3e67eea968707e9eb / 0.47 / 2012
+inc / 6 / 8 /
+ * store uuid per osd
+ * bumps osdmap::incremental extended version to 8; in function
+ * bumps osdmap's extended version to 8; in function
+map / 6 / - / 8 / 5125daa6d78e173a8dbc75723a8fdcd279a44bcd / 0.47 / 2012
+inc / 6 / - / 8 /
+ * drop defines
+ * drops defines for CEPH_OSDMAP_*_VERSION from rados.h
+map / 6 / 9 / e9f051ef3c49a080b24d7811a16aefb64beacbbd / 0.53 / 2012
+inc / 6 / 9 /
+ * add osd_xinfo_t
+ * osdmap::incremental ext version bumped to 9
+ * osdmap's ext version bumped to 9
+ * because we're adding osd_xinfo_t to the map
+map / 6 / - / 10 / 1fee4ccd5277b52292e255daf458330eef5f0255 / 0.64 / 2013
+inc / 6 / - / 10 /
+ * encode front hb addr
+ * osdmap::incremental ext version bumped to 10
+ * osdmap's ext version bumped to 10
+ * because we're adding osd_addrs->hb_front_addr to map
+
+// below we have the change to ENCODE_START() for osdmap and others
+// this means client-usable data and extended osd data get to have their
+// own ENCODE_START()'s, hence their versions start at 1 again.
+
+map / 7 / 1 / 1 / 3d7c69fb0986337dc72e466dc39d93e5ab406062 / 0.77 / 2014
+inc / 7 / 1 / 1 / b55c45e85dbd5d2513a4c56b3b74dcafd03f20b1 / 0.77 / 2014
+ * introduces ENCODE_START() approach to osdmap, and the 'features'
+ argument we currently see in ::encode() functions
+ * same, but for osdmap::incremental
+map / 7 / 1 / 1 / b9208b47745fdd53d36b682bebfc01e913347092 / 0.77 / 2014
+inc / 7 / 1 / 2 /
+ * include features argument in incremental.
+map / 7 / 2 / 1 / cee914290c5540eb1fb9d70faac70a581381c29b / 0.78 / 2014
+inc / 7 / 2 / 2 /
+ * add osd_primary_affinity
+map / 7 / 3 / 1 / c4f8f265955d54f33c79cde02c1ab2fe69ab1ab0 / 0.78 / 2014
+inc / 7 / 3 / 2 /
+ * add new/old erasure code profiles
+map / 8 / 3 / 1 / 3dcf5b9636bb9e0cd6484d18f151b457e1a0c328 / 0.91 / 2014
+inc / 8 / 3 / 2 /
+ * encode crc
+map / 8 / 3 / 1 / 04679c5451e353c966f6ed00b33fa97be8072a79 / 9.1.0 / 2015
+inc / 8 / 3 / 2 /
+ * simply ensures encode_features are filled to CEPH_FEATURE_PGID64 when
+ decoding an incremental if struct_v >= 6; else keeps it at zero.
+ * otherwise, if we get an incremental from hammer (which has
+ struct_v = 6) we would be decoding it as if it were a map from before
+ CEPH_FEATURES_PGID64 (which was introduced in 0.35, pre-argonaut)
+map / 8 / 3 / 2 / 5c6b9d9dcd0a225e3a2b154c20a623868c269346 / 12.0.1 / 2017
+inc / 8 / 3 / 3 /
+ * add (near)full_ratio
+ * used to live in pgmap, moving to osdmap for luminous
+ * conditional on SERVER_LUMINOUS feature being present
+ * osdmap::incremental::encode(): conditional on ev >= 3
+ * osdmap::incremental::decode(): conditional on ev >= 3, else -1
+ * osdmap::encode(): conditional on ev >= 2
+ * osdmap::decode(): conditional on ev >= 0, else 0
+map / 8 / 4 / 2 / 27d6f4373bafa24450f6dbb4e4252c2d9c2c1448 / 12.0.2 / 2017
+inc / 8 / 4 / 3 /
+ * add pg_remap and pg_remap_items
+ * first forces a pg to map to a particular value; second replaces
+ specific osds with specific other osds in crush mapping.
+ * inc conditional on SERVER_LUMINOUS feature being present
+ * osdmap::incremental::encode(): conditional on cv >= 4
+ * osdmap::incremental::decode(): conditional on cv >= 4
+ * map conditional on OSDMAP_REMAP feature being present
+ * osdmap::encode(): if not feature, cv = 3; encode on cv >= 4
+ * osdmap::decode(): conditional on cv >= 4
+map / 8 / 4 / 3 / 27d6f4373bafa24450f6dbb4e4252c2d9c2c1448 / 12.0.2 / 2017
+inc / 8 / 4 / 4 /
+ * handle backfillfull_ratio like nearfull and full
+ * inc:
+ * osdmap::incremental::encode(): conditional on ev >= 3
+ * osdmap::incremental::decode(): conditional on ev >= 4, else -1
+ * map:
+ * osdmap::encode(): conditional on ev >= 2
+ * osdmap::decode(): conditional on ev >= 3, else 0
+map / 8 / 4 / 3 / a1c66468232002c9f36033226f5db0a5751e8d18 / 12.0.3 / 2017
+inc / 8 / 4 / 4 /
+ * add require_min_compat_client field
+ * inc:
+ * osdmap::incremental::encode() conditional on ev >= 4
+ * osdmap::incremental::decode() conditional on ev >= 4
+ map:
+ * osdmap::encode() conditional on ev >= 3
+ * osdmap::decode() conditional on ev >= 3
+map / 8 / 4 / 4 / 4a09e9431de3084b1ca98af11b28f822fde4ffbe / 12.0.3 / 2017
+inc / 8 / 4 / 5 /
+ * bumps encoding version for require_min_compat_client
+ * otherwise osdmap::decode() would throw exception when decoding
+ old maps
+ * inc:
+ * osdmap::incremental::encode() no conditional on ev >= 3
+ * osdmap::incremental::decode() conditional on ev >= 5
+ * map:
+ * osdmap::encode() conditional on ev >= 2
+ * osdmap::decode() conditional on ev >= 4
+map / 8 / 4 / 5 / 3d4c4d9d9da07e1456331c43acc998d2008ca8ea / 12.1.0 / 2017
+inc / 8 / 4 / 6 /
+ * add require_osd_release numeric field
+ * new_require_min_compat_client:
+ * osdmap::incremental::encode() conditional on ev >= 5
+ * osdmap::encode() conditional on ev >= 4
+ * require_osd_release:
+ * osdmap::incremental::encode() conditional on ev >= 6
+ * osdmap::incremental::decode() conditional on ev >= 6 (else, -1)
+ * osdmap::encode() conditional on ev >= 5
+ * osdmap::decode() conditional on ev >= 5 (else, -1)
+map / 8 / 4 / 5 / f22997e24bda4e6476e15d5d4ad9737861f9741f / 12.1.0 / 2017
+inc / 8 / 4 / 6 /
+ * switch (require_)min_compat_client to integers instead of strings
+ * osdmap::incremental::encode() conditional on ev >= 6
+ * osdmap::incremental::decode():
+ * if ev == 5, decode string and translate to release number
+ * if ev >= 6, decode integer
+ * osdmap::encode() conditional on ev >= 4
+ * osdmap::decode():
+ * if ev == 4, decode string and translate to release number
+ * if ev >= 5, decode integer
+map / 8 / 4 / 6 / a8fb39b57884d96201fa502b17bc9395ec38c1b3 / 12.1.0 / 2017
+inc / 8 / 5 / 6 /
+ * make incremental's `new_state` 32 bits instead of 8 bits
+ * implies forcing 8 bits on
+ * osdmap::incremental::encode_client_old()
+ * osdmap::incremental::encode_classic()
+ * osdmap::incremental::decode_classic()
+ * osdmap::incremental::encode() conditional on cv >= 5, else force 8b.
+ * osdmap::incremental::decode() conditional on cv >= 5, else force 8b.
+map / 8 / 5 / 6 / 3c1e58215bbb98f71aae30904f9010a57a58da81 / 12.1.0 / 2017
+inc / 8 / 5 / 6 /
+ * same as above
+map / 8 / 6 / 6 / 48158ec579b708772fae82daaa6cb5dcaf5ac5dd / 12.1.0 / 2017
+inc / 8 / 5 / 6 /
+ * add crush_version
+ * osdmap::encode() conditional on cv >= 6
+ * osdmap::decode() conditional on cv >= 6
+map / 8 / 7 / 6 / 553048fbf97af999783deb7e992c8ecfa5e55500 / 13.0.2 / 2017
+inc / 8 / 6 / 6 /
+ * track newly removed and purged snaps in each epoch
+ * new_removed_snaps
+ * new_purged_snaps
+ * osdmap::encode() conditional on cv >= 7
+ * if SERVER_MIMIC not in features, cv = 6
+ * osdmap::decode() conditional cv >= 7
+map / 8 / 8 / 6 / f99c2a9fec65ad3ce275ef24bd167ee03275d3d7 / 14.0.1 / 2018
+inc / 8 / 7 / 6 /
+ * fix pre-addrvec compat
+ * osdmap::encode() conditional on cv >= 8, else encode client addrs
+ one by one in a loop.
+ * osdmap::decode() just bumps version (?)
+map / 8 / 8 / 7 / 9fb1e521c7c75c124b0dbf193e8b65ff1b5f461e / 14.0.1 / 2018
+inc / 8 / 7 / 7 /
+ * make cluster addrs into addrvecs too
+ * this will allow single-step upgrade from msgr1 to msgr2
+map / 8 / 9 / 7 / d414f0b43a69f3c2db8e454d795be881496237c6 / 14.0.1 / 2018
+inc / 8 / 8 / 7 /
+ * store last_up_change and last_in_change
+ * osdmap::encode() conditional on cv >= 9
+ * osdmap::decode() conditional on cv >= 9
+
+
+
+osd_info_t:
+v / commit / version / date / reason
+
+1 / e574c84a6a0c5a5070dc72d5f5d3d17914ef824a / 0.19 / 2010 / add struct_v
+
+osd_xinfo_t:
+v / commit / version / date
+
+1 / e9f051ef3c49a080b24d7811a16aefb64beacbbd / 0.53 / 2012
+ * add osd_xinfo_t
+2 / 31743d50a109a463d664ec9cf764d5405db507bd / 0.75 / 2013
+ * add features bit mask to osd_xinfo_t
+3 / 87722a42c286d4d12190b86b6d06d388e2953ba0 / 0.82 / 2014
+ * remember osd weight when auto-marking osds out
+
+rados.h:
+v / commit / version / date / reason
+
+- / 147c6f51e34a875ab65624df04baa8ef89296ddd / 0.19 / 2010 / move versions
+ 3 / CEPH_OSDMAP_INC_VERSION
+ 3 / CEPH_OSDMAP_VERSION
+ 2 / CEPH_PG_POOL_VERSION
diff --git a/doc/dev/osd_internals/partial_object_recovery.rst b/doc/dev/osd_internals/partial_object_recovery.rst
new file mode 100644
index 000000000..a22f63348
--- /dev/null
+++ b/doc/dev/osd_internals/partial_object_recovery.rst
@@ -0,0 +1,148 @@
+=======================
+Partial Object Recovery
+=======================
+
+Partial Object Recovery improves the efficiency of log-based recovery (vs
+backfill). Original log-based recovery calculates missing_set based on pg_log
+differences.
+
+The whole object should be recovery from one OSD to another
+if the object is indicated modified by pg_log regardless of how much
+content in the object is really modified. That means a 4M object,
+which is just modified 4k inside, should recovery the whole 4M object
+rather than the modified 4k content. In addition, object map should be
+also recovered even if it is not modified at all.
+
+Partial Object Recovery is designed to solve the problem mentioned above.
+In order to achieve the goals, two things should be done:
+
+1. logging where the object is modified is necessary
+2. logging whether the object_map of an object is modified is also necessary
+
+class ObjectCleanRegion is introduced to do what we want.
+clean_offsets is a variable of interval_set<uint64_t>
+and is used to indicate the unmodified content in an object.
+clean_omap is a variable of bool indicating whether object_map is modified.
+new_object means that osd does not exist for an object
+max_num_intervals is an upbound of the number of intervals in clean_offsets
+so that the memory cost of clean_offsets is always bounded.
+
+The shortest clean interval will be trimmed if the number of intervals
+in clean_offsets exceeds the boundary.
+
+ etc. max_num_intervals=2, clean_offsets:{[5~10], [20~5]}
+
+ then new interval [30~10] will evict out the shortest one [20~5]
+
+ finally, clean_offsets becomes {[5~10], [30~10]}
+
+Procedures for Partial Object Recovery
+======================================
+
+Firstly, OpContext and pg_log_entry_t should contain ObjectCleanRegion.
+In do_osd_ops(), finish_copyfrom(), finish_promote(), corresponding content
+in ObjectCleanRegion should mark dirty so that trace the modification of an object.
+Also update ObjectCleanRegion in OpContext to its pg_log_entry_t.
+
+Secondly, pg_missing_set can build and rebuild correctly.
+when calculating pg_missing_set during peering process,
+also merge ObjectCleanRegion in each pg_log_entry_t.
+
+ etc. object aa has pg_log:
+ 26'101 {[0~4096, 8192~MAX], false}
+
+ 26'104 {0~8192, 12288~MAX, false}
+
+ 28'108 {[0~12288, 16384~MAX], true}
+
+ missing_set for object aa: merge pg_log above --> {[0~4096, 16384~MAX], true}.
+ which means 4096~16384 is modified and object_map is also modified on version 28'108
+
+Also, OSD may be crash after merge log.
+Therefore, we need to read_log and rebuild pg_missing_set. For example, pg_log is:
+
+ object aa: 26'101 {[0~4096, 8192~MAX], false}
+
+ object bb: 26'102 {[0~4096, 8192~MAX], false}
+
+ object cc: 26'103 {[0~4096, 8192~MAX], false}
+
+ object aa: 26'104 {0~8192, 12288~MAX, false}
+
+ object dd: 26'105 {[0~4096, 8192~MAX], false}
+
+ object aa: 28'108 {[0~12288, 16384~MAX], true}
+
+Originally, if bb,cc,dd is recovered, and aa is not.
+So we need to rebuild pg_missing_set for object aa,
+and find aa is modified on version 28'108.
+If version in object_info is 26'96 < 28'108,
+we don't need to consider 26'104 and 26'101 because the whole object will be recovered.
+However, Partial Object Recovery should also require us to rebuild ObjectCleanRegion.
+
+Knowing whether the object is modified is not enough.
+
+Therefore, we also need to traverse the pg_log before,
+that says 26'104 and 26'101 also > object_info(26'96)
+and rebuild pg_missing_set for object aa based on those three logs: 28'108, 26'104, 26'101.
+The way how to merge logs is the same as mentioned above
+
+Finally, finish the push and pull process based on pg_missing_set.
+Updating copy_subset in recovery_info based on ObjectCleanRegion in pg_missing_set.
+copy_subset indicates the intervals of content need to pull and push.
+
+The complicated part here is submit_push_data
+and serval cases should be considered separately.
+what we need to consider is how to deal with the object data,
+object data makes up of omap_header, xattrs, omap, data:
+
+case 1: first && complete: since object recovering is finished in a single PushOp,
+we would like to preserve the original object and overwrite on the object directly.
+Object will not be removed and touch a new one.
+
+ issue 1: As object is not removed, old xattrs remain in the old object
+ but maybe updated in new object. Overwriting for the same key or adding new keys is correct,
+ but removing keys will be wrong.
+ In order to solve this issue, We need to remove the all original xattrs in the object, and then update new xattrs.
+
+ issue 2: As object is not removed,
+ object_map may be recovered depending on the clean_omap.
+ Therefore, if recovering clean_omap, we need to remove old omap of the object for the same reason
+ since omap updating may also be a deletion.
+ Thus, in this case, we should do:
+
+ 1) clear xattrs of the object
+ 2) clear omap of the object if omap recovery is needed
+ 3) truncate the object into recovery_info.size
+ 4) recovery omap_header
+ 5) recovery xattrs, and recover omap if needed
+ 6) punch zeros for original object if fiemap tells nothing there
+ 7) overwrite object content which is modified
+ 8) finish recovery
+
+case 2: first && !complete: object recovering should be done in multiple times.
+Here, target_oid will indicate a new temp_object in pgid_TEMP,
+so the issues are a bit difference.
+
+ issue 1: As object is newly created, there is no need to deal with xattrs
+
+ issue 2: As object is newly created,
+ and object_map may not be transmitted depending on clean_omap.
+ Therefore, if clean_omap is true, we need to clone object_map from original object.
+ issue 3: As object is newly created, and unmodified data will not be transmitted.
+ Therefore, we need to clone unmodified data from the original object.
+ Thus, in this case, we should do:
+
+ 1) remove the temp object
+ 2) create a new temp object
+ 3) set alloc_hint for the new temp object
+ 4) truncate new temp object to recovery_info.size
+ 5) recovery omap_header
+ 6) clone object_map from original object if omap is clean
+ 7) clone unmodified object_data from original object
+ 8) punch zeros for the new temp object
+ 9) recovery xattrs, and recover omap if needed
+ 10) overwrite object content which is modified
+ 11) remove the original object
+ 12) move and rename the new temp object to replace the original object
+ 13) finish recovery
diff --git a/doc/dev/osd_internals/past_intervals.rst b/doc/dev/osd_internals/past_intervals.rst
new file mode 100644
index 000000000..5b594df1a
--- /dev/null
+++ b/doc/dev/osd_internals/past_intervals.rst
@@ -0,0 +1,93 @@
+=============
+PastIntervals
+=============
+
+Purpose
+-------
+
+There are two situations where we need to consider the set of all acting-set
+OSDs for a PG back to some epoch ``e``:
+
+ * During peering, we need to consider the acting set for every epoch back to
+ ``last_epoch_started``, the last epoch in which the PG completed peering and
+ became active.
+ (see :doc:`/dev/osd_internals/last_epoch_started` for a detailed explanation)
+ * During recovery, we need to consider the acting set for every epoch back to
+ ``last_epoch_clean``, the last epoch at which all of the OSDs in the acting
+ set were fully recovered, and the acting set was full.
+
+For either of these purposes, we could build such a set by iterating backwards
+from the current OSDMap to the relevant epoch. Instead, we maintain a structure
+PastIntervals for each PG.
+
+An ``interval`` is a contiguous sequence of OSDMap epochs where the PG mapping
+didn't change. This includes changes to the acting set, the up set, the
+primary, and several other parameters fully spelled out in
+PastIntervals::check_new_interval.
+
+Maintenance and Trimming
+------------------------
+
+The PastIntervals structure stores a record for each ``interval`` back to
+last_epoch_clean. On each new ``interval`` (See AdvMap reactions,
+PeeringState::should_restart_peering, and PeeringState::start_peering_interval)
+each OSD with the PG will add the new ``interval`` to its local PastIntervals.
+Activation messages to OSDs which do not already have the PG contain the
+sender's PastIntervals so that the recipient needn't rebuild it. (See
+PeeringState::activate needs_past_intervals).
+
+PastIntervals are trimmed in two places. First, when the primary marks the
+PG clean, it clears its past_intervals instance
+(PeeringState::try_mark_clean()). The replicas will do the same thing when
+they receive the info (See PeeringState::update_history).
+
+The second, more complex, case is in PeeringState::start_peering_interval. In
+the event of a "map gap", we assume that the PG actually has gone clean, but we
+haven't received a pg_info_t with the updated ``last_epoch_clean`` value yet.
+To explain this behavior, we need to discuss OSDMap trimming.
+
+OSDMap Trimming
+---------------
+
+OSDMaps are created by the Monitor quorum and gossiped out to the OSDs. The
+Monitor cluster also determines when OSDs (and the Monitors) are allowed to
+trim old OSDMap epochs. For the reasons explained above in this document, the
+primary constraint is that we must retain all OSDMaps back to some epoch such
+that all PGs have been clean at that or a later epoch (min_last_epoch_clean).
+(See OSDMonitor::get_trim_to).
+
+The Monitor quorum determines min_last_epoch_clean through MOSDBeacon messages
+sent periodically by each OSDs. Each message contains a set of PGs for which
+the OSD is primary at that moment as well as the min_last_epoch_clean across
+that set. The Monitors track these values in OSDMonitor::last_epoch_clean.
+
+There is a subtlety in the min_last_epoch_clean value used by the OSD to
+populate the MOSDBeacon. OSD::collect_pg_stats invokes PG::with_pg_stats to
+obtain the lec value, which actually uses
+pg_stat_t::get_effective_last_epoch_clean() rather than
+info.history.last_epoch_clean. If the PG is currently clean,
+pg_stat_t::get_effective_last_epoch_clean() is the current epoch rather than
+last_epoch_clean -- this works because the PG is clean at that epoch and it
+allows OSDMaps to be trimmed during periods where OSDMaps are being created
+(due to snapshot activity, perhaps), but no PGs are undergoing ``interval``
+changes.
+
+Back to PastIntervals
+---------------------
+
+We can now understand our second trimming case above. If OSDMaps have been
+trimmed up to epoch ``e``, we know that the PG must have been clean at some epoch
+>= ``e`` (indeed, **all** PGs must have been), so we can drop our PastIntevals.
+
+This dependency also pops up in PeeringState::check_past_interval_bounds().
+PeeringState::get_required_past_interval_bounds takes as a parameter
+oldest_epoch, which comes from OSDSuperblock::cluster_osdmap_trim_lower_bound.
+We use cluster_osdmap_trim_lower_bound rather than a specific osd's oldest_map
+because we don't necessarily trim all MOSDMap::cluster_osdmap_trim_lower_bound.
+In order to avoid doing too much work at once we limit the amount of osdmaps
+trimmed using ``osd_target_transaction_size`` in OSD::trim_maps().
+For this reason, a specific OSD's oldest_map can lag behind
+OSDSuperblock::cluster_osdmap_trim_lower_bound
+for a while.
+
+See https://tracker.ceph.com/issues/49689 for an example.
diff --git a/doc/dev/osd_internals/pg.rst b/doc/dev/osd_internals/pg.rst
new file mode 100644
index 000000000..397d4ab5d
--- /dev/null
+++ b/doc/dev/osd_internals/pg.rst
@@ -0,0 +1,31 @@
+====
+PG
+====
+
+Concepts
+--------
+
+*Peering Interval*
+ See PG::start_peering_interval.
+ See PG::acting_up_affected
+ See PG::PeeringState::Reset
+
+ A peering interval is a maximal set of contiguous map epochs in which the
+ up and acting sets did not change. PG::PeeringMachine represents a
+ transition from one interval to another as passing through
+ PeeringState::Reset. On PG::PeeringState::AdvMap PG::acting_up_affected can
+ cause the pg to transition to Reset.
+
+
+Peering Details and Gotchas
+---------------------------
+For an overview of peering, see `Peering <../../peering>`_.
+
+ * PG::flushed defaults to false and is set to false in
+ PG::start_peering_interval. Upon transitioning to PG::PeeringState::Started
+ we send a transaction through the pg op sequencer which, upon complete,
+ sends a FlushedEvt which sets flushed to true. The primary cannot go
+ active until this happens (See PG::PeeringState::WaitFlushedPeering).
+ Replicas can go active but cannot serve ops (writes or reads).
+ This is necessary because we cannot read our ondisk state until unstable
+ transactions from the previous interval have cleared.
diff --git a/doc/dev/osd_internals/pg_removal.rst b/doc/dev/osd_internals/pg_removal.rst
new file mode 100644
index 000000000..c5fe0e1ab
--- /dev/null
+++ b/doc/dev/osd_internals/pg_removal.rst
@@ -0,0 +1,56 @@
+==========
+PG Removal
+==========
+
+See OSD::_remove_pg, OSD::RemoveWQ
+
+There are two ways for a pg to be removed from an OSD:
+
+ 1. MOSDPGRemove from the primary
+ 2. OSD::advance_map finds that the pool has been removed
+
+In either case, our general strategy for removing the pg is to
+atomically set the metadata objects (pg->log_oid, pg->biginfo_oid) to
+backfill and asynchronously remove the pg collections. We do not do
+this inline because scanning the collections to remove the objects is
+an expensive operation.
+
+OSDService::deleting_pgs tracks all pgs in the process of being
+deleted. Each DeletingState object in deleting_pgs lives while at
+least one reference to it remains. Each item in RemoveWQ carries a
+reference to the DeletingState for the relevant pg such that
+deleting_pgs.lookup(pgid) will return a null ref only if there are no
+collections currently being deleted for that pg.
+
+The DeletingState for a pg also carries information about the status
+of the current deletion and allows the deletion to be cancelled.
+The possible states are:
+
+ 1. QUEUED: the PG is in the RemoveWQ
+ 2. CLEARING_DIR: the PG's contents are being removed synchronously
+ 3. DELETING_DIR: the PG's directories and metadata being queued for removal
+ 4. DELETED_DIR: the final removal transaction has been queued
+ 5. CANCELED: the deletion has been cancelled
+
+In 1 and 2, the deletion can be cancelled. Each state transition
+method (and check_canceled) returns false if deletion has been
+cancelled and true if the state transition was successful. Similarly,
+try_stop_deletion() returns true if it succeeds in cancelling the
+deletion. Additionally, try_stop_deletion() in the event that it
+fails to stop the deletion will not return until the final removal
+transaction is queued. This ensures that any operations queued after
+that point will be ordered after the pg deletion.
+
+OSD::_create_lock_pg must handle two cases:
+
+ 1. Either there is no DeletingStateRef for the pg, or it failed to cancel
+ 2. We succeeded in cancelling the deletion.
+
+In case 1., we proceed as if there were no deletion occurring, except that
+we avoid writing to the PG until the deletion finishes. In case 2., we
+proceed as in case 1., except that we first mark the PG as backfilling.
+
+Similarly, OSD::osr_registry ensures that the OpSequencers for those
+pgs can be reused for a new pg if created before the old one is fully
+removed, ensuring that operations on the new pg are sequenced properly
+with respect to operations on the old one.
diff --git a/doc/dev/osd_internals/pgpool.rst b/doc/dev/osd_internals/pgpool.rst
new file mode 100644
index 000000000..45a252bd4
--- /dev/null
+++ b/doc/dev/osd_internals/pgpool.rst
@@ -0,0 +1,22 @@
+==================
+PGPool
+==================
+
+PGPool is a structure used to manage and update the status of removed
+snapshots. It does this by maintaining two fields, cached_removed_snaps - the
+current removed snap set and newly_removed_snaps - newly removed snaps in the
+last epoch. In OSD::load_pgs the osd map is recovered from the pg's file store
+and passed down to OSD::_get_pool where a PGPool object is initialised with the
+map.
+
+With each new map we receive we call PGPool::update with the new map. In that
+function we build a list of newly removed snaps
+(pg_pool_t::build_removed_snaps) and merge that with our cached_removed_snaps.
+This function included checks to make sure we only do this update when things
+have changed or there has been a map gap.
+
+When we activate the pg we initialise the snap trim queue from
+cached_removed_snaps and subtract the purged_snaps we have already purged
+leaving us with the list of snaps that need to be trimmed. Trimming is later
+performed asynchronously by the snap_trim_wq.
+
diff --git a/doc/dev/osd_internals/recovery_reservation.rst b/doc/dev/osd_internals/recovery_reservation.rst
new file mode 100644
index 000000000..a24ac1b15
--- /dev/null
+++ b/doc/dev/osd_internals/recovery_reservation.rst
@@ -0,0 +1,83 @@
+====================
+Recovery Reservation
+====================
+
+Recovery reservation extends and subsumes backfill reservation. The
+reservation system from backfill recovery is used for local and remote
+reservations.
+
+When a PG goes active, first it determines what type of recovery is
+necessary, if any. It may need log-based recovery, backfill recovery,
+both, or neither.
+
+In log-based recovery, the primary first acquires a local reservation
+from the OSDService's local_reserver. Then a MRemoteReservationRequest
+message is sent to each replica in order of OSD number. These requests
+will always be granted (i.e., cannot be rejected), but they may take
+some time to be granted if the remotes have already granted all their
+remote reservation slots.
+
+After all reservations are acquired, log-based recovery proceeds as it
+would without the reservation system.
+
+After log-based recovery completes, the primary releases all remote
+reservations. The local reservation remains held. The primary then
+determines whether backfill is necessary. If it is not necessary, the
+primary releases its local reservation and waits in the Recovered state
+for all OSDs to indicate that they are clean.
+
+If backfill recovery occurs after log-based recovery, the local
+reservation does not need to be reacquired since it is still held from
+before. If it occurs immediately after activation (log-based recovery
+not possible/necessary), the local reservation is acquired according to
+the typical process.
+
+Once the primary has its local reservation, it requests a remote
+reservation from the backfill target. This reservation CAN be rejected,
+for instance if the OSD is too full (backfillfull_ratio osd setting).
+If the reservation is rejected, the primary drops its local
+reservation, waits (osd_backfill_retry_interval), and then retries. It
+will retry indefinitely.
+
+Once the primary has the local and remote reservations, backfill
+proceeds as usual. After backfill completes the remote reservation is
+dropped.
+
+Finally, after backfill (or log-based recovery if backfill was not
+necessary), the primary drops the local reservation and enters the
+Recovered state. Once all the PGs have reported they are clean, the
+primary enters the Clean state and marks itself active+clean.
+
+-----------------
+Dump Reservations
+-----------------
+
+An OSD daemon command dumps total local and remote reservations::
+
+ ceph daemon osd.<id> dump_recovery_reservations
+
+
+--------------
+Things to Note
+--------------
+
+We always grab the local reservation first, to prevent a circular
+dependency. We grab remote reservations in order of OSD number for the
+same reason.
+
+The recovery reservation state chart controls the PG state as reported
+to the monitor. The state chart can set:
+
+ - recovery_wait: waiting for local/remote reservations
+ - recovering: recovering
+ - recovery_toofull: recovery stopped, OSD(s) above full ratio
+ - backfill_wait: waiting for remote backfill reservations
+ - backfilling: backfilling
+ - backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio
+
+
+--------
+See Also
+--------
+
+The Active substate of the automatically generated OSD state diagram.
diff --git a/doc/dev/osd_internals/refcount.rst b/doc/dev/osd_internals/refcount.rst
new file mode 100644
index 000000000..3324b63e5
--- /dev/null
+++ b/doc/dev/osd_internals/refcount.rst
@@ -0,0 +1,45 @@
+========
+Refcount
+========
+
+
+Introduction
+============
+
+Deduplication, as described in ../deduplication.rst, needs a way to
+maintain a target pool of deduplicated chunks with atomic ref
+refcounting. To that end, there exists an osd object class
+refcount responsible for using the object class machinery to
+maintain refcounts on deduped chunks and ultimately remove them
+as the refcount hits 0.
+
+Class Interface
+===============
+
+See cls/refcount/cls_refcount_client*
+
+* cls_refcount_get
+
+ Atomically increments the refcount with specified tag ::
+
+ void cls_refcount_get(librados::ObjectWriteOperation& op, const string& tag, bool implicit_ref = false);
+
+* cls_refcount_put
+
+ Atomically decrements the refcount specified by passed tag ::
+
+ void cls_refcount_put(librados::ObjectWriteOperation& op, const string& tag, bool implicit_ref = false);
+
+* cls_refcount_Set
+
+ Atomically sets the set of refcounts with passed list of tags ::
+
+ void cls_refcount_set(librados::ObjectWriteOperation& op, list<string>& refs);
+
+* cls_refcount_read
+
+ Dumps the current set of ref tags for the object ::
+
+ int cls_refcount_read(librados::IoCtx& io_ctx, string& oid, list<string> *refs, bool implicit_ref = false);
+
+
diff --git a/doc/dev/osd_internals/scrub.rst b/doc/dev/osd_internals/scrub.rst
new file mode 100644
index 000000000..149509799
--- /dev/null
+++ b/doc/dev/osd_internals/scrub.rst
@@ -0,0 +1,41 @@
+
+Scrub internals and diagnostics
+===============================
+
+Scrubbing Behavior Table
+------------------------
+
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Flags | none | noscrub | nodeep_scrub | noscrub/nodeep_scrub |
++=================================================+==========+===========+===============+======================+
+| Periodic tick | S | X | S | X |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Periodic tick after osd_deep_scrub_interval | D | D | S | X |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Initiated scrub | S | S | S | S |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Initiated scrub after osd_deep_scrub_interval | D | D | S | S |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Initiated deep scrub | D | D | D | D |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+
+- X = Do nothing
+- S = Do regular scrub
+- D = Do deep scrub
+
+State variables
+---------------
+
+- Periodic tick state is ``!must_scrub && !must_deep_scrub && !time_for_deep``
+- Periodic tick after ``osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep``
+- Initiated scrub state is ``must_scrub && !must_deep_scrub && !time_for_deep``
+- Initiated scrub after ``osd_deep_scrub_interval`` state is ``must_scrub && !must_deep_scrub && time_for_deep``
+- Initiated deep scrub state is ``must_scrub && must_deep_scrub``
+
+Scrub Reservations
+------------------
+
+An OSD daemon command dumps total local and remote reservations::
+
+ ceph daemon osd.<id> dump_scrub_reservations
+
diff --git a/doc/dev/osd_internals/snaps.rst b/doc/dev/osd_internals/snaps.rst
new file mode 100644
index 000000000..5ebd0884a
--- /dev/null
+++ b/doc/dev/osd_internals/snaps.rst
@@ -0,0 +1,128 @@
+======
+Snaps
+======
+
+Overview
+--------
+Rados supports two related snapshotting mechanisms:
+
+ 1. *pool snaps*: snapshots are implicitly applied to all objects
+ in a pool
+ 2. *self managed snaps*: the user must provide the current *SnapContext*
+ on each write.
+
+These two are mutually exclusive, only one or the other can be used on
+a particular pool.
+
+The *SnapContext* is the set of snapshots currently defined for an object
+as well as the most recent snapshot (the *seq*) requested from the mon for
+sequencing purposes (a *SnapContext* with a newer *seq* is considered to
+be more recent).
+
+The difference between *pool snaps* and *self managed snaps* from the
+OSD's point of view lies in whether the *SnapContext* comes to the OSD
+via the client's MOSDOp or via the most recent OSDMap.
+
+See OSD::make_writeable
+
+Ondisk Structures
+-----------------
+Each object has in the PG collection a *head* object (or *snapdir*, which we
+will come to shortly) and possibly a set of *clone* objects.
+Each hobject_t has a snap field. For the *head* (the only writeable version
+of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
+snap field is set to the *seq* of the *SnapContext* at their creation.
+When the OSD services a write, it first checks whether the most recent
+*clone* is tagged with a snapid prior to the most recent snap represented
+in the *SnapContext*. If so, at least one snapshot has occurred between
+the time of the write and the time of the last clone. Therefore, prior
+to performing the mutation, the OSD creates a new clone for servicing
+reads on snaps between the snapid of the last clone and the most recent
+snapid.
+
+The *head* object contains a *SnapSet* encoded in an attribute, which tracks
+
+ 1. The full set of snaps defined for the object
+ 2. The full set of clones which currently exist
+ 3. Overlapping intervals between clones for tracking space usage
+ 4. Clone size
+
+If the *head* is deleted while there are still clones, a *snapdir* object
+is created instead to house the *SnapSet*.
+
+Additionally, the *object_info_t* on each clone includes a vector of snaps
+for which clone is defined.
+
+Snap Removal
+------------
+To remove a snapshot, a request is made to the *Monitor* cluster to
+add the snapshot id to the list of purged snaps (or to remove it from
+the set of pool snaps in the case of *pool snaps*). In either case,
+the *PG* adds the snap to its *snap_trimq* for trimming.
+
+A clone can be removed when all of its snaps have been removed. In
+order to determine which clones might need to be removed upon snap
+removal, we maintain a mapping from snap to *hobject_t* using the
+*SnapMapper*.
+
+See PrimaryLogPG::SnapTrimmer, SnapMapper
+
+This trimming is performed asynchronously by the snap_trim_wq while the
+PG is clean and not scrubbing.
+
+ #. The next snap in PG::snap_trimq is selected for trimming
+ #. We determine the next object for trimming out of PG::snap_mapper.
+ For each object, we create a log entry and repop updating the
+ object info and the snap set (including adjusting the overlaps).
+ If the object is a clone which no longer belongs to any live snapshots,
+ it is removed here. (See PrimaryLogPG::trim_object() when new_snaps
+ is empty.)
+ #. We also locally update our *SnapMapper* instance with the object's
+ new snaps.
+ #. The log entry containing the modification of the object also
+ contains the new set of snaps, which the replica uses to update
+ its own *SnapMapper* instance.
+ #. The primary shares the info with the replica, which persists
+ the new set of purged_snaps along with the rest of the info.
+
+
+
+Recovery
+--------
+Because the trim operations are implemented using repops and log entries,
+normal PG peering and recovery maintain the snap trimmer operations with
+the caveat that push and removal operations need to update the local
+*SnapMapper* instance. If the purged_snaps update is lost, we merely
+retrim a now empty snap.
+
+SnapMapper
+----------
+*SnapMapper* is implemented on top of map_cacher<string, bufferlist>,
+which provides an interface over a backing store such as the file system
+with async transactions. While transactions are incomplete, the map_cacher
+instance buffers unstable keys allowing consistent access without having
+to flush the filestore. *SnapMapper* provides two mappings:
+
+ 1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone
+ object
+ 2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot
+ as one of its snaps
+
+Assumption: there are lots of hobjects and relatively few snaps. The
+first encoding has a stringification of the object as the key and an
+encoding of the set of snaps as a value. The second mapping, because there
+might be many hobjects for a single snap, is stored as a collection of keys
+of the form stringify(snap)_stringify(object) such that stringify(snap)
+is constant length. These keys have a bufferlist encoding
+pair<snapid, hobject_t> as a value. Thus, creating or trimming a single
+object does not involve reading all objects for any snap. Additionally,
+upon construction, the *SnapMapper* is provided with a mask for filtering
+the objects in the single SnapMapper keyspace belonging to that PG.
+
+Split
+-----
+The snapid_t -> hobject_t key entries are arranged such that for any PG,
+up to 8 prefixes need to be checked to determine all hobjects in a particular
+snap for a particular PG. Upon split, the prefixes to check on the parent
+are adjusted such that only the objects remaining in the PG will be visible.
+The children will immediately have the correct mapping.
diff --git a/doc/dev/osd_internals/stale_read.rst b/doc/dev/osd_internals/stale_read.rst
new file mode 100644
index 000000000..5493bb1f4
--- /dev/null
+++ b/doc/dev/osd_internals/stale_read.rst
@@ -0,0 +1,102 @@
+Preventing Stale Reads
+======================
+
+We write synchronously to all replicas before sending an ACK to the
+client, which limits the potential for inconsistency
+in the write path. However, by default we serve reads from just
+one replica (the lead/primary OSD for each PG), and the
+client will use whatever OSDMap is has to select the OSD from which to read.
+In most cases, this is fine: either the client map is correct,
+or the OSD that we think is the primary for the object knows that it
+is not the primary anymore, and can feed the client an updated map
+that indicates a newer primary.
+
+They key is to ensure that this is *always* true. In particular, we
+need to ensure that an OSD that is fenced off from its peers and has
+not learned about a map update does not continue to service read
+requests from similarly stale clients at any point after which a new
+primary may have been allowed to make a write.
+
+We accomplish this via a mechanism that works much like a read lease.
+Each pool may have a ``read_lease_interval`` property which defines
+how long this is, although by default we simply set it to
+``osd_pool_default_read_lease_ratio`` (default: .8) times the
+``osd_heartbeat_grace``. (This way the lease will generally have
+expired by the time we mark a failed OSD down.)
+
+readable_until
+--------------
+
+Primary and replica both track a couple of values:
+
+* *readable_until* is how long we are allowed to service (read)
+ requests before *our* "lease" expires.
+* *readable_until_ub* is an upper bound on *readable_until* for any
+ OSD in the acting set.
+
+The primary manages these two values by sending *pg_lease_t* messages
+to replicas that increase the upper bound. Once all acting OSDs have
+acknowledged they've seen the higher bound, the primary increases its
+own *readable_until* and shares that (in a subsequent *pg_lease_t*
+message). The resulting invariant is that any acting OSDs'
+*readable_until* is always <= any acting OSDs' *readable_until_ub*.
+
+In order to avoid any problems with clock skew, we use monotonic
+clocks (which are only accurate locally and unaffected by time
+adjustments) throughout to manage these leases. Peer OSDs calculate
+upper and lower bounds on the deltas between OSD-local clocks,
+allowing the primary to share timestamps based on its local clock
+while replicas translate that to an appropriate bound in for their own
+local clocks.
+
+Prior Intervals
+---------------
+
+Whenever there is an interval change, we need to have an upper bound
+on the *readable_until* values for any OSDs in the prior interval.
+All OSDs from that interval have this value (*readable_until_ub*), and
+share it as part of the pg_history_t during peering.
+
+Because peering may involve OSDs that were not already communicating
+before and may not have bounds on their clock deltas, the bound in
+*pg_history_t* is shared as a simple duration before the upper bound
+expires. This means that the bound slips forward in time due to the
+transit time for the peering message, but that is generally quite
+short, and moving the bound later in time is safe since it is an
+*upper* bound.
+
+PG "laggy" state
+----------------
+
+While the PG is active, *pg_lease_t* and *pg_lease_ack_t* messages are
+regularly exchanged. However, if a client request comes in and the
+lease has expired (*readable_until* has passed), the PG will go into a
+*LAGGY* state and request will be blocked. Once the lease is renewed,
+the request(s) will be requeued.
+
+PG "wait" state
+---------------
+
+If peering completes but the prior interval's OSDs may still be
+readable, the PG will go into the *WAIT* state until sufficient time
+has passed. Any OSD requests will block during that period. Recovery
+may proceed while in this state, since the logical, user-visible
+content of objects does not change.
+
+Dead OSDs
+---------
+
+Generally speaking, we need to wait until prior intervals' OSDs *know*
+that they should no longer be readable. If an OSD is known to have
+crashed (e.g., because the process is no longer running, which we may
+infer because we get a ECONNREFUSED error), then we can infer that it
+is not readable.
+
+Similarly, if an OSD is marked down, gets a map update telling it so,
+and then informs the monitor that it knows it was marked down, we can
+similarly infer that it is not still serving requests for a prior interval.
+
+When a PG is in the *WAIT* state, it will watch new maps for OSDs'
+*dead_epoch* value indicating they are aware of their dead-ness. If
+all down OSDs from prior interval are so aware, we can exit the WAIT
+state early.
diff --git a/doc/dev/osd_internals/watch_notify.rst b/doc/dev/osd_internals/watch_notify.rst
new file mode 100644
index 000000000..8c2ce09ba
--- /dev/null
+++ b/doc/dev/osd_internals/watch_notify.rst
@@ -0,0 +1,81 @@
+============
+Watch Notify
+============
+
+See librados for the watch/notify interface.
+
+Overview
+--------
+The object_info (See osd/osd_types.h) tracks the set of watchers for
+a particular object persistently in the object_info_t::watchers map.
+In order to track notify progress, we also maintain some ephemeral
+structures associated with the ObjectContext.
+
+Each Watch has an associated Watch object (See osd/Watch.h). The
+ObjectContext for a watched object will have a (strong) reference
+to one Watch object per watch, and each Watch object holds a
+reference to the corresponding ObjectContext. This circular reference
+is deliberate and is broken when the Watch state is discarded on
+a new peering interval or removed upon timeout expiration or an
+unwatch operation.
+
+A watch tracks the associated connection via a strong
+ConnectionRef Watch::conn. The associated connection has a
+WatchConState stashed in the OSD::Session for tracking associated
+Watches in order to be able to notify them upon ms_handle_reset()
+(via WatchConState::reset()).
+
+Each Watch object tracks the set of currently un-acked notifies.
+start_notify() on a Watch object adds a reference to a new in-progress
+Notify to the Watch and either:
+
+* if the Watch is *connected*, sends a Notify message to the client
+* if the Watch is *unconnected*, does nothing.
+
+When the Watch becomes connected (in PrimaryLogPG::do_osd_op_effects),
+Notifies are resent to all remaining tracked Notify objects.
+
+Each Notify object tracks the set of un-notified Watchers via
+calls to complete_watcher(). Once the remaining set is empty or the
+timeout expires (cb, registered in init()) a notify completion
+is sent to the client.
+
+Watch Lifecycle
+---------------
+A watch may be in one of 5 states:
+
+1. Non existent.
+2. On disk, but not registered with an object context.
+3. Connected
+4. Disconnected, callback registered with timer
+5. Disconnected, callback in queue for scrub or is_degraded
+
+Case 2 occurs between when an OSD goes active and the ObjectContext
+for an object with watchers is loaded into memory due to an access.
+During Case 2, no state is registered for the watch. Case 2
+transitions to Case 4 in PrimaryLogPG::populate_obc_watchers() during
+PrimaryLogPG::find_object_context. Case 1 becomes case 3 via
+OSD::do_osd_op_effects due to a watch operation. Case 4,5 become case
+3 in the same way. Case 3 becomes case 4 when the connection resets
+on a watcher's session.
+
+Cases 4&5 can use some explanation. Normally, when a Watch enters Case
+4, a callback is registered with the OSDService::watch_timer to be
+called at timeout expiration. At the time that the callback is
+called, however, the pg might be in a state where it cannot write
+to the object in order to remove the watch (i.e., during a scrub
+or while the object is degraded). In that case, we use
+Watch::get_delayed_cb() to generate another Context for use from
+the callbacks_for_degraded_object and Scrubber::callbacks lists.
+In either case, Watch::unregister_cb() does the right thing
+(SafeTimer::cancel_event() is harmless for contexts not registered
+with the timer).
+
+Notify Lifecycle
+----------------
+The notify timeout is simpler: a timeout callback is registered when
+the notify is init()'d. If all watchers ack notifies before the
+timeout occurs, the timeout is canceled and the client is notified
+of the notify completion. Otherwise, the timeout fires, the Notify
+object pings each Watch via cancel_notify to remove itself, and
+sends the notify completion to the client early.
diff --git a/doc/dev/osd_internals/wbthrottle.rst b/doc/dev/osd_internals/wbthrottle.rst
new file mode 100644
index 000000000..9b67efbb6
--- /dev/null
+++ b/doc/dev/osd_internals/wbthrottle.rst
@@ -0,0 +1,28 @@
+==================
+Writeback Throttle
+==================
+
+Previously, the filestore had a problem when handling large numbers of
+small ios. We throttle dirty data implicitly via the journal, but
+a large number of inodes can be dirtied without filling the journal
+resulting in a very long sync time when the sync finally does happen.
+The flusher was not an adequate solution to this problem since it
+forced writeback of small writes too eagerly killing performance.
+
+WBThrottle tracks unflushed io per hobject_t and ::fsyncs in lru
+order once the start_flusher threshold is exceeded for any of
+dirty bytes, dirty ios, or dirty inodes. While any of these exceed
+the hard_limit, we block on throttle() in _do_op.
+
+See src/os/WBThrottle.h, src/osd/WBThrottle.cc
+
+To track the open FDs through the writeback process, there is now an
+fdcache to cache open fds. lfn_open now returns a cached FDRef which
+implicitly closes the fd once all references have expired.
+
+Filestore syncs have a sideeffect of flushing all outstanding objects
+in the wbthrottle.
+
+lfn_unlink clears the cached FDRef and wbthrottle entries for the
+unlinked object when the last link is removed and asserts that all
+outstanding FDRefs for that object are dead.