summaryrefslogtreecommitdiffstats
path: root/doc/dev/crimson/pipeline.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/dev/crimson/pipeline.rst')
-rw-r--r--doc/dev/crimson/pipeline.rst97
1 files changed, 97 insertions, 0 deletions
diff --git a/doc/dev/crimson/pipeline.rst b/doc/dev/crimson/pipeline.rst
new file mode 100644
index 000000000..e9115c6d7
--- /dev/null
+++ b/doc/dev/crimson/pipeline.rst
@@ -0,0 +1,97 @@
+==============================
+The ``ClientRequest`` pipeline
+==============================
+
+In crimson, exactly like in the classical OSD, a client request has data and
+ordering dependencies which must be satisfied before processing (actually
+a particular phase of) can begin. As one of the goals behind crimson is to
+preserve the compatibility with the existing OSD incarnation, the same semantic
+must be assured. An obvious example of such data dependency is the fact that
+an OSD needs to have a version of OSDMap that matches the one used by the client
+(``Message::get_min_epoch()``).
+
+If a dependency is not satisfied, the processing stops. It is crucial to note
+the same must happen to all other requests that are sequenced-after (due to
+their ordering requirements).
+
+There are a few cases when the blocking of a client request can happen.
+
+
+ ``ClientRequest::ConnectionPipeline::await_map``
+ wait for particular OSDMap version is available at the OSD level
+ ``ClientRequest::ConnectionPipeline::get_pg``
+ wait a particular PG becomes available on OSD
+ ``ClientRequest::PGPipeline::await_map``
+ wait on a PG being advanced to particular epoch
+ ``ClientRequest::PGPipeline::wait_for_active``
+ wait for a PG to become *active* (i.e. have ``is_active()`` asserted)
+ ``ClientRequest::PGPipeline::recover_missing``
+ wait on an object to be recovered (i.e. leaving the ``missing`` set)
+ ``ClientRequest::PGPipeline::get_obc``
+ wait on an object to be available for locking. The ``obc`` will be locked
+ before this operation is allowed to continue
+ ``ClientRequest::PGPipeline::process``
+ wait if any other ``MOSDOp`` message is handled against this PG
+
+At any moment, a ``ClientRequest`` being served should be in one and only one
+of the phases described above. Similarly, an object denoting particular phase
+can host not more than a single ``ClientRequest`` the same time. At low-level
+this is achieved with a combination of a barrier and an exclusive lock.
+They implement the semantic of a semaphore with a single slot for these exclusive
+phases.
+
+As the execution advances, request enters next phase and leaves the current one
+freeing it for another ``ClientRequest`` instance. All these phases form a pipeline
+which assures the order is preserved.
+
+These pipeline phases are divided into two ordering domains: ``ConnectionPipeline``
+and ``PGPipeline``. The former ensures order across a client connection while
+the latter does that across a PG. That is, requests originating from the same
+connection are executed in the same order as they were sent by the client.
+The same applies to the PG domain: when requests from multiple connections reach
+a PG, they are executed in the same order as they entered a first blocking phase
+of the ``PGPipeline``.
+
+Comparison with the classical OSD
+----------------------------------
+As the audience of this document are Ceph Developers, it seems reasonable to
+match the phases of crimson's ``ClientRequest`` pipeline with the blocking
+stages in the classical OSD. The names in the right column are names of
+containers (lists and maps) used to implement these stages. They are also
+already documented in the ``PG.h`` header.
+
++----------------------------------------+--------------------------------------+
+| crimson | ceph-osd waiting list |
++========================================+======================================+
+|``ConnectionPipeline::await_map`` | ``OSDShardPGSlot::waiting`` and |
+|``ConnectionPipeline::get_pg`` | ``OSDShardPGSlot::waiting_peering`` |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::await_map`` | ``PG::waiting_for_map`` |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::wait_for_active`` | ``PG::waiting_for_peered`` |
+| +--------------------------------------+
+| | ``PG::waiting_for_flush`` |
+| +--------------------------------------+
+| | ``PG::waiting_for_active`` |
++----------------------------------------+--------------------------------------+
+|To be done (``PG_STATE_LAGGY``) | ``PG::waiting_for_readable`` |
++----------------------------------------+--------------------------------------+
+|To be done | ``PG::waiting_for_scrub`` |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::recover_missing`` | ``PG::waiting_for_unreadable_object``|
+| +--------------------------------------+
+| | ``PG::waiting_for_degraded_object`` |
++----------------------------------------+--------------------------------------+
+|To be done (proxying) | ``PG::waiting_for_blocked_object`` |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::get_obc`` | *obc rwlocks* |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::process`` | ``PG::lock`` (roughly) |
++----------------------------------------+--------------------------------------+
+
+
+As the last word it might be worth to emphasize that the ordering implementations
+in both classical OSD and in crimson are stricter than a theoretical minimum one
+required by the RADOS protocol. For instance, we could parallelize read operations
+targeting the same object at the price of extra complexity but we don't -- the
+simplicity has won.