summaryrefslogtreecommitdiffstats
path: root/doc/dev/rados-client-protocol.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-07 18:45:59 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-07 18:45:59 +0000
commit19fcec84d8d7d21e796c7624e521b60d28ee21ed (patch)
tree42d26aa27d1e3f7c0b8bd3fd14e7d7082f5008dc /doc/dev/rados-client-protocol.rst
parentInitial commit. (diff)
downloadceph-19fcec84d8d7d21e796c7624e521b60d28ee21ed.tar.xz
ceph-19fcec84d8d7d21e796c7624e521b60d28ee21ed.zip
Adding upstream version 16.2.11+ds.upstream/16.2.11+dsupstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r--doc/dev/rados-client-protocol.rst117
1 files changed, 117 insertions, 0 deletions
diff --git a/doc/dev/rados-client-protocol.rst b/doc/dev/rados-client-protocol.rst
new file mode 100644
index 000000000..920c65f39
--- /dev/null
+++ b/doc/dev/rados-client-protocol.rst
@@ -0,0 +1,117 @@
+RADOS client protocol
+=====================
+
+This is very incomplete, but one must start somewhere.
+
+Basics
+------
+
+Requests are MOSDOp messages. Replies are MOSDOpReply messages.
+
+An object request is targeted at an hobject_t, which includes a pool,
+hash value, object name, placement key (usually empty), and snapid.
+
+The hash value is a 32-bit hash value, normally generated by hashing
+the object name. The hobject_t can be arbitrarily constructed,
+though, with any hash value and name. Note that in the MOSDOp these
+components are spread across several fields and not logically
+assembled in an actual hobject_t member (mainly historical reasons).
+
+A request can also target a PG. In this case, the *ps* value matches
+a specific PG, the object name is empty, and (hopefully) the ops in
+the request are PG ops.
+
+Either way, the request ultimately targets a PG, either by using the
+explicit pgid or by folding the hash value onto the current number of
+pgs in the pool. The client sends the request to the primary for the
+associated PG.
+
+Each request is assigned a unique tid.
+
+Resends
+-------
+
+If there is a connection drop, the client will resend any outstanding
+requests.
+
+Any time there is a PG mapping change such that the primary changes,
+the client is responsible for resending the request. Note that
+although there may be an interval change from the OSD's perspective
+(triggering PG peering), if the primary doesn't change then the client
+need not resend.
+
+There are a few exceptions to this rule:
+
+ * There is a last_force_op_resend field in the pg_pool_t in the
+ OSDMap. If this changes, then the clients are forced to resend any
+ outstanding requests. (This happens when tiering is adjusted, for
+ example.)
+ * Some requests are such that they are resent on *any* PG interval
+ change, as defined by pg_interval_t's is_new_interval() (the same
+ criteria used by peering in the OSD).
+ * If the PAUSE OSDMap flag is set and unset.
+
+Each time a request is sent to the OSD the *attempt* field is incremented. The
+first time it is 0, the next 1, etc.
+
+Backoff
+-------
+
+Ordinarily the OSD will simply queue any requests it can't immediately
+process in memory until such time as it can. This can become
+problematic because the OSD limits the total amount of RAM consumed by
+incoming messages: if either of the thresholds for the number of
+messages or the number of bytes is reached, new messages will not be
+read off the network socket, causing backpressure through the network.
+
+In some cases, though, the OSD knows or expects that a PG or object
+will be unavailable for some time and does not want to consume memory
+by queuing requests. In these cases it can send a MOSDBackoff message
+to the client.
+
+A backoff request has four properties:
+
+#. the op code (block, unblock, or ack-block)
+#. *id*, a unique id assigned within this session
+#. hobject_t begin
+#. hobject_t end
+
+There are two types of backoff: a *PG* backoff will plug all requests
+targeting an entire PG at the client, as described by a range of the
+hash/hobject_t space [begin,end), while an *object* backoff will plug
+all requests targeting a single object (begin == end).
+
+When the client receives a *block* backoff message, it is now
+responsible for *not* sending any requests for hobject_ts described by
+the backoff. The backoff remains in effect until the backoff is
+cleared (via an 'unblock' message) or the OSD session is closed. A
+*ack_block* message is sent back to the OSD immediately to acknowledge
+receipt of the backoff.
+
+When an unblock is
+received, it will reference a specific id that the client previous had
+blocked. However, the range described by the unblock may be smaller
+than the original range, as the PG may have split on the OSD. The unblock
+should *only* unblock the range specified in the unblock message. Any requests
+that fall within the unblock request range are reexamined and, if no other
+installed backoff applies, resent.
+
+On the OSD, Backoffs are also tracked across ranges of the hash space, and
+exist in three states:
+
+#. new
+#. acked
+#. deleting
+
+A newly installed backoff is set to *new* and a message is sent to the
+client. When the *ack-block* message is received it is changed to the
+*acked* state. The OSD may process other messages from the client that
+are covered by the backoff in the *new* state, but once the backoff is
+*acked* it should never see a blocked request unless there is a bug.
+
+If the OSD wants to a remove a backoff in the *acked* state it can
+simply remove it and notify the client. If the backoff is in the
+*new* state it must move it to the *deleting* state and continue to
+use it to discard client requests until the *ack-block* message is
+received, at which point it can finally be removed. This is necessary to
+preserve the order of operations processed by the OSD.