diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /doc/dev/rados-client-protocol.rst | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/dev/rados-client-protocol.rst')
-rw-r--r-- | doc/dev/rados-client-protocol.rst | 117 |
1 files changed, 117 insertions, 0 deletions
diff --git a/doc/dev/rados-client-protocol.rst b/doc/dev/rados-client-protocol.rst new file mode 100644 index 000000000..920c65f39 --- /dev/null +++ b/doc/dev/rados-client-protocol.rst @@ -0,0 +1,117 @@ +RADOS client protocol +===================== + +This is very incomplete, but one must start somewhere. + +Basics +------ + +Requests are MOSDOp messages. Replies are MOSDOpReply messages. + +An object request is targeted at an hobject_t, which includes a pool, +hash value, object name, placement key (usually empty), and snapid. + +The hash value is a 32-bit hash value, normally generated by hashing +the object name. The hobject_t can be arbitrarily constructed, +though, with any hash value and name. Note that in the MOSDOp these +components are spread across several fields and not logically +assembled in an actual hobject_t member (mainly historical reasons). + +A request can also target a PG. In this case, the *ps* value matches +a specific PG, the object name is empty, and (hopefully) the ops in +the request are PG ops. + +Either way, the request ultimately targets a PG, either by using the +explicit pgid or by folding the hash value onto the current number of +pgs in the pool. The client sends the request to the primary for the +associated PG. + +Each request is assigned a unique tid. + +Resends +------- + +If there is a connection drop, the client will resend any outstanding +requests. + +Any time there is a PG mapping change such that the primary changes, +the client is responsible for resending the request. Note that +although there may be an interval change from the OSD's perspective +(triggering PG peering), if the primary doesn't change then the client +need not resend. + +There are a few exceptions to this rule: + + * There is a last_force_op_resend field in the pg_pool_t in the + OSDMap. If this changes, then the clients are forced to resend any + outstanding requests. (This happens when tiering is adjusted, for + example.) + * Some requests are such that they are resent on *any* PG interval + change, as defined by pg_interval_t's is_new_interval() (the same + criteria used by peering in the OSD). + * If the PAUSE OSDMap flag is set and unset. + +Each time a request is sent to the OSD the *attempt* field is incremented. The +first time it is 0, the next 1, etc. + +Backoff +------- + +Ordinarily the OSD will simply queue any requests it can't immediately +process in memory until such time as it can. This can become +problematic because the OSD limits the total amount of RAM consumed by +incoming messages: if either of the thresholds for the number of +messages or the number of bytes is reached, new messages will not be +read off the network socket, causing backpressure through the network. + +In some cases, though, the OSD knows or expects that a PG or object +will be unavailable for some time and does not want to consume memory +by queuing requests. In these cases it can send a MOSDBackoff message +to the client. + +A backoff request has four properties: + +#. the op code (block, unblock, or ack-block) +#. *id*, a unique id assigned within this session +#. hobject_t begin +#. hobject_t end + +There are two types of backoff: a *PG* backoff will plug all requests +targeting an entire PG at the client, as described by a range of the +hash/hobject_t space [begin,end), while an *object* backoff will plug +all requests targeting a single object (begin == end). + +When the client receives a *block* backoff message, it is now +responsible for *not* sending any requests for hobject_ts described by +the backoff. The backoff remains in effect until the backoff is +cleared (via an 'unblock' message) or the OSD session is closed. A +*ack_block* message is sent back to the OSD immediately to acknowledge +receipt of the backoff. + +When an unblock is +received, it will reference a specific id that the client previous had +blocked. However, the range described by the unblock may be smaller +than the original range, as the PG may have split on the OSD. The unblock +should *only* unblock the range specified in the unblock message. Any requests +that fall within the unblock request range are reexamined and, if no other +installed backoff applies, resent. + +On the OSD, Backoffs are also tracked across ranges of the hash space, and +exist in three states: + +#. new +#. acked +#. deleting + +A newly installed backoff is set to *new* and a message is sent to the +client. When the *ack-block* message is received it is changed to the +*acked* state. The OSD may process other messages from the client that +are covered by the backoff in the *new* state, but once the backoff is +*acked* it should never see a blocked request unless there is a bug. + +If the OSD wants to a remove a backoff in the *acked* state it can +simply remove it and notify the client. If the backoff is in the +*new* state it must move it to the *deleting* state and continue to +use it to discard client requests until the *ack-block* message is +received, at which point it can finally be removed. This is necessary to +preserve the order of operations processed by the OSD. |