summaryrefslogtreecommitdiffstats
path: root/doc/dev/msgr2.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /doc/dev/msgr2.rst
parentInitial commit. (diff)
downloadceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/dev/msgr2.rst')
-rw-r--r--doc/dev/msgr2.rst956
1 files changed, 956 insertions, 0 deletions
diff --git a/doc/dev/msgr2.rst b/doc/dev/msgr2.rst
new file mode 100644
index 000000000..ecd6c8258
--- /dev/null
+++ b/doc/dev/msgr2.rst
@@ -0,0 +1,956 @@
+.. _msgr2-protocol:
+
+msgr2 protocol (msgr2.0 and msgr2.1)
+====================================
+
+This is a revision of the legacy Ceph on-wire protocol that was
+implemented by the SimpleMessenger. It addresses performance and
+security issues.
+
+Goals
+-----
+
+This protocol revision has several goals relative to the original protocol:
+
+* *Flexible handshaking*. The original protocol did not have a
+ sufficiently flexible protocol negotiation that allows for features
+ that were not required.
+* *Encryption*. We will incorporate encryption over the wire.
+* *Performance*. We would like to provide for protocol features
+ (e.g., padding) that keep computation and memory copies out of the
+ fast path where possible.
+* *Signing*. We will allow for traffic to be signed (but not
+ necessarily encrypted). This is not implemented.
+
+Definitions
+-----------
+
+* *client* (C): the party initiating a (TCP) connection
+* *server* (S): the party accepting a (TCP) connection
+* *connection*: an instance of a (TCP) connection between two processes.
+* *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity
+ has one or more unique entity_addr_t's by virtue of the 'nonce'
+ field, which is typically a pid or random value.
+* *session*: a stateful session between two entities in which message
+ exchange is ordered and lossless. A session might span multiple
+ connections if there is an interruption (TCP connection disconnect).
+* *frame*: a discrete message sent between the peers. Each frame
+ consists of a tag (type code), payload, and (if signing
+ or encryption is enabled) some other fields. See below for the
+ structure.
+* *tag*: a type code associated with a frame. The tag
+ determines the structure of the payload.
+
+Phases
+------
+
+A connection has four distinct phases:
+
+#. banner
+#. authentication frame exchange
+#. message flow handshake frame exchange
+#. message frame exchange
+
+Banner
+------
+
+Both the client and server, upon connecting, send a banner::
+
+ "ceph v2\n"
+ __le16 banner payload length
+ banner payload
+
+A banner payload has the form::
+
+ __le64 peer_supported_features
+ __le64 peer_required_features
+
+This is a new, distinct feature bit namespace (CEPH_MSGR2_*).
+Currently, only CEPH_MSGR2_FEATURE_REVISION_1 is defined. It is
+supported but not required, so that msgr2.0 and msgr2.1 peers
+can talk to each other.
+
+If the remote party advertises required features we don't support, we
+can disconnect.
+
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | send banner |
+ |----+ +----|
+ | | | |
+ | +-------+--->|
+ | send banner| |
+ |<-----------+ |
+ | |
+
+Frame format
+------------
+
+After the banners are exchanged, all further communication happens
+in frames. The exact format of the frame depends on the connection
+mode (msgr2.0-crc, msgr2.0-secure, msgr2.1-crc or msgr2.1-secure).
+All connections start in crc mode (either msgr2.0-crc or msgr2.1-crc,
+depending on peer_supported_features from the banner).
+
+Each frame has a 32-byte preamble::
+
+ __u8 tag
+ __u8 number of segments
+ {
+ __le32 segment length
+ __le16 segment alignment
+ } * 4
+ __u8 flags
+ reserved (1 byte)
+ __le32 preamble crc
+
+An empty frame has one empty segment. A non-empty frame can have
+between one and four segments, all segments except the last may be
+empty.
+
+If there are less than four segments, unused (trailing) segment
+length and segment alignment fields are zeroed.
+
+### Currently supported flags
+
+ 1. FRAME_EARLY_DATA_COMPRESSED (see :ref:`msgr-post-compression`)
+
+The reserved bytes are zeroed.
+
+The preamble checksum is CRC32-C. It covers everything up to
+itself (28 bytes) and is calculated and verified irrespective of
+the connection mode (i.e. even if the frame is encrypted).
+
+### msgr2.0-crc mode
+
+A msgr2.0-crc frame has the form::
+
+ preamble (32 bytes)
+ {
+ segment payload
+ } * number of segments
+ epilogue (17 bytes)
+
+where epilogue is::
+
+ __u8 late_flags
+ {
+ __le32 segment crc
+ } * 4
+
+late_flags is used for frame abortion. After transmitting the
+preamble and the first segment, the sender can fill the remaining
+segments with zeros and set a flag to indicate that the receiver must
+drop the frame. This allows the sender to avoid extra buffering
+when a frame that is being put on the wire is revoked (i.e. yanked
+out of the messenger): payload buffers can be unpinned and handed
+back to the user immediately, without making a copy or blocking
+until the whole frame is transmitted. Currently this is used only
+by the kernel client, see ceph_msg_revoke().
+
+The segment checksum is CRC32-C. For "used" empty segments, it is
+set to (__le32)-1. For unused (trailing) segments, it is zeroed.
+
+The crcs are calculated just to protect against bit errors.
+No authenticity guarantees are provided, unlike in msgr1 which
+attempted to provide some authenticity guarantee by optionally
+signing segment lengths and crcs with the session key.
+
+Issues:
+
+1. As part of introducing a structure for a generic frame with
+ variable number of segments suitable for both control and
+ message frames, msgr2.0 moved the crc of the first segment of
+ the message frame (ceph_msg_header2) into the epilogue.
+
+ As a result, ceph_msg_header2 can no longer be safely
+ interpreted before the whole frame is read off the wire.
+ This is a regression from msgr1, because in order to scatter
+ the payload directly into user-provided buffers and thus avoid
+ extra buffering and copying when receiving message frames,
+ ceph_msg_header2 must be available in advance -- it stores
+ the transaction id which the user buffers are keyed on.
+ The implementation has to choose between forgoing this
+ optimization or acting on an unverified segment.
+
+2. late_flags is not covered by any crc. Since it stores the
+ abort flag, a single bit flip can result in a completed frame
+ being dropped (causing the sender to hang waiting for a reply)
+ or, worse, in an aborted frame with garbage segment payloads
+ being dispatched.
+
+ This was the case with msgr1 and got carried over to msgr2.0.
+
+### msgr2.1-crc mode
+
+Differences from msgr2.0-crc:
+
+1. The crc of the first segment is stored at the end of the
+ first segment, not in the epilogue. The epilogue stores up to
+ three crcs, not up to four.
+
+ If the first segment is empty, (__le32)-1 crc is not generated.
+
+2. The epilogue is generated only if the frame has more than one
+ segment (i.e. at least one of second to fourth segments is not
+ empty). Rationale: If the frame has only one segment, it cannot
+ be aborted and there are no crcs to store in the epilogue.
+
+3. Unchecksummed late_flags is replaced with late_status which
+ builds in bit error detection by using a 4-bit nibble per flag
+ and two code words that are Hamming Distance = 4 apart (and not
+ all zeros or ones). This comes at the expense of having only
+ one reserved flag, of course.
+
+Some example frames:
+
+* A 0+0+0+0 frame (empty, no epilogue)::
+
+ preamble (32 bytes)
+
+* A 20+0+0+0 frame (no epilogue)::
+
+ preamble (32 bytes)
+ segment1 payload (20 bytes)
+ __le32 segment1 crc
+
+* A 0+70+0+0 frame::
+
+ preamble (32 bytes)
+ segment2 payload (70 bytes)
+ epilogue (13 bytes)
+
+* A 20+70+0+350 frame::
+
+ preamble (32 bytes)
+ segment1 payload (20 bytes)
+ __le32 segment1 crc
+ segment2 payload (70 bytes)
+ segment4 payload (350 bytes)
+ epilogue (13 bytes)
+
+where epilogue is::
+
+ __u8 late_status
+ {
+ __le32 segment crc
+ } * 3
+
+Hello
+-----
+
+* TAG_HELLO: client->server and server->client::
+
+ __u8 entity_type
+ entity_addr_t peer_socket_address
+
+ - We immediately share our entity type and the address of the peer (which can be useful
+ for detecting our effective IP address, especially in the presence of NAT).
+
+
+Authentication
+--------------
+
+* TAG_AUTH_REQUEST: client->server::
+
+ __le32 method; // CEPH_AUTH_{NONE, CEPHX, ...}
+ __le32 num_preferred_modes;
+ list<__le32> mode // CEPH_CON_MODE_*
+ method specific payload
+
+* TAG_AUTH_BAD_METHOD server -> client: reject client-selected auth method::
+
+ __le32 method
+ __le32 negative error result code
+ __le32 num_methods
+ list<__le32> allowed_methods // CEPH_AUTH_{NONE, CEPHX, ...}
+ __le32 num_modes
+ list<__le32> allowed_modes // CEPH_CON_MODE_*
+
+ - Returns the attempted auth method, and error code (-EOPNOTSUPP if
+ the method is unsupported), and the list of allowed authentication
+ methods.
+
+* TAG_AUTH_REPLY_MORE: server->client::
+
+ __le32 len;
+ method specific payload
+
+* TAG_AUTH_REQUEST_MORE: client->server::
+
+ __le32 len;
+ method specific payload
+
+* TAG_AUTH_DONE: (server->client)::
+
+ __le64 global_id
+ __le32 connection mode // CEPH_CON_MODE_*
+ method specific payload
+
+ - The server is the one to decide authentication has completed and what
+ the final connection mode will be.
+
+
+Example of authentication phase interaction when the client uses an
+allowed authentication method:
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | auth request |
+ |---------------->|
+ |<----------------|
+ | auth more|
+ | |
+ |auth more |
+ |---------------->|
+ |<----------------|
+ | auth done|
+
+
+Example of authentication phase interaction when the client uses a forbidden
+authentication method as the first attempt:
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | auth request |
+ |---------------->|
+ |<----------------|
+ | bad method |
+ | |
+ | auth request |
+ |---------------->|
+ |<----------------|
+ | auth more|
+ | |
+ | auth more |
+ |---------------->|
+ |<----------------|
+ | auth done|
+
+
+Post-auth frame format
+----------------------
+
+Depending on the negotiated connection mode from TAG_AUTH_DONE, the
+connection either stays in crc mode or switches to the corresponding
+secure mode (msgr2.0-secure or msgr2.1-secure).
+
+### msgr2.0-secure mode
+
+A msgr2.0-secure frame has the form::
+
+ {
+ preamble (32 bytes)
+ {
+ segment payload
+ zero padding (out to 16 bytes)
+ } * number of segments
+ epilogue (16 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+where epilogue is::
+
+ __u8 late_flags
+ zero padding (15 bytes)
+
+late_flags has the same meaning as in msgr2.0-crc mode.
+
+Each segment and the epilogue are zero padded out to 16 bytes.
+Technically, GCM doesn't require any padding because Counter mode
+(the C in GCM) essentially turns a block cipher into a stream cipher.
+But, if the overall input length is not a multiple of 16 bytes, some
+implicit zero padding would occur internally because GHASH function
+used by GCM for generating auth tags only works on 16-byte blocks.
+
+Issues:
+
+1. The sender encrypts the whole frame using a single nonce
+ and generating a single auth tag. Because segment lengths are
+ stored in the preamble, the receiver has no choice but to decrypt
+ and interpret the preamble without verifying the auth tag -- it
+ can't even tell how much to read off the wire to get the auth tag
+ otherwise! This creates a decryption oracle, which, in conjunction
+ with Counter mode malleability, could lead to recovery of sensitive
+ information.
+
+ This issue extends to the first segment of the message frame as
+ well. As in msgr2.0-crc mode, ceph_msg_header2 cannot be safely
+ interpreted before the whole frame is read off the wire.
+
+2. Deterministic nonce construction with a 4-byte counter field
+ followed by an 8-byte fixed field is used. The initial values are
+ taken from the connection secret -- a random byte string generated
+ during the authentication phase. Because the counter field is
+ only four bytes long, it can wrap and then repeat in under a day,
+ leading to GCM nonce reuse and therefore a potential complete
+ loss of both authenticity and confidentiality for the connection.
+ This was addressed by disconnecting before the counter repeats
+ (CVE-2020-1759).
+
+### msgr2.1-secure mode
+
+Differences from msgr2.0-secure:
+
+1. The preamble, the first segment and the rest of the frame are
+ encrypted separately, using separate nonces and generating
+ separate auth tags. This gets rid of unverified plaintext use
+ and keeps msgr2.1-secure mode close to msgr2.1-crc mode, allowing
+ the implementation to receive message frames in a similar fashion
+ (little to no buffering, same scatter/gather logic, etc).
+
+ In order to reduce the number of en/decryption operations per
+ frame, the preamble is grown by a fixed size inline buffer (48
+ bytes) that the first segment is inlined into, either fully or
+ partially. The preamble auth tag covers both the preamble and the
+ inline buffer, so if the first segment is small enough to be fully
+ inlined, it becomes available after a single decryption operation.
+
+2. As in msgr2.1-crc mode, the epilogue is generated only if the
+ frame has more than one segment. The rationale is even stronger,
+ as it would require an extra en/decryption operation.
+
+3. For consistency with msgr2.1-crc mode, late_flags is replaced
+ with late_status (the built-in bit error detection isn't really
+ needed in secure mode).
+
+4. In accordance with `NIST Recommendation for GCM`_, deterministic
+ nonce construction with a 4-byte fixed field followed by an 8-byte
+ counter field is used. An 8-byte counter field should never repeat
+ but the nonce reuse protection put in place for msgr2.0-secure mode
+ is still there.
+
+ The initial values are the same as in msgr2.0-secure mode.
+
+ .. _`NIST Recommendation for GCM`: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf
+
+As in msgr2.0-secure mode, each segment is zero padded out to
+16 bytes. If the first segment is fully inlined, its padding goes
+to the inline buffer. Otherwise, the padding is on the remainder.
+The corollary to this is that the inline buffer is consumed in
+16-byte chunks.
+
+The unused portion of the inline buffer is zeroed.
+
+Some example frames:
+
+* A 0+0+0+0 frame (empty, nothing to inline, no epilogue)::
+
+ {
+ preamble (32 bytes)
+ zero padding (48 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+* A 20+0+0+0 frame (first segment fully inlined, no epilogue)::
+
+ {
+ preamble (32 bytes)
+ segment1 payload (20 bytes)
+ zero padding (28 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+* A 0+70+0+0 frame (nothing to inline)::
+
+ {
+ preamble (32 bytes)
+ zero padding (48 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment2 payload (70 bytes)
+ zero padding (10 bytes)
+ epilogue (16 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+* A 20+70+0+350 frame (first segment fully inlined)::
+
+ {
+ preamble (32 bytes)
+ segment1 payload (20 bytes)
+ zero padding (28 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment2 payload (70 bytes)
+ zero padding (10 bytes)
+ segment4 payload (350 bytes)
+ zero padding (2 bytes)
+ epilogue (16 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+* A 105+0+0+0 frame (first segment partially inlined, no epilogue)::
+
+ {
+ preamble (32 bytes)
+ segment1 payload (48 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment1 payload remainder (57 bytes)
+ zero padding (7 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+* A 105+70+0+350 frame (first segment partially inlined)::
+
+ {
+ preamble (32 bytes)
+ segment1 payload (48 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment1 payload remainder (57 bytes)
+ zero padding (7 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment2 payload (70 bytes)
+ zero padding (10 bytes)
+ segment4 payload (350 bytes)
+ zero padding (2 bytes)
+ epilogue (16 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+where epilogue is::
+
+ __u8 late_status
+ zero padding (15 bytes)
+
+late_status has the same meaning as in msgr2.1-crc mode.
+
+Compression
+-----------
+Compression handshake is implemented using msgr2 feature-based handshaking.
+In this phase, the client will indicate the server if on-wire-compression can be used for message transmitting,
+in addition to the list of supported compression methods. If on-wire-compression is enabled for both client and server,
+the server will choose a compression method based on client's request and its' own preferences.
+Once the handshake is completed, both peers have setup their compression handlers (if desired).
+
+* TAG_COMPRESSION_REQUEST (client->server): declares compression capabilities and requirements::
+
+ bool is_compress
+ std::vector<uint32_t> preferred_methods
+
+ - if the client identifies that both peers support compression feature, it initiates the handshake.
+ - is_compress flag indicates whether the client's configuration is to use compression.
+ - preferred_methods is a list of compression algorithms that are supported by the client.
+
+* TAG_COMPRESSION_DONE (server->client) : determines on compression settings::
+
+ bool is_compress
+ uint32_t method
+
+ - the server determines whether compression is possible according to the configuration.
+ - if it is possible, it will pick the most prioritized compression method that is also supported by the client.
+ - if none exists, it will determine that session between the peers will be handled without compression.
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | compression request |
+ |----------------------->|
+ |<-----------------------|
+ | compression done |
+
+# msgr2.x-secure mode
+
+Combining compression with encryption introduces security implications.
+Compression will not be possible when using secure mode, unless configured specifically by an admin.
+
+.. _msgr-post-compression:
+
+Post-compression frame format
+-----------------------------
+Depending on the negotiated connection mode from TAG_COMPRESSION_DONE, the connection is able to accept/send compressed frames or process all frames as decompressed.
+
+# msgr2.x-force mode
+
+All subsequent frames that will be sent via the connection will be compressed if compression requirements are met (e.g, the frames size).
+
+For compressed frames, the sending peer will enable the FRAME_EARLY_DATA_COMPRESSED flag, thus allowing the accepting peer to detect it and decompress the frame.
+
+# msgr2.x-none mode
+
+FRAME_EARLY_DATA_COMPRESSED flag will be disabled in preamble.
+
+
+Message flow handshake
+----------------------
+
+In this phase the peers identify each other and (if desired) reconnect to
+an established session.
+
+* TAG_CLIENT_IDENT (client->server): identify ourselves::
+
+ __le32 num_addrs
+ entity_addrvec_t*num_addrs entity addrs
+ entity_addr_t target entity addr
+ __le64 gid (numeric part of osd.0, client.123456, ...)
+ __le64 global_seq
+ __le64 features supported (CEPH_FEATURE_* bitmask)
+ __le64 features required (CEPH_FEATURE_* bitmask)
+ __le64 flags (CEPH_MSG_CONNECT_* bitmask)
+ __le64 cookie
+
+ - client will send first, server will reply with same. if this is a
+ new session, the client and server can proceed to the message exchange.
+ - the target addr is who the client is trying to connect *to*, so
+ that the server side can close the connection if the client is
+ talking to the wrong daemon.
+ - type.gid (entity_name_t) is set here, by combining the type shared in the hello
+ frame with the gid here. this means we don't need it
+ in the header of every message. it also means that we can't send
+ messages "from" other entity_name_t's. the current
+ implementations set this at the top of _send_message etc so this
+ shouldn't break any existing functionality. implementation will
+ likely want to mask this against what the authenticated credential
+ allows.
+ - cookie is the client cookie used to identify a session, and can be used
+ to reconnect to an existing session.
+ - we've dropped the 'protocol_version' field from msgr1
+
+* TAG_IDENT_MISSING_FEATURES (server->client): complain about a TAG_IDENT
+ with too few features::
+
+ __le64 features we require that the peer didn't advertise
+
+* TAG_SERVER_IDENT (server->client): accept client ident and identify server::
+
+ __le32 num_addrs
+ entity_addrvec_t*num_addrs entity addrs
+ __le64 gid (numeric part of osd.0, client.123456, ...)
+ __le64 global_seq
+ __le64 features supported (CEPH_FEATURE_* bitmask)
+ __le64 features required (CEPH_FEATURE_* bitmask)
+ __le64 flags (CEPH_MSG_CONNECT_* bitmask)
+ __le64 cookie
+
+ - The server cookie can be used by the client if it is later disconnected
+ and wants to reconnect and resume the session.
+
+* TAG_RECONNECT (client->server): reconnect to an established session::
+
+ __le32 num_addrs
+ entity_addr_t * num_addrs
+ __le64 client_cookie
+ __le64 server_cookie
+ __le64 global_seq
+ __le64 connect_seq
+ __le64 msg_seq (the last msg seq received)
+
+* TAG_RECONNECT_OK (server->client): acknowledge a reconnect attempt::
+
+ __le64 msg_seq (last msg seq received)
+
+ - once the client receives this, the client can proceed to message exchange.
+ - once the server sends this, the server can proceed to message exchange.
+
+* TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq
+
+* TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq
+
+* TAG_RECONNECT_WAIT (server only): fail reconnect due to connect race.
+
+ - Indicates that the server is already connecting to the client, and
+ that direction should win the race. The client should wait for that
+ connection to complete.
+
+* TAG_RESET_SESSION (server only): ask client to reset session::
+
+ __u8 full
+
+ - full flag indicates whether peer should do a full reset, i.e., drop
+ message queue.
+
+
+Example of failure scenarios:
+
+* First client's client_ident message is lost, and then client reconnects.
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | |
+ c_cookie(a) | client_ident(a) |
+ |-------------X |
+ | |
+ | client_ident(a) |
+ |-------------------->|
+ |<--------------------|
+ | server_ident(b) | s_cookie(b)
+ | |
+ | session established |
+ | |
+
+
+* Server's server_ident message is lost, and then client reconnects.
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | |
+ c_cookie(a) | client_ident(a) |
+ |-------------------->|
+ | X------------|
+ | server_ident(b) | s_cookie(b)
+ | |
+ | |
+ | client_ident(a) |
+ |-------------------->|
+ |<--------------------|
+ | server_ident(c) | s_cookie(c)
+ | |
+ | session established |
+ | |
+
+
+* Server's server_ident message is lost, and then server reconnects.
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | |
+ c_cookie(a) | client_ident(a) |
+ |-------------------->|
+ | X------------|
+ | server_ident(b) | s_cookie(b)
+ | |
+ | |
+ | reconnect(a, b) |
+ |<--------------------|
+ |-------------------->|
+ | reset_session(F) |
+ | |
+ | client_ident(a) | c_cookie(a)
+ |<--------------------|
+ |-------------------->|
+ s_cookie(c) | server_ident(c) |
+ | |
+
+
+* Connection failure after session is established, and then client reconnects.
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | |
+ c_cookie(a) | session established | s_cookie(b)
+ |<------------------->|
+ | X------------|
+ | |
+ | reconnect(a, b) |
+ |-------------------->|
+ |<--------------------|
+ | reconnect_ok |
+ | |
+
+
+* Connection failure after session is established because server reset,
+ and then client reconnects.
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | |
+ c_cookie(a) | session established | s_cookie(b)
+ |<------------------->|
+ | X------------| reset
+ | |
+ | reconnect(a, b) |
+ |-------------------->|
+ |<--------------------|
+ | reset_session(RC*) |
+ | |
+ c_cookie(c) | client_ident(c) |
+ |-------------------->|
+ |<--------------------|
+ | server_ident(d) | s_cookie(d)
+ | |
+
+RC* means that the reset session full flag depends on the policy.resetcheck
+of the connection.
+
+
+* Connection failure after session is established because client reset,
+ and then client reconnects.
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | |
+ c_cookie(a) | session established | s_cookie(b)
+ |<------------------->|
+ reset | X------------|
+ | |
+ c_cookie(c) | client_ident(c) |
+ |-------------------->|
+ |<--------------------| reset if policy.resetcheck
+ | server_ident(d) | s_cookie(d)
+ | |
+
+
+Message exchange
+----------------
+
+Once a session is established, we can exchange messages.
+
+* TAG_MSG: a message::
+
+ ceph_msg_header2
+ front
+ middle
+ data_pre_padding
+ data
+
+ - The ceph_msg_header2 is modified from ceph_msg_header:
+ * include an ack_seq. This avoids the need for a TAG_ACK
+ message most of the time.
+ * remove the src field, which we now get from the message flow
+ handshake (TAG_IDENT).
+ * specifies the data_pre_padding length, which can be used to
+ adjust the alignment of the data payload. (NOTE: is this is
+ useful?)
+
+* TAG_ACK: acknowledge receipt of message(s)::
+
+ __le64 seq
+
+ - This is only used for stateful sessions.
+
+* TAG_KEEPALIVE2: check for connection liveness::
+
+ ceph_timespec stamp
+
+ - Time stamp is local to sender.
+
+* TAG_KEEPALIVE2_ACK: reply to a keepalive2::
+
+ ceph_timestamp stamp
+
+ - Time stamp is from the TAG_KEEPALIVE2 we are responding to.
+
+* TAG_CLOSE: terminate a connection
+
+ Indicates that a connection should be terminated. This is equivalent
+ to a hangup or reset (i.e., should trigger ms_handle_reset). It
+ isn't strictly necessary or useful as we could just disconnect the
+ TCP connection.
+
+
+Example of protocol interaction (WIP)
+_____________________________________
+
+
+.. ditaa::
+
+ +---------+ +--------+
+ | Client | | Server |
+ +---------+ +--------+
+ | send banner |
+ |----+ +------|
+ | | | |
+ | +-------+----->|
+ | send banner| |
+ |<-----------+ |
+ | |
+ | send new stream |
+ |------------------>|
+ | auth request |
+ |------------------>|
+ |<------------------|
+ | bad method |
+ | |
+ | auth request |
+ |------------------>|
+ |<------------------|
+ | auth more |
+ | |
+ | auth more |
+ |------------------>|
+ |<------------------|
+ | auth done |
+ | |
+
+
+.. graphviz::
+ :caption: client side state machine
+
+ digraph lossy_client {
+ node [shape = doublecircle]; "start_connect" "closed";
+ node [shape = oval];
+ start_connect -> banner_connecting [label = "<connected>"];
+ subgraph hello_banner {
+ banner_connecting -> hello_connecting [label = "banner exchange"];
+ hello_connecting -> banner_connecting [label = "hello exchange"];
+ label = "hello banner exchange";
+ color = blue;
+ }
+ banner_connecting -> auth_connecting [label = "<exchange done>"];
+ auth_connecting -> auth_connecting [label = "auth reply more"];
+ auth_connecting -> auth_connecting [label = "auth bad method"];
+ auth_connecting -> auth_connecting_sign [label = "auth done"];
+ auth_connecting_sign -> session_connecting [label = "auth signature"];
+ session_connecting -> wait [label = "wait"];
+ wait -> start_connect [label = "<backoff>"];
+ session_connecting -> closed [label = "ident missing features"];
+ session_connecting -> ready [label = "server ident", tooltip = "set peer_name, peer_addr and connection features"];
+ ready -> ready [label = "keep alive"];
+ }
+
+.. graphviz::
+ :caption: server side state machine
+
+ digraph lossy_server {
+ node [shape = doublecircle]; "start_accept" "closed";
+ node [shape = oval];
+ start_accept -> banner_accepting [label = "<accepted>"];
+ subgraph hello_banner {
+ banner_accepting -> hello_accepting [label = "banner exchange"];
+ hello_accepting -> banner_accepting [label = "hello exchange"];
+ label = "hello banner exchange";
+ color = blue;
+ };
+ banner_accepting -> auth_accepting [label = "<exchange done>"];
+ auth_accepting -> auth_accepting_more [label = "auth_request => 0"];
+ auth_accepting -> auth_accepting_sign [label = "auth_request => 1"];
+ auth_accepting_more -> auth_accepting_more [label = "auth_request => 0"];
+ auth_accepting_more -> auth_accepting_sign [label = "auth_request => 1"];
+ auth_accepting_more -> standby [label = "auth_request => EBUSY"];
+ auth_accepting_more -> auth_accepting_more [label = "auth_request => *"];
+ auth_accepting -> standby [label = "auth_request => EBUSY"];
+ auth_accepting -> auth_accepting [label = "send <auth bad method>"];
+ auth_accepting_sign -> session_accepting [label = "auth signature"];
+ session_accepting -> session_accepting [label = "reconnect"];
+ session_accepting -> closed [label = "ident missing features"];
+ session_accepting -> ready [label = "client ident", tooltip = "set connection features"];
+ ready -> ready [label = "keep alive"];
+ }