summaryrefslogtreecommitdiffstats
path: root/doc/dev/peering.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/dev/peering.rst')
-rw-r--r--doc/dev/peering.rst270
1 files changed, 270 insertions, 0 deletions
diff --git a/doc/dev/peering.rst b/doc/dev/peering.rst
new file mode 100644
index 000000000..3960e14ca
--- /dev/null
+++ b/doc/dev/peering.rst
@@ -0,0 +1,270 @@
+======================
+Peering
+======================
+
+Concepts
+--------
+
+*Peering*
+ the process of bringing all of the OSDs that store
+ a Placement Group (PG) into agreement about the state
+ of all of the objects (and their metadata) in that PG.
+ Note that agreeing on the state does not mean that
+ they all have the latest contents.
+
+*Acting set*
+ the ordered list of OSDs who are (or were as of some epoch)
+ responsible for a particular PG.
+
+*Up set*
+ the ordered list of OSDs responsible for a particular PG for
+ a particular epoch according to CRUSH. Normally this
+ is the same as the *acting set*, except when the *acting set* has been
+ explicitly overridden via *PG temp* in the OSDMap.
+
+*PG temp*
+ a temporary placement group acting set used while backfilling the
+ primary osd. Let say acting is [0,1,2] and we are
+ active+clean. Something happens and acting is now [3,1,2]. osd 3 is
+ empty and can't serve reads although it is the primary. osd.3 will
+ see that and request a *PG temp* of [1,2,3] to the monitors using a
+ MOSDPGTemp message so that osd.1 temporarily becomes the
+ primary. It will select osd.3 as a backfill peer and continue to
+ serve reads and writes while osd.3 is backfilled. When backfilling
+ is complete, *PG temp* is discarded and the acting set changes back
+ to [3,1,2] and osd.3 becomes the primary.
+
+*current interval* or *past interval*
+ a sequence of OSD map epochs during which the *acting set* and *up
+ set* for particular PG do not change
+
+*primary*
+ the (by convention first) member of the *acting set*,
+ who is responsible for coordination peering, and is
+ the only OSD that will accept client initiated
+ writes to objects in a placement group.
+
+*replica*
+ a non-primary OSD in the *acting set* for a placement group
+ (and who has been recognized as such and *activated* by the primary).
+
+*stray*
+ an OSD who is not a member of the current *acting set*, but
+ has not yet been told that it can delete its copies of a
+ particular placement group.
+
+*recovery*
+ ensuring that copies of all of the objects in a PG
+ are on all of the OSDs in the *acting set*. Once
+ *peering* has been performed, the primary can start
+ accepting write operations, and *recovery* can proceed
+ in the background.
+
+*PG info*
+ basic metadata about the PG's creation epoch, the version
+ for the most recent write to the PG, *last epoch started*, *last
+ epoch clean*, and the beginning of the *current interval*. Any
+ inter-OSD communication about PGs includes the *PG info*, such that
+ any OSD that knows a PG exists (or once existed) also has a lower
+ bound on *last epoch clean* or *last epoch started*.
+
+*PG log*
+ a list of recent updates made to objects in a PG.
+ Note that these logs can be truncated after all OSDs
+ in the *acting set* have acknowledged up to a certain
+ point.
+
+*missing set*
+ Each OSD notes update log entries and if they imply updates to
+ the contents of an object, adds that object to a list of needed
+ updates. This list is called the *missing set* for that <OSD,PG>.
+
+*Authoritative History*
+ a complete, and fully ordered set of operations that, if
+ performed, would bring an OSD's copy of a Placement Group
+ up to date.
+
+*epoch*
+ a (monotonically increasing) OSD map version number
+
+*last epoch start*
+ the last epoch at which all nodes in the *acting set*
+ for a particular placement group agreed on an
+ *authoritative history*. At this point, *peering* is
+ deemed to have been successful.
+
+*up_thru*
+ before a primary can successfully complete the *peering* process,
+ it must inform a monitor that is alive through the current
+ OSD map epoch by having the monitor set its *up_thru* in the osd
+ map. This helps peering ignore previous *acting sets* for which
+ peering never completed after certain sequences of failures, such as
+ the second interval below:
+
+ - *acting set* = [A,B]
+ - *acting set* = [A]
+ - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
+ - *acting set* = [B] (B restarts, A does not)
+
+*last epoch clean*
+ the last epoch at which all nodes in the *acting set*
+ for a particular placement group were completely
+ up to date (both PG logs and object contents).
+ At this point, *recovery* is deemed to have been
+ completed.
+
+Description of the Peering Process
+----------------------------------
+
+The *Golden Rule* is that no write operation to any PG
+is acknowledged to a client until it has been persisted
+by all members of the *acting set* for that PG. This means
+that if we can communicate with at least one member of
+each *acting set* since the last successful *peering*, someone
+will have a record of every (acknowledged) operation
+since the last successful *peering*.
+This means that it should be possible for the current
+primary to construct and disseminate a new *authoritative history*.
+
+It is also important to appreciate the role of the OSD map
+(list of all known OSDs and their states, as well as some
+information about the placement groups) in the *peering*
+process:
+
+ When OSDs go up or down (or get added or removed)
+ this has the potential to affect the *active sets*
+ of many placement groups.
+
+ Before a primary successfully completes the *peering*
+ process, the OSD map must reflect that the OSD was alive
+ and well as of the first epoch in the *current interval*.
+
+ Changes can only be made after successful *peering*.
+
+Thus, a new primary can use the latest OSD map along with a recent
+history of past maps to generate a set of *past intervals* to
+determine which OSDs must be consulted before we can successfully
+*peer*. The set of past intervals is bounded by *last epoch started*,
+the most recent *past interval* for which we know *peering* completed.
+The process by which an OSD discovers a PG exists in the first place is
+by exchanging *PG info* messages, so the OSD always has some lower
+bound on *last epoch started*.
+
+The high level process is for the current PG primary to:
+
+ 1. get a recent OSD map (to identify the members of the all
+ interesting *acting sets*, and confirm that we are still the
+ primary).
+
+ #. generate a list of *past intervals* since *last epoch started*.
+ Consider the subset of those for which *up_thru* was greater than
+ the first interval epoch by the last interval epoch's OSD map; that is,
+ the subset for which *peering* could have completed before the *acting
+ set* changed to another set of OSDs.
+
+ Successful *peering* will require that we be able to contact at
+ least one OSD from each of *past interval*'s *acting set*.
+
+ #. ask every node in that list for its *PG info*, which includes the most
+ recent write made to the PG, and a value for *last epoch started*. If
+ we learn about a *last epoch started* that is newer than our own, we can
+ prune older *past intervals* and reduce the peer OSDs we need to contact.
+
+ #. if anyone else has (in its PG log) operations that I do not have,
+ instruct them to send me the missing log entries so that the primary's
+ *PG log* is up to date (includes the newest write)..
+
+ #. for each member of the current *acting set*:
+
+ a. ask it for copies of all PG log entries since *last epoch start*
+ so that I can verify that they agree with mine (or know what
+ objects I will be telling it to delete).
+
+ If the cluster failed before an operation was persisted by all
+ members of the *acting set*, and the subsequent *peering* did not
+ remember that operation, and a node that did remember that
+ operation later rejoined, its logs would record a different
+ (divergent) history than the *authoritative history* that was
+ reconstructed in the *peering* after the failure.
+
+ Since the *divergent* events were not recorded in other logs
+ from that *acting set*, they were not acknowledged to the client,
+ and there is no harm in discarding them (so that all OSDs agree
+ on the *authoritative history*). But, we will have to instruct
+ any OSD that stores data from a divergent update to delete the
+ affected (and now deemed to be apocryphal) objects.
+
+ #. ask it for its *missing set* (object updates recorded
+ in its PG log, but for which it does not have the new data).
+ This is the list of objects that must be fully replicated
+ before we can accept writes.
+
+ #. at this point, the primary's PG log contains an *authoritative history* of
+ the placement group, and the OSD now has sufficient
+ information to bring any other OSD in the *acting set* up to date.
+
+ #. if the primary's *up_thru* value in the current OSD map is not greater than
+ or equal to the first epoch in the *current interval*, send a request to the
+ monitor to update it, and wait until receive an updated OSD map that reflects
+ the change.
+
+ #. for each member of the current *acting set*:
+
+ a. send them log updates to bring their PG logs into agreement with
+ my own (*authoritative history*) ... which may involve deciding
+ to delete divergent objects.
+
+ #. await acknowledgment that they have persisted the PG log entries.
+
+ #. at this point all OSDs in the *acting set* agree on all of the meta-data,
+ and would (in any future *peering*) return identical accounts of all
+ updates.
+
+ a. start accepting client write operations (because we have unanimous
+ agreement on the state of the objects into which those updates are
+ being accepted). Note, however, that if a client tries to write to an
+ object it will be promoted to the front of the recovery queue, and the
+ write willy be applied after it is fully replicated to the current *acting set*.
+
+ #. update the *last epoch started* value in our local *PG info*, and instruct
+ other *active set* OSDs to do the same.
+
+ #. start pulling object data updates that other OSDs have, but I do not. We may
+ need to query OSDs from additional *past intervals* prior to *last epoch started*
+ (the last time *peering* completed) and following *last epoch clean* (the last epoch that
+ recovery completed) in order to find copies of all objects.
+
+ #. start pushing object data updates to other OSDs that do not yet have them.
+
+ We push these updates from the primary (rather than having the replicas
+ pull them) because this allows the primary to ensure that a replica has
+ the current contents before sending it an update write. It also makes
+ it possible for a single read (from the primary) to be used to write
+ the data to multiple replicas. If each replica did its own pulls,
+ the data might have to be read multiple times.
+
+ #. once all replicas store the all copies of all objects (that
+ existed prior to the start of this epoch) we can update *last
+ epoch clean* in the *PG info*, and we can dismiss all of the
+ *stray* replicas, allowing them to delete their copies of objects
+ for which they are no longer in the *acting set*.
+
+ We could not dismiss the *strays* prior to this because it was possible
+ that one of those *strays* might hold the sole surviving copy of an
+ old object (all of whose copies disappeared before they could be
+ replicated on members of the current *acting set*).
+
+Generate a State Model
+----------------------
+
+Use the `gen_state_diagram.py <https://github.com/ceph/ceph/blob/master/doc/scripts/gen_state_diagram.py>`_ script to generate a copy of the latest peering state model::
+
+ $ git clone https://github.com/ceph/ceph.git
+ $ cd ceph
+ $ cat src/osd/PeeringState.h src/osd/PeeringState.cc | doc/scripts/gen_state_diagram.py > doc/dev/peering_graph.generated.dot
+ $ sed -i 's/7,7/1080,1080/' doc/dev/peering_graph.generated.dot
+ $ dot -Tsvg doc/dev/peering_graph.generated.dot > doc/dev/peering_graph.generated.svg
+
+Sample state model:
+
+.. graphviz:: peering_graph.generated.dot