summaryrefslogtreecommitdiffstats
path: root/doc/dev/osd_internals/snaps.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/dev/osd_internals/snaps.rst')
-rw-r--r--doc/dev/osd_internals/snaps.rst128
1 files changed, 128 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/snaps.rst b/doc/dev/osd_internals/snaps.rst
new file mode 100644
index 000000000..5ebd0884a
--- /dev/null
+++ b/doc/dev/osd_internals/snaps.rst
@@ -0,0 +1,128 @@
+======
+Snaps
+======
+
+Overview
+--------
+Rados supports two related snapshotting mechanisms:
+
+ 1. *pool snaps*: snapshots are implicitly applied to all objects
+ in a pool
+ 2. *self managed snaps*: the user must provide the current *SnapContext*
+ on each write.
+
+These two are mutually exclusive, only one or the other can be used on
+a particular pool.
+
+The *SnapContext* is the set of snapshots currently defined for an object
+as well as the most recent snapshot (the *seq*) requested from the mon for
+sequencing purposes (a *SnapContext* with a newer *seq* is considered to
+be more recent).
+
+The difference between *pool snaps* and *self managed snaps* from the
+OSD's point of view lies in whether the *SnapContext* comes to the OSD
+via the client's MOSDOp or via the most recent OSDMap.
+
+See OSD::make_writeable
+
+Ondisk Structures
+-----------------
+Each object has in the PG collection a *head* object (or *snapdir*, which we
+will come to shortly) and possibly a set of *clone* objects.
+Each hobject_t has a snap field. For the *head* (the only writeable version
+of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
+snap field is set to the *seq* of the *SnapContext* at their creation.
+When the OSD services a write, it first checks whether the most recent
+*clone* is tagged with a snapid prior to the most recent snap represented
+in the *SnapContext*. If so, at least one snapshot has occurred between
+the time of the write and the time of the last clone. Therefore, prior
+to performing the mutation, the OSD creates a new clone for servicing
+reads on snaps between the snapid of the last clone and the most recent
+snapid.
+
+The *head* object contains a *SnapSet* encoded in an attribute, which tracks
+
+ 1. The full set of snaps defined for the object
+ 2. The full set of clones which currently exist
+ 3. Overlapping intervals between clones for tracking space usage
+ 4. Clone size
+
+If the *head* is deleted while there are still clones, a *snapdir* object
+is created instead to house the *SnapSet*.
+
+Additionally, the *object_info_t* on each clone includes a vector of snaps
+for which clone is defined.
+
+Snap Removal
+------------
+To remove a snapshot, a request is made to the *Monitor* cluster to
+add the snapshot id to the list of purged snaps (or to remove it from
+the set of pool snaps in the case of *pool snaps*). In either case,
+the *PG* adds the snap to its *snap_trimq* for trimming.
+
+A clone can be removed when all of its snaps have been removed. In
+order to determine which clones might need to be removed upon snap
+removal, we maintain a mapping from snap to *hobject_t* using the
+*SnapMapper*.
+
+See PrimaryLogPG::SnapTrimmer, SnapMapper
+
+This trimming is performed asynchronously by the snap_trim_wq while the
+PG is clean and not scrubbing.
+
+ #. The next snap in PG::snap_trimq is selected for trimming
+ #. We determine the next object for trimming out of PG::snap_mapper.
+ For each object, we create a log entry and repop updating the
+ object info and the snap set (including adjusting the overlaps).
+ If the object is a clone which no longer belongs to any live snapshots,
+ it is removed here. (See PrimaryLogPG::trim_object() when new_snaps
+ is empty.)
+ #. We also locally update our *SnapMapper* instance with the object's
+ new snaps.
+ #. The log entry containing the modification of the object also
+ contains the new set of snaps, which the replica uses to update
+ its own *SnapMapper* instance.
+ #. The primary shares the info with the replica, which persists
+ the new set of purged_snaps along with the rest of the info.
+
+
+
+Recovery
+--------
+Because the trim operations are implemented using repops and log entries,
+normal PG peering and recovery maintain the snap trimmer operations with
+the caveat that push and removal operations need to update the local
+*SnapMapper* instance. If the purged_snaps update is lost, we merely
+retrim a now empty snap.
+
+SnapMapper
+----------
+*SnapMapper* is implemented on top of map_cacher<string, bufferlist>,
+which provides an interface over a backing store such as the file system
+with async transactions. While transactions are incomplete, the map_cacher
+instance buffers unstable keys allowing consistent access without having
+to flush the filestore. *SnapMapper* provides two mappings:
+
+ 1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone
+ object
+ 2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot
+ as one of its snaps
+
+Assumption: there are lots of hobjects and relatively few snaps. The
+first encoding has a stringification of the object as the key and an
+encoding of the set of snaps as a value. The second mapping, because there
+might be many hobjects for a single snap, is stored as a collection of keys
+of the form stringify(snap)_stringify(object) such that stringify(snap)
+is constant length. These keys have a bufferlist encoding
+pair<snapid, hobject_t> as a value. Thus, creating or trimming a single
+object does not involve reading all objects for any snap. Additionally,
+upon construction, the *SnapMapper* is provided with a mask for filtering
+the objects in the single SnapMapper keyspace belonging to that PG.
+
+Split
+-----
+The snapid_t -> hobject_t key entries are arranged such that for any PG,
+up to 8 prefixes need to be checked to determine all hobjects in a particular
+snap for a particular PG. Upon split, the prefixes to check on the parent
+are adjusted such that only the objects remaining in the PG will be visible.
+The children will immediately have the correct mapping.