summaryrefslogtreecommitdiffstats
path: root/doc/dev/osd_internals
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-23 16:45:13 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-23 16:45:13 +0000
commit389020e14594e4894e28d1eb9103c210b142509e (patch)
tree2ba734cdd7a243f46dda7c3d0cc88c2293d9699f /doc/dev/osd_internals
parentAdding upstream version 18.2.2. (diff)
downloadceph-389020e14594e4894e28d1eb9103c210b142509e.tar.xz
ceph-389020e14594e4894e28d1eb9103c210b142509e.zip
Adding upstream version 18.2.3.upstream/18.2.3
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/dev/osd_internals')
-rw-r--r--doc/dev/osd_internals/manifest.rst2
-rw-r--r--doc/dev/osd_internals/snaps.rst121
2 files changed, 118 insertions, 5 deletions
diff --git a/doc/dev/osd_internals/manifest.rst b/doc/dev/osd_internals/manifest.rst
index 7be4350ea..43c23fa71 100644
--- a/doc/dev/osd_internals/manifest.rst
+++ b/doc/dev/osd_internals/manifest.rst
@@ -218,6 +218,8 @@ we may want to exploit.
The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
clones as part of leak detection.
+.. _osd-make-writeable:
+
An important question is how we deal with the fact that many clones
will frequently have references to the same backing chunks at the same
offset. In particular, ``make_writeable`` will generally create a clone
diff --git a/doc/dev/osd_internals/snaps.rst b/doc/dev/osd_internals/snaps.rst
index 5ebd0884a..736d0add5 100644
--- a/doc/dev/osd_internals/snaps.rst
+++ b/doc/dev/osd_internals/snaps.rst
@@ -23,12 +23,11 @@ The difference between *pool snaps* and *self managed snaps* from the
OSD's point of view lies in whether the *SnapContext* comes to the OSD
via the client's MOSDOp or via the most recent OSDMap.
-See OSD::make_writeable
+See :ref:`manifest.rst <osd-make-writeable>` for more information.
Ondisk Structures
-----------------
-Each object has in the PG collection a *head* object (or *snapdir*, which we
-will come to shortly) and possibly a set of *clone* objects.
+Each object has in the PG collection a *head* object and possibly a set of *clone* objects.
Each hobject_t has a snap field. For the *head* (the only writeable version
of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
snap field is set to the *seq* of the *SnapContext* at their creation.
@@ -47,8 +46,12 @@ The *head* object contains a *SnapSet* encoded in an attribute, which tracks
3. Overlapping intervals between clones for tracking space usage
4. Clone size
-If the *head* is deleted while there are still clones, a *snapdir* object
-is created instead to house the *SnapSet*.
+The *head* can't be deleted while there are still clones. Instead, it is
+marked as whiteout (``object_info_t::FLAG_WHITEOUT``) in order to house the
+*SnapSet* contained in it.
+In that case, the *head* object no longer logically exists.
+
+See: should_whiteout()
Additionally, the *object_info_t* on each clone includes a vector of snaps
for which clone is defined.
@@ -126,3 +129,111 @@ up to 8 prefixes need to be checked to determine all hobjects in a particular
snap for a particular PG. Upon split, the prefixes to check on the parent
are adjusted such that only the objects remaining in the PG will be visible.
The children will immediately have the correct mapping.
+
+clone_overlap
+-------------
+Each SnapSet attached to the *head* object contains the overlapping intervals
+between clone objects for optimizing space.
+The overlapping intervals are stored within the ``clone_overlap`` map, each element in the
+map stores the snap ID and the corresponding overlap with the next newest clone.
+
+See the following example using a 4 byte object:
+
++--------+---------+
+| object | content |
++========+=========+
+| head | [AAAA] |
++--------+---------+
+
+listsnaps output is as follows:
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| head | - | 4 | |
++---------+-------+------+---------+
+
+After taking a snapshot (ID 1) and re-writing the first 2 bytes of the object,
+the clone created will overlap with the new *head* object in its last 2 bytes.
+
++------------+---------+
+| object | content |
++============+=========+
+| head | [BBAA] |
++------------+---------+
+| clone ID 1 | [AAAA] |
++------------+---------+
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| 1 | 1 | 4 | [2~2] |
++---------+-------+------+---------+
+| head | - | 4 | |
++---------+-------+------+---------+
+
+By taking another snapshot (ID 2) and this time re-writing only the first 1 byte of the object,
+the clone created (ID 2) will overlap with the new *head* object in its last 3 bytes.
+While the oldest clone (ID 1) will overlap with the newest clone in its last 2 bytes.
+
++------------+---------+
+| object | content |
++============+=========+
+| head | [CBAA] |
++------------+---------+
+| clone ID 2 | [BBAA] |
++------------+---------+
+| clone ID 1 | [AAAA] |
++------------+---------+
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| 1 | 1 | 4 | [2~2] |
++---------+-------+------+---------+
+| 2 | 2 | 4 | [1~3] |
++---------+-------+------+---------+
+| head | - | 4 | |
++---------+-------+------+---------+
+
+If the *head* object will be completely re-written by re-writing 4 bytes,
+the only existing overlap that will remain will be between the two clones.
+
++------------+---------+
+| object | content |
++============+=========+
+| head | [DDDD] |
++------------+---------+
+| clone ID 2 | [BBAA] |
++------------+---------+
+| clone ID 1 | [AAAA] |
++------------+---------+
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| 1 | 1 | 4 | [2~2] |
++---------+-------+------+---------+
+| 2 | 2 | 4 | |
++---------+-------+------+---------+
+| head | - | 4 | |
++---------+-------+------+---------+
+
+Lastly, after the last snap (ID 2) is removed and snaptrim kicks in,
+no overlapping intervals will remain:
+
++------------+---------+
+| object | content |
++============+=========+
+| head | [DDDD] |
++------------+---------+
+| clone ID 1 | [AAAA] |
++------------+---------+
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| 1 | 1 | 4 | |
++---------+-------+------+---------+
+| head | - | 4 | |
++---------+-------+------+---------+