summaryrefslogtreecommitdiffstats
path: root/doc/dev/osd_internals/last_epoch_started.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /doc/dev/osd_internals/last_epoch_started.rst
parentInitial commit. (diff)
downloadceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/dev/osd_internals/last_epoch_started.rst')
-rw-r--r--doc/dev/osd_internals/last_epoch_started.rst60
1 files changed, 60 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/last_epoch_started.rst b/doc/dev/osd_internals/last_epoch_started.rst
new file mode 100644
index 000000000..c31cc66b5
--- /dev/null
+++ b/doc/dev/osd_internals/last_epoch_started.rst
@@ -0,0 +1,60 @@
+======================
+last_epoch_started
+======================
+
+``info.last_epoch_started`` records an activation epoch ``e`` for interval ``i``
+such that all writes committed in ``i`` or earlier are reflected in the
+local info/log and no writes after ``i`` are reflected in the local
+info/log. Since no committed write is ever divergent, even if we
+get an authoritative log/info with an older ``info.last_epoch_started``,
+we can leave our ``info.last_epoch_started`` alone since no writes could
+have committed in any intervening interval (See PG::proc_master_log).
+
+``info.history.last_epoch_started`` records a lower bound on the most
+recent interval in which the PG as a whole went active and accepted
+writes. On a particular OSD it is also an upper bound on the
+activation epoch of intervals in which writes in the local PG log
+occurred: we update it before accepting writes. Because all
+committed writes are committed by all acting set OSDs, any
+non-divergent writes ensure that ``history.last_epoch_started`` was
+recorded by all acting set members in the interval. Once peering has
+queried one OSD from each interval back to some seen
+``history.last_epoch_started``, it follows that no interval after the max
+``history.last_epoch_started`` can have reported writes as committed
+(since we record it before recording client writes in an interval).
+Thus, the minimum ``last_update`` across all infos with
+``info.last_epoch_started >= MAX(history.last_epoch_started)`` must be an
+upper bound on writes reported as committed to the client.
+
+We update ``info.last_epoch_started`` with the initial activation message,
+but we only update ``history.last_epoch_started`` after the new
+``info.last_epoch_started`` is persisted (possibly along with the first
+write). This ensures that we do not require an OSD with the most
+recent ``info.last_epoch_started`` until all acting set OSDs have recorded
+it.
+
+In ``find_best_info``, we do include ``info.last_epoch_started`` values when
+calculating ``max_last_epoch_started_found`` because we want to avoid
+designating a log entry divergent which in a prior interval would have
+been non-divergent since it might have been used to serve a read. In
+``activate()``, we use the peer's ``last_epoch_started`` value as a bound on
+how far back divergent log entries can be found.
+
+However, in a case like
+
+.. code::
+
+ calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
+
+since osd.1 is the only one which recorded info.les=477, while osd.4,osd.0
+(which were the acting set in that interval) did not (osd.4 restarted and osd.0
+did not get the message in time), the PG is marked incomplete when
+either osd.4 or osd.0 would have been valid choices. To avoid this, we do not
+consider ``info.les`` for incomplete peers when calculating
+``min_last_epoch_started_found``. It would not have been in the acting
+set, so we must have another OSD from that interval anyway (if
+``maybe_went_rw``). If that OSD does not remember that ``info.les``, then we
+cannot have served reads.