diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /doc/dev/osd_internals/last_epoch_started.rst | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/dev/osd_internals/last_epoch_started.rst')
-rw-r--r-- | doc/dev/osd_internals/last_epoch_started.rst | 60 |
1 files changed, 60 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/last_epoch_started.rst b/doc/dev/osd_internals/last_epoch_started.rst new file mode 100644 index 000000000..c31cc66b5 --- /dev/null +++ b/doc/dev/osd_internals/last_epoch_started.rst @@ -0,0 +1,60 @@ +====================== +last_epoch_started +====================== + +``info.last_epoch_started`` records an activation epoch ``e`` for interval ``i`` +such that all writes committed in ``i`` or earlier are reflected in the +local info/log and no writes after ``i`` are reflected in the local +info/log. Since no committed write is ever divergent, even if we +get an authoritative log/info with an older ``info.last_epoch_started``, +we can leave our ``info.last_epoch_started`` alone since no writes could +have committed in any intervening interval (See PG::proc_master_log). + +``info.history.last_epoch_started`` records a lower bound on the most +recent interval in which the PG as a whole went active and accepted +writes. On a particular OSD it is also an upper bound on the +activation epoch of intervals in which writes in the local PG log +occurred: we update it before accepting writes. Because all +committed writes are committed by all acting set OSDs, any +non-divergent writes ensure that ``history.last_epoch_started`` was +recorded by all acting set members in the interval. Once peering has +queried one OSD from each interval back to some seen +``history.last_epoch_started``, it follows that no interval after the max +``history.last_epoch_started`` can have reported writes as committed +(since we record it before recording client writes in an interval). +Thus, the minimum ``last_update`` across all infos with +``info.last_epoch_started >= MAX(history.last_epoch_started)`` must be an +upper bound on writes reported as committed to the client. + +We update ``info.last_epoch_started`` with the initial activation message, +but we only update ``history.last_epoch_started`` after the new +``info.last_epoch_started`` is persisted (possibly along with the first +write). This ensures that we do not require an OSD with the most +recent ``info.last_epoch_started`` until all acting set OSDs have recorded +it. + +In ``find_best_info``, we do include ``info.last_epoch_started`` values when +calculating ``max_last_epoch_started_found`` because we want to avoid +designating a log entry divergent which in a prior interval would have +been non-divergent since it might have been used to serve a read. In +``activate()``, we use the peer's ``last_epoch_started`` value as a bound on +how far back divergent log entries can be found. + +However, in a case like + +.. code:: + + calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556 + calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556 + calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556 + calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556 + +since osd.1 is the only one which recorded info.les=477, while osd.4,osd.0 +(which were the acting set in that interval) did not (osd.4 restarted and osd.0 +did not get the message in time), the PG is marked incomplete when +either osd.4 or osd.0 would have been valid choices. To avoid this, we do not +consider ``info.les`` for incomplete peers when calculating +``min_last_epoch_started_found``. It would not have been in the acting +set, so we must have another OSD from that interval anyway (if +``maybe_went_rw``). If that OSD does not remember that ``info.les``, then we +cannot have served reads. |