doc/dev/osd_internals/last_epoch_started.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

======================
last_epoch_started
======================

``info.last_epoch_started`` records an activation epoch ``e`` for interval ``i``
such that all writes committed in ``i`` or earlier are reflected in the
local info/log and no writes after ``i`` are reflected in the local
info/log.  Since no committed write is ever divergent, even if we
get an authoritative log/info with an older ``info.last_epoch_started``,
we can leave our ``info.last_epoch_started`` alone since no writes could
have committed in any intervening interval (See PG::proc_master_log).

``info.history.last_epoch_started`` records a lower bound on the most
recent interval in which the PG as a whole went active and accepted
writes.  On a particular OSD it is also an upper bound on the
activation epoch of intervals in which writes in the local PG log
occurred:  we update it before accepting writes.  Because all
committed writes are committed by all acting set OSDs, any
non-divergent writes ensure that ``history.last_epoch_started`` was
recorded by all acting set members in the interval.  Once peering has
queried one OSD from each interval back to some seen
``history.last_epoch_started``, it follows that no interval after the max
``history.last_epoch_started`` can have reported writes as committed
(since we record it before recording client writes in an interval).
Thus, the minimum ``last_update`` across all infos with
``info.last_epoch_started >= MAX(history.last_epoch_started)`` must be an
upper bound on writes reported as committed to the client.

We update ``info.last_epoch_started`` with the initial activation message,
but we only update ``history.last_epoch_started`` after the new
``info.last_epoch_started`` is persisted (possibly along with the first
write).  This ensures that we do not require an OSD with the most
recent ``info.last_epoch_started`` until all acting set OSDs have recorded
it.

In ``find_best_info``, we do include ``info.last_epoch_started`` values when
calculating ``max_last_epoch_started_found`` because we want to avoid
designating a log entry divergent which in a prior interval would have
been non-divergent since it might have been used to serve a read.  In
``activate()``, we use the peer's ``last_epoch_started`` value as a bound on
how far back divergent log entries can be found.

However, in a case like

.. code::

  calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
  calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556
  calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
  calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556

since osd.1 is the only one which recorded info.les=477, while osd.4,osd.0
(which were the acting set in that interval) did not (osd.4 restarted and osd.0
did not get the message in time), the PG is marked incomplete when
either osd.4 or osd.0 would have been valid choices. To avoid this, we do not
consider ``info.les`` for incomplete peers when calculating
``min_last_epoch_started_found``.  It would not have been in the acting
set, so we must have another OSD from that interval anyway (if
``maybe_went_rw``).  If that OSD does not remember that ``info.les``, then we
cannot have served reads.