Preventing Stale Reads ====================== We write synchronously to all replicas before sending an ACK to the client, which limits the potential for inconsistency in the write path. However, by default we serve reads from just one replica (the lead/primary OSD for each PG), and the client will use whatever OSDMap is has to select the OSD from which to read. In most cases, this is fine: either the client map is correct, or the OSD that we think is the primary for the object knows that it is not the primary anymore, and can feed the client an updated map that indicates a newer primary. They key is to ensure that this is *always* true. In particular, we need to ensure that an OSD that is fenced off from its peers and has not learned about a map update does not continue to service read requests from similarly stale clients at any point after which a new primary may have been allowed to make a write. We accomplish this via a mechanism that works much like a read lease. Each pool may have a ``read_lease_interval`` property which defines how long this is, although by default we simply set it to ``osd_pool_default_read_lease_ratio`` (default: .8) times the ``osd_heartbeat_grace``. (This way the lease will generally have expired by the time we mark a failed OSD down.) readable_until -------------- Primary and replica both track a couple of values: * *readable_until* is how long we are allowed to service (read) requests before *our* "lease" expires. * *readable_until_ub* is an upper bound on *readable_until* for any OSD in the acting set. The primary manages these two values by sending *pg_lease_t* messages to replicas that increase the upper bound. Once all acting OSDs have acknowledged they've seen the higher bound, the primary increases its own *readable_until* and shares that (in a subsequent *pg_lease_t* message). The resulting invariant is that any acting OSDs' *readable_until* is always <= any acting OSDs' *readable_until_ub*. In order to avoid any problems with clock skew, we use monotonic clocks (which are only accurate locally and unaffected by time adjustments) throughout to manage these leases. Peer OSDs calculate upper and lower bounds on the deltas between OSD-local clocks, allowing the primary to share timestamps based on its local clock while replicas translate that to an appropriate bound in for their own local clocks. Prior Intervals --------------- Whenever there is an interval change, we need to have an upper bound on the *readable_until* values for any OSDs in the prior interval. All OSDs from that interval have this value (*readable_until_ub*), and share it as part of the pg_history_t during peering. Because peering may involve OSDs that were not already communicating before and may not have bounds on their clock deltas, the bound in *pg_history_t* is shared as a simple duration before the upper bound expires. This means that the bound slips forward in time due to the transit time for the peering message, but that is generally quite short, and moving the bound later in time is safe since it is an *upper* bound. PG "laggy" state ---------------- While the PG is active, *pg_lease_t* and *pg_lease_ack_t* messages are regularly exchanged. However, if a client request comes in and the lease has expired (*readable_until* has passed), the PG will go into a *LAGGY* state and request will be blocked. Once the lease is renewed, the request(s) will be requeued. PG "wait" state --------------- If peering completes but the prior interval's OSDs may still be readable, the PG will go into the *WAIT* state until sufficient time has passed. Any OSD requests will block during that period. Recovery may proceed while in this state, since the logical, user-visible content of objects does not change. Dead OSDs --------- Generally speaking, we need to wait until prior intervals' OSDs *know* that they should no longer be readable. If an OSD is known to have crashed (e.g., because the process is no longer running, which we may infer because we get a ECONNREFUSED error), then we can infer that it is not readable. Similarly, if an OSD is marked down, gets a map update telling it so, and then informs the monitor that it knows it was marked down, we can similarly infer that it is not still serving requests for a prior interval. When a PG is in the *WAIT* state, it will watch new maps for OSDs' *dead_epoch* value indicating they are aware of their dead-ness. If all down OSDs from prior interval are so aware, we can exit the WAIT state early.