diff options
Diffstat (limited to 'doc/dev/osd_internals/backfill_reservation.rst')
-rw-r--r-- | doc/dev/osd_internals/backfill_reservation.rst | 93 |
1 files changed, 93 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/backfill_reservation.rst b/doc/dev/osd_internals/backfill_reservation.rst new file mode 100644 index 000000000..3c380dcf6 --- /dev/null +++ b/doc/dev/osd_internals/backfill_reservation.rst @@ -0,0 +1,93 @@ +==================== +Backfill Reservation +==================== + +When a new OSD joins a cluster all PGs with it in their acting sets must +eventually backfill. If all of these backfills happen simultaneously +they will present excessive load on the OSD: the "thundering herd" +effect. + +The ``osd_max_backfills`` tunable limits the number of outgoing or +incoming backfills that are active on a given OSD. Note that this limit is +applied separately to incoming and to outgoing backfill operations. +Thus there can be as many as ``osd_max_backfills * 2`` backfill operations +in flight on each OSD. This subtlety is often missed, and Ceph +operators can be puzzled as to why more ops are observed than expected. + +Each ``OSDService`` now has two AsyncReserver instances: one for backfills going +from the OSD (``local_reserver``) and one for backfills going to the OSD +(``remote_reserver``). An ``AsyncReserver`` (``common/AsyncReserver.h``) +manages a queue by priority of waiting items and a set of current reservation +holders. When a slot frees up, the ``AsyncReserver`` queues the ``Context*`` +associated with the next item on the highest priority queue in the finisher +provided to the constructor. + +For a primary to initiate a backfill it must first obtain a reservation from +its own ``local_reserver``. Then it must obtain a reservation from the backfill +target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is +managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states +of ``Active`` in PG.h). The reservations are dropped either on the ``Backfilled`` +event, which is sent on the primary before calling ``recovery_complete`` +and on the replica on receipt of the ``BackfillComplete`` progress message), +or upon leaving ``Active`` or ``ReplicaActive``. + +It's important to always grab the local reservation before the remote +reservation in order to prevent a circular dependency. + +We minimize the risk of data loss by prioritizing the order in +which PGs are recovered. Admins can override the default order by using +``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op +priority ``255`` will start before a ``force-backfill`` op at priority ``254``. + +If recovery is needed because a PG is below ``min_size`` a base priority of +``220`` is used. This is incremented by the number of OSDs short of the pool's +``min_size`` as well as a value relative to the pool's ``recovery_priority``. +The resultant priority is capped at ``253`` so that it does not confound forced +ops as described above. Under ordinary circumstances a recovery op is +prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``. +The resultant priority is capped at ``219``. + +If backfill is needed because the number of acting OSDs is less than +the pool's ``min_size``, a priority of ``220`` is used. The number of OSDs +short of the pool's ``min_size`` is added as well as a value relative to +the pool's ``recovery_priority``. The total priority is limited to ``253``. + +If backfill is needed because a PG is undersized, +a priority of ``140`` is used. The number of OSDs below the size of the pool is +added as well as a value relative to the pool's ``recovery_priority``. The +resultant priority is capped at ``179``. If a backfill op is +needed because a PG is degraded, a priority of ``140`` is used. A value +relative to the pool's ``recovery_priority`` is added. The resultant priority +is capped at ``179`` . Under ordinary circumstances a +backfill op priority of ``100`` is used. A value relative to the pool's +``recovery_priority`` is added. The total priority is capped at ``139``. + +.. list-table:: Backfill and Recovery op priorities + :widths: 20 20 20 + :header-rows: 1 + + * - Description + - Base priority + - Maximum priority + * - Backfill + - 100 + - 139 + * - Degraded Backfill + - 140 + - 179 + * - Recovery + - 180 + - 219 + * - Inactive Recovery + - 220 + - 253 + * - Inactive Backfill + - 220 + - 253 + * - force-backfill + - 254 + - + * - force-recovery + - 255 + - + |