==================== Backfill Reservation ==================== When a new OSD joins a cluster all PGs with it in their acting sets must eventually backfill. If all of these backfills happen simultaneously they will present excessive load on the OSD: the "thundering herd" effect. The ``osd_max_backfills`` tunable limits the number of outgoing or incoming backfills that are active on a given OSD. Note that this limit is applied separately to incoming and to outgoing backfill operations. Thus there can be as many as ``osd_max_backfills * 2`` backfill operations in flight on each OSD. This subtlety is often missed, and Ceph operators can be puzzled as to why more ops are observed than expected. Each ``OSDService`` now has two AsyncReserver instances: one for backfills going from the OSD (``local_reserver``) and one for backfills going to the OSD (``remote_reserver``). An ``AsyncReserver`` (``common/AsyncReserver.h``) manages a queue by priority of waiting items and a set of current reservation holders. When a slot frees up, the ``AsyncReserver`` queues the ``Context*`` associated with the next item on the highest priority queue in the finisher provided to the constructor. For a primary to initiate a backfill it must first obtain a reservation from its own ``local_reserver``. Then it must obtain a reservation from the backfill target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states of ``Active`` in PG.h). The reservations are dropped either on the ``Backfilled`` event, which is sent on the primary before calling ``recovery_complete`` and on the replica on receipt of the ``BackfillComplete`` progress message), or upon leaving ``Active`` or ``ReplicaActive``. It's important to always grab the local reservation before the remote reservation in order to prevent a circular dependency. We minimize the risk of data loss by prioritizing the order in which PGs are recovered. Admins can override the default order by using ``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op priority ``255`` will start before a ``force-backfill`` op at priority ``254``. If recovery is needed because a PG is below ``min_size`` a base priority of ``220`` is used. This is incremented by the number of OSDs short of the pool's ``min_size`` as well as a value relative to the pool's ``recovery_priority``. The resultant priority is capped at ``253`` so that it does not confound forced ops as described above. Under ordinary circumstances a recovery op is prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``. The resultant priority is capped at ``219``. If backfill is needed because the number of acting OSDs is less than the pool's ``min_size``, a priority of ``220`` is used. The number of OSDs short of the pool's ``min_size`` is added as well as a value relative to the pool's ``recovery_priority``. The total priority is limited to ``253``. If backfill is needed because a PG is undersized, a priority of ``140`` is used. The number of OSDs below the size of the pool is added as well as a value relative to the pool's ``recovery_priority``. The resultant priority is capped at ``179``. If a backfill op is needed because a PG is degraded, a priority of ``140`` is used. A value relative to the pool's ``recovery_priority`` is added. The resultant priority is capped at ``179`` . Under ordinary circumstances a backfill op priority of ``100`` is used. A value relative to the pool's ``recovery_priority`` is added. The total priority is capped at ``139``. .. list-table:: Backfill and Recovery op priorities :widths: 20 20 20 :header-rows: 1 * - Description - Base priority - Maximum priority * - Backfill - 100 - 139 * - Degraded Backfill - 140 - 179 * - Recovery - 180 - 219 * - Inactive Recovery - 220 - 253 * - Inactive Backfill - 220 - 253 * - force-backfill - 254 - * - force-recovery - 255 -