====================
Backfill Reservation
====================

When a new OSD joins a cluster all PGs with it in their acting sets must
eventually backfill.  If all of these backfills happen simultaneously
they will present excessive load on the OSD:  the "thundering herd"
effect.

The ``osd_max_backfills`` tunable limits the number of outgoing or
incoming backfills that are active on a given OSD. Note that this limit is
applied separately to incoming and to outgoing backfill operations.
Thus there can be as many as ``osd_max_backfills * 2`` backfill operations
in flight on each OSD.  This subtlety is often missed, and Ceph
operators can be puzzled as to why more ops are observed than expected.

Each ``OSDService`` now has two AsyncReserver instances: one for backfills going
from the OSD (``local_reserver``) and one for backfills going to the OSD
(``remote_reserver``).  An ``AsyncReserver`` (``common/AsyncReserver.h``)
manages a queue by priority of waiting items and a set of current reservation
holders.  When a slot frees up, the ``AsyncReserver`` queues the ``Context*``
associated with the next item on the highest priority queue in the finisher
provided to the constructor.

For a primary to initiate a backfill it must first obtain a reservation from
its own ``local_reserver``.  Then it must obtain a reservation from the backfill
target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is
managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states
of ``Active`` in PG.h).  The reservations are dropped either on the ``Backfilled``
event, which is sent on the primary before calling ``recovery_complete``
and on the replica on receipt of the ``BackfillComplete`` progress message),
or upon leaving ``Active`` or ``ReplicaActive``.

It's important to always grab the local reservation before the remote
reservation in order to prevent a circular dependency.

We minimize the risk of data loss by prioritizing the order in
which PGs are recovered.  Admins can override the default order by using
``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op
priority ``255`` will start before a ``force-backfill`` op at priority ``254``.

If recovery is needed because a PG is below ``min_size`` a base priority of
``220`` is used. This is incremented by the number of OSDs short of the pool's
``min_size`` as well as a value relative to the pool's ``recovery_priority``.
The resultant priority is capped at ``253`` so that it does not confound forced
ops as described above. Under ordinary circumstances a recovery op is
prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``.
The resultant priority is capped at ``219``.

If backfill is needed because the number of acting OSDs is less than
the pool's ``min_size``, a priority of ``220`` is used.  The number of OSDs
short of the pool's ``min_size`` is added as well as a value relative to
the pool's ``recovery_priority``.  The total priority is limited to ``253``.

If backfill is needed because a PG is undersized,
a priority of ``140`` is used.  The number of OSDs below the size of the pool is
added as well as a value relative to the pool's ``recovery_priority``.  The
resultant priority is capped at ``179``.  If a backfill op is
needed because a PG is degraded, a priority of ``140`` is used.  A value
relative to the pool's ``recovery_priority`` is added.  The resultant priority
is capped at ``179`` .  Under ordinary circumstances a
backfill op priority of ``100`` is used.  A value relative to the pool's
``recovery_priority`` is added.  The total priority is capped at ``139``.

.. list-table:: Backfill and Recovery op priorities
   :widths: 20 20 20
   :header-rows: 1

   * - Description
     - Base priority
     - Maximum priority
   * - Backfill
     - 100
     - 139
   * - Degraded Backfill
     - 140
     - 179
   * - Recovery
     - 180
     - 219
   * - Inactive Recovery
     - 220
     - 253
   * - Inactive Backfill
     - 220
     - 253
   * - force-backfill
     - 254
     -
   * - force-recovery
     - 255
     -