1 files changed, 187 insertions, 0 deletions
diff --git a/doc/cephfs/standby.rst b/doc/cephfs/standby.rst
new file mode 100644
index 000000000..367c6762b
--- /dev/null
+++ b/doc/cephfs/standby.rst
@@ -0,0 +1,187 @@
+.. _mds-standby:
+
+Terminology
+-----------
+
+A Ceph cluster may have zero or more CephFS *file systems*.  Each CephFS has
+a human readable name (set at creation time with ``fs new``) and an integer
+ID.  The ID is called the file system cluster ID, or *FSCID*.
+
+Each CephFS file system has a number of *ranks*, numbered beginning with zero.
+By default there is one rank per file system.  A rank may be thought of as a
+metadata shard.  Management of ranks is described in :doc:`/cephfs/multimds` .
+
+Each CephFS ``ceph-mds`` daemon starts without a rank.  It may be assigned one
+by the cluster's monitors. A daemon may only hold one rank at a time, and only
+give up a rank when the ``ceph-mds`` process stops.
+
+If a rank is not associated with any daemon, that rank is considered ``failed``.
+Once a rank is assigned to a daemon, the rank is considered ``up``.
+
+Each ``ceph-mds`` daemon has a *name* that is assigned statically by the
+administrator when the daemon is first configured.  Each daemon's *name* is
+typically that of the hostname where the process runs.
+
+A ``ceph-mds`` daemon may be assigned to a specific file system by
+setting its ``mds_join_fs`` configuration option to the file system's
+``name``.
+
+When a ``ceph-mds`` daemon starts, it is also assigned an integer ``GID``,
+which is unique to this current daemon's process.  In other words, when a
+``ceph-mds`` daemon is restarted, it runs as a new process and is assigned a
+*new* ``GID`` that is different from that of the previous process.
+
+Referring to MDS daemons
+------------------------
+
+Most administrative commands that refer to a ``ceph-mds`` daemon (MDS)
+accept a flexible argument format that may specify a ``rank``, a ``GID``
+or a ``name``.
+
+Where a ``rank`` is used, it  may optionally be qualified by
+a leading file system ``name`` or ``GID``.  If a daemon is a standby (i.e.
+it is not currently assigned a ``rank``), then it may only be
+referred to by ``GID`` or ``name``.
+
+For example, say we have an MDS daemon with ``name`` 'myhost' and
+``GID`` 5446, and which is assigned ``rank`` 0 for the file system 'myfs'
+with ``FSCID`` 3.  Any of the following are suitable forms of the ``fail``
+command:
+
+::
+
+    ceph mds fail 5446     # GID
+    ceph mds fail myhost   # Daemon name
+    ceph mds fail 0        # Unqualified rank
+    ceph mds fail 3:0      # FSCID and rank
+    ceph mds fail myfs:0   # File System name and rank
+
+Managing failover
+-----------------
+
+If an MDS daemon stops communicating with the cluster's monitors, the monitors
+will wait ``mds_beacon_grace`` seconds (default 15) before marking the daemon as
+*laggy*.  If a standby MDS is available, the monitor will immediately replace the
+laggy daemon.
+
+Each file system may specify a minimum number of standby daemons in order to be
+considered healthy. This number includes daemons in the ``standby-replay`` state
+waiting for a ``rank`` to fail. Note that a ``standby-replay`` daemon will not
+be assigned to take over a failure for another ``rank`` or a failure in a
+different CephFS file system). The pool of standby daemons not in ``replay``
+counts towards any file system count.
+Each file system may set the desired number of standby daemons by:
+
+::
+
+    ceph fs set <fs name> standby_count_wanted <count>
+
+Setting ``count`` to 0 will disable the health check.
+
+
+.. _mds-standby-replay:
+
+Configuring standby-replay
+--------------------------
+
+Each CephFS file system may be configured to add ``standby-replay`` daemons.
+These standby daemons follow the active MDS's metadata journal in order to
+reduce failover time in the event that the active MDS becomes unavailable. Each
+active MDS may have only one ``standby-replay`` daemon following it.
+
+Configuration of ``standby-replay`` on a file system is done using the below:
+
+::
+
+    ceph fs set <fs name> allow_standby_replay <bool>
+
+Once set, the monitors will assign available standby daemons to follow the
+active MDSs in that file system.
+
+Once an MDS has entered the ``standby-replay`` state, it will only be used as a
+standby for the ``rank`` that it is following. If another ``rank`` fails, this
+``standby-replay`` daemon will not be used as a replacement, even if no other
+standbys are available. For this reason, it is advised that if ``standby-replay``
+is used then *every* active MDS should have a ``standby-replay`` daemon.
+
+.. _mds-join-fs:
+
+Configuring MDS file system affinity
+------------------------------------
+
+You might elect to dedicate an MDS to a particular file system. Or, perhaps you
+have MDSs that run on better hardware that should be preferred over a last-resort
+standby on modest or over-provisioned systems. To configure this preference,
+CephFS provides a configuration option for MDS called ``mds_join_fs`` which
+enforces this affinity.
+
+When failing over MDS daemons, a cluster's monitors will prefer standby daemons with
+``mds_join_fs`` equal to the file system ``name`` with the failed ``rank``.  If no
+standby exists with ``mds_join_fs`` equal to the file system ``name``, it will
+choose an unqualified standby (no setting for ``mds_join_fs``) for the replacement,
+or any other available standby, as a last resort. Note, this does not change the
+behavior that ``standby-replay`` daemons are always selected before
+other standbys.
+
+Even further, the monitors will regularly examine the CephFS file systems even when
+stable to check if a standby with stronger affinity is available to replace an
+MDS with lower affinity. This process is also done for ``standby-replay`` daemons:
+if a regular standby has stronger affinity than the ``standby-replay`` MDS, it will
+replace the standby-replay MDS.
+
+For example, given this stable and healthy file system:
+
+::
+
+    $ ceph fs dump
+    dumped fsmap epoch 399
+    ...
+    Filesystem 'cephfs' (27)
+    ...
+    e399
+    max_mds 1
+    in      0
+    up      {0=20384}
+    failed
+    damaged
+    stopped
+    ...
+    [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]
+
+    Standby daemons:
+
+    [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
+
+
+You may set ``mds_join_fs`` on the standby to enforce your preference: ::
+
+    $ ceph config set mds.b mds_join_fs cephfs
+
+after automatic failover: ::
+
+    $ ceph fs dump
+    dumped fsmap epoch 405
+    e405
+    ...
+    Filesystem 'cephfs' (27)
+    ...
+    max_mds 1
+    in      0
+    up      {0=10420}
+    failed
+    damaged
+    stopped
+    ...
+    [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
+
+    Standby daemons:
+
+    [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]
+
+Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this
+output, the file system name from ``mds_join_fs`` is changed to the file system
+identifier (27). If the file system is recreated with the same name, the
+standby will follow the new file system as expected.
+
+Finally, if the file system is degraded or undersized, no failover will occur
+to enforce ``mds_join_fs``.