summaryrefslogtreecommitdiffstats
path: root/doc/rados/operations
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rados/operations')
-rw-r--r--doc/rados/operations/add-or-rm-mons.rst458
-rw-r--r--doc/rados/operations/add-or-rm-osds.rst419
-rw-r--r--doc/rados/operations/balancer.rst221
-rw-r--r--doc/rados/operations/bluestore-migration.rst357
-rw-r--r--doc/rados/operations/cache-tiering.rst557
-rw-r--r--doc/rados/operations/change-mon-elections.rst100
-rw-r--r--doc/rados/operations/control.rst665
-rw-r--r--doc/rados/operations/crush-map-edits.rst746
-rw-r--r--doc/rados/operations/crush-map.rst1147
-rw-r--r--doc/rados/operations/data-placement.rst47
-rw-r--r--doc/rados/operations/devices.rst227
-rw-r--r--doc/rados/operations/erasure-code-clay.rst240
-rw-r--r--doc/rados/operations/erasure-code-isa.rst107
-rw-r--r--doc/rados/operations/erasure-code-jerasure.rst123
-rw-r--r--doc/rados/operations/erasure-code-lrc.rst388
-rw-r--r--doc/rados/operations/erasure-code-profile.rst128
-rw-r--r--doc/rados/operations/erasure-code-shec.rst145
-rw-r--r--doc/rados/operations/erasure-code.rst272
-rw-r--r--doc/rados/operations/health-checks.rst1619
-rw-r--r--doc/rados/operations/index.rst99
-rw-r--r--doc/rados/operations/monitoring-osd-pg.rst556
-rw-r--r--doc/rados/operations/monitoring.rst644
-rw-r--r--doc/rados/operations/operating.rst174
-rw-r--r--doc/rados/operations/pg-concepts.rst104
-rw-r--r--doc/rados/operations/pg-repair.rst118
-rw-r--r--doc/rados/operations/pg-states.rst118
-rw-r--r--doc/rados/operations/placement-groups.rst897
-rw-r--r--doc/rados/operations/pools.rst751
-rw-r--r--doc/rados/operations/read-balancer.rst64
-rw-r--r--doc/rados/operations/stretch-mode.rst262
-rw-r--r--doc/rados/operations/upmap.rst113
-rw-r--r--doc/rados/operations/user-management.rst840
32 files changed, 12706 insertions, 0 deletions
diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst
new file mode 100644
index 000000000..3688bb798
--- /dev/null
+++ b/doc/rados/operations/add-or-rm-mons.rst
@@ -0,0 +1,458 @@
+.. _adding-and-removing-monitors:
+
+==========================
+ Adding/Removing Monitors
+==========================
+
+It is possible to add monitors to a running cluster as long as redundancy is
+maintained. To bootstrap a monitor, see `Manual Deployment`_ or `Monitor
+Bootstrap`_.
+
+.. _adding-monitors:
+
+Adding Monitors
+===============
+
+Ceph monitors serve as the single source of truth for the cluster map. It is
+possible to run a cluster with only one monitor, but for a production cluster
+it is recommended to have at least three monitors provisioned and in quorum.
+Ceph monitors use a variation of the `Paxos`_ algorithm to maintain consensus
+about maps and about other critical information across the cluster. Due to the
+nature of Paxos, Ceph is able to maintain quorum (and thus establish
+consensus) only if a majority of the monitors are ``active``.
+
+It is best to run an odd number of monitors. This is because a cluster that is
+running an odd number of monitors is more resilient than a cluster running an
+even number. For example, in a two-monitor deployment, no failures can be
+tolerated if quorum is to be maintained; in a three-monitor deployment, one
+failure can be tolerated; in a four-monitor deployment, one failure can be
+tolerated; and in a five-monitor deployment, two failures can be tolerated. In
+general, a cluster running an odd number of monitors is best because it avoids
+what is called the *split brain* phenomenon. In short, Ceph is able to operate
+only if a majority of monitors are ``active`` and able to communicate with each
+other, (for example: there must be a single monitor, two out of two monitors,
+two out of three monitors, three out of five monitors, or the like).
+
+For small or non-critical deployments of multi-node Ceph clusters, it is
+recommended to deploy three monitors. For larger clusters or for clusters that
+are intended to survive a double failure, it is recommended to deploy five
+monitors. Only in rare circumstances is there any justification for deploying
+seven or more monitors.
+
+It is possible to run a monitor on the same host that is running an OSD.
+However, this approach has disadvantages: for example: `fsync` issues with the
+kernel might weaken performance, monitor and OSD daemons might be inactive at
+the same time and cause disruption if the node crashes, is rebooted, or is
+taken down for maintenance. Because of these risks, it is instead
+recommended to run monitors and managers on dedicated hosts.
+
+.. note:: A *majority* of monitors in your cluster must be able to
+ reach each other in order for quorum to be established.
+
+Deploying your Hardware
+-----------------------
+
+Some operators choose to add a new monitor host at the same time that they add
+a new monitor. For details on the minimum recommendations for monitor hardware,
+see `Hardware Recommendations`_. Before adding a monitor host to the cluster,
+make sure that there is an up-to-date version of Linux installed.
+
+Add the newly installed monitor host to a rack in your cluster, connect the
+host to the network, and make sure that the host has network connectivity.
+
+.. _Hardware Recommendations: ../../../start/hardware-recommendations
+
+Installing the Required Software
+--------------------------------
+
+In manually deployed clusters, it is necessary to install Ceph packages
+manually. For details, see `Installing Packages`_. Configure SSH so that it can
+be used by a user that has passwordless authentication and root permissions.
+
+.. _Installing Packages: ../../../install/install-storage-cluster
+
+
+.. _Adding a Monitor (Manual):
+
+Adding a Monitor (Manual)
+-------------------------
+
+The procedure in this section creates a ``ceph-mon`` data directory, retrieves
+both the monitor map and the monitor keyring, and adds a ``ceph-mon`` daemon to
+the cluster. The procedure might result in a Ceph cluster that contains only
+two monitor daemons. To add more monitors until there are enough ``ceph-mon``
+daemons to establish quorum, repeat the procedure.
+
+This is a good point at which to define the new monitor's ``id``. Monitors have
+often been named with single letters (``a``, ``b``, ``c``, etc.), but you are
+free to define the ``id`` however you see fit. In this document, ``{mon-id}``
+refers to the ``id`` exclusive of the ``mon.`` prefix: for example, if
+``mon.a`` has been chosen as the ``id`` of a monitor, then ``{mon-id}`` is
+``a``. ???
+
+#. Create a data directory on the machine that will host the new monitor:
+
+ .. prompt:: bash $
+
+ ssh {new-mon-host}
+ sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
+
+#. Create a temporary directory ``{tmp}`` that will contain the files needed
+ during this procedure. This directory should be different from the data
+ directory created in the previous step. Because this is a temporary
+ directory, it can be removed after the procedure is complete:
+
+ .. prompt:: bash $
+
+ mkdir {tmp}
+
+#. Retrieve the keyring for your monitors (``{tmp}`` is the path to the
+ retrieved keyring and ``{key-filename}`` is the name of the file that
+ contains the retrieved monitor key):
+
+ .. prompt:: bash $
+
+ ceph auth get mon. -o {tmp}/{key-filename}
+
+#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map
+ and ``{map-filename}`` is the name of the file that contains the retrieved
+ monitor map):
+
+ .. prompt:: bash $
+
+ ceph mon getmap -o {tmp}/{map-filename}
+
+#. Prepare the monitor's data directory, which was created in the first step.
+ The following command must specify the path to the monitor map (so that
+ information about a quorum of monitors and their ``fsid``\s can be
+ retrieved) and specify the path to the monitor keyring:
+
+ .. prompt:: bash $
+
+ sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
+
+#. Start the new monitor. It will automatically join the cluster. To provide
+ information to the daemon about which address to bind to, use either the
+ ``--public-addr {ip}`` option or the ``--public-network {network}`` option.
+ For example:
+
+ .. prompt:: bash $
+
+ ceph-mon -i {mon-id} --public-addr {ip:port}
+
+.. _removing-monitors:
+
+Removing Monitors
+=================
+
+When monitors are removed from a cluster, it is important to remember
+that Ceph monitors use Paxos to maintain consensus about the cluster
+map. Such consensus is possible only if the number of monitors is sufficient
+to establish quorum.
+
+
+.. _Removing a Monitor (Manual):
+
+Removing a Monitor (Manual)
+---------------------------
+
+The procedure in this section removes a ``ceph-mon`` daemon from the cluster.
+The procedure might result in a Ceph cluster that contains a number of monitors
+insufficient to maintain quorum, so plan carefully. When replacing an old
+monitor with a new monitor, add the new monitor first, wait for quorum to be
+established, and then remove the old monitor. This ensures that quorum is not
+lost.
+
+
+#. Stop the monitor:
+
+ .. prompt:: bash $
+
+ service ceph -a stop mon.{mon-id}
+
+#. Remove the monitor from the cluster:
+
+ .. prompt:: bash $
+
+ ceph mon remove {mon-id}
+
+#. Remove the monitor entry from the ``ceph.conf`` file:
+
+.. _rados-mon-remove-from-unhealthy:
+
+
+Removing Monitors from an Unhealthy Cluster
+-------------------------------------------
+
+The procedure in this section removes a ``ceph-mon`` daemon from an unhealthy
+cluster (for example, a cluster whose monitors are unable to form a quorum).
+
+#. Stop all ``ceph-mon`` daemons on all monitor hosts:
+
+ .. prompt:: bash $
+
+ ssh {mon-host}
+ systemctl stop ceph-mon.target
+
+ Repeat this step on every monitor host.
+
+#. Identify a surviving monitor and log in to the monitor's host:
+
+ .. prompt:: bash $
+
+ ssh {mon-host}
+
+#. Extract a copy of the ``monmap`` file by running a command of the following
+ form:
+
+ .. prompt:: bash $
+
+ ceph-mon -i {mon-id} --extract-monmap {map-path}
+
+ Here is a more concrete example. In this example, ``hostname`` is the
+ ``{mon-id}`` and ``/tmp/monpap`` is the ``{map-path}``:
+
+ .. prompt:: bash $
+
+ ceph-mon -i `hostname` --extract-monmap /tmp/monmap
+
+#. Remove the non-surviving or otherwise problematic monitors:
+
+ .. prompt:: bash $
+
+ monmaptool {map-path} --rm {mon-id}
+
+ For example, suppose that there are three monitors |---| ``mon.a``, ``mon.b``,
+ and ``mon.c`` |---| and that only ``mon.a`` will survive:
+
+ .. prompt:: bash $
+
+ monmaptool /tmp/monmap --rm b
+ monmaptool /tmp/monmap --rm c
+
+#. Inject the surviving map that includes the removed monitors into the
+ monmap of the surviving monitor(s):
+
+ .. prompt:: bash $
+
+ ceph-mon -i {mon-id} --inject-monmap {map-path}
+
+ Continuing with the above example, inject a map into monitor ``mon.a`` by
+ running the following command:
+
+ .. prompt:: bash $
+
+ ceph-mon -i a --inject-monmap /tmp/monmap
+
+
+#. Start only the surviving monitors.
+
+#. Verify that the monitors form a quorum by running the command ``ceph -s``.
+
+#. The data directory of the removed monitors is in ``/var/lib/ceph/mon``:
+ either archive this data directory in a safe location or delete this data
+ directory. However, do not delete it unless you are confident that the
+ remaining monitors are healthy and sufficiently redundant. Make sure that
+ there is enough room for the live DB to expand and compact, and make sure
+ that there is also room for an archived copy of the DB. The archived copy
+ can be compressed.
+
+.. _Changing a Monitor's IP address:
+
+Changing a Monitor's IP Address
+===============================
+
+.. important:: Existing monitors are not supposed to change their IP addresses.
+
+Monitors are critical components of a Ceph cluster. The entire system can work
+properly only if the monitors maintain quorum, and quorum can be established
+only if the monitors have discovered each other by means of their IP addresses.
+Ceph has strict requirements on the discovery of monitors.
+
+Although the ``ceph.conf`` file is used by Ceph clients and other Ceph daemons
+to discover monitors, the monitor map is used by monitors to discover each
+other. This is why it is necessary to obtain the current ``monmap`` at the time
+a new monitor is created: as can be seen above in `Adding a Monitor (Manual)`_,
+the ``monmap`` is one of the arguments required by the ``ceph-mon -i {mon-id}
+--mkfs`` command. The following sections explain the consistency requirements
+for Ceph monitors, and also explain a number of safe ways to change a monitor's
+IP address.
+
+
+Consistency Requirements
+------------------------
+
+When a monitor discovers other monitors in the cluster, it always refers to the
+local copy of the monitor map. Using the monitor map instead of using the
+``ceph.conf`` file avoids errors that could break the cluster (for example,
+typos or other slight errors in ``ceph.conf`` when a monitor address or port is
+specified). Because monitors use monitor maps for discovery and because they
+share monitor maps with Ceph clients and other Ceph daemons, the monitor map
+provides monitors with a strict guarantee that their consensus is valid.
+
+Strict consistency also applies to updates to the monmap. As with any other
+updates on the monitor, changes to the monmap always run through a distributed
+consensus algorithm called `Paxos`_. The monitors must agree on each update to
+the monmap, such as adding or removing a monitor, to ensure that each monitor
+in the quorum has the same version of the monmap. Updates to the monmap are
+incremental so that monitors have the latest agreed upon version, and a set of
+previous versions, allowing a monitor that has an older version of the monmap
+to catch up with the current state of the cluster.
+
+There are additional advantages to using the monitor map rather than
+``ceph.conf`` when monitors discover each other. Because ``ceph.conf`` is not
+automatically updated and distributed, its use would bring certain risks:
+monitors might use an outdated ``ceph.conf`` file, might fail to recognize a
+specific monitor, might fall out of quorum, and might develop a situation in
+which `Paxos`_ is unable to accurately ascertain the current state of the
+system. Because of these risks, any changes to an existing monitor's IP address
+must be made with great care.
+
+.. _operations_add_or_rm_mons_changing_mon_ip:
+
+Changing a Monitor's IP address (Preferred Method)
+--------------------------------------------------
+
+If a monitor's IP address is changed only in the ``ceph.conf`` file, there is
+no guarantee that the other monitors in the cluster will receive the update.
+For this reason, the preferred method to change a monitor's IP address is as
+follows: add a new monitor with the desired IP address (as described in `Adding
+a Monitor (Manual)`_), make sure that the new monitor successfully joins the
+quorum, remove the monitor that is using the old IP address, and update the
+``ceph.conf`` file to ensure that clients and other daemons are made aware of
+the new monitor's IP address.
+
+For example, suppose that there are three monitors in place::
+
+ [mon.a]
+ host = host01
+ addr = 10.0.0.1:6789
+ [mon.b]
+ host = host02
+ addr = 10.0.0.2:6789
+ [mon.c]
+ host = host03
+ addr = 10.0.0.3:6789
+
+To change ``mon.c`` so that its name is ``host04`` and its IP address is
+``10.0.0.4``: (1) follow the steps in `Adding a Monitor (Manual)`_ to add a new
+monitor ``mon.d``, (2) make sure that ``mon.d`` is running before removing
+``mon.c`` or else quorum will be broken, and (3) follow the steps in `Removing
+a Monitor (Manual)`_ to remove ``mon.c``. To move all three monitors to new IP
+addresses, repeat this process.
+
+Changing a Monitor's IP address (Advanced Method)
+-------------------------------------------------
+
+There are cases in which the method outlined in :ref"`<Changing a Monitor's IP
+Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot
+be used. For example, it might be necessary to move the cluster's monitors to a
+different network, to a different part of the datacenter, or to a different
+datacenter altogether. It is still possible to change the monitors' IP
+addresses, but a different method must be used.
+
+For such cases, a new monitor map with updated IP addresses for every monitor
+in the cluster must be generated and injected on each monitor. Although this
+method is not particularly easy, such a major migration is unlikely to be a
+routine task. As stated at the beginning of this section, existing monitors are
+not supposed to change their IP addresses.
+
+Continue with the monitor configuration in the example from :ref"`<Changing a
+Monitor's IP Address (Preferred Method)>
+operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors
+are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that
+these networks are unable to communicate. Carry out the following procedure:
+
+#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor
+ map, and ``{filename}`` is the name of the file that contains the retrieved
+ monitor map):
+
+ .. prompt:: bash $
+
+ ceph mon getmap -o {tmp}/{filename}
+
+#. Check the contents of the monitor map:
+
+ .. prompt:: bash $
+
+ monmaptool --print {tmp}/{filename}
+
+ ::
+
+ monmaptool: monmap file {tmp}/{filename}
+ epoch 1
+ fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
+ last_changed 2012-12-17 02:46:41.591248
+ created 2012-12-17 02:46:41.591248
+ 0: 10.0.0.1:6789/0 mon.a
+ 1: 10.0.0.2:6789/0 mon.b
+ 2: 10.0.0.3:6789/0 mon.c
+
+#. Remove the existing monitors from the monitor map:
+
+ .. prompt:: bash $
+
+ monmaptool --rm a --rm b --rm c {tmp}/{filename}
+
+ ::
+
+ monmaptool: monmap file {tmp}/{filename}
+ monmaptool: removing a
+ monmaptool: removing b
+ monmaptool: removing c
+ monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
+
+#. Add the new monitor locations to the monitor map:
+
+ .. prompt:: bash $
+
+ monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename}
+
+ ::
+
+ monmaptool: monmap file {tmp}/{filename}
+ monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors)
+
+#. Check the new contents of the monitor map:
+
+ .. prompt:: bash $
+
+ monmaptool --print {tmp}/{filename}
+
+ ::
+
+ monmaptool: monmap file {tmp}/{filename}
+ epoch 1
+ fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
+ last_changed 2012-12-17 02:46:41.591248
+ created 2012-12-17 02:46:41.591248
+ 0: 10.1.0.1:6789/0 mon.a
+ 1: 10.1.0.2:6789/0 mon.b
+ 2: 10.1.0.3:6789/0 mon.c
+
+At this point, we assume that the monitors (and stores) have been installed at
+the new location. Next, propagate the modified monitor map to the new monitors,
+and inject the modified monitor map into each new monitor.
+
+#. Make sure all of your monitors have been stopped. Never inject into a
+ monitor while the monitor daemon is running.
+
+#. Inject the monitor map:
+
+ .. prompt:: bash $
+
+ ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename}
+
+#. Restart all of the monitors.
+
+Migration to the new location is now complete. The monitors should operate
+successfully.
+
+
+
+.. _Manual Deployment: ../../../install/manual-deployment
+.. _Monitor Bootstrap: ../../../dev/mon-bootstrap
+.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
+
+.. |---| unicode:: U+2014 .. EM DASH
+ :trim:
diff --git a/doc/rados/operations/add-or-rm-osds.rst b/doc/rados/operations/add-or-rm-osds.rst
new file mode 100644
index 000000000..1a6621148
--- /dev/null
+++ b/doc/rados/operations/add-or-rm-osds.rst
@@ -0,0 +1,419 @@
+======================
+ Adding/Removing OSDs
+======================
+
+When a cluster is up and running, it is possible to add or remove OSDs.
+
+Adding OSDs
+===========
+
+OSDs can be added to a cluster in order to expand the cluster's capacity and
+resilience. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on one
+storage drive within a host machine. But if your host machine has multiple
+storage drives, you may map one ``ceph-osd`` daemon for each drive on the
+machine.
+
+It's a good idea to check the capacity of your cluster so that you know when it
+approaches its capacity limits. If your cluster has reached its ``near full``
+ratio, then you should add OSDs to expand your cluster's capacity.
+
+.. warning:: Do not add an OSD after your cluster has reached its ``full
+ ratio``. OSD failures that occur after the cluster reaches its ``near full
+ ratio`` might cause the cluster to exceed its ``full ratio``.
+
+
+Deploying your Hardware
+-----------------------
+
+If you are also adding a new host when adding a new OSD, see `Hardware
+Recommendations`_ for details on minimum recommendations for OSD hardware. To
+add an OSD host to your cluster, begin by making sure that an appropriate
+version of Linux has been installed on the host machine and that all initial
+preparations for your storage drives have been carried out. For details, see
+`Filesystem Recommendations`_.
+
+Next, add your OSD host to a rack in your cluster, connect the host to the
+network, and ensure that the host has network connectivity. For details, see
+`Network Configuration Reference`_.
+
+
+.. _Hardware Recommendations: ../../../start/hardware-recommendations
+.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations
+.. _Network Configuration Reference: ../../configuration/network-config-ref
+
+Installing the Required Software
+--------------------------------
+
+If your cluster has been manually deployed, you will need to install Ceph
+software packages manually. For details, see `Installing Ceph (Manual)`_.
+Configure SSH for the appropriate user to have both passwordless authentication
+and root permissions.
+
+.. _Installing Ceph (Manual): ../../../install
+
+
+Adding an OSD (Manual)
+----------------------
+
+The following procedure sets up a ``ceph-osd`` daemon, configures this OSD to
+use one drive, and configures the cluster to distribute data to the OSD. If
+your host machine has multiple drives, you may add an OSD for each drive on the
+host by repeating this procedure.
+
+As the following procedure will demonstrate, adding an OSD involves creating a
+metadata directory for it, configuring a data storage drive, adding the OSD to
+the cluster, and then adding it to the CRUSH map.
+
+When you add the OSD to the CRUSH map, you will need to consider the weight you
+assign to the new OSD. Since storage drive capacities increase over time, newer
+OSD hosts are likely to have larger hard drives than the older hosts in the
+cluster have and therefore might have greater weight as well.
+
+.. tip:: Ceph works best with uniform hardware across pools. It is possible to
+ add drives of dissimilar size and then adjust their weights accordingly.
+ However, for best performance, consider a CRUSH hierarchy that has drives of
+ the same type and size. It is better to add larger drives uniformly to
+ existing hosts. This can be done incrementally, replacing smaller drives
+ each time the new drives are added.
+
+#. Create the new OSD by running a command of the following form. If you opt
+ not to specify a UUID in this command, the UUID will be set automatically
+ when the OSD starts up. The OSD number, which is needed for subsequent
+ steps, is found in the command's output:
+
+ .. prompt:: bash $
+
+ ceph osd create [{uuid} [{id}]]
+
+ If the optional parameter {id} is specified it will be used as the OSD ID.
+ However, if the ID number is already in use, the command will fail.
+
+ .. warning:: Explicitly specifying the ``{id}`` parameter is not
+ recommended. IDs are allocated as an array, and any skipping of entries
+ consumes extra memory. This memory consumption can become significant if
+ there are large gaps or if clusters are large. By leaving the ``{id}``
+ parameter unspecified, we ensure that Ceph uses the smallest ID number
+ available and that these problems are avoided.
+
+#. Create the default directory for your new OSD by running commands of the
+ following form:
+
+ .. prompt:: bash $
+
+ ssh {new-osd-host}
+ sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
+
+#. If the OSD will be created on a drive other than the OS drive, prepare it
+ for use with Ceph. Run commands of the following form:
+
+ .. prompt:: bash $
+
+ ssh {new-osd-host}
+ sudo mkfs -t {fstype} /dev/{drive}
+ sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number}
+
+#. Initialize the OSD data directory by running commands of the following form:
+
+ .. prompt:: bash $
+
+ ssh {new-osd-host}
+ ceph-osd -i {osd-num} --mkfs --mkkey
+
+ Make sure that the directory is empty before running ``ceph-osd``.
+
+#. Register the OSD authentication key by running a command of the following
+ form:
+
+ .. prompt:: bash $
+
+ ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring
+
+ This presentation of the command has ``ceph-{osd-num}`` in the listed path
+ because many clusters have the name ``ceph``. However, if your cluster name
+ is not ``ceph``, then the string ``ceph`` in ``ceph-{osd-num}`` needs to be
+ replaced with your cluster name. For example, if your cluster name is
+ ``cluster1``, then the path in the command should be
+ ``/var/lib/ceph/osd/cluster1-{osd-num}/keyring``.
+
+#. Add the OSD to the CRUSH map by running the following command. This allows
+ the OSD to begin receiving data. The ``ceph osd crush add`` command can add
+ OSDs to the CRUSH hierarchy wherever you want. If you specify one or more
+ buckets, the command places the OSD in the most specific of those buckets,
+ and it moves that bucket underneath any other buckets that you have
+ specified. **Important:** If you specify only the root bucket, the command
+ will attach the OSD directly to the root, but CRUSH rules expect OSDs to be
+ inside of hosts. If the OSDs are not inside hosts, the OSDS will likely not
+ receive any data.
+
+ .. prompt:: bash $
+
+ ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...]
+
+ Note that there is another way to add a new OSD to the CRUSH map: decompile
+ the CRUSH map, add the OSD to the device list, add the host as a bucket (if
+ it is not already in the CRUSH map), add the device as an item in the host,
+ assign the device a weight, recompile the CRUSH map, and set the CRUSH map.
+ For details, see `Add/Move an OSD`_. This is rarely necessary with recent
+ releases (this sentence was written the month that Reef was released).
+
+
+.. _rados-replacing-an-osd:
+
+Replacing an OSD
+----------------
+
+.. note:: If the procedure in this section does not work for you, try the
+ instructions in the ``cephadm`` documentation:
+ :ref:`cephadm-replacing-an-osd`.
+
+Sometimes OSDs need to be replaced: for example, when a disk fails, or when an
+administrator wants to reprovision OSDs with a new back end (perhaps when
+switching from Filestore to BlueStore). Replacing an OSD differs from `Removing
+the OSD`_ in that the replaced OSD's ID and CRUSH map entry must be kept intact
+after the OSD is destroyed for replacement.
+
+
+#. Make sure that it is safe to destroy the OSD:
+
+ .. prompt:: bash $
+
+ while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done
+
+#. Destroy the OSD:
+
+ .. prompt:: bash $
+
+ ceph osd destroy {id} --yes-i-really-mean-it
+
+#. *Optional*: If the disk that you plan to use is not a new disk and has been
+ used before for other purposes, zap the disk:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm zap /dev/sdX
+
+#. Prepare the disk for replacement by using the ID of the OSD that was
+ destroyed in previous steps:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm prepare --osd-id {id} --data /dev/sdX
+
+#. Finally, activate the OSD:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm activate {id} {fsid}
+
+Alternatively, instead of carrying out the final two steps (preparing the disk
+and activating the OSD), you can re-create the OSD by running a single command
+of the following form:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm create --osd-id {id} --data /dev/sdX
+
+Starting the OSD
+----------------
+
+After an OSD is added to Ceph, the OSD is in the cluster. However, until it is
+started, the OSD is considered ``down`` and ``in``. The OSD is not running and
+will be unable to receive data. To start an OSD, either run ``service ceph``
+from your admin host or run a command of the following form to start the OSD
+from its host machine:
+
+ .. prompt:: bash $
+
+ sudo systemctl start ceph-osd@{osd-num}
+
+After the OSD is started, it is considered ``up`` and ``in``.
+
+Observing the Data Migration
+----------------------------
+
+After the new OSD has been added to the CRUSH map, Ceph begins rebalancing the
+cluster by migrating placement groups (PGs) to the new OSD. To observe this
+process by using the `ceph`_ tool, run the following command:
+
+ .. prompt:: bash $
+
+ ceph -w
+
+Or:
+
+ .. prompt:: bash $
+
+ watch ceph status
+
+The PG states will first change from ``active+clean`` to ``active, some
+degraded objects`` and then return to ``active+clean`` when migration
+completes. When you are finished observing, press Ctrl-C to exit.
+
+.. _Add/Move an OSD: ../crush-map#addosd
+.. _ceph: ../monitoring
+
+
+Removing OSDs (Manual)
+======================
+
+It is possible to remove an OSD manually while the cluster is running: you
+might want to do this in order to reduce the size of the cluster or when
+replacing hardware. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on
+one storage drive within a host machine. Alternatively, if your host machine
+has multiple storage drives, you might need to remove multiple ``ceph-osd``
+daemons: one daemon for each drive on the machine.
+
+.. warning:: Before you begin the process of removing an OSD, make sure that
+ your cluster is not near its ``full ratio``. Otherwise the act of removing
+ OSDs might cause the cluster to reach or exceed its ``full ratio``.
+
+
+Taking the OSD ``out`` of the Cluster
+-------------------------------------
+
+OSDs are typically ``up`` and ``in`` before they are removed from the cluster.
+Before the OSD can be removed from the cluster, the OSD must be taken ``out``
+of the cluster so that Ceph can begin rebalancing and copying its data to other
+OSDs. To take an OSD ``out`` of the cluster, run a command of the following
+form:
+
+ .. prompt:: bash $
+
+ ceph osd out {osd-num}
+
+
+Observing the Data Migration
+----------------------------
+
+After the OSD has been taken ``out`` of the cluster, Ceph begins rebalancing
+the cluster by migrating placement groups out of the OSD that was removed. To
+observe this process by using the `ceph`_ tool, run the following command:
+
+ .. prompt:: bash $
+
+ ceph -w
+
+The PG states will change from ``active+clean`` to ``active, some degraded
+objects`` and will then return to ``active+clean`` when migration completes.
+When you are finished observing, press Ctrl-C to exit.
+
+.. note:: Under certain conditions, the action of taking ``out`` an OSD
+ might lead CRUSH to encounter a corner case in which some PGs remain stuck
+ in the ``active+remapped`` state. This problem sometimes occurs in small
+ clusters with few hosts (for example, in a small testing cluster). To
+ address this problem, mark the OSD ``in`` by running a command of the
+ following form:
+
+ .. prompt:: bash $
+
+ ceph osd in {osd-num}
+
+ After the OSD has come back to its initial state, do not mark the OSD
+ ``out`` again. Instead, set the OSD's weight to ``0`` by running a command
+ of the following form:
+
+ .. prompt:: bash $
+
+ ceph osd crush reweight osd.{osd-num} 0
+
+ After the OSD has been reweighted, observe the data migration and confirm
+ that it has completed successfully. The difference between marking an OSD
+ ``out`` and reweighting the OSD to ``0`` has to do with the bucket that
+ contains the OSD. When an OSD is marked ``out``, the weight of the bucket is
+ not changed. But when an OSD is reweighted to ``0``, the weight of the
+ bucket is updated (namely, the weight of the OSD is subtracted from the
+ overall weight of the bucket). When operating small clusters, it can
+ sometimes be preferable to use the above reweight command.
+
+
+Stopping the OSD
+----------------
+
+After you take an OSD ``out`` of the cluster, the OSD might still be running.
+In such a case, the OSD is ``up`` and ``out``. Before it is removed from the
+cluster, the OSD must be stopped by running commands of the following form:
+
+ .. prompt:: bash $
+
+ ssh {osd-host}
+ sudo systemctl stop ceph-osd@{osd-num}
+
+After the OSD has been stopped, it is ``down``.
+
+
+Removing the OSD
+----------------
+
+The following procedure removes an OSD from the cluster map, removes the OSD's
+authentication key, removes the OSD from the OSD map, and removes the OSD from
+the ``ceph.conf`` file. If your host has multiple drives, it might be necessary
+to remove an OSD from each drive by repeating this procedure.
+
+#. Begin by having the cluster forget the OSD. This step removes the OSD from
+ the CRUSH map, removes the OSD's authentication key, and removes the OSD
+ from the OSD map. (The :ref:`purge subcommand <ceph-admin-osd>` was
+ introduced in Luminous. For older releases, see :ref:`the procedure linked
+ here <ceph_osd_purge_procedure_pre_luminous>`.):
+
+ .. prompt:: bash $
+
+ ceph osd purge {id} --yes-i-really-mean-it
+
+
+#. Navigate to the host where the master copy of the cluster's
+ ``ceph.conf`` file is kept:
+
+ .. prompt:: bash $
+
+ ssh {admin-host}
+ cd /etc/ceph
+ vim ceph.conf
+
+#. Remove the OSD entry from your ``ceph.conf`` file (if such an entry
+ exists)::
+
+ [osd.1]
+ host = {hostname}
+
+#. Copy the updated ``ceph.conf`` file from the location on the host where the
+ master copy of the cluster's ``ceph.conf`` is kept to the ``/etc/ceph``
+ directory of the other hosts in your cluster.
+
+.. _ceph_osd_purge_procedure_pre_luminous:
+
+If your Ceph cluster is older than Luminous, you will be unable to use the
+``ceph osd purge`` command. Instead, carry out the following procedure:
+
+#. Remove the OSD from the CRUSH map so that it no longer receives data (for
+ more details, see `Remove an OSD`_):
+
+ .. prompt:: bash $
+
+ ceph osd crush remove {name}
+
+ Instead of removing the OSD from the CRUSH map, you might opt for one of two
+ alternatives: (1) decompile the CRUSH map, remove the OSD from the device
+ list, and remove the device from the host bucket; (2) remove the host bucket
+ from the CRUSH map (provided that it is in the CRUSH map and that you intend
+ to remove the host), recompile the map, and set it:
+
+
+#. Remove the OSD authentication key:
+
+ .. prompt:: bash $
+
+ ceph auth del osd.{osd-num}
+
+#. Remove the OSD:
+
+ .. prompt:: bash $
+
+ ceph osd rm {osd-num}
+
+ For example:
+
+ .. prompt:: bash $
+
+ ceph osd rm 1
+
+.. _Remove an OSD: ../crush-map#removeosd
diff --git a/doc/rados/operations/balancer.rst b/doc/rados/operations/balancer.rst
new file mode 100644
index 000000000..aa4eab93c
--- /dev/null
+++ b/doc/rados/operations/balancer.rst
@@ -0,0 +1,221 @@
+.. _balancer:
+
+Balancer Module
+=======================
+
+The *balancer* can optimize the allocation of placement groups (PGs) across
+OSDs in order to achieve a balanced distribution. The balancer can operate
+either automatically or in a supervised fashion.
+
+
+Status
+------
+
+To check the current status of the balancer, run the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer status
+
+
+Automatic balancing
+-------------------
+
+When the balancer is in ``upmap`` mode, the automatic balancing feature is
+enabled by default. For more details, see :ref:`upmap`. To disable the
+balancer, run the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer off
+
+The balancer mode can be changed from ``upmap`` mode to ``crush-compat`` mode.
+``crush-compat`` mode is backward compatible with older clients. In
+``crush-compat`` mode, the balancer automatically makes small changes to the
+data distribution in order to ensure that OSDs are utilized equally.
+
+
+Throttling
+----------
+
+If the cluster is degraded (that is, if an OSD has failed and the system hasn't
+healed itself yet), then the balancer will not make any adjustments to the PG
+distribution.
+
+When the cluster is healthy, the balancer will incrementally move a small
+fraction of unbalanced PGs in order to improve distribution. This fraction
+will not exceed a certain threshold that defaults to 5%. To adjust this
+``target_max_misplaced_ratio`` threshold setting, run the following command:
+
+ .. prompt:: bash $
+
+ ceph config set mgr target_max_misplaced_ratio .07 # 7%
+
+The balancer sleeps between runs. To set the number of seconds for this
+interval of sleep, run the following command:
+
+ .. prompt:: bash $
+
+ ceph config set mgr mgr/balancer/sleep_interval 60
+
+To set the time of day (in HHMM format) at which automatic balancing begins,
+run the following command:
+
+ .. prompt:: bash $
+
+ ceph config set mgr mgr/balancer/begin_time 0000
+
+To set the time of day (in HHMM format) at which automatic balancing ends, run
+the following command:
+
+ .. prompt:: bash $
+
+ ceph config set mgr mgr/balancer/end_time 2359
+
+Automatic balancing can be restricted to certain days of the week. To restrict
+it to a specific day of the week or later (as with crontab, ``0`` is Sunday,
+``1`` is Monday, and so on), run the following command:
+
+ .. prompt:: bash $
+
+ ceph config set mgr mgr/balancer/begin_weekday 0
+
+To restrict automatic balancing to a specific day of the week or earlier
+(again, ``0`` is Sunday, ``1`` is Monday, and so on), run the following
+command:
+
+ .. prompt:: bash $
+
+ ceph config set mgr mgr/balancer/end_weekday 6
+
+Automatic balancing can be restricted to certain pools. By default, the value
+of this setting is an empty string, so that all pools are automatically
+balanced. To restrict automatic balancing to specific pools, retrieve their
+numeric pool IDs (by running the :command:`ceph osd pool ls detail` command),
+and then run the following command:
+
+ .. prompt:: bash $
+
+ ceph config set mgr mgr/balancer/pool_ids 1,2,3
+
+
+Modes
+-----
+
+There are two supported balancer modes:
+
+#. **crush-compat**. This mode uses the compat weight-set feature (introduced
+ in Luminous) to manage an alternative set of weights for devices in the
+ CRUSH hierarchy. When the balancer is operating in this mode, the normal
+ weights should remain set to the size of the device in order to reflect the
+ target amount of data intended to be stored on the device. The balancer will
+ then optimize the weight-set values, adjusting them up or down in small
+ increments, in order to achieve a distribution that matches the target
+ distribution as closely as possible. (Because PG placement is a pseudorandom
+ process, it is subject to a natural amount of variation; optimizing the
+ weights serves to counteract that natural variation.)
+
+ Note that this mode is *fully backward compatible* with older clients: when
+ an OSD Map and CRUSH map are shared with older clients, Ceph presents the
+ optimized weights as the "real" weights.
+
+ The primary limitation of this mode is that the balancer cannot handle
+ multiple CRUSH hierarchies with different placement rules if the subtrees of
+ the hierarchy share any OSDs. (Such sharing of OSDs is not typical and,
+ because of the difficulty of managing the space utilization on the shared
+ OSDs, is generally not recommended.)
+
+#. **upmap**. In Luminous and later releases, the OSDMap can store explicit
+ mappings for individual OSDs as exceptions to the normal CRUSH placement
+ calculation. These ``upmap`` entries provide fine-grained control over the
+ PG mapping. This balancer mode optimizes the placement of individual PGs in
+ order to achieve a balanced distribution. In most cases, the resulting
+ distribution is nearly perfect: that is, there is an equal number of PGs on
+ each OSD (±1 PG, since the total number might not divide evenly).
+
+ To use ``upmap``, all clients must be Luminous or newer.
+
+The default mode is ``upmap``. The mode can be changed to ``crush-compat`` by
+running the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer mode crush-compat
+
+Supervised optimization
+-----------------------
+
+Supervised use of the balancer can be understood in terms of three distinct
+phases:
+
+#. building a plan
+#. evaluating the quality of the data distribution, either for the current PG
+ distribution or for the PG distribution that would result after executing a
+ plan
+#. executing the plan
+
+To evaluate the current distribution, run the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer eval
+
+To evaluate the distribution for a single pool, run the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer eval <pool-name>
+
+To see the evaluation in greater detail, run the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer eval-verbose ...
+
+To instruct the balancer to generate a plan (using the currently configured
+mode), make up a name (any useful identifying string) for the plan, and run the
+following command:
+
+ .. prompt:: bash $
+
+ ceph balancer optimize <plan-name>
+
+To see the contents of a plan, run the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer show <plan-name>
+
+To display all plans, run the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer ls
+
+To discard an old plan, run the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer rm <plan-name>
+
+To see currently recorded plans, examine the output of the following status
+command:
+
+ .. prompt:: bash $
+
+ ceph balancer status
+
+To evaluate the distribution that would result from executing a specific plan,
+run the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer eval <plan-name>
+
+If a plan is expected to improve the distribution (that is, the plan's score is
+lower than the current cluster state's score), you can execute that plan by
+running the following command:
+
+ .. prompt:: bash $
+
+ ceph balancer execute <plan-name>
diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst
new file mode 100644
index 000000000..d24782c46
--- /dev/null
+++ b/doc/rados/operations/bluestore-migration.rst
@@ -0,0 +1,357 @@
+.. _rados_operations_bluestore_migration:
+
+=====================
+ BlueStore Migration
+=====================
+.. warning:: Filestore has been deprecated in the Reef release and is no longer supported.
+ Please migrate to BlueStore.
+
+Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph
+cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs.
+Because BlueStore is superior to Filestore in performance and robustness, and
+because Filestore is not supported by Ceph releases beginning with Reef, users
+deploying Filestore OSDs should transition to BlueStore. There are several
+strategies for making the transition to BlueStore.
+
+BlueStore is so different from Filestore that an individual OSD cannot be
+converted in place. Instead, the conversion process must use either (1) the
+cluster's normal replication and healing support, or (2) tools and strategies
+that copy OSD content from an old (Filestore) device to a new (BlueStore) one.
+
+Deploying new OSDs with BlueStore
+=================================
+
+Use BlueStore when deploying new OSDs (for example, when the cluster is
+expanded). Because this is the default behavior, no specific change is
+needed.
+
+Similarly, use BlueStore for any OSDs that have been reprovisioned after
+a failed drive was replaced.
+
+Converting existing OSDs
+========================
+
+"Mark-``out``" replacement
+--------------------------
+
+The simplest approach is to verify that the cluster is healthy and
+then follow these steps for each Filestore OSD in succession: mark the OSD
+``out``, wait for the data to replicate across the cluster, reprovision the OSD,
+mark the OSD back ``in``, and wait for recovery to complete before proceeding
+to the next OSD. This approach is easy to automate, but it entails unnecessary
+data migration that carries costs in time and SSD wear.
+
+#. Identify a Filestore OSD to replace::
+
+ ID=<osd-id-number>
+ DEVICE=<disk-device>
+
+ #. Determine whether a given OSD is Filestore or BlueStore:
+
+ .. prompt:: bash $
+
+ ceph osd metadata $ID | grep osd_objectstore
+
+ #. Get a current count of Filestore and BlueStore OSDs:
+
+ .. prompt:: bash $
+
+ ceph osd count-metadata osd_objectstore
+
+#. Mark a Filestore OSD ``out``:
+
+ .. prompt:: bash $
+
+ ceph osd out $ID
+
+#. Wait for the data to migrate off this OSD:
+
+ .. prompt:: bash $
+
+ while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
+
+#. Stop the OSD:
+
+ .. prompt:: bash $
+
+ systemctl kill ceph-osd@$ID
+
+ .. _osd_id_retrieval:
+
+#. Note which device the OSD is using:
+
+ .. prompt:: bash $
+
+ mount | grep /var/lib/ceph/osd/ceph-$ID
+
+#. Unmount the OSD:
+
+ .. prompt:: bash $
+
+ umount /var/lib/ceph/osd/ceph-$ID
+
+#. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy
+ the contents of the device; you must be certain that the data on the device is
+ not needed (in other words, that the cluster is healthy) before proceeding:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm zap $DEVICE
+
+#. Tell the cluster that the OSD has been destroyed (and that a new OSD can be
+ reprovisioned with the same OSD ID):
+
+ .. prompt:: bash $
+
+ ceph osd destroy $ID --yes-i-really-mean-it
+
+#. Provision a BlueStore OSD in place by using the same OSD ID. This requires
+ you to identify which device to wipe, and to make certain that you target
+ the correct and intended device, using the information that was retrieved in
+ the :ref:`"Note which device the OSD is using" <osd_id_retrieval>` step. BE
+ CAREFUL! Note that you may need to modify these commands when dealing with
+ hybrid OSDs:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
+
+#. Repeat.
+
+You may opt to (1) have the balancing of the replacement BlueStore OSD take
+place concurrently with the draining of the next Filestore OSD, or instead
+(2) follow the same procedure for multiple OSDs in parallel. In either case,
+however, you must ensure that the cluster is fully clean (in other words, that
+all data has all replicas) before destroying any OSDs. If you opt to reprovision
+multiple OSDs in parallel, be **very** careful to destroy OSDs only within a
+single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to
+satisfy this requirement will reduce the redundancy and availability of your
+data and increase the risk of data loss (or even guarantee data loss).
+
+Advantages:
+
+* Simple.
+* Can be done on a device-by-device basis.
+* No spare devices or hosts are required.
+
+Disadvantages:
+
+* Data is copied over the network twice: once to another OSD in the cluster (to
+ maintain the specified number of replicas), and again back to the
+ reprovisioned BlueStore OSD.
+
+"Whole host" replacement
+------------------------
+
+If you have a spare host in the cluster, or sufficient free space to evacuate
+an entire host for use as a spare, then the conversion can be done on a
+host-by-host basis so that each stored copy of the data is migrated only once.
+
+To use this approach, you need an empty host that has no OSDs provisioned.
+There are two ways to do this: either by using a new, empty host that is not
+yet part of the cluster, or by offloading data from an existing host that is
+already part of the cluster.
+
+Using a new, empty host
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Ideally the host will have roughly the same capacity as each of the other hosts
+you will be converting. Add the host to the CRUSH hierarchy, but do not attach
+it to the root:
+
+
+.. prompt:: bash $
+
+ NEWHOST=<empty-host-name>
+ ceph osd crush add-bucket $NEWHOST host
+
+Make sure that Ceph packages are installed on the new host.
+
+Using an existing host
+^^^^^^^^^^^^^^^^^^^^^^
+
+If you would like to use an existing host that is already part of the cluster,
+and if there is sufficient free space on that host so that all of its data can
+be migrated off to other cluster hosts, you can do the following (instead of
+using a new, empty host):
+
+.. prompt:: bash $
+
+ OLDHOST=<existing-cluster-host-to-offload>
+ ceph osd crush unlink $OLDHOST default
+
+where "default" is the immediate ancestor in the CRUSH map. (For
+smaller clusters with unmodified configurations this will normally
+be "default", but it might instead be a rack name.) You should now
+see the host at the top of the OSD tree output with no parent:
+
+.. prompt:: bash $
+
+ bin/ceph osd tree
+
+::
+
+ ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
+ -5 0 host oldhost
+ 10 ssd 1.00000 osd.10 up 1.00000 1.00000
+ 11 ssd 1.00000 osd.11 up 1.00000 1.00000
+ 12 ssd 1.00000 osd.12 up 1.00000 1.00000
+ -1 3.00000 root default
+ -2 3.00000 host foo
+ 0 ssd 1.00000 osd.0 up 1.00000 1.00000
+ 1 ssd 1.00000 osd.1 up 1.00000 1.00000
+ 2 ssd 1.00000 osd.2 up 1.00000 1.00000
+ ...
+
+If everything looks good, jump directly to the :ref:`"Wait for the data
+migration to complete" <bluestore_data_migration_step>` step below and proceed
+from there to clean up the old OSDs.
+
+Migration process
+^^^^^^^^^^^^^^^^^
+
+If you're using a new host, start at :ref:`the first step
+<bluestore_migration_process_first_step>`. If you're using an existing host,
+jump to :ref:`this step <bluestore_data_migration_step>`.
+
+.. _bluestore_migration_process_first_step:
+
+#. Provision new BlueStore OSDs for all devices:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm create --bluestore --data /dev/$DEVICE
+
+#. Verify that the new OSDs have joined the cluster:
+
+ .. prompt:: bash $
+
+ ceph osd tree
+
+ You should see the new host ``$NEWHOST`` with all of the OSDs beneath
+ it, but the host should *not* be nested beneath any other node in the
+ hierarchy (like ``root default``). For example, if ``newhost`` is
+ the empty host, you might see something like::
+
+ $ bin/ceph osd tree
+ ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
+ -5 0 host newhost
+ 10 ssd 1.00000 osd.10 up 1.00000 1.00000
+ 11 ssd 1.00000 osd.11 up 1.00000 1.00000
+ 12 ssd 1.00000 osd.12 up 1.00000 1.00000
+ -1 3.00000 root default
+ -2 3.00000 host oldhost1
+ 0 ssd 1.00000 osd.0 up 1.00000 1.00000
+ 1 ssd 1.00000 osd.1 up 1.00000 1.00000
+ 2 ssd 1.00000 osd.2 up 1.00000 1.00000
+ ...
+
+#. Identify the first target host to convert :
+
+ .. prompt:: bash $
+
+ OLDHOST=<existing-cluster-host-to-convert>
+
+#. Swap the new host into the old host's position in the cluster:
+
+ .. prompt:: bash $
+
+ ceph osd crush swap-bucket $NEWHOST $OLDHOST
+
+ At this point all data on ``$OLDHOST`` will begin migrating to the OSDs on
+ ``$NEWHOST``. If there is a difference between the total capacity of the
+ old hosts and the total capacity of the new hosts, you may also see some
+ data migrate to or from other nodes in the cluster. Provided that the hosts
+ are similarly sized, however, this will be a relatively small amount of
+ data.
+
+ .. _bluestore_data_migration_step:
+
+#. Wait for the data migration to complete:
+
+ .. prompt:: bash $
+
+ while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
+
+#. Stop all old OSDs on the now-empty ``$OLDHOST``:
+
+ .. prompt:: bash $
+
+ ssh $OLDHOST
+ systemctl kill ceph-osd.target
+ umount /var/lib/ceph/osd/ceph-*
+
+#. Destroy and purge the old OSDs:
+
+ .. prompt:: bash $
+
+ for osd in `ceph osd ls-tree $OLDHOST`; do
+ ceph osd purge $osd --yes-i-really-mean-it
+ done
+
+#. Wipe the old OSDs. This requires you to identify which devices are to be
+ wiped manually. BE CAREFUL! For each device:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm zap $DEVICE
+
+#. Use the now-empty host as the new host, and repeat:
+
+ .. prompt:: bash $
+
+ NEWHOST=$OLDHOST
+
+Advantages:
+
+* Data is copied over the network only once.
+* An entire host's OSDs are converted at once.
+* Can be parallelized, to make possible the conversion of multiple hosts at the same time.
+* No host involved in this process needs to have a spare device.
+
+Disadvantages:
+
+* A spare host is required.
+* An entire host's worth of OSDs will be migrating data at a time. This
+ is likely to impact overall cluster performance.
+* All migrated data still makes one full hop over the network.
+
+Per-OSD device copy
+-------------------
+A single logical OSD can be converted by using the ``copy`` function
+included in ``ceph-objectstore-tool``. This requires that the host have one or more free
+devices to provision a new, empty BlueStore OSD. For
+example, if each host in your cluster has twelve OSDs, then you need a
+thirteenth unused OSD so that each OSD can be converted before the
+previous OSD is reclaimed to convert the next OSD.
+
+Caveats:
+
+* This approach requires that we prepare an empty BlueStore OSD but that we do not allocate
+ a new OSD ID to it. The ``ceph-volume`` tool does not support such an operation. **IMPORTANT:**
+ because the setup of *dmcrypt* is closely tied to the identity of the OSD, this approach does not
+ work with encrypted OSDs.
+
+* The device must be manually partitioned.
+
+* An unsupported user-contributed script that demonstrates this process may be found here:
+ https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash
+
+Advantages:
+
+* Provided that the 'noout' or the 'norecover'/'norebalance' flags are set on the OSD or the
+ cluster while the conversion process is underway, little or no data migrates over the
+ network during the conversion.
+
+Disadvantages:
+
+* Tooling is not fully implemented, supported, or documented.
+
+* Each host must have an appropriate spare or empty device for staging.
+
+* The OSD is offline during the conversion, which means new writes to PGs
+ with the OSD in their acting set may not be ideally redundant until the
+ subject OSD comes up and recovers. This increases the risk of data
+ loss due to an overlapping failure. However, if another OSD fails before
+ conversion and startup have completed, the original Filestore OSD can be
+ started to provide access to its original data.
diff --git a/doc/rados/operations/cache-tiering.rst b/doc/rados/operations/cache-tiering.rst
new file mode 100644
index 000000000..127b0141f
--- /dev/null
+++ b/doc/rados/operations/cache-tiering.rst
@@ -0,0 +1,557 @@
+===============
+ Cache Tiering
+===============
+
+.. warning:: Cache tiering has been deprecated in the Reef release as it
+ has lacked a maintainer for a very long time. This does not mean
+ it will be certainly removed, but we may choose to remove it
+ without much further notice.
+
+A cache tier provides Ceph Clients with better I/O performance for a subset of
+the data stored in a backing storage tier. Cache tiering involves creating a
+pool of relatively fast/expensive storage devices (e.g., solid state drives)
+configured to act as a cache tier, and a backing pool of either erasure-coded
+or relatively slower/cheaper devices configured to act as an economical storage
+tier. The Ceph objecter handles where to place the objects and the tiering
+agent determines when to flush objects from the cache to the backing storage
+tier. So the cache tier and the backing storage tier are completely transparent
+to Ceph clients.
+
+
+.. ditaa::
+ +-------------+
+ | Ceph Client |
+ +------+------+
+ ^
+ Tiering is |
+ Transparent | Faster I/O
+ to Ceph | +---------------+
+ Client Ops | | |
+ | +----->+ Cache Tier |
+ | | | |
+ | | +-----+---+-----+
+ | | | ^
+ v v | | Active Data in Cache Tier
+ +------+----+--+ | |
+ | Objecter | | |
+ +-----------+--+ | |
+ ^ | | Inactive Data in Storage Tier
+ | v |
+ | +-----+---+-----+
+ | | |
+ +----->| Storage Tier |
+ | |
+ +---------------+
+ Slower I/O
+
+
+The cache tiering agent handles the migration of data between the cache tier
+and the backing storage tier automatically. However, admins have the ability to
+configure how this migration takes place by setting the ``cache-mode``. There are
+two main scenarios:
+
+- **writeback** mode: If the base tier and the cache tier are configured in
+ ``writeback`` mode, Ceph clients receive an ACK from the base tier every time
+ they write data to it. Then the cache tiering agent determines whether
+ ``osd_tier_default_cache_min_write_recency_for_promote`` has been set. If it
+ has been set and the data has been written more than a specified number of
+ times per interval, the data is promoted to the cache tier.
+
+ When Ceph clients need access to data stored in the base tier, the cache
+ tiering agent reads the data from the base tier and returns it to the client.
+ While data is being read from the base tier, the cache tiering agent consults
+ the value of ``osd_tier_default_cache_min_read_recency_for_promote`` and
+ decides whether to promote that data from the base tier to the cache tier.
+ When data has been promoted from the base tier to the cache tier, the Ceph
+ client is able to perform I/O operations on it using the cache tier. This is
+ well-suited for mutable data (for example, photo/video editing, transactional
+ data).
+
+- **readproxy** mode: This mode will use any objects that already
+ exist in the cache tier, but if an object is not present in the
+ cache the request will be proxied to the base tier. This is useful
+ for transitioning from ``writeback`` mode to a disabled cache as it
+ allows the workload to function properly while the cache is drained,
+ without adding any new objects to the cache.
+
+Other cache modes are:
+
+- **readonly** promotes objects to the cache on read operations only; write
+ operations are forwarded to the base tier. This mode is intended for
+ read-only workloads that do not require consistency to be enforced by the
+ storage system. (**Warning**: when objects are updated in the base tier,
+ Ceph makes **no** attempt to sync these updates to the corresponding objects
+ in the cache. Since this mode is considered experimental, a
+ ``--yes-i-really-mean-it`` option must be passed in order to enable it.)
+
+- **none** is used to completely disable caching.
+
+
+A word of caution
+=================
+
+Cache tiering will *degrade* performance for most workloads. Users should use
+extreme caution before using this feature.
+
+* *Workload dependent*: Whether a cache will improve performance is
+ highly dependent on the workload. Because there is a cost
+ associated with moving objects into or out of the cache, it can only
+ be effective when there is a *large skew* in the access pattern in
+ the data set, such that most of the requests touch a small number of
+ objects. The cache pool should be large enough to capture the
+ working set for your workload to avoid thrashing.
+
+* *Difficult to benchmark*: Most benchmarks that users run to measure
+ performance will show terrible performance with cache tiering, in
+ part because very few of them skew requests toward a small set of
+ objects, it can take a long time for the cache to "warm up," and
+ because the warm-up cost can be high.
+
+* *Usually slower*: For workloads that are not cache tiering-friendly,
+ performance is often slower than a normal RADOS pool without cache
+ tiering enabled.
+
+* *librados object enumeration*: The librados-level object enumeration
+ API is not meant to be coherent in the presence of the case. If
+ your application is using librados directly and relies on object
+ enumeration, cache tiering will probably not work as expected.
+ (This is not a problem for RGW, RBD, or CephFS.)
+
+* *Complexity*: Enabling cache tiering means that a lot of additional
+ machinery and complexity within the RADOS cluster is being used.
+ This increases the probability that you will encounter a bug in the system
+ that other users have not yet encountered and will put your deployment at a
+ higher level of risk.
+
+Known Good Workloads
+--------------------
+
+* *RGW time-skewed*: If the RGW workload is such that almost all read
+ operations are directed at recently written objects, a simple cache
+ tiering configuration that destages recently written objects from
+ the cache to the base tier after a configurable period can work
+ well.
+
+Known Bad Workloads
+-------------------
+
+The following configurations are *known to work poorly* with cache
+tiering.
+
+* *RBD with replicated cache and erasure-coded base*: This is a common
+ request, but usually does not perform well. Even reasonably skewed
+ workloads still send some small writes to cold objects, and because
+ small writes are not yet supported by the erasure-coded pool, entire
+ (usually 4 MB) objects must be migrated into the cache in order to
+ satisfy a small (often 4 KB) write. Only a handful of users have
+ successfully deployed this configuration, and it only works for them
+ because their data is extremely cold (backups) and they are not in
+ any way sensitive to performance.
+
+* *RBD with replicated cache and base*: RBD with a replicated base
+ tier does better than when the base is erasure coded, but it is
+ still highly dependent on the amount of skew in the workload, and
+ very difficult to validate. The user will need to have a good
+ understanding of their workload and will need to tune the cache
+ tiering parameters carefully.
+
+
+Setting Up Pools
+================
+
+To set up cache tiering, you must have two pools. One will act as the
+backing storage and the other will act as the cache.
+
+
+Setting Up a Backing Storage Pool
+---------------------------------
+
+Setting up a backing storage pool typically involves one of two scenarios:
+
+- **Standard Storage**: In this scenario, the pool stores multiple copies
+ of an object in the Ceph Storage Cluster.
+
+- **Erasure Coding:** In this scenario, the pool uses erasure coding to
+ store data much more efficiently with a small performance tradeoff.
+
+In the standard storage scenario, you can setup a CRUSH rule to establish
+the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD
+Daemons perform optimally when all storage drives in the rule are of the
+same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_
+for details on creating a rule. Once you have created a rule, create
+a backing storage pool.
+
+In the erasure coding scenario, the pool creation arguments will generate the
+appropriate rule automatically. See `Create a Pool`_ for details.
+
+In subsequent examples, we will refer to the backing storage pool
+as ``cold-storage``.
+
+
+Setting Up a Cache Pool
+-----------------------
+
+Setting up a cache pool follows the same procedure as the standard storage
+scenario, but with this difference: the drives for the cache tier are typically
+high performance drives that reside in their own servers and have their own
+CRUSH rule. When setting up such a rule, it should take account of the hosts
+that have the high performance drives while omitting the hosts that don't. See
+:ref:`CRUSH Device Class<crush-map-device-class>` for details.
+
+
+In subsequent examples, we will refer to the cache pool as ``hot-storage`` and
+the backing pool as ``cold-storage``.
+
+For cache tier configuration and default values, see
+`Pools - Set Pool Values`_.
+
+
+Creating a Cache Tier
+=====================
+
+Setting up a cache tier involves associating a backing storage pool with
+a cache pool:
+
+.. prompt:: bash $
+
+ ceph osd tier add {storagepool} {cachepool}
+
+For example:
+
+.. prompt:: bash $
+
+ ceph osd tier add cold-storage hot-storage
+
+To set the cache mode, execute the following:
+
+.. prompt:: bash $
+
+ ceph osd tier cache-mode {cachepool} {cache-mode}
+
+For example:
+
+.. prompt:: bash $
+
+ ceph osd tier cache-mode hot-storage writeback
+
+The cache tiers overlay the backing storage tier, so they require one
+additional step: you must direct all client traffic from the storage pool to
+the cache pool. To direct client traffic directly to the cache pool, execute
+the following:
+
+.. prompt:: bash $
+
+ ceph osd tier set-overlay {storagepool} {cachepool}
+
+For example:
+
+.. prompt:: bash $
+
+ ceph osd tier set-overlay cold-storage hot-storage
+
+
+Configuring a Cache Tier
+========================
+
+Cache tiers have several configuration options. You may set
+cache tier configuration options with the following usage:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} {key} {value}
+
+See `Pools - Set Pool Values`_ for details.
+
+
+Target Size and Type
+--------------------
+
+Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} hit_set_type bloom
+
+For example:
+
+.. prompt:: bash $
+
+ ceph osd pool set hot-storage hit_set_type bloom
+
+The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to
+store, and how much time each HitSet should cover:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} hit_set_count 12
+ ceph osd pool set {cachepool} hit_set_period 14400
+ ceph osd pool set {cachepool} target_max_bytes 1000000000000
+
+.. note:: A larger ``hit_set_count`` results in more RAM consumed by
+ the ``ceph-osd`` process.
+
+Binning accesses over time allows Ceph to determine whether a Ceph client
+accessed an object at least once, or more than once over a time period
+("age" vs "temperature").
+
+The ``min_read_recency_for_promote`` defines how many HitSets to check for the
+existence of an object when handling a read operation. The checking result is
+used to decide whether to promote the object asynchronously. Its value should be
+between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
+If it's set to 1, the current HitSet is checked. And if this object is in the
+current HitSet, it's promoted. Otherwise not. For the other values, the exact
+number of archive HitSets are checked. The object is promoted if the object is
+found in any of the most recent ``min_read_recency_for_promote`` HitSets.
+
+A similar parameter can be set for the write operation, which is
+``min_write_recency_for_promote``:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} min_read_recency_for_promote 2
+ ceph osd pool set {cachepool} min_write_recency_for_promote 2
+
+.. note:: The longer the period and the higher the
+ ``min_read_recency_for_promote`` and
+ ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd``
+ daemon consumes. In particular, when the agent is active to flush
+ or evict cache objects, all ``hit_set_count`` HitSets are loaded
+ into RAM.
+
+
+Cache Sizing
+------------
+
+The cache tiering agent performs two main functions:
+
+- **Flushing:** The agent identifies modified (or dirty) objects and forwards
+ them to the storage pool for long-term storage.
+
+- **Evicting:** The agent identifies objects that haven't been modified
+ (or clean) and evicts the least recently used among them from the cache.
+
+
+Absolute Sizing
+~~~~~~~~~~~~~~~
+
+The cache tiering agent can flush or evict objects based upon the total number
+of bytes or the total number of objects. To specify a maximum number of bytes,
+execute the following:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} target_max_bytes {#bytes}
+
+For example, to flush or evict at 1 TB, execute the following:
+
+.. prompt:: bash $
+
+ ceph osd pool set hot-storage target_max_bytes 1099511627776
+
+To specify the maximum number of objects, execute the following:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} target_max_objects {#objects}
+
+For example, to flush or evict at 1M objects, execute the following:
+
+.. prompt:: bash $
+
+ ceph osd pool set hot-storage target_max_objects 1000000
+
+.. note:: Ceph is not able to determine the size of a cache pool automatically, so
+ the configuration on the absolute size is required here, otherwise the
+ flush/evict will not work. If you specify both limits, the cache tiering
+ agent will begin flushing or evicting when either threshold is triggered.
+
+.. note:: All client requests will be blocked only when ``target_max_bytes`` or
+ ``target_max_objects`` reached
+
+Relative Sizing
+~~~~~~~~~~~~~~~
+
+The cache tiering agent can flush or evict objects relative to the size of the
+cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in
+`Absolute sizing`_). When the cache pool consists of a certain percentage of
+modified (or dirty) objects, the cache tiering agent will flush them to the
+storage pool. To set the ``cache_target_dirty_ratio``, execute the following:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0}
+
+For example, setting the value to ``0.4`` will begin flushing modified
+(dirty) objects when they reach 40% of the cache pool's capacity:
+
+.. prompt:: bash $
+
+ ceph osd pool set hot-storage cache_target_dirty_ratio 0.4
+
+When the dirty objects reaches a certain percentage of its capacity, flush dirty
+objects with a higher speed. To set the ``cache_target_dirty_high_ratio``:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}
+
+For example, setting the value to ``0.6`` will begin aggressively flush dirty
+objects when they reach 60% of the cache pool's capacity. obviously, we'd
+better set the value between dirty_ratio and full_ratio:
+
+.. prompt:: bash $
+
+ ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6
+
+When the cache pool reaches a certain percentage of its capacity, the cache
+tiering agent will evict objects to maintain free capacity. To set the
+``cache_target_full_ratio``, execute the following:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0}
+
+For example, setting the value to ``0.8`` will begin flushing unmodified
+(clean) objects when they reach 80% of the cache pool's capacity:
+
+.. prompt:: bash $
+
+ ceph osd pool set hot-storage cache_target_full_ratio 0.8
+
+
+Cache Age
+---------
+
+You can specify the minimum age of an object before the cache tiering agent
+flushes a recently modified (or dirty) object to the backing storage pool:
+
+.. prompt:: bash $
+
+ ceph osd pool set {cachepool} cache_min_flush_age {#seconds}
+
+For example, to flush modified (or dirty) objects after 10 minutes, execute the
+following:
+
+.. prompt:: bash $
+
+ ceph osd pool set hot-storage cache_min_flush_age 600
+
+You can specify the minimum age of an object before it will be evicted from the
+cache tier:
+
+.. prompt:: bash $
+
+ ceph osd pool {cache-tier} cache_min_evict_age {#seconds}
+
+For example, to evict objects after 30 minutes, execute the following:
+
+.. prompt:: bash $
+
+ ceph osd pool set hot-storage cache_min_evict_age 1800
+
+
+Removing a Cache Tier
+=====================
+
+Removing a cache tier differs depending on whether it is a writeback
+cache or a read-only cache.
+
+
+Removing a Read-Only Cache
+--------------------------
+
+Since a read-only cache does not have modified data, you can disable
+and remove it without losing any recent changes to objects in the cache.
+
+#. Change the cache-mode to ``none`` to disable it.:
+
+ .. prompt:: bash
+
+ ceph osd tier cache-mode {cachepool} none
+
+ For example:
+
+ .. prompt:: bash $
+
+ ceph osd tier cache-mode hot-storage none
+
+#. Remove the cache pool from the backing pool.:
+
+ .. prompt:: bash $
+
+ ceph osd tier remove {storagepool} {cachepool}
+
+ For example:
+
+ .. prompt:: bash $
+
+ ceph osd tier remove cold-storage hot-storage
+
+
+Removing a Writeback Cache
+--------------------------
+
+Since a writeback cache may have modified data, you must take steps to ensure
+that you do not lose any recent changes to objects in the cache before you
+disable and remove it.
+
+
+#. Change the cache mode to ``proxy`` so that new and modified objects will
+ flush to the backing storage pool.:
+
+ .. prompt:: bash $
+
+ ceph osd tier cache-mode {cachepool} proxy
+
+ For example:
+
+ .. prompt:: bash $
+
+ ceph osd tier cache-mode hot-storage proxy
+
+
+#. Ensure that the cache pool has been flushed. This may take a few minutes:
+
+ .. prompt:: bash $
+
+ rados -p {cachepool} ls
+
+ If the cache pool still has objects, you can flush them manually.
+ For example:
+
+ .. prompt:: bash $
+
+ rados -p {cachepool} cache-flush-evict-all
+
+
+#. Remove the overlay so that clients will not direct traffic to the cache.:
+
+ .. prompt:: bash $
+
+ ceph osd tier remove-overlay {storagetier}
+
+ For example:
+
+ .. prompt:: bash $
+
+ ceph osd tier remove-overlay cold-storage
+
+
+#. Finally, remove the cache tier pool from the backing storage pool.:
+
+ .. prompt:: bash $
+
+ ceph osd tier remove {storagepool} {cachepool}
+
+ For example:
+
+ .. prompt:: bash $
+
+ ceph osd tier remove cold-storage hot-storage
+
+
+.. _Create a Pool: ../pools#create-a-pool
+.. _Pools - Set Pool Values: ../pools#set-pool-values
+.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter
+.. _CRUSH Maps: ../crush-map
+.. _Absolute Sizing: #absolute-sizing
diff --git a/doc/rados/operations/change-mon-elections.rst b/doc/rados/operations/change-mon-elections.rst
new file mode 100644
index 000000000..7418ea363
--- /dev/null
+++ b/doc/rados/operations/change-mon-elections.rst
@@ -0,0 +1,100 @@
+.. _changing_monitor_elections:
+
+=======================================
+Configuring Monitor Election Strategies
+=======================================
+
+By default, the monitors are in ``classic`` mode. We recommend staying in this
+mode unless you have a very specific reason.
+
+If you want to switch modes BEFORE constructing the cluster, change the ``mon
+election default strategy`` option. This option takes an integer value:
+
+* ``1`` for ``classic``
+* ``2`` for ``disallow``
+* ``3`` for ``connectivity``
+
+After your cluster has started running, you can change strategies by running a
+command of the following form:
+
+ $ ceph mon set election_strategy {classic|disallow|connectivity}
+
+Choosing a mode
+===============
+
+The modes other than ``classic`` provide specific features. We recommend staying
+in ``classic`` mode if you don't need these extra features because it is the
+simplest mode.
+
+.. _rados_operations_disallow_mode:
+
+Disallow Mode
+=============
+
+The ``disallow`` mode allows you to mark monitors as disallowed. Disallowed
+monitors participate in the quorum and serve clients, but cannot be elected
+leader. You might want to use this mode for monitors that are far away from
+clients.
+
+To disallow a monitor from being elected leader, run a command of the following
+form:
+
+.. prompt:: bash $
+
+ ceph mon add disallowed_leader {name}
+
+To remove a monitor from the disallowed list and allow it to be elected leader,
+run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph mon rm disallowed_leader {name}
+
+To see the list of disallowed leaders, examine the output of the following
+command:
+
+.. prompt:: bash $
+
+ ceph mon dump
+
+Connectivity Mode
+=================
+
+The ``connectivity`` mode evaluates connection scores that are provided by each
+monitor for its peers and elects the monitor with the highest score. This mode
+is designed to handle network partitioning (also called *net-splits*): network
+partitioning might occur if your cluster is stretched across multiple data
+centers or otherwise has a non-uniform or unbalanced network topology.
+
+The ``connectivity`` mode also supports disallowing monitors from being elected
+leader by using the same commands that were presented in :ref:`Disallow Mode <rados_operations_disallow_mode>`.
+
+Examining connectivity scores
+=============================
+
+The monitors maintain connection scores even if they aren't in ``connectivity``
+mode. To examine a specific monitor's connection scores, run a command of the
+following form:
+
+.. prompt:: bash $
+
+ ceph daemon mon.{name} connection scores dump
+
+Scores for an individual connection range from ``0`` to ``1`` inclusive and
+include whether the connection is considered alive or dead (as determined by
+whether it returned its latest ping before timeout).
+
+Connectivity scores are expected to remain valid. However, if during
+troubleshooting you determine that these scores have for some reason become
+invalid, drop the history and reset the scores by running a command of the
+following form:
+
+.. prompt:: bash $
+
+ ceph daemon mon.{name} connection scores reset
+
+Resetting connectivity scores carries little risk: monitors will still quickly
+determine whether a connection is alive or dead and trend back to the previous
+scores if those scores were accurate. Nevertheless, resetting scores ought to
+be unnecessary and it is not recommended unless advised by your support team
+or by a developer.
diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst
new file mode 100644
index 000000000..033f831cd
--- /dev/null
+++ b/doc/rados/operations/control.rst
@@ -0,0 +1,665 @@
+.. index:: control, commands
+
+==================
+ Control Commands
+==================
+
+
+Monitor Commands
+================
+
+To issue monitor commands, use the ``ceph`` utility:
+
+.. prompt:: bash $
+
+ ceph [-m monhost] {command}
+
+In most cases, monitor commands have the following form:
+
+.. prompt:: bash $
+
+ ceph {subsystem} {command}
+
+
+System Commands
+===============
+
+To display the current cluster status, run the following commands:
+
+.. prompt:: bash $
+
+ ceph -s
+ ceph status
+
+To display a running summary of cluster status and major events, run the
+following command:
+
+.. prompt:: bash $
+
+ ceph -w
+
+To display the monitor quorum, including which monitors are participating and
+which one is the leader, run the following commands:
+
+.. prompt:: bash $
+
+ ceph mon stat
+ ceph quorum_status
+
+To query the status of a single monitor, including whether it is in the quorum,
+run the following command:
+
+.. prompt:: bash $
+
+ ceph tell mon.[id] mon_status
+
+Here the value of ``[id]`` can be found by consulting the output of ``ceph
+-s``.
+
+
+Authentication Subsystem
+========================
+
+To add an OSD keyring for a specific OSD, run the following command:
+
+.. prompt:: bash $
+
+ ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring}
+
+To list the cluster's keys and their capabilities, run the following command:
+
+.. prompt:: bash $
+
+ ceph auth ls
+
+
+Placement Group Subsystem
+=========================
+
+To display the statistics for all placement groups (PGs), run the following
+command:
+
+.. prompt:: bash $
+
+ ceph pg dump [--format {format}]
+
+Here the valid formats are ``plain`` (default), ``json`` ``json-pretty``,
+``xml``, and ``xml-pretty``. When implementing monitoring tools and other
+tools, it is best to use the ``json`` format. JSON parsing is more
+deterministic than the ``plain`` format (which is more human readable), and the
+layout is much more consistent from release to release. The ``jq`` utility is
+very useful for extracting data from JSON output.
+
+To display the statistics for all PGs stuck in a specified state, run the
+following command:
+
+.. prompt:: bash $
+
+ ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}]
+
+Here ``--format`` may be ``plain`` (default), ``json``, ``json-pretty``,
+``xml``, or ``xml-pretty``.
+
+The ``--threshold`` argument determines the time interval (in seconds) for a PG
+to be considered ``stuck`` (default: 300).
+
+PGs might be stuck in any of the following states:
+
+**Inactive**
+
+ PGs are unable to process reads or writes because they are waiting for an
+ OSD that has the most up-to-date data to return to an ``up`` state.
+
+
+**Unclean**
+
+ PGs contain objects that have not been replicated the desired number of
+ times. These PGs have not yet completed the process of recovering.
+
+
+**Stale**
+
+ PGs are in an unknown state, because the OSDs that host them have not
+ reported to the monitor cluster for a certain period of time (specified by
+ the ``mon_osd_report_timeout`` configuration setting).
+
+
+To delete a ``lost`` object or revert an object to its prior state, either by
+reverting it to its previous version or by deleting it because it was just
+created and has no previous version, run the following command:
+
+.. prompt:: bash $
+
+ ceph pg {pgid} mark_unfound_lost revert|delete
+
+
+.. _osd-subsystem:
+
+OSD Subsystem
+=============
+
+To query OSD subsystem status, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd stat
+
+To write a copy of the most recent OSD map to a file (see :ref:`osdmaptool
+<osdmaptool>`), run the following command:
+
+.. prompt:: bash $
+
+ ceph osd getmap -o file
+
+To write a copy of the CRUSH map from the most recent OSD map to a file, run
+the following command:
+
+.. prompt:: bash $
+
+ ceph osd getcrushmap -o file
+
+Note that this command is functionally equivalent to the following two
+commands:
+
+.. prompt:: bash $
+
+ ceph osd getmap -o /tmp/osdmap
+ osdmaptool /tmp/osdmap --export-crush file
+
+To dump the OSD map, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd dump [--format {format}]
+
+The ``--format`` option accepts the following arguments: ``plain`` (default),
+``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON is
+the recommended format for tools, scripting, and other forms of automation.
+
+To dump the OSD map as a tree that lists one OSD per line and displays
+information about the weights and states of the OSDs, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph osd tree [--format {format}]
+
+To find out where a specific RADOS object is stored in the system, run a
+command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd map <pool-name> <object-name>
+
+To add or move a new OSD (specified by its ID, name, or weight) to a specific
+CRUSH location, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]]
+
+To remove an existing OSD from the CRUSH map, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush remove {name}
+
+To remove an existing bucket from the CRUSH map, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush remove {bucket-name}
+
+To move an existing bucket from one position in the CRUSH hierarchy to another,
+run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush move {id} {loc1} [{loc2} ...]
+
+To set the CRUSH weight of a specific OSD (specified by ``{name}``) to
+``{weight}``, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush reweight {name} {weight}
+
+To mark an OSD as ``lost``, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd lost {id} [--yes-i-really-mean-it]
+
+.. warning::
+ This could result in permanent data loss. Use with caution!
+
+To create a new OSD, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd create [{uuid}]
+
+If no UUID is given as part of this command, the UUID will be set automatically
+when the OSD starts up.
+
+To remove one or more specific OSDs, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd rm [{id}...]
+
+To display the current ``max_osd`` parameter in the OSD map, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph osd getmaxosd
+
+To import a specific CRUSH map, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd setcrushmap -i file
+
+To set the ``max_osd`` parameter in the OSD map, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd setmaxosd
+
+The parameter has a default value of 10000. Most operators will never need to
+adjust it.
+
+To mark a specific OSD ``down``, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd down {osd-num}
+
+To mark a specific OSD ``out`` (so that no data will be allocated to it), run
+the following command:
+
+.. prompt:: bash $
+
+ ceph osd out {osd-num}
+
+To mark a specific OSD ``in`` (so that data will be allocated to it), run the
+following command:
+
+.. prompt:: bash $
+
+ ceph osd in {osd-num}
+
+By using the "pause flags" in the OSD map, you can pause or unpause I/O
+requests. If the flags are set, then no I/O requests will be sent to any OSD.
+When the flags are cleared, then pending I/O requests will be resent. To set or
+clear pause flags, run one of the following commands:
+
+.. prompt:: bash $
+
+ ceph osd pause
+ ceph osd unpause
+
+You can assign an override or ``reweight`` weight value to a specific OSD if
+the normal CRUSH distribution seems to be suboptimal. The weight of an OSD
+helps determine the extent of its I/O requests and data storage: two OSDs with
+the same weight will receive approximately the same number of I/O requests and
+store approximately the same amount of data. The ``ceph osd reweight`` command
+assigns an override weight to an OSD. The weight value is in the range 0 to 1,
+and the command forces CRUSH to relocate a certain amount (1 - ``weight``) of
+the data that would otherwise be on this OSD. The command does not change the
+weights of the buckets above the OSD in the CRUSH map. Using the command is
+merely a corrective measure: for example, if one of your OSDs is at 90% and the
+others are at 50%, you could reduce the outlier weight to correct this
+imbalance. To assign an override weight to a specific OSD, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph osd reweight {osd-num} {weight}
+
+.. note:: Any assigned override reweight value will conflict with the balancer.
+ This means that if the balancer is in use, all override reweight values
+ should be ``1.0000`` in order to avoid suboptimal cluster behavior.
+
+A cluster's OSDs can be reweighted in order to maintain balance if some OSDs
+are being disproportionately utilized. Note that override or ``reweight``
+weights have values relative to one another that default to 1.00000; their
+values are not absolute, and these weights must be distinguished from CRUSH
+weights (which reflect the absolute capacity of a bucket, as measured in TiB).
+To reweight OSDs by utilization, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing]
+
+By default, this command adjusts the override weight of OSDs that have ±20% of
+the average utilization, but you can specify a different percentage in the
+``threshold`` argument.
+
+To limit the increment by which any OSD's reweight is to be changed, use the
+``max_change`` argument (default: 0.05). To limit the number of OSDs that are
+to be adjusted, use the ``max_osds`` argument (default: 4). Increasing these
+variables can accelerate the reweighting process, but perhaps at the cost of
+slower client operations (as a result of the increase in data movement).
+
+You can test the ``osd reweight-by-utilization`` command before running it. To
+find out which and how many PGs and OSDs will be affected by a specific use of
+the ``osd reweight-by-utilization`` command, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing]
+
+The ``--no-increasing`` option can be added to the ``reweight-by-utilization``
+and ``test-reweight-by-utilization`` commands in order to prevent any override
+weights that are currently less than 1.00000 from being increased. This option
+can be useful in certain circumstances: for example, when you are hastily
+balancing in order to remedy ``full`` or ``nearfull`` OSDs, or when there are
+OSDs being evacuated or slowly brought into service.
+
+Operators of deployments that utilize Nautilus or newer (or later revisions of
+Luminous and Mimic) and that have no pre-Luminous clients might likely instead
+want to enable the `balancer`` module for ``ceph-mgr``.
+
+The blocklist can be modified by adding or removing an IP address or a CIDR
+range. If an address is blocklisted, it will be unable to connect to any OSD.
+If an OSD is contained within an IP address or CIDR range that has been
+blocklisted, the OSD will be unable to perform operations on its peers when it
+acts as a client: such blocked operations include tiering and copy-from
+functionality. To add or remove an IP address or CIDR range to the blocklist,
+run one of the following commands:
+
+.. prompt:: bash $
+
+ ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME]
+ ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits]
+
+If you add something to the blocklist with the above ``add`` command, you can
+use the ``TIME`` keyword to specify the length of time (in seconds) that it
+will remain on the blocklist (default: one hour). To add or remove a CIDR
+range, use the ``range`` keyword in the above commands.
+
+Note that these commands are useful primarily in failure testing. Under normal
+conditions, blocklists are maintained automatically and do not need any manual
+intervention.
+
+To create or delete a snapshot of a specific storage pool, run one of the
+following commands:
+
+.. prompt:: bash $
+
+ ceph osd pool mksnap {pool-name} {snap-name}
+ ceph osd pool rmsnap {pool-name} {snap-name}
+
+To create, delete, or rename a specific storage pool, run one of the following
+commands:
+
+.. prompt:: bash $
+
+ ceph osd pool create {pool-name} [pg_num [pgp_num]]
+ ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
+ ceph osd pool rename {old-name} {new-name}
+
+To change a pool setting, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set {pool-name} {field} {value}
+
+The following are valid fields:
+
+ * ``size``: The number of copies of data in the pool.
+ * ``pg_num``: The PG number.
+ * ``pgp_num``: The effective number of PGs when calculating placement.
+ * ``crush_rule``: The rule number for mapping placement.
+
+To retrieve the value of a pool setting, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool get {pool-name} {field}
+
+Valid fields are:
+
+ * ``pg_num``: The PG number.
+ * ``pgp_num``: The effective number of PGs when calculating placement.
+
+To send a scrub command to a specific OSD, or to all OSDs (by using ``*``), run
+the following command:
+
+.. prompt:: bash $
+
+ ceph osd scrub {osd-num}
+
+To send a repair command to a specific OSD, or to all OSDs (by using ``*``),
+run the following command:
+
+.. prompt:: bash $
+
+ ceph osd repair N
+
+You can run a simple throughput benchmark test against a specific OSD. This
+test writes a total size of ``TOTAL_DATA_BYTES`` (default: 1 GB) incrementally,
+in multiple write requests that each have a size of ``BYTES_PER_WRITE``
+(default: 4 MB). The test is not destructive and it will not overwrite existing
+live OSD data, but it might temporarily affect the performance of clients that
+are concurrently accessing the OSD. To launch this benchmark test, run the
+following command:
+
+.. prompt:: bash $
+
+ ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE]
+
+To clear the caches of a specific OSD during the interval between one benchmark
+run and another, run the following command:
+
+.. prompt:: bash $
+
+ ceph tell osd.N cache drop
+
+To retrieve the cache statistics of a specific OSD, run the following command:
+
+.. prompt:: bash $
+
+ ceph tell osd.N cache status
+
+MDS Subsystem
+=============
+
+To change the configuration parameters of a running metadata server, run the
+following command:
+
+.. prompt:: bash $
+
+ ceph tell mds.{mds-id} config set {setting} {value}
+
+Example:
+
+.. prompt:: bash $
+
+ ceph tell mds.0 config set debug_ms 1
+
+To enable debug messages, run the following command:
+
+.. prompt:: bash $
+
+ ceph mds stat
+
+To display the status of all metadata servers, run the following command:
+
+.. prompt:: bash $
+
+ ceph mds fail 0
+
+To mark the active metadata server as failed (and to trigger failover to a
+standby if a standby is present), run the following command:
+
+.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap
+
+
+Mon Subsystem
+=============
+
+To display monitor statistics, run the following command:
+
+.. prompt:: bash $
+
+ ceph mon stat
+
+This command returns output similar to the following:
+
+::
+
+ e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
+
+There is a ``quorum`` list at the end of the output. It lists those monitor
+nodes that are part of the current quorum.
+
+To retrieve this information in a more direct way, run the following command:
+
+.. prompt:: bash $
+
+ ceph quorum_status -f json-pretty
+
+This command returns output similar to the following:
+
+.. code-block:: javascript
+
+ {
+ "election_epoch": 6,
+ "quorum": [
+ 0,
+ 1,
+ 2
+ ],
+ "quorum_names": [
+ "a",
+ "b",
+ "c"
+ ],
+ "quorum_leader_name": "a",
+ "monmap": {
+ "epoch": 2,
+ "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
+ "modified": "2016-12-26 14:42:09.288066",
+ "created": "2016-12-26 14:42:03.573585",
+ "features": {
+ "persistent": [
+ "kraken"
+ ],
+ "optional": []
+ },
+ "mons": [
+ {
+ "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:40000\/0",
+ "public_addr": "127.0.0.1:40000\/0"
+ },
+ {
+ "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:40001\/0",
+ "public_addr": "127.0.0.1:40001\/0"
+ },
+ {
+ "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:40002\/0",
+ "public_addr": "127.0.0.1:40002\/0"
+ }
+ ]
+ }
+ }
+
+
+The above will block until a quorum is reached.
+
+To see the status of a specific monitor, run the following command:
+
+.. prompt:: bash $
+
+ ceph tell mon.[name] mon_status
+
+Here the value of ``[name]`` can be found by consulting the output of the
+``ceph quorum_status`` command. This command returns output similar to the
+following:
+
+::
+
+ {
+ "name": "b",
+ "rank": 1,
+ "state": "peon",
+ "election_epoch": 6,
+ "quorum": [
+ 0,
+ 1,
+ 2
+ ],
+ "features": {
+ "required_con": "9025616074522624",
+ "required_mon": [
+ "kraken"
+ ],
+ "quorum_con": "1152921504336314367",
+ "quorum_mon": [
+ "kraken"
+ ]
+ },
+ "outside_quorum": [],
+ "extra_probe_peers": [],
+ "sync_provider": [],
+ "monmap": {
+ "epoch": 2,
+ "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
+ "modified": "2016-12-26 14:42:09.288066",
+ "created": "2016-12-26 14:42:03.573585",
+ "features": {
+ "persistent": [
+ "kraken"
+ ],
+ "optional": []
+ },
+ "mons": [
+ {
+ "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:40000\/0",
+ "public_addr": "127.0.0.1:40000\/0"
+ },
+ {
+ "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:40001\/0",
+ "public_addr": "127.0.0.1:40001\/0"
+ },
+ {
+ "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:40002\/0",
+ "public_addr": "127.0.0.1:40002\/0"
+ }
+ ]
+ }
+ }
+
+To see a dump of the monitor state, run the following command:
+
+.. prompt:: bash $
+
+ ceph mon dump
+
+This command returns output similar to the following:
+
+::
+
+ dumped monmap epoch 2
+ epoch 2
+ fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
+ last_changed 2016-12-26 14:42:09.288066
+ created 2016-12-26 14:42:03.573585
+ 0: 127.0.0.1:40000/0 mon.a
+ 1: 127.0.0.1:40001/0 mon.b
+ 2: 127.0.0.1:40002/0 mon.c
diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst
new file mode 100644
index 000000000..46a4a4f74
--- /dev/null
+++ b/doc/rados/operations/crush-map-edits.rst
@@ -0,0 +1,746 @@
+Manually editing the CRUSH Map
+==============================
+
+.. note:: Manually editing the CRUSH map is an advanced administrator
+ operation. For the majority of installations, CRUSH changes can be
+ implemented via the Ceph CLI and do not require manual CRUSH map edits. If
+ you have identified a use case where manual edits *are* necessary with a
+ recent Ceph release, consider contacting the Ceph developers at dev@ceph.io
+ so that future versions of Ceph do not have this problem.
+
+To edit an existing CRUSH map, carry out the following procedure:
+
+#. `Get the CRUSH map`_.
+#. `Decompile`_ the CRUSH map.
+#. Edit at least one of the following sections: `Devices`_, `Buckets`_, and
+ `Rules`_. Use a text editor for this task.
+#. `Recompile`_ the CRUSH map.
+#. `Set the CRUSH map`_.
+
+For details on setting the CRUSH map rule for a specific pool, see `Set Pool
+Values`_.
+
+.. _Get the CRUSH map: #getcrushmap
+.. _Decompile: #decompilecrushmap
+.. _Devices: #crushmapdevices
+.. _Buckets: #crushmapbuckets
+.. _Rules: #crushmaprules
+.. _Recompile: #compilecrushmap
+.. _Set the CRUSH map: #setcrushmap
+.. _Set Pool Values: ../pools#setpoolvalues
+
+.. _getcrushmap:
+
+Get the CRUSH Map
+-----------------
+
+To get the CRUSH map for your cluster, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd getcrushmap -o {compiled-crushmap-filename}
+
+Ceph outputs (``-o``) a compiled CRUSH map to the filename that you have
+specified. Because the CRUSH map is in a compiled form, you must first
+decompile it before you can edit it.
+
+.. _decompilecrushmap:
+
+Decompile the CRUSH Map
+-----------------------
+
+To decompile the CRUSH map, run a command of the following form:
+
+.. prompt:: bash $
+
+ crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
+
+.. _compilecrushmap:
+
+Recompile the CRUSH Map
+-----------------------
+
+To compile the CRUSH map, run a command of the following form:
+
+.. prompt:: bash $
+
+ crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename}
+
+.. _setcrushmap:
+
+Set the CRUSH Map
+-----------------
+
+To set the CRUSH map for your cluster, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd setcrushmap -i {compiled-crushmap-filename}
+
+Ceph loads (``-i``) a compiled CRUSH map from the filename that you have
+specified.
+
+Sections
+--------
+
+A CRUSH map has six main sections:
+
+#. **tunables:** The preamble at the top of the map describes any *tunables*
+ that are not a part of legacy CRUSH behavior. These tunables correct for old
+ bugs, optimizations, or other changes that have been made over the years to
+ improve CRUSH's behavior.
+
+#. **devices:** Devices are individual OSDs that store data.
+
+#. **types**: Bucket ``types`` define the types of buckets that are used in
+ your CRUSH hierarchy.
+
+#. **buckets:** Buckets consist of a hierarchical aggregation of storage
+ locations (for example, rows, racks, chassis, hosts) and their assigned
+ weights. After the bucket ``types`` have been defined, the CRUSH map defines
+ each node in the hierarchy, its type, and which devices or other nodes it
+ contains.
+
+#. **rules:** Rules define policy about how data is distributed across
+ devices in the hierarchy.
+
+#. **choose_args:** ``choose_args`` are alternative weights associated with
+ the hierarchy that have been adjusted in order to optimize data placement. A
+ single ``choose_args`` map can be used for the entire cluster, or a number
+ of ``choose_args`` maps can be created such that each map is crafted for a
+ particular pool.
+
+
+.. _crushmapdevices:
+
+CRUSH-Map Devices
+-----------------
+
+Devices are individual OSDs that store data. In this section, there is usually
+one device defined for each OSD daemon in your cluster. Devices are identified
+by an ``id`` (a non-negative integer) and a ``name`` (usually ``osd.N``, where
+``N`` is the device's ``id``).
+
+
+.. _crush-map-device-class:
+
+A device can also have a *device class* associated with it: for example,
+``hdd`` or ``ssd``. Device classes make it possible for devices to be targeted
+by CRUSH rules. This means that device classes allow CRUSH rules to select only
+OSDs that match certain characteristics. For example, you might want an RBD
+pool associated only with SSDs and a different RBD pool associated only with
+HDDs.
+
+To see a list of devices, run the following command:
+
+.. prompt:: bash #
+
+ ceph device ls
+
+The output of this command takes the following form:
+
+::
+
+ device {num} {osd.name} [class {class}]
+
+For example:
+
+.. prompt:: bash #
+
+ ceph device ls
+
+::
+
+ device 0 osd.0 class ssd
+ device 1 osd.1 class hdd
+ device 2 osd.2
+ device 3 osd.3
+
+In most cases, each device maps to a corresponding ``ceph-osd`` daemon. This
+daemon might map to a single storage device, a pair of devices (for example,
+one for data and one for a journal or metadata), or in some cases a small RAID
+device or a partition of a larger storage device.
+
+
+CRUSH-Map Bucket Types
+----------------------
+
+The second list in the CRUSH map defines 'bucket' types. Buckets facilitate a
+hierarchy of nodes and leaves. Node buckets (also known as non-leaf buckets)
+typically represent physical locations in a hierarchy. Nodes aggregate other
+nodes or leaves. Leaf buckets represent ``ceph-osd`` daemons and their
+corresponding storage media.
+
+.. tip:: In the context of CRUSH, the term "bucket" is used to refer to
+ a node in the hierarchy (that is, to a location or a piece of physical
+ hardware). In the context of RADOS Gateway APIs, however, the term
+ "bucket" has a different meaning.
+
+To add a bucket type to the CRUSH map, create a new line under the list of
+bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
+By convention, there is exactly one leaf bucket type and it is ``type 0``;
+however, you may give the leaf bucket any name you like (for example: ``osd``,
+``disk``, ``drive``, ``storage``)::
+
+ # types
+ type {num} {bucket-name}
+
+For example::
+
+ # types
+ type 0 osd
+ type 1 host
+ type 2 chassis
+ type 3 rack
+ type 4 row
+ type 5 pdu
+ type 6 pod
+ type 7 room
+ type 8 datacenter
+ type 9 zone
+ type 10 region
+ type 11 root
+
+.. _crushmapbuckets:
+
+CRUSH-Map Bucket Hierarchy
+--------------------------
+
+The CRUSH algorithm distributes data objects among storage devices according to
+a per-device weight value, approximating a uniform probability distribution.
+CRUSH distributes objects and their replicas according to the hierarchical
+cluster map you define. The CRUSH map represents the available storage devices
+and the logical elements that contain them.
+
+To map placement groups (PGs) to OSDs across failure domains, a CRUSH map
+defines a hierarchical list of bucket types under ``#types`` in the generated
+CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf
+nodes according to their failure domains (for example: hosts, chassis, racks,
+power distribution units, pods, rows, rooms, and data centers). With the
+exception of the leaf nodes that represent OSDs, the hierarchy is arbitrary and
+you may define it according to your own needs.
+
+We recommend adapting your CRUSH map to your preferred hardware-naming
+conventions and using bucket names that clearly reflect the physical
+hardware. Clear naming practice can make it easier to administer the cluster
+and easier to troubleshoot problems when OSDs malfunction (or other hardware
+malfunctions) and the administrator needs access to physical hardware.
+
+
+In the following example, the bucket hierarchy has a leaf bucket named ``osd``
+and two node buckets named ``host`` and ``rack``:
+
+.. ditaa::
+ +-----------+
+ | {o}rack |
+ | Bucket |
+ +-----+-----+
+ |
+ +---------------+---------------+
+ | |
+ +-----+-----+ +-----+-----+
+ | {o}host | | {o}host |
+ | Bucket | | Bucket |
+ +-----+-----+ +-----+-----+
+ | |
+ +-------+-------+ +-------+-------+
+ | | | |
+ +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
+ | osd | | osd | | osd | | osd |
+ | Bucket | | Bucket | | Bucket | | Bucket |
+ +-----------+ +-----------+ +-----------+ +-----------+
+
+.. note:: The higher-numbered ``rack`` bucket type aggregates the
+ lower-numbered ``host`` bucket type.
+
+Because leaf nodes reflect storage devices that have already been declared
+under the ``#devices`` list at the beginning of the CRUSH map, there is no need
+to declare them as bucket instances. The second-lowest bucket type in your
+hierarchy is typically used to aggregate the devices (that is, the
+second-lowest bucket type is usually the computer that contains the storage
+media and, such as ``node``, ``computer``, ``server``, ``host``, or
+``machine``). In high-density environments, it is common to have multiple hosts
+or nodes in a single chassis (for example, in the cases of blades or twins). It
+is important to anticipate the potential consequences of chassis failure -- for
+example, during the replacement of a chassis in case of a node failure, the
+chassis's hosts or nodes (and their associated OSDs) will be in a ``down``
+state.
+
+To declare a bucket instance, do the following: specify its type, give it a
+unique name (an alphanumeric string), assign it a unique ID expressed as a
+negative integer (this is optional), assign it a weight relative to the total
+capacity and capability of the item(s) in the bucket, assign it a bucket
+algorithm (usually ``straw2``), and specify the bucket algorithm's hash
+(usually ``0``, a setting that reflects the hash algorithm ``rjenkins1``). A
+bucket may have one or more items. The items may consist of node buckets or
+leaves. Items may have a weight that reflects the relative weight of the item.
+
+To declare a node bucket, use the following syntax::
+
+ [bucket-type] [bucket-name] {
+ id [a unique negative numeric ID]
+ weight [the relative capacity/capability of the item(s)]
+ alg [the bucket type: uniform | list | tree | straw | straw2 ]
+ hash [the hash type: 0 by default]
+ item [item-name] weight [weight]
+ }
+
+For example, in the above diagram, two host buckets (referred to in the
+declaration below as ``node1`` and ``node2``) and one rack bucket (referred to
+in the declaration below as ``rack1``) are defined. The OSDs are declared as
+items within the host buckets::
+
+ host node1 {
+ id -1
+ alg straw2
+ hash 0
+ item osd.0 weight 1.00
+ item osd.1 weight 1.00
+ }
+
+ host node2 {
+ id -2
+ alg straw2
+ hash 0
+ item osd.2 weight 1.00
+ item osd.3 weight 1.00
+ }
+
+ rack rack1 {
+ id -3
+ alg straw2
+ hash 0
+ item node1 weight 2.00
+ item node2 weight 2.00
+ }
+
+.. note:: In this example, the rack bucket does not contain any OSDs. Instead,
+ it contains lower-level host buckets and includes the sum of their weight in
+ the item entry.
+
+
+.. topic:: Bucket Types
+
+ Ceph supports five bucket types. Each bucket type provides a balance between
+ performance and reorganization efficiency, and each is different from the
+ others. If you are unsure of which bucket type to use, use the ``straw2``
+ bucket. For a more technical discussion of bucket types than is offered
+ here, see **Section 3.4** of `CRUSH - Controlled, Scalable, Decentralized
+ Placement of Replicated Data`_.
+
+ The bucket types are as follows:
+
+ #. **uniform**: Uniform buckets aggregate devices that have **exactly**
+ the same weight. For example, when hardware is commissioned or
+ decommissioned, it is often done in sets of machines that have exactly
+ the same physical configuration (this can be the case, for example,
+ after bulk purchases). When storage devices have exactly the same
+ weight, you may use the ``uniform`` bucket type, which allows CRUSH to
+ map replicas into uniform buckets in constant time. If your devices have
+ non-uniform weights, you should not use the uniform bucket algorithm.
+
+ #. **list**: List buckets aggregate their content as linked lists. The
+ behavior of list buckets is governed by the :abbr:`RUSH (Replication
+ Under Scalable Hashing)`:sub:`P` algorithm. In the behavior of this
+ bucket type, an object is either relocated to the newest device in
+ accordance with an appropriate probability, or it remains on the older
+ devices as before. This results in optimal data migration when items are
+ added to the bucket. The removal of items from the middle or the tail of
+ the list, however, can result in a significant amount of unnecessary
+ data movement. This means that list buckets are most suitable for
+ circumstances in which they **never shrink or very rarely shrink**.
+
+ #. **tree**: Tree buckets use a binary search tree. They are more efficient
+ at dealing with buckets that contain many items than are list buckets.
+ The behavior of tree buckets is governed by the :abbr:`RUSH (Replication
+ Under Scalable Hashing)`:sub:`R` algorithm. Tree buckets reduce the
+ placement time to 0(log\ :sub:`n`). This means that tree buckets are
+ suitable for managing large sets of devices or nested buckets.
+
+ #. **straw**: Straw buckets allow all items in the bucket to "compete"
+ against each other for replica placement through a process analogous to
+ drawing straws. This is different from the behavior of list buckets and
+ tree buckets, which use a divide-and-conquer strategy that either gives
+ certain items precedence (for example, those at the beginning of a list)
+ or obviates the need to consider entire subtrees of items. Such an
+ approach improves the performance of the replica placement process, but
+ can also introduce suboptimal reorganization behavior when the contents
+ of a bucket change due an addition, a removal, or the re-weighting of an
+ item.
+
+ * **straw2**: Straw2 buckets improve on Straw by correctly avoiding
+ any data movement between items when neighbor weights change. For
+ example, if the weight of a given item changes (including during the
+ operations of adding it to the cluster or removing it from the
+ cluster), there will be data movement to or from only that item.
+ Neighbor weights are not taken into account.
+
+
+.. topic:: Hash
+
+ Each bucket uses a hash algorithm. As of Reef, Ceph supports the
+ ``rjenkins1`` algorithm. To select ``rjenkins1`` as the hash algorithm,
+ enter ``0`` as your hash setting.
+
+.. _weightingbucketitems:
+
+.. topic:: Weighting Bucket Items
+
+ Ceph expresses bucket weights as doubles, which allows for fine-grained
+ weighting. A weight is the relative difference between device capacities. We
+ recommend using ``1.00`` as the relative weight for a 1 TB storage device.
+ In such a scenario, a weight of ``0.50`` would represent approximately 500
+ GB, and a weight of ``3.00`` would represent approximately 3 TB. Buckets
+ higher in the CRUSH hierarchy have a weight that is the sum of the weight of
+ the leaf items aggregated by the bucket.
+
+
+.. _crushmaprules:
+
+CRUSH Map Rules
+---------------
+
+CRUSH maps have rules that include data placement for a pool: these are
+called "CRUSH rules". The default CRUSH map has one rule for each pool. If you
+are running a large cluster, you might create many pools and each of those
+pools might have its own non-default CRUSH rule.
+
+
+.. note:: In most cases, there is no need to modify the default rule. When a
+ new pool is created, by default the rule will be set to the value ``0``
+ (which indicates the default CRUSH rule, which has the numeric ID ``0``).
+
+CRUSH rules define policy that governs how data is distributed across the devices in
+the hierarchy. The rules define placement as well as replication strategies or
+distribution policies that allow you to specify exactly how CRUSH places data
+replicas. For example, you might create one rule selecting a pair of targets for
+two-way mirroring, another rule for selecting three targets in two different data
+centers for three-way replication, and yet another rule for erasure coding across
+six storage devices. For a detailed discussion of CRUSH rules, see **Section 3.2**
+of `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_.
+
+A rule takes the following form::
+
+ rule <rulename> {
+
+ id [a unique integer ID]
+ type [replicated|erasure]
+ step take <bucket-name> [class <device-class>]
+ step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type>
+ step emit
+ }
+
+
+``id``
+ :Description: A unique integer that identifies the rule.
+ :Purpose: A component of the rule mask.
+ :Type: Integer
+ :Required: Yes
+ :Default: 0
+
+
+``type``
+ :Description: Denotes the type of replication strategy to be enforced by the
+ rule.
+ :Purpose: A component of the rule mask.
+ :Type: String
+ :Required: Yes
+ :Default: ``replicated``
+ :Valid Values: ``replicated`` or ``erasure``
+
+
+``step take <bucket-name> [class <device-class>]``
+ :Description: Takes a bucket name and iterates down the tree. If
+ the ``device-class`` argument is specified, the argument must
+ match a class assigned to OSDs within the cluster. Only
+ devices belonging to the class are included.
+ :Purpose: A component of the rule.
+ :Required: Yes
+ :Example: ``step take data``
+
+
+
+``step choose firstn {num} type {bucket-type}``
+ :Description: Selects ``num`` buckets of the given type from within the
+ current bucket. ``{num}`` is usually the number of replicas in
+ the pool (in other words, the pool size).
+
+ - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available).
+ - If ``pool-num-replicas > {num} > 0``, choose that many buckets.
+ - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets.
+
+ :Purpose: A component of the rule.
+ :Prerequisite: Follows ``step take`` or ``step choose``.
+ :Example: ``step choose firstn 1 type row``
+
+
+``step chooseleaf firstn {num} type {bucket-type}``
+ :Description: Selects a set of buckets of the given type and chooses a leaf
+ node (that is, an OSD) from the subtree of each bucket in that set of buckets. The
+ number of buckets in the set is usually the number of replicas in
+ the pool (in other words, the pool size).
+
+ - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available).
+ - If ``pool-num-replicas > {num} > 0``, choose that many buckets.
+ - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets.
+ :Purpose: A component of the rule. Using ``chooseleaf`` obviates the need to select a device in a separate step.
+ :Prerequisite: Follows ``step take`` or ``step choose``.
+ :Example: ``step chooseleaf firstn 0 type row``
+
+
+``step emit``
+ :Description: Outputs the current value on the top of the stack and empties
+ the stack. Typically used
+ at the end of a rule, but may also be used to choose from different
+ trees in the same rule.
+
+ :Purpose: A component of the rule.
+ :Prerequisite: Follows ``step choose``.
+ :Example: ``step emit``
+
+.. important:: A single CRUSH rule can be assigned to multiple pools, but
+ a single pool cannot have multiple CRUSH rules.
+
+``firstn`` or ``indep``
+
+ :Description: Determines which replacement strategy CRUSH uses when items (OSDs)
+ are marked ``down`` in the CRUSH map. When this rule is used
+ with replicated pools, ``firstn`` is used. When this rule is
+ used with erasure-coded pools, ``indep`` is used.
+
+ Suppose that a PG is stored on OSDs 1, 2, 3, 4, and 5 and then
+ OSD 3 goes down.
+
+ When in ``firstn`` mode, CRUSH simply adjusts its calculation
+ to select OSDs 1 and 2, then selects 3 and discovers that 3 is
+ down, retries and selects 4 and 5, and finally goes on to
+ select a new OSD: OSD 6. The final CRUSH mapping
+ transformation is therefore 1, 2, 3, 4, 5 → 1, 2, 4, 5, 6.
+
+ However, if you were storing an erasure-coded pool, the above
+ sequence would have changed the data that is mapped to OSDs 4,
+ 5, and 6. The ``indep`` mode attempts to avoid this unwanted
+ consequence. When in ``indep`` mode, CRUSH can be expected to
+ select 3, discover that 3 is down, retry, and select 6. The
+ final CRUSH mapping transformation is therefore 1, 2, 3, 4, 5
+ → 1, 2, 6, 4, 5.
+
+.. _crush-reclassify:
+
+Migrating from a legacy SSD rule to device classes
+--------------------------------------------------
+
+Prior to the Luminous release's introduction of the *device class* feature, in
+order to write rules that applied to a specialized device type (for example,
+SSD), it was necessary to manually edit the CRUSH map and maintain a parallel
+hierarchy for each device type. The device class feature provides a more
+transparent way to achieve this end.
+
+However, if your cluster is migrated from an existing manually-customized
+per-device map to new device class-based rules, all data in the system will be
+reshuffled.
+
+The ``crushtool`` utility has several commands that can transform a legacy rule
+and hierarchy and allow you to start using the new device class rules. There
+are three possible types of transformation:
+
+#. ``--reclassify-root <root-name> <device-class>``
+
+ This command examines everything under ``root-name`` in the hierarchy and
+ rewrites any rules that reference the specified root and that have the
+ form ``take <root-name>`` so that they instead have the
+ form ``take <root-name> class <device-class>``. The command also renumbers
+ the buckets in such a way that the old IDs are used for the specified
+ class's "shadow tree" and as a result no data movement takes place.
+
+ For example, suppose you have the following as an existing rule::
+
+ rule replicated_rule {
+ id 0
+ type replicated
+ step take default
+ step chooseleaf firstn 0 type rack
+ step emit
+ }
+
+ If the root ``default`` is reclassified as class ``hdd``, the new rule will
+ be as follows::
+
+ rule replicated_rule {
+ id 0
+ type replicated
+ step take default class hdd
+ step chooseleaf firstn 0 type rack
+ step emit
+ }
+
+#. ``--set-subtree-class <bucket-name> <device-class>``
+
+ This command marks every device in the subtree that is rooted at *bucket-name*
+ with the specified device class.
+
+ This command is typically used in conjunction with the ``--reclassify-root`` option
+ in order to ensure that all devices in that root are labeled with the
+ correct class. In certain circumstances, however, some of those devices
+ are correctly labeled with a different class and must not be relabeled. To
+ manage this difficulty, one can exclude the ``--set-subtree-class``
+ option. The remapping process will not be perfect, because the previous rule
+ had an effect on devices of multiple classes but the adjusted rules will map
+ only to devices of the specified device class. However, when there are not many
+ outlier devices, the resulting level of data movement is often within tolerable
+ limits.
+
+
+#. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>``
+
+ This command allows you to merge a parallel type-specific hierarchy with the
+ normal hierarchy. For example, many users have maps that resemble the
+ following::
+
+ host node1 {
+ id -2 # do not change unnecessarily
+ # weight 109.152
+ alg straw2
+ hash 0 # rjenkins1
+ item osd.0 weight 9.096
+ item osd.1 weight 9.096
+ item osd.2 weight 9.096
+ item osd.3 weight 9.096
+ item osd.4 weight 9.096
+ item osd.5 weight 9.096
+ ...
+ }
+
+ host node1-ssd {
+ id -10 # do not change unnecessarily
+ # weight 2.000
+ alg straw2
+ hash 0 # rjenkins1
+ item osd.80 weight 2.000
+ ...
+ }
+
+ root default {
+ id -1 # do not change unnecessarily
+ alg straw2
+ hash 0 # rjenkins1
+ item node1 weight 110.967
+ ...
+ }
+
+ root ssd {
+ id -18 # do not change unnecessarily
+ # weight 16.000
+ alg straw2
+ hash 0 # rjenkins1
+ item node1-ssd weight 2.000
+ ...
+ }
+
+ This command reclassifies each bucket that matches a certain
+ pattern. The pattern can be of the form ``%suffix`` or ``prefix%``. For
+ example, in the above example, we would use the pattern
+ ``%-ssd``. For each matched bucket, the remaining portion of the
+ name (corresponding to the ``%`` wildcard) specifies the *base bucket*. All
+ devices in the matched bucket are labeled with the specified
+ device class and then moved to the base bucket. If the base bucket
+ does not exist (for example, ``node12-ssd`` exists but ``node12`` does
+ not), then it is created and linked under the specified
+ *default parent* bucket. In each case, care is taken to preserve
+ the old bucket IDs for the new shadow buckets in order to prevent data
+ movement. Any rules with ``take`` steps that reference the old
+ buckets are adjusted accordingly.
+
+
+#. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>``
+
+ The same command can also be used without a wildcard in order to map a
+ single bucket. For example, in the previous example, we want the
+ ``ssd`` bucket to be mapped to the ``default`` bucket.
+
+#. The final command to convert the map that consists of the above fragments
+ resembles the following:
+
+ .. prompt:: bash $
+
+ ceph osd getcrushmap -o original
+ crushtool -i original --reclassify \
+ --set-subtree-class default hdd \
+ --reclassify-root default hdd \
+ --reclassify-bucket %-ssd ssd default \
+ --reclassify-bucket ssd ssd default \
+ -o adjusted
+
+``--compare`` flag
+------------------
+
+A ``--compare`` flag is available to make sure that the conversion performed in
+:ref:`Migrating from a legacy SSD rule to device classes <crush-reclassify>` is
+correct. This flag tests a large sample of inputs against the CRUSH map and
+checks that the expected result is output. The options that control these
+inputs are the same as the options that apply to the ``--test`` command. For an
+illustration of how this ``--compare`` command applies to the above example,
+see the following:
+
+.. prompt:: bash $
+
+ crushtool -i original --compare adjusted
+
+::
+
+ rule 0 had 0/10240 mismatched mappings (0)
+ rule 1 had 0/10240 mismatched mappings (0)
+ maps appear equivalent
+
+If the command finds any differences, the ratio of remapped inputs is reported
+in the parentheses.
+
+When you are satisfied with the adjusted map, apply it to the cluster by
+running the following command:
+
+.. prompt:: bash $
+
+ ceph osd setcrushmap -i adjusted
+
+Manually Tuning CRUSH
+---------------------
+
+If you have verified that all clients are running recent code, you can adjust
+the CRUSH tunables by extracting the CRUSH map, modifying the values, and
+reinjecting the map into the cluster. The procedure is carried out as follows:
+
+#. Extract the latest CRUSH map:
+
+ .. prompt:: bash $
+
+ ceph osd getcrushmap -o /tmp/crush
+
+#. Adjust tunables. In our tests, the following values appear to result in the
+ best behavior for both large and small clusters. The procedure requires that
+ you specify the ``--enable-unsafe-tunables`` flag in the ``crushtool``
+ command. Use this option with **extreme care**:
+
+ .. prompt:: bash $
+
+ crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
+
+#. Reinject the modified map:
+
+ .. prompt:: bash $
+
+ ceph osd setcrushmap -i /tmp/crush.new
+
+Legacy values
+-------------
+
+To set the legacy values of the CRUSH tunables, run the following command:
+
+.. prompt:: bash $
+
+ crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
+
+The special ``--enable-unsafe-tunables`` flag is required. Be careful when
+running old versions of the ``ceph-osd`` daemon after reverting to legacy
+values, because the feature bit is not perfectly enforced.
+
+.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst
new file mode 100644
index 000000000..39151e6d4
--- /dev/null
+++ b/doc/rados/operations/crush-map.rst
@@ -0,0 +1,1147 @@
+============
+ CRUSH Maps
+============
+
+The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
+computes storage locations in order to determine how to store and retrieve
+data. CRUSH allows Ceph clients to communicate with OSDs directly rather than
+through a centralized server or broker. By using an algorithmically-determined
+method of storing and retrieving data, Ceph avoids a single point of failure, a
+performance bottleneck, and a physical limit to its scalability.
+
+CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs,
+distributing the data across the cluster in accordance with configured
+replication policy and failure domains. For a detailed discussion of CRUSH, see
+`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
+
+CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a
+hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH
+replicates data within the cluster's pools. By reflecting the underlying
+physical organization of the installation, CRUSH can model (and thereby
+address) the potential for correlated device failures. Some factors relevant
+to the CRUSH hierarchy include chassis, racks, physical proximity, a shared
+power source, shared networking, and failure domains. By encoding this
+information into the CRUSH map, CRUSH placement policies distribute object
+replicas across failure domains while maintaining the desired distribution. For
+example, to address the possibility of concurrent failures, it might be
+desirable to ensure that data replicas are on devices that reside in or rely
+upon different shelves, racks, power supplies, controllers, or physical
+locations.
+
+When OSDs are deployed, they are automatically added to the CRUSH map under a
+``host`` bucket that is named for the node on which the OSDs run. This
+behavior, combined with the configured CRUSH failure domain, ensures that
+replicas or erasure-code shards are distributed across hosts and that the
+failure of a single host or other kinds of failures will not affect
+availability. For larger clusters, administrators must carefully consider their
+choice of failure domain. For example, distributing replicas across racks is
+typical for mid- to large-sized clusters.
+
+
+CRUSH Location
+==============
+
+The location of an OSD within the CRUSH map's hierarchy is referred to as its
+``CRUSH location``. The specification of a CRUSH location takes the form of a
+list of key-value pairs. For example, if an OSD is in a particular row, rack,
+chassis, and host, and is also part of the 'default' CRUSH root (which is the
+case for most clusters), its CRUSH location can be specified as follows::
+
+ root=default row=a rack=a2 chassis=a2a host=a2a1
+
+.. note::
+
+ #. The order of the keys does not matter.
+ #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default,
+ valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``,
+ ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined
+ types suffice for nearly all clusters, but can be customized by
+ modifying the CRUSH map.
+ #. Not all keys need to be specified. For example, by default, Ceph
+ automatically sets an ``OSD``'s location as ``root=default
+ host=HOSTNAME`` (as determined by the output of ``hostname -s``).
+
+The CRUSH location for an OSD can be modified by adding the ``crush location``
+option in ``ceph.conf``. When this option has been added, every time the OSD
+starts it verifies that it is in the correct location in the CRUSH map and
+moves itself if it is not. To disable this automatic CRUSH map management, add
+the following to the ``ceph.conf`` configuration file in the ``[osd]``
+section::
+
+ osd crush update on start = false
+
+Note that this action is unnecessary in most cases.
+
+
+Custom location hooks
+---------------------
+
+A custom location hook can be used to generate a more complete CRUSH location
+on startup. The CRUSH location is determined by, in order of preference:
+
+#. A ``crush location`` option in ``ceph.conf``
+#. A default of ``root=default host=HOSTNAME`` where the hostname is determined
+ by the output of the ``hostname -s`` command
+
+A script can be written to provide additional location fields (for example,
+``rack`` or ``datacenter``) and the hook can be enabled via the following
+config option::
+
+ crush location hook = /path/to/customized-ceph-crush-location
+
+This hook is passed several arguments (see below). The hook outputs a single
+line to ``stdout`` that contains the CRUSH location description. The output
+resembles the following:::
+
+ --cluster CLUSTER --id ID --type TYPE
+
+Here the cluster name is typically ``ceph``, the ``id`` is the daemon
+identifier or (in the case of OSDs) the OSD number, and the daemon type is
+``osd``, ``mds, ``mgr``, or ``mon``.
+
+For example, a simple hook that specifies a rack location via a value in the
+file ``/etc/rack`` might be as follows::
+
+ #!/bin/sh
+ echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default"
+
+
+CRUSH structure
+===============
+
+The CRUSH map consists of (1) a hierarchy that describes the physical topology
+of the cluster and (2) a set of rules that defines data placement policy. The
+hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to
+other physical features or groupings: hosts, racks, rows, data centers, and so
+on. The rules determine how replicas are placed in terms of that hierarchy (for
+example, 'three replicas in different racks').
+
+Devices
+-------
+
+Devices are individual OSDs that store data (usually one device for each
+storage drive). Devices are identified by an ``id`` (a non-negative integer)
+and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``).
+
+In Luminous and later releases, OSDs can have a *device class* assigned (for
+example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH
+rules. Device classes are especially useful when mixing device types within
+hosts.
+
+.. _crush_map_default_types:
+
+Types and Buckets
+-----------------
+
+"Bucket", in the context of CRUSH, is a term for any of the internal nodes in
+the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of
+*types* that are used to identify these nodes. Default types include:
+
+- ``osd`` (or ``device``)
+- ``host``
+- ``chassis``
+- ``rack``
+- ``row``
+- ``pdu``
+- ``pod``
+- ``room``
+- ``datacenter``
+- ``zone``
+- ``region``
+- ``root``
+
+Most clusters use only a handful of these types, and other types can be defined
+as needed.
+
+The hierarchy is built with devices (normally of type ``osd``) at the leaves
+and non-device types as the internal nodes. The root node is of type ``root``.
+For example:
+
+
+.. ditaa::
+
+ +-----------------+
+ |{o}root default |
+ +--------+--------+
+ |
+ +---------------+---------------+
+ | |
+ +------+------+ +------+------+
+ |{o}host foo | |{o}host bar |
+ +------+------+ +------+------+
+ | |
+ +-------+-------+ +-------+-------+
+ | | | |
+ +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
+ | osd.0 | | osd.1 | | osd.2 | | osd.3 |
+ +-----------+ +-----------+ +-----------+ +-----------+
+
+
+Each node (device or bucket) in the hierarchy has a *weight* that indicates the
+relative proportion of the total data that should be stored by that device or
+hierarchy subtree. Weights are set at the leaves, indicating the size of the
+device. These weights automatically sum in an 'up the tree' direction: that is,
+the weight of the ``root`` node will be the sum of the weights of all devices
+contained under it. Weights are typically measured in tebibytes (TiB).
+
+To get a simple view of the cluster's CRUSH hierarchy, including weights, run
+the following command:
+
+.. prompt:: bash $
+
+ ceph osd tree
+
+Rules
+-----
+
+CRUSH rules define policy governing how data is distributed across the devices
+in the hierarchy. The rules define placement as well as replication strategies
+or distribution policies that allow you to specify exactly how CRUSH places
+data replicas. For example, you might create one rule selecting a pair of
+targets for two-way mirroring, another rule for selecting three targets in two
+different data centers for three-way replication, and yet another rule for
+erasure coding across six storage devices. For a detailed discussion of CRUSH
+rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized
+Placement of Replicated Data`_.
+
+CRUSH rules can be created via the command-line by specifying the *pool type*
+that they will govern (replicated or erasure coded), the *failure domain*, and
+optionally a *device class*. In rare cases, CRUSH rules must be created by
+manually editing the CRUSH map.
+
+To see the rules that are defined for the cluster, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush rule ls
+
+To view the contents of the rules, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush rule dump
+
+.. _device_classes:
+
+Device classes
+--------------
+
+Each device can optionally have a *class* assigned. By default, OSDs
+automatically set their class at startup to `hdd`, `ssd`, or `nvme` in
+accordance with the type of device they are backed by.
+
+To explicitly set the device class of one or more OSDs, run a command of the
+following form:
+
+.. prompt:: bash $
+
+ ceph osd crush set-device-class <class> <osd-name> [...]
+
+Once a device class has been set, it cannot be changed to another class until
+the old class is unset. To remove the old class of one or more OSDs, run a
+command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd crush rm-device-class <osd-name> [...]
+
+This restriction allows administrators to set device classes that won't be
+changed on OSD restart or by a script.
+
+To create a placement rule that targets a specific device class, run a command
+of the following form:
+
+.. prompt:: bash $
+
+ ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
+
+To apply the new placement rule to a specific pool, run a command of the
+following form:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> crush_rule <rule-name>
+
+Device classes are implemented by creating one or more "shadow" CRUSH
+hierarchies. For each device class in use, there will be a shadow hierarchy
+that contains only devices of that class. CRUSH rules can then distribute data
+across the relevant shadow hierarchy. This approach is fully backward
+compatible with older Ceph clients. To view the CRUSH hierarchy with shadow
+items displayed, run the following command:
+
+.. prompt:: bash #
+
+ ceph osd crush tree --show-shadow
+
+Some older clusters that were created before the Luminous release rely on
+manually crafted CRUSH maps to maintain per-device-type hierarchies. For these
+clusters, there is a *reclassify* tool available that can help them transition
+to device classes without triggering unwanted data movement (see
+:ref:`crush-reclassify`).
+
+Weight sets
+-----------
+
+A *weight set* is an alternative set of weights to use when calculating data
+placement. The normal weights associated with each device in the CRUSH map are
+set in accordance with the device size and indicate how much data should be
+stored where. However, because CRUSH is a probabilistic pseudorandom placement
+process, there is always some variation from this ideal distribution (in the
+same way that rolling a die sixty times will likely not result in exactly ten
+ones and ten sixes). Weight sets allow the cluster to perform numerical
+optimization based on the specifics of your cluster (for example: hierarchy,
+pools) to achieve a balanced distribution.
+
+Ceph supports two types of weight sets:
+
+#. A **compat** weight set is a single alternative set of weights for each
+ device and each node in the cluster. Compat weight sets cannot be expected
+ to correct all anomalies (for example, PGs for different pools might be of
+ different sizes and have different load levels, but are mostly treated alike
+ by the balancer). However, they have the major advantage of being *backward
+ compatible* with previous versions of Ceph. This means that even though
+ weight sets were first introduced in Luminous v12.2.z, older clients (for
+ example, Firefly) can still connect to the cluster when a compat weight set
+ is being used to balance data.
+
+#. A **per-pool** weight set is more flexible in that it allows placement to
+ be optimized for each data pool. Additionally, weights can be adjusted
+ for each position of placement, allowing the optimizer to correct for a
+ subtle skew of data toward devices with small weights relative to their
+ peers (an effect that is usually apparent only in very large clusters
+ but that can cause balancing problems).
+
+When weight sets are in use, the weights associated with each node in the
+hierarchy are visible in a separate column (labeled either as ``(compat)`` or
+as the pool name) in the output of the following command:
+
+.. prompt:: bash #
+
+ ceph osd tree
+
+If both *compat* and *per-pool* weight sets are in use, data placement for a
+particular pool will use its own per-pool weight set if present. If only
+*compat* weight sets are in use, data placement will use the compat weight set.
+If neither are in use, data placement will use the normal CRUSH weights.
+
+Although weight sets can be set up and adjusted manually, we recommend enabling
+the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the
+cluster is running Luminous or a later release.
+
+Modifying the CRUSH map
+=======================
+
+.. _addosd:
+
+Adding/Moving an OSD
+--------------------
+
+.. note:: Under normal conditions, OSDs automatically add themselves to the
+ CRUSH map when they are created. The command in this section is rarely
+ needed.
+
+
+To add or move an OSD in the CRUSH map of a running cluster, run a command of
+the following form:
+
+.. prompt:: bash $
+
+ ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
+
+For details on this command's parameters, see the following:
+
+``name``
+ :Description: The full name of the OSD.
+ :Type: String
+ :Required: Yes
+ :Example: ``osd.0``
+
+
+``weight``
+ :Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB).
+ :Type: Double
+ :Required: Yes
+ :Example: ``2.0``
+
+
+``root``
+ :Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``).
+ :Type: Key-value pair.
+ :Required: Yes
+ :Example: ``root=default``
+
+
+``bucket-type``
+ :Description: The OSD's location in the CRUSH hierarchy.
+ :Type: Key-value pairs.
+ :Required: No
+ :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
+
+In the following example, the command adds ``osd.0`` to the hierarchy, or moves
+``osd.0`` from a previous location:
+
+.. prompt:: bash $
+
+ ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
+
+
+Adjusting OSD weight
+--------------------
+
+.. note:: Under normal conditions, OSDs automatically add themselves to the
+ CRUSH map with the correct weight when they are created. The command in this
+ section is rarely needed.
+
+To adjust an OSD's CRUSH weight in a running cluster, run a command of the
+following form:
+
+.. prompt:: bash $
+
+ ceph osd crush reweight {name} {weight}
+
+For details on this command's parameters, see the following:
+
+``name``
+ :Description: The full name of the OSD.
+ :Type: String
+ :Required: Yes
+ :Example: ``osd.0``
+
+
+``weight``
+ :Description: The CRUSH weight of the OSD.
+ :Type: Double
+ :Required: Yes
+ :Example: ``2.0``
+
+
+.. _removeosd:
+
+Removing an OSD
+---------------
+
+.. note:: OSDs are normally removed from the CRUSH map as a result of the
+ `ceph osd purge`` command. This command is rarely needed.
+
+To remove an OSD from the CRUSH map of a running cluster, run a command of the
+following form:
+
+.. prompt:: bash $
+
+ ceph osd crush remove {name}
+
+For details on the ``name`` parameter, see the following:
+
+``name``
+ :Description: The full name of the OSD.
+ :Type: String
+ :Required: Yes
+ :Example: ``osd.0``
+
+
+Adding a CRUSH Bucket
+---------------------
+
+.. note:: Buckets are implicitly created when an OSD is added and the command
+ that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the
+ OSD's location (provided that a bucket with that name does not already
+ exist). The command in this section is typically used when manually
+ adjusting the structure of the hierarchy after OSDs have already been
+ created. One use of this command is to move a series of hosts to a new
+ rack-level bucket. Another use of this command is to add new ``host``
+ buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive
+ any data until they are ready to receive data. When they are ready, move the
+ buckets to the ``default`` root or to any other root as described below.
+
+To add a bucket in the CRUSH map of a running cluster, run a command of the
+following form:
+
+.. prompt:: bash $
+
+ ceph osd crush add-bucket {bucket-name} {bucket-type}
+
+For details on this command's parameters, see the following:
+
+``bucket-name``
+ :Description: The full name of the bucket.
+ :Type: String
+ :Required: Yes
+ :Example: ``rack12``
+
+
+``bucket-type``
+ :Description: The type of the bucket. This type must already exist in the CRUSH hierarchy.
+ :Type: String
+ :Required: Yes
+ :Example: ``rack``
+
+In the following example, the command adds the ``rack12`` bucket to the hierarchy:
+
+.. prompt:: bash $
+
+ ceph osd crush add-bucket rack12 rack
+
+Moving a Bucket
+---------------
+
+To move a bucket to a different location or position in the CRUSH map
+hierarchy, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
+
+For details on this command's parameters, see the following:
+
+``bucket-name``
+ :Description: The name of the bucket that you are moving.
+ :Type: String
+ :Required: Yes
+ :Example: ``foo-bar-1``
+
+``bucket-type``
+ :Description: The bucket's new location in the CRUSH hierarchy.
+ :Type: Key-value pairs.
+ :Required: No
+ :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
+
+Removing a Bucket
+-----------------
+
+To remove a bucket from the CRUSH hierarchy, run a command of the following
+form:
+
+.. prompt:: bash $
+
+ ceph osd crush remove {bucket-name}
+
+.. note:: A bucket must already be empty before it is removed from the CRUSH
+ hierarchy. In other words, there must not be OSDs or any other CRUSH buckets
+ within it.
+
+For details on the ``bucket-name`` parameter, see the following:
+
+``bucket-name``
+ :Description: The name of the bucket that is being removed.
+ :Type: String
+ :Required: Yes
+ :Example: ``rack12``
+
+In the following example, the command removes the ``rack12`` bucket from the
+hierarchy:
+
+.. prompt:: bash $
+
+ ceph osd crush remove rack12
+
+Creating a compat weight set
+----------------------------
+
+.. note:: Normally this action is done automatically if needed by the
+ ``balancer`` module (provided that the module is enabled).
+
+To create a *compat* weight set, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush weight-set create-compat
+
+To adjust the weights of the compat weight set, run a command of the following
+form:
+
+.. prompt:: bash $
+
+ ceph osd crush weight-set reweight-compat {name} {weight}
+
+To destroy the compat weight set, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush weight-set rm-compat
+
+Creating per-pool weight sets
+-----------------------------
+
+To create a weight set for a specific pool, run a command of the following
+form:
+
+.. prompt:: bash $
+
+ ceph osd crush weight-set create {pool-name} {mode}
+
+.. note:: Per-pool weight sets can be used only if all servers and daemons are
+ running Luminous v12.2.z or a later release.
+
+For details on this command's parameters, see the following:
+
+``pool-name``
+ :Description: The name of a RADOS pool.
+ :Type: String
+ :Required: Yes
+ :Example: ``rbd``
+
+``mode``
+ :Description: Either ``flat`` or ``positional``. A *flat* weight set
+ assigns a single weight to all devices or buckets. A
+ *positional* weight set has a potentially different
+ weight for each position in the resulting placement
+ mapping. For example: if a pool has a replica count of
+ ``3``, then a positional weight set will have three
+ weights for each device and bucket.
+ :Type: String
+ :Required: Yes
+ :Example: ``flat``
+
+To adjust the weight of an item in a weight set, run a command of the following
+form:
+
+.. prompt:: bash $
+
+ ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
+
+To list existing weight sets, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush weight-set ls
+
+To remove a weight set, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd crush weight-set rm {pool-name}
+
+
+Creating a rule for a replicated pool
+-------------------------------------
+
+When you create a CRUSH rule for a replicated pool, there is an important
+decision to make: selecting a failure domain. For example, if you select a
+failure domain of ``host``, then CRUSH will ensure that each replica of the
+data is stored on a unique host. Alternatively, if you select a failure domain
+of ``rack``, then each replica of the data will be stored in a different rack.
+Your selection of failure domain should be guided by the size and its CRUSH
+topology.
+
+The entire cluster hierarchy is typically nested beneath a root node that is
+named ``default``. If you have customized your hierarchy, you might want to
+create a rule nested beneath some other node in the hierarchy. In creating
+this rule for the customized hierarchy, the node type doesn't matter, and in
+particular the rule does not have to be nested beneath a ``root`` node.
+
+It is possible to create a rule that restricts data placement to a specific
+*class* of device. By default, Ceph OSDs automatically classify themselves as
+either ``hdd`` or ``ssd`` in accordance with the underlying type of device
+being used. These device classes can be customized. One might set the ``device
+class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set
+them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules
+and pools may be flexibly constrained to use (or avoid using) specific subsets
+of OSDs based on specific requirements.
+
+To create a rule for a replicated pool, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
+
+For details on this command's parameters, see the following:
+
+``name``
+ :Description: The name of the rule.
+ :Type: String
+ :Required: Yes
+ :Example: ``rbd-rule``
+
+``root``
+ :Description: The name of the CRUSH hierarchy node under which data is to be placed.
+ :Type: String
+ :Required: Yes
+ :Example: ``default``
+
+``failure-domain-type``
+ :Description: The type of CRUSH nodes used for the replicas of the failure domain.
+ :Type: String
+ :Required: Yes
+ :Example: ``rack``
+
+``class``
+ :Description: The device class on which data is to be placed.
+ :Type: String
+ :Required: No
+ :Example: ``ssd``
+
+Creating a rule for an erasure-coded pool
+-----------------------------------------
+
+For an erasure-coded pool, similar decisions need to be made: what the failure
+domain is, which node in the hierarchy data will be placed under (usually
+``default``), and whether placement is restricted to a specific device class.
+However, erasure-code pools are created in a different way: there is a need to
+construct them carefully with reference to the erasure code plugin in use. For
+this reason, these decisions must be incorporated into the **erasure-code
+profile**. A CRUSH rule will then be created from the erasure-code profile,
+either explicitly or automatically when the profile is used to create a pool.
+
+To list the erasure-code profiles, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile ls
+
+To view a specific existing profile, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile get {profile-name}
+
+Under normal conditions, profiles should never be modified; instead, a new
+profile should be created and used when creating either a new pool or a new
+rule for an existing pool.
+
+An erasure-code profile consists of a set of key-value pairs. Most of these
+key-value pairs govern the behavior of the erasure code that encodes data in
+the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH
+rule that is created.
+
+The relevant erasure-code profile properties are as follows:
+
+ * **crush-root**: the name of the CRUSH node under which to place data
+ [default: ``default``].
+ * **crush-failure-domain**: the CRUSH bucket type used in the distribution of
+ erasure-coded shards [default: ``host``].
+ * **crush-device-class**: the device class on which to place data [default:
+ none, which means that all devices are used].
+ * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the
+ number of erasure-code shards, affecting the resulting CRUSH rule.
+
+ After a profile is defined, you can create a CRUSH rule by running a command
+ of the following form:
+
+.. prompt:: bash $
+
+ ceph osd crush rule create-erasure {name} {profile-name}
+
+.. note: When creating a new pool, it is not necessary to create the rule
+ explicitly. If only the erasure-code profile is specified and the rule
+ argument is omitted, then Ceph will create the CRUSH rule automatically.
+
+
+Deleting rules
+--------------
+
+To delete rules that are not in use by pools, run a command of the following
+form:
+
+.. prompt:: bash $
+
+ ceph osd crush rule rm {rule-name}
+
+.. _crush-map-tunables:
+
+Tunables
+========
+
+The CRUSH algorithm that is used to calculate the placement of data has been
+improved over time. In order to support changes in behavior, we have provided
+users with sets of tunables that determine which legacy or optimal version of
+CRUSH is to be used.
+
+In order to use newer tunables, all Ceph clients and daemons must support the
+new major release of CRUSH. Because of this requirement, we have created
+``profiles`` that are named after the Ceph version in which they were
+introduced. For example, the ``firefly`` tunables were first supported by the
+Firefly release and do not work with older clients (for example, clients
+running Dumpling). After a cluster's tunables profile is changed from a legacy
+set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options
+will prevent older clients that do not support the new CRUSH features from
+connecting to the cluster.
+
+argonaut (legacy)
+-----------------
+
+The legacy CRUSH behavior used by Argonaut and older releases works fine for
+most clusters, provided that not many OSDs have been marked ``out``.
+
+bobtail (CRUSH_TUNABLES2)
+-------------------------
+
+The ``bobtail`` tunable profile provides the following improvements:
+
+ * For hierarchies with a small number of devices in leaf buckets, some PGs
+ might map to fewer than the desired number of replicas, resulting in
+ ``undersized`` PGs. This is known to happen in the case of hierarchies with
+ ``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each
+ host.
+
+ * For large clusters, a small percentage of PGs might map to fewer than the
+ desired number of OSDs. This is known to happen when there are multiple
+ hierarchy layers in use (for example,, ``row``, ``rack``, ``host``,
+ ``osd``).
+
+ * When one or more OSDs are marked ``out``, data tends to be redistributed
+ to nearby OSDs instead of across the entire hierarchy.
+
+The tunables introduced in the Bobtail release are as follows:
+
+ * ``choose_local_tries``: Number of local retries. The legacy value is ``2``,
+ and the optimal value is ``0``.
+
+ * ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal
+ value is 0.
+
+ * ``choose_total_tries``: Total number of attempts to choose an item. The
+ legacy value is ``19``, but subsequent testing indicates that a value of
+ ``50`` is more appropriate for typical clusters. For extremely large
+ clusters, an even larger value might be necessary.
+
+ * ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will
+ retry, or try only once and allow the original placement to retry. The
+ legacy default is ``0``, and the optimal value is ``1``.
+
+Migration impact:
+
+ * Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a
+ moderate amount of data movement. Use caution on a cluster that is already
+ populated with data.
+
+firefly (CRUSH_TUNABLES3)
+-------------------------
+
+chooseleaf_vary_r
+~~~~~~~~~~~~~~~~~
+
+This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step
+behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs.
+
+This profile was introduced in the Firefly release, and adds a new tunable as follows:
+
+ * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start
+ with a non-zero value of ``r``, as determined by the number of attempts the
+ parent has already made. The legacy default value is ``0``, but with this
+ value CRUSH is sometimes unable to find a mapping. The optimal value (in
+ terms of computational cost and correctness) is ``1``.
+
+Migration impact:
+
+ * For existing clusters that store a great deal of data, changing this tunable
+ from ``0`` to ``1`` will trigger a large amount of data migration; a value
+ of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will
+ cause less data to move.
+
+straw_calc_version tunable
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There were problems with the internal weights calculated and stored in the
+CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH
+weight of ``0`` or with a mix of different and unique weights, CRUSH would
+distribute data incorrectly (that is, not in proportion to the weights).
+
+This tunable, introduced in the Firefly release, is as follows:
+
+ * ``straw_calc_version``: A value of ``0`` preserves the old, broken
+ internal-weight calculation; a value of ``1`` fixes the problem.
+
+Migration impact:
+
+ * Changing this tunable to a value of ``1`` and then adjusting a straw bucket
+ (either by adding, removing, or reweighting an item or by using the
+ reweight-all command) can trigger a small to moderate amount of data
+ movement provided that the cluster has hit one of the problematic
+ conditions.
+
+This tunable option is notable in that it has absolutely no impact on the
+required kernel version in the client side.
+
+hammer (CRUSH_V4)
+-----------------
+
+The ``hammer`` tunable profile does not affect the mapping of existing CRUSH
+maps simply by changing the profile. However:
+
+ * There is a new bucket algorithm supported: ``straw2``. This new algorithm
+ fixes several limitations in the original ``straw``. More specifically, the
+ old ``straw`` buckets would change some mappings that should not have
+ changed when a weight was adjusted, while ``straw2`` achieves the original
+ goal of changing mappings only to or from the bucket item whose weight has
+ changed.
+
+ * The ``straw2`` type is the default type for any newly created buckets.
+
+Migration impact:
+
+ * Changing a bucket type from ``straw`` to ``straw2`` will trigger a small
+ amount of data movement, depending on how much the bucket items' weights
+ vary from each other. When the weights are all the same no data will move,
+ and the more variance there is in the weights the more movement there will
+ be.
+
+jewel (CRUSH_TUNABLES5)
+-----------------------
+
+The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a
+result, significantly fewer mappings change when an OSD is marked ``out`` of
+the cluster. This improvement results in significantly less data movement.
+
+The new tunable introduced in the Jewel release is as follows:
+
+ * ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt
+ will use a better value for an inner loop that greatly reduces the number of
+ mapping changes when an OSD is marked ``out``. The legacy value is ``0``,
+ and the new value of ``1`` uses the new approach.
+
+Migration impact:
+
+ * Changing this value on an existing cluster will result in a very large
+ amount of data movement because nearly every PG mapping is likely to change.
+
+Client versions that support CRUSH_TUNABLES2
+--------------------------------------------
+
+ * v0.55 and later, including Bobtail (v0.56.x)
+ * Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients)
+
+Client versions that support CRUSH_TUNABLES3
+--------------------------------------------
+
+ * v0.78 (Firefly) and later
+ * Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients)
+
+Client versions that support CRUSH_V4
+-------------------------------------
+
+ * v0.94 (Hammer) and later
+ * Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients)
+
+Client versions that support CRUSH_TUNABLES5
+--------------------------------------------
+
+ * v10.0.2 (Jewel) and later
+ * Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients)
+
+"Non-optimal tunables" warning
+------------------------------
+
+In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush
+map has non-optimal tunables") if any of the current CRUSH tunables have
+non-optimal values: that is, if any fail to have the optimal values from the
+:ref:` ``default`` profile
+<rados_operations_crush_map_default_profile_definition>`. There are two
+different ways to silence the alert:
+
+1. Adjust the CRUSH tunables on the existing cluster so as to render them
+ optimal. Making this adjustment will trigger some data movement
+ (possibly as much as 10%). This approach is generally preferred to the
+ other approach, but special care must be taken in situations where
+ data movement might affect performance: for example, in production clusters.
+ To enable optimal tunables, run the following command:
+
+ .. prompt:: bash $
+
+ ceph osd crush tunables optimal
+
+ There are several potential problems that might make it preferable to revert
+ to the previous values of the tunables. The new values might generate too
+ much load for the cluster to handle, the new values might unacceptably slow
+ the operation of the cluster, or there might be a client-compatibility
+ problem. Such client-compatibility problems can arise when using old-kernel
+ CephFS or RBD clients, or pre-Bobtail ``librados`` clients. To revert to
+ the previous values of the tunables, run the following command:
+
+ .. prompt:: bash $
+
+ ceph osd crush tunables legacy
+
+2. To silence the alert without making any changes to CRUSH,
+ add the following option to the ``[mon]`` section of your ceph.conf file::
+
+ mon_warn_on_legacy_crush_tunables = false
+
+ In order for this change to take effect, you will need to either restart
+ the monitors or run the following command to apply the option to the
+ monitors while they are still running:
+
+ .. prompt:: bash $
+
+ ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false
+
+
+Tuning CRUSH
+------------
+
+When making adjustments to CRUSH tunables, keep the following considerations in
+mind:
+
+ * Adjusting the values of CRUSH tunables will result in the shift of one or
+ more PGs from one storage node to another. If the Ceph cluster is already
+ storing a great deal of data, be prepared for significant data movement.
+ * When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they
+ immediately begin rejecting new connections from clients that do not support
+ the new feature. However, already-connected clients are effectively
+ grandfathered in, and any of these clients that do not support the new
+ feature will malfunction.
+ * If the CRUSH tunables are set to newer (non-legacy) values and subsequently
+ reverted to the legacy values, ``ceph-osd`` daemons will not be required to
+ support any of the newer CRUSH features associated with the newer
+ (non-legacy) values. However, the OSD peering process requires the
+ examination and understanding of old maps. For this reason, **if the cluster
+ has previously used non-legacy CRUSH values, do not run old versions of
+ the** ``ceph-osd`` **daemon** -- even if the latest version of the map has
+ been reverted so as to use the legacy defaults.
+
+The simplest way to adjust CRUSH tunables is to apply them in matched sets
+known as *profiles*. As of the Octopus release, Ceph supports the following
+profiles:
+
+ * ``legacy``: The legacy behavior from argonaut and earlier.
+ * ``argonaut``: The legacy values supported by the argonaut release.
+ * ``bobtail``: The values supported by the bobtail release.
+ * ``firefly``: The values supported by the firefly release.
+ * ``hammer``: The values supported by the hammer release.
+ * ``jewel``: The values supported by the jewel release.
+ * ``optimal``: The best values for the current version of Ceph.
+ .. _rados_operations_crush_map_default_profile_definition:
+ * ``default``: The default values of a new cluster that has been installed
+ from scratch. These values, which depend on the current version of Ceph, are
+ hardcoded and are typically a mix of optimal and legacy values. These
+ values often correspond to the ``optimal`` profile of either the previous
+ LTS (long-term service) release or the most recent release for which most
+ users are expected to have up-to-date clients.
+
+To apply a profile to a running cluster, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd crush tunables {PROFILE}
+
+This action might trigger a great deal of data movement. Consult release notes
+and documentation before changing the profile on a running cluster. Consider
+throttling recovery and backfill parameters in order to limit the backfill
+resulting from a specific change.
+
+.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
+
+
+Tuning Primary OSD Selection
+============================
+
+When a Ceph client reads or writes data, it first contacts the primary OSD in
+each affected PG's acting set. By default, the first OSD in the acting set is
+the primary OSD (also known as the "lead OSD"). For example, in the acting set
+``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD.
+However, sometimes it is clear that an OSD is not well suited to act as the
+lead as compared with other OSDs (for example, if the OSD has a slow drive or a
+slow controller). To prevent performance bottlenecks (especially on read
+operations) and at the same time maximize the utilization of your hardware, you
+can influence the selection of the primary OSD either by adjusting "primary
+affinity" values, or by crafting a CRUSH rule that selects OSDs that are better
+suited to act as the lead rather than other OSDs.
+
+To determine whether tuning Ceph's selection of primary OSDs will improve
+cluster performance, pool redundancy strategy must be taken into account. For
+replicated pools, this tuning can be especially useful, because by default read
+operations are served from the primary OSD of each PG. For erasure-coded pools,
+however, the speed of read operations can be increased by enabling **fast
+read** (see :ref:`pool-settings`).
+
+.. _rados_ops_primary_affinity:
+
+Primary Affinity
+----------------
+
+**Primary affinity** is a characteristic of an OSD that governs the likelihood
+that a given OSD will be selected as the primary OSD (or "lead OSD") in a given
+acting set. A primary affinity value can be any real number in the range ``0``
+to ``1``, inclusive.
+
+As an example of a common scenario in which it can be useful to adjust primary
+affinity values, let us suppose that a cluster contains a mix of drive sizes:
+for example, suppose it contains some older racks with 1.9 TB SATA SSDs and
+some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned
+twice the number of PGs and will thus serve twice the number of write and read
+operations -- they will be busier than the former. In such a scenario, you
+might make a rough assignment of primary affinity as inversely proportional to
+OSD size. Such an assignment will not be 100% optimal, but it can readily
+achieve a 15% improvement in overall read throughput by means of a more even
+utilization of SATA interface bandwidth and CPU cycles. This example is not
+merely a thought experiment meant to illustrate the theoretical benefits of
+adjusting primary affinity values; this fifteen percent improvement was
+achieved on an actual Ceph cluster.
+
+By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster
+in which every OSD has this default value, all OSDs are equally likely to act
+as a primary OSD.
+
+By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less
+likely to select the OSD as primary in a PG's acting set. To change the weight
+value associated with a specific OSD's primary affinity, run a command of the
+following form:
+
+.. prompt:: bash $
+
+ ceph osd primary-affinity <osd-id> <weight>
+
+The primary affinity of an OSD can be set to any real number in the range
+``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as
+primary and ``1`` indicates that the OSD is maximally likely to be used as a
+primary. When the weight is between these extremes, its value indicates roughly
+how likely it is that CRUSH will select the OSD associated with it as a
+primary.
+
+The process by which CRUSH selects the lead OSD is not a mere function of a
+simple probability determined by relative affinity values. Nevertheless,
+measurable results can be achieved even with first-order approximations of
+desirable primary affinity values.
+
+
+Custom CRUSH Rules
+------------------
+
+Some clusters balance cost and performance by mixing SSDs and HDDs in the same
+replicated pool. By setting the primary affinity of HDD OSDs to ``0``,
+operations will be directed to an SSD OSD in each acting set. Alternatively,
+you can define a CRUSH rule that always selects an SSD OSD as the primary OSD
+and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting
+set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs.
+
+For example, see the following CRUSH rule::
+
+ rule mixed_replicated_rule {
+ id 11
+ type replicated
+ step take default class ssd
+ step chooseleaf firstn 1 type host
+ step emit
+ step take default class hdd
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool,
+this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on
+different hosts, because the first SSD OSD might be colocated with any of the
+``N`` HDD OSDs.
+
+To avoid this extra storage requirement, you might place SSDs and HDDs in
+different hosts. However, taking this approach means that all client requests
+will be received by hosts with SSDs. For this reason, it might be advisable to
+have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the
+latter will under normal circumstances perform only recovery operations. Here
+the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement
+not to contain any of the same servers, as seen in the following CRUSH rule::
+
+ rule mixed_replicated_rule_two {
+ id 1
+ type replicated
+ step take ssd_hosts class ssd
+ step chooseleaf firstn 1 type host
+ step emit
+ step take hdd_hosts class hdd
+ step chooseleaf firstn -1 type host
+ step emit
+ }
+
+.. note:: If a primary SSD OSD fails, then requests to the associated PG will
+ be temporarily served from a slower HDD OSD until the PG's data has been
+ replicated onto the replacement primary SSD OSD.
+
+
diff --git a/doc/rados/operations/data-placement.rst b/doc/rados/operations/data-placement.rst
new file mode 100644
index 000000000..3d3be65ec
--- /dev/null
+++ b/doc/rados/operations/data-placement.rst
@@ -0,0 +1,47 @@
+=========================
+ Data Placement Overview
+=========================
+
+Ceph stores, replicates, and rebalances data objects across a RADOS cluster
+dynamically. Because different users store objects in different pools for
+different purposes on many OSDs, Ceph operations require a certain amount of
+data- placement planning. The main data-placement planning concepts in Ceph
+include:
+
+- **Pools:** Ceph stores data within pools, which are logical groups used for
+ storing objects. Pools manage the number of placement groups, the number of
+ replicas, and the CRUSH rule for the pool. To store data in a pool, it is
+ necessary to be an authenticated user with permissions for the pool. Ceph is
+ able to make snapshots of pools. For additional details, see `Pools`_.
+
+- **Placement Groups:** Ceph maps objects to placement groups. Placement
+ groups (PGs) are shards or fragments of a logical object pool that place
+ objects as a group into OSDs. Placement groups reduce the amount of
+ per-object metadata that is necessary for Ceph to store the data in OSDs. A
+ greater number of placement groups (for example, 100 PGs per OSD as compared
+ with 50 PGs per OSD) leads to better balancing. For additional details, see
+ :ref:`placement groups`.
+
+- **CRUSH Maps:** CRUSH plays a major role in allowing Ceph to scale while
+ avoiding certain pitfalls, such as performance bottlenecks, limitations to
+ scalability, and single points of failure. CRUSH maps provide the physical
+ topology of the cluster to the CRUSH algorithm, so that it can determine both
+ (1) where the data for an object and its replicas should be stored and (2)
+ how to store that data across failure domains so as to improve data safety.
+ For additional details, see `CRUSH Maps`_.
+
+- **Balancer:** The balancer is a feature that automatically optimizes the
+ distribution of placement groups across devices in order to achieve a
+ balanced data distribution, in order to maximize the amount of data that can
+ be stored in the cluster, and in order to evenly distribute the workload
+ across OSDs.
+
+It is possible to use the default values for each of the above components.
+Default values are recommended for a test cluster's initial setup. However,
+when planning a large Ceph cluster, values should be customized for
+data-placement operations with reference to the different roles played by
+pools, placement groups, and CRUSH.
+
+.. _Pools: ../pools
+.. _CRUSH Maps: ../crush-map
+.. _Balancer: ../balancer
diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst
new file mode 100644
index 000000000..f92f622d5
--- /dev/null
+++ b/doc/rados/operations/devices.rst
@@ -0,0 +1,227 @@
+.. _devices:
+
+Device Management
+=================
+
+Device management allows Ceph to address hardware failure. Ceph tracks hardware
+storage devices (HDDs, SSDs) to see which devices are managed by which daemons.
+Ceph also collects health metrics about these devices. By doing so, Ceph can
+provide tools that predict hardware failure and can automatically respond to
+hardware failure.
+
+Device tracking
+---------------
+
+To see a list of the storage devices that are in use, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph device ls
+
+Alternatively, to list devices by daemon or by host, run a command of one of
+the following forms:
+
+.. prompt:: bash $
+
+ ceph device ls-by-daemon <daemon>
+ ceph device ls-by-host <host>
+
+To see information about the location of an specific device and about how the
+device is being consumed, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph device info <devid>
+
+Identifying physical devices
+----------------------------
+
+To make the replacement of failed disks easier and less error-prone, you can
+(in some cases) "blink" the drive's LEDs on hardware enclosures by running a
+command of the following form::
+
+ device light on|off <devid> [ident|fault] [--force]
+
+.. note:: Using this command to blink the lights might not work. Whether it
+ works will depend upon such factors as your kernel revision, your SES
+ firmware, or the setup of your HBA.
+
+The ``<devid>`` parameter is the device identification. To retrieve this
+information, run the following command:
+
+.. prompt:: bash $
+
+ ceph device ls
+
+The ``[ident|fault]`` parameter determines which kind of light will blink. By
+default, the `identification` light is used.
+
+.. note:: This command works only if the Cephadm or the Rook `orchestrator
+ <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_
+ module is enabled. To see which orchestrator module is enabled, run the
+ following command:
+
+ .. prompt:: bash $
+
+ ceph orch status
+
+The command that makes the drive's LEDs blink is `lsmcli`. To customize this
+command, configure it via a Jinja2 template by running commands of the
+following forms::
+
+ ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
+ ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"
+
+The following arguments can be used to customize the Jinja2 template:
+
+* ``on``
+ A boolean value.
+* ``ident_fault``
+ A string that contains `ident` or `fault`.
+* ``dev``
+ A string that contains the device ID: for example, `SanDisk_X400_M.2_2280_512GB_162924424784`.
+* ``path``
+ A string that contains the device path: for example, `/dev/sda`.
+
+.. _enabling-monitoring:
+
+Enabling monitoring
+-------------------
+
+Ceph can also monitor the health metrics associated with your device. For
+example, SATA drives implement a standard called SMART that provides a wide
+range of internal metrics about the device's usage and health (for example: the
+number of hours powered on, the number of power cycles, the number of
+unrecoverable read errors). Other device types such as SAS and NVMe present a
+similar set of metrics (via slightly different standards). All of these
+metrics can be collected by Ceph via the ``smartctl`` tool.
+
+You can enable or disable health monitoring by running one of the following
+commands:
+
+.. prompt:: bash $
+
+ ceph device monitoring on
+ ceph device monitoring off
+
+Scraping
+--------
+
+If monitoring is enabled, device metrics will be scraped automatically at
+regular intervals. To configure that interval, run a command of the following
+form:
+
+.. prompt:: bash $
+
+ ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
+
+By default, device metrics are scraped once every 24 hours.
+
+To manually scrape all devices, run the following command:
+
+.. prompt:: bash $
+
+ ceph device scrape-health-metrics
+
+To scrape a single device, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph device scrape-health-metrics <device-id>
+
+To scrape a single daemon's devices, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph device scrape-daemon-health-metrics <who>
+
+To retrieve the stored health metrics for a device (optionally for a specific
+timestamp), run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph device get-health-metrics <devid> [sample-timestamp]
+
+Failure prediction
+------------------
+
+Ceph can predict drive life expectancy and device failures by analyzing the
+health metrics that it collects. The prediction modes are as follows:
+
+* *none*: disable device failure prediction.
+* *local*: use a pre-trained prediction model from the ``ceph-mgr`` daemon.
+
+To configure the prediction mode, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph config set global device_failure_prediction_mode <mode>
+
+Under normal conditions, failure prediction runs periodically in the
+background. For this reason, life expectancy values might be populated only
+after a significant amount of time has passed. The life expectancy of all
+devices is displayed in the output of the following command:
+
+.. prompt:: bash $
+
+ ceph device ls
+
+To see the metadata of a specific device, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph device info <devid>
+
+To explicitly force prediction of a specific device's life expectancy, run a
+command of the following form:
+
+.. prompt:: bash $
+
+ ceph device predict-life-expectancy <devid>
+
+In addition to Ceph's internal device failure prediction, you might have an
+external source of information about device failures. To inform Ceph of a
+specific device's life expectancy, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph device set-life-expectancy <devid> <from> [<to>]
+
+Life expectancies are expressed as a time interval. This means that the
+uncertainty of the life expectancy can be expressed in the form of a range of
+time, and perhaps a wide range of time. The interval's end can be left
+unspecified.
+
+Health alerts
+-------------
+
+The ``mgr/devicehealth/warn_threshold`` configuration option controls the
+health check for an expected device failure. If the device is expected to fail
+within the specified time interval, an alert is raised.
+
+To check the stored life expectancy of all devices and generate any appropriate
+health alert, run the following command:
+
+.. prompt:: bash $
+
+ ceph device check-health
+
+Automatic Migration
+-------------------
+
+The ``mgr/devicehealth/self_heal`` option (enabled by default) automatically
+migrates data away from devices that are expected to fail soon. If this option
+is enabled, the module marks such devices ``out`` so that automatic migration
+will occur.
+
+.. note:: The ``mon_osd_min_up_ratio`` configuration option can help prevent
+ this process from cascading to total failure. If the "self heal" module
+ marks ``out`` so many OSDs that the ratio value of ``mon_osd_min_up_ratio``
+ is exceeded, then the cluster raises the ``DEVICE_HEALTH_TOOMANY`` health
+ check. For instructions on what to do in this situation, see
+ :ref:`DEVICE_HEALTH_TOOMANY<rados_health_checks_device_health_toomany>`.
+
+The ``mgr/devicehealth/mark_out_threshold`` configuration option specifies the
+time interval for automatic migration. If a device is expected to fail within
+the specified time interval, it will be automatically marked ``out``.
diff --git a/doc/rados/operations/erasure-code-clay.rst b/doc/rados/operations/erasure-code-clay.rst
new file mode 100644
index 000000000..1cffa32f5
--- /dev/null
+++ b/doc/rados/operations/erasure-code-clay.rst
@@ -0,0 +1,240 @@
+================
+CLAY code plugin
+================
+
+CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings
+in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let:
+
+ d = number of OSDs contacted during repair
+
+If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires
+reading from the *d=8* others to repair. And recovery of say a 1GiB needs
+a download of 8 X 1GiB = 8GiB of information.
+
+However, in the case of the *clay* plugin *d* is configurable within the limits:
+
+ k+1 <= d <= k+m-1
+
+By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms
+of network bandwidth and disk IO. In the case of the *clay* plugin configured with
+*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and
+250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB
+amount of information. More general parameters are provided below. The benefits are substantial
+when the repair is carried out for a rack that stores information on the order of
+Terabytes.
+
+ +-------------+---------------------------------------------------------+
+ | plugin | total amount of disk IO |
+ +=============+=========================================================+
+ |jerasure,isa | :math:`k S` |
+ +-------------+---------------------------------------------------------+
+ | clay | :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` |
+ +-------------+---------------------------------------------------------+
+
+where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have
+used the largest possible value of *d* as this will result in the smallest amount of data download needed
+to achieve recovery from an OSD failure.
+
+Erasure-code profile examples
+=============================
+
+An example configuration that can be used to observe reduced bandwidth usage:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set CLAYprofile \
+ plugin=clay \
+ k=4 m=2 d=5 \
+ crush-failure-domain=host
+ ceph osd pool create claypool erasure CLAYprofile
+
+
+Creating a clay profile
+=======================
+
+To create a new clay code profile:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=clay \
+ k={data-chunks} \
+ m={coding-chunks} \
+ [d={helper-chunks}] \
+ [scalar_mds={plugin-name}] \
+ [technique={technique-name}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split into **data-chunks** parts,
+ each of which is stored on a different OSD.
+
+:Type: Integer
+:Required: Yes.
+:Example: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: Yes.
+:Example: 2
+
+``d={helper-chunks}``
+
+:Description: Number of OSDs requested to send data during recovery of
+ a single chunk. *d* needs to be chosen such that
+ k+1 <= d <= k+m-1. The larger the *d*, the better the savings.
+
+:Type: Integer
+:Required: No.
+:Default: k+m-1
+
+``scalar_mds={jerasure|isa|shec}``
+
+:Description: **scalar_mds** specifies the plugin that is used as a
+ building block in the layered construction. It can be
+ one of *jerasure*, *isa*, *shec*
+
+:Type: String
+:Required: No.
+:Default: jerasure
+
+``technique={technique}``
+
+:Description: **technique** specifies the technique that will be picked
+ within the 'scalar_mds' plugin specified. Supported techniques
+ are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig',
+ 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
+ 'cauchy' for isa and 'single', 'multiple' for shec.
+
+:Type: String
+:Required: No.
+:Default: reed_sol_van (for jerasure, isa), single (for shec)
+
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the CRUSH rule. For instance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a CRUSH rule step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
+
+Notion of sub-chunks
+====================
+
+The Clay code is able to save in terms of disk IO, network bandwidth as it
+is a vector code and it is able to view and manipulate data within a chunk
+at a finer granularity termed as a sub-chunk. The number of sub-chunks within
+a chunk for a Clay code is given by:
+
+ sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1`
+
+
+During repair of an OSD, the helper information requested
+from an available OSD is only a fraction of a chunk. In fact, the number
+of sub-chunks within a chunk that are accessed during repair is given by:
+
+ repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}`
+
+Examples
+--------
+
+#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
+ 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read
+ during repair.
+#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
+ is 16. A quarter of a chunk is read from an available OSD for repair of a failed
+ chunk.
+
+
+
+How to choose a configuration given a workload
+==============================================
+
+Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
+are not necessarily stored consecutively within a chunk. For best disk IO
+performance, it is helpful to read contiguous data. For this reason, it is suggested that
+you choose stripe-size such that the sub-chunk size is sufficiently large.
+
+For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that:
+
+ sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ...
+
+#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d.
+ For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
+ result in a sub-chunk count of 1024 and a sub-chunk size of 4KB.
+#. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network
+ and disk IO benefits.
+
+Comparisons with LRC
+====================
+
+Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
+bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
+number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead.
+The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in
+addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
+can recover from the failure of any ``m`` OSDs.
+
+ +-----------------+----------------------------------+----------------------------------+
+ | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) |
+ +=================+================+=================+==================================+
+ | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) |
+ +-----------------+----------------------------------+----------------------------------+
+ | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) |
+ +-----------------+----------------------------------+----------------------------------+
+
+
+where ``S`` is the amount of data stored of single OSD being recovered.
diff --git a/doc/rados/operations/erasure-code-isa.rst b/doc/rados/operations/erasure-code-isa.rst
new file mode 100644
index 000000000..9a43f89a2
--- /dev/null
+++ b/doc/rados/operations/erasure-code-isa.rst
@@ -0,0 +1,107 @@
+=======================
+ISA erasure code plugin
+=======================
+
+The *isa* plugin encapsulates the `ISA
+<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_
+library.
+
+Create an isa profile
+=====================
+
+To create a new *isa* erasure code profile:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=isa \
+ technique={reed_sol_van|cauchy} \
+ [k={data-chunks}] \
+ [m={coding-chunks}] \
+ [crush-root={root}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: No.
+:Default: 7
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: No.
+:Default: 3
+
+``technique={reed_sol_van|cauchy}``
+
+:Description: The ISA plugin comes in two `Reed Solomon
+ <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_
+ forms. If *reed_sol_van* is set, it is `Vandermonde
+ <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if
+ *cauchy* is set, it is `Cauchy
+ <https://en.wikipedia.org/wiki/Cauchy_matrix>`_.
+
+:Type: String
+:Required: No.
+:Default: reed_sol_van
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the CRUSH rule. For instance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a CRUSH rule step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
diff --git a/doc/rados/operations/erasure-code-jerasure.rst b/doc/rados/operations/erasure-code-jerasure.rst
new file mode 100644
index 000000000..8a0207748
--- /dev/null
+++ b/doc/rados/operations/erasure-code-jerasure.rst
@@ -0,0 +1,123 @@
+============================
+Jerasure erasure code plugin
+============================
+
+The *jerasure* plugin is the most generic and flexible plugin, it is
+also the default for Ceph erasure coded pools.
+
+The *jerasure* plugin encapsulates the `Jerasure
+<https://github.com/ceph/jerasure>`_ library. It is
+recommended to read the ``jerasure`` documentation to
+understand the parameters. Note that the ``jerasure.org``
+web site as of 2023 may no longer be connected to the original
+project or legitimate.
+
+Create a jerasure profile
+=========================
+
+To create a new *jerasure* erasure code profile:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=jerasure \
+ k={data-chunks} \
+ m={coding-chunks} \
+ technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \
+ [crush-root={root}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: Yes.
+:Example: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: Yes.
+:Example: 2
+
+``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}``
+
+:Description: The more flexible technique is *reed_sol_van* : it is
+ enough to set *k* and *m*. The *cauchy_good* technique
+ can be faster but you need to chose the *packetsize*
+ carefully. All of *reed_sol_r6_op*, *liberation*,
+ *blaum_roth*, *liber8tion* are *RAID6* equivalents in
+ the sense that they can only be configured with *m=2*.
+
+:Type: String
+:Required: No.
+:Default: reed_sol_van
+
+``packetsize={bytes}``
+
+:Description: The encoding will be done on packets of *bytes* size at
+ a time. Choosing the right packet size is difficult. The
+ *jerasure* documentation contains extensive information
+ on this topic.
+
+:Type: Integer
+:Required: No.
+:Default: 2048
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the CRUSH rule. For instance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a CRUSH rule step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
diff --git a/doc/rados/operations/erasure-code-lrc.rst b/doc/rados/operations/erasure-code-lrc.rst
new file mode 100644
index 000000000..5329603b9
--- /dev/null
+++ b/doc/rados/operations/erasure-code-lrc.rst
@@ -0,0 +1,388 @@
+======================================
+Locally repairable erasure code plugin
+======================================
+
+With the *jerasure* plugin, when an erasure coded object is stored on
+multiple OSDs, recovering from the loss of one OSD requires reading
+from *k* others. For instance if *jerasure* is configured with
+*k=8* and *m=4*, recovering from the loss of one OSD requires reading
+from eight others.
+
+The *lrc* erasure code plugin creates local parity chunks to enable
+recovery using fewer surviving OSDs. For instance if *lrc* is configured with
+*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
+every four OSDs. When a single OSD is lost, it can be recovered with
+only four OSDs instead of eight.
+
+Erasure code profile examples
+=============================
+
+Reduce recovery bandwidth between hosts
+---------------------------------------
+
+Although it is probably not an interesting use case when all hosts are
+connected to the same switch, reduced bandwidth usage can actually be
+observed.:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ k=4 m=2 l=3 \
+ crush-failure-domain=host
+ ceph osd pool create lrcpool erasure LRCprofile
+
+
+Reduce recovery bandwidth between racks
+---------------------------------------
+
+In Firefly the bandwidth reduction will only be observed if the primary
+OSD is in the same rack as the lost chunk.:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ k=4 m=2 l=3 \
+ crush-locality=rack \
+ crush-failure-domain=host
+ ceph osd pool create lrcpool erasure LRCprofile
+
+
+Create an lrc profile
+=====================
+
+To create a new lrc erasure code profile:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=lrc \
+ k={data-chunks} \
+ m={coding-chunks} \
+ l={locality} \
+ [crush-root={root}] \
+ [crush-locality={bucket-type}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: Yes.
+:Example: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: Yes.
+:Example: 2
+
+``l={locality}``
+
+:Description: Group the coding and data chunks into sets of size
+ **locality**. For instance, for **k=4** and **m=2**,
+ when **locality=3** two groups of three are created.
+ Each set can be recovered without reading chunks
+ from another set.
+
+:Type: Integer
+:Required: Yes.
+:Example: 3
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the CRUSH rule. For instance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-locality={bucket-type}``
+
+:Description: The type of the CRUSH bucket in which each set of chunks
+ defined by **l** will be stored. For instance, if it is
+ set to **rack**, each group of **l** chunks will be
+ placed in a different rack. It is used to create a
+ CRUSH rule step such as **step choose rack**. If it is not
+ set, no such grouping is done.
+
+:Type: String
+:Required: No.
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a CRUSH rule step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
+Low level plugin configuration
+==============================
+
+The sum of **k** and **m** must be a multiple of the **l** parameter.
+The low level configuration parameters however do not enforce this
+restriction and it may be advantageous to use them for specific
+purposes. It is for instance possible to define two groups, one with 4
+chunks and another with 3 chunks. It is also possible to recursively
+define locality sets, for instance datacenters and racks into
+datacenters. The **k/m/l** are implemented by generating a low level
+configuration.
+
+The *lrc* erasure code plugin recursively applies erasure code
+techniques so that recovering from the loss of some chunks only
+requires a subset of the available chunks, most of the time.
+
+For instance, when three coding steps are described as::
+
+ chunk nr 01234567
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+where *c* are coding chunks calculated from the data chunks *D*, the
+loss of chunk *7* can be recovered with the last four chunks. And the
+loss of chunk *2* chunk can be recovered with the first four
+chunks.
+
+Erasure code profile examples using low level configuration
+===========================================================
+
+Minimal testing
+---------------
+
+It is strictly equivalent to using a *K=2* *M=1* erasure code profile. The *DD*
+implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
+by default.:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=DD_ \
+ layers='[ [ "DDc", "" ] ]'
+ ceph osd pool create lrcpool erasure LRCprofile
+
+Reduce recovery bandwidth between hosts
+---------------------------------------
+
+Although it is probably not an interesting use case when all hosts are
+connected to the same switch, reduced bandwidth usage can actually be
+observed. It is equivalent to **k=4**, **m=2** and **l=3** although
+the layout of the chunks is different. **WARNING: PROMPTS ARE SELECTABLE**
+
+::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=__DD__DD \
+ layers='[
+ [ "_cDD_cDD", "" ],
+ [ "cDDD____", "" ],
+ [ "____cDDD", "" ],
+ ]'
+ $ ceph osd pool create lrcpool erasure LRCprofile
+
+
+Reduce recovery bandwidth between racks
+---------------------------------------
+
+In Firefly the reduced bandwidth will only be observed if the primary OSD is in
+the same rack as the lost chunk. **WARNING: PROMPTS ARE SELECTABLE**
+
+::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=__DD__DD \
+ layers='[
+ [ "_cDD_cDD", "" ],
+ [ "cDDD____", "" ],
+ [ "____cDDD", "" ],
+ ]' \
+ crush-steps='[
+ [ "choose", "rack", 2 ],
+ [ "chooseleaf", "host", 4 ],
+ ]'
+
+ $ ceph osd pool create lrcpool erasure LRCprofile
+
+Testing with different Erasure Code backends
+--------------------------------------------
+
+LRC now uses jerasure as the default EC backend. It is possible to
+specify the EC backend/algorithm on a per layer basis using the low
+level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
+is actually an erasure code profile to be used for this level. The
+example below specifies the ISA backend with the cauchy technique to
+be used in the lrcpool.:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=DD_ \
+ layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
+ ceph osd pool create lrcpool erasure LRCprofile
+
+You could also use a different erasure code profile for each
+layer. **WARNING: PROMPTS ARE SELECTABLE**
+
+::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=__DD__DD \
+ layers='[
+ [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
+ [ "cDDD____", "plugin=isa" ],
+ [ "____cDDD", "plugin=jerasure" ],
+ ]'
+ $ ceph osd pool create lrcpool erasure LRCprofile
+
+
+
+Erasure coding and decoding algorithm
+=====================================
+
+The steps found in the layers description::
+
+ chunk nr 01234567
+
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+are applied in order. For instance, if a 4K object is encoded, it will
+first go through *step 1* and be divided in four 1K chunks (the four
+uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
+order. From these, two coding chunks are calculated (the two lowercase
+c). The coding chunks are stored in the chunks 1 and 5, respectively.
+
+The *step 2* re-uses the content created by *step 1* in a similar
+fashion and stores a single coding chunk *c* at position 0. The last four
+chunks, marked with an underscore (*_*) for readability, are ignored.
+
+The *step 3* stores a single coding chunk *c* at position 4. The three
+chunks created by *step 1* are used to compute this coding chunk,
+i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.
+
+If chunk *2* is lost::
+
+ chunk nr 01234567
+
+ step 1 _c D_cDD
+ step 2 cD D____
+ step 3 __ _cDDD
+
+decoding will attempt to recover it by walking the steps in reverse
+order: *step 3* then *step 2* and finally *step 1*.
+
+The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
+and is skipped.
+
+The coding chunk from *step 2*, stored in chunk *0*, allows it to
+recover the content of chunk *2*. There are no more chunks to recover
+and the process stops, without considering *step 1*.
+
+Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
+back chunk *2*.
+
+If chunk *2, 3, 6* are lost::
+
+ chunk nr 01234567
+
+ step 1 _c _c D
+ step 2 cD __ _
+ step 3 __ cD D
+
+The *step 3* can recover the content of chunk *6*::
+
+ chunk nr 01234567
+
+ step 1 _c _cDD
+ step 2 cD ____
+ step 3 __ cDDD
+
+The *step 2* fails to recover and is skipped because there are two
+chunks missing (*2, 3*) and it can only recover from one missing
+chunk.
+
+The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
+recover the content of chunk *2, 3*::
+
+ chunk nr 01234567
+
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+Controlling CRUSH placement
+===========================
+
+The default CRUSH rule provides OSDs that are on different hosts. For instance::
+
+ chunk nr 01234567
+
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+needs exactly *8* OSDs, one for each chunk. If the hosts are in two
+adjacent racks, the first four chunks can be placed in the first rack
+and the last four in the second rack. So that recovering from the loss
+of a single OSD does not require using bandwidth between the two
+racks.
+
+For instance::
+
+ crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
+
+will create a rule that will select two crush buckets of type
+*rack* and for each of them choose four OSDs, each of them located in
+different buckets of type *host*.
+
+The CRUSH rule can also be manually crafted for finer control.
diff --git a/doc/rados/operations/erasure-code-profile.rst b/doc/rados/operations/erasure-code-profile.rst
new file mode 100644
index 000000000..947b34c1f
--- /dev/null
+++ b/doc/rados/operations/erasure-code-profile.rst
@@ -0,0 +1,128 @@
+.. _erasure-code-profiles:
+
+=====================
+Erasure code profiles
+=====================
+
+Erasure code is defined by a **profile** and is used when creating an
+erasure coded pool and the associated CRUSH rule.
+
+The **default** erasure code profile (which is created when the Ceph
+cluster is initialized) will split the data into 2 equal-sized chunks,
+and have 2 parity chunks of the same size. It will take as much space
+in the cluster as a 2-replica pool but can sustain the data loss of 2
+chunks out of 4. It is described as a profile with **k=2** and **m=2**,
+meaning the information is spread over four OSD (k+m == 4) and two of
+them can be lost.
+
+To improve redundancy without increasing raw storage requirements, a
+new profile can be created. For instance, a profile with **k=10** and
+**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an
+object on fourteen (k+m=14) OSDs. The object is first divided in
+**10** chunks (if the object is 10MB, each chunk is 1MB) and **4**
+coding chunks are computed, for recovery (each coding chunk has the
+same size as the data chunk, i.e. 1MB). The raw space overhead is only
+40% and the object will not be lost even if four OSDs break at the
+same time.
+
+.. _list of available plugins:
+
+.. toctree::
+ :maxdepth: 1
+
+ erasure-code-jerasure
+ erasure-code-isa
+ erasure-code-lrc
+ erasure-code-shec
+ erasure-code-clay
+
+osd erasure-code-profile set
+============================
+
+To create a new erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ [{directory=directory}] \
+ [{plugin=plugin}] \
+ [{stripe_unit=stripe_unit}] \
+ [{key=value} ...] \
+ [--force]
+
+Where:
+
+``{directory=directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``{plugin=plugin}``
+
+:Description: Use the erasure code **plugin** to compute coding chunks
+ and recover missing chunks. See the `list of available
+ plugins`_ for more information.
+
+:Type: String
+:Required: No.
+:Default: jerasure
+
+``{stripe_unit=stripe_unit}``
+
+:Description: The amount of data in a data chunk, per stripe. For
+ example, a profile with 2 data chunks and stripe_unit=4K
+ would put the range 0-4K in chunk 0, 4K-8K in chunk 1,
+ then 8K-12K in chunk 0 again. This should be a multiple
+ of 4K for best performance. The default value is taken
+ from the monitor config option
+ ``osd_pool_erasure_code_stripe_unit`` when a pool is
+ created. The stripe_width of a pool using this profile
+ will be the number of data chunks multiplied by this
+ stripe_unit.
+
+:Type: String
+:Required: No.
+
+``{key=value}``
+
+:Description: The semantic of the remaining key/value pairs is defined
+ by the erasure code plugin.
+
+:Type: String
+:Required: No.
+
+``--force``
+
+:Description: Override an existing profile by the same name, and allow
+ setting a non-4K-aligned stripe_unit.
+
+:Type: String
+:Required: No.
+
+osd erasure-code-profile rm
+============================
+
+To remove an erasure code profile::
+
+ ceph osd erasure-code-profile rm {name}
+
+If the profile is referenced by a pool, the deletion will fail.
+
+.. warning:: Removing an erasure code profile using ``osd erasure-code-profile rm`` does not automatically delete the associated CRUSH rule associated with the erasure code profile. It is recommended to manually remove the associated CRUSH rule using ``ceph osd crush rule remove {rule-name}`` to avoid unexpected behavior.
+
+osd erasure-code-profile get
+============================
+
+To display an erasure code profile::
+
+ ceph osd erasure-code-profile get {name}
+
+osd erasure-code-profile ls
+===========================
+
+To list the names of all erasure code profiles::
+
+ ceph osd erasure-code-profile ls
+
diff --git a/doc/rados/operations/erasure-code-shec.rst b/doc/rados/operations/erasure-code-shec.rst
new file mode 100644
index 000000000..4e8f59b0b
--- /dev/null
+++ b/doc/rados/operations/erasure-code-shec.rst
@@ -0,0 +1,145 @@
+========================
+SHEC erasure code plugin
+========================
+
+The *shec* plugin encapsulates the `multiple SHEC
+<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_
+library. It allows ceph to recover data more efficiently than Reed Solomon codes.
+
+Create an SHEC profile
+======================
+
+To create a new *shec* erasure code profile:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=shec \
+ [k={data-chunks}] \
+ [m={coding-chunks}] \
+ [c={durability-estimator}] \
+ [crush-root={root}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data-chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: No.
+:Default: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding-chunks** for each object and store them on
+ different OSDs. The number of **coding-chunks** does not necessarily
+ equal the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: No.
+:Default: 3
+
+``c={durability-estimator}``
+
+:Description: The number of parity chunks each of which includes each data chunk in its
+ calculation range. The number is used as a **durability estimator**.
+ For instance, if c=2, 2 OSDs can be down without losing data.
+
+:Type: Integer
+:Required: No.
+:Default: 2
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the CRUSH rule. For instance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a CRUSH rule step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
+Brief description of SHEC's layouts
+===================================
+
+Space Efficiency
+----------------
+
+Space efficiency is a ratio of data chunks to all ones in a object and
+represented as k/(k+m).
+In order to improve space efficiency, you should increase k or decrease m:
+
+ space efficiency of SHEC(4,3,2) = :math:`\frac{4}{4+3}` = 0.57
+ SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency
+
+Durability
+----------
+
+The third parameter of SHEC (=c) is a durability estimator, which approximates
+the number of OSDs that can be down without losing data.
+
+``durability estimator of SHEC(4,3,2) = 2``
+
+Recovery Efficiency
+-------------------
+
+Describing calculation of recovery efficiency is beyond the scope of this document,
+but at least increasing m without increasing c achieves improvement of recovery efficiency.
+(However, we must pay attention to the sacrifice of space efficiency in this case.)
+
+``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency``
+
+Erasure code profile examples
+=============================
+
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set SHECprofile \
+ plugin=shec \
+ k=8 m=4 c=3 \
+ crush-failure-domain=host
+ ceph osd pool create shecpool erasure SHECprofile
diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst
new file mode 100644
index 000000000..e2bd3c296
--- /dev/null
+++ b/doc/rados/operations/erasure-code.rst
@@ -0,0 +1,272 @@
+.. _ecpool:
+
+==============
+ Erasure code
+==============
+
+By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
+replicated-type pools, every object is copied to multiple disks. This
+multiple copying is the method of data protection known as "replication".
+
+By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
+pools use a method of data protection that is different from replication. In
+erasure coding, data is broken into fragments of two kinds: data blocks and
+parity blocks. If a drive fails or becomes corrupted, the parity blocks are
+used to rebuild the data. At scale, erasure coding saves space relative to
+replication.
+
+In this documentation, data blocks are referred to as "data chunks"
+and parity blocks are referred to as "coding chunks".
+
+Erasure codes are also called "forward error correction codes". The
+first forward error correction code was developed in 1950 by Richard
+Hamming at Bell Laboratories.
+
+
+Creating a sample erasure-coded pool
+------------------------------------
+
+The simplest erasure-coded pool is similar to `RAID5
+<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
+requires at least three hosts:
+
+.. prompt:: bash $
+
+ ceph osd pool create ecpool erasure
+
+::
+
+ pool 'ecpool' created
+
+.. prompt:: bash $
+
+ echo ABCDEFGHI | rados --pool ecpool put NYAN -
+ rados --pool ecpool get NYAN -
+
+::
+
+ ABCDEFGHI
+
+Erasure-code profiles
+---------------------
+
+The default erasure-code profile can sustain the overlapping loss of two OSDs
+without losing data. This erasure-code profile is equivalent to a replicated
+pool of size three, but with different storage requirements: instead of
+requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default
+profile can be displayed with this command:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile get default
+
+::
+
+ k=2
+ m=2
+ plugin=jerasure
+ crush-failure-domain=host
+ technique=reed_sol_van
+
+.. note::
+ The profile just displayed is for the *default* erasure-coded pool, not the
+ *simplest* erasure-coded pool. These two pools are not the same:
+
+ The default erasure-coded pool has two data chunks (K) and two coding chunks
+ (M). The profile of the default erasure-coded pool is "k=2 m=2".
+
+ The simplest erasure-coded pool has two data chunks (K) and one coding chunk
+ (M). The profile of the simplest erasure-coded pool is "k=2 m=1".
+
+Choosing the right profile is important because the profile cannot be modified
+after the pool is created. If you find that you need an erasure-coded pool with
+a profile different than the one you have created, you must create a new pool
+with a different (and presumably more carefully considered) profile. When the
+new pool is created, all objects from the wrongly configured pool must be moved
+to the newly created pool. There is no way to alter the profile of a pool after
+the pool has been created.
+
+The most important parameters of the profile are *K*, *M*, and
+*crush-failure-domain* because they define the storage overhead and
+the data durability. For example, if the desired architecture must
+sustain the loss of two racks with a storage overhead of 67%,
+the following profile can be defined:
+
+.. prompt:: bash $
+
+ ceph osd erasure-code-profile set myprofile \
+ k=3 \
+ m=2 \
+ crush-failure-domain=rack
+ ceph osd pool create ecpool erasure myprofile
+ echo ABCDEFGHI | rados --pool ecpool put NYAN -
+ rados --pool ecpool get NYAN -
+
+::
+
+ ABCDEFGHI
+
+The *NYAN* object will be divided in three (*K=3*) and two additional
+*chunks* will be created (*M=2*). The value of *M* defines how many
+OSDs can be lost simultaneously without losing any data. The
+*crush-failure-domain=rack* will create a CRUSH rule that ensures
+no two *chunks* are stored in the same rack.
+
+.. ditaa::
+ +-------------------+
+ name | NYAN |
+ +-------------------+
+ content | ABCDEFGHI |
+ +--------+----------+
+ |
+ |
+ v
+ +------+------+
+ +---------------+ encode(3,2) +-----------+
+ | +--+--+---+---+ |
+ | | | | |
+ | +-------+ | +-----+ |
+ | | | | |
+ +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
+ name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
+ +------+ +------+ +------+ +------+ +------+
+ shard | 1 | | 2 | | 3 | | 4 | | 5 |
+ +------+ +------+ +------+ +------+ +------+
+ content | ABC | | DEF | | GHI | | YXY | | QGC |
+ +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
+ | | | | |
+ | | v | |
+ | | +--+---+ | |
+ | | | OSD1 | | |
+ | | +------+ | |
+ | | | |
+ | | +------+ | |
+ | +------>| OSD2 | | |
+ | +------+ | |
+ | | |
+ | +------+ | |
+ | | OSD3 |<----+ |
+ | +------+ |
+ | |
+ | +------+ |
+ | | OSD4 |<--------------+
+ | +------+
+ |
+ | +------+
+ +----------------->| OSD5 |
+ +------+
+
+
+More information can be found in the `erasure-code profiles
+<../erasure-code-profile>`_ documentation.
+
+
+Erasure Coding with Overwrites
+------------------------------
+
+By default, erasure-coded pools work only with operations that
+perform full object writes and appends (for example, RGW).
+
+Since Luminous, partial writes for an erasure-coded pool may be
+enabled with a per-pool setting. This lets RBD and CephFS store their
+data in an erasure-coded pool:
+
+.. prompt:: bash $
+
+ ceph osd pool set ec_pool allow_ec_overwrites true
+
+This can be enabled only on a pool residing on BlueStore OSDs, since
+BlueStore's checksumming is used during deep scrubs to detect bitrot
+or other corruption. Using Filestore with EC overwrites is not only
+unsafe, but it also results in lower performance compared to BlueStore.
+
+Erasure-coded pools do not support omap, so to use them with RBD and
+CephFS you must instruct them to store their data in an EC pool and
+their metadata in a replicated pool. For RBD, this means using the
+erasure-coded pool as the ``--data-pool`` during image creation:
+
+.. prompt:: bash $
+
+ rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
+
+For CephFS, an erasure-coded pool can be set as the default data pool during
+file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
+
+
+Erasure-coded pools and cache tiering
+-------------------------------------
+
+.. note:: Cache tiering is deprecated in Reef.
+
+Erasure-coded pools require more resources than replicated pools and
+lack some of the functionality supported by replicated pools (for example, omap).
+To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_
+before setting up the erasure-coded pool.
+
+For example, if the pool *hot-storage* is made of fast storage, the following commands
+will place the *hot-storage* pool as a tier of *ecpool* in *writeback*
+mode:
+
+.. prompt:: bash $
+
+ ceph osd tier add ecpool hot-storage
+ ceph osd tier cache-mode hot-storage writeback
+ ceph osd tier set-overlay ecpool hot-storage
+
+The result is that every write and read to the *ecpool* actually uses
+the *hot-storage* pool and benefits from its flexibility and speed.
+
+More information can be found in the `cache tiering
+<../cache-tiering>`_ documentation. Note, however, that cache tiering
+is deprecated and may be removed completely in a future release.
+
+Erasure-coded pool recovery
+---------------------------
+If an erasure-coded pool loses any data shards, it must recover them from others.
+This recovery involves reading from the remaining shards, reconstructing the data, and
+writing new shards.
+
+In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
+available. (With fewer than *K* shards, you have actually lost data!)
+
+Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be
+available, even if ``min_size`` was greater than ``K``. This was a conservative
+decision made out of an abundance of caution when designing the new pool
+mode. As a result, however, pools with lost OSDs but without complete data loss were
+unable to recover and go active without manual intervention to temporarily change
+the ``min_size`` setting.
+
+We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and
+loss of data.
+
+
+
+Glossary
+--------
+
+*chunk*
+ When the encoding function is called, it returns chunks of the same size as each other. There are two
+ kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and
+ (2) *coding chunks*, which can be used to rebuild a lost chunk.
+
+*K*
+ The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object
+ is divided into two objects of 5KB each.
+
+*M*
+ The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can
+ be missing from the cluster without the cluster suffering data loss. For example, if there are two coding
+ chunks, then two OSDs can be missing without data loss.
+
+Table of contents
+-----------------
+
+.. toctree::
+ :maxdepth: 1
+
+ erasure-code-profile
+ erasure-code-jerasure
+ erasure-code-isa
+ erasure-code-lrc
+ erasure-code-shec
+ erasure-code-clay
diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst
new file mode 100644
index 000000000..d52465602
--- /dev/null
+++ b/doc/rados/operations/health-checks.rst
@@ -0,0 +1,1619 @@
+.. _health-checks:
+
+===============
+ Health checks
+===============
+
+Overview
+========
+
+There is a finite set of health messages that a Ceph cluster can raise. These
+messages are known as *health checks*. Each health check has a unique
+identifier.
+
+The identifier is a terse human-readable string -- that is, the identifier is
+readable in much the same way as a typical variable name. It is intended to
+enable tools (for example, UIs) to make sense of health checks and present them
+in a way that reflects their meaning.
+
+This page lists the health checks that are raised by the monitor and manager
+daemons. In addition to these, you might see health checks that originate
+from MDS daemons (see :ref:`cephfs-health-messages`), and health checks
+that are defined by ``ceph-mgr`` python modules.
+
+Definitions
+===========
+
+Monitor
+-------
+
+DAEMON_OLD_VERSION
+__________________
+
+Warn if one or more old versions of Ceph are running on any daemons. A health
+check is raised if multiple versions are detected. This condition must exist
+for a period of time greater than ``mon_warn_older_version_delay`` (set to one
+week by default) in order for the health check to be raised. This allows most
+upgrades to proceed without the occurrence of a false warning. If the upgrade
+is paused for an extended time period, ``health mute`` can be used by running
+``ceph health mute DAEMON_OLD_VERSION --sticky``. Be sure, however, to run
+``ceph health unmute DAEMON_OLD_VERSION`` after the upgrade has finished.
+
+MON_DOWN
+________
+
+One or more monitor daemons are currently down. The cluster requires a majority
+(more than one-half) of the monitors to be available. When one or more monitors
+are down, clients might have a harder time forming their initial connection to
+the cluster, as they might need to try more addresses before they reach an
+operating monitor.
+
+The down monitor daemon should be restarted as soon as possible to reduce the
+risk of a subsequent monitor failure leading to a service outage.
+
+MON_CLOCK_SKEW
+______________
+
+The clocks on the hosts running the ceph-mon monitor daemons are not
+well-synchronized. This health check is raised if the cluster detects a clock
+skew greater than ``mon_clock_drift_allowed``.
+
+This issue is best resolved by synchronizing the clocks by using a tool like
+``ntpd`` or ``chrony``.
+
+If it is impractical to keep the clocks closely synchronized, the
+``mon_clock_drift_allowed`` threshold can also be increased. However, this
+value must stay significantly below the ``mon_lease`` interval in order for the
+monitor cluster to function properly.
+
+MON_MSGR2_NOT_ENABLED
+_____________________
+
+The :confval:`ms_bind_msgr2` option is enabled but one or more monitors are
+not configured to bind to a v2 port in the cluster's monmap. This
+means that features specific to the msgr2 protocol (for example, encryption)
+are unavailable on some or all connections.
+
+In most cases this can be corrected by running the following command:
+
+.. prompt:: bash $
+
+ ceph mon enable-msgr2
+
+After this command is run, any monitor configured to listen on the old default
+port (6789) will continue to listen for v1 connections on 6789 and begin to
+listen for v2 connections on the new default port 3300.
+
+If a monitor is configured to listen for v1 connections on a non-standard port
+(that is, a port other than 6789), then the monmap will need to be modified
+manually.
+
+
+MON_DISK_LOW
+____________
+
+One or more monitors are low on disk space. This health check is raised if the
+percentage of available space on the file system used by the monitor database
+(normally ``/var/lib/ceph/mon``) drops below the percentage value
+``mon_data_avail_warn`` (default: 30%).
+
+This alert might indicate that some other process or user on the system is
+filling up the file system used by the monitor. It might also
+indicate that the monitor database is too large (see ``MON_DISK_BIG``
+below).
+
+If space cannot be freed, the monitor's data directory might need to be
+moved to another storage device or file system (this relocation process must be carried out while the monitor
+daemon is not running).
+
+
+MON_DISK_CRIT
+_____________
+
+One or more monitors are critically low on disk space. This health check is raised if the
+percentage of available space on the file system used by the monitor database
+(normally ``/var/lib/ceph/mon``) drops below the percentage value
+``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above.
+
+MON_DISK_BIG
+____________
+
+The database size for one or more monitors is very large. This health check is
+raised if the size of the monitor database is larger than
+``mon_data_size_warn`` (default: 15 GiB).
+
+A large database is unusual, but does not necessarily indicate a problem.
+Monitor databases might grow in size when there are placement groups that have
+not reached an ``active+clean`` state in a long time.
+
+This alert might also indicate that the monitor's database is not properly
+compacting, an issue that has been observed with some older versions of leveldb
+and rocksdb. Forcing a compaction with ``ceph daemon mon.<id> compact`` might
+shrink the database's on-disk size.
+
+This alert might also indicate that the monitor has a bug that prevents it from
+pruning the cluster metadata that it stores. If the problem persists, please
+report a bug.
+
+To adjust the warning threshold, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set global mon_data_size_warn <size>
+
+
+AUTH_INSECURE_GLOBAL_ID_RECLAIM
+_______________________________
+
+One or more clients or daemons that are connected to the cluster are not
+securely reclaiming their ``global_id`` (a unique number that identifies each
+entity in the cluster) when reconnecting to a monitor. The client is being
+permitted to connect anyway because the
+``auth_allow_insecure_global_id_reclaim`` option is set to ``true`` (which may
+be necessary until all Ceph clients have been upgraded) and because the
+``auth_expose_insecure_global_id_reclaim`` option is set to ``true`` (which
+allows monitors to detect clients with "insecure reclaim" sooner by forcing
+those clients to reconnect immediately after their initial authentication).
+
+To identify which client(s) are using unpatched Ceph client code, run the
+following command:
+
+.. prompt:: bash $
+
+ ceph health detail
+
+If you collect a dump of the clients that are connected to an individual
+monitor and examine the ``global_id_status`` field in the output of the dump,
+you can see the ``global_id`` reclaim behavior of those clients. Here
+``reclaim_insecure`` means that a client is unpatched and is contributing to
+this health check. To effect a client dump, run the following command:
+
+.. prompt:: bash $
+
+ ceph tell mon.\* sessions
+
+We strongly recommend that all clients in the system be upgraded to a newer
+version of Ceph that correctly reclaims ``global_id`` values. After all clients
+have been updated, run the following command to stop allowing insecure
+reconnections:
+
+.. prompt:: bash $
+
+ ceph config set mon auth_allow_insecure_global_id_reclaim false
+
+If it is impractical to upgrade all clients immediately, you can temporarily
+silence this alert by running the following command:
+
+.. prompt:: bash $
+
+ ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM 1w # 1 week
+
+Although we do NOT recommend doing so, you can also disable this alert
+indefinitely by running the following command:
+
+.. prompt:: bash $
+
+ ceph config set mon mon_warn_on_insecure_global_id_reclaim false
+
+AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED
+_______________________________________
+
+Ceph is currently configured to allow clients that reconnect to monitors using
+an insecure process to reclaim their previous ``global_id``. Such reclaiming is
+allowed because, by default, ``auth_allow_insecure_global_id_reclaim`` is set
+to ``true``. It might be necessary to leave this setting enabled while existing
+Ceph clients are upgraded to newer versions of Ceph that correctly and securely
+reclaim their ``global_id``.
+
+If the ``AUTH_INSECURE_GLOBAL_ID_RECLAIM`` health check has not also been
+raised and if the ``auth_expose_insecure_global_id_reclaim`` setting has not
+been disabled (it is enabled by default), then there are currently no clients
+connected that need to be upgraded. In that case, it is safe to disable
+``insecure global_id reclaim`` by running the following command:
+
+.. prompt:: bash $
+
+ ceph config set mon auth_allow_insecure_global_id_reclaim false
+
+On the other hand, if there are still clients that need to be upgraded, then
+this alert can be temporarily silenced by running the following command:
+
+.. prompt:: bash $
+
+ ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1w # 1 week
+
+Although we do NOT recommend doing so, you can also disable this alert indefinitely
+by running the following command:
+
+.. prompt:: bash $
+
+ ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false
+
+
+Manager
+-------
+
+MGR_DOWN
+________
+
+All manager daemons are currently down. The cluster should normally have at
+least one running manager (``ceph-mgr``) daemon. If no manager daemon is
+running, the cluster's ability to monitor itself will be compromised, and parts
+of the management API will become unavailable (for example, the dashboard will
+not work, and most CLI commands that report metrics or runtime state will
+block). However, the cluster will still be able to perform all I/O operations
+and to recover from failures.
+
+The "down" manager daemon should be restarted as soon as possible to ensure
+that the cluster can be monitored (for example, so that the ``ceph -s``
+information is up to date, or so that metrics can be scraped by Prometheus).
+
+
+MGR_MODULE_DEPENDENCY
+_____________________
+
+An enabled manager module is failing its dependency check. This health check
+typically comes with an explanatory message from the module about the problem.
+
+For example, a module might report that a required package is not installed: in
+this case, you should install the required package and restart your manager
+daemons.
+
+This health check is applied only to enabled modules. If a module is not
+enabled, you can see whether it is reporting dependency issues in the output of
+`ceph module ls`.
+
+
+MGR_MODULE_ERROR
+________________
+
+A manager module has experienced an unexpected error. Typically, this means
+that an unhandled exception was raised from the module's `serve` function. The
+human-readable description of the error might be obscurely worded if the
+exception did not provide a useful description of itself.
+
+This health check might indicate a bug: please open a Ceph bug report if you
+think you have encountered a bug.
+
+However, if you believe the error is transient, you may restart your manager
+daemon(s) or use ``ceph mgr fail`` on the active daemon in order to force
+failover to another daemon.
+
+OSDs
+----
+
+OSD_DOWN
+________
+
+One or more OSDs are marked "down". The ceph-osd daemon might have been
+stopped, or peer OSDs might be unable to reach the OSD over the network.
+Common causes include a stopped or crashed daemon, a "down" host, or a network
+outage.
+
+Verify that the host is healthy, the daemon is started, and the network is
+functioning. If the daemon has crashed, the daemon log file
+(``/var/log/ceph/ceph-osd.*``) might contain debugging information.
+
+OSD_<crush type>_DOWN
+_____________________
+
+(for example, OSD_HOST_DOWN, OSD_ROOT_DOWN)
+
+All of the OSDs within a particular CRUSH subtree are marked "down" (for
+example, all OSDs on a host).
+
+OSD_ORPHAN
+__________
+
+An OSD is referenced in the CRUSH map hierarchy, but does not exist.
+
+To remove the OSD from the CRUSH map hierarchy, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd crush rm osd.<id>
+
+OSD_OUT_OF_ORDER_FULL
+_____________________
+
+The utilization thresholds for `nearfull`, `backfillfull`, `full`, and/or
+`failsafe_full` are not ascending. In particular, the following pattern is
+expected: `nearfull < backfillfull`, `backfillfull < full`, and `full <
+failsafe_full`.
+
+To adjust these utilization thresholds, run the following commands:
+
+.. prompt:: bash $
+
+ ceph osd set-nearfull-ratio <ratio>
+ ceph osd set-backfillfull-ratio <ratio>
+ ceph osd set-full-ratio <ratio>
+
+
+OSD_FULL
+________
+
+One or more OSDs have exceeded the `full` threshold and are preventing the
+cluster from servicing writes.
+
+To check utilization by pool, run the following command:
+
+.. prompt:: bash $
+
+ ceph df
+
+To see the currently defined `full` ratio, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd dump | grep full_ratio
+
+A short-term workaround to restore write availability is to raise the full
+threshold by a small amount. To do so, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd set-full-ratio <ratio>
+
+Additional OSDs should be deployed in order to add new storage to the cluster,
+or existing data should be deleted in order to free up space in the cluster.
+
+OSD_BACKFILLFULL
+________________
+
+One or more OSDs have exceeded the `backfillfull` threshold or *would* exceed
+it if the currently-mapped backfills were to finish, which will prevent data
+from rebalancing to this OSD. This alert is an early warning that
+rebalancing might be unable to complete and that the cluster is approaching
+full.
+
+To check utilization by pool, run the following command:
+
+.. prompt:: bash $
+
+ ceph df
+
+OSD_NEARFULL
+____________
+
+One or more OSDs have exceeded the `nearfull` threshold. This alert is an early
+warning that the cluster is approaching full.
+
+To check utilization by pool, run the following command:
+
+.. prompt:: bash $
+
+ ceph df
+
+OSDMAP_FLAGS
+____________
+
+One or more cluster flags of interest have been set. These flags include:
+
+* *full* - the cluster is flagged as full and cannot serve writes
+* *pauserd*, *pausewr* - there are paused reads or writes
+* *noup* - OSDs are not allowed to start
+* *nodown* - OSD failure reports are being ignored, and that means that the
+ monitors will not mark OSDs "down"
+* *noin* - OSDs that were previously marked ``out`` are not being marked
+ back ``in`` when they start
+* *noout* - "down" OSDs are not automatically being marked ``out`` after the
+ configured interval
+* *nobackfill*, *norecover*, *norebalance* - recovery or data
+ rebalancing is suspended
+* *noscrub*, *nodeep_scrub* - scrubbing is disabled
+* *notieragent* - cache-tiering activity is suspended
+
+With the exception of *full*, these flags can be set or cleared by running the
+following commands:
+
+.. prompt:: bash $
+
+ ceph osd set <flag>
+ ceph osd unset <flag>
+
+OSD_FLAGS
+_________
+
+One or more OSDs or CRUSH {nodes,device classes} have a flag of interest set.
+These flags include:
+
+* *noup*: these OSDs are not allowed to start
+* *nodown*: failure reports for these OSDs will be ignored
+* *noin*: if these OSDs were previously marked ``out`` automatically
+ after a failure, they will not be marked ``in`` when they start
+* *noout*: if these OSDs are "down" they will not automatically be marked
+ ``out`` after the configured interval
+
+To set and clear these flags in batch, run the following commands:
+
+.. prompt:: bash $
+
+ ceph osd set-group <flags> <who>
+ ceph osd unset-group <flags> <who>
+
+For example:
+
+.. prompt:: bash $
+
+ ceph osd set-group noup,noout osd.0 osd.1
+ ceph osd unset-group noup,noout osd.0 osd.1
+ ceph osd set-group noup,noout host-foo
+ ceph osd unset-group noup,noout host-foo
+ ceph osd set-group noup,noout class-hdd
+ ceph osd unset-group noup,noout class-hdd
+
+OLD_CRUSH_TUNABLES
+__________________
+
+The CRUSH map is using very old settings and should be updated. The oldest set
+of tunables that can be used (that is, the oldest client version that can
+connect to the cluster) without raising this health check is determined by the
+``mon_crush_min_required_version`` config option. For more information, see
+:ref:`crush-map-tunables`.
+
+OLD_CRUSH_STRAW_CALC_VERSION
+____________________________
+
+The CRUSH map is using an older, non-optimal method of calculating intermediate
+weight values for ``straw`` buckets.
+
+The CRUSH map should be updated to use the newer method (that is:
+``straw_calc_version=1``). For more information, see :ref:`crush-map-tunables`.
+
+CACHE_POOL_NO_HIT_SET
+_____________________
+
+One or more cache pools are not configured with a *hit set* to track
+utilization. This issue prevents the tiering agent from identifying cold
+objects that are to be flushed and evicted from the cache.
+
+To configure hit sets on the cache pool, run the following commands:
+
+.. prompt:: bash $
+
+ ceph osd pool set <poolname> hit_set_type <type>
+ ceph osd pool set <poolname> hit_set_period <period-in-seconds>
+ ceph osd pool set <poolname> hit_set_count <number-of-hitsets>
+ ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate>
+
+OSD_NO_SORTBITWISE
+__________________
+
+No pre-Luminous v12.y.z OSDs are running, but the ``sortbitwise`` flag has not
+been set.
+
+The ``sortbitwise`` flag must be set in order for OSDs running Luminous v12.y.z
+or newer to start. To safely set the flag, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd set sortbitwise
+
+OSD_FILESTORE
+__________________
+
+Warn if OSDs are running Filestore. The Filestore OSD back end has been
+deprecated; the BlueStore back end has been the default object store since the
+Ceph Luminous release.
+
+The 'mclock_scheduler' is not supported for Filestore OSDs. For this reason,
+the default 'osd_op_queue' is set to 'wpq' for Filestore OSDs and is enforced
+even if the user attempts to change it.
+
+
+
+.. prompt:: bash $
+
+ ceph report | jq -c '."osd_metadata" | .[] | select(.osd_objectstore | contains("filestore")) | {id, osd_objectstore}'
+
+**In order to upgrade to Reef or a later release, you must first migrate any
+Filestore OSDs to BlueStore.**
+
+If you are upgrading a pre-Reef release to Reef or later, but it is not
+feasible to migrate Filestore OSDs to BlueStore immediately, you can
+temporarily silence this alert by running the following command:
+
+.. prompt:: bash $
+
+ ceph health mute OSD_FILESTORE
+
+Since this migration can take a considerable amount of time to complete, we
+recommend that you begin the process well in advance of any update to Reef or
+to later releases.
+
+POOL_FULL
+_________
+
+One or more pools have reached their quota and are no longer allowing writes.
+
+To see pool quotas and utilization, run the following command:
+
+.. prompt:: bash $
+
+ ceph df detail
+
+If you opt to raise the pool quota, run the following commands:
+
+.. prompt:: bash $
+
+ ceph osd pool set-quota <poolname> max_objects <num-objects>
+ ceph osd pool set-quota <poolname> max_bytes <num-bytes>
+
+If not, delete some existing data to reduce utilization.
+
+BLUEFS_SPILLOVER
+________________
+
+One or more OSDs that use the BlueStore back end have been allocated `db`
+partitions (that is, storage space for metadata, normally on a faster device),
+but because that space has been filled, metadata has "spilled over" onto the
+slow device. This is not necessarily an error condition or even unexpected
+behavior, but may result in degraded performance. If the administrator had
+expected that all metadata would fit on the faster device, this alert indicates
+that not enough space was provided.
+
+To disable this alert on all OSDs, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set osd bluestore_warn_on_bluefs_spillover false
+
+Alternatively, to disable the alert on a specific OSD, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph config set osd.123 bluestore_warn_on_bluefs_spillover false
+
+To secure more metadata space, you can destroy and reprovision the OSD in
+question. This process involves data migration and recovery.
+
+It might also be possible to expand the LVM logical volume that backs the `db`
+storage. If the underlying LV has been expanded, you must stop the OSD daemon
+and inform BlueFS of the device-size change by running the following command:
+
+.. prompt:: bash $
+
+ ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-$ID
+
+BLUEFS_AVAILABLE_SPACE
+______________________
+
+To see how much space is free for BlueFS, run the following command:
+
+.. prompt:: bash $
+
+ ceph daemon osd.123 bluestore bluefs available
+
+This will output up to three values: ``BDEV_DB free``, ``BDEV_SLOW free``, and
+``available_from_bluestore``. ``BDEV_DB`` and ``BDEV_SLOW`` report the amount
+of space that has been acquired by BlueFS and is now considered free. The value
+``available_from_bluestore`` indicates the ability of BlueStore to relinquish
+more space to BlueFS. It is normal for this value to differ from the amount of
+BlueStore free space, because the BlueFS allocation unit is typically larger
+than the BlueStore allocation unit. This means that only part of the BlueStore
+free space will be available for BlueFS.
+
+BLUEFS_LOW_SPACE
+_________________
+
+If BlueFS is running low on available free space and there is not much free
+space available from BlueStore (in other words, `available_from_bluestore` has
+a low value), consider reducing the BlueFS allocation unit size. To simulate
+available space when the allocation unit is different, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph daemon osd.123 bluestore bluefs available <alloc-unit-size>
+
+BLUESTORE_FRAGMENTATION
+_______________________
+
+As BlueStore operates, the free space on the underlying storage will become
+fragmented. This is normal and unavoidable, but excessive fragmentation causes
+slowdown. To inspect BlueStore fragmentation, run the following command:
+
+.. prompt:: bash $
+
+ ceph daemon osd.123 bluestore allocator score block
+
+The fragmentation score is given in a [0-1] range.
+[0.0 .. 0.4] tiny fragmentation
+[0.4 .. 0.7] small, acceptable fragmentation
+[0.7 .. 0.9] considerable, but safe fragmentation
+[0.9 .. 1.0] severe fragmentation, might impact BlueFS's ability to get space from BlueStore
+
+To see a detailed report of free fragments, run the following command:
+
+.. prompt:: bash $
+
+ ceph daemon osd.123 bluestore allocator dump block
+
+For OSD processes that are not currently running, fragmentation can be
+inspected with `ceph-bluestore-tool`. To see the fragmentation score, run the
+following command:
+
+.. prompt:: bash $
+
+ ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score
+
+To dump detailed free chunks, run the following command:
+
+.. prompt:: bash $
+
+ ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump
+
+BLUESTORE_LEGACY_STATFS
+_______________________
+
+One or more OSDs have BlueStore volumes that were created prior to the
+Nautilus release. (In Nautilus, BlueStore tracks its internal usage
+statistics on a granular, per-pool basis.)
+
+If *all* OSDs
+are older than Nautilus, this means that the per-pool metrics are
+simply unavailable. But if there is a mixture of pre-Nautilus and
+post-Nautilus OSDs, the cluster usage statistics reported by ``ceph
+df`` will be inaccurate.
+
+The old OSDs can be updated to use the new usage-tracking scheme by stopping
+each OSD, running a repair operation, and then restarting the OSD. For example,
+to update ``osd.123``, run the following commands:
+
+.. prompt:: bash $
+
+ systemctl stop ceph-osd@123
+ ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123
+ systemctl start ceph-osd@123
+
+To disable this alert, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set global bluestore_warn_on_legacy_statfs false
+
+BLUESTORE_NO_PER_POOL_OMAP
+__________________________
+
+One or more OSDs have volumes that were created prior to the Octopus release.
+(In Octopus and later releases, BlueStore tracks omap space utilization by
+pool.)
+
+If there are any BlueStore OSDs that do not have the new tracking enabled, the
+cluster will report an approximate value for per-pool omap usage based on the
+most recent deep scrub.
+
+The OSDs can be updated to track by pool by stopping each OSD, running a repair
+operation, and then restarting the OSD. For example, to update ``osd.123``, run
+the following commands:
+
+.. prompt:: bash $
+
+ systemctl stop ceph-osd@123
+ ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123
+ systemctl start ceph-osd@123
+
+To disable this alert, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set global bluestore_warn_on_no_per_pool_omap false
+
+BLUESTORE_NO_PER_PG_OMAP
+__________________________
+
+One or more OSDs have volumes that were created prior to Pacific. (In Pacific
+and later releases Bluestore tracks omap space utilitzation by Placement Group
+(PG).)
+
+Per-PG omap allows faster PG removal when PGs migrate.
+
+The older OSDs can be updated to track by PG by stopping each OSD, running a
+repair operation, and then restarting the OSD. For example, to update
+``osd.123``, run the following commands:
+
+.. prompt:: bash $
+
+ systemctl stop ceph-osd@123
+ ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123
+ systemctl start ceph-osd@123
+
+To disable this alert, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set global bluestore_warn_on_no_per_pg_omap false
+
+
+BLUESTORE_DISK_SIZE_MISMATCH
+____________________________
+
+One or more BlueStore OSDs have an internal inconsistency between the size of
+the physical device and the metadata that tracks its size. This inconsistency
+can lead to the OSD(s) crashing in the future.
+
+The OSDs that have this inconsistency should be destroyed and reprovisioned. Be
+very careful to execute this procedure on only one OSD at a time, so as to
+minimize the risk of losing any data. To execute this procedure, where ``$N``
+is the OSD that has the inconsistency, run the following commands:
+
+.. prompt:: bash $
+
+ ceph osd out osd.$N
+ while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done
+ ceph osd destroy osd.$N
+ ceph-volume lvm zap /path/to/device
+ ceph-volume lvm create --osd-id $N --data /path/to/device
+
+.. note::
+
+ Wait for this recovery procedure to completely on one OSD before running it
+ on the next.
+
+BLUESTORE_NO_COMPRESSION
+________________________
+
+One or more OSDs is unable to load a BlueStore compression plugin. This issue
+might be caused by a broken installation, in which the ``ceph-osd`` binary does
+not match the compression plugins. Or it might be caused by a recent upgrade in
+which the ``ceph-osd`` daemon was not restarted.
+
+To resolve this issue, verify that all of the packages on the host that is
+running the affected OSD(s) are correctly installed and that the OSD daemon(s)
+have been restarted. If the problem persists, check the OSD log for information
+about the source of the problem.
+
+BLUESTORE_SPURIOUS_READ_ERRORS
+______________________________
+
+One or more BlueStore OSDs detect spurious read errors on the main device.
+BlueStore has recovered from these errors by retrying disk reads. This alert
+might indicate issues with underlying hardware, issues with the I/O subsystem,
+or something similar. In theory, such issues can cause permanent data
+corruption. Some observations on the root cause of spurious read errors can be
+found here: https://tracker.ceph.com/issues/22464
+
+This alert does not require an immediate response, but the affected host might
+need additional attention: for example, upgrading the host to the latest
+OS/kernel versions and implementing hardware-resource-utilization monitoring.
+
+To disable this alert on all OSDs, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set osd bluestore_warn_on_spurious_read_errors false
+
+Or, to disable this alert on a specific OSD, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set osd.123 bluestore_warn_on_spurious_read_errors false
+
+Device health
+-------------
+
+DEVICE_HEALTH
+_____________
+
+One or more OSD devices are expected to fail soon, where the warning threshold
+is determined by the ``mgr/devicehealth/warn_threshold`` config option.
+
+Because this alert applies only to OSDs that are currently marked ``in``, the
+appropriate response to this expected failure is (1) to mark the OSD ``out`` so
+that data is migrated off of the OSD, and then (2) to remove the hardware from
+the system. Note that this marking ``out`` is normally done automatically if
+``mgr/devicehealth/self_heal`` is enabled (as determined by
+``mgr/devicehealth/mark_out_threshold``).
+
+To check device health, run the following command:
+
+.. prompt:: bash $
+
+ ceph device info <device-id>
+
+Device life expectancy is set either by a prediction model that the mgr runs or
+by an external tool that is activated by running the following command:
+
+.. prompt:: bash $
+
+ ceph device set-life-expectancy <device-id> <from> <to>
+
+You can change the stored life expectancy manually, but such a change usually
+doesn't accomplish anything. The reason for this is that whichever tool
+originally set the stored life expectancy will probably undo your change by
+setting it again, and a change to the stored value does not affect the actual
+health of the hardware device.
+
+DEVICE_HEALTH_IN_USE
+____________________
+
+One or more devices (that is, OSDs) are expected to fail soon and have been
+marked ``out`` of the cluster (as controlled by
+``mgr/devicehealth/mark_out_threshold``), but they are still participating in
+one or more Placement Groups. This might be because the OSD(s) were marked
+``out`` only recently and data is still migrating, or because data cannot be
+migrated off of the OSD(s) for some reason (for example, the cluster is nearly
+full, or the CRUSH hierarchy is structured so that there isn't another suitable
+OSD to migrate the data to).
+
+This message can be silenced by disabling self-heal behavior (that is, setting
+``mgr/devicehealth/self_heal`` to ``false``), by adjusting
+``mgr/devicehealth/mark_out_threshold``, or by addressing whichever condition
+is preventing data from being migrated off of the ailing OSD(s).
+
+.. _rados_health_checks_device_health_toomany:
+
+DEVICE_HEALTH_TOOMANY
+_____________________
+
+Too many devices (that is, OSDs) are expected to fail soon, and because
+``mgr/devicehealth/self_heal`` behavior is enabled, marking ``out`` all of the
+ailing OSDs would exceed the cluster's ``mon_osd_min_in_ratio`` ratio. This
+ratio prevents a cascade of too many OSDs from being automatically marked
+``out``.
+
+You should promptly add new OSDs to the cluster to prevent data loss, or
+incrementally replace the failing OSDs.
+
+Alternatively, you can silence this health check by adjusting options including
+``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``. Be
+warned, however, that this will increase the likelihood of unrecoverable data
+loss.
+
+
+Data health (pools & placement groups)
+--------------------------------------
+
+PG_AVAILABILITY
+_______________
+
+Data availability is reduced. In other words, the cluster is unable to service
+potential read or write requests for at least some data in the cluster. More
+precisely, one or more Placement Groups (PGs) are in a state that does not
+allow I/O requests to be serviced. Any of the following PG states are
+problematic if they do not clear quickly: *peering*, *stale*, *incomplete*, and
+the lack of *active*.
+
+For detailed information about which PGs are affected, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph health detail
+
+In most cases, the root cause of this issue is that one or more OSDs are
+currently ``down``: see ``OSD_DOWN`` above.
+
+To see the state of a specific problematic PG, run the following command:
+
+.. prompt:: bash $
+
+ ceph tell <pgid> query
+
+PG_DEGRADED
+___________
+
+Data redundancy is reduced for some data: in other words, the cluster does not
+have the desired number of replicas for all data (in the case of replicated
+pools) or erasure code fragments (in the case of erasure-coded pools). More
+precisely, one or more Placement Groups (PGs):
+
+* have the *degraded* or *undersized* flag set, which means that there are not
+ enough instances of that PG in the cluster; or
+* have not had the *clean* state set for a long time.
+
+For detailed information about which PGs are affected, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph health detail
+
+In most cases, the root cause of this issue is that one or more OSDs are
+currently "down": see ``OSD_DOWN`` above.
+
+To see the state of a specific problematic PG, run the following command:
+
+.. prompt:: bash $
+
+ ceph tell <pgid> query
+
+
+PG_RECOVERY_FULL
+________________
+
+Data redundancy might be reduced or even put at risk for some data due to a
+lack of free space in the cluster. More precisely, one or more Placement Groups
+have the *recovery_toofull* flag set, which means that the cluster is unable to
+migrate or recover data because one or more OSDs are above the ``full``
+threshold.
+
+For steps to resolve this condition, see *OSD_FULL* above.
+
+PG_BACKFILL_FULL
+________________
+
+Data redundancy might be reduced or even put at risk for some data due to a
+lack of free space in the cluster. More precisely, one or more Placement Groups
+have the *backfill_toofull* flag set, which means that the cluster is unable to
+migrate or recover data because one or more OSDs are above the ``backfillfull``
+threshold.
+
+For steps to resolve this condition, see *OSD_BACKFILLFULL* above.
+
+PG_DAMAGED
+__________
+
+Data scrubbing has discovered problems with data consistency in the cluster.
+More precisely, one or more Placement Groups either (1) have the *inconsistent*
+or ``snaptrim_error`` flag set, which indicates that an earlier data scrub
+operation found a problem, or (2) have the *repair* flag set, which means that
+a repair for such an inconsistency is currently in progress.
+
+For more information, see :doc:`pg-repair`.
+
+OSD_SCRUB_ERRORS
+________________
+
+Recent OSD scrubs have discovered inconsistencies. This alert is generally
+paired with *PG_DAMAGED* (see above).
+
+For more information, see :doc:`pg-repair`.
+
+OSD_TOO_MANY_REPAIRS
+____________________
+
+The count of read repairs has exceeded the config value threshold
+``mon_osd_warn_num_repaired`` (default: ``10``). Because scrub handles errors
+only for data at rest, and because any read error that occurs when another
+replica is available will be repaired immediately so that the client can get
+the object data, there might exist failing disks that are not registering any
+scrub errors. This repair count is maintained as a way of identifying any such
+failing disks.
+
+
+LARGE_OMAP_OBJECTS
+__________________
+
+One or more pools contain large omap objects, as determined by
+``osd_deep_scrub_large_omap_object_key_threshold`` (threshold for the number of
+keys to determine what is considered a large omap object) or
+``osd_deep_scrub_large_omap_object_value_sum_threshold`` (the threshold for the
+summed size in bytes of all key values to determine what is considered a large
+omap object) or both. To find more information on object name, key count, and
+size in bytes, search the cluster log for 'Large omap object found'. This issue
+can be caused by RGW-bucket index objects that do not have automatic resharding
+enabled. For more information on resharding, see :ref:`RGW Dynamic Bucket Index
+Resharding <rgw_dynamic_bucket_index_resharding>`.
+
+To adjust the thresholds mentioned above, run the following commands:
+
+.. prompt:: bash $
+
+ ceph config set osd osd_deep_scrub_large_omap_object_key_threshold <keys>
+ ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold <bytes>
+
+CACHE_POOL_NEAR_FULL
+____________________
+
+A cache-tier pool is nearly full, as determined by the ``target_max_bytes`` and
+``target_max_objects`` properties of the cache pool. Once the pool reaches the
+target threshold, write requests to the pool might block while data is flushed
+and evicted from the cache. This state normally leads to very high latencies
+and poor performance.
+
+To adjust the cache pool's target size, run the following commands:
+
+.. prompt:: bash $
+
+ ceph osd pool set <cache-pool-name> target_max_bytes <bytes>
+ ceph osd pool set <cache-pool-name> target_max_objects <objects>
+
+There might be other reasons that normal cache flush and evict activity are
+throttled: for example, reduced availability of the base tier, reduced
+performance of the base tier, or overall cluster load.
+
+TOO_FEW_PGS
+___________
+
+The number of Placement Groups (PGs) that are in use in the cluster is below
+the configurable threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can
+lead to suboptimal distribution and suboptimal balance of data across the OSDs
+in the cluster, and a reduction of overall performance.
+
+If data pools have not yet been created, this condition is expected.
+
+To address this issue, you can increase the PG count for existing pools or
+create new pools. For more information, see
+:ref:`choosing-number-of-placement-groups`.
+
+POOL_PG_NUM_NOT_POWER_OF_TWO
+____________________________
+
+One or more pools have a ``pg_num`` value that is not a power of two. Although
+this is not strictly incorrect, it does lead to a less balanced distribution of
+data because some Placement Groups will have roughly twice as much data as
+others have.
+
+This is easily corrected by setting the ``pg_num`` value for the affected
+pool(s) to a nearby power of two. To do so, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> pg_num <value>
+
+To disable this health check, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false
+
+POOL_TOO_FEW_PGS
+________________
+
+One or more pools should probably have more Placement Groups (PGs), given the
+amount of data that is currently stored in the pool. This issue can lead to
+suboptimal distribution and suboptimal balance of data across the OSDs in the
+cluster, and a reduction of overall performance. This alert is raised only if
+the ``pg_autoscale_mode`` property on the pool is set to ``warn``.
+
+To disable the alert, entirely disable auto-scaling of PGs for the pool by
+running the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> pg_autoscale_mode off
+
+To allow the cluster to automatically adjust the number of PGs for the pool,
+run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> pg_autoscale_mode on
+
+Alternatively, to manually set the number of PGs for the pool to the
+recommended amount, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> pg_num <new-pg-num>
+
+For more information, see :ref:`choosing-number-of-placement-groups` and
+:ref:`pg-autoscaler`.
+
+TOO_MANY_PGS
+____________
+
+The number of Placement Groups (PGs) in use in the cluster is above the
+configurable threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold
+is exceeded, the cluster will not allow new pools to be created, pool `pg_num`
+to be increased, or pool replication to be increased (any of which, if allowed,
+would lead to more PGs in the cluster). A large number of PGs can lead to
+higher memory utilization for OSD daemons, slower peering after cluster state
+changes (for example, OSD restarts, additions, or removals), and higher load on
+the Manager and Monitor daemons.
+
+The simplest way to mitigate the problem is to increase the number of OSDs in
+the cluster by adding more hardware. Note that, because the OSD count that is
+used for the purposes of this health check is the number of ``in`` OSDs,
+marking ``out`` OSDs ``in`` (if there are any ``out`` OSDs available) can also
+help. To do so, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd in <osd id(s)>
+
+For more information, see :ref:`choosing-number-of-placement-groups`.
+
+POOL_TOO_MANY_PGS
+_________________
+
+One or more pools should probably have fewer Placement Groups (PGs), given the
+amount of data that is currently stored in the pool. This issue can lead to
+higher memory utilization for OSD daemons, slower peering after cluster state
+changes (for example, OSD restarts, additions, or removals), and higher load on
+the Manager and Monitor daemons. This alert is raised only if the
+``pg_autoscale_mode`` property on the pool is set to ``warn``.
+
+To disable the alert, entirely disable auto-scaling of PGs for the pool by
+running the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> pg_autoscale_mode off
+
+To allow the cluster to automatically adjust the number of PGs for the pool,
+run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> pg_autoscale_mode on
+
+Alternatively, to manually set the number of PGs for the pool to the
+recommended amount, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> pg_num <new-pg-num>
+
+For more information, see :ref:`choosing-number-of-placement-groups` and
+:ref:`pg-autoscaler`.
+
+
+POOL_TARGET_SIZE_BYTES_OVERCOMMITTED
+____________________________________
+
+One or more pools have a ``target_size_bytes`` property that is set in order to
+estimate the expected size of the pool, but the value(s) of this property are
+greater than the total available storage (either by themselves or in
+combination with other pools).
+
+This alert is usually an indication that the ``target_size_bytes`` value for
+the pool is too large and should be reduced or set to zero. To reduce the
+``target_size_bytes`` value or set it to zero, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> target_size_bytes 0
+
+The above command sets the value of ``target_size_bytes`` to zero. To set the
+value of ``target_size_bytes`` to a non-zero value, replace the ``0`` with that
+non-zero value.
+
+For more information, see :ref:`specifying_pool_target_size`.
+
+POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO
+____________________________________
+
+One or more pools have both ``target_size_bytes`` and ``target_size_ratio`` set
+in order to estimate the expected size of the pool. Only one of these
+properties should be non-zero. If both are set to a non-zero value, then
+``target_size_ratio`` takes precedence and ``target_size_bytes`` is ignored.
+
+To reset ``target_size_bytes`` to zero, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> target_size_bytes 0
+
+For more information, see :ref:`specifying_pool_target_size`.
+
+TOO_FEW_OSDS
+____________
+
+The number of OSDs in the cluster is below the configurable threshold of
+``osd_pool_default_size``. This means that some or all data may not be able to
+satisfy the data protection policy specified in CRUSH rules and pool settings.
+
+SMALLER_PGP_NUM
+_______________
+
+One or more pools have a ``pgp_num`` value less than ``pg_num``. This alert is
+normally an indication that the Placement Group (PG) count was increased
+without any increase in the placement behavior.
+
+This disparity is sometimes brought about deliberately, in order to separate
+out the `split` step when the PG count is adjusted from the data migration that
+is needed when ``pgp_num`` is changed.
+
+This issue is normally resolved by setting ``pgp_num`` to match ``pg_num``, so
+as to trigger the data migration, by running the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool> pgp_num <pg-num-value>
+
+MANY_OBJECTS_PER_PG
+___________________
+
+One or more pools have an average number of objects per Placement Group (PG)
+that is significantly higher than the overall cluster average. The specific
+threshold is determined by the ``mon_pg_warn_max_object_skew`` configuration
+value.
+
+This alert is usually an indication that the pool(s) that contain most of the
+data in the cluster have too few PGs, or that other pools that contain less
+data have too many PGs. See *TOO_MANY_PGS* above.
+
+To silence the health check, raise the threshold by adjusting the
+``mon_pg_warn_max_object_skew`` config option on the managers.
+
+The health check will be silenced for a specific pool only if
+``pg_autoscale_mode`` is set to ``on``.
+
+POOL_APP_NOT_ENABLED
+____________________
+
+A pool exists but the pool has not been tagged for use by a particular
+application.
+
+To resolve this issue, tag the pool for use by an application. For
+example, if the pool is used by RBD, run the following command:
+
+.. prompt:: bash $
+
+ rbd pool init <poolname>
+
+Alternatively, if the pool is being used by a custom application (here 'foo'),
+you can label the pool by running the following low-level command:
+
+.. prompt:: bash $
+
+ ceph osd pool application enable foo
+
+For more information, see :ref:`associate-pool-to-application`.
+
+POOL_FULL
+_________
+
+One or more pools have reached (or are very close to reaching) their quota. The
+threshold to raise this health check is determined by the
+``mon_pool_quota_crit_threshold`` configuration option.
+
+Pool quotas can be adjusted up or down (or removed) by running the following
+commands:
+
+.. prompt:: bash $
+
+ ceph osd pool set-quota <pool> max_bytes <bytes>
+ ceph osd pool set-quota <pool> max_objects <objects>
+
+To disable a quota, set the quota value to 0.
+
+POOL_NEAR_FULL
+______________
+
+One or more pools are approaching a configured fullness threshold.
+
+One of the several thresholds that can raise this health check is determined by
+the ``mon_pool_quota_warn_threshold`` configuration option.
+
+Pool quotas can be adjusted up or down (or removed) by running the following
+commands:
+
+.. prompt:: bash $
+
+ ceph osd pool set-quota <pool> max_bytes <bytes>
+ ceph osd pool set-quota <pool> max_objects <objects>
+
+To disable a quota, set the quota value to 0.
+
+Other thresholds that can raise the two health checks above are
+``mon_osd_nearfull_ratio`` and ``mon_osd_full_ratio``. For details and
+resolution, see :ref:`storage-capacity` and :ref:`no-free-drive-space`.
+
+OBJECT_MISPLACED
+________________
+
+One or more objects in the cluster are not stored on the node that CRUSH would
+prefer that they be stored on. This alert is an indication that data migration
+due to a recent cluster change has not yet completed.
+
+Misplaced data is not a dangerous condition in and of itself; data consistency
+is never at risk, and old copies of objects will not be removed until the
+desired number of new copies (in the desired locations) has been created.
+
+OBJECT_UNFOUND
+______________
+
+One or more objects in the cluster cannot be found. More precisely, the OSDs
+know that a new or updated copy of an object should exist, but no such copy has
+been found on OSDs that are currently online.
+
+Read or write requests to unfound objects will block.
+
+Ideally, a "down" OSD that has a more recent copy of the unfound object can be
+brought back online. To identify candidate OSDs, check the peering state of the
+PG(s) responsible for the unfound object. To see the peering state, run the
+following command:
+
+.. prompt:: bash $
+
+ ceph tell <pgid> query
+
+On the other hand, if the latest copy of the object is not available, the
+cluster can be told to roll back to a previous version of the object. For more
+information, see :ref:`failures-osd-unfound`.
+
+SLOW_OPS
+________
+
+One or more OSD requests or monitor requests are taking a long time to process.
+This alert might be an indication of extreme load, a slow storage device, or a
+software bug.
+
+To query the request queue for the daemon that is causing the slowdown, run the
+following command from the daemon's host:
+
+.. prompt:: bash $
+
+ ceph daemon osd.<id> ops
+
+To see a summary of the slowest recent requests, run the following command:
+
+.. prompt:: bash $
+
+ ceph daemon osd.<id> dump_historic_ops
+
+To see the location of a specific OSD, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd find osd.<id>
+
+PG_NOT_SCRUBBED
+_______________
+
+One or more Placement Groups (PGs) have not been scrubbed recently. PGs are
+normally scrubbed within an interval determined by
+:confval:`osd_scrub_max_interval` globally. This interval can be overridden on
+per-pool basis by changing the value of the variable
+:confval:`scrub_max_interval`. This health check is raised if a certain
+percentage (determined by ``mon_warn_pg_not_scrubbed_ratio``) of the interval
+has elapsed after the time the scrub was scheduled and no scrub has been
+performed.
+
+PGs will be scrubbed only if they are flagged as ``clean`` (which means that
+they are to be cleaned, and not that they have been examined and found to be
+clean). Misplaced or degraded PGs will not be flagged as ``clean`` (see
+*PG_AVAILABILITY* and *PG_DEGRADED* above).
+
+To manually initiate a scrub of a clean PG, run the following command:
+
+.. prompt: bash $
+
+ ceph pg scrub <pgid>
+
+PG_NOT_DEEP_SCRUBBED
+____________________
+
+One or more Placement Groups (PGs) have not been deep scrubbed recently. PGs
+are normally scrubbed every :confval:`osd_deep_scrub_interval` seconds at most.
+This health check is raised if a certain percentage (determined by
+``mon_warn_pg_not_deep_scrubbed_ratio``) of the interval has elapsed after the
+time the scrub was scheduled and no scrub has been performed.
+
+PGs will receive a deep scrub only if they are flagged as *clean* (which means
+that they are to be cleaned, and not that they have been examined and found to
+be clean). Misplaced or degraded PGs might not be flagged as ``clean`` (see
+*PG_AVAILABILITY* and *PG_DEGRADED* above).
+
+To manually initiate a deep scrub of a clean PG, run the following command:
+
+.. prompt:: bash $
+
+ ceph pg deep-scrub <pgid>
+
+
+PG_SLOW_SNAP_TRIMMING
+_____________________
+
+The snapshot trim queue for one or more PGs has exceeded the configured warning
+threshold. This alert indicates either that an extremely large number of
+snapshots was recently deleted, or that OSDs are unable to trim snapshots
+quickly enough to keep up with the rate of new snapshot deletions.
+
+The warning threshold is determined by the ``mon_osd_snap_trim_queue_warn_on``
+option (default: 32768).
+
+This alert might be raised if OSDs are under excessive load and unable to keep
+up with their background work, or if the OSDs' internal metadata database is
+heavily fragmented and unable to perform. The alert might also indicate some
+other performance issue with the OSDs.
+
+The exact size of the snapshot trim queue is reported by the ``snaptrimq_len``
+field of ``ceph pg ls -f json-detail``.
+
+Stretch Mode
+------------
+
+INCORRECT_NUM_BUCKETS_STRETCH_MODE
+__________________________________
+
+Stretch mode currently only support 2 dividing buckets with OSDs, this warning suggests
+that the number of dividing buckets is not equal to 2 after stretch mode is enabled.
+You can expect unpredictable failures and MON assertions until the condition is fixed.
+
+We encourage you to fix this by removing additional dividing buckets or bump the
+number of dividing buckets to 2.
+
+UNEVEN_WEIGHTS_STRETCH_MODE
+___________________________
+
+The 2 dividing buckets must have equal weights when stretch mode is enabled.
+This warning suggests that the 2 dividing buckets have uneven weights after
+stretch mode is enabled. This is not immediately fatal, however, you can expect
+Ceph to be confused when trying to process transitions between dividing buckets.
+
+We encourage you to fix this by making the weights even on both dividing buckets.
+This can be done by making sure the combined weight of the OSDs on each dividing
+bucket are the same.
+
+Miscellaneous
+-------------
+
+RECENT_CRASH
+____________
+
+One or more Ceph daemons have crashed recently, and the crash(es) have not yet
+been acknowledged and archived by the administrator. This alert might indicate
+a software bug, a hardware problem (for example, a failing disk), or some other
+problem.
+
+To list recent crashes, run the following command:
+
+.. prompt:: bash $
+
+ ceph crash ls-new
+
+To examine information about a specific crash, run the following command:
+
+.. prompt:: bash $
+
+ ceph crash info <crash-id>
+
+To silence this alert, you can archive the crash (perhaps after the crash
+has been examined by an administrator) by running the following command:
+
+.. prompt:: bash $
+
+ ceph crash archive <crash-id>
+
+Similarly, to archive all recent crashes, run the following command:
+
+.. prompt:: bash $
+
+ ceph crash archive-all
+
+Archived crashes will still be visible by running the command ``ceph crash
+ls``, but not by running the command ``ceph crash ls-new``.
+
+The time period that is considered recent is determined by the option
+``mgr/crash/warn_recent_interval`` (default: two weeks).
+
+To entirely disable this alert, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set mgr/crash/warn_recent_interval 0
+
+RECENT_MGR_MODULE_CRASH
+_______________________
+
+One or more ``ceph-mgr`` modules have crashed recently, and the crash(es) have
+not yet been acknowledged and archived by the administrator. This alert
+usually indicates a software bug in one of the software modules that are
+running inside the ``ceph-mgr`` daemon. The module that experienced the problem
+might be disabled as a result, but other modules are unaffected and continue to
+function as expected.
+
+As with the *RECENT_CRASH* health check, a specific crash can be inspected by
+running the following command:
+
+.. prompt:: bash $
+
+ ceph crash info <crash-id>
+
+To silence this alert, you can archive the crash (perhaps after the crash has
+been examined by an administrator) by running the following command:
+
+.. prompt:: bash $
+
+ ceph crash archive <crash-id>
+
+Similarly, to archive all recent crashes, run the following command:
+
+.. prompt:: bash $
+
+ ceph crash archive-all
+
+Archived crashes will still be visible by running the command ``ceph crash ls``
+but not by running the command ``ceph crash ls-new``.
+
+The time period that is considered recent is determined by the option
+``mgr/crash/warn_recent_interval`` (default: two weeks).
+
+To entirely disable this alert, run the following command:
+
+.. prompt:: bash $
+
+ ceph config set mgr/crash/warn_recent_interval 0
+
+TELEMETRY_CHANGED
+_________________
+
+Telemetry has been enabled, but because the contents of the telemetry report
+have changed in the meantime, telemetry reports will not be sent.
+
+Ceph developers occasionally revise the telemetry feature to include new and
+useful information, or to remove information found to be useless or sensitive.
+If any new information is included in the report, Ceph requires the
+administrator to re-enable telemetry. This requirement ensures that the
+administrator has an opportunity to (re)review the information that will be
+shared.
+
+To review the contents of the telemetry report, run the following command:
+
+.. prompt:: bash $
+
+ ceph telemetry show
+
+Note that the telemetry report consists of several channels that may be
+independently enabled or disabled. For more information, see :ref:`telemetry`.
+
+To re-enable telemetry (and silence the alert), run the following command:
+
+.. prompt:: bash $
+
+ ceph telemetry on
+
+To disable telemetry (and silence the alert), run the following command:
+
+.. prompt:: bash $
+
+ ceph telemetry off
+
+AUTH_BAD_CAPS
+_____________
+
+One or more auth users have capabilities that cannot be parsed by the monitors.
+As a general rule, this alert indicates that there are one or more daemon types
+that the user is not authorized to use to perform any action.
+
+This alert is most likely to be raised after an upgrade if (1) the capabilities
+were set with an older version of Ceph that did not properly validate the
+syntax of those capabilities, or if (2) the syntax of the capabilities has
+changed.
+
+To remove the user(s) in question, run the following command:
+
+.. prompt:: bash $
+
+ ceph auth rm <entity-name>
+
+(This resolves the health check, but it prevents clients from being able to
+authenticate as the removed user.)
+
+Alternatively, to update the capabilities for the user(s), run the following
+command:
+
+.. prompt:: bash $
+
+ ceph auth <entity-name> <daemon-type> <caps> [<daemon-type> <caps> ...]
+
+For more information about auth capabilities, see :ref:`user-management`.
+
+OSD_NO_DOWN_OUT_INTERVAL
+________________________
+
+The ``mon_osd_down_out_interval`` option is set to zero, which means that the
+system does not automatically perform any repair or healing operations when an
+OSD fails. Instead, an administrator an external orchestrator must manually
+mark "down" OSDs as ``out`` (by running ``ceph osd out <osd-id>``) in order to
+trigger recovery.
+
+This option is normally set to five or ten minutes, which should be enough time
+for a host to power-cycle or reboot.
+
+To silence this alert, set ``mon_warn_on_osd_down_out_interval_zero`` to
+``false`` by running the following command:
+
+.. prompt:: bash $
+
+ ceph config global mon mon_warn_on_osd_down_out_interval_zero false
+
+DASHBOARD_DEBUG
+_______________
+
+The Dashboard debug mode is enabled. This means that if there is an error while
+processing a REST API request, the HTTP error response will contain a Python
+traceback. This mode should be disabled in production environments because such
+a traceback might contain and expose sensitive information.
+
+To disable the debug mode, run the following command:
+
+.. prompt:: bash $
+
+ ceph dashboard debug disable
diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst
new file mode 100644
index 000000000..15525c1d3
--- /dev/null
+++ b/doc/rados/operations/index.rst
@@ -0,0 +1,99 @@
+.. _rados-operations:
+
+====================
+ Cluster Operations
+====================
+
+.. raw:: html
+
+ <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3>
+
+High-level cluster operations consist primarily of starting, stopping, and
+restarting a cluster with the ``ceph`` service; checking the cluster's health;
+and, monitoring an operating cluster.
+
+.. toctree::
+ :maxdepth: 1
+
+ operating
+ health-checks
+ monitoring
+ monitoring-osd-pg
+ user-management
+ pg-repair
+
+.. raw:: html
+
+ </td><td><h3>Data Placement</h3>
+
+Once you have your cluster up and running, you may begin working with data
+placement. Ceph supports petabyte-scale data storage clusters, with storage
+pools and placement groups that distribute data across the cluster using Ceph's
+CRUSH algorithm.
+
+.. toctree::
+ :maxdepth: 1
+
+ data-placement
+ pools
+ erasure-code
+ cache-tiering
+ placement-groups
+ upmap
+ read-balancer
+ balancer
+ crush-map
+ crush-map-edits
+ stretch-mode
+ change-mon-elections
+
+
+
+.. raw:: html
+
+ </td></tr><tr><td><h3>Low-level Operations</h3>
+
+Low-level cluster operations consist of starting, stopping, and restarting a
+particular daemon within a cluster; changing the settings of a particular
+daemon or subsystem; and, adding a daemon to the cluster or removing a daemon
+from the cluster. The most common use cases for low-level operations include
+growing or shrinking the Ceph cluster and replacing legacy or failed hardware
+with new hardware.
+
+.. toctree::
+ :maxdepth: 1
+
+ add-or-rm-osds
+ add-or-rm-mons
+ devices
+ bluestore-migration
+ Command Reference <control>
+
+
+
+.. raw:: html
+
+ </td><td><h3>Troubleshooting</h3>
+
+Ceph is still on the leading edge, so you may encounter situations that require
+you to evaluate your Ceph configuration and modify your logging and debugging
+settings to identify and remedy issues you are encountering with your cluster.
+
+.. toctree::
+ :maxdepth: 1
+
+ ../troubleshooting/community
+ ../troubleshooting/troubleshooting-mon
+ ../troubleshooting/troubleshooting-osd
+ ../troubleshooting/troubleshooting-pg
+ ../troubleshooting/log-and-debug
+ ../troubleshooting/cpu-profiling
+ ../troubleshooting/memory-profiling
+
+
+
+
+.. raw:: html
+
+ </td></tr></tbody></table>
+
diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst
new file mode 100644
index 000000000..b0a6767a1
--- /dev/null
+++ b/doc/rados/operations/monitoring-osd-pg.rst
@@ -0,0 +1,556 @@
+=========================
+ Monitoring OSDs and PGs
+=========================
+
+High availability and high reliability require a fault-tolerant approach to
+managing hardware and software issues. Ceph has no single point of failure and
+it can service requests for data even when in a "degraded" mode. Ceph's `data
+placement`_ introduces a layer of indirection to ensure that data doesn't bind
+directly to specific OSDs. For this reason, tracking system faults
+requires finding the `placement group`_ (PG) and the underlying OSDs at the
+root of the problem.
+
+.. tip:: A fault in one part of the cluster might prevent you from accessing a
+ particular object, but that doesn't mean that you are prevented from
+ accessing other objects. When you run into a fault, don't panic. Just
+ follow the steps for monitoring your OSDs and placement groups, and then
+ begin troubleshooting.
+
+Ceph is self-repairing. However, when problems persist, monitoring OSDs and
+placement groups will help you identify the problem.
+
+
+Monitoring OSDs
+===============
+
+An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is
+either running and reachable (``up``), or it is not running and not reachable
+(``down``).
+
+If an OSD is ``up``, it may be either ``in`` service (clients can read and
+write data) or it is ``out`` of service. If the OSD was ``in`` but then due to
+a failure or a manual action was set to the ``out`` state, Ceph will migrate
+placement groups to the other OSDs to maintin the configured redundancy.
+
+If an OSD is ``out`` of service, CRUSH will not assign placement groups to it.
+If an OSD is ``down``, it will also be ``out``.
+
+.. note:: If an OSD is ``down`` and ``in``, there is a problem and this
+ indicates that the cluster is not in a healthy state.
+
+.. ditaa::
+
+ +----------------+ +----------------+
+ | | | |
+ | OSD #n In | | OSD #n Up |
+ | | | |
+ +----------------+ +----------------+
+ ^ ^
+ | |
+ | |
+ v v
+ +----------------+ +----------------+
+ | | | |
+ | OSD #n Out | | OSD #n Down |
+ | | | |
+ +----------------+ +----------------+
+
+If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
+you might notice that the cluster does not always show ``HEALTH OK``. Don't
+panic. There are certain circumstances in which it is expected and normal that
+the cluster will **NOT** show ``HEALTH OK``:
+
+#. You haven't started the cluster yet.
+#. You have just started or restarted the cluster and it's not ready to show
+ health statuses yet, because the PGs are in the process of being created and
+ the OSDs are in the process of peering.
+#. You have just added or removed an OSD.
+#. You have just have modified your cluster map.
+
+Checking to see if OSDs are ``up`` and running is an important aspect of monitoring them:
+whenever the cluster is up and running, every OSD that is ``in`` the cluster should also
+be ``up`` and running. To see if all of the cluster's OSDs are running, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph osd stat
+
+The output provides the following information: the total number of OSDs (x),
+how many OSDs are ``up`` (y), how many OSDs are ``in`` (z), and the map epoch (eNNNN). ::
+
+ x osds: y up, z in; epoch: eNNNN
+
+If the number of OSDs that are ``in`` the cluster is greater than the number of
+OSDs that are ``up``, run the following command to identify the ``ceph-osd``
+daemons that are not running:
+
+.. prompt:: bash $
+
+ ceph osd tree
+
+::
+
+ #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
+ -1 2.00000 pool openstack
+ -3 2.00000 rack dell-2950-rack-A
+ -2 2.00000 host dell-2950-A1
+ 0 ssd 1.00000 osd.0 up 1.00000 1.00000
+ 1 ssd 1.00000 osd.1 down 1.00000 1.00000
+
+.. tip:: Searching through a well-designed CRUSH hierarchy to identify the physical
+ locations of particular OSDs might help you troubleshoot your cluster.
+
+If an OSD is ``down``, start it by running the following command:
+
+.. prompt:: bash $
+
+ sudo systemctl start ceph-osd@1
+
+For problems associated with OSDs that have stopped or won't restart, see `OSD Not Running`_.
+
+
+PG Sets
+=======
+
+When CRUSH assigns a PG to OSDs, it takes note of how many replicas of the PG
+are required by the pool and then assigns each replica to a different OSD.
+For example, if the pool requires three replicas of a PG, CRUSH might assign
+them individually to ``osd.1``, ``osd.2`` and ``osd.3``. CRUSH seeks a
+pseudo-random placement that takes into account the failure domains that you
+have set in your `CRUSH map`_; for this reason, PGs are rarely assigned to
+immediately adjacent OSDs in a large cluster.
+
+Ceph processes client requests with the **Acting Set** of OSDs: this is the set
+of OSDs that currently have a full and working version of a PG shard and that
+are therefore responsible for handling requests. By contrast, the **Up Set** is
+the set of OSDs that contain a shard of a specific PG. Data is moved or copied
+to the **Up Set**, or planned to be moved or copied, to the **Up Set**. See
+:ref:`Placement Group Concepts <rados_operations_pg_concepts>`.
+
+Sometimes an OSD in the Acting Set is ``down`` or otherwise unable to
+service requests for objects in the PG. When this kind of situation
+arises, don't panic. Common examples of such a situation include:
+
+- You added or removed an OSD, CRUSH reassigned the PG to
+ other OSDs, and this reassignment changed the composition of the Acting Set and triggered
+ the migration of data by means of a "backfill" process.
+- An OSD was ``down``, was restarted, and is now ``recovering``.
+- An OSD in the Acting Set is ``down`` or unable to service requests,
+ and another OSD has temporarily assumed its duties.
+
+Typically, the Up Set and the Acting Set are identical. When they are not, it
+might indicate that Ceph is migrating the PG (in other words, that the PG has
+been remapped), that an OSD is recovering, or that there is a problem with the
+cluster (in such scenarios, Ceph usually shows a "HEALTH WARN" state with a
+"stuck stale" message).
+
+To retrieve a list of PGs, run the following command:
+
+.. prompt:: bash $
+
+ ceph pg dump
+
+To see which OSDs are within the Acting Set and the Up Set for a specific PG, run the following command:
+
+.. prompt:: bash $
+
+ ceph pg map {pg-num}
+
+The output provides the following information: the osdmap epoch (eNNN), the PG number
+({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the Acting Set
+(acting[])::
+
+ osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2]
+
+.. note:: If the Up Set and the Acting Set do not match, this might indicate
+ that the cluster is rebalancing itself or that there is a problem with
+ the cluster.
+
+
+Peering
+=======
+
+Before you can write data to a PG, it must be in an ``active`` state and it
+will preferably be in a ``clean`` state. For Ceph to determine the current
+state of a PG, peering must take place. That is, the primary OSD of the PG
+(that is, the first OSD in the Acting Set) must peer with the secondary and
+OSDs so that consensus on the current state of the PG can be established. In
+the following diagram, we assume a pool with three replicas of the PG:
+
+.. ditaa::
+
+ +---------+ +---------+ +-------+
+ | OSD 1 | | OSD 2 | | OSD 3 |
+ +---------+ +---------+ +-------+
+ | | |
+ | Request To | |
+ | Peer | |
+ |-------------->| |
+ |<--------------| |
+ | Peering |
+ | |
+ | Request To |
+ | Peer |
+ |----------------------------->|
+ |<-----------------------------|
+ | Peering |
+
+The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD
+Interaction`_. To troubleshoot peering issues, see `Peering
+Failure`_.
+
+
+Monitoring PG States
+====================
+
+If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
+you might notice that the cluster does not always show ``HEALTH OK``. After
+first checking to see if the OSDs are running, you should also check PG
+states. There are certain PG-peering-related circumstances in which it is expected
+and normal that the cluster will **NOT** show ``HEALTH OK``:
+
+#. You have just created a pool and the PGs haven't peered yet.
+#. The PGs are recovering.
+#. You have just added an OSD to or removed an OSD from the cluster.
+#. You have just modified your CRUSH map and your PGs are migrating.
+#. There is inconsistent data in different replicas of a PG.
+#. Ceph is scrubbing a PG's replicas.
+#. Ceph doesn't have enough storage capacity to complete backfilling operations.
+
+If one of these circumstances causes Ceph to show ``HEALTH WARN``, don't
+panic. In many cases, the cluster will recover on its own. In some cases, however, you
+might need to take action. An important aspect of monitoring PGs is to check their
+status as ``active`` and ``clean``: that is, it is important to ensure that, when the
+cluster is up and running, all PGs are ``active`` and (preferably) ``clean``.
+To see the status of every PG, run the following command:
+
+.. prompt:: bash $
+
+ ceph pg stat
+
+The output provides the following information: the total number of PGs (x), how many
+PGs are in a particular state such as ``active+clean`` (y), and the
+amount of data stored (z). ::
+
+ x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail
+
+.. note:: It is common for Ceph to report multiple states for PGs (for example,
+ ``active+clean``, ``active+clean+remapped``, ``active+clean+scrubbing``.
+
+Here Ceph shows not only the PG states, but also storage capacity used (aa),
+the amount of storage capacity remaining (bb), and the total storage capacity
+of the PG. These values can be important in a few cases:
+
+- The cluster is reaching its ``near full ratio`` or ``full ratio``.
+- Data is not being distributed across the cluster due to an error in the
+ CRUSH configuration.
+
+
+.. topic:: Placement Group IDs
+
+ PG IDs consist of the pool number (not the pool name) followed by a period
+ (.) and a hexadecimal number. You can view pool numbers and their names from
+ in the output of ``ceph osd lspools``. For example, the first pool that was
+ created corresponds to pool number ``1``. A fully qualified PG ID has the
+ following form::
+
+ {pool-num}.{pg-id}
+
+ It typically resembles the following::
+
+ 1.1701b
+
+
+To retrieve a list of PGs, run the following command:
+
+.. prompt:: bash $
+
+ ceph pg dump
+
+To format the output in JSON format and save it to a file, run the following command:
+
+.. prompt:: bash $
+
+ ceph pg dump -o {filename} --format=json
+
+To query a specific PG, run the following command:
+
+.. prompt:: bash $
+
+ ceph pg {poolnum}.{pg-id} query
+
+Ceph will output the query in JSON format.
+
+The following subsections describe the most common PG states in detail.
+
+
+Creating
+--------
+
+PGs are created when you create a pool: the command that creates a pool
+specifies the total number of PGs for that pool, and when the pool is created
+all of those PGs are created as well. Ceph will echo ``creating`` while it is
+creating PGs. After the PG(s) are created, the OSDs that are part of a PG's
+Acting Set will peer. Once peering is complete, the PG status should be
+``active+clean``. This status means that Ceph clients begin writing to the
+PG.
+
+.. ditaa::
+
+ /-----------\ /-----------\ /-----------\
+ | Creating |------>| Peering |------>| Active |
+ \-----------/ \-----------/ \-----------/
+
+Peering
+-------
+
+When a PG peers, the OSDs that store the replicas of its data converge on an
+agreed state of the data and metadata within that PG. When peering is complete,
+those OSDs agree about the state of that PG. However, completion of the peering
+process does **NOT** mean that each replica has the latest contents.
+
+.. topic:: Authoritative History
+
+ Ceph will **NOT** acknowledge a write operation to a client until that write
+ operation is persisted by every OSD in the Acting Set. This practice ensures
+ that at least one member of the Acting Set will have a record of every
+ acknowledged write operation since the last successful peering operation.
+
+ Given an accurate record of each acknowledged write operation, Ceph can
+ construct a new authoritative history of the PG--that is, a complete and
+ fully ordered set of operations that, if performed, would bring an OSD’s
+ copy of the PG up to date.
+
+
+Active
+------
+
+After Ceph has completed the peering process, a PG should become ``active``.
+The ``active`` state means that the data in the PG is generally available for
+read and write operations in the primary and replica OSDs.
+
+
+Clean
+-----
+
+When a PG is in the ``clean`` state, all OSDs holding its data and metadata
+have successfully peered and there are no stray replicas. Ceph has replicated
+all objects in the PG the correct number of times.
+
+
+Degraded
+--------
+
+When a client writes an object to the primary OSD, the primary OSD is
+responsible for writing the replicas to the replica OSDs. After the primary OSD
+writes the object to storage, the PG will remain in a ``degraded``
+state until the primary OSD has received an acknowledgement from the replica
+OSDs that Ceph created the replica objects successfully.
+
+The reason that a PG can be ``active+degraded`` is that an OSD can be
+``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes
+``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must
+peer again when the OSD comes back online. However, a client can still write a
+new object to a ``degraded`` PG if it is ``active``.
+
+If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the
+``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD
+to another OSD. The time between being marked ``down`` and being marked ``out``
+is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds
+by default.
+
+A PG can also be in the ``degraded`` state because there are one or more
+objects that Ceph expects to find in the PG but that Ceph cannot find. Although
+you cannot read or write to unfound objects, you can still access all of the other
+objects in the ``degraded`` PG.
+
+
+Recovering
+----------
+
+Ceph was designed for fault-tolerance, because hardware and other server
+problems are expected or even routine. When an OSD goes ``down``, its contents
+might fall behind the current state of other replicas in the PGs. When the OSD
+has returned to the ``up`` state, the contents of the PGs must be updated to
+reflect that current state. During that time period, the OSD might be in a
+``recovering`` state.
+
+Recovery is not always trivial, because a hardware failure might cause a
+cascading failure of multiple OSDs. For example, a network switch for a rack or
+cabinet might fail, which can cause the OSDs of a number of host machines to
+fall behind the current state of the cluster. In such a scenario, general
+recovery is possible only if each of the OSDs recovers after the fault has been
+resolved.]
+
+Ceph provides a number of settings that determine how the cluster balances the
+resource contention between the need to process new service requests and the
+need to recover data objects and restore the PGs to the current state. The
+``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and
+even process some replay requests before starting the recovery process. The
+``osd_recovery_thread_timeout`` setting determines the duration of a thread
+timeout, because multiple OSDs might fail, restart, and re-peer at staggered
+rates. The ``osd_recovery_max_active`` setting limits the number of recovery
+requests an OSD can entertain simultaneously, in order to prevent the OSD from
+failing to serve. The ``osd_recovery_max_chunk`` setting limits the size of
+the recovered data chunks, in order to prevent network congestion.
+
+
+Back Filling
+------------
+
+When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are
+already in the cluster to the newly added OSD. It can put excessive load on the
+new OSD to force it to immediately accept the reassigned PGs. Back filling the
+OSD with the PGs allows this process to begin in the background. After the
+backfill operations have completed, the new OSD will begin serving requests as
+soon as it is ready.
+
+During the backfill operations, you might see one of several states:
+``backfill_wait`` indicates that a backfill operation is pending, but is not
+yet underway; ``backfilling`` indicates that a backfill operation is currently
+underway; and ``backfill_toofull`` indicates that a backfill operation was
+requested but couldn't be completed due to insufficient storage capacity. When
+a PG cannot be backfilled, it might be considered ``incomplete``.
+
+The ``backfill_toofull`` state might be transient. It might happen that, as PGs
+are moved around, space becomes available. The ``backfill_toofull`` state is
+similar to ``backfill_wait`` in that backfill operations can proceed as soon as
+conditions change.
+
+Ceph provides a number of settings to manage the load spike associated with the
+reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills``
+setting specifies the maximum number of concurrent backfills to and from an OSD
+(default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a
+backfill request if the OSD is approaching its full ratio (default: 90%). This
+setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If
+an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting
+allows an OSD to retry the request after a certain interval (default: 30
+seconds). OSDs can also set ``osd_backfill_scan_min`` and
+``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and
+512, respectively).
+
+
+Remapped
+--------
+
+When the Acting Set that services a PG changes, the data migrates from the old
+Acting Set to the new Acting Set. Because it might take time for the new
+primary OSD to begin servicing requests, the old primary OSD might be required
+to continue servicing requests until the PG data migration is complete. After
+data migration has completed, the mapping uses the primary OSD of the new
+Acting Set.
+
+
+Stale
+-----
+
+Although Ceph uses heartbeats in order to ensure that hosts and daemons are
+running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are
+not reporting statistics in a timely manner (for example, there might be a
+temporary network fault). By default, OSD daemons report their PG, up through,
+boot, and failure statistics every half second (that is, in accordance with a
+value of ``0.5``), which is more frequent than the reports defined by the
+heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report
+to the monitor or if other OSDs have reported the primary OSD ``down``, the
+monitors will mark the PG ``stale``.
+
+When you start your cluster, it is common to see the ``stale`` state until the
+peering process completes. After your cluster has been running for a while,
+however, seeing PGs in the ``stale`` state indicates that the primary OSD for
+those PGs is ``down`` or not reporting PG statistics to the monitor.
+
+
+Identifying Troubled PGs
+========================
+
+As previously noted, a PG is not necessarily having problems just because its
+state is not ``active+clean``. When PGs are stuck, this might indicate that
+Ceph cannot perform self-repairs. The stuck states include:
+
+- **Unclean**: PGs contain objects that have not been replicated the desired
+ number of times. Under normal conditions, it can be assumed that these PGs
+ are recovering.
+- **Inactive**: PGs cannot process reads or writes because they are waiting for
+ an OSD that has the most up-to-date data to come back ``up``.
+- **Stale**: PG are in an unknown state, because the OSDs that host them have
+ not reported to the monitor cluster for a certain period of time (determined
+ by ``mon_osd_report_timeout``).
+
+To identify stuck PGs, run the following command:
+
+.. prompt:: bash $
+
+ ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
+
+For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs,
+see `Troubleshooting PG Errors`_.
+
+
+Finding an Object Location
+==========================
+
+To store object data in the Ceph Object Store, a Ceph client must:
+
+#. Set an object name
+#. Specify a `pool`_
+
+The Ceph client retrieves the latest cluster map, the CRUSH algorithm
+calculates how to map the object to a PG, and then the algorithm calculates how
+to dynamically assign the PG to an OSD. To find the object location given only
+the object name and the pool name, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd map {poolname} {object-name} [namespace]
+
+.. topic:: Exercise: Locate an Object
+
+ As an exercise, let's create an object. We can specify an object name, a path
+ to a test file that contains some object data, and a pool name by using the
+ ``rados put`` command on the command line. For example:
+
+ .. prompt:: bash $
+
+ rados put {object-name} {file-path} --pool=data
+ rados put test-object-1 testfile.txt --pool=data
+
+ To verify that the Ceph Object Store stored the object, run the
+ following command:
+
+ .. prompt:: bash $
+
+ rados -p data ls
+
+ To identify the object location, run the following commands:
+
+ .. prompt:: bash $
+
+ ceph osd map {pool-name} {object-name}
+ ceph osd map data test-object-1
+
+ Ceph should output the object's location. For example::
+
+ osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)
+
+ To remove the test object, simply delete it by running the ``rados rm``
+ command. For example:
+
+ .. prompt:: bash $
+
+ rados rm test-object-1 --pool=data
+
+As the cluster evolves, the object location may change dynamically. One benefit
+of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually
+performing the migration. For details, see the `Architecture`_ section.
+
+.. _data placement: ../data-placement
+.. _pool: ../pools
+.. _placement group: ../placement-groups
+.. _Architecture: ../../../architecture
+.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running
+.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors
+.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering
+.. _CRUSH map: ../crush-map
+.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/
+.. _Placement Group Subsystem: ../control#placement-group-subsystem
diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst
new file mode 100644
index 000000000..a9171f2d8
--- /dev/null
+++ b/doc/rados/operations/monitoring.rst
@@ -0,0 +1,644 @@
+======================
+ Monitoring a Cluster
+======================
+
+After you have a running cluster, you can use the ``ceph`` tool to monitor your
+cluster. Monitoring a cluster typically involves checking OSD status, monitor
+status, placement group status, and metadata server status.
+
+Using the command line
+======================
+
+Interactive mode
+----------------
+
+To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
+with no arguments. For example:
+
+.. prompt:: bash $
+
+ ceph
+
+.. prompt:: ceph>
+ :prompts: ceph>
+
+ health
+ status
+ quorum_status
+ mon stat
+
+Non-default paths
+-----------------
+
+If you specified non-default locations for your configuration or keyring when
+you install the cluster, you may specify their locations to the ``ceph`` tool
+by running the following command:
+
+.. prompt:: bash $
+
+ ceph -c /path/to/conf -k /path/to/keyring health
+
+Checking a Cluster's Status
+===========================
+
+After you start your cluster, and before you start reading and/or writing data,
+you should check your cluster's status.
+
+To check a cluster's status, run the following command:
+
+.. prompt:: bash $
+
+ ceph status
+
+Alternatively, you can run the following command:
+
+.. prompt:: bash $
+
+ ceph -s
+
+In interactive mode, this operation is performed by typing ``status`` and
+pressing **Enter**:
+
+.. prompt:: ceph>
+ :prompts: ceph>
+
+ status
+
+Ceph will print the cluster status. For example, a tiny Ceph "demonstration
+cluster" that is running one instance of each service (monitor, manager, and
+OSD) might print the following:
+
+::
+
+ cluster:
+ id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
+ health: HEALTH_OK
+
+ services:
+ mon: 3 daemons, quorum a,b,c
+ mgr: x(active)
+ mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby
+ osd: 3 osds: 3 up, 3 in
+
+ data:
+ pools: 2 pools, 16 pgs
+ objects: 21 objects, 2.19K
+ usage: 546 GB used, 384 GB / 931 GB avail
+ pgs: 16 active+clean
+
+
+How Ceph Calculates Data Usage
+------------------------------
+
+The ``usage`` value reflects the *actual* amount of raw storage used. The ``xxx
+GB / xxx GB`` value means the amount available (the lesser number) of the
+overall storage capacity of the cluster. The notional number reflects the size
+of the stored data before it is replicated, cloned or snapshotted. Therefore,
+the amount of data actually stored typically exceeds the notional amount
+stored, because Ceph creates replicas of the data and may also use storage
+capacity for cloning and snapshotting.
+
+
+Watching a Cluster
+==================
+
+Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster
+itself maintains a *cluster log* that records high-level events about the
+entire Ceph cluster. These events are logged to disk on monitor servers (in
+the default location ``/var/log/ceph/ceph.log``), and they can be monitored via
+the command line.
+
+To follow the cluster log, run the following command:
+
+.. prompt:: bash $
+
+ ceph -w
+
+Ceph will print the status of the system, followed by each log message as it is
+added. For example:
+
+::
+
+ cluster:
+ id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
+ health: HEALTH_OK
+
+ services:
+ mon: 3 daemons, quorum a,b,c
+ mgr: x(active)
+ mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby
+ osd: 3 osds: 3 up, 3 in
+
+ data:
+ pools: 2 pools, 16 pgs
+ objects: 21 objects, 2.19K
+ usage: 546 GB used, 384 GB / 931 GB avail
+ pgs: 16 active+clean
+
+
+ 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot
+ 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
+ 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available
+
+Instead of printing log lines as they are added, you might want to print only
+the most recent lines. Run ``ceph log last [n]`` to see the most recent ``n``
+lines from the cluster log.
+
+Monitoring Health Checks
+========================
+
+Ceph continuously runs various *health checks*. When
+a health check fails, this failure is reflected in the output of ``ceph status`` and
+``ceph health``. The cluster log receives messages that
+indicate when a check has failed and when the cluster has recovered.
+
+For example, when an OSD goes down, the ``health`` section of the status
+output is updated as follows:
+
+::
+
+ health: HEALTH_WARN
+ 1 osds down
+ Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded
+
+At the same time, cluster log messages are emitted to record the failure of the
+health checks:
+
+::
+
+ 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
+ 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)
+
+When the OSD comes back online, the cluster log records the cluster's return
+to a healthy state:
+
+::
+
+ 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
+ 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
+ 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy
+
+Network Performance Checks
+--------------------------
+
+Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon
+availability and network performance. If a single delayed response is detected,
+this might indicate nothing more than a busy OSD. But if multiple delays
+between distinct pairs of OSDs are detected, this might indicate a failed
+network switch, a NIC failure, or a layer 1 failure.
+
+By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a
+health check (a ``HEALTH_WARN``. For example:
+
+::
+
+ HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms)
+
+In the output of the ``ceph health detail`` command, you can see which OSDs are
+experiencing delays and how long the delays are. The output of ``ceph health
+detail`` is limited to ten lines. Here is an example of the output you can
+expect from the ``ceph health detail`` command::
+
+ [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms)
+ Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving
+ Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.2 [dc1,rack2] 1030.123 msec
+ Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec
+ Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec
+
+To see more detail and to collect a complete dump of network performance
+information, use the ``dump_osd_network`` command. This command is usually sent
+to a Ceph Manager Daemon, but it can be used to collect information about a
+specific OSD's interactions by sending it to that OSD. The default threshold
+for a slow heartbeat is 1 second (1000 milliseconds), but this can be
+overridden by providing a number of milliseconds as an argument.
+
+To show all network performance data with a specified threshold of 0, send the
+following command to the mgr:
+
+.. prompt:: bash $
+
+ ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0
+
+::
+
+ {
+ "threshold": 0,
+ "entries": [
+ {
+ "last update": "Wed Sep 4 17:04:49 2019",
+ "stale": false,
+ "from osd": 2,
+ "to osd": 0,
+ "interface": "front",
+ "average": {
+ "1min": 1.023,
+ "5min": 0.860,
+ "15min": 0.883
+ },
+ "min": {
+ "1min": 0.818,
+ "5min": 0.607,
+ "15min": 0.607
+ },
+ "max": {
+ "1min": 1.164,
+ "5min": 1.173,
+ "15min": 1.544
+ },
+ "last": 0.924
+ },
+ {
+ "last update": "Wed Sep 4 17:04:49 2019",
+ "stale": false,
+ "from osd": 2,
+ "to osd": 0,
+ "interface": "back",
+ "average": {
+ "1min": 0.968,
+ "5min": 0.897,
+ "15min": 0.830
+ },
+ "min": {
+ "1min": 0.860,
+ "5min": 0.563,
+ "15min": 0.502
+ },
+ "max": {
+ "1min": 1.171,
+ "5min": 1.216,
+ "15min": 1.456
+ },
+ "last": 0.845
+ },
+ {
+ "last update": "Wed Sep 4 17:04:48 2019",
+ "stale": false,
+ "from osd": 0,
+ "to osd": 1,
+ "interface": "front",
+ "average": {
+ "1min": 0.965,
+ "5min": 0.811,
+ "15min": 0.850
+ },
+ "min": {
+ "1min": 0.650,
+ "5min": 0.488,
+ "15min": 0.466
+ },
+ "max": {
+ "1min": 1.252,
+ "5min": 1.252,
+ "15min": 1.362
+ },
+ "last": 0.791
+ },
+ ...
+
+
+
+Muting Health Checks
+--------------------
+
+Health checks can be muted so that they have no effect on the overall
+reported status of the cluster. For example, if the cluster has raised a
+single health check and then you mute that health check, then the cluster will report a status of ``HEALTH_OK``.
+To mute a specific health check, use the health check code that corresponds to that health check (see :ref:`health-checks`), and
+run the following command:
+
+.. prompt:: bash $
+
+ ceph health mute <code>
+
+For example, to mute an ``OSD_DOWN`` health check, run the following command:
+
+.. prompt:: bash $
+
+ ceph health mute OSD_DOWN
+
+Mutes are reported as part of the short and long form of the ``ceph health`` command's output.
+For example, in the above scenario, the cluster would report:
+
+.. prompt:: bash $
+
+ ceph health
+
+::
+
+ HEALTH_OK (muted: OSD_DOWN)
+
+.. prompt:: bash $
+
+ ceph health detail
+
+::
+
+ HEALTH_OK (muted: OSD_DOWN)
+ (MUTED) OSD_DOWN 1 osds down
+ osd.1 is down
+
+A mute can be removed by running the following command:
+
+.. prompt:: bash $
+
+ ceph health unmute <code>
+
+For example:
+
+.. prompt:: bash $
+
+ ceph health unmute OSD_DOWN
+
+A "health mute" can have a TTL (**T**\ime **T**\o **L**\ive)
+associated with it: this means that the mute will automatically expire
+after a specified period of time. The TTL is specified as an optional
+duration argument, as seen in the following examples:
+
+.. prompt:: bash $
+
+ ceph health mute OSD_DOWN 4h # mute for 4 hours
+ ceph health mute MON_DOWN 15m # mute for 15 minutes
+
+Normally, if a muted health check is resolved (for example, if the OSD that raised the ``OSD_DOWN`` health check
+in the example above has come back up), the mute goes away. If the health check comes
+back later, it will be reported in the usual way.
+
+It is possible to make a health mute "sticky": this means that the mute will remain even if the
+health check clears. For example, to make a health mute "sticky", you might run the following command:
+
+.. prompt:: bash $
+
+ ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour
+
+Most health mutes disappear if the unhealthy condition that triggered the health check gets worse.
+For example, suppose that there is one OSD down and the health check is muted. In that case, if
+one or more additional OSDs go down, then the health mute disappears. This behavior occurs in any health check with a threshold value.
+
+
+Checking a Cluster's Usage Stats
+================================
+
+To check a cluster's data usage and data distribution among pools, use the
+``df`` command. This option is similar to Linux's ``df`` command. Run the
+following command:
+
+.. prompt:: bash $
+
+ ceph df
+
+The output of ``ceph df`` resembles the following::
+
+ CLASS SIZE AVAIL USED RAW USED %RAW USED
+ ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00
+ TOTAL 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00
+
+ --- POOLS ---
+ POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
+ device_health_metrics 1 1 242 KiB 15 KiB 227 KiB 4 251 KiB 24 KiB 227 KiB 0 297 GiB N/A N/A 4 0 B 0 B
+ cephfs.a.meta 2 32 6.8 KiB 6.8 KiB 0 B 22 96 KiB 96 KiB 0 B 0 297 GiB N/A N/A 22 0 B 0 B
+ cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B
+ test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B
+
+- **CLASS:** For example, "ssd" or "hdd".
+- **SIZE:** The amount of storage capacity managed by the cluster.
+- **AVAIL:** The amount of free space available in the cluster.
+- **USED:** The amount of raw storage consumed by user data (excluding
+ BlueStore's database).
+- **RAW USED:** The amount of raw storage consumed by user data, internal
+ overhead, and reserved capacity.
+- **%RAW USED:** The percentage of raw storage used. Watch this number in
+ conjunction with ``full ratio`` and ``near full ratio`` to be forewarned when
+ your cluster approaches the fullness thresholds. See `Storage Capacity`_.
+
+
+**POOLS:**
+
+The POOLS section of the output provides a list of pools and the *notional*
+usage of each pool. This section of the output **DOES NOT** reflect replicas,
+clones, or snapshots. For example, if you store an object with 1MB of data,
+then the notional usage will be 1MB, but the actual usage might be 2MB or more
+depending on the number of replicas, clones, and snapshots.
+
+- **ID:** The number of the specific node within the pool.
+- **STORED:** The actual amount of data that the user has stored in a pool.
+ This is similar to the USED column in earlier versions of Ceph, but the
+ calculations (for BlueStore!) are more precise (in that gaps are properly
+ handled).
+
+ - **(DATA):** Usage for RBD (RADOS Block Device), CephFS file data, and RGW
+ (RADOS Gateway) object data.
+ - **(OMAP):** Key-value pairs. Used primarily by CephFS and RGW (RADOS
+ Gateway) for metadata storage.
+
+- **OBJECTS:** The notional number of objects stored per pool (that is, the
+ number of objects other than replicas, clones, or snapshots).
+- **USED:** The space allocated for a pool over all OSDs. This includes space
+ for replication, space for allocation granularity, and space for the overhead
+ associated with erasure-coding. Compression savings and object-content gaps
+ are also taken into account. However, BlueStore's database is not included in
+ the amount reported under USED.
+
+ - **(DATA):** Object usage for RBD (RADOS Block Device), CephFS file data,
+ and RGW (RADOS Gateway) object data.
+ - **(OMAP):** Object key-value pairs. Used primarily by CephFS and RGW (RADOS
+ Gateway) for metadata storage.
+
+- **%USED:** The notional percentage of storage used per pool.
+- **MAX AVAIL:** An estimate of the notional amount of data that can be written
+ to this pool.
+- **QUOTA OBJECTS:** The number of quota objects.
+- **QUOTA BYTES:** The number of bytes in the quota objects.
+- **DIRTY:** The number of objects in the cache pool that have been written to
+ the cache pool but have not yet been flushed to the base pool. This field is
+ available only when cache tiering is in use.
+- **USED COMPR:** The amount of space allocated for compressed data. This
+ includes compressed data in addition to all of the space required for
+ replication, allocation granularity, and erasure- coding overhead.
+- **UNDER COMPR:** The amount of data that has passed through compression
+ (summed over all replicas) and that is worth storing in a compressed form.
+
+
+.. note:: The numbers in the POOLS section are notional. They do not include
+ the number of replicas, clones, or snapshots. As a result, the sum of the
+ USED and %USED amounts in the POOLS section of the output will not be equal
+ to the sum of the USED and %USED amounts in the RAW section of the output.
+
+.. note:: The MAX AVAIL value is a complicated function of the replication or
+ the kind of erasure coding used, the CRUSH rule that maps storage to
+ devices, the utilization of those devices, and the configured
+ ``mon_osd_full_ratio`` setting.
+
+
+Checking OSD Status
+===================
+
+To check if OSDs are ``up`` and ``in``, run the
+following command:
+
+.. prompt:: bash #
+
+ ceph osd stat
+
+Alternatively, you can run the following command:
+
+.. prompt:: bash #
+
+ ceph osd dump
+
+To view OSDs according to their position in the CRUSH map, run the following
+command:
+
+.. prompt:: bash #
+
+ ceph osd tree
+
+To print out a CRUSH tree that displays a host, its OSDs, whether the OSDs are
+``up``, and the weight of the OSDs, run the following command:
+
+.. code-block:: bash
+
+ #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
+ -1 3.00000 pool default
+ -3 3.00000 rack mainrack
+ -2 3.00000 host osd-host
+ 0 ssd 1.00000 osd.0 up 1.00000 1.00000
+ 1 ssd 1.00000 osd.1 up 1.00000 1.00000
+ 2 ssd 1.00000 osd.2 up 1.00000 1.00000
+
+See `Monitoring OSDs and Placement Groups`_.
+
+Checking Monitor Status
+=======================
+
+If your cluster has multiple monitors, then you need to perform certain
+"monitor status" checks. After starting the cluster and before reading or
+writing data, you should check quorum status. A quorum must be present when
+multiple monitors are running to ensure proper functioning of your Ceph
+cluster. Check monitor status regularly in order to ensure that all of the
+monitors are running.
+
+To display the monitor map, run the following command:
+
+.. prompt:: bash $
+
+ ceph mon stat
+
+Alternatively, you can run the following command:
+
+.. prompt:: bash $
+
+ ceph mon dump
+
+To check the quorum status for the monitor cluster, run the following command:
+
+.. prompt:: bash $
+
+ ceph quorum_status
+
+Ceph returns the quorum status. For example, a Ceph cluster that consists of
+three monitors might return the following:
+
+.. code-block:: javascript
+
+ { "election_epoch": 10,
+ "quorum": [
+ 0,
+ 1,
+ 2],
+ "quorum_names": [
+ "a",
+ "b",
+ "c"],
+ "quorum_leader_name": "a",
+ "monmap": { "epoch": 1,
+ "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
+ "modified": "2011-12-12 13:28:27.505520",
+ "created": "2011-12-12 13:28:27.505520",
+ "features": {"persistent": [
+ "kraken",
+ "luminous",
+ "mimic"],
+ "optional": []
+ },
+ "mons": [
+ { "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:6789/0",
+ "public_addr": "127.0.0.1:6789/0"},
+ { "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:6790/0",
+ "public_addr": "127.0.0.1:6790/0"},
+ { "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:6791/0",
+ "public_addr": "127.0.0.1:6791/0"}
+ ]
+ }
+ }
+
+Checking MDS Status
+===================
+
+Metadata servers provide metadata services for CephFS. Metadata servers have
+two sets of states: ``up | down`` and ``active | inactive``. To check if your
+metadata servers are ``up`` and ``active``, run the following command:
+
+.. prompt:: bash $
+
+ ceph mds stat
+
+To display details of the metadata servers, run the following command:
+
+.. prompt:: bash $
+
+ ceph fs dump
+
+
+Checking Placement Group States
+===============================
+
+Placement groups (PGs) map objects to OSDs. PGs are monitored in order to
+ensure that they are ``active`` and ``clean``. See `Monitoring OSDs and
+Placement Groups`_.
+
+.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg
+
+.. _rados-monitoring-using-admin-socket:
+
+Using the Admin Socket
+======================
+
+The Ceph admin socket allows you to query a daemon via a socket interface. By
+default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon via
+the admin socket, log in to the host that is running the daemon and run one of
+the two following commands:
+
+.. prompt:: bash $
+
+ ceph daemon {daemon-name}
+ ceph daemon {path-to-socket-file}
+
+For example, the following commands are equivalent to each other:
+
+.. prompt:: bash $
+
+ ceph daemon osd.0 foo
+ ceph daemon /var/run/ceph/ceph-osd.0.asok foo
+
+To view the available admin-socket commands, run the following command:
+
+.. prompt:: bash $
+
+ ceph daemon {daemon-name} help
+
+Admin-socket commands enable you to view and set your configuration at runtime.
+For more on viewing your configuration, see `Viewing a Configuration at
+Runtime`_. There are two methods of setting configuration value at runtime: (1)
+using the admin socket, which bypasses the monitor and requires a direct login
+to the host in question, and (2) using the ``ceph tell {daemon-type}.{id}
+config set`` command, which relies on the monitor and does not require a direct
+login.
+
+.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime
+.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
diff --git a/doc/rados/operations/operating.rst b/doc/rados/operations/operating.rst
new file mode 100644
index 000000000..f4a2fd988
--- /dev/null
+++ b/doc/rados/operations/operating.rst
@@ -0,0 +1,174 @@
+=====================
+ Operating a Cluster
+=====================
+
+.. index:: systemd; operating a cluster
+
+
+Running Ceph with systemd
+=========================
+
+In all distributions that support systemd (CentOS 7, Fedora, Debian
+Jessie 8 and later, and SUSE), systemd files (and NOT legacy SysVinit scripts)
+are used to manage Ceph daemons. Ceph daemons therefore behave like any other daemons
+that can be controlled by the ``systemctl`` command, as in the following examples:
+
+.. prompt:: bash $
+
+ sudo systemctl start ceph.target # start all daemons
+ sudo systemctl status ceph-osd@12 # check status of osd.12
+
+To list all of the Ceph systemd units on a node, run the following command:
+
+.. prompt:: bash $
+
+ sudo systemctl status ceph\*.service ceph\*.target
+
+
+Starting all daemons
+--------------------
+
+To start all of the daemons on a Ceph node (regardless of their type), run the
+following command:
+
+.. prompt:: bash $
+
+ sudo systemctl start ceph.target
+
+
+Stopping all daemons
+--------------------
+
+To stop all of the daemons on a Ceph node (regardless of their type), run the
+following command:
+
+.. prompt:: bash $
+
+ sudo systemctl stop ceph\*.service ceph\*.target
+
+
+Starting all daemons by type
+----------------------------
+
+To start all of the daemons of a particular type on a Ceph node, run one of the
+following commands:
+
+.. prompt:: bash $
+
+ sudo systemctl start ceph-osd.target
+ sudo systemctl start ceph-mon.target
+ sudo systemctl start ceph-mds.target
+
+
+Stopping all daemons by type
+----------------------------
+
+To stop all of the daemons of a particular type on a Ceph node, run one of the
+following commands:
+
+.. prompt:: bash $
+
+ sudo systemctl stop ceph-osd\*.service ceph-osd.target
+ sudo systemctl stop ceph-mon\*.service ceph-mon.target
+ sudo systemctl stop ceph-mds\*.service ceph-mds.target
+
+
+Starting a daemon
+-----------------
+
+To start a specific daemon instance on a Ceph node, run one of the
+following commands:
+
+.. prompt:: bash $
+
+ sudo systemctl start ceph-osd@{id}
+ sudo systemctl start ceph-mon@{hostname}
+ sudo systemctl start ceph-mds@{hostname}
+
+For example:
+
+.. prompt:: bash $
+
+ sudo systemctl start ceph-osd@1
+ sudo systemctl start ceph-mon@ceph-server
+ sudo systemctl start ceph-mds@ceph-server
+
+
+Stopping a daemon
+-----------------
+
+To stop a specific daemon instance on a Ceph node, run one of the
+following commands:
+
+.. prompt:: bash $
+
+ sudo systemctl stop ceph-osd@{id}
+ sudo systemctl stop ceph-mon@{hostname}
+ sudo systemctl stop ceph-mds@{hostname}
+
+For example:
+
+.. prompt:: bash $
+
+ sudo systemctl stop ceph-osd@1
+ sudo systemctl stop ceph-mon@ceph-server
+ sudo systemctl stop ceph-mds@ceph-server
+
+
+.. index:: sysvinit; operating a cluster
+
+Running Ceph with SysVinit
+==========================
+
+Each time you start, restart, or stop Ceph daemons, you must specify at least one option and one command.
+Likewise, each time you start, restart, or stop your entire cluster, you must specify at least one option and one command.
+In both cases, you can also specify a daemon type or a daemon instance. ::
+
+ {commandline} [options] [commands] [daemons]
+
+The ``ceph`` options include:
+
++-----------------+----------+-------------------------------------------------+
+| Option | Shortcut | Description |
++=================+==========+=================================================+
+| ``--verbose`` | ``-v`` | Use verbose logging. |
++-----------------+----------+-------------------------------------------------+
+| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. |
++-----------------+----------+-------------------------------------------------+
+| ``--allhosts`` | ``-a`` | Execute on all nodes listed in ``ceph.conf``. |
+| | | Otherwise, it only executes on ``localhost``. |
++-----------------+----------+-------------------------------------------------+
+| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. |
++-----------------+----------+-------------------------------------------------+
+| ``--norestart`` | ``N/A`` | Do not restart a daemon if it core dumps. |
++-----------------+----------+-------------------------------------------------+
+| ``--conf`` | ``-c`` | Use an alternate configuration file. |
++-----------------+----------+-------------------------------------------------+
+
+The ``ceph`` commands include:
+
++------------------+------------------------------------------------------------+
+| Command | Description |
++==================+============================================================+
+| ``start`` | Start the daemon(s). |
++------------------+------------------------------------------------------------+
+| ``stop`` | Stop the daemon(s). |
++------------------+------------------------------------------------------------+
+| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9``. |
++------------------+------------------------------------------------------------+
+| ``killall`` | Kill all daemons of a particular type. |
++------------------+------------------------------------------------------------+
+| ``cleanlogs`` | Cleans out the log directory. |
++------------------+------------------------------------------------------------+
+| ``cleanalllogs`` | Cleans out **everything** in the log directory. |
++------------------+------------------------------------------------------------+
+
+The ``[daemons]`` option allows the ``ceph`` service to target specific daemon types
+in order to perform subsystem operations. Daemon types include:
+
+- ``mon``
+- ``osd``
+- ``mds``
+
+.. _Valgrind: http://www.valgrind.org/
+.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html
diff --git a/doc/rados/operations/pg-concepts.rst b/doc/rados/operations/pg-concepts.rst
new file mode 100644
index 000000000..83062b53a
--- /dev/null
+++ b/doc/rados/operations/pg-concepts.rst
@@ -0,0 +1,104 @@
+.. _rados_operations_pg_concepts:
+
+==========================
+ Placement Group Concepts
+==========================
+
+When you execute commands like ``ceph -w``, ``ceph osd dump``, and other
+commands related to placement groups, Ceph may return values using some
+of the following terms:
+
+*Peering*
+ The process of bringing all of the OSDs that store
+ a Placement Group (PG) into agreement about the state
+ of all of the objects (and their metadata) in that PG.
+ Note that agreeing on the state does not mean that
+ they all have the latest contents.
+
+*Acting Set*
+ The ordered list of OSDs who are (or were as of some epoch)
+ responsible for a particular placement group.
+
+*Up Set*
+ The ordered list of OSDs responsible for a particular placement
+ group for a particular epoch according to CRUSH. Normally this
+ is the same as the *Acting Set*, except when the *Acting Set* has
+ been explicitly overridden via ``pg_temp`` in the OSD Map.
+
+*Current Interval* or *Past Interval*
+ A sequence of OSD map epochs during which the *Acting Set* and *Up
+ Set* for particular placement group do not change.
+
+*Primary*
+ The member (and by convention first) of the *Acting Set*,
+ that is responsible for coordination peering, and is
+ the only OSD that will accept client-initiated
+ writes to objects in a placement group.
+
+*Replica*
+ A non-primary OSD in the *Acting Set* for a placement group
+ (and who has been recognized as such and *activated* by the primary).
+
+*Stray*
+ An OSD that is not a member of the current *Acting Set*, but
+ has not yet been told that it can delete its copies of a
+ particular placement group.
+
+*Recovery*
+ Ensuring that copies of all of the objects in a placement group
+ are on all of the OSDs in the *Acting Set*. Once *Peering* has
+ been performed, the *Primary* can start accepting write operations,
+ and *Recovery* can proceed in the background.
+
+*PG Info*
+ Basic metadata about the placement group's creation epoch, the version
+ for the most recent write to the placement group, *last epoch started*,
+ *last epoch clean*, and the beginning of the *current interval*. Any
+ inter-OSD communication about placement groups includes the *PG Info*,
+ such that any OSD that knows a placement group exists (or once existed)
+ also has a lower bound on *last epoch clean* or *last epoch started*.
+
+*PG Log*
+ A list of recent updates made to objects in a placement group.
+ Note that these logs can be truncated after all OSDs
+ in the *Acting Set* have acknowledged up to a certain
+ point.
+
+*Missing Set*
+ Each OSD notes update log entries and if they imply updates to
+ the contents of an object, adds that object to a list of needed
+ updates. This list is called the *Missing Set* for that ``<OSD,PG>``.
+
+*Authoritative History*
+ A complete, and fully ordered set of operations that, if
+ performed, would bring an OSD's copy of a placement group
+ up to date.
+
+*Epoch*
+ A (monotonically increasing) OSD map version number
+
+*Last Epoch Start*
+ The last epoch at which all nodes in the *Acting Set*
+ for a particular placement group agreed on an
+ *Authoritative History*. At this point, *Peering* is
+ deemed to have been successful.
+
+*up_thru*
+ Before a *Primary* can successfully complete the *Peering* process,
+ it must inform a monitor that is alive through the current
+ OSD map *Epoch* by having the monitor set its *up_thru* in the osd
+ map. This helps *Peering* ignore previous *Acting Sets* for which
+ *Peering* never completed after certain sequences of failures, such as
+ the second interval below:
+
+ - *acting set* = [A,B]
+ - *acting set* = [A]
+ - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
+ - *acting set* = [B] (B restarts, A does not)
+
+*Last Epoch Clean*
+ The last *Epoch* at which all nodes in the *Acting set*
+ for a particular placement group were completely
+ up to date (both placement group logs and object contents).
+ At this point, *recovery* is deemed to have been
+ completed.
diff --git a/doc/rados/operations/pg-repair.rst b/doc/rados/operations/pg-repair.rst
new file mode 100644
index 000000000..609318fca
--- /dev/null
+++ b/doc/rados/operations/pg-repair.rst
@@ -0,0 +1,118 @@
+============================
+Repairing PG Inconsistencies
+============================
+Sometimes a Placement Group (PG) might become ``inconsistent``. To return the PG
+to an ``active+clean`` state, you must first determine which of the PGs has become
+inconsistent and then run the ``pg repair`` command on it. This page contains
+commands for diagnosing PGs and the command for repairing PGs that have become
+inconsistent.
+
+.. highlight:: console
+
+Commands for Diagnosing PG Problems
+===================================
+The commands in this section provide various ways of diagnosing broken PGs.
+
+To see a high-level (low-detail) overview of Ceph cluster health, run the
+following command:
+
+.. prompt:: bash #
+
+ ceph health detail
+
+To see more detail on the status of the PGs, run the following command:
+
+.. prompt:: bash #
+
+ ceph pg dump --format=json-pretty
+
+To see a list of inconsistent PGs, run the following command:
+
+.. prompt:: bash #
+
+ rados list-inconsistent-pg {pool}
+
+To see a list of inconsistent RADOS objects, run the following command:
+
+.. prompt:: bash #
+
+ rados list-inconsistent-obj {pgid}
+
+To see a list of inconsistent snapsets in a specific PG, run the following
+command:
+
+.. prompt:: bash #
+
+ rados list-inconsistent-snapset {pgid}
+
+
+Commands for Repairing PGs
+==========================
+The form of the command to repair a broken PG is as follows:
+
+.. prompt:: bash #
+
+ ceph pg repair {pgid}
+
+Here ``{pgid}`` represents the id of the affected PG.
+
+For example:
+
+.. prompt:: bash #
+
+ ceph pg repair 1.4
+
+.. note:: PG IDs have the form ``N.xxxxx``, where ``N`` is the number of the
+ pool that contains the PG. The command ``ceph osd listpools`` and the
+ command ``ceph osd dump | grep pool`` return a list of pool numbers.
+
+More Information on PG Repair
+=============================
+Ceph stores and updates the checksums of objects stored in the cluster. When a
+scrub is performed on a PG, the OSD attempts to choose an authoritative copy
+from among its replicas. Only one of the possible cases is consistent. After
+performing a deep scrub, Ceph calculates the checksum of an object that is read
+from disk and compares it to the checksum that was previously recorded. If the
+current checksum and the previously recorded checksum do not match, that
+mismatch is considered to be an inconsistency. In the case of replicated pools,
+any mismatch between the checksum of any replica of an object and the checksum
+of the authoritative copy means that there is an inconsistency. The discovery
+of these inconsistencies cause a PG's state to be set to ``inconsistent``.
+
+The ``pg repair`` command attempts to fix inconsistencies of various kinds. If
+``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of
+the inconsistent copy with the digest of the authoritative copy. If ``pg
+repair`` finds an inconsistent replicated pool, it marks the inconsistent copy
+as missing. In the case of replicated pools, recovery is beyond the scope of
+``pg repair``.
+
+In the case of erasure-coded and BlueStore pools, Ceph will automatically
+perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to
+``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default
+``5``) errors are found.
+
+The ``pg repair`` command will not solve every problem. Ceph does not
+automatically repair PGs when they are found to contain inconsistencies.
+
+The checksum of a RADOS object or an omap is not always available. Checksums
+are calculated incrementally. If a replicated object is updated
+non-sequentially, the write operation involved in the update changes the object
+and invalidates its checksum. The whole object is not read while the checksum
+is recalculated. The ``pg repair`` command is able to make repairs even when
+checksums are not available to it, as in the case of Filestore. Users working
+with replicated Filestore pools might prefer manual repair to ``ceph pg
+repair``.
+
+This material is relevant for Filestore, but not for BlueStore, which has its
+own internal checksums. The matched-record checksum and the calculated checksum
+cannot prove that any specific copy is in fact authoritative. If there is no
+checksum available, ``pg repair`` favors the data on the primary, but this
+might not be the uncorrupted replica. Because of this uncertainty, human
+intervention is necessary when an inconsistency is discovered. This
+intervention sometimes involves use of ``ceph-objectstore-tool``.
+
+External Links
+==============
+https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page
+contains a walkthrough of the repair of a PG. It is recommended reading if you
+want to repair a PG but have never done so.
diff --git a/doc/rados/operations/pg-states.rst b/doc/rados/operations/pg-states.rst
new file mode 100644
index 000000000..495229d92
--- /dev/null
+++ b/doc/rados/operations/pg-states.rst
@@ -0,0 +1,118 @@
+========================
+ Placement Group States
+========================
+
+When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``),
+Ceph will report on the status of the placement groups. A placement group has
+one or more states. The optimum state for placement groups in the placement group
+map is ``active + clean``.
+
+*creating*
+ Ceph is still creating the placement group.
+
+*activating*
+ The placement group is peered but not yet active.
+
+*active*
+ Ceph will process requests to the placement group.
+
+*clean*
+ Ceph replicated all objects in the placement group the correct number of times.
+
+*down*
+ A replica with necessary data is down, so the placement group is offline.
+
+*laggy*
+ A replica is not acknowledging new leases from the primary in a timely fashion; IO is temporarily paused.
+
+*wait*
+ The set of OSDs for this PG has just changed and IO is temporarily paused until the previous interval's leases expire.
+
+*scrubbing*
+ Ceph is checking the placement group metadata for inconsistencies.
+
+*deep*
+ Ceph is checking the placement group data against stored checksums.
+
+*degraded*
+ Ceph has not replicated some objects in the placement group the correct number of times yet.
+
+*inconsistent*
+ Ceph detects inconsistencies in the one or more replicas of an object in the placement group
+ (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.).
+
+*peering*
+ The placement group is undergoing the peering process
+
+*repair*
+ Ceph is checking the placement group and repairing any inconsistencies it finds (if possible).
+
+*recovering*
+ Ceph is migrating/synchronizing objects and their replicas.
+
+*forced_recovery*
+ High recovery priority of that PG is enforced by user.
+
+*recovery_wait*
+ The placement group is waiting in line to start recover.
+
+*recovery_toofull*
+ A recovery operation is waiting because the destination OSD is over its
+ full ratio.
+
+*recovery_unfound*
+ Recovery stopped due to unfound objects.
+
+*backfilling*
+ Ceph is scanning and synchronizing the entire contents of a placement group
+ instead of inferring what contents need to be synchronized from the logs of
+ recent operations. Backfill is a special case of recovery.
+
+*forced_backfill*
+ High backfill priority of that PG is enforced by user.
+
+*backfill_wait*
+ The placement group is waiting in line to start backfill.
+
+*backfill_toofull*
+ A backfill operation is waiting because the destination OSD is over
+ the backfillfull ratio.
+
+*backfill_unfound*
+ Backfill stopped due to unfound objects.
+
+*incomplete*
+ Ceph detects that a placement group is missing information about
+ writes that may have occurred, or does not have any healthy
+ copies. If you see this state, try to start any failed OSDs that may
+ contain the needed information. In the case of an erasure coded pool
+ temporarily reducing min_size may allow recovery.
+
+*stale*
+ The placement group is in an unknown state - the monitors have not received
+ an update for it since the placement group mapping changed.
+
+*remapped*
+ The placement group is temporarily mapped to a different set of OSDs from what
+ CRUSH specified.
+
+*undersized*
+ The placement group has fewer copies than the configured pool replication level.
+
+*peered*
+ The placement group has peered, but cannot serve client IO due to not having
+ enough copies to reach the pool's configured min_size parameter. Recovery
+ may occur in this state, so the pg may heal up to min_size eventually.
+
+*snaptrim*
+ Trimming snaps.
+
+*snaptrim_wait*
+ Queued to trim snaps.
+
+*snaptrim_error*
+ Error stopped trimming snaps.
+
+*unknown*
+ The ceph-mgr hasn't yet received any information about the PG's state from an
+ OSD since mgr started up.
diff --git a/doc/rados/operations/placement-groups.rst b/doc/rados/operations/placement-groups.rst
new file mode 100644
index 000000000..dda4a0177
--- /dev/null
+++ b/doc/rados/operations/placement-groups.rst
@@ -0,0 +1,897 @@
+.. _placement groups:
+
+==================
+ Placement Groups
+==================
+
+.. _pg-autoscaler:
+
+Autoscaling placement groups
+============================
+
+Placement groups (PGs) are an internal implementation detail of how Ceph
+distributes data. Autoscaling provides a way to manage PGs, and especially to
+manage the number of PGs present in different pools. When *pg-autoscaling* is
+enabled, the cluster is allowed to make recommendations or automatic
+adjustments with respect to the number of PGs for each pool (``pgp_num``) in
+accordance with expected cluster utilization and expected pool utilization.
+
+Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``,
+``on``, or ``warn``:
+
+* ``off``: Disable autoscaling for this pool. It is up to the administrator to
+ choose an appropriate ``pgp_num`` for each pool. For more information, see
+ :ref:`choosing-number-of-placement-groups`.
+* ``on``: Enable automated adjustments of the PG count for the given pool.
+* ``warn``: Raise health checks when the PG count is in need of adjustment.
+
+To set the autoscaling mode for an existing pool, run a command of the
+following form:
+
+.. prompt:: bash #
+
+ ceph osd pool set <pool-name> pg_autoscale_mode <mode>
+
+For example, to enable autoscaling on pool ``foo``, run the following command:
+
+.. prompt:: bash #
+
+ ceph osd pool set foo pg_autoscale_mode on
+
+There is also a ``pg_autoscale_mode`` setting for any pools that are created
+after the initial setup of the cluster. To change this setting, run a command
+of the following form:
+
+.. prompt:: bash #
+
+ ceph config set global osd_pool_default_pg_autoscale_mode <mode>
+
+You can disable or enable the autoscaler for all pools with the ``noautoscale``
+flag. By default, this flag is set to ``off``, but you can set it to ``on`` by
+running the following command:
+
+.. prompt:: bash #
+
+ ceph osd pool set noautoscale
+
+To set the ``noautoscale`` flag to ``off``, run the following command:
+
+.. prompt:: bash #
+
+ ceph osd pool unset noautoscale
+
+To get the value of the flag, run the following command:
+
+.. prompt:: bash #
+
+ ceph osd pool get noautoscale
+
+Viewing PG scaling recommendations
+----------------------------------
+
+To view each pool, its relative utilization, and any recommended changes to the
+PG count, run the following command:
+
+.. prompt:: bash #
+
+ ceph osd pool autoscale-status
+
+The output will resemble the following::
+
+ POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
+ a 12900M 3.0 82431M 0.4695 8 128 warn True
+ c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True
+ b 0 953.6M 3.0 82431M 0.0347 8 warn False
+
+- **POOL** is the name of the pool.
+
+- **SIZE** is the amount of data stored in the pool.
+
+- **TARGET SIZE** (if present) is the amount of data that is expected to be
+ stored in the pool, as specified by the administrator. The system uses the
+ greater of the two values for its calculation.
+
+- **RATE** is the multiplier for the pool that determines how much raw storage
+ capacity is consumed. For example, a three-replica pool will have a ratio of
+ 3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5.
+
+- **RAW CAPACITY** is the total amount of raw storage capacity on the specific
+ OSDs that are responsible for storing the data of the pool (and perhaps the
+ data of other pools).
+
+- **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the
+ total raw storage capacity. In order words, RATIO is defined as
+ (SIZE * RATE) / RAW CAPACITY.
+
+- **TARGET RATIO** (if present) is the ratio of the expected storage of this
+ pool (that is, the amount of storage that this pool is expected to consume,
+ as specified by the administrator) to the expected storage of all other pools
+ that have target ratios set. If both ``target_size_bytes`` and
+ ``target_size_ratio`` are specified, then ``target_size_ratio`` takes
+ precedence.
+
+- **EFFECTIVE RATIO** is the result of making two adjustments to the target
+ ratio:
+
+ #. Subtracting any capacity expected to be used by pools that have target
+ size set.
+
+ #. Normalizing the target ratios among pools that have target ratio set so
+ that collectively they target cluster capacity. For example, four pools
+ with target_ratio 1.0 would have an effective ratio of 0.25.
+
+ The system's calculations use whichever of these two ratios (that is, the
+ target ratio and the effective ratio) is greater.
+
+- **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance
+ with prior information about how many PGs a specific pool is expected to
+ have.
+
+- **PG_NUM** is either the current number of PGs associated with the pool or,
+ if a ``pg_num`` change is in progress, the current number of PGs that the
+ pool is working towards.
+
+- **NEW PG_NUM** (if present) is the value that the system is recommending the
+ ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is
+ present only if the recommended value varies from the current value by more
+ than the default factor of ``3``. To adjust this factor (in the following
+ example, it is changed to ``2``), run the following command:
+
+ .. prompt:: bash #
+
+ ceph osd pool set threshold 2.0
+
+- **AUTOSCALE** is the pool's ``pg_autoscale_mode`` and is set to ``on``,
+ ``off``, or ``warn``.
+
+- **BULK** determines whether the pool is ``bulk``. It has a value of ``True``
+ or ``False``. A ``bulk`` pool is expected to be large and should initially
+ have a large number of PGs so that performance does not suffer]. On the other
+ hand, a pool that is not ``bulk`` is expected to be small (for example, a
+ ``.mgr`` pool or a meta pool).
+
+.. note::
+
+ If the ``ceph osd pool autoscale-status`` command returns no output at all,
+ there is probably at least one pool that spans multiple CRUSH roots. This
+ 'spanning pool' issue can happen in scenarios like the following:
+ when a new deployment auto-creates the ``.mgr`` pool on the ``default``
+ CRUSH root, subsequent pools are created with rules that constrain them to a
+ specific shadow CRUSH tree. For example, if you create an RBD metadata pool
+ that is constrained to ``deviceclass = ssd`` and an RBD data pool that is
+ constrained to ``deviceclass = hdd``, you will encounter this issue. To
+ remedy this issue, constrain the spanning pool to only one device class. In
+ the above scenario, there is likely to be a ``replicated-ssd`` CRUSH rule in
+ effect, and the ``.mgr`` pool can be constrained to ``ssd`` devices by
+ running the following commands:
+
+ .. prompt:: bash #
+
+ ceph osd pool set .mgr crush_rule replicated-ssd
+ ceph osd pool set pool 1 crush_rule to replicated-ssd
+
+ This intervention will result in a small amount of backfill, but
+ typically this traffic completes quickly.
+
+
+Automated scaling
+-----------------
+
+In the simplest approach to automated scaling, the cluster is allowed to
+automatically scale ``pgp_num`` in accordance with usage. Ceph considers the
+total available storage and the target number of PGs for the whole system,
+considers how much data is stored in each pool, and apportions PGs accordingly.
+The system is conservative with its approach, making changes to a pool only
+when the current number of PGs (``pg_num``) varies by more than a factor of 3
+from the recommended number.
+
+The target number of PGs per OSD is determined by the ``mon_target_pg_per_osd``
+parameter (default: 100), which can be adjusted by running the following
+command:
+
+.. prompt:: bash #
+
+ ceph config set global mon_target_pg_per_osd 100
+
+The autoscaler analyzes pools and adjusts on a per-subtree basis. Because each
+pool might map to a different CRUSH rule, and each rule might distribute data
+across different devices, Ceph will consider the utilization of each subtree of
+the hierarchy independently. For example, a pool that maps to OSDs of class
+``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG
+counts that are determined by how many of these two different device types
+there are.
+
+If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees
+with both ``ssd`` and ``hdd`` devices), the autoscaler issues a warning to the
+user in the manager log. The warning states the name of the pool and the set of
+roots that overlap each other. The autoscaler does not scale any pools with
+overlapping roots because this condition can cause problems with the scaling
+process. We recommend constraining each pool so that it belongs to only one
+root (that is, one OSD class) to silence the warning and ensure a successful
+scaling process.
+
+.. _managing_bulk_flagged_pools:
+
+Managing pools that are flagged with ``bulk``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
+complement of PGs and then scales down the number of PGs only if the usage
+ratio across the pool is uneven. However, if a pool is not flagged ``bulk``,
+then the autoscaler starts the pool with minimal PGs and creates additional PGs
+only if there is more usage in the pool.
+
+To create a pool that will be flagged ``bulk``, run the following command:
+
+.. prompt:: bash #
+
+ ceph osd pool create <pool-name> --bulk
+
+To set or unset the ``bulk`` flag of an existing pool, run the following
+command:
+
+.. prompt:: bash #
+
+ ceph osd pool set <pool-name> bulk <true/false/1/0>
+
+To get the ``bulk`` flag of an existing pool, run the following command:
+
+.. prompt:: bash #
+
+ ceph osd pool get <pool-name> bulk
+
+.. _specifying_pool_target_size:
+
+Specifying expected pool size
+-----------------------------
+
+When a cluster or pool is first created, it consumes only a small fraction of
+the total cluster capacity and appears to the system as if it should need only
+a small number of PGs. However, in some cases, cluster administrators know
+which pools are likely to consume most of the system capacity in the long run.
+When Ceph is provided with this information, a more appropriate number of PGs
+can be used from the beginning, obviating subsequent changes in ``pg_num`` and
+the associated overhead cost of relocating data.
+
+The *target size* of a pool can be specified in two ways: either in relation to
+the absolute size (in bytes) of the pool, or as a weight relative to all other
+pools that have ``target_size_ratio`` set.
+
+For example, to tell the system that ``mypool`` is expected to consume 100 TB,
+run the following command:
+
+.. prompt:: bash #
+
+ ceph osd pool set mypool target_size_bytes 100T
+
+Alternatively, to tell the system that ``mypool`` is expected to consume a
+ratio of 1.0 relative to other pools that have ``target_size_ratio`` set,
+adjust the ``target_size_ratio`` setting of ``my pool`` by running the
+following command:
+
+.. prompt:: bash #
+
+ ceph osd pool set mypool target_size_ratio 1.0
+
+If `mypool` is the only pool in the cluster, then it is expected to use 100% of
+the total cluster capacity. However, if the cluster contains a second pool that
+has ``target_size_ratio`` set to 1.0, then both pools are expected to use 50%
+of the total cluster capacity.
+
+The ``ceph osd pool create`` command has two command-line options that can be
+used to set the target size of a pool at creation time: ``--target-size-bytes
+<bytes>`` and ``--target-size-ratio <ratio>``.
+
+Note that if the target-size values that have been specified are impossible
+(for example, a capacity larger than the total cluster), then a health check
+(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
+
+If both ``target_size_ratio`` and ``target_size_bytes`` are specified for a
+pool, then the latter will be ignored, the former will be used in system
+calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``)
+will be raised.
+
+Specifying bounds on a pool's PGs
+---------------------------------
+
+It is possible to specify both the minimum number and the maximum number of PGs
+for a pool.
+
+Setting a Minimum Number of PGs and a Maximum Number of PGs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a minimum is set, then Ceph will not itself reduce (nor recommend that you
+reduce) the number of PGs to a value below the configured value. Setting a
+minimum serves to establish a lower bound on the amount of parallelism enjoyed
+by a client during I/O, even if a pool is mostly empty.
+
+If a maximum is set, then Ceph will not itself increase (or recommend that you
+increase) the number of PGs to a value above the configured value.
+
+To set the minimum number of PGs for a pool, run a command of the following
+form:
+
+.. prompt:: bash #
+
+ ceph osd pool set <pool-name> pg_num_min <num>
+
+To set the maximum number of PGs for a pool, run a command of the following
+form:
+
+.. prompt:: bash #
+
+ ceph osd pool set <pool-name> pg_num_max <num>
+
+In addition, the ``ceph osd pool create`` command has two command-line options
+that can be used to specify the minimum or maximum PG count of a pool at
+creation time: ``--pg-num-min <num>`` and ``--pg-num-max <num>``.
+
+.. _preselection:
+
+Preselecting pg_num
+===================
+
+When creating a pool with the following command, you have the option to
+preselect the value of the ``pg_num`` parameter:
+
+.. prompt:: bash #
+
+ ceph osd pool create {pool-name} [pg_num]
+
+If you opt not to specify ``pg_num`` in this command, the cluster uses the PG
+autoscaler to automatically configure the parameter in accordance with the
+amount of data that is stored in the pool (see :ref:`pg-autoscaler` above).
+
+However, your decision of whether or not to specify ``pg_num`` at creation time
+has no effect on whether the parameter will be automatically tuned by the
+cluster afterwards. As seen above, autoscaling of PGs is enabled or disabled by
+running a command of the following form:
+
+.. prompt:: bash #
+
+ ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
+
+Without the balancer, the suggested target is approximately 100 PG replicas on
+each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is
+reasonable.
+
+The autoscaler attempts to satisfy the following conditions:
+
+- the number of PGs per OSD should be proportional to the amount of data in the
+ pool
+- there should be 50-100 PGs per pool, taking into account the replication
+ overhead or erasure-coding fan-out of each PG's replicas across OSDs
+
+Use of Placement Groups
+=======================
+
+A placement group aggregates objects within a pool. The tracking of RADOS
+object placement and object metadata on a per-object basis is computationally
+expensive. It would be infeasible for a system with millions of RADOS
+objects to efficiently track placement on a per-object basis.
+
+.. ditaa::
+ /-----\ /-----\ /-----\ /-----\ /-----\
+ | obj | | obj | | obj | | obj | | obj |
+ \-----/ \-----/ \-----/ \-----/ \-----/
+ | | | | |
+ +--------+--------+ +---+----+
+ | |
+ v v
+ +-----------------------+ +-----------------------+
+ | Placement Group #1 | | Placement Group #2 |
+ | | | |
+ +-----------------------+ +-----------------------+
+ | |
+ +------------------------------+
+ |
+ v
+ +-----------------------+
+ | Pool |
+ | |
+ +-----------------------+
+
+The Ceph client calculates which PG a RADOS object should be in. As part of
+this calculation, the client hashes the object ID and performs an operation
+involving both the number of PGs in the specified pool and the pool ID. For
+details, see `Mapping PGs to OSDs`_.
+
+The contents of a RADOS object belonging to a PG are stored in a set of OSDs.
+For example, in a replicated pool of size two, each PG will store objects on
+two OSDs, as shown below:
+
+.. ditaa::
+ +-----------------------+ +-----------------------+
+ | Placement Group #1 | | Placement Group #2 |
+ | | | |
+ +-----------------------+ +-----------------------+
+ | | | |
+ v v v v
+ /----------\ /----------\ /----------\ /----------\
+ | | | | | | | |
+ | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
+ | | | | | | | |
+ \----------/ \----------/ \----------/ \----------/
+
+
+If OSD #2 fails, another OSD will be assigned to Placement Group #1 and then
+filled with copies of all objects in OSD #1. If the pool size is changed from
+two to three, an additional OSD will be assigned to the PG and will receive
+copies of all objects in the PG.
+
+An OSD assigned to a PG is not owned exclusively by that PG; rather, the OSD is
+shared with other PGs either from the same pool or from other pools. In our
+example, OSD #2 is shared by Placement Group #1 and Placement Group #2. If OSD
+#2 fails, then Placement Group #2 must restore copies of objects (by making use
+of OSD #3).
+
+When the number of PGs increases, several consequences ensue. The new PGs are
+assigned OSDs. The result of the CRUSH function changes, which means that some
+objects from the already-existing PGs are copied to the new PGs and removed
+from the old ones.
+
+Factors Relevant To Specifying pg_num
+=====================================
+
+On the one hand, the criteria of data durability and even distribution across
+OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of
+saving CPU resources and minimizing memory usage weigh in favor of a low number
+of PGs.
+
+.. _data durability:
+
+Data durability
+---------------
+
+When an OSD fails, the risk of data loss is increased until replication of the
+data it hosted is restored to the configured level. To illustrate this point,
+let's imagine a scenario that results in permanent data loss in a single PG:
+
+#. The OSD fails and all copies of the object that it contains are lost. For
+ each object within the PG, the number of its replicas suddenly drops from
+ three to two.
+
+#. Ceph starts recovery for this PG by choosing a new OSD on which to re-create
+ the third copy of each object.
+
+#. Another OSD within the same PG fails before the new OSD is fully populated
+ with the third copy. Some objects will then only have one surviving copy.
+
+#. Ceph selects yet another OSD and continues copying objects in order to
+ restore the desired number of copies.
+
+#. A third OSD within the same PG fails before recovery is complete. If this
+ OSD happened to contain the only remaining copy of an object, the object is
+ permanently lost.
+
+In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH
+will give each PG three OSDs. Ultimately, each OSD hosts :math:`\frac{(512 *
+3)}{10} = ~150` PGs. So when the first OSD fails in the above scenario,
+recovery will begin for all 150 PGs at the same time.
+
+The 150 PGs that are being recovered are likely to be homogeneously distributed
+across the 9 remaining OSDs. Each remaining OSD is therefore likely to send
+copies of objects to all other OSDs and also likely to receive some new objects
+to be stored because it has become part of a new PG.
+
+The amount of time it takes for this recovery to complete depends on the
+architecture of the Ceph cluster. Compare two setups: (1) Each OSD is hosted by
+a 1 TB SSD on a single machine, all of the OSDs are connected to a 10 Gb/s
+switch, and the recovery of a single OSD completes within a certain number of
+minutes. (2) There are two OSDs per machine using HDDs with no SSD WAL+DB and
+a 1 Gb/s switch. In the second setup, recovery will be at least one order of
+magnitude slower.
+
+In such a cluster, the number of PGs has almost no effect on data durability.
+Whether there are 128 PGs per OSD or 8192 PGs per OSD, the recovery will be no
+slower or faster.
+
+However, an increase in the number of OSDs can increase the speed of recovery.
+Suppose our Ceph cluster is expanded from 10 OSDs to 20 OSDs. Each OSD now
+participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will
+still be required to replicate the same number of objects in order to recover.
+But instead of there being only 10 OSDs that have to copy ~100 GB each, there
+are now 20 OSDs that have to copy only 50 GB each. If the network had
+previously been a bottleneck, recovery now happens twice as fast.
+
+Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only
+~38 PGs. And if an OSD dies, recovery will take place faster than before unless
+it is blocked by another bottleneck. Now, however, suppose that our cluster
+grows to 200 OSDs. Each OSD will host only ~7 PGs. And if an OSD dies, recovery
+will happen across at most :math:`\approx 21 = (7 \times 3)` OSDs
+associated with these PGs. This means that recovery will take longer than when
+there were only 40 OSDs. For this reason, the number of PGs should be
+increased.
+
+No matter how brief the recovery time is, there is always a chance that an
+additional OSD will fail while recovery is in progress. Consider the cluster
+with 10 OSDs described above: if any of the OSDs fail, then :math:`\approx 17`
+(approximately 150 divided by 9) PGs will have only one remaining copy. And if
+any of the 8 remaining OSDs fail, then 2 (approximately 17 divided by 8) PGs
+are likely to lose their remaining objects. This is one reason why setting
+``size=2`` is risky.
+
+When the number of OSDs in the cluster increases to 20, the number of PGs that
+would be damaged by the loss of three OSDs significantly decreases. The loss of
+a second OSD degrades only approximately :math:`4` or (:math:`\frac{75}{19}`)
+PGs rather than :math:`\approx 17` PGs, and the loss of a third OSD results in
+data loss only if it is one of the 4 OSDs that contains the remaining copy.
+This means -- assuming that the probability of losing one OSD during recovery
+is 0.0001% -- that the probability of data loss when three OSDs are lost is
+:math:`\approx 17 \times 10 \times 0.0001%` in the cluster with 10 OSDs, and
+only :math:`\approx 4 \times 20 \times 0.0001%` in the cluster with 20 OSDs.
+
+In summary, the greater the number of OSDs, the faster the recovery and the
+lower the risk of permanently losing a PG due to cascading failures. As far as
+data durability is concerned, in a cluster with fewer than 50 OSDs, it doesn't
+much matter whether there are 512 or 4096 PGs.
+
+.. note:: It can take a long time for an OSD that has been recently added to
+ the cluster to be populated with the PGs assigned to it. However, no object
+ degradation or impact on data durability will result from the slowness of
+ this process since Ceph populates data into the new PGs before removing it
+ from the old PGs.
+
+.. _object distribution:
+
+Object distribution within a pool
+---------------------------------
+
+Under ideal conditions, objects are evenly distributed across PGs. Because
+CRUSH computes the PG for each object but does not know how much data is stored
+in each OSD associated with the PG, the ratio between the number of PGs and the
+number of OSDs can have a significant influence on data distribution.
+
+For example, suppose that there is only a single PG for ten OSDs in a
+three-replica pool. In that case, only three OSDs would be used because CRUSH
+would have no other option. However, if more PGs are available, RADOS objects are
+more likely to be evenly distributed across OSDs. CRUSH makes every effort to
+distribute OSDs evenly across all existing PGs.
+
+As long as there are one or two orders of magnitude more PGs than OSDs, the
+distribution is likely to be even. For example: 256 PGs for 3 OSDs, 512 PGs for
+10 OSDs, or 1024 PGs for 10 OSDs.
+
+However, uneven data distribution can emerge due to factors other than the
+ratio of PGs to OSDs. For example, since CRUSH does not take into account the
+size of the RADOS objects, the presence of a few very large RADOS objects can
+create an imbalance. Suppose that one million 4 KB RADOS objects totaling 4 GB
+are evenly distributed among 1024 PGs on 10 OSDs. These RADOS objects will
+consume 4 GB / 10 = 400 MB on each OSD. If a single 400 MB RADOS object is then
+added to the pool, the three OSDs supporting the PG in which the RADOS object
+has been placed will each be filled with 400 MB + 400 MB = 800 MB but the seven
+other OSDs will still contain only 400 MB.
+
+.. _resource usage:
+
+Memory, CPU and network usage
+-----------------------------
+
+Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and
+MONs. These needs must be met at all times and are increased during recovery.
+Indeed, one of the main reasons PGs were developed was to share this overhead
+by clustering objects together.
+
+For this reason, minimizing the number of PGs saves significant resources.
+
+.. _choosing-number-of-placement-groups:
+
+Choosing the Number of PGs
+==========================
+
+.. note: It is rarely necessary to do the math in this section by hand.
+ Instead, use the ``ceph osd pool autoscale-status`` command in combination
+ with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For
+ more information, see :ref:`pg-autoscaler`.
+
+If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in
+order to balance resource usage, data durability, and data distribution. If you
+have fewer than 50 OSDs, follow the guidance in the `preselection`_ section.
+For a single pool, use the following formula to get a baseline value:
+
+ Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}`
+
+Here **pool size** is either the number of replicas for replicated pools or the
+K+M sum for erasure-coded pools. To retrieve this sum, run the command ``ceph
+osd erasure-code-profile get``.
+
+Next, check whether the resulting baseline value is consistent with the way you
+designed your Ceph cluster to maximize `data durability`_ and `object
+distribution`_ and to minimize `resource usage`_.
+
+This value should be **rounded up to the nearest power of two**.
+
+Each pool's ``pg_num`` should be a power of two. Other values are likely to
+result in uneven distribution of data across OSDs. It is best to increase
+``pg_num`` for a pool only when it is feasible and desirable to set the next
+highest power of two. Note that this power of two rule is per-pool; it is
+neither necessary nor easy to align the sum of all pools' ``pg_num`` to a power
+of two.
+
+For example, if you have a cluster with 200 OSDs and a single pool with a size
+of 3 replicas, estimate the number of PGs as follows:
+
+ :math:`\frac{200 \times 100}{3} = 6667`. Rounded up to the nearest power of 2: 8192.
+
+When using multiple data pools to store objects, make sure that you balance the
+number of PGs per pool against the number of PGs per OSD so that you arrive at
+a reasonable total number of PGs. It is important to find a number that
+provides reasonably low variance per OSD without taxing system resources or
+making the peering process too slow.
+
+For example, suppose you have a cluster of 10 pools, each with 512 PGs on 10
+OSDs. That amounts to 5,120 PGs distributed across 10 OSDs, or 512 PGs per OSD.
+This cluster will not use too many resources. However, in a cluster of 1,000
+pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs
+each. This cluster will require significantly more resources and significantly
+more time for peering.
+
+For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_
+tool.
+
+
+.. _setting the number of placement groups:
+
+Setting the Number of PGs
+=========================
+
+Setting the initial number of PGs in a pool must be done at the time you create
+the pool. See `Create a Pool`_ for details.
+
+However, even after a pool is created, if the ``pg_autoscaler`` is not being
+used to manage ``pg_num`` values, you can change the number of PGs by running a
+command of the following form:
+
+.. prompt:: bash #
+
+ ceph osd pool set {pool-name} pg_num {pg_num}
+
+If you increase the number of PGs, your cluster will not rebalance until you
+increase the number of PGs for placement (``pgp_num``). The ``pgp_num``
+parameter specifies the number of PGs that are to be considered for placement
+by the CRUSH algorithm. Increasing ``pg_num`` splits the PGs in your cluster,
+but data will not be migrated to the newer PGs until ``pgp_num`` is increased.
+The ``pgp_num`` parameter should be equal to the ``pg_num`` parameter. To
+increase the number of PGs for placement, run a command of the following form:
+
+.. prompt:: bash #
+
+ ceph osd pool set {pool-name} pgp_num {pgp_num}
+
+If you decrease the number of PGs, then ``pgp_num`` is adjusted automatically.
+In releases of Ceph that are Nautilus and later (inclusive), when the
+``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to match
+``pg_num``. This process manifests as periods of remapping of PGs and of
+backfill, and is expected behavior and normal.
+
+.. _rados_ops_pgs_get_pg_num:
+
+Get the Number of PGs
+=====================
+
+To get the number of PGs in a pool, run a command of the following form:
+
+.. prompt:: bash #
+
+ ceph osd pool get {pool-name} pg_num
+
+
+Get a Cluster's PG Statistics
+=============================
+
+To see the details of the PGs in your cluster, run a command of the following
+form:
+
+.. prompt:: bash #
+
+ ceph pg dump [--format {format}]
+
+Valid formats are ``plain`` (default) and ``json``.
+
+
+Get Statistics for Stuck PGs
+============================
+
+To see the statistics for all PGs that are stuck in a specified state, run a
+command of the following form:
+
+.. prompt:: bash #
+
+ ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
+
+- **Inactive** PGs cannot process reads or writes because they are waiting for
+ enough OSDs with the most up-to-date data to come ``up`` and ``in``.
+
+- **Undersized** PGs contain objects that have not been replicated the desired
+ number of times. Under normal conditions, it can be assumed that these PGs
+ are recovering.
+
+- **Stale** PGs are in an unknown state -- the OSDs that host them have not
+ reported to the monitor cluster for a certain period of time (determined by
+ ``mon_osd_report_timeout``).
+
+Valid formats are ``plain`` (default) and ``json``. The threshold defines the
+minimum number of seconds the PG is stuck before it is included in the returned
+statistics (default: 300).
+
+
+Get a PG Map
+============
+
+To get the PG map for a particular PG, run a command of the following form:
+
+.. prompt:: bash #
+
+ ceph pg map {pg-id}
+
+For example:
+
+.. prompt:: bash #
+
+ ceph pg map 1.6c
+
+Ceph will return the PG map, the PG, and the OSD status. The output resembles
+the following:
+
+.. prompt:: bash #
+
+ osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
+
+
+Get a PG's Statistics
+=====================
+
+To see statistics for a particular PG, run a command of the following form:
+
+.. prompt:: bash #
+
+ ceph pg {pg-id} query
+
+
+Scrub a PG
+==========
+
+To scrub a PG, run a command of the following form:
+
+.. prompt:: bash #
+
+ ceph pg scrub {pg-id}
+
+Ceph checks the primary and replica OSDs, generates a catalog of all objects in
+the PG, and compares the objects against each other in order to ensure that no
+objects are missing or mismatched and that their contents are consistent. If
+the replicas all match, then a final semantic sweep takes place to ensure that
+all snapshot-related object metadata is consistent. Errors are reported in
+logs.
+
+To scrub all PGs from a specific pool, run a command of the following form:
+
+.. prompt:: bash #
+
+ ceph osd pool scrub {pool-name}
+
+
+Prioritize backfill/recovery of PG(s)
+=====================================
+
+You might encounter a situation in which multiple PGs require recovery or
+backfill, but the data in some PGs is more important than the data in others
+(for example, some PGs hold data for images that are used by running machines
+and other PGs are used by inactive machines and hold data that is less
+relevant). In that case, you might want to prioritize recovery or backfill of
+the PGs with especially important data so that the performance of the cluster
+and the availability of their data are restored sooner. To designate specific
+PG(s) as prioritized during recovery, run a command of the following form:
+
+.. prompt:: bash #
+
+ ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+
+To mark specific PG(s) as prioritized during backfill, run a command of the
+following form:
+
+.. prompt:: bash #
+
+ ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+
+These commands instruct Ceph to perform recovery or backfill on the specified
+PGs before processing the other PGs. Prioritization does not interrupt current
+backfills or recovery, but places the specified PGs at the top of the queue so
+that they will be acted upon next. If you change your mind or realize that you
+have prioritized the wrong PGs, run one or both of the following commands:
+
+.. prompt:: bash #
+
+ ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+ ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+
+These commands remove the ``force`` flag from the specified PGs, so that the
+PGs will be processed in their usual order. As in the case of adding the
+``force`` flag, this affects only those PGs that are still queued but does not
+affect PGs currently undergoing recovery.
+
+The ``force`` flag is cleared automatically after recovery or backfill of the
+PGs is complete.
+
+Similarly, to instruct Ceph to prioritize all PGs from a specified pool (that
+is, to perform recovery or backfill on those PGs first), run one or both of the
+following commands:
+
+.. prompt:: bash #
+
+ ceph osd pool force-recovery {pool-name}
+ ceph osd pool force-backfill {pool-name}
+
+These commands can also be cancelled. To revert to the default order, run one
+or both of the following commands:
+
+.. prompt:: bash #
+
+ ceph osd pool cancel-force-recovery {pool-name}
+ ceph osd pool cancel-force-backfill {pool-name}
+
+.. warning:: These commands can break the order of Ceph's internal priority
+ computations, so use them with caution! If you have multiple pools that are
+ currently sharing the same underlying OSDs, and if the data held by certain
+ pools is more important than the data held by other pools, then we recommend
+ that you run a command of the following form to arrange a custom
+ recovery/backfill priority for all pools:
+
+.. prompt:: bash #
+
+ ceph osd pool set {pool-name} recovery_priority {value}
+
+For example, if you have twenty pools, you could make the most important pool
+priority ``20``, and the next most important pool priority ``19``, and so on.
+
+Another option is to set the recovery/backfill priority for only a proper
+subset of pools. In such a scenario, three important pools might (all) be
+assigned priority ``1`` and all other pools would be left without an assigned
+recovery/backfill priority. Another possibility is to select three important
+pools and set their recovery/backfill priorities to ``3``, ``2``, and ``1``
+respectively.
+
+.. important:: Numbers of greater value have higher priority than numbers of
+ lesser value when using ``ceph osd pool set {pool-name} recovery_priority
+ {value}`` to set their recovery/backfill priority. For example, a pool with
+ the recovery/backfill priority ``30`` has a higher priority than a pool with
+ the recovery/backfill priority ``15``.
+
+Reverting Lost RADOS Objects
+============================
+
+If the cluster has lost one or more RADOS objects and you have decided to
+abandon the search for the lost data, you must mark the unfound objects
+``lost``.
+
+If every possible location has been queried and all OSDs are ``up`` and ``in``,
+but certain RADOS objects are still lost, you might have to give up on those
+objects. This situation can arise when rare and unusual combinations of
+failures allow the cluster to learn about writes that were performed before the
+writes themselves were recovered.
+
+The command to mark a RADOS object ``lost`` has only one supported option:
+``revert``. The ``revert`` option will either roll back to a previous version
+of the RADOS object (if it is old enough to have a previous version) or forget
+about it entirely (if it is too new to have a previous version). To mark the
+"unfound" objects ``lost``, run a command of the following form:
+
+
+.. prompt:: bash #
+
+ ceph pg {pg-id} mark_unfound_lost revert|delete
+
+.. important:: Use this feature with caution. It might confuse applications
+ that expect the object(s) to exist.
+
+
+.. toctree::
+ :hidden:
+
+ pg-states
+ pg-concepts
+
+
+.. _Create a Pool: ../pools#createpool
+.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
+.. _pgcalc: https://old.ceph.com/pgcalc/
diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst
new file mode 100644
index 000000000..dda9e844e
--- /dev/null
+++ b/doc/rados/operations/pools.rst
@@ -0,0 +1,751 @@
+.. _rados_pools:
+
+=======
+ Pools
+=======
+Pools are logical partitions that are used to store objects.
+
+Pools provide:
+
+- **Resilience**: It is possible to set the number of OSDs that are allowed to
+ fail without any data being lost. If your cluster uses replicated pools, the
+ number of OSDs that can fail without data loss is equal to the number of
+ replicas.
+
+ For example: a typical configuration stores an object and two replicas
+ (copies) of each RADOS object (that is: ``size = 3``), but you can configure
+ the number of replicas on a per-pool basis. For `erasure-coded pools
+ <../erasure-code>`_, resilience is defined as the number of coding chunks
+ (for example, ``m = 2`` in the default **erasure code profile**).
+
+- **Placement Groups**: You can set the number of placement groups (PGs) for
+ the pool. In a typical configuration, the target number of PGs is
+ approximately one hundred PGs per OSD. This provides reasonable balancing
+ without consuming excessive computing resources. When setting up multiple
+ pools, be careful to set an appropriate number of PGs for each pool and for
+ the cluster as a whole. Each PG belongs to a specific pool: when multiple
+ pools use the same OSDs, make sure that the **sum** of PG replicas per OSD is
+ in the desired PG-per-OSD target range. To calculate an appropriate number of
+ PGs for your pools, use the `pgcalc`_ tool.
+
+- **CRUSH Rules**: When data is stored in a pool, the placement of the object
+ and its replicas (or chunks, in the case of erasure-coded pools) in your
+ cluster is governed by CRUSH rules. Custom CRUSH rules can be created for a
+ pool if the default rule does not fit your use case.
+
+- **Snapshots**: The command ``ceph osd pool mksnap`` creates a snapshot of a
+ pool.
+
+Pool Names
+==========
+
+Pool names beginning with ``.`` are reserved for use by Ceph's internal
+operations. Do not create or manipulate pools with these names.
+
+
+List Pools
+==========
+
+There are multiple ways to get the list of pools in your cluster.
+
+To list just your cluster's pool names (good for scripting), execute:
+
+.. prompt:: bash $
+
+ ceph osd pool ls
+
+::
+
+ .rgw.root
+ default.rgw.log
+ default.rgw.control
+ default.rgw.meta
+
+To list your cluster's pools with the pool number, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd lspools
+
+::
+
+ 1 .rgw.root
+ 2 default.rgw.log
+ 3 default.rgw.control
+ 4 default.rgw.meta
+
+To list your cluster's pools with additional information, execute:
+
+.. prompt:: bash $
+
+ ceph osd pool ls detail
+
+::
+
+ pool 1 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00
+ pool 2 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 21 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00
+ pool 3 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 23 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00
+ pool 4 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 25 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 4.00
+
+To get even more information, you can execute this command with the ``--format`` (or ``-f``) option and the ``json``, ``json-pretty``, ``xml`` or ``xml-pretty`` value.
+
+.. _createpool:
+
+Creating a Pool
+===============
+
+Before creating a pool, consult `Pool, PG and CRUSH Config Reference`_. Your
+Ceph configuration file contains a setting (namely, ``pg_num``) that determines
+the number of PGs. However, this setting's default value is NOT appropriate
+for most systems. In most cases, you should override this default value when
+creating your pool. For details on PG numbers, see `setting the number of
+placement groups`_
+
+For example:
+
+.. prompt:: bash $
+
+ osd_pool_default_pg_num = 128
+ osd_pool_default_pgp_num = 128
+
+.. note:: In Luminous and later releases, each pool must be associated with the
+ application that will be using the pool. For more information, see
+ `Associating a Pool with an Application`_ below.
+
+To create a pool, run one of the following commands:
+
+.. prompt:: bash $
+
+ ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] [replicated] \
+ [crush-rule-name] [expected-num-objects]
+
+or:
+
+.. prompt:: bash $
+
+ ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] erasure \
+ [erasure-code-profile] [crush-rule-name] [expected_num_objects] [--autoscale-mode=<on,off,warn>]
+
+For a brief description of the elements of the above commands, consult the
+following:
+
+.. describe:: {pool-name}
+
+ The name of the pool. It must be unique.
+
+ :Type: String
+ :Required: Yes.
+
+.. describe:: {pg-num}
+
+ The total number of PGs in the pool. For details on calculating an
+ appropriate number, see :ref:`placement groups`. The default value ``8`` is
+ NOT suitable for most systems.
+
+ :Type: Integer
+ :Required: Yes.
+ :Default: 8
+
+.. describe:: {pgp-num}
+
+ The total number of PGs for placement purposes. This **should be equal to
+ the total number of PGs**, except briefly while ``pg_num`` is being
+ increased or decreased.
+
+ :Type: Integer
+ :Required: Yes. If no value has been specified in the command, then the default value is used (unless a different value has been set in Ceph configuration).
+ :Default: 8
+
+.. describe:: {replicated|erasure}
+
+ The pool type. This can be either **replicated** (to recover from lost OSDs
+ by keeping multiple copies of the objects) or **erasure** (to achieve a kind
+ of `generalized parity RAID <../erasure-code>`_ capability). The
+ **replicated** pools require more raw storage but can implement all Ceph
+ operations. The **erasure** pools require less raw storage but can perform
+ only some Ceph tasks and may provide decreased performance.
+
+ :Type: String
+ :Required: No.
+ :Default: replicated
+
+.. describe:: [crush-rule-name]
+
+ The name of the CRUSH rule to use for this pool. The specified rule must
+ exist; otherwise the command will fail.
+
+ :Type: String
+ :Required: No.
+ :Default: For **replicated** pools, it is the rule specified by the :confval:`osd_pool_default_crush_rule` configuration variable. This rule must exist. For **erasure** pools, it is the ``erasure-code`` rule if the ``default`` `erasure code profile`_ is used or the ``{pool-name}`` rule if not. This rule will be created implicitly if it doesn't already exist.
+
+.. describe:: [erasure-code-profile=profile]
+
+ For **erasure** pools only. Instructs Ceph to use the specified `erasure
+ code profile`_. This profile must be an existing profile as defined by **osd
+ erasure-code-profile set**.
+
+ :Type: String
+ :Required: No.
+
+.. _erasure code profile: ../erasure-code-profile
+
+.. describe:: --autoscale-mode=<on,off,warn>
+
+ - ``on``: the Ceph cluster will autotune or recommend changes to the number of PGs in your pool based on actual usage.
+ - ``warn``: the Ceph cluster will autotune or recommend changes to the number of PGs in your pool based on actual usage.
+ - ``off``: refer to :ref:`placement groups` for more information.
+
+ :Type: String
+ :Required: No.
+ :Default: The default behavior is determined by the :confval:`osd_pool_default_pg_autoscale_mode` option.
+
+.. describe:: [expected-num-objects]
+
+ The expected number of RADOS objects for this pool. By setting this value and
+ assigning a negative value to **filestore merge threshold**, you arrange
+ for the PG folder splitting to occur at the time of pool creation and
+ avoid the latency impact that accompanies runtime folder splitting.
+
+ :Type: Integer
+ :Required: No.
+ :Default: 0, no splitting at the time of pool creation.
+
+.. _associate-pool-to-application:
+
+Associating a Pool with an Application
+======================================
+
+Pools need to be associated with an application before they can be used. Pools
+that are intended for use with CephFS and pools that are created automatically
+by RGW are associated automatically. Pools that are intended for use with RBD
+should be initialized with the ``rbd`` tool (see `Block Device Commands`_ for
+more information).
+
+For other cases, you can manually associate a free-form application name to a
+pool by running the following command.:
+
+.. prompt:: bash $
+
+ ceph osd pool application enable {pool-name} {application-name}
+
+.. note:: CephFS uses the application name ``cephfs``, RBD uses the
+ application name ``rbd``, and RGW uses the application name ``rgw``.
+
+Setting Pool Quotas
+===================
+
+To set pool quotas for the maximum number of bytes and/or the maximum number of
+RADOS objects per pool, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}]
+
+For example:
+
+.. prompt:: bash $
+
+ ceph osd pool set-quota data max_objects 10000
+
+To remove a quota, set its value to ``0``.
+
+
+Deleting a Pool
+===============
+
+To delete a pool, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
+
+To remove a pool, you must set the ``mon_allow_pool_delete`` flag to ``true``
+in the monitor's configuration. Otherwise, monitors will refuse to remove
+pools.
+
+For more information, see `Monitor Configuration`_.
+
+.. _Monitor Configuration: ../../configuration/mon-config-ref
+
+If there are custom rules for a pool that is no longer needed, consider
+deleting those rules.
+
+.. prompt:: bash $
+
+ ceph osd pool get {pool-name} crush_rule
+
+For example, if the custom rule is "123", check all pools to see whether they
+use the rule by running the following command:
+
+.. prompt:: bash $
+
+ ceph osd dump | grep "^pool" | grep "crush_rule 123"
+
+If no pools use this custom rule, then it is safe to delete the rule from the
+cluster.
+
+Similarly, if there are users with permissions restricted to a pool that no
+longer exists, consider deleting those users by running commands of the
+following forms:
+
+.. prompt:: bash $
+
+ ceph auth ls | grep -C 5 {pool-name}
+ ceph auth del {user}
+
+
+Renaming a Pool
+===============
+
+To rename a pool, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd pool rename {current-pool-name} {new-pool-name}
+
+If you rename a pool for which an authenticated user has per-pool capabilities,
+you must update the user's capabilities ("caps") to refer to the new pool name.
+
+
+Showing Pool Statistics
+=======================
+
+To show a pool's utilization statistics, run the following command:
+
+.. prompt:: bash $
+
+ rados df
+
+To obtain I/O information for a specific pool or for all pools, run a command
+of the following form:
+
+.. prompt:: bash $
+
+ ceph osd pool stats [{pool-name}]
+
+
+Making a Snapshot of a Pool
+===========================
+
+To make a snapshot of a pool, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd pool mksnap {pool-name} {snap-name}
+
+Removing a Snapshot of a Pool
+=============================
+
+To remove a snapshot of a pool, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd pool rmsnap {pool-name} {snap-name}
+
+.. _setpoolvalues:
+
+Setting Pool Values
+===================
+
+To assign values to a pool's configuration keys, run a command of the following
+form:
+
+.. prompt:: bash $
+
+ ceph osd pool set {pool-name} {key} {value}
+
+You may set values for the following keys:
+
+.. _compression_algorithm:
+
+.. describe:: compression_algorithm
+
+ :Description: Sets the inline compression algorithm used in storing data on the underlying BlueStore back end. This key's setting overrides the global setting :confval:`bluestore_compression_algorithm`.
+ :Type: String
+ :Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd``
+
+.. describe:: compression_mode
+
+ :Description: Sets the policy for the inline compression algorithm used in storing data on the underlying BlueStore back end. This key's setting overrides the global setting :confval:`bluestore_compression_mode`.
+ :Type: String
+ :Valid Settings: ``none``, ``passive``, ``aggressive``, ``force``
+
+.. describe:: compression_min_blob_size
+
+
+ :Description: Sets the minimum size for the compression of chunks: that is, chunks smaller than this are not compressed. This key's setting overrides the following global settings:
+
+ * :confval:`bluestore_compression_min_blob_size`
+ * :confval:`bluestore_compression_min_blob_size_hdd`
+ * :confval:`bluestore_compression_min_blob_size_ssd`
+
+ :Type: Unsigned Integer
+
+
+.. describe:: compression_max_blob_size
+
+ :Description: Sets the maximum size for chunks: that is, chunks larger than this are broken into smaller blobs of this size before compression is performed.
+ :Type: Unsigned Integer
+
+.. _size:
+
+.. describe:: size
+
+ :Description: Sets the number of replicas for objects in the pool. For further details, see `Setting the Number of RADOS Object Replicas`_. Replicated pools only.
+ :Type: Integer
+
+.. _min_size:
+
+.. describe:: min_size
+
+ :Description: Sets the minimum number of replicas required for I/O. For further details, see `Setting the Number of RADOS Object Replicas`_. For erasure-coded pools, this should be set to a value greater than 'k'. If I/O is allowed at the value 'k', then there is no redundancy and data will be lost in the event of a permanent OSD failure. For more information, see `Erasure Code <../erasure-code>`_
+ :Type: Integer
+ :Version: ``0.54`` and above
+
+.. _pg_num:
+
+.. describe:: pg_num
+
+ :Description: Sets the effective number of PGs to use when calculating data placement.
+ :Type: Integer
+ :Valid Range: ``0`` to ``mon_max_pool_pg_num``. If set to ``0``, the value of ``osd_pool_default_pg_num`` will be used.
+
+.. _pgp_num:
+
+.. describe:: pgp_num
+
+ :Description: Sets the effective number of PGs to use when calculating data placement.
+ :Type: Integer
+ :Valid Range: Between ``1`` and the current value of ``pg_num``.
+
+.. _crush_rule:
+
+.. describe:: crush_rule
+
+ :Description: Sets the CRUSH rule that Ceph uses to map object placement within the pool.
+ :Type: String
+
+.. _allow_ec_overwrites:
+
+.. describe:: allow_ec_overwrites
+
+ :Description: Determines whether writes to an erasure-coded pool are allowed to update only part of a RADOS object. This allows CephFS and RBD to use an EC (erasure-coded) pool for user data (but not for metadata). For more details, see `Erasure Coding with Overwrites`_.
+ :Type: Boolean
+
+ .. versionadded:: 12.2.0
+
+.. describe:: hashpspool
+
+ :Description: Sets and unsets the HASHPSPOOL flag on a given pool.
+ :Type: Integer
+ :Valid Range: 1 sets flag, 0 unsets flag
+
+.. _nodelete:
+
+.. describe:: nodelete
+
+ :Description: Sets and unsets the NODELETE flag on a given pool.
+ :Type: Integer
+ :Valid Range: 1 sets flag, 0 unsets flag
+ :Version: Version ``FIXME``
+
+.. _nopgchange:
+
+.. describe:: nopgchange
+
+ :Description: Sets and unsets the NOPGCHANGE flag on a given pool.
+ :Type: Integer
+ :Valid Range: 1 sets flag, 0 unsets flag
+ :Version: Version ``FIXME``
+
+.. _nosizechange:
+
+.. describe:: nosizechange
+
+ :Description: Sets and unsets the NOSIZECHANGE flag on a given pool.
+ :Type: Integer
+ :Valid Range: 1 sets flag, 0 unsets flag
+ :Version: Version ``FIXME``
+
+.. _bulk:
+
+.. describe:: bulk
+
+ :Description: Sets and unsets the bulk flag on a given pool.
+ :Type: Boolean
+ :Valid Range: ``true``/``1`` sets flag, ``false``/``0`` unsets flag
+
+.. _write_fadvise_dontneed:
+
+.. describe:: write_fadvise_dontneed
+
+ :Description: Sets and unsets the WRITE_FADVISE_DONTNEED flag on a given pool.
+ :Type: Integer
+ :Valid Range: ``1`` sets flag, ``0`` unsets flag
+
+.. _noscrub:
+
+.. describe:: noscrub
+
+ :Description: Sets and unsets the NOSCRUB flag on a given pool.
+ :Type: Integer
+ :Valid Range: ``1`` sets flag, ``0`` unsets flag
+
+.. _nodeep-scrub:
+
+.. describe:: nodeep-scrub
+
+ :Description: Sets and unsets the NODEEP_SCRUB flag on a given pool.
+ :Type: Integer
+ :Valid Range: ``1`` sets flag, ``0`` unsets flag
+
+.. _target_max_bytes:
+
+.. describe:: target_max_bytes
+
+ :Description: Ceph will begin flushing or evicting objects when the
+ ``max_bytes`` threshold is triggered.
+ :Type: Integer
+ :Example: ``1000000000000`` #1-TB
+
+.. _target_max_objects:
+
+.. describe:: target_max_objects
+
+ :Description: Ceph will begin flushing or evicting objects when the
+ ``max_objects`` threshold is triggered.
+ :Type: Integer
+ :Example: ``1000000`` #1M objects
+
+.. _fast_read:
+
+.. describe:: fast_read
+
+ :Description: For erasure-coded pools, if this flag is turned ``on``, the
+ read request issues "sub reads" to all shards, and then waits
+ until it receives enough shards to decode before it serves
+ the client. If *jerasure* or *isa* erasure plugins are in
+ use, then after the first *K* replies have returned, the
+ client's request is served immediately using the data decoded
+ from these replies. This approach sacrifices resources in
+ exchange for better performance. This flag is supported only
+ for erasure-coded pools.
+ :Type: Boolean
+ :Defaults: ``0``
+
+.. _scrub_min_interval:
+
+.. describe:: scrub_min_interval
+
+ :Description: Sets the minimum interval (in seconds) for successive scrubs of the pool's PGs when the load is low. If the default value of ``0`` is in effect, then the value of ``osd_scrub_min_interval`` from central config is used.
+
+ :Type: Double
+ :Default: ``0``
+
+.. _scrub_max_interval:
+
+.. describe:: scrub_max_interval
+
+ :Description: Sets the maximum interval (in seconds) for scrubs of the pool's PGs regardless of cluster load. If the value of ``scrub_max_interval`` is ``0``, then the value ``osd_scrub_max_interval`` from central config is used.
+
+ :Type: Double
+ :Default: ``0``
+
+.. _deep_scrub_interval:
+
+.. describe:: deep_scrub_interval
+
+ :Description: Sets the interval (in seconds) for pool “deep” scrubs of the pool's PGs. If the value of ``deep_scrub_interval`` is ``0``, the value ``osd_deep_scrub_interval`` from central config is used.
+
+ :Type: Double
+ :Default: ``0``
+
+.. _recovery_priority:
+
+.. describe:: recovery_priority
+
+ :Description: Setting this value adjusts a pool's computed reservation priority. This value must be in the range ``-10`` to ``10``. Any pool assigned a negative value will be given a lower priority than any new pools, so users are directed to assign negative values to low-priority pools.
+
+ :Type: Integer
+ :Default: ``0``
+
+
+.. _recovery_op_priority:
+
+.. describe:: recovery_op_priority
+
+ :Description: Sets the recovery operation priority for a specific pool's PGs. This overrides the general priority determined by :confval:`osd_recovery_op_priority`.
+
+ :Type: Integer
+ :Default: ``0``
+
+
+Getting Pool Values
+===================
+
+To get a value from a pool's key, run a command of the following form:
+
+.. prompt:: bash $
+
+ ceph osd pool get {pool-name} {key}
+
+
+You may get values from the following keys:
+
+
+``size``
+
+:Description: See size_.
+
+:Type: Integer
+
+
+``min_size``
+
+:Description: See min_size_.
+
+:Type: Integer
+:Version: ``0.54`` and above
+
+
+``pg_num``
+
+:Description: See pg_num_.
+
+:Type: Integer
+
+
+``pgp_num``
+
+:Description: See pgp_num_.
+
+:Type: Integer
+:Valid Range: Equal to or less than ``pg_num``.
+
+
+``crush_rule``
+
+:Description: See crush_rule_.
+
+
+``target_max_bytes``
+
+:Description: See target_max_bytes_.
+
+:Type: Integer
+
+
+``target_max_objects``
+
+:Description: See target_max_objects_.
+
+:Type: Integer
+
+
+``fast_read``
+
+:Description: See fast_read_.
+
+:Type: Boolean
+
+
+``scrub_min_interval``
+
+:Description: See scrub_min_interval_.
+
+:Type: Double
+
+
+``scrub_max_interval``
+
+:Description: See scrub_max_interval_.
+
+:Type: Double
+
+
+``deep_scrub_interval``
+
+:Description: See deep_scrub_interval_.
+
+:Type: Double
+
+
+``allow_ec_overwrites``
+
+:Description: See allow_ec_overwrites_.
+
+:Type: Boolean
+
+
+``recovery_priority``
+
+:Description: See recovery_priority_.
+
+:Type: Integer
+
+
+``recovery_op_priority``
+
+:Description: See recovery_op_priority_.
+
+:Type: Integer
+
+
+Setting the Number of RADOS Object Replicas
+===========================================
+
+To set the number of data replicas on a replicated pool, run a command of the
+following form:
+
+.. prompt:: bash $
+
+ ceph osd pool set {poolname} size {num-replicas}
+
+.. important:: The ``{num-replicas}`` argument includes the primary object
+ itself. For example, if you want there to be two replicas of the object in
+ addition to the original object (for a total of three instances of the
+ object) specify ``3`` by running the following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set data size 3
+
+You may run the above command for each pool.
+
+.. Note:: An object might accept I/Os in degraded mode with fewer than ``pool
+ size`` replicas. To set a minimum number of replicas required for I/O, you
+ should use the ``min_size`` setting. For example, you might run the
+ following command:
+
+.. prompt:: bash $
+
+ ceph osd pool set data min_size 2
+
+This command ensures that no object in the data pool will receive I/O if it has
+fewer than ``min_size`` (in this case, two) replicas.
+
+
+Getting the Number of Object Replicas
+=====================================
+
+To get the number of object replicas, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd dump | grep 'replicated size'
+
+Ceph will list pools and highlight the ``replicated size`` attribute. By
+default, Ceph creates two replicas of an object (a total of three copies, for a
+size of ``3``).
+
+Managing pools that are flagged with ``--bulk``
+===============================================
+See :ref:`managing_bulk_flagged_pools`.
+
+
+.. _pgcalc: https://old.ceph.com/pgcalc/
+.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
+.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter
+.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups
+.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites
+.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool
diff --git a/doc/rados/operations/read-balancer.rst b/doc/rados/operations/read-balancer.rst
new file mode 100644
index 000000000..0833e4326
--- /dev/null
+++ b/doc/rados/operations/read-balancer.rst
@@ -0,0 +1,64 @@
+.. _read_balancer:
+
+=======================================
+Operating the Read (Primary) Balancer
+=======================================
+
+You might be wondering: How can I improve performance in my Ceph cluster?
+One important data point you can check is the ``read_balance_score`` on each
+of your replicated pools.
+
+This metric, available via ``ceph osd pool ls detail`` (see :ref:`rados_pools`
+for more details) indicates read performance, or how balanced the primaries are
+for each replicated pool. In most cases, if a ``read_balance_score`` is above 1
+(for instance, 1.5), this means that your pool has unbalanced primaries and that
+you may want to try improving your read performance with the read balancer.
+
+Online Optimization
+===================
+
+At present, there is no online option for the read balancer. However, we plan to add
+the read balancer as an option to the :ref:`balancer` in the next Ceph version
+so it can be enabled to run automatically in the background like the upmap balancer.
+
+Offline Optimization
+====================
+
+Primaries are updated with an offline optimizer that is built into the
+:ref:`osdmaptool`.
+
+#. Grab the latest copy of your osdmap:
+
+ .. prompt:: bash $
+
+ ceph osd getmap -o om
+
+#. Run the optimizer:
+
+ .. prompt:: bash $
+
+ osdmaptool om --read out.txt --read-pool <pool name> [--vstart]
+
+ It is highly recommended that you run the capacity balancer before running the
+ balancer to ensure optimal results. See :ref:`upmap` for details on how to balance
+ capacity in a cluster.
+
+#. Apply the changes:
+
+ .. prompt:: bash $
+
+ source out.txt
+
+ In the above example, the proposed changes are written to the output file
+ ``out.txt``. The commands in this procedure are normal Ceph CLI commands
+ that can be run in order to apply the changes to the cluster.
+
+ If you are working in a vstart cluster, you may pass the ``--vstart`` parameter
+ as shown above so the CLI commands are formatted with the `./bin/` prefix.
+
+ Note that any time the number of pgs changes (for instance, if the pg autoscaler [:ref:`pg-autoscaler`]
+ kicks in), you should consider rechecking the scores and rerunning the balancer if needed.
+
+To see some details about what the tool is doing, you can pass
+``--debug-osd 10`` to ``osdmaptool``. To see even more details, pass
+``--debug-osd 20`` to ``osdmaptool``.
diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst
new file mode 100644
index 000000000..f797b5b91
--- /dev/null
+++ b/doc/rados/operations/stretch-mode.rst
@@ -0,0 +1,262 @@
+.. _stretch_mode:
+
+================
+Stretch Clusters
+================
+
+
+Stretch Clusters
+================
+
+A stretch cluster is a cluster that has servers in geographically separated
+data centers, distributed over a WAN. Stretch clusters have LAN-like high-speed
+and low-latency connections, but limited links. Stretch clusters have a higher
+likelihood of (possibly asymmetric) network splits, and a higher likelihood of
+temporary or complete loss of an entire data center (which can represent
+one-third to one-half of the total cluster).
+
+Ceph is designed with the expectation that all parts of its network and cluster
+will be reliable and that failures will be distributed randomly across the
+CRUSH map. Even if a switch goes down and causes the loss of many OSDs, Ceph is
+designed so that the remaining OSDs and monitors will route around such a loss.
+
+Sometimes this cannot be relied upon. If you have a "stretched-cluster"
+deployment in which much of your cluster is behind a single network component,
+you might need to use **stretch mode** to ensure data integrity.
+
+We will here consider two standard configurations: a configuration with two
+data centers (or, in clouds, two availability zones), and a configuration with
+three data centers (or, in clouds, three availability zones).
+
+In the two-site configuration, Ceph expects each of the sites to hold a copy of
+the data, and Ceph also expects there to be a third site that has a tiebreaker
+monitor. This tiebreaker monitor picks a winner if the network connection fails
+and both data centers remain alive.
+
+The tiebreaker monitor can be a VM. It can also have high latency relative to
+the two main sites.
+
+The standard Ceph configuration is able to survive MANY network failures or
+data-center failures without ever compromising data availability. If enough
+Ceph servers are brought back following a failure, the cluster *will* recover.
+If you lose a data center but are still able to form a quorum of monitors and
+still have all the data available, Ceph will maintain availability. (This
+assumes that the cluster has enough copies to satisfy the pools' ``min_size``
+configuration option, or (failing that) that the cluster has CRUSH rules in
+place that will cause the cluster to re-replicate the data until the
+``min_size`` configuration option has been met.)
+
+Stretch Cluster Issues
+======================
+
+Ceph does not permit the compromise of data integrity and data consistency
+under any circumstances. When service is restored after a network failure or a
+loss of Ceph nodes, Ceph will restore itself to a state of normal functioning
+without operator intervention.
+
+Ceph does not permit the compromise of data integrity or data consistency, but
+there are situations in which *data availability* is compromised. These
+situations can occur even though there are enough clusters available to satisfy
+Ceph's consistency and sizing constraints. In some situations, you might
+discover that your cluster does not satisfy those constraints.
+
+The first category of these failures that we will discuss involves inconsistent
+networks -- if there is a netsplit (a disconnection between two servers that
+splits the network into two pieces), Ceph might be unable to mark OSDs ``down``
+and remove them from the acting PG sets. This failure to mark ODSs ``down``
+will occur, despite the fact that the primary PG is unable to replicate data (a
+situation that, under normal non-netsplit circumstances, would result in the
+marking of affected OSDs as ``down`` and their removal from the PG). If this
+happens, Ceph will be unable to satisfy its durability guarantees and
+consequently IO will not be permitted.
+
+The second category of failures that we will discuss involves the situation in
+which the constraints are not sufficient to guarantee the replication of data
+across data centers, though it might seem that the data is correctly replicated
+across data centers. For example, in a scenario in which there are two data
+centers named Data Center A and Data Center B, and the CRUSH rule targets three
+replicas and places a replica in each data center with a ``min_size`` of ``2``,
+the PG might go active with two replicas in Data Center A and zero replicas in
+Data Center B. In a situation of this kind, the loss of Data Center A means
+that the data is lost and Ceph will not be able to operate on it. This
+situation is surprisingly difficult to avoid using only standard CRUSH rules.
+
+
+Stretch Mode
+============
+Stretch mode is designed to handle deployments in which you cannot guarantee the
+replication of data across two data centers. This kind of situation can arise
+when the cluster's CRUSH rule specifies that three copies are to be made, but
+then a copy is placed in each data center with a ``min_size`` of 2. Under such
+conditions, a placement group can become active with two copies in the first
+data center and no copies in the second data center.
+
+
+Entering Stretch Mode
+---------------------
+
+To enable stretch mode, you must set the location of each monitor, matching
+your CRUSH map. This procedure shows how to do this.
+
+
+#. Place ``mon.a`` in your first data center:
+
+ .. prompt:: bash $
+
+ ceph mon set_location a datacenter=site1
+
+#. Generate a CRUSH rule that places two copies in each data center.
+ This requires editing the CRUSH map directly:
+
+ .. prompt:: bash $
+
+ ceph osd getcrushmap > crush.map.bin
+ crushtool -d crush.map.bin -o crush.map.txt
+
+#. Edit the ``crush.map.txt`` file to add a new rule. Here there is only one
+ other rule (``id 1``), but you might need to use a different rule ID. We
+ have two data-center buckets named ``site1`` and ``site2``:
+
+ ::
+
+ rule stretch_rule {
+ id 1
+ min_size 1
+ max_size 10
+ type replicated
+ step take site1
+ step chooseleaf firstn 2 type host
+ step emit
+ step take site2
+ step chooseleaf firstn 2 type host
+ step emit
+ }
+
+#. Inject the CRUSH map to make the rule available to the cluster:
+
+ .. prompt:: bash $
+
+ crushtool -c crush.map.txt -o crush2.map.bin
+ ceph osd setcrushmap -i crush2.map.bin
+
+#. Run the monitors in connectivity mode. See `Changing Monitor Elections`_.
+
+#. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the
+ tiebreaker monitor and we are splitting across data centers. The tiebreaker
+ monitor must be assigned a data center that is neither ``site1`` nor
+ ``site2``. For this purpose you can create another data-center bucket named
+ ``site3`` in your CRUSH and place ``mon.e`` there:
+
+ .. prompt:: bash $
+
+ ceph mon set_location e datacenter=site3
+ ceph mon enable_stretch_mode e stretch_rule datacenter
+
+When stretch mode is enabled, PGs will become active only when they peer
+across data centers (or across whichever CRUSH bucket type was specified),
+assuming both are alive. Pools will increase in size from the default ``3`` to
+``4``, and two copies will be expected in each site. OSDs will be allowed to
+connect to monitors only if they are in the same data center as the monitors.
+New monitors will not be allowed to join the cluster if they do not specify a
+location.
+
+If all OSDs and monitors in one of the data centers become inaccessible at once,
+the surviving data center enters a "degraded stretch mode". A warning will be
+issued, the ``min_size`` will be reduced to ``1``, and the cluster will be
+allowed to go active with the data in the single remaining site. The pool size
+does not change, so warnings will be generated that report that the pools are
+too small -- but a special stretch mode flag will prevent the OSDs from
+creating extra copies in the remaining data center. This means that the data
+center will keep only two copies, just as before.
+
+When the missing data center comes back, the cluster will enter a "recovery
+stretch mode". This changes the warning and allows peering, but requires OSDs
+only from the data center that was ``up`` throughout the duration of the
+downtime. When all PGs are in a known state, and are neither degraded nor
+incomplete, the cluster transitions back to regular stretch mode, ends the
+warning, restores ``min_size`` to its original value (``2``), requires both
+sites to peer, and no longer requires the site that was up throughout the
+duration of the downtime when peering (which makes failover to the other site
+possible, if needed).
+
+.. _Changing Monitor elections: ../change-mon-elections
+
+Limitations of Stretch Mode
+===========================
+When using stretch mode, OSDs must be located at exactly two sites.
+
+Two monitors should be run in each data center, plus a tiebreaker in a third
+(or in the cloud) for a total of five monitors. While in stretch mode, OSDs
+will connect only to monitors within the data center in which they are located.
+OSDs *DO NOT* connect to the tiebreaker monitor.
+
+Erasure-coded pools cannot be used with stretch mode. Attempts to use erasure
+coded pools with stretch mode will fail. Erasure coded pools cannot be created
+while in stretch mode.
+
+To use stretch mode, you will need to create a CRUSH rule that provides two
+replicas in each data center. Ensure that there are four total replicas: two in
+each data center. If pools exist in the cluster that do not have the default
+``size`` or ``min_size``, Ceph will not enter stretch mode. An example of such
+a CRUSH rule is given above.
+
+Because stretch mode runs with ``min_size`` set to ``1`` (or, more directly,
+``min_size 1``), we recommend enabling stretch mode only when using OSDs on
+SSDs (including NVMe OSDs). Hybrid HDD+SDD or HDD-only OSDs are not recommended
+due to the long time it takes for them to recover after connectivity between
+data centers has been restored. This reduces the potential for data loss.
+
+In the future, stretch mode might support erasure-coded pools and might support
+deployments that have more than two data centers.
+
+Other commands
+==============
+
+Replacing a failed tiebreaker monitor
+-------------------------------------
+
+Turn on a new monitor and run the following command:
+
+.. prompt:: bash $
+
+ ceph mon set_new_tiebreaker mon.<new_mon_name>
+
+This command protests if the new monitor is in the same location as the
+existing non-tiebreaker monitors. **This command WILL NOT remove the previous
+tiebreaker monitor.** Remove the previous tiebreaker monitor yourself.
+
+Using "--set-crush-location" and not "ceph mon set_location"
+------------------------------------------------------------
+
+If you write your own tooling for deploying Ceph, use the
+``--set-crush-location`` option when booting monitors instead of running ``ceph
+mon set_location``. This option accepts only a single ``bucket=loc`` pair (for
+example, ``ceph-mon --set-crush-location 'datacenter=a'``), and that pair must
+match the bucket type that was specified when running ``enable_stretch_mode``.
+
+Forcing recovery stretch mode
+-----------------------------
+
+When in stretch degraded mode, the cluster will go into "recovery" mode
+automatically when the disconnected data center comes back. If that does not
+happen or you want to enable recovery mode early, run the following command:
+
+.. prompt:: bash $
+
+ ceph osd force_recovery_stretch_mode --yes-i-really-mean-it
+
+Forcing normal stretch mode
+---------------------------
+
+When in recovery mode, the cluster should go back into normal stretch mode when
+the PGs are healthy. If this fails to happen or if you want to force the
+cross-data-center peering early and are willing to risk data downtime (or have
+verified separately that all the PGs can peer, even if they aren't fully
+recovered), run the following command:
+
+.. prompt:: bash $
+
+ ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
+
+This command can be used to to remove the ``HEALTH_WARN`` state, which recovery
+mode generates.
diff --git a/doc/rados/operations/upmap.rst b/doc/rados/operations/upmap.rst
new file mode 100644
index 000000000..8541680d8
--- /dev/null
+++ b/doc/rados/operations/upmap.rst
@@ -0,0 +1,113 @@
+.. _upmap:
+
+=======================================
+Using pg-upmap
+=======================================
+
+In Luminous v12.2.z and later releases, there is a *pg-upmap* exception table
+in the OSDMap that allows the cluster to explicitly map specific PGs to
+specific OSDs. This allows the cluster to fine-tune the data distribution to,
+in most cases, uniformly distribute PGs across OSDs.
+
+However, there is an important caveat when it comes to this new feature: it
+requires all clients to understand the new *pg-upmap* structure in the OSDMap.
+
+Online Optimization
+===================
+
+Enabling
+--------
+
+In order to use ``pg-upmap``, the cluster cannot have any pre-Luminous clients.
+By default, new clusters enable the *balancer module*, which makes use of
+``pg-upmap``. If you want to use a different balancer or you want to make your
+own custom ``pg-upmap`` entries, you might want to turn off the balancer in
+order to avoid conflict:
+
+.. prompt:: bash $
+
+ ceph balancer off
+
+To allow use of the new feature on an existing cluster, you must restrict the
+cluster to supporting only Luminous (and newer) clients. To do so, run the
+following command:
+
+.. prompt:: bash $
+
+ ceph osd set-require-min-compat-client luminous
+
+This command will fail if any pre-Luminous clients or daemons are connected to
+the monitors. To see which client versions are in use, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph features
+
+Balancer Module
+---------------
+
+The `balancer` module for ``ceph-mgr`` will automatically balance the number of
+PGs per OSD. See :ref:`balancer`
+
+Offline Optimization
+====================
+
+Upmap entries are updated with an offline optimizer that is built into the
+:ref:`osdmaptool`.
+
+#. Grab the latest copy of your osdmap:
+
+ .. prompt:: bash $
+
+ ceph osd getmap -o om
+
+#. Run the optimizer:
+
+ .. prompt:: bash $
+
+ osdmaptool om --upmap out.txt [--upmap-pool <pool>] \
+ [--upmap-max <max-optimizations>] \
+ [--upmap-deviation <max-deviation>] \
+ [--upmap-active]
+
+ It is highly recommended that optimization be done for each pool
+ individually, or for sets of similarly utilized pools. You can specify the
+ ``--upmap-pool`` option multiple times. "Similarly utilized pools" means
+ pools that are mapped to the same devices and that store the same kind of
+ data (for example, RBD image pools are considered to be similarly utilized;
+ an RGW index pool and an RGW data pool are not considered to be similarly
+ utilized).
+
+ The ``max-optimizations`` value determines the maximum number of upmap
+ entries to identify. The default is `10` (as is the case with the
+ ``ceph-mgr`` balancer module), but you should use a larger number if you are
+ doing offline optimization. If it cannot find any additional changes to
+ make (that is, if the pool distribution is perfect), it will stop early.
+
+ The ``max-deviation`` value defaults to `5`. If an OSD's PG count varies
+ from the computed target number by no more than this amount it will be
+ considered perfect.
+
+ The ``--upmap-active`` option simulates the behavior of the active balancer
+ in upmap mode. It keeps cycling until the OSDs are balanced and reports how
+ many rounds have occurred and how long each round takes. The elapsed time
+ for rounds indicates the CPU load that ``ceph-mgr`` consumes when it computes
+ the next optimization plan.
+
+#. Apply the changes:
+
+ .. prompt:: bash $
+
+ source out.txt
+
+ In the above example, the proposed changes are written to the output file
+ ``out.txt``. The commands in this procedure are normal Ceph CLI commands
+ that can be run in order to apply the changes to the cluster.
+
+The above steps can be repeated as many times as necessary to achieve a perfect
+distribution of PGs for each set of pools.
+
+To see some (gory) details about what the tool is doing, you can pass
+``--debug-osd 10`` to ``osdmaptool``. To see even more details, pass
+``--debug-crush 10`` to ``osdmaptool``.
diff --git a/doc/rados/operations/user-management.rst b/doc/rados/operations/user-management.rst
new file mode 100644
index 000000000..130c02002
--- /dev/null
+++ b/doc/rados/operations/user-management.rst
@@ -0,0 +1,840 @@
+.. _user-management:
+
+=================
+ User Management
+=================
+
+This document describes :term:`Ceph Client` users, and describes the process by
+which they perform authentication and authorization so that they can access the
+:term:`Ceph Storage Cluster`. Users are either individuals or system actors
+(for example, applications) that use Ceph clients to interact with the Ceph
+Storage Cluster daemons.
+
+.. ditaa::
+ +-----+
+ | {o} |
+ | |
+ +--+--+ /---------\ /---------\
+ | | Ceph | | Ceph |
+ ---+---*----->| |<------------->| |
+ | uses | Clients | | Servers |
+ | \---------/ \---------/
+ /--+--\
+ | |
+ | |
+ actor
+
+
+When Ceph runs with authentication and authorization enabled (both are enabled
+by default), you must specify a user name and a keyring that contains the
+secret key of the specified user (usually these are specified via the command
+line). If you do not specify a user name, Ceph will use ``client.admin`` as the
+default user name. If you do not specify a keyring, Ceph will look for a
+keyring via the ``keyring`` setting in the Ceph configuration. For example, if
+you execute the ``ceph health`` command without specifying a user or a keyring,
+Ceph will assume that the keyring is in ``/etc/ceph/ceph.client.admin.keyring``
+and will attempt to use that keyring. The following illustrates this behavior:
+
+.. prompt:: bash $
+
+ ceph health
+
+Ceph will interpret the command like this:
+
+.. prompt:: bash $
+
+ ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health
+
+Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid
+re-entry of the user name and secret.
+
+For details on configuring the Ceph Storage Cluster to use authentication, see
+`Cephx Config Reference`_. For details on the architecture of Cephx, see
+`Architecture - High Availability Authentication`_.
+
+Background
+==========
+
+No matter what type of Ceph client is used (for example: Block Device, Object
+Storage, Filesystem, native API), Ceph stores all data as RADOS objects within
+`pools`_. Ceph users must have access to a given pool in order to read and
+write data, and Ceph users must have execute permissions in order to use Ceph's
+administrative commands. The following concepts will help you understand
+Ceph['s] user management.
+
+.. _rados-ops-user:
+
+User
+----
+
+A user is either an individual or a system actor (for example, an application).
+Creating users allows you to control who (or what) can access your Ceph Storage
+Cluster, its pools, and the data within those pools.
+
+Ceph has the concept of a ``type`` of user. For purposes of user management,
+the type will always be ``client``. Ceph identifies users in a "period-
+delimited form" that consists of the user type and the user ID: for example,
+``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing
+is that the Cephx protocol is used not only by clients but also non-clients,
+such as Ceph Monitors, OSDs, and Metadata Servers. Distinguishing the user type
+helps to distinguish between client users and other users. This distinction
+streamlines access control, user monitoring, and traceability.
+
+Sometimes Ceph's user type might seem confusing, because the Ceph command line
+allows you to specify a user with or without the type, depending upon your
+command line usage. If you specify ``--user`` or ``--id``, you can omit the
+type. For example, ``client.user1`` can be entered simply as ``user1``. On the
+other hand, if you specify ``--name`` or ``-n``, you must supply the type and
+name: for example, ``client.user1``. We recommend using the type and name as a
+best practice wherever possible.
+
+.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage
+ user or a Ceph File System user. The Ceph Object Gateway uses a Ceph Storage
+ Cluster user to communicate between the gateway daemon and the storage
+ cluster, but the Ceph Object Gateway has its own user-management
+ functionality for end users. The Ceph File System uses POSIX semantics, and
+ the user space associated with the Ceph File System is not the same as the
+ user space associated with a Ceph Storage Cluster user.
+
+Authorization (Capabilities)
+----------------------------
+
+Ceph uses the term "capabilities" (caps) to describe the permissions granted to
+an authenticated user to exercise the functionality of the monitors, OSDs, and
+metadata servers. Capabilities can also restrict access to data within a pool,
+a namespace within a pool, or a set of pools based on their application tags.
+A Ceph administrative user specifies the capabilities of a user when creating
+or updating that user.
+
+Capability syntax follows this form::
+
+ {daemon-type} '{cap-spec}[, {cap-spec} ...]'
+
+- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access
+ settings, and can be applied in aggregate from pre-defined profiles with
+ ``profile {name}``. For example::
+
+ mon 'allow {access-spec} [network {network/prefix}]'
+
+ mon 'profile {name}'
+
+ The ``{access-spec}`` syntax is as follows: ::
+
+ * | all | [r][w][x]
+
+ The optional ``{network/prefix}`` is a standard network name and prefix
+ length in CIDR notation (for example, ``10.3.0.0/16``). If
+ ``{network/prefix}`` is present, the monitor capability can be used only by
+ clients that connect from the specified network.
+
+- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, and
+ ``class-read`` and ``class-write`` access settings. OSD capabilities can be
+ applied in aggregate from pre-defined profiles with ``profile {name}``. In
+ addition, OSD capabilities allow for pool and namespace settings. ::
+
+ osd 'allow {access-spec} [{match-spec}] [network {network/prefix}]'
+
+ osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]] [network {network/prefix}]'
+
+ There are two alternative forms of the ``{access-spec}`` syntax: ::
+
+ * | all | [r][w][x] [class-read] [class-write]
+
+ class {class name} [{method name}]
+
+ There are two alternative forms of the optional ``{match-spec}`` syntax::
+
+ pool={pool-name} [namespace={namespace-name}] [object_prefix {prefix}]
+
+ [namespace={namespace-name}] tag {application} {key}={value}
+
+ The optional ``{network/prefix}`` is a standard network name and prefix
+ length in CIDR notation (for example, ``10.3.0.0/16``). If
+ ``{network/prefix}`` is present, the OSD capability can be used only by
+ clients that connect from the specified network.
+
+- **Manager Caps:** Manager (``ceph-mgr``) capabilities include ``r``, ``w``,
+ ``x`` access settings, and can be applied in aggregate from pre-defined
+ profiles with ``profile {name}``. For example::
+
+ mgr 'allow {access-spec} [network {network/prefix}]'
+
+ mgr 'profile {name} [{key1} {match-type} {value1} ...] [network {network/prefix}]'
+
+ Manager capabilities can also be specified for specific commands, for all
+ commands exported by a built-in manager service, or for all commands exported
+ by a specific add-on module. For example::
+
+ mgr 'allow command "{command-prefix}" [with {key1} {match-type} {value1} ...] [network {network/prefix}]'
+
+ mgr 'allow service {service-name} {access-spec} [network {network/prefix}]'
+
+ mgr 'allow module {module-name} [with {key1} {match-type} {value1} ...] {access-spec} [network {network/prefix}]'
+
+ The ``{access-spec}`` syntax is as follows: ::
+
+ * | all | [r][w][x]
+
+ The ``{service-name}`` is one of the following: ::
+
+ mgr | osd | pg | py
+
+ The ``{match-type}`` is one of the following: ::
+
+ = | prefix | regex
+
+- **Metadata Server Caps:** For administrators, use ``allow *``. For all other
+ users (for example, CephFS clients), consult :doc:`/cephfs/client-auth`
+
+.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the
+ Ceph Storage Cluster. For this reason, it is not represented as
+ a Ceph Storage Cluster daemon type.
+
+The following entries describe access capabilities.
+
+``allow``
+
+:Description: Precedes access settings for a daemon. Implies ``rw``
+ for MDS only.
+
+
+``r``
+
+:Description: Gives the user read access. Required with monitors to retrieve
+ the CRUSH map.
+
+
+``w``
+
+:Description: Gives the user write access to objects.
+
+
+``x``
+
+:Description: Gives the user the capability to call class methods
+ (that is, both read and write) and to conduct ``auth``
+ operations on monitors.
+
+
+``class-read``
+
+:Descriptions: Gives the user the capability to call class read methods.
+ Subset of ``x``.
+
+
+``class-write``
+
+:Description: Gives the user the capability to call class write methods.
+ Subset of ``x``.
+
+
+``*``, ``all``
+
+:Description: Gives the user read, write, and execute permissions for a
+ particular daemon/pool, as well as the ability to execute
+ admin commands.
+
+
+The following entries describe valid capability profiles:
+
+``profile osd`` (Monitor only)
+
+:Description: Gives a user permissions to connect as an OSD to other OSDs or
+ monitors. Conferred on OSDs in order to enable OSDs to handle replication
+ heartbeat traffic and status reporting.
+
+
+``profile mds`` (Monitor only)
+
+:Description: Gives a user permissions to connect as an MDS to other MDSs or
+ monitors.
+
+
+``profile bootstrap-osd`` (Monitor only)
+
+:Description: Gives a user permissions to bootstrap an OSD. Conferred on
+ deployment tools such as ``ceph-volume`` and ``cephadm``
+ so that they have permissions to add keys when
+ bootstrapping an OSD.
+
+
+``profile bootstrap-mds`` (Monitor only)
+
+:Description: Gives a user permissions to bootstrap a metadata server.
+ Conferred on deployment tools such as ``cephadm``
+ so that they have permissions to add keys when bootstrapping
+ a metadata server.
+
+``profile bootstrap-rbd`` (Monitor only)
+
+:Description: Gives a user permissions to bootstrap an RBD user.
+ Conferred on deployment tools such as ``cephadm``
+ so that they have permissions to add keys when bootstrapping
+ an RBD user.
+
+``profile bootstrap-rbd-mirror`` (Monitor only)
+
+:Description: Gives a user permissions to bootstrap an ``rbd-mirror`` daemon
+ user. Conferred on deployment tools such as ``cephadm`` so that
+ they have permissions to add keys when bootstrapping an
+ ``rbd-mirror`` daemon.
+
+``profile rbd`` (Manager, Monitor, and OSD)
+
+:Description: Gives a user permissions to manipulate RBD images. When used as a
+ Monitor cap, it provides the user with the minimal privileges
+ required by an RBD client application; such privileges include
+ the ability to blocklist other client users. When used as an OSD
+ cap, it provides an RBD client application with read-write access
+ to the specified pool. The Manager cap supports optional ``pool``
+ and ``namespace`` keyword arguments.
+
+``profile rbd-mirror`` (Monitor only)
+
+:Description: Gives a user permissions to manipulate RBD images and retrieve
+ RBD mirroring config-key secrets. It provides the minimal
+ privileges required for the user to manipulate the ``rbd-mirror``
+ daemon.
+
+``profile rbd-read-only`` (Manager and OSD)
+
+:Description: Gives a user read-only permissions to RBD images. The Manager cap
+ supports optional ``pool`` and ``namespace`` keyword arguments.
+
+``profile simple-rados-client`` (Monitor only)
+
+:Description: Gives a user read-only permissions for monitor, OSD, and PG data.
+ Intended for use by direct librados client applications.
+
+``profile simple-rados-client-with-blocklist`` (Monitor only)
+
+:Description: Gives a user read-only permissions for monitor, OSD, and PG data.
+ Intended for use by direct librados client applications. Also
+ includes permissions to add blocklist entries to build
+ high-availability (HA) applications.
+
+``profile fs-client`` (Monitor only)
+
+:Description: Gives a user read-only permissions for monitor, OSD, PG, and MDS
+ data. Intended for CephFS clients.
+
+``profile role-definer`` (Monitor and Auth)
+
+:Description: Gives a user **all** permissions for the auth subsystem, read-only
+ access to monitors, and nothing else. Useful for automation
+ tools. Do not assign this unless you really, **really** know what
+ you're doing, as the security ramifications are substantial and
+ pervasive.
+
+``profile crash`` (Monitor and MGR)
+
+:Description: Gives a user read-only access to monitors. Used in conjunction
+ with the manager ``crash`` module to upload daemon crash
+ dumps into monitor storage for later analysis.
+
+Pool
+----
+
+A pool is a logical partition where users store data.
+In Ceph deployments, it is common to create a pool as a logical partition for
+similar types of data. For example, when deploying Ceph as a back end for
+OpenStack, a typical deployment would have pools for volumes, images, backups
+and virtual machines, and such users as ``client.glance`` and ``client.cinder``.
+
+Application Tags
+----------------
+
+Access may be restricted to specific pools as defined by their application
+metadata. The ``*`` wildcard may be used for the ``key`` argument, the
+``value`` argument, or both. The ``all`` tag is a synonym for ``*``.
+
+Namespace
+---------
+
+Objects within a pool can be associated to a namespace: that is, to a logical group of
+objects within the pool. A user's access to a pool can be associated with a
+namespace so that reads and writes by the user can take place only within the
+namespace. Objects written to a namespace within the pool can be accessed only
+by users who have access to the namespace.
+
+.. note:: Namespaces are primarily useful for applications written on top of
+ ``librados``. In such situations, the logical grouping provided by
+ namespaces can obviate the need to create different pools. In Luminous and
+ later releases, Ceph Object Gateway uses namespaces for various metadata
+ objects.
+
+The rationale for namespaces is this: namespaces are relatively less
+computationally expensive than pools, which (pools) can be a computationally
+expensive method of segregating data sets between different authorized users.
+
+For example, a pool ought to host approximately 100 placement-group replicas
+per OSD. This means that a cluster with 1000 OSDs and three 3R replicated pools
+would have (in a single pool) 100,000 placement-group replicas, and that means
+that it has 33,333 Placement Groups.
+
+By contrast, writing an object to a namespace simply associates the namespace
+to the object name without incurring the computational overhead of a separate
+pool. Instead of creating a separate pool for a user or set of users, you can
+use a namespace.
+
+.. note::
+
+ Namespaces are available only when using ``librados``.
+
+
+Access may be restricted to specific RADOS namespaces by use of the ``namespace``
+capability. Limited globbing of namespaces (that is, use of wildcards (``*``)) is supported: if the last character
+of the specified namespace is ``*``, then access is granted to any namespace
+starting with the provided argument.
+
+Managing Users
+==============
+
+User management functionality provides Ceph Storage Cluster administrators with
+the ability to create, update, and delete users directly in the Ceph Storage
+Cluster.
+
+When you create or delete users in the Ceph Storage Cluster, you might need to
+distribute keys to clients so that they can be added to keyrings. For details, see `Keyring
+Management`_.
+
+Listing Users
+-------------
+
+To list the users in your cluster, run the following command:
+
+.. prompt:: bash $
+
+ ceph auth ls
+
+Ceph will list all users in your cluster. For example, in a two-node
+cluster, ``ceph auth ls`` will provide an output that resembles the following::
+
+ installed auth entries:
+
+ osd.0
+ key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w==
+ caps: [mon] allow profile osd
+ caps: [osd] allow *
+ osd.1
+ key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA==
+ caps: [mon] allow profile osd
+ caps: [osd] allow *
+ client.admin
+ key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw==
+ caps: [mds] allow
+ caps: [mon] allow *
+ caps: [osd] allow *
+ client.bootstrap-mds
+ key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww==
+ caps: [mon] allow profile bootstrap-mds
+ client.bootstrap-osd
+ key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw==
+ caps: [mon] allow profile bootstrap-osd
+
+Note that, according to the ``TYPE.ID`` notation for users, ``osd.0`` is a
+user of type ``osd`` and an ID of ``0``, and ``client.admin`` is a user of type
+``client`` and an ID of ``admin`` (that is, the default ``client.admin`` user).
+Note too that each entry has a ``key: <value>`` entry, and also has one or more
+``caps:`` entries.
+
+To save the output of ``ceph auth ls`` to a file, use the ``-o {filename}`` option.
+
+
+Getting a User
+--------------
+
+To retrieve a specific user, key, and capabilities, run the following command:
+
+.. prompt:: bash $
+
+ ceph auth get {TYPE.ID}
+
+For example:
+
+.. prompt:: bash $
+
+ ceph auth get client.admin
+
+To save the output of ``ceph auth get`` to a file, use the ``-o {filename}`` option. Developers may also run the following command:
+
+.. prompt:: bash $
+
+ ceph auth export {TYPE.ID}
+
+The ``auth export`` command is identical to ``auth get``.
+
+.. _rados_ops_adding_a_user:
+
+Adding a User
+-------------
+
+Adding a user creates a user name (that is, ``TYPE.ID``), a secret key, and
+any capabilities specified in the command that creates the user.
+
+A user's key allows the user to authenticate with the Ceph Storage Cluster.
+The user's capabilities authorize the user to read, write, or execute on Ceph
+monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``).
+
+There are a few ways to add a user:
+
+- ``ceph auth add``: This command is the canonical way to add a user. It
+ will create the user, generate a key, and add any specified capabilities.
+
+- ``ceph auth get-or-create``: This command is often the most convenient way
+ to create a user, because it returns a keyfile format with the user name
+ (in brackets) and the key. If the user already exists, this command
+ simply returns the user name and key in the keyfile format. To save the output to
+ a file, use the ``-o {filename}`` option.
+
+- ``ceph auth get-or-create-key``: This command is a convenient way to create
+ a user and return the user's key and nothing else. This is useful for clients that
+ need only the key (for example, libvirt). If the user already exists, this command
+ simply returns the key. To save the output to
+ a file, use the ``-o {filename}`` option.
+
+It is possible, when creating client users, to create a user with no capabilities. A user
+with no capabilities is useless beyond mere authentication, because the client
+cannot retrieve the cluster map from the monitor. However, you might want to create a user
+with no capabilities and wait until later to add capabilities to the user by using the ``ceph auth caps`` comand.
+
+A typical user has at least read capabilities on the Ceph monitor and
+read and write capabilities on Ceph OSDs. A user's OSD permissions
+are often restricted so that the user can access only one particular pool.
+In the following example, the commands (1) add a client named ``john`` that has read capabilities on the Ceph monitor
+and read and write capabilities on the pool named ``liverpool``, (2) authorize a client named ``paul`` to have read capabilities on the Ceph monitor and
+read and write capabilities on the pool named ``liverpool``, (3) authorize a client named ``george`` to have read capabilities on the Ceph monitor and
+read and write capabilities on the pool named ``liverpool`` and use the keyring named ``george.keyring`` to make this authorization, and (4) authorize
+a client named ``ringo`` to have read capabilities on the Ceph monitor and read and write capabilities on the pool named ``liverpool`` and use the key
+named ``ringo.key`` to make this authorization:
+
+.. prompt:: bash $
+
+ ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool'
+ ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool'
+ ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring
+ ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key
+
+.. important:: Any user that has capabilities on OSDs will have access to ALL pools in the cluster
+ unless that user's access has been restricted to a proper subset of the pools in the cluster.
+
+
+.. _modify-user-capabilities:
+
+Modifying User Capabilities
+---------------------------
+
+The ``ceph auth caps`` command allows you to specify a user and change that
+user's capabilities. Setting new capabilities will overwrite current capabilities.
+To view current capabilities, run ``ceph auth get USERTYPE.USERID``.
+To add capabilities, run a command of the following form (and be sure to specify the existing capabilities):
+
+.. prompt:: bash $
+
+ ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]']
+
+For example:
+
+.. prompt:: bash $
+
+ ceph auth get client.john
+ ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool'
+ ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool'
+ ceph auth caps client.brian-manager mon 'allow *' osd 'allow *'
+
+For additional details on capabilities, see `Authorization (Capabilities)`_.
+
+Deleting a User
+---------------
+
+To delete a user, use ``ceph auth del``:
+
+.. prompt:: bash $
+
+ ceph auth del {TYPE}.{ID}
+
+Here ``{TYPE}`` is either ``client``, ``osd``, ``mon``, or ``mds``,
+and ``{ID}`` is the user name or the ID of the daemon.
+
+
+Printing a User's Key
+---------------------
+
+To print a user's authentication key to standard output, run the following command:
+
+.. prompt:: bash $
+
+ ceph auth print-key {TYPE}.{ID}
+
+Here ``{TYPE}`` is either ``client``, ``osd``, ``mon``, or ``mds``,
+and ``{ID}`` is the user name or the ID of the daemon.
+
+When it is necessary to populate client software with a user's key (as in the case of libvirt),
+you can print the user's key by running the following command:
+
+.. prompt:: bash $
+
+ mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user`
+
+Importing a User
+----------------
+
+To import one or more users, use ``ceph auth import`` and
+specify a keyring as follows:
+
+.. prompt:: bash $
+
+ ceph auth import -i /path/to/keyring
+
+For example:
+
+.. prompt:: bash $
+
+ sudo ceph auth import -i /etc/ceph/ceph.keyring
+
+.. note:: The Ceph storage cluster will add new users, their keys, and their
+ capabilities and will update existing users, their keys, and their
+ capabilities.
+
+Keyring Management
+==================
+
+When you access Ceph via a Ceph client, the Ceph client will look for a local
+keyring. Ceph presets the ``keyring`` setting with four keyring
+names by default. For this reason, you do not have to set the keyring names in your Ceph configuration file
+unless you want to override these defaults (which is not recommended). The four default keyring names are as follows:
+
+- ``/etc/ceph/$cluster.$name.keyring``
+- ``/etc/ceph/$cluster.keyring``
+- ``/etc/ceph/keyring``
+- ``/etc/ceph/keyring.bin``
+
+The ``$cluster`` metavariable found in the first two default keyring names above
+is your Ceph cluster name as defined by the name of the Ceph configuration
+file: for example, if the Ceph configuration file is named ``ceph.conf``,
+then your Ceph cluster name is ``ceph`` and the second name above would be
+``ceph.keyring``. The ``$name`` metavariable is the user type and user ID:
+for example, given the user ``client.admin``, the first name above would be
+``ceph.client.admin.keyring``.
+
+.. note:: When running commands that read or write to ``/etc/ceph``, you might
+ need to use ``sudo`` to run the command as ``root``.
+
+After you create a user (for example, ``client.ringo``), you must get the key and add
+it to a keyring on a Ceph client so that the user can access the Ceph Storage
+Cluster.
+
+The `User Management`_ section details how to list, get, add, modify, and delete
+users directly in the Ceph Storage Cluster. In addition, Ceph provides the
+``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client.
+
+Creating a Keyring
+------------------
+
+When you use the procedures in the `Managing Users`_ section to create users,
+you must provide user keys to the Ceph client(s). This is required so that the Ceph client(s)
+can retrieve the key for the specified user and authenticate that user against the Ceph
+Storage Cluster. Ceph clients access keyrings in order to look up a user name and
+retrieve the user's key.
+
+The ``ceph-authtool`` utility allows you to create a keyring. To create an
+empty keyring, use ``--create-keyring`` or ``-C``. For example:
+
+.. prompt:: bash $
+
+ ceph-authtool --create-keyring /path/to/keyring
+
+When creating a keyring with multiple users, we recommend using the cluster name
+(of the form ``$cluster.keyring``) for the keyring filename and saving the keyring in the
+``/etc/ceph`` directory. By doing this, you ensure that the ``keyring`` configuration default setting
+will pick up the filename without requiring you to specify the filename in the local copy
+of your Ceph configuration file. For example, you can create ``ceph.keyring`` by
+running the following command:
+
+.. prompt:: bash $
+
+ sudo ceph-authtool -C /etc/ceph/ceph.keyring
+
+When creating a keyring with a single user, we recommend using the cluster name,
+the user type, and the user name, and saving the keyring in the ``/etc/ceph`` directory.
+For example, we recommend that the ``client.admin`` user use ``ceph.client.admin.keyring``.
+
+To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means
+that the file will have ``rw`` permissions for the ``root`` user only, which is
+appropriate when the keyring contains administrator keys. However, if you
+intend to use the keyring for a particular user or group of users, be sure to use ``chown`` or ``chmod`` to establish appropriate keyring
+ownership and access.
+
+Adding a User to a Keyring
+--------------------------
+
+When you :ref:`Add a user<rados_ops_adding_a_user>` to the Ceph Storage
+Cluster, you can use the `Getting a User`_ procedure to retrieve a user, key,
+and capabilities and then save the user to a keyring.
+
+If you want to use only one user per keyring, the `Getting a User`_ procedure with
+the ``-o`` option will save the output in the keyring file format. For example,
+to create a keyring for the ``client.admin`` user, run the following command:
+
+.. prompt:: bash $
+
+ sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring
+
+Notice that the file format in this command is the file format conventionally used when manipulating the keyrings of individual users.
+
+If you want to import users to a keyring, you can use ``ceph-authtool``
+to specify the destination keyring and the source keyring.
+For example:
+
+.. prompt:: bash $
+
+ sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
+
+Creating a User
+---------------
+
+Ceph provides the `Adding a User`_ function to create a user directly in the Ceph
+Storage Cluster. However, you can also create a user, keys, and capabilities
+directly on a Ceph client keyring, and then import the user to the Ceph
+Storage Cluster. For example:
+
+.. prompt:: bash $
+
+ sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring
+
+For additional details on capabilities, see `Authorization (Capabilities)`_.
+
+You can also create a keyring and add a new user to the keyring simultaneously.
+For example:
+
+.. prompt:: bash $
+
+ sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key
+
+In the above examples, the new user ``client.ringo`` has been added only to the
+keyring. The new user has not been added to the Ceph Storage Cluster.
+
+To add the new user ``client.ringo`` to the Ceph Storage Cluster, run the following command:
+
+.. prompt:: bash $
+
+ sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring
+
+Modifying a User
+----------------
+
+To modify the capabilities of a user record in a keyring, specify the keyring
+and the user, followed by the capabilities. For example:
+
+.. prompt:: bash $
+
+ sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx'
+
+To update the user in the Ceph Storage Cluster, you must update the user
+in the keyring to the user entry in the Ceph Storage Cluster. To do so, run the following command:
+
+.. prompt:: bash $
+
+ sudo ceph auth import -i /etc/ceph/ceph.keyring
+
+For details on updating a Ceph Storage Cluster user from a
+keyring, see `Importing a User`_
+
+You may also :ref:`Modify user capabilities<modify-user-capabilities>` directly in the cluster, store the
+results to a keyring file, and then import the keyring into your main
+``ceph.keyring`` file.
+
+Command Line Usage
+==================
+
+Ceph supports the following usage for user name and secret:
+
+``--id`` | ``--user``
+
+:Description: Ceph identifies users with a type and an ID: the form of this user identification is ``TYPE.ID``, and examples of the type and ID are
+ ``client.admin`` and ``client.user1``. The ``id``, ``name`` and
+ ``-n`` options allow you to specify the ID portion of the user
+ name (for example, ``admin``, ``user1``, ``foo``). You can specify
+ the user with the ``--id`` and omit the type. For example,
+ to specify user ``client.foo``, run the following commands:
+
+ .. prompt:: bash $
+
+ ceph --id foo --keyring /path/to/keyring health
+ ceph --user foo --keyring /path/to/keyring health
+
+
+``--name`` | ``-n``
+
+:Description: Ceph identifies users with a type and an ID: the form of this user identification is ``TYPE.ID``, and examples of the type and ID are
+ ``client.admin`` and ``client.user1``. The ``--name`` and ``-n``
+ options allow you to specify the fully qualified user name.
+ You are required to specify the user type (typically ``client``) with the
+ user ID. For example:
+
+ .. prompt:: bash $
+
+ ceph --name client.foo --keyring /path/to/keyring health
+ ceph -n client.foo --keyring /path/to/keyring health
+
+
+``--keyring``
+
+:Description: The path to the keyring that contains one or more user names and
+ secrets. The ``--secret`` option provides the same functionality,
+ but it does not work with Ceph RADOS Gateway, which uses
+ ``--secret`` for another purpose. You may retrieve a keyring with
+ ``ceph auth get-or-create`` and store it locally. This is a
+ preferred approach, because you can switch user names without
+ switching the keyring path. For example:
+
+ .. prompt:: bash $
+
+ sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage
+
+
+.. _pools: ../pools
+
+Limitations
+===========
+
+The ``cephx`` protocol authenticates Ceph clients and servers to each other. It
+is not intended to handle authentication of human users or application programs
+that are run on their behalf. If your access control
+needs require that kind of authentication, you will need to have some other mechanism, which is likely to be specific to the
+front end that is used to access the Ceph object store. This other mechanism would ensure that only acceptable users and programs are able to run on the
+machine that Ceph permits to access its object store.
+
+The keys used to authenticate Ceph clients and servers are typically stored in
+a plain text file on a trusted host. Appropriate permissions must be set on the plain text file.
+
+.. important:: Storing keys in plaintext files has security shortcomings, but
+ they are difficult to avoid, given the basic authentication methods Ceph
+ uses in the background. Anyone setting up Ceph systems should be aware of
+ these shortcomings.
+
+In particular, user machines, especially portable machines, should not
+be configured to interact directly with Ceph, since that mode of use would
+require the storage of a plaintext authentication key on an insecure machine.
+Anyone who stole that machine or obtained access to it could
+obtain a key that allows them to authenticate their own machines to Ceph.
+
+Instead of permitting potentially insecure machines to access a Ceph object
+store directly, you should require users to sign in to a trusted machine in
+your environment, using a method that provides sufficient security for your
+purposes. That trusted machine will store the plaintext Ceph keys for the
+human users. A future version of Ceph might address these particular
+authentication issues more fully.
+
+At present, none of the Ceph authentication protocols provide secrecy for
+messages in transit. As a result, an eavesdropper on the wire can hear and understand
+all data sent between clients and servers in Ceph, even if the eavesdropper cannot create or
+alter the data. Similarly, Ceph does not include options to encrypt user data in the
+object store. Users can, of course, hand-encrypt and store their own data in the Ceph
+object store, but Ceph itself provides no features to perform object
+encryption. Anyone storing sensitive data in Ceph should consider
+encrypting their data before providing it to the Ceph system.
+
+
+.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication
+.. _Cephx Config Reference: ../../configuration/auth-config-ref