summaryrefslogtreecommitdiffstats
path: root/doc/rados/configuration
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /doc/rados/configuration
parentInitial commit. (diff)
downloadceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r--doc/rados/configuration/auth-config-ref.rst379
-rw-r--r--doc/rados/configuration/bluestore-config-ref.rst552
-rw-r--r--doc/rados/configuration/ceph-conf.rst715
-rw-r--r--doc/rados/configuration/common.rst207
-rw-r--r--doc/rados/configuration/demo-ceph.conf31
-rw-r--r--doc/rados/configuration/filestore-config-ref.rst377
-rw-r--r--doc/rados/configuration/general-config-ref.rst19
-rw-r--r--doc/rados/configuration/index.rst53
-rw-r--r--doc/rados/configuration/journal-ref.rst39
-rw-r--r--doc/rados/configuration/mclock-config-ref.rst699
-rw-r--r--doc/rados/configuration/mon-config-ref.rst642
-rw-r--r--doc/rados/configuration/mon-lookup-dns.rst58
-rw-r--r--doc/rados/configuration/mon-osd-interaction.rst245
-rw-r--r--doc/rados/configuration/msgr2.rst257
-rw-r--r--doc/rados/configuration/network-config-ref.rst355
-rw-r--r--doc/rados/configuration/osd-config-ref.rst445
-rw-r--r--doc/rados/configuration/pool-pg-config-ref.rst46
-rw-r--r--doc/rados/configuration/pool-pg.conf21
-rw-r--r--doc/rados/configuration/storage-devices.rst93
19 files changed, 5233 insertions, 0 deletions
diff --git a/doc/rados/configuration/auth-config-ref.rst b/doc/rados/configuration/auth-config-ref.rst
new file mode 100644
index 000000000..fc14f4ee6
--- /dev/null
+++ b/doc/rados/configuration/auth-config-ref.rst
@@ -0,0 +1,379 @@
+.. _rados-cephx-config-ref:
+
+========================
+ CephX Config Reference
+========================
+
+The CephX protocol is enabled by default. The cryptographic authentication that
+CephX provides has some computational costs, though they should generally be
+quite low. If the network environment connecting your client and server hosts
+is very safe and you cannot afford authentication, you can disable it.
+**Disabling authentication is not generally recommended**.
+
+.. note:: If you disable authentication, you will be at risk of a
+ man-in-the-middle attack that alters your client/server messages, which
+ could have disastrous security effects.
+
+For information about creating users, see `User Management`_. For details on
+the architecture of CephX, see `Architecture - High Availability
+Authentication`_.
+
+
+Deployment Scenarios
+====================
+
+How you initially configure CephX depends on your scenario. There are two
+common strategies for deploying a Ceph cluster. If you are a first-time Ceph
+user, you should probably take the easiest approach: using ``cephadm`` to
+deploy a cluster. But if your cluster uses other deployment tools (for example,
+Ansible, Chef, Juju, or Puppet), you will need either to use the manual
+deployment procedures or to configure your deployment tool so that it will
+bootstrap your monitor(s).
+
+Manual Deployment
+-----------------
+
+When you deploy a cluster manually, it is necessary to bootstrap the monitors
+manually and to create the ``client.admin`` user and keyring. To bootstrap
+monitors, follow the steps in `Monitor Bootstrapping`_. Follow these steps when
+using third-party deployment tools (for example, Chef, Puppet, and Juju).
+
+
+Enabling/Disabling CephX
+========================
+
+Enabling CephX is possible only if the keys for your monitors, OSDs, and
+metadata servers have already been deployed. If you are simply toggling CephX
+on or off, it is not necessary to repeat the bootstrapping procedures.
+
+Enabling CephX
+--------------
+
+When CephX is enabled, Ceph will look for the keyring in the default search
+path: this path includes ``/etc/ceph/$cluster.$name.keyring``. It is possible
+to override this search-path location by adding a ``keyring`` option in the
+``[global]`` section of your `Ceph configuration`_ file, but this is not
+recommended.
+
+To enable CephX on a cluster for which authentication has been disabled, carry
+out the following procedure. If you (or your deployment utility) have already
+generated the keys, you may skip the steps related to generating keys.
+
+#. Create a ``client.admin`` key, and save a copy of the key for your client
+ host:
+
+ .. prompt:: bash $
+
+ ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring
+
+ **Warning:** This step will clobber any existing
+ ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a
+ deployment tool has already generated a keyring file for you. Be careful!
+
+#. Create a monitor keyring and generate a monitor secret key:
+
+ .. prompt:: bash $
+
+ ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
+
+#. For each monitor, copy the monitor keyring into a ``ceph.mon.keyring`` file
+ in the monitor's ``mon data`` directory. For example, to copy the monitor
+ keyring to ``mon.a`` in a cluster called ``ceph``, run the following
+ command:
+
+ .. prompt:: bash $
+
+ cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring
+
+#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter:
+
+ .. prompt:: bash $
+
+ ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring
+
+#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:
+
+ .. prompt:: bash $
+
+ ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring
+
+#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:
+
+ .. prompt:: bash $
+
+ ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring
+
+#. Enable CephX authentication by setting the following options in the
+ ``[global]`` section of your `Ceph configuration`_ file:
+
+ .. code-block:: ini
+
+ auth_cluster_required = cephx
+ auth_service_required = cephx
+ auth_client_required = cephx
+
+#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_.
+
+For details on bootstrapping a monitor manually, see `Manual Deployment`_.
+
+
+
+Disabling CephX
+---------------
+
+The following procedure describes how to disable CephX. If your cluster
+environment is safe, you might want to disable CephX in order to offset the
+computational expense of running authentication. **We do not recommend doing
+so.** However, setup and troubleshooting might be easier if authentication is
+temporarily disabled and subsequently re-enabled.
+
+#. Disable CephX authentication by setting the following options in the
+ ``[global]`` section of your `Ceph configuration`_ file:
+
+ .. code-block:: ini
+
+ auth_cluster_required = none
+ auth_service_required = none
+ auth_client_required = none
+
+#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_.
+
+
+Configuration Settings
+======================
+
+Enablement
+----------
+
+
+``auth_cluster_required``
+
+:Description: If this configuration setting is enabled, the Ceph Storage
+ Cluster daemons (that is, ``ceph-mon``, ``ceph-osd``,
+ ``ceph-mds``, and ``ceph-mgr``) are required to authenticate with
+ each other. Valid settings are ``cephx`` or ``none``.
+
+:Type: String
+:Required: No
+:Default: ``cephx``.
+
+
+``auth_service_required``
+
+:Description: If this configuration setting is enabled, then Ceph clients can
+ access Ceph services only if those clients authenticate with the
+ Ceph Storage Cluster. Valid settings are ``cephx`` or ``none``.
+
+:Type: String
+:Required: No
+:Default: ``cephx``.
+
+
+``auth_client_required``
+
+:Description: If this configuration setting is enabled, then communication
+ between the Ceph client and Ceph Storage Cluster can be
+ established only if the Ceph Storage Cluster authenticates
+ against the Ceph client. Valid settings are ``cephx`` or
+ ``none``.
+
+:Type: String
+:Required: No
+:Default: ``cephx``.
+
+
+.. index:: keys; keyring
+
+Keys
+----
+
+When Ceph is run with authentication enabled, ``ceph`` administrative commands
+and Ceph clients can access the Ceph Storage Cluster only if they use
+authentication keys.
+
+The most common way to make these keys available to ``ceph`` administrative
+commands and Ceph clients is to include a Ceph keyring under the ``/etc/ceph``
+directory. For Octopus and later releases that use ``cephadm``, the filename is
+usually ``ceph.client.admin.keyring``. If the keyring is included in the
+``/etc/ceph`` directory, then it is unnecessary to specify a ``keyring`` entry
+in the Ceph configuration file.
+
+Because the Ceph Storage Cluster's keyring file contains the ``client.admin``
+key, we recommend copying the keyring file to nodes from which you run
+administrative commands.
+
+To perform this step manually, run the following command:
+
+.. prompt:: bash $
+
+ sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring
+
+.. tip:: Make sure that the ``ceph.keyring`` file has appropriate permissions
+ (for example, ``chmod 644``) set on your client machine.
+
+You can specify the key itself by using the ``key`` setting in the Ceph
+configuration file (this approach is not recommended), or instead specify a
+path to a keyfile by using the ``keyfile`` setting in the Ceph configuration
+file.
+
+``keyring``
+
+:Description: The path to the keyring file.
+:Type: String
+:Required: No
+:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin``
+
+
+``keyfile``
+
+:Description: The path to a keyfile (that is, a file containing only the key).
+:Type: String
+:Required: No
+:Default: None
+
+
+``key``
+
+:Description: The key (that is, the text string of the key itself). We do not
+ recommend that you use this setting unless you know what you're
+ doing.
+:Type: String
+:Required: No
+:Default: None
+
+
+Daemon Keyrings
+---------------
+
+Administrative users or deployment tools (for example, ``cephadm``) generate
+daemon keyrings in the same way that they generate user keyrings. By default,
+Ceph stores the keyring of a daemon inside that daemon's data directory. The
+default keyring locations and the capabilities that are necessary for the
+daemon to function are shown below.
+
+``ceph-mon``
+
+:Location: ``$mon_data/keyring``
+:Capabilities: ``mon 'allow *'``
+
+``ceph-osd``
+
+:Location: ``$osd_data/keyring``
+:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'``
+
+``ceph-mds``
+
+:Location: ``$mds_data/keyring``
+:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'``
+
+``ceph-mgr``
+
+:Location: ``$mgr_data/keyring``
+:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'``
+
+``radosgw``
+
+:Location: ``$rgw_data/keyring``
+:Capabilities: ``mon 'allow rwx' osd 'allow rwx'``
+
+
+.. note:: The monitor keyring (that is, ``mon.``) contains a key but no
+ capabilities, and this keyring is not part of the cluster ``auth`` database.
+
+The daemon's data-directory locations default to directories of the form::
+
+ /var/lib/ceph/$type/$cluster-$id
+
+For example, ``osd.12`` would have the following data directory::
+
+ /var/lib/ceph/osd/ceph-12
+
+It is possible to override these locations, but it is not recommended.
+
+
+.. index:: signatures
+
+Signatures
+----------
+
+Ceph performs a signature check that provides some limited protection against
+messages being tampered with in flight (for example, by a "man in the middle"
+attack).
+
+As with other parts of Ceph authentication, signatures admit of fine-grained
+control. You can enable or disable signatures for service messages between
+clients and Ceph, and for messages between Ceph daemons.
+
+Note that even when signatures are enabled data is not encrypted in flight.
+
+``cephx_require_signatures``
+
+:Description: If this configuration setting is set to ``true``, Ceph requires
+ signatures on all message traffic between the Ceph client and the
+ Ceph Storage Cluster, and between daemons within the Ceph Storage
+ Cluster.
+
+.. note::
+ **ANTIQUATED NOTE:**
+
+ Neither Ceph Argonaut nor Linux kernel versions prior to 3.19
+ support signatures; if one of these clients is in use, ``cephx_require_signatures``
+ can be disabled in order to allow the client to connect.
+
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``cephx_cluster_require_signatures``
+
+:Description: If this configuration setting is set to ``true``, Ceph requires
+ signatures on all message traffic between Ceph daemons within the
+ Ceph Storage Cluster.
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``cephx_service_require_signatures``
+
+:Description: If this configuration setting is set to ``true``, Ceph requires
+ signatures on all message traffic between Ceph clients and the
+ Ceph Storage Cluster.
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``cephx_sign_messages``
+
+:Description: If this configuration setting is set to ``true``, and if the Ceph
+ version supports message signing, then Ceph will sign all
+ messages so that they are more difficult to spoof.
+
+:Type: Boolean
+:Default: ``true``
+
+
+Time to Live
+------------
+
+``auth_service_ticket_ttl``
+
+:Description: When the Ceph Storage Cluster sends a ticket for authentication
+ to a Ceph client, the Ceph Storage Cluster assigns that ticket a
+ Time To Live (TTL).
+
+:Type: Double
+:Default: ``60*60``
+
+
+.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping
+.. _Operating a Cluster: ../../operations/operating
+.. _Manual Deployment: ../../../install/manual-deployment
+.. _Ceph configuration: ../ceph-conf
+.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication
+.. _User Management: ../../operations/user-management
diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst
new file mode 100644
index 000000000..3707be1aa
--- /dev/null
+++ b/doc/rados/configuration/bluestore-config-ref.rst
@@ -0,0 +1,552 @@
+==================================
+ BlueStore Configuration Reference
+==================================
+
+Devices
+=======
+
+BlueStore manages either one, two, or in certain cases three storage devices.
+These *devices* are "devices" in the Linux/Unix sense. This means that they are
+assets listed under ``/dev`` or ``/devices``. Each of these devices may be an
+entire storage drive, or a partition of a storage drive, or a logical volume.
+BlueStore does not create or mount a conventional file system on devices that
+it uses; BlueStore reads and writes to the devices directly in a "raw" fashion.
+
+In the simplest case, BlueStore consumes all of a single storage device. This
+device is known as the *primary device*. The primary device is identified by
+the ``block`` symlink in the data directory.
+
+The data directory is a ``tmpfs`` mount. When this data directory is booted or
+activated by ``ceph-volume``, it is populated with metadata files and links
+that hold information about the OSD: for example, the OSD's identifier, the
+name of the cluster that the OSD belongs to, and the OSD's private keyring.
+
+In more complicated cases, BlueStore is deployed across one or two additional
+devices:
+
+* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data
+ directory) can be used to separate out BlueStore's internal journal or
+ write-ahead log. Using a WAL device is advantageous only if the WAL device
+ is faster than the primary device (for example, if the WAL device is an SSD
+ and the primary device is an HDD).
+* A *DB device* (identified as ``block.db`` in the data directory) can be used
+ to store BlueStore's internal metadata. BlueStore (or more precisely, the
+ embedded RocksDB) will put as much metadata as it can on the DB device in
+ order to improve performance. If the DB device becomes full, metadata will
+ spill back onto the primary device (where it would have been located in the
+ absence of the DB device). Again, it is advantageous to provision a DB device
+ only if it is faster than the primary device.
+
+If there is only a small amount of fast storage available (for example, less
+than a gigabyte), we recommend using the available space as a WAL device. But
+if more fast storage is available, it makes more sense to provision a DB
+device. Because the BlueStore journal is always placed on the fastest device
+available, using a DB device provides the same benefit that using a WAL device
+would, while *also* allowing additional metadata to be stored off the primary
+device (provided that it fits). DB devices make this possible because whenever
+a DB device is specified but an explicit WAL device is not, the WAL will be
+implicitly colocated with the DB on the faster device.
+
+To provision a single-device (colocated) BlueStore OSD, run the following
+command:
+
+.. prompt:: bash $
+
+ ceph-volume lvm prepare --bluestore --data <device>
+
+To specify a WAL device or DB device, run the following command:
+
+.. prompt:: bash $
+
+ ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
+
+.. note:: The option ``--data`` can take as its argument any of the the
+ following devices: logical volumes specified using *vg/lv* notation,
+ existing logical volumes, and GPT partitions.
+
+
+
+Provisioning strategies
+-----------------------
+
+BlueStore differs from Filestore in that there are several ways to deploy a
+BlueStore OSD. However, the overall deployment strategy for BlueStore can be
+clarified by examining just these two common arrangements:
+
+.. _bluestore-single-type-device-config:
+
+**block (data) only**
+^^^^^^^^^^^^^^^^^^^^^
+If all devices are of the same type (for example, they are all HDDs), and if
+there are no fast devices available for the storage of metadata, then it makes
+sense to specify the block device only and to leave ``block.db`` and
+``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single
+``/dev/sda`` device is as follows:
+
+.. prompt:: bash $
+
+ ceph-volume lvm create --bluestore --data /dev/sda
+
+If the devices to be used for a BlueStore OSD are pre-created logical volumes,
+then the :ref:`ceph-volume-lvm` call for an logical volume named
+``ceph-vg/block-lv`` is as follows:
+
+.. prompt:: bash $
+
+ ceph-volume lvm create --bluestore --data ceph-vg/block-lv
+
+.. _bluestore-mixed-device-config:
+
+**block and block.db**
+^^^^^^^^^^^^^^^^^^^^^^
+
+If you have a mix of fast and slow devices (for example, SSD or HDD), then we
+recommend placing ``block.db`` on the faster device while ``block`` (that is,
+the data) is stored on the slower device (that is, the rotational drive).
+
+You must create these volume groups and these logical volumes manually. as The
+``ceph-volume`` tool is currently unable to do so [create them?] automatically.
+
+The following procedure illustrates the manual creation of volume groups and
+logical volumes. For this example, we shall assume four rotational drives
+(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First,
+to create the volume groups, run the following commands:
+
+.. prompt:: bash $
+
+ vgcreate ceph-block-0 /dev/sda
+ vgcreate ceph-block-1 /dev/sdb
+ vgcreate ceph-block-2 /dev/sdc
+ vgcreate ceph-block-3 /dev/sdd
+
+Next, to create the logical volumes for ``block``, run the following commands:
+
+.. prompt:: bash $
+
+ lvcreate -l 100%FREE -n block-0 ceph-block-0
+ lvcreate -l 100%FREE -n block-1 ceph-block-1
+ lvcreate -l 100%FREE -n block-2 ceph-block-2
+ lvcreate -l 100%FREE -n block-3 ceph-block-3
+
+Because there are four HDDs, there will be four OSDs. Supposing that there is a
+200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running
+the following commands:
+
+.. prompt:: bash $
+
+ vgcreate ceph-db-0 /dev/sdx
+ lvcreate -L 50GB -n db-0 ceph-db-0
+ lvcreate -L 50GB -n db-1 ceph-db-0
+ lvcreate -L 50GB -n db-2 ceph-db-0
+ lvcreate -L 50GB -n db-3 ceph-db-0
+
+Finally, to create the four OSDs, run the following commands:
+
+.. prompt:: bash $
+
+ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
+ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
+ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
+ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
+
+After this procedure is finished, there should be four OSDs, ``block`` should
+be on the four HDDs, and each HDD should have a 50GB logical volume
+(specifically, a DB device) on the shared SSD.
+
+Sizing
+======
+When using a :ref:`mixed spinning-and-solid-drive setup
+<bluestore-mixed-device-config>`, it is important to make a large enough
+``block.db`` logical volume for BlueStore. The logical volumes associated with
+``block.db`` should have logical volumes that are *as large as possible*.
+
+It is generally recommended that the size of ``block.db`` be somewhere between
+1% and 4% of the size of ``block``. For RGW workloads, it is recommended that
+the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy
+use of ``block.db`` to store metadata (in particular, omap keys). For example,
+if the ``block`` size is 1TB, then ``block.db`` should have a size of at least
+40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to
+2% of the ``block`` size.
+
+In older releases, internal level sizes are such that the DB can fully utilize
+only those specific partition / logical volume sizes that correspond to sums of
+L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly
+3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from
+sizing that accommodates L3 and higher, though DB compaction can be facilitated
+by doubling these figures to 6GB, 60GB, and 600GB.
+
+Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow
+for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific
+release brings experimental dynamic-level support. Because of these advances,
+users of older releases might want to plan ahead by provisioning larger DB
+devices today so that the benefits of scale can be realized when upgrades are
+made in the future.
+
+When *not* using a mix of fast and slow devices, there is no requirement to
+create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore
+will automatically colocate these devices within the space of ``block``.
+
+Automatic Cache Sizing
+======================
+
+BlueStore can be configured to automatically resize its caches, provided that
+certain conditions are met: TCMalloc must be configured as the memory allocator
+and the ``bluestore_cache_autotune`` configuration option must be enabled (note
+that it is currently enabled by default). When automatic cache sizing is in
+effect, BlueStore attempts to keep OSD heap-memory usage under a certain target
+size (as determined by ``osd_memory_target``). This approach makes use of a
+best-effort algorithm and caches do not shrink smaller than the size defined by
+the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance
+with a hierarchy of priorities. But if priority information is not available,
+the values specified in the ``bluestore_cache_meta_ratio`` and
+``bluestore_cache_kv_ratio`` options are used as fallback cache ratios.
+
+.. confval:: bluestore_cache_autotune
+.. confval:: osd_memory_target
+.. confval:: bluestore_cache_autotune_interval
+.. confval:: osd_memory_base
+.. confval:: osd_memory_expected_fragmentation
+.. confval:: osd_memory_cache_min
+.. confval:: osd_memory_cache_resize_interval
+
+
+Manual Cache Sizing
+===================
+
+The amount of memory consumed by each OSD to be used for its BlueStore cache is
+determined by the ``bluestore_cache_size`` configuration option. If that option
+has not been specified (that is, if it remains at 0), then Ceph uses a
+different configuration option to determine the default memory budget:
+``bluestore_cache_size_hdd`` if the primary device is an HDD, or
+``bluestore_cache_size_ssd`` if the primary device is an SSD.
+
+BlueStore and the rest of the Ceph OSD daemon make every effort to work within
+this memory budget. Note that in addition to the configured cache size, there
+is also memory consumed by the OSD itself. There is additional utilization due
+to memory fragmentation and other allocator overhead.
+
+The configured cache-memory budget can be used to store the following types of
+things:
+
+* Key/Value metadata (that is, RocksDB's internal cache)
+* BlueStore metadata
+* BlueStore data (that is, recently read or recently written object data)
+
+Cache memory usage is governed by the configuration options
+``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction
+of the cache that is reserved for data is governed by both the effective
+BlueStore cache size (which depends on the relevant
+``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary
+device) and the "meta" and "kv" ratios. This data fraction can be calculated
+with the following formula: ``<effective_cache_size> * (1 -
+bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``.
+
+.. confval:: bluestore_cache_size
+.. confval:: bluestore_cache_size_hdd
+.. confval:: bluestore_cache_size_ssd
+.. confval:: bluestore_cache_meta_ratio
+.. confval:: bluestore_cache_kv_ratio
+
+Checksums
+=========
+
+BlueStore checksums all metadata and all data written to disk. Metadata
+checksumming is handled by RocksDB and uses the `crc32c` algorithm. By
+contrast, data checksumming is handled by BlueStore and can use either
+`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default
+checksum algorithm and it is suitable for most purposes.
+
+Full data checksumming increases the amount of metadata that BlueStore must
+store and manage. Whenever possible (for example, when clients hint that data
+is written and read sequentially), BlueStore will checksum larger blocks. In
+many cases, however, it must store a checksum value (usually 4 bytes) for every
+4 KB block of data.
+
+It is possible to obtain a smaller checksum value by truncating the checksum to
+one or two bytes and reducing the metadata overhead. A drawback of this
+approach is that it increases the probability of a random error going
+undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in
+65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte)
+checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8`
+as the checksum algorithm.
+
+The *checksum algorithm* can be specified either via a per-pool ``csum_type``
+configuration option or via the global configuration option. For example:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> csum_type <algorithm>
+
+.. confval:: bluestore_csum_type
+
+Inline Compression
+==================
+
+BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`.
+
+Whether data in BlueStore is compressed is determined by two factors: (1) the
+*compression mode* and (2) any client hints associated with a write operation.
+The compression modes are as follows:
+
+* **none**: Never compress data.
+* **passive**: Do not compress data unless the write operation has a
+ *compressible* hint set.
+* **aggressive**: Do compress data unless the write operation has an
+ *incompressible* hint set.
+* **force**: Try to compress data no matter what.
+
+For more information about the *compressible* and *incompressible* I/O hints,
+see :c:func:`rados_set_alloc_hint`.
+
+Note that data in Bluestore will be compressed only if the data chunk will be
+sufficiently reduced in size (as determined by the ``bluestore compression
+required ratio`` setting). No matter which compression modes have been used, if
+the data chunk is too big, then it will be discarded and the original
+(uncompressed) data will be stored instead. For example, if ``bluestore
+compression required ratio`` is set to ``.7``, then data compression will take
+place only if the size of the compressed data is no more than 70% of the size
+of the original data.
+
+The *compression mode*, *compression algorithm*, *compression required ratio*,
+*min blob size*, and *max blob size* settings can be specified either via a
+per-pool property or via a global config option. To specify pool properties,
+run the following commands:
+
+.. prompt:: bash $
+
+ ceph osd pool set <pool-name> compression_algorithm <algorithm>
+ ceph osd pool set <pool-name> compression_mode <mode>
+ ceph osd pool set <pool-name> compression_required_ratio <ratio>
+ ceph osd pool set <pool-name> compression_min_blob_size <size>
+ ceph osd pool set <pool-name> compression_max_blob_size <size>
+
+.. confval:: bluestore_compression_algorithm
+.. confval:: bluestore_compression_mode
+.. confval:: bluestore_compression_required_ratio
+.. confval:: bluestore_compression_min_blob_size
+.. confval:: bluestore_compression_min_blob_size_hdd
+.. confval:: bluestore_compression_min_blob_size_ssd
+.. confval:: bluestore_compression_max_blob_size
+.. confval:: bluestore_compression_max_blob_size_hdd
+.. confval:: bluestore_compression_max_blob_size_ssd
+
+.. _bluestore-rocksdb-sharding:
+
+RocksDB Sharding
+================
+
+BlueStore maintains several types of internal key-value data, all of which are
+stored in RocksDB. Each data type in BlueStore is assigned a unique prefix.
+Prior to the Pacific release, all key-value data was stored in a single RocksDB
+column family: 'default'. In Pacific and later releases, however, BlueStore can
+divide key-value data into several RocksDB column families. BlueStore achieves
+better caching and more precise compaction when keys are similar: specifically,
+when keys have similar access frequency, similar modification frequency, and a
+similar lifetime. Under such conditions, performance is improved and less disk
+space is required during compaction (because each column family is smaller and
+is able to compact independently of the others).
+
+OSDs deployed in Pacific or later releases use RocksDB sharding by default.
+However, if Ceph has been upgraded to Pacific or a later version from a
+previous version, sharding is disabled on any OSDs that were created before
+Pacific.
+
+To enable sharding and apply the Pacific defaults to a specific OSD, stop the
+OSD and run the following command:
+
+ .. prompt:: bash #
+
+ ceph-bluestore-tool \
+ --path <data path> \
+ --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \
+ reshard
+
+.. confval:: bluestore_rocksdb_cf
+.. confval:: bluestore_rocksdb_cfs
+
+Throttling
+==========
+
+.. confval:: bluestore_throttle_bytes
+.. confval:: bluestore_throttle_deferred_bytes
+.. confval:: bluestore_throttle_cost_per_io
+.. confval:: bluestore_throttle_cost_per_io_hdd
+.. confval:: bluestore_throttle_cost_per_io_ssd
+
+SPDK Usage
+==========
+
+To use the SPDK driver for NVMe devices, you must first prepare your system.
+See `SPDK document`__.
+
+.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
+
+SPDK offers a script that will configure the device automatically. Run this
+script with root permissions:
+
+.. prompt:: bash $
+
+ sudo src/spdk/scripts/setup.sh
+
+You will need to specify the subject NVMe device's device selector with the
+"spdk:" prefix for ``bluestore_block_path``.
+
+In the following example, you first find the device selector of an Intel NVMe
+SSD by running the following command:
+
+.. prompt:: bash $
+
+ lspci -mm -n -d -d 8086:0953
+
+The form of the device selector is either ``DDDD:BB:DD.FF`` or
+``DDDD.BB.DD.FF``.
+
+Next, supposing that ``0000:01:00.0`` is the device selector found in the
+output of the ``lspci`` command, you can specify the device selector by running
+the following command::
+
+ bluestore_block_path = "spdk:trtype:pcie traddr:0000:01:00.0"
+
+You may also specify a remote NVMeoF target over the TCP transport, as in the
+following example::
+
+ bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1"
+
+To run multiple SPDK instances per node, you must make sure each instance uses
+its own DPDK memory by specifying for each instance the amount of DPDK memory
+(in MB) that the instance will use.
+
+In most cases, a single device can be used for data, DB, and WAL. We describe
+this strategy as *colocating* these components. Be sure to enter the below
+settings to ensure that all I/Os are issued through SPDK::
+
+ bluestore_block_db_path = ""
+ bluestore_block_db_size = 0
+ bluestore_block_wal_path = ""
+ bluestore_block_wal_size = 0
+
+If these settings are not entered, then the current implementation will
+populate the SPDK map files with kernel file system symbols and will use the
+kernel driver to issue DB/WAL I/Os.
+
+Minimum Allocation Size
+=======================
+
+There is a configured minimum amount of storage that BlueStore allocates on an
+underlying storage device. In practice, this is the least amount of capacity
+that even a tiny RADOS object can consume on each OSD's primary device. The
+configuration option in question--:confval:`bluestore_min_alloc_size`--derives
+its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or
+:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational``
+attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with
+the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs
+(including NVMe devices), Bluestore is initialized with the current value of
+:confval:`bluestore_min_alloc_size_ssd`.
+
+In Mimic and earlier releases, the default values were 64KB for rotational
+media (HDD) and 16KB for non-rotational media (SSD). The Octopus release
+changed the the default value for non-rotational media (SSD) to 4KB, and the
+Pacific release changed the default value for rotational media (HDD) to 4KB.
+
+These changes were driven by space amplification that was experienced by Ceph
+RADOS GateWay (RGW) deployments that hosted large numbers of small files
+(S3/Swift objects).
+
+For example, when an RGW client stores a 1 KB S3 object, that object is written
+to a single RADOS object. In accordance with the default
+:confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated.
+This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never
+used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB
+user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB
+RADOS object, with the result that 4KB of device capacity is stranded. In this
+case, however, the overhead percentage is much smaller. Think of this in terms
+of the remainder from a modulus operation. The overhead *percentage* thus
+decreases rapidly as object size increases.
+
+There is an additional subtlety that is easily missed: the amplification
+phenomenon just described takes place for *each* replica. For example, when
+using the default of three copies of data (3R), a 1 KB S3 object actually
+strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used
+instead of replication, the amplification might be even higher: for a ``k=4,
+m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6)
+of device capacity.
+
+When an RGW bucket pool contains many relatively large user objects, the effect
+of this phenomenon is often negligible. However, with deployments that can
+expect a significant fraction of relatively small user objects, the effect
+should be taken into consideration.
+
+The 4KB default value aligns well with conventional HDD and SSD devices.
+However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear
+best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation
+to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel
+storage drives can achieve read performance that is competitive with that of
+conventional TLC SSDs and write performance that is faster than that of HDDs,
+with higher density and lower cost than TLC SSDs.
+
+Note that when creating OSDs on these novel devices, one must be careful to
+apply the non-default value only to appropriate devices, and not to
+conventional HDD and SSD devices. Error can be avoided through careful ordering
+of OSD creation, with custom OSD device classes, and especially by the use of
+central configuration *masks*.
+
+In Quincy and later releases, you can use the
+:confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow
+automatic discovery of the correct value as each OSD is created. Note that the
+use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or
+other device-layering and abstraction technologies might confound the
+determination of correct values. Moreover, OSDs deployed on top of VMware
+storage have sometimes been found to report a ``rotational`` attribute that
+does not match the underlying hardware.
+
+We suggest inspecting such OSDs at startup via logs and admin sockets in order
+to ensure that their behavior is correct. Be aware that this kind of inspection
+might not work as expected with older kernels. To check for this issue,
+examine the presence and value of ``/sys/block/<drive>/queue/optimal_io_size``.
+
+.. note:: When running Reef or a later Ceph release, the ``min_alloc_size``
+ baked into each OSD is conveniently reported by ``ceph osd metadata``.
+
+To inspect a specific OSD, run the following command:
+
+.. prompt:: bash #
+
+ ceph osd metadata osd.1701 | egrep rotational\|alloc
+
+This space amplification might manifest as an unusually high ratio of raw to
+stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR``
+values reported by ``ceph osd df`` that are unusually high in comparison to
+other, ostensibly identical, OSDs. Finally, there might be unexpected balancer
+behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values.
+
+This BlueStore attribute takes effect *only* at OSD creation; if the attribute
+is changed later, a specific OSD's behavior will not change unless and until
+the OSD is destroyed and redeployed with the appropriate option value(s).
+Upgrading to a later Ceph release will *not* change the value used by OSDs that
+were deployed under older releases or with other settings.
+
+.. confval:: bluestore_min_alloc_size
+.. confval:: bluestore_min_alloc_size_hdd
+.. confval:: bluestore_min_alloc_size_ssd
+.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size
+
+DSA (Data Streaming Accelerator) Usage
+======================================
+
+If you want to use the DML library to drive the DSA device for offloading
+read/write operations on persistent memory (PMEM) in BlueStore, you need to
+install `DML`_ and the `idxd-config`_ library. This will work only on machines
+that have a SPR (Sapphire Rapids) CPU.
+
+.. _dml: https://github.com/intel/dml
+.. _idxd-config: https://github.com/intel/idxd-config
+
+After installing the DML software, configure the shared work queues (WQs) with
+reference to the following WQ configuration example:
+
+.. prompt:: bash $
+
+ accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
+ accel-config config-engine dsa0/engine0.1 --group-id=1
+ accel-config enable-device dsa0
+ accel-config enable-wq dsa0/wq0.1
diff --git a/doc/rados/configuration/ceph-conf.rst b/doc/rados/configuration/ceph-conf.rst
new file mode 100644
index 000000000..d8d5c9d03
--- /dev/null
+++ b/doc/rados/configuration/ceph-conf.rst
@@ -0,0 +1,715 @@
+.. _configuring-ceph:
+
+==================
+ Configuring Ceph
+==================
+
+When Ceph services start, the initialization process activates a set of
+daemons that run in the background. A :term:`Ceph Storage Cluster` runs at
+least three types of daemons:
+
+- :term:`Ceph Monitor` (``ceph-mon``)
+- :term:`Ceph Manager` (``ceph-mgr``)
+- :term:`Ceph OSD Daemon` (``ceph-osd``)
+
+Any Ceph Storage Cluster that supports the :term:`Ceph File System` also runs
+at least one :term:`Ceph Metadata Server` (``ceph-mds``). Any Cluster that
+supports :term:`Ceph Object Storage` runs Ceph RADOS Gateway daemons
+(``radosgw``).
+
+Each daemon has a number of configuration options, and each of those options
+has a default value. Adjust the behavior of the system by changing these
+configuration options. Make sure to understand the consequences before
+overriding the default values, as it is possible to significantly degrade the
+performance and stability of your cluster. Remember that default values
+sometimes change between releases. For this reason, it is best to review the
+version of this documentation that applies to your Ceph release.
+
+Option names
+============
+
+Each of the Ceph configuration options has a unique name that consists of words
+formed with lowercase characters and connected with underscore characters
+(``_``).
+
+When option names are specified on the command line, underscore (``_``) and
+dash (``-``) characters can be used interchangeably (for example,
+``--mon-host`` is equivalent to ``--mon_host``).
+
+When option names appear in configuration files, spaces can also be used in
+place of underscores or dashes. However, for the sake of clarity and
+convenience, we suggest that you consistently use underscores, as we do
+throughout this documentation.
+
+Config sources
+==============
+
+Each Ceph daemon, process, and library pulls its configuration from one or more
+of the several sources listed below. Sources that occur later in the list
+override those that occur earlier in the list (when both are present).
+
+- the compiled-in default value
+- the monitor cluster's centralized configuration database
+- a configuration file stored on the local host
+- environment variables
+- command-line arguments
+- runtime overrides that are set by an administrator
+
+One of the first things a Ceph process does on startup is parse the
+configuration options provided via the command line, via the environment, and
+via the local configuration file. Next, the process contacts the monitor
+cluster to retrieve centrally-stored configuration for the entire cluster.
+After a complete view of the configuration is available, the startup of the
+daemon or process will commence.
+
+.. _bootstrap-options:
+
+Bootstrap options
+-----------------
+
+Bootstrap options are configuration options that affect the process's ability
+to contact the monitors, to authenticate, and to retrieve the cluster-stored
+configuration. For this reason, these options might need to be stored locally
+on the node, and set by means of a local configuration file. These options
+include the following:
+
+.. confval:: mon_host
+.. confval:: mon_host_override
+
+- :confval:`mon_dns_srv_name`
+- :confval:`mon_data`, :confval:`osd_data`, :confval:`mds_data`,
+ :confval:`mgr_data`, and similar options that define which local directory
+ the daemon stores its data in.
+- :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be
+ used to specify the authentication credential to use to authenticate with the
+ monitor. Note that in most cases the default keyring location is in the data
+ directory specified above.
+
+In most cases, there is no reason to modify the default values of these
+options. However, there is one exception to this: the :confval:`mon_host`
+option that identifies the addresses of the cluster's monitors. But when
+:ref:`DNS is used to identify monitors<mon-dns-lookup>`, a local Ceph
+configuration file can be avoided entirely.
+
+
+Skipping monitor config
+-----------------------
+
+The option ``--no-mon-config`` can be passed in any command in order to skip
+the step that retrieves configuration information from the cluster's monitors.
+Skipping this retrieval step can be useful in cases where configuration is
+managed entirely via configuration files, or when maintenance activity needs to
+be done but the monitor cluster is down.
+
+.. _ceph-conf-file:
+
+Configuration sections
+======================
+
+Each of the configuration options associated with a single process or daemon
+has a single value. However, the values for a configuration option can vary
+across daemon types, and can vary even across different daemons of the same
+type. Ceph options that are stored in the monitor configuration database or in
+local configuration files are grouped into sections |---| so-called "configuration
+sections" |---| to indicate which daemons or clients they apply to.
+
+
+These sections include the following:
+
+.. confsec:: global
+
+ Settings under ``global`` affect all daemons and clients
+ in a Ceph Storage Cluster.
+
+ :example: ``log_file = /var/log/ceph/$cluster-$type.$id.log``
+
+.. confsec:: mon
+
+ Settings under ``mon`` affect all ``ceph-mon`` daemons in
+ the Ceph Storage Cluster, and override the same setting in
+ ``global``.
+
+ :example: ``mon_cluster_log_to_syslog = true``
+
+.. confsec:: mgr
+
+ Settings in the ``mgr`` section affect all ``ceph-mgr`` daemons in
+ the Ceph Storage Cluster, and override the same setting in
+ ``global``.
+
+ :example: ``mgr_stats_period = 10``
+
+.. confsec:: osd
+
+ Settings under ``osd`` affect all ``ceph-osd`` daemons in
+ the Ceph Storage Cluster, and override the same setting in
+ ``global``.
+
+ :example: ``osd_op_queue = wpq``
+
+.. confsec:: mds
+
+ Settings in the ``mds`` section affect all ``ceph-mds`` daemons in
+ the Ceph Storage Cluster, and override the same setting in
+ ``global``.
+
+ :example: ``mds_cache_memory_limit = 10G``
+
+.. confsec:: client
+
+ Settings under ``client`` affect all Ceph clients
+ (for example, mounted Ceph File Systems, mounted Ceph Block Devices)
+ as well as RADOS Gateway (RGW) daemons.
+
+ :example: ``objecter_inflight_ops = 512``
+
+
+Configuration sections can also specify an individual daemon or client name. For example,
+``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names.
+
+
+Any given daemon will draw its settings from the global section, the daemon- or
+client-type section, and the section sharing its name. Settings in the
+most-specific section take precedence so precedence: for example, if the same
+option is specified in both :confsec:`global`, :confsec:`mon`, and ``mon.foo``
+on the same source (i.e. that is, in the same configuration file), the
+``mon.foo`` setting will be used.
+
+If multiple values of the same configuration option are specified in the same
+section, the last value specified takes precedence.
+
+Note that values from the local configuration file always take precedence over
+values from the monitor configuration database, regardless of the section in
+which they appear.
+
+.. _ceph-metavariables:
+
+Metavariables
+=============
+
+Metavariables dramatically simplify Ceph storage cluster configuration. When a
+metavariable is set in a configuration value, Ceph expands the metavariable at
+the time the configuration value is used. In this way, Ceph metavariables
+behave similarly to the way that variable expansion works in the Bash shell.
+
+Ceph supports the following metavariables:
+
+.. describe:: $cluster
+
+ Expands to the Ceph Storage Cluster name. Useful when running
+ multiple Ceph Storage Clusters on the same hardware.
+
+ :example: ``/etc/ceph/$cluster.keyring``
+ :default: ``ceph``
+
+.. describe:: $type
+
+ Expands to a daemon or process type (for example, ``mds``, ``osd``, or ``mon``)
+
+ :example: ``/var/lib/ceph/$type``
+
+.. describe:: $id
+
+ Expands to the daemon or client identifier. For
+ ``osd.0``, this would be ``0``; for ``mds.a``, it would
+ be ``a``.
+
+ :example: ``/var/lib/ceph/$type/$cluster-$id``
+
+.. describe:: $host
+
+ Expands to the host name where the process is running.
+
+.. describe:: $name
+
+ Expands to ``$type.$id``.
+
+ :example: ``/var/run/ceph/$cluster-$name.asok``
+
+.. describe:: $pid
+
+ Expands to daemon pid.
+
+ :example: ``/var/run/ceph/$cluster-$name-$pid.asok``
+
+
+Ceph configuration file
+=======================
+
+On startup, Ceph processes search for a configuration file in the
+following locations:
+
+#. ``$CEPH_CONF`` (that is, the path following the ``$CEPH_CONF``
+ environment variable)
+#. ``-c path/path`` (that is, the ``-c`` command line argument)
+#. ``/etc/ceph/$cluster.conf``
+#. ``~/.ceph/$cluster.conf``
+#. ``./$cluster.conf`` (that is, in the current working directory)
+#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf``
+
+Here ``$cluster`` is the cluster's name (default: ``ceph``).
+
+The Ceph configuration file uses an ``ini`` style syntax. You can add "comment
+text" after a pound sign (#) or a semi-colon semicolon (;). For example:
+
+.. code-block:: ini
+
+ # <--A number (#) sign number sign (#) precedes a comment.
+ ; A comment may be anything.
+ # Comments always follow a semi-colon semicolon (;) or a pound sign (#) on each line.
+ # The end of the line terminates a comment.
+ # We recommend that you provide comments in your configuration file(s).
+
+
+.. _ceph-conf-settings:
+
+Config file section names
+-------------------------
+
+The configuration file is divided into sections. Each section must begin with a
+valid configuration section name (see `Configuration sections`_, above) that is
+surrounded by square brackets. For example:
+
+.. code-block:: ini
+
+ [global]
+ debug_ms = 0
+
+ [osd]
+ debug_ms = 1
+
+ [osd.1]
+ debug_ms = 10
+
+ [osd.2]
+ debug_ms = 10
+
+Config file option values
+-------------------------
+
+The value of a configuration option is a string. If the string is too long to
+fit on a single line, you can put a backslash (``\``) at the end of the line
+and the backslash will act as a line continuation marker. In such a case, the
+value of the option will be the string after ``=`` in the current line,
+combined with the string in the next line. Here is an example::
+
+ [global]
+ foo = long long ago\
+ long ago
+
+In this example, the value of the "``foo``" option is "``long long ago long
+ago``".
+
+An option value typically ends with either a newline or a comment. For
+example:
+
+.. code-block:: ini
+
+ [global]
+ obscure_one = difficult to explain # I will try harder in next release
+ simpler_one = nothing to explain
+
+In this example, the value of the "``obscure one``" option is "``difficult to
+explain``" and the value of the "``simpler one`` options is "``nothing to
+explain``".
+
+When an option value contains spaces, it can be enclosed within single quotes
+or double quotes in order to make its scope clear and in order to make sure
+that the first space in the value is not interpreted as the end of the value.
+For example:
+
+.. code-block:: ini
+
+ [global]
+ line = "to be, or not to be"
+
+In option values, there are four characters that are treated as escape
+characters: ``=``, ``#``, ``;`` and ``[``. They are permitted to occur in an
+option value only if they are immediately preceded by the backslash character
+(``\``). For example:
+
+.. code-block:: ini
+
+ [global]
+ secret = "i love \# and \["
+
+Each configuration option falls under one of the following types:
+
+.. describe:: int
+
+ 64-bit signed integer. Some SI suffixes are supported, such as "K", "M",
+ "G", "T", "P", and "E" (meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`,
+ 10\ :sup:`9`, etc.). "B" is the only supported unit string. Thus "1K", "1M",
+ "128B" and "-1" are all valid option values. When a negative value is
+ assigned to a threshold option, this can indicate that the option is
+ "unlimited" -- that is, that there is no threshold or limit in effect.
+
+ :example: ``42``, ``-1``
+
+.. describe:: uint
+
+ This differs from ``integer`` only in that negative values are not
+ permitted.
+
+ :example: ``256``, ``0``
+
+.. describe:: str
+
+ A string encoded in UTF-8. Certain characters are not permitted. Reference
+ the above notes for the details.
+
+ :example: ``"hello world"``, ``"i love \#"``, ``yet-another-name``
+
+.. describe:: boolean
+
+ Typically either of the two values ``true`` or ``false``. However, any
+ integer is permitted: "0" implies ``false``, and any non-zero value implies
+ ``true``.
+
+ :example: ``true``, ``false``, ``1``, ``0``
+
+.. describe:: addr
+
+ A single address, optionally prefixed with ``v1``, ``v2`` or ``any`` for the
+ messenger protocol. If no prefix is specified, the ``v2`` protocol is used.
+ For more details, see :ref:`address_formats`.
+
+ :example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789``
+
+.. describe:: addrvec
+
+ A set of addresses separated by ",". The addresses can be optionally quoted
+ with ``[`` and ``]``.
+
+ :example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]``
+
+.. describe:: uuid
+
+ The string format of a uuid defined by `RFC4122
+ <https://www.ietf.org/rfc/rfc4122.txt>`_. Certain variants are also
+ supported: for more details, see `Boost document
+ <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_.
+
+ :example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6``
+
+.. describe:: size
+
+ 64-bit unsigned integer. Both SI prefixes and IEC prefixes are supported.
+ "B" is the only supported unit string. Negative values are not permitted.
+
+ :example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``.
+
+.. describe:: secs
+
+ Denotes a duration of time. The default unit of time is the second.
+ The following units of time are supported:
+
+ * second: ``s``, ``sec``, ``second``, ``seconds``
+ * minute: ``m``, ``min``, ``minute``, ``minutes``
+ * hour: ``hs``, ``hr``, ``hour``, ``hours``
+ * day: ``d``, ``day``, ``days``
+ * week: ``w``, ``wk``, ``week``, ``weeks``
+ * month: ``mo``, ``month``, ``months``
+ * year: ``y``, ``yr``, ``year``, ``years``
+
+ :example: ``1 m``, ``1m`` and ``1 week``
+
+.. _ceph-conf-database:
+
+Monitor configuration database
+==============================
+
+The monitor cluster manages a database of configuration options that can be
+consumed by the entire cluster. This allows for streamlined central
+configuration management of the entire system. For ease of administration and
+transparency, the vast majority of configuration options can and should be
+stored in this database.
+
+Some settings might need to be stored in local configuration files because they
+affect the ability of the process to connect to the monitors, to authenticate,
+and to fetch configuration information. In most cases this applies only to the
+``mon_host`` option. This issue can be avoided by using :ref:`DNS SRV
+records<mon-dns-lookup>`.
+
+Sections and masks
+------------------
+
+Configuration options stored by the monitor can be stored in a global section,
+in a daemon-type section, or in a specific daemon section. In this, they are
+no different from the options in a configuration file.
+
+In addition, options may have a *mask* associated with them to further restrict
+which daemons or clients the option applies to. Masks take two forms:
+
+#. ``type:location`` where ``type`` is a CRUSH property like ``rack`` or
+ ``host``, and ``location`` is a value for that property. For example,
+ ``host:foo`` would limit the option only to daemons or clients
+ running on a particular host.
+#. ``class:device-class`` where ``device-class`` is the name of a CRUSH
+ device class (for example, ``hdd`` or ``ssd``). For example,
+ ``class:ssd`` would limit the option only to OSDs backed by SSDs.
+ (This mask has no effect on non-OSD daemons or clients.)
+
+In commands that specify a configuration option, the argument of the option (in
+the following examples, this is the "who" string) may be a section name, a
+mask, or a combination of both separated by a slash character (``/``). For
+example, ``osd/rack:foo`` would refer to all OSD daemons in the ``foo`` rack.
+
+When configuration options are shown, the section name and mask are presented
+in separate fields or columns to make them more readable.
+
+Commands
+--------
+
+The following CLI commands are used to configure the cluster:
+
+* ``ceph config dump`` dumps the entire monitor configuration
+ database for the cluster.
+
+* ``ceph config get <who>`` dumps the configuration options stored in
+ the monitor configuration database for a specific daemon or client
+ (for example, ``mds.a``).
+
+* ``ceph config get <who> <option>`` shows either a configuration value
+ stored in the monitor configuration database for a specific daemon or client
+ (for example, ``mds.a``), or, if that value is not present in the monitor
+ configuration database, the compiled-in default value.
+
+* ``ceph config set <who> <option> <value>`` specifies a configuration
+ option in the monitor configuration database.
+
+* ``ceph config show <who>`` shows the configuration for a running daemon.
+ These settings might differ from those stored by the monitors if there are
+ also local configuration files in use or if options have been overridden on
+ the command line or at run time. The source of the values of the options is
+ displayed in the output.
+
+* ``ceph config assimilate-conf -i <input file> -o <output file>`` ingests a
+ configuration file from *input file* and moves any valid options into the
+ monitor configuration database. Any settings that are unrecognized, are
+ invalid, or cannot be controlled by the monitor will be returned in an
+ abbreviated configuration file stored in *output file*. This command is
+ useful for transitioning from legacy configuration files to centralized
+ monitor-based configuration.
+
+Note that ``ceph config set <who> <option> <value>`` and ``ceph config get
+<who> <option>`` will not necessarily return the same values. The latter
+command will show compiled-in default values. In order to determine whether a
+configuration option is present in the monitor configuration database, run
+``ceph config dump``.
+
+Help
+====
+
+To get help for a particular option, run the following command:
+
+.. prompt:: bash $
+
+ ceph config help <option>
+
+For example:
+
+.. prompt:: bash $
+
+ ceph config help log_file
+
+::
+
+ log_file - path to log file
+ (std::string, basic)
+ Default (non-daemon):
+ Default (daemon): /var/log/ceph/$cluster-$name.log
+ Can update at runtime: false
+ See also: [log_to_stderr,err_to_stderr,log_to_syslog,err_to_syslog]
+
+or:
+
+.. prompt:: bash $
+
+ ceph config help log_file -f json-pretty
+
+::
+
+ {
+ "name": "log_file",
+ "type": "std::string",
+ "level": "basic",
+ "desc": "path to log file",
+ "long_desc": "",
+ "default": "",
+ "daemon_default": "/var/log/ceph/$cluster-$name.log",
+ "tags": [],
+ "services": [],
+ "see_also": [
+ "log_to_stderr",
+ "err_to_stderr",
+ "log_to_syslog",
+ "err_to_syslog"
+ ],
+ "enum_values": [],
+ "min": "",
+ "max": "",
+ "can_update_at_runtime": false
+ }
+
+The ``level`` property can be ``basic``, ``advanced``, or ``dev``. The `dev`
+options are intended for use by developers, generally for testing purposes, and
+are not recommended for use by operators.
+
+.. note:: This command uses the configuration schema that is compiled into the
+ running monitors. If you have a mixed-version cluster (as might exist, for
+ example, during an upgrade), you might want to query the option schema from
+ a specific running daemon by running a command of the following form:
+
+.. prompt:: bash $
+
+ ceph daemon <name> config help [option]
+
+Runtime Changes
+===============
+
+In most cases, Ceph permits changes to the configuration of a daemon at
+run time. This can be used for increasing or decreasing the amount of logging
+output, for enabling or disabling debug settings, and for runtime optimization.
+
+Use the ``ceph config set`` command to update configuration options. For
+example, to enable the most verbose debug log level on a specific OSD, run a
+command of the following form:
+
+.. prompt:: bash $
+
+ ceph config set osd.123 debug_ms 20
+
+.. note:: If an option has been customized in a local configuration file, the
+ `central config
+ <https://ceph.io/en/news/blog/2018/new-mimic-centralized-configuration-management/>`_
+ setting will be ignored because it has a lower priority than the local
+ configuration file.
+
+.. note:: Log levels range from 0 to 20.
+
+Override values
+---------------
+
+Options can be set temporarily by using the Ceph CLI ``tell`` or ``daemon``
+interfaces on the Ceph CLI. These *override* values are ephemeral, which means
+that they affect only the current instance of the daemon and revert to
+persistently configured values when the daemon restarts.
+
+Override values can be set in two ways:
+
+#. From any host, send a message to a daemon with a command of the following
+ form:
+
+ .. prompt:: bash $
+
+ ceph tell <name> config set <option> <value>
+
+ For example:
+
+ .. prompt:: bash $
+
+ ceph tell osd.123 config set debug_osd 20
+
+ The ``tell`` command can also accept a wildcard as the daemon identifier.
+ For example, to adjust the debug level on all OSD daemons, run a command of
+ the following form:
+
+ .. prompt:: bash $
+
+ ceph tell osd.* config set debug_osd 20
+
+#. On the host where the daemon is running, connect to the daemon via a socket
+ in ``/var/run/ceph`` by running a command of the following form:
+
+ .. prompt:: bash $
+
+ ceph daemon <name> config set <option> <value>
+
+ For example:
+
+ .. prompt:: bash $
+
+ ceph daemon osd.4 config set debug_osd 20
+
+.. note:: In the output of the ``ceph config show`` command, these temporary
+ values are shown to have a source of ``override``.
+
+
+Viewing runtime settings
+========================
+
+You can see the current settings specified for a running daemon with the ``ceph
+config show`` command. For example, to see the (non-default) settings for the
+daemon ``osd.0``, run the following command:
+
+.. prompt:: bash $
+
+ ceph config show osd.0
+
+To see a specific setting, run the following command:
+
+.. prompt:: bash $
+
+ ceph config show osd.0 debug_osd
+
+To see all settings (including those with default values), run the following
+command:
+
+.. prompt:: bash $
+
+ ceph config show-with-defaults osd.0
+
+You can see all settings for a daemon that is currently running by connecting
+to it on the local host via the admin socket. For example, to dump all
+current settings, run the following command:
+
+.. prompt:: bash $
+
+ ceph daemon osd.0 config show
+
+To see non-default settings and to see where each value came from (for example,
+a config file, the monitor, or an override), run the following command:
+
+.. prompt:: bash $
+
+ ceph daemon osd.0 config diff
+
+To see the value of a single setting, run the following command:
+
+.. prompt:: bash $
+
+ ceph daemon osd.0 config get debug_osd
+
+
+Changes introduced in Octopus
+=============================
+
+The Octopus release changed the way the configuration file is parsed.
+These changes are as follows:
+
+- Repeated configuration options are allowed, and no warnings will be
+ displayed. This means that the setting that comes last in the file is the one
+ that takes effect. Prior to this change, Ceph displayed warning messages when
+ lines containing duplicate options were encountered, such as::
+
+ warning line 42: 'foo' in section 'bar' redefined
+- Prior to Octopus, options containing invalid UTF-8 characters were ignored
+ with warning messages. But in Octopus, they are treated as fatal errors.
+- The backslash character ``\`` is used as the line-continuation marker that
+ combines the next line with the current one. Prior to Octopus, there was a
+ requirement that any end-of-line backslash be followed by a non-empty line.
+ But in Octopus, an empty line following a backslash is allowed.
+- In the configuration file, each line specifies an individual configuration
+ option. The option's name and its value are separated with ``=``, and the
+ value may be enclosed within single or double quotes. If an invalid
+ configuration is specified, we will treat it as an invalid configuration
+ file::
+
+ bad option ==== bad value
+- Prior to Octopus, if no section name was specified in the configuration file,
+ all options would be set as though they were within the :confsec:`global`
+ section. This approach is discouraged. Since Octopus, any configuration
+ file that has no section name must contain only a single option.
+
+.. |---| unicode:: U+2014 .. EM DASH :trim:
diff --git a/doc/rados/configuration/common.rst b/doc/rados/configuration/common.rst
new file mode 100644
index 000000000..0b373f469
--- /dev/null
+++ b/doc/rados/configuration/common.rst
@@ -0,0 +1,207 @@
+.. _ceph-conf-common-settings:
+
+Common Settings
+===============
+
+The `Hardware Recommendations`_ section provides some hardware guidelines for
+configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph
+Node` to run multiple daemons. For example, a single node with multiple drives
+ususally runs one ``ceph-osd`` for each drive. Ideally, each node will be
+assigned to a particular type of process. For example, some nodes might run
+``ceph-osd`` daemons, other nodes might run ``ceph-mds`` daemons, and still
+other nodes might run ``ceph-mon`` daemons.
+
+Each node has a name. The name of a node can be found in its ``host`` setting.
+Monitors also specify a network address and port (that is, a domain name or IP
+address) that can be found in the ``addr`` setting. A basic configuration file
+typically specifies only minimal settings for each instance of monitor daemons.
+For example:
+
+
+
+
+.. code-block:: ini
+
+ [global]
+ mon_initial_members = ceph1
+ mon_host = 10.0.0.1
+
+.. important:: The ``host`` setting's value is the short name of the node. It
+ is not an FQDN. It is **NOT** an IP address. To retrieve the name of the
+ node, enter ``hostname -s`` on the command line. Unless you are deploying
+ Ceph manually, do not use ``host`` settings for anything other than initial
+ monitor setup. **DO NOT** specify the ``host`` setting under individual
+ daemons when using deployment tools like ``chef`` or ``cephadm``. Such tools
+ are designed to enter the appropriate values for you in the cluster map.
+
+
+.. _ceph-network-config:
+
+Networks
+========
+
+For more about configuring a network for use with Ceph, see the `Network
+Configuration Reference`_ .
+
+
+Monitors
+========
+
+Ceph production clusters typically provision at least three :term:`Ceph
+Monitor` daemons to ensure availability in the event of a monitor instance
+crash. A minimum of three :term:`Ceph Monitor` daemons ensures that the Paxos
+algorithm is able to determine which version of the :term:`Ceph Cluster Map` is
+the most recent. It makes this determination by consulting a majority of Ceph
+Monitors in the quorum.
+
+.. note:: You may deploy Ceph with a single monitor, but if the instance fails,
+ the lack of other monitors might interrupt data-service availability.
+
+Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and on
+port ``6789`` for the old v1 protocol.
+
+By default, Ceph expects to store monitor data on the following path::
+
+ /var/lib/ceph/mon/$cluster-$id
+
+You or a deployment tool (for example, ``cephadm``) must create the
+corresponding directory. With metavariables fully expressed and a cluster named
+"ceph", the path specified in the above example evaluates to::
+
+ /var/lib/ceph/mon/ceph-a
+
+For additional details, see the `Monitor Config Reference`_.
+
+.. _Monitor Config Reference: ../mon-config-ref
+
+
+.. _ceph-osd-config:
+
+Authentication
+==============
+
+.. versionadded:: Bobtail 0.56
+
+Authentication is explicitly enabled or disabled in the ``[global]`` section of
+the Ceph configuration file, as shown here:
+
+.. code-block:: ini
+
+ auth_cluster_required = cephx
+ auth_service_required = cephx
+ auth_client_required = cephx
+
+In addition, you should enable message signing. For details, see `Cephx Config
+Reference`_.
+
+.. _Cephx Config Reference: ../auth-config-ref
+
+
+.. _ceph-monitor-config:
+
+
+OSDs
+====
+
+By default, Ceph expects to store a Ceph OSD Daemon's data on the following
+path::
+
+ /var/lib/ceph/osd/$cluster-$id
+
+You or a deployment tool (for example, ``cephadm``) must create the
+corresponding directory. With metavariables fully expressed and a cluster named
+"ceph", the path specified in the above example evaluates to::
+
+ /var/lib/ceph/osd/ceph-0
+
+You can override this path using the ``osd_data`` setting. We recommend that
+you do not change the default location. To create the default directory on your
+OSD host, run the following commands:
+
+.. prompt:: bash $
+
+ ssh {osd-host}
+ sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
+
+The ``osd_data`` path ought to lead to a mount point that has mounted on it a
+device that is distinct from the device that contains the operating system and
+the daemons. To use a device distinct from the device that contains the
+operating system and the daemons, prepare it for use with Ceph and mount it on
+the directory you just created by running the following commands:
+
+.. prompt:: bash $
+
+ ssh {new-osd-host}
+ sudo mkfs -t {fstype} /dev/{disk}
+ sudo mount -o user_xattr /dev/{disk} /var/lib/ceph/osd/ceph-{osd-number}
+
+We recommend using the ``xfs`` file system when running :command:`mkfs`. (The
+``btrfs`` and ``ext4`` file systems are not recommended and are no longer
+tested.)
+
+For additional configuration details, see `OSD Config Reference`_.
+
+
+Heartbeats
+==========
+
+During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons
+and report their findings to the Ceph Monitor. This process does not require
+you to provide any settings. However, if you have network latency issues, you
+might want to modify the default settings.
+
+For additional details, see `Configuring Monitor/OSD Interaction`_.
+
+
+.. _ceph-logging-and-debugging:
+
+Logs / Debugging
+================
+
+You might sometimes encounter issues with Ceph that require you to use Ceph's
+logging and debugging features. For details on log rotation, see `Debugging and
+Logging`_.
+
+.. _Debugging and Logging: ../../troubleshooting/log-and-debug
+
+
+Example ceph.conf
+=================
+
+.. literalinclude:: demo-ceph.conf
+ :language: ini
+
+.. _ceph-runtime-config:
+
+
+
+Naming Clusters (deprecated)
+============================
+
+Each Ceph cluster has an internal name. This internal name is used as part of
+configuration, and as part of "log file" names as well as part of directory
+names and as part of mountpoint names. This name defaults to "ceph". Previous
+releases of Ceph allowed one to specify a custom name instead, for example
+"ceph2". This option was intended to facilitate the running of multiple logical
+clusters on the same physical hardware, but in practice it was rarely
+exploited. Custom cluster names should no longer be attempted. Old
+documentation might lead readers to wrongly think that unique cluster names are
+required to use ``rbd-mirror``. They are not required.
+
+Custom cluster names are now considered deprecated and the ability to deploy
+them has already been removed from some tools, although existing custom-name
+deployments continue to operate. The ability to run and manage clusters with
+custom names might be progressively removed by future Ceph releases, so **it is
+strongly recommended to deploy all new clusters with the default name "ceph"**.
+
+Some Ceph CLI commands accept a ``--cluster`` (cluster name) option. This
+option is present only for the sake of backward compatibility. New tools and
+deployments cannot be relied upon to accommodate this option.
+
+If you need to allow multiple clusters to exist on the same host, use
+:ref:`cephadm`, which uses containers to fully isolate each cluster.
+
+.. _Hardware Recommendations: ../../../start/hardware-recommendations
+.. _Network Configuration Reference: ../network-config-ref
+.. _OSD Config Reference: ../osd-config-ref
+.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
diff --git a/doc/rados/configuration/demo-ceph.conf b/doc/rados/configuration/demo-ceph.conf
new file mode 100644
index 000000000..8ba285a42
--- /dev/null
+++ b/doc/rados/configuration/demo-ceph.conf
@@ -0,0 +1,31 @@
+[global]
+fsid = {cluster-id}
+mon_initial_members = {hostname}[, {hostname}]
+mon_host = {ip-address}[, {ip-address}]
+
+#All clusters have a front-side public network.
+#If you have two network interfaces, you can configure a private / cluster
+#network for RADOS object replication, heartbeats, backfill,
+#recovery, etc.
+public_network = {network}[, {network}]
+#cluster_network = {network}[, {network}]
+
+#Clusters require authentication by default.
+auth_cluster_required = cephx
+auth_service_required = cephx
+auth_client_required = cephx
+
+#Choose reasonable number of replicas and placement groups.
+osd_journal_size = {n}
+osd_pool_default_size = {n} # Write an object n times.
+osd_pool_default_min_size = {n} # Allow writing n copies in a degraded state.
+osd_pool_default_pg_autoscale_mode = {mode} # on, off, or warn
+# Only used if autoscaling is off or warn:
+osd_pool_default_pg_num = {n}
+
+#Choose a reasonable crush leaf type.
+#0 for a 1-node cluster.
+#1 for a multi node cluster in a single rack
+#2 for a multi node, multi chassis cluster with multiple hosts in a chassis
+#3 for a multi node cluster with hosts across racks, etc.
+osd_crush_chooseleaf_type = {n}
diff --git a/doc/rados/configuration/filestore-config-ref.rst b/doc/rados/configuration/filestore-config-ref.rst
new file mode 100644
index 000000000..7aefe26b3
--- /dev/null
+++ b/doc/rados/configuration/filestore-config-ref.rst
@@ -0,0 +1,377 @@
+============================
+ Filestore Config Reference
+============================
+
+.. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's
+ default storage back end. Since the Luminous release of Ceph, BlueStore has
+ been Ceph's default storage back end. However, Filestore OSDs are still
+ supported up to Quincy. Filestore OSDs are not supported in Reef. See
+ :ref:`OSD Back Ends <rados_config_storage_devices_osd_backends>`. See
+ :ref:`BlueStore Migration <rados_operations_bluestore_migration>` for
+ instructions explaining how to replace an existing Filestore back end with a
+ BlueStore back end.
+
+
+``filestore_debug_omap_check``
+
+:Description: Debugging check on synchronization. Expensive. For debugging only.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+.. index:: filestore; extended attributes
+
+Extended Attributes
+===================
+
+Extended Attributes (XATTRs) are important for Filestore OSDs. However, Certain
+disadvantages can occur when the underlying file system is used for the storage
+of XATTRs: some file systems have limits on the number of bytes that can be
+stored in XATTRs, and your file system might in some cases therefore run slower
+than would an alternative method of storing XATTRs. For this reason, a method
+of storing XATTRs extrinsic to the underlying file system might improve
+performance. To implement such an extrinsic method, refer to the following
+settings.
+
+If the underlying file system has no size limit, then Ceph XATTRs are stored as
+``inline xattr``, using the XATTRs provided by the file system. But if there is
+a size limit (for example, ext4 imposes a limit of 4 KB total), then some Ceph
+XATTRs will be stored in a key/value database when the limit is reached. More
+precisely, this begins to occur when either the
+``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs``
+threshold is reached.
+
+
+``filestore_max_inline_xattr_size``
+
+:Description: Defines the maximum size per object of an XATTR that can be
+ stored in the file system (for example, XFS, Btrfs, ext4). The
+ specified size should not be larger than the file system can
+ handle. Using the default value of 0 instructs Filestore to use
+ the value specific to the file system.
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``0``
+
+
+``filestore_max_inline_xattr_size_xfs``
+
+:Description: Defines the maximum size of an XATTR that can be stored in the
+ XFS file system. This setting is used only if
+ ``filestore_max_inline_xattr_size`` == 0.
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``65536``
+
+
+``filestore_max_inline_xattr_size_btrfs``
+
+:Description: Defines the maximum size of an XATTR that can be stored in the
+ Btrfs file system. This setting is used only if
+ ``filestore_max_inline_xattr_size`` == 0.
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``2048``
+
+
+``filestore_max_inline_xattr_size_other``
+
+:Description: Defines the maximum size of an XATTR that can be stored in other file systems.
+ This setting is used only if ``filestore_max_inline_xattr_size`` == 0.
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``512``
+
+
+``filestore_max_inline_xattrs``
+
+:Description: Defines the maximum number of XATTRs per object that can be stored in the file system.
+ Using the default value of 0 instructs Filestore to use the value specific to the file system.
+:Type: 32-bit Integer
+:Required: No
+:Default: ``0``
+
+
+``filestore_max_inline_xattrs_xfs``
+
+:Description: Defines the maximum number of XATTRs per object that can be stored in the XFS file system.
+ This setting is used only if ``filestore_max_inline_xattrs`` == 0.
+:Type: 32-bit Integer
+:Required: No
+:Default: ``10``
+
+
+``filestore_max_inline_xattrs_btrfs``
+
+:Description: Defines the maximum number of XATTRs per object that can be stored in the Btrfs file system.
+ This setting is used only if ``filestore_max_inline_xattrs`` == 0.
+:Type: 32-bit Integer
+:Required: No
+:Default: ``10``
+
+
+``filestore_max_inline_xattrs_other``
+
+:Description: Defines the maximum number of XATTRs per object that can be stored in other file systems.
+ This setting is used only if ``filestore_max_inline_xattrs`` == 0.
+:Type: 32-bit Integer
+:Required: No
+:Default: ``2``
+
+.. index:: filestore; synchronization
+
+Synchronization Intervals
+=========================
+
+Filestore must periodically quiesce writes and synchronize the file system.
+Each synchronization creates a consistent commit point. When the commit point
+is created, Filestore is able to free all journal entries up to that point.
+More-frequent synchronization tends to reduce both synchronization time and
+the amount of data that needs to remain in the journal. Less-frequent
+synchronization allows the backing file system to coalesce small writes and
+metadata updates, potentially increasing synchronization
+efficiency but also potentially increasing tail latency.
+
+
+``filestore_max_sync_interval``
+
+:Description: Defines the maximum interval (in seconds) for synchronizing Filestore.
+:Type: Double
+:Required: No
+:Default: ``5``
+
+
+``filestore_min_sync_interval``
+
+:Description: Defines the minimum interval (in seconds) for synchronizing Filestore.
+:Type: Double
+:Required: No
+:Default: ``.01``
+
+
+.. index:: filestore; flusher
+
+Flusher
+=======
+
+The Filestore flusher forces data from large writes to be written out using
+``sync_file_range`` prior to the synchronization.
+Ideally, this action reduces the cost of the eventual synchronization. In practice, however, disabling
+'filestore_flusher' seems in some cases to improve performance.
+
+
+``filestore_flusher``
+
+:Description: Enables the Filestore flusher.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+.. deprecated:: v.65
+
+``filestore_flusher_max_fds``
+
+:Description: Defines the maximum number of file descriptors for the flusher.
+:Type: Integer
+:Required: No
+:Default: ``512``
+
+.. deprecated:: v.65
+
+``filestore_sync_flush``
+
+:Description: Enables the synchronization flusher.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+.. deprecated:: v.65
+
+``filestore_fsync_flushes_journal_data``
+
+:Description: Flushes journal data during file-system synchronization.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+.. index:: filestore; queue
+
+Queue
+=====
+
+The following settings define limits on the size of the Filestore queue:
+
+``filestore_queue_max_ops``
+
+:Description: Defines the maximum number of in-progress operations that Filestore accepts before it blocks the queueing of any new operations.
+:Type: Integer
+:Required: No. Minimal impact on performance.
+:Default: ``50``
+
+
+``filestore_queue_max_bytes``
+
+:Description: Defines the maximum number of bytes permitted per operation.
+:Type: Integer
+:Required: No
+:Default: ``100 << 20``
+
+
+.. index:: filestore; timeouts
+
+Timeouts
+========
+
+``filestore_op_threads``
+
+:Description: Defines the number of file-system operation threads that execute in parallel.
+:Type: Integer
+:Required: No
+:Default: ``2``
+
+
+``filestore_op_thread_timeout``
+
+:Description: Defines the timeout (in seconds) for a file-system operation thread.
+:Type: Integer
+:Required: No
+:Default: ``60``
+
+
+``filestore_op_thread_suicide_timeout``
+
+:Description: Defines the timeout (in seconds) for a commit operation before the commit is cancelled.
+:Type: Integer
+:Required: No
+:Default: ``180``
+
+
+.. index:: filestore; btrfs
+
+B-Tree Filesystem
+=================
+
+
+``filestore_btrfs_snap``
+
+:Description: Enables snapshots for a ``btrfs`` Filestore.
+:Type: Boolean
+:Required: No. Used only for ``btrfs``.
+:Default: ``true``
+
+
+``filestore_btrfs_clone_range``
+
+:Description: Enables cloning ranges for a ``btrfs`` Filestore.
+:Type: Boolean
+:Required: No. Used only for ``btrfs``.
+:Default: ``true``
+
+
+.. index:: filestore; journal
+
+Journal
+=======
+
+
+``filestore_journal_parallel``
+
+:Description: Enables parallel journaling, default for ``btrfs``.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``filestore_journal_writeahead``
+
+:Description: Enables write-ahead journaling, default for XFS.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``filestore_journal_trailing``
+
+:Description: Deprecated. **Never use.**
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+Misc
+====
+
+
+``filestore_merge_threshold``
+
+:Description: Defines the minimum number of files permitted in a subdirectory before the subdirectory is merged into its parent directory.
+ NOTE: A negative value means that subdirectory merging is disabled.
+:Type: Integer
+:Required: No
+:Default: ``-10``
+
+
+``filestore_split_multiple``
+
+:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16``
+ is the maximum number of files permitted in a subdirectory
+ before the subdirectory is split into child directories.
+
+:Type: Integer
+:Required: No
+:Default: ``2``
+
+
+``filestore_split_rand_factor``
+
+:Description: A random factor added to the split threshold to avoid
+ too many (expensive) Filestore splits occurring at the same time.
+ For details, see ``filestore_split_multiple``.
+ To change this setting for an existing OSD, it is necessary to take the OSD
+ offline before running the ``ceph-objectstore-tool apply-layout-settings`` command.
+
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``20``
+
+
+``filestore_update_to``
+
+:Description: Limits automatic upgrades to a specified version of Filestore. Useful in cases in which you want to avoid upgrading to a specific version.
+:Type: Integer
+:Required: No
+:Default: ``1000``
+
+
+``filestore_blackhole``
+
+:Description: Drops any new transactions on the floor, similar to redirecting to NULL.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``filestore_dump_file``
+
+:Description: Defines the file that transaction dumps are stored on.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``filestore_kill_at``
+
+:Description: Injects a failure at the *n*\th opportunity.
+:Type: String
+:Required: No
+:Default: ``false``
+
+
+``filestore_fail_eio``
+
+:Description: Fail/Crash on EIO.
+:Type: Boolean
+:Required: No
+:Default: ``true``
diff --git a/doc/rados/configuration/general-config-ref.rst b/doc/rados/configuration/general-config-ref.rst
new file mode 100644
index 000000000..f4613456a
--- /dev/null
+++ b/doc/rados/configuration/general-config-ref.rst
@@ -0,0 +1,19 @@
+==========================
+ General Config Reference
+==========================
+
+.. confval:: admin_socket
+ :default: /var/run/ceph/$cluster-$name.asok
+.. confval:: pid_file
+.. confval:: chdir
+.. confval:: fatal_signal_handlers
+.. describe:: max_open_files
+
+ If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets
+ the max open FDs at the OS level (i.e., the max # of file
+ descriptors). A suitably large value prevents Ceph Daemons from running out
+ of file descriptors.
+
+ :Type: 64-bit Integer
+ :Required: No
+ :Default: ``0``
diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst
new file mode 100644
index 000000000..715b999d1
--- /dev/null
+++ b/doc/rados/configuration/index.rst
@@ -0,0 +1,53 @@
+===============
+ Configuration
+===============
+
+Each Ceph process, daemon, or utility draws its configuration from several
+sources on startup. Such sources can include (1) a local configuration, (2) the
+monitors, (3) the command line, and (4) environment variables.
+
+Configuration options can be set globally so that they apply (1) to all
+daemons, (2) to all daemons or services of a particular type, or (3) to only a
+specific daemon, process, or client.
+
+.. raw:: html
+
+ <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3>
+
+For general object store configuration, refer to the following:
+
+.. toctree::
+ :maxdepth: 1
+
+ Storage devices <storage-devices>
+ ceph-conf
+
+
+.. raw:: html
+
+ </td><td><h3>Reference</h3>
+
+To optimize the performance of your cluster, refer to the following:
+
+.. toctree::
+ :maxdepth: 1
+
+ Common Settings <common>
+ Network Settings <network-config-ref>
+ Messenger v2 protocol <msgr2>
+ Auth Settings <auth-config-ref>
+ Monitor Settings <mon-config-ref>
+ mon-lookup-dns
+ Heartbeat Settings <mon-osd-interaction>
+ OSD Settings <osd-config-ref>
+ DmClock Settings <mclock-config-ref>
+ BlueStore Settings <bluestore-config-ref>
+ FileStore Settings <filestore-config-ref>
+ Journal Settings <journal-ref>
+ Pool, PG & CRUSH Settings <pool-pg-config-ref.rst>
+ General Settings <general-config-ref>
+
+
+.. raw:: html
+
+ </td></tr></tbody></table>
diff --git a/doc/rados/configuration/journal-ref.rst b/doc/rados/configuration/journal-ref.rst
new file mode 100644
index 000000000..5ce5a5e2d
--- /dev/null
+++ b/doc/rados/configuration/journal-ref.rst
@@ -0,0 +1,39 @@
+==========================
+ Journal Config Reference
+==========================
+.. warning:: Filestore has been deprecated in the Reef release and is no longer supported.
+.. index:: journal; journal configuration
+
+Filestore OSDs use a journal for two reasons: speed and consistency. Note
+that since Luminous, the BlueStore OSD back end has been preferred and default.
+This information is provided for pre-existing OSDs and for rare situations where
+Filestore is preferred for new deployments.
+
+- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes
+ quickly. Ceph writes small, random i/o to the journal sequentially, which
+ tends to speed up bursty workloads by allowing the backing file system more
+ time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead
+ to spiky performance with short spurts of high-speed writes followed by
+ periods without any write progress as the file system catches up to the
+ journal.
+
+- **Consistency:** Ceph OSD Daemons require a file system interface that
+ guarantees atomic compound operations. Ceph OSD Daemons write a description
+ of the operation to the journal and apply the operation to the file system.
+ This enables atomic updates to an object (for example, placement group
+ metadata). Every few seconds--between ``filestore max sync interval`` and
+ ``filestore min sync interval``--the Ceph OSD Daemon stops writes and
+ synchronizes the journal with the file system, allowing Ceph OSD Daemons to
+ trim operations from the journal and reuse the space. On failure, Ceph
+ OSD Daemons replay the journal starting after the last synchronization
+ operation.
+
+Ceph OSD Daemons recognize the following journal settings:
+
+.. confval:: journal_dio
+.. confval:: journal_aio
+.. confval:: journal_block_align
+.. confval:: journal_max_write_bytes
+.. confval:: journal_max_write_entries
+.. confval:: journal_align_min_size
+.. confval:: journal_zero_on_create
diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst
new file mode 100644
index 000000000..a338aa6da
--- /dev/null
+++ b/doc/rados/configuration/mclock-config-ref.rst
@@ -0,0 +1,699 @@
+========================
+ mClock Config Reference
+========================
+
+.. index:: mclock; configuration
+
+QoS support in Ceph is implemented using a queuing scheduler based on `the
+dmClock algorithm`_. See :ref:`dmclock-qos` section for more details.
+
+To make the usage of mclock more user-friendly and intuitive, mclock config
+profiles are introduced. The mclock profiles mask the low level details from
+users, making it easier to configure and use mclock.
+
+The following input parameters are required for a mclock profile to configure
+the QoS related parameters:
+
+* total capacity (IOPS) of each OSD (determined automatically -
+ See `OSD Capacity Determination (Automated)`_)
+
+* the max sequential bandwidth capacity (MiB/s) of each OSD -
+ See *osd_mclock_max_sequential_bandwidth_[hdd|ssd]* option
+
+* an mclock profile type to enable
+
+Using the settings in the specified profile, an OSD determines and applies the
+lower-level mclock and Ceph parameters. The parameters applied by the mclock
+profile make it possible to tune the QoS between client I/O and background
+operations in the OSD.
+
+
+.. index:: mclock; mclock clients
+
+mClock Client Types
+===================
+
+The mclock scheduler handles requests from different types of Ceph services.
+Each service can be considered as a type of client from mclock's perspective.
+Depending on the type of requests handled, mclock clients are classified into
+the buckets as shown in the table below,
+
++------------------------+--------------------------------------------------------------+
+| Client Type | Request Types |
++========================+==============================================================+
+| Client | I/O requests issued by external clients of Ceph |
++------------------------+--------------------------------------------------------------+
+| Background recovery | Internal recovery requests |
++------------------------+--------------------------------------------------------------+
+| Background best-effort | Internal backfill, scrub, snap trim and PG deletion requests |
++------------------------+--------------------------------------------------------------+
+
+The mclock profiles allocate parameters like reservation, weight and limit
+(see :ref:`dmclock-qos`) differently for each client type. The next sections
+describe the mclock profiles in greater detail.
+
+
+.. index:: mclock; profile definition
+
+mClock Profiles - Definition and Purpose
+========================================
+
+A mclock profile is *“a configuration setting that when applied on a running
+Ceph cluster enables the throttling of the operations(IOPS) belonging to
+different client classes (background recovery, scrub, snaptrim, client op,
+osd subop)”*.
+
+The mclock profile uses the capacity limits and the mclock profile type selected
+by the user to determine the low-level mclock resource control configuration
+parameters and apply them transparently. Additionally, other Ceph configuration
+parameters are also applied. Please see sections below for more information.
+
+The low-level mclock resource control parameters are the *reservation*,
+*limit*, and *weight* that provide control of the resource shares, as
+described in the :ref:`dmclock-qos` section.
+
+
+.. index:: mclock; profile types
+
+mClock Profile Types
+====================
+
+mclock profiles can be broadly classified into *built-in* and *custom* profiles,
+
+Built-in Profiles
+-----------------
+Users can choose between the following built-in profile types:
+
+.. note:: The values mentioned in the tables below represent the proportion
+ of the total IOPS capacity of the OSD allocated for the service type.
+
+* balanced (default)
+* high_client_ops
+* high_recovery_ops
+
+balanced (*default*)
+^^^^^^^^^^^^^^^^^^^^
+The *balanced* profile is the default mClock profile. This profile allocates
+equal reservation/priority to client operations and background recovery
+operations. Background best-effort ops are given lower reservation and therefore
+take a longer time to complete when are are competing operations. This profile
+helps meet the normal/steady-state requirements of the cluster. This is the
+case when external client performance requirement is not critical and there are
+other background operations that still need attention within the OSD.
+
+But there might be instances that necessitate giving higher allocations to either
+client ops or recovery ops. In order to deal with such a situation, the alternate
+built-in profiles may be enabled by following the steps mentioned in next sections.
+
++------------------------+-------------+--------+-------+
+| Service Type | Reservation | Weight | Limit |
++========================+=============+========+=======+
+| client | 50% | 1 | MAX |
++------------------------+-------------+--------+-------+
+| background recovery | 50% | 1 | MAX |
++------------------------+-------------+--------+-------+
+| background best-effort | MIN | 1 | 90% |
++------------------------+-------------+--------+-------+
+
+high_client_ops
+^^^^^^^^^^^^^^^
+This profile optimizes client performance over background activities by
+allocating more reservation and limit to client operations as compared to
+background operations in the OSD. This profile, for example, may be enabled
+to provide the needed performance for I/O intensive applications for a
+sustained period of time at the cost of slower recoveries. The table shows
+the resource control parameters set by the profile:
+
++------------------------+-------------+--------+-------+
+| Service Type | Reservation | Weight | Limit |
++========================+=============+========+=======+
+| client | 60% | 2 | MAX |
++------------------------+-------------+--------+-------+
+| background recovery | 40% | 1 | MAX |
++------------------------+-------------+--------+-------+
+| background best-effort | MIN | 1 | 70% |
++------------------------+-------------+--------+-------+
+
+high_recovery_ops
+^^^^^^^^^^^^^^^^^
+This profile optimizes background recovery performance as compared to external
+clients and other background operations within the OSD. This profile, for
+example, may be enabled by an administrator temporarily to speed-up background
+recoveries during non-peak hours. The table shows the resource control
+parameters set by the profile:
+
++------------------------+-------------+--------+-------+
+| Service Type | Reservation | Weight | Limit |
++========================+=============+========+=======+
+| client | 30% | 1 | MAX |
++------------------------+-------------+--------+-------+
+| background recovery | 70% | 2 | MAX |
++------------------------+-------------+--------+-------+
+| background best-effort | MIN | 1 | MAX |
++------------------------+-------------+--------+-------+
+
+.. note:: Across the built-in profiles, internal background best-effort clients
+ of mclock include "backfill", "scrub", "snap trim", and "pg deletion"
+ operations.
+
+
+Custom Profile
+--------------
+This profile gives users complete control over all the mclock configuration
+parameters. This profile should be used with caution and is meant for advanced
+users, who understand mclock and Ceph related configuration options.
+
+
+.. index:: mclock; built-in profiles
+
+mClock Built-in Profiles - Locked Config Options
+=================================================
+The below sections describe the config options that are locked to certain values
+in order to ensure mClock scheduler is able to provide predictable QoS.
+
+mClock Config Options
+---------------------
+.. important:: These defaults cannot be changed using any of the config
+ subsytem commands like *config set* or via the *config daemon* or *config
+ tell* interfaces. Although the above command(s) report success, the mclock
+ QoS parameters are reverted to their respective built-in profile defaults.
+
+When a built-in profile is enabled, the mClock scheduler calculates the low
+level mclock parameters [*reservation*, *weight*, *limit*] based on the profile
+enabled for each client type. The mclock parameters are calculated based on
+the max OSD capacity provided beforehand. As a result, the following mclock
+config parameters cannot be modified when using any of the built-in profiles:
+
+- :confval:`osd_mclock_scheduler_client_res`
+- :confval:`osd_mclock_scheduler_client_wgt`
+- :confval:`osd_mclock_scheduler_client_lim`
+- :confval:`osd_mclock_scheduler_background_recovery_res`
+- :confval:`osd_mclock_scheduler_background_recovery_wgt`
+- :confval:`osd_mclock_scheduler_background_recovery_lim`
+- :confval:`osd_mclock_scheduler_background_best_effort_res`
+- :confval:`osd_mclock_scheduler_background_best_effort_wgt`
+- :confval:`osd_mclock_scheduler_background_best_effort_lim`
+
+Recovery/Backfill Options
+-------------------------
+.. warning:: The recommendation is to not change these options as the built-in
+ profiles are optimized based on them. Changing these defaults can result in
+ unexpected performance outcomes.
+
+The following recovery and backfill related Ceph options are overridden to
+mClock defaults:
+
+- :confval:`osd_max_backfills`
+- :confval:`osd_recovery_max_active`
+- :confval:`osd_recovery_max_active_hdd`
+- :confval:`osd_recovery_max_active_ssd`
+
+The following table shows the mClock defaults which is the same as the current
+defaults. This is done to maximize the performance of the foreground (client)
+operations:
+
++----------------------------------------+------------------+----------------+
+| Config Option | Original Default | mClock Default |
++========================================+==================+================+
+| :confval:`osd_max_backfills` | 1 | 1 |
++----------------------------------------+------------------+----------------+
+| :confval:`osd_recovery_max_active` | 0 | 0 |
++----------------------------------------+------------------+----------------+
+| :confval:`osd_recovery_max_active_hdd` | 3 | 3 |
++----------------------------------------+------------------+----------------+
+| :confval:`osd_recovery_max_active_ssd` | 10 | 10 |
++----------------------------------------+------------------+----------------+
+
+The above mClock defaults, can be modified only if necessary by enabling
+:confval:`osd_mclock_override_recovery_settings` (default: false). The
+steps for this is discussed in the
+`Steps to Modify mClock Max Backfills/Recovery Limits`_ section.
+
+Sleep Options
+-------------
+If any mClock profile (including "custom") is active, the following Ceph config
+sleep options are disabled (set to 0),
+
+- :confval:`osd_recovery_sleep`
+- :confval:`osd_recovery_sleep_hdd`
+- :confval:`osd_recovery_sleep_ssd`
+- :confval:`osd_recovery_sleep_hybrid`
+- :confval:`osd_scrub_sleep`
+- :confval:`osd_delete_sleep`
+- :confval:`osd_delete_sleep_hdd`
+- :confval:`osd_delete_sleep_ssd`
+- :confval:`osd_delete_sleep_hybrid`
+- :confval:`osd_snap_trim_sleep`
+- :confval:`osd_snap_trim_sleep_hdd`
+- :confval:`osd_snap_trim_sleep_ssd`
+- :confval:`osd_snap_trim_sleep_hybrid`
+
+The above sleep options are disabled to ensure that mclock scheduler is able to
+determine when to pick the next op from its operation queue and transfer it to
+the operation sequencer. This results in the desired QoS being provided across
+all its clients.
+
+
+.. index:: mclock; enable built-in profile
+
+Steps to Enable mClock Profile
+==============================
+
+As already mentioned, the default mclock profile is set to *balanced*.
+The other values for the built-in profiles include *high_client_ops* and
+*high_recovery_ops*.
+
+If there is a requirement to change the default profile, then the option
+:confval:`osd_mclock_profile` may be set during runtime by using the following
+command:
+
+ .. prompt:: bash #
+
+ ceph config set osd.N osd_mclock_profile <value>
+
+For example, to change the profile to allow faster recoveries on "osd.0", the
+following command can be used to switch to the *high_recovery_ops* profile:
+
+ .. prompt:: bash #
+
+ ceph config set osd.0 osd_mclock_profile high_recovery_ops
+
+.. note:: The *custom* profile is not recommended unless you are an advanced
+ user.
+
+And that's it! You are ready to run workloads on the cluster and check if the
+QoS requirements are being met.
+
+
+Switching Between Built-in and Custom Profiles
+==============================================
+
+There may be situations requiring switching from a built-in profile to the
+*custom* profile and vice-versa. The following sections outline the steps to
+accomplish this.
+
+Steps to Switch From a Built-in to the Custom Profile
+-----------------------------------------------------
+
+The following command can be used to switch to the *custom* profile:
+
+ .. prompt:: bash #
+
+ ceph config set osd osd_mclock_profile custom
+
+For example, to change the profile to *custom* on all OSDs, the following
+command can be used:
+
+ .. prompt:: bash #
+
+ ceph config set osd osd_mclock_profile custom
+
+After switching to the *custom* profile, the desired mClock configuration
+option may be modified. For example, to change the client reservation IOPS
+ratio for a specific OSD (say osd.0) to 0.5 (or 50%), the following command
+can be used:
+
+ .. prompt:: bash #
+
+ ceph config set osd.0 osd_mclock_scheduler_client_res 0.5
+
+.. important:: Care must be taken to change the reservations of other services
+ like recovery and background best effort accordingly to ensure that the sum
+ of the reservations do not exceed the maximum proportion (1.0) of the IOPS
+ capacity of the OSD.
+
+.. tip:: The reservation and limit parameter allocations are per-shard based on
+ the type of backing device (HDD/SSD) under the OSD. See
+ :confval:`osd_op_num_shards_hdd` and :confval:`osd_op_num_shards_ssd` for
+ more details.
+
+Steps to Switch From the Custom Profile to a Built-in Profile
+-------------------------------------------------------------
+
+Switching from the *custom* profile to a built-in profile requires an
+intermediate step of removing the custom settings from the central config
+database for the changes to take effect.
+
+The following sequence of commands can be used to switch to a built-in profile:
+
+#. Set the desired built-in profile using:
+
+ .. prompt:: bash #
+
+ ceph config set osd <mClock Configuration Option>
+
+ For example, to set the built-in profile to ``high_client_ops`` on all
+ OSDs, run the following command:
+
+ .. prompt:: bash #
+
+ ceph config set osd osd_mclock_profile high_client_ops
+#. Determine the existing custom mClock configuration settings in the central
+ config database using the following command:
+
+ .. prompt:: bash #
+
+ ceph config dump
+#. Remove the custom mClock configuration settings determined in the previous
+ step from the central config database:
+
+ .. prompt:: bash #
+
+ ceph config rm osd <mClock Configuration Option>
+
+ For example, to remove the configuration option
+ :confval:`osd_mclock_scheduler_client_res` that was set on all OSDs, run the
+ following command:
+
+ .. prompt:: bash #
+
+ ceph config rm osd osd_mclock_scheduler_client_res
+#. After all existing custom mClock configuration settings have been removed
+ from the central config database, the configuration settings pertaining to
+ ``high_client_ops`` will come into effect. For e.g., to verify the settings
+ on osd.0 use:
+
+ .. prompt:: bash #
+
+ ceph config show osd.0
+
+Switch Temporarily Between mClock Profiles
+------------------------------------------
+
+To switch between mClock profiles on a temporary basis, the following commands
+may be used to override the settings:
+
+.. warning:: This section is for advanced users or for experimental testing. The
+ recommendation is to not use the below commands on a running cluster as it
+ could have unexpected outcomes.
+
+.. note:: The configuration changes on an OSD using the below commands are
+ ephemeral and are lost when it restarts. It is also important to note that
+ the config options overridden using the below commands cannot be modified
+ further using the *ceph config set osd.N ...* command. The changes will not
+ take effect until a given OSD is restarted. This is intentional, as per the
+ config subsystem design. However, any further modification can still be made
+ ephemerally using the commands mentioned below.
+
+#. Run the *injectargs* command as shown to override the mclock settings:
+
+ .. prompt:: bash #
+
+ ceph tell osd.N injectargs '--<mClock Configuration Option>=<value>'
+
+ For example, the following command overrides the
+ :confval:`osd_mclock_profile` option on osd.0:
+
+ .. prompt:: bash #
+
+ ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops'
+
+
+#. An alternate command that can be used is:
+
+ .. prompt:: bash #
+
+ ceph daemon osd.N config set <mClock Configuration Option> <value>
+
+ For example, the following command overrides the
+ :confval:`osd_mclock_profile` option on osd.0:
+
+ .. prompt:: bash #
+
+ ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops
+
+The individual QoS-related config options for the *custom* profile can also be
+modified ephemerally using the above commands.
+
+
+Steps to Modify mClock Max Backfills/Recovery Limits
+====================================================
+
+This section describes the steps to modify the default max backfills or recovery
+limits if the need arises.
+
+.. warning:: This section is for advanced users or for experimental testing. The
+ recommendation is to retain the defaults as is on a running cluster as
+ modifying them could have unexpected performance outcomes. The values may
+ be modified only if the cluster is unable to cope/showing poor performance
+ with the default settings or for performing experiments on a test cluster.
+
+.. important:: The max backfill/recovery options that can be modified are listed
+ in section `Recovery/Backfill Options`_. The modification of the mClock
+ default backfills/recovery limit is gated by the
+ :confval:`osd_mclock_override_recovery_settings` option, which is set to
+ *false* by default. Attempting to modify any default recovery/backfill
+ limits without setting the gating option will reset that option back to the
+ mClock defaults along with a warning message logged in the cluster log. Note
+ that it may take a few seconds for the default value to come back into
+ effect. Verify the limit using the *config show* command as shown below.
+
+#. Set the :confval:`osd_mclock_override_recovery_settings` config option on all
+ osds to *true* using:
+
+ .. prompt:: bash #
+
+ ceph config set osd osd_mclock_override_recovery_settings true
+
+#. Set the desired max backfill/recovery option using:
+
+ .. prompt:: bash #
+
+ ceph config set osd osd_max_backfills <value>
+
+ For example, the following command modifies the :confval:`osd_max_backfills`
+ option on all osds to 5.
+
+ .. prompt:: bash #
+
+ ceph config set osd osd_max_backfills 5
+
+#. Wait for a few seconds and verify the running configuration for a specific
+ OSD using:
+
+ .. prompt:: bash #
+
+ ceph config show osd.N | grep osd_max_backfills
+
+ For example, the following command shows the running configuration of
+ :confval:`osd_max_backfills` on osd.0.
+
+ .. prompt:: bash #
+
+ ceph config show osd.0 | grep osd_max_backfills
+
+#. Reset the :confval:`osd_mclock_override_recovery_settings` config option on
+ all osds to *false* using:
+
+ .. prompt:: bash #
+
+ ceph config set osd osd_mclock_override_recovery_settings false
+
+
+OSD Capacity Determination (Automated)
+======================================
+
+The OSD capacity in terms of total IOPS is determined automatically during OSD
+initialization. This is achieved by running the OSD bench tool and overriding
+the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option
+depending on the device type. No other action/input is expected from the user
+to set the OSD capacity.
+
+.. note:: If you wish to manually benchmark OSD(s) or manually tune the
+ Bluestore throttle parameters, see section
+ `Steps to Manually Benchmark an OSD (Optional)`_.
+
+You may verify the capacity of an OSD after the cluster is brought up by using
+the following command:
+
+ .. prompt:: bash #
+
+ ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd]
+
+For example, the following command shows the max capacity for "osd.0" on a Ceph
+node whose underlying device type is SSD:
+
+ .. prompt:: bash #
+
+ ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
+
+Mitigation of Unrealistic OSD Capacity From Automated Test
+----------------------------------------------------------
+In certain conditions, the OSD bench tool may show unrealistic/inflated result
+depending on the drive configuration and other environment related conditions.
+To mitigate the performance impact due to this unrealistic capacity, a couple
+of threshold config options depending on the osd's device type are defined and
+used:
+
+- :confval:`osd_mclock_iops_capacity_threshold_hdd` = 500
+- :confval:`osd_mclock_iops_capacity_threshold_ssd` = 80000
+
+The following automated step is performed:
+
+Fallback to using default OSD capacity (automated)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+If OSD bench reports a measurement that exceeds the above threshold values
+depending on the underlying device type, the fallback mechanism reverts to the
+default value of :confval:`osd_mclock_max_capacity_iops_hdd` or
+:confval:`osd_mclock_max_capacity_iops_ssd`. The threshold config options
+can be reconfigured based on the type of drive used. Additionally, a cluster
+warning is logged in case the measurement exceeds the threshold. For example, ::
+
+ 2022-10-27T15:30:23.270+0000 7f9b5dbe95c0 0 log_channel(cluster) log [WRN]
+ : OSD bench result of 39546.479392 IOPS exceeded the threshold limit of
+ 25000.000000 IOPS for osd.1. IOPS capacity is unchanged at 21500.000000
+ IOPS. The recommendation is to establish the osd's IOPS capacity using other
+ benchmark tools (e.g. Fio) and then override
+ osd_mclock_max_capacity_iops_[hdd|ssd].
+
+If the default capacity doesn't accurately represent the OSD's capacity, the
+following additional step is recommended to address this:
+
+Run custom drive benchmark if defaults are not accurate (manual)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+If the default OSD capacity is not accurate, the recommendation is to run a
+custom benchmark using your preferred tool (e.g. Fio) on the drive and then
+override the ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option as described
+in the `Specifying Max OSD Capacity`_ section.
+
+This step is highly recommended until an alternate mechansim is worked upon.
+
+Steps to Manually Benchmark an OSD (Optional)
+=============================================
+
+.. note:: These steps are only necessary if you want to override the OSD
+ capacity already determined automatically during OSD initialization.
+ Otherwise, you may skip this section entirely.
+
+.. tip:: If you have already determined the benchmark data and wish to manually
+ override the max osd capacity for an OSD, you may skip to section
+ `Specifying Max OSD Capacity`_.
+
+
+Any existing benchmarking tool (e.g. Fio) can be used for this purpose. In this
+case, the steps use the *Ceph OSD Bench* command described in the next section.
+Regardless of the tool/command used, the steps outlined further below remain the
+same.
+
+As already described in the :ref:`dmclock-qos` section, the number of
+shards and the bluestore's throttle parameters have an impact on the mclock op
+queues. Therefore, it is critical to set these values carefully in order to
+maximize the impact of the mclock scheduler.
+
+:Number of Operational Shards:
+ We recommend using the default number of shards as defined by the
+ configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
+ ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase
+ the impact of the mclock queues.
+
+:Bluestore Throttle Parameters:
+ We recommend using the default values as defined by
+ :confval:`bluestore_throttle_bytes` and
+ :confval:`bluestore_throttle_deferred_bytes`. But these parameters may also be
+ determined during the benchmarking phase as described below.
+
+OSD Bench Command Syntax
+------------------------
+
+The :ref:`osd-subsystem` section describes the OSD bench command. The syntax
+used for benchmarking is shown below :
+
+.. prompt:: bash #
+
+ ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]
+
+where,
+
+* ``TOTAL_BYTES``: Total number of bytes to write
+* ``BYTES_PER_WRITE``: Block size per write
+* ``OBJ_SIZE``: Bytes per object
+* ``NUM_OBJS``: Number of objects to write
+
+Benchmarking Test Steps Using OSD Bench
+---------------------------------------
+
+The steps below use the default shards and detail the steps used to determine
+the correct bluestore throttle values (optional).
+
+#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
+ you wish to benchmark.
+#. Run a simple 4KiB random write workload on an OSD using the following
+ commands:
+
+ .. note:: Note that before running the test, caches must be cleared to get an
+ accurate measurement.
+
+ For example, if you are running the benchmark test on osd.0, run the following
+ commands:
+
+ .. prompt:: bash #
+
+ ceph tell osd.0 cache drop
+
+ .. prompt:: bash #
+
+ ceph tell osd.0 bench 12288000 4096 4194304 100
+
+#. Note the overall throughput(IOPS) obtained from the output of the osd bench
+ command. This value is the baseline throughput(IOPS) when the default
+ bluestore throttle options are in effect.
+#. If the intent is to determine the bluestore throttle values for your
+ environment, then set the two options, :confval:`bluestore_throttle_bytes`
+ and :confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each
+ to begin with. Otherwise, you may skip to the next section.
+#. Run the 4KiB random write test as before using OSD bench.
+#. Note the overall throughput from the output and compare the value
+ against the baseline throughput recorded in step 3.
+#. If the throughput doesn't match with the baseline, increment the bluestore
+ throttle options by 2x and repeat steps 5 through 7 until the obtained
+ throughput is very close to the baseline value.
+
+For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB
+for both bluestore throttle and deferred bytes was determined to maximize the
+impact of mclock. For HDDs, the corresponding value was 40 MiB, where the
+overall throughput was roughly equal to the baseline throughput. Note that in
+general for HDDs, the bluestore throttle values are expected to be higher when
+compared to SSDs.
+
+
+Specifying Max OSD Capacity
+----------------------------
+
+The steps in this section may be performed only if you want to override the
+max osd capacity automatically set during OSD initialization. The option
+``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the
+following command:
+
+ .. prompt:: bash #
+
+ ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value>
+
+For example, the following command sets the max capacity for a specific OSD
+(say "osd.0") whose underlying device type is HDD to 350 IOPS:
+
+ .. prompt:: bash #
+
+ ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350
+
+Alternatively, you may specify the max capacity for OSDs within the Ceph
+configuration file under the respective [osd.N] section. See
+:ref:`ceph-conf-settings` for more details.
+
+
+.. index:: mclock; config settings
+
+mClock Config Options
+=====================
+
+.. confval:: osd_mclock_profile
+.. confval:: osd_mclock_max_capacity_iops_hdd
+.. confval:: osd_mclock_max_capacity_iops_ssd
+.. confval:: osd_mclock_max_sequential_bandwidth_hdd
+.. confval:: osd_mclock_max_sequential_bandwidth_ssd
+.. confval:: osd_mclock_force_run_benchmark_on_init
+.. confval:: osd_mclock_skip_benchmark
+.. confval:: osd_mclock_override_recovery_settings
+.. confval:: osd_mclock_iops_capacity_threshold_hdd
+.. confval:: osd_mclock_iops_capacity_threshold_ssd
+
+.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst
new file mode 100644
index 000000000..e0a12d093
--- /dev/null
+++ b/doc/rados/configuration/mon-config-ref.rst
@@ -0,0 +1,642 @@
+.. _monitor-config-reference:
+
+==========================
+ Monitor Config Reference
+==========================
+
+Understanding how to configure a :term:`Ceph Monitor` is an important part of
+building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters
+have at least one monitor**. The monitor complement usually remains fairly
+consistent, but you can add, remove or replace a monitor in a cluster. See
+`Adding/Removing a Monitor`_ for details.
+
+
+.. index:: Ceph Monitor; Paxos
+
+Background
+==========
+
+Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`.
+
+The :term:`Cluster Map` makes it possible for :term:`Ceph client`\s to
+determine the location of all Ceph Monitors, Ceph OSD Daemons, and Ceph
+Metadata Servers. Clients do this by connecting to one Ceph Monitor and
+retrieving a current cluster map. Ceph clients must connect to a Ceph Monitor
+before they can read from or write to Ceph OSD Daemons or Ceph Metadata
+Servers. A Ceph client that has a current copy of the cluster map and the CRUSH
+algorithm can compute the location of any RADOS object within the cluster. This
+makes it possible for Ceph clients to talk directly to Ceph OSD Daemons. Direct
+communication between clients and Ceph OSD Daemons improves upon traditional
+storage architectures that required clients to communicate with a central
+component. See `Scalability and High Availability`_ for more on this subject.
+
+The Ceph Monitor's primary function is to maintain a master copy of the cluster
+map. Monitors also provide authentication and logging services. All changes in
+the monitor services are written by the Ceph Monitor to a single Paxos
+instance, and Paxos writes the changes to a key/value store. This provides
+strong consistency. Ceph Monitors are able to query the most recent version of
+the cluster map during sync operations, and they use the key/value store's
+snapshots and iterators (using RocksDB) to perform store-wide synchronization.
+
+.. ditaa::
+ /-------------\ /-------------\
+ | Monitor | Write Changes | Paxos |
+ | cCCC +-------------->+ cCCC |
+ | | | |
+ +-------------+ \------+------/
+ | Auth | |
+ +-------------+ | Write Changes
+ | Log | |
+ +-------------+ v
+ | Monitor Map | /------+------\
+ +-------------+ | Key / Value |
+ | OSD Map | | Store |
+ +-------------+ | cCCC |
+ | PG Map | \------+------/
+ +-------------+ ^
+ | MDS Map | | Read Changes
+ +-------------+ |
+ | cCCC |*---------------------+
+ \-------------/
+
+.. index:: Ceph Monitor; cluster map
+
+Cluster Maps
+------------
+
+The cluster map is a composite of maps, including the monitor map, the OSD map,
+the placement group map and the metadata server map. The cluster map tracks a
+number of important things: which processes are ``in`` the Ceph Storage Cluster;
+which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running
+or ``down``; whether, the placement groups are ``active`` or ``inactive``, and
+``clean`` or in some other state; and, other details that reflect the current
+state of the cluster such as the total amount of storage space, and the amount
+of storage used.
+
+When there is a significant change in the state of the cluster--e.g., a Ceph OSD
+Daemon goes down, a placement group falls into a degraded state, etc.--the
+cluster map gets updated to reflect the current state of the cluster.
+Additionally, the Ceph Monitor also maintains a history of the prior states of
+the cluster. The monitor map, OSD map, placement group map and metadata server
+map each maintain a history of their map versions. We call each version an
+"epoch."
+
+When operating your Ceph Storage Cluster, keeping track of these states is an
+important part of your system administration duties. See `Monitoring a Cluster`_
+and `Monitoring OSDs and PGs`_ for additional details.
+
+.. index:: high availability; quorum
+
+Monitor Quorum
+--------------
+
+Our Configuring ceph section provides a trivial `Ceph configuration file`_ that
+provides for one monitor in the test cluster. A cluster will run fine with a
+single monitor; however, **a single monitor is a single-point-of-failure**. To
+ensure high availability in a production Ceph Storage Cluster, you should run
+Ceph with multiple monitors so that the failure of a single monitor **WILL NOT**
+bring down your entire cluster.
+
+When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability,
+Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map.
+A consensus requires a majority of monitors running to establish a quorum for
+consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6;
+etc.).
+
+.. confval:: mon_force_quorum_join
+
+.. index:: Ceph Monitor; consistency
+
+Consistency
+-----------
+
+When you add monitor settings to your Ceph configuration file, you need to be
+aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes
+strict consistency requirements** for a Ceph monitor when discovering another
+Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons
+use the Ceph configuration file to discover monitors, monitors discover each
+other using the monitor map (monmap), not the Ceph configuration file.
+
+A Ceph Monitor always refers to the local copy of the monmap when discovering
+other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the
+Ceph configuration file avoids errors that could break the cluster (e.g., typos
+in ``ceph.conf`` when specifying a monitor address or port). Since monitors use
+monmaps for discovery and they share monmaps with clients and other Ceph
+daemons, **the monmap provides monitors with a strict guarantee that their
+consensus is valid.**
+
+Strict consistency also applies to updates to the monmap. As with any other
+updates on the Ceph Monitor, changes to the monmap always run through a
+distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on
+each update to the monmap, such as adding or removing a Ceph Monitor, to ensure
+that each monitor in the quorum has the same version of the monmap. Updates to
+the monmap are incremental so that Ceph Monitors have the latest agreed upon
+version, and a set of previous versions. Maintaining a history enables a Ceph
+Monitor that has an older version of the monmap to catch up with the current
+state of the Ceph Storage Cluster.
+
+If Ceph Monitors were to discover each other through the Ceph configuration file
+instead of through the monmap, additional risks would be introduced because
+Ceph configuration files are not updated and distributed automatically. Ceph
+Monitors might inadvertently use an older Ceph configuration file, fail to
+recognize a Ceph Monitor, fall out of a quorum, or develop a situation where
+`Paxos`_ is not able to determine the current state of the system accurately.
+
+
+.. index:: Ceph Monitor; bootstrapping monitors
+
+Bootstrapping Monitors
+----------------------
+
+In most configuration and deployment cases, tools that deploy Ceph help
+bootstrap the Ceph Monitors by generating a monitor map for you (e.g.,
+``cephadm``, etc). A Ceph Monitor requires a few explicit
+settings:
+
+- **Filesystem ID**: The ``fsid`` is the unique identifier for your
+ object store. Since you can run multiple clusters on the same
+ hardware, you must specify the unique ID of the object store when
+ bootstrapping a monitor. Deployment tools usually do this for you
+ (e.g., ``cephadm`` can call a tool like ``uuidgen``), but you
+ may specify the ``fsid`` manually too.
+
+- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within
+ the cluster. It is an alphanumeric value, and by convention the identifier
+ usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This
+ can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.),
+ by a deployment tool, or using the ``ceph`` commandline.
+
+- **Keys**: The monitor must have secret keys. A deployment tool such as
+ ``cephadm`` usually does this for you, but you may
+ perform this step manually too. See `Monitor Keyrings`_ for details.
+
+For additional details on bootstrapping, see `Bootstrapping a Monitor`_.
+
+.. index:: Ceph Monitor; configuring monitors
+
+Configuring Monitors
+====================
+
+To apply configuration settings to the entire cluster, enter the configuration
+settings under ``[global]``. To apply configuration settings to all monitors in
+your cluster, enter the configuration settings under ``[mon]``. To apply
+configuration settings to specific monitors, specify the monitor instance
+(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation.
+
+.. code-block:: ini
+
+ [global]
+
+ [mon]
+
+ [mon.a]
+
+ [mon.b]
+
+ [mon.c]
+
+
+Minimum Configuration
+---------------------
+
+The bare minimum monitor settings for a Ceph monitor via the Ceph configuration
+file include a hostname and a network address for each monitor. You can configure
+these under ``[mon]`` or under the entry for a specific monitor.
+
+.. code-block:: ini
+
+ [global]
+ mon_host = 10.0.0.2,10.0.0.3,10.0.0.4
+
+.. code-block:: ini
+
+ [mon.a]
+ host = hostname1
+ mon_addr = 10.0.0.10:6789
+
+See the `Network Configuration Reference`_ for details.
+
+.. note:: This minimum configuration for monitors assumes that a deployment
+ tool generates the ``fsid`` and the ``mon.`` key for you.
+
+Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP addresses of
+monitors. However, if you decide to change the monitor's IP address, you
+must follow a specific procedure. See :ref:`Changing a Monitor's IP address` for
+details.
+
+Monitors can also be found by clients by using DNS SRV records. See `Monitor lookup through DNS`_ for details.
+
+Cluster ID
+----------
+
+Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it
+usually appears under the ``[global]`` section of the configuration file.
+Deployment tools usually generate the ``fsid`` and store it in the monitor map,
+so the value may not appear in a configuration file. The ``fsid`` makes it
+possible to run daemons for multiple clusters on the same hardware.
+
+.. confval:: fsid
+
+.. index:: Ceph Monitor; initial members
+
+Initial Members
+---------------
+
+We recommend running a production Ceph Storage Cluster with at least three Ceph
+Monitors to ensure high availability. When you run multiple monitors, you may
+specify the initial monitors that must be members of the cluster in order to
+establish a quorum. This may reduce the time it takes for your cluster to come
+online.
+
+.. code-block:: ini
+
+ [mon]
+ mon_initial_members = a,b,c
+
+
+.. confval:: mon_initial_members
+
+.. index:: Ceph Monitor; data path
+
+Data
+----
+
+Ceph provides a default path where Ceph Monitors store data. For optimal
+performance in a production Ceph Storage Cluster, we recommend running Ceph
+Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb uses
+``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk
+very often, which can interfere with Ceph OSD Daemon workloads if the data
+store is co-located with the OSD Daemons.
+
+In Ceph versions 0.58 and earlier, Ceph Monitors store their data in plain files. This
+approach allows users to inspect monitor data with common tools like ``ls``
+and ``cat``. However, this approach didn't provide strong consistency.
+
+In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value
+pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents
+recovering Ceph Monitors from running corrupted versions through Paxos, and it
+enables multiple modification operations in one single atomic batch, among other
+advantages.
+
+Generally, we do not recommend changing the default data location. If you modify
+the default location, we recommend that you make it uniform across Ceph Monitors
+by setting it in the ``[mon]`` section of the configuration file.
+
+
+.. confval:: mon_data
+.. confval:: mon_data_size_warn
+.. confval:: mon_data_avail_warn
+.. confval:: mon_data_avail_crit
+.. confval:: mon_warn_on_crush_straw_calc_version_zero
+.. confval:: mon_warn_on_legacy_crush_tunables
+.. confval:: mon_crush_min_required_version
+.. confval:: mon_warn_on_osd_down_out_interval_zero
+.. confval:: mon_warn_on_slow_ping_ratio
+.. confval:: mon_warn_on_slow_ping_time
+.. confval:: mon_warn_on_pool_no_redundancy
+.. confval:: mon_cache_target_full_warn_ratio
+.. confval:: mon_health_to_clog
+.. confval:: mon_health_to_clog_tick_interval
+.. confval:: mon_health_to_clog_interval
+
+.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning
+
+.. _storage-capacity:
+
+Storage Capacity
+----------------
+
+When a Ceph Storage Cluster gets close to its maximum capacity
+(see``mon_osd_full ratio``), Ceph prevents you from writing to or reading from OSDs
+as a safety measure to prevent data loss. Therefore, letting a
+production Ceph Storage Cluster approach its full ratio is not a good practice,
+because it sacrifices high availability. The default full ratio is ``.95``, or
+95% of capacity. This a very aggressive setting for a test cluster with a small
+number of OSDs.
+
+.. tip:: When monitoring your cluster, be alert to warnings related to the
+ ``nearfull`` ratio. This means that a failure of some OSDs could result
+ in a temporary service disruption if one or more OSDs fails. Consider adding
+ more OSDs to increase storage capacity.
+
+A common scenario for test clusters involves a system administrator removing an
+OSD from the Ceph Storage Cluster, watching the cluster rebalance, then removing
+another OSD, and another, until at least one OSD eventually reaches the full
+ratio and the cluster locks up. We recommend a bit of capacity
+planning even with a test cluster. Planning enables you to gauge how much spare
+capacity you will need in order to maintain high availability. Ideally, you want
+to plan for a series of Ceph OSD Daemon failures where the cluster can recover
+to an ``active+clean`` state without replacing those OSDs
+immediately. Cluster operation continues in the ``active+degraded`` state, but this
+is not ideal for normal operation and should be addressed promptly.
+
+The following diagram depicts a simplistic Ceph Storage Cluster containing 33
+Ceph Nodes with one OSD per host, each OSD reading from
+and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum
+actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph
+Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow
+Ceph Clients to read and write data. So the Ceph Storage Cluster's operating
+capacity is 95TB, not 99TB.
+
+.. ditaa::
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 |
+ | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+
+It is normal in such a cluster for one or two OSDs to fail. A less frequent but
+reasonable scenario involves a rack's router or power supply failing, which
+brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario,
+you should still strive for a cluster that can remain operational and achieve an
+``active + clean`` state--even if that means adding a few hosts with additional
+OSDs in short order. If your capacity utilization is too high, you may not lose
+data, but you could still sacrifice data availability while resolving an outage
+within a failure domain if capacity utilization of the cluster exceeds the full
+ratio. For this reason, we recommend at least some rough capacity planning.
+
+Identify two numbers for your cluster:
+
+#. The number of OSDs.
+#. The total capacity of the cluster
+
+If you divide the total capacity of your cluster by the number of OSDs in your
+cluster, you will find the mean average capacity of an OSD within your cluster.
+Consider multiplying that number by the number of OSDs you expect will fail
+simultaneously during normal operations (a relatively small number). Finally
+multiply the capacity of the cluster by the full ratio to arrive at a maximum
+operating capacity; then, subtract the number of amount of data from the OSDs
+you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing
+process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at
+a reasonable number for a near full ratio.
+
+The following settings only apply on cluster creation and are then stored in
+the OSDMap. To clarify, in normal operation the values that are used by OSDs
+are those found in the OSDMap, not those in the configuration file or central
+config store.
+
+.. code-block:: ini
+
+ [global]
+ mon_osd_full_ratio = .80
+ mon_osd_backfillfull_ratio = .75
+ mon_osd_nearfull_ratio = .70
+
+
+``mon_osd_full_ratio``
+
+:Description: The threshold percentage of device space utilized before an OSD is
+ considered ``full``.
+
+:Type: Float
+:Default: ``0.95``
+
+
+``mon_osd_backfillfull_ratio``
+
+:Description: The threshold percentage of device space utilized before an OSD is
+ considered too ``full`` to backfill.
+
+:Type: Float
+:Default: ``0.90``
+
+
+``mon_osd_nearfull_ratio``
+
+:Description: The threshold percentage of device space used before an OSD is
+ considered ``nearfull``.
+
+:Type: Float
+:Default: ``0.85``
+
+
+.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you
+ may have an inaccurate CRUSH weight set for the nearfull OSDs.
+
+.. tip:: These settings only apply during cluster creation. Afterwards they need
+ to be changed in the OSDMap using ``ceph osd set-nearfull-ratio`` and
+ ``ceph osd set-full-ratio``
+
+.. index:: heartbeat
+
+Heartbeat
+---------
+
+Ceph monitors know about the cluster by requiring reports from each OSD, and by
+receiving reports from OSDs about the status of their neighboring OSDs. Ceph
+provides reasonable default settings for monitor/OSD interaction; however, you
+may modify them as needed. See `Monitor/OSD Interaction`_ for details.
+
+
+.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization
+
+Monitor Store Synchronization
+-----------------------------
+
+When you run a production cluster with multiple monitors (recommended), each
+monitor checks to see if a neighboring monitor has a more recent version of the
+cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers
+higher than the most current epoch in the map of the instant monitor).
+Periodically, one monitor in the cluster may fall behind the other monitors to
+the point where it must leave the quorum, synchronize to retrieve the most
+current information about the cluster, and then rejoin the quorum. For the
+purposes of synchronization, monitors may assume one of three roles:
+
+#. **Leader**: The `Leader` is the first monitor to achieve the most recent
+ Paxos version of the cluster map.
+
+#. **Provider**: The `Provider` is a monitor that has the most recent version
+ of the cluster map, but wasn't the first to achieve the most recent version.
+
+#. **Requester:** A `Requester` is a monitor that has fallen behind the leader
+ and must synchronize in order to retrieve the most recent information about
+ the cluster before it can rejoin the quorum.
+
+These roles enable a leader to delegate synchronization duties to a provider,
+which prevents synchronization requests from overloading the leader--improving
+performance. In the following diagram, the requester has learned that it has
+fallen behind the other monitors. The requester asks the leader to synchronize,
+and the leader tells the requester to synchronize with a provider.
+
+
+.. ditaa::
+ +-----------+ +---------+ +----------+
+ | Requester | | Leader | | Provider |
+ +-----------+ +---------+ +----------+
+ | | |
+ | | |
+ | Ask to Synchronize | |
+ |------------------->| |
+ | | |
+ |<-------------------| |
+ | Tell Requester to | |
+ | Sync with Provider | |
+ | | |
+ | Synchronize |
+ |--------------------+-------------------->|
+ | | |
+ |<-------------------+---------------------|
+ | Send Chunk to Requester |
+ | (repeat as necessary) |
+ | Requester Acks Chuck to Provider |
+ |--------------------+-------------------->|
+ | |
+ | Sync Complete |
+ | Notification |
+ |------------------->|
+ | |
+ |<-------------------|
+ | Ack |
+ | |
+
+
+Synchronization always occurs when a new monitor joins the cluster. During
+runtime operations, monitors may receive updates to the cluster map at different
+times. This means the leader and provider roles may migrate from one monitor to
+another. If this happens while synchronizing (e.g., a provider falls behind the
+leader), the provider can terminate synchronization with a requester.
+
+Once synchronization is complete, Ceph performs trimming across the cluster.
+Trimming requires that the placement groups are ``active+clean``.
+
+
+.. confval:: mon_sync_timeout
+.. confval:: mon_sync_max_payload_size
+.. confval:: paxos_max_join_drift
+.. confval:: paxos_stash_full_interval
+.. confval:: paxos_propose_interval
+.. confval:: paxos_min
+.. confval:: paxos_min_wait
+.. confval:: paxos_trim_min
+.. confval:: paxos_trim_max
+.. confval:: paxos_service_trim_min
+.. confval:: paxos_service_trim_max
+.. confval:: paxos_service_trim_max_multiplier
+.. confval:: mon_mds_force_trim_to
+.. confval:: mon_osd_force_trim_to
+.. confval:: mon_osd_cache_size
+.. confval:: mon_election_timeout
+.. confval:: mon_lease
+.. confval:: mon_lease_renew_interval_factor
+.. confval:: mon_lease_ack_timeout_factor
+.. confval:: mon_accept_timeout_factor
+.. confval:: mon_min_osdmap_epochs
+.. confval:: mon_max_log_epochs
+
+
+.. index:: Ceph Monitor; clock
+
+.. _mon-config-ref-clock:
+
+Clock
+-----
+
+Ceph daemons pass critical messages to each other, which must be processed
+before daemons reach a timeout threshold. If the clocks in Ceph monitors
+are not synchronized, it can lead to a number of anomalies. For example:
+
+- Daemons ignoring received messages (e.g., timestamps outdated)
+- Timeouts triggered too soon/late when a message wasn't received in time.
+
+See `Monitor Store Synchronization`_ for details.
+
+
+.. tip:: You must configure NTP or PTP daemons on your Ceph monitor hosts to
+ ensure that the monitor cluster operates with synchronized clocks.
+ It can be advantageous to have monitor hosts sync with each other
+ as well as with multiple quality upstream time sources.
+
+Clock drift may still be noticeable with NTP even though the discrepancy is not
+yet harmful. Ceph's clock drift / clock skew warnings may get triggered even
+though NTP maintains a reasonable level of synchronization. Increasing your
+clock drift may be tolerable under such circumstances; however, a number of
+factors such as workload, network latency, configuring overrides to default
+timeouts and the `Monitor Store Synchronization`_ settings may influence
+the level of acceptable clock drift without compromising Paxos guarantees.
+
+Ceph provides the following tunable options to allow you to find
+acceptable values.
+
+.. confval:: mon_tick_interval
+.. confval:: mon_clock_drift_allowed
+.. confval:: mon_clock_drift_warn_backoff
+.. confval:: mon_timecheck_interval
+.. confval:: mon_timecheck_skew_interval
+
+Client
+------
+
+.. confval:: mon_client_hunt_interval
+.. confval:: mon_client_ping_interval
+.. confval:: mon_client_max_log_entries_per_message
+.. confval:: mon_client_bytes
+
+.. _pool-settings:
+
+Pool settings
+=============
+
+Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools.
+Monitors can also disallow removal of pools if appropriately configured. The inconvenience of this guardrail
+is far outweighed by the number of accidental pool (and thus data) deletions it prevents.
+
+.. confval:: mon_allow_pool_delete
+.. confval:: osd_pool_default_ec_fast_read
+.. confval:: osd_pool_default_flag_hashpspool
+.. confval:: osd_pool_default_flag_nodelete
+.. confval:: osd_pool_default_flag_nopgchange
+.. confval:: osd_pool_default_flag_nosizechange
+
+For more information about the pool flags see :ref:`Pool values <setpoolvalues>`.
+
+Miscellaneous
+=============
+
+.. confval:: mon_max_osd
+.. confval:: mon_globalid_prealloc
+.. confval:: mon_subscribe_interval
+.. confval:: mon_stat_smooth_intervals
+.. confval:: mon_probe_timeout
+.. confval:: mon_daemon_bytes
+.. confval:: mon_max_log_entries_per_event
+.. confval:: mon_osd_prime_pg_temp
+.. confval:: mon_osd_prime_pg_temp_max_time
+.. confval:: mon_osd_prime_pg_temp_max_estimate
+.. confval:: mon_mds_skip_sanity
+.. confval:: mon_max_mdsmap_epochs
+.. confval:: mon_config_key_max_entry_size
+.. confval:: mon_scrub_interval
+.. confval:: mon_scrub_max_keys
+.. confval:: mon_compact_on_start
+.. confval:: mon_compact_on_bootstrap
+.. confval:: mon_compact_on_trim
+.. confval:: mon_cpu_threads
+.. confval:: mon_osd_mapping_pgs_per_chunk
+.. confval:: mon_session_timeout
+.. confval:: mon_osd_cache_size_min
+.. confval:: mon_memory_target
+.. confval:: mon_memory_autotune
+
+.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
+.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys
+.. _Ceph configuration file: ../ceph-conf/#monitors
+.. _Network Configuration Reference: ../network-config-ref
+.. _Monitor lookup through DNS: ../mon-lookup-dns
+.. _ACID: https://en.wikipedia.org/wiki/ACID
+.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons
+.. _Monitoring a Cluster: ../../operations/monitoring
+.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg
+.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap
+.. _Monitor/OSD Interaction: ../mon-osd-interaction
+.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability
diff --git a/doc/rados/configuration/mon-lookup-dns.rst b/doc/rados/configuration/mon-lookup-dns.rst
new file mode 100644
index 000000000..129a083c4
--- /dev/null
+++ b/doc/rados/configuration/mon-lookup-dns.rst
@@ -0,0 +1,58 @@
+.. _mon-dns-lookup:
+
+===============================
+Looking up Monitors through DNS
+===============================
+
+Since Ceph version 11.0.0 (Kraken), RADOS has supported looking up monitors
+through DNS.
+
+The addition of the ability to look up monitors through DNS means that daemons
+and clients do not require a *mon host* configuration directive in their
+``ceph.conf`` configuration file.
+
+With a DNS update, clients and daemons can be made aware of changes
+in the monitor topology. To be more precise and technical, clients look up the
+monitors by using ``DNS SRV TCP`` records.
+
+By default, clients and daemons look for the TCP service called *ceph-mon*,
+which is configured by the *mon_dns_srv_name* configuration directive.
+
+
+.. confval:: mon_dns_srv_name
+
+Example
+-------
+When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements.
+
+First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA).
+
+::
+
+ mon1.example.com. AAAA 2001:db8::100
+ mon2.example.com. AAAA 2001:db8::200
+ mon3.example.com. AAAA 2001:db8::300
+
+::
+
+ mon1.example.com. A 192.168.0.1
+ mon2.example.com. A 192.168.0.2
+ mon3.example.com. A 192.168.0.3
+
+
+With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors.
+
+::
+
+ _ceph-mon._tcp.example.com. 60 IN SRV 10 20 6789 mon1.example.com.
+ _ceph-mon._tcp.example.com. 60 IN SRV 10 30 6789 mon2.example.com.
+ _ceph-mon._tcp.example.com. 60 IN SRV 20 50 6789 mon3.example.com.
+
+Now all Monitors are running on port *6789*, with priorities 10, 10, 20 and weights 20, 30, 50 respectively.
+
+Monitor clients choose monitor by referencing the SRV records. If a cluster has multiple Monitor SRV records
+with the same priority value, clients and daemons will load balance the connections to Monitors in proportion
+to the values of the SRV weight fields.
+
+For the above example, this will result in approximate 40% of the clients and daemons connecting to mon1,
+60% of them connecting to mon2. However, if neither of them is reachable, then mon3 will be reconsidered as a fallback.
diff --git a/doc/rados/configuration/mon-osd-interaction.rst b/doc/rados/configuration/mon-osd-interaction.rst
new file mode 100644
index 000000000..8cf09707d
--- /dev/null
+++ b/doc/rados/configuration/mon-osd-interaction.rst
@@ -0,0 +1,245 @@
+=====================================
+ Configuring Monitor/OSD Interaction
+=====================================
+
+.. index:: heartbeat
+
+After you have completed your initial Ceph configuration, you may deploy and run
+Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the
+:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage
+Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring
+reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph
+OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph
+Monitor doesn't receive reports, or if it receives reports of changes in the
+Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph
+Cluster Map`.
+
+Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon
+interaction. However, you may override the defaults. The following sections
+describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of
+monitoring the Ceph Storage Cluster.
+
+.. index:: heartbeat interval
+
+OSDs Check Heartbeats
+=====================
+
+Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons at random
+intervals less than every 6 seconds. If a neighboring Ceph OSD Daemon doesn't
+show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may
+consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph
+Monitor, which will update the Ceph Cluster Map. You may change this grace
+period by adding an ``osd heartbeat grace`` setting under the ``[mon]``
+and ``[osd]`` or ``[global]`` section of your Ceph configuration file,
+or by setting the value at runtime.
+
+
+.. ditaa::
+ +---------+ +---------+
+ | OSD 1 | | OSD 2 |
+ +---------+ +---------+
+ | |
+ |----+ Heartbeat |
+ | | Interval |
+ |<---+ Exceeded |
+ | |
+ | Check |
+ | Heartbeat |
+ |------------------->|
+ | |
+ |<-------------------|
+ | Heart Beating |
+ | |
+ |----+ Heartbeat |
+ | | Interval |
+ |<---+ Exceeded |
+ | |
+ | Check |
+ | Heartbeat |
+ |------------------->|
+ | |
+ |----+ Grace |
+ | | Period |
+ |<---+ Exceeded |
+ | |
+ |----+ Mark |
+ | | OSD 2 |
+ |<---+ Down |
+
+
+.. index:: OSD down report
+
+OSDs Report Down OSDs
+=====================
+
+By default, two Ceph OSD Daemons from different hosts must report to the Ceph
+Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors
+acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance
+that all the OSDs reporting the failure are hosted in a rack with a bad switch
+which has trouble connecting to another OSD. To avoid this sort of false alarm,
+we consider the peers reporting a failure a proxy for a potential "subcluster"
+over the overall cluster that is similarly laggy. This is clearly not true in
+all cases, but will sometimes help us localize the grace correction to a subset
+of the system that is unhappy. ``mon osd reporter subtree level`` is used to
+group the peers into the "subcluster" by their common ancestor type in CRUSH
+map. By default, only two reports from different subtree are required to report
+another Ceph OSD Daemon ``down``. You can change the number of reporters from
+unique subtrees and the common ancestor type required to report a Ceph OSD
+Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters``
+and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of
+your Ceph configuration file, or by setting the value at runtime.
+
+
+.. ditaa::
+
+ +---------+ +---------+ +---------+
+ | OSD 1 | | OSD 2 | | Monitor |
+ +---------+ +---------+ +---------+
+ | | |
+ | OSD 3 Is Down | |
+ |---------------+--------------->|
+ | | |
+ | | |
+ | | OSD 3 Is Down |
+ | |--------------->|
+ | | |
+ | | |
+ | | |---------+ Mark
+ | | | | OSD 3
+ | | |<--------+ Down
+
+
+.. index:: peering failure
+
+OSDs Report Peering Failure
+===========================
+
+If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its
+Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for
+the most recent copy of the cluster map every 30 seconds. You can change the
+Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval``
+setting under the ``[osd]`` section of your Ceph configuration file, or by
+setting the value at runtime.
+
+.. ditaa::
+
+ +---------+ +---------+ +-------+ +---------+
+ | OSD 1 | | OSD 2 | | OSD 3 | | Monitor |
+ +---------+ +---------+ +-------+ +---------+
+ | | | |
+ | Request To | | |
+ | Peer | | |
+ |-------------->| | |
+ |<--------------| | |
+ | Peering | |
+ | | |
+ | Request To | |
+ | Peer | |
+ |----------------------------->| |
+ | |
+ |----+ OSD Monitor |
+ | | Heartbeat |
+ |<---+ Interval Exceeded |
+ | |
+ | Failed to Peer with OSD 3 |
+ |-------------------------------------------->|
+ |<--------------------------------------------|
+ | Receive New Cluster Map |
+
+
+.. index:: OSD status
+
+OSDs Report Their Status
+========================
+
+If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will
+consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout``
+elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable
+event such as a failure, a change in placement group stats, a change in
+``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD
+Daemon minimum report interval by adding an ``osd mon report interval``
+setting under the ``[osd]`` section of your Ceph configuration file, or by
+setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph
+Monitor every 120 seconds irrespective of whether any notable changes occur.
+You can change the Ceph Monitor report interval by adding an ``osd mon report
+interval max`` setting under the ``[osd]`` section of your Ceph configuration
+file, or by setting the value at runtime.
+
+
+.. ditaa::
+
+ +---------+ +---------+
+ | OSD 1 | | Monitor |
+ +---------+ +---------+
+ | |
+ |----+ Report Min |
+ | | Interval |
+ |<---+ Exceeded |
+ | |
+ |----+ Reportable |
+ | | Event |
+ |<---+ Occurs |
+ | |
+ | Report To |
+ | Monitor |
+ |------------------->|
+ | |
+ |----+ Report Max |
+ | | Interval |
+ |<---+ Exceeded |
+ | |
+ | Report To |
+ | Monitor |
+ |------------------->|
+ | |
+ |----+ Monitor |
+ | | Fails |
+ |<---+ |
+ +----+ Monitor OSD
+ | | Report Timeout
+ |<---+ Exceeded
+ |
+ +----+ Mark
+ | | OSD 1
+ |<---+ Down
+
+
+
+
+Configuration Settings
+======================
+
+When modifying heartbeat settings, you should include them in the ``[global]``
+section of your configuration file.
+
+.. index:: monitor heartbeat
+
+Monitor Settings
+----------------
+
+.. confval:: mon_osd_min_up_ratio
+.. confval:: mon_osd_min_in_ratio
+.. confval:: mon_osd_laggy_halflife
+.. confval:: mon_osd_laggy_weight
+.. confval:: mon_osd_laggy_max_interval
+.. confval:: mon_osd_adjust_heartbeat_grace
+.. confval:: mon_osd_adjust_down_out_interval
+.. confval:: mon_osd_auto_mark_in
+.. confval:: mon_osd_auto_mark_auto_out_in
+.. confval:: mon_osd_auto_mark_new_in
+.. confval:: mon_osd_down_out_interval
+.. confval:: mon_osd_down_out_subtree_limit
+.. confval:: mon_osd_report_timeout
+.. confval:: mon_osd_min_down_reporters
+.. confval:: mon_osd_reporter_subtree_level
+
+.. index:: OSD heartbeat
+
+OSD Settings
+------------
+
+.. confval:: osd_heartbeat_interval
+.. confval:: osd_heartbeat_grace
+.. confval:: osd_mon_heartbeat_interval
+.. confval:: osd_mon_heartbeat_stat_stale
+.. confval:: osd_mon_report_interval
diff --git a/doc/rados/configuration/msgr2.rst b/doc/rados/configuration/msgr2.rst
new file mode 100644
index 000000000..33fe4e022
--- /dev/null
+++ b/doc/rados/configuration/msgr2.rst
@@ -0,0 +1,257 @@
+.. _msgr2:
+
+Messenger v2
+============
+
+What is it
+----------
+
+The messenger v2 protocol, or msgr2, is the second major revision on
+Ceph's on-wire protocol. It brings with it several key features:
+
+* A *secure* mode that encrypts all data passing over the network
+* Improved encapsulation of authentication payloads, enabling future
+ integration of new authentication modes like Kerberos
+* Improved earlier feature advertisement and negotiation, enabling
+ future protocol revisions
+
+Ceph daemons can now bind to multiple ports, allowing both legacy Ceph
+clients and new v2-capable clients to connect to the same cluster.
+
+By default, monitors now bind to the new IANA-assigned port ``3300``
+(ce4h or 0xce4) for the new v2 protocol, while also binding to the
+old default port ``6789`` for the legacy v1 protocol.
+
+.. _address_formats:
+
+Address formats
+---------------
+
+Prior to Nautilus, all network addresses were rendered like
+``1.2.3.4:567/89012`` where there was an IP address, a port, and a
+nonce to uniquely identify a client or daemon on the network.
+Starting with Nautilus, we now have three different address types:
+
+* **v2**: ``v2:1.2.3.4:578/89012`` identifies a daemon binding to a
+ port speaking the new v2 protocol
+* **v1**: ``v1:1.2.3.4:578/89012`` identifies a daemon binding to a
+ port speaking the legacy v1 protocol. Any address that was
+ previously shown with any prefix is now shown as a ``v1:`` address.
+* **TYPE_ANY** ``any:1.2.3.4:578/89012`` identifies a client that can
+ speak either version of the protocol. Prior to nautilus, clients would appear as
+ ``1.2.3.4:0/123456``, where the port of 0 indicates they are clients
+ and do not accept incoming connections. Starting with Nautilus,
+ these clients are now internally represented by a **TYPE_ANY**
+ address, and still shown with no prefix, because they may
+ connect to daemons using the v2 or v1 protocol, depending on what
+ protocol(s) the daemons are using.
+
+Because daemons now bind to multiple ports, they are now described by
+a vector of addresses instead of a single address. For example,
+dumping the monitor map on a Nautilus cluster now includes lines
+like::
+
+ epoch 1
+ fsid 50fcf227-be32-4bcb-8b41-34ca8370bd16
+ last_changed 2019-02-25 11:10:46.700821
+ created 2019-02-25 11:10:46.700821
+ min_mon_release 14 (nautilus)
+ 0: [v2:10.0.0.10:3300/0,v1:10.0.0.10:6789/0] mon.foo
+ 1: [v2:10.0.0.11:3300/0,v1:10.0.0.11:6789/0] mon.bar
+ 2: [v2:10.0.0.12:3300/0,v1:10.0.0.12:6789/0] mon.baz
+
+The bracketed list or vector of addresses means that the same daemon can be
+reached on multiple ports (and protocols). Any client or other daemon
+connecting to that daemon will use the v2 protocol (listed first) if
+possible; otherwise it will back to the legacy v1 protocol. Legacy
+clients will only see the v1 addresses and will continue to connect as
+they did before, with the v1 protocol.
+
+Starting in Nautilus, the ``mon_host`` configuration option and ``-m
+<mon-host>`` command line options support the same bracketed address
+vector syntax.
+
+
+Bind configuration options
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Two new configuration options control whether the v1 and/or v2
+protocol is used:
+
+ * :confval:`ms_bind_msgr1` [default: true] controls whether a daemon binds
+ to a port speaking the v1 protocol
+ * :confval:`ms_bind_msgr2` [default: true] controls whether a daemon binds
+ to a port speaking the v2 protocol
+
+Similarly, two options control whether IPv4 and IPv6 addresses are used:
+
+ * :confval:`ms_bind_ipv4` [default: true] controls whether a daemon binds
+ to an IPv4 address
+ * :confval:`ms_bind_ipv6` [default: false] controls whether a daemon binds
+ to an IPv6 address
+
+.. note:: The ability to bind to multiple ports has paved the way for
+ dual-stack IPv4 and IPv6 support. That said, dual-stack operation is
+ not yet supported as of Quincy v17.2.0.
+
+Connection modes
+----------------
+
+The v2 protocol supports two connection modes:
+
+* *crc* mode provides:
+
+ - a strong initial authentication when the connection is established
+ (with cephx, mutual authentication of both parties with protection
+ from a man-in-the-middle or eavesdropper), and
+ - a crc32c integrity check to protect against bit flips due to flaky
+ hardware or cosmic rays
+
+ *crc* mode does *not* provide:
+
+ - secrecy (an eavesdropper on the network can see all
+ post-authentication traffic as it goes by) or
+ - protection from a malicious man-in-the-middle (who can deliberate
+ modify traffic as it goes by, as long as they are careful to
+ adjust the crc32c values to match)
+
+* *secure* mode provides:
+
+ - a strong initial authentication when the connection is established
+ (with cephx, mutual authentication of both parties with protection
+ from a man-in-the-middle or eavesdropper), and
+ - full encryption of all post-authentication traffic, including a
+ cryptographic integrity check.
+
+ In Nautilus, secure mode uses the `AES-GCM
+ <https://en.wikipedia.org/wiki/Galois/Counter_Mode>`_ stream cipher,
+ which is generally very fast on modern processors (e.g., faster than
+ a SHA-256 cryptographic hash).
+
+Connection mode configuration options
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For most connections, there are options that control which modes are used:
+
+.. confval:: ms_cluster_mode
+.. confval:: ms_service_mode
+.. confval:: ms_client_mode
+
+There are a parallel set of options that apply specifically to
+monitors, allowing administrators to set different (usually more
+secure) requirements on communication with the monitors.
+
+.. confval:: ms_mon_cluster_mode
+.. confval:: ms_mon_service_mode
+.. confval:: ms_mon_client_mode
+
+
+Compression modes
+-----------------
+
+The v2 protocol supports two compression modes:
+
+* *force* mode provides:
+
+ - In multi-availability zones deployment, compressing replication messages between OSDs saves latency.
+ - In the public cloud, inter-AZ communications are expensive. Thus, minimizing message
+ size reduces network costs to cloud provider.
+ - When using instance storage on AWS (probably other public clouds as well) the instances with NVMe
+ provide low network bandwidth relative to the device bandwidth.
+ In this case, NW compression can improve the overall performance since this is clearly
+ the bottleneck.
+
+* *none* mode provides:
+
+ - messages are transmitted without compression.
+
+
+Compression mode configuration options
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For all connections, there is an option that controls compression usage in secure mode
+
+.. confval:: ms_compress_secure
+
+There is a parallel set of options that apply specifically to OSDs,
+allowing administrators to set different requirements on communication between OSDs.
+
+.. confval:: ms_osd_compress_mode
+.. confval:: ms_osd_compress_min_size
+.. confval:: ms_osd_compression_algorithm
+
+Transitioning from v1-only to v2-plus-v1
+----------------------------------------
+
+By default, ``ms_bind_msgr2`` is true starting with Nautilus 14.2.z.
+However, until the monitors start using v2, only limited services will
+start advertising v2 addresses.
+
+For most users, the monitors are binding to the default legacy port ``6789``
+for the v1 protocol. When this is the case, enabling v2 is as simple as:
+
+.. prompt:: bash $
+
+ ceph mon enable-msgr2
+
+If the monitors are bound to non-standard ports, you will need to
+specify an additional port for v2 explicitly. For example, if your
+monitor ``mon.a`` binds to ``1.2.3.4:1111``, and you want to add v2 on
+port ``1112``:
+
+.. prompt:: bash $
+
+ ceph mon set-addrs a [v2:1.2.3.4:1112,v1:1.2.3.4:1111]
+
+Once the monitors bind to v2, each daemon will start advertising a v2
+address when it is next restarted.
+
+
+.. _msgr2_ceph_conf:
+
+Updating ceph.conf and mon_host
+-------------------------------
+
+Prior to Nautilus, a CLI user or daemon will normally discover the
+monitors via the ``mon_host`` option in ``/etc/ceph/ceph.conf``. The
+syntax for this option has expanded starting with Nautilus to allow
+support the new bracketed list format. For example, an old line
+like::
+
+ mon_host = 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789
+
+Can be changed to::
+
+ mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0],[v2:10.0.0.2:3300/0,v1:10.0.0.2:6789/0],[v2:10.0.0.3:3300/0,v1:10.0.0.3:6789/0]
+
+However, when default ports are used (``3300`` and ``6789``), they can
+be omitted::
+
+ mon_host = 10.0.0.1,10.0.0.2,10.0.0.3
+
+Once v2 has been enabled on the monitors, ``ceph.conf`` may need to be
+updated to either specify no ports (this is usually simplest), or
+explicitly specify both the v2 and v1 addresses. Note, however, that
+the new bracketed syntax is only understood by Nautilus and later, so
+do not make that change on hosts that have not yet had their ceph
+packages upgraded.
+
+When you are updating ``ceph.conf``, note the new ``ceph config
+generate-minimal-conf`` command (which generates a barebones config
+file with just enough information to reach the monitors) and the
+``ceph config assimilate-conf`` (which moves config file options into
+the monitors' configuration database) may be helpful. For example,::
+
+ # ceph config assimilate-conf < /etc/ceph/ceph.conf
+ # ceph config generate-minimal-config > /etc/ceph/ceph.conf.new
+ # cat /etc/ceph/ceph.conf.new
+ # minimal ceph.conf for 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3
+ [global]
+ fsid = 0e5a806b-0ce5-4bc6-b949-aa6f68f5c2a3
+ mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0]
+ # mv /etc/ceph/ceph.conf.new /etc/ceph/ceph.conf
+
+Protocol
+--------
+
+For a detailed description of the v2 wire protocol, see :ref:`msgr2-protocol`.
diff --git a/doc/rados/configuration/network-config-ref.rst b/doc/rados/configuration/network-config-ref.rst
new file mode 100644
index 000000000..81e85c5d1
--- /dev/null
+++ b/doc/rados/configuration/network-config-ref.rst
@@ -0,0 +1,355 @@
+=================================
+ Network Configuration Reference
+=================================
+
+Network configuration is critical for building a high performance :term:`Ceph
+Storage Cluster`. The Ceph Storage Cluster does not perform request routing or
+dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make
+requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication
+on behalf of Ceph Clients, which means replication and other factors impose
+additional loads on Ceph Storage Cluster networks.
+
+Our Quick Start configurations provide a trivial Ceph configuration file that
+sets monitor IP addresses and daemon host names only. Unless you specify a
+cluster network, Ceph assumes a single "public" network. Ceph functions just
+fine with a public network only, but you may see significant performance
+improvement with a second "cluster" network in a large cluster.
+
+It is possible to run a Ceph Storage Cluster with two networks: a public
+(client, front-side) network and a cluster (private, replication, back-side)
+network. However, this approach
+complicates network configuration (both hardware and software) and does not usually
+have a significant impact on overall performance. For this reason, we recommend
+that for resilience and capacity dual-NIC systems either active/active bond
+these interfaces or implement a layer 3 multipath strategy with eg. FRR.
+
+If, despite the complexity, one still wishes to use two networks, each
+:term:`Ceph Node` will need to have more than one network interface or VLAN. See `Hardware
+Recommendations - Networks`_ for additional details.
+
+.. ditaa::
+ +-------------+
+ | Ceph Client |
+ +----*--*-----+
+ | ^
+ Request | : Response
+ v |
+ /----------------------------------*--*-------------------------------------\
+ | Public Network |
+ \---*--*------------*--*-------------*--*------------*--*------------*--*---/
+ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
+ | | | | | | | | | |
+ | : | : | : | : | :
+ v v v v v v v v v v
+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+
+ | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD |
+ +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+
+ ^ ^ ^ ^ ^ ^
+ The cluster network relieves | | | | | |
+ OSD replication and heartbeat | : | : | :
+ traffic from the public network. v v v v v v
+ /------------------------------------*--*------------*--*------------*--*---\
+ | cCCC Cluster Network |
+ \---------------------------------------------------------------------------/
+
+
+IP Tables
+=========
+
+By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may
+configure this range at your discretion. Before configuring your IP tables,
+check the default ``iptables`` configuration.
+
+.. prompt:: bash $
+
+ sudo iptables -L
+
+Some Linux distributions include rules that reject all inbound requests
+except SSH from all network interfaces. For example::
+
+ REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
+
+You will need to delete these rules on both your public and cluster networks
+initially, and replace them with appropriate rules when you are ready to
+harden the ports on your Ceph Nodes.
+
+
+Monitor IP Tables
+-----------------
+
+Ceph Monitors listen on ports ``3300`` and ``6789`` by
+default. Additionally, Ceph Monitors always operate on the public
+network. When you add the rule using the example below, make sure you
+replace ``{iface}`` with the public network interface (e.g., ``eth0``,
+``eth1``, etc.), ``{ip-address}`` with the IP address of the public
+network and ``{netmask}`` with the netmask for the public network. :
+
+.. prompt:: bash $
+
+ sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT
+
+
+MDS and Manager IP Tables
+-------------------------
+
+A :term:`Ceph Metadata Server` or :term:`Ceph Manager` listens on the first
+available port on the public network beginning at port 6800. Note that this
+behavior is not deterministic, so if you are running more than one OSD or MDS
+on the same host, or if you restart the daemons within a short window of time,
+the daemons will bind to higher ports. You should open the entire 6800-7300
+range by default. When you add the rule using the example below, make sure
+you replace ``{iface}`` with the public network interface (e.g., ``eth0``,
+``eth1``, etc.), ``{ip-address}`` with the IP address of the public network
+and ``{netmask}`` with the netmask of the public network.
+
+For example:
+
+.. prompt:: bash $
+
+ sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT
+
+
+OSD IP Tables
+-------------
+
+By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node
+beginning at port 6800. Note that this behavior is not deterministic, so if you
+are running more than one OSD or MDS on the same host, or if you restart the
+daemons within a short window of time, the daemons will bind to higher ports.
+Each Ceph OSD Daemon on a Ceph Node may use up to four ports:
+
+#. One for talking to clients and monitors.
+#. One for sending data to other OSDs.
+#. Two for heartbeating on each interface.
+
+.. ditaa::
+ /---------------\
+ | OSD |
+ | +---+----------------+-----------+
+ | | Clients & Monitors | Heartbeat |
+ | +---+----------------+-----------+
+ | |
+ | +---+----------------+-----------+
+ | | Data Replication | Heartbeat |
+ | +---+----------------+-----------+
+ | cCCC |
+ \---------------/
+
+When a daemon fails and restarts without letting go of the port, the restarted
+daemon will bind to a new port. You should open the entire 6800-7300 port range
+to handle this possibility.
+
+If you set up separate public and cluster networks, you must add rules for both
+the public network and the cluster network, because clients will connect using
+the public network and other Ceph OSD Daemons will connect using the cluster
+network. When you add the rule using the example below, make sure you replace
+``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.),
+``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the
+public or cluster network. For example:
+
+.. prompt:: bash $
+
+ sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT
+
+.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the
+ Ceph OSD Daemons, you can consolidate the public network configuration step.
+
+
+Ceph Networks
+=============
+
+To configure Ceph networks, you must add a network configuration to the
+``[global]`` section of the configuration file. Our 5-minute Quick Start
+provides a trivial Ceph configuration file that assumes one public network
+with client and server on the same network and subnet. Ceph functions just fine
+with a public network only. However, Ceph allows you to establish much more
+specific criteria, including multiple IP network and subnet masks for your
+public network. You can also establish a separate cluster network to handle OSD
+heartbeat, object replication and recovery traffic. Don't confuse the IP
+addresses you set in your configuration with the public-facing IP addresses
+network clients may use to access your service. Typical internal IP networks are
+often ``192.168.0.0`` or ``10.0.0.0``.
+
+.. tip:: If you specify more than one IP address and subnet mask for
+ either the public or the cluster network, the subnets within the network
+ must be capable of routing to each other. Additionally, make sure you
+ include each IP address/subnet in your IP tables and open ports for them
+ as necessary.
+
+.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``).
+
+When you have configured your networks, you may restart your cluster or restart
+each daemon. Ceph daemons bind dynamically, so you do not have to restart the
+entire cluster at once if you change your network configuration.
+
+
+Public Network
+--------------
+
+To configure a public network, add the following option to the ``[global]``
+section of your Ceph configuration file.
+
+.. code-block:: ini
+
+ [global]
+ # ... elided configuration
+ public_network = {public-network/netmask}
+
+.. _cluster-network:
+
+Cluster Network
+---------------
+
+If you declare a cluster network, OSDs will route heartbeat, object replication
+and recovery traffic over the cluster network. This may improve performance
+compared to using a single network. To configure a cluster network, add the
+following option to the ``[global]`` section of your Ceph configuration file.
+
+.. code-block:: ini
+
+ [global]
+ # ... elided configuration
+ cluster_network = {cluster-network/netmask}
+
+We prefer that the cluster network is **NOT** reachable from the public network
+or the Internet for added security.
+
+IPv4/IPv6 Dual Stack Mode
+-------------------------
+
+If you want to run in an IPv4/IPv6 dual stack mode and want to define your public and/or
+cluster networks, then you need to specify both your IPv4 and IPv6 networks for each:
+
+.. code-block:: ini
+
+ [global]
+ # ... elided configuration
+ public_network = {IPv4 public-network/netmask}, {IPv6 public-network/netmask}
+
+This is so that Ceph can find a valid IP address for both address families.
+
+If you want just an IPv4 or an IPv6 stack environment, then make sure you set the `ms bind`
+options correctly.
+
+.. note::
+ Binding to IPv4 is enabled by default, so if you just add the option to bind to IPv6
+ you'll actually put yourself into dual stack mode. If you want just IPv6, then disable IPv4 and
+ enable IPv6. See `Bind`_ below.
+
+Ceph Daemons
+============
+
+Monitor daemons are each configured to bind to a specific IP address. These
+addresses are normally configured by your deployment tool. Other components
+in the Ceph cluster discover the monitors via the ``mon host`` configuration
+option, normally specified in the ``[global]`` section of the ``ceph.conf`` file.
+
+.. code-block:: ini
+
+ [global]
+ mon_host = 10.0.0.2, 10.0.0.3, 10.0.0.4
+
+The ``mon_host`` value can be a list of IP addresses or a name that is
+looked up via DNS. In the case of a DNS name with multiple A or AAAA
+records, all records are probed in order to discover a monitor. Once
+one monitor is reached, all other current monitors are discovered, so
+the ``mon host`` configuration option only needs to be sufficiently up
+to date such that a client can reach one monitor that is currently online.
+
+The MGR, OSD, and MDS daemons will bind to any available address and
+do not require any special configuration. However, it is possible to
+specify a specific IP address for them to bind to with the ``public
+addr`` (and/or, in the case of OSD daemons, the ``cluster addr``)
+configuration option. For example,
+
+.. code-block:: ini
+
+ [osd.0]
+ public_addr = {host-public-ip-address}
+ cluster_addr = {host-cluster-ip-address}
+
+.. topic:: One NIC OSD in a Two Network Cluster
+
+ Generally, we do not recommend deploying an OSD host with a single network interface in a
+ cluster with two networks. However, you may accomplish this by forcing the
+ OSD host to operate on the public network by adding a ``public_addr`` entry
+ to the ``[osd.n]`` section of the Ceph configuration file, where ``n``
+ refers to the ID of the OSD with one network interface. Additionally, the public
+ network and cluster network must be able to route traffic to each other,
+ which we don't recommend for security reasons.
+
+
+Network Config Settings
+=======================
+
+Network configuration settings are not required. Ceph assumes a public network
+with all hosts operating on it unless you specifically configure a cluster
+network.
+
+
+Public Network
+--------------
+
+The public network configuration allows you specifically define IP addresses
+and subnets for the public network. You may specifically assign static IP
+addresses or override ``public_network`` settings using the ``public_addr``
+setting for a specific daemon.
+
+.. confval:: public_network
+.. confval:: public_addr
+
+Cluster Network
+---------------
+
+The cluster network configuration allows you to declare a cluster network, and
+specifically define IP addresses and subnets for the cluster network. You may
+specifically assign static IP addresses or override ``cluster_network``
+settings using the ``cluster_addr`` setting for specific OSD daemons.
+
+
+.. confval:: cluster_network
+.. confval:: cluster_addr
+
+Bind
+----
+
+Bind settings set the default port ranges Ceph OSD and MDS daemons use. The
+default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration
+allows you to use the configured port range.
+
+You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4
+addresses.
+
+.. confval:: ms_bind_port_min
+.. confval:: ms_bind_port_max
+.. confval:: ms_bind_ipv4
+.. confval:: ms_bind_ipv6
+.. confval:: public_bind_addr
+
+TCP
+---
+
+Ceph disables TCP buffering by default.
+
+.. confval:: ms_tcp_nodelay
+.. confval:: ms_tcp_rcvbuf
+
+General Settings
+----------------
+
+.. confval:: ms_type
+.. confval:: ms_async_op_threads
+.. confval:: ms_initial_backoff
+.. confval:: ms_max_backoff
+.. confval:: ms_die_on_bad_msg
+.. confval:: ms_dispatch_throttle_bytes
+.. confval:: ms_inject_socket_failures
+
+
+.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability
+.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks
+.. _hardware recommendations: ../../../start/hardware-recommendations
+.. _Monitor / OSD Interaction: ../mon-osd-interaction
+.. _Message Signatures: ../auth-config-ref#signatures
+.. _CIDR: https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing
+.. _Nagle's Algorithm: https://en.wikipedia.org/wiki/Nagle's_algorithm
diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst
new file mode 100644
index 000000000..060121200
--- /dev/null
+++ b/doc/rados/configuration/osd-config-ref.rst
@@ -0,0 +1,445 @@
+======================
+ OSD Config Reference
+======================
+
+.. index:: OSD; configuration
+
+You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
+releases, the central config store), but Ceph OSD
+Daemons can use the default values and a very minimal configuration. A minimal
+Ceph OSD Daemon configuration sets ``host`` and
+uses default values for nearly everything else.
+
+Ceph OSD Daemons are numerically identified in incremental fashion, beginning
+with ``0`` using the following convention. ::
+
+ osd.0
+ osd.1
+ osd.2
+
+In a configuration file, you may specify settings for all Ceph OSD Daemons in
+the cluster by adding configuration settings to the ``[osd]`` section of your
+configuration file. To add settings directly to a specific Ceph OSD Daemon
+(e.g., ``host``), enter it in an OSD-specific section of your configuration
+file. For example:
+
+.. code-block:: ini
+
+ [osd]
+ osd_journal_size = 5120
+
+ [osd.0]
+ host = osd-host-a
+
+ [osd.1]
+ host = osd-host-b
+
+
+.. index:: OSD; config settings
+
+General Settings
+================
+
+The following settings provide a Ceph OSD Daemon's ID, and determine paths to
+data and journals. Ceph deployment scripts typically generate the UUID
+automatically.
+
+.. warning:: **DO NOT** change the default paths for data or journals, as it
+ makes it more problematic to troubleshoot Ceph later.
+
+When using Filestore, the journal size should be at least twice the product of the expected drive
+speed multiplied by ``filestore_max_sync_interval``. However, the most common
+practice is to partition the journal drive (often an SSD), and mount it such
+that Ceph uses the entire partition for the journal.
+
+.. confval:: osd_uuid
+.. confval:: osd_data
+.. confval:: osd_max_write_size
+.. confval:: osd_max_object_size
+.. confval:: osd_client_message_size_cap
+.. confval:: osd_class_dir
+ :default: $libdir/rados-classes
+
+.. index:: OSD; file system
+
+File System Settings
+====================
+Ceph builds and mounts file systems which are used for Ceph OSDs.
+
+``osd_mkfs_options {fs-type}``
+
+:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
+
+:Type: String
+:Default for xfs: ``-f -i 2048``
+:Default for other file systems: {empty string}
+
+For example::
+ ``osd_mkfs_options_xfs = -f -d agcount=24``
+
+``osd_mount_options {fs-type}``
+
+:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
+
+:Type: String
+:Default for xfs: ``rw,noatime,inode64``
+:Default for other file systems: ``rw, noatime``
+
+For example::
+ ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
+
+
+.. index:: OSD; journal settings
+
+Journal Settings
+================
+
+This section applies only to the older Filestore OSD back end. Since Luminous
+BlueStore has been default and preferred.
+
+By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
+the following path, which is usually a symlink to a device or partition::
+
+ /var/lib/ceph/osd/$cluster-$id/journal
+
+When using a single device type (for example, spinning drives), the journals
+should be *colocated*: the logical volume (or partition) should be in the same
+device as the ``data`` logical volume.
+
+When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
+drives) it makes sense to place the journal on the faster device, while
+``data`` occupies the slower device fully.
+
+The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
+larger, in which case it will need to be set in the ``ceph.conf`` file.
+A value of 10 gigabytes is common in practice::
+
+ osd_journal_size = 10240
+
+
+.. confval:: osd_journal
+.. confval:: osd_journal_size
+
+See `Journal Config Reference`_ for additional details.
+
+
+Monitor OSD Interaction
+=======================
+
+Ceph OSD Daemons check each other's heartbeats and report to monitors
+periodically. Ceph can use default values in many cases. However, if your
+network has latency issues, you may need to adopt longer intervals. See
+`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
+
+
+Data Placement
+==============
+
+See `Pool & PG Config Reference`_ for details.
+
+
+.. index:: OSD; scrubbing
+
+.. _rados_config_scrubbing:
+
+Scrubbing
+=========
+
+One way that Ceph ensures data integrity is by "scrubbing" placement groups.
+Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph
+generates a catalog of all objects in each placement group and compares each
+primary object to its replicas, ensuring that no objects are missing or
+mismatched. Light scrubbing checks the object size and attributes, and is
+usually done daily. Deep scrubbing reads the data and uses checksums to ensure
+data integrity, and is usually done weekly. The freqeuncies of both light
+scrubbing and deep scrubbing are determined by the cluster's configuration,
+which is fully under your control and subject to the settings explained below
+in this section.
+
+Although scrubbing is important for maintaining data integrity, it can reduce
+the performance of the Ceph cluster. You can adjust the following settings to
+increase or decrease the frequency and depth of scrubbing operations.
+
+
+.. confval:: osd_max_scrubs
+.. confval:: osd_scrub_begin_hour
+.. confval:: osd_scrub_end_hour
+.. confval:: osd_scrub_begin_week_day
+.. confval:: osd_scrub_end_week_day
+.. confval:: osd_scrub_during_recovery
+.. confval:: osd_scrub_load_threshold
+.. confval:: osd_scrub_min_interval
+.. confval:: osd_scrub_max_interval
+.. confval:: osd_scrub_chunk_min
+.. confval:: osd_scrub_chunk_max
+.. confval:: osd_scrub_sleep
+.. confval:: osd_deep_scrub_interval
+.. confval:: osd_scrub_interval_randomize_ratio
+.. confval:: osd_deep_scrub_stride
+.. confval:: osd_scrub_auto_repair
+.. confval:: osd_scrub_auto_repair_num_errors
+
+.. index:: OSD; operations settings
+
+Operations
+==========
+
+.. confval:: osd_op_num_shards
+.. confval:: osd_op_num_shards_hdd
+.. confval:: osd_op_num_shards_ssd
+.. confval:: osd_op_queue
+.. confval:: osd_op_queue_cut_off
+.. confval:: osd_client_op_priority
+.. confval:: osd_recovery_op_priority
+.. confval:: osd_scrub_priority
+.. confval:: osd_requested_scrub_priority
+.. confval:: osd_snap_trim_priority
+.. confval:: osd_snap_trim_sleep
+.. confval:: osd_snap_trim_sleep_hdd
+.. confval:: osd_snap_trim_sleep_ssd
+.. confval:: osd_snap_trim_sleep_hybrid
+.. confval:: osd_op_thread_timeout
+.. confval:: osd_op_complaint_time
+.. confval:: osd_op_history_size
+.. confval:: osd_op_history_duration
+.. confval:: osd_op_log_threshold
+.. confval:: osd_op_thread_suicide_timeout
+.. note:: See https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ for
+ more on ``osd_op_thread_suicide_timeout``. Be aware that this is a link to a
+ reworking of a blog post from 2017, and that its conclusion will direct you
+ back to this page "for more information".
+
+.. _dmclock-qos:
+
+QoS Based on mClock
+-------------------
+
+Ceph's use of mClock is now more refined and can be used by following the
+steps as described in `mClock Config Reference`_.
+
+Core Concepts
+`````````````
+
+Ceph's QoS support is implemented using a queueing scheduler
+based on `the dmClock algorithm`_. This algorithm allocates the I/O
+resources of the Ceph cluster in proportion to weights, and enforces
+the constraints of minimum reservation and maximum limitation, so that
+the services can compete for the resources fairly. Currently the
+*mclock_scheduler* operation queue divides Ceph services involving I/O
+resources into following buckets:
+
+- client op: the iops issued by client
+- osd subop: the iops issued by primary OSD
+- snap trim: the snap trimming related requests
+- pg recovery: the recovery related requests
+- pg scrub: the scrub related requests
+
+And the resources are partitioned using following three sets of tags. In other
+words, the share of each type of service is controlled by three tags:
+
+#. reservation: the minimum IOPS allocated for the service.
+#. limitation: the maximum IOPS allocated for the service.
+#. weight: the proportional share of capacity if extra capacity or system
+ oversubscribed.
+
+In Ceph, operations are graded with "cost". And the resources allocated
+for serving various services are consumed by these "costs". So, for
+example, the more reservation a services has, the more resource it is
+guaranteed to possess, as long as it requires. Assuming there are 2
+services: recovery and client ops:
+
+- recovery: (r:1, l:5, w:1)
+- client ops: (r:2, l:0, w:9)
+
+The settings above ensure that the recovery won't get more than 5
+requests per second serviced, even if it requires so (see CURRENT
+IMPLEMENTATION NOTE below), and no other services are competing with
+it. But if the clients start to issue large amount of I/O requests,
+neither will they exhaust all the I/O resources. 1 request per second
+is always allocated for recovery jobs as long as there are any such
+requests. So the recovery jobs won't be starved even in a cluster with
+high load. And in the meantime, the client ops can enjoy a larger
+portion of the I/O resource, because its weight is "9", while its
+competitor "1". In the case of client ops, it is not clamped by the
+limit setting, so it can make use of all the resources if there is no
+recovery ongoing.
+
+CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
+values. Therefore, if a service crosses the enforced limit, the op remains
+in the operation queue until the limit is restored.
+
+Subtleties of mClock
+````````````````````
+
+The reservation and limit values have a unit of requests per
+second. The weight, however, does not technically have a unit and the
+weights are relative to one another. So if one class of requests has a
+weight of 1 and another a weight of 9, then the latter class of
+requests should get 9 executed at a 9 to 1 ratio as the first class.
+However that will only happen once the reservations are met and those
+values include the operations executed under the reservation phase.
+
+Even though the weights do not have units, one must be careful in
+choosing their values due how the algorithm assigns weight tags to
+requests. If the weight is *W*, then for a given class of requests,
+the next one that comes in will have a weight tag of *1/W* plus the
+previous weight tag or the current time, whichever is larger. That
+means if *W* is sufficiently large and therefore *1/W* is sufficiently
+small, the calculated tag may never be assigned as it will get a value
+of the current time. The ultimate lesson is that values for weight
+should not be too large. They should be under the number of requests
+one expects to be serviced each second.
+
+Caveats
+```````
+
+There are some factors that can reduce the impact of the mClock op
+queues within Ceph. First, requests to an OSD are sharded by their
+placement group identifier. Each shard has its own mClock queue and
+these queues neither interact nor share information among them. The
+number of shards can be controlled with the configuration options
+:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
+:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
+impact of the mClock queues, but may have other deleterious effects.
+
+Second, requests are transferred from the operation queue to the
+operation sequencer, in which they go through the phases of
+execution. The operation queue is where mClock resides and mClock
+determines the next op to transfer to the operation sequencer. The
+number of operations allowed in the operation sequencer is a complex
+issue. In general we want to keep enough operations in the sequencer
+so it's always getting work done on some operations while it's waiting
+for disk and network access to complete on other operations. On the
+other hand, once an operation is transferred to the operation
+sequencer, mClock no longer has control over it. Therefore to maximize
+the impact of mClock, we want to keep as few operations in the
+operation sequencer as possible. So we have an inherent tension.
+
+The configuration options that influence the number of operations in
+the operation sequencer are :confval:`bluestore_throttle_bytes`,
+:confval:`bluestore_throttle_deferred_bytes`,
+:confval:`bluestore_throttle_cost_per_io`,
+:confval:`bluestore_throttle_cost_per_io_hdd`, and
+:confval:`bluestore_throttle_cost_per_io_ssd`.
+
+A third factor that affects the impact of the mClock algorithm is that
+we're using a distributed system, where requests are made to multiple
+OSDs and each OSD has (can have) multiple shards. Yet we're currently
+using the mClock algorithm, which is not distributed (note: dmClock is
+the distributed version of mClock).
+
+Various organizations and individuals are currently experimenting with
+mClock as it exists in this code base along with their modifications
+to the code base. We hope you'll share you're experiences with your
+mClock and dmClock experiments on the ``ceph-devel`` mailing list.
+
+.. confval:: osd_async_recovery_min_cost
+.. confval:: osd_push_per_object_cost
+.. confval:: osd_mclock_scheduler_client_res
+.. confval:: osd_mclock_scheduler_client_wgt
+.. confval:: osd_mclock_scheduler_client_lim
+.. confval:: osd_mclock_scheduler_background_recovery_res
+.. confval:: osd_mclock_scheduler_background_recovery_wgt
+.. confval:: osd_mclock_scheduler_background_recovery_lim
+.. confval:: osd_mclock_scheduler_background_best_effort_res
+.. confval:: osd_mclock_scheduler_background_best_effort_wgt
+.. confval:: osd_mclock_scheduler_background_best_effort_lim
+
+.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
+
+.. index:: OSD; backfilling
+
+Backfilling
+===========
+
+When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
+rebalance the cluster by moving placement groups to or from Ceph OSDs
+to restore balanced utilization. The process of migrating placement groups and
+the objects they contain can reduce the cluster's operational performance
+considerably. To maintain operational performance, Ceph performs this migration
+with 'backfilling', which allows Ceph to set backfill operations to a lower
+priority than requests to read or write data.
+
+
+.. confval:: osd_max_backfills
+.. confval:: osd_backfill_scan_min
+.. confval:: osd_backfill_scan_max
+.. confval:: osd_backfill_retry_interval
+
+.. index:: OSD; osdmap
+
+OSD Map
+=======
+
+OSD maps reflect the OSD daemons operating in the cluster. Over time, the
+number of map epochs increases. Ceph provides some settings to ensure that
+Ceph performs well as the OSD map grows larger.
+
+.. confval:: osd_map_dedup
+.. confval:: osd_map_cache_size
+.. confval:: osd_map_message_max
+
+.. index:: OSD; recovery
+
+Recovery
+========
+
+When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
+begins peering with other Ceph OSD Daemons before writes can occur. See
+`Monitoring OSDs and PGs`_ for details.
+
+If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
+sync with other Ceph OSD Daemons containing more recent versions of objects in
+the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
+mode and seeks to get the latest copy of the data and bring its map back up to
+date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
+and placement groups may be significantly out of date. Also, if a failure domain
+went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
+the same time. This can make the recovery process time consuming and resource
+intensive.
+
+To maintain operational performance, Ceph performs recovery with limitations on
+the number recovery requests, threads and object chunk sizes which allows Ceph
+perform well in a degraded state.
+
+.. confval:: osd_recovery_delay_start
+.. confval:: osd_recovery_max_active
+.. confval:: osd_recovery_max_active_hdd
+.. confval:: osd_recovery_max_active_ssd
+.. confval:: osd_recovery_max_chunk
+.. confval:: osd_recovery_max_single_start
+.. confval:: osd_recover_clone_overlap
+.. confval:: osd_recovery_sleep
+.. confval:: osd_recovery_sleep_hdd
+.. confval:: osd_recovery_sleep_ssd
+.. confval:: osd_recovery_sleep_hybrid
+.. confval:: osd_recovery_priority
+
+Tiering
+=======
+
+.. confval:: osd_agent_max_ops
+.. confval:: osd_agent_max_low_ops
+
+See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
+objects within the high speed mode.
+
+Miscellaneous
+=============
+
+.. confval:: osd_default_notify_timeout
+.. confval:: osd_check_for_log_corruption
+.. confval:: osd_delete_sleep
+.. confval:: osd_delete_sleep_hdd
+.. confval:: osd_delete_sleep_ssd
+.. confval:: osd_delete_sleep_hybrid
+.. confval:: osd_command_max_records
+.. confval:: osd_fast_fail_on_connection_refused
+
+.. _pool: ../../operations/pools
+.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
+.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
+.. _Pool & PG Config Reference: ../pool-pg-config-ref
+.. _Journal Config Reference: ../journal-ref
+.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
+.. _mClock Config Reference: ../mclock-config-ref
diff --git a/doc/rados/configuration/pool-pg-config-ref.rst b/doc/rados/configuration/pool-pg-config-ref.rst
new file mode 100644
index 000000000..902c80346
--- /dev/null
+++ b/doc/rados/configuration/pool-pg-config-ref.rst
@@ -0,0 +1,46 @@
+.. _rados_config_pool_pg_crush_ref:
+
+======================================
+ Pool, PG and CRUSH Config Reference
+======================================
+
+.. index:: pools; configuration
+
+Ceph uses default values to determine how many placement groups (PGs) will be
+assigned to each pool. We recommend overriding some of the defaults.
+Specifically, we recommend setting a pool's replica size and overriding the
+default number of placement groups. You can set these values when running
+`pool`_ commands. You can also override the defaults by adding new ones in the
+``[global]`` section of your Ceph configuration file.
+
+
+.. literalinclude:: pool-pg.conf
+ :language: ini
+
+.. confval:: mon_max_pool_pg_num
+.. confval:: mon_pg_stuck_threshold
+.. confval:: mon_pg_warn_min_per_osd
+.. confval:: mon_pg_warn_min_objects
+.. confval:: mon_pg_warn_min_pool_objects
+.. confval:: mon_pg_check_down_all_threshold
+.. confval:: mon_pg_warn_max_object_skew
+.. confval:: mon_delta_reset_interval
+.. confval:: osd_crush_chooseleaf_type
+.. confval:: osd_crush_initial_weight
+.. confval:: osd_pool_default_crush_rule
+.. confval:: osd_pool_erasure_code_stripe_unit
+.. confval:: osd_pool_default_size
+.. confval:: osd_pool_default_min_size
+.. confval:: osd_pool_default_pg_num
+.. confval:: osd_pool_default_pgp_num
+.. confval:: osd_pool_default_pg_autoscale_mode
+.. confval:: osd_pool_default_flags
+.. confval:: osd_max_pgls
+.. confval:: osd_min_pg_log_entries
+.. confval:: osd_max_pg_log_entries
+.. confval:: osd_default_data_pool_replay_window
+.. confval:: osd_max_pg_per_osd_hard_ratio
+
+.. _pool: ../../operations/pools
+.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
+.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems
diff --git a/doc/rados/configuration/pool-pg.conf b/doc/rados/configuration/pool-pg.conf
new file mode 100644
index 000000000..6765d37df
--- /dev/null
+++ b/doc/rados/configuration/pool-pg.conf
@@ -0,0 +1,21 @@
+[global]
+
+ # By default, Ceph makes three replicas of RADOS objects. If you want
+ # to maintain four copies of an object the default value--a primary
+ # copy and three replica copies--reset the default values as shown in
+ # 'osd_pool_default_size'. If you want to allow Ceph to accept an I/O
+ # operation to a degraded PG, set 'osd_pool_default_min_size' to a
+ # number less than the 'osd_pool_default_size' value.
+
+ osd_pool_default_size = 3 # Write an object three times.
+ osd_pool_default_min_size = 2 # Accept an I/O operation to a PG that has two copies of an object.
+
+ # Note: by default, PG autoscaling is enabled and this value is used only
+ # in specific circumstances. It is however still recommend to set it.
+ # Ensure you have a realistic number of placement groups. We recommend
+ # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100
+ # divided by the number of replicas (i.e., 'osd_pool_default_size'). So for
+ # 10 OSDs and 'osd_pool_default_size' = 4, we'd recommend approximately
+ # (100 * 10) / 4 = 250.
+ # Always use the nearest power of two.
+ osd_pool_default_pg_num = 256
diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst
new file mode 100644
index 000000000..c83e87da7
--- /dev/null
+++ b/doc/rados/configuration/storage-devices.rst
@@ -0,0 +1,93 @@
+=================
+ Storage Devices
+=================
+
+There are several Ceph daemons in a storage cluster:
+
+.. _rados_configuration_storage-devices_ceph_osd:
+
+* **Ceph OSDs** (Object Storage Daemons) store most of the data
+ in Ceph. Usually each OSD is backed by a single storage device.
+ This can be a traditional hard disk (HDD) or a solid state disk
+ (SSD). OSDs can also be backed by a combination of devices: for
+ example, a HDD for most data and an SSD (or partition of an
+ SSD) for some metadata. The number of OSDs in a cluster is
+ usually a function of the amount of data to be stored, the size
+ of each storage device, and the level and type of redundancy
+ specified (replication or erasure coding).
+* **Ceph Monitor** daemons manage critical cluster state. This
+ includes cluster membership and authentication information.
+ Small clusters require only a few gigabytes of storage to hold
+ the monitor database. In large clusters, however, the monitor
+ database can reach sizes of tens of gigabytes to hundreds of
+ gigabytes.
+* **Ceph Manager** daemons run alongside monitor daemons, providing
+ additional monitoring and providing interfaces to external
+ monitoring and management systems.
+
+.. _rados_config_storage_devices_osd_backends:
+
+OSD Back Ends
+=============
+
+There are two ways that OSDs manage the data they store. As of the Luminous
+12.2.z release, the default (and recommended) back end is *BlueStore*. Prior
+to the Luminous release, the default (and only) back end was *Filestore*.
+
+.. _rados_config_storage_devices_bluestore:
+
+BlueStore
+---------
+
+BlueStore is a special-purpose storage back end designed specifically for
+managing data on disk for Ceph OSD workloads. BlueStore's design is based on
+a decade of experience of supporting and managing Filestore OSDs.
+
+Key BlueStore features include:
+
+* Direct management of storage devices. BlueStore consumes raw block devices or
+ partitions. This avoids intervening layers of abstraction (such as local file
+ systems like XFS) that can limit performance or add complexity.
+* Metadata management with RocksDB. RocksDB's key/value database is embedded
+ in order to manage internal metadata, including the mapping of object
+ names to block locations on disk.
+* Full data and metadata checksumming. By default, all data and
+ metadata written to BlueStore is protected by one or more
+ checksums. No data or metadata is read from disk or returned
+ to the user without being verified.
+* Inline compression. Data can be optionally compressed before being written
+ to disk.
+* Multi-device metadata tiering. BlueStore allows its internal
+ journal (write-ahead log) to be written to a separate, high-speed
+ device (like an SSD, NVMe, or NVDIMM) for increased performance. If
+ a significant amount of faster storage is available, internal
+ metadata can be stored on the faster device.
+* Efficient copy-on-write. RBD and CephFS snapshots rely on a
+ copy-on-write *clone* mechanism that is implemented efficiently in
+ BlueStore. This results in efficient I/O both for regular snapshots
+ and for erasure-coded pools (which rely on cloning to implement
+ efficient two-phase commits).
+
+For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`.
+
+FileStore
+---------
+.. warning:: Filestore has been deprecated in the Reef release and is no longer supported.
+
+
+FileStore is the legacy approach to storing objects in Ceph. It
+relies on a standard file system (normally XFS) in combination with a
+key/value database (traditionally LevelDB, now RocksDB) for some
+metadata.
+
+FileStore is well-tested and widely used in production. However, it
+suffers from many performance deficiencies due to its overall design
+and its reliance on a traditional file system for object data storage.
+
+Although FileStore is capable of functioning on most POSIX-compatible
+file systems (including btrfs and ext4), we recommend that only the
+XFS file system be used with Ceph. Both btrfs and ext4 have known bugs and
+deficiencies and their use may lead to data loss. By default, all Ceph
+provisioning tools use XFS.
+
+For more information, see :doc:`filestore-config-ref`.