1 files changed, 395 insertions, 0 deletions
diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst
new file mode 100644
index 000000000..579056895
--- /dev/null
+++ b/doc/rados/configuration/mclock-config-ref.rst
@@ -0,0 +1,395 @@
+========================
+ mClock Config Reference
+========================
+
+.. index:: mclock; configuration
+
+Mclock profiles mask the low level details from users, making it
+easier for them to configure mclock.
+
+The following input parameters are required for a mclock profile to configure
+the QoS related parameters:
+
+* total capacity (IOPS) of each OSD (determined automatically)
+
+* an mclock profile type to enable
+
+Using the settings in the specified profile, the OSD determines and applies the
+lower-level mclock and Ceph parameters. The parameters applied by the mclock
+profile make it possible to tune the QoS between client I/O, recovery/backfill
+operations, and other background operations (for example, scrub, snap trim, and
+PG deletion). These background activities are considered best-effort internal
+clients of Ceph.
+
+
+.. index:: mclock; profile definition
+
+mClock Profiles - Definition and Purpose
+========================================
+
+A mclock profile is *“a configuration setting that when applied on a running
+Ceph cluster enables the throttling of the operations(IOPS) belonging to
+different client classes (background recovery, scrub, snaptrim, client op,
+osd subop)”*.
+
+The mclock profile uses the capacity limits and the mclock profile type selected
+by the user to determine the low-level mclock resource control parameters.
+
+Depending on the profile type, lower-level mclock resource-control parameters
+and some Ceph-configuration parameters are transparently applied.
+
+The low-level mclock resource control parameters are the *reservation*,
+*limit*, and *weight* that provide control of the resource shares, as
+described in the :ref:`dmclock-qos` section.
+
+
+.. index:: mclock; profile types
+
+mClock Profile Types
+====================
+
+mclock profiles can be broadly classified into two types,
+
+- **Built-in**: Users can choose between the following built-in profile types:
+
+  - **high_client_ops** (*default*):
+    This profile allocates more reservation and limit to external-client ops
+    as compared to background recoveries and other internal clients within
+    Ceph. This profile is enabled by default.
+  - **high_recovery_ops**:
+    This profile allocates more reservation to background recoveries as
+    compared to external clients and other internal clients within Ceph. For
+    example, an admin may enable this profile temporarily to speed-up background
+    recoveries during non-peak hours.
+  - **balanced**:
+    This profile allocates equal reservation to client ops and background
+    recovery ops.
+
+- **Custom**: This profile gives users complete control over all the mclock
+  configuration parameters. Using this profile is not recommended without
+  a deep understanding of mclock and related Ceph-configuration options.
+
+.. note:: Across the built-in profiles, internal clients of mclock (for example
+          "scrub", "snap trim", and "pg deletion") are given slightly lower
+          reservations, but higher weight and no limit. This ensures that
+          these operations are able to complete quickly if there are no other
+          competing services.
+
+
+.. index:: mclock; built-in profiles
+
+mClock Built-in Profiles
+========================
+
+When a built-in profile is enabled, the mClock scheduler calculates the low
+level mclock parameters [*reservation*, *weight*, *limit*] based on the profile
+enabled for each client type. The mclock parameters are calculated based on
+the max OSD capacity provided beforehand. As a result, the following mclock
+config parameters cannot be modified when using any of the built-in profiles:
+
+- ``osd_mclock_scheduler_client_res``
+- ``osd_mclock_scheduler_client_wgt``
+- ``osd_mclock_scheduler_client_lim``
+- ``osd_mclock_scheduler_background_recovery_res``
+- ``osd_mclock_scheduler_background_recovery_wgt``
+- ``osd_mclock_scheduler_background_recovery_lim``
+- ``osd_mclock_scheduler_background_best_effort_res``
+- ``osd_mclock_scheduler_background_best_effort_wgt``
+- ``osd_mclock_scheduler_background_best_effort_lim``
+
+The following Ceph options will not be modifiable by the user:
+
+- ``osd_max_backfills``
+- ``osd_recovery_max_active``
+
+This is because the above options are internally modified by the mclock
+scheduler in order to maximize the impact of the set profile.
+
+By default, the *high_client_ops* profile is enabled to ensure that a larger
+chunk of the bandwidth allocation goes to client ops. Background recovery ops
+are given lower allocation (and therefore take a longer time to complete). But
+there might be instances that necessitate giving higher allocations to either
+client ops or recovery ops. In order to deal with such a situation, you can
+enable one of the alternate built-in profiles by following the steps mentioned
+in the next section.
+
+If any mClock profile (including "custom") is active, the following Ceph config
+sleep options will be disabled,
+
+- ``osd_recovery_sleep``
+- ``osd_recovery_sleep_hdd``
+- ``osd_recovery_sleep_ssd``
+- ``osd_recovery_sleep_hybrid``
+- ``osd_scrub_sleep``
+- ``osd_delete_sleep``
+- ``osd_delete_sleep_hdd``
+- ``osd_delete_sleep_ssd``
+- ``osd_delete_sleep_hybrid``
+- ``osd_snap_trim_sleep``
+- ``osd_snap_trim_sleep_hdd``
+- ``osd_snap_trim_sleep_ssd``
+- ``osd_snap_trim_sleep_hybrid``
+
+The above sleep options are disabled to ensure that mclock scheduler is able to
+determine when to pick the next op from its operation queue and transfer it to
+the operation sequencer. This results in the desired QoS being provided across
+all its clients.
+
+
+.. index:: mclock; enable built-in profile
+
+Steps to Enable mClock Profile
+==============================
+
+As already mentioned, the default mclock profile is set to *high_client_ops*.
+The other values for the built-in profiles include *balanced* and
+*high_recovery_ops*.
+
+If there is a requirement to change the default profile, then the option
+``osd_mclock_profile`` may be set during runtime by using the following
+command:
+
+  .. prompt:: bash #
+
+    ceph config set osd.N osd_mclock_profile <value>
+
+For example, to change the profile to allow faster recoveries on "osd.0", the
+following command can be used to switch to the *high_recovery_ops* profile:
+
+  .. prompt:: bash #
+
+    ceph config set osd.0 osd_mclock_profile high_recovery_ops
+
+.. note:: The *custom* profile is not recommended unless you are an advanced
+          user.
+
+And that's it! You are ready to run workloads on the cluster and check if the
+QoS requirements are being met.
+
+
+OSD Capacity Determination (Automated)
+======================================
+
+The OSD capacity in terms of total IOPS is determined automatically during OSD
+initialization. This is achieved by running the OSD bench tool and overriding
+the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option
+depending on the device type. No other action/input is expected from the user
+to set the OSD capacity. You may verify the capacity of an OSD after the
+cluster is brought up by using the following command:
+
+  .. prompt:: bash #
+
+    ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd]
+
+For example, the following command shows the max capacity for "osd.0" on a Ceph
+node whose underlying device type is SSD:
+
+  .. prompt:: bash #
+
+    ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
+
+
+Steps to Manually Benchmark an OSD (Optional)
+=============================================
+
+.. note:: These steps are only necessary if you want to override the OSD
+          capacity already determined automatically during OSD initialization.
+          Otherwise, you may skip this section entirely.
+
+.. tip:: If you have already determined the benchmark data and wish to manually
+         override the max osd capacity for an OSD, you may skip to section
+         `Specifying  Max OSD Capacity`_.
+
+
+Any existing benchmarking tool can be used for this purpose. In this case, the
+steps use the *Ceph OSD Bench* command described in the next section. Regardless
+of the tool/command used, the steps outlined further below remain the same.
+
+As already described in the :ref:`dmclock-qos` section, the number of
+shards and the bluestore's throttle parameters have an impact on the mclock op
+queues. Therefore, it is critical to set these values carefully in order to
+maximize the impact of the mclock scheduler.
+
+:Number of Operational Shards:
+  We recommend using the default number of shards as defined by the
+  configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
+  ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase
+  the impact of the mclock queues.
+
+:Bluestore Throttle Parameters:
+  We recommend using the default values as defined by
+  ``bluestore_throttle_bytes`` and ``bluestore_throttle_deferred_bytes``. But
+  these parameters may also be determined during the benchmarking phase as
+  described below.
+
+
+OSD Bench Command Syntax
+````````````````````````
+
+The :ref:`osd-subsystem` section describes the OSD bench command. The syntax
+used for benchmarking is shown below :
+
+.. prompt:: bash #
+
+  ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]
+
+where,
+
+* ``TOTAL_BYTES``: Total number of bytes to write
+* ``BYTES_PER_WRITE``: Block size per write
+* ``OBJ_SIZE``: Bytes per object
+* ``NUM_OBJS``: Number of objects to write
+
+Benchmarking Test Steps Using OSD Bench
+```````````````````````````````````````
+
+The steps below use the default shards and detail the steps used to determine
+the correct bluestore throttle values (optional).
+
+#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
+   you wish to benchmark.
+#. Run a simple 4KiB random write workload on an OSD using the following
+   commands:
+
+   .. note:: Note that before running the test, caches must be cleared to get an
+             accurate measurement.
+
+   For example, if you are running the benchmark test on osd.0, run the following
+   commands:
+
+   .. prompt:: bash #
+
+     ceph tell osd.0 cache drop
+
+   .. prompt:: bash #
+
+     ceph tell osd.0 bench 12288000 4096 4194304 100
+
+#. Note the overall throughput(IOPS) obtained from the output of the osd bench
+   command. This value is the baseline throughput(IOPS) when the default
+   bluestore throttle options are in effect.
+#. If the intent is to determine the bluestore throttle values for your
+   environment, then set the two options, ``bluestore_throttle_bytes``
+   and ``bluestore_throttle_deferred_bytes`` to 32 KiB(32768 Bytes) each
+   to begin with. Otherwise, you may skip to the next section.
+#. Run the 4KiB random write test as before using OSD bench.
+#. Note the overall throughput from the output and compare the value
+   against the baseline throughput recorded in step 3.
+#. If the throughput doesn't match with the baseline, increment the bluestore
+   throttle options by 2x and repeat steps 5 through 7 until the obtained
+   throughput is very close to the baseline value.
+
+For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB
+for both bluestore throttle and deferred bytes was determined to maximize the
+impact of mclock. For HDDs, the corresponding value was 40 MiB, where the
+overall throughput was roughly equal to the baseline throughput. Note that in
+general for HDDs, the bluestore throttle values are expected to be higher when
+compared to SSDs.
+
+
+Specifying  Max OSD Capacity
+````````````````````````````
+
+The steps in this section may be performed only if you want to override the
+max osd capacity automatically set during OSD initialization. The option
+``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the
+following command:
+
+  .. prompt:: bash #
+
+     ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value>
+
+For example, the following command sets the max capacity for a specific OSD
+(say "osd.0") whose underlying device type is HDD to 350 IOPS:
+
+  .. prompt:: bash #
+
+    ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350
+
+Alternatively, you may specify the max capacity for OSDs within the Ceph
+configuration file under the respective [osd.N] section. See
+:ref:`ceph-conf-settings` for more details.
+
+
+.. index:: mclock; config settings
+
+mClock Config Options
+=====================
+
+``osd_mclock_profile``
+
+:Description: This sets the type of mclock profile to use for providing QoS
+              based on operations belonging to different classes (background
+              recovery, scrub, snaptrim, client op, osd subop). Once a built-in
+              profile is enabled, the lower level mclock resource control
+              parameters [*reservation, weight, limit*] and some Ceph
+              configuration parameters are set transparently. Note that the
+              above does not apply for the *custom* profile.
+
+:Type: String
+:Valid Choices: high_client_ops, high_recovery_ops, balanced, custom
+:Default: ``high_client_ops``
+
+``osd_mclock_max_capacity_iops_hdd``
+
+:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for
+              rotational media)
+
+:Type: Float
+:Default: ``315.0``
+
+``osd_mclock_max_capacity_iops_ssd``
+
+:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for
+              solid state media)
+
+:Type: Float
+:Default: ``21500.0``
+
+``osd_mclock_cost_per_io_usec``
+
+:Description: Cost per IO in microseconds to consider per OSD (overrides _ssd
+              and _hdd if non-zero)
+
+:Type: Float
+:Default: ``0.0``
+
+``osd_mclock_cost_per_io_usec_hdd``
+
+:Description: Cost per IO in microseconds to consider per OSD (for rotational
+              media)
+
+:Type: Float
+:Default: ``25000.0``
+
+``osd_mclock_cost_per_io_usec_ssd``
+
+:Description: Cost per IO in microseconds to consider per OSD (for solid state
+              media)
+
+:Type: Float
+:Default: ``50.0``
+
+``osd_mclock_cost_per_byte_usec``
+
+:Description: Cost per byte in microseconds to consider per OSD (overrides _ssd
+              and _hdd if non-zero)
+
+:Type: Float
+:Default: ``0.0``
+
+``osd_mclock_cost_per_byte_usec_hdd``
+
+:Description: Cost per byte in microseconds to consider per OSD (for rotational
+              media)
+
+:Type: Float
+:Default: ``5.2``
+
+``osd_mclock_cost_per_byte_usec_ssd``
+
+:Description: Cost per byte in microseconds to consider per OSD (for solid state
+              media)
+
+:Type: Float
+:Default: ``0.011``