summaryrefslogtreecommitdiffstats
path: root/doc/rados/configuration/mclock-config-ref.rst
blob: a338aa6da56167398cf7eabcfdfb69f2433d7aef (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
========================
 mClock Config Reference
========================

.. index:: mclock; configuration

QoS support in Ceph is implemented using a queuing scheduler based on `the
dmClock algorithm`_. See :ref:`dmclock-qos` section for more details.

To make the usage of mclock more user-friendly and intuitive, mclock config
profiles are introduced. The mclock profiles mask the low level details from
users, making it easier to configure and use mclock.

The following input parameters are required for a mclock profile to configure
the QoS related parameters:

* total capacity (IOPS) of each OSD (determined automatically -
  See `OSD Capacity Determination (Automated)`_)

* the max sequential bandwidth capacity (MiB/s) of each OSD -
  See *osd_mclock_max_sequential_bandwidth_[hdd|ssd]* option

* an mclock profile type to enable

Using the settings in the specified profile, an OSD determines and applies the
lower-level mclock and Ceph parameters. The parameters applied by the mclock
profile make it possible to tune the QoS between client I/O and background
operations in the OSD.


.. index:: mclock; mclock clients

mClock Client Types
===================

The mclock scheduler handles requests from different types of Ceph services.
Each service can be considered as a type of client from mclock's perspective.
Depending on the type of requests handled, mclock clients are classified into
the buckets as shown in the table below,

+------------------------+--------------------------------------------------------------+
|  Client Type           | Request Types                                                |
+========================+==============================================================+
| Client                 | I/O requests issued by external clients of Ceph              |
+------------------------+--------------------------------------------------------------+
| Background recovery    | Internal recovery requests                                   |
+------------------------+--------------------------------------------------------------+
| Background best-effort | Internal backfill, scrub, snap trim and PG deletion requests |
+------------------------+--------------------------------------------------------------+

The mclock profiles allocate parameters like reservation, weight and limit
(see :ref:`dmclock-qos`) differently for each client type. The next sections
describe the mclock profiles in greater detail.


.. index:: mclock; profile definition

mClock Profiles - Definition and Purpose
========================================

A mclock profile is *“a configuration setting that when applied on a running
Ceph cluster enables the throttling of the operations(IOPS) belonging to
different client classes (background recovery, scrub, snaptrim, client op,
osd subop)”*.

The mclock profile uses the capacity limits and the mclock profile type selected
by the user to determine the low-level mclock resource control configuration
parameters and apply them transparently. Additionally, other Ceph configuration
parameters are also applied. Please see sections below for more information.

The low-level mclock resource control parameters are the *reservation*,
*limit*, and *weight* that provide control of the resource shares, as
described in the :ref:`dmclock-qos` section.


.. index:: mclock; profile types

mClock Profile Types
====================

mclock profiles can be broadly classified into *built-in* and *custom* profiles,

Built-in Profiles
-----------------
Users can choose between the following built-in profile types:

.. note:: The values mentioned in the tables below represent the proportion
          of the total IOPS capacity of the OSD allocated for the service type.

* balanced (default)
* high_client_ops
* high_recovery_ops

balanced (*default*)
^^^^^^^^^^^^^^^^^^^^
The *balanced* profile is the default mClock profile. This profile allocates
equal reservation/priority to client operations and background recovery
operations. Background best-effort ops are given lower reservation and therefore
take a longer time to complete when are are competing operations. This profile
helps meet the normal/steady-state requirements of the cluster. This is the
case when external client performance requirement is not critical and there are
other background operations that still need attention within the OSD.

But there might be instances that necessitate giving higher allocations to either
client ops or recovery ops. In order to deal with such a situation, the alternate
built-in profiles may be enabled by following the steps mentioned in next sections.

+------------------------+-------------+--------+-------+
|  Service Type          | Reservation | Weight | Limit |
+========================+=============+========+=======+
| client                 | 50%         | 1      | MAX   |
+------------------------+-------------+--------+-------+
| background recovery    | 50%         | 1      | MAX   |
+------------------------+-------------+--------+-------+
| background best-effort | MIN         | 1      | 90%   |
+------------------------+-------------+--------+-------+

high_client_ops
^^^^^^^^^^^^^^^
This profile optimizes client performance over background activities by
allocating more reservation and limit to client operations as compared to
background operations in the OSD. This profile, for example, may be enabled
to provide the needed performance for I/O intensive applications for a
sustained period of time at the cost of slower recoveries. The table shows
the resource control parameters set by the profile:

+------------------------+-------------+--------+-------+
|  Service Type          | Reservation | Weight | Limit |
+========================+=============+========+=======+
| client                 | 60%         | 2      | MAX   |
+------------------------+-------------+--------+-------+
| background recovery    | 40%         | 1      | MAX   |
+------------------------+-------------+--------+-------+
| background best-effort | MIN         | 1      | 70%   |
+------------------------+-------------+--------+-------+

high_recovery_ops
^^^^^^^^^^^^^^^^^
This profile optimizes background recovery performance as compared to external
clients and other background operations within the OSD. This profile, for
example, may be enabled by an administrator temporarily to speed-up background
recoveries during non-peak hours. The table shows the resource control
parameters set by the profile:

+------------------------+-------------+--------+-------+
|  Service Type          | Reservation | Weight | Limit |
+========================+=============+========+=======+
| client                 | 30%         | 1      | MAX   |
+------------------------+-------------+--------+-------+
| background recovery    | 70%         | 2      | MAX   |
+------------------------+-------------+--------+-------+
| background best-effort | MIN         | 1      | MAX   |
+------------------------+-------------+--------+-------+

.. note:: Across the built-in profiles, internal background best-effort clients
          of mclock include "backfill", "scrub", "snap trim", and "pg deletion"
          operations.


Custom Profile
--------------
This profile gives users complete control over all the mclock configuration
parameters. This profile should be used with caution and is meant for advanced
users, who understand mclock and Ceph related configuration options.


.. index:: mclock; built-in profiles

mClock Built-in Profiles -  Locked Config Options
=================================================
The below sections describe the config options that are locked to certain values
in order to ensure mClock scheduler is able to provide predictable QoS.

mClock Config Options
---------------------
.. important:: These defaults cannot be changed using any of the config
   subsytem commands like *config set* or via the *config daemon* or *config
   tell* interfaces. Although the above command(s) report success, the mclock
   QoS parameters are reverted to their respective built-in profile defaults.

When a built-in profile is enabled, the mClock scheduler calculates the low
level mclock parameters [*reservation*, *weight*, *limit*] based on the profile
enabled for each client type. The mclock parameters are calculated based on
the max OSD capacity provided beforehand. As a result, the following mclock
config parameters cannot be modified when using any of the built-in profiles:

- :confval:`osd_mclock_scheduler_client_res`
- :confval:`osd_mclock_scheduler_client_wgt`
- :confval:`osd_mclock_scheduler_client_lim`
- :confval:`osd_mclock_scheduler_background_recovery_res`
- :confval:`osd_mclock_scheduler_background_recovery_wgt`
- :confval:`osd_mclock_scheduler_background_recovery_lim`
- :confval:`osd_mclock_scheduler_background_best_effort_res`
- :confval:`osd_mclock_scheduler_background_best_effort_wgt`
- :confval:`osd_mclock_scheduler_background_best_effort_lim`

Recovery/Backfill Options
-------------------------
.. warning:: The recommendation is to not change these options as the built-in
   profiles are optimized based on them. Changing these defaults can result in
   unexpected performance outcomes.

The following recovery and backfill related Ceph options are overridden to
mClock defaults:

- :confval:`osd_max_backfills`
- :confval:`osd_recovery_max_active`
- :confval:`osd_recovery_max_active_hdd`
- :confval:`osd_recovery_max_active_ssd`

The following table shows the mClock defaults which is the same as the current
defaults. This is done to maximize the performance of the foreground (client)
operations:

+----------------------------------------+------------------+----------------+
|  Config Option                         | Original Default | mClock Default |
+========================================+==================+================+
| :confval:`osd_max_backfills`           | 1                | 1              |
+----------------------------------------+------------------+----------------+
| :confval:`osd_recovery_max_active`     | 0                | 0              |
+----------------------------------------+------------------+----------------+
| :confval:`osd_recovery_max_active_hdd` | 3                | 3              |
+----------------------------------------+------------------+----------------+
| :confval:`osd_recovery_max_active_ssd` | 10               | 10             |
+----------------------------------------+------------------+----------------+

The above mClock defaults, can be modified only if necessary by enabling
:confval:`osd_mclock_override_recovery_settings` (default: false). The
steps for this is discussed in the
`Steps to Modify mClock Max Backfills/Recovery Limits`_ section.

Sleep Options
-------------
If any mClock profile (including "custom") is active, the following Ceph config
sleep options are disabled (set to 0),

- :confval:`osd_recovery_sleep`
- :confval:`osd_recovery_sleep_hdd`
- :confval:`osd_recovery_sleep_ssd`
- :confval:`osd_recovery_sleep_hybrid`
- :confval:`osd_scrub_sleep`
- :confval:`osd_delete_sleep`
- :confval:`osd_delete_sleep_hdd`
- :confval:`osd_delete_sleep_ssd`
- :confval:`osd_delete_sleep_hybrid`
- :confval:`osd_snap_trim_sleep`
- :confval:`osd_snap_trim_sleep_hdd`
- :confval:`osd_snap_trim_sleep_ssd`
- :confval:`osd_snap_trim_sleep_hybrid`

The above sleep options are disabled to ensure that mclock scheduler is able to
determine when to pick the next op from its operation queue and transfer it to
the operation sequencer. This results in the desired QoS being provided across
all its clients.


.. index:: mclock; enable built-in profile

Steps to Enable mClock Profile
==============================

As already mentioned, the default mclock profile is set to *balanced*.
The other values for the built-in profiles include *high_client_ops* and
*high_recovery_ops*.

If there is a requirement to change the default profile, then the option
:confval:`osd_mclock_profile` may be set during runtime by using the following
command:

  .. prompt:: bash #

    ceph config set osd.N osd_mclock_profile <value>

For example, to change the profile to allow faster recoveries on "osd.0", the
following command can be used to switch to the *high_recovery_ops* profile:

  .. prompt:: bash #

    ceph config set osd.0 osd_mclock_profile high_recovery_ops

.. note:: The *custom* profile is not recommended unless you are an advanced
          user.

And that's it! You are ready to run workloads on the cluster and check if the
QoS requirements are being met.


Switching Between Built-in and Custom Profiles
==============================================

There may be situations requiring switching from a built-in profile to the
*custom* profile and vice-versa. The following sections outline the steps to
accomplish this.

Steps to Switch From a Built-in to the Custom Profile
-----------------------------------------------------

The following command can be used to switch to the *custom* profile:

  .. prompt:: bash #

    ceph config set osd osd_mclock_profile custom

For example, to change the profile to *custom* on all OSDs, the following
command can be used:

  .. prompt:: bash #

    ceph config set osd osd_mclock_profile custom

After switching to the *custom* profile, the desired mClock configuration
option may be modified. For example, to change the client reservation IOPS
ratio for a specific OSD (say osd.0) to 0.5 (or 50%), the following command
can be used:

  .. prompt:: bash #

    ceph config set osd.0 osd_mclock_scheduler_client_res 0.5

.. important:: Care must be taken to change the reservations of other services
   like recovery and background best effort accordingly to ensure that the sum
   of the reservations do not exceed the maximum proportion (1.0) of the IOPS
   capacity of the OSD.

.. tip::  The reservation and limit parameter allocations are per-shard based on
   the type of backing device (HDD/SSD) under the OSD. See
   :confval:`osd_op_num_shards_hdd` and :confval:`osd_op_num_shards_ssd` for
   more details.

Steps to Switch From the Custom Profile to a Built-in Profile
-------------------------------------------------------------

Switching from the *custom* profile to a built-in profile requires an
intermediate step of removing the custom settings from the central config
database for the changes to take effect.

The following sequence of commands can be used to switch to a built-in profile:

#. Set the desired built-in profile using:

   .. prompt:: bash #

     ceph config set osd <mClock Configuration Option>

   For example, to set the built-in profile to ``high_client_ops`` on all
   OSDs, run the following command:

   .. prompt:: bash #

     ceph config set osd osd_mclock_profile high_client_ops
#. Determine the existing custom mClock configuration settings in the central
   config database using the following command:

   .. prompt:: bash #

     ceph config dump
#. Remove the custom mClock configuration settings determined in the previous
   step from the central config database:

   .. prompt:: bash #

     ceph config rm osd <mClock Configuration Option>

   For example, to remove the configuration option
   :confval:`osd_mclock_scheduler_client_res` that was set on all OSDs, run the
   following command:

   .. prompt:: bash #

     ceph config rm osd osd_mclock_scheduler_client_res
#. After all existing custom mClock configuration settings have been removed
   from the central config database, the configuration settings pertaining to
   ``high_client_ops`` will come into effect. For e.g., to verify the settings
   on osd.0 use:

   .. prompt:: bash #

     ceph config show osd.0

Switch Temporarily Between mClock Profiles
------------------------------------------

To switch between mClock profiles on a temporary basis, the following commands
may be used to override the settings:

.. warning:: This section is for advanced users or for experimental testing. The
   recommendation is to not use the below commands on a running cluster as it
   could have unexpected outcomes.

.. note:: The configuration changes on an OSD using the below commands are
   ephemeral and are lost when it restarts. It is also important to note that
   the config options overridden using the below commands cannot be modified
   further using the *ceph config set osd.N ...* command. The changes will not
   take effect until a given OSD is restarted. This is intentional, as per the
   config subsystem design. However, any further modification can still be made
   ephemerally using the commands mentioned below.

#. Run the *injectargs* command as shown to override the mclock settings:

   .. prompt:: bash #

     ceph tell osd.N injectargs '--<mClock Configuration Option>=<value>'

   For example, the following command overrides the
   :confval:`osd_mclock_profile` option on osd.0:

   .. prompt:: bash #

     ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops'


#. An alternate command that can be used is:

   .. prompt:: bash #

     ceph daemon osd.N config set <mClock Configuration Option> <value>

   For example, the following command overrides the
   :confval:`osd_mclock_profile` option on osd.0:

   .. prompt:: bash #

     ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops

The individual QoS-related config options for the *custom* profile can also be
modified ephemerally using the above commands.


Steps to Modify mClock Max Backfills/Recovery Limits
====================================================

This section describes the steps to modify the default max backfills or recovery
limits if the need arises.

.. warning:: This section is for advanced users or for experimental testing. The
   recommendation is to retain the defaults as is on a running cluster as
   modifying them could have unexpected performance outcomes. The values may
   be modified only if the cluster is unable to cope/showing poor performance
   with the default settings or for performing experiments on a test cluster.

.. important:: The max backfill/recovery options that can be modified are listed
   in section `Recovery/Backfill Options`_. The modification of the mClock
   default backfills/recovery limit is gated by the
   :confval:`osd_mclock_override_recovery_settings` option, which is set to
   *false* by default. Attempting to modify any default recovery/backfill
   limits without setting the gating option will reset that option back to the
   mClock defaults along with a warning message logged in the cluster log. Note
   that it may take a few seconds for the default value to come back into
   effect. Verify the limit using the *config show* command as shown below.

#. Set the :confval:`osd_mclock_override_recovery_settings` config option on all
   osds to *true* using:

   .. prompt:: bash #

     ceph config set osd osd_mclock_override_recovery_settings true

#. Set the desired max backfill/recovery option using:

   .. prompt:: bash #

     ceph config set osd osd_max_backfills <value>

   For example, the following command modifies the :confval:`osd_max_backfills`
   option on all osds to 5.

   .. prompt:: bash #

     ceph config set osd osd_max_backfills 5

#. Wait for a few seconds and verify the running configuration for a specific
   OSD using:

   .. prompt:: bash #

     ceph config show osd.N | grep osd_max_backfills

   For example, the following command shows the running configuration of
   :confval:`osd_max_backfills` on osd.0.

   .. prompt:: bash #

     ceph config show osd.0 | grep osd_max_backfills

#. Reset the :confval:`osd_mclock_override_recovery_settings` config option on
   all osds to *false* using:

   .. prompt:: bash #

     ceph config set osd osd_mclock_override_recovery_settings false


OSD Capacity Determination (Automated)
======================================

The OSD capacity in terms of total IOPS is determined automatically during OSD
initialization. This is achieved by running the OSD bench tool and overriding
the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option
depending on the device type. No other action/input is expected from the user
to set the OSD capacity.

.. note:: If you wish to manually benchmark OSD(s) or manually tune the
          Bluestore throttle parameters, see section
          `Steps to Manually Benchmark an OSD (Optional)`_.

You may verify the capacity of an OSD after the cluster is brought up by using
the following command:

  .. prompt:: bash #

    ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd]

For example, the following command shows the max capacity for "osd.0" on a Ceph
node whose underlying device type is SSD:

  .. prompt:: bash #

    ceph config show osd.0 osd_mclock_max_capacity_iops_ssd

Mitigation of Unrealistic OSD Capacity From Automated Test
----------------------------------------------------------
In certain conditions, the OSD bench tool may show unrealistic/inflated result
depending on the drive configuration and other environment related conditions.
To mitigate the performance impact due to this unrealistic capacity, a couple
of threshold config options depending on the osd's device type are defined and
used:

- :confval:`osd_mclock_iops_capacity_threshold_hdd` = 500
- :confval:`osd_mclock_iops_capacity_threshold_ssd` = 80000

The following automated step is performed:

Fallback to using default OSD capacity (automated)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If OSD bench reports a measurement that exceeds the above threshold values
depending on the underlying device type, the fallback mechanism reverts to the
default value of :confval:`osd_mclock_max_capacity_iops_hdd` or
:confval:`osd_mclock_max_capacity_iops_ssd`. The threshold config options
can be reconfigured based on the type of drive used. Additionally, a cluster
warning is logged in case the measurement exceeds the threshold. For example, ::

    2022-10-27T15:30:23.270+0000 7f9b5dbe95c0  0 log_channel(cluster) log [WRN]
    : OSD bench result of 39546.479392 IOPS exceeded the threshold limit of
    25000.000000 IOPS for osd.1. IOPS capacity is unchanged at 21500.000000
    IOPS. The recommendation is to establish the osd's IOPS capacity using other
    benchmark tools (e.g. Fio) and then override
    osd_mclock_max_capacity_iops_[hdd|ssd].

If the default capacity doesn't accurately represent the OSD's capacity, the
following additional step is recommended to address this:

Run custom drive benchmark if defaults are not accurate (manual)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the default OSD capacity is not accurate, the recommendation is to run a
custom benchmark using your preferred tool (e.g. Fio) on the drive and then
override the ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option as described
in the `Specifying  Max OSD Capacity`_ section.

This step is highly recommended until an alternate mechansim is worked upon.

Steps to Manually Benchmark an OSD (Optional)
=============================================

.. note:: These steps are only necessary if you want to override the OSD
          capacity already determined automatically during OSD initialization.
          Otherwise, you may skip this section entirely.

.. tip:: If you have already determined the benchmark data and wish to manually
         override the max osd capacity for an OSD, you may skip to section
         `Specifying  Max OSD Capacity`_.


Any existing benchmarking tool (e.g. Fio) can be used for this purpose. In this
case, the steps use the *Ceph OSD Bench* command described in the next section.
Regardless of the tool/command used, the steps outlined further below remain the
same.

As already described in the :ref:`dmclock-qos` section, the number of
shards and the bluestore's throttle parameters have an impact on the mclock op
queues. Therefore, it is critical to set these values carefully in order to
maximize the impact of the mclock scheduler.

:Number of Operational Shards:
  We recommend using the default number of shards as defined by the
  configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
  ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase
  the impact of the mclock queues.

:Bluestore Throttle Parameters:
  We recommend using the default values as defined by
  :confval:`bluestore_throttle_bytes` and
  :confval:`bluestore_throttle_deferred_bytes`. But these parameters may also be
  determined during the benchmarking phase as described below.

OSD Bench Command Syntax
------------------------

The :ref:`osd-subsystem` section describes the OSD bench command. The syntax
used for benchmarking is shown below :

.. prompt:: bash #

  ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]

where,

* ``TOTAL_BYTES``: Total number of bytes to write
* ``BYTES_PER_WRITE``: Block size per write
* ``OBJ_SIZE``: Bytes per object
* ``NUM_OBJS``: Number of objects to write

Benchmarking Test Steps Using OSD Bench
---------------------------------------

The steps below use the default shards and detail the steps used to determine
the correct bluestore throttle values (optional).

#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
   you wish to benchmark.
#. Run a simple 4KiB random write workload on an OSD using the following
   commands:

   .. note:: Note that before running the test, caches must be cleared to get an
             accurate measurement.

   For example, if you are running the benchmark test on osd.0, run the following
   commands:

   .. prompt:: bash #

     ceph tell osd.0 cache drop

   .. prompt:: bash #

     ceph tell osd.0 bench 12288000 4096 4194304 100

#. Note the overall throughput(IOPS) obtained from the output of the osd bench
   command. This value is the baseline throughput(IOPS) when the default
   bluestore throttle options are in effect.
#. If the intent is to determine the bluestore throttle values for your
   environment, then set the two options, :confval:`bluestore_throttle_bytes`
   and :confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each
   to begin with. Otherwise, you may skip to the next section.
#. Run the 4KiB random write test as before using OSD bench.
#. Note the overall throughput from the output and compare the value
   against the baseline throughput recorded in step 3.
#. If the throughput doesn't match with the baseline, increment the bluestore
   throttle options by 2x and repeat steps 5 through 7 until the obtained
   throughput is very close to the baseline value.

For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB
for both bluestore throttle and deferred bytes was determined to maximize the
impact of mclock. For HDDs, the corresponding value was 40 MiB, where the
overall throughput was roughly equal to the baseline throughput. Note that in
general for HDDs, the bluestore throttle values are expected to be higher when
compared to SSDs.


Specifying  Max OSD Capacity
----------------------------

The steps in this section may be performed only if you want to override the
max osd capacity automatically set during OSD initialization. The option
``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the
following command:

  .. prompt:: bash #

     ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value>

For example, the following command sets the max capacity for a specific OSD
(say "osd.0") whose underlying device type is HDD to 350 IOPS:

  .. prompt:: bash #

    ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350

Alternatively, you may specify the max capacity for OSDs within the Ceph
configuration file under the respective [osd.N] section. See
:ref:`ceph-conf-settings` for more details.


.. index:: mclock; config settings

mClock Config Options
=====================

.. confval:: osd_mclock_profile
.. confval:: osd_mclock_max_capacity_iops_hdd
.. confval:: osd_mclock_max_capacity_iops_ssd
.. confval:: osd_mclock_max_sequential_bandwidth_hdd
.. confval:: osd_mclock_max_sequential_bandwidth_ssd
.. confval:: osd_mclock_force_run_benchmark_on_init
.. confval:: osd_mclock_skip_benchmark
.. confval:: osd_mclock_override_recovery_settings
.. confval:: osd_mclock_iops_capacity_threshold_hdd
.. confval:: osd_mclock_iops_capacity_threshold_ssd

.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf