summaryrefslogtreecommitdiffstats
path: root/doc/rados/configuration/mclock-config-ref.rst
blob: 57905689538bf9097146ff5323d26c3523c4bb83 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
========================
 mClock Config Reference
========================

.. index:: mclock; configuration

Mclock profiles mask the low level details from users, making it
easier for them to configure mclock.

The following input parameters are required for a mclock profile to configure
the QoS related parameters:

* total capacity (IOPS) of each OSD (determined automatically)

* an mclock profile type to enable

Using the settings in the specified profile, the OSD determines and applies the
lower-level mclock and Ceph parameters. The parameters applied by the mclock
profile make it possible to tune the QoS between client I/O, recovery/backfill
operations, and other background operations (for example, scrub, snap trim, and
PG deletion). These background activities are considered best-effort internal
clients of Ceph.


.. index:: mclock; profile definition

mClock Profiles - Definition and Purpose
========================================

A mclock profile is *“a configuration setting that when applied on a running
Ceph cluster enables the throttling of the operations(IOPS) belonging to
different client classes (background recovery, scrub, snaptrim, client op,
osd subop)”*.

The mclock profile uses the capacity limits and the mclock profile type selected
by the user to determine the low-level mclock resource control parameters.

Depending on the profile type, lower-level mclock resource-control parameters
and some Ceph-configuration parameters are transparently applied.

The low-level mclock resource control parameters are the *reservation*,
*limit*, and *weight* that provide control of the resource shares, as
described in the :ref:`dmclock-qos` section.


.. index:: mclock; profile types

mClock Profile Types
====================

mclock profiles can be broadly classified into two types,

- **Built-in**: Users can choose between the following built-in profile types:

  - **high_client_ops** (*default*):
    This profile allocates more reservation and limit to external-client ops
    as compared to background recoveries and other internal clients within
    Ceph. This profile is enabled by default.
  - **high_recovery_ops**:
    This profile allocates more reservation to background recoveries as
    compared to external clients and other internal clients within Ceph. For
    example, an admin may enable this profile temporarily to speed-up background
    recoveries during non-peak hours.
  - **balanced**:
    This profile allocates equal reservation to client ops and background
    recovery ops.

- **Custom**: This profile gives users complete control over all the mclock
  configuration parameters. Using this profile is not recommended without
  a deep understanding of mclock and related Ceph-configuration options.

.. note:: Across the built-in profiles, internal clients of mclock (for example
          "scrub", "snap trim", and "pg deletion") are given slightly lower
          reservations, but higher weight and no limit. This ensures that
          these operations are able to complete quickly if there are no other
          competing services.


.. index:: mclock; built-in profiles

mClock Built-in Profiles
========================

When a built-in profile is enabled, the mClock scheduler calculates the low
level mclock parameters [*reservation*, *weight*, *limit*] based on the profile
enabled for each client type. The mclock parameters are calculated based on
the max OSD capacity provided beforehand. As a result, the following mclock
config parameters cannot be modified when using any of the built-in profiles:

- ``osd_mclock_scheduler_client_res``
- ``osd_mclock_scheduler_client_wgt``
- ``osd_mclock_scheduler_client_lim``
- ``osd_mclock_scheduler_background_recovery_res``
- ``osd_mclock_scheduler_background_recovery_wgt``
- ``osd_mclock_scheduler_background_recovery_lim``
- ``osd_mclock_scheduler_background_best_effort_res``
- ``osd_mclock_scheduler_background_best_effort_wgt``
- ``osd_mclock_scheduler_background_best_effort_lim``

The following Ceph options will not be modifiable by the user:

- ``osd_max_backfills``
- ``osd_recovery_max_active``

This is because the above options are internally modified by the mclock
scheduler in order to maximize the impact of the set profile.

By default, the *high_client_ops* profile is enabled to ensure that a larger
chunk of the bandwidth allocation goes to client ops. Background recovery ops
are given lower allocation (and therefore take a longer time to complete). But
there might be instances that necessitate giving higher allocations to either
client ops or recovery ops. In order to deal with such a situation, you can
enable one of the alternate built-in profiles by following the steps mentioned
in the next section.

If any mClock profile (including "custom") is active, the following Ceph config
sleep options will be disabled,

- ``osd_recovery_sleep``
- ``osd_recovery_sleep_hdd``
- ``osd_recovery_sleep_ssd``
- ``osd_recovery_sleep_hybrid``
- ``osd_scrub_sleep``
- ``osd_delete_sleep``
- ``osd_delete_sleep_hdd``
- ``osd_delete_sleep_ssd``
- ``osd_delete_sleep_hybrid``
- ``osd_snap_trim_sleep``
- ``osd_snap_trim_sleep_hdd``
- ``osd_snap_trim_sleep_ssd``
- ``osd_snap_trim_sleep_hybrid``

The above sleep options are disabled to ensure that mclock scheduler is able to
determine when to pick the next op from its operation queue and transfer it to
the operation sequencer. This results in the desired QoS being provided across
all its clients.


.. index:: mclock; enable built-in profile

Steps to Enable mClock Profile
==============================

As already mentioned, the default mclock profile is set to *high_client_ops*.
The other values for the built-in profiles include *balanced* and
*high_recovery_ops*.

If there is a requirement to change the default profile, then the option
``osd_mclock_profile`` may be set during runtime by using the following
command:

  .. prompt:: bash #

    ceph config set osd.N osd_mclock_profile <value>

For example, to change the profile to allow faster recoveries on "osd.0", the
following command can be used to switch to the *high_recovery_ops* profile:

  .. prompt:: bash #

    ceph config set osd.0 osd_mclock_profile high_recovery_ops

.. note:: The *custom* profile is not recommended unless you are an advanced
          user.

And that's it! You are ready to run workloads on the cluster and check if the
QoS requirements are being met.


OSD Capacity Determination (Automated)
======================================

The OSD capacity in terms of total IOPS is determined automatically during OSD
initialization. This is achieved by running the OSD bench tool and overriding
the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option
depending on the device type. No other action/input is expected from the user
to set the OSD capacity. You may verify the capacity of an OSD after the
cluster is brought up by using the following command:

  .. prompt:: bash #

    ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd]

For example, the following command shows the max capacity for "osd.0" on a Ceph
node whose underlying device type is SSD:

  .. prompt:: bash #

    ceph config show osd.0 osd_mclock_max_capacity_iops_ssd


Steps to Manually Benchmark an OSD (Optional)
=============================================

.. note:: These steps are only necessary if you want to override the OSD
          capacity already determined automatically during OSD initialization.
          Otherwise, you may skip this section entirely.

.. tip:: If you have already determined the benchmark data and wish to manually
         override the max osd capacity for an OSD, you may skip to section
         `Specifying  Max OSD Capacity`_.


Any existing benchmarking tool can be used for this purpose. In this case, the
steps use the *Ceph OSD Bench* command described in the next section. Regardless
of the tool/command used, the steps outlined further below remain the same.

As already described in the :ref:`dmclock-qos` section, the number of
shards and the bluestore's throttle parameters have an impact on the mclock op
queues. Therefore, it is critical to set these values carefully in order to
maximize the impact of the mclock scheduler.

:Number of Operational Shards:
  We recommend using the default number of shards as defined by the
  configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
  ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase
  the impact of the mclock queues.

:Bluestore Throttle Parameters:
  We recommend using the default values as defined by
  ``bluestore_throttle_bytes`` and ``bluestore_throttle_deferred_bytes``. But
  these parameters may also be determined during the benchmarking phase as
  described below.


OSD Bench Command Syntax
````````````````````````

The :ref:`osd-subsystem` section describes the OSD bench command. The syntax
used for benchmarking is shown below :

.. prompt:: bash #

  ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]

where,

* ``TOTAL_BYTES``: Total number of bytes to write
* ``BYTES_PER_WRITE``: Block size per write
* ``OBJ_SIZE``: Bytes per object
* ``NUM_OBJS``: Number of objects to write

Benchmarking Test Steps Using OSD Bench
```````````````````````````````````````

The steps below use the default shards and detail the steps used to determine
the correct bluestore throttle values (optional).

#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
   you wish to benchmark.
#. Run a simple 4KiB random write workload on an OSD using the following
   commands:

   .. note:: Note that before running the test, caches must be cleared to get an
             accurate measurement.

   For example, if you are running the benchmark test on osd.0, run the following
   commands:

   .. prompt:: bash #

     ceph tell osd.0 cache drop

   .. prompt:: bash #

     ceph tell osd.0 bench 12288000 4096 4194304 100

#. Note the overall throughput(IOPS) obtained from the output of the osd bench
   command. This value is the baseline throughput(IOPS) when the default
   bluestore throttle options are in effect.
#. If the intent is to determine the bluestore throttle values for your
   environment, then set the two options, ``bluestore_throttle_bytes``
   and ``bluestore_throttle_deferred_bytes`` to 32 KiB(32768 Bytes) each
   to begin with. Otherwise, you may skip to the next section.
#. Run the 4KiB random write test as before using OSD bench.
#. Note the overall throughput from the output and compare the value
   against the baseline throughput recorded in step 3.
#. If the throughput doesn't match with the baseline, increment the bluestore
   throttle options by 2x and repeat steps 5 through 7 until the obtained
   throughput is very close to the baseline value.

For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB
for both bluestore throttle and deferred bytes was determined to maximize the
impact of mclock. For HDDs, the corresponding value was 40 MiB, where the
overall throughput was roughly equal to the baseline throughput. Note that in
general for HDDs, the bluestore throttle values are expected to be higher when
compared to SSDs.


Specifying  Max OSD Capacity
````````````````````````````

The steps in this section may be performed only if you want to override the
max osd capacity automatically set during OSD initialization. The option
``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the
following command:

  .. prompt:: bash #

     ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value>

For example, the following command sets the max capacity for a specific OSD
(say "osd.0") whose underlying device type is HDD to 350 IOPS:

  .. prompt:: bash #

    ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350

Alternatively, you may specify the max capacity for OSDs within the Ceph
configuration file under the respective [osd.N] section. See
:ref:`ceph-conf-settings` for more details.


.. index:: mclock; config settings

mClock Config Options
=====================

``osd_mclock_profile``

:Description: This sets the type of mclock profile to use for providing QoS
              based on operations belonging to different classes (background
              recovery, scrub, snaptrim, client op, osd subop). Once a built-in
              profile is enabled, the lower level mclock resource control
              parameters [*reservation, weight, limit*] and some Ceph
              configuration parameters are set transparently. Note that the
              above does not apply for the *custom* profile.

:Type: String
:Valid Choices: high_client_ops, high_recovery_ops, balanced, custom
:Default: ``high_client_ops``

``osd_mclock_max_capacity_iops_hdd``

:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for
              rotational media)

:Type: Float
:Default: ``315.0``

``osd_mclock_max_capacity_iops_ssd``

:Description: Max IOPS capacity (at 4KiB block size) to consider per OSD (for
              solid state media)

:Type: Float
:Default: ``21500.0``

``osd_mclock_cost_per_io_usec``

:Description: Cost per IO in microseconds to consider per OSD (overrides _ssd
              and _hdd if non-zero)

:Type: Float
:Default: ``0.0``

``osd_mclock_cost_per_io_usec_hdd``

:Description: Cost per IO in microseconds to consider per OSD (for rotational
              media)

:Type: Float
:Default: ``25000.0``

``osd_mclock_cost_per_io_usec_ssd``

:Description: Cost per IO in microseconds to consider per OSD (for solid state
              media)

:Type: Float
:Default: ``50.0``

``osd_mclock_cost_per_byte_usec``

:Description: Cost per byte in microseconds to consider per OSD (overrides _ssd
              and _hdd if non-zero)

:Type: Float
:Default: ``0.0``

``osd_mclock_cost_per_byte_usec_hdd``

:Description: Cost per byte in microseconds to consider per OSD (for rotational
              media)

:Type: Float
:Default: ``5.2``

``osd_mclock_cost_per_byte_usec_ssd``

:Description: Cost per byte in microseconds to consider per OSD (for solid state
              media)

:Type: Float
:Default: ``0.011``