diff options
Diffstat (limited to 'doc/man/8/crushtool.rst')
-rw-r--r-- | doc/man/8/crushtool.rst | 302 |
1 files changed, 302 insertions, 0 deletions
diff --git a/doc/man/8/crushtool.rst b/doc/man/8/crushtool.rst new file mode 100644 index 000000000..4c8486596 --- /dev/null +++ b/doc/man/8/crushtool.rst @@ -0,0 +1,302 @@ +:orphan: + +========================================== + crushtool -- CRUSH map manipulation tool +========================================== + +.. program:: crushtool + +Synopsis +======== + +| **crushtool** ( -d *map* | -c *map.txt* | --build --num_osds *numosds* + *layer1* *...* | --test ) [ -o *outfile* ] + + +Description +=========== + +**crushtool** is a utility that lets you create, compile, decompile +and test CRUSH map files. + +CRUSH is a pseudo-random data distribution algorithm that efficiently +maps input values (which, in the context of Ceph, correspond to Placement +Groups) across a heterogeneous, hierarchically structured device map. +The algorithm was originally described in detail in the following paper +(although it has evolved some since then):: + + http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf + +The tool has four modes of operation. + +.. option:: --compile|-c map.txt + + will compile a plaintext map.txt into a binary map file. + +.. option:: --decompile|-d map + + will take the compiled map and decompile it into a plaintext source + file, suitable for editing. + +.. option:: --build --num_osds {num-osds} layer1 ... + + will create map with the given layer structure. See below for a + detailed explanation. + +.. option:: --test + + will perform a dry run of a CRUSH mapping for a range of input + values ``[--min-x,--max-x]`` (default ``[0,1023]``) which can be + thought of as simulated Placement Groups. See below for a more + detailed explanation. + +Unlike other Ceph tools, **crushtool** does not accept generic options +such as **--debug-crush** from the command line. They can, however, be +provided via the CEPH_ARGS environment variable. For instance, to +silence all output from the CRUSH subsystem:: + + CEPH_ARGS="--debug-crush 0" crushtool ... + + +Running tests with --test +========================= + +The test mode will use the input crush map ( as specified with **-i +map** ) and perform a dry run of CRUSH mapping or random placement +(if **--simulate** is set ). On completion, two kinds of reports can be +created. +1) The **--show-...** option outputs human readable information +on stderr. +2) The **--output-csv** option creates CSV files that are +documented by the **--help-output** option. + +Note: Each Placement Group (PG) has an integer ID which can be obtained +from ``ceph pg dump`` (for example PG 2.2f means pool id 2, PG id 32). +The pool and PG IDs are combined by a function to get a value which is +given to CRUSH to map it to OSDs. crushtool does not know about PGs or +pools; it only runs simulations by mapping values in the range +``[--min-x,--max-x]``. + + +.. option:: --show-statistics + + Displays a summary of the distribution. For instance:: + + rule 1 (metadata) num_rep 5 result size == 5: 1024/1024 + + shows that rule **1** which is named **metadata** successfully + mapped **1024** values to **result size == 5** devices when trying + to map them to **num_rep 5** replicas. When it fails to provide the + required mapping, presumably because the number of **tries** must + be increased, a breakdown of the failures is displayed. For instance:: + + rule 1 (metadata) num_rep 10 result size == 8: 4/1024 + rule 1 (metadata) num_rep 10 result size == 9: 93/1024 + rule 1 (metadata) num_rep 10 result size == 10: 927/1024 + + shows that although **num_rep 10** replicas were required, **4** + out of **1024** values ( **4/1024** ) were mapped to **result size + == 8** devices only. + +.. option:: --show-mappings + + Displays the mapping of each value in the range ``[--min-x,--max-x]``. + For instance:: + + CRUSH rule 1 x 24 [11,6] + + shows that value **24** is mapped to devices **[11,6]** by rule + **1**. + + One of the following is required when using the ``--show-mappings`` option: + + (a) ``--num-rep`` + (b) both ``--min-rep`` and ``--max-rep`` + + ``--num-rep`` stands for "number of replicas, indicates the number of + replicas in a pool, and is used to specify an exact number of replicas (for + example ``--num-rep 5``). ``--min-rep`` and ``--max-rep`` are used together + to specify a range of replicas (for example, ``--min-rep 1 --max-rep 10``). + +.. option:: --show-bad-mappings + + Displays which value failed to be mapped to the required number of + devices. For instance:: + + bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9] + + shows that when rule **1** was required to map **7** devices, it + could map only six : **[8,10,2,11,6,9]**. + +.. option:: --show-utilization + + Displays the expected and actual utilization for each device, for + each number of replicas. For instance:: + + device 0: stored : 951 expected : 853.333 + device 1: stored : 963 expected : 853.333 + ... + + shows that device **0** stored **951** values and was expected to store **853**. + Implies **--show-statistics**. + +.. option:: --show-utilization-all + + Displays the same as **--show-utilization** but does not suppress + output when the weight of a device is zero. + Implies **--show-statistics**. + +.. option:: --show-choose-tries + + Displays how many attempts were needed to find a device mapping. + For instance:: + + 0: 95224 + 1: 3745 + 2: 2225 + .. + + shows that **95224** mappings succeeded without retries, **3745** + mappings succeeded with one attempts, etc. There are as many rows + as the value of the **--set-choose-total-tries** option. + +.. option:: --output-csv + + Creates CSV files (in the current directory) containing information + documented by **--help-output**. The files are named after the rule + used when collecting the statistics. For instance, if the rule + : 'metadata' is used, the CSV files will be:: + + metadata-absolute_weights.csv + metadata-device_utilization.csv + ... + + The first line of the file shortly explains the column layout. For + instance:: + + metadata-absolute_weights.csv + Device ID, Absolute Weight + 0,1 + ... + +.. option:: --output-name NAME + + Prepend **NAME** to the file names generated when **--output-csv** + is specified. For instance **--output-name FOO** will create + files:: + + FOO-metadata-absolute_weights.csv + FOO-metadata-device_utilization.csv + ... + +The **--set-...** options can be used to modify the tunables of the +input crush map. The input crush map is modified in +memory. For example:: + + $ crushtool -i mymap --test --show-bad-mappings + bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9] + +could be fixed by increasing the **choose-total-tries** as follows: + + $ crushtool -i mymap --test \ + --show-bad-mappings \ + --set-choose-total-tries 500 + +Building a map with --build +=========================== + +The build mode will generate hierarchical maps. The first argument +specifies the number of devices (leaves) in the CRUSH hierarchy. Each +layer describes how the layer (or devices) preceding it should be +grouped. + +Each layer consists of:: + + bucket ( uniform | list | tree | straw | straw2 ) size + +The **bucket** is the type of the buckets in the layer +(e.g. "rack"). Each bucket name will be built by appending a unique +number to the **bucket** string (e.g. "rack0", "rack1"...). + +The second component is the type of bucket: **straw** should be used +most of the time. + +The third component is the maximum size of the bucket. A size of zero +means a bucket of infinite capacity. + + +Example +======= + +Suppose we have two rows with two racks each and 20 nodes per rack. Suppose +each node contains 4 storage devices for Ceph OSD Daemons. This configuration +allows us to deploy 320 Ceph OSD Daemons. Lets assume a 42U rack with 2U nodes, +leaving an extra 2U for a rack switch. + +To reflect our hierarchy of devices, nodes, racks and rows, we would execute +the following:: + + $ crushtool -o crushmap --build --num_osds 320 \ + node straw 4 \ + rack straw 20 \ + row straw 2 \ + root straw 0 + # id weight type name reweight + -87 320 root root + -85 160 row row0 + -81 80 rack rack0 + -1 4 node node0 + 0 1 osd.0 1 + 1 1 osd.1 1 + 2 1 osd.2 1 + 3 1 osd.3 1 + -2 4 node node1 + 4 1 osd.4 1 + 5 1 osd.5 1 + ... + +CRUSH rules are created so the generated crushmap can be +tested. They are the same rules as the ones created by default when +creating a new Ceph cluster. They can be further edited with:: + + # decompile + crushtool -d crushmap -o map.txt + + # edit + emacs map.txt + + # recompile + crushtool -c map.txt -o crushmap + +Reclassify +========== + +The *reclassify* function allows users to transition from older maps that +maintain parallel hierarchies for OSDs of different types to a modern CRUSH +map that makes use of the *device class* feature. For more information, +see https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/#migrating-from-a-legacy-ssd-rule-to-device-classes. + +Example output from --test +========================== + +See https://github.com/ceph/ceph/blob/master/src/test/cli/crushtool/set-choose.t +for sample ``crushtool --test`` commands and output produced thereby. + +Availability +============ + +**crushtool** is part of Ceph, a massively scalable, open-source, distributed storage system. Please +refer to the Ceph documentation at https://docs.ceph.com for more +information. + + +See also +======== + +:doc:`ceph <ceph>`\(8), +:doc:`osdmaptool <osdmaptool>`\(8), + +Authors +======= + +John Wilkins, Sage Weil, Loic Dachary |