diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-23 16:45:13 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-23 16:45:13 +0000 |
commit | 389020e14594e4894e28d1eb9103c210b142509e (patch) | |
tree | 2ba734cdd7a243f46dda7c3d0cc88c2293d9699f /doc/rados/operations | |
parent | Adding upstream version 18.2.2. (diff) | |
download | ceph-389020e14594e4894e28d1eb9103c210b142509e.tar.xz ceph-389020e14594e4894e28d1eb9103c210b142509e.zip |
Adding upstream version 18.2.3.upstream/18.2.3
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/rados/operations')
-rw-r--r-- | doc/rados/operations/add-or-rm-mons.rst | 151 | ||||
-rw-r--r-- | doc/rados/operations/control.rst | 10 | ||||
-rw-r--r-- | doc/rados/operations/crush-map.rst | 49 | ||||
-rw-r--r-- | doc/rados/operations/erasure-code-profile.rst | 4 | ||||
-rw-r--r-- | doc/rados/operations/erasure-code.rst | 178 | ||||
-rw-r--r-- | doc/rados/operations/index.rst | 1 | ||||
-rw-r--r-- | doc/rados/operations/monitoring.rst | 2 | ||||
-rw-r--r-- | doc/rados/operations/pgcalc/index.rst | 68 | ||||
-rw-r--r-- | doc/rados/operations/placement-groups.rst | 32 | ||||
-rw-r--r-- | doc/rados/operations/pools.rst | 41 | ||||
-rw-r--r-- | doc/rados/operations/stretch-mode.rst | 10 |
11 files changed, 470 insertions, 76 deletions
diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst index 3688bb798..e97c0b94d 100644 --- a/doc/rados/operations/add-or-rm-mons.rst +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -344,12 +344,13 @@ addresses, repeat this process. Changing a Monitor's IP address (Advanced Method) ------------------------------------------------- -There are cases in which the method outlined in :ref"`<Changing a Monitor's IP -Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot -be used. For example, it might be necessary to move the cluster's monitors to a -different network, to a different part of the datacenter, or to a different -datacenter altogether. It is still possible to change the monitors' IP -addresses, but a different method must be used. +There are cases in which the method outlined in +:ref:`operations_add_or_rm_mons_changing_mon_ip` cannot be used. For example, +it might be necessary to move the cluster's monitors to a different network, to +a different part of the datacenter, or to a different datacenter altogether. It +is still possible to change the monitors' IP addresses, but a different method +must be used. + For such cases, a new monitor map with updated IP addresses for every monitor in the cluster must be generated and injected on each monitor. Although this @@ -357,11 +358,11 @@ method is not particularly easy, such a major migration is unlikely to be a routine task. As stated at the beginning of this section, existing monitors are not supposed to change their IP addresses. -Continue with the monitor configuration in the example from :ref"`<Changing a -Monitor's IP Address (Preferred Method)> -operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors -are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that -these networks are unable to communicate. Carry out the following procedure: +Continue with the monitor configuration in the example from +:ref:`operations_add_or_rm_mons_changing_mon_ip`. Suppose that all of the +monitors are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, +and that these networks are unable to communicate. Carry out the following +procedure: #. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map, and ``{filename}`` is the name of the file that contains the retrieved @@ -448,7 +449,135 @@ and inject the modified monitor map into each new monitor. Migration to the new location is now complete. The monitors should operate successfully. +Using cephadm to change the public network +========================================== + +Overview +-------- + +The procedure in this overview section provides only the broad outlines of +using ``cephadm`` to change the public network. + +#. Create backups of all keyrings, configuration files, and the current monmap. + +#. Stop the cluster and disable ``ceph.target`` to prevent the daemons from + starting. + +#. Move the servers and power them on. + +#. Change the network setup as desired. + + +Example Procedure +----------------- + +.. note:: In this procedure, the "old network" has addresses of the form + ``10.10.10.0/24`` and the "new network" has addresses of the form + ``192.168.160.0/24``. + +#. Enter the shell of the first monitor: + + .. prompt:: bash # + + cephadm shell --name mon.reef1 + +#. Extract the current monmap from ``mon.reef1``: + + .. prompt:: bash # + + ceph-mon -i reef1 --extract-monmap monmap + +#. Print the content of the monmap: + + .. prompt:: bash # + + monmaptool --print monmap + + :: + + monmaptool: monmap file monmap + epoch 5 + fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a + last_changed 2024-02-21T09:32:18.292040+0000 + created 2024-02-21T09:18:27.136371+0000 + min_mon_release 18 (reef) + election_strategy: 1 + 0: [v2:10.10.10.11:3300/0,v1:10.10.10.11:6789/0] mon.reef1 + 1: [v2:10.10.10.12:3300/0,v1:10.10.10.12:6789/0] mon.reef2 + 2: [v2:10.10.10.13:3300/0,v1:10.10.10.13:6789/0] mon.reef3 + +#. Remove monitors with old addresses: + + .. prompt:: bash # + + monmaptool --rm reef1 --rm reef2 --rm reef3 monmap + +#. Add monitors with new addresses: + + .. prompt:: bash # + + monmaptool --addv reef1 [v2:192.168.160.11:3300/0,v1:192.168.160.11:6789/0] --addv reef2 [v2:192.168.160.12:3300/0,v1:192.168.160.12:6789/0] --addv reef3 [v2:192.168.160.13:3300/0,v1:192.168.160.13:6789/0] monmap + +#. Verify that the changes to the monmap have been made successfully: + + .. prompt:: bash # + + monmaptool --print monmap + + :: + + monmaptool: monmap file monmap + epoch 4 + fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a + last_changed 2024-02-21T09:32:18.292040+0000 + created 2024-02-21T09:18:27.136371+0000 + min_mon_release 18 (reef) + election_strategy: 1 + 0: [v2:192.168.160.11:3300/0,v1:192.168.160.11:6789/0] mon.reef1 + 1: [v2:192.168.160.12:3300/0,v1:192.168.160.12:6789/0] mon.reef2 + 2: [v2:192.168.160.13:3300/0,v1:192.168.160.13:6789/0] mon.reef3 + +#. Inject the new monmap into the Ceph cluster: + + .. prompt:: bash # + + ceph-mon -i reef1 --inject-monmap monmap + +#. Repeat the steps above for all other monitors in the cluster. + +#. Update ``/var/lib/ceph/{FSID}/mon.{MON}/config``. + +#. Start the monitors. + +#. Update the ceph ``public_network``: + + .. prompt:: bash # + + ceph config set mon public_network 192.168.160.0/24 + +#. Update the configuration files of the managers + (``/var/lib/ceph/{FSID}/mgr.{mgr}/config``) and start them. Orchestrator + will now be available, but it will attempt to connect to the old network + because the host list contains the old addresses. + +#. Update the host addresses by running commands of the following form: + + .. prompt:: bash # + + ceph orch host set-addr reef1 192.168.160.11 + ceph orch host set-addr reef2 192.168.160.12 + ceph orch host set-addr reef3 192.168.160.13 + +#. Wait a few minutes for the orchestrator to connect to each host. + +#. Reconfigure the OSDs so that their config files are automatically updated: + + .. prompt:: bash # + + ceph orch reconfig osd +*The above procedure was developed by Eugen Block and was successfully tested +in February 2024 on Ceph version 18.2.1 (Reef).* .. _Manual Deployment: ../../../install/manual-deployment .. _Monitor Bootstrap: ../../../dev/mon-bootstrap diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst index 033f831cd..32d043f1f 100644 --- a/doc/rados/operations/control.rst +++ b/doc/rados/operations/control.rst @@ -474,27 +474,25 @@ following command: ceph tell mds.{mds-id} config set {setting} {value} -Example: +Example: to enable debug messages, run the following command: .. prompt:: bash $ ceph tell mds.0 config set debug_ms 1 -To enable debug messages, run the following command: +To display the status of all metadata servers, run the following command: .. prompt:: bash $ ceph mds stat -To display the status of all metadata servers, run the following command: +To mark the active metadata server as failed (and to trigger failover to a +standby if a standby is present), run the following command: .. prompt:: bash $ ceph mds fail 0 -To mark the active metadata server as failed (and to trigger failover to a -standby if a standby is present), run the following command: - .. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst index 39151e6d4..18f4dcb6d 100644 --- a/doc/rados/operations/crush-map.rst +++ b/doc/rados/operations/crush-map.rst @@ -57,53 +57,62 @@ case for most clusters), its CRUSH location can be specified as follows:: ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined types suffice for nearly all clusters, but can be customized by modifying the CRUSH map. - #. Not all keys need to be specified. For example, by default, Ceph - automatically sets an ``OSD``'s location as ``root=default - host=HOSTNAME`` (as determined by the output of ``hostname -s``). -The CRUSH location for an OSD can be modified by adding the ``crush location`` -option in ``ceph.conf``. When this option has been added, every time the OSD +The CRUSH location for an OSD can be set by adding the ``crush_location`` +option in ``ceph.conf``, example: + + crush_location = root=default row=a rack=a2 chassis=a2a host=a2a1 + +When this option has been added, every time the OSD starts it verifies that it is in the correct location in the CRUSH map and moves itself if it is not. To disable this automatic CRUSH map management, add the following to the ``ceph.conf`` configuration file in the ``[osd]`` section:: - osd crush update on start = false + osd_crush_update_on_start = false Note that this action is unnecessary in most cases. +If the ``crush_location`` is not set explicitly, +a default of ``root=default host=HOSTNAME`` is used for ``OSD``s, +where the hostname is determined by the output of the ``hostname -s`` command. + +.. note:: If you switch from this default to an explicitly set ``crush_location``, + do not forget to include ``root=default`` because existing CRUSH rules refer to it. Custom location hooks --------------------- -A custom location hook can be used to generate a more complete CRUSH location -on startup. The CRUSH location is determined by, in order of preference: +A custom location hook can be used to generate a more complete CRUSH location, +on startup. + +This is useful when some location fields are not known at the time +``ceph.conf`` is written (for example, fields ``rack`` or ``datacenter`` +when deploying a single configuration across multiple datacenters). -#. A ``crush location`` option in ``ceph.conf`` -#. A default of ``root=default host=HOSTNAME`` where the hostname is determined - by the output of the ``hostname -s`` command +If configured, executed, and parsed successfully, the hook's output replaces +any previously set CRUSH location. -A script can be written to provide additional location fields (for example, -``rack`` or ``datacenter``) and the hook can be enabled via the following -config option:: +The hook hook can be enabled in ``ceph.conf`` by providing a path to an +executable file (often a script), example:: - crush location hook = /path/to/customized-ceph-crush-location + crush_location_hook = /path/to/customized-ceph-crush-location This hook is passed several arguments (see below). The hook outputs a single -line to ``stdout`` that contains the CRUSH location description. The output -resembles the following::: +line to ``stdout`` that contains the CRUSH location description. The arguments +resemble the following::: --cluster CLUSTER --id ID --type TYPE Here the cluster name is typically ``ceph``, the ``id`` is the daemon identifier or (in the case of OSDs) the OSD number, and the daemon type is -``osd``, ``mds, ``mgr``, or ``mon``. +``osd``, ``mds``, ``mgr``, or ``mon``. For example, a simple hook that specifies a rack location via a value in the -file ``/etc/rack`` might be as follows:: +file ``/etc/rack`` (assuming it contains no spaces) might be as follows:: #!/bin/sh - echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" + echo "root=default rack=$(cat /etc/rack) host=$(hostname -s)" CRUSH structure diff --git a/doc/rados/operations/erasure-code-profile.rst b/doc/rados/operations/erasure-code-profile.rst index 947b34c1f..a8f006398 100644 --- a/doc/rados/operations/erasure-code-profile.rst +++ b/doc/rados/operations/erasure-code-profile.rst @@ -96,7 +96,9 @@ Where: ``--force`` :Description: Override an existing profile by the same name, and allow - setting a non-4K-aligned stripe_unit. + setting a non-4K-aligned stripe_unit. Overriding an existing + profile can be dangerous, and thus ``--yes-i-really-mean-it`` + must be used as well. :Type: String :Required: No. diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst index e2bd3c296..e53f348cd 100644 --- a/doc/rados/operations/erasure-code.rst +++ b/doc/rados/operations/erasure-code.rst @@ -179,6 +179,8 @@ This can be enabled only on a pool residing on BlueStore OSDs, since BlueStore's checksumming is used during deep scrubs to detect bitrot or other corruption. Using Filestore with EC overwrites is not only unsafe, but it also results in lower performance compared to BlueStore. +Moreover, Filestore is deprecated and any Filestore OSDs in your cluster +should be migrated to BlueStore. Erasure-coded pools do not support omap, so to use them with RBD and CephFS you must instruct them to store their data in an EC pool and @@ -192,6 +194,182 @@ erasure-coded pool as the ``--data-pool`` during image creation: For CephFS, an erasure-coded pool can be set as the default data pool during file system creation or via `file layouts <../../../cephfs/file-layouts>`_. +Erasure-coded pool overhead +--------------------------- + +The overhead factor (space amplification) of an erasure-coded pool +is `(k+m) / k`. For a 4,2 profile, the overhead is +thus 1.5, which means that 1.5 GiB of underlying storage are used to store +1 GiB of user data. Contrast with default three-way replication, with +which the overhead factor is 3.0. Do not mistake erasure coding for a free +lunch: there is a significant performance tradeoff, especially when using HDDs +and when performing cluster recovery or backfill. + +Below is a table showing the overhead factors for various values of `k` and `m`. +As `m` increases above 2, the incremental capacity overhead gain quickly +experiences diminishing returns but the performance impact grows proportionally. +We recommend that you do not choose a profile with `k` > 4 or `m` > 2 until +and unless you fully understand the ramifications, including the number of +failure domains your cluster topology must contain. If you choose `m=1`, +expect data unavailability during maintenance and data loss if component +failures overlap. + +.. list-table:: Erasure coding overhead + :widths: 4 4 4 4 4 4 4 4 4 4 4 4 + :header-rows: 1 + :stub-columns: 1 + + * - + - m=1 + - m=2 + - m=3 + - m=4 + - m=4 + - m=6 + - m=7 + - m=8 + - m=9 + - m=10 + - m=11 + * - k=1 + - 2.00 + - 3.00 + - 4.00 + - 5.00 + - 6.00 + - 7.00 + - 8.00 + - 9.00 + - 10.00 + - 11.00 + - 12.00 + * - k=2 + - 1.50 + - 2.00 + - 2.50 + - 3.00 + - 3.50 + - 4.00 + - 4.50 + - 5.00 + - 5.50 + - 6.00 + - 6.50 + * - k=3 + - 1.33 + - 1.67 + - 2.00 + - 2.33 + - 2.67 + - 3.00 + - 3.33 + - 3.67 + - 4.00 + - 4.33 + - 4.67 + * - k=4 + - 1.25 + - 1.50 + - 1.75 + - 2.00 + - 2.25 + - 2.50 + - 2.75 + - 3.00 + - 3.25 + - 3.50 + - 3.75 + * - k=5 + - 1.20 + - 1.40 + - 1.60 + - 1.80 + - 2.00 + - 2.20 + - 2.40 + - 2.60 + - 2.80 + - 3.00 + - 3.20 + * - k=6 + - 1.16 + - 1.33 + - 1.50 + - 1.66 + - 1.83 + - 2.00 + - 2.17 + - 2.33 + - 2.50 + - 2.66 + - 2.83 + * - k=7 + - 1.14 + - 1.29 + - 1.43 + - 1.58 + - 1.71 + - 1.86 + - 2.00 + - 2.14 + - 2.29 + - 2.43 + - 2.58 + * - k=8 + - 1.13 + - 1.25 + - 1.38 + - 1.50 + - 1.63 + - 1.75 + - 1.88 + - 2.00 + - 2.13 + - 2.25 + - 2.38 + * - k=9 + - 1.11 + - 1.22 + - 1.33 + - 1.44 + - 1.56 + - 1.67 + - 1.78 + - 1.88 + - 2.00 + - 2.11 + - 2.22 + * - k=10 + - 1.10 + - 1.20 + - 1.30 + - 1.40 + - 1.50 + - 1.60 + - 1.70 + - 1.80 + - 1.90 + - 2.00 + - 2.10 + * - k=11 + - 1.09 + - 1.18 + - 1.27 + - 1.36 + - 1.45 + - 1.54 + - 1.63 + - 1.72 + - 1.82 + - 1.91 + - 2.00 + + + + + + + Erasure-coded pools and cache tiering ------------------------------------- diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst index 15525c1d3..91301382d 100644 --- a/doc/rados/operations/index.rst +++ b/doc/rados/operations/index.rst @@ -21,6 +21,7 @@ and, monitoring an operating cluster. monitoring-osd-pg user-management pg-repair + pgcalc/index .. raw:: html diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst index a9171f2d8..d3a258f76 100644 --- a/doc/rados/operations/monitoring.rst +++ b/doc/rados/operations/monitoring.rst @@ -517,6 +517,8 @@ multiple monitors are running to ensure proper functioning of your Ceph cluster. Check monitor status regularly in order to ensure that all of the monitors are running. +.. _display-mon-map: + To display the monitor map, run the following command: .. prompt:: bash $ diff --git a/doc/rados/operations/pgcalc/index.rst b/doc/rados/operations/pgcalc/index.rst new file mode 100644 index 000000000..1aed87391 --- /dev/null +++ b/doc/rados/operations/pgcalc/index.rst @@ -0,0 +1,68 @@ +.. _pgcalc: + + +======= +PG Calc +======= + + +.. raw:: html + + + <link rel="stylesheet" id="wp-job-manager-job-listings-css" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/wp-content/plugins/wp-job-manager/assets/dist/css/job-listings.css" type="text/css" media="all"/> + <link rel="stylesheet" id="ceph/googlefont-css" href="https://web.archive.org/web/20230614135557cs_/https://fonts.googleapis.com/css?family=Raleway%3A300%2C400%2C700&ver=5.7.2" type="text/css" media="all"/> + <link rel="stylesheet" id="Stylesheet-css" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/wp-content/themes/cephTheme/Resources/Styles/style.min.css" type="text/css" media="all"/> + <link rel="stylesheet" id="tablepress-default-css" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/wp-content/plugins/tablepress/css/default.min.css" type="text/css" media="all"/> + <link rel="stylesheet" id="jetpack_css-css" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/wp-content/plugins/jetpack/css/jetpack.css" type="text/css" media="all"/> + <script type="text/javascript" src="https://web.archive.org/web/20230614135557js_/https://old.ceph.com/wp-content/themes/cephTheme/foundation_framework/js/vendor/jquery.js" id="jquery-js"></script> + + <link rel="stylesheet" href="https://web.archive.org/web/20230614135557cs_/https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.2/themes/smoothness/jquery-ui.css"/> + <link rel="stylesheet" href="https://web.archive.org/web/20230614135557cs_/https://old.ceph.com/pgcalc_assets/pgcalc.css"/> + <script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.2/jquery-ui.min.js"></script> + + <script src="../../../_static/js/pgcalc.js"></script> + <div id="pgcalcdiv"> + <div id="instructions"> + <h2>Ceph PGs per Pool Calculator</h2><br/><fieldset><legend>Instructions</legend> + <ol> + <li>Confirm your understanding of the fields by reading through the Key below.</li> + <li>Select a <b>"Ceph Use Case"</b> from the drop down menu.</li> + <li>Adjust the values in the <span class="inputColor addBorder" style="font-weight: bold;">"Green"</span> shaded fields below.<br/> + <b>Tip:</b> Headers can be clicked to change the value throughout the table.</li> + <li>You will see the Suggested PG Count update based on your inputs.</li> + <li>Click the <b>"Add Pool"</b> button to create a new line for a new pool.</li> + <li>Click the <span class="ui-icon ui-icon-trash" style="display:inline-block;"></span> icon to delete the specific Pool.</li> + <li>For more details on the logic used and some important details, see the area below the table.</li> + <li>Once all values have been adjusted, click the <b>"Generate Commands"</b> button to get the pool creation commands.</li> + </ol></fieldset> + </div> + <div id="beforeTable"></div> + <br/> + <p class="validateTips"> </p> + <label for="presetType">Ceph Use Case Selector:</label><br/><select id="presetType"></select><button style="margin-left: 200px;" id="btnAddPool" type="button">Add Pool</button><button type="button" id="btnGenCommands" download="commands.txt">Generate Commands</button> + <div id="pgsPerPoolTable"> + <table id="pgsperpool"> + </table> + </div> <!-- id = pgsPerPoolTable --> + <br/> + <div id="afterTable"></div> + <div id="countLogic"><fieldset><legend>Logic behind Suggested PG Count</legend> + <br/> + <div class="upperFormula">( Target PGs per OSD ) x ( OSD # ) x ( %Data )</div> + <div class="lowerFormula">( Size )</div> + <ol id="countLogicList"> + <li>If the value of the above calculation is less than the value of <b>( OSD# ) / ( Size )</b>, then the value is updated to the value of <b>( OSD# ) / ( Size )</b>. This is to ensure even load / data distribution by allocating at least one Primary or Secondary PG to every OSD for every Pool.</li> + <li>The output value is then rounded to the <b>nearest power of 2</b>.<br/><b>Tip:</b> The nearest power of 2 provides a marginal improvement in efficiency of the <a href="https://web.archive.org/web/20230614135557/http://ceph.com/docs/master/rados/operations/crush-map/" title="CRUSH Map Details">CRUSH</a> algorithm.</li> + <li>If the nearest power of 2 is more than <b>25%</b> below the original value, the next higher power of 2 is used.</li> + </ol> + <b>Objective</b> + <ul><li>The objective of this calculation and the target ranges noted in the "Key" section above are to ensure that there are sufficient Placement Groups for even data distribution throughout the cluster, while not going high enough on the PG per OSD ratio to cause problems during Recovery and/or Backfill operations.</li></ul> + <b>Effects of enpty or non-active pools:</b> + <ul> + <li>Empty or otherwise non-active pools should not be considered helpful toward even data distribution throughout the cluster.</li> + <li>However, the PGs associated with these empty / non-active pools still consume memory and CPU overhead.</li> + </ul> + </fieldset> + </div> + <div id="commands" title="Pool Creation Commands"><code><pre id="commandCode"></pre></code></div> + </div> diff --git a/doc/rados/operations/placement-groups.rst b/doc/rados/operations/placement-groups.rst index dda4a0177..93ab1f0c0 100644 --- a/doc/rados/operations/placement-groups.rst +++ b/doc/rados/operations/placement-groups.rst @@ -4,6 +4,21 @@ Placement Groups ================== +Placement groups (PGs) are subsets of each logical Ceph pool. Placement groups +perform the function of placing objects (as a group) into OSDs. Ceph manages +data internally at placement-group granularity: this scales better than would +managing individual RADOS objects. A cluster that has a larger number of +placement groups (for example, 150 per OSD) is better balanced than an +otherwise identical cluster with a smaller number of placement groups. + +Ceph’s internal RADOS objects are each mapped to a specific placement group, +and each placement group belongs to exactly one Ceph pool. + +See Sage Weil's blog post `New in Nautilus: PG merging and autotuning +<https://ceph.io/en/news/blog/2019/new-in-nautilus-pg-merging-and-autotuning/>`_ +for more information about the relationship of placement groups to pools and to +objects. + .. _pg-autoscaler: Autoscaling placement groups @@ -131,11 +146,11 @@ The output will resemble the following:: if a ``pg_num`` change is in progress, the current number of PGs that the pool is working towards. -- **NEW PG_NUM** (if present) is the value that the system is recommending the - ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is - present only if the recommended value varies from the current value by more - than the default factor of ``3``. To adjust this factor (in the following - example, it is changed to ``2``), run the following command: +- **NEW PG_NUM** (if present) is the value that the system recommends that the + ``pg_num`` of the pool should be. It is always a power of two, and it + is present only if the recommended value varies from the current value by + more than the default factor of ``3``. To adjust this multiple (in the + following example, it is changed to ``2``), run the following command: .. prompt:: bash # @@ -168,7 +183,6 @@ The output will resemble the following:: .. prompt:: bash # ceph osd pool set .mgr crush_rule replicated-ssd - ceph osd pool set pool 1 crush_rule to replicated-ssd This intervention will result in a small amount of backfill, but typically this traffic completes quickly. @@ -626,15 +640,14 @@ pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs each. This cluster will require significantly more resources and significantly more time for peering. -For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_ -tool. - .. _setting the number of placement groups: Setting the Number of PGs ========================= +:ref:`Placement Group Link <pgcalc>` + Setting the initial number of PGs in a pool must be done at the time you create the pool. See `Create a Pool`_ for details. @@ -894,4 +907,3 @@ about it entirely (if it is too new to have a previous version). To mark the .. _Create a Pool: ../pools#createpool .. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds -.. _pgcalc: https://old.ceph.com/pgcalc/ diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst index dda9e844e..c3fe3b7d8 100644 --- a/doc/rados/operations/pools.rst +++ b/doc/rados/operations/pools.rst @@ -18,15 +18,17 @@ Pools provide: <../erasure-code>`_, resilience is defined as the number of coding chunks (for example, ``m = 2`` in the default **erasure code profile**). -- **Placement Groups**: You can set the number of placement groups (PGs) for - the pool. In a typical configuration, the target number of PGs is - approximately one hundred PGs per OSD. This provides reasonable balancing - without consuming excessive computing resources. When setting up multiple - pools, be careful to set an appropriate number of PGs for each pool and for - the cluster as a whole. Each PG belongs to a specific pool: when multiple - pools use the same OSDs, make sure that the **sum** of PG replicas per OSD is - in the desired PG-per-OSD target range. To calculate an appropriate number of - PGs for your pools, use the `pgcalc`_ tool. +- **Placement Groups**: The :ref:`autoscaler <pg-autoscaler>` sets the number + of placement groups (PGs) for the pool. In a typical configuration, the + target number of PGs is approximately one-hundred and fifty PGs per OSD. This + provides reasonable balancing without consuming excessive computing + resources. When setting up multiple pools, set an appropriate number of PGs + for each pool and for the cluster as a whole. Each PG belongs to a specific + pool: when multiple pools use the same OSDs, make sure that the **sum** of PG + replicas per OSD is in the desired PG-per-OSD target range. See :ref:`Setting + the Number of Placement Groups <setting the number of placement groups>` for + instructions on how to manually set the number of placement groups per pool + (this procedure works only when the autoscaler is not used). - **CRUSH Rules**: When data is stored in a pool, the placement of the object and its replicas (or chunks, in the case of erasure-coded pools) in your @@ -94,19 +96,12 @@ To get even more information, you can execute this command with the ``--format`` Creating a Pool =============== -Before creating a pool, consult `Pool, PG and CRUSH Config Reference`_. Your -Ceph configuration file contains a setting (namely, ``pg_num``) that determines -the number of PGs. However, this setting's default value is NOT appropriate -for most systems. In most cases, you should override this default value when -creating your pool. For details on PG numbers, see `setting the number of -placement groups`_ - -For example: - -.. prompt:: bash $ - - osd_pool_default_pg_num = 128 - osd_pool_default_pgp_num = 128 +Before creating a pool, consult `Pool, PG and CRUSH Config Reference`_. The +Ceph central configuration database in the monitor cluster contains a setting +(namely, ``pg_num``) that determines the number of PGs per pool when a pool has +been created and no per-pool value has been specified. It is possible to change +this value from its default. For more on the subject of setting the number of +PGs per pool, see `setting the number of placement groups`_. .. note:: In Luminous and later releases, each pool must be associated with the application that will be using the pool. For more information, see @@ -742,8 +737,6 @@ Managing pools that are flagged with ``--bulk`` =============================================== See :ref:`managing_bulk_flagged_pools`. - -.. _pgcalc: https://old.ceph.com/pgcalc/ .. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref .. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter .. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst index f797b5b91..787e8cb4d 100644 --- a/doc/rados/operations/stretch-mode.rst +++ b/doc/rados/operations/stretch-mode.rst @@ -121,8 +121,6 @@ your CRUSH map. This procedure shows how to do this. rule stretch_rule { id 1 - min_size 1 - max_size 10 type replicated step take site1 step chooseleaf firstn 2 type host @@ -141,11 +139,15 @@ your CRUSH map. This procedure shows how to do this. #. Run the monitors in connectivity mode. See `Changing Monitor Elections`_. + .. prompt:: bash $ + + ceph mon set election_strategy connectivity + #. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the tiebreaker monitor and we are splitting across data centers. The tiebreaker monitor must be assigned a data center that is neither ``site1`` nor - ``site2``. For this purpose you can create another data-center bucket named - ``site3`` in your CRUSH and place ``mon.e`` there: + ``site2``. This data center **should not** be defined in your CRUSH map, here + we are placing ``mon.e`` in a virtual data center called ``site3``: .. prompt:: bash $ |