From e6918187568dbd01842d8d1d2c808ce16a894239 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 21 Apr 2024 13:54:28 +0200 Subject: Adding upstream version 18.2.2. Signed-off-by: Daniel Baumann --- doc/rados/operations/add-or-rm-mons.rst | 458 +++++++ doc/rados/operations/add-or-rm-osds.rst | 419 ++++++ doc/rados/operations/balancer.rst | 221 ++++ doc/rados/operations/bluestore-migration.rst | 357 ++++++ doc/rados/operations/cache-tiering.rst | 557 ++++++++ doc/rados/operations/change-mon-elections.rst | 100 ++ doc/rados/operations/control.rst | 665 ++++++++++ doc/rados/operations/crush-map-edits.rst | 746 +++++++++++ doc/rados/operations/crush-map.rst | 1147 +++++++++++++++++ doc/rados/operations/data-placement.rst | 47 + doc/rados/operations/devices.rst | 227 ++++ doc/rados/operations/erasure-code-clay.rst | 240 ++++ doc/rados/operations/erasure-code-isa.rst | 107 ++ doc/rados/operations/erasure-code-jerasure.rst | 123 ++ doc/rados/operations/erasure-code-lrc.rst | 388 ++++++ doc/rados/operations/erasure-code-profile.rst | 128 ++ doc/rados/operations/erasure-code-shec.rst | 145 +++ doc/rados/operations/erasure-code.rst | 272 ++++ doc/rados/operations/health-checks.rst | 1619 ++++++++++++++++++++++++ doc/rados/operations/index.rst | 99 ++ doc/rados/operations/monitoring-osd-pg.rst | 556 ++++++++ doc/rados/operations/monitoring.rst | 644 ++++++++++ doc/rados/operations/operating.rst | 174 +++ doc/rados/operations/pg-concepts.rst | 104 ++ doc/rados/operations/pg-repair.rst | 118 ++ doc/rados/operations/pg-states.rst | 118 ++ doc/rados/operations/placement-groups.rst | 897 +++++++++++++ doc/rados/operations/pools.rst | 751 +++++++++++ doc/rados/operations/read-balancer.rst | 64 + doc/rados/operations/stretch-mode.rst | 262 ++++ doc/rados/operations/upmap.rst | 113 ++ doc/rados/operations/user-management.rst | 840 ++++++++++++ 32 files changed, 12706 insertions(+) create mode 100644 doc/rados/operations/add-or-rm-mons.rst create mode 100644 doc/rados/operations/add-or-rm-osds.rst create mode 100644 doc/rados/operations/balancer.rst create mode 100644 doc/rados/operations/bluestore-migration.rst create mode 100644 doc/rados/operations/cache-tiering.rst create mode 100644 doc/rados/operations/change-mon-elections.rst create mode 100644 doc/rados/operations/control.rst create mode 100644 doc/rados/operations/crush-map-edits.rst create mode 100644 doc/rados/operations/crush-map.rst create mode 100644 doc/rados/operations/data-placement.rst create mode 100644 doc/rados/operations/devices.rst create mode 100644 doc/rados/operations/erasure-code-clay.rst create mode 100644 doc/rados/operations/erasure-code-isa.rst create mode 100644 doc/rados/operations/erasure-code-jerasure.rst create mode 100644 doc/rados/operations/erasure-code-lrc.rst create mode 100644 doc/rados/operations/erasure-code-profile.rst create mode 100644 doc/rados/operations/erasure-code-shec.rst create mode 100644 doc/rados/operations/erasure-code.rst create mode 100644 doc/rados/operations/health-checks.rst create mode 100644 doc/rados/operations/index.rst create mode 100644 doc/rados/operations/monitoring-osd-pg.rst create mode 100644 doc/rados/operations/monitoring.rst create mode 100644 doc/rados/operations/operating.rst create mode 100644 doc/rados/operations/pg-concepts.rst create mode 100644 doc/rados/operations/pg-repair.rst create mode 100644 doc/rados/operations/pg-states.rst create mode 100644 doc/rados/operations/placement-groups.rst create mode 100644 doc/rados/operations/pools.rst create mode 100644 doc/rados/operations/read-balancer.rst create mode 100644 doc/rados/operations/stretch-mode.rst create mode 100644 doc/rados/operations/upmap.rst create mode 100644 doc/rados/operations/user-management.rst (limited to 'doc/rados/operations') diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst new file mode 100644 index 000000000..3688bb798 --- /dev/null +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -0,0 +1,458 @@ +.. _adding-and-removing-monitors: + +========================== + Adding/Removing Monitors +========================== + +It is possible to add monitors to a running cluster as long as redundancy is +maintained. To bootstrap a monitor, see `Manual Deployment`_ or `Monitor +Bootstrap`_. + +.. _adding-monitors: + +Adding Monitors +=============== + +Ceph monitors serve as the single source of truth for the cluster map. It is +possible to run a cluster with only one monitor, but for a production cluster +it is recommended to have at least three monitors provisioned and in quorum. +Ceph monitors use a variation of the `Paxos`_ algorithm to maintain consensus +about maps and about other critical information across the cluster. Due to the +nature of Paxos, Ceph is able to maintain quorum (and thus establish +consensus) only if a majority of the monitors are ``active``. + +It is best to run an odd number of monitors. This is because a cluster that is +running an odd number of monitors is more resilient than a cluster running an +even number. For example, in a two-monitor deployment, no failures can be +tolerated if quorum is to be maintained; in a three-monitor deployment, one +failure can be tolerated; in a four-monitor deployment, one failure can be +tolerated; and in a five-monitor deployment, two failures can be tolerated. In +general, a cluster running an odd number of monitors is best because it avoids +what is called the *split brain* phenomenon. In short, Ceph is able to operate +only if a majority of monitors are ``active`` and able to communicate with each +other, (for example: there must be a single monitor, two out of two monitors, +two out of three monitors, three out of five monitors, or the like). + +For small or non-critical deployments of multi-node Ceph clusters, it is +recommended to deploy three monitors. For larger clusters or for clusters that +are intended to survive a double failure, it is recommended to deploy five +monitors. Only in rare circumstances is there any justification for deploying +seven or more monitors. + +It is possible to run a monitor on the same host that is running an OSD. +However, this approach has disadvantages: for example: `fsync` issues with the +kernel might weaken performance, monitor and OSD daemons might be inactive at +the same time and cause disruption if the node crashes, is rebooted, or is +taken down for maintenance. Because of these risks, it is instead +recommended to run monitors and managers on dedicated hosts. + +.. note:: A *majority* of monitors in your cluster must be able to + reach each other in order for quorum to be established. + +Deploying your Hardware +----------------------- + +Some operators choose to add a new monitor host at the same time that they add +a new monitor. For details on the minimum recommendations for monitor hardware, +see `Hardware Recommendations`_. Before adding a monitor host to the cluster, +make sure that there is an up-to-date version of Linux installed. + +Add the newly installed monitor host to a rack in your cluster, connect the +host to the network, and make sure that the host has network connectivity. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations + +Installing the Required Software +-------------------------------- + +In manually deployed clusters, it is necessary to install Ceph packages +manually. For details, see `Installing Packages`_. Configure SSH so that it can +be used by a user that has passwordless authentication and root permissions. + +.. _Installing Packages: ../../../install/install-storage-cluster + + +.. _Adding a Monitor (Manual): + +Adding a Monitor (Manual) +------------------------- + +The procedure in this section creates a ``ceph-mon`` data directory, retrieves +both the monitor map and the monitor keyring, and adds a ``ceph-mon`` daemon to +the cluster. The procedure might result in a Ceph cluster that contains only +two monitor daemons. To add more monitors until there are enough ``ceph-mon`` +daemons to establish quorum, repeat the procedure. + +This is a good point at which to define the new monitor's ``id``. Monitors have +often been named with single letters (``a``, ``b``, ``c``, etc.), but you are +free to define the ``id`` however you see fit. In this document, ``{mon-id}`` +refers to the ``id`` exclusive of the ``mon.`` prefix: for example, if +``mon.a`` has been chosen as the ``id`` of a monitor, then ``{mon-id}`` is +``a``. ??? + +#. Create a data directory on the machine that will host the new monitor: + + .. prompt:: bash $ + + ssh {new-mon-host} + sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} + +#. Create a temporary directory ``{tmp}`` that will contain the files needed + during this procedure. This directory should be different from the data + directory created in the previous step. Because this is a temporary + directory, it can be removed after the procedure is complete: + + .. prompt:: bash $ + + mkdir {tmp} + +#. Retrieve the keyring for your monitors (``{tmp}`` is the path to the + retrieved keyring and ``{key-filename}`` is the name of the file that + contains the retrieved monitor key): + + .. prompt:: bash $ + + ceph auth get mon. -o {tmp}/{key-filename} + +#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map + and ``{map-filename}`` is the name of the file that contains the retrieved + monitor map): + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{map-filename} + +#. Prepare the monitor's data directory, which was created in the first step. + The following command must specify the path to the monitor map (so that + information about a quorum of monitors and their ``fsid``\s can be + retrieved) and specify the path to the monitor keyring: + + .. prompt:: bash $ + + sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} + +#. Start the new monitor. It will automatically join the cluster. To provide + information to the daemon about which address to bind to, use either the + ``--public-addr {ip}`` option or the ``--public-network {network}`` option. + For example: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --public-addr {ip:port} + +.. _removing-monitors: + +Removing Monitors +================= + +When monitors are removed from a cluster, it is important to remember +that Ceph monitors use Paxos to maintain consensus about the cluster +map. Such consensus is possible only if the number of monitors is sufficient +to establish quorum. + + +.. _Removing a Monitor (Manual): + +Removing a Monitor (Manual) +--------------------------- + +The procedure in this section removes a ``ceph-mon`` daemon from the cluster. +The procedure might result in a Ceph cluster that contains a number of monitors +insufficient to maintain quorum, so plan carefully. When replacing an old +monitor with a new monitor, add the new monitor first, wait for quorum to be +established, and then remove the old monitor. This ensures that quorum is not +lost. + + +#. Stop the monitor: + + .. prompt:: bash $ + + service ceph -a stop mon.{mon-id} + +#. Remove the monitor from the cluster: + + .. prompt:: bash $ + + ceph mon remove {mon-id} + +#. Remove the monitor entry from the ``ceph.conf`` file: + +.. _rados-mon-remove-from-unhealthy: + + +Removing Monitors from an Unhealthy Cluster +------------------------------------------- + +The procedure in this section removes a ``ceph-mon`` daemon from an unhealthy +cluster (for example, a cluster whose monitors are unable to form a quorum). + +#. Stop all ``ceph-mon`` daemons on all monitor hosts: + + .. prompt:: bash $ + + ssh {mon-host} + systemctl stop ceph-mon.target + + Repeat this step on every monitor host. + +#. Identify a surviving monitor and log in to the monitor's host: + + .. prompt:: bash $ + + ssh {mon-host} + +#. Extract a copy of the ``monmap`` file by running a command of the following + form: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --extract-monmap {map-path} + + Here is a more concrete example. In this example, ``hostname`` is the + ``{mon-id}`` and ``/tmp/monpap`` is the ``{map-path}``: + + .. prompt:: bash $ + + ceph-mon -i `hostname` --extract-monmap /tmp/monmap + +#. Remove the non-surviving or otherwise problematic monitors: + + .. prompt:: bash $ + + monmaptool {map-path} --rm {mon-id} + + For example, suppose that there are three monitors |---| ``mon.a``, ``mon.b``, + and ``mon.c`` |---| and that only ``mon.a`` will survive: + + .. prompt:: bash $ + + monmaptool /tmp/monmap --rm b + monmaptool /tmp/monmap --rm c + +#. Inject the surviving map that includes the removed monitors into the + monmap of the surviving monitor(s): + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {map-path} + + Continuing with the above example, inject a map into monitor ``mon.a`` by + running the following command: + + .. prompt:: bash $ + + ceph-mon -i a --inject-monmap /tmp/monmap + + +#. Start only the surviving monitors. + +#. Verify that the monitors form a quorum by running the command ``ceph -s``. + +#. The data directory of the removed monitors is in ``/var/lib/ceph/mon``: + either archive this data directory in a safe location or delete this data + directory. However, do not delete it unless you are confident that the + remaining monitors are healthy and sufficiently redundant. Make sure that + there is enough room for the live DB to expand and compact, and make sure + that there is also room for an archived copy of the DB. The archived copy + can be compressed. + +.. _Changing a Monitor's IP address: + +Changing a Monitor's IP Address +=============================== + +.. important:: Existing monitors are not supposed to change their IP addresses. + +Monitors are critical components of a Ceph cluster. The entire system can work +properly only if the monitors maintain quorum, and quorum can be established +only if the monitors have discovered each other by means of their IP addresses. +Ceph has strict requirements on the discovery of monitors. + +Although the ``ceph.conf`` file is used by Ceph clients and other Ceph daemons +to discover monitors, the monitor map is used by monitors to discover each +other. This is why it is necessary to obtain the current ``monmap`` at the time +a new monitor is created: as can be seen above in `Adding a Monitor (Manual)`_, +the ``monmap`` is one of the arguments required by the ``ceph-mon -i {mon-id} +--mkfs`` command. The following sections explain the consistency requirements +for Ceph monitors, and also explain a number of safe ways to change a monitor's +IP address. + + +Consistency Requirements +------------------------ + +When a monitor discovers other monitors in the cluster, it always refers to the +local copy of the monitor map. Using the monitor map instead of using the +``ceph.conf`` file avoids errors that could break the cluster (for example, +typos or other slight errors in ``ceph.conf`` when a monitor address or port is +specified). Because monitors use monitor maps for discovery and because they +share monitor maps with Ceph clients and other Ceph daemons, the monitor map +provides monitors with a strict guarantee that their consensus is valid. + +Strict consistency also applies to updates to the monmap. As with any other +updates on the monitor, changes to the monmap always run through a distributed +consensus algorithm called `Paxos`_. The monitors must agree on each update to +the monmap, such as adding or removing a monitor, to ensure that each monitor +in the quorum has the same version of the monmap. Updates to the monmap are +incremental so that monitors have the latest agreed upon version, and a set of +previous versions, allowing a monitor that has an older version of the monmap +to catch up with the current state of the cluster. + +There are additional advantages to using the monitor map rather than +``ceph.conf`` when monitors discover each other. Because ``ceph.conf`` is not +automatically updated and distributed, its use would bring certain risks: +monitors might use an outdated ``ceph.conf`` file, might fail to recognize a +specific monitor, might fall out of quorum, and might develop a situation in +which `Paxos`_ is unable to accurately ascertain the current state of the +system. Because of these risks, any changes to an existing monitor's IP address +must be made with great care. + +.. _operations_add_or_rm_mons_changing_mon_ip: + +Changing a Monitor's IP address (Preferred Method) +-------------------------------------------------- + +If a monitor's IP address is changed only in the ``ceph.conf`` file, there is +no guarantee that the other monitors in the cluster will receive the update. +For this reason, the preferred method to change a monitor's IP address is as +follows: add a new monitor with the desired IP address (as described in `Adding +a Monitor (Manual)`_), make sure that the new monitor successfully joins the +quorum, remove the monitor that is using the old IP address, and update the +``ceph.conf`` file to ensure that clients and other daemons are made aware of +the new monitor's IP address. + +For example, suppose that there are three monitors in place:: + + [mon.a] + host = host01 + addr = 10.0.0.1:6789 + [mon.b] + host = host02 + addr = 10.0.0.2:6789 + [mon.c] + host = host03 + addr = 10.0.0.3:6789 + +To change ``mon.c`` so that its name is ``host04`` and its IP address is +``10.0.0.4``: (1) follow the steps in `Adding a Monitor (Manual)`_ to add a new +monitor ``mon.d``, (2) make sure that ``mon.d`` is running before removing +``mon.c`` or else quorum will be broken, and (3) follow the steps in `Removing +a Monitor (Manual)`_ to remove ``mon.c``. To move all three monitors to new IP +addresses, repeat this process. + +Changing a Monitor's IP address (Advanced Method) +------------------------------------------------- + +There are cases in which the method outlined in :ref"` operations_add_or_rm_mons_changing_mon_ip` cannot +be used. For example, it might be necessary to move the cluster's monitors to a +different network, to a different part of the datacenter, or to a different +datacenter altogether. It is still possible to change the monitors' IP +addresses, but a different method must be used. + +For such cases, a new monitor map with updated IP addresses for every monitor +in the cluster must be generated and injected on each monitor. Although this +method is not particularly easy, such a major migration is unlikely to be a +routine task. As stated at the beginning of this section, existing monitors are +not supposed to change their IP addresses. + +Continue with the monitor configuration in the example from :ref"` +operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors +are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that +these networks are unable to communicate. Carry out the following procedure: + +#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor + map, and ``{filename}`` is the name of the file that contains the retrieved + monitor map): + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{filename} + +#. Check the contents of the monitor map: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.0.0.1:6789/0 mon.a + 1: 10.0.0.2:6789/0 mon.b + 2: 10.0.0.3:6789/0 mon.c + +#. Remove the existing monitors from the monitor map: + + .. prompt:: bash $ + + monmaptool --rm a --rm b --rm c {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: removing a + monmaptool: removing b + monmaptool: removing c + monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) + +#. Add the new monitor locations to the monitor map: + + .. prompt:: bash $ + + monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) + +#. Check the new contents of the monitor map: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.1.0.1:6789/0 mon.a + 1: 10.1.0.2:6789/0 mon.b + 2: 10.1.0.3:6789/0 mon.c + +At this point, we assume that the monitors (and stores) have been installed at +the new location. Next, propagate the modified monitor map to the new monitors, +and inject the modified monitor map into each new monitor. + +#. Make sure all of your monitors have been stopped. Never inject into a + monitor while the monitor daemon is running. + +#. Inject the monitor map: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} + +#. Restart all of the monitors. + +Migration to the new location is now complete. The monitors should operate +successfully. + + + +.. _Manual Deployment: ../../../install/manual-deployment +.. _Monitor Bootstrap: ../../../dev/mon-bootstrap +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) + +.. |---| unicode:: U+2014 .. EM DASH + :trim: diff --git a/doc/rados/operations/add-or-rm-osds.rst b/doc/rados/operations/add-or-rm-osds.rst new file mode 100644 index 000000000..1a6621148 --- /dev/null +++ b/doc/rados/operations/add-or-rm-osds.rst @@ -0,0 +1,419 @@ +====================== + Adding/Removing OSDs +====================== + +When a cluster is up and running, it is possible to add or remove OSDs. + +Adding OSDs +=========== + +OSDs can be added to a cluster in order to expand the cluster's capacity and +resilience. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on one +storage drive within a host machine. But if your host machine has multiple +storage drives, you may map one ``ceph-osd`` daemon for each drive on the +machine. + +It's a good idea to check the capacity of your cluster so that you know when it +approaches its capacity limits. If your cluster has reached its ``near full`` +ratio, then you should add OSDs to expand your cluster's capacity. + +.. warning:: Do not add an OSD after your cluster has reached its ``full + ratio``. OSD failures that occur after the cluster reaches its ``near full + ratio`` might cause the cluster to exceed its ``full ratio``. + + +Deploying your Hardware +----------------------- + +If you are also adding a new host when adding a new OSD, see `Hardware +Recommendations`_ for details on minimum recommendations for OSD hardware. To +add an OSD host to your cluster, begin by making sure that an appropriate +version of Linux has been installed on the host machine and that all initial +preparations for your storage drives have been carried out. For details, see +`Filesystem Recommendations`_. + +Next, add your OSD host to a rack in your cluster, connect the host to the +network, and ensure that the host has network connectivity. For details, see +`Network Configuration Reference`_. + + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations +.. _Network Configuration Reference: ../../configuration/network-config-ref + +Installing the Required Software +-------------------------------- + +If your cluster has been manually deployed, you will need to install Ceph +software packages manually. For details, see `Installing Ceph (Manual)`_. +Configure SSH for the appropriate user to have both passwordless authentication +and root permissions. + +.. _Installing Ceph (Manual): ../../../install + + +Adding an OSD (Manual) +---------------------- + +The following procedure sets up a ``ceph-osd`` daemon, configures this OSD to +use one drive, and configures the cluster to distribute data to the OSD. If +your host machine has multiple drives, you may add an OSD for each drive on the +host by repeating this procedure. + +As the following procedure will demonstrate, adding an OSD involves creating a +metadata directory for it, configuring a data storage drive, adding the OSD to +the cluster, and then adding it to the CRUSH map. + +When you add the OSD to the CRUSH map, you will need to consider the weight you +assign to the new OSD. Since storage drive capacities increase over time, newer +OSD hosts are likely to have larger hard drives than the older hosts in the +cluster have and therefore might have greater weight as well. + +.. tip:: Ceph works best with uniform hardware across pools. It is possible to + add drives of dissimilar size and then adjust their weights accordingly. + However, for best performance, consider a CRUSH hierarchy that has drives of + the same type and size. It is better to add larger drives uniformly to + existing hosts. This can be done incrementally, replacing smaller drives + each time the new drives are added. + +#. Create the new OSD by running a command of the following form. If you opt + not to specify a UUID in this command, the UUID will be set automatically + when the OSD starts up. The OSD number, which is needed for subsequent + steps, is found in the command's output: + + .. prompt:: bash $ + + ceph osd create [{uuid} [{id}]] + + If the optional parameter {id} is specified it will be used as the OSD ID. + However, if the ID number is already in use, the command will fail. + + .. warning:: Explicitly specifying the ``{id}`` parameter is not + recommended. IDs are allocated as an array, and any skipping of entries + consumes extra memory. This memory consumption can become significant if + there are large gaps or if clusters are large. By leaving the ``{id}`` + parameter unspecified, we ensure that Ceph uses the smallest ID number + available and that these problems are avoided. + +#. Create the default directory for your new OSD by running commands of the + following form: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +#. If the OSD will be created on a drive other than the OS drive, prepare it + for use with Ceph. Run commands of the following form: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{drive} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +#. Initialize the OSD data directory by running commands of the following form: + + .. prompt:: bash $ + + ssh {new-osd-host} + ceph-osd -i {osd-num} --mkfs --mkkey + + Make sure that the directory is empty before running ``ceph-osd``. + +#. Register the OSD authentication key by running a command of the following + form: + + .. prompt:: bash $ + + ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring + + This presentation of the command has ``ceph-{osd-num}`` in the listed path + because many clusters have the name ``ceph``. However, if your cluster name + is not ``ceph``, then the string ``ceph`` in ``ceph-{osd-num}`` needs to be + replaced with your cluster name. For example, if your cluster name is + ``cluster1``, then the path in the command should be + ``/var/lib/ceph/osd/cluster1-{osd-num}/keyring``. + +#. Add the OSD to the CRUSH map by running the following command. This allows + the OSD to begin receiving data. The ``ceph osd crush add`` command can add + OSDs to the CRUSH hierarchy wherever you want. If you specify one or more + buckets, the command places the OSD in the most specific of those buckets, + and it moves that bucket underneath any other buckets that you have + specified. **Important:** If you specify only the root bucket, the command + will attach the OSD directly to the root, but CRUSH rules expect OSDs to be + inside of hosts. If the OSDs are not inside hosts, the OSDS will likely not + receive any data. + + .. prompt:: bash $ + + ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] + + Note that there is another way to add a new OSD to the CRUSH map: decompile + the CRUSH map, add the OSD to the device list, add the host as a bucket (if + it is not already in the CRUSH map), add the device as an item in the host, + assign the device a weight, recompile the CRUSH map, and set the CRUSH map. + For details, see `Add/Move an OSD`_. This is rarely necessary with recent + releases (this sentence was written the month that Reef was released). + + +.. _rados-replacing-an-osd: + +Replacing an OSD +---------------- + +.. note:: If the procedure in this section does not work for you, try the + instructions in the ``cephadm`` documentation: + :ref:`cephadm-replacing-an-osd`. + +Sometimes OSDs need to be replaced: for example, when a disk fails, or when an +administrator wants to reprovision OSDs with a new back end (perhaps when +switching from Filestore to BlueStore). Replacing an OSD differs from `Removing +the OSD`_ in that the replaced OSD's ID and CRUSH map entry must be kept intact +after the OSD is destroyed for replacement. + + +#. Make sure that it is safe to destroy the OSD: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done + +#. Destroy the OSD: + + .. prompt:: bash $ + + ceph osd destroy {id} --yes-i-really-mean-it + +#. *Optional*: If the disk that you plan to use is not a new disk and has been + used before for other purposes, zap the disk: + + .. prompt:: bash $ + + ceph-volume lvm zap /dev/sdX + +#. Prepare the disk for replacement by using the ID of the OSD that was + destroyed in previous steps: + + .. prompt:: bash $ + + ceph-volume lvm prepare --osd-id {id} --data /dev/sdX + +#. Finally, activate the OSD: + + .. prompt:: bash $ + + ceph-volume lvm activate {id} {fsid} + +Alternatively, instead of carrying out the final two steps (preparing the disk +and activating the OSD), you can re-create the OSD by running a single command +of the following form: + + .. prompt:: bash $ + + ceph-volume lvm create --osd-id {id} --data /dev/sdX + +Starting the OSD +---------------- + +After an OSD is added to Ceph, the OSD is in the cluster. However, until it is +started, the OSD is considered ``down`` and ``in``. The OSD is not running and +will be unable to receive data. To start an OSD, either run ``service ceph`` +from your admin host or run a command of the following form to start the OSD +from its host machine: + + .. prompt:: bash $ + + sudo systemctl start ceph-osd@{osd-num} + +After the OSD is started, it is considered ``up`` and ``in``. + +Observing the Data Migration +---------------------------- + +After the new OSD has been added to the CRUSH map, Ceph begins rebalancing the +cluster by migrating placement groups (PGs) to the new OSD. To observe this +process by using the `ceph`_ tool, run the following command: + + .. prompt:: bash $ + + ceph -w + +Or: + + .. prompt:: bash $ + + watch ceph status + +The PG states will first change from ``active+clean`` to ``active, some +degraded objects`` and then return to ``active+clean`` when migration +completes. When you are finished observing, press Ctrl-C to exit. + +.. _Add/Move an OSD: ../crush-map#addosd +.. _ceph: ../monitoring + + +Removing OSDs (Manual) +====================== + +It is possible to remove an OSD manually while the cluster is running: you +might want to do this in order to reduce the size of the cluster or when +replacing hardware. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on +one storage drive within a host machine. Alternatively, if your host machine +has multiple storage drives, you might need to remove multiple ``ceph-osd`` +daemons: one daemon for each drive on the machine. + +.. warning:: Before you begin the process of removing an OSD, make sure that + your cluster is not near its ``full ratio``. Otherwise the act of removing + OSDs might cause the cluster to reach or exceed its ``full ratio``. + + +Taking the OSD ``out`` of the Cluster +------------------------------------- + +OSDs are typically ``up`` and ``in`` before they are removed from the cluster. +Before the OSD can be removed from the cluster, the OSD must be taken ``out`` +of the cluster so that Ceph can begin rebalancing and copying its data to other +OSDs. To take an OSD ``out`` of the cluster, run a command of the following +form: + + .. prompt:: bash $ + + ceph osd out {osd-num} + + +Observing the Data Migration +---------------------------- + +After the OSD has been taken ``out`` of the cluster, Ceph begins rebalancing +the cluster by migrating placement groups out of the OSD that was removed. To +observe this process by using the `ceph`_ tool, run the following command: + + .. prompt:: bash $ + + ceph -w + +The PG states will change from ``active+clean`` to ``active, some degraded +objects`` and will then return to ``active+clean`` when migration completes. +When you are finished observing, press Ctrl-C to exit. + +.. note:: Under certain conditions, the action of taking ``out`` an OSD + might lead CRUSH to encounter a corner case in which some PGs remain stuck + in the ``active+remapped`` state. This problem sometimes occurs in small + clusters with few hosts (for example, in a small testing cluster). To + address this problem, mark the OSD ``in`` by running a command of the + following form: + + .. prompt:: bash $ + + ceph osd in {osd-num} + + After the OSD has come back to its initial state, do not mark the OSD + ``out`` again. Instead, set the OSD's weight to ``0`` by running a command + of the following form: + + .. prompt:: bash $ + + ceph osd crush reweight osd.{osd-num} 0 + + After the OSD has been reweighted, observe the data migration and confirm + that it has completed successfully. The difference between marking an OSD + ``out`` and reweighting the OSD to ``0`` has to do with the bucket that + contains the OSD. When an OSD is marked ``out``, the weight of the bucket is + not changed. But when an OSD is reweighted to ``0``, the weight of the + bucket is updated (namely, the weight of the OSD is subtracted from the + overall weight of the bucket). When operating small clusters, it can + sometimes be preferable to use the above reweight command. + + +Stopping the OSD +---------------- + +After you take an OSD ``out`` of the cluster, the OSD might still be running. +In such a case, the OSD is ``up`` and ``out``. Before it is removed from the +cluster, the OSD must be stopped by running commands of the following form: + + .. prompt:: bash $ + + ssh {osd-host} + sudo systemctl stop ceph-osd@{osd-num} + +After the OSD has been stopped, it is ``down``. + + +Removing the OSD +---------------- + +The following procedure removes an OSD from the cluster map, removes the OSD's +authentication key, removes the OSD from the OSD map, and removes the OSD from +the ``ceph.conf`` file. If your host has multiple drives, it might be necessary +to remove an OSD from each drive by repeating this procedure. + +#. Begin by having the cluster forget the OSD. This step removes the OSD from + the CRUSH map, removes the OSD's authentication key, and removes the OSD + from the OSD map. (The :ref:`purge subcommand ` was + introduced in Luminous. For older releases, see :ref:`the procedure linked + here `.): + + .. prompt:: bash $ + + ceph osd purge {id} --yes-i-really-mean-it + + +#. Navigate to the host where the master copy of the cluster's + ``ceph.conf`` file is kept: + + .. prompt:: bash $ + + ssh {admin-host} + cd /etc/ceph + vim ceph.conf + +#. Remove the OSD entry from your ``ceph.conf`` file (if such an entry + exists):: + + [osd.1] + host = {hostname} + +#. Copy the updated ``ceph.conf`` file from the location on the host where the + master copy of the cluster's ``ceph.conf`` is kept to the ``/etc/ceph`` + directory of the other hosts in your cluster. + +.. _ceph_osd_purge_procedure_pre_luminous: + +If your Ceph cluster is older than Luminous, you will be unable to use the +``ceph osd purge`` command. Instead, carry out the following procedure: + +#. Remove the OSD from the CRUSH map so that it no longer receives data (for + more details, see `Remove an OSD`_): + + .. prompt:: bash $ + + ceph osd crush remove {name} + + Instead of removing the OSD from the CRUSH map, you might opt for one of two + alternatives: (1) decompile the CRUSH map, remove the OSD from the device + list, and remove the device from the host bucket; (2) remove the host bucket + from the CRUSH map (provided that it is in the CRUSH map and that you intend + to remove the host), recompile the map, and set it: + + +#. Remove the OSD authentication key: + + .. prompt:: bash $ + + ceph auth del osd.{osd-num} + +#. Remove the OSD: + + .. prompt:: bash $ + + ceph osd rm {osd-num} + + For example: + + .. prompt:: bash $ + + ceph osd rm 1 + +.. _Remove an OSD: ../crush-map#removeosd diff --git a/doc/rados/operations/balancer.rst b/doc/rados/operations/balancer.rst new file mode 100644 index 000000000..aa4eab93c --- /dev/null +++ b/doc/rados/operations/balancer.rst @@ -0,0 +1,221 @@ +.. _balancer: + +Balancer Module +======================= + +The *balancer* can optimize the allocation of placement groups (PGs) across +OSDs in order to achieve a balanced distribution. The balancer can operate +either automatically or in a supervised fashion. + + +Status +------ + +To check the current status of the balancer, run the following command: + + .. prompt:: bash $ + + ceph balancer status + + +Automatic balancing +------------------- + +When the balancer is in ``upmap`` mode, the automatic balancing feature is +enabled by default. For more details, see :ref:`upmap`. To disable the +balancer, run the following command: + + .. prompt:: bash $ + + ceph balancer off + +The balancer mode can be changed from ``upmap`` mode to ``crush-compat`` mode. +``crush-compat`` mode is backward compatible with older clients. In +``crush-compat`` mode, the balancer automatically makes small changes to the +data distribution in order to ensure that OSDs are utilized equally. + + +Throttling +---------- + +If the cluster is degraded (that is, if an OSD has failed and the system hasn't +healed itself yet), then the balancer will not make any adjustments to the PG +distribution. + +When the cluster is healthy, the balancer will incrementally move a small +fraction of unbalanced PGs in order to improve distribution. This fraction +will not exceed a certain threshold that defaults to 5%. To adjust this +``target_max_misplaced_ratio`` threshold setting, run the following command: + + .. prompt:: bash $ + + ceph config set mgr target_max_misplaced_ratio .07 # 7% + +The balancer sleeps between runs. To set the number of seconds for this +interval of sleep, run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/sleep_interval 60 + +To set the time of day (in HHMM format) at which automatic balancing begins, +run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_time 0000 + +To set the time of day (in HHMM format) at which automatic balancing ends, run +the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_time 2359 + +Automatic balancing can be restricted to certain days of the week. To restrict +it to a specific day of the week or later (as with crontab, ``0`` is Sunday, +``1`` is Monday, and so on), run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_weekday 0 + +To restrict automatic balancing to a specific day of the week or earlier +(again, ``0`` is Sunday, ``1`` is Monday, and so on), run the following +command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_weekday 6 + +Automatic balancing can be restricted to certain pools. By default, the value +of this setting is an empty string, so that all pools are automatically +balanced. To restrict automatic balancing to specific pools, retrieve their +numeric pool IDs (by running the :command:`ceph osd pool ls detail` command), +and then run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/pool_ids 1,2,3 + + +Modes +----- + +There are two supported balancer modes: + +#. **crush-compat**. This mode uses the compat weight-set feature (introduced + in Luminous) to manage an alternative set of weights for devices in the + CRUSH hierarchy. When the balancer is operating in this mode, the normal + weights should remain set to the size of the device in order to reflect the + target amount of data intended to be stored on the device. The balancer will + then optimize the weight-set values, adjusting them up or down in small + increments, in order to achieve a distribution that matches the target + distribution as closely as possible. (Because PG placement is a pseudorandom + process, it is subject to a natural amount of variation; optimizing the + weights serves to counteract that natural variation.) + + Note that this mode is *fully backward compatible* with older clients: when + an OSD Map and CRUSH map are shared with older clients, Ceph presents the + optimized weights as the "real" weights. + + The primary limitation of this mode is that the balancer cannot handle + multiple CRUSH hierarchies with different placement rules if the subtrees of + the hierarchy share any OSDs. (Such sharing of OSDs is not typical and, + because of the difficulty of managing the space utilization on the shared + OSDs, is generally not recommended.) + +#. **upmap**. In Luminous and later releases, the OSDMap can store explicit + mappings for individual OSDs as exceptions to the normal CRUSH placement + calculation. These ``upmap`` entries provide fine-grained control over the + PG mapping. This balancer mode optimizes the placement of individual PGs in + order to achieve a balanced distribution. In most cases, the resulting + distribution is nearly perfect: that is, there is an equal number of PGs on + each OSD (±1 PG, since the total number might not divide evenly). + + To use ``upmap``, all clients must be Luminous or newer. + +The default mode is ``upmap``. The mode can be changed to ``crush-compat`` by +running the following command: + + .. prompt:: bash $ + + ceph balancer mode crush-compat + +Supervised optimization +----------------------- + +Supervised use of the balancer can be understood in terms of three distinct +phases: + +#. building a plan +#. evaluating the quality of the data distribution, either for the current PG + distribution or for the PG distribution that would result after executing a + plan +#. executing the plan + +To evaluate the current distribution, run the following command: + + .. prompt:: bash $ + + ceph balancer eval + +To evaluate the distribution for a single pool, run the following command: + + .. prompt:: bash $ + + ceph balancer eval + +To see the evaluation in greater detail, run the following command: + + .. prompt:: bash $ + + ceph balancer eval-verbose ... + +To instruct the balancer to generate a plan (using the currently configured +mode), make up a name (any useful identifying string) for the plan, and run the +following command: + + .. prompt:: bash $ + + ceph balancer optimize + +To see the contents of a plan, run the following command: + + .. prompt:: bash $ + + ceph balancer show + +To display all plans, run the following command: + + .. prompt:: bash $ + + ceph balancer ls + +To discard an old plan, run the following command: + + .. prompt:: bash $ + + ceph balancer rm + +To see currently recorded plans, examine the output of the following status +command: + + .. prompt:: bash $ + + ceph balancer status + +To evaluate the distribution that would result from executing a specific plan, +run the following command: + + .. prompt:: bash $ + + ceph balancer eval + +If a plan is expected to improve the distribution (that is, the plan's score is +lower than the current cluster state's score), you can execute that plan by +running the following command: + + .. prompt:: bash $ + + ceph balancer execute diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst new file mode 100644 index 000000000..d24782c46 --- /dev/null +++ b/doc/rados/operations/bluestore-migration.rst @@ -0,0 +1,357 @@ +.. _rados_operations_bluestore_migration: + +===================== + BlueStore Migration +===================== +.. warning:: Filestore has been deprecated in the Reef release and is no longer supported. + Please migrate to BlueStore. + +Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph +cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs. +Because BlueStore is superior to Filestore in performance and robustness, and +because Filestore is not supported by Ceph releases beginning with Reef, users +deploying Filestore OSDs should transition to BlueStore. There are several +strategies for making the transition to BlueStore. + +BlueStore is so different from Filestore that an individual OSD cannot be +converted in place. Instead, the conversion process must use either (1) the +cluster's normal replication and healing support, or (2) tools and strategies +that copy OSD content from an old (Filestore) device to a new (BlueStore) one. + +Deploying new OSDs with BlueStore +================================= + +Use BlueStore when deploying new OSDs (for example, when the cluster is +expanded). Because this is the default behavior, no specific change is +needed. + +Similarly, use BlueStore for any OSDs that have been reprovisioned after +a failed drive was replaced. + +Converting existing OSDs +======================== + +"Mark-``out``" replacement +-------------------------- + +The simplest approach is to verify that the cluster is healthy and +then follow these steps for each Filestore OSD in succession: mark the OSD +``out``, wait for the data to replicate across the cluster, reprovision the OSD, +mark the OSD back ``in``, and wait for recovery to complete before proceeding +to the next OSD. This approach is easy to automate, but it entails unnecessary +data migration that carries costs in time and SSD wear. + +#. Identify a Filestore OSD to replace:: + + ID= + DEVICE= + + #. Determine whether a given OSD is Filestore or BlueStore: + + .. prompt:: bash $ + + ceph osd metadata $ID | grep osd_objectstore + + #. Get a current count of Filestore and BlueStore OSDs: + + .. prompt:: bash $ + + ceph osd count-metadata osd_objectstore + +#. Mark a Filestore OSD ``out``: + + .. prompt:: bash $ + + ceph osd out $ID + +#. Wait for the data to migrate off this OSD: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done + +#. Stop the OSD: + + .. prompt:: bash $ + + systemctl kill ceph-osd@$ID + + .. _osd_id_retrieval: + +#. Note which device the OSD is using: + + .. prompt:: bash $ + + mount | grep /var/lib/ceph/osd/ceph-$ID + +#. Unmount the OSD: + + .. prompt:: bash $ + + umount /var/lib/ceph/osd/ceph-$ID + +#. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy + the contents of the device; you must be certain that the data on the device is + not needed (in other words, that the cluster is healthy) before proceeding: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Tell the cluster that the OSD has been destroyed (and that a new OSD can be + reprovisioned with the same OSD ID): + + .. prompt:: bash $ + + ceph osd destroy $ID --yes-i-really-mean-it + +#. Provision a BlueStore OSD in place by using the same OSD ID. This requires + you to identify which device to wipe, and to make certain that you target + the correct and intended device, using the information that was retrieved in + the :ref:`"Note which device the OSD is using" ` step. BE + CAREFUL! Note that you may need to modify these commands when dealing with + hybrid OSDs: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID + +#. Repeat. + +You may opt to (1) have the balancing of the replacement BlueStore OSD take +place concurrently with the draining of the next Filestore OSD, or instead +(2) follow the same procedure for multiple OSDs in parallel. In either case, +however, you must ensure that the cluster is fully clean (in other words, that +all data has all replicas) before destroying any OSDs. If you opt to reprovision +multiple OSDs in parallel, be **very** careful to destroy OSDs only within a +single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to +satisfy this requirement will reduce the redundancy and availability of your +data and increase the risk of data loss (or even guarantee data loss). + +Advantages: + +* Simple. +* Can be done on a device-by-device basis. +* No spare devices or hosts are required. + +Disadvantages: + +* Data is copied over the network twice: once to another OSD in the cluster (to + maintain the specified number of replicas), and again back to the + reprovisioned BlueStore OSD. + +"Whole host" replacement +------------------------ + +If you have a spare host in the cluster, or sufficient free space to evacuate +an entire host for use as a spare, then the conversion can be done on a +host-by-host basis so that each stored copy of the data is migrated only once. + +To use this approach, you need an empty host that has no OSDs provisioned. +There are two ways to do this: either by using a new, empty host that is not +yet part of the cluster, or by offloading data from an existing host that is +already part of the cluster. + +Using a new, empty host +^^^^^^^^^^^^^^^^^^^^^^^ + +Ideally the host will have roughly the same capacity as each of the other hosts +you will be converting. Add the host to the CRUSH hierarchy, but do not attach +it to the root: + + +.. prompt:: bash $ + + NEWHOST= + ceph osd crush add-bucket $NEWHOST host + +Make sure that Ceph packages are installed on the new host. + +Using an existing host +^^^^^^^^^^^^^^^^^^^^^^ + +If you would like to use an existing host that is already part of the cluster, +and if there is sufficient free space on that host so that all of its data can +be migrated off to other cluster hosts, you can do the following (instead of +using a new, empty host): + +.. prompt:: bash $ + + OLDHOST= + ceph osd crush unlink $OLDHOST default + +where "default" is the immediate ancestor in the CRUSH map. (For +smaller clusters with unmodified configurations this will normally +be "default", but it might instead be a rack name.) You should now +see the host at the top of the OSD tree output with no parent: + +.. prompt:: bash $ + + bin/ceph osd tree + +:: + + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host oldhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host foo + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +If everything looks good, jump directly to the :ref:`"Wait for the data +migration to complete" ` step below and proceed +from there to clean up the old OSDs. + +Migration process +^^^^^^^^^^^^^^^^^ + +If you're using a new host, start at :ref:`the first step +`. If you're using an existing host, +jump to :ref:`this step `. + +.. _bluestore_migration_process_first_step: + +#. Provision new BlueStore OSDs for all devices: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/$DEVICE + +#. Verify that the new OSDs have joined the cluster: + + .. prompt:: bash $ + + ceph osd tree + + You should see the new host ``$NEWHOST`` with all of the OSDs beneath + it, but the host should *not* be nested beneath any other node in the + hierarchy (like ``root default``). For example, if ``newhost`` is + the empty host, you might see something like:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host newhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host oldhost1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +#. Identify the first target host to convert : + + .. prompt:: bash $ + + OLDHOST= + +#. Swap the new host into the old host's position in the cluster: + + .. prompt:: bash $ + + ceph osd crush swap-bucket $NEWHOST $OLDHOST + + At this point all data on ``$OLDHOST`` will begin migrating to the OSDs on + ``$NEWHOST``. If there is a difference between the total capacity of the + old hosts and the total capacity of the new hosts, you may also see some + data migrate to or from other nodes in the cluster. Provided that the hosts + are similarly sized, however, this will be a relatively small amount of + data. + + .. _bluestore_data_migration_step: + +#. Wait for the data migration to complete: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done + +#. Stop all old OSDs on the now-empty ``$OLDHOST``: + + .. prompt:: bash $ + + ssh $OLDHOST + systemctl kill ceph-osd.target + umount /var/lib/ceph/osd/ceph-* + +#. Destroy and purge the old OSDs: + + .. prompt:: bash $ + + for osd in `ceph osd ls-tree $OLDHOST`; do + ceph osd purge $osd --yes-i-really-mean-it + done + +#. Wipe the old OSDs. This requires you to identify which devices are to be + wiped manually. BE CAREFUL! For each device: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Use the now-empty host as the new host, and repeat: + + .. prompt:: bash $ + + NEWHOST=$OLDHOST + +Advantages: + +* Data is copied over the network only once. +* An entire host's OSDs are converted at once. +* Can be parallelized, to make possible the conversion of multiple hosts at the same time. +* No host involved in this process needs to have a spare device. + +Disadvantages: + +* A spare host is required. +* An entire host's worth of OSDs will be migrating data at a time. This + is likely to impact overall cluster performance. +* All migrated data still makes one full hop over the network. + +Per-OSD device copy +------------------- +A single logical OSD can be converted by using the ``copy`` function +included in ``ceph-objectstore-tool``. This requires that the host have one or more free +devices to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has twelve OSDs, then you need a +thirteenth unused OSD so that each OSD can be converted before the +previous OSD is reclaimed to convert the next OSD. + +Caveats: + +* This approach requires that we prepare an empty BlueStore OSD but that we do not allocate + a new OSD ID to it. The ``ceph-volume`` tool does not support such an operation. **IMPORTANT:** + because the setup of *dmcrypt* is closely tied to the identity of the OSD, this approach does not + work with encrypted OSDs. + +* The device must be manually partitioned. + +* An unsupported user-contributed script that demonstrates this process may be found here: + https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash + +Advantages: + +* Provided that the 'noout' or the 'norecover'/'norebalance' flags are set on the OSD or the + cluster while the conversion process is underway, little or no data migrates over the + network during the conversion. + +Disadvantages: + +* Tooling is not fully implemented, supported, or documented. + +* Each host must have an appropriate spare or empty device for staging. + +* The OSD is offline during the conversion, which means new writes to PGs + with the OSD in their acting set may not be ideally redundant until the + subject OSD comes up and recovers. This increases the risk of data + loss due to an overlapping failure. However, if another OSD fails before + conversion and startup have completed, the original Filestore OSD can be + started to provide access to its original data. diff --git a/doc/rados/operations/cache-tiering.rst b/doc/rados/operations/cache-tiering.rst new file mode 100644 index 000000000..127b0141f --- /dev/null +++ b/doc/rados/operations/cache-tiering.rst @@ -0,0 +1,557 @@ +=============== + Cache Tiering +=============== + +.. warning:: Cache tiering has been deprecated in the Reef release as it + has lacked a maintainer for a very long time. This does not mean + it will be certainly removed, but we may choose to remove it + without much further notice. + +A cache tier provides Ceph Clients with better I/O performance for a subset of +the data stored in a backing storage tier. Cache tiering involves creating a +pool of relatively fast/expensive storage devices (e.g., solid state drives) +configured to act as a cache tier, and a backing pool of either erasure-coded +or relatively slower/cheaper devices configured to act as an economical storage +tier. The Ceph objecter handles where to place the objects and the tiering +agent determines when to flush objects from the cache to the backing storage +tier. So the cache tier and the backing storage tier are completely transparent +to Ceph clients. + + +.. ditaa:: + +-------------+ + | Ceph Client | + +------+------+ + ^ + Tiering is | + Transparent | Faster I/O + to Ceph | +---------------+ + Client Ops | | | + | +----->+ Cache Tier | + | | | | + | | +-----+---+-----+ + | | | ^ + v v | | Active Data in Cache Tier + +------+----+--+ | | + | Objecter | | | + +-----------+--+ | | + ^ | | Inactive Data in Storage Tier + | v | + | +-----+---+-----+ + | | | + +----->| Storage Tier | + | | + +---------------+ + Slower I/O + + +The cache tiering agent handles the migration of data between the cache tier +and the backing storage tier automatically. However, admins have the ability to +configure how this migration takes place by setting the ``cache-mode``. There are +two main scenarios: + +- **writeback** mode: If the base tier and the cache tier are configured in + ``writeback`` mode, Ceph clients receive an ACK from the base tier every time + they write data to it. Then the cache tiering agent determines whether + ``osd_tier_default_cache_min_write_recency_for_promote`` has been set. If it + has been set and the data has been written more than a specified number of + times per interval, the data is promoted to the cache tier. + + When Ceph clients need access to data stored in the base tier, the cache + tiering agent reads the data from the base tier and returns it to the client. + While data is being read from the base tier, the cache tiering agent consults + the value of ``osd_tier_default_cache_min_read_recency_for_promote`` and + decides whether to promote that data from the base tier to the cache tier. + When data has been promoted from the base tier to the cache tier, the Ceph + client is able to perform I/O operations on it using the cache tier. This is + well-suited for mutable data (for example, photo/video editing, transactional + data). + +- **readproxy** mode: This mode will use any objects that already + exist in the cache tier, but if an object is not present in the + cache the request will be proxied to the base tier. This is useful + for transitioning from ``writeback`` mode to a disabled cache as it + allows the workload to function properly while the cache is drained, + without adding any new objects to the cache. + +Other cache modes are: + +- **readonly** promotes objects to the cache on read operations only; write + operations are forwarded to the base tier. This mode is intended for + read-only workloads that do not require consistency to be enforced by the + storage system. (**Warning**: when objects are updated in the base tier, + Ceph makes **no** attempt to sync these updates to the corresponding objects + in the cache. Since this mode is considered experimental, a + ``--yes-i-really-mean-it`` option must be passed in order to enable it.) + +- **none** is used to completely disable caching. + + +A word of caution +================= + +Cache tiering will *degrade* performance for most workloads. Users should use +extreme caution before using this feature. + +* *Workload dependent*: Whether a cache will improve performance is + highly dependent on the workload. Because there is a cost + associated with moving objects into or out of the cache, it can only + be effective when there is a *large skew* in the access pattern in + the data set, such that most of the requests touch a small number of + objects. The cache pool should be large enough to capture the + working set for your workload to avoid thrashing. + +* *Difficult to benchmark*: Most benchmarks that users run to measure + performance will show terrible performance with cache tiering, in + part because very few of them skew requests toward a small set of + objects, it can take a long time for the cache to "warm up," and + because the warm-up cost can be high. + +* *Usually slower*: For workloads that are not cache tiering-friendly, + performance is often slower than a normal RADOS pool without cache + tiering enabled. + +* *librados object enumeration*: The librados-level object enumeration + API is not meant to be coherent in the presence of the case. If + your application is using librados directly and relies on object + enumeration, cache tiering will probably not work as expected. + (This is not a problem for RGW, RBD, or CephFS.) + +* *Complexity*: Enabling cache tiering means that a lot of additional + machinery and complexity within the RADOS cluster is being used. + This increases the probability that you will encounter a bug in the system + that other users have not yet encountered and will put your deployment at a + higher level of risk. + +Known Good Workloads +-------------------- + +* *RGW time-skewed*: If the RGW workload is such that almost all read + operations are directed at recently written objects, a simple cache + tiering configuration that destages recently written objects from + the cache to the base tier after a configurable period can work + well. + +Known Bad Workloads +------------------- + +The following configurations are *known to work poorly* with cache +tiering. + +* *RBD with replicated cache and erasure-coded base*: This is a common + request, but usually does not perform well. Even reasonably skewed + workloads still send some small writes to cold objects, and because + small writes are not yet supported by the erasure-coded pool, entire + (usually 4 MB) objects must be migrated into the cache in order to + satisfy a small (often 4 KB) write. Only a handful of users have + successfully deployed this configuration, and it only works for them + because their data is extremely cold (backups) and they are not in + any way sensitive to performance. + +* *RBD with replicated cache and base*: RBD with a replicated base + tier does better than when the base is erasure coded, but it is + still highly dependent on the amount of skew in the workload, and + very difficult to validate. The user will need to have a good + understanding of their workload and will need to tune the cache + tiering parameters carefully. + + +Setting Up Pools +================ + +To set up cache tiering, you must have two pools. One will act as the +backing storage and the other will act as the cache. + + +Setting Up a Backing Storage Pool +--------------------------------- + +Setting up a backing storage pool typically involves one of two scenarios: + +- **Standard Storage**: In this scenario, the pool stores multiple copies + of an object in the Ceph Storage Cluster. + +- **Erasure Coding:** In this scenario, the pool uses erasure coding to + store data much more efficiently with a small performance tradeoff. + +In the standard storage scenario, you can setup a CRUSH rule to establish +the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD +Daemons perform optimally when all storage drives in the rule are of the +same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ +for details on creating a rule. Once you have created a rule, create +a backing storage pool. + +In the erasure coding scenario, the pool creation arguments will generate the +appropriate rule automatically. See `Create a Pool`_ for details. + +In subsequent examples, we will refer to the backing storage pool +as ``cold-storage``. + + +Setting Up a Cache Pool +----------------------- + +Setting up a cache pool follows the same procedure as the standard storage +scenario, but with this difference: the drives for the cache tier are typically +high performance drives that reside in their own servers and have their own +CRUSH rule. When setting up such a rule, it should take account of the hosts +that have the high performance drives while omitting the hosts that don't. See +:ref:`CRUSH Device Class` for details. + + +In subsequent examples, we will refer to the cache pool as ``hot-storage`` and +the backing pool as ``cold-storage``. + +For cache tier configuration and default values, see +`Pools - Set Pool Values`_. + + +Creating a Cache Tier +===================== + +Setting up a cache tier involves associating a backing storage pool with +a cache pool: + +.. prompt:: bash $ + + ceph osd tier add {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier add cold-storage hot-storage + +To set the cache mode, execute the following: + +.. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} {cache-mode} + +For example: + +.. prompt:: bash $ + + ceph osd tier cache-mode hot-storage writeback + +The cache tiers overlay the backing storage tier, so they require one +additional step: you must direct all client traffic from the storage pool to +the cache pool. To direct client traffic directly to the cache pool, execute +the following: + +.. prompt:: bash $ + + ceph osd tier set-overlay {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier set-overlay cold-storage hot-storage + + +Configuring a Cache Tier +======================== + +Cache tiers have several configuration options. You may set +cache tier configuration options with the following usage: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} {key} {value} + +See `Pools - Set Pool Values`_ for details. + + +Target Size and Type +-------------------- + +Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_type bloom + +For example: + +.. prompt:: bash $ + + ceph osd pool set hot-storage hit_set_type bloom + +The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to +store, and how much time each HitSet should cover: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_count 12 + ceph osd pool set {cachepool} hit_set_period 14400 + ceph osd pool set {cachepool} target_max_bytes 1000000000000 + +.. note:: A larger ``hit_set_count`` results in more RAM consumed by + the ``ceph-osd`` process. + +Binning accesses over time allows Ceph to determine whether a Ceph client +accessed an object at least once, or more than once over a time period +("age" vs "temperature"). + +The ``min_read_recency_for_promote`` defines how many HitSets to check for the +existence of an object when handling a read operation. The checking result is +used to decide whether to promote the object asynchronously. Its value should be +between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. +If it's set to 1, the current HitSet is checked. And if this object is in the +current HitSet, it's promoted. Otherwise not. For the other values, the exact +number of archive HitSets are checked. The object is promoted if the object is +found in any of the most recent ``min_read_recency_for_promote`` HitSets. + +A similar parameter can be set for the write operation, which is +``min_write_recency_for_promote``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} min_read_recency_for_promote 2 + ceph osd pool set {cachepool} min_write_recency_for_promote 2 + +.. note:: The longer the period and the higher the + ``min_read_recency_for_promote`` and + ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` + daemon consumes. In particular, when the agent is active to flush + or evict cache objects, all ``hit_set_count`` HitSets are loaded + into RAM. + + +Cache Sizing +------------ + +The cache tiering agent performs two main functions: + +- **Flushing:** The agent identifies modified (or dirty) objects and forwards + them to the storage pool for long-term storage. + +- **Evicting:** The agent identifies objects that haven't been modified + (or clean) and evicts the least recently used among them from the cache. + + +Absolute Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects based upon the total number +of bytes or the total number of objects. To specify a maximum number of bytes, +execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_bytes {#bytes} + +For example, to flush or evict at 1 TB, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_bytes 1099511627776 + +To specify the maximum number of objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_objects {#objects} + +For example, to flush or evict at 1M objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_objects 1000000 + +.. note:: Ceph is not able to determine the size of a cache pool automatically, so + the configuration on the absolute size is required here, otherwise the + flush/evict will not work. If you specify both limits, the cache tiering + agent will begin flushing or evicting when either threshold is triggered. + +.. note:: All client requests will be blocked only when ``target_max_bytes`` or + ``target_max_objects`` reached + +Relative Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects relative to the size of the +cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in +`Absolute sizing`_). When the cache pool consists of a certain percentage of +modified (or dirty) objects, the cache tiering agent will flush them to the +storage pool. To set the ``cache_target_dirty_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} + +For example, setting the value to ``0.4`` will begin flushing modified +(dirty) objects when they reach 40% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 + +When the dirty objects reaches a certain percentage of its capacity, flush dirty +objects with a higher speed. To set the ``cache_target_dirty_high_ratio``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} + +For example, setting the value to ``0.6`` will begin aggressively flush dirty +objects when they reach 60% of the cache pool's capacity. obviously, we'd +better set the value between dirty_ratio and full_ratio: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 + +When the cache pool reaches a certain percentage of its capacity, the cache +tiering agent will evict objects to maintain free capacity. To set the +``cache_target_full_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} + +For example, setting the value to ``0.8`` will begin flushing unmodified +(clean) objects when they reach 80% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_full_ratio 0.8 + + +Cache Age +--------- + +You can specify the minimum age of an object before the cache tiering agent +flushes a recently modified (or dirty) object to the backing storage pool: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_min_flush_age {#seconds} + +For example, to flush modified (or dirty) objects after 10 minutes, execute the +following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_flush_age 600 + +You can specify the minimum age of an object before it will be evicted from the +cache tier: + +.. prompt:: bash $ + + ceph osd pool {cache-tier} cache_min_evict_age {#seconds} + +For example, to evict objects after 30 minutes, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_evict_age 1800 + + +Removing a Cache Tier +===================== + +Removing a cache tier differs depending on whether it is a writeback +cache or a read-only cache. + + +Removing a Read-Only Cache +-------------------------- + +Since a read-only cache does not have modified data, you can disable +and remove it without losing any recent changes to objects in the cache. + +#. Change the cache-mode to ``none`` to disable it.: + + .. prompt:: bash + + ceph osd tier cache-mode {cachepool} none + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage none + +#. Remove the cache pool from the backing pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +Removing a Writeback Cache +-------------------------- + +Since a writeback cache may have modified data, you must take steps to ensure +that you do not lose any recent changes to objects in the cache before you +disable and remove it. + + +#. Change the cache mode to ``proxy`` so that new and modified objects will + flush to the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} proxy + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage proxy + + +#. Ensure that the cache pool has been flushed. This may take a few minutes: + + .. prompt:: bash $ + + rados -p {cachepool} ls + + If the cache pool still has objects, you can flush them manually. + For example: + + .. prompt:: bash $ + + rados -p {cachepool} cache-flush-evict-all + + +#. Remove the overlay so that clients will not direct traffic to the cache.: + + .. prompt:: bash $ + + ceph osd tier remove-overlay {storagetier} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove-overlay cold-storage + + +#. Finally, remove the cache tier pool from the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +.. _Create a Pool: ../pools#create-a-pool +.. _Pools - Set Pool Values: ../pools#set-pool-values +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _CRUSH Maps: ../crush-map +.. _Absolute Sizing: #absolute-sizing diff --git a/doc/rados/operations/change-mon-elections.rst b/doc/rados/operations/change-mon-elections.rst new file mode 100644 index 000000000..7418ea363 --- /dev/null +++ b/doc/rados/operations/change-mon-elections.rst @@ -0,0 +1,100 @@ +.. _changing_monitor_elections: + +======================================= +Configuring Monitor Election Strategies +======================================= + +By default, the monitors are in ``classic`` mode. We recommend staying in this +mode unless you have a very specific reason. + +If you want to switch modes BEFORE constructing the cluster, change the ``mon +election default strategy`` option. This option takes an integer value: + +* ``1`` for ``classic`` +* ``2`` for ``disallow`` +* ``3`` for ``connectivity`` + +After your cluster has started running, you can change strategies by running a +command of the following form: + + $ ceph mon set election_strategy {classic|disallow|connectivity} + +Choosing a mode +=============== + +The modes other than ``classic`` provide specific features. We recommend staying +in ``classic`` mode if you don't need these extra features because it is the +simplest mode. + +.. _rados_operations_disallow_mode: + +Disallow Mode +============= + +The ``disallow`` mode allows you to mark monitors as disallowed. Disallowed +monitors participate in the quorum and serve clients, but cannot be elected +leader. You might want to use this mode for monitors that are far away from +clients. + +To disallow a monitor from being elected leader, run a command of the following +form: + +.. prompt:: bash $ + + ceph mon add disallowed_leader {name} + +To remove a monitor from the disallowed list and allow it to be elected leader, +run a command of the following form: + +.. prompt:: bash $ + + ceph mon rm disallowed_leader {name} + +To see the list of disallowed leaders, examine the output of the following +command: + +.. prompt:: bash $ + + ceph mon dump + +Connectivity Mode +================= + +The ``connectivity`` mode evaluates connection scores that are provided by each +monitor for its peers and elects the monitor with the highest score. This mode +is designed to handle network partitioning (also called *net-splits*): network +partitioning might occur if your cluster is stretched across multiple data +centers or otherwise has a non-uniform or unbalanced network topology. + +The ``connectivity`` mode also supports disallowing monitors from being elected +leader by using the same commands that were presented in :ref:`Disallow Mode `. + +Examining connectivity scores +============================= + +The monitors maintain connection scores even if they aren't in ``connectivity`` +mode. To examine a specific monitor's connection scores, run a command of the +following form: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores dump + +Scores for an individual connection range from ``0`` to ``1`` inclusive and +include whether the connection is considered alive or dead (as determined by +whether it returned its latest ping before timeout). + +Connectivity scores are expected to remain valid. However, if during +troubleshooting you determine that these scores have for some reason become +invalid, drop the history and reset the scores by running a command of the +following form: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores reset + +Resetting connectivity scores carries little risk: monitors will still quickly +determine whether a connection is alive or dead and trend back to the previous +scores if those scores were accurate. Nevertheless, resetting scores ought to +be unnecessary and it is not recommended unless advised by your support team +or by a developer. diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst new file mode 100644 index 000000000..033f831cd --- /dev/null +++ b/doc/rados/operations/control.rst @@ -0,0 +1,665 @@ +.. index:: control, commands + +================== + Control Commands +================== + + +Monitor Commands +================ + +To issue monitor commands, use the ``ceph`` utility: + +.. prompt:: bash $ + + ceph [-m monhost] {command} + +In most cases, monitor commands have the following form: + +.. prompt:: bash $ + + ceph {subsystem} {command} + + +System Commands +=============== + +To display the current cluster status, run the following commands: + +.. prompt:: bash $ + + ceph -s + ceph status + +To display a running summary of cluster status and major events, run the +following command: + +.. prompt:: bash $ + + ceph -w + +To display the monitor quorum, including which monitors are participating and +which one is the leader, run the following commands: + +.. prompt:: bash $ + + ceph mon stat + ceph quorum_status + +To query the status of a single monitor, including whether it is in the quorum, +run the following command: + +.. prompt:: bash $ + + ceph tell mon.[id] mon_status + +Here the value of ``[id]`` can be found by consulting the output of ``ceph +-s``. + + +Authentication Subsystem +======================== + +To add an OSD keyring for a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} + +To list the cluster's keys and their capabilities, run the following command: + +.. prompt:: bash $ + + ceph auth ls + + +Placement Group Subsystem +========================= + +To display the statistics for all placement groups (PGs), run the following +command: + +.. prompt:: bash $ + + ceph pg dump [--format {format}] + +Here the valid formats are ``plain`` (default), ``json`` ``json-pretty``, +``xml``, and ``xml-pretty``. When implementing monitoring tools and other +tools, it is best to use the ``json`` format. JSON parsing is more +deterministic than the ``plain`` format (which is more human readable), and the +layout is much more consistent from release to release. The ``jq`` utility is +very useful for extracting data from JSON output. + +To display the statistics for all PGs stuck in a specified state, run the +following command: + +.. prompt:: bash $ + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] + +Here ``--format`` may be ``plain`` (default), ``json``, ``json-pretty``, +``xml``, or ``xml-pretty``. + +The ``--threshold`` argument determines the time interval (in seconds) for a PG +to be considered ``stuck`` (default: 300). + +PGs might be stuck in any of the following states: + +**Inactive** + + PGs are unable to process reads or writes because they are waiting for an + OSD that has the most up-to-date data to return to an ``up`` state. + + +**Unclean** + + PGs contain objects that have not been replicated the desired number of + times. These PGs have not yet completed the process of recovering. + + +**Stale** + + PGs are in an unknown state, because the OSDs that host them have not + reported to the monitor cluster for a certain period of time (specified by + the ``mon_osd_report_timeout`` configuration setting). + + +To delete a ``lost`` object or revert an object to its prior state, either by +reverting it to its previous version or by deleting it because it was just +created and has no previous version, run the following command: + +.. prompt:: bash $ + + ceph pg {pgid} mark_unfound_lost revert|delete + + +.. _osd-subsystem: + +OSD Subsystem +============= + +To query OSD subsystem status, run the following command: + +.. prompt:: bash $ + + ceph osd stat + +To write a copy of the most recent OSD map to a file (see :ref:`osdmaptool +`), run the following command: + +.. prompt:: bash $ + + ceph osd getmap -o file + +To write a copy of the CRUSH map from the most recent OSD map to a file, run +the following command: + +.. prompt:: bash $ + + ceph osd getcrushmap -o file + +Note that this command is functionally equivalent to the following two +commands: + +.. prompt:: bash $ + + ceph osd getmap -o /tmp/osdmap + osdmaptool /tmp/osdmap --export-crush file + +To dump the OSD map, run the following command: + +.. prompt:: bash $ + + ceph osd dump [--format {format}] + +The ``--format`` option accepts the following arguments: ``plain`` (default), +``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON is +the recommended format for tools, scripting, and other forms of automation. + +To dump the OSD map as a tree that lists one OSD per line and displays +information about the weights and states of the OSDs, run the following +command: + +.. prompt:: bash $ + + ceph osd tree [--format {format}] + +To find out where a specific RADOS object is stored in the system, run a +command of the following form: + +.. prompt:: bash $ + + ceph osd map + +To add or move a new OSD (specified by its ID, name, or weight) to a specific +CRUSH location, run the following command: + +.. prompt:: bash $ + + ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] + +To remove an existing OSD from the CRUSH map, run the following command: + +.. prompt:: bash $ + + ceph osd crush remove {name} + +To remove an existing bucket from the CRUSH map, run the following command: + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +To move an existing bucket from one position in the CRUSH hierarchy to another, +run the following command: + +.. prompt:: bash $ + + ceph osd crush move {id} {loc1} [{loc2} ...] + +To set the CRUSH weight of a specific OSD (specified by ``{name}``) to +``{weight}``, run the following command: + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +To mark an OSD as ``lost``, run the following command: + +.. prompt:: bash $ + + ceph osd lost {id} [--yes-i-really-mean-it] + +.. warning:: + This could result in permanent data loss. Use with caution! + +To create a new OSD, run the following command: + +.. prompt:: bash $ + + ceph osd create [{uuid}] + +If no UUID is given as part of this command, the UUID will be set automatically +when the OSD starts up. + +To remove one or more specific OSDs, run the following command: + +.. prompt:: bash $ + + ceph osd rm [{id}...] + +To display the current ``max_osd`` parameter in the OSD map, run the following +command: + +.. prompt:: bash $ + + ceph osd getmaxosd + +To import a specific CRUSH map, run the following command: + +.. prompt:: bash $ + + ceph osd setcrushmap -i file + +To set the ``max_osd`` parameter in the OSD map, run the following command: + +.. prompt:: bash $ + + ceph osd setmaxosd + +The parameter has a default value of 10000. Most operators will never need to +adjust it. + +To mark a specific OSD ``down``, run the following command: + +.. prompt:: bash $ + + ceph osd down {osd-num} + +To mark a specific OSD ``out`` (so that no data will be allocated to it), run +the following command: + +.. prompt:: bash $ + + ceph osd out {osd-num} + +To mark a specific OSD ``in`` (so that data will be allocated to it), run the +following command: + +.. prompt:: bash $ + + ceph osd in {osd-num} + +By using the "pause flags" in the OSD map, you can pause or unpause I/O +requests. If the flags are set, then no I/O requests will be sent to any OSD. +When the flags are cleared, then pending I/O requests will be resent. To set or +clear pause flags, run one of the following commands: + +.. prompt:: bash $ + + ceph osd pause + ceph osd unpause + +You can assign an override or ``reweight`` weight value to a specific OSD if +the normal CRUSH distribution seems to be suboptimal. The weight of an OSD +helps determine the extent of its I/O requests and data storage: two OSDs with +the same weight will receive approximately the same number of I/O requests and +store approximately the same amount of data. The ``ceph osd reweight`` command +assigns an override weight to an OSD. The weight value is in the range 0 to 1, +and the command forces CRUSH to relocate a certain amount (1 - ``weight``) of +the data that would otherwise be on this OSD. The command does not change the +weights of the buckets above the OSD in the CRUSH map. Using the command is +merely a corrective measure: for example, if one of your OSDs is at 90% and the +others are at 50%, you could reduce the outlier weight to correct this +imbalance. To assign an override weight to a specific OSD, run the following +command: + +.. prompt:: bash $ + + ceph osd reweight {osd-num} {weight} + +.. note:: Any assigned override reweight value will conflict with the balancer. + This means that if the balancer is in use, all override reweight values + should be ``1.0000`` in order to avoid suboptimal cluster behavior. + +A cluster's OSDs can be reweighted in order to maintain balance if some OSDs +are being disproportionately utilized. Note that override or ``reweight`` +weights have values relative to one another that default to 1.00000; their +values are not absolute, and these weights must be distinguished from CRUSH +weights (which reflect the absolute capacity of a bucket, as measured in TiB). +To reweight OSDs by utilization, run the following command: + +.. prompt:: bash $ + + ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing] + +By default, this command adjusts the override weight of OSDs that have ±20% of +the average utilization, but you can specify a different percentage in the +``threshold`` argument. + +To limit the increment by which any OSD's reweight is to be changed, use the +``max_change`` argument (default: 0.05). To limit the number of OSDs that are +to be adjusted, use the ``max_osds`` argument (default: 4). Increasing these +variables can accelerate the reweighting process, but perhaps at the cost of +slower client operations (as a result of the increase in data movement). + +You can test the ``osd reweight-by-utilization`` command before running it. To +find out which and how many PGs and OSDs will be affected by a specific use of +the ``osd reweight-by-utilization`` command, run the following command: + +.. prompt:: bash $ + + ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing] + +The ``--no-increasing`` option can be added to the ``reweight-by-utilization`` +and ``test-reweight-by-utilization`` commands in order to prevent any override +weights that are currently less than 1.00000 from being increased. This option +can be useful in certain circumstances: for example, when you are hastily +balancing in order to remedy ``full`` or ``nearfull`` OSDs, or when there are +OSDs being evacuated or slowly brought into service. + +Operators of deployments that utilize Nautilus or newer (or later revisions of +Luminous and Mimic) and that have no pre-Luminous clients might likely instead +want to enable the `balancer`` module for ``ceph-mgr``. + +The blocklist can be modified by adding or removing an IP address or a CIDR +range. If an address is blocklisted, it will be unable to connect to any OSD. +If an OSD is contained within an IP address or CIDR range that has been +blocklisted, the OSD will be unable to perform operations on its peers when it +acts as a client: such blocked operations include tiering and copy-from +functionality. To add or remove an IP address or CIDR range to the blocklist, +run one of the following commands: + +.. prompt:: bash $ + + ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME] + ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits] + +If you add something to the blocklist with the above ``add`` command, you can +use the ``TIME`` keyword to specify the length of time (in seconds) that it +will remain on the blocklist (default: one hour). To add or remove a CIDR +range, use the ``range`` keyword in the above commands. + +Note that these commands are useful primarily in failure testing. Under normal +conditions, blocklists are maintained automatically and do not need any manual +intervention. + +To create or delete a snapshot of a specific storage pool, run one of the +following commands: + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + ceph osd pool rmsnap {pool-name} {snap-name} + +To create, delete, or rename a specific storage pool, run one of the following +commands: + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [pg_num [pgp_num]] + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + ceph osd pool rename {old-name} {new-name} + +To change a pool setting, run the following command: + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {field} {value} + +The following are valid fields: + + * ``size``: The number of copies of data in the pool. + * ``pg_num``: The PG number. + * ``pgp_num``: The effective number of PGs when calculating placement. + * ``crush_rule``: The rule number for mapping placement. + +To retrieve the value of a pool setting, run the following command: + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {field} + +Valid fields are: + + * ``pg_num``: The PG number. + * ``pgp_num``: The effective number of PGs when calculating placement. + +To send a scrub command to a specific OSD, or to all OSDs (by using ``*``), run +the following command: + +.. prompt:: bash $ + + ceph osd scrub {osd-num} + +To send a repair command to a specific OSD, or to all OSDs (by using ``*``), +run the following command: + +.. prompt:: bash $ + + ceph osd repair N + +You can run a simple throughput benchmark test against a specific OSD. This +test writes a total size of ``TOTAL_DATA_BYTES`` (default: 1 GB) incrementally, +in multiple write requests that each have a size of ``BYTES_PER_WRITE`` +(default: 4 MB). The test is not destructive and it will not overwrite existing +live OSD data, but it might temporarily affect the performance of clients that +are concurrently accessing the OSD. To launch this benchmark test, run the +following command: + +.. prompt:: bash $ + + ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] + +To clear the caches of a specific OSD during the interval between one benchmark +run and another, run the following command: + +.. prompt:: bash $ + + ceph tell osd.N cache drop + +To retrieve the cache statistics of a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph tell osd.N cache status + +MDS Subsystem +============= + +To change the configuration parameters of a running metadata server, run the +following command: + +.. prompt:: bash $ + + ceph tell mds.{mds-id} config set {setting} {value} + +Example: + +.. prompt:: bash $ + + ceph tell mds.0 config set debug_ms 1 + +To enable debug messages, run the following command: + +.. prompt:: bash $ + + ceph mds stat + +To display the status of all metadata servers, run the following command: + +.. prompt:: bash $ + + ceph mds fail 0 + +To mark the active metadata server as failed (and to trigger failover to a +standby if a standby is present), run the following command: + +.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap + + +Mon Subsystem +============= + +To display monitor statistics, run the following command: + +.. prompt:: bash $ + + ceph mon stat + +This command returns output similar to the following: + +:: + + e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c + +There is a ``quorum`` list at the end of the output. It lists those monitor +nodes that are part of the current quorum. + +To retrieve this information in a more direct way, run the following command: + +.. prompt:: bash $ + + ceph quorum_status -f json-pretty + +This command returns output similar to the following: + +.. code-block:: javascript + + { + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "quorum_names": [ + "a", + "b", + "c" + ], + "quorum_leader_name": "a", + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + + +The above will block until a quorum is reached. + +To see the status of a specific monitor, run the following command: + +.. prompt:: bash $ + + ceph tell mon.[name] mon_status + +Here the value of ``[name]`` can be found by consulting the output of the +``ceph quorum_status`` command. This command returns output similar to the +following: + +:: + + { + "name": "b", + "rank": 1, + "state": "peon", + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "features": { + "required_con": "9025616074522624", + "required_mon": [ + "kraken" + ], + "quorum_con": "1152921504336314367", + "quorum_mon": [ + "kraken" + ] + }, + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + +To see a dump of the monitor state, run the following command: + +.. prompt:: bash $ + + ceph mon dump + +This command returns output similar to the following: + +:: + + dumped monmap epoch 2 + epoch 2 + fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc + last_changed 2016-12-26 14:42:09.288066 + created 2016-12-26 14:42:03.573585 + 0: 127.0.0.1:40000/0 mon.a + 1: 127.0.0.1:40001/0 mon.b + 2: 127.0.0.1:40002/0 mon.c diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst new file mode 100644 index 000000000..46a4a4f74 --- /dev/null +++ b/doc/rados/operations/crush-map-edits.rst @@ -0,0 +1,746 @@ +Manually editing the CRUSH Map +============================== + +.. note:: Manually editing the CRUSH map is an advanced administrator + operation. For the majority of installations, CRUSH changes can be + implemented via the Ceph CLI and do not require manual CRUSH map edits. If + you have identified a use case where manual edits *are* necessary with a + recent Ceph release, consider contacting the Ceph developers at dev@ceph.io + so that future versions of Ceph do not have this problem. + +To edit an existing CRUSH map, carry out the following procedure: + +#. `Get the CRUSH map`_. +#. `Decompile`_ the CRUSH map. +#. Edit at least one of the following sections: `Devices`_, `Buckets`_, and + `Rules`_. Use a text editor for this task. +#. `Recompile`_ the CRUSH map. +#. `Set the CRUSH map`_. + +For details on setting the CRUSH map rule for a specific pool, see `Set Pool +Values`_. + +.. _Get the CRUSH map: #getcrushmap +.. _Decompile: #decompilecrushmap +.. _Devices: #crushmapdevices +.. _Buckets: #crushmapbuckets +.. _Rules: #crushmaprules +.. _Recompile: #compilecrushmap +.. _Set the CRUSH map: #setcrushmap +.. _Set Pool Values: ../pools#setpoolvalues + +.. _getcrushmap: + +Get the CRUSH Map +----------------- + +To get the CRUSH map for your cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph osd getcrushmap -o {compiled-crushmap-filename} + +Ceph outputs (``-o``) a compiled CRUSH map to the filename that you have +specified. Because the CRUSH map is in a compiled form, you must first +decompile it before you can edit it. + +.. _decompilecrushmap: + +Decompile the CRUSH Map +----------------------- + +To decompile the CRUSH map, run a command of the following form: + +.. prompt:: bash $ + + crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} + +.. _compilecrushmap: + +Recompile the CRUSH Map +----------------------- + +To compile the CRUSH map, run a command of the following form: + +.. prompt:: bash $ + + crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename} + +.. _setcrushmap: + +Set the CRUSH Map +----------------- + +To set the CRUSH map for your cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph osd setcrushmap -i {compiled-crushmap-filename} + +Ceph loads (``-i``) a compiled CRUSH map from the filename that you have +specified. + +Sections +-------- + +A CRUSH map has six main sections: + +#. **tunables:** The preamble at the top of the map describes any *tunables* + that are not a part of legacy CRUSH behavior. These tunables correct for old + bugs, optimizations, or other changes that have been made over the years to + improve CRUSH's behavior. + +#. **devices:** Devices are individual OSDs that store data. + +#. **types**: Bucket ``types`` define the types of buckets that are used in + your CRUSH hierarchy. + +#. **buckets:** Buckets consist of a hierarchical aggregation of storage + locations (for example, rows, racks, chassis, hosts) and their assigned + weights. After the bucket ``types`` have been defined, the CRUSH map defines + each node in the hierarchy, its type, and which devices or other nodes it + contains. + +#. **rules:** Rules define policy about how data is distributed across + devices in the hierarchy. + +#. **choose_args:** ``choose_args`` are alternative weights associated with + the hierarchy that have been adjusted in order to optimize data placement. A + single ``choose_args`` map can be used for the entire cluster, or a number + of ``choose_args`` maps can be created such that each map is crafted for a + particular pool. + + +.. _crushmapdevices: + +CRUSH-Map Devices +----------------- + +Devices are individual OSDs that store data. In this section, there is usually +one device defined for each OSD daemon in your cluster. Devices are identified +by an ``id`` (a non-negative integer) and a ``name`` (usually ``osd.N``, where +``N`` is the device's ``id``). + + +.. _crush-map-device-class: + +A device can also have a *device class* associated with it: for example, +``hdd`` or ``ssd``. Device classes make it possible for devices to be targeted +by CRUSH rules. This means that device classes allow CRUSH rules to select only +OSDs that match certain characteristics. For example, you might want an RBD +pool associated only with SSDs and a different RBD pool associated only with +HDDs. + +To see a list of devices, run the following command: + +.. prompt:: bash # + + ceph device ls + +The output of this command takes the following form: + +:: + + device {num} {osd.name} [class {class}] + +For example: + +.. prompt:: bash # + + ceph device ls + +:: + + device 0 osd.0 class ssd + device 1 osd.1 class hdd + device 2 osd.2 + device 3 osd.3 + +In most cases, each device maps to a corresponding ``ceph-osd`` daemon. This +daemon might map to a single storage device, a pair of devices (for example, +one for data and one for a journal or metadata), or in some cases a small RAID +device or a partition of a larger storage device. + + +CRUSH-Map Bucket Types +---------------------- + +The second list in the CRUSH map defines 'bucket' types. Buckets facilitate a +hierarchy of nodes and leaves. Node buckets (also known as non-leaf buckets) +typically represent physical locations in a hierarchy. Nodes aggregate other +nodes or leaves. Leaf buckets represent ``ceph-osd`` daemons and their +corresponding storage media. + +.. tip:: In the context of CRUSH, the term "bucket" is used to refer to + a node in the hierarchy (that is, to a location or a piece of physical + hardware). In the context of RADOS Gateway APIs, however, the term + "bucket" has a different meaning. + +To add a bucket type to the CRUSH map, create a new line under the list of +bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. +By convention, there is exactly one leaf bucket type and it is ``type 0``; +however, you may give the leaf bucket any name you like (for example: ``osd``, +``disk``, ``drive``, ``storage``):: + + # types + type {num} {bucket-name} + +For example:: + + # types + type 0 osd + type 1 host + type 2 chassis + type 3 rack + type 4 row + type 5 pdu + type 6 pod + type 7 room + type 8 datacenter + type 9 zone + type 10 region + type 11 root + +.. _crushmapbuckets: + +CRUSH-Map Bucket Hierarchy +-------------------------- + +The CRUSH algorithm distributes data objects among storage devices according to +a per-device weight value, approximating a uniform probability distribution. +CRUSH distributes objects and their replicas according to the hierarchical +cluster map you define. The CRUSH map represents the available storage devices +and the logical elements that contain them. + +To map placement groups (PGs) to OSDs across failure domains, a CRUSH map +defines a hierarchical list of bucket types under ``#types`` in the generated +CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf +nodes according to their failure domains (for example: hosts, chassis, racks, +power distribution units, pods, rows, rooms, and data centers). With the +exception of the leaf nodes that represent OSDs, the hierarchy is arbitrary and +you may define it according to your own needs. + +We recommend adapting your CRUSH map to your preferred hardware-naming +conventions and using bucket names that clearly reflect the physical +hardware. Clear naming practice can make it easier to administer the cluster +and easier to troubleshoot problems when OSDs malfunction (or other hardware +malfunctions) and the administrator needs access to physical hardware. + + +In the following example, the bucket hierarchy has a leaf bucket named ``osd`` +and two node buckets named ``host`` and ``rack``: + +.. ditaa:: + +-----------+ + | {o}rack | + | Bucket | + +-----+-----+ + | + +---------------+---------------+ + | | + +-----+-----+ +-----+-----+ + | {o}host | | {o}host | + | Bucket | | Bucket | + +-----+-----+ +-----+-----+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd | | osd | | osd | | osd | + | Bucket | | Bucket | | Bucket | | Bucket | + +-----------+ +-----------+ +-----------+ +-----------+ + +.. note:: The higher-numbered ``rack`` bucket type aggregates the + lower-numbered ``host`` bucket type. + +Because leaf nodes reflect storage devices that have already been declared +under the ``#devices`` list at the beginning of the CRUSH map, there is no need +to declare them as bucket instances. The second-lowest bucket type in your +hierarchy is typically used to aggregate the devices (that is, the +second-lowest bucket type is usually the computer that contains the storage +media and, such as ``node``, ``computer``, ``server``, ``host``, or +``machine``). In high-density environments, it is common to have multiple hosts +or nodes in a single chassis (for example, in the cases of blades or twins). It +is important to anticipate the potential consequences of chassis failure -- for +example, during the replacement of a chassis in case of a node failure, the +chassis's hosts or nodes (and their associated OSDs) will be in a ``down`` +state. + +To declare a bucket instance, do the following: specify its type, give it a +unique name (an alphanumeric string), assign it a unique ID expressed as a +negative integer (this is optional), assign it a weight relative to the total +capacity and capability of the item(s) in the bucket, assign it a bucket +algorithm (usually ``straw2``), and specify the bucket algorithm's hash +(usually ``0``, a setting that reflects the hash algorithm ``rjenkins1``). A +bucket may have one or more items. The items may consist of node buckets or +leaves. Items may have a weight that reflects the relative weight of the item. + +To declare a node bucket, use the following syntax:: + + [bucket-type] [bucket-name] { + id [a unique negative numeric ID] + weight [the relative capacity/capability of the item(s)] + alg [the bucket type: uniform | list | tree | straw | straw2 ] + hash [the hash type: 0 by default] + item [item-name] weight [weight] + } + +For example, in the above diagram, two host buckets (referred to in the +declaration below as ``node1`` and ``node2``) and one rack bucket (referred to +in the declaration below as ``rack1``) are defined. The OSDs are declared as +items within the host buckets:: + + host node1 { + id -1 + alg straw2 + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host node2 { + id -2 + alg straw2 + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + rack rack1 { + id -3 + alg straw2 + hash 0 + item node1 weight 2.00 + item node2 weight 2.00 + } + +.. note:: In this example, the rack bucket does not contain any OSDs. Instead, + it contains lower-level host buckets and includes the sum of their weight in + the item entry. + + +.. topic:: Bucket Types + + Ceph supports five bucket types. Each bucket type provides a balance between + performance and reorganization efficiency, and each is different from the + others. If you are unsure of which bucket type to use, use the ``straw2`` + bucket. For a more technical discussion of bucket types than is offered + here, see **Section 3.4** of `CRUSH - Controlled, Scalable, Decentralized + Placement of Replicated Data`_. + + The bucket types are as follows: + + #. **uniform**: Uniform buckets aggregate devices that have **exactly** + the same weight. For example, when hardware is commissioned or + decommissioned, it is often done in sets of machines that have exactly + the same physical configuration (this can be the case, for example, + after bulk purchases). When storage devices have exactly the same + weight, you may use the ``uniform`` bucket type, which allows CRUSH to + map replicas into uniform buckets in constant time. If your devices have + non-uniform weights, you should not use the uniform bucket algorithm. + + #. **list**: List buckets aggregate their content as linked lists. The + behavior of list buckets is governed by the :abbr:`RUSH (Replication + Under Scalable Hashing)`:sub:`P` algorithm. In the behavior of this + bucket type, an object is either relocated to the newest device in + accordance with an appropriate probability, or it remains on the older + devices as before. This results in optimal data migration when items are + added to the bucket. The removal of items from the middle or the tail of + the list, however, can result in a significant amount of unnecessary + data movement. This means that list buckets are most suitable for + circumstances in which they **never shrink or very rarely shrink**. + + #. **tree**: Tree buckets use a binary search tree. They are more efficient + at dealing with buckets that contain many items than are list buckets. + The behavior of tree buckets is governed by the :abbr:`RUSH (Replication + Under Scalable Hashing)`:sub:`R` algorithm. Tree buckets reduce the + placement time to 0(log\ :sub:`n`). This means that tree buckets are + suitable for managing large sets of devices or nested buckets. + + #. **straw**: Straw buckets allow all items in the bucket to "compete" + against each other for replica placement through a process analogous to + drawing straws. This is different from the behavior of list buckets and + tree buckets, which use a divide-and-conquer strategy that either gives + certain items precedence (for example, those at the beginning of a list) + or obviates the need to consider entire subtrees of items. Such an + approach improves the performance of the replica placement process, but + can also introduce suboptimal reorganization behavior when the contents + of a bucket change due an addition, a removal, or the re-weighting of an + item. + + * **straw2**: Straw2 buckets improve on Straw by correctly avoiding + any data movement between items when neighbor weights change. For + example, if the weight of a given item changes (including during the + operations of adding it to the cluster or removing it from the + cluster), there will be data movement to or from only that item. + Neighbor weights are not taken into account. + + +.. topic:: Hash + + Each bucket uses a hash algorithm. As of Reef, Ceph supports the + ``rjenkins1`` algorithm. To select ``rjenkins1`` as the hash algorithm, + enter ``0`` as your hash setting. + +.. _weightingbucketitems: + +.. topic:: Weighting Bucket Items + + Ceph expresses bucket weights as doubles, which allows for fine-grained + weighting. A weight is the relative difference between device capacities. We + recommend using ``1.00`` as the relative weight for a 1 TB storage device. + In such a scenario, a weight of ``0.50`` would represent approximately 500 + GB, and a weight of ``3.00`` would represent approximately 3 TB. Buckets + higher in the CRUSH hierarchy have a weight that is the sum of the weight of + the leaf items aggregated by the bucket. + + +.. _crushmaprules: + +CRUSH Map Rules +--------------- + +CRUSH maps have rules that include data placement for a pool: these are +called "CRUSH rules". The default CRUSH map has one rule for each pool. If you +are running a large cluster, you might create many pools and each of those +pools might have its own non-default CRUSH rule. + + +.. note:: In most cases, there is no need to modify the default rule. When a + new pool is created, by default the rule will be set to the value ``0`` + (which indicates the default CRUSH rule, which has the numeric ID ``0``). + +CRUSH rules define policy that governs how data is distributed across the devices in +the hierarchy. The rules define placement as well as replication strategies or +distribution policies that allow you to specify exactly how CRUSH places data +replicas. For example, you might create one rule selecting a pair of targets for +two-way mirroring, another rule for selecting three targets in two different data +centers for three-way replication, and yet another rule for erasure coding across +six storage devices. For a detailed discussion of CRUSH rules, see **Section 3.2** +of `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_. + +A rule takes the following form:: + + rule { + + id [a unique integer ID] + type [replicated|erasure] + step take [class ] + step [choose|chooseleaf] [firstn|indep] type + step emit + } + + +``id`` + :Description: A unique integer that identifies the rule. + :Purpose: A component of the rule mask. + :Type: Integer + :Required: Yes + :Default: 0 + + +``type`` + :Description: Denotes the type of replication strategy to be enforced by the + rule. + :Purpose: A component of the rule mask. + :Type: String + :Required: Yes + :Default: ``replicated`` + :Valid Values: ``replicated`` or ``erasure`` + + +``step take [class ]`` + :Description: Takes a bucket name and iterates down the tree. If + the ``device-class`` argument is specified, the argument must + match a class assigned to OSDs within the cluster. Only + devices belonging to the class are included. + :Purpose: A component of the rule. + :Required: Yes + :Example: ``step take data`` + + + +``step choose firstn {num} type {bucket-type}`` + :Description: Selects ``num`` buckets of the given type from within the + current bucket. ``{num}`` is usually the number of replicas in + the pool (in other words, the pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available). + - If ``pool-num-replicas > {num} > 0``, choose that many buckets. + - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets. + + :Purpose: A component of the rule. + :Prerequisite: Follows ``step take`` or ``step choose``. + :Example: ``step choose firstn 1 type row`` + + +``step chooseleaf firstn {num} type {bucket-type}`` + :Description: Selects a set of buckets of the given type and chooses a leaf + node (that is, an OSD) from the subtree of each bucket in that set of buckets. The + number of buckets in the set is usually the number of replicas in + the pool (in other words, the pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available). + - If ``pool-num-replicas > {num} > 0``, choose that many buckets. + - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets. + :Purpose: A component of the rule. Using ``chooseleaf`` obviates the need to select a device in a separate step. + :Prerequisite: Follows ``step take`` or ``step choose``. + :Example: ``step chooseleaf firstn 0 type row`` + + +``step emit`` + :Description: Outputs the current value on the top of the stack and empties + the stack. Typically used + at the end of a rule, but may also be used to choose from different + trees in the same rule. + + :Purpose: A component of the rule. + :Prerequisite: Follows ``step choose``. + :Example: ``step emit`` + +.. important:: A single CRUSH rule can be assigned to multiple pools, but + a single pool cannot have multiple CRUSH rules. + +``firstn`` or ``indep`` + + :Description: Determines which replacement strategy CRUSH uses when items (OSDs) + are marked ``down`` in the CRUSH map. When this rule is used + with replicated pools, ``firstn`` is used. When this rule is + used with erasure-coded pools, ``indep`` is used. + + Suppose that a PG is stored on OSDs 1, 2, 3, 4, and 5 and then + OSD 3 goes down. + + When in ``firstn`` mode, CRUSH simply adjusts its calculation + to select OSDs 1 and 2, then selects 3 and discovers that 3 is + down, retries and selects 4 and 5, and finally goes on to + select a new OSD: OSD 6. The final CRUSH mapping + transformation is therefore 1, 2, 3, 4, 5 → 1, 2, 4, 5, 6. + + However, if you were storing an erasure-coded pool, the above + sequence would have changed the data that is mapped to OSDs 4, + 5, and 6. The ``indep`` mode attempts to avoid this unwanted + consequence. When in ``indep`` mode, CRUSH can be expected to + select 3, discover that 3 is down, retry, and select 6. The + final CRUSH mapping transformation is therefore 1, 2, 3, 4, 5 + → 1, 2, 6, 4, 5. + +.. _crush-reclassify: + +Migrating from a legacy SSD rule to device classes +-------------------------------------------------- + +Prior to the Luminous release's introduction of the *device class* feature, in +order to write rules that applied to a specialized device type (for example, +SSD), it was necessary to manually edit the CRUSH map and maintain a parallel +hierarchy for each device type. The device class feature provides a more +transparent way to achieve this end. + +However, if your cluster is migrated from an existing manually-customized +per-device map to new device class-based rules, all data in the system will be +reshuffled. + +The ``crushtool`` utility has several commands that can transform a legacy rule +and hierarchy and allow you to start using the new device class rules. There +are three possible types of transformation: + +#. ``--reclassify-root `` + + This command examines everything under ``root-name`` in the hierarchy and + rewrites any rules that reference the specified root and that have the + form ``take `` so that they instead have the + form ``take class ``. The command also renumbers + the buckets in such a way that the old IDs are used for the specified + class's "shadow tree" and as a result no data movement takes place. + + For example, suppose you have the following as an existing rule:: + + rule replicated_rule { + id 0 + type replicated + step take default + step chooseleaf firstn 0 type rack + step emit + } + + If the root ``default`` is reclassified as class ``hdd``, the new rule will + be as follows:: + + rule replicated_rule { + id 0 + type replicated + step take default class hdd + step chooseleaf firstn 0 type rack + step emit + } + +#. ``--set-subtree-class `` + + This command marks every device in the subtree that is rooted at *bucket-name* + with the specified device class. + + This command is typically used in conjunction with the ``--reclassify-root`` option + in order to ensure that all devices in that root are labeled with the + correct class. In certain circumstances, however, some of those devices + are correctly labeled with a different class and must not be relabeled. To + manage this difficulty, one can exclude the ``--set-subtree-class`` + option. The remapping process will not be perfect, because the previous rule + had an effect on devices of multiple classes but the adjusted rules will map + only to devices of the specified device class. However, when there are not many + outlier devices, the resulting level of data movement is often within tolerable + limits. + + +#. ``--reclassify-bucket `` + + This command allows you to merge a parallel type-specific hierarchy with the + normal hierarchy. For example, many users have maps that resemble the + following:: + + host node1 { + id -2 # do not change unnecessarily + # weight 109.152 + alg straw2 + hash 0 # rjenkins1 + item osd.0 weight 9.096 + item osd.1 weight 9.096 + item osd.2 weight 9.096 + item osd.3 weight 9.096 + item osd.4 weight 9.096 + item osd.5 weight 9.096 + ... + } + + host node1-ssd { + id -10 # do not change unnecessarily + # weight 2.000 + alg straw2 + hash 0 # rjenkins1 + item osd.80 weight 2.000 + ... + } + + root default { + id -1 # do not change unnecessarily + alg straw2 + hash 0 # rjenkins1 + item node1 weight 110.967 + ... + } + + root ssd { + id -18 # do not change unnecessarily + # weight 16.000 + alg straw2 + hash 0 # rjenkins1 + item node1-ssd weight 2.000 + ... + } + + This command reclassifies each bucket that matches a certain + pattern. The pattern can be of the form ``%suffix`` or ``prefix%``. For + example, in the above example, we would use the pattern + ``%-ssd``. For each matched bucket, the remaining portion of the + name (corresponding to the ``%`` wildcard) specifies the *base bucket*. All + devices in the matched bucket are labeled with the specified + device class and then moved to the base bucket. If the base bucket + does not exist (for example, ``node12-ssd`` exists but ``node12`` does + not), then it is created and linked under the specified + *default parent* bucket. In each case, care is taken to preserve + the old bucket IDs for the new shadow buckets in order to prevent data + movement. Any rules with ``take`` steps that reference the old + buckets are adjusted accordingly. + + +#. ``--reclassify-bucket `` + + The same command can also be used without a wildcard in order to map a + single bucket. For example, in the previous example, we want the + ``ssd`` bucket to be mapped to the ``default`` bucket. + +#. The final command to convert the map that consists of the above fragments + resembles the following: + + .. prompt:: bash $ + + ceph osd getcrushmap -o original + crushtool -i original --reclassify \ + --set-subtree-class default hdd \ + --reclassify-root default hdd \ + --reclassify-bucket %-ssd ssd default \ + --reclassify-bucket ssd ssd default \ + -o adjusted + +``--compare`` flag +------------------ + +A ``--compare`` flag is available to make sure that the conversion performed in +:ref:`Migrating from a legacy SSD rule to device classes ` is +correct. This flag tests a large sample of inputs against the CRUSH map and +checks that the expected result is output. The options that control these +inputs are the same as the options that apply to the ``--test`` command. For an +illustration of how this ``--compare`` command applies to the above example, +see the following: + +.. prompt:: bash $ + + crushtool -i original --compare adjusted + +:: + + rule 0 had 0/10240 mismatched mappings (0) + rule 1 had 0/10240 mismatched mappings (0) + maps appear equivalent + +If the command finds any differences, the ratio of remapped inputs is reported +in the parentheses. + +When you are satisfied with the adjusted map, apply it to the cluster by +running the following command: + +.. prompt:: bash $ + + ceph osd setcrushmap -i adjusted + +Manually Tuning CRUSH +--------------------- + +If you have verified that all clients are running recent code, you can adjust +the CRUSH tunables by extracting the CRUSH map, modifying the values, and +reinjecting the map into the cluster. The procedure is carried out as follows: + +#. Extract the latest CRUSH map: + + .. prompt:: bash $ + + ceph osd getcrushmap -o /tmp/crush + +#. Adjust tunables. In our tests, the following values appear to result in the + best behavior for both large and small clusters. The procedure requires that + you specify the ``--enable-unsafe-tunables`` flag in the ``crushtool`` + command. Use this option with **extreme care**: + + .. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new + +#. Reinject the modified map: + + .. prompt:: bash $ + + ceph osd setcrushmap -i /tmp/crush.new + +Legacy values +------------- + +To set the legacy values of the CRUSH tunables, run the following command: + +.. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy + +The special ``--enable-unsafe-tunables`` flag is required. Be careful when +running old versions of the ``ceph-osd`` daemon after reverting to legacy +values, because the feature bit is not perfectly enforced. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst new file mode 100644 index 000000000..39151e6d4 --- /dev/null +++ b/doc/rados/operations/crush-map.rst @@ -0,0 +1,1147 @@ +============ + CRUSH Maps +============ + +The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm +computes storage locations in order to determine how to store and retrieve +data. CRUSH allows Ceph clients to communicate with OSDs directly rather than +through a centralized server or broker. By using an algorithmically-determined +method of storing and retrieving data, Ceph avoids a single point of failure, a +performance bottleneck, and a physical limit to its scalability. + +CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs, +distributing the data across the cluster in accordance with configured +replication policy and failure domains. For a detailed discussion of CRUSH, see +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ + +CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a +hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH +replicates data within the cluster's pools. By reflecting the underlying +physical organization of the installation, CRUSH can model (and thereby +address) the potential for correlated device failures. Some factors relevant +to the CRUSH hierarchy include chassis, racks, physical proximity, a shared +power source, shared networking, and failure domains. By encoding this +information into the CRUSH map, CRUSH placement policies distribute object +replicas across failure domains while maintaining the desired distribution. For +example, to address the possibility of concurrent failures, it might be +desirable to ensure that data replicas are on devices that reside in or rely +upon different shelves, racks, power supplies, controllers, or physical +locations. + +When OSDs are deployed, they are automatically added to the CRUSH map under a +``host`` bucket that is named for the node on which the OSDs run. This +behavior, combined with the configured CRUSH failure domain, ensures that +replicas or erasure-code shards are distributed across hosts and that the +failure of a single host or other kinds of failures will not affect +availability. For larger clusters, administrators must carefully consider their +choice of failure domain. For example, distributing replicas across racks is +typical for mid- to large-sized clusters. + + +CRUSH Location +============== + +The location of an OSD within the CRUSH map's hierarchy is referred to as its +``CRUSH location``. The specification of a CRUSH location takes the form of a +list of key-value pairs. For example, if an OSD is in a particular row, rack, +chassis, and host, and is also part of the 'default' CRUSH root (which is the +case for most clusters), its CRUSH location can be specified as follows:: + + root=default row=a rack=a2 chassis=a2a host=a2a1 + +.. note:: + + #. The order of the keys does not matter. + #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default, + valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``, + ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined + types suffice for nearly all clusters, but can be customized by + modifying the CRUSH map. + #. Not all keys need to be specified. For example, by default, Ceph + automatically sets an ``OSD``'s location as ``root=default + host=HOSTNAME`` (as determined by the output of ``hostname -s``). + +The CRUSH location for an OSD can be modified by adding the ``crush location`` +option in ``ceph.conf``. When this option has been added, every time the OSD +starts it verifies that it is in the correct location in the CRUSH map and +moves itself if it is not. To disable this automatic CRUSH map management, add +the following to the ``ceph.conf`` configuration file in the ``[osd]`` +section:: + + osd crush update on start = false + +Note that this action is unnecessary in most cases. + + +Custom location hooks +--------------------- + +A custom location hook can be used to generate a more complete CRUSH location +on startup. The CRUSH location is determined by, in order of preference: + +#. A ``crush location`` option in ``ceph.conf`` +#. A default of ``root=default host=HOSTNAME`` where the hostname is determined + by the output of the ``hostname -s`` command + +A script can be written to provide additional location fields (for example, +``rack`` or ``datacenter``) and the hook can be enabled via the following +config option:: + + crush location hook = /path/to/customized-ceph-crush-location + +This hook is passed several arguments (see below). The hook outputs a single +line to ``stdout`` that contains the CRUSH location description. The output +resembles the following::: + + --cluster CLUSTER --id ID --type TYPE + +Here the cluster name is typically ``ceph``, the ``id`` is the daemon +identifier or (in the case of OSDs) the OSD number, and the daemon type is +``osd``, ``mds, ``mgr``, or ``mon``. + +For example, a simple hook that specifies a rack location via a value in the +file ``/etc/rack`` might be as follows:: + + #!/bin/sh + echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" + + +CRUSH structure +=============== + +The CRUSH map consists of (1) a hierarchy that describes the physical topology +of the cluster and (2) a set of rules that defines data placement policy. The +hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to +other physical features or groupings: hosts, racks, rows, data centers, and so +on. The rules determine how replicas are placed in terms of that hierarchy (for +example, 'three replicas in different racks'). + +Devices +------- + +Devices are individual OSDs that store data (usually one device for each +storage drive). Devices are identified by an ``id`` (a non-negative integer) +and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``). + +In Luminous and later releases, OSDs can have a *device class* assigned (for +example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH +rules. Device classes are especially useful when mixing device types within +hosts. + +.. _crush_map_default_types: + +Types and Buckets +----------------- + +"Bucket", in the context of CRUSH, is a term for any of the internal nodes in +the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of +*types* that are used to identify these nodes. Default types include: + +- ``osd`` (or ``device``) +- ``host`` +- ``chassis`` +- ``rack`` +- ``row`` +- ``pdu`` +- ``pod`` +- ``room`` +- ``datacenter`` +- ``zone`` +- ``region`` +- ``root`` + +Most clusters use only a handful of these types, and other types can be defined +as needed. + +The hierarchy is built with devices (normally of type ``osd``) at the leaves +and non-device types as the internal nodes. The root node is of type ``root``. +For example: + + +.. ditaa:: + + +-----------------+ + |{o}root default | + +--------+--------+ + | + +---------------+---------------+ + | | + +------+------+ +------+------+ + |{o}host foo | |{o}host bar | + +------+------+ +------+------+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd.0 | | osd.1 | | osd.2 | | osd.3 | + +-----------+ +-----------+ +-----------+ +-----------+ + + +Each node (device or bucket) in the hierarchy has a *weight* that indicates the +relative proportion of the total data that should be stored by that device or +hierarchy subtree. Weights are set at the leaves, indicating the size of the +device. These weights automatically sum in an 'up the tree' direction: that is, +the weight of the ``root`` node will be the sum of the weights of all devices +contained under it. Weights are typically measured in tebibytes (TiB). + +To get a simple view of the cluster's CRUSH hierarchy, including weights, run +the following command: + +.. prompt:: bash $ + + ceph osd tree + +Rules +----- + +CRUSH rules define policy governing how data is distributed across the devices +in the hierarchy. The rules define placement as well as replication strategies +or distribution policies that allow you to specify exactly how CRUSH places +data replicas. For example, you might create one rule selecting a pair of +targets for two-way mirroring, another rule for selecting three targets in two +different data centers for three-way replication, and yet another rule for +erasure coding across six storage devices. For a detailed discussion of CRUSH +rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized +Placement of Replicated Data`_. + +CRUSH rules can be created via the command-line by specifying the *pool type* +that they will govern (replicated or erasure coded), the *failure domain*, and +optionally a *device class*. In rare cases, CRUSH rules must be created by +manually editing the CRUSH map. + +To see the rules that are defined for the cluster, run the following command: + +.. prompt:: bash $ + + ceph osd crush rule ls + +To view the contents of the rules, run the following command: + +.. prompt:: bash $ + + ceph osd crush rule dump + +.. _device_classes: + +Device classes +-------------- + +Each device can optionally have a *class* assigned. By default, OSDs +automatically set their class at startup to `hdd`, `ssd`, or `nvme` in +accordance with the type of device they are backed by. + +To explicitly set the device class of one or more OSDs, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush set-device-class [...] + +Once a device class has been set, it cannot be changed to another class until +the old class is unset. To remove the old class of one or more OSDs, run a +command of the following form: + +.. prompt:: bash $ + + ceph osd crush rm-device-class [...] + +This restriction allows administrators to set device classes that won't be +changed on OSD restart or by a script. + +To create a placement rule that targets a specific device class, run a command +of the following form: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated + +To apply the new placement rule to a specific pool, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd pool set crush_rule + +Device classes are implemented by creating one or more "shadow" CRUSH +hierarchies. For each device class in use, there will be a shadow hierarchy +that contains only devices of that class. CRUSH rules can then distribute data +across the relevant shadow hierarchy. This approach is fully backward +compatible with older Ceph clients. To view the CRUSH hierarchy with shadow +items displayed, run the following command: + +.. prompt:: bash # + + ceph osd crush tree --show-shadow + +Some older clusters that were created before the Luminous release rely on +manually crafted CRUSH maps to maintain per-device-type hierarchies. For these +clusters, there is a *reclassify* tool available that can help them transition +to device classes without triggering unwanted data movement (see +:ref:`crush-reclassify`). + +Weight sets +----------- + +A *weight set* is an alternative set of weights to use when calculating data +placement. The normal weights associated with each device in the CRUSH map are +set in accordance with the device size and indicate how much data should be +stored where. However, because CRUSH is a probabilistic pseudorandom placement +process, there is always some variation from this ideal distribution (in the +same way that rolling a die sixty times will likely not result in exactly ten +ones and ten sixes). Weight sets allow the cluster to perform numerical +optimization based on the specifics of your cluster (for example: hierarchy, +pools) to achieve a balanced distribution. + +Ceph supports two types of weight sets: + +#. A **compat** weight set is a single alternative set of weights for each + device and each node in the cluster. Compat weight sets cannot be expected + to correct all anomalies (for example, PGs for different pools might be of + different sizes and have different load levels, but are mostly treated alike + by the balancer). However, they have the major advantage of being *backward + compatible* with previous versions of Ceph. This means that even though + weight sets were first introduced in Luminous v12.2.z, older clients (for + example, Firefly) can still connect to the cluster when a compat weight set + is being used to balance data. + +#. A **per-pool** weight set is more flexible in that it allows placement to + be optimized for each data pool. Additionally, weights can be adjusted + for each position of placement, allowing the optimizer to correct for a + subtle skew of data toward devices with small weights relative to their + peers (an effect that is usually apparent only in very large clusters + but that can cause balancing problems). + +When weight sets are in use, the weights associated with each node in the +hierarchy are visible in a separate column (labeled either as ``(compat)`` or +as the pool name) in the output of the following command: + +.. prompt:: bash # + + ceph osd tree + +If both *compat* and *per-pool* weight sets are in use, data placement for a +particular pool will use its own per-pool weight set if present. If only +*compat* weight sets are in use, data placement will use the compat weight set. +If neither are in use, data placement will use the normal CRUSH weights. + +Although weight sets can be set up and adjusted manually, we recommend enabling +the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the +cluster is running Luminous or a later release. + +Modifying the CRUSH map +======================= + +.. _addosd: + +Adding/Moving an OSD +-------------------- + +.. note:: Under normal conditions, OSDs automatically add themselves to the + CRUSH map when they are created. The command in this section is rarely + needed. + + +To add or move an OSD in the CRUSH map of a running cluster, run a command of +the following form: + +.. prompt:: bash $ + + ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] + +For details on this command's parameters, see the following: + +``name`` + :Description: The full name of the OSD. + :Type: String + :Required: Yes + :Example: ``osd.0`` + + +``weight`` + :Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB). + :Type: Double + :Required: Yes + :Example: ``2.0`` + + +``root`` + :Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``). + :Type: Key-value pair. + :Required: Yes + :Example: ``root=default`` + + +``bucket-type`` + :Description: The OSD's location in the CRUSH hierarchy. + :Type: Key-value pairs. + :Required: No + :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +In the following example, the command adds ``osd.0`` to the hierarchy, or moves +``osd.0`` from a previous location: + +.. prompt:: bash $ + + ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 + + +Adjusting OSD weight +-------------------- + +.. note:: Under normal conditions, OSDs automatically add themselves to the + CRUSH map with the correct weight when they are created. The command in this + section is rarely needed. + +To adjust an OSD's CRUSH weight in a running cluster, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +For details on this command's parameters, see the following: + +``name`` + :Description: The full name of the OSD. + :Type: String + :Required: Yes + :Example: ``osd.0`` + + +``weight`` + :Description: The CRUSH weight of the OSD. + :Type: Double + :Required: Yes + :Example: ``2.0`` + + +.. _removeosd: + +Removing an OSD +--------------- + +.. note:: OSDs are normally removed from the CRUSH map as a result of the + `ceph osd purge`` command. This command is rarely needed. + +To remove an OSD from the CRUSH map of a running cluster, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush remove {name} + +For details on the ``name`` parameter, see the following: + +``name`` + :Description: The full name of the OSD. + :Type: String + :Required: Yes + :Example: ``osd.0`` + + +Adding a CRUSH Bucket +--------------------- + +.. note:: Buckets are implicitly created when an OSD is added and the command + that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the + OSD's location (provided that a bucket with that name does not already + exist). The command in this section is typically used when manually + adjusting the structure of the hierarchy after OSDs have already been + created. One use of this command is to move a series of hosts to a new + rack-level bucket. Another use of this command is to add new ``host`` + buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive + any data until they are ready to receive data. When they are ready, move the + buckets to the ``default`` root or to any other root as described below. + +To add a bucket in the CRUSH map of a running cluster, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush add-bucket {bucket-name} {bucket-type} + +For details on this command's parameters, see the following: + +``bucket-name`` + :Description: The full name of the bucket. + :Type: String + :Required: Yes + :Example: ``rack12`` + + +``bucket-type`` + :Description: The type of the bucket. This type must already exist in the CRUSH hierarchy. + :Type: String + :Required: Yes + :Example: ``rack`` + +In the following example, the command adds the ``rack12`` bucket to the hierarchy: + +.. prompt:: bash $ + + ceph osd crush add-bucket rack12 rack + +Moving a Bucket +--------------- + +To move a bucket to a different location or position in the CRUSH map +hierarchy, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] + +For details on this command's parameters, see the following: + +``bucket-name`` + :Description: The name of the bucket that you are moving. + :Type: String + :Required: Yes + :Example: ``foo-bar-1`` + +``bucket-type`` + :Description: The bucket's new location in the CRUSH hierarchy. + :Type: Key-value pairs. + :Required: No + :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +Removing a Bucket +----------------- + +To remove a bucket from the CRUSH hierarchy, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +.. note:: A bucket must already be empty before it is removed from the CRUSH + hierarchy. In other words, there must not be OSDs or any other CRUSH buckets + within it. + +For details on the ``bucket-name`` parameter, see the following: + +``bucket-name`` + :Description: The name of the bucket that is being removed. + :Type: String + :Required: Yes + :Example: ``rack12`` + +In the following example, the command removes the ``rack12`` bucket from the +hierarchy: + +.. prompt:: bash $ + + ceph osd crush remove rack12 + +Creating a compat weight set +---------------------------- + +.. note:: Normally this action is done automatically if needed by the + ``balancer`` module (provided that the module is enabled). + +To create a *compat* weight set, run the following command: + +.. prompt:: bash $ + + ceph osd crush weight-set create-compat + +To adjust the weights of the compat weight set, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight-compat {name} {weight} + +To destroy the compat weight set, run the following command: + +.. prompt:: bash $ + + ceph osd crush weight-set rm-compat + +Creating per-pool weight sets +----------------------------- + +To create a weight set for a specific pool, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush weight-set create {pool-name} {mode} + +.. note:: Per-pool weight sets can be used only if all servers and daemons are + running Luminous v12.2.z or a later release. + +For details on this command's parameters, see the following: + +``pool-name`` + :Description: The name of a RADOS pool. + :Type: String + :Required: Yes + :Example: ``rbd`` + +``mode`` + :Description: Either ``flat`` or ``positional``. A *flat* weight set + assigns a single weight to all devices or buckets. A + *positional* weight set has a potentially different + weight for each position in the resulting placement + mapping. For example: if a pool has a replica count of + ``3``, then a positional weight set will have three + weights for each device and bucket. + :Type: String + :Required: Yes + :Example: ``flat`` + +To adjust the weight of an item in a weight set, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} + +To list existing weight sets, run the following command: + +.. prompt:: bash $ + + ceph osd crush weight-set ls + +To remove a weight set, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush weight-set rm {pool-name} + + +Creating a rule for a replicated pool +------------------------------------- + +When you create a CRUSH rule for a replicated pool, there is an important +decision to make: selecting a failure domain. For example, if you select a +failure domain of ``host``, then CRUSH will ensure that each replica of the +data is stored on a unique host. Alternatively, if you select a failure domain +of ``rack``, then each replica of the data will be stored in a different rack. +Your selection of failure domain should be guided by the size and its CRUSH +topology. + +The entire cluster hierarchy is typically nested beneath a root node that is +named ``default``. If you have customized your hierarchy, you might want to +create a rule nested beneath some other node in the hierarchy. In creating +this rule for the customized hierarchy, the node type doesn't matter, and in +particular the rule does not have to be nested beneath a ``root`` node. + +It is possible to create a rule that restricts data placement to a specific +*class* of device. By default, Ceph OSDs automatically classify themselves as +either ``hdd`` or ``ssd`` in accordance with the underlying type of device +being used. These device classes can be customized. One might set the ``device +class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set +them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules +and pools may be flexibly constrained to use (or avoid using) specific subsets +of OSDs based on specific requirements. + +To create a rule for a replicated pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] + +For details on this command's parameters, see the following: + +``name`` + :Description: The name of the rule. + :Type: String + :Required: Yes + :Example: ``rbd-rule`` + +``root`` + :Description: The name of the CRUSH hierarchy node under which data is to be placed. + :Type: String + :Required: Yes + :Example: ``default`` + +``failure-domain-type`` + :Description: The type of CRUSH nodes used for the replicas of the failure domain. + :Type: String + :Required: Yes + :Example: ``rack`` + +``class`` + :Description: The device class on which data is to be placed. + :Type: String + :Required: No + :Example: ``ssd`` + +Creating a rule for an erasure-coded pool +----------------------------------------- + +For an erasure-coded pool, similar decisions need to be made: what the failure +domain is, which node in the hierarchy data will be placed under (usually +``default``), and whether placement is restricted to a specific device class. +However, erasure-code pools are created in a different way: there is a need to +construct them carefully with reference to the erasure code plugin in use. For +this reason, these decisions must be incorporated into the **erasure-code +profile**. A CRUSH rule will then be created from the erasure-code profile, +either explicitly or automatically when the profile is used to create a pool. + +To list the erasure-code profiles, run the following command: + +.. prompt:: bash $ + + ceph osd erasure-code-profile ls + +To view a specific existing profile, run a command of the following form: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get {profile-name} + +Under normal conditions, profiles should never be modified; instead, a new +profile should be created and used when creating either a new pool or a new +rule for an existing pool. + +An erasure-code profile consists of a set of key-value pairs. Most of these +key-value pairs govern the behavior of the erasure code that encodes data in +the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH +rule that is created. + +The relevant erasure-code profile properties are as follows: + + * **crush-root**: the name of the CRUSH node under which to place data + [default: ``default``]. + * **crush-failure-domain**: the CRUSH bucket type used in the distribution of + erasure-coded shards [default: ``host``]. + * **crush-device-class**: the device class on which to place data [default: + none, which means that all devices are used]. + * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the + number of erasure-code shards, affecting the resulting CRUSH rule. + + After a profile is defined, you can create a CRUSH rule by running a command + of the following form: + +.. prompt:: bash $ + + ceph osd crush rule create-erasure {name} {profile-name} + +.. note: When creating a new pool, it is not necessary to create the rule + explicitly. If only the erasure-code profile is specified and the rule + argument is omitted, then Ceph will create the CRUSH rule automatically. + + +Deleting rules +-------------- + +To delete rules that are not in use by pools, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush rule rm {rule-name} + +.. _crush-map-tunables: + +Tunables +======== + +The CRUSH algorithm that is used to calculate the placement of data has been +improved over time. In order to support changes in behavior, we have provided +users with sets of tunables that determine which legacy or optimal version of +CRUSH is to be used. + +In order to use newer tunables, all Ceph clients and daemons must support the +new major release of CRUSH. Because of this requirement, we have created +``profiles`` that are named after the Ceph version in which they were +introduced. For example, the ``firefly`` tunables were first supported by the +Firefly release and do not work with older clients (for example, clients +running Dumpling). After a cluster's tunables profile is changed from a legacy +set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options +will prevent older clients that do not support the new CRUSH features from +connecting to the cluster. + +argonaut (legacy) +----------------- + +The legacy CRUSH behavior used by Argonaut and older releases works fine for +most clusters, provided that not many OSDs have been marked ``out``. + +bobtail (CRUSH_TUNABLES2) +------------------------- + +The ``bobtail`` tunable profile provides the following improvements: + + * For hierarchies with a small number of devices in leaf buckets, some PGs + might map to fewer than the desired number of replicas, resulting in + ``undersized`` PGs. This is known to happen in the case of hierarchies with + ``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each + host. + + * For large clusters, a small percentage of PGs might map to fewer than the + desired number of OSDs. This is known to happen when there are multiple + hierarchy layers in use (for example,, ``row``, ``rack``, ``host``, + ``osd``). + + * When one or more OSDs are marked ``out``, data tends to be redistributed + to nearby OSDs instead of across the entire hierarchy. + +The tunables introduced in the Bobtail release are as follows: + + * ``choose_local_tries``: Number of local retries. The legacy value is ``2``, + and the optimal value is ``0``. + + * ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal + value is 0. + + * ``choose_total_tries``: Total number of attempts to choose an item. The + legacy value is ``19``, but subsequent testing indicates that a value of + ``50`` is more appropriate for typical clusters. For extremely large + clusters, an even larger value might be necessary. + + * ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will + retry, or try only once and allow the original placement to retry. The + legacy default is ``0``, and the optimal value is ``1``. + +Migration impact: + + * Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a + moderate amount of data movement. Use caution on a cluster that is already + populated with data. + +firefly (CRUSH_TUNABLES3) +------------------------- + +chooseleaf_vary_r +~~~~~~~~~~~~~~~~~ + +This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step +behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs. + +This profile was introduced in the Firefly release, and adds a new tunable as follows: + + * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start + with a non-zero value of ``r``, as determined by the number of attempts the + parent has already made. The legacy default value is ``0``, but with this + value CRUSH is sometimes unable to find a mapping. The optimal value (in + terms of computational cost and correctness) is ``1``. + +Migration impact: + + * For existing clusters that store a great deal of data, changing this tunable + from ``0`` to ``1`` will trigger a large amount of data migration; a value + of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will + cause less data to move. + +straw_calc_version tunable +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There were problems with the internal weights calculated and stored in the +CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH +weight of ``0`` or with a mix of different and unique weights, CRUSH would +distribute data incorrectly (that is, not in proportion to the weights). + +This tunable, introduced in the Firefly release, is as follows: + + * ``straw_calc_version``: A value of ``0`` preserves the old, broken + internal-weight calculation; a value of ``1`` fixes the problem. + +Migration impact: + + * Changing this tunable to a value of ``1`` and then adjusting a straw bucket + (either by adding, removing, or reweighting an item or by using the + reweight-all command) can trigger a small to moderate amount of data + movement provided that the cluster has hit one of the problematic + conditions. + +This tunable option is notable in that it has absolutely no impact on the +required kernel version in the client side. + +hammer (CRUSH_V4) +----------------- + +The ``hammer`` tunable profile does not affect the mapping of existing CRUSH +maps simply by changing the profile. However: + + * There is a new bucket algorithm supported: ``straw2``. This new algorithm + fixes several limitations in the original ``straw``. More specifically, the + old ``straw`` buckets would change some mappings that should not have + changed when a weight was adjusted, while ``straw2`` achieves the original + goal of changing mappings only to or from the bucket item whose weight has + changed. + + * The ``straw2`` type is the default type for any newly created buckets. + +Migration impact: + + * Changing a bucket type from ``straw`` to ``straw2`` will trigger a small + amount of data movement, depending on how much the bucket items' weights + vary from each other. When the weights are all the same no data will move, + and the more variance there is in the weights the more movement there will + be. + +jewel (CRUSH_TUNABLES5) +----------------------- + +The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a +result, significantly fewer mappings change when an OSD is marked ``out`` of +the cluster. This improvement results in significantly less data movement. + +The new tunable introduced in the Jewel release is as follows: + + * ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt + will use a better value for an inner loop that greatly reduces the number of + mapping changes when an OSD is marked ``out``. The legacy value is ``0``, + and the new value of ``1`` uses the new approach. + +Migration impact: + + * Changing this value on an existing cluster will result in a very large + amount of data movement because nearly every PG mapping is likely to change. + +Client versions that support CRUSH_TUNABLES2 +-------------------------------------------- + + * v0.55 and later, including Bobtail (v0.56.x) + * Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients) + +Client versions that support CRUSH_TUNABLES3 +-------------------------------------------- + + * v0.78 (Firefly) and later + * Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients) + +Client versions that support CRUSH_V4 +------------------------------------- + + * v0.94 (Hammer) and later + * Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients) + +Client versions that support CRUSH_TUNABLES5 +-------------------------------------------- + + * v10.0.2 (Jewel) and later + * Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients) + +"Non-optimal tunables" warning +------------------------------ + +In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush +map has non-optimal tunables") if any of the current CRUSH tunables have +non-optimal values: that is, if any fail to have the optimal values from the +:ref:` ``default`` profile +`. There are two +different ways to silence the alert: + +1. Adjust the CRUSH tunables on the existing cluster so as to render them + optimal. Making this adjustment will trigger some data movement + (possibly as much as 10%). This approach is generally preferred to the + other approach, but special care must be taken in situations where + data movement might affect performance: for example, in production clusters. + To enable optimal tunables, run the following command: + + .. prompt:: bash $ + + ceph osd crush tunables optimal + + There are several potential problems that might make it preferable to revert + to the previous values of the tunables. The new values might generate too + much load for the cluster to handle, the new values might unacceptably slow + the operation of the cluster, or there might be a client-compatibility + problem. Such client-compatibility problems can arise when using old-kernel + CephFS or RBD clients, or pre-Bobtail ``librados`` clients. To revert to + the previous values of the tunables, run the following command: + + .. prompt:: bash $ + + ceph osd crush tunables legacy + +2. To silence the alert without making any changes to CRUSH, + add the following option to the ``[mon]`` section of your ceph.conf file:: + + mon_warn_on_legacy_crush_tunables = false + + In order for this change to take effect, you will need to either restart + the monitors or run the following command to apply the option to the + monitors while they are still running: + + .. prompt:: bash $ + + ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false + + +Tuning CRUSH +------------ + +When making adjustments to CRUSH tunables, keep the following considerations in +mind: + + * Adjusting the values of CRUSH tunables will result in the shift of one or + more PGs from one storage node to another. If the Ceph cluster is already + storing a great deal of data, be prepared for significant data movement. + * When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they + immediately begin rejecting new connections from clients that do not support + the new feature. However, already-connected clients are effectively + grandfathered in, and any of these clients that do not support the new + feature will malfunction. + * If the CRUSH tunables are set to newer (non-legacy) values and subsequently + reverted to the legacy values, ``ceph-osd`` daemons will not be required to + support any of the newer CRUSH features associated with the newer + (non-legacy) values. However, the OSD peering process requires the + examination and understanding of old maps. For this reason, **if the cluster + has previously used non-legacy CRUSH values, do not run old versions of + the** ``ceph-osd`` **daemon** -- even if the latest version of the map has + been reverted so as to use the legacy defaults. + +The simplest way to adjust CRUSH tunables is to apply them in matched sets +known as *profiles*. As of the Octopus release, Ceph supports the following +profiles: + + * ``legacy``: The legacy behavior from argonaut and earlier. + * ``argonaut``: The legacy values supported by the argonaut release. + * ``bobtail``: The values supported by the bobtail release. + * ``firefly``: The values supported by the firefly release. + * ``hammer``: The values supported by the hammer release. + * ``jewel``: The values supported by the jewel release. + * ``optimal``: The best values for the current version of Ceph. + .. _rados_operations_crush_map_default_profile_definition: + * ``default``: The default values of a new cluster that has been installed + from scratch. These values, which depend on the current version of Ceph, are + hardcoded and are typically a mix of optimal and legacy values. These + values often correspond to the ``optimal`` profile of either the previous + LTS (long-term service) release or the most recent release for which most + users are expected to have up-to-date clients. + +To apply a profile to a running cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush tunables {PROFILE} + +This action might trigger a great deal of data movement. Consult release notes +and documentation before changing the profile on a running cluster. Consider +throttling recovery and backfill parameters in order to limit the backfill +resulting from a specific change. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf + + +Tuning Primary OSD Selection +============================ + +When a Ceph client reads or writes data, it first contacts the primary OSD in +each affected PG's acting set. By default, the first OSD in the acting set is +the primary OSD (also known as the "lead OSD"). For example, in the acting set +``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD. +However, sometimes it is clear that an OSD is not well suited to act as the +lead as compared with other OSDs (for example, if the OSD has a slow drive or a +slow controller). To prevent performance bottlenecks (especially on read +operations) and at the same time maximize the utilization of your hardware, you +can influence the selection of the primary OSD either by adjusting "primary +affinity" values, or by crafting a CRUSH rule that selects OSDs that are better +suited to act as the lead rather than other OSDs. + +To determine whether tuning Ceph's selection of primary OSDs will improve +cluster performance, pool redundancy strategy must be taken into account. For +replicated pools, this tuning can be especially useful, because by default read +operations are served from the primary OSD of each PG. For erasure-coded pools, +however, the speed of read operations can be increased by enabling **fast +read** (see :ref:`pool-settings`). + +.. _rados_ops_primary_affinity: + +Primary Affinity +---------------- + +**Primary affinity** is a characteristic of an OSD that governs the likelihood +that a given OSD will be selected as the primary OSD (or "lead OSD") in a given +acting set. A primary affinity value can be any real number in the range ``0`` +to ``1``, inclusive. + +As an example of a common scenario in which it can be useful to adjust primary +affinity values, let us suppose that a cluster contains a mix of drive sizes: +for example, suppose it contains some older racks with 1.9 TB SATA SSDs and +some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned +twice the number of PGs and will thus serve twice the number of write and read +operations -- they will be busier than the former. In such a scenario, you +might make a rough assignment of primary affinity as inversely proportional to +OSD size. Such an assignment will not be 100% optimal, but it can readily +achieve a 15% improvement in overall read throughput by means of a more even +utilization of SATA interface bandwidth and CPU cycles. This example is not +merely a thought experiment meant to illustrate the theoretical benefits of +adjusting primary affinity values; this fifteen percent improvement was +achieved on an actual Ceph cluster. + +By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster +in which every OSD has this default value, all OSDs are equally likely to act +as a primary OSD. + +By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less +likely to select the OSD as primary in a PG's acting set. To change the weight +value associated with a specific OSD's primary affinity, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd primary-affinity + +The primary affinity of an OSD can be set to any real number in the range +``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as +primary and ``1`` indicates that the OSD is maximally likely to be used as a +primary. When the weight is between these extremes, its value indicates roughly +how likely it is that CRUSH will select the OSD associated with it as a +primary. + +The process by which CRUSH selects the lead OSD is not a mere function of a +simple probability determined by relative affinity values. Nevertheless, +measurable results can be achieved even with first-order approximations of +desirable primary affinity values. + + +Custom CRUSH Rules +------------------ + +Some clusters balance cost and performance by mixing SSDs and HDDs in the same +replicated pool. By setting the primary affinity of HDD OSDs to ``0``, +operations will be directed to an SSD OSD in each acting set. Alternatively, +you can define a CRUSH rule that always selects an SSD OSD as the primary OSD +and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting +set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs. + +For example, see the following CRUSH rule:: + + rule mixed_replicated_rule { + id 11 + type replicated + step take default class ssd + step chooseleaf firstn 1 type host + step emit + step take default class hdd + step chooseleaf firstn 0 type host + step emit + } + +This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool, +this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on +different hosts, because the first SSD OSD might be colocated with any of the +``N`` HDD OSDs. + +To avoid this extra storage requirement, you might place SSDs and HDDs in +different hosts. However, taking this approach means that all client requests +will be received by hosts with SSDs. For this reason, it might be advisable to +have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the +latter will under normal circumstances perform only recovery operations. Here +the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement +not to contain any of the same servers, as seen in the following CRUSH rule:: + + rule mixed_replicated_rule_two { + id 1 + type replicated + step take ssd_hosts class ssd + step chooseleaf firstn 1 type host + step emit + step take hdd_hosts class hdd + step chooseleaf firstn -1 type host + step emit + } + +.. note:: If a primary SSD OSD fails, then requests to the associated PG will + be temporarily served from a slower HDD OSD until the PG's data has been + replicated onto the replacement primary SSD OSD. + + diff --git a/doc/rados/operations/data-placement.rst b/doc/rados/operations/data-placement.rst new file mode 100644 index 000000000..3d3be65ec --- /dev/null +++ b/doc/rados/operations/data-placement.rst @@ -0,0 +1,47 @@ +========================= + Data Placement Overview +========================= + +Ceph stores, replicates, and rebalances data objects across a RADOS cluster +dynamically. Because different users store objects in different pools for +different purposes on many OSDs, Ceph operations require a certain amount of +data- placement planning. The main data-placement planning concepts in Ceph +include: + +- **Pools:** Ceph stores data within pools, which are logical groups used for + storing objects. Pools manage the number of placement groups, the number of + replicas, and the CRUSH rule for the pool. To store data in a pool, it is + necessary to be an authenticated user with permissions for the pool. Ceph is + able to make snapshots of pools. For additional details, see `Pools`_. + +- **Placement Groups:** Ceph maps objects to placement groups. Placement + groups (PGs) are shards or fragments of a logical object pool that place + objects as a group into OSDs. Placement groups reduce the amount of + per-object metadata that is necessary for Ceph to store the data in OSDs. A + greater number of placement groups (for example, 100 PGs per OSD as compared + with 50 PGs per OSD) leads to better balancing. For additional details, see + :ref:`placement groups`. + +- **CRUSH Maps:** CRUSH plays a major role in allowing Ceph to scale while + avoiding certain pitfalls, such as performance bottlenecks, limitations to + scalability, and single points of failure. CRUSH maps provide the physical + topology of the cluster to the CRUSH algorithm, so that it can determine both + (1) where the data for an object and its replicas should be stored and (2) + how to store that data across failure domains so as to improve data safety. + For additional details, see `CRUSH Maps`_. + +- **Balancer:** The balancer is a feature that automatically optimizes the + distribution of placement groups across devices in order to achieve a + balanced data distribution, in order to maximize the amount of data that can + be stored in the cluster, and in order to evenly distribute the workload + across OSDs. + +It is possible to use the default values for each of the above components. +Default values are recommended for a test cluster's initial setup. However, +when planning a large Ceph cluster, values should be customized for +data-placement operations with reference to the different roles played by +pools, placement groups, and CRUSH. + +.. _Pools: ../pools +.. _CRUSH Maps: ../crush-map +.. _Balancer: ../balancer diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst new file mode 100644 index 000000000..f92f622d5 --- /dev/null +++ b/doc/rados/operations/devices.rst @@ -0,0 +1,227 @@ +.. _devices: + +Device Management +================= + +Device management allows Ceph to address hardware failure. Ceph tracks hardware +storage devices (HDDs, SSDs) to see which devices are managed by which daemons. +Ceph also collects health metrics about these devices. By doing so, Ceph can +provide tools that predict hardware failure and can automatically respond to +hardware failure. + +Device tracking +--------------- + +To see a list of the storage devices that are in use, run the following +command: + +.. prompt:: bash $ + + ceph device ls + +Alternatively, to list devices by daemon or by host, run a command of one of +the following forms: + +.. prompt:: bash $ + + ceph device ls-by-daemon + ceph device ls-by-host + +To see information about the location of an specific device and about how the +device is being consumed, run a command of the following form: + +.. prompt:: bash $ + + ceph device info + +Identifying physical devices +---------------------------- + +To make the replacement of failed disks easier and less error-prone, you can +(in some cases) "blink" the drive's LEDs on hardware enclosures by running a +command of the following form:: + + device light on|off [ident|fault] [--force] + +.. note:: Using this command to blink the lights might not work. Whether it + works will depend upon such factors as your kernel revision, your SES + firmware, or the setup of your HBA. + +The ```` parameter is the device identification. To retrieve this +information, run the following command: + +.. prompt:: bash $ + + ceph device ls + +The ``[ident|fault]`` parameter determines which kind of light will blink. By +default, the `identification` light is used. + +.. note:: This command works only if the Cephadm or the Rook `orchestrator + `_ + module is enabled. To see which orchestrator module is enabled, run the + following command: + + .. prompt:: bash $ + + ceph orch status + +The command that makes the drive's LEDs blink is `lsmcli`. To customize this +command, configure it via a Jinja2 template by running commands of the +following forms:: + + ceph config-key set mgr/cephadm/blink_device_light_cmd "