diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /doc/rados/operations | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
32 files changed, 12706 insertions, 0 deletions
diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst new file mode 100644 index 000000000..3688bb798 --- /dev/null +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -0,0 +1,458 @@ +.. _adding-and-removing-monitors: + +========================== + Adding/Removing Monitors +========================== + +It is possible to add monitors to a running cluster as long as redundancy is +maintained. To bootstrap a monitor, see `Manual Deployment`_ or `Monitor +Bootstrap`_. + +.. _adding-monitors: + +Adding Monitors +=============== + +Ceph monitors serve as the single source of truth for the cluster map. It is +possible to run a cluster with only one monitor, but for a production cluster +it is recommended to have at least three monitors provisioned and in quorum. +Ceph monitors use a variation of the `Paxos`_ algorithm to maintain consensus +about maps and about other critical information across the cluster. Due to the +nature of Paxos, Ceph is able to maintain quorum (and thus establish +consensus) only if a majority of the monitors are ``active``. + +It is best to run an odd number of monitors. This is because a cluster that is +running an odd number of monitors is more resilient than a cluster running an +even number. For example, in a two-monitor deployment, no failures can be +tolerated if quorum is to be maintained; in a three-monitor deployment, one +failure can be tolerated; in a four-monitor deployment, one failure can be +tolerated; and in a five-monitor deployment, two failures can be tolerated. In +general, a cluster running an odd number of monitors is best because it avoids +what is called the *split brain* phenomenon. In short, Ceph is able to operate +only if a majority of monitors are ``active`` and able to communicate with each +other, (for example: there must be a single monitor, two out of two monitors, +two out of three monitors, three out of five monitors, or the like). + +For small or non-critical deployments of multi-node Ceph clusters, it is +recommended to deploy three monitors. For larger clusters or for clusters that +are intended to survive a double failure, it is recommended to deploy five +monitors. Only in rare circumstances is there any justification for deploying +seven or more monitors. + +It is possible to run a monitor on the same host that is running an OSD. +However, this approach has disadvantages: for example: `fsync` issues with the +kernel might weaken performance, monitor and OSD daemons might be inactive at +the same time and cause disruption if the node crashes, is rebooted, or is +taken down for maintenance. Because of these risks, it is instead +recommended to run monitors and managers on dedicated hosts. + +.. note:: A *majority* of monitors in your cluster must be able to + reach each other in order for quorum to be established. + +Deploying your Hardware +----------------------- + +Some operators choose to add a new monitor host at the same time that they add +a new monitor. For details on the minimum recommendations for monitor hardware, +see `Hardware Recommendations`_. Before adding a monitor host to the cluster, +make sure that there is an up-to-date version of Linux installed. + +Add the newly installed monitor host to a rack in your cluster, connect the +host to the network, and make sure that the host has network connectivity. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations + +Installing the Required Software +-------------------------------- + +In manually deployed clusters, it is necessary to install Ceph packages +manually. For details, see `Installing Packages`_. Configure SSH so that it can +be used by a user that has passwordless authentication and root permissions. + +.. _Installing Packages: ../../../install/install-storage-cluster + + +.. _Adding a Monitor (Manual): + +Adding a Monitor (Manual) +------------------------- + +The procedure in this section creates a ``ceph-mon`` data directory, retrieves +both the monitor map and the monitor keyring, and adds a ``ceph-mon`` daemon to +the cluster. The procedure might result in a Ceph cluster that contains only +two monitor daemons. To add more monitors until there are enough ``ceph-mon`` +daemons to establish quorum, repeat the procedure. + +This is a good point at which to define the new monitor's ``id``. Monitors have +often been named with single letters (``a``, ``b``, ``c``, etc.), but you are +free to define the ``id`` however you see fit. In this document, ``{mon-id}`` +refers to the ``id`` exclusive of the ``mon.`` prefix: for example, if +``mon.a`` has been chosen as the ``id`` of a monitor, then ``{mon-id}`` is +``a``. ??? + +#. Create a data directory on the machine that will host the new monitor: + + .. prompt:: bash $ + + ssh {new-mon-host} + sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} + +#. Create a temporary directory ``{tmp}`` that will contain the files needed + during this procedure. This directory should be different from the data + directory created in the previous step. Because this is a temporary + directory, it can be removed after the procedure is complete: + + .. prompt:: bash $ + + mkdir {tmp} + +#. Retrieve the keyring for your monitors (``{tmp}`` is the path to the + retrieved keyring and ``{key-filename}`` is the name of the file that + contains the retrieved monitor key): + + .. prompt:: bash $ + + ceph auth get mon. -o {tmp}/{key-filename} + +#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map + and ``{map-filename}`` is the name of the file that contains the retrieved + monitor map): + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{map-filename} + +#. Prepare the monitor's data directory, which was created in the first step. + The following command must specify the path to the monitor map (so that + information about a quorum of monitors and their ``fsid``\s can be + retrieved) and specify the path to the monitor keyring: + + .. prompt:: bash $ + + sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} + +#. Start the new monitor. It will automatically join the cluster. To provide + information to the daemon about which address to bind to, use either the + ``--public-addr {ip}`` option or the ``--public-network {network}`` option. + For example: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --public-addr {ip:port} + +.. _removing-monitors: + +Removing Monitors +================= + +When monitors are removed from a cluster, it is important to remember +that Ceph monitors use Paxos to maintain consensus about the cluster +map. Such consensus is possible only if the number of monitors is sufficient +to establish quorum. + + +.. _Removing a Monitor (Manual): + +Removing a Monitor (Manual) +--------------------------- + +The procedure in this section removes a ``ceph-mon`` daemon from the cluster. +The procedure might result in a Ceph cluster that contains a number of monitors +insufficient to maintain quorum, so plan carefully. When replacing an old +monitor with a new monitor, add the new monitor first, wait for quorum to be +established, and then remove the old monitor. This ensures that quorum is not +lost. + + +#. Stop the monitor: + + .. prompt:: bash $ + + service ceph -a stop mon.{mon-id} + +#. Remove the monitor from the cluster: + + .. prompt:: bash $ + + ceph mon remove {mon-id} + +#. Remove the monitor entry from the ``ceph.conf`` file: + +.. _rados-mon-remove-from-unhealthy: + + +Removing Monitors from an Unhealthy Cluster +------------------------------------------- + +The procedure in this section removes a ``ceph-mon`` daemon from an unhealthy +cluster (for example, a cluster whose monitors are unable to form a quorum). + +#. Stop all ``ceph-mon`` daemons on all monitor hosts: + + .. prompt:: bash $ + + ssh {mon-host} + systemctl stop ceph-mon.target + + Repeat this step on every monitor host. + +#. Identify a surviving monitor and log in to the monitor's host: + + .. prompt:: bash $ + + ssh {mon-host} + +#. Extract a copy of the ``monmap`` file by running a command of the following + form: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --extract-monmap {map-path} + + Here is a more concrete example. In this example, ``hostname`` is the + ``{mon-id}`` and ``/tmp/monpap`` is the ``{map-path}``: + + .. prompt:: bash $ + + ceph-mon -i `hostname` --extract-monmap /tmp/monmap + +#. Remove the non-surviving or otherwise problematic monitors: + + .. prompt:: bash $ + + monmaptool {map-path} --rm {mon-id} + + For example, suppose that there are three monitors |---| ``mon.a``, ``mon.b``, + and ``mon.c`` |---| and that only ``mon.a`` will survive: + + .. prompt:: bash $ + + monmaptool /tmp/monmap --rm b + monmaptool /tmp/monmap --rm c + +#. Inject the surviving map that includes the removed monitors into the + monmap of the surviving monitor(s): + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {map-path} + + Continuing with the above example, inject a map into monitor ``mon.a`` by + running the following command: + + .. prompt:: bash $ + + ceph-mon -i a --inject-monmap /tmp/monmap + + +#. Start only the surviving monitors. + +#. Verify that the monitors form a quorum by running the command ``ceph -s``. + +#. The data directory of the removed monitors is in ``/var/lib/ceph/mon``: + either archive this data directory in a safe location or delete this data + directory. However, do not delete it unless you are confident that the + remaining monitors are healthy and sufficiently redundant. Make sure that + there is enough room for the live DB to expand and compact, and make sure + that there is also room for an archived copy of the DB. The archived copy + can be compressed. + +.. _Changing a Monitor's IP address: + +Changing a Monitor's IP Address +=============================== + +.. important:: Existing monitors are not supposed to change their IP addresses. + +Monitors are critical components of a Ceph cluster. The entire system can work +properly only if the monitors maintain quorum, and quorum can be established +only if the monitors have discovered each other by means of their IP addresses. +Ceph has strict requirements on the discovery of monitors. + +Although the ``ceph.conf`` file is used by Ceph clients and other Ceph daemons +to discover monitors, the monitor map is used by monitors to discover each +other. This is why it is necessary to obtain the current ``monmap`` at the time +a new monitor is created: as can be seen above in `Adding a Monitor (Manual)`_, +the ``monmap`` is one of the arguments required by the ``ceph-mon -i {mon-id} +--mkfs`` command. The following sections explain the consistency requirements +for Ceph monitors, and also explain a number of safe ways to change a monitor's +IP address. + + +Consistency Requirements +------------------------ + +When a monitor discovers other monitors in the cluster, it always refers to the +local copy of the monitor map. Using the monitor map instead of using the +``ceph.conf`` file avoids errors that could break the cluster (for example, +typos or other slight errors in ``ceph.conf`` when a monitor address or port is +specified). Because monitors use monitor maps for discovery and because they +share monitor maps with Ceph clients and other Ceph daemons, the monitor map +provides monitors with a strict guarantee that their consensus is valid. + +Strict consistency also applies to updates to the monmap. As with any other +updates on the monitor, changes to the monmap always run through a distributed +consensus algorithm called `Paxos`_. The monitors must agree on each update to +the monmap, such as adding or removing a monitor, to ensure that each monitor +in the quorum has the same version of the monmap. Updates to the monmap are +incremental so that monitors have the latest agreed upon version, and a set of +previous versions, allowing a monitor that has an older version of the monmap +to catch up with the current state of the cluster. + +There are additional advantages to using the monitor map rather than +``ceph.conf`` when monitors discover each other. Because ``ceph.conf`` is not +automatically updated and distributed, its use would bring certain risks: +monitors might use an outdated ``ceph.conf`` file, might fail to recognize a +specific monitor, might fall out of quorum, and might develop a situation in +which `Paxos`_ is unable to accurately ascertain the current state of the +system. Because of these risks, any changes to an existing monitor's IP address +must be made with great care. + +.. _operations_add_or_rm_mons_changing_mon_ip: + +Changing a Monitor's IP address (Preferred Method) +-------------------------------------------------- + +If a monitor's IP address is changed only in the ``ceph.conf`` file, there is +no guarantee that the other monitors in the cluster will receive the update. +For this reason, the preferred method to change a monitor's IP address is as +follows: add a new monitor with the desired IP address (as described in `Adding +a Monitor (Manual)`_), make sure that the new monitor successfully joins the +quorum, remove the monitor that is using the old IP address, and update the +``ceph.conf`` file to ensure that clients and other daemons are made aware of +the new monitor's IP address. + +For example, suppose that there are three monitors in place:: + + [mon.a] + host = host01 + addr = 10.0.0.1:6789 + [mon.b] + host = host02 + addr = 10.0.0.2:6789 + [mon.c] + host = host03 + addr = 10.0.0.3:6789 + +To change ``mon.c`` so that its name is ``host04`` and its IP address is +``10.0.0.4``: (1) follow the steps in `Adding a Monitor (Manual)`_ to add a new +monitor ``mon.d``, (2) make sure that ``mon.d`` is running before removing +``mon.c`` or else quorum will be broken, and (3) follow the steps in `Removing +a Monitor (Manual)`_ to remove ``mon.c``. To move all three monitors to new IP +addresses, repeat this process. + +Changing a Monitor's IP address (Advanced Method) +------------------------------------------------- + +There are cases in which the method outlined in :ref"`<Changing a Monitor's IP +Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot +be used. For example, it might be necessary to move the cluster's monitors to a +different network, to a different part of the datacenter, or to a different +datacenter altogether. It is still possible to change the monitors' IP +addresses, but a different method must be used. + +For such cases, a new monitor map with updated IP addresses for every monitor +in the cluster must be generated and injected on each monitor. Although this +method is not particularly easy, such a major migration is unlikely to be a +routine task. As stated at the beginning of this section, existing monitors are +not supposed to change their IP addresses. + +Continue with the monitor configuration in the example from :ref"`<Changing a +Monitor's IP Address (Preferred Method)> +operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors +are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that +these networks are unable to communicate. Carry out the following procedure: + +#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor + map, and ``{filename}`` is the name of the file that contains the retrieved + monitor map): + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{filename} + +#. Check the contents of the monitor map: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.0.0.1:6789/0 mon.a + 1: 10.0.0.2:6789/0 mon.b + 2: 10.0.0.3:6789/0 mon.c + +#. Remove the existing monitors from the monitor map: + + .. prompt:: bash $ + + monmaptool --rm a --rm b --rm c {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: removing a + monmaptool: removing b + monmaptool: removing c + monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) + +#. Add the new monitor locations to the monitor map: + + .. prompt:: bash $ + + monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) + +#. Check the new contents of the monitor map: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.1.0.1:6789/0 mon.a + 1: 10.1.0.2:6789/0 mon.b + 2: 10.1.0.3:6789/0 mon.c + +At this point, we assume that the monitors (and stores) have been installed at +the new location. Next, propagate the modified monitor map to the new monitors, +and inject the modified monitor map into each new monitor. + +#. Make sure all of your monitors have been stopped. Never inject into a + monitor while the monitor daemon is running. + +#. Inject the monitor map: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} + +#. Restart all of the monitors. + +Migration to the new location is now complete. The monitors should operate +successfully. + + + +.. _Manual Deployment: ../../../install/manual-deployment +.. _Monitor Bootstrap: ../../../dev/mon-bootstrap +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) + +.. |---| unicode:: U+2014 .. EM DASH + :trim: diff --git a/doc/rados/operations/add-or-rm-osds.rst b/doc/rados/operations/add-or-rm-osds.rst new file mode 100644 index 000000000..1a6621148 --- /dev/null +++ b/doc/rados/operations/add-or-rm-osds.rst @@ -0,0 +1,419 @@ +====================== + Adding/Removing OSDs +====================== + +When a cluster is up and running, it is possible to add or remove OSDs. + +Adding OSDs +=========== + +OSDs can be added to a cluster in order to expand the cluster's capacity and +resilience. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on one +storage drive within a host machine. But if your host machine has multiple +storage drives, you may map one ``ceph-osd`` daemon for each drive on the +machine. + +It's a good idea to check the capacity of your cluster so that you know when it +approaches its capacity limits. If your cluster has reached its ``near full`` +ratio, then you should add OSDs to expand your cluster's capacity. + +.. warning:: Do not add an OSD after your cluster has reached its ``full + ratio``. OSD failures that occur after the cluster reaches its ``near full + ratio`` might cause the cluster to exceed its ``full ratio``. + + +Deploying your Hardware +----------------------- + +If you are also adding a new host when adding a new OSD, see `Hardware +Recommendations`_ for details on minimum recommendations for OSD hardware. To +add an OSD host to your cluster, begin by making sure that an appropriate +version of Linux has been installed on the host machine and that all initial +preparations for your storage drives have been carried out. For details, see +`Filesystem Recommendations`_. + +Next, add your OSD host to a rack in your cluster, connect the host to the +network, and ensure that the host has network connectivity. For details, see +`Network Configuration Reference`_. + + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations +.. _Network Configuration Reference: ../../configuration/network-config-ref + +Installing the Required Software +-------------------------------- + +If your cluster has been manually deployed, you will need to install Ceph +software packages manually. For details, see `Installing Ceph (Manual)`_. +Configure SSH for the appropriate user to have both passwordless authentication +and root permissions. + +.. _Installing Ceph (Manual): ../../../install + + +Adding an OSD (Manual) +---------------------- + +The following procedure sets up a ``ceph-osd`` daemon, configures this OSD to +use one drive, and configures the cluster to distribute data to the OSD. If +your host machine has multiple drives, you may add an OSD for each drive on the +host by repeating this procedure. + +As the following procedure will demonstrate, adding an OSD involves creating a +metadata directory for it, configuring a data storage drive, adding the OSD to +the cluster, and then adding it to the CRUSH map. + +When you add the OSD to the CRUSH map, you will need to consider the weight you +assign to the new OSD. Since storage drive capacities increase over time, newer +OSD hosts are likely to have larger hard drives than the older hosts in the +cluster have and therefore might have greater weight as well. + +.. tip:: Ceph works best with uniform hardware across pools. It is possible to + add drives of dissimilar size and then adjust their weights accordingly. + However, for best performance, consider a CRUSH hierarchy that has drives of + the same type and size. It is better to add larger drives uniformly to + existing hosts. This can be done incrementally, replacing smaller drives + each time the new drives are added. + +#. Create the new OSD by running a command of the following form. If you opt + not to specify a UUID in this command, the UUID will be set automatically + when the OSD starts up. The OSD number, which is needed for subsequent + steps, is found in the command's output: + + .. prompt:: bash $ + + ceph osd create [{uuid} [{id}]] + + If the optional parameter {id} is specified it will be used as the OSD ID. + However, if the ID number is already in use, the command will fail. + + .. warning:: Explicitly specifying the ``{id}`` parameter is not + recommended. IDs are allocated as an array, and any skipping of entries + consumes extra memory. This memory consumption can become significant if + there are large gaps or if clusters are large. By leaving the ``{id}`` + parameter unspecified, we ensure that Ceph uses the smallest ID number + available and that these problems are avoided. + +#. Create the default directory for your new OSD by running commands of the + following form: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +#. If the OSD will be created on a drive other than the OS drive, prepare it + for use with Ceph. Run commands of the following form: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{drive} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +#. Initialize the OSD data directory by running commands of the following form: + + .. prompt:: bash $ + + ssh {new-osd-host} + ceph-osd -i {osd-num} --mkfs --mkkey + + Make sure that the directory is empty before running ``ceph-osd``. + +#. Register the OSD authentication key by running a command of the following + form: + + .. prompt:: bash $ + + ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring + + This presentation of the command has ``ceph-{osd-num}`` in the listed path + because many clusters have the name ``ceph``. However, if your cluster name + is not ``ceph``, then the string ``ceph`` in ``ceph-{osd-num}`` needs to be + replaced with your cluster name. For example, if your cluster name is + ``cluster1``, then the path in the command should be + ``/var/lib/ceph/osd/cluster1-{osd-num}/keyring``. + +#. Add the OSD to the CRUSH map by running the following command. This allows + the OSD to begin receiving data. The ``ceph osd crush add`` command can add + OSDs to the CRUSH hierarchy wherever you want. If you specify one or more + buckets, the command places the OSD in the most specific of those buckets, + and it moves that bucket underneath any other buckets that you have + specified. **Important:** If you specify only the root bucket, the command + will attach the OSD directly to the root, but CRUSH rules expect OSDs to be + inside of hosts. If the OSDs are not inside hosts, the OSDS will likely not + receive any data. + + .. prompt:: bash $ + + ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] + + Note that there is another way to add a new OSD to the CRUSH map: decompile + the CRUSH map, add the OSD to the device list, add the host as a bucket (if + it is not already in the CRUSH map), add the device as an item in the host, + assign the device a weight, recompile the CRUSH map, and set the CRUSH map. + For details, see `Add/Move an OSD`_. This is rarely necessary with recent + releases (this sentence was written the month that Reef was released). + + +.. _rados-replacing-an-osd: + +Replacing an OSD +---------------- + +.. note:: If the procedure in this section does not work for you, try the + instructions in the ``cephadm`` documentation: + :ref:`cephadm-replacing-an-osd`. + +Sometimes OSDs need to be replaced: for example, when a disk fails, or when an +administrator wants to reprovision OSDs with a new back end (perhaps when +switching from Filestore to BlueStore). Replacing an OSD differs from `Removing +the OSD`_ in that the replaced OSD's ID and CRUSH map entry must be kept intact +after the OSD is destroyed for replacement. + + +#. Make sure that it is safe to destroy the OSD: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done + +#. Destroy the OSD: + + .. prompt:: bash $ + + ceph osd destroy {id} --yes-i-really-mean-it + +#. *Optional*: If the disk that you plan to use is not a new disk and has been + used before for other purposes, zap the disk: + + .. prompt:: bash $ + + ceph-volume lvm zap /dev/sdX + +#. Prepare the disk for replacement by using the ID of the OSD that was + destroyed in previous steps: + + .. prompt:: bash $ + + ceph-volume lvm prepare --osd-id {id} --data /dev/sdX + +#. Finally, activate the OSD: + + .. prompt:: bash $ + + ceph-volume lvm activate {id} {fsid} + +Alternatively, instead of carrying out the final two steps (preparing the disk +and activating the OSD), you can re-create the OSD by running a single command +of the following form: + + .. prompt:: bash $ + + ceph-volume lvm create --osd-id {id} --data /dev/sdX + +Starting the OSD +---------------- + +After an OSD is added to Ceph, the OSD is in the cluster. However, until it is +started, the OSD is considered ``down`` and ``in``. The OSD is not running and +will be unable to receive data. To start an OSD, either run ``service ceph`` +from your admin host or run a command of the following form to start the OSD +from its host machine: + + .. prompt:: bash $ + + sudo systemctl start ceph-osd@{osd-num} + +After the OSD is started, it is considered ``up`` and ``in``. + +Observing the Data Migration +---------------------------- + +After the new OSD has been added to the CRUSH map, Ceph begins rebalancing the +cluster by migrating placement groups (PGs) to the new OSD. To observe this +process by using the `ceph`_ tool, run the following command: + + .. prompt:: bash $ + + ceph -w + +Or: + + .. prompt:: bash $ + + watch ceph status + +The PG states will first change from ``active+clean`` to ``active, some +degraded objects`` and then return to ``active+clean`` when migration +completes. When you are finished observing, press Ctrl-C to exit. + +.. _Add/Move an OSD: ../crush-map#addosd +.. _ceph: ../monitoring + + +Removing OSDs (Manual) +====================== + +It is possible to remove an OSD manually while the cluster is running: you +might want to do this in order to reduce the size of the cluster or when +replacing hardware. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on +one storage drive within a host machine. Alternatively, if your host machine +has multiple storage drives, you might need to remove multiple ``ceph-osd`` +daemons: one daemon for each drive on the machine. + +.. warning:: Before you begin the process of removing an OSD, make sure that + your cluster is not near its ``full ratio``. Otherwise the act of removing + OSDs might cause the cluster to reach or exceed its ``full ratio``. + + +Taking the OSD ``out`` of the Cluster +------------------------------------- + +OSDs are typically ``up`` and ``in`` before they are removed from the cluster. +Before the OSD can be removed from the cluster, the OSD must be taken ``out`` +of the cluster so that Ceph can begin rebalancing and copying its data to other +OSDs. To take an OSD ``out`` of the cluster, run a command of the following +form: + + .. prompt:: bash $ + + ceph osd out {osd-num} + + +Observing the Data Migration +---------------------------- + +After the OSD has been taken ``out`` of the cluster, Ceph begins rebalancing +the cluster by migrating placement groups out of the OSD that was removed. To +observe this process by using the `ceph`_ tool, run the following command: + + .. prompt:: bash $ + + ceph -w + +The PG states will change from ``active+clean`` to ``active, some degraded +objects`` and will then return to ``active+clean`` when migration completes. +When you are finished observing, press Ctrl-C to exit. + +.. note:: Under certain conditions, the action of taking ``out`` an OSD + might lead CRUSH to encounter a corner case in which some PGs remain stuck + in the ``active+remapped`` state. This problem sometimes occurs in small + clusters with few hosts (for example, in a small testing cluster). To + address this problem, mark the OSD ``in`` by running a command of the + following form: + + .. prompt:: bash $ + + ceph osd in {osd-num} + + After the OSD has come back to its initial state, do not mark the OSD + ``out`` again. Instead, set the OSD's weight to ``0`` by running a command + of the following form: + + .. prompt:: bash $ + + ceph osd crush reweight osd.{osd-num} 0 + + After the OSD has been reweighted, observe the data migration and confirm + that it has completed successfully. The difference between marking an OSD + ``out`` and reweighting the OSD to ``0`` has to do with the bucket that + contains the OSD. When an OSD is marked ``out``, the weight of the bucket is + not changed. But when an OSD is reweighted to ``0``, the weight of the + bucket is updated (namely, the weight of the OSD is subtracted from the + overall weight of the bucket). When operating small clusters, it can + sometimes be preferable to use the above reweight command. + + +Stopping the OSD +---------------- + +After you take an OSD ``out`` of the cluster, the OSD might still be running. +In such a case, the OSD is ``up`` and ``out``. Before it is removed from the +cluster, the OSD must be stopped by running commands of the following form: + + .. prompt:: bash $ + + ssh {osd-host} + sudo systemctl stop ceph-osd@{osd-num} + +After the OSD has been stopped, it is ``down``. + + +Removing the OSD +---------------- + +The following procedure removes an OSD from the cluster map, removes the OSD's +authentication key, removes the OSD from the OSD map, and removes the OSD from +the ``ceph.conf`` file. If your host has multiple drives, it might be necessary +to remove an OSD from each drive by repeating this procedure. + +#. Begin by having the cluster forget the OSD. This step removes the OSD from + the CRUSH map, removes the OSD's authentication key, and removes the OSD + from the OSD map. (The :ref:`purge subcommand <ceph-admin-osd>` was + introduced in Luminous. For older releases, see :ref:`the procedure linked + here <ceph_osd_purge_procedure_pre_luminous>`.): + + .. prompt:: bash $ + + ceph osd purge {id} --yes-i-really-mean-it + + +#. Navigate to the host where the master copy of the cluster's + ``ceph.conf`` file is kept: + + .. prompt:: bash $ + + ssh {admin-host} + cd /etc/ceph + vim ceph.conf + +#. Remove the OSD entry from your ``ceph.conf`` file (if such an entry + exists):: + + [osd.1] + host = {hostname} + +#. Copy the updated ``ceph.conf`` file from the location on the host where the + master copy of the cluster's ``ceph.conf`` is kept to the ``/etc/ceph`` + directory of the other hosts in your cluster. + +.. _ceph_osd_purge_procedure_pre_luminous: + +If your Ceph cluster is older than Luminous, you will be unable to use the +``ceph osd purge`` command. Instead, carry out the following procedure: + +#. Remove the OSD from the CRUSH map so that it no longer receives data (for + more details, see `Remove an OSD`_): + + .. prompt:: bash $ + + ceph osd crush remove {name} + + Instead of removing the OSD from the CRUSH map, you might opt for one of two + alternatives: (1) decompile the CRUSH map, remove the OSD from the device + list, and remove the device from the host bucket; (2) remove the host bucket + from the CRUSH map (provided that it is in the CRUSH map and that you intend + to remove the host), recompile the map, and set it: + + +#. Remove the OSD authentication key: + + .. prompt:: bash $ + + ceph auth del osd.{osd-num} + +#. Remove the OSD: + + .. prompt:: bash $ + + ceph osd rm {osd-num} + + For example: + + .. prompt:: bash $ + + ceph osd rm 1 + +.. _Remove an OSD: ../crush-map#removeosd diff --git a/doc/rados/operations/balancer.rst b/doc/rados/operations/balancer.rst new file mode 100644 index 000000000..aa4eab93c --- /dev/null +++ b/doc/rados/operations/balancer.rst @@ -0,0 +1,221 @@ +.. _balancer: + +Balancer Module +======================= + +The *balancer* can optimize the allocation of placement groups (PGs) across +OSDs in order to achieve a balanced distribution. The balancer can operate +either automatically or in a supervised fashion. + + +Status +------ + +To check the current status of the balancer, run the following command: + + .. prompt:: bash $ + + ceph balancer status + + +Automatic balancing +------------------- + +When the balancer is in ``upmap`` mode, the automatic balancing feature is +enabled by default. For more details, see :ref:`upmap`. To disable the +balancer, run the following command: + + .. prompt:: bash $ + + ceph balancer off + +The balancer mode can be changed from ``upmap`` mode to ``crush-compat`` mode. +``crush-compat`` mode is backward compatible with older clients. In +``crush-compat`` mode, the balancer automatically makes small changes to the +data distribution in order to ensure that OSDs are utilized equally. + + +Throttling +---------- + +If the cluster is degraded (that is, if an OSD has failed and the system hasn't +healed itself yet), then the balancer will not make any adjustments to the PG +distribution. + +When the cluster is healthy, the balancer will incrementally move a small +fraction of unbalanced PGs in order to improve distribution. This fraction +will not exceed a certain threshold that defaults to 5%. To adjust this +``target_max_misplaced_ratio`` threshold setting, run the following command: + + .. prompt:: bash $ + + ceph config set mgr target_max_misplaced_ratio .07 # 7% + +The balancer sleeps between runs. To set the number of seconds for this +interval of sleep, run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/sleep_interval 60 + +To set the time of day (in HHMM format) at which automatic balancing begins, +run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_time 0000 + +To set the time of day (in HHMM format) at which automatic balancing ends, run +the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_time 2359 + +Automatic balancing can be restricted to certain days of the week. To restrict +it to a specific day of the week or later (as with crontab, ``0`` is Sunday, +``1`` is Monday, and so on), run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_weekday 0 + +To restrict automatic balancing to a specific day of the week or earlier +(again, ``0`` is Sunday, ``1`` is Monday, and so on), run the following +command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_weekday 6 + +Automatic balancing can be restricted to certain pools. By default, the value +of this setting is an empty string, so that all pools are automatically +balanced. To restrict automatic balancing to specific pools, retrieve their +numeric pool IDs (by running the :command:`ceph osd pool ls detail` command), +and then run the following command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/pool_ids 1,2,3 + + +Modes +----- + +There are two supported balancer modes: + +#. **crush-compat**. This mode uses the compat weight-set feature (introduced + in Luminous) to manage an alternative set of weights for devices in the + CRUSH hierarchy. When the balancer is operating in this mode, the normal + weights should remain set to the size of the device in order to reflect the + target amount of data intended to be stored on the device. The balancer will + then optimize the weight-set values, adjusting them up or down in small + increments, in order to achieve a distribution that matches the target + distribution as closely as possible. (Because PG placement is a pseudorandom + process, it is subject to a natural amount of variation; optimizing the + weights serves to counteract that natural variation.) + + Note that this mode is *fully backward compatible* with older clients: when + an OSD Map and CRUSH map are shared with older clients, Ceph presents the + optimized weights as the "real" weights. + + The primary limitation of this mode is that the balancer cannot handle + multiple CRUSH hierarchies with different placement rules if the subtrees of + the hierarchy share any OSDs. (Such sharing of OSDs is not typical and, + because of the difficulty of managing the space utilization on the shared + OSDs, is generally not recommended.) + +#. **upmap**. In Luminous and later releases, the OSDMap can store explicit + mappings for individual OSDs as exceptions to the normal CRUSH placement + calculation. These ``upmap`` entries provide fine-grained control over the + PG mapping. This balancer mode optimizes the placement of individual PGs in + order to achieve a balanced distribution. In most cases, the resulting + distribution is nearly perfect: that is, there is an equal number of PGs on + each OSD (±1 PG, since the total number might not divide evenly). + + To use ``upmap``, all clients must be Luminous or newer. + +The default mode is ``upmap``. The mode can be changed to ``crush-compat`` by +running the following command: + + .. prompt:: bash $ + + ceph balancer mode crush-compat + +Supervised optimization +----------------------- + +Supervised use of the balancer can be understood in terms of three distinct +phases: + +#. building a plan +#. evaluating the quality of the data distribution, either for the current PG + distribution or for the PG distribution that would result after executing a + plan +#. executing the plan + +To evaluate the current distribution, run the following command: + + .. prompt:: bash $ + + ceph balancer eval + +To evaluate the distribution for a single pool, run the following command: + + .. prompt:: bash $ + + ceph balancer eval <pool-name> + +To see the evaluation in greater detail, run the following command: + + .. prompt:: bash $ + + ceph balancer eval-verbose ... + +To instruct the balancer to generate a plan (using the currently configured +mode), make up a name (any useful identifying string) for the plan, and run the +following command: + + .. prompt:: bash $ + + ceph balancer optimize <plan-name> + +To see the contents of a plan, run the following command: + + .. prompt:: bash $ + + ceph balancer show <plan-name> + +To display all plans, run the following command: + + .. prompt:: bash $ + + ceph balancer ls + +To discard an old plan, run the following command: + + .. prompt:: bash $ + + ceph balancer rm <plan-name> + +To see currently recorded plans, examine the output of the following status +command: + + .. prompt:: bash $ + + ceph balancer status + +To evaluate the distribution that would result from executing a specific plan, +run the following command: + + .. prompt:: bash $ + + ceph balancer eval <plan-name> + +If a plan is expected to improve the distribution (that is, the plan's score is +lower than the current cluster state's score), you can execute that plan by +running the following command: + + .. prompt:: bash $ + + ceph balancer execute <plan-name> diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst new file mode 100644 index 000000000..d24782c46 --- /dev/null +++ b/doc/rados/operations/bluestore-migration.rst @@ -0,0 +1,357 @@ +.. _rados_operations_bluestore_migration: + +===================== + BlueStore Migration +===================== +.. warning:: Filestore has been deprecated in the Reef release and is no longer supported. + Please migrate to BlueStore. + +Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph +cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs. +Because BlueStore is superior to Filestore in performance and robustness, and +because Filestore is not supported by Ceph releases beginning with Reef, users +deploying Filestore OSDs should transition to BlueStore. There are several +strategies for making the transition to BlueStore. + +BlueStore is so different from Filestore that an individual OSD cannot be +converted in place. Instead, the conversion process must use either (1) the +cluster's normal replication and healing support, or (2) tools and strategies +that copy OSD content from an old (Filestore) device to a new (BlueStore) one. + +Deploying new OSDs with BlueStore +================================= + +Use BlueStore when deploying new OSDs (for example, when the cluster is +expanded). Because this is the default behavior, no specific change is +needed. + +Similarly, use BlueStore for any OSDs that have been reprovisioned after +a failed drive was replaced. + +Converting existing OSDs +======================== + +"Mark-``out``" replacement +-------------------------- + +The simplest approach is to verify that the cluster is healthy and +then follow these steps for each Filestore OSD in succession: mark the OSD +``out``, wait for the data to replicate across the cluster, reprovision the OSD, +mark the OSD back ``in``, and wait for recovery to complete before proceeding +to the next OSD. This approach is easy to automate, but it entails unnecessary +data migration that carries costs in time and SSD wear. + +#. Identify a Filestore OSD to replace:: + + ID=<osd-id-number> + DEVICE=<disk-device> + + #. Determine whether a given OSD is Filestore or BlueStore: + + .. prompt:: bash $ + + ceph osd metadata $ID | grep osd_objectstore + + #. Get a current count of Filestore and BlueStore OSDs: + + .. prompt:: bash $ + + ceph osd count-metadata osd_objectstore + +#. Mark a Filestore OSD ``out``: + + .. prompt:: bash $ + + ceph osd out $ID + +#. Wait for the data to migrate off this OSD: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done + +#. Stop the OSD: + + .. prompt:: bash $ + + systemctl kill ceph-osd@$ID + + .. _osd_id_retrieval: + +#. Note which device the OSD is using: + + .. prompt:: bash $ + + mount | grep /var/lib/ceph/osd/ceph-$ID + +#. Unmount the OSD: + + .. prompt:: bash $ + + umount /var/lib/ceph/osd/ceph-$ID + +#. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy + the contents of the device; you must be certain that the data on the device is + not needed (in other words, that the cluster is healthy) before proceeding: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Tell the cluster that the OSD has been destroyed (and that a new OSD can be + reprovisioned with the same OSD ID): + + .. prompt:: bash $ + + ceph osd destroy $ID --yes-i-really-mean-it + +#. Provision a BlueStore OSD in place by using the same OSD ID. This requires + you to identify which device to wipe, and to make certain that you target + the correct and intended device, using the information that was retrieved in + the :ref:`"Note which device the OSD is using" <osd_id_retrieval>` step. BE + CAREFUL! Note that you may need to modify these commands when dealing with + hybrid OSDs: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID + +#. Repeat. + +You may opt to (1) have the balancing of the replacement BlueStore OSD take +place concurrently with the draining of the next Filestore OSD, or instead +(2) follow the same procedure for multiple OSDs in parallel. In either case, +however, you must ensure that the cluster is fully clean (in other words, that +all data has all replicas) before destroying any OSDs. If you opt to reprovision +multiple OSDs in parallel, be **very** careful to destroy OSDs only within a +single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to +satisfy this requirement will reduce the redundancy and availability of your +data and increase the risk of data loss (or even guarantee data loss). + +Advantages: + +* Simple. +* Can be done on a device-by-device basis. +* No spare devices or hosts are required. + +Disadvantages: + +* Data is copied over the network twice: once to another OSD in the cluster (to + maintain the specified number of replicas), and again back to the + reprovisioned BlueStore OSD. + +"Whole host" replacement +------------------------ + +If you have a spare host in the cluster, or sufficient free space to evacuate +an entire host for use as a spare, then the conversion can be done on a +host-by-host basis so that each stored copy of the data is migrated only once. + +To use this approach, you need an empty host that has no OSDs provisioned. +There are two ways to do this: either by using a new, empty host that is not +yet part of the cluster, or by offloading data from an existing host that is +already part of the cluster. + +Using a new, empty host +^^^^^^^^^^^^^^^^^^^^^^^ + +Ideally the host will have roughly the same capacity as each of the other hosts +you will be converting. Add the host to the CRUSH hierarchy, but do not attach +it to the root: + + +.. prompt:: bash $ + + NEWHOST=<empty-host-name> + ceph osd crush add-bucket $NEWHOST host + +Make sure that Ceph packages are installed on the new host. + +Using an existing host +^^^^^^^^^^^^^^^^^^^^^^ + +If you would like to use an existing host that is already part of the cluster, +and if there is sufficient free space on that host so that all of its data can +be migrated off to other cluster hosts, you can do the following (instead of +using a new, empty host): + +.. prompt:: bash $ + + OLDHOST=<existing-cluster-host-to-offload> + ceph osd crush unlink $OLDHOST default + +where "default" is the immediate ancestor in the CRUSH map. (For +smaller clusters with unmodified configurations this will normally +be "default", but it might instead be a rack name.) You should now +see the host at the top of the OSD tree output with no parent: + +.. prompt:: bash $ + + bin/ceph osd tree + +:: + + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host oldhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host foo + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +If everything looks good, jump directly to the :ref:`"Wait for the data +migration to complete" <bluestore_data_migration_step>` step below and proceed +from there to clean up the old OSDs. + +Migration process +^^^^^^^^^^^^^^^^^ + +If you're using a new host, start at :ref:`the first step +<bluestore_migration_process_first_step>`. If you're using an existing host, +jump to :ref:`this step <bluestore_data_migration_step>`. + +.. _bluestore_migration_process_first_step: + +#. Provision new BlueStore OSDs for all devices: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/$DEVICE + +#. Verify that the new OSDs have joined the cluster: + + .. prompt:: bash $ + + ceph osd tree + + You should see the new host ``$NEWHOST`` with all of the OSDs beneath + it, but the host should *not* be nested beneath any other node in the + hierarchy (like ``root default``). For example, if ``newhost`` is + the empty host, you might see something like:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host newhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host oldhost1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +#. Identify the first target host to convert : + + .. prompt:: bash $ + + OLDHOST=<existing-cluster-host-to-convert> + +#. Swap the new host into the old host's position in the cluster: + + .. prompt:: bash $ + + ceph osd crush swap-bucket $NEWHOST $OLDHOST + + At this point all data on ``$OLDHOST`` will begin migrating to the OSDs on + ``$NEWHOST``. If there is a difference between the total capacity of the + old hosts and the total capacity of the new hosts, you may also see some + data migrate to or from other nodes in the cluster. Provided that the hosts + are similarly sized, however, this will be a relatively small amount of + data. + + .. _bluestore_data_migration_step: + +#. Wait for the data migration to complete: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done + +#. Stop all old OSDs on the now-empty ``$OLDHOST``: + + .. prompt:: bash $ + + ssh $OLDHOST + systemctl kill ceph-osd.target + umount /var/lib/ceph/osd/ceph-* + +#. Destroy and purge the old OSDs: + + .. prompt:: bash $ + + for osd in `ceph osd ls-tree $OLDHOST`; do + ceph osd purge $osd --yes-i-really-mean-it + done + +#. Wipe the old OSDs. This requires you to identify which devices are to be + wiped manually. BE CAREFUL! For each device: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Use the now-empty host as the new host, and repeat: + + .. prompt:: bash $ + + NEWHOST=$OLDHOST + +Advantages: + +* Data is copied over the network only once. +* An entire host's OSDs are converted at once. +* Can be parallelized, to make possible the conversion of multiple hosts at the same time. +* No host involved in this process needs to have a spare device. + +Disadvantages: + +* A spare host is required. +* An entire host's worth of OSDs will be migrating data at a time. This + is likely to impact overall cluster performance. +* All migrated data still makes one full hop over the network. + +Per-OSD device copy +------------------- +A single logical OSD can be converted by using the ``copy`` function +included in ``ceph-objectstore-tool``. This requires that the host have one or more free +devices to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has twelve OSDs, then you need a +thirteenth unused OSD so that each OSD can be converted before the +previous OSD is reclaimed to convert the next OSD. + +Caveats: + +* This approach requires that we prepare an empty BlueStore OSD but that we do not allocate + a new OSD ID to it. The ``ceph-volume`` tool does not support such an operation. **IMPORTANT:** + because the setup of *dmcrypt* is closely tied to the identity of the OSD, this approach does not + work with encrypted OSDs. + +* The device must be manually partitioned. + +* An unsupported user-contributed script that demonstrates this process may be found here: + https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash + +Advantages: + +* Provided that the 'noout' or the 'norecover'/'norebalance' flags are set on the OSD or the + cluster while the conversion process is underway, little or no data migrates over the + network during the conversion. + +Disadvantages: + +* Tooling is not fully implemented, supported, or documented. + +* Each host must have an appropriate spare or empty device for staging. + +* The OSD is offline during the conversion, which means new writes to PGs + with the OSD in their acting set may not be ideally redundant until the + subject OSD comes up and recovers. This increases the risk of data + loss due to an overlapping failure. However, if another OSD fails before + conversion and startup have completed, the original Filestore OSD can be + started to provide access to its original data. diff --git a/doc/rados/operations/cache-tiering.rst b/doc/rados/operations/cache-tiering.rst new file mode 100644 index 000000000..127b0141f --- /dev/null +++ b/doc/rados/operations/cache-tiering.rst @@ -0,0 +1,557 @@ +=============== + Cache Tiering +=============== + +.. warning:: Cache tiering has been deprecated in the Reef release as it + has lacked a maintainer for a very long time. This does not mean + it will be certainly removed, but we may choose to remove it + without much further notice. + +A cache tier provides Ceph Clients with better I/O performance for a subset of +the data stored in a backing storage tier. Cache tiering involves creating a +pool of relatively fast/expensive storage devices (e.g., solid state drives) +configured to act as a cache tier, and a backing pool of either erasure-coded +or relatively slower/cheaper devices configured to act as an economical storage +tier. The Ceph objecter handles where to place the objects and the tiering +agent determines when to flush objects from the cache to the backing storage +tier. So the cache tier and the backing storage tier are completely transparent +to Ceph clients. + + +.. ditaa:: + +-------------+ + | Ceph Client | + +------+------+ + ^ + Tiering is | + Transparent | Faster I/O + to Ceph | +---------------+ + Client Ops | | | + | +----->+ Cache Tier | + | | | | + | | +-----+---+-----+ + | | | ^ + v v | | Active Data in Cache Tier + +------+----+--+ | | + | Objecter | | | + +-----------+--+ | | + ^ | | Inactive Data in Storage Tier + | v | + | +-----+---+-----+ + | | | + +----->| Storage Tier | + | | + +---------------+ + Slower I/O + + +The cache tiering agent handles the migration of data between the cache tier +and the backing storage tier automatically. However, admins have the ability to +configure how this migration takes place by setting the ``cache-mode``. There are +two main scenarios: + +- **writeback** mode: If the base tier and the cache tier are configured in + ``writeback`` mode, Ceph clients receive an ACK from the base tier every time + they write data to it. Then the cache tiering agent determines whether + ``osd_tier_default_cache_min_write_recency_for_promote`` has been set. If it + has been set and the data has been written more than a specified number of + times per interval, the data is promoted to the cache tier. + + When Ceph clients need access to data stored in the base tier, the cache + tiering agent reads the data from the base tier and returns it to the client. + While data is being read from the base tier, the cache tiering agent consults + the value of ``osd_tier_default_cache_min_read_recency_for_promote`` and + decides whether to promote that data from the base tier to the cache tier. + When data has been promoted from the base tier to the cache tier, the Ceph + client is able to perform I/O operations on it using the cache tier. This is + well-suited for mutable data (for example, photo/video editing, transactional + data). + +- **readproxy** mode: This mode will use any objects that already + exist in the cache tier, but if an object is not present in the + cache the request will be proxied to the base tier. This is useful + for transitioning from ``writeback`` mode to a disabled cache as it + allows the workload to function properly while the cache is drained, + without adding any new objects to the cache. + +Other cache modes are: + +- **readonly** promotes objects to the cache on read operations only; write + operations are forwarded to the base tier. This mode is intended for + read-only workloads that do not require consistency to be enforced by the + storage system. (**Warning**: when objects are updated in the base tier, + Ceph makes **no** attempt to sync these updates to the corresponding objects + in the cache. Since this mode is considered experimental, a + ``--yes-i-really-mean-it`` option must be passed in order to enable it.) + +- **none** is used to completely disable caching. + + +A word of caution +================= + +Cache tiering will *degrade* performance for most workloads. Users should use +extreme caution before using this feature. + +* *Workload dependent*: Whether a cache will improve performance is + highly dependent on the workload. Because there is a cost + associated with moving objects into or out of the cache, it can only + be effective when there is a *large skew* in the access pattern in + the data set, such that most of the requests touch a small number of + objects. The cache pool should be large enough to capture the + working set for your workload to avoid thrashing. + +* *Difficult to benchmark*: Most benchmarks that users run to measure + performance will show terrible performance with cache tiering, in + part because very few of them skew requests toward a small set of + objects, it can take a long time for the cache to "warm up," and + because the warm-up cost can be high. + +* *Usually slower*: For workloads that are not cache tiering-friendly, + performance is often slower than a normal RADOS pool without cache + tiering enabled. + +* *librados object enumeration*: The librados-level object enumeration + API is not meant to be coherent in the presence of the case. If + your application is using librados directly and relies on object + enumeration, cache tiering will probably not work as expected. + (This is not a problem for RGW, RBD, or CephFS.) + +* *Complexity*: Enabling cache tiering means that a lot of additional + machinery and complexity within the RADOS cluster is being used. + This increases the probability that you will encounter a bug in the system + that other users have not yet encountered and will put your deployment at a + higher level of risk. + +Known Good Workloads +-------------------- + +* *RGW time-skewed*: If the RGW workload is such that almost all read + operations are directed at recently written objects, a simple cache + tiering configuration that destages recently written objects from + the cache to the base tier after a configurable period can work + well. + +Known Bad Workloads +------------------- + +The following configurations are *known to work poorly* with cache +tiering. + +* *RBD with replicated cache and erasure-coded base*: This is a common + request, but usually does not perform well. Even reasonably skewed + workloads still send some small writes to cold objects, and because + small writes are not yet supported by the erasure-coded pool, entire + (usually 4 MB) objects must be migrated into the cache in order to + satisfy a small (often 4 KB) write. Only a handful of users have + successfully deployed this configuration, and it only works for them + because their data is extremely cold (backups) and they are not in + any way sensitive to performance. + +* *RBD with replicated cache and base*: RBD with a replicated base + tier does better than when the base is erasure coded, but it is + still highly dependent on the amount of skew in the workload, and + very difficult to validate. The user will need to have a good + understanding of their workload and will need to tune the cache + tiering parameters carefully. + + +Setting Up Pools +================ + +To set up cache tiering, you must have two pools. One will act as the +backing storage and the other will act as the cache. + + +Setting Up a Backing Storage Pool +--------------------------------- + +Setting up a backing storage pool typically involves one of two scenarios: + +- **Standard Storage**: In this scenario, the pool stores multiple copies + of an object in the Ceph Storage Cluster. + +- **Erasure Coding:** In this scenario, the pool uses erasure coding to + store data much more efficiently with a small performance tradeoff. + +In the standard storage scenario, you can setup a CRUSH rule to establish +the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD +Daemons perform optimally when all storage drives in the rule are of the +same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ +for details on creating a rule. Once you have created a rule, create +a backing storage pool. + +In the erasure coding scenario, the pool creation arguments will generate the +appropriate rule automatically. See `Create a Pool`_ for details. + +In subsequent examples, we will refer to the backing storage pool +as ``cold-storage``. + + +Setting Up a Cache Pool +----------------------- + +Setting up a cache pool follows the same procedure as the standard storage +scenario, but with this difference: the drives for the cache tier are typically +high performance drives that reside in their own servers and have their own +CRUSH rule. When setting up such a rule, it should take account of the hosts +that have the high performance drives while omitting the hosts that don't. See +:ref:`CRUSH Device Class<crush-map-device-class>` for details. + + +In subsequent examples, we will refer to the cache pool as ``hot-storage`` and +the backing pool as ``cold-storage``. + +For cache tier configuration and default values, see +`Pools - Set Pool Values`_. + + +Creating a Cache Tier +===================== + +Setting up a cache tier involves associating a backing storage pool with +a cache pool: + +.. prompt:: bash $ + + ceph osd tier add {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier add cold-storage hot-storage + +To set the cache mode, execute the following: + +.. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} {cache-mode} + +For example: + +.. prompt:: bash $ + + ceph osd tier cache-mode hot-storage writeback + +The cache tiers overlay the backing storage tier, so they require one +additional step: you must direct all client traffic from the storage pool to +the cache pool. To direct client traffic directly to the cache pool, execute +the following: + +.. prompt:: bash $ + + ceph osd tier set-overlay {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier set-overlay cold-storage hot-storage + + +Configuring a Cache Tier +======================== + +Cache tiers have several configuration options. You may set +cache tier configuration options with the following usage: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} {key} {value} + +See `Pools - Set Pool Values`_ for details. + + +Target Size and Type +-------------------- + +Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_type bloom + +For example: + +.. prompt:: bash $ + + ceph osd pool set hot-storage hit_set_type bloom + +The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to +store, and how much time each HitSet should cover: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_count 12 + ceph osd pool set {cachepool} hit_set_period 14400 + ceph osd pool set {cachepool} target_max_bytes 1000000000000 + +.. note:: A larger ``hit_set_count`` results in more RAM consumed by + the ``ceph-osd`` process. + +Binning accesses over time allows Ceph to determine whether a Ceph client +accessed an object at least once, or more than once over a time period +("age" vs "temperature"). + +The ``min_read_recency_for_promote`` defines how many HitSets to check for the +existence of an object when handling a read operation. The checking result is +used to decide whether to promote the object asynchronously. Its value should be +between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. +If it's set to 1, the current HitSet is checked. And if this object is in the +current HitSet, it's promoted. Otherwise not. For the other values, the exact +number of archive HitSets are checked. The object is promoted if the object is +found in any of the most recent ``min_read_recency_for_promote`` HitSets. + +A similar parameter can be set for the write operation, which is +``min_write_recency_for_promote``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} min_read_recency_for_promote 2 + ceph osd pool set {cachepool} min_write_recency_for_promote 2 + +.. note:: The longer the period and the higher the + ``min_read_recency_for_promote`` and + ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` + daemon consumes. In particular, when the agent is active to flush + or evict cache objects, all ``hit_set_count`` HitSets are loaded + into RAM. + + +Cache Sizing +------------ + +The cache tiering agent performs two main functions: + +- **Flushing:** The agent identifies modified (or dirty) objects and forwards + them to the storage pool for long-term storage. + +- **Evicting:** The agent identifies objects that haven't been modified + (or clean) and evicts the least recently used among them from the cache. + + +Absolute Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects based upon the total number +of bytes or the total number of objects. To specify a maximum number of bytes, +execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_bytes {#bytes} + +For example, to flush or evict at 1 TB, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_bytes 1099511627776 + +To specify the maximum number of objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_objects {#objects} + +For example, to flush or evict at 1M objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_objects 1000000 + +.. note:: Ceph is not able to determine the size of a cache pool automatically, so + the configuration on the absolute size is required here, otherwise the + flush/evict will not work. If you specify both limits, the cache tiering + agent will begin flushing or evicting when either threshold is triggered. + +.. note:: All client requests will be blocked only when ``target_max_bytes`` or + ``target_max_objects`` reached + +Relative Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects relative to the size of the +cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in +`Absolute sizing`_). When the cache pool consists of a certain percentage of +modified (or dirty) objects, the cache tiering agent will flush them to the +storage pool. To set the ``cache_target_dirty_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} + +For example, setting the value to ``0.4`` will begin flushing modified +(dirty) objects when they reach 40% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 + +When the dirty objects reaches a certain percentage of its capacity, flush dirty +objects with a higher speed. To set the ``cache_target_dirty_high_ratio``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} + +For example, setting the value to ``0.6`` will begin aggressively flush dirty +objects when they reach 60% of the cache pool's capacity. obviously, we'd +better set the value between dirty_ratio and full_ratio: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 + +When the cache pool reaches a certain percentage of its capacity, the cache +tiering agent will evict objects to maintain free capacity. To set the +``cache_target_full_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} + +For example, setting the value to ``0.8`` will begin flushing unmodified +(clean) objects when they reach 80% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_full_ratio 0.8 + + +Cache Age +--------- + +You can specify the minimum age of an object before the cache tiering agent +flushes a recently modified (or dirty) object to the backing storage pool: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_min_flush_age {#seconds} + +For example, to flush modified (or dirty) objects after 10 minutes, execute the +following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_flush_age 600 + +You can specify the minimum age of an object before it will be evicted from the +cache tier: + +.. prompt:: bash $ + + ceph osd pool {cache-tier} cache_min_evict_age {#seconds} + +For example, to evict objects after 30 minutes, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_evict_age 1800 + + +Removing a Cache Tier +===================== + +Removing a cache tier differs depending on whether it is a writeback +cache or a read-only cache. + + +Removing a Read-Only Cache +-------------------------- + +Since a read-only cache does not have modified data, you can disable +and remove it without losing any recent changes to objects in the cache. + +#. Change the cache-mode to ``none`` to disable it.: + + .. prompt:: bash + + ceph osd tier cache-mode {cachepool} none + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage none + +#. Remove the cache pool from the backing pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +Removing a Writeback Cache +-------------------------- + +Since a writeback cache may have modified data, you must take steps to ensure +that you do not lose any recent changes to objects in the cache before you +disable and remove it. + + +#. Change the cache mode to ``proxy`` so that new and modified objects will + flush to the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} proxy + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage proxy + + +#. Ensure that the cache pool has been flushed. This may take a few minutes: + + .. prompt:: bash $ + + rados -p {cachepool} ls + + If the cache pool still has objects, you can flush them manually. + For example: + + .. prompt:: bash $ + + rados -p {cachepool} cache-flush-evict-all + + +#. Remove the overlay so that clients will not direct traffic to the cache.: + + .. prompt:: bash $ + + ceph osd tier remove-overlay {storagetier} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove-overlay cold-storage + + +#. Finally, remove the cache tier pool from the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +.. _Create a Pool: ../pools#create-a-pool +.. _Pools - Set Pool Values: ../pools#set-pool-values +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _CRUSH Maps: ../crush-map +.. _Absolute Sizing: #absolute-sizing diff --git a/doc/rados/operations/change-mon-elections.rst b/doc/rados/operations/change-mon-elections.rst new file mode 100644 index 000000000..7418ea363 --- /dev/null +++ b/doc/rados/operations/change-mon-elections.rst @@ -0,0 +1,100 @@ +.. _changing_monitor_elections: + +======================================= +Configuring Monitor Election Strategies +======================================= + +By default, the monitors are in ``classic`` mode. We recommend staying in this +mode unless you have a very specific reason. + +If you want to switch modes BEFORE constructing the cluster, change the ``mon +election default strategy`` option. This option takes an integer value: + +* ``1`` for ``classic`` +* ``2`` for ``disallow`` +* ``3`` for ``connectivity`` + +After your cluster has started running, you can change strategies by running a +command of the following form: + + $ ceph mon set election_strategy {classic|disallow|connectivity} + +Choosing a mode +=============== + +The modes other than ``classic`` provide specific features. We recommend staying +in ``classic`` mode if you don't need these extra features because it is the +simplest mode. + +.. _rados_operations_disallow_mode: + +Disallow Mode +============= + +The ``disallow`` mode allows you to mark monitors as disallowed. Disallowed +monitors participate in the quorum and serve clients, but cannot be elected +leader. You might want to use this mode for monitors that are far away from +clients. + +To disallow a monitor from being elected leader, run a command of the following +form: + +.. prompt:: bash $ + + ceph mon add disallowed_leader {name} + +To remove a monitor from the disallowed list and allow it to be elected leader, +run a command of the following form: + +.. prompt:: bash $ + + ceph mon rm disallowed_leader {name} + +To see the list of disallowed leaders, examine the output of the following +command: + +.. prompt:: bash $ + + ceph mon dump + +Connectivity Mode +================= + +The ``connectivity`` mode evaluates connection scores that are provided by each +monitor for its peers and elects the monitor with the highest score. This mode +is designed to handle network partitioning (also called *net-splits*): network +partitioning might occur if your cluster is stretched across multiple data +centers or otherwise has a non-uniform or unbalanced network topology. + +The ``connectivity`` mode also supports disallowing monitors from being elected +leader by using the same commands that were presented in :ref:`Disallow Mode <rados_operations_disallow_mode>`. + +Examining connectivity scores +============================= + +The monitors maintain connection scores even if they aren't in ``connectivity`` +mode. To examine a specific monitor's connection scores, run a command of the +following form: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores dump + +Scores for an individual connection range from ``0`` to ``1`` inclusive and +include whether the connection is considered alive or dead (as determined by +whether it returned its latest ping before timeout). + +Connectivity scores are expected to remain valid. However, if during +troubleshooting you determine that these scores have for some reason become +invalid, drop the history and reset the scores by running a command of the +following form: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores reset + +Resetting connectivity scores carries little risk: monitors will still quickly +determine whether a connection is alive or dead and trend back to the previous +scores if those scores were accurate. Nevertheless, resetting scores ought to +be unnecessary and it is not recommended unless advised by your support team +or by a developer. diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst new file mode 100644 index 000000000..033f831cd --- /dev/null +++ b/doc/rados/operations/control.rst @@ -0,0 +1,665 @@ +.. index:: control, commands + +================== + Control Commands +================== + + +Monitor Commands +================ + +To issue monitor commands, use the ``ceph`` utility: + +.. prompt:: bash $ + + ceph [-m monhost] {command} + +In most cases, monitor commands have the following form: + +.. prompt:: bash $ + + ceph {subsystem} {command} + + +System Commands +=============== + +To display the current cluster status, run the following commands: + +.. prompt:: bash $ + + ceph -s + ceph status + +To display a running summary of cluster status and major events, run the +following command: + +.. prompt:: bash $ + + ceph -w + +To display the monitor quorum, including which monitors are participating and +which one is the leader, run the following commands: + +.. prompt:: bash $ + + ceph mon stat + ceph quorum_status + +To query the status of a single monitor, including whether it is in the quorum, +run the following command: + +.. prompt:: bash $ + + ceph tell mon.[id] mon_status + +Here the value of ``[id]`` can be found by consulting the output of ``ceph +-s``. + + +Authentication Subsystem +======================== + +To add an OSD keyring for a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} + +To list the cluster's keys and their capabilities, run the following command: + +.. prompt:: bash $ + + ceph auth ls + + +Placement Group Subsystem +========================= + +To display the statistics for all placement groups (PGs), run the following +command: + +.. prompt:: bash $ + + ceph pg dump [--format {format}] + +Here the valid formats are ``plain`` (default), ``json`` ``json-pretty``, +``xml``, and ``xml-pretty``. When implementing monitoring tools and other +tools, it is best to use the ``json`` format. JSON parsing is more +deterministic than the ``plain`` format (which is more human readable), and the +layout is much more consistent from release to release. The ``jq`` utility is +very useful for extracting data from JSON output. + +To display the statistics for all PGs stuck in a specified state, run the +following command: + +.. prompt:: bash $ + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] + +Here ``--format`` may be ``plain`` (default), ``json``, ``json-pretty``, +``xml``, or ``xml-pretty``. + +The ``--threshold`` argument determines the time interval (in seconds) for a PG +to be considered ``stuck`` (default: 300). + +PGs might be stuck in any of the following states: + +**Inactive** + + PGs are unable to process reads or writes because they are waiting for an + OSD that has the most up-to-date data to return to an ``up`` state. + + +**Unclean** + + PGs contain objects that have not been replicated the desired number of + times. These PGs have not yet completed the process of recovering. + + +**Stale** + + PGs are in an unknown state, because the OSDs that host them have not + reported to the monitor cluster for a certain period of time (specified by + the ``mon_osd_report_timeout`` configuration setting). + + +To delete a ``lost`` object or revert an object to its prior state, either by +reverting it to its previous version or by deleting it because it was just +created and has no previous version, run the following command: + +.. prompt:: bash $ + + ceph pg {pgid} mark_unfound_lost revert|delete + + +.. _osd-subsystem: + +OSD Subsystem +============= + +To query OSD subsystem status, run the following command: + +.. prompt:: bash $ + + ceph osd stat + +To write a copy of the most recent OSD map to a file (see :ref:`osdmaptool +<osdmaptool>`), run the following command: + +.. prompt:: bash $ + + ceph osd getmap -o file + +To write a copy of the CRUSH map from the most recent OSD map to a file, run +the following command: + +.. prompt:: bash $ + + ceph osd getcrushmap -o file + +Note that this command is functionally equivalent to the following two +commands: + +.. prompt:: bash $ + + ceph osd getmap -o /tmp/osdmap + osdmaptool /tmp/osdmap --export-crush file + +To dump the OSD map, run the following command: + +.. prompt:: bash $ + + ceph osd dump [--format {format}] + +The ``--format`` option accepts the following arguments: ``plain`` (default), +``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON is +the recommended format for tools, scripting, and other forms of automation. + +To dump the OSD map as a tree that lists one OSD per line and displays +information about the weights and states of the OSDs, run the following +command: + +.. prompt:: bash $ + + ceph osd tree [--format {format}] + +To find out where a specific RADOS object is stored in the system, run a +command of the following form: + +.. prompt:: bash $ + + ceph osd map <pool-name> <object-name> + +To add or move a new OSD (specified by its ID, name, or weight) to a specific +CRUSH location, run the following command: + +.. prompt:: bash $ + + ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] + +To remove an existing OSD from the CRUSH map, run the following command: + +.. prompt:: bash $ + + ceph osd crush remove {name} + +To remove an existing bucket from the CRUSH map, run the following command: + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +To move an existing bucket from one position in the CRUSH hierarchy to another, +run the following command: + +.. prompt:: bash $ + + ceph osd crush move {id} {loc1} [{loc2} ...] + +To set the CRUSH weight of a specific OSD (specified by ``{name}``) to +``{weight}``, run the following command: + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +To mark an OSD as ``lost``, run the following command: + +.. prompt:: bash $ + + ceph osd lost {id} [--yes-i-really-mean-it] + +.. warning:: + This could result in permanent data loss. Use with caution! + +To create a new OSD, run the following command: + +.. prompt:: bash $ + + ceph osd create [{uuid}] + +If no UUID is given as part of this command, the UUID will be set automatically +when the OSD starts up. + +To remove one or more specific OSDs, run the following command: + +.. prompt:: bash $ + + ceph osd rm [{id}...] + +To display the current ``max_osd`` parameter in the OSD map, run the following +command: + +.. prompt:: bash $ + + ceph osd getmaxosd + +To import a specific CRUSH map, run the following command: + +.. prompt:: bash $ + + ceph osd setcrushmap -i file + +To set the ``max_osd`` parameter in the OSD map, run the following command: + +.. prompt:: bash $ + + ceph osd setmaxosd + +The parameter has a default value of 10000. Most operators will never need to +adjust it. + +To mark a specific OSD ``down``, run the following command: + +.. prompt:: bash $ + + ceph osd down {osd-num} + +To mark a specific OSD ``out`` (so that no data will be allocated to it), run +the following command: + +.. prompt:: bash $ + + ceph osd out {osd-num} + +To mark a specific OSD ``in`` (so that data will be allocated to it), run the +following command: + +.. prompt:: bash $ + + ceph osd in {osd-num} + +By using the "pause flags" in the OSD map, you can pause or unpause I/O +requests. If the flags are set, then no I/O requests will be sent to any OSD. +When the flags are cleared, then pending I/O requests will be resent. To set or +clear pause flags, run one of the following commands: + +.. prompt:: bash $ + + ceph osd pause + ceph osd unpause + +You can assign an override or ``reweight`` weight value to a specific OSD if +the normal CRUSH distribution seems to be suboptimal. The weight of an OSD +helps determine the extent of its I/O requests and data storage: two OSDs with +the same weight will receive approximately the same number of I/O requests and +store approximately the same amount of data. The ``ceph osd reweight`` command +assigns an override weight to an OSD. The weight value is in the range 0 to 1, +and the command forces CRUSH to relocate a certain amount (1 - ``weight``) of +the data that would otherwise be on this OSD. The command does not change the +weights of the buckets above the OSD in the CRUSH map. Using the command is +merely a corrective measure: for example, if one of your OSDs is at 90% and the +others are at 50%, you could reduce the outlier weight to correct this +imbalance. To assign an override weight to a specific OSD, run the following +command: + +.. prompt:: bash $ + + ceph osd reweight {osd-num} {weight} + +.. note:: Any assigned override reweight value will conflict with the balancer. + This means that if the balancer is in use, all override reweight values + should be ``1.0000`` in order to avoid suboptimal cluster behavior. + +A cluster's OSDs can be reweighted in order to maintain balance if some OSDs +are being disproportionately utilized. Note that override or ``reweight`` +weights have values relative to one another that default to 1.00000; their +values are not absolute, and these weights must be distinguished from CRUSH +weights (which reflect the absolute capacity of a bucket, as measured in TiB). +To reweight OSDs by utilization, run the following command: + +.. prompt:: bash $ + + ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing] + +By default, this command adjusts the override weight of OSDs that have ±20% of +the average utilization, but you can specify a different percentage in the +``threshold`` argument. + +To limit the increment by which any OSD's reweight is to be changed, use the +``max_change`` argument (default: 0.05). To limit the number of OSDs that are +to be adjusted, use the ``max_osds`` argument (default: 4). Increasing these +variables can accelerate the reweighting process, but perhaps at the cost of +slower client operations (as a result of the increase in data movement). + +You can test the ``osd reweight-by-utilization`` command before running it. To +find out which and how many PGs and OSDs will be affected by a specific use of +the ``osd reweight-by-utilization`` command, run the following command: + +.. prompt:: bash $ + + ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing] + +The ``--no-increasing`` option can be added to the ``reweight-by-utilization`` +and ``test-reweight-by-utilization`` commands in order to prevent any override +weights that are currently less than 1.00000 from being increased. This option +can be useful in certain circumstances: for example, when you are hastily +balancing in order to remedy ``full`` or ``nearfull`` OSDs, or when there are +OSDs being evacuated or slowly brought into service. + +Operators of deployments that utilize Nautilus or newer (or later revisions of +Luminous and Mimic) and that have no pre-Luminous clients might likely instead +want to enable the `balancer`` module for ``ceph-mgr``. + +The blocklist can be modified by adding or removing an IP address or a CIDR +range. If an address is blocklisted, it will be unable to connect to any OSD. +If an OSD is contained within an IP address or CIDR range that has been +blocklisted, the OSD will be unable to perform operations on its peers when it +acts as a client: such blocked operations include tiering and copy-from +functionality. To add or remove an IP address or CIDR range to the blocklist, +run one of the following commands: + +.. prompt:: bash $ + + ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME] + ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits] + +If you add something to the blocklist with the above ``add`` command, you can +use the ``TIME`` keyword to specify the length of time (in seconds) that it +will remain on the blocklist (default: one hour). To add or remove a CIDR +range, use the ``range`` keyword in the above commands. + +Note that these commands are useful primarily in failure testing. Under normal +conditions, blocklists are maintained automatically and do not need any manual +intervention. + +To create or delete a snapshot of a specific storage pool, run one of the +following commands: + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + ceph osd pool rmsnap {pool-name} {snap-name} + +To create, delete, or rename a specific storage pool, run one of the following +commands: + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [pg_num [pgp_num]] + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + ceph osd pool rename {old-name} {new-name} + +To change a pool setting, run the following command: + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {field} {value} + +The following are valid fields: + + * ``size``: The number of copies of data in the pool. + * ``pg_num``: The PG number. + * ``pgp_num``: The effective number of PGs when calculating placement. + * ``crush_rule``: The rule number for mapping placement. + +To retrieve the value of a pool setting, run the following command: + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {field} + +Valid fields are: + + * ``pg_num``: The PG number. + * ``pgp_num``: The effective number of PGs when calculating placement. + +To send a scrub command to a specific OSD, or to all OSDs (by using ``*``), run +the following command: + +.. prompt:: bash $ + + ceph osd scrub {osd-num} + +To send a repair command to a specific OSD, or to all OSDs (by using ``*``), +run the following command: + +.. prompt:: bash $ + + ceph osd repair N + +You can run a simple throughput benchmark test against a specific OSD. This +test writes a total size of ``TOTAL_DATA_BYTES`` (default: 1 GB) incrementally, +in multiple write requests that each have a size of ``BYTES_PER_WRITE`` +(default: 4 MB). The test is not destructive and it will not overwrite existing +live OSD data, but it might temporarily affect the performance of clients that +are concurrently accessing the OSD. To launch this benchmark test, run the +following command: + +.. prompt:: bash $ + + ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] + +To clear the caches of a specific OSD during the interval between one benchmark +run and another, run the following command: + +.. prompt:: bash $ + + ceph tell osd.N cache drop + +To retrieve the cache statistics of a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph tell osd.N cache status + +MDS Subsystem +============= + +To change the configuration parameters of a running metadata server, run the +following command: + +.. prompt:: bash $ + + ceph tell mds.{mds-id} config set {setting} {value} + +Example: + +.. prompt:: bash $ + + ceph tell mds.0 config set debug_ms 1 + +To enable debug messages, run the following command: + +.. prompt:: bash $ + + ceph mds stat + +To display the status of all metadata servers, run the following command: + +.. prompt:: bash $ + + ceph mds fail 0 + +To mark the active metadata server as failed (and to trigger failover to a +standby if a standby is present), run the following command: + +.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap + + +Mon Subsystem +============= + +To display monitor statistics, run the following command: + +.. prompt:: bash $ + + ceph mon stat + +This command returns output similar to the following: + +:: + + e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c + +There is a ``quorum`` list at the end of the output. It lists those monitor +nodes that are part of the current quorum. + +To retrieve this information in a more direct way, run the following command: + +.. prompt:: bash $ + + ceph quorum_status -f json-pretty + +This command returns output similar to the following: + +.. code-block:: javascript + + { + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "quorum_names": [ + "a", + "b", + "c" + ], + "quorum_leader_name": "a", + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + + +The above will block until a quorum is reached. + +To see the status of a specific monitor, run the following command: + +.. prompt:: bash $ + + ceph tell mon.[name] mon_status + +Here the value of ``[name]`` can be found by consulting the output of the +``ceph quorum_status`` command. This command returns output similar to the +following: + +:: + + { + "name": "b", + "rank": 1, + "state": "peon", + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "features": { + "required_con": "9025616074522624", + "required_mon": [ + "kraken" + ], + "quorum_con": "1152921504336314367", + "quorum_mon": [ + "kraken" + ] + }, + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + +To see a dump of the monitor state, run the following command: + +.. prompt:: bash $ + + ceph mon dump + +This command returns output similar to the following: + +:: + + dumped monmap epoch 2 + epoch 2 + fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc + last_changed 2016-12-26 14:42:09.288066 + created 2016-12-26 14:42:03.573585 + 0: 127.0.0.1:40000/0 mon.a + 1: 127.0.0.1:40001/0 mon.b + 2: 127.0.0.1:40002/0 mon.c diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst new file mode 100644 index 000000000..46a4a4f74 --- /dev/null +++ b/doc/rados/operations/crush-map-edits.rst @@ -0,0 +1,746 @@ +Manually editing the CRUSH Map +============================== + +.. note:: Manually editing the CRUSH map is an advanced administrator + operation. For the majority of installations, CRUSH changes can be + implemented via the Ceph CLI and do not require manual CRUSH map edits. If + you have identified a use case where manual edits *are* necessary with a + recent Ceph release, consider contacting the Ceph developers at dev@ceph.io + so that future versions of Ceph do not have this problem. + +To edit an existing CRUSH map, carry out the following procedure: + +#. `Get the CRUSH map`_. +#. `Decompile`_ the CRUSH map. +#. Edit at least one of the following sections: `Devices`_, `Buckets`_, and + `Rules`_. Use a text editor for this task. +#. `Recompile`_ the CRUSH map. +#. `Set the CRUSH map`_. + +For details on setting the CRUSH map rule for a specific pool, see `Set Pool +Values`_. + +.. _Get the CRUSH map: #getcrushmap +.. _Decompile: #decompilecrushmap +.. _Devices: #crushmapdevices +.. _Buckets: #crushmapbuckets +.. _Rules: #crushmaprules +.. _Recompile: #compilecrushmap +.. _Set the CRUSH map: #setcrushmap +.. _Set Pool Values: ../pools#setpoolvalues + +.. _getcrushmap: + +Get the CRUSH Map +----------------- + +To get the CRUSH map for your cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph osd getcrushmap -o {compiled-crushmap-filename} + +Ceph outputs (``-o``) a compiled CRUSH map to the filename that you have +specified. Because the CRUSH map is in a compiled form, you must first +decompile it before you can edit it. + +.. _decompilecrushmap: + +Decompile the CRUSH Map +----------------------- + +To decompile the CRUSH map, run a command of the following form: + +.. prompt:: bash $ + + crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} + +.. _compilecrushmap: + +Recompile the CRUSH Map +----------------------- + +To compile the CRUSH map, run a command of the following form: + +.. prompt:: bash $ + + crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename} + +.. _setcrushmap: + +Set the CRUSH Map +----------------- + +To set the CRUSH map for your cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph osd setcrushmap -i {compiled-crushmap-filename} + +Ceph loads (``-i``) a compiled CRUSH map from the filename that you have +specified. + +Sections +-------- + +A CRUSH map has six main sections: + +#. **tunables:** The preamble at the top of the map describes any *tunables* + that are not a part of legacy CRUSH behavior. These tunables correct for old + bugs, optimizations, or other changes that have been made over the years to + improve CRUSH's behavior. + +#. **devices:** Devices are individual OSDs that store data. + +#. **types**: Bucket ``types`` define the types of buckets that are used in + your CRUSH hierarchy. + +#. **buckets:** Buckets consist of a hierarchical aggregation of storage + locations (for example, rows, racks, chassis, hosts) and their assigned + weights. After the bucket ``types`` have been defined, the CRUSH map defines + each node in the hierarchy, its type, and which devices or other nodes it + contains. + +#. **rules:** Rules define policy about how data is distributed across + devices in the hierarchy. + +#. **choose_args:** ``choose_args`` are alternative weights associated with + the hierarchy that have been adjusted in order to optimize data placement. A + single ``choose_args`` map can be used for the entire cluster, or a number + of ``choose_args`` maps can be created such that each map is crafted for a + particular pool. + + +.. _crushmapdevices: + +CRUSH-Map Devices +----------------- + +Devices are individual OSDs that store data. In this section, there is usually +one device defined for each OSD daemon in your cluster. Devices are identified +by an ``id`` (a non-negative integer) and a ``name`` (usually ``osd.N``, where +``N`` is the device's ``id``). + + +.. _crush-map-device-class: + +A device can also have a *device class* associated with it: for example, +``hdd`` or ``ssd``. Device classes make it possible for devices to be targeted +by CRUSH rules. This means that device classes allow CRUSH rules to select only +OSDs that match certain characteristics. For example, you might want an RBD +pool associated only with SSDs and a different RBD pool associated only with +HDDs. + +To see a list of devices, run the following command: + +.. prompt:: bash # + + ceph device ls + +The output of this command takes the following form: + +:: + + device {num} {osd.name} [class {class}] + +For example: + +.. prompt:: bash # + + ceph device ls + +:: + + device 0 osd.0 class ssd + device 1 osd.1 class hdd + device 2 osd.2 + device 3 osd.3 + +In most cases, each device maps to a corresponding ``ceph-osd`` daemon. This +daemon might map to a single storage device, a pair of devices (for example, +one for data and one for a journal or metadata), or in some cases a small RAID +device or a partition of a larger storage device. + + +CRUSH-Map Bucket Types +---------------------- + +The second list in the CRUSH map defines 'bucket' types. Buckets facilitate a +hierarchy of nodes and leaves. Node buckets (also known as non-leaf buckets) +typically represent physical locations in a hierarchy. Nodes aggregate other +nodes or leaves. Leaf buckets represent ``ceph-osd`` daemons and their +corresponding storage media. + +.. tip:: In the context of CRUSH, the term "bucket" is used to refer to + a node in the hierarchy (that is, to a location or a piece of physical + hardware). In the context of RADOS Gateway APIs, however, the term + "bucket" has a different meaning. + +To add a bucket type to the CRUSH map, create a new line under the list of +bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. +By convention, there is exactly one leaf bucket type and it is ``type 0``; +however, you may give the leaf bucket any name you like (for example: ``osd``, +``disk``, ``drive``, ``storage``):: + + # types + type {num} {bucket-name} + +For example:: + + # types + type 0 osd + type 1 host + type 2 chassis + type 3 rack + type 4 row + type 5 pdu + type 6 pod + type 7 room + type 8 datacenter + type 9 zone + type 10 region + type 11 root + +.. _crushmapbuckets: + +CRUSH-Map Bucket Hierarchy +-------------------------- + +The CRUSH algorithm distributes data objects among storage devices according to +a per-device weight value, approximating a uniform probability distribution. +CRUSH distributes objects and their replicas according to the hierarchical +cluster map you define. The CRUSH map represents the available storage devices +and the logical elements that contain them. + +To map placement groups (PGs) to OSDs across failure domains, a CRUSH map +defines a hierarchical list of bucket types under ``#types`` in the generated +CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf +nodes according to their failure domains (for example: hosts, chassis, racks, +power distribution units, pods, rows, rooms, and data centers). With the +exception of the leaf nodes that represent OSDs, the hierarchy is arbitrary and +you may define it according to your own needs. + +We recommend adapting your CRUSH map to your preferred hardware-naming +conventions and using bucket names that clearly reflect the physical +hardware. Clear naming practice can make it easier to administer the cluster +and easier to troubleshoot problems when OSDs malfunction (or other hardware +malfunctions) and the administrator needs access to physical hardware. + + +In the following example, the bucket hierarchy has a leaf bucket named ``osd`` +and two node buckets named ``host`` and ``rack``: + +.. ditaa:: + +-----------+ + | {o}rack | + | Bucket | + +-----+-----+ + | + +---------------+---------------+ + | | + +-----+-----+ +-----+-----+ + | {o}host | | {o}host | + | Bucket | | Bucket | + +-----+-----+ +-----+-----+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd | | osd | | osd | | osd | + | Bucket | | Bucket | | Bucket | | Bucket | + +-----------+ +-----------+ +-----------+ +-----------+ + +.. note:: The higher-numbered ``rack`` bucket type aggregates the + lower-numbered ``host`` bucket type. + +Because leaf nodes reflect storage devices that have already been declared +under the ``#devices`` list at the beginning of the CRUSH map, there is no need +to declare them as bucket instances. The second-lowest bucket type in your +hierarchy is typically used to aggregate the devices (that is, the +second-lowest bucket type is usually the computer that contains the storage +media and, such as ``node``, ``computer``, ``server``, ``host``, or +``machine``). In high-density environments, it is common to have multiple hosts +or nodes in a single chassis (for example, in the cases of blades or twins). It +is important to anticipate the potential consequences of chassis failure -- for +example, during the replacement of a chassis in case of a node failure, the +chassis's hosts or nodes (and their associated OSDs) will be in a ``down`` +state. + +To declare a bucket instance, do the following: specify its type, give it a +unique name (an alphanumeric string), assign it a unique ID expressed as a +negative integer (this is optional), assign it a weight relative to the total +capacity and capability of the item(s) in the bucket, assign it a bucket +algorithm (usually ``straw2``), and specify the bucket algorithm's hash +(usually ``0``, a setting that reflects the hash algorithm ``rjenkins1``). A +bucket may have one or more items. The items may consist of node buckets or +leaves. Items may have a weight that reflects the relative weight of the item. + +To declare a node bucket, use the following syntax:: + + [bucket-type] [bucket-name] { + id [a unique negative numeric ID] + weight [the relative capacity/capability of the item(s)] + alg [the bucket type: uniform | list | tree | straw | straw2 ] + hash [the hash type: 0 by default] + item [item-name] weight [weight] + } + +For example, in the above diagram, two host buckets (referred to in the +declaration below as ``node1`` and ``node2``) and one rack bucket (referred to +in the declaration below as ``rack1``) are defined. The OSDs are declared as +items within the host buckets:: + + host node1 { + id -1 + alg straw2 + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host node2 { + id -2 + alg straw2 + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + rack rack1 { + id -3 + alg straw2 + hash 0 + item node1 weight 2.00 + item node2 weight 2.00 + } + +.. note:: In this example, the rack bucket does not contain any OSDs. Instead, + it contains lower-level host buckets and includes the sum of their weight in + the item entry. + + +.. topic:: Bucket Types + + Ceph supports five bucket types. Each bucket type provides a balance between + performance and reorganization efficiency, and each is different from the + others. If you are unsure of which bucket type to use, use the ``straw2`` + bucket. For a more technical discussion of bucket types than is offered + here, see **Section 3.4** of `CRUSH - Controlled, Scalable, Decentralized + Placement of Replicated Data`_. + + The bucket types are as follows: + + #. **uniform**: Uniform buckets aggregate devices that have **exactly** + the same weight. For example, when hardware is commissioned or + decommissioned, it is often done in sets of machines that have exactly + the same physical configuration (this can be the case, for example, + after bulk purchases). When storage devices have exactly the same + weight, you may use the ``uniform`` bucket type, which allows CRUSH to + map replicas into uniform buckets in constant time. If your devices have + non-uniform weights, you should not use the uniform bucket algorithm. + + #. **list**: List buckets aggregate their content as linked lists. The + behavior of list buckets is governed by the :abbr:`RUSH (Replication + Under Scalable Hashing)`:sub:`P` algorithm. In the behavior of this + bucket type, an object is either relocated to the newest device in + accordance with an appropriate probability, or it remains on the older + devices as before. This results in optimal data migration when items are + added to the bucket. The removal of items from the middle or the tail of + the list, however, can result in a significant amount of unnecessary + data movement. This means that list buckets are most suitable for + circumstances in which they **never shrink or very rarely shrink**. + + #. **tree**: Tree buckets use a binary search tree. They are more efficient + at dealing with buckets that contain many items than are list buckets. + The behavior of tree buckets is governed by the :abbr:`RUSH (Replication + Under Scalable Hashing)`:sub:`R` algorithm. Tree buckets reduce the + placement time to 0(log\ :sub:`n`). This means that tree buckets are + suitable for managing large sets of devices or nested buckets. + + #. **straw**: Straw buckets allow all items in the bucket to "compete" + against each other for replica placement through a process analogous to + drawing straws. This is different from the behavior of list buckets and + tree buckets, which use a divide-and-conquer strategy that either gives + certain items precedence (for example, those at the beginning of a list) + or obviates the need to consider entire subtrees of items. Such an + approach improves the performance of the replica placement process, but + can also introduce suboptimal reorganization behavior when the contents + of a bucket change due an addition, a removal, or the re-weighting of an + item. + + * **straw2**: Straw2 buckets improve on Straw by correctly avoiding + any data movement between items when neighbor weights change. For + example, if the weight of a given item changes (including during the + operations of adding it to the cluster or removing it from the + cluster), there will be data movement to or from only that item. + Neighbor weights are not taken into account. + + +.. topic:: Hash + + Each bucket uses a hash algorithm. As of Reef, Ceph supports the + ``rjenkins1`` algorithm. To select ``rjenkins1`` as the hash algorithm, + enter ``0`` as your hash setting. + +.. _weightingbucketitems: + +.. topic:: Weighting Bucket Items + + Ceph expresses bucket weights as doubles, which allows for fine-grained + weighting. A weight is the relative difference between device capacities. We + recommend using ``1.00`` as the relative weight for a 1 TB storage device. + In such a scenario, a weight of ``0.50`` would represent approximately 500 + GB, and a weight of ``3.00`` would represent approximately 3 TB. Buckets + higher in the CRUSH hierarchy have a weight that is the sum of the weight of + the leaf items aggregated by the bucket. + + +.. _crushmaprules: + +CRUSH Map Rules +--------------- + +CRUSH maps have rules that include data placement for a pool: these are +called "CRUSH rules". The default CRUSH map has one rule for each pool. If you +are running a large cluster, you might create many pools and each of those +pools might have its own non-default CRUSH rule. + + +.. note:: In most cases, there is no need to modify the default rule. When a + new pool is created, by default the rule will be set to the value ``0`` + (which indicates the default CRUSH rule, which has the numeric ID ``0``). + +CRUSH rules define policy that governs how data is distributed across the devices in +the hierarchy. The rules define placement as well as replication strategies or +distribution policies that allow you to specify exactly how CRUSH places data +replicas. For example, you might create one rule selecting a pair of targets for +two-way mirroring, another rule for selecting three targets in two different data +centers for three-way replication, and yet another rule for erasure coding across +six storage devices. For a detailed discussion of CRUSH rules, see **Section 3.2** +of `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_. + +A rule takes the following form:: + + rule <rulename> { + + id [a unique integer ID] + type [replicated|erasure] + step take <bucket-name> [class <device-class>] + step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type> + step emit + } + + +``id`` + :Description: A unique integer that identifies the rule. + :Purpose: A component of the rule mask. + :Type: Integer + :Required: Yes + :Default: 0 + + +``type`` + :Description: Denotes the type of replication strategy to be enforced by the + rule. + :Purpose: A component of the rule mask. + :Type: String + :Required: Yes + :Default: ``replicated`` + :Valid Values: ``replicated`` or ``erasure`` + + +``step take <bucket-name> [class <device-class>]`` + :Description: Takes a bucket name and iterates down the tree. If + the ``device-class`` argument is specified, the argument must + match a class assigned to OSDs within the cluster. Only + devices belonging to the class are included. + :Purpose: A component of the rule. + :Required: Yes + :Example: ``step take data`` + + + +``step choose firstn {num} type {bucket-type}`` + :Description: Selects ``num`` buckets of the given type from within the + current bucket. ``{num}`` is usually the number of replicas in + the pool (in other words, the pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available). + - If ``pool-num-replicas > {num} > 0``, choose that many buckets. + - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets. + + :Purpose: A component of the rule. + :Prerequisite: Follows ``step take`` or ``step choose``. + :Example: ``step choose firstn 1 type row`` + + +``step chooseleaf firstn {num} type {bucket-type}`` + :Description: Selects a set of buckets of the given type and chooses a leaf + node (that is, an OSD) from the subtree of each bucket in that set of buckets. The + number of buckets in the set is usually the number of replicas in + the pool (in other words, the pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available). + - If ``pool-num-replicas > {num} > 0``, choose that many buckets. + - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets. + :Purpose: A component of the rule. Using ``chooseleaf`` obviates the need to select a device in a separate step. + :Prerequisite: Follows ``step take`` or ``step choose``. + :Example: ``step chooseleaf firstn 0 type row`` + + +``step emit`` + :Description: Outputs the current value on the top of the stack and empties + the stack. Typically used + at the end of a rule, but may also be used to choose from different + trees in the same rule. + + :Purpose: A component of the rule. + :Prerequisite: Follows ``step choose``. + :Example: ``step emit`` + +.. important:: A single CRUSH rule can be assigned to multiple pools, but + a single pool cannot have multiple CRUSH rules. + +``firstn`` or ``indep`` + + :Description: Determines which replacement strategy CRUSH uses when items (OSDs) + are marked ``down`` in the CRUSH map. When this rule is used + with replicated pools, ``firstn`` is used. When this rule is + used with erasure-coded pools, ``indep`` is used. + + Suppose that a PG is stored on OSDs 1, 2, 3, 4, and 5 and then + OSD 3 goes down. + + When in ``firstn`` mode, CRUSH simply adjusts its calculation + to select OSDs 1 and 2, then selects 3 and discovers that 3 is + down, retries and selects 4 and 5, and finally goes on to + select a new OSD: OSD 6. The final CRUSH mapping + transformation is therefore 1, 2, 3, 4, 5 → 1, 2, 4, 5, 6. + + However, if you were storing an erasure-coded pool, the above + sequence would have changed the data that is mapped to OSDs 4, + 5, and 6. The ``indep`` mode attempts to avoid this unwanted + consequence. When in ``indep`` mode, CRUSH can be expected to + select 3, discover that 3 is down, retry, and select 6. The + final CRUSH mapping transformation is therefore 1, 2, 3, 4, 5 + → 1, 2, 6, 4, 5. + +.. _crush-reclassify: + +Migrating from a legacy SSD rule to device classes +-------------------------------------------------- + +Prior to the Luminous release's introduction of the *device class* feature, in +order to write rules that applied to a specialized device type (for example, +SSD), it was necessary to manually edit the CRUSH map and maintain a parallel +hierarchy for each device type. The device class feature provides a more +transparent way to achieve this end. + +However, if your cluster is migrated from an existing manually-customized +per-device map to new device class-based rules, all data in the system will be +reshuffled. + +The ``crushtool`` utility has several commands that can transform a legacy rule +and hierarchy and allow you to start using the new device class rules. There +are three possible types of transformation: + +#. ``--reclassify-root <root-name> <device-class>`` + + This command examines everything under ``root-name`` in the hierarchy and + rewrites any rules that reference the specified root and that have the + form ``take <root-name>`` so that they instead have the + form ``take <root-name> class <device-class>``. The command also renumbers + the buckets in such a way that the old IDs are used for the specified + class's "shadow tree" and as a result no data movement takes place. + + For example, suppose you have the following as an existing rule:: + + rule replicated_rule { + id 0 + type replicated + step take default + step chooseleaf firstn 0 type rack + step emit + } + + If the root ``default`` is reclassified as class ``hdd``, the new rule will + be as follows:: + + rule replicated_rule { + id 0 + type replicated + step take default class hdd + step chooseleaf firstn 0 type rack + step emit + } + +#. ``--set-subtree-class <bucket-name> <device-class>`` + + This command marks every device in the subtree that is rooted at *bucket-name* + with the specified device class. + + This command is typically used in conjunction with the ``--reclassify-root`` option + in order to ensure that all devices in that root are labeled with the + correct class. In certain circumstances, however, some of those devices + are correctly labeled with a different class and must not be relabeled. To + manage this difficulty, one can exclude the ``--set-subtree-class`` + option. The remapping process will not be perfect, because the previous rule + had an effect on devices of multiple classes but the adjusted rules will map + only to devices of the specified device class. However, when there are not many + outlier devices, the resulting level of data movement is often within tolerable + limits. + + +#. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>`` + + This command allows you to merge a parallel type-specific hierarchy with the + normal hierarchy. For example, many users have maps that resemble the + following:: + + host node1 { + id -2 # do not change unnecessarily + # weight 109.152 + alg straw2 + hash 0 # rjenkins1 + item osd.0 weight 9.096 + item osd.1 weight 9.096 + item osd.2 weight 9.096 + item osd.3 weight 9.096 + item osd.4 weight 9.096 + item osd.5 weight 9.096 + ... + } + + host node1-ssd { + id -10 # do not change unnecessarily + # weight 2.000 + alg straw2 + hash 0 # rjenkins1 + item osd.80 weight 2.000 + ... + } + + root default { + id -1 # do not change unnecessarily + alg straw2 + hash 0 # rjenkins1 + item node1 weight 110.967 + ... + } + + root ssd { + id -18 # do not change unnecessarily + # weight 16.000 + alg straw2 + hash 0 # rjenkins1 + item node1-ssd weight 2.000 + ... + } + + This command reclassifies each bucket that matches a certain + pattern. The pattern can be of the form ``%suffix`` or ``prefix%``. For + example, in the above example, we would use the pattern + ``%-ssd``. For each matched bucket, the remaining portion of the + name (corresponding to the ``%`` wildcard) specifies the *base bucket*. All + devices in the matched bucket are labeled with the specified + device class and then moved to the base bucket. If the base bucket + does not exist (for example, ``node12-ssd`` exists but ``node12`` does + not), then it is created and linked under the specified + *default parent* bucket. In each case, care is taken to preserve + the old bucket IDs for the new shadow buckets in order to prevent data + movement. Any rules with ``take`` steps that reference the old + buckets are adjusted accordingly. + + +#. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>`` + + The same command can also be used without a wildcard in order to map a + single bucket. For example, in the previous example, we want the + ``ssd`` bucket to be mapped to the ``default`` bucket. + +#. The final command to convert the map that consists of the above fragments + resembles the following: + + .. prompt:: bash $ + + ceph osd getcrushmap -o original + crushtool -i original --reclassify \ + --set-subtree-class default hdd \ + --reclassify-root default hdd \ + --reclassify-bucket %-ssd ssd default \ + --reclassify-bucket ssd ssd default \ + -o adjusted + +``--compare`` flag +------------------ + +A ``--compare`` flag is available to make sure that the conversion performed in +:ref:`Migrating from a legacy SSD rule to device classes <crush-reclassify>` is +correct. This flag tests a large sample of inputs against the CRUSH map and +checks that the expected result is output. The options that control these +inputs are the same as the options that apply to the ``--test`` command. For an +illustration of how this ``--compare`` command applies to the above example, +see the following: + +.. prompt:: bash $ + + crushtool -i original --compare adjusted + +:: + + rule 0 had 0/10240 mismatched mappings (0) + rule 1 had 0/10240 mismatched mappings (0) + maps appear equivalent + +If the command finds any differences, the ratio of remapped inputs is reported +in the parentheses. + +When you are satisfied with the adjusted map, apply it to the cluster by +running the following command: + +.. prompt:: bash $ + + ceph osd setcrushmap -i adjusted + +Manually Tuning CRUSH +--------------------- + +If you have verified that all clients are running recent code, you can adjust +the CRUSH tunables by extracting the CRUSH map, modifying the values, and +reinjecting the map into the cluster. The procedure is carried out as follows: + +#. Extract the latest CRUSH map: + + .. prompt:: bash $ + + ceph osd getcrushmap -o /tmp/crush + +#. Adjust tunables. In our tests, the following values appear to result in the + best behavior for both large and small clusters. The procedure requires that + you specify the ``--enable-unsafe-tunables`` flag in the ``crushtool`` + command. Use this option with **extreme care**: + + .. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new + +#. Reinject the modified map: + + .. prompt:: bash $ + + ceph osd setcrushmap -i /tmp/crush.new + +Legacy values +------------- + +To set the legacy values of the CRUSH tunables, run the following command: + +.. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy + +The special ``--enable-unsafe-tunables`` flag is required. Be careful when +running old versions of the ``ceph-osd`` daemon after reverting to legacy +values, because the feature bit is not perfectly enforced. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst new file mode 100644 index 000000000..39151e6d4 --- /dev/null +++ b/doc/rados/operations/crush-map.rst @@ -0,0 +1,1147 @@ +============ + CRUSH Maps +============ + +The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm +computes storage locations in order to determine how to store and retrieve +data. CRUSH allows Ceph clients to communicate with OSDs directly rather than +through a centralized server or broker. By using an algorithmically-determined +method of storing and retrieving data, Ceph avoids a single point of failure, a +performance bottleneck, and a physical limit to its scalability. + +CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs, +distributing the data across the cluster in accordance with configured +replication policy and failure domains. For a detailed discussion of CRUSH, see +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ + +CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a +hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH +replicates data within the cluster's pools. By reflecting the underlying +physical organization of the installation, CRUSH can model (and thereby +address) the potential for correlated device failures. Some factors relevant +to the CRUSH hierarchy include chassis, racks, physical proximity, a shared +power source, shared networking, and failure domains. By encoding this +information into the CRUSH map, CRUSH placement policies distribute object +replicas across failure domains while maintaining the desired distribution. For +example, to address the possibility of concurrent failures, it might be +desirable to ensure that data replicas are on devices that reside in or rely +upon different shelves, racks, power supplies, controllers, or physical +locations. + +When OSDs are deployed, they are automatically added to the CRUSH map under a +``host`` bucket that is named for the node on which the OSDs run. This +behavior, combined with the configured CRUSH failure domain, ensures that +replicas or erasure-code shards are distributed across hosts and that the +failure of a single host or other kinds of failures will not affect +availability. For larger clusters, administrators must carefully consider their +choice of failure domain. For example, distributing replicas across racks is +typical for mid- to large-sized clusters. + + +CRUSH Location +============== + +The location of an OSD within the CRUSH map's hierarchy is referred to as its +``CRUSH location``. The specification of a CRUSH location takes the form of a +list of key-value pairs. For example, if an OSD is in a particular row, rack, +chassis, and host, and is also part of the 'default' CRUSH root (which is the +case for most clusters), its CRUSH location can be specified as follows:: + + root=default row=a rack=a2 chassis=a2a host=a2a1 + +.. note:: + + #. The order of the keys does not matter. + #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default, + valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``, + ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined + types suffice for nearly all clusters, but can be customized by + modifying the CRUSH map. + #. Not all keys need to be specified. For example, by default, Ceph + automatically sets an ``OSD``'s location as ``root=default + host=HOSTNAME`` (as determined by the output of ``hostname -s``). + +The CRUSH location for an OSD can be modified by adding the ``crush location`` +option in ``ceph.conf``. When this option has been added, every time the OSD +starts it verifies that it is in the correct location in the CRUSH map and +moves itself if it is not. To disable this automatic CRUSH map management, add +the following to the ``ceph.conf`` configuration file in the ``[osd]`` +section:: + + osd crush update on start = false + +Note that this action is unnecessary in most cases. + + +Custom location hooks +--------------------- + +A custom location hook can be used to generate a more complete CRUSH location +on startup. The CRUSH location is determined by, in order of preference: + +#. A ``crush location`` option in ``ceph.conf`` +#. A default of ``root=default host=HOSTNAME`` where the hostname is determined + by the output of the ``hostname -s`` command + +A script can be written to provide additional location fields (for example, +``rack`` or ``datacenter``) and the hook can be enabled via the following +config option:: + + crush location hook = /path/to/customized-ceph-crush-location + +This hook is passed several arguments (see below). The hook outputs a single +line to ``stdout`` that contains the CRUSH location description. The output +resembles the following::: + + --cluster CLUSTER --id ID --type TYPE + +Here the cluster name is typically ``ceph``, the ``id`` is the daemon +identifier or (in the case of OSDs) the OSD number, and the daemon type is +``osd``, ``mds, ``mgr``, or ``mon``. + +For example, a simple hook that specifies a rack location via a value in the +file ``/etc/rack`` might be as follows:: + + #!/bin/sh + echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" + + +CRUSH structure +=============== + +The CRUSH map consists of (1) a hierarchy that describes the physical topology +of the cluster and (2) a set of rules that defines data placement policy. The +hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to +other physical features or groupings: hosts, racks, rows, data centers, and so +on. The rules determine how replicas are placed in terms of that hierarchy (for +example, 'three replicas in different racks'). + +Devices +------- + +Devices are individual OSDs that store data (usually one device for each +storage drive). Devices are identified by an ``id`` (a non-negative integer) +and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``). + +In Luminous and later releases, OSDs can have a *device class* assigned (for +example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH +rules. Device classes are especially useful when mixing device types within +hosts. + +.. _crush_map_default_types: + +Types and Buckets +----------------- + +"Bucket", in the context of CRUSH, is a term for any of the internal nodes in +the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of +*types* that are used to identify these nodes. Default types include: + +- ``osd`` (or ``device``) +- ``host`` +- ``chassis`` +- ``rack`` +- ``row`` +- ``pdu`` +- ``pod`` +- ``room`` +- ``datacenter`` +- ``zone`` +- ``region`` +- ``root`` + +Most clusters use only a handful of these types, and other types can be defined +as needed. + +The hierarchy is built with devices (normally of type ``osd``) at the leaves +and non-device types as the internal nodes. The root node is of type ``root``. +For example: + + +.. ditaa:: + + +-----------------+ + |{o}root default | + +--------+--------+ + | + +---------------+---------------+ + | | + +------+------+ +------+------+ + |{o}host foo | |{o}host bar | + +------+------+ +------+------+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd.0 | | osd.1 | | osd.2 | | osd.3 | + +-----------+ +-----------+ +-----------+ +-----------+ + + +Each node (device or bucket) in the hierarchy has a *weight* that indicates the +relative proportion of the total data that should be stored by that device or +hierarchy subtree. Weights are set at the leaves, indicating the size of the +device. These weights automatically sum in an 'up the tree' direction: that is, +the weight of the ``root`` node will be the sum of the weights of all devices +contained under it. Weights are typically measured in tebibytes (TiB). + +To get a simple view of the cluster's CRUSH hierarchy, including weights, run +the following command: + +.. prompt:: bash $ + + ceph osd tree + +Rules +----- + +CRUSH rules define policy governing how data is distributed across the devices +in the hierarchy. The rules define placement as well as replication strategies +or distribution policies that allow you to specify exactly how CRUSH places +data replicas. For example, you might create one rule selecting a pair of +targets for two-way mirroring, another rule for selecting three targets in two +different data centers for three-way replication, and yet another rule for +erasure coding across six storage devices. For a detailed discussion of CRUSH +rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized +Placement of Replicated Data`_. + +CRUSH rules can be created via the command-line by specifying the *pool type* +that they will govern (replicated or erasure coded), the *failure domain*, and +optionally a *device class*. In rare cases, CRUSH rules must be created by +manually editing the CRUSH map. + +To see the rules that are defined for the cluster, run the following command: + +.. prompt:: bash $ + + ceph osd crush rule ls + +To view the contents of the rules, run the following command: + +.. prompt:: bash $ + + ceph osd crush rule dump + +.. _device_classes: + +Device classes +-------------- + +Each device can optionally have a *class* assigned. By default, OSDs +automatically set their class at startup to `hdd`, `ssd`, or `nvme` in +accordance with the type of device they are backed by. + +To explicitly set the device class of one or more OSDs, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush set-device-class <class> <osd-name> [...] + +Once a device class has been set, it cannot be changed to another class until +the old class is unset. To remove the old class of one or more OSDs, run a +command of the following form: + +.. prompt:: bash $ + + ceph osd crush rm-device-class <osd-name> [...] + +This restriction allows administrators to set device classes that won't be +changed on OSD restart or by a script. + +To create a placement rule that targets a specific device class, run a command +of the following form: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> + +To apply the new placement rule to a specific pool, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> crush_rule <rule-name> + +Device classes are implemented by creating one or more "shadow" CRUSH +hierarchies. For each device class in use, there will be a shadow hierarchy +that contains only devices of that class. CRUSH rules can then distribute data +across the relevant shadow hierarchy. This approach is fully backward +compatible with older Ceph clients. To view the CRUSH hierarchy with shadow +items displayed, run the following command: + +.. prompt:: bash # + + ceph osd crush tree --show-shadow + +Some older clusters that were created before the Luminous release rely on +manually crafted CRUSH maps to maintain per-device-type hierarchies. For these +clusters, there is a *reclassify* tool available that can help them transition +to device classes without triggering unwanted data movement (see +:ref:`crush-reclassify`). + +Weight sets +----------- + +A *weight set* is an alternative set of weights to use when calculating data +placement. The normal weights associated with each device in the CRUSH map are +set in accordance with the device size and indicate how much data should be +stored where. However, because CRUSH is a probabilistic pseudorandom placement +process, there is always some variation from this ideal distribution (in the +same way that rolling a die sixty times will likely not result in exactly ten +ones and ten sixes). Weight sets allow the cluster to perform numerical +optimization based on the specifics of your cluster (for example: hierarchy, +pools) to achieve a balanced distribution. + +Ceph supports two types of weight sets: + +#. A **compat** weight set is a single alternative set of weights for each + device and each node in the cluster. Compat weight sets cannot be expected + to correct all anomalies (for example, PGs for different pools might be of + different sizes and have different load levels, but are mostly treated alike + by the balancer). However, they have the major advantage of being *backward + compatible* with previous versions of Ceph. This means that even though + weight sets were first introduced in Luminous v12.2.z, older clients (for + example, Firefly) can still connect to the cluster when a compat weight set + is being used to balance data. + +#. A **per-pool** weight set is more flexible in that it allows placement to + be optimized for each data pool. Additionally, weights can be adjusted + for each position of placement, allowing the optimizer to correct for a + subtle skew of data toward devices with small weights relative to their + peers (an effect that is usually apparent only in very large clusters + but that can cause balancing problems). + +When weight sets are in use, the weights associated with each node in the +hierarchy are visible in a separate column (labeled either as ``(compat)`` or +as the pool name) in the output of the following command: + +.. prompt:: bash # + + ceph osd tree + +If both *compat* and *per-pool* weight sets are in use, data placement for a +particular pool will use its own per-pool weight set if present. If only +*compat* weight sets are in use, data placement will use the compat weight set. +If neither are in use, data placement will use the normal CRUSH weights. + +Although weight sets can be set up and adjusted manually, we recommend enabling +the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the +cluster is running Luminous or a later release. + +Modifying the CRUSH map +======================= + +.. _addosd: + +Adding/Moving an OSD +-------------------- + +.. note:: Under normal conditions, OSDs automatically add themselves to the + CRUSH map when they are created. The command in this section is rarely + needed. + + +To add or move an OSD in the CRUSH map of a running cluster, run a command of +the following form: + +.. prompt:: bash $ + + ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] + +For details on this command's parameters, see the following: + +``name`` + :Description: The full name of the OSD. + :Type: String + :Required: Yes + :Example: ``osd.0`` + + +``weight`` + :Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB). + :Type: Double + :Required: Yes + :Example: ``2.0`` + + +``root`` + :Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``). + :Type: Key-value pair. + :Required: Yes + :Example: ``root=default`` + + +``bucket-type`` + :Description: The OSD's location in the CRUSH hierarchy. + :Type: Key-value pairs. + :Required: No + :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +In the following example, the command adds ``osd.0`` to the hierarchy, or moves +``osd.0`` from a previous location: + +.. prompt:: bash $ + + ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 + + +Adjusting OSD weight +-------------------- + +.. note:: Under normal conditions, OSDs automatically add themselves to the + CRUSH map with the correct weight when they are created. The command in this + section is rarely needed. + +To adjust an OSD's CRUSH weight in a running cluster, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +For details on this command's parameters, see the following: + +``name`` + :Description: The full name of the OSD. + :Type: String + :Required: Yes + :Example: ``osd.0`` + + +``weight`` + :Description: The CRUSH weight of the OSD. + :Type: Double + :Required: Yes + :Example: ``2.0`` + + +.. _removeosd: + +Removing an OSD +--------------- + +.. note:: OSDs are normally removed from the CRUSH map as a result of the + `ceph osd purge`` command. This command is rarely needed. + +To remove an OSD from the CRUSH map of a running cluster, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush remove {name} + +For details on the ``name`` parameter, see the following: + +``name`` + :Description: The full name of the OSD. + :Type: String + :Required: Yes + :Example: ``osd.0`` + + +Adding a CRUSH Bucket +--------------------- + +.. note:: Buckets are implicitly created when an OSD is added and the command + that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the + OSD's location (provided that a bucket with that name does not already + exist). The command in this section is typically used when manually + adjusting the structure of the hierarchy after OSDs have already been + created. One use of this command is to move a series of hosts to a new + rack-level bucket. Another use of this command is to add new ``host`` + buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive + any data until they are ready to receive data. When they are ready, move the + buckets to the ``default`` root or to any other root as described below. + +To add a bucket in the CRUSH map of a running cluster, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd crush add-bucket {bucket-name} {bucket-type} + +For details on this command's parameters, see the following: + +``bucket-name`` + :Description: The full name of the bucket. + :Type: String + :Required: Yes + :Example: ``rack12`` + + +``bucket-type`` + :Description: The type of the bucket. This type must already exist in the CRUSH hierarchy. + :Type: String + :Required: Yes + :Example: ``rack`` + +In the following example, the command adds the ``rack12`` bucket to the hierarchy: + +.. prompt:: bash $ + + ceph osd crush add-bucket rack12 rack + +Moving a Bucket +--------------- + +To move a bucket to a different location or position in the CRUSH map +hierarchy, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] + +For details on this command's parameters, see the following: + +``bucket-name`` + :Description: The name of the bucket that you are moving. + :Type: String + :Required: Yes + :Example: ``foo-bar-1`` + +``bucket-type`` + :Description: The bucket's new location in the CRUSH hierarchy. + :Type: Key-value pairs. + :Required: No + :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +Removing a Bucket +----------------- + +To remove a bucket from the CRUSH hierarchy, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +.. note:: A bucket must already be empty before it is removed from the CRUSH + hierarchy. In other words, there must not be OSDs or any other CRUSH buckets + within it. + +For details on the ``bucket-name`` parameter, see the following: + +``bucket-name`` + :Description: The name of the bucket that is being removed. + :Type: String + :Required: Yes + :Example: ``rack12`` + +In the following example, the command removes the ``rack12`` bucket from the +hierarchy: + +.. prompt:: bash $ + + ceph osd crush remove rack12 + +Creating a compat weight set +---------------------------- + +.. note:: Normally this action is done automatically if needed by the + ``balancer`` module (provided that the module is enabled). + +To create a *compat* weight set, run the following command: + +.. prompt:: bash $ + + ceph osd crush weight-set create-compat + +To adjust the weights of the compat weight set, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight-compat {name} {weight} + +To destroy the compat weight set, run the following command: + +.. prompt:: bash $ + + ceph osd crush weight-set rm-compat + +Creating per-pool weight sets +----------------------------- + +To create a weight set for a specific pool, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush weight-set create {pool-name} {mode} + +.. note:: Per-pool weight sets can be used only if all servers and daemons are + running Luminous v12.2.z or a later release. + +For details on this command's parameters, see the following: + +``pool-name`` + :Description: The name of a RADOS pool. + :Type: String + :Required: Yes + :Example: ``rbd`` + +``mode`` + :Description: Either ``flat`` or ``positional``. A *flat* weight set + assigns a single weight to all devices or buckets. A + *positional* weight set has a potentially different + weight for each position in the resulting placement + mapping. For example: if a pool has a replica count of + ``3``, then a positional weight set will have three + weights for each device and bucket. + :Type: String + :Required: Yes + :Example: ``flat`` + +To adjust the weight of an item in a weight set, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} + +To list existing weight sets, run the following command: + +.. prompt:: bash $ + + ceph osd crush weight-set ls + +To remove a weight set, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush weight-set rm {pool-name} + + +Creating a rule for a replicated pool +------------------------------------- + +When you create a CRUSH rule for a replicated pool, there is an important +decision to make: selecting a failure domain. For example, if you select a +failure domain of ``host``, then CRUSH will ensure that each replica of the +data is stored on a unique host. Alternatively, if you select a failure domain +of ``rack``, then each replica of the data will be stored in a different rack. +Your selection of failure domain should be guided by the size and its CRUSH +topology. + +The entire cluster hierarchy is typically nested beneath a root node that is +named ``default``. If you have customized your hierarchy, you might want to +create a rule nested beneath some other node in the hierarchy. In creating +this rule for the customized hierarchy, the node type doesn't matter, and in +particular the rule does not have to be nested beneath a ``root`` node. + +It is possible to create a rule that restricts data placement to a specific +*class* of device. By default, Ceph OSDs automatically classify themselves as +either ``hdd`` or ``ssd`` in accordance with the underlying type of device +being used. These device classes can be customized. One might set the ``device +class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set +them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules +and pools may be flexibly constrained to use (or avoid using) specific subsets +of OSDs based on specific requirements. + +To create a rule for a replicated pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] + +For details on this command's parameters, see the following: + +``name`` + :Description: The name of the rule. + :Type: String + :Required: Yes + :Example: ``rbd-rule`` + +``root`` + :Description: The name of the CRUSH hierarchy node under which data is to be placed. + :Type: String + :Required: Yes + :Example: ``default`` + +``failure-domain-type`` + :Description: The type of CRUSH nodes used for the replicas of the failure domain. + :Type: String + :Required: Yes + :Example: ``rack`` + +``class`` + :Description: The device class on which data is to be placed. + :Type: String + :Required: No + :Example: ``ssd`` + +Creating a rule for an erasure-coded pool +----------------------------------------- + +For an erasure-coded pool, similar decisions need to be made: what the failure +domain is, which node in the hierarchy data will be placed under (usually +``default``), and whether placement is restricted to a specific device class. +However, erasure-code pools are created in a different way: there is a need to +construct them carefully with reference to the erasure code plugin in use. For +this reason, these decisions must be incorporated into the **erasure-code +profile**. A CRUSH rule will then be created from the erasure-code profile, +either explicitly or automatically when the profile is used to create a pool. + +To list the erasure-code profiles, run the following command: + +.. prompt:: bash $ + + ceph osd erasure-code-profile ls + +To view a specific existing profile, run a command of the following form: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get {profile-name} + +Under normal conditions, profiles should never be modified; instead, a new +profile should be created and used when creating either a new pool or a new +rule for an existing pool. + +An erasure-code profile consists of a set of key-value pairs. Most of these +key-value pairs govern the behavior of the erasure code that encodes data in +the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH +rule that is created. + +The relevant erasure-code profile properties are as follows: + + * **crush-root**: the name of the CRUSH node under which to place data + [default: ``default``]. + * **crush-failure-domain**: the CRUSH bucket type used in the distribution of + erasure-coded shards [default: ``host``]. + * **crush-device-class**: the device class on which to place data [default: + none, which means that all devices are used]. + * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the + number of erasure-code shards, affecting the resulting CRUSH rule. + + After a profile is defined, you can create a CRUSH rule by running a command + of the following form: + +.. prompt:: bash $ + + ceph osd crush rule create-erasure {name} {profile-name} + +.. note: When creating a new pool, it is not necessary to create the rule + explicitly. If only the erasure-code profile is specified and the rule + argument is omitted, then Ceph will create the CRUSH rule automatically. + + +Deleting rules +-------------- + +To delete rules that are not in use by pools, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd crush rule rm {rule-name} + +.. _crush-map-tunables: + +Tunables +======== + +The CRUSH algorithm that is used to calculate the placement of data has been +improved over time. In order to support changes in behavior, we have provided +users with sets of tunables that determine which legacy or optimal version of +CRUSH is to be used. + +In order to use newer tunables, all Ceph clients and daemons must support the +new major release of CRUSH. Because of this requirement, we have created +``profiles`` that are named after the Ceph version in which they were +introduced. For example, the ``firefly`` tunables were first supported by the +Firefly release and do not work with older clients (for example, clients +running Dumpling). After a cluster's tunables profile is changed from a legacy +set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options +will prevent older clients that do not support the new CRUSH features from +connecting to the cluster. + +argonaut (legacy) +----------------- + +The legacy CRUSH behavior used by Argonaut and older releases works fine for +most clusters, provided that not many OSDs have been marked ``out``. + +bobtail (CRUSH_TUNABLES2) +------------------------- + +The ``bobtail`` tunable profile provides the following improvements: + + * For hierarchies with a small number of devices in leaf buckets, some PGs + might map to fewer than the desired number of replicas, resulting in + ``undersized`` PGs. This is known to happen in the case of hierarchies with + ``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each + host. + + * For large clusters, a small percentage of PGs might map to fewer than the + desired number of OSDs. This is known to happen when there are multiple + hierarchy layers in use (for example,, ``row``, ``rack``, ``host``, + ``osd``). + + * When one or more OSDs are marked ``out``, data tends to be redistributed + to nearby OSDs instead of across the entire hierarchy. + +The tunables introduced in the Bobtail release are as follows: + + * ``choose_local_tries``: Number of local retries. The legacy value is ``2``, + and the optimal value is ``0``. + + * ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal + value is 0. + + * ``choose_total_tries``: Total number of attempts to choose an item. The + legacy value is ``19``, but subsequent testing indicates that a value of + ``50`` is more appropriate for typical clusters. For extremely large + clusters, an even larger value might be necessary. + + * ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will + retry, or try only once and allow the original placement to retry. The + legacy default is ``0``, and the optimal value is ``1``. + +Migration impact: + + * Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a + moderate amount of data movement. Use caution on a cluster that is already + populated with data. + +firefly (CRUSH_TUNABLES3) +------------------------- + +chooseleaf_vary_r +~~~~~~~~~~~~~~~~~ + +This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step +behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs. + +This profile was introduced in the Firefly release, and adds a new tunable as follows: + + * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start + with a non-zero value of ``r``, as determined by the number of attempts the + parent has already made. The legacy default value is ``0``, but with this + value CRUSH is sometimes unable to find a mapping. The optimal value (in + terms of computational cost and correctness) is ``1``. + +Migration impact: + + * For existing clusters that store a great deal of data, changing this tunable + from ``0`` to ``1`` will trigger a large amount of data migration; a value + of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will + cause less data to move. + +straw_calc_version tunable +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There were problems with the internal weights calculated and stored in the +CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH +weight of ``0`` or with a mix of different and unique weights, CRUSH would +distribute data incorrectly (that is, not in proportion to the weights). + +This tunable, introduced in the Firefly release, is as follows: + + * ``straw_calc_version``: A value of ``0`` preserves the old, broken + internal-weight calculation; a value of ``1`` fixes the problem. + +Migration impact: + + * Changing this tunable to a value of ``1`` and then adjusting a straw bucket + (either by adding, removing, or reweighting an item or by using the + reweight-all command) can trigger a small to moderate amount of data + movement provided that the cluster has hit one of the problematic + conditions. + +This tunable option is notable in that it has absolutely no impact on the +required kernel version in the client side. + +hammer (CRUSH_V4) +----------------- + +The ``hammer`` tunable profile does not affect the mapping of existing CRUSH +maps simply by changing the profile. However: + + * There is a new bucket algorithm supported: ``straw2``. This new algorithm + fixes several limitations in the original ``straw``. More specifically, the + old ``straw`` buckets would change some mappings that should not have + changed when a weight was adjusted, while ``straw2`` achieves the original + goal of changing mappings only to or from the bucket item whose weight has + changed. + + * The ``straw2`` type is the default type for any newly created buckets. + +Migration impact: + + * Changing a bucket type from ``straw`` to ``straw2`` will trigger a small + amount of data movement, depending on how much the bucket items' weights + vary from each other. When the weights are all the same no data will move, + and the more variance there is in the weights the more movement there will + be. + +jewel (CRUSH_TUNABLES5) +----------------------- + +The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a +result, significantly fewer mappings change when an OSD is marked ``out`` of +the cluster. This improvement results in significantly less data movement. + +The new tunable introduced in the Jewel release is as follows: + + * ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt + will use a better value for an inner loop that greatly reduces the number of + mapping changes when an OSD is marked ``out``. The legacy value is ``0``, + and the new value of ``1`` uses the new approach. + +Migration impact: + + * Changing this value on an existing cluster will result in a very large + amount of data movement because nearly every PG mapping is likely to change. + +Client versions that support CRUSH_TUNABLES2 +-------------------------------------------- + + * v0.55 and later, including Bobtail (v0.56.x) + * Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients) + +Client versions that support CRUSH_TUNABLES3 +-------------------------------------------- + + * v0.78 (Firefly) and later + * Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients) + +Client versions that support CRUSH_V4 +------------------------------------- + + * v0.94 (Hammer) and later + * Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients) + +Client versions that support CRUSH_TUNABLES5 +-------------------------------------------- + + * v10.0.2 (Jewel) and later + * Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients) + +"Non-optimal tunables" warning +------------------------------ + +In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush +map has non-optimal tunables") if any of the current CRUSH tunables have +non-optimal values: that is, if any fail to have the optimal values from the +:ref:` ``default`` profile +<rados_operations_crush_map_default_profile_definition>`. There are two +different ways to silence the alert: + +1. Adjust the CRUSH tunables on the existing cluster so as to render them + optimal. Making this adjustment will trigger some data movement + (possibly as much as 10%). This approach is generally preferred to the + other approach, but special care must be taken in situations where + data movement might affect performance: for example, in production clusters. + To enable optimal tunables, run the following command: + + .. prompt:: bash $ + + ceph osd crush tunables optimal + + There are several potential problems that might make it preferable to revert + to the previous values of the tunables. The new values might generate too + much load for the cluster to handle, the new values might unacceptably slow + the operation of the cluster, or there might be a client-compatibility + problem. Such client-compatibility problems can arise when using old-kernel + CephFS or RBD clients, or pre-Bobtail ``librados`` clients. To revert to + the previous values of the tunables, run the following command: + + .. prompt:: bash $ + + ceph osd crush tunables legacy + +2. To silence the alert without making any changes to CRUSH, + add the following option to the ``[mon]`` section of your ceph.conf file:: + + mon_warn_on_legacy_crush_tunables = false + + In order for this change to take effect, you will need to either restart + the monitors or run the following command to apply the option to the + monitors while they are still running: + + .. prompt:: bash $ + + ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false + + +Tuning CRUSH +------------ + +When making adjustments to CRUSH tunables, keep the following considerations in +mind: + + * Adjusting the values of CRUSH tunables will result in the shift of one or + more PGs from one storage node to another. If the Ceph cluster is already + storing a great deal of data, be prepared for significant data movement. + * When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they + immediately begin rejecting new connections from clients that do not support + the new feature. However, already-connected clients are effectively + grandfathered in, and any of these clients that do not support the new + feature will malfunction. + * If the CRUSH tunables are set to newer (non-legacy) values and subsequently + reverted to the legacy values, ``ceph-osd`` daemons will not be required to + support any of the newer CRUSH features associated with the newer + (non-legacy) values. However, the OSD peering process requires the + examination and understanding of old maps. For this reason, **if the cluster + has previously used non-legacy CRUSH values, do not run old versions of + the** ``ceph-osd`` **daemon** -- even if the latest version of the map has + been reverted so as to use the legacy defaults. + +The simplest way to adjust CRUSH tunables is to apply them in matched sets +known as *profiles*. As of the Octopus release, Ceph supports the following +profiles: + + * ``legacy``: The legacy behavior from argonaut and earlier. + * ``argonaut``: The legacy values supported by the argonaut release. + * ``bobtail``: The values supported by the bobtail release. + * ``firefly``: The values supported by the firefly release. + * ``hammer``: The values supported by the hammer release. + * ``jewel``: The values supported by the jewel release. + * ``optimal``: The best values for the current version of Ceph. + .. _rados_operations_crush_map_default_profile_definition: + * ``default``: The default values of a new cluster that has been installed + from scratch. These values, which depend on the current version of Ceph, are + hardcoded and are typically a mix of optimal and legacy values. These + values often correspond to the ``optimal`` profile of either the previous + LTS (long-term service) release or the most recent release for which most + users are expected to have up-to-date clients. + +To apply a profile to a running cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph osd crush tunables {PROFILE} + +This action might trigger a great deal of data movement. Consult release notes +and documentation before changing the profile on a running cluster. Consider +throttling recovery and backfill parameters in order to limit the backfill +resulting from a specific change. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf + + +Tuning Primary OSD Selection +============================ + +When a Ceph client reads or writes data, it first contacts the primary OSD in +each affected PG's acting set. By default, the first OSD in the acting set is +the primary OSD (also known as the "lead OSD"). For example, in the acting set +``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD. +However, sometimes it is clear that an OSD is not well suited to act as the +lead as compared with other OSDs (for example, if the OSD has a slow drive or a +slow controller). To prevent performance bottlenecks (especially on read +operations) and at the same time maximize the utilization of your hardware, you +can influence the selection of the primary OSD either by adjusting "primary +affinity" values, or by crafting a CRUSH rule that selects OSDs that are better +suited to act as the lead rather than other OSDs. + +To determine whether tuning Ceph's selection of primary OSDs will improve +cluster performance, pool redundancy strategy must be taken into account. For +replicated pools, this tuning can be especially useful, because by default read +operations are served from the primary OSD of each PG. For erasure-coded pools, +however, the speed of read operations can be increased by enabling **fast +read** (see :ref:`pool-settings`). + +.. _rados_ops_primary_affinity: + +Primary Affinity +---------------- + +**Primary affinity** is a characteristic of an OSD that governs the likelihood +that a given OSD will be selected as the primary OSD (or "lead OSD") in a given +acting set. A primary affinity value can be any real number in the range ``0`` +to ``1``, inclusive. + +As an example of a common scenario in which it can be useful to adjust primary +affinity values, let us suppose that a cluster contains a mix of drive sizes: +for example, suppose it contains some older racks with 1.9 TB SATA SSDs and +some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned +twice the number of PGs and will thus serve twice the number of write and read +operations -- they will be busier than the former. In such a scenario, you +might make a rough assignment of primary affinity as inversely proportional to +OSD size. Such an assignment will not be 100% optimal, but it can readily +achieve a 15% improvement in overall read throughput by means of a more even +utilization of SATA interface bandwidth and CPU cycles. This example is not +merely a thought experiment meant to illustrate the theoretical benefits of +adjusting primary affinity values; this fifteen percent improvement was +achieved on an actual Ceph cluster. + +By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster +in which every OSD has this default value, all OSDs are equally likely to act +as a primary OSD. + +By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less +likely to select the OSD as primary in a PG's acting set. To change the weight +value associated with a specific OSD's primary affinity, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd primary-affinity <osd-id> <weight> + +The primary affinity of an OSD can be set to any real number in the range +``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as +primary and ``1`` indicates that the OSD is maximally likely to be used as a +primary. When the weight is between these extremes, its value indicates roughly +how likely it is that CRUSH will select the OSD associated with it as a +primary. + +The process by which CRUSH selects the lead OSD is not a mere function of a +simple probability determined by relative affinity values. Nevertheless, +measurable results can be achieved even with first-order approximations of +desirable primary affinity values. + + +Custom CRUSH Rules +------------------ + +Some clusters balance cost and performance by mixing SSDs and HDDs in the same +replicated pool. By setting the primary affinity of HDD OSDs to ``0``, +operations will be directed to an SSD OSD in each acting set. Alternatively, +you can define a CRUSH rule that always selects an SSD OSD as the primary OSD +and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting +set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs. + +For example, see the following CRUSH rule:: + + rule mixed_replicated_rule { + id 11 + type replicated + step take default class ssd + step chooseleaf firstn 1 type host + step emit + step take default class hdd + step chooseleaf firstn 0 type host + step emit + } + +This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool, +this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on +different hosts, because the first SSD OSD might be colocated with any of the +``N`` HDD OSDs. + +To avoid this extra storage requirement, you might place SSDs and HDDs in +different hosts. However, taking this approach means that all client requests +will be received by hosts with SSDs. For this reason, it might be advisable to +have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the +latter will under normal circumstances perform only recovery operations. Here +the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement +not to contain any of the same servers, as seen in the following CRUSH rule:: + + rule mixed_replicated_rule_two { + id 1 + type replicated + step take ssd_hosts class ssd + step chooseleaf firstn 1 type host + step emit + step take hdd_hosts class hdd + step chooseleaf firstn -1 type host + step emit + } + +.. note:: If a primary SSD OSD fails, then requests to the associated PG will + be temporarily served from a slower HDD OSD until the PG's data has been + replicated onto the replacement primary SSD OSD. + + diff --git a/doc/rados/operations/data-placement.rst b/doc/rados/operations/data-placement.rst new file mode 100644 index 000000000..3d3be65ec --- /dev/null +++ b/doc/rados/operations/data-placement.rst @@ -0,0 +1,47 @@ +========================= + Data Placement Overview +========================= + +Ceph stores, replicates, and rebalances data objects across a RADOS cluster +dynamically. Because different users store objects in different pools for +different purposes on many OSDs, Ceph operations require a certain amount of +data- placement planning. The main data-placement planning concepts in Ceph +include: + +- **Pools:** Ceph stores data within pools, which are logical groups used for + storing objects. Pools manage the number of placement groups, the number of + replicas, and the CRUSH rule for the pool. To store data in a pool, it is + necessary to be an authenticated user with permissions for the pool. Ceph is + able to make snapshots of pools. For additional details, see `Pools`_. + +- **Placement Groups:** Ceph maps objects to placement groups. Placement + groups (PGs) are shards or fragments of a logical object pool that place + objects as a group into OSDs. Placement groups reduce the amount of + per-object metadata that is necessary for Ceph to store the data in OSDs. A + greater number of placement groups (for example, 100 PGs per OSD as compared + with 50 PGs per OSD) leads to better balancing. For additional details, see + :ref:`placement groups`. + +- **CRUSH Maps:** CRUSH plays a major role in allowing Ceph to scale while + avoiding certain pitfalls, such as performance bottlenecks, limitations to + scalability, and single points of failure. CRUSH maps provide the physical + topology of the cluster to the CRUSH algorithm, so that it can determine both + (1) where the data for an object and its replicas should be stored and (2) + how to store that data across failure domains so as to improve data safety. + For additional details, see `CRUSH Maps`_. + +- **Balancer:** The balancer is a feature that automatically optimizes the + distribution of placement groups across devices in order to achieve a + balanced data distribution, in order to maximize the amount of data that can + be stored in the cluster, and in order to evenly distribute the workload + across OSDs. + +It is possible to use the default values for each of the above components. +Default values are recommended for a test cluster's initial setup. However, +when planning a large Ceph cluster, values should be customized for +data-placement operations with reference to the different roles played by +pools, placement groups, and CRUSH. + +.. _Pools: ../pools +.. _CRUSH Maps: ../crush-map +.. _Balancer: ../balancer diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst new file mode 100644 index 000000000..f92f622d5 --- /dev/null +++ b/doc/rados/operations/devices.rst @@ -0,0 +1,227 @@ +.. _devices: + +Device Management +================= + +Device management allows Ceph to address hardware failure. Ceph tracks hardware +storage devices (HDDs, SSDs) to see which devices are managed by which daemons. +Ceph also collects health metrics about these devices. By doing so, Ceph can +provide tools that predict hardware failure and can automatically respond to +hardware failure. + +Device tracking +--------------- + +To see a list of the storage devices that are in use, run the following +command: + +.. prompt:: bash $ + + ceph device ls + +Alternatively, to list devices by daemon or by host, run a command of one of +the following forms: + +.. prompt:: bash $ + + ceph device ls-by-daemon <daemon> + ceph device ls-by-host <host> + +To see information about the location of an specific device and about how the +device is being consumed, run a command of the following form: + +.. prompt:: bash $ + + ceph device info <devid> + +Identifying physical devices +---------------------------- + +To make the replacement of failed disks easier and less error-prone, you can +(in some cases) "blink" the drive's LEDs on hardware enclosures by running a +command of the following form:: + + device light on|off <devid> [ident|fault] [--force] + +.. note:: Using this command to blink the lights might not work. Whether it + works will depend upon such factors as your kernel revision, your SES + firmware, or the setup of your HBA. + +The ``<devid>`` parameter is the device identification. To retrieve this +information, run the following command: + +.. prompt:: bash $ + + ceph device ls + +The ``[ident|fault]`` parameter determines which kind of light will blink. By +default, the `identification` light is used. + +.. note:: This command works only if the Cephadm or the Rook `orchestrator + <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ + module is enabled. To see which orchestrator module is enabled, run the + following command: + + .. prompt:: bash $ + + ceph orch status + +The command that makes the drive's LEDs blink is `lsmcli`. To customize this +command, configure it via a Jinja2 template by running commands of the +following forms:: + + ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>" + ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'" + +The following arguments can be used to customize the Jinja2 template: + +* ``on`` + A boolean value. +* ``ident_fault`` + A string that contains `ident` or `fault`. +* ``dev`` + A string that contains the device ID: for example, `SanDisk_X400_M.2_2280_512GB_162924424784`. +* ``path`` + A string that contains the device path: for example, `/dev/sda`. + +.. _enabling-monitoring: + +Enabling monitoring +------------------- + +Ceph can also monitor the health metrics associated with your device. For +example, SATA drives implement a standard called SMART that provides a wide +range of internal metrics about the device's usage and health (for example: the +number of hours powered on, the number of power cycles, the number of +unrecoverable read errors). Other device types such as SAS and NVMe present a +similar set of metrics (via slightly different standards). All of these +metrics can be collected by Ceph via the ``smartctl`` tool. + +You can enable or disable health monitoring by running one of the following +commands: + +.. prompt:: bash $ + + ceph device monitoring on + ceph device monitoring off + +Scraping +-------- + +If monitoring is enabled, device metrics will be scraped automatically at +regular intervals. To configure that interval, run a command of the following +form: + +.. prompt:: bash $ + + ceph config set mgr mgr/devicehealth/scrape_frequency <seconds> + +By default, device metrics are scraped once every 24 hours. + +To manually scrape all devices, run the following command: + +.. prompt:: bash $ + + ceph device scrape-health-metrics + +To scrape a single device, run a command of the following form: + +.. prompt:: bash $ + + ceph device scrape-health-metrics <device-id> + +To scrape a single daemon's devices, run a command of the following form: + +.. prompt:: bash $ + + ceph device scrape-daemon-health-metrics <who> + +To retrieve the stored health metrics for a device (optionally for a specific +timestamp), run a command of the following form: + +.. prompt:: bash $ + + ceph device get-health-metrics <devid> [sample-timestamp] + +Failure prediction +------------------ + +Ceph can predict drive life expectancy and device failures by analyzing the +health metrics that it collects. The prediction modes are as follows: + +* *none*: disable device failure prediction. +* *local*: use a pre-trained prediction model from the ``ceph-mgr`` daemon. + +To configure the prediction mode, run a command of the following form: + +.. prompt:: bash $ + + ceph config set global device_failure_prediction_mode <mode> + +Under normal conditions, failure prediction runs periodically in the +background. For this reason, life expectancy values might be populated only +after a significant amount of time has passed. The life expectancy of all +devices is displayed in the output of the following command: + +.. prompt:: bash $ + + ceph device ls + +To see the metadata of a specific device, run a command of the following form: + +.. prompt:: bash $ + + ceph device info <devid> + +To explicitly force prediction of a specific device's life expectancy, run a +command of the following form: + +.. prompt:: bash $ + + ceph device predict-life-expectancy <devid> + +In addition to Ceph's internal device failure prediction, you might have an +external source of information about device failures. To inform Ceph of a +specific device's life expectancy, run a command of the following form: + +.. prompt:: bash $ + + ceph device set-life-expectancy <devid> <from> [<to>] + +Life expectancies are expressed as a time interval. This means that the +uncertainty of the life expectancy can be expressed in the form of a range of +time, and perhaps a wide range of time. The interval's end can be left +unspecified. + +Health alerts +------------- + +The ``mgr/devicehealth/warn_threshold`` configuration option controls the +health check for an expected device failure. If the device is expected to fail +within the specified time interval, an alert is raised. + +To check the stored life expectancy of all devices and generate any appropriate +health alert, run the following command: + +.. prompt:: bash $ + + ceph device check-health + +Automatic Migration +------------------- + +The ``mgr/devicehealth/self_heal`` option (enabled by default) automatically +migrates data away from devices that are expected to fail soon. If this option +is enabled, the module marks such devices ``out`` so that automatic migration +will occur. + +.. note:: The ``mon_osd_min_up_ratio`` configuration option can help prevent + this process from cascading to total failure. If the "self heal" module + marks ``out`` so many OSDs that the ratio value of ``mon_osd_min_up_ratio`` + is exceeded, then the cluster raises the ``DEVICE_HEALTH_TOOMANY`` health + check. For instructions on what to do in this situation, see + :ref:`DEVICE_HEALTH_TOOMANY<rados_health_checks_device_health_toomany>`. + +The ``mgr/devicehealth/mark_out_threshold`` configuration option specifies the +time interval for automatic migration. If a device is expected to fail within +the specified time interval, it will be automatically marked ``out``. diff --git a/doc/rados/operations/erasure-code-clay.rst b/doc/rados/operations/erasure-code-clay.rst new file mode 100644 index 000000000..1cffa32f5 --- /dev/null +++ b/doc/rados/operations/erasure-code-clay.rst @@ -0,0 +1,240 @@ +================ +CLAY code plugin +================ + +CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings +in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let: + + d = number of OSDs contacted during repair + +If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires +reading from the *d=8* others to repair. And recovery of say a 1GiB needs +a download of 8 X 1GiB = 8GiB of information. + +However, in the case of the *clay* plugin *d* is configurable within the limits: + + k+1 <= d <= k+m-1 + +By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms +of network bandwidth and disk IO. In the case of the *clay* plugin configured with +*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and +250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB +amount of information. More general parameters are provided below. The benefits are substantial +when the repair is carried out for a rack that stores information on the order of +Terabytes. + + +-------------+---------------------------------------------------------+ + | plugin | total amount of disk IO | + +=============+=========================================================+ + |jerasure,isa | :math:`k S` | + +-------------+---------------------------------------------------------+ + | clay | :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` | + +-------------+---------------------------------------------------------+ + +where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have +used the largest possible value of *d* as this will result in the smallest amount of data download needed +to achieve recovery from an OSD failure. + +Erasure-code profile examples +============================= + +An example configuration that can be used to observe reduced bandwidth usage: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set CLAYprofile \ + plugin=clay \ + k=4 m=2 d=5 \ + crush-failure-domain=host + ceph osd pool create claypool erasure CLAYprofile + + +Creating a clay profile +======================= + +To create a new clay code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=clay \ + k={data-chunks} \ + m={coding-chunks} \ + [d={helper-chunks}] \ + [scalar_mds={plugin-name}] \ + [technique={technique-name}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split into **data-chunks** parts, + each of which is stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``d={helper-chunks}`` + +:Description: Number of OSDs requested to send data during recovery of + a single chunk. *d* needs to be chosen such that + k+1 <= d <= k+m-1. The larger the *d*, the better the savings. + +:Type: Integer +:Required: No. +:Default: k+m-1 + +``scalar_mds={jerasure|isa|shec}`` + +:Description: **scalar_mds** specifies the plugin that is used as a + building block in the layered construction. It can be + one of *jerasure*, *isa*, *shec* + +:Type: String +:Required: No. +:Default: jerasure + +``technique={technique}`` + +:Description: **technique** specifies the technique that will be picked + within the 'scalar_mds' plugin specified. Supported techniques + are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', + 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van', + 'cauchy' for isa and 'single', 'multiple' for shec. + +:Type: String +:Required: No. +:Default: reed_sol_van (for jerasure, isa), single (for shec) + + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + + +Notion of sub-chunks +==================== + +The Clay code is able to save in terms of disk IO, network bandwidth as it +is a vector code and it is able to view and manipulate data within a chunk +at a finer granularity termed as a sub-chunk. The number of sub-chunks within +a chunk for a Clay code is given by: + + sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1` + + +During repair of an OSD, the helper information requested +from an available OSD is only a fraction of a chunk. In fact, the number +of sub-chunks within a chunk that are accessed during repair is given by: + + repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}` + +Examples +-------- + +#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is + 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read + during repair. +#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count + is 16. A quarter of a chunk is read from an available OSD for repair of a failed + chunk. + + + +How to choose a configuration given a workload +============================================== + +Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks +are not necessarily stored consecutively within a chunk. For best disk IO +performance, it is helpful to read contiguous data. For this reason, it is suggested that +you choose stripe-size such that the sub-chunk size is sufficiently large. + +For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that: + + sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ... + +#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d. + For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will + result in a sub-chunk count of 1024 and a sub-chunk size of 4KB. +#. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network + and disk IO benefits. + +Comparisons with LRC +==================== + +Locally Recoverable Codes (LRC) are also designed in order to save in terms of network +bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the +number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead. +The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in +addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc* +can recover from the failure of any ``m`` OSDs. + + +-----------------+----------------------------------+----------------------------------+ + | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) | + +=================+================+=================+==================================+ + | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) | + +-----------------+----------------------------------+----------------------------------+ + | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) | + +-----------------+----------------------------------+----------------------------------+ + + +where ``S`` is the amount of data stored of single OSD being recovered. diff --git a/doc/rados/operations/erasure-code-isa.rst b/doc/rados/operations/erasure-code-isa.rst new file mode 100644 index 000000000..9a43f89a2 --- /dev/null +++ b/doc/rados/operations/erasure-code-isa.rst @@ -0,0 +1,107 @@ +======================= +ISA erasure code plugin +======================= + +The *isa* plugin encapsulates the `ISA +<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_ +library. + +Create an isa profile +===================== + +To create a new *isa* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=isa \ + technique={reed_sol_van|cauchy} \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 7 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``technique={reed_sol_van|cauchy}`` + +:Description: The ISA plugin comes in two `Reed Solomon + <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_ + forms. If *reed_sol_van* is set, it is `Vandermonde + <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if + *cauchy* is set, it is `Cauchy + <https://en.wikipedia.org/wiki/Cauchy_matrix>`_. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-jerasure.rst b/doc/rados/operations/erasure-code-jerasure.rst new file mode 100644 index 000000000..8a0207748 --- /dev/null +++ b/doc/rados/operations/erasure-code-jerasure.rst @@ -0,0 +1,123 @@ +============================ +Jerasure erasure code plugin +============================ + +The *jerasure* plugin is the most generic and flexible plugin, it is +also the default for Ceph erasure coded pools. + +The *jerasure* plugin encapsulates the `Jerasure +<https://github.com/ceph/jerasure>`_ library. It is +recommended to read the ``jerasure`` documentation to +understand the parameters. Note that the ``jerasure.org`` +web site as of 2023 may no longer be connected to the original +project or legitimate. + +Create a jerasure profile +========================= + +To create a new *jerasure* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=jerasure \ + k={data-chunks} \ + m={coding-chunks} \ + technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}`` + +:Description: The more flexible technique is *reed_sol_van* : it is + enough to set *k* and *m*. The *cauchy_good* technique + can be faster but you need to chose the *packetsize* + carefully. All of *reed_sol_r6_op*, *liberation*, + *blaum_roth*, *liber8tion* are *RAID6* equivalents in + the sense that they can only be configured with *m=2*. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``packetsize={bytes}`` + +:Description: The encoding will be done on packets of *bytes* size at + a time. Choosing the right packet size is difficult. The + *jerasure* documentation contains extensive information + on this topic. + +:Type: Integer +:Required: No. +:Default: 2048 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-lrc.rst b/doc/rados/operations/erasure-code-lrc.rst new file mode 100644 index 000000000..5329603b9 --- /dev/null +++ b/doc/rados/operations/erasure-code-lrc.rst @@ -0,0 +1,388 @@ +====================================== +Locally repairable erasure code plugin +====================================== + +With the *jerasure* plugin, when an erasure coded object is stored on +multiple OSDs, recovering from the loss of one OSD requires reading +from *k* others. For instance if *jerasure* is configured with +*k=8* and *m=4*, recovering from the loss of one OSD requires reading +from eight others. + +The *lrc* erasure code plugin creates local parity chunks to enable +recovery using fewer surviving OSDs. For instance if *lrc* is configured with +*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for +every four OSDs. When a single OSD is lost, it can be recovered with +only four OSDs instead of eight. + +Erasure code profile examples +============================= + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-failure-domain=host + ceph osd pool create lrcpool erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the bandwidth reduction will only be observed if the primary +OSD is in the same rack as the lost chunk.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-locality=rack \ + crush-failure-domain=host + ceph osd pool create lrcpool erasure LRCprofile + + +Create an lrc profile +===================== + +To create a new lrc erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=lrc \ + k={data-chunks} \ + m={coding-chunks} \ + l={locality} \ + [crush-root={root}] \ + [crush-locality={bucket-type}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``l={locality}`` + +:Description: Group the coding and data chunks into sets of size + **locality**. For instance, for **k=4** and **m=2**, + when **locality=3** two groups of three are created. + Each set can be recovered without reading chunks + from another set. + +:Type: Integer +:Required: Yes. +:Example: 3 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-locality={bucket-type}`` + +:Description: The type of the CRUSH bucket in which each set of chunks + defined by **l** will be stored. For instance, if it is + set to **rack**, each group of **l** chunks will be + placed in a different rack. It is used to create a + CRUSH rule step such as **step choose rack**. If it is not + set, no such grouping is done. + +:Type: String +:Required: No. + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Low level plugin configuration +============================== + +The sum of **k** and **m** must be a multiple of the **l** parameter. +The low level configuration parameters however do not enforce this +restriction and it may be advantageous to use them for specific +purposes. It is for instance possible to define two groups, one with 4 +chunks and another with 3 chunks. It is also possible to recursively +define locality sets, for instance datacenters and racks into +datacenters. The **k/m/l** are implemented by generating a low level +configuration. + +The *lrc* erasure code plugin recursively applies erasure code +techniques so that recovering from the loss of some chunks only +requires a subset of the available chunks, most of the time. + +For instance, when three coding steps are described as:: + + chunk nr 01234567 + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +where *c* are coding chunks calculated from the data chunks *D*, the +loss of chunk *7* can be recovered with the last four chunks. And the +loss of chunk *2* chunk can be recovered with the first four +chunks. + +Erasure code profile examples using low level configuration +=========================================================== + +Minimal testing +--------------- + +It is strictly equivalent to using a *K=2* *M=1* erasure code profile. The *DD* +implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used +by default.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "" ] ]' + ceph osd pool create lrcpool erasure LRCprofile + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed. It is equivalent to **k=4**, **m=2** and **l=3** although +the layout of the chunks is different. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' + $ ceph osd pool create lrcpool erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the reduced bandwidth will only be observed if the primary OSD is in +the same rack as the lost chunk. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' \ + crush-steps='[ + [ "choose", "rack", 2 ], + [ "chooseleaf", "host", 4 ], + ]' + + $ ceph osd pool create lrcpool erasure LRCprofile + +Testing with different Erasure Code backends +-------------------------------------------- + +LRC now uses jerasure as the default EC backend. It is possible to +specify the EC backend/algorithm on a per layer basis using the low +level configuration. The second argument in layers='[ [ "DDc", "" ] ]' +is actually an erasure code profile to be used for this level. The +example below specifies the ISA backend with the cauchy technique to +be used in the lrcpool.: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' + ceph osd pool create lrcpool erasure LRCprofile + +You could also use a different erasure code profile for each +layer. **WARNING: PROMPTS ARE SELECTABLE** + +:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "plugin=isa technique=cauchy" ], + [ "cDDD____", "plugin=isa" ], + [ "____cDDD", "plugin=jerasure" ], + ]' + $ ceph osd pool create lrcpool erasure LRCprofile + + + +Erasure coding and decoding algorithm +===================================== + +The steps found in the layers description:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +are applied in order. For instance, if a 4K object is encoded, it will +first go through *step 1* and be divided in four 1K chunks (the four +uppercase D). They are stored in the chunks 2, 3, 6 and 7, in +order. From these, two coding chunks are calculated (the two lowercase +c). The coding chunks are stored in the chunks 1 and 5, respectively. + +The *step 2* re-uses the content created by *step 1* in a similar +fashion and stores a single coding chunk *c* at position 0. The last four +chunks, marked with an underscore (*_*) for readability, are ignored. + +The *step 3* stores a single coding chunk *c* at position 4. The three +chunks created by *step 1* are used to compute this coding chunk, +i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. + +If chunk *2* is lost:: + + chunk nr 01234567 + + step 1 _c D_cDD + step 2 cD D____ + step 3 __ _cDDD + +decoding will attempt to recover it by walking the steps in reverse +order: *step 3* then *step 2* and finally *step 1*. + +The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) +and is skipped. + +The coding chunk from *step 2*, stored in chunk *0*, allows it to +recover the content of chunk *2*. There are no more chunks to recover +and the process stops, without considering *step 1*. + +Recovering chunk *2* requires reading chunks *0, 1, 3* and writing +back chunk *2*. + +If chunk *2, 3, 6* are lost:: + + chunk nr 01234567 + + step 1 _c _c D + step 2 cD __ _ + step 3 __ cD D + +The *step 3* can recover the content of chunk *6*:: + + chunk nr 01234567 + + step 1 _c _cDD + step 2 cD ____ + step 3 __ cDDD + +The *step 2* fails to recover and is skipped because there are two +chunks missing (*2, 3*) and it can only recover from one missing +chunk. + +The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to +recover the content of chunk *2, 3*:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +Controlling CRUSH placement +=========================== + +The default CRUSH rule provides OSDs that are on different hosts. For instance:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +needs exactly *8* OSDs, one for each chunk. If the hosts are in two +adjacent racks, the first four chunks can be placed in the first rack +and the last four in the second rack. So that recovering from the loss +of a single OSD does not require using bandwidth between the two +racks. + +For instance:: + + crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' + +will create a rule that will select two crush buckets of type +*rack* and for each of them choose four OSDs, each of them located in +different buckets of type *host*. + +The CRUSH rule can also be manually crafted for finer control. diff --git a/doc/rados/operations/erasure-code-profile.rst b/doc/rados/operations/erasure-code-profile.rst new file mode 100644 index 000000000..947b34c1f --- /dev/null +++ b/doc/rados/operations/erasure-code-profile.rst @@ -0,0 +1,128 @@ +.. _erasure-code-profiles: + +===================== +Erasure code profiles +===================== + +Erasure code is defined by a **profile** and is used when creating an +erasure coded pool and the associated CRUSH rule. + +The **default** erasure code profile (which is created when the Ceph +cluster is initialized) will split the data into 2 equal-sized chunks, +and have 2 parity chunks of the same size. It will take as much space +in the cluster as a 2-replica pool but can sustain the data loss of 2 +chunks out of 4. It is described as a profile with **k=2** and **m=2**, +meaning the information is spread over four OSD (k+m == 4) and two of +them can be lost. + +To improve redundancy without increasing raw storage requirements, a +new profile can be created. For instance, a profile with **k=10** and +**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an +object on fourteen (k+m=14) OSDs. The object is first divided in +**10** chunks (if the object is 10MB, each chunk is 1MB) and **4** +coding chunks are computed, for recovery (each coding chunk has the +same size as the data chunk, i.e. 1MB). The raw space overhead is only +40% and the object will not be lost even if four OSDs break at the +same time. + +.. _list of available plugins: + +.. toctree:: + :maxdepth: 1 + + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay + +osd erasure-code-profile set +============================ + +To create a new erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + [{directory=directory}] \ + [{plugin=plugin}] \ + [{stripe_unit=stripe_unit}] \ + [{key=value} ...] \ + [--force] + +Where: + +``{directory=directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``{plugin=plugin}`` + +:Description: Use the erasure code **plugin** to compute coding chunks + and recover missing chunks. See the `list of available + plugins`_ for more information. + +:Type: String +:Required: No. +:Default: jerasure + +``{stripe_unit=stripe_unit}`` + +:Description: The amount of data in a data chunk, per stripe. For + example, a profile with 2 data chunks and stripe_unit=4K + would put the range 0-4K in chunk 0, 4K-8K in chunk 1, + then 8K-12K in chunk 0 again. This should be a multiple + of 4K for best performance. The default value is taken + from the monitor config option + ``osd_pool_erasure_code_stripe_unit`` when a pool is + created. The stripe_width of a pool using this profile + will be the number of data chunks multiplied by this + stripe_unit. + +:Type: String +:Required: No. + +``{key=value}`` + +:Description: The semantic of the remaining key/value pairs is defined + by the erasure code plugin. + +:Type: String +:Required: No. + +``--force`` + +:Description: Override an existing profile by the same name, and allow + setting a non-4K-aligned stripe_unit. + +:Type: String +:Required: No. + +osd erasure-code-profile rm +============================ + +To remove an erasure code profile:: + + ceph osd erasure-code-profile rm {name} + +If the profile is referenced by a pool, the deletion will fail. + +.. warning:: Removing an erasure code profile using ``osd erasure-code-profile rm`` does not automatically delete the associated CRUSH rule associated with the erasure code profile. It is recommended to manually remove the associated CRUSH rule using ``ceph osd crush rule remove {rule-name}`` to avoid unexpected behavior. + +osd erasure-code-profile get +============================ + +To display an erasure code profile:: + + ceph osd erasure-code-profile get {name} + +osd erasure-code-profile ls +=========================== + +To list the names of all erasure code profiles:: + + ceph osd erasure-code-profile ls + diff --git a/doc/rados/operations/erasure-code-shec.rst b/doc/rados/operations/erasure-code-shec.rst new file mode 100644 index 000000000..4e8f59b0b --- /dev/null +++ b/doc/rados/operations/erasure-code-shec.rst @@ -0,0 +1,145 @@ +======================== +SHEC erasure code plugin +======================== + +The *shec* plugin encapsulates the `multiple SHEC +<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_ +library. It allows ceph to recover data more efficiently than Reed Solomon codes. + +Create an SHEC profile +====================== + +To create a new *shec* erasure code profile: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set {name} \ + plugin=shec \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [c={durability-estimator}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data-chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding-chunks** for each object and store them on + different OSDs. The number of **coding-chunks** does not necessarily + equal the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``c={durability-estimator}`` + +:Description: The number of parity chunks each of which includes each data chunk in its + calculation range. The number is used as a **durability estimator**. + For instance, if c=2, 2 OSDs can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 2 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Brief description of SHEC's layouts +=================================== + +Space Efficiency +---------------- + +Space efficiency is a ratio of data chunks to all ones in a object and +represented as k/(k+m). +In order to improve space efficiency, you should increase k or decrease m: + + space efficiency of SHEC(4,3,2) = :math:`\frac{4}{4+3}` = 0.57 + SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency + +Durability +---------- + +The third parameter of SHEC (=c) is a durability estimator, which approximates +the number of OSDs that can be down without losing data. + +``durability estimator of SHEC(4,3,2) = 2`` + +Recovery Efficiency +------------------- + +Describing calculation of recovery efficiency is beyond the scope of this document, +but at least increasing m without increasing c achieves improvement of recovery efficiency. +(However, we must pay attention to the sacrifice of space efficiency in this case.) + +``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency`` + +Erasure code profile examples +============================= + + +.. prompt:: bash $ + + ceph osd erasure-code-profile set SHECprofile \ + plugin=shec \ + k=8 m=4 c=3 \ + crush-failure-domain=host + ceph osd pool create shecpool erasure SHECprofile diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst new file mode 100644 index 000000000..e2bd3c296 --- /dev/null +++ b/doc/rados/operations/erasure-code.rst @@ -0,0 +1,272 @@ +.. _ecpool: + +============== + Erasure code +============== + +By default, Ceph `pools <../pools>`_ are created with the type "replicated". In +replicated-type pools, every object is copied to multiple disks. This +multiple copying is the method of data protection known as "replication". + +By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_ +pools use a method of data protection that is different from replication. In +erasure coding, data is broken into fragments of two kinds: data blocks and +parity blocks. If a drive fails or becomes corrupted, the parity blocks are +used to rebuild the data. At scale, erasure coding saves space relative to +replication. + +In this documentation, data blocks are referred to as "data chunks" +and parity blocks are referred to as "coding chunks". + +Erasure codes are also called "forward error correction codes". The +first forward error correction code was developed in 1950 by Richard +Hamming at Bell Laboratories. + + +Creating a sample erasure-coded pool +------------------------------------ + +The simplest erasure-coded pool is similar to `RAID5 +<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and +requires at least three hosts: + +.. prompt:: bash $ + + ceph osd pool create ecpool erasure + +:: + + pool 'ecpool' created + +.. prompt:: bash $ + + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +Erasure-code profiles +--------------------- + +The default erasure-code profile can sustain the overlapping loss of two OSDs +without losing data. This erasure-code profile is equivalent to a replicated +pool of size three, but with different storage requirements: instead of +requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default +profile can be displayed with this command: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get default + +:: + + k=2 + m=2 + plugin=jerasure + crush-failure-domain=host + technique=reed_sol_van + +.. note:: + The profile just displayed is for the *default* erasure-coded pool, not the + *simplest* erasure-coded pool. These two pools are not the same: + + The default erasure-coded pool has two data chunks (K) and two coding chunks + (M). The profile of the default erasure-coded pool is "k=2 m=2". + + The simplest erasure-coded pool has two data chunks (K) and one coding chunk + (M). The profile of the simplest erasure-coded pool is "k=2 m=1". + +Choosing the right profile is important because the profile cannot be modified +after the pool is created. If you find that you need an erasure-coded pool with +a profile different than the one you have created, you must create a new pool +with a different (and presumably more carefully considered) profile. When the +new pool is created, all objects from the wrongly configured pool must be moved +to the newly created pool. There is no way to alter the profile of a pool after +the pool has been created. + +The most important parameters of the profile are *K*, *M*, and +*crush-failure-domain* because they define the storage overhead and +the data durability. For example, if the desired architecture must +sustain the loss of two racks with a storage overhead of 67%, +the following profile can be defined: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set myprofile \ + k=3 \ + m=2 \ + crush-failure-domain=rack + ceph osd pool create ecpool erasure myprofile + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +The *NYAN* object will be divided in three (*K=3*) and two additional +*chunks* will be created (*M=2*). The value of *M* defines how many +OSDs can be lost simultaneously without losing any data. The +*crush-failure-domain=rack* will create a CRUSH rule that ensures +no two *chunks* are stored in the same rack. + +.. ditaa:: + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | v | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | + | +------+ + +----------------->| OSD5 | + +------+ + + +More information can be found in the `erasure-code profiles +<../erasure-code-profile>`_ documentation. + + +Erasure Coding with Overwrites +------------------------------ + +By default, erasure-coded pools work only with operations that +perform full object writes and appends (for example, RGW). + +Since Luminous, partial writes for an erasure-coded pool may be +enabled with a per-pool setting. This lets RBD and CephFS store their +data in an erasure-coded pool: + +.. prompt:: bash $ + + ceph osd pool set ec_pool allow_ec_overwrites true + +This can be enabled only on a pool residing on BlueStore OSDs, since +BlueStore's checksumming is used during deep scrubs to detect bitrot +or other corruption. Using Filestore with EC overwrites is not only +unsafe, but it also results in lower performance compared to BlueStore. + +Erasure-coded pools do not support omap, so to use them with RBD and +CephFS you must instruct them to store their data in an EC pool and +their metadata in a replicated pool. For RBD, this means using the +erasure-coded pool as the ``--data-pool`` during image creation: + +.. prompt:: bash $ + + rbd create --size 1G --data-pool ec_pool replicated_pool/image_name + +For CephFS, an erasure-coded pool can be set as the default data pool during +file system creation or via `file layouts <../../../cephfs/file-layouts>`_. + + +Erasure-coded pools and cache tiering +------------------------------------- + +.. note:: Cache tiering is deprecated in Reef. + +Erasure-coded pools require more resources than replicated pools and +lack some of the functionality supported by replicated pools (for example, omap). +To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_ +before setting up the erasure-coded pool. + +For example, if the pool *hot-storage* is made of fast storage, the following commands +will place the *hot-storage* pool as a tier of *ecpool* in *writeback* +mode: + +.. prompt:: bash $ + + ceph osd tier add ecpool hot-storage + ceph osd tier cache-mode hot-storage writeback + ceph osd tier set-overlay ecpool hot-storage + +The result is that every write and read to the *ecpool* actually uses +the *hot-storage* pool and benefits from its flexibility and speed. + +More information can be found in the `cache tiering +<../cache-tiering>`_ documentation. Note, however, that cache tiering +is deprecated and may be removed completely in a future release. + +Erasure-coded pool recovery +--------------------------- +If an erasure-coded pool loses any data shards, it must recover them from others. +This recovery involves reading from the remaining shards, reconstructing the data, and +writing new shards. + +In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards +available. (With fewer than *K* shards, you have actually lost data!) + +Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be +available, even if ``min_size`` was greater than ``K``. This was a conservative +decision made out of an abundance of caution when designing the new pool +mode. As a result, however, pools with lost OSDs but without complete data loss were +unable to recover and go active without manual intervention to temporarily change +the ``min_size`` setting. + +We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and +loss of data. + + + +Glossary +-------- + +*chunk* + When the encoding function is called, it returns chunks of the same size as each other. There are two + kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and + (2) *coding chunks*, which can be used to rebuild a lost chunk. + +*K* + The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object + is divided into two objects of 5KB each. + +*M* + The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can + be missing from the cluster without the cluster suffering data loss. For example, if there are two coding + chunks, then two OSDs can be missing without data loss. + +Table of contents +----------------- + +.. toctree:: + :maxdepth: 1 + + erasure-code-profile + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst new file mode 100644 index 000000000..d52465602 --- /dev/null +++ b/doc/rados/operations/health-checks.rst @@ -0,0 +1,1619 @@ +.. _health-checks: + +=============== + Health checks +=============== + +Overview +======== + +There is a finite set of health messages that a Ceph cluster can raise. These +messages are known as *health checks*. Each health check has a unique +identifier. + +The identifier is a terse human-readable string -- that is, the identifier is +readable in much the same way as a typical variable name. It is intended to +enable tools (for example, UIs) to make sense of health checks and present them +in a way that reflects their meaning. + +This page lists the health checks that are raised by the monitor and manager +daemons. In addition to these, you might see health checks that originate +from MDS daemons (see :ref:`cephfs-health-messages`), and health checks +that are defined by ``ceph-mgr`` python modules. + +Definitions +=========== + +Monitor +------- + +DAEMON_OLD_VERSION +__________________ + +Warn if one or more old versions of Ceph are running on any daemons. A health +check is raised if multiple versions are detected. This condition must exist +for a period of time greater than ``mon_warn_older_version_delay`` (set to one +week by default) in order for the health check to be raised. This allows most +upgrades to proceed without the occurrence of a false warning. If the upgrade +is paused for an extended time period, ``health mute`` can be used by running +``ceph health mute DAEMON_OLD_VERSION --sticky``. Be sure, however, to run +``ceph health unmute DAEMON_OLD_VERSION`` after the upgrade has finished. + +MON_DOWN +________ + +One or more monitor daemons are currently down. The cluster requires a majority +(more than one-half) of the monitors to be available. When one or more monitors +are down, clients might have a harder time forming their initial connection to +the cluster, as they might need to try more addresses before they reach an +operating monitor. + +The down monitor daemon should be restarted as soon as possible to reduce the +risk of a subsequent monitor failure leading to a service outage. + +MON_CLOCK_SKEW +______________ + +The clocks on the hosts running the ceph-mon monitor daemons are not +well-synchronized. This health check is raised if the cluster detects a clock +skew greater than ``mon_clock_drift_allowed``. + +This issue is best resolved by synchronizing the clocks by using a tool like +``ntpd`` or ``chrony``. + +If it is impractical to keep the clocks closely synchronized, the +``mon_clock_drift_allowed`` threshold can also be increased. However, this +value must stay significantly below the ``mon_lease`` interval in order for the +monitor cluster to function properly. + +MON_MSGR2_NOT_ENABLED +_____________________ + +The :confval:`ms_bind_msgr2` option is enabled but one or more monitors are +not configured to bind to a v2 port in the cluster's monmap. This +means that features specific to the msgr2 protocol (for example, encryption) +are unavailable on some or all connections. + +In most cases this can be corrected by running the following command: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +After this command is run, any monitor configured to listen on the old default +port (6789) will continue to listen for v1 connections on 6789 and begin to +listen for v2 connections on the new default port 3300. + +If a monitor is configured to listen for v1 connections on a non-standard port +(that is, a port other than 6789), then the monmap will need to be modified +manually. + + +MON_DISK_LOW +____________ + +One or more monitors are low on disk space. This health check is raised if the +percentage of available space on the file system used by the monitor database +(normally ``/var/lib/ceph/mon``) drops below the percentage value +``mon_data_avail_warn`` (default: 30%). + +This alert might indicate that some other process or user on the system is +filling up the file system used by the monitor. It might also +indicate that the monitor database is too large (see ``MON_DISK_BIG`` +below). + +If space cannot be freed, the monitor's data directory might need to be +moved to another storage device or file system (this relocation process must be carried out while the monitor +daemon is not running). + + +MON_DISK_CRIT +_____________ + +One or more monitors are critically low on disk space. This health check is raised if the +percentage of available space on the file system used by the monitor database +(normally ``/var/lib/ceph/mon``) drops below the percentage value +``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above. + +MON_DISK_BIG +____________ + +The database size for one or more monitors is very large. This health check is +raised if the size of the monitor database is larger than +``mon_data_size_warn`` (default: 15 GiB). + +A large database is unusual, but does not necessarily indicate a problem. +Monitor databases might grow in size when there are placement groups that have +not reached an ``active+clean`` state in a long time. + +This alert might also indicate that the monitor's database is not properly +compacting, an issue that has been observed with some older versions of leveldb +and rocksdb. Forcing a compaction with ``ceph daemon mon.<id> compact`` might +shrink the database's on-disk size. + +This alert might also indicate that the monitor has a bug that prevents it from +pruning the cluster metadata that it stores. If the problem persists, please +report a bug. + +To adjust the warning threshold, run the following command: + +.. prompt:: bash $ + + ceph config set global mon_data_size_warn <size> + + +AUTH_INSECURE_GLOBAL_ID_RECLAIM +_______________________________ + +One or more clients or daemons that are connected to the cluster are not +securely reclaiming their ``global_id`` (a unique number that identifies each +entity in the cluster) when reconnecting to a monitor. The client is being +permitted to connect anyway because the +``auth_allow_insecure_global_id_reclaim`` option is set to ``true`` (which may +be necessary until all Ceph clients have been upgraded) and because the +``auth_expose_insecure_global_id_reclaim`` option is set to ``true`` (which +allows monitors to detect clients with "insecure reclaim" sooner by forcing +those clients to reconnect immediately after their initial authentication). + +To identify which client(s) are using unpatched Ceph client code, run the +following command: + +.. prompt:: bash $ + + ceph health detail + +If you collect a dump of the clients that are connected to an individual +monitor and examine the ``global_id_status`` field in the output of the dump, +you can see the ``global_id`` reclaim behavior of those clients. Here +``reclaim_insecure`` means that a client is unpatched and is contributing to +this health check. To effect a client dump, run the following command: + +.. prompt:: bash $ + + ceph tell mon.\* sessions + +We strongly recommend that all clients in the system be upgraded to a newer +version of Ceph that correctly reclaims ``global_id`` values. After all clients +have been updated, run the following command to stop allowing insecure +reconnections: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +If it is impractical to upgrade all clients immediately, you can temporarily +silence this alert by running the following command: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this alert +indefinitely by running the following command: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim false + +AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED +_______________________________________ + +Ceph is currently configured to allow clients that reconnect to monitors using +an insecure process to reclaim their previous ``global_id``. Such reclaiming is +allowed because, by default, ``auth_allow_insecure_global_id_reclaim`` is set +to ``true``. It might be necessary to leave this setting enabled while existing +Ceph clients are upgraded to newer versions of Ceph that correctly and securely +reclaim their ``global_id``. + +If the ``AUTH_INSECURE_GLOBAL_ID_RECLAIM`` health check has not also been +raised and if the ``auth_expose_insecure_global_id_reclaim`` setting has not +been disabled (it is enabled by default), then there are currently no clients +connected that need to be upgraded. In that case, it is safe to disable +``insecure global_id reclaim`` by running the following command: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +On the other hand, if there are still clients that need to be upgraded, then +this alert can be temporarily silenced by running the following command: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this alert indefinitely +by running the following command: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false + + +Manager +------- + +MGR_DOWN +________ + +All manager daemons are currently down. The cluster should normally have at +least one running manager (``ceph-mgr``) daemon. If no manager daemon is +running, the cluster's ability to monitor itself will be compromised, and parts +of the management API will become unavailable (for example, the dashboard will +not work, and most CLI commands that report metrics or runtime state will +block). However, the cluster will still be able to perform all I/O operations +and to recover from failures. + +The "down" manager daemon should be restarted as soon as possible to ensure +that the cluster can be monitored (for example, so that the ``ceph -s`` +information is up to date, or so that metrics can be scraped by Prometheus). + + +MGR_MODULE_DEPENDENCY +_____________________ + +An enabled manager module is failing its dependency check. This health check +typically comes with an explanatory message from the module about the problem. + +For example, a module might report that a required package is not installed: in +this case, you should install the required package and restart your manager +daemons. + +This health check is applied only to enabled modules. If a module is not +enabled, you can see whether it is reporting dependency issues in the output of +`ceph module ls`. + + +MGR_MODULE_ERROR +________________ + +A manager module has experienced an unexpected error. Typically, this means +that an unhandled exception was raised from the module's `serve` function. The +human-readable description of the error might be obscurely worded if the +exception did not provide a useful description of itself. + +This health check might indicate a bug: please open a Ceph bug report if you +think you have encountered a bug. + +However, if you believe the error is transient, you may restart your manager +daemon(s) or use ``ceph mgr fail`` on the active daemon in order to force +failover to another daemon. + +OSDs +---- + +OSD_DOWN +________ + +One or more OSDs are marked "down". The ceph-osd daemon might have been +stopped, or peer OSDs might be unable to reach the OSD over the network. +Common causes include a stopped or crashed daemon, a "down" host, or a network +outage. + +Verify that the host is healthy, the daemon is started, and the network is +functioning. If the daemon has crashed, the daemon log file +(``/var/log/ceph/ceph-osd.*``) might contain debugging information. + +OSD_<crush type>_DOWN +_____________________ + +(for example, OSD_HOST_DOWN, OSD_ROOT_DOWN) + +All of the OSDs within a particular CRUSH subtree are marked "down" (for +example, all OSDs on a host). + +OSD_ORPHAN +__________ + +An OSD is referenced in the CRUSH map hierarchy, but does not exist. + +To remove the OSD from the CRUSH map hierarchy, run the following command: + +.. prompt:: bash $ + + ceph osd crush rm osd.<id> + +OSD_OUT_OF_ORDER_FULL +_____________________ + +The utilization thresholds for `nearfull`, `backfillfull`, `full`, and/or +`failsafe_full` are not ascending. In particular, the following pattern is +expected: `nearfull < backfillfull`, `backfillfull < full`, and `full < +failsafe_full`. + +To adjust these utilization thresholds, run the following commands: + +.. prompt:: bash $ + + ceph osd set-nearfull-ratio <ratio> + ceph osd set-backfillfull-ratio <ratio> + ceph osd set-full-ratio <ratio> + + +OSD_FULL +________ + +One or more OSDs have exceeded the `full` threshold and are preventing the +cluster from servicing writes. + +To check utilization by pool, run the following command: + +.. prompt:: bash $ + + ceph df + +To see the currently defined `full` ratio, run the following command: + +.. prompt:: bash $ + + ceph osd dump | grep full_ratio + +A short-term workaround to restore write availability is to raise the full +threshold by a small amount. To do so, run the following command: + +.. prompt:: bash $ + + ceph osd set-full-ratio <ratio> + +Additional OSDs should be deployed in order to add new storage to the cluster, +or existing data should be deleted in order to free up space in the cluster. + +OSD_BACKFILLFULL +________________ + +One or more OSDs have exceeded the `backfillfull` threshold or *would* exceed +it if the currently-mapped backfills were to finish, which will prevent data +from rebalancing to this OSD. This alert is an early warning that +rebalancing might be unable to complete and that the cluster is approaching +full. + +To check utilization by pool, run the following command: + +.. prompt:: bash $ + + ceph df + +OSD_NEARFULL +____________ + +One or more OSDs have exceeded the `nearfull` threshold. This alert is an early +warning that the cluster is approaching full. + +To check utilization by pool, run the following command: + +.. prompt:: bash $ + + ceph df + +OSDMAP_FLAGS +____________ + +One or more cluster flags of interest have been set. These flags include: + +* *full* - the cluster is flagged as full and cannot serve writes +* *pauserd*, *pausewr* - there are paused reads or writes +* *noup* - OSDs are not allowed to start +* *nodown* - OSD failure reports are being ignored, and that means that the + monitors will not mark OSDs "down" +* *noin* - OSDs that were previously marked ``out`` are not being marked + back ``in`` when they start +* *noout* - "down" OSDs are not automatically being marked ``out`` after the + configured interval +* *nobackfill*, *norecover*, *norebalance* - recovery or data + rebalancing is suspended +* *noscrub*, *nodeep_scrub* - scrubbing is disabled +* *notieragent* - cache-tiering activity is suspended + +With the exception of *full*, these flags can be set or cleared by running the +following commands: + +.. prompt:: bash $ + + ceph osd set <flag> + ceph osd unset <flag> + +OSD_FLAGS +_________ + +One or more OSDs or CRUSH {nodes,device classes} have a flag of interest set. +These flags include: + +* *noup*: these OSDs are not allowed to start +* *nodown*: failure reports for these OSDs will be ignored +* *noin*: if these OSDs were previously marked ``out`` automatically + after a failure, they will not be marked ``in`` when they start +* *noout*: if these OSDs are "down" they will not automatically be marked + ``out`` after the configured interval + +To set and clear these flags in batch, run the following commands: + +.. prompt:: bash $ + + ceph osd set-group <flags> <who> + ceph osd unset-group <flags> <who> + +For example: + +.. prompt:: bash $ + + ceph osd set-group noup,noout osd.0 osd.1 + ceph osd unset-group noup,noout osd.0 osd.1 + ceph osd set-group noup,noout host-foo + ceph osd unset-group noup,noout host-foo + ceph osd set-group noup,noout class-hdd + ceph osd unset-group noup,noout class-hdd + +OLD_CRUSH_TUNABLES +__________________ + +The CRUSH map is using very old settings and should be updated. The oldest set +of tunables that can be used (that is, the oldest client version that can +connect to the cluster) without raising this health check is determined by the +``mon_crush_min_required_version`` config option. For more information, see +:ref:`crush-map-tunables`. + +OLD_CRUSH_STRAW_CALC_VERSION +____________________________ + +The CRUSH map is using an older, non-optimal method of calculating intermediate +weight values for ``straw`` buckets. + +The CRUSH map should be updated to use the newer method (that is: +``straw_calc_version=1``). For more information, see :ref:`crush-map-tunables`. + +CACHE_POOL_NO_HIT_SET +_____________________ + +One or more cache pools are not configured with a *hit set* to track +utilization. This issue prevents the tiering agent from identifying cold +objects that are to be flushed and evicted from the cache. + +To configure hit sets on the cache pool, run the following commands: + +.. prompt:: bash $ + + ceph osd pool set <poolname> hit_set_type <type> + ceph osd pool set <poolname> hit_set_period <period-in-seconds> + ceph osd pool set <poolname> hit_set_count <number-of-hitsets> + ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> + +OSD_NO_SORTBITWISE +__________________ + +No pre-Luminous v12.y.z OSDs are running, but the ``sortbitwise`` flag has not +been set. + +The ``sortbitwise`` flag must be set in order for OSDs running Luminous v12.y.z +or newer to start. To safely set the flag, run the following command: + +.. prompt:: bash $ + + ceph osd set sortbitwise + +OSD_FILESTORE +__________________ + +Warn if OSDs are running Filestore. The Filestore OSD back end has been +deprecated; the BlueStore back end has been the default object store since the +Ceph Luminous release. + +The 'mclock_scheduler' is not supported for Filestore OSDs. For this reason, +the default 'osd_op_queue' is set to 'wpq' for Filestore OSDs and is enforced +even if the user attempts to change it. + + + +.. prompt:: bash $ + + ceph report | jq -c '."osd_metadata" | .[] | select(.osd_objectstore | contains("filestore")) | {id, osd_objectstore}' + +**In order to upgrade to Reef or a later release, you must first migrate any +Filestore OSDs to BlueStore.** + +If you are upgrading a pre-Reef release to Reef or later, but it is not +feasible to migrate Filestore OSDs to BlueStore immediately, you can +temporarily silence this alert by running the following command: + +.. prompt:: bash $ + + ceph health mute OSD_FILESTORE + +Since this migration can take a considerable amount of time to complete, we +recommend that you begin the process well in advance of any update to Reef or +to later releases. + +POOL_FULL +_________ + +One or more pools have reached their quota and are no longer allowing writes. + +To see pool quotas and utilization, run the following command: + +.. prompt:: bash $ + + ceph df detail + +If you opt to raise the pool quota, run the following commands: + +.. prompt:: bash $ + + ceph osd pool set-quota <poolname> max_objects <num-objects> + ceph osd pool set-quota <poolname> max_bytes <num-bytes> + +If not, delete some existing data to reduce utilization. + +BLUEFS_SPILLOVER +________________ + +One or more OSDs that use the BlueStore back end have been allocated `db` +partitions (that is, storage space for metadata, normally on a faster device), +but because that space has been filled, metadata has "spilled over" onto the +slow device. This is not necessarily an error condition or even unexpected +behavior, but may result in degraded performance. If the administrator had +expected that all metadata would fit on the faster device, this alert indicates +that not enough space was provided. + +To disable this alert on all OSDs, run the following command: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_bluefs_spillover false + +Alternatively, to disable the alert on a specific OSD, run the following +command: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_bluefs_spillover false + +To secure more metadata space, you can destroy and reprovision the OSD in +question. This process involves data migration and recovery. + +It might also be possible to expand the LVM logical volume that backs the `db` +storage. If the underlying LV has been expanded, you must stop the OSD daemon +and inform BlueFS of the device-size change by running the following command: + +.. prompt:: bash $ + + ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-$ID + +BLUEFS_AVAILABLE_SPACE +______________________ + +To see how much space is free for BlueFS, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available + +This will output up to three values: ``BDEV_DB free``, ``BDEV_SLOW free``, and +``available_from_bluestore``. ``BDEV_DB`` and ``BDEV_SLOW`` report the amount +of space that has been acquired by BlueFS and is now considered free. The value +``available_from_bluestore`` indicates the ability of BlueStore to relinquish +more space to BlueFS. It is normal for this value to differ from the amount of +BlueStore free space, because the BlueFS allocation unit is typically larger +than the BlueStore allocation unit. This means that only part of the BlueStore +free space will be available for BlueFS. + +BLUEFS_LOW_SPACE +_________________ + +If BlueFS is running low on available free space and there is not much free +space available from BlueStore (in other words, `available_from_bluestore` has +a low value), consider reducing the BlueFS allocation unit size. To simulate +available space when the allocation unit is different, run the following +command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available <alloc-unit-size> + +BLUESTORE_FRAGMENTATION +_______________________ + +As BlueStore operates, the free space on the underlying storage will become +fragmented. This is normal and unavoidable, but excessive fragmentation causes +slowdown. To inspect BlueStore fragmentation, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator score block + +The fragmentation score is given in a [0-1] range. +[0.0 .. 0.4] tiny fragmentation +[0.4 .. 0.7] small, acceptable fragmentation +[0.7 .. 0.9] considerable, but safe fragmentation +[0.9 .. 1.0] severe fragmentation, might impact BlueFS's ability to get space from BlueStore + +To see a detailed report of free fragments, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator dump block + +For OSD processes that are not currently running, fragmentation can be +inspected with `ceph-bluestore-tool`. To see the fragmentation score, run the +following command: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score + +To dump detailed free chunks, run the following command: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump + +BLUESTORE_LEGACY_STATFS +_______________________ + +One or more OSDs have BlueStore volumes that were created prior to the +Nautilus release. (In Nautilus, BlueStore tracks its internal usage +statistics on a granular, per-pool basis.) + +If *all* OSDs +are older than Nautilus, this means that the per-pool metrics are +simply unavailable. But if there is a mixture of pre-Nautilus and +post-Nautilus OSDs, the cluster usage statistics reported by ``ceph +df`` will be inaccurate. + +The old OSDs can be updated to use the new usage-tracking scheme by stopping +each OSD, running a repair operation, and then restarting the OSD. For example, +to update ``osd.123``, run the following commands: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +To disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_legacy_statfs false + +BLUESTORE_NO_PER_POOL_OMAP +__________________________ + +One or more OSDs have volumes that were created prior to the Octopus release. +(In Octopus and later releases, BlueStore tracks omap space utilization by +pool.) + +If there are any BlueStore OSDs that do not have the new tracking enabled, the +cluster will report an approximate value for per-pool omap usage based on the +most recent deep scrub. + +The OSDs can be updated to track by pool by stopping each OSD, running a repair +operation, and then restarting the OSD. For example, to update ``osd.123``, run +the following commands: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +To disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pool_omap false + +BLUESTORE_NO_PER_PG_OMAP +__________________________ + +One or more OSDs have volumes that were created prior to Pacific. (In Pacific +and later releases Bluestore tracks omap space utilitzation by Placement Group +(PG).) + +Per-PG omap allows faster PG removal when PGs migrate. + +The older OSDs can be updated to track by PG by stopping each OSD, running a +repair operation, and then restarting the OSD. For example, to update +``osd.123``, run the following commands: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +To disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pg_omap false + + +BLUESTORE_DISK_SIZE_MISMATCH +____________________________ + +One or more BlueStore OSDs have an internal inconsistency between the size of +the physical device and the metadata that tracks its size. This inconsistency +can lead to the OSD(s) crashing in the future. + +The OSDs that have this inconsistency should be destroyed and reprovisioned. Be +very careful to execute this procedure on only one OSD at a time, so as to +minimize the risk of losing any data. To execute this procedure, where ``$N`` +is the OSD that has the inconsistency, run the following commands: + +.. prompt:: bash $ + + ceph osd out osd.$N + while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done + ceph osd destroy osd.$N + ceph-volume lvm zap /path/to/device + ceph-volume lvm create --osd-id $N --data /path/to/device + +.. note:: + + Wait for this recovery procedure to completely on one OSD before running it + on the next. + +BLUESTORE_NO_COMPRESSION +________________________ + +One or more OSDs is unable to load a BlueStore compression plugin. This issue +might be caused by a broken installation, in which the ``ceph-osd`` binary does +not match the compression plugins. Or it might be caused by a recent upgrade in +which the ``ceph-osd`` daemon was not restarted. + +To resolve this issue, verify that all of the packages on the host that is +running the affected OSD(s) are correctly installed and that the OSD daemon(s) +have been restarted. If the problem persists, check the OSD log for information +about the source of the problem. + +BLUESTORE_SPURIOUS_READ_ERRORS +______________________________ + +One or more BlueStore OSDs detect spurious read errors on the main device. +BlueStore has recovered from these errors by retrying disk reads. This alert +might indicate issues with underlying hardware, issues with the I/O subsystem, +or something similar. In theory, such issues can cause permanent data +corruption. Some observations on the root cause of spurious read errors can be +found here: https://tracker.ceph.com/issues/22464 + +This alert does not require an immediate response, but the affected host might +need additional attention: for example, upgrading the host to the latest +OS/kernel versions and implementing hardware-resource-utilization monitoring. + +To disable this alert on all OSDs, run the following command: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_spurious_read_errors false + +Or, to disable this alert on a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_spurious_read_errors false + +Device health +------------- + +DEVICE_HEALTH +_____________ + +One or more OSD devices are expected to fail soon, where the warning threshold +is determined by the ``mgr/devicehealth/warn_threshold`` config option. + +Because this alert applies only to OSDs that are currently marked ``in``, the +appropriate response to this expected failure is (1) to mark the OSD ``out`` so +that data is migrated off of the OSD, and then (2) to remove the hardware from +the system. Note that this marking ``out`` is normally done automatically if +``mgr/devicehealth/self_heal`` is enabled (as determined by +``mgr/devicehealth/mark_out_threshold``). + +To check device health, run the following command: + +.. prompt:: bash $ + + ceph device info <device-id> + +Device life expectancy is set either by a prediction model that the mgr runs or +by an external tool that is activated by running the following command: + +.. prompt:: bash $ + + ceph device set-life-expectancy <device-id> <from> <to> + +You can change the stored life expectancy manually, but such a change usually +doesn't accomplish anything. The reason for this is that whichever tool +originally set the stored life expectancy will probably undo your change by +setting it again, and a change to the stored value does not affect the actual +health of the hardware device. + +DEVICE_HEALTH_IN_USE +____________________ + +One or more devices (that is, OSDs) are expected to fail soon and have been +marked ``out`` of the cluster (as controlled by +``mgr/devicehealth/mark_out_threshold``), but they are still participating in +one or more Placement Groups. This might be because the OSD(s) were marked +``out`` only recently and data is still migrating, or because data cannot be +migrated off of the OSD(s) for some reason (for example, the cluster is nearly +full, or the CRUSH hierarchy is structured so that there isn't another suitable +OSD to migrate the data to). + +This message can be silenced by disabling self-heal behavior (that is, setting +``mgr/devicehealth/self_heal`` to ``false``), by adjusting +``mgr/devicehealth/mark_out_threshold``, or by addressing whichever condition +is preventing data from being migrated off of the ailing OSD(s). + +.. _rados_health_checks_device_health_toomany: + +DEVICE_HEALTH_TOOMANY +_____________________ + +Too many devices (that is, OSDs) are expected to fail soon, and because +``mgr/devicehealth/self_heal`` behavior is enabled, marking ``out`` all of the +ailing OSDs would exceed the cluster's ``mon_osd_min_in_ratio`` ratio. This +ratio prevents a cascade of too many OSDs from being automatically marked +``out``. + +You should promptly add new OSDs to the cluster to prevent data loss, or +incrementally replace the failing OSDs. + +Alternatively, you can silence this health check by adjusting options including +``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``. Be +warned, however, that this will increase the likelihood of unrecoverable data +loss. + + +Data health (pools & placement groups) +-------------------------------------- + +PG_AVAILABILITY +_______________ + +Data availability is reduced. In other words, the cluster is unable to service +potential read or write requests for at least some data in the cluster. More +precisely, one or more Placement Groups (PGs) are in a state that does not +allow I/O requests to be serviced. Any of the following PG states are +problematic if they do not clear quickly: *peering*, *stale*, *incomplete*, and +the lack of *active*. + +For detailed information about which PGs are affected, run the following +command: + +.. prompt:: bash $ + + ceph health detail + +In most cases, the root cause of this issue is that one or more OSDs are +currently ``down``: see ``OSD_DOWN`` above. + +To see the state of a specific problematic PG, run the following command: + +.. prompt:: bash $ + + ceph tell <pgid> query + +PG_DEGRADED +___________ + +Data redundancy is reduced for some data: in other words, the cluster does not +have the desired number of replicas for all data (in the case of replicated +pools) or erasure code fragments (in the case of erasure-coded pools). More +precisely, one or more Placement Groups (PGs): + +* have the *degraded* or *undersized* flag set, which means that there are not + enough instances of that PG in the cluster; or +* have not had the *clean* state set for a long time. + +For detailed information about which PGs are affected, run the following +command: + +.. prompt:: bash $ + + ceph health detail + +In most cases, the root cause of this issue is that one or more OSDs are +currently "down": see ``OSD_DOWN`` above. + +To see the state of a specific problematic PG, run the following command: + +.. prompt:: bash $ + + ceph tell <pgid> query + + +PG_RECOVERY_FULL +________________ + +Data redundancy might be reduced or even put at risk for some data due to a +lack of free space in the cluster. More precisely, one or more Placement Groups +have the *recovery_toofull* flag set, which means that the cluster is unable to +migrate or recover data because one or more OSDs are above the ``full`` +threshold. + +For steps to resolve this condition, see *OSD_FULL* above. + +PG_BACKFILL_FULL +________________ + +Data redundancy might be reduced or even put at risk for some data due to a +lack of free space in the cluster. More precisely, one or more Placement Groups +have the *backfill_toofull* flag set, which means that the cluster is unable to +migrate or recover data because one or more OSDs are above the ``backfillfull`` +threshold. + +For steps to resolve this condition, see *OSD_BACKFILLFULL* above. + +PG_DAMAGED +__________ + +Data scrubbing has discovered problems with data consistency in the cluster. +More precisely, one or more Placement Groups either (1) have the *inconsistent* +or ``snaptrim_error`` flag set, which indicates that an earlier data scrub +operation found a problem, or (2) have the *repair* flag set, which means that +a repair for such an inconsistency is currently in progress. + +For more information, see :doc:`pg-repair`. + +OSD_SCRUB_ERRORS +________________ + +Recent OSD scrubs have discovered inconsistencies. This alert is generally +paired with *PG_DAMAGED* (see above). + +For more information, see :doc:`pg-repair`. + +OSD_TOO_MANY_REPAIRS +____________________ + +The count of read repairs has exceeded the config value threshold +``mon_osd_warn_num_repaired`` (default: ``10``). Because scrub handles errors +only for data at rest, and because any read error that occurs when another +replica is available will be repaired immediately so that the client can get +the object data, there might exist failing disks that are not registering any +scrub errors. This repair count is maintained as a way of identifying any such +failing disks. + + +LARGE_OMAP_OBJECTS +__________________ + +One or more pools contain large omap objects, as determined by +``osd_deep_scrub_large_omap_object_key_threshold`` (threshold for the number of +keys to determine what is considered a large omap object) or +``osd_deep_scrub_large_omap_object_value_sum_threshold`` (the threshold for the +summed size in bytes of all key values to determine what is considered a large +omap object) or both. To find more information on object name, key count, and +size in bytes, search the cluster log for 'Large omap object found'. This issue +can be caused by RGW-bucket index objects that do not have automatic resharding +enabled. For more information on resharding, see :ref:`RGW Dynamic Bucket Index +Resharding <rgw_dynamic_bucket_index_resharding>`. + +To adjust the thresholds mentioned above, run the following commands: + +.. prompt:: bash $ + + ceph config set osd osd_deep_scrub_large_omap_object_key_threshold <keys> + ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold <bytes> + +CACHE_POOL_NEAR_FULL +____________________ + +A cache-tier pool is nearly full, as determined by the ``target_max_bytes`` and +``target_max_objects`` properties of the cache pool. Once the pool reaches the +target threshold, write requests to the pool might block while data is flushed +and evicted from the cache. This state normally leads to very high latencies +and poor performance. + +To adjust the cache pool's target size, run the following commands: + +.. prompt:: bash $ + + ceph osd pool set <cache-pool-name> target_max_bytes <bytes> + ceph osd pool set <cache-pool-name> target_max_objects <objects> + +There might be other reasons that normal cache flush and evict activity are +throttled: for example, reduced availability of the base tier, reduced +performance of the base tier, or overall cluster load. + +TOO_FEW_PGS +___________ + +The number of Placement Groups (PGs) that are in use in the cluster is below +the configurable threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can +lead to suboptimal distribution and suboptimal balance of data across the OSDs +in the cluster, and a reduction of overall performance. + +If data pools have not yet been created, this condition is expected. + +To address this issue, you can increase the PG count for existing pools or +create new pools. For more information, see +:ref:`choosing-number-of-placement-groups`. + +POOL_PG_NUM_NOT_POWER_OF_TWO +____________________________ + +One or more pools have a ``pg_num`` value that is not a power of two. Although +this is not strictly incorrect, it does lead to a less balanced distribution of +data because some Placement Groups will have roughly twice as much data as +others have. + +This is easily corrected by setting the ``pg_num`` value for the affected +pool(s) to a nearby power of two. To do so, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <value> + +To disable this health check, run the following command: + +.. prompt:: bash $ + + ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false + +POOL_TOO_FEW_PGS +________________ + +One or more pools should probably have more Placement Groups (PGs), given the +amount of data that is currently stored in the pool. This issue can lead to +suboptimal distribution and suboptimal balance of data across the OSDs in the +cluster, and a reduction of overall performance. This alert is raised only if +the ``pg_autoscale_mode`` property on the pool is set to ``warn``. + +To disable the alert, entirely disable auto-scaling of PGs for the pool by +running the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs for the pool, +run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +Alternatively, to manually set the number of PGs for the pool to the +recommended amount, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +For more information, see :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler`. + +TOO_MANY_PGS +____________ + +The number of Placement Groups (PGs) in use in the cluster is above the +configurable threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold +is exceeded, the cluster will not allow new pools to be created, pool `pg_num` +to be increased, or pool replication to be increased (any of which, if allowed, +would lead to more PGs in the cluster). A large number of PGs can lead to +higher memory utilization for OSD daemons, slower peering after cluster state +changes (for example, OSD restarts, additions, or removals), and higher load on +the Manager and Monitor daemons. + +The simplest way to mitigate the problem is to increase the number of OSDs in +the cluster by adding more hardware. Note that, because the OSD count that is +used for the purposes of this health check is the number of ``in`` OSDs, +marking ``out`` OSDs ``in`` (if there are any ``out`` OSDs available) can also +help. To do so, run the following command: + +.. prompt:: bash $ + + ceph osd in <osd id(s)> + +For more information, see :ref:`choosing-number-of-placement-groups`. + +POOL_TOO_MANY_PGS +_________________ + +One or more pools should probably have fewer Placement Groups (PGs), given the +amount of data that is currently stored in the pool. This issue can lead to +higher memory utilization for OSD daemons, slower peering after cluster state +changes (for example, OSD restarts, additions, or removals), and higher load on +the Manager and Monitor daemons. This alert is raised only if the +``pg_autoscale_mode`` property on the pool is set to ``warn``. + +To disable the alert, entirely disable auto-scaling of PGs for the pool by +running the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs for the pool, +run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +Alternatively, to manually set the number of PGs for the pool to the +recommended amount, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +For more information, see :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler`. + + +POOL_TARGET_SIZE_BYTES_OVERCOMMITTED +____________________________________ + +One or more pools have a ``target_size_bytes`` property that is set in order to +estimate the expected size of the pool, but the value(s) of this property are +greater than the total available storage (either by themselves or in +combination with other pools). + +This alert is usually an indication that the ``target_size_bytes`` value for +the pool is too large and should be reduced or set to zero. To reduce the +``target_size_bytes`` value or set it to zero, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +The above command sets the value of ``target_size_bytes`` to zero. To set the +value of ``target_size_bytes`` to a non-zero value, replace the ``0`` with that +non-zero value. + +For more information, see :ref:`specifying_pool_target_size`. + +POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO +____________________________________ + +One or more pools have both ``target_size_bytes`` and ``target_size_ratio`` set +in order to estimate the expected size of the pool. Only one of these +properties should be non-zero. If both are set to a non-zero value, then +``target_size_ratio`` takes precedence and ``target_size_bytes`` is ignored. + +To reset ``target_size_bytes`` to zero, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +For more information, see :ref:`specifying_pool_target_size`. + +TOO_FEW_OSDS +____________ + +The number of OSDs in the cluster is below the configurable threshold of +``osd_pool_default_size``. This means that some or all data may not be able to +satisfy the data protection policy specified in CRUSH rules and pool settings. + +SMALLER_PGP_NUM +_______________ + +One or more pools have a ``pgp_num`` value less than ``pg_num``. This alert is +normally an indication that the Placement Group (PG) count was increased +without any increase in the placement behavior. + +This disparity is sometimes brought about deliberately, in order to separate +out the `split` step when the PG count is adjusted from the data migration that +is needed when ``pgp_num`` is changed. + +This issue is normally resolved by setting ``pgp_num`` to match ``pg_num``, so +as to trigger the data migration, by running the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool> pgp_num <pg-num-value> + +MANY_OBJECTS_PER_PG +___________________ + +One or more pools have an average number of objects per Placement Group (PG) +that is significantly higher than the overall cluster average. The specific +threshold is determined by the ``mon_pg_warn_max_object_skew`` configuration +value. + +This alert is usually an indication that the pool(s) that contain most of the +data in the cluster have too few PGs, or that other pools that contain less +data have too many PGs. See *TOO_MANY_PGS* above. + +To silence the health check, raise the threshold by adjusting the +``mon_pg_warn_max_object_skew`` config option on the managers. + +The health check will be silenced for a specific pool only if +``pg_autoscale_mode`` is set to ``on``. + +POOL_APP_NOT_ENABLED +____________________ + +A pool exists but the pool has not been tagged for use by a particular +application. + +To resolve this issue, tag the pool for use by an application. For +example, if the pool is used by RBD, run the following command: + +.. prompt:: bash $ + + rbd pool init <poolname> + +Alternatively, if the pool is being used by a custom application (here 'foo'), +you can label the pool by running the following low-level command: + +.. prompt:: bash $ + + ceph osd pool application enable foo + +For more information, see :ref:`associate-pool-to-application`. + +POOL_FULL +_________ + +One or more pools have reached (or are very close to reaching) their quota. The +threshold to raise this health check is determined by the +``mon_pool_quota_crit_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) by running the following +commands: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +To disable a quota, set the quota value to 0. + +POOL_NEAR_FULL +______________ + +One or more pools are approaching a configured fullness threshold. + +One of the several thresholds that can raise this health check is determined by +the ``mon_pool_quota_warn_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) by running the following +commands: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +To disable a quota, set the quota value to 0. + +Other thresholds that can raise the two health checks above are +``mon_osd_nearfull_ratio`` and ``mon_osd_full_ratio``. For details and +resolution, see :ref:`storage-capacity` and :ref:`no-free-drive-space`. + +OBJECT_MISPLACED +________________ + +One or more objects in the cluster are not stored on the node that CRUSH would +prefer that they be stored on. This alert is an indication that data migration +due to a recent cluster change has not yet completed. + +Misplaced data is not a dangerous condition in and of itself; data consistency +is never at risk, and old copies of objects will not be removed until the +desired number of new copies (in the desired locations) has been created. + +OBJECT_UNFOUND +______________ + +One or more objects in the cluster cannot be found. More precisely, the OSDs +know that a new or updated copy of an object should exist, but no such copy has +been found on OSDs that are currently online. + +Read or write requests to unfound objects will block. + +Ideally, a "down" OSD that has a more recent copy of the unfound object can be +brought back online. To identify candidate OSDs, check the peering state of the +PG(s) responsible for the unfound object. To see the peering state, run the +following command: + +.. prompt:: bash $ + + ceph tell <pgid> query + +On the other hand, if the latest copy of the object is not available, the +cluster can be told to roll back to a previous version of the object. For more +information, see :ref:`failures-osd-unfound`. + +SLOW_OPS +________ + +One or more OSD requests or monitor requests are taking a long time to process. +This alert might be an indication of extreme load, a slow storage device, or a +software bug. + +To query the request queue for the daemon that is causing the slowdown, run the +following command from the daemon's host: + +.. prompt:: bash $ + + ceph daemon osd.<id> ops + +To see a summary of the slowest recent requests, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.<id> dump_historic_ops + +To see the location of a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph osd find osd.<id> + +PG_NOT_SCRUBBED +_______________ + +One or more Placement Groups (PGs) have not been scrubbed recently. PGs are +normally scrubbed within an interval determined by +:confval:`osd_scrub_max_interval` globally. This interval can be overridden on +per-pool basis by changing the value of the variable +:confval:`scrub_max_interval`. This health check is raised if a certain +percentage (determined by ``mon_warn_pg_not_scrubbed_ratio``) of the interval +has elapsed after the time the scrub was scheduled and no scrub has been +performed. + +PGs will be scrubbed only if they are flagged as ``clean`` (which means that +they are to be cleaned, and not that they have been examined and found to be +clean). Misplaced or degraded PGs will not be flagged as ``clean`` (see +*PG_AVAILABILITY* and *PG_DEGRADED* above). + +To manually initiate a scrub of a clean PG, run the following command: + +.. prompt: bash $ + + ceph pg scrub <pgid> + +PG_NOT_DEEP_SCRUBBED +____________________ + +One or more Placement Groups (PGs) have not been deep scrubbed recently. PGs +are normally scrubbed every :confval:`osd_deep_scrub_interval` seconds at most. +This health check is raised if a certain percentage (determined by +``mon_warn_pg_not_deep_scrubbed_ratio``) of the interval has elapsed after the +time the scrub was scheduled and no scrub has been performed. + +PGs will receive a deep scrub only if they are flagged as *clean* (which means +that they are to be cleaned, and not that they have been examined and found to +be clean). Misplaced or degraded PGs might not be flagged as ``clean`` (see +*PG_AVAILABILITY* and *PG_DEGRADED* above). + +To manually initiate a deep scrub of a clean PG, run the following command: + +.. prompt:: bash $ + + ceph pg deep-scrub <pgid> + + +PG_SLOW_SNAP_TRIMMING +_____________________ + +The snapshot trim queue for one or more PGs has exceeded the configured warning +threshold. This alert indicates either that an extremely large number of +snapshots was recently deleted, or that OSDs are unable to trim snapshots +quickly enough to keep up with the rate of new snapshot deletions. + +The warning threshold is determined by the ``mon_osd_snap_trim_queue_warn_on`` +option (default: 32768). + +This alert might be raised if OSDs are under excessive load and unable to keep +up with their background work, or if the OSDs' internal metadata database is +heavily fragmented and unable to perform. The alert might also indicate some +other performance issue with the OSDs. + +The exact size of the snapshot trim queue is reported by the ``snaptrimq_len`` +field of ``ceph pg ls -f json-detail``. + +Stretch Mode +------------ + +INCORRECT_NUM_BUCKETS_STRETCH_MODE +__________________________________ + +Stretch mode currently only support 2 dividing buckets with OSDs, this warning suggests +that the number of dividing buckets is not equal to 2 after stretch mode is enabled. +You can expect unpredictable failures and MON assertions until the condition is fixed. + +We encourage you to fix this by removing additional dividing buckets or bump the +number of dividing buckets to 2. + +UNEVEN_WEIGHTS_STRETCH_MODE +___________________________ + +The 2 dividing buckets must have equal weights when stretch mode is enabled. +This warning suggests that the 2 dividing buckets have uneven weights after +stretch mode is enabled. This is not immediately fatal, however, you can expect +Ceph to be confused when trying to process transitions between dividing buckets. + +We encourage you to fix this by making the weights even on both dividing buckets. +This can be done by making sure the combined weight of the OSDs on each dividing +bucket are the same. + +Miscellaneous +------------- + +RECENT_CRASH +____________ + +One or more Ceph daemons have crashed recently, and the crash(es) have not yet +been acknowledged and archived by the administrator. This alert might indicate +a software bug, a hardware problem (for example, a failing disk), or some other +problem. + +To list recent crashes, run the following command: + +.. prompt:: bash $ + + ceph crash ls-new + +To examine information about a specific crash, run the following command: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +To silence this alert, you can archive the crash (perhaps after the crash +has been examined by an administrator) by running the following command: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, to archive all recent crashes, run the following command: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible by running the command ``ceph crash +ls``, but not by running the command ``ceph crash ls-new``. + +The time period that is considered recent is determined by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +To entirely disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +RECENT_MGR_MODULE_CRASH +_______________________ + +One or more ``ceph-mgr`` modules have crashed recently, and the crash(es) have +not yet been acknowledged and archived by the administrator. This alert +usually indicates a software bug in one of the software modules that are +running inside the ``ceph-mgr`` daemon. The module that experienced the problem +might be disabled as a result, but other modules are unaffected and continue to +function as expected. + +As with the *RECENT_CRASH* health check, a specific crash can be inspected by +running the following command: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +To silence this alert, you can archive the crash (perhaps after the crash has +been examined by an administrator) by running the following command: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, to archive all recent crashes, run the following command: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible by running the command ``ceph crash ls`` +but not by running the command ``ceph crash ls-new``. + +The time period that is considered recent is determined by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +To entirely disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +TELEMETRY_CHANGED +_________________ + +Telemetry has been enabled, but because the contents of the telemetry report +have changed in the meantime, telemetry reports will not be sent. + +Ceph developers occasionally revise the telemetry feature to include new and +useful information, or to remove information found to be useless or sensitive. +If any new information is included in the report, Ceph requires the +administrator to re-enable telemetry. This requirement ensures that the +administrator has an opportunity to (re)review the information that will be +shared. + +To review the contents of the telemetry report, run the following command: + +.. prompt:: bash $ + + ceph telemetry show + +Note that the telemetry report consists of several channels that may be +independently enabled or disabled. For more information, see :ref:`telemetry`. + +To re-enable telemetry (and silence the alert), run the following command: + +.. prompt:: bash $ + + ceph telemetry on + +To disable telemetry (and silence the alert), run the following command: + +.. prompt:: bash $ + + ceph telemetry off + +AUTH_BAD_CAPS +_____________ + +One or more auth users have capabilities that cannot be parsed by the monitors. +As a general rule, this alert indicates that there are one or more daemon types +that the user is not authorized to use to perform any action. + +This alert is most likely to be raised after an upgrade if (1) the capabilities +were set with an older version of Ceph that did not properly validate the +syntax of those capabilities, or if (2) the syntax of the capabilities has +changed. + +To remove the user(s) in question, run the following command: + +.. prompt:: bash $ + + ceph auth rm <entity-name> + +(This resolves the health check, but it prevents clients from being able to +authenticate as the removed user.) + +Alternatively, to update the capabilities for the user(s), run the following +command: + +.. prompt:: bash $ + + ceph auth <entity-name> <daemon-type> <caps> [<daemon-type> <caps> ...] + +For more information about auth capabilities, see :ref:`user-management`. + +OSD_NO_DOWN_OUT_INTERVAL +________________________ + +The ``mon_osd_down_out_interval`` option is set to zero, which means that the +system does not automatically perform any repair or healing operations when an +OSD fails. Instead, an administrator an external orchestrator must manually +mark "down" OSDs as ``out`` (by running ``ceph osd out <osd-id>``) in order to +trigger recovery. + +This option is normally set to five or ten minutes, which should be enough time +for a host to power-cycle or reboot. + +To silence this alert, set ``mon_warn_on_osd_down_out_interval_zero`` to +``false`` by running the following command: + +.. prompt:: bash $ + + ceph config global mon mon_warn_on_osd_down_out_interval_zero false + +DASHBOARD_DEBUG +_______________ + +The Dashboard debug mode is enabled. This means that if there is an error while +processing a REST API request, the HTTP error response will contain a Python +traceback. This mode should be disabled in production environments because such +a traceback might contain and expose sensitive information. + +To disable the debug mode, run the following command: + +.. prompt:: bash $ + + ceph dashboard debug disable diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst new file mode 100644 index 000000000..15525c1d3 --- /dev/null +++ b/doc/rados/operations/index.rst @@ -0,0 +1,99 @@ +.. _rados-operations: + +==================== + Cluster Operations +==================== + +.. raw:: html + + <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3> + +High-level cluster operations consist primarily of starting, stopping, and +restarting a cluster with the ``ceph`` service; checking the cluster's health; +and, monitoring an operating cluster. + +.. toctree:: + :maxdepth: 1 + + operating + health-checks + monitoring + monitoring-osd-pg + user-management + pg-repair + +.. raw:: html + + </td><td><h3>Data Placement</h3> + +Once you have your cluster up and running, you may begin working with data +placement. Ceph supports petabyte-scale data storage clusters, with storage +pools and placement groups that distribute data across the cluster using Ceph's +CRUSH algorithm. + +.. toctree:: + :maxdepth: 1 + + data-placement + pools + erasure-code + cache-tiering + placement-groups + upmap + read-balancer + balancer + crush-map + crush-map-edits + stretch-mode + change-mon-elections + + + +.. raw:: html + + </td></tr><tr><td><h3>Low-level Operations</h3> + +Low-level cluster operations consist of starting, stopping, and restarting a +particular daemon within a cluster; changing the settings of a particular +daemon or subsystem; and, adding a daemon to the cluster or removing a daemon +from the cluster. The most common use cases for low-level operations include +growing or shrinking the Ceph cluster and replacing legacy or failed hardware +with new hardware. + +.. toctree:: + :maxdepth: 1 + + add-or-rm-osds + add-or-rm-mons + devices + bluestore-migration + Command Reference <control> + + + +.. raw:: html + + </td><td><h3>Troubleshooting</h3> + +Ceph is still on the leading edge, so you may encounter situations that require +you to evaluate your Ceph configuration and modify your logging and debugging +settings to identify and remedy issues you are encountering with your cluster. + +.. toctree:: + :maxdepth: 1 + + ../troubleshooting/community + ../troubleshooting/troubleshooting-mon + ../troubleshooting/troubleshooting-osd + ../troubleshooting/troubleshooting-pg + ../troubleshooting/log-and-debug + ../troubleshooting/cpu-profiling + ../troubleshooting/memory-profiling + + + + +.. raw:: html + + </td></tr></tbody></table> + diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst new file mode 100644 index 000000000..b0a6767a1 --- /dev/null +++ b/doc/rados/operations/monitoring-osd-pg.rst @@ -0,0 +1,556 @@ +========================= + Monitoring OSDs and PGs +========================= + +High availability and high reliability require a fault-tolerant approach to +managing hardware and software issues. Ceph has no single point of failure and +it can service requests for data even when in a "degraded" mode. Ceph's `data +placement`_ introduces a layer of indirection to ensure that data doesn't bind +directly to specific OSDs. For this reason, tracking system faults +requires finding the `placement group`_ (PG) and the underlying OSDs at the +root of the problem. + +.. tip:: A fault in one part of the cluster might prevent you from accessing a + particular object, but that doesn't mean that you are prevented from + accessing other objects. When you run into a fault, don't panic. Just + follow the steps for monitoring your OSDs and placement groups, and then + begin troubleshooting. + +Ceph is self-repairing. However, when problems persist, monitoring OSDs and +placement groups will help you identify the problem. + + +Monitoring OSDs +=============== + +An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is +either running and reachable (``up``), or it is not running and not reachable +(``down``). + +If an OSD is ``up``, it may be either ``in`` service (clients can read and +write data) or it is ``out`` of service. If the OSD was ``in`` but then due to +a failure or a manual action was set to the ``out`` state, Ceph will migrate +placement groups to the other OSDs to maintin the configured redundancy. + +If an OSD is ``out`` of service, CRUSH will not assign placement groups to it. +If an OSD is ``down``, it will also be ``out``. + +.. note:: If an OSD is ``down`` and ``in``, there is a problem and this + indicates that the cluster is not in a healthy state. + +.. ditaa:: + + +----------------+ +----------------+ + | | | | + | OSD #n In | | OSD #n Up | + | | | | + +----------------+ +----------------+ + ^ ^ + | | + | | + v v + +----------------+ +----------------+ + | | | | + | OSD #n Out | | OSD #n Down | + | | | | + +----------------+ +----------------+ + +If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``, +you might notice that the cluster does not always show ``HEALTH OK``. Don't +panic. There are certain circumstances in which it is expected and normal that +the cluster will **NOT** show ``HEALTH OK``: + +#. You haven't started the cluster yet. +#. You have just started or restarted the cluster and it's not ready to show + health statuses yet, because the PGs are in the process of being created and + the OSDs are in the process of peering. +#. You have just added or removed an OSD. +#. You have just have modified your cluster map. + +Checking to see if OSDs are ``up`` and running is an important aspect of monitoring them: +whenever the cluster is up and running, every OSD that is ``in`` the cluster should also +be ``up`` and running. To see if all of the cluster's OSDs are running, run the following +command: + +.. prompt:: bash $ + + ceph osd stat + +The output provides the following information: the total number of OSDs (x), +how many OSDs are ``up`` (y), how many OSDs are ``in`` (z), and the map epoch (eNNNN). :: + + x osds: y up, z in; epoch: eNNNN + +If the number of OSDs that are ``in`` the cluster is greater than the number of +OSDs that are ``up``, run the following command to identify the ``ceph-osd`` +daemons that are not running: + +.. prompt:: bash $ + + ceph osd tree + +:: + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 2.00000 pool openstack + -3 2.00000 rack dell-2950-rack-A + -2 2.00000 host dell-2950-A1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 down 1.00000 1.00000 + +.. tip:: Searching through a well-designed CRUSH hierarchy to identify the physical + locations of particular OSDs might help you troubleshoot your cluster. + +If an OSD is ``down``, start it by running the following command: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@1 + +For problems associated with OSDs that have stopped or won't restart, see `OSD Not Running`_. + + +PG Sets +======= + +When CRUSH assigns a PG to OSDs, it takes note of how many replicas of the PG +are required by the pool and then assigns each replica to a different OSD. +For example, if the pool requires three replicas of a PG, CRUSH might assign +them individually to ``osd.1``, ``osd.2`` and ``osd.3``. CRUSH seeks a +pseudo-random placement that takes into account the failure domains that you +have set in your `CRUSH map`_; for this reason, PGs are rarely assigned to +immediately adjacent OSDs in a large cluster. + +Ceph processes client requests with the **Acting Set** of OSDs: this is the set +of OSDs that currently have a full and working version of a PG shard and that +are therefore responsible for handling requests. By contrast, the **Up Set** is +the set of OSDs that contain a shard of a specific PG. Data is moved or copied +to the **Up Set**, or planned to be moved or copied, to the **Up Set**. See +:ref:`Placement Group Concepts <rados_operations_pg_concepts>`. + +Sometimes an OSD in the Acting Set is ``down`` or otherwise unable to +service requests for objects in the PG. When this kind of situation +arises, don't panic. Common examples of such a situation include: + +- You added or removed an OSD, CRUSH reassigned the PG to + other OSDs, and this reassignment changed the composition of the Acting Set and triggered + the migration of data by means of a "backfill" process. +- An OSD was ``down``, was restarted, and is now ``recovering``. +- An OSD in the Acting Set is ``down`` or unable to service requests, + and another OSD has temporarily assumed its duties. + +Typically, the Up Set and the Acting Set are identical. When they are not, it +might indicate that Ceph is migrating the PG (in other words, that the PG has +been remapped), that an OSD is recovering, or that there is a problem with the +cluster (in such scenarios, Ceph usually shows a "HEALTH WARN" state with a +"stuck stale" message). + +To retrieve a list of PGs, run the following command: + +.. prompt:: bash $ + + ceph pg dump + +To see which OSDs are within the Acting Set and the Up Set for a specific PG, run the following command: + +.. prompt:: bash $ + + ceph pg map {pg-num} + +The output provides the following information: the osdmap epoch (eNNN), the PG number +({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the Acting Set +(acting[]):: + + osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2] + +.. note:: If the Up Set and the Acting Set do not match, this might indicate + that the cluster is rebalancing itself or that there is a problem with + the cluster. + + +Peering +======= + +Before you can write data to a PG, it must be in an ``active`` state and it +will preferably be in a ``clean`` state. For Ceph to determine the current +state of a PG, peering must take place. That is, the primary OSD of the PG +(that is, the first OSD in the Acting Set) must peer with the secondary and +OSDs so that consensus on the current state of the PG can be established. In +the following diagram, we assume a pool with three replicas of the PG: + +.. ditaa:: + + +---------+ +---------+ +-------+ + | OSD 1 | | OSD 2 | | OSD 3 | + +---------+ +---------+ +-------+ + | | | + | Request To | | + | Peer | | + |-------------->| | + |<--------------| | + | Peering | + | | + | Request To | + | Peer | + |----------------------------->| + |<-----------------------------| + | Peering | + +The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD +Interaction`_. To troubleshoot peering issues, see `Peering +Failure`_. + + +Monitoring PG States +==================== + +If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``, +you might notice that the cluster does not always show ``HEALTH OK``. After +first checking to see if the OSDs are running, you should also check PG +states. There are certain PG-peering-related circumstances in which it is expected +and normal that the cluster will **NOT** show ``HEALTH OK``: + +#. You have just created a pool and the PGs haven't peered yet. +#. The PGs are recovering. +#. You have just added an OSD to or removed an OSD from the cluster. +#. You have just modified your CRUSH map and your PGs are migrating. +#. There is inconsistent data in different replicas of a PG. +#. Ceph is scrubbing a PG's replicas. +#. Ceph doesn't have enough storage capacity to complete backfilling operations. + +If one of these circumstances causes Ceph to show ``HEALTH WARN``, don't +panic. In many cases, the cluster will recover on its own. In some cases, however, you +might need to take action. An important aspect of monitoring PGs is to check their +status as ``active`` and ``clean``: that is, it is important to ensure that, when the +cluster is up and running, all PGs are ``active`` and (preferably) ``clean``. +To see the status of every PG, run the following command: + +.. prompt:: bash $ + + ceph pg stat + +The output provides the following information: the total number of PGs (x), how many +PGs are in a particular state such as ``active+clean`` (y), and the +amount of data stored (z). :: + + x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail + +.. note:: It is common for Ceph to report multiple states for PGs (for example, + ``active+clean``, ``active+clean+remapped``, ``active+clean+scrubbing``. + +Here Ceph shows not only the PG states, but also storage capacity used (aa), +the amount of storage capacity remaining (bb), and the total storage capacity +of the PG. These values can be important in a few cases: + +- The cluster is reaching its ``near full ratio`` or ``full ratio``. +- Data is not being distributed across the cluster due to an error in the + CRUSH configuration. + + +.. topic:: Placement Group IDs + + PG IDs consist of the pool number (not the pool name) followed by a period + (.) and a hexadecimal number. You can view pool numbers and their names from + in the output of ``ceph osd lspools``. For example, the first pool that was + created corresponds to pool number ``1``. A fully qualified PG ID has the + following form:: + + {pool-num}.{pg-id} + + It typically resembles the following:: + + 1.1701b + + +To retrieve a list of PGs, run the following command: + +.. prompt:: bash $ + + ceph pg dump + +To format the output in JSON format and save it to a file, run the following command: + +.. prompt:: bash $ + + ceph pg dump -o {filename} --format=json + +To query a specific PG, run the following command: + +.. prompt:: bash $ + + ceph pg {poolnum}.{pg-id} query + +Ceph will output the query in JSON format. + +The following subsections describe the most common PG states in detail. + + +Creating +-------- + +PGs are created when you create a pool: the command that creates a pool +specifies the total number of PGs for that pool, and when the pool is created +all of those PGs are created as well. Ceph will echo ``creating`` while it is +creating PGs. After the PG(s) are created, the OSDs that are part of a PG's +Acting Set will peer. Once peering is complete, the PG status should be +``active+clean``. This status means that Ceph clients begin writing to the +PG. + +.. ditaa:: + + /-----------\ /-----------\ /-----------\ + | Creating |------>| Peering |------>| Active | + \-----------/ \-----------/ \-----------/ + +Peering +------- + +When a PG peers, the OSDs that store the replicas of its data converge on an +agreed state of the data and metadata within that PG. When peering is complete, +those OSDs agree about the state of that PG. However, completion of the peering +process does **NOT** mean that each replica has the latest contents. + +.. topic:: Authoritative History + + Ceph will **NOT** acknowledge a write operation to a client until that write + operation is persisted by every OSD in the Acting Set. This practice ensures + that at least one member of the Acting Set will have a record of every + acknowledged write operation since the last successful peering operation. + + Given an accurate record of each acknowledged write operation, Ceph can + construct a new authoritative history of the PG--that is, a complete and + fully ordered set of operations that, if performed, would bring an OSD’s + copy of the PG up to date. + + +Active +------ + +After Ceph has completed the peering process, a PG should become ``active``. +The ``active`` state means that the data in the PG is generally available for +read and write operations in the primary and replica OSDs. + + +Clean +----- + +When a PG is in the ``clean`` state, all OSDs holding its data and metadata +have successfully peered and there are no stray replicas. Ceph has replicated +all objects in the PG the correct number of times. + + +Degraded +-------- + +When a client writes an object to the primary OSD, the primary OSD is +responsible for writing the replicas to the replica OSDs. After the primary OSD +writes the object to storage, the PG will remain in a ``degraded`` +state until the primary OSD has received an acknowledgement from the replica +OSDs that Ceph created the replica objects successfully. + +The reason that a PG can be ``active+degraded`` is that an OSD can be +``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes +``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must +peer again when the OSD comes back online. However, a client can still write a +new object to a ``degraded`` PG if it is ``active``. + +If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the +``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD +to another OSD. The time between being marked ``down`` and being marked ``out`` +is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds +by default. + +A PG can also be in the ``degraded`` state because there are one or more +objects that Ceph expects to find in the PG but that Ceph cannot find. Although +you cannot read or write to unfound objects, you can still access all of the other +objects in the ``degraded`` PG. + + +Recovering +---------- + +Ceph was designed for fault-tolerance, because hardware and other server +problems are expected or even routine. When an OSD goes ``down``, its contents +might fall behind the current state of other replicas in the PGs. When the OSD +has returned to the ``up`` state, the contents of the PGs must be updated to +reflect that current state. During that time period, the OSD might be in a +``recovering`` state. + +Recovery is not always trivial, because a hardware failure might cause a +cascading failure of multiple OSDs. For example, a network switch for a rack or +cabinet might fail, which can cause the OSDs of a number of host machines to +fall behind the current state of the cluster. In such a scenario, general +recovery is possible only if each of the OSDs recovers after the fault has been +resolved.] + +Ceph provides a number of settings that determine how the cluster balances the +resource contention between the need to process new service requests and the +need to recover data objects and restore the PGs to the current state. The +``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and +even process some replay requests before starting the recovery process. The +``osd_recovery_thread_timeout`` setting determines the duration of a thread +timeout, because multiple OSDs might fail, restart, and re-peer at staggered +rates. The ``osd_recovery_max_active`` setting limits the number of recovery +requests an OSD can entertain simultaneously, in order to prevent the OSD from +failing to serve. The ``osd_recovery_max_chunk`` setting limits the size of +the recovered data chunks, in order to prevent network congestion. + + +Back Filling +------------ + +When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are +already in the cluster to the newly added OSD. It can put excessive load on the +new OSD to force it to immediately accept the reassigned PGs. Back filling the +OSD with the PGs allows this process to begin in the background. After the +backfill operations have completed, the new OSD will begin serving requests as +soon as it is ready. + +During the backfill operations, you might see one of several states: +``backfill_wait`` indicates that a backfill operation is pending, but is not +yet underway; ``backfilling`` indicates that a backfill operation is currently +underway; and ``backfill_toofull`` indicates that a backfill operation was +requested but couldn't be completed due to insufficient storage capacity. When +a PG cannot be backfilled, it might be considered ``incomplete``. + +The ``backfill_toofull`` state might be transient. It might happen that, as PGs +are moved around, space becomes available. The ``backfill_toofull`` state is +similar to ``backfill_wait`` in that backfill operations can proceed as soon as +conditions change. + +Ceph provides a number of settings to manage the load spike associated with the +reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills`` +setting specifies the maximum number of concurrent backfills to and from an OSD +(default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a +backfill request if the OSD is approaching its full ratio (default: 90%). This +setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If +an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting +allows an OSD to retry the request after a certain interval (default: 30 +seconds). OSDs can also set ``osd_backfill_scan_min`` and +``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and +512, respectively). + + +Remapped +-------- + +When the Acting Set that services a PG changes, the data migrates from the old +Acting Set to the new Acting Set. Because it might take time for the new +primary OSD to begin servicing requests, the old primary OSD might be required +to continue servicing requests until the PG data migration is complete. After +data migration has completed, the mapping uses the primary OSD of the new +Acting Set. + + +Stale +----- + +Although Ceph uses heartbeats in order to ensure that hosts and daemons are +running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are +not reporting statistics in a timely manner (for example, there might be a +temporary network fault). By default, OSD daemons report their PG, up through, +boot, and failure statistics every half second (that is, in accordance with a +value of ``0.5``), which is more frequent than the reports defined by the +heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report +to the monitor or if other OSDs have reported the primary OSD ``down``, the +monitors will mark the PG ``stale``. + +When you start your cluster, it is common to see the ``stale`` state until the +peering process completes. After your cluster has been running for a while, +however, seeing PGs in the ``stale`` state indicates that the primary OSD for +those PGs is ``down`` or not reporting PG statistics to the monitor. + + +Identifying Troubled PGs +======================== + +As previously noted, a PG is not necessarily having problems just because its +state is not ``active+clean``. When PGs are stuck, this might indicate that +Ceph cannot perform self-repairs. The stuck states include: + +- **Unclean**: PGs contain objects that have not been replicated the desired + number of times. Under normal conditions, it can be assumed that these PGs + are recovering. +- **Inactive**: PGs cannot process reads or writes because they are waiting for + an OSD that has the most up-to-date data to come back ``up``. +- **Stale**: PG are in an unknown state, because the OSDs that host them have + not reported to the monitor cluster for a certain period of time (determined + by ``mon_osd_report_timeout``). + +To identify stuck PGs, run the following command: + +.. prompt:: bash $ + + ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] + +For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs, +see `Troubleshooting PG Errors`_. + + +Finding an Object Location +========================== + +To store object data in the Ceph Object Store, a Ceph client must: + +#. Set an object name +#. Specify a `pool`_ + +The Ceph client retrieves the latest cluster map, the CRUSH algorithm +calculates how to map the object to a PG, and then the algorithm calculates how +to dynamically assign the PG to an OSD. To find the object location given only +the object name and the pool name, run a command of the following form: + +.. prompt:: bash $ + + ceph osd map {poolname} {object-name} [namespace] + +.. topic:: Exercise: Locate an Object + + As an exercise, let's create an object. We can specify an object name, a path + to a test file that contains some object data, and a pool name by using the + ``rados put`` command on the command line. For example: + + .. prompt:: bash $ + + rados put {object-name} {file-path} --pool=data + rados put test-object-1 testfile.txt --pool=data + + To verify that the Ceph Object Store stored the object, run the + following command: + + .. prompt:: bash $ + + rados -p data ls + + To identify the object location, run the following commands: + + .. prompt:: bash $ + + ceph osd map {pool-name} {object-name} + ceph osd map data test-object-1 + + Ceph should output the object's location. For example:: + + osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0) + + To remove the test object, simply delete it by running the ``rados rm`` + command. For example: + + .. prompt:: bash $ + + rados rm test-object-1 --pool=data + +As the cluster evolves, the object location may change dynamically. One benefit +of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually +performing the migration. For details, see the `Architecture`_ section. + +.. _data placement: ../data-placement +.. _pool: ../pools +.. _placement group: ../placement-groups +.. _Architecture: ../../../architecture +.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running +.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors +.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering +.. _CRUSH map: ../crush-map +.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ +.. _Placement Group Subsystem: ../control#placement-group-subsystem diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst new file mode 100644 index 000000000..a9171f2d8 --- /dev/null +++ b/doc/rados/operations/monitoring.rst @@ -0,0 +1,644 @@ +====================== + Monitoring a Cluster +====================== + +After you have a running cluster, you can use the ``ceph`` tool to monitor your +cluster. Monitoring a cluster typically involves checking OSD status, monitor +status, placement group status, and metadata server status. + +Using the command line +====================== + +Interactive mode +---------------- + +To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line +with no arguments. For example: + +.. prompt:: bash $ + + ceph + +.. prompt:: ceph> + :prompts: ceph> + + health + status + quorum_status + mon stat + +Non-default paths +----------------- + +If you specified non-default locations for your configuration or keyring when +you install the cluster, you may specify their locations to the ``ceph`` tool +by running the following command: + +.. prompt:: bash $ + + ceph -c /path/to/conf -k /path/to/keyring health + +Checking a Cluster's Status +=========================== + +After you start your cluster, and before you start reading and/or writing data, +you should check your cluster's status. + +To check a cluster's status, run the following command: + +.. prompt:: bash $ + + ceph status + +Alternatively, you can run the following command: + +.. prompt:: bash $ + + ceph -s + +In interactive mode, this operation is performed by typing ``status`` and +pressing **Enter**: + +.. prompt:: ceph> + :prompts: ceph> + + status + +Ceph will print the cluster status. For example, a tiny Ceph "demonstration +cluster" that is running one instance of each service (monitor, manager, and +OSD) might print the following: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + +How Ceph Calculates Data Usage +------------------------------ + +The ``usage`` value reflects the *actual* amount of raw storage used. The ``xxx +GB / xxx GB`` value means the amount available (the lesser number) of the +overall storage capacity of the cluster. The notional number reflects the size +of the stored data before it is replicated, cloned or snapshotted. Therefore, +the amount of data actually stored typically exceeds the notional amount +stored, because Ceph creates replicas of the data and may also use storage +capacity for cloning and snapshotting. + + +Watching a Cluster +================== + +Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster +itself maintains a *cluster log* that records high-level events about the +entire Ceph cluster. These events are logged to disk on monitor servers (in +the default location ``/var/log/ceph/ceph.log``), and they can be monitored via +the command line. + +To follow the cluster log, run the following command: + +.. prompt:: bash $ + + ceph -w + +Ceph will print the status of the system, followed by each log message as it is +added. For example: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + + 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot + 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x + 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available + +Instead of printing log lines as they are added, you might want to print only +the most recent lines. Run ``ceph log last [n]`` to see the most recent ``n`` +lines from the cluster log. + +Monitoring Health Checks +======================== + +Ceph continuously runs various *health checks*. When +a health check fails, this failure is reflected in the output of ``ceph status`` and +``ceph health``. The cluster log receives messages that +indicate when a check has failed and when the cluster has recovered. + +For example, when an OSD goes down, the ``health`` section of the status +output is updated as follows: + +:: + + health: HEALTH_WARN + 1 osds down + Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded + +At the same time, cluster log messages are emitted to record the failure of the +health checks: + +:: + + 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) + 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) + +When the OSD comes back online, the cluster log records the cluster's return +to a healthy state: + +:: + + 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) + 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) + 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy + +Network Performance Checks +-------------------------- + +Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon +availability and network performance. If a single delayed response is detected, +this might indicate nothing more than a busy OSD. But if multiple delays +between distinct pairs of OSDs are detected, this might indicate a failed +network switch, a NIC failure, or a layer 1 failure. + +By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a +health check (a ``HEALTH_WARN``. For example: + +:: + + HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms) + +In the output of the ``ceph health detail`` command, you can see which OSDs are +experiencing delays and how long the delays are. The output of ``ceph health +detail`` is limited to ten lines. Here is an example of the output you can +expect from the ``ceph health detail`` command:: + + [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms) + Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving + Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.2 [dc1,rack2] 1030.123 msec + Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec + Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec + +To see more detail and to collect a complete dump of network performance +information, use the ``dump_osd_network`` command. This command is usually sent +to a Ceph Manager Daemon, but it can be used to collect information about a +specific OSD's interactions by sending it to that OSD. The default threshold +for a slow heartbeat is 1 second (1000 milliseconds), but this can be +overridden by providing a number of milliseconds as an argument. + +To show all network performance data with a specified threshold of 0, send the +following command to the mgr: + +.. prompt:: bash $ + + ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0 + +:: + + { + "threshold": 0, + "entries": [ + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "front", + "average": { + "1min": 1.023, + "5min": 0.860, + "15min": 0.883 + }, + "min": { + "1min": 0.818, + "5min": 0.607, + "15min": 0.607 + }, + "max": { + "1min": 1.164, + "5min": 1.173, + "15min": 1.544 + }, + "last": 0.924 + }, + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "back", + "average": { + "1min": 0.968, + "5min": 0.897, + "15min": 0.830 + }, + "min": { + "1min": 0.860, + "5min": 0.563, + "15min": 0.502 + }, + "max": { + "1min": 1.171, + "5min": 1.216, + "15min": 1.456 + }, + "last": 0.845 + }, + { + "last update": "Wed Sep 4 17:04:48 2019", + "stale": false, + "from osd": 0, + "to osd": 1, + "interface": "front", + "average": { + "1min": 0.965, + "5min": 0.811, + "15min": 0.850 + }, + "min": { + "1min": 0.650, + "5min": 0.488, + "15min": 0.466 + }, + "max": { + "1min": 1.252, + "5min": 1.252, + "15min": 1.362 + }, + "last": 0.791 + }, + ... + + + +Muting Health Checks +-------------------- + +Health checks can be muted so that they have no effect on the overall +reported status of the cluster. For example, if the cluster has raised a +single health check and then you mute that health check, then the cluster will report a status of ``HEALTH_OK``. +To mute a specific health check, use the health check code that corresponds to that health check (see :ref:`health-checks`), and +run the following command: + +.. prompt:: bash $ + + ceph health mute <code> + +For example, to mute an ``OSD_DOWN`` health check, run the following command: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN + +Mutes are reported as part of the short and long form of the ``ceph health`` command's output. +For example, in the above scenario, the cluster would report: + +.. prompt:: bash $ + + ceph health + +:: + + HEALTH_OK (muted: OSD_DOWN) + +.. prompt:: bash $ + + ceph health detail + +:: + + HEALTH_OK (muted: OSD_DOWN) + (MUTED) OSD_DOWN 1 osds down + osd.1 is down + +A mute can be removed by running the following command: + +.. prompt:: bash $ + + ceph health unmute <code> + +For example: + +.. prompt:: bash $ + + ceph health unmute OSD_DOWN + +A "health mute" can have a TTL (**T**\ime **T**\o **L**\ive) +associated with it: this means that the mute will automatically expire +after a specified period of time. The TTL is specified as an optional +duration argument, as seen in the following examples: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN 4h # mute for 4 hours + ceph health mute MON_DOWN 15m # mute for 15 minutes + +Normally, if a muted health check is resolved (for example, if the OSD that raised the ``OSD_DOWN`` health check +in the example above has come back up), the mute goes away. If the health check comes +back later, it will be reported in the usual way. + +It is possible to make a health mute "sticky": this means that the mute will remain even if the +health check clears. For example, to make a health mute "sticky", you might run the following command: + +.. prompt:: bash $ + + ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour + +Most health mutes disappear if the unhealthy condition that triggered the health check gets worse. +For example, suppose that there is one OSD down and the health check is muted. In that case, if +one or more additional OSDs go down, then the health mute disappears. This behavior occurs in any health check with a threshold value. + + +Checking a Cluster's Usage Stats +================================ + +To check a cluster's data usage and data distribution among pools, use the +``df`` command. This option is similar to Linux's ``df`` command. Run the +following command: + +.. prompt:: bash $ + + ceph df + +The output of ``ceph df`` resembles the following:: + + CLASS SIZE AVAIL USED RAW USED %RAW USED + ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 + TOTAL 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 + + --- POOLS --- + POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR + device_health_metrics 1 1 242 KiB 15 KiB 227 KiB 4 251 KiB 24 KiB 227 KiB 0 297 GiB N/A N/A 4 0 B 0 B + cephfs.a.meta 2 32 6.8 KiB 6.8 KiB 0 B 22 96 KiB 96 KiB 0 B 0 297 GiB N/A N/A 22 0 B 0 B + cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B + test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B + +- **CLASS:** For example, "ssd" or "hdd". +- **SIZE:** The amount of storage capacity managed by the cluster. +- **AVAIL:** The amount of free space available in the cluster. +- **USED:** The amount of raw storage consumed by user data (excluding + BlueStore's database). +- **RAW USED:** The amount of raw storage consumed by user data, internal + overhead, and reserved capacity. +- **%RAW USED:** The percentage of raw storage used. Watch this number in + conjunction with ``full ratio`` and ``near full ratio`` to be forewarned when + your cluster approaches the fullness thresholds. See `Storage Capacity`_. + + +**POOLS:** + +The POOLS section of the output provides a list of pools and the *notional* +usage of each pool. This section of the output **DOES NOT** reflect replicas, +clones, or snapshots. For example, if you store an object with 1MB of data, +then the notional usage will be 1MB, but the actual usage might be 2MB or more +depending on the number of replicas, clones, and snapshots. + +- **ID:** The number of the specific node within the pool. +- **STORED:** The actual amount of data that the user has stored in a pool. + This is similar to the USED column in earlier versions of Ceph, but the + calculations (for BlueStore!) are more precise (in that gaps are properly + handled). + + - **(DATA):** Usage for RBD (RADOS Block Device), CephFS file data, and RGW + (RADOS Gateway) object data. + - **(OMAP):** Key-value pairs. Used primarily by CephFS and RGW (RADOS + Gateway) for metadata storage. + +- **OBJECTS:** The notional number of objects stored per pool (that is, the + number of objects other than replicas, clones, or snapshots). +- **USED:** The space allocated for a pool over all OSDs. This includes space + for replication, space for allocation granularity, and space for the overhead + associated with erasure-coding. Compression savings and object-content gaps + are also taken into account. However, BlueStore's database is not included in + the amount reported under USED. + + - **(DATA):** Object usage for RBD (RADOS Block Device), CephFS file data, + and RGW (RADOS Gateway) object data. + - **(OMAP):** Object key-value pairs. Used primarily by CephFS and RGW (RADOS + Gateway) for metadata storage. + +- **%USED:** The notional percentage of storage used per pool. +- **MAX AVAIL:** An estimate of the notional amount of data that can be written + to this pool. +- **QUOTA OBJECTS:** The number of quota objects. +- **QUOTA BYTES:** The number of bytes in the quota objects. +- **DIRTY:** The number of objects in the cache pool that have been written to + the cache pool but have not yet been flushed to the base pool. This field is + available only when cache tiering is in use. +- **USED COMPR:** The amount of space allocated for compressed data. This + includes compressed data in addition to all of the space required for + replication, allocation granularity, and erasure- coding overhead. +- **UNDER COMPR:** The amount of data that has passed through compression + (summed over all replicas) and that is worth storing in a compressed form. + + +.. note:: The numbers in the POOLS section are notional. They do not include + the number of replicas, clones, or snapshots. As a result, the sum of the + USED and %USED amounts in the POOLS section of the output will not be equal + to the sum of the USED and %USED amounts in the RAW section of the output. + +.. note:: The MAX AVAIL value is a complicated function of the replication or + the kind of erasure coding used, the CRUSH rule that maps storage to + devices, the utilization of those devices, and the configured + ``mon_osd_full_ratio`` setting. + + +Checking OSD Status +=================== + +To check if OSDs are ``up`` and ``in``, run the +following command: + +.. prompt:: bash # + + ceph osd stat + +Alternatively, you can run the following command: + +.. prompt:: bash # + + ceph osd dump + +To view OSDs according to their position in the CRUSH map, run the following +command: + +.. prompt:: bash # + + ceph osd tree + +To print out a CRUSH tree that displays a host, its OSDs, whether the OSDs are +``up``, and the weight of the OSDs, run the following command: + +.. code-block:: bash + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 3.00000 pool default + -3 3.00000 rack mainrack + -2 3.00000 host osd-host + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + +See `Monitoring OSDs and Placement Groups`_. + +Checking Monitor Status +======================= + +If your cluster has multiple monitors, then you need to perform certain +"monitor status" checks. After starting the cluster and before reading or +writing data, you should check quorum status. A quorum must be present when +multiple monitors are running to ensure proper functioning of your Ceph +cluster. Check monitor status regularly in order to ensure that all of the +monitors are running. + +To display the monitor map, run the following command: + +.. prompt:: bash $ + + ceph mon stat + +Alternatively, you can run the following command: + +.. prompt:: bash $ + + ceph mon dump + +To check the quorum status for the monitor cluster, run the following command: + +.. prompt:: bash $ + + ceph quorum_status + +Ceph returns the quorum status. For example, a Ceph cluster that consists of +three monitors might return the following: + +.. code-block:: javascript + + { "election_epoch": 10, + "quorum": [ + 0, + 1, + 2], + "quorum_names": [ + "a", + "b", + "c"], + "quorum_leader_name": "a", + "monmap": { "epoch": 1, + "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", + "modified": "2011-12-12 13:28:27.505520", + "created": "2011-12-12 13:28:27.505520", + "features": {"persistent": [ + "kraken", + "luminous", + "mimic"], + "optional": [] + }, + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789/0", + "public_addr": "127.0.0.1:6789/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790/0", + "public_addr": "127.0.0.1:6790/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6791/0", + "public_addr": "127.0.0.1:6791/0"} + ] + } + } + +Checking MDS Status +=================== + +Metadata servers provide metadata services for CephFS. Metadata servers have +two sets of states: ``up | down`` and ``active | inactive``. To check if your +metadata servers are ``up`` and ``active``, run the following command: + +.. prompt:: bash $ + + ceph mds stat + +To display details of the metadata servers, run the following command: + +.. prompt:: bash $ + + ceph fs dump + + +Checking Placement Group States +=============================== + +Placement groups (PGs) map objects to OSDs. PGs are monitored in order to +ensure that they are ``active`` and ``clean``. See `Monitoring OSDs and +Placement Groups`_. + +.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg + +.. _rados-monitoring-using-admin-socket: + +Using the Admin Socket +====================== + +The Ceph admin socket allows you to query a daemon via a socket interface. By +default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon via +the admin socket, log in to the host that is running the daemon and run one of +the two following commands: + +.. prompt:: bash $ + + ceph daemon {daemon-name} + ceph daemon {path-to-socket-file} + +For example, the following commands are equivalent to each other: + +.. prompt:: bash $ + + ceph daemon osd.0 foo + ceph daemon /var/run/ceph/ceph-osd.0.asok foo + +To view the available admin-socket commands, run the following command: + +.. prompt:: bash $ + + ceph daemon {daemon-name} help + +Admin-socket commands enable you to view and set your configuration at runtime. +For more on viewing your configuration, see `Viewing a Configuration at +Runtime`_. There are two methods of setting configuration value at runtime: (1) +using the admin socket, which bypasses the monitor and requires a direct login +to the host in question, and (2) using the ``ceph tell {daemon-type}.{id} +config set`` command, which relies on the monitor and does not require a direct +login. + +.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime +.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity diff --git a/doc/rados/operations/operating.rst b/doc/rados/operations/operating.rst new file mode 100644 index 000000000..f4a2fd988 --- /dev/null +++ b/doc/rados/operations/operating.rst @@ -0,0 +1,174 @@ +===================== + Operating a Cluster +===================== + +.. index:: systemd; operating a cluster + + +Running Ceph with systemd +========================= + +In all distributions that support systemd (CentOS 7, Fedora, Debian +Jessie 8 and later, and SUSE), systemd files (and NOT legacy SysVinit scripts) +are used to manage Ceph daemons. Ceph daemons therefore behave like any other daemons +that can be controlled by the ``systemctl`` command, as in the following examples: + +.. prompt:: bash $ + + sudo systemctl start ceph.target # start all daemons + sudo systemctl status ceph-osd@12 # check status of osd.12 + +To list all of the Ceph systemd units on a node, run the following command: + +.. prompt:: bash $ + + sudo systemctl status ceph\*.service ceph\*.target + + +Starting all daemons +-------------------- + +To start all of the daemons on a Ceph node (regardless of their type), run the +following command: + +.. prompt:: bash $ + + sudo systemctl start ceph.target + + +Stopping all daemons +-------------------- + +To stop all of the daemons on a Ceph node (regardless of their type), run the +following command: + +.. prompt:: bash $ + + sudo systemctl stop ceph\*.service ceph\*.target + + +Starting all daemons by type +---------------------------- + +To start all of the daemons of a particular type on a Ceph node, run one of the +following commands: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd.target + sudo systemctl start ceph-mon.target + sudo systemctl start ceph-mds.target + + +Stopping all daemons by type +---------------------------- + +To stop all of the daemons of a particular type on a Ceph node, run one of the +following commands: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd\*.service ceph-osd.target + sudo systemctl stop ceph-mon\*.service ceph-mon.target + sudo systemctl stop ceph-mds\*.service ceph-mds.target + + +Starting a daemon +----------------- + +To start a specific daemon instance on a Ceph node, run one of the +following commands: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@{id} + sudo systemctl start ceph-mon@{hostname} + sudo systemctl start ceph-mds@{hostname} + +For example: + +.. prompt:: bash $ + + sudo systemctl start ceph-osd@1 + sudo systemctl start ceph-mon@ceph-server + sudo systemctl start ceph-mds@ceph-server + + +Stopping a daemon +----------------- + +To stop a specific daemon instance on a Ceph node, run one of the +following commands: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd@{id} + sudo systemctl stop ceph-mon@{hostname} + sudo systemctl stop ceph-mds@{hostname} + +For example: + +.. prompt:: bash $ + + sudo systemctl stop ceph-osd@1 + sudo systemctl stop ceph-mon@ceph-server + sudo systemctl stop ceph-mds@ceph-server + + +.. index:: sysvinit; operating a cluster + +Running Ceph with SysVinit +========================== + +Each time you start, restart, or stop Ceph daemons, you must specify at least one option and one command. +Likewise, each time you start, restart, or stop your entire cluster, you must specify at least one option and one command. +In both cases, you can also specify a daemon type or a daemon instance. :: + + {commandline} [options] [commands] [daemons] + +The ``ceph`` options include: + ++-----------------+----------+-------------------------------------------------+ +| Option | Shortcut | Description | ++=================+==========+=================================================+ +| ``--verbose`` | ``-v`` | Use verbose logging. | ++-----------------+----------+-------------------------------------------------+ +| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. | ++-----------------+----------+-------------------------------------------------+ +| ``--allhosts`` | ``-a`` | Execute on all nodes listed in ``ceph.conf``. | +| | | Otherwise, it only executes on ``localhost``. | ++-----------------+----------+-------------------------------------------------+ +| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--norestart`` | ``N/A`` | Do not restart a daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--conf`` | ``-c`` | Use an alternate configuration file. | ++-----------------+----------+-------------------------------------------------+ + +The ``ceph`` commands include: + ++------------------+------------------------------------------------------------+ +| Command | Description | ++==================+============================================================+ +| ``start`` | Start the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``stop`` | Stop the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9``. | ++------------------+------------------------------------------------------------+ +| ``killall`` | Kill all daemons of a particular type. | ++------------------+------------------------------------------------------------+ +| ``cleanlogs`` | Cleans out the log directory. | ++------------------+------------------------------------------------------------+ +| ``cleanalllogs`` | Cleans out **everything** in the log directory. | ++------------------+------------------------------------------------------------+ + +The ``[daemons]`` option allows the ``ceph`` service to target specific daemon types +in order to perform subsystem operations. Daemon types include: + +- ``mon`` +- ``osd`` +- ``mds`` + +.. _Valgrind: http://www.valgrind.org/ +.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html diff --git a/doc/rados/operations/pg-concepts.rst b/doc/rados/operations/pg-concepts.rst new file mode 100644 index 000000000..83062b53a --- /dev/null +++ b/doc/rados/operations/pg-concepts.rst @@ -0,0 +1,104 @@ +.. _rados_operations_pg_concepts: + +========================== + Placement Group Concepts +========================== + +When you execute commands like ``ceph -w``, ``ceph osd dump``, and other +commands related to placement groups, Ceph may return values using some +of the following terms: + +*Peering* + The process of bringing all of the OSDs that store + a Placement Group (PG) into agreement about the state + of all of the objects (and their metadata) in that PG. + Note that agreeing on the state does not mean that + they all have the latest contents. + +*Acting Set* + The ordered list of OSDs who are (or were as of some epoch) + responsible for a particular placement group. + +*Up Set* + The ordered list of OSDs responsible for a particular placement + group for a particular epoch according to CRUSH. Normally this + is the same as the *Acting Set*, except when the *Acting Set* has + been explicitly overridden via ``pg_temp`` in the OSD Map. + +*Current Interval* or *Past Interval* + A sequence of OSD map epochs during which the *Acting Set* and *Up + Set* for particular placement group do not change. + +*Primary* + The member (and by convention first) of the *Acting Set*, + that is responsible for coordination peering, and is + the only OSD that will accept client-initiated + writes to objects in a placement group. + +*Replica* + A non-primary OSD in the *Acting Set* for a placement group + (and who has been recognized as such and *activated* by the primary). + +*Stray* + An OSD that is not a member of the current *Acting Set*, but + has not yet been told that it can delete its copies of a + particular placement group. + +*Recovery* + Ensuring that copies of all of the objects in a placement group + are on all of the OSDs in the *Acting Set*. Once *Peering* has + been performed, the *Primary* can start accepting write operations, + and *Recovery* can proceed in the background. + +*PG Info* + Basic metadata about the placement group's creation epoch, the version + for the most recent write to the placement group, *last epoch started*, + *last epoch clean*, and the beginning of the *current interval*. Any + inter-OSD communication about placement groups includes the *PG Info*, + such that any OSD that knows a placement group exists (or once existed) + also has a lower bound on *last epoch clean* or *last epoch started*. + +*PG Log* + A list of recent updates made to objects in a placement group. + Note that these logs can be truncated after all OSDs + in the *Acting Set* have acknowledged up to a certain + point. + +*Missing Set* + Each OSD notes update log entries and if they imply updates to + the contents of an object, adds that object to a list of needed + updates. This list is called the *Missing Set* for that ``<OSD,PG>``. + +*Authoritative History* + A complete, and fully ordered set of operations that, if + performed, would bring an OSD's copy of a placement group + up to date. + +*Epoch* + A (monotonically increasing) OSD map version number + +*Last Epoch Start* + The last epoch at which all nodes in the *Acting Set* + for a particular placement group agreed on an + *Authoritative History*. At this point, *Peering* is + deemed to have been successful. + +*up_thru* + Before a *Primary* can successfully complete the *Peering* process, + it must inform a monitor that is alive through the current + OSD map *Epoch* by having the monitor set its *up_thru* in the osd + map. This helps *Peering* ignore previous *Acting Sets* for which + *Peering* never completed after certain sequences of failures, such as + the second interval below: + + - *acting set* = [A,B] + - *acting set* = [A] + - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection) + - *acting set* = [B] (B restarts, A does not) + +*Last Epoch Clean* + The last *Epoch* at which all nodes in the *Acting set* + for a particular placement group were completely + up to date (both placement group logs and object contents). + At this point, *recovery* is deemed to have been + completed. diff --git a/doc/rados/operations/pg-repair.rst b/doc/rados/operations/pg-repair.rst new file mode 100644 index 000000000..609318fca --- /dev/null +++ b/doc/rados/operations/pg-repair.rst @@ -0,0 +1,118 @@ +============================ +Repairing PG Inconsistencies +============================ +Sometimes a Placement Group (PG) might become ``inconsistent``. To return the PG +to an ``active+clean`` state, you must first determine which of the PGs has become +inconsistent and then run the ``pg repair`` command on it. This page contains +commands for diagnosing PGs and the command for repairing PGs that have become +inconsistent. + +.. highlight:: console + +Commands for Diagnosing PG Problems +=================================== +The commands in this section provide various ways of diagnosing broken PGs. + +To see a high-level (low-detail) overview of Ceph cluster health, run the +following command: + +.. prompt:: bash # + + ceph health detail + +To see more detail on the status of the PGs, run the following command: + +.. prompt:: bash # + + ceph pg dump --format=json-pretty + +To see a list of inconsistent PGs, run the following command: + +.. prompt:: bash # + + rados list-inconsistent-pg {pool} + +To see a list of inconsistent RADOS objects, run the following command: + +.. prompt:: bash # + + rados list-inconsistent-obj {pgid} + +To see a list of inconsistent snapsets in a specific PG, run the following +command: + +.. prompt:: bash # + + rados list-inconsistent-snapset {pgid} + + +Commands for Repairing PGs +========================== +The form of the command to repair a broken PG is as follows: + +.. prompt:: bash # + + ceph pg repair {pgid} + +Here ``{pgid}`` represents the id of the affected PG. + +For example: + +.. prompt:: bash # + + ceph pg repair 1.4 + +.. note:: PG IDs have the form ``N.xxxxx``, where ``N`` is the number of the + pool that contains the PG. The command ``ceph osd listpools`` and the + command ``ceph osd dump | grep pool`` return a list of pool numbers. + +More Information on PG Repair +============================= +Ceph stores and updates the checksums of objects stored in the cluster. When a +scrub is performed on a PG, the OSD attempts to choose an authoritative copy +from among its replicas. Only one of the possible cases is consistent. After +performing a deep scrub, Ceph calculates the checksum of an object that is read +from disk and compares it to the checksum that was previously recorded. If the +current checksum and the previously recorded checksum do not match, that +mismatch is considered to be an inconsistency. In the case of replicated pools, +any mismatch between the checksum of any replica of an object and the checksum +of the authoritative copy means that there is an inconsistency. The discovery +of these inconsistencies cause a PG's state to be set to ``inconsistent``. + +The ``pg repair`` command attempts to fix inconsistencies of various kinds. If +``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of +the inconsistent copy with the digest of the authoritative copy. If ``pg +repair`` finds an inconsistent replicated pool, it marks the inconsistent copy +as missing. In the case of replicated pools, recovery is beyond the scope of +``pg repair``. + +In the case of erasure-coded and BlueStore pools, Ceph will automatically +perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to +``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default +``5``) errors are found. + +The ``pg repair`` command will not solve every problem. Ceph does not +automatically repair PGs when they are found to contain inconsistencies. + +The checksum of a RADOS object or an omap is not always available. Checksums +are calculated incrementally. If a replicated object is updated +non-sequentially, the write operation involved in the update changes the object +and invalidates its checksum. The whole object is not read while the checksum +is recalculated. The ``pg repair`` command is able to make repairs even when +checksums are not available to it, as in the case of Filestore. Users working +with replicated Filestore pools might prefer manual repair to ``ceph pg +repair``. + +This material is relevant for Filestore, but not for BlueStore, which has its +own internal checksums. The matched-record checksum and the calculated checksum +cannot prove that any specific copy is in fact authoritative. If there is no +checksum available, ``pg repair`` favors the data on the primary, but this +might not be the uncorrupted replica. Because of this uncertainty, human +intervention is necessary when an inconsistency is discovered. This +intervention sometimes involves use of ``ceph-objectstore-tool``. + +External Links +============== +https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page +contains a walkthrough of the repair of a PG. It is recommended reading if you +want to repair a PG but have never done so. diff --git a/doc/rados/operations/pg-states.rst b/doc/rados/operations/pg-states.rst new file mode 100644 index 000000000..495229d92 --- /dev/null +++ b/doc/rados/operations/pg-states.rst @@ -0,0 +1,118 @@ +======================== + Placement Group States +======================== + +When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``), +Ceph will report on the status of the placement groups. A placement group has +one or more states. The optimum state for placement groups in the placement group +map is ``active + clean``. + +*creating* + Ceph is still creating the placement group. + +*activating* + The placement group is peered but not yet active. + +*active* + Ceph will process requests to the placement group. + +*clean* + Ceph replicated all objects in the placement group the correct number of times. + +*down* + A replica with necessary data is down, so the placement group is offline. + +*laggy* + A replica is not acknowledging new leases from the primary in a timely fashion; IO is temporarily paused. + +*wait* + The set of OSDs for this PG has just changed and IO is temporarily paused until the previous interval's leases expire. + +*scrubbing* + Ceph is checking the placement group metadata for inconsistencies. + +*deep* + Ceph is checking the placement group data against stored checksums. + +*degraded* + Ceph has not replicated some objects in the placement group the correct number of times yet. + +*inconsistent* + Ceph detects inconsistencies in the one or more replicas of an object in the placement group + (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.). + +*peering* + The placement group is undergoing the peering process + +*repair* + Ceph is checking the placement group and repairing any inconsistencies it finds (if possible). + +*recovering* + Ceph is migrating/synchronizing objects and their replicas. + +*forced_recovery* + High recovery priority of that PG is enforced by user. + +*recovery_wait* + The placement group is waiting in line to start recover. + +*recovery_toofull* + A recovery operation is waiting because the destination OSD is over its + full ratio. + +*recovery_unfound* + Recovery stopped due to unfound objects. + +*backfilling* + Ceph is scanning and synchronizing the entire contents of a placement group + instead of inferring what contents need to be synchronized from the logs of + recent operations. Backfill is a special case of recovery. + +*forced_backfill* + High backfill priority of that PG is enforced by user. + +*backfill_wait* + The placement group is waiting in line to start backfill. + +*backfill_toofull* + A backfill operation is waiting because the destination OSD is over + the backfillfull ratio. + +*backfill_unfound* + Backfill stopped due to unfound objects. + +*incomplete* + Ceph detects that a placement group is missing information about + writes that may have occurred, or does not have any healthy + copies. If you see this state, try to start any failed OSDs that may + contain the needed information. In the case of an erasure coded pool + temporarily reducing min_size may allow recovery. + +*stale* + The placement group is in an unknown state - the monitors have not received + an update for it since the placement group mapping changed. + +*remapped* + The placement group is temporarily mapped to a different set of OSDs from what + CRUSH specified. + +*undersized* + The placement group has fewer copies than the configured pool replication level. + +*peered* + The placement group has peered, but cannot serve client IO due to not having + enough copies to reach the pool's configured min_size parameter. Recovery + may occur in this state, so the pg may heal up to min_size eventually. + +*snaptrim* + Trimming snaps. + +*snaptrim_wait* + Queued to trim snaps. + +*snaptrim_error* + Error stopped trimming snaps. + +*unknown* + The ceph-mgr hasn't yet received any information about the PG's state from an + OSD since mgr started up. diff --git a/doc/rados/operations/placement-groups.rst b/doc/rados/operations/placement-groups.rst new file mode 100644 index 000000000..dda4a0177 --- /dev/null +++ b/doc/rados/operations/placement-groups.rst @@ -0,0 +1,897 @@ +.. _placement groups: + +================== + Placement Groups +================== + +.. _pg-autoscaler: + +Autoscaling placement groups +============================ + +Placement groups (PGs) are an internal implementation detail of how Ceph +distributes data. Autoscaling provides a way to manage PGs, and especially to +manage the number of PGs present in different pools. When *pg-autoscaling* is +enabled, the cluster is allowed to make recommendations or automatic +adjustments with respect to the number of PGs for each pool (``pgp_num``) in +accordance with expected cluster utilization and expected pool utilization. + +Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``, +``on``, or ``warn``: + +* ``off``: Disable autoscaling for this pool. It is up to the administrator to + choose an appropriate ``pgp_num`` for each pool. For more information, see + :ref:`choosing-number-of-placement-groups`. +* ``on``: Enable automated adjustments of the PG count for the given pool. +* ``warn``: Raise health checks when the PG count is in need of adjustment. + +To set the autoscaling mode for an existing pool, run a command of the +following form: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_autoscale_mode <mode> + +For example, to enable autoscaling on pool ``foo``, run the following command: + +.. prompt:: bash # + + ceph osd pool set foo pg_autoscale_mode on + +There is also a ``pg_autoscale_mode`` setting for any pools that are created +after the initial setup of the cluster. To change this setting, run a command +of the following form: + +.. prompt:: bash # + + ceph config set global osd_pool_default_pg_autoscale_mode <mode> + +You can disable or enable the autoscaler for all pools with the ``noautoscale`` +flag. By default, this flag is set to ``off``, but you can set it to ``on`` by +running the following command: + +.. prompt:: bash # + + ceph osd pool set noautoscale + +To set the ``noautoscale`` flag to ``off``, run the following command: + +.. prompt:: bash # + + ceph osd pool unset noautoscale + +To get the value of the flag, run the following command: + +.. prompt:: bash # + + ceph osd pool get noautoscale + +Viewing PG scaling recommendations +---------------------------------- + +To view each pool, its relative utilization, and any recommended changes to the +PG count, run the following command: + +.. prompt:: bash # + + ceph osd pool autoscale-status + +The output will resemble the following:: + + POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK + a 12900M 3.0 82431M 0.4695 8 128 warn True + c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True + b 0 953.6M 3.0 82431M 0.0347 8 warn False + +- **POOL** is the name of the pool. + +- **SIZE** is the amount of data stored in the pool. + +- **TARGET SIZE** (if present) is the amount of data that is expected to be + stored in the pool, as specified by the administrator. The system uses the + greater of the two values for its calculation. + +- **RATE** is the multiplier for the pool that determines how much raw storage + capacity is consumed. For example, a three-replica pool will have a ratio of + 3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5. + +- **RAW CAPACITY** is the total amount of raw storage capacity on the specific + OSDs that are responsible for storing the data of the pool (and perhaps the + data of other pools). + +- **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the + total raw storage capacity. In order words, RATIO is defined as + (SIZE * RATE) / RAW CAPACITY. + +- **TARGET RATIO** (if present) is the ratio of the expected storage of this + pool (that is, the amount of storage that this pool is expected to consume, + as specified by the administrator) to the expected storage of all other pools + that have target ratios set. If both ``target_size_bytes`` and + ``target_size_ratio`` are specified, then ``target_size_ratio`` takes + precedence. + +- **EFFECTIVE RATIO** is the result of making two adjustments to the target + ratio: + + #. Subtracting any capacity expected to be used by pools that have target + size set. + + #. Normalizing the target ratios among pools that have target ratio set so + that collectively they target cluster capacity. For example, four pools + with target_ratio 1.0 would have an effective ratio of 0.25. + + The system's calculations use whichever of these two ratios (that is, the + target ratio and the effective ratio) is greater. + +- **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance + with prior information about how many PGs a specific pool is expected to + have. + +- **PG_NUM** is either the current number of PGs associated with the pool or, + if a ``pg_num`` change is in progress, the current number of PGs that the + pool is working towards. + +- **NEW PG_NUM** (if present) is the value that the system is recommending the + ``pg_num`` of the pool to be changed to. It is always a power of 2, and it is + present only if the recommended value varies from the current value by more + than the default factor of ``3``. To adjust this factor (in the following + example, it is changed to ``2``), run the following command: + + .. prompt:: bash # + + ceph osd pool set threshold 2.0 + +- **AUTOSCALE** is the pool's ``pg_autoscale_mode`` and is set to ``on``, + ``off``, or ``warn``. + +- **BULK** determines whether the pool is ``bulk``. It has a value of ``True`` + or ``False``. A ``bulk`` pool is expected to be large and should initially + have a large number of PGs so that performance does not suffer]. On the other + hand, a pool that is not ``bulk`` is expected to be small (for example, a + ``.mgr`` pool or a meta pool). + +.. note:: + + If the ``ceph osd pool autoscale-status`` command returns no output at all, + there is probably at least one pool that spans multiple CRUSH roots. This + 'spanning pool' issue can happen in scenarios like the following: + when a new deployment auto-creates the ``.mgr`` pool on the ``default`` + CRUSH root, subsequent pools are created with rules that constrain them to a + specific shadow CRUSH tree. For example, if you create an RBD metadata pool + that is constrained to ``deviceclass = ssd`` and an RBD data pool that is + constrained to ``deviceclass = hdd``, you will encounter this issue. To + remedy this issue, constrain the spanning pool to only one device class. In + the above scenario, there is likely to be a ``replicated-ssd`` CRUSH rule in + effect, and the ``.mgr`` pool can be constrained to ``ssd`` devices by + running the following commands: + + .. prompt:: bash # + + ceph osd pool set .mgr crush_rule replicated-ssd + ceph osd pool set pool 1 crush_rule to replicated-ssd + + This intervention will result in a small amount of backfill, but + typically this traffic completes quickly. + + +Automated scaling +----------------- + +In the simplest approach to automated scaling, the cluster is allowed to +automatically scale ``pgp_num`` in accordance with usage. Ceph considers the +total available storage and the target number of PGs for the whole system, +considers how much data is stored in each pool, and apportions PGs accordingly. +The system is conservative with its approach, making changes to a pool only +when the current number of PGs (``pg_num``) varies by more than a factor of 3 +from the recommended number. + +The target number of PGs per OSD is determined by the ``mon_target_pg_per_osd`` +parameter (default: 100), which can be adjusted by running the following +command: + +.. prompt:: bash # + + ceph config set global mon_target_pg_per_osd 100 + +The autoscaler analyzes pools and adjusts on a per-subtree basis. Because each +pool might map to a different CRUSH rule, and each rule might distribute data +across different devices, Ceph will consider the utilization of each subtree of +the hierarchy independently. For example, a pool that maps to OSDs of class +``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG +counts that are determined by how many of these two different device types +there are. + +If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees +with both ``ssd`` and ``hdd`` devices), the autoscaler issues a warning to the +user in the manager log. The warning states the name of the pool and the set of +roots that overlap each other. The autoscaler does not scale any pools with +overlapping roots because this condition can cause problems with the scaling +process. We recommend constraining each pool so that it belongs to only one +root (that is, one OSD class) to silence the warning and ensure a successful +scaling process. + +.. _managing_bulk_flagged_pools: + +Managing pools that are flagged with ``bulk`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full +complement of PGs and then scales down the number of PGs only if the usage +ratio across the pool is uneven. However, if a pool is not flagged ``bulk``, +then the autoscaler starts the pool with minimal PGs and creates additional PGs +only if there is more usage in the pool. + +To create a pool that will be flagged ``bulk``, run the following command: + +.. prompt:: bash # + + ceph osd pool create <pool-name> --bulk + +To set or unset the ``bulk`` flag of an existing pool, run the following +command: + +.. prompt:: bash # + + ceph osd pool set <pool-name> bulk <true/false/1/0> + +To get the ``bulk`` flag of an existing pool, run the following command: + +.. prompt:: bash # + + ceph osd pool get <pool-name> bulk + +.. _specifying_pool_target_size: + +Specifying expected pool size +----------------------------- + +When a cluster or pool is first created, it consumes only a small fraction of +the total cluster capacity and appears to the system as if it should need only +a small number of PGs. However, in some cases, cluster administrators know +which pools are likely to consume most of the system capacity in the long run. +When Ceph is provided with this information, a more appropriate number of PGs +can be used from the beginning, obviating subsequent changes in ``pg_num`` and +the associated overhead cost of relocating data. + +The *target size* of a pool can be specified in two ways: either in relation to +the absolute size (in bytes) of the pool, or as a weight relative to all other +pools that have ``target_size_ratio`` set. + +For example, to tell the system that ``mypool`` is expected to consume 100 TB, +run the following command: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_bytes 100T + +Alternatively, to tell the system that ``mypool`` is expected to consume a +ratio of 1.0 relative to other pools that have ``target_size_ratio`` set, +adjust the ``target_size_ratio`` setting of ``my pool`` by running the +following command: + +.. prompt:: bash # + + ceph osd pool set mypool target_size_ratio 1.0 + +If `mypool` is the only pool in the cluster, then it is expected to use 100% of +the total cluster capacity. However, if the cluster contains a second pool that +has ``target_size_ratio`` set to 1.0, then both pools are expected to use 50% +of the total cluster capacity. + +The ``ceph osd pool create`` command has two command-line options that can be +used to set the target size of a pool at creation time: ``--target-size-bytes +<bytes>`` and ``--target-size-ratio <ratio>``. + +Note that if the target-size values that have been specified are impossible +(for example, a capacity larger than the total cluster), then a health check +(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised. + +If both ``target_size_ratio`` and ``target_size_bytes`` are specified for a +pool, then the latter will be ignored, the former will be used in system +calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) +will be raised. + +Specifying bounds on a pool's PGs +--------------------------------- + +It is possible to specify both the minimum number and the maximum number of PGs +for a pool. + +Setting a Minimum Number of PGs and a Maximum Number of PGs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a minimum is set, then Ceph will not itself reduce (nor recommend that you +reduce) the number of PGs to a value below the configured value. Setting a +minimum serves to establish a lower bound on the amount of parallelism enjoyed +by a client during I/O, even if a pool is mostly empty. + +If a maximum is set, then Ceph will not itself increase (or recommend that you +increase) the number of PGs to a value above the configured value. + +To set the minimum number of PGs for a pool, run a command of the following +form: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_num_min <num> + +To set the maximum number of PGs for a pool, run a command of the following +form: + +.. prompt:: bash # + + ceph osd pool set <pool-name> pg_num_max <num> + +In addition, the ``ceph osd pool create`` command has two command-line options +that can be used to specify the minimum or maximum PG count of a pool at +creation time: ``--pg-num-min <num>`` and ``--pg-num-max <num>``. + +.. _preselection: + +Preselecting pg_num +=================== + +When creating a pool with the following command, you have the option to +preselect the value of the ``pg_num`` parameter: + +.. prompt:: bash # + + ceph osd pool create {pool-name} [pg_num] + +If you opt not to specify ``pg_num`` in this command, the cluster uses the PG +autoscaler to automatically configure the parameter in accordance with the +amount of data that is stored in the pool (see :ref:`pg-autoscaler` above). + +However, your decision of whether or not to specify ``pg_num`` at creation time +has no effect on whether the parameter will be automatically tuned by the +cluster afterwards. As seen above, autoscaling of PGs is enabled or disabled by +running a command of the following form: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn) + +Without the balancer, the suggested target is approximately 100 PG replicas on +each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is +reasonable. + +The autoscaler attempts to satisfy the following conditions: + +- the number of PGs per OSD should be proportional to the amount of data in the + pool +- there should be 50-100 PGs per pool, taking into account the replication + overhead or erasure-coding fan-out of each PG's replicas across OSDs + +Use of Placement Groups +======================= + +A placement group aggregates objects within a pool. The tracking of RADOS +object placement and object metadata on a per-object basis is computationally +expensive. It would be infeasible for a system with millions of RADOS +objects to efficiently track placement on a per-object basis. + +.. ditaa:: + /-----\ /-----\ /-----\ /-----\ /-----\ + | obj | | obj | | obj | | obj | | obj | + \-----/ \-----/ \-----/ \-----/ \-----/ + | | | | | + +--------+--------+ +---+----+ + | | + v v + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | + +------------------------------+ + | + v + +-----------------------+ + | Pool | + | | + +-----------------------+ + +The Ceph client calculates which PG a RADOS object should be in. As part of +this calculation, the client hashes the object ID and performs an operation +involving both the number of PGs in the specified pool and the pool ID. For +details, see `Mapping PGs to OSDs`_. + +The contents of a RADOS object belonging to a PG are stored in a set of OSDs. +For example, in a replicated pool of size two, each PG will store objects on +two OSDs, as shown below: + +.. ditaa:: + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | | | + v v v v + /----------\ /----------\ /----------\ /----------\ + | | | | | | | | + | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | + | | | | | | | | + \----------/ \----------/ \----------/ \----------/ + + +If OSD #2 fails, another OSD will be assigned to Placement Group #1 and then +filled with copies of all objects in OSD #1. If the pool size is changed from +two to three, an additional OSD will be assigned to the PG and will receive +copies of all objects in the PG. + +An OSD assigned to a PG is not owned exclusively by that PG; rather, the OSD is +shared with other PGs either from the same pool or from other pools. In our +example, OSD #2 is shared by Placement Group #1 and Placement Group #2. If OSD +#2 fails, then Placement Group #2 must restore copies of objects (by making use +of OSD #3). + +When the number of PGs increases, several consequences ensue. The new PGs are +assigned OSDs. The result of the CRUSH function changes, which means that some +objects from the already-existing PGs are copied to the new PGs and removed +from the old ones. + +Factors Relevant To Specifying pg_num +===================================== + +On the one hand, the criteria of data durability and even distribution across +OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of +saving CPU resources and minimizing memory usage weigh in favor of a low number +of PGs. + +.. _data durability: + +Data durability +--------------- + +When an OSD fails, the risk of data loss is increased until replication of the +data it hosted is restored to the configured level. To illustrate this point, +let's imagine a scenario that results in permanent data loss in a single PG: + +#. The OSD fails and all copies of the object that it contains are lost. For + each object within the PG, the number of its replicas suddenly drops from + three to two. + +#. Ceph starts recovery for this PG by choosing a new OSD on which to re-create + the third copy of each object. + +#. Another OSD within the same PG fails before the new OSD is fully populated + with the third copy. Some objects will then only have one surviving copy. + +#. Ceph selects yet another OSD and continues copying objects in order to + restore the desired number of copies. + +#. A third OSD within the same PG fails before recovery is complete. If this + OSD happened to contain the only remaining copy of an object, the object is + permanently lost. + +In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH +will give each PG three OSDs. Ultimately, each OSD hosts :math:`\frac{(512 * +3)}{10} = ~150` PGs. So when the first OSD fails in the above scenario, +recovery will begin for all 150 PGs at the same time. + +The 150 PGs that are being recovered are likely to be homogeneously distributed +across the 9 remaining OSDs. Each remaining OSD is therefore likely to send +copies of objects to all other OSDs and also likely to receive some new objects +to be stored because it has become part of a new PG. + +The amount of time it takes for this recovery to complete depends on the +architecture of the Ceph cluster. Compare two setups: (1) Each OSD is hosted by +a 1 TB SSD on a single machine, all of the OSDs are connected to a 10 Gb/s +switch, and the recovery of a single OSD completes within a certain number of +minutes. (2) There are two OSDs per machine using HDDs with no SSD WAL+DB and +a 1 Gb/s switch. In the second setup, recovery will be at least one order of +magnitude slower. + +In such a cluster, the number of PGs has almost no effect on data durability. +Whether there are 128 PGs per OSD or 8192 PGs per OSD, the recovery will be no +slower or faster. + +However, an increase in the number of OSDs can increase the speed of recovery. +Suppose our Ceph cluster is expanded from 10 OSDs to 20 OSDs. Each OSD now +participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will +still be required to replicate the same number of objects in order to recover. +But instead of there being only 10 OSDs that have to copy ~100 GB each, there +are now 20 OSDs that have to copy only 50 GB each. If the network had +previously been a bottleneck, recovery now happens twice as fast. + +Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only +~38 PGs. And if an OSD dies, recovery will take place faster than before unless +it is blocked by another bottleneck. Now, however, suppose that our cluster +grows to 200 OSDs. Each OSD will host only ~7 PGs. And if an OSD dies, recovery +will happen across at most :math:`\approx 21 = (7 \times 3)` OSDs +associated with these PGs. This means that recovery will take longer than when +there were only 40 OSDs. For this reason, the number of PGs should be +increased. + +No matter how brief the recovery time is, there is always a chance that an +additional OSD will fail while recovery is in progress. Consider the cluster +with 10 OSDs described above: if any of the OSDs fail, then :math:`\approx 17` +(approximately 150 divided by 9) PGs will have only one remaining copy. And if +any of the 8 remaining OSDs fail, then 2 (approximately 17 divided by 8) PGs +are likely to lose their remaining objects. This is one reason why setting +``size=2`` is risky. + +When the number of OSDs in the cluster increases to 20, the number of PGs that +would be damaged by the loss of three OSDs significantly decreases. The loss of +a second OSD degrades only approximately :math:`4` or (:math:`\frac{75}{19}`) +PGs rather than :math:`\approx 17` PGs, and the loss of a third OSD results in +data loss only if it is one of the 4 OSDs that contains the remaining copy. +This means -- assuming that the probability of losing one OSD during recovery +is 0.0001% -- that the probability of data loss when three OSDs are lost is +:math:`\approx 17 \times 10 \times 0.0001%` in the cluster with 10 OSDs, and +only :math:`\approx 4 \times 20 \times 0.0001%` in the cluster with 20 OSDs. + +In summary, the greater the number of OSDs, the faster the recovery and the +lower the risk of permanently losing a PG due to cascading failures. As far as +data durability is concerned, in a cluster with fewer than 50 OSDs, it doesn't +much matter whether there are 512 or 4096 PGs. + +.. note:: It can take a long time for an OSD that has been recently added to + the cluster to be populated with the PGs assigned to it. However, no object + degradation or impact on data durability will result from the slowness of + this process since Ceph populates data into the new PGs before removing it + from the old PGs. + +.. _object distribution: + +Object distribution within a pool +--------------------------------- + +Under ideal conditions, objects are evenly distributed across PGs. Because +CRUSH computes the PG for each object but does not know how much data is stored +in each OSD associated with the PG, the ratio between the number of PGs and the +number of OSDs can have a significant influence on data distribution. + +For example, suppose that there is only a single PG for ten OSDs in a +three-replica pool. In that case, only three OSDs would be used because CRUSH +would have no other option. However, if more PGs are available, RADOS objects are +more likely to be evenly distributed across OSDs. CRUSH makes every effort to +distribute OSDs evenly across all existing PGs. + +As long as there are one or two orders of magnitude more PGs than OSDs, the +distribution is likely to be even. For example: 256 PGs for 3 OSDs, 512 PGs for +10 OSDs, or 1024 PGs for 10 OSDs. + +However, uneven data distribution can emerge due to factors other than the +ratio of PGs to OSDs. For example, since CRUSH does not take into account the +size of the RADOS objects, the presence of a few very large RADOS objects can +create an imbalance. Suppose that one million 4 KB RADOS objects totaling 4 GB +are evenly distributed among 1024 PGs on 10 OSDs. These RADOS objects will +consume 4 GB / 10 = 400 MB on each OSD. If a single 400 MB RADOS object is then +added to the pool, the three OSDs supporting the PG in which the RADOS object +has been placed will each be filled with 400 MB + 400 MB = 800 MB but the seven +other OSDs will still contain only 400 MB. + +.. _resource usage: + +Memory, CPU and network usage +----------------------------- + +Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and +MONs. These needs must be met at all times and are increased during recovery. +Indeed, one of the main reasons PGs were developed was to share this overhead +by clustering objects together. + +For this reason, minimizing the number of PGs saves significant resources. + +.. _choosing-number-of-placement-groups: + +Choosing the Number of PGs +========================== + +.. note: It is rarely necessary to do the math in this section by hand. + Instead, use the ``ceph osd pool autoscale-status`` command in combination + with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For + more information, see :ref:`pg-autoscaler`. + +If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in +order to balance resource usage, data durability, and data distribution. If you +have fewer than 50 OSDs, follow the guidance in the `preselection`_ section. +For a single pool, use the following formula to get a baseline value: + + Total PGs = :math:`\frac{OSDs \times 100}{pool \: size}` + +Here **pool size** is either the number of replicas for replicated pools or the +K+M sum for erasure-coded pools. To retrieve this sum, run the command ``ceph +osd erasure-code-profile get``. + +Next, check whether the resulting baseline value is consistent with the way you +designed your Ceph cluster to maximize `data durability`_ and `object +distribution`_ and to minimize `resource usage`_. + +This value should be **rounded up to the nearest power of two**. + +Each pool's ``pg_num`` should be a power of two. Other values are likely to +result in uneven distribution of data across OSDs. It is best to increase +``pg_num`` for a pool only when it is feasible and desirable to set the next +highest power of two. Note that this power of two rule is per-pool; it is +neither necessary nor easy to align the sum of all pools' ``pg_num`` to a power +of two. + +For example, if you have a cluster with 200 OSDs and a single pool with a size +of 3 replicas, estimate the number of PGs as follows: + + :math:`\frac{200 \times 100}{3} = 6667`. Rounded up to the nearest power of 2: 8192. + +When using multiple data pools to store objects, make sure that you balance the +number of PGs per pool against the number of PGs per OSD so that you arrive at +a reasonable total number of PGs. It is important to find a number that +provides reasonably low variance per OSD without taxing system resources or +making the peering process too slow. + +For example, suppose you have a cluster of 10 pools, each with 512 PGs on 10 +OSDs. That amounts to 5,120 PGs distributed across 10 OSDs, or 512 PGs per OSD. +This cluster will not use too many resources. However, in a cluster of 1,000 +pools, each with 512 PGs on 10 OSDs, the OSDs will have to handle ~50,000 PGs +each. This cluster will require significantly more resources and significantly +more time for peering. + +For determining the optimal number of PGs per OSD, we recommend the `PGCalc`_ +tool. + + +.. _setting the number of placement groups: + +Setting the Number of PGs +========================= + +Setting the initial number of PGs in a pool must be done at the time you create +the pool. See `Create a Pool`_ for details. + +However, even after a pool is created, if the ``pg_autoscaler`` is not being +used to manage ``pg_num`` values, you can change the number of PGs by running a +command of the following form: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pg_num {pg_num} + +If you increase the number of PGs, your cluster will not rebalance until you +increase the number of PGs for placement (``pgp_num``). The ``pgp_num`` +parameter specifies the number of PGs that are to be considered for placement +by the CRUSH algorithm. Increasing ``pg_num`` splits the PGs in your cluster, +but data will not be migrated to the newer PGs until ``pgp_num`` is increased. +The ``pgp_num`` parameter should be equal to the ``pg_num`` parameter. To +increase the number of PGs for placement, run a command of the following form: + +.. prompt:: bash # + + ceph osd pool set {pool-name} pgp_num {pgp_num} + +If you decrease the number of PGs, then ``pgp_num`` is adjusted automatically. +In releases of Ceph that are Nautilus and later (inclusive), when the +``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to match +``pg_num``. This process manifests as periods of remapping of PGs and of +backfill, and is expected behavior and normal. + +.. _rados_ops_pgs_get_pg_num: + +Get the Number of PGs +===================== + +To get the number of PGs in a pool, run a command of the following form: + +.. prompt:: bash # + + ceph osd pool get {pool-name} pg_num + + +Get a Cluster's PG Statistics +============================= + +To see the details of the PGs in your cluster, run a command of the following +form: + +.. prompt:: bash # + + ceph pg dump [--format {format}] + +Valid formats are ``plain`` (default) and ``json``. + + +Get Statistics for Stuck PGs +============================ + +To see the statistics for all PGs that are stuck in a specified state, run a +command of the following form: + +.. prompt:: bash # + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] + +- **Inactive** PGs cannot process reads or writes because they are waiting for + enough OSDs with the most up-to-date data to come ``up`` and ``in``. + +- **Undersized** PGs contain objects that have not been replicated the desired + number of times. Under normal conditions, it can be assumed that these PGs + are recovering. + +- **Stale** PGs are in an unknown state -- the OSDs that host them have not + reported to the monitor cluster for a certain period of time (determined by + ``mon_osd_report_timeout``). + +Valid formats are ``plain`` (default) and ``json``. The threshold defines the +minimum number of seconds the PG is stuck before it is included in the returned +statistics (default: 300). + + +Get a PG Map +============ + +To get the PG map for a particular PG, run a command of the following form: + +.. prompt:: bash # + + ceph pg map {pg-id} + +For example: + +.. prompt:: bash # + + ceph pg map 1.6c + +Ceph will return the PG map, the PG, and the OSD status. The output resembles +the following: + +.. prompt:: bash # + + osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] + + +Get a PG's Statistics +===================== + +To see statistics for a particular PG, run a command of the following form: + +.. prompt:: bash # + + ceph pg {pg-id} query + + +Scrub a PG +========== + +To scrub a PG, run a command of the following form: + +.. prompt:: bash # + + ceph pg scrub {pg-id} + +Ceph checks the primary and replica OSDs, generates a catalog of all objects in +the PG, and compares the objects against each other in order to ensure that no +objects are missing or mismatched and that their contents are consistent. If +the replicas all match, then a final semantic sweep takes place to ensure that +all snapshot-related object metadata is consistent. Errors are reported in +logs. + +To scrub all PGs from a specific pool, run a command of the following form: + +.. prompt:: bash # + + ceph osd pool scrub {pool-name} + + +Prioritize backfill/recovery of PG(s) +===================================== + +You might encounter a situation in which multiple PGs require recovery or +backfill, but the data in some PGs is more important than the data in others +(for example, some PGs hold data for images that are used by running machines +and other PGs are used by inactive machines and hold data that is less +relevant). In that case, you might want to prioritize recovery or backfill of +the PGs with especially important data so that the performance of the cluster +and the availability of their data are restored sooner. To designate specific +PG(s) as prioritized during recovery, run a command of the following form: + +.. prompt:: bash # + + ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +To mark specific PG(s) as prioritized during backfill, run a command of the +following form: + +.. prompt:: bash # + + ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +These commands instruct Ceph to perform recovery or backfill on the specified +PGs before processing the other PGs. Prioritization does not interrupt current +backfills or recovery, but places the specified PGs at the top of the queue so +that they will be acted upon next. If you change your mind or realize that you +have prioritized the wrong PGs, run one or both of the following commands: + +.. prompt:: bash # + + ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +These commands remove the ``force`` flag from the specified PGs, so that the +PGs will be processed in their usual order. As in the case of adding the +``force`` flag, this affects only those PGs that are still queued but does not +affect PGs currently undergoing recovery. + +The ``force`` flag is cleared automatically after recovery or backfill of the +PGs is complete. + +Similarly, to instruct Ceph to prioritize all PGs from a specified pool (that +is, to perform recovery or backfill on those PGs first), run one or both of the +following commands: + +.. prompt:: bash # + + ceph osd pool force-recovery {pool-name} + ceph osd pool force-backfill {pool-name} + +These commands can also be cancelled. To revert to the default order, run one +or both of the following commands: + +.. prompt:: bash # + + ceph osd pool cancel-force-recovery {pool-name} + ceph osd pool cancel-force-backfill {pool-name} + +.. warning:: These commands can break the order of Ceph's internal priority + computations, so use them with caution! If you have multiple pools that are + currently sharing the same underlying OSDs, and if the data held by certain + pools is more important than the data held by other pools, then we recommend + that you run a command of the following form to arrange a custom + recovery/backfill priority for all pools: + +.. prompt:: bash # + + ceph osd pool set {pool-name} recovery_priority {value} + +For example, if you have twenty pools, you could make the most important pool +priority ``20``, and the next most important pool priority ``19``, and so on. + +Another option is to set the recovery/backfill priority for only a proper +subset of pools. In such a scenario, three important pools might (all) be +assigned priority ``1`` and all other pools would be left without an assigned +recovery/backfill priority. Another possibility is to select three important +pools and set their recovery/backfill priorities to ``3``, ``2``, and ``1`` +respectively. + +.. important:: Numbers of greater value have higher priority than numbers of + lesser value when using ``ceph osd pool set {pool-name} recovery_priority + {value}`` to set their recovery/backfill priority. For example, a pool with + the recovery/backfill priority ``30`` has a higher priority than a pool with + the recovery/backfill priority ``15``. + +Reverting Lost RADOS Objects +============================ + +If the cluster has lost one or more RADOS objects and you have decided to +abandon the search for the lost data, you must mark the unfound objects +``lost``. + +If every possible location has been queried and all OSDs are ``up`` and ``in``, +but certain RADOS objects are still lost, you might have to give up on those +objects. This situation can arise when rare and unusual combinations of +failures allow the cluster to learn about writes that were performed before the +writes themselves were recovered. + +The command to mark a RADOS object ``lost`` has only one supported option: +``revert``. The ``revert`` option will either roll back to a previous version +of the RADOS object (if it is old enough to have a previous version) or forget +about it entirely (if it is too new to have a previous version). To mark the +"unfound" objects ``lost``, run a command of the following form: + + +.. prompt:: bash # + + ceph pg {pg-id} mark_unfound_lost revert|delete + +.. important:: Use this feature with caution. It might confuse applications + that expect the object(s) to exist. + + +.. toctree:: + :hidden: + + pg-states + pg-concepts + + +.. _Create a Pool: ../pools#createpool +.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds +.. _pgcalc: https://old.ceph.com/pgcalc/ diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst new file mode 100644 index 000000000..dda9e844e --- /dev/null +++ b/doc/rados/operations/pools.rst @@ -0,0 +1,751 @@ +.. _rados_pools: + +======= + Pools +======= +Pools are logical partitions that are used to store objects. + +Pools provide: + +- **Resilience**: It is possible to set the number of OSDs that are allowed to + fail without any data being lost. If your cluster uses replicated pools, the + number of OSDs that can fail without data loss is equal to the number of + replicas. + + For example: a typical configuration stores an object and two replicas + (copies) of each RADOS object (that is: ``size = 3``), but you can configure + the number of replicas on a per-pool basis. For `erasure-coded pools + <../erasure-code>`_, resilience is defined as the number of coding chunks + (for example, ``m = 2`` in the default **erasure code profile**). + +- **Placement Groups**: You can set the number of placement groups (PGs) for + the pool. In a typical configuration, the target number of PGs is + approximately one hundred PGs per OSD. This provides reasonable balancing + without consuming excessive computing resources. When setting up multiple + pools, be careful to set an appropriate number of PGs for each pool and for + the cluster as a whole. Each PG belongs to a specific pool: when multiple + pools use the same OSDs, make sure that the **sum** of PG replicas per OSD is + in the desired PG-per-OSD target range. To calculate an appropriate number of + PGs for your pools, use the `pgcalc`_ tool. + +- **CRUSH Rules**: When data is stored in a pool, the placement of the object + and its replicas (or chunks, in the case of erasure-coded pools) in your + cluster is governed by CRUSH rules. Custom CRUSH rules can be created for a + pool if the default rule does not fit your use case. + +- **Snapshots**: The command ``ceph osd pool mksnap`` creates a snapshot of a + pool. + +Pool Names +========== + +Pool names beginning with ``.`` are reserved for use by Ceph's internal +operations. Do not create or manipulate pools with these names. + + +List Pools +========== + +There are multiple ways to get the list of pools in your cluster. + +To list just your cluster's pool names (good for scripting), execute: + +.. prompt:: bash $ + + ceph osd pool ls + +:: + + .rgw.root + default.rgw.log + default.rgw.control + default.rgw.meta + +To list your cluster's pools with the pool number, run the following command: + +.. prompt:: bash $ + + ceph osd lspools + +:: + + 1 .rgw.root + 2 default.rgw.log + 3 default.rgw.control + 4 default.rgw.meta + +To list your cluster's pools with additional information, execute: + +.. prompt:: bash $ + + ceph osd pool ls detail + +:: + + pool 1 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00 + pool 2 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 21 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00 + pool 3 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 23 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00 + pool 4 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 25 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 4.00 + +To get even more information, you can execute this command with the ``--format`` (or ``-f``) option and the ``json``, ``json-pretty``, ``xml`` or ``xml-pretty`` value. + +.. _createpool: + +Creating a Pool +=============== + +Before creating a pool, consult `Pool, PG and CRUSH Config Reference`_. Your +Ceph configuration file contains a setting (namely, ``pg_num``) that determines +the number of PGs. However, this setting's default value is NOT appropriate +for most systems. In most cases, you should override this default value when +creating your pool. For details on PG numbers, see `setting the number of +placement groups`_ + +For example: + +.. prompt:: bash $ + + osd_pool_default_pg_num = 128 + osd_pool_default_pgp_num = 128 + +.. note:: In Luminous and later releases, each pool must be associated with the + application that will be using the pool. For more information, see + `Associating a Pool with an Application`_ below. + +To create a pool, run one of the following commands: + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] [replicated] \ + [crush-rule-name] [expected-num-objects] + +or: + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] erasure \ + [erasure-code-profile] [crush-rule-name] [expected_num_objects] [--autoscale-mode=<on,off,warn>] + +For a brief description of the elements of the above commands, consult the +following: + +.. describe:: {pool-name} + + The name of the pool. It must be unique. + + :Type: String + :Required: Yes. + +.. describe:: {pg-num} + + The total number of PGs in the pool. For details on calculating an + appropriate number, see :ref:`placement groups`. The default value ``8`` is + NOT suitable for most systems. + + :Type: Integer + :Required: Yes. + :Default: 8 + +.. describe:: {pgp-num} + + The total number of PGs for placement purposes. This **should be equal to + the total number of PGs**, except briefly while ``pg_num`` is being + increased or decreased. + + :Type: Integer + :Required: Yes. If no value has been specified in the command, then the default value is used (unless a different value has been set in Ceph configuration). + :Default: 8 + +.. describe:: {replicated|erasure} + + The pool type. This can be either **replicated** (to recover from lost OSDs + by keeping multiple copies of the objects) or **erasure** (to achieve a kind + of `generalized parity RAID <../erasure-code>`_ capability). The + **replicated** pools require more raw storage but can implement all Ceph + operations. The **erasure** pools require less raw storage but can perform + only some Ceph tasks and may provide decreased performance. + + :Type: String + :Required: No. + :Default: replicated + +.. describe:: [crush-rule-name] + + The name of the CRUSH rule to use for this pool. The specified rule must + exist; otherwise the command will fail. + + :Type: String + :Required: No. + :Default: For **replicated** pools, it is the rule specified by the :confval:`osd_pool_default_crush_rule` configuration variable. This rule must exist. For **erasure** pools, it is the ``erasure-code`` rule if the ``default`` `erasure code profile`_ is used or the ``{pool-name}`` rule if not. This rule will be created implicitly if it doesn't already exist. + +.. describe:: [erasure-code-profile=profile] + + For **erasure** pools only. Instructs Ceph to use the specified `erasure + code profile`_. This profile must be an existing profile as defined by **osd + erasure-code-profile set**. + + :Type: String + :Required: No. + +.. _erasure code profile: ../erasure-code-profile + +.. describe:: --autoscale-mode=<on,off,warn> + + - ``on``: the Ceph cluster will autotune or recommend changes to the number of PGs in your pool based on actual usage. + - ``warn``: the Ceph cluster will autotune or recommend changes to the number of PGs in your pool based on actual usage. + - ``off``: refer to :ref:`placement groups` for more information. + + :Type: String + :Required: No. + :Default: The default behavior is determined by the :confval:`osd_pool_default_pg_autoscale_mode` option. + +.. describe:: [expected-num-objects] + + The expected number of RADOS objects for this pool. By setting this value and + assigning a negative value to **filestore merge threshold**, you arrange + for the PG folder splitting to occur at the time of pool creation and + avoid the latency impact that accompanies runtime folder splitting. + + :Type: Integer + :Required: No. + :Default: 0, no splitting at the time of pool creation. + +.. _associate-pool-to-application: + +Associating a Pool with an Application +====================================== + +Pools need to be associated with an application before they can be used. Pools +that are intended for use with CephFS and pools that are created automatically +by RGW are associated automatically. Pools that are intended for use with RBD +should be initialized with the ``rbd`` tool (see `Block Device Commands`_ for +more information). + +For other cases, you can manually associate a free-form application name to a +pool by running the following command.: + +.. prompt:: bash $ + + ceph osd pool application enable {pool-name} {application-name} + +.. note:: CephFS uses the application name ``cephfs``, RBD uses the + application name ``rbd``, and RGW uses the application name ``rgw``. + +Setting Pool Quotas +=================== + +To set pool quotas for the maximum number of bytes and/or the maximum number of +RADOS objects per pool, run the following command: + +.. prompt:: bash $ + + ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}] + +For example: + +.. prompt:: bash $ + + ceph osd pool set-quota data max_objects 10000 + +To remove a quota, set its value to ``0``. + + +Deleting a Pool +=============== + +To delete a pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + +To remove a pool, you must set the ``mon_allow_pool_delete`` flag to ``true`` +in the monitor's configuration. Otherwise, monitors will refuse to remove +pools. + +For more information, see `Monitor Configuration`_. + +.. _Monitor Configuration: ../../configuration/mon-config-ref + +If there are custom rules for a pool that is no longer needed, consider +deleting those rules. + +.. prompt:: bash $ + + ceph osd pool get {pool-name} crush_rule + +For example, if the custom rule is "123", check all pools to see whether they +use the rule by running the following command: + +.. prompt:: bash $ + + ceph osd dump | grep "^pool" | grep "crush_rule 123" + +If no pools use this custom rule, then it is safe to delete the rule from the +cluster. + +Similarly, if there are users with permissions restricted to a pool that no +longer exists, consider deleting those users by running commands of the +following forms: + +.. prompt:: bash $ + + ceph auth ls | grep -C 5 {pool-name} + ceph auth del {user} + + +Renaming a Pool +=============== + +To rename a pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool rename {current-pool-name} {new-pool-name} + +If you rename a pool for which an authenticated user has per-pool capabilities, +you must update the user's capabilities ("caps") to refer to the new pool name. + + +Showing Pool Statistics +======================= + +To show a pool's utilization statistics, run the following command: + +.. prompt:: bash $ + + rados df + +To obtain I/O information for a specific pool or for all pools, run a command +of the following form: + +.. prompt:: bash $ + + ceph osd pool stats [{pool-name}] + + +Making a Snapshot of a Pool +=========================== + +To make a snapshot of a pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + +Removing a Snapshot of a Pool +============================= + +To remove a snapshot of a pool, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool rmsnap {pool-name} {snap-name} + +.. _setpoolvalues: + +Setting Pool Values +=================== + +To assign values to a pool's configuration keys, run a command of the following +form: + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {key} {value} + +You may set values for the following keys: + +.. _compression_algorithm: + +.. describe:: compression_algorithm + + :Description: Sets the inline compression algorithm used in storing data on the underlying BlueStore back end. This key's setting overrides the global setting :confval:`bluestore_compression_algorithm`. + :Type: String + :Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` + +.. describe:: compression_mode + + :Description: Sets the policy for the inline compression algorithm used in storing data on the underlying BlueStore back end. This key's setting overrides the global setting :confval:`bluestore_compression_mode`. + :Type: String + :Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` + +.. describe:: compression_min_blob_size + + + :Description: Sets the minimum size for the compression of chunks: that is, chunks smaller than this are not compressed. This key's setting overrides the following global settings: + + * :confval:`bluestore_compression_min_blob_size` + * :confval:`bluestore_compression_min_blob_size_hdd` + * :confval:`bluestore_compression_min_blob_size_ssd` + + :Type: Unsigned Integer + + +.. describe:: compression_max_blob_size + + :Description: Sets the maximum size for chunks: that is, chunks larger than this are broken into smaller blobs of this size before compression is performed. + :Type: Unsigned Integer + +.. _size: + +.. describe:: size + + :Description: Sets the number of replicas for objects in the pool. For further details, see `Setting the Number of RADOS Object Replicas`_. Replicated pools only. + :Type: Integer + +.. _min_size: + +.. describe:: min_size + + :Description: Sets the minimum number of replicas required for I/O. For further details, see `Setting the Number of RADOS Object Replicas`_. For erasure-coded pools, this should be set to a value greater than 'k'. If I/O is allowed at the value 'k', then there is no redundancy and data will be lost in the event of a permanent OSD failure. For more information, see `Erasure Code <../erasure-code>`_ + :Type: Integer + :Version: ``0.54`` and above + +.. _pg_num: + +.. describe:: pg_num + + :Description: Sets the effective number of PGs to use when calculating data placement. + :Type: Integer + :Valid Range: ``0`` to ``mon_max_pool_pg_num``. If set to ``0``, the value of ``osd_pool_default_pg_num`` will be used. + +.. _pgp_num: + +.. describe:: pgp_num + + :Description: Sets the effective number of PGs to use when calculating data placement. + :Type: Integer + :Valid Range: Between ``1`` and the current value of ``pg_num``. + +.. _crush_rule: + +.. describe:: crush_rule + + :Description: Sets the CRUSH rule that Ceph uses to map object placement within the pool. + :Type: String + +.. _allow_ec_overwrites: + +.. describe:: allow_ec_overwrites + + :Description: Determines whether writes to an erasure-coded pool are allowed to update only part of a RADOS object. This allows CephFS and RBD to use an EC (erasure-coded) pool for user data (but not for metadata). For more details, see `Erasure Coding with Overwrites`_. + :Type: Boolean + + .. versionadded:: 12.2.0 + +.. describe:: hashpspool + + :Description: Sets and unsets the HASHPSPOOL flag on a given pool. + :Type: Integer + :Valid Range: 1 sets flag, 0 unsets flag + +.. _nodelete: + +.. describe:: nodelete + + :Description: Sets and unsets the NODELETE flag on a given pool. + :Type: Integer + :Valid Range: 1 sets flag, 0 unsets flag + :Version: Version ``FIXME`` + +.. _nopgchange: + +.. describe:: nopgchange + + :Description: Sets and unsets the NOPGCHANGE flag on a given pool. + :Type: Integer + :Valid Range: 1 sets flag, 0 unsets flag + :Version: Version ``FIXME`` + +.. _nosizechange: + +.. describe:: nosizechange + + :Description: Sets and unsets the NOSIZECHANGE flag on a given pool. + :Type: Integer + :Valid Range: 1 sets flag, 0 unsets flag + :Version: Version ``FIXME`` + +.. _bulk: + +.. describe:: bulk + + :Description: Sets and unsets the bulk flag on a given pool. + :Type: Boolean + :Valid Range: ``true``/``1`` sets flag, ``false``/``0`` unsets flag + +.. _write_fadvise_dontneed: + +.. describe:: write_fadvise_dontneed + + :Description: Sets and unsets the WRITE_FADVISE_DONTNEED flag on a given pool. + :Type: Integer + :Valid Range: ``1`` sets flag, ``0`` unsets flag + +.. _noscrub: + +.. describe:: noscrub + + :Description: Sets and unsets the NOSCRUB flag on a given pool. + :Type: Integer + :Valid Range: ``1`` sets flag, ``0`` unsets flag + +.. _nodeep-scrub: + +.. describe:: nodeep-scrub + + :Description: Sets and unsets the NODEEP_SCRUB flag on a given pool. + :Type: Integer + :Valid Range: ``1`` sets flag, ``0`` unsets flag + +.. _target_max_bytes: + +.. describe:: target_max_bytes + + :Description: Ceph will begin flushing or evicting objects when the + ``max_bytes`` threshold is triggered. + :Type: Integer + :Example: ``1000000000000`` #1-TB + +.. _target_max_objects: + +.. describe:: target_max_objects + + :Description: Ceph will begin flushing or evicting objects when the + ``max_objects`` threshold is triggered. + :Type: Integer + :Example: ``1000000`` #1M objects + +.. _fast_read: + +.. describe:: fast_read + + :Description: For erasure-coded pools, if this flag is turned ``on``, the + read request issues "sub reads" to all shards, and then waits + until it receives enough shards to decode before it serves + the client. If *jerasure* or *isa* erasure plugins are in + use, then after the first *K* replies have returned, the + client's request is served immediately using the data decoded + from these replies. This approach sacrifices resources in + exchange for better performance. This flag is supported only + for erasure-coded pools. + :Type: Boolean + :Defaults: ``0`` + +.. _scrub_min_interval: + +.. describe:: scrub_min_interval + + :Description: Sets the minimum interval (in seconds) for successive scrubs of the pool's PGs when the load is low. If the default value of ``0`` is in effect, then the value of ``osd_scrub_min_interval`` from central config is used. + + :Type: Double + :Default: ``0`` + +.. _scrub_max_interval: + +.. describe:: scrub_max_interval + + :Description: Sets the maximum interval (in seconds) for scrubs of the pool's PGs regardless of cluster load. If the value of ``scrub_max_interval`` is ``0``, then the value ``osd_scrub_max_interval`` from central config is used. + + :Type: Double + :Default: ``0`` + +.. _deep_scrub_interval: + +.. describe:: deep_scrub_interval + + :Description: Sets the interval (in seconds) for pool “deep” scrubs of the pool's PGs. If the value of ``deep_scrub_interval`` is ``0``, the value ``osd_deep_scrub_interval`` from central config is used. + + :Type: Double + :Default: ``0`` + +.. _recovery_priority: + +.. describe:: recovery_priority + + :Description: Setting this value adjusts a pool's computed reservation priority. This value must be in the range ``-10`` to ``10``. Any pool assigned a negative value will be given a lower priority than any new pools, so users are directed to assign negative values to low-priority pools. + + :Type: Integer + :Default: ``0`` + + +.. _recovery_op_priority: + +.. describe:: recovery_op_priority + + :Description: Sets the recovery operation priority for a specific pool's PGs. This overrides the general priority determined by :confval:`osd_recovery_op_priority`. + + :Type: Integer + :Default: ``0`` + + +Getting Pool Values +=================== + +To get a value from a pool's key, run a command of the following form: + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {key} + + +You may get values from the following keys: + + +``size`` + +:Description: See size_. + +:Type: Integer + + +``min_size`` + +:Description: See min_size_. + +:Type: Integer +:Version: ``0.54`` and above + + +``pg_num`` + +:Description: See pg_num_. + +:Type: Integer + + +``pgp_num`` + +:Description: See pgp_num_. + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + + +``crush_rule`` + +:Description: See crush_rule_. + + +``target_max_bytes`` + +:Description: See target_max_bytes_. + +:Type: Integer + + +``target_max_objects`` + +:Description: See target_max_objects_. + +:Type: Integer + + +``fast_read`` + +:Description: See fast_read_. + +:Type: Boolean + + +``scrub_min_interval`` + +:Description: See scrub_min_interval_. + +:Type: Double + + +``scrub_max_interval`` + +:Description: See scrub_max_interval_. + +:Type: Double + + +``deep_scrub_interval`` + +:Description: See deep_scrub_interval_. + +:Type: Double + + +``allow_ec_overwrites`` + +:Description: See allow_ec_overwrites_. + +:Type: Boolean + + +``recovery_priority`` + +:Description: See recovery_priority_. + +:Type: Integer + + +``recovery_op_priority`` + +:Description: See recovery_op_priority_. + +:Type: Integer + + +Setting the Number of RADOS Object Replicas +=========================================== + +To set the number of data replicas on a replicated pool, run a command of the +following form: + +.. prompt:: bash $ + + ceph osd pool set {poolname} size {num-replicas} + +.. important:: The ``{num-replicas}`` argument includes the primary object + itself. For example, if you want there to be two replicas of the object in + addition to the original object (for a total of three instances of the + object) specify ``3`` by running the following command: + +.. prompt:: bash $ + + ceph osd pool set data size 3 + +You may run the above command for each pool. + +.. Note:: An object might accept I/Os in degraded mode with fewer than ``pool + size`` replicas. To set a minimum number of replicas required for I/O, you + should use the ``min_size`` setting. For example, you might run the + following command: + +.. prompt:: bash $ + + ceph osd pool set data min_size 2 + +This command ensures that no object in the data pool will receive I/O if it has +fewer than ``min_size`` (in this case, two) replicas. + + +Getting the Number of Object Replicas +===================================== + +To get the number of object replicas, run the following command: + +.. prompt:: bash $ + + ceph osd dump | grep 'replicated size' + +Ceph will list pools and highlight the ``replicated size`` attribute. By +default, Ceph creates two replicas of an object (a total of three copies, for a +size of ``3``). + +Managing pools that are flagged with ``--bulk`` +=============================================== +See :ref:`managing_bulk_flagged_pools`. + + +.. _pgcalc: https://old.ceph.com/pgcalc/ +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups +.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites +.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool diff --git a/doc/rados/operations/read-balancer.rst b/doc/rados/operations/read-balancer.rst new file mode 100644 index 000000000..0833e4326 --- /dev/null +++ b/doc/rados/operations/read-balancer.rst @@ -0,0 +1,64 @@ +.. _read_balancer: + +======================================= +Operating the Read (Primary) Balancer +======================================= + +You might be wondering: How can I improve performance in my Ceph cluster? +One important data point you can check is the ``read_balance_score`` on each +of your replicated pools. + +This metric, available via ``ceph osd pool ls detail`` (see :ref:`rados_pools` +for more details) indicates read performance, or how balanced the primaries are +for each replicated pool. In most cases, if a ``read_balance_score`` is above 1 +(for instance, 1.5), this means that your pool has unbalanced primaries and that +you may want to try improving your read performance with the read balancer. + +Online Optimization +=================== + +At present, there is no online option for the read balancer. However, we plan to add +the read balancer as an option to the :ref:`balancer` in the next Ceph version +so it can be enabled to run automatically in the background like the upmap balancer. + +Offline Optimization +==================== + +Primaries are updated with an offline optimizer that is built into the +:ref:`osdmaptool`. + +#. Grab the latest copy of your osdmap: + + .. prompt:: bash $ + + ceph osd getmap -o om + +#. Run the optimizer: + + .. prompt:: bash $ + + osdmaptool om --read out.txt --read-pool <pool name> [--vstart] + + It is highly recommended that you run the capacity balancer before running the + balancer to ensure optimal results. See :ref:`upmap` for details on how to balance + capacity in a cluster. + +#. Apply the changes: + + .. prompt:: bash $ + + source out.txt + + In the above example, the proposed changes are written to the output file + ``out.txt``. The commands in this procedure are normal Ceph CLI commands + that can be run in order to apply the changes to the cluster. + + If you are working in a vstart cluster, you may pass the ``--vstart`` parameter + as shown above so the CLI commands are formatted with the `./bin/` prefix. + + Note that any time the number of pgs changes (for instance, if the pg autoscaler [:ref:`pg-autoscaler`] + kicks in), you should consider rechecking the scores and rerunning the balancer if needed. + +To see some details about what the tool is doing, you can pass +``--debug-osd 10`` to ``osdmaptool``. To see even more details, pass +``--debug-osd 20`` to ``osdmaptool``. diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst new file mode 100644 index 000000000..f797b5b91 --- /dev/null +++ b/doc/rados/operations/stretch-mode.rst @@ -0,0 +1,262 @@ +.. _stretch_mode: + +================ +Stretch Clusters +================ + + +Stretch Clusters +================ + +A stretch cluster is a cluster that has servers in geographically separated +data centers, distributed over a WAN. Stretch clusters have LAN-like high-speed +and low-latency connections, but limited links. Stretch clusters have a higher +likelihood of (possibly asymmetric) network splits, and a higher likelihood of +temporary or complete loss of an entire data center (which can represent +one-third to one-half of the total cluster). + +Ceph is designed with the expectation that all parts of its network and cluster +will be reliable and that failures will be distributed randomly across the +CRUSH map. Even if a switch goes down and causes the loss of many OSDs, Ceph is +designed so that the remaining OSDs and monitors will route around such a loss. + +Sometimes this cannot be relied upon. If you have a "stretched-cluster" +deployment in which much of your cluster is behind a single network component, +you might need to use **stretch mode** to ensure data integrity. + +We will here consider two standard configurations: a configuration with two +data centers (or, in clouds, two availability zones), and a configuration with +three data centers (or, in clouds, three availability zones). + +In the two-site configuration, Ceph expects each of the sites to hold a copy of +the data, and Ceph also expects there to be a third site that has a tiebreaker +monitor. This tiebreaker monitor picks a winner if the network connection fails +and both data centers remain alive. + +The tiebreaker monitor can be a VM. It can also have high latency relative to +the two main sites. + +The standard Ceph configuration is able to survive MANY network failures or +data-center failures without ever compromising data availability. If enough +Ceph servers are brought back following a failure, the cluster *will* recover. +If you lose a data center but are still able to form a quorum of monitors and +still have all the data available, Ceph will maintain availability. (This +assumes that the cluster has enough copies to satisfy the pools' ``min_size`` +configuration option, or (failing that) that the cluster has CRUSH rules in +place that will cause the cluster to re-replicate the data until the +``min_size`` configuration option has been met.) + +Stretch Cluster Issues +====================== + +Ceph does not permit the compromise of data integrity and data consistency +under any circumstances. When service is restored after a network failure or a +loss of Ceph nodes, Ceph will restore itself to a state of normal functioning +without operator intervention. + +Ceph does not permit the compromise of data integrity or data consistency, but +there are situations in which *data availability* is compromised. These +situations can occur even though there are enough clusters available to satisfy +Ceph's consistency and sizing constraints. In some situations, you might +discover that your cluster does not satisfy those constraints. + +The first category of these failures that we will discuss involves inconsistent +networks -- if there is a netsplit (a disconnection between two servers that +splits the network into two pieces), Ceph might be unable to mark OSDs ``down`` +and remove them from the acting PG sets. This failure to mark ODSs ``down`` +will occur, despite the fact that the primary PG is unable to replicate data (a +situation that, under normal non-netsplit circumstances, would result in the +marking of affected OSDs as ``down`` and their removal from the PG). If this +happens, Ceph will be unable to satisfy its durability guarantees and +consequently IO will not be permitted. + +The second category of failures that we will discuss involves the situation in +which the constraints are not sufficient to guarantee the replication of data +across data centers, though it might seem that the data is correctly replicated +across data centers. For example, in a scenario in which there are two data +centers named Data Center A and Data Center B, and the CRUSH rule targets three +replicas and places a replica in each data center with a ``min_size`` of ``2``, +the PG might go active with two replicas in Data Center A and zero replicas in +Data Center B. In a situation of this kind, the loss of Data Center A means +that the data is lost and Ceph will not be able to operate on it. This +situation is surprisingly difficult to avoid using only standard CRUSH rules. + + +Stretch Mode +============ +Stretch mode is designed to handle deployments in which you cannot guarantee the +replication of data across two data centers. This kind of situation can arise +when the cluster's CRUSH rule specifies that three copies are to be made, but +then a copy is placed in each data center with a ``min_size`` of 2. Under such +conditions, a placement group can become active with two copies in the first +data center and no copies in the second data center. + + +Entering Stretch Mode +--------------------- + +To enable stretch mode, you must set the location of each monitor, matching +your CRUSH map. This procedure shows how to do this. + + +#. Place ``mon.a`` in your first data center: + + .. prompt:: bash $ + + ceph mon set_location a datacenter=site1 + +#. Generate a CRUSH rule that places two copies in each data center. + This requires editing the CRUSH map directly: + + .. prompt:: bash $ + + ceph osd getcrushmap > crush.map.bin + crushtool -d crush.map.bin -o crush.map.txt + +#. Edit the ``crush.map.txt`` file to add a new rule. Here there is only one + other rule (``id 1``), but you might need to use a different rule ID. We + have two data-center buckets named ``site1`` and ``site2``: + + :: + + rule stretch_rule { + id 1 + min_size 1 + max_size 10 + type replicated + step take site1 + step chooseleaf firstn 2 type host + step emit + step take site2 + step chooseleaf firstn 2 type host + step emit + } + +#. Inject the CRUSH map to make the rule available to the cluster: + + .. prompt:: bash $ + + crushtool -c crush.map.txt -o crush2.map.bin + ceph osd setcrushmap -i crush2.map.bin + +#. Run the monitors in connectivity mode. See `Changing Monitor Elections`_. + +#. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the + tiebreaker monitor and we are splitting across data centers. The tiebreaker + monitor must be assigned a data center that is neither ``site1`` nor + ``site2``. For this purpose you can create another data-center bucket named + ``site3`` in your CRUSH and place ``mon.e`` there: + + .. prompt:: bash $ + + ceph mon set_location e datacenter=site3 + ceph mon enable_stretch_mode e stretch_rule datacenter + +When stretch mode is enabled, PGs will become active only when they peer +across data centers (or across whichever CRUSH bucket type was specified), +assuming both are alive. Pools will increase in size from the default ``3`` to +``4``, and two copies will be expected in each site. OSDs will be allowed to +connect to monitors only if they are in the same data center as the monitors. +New monitors will not be allowed to join the cluster if they do not specify a +location. + +If all OSDs and monitors in one of the data centers become inaccessible at once, +the surviving data center enters a "degraded stretch mode". A warning will be +issued, the ``min_size`` will be reduced to ``1``, and the cluster will be +allowed to go active with the data in the single remaining site. The pool size +does not change, so warnings will be generated that report that the pools are +too small -- but a special stretch mode flag will prevent the OSDs from +creating extra copies in the remaining data center. This means that the data +center will keep only two copies, just as before. + +When the missing data center comes back, the cluster will enter a "recovery +stretch mode". This changes the warning and allows peering, but requires OSDs +only from the data center that was ``up`` throughout the duration of the +downtime. When all PGs are in a known state, and are neither degraded nor +incomplete, the cluster transitions back to regular stretch mode, ends the +warning, restores ``min_size`` to its original value (``2``), requires both +sites to peer, and no longer requires the site that was up throughout the +duration of the downtime when peering (which makes failover to the other site +possible, if needed). + +.. _Changing Monitor elections: ../change-mon-elections + +Limitations of Stretch Mode +=========================== +When using stretch mode, OSDs must be located at exactly two sites. + +Two monitors should be run in each data center, plus a tiebreaker in a third +(or in the cloud) for a total of five monitors. While in stretch mode, OSDs +will connect only to monitors within the data center in which they are located. +OSDs *DO NOT* connect to the tiebreaker monitor. + +Erasure-coded pools cannot be used with stretch mode. Attempts to use erasure +coded pools with stretch mode will fail. Erasure coded pools cannot be created +while in stretch mode. + +To use stretch mode, you will need to create a CRUSH rule that provides two +replicas in each data center. Ensure that there are four total replicas: two in +each data center. If pools exist in the cluster that do not have the default +``size`` or ``min_size``, Ceph will not enter stretch mode. An example of such +a CRUSH rule is given above. + +Because stretch mode runs with ``min_size`` set to ``1`` (or, more directly, +``min_size 1``), we recommend enabling stretch mode only when using OSDs on +SSDs (including NVMe OSDs). Hybrid HDD+SDD or HDD-only OSDs are not recommended +due to the long time it takes for them to recover after connectivity between +data centers has been restored. This reduces the potential for data loss. + +In the future, stretch mode might support erasure-coded pools and might support +deployments that have more than two data centers. + +Other commands +============== + +Replacing a failed tiebreaker monitor +------------------------------------- + +Turn on a new monitor and run the following command: + +.. prompt:: bash $ + + ceph mon set_new_tiebreaker mon.<new_mon_name> + +This command protests if the new monitor is in the same location as the +existing non-tiebreaker monitors. **This command WILL NOT remove the previous +tiebreaker monitor.** Remove the previous tiebreaker monitor yourself. + +Using "--set-crush-location" and not "ceph mon set_location" +------------------------------------------------------------ + +If you write your own tooling for deploying Ceph, use the +``--set-crush-location`` option when booting monitors instead of running ``ceph +mon set_location``. This option accepts only a single ``bucket=loc`` pair (for +example, ``ceph-mon --set-crush-location 'datacenter=a'``), and that pair must +match the bucket type that was specified when running ``enable_stretch_mode``. + +Forcing recovery stretch mode +----------------------------- + +When in stretch degraded mode, the cluster will go into "recovery" mode +automatically when the disconnected data center comes back. If that does not +happen or you want to enable recovery mode early, run the following command: + +.. prompt:: bash $ + + ceph osd force_recovery_stretch_mode --yes-i-really-mean-it + +Forcing normal stretch mode +--------------------------- + +When in recovery mode, the cluster should go back into normal stretch mode when +the PGs are healthy. If this fails to happen or if you want to force the +cross-data-center peering early and are willing to risk data downtime (or have +verified separately that all the PGs can peer, even if they aren't fully +recovered), run the following command: + +.. prompt:: bash $ + + ceph osd force_healthy_stretch_mode --yes-i-really-mean-it + +This command can be used to to remove the ``HEALTH_WARN`` state, which recovery +mode generates. diff --git a/doc/rados/operations/upmap.rst b/doc/rados/operations/upmap.rst new file mode 100644 index 000000000..8541680d8 --- /dev/null +++ b/doc/rados/operations/upmap.rst @@ -0,0 +1,113 @@ +.. _upmap: + +======================================= +Using pg-upmap +======================================= + +In Luminous v12.2.z and later releases, there is a *pg-upmap* exception table +in the OSDMap that allows the cluster to explicitly map specific PGs to +specific OSDs. This allows the cluster to fine-tune the data distribution to, +in most cases, uniformly distribute PGs across OSDs. + +However, there is an important caveat when it comes to this new feature: it +requires all clients to understand the new *pg-upmap* structure in the OSDMap. + +Online Optimization +=================== + +Enabling +-------- + +In order to use ``pg-upmap``, the cluster cannot have any pre-Luminous clients. +By default, new clusters enable the *balancer module*, which makes use of +``pg-upmap``. If you want to use a different balancer or you want to make your +own custom ``pg-upmap`` entries, you might want to turn off the balancer in +order to avoid conflict: + +.. prompt:: bash $ + + ceph balancer off + +To allow use of the new feature on an existing cluster, you must restrict the +cluster to supporting only Luminous (and newer) clients. To do so, run the +following command: + +.. prompt:: bash $ + + ceph osd set-require-min-compat-client luminous + +This command will fail if any pre-Luminous clients or daemons are connected to +the monitors. To see which client versions are in use, run the following +command: + +.. prompt:: bash $ + + ceph features + +Balancer Module +--------------- + +The `balancer` module for ``ceph-mgr`` will automatically balance the number of +PGs per OSD. See :ref:`balancer` + +Offline Optimization +==================== + +Upmap entries are updated with an offline optimizer that is built into the +:ref:`osdmaptool`. + +#. Grab the latest copy of your osdmap: + + .. prompt:: bash $ + + ceph osd getmap -o om + +#. Run the optimizer: + + .. prompt:: bash $ + + osdmaptool om --upmap out.txt [--upmap-pool <pool>] \ + [--upmap-max <max-optimizations>] \ + [--upmap-deviation <max-deviation>] \ + [--upmap-active] + + It is highly recommended that optimization be done for each pool + individually, or for sets of similarly utilized pools. You can specify the + ``--upmap-pool`` option multiple times. "Similarly utilized pools" means + pools that are mapped to the same devices and that store the same kind of + data (for example, RBD image pools are considered to be similarly utilized; + an RGW index pool and an RGW data pool are not considered to be similarly + utilized). + + The ``max-optimizations`` value determines the maximum number of upmap + entries to identify. The default is `10` (as is the case with the + ``ceph-mgr`` balancer module), but you should use a larger number if you are + doing offline optimization. If it cannot find any additional changes to + make (that is, if the pool distribution is perfect), it will stop early. + + The ``max-deviation`` value defaults to `5`. If an OSD's PG count varies + from the computed target number by no more than this amount it will be + considered perfect. + + The ``--upmap-active`` option simulates the behavior of the active balancer + in upmap mode. It keeps cycling until the OSDs are balanced and reports how + many rounds have occurred and how long each round takes. The elapsed time + for rounds indicates the CPU load that ``ceph-mgr`` consumes when it computes + the next optimization plan. + +#. Apply the changes: + + .. prompt:: bash $ + + source out.txt + + In the above example, the proposed changes are written to the output file + ``out.txt``. The commands in this procedure are normal Ceph CLI commands + that can be run in order to apply the changes to the cluster. + +The above steps can be repeated as many times as necessary to achieve a perfect +distribution of PGs for each set of pools. + +To see some (gory) details about what the tool is doing, you can pass +``--debug-osd 10`` to ``osdmaptool``. To see even more details, pass +``--debug-crush 10`` to ``osdmaptool``. diff --git a/doc/rados/operations/user-management.rst b/doc/rados/operations/user-management.rst new file mode 100644 index 000000000..130c02002 --- /dev/null +++ b/doc/rados/operations/user-management.rst @@ -0,0 +1,840 @@ +.. _user-management: + +================= + User Management +================= + +This document describes :term:`Ceph Client` users, and describes the process by +which they perform authentication and authorization so that they can access the +:term:`Ceph Storage Cluster`. Users are either individuals or system actors +(for example, applications) that use Ceph clients to interact with the Ceph +Storage Cluster daemons. + +.. ditaa:: + +-----+ + | {o} | + | | + +--+--+ /---------\ /---------\ + | | Ceph | | Ceph | + ---+---*----->| |<------------->| | + | uses | Clients | | Servers | + | \---------/ \---------/ + /--+--\ + | | + | | + actor + + +When Ceph runs with authentication and authorization enabled (both are enabled +by default), you must specify a user name and a keyring that contains the +secret key of the specified user (usually these are specified via the command +line). If you do not specify a user name, Ceph will use ``client.admin`` as the +default user name. If you do not specify a keyring, Ceph will look for a +keyring via the ``keyring`` setting in the Ceph configuration. For example, if +you execute the ``ceph health`` command without specifying a user or a keyring, +Ceph will assume that the keyring is in ``/etc/ceph/ceph.client.admin.keyring`` +and will attempt to use that keyring. The following illustrates this behavior: + +.. prompt:: bash $ + + ceph health + +Ceph will interpret the command like this: + +.. prompt:: bash $ + + ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health + +Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid +re-entry of the user name and secret. + +For details on configuring the Ceph Storage Cluster to use authentication, see +`Cephx Config Reference`_. For details on the architecture of Cephx, see +`Architecture - High Availability Authentication`_. + +Background +========== + +No matter what type of Ceph client is used (for example: Block Device, Object +Storage, Filesystem, native API), Ceph stores all data as RADOS objects within +`pools`_. Ceph users must have access to a given pool in order to read and +write data, and Ceph users must have execute permissions in order to use Ceph's +administrative commands. The following concepts will help you understand +Ceph['s] user management. + +.. _rados-ops-user: + +User +---- + +A user is either an individual or a system actor (for example, an application). +Creating users allows you to control who (or what) can access your Ceph Storage +Cluster, its pools, and the data within those pools. + +Ceph has the concept of a ``type`` of user. For purposes of user management, +the type will always be ``client``. Ceph identifies users in a "period- +delimited form" that consists of the user type and the user ID: for example, +``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing +is that the Cephx protocol is used not only by clients but also non-clients, +such as Ceph Monitors, OSDs, and Metadata Servers. Distinguishing the user type +helps to distinguish between client users and other users. This distinction +streamlines access control, user monitoring, and traceability. + +Sometimes Ceph's user type might seem confusing, because the Ceph command line +allows you to specify a user with or without the type, depending upon your +command line usage. If you specify ``--user`` or ``--id``, you can omit the +type. For example, ``client.user1`` can be entered simply as ``user1``. On the +other hand, if you specify ``--name`` or ``-n``, you must supply the type and +name: for example, ``client.user1``. We recommend using the type and name as a +best practice wherever possible. + +.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage + user or a Ceph File System user. The Ceph Object Gateway uses a Ceph Storage + Cluster user to communicate between the gateway daemon and the storage + cluster, but the Ceph Object Gateway has its own user-management + functionality for end users. The Ceph File System uses POSIX semantics, and + the user space associated with the Ceph File System is not the same as the + user space associated with a Ceph Storage Cluster user. + +Authorization (Capabilities) +---------------------------- + +Ceph uses the term "capabilities" (caps) to describe the permissions granted to +an authenticated user to exercise the functionality of the monitors, OSDs, and +metadata servers. Capabilities can also restrict access to data within a pool, +a namespace within a pool, or a set of pools based on their application tags. +A Ceph administrative user specifies the capabilities of a user when creating +or updating that user. + +Capability syntax follows this form:: + + {daemon-type} '{cap-spec}[, {cap-spec} ...]' + +- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access + settings, and can be applied in aggregate from pre-defined profiles with + ``profile {name}``. For example:: + + mon 'allow {access-spec} [network {network/prefix}]' + + mon 'profile {name}' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The optional ``{network/prefix}`` is a standard network name and prefix + length in CIDR notation (for example, ``10.3.0.0/16``). If + ``{network/prefix}`` is present, the monitor capability can be used only by + clients that connect from the specified network. + +- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, and + ``class-read`` and ``class-write`` access settings. OSD capabilities can be + applied in aggregate from pre-defined profiles with ``profile {name}``. In + addition, OSD capabilities allow for pool and namespace settings. :: + + osd 'allow {access-spec} [{match-spec}] [network {network/prefix}]' + + osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]] [network {network/prefix}]' + + There are two alternative forms of the ``{access-spec}`` syntax: :: + + * | all | [r][w][x] [class-read] [class-write] + + class {class name} [{method name}] + + There are two alternative forms of the optional ``{match-spec}`` syntax:: + + pool={pool-name} [namespace={namespace-name}] [object_prefix {prefix}] + + [namespace={namespace-name}] tag {application} {key}={value} + + The optional ``{network/prefix}`` is a standard network name and prefix + length in CIDR notation (for example, ``10.3.0.0/16``). If + ``{network/prefix}`` is present, the OSD capability can be used only by + clients that connect from the specified network. + +- **Manager Caps:** Manager (``ceph-mgr``) capabilities include ``r``, ``w``, + ``x`` access settings, and can be applied in aggregate from pre-defined + profiles with ``profile {name}``. For example:: + + mgr 'allow {access-spec} [network {network/prefix}]' + + mgr 'profile {name} [{key1} {match-type} {value1} ...] [network {network/prefix}]' + + Manager capabilities can also be specified for specific commands, for all + commands exported by a built-in manager service, or for all commands exported + by a specific add-on module. For example:: + + mgr 'allow command "{command-prefix}" [with {key1} {match-type} {value1} ...] [network {network/prefix}]' + + mgr 'allow service {service-name} {access-spec} [network {network/prefix}]' + + mgr 'allow module {module-name} [with {key1} {match-type} {value1} ...] {access-spec} [network {network/prefix}]' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The ``{service-name}`` is one of the following: :: + + mgr | osd | pg | py + + The ``{match-type}`` is one of the following: :: + + = | prefix | regex + +- **Metadata Server Caps:** For administrators, use ``allow *``. For all other + users (for example, CephFS clients), consult :doc:`/cephfs/client-auth` + +.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the + Ceph Storage Cluster. For this reason, it is not represented as + a Ceph Storage Cluster daemon type. + +The following entries describe access capabilities. + +``allow`` + +:Description: Precedes access settings for a daemon. Implies ``rw`` + for MDS only. + + +``r`` + +:Description: Gives the user read access. Required with monitors to retrieve + the CRUSH map. + + +``w`` + +:Description: Gives the user write access to objects. + + +``x`` + +:Description: Gives the user the capability to call class methods + (that is, both read and write) and to conduct ``auth`` + operations on monitors. + + +``class-read`` + +:Descriptions: Gives the user the capability to call class read methods. + Subset of ``x``. + + +``class-write`` + +:Description: Gives the user the capability to call class write methods. + Subset of ``x``. + + +``*``, ``all`` + +:Description: Gives the user read, write, and execute permissions for a + particular daemon/pool, as well as the ability to execute + admin commands. + + +The following entries describe valid capability profiles: + +``profile osd`` (Monitor only) + +:Description: Gives a user permissions to connect as an OSD to other OSDs or + monitors. Conferred on OSDs in order to enable OSDs to handle replication + heartbeat traffic and status reporting. + + +``profile mds`` (Monitor only) + +:Description: Gives a user permissions to connect as an MDS to other MDSs or + monitors. + + +``profile bootstrap-osd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an OSD. Conferred on + deployment tools such as ``ceph-volume`` and ``cephadm`` + so that they have permissions to add keys when + bootstrapping an OSD. + + +``profile bootstrap-mds`` (Monitor only) + +:Description: Gives a user permissions to bootstrap a metadata server. + Conferred on deployment tools such as ``cephadm`` + so that they have permissions to add keys when bootstrapping + a metadata server. + +``profile bootstrap-rbd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an RBD user. + Conferred on deployment tools such as ``cephadm`` + so that they have permissions to add keys when bootstrapping + an RBD user. + +``profile bootstrap-rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an ``rbd-mirror`` daemon + user. Conferred on deployment tools such as ``cephadm`` so that + they have permissions to add keys when bootstrapping an + ``rbd-mirror`` daemon. + +``profile rbd`` (Manager, Monitor, and OSD) + +:Description: Gives a user permissions to manipulate RBD images. When used as a + Monitor cap, it provides the user with the minimal privileges + required by an RBD client application; such privileges include + the ability to blocklist other client users. When used as an OSD + cap, it provides an RBD client application with read-write access + to the specified pool. The Manager cap supports optional ``pool`` + and ``namespace`` keyword arguments. + +``profile rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to manipulate RBD images and retrieve + RBD mirroring config-key secrets. It provides the minimal + privileges required for the user to manipulate the ``rbd-mirror`` + daemon. + +``profile rbd-read-only`` (Manager and OSD) + +:Description: Gives a user read-only permissions to RBD images. The Manager cap + supports optional ``pool`` and ``namespace`` keyword arguments. + +``profile simple-rados-client`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, and PG data. + Intended for use by direct librados client applications. + +``profile simple-rados-client-with-blocklist`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, and PG data. + Intended for use by direct librados client applications. Also + includes permissions to add blocklist entries to build + high-availability (HA) applications. + +``profile fs-client`` (Monitor only) + +:Description: Gives a user read-only permissions for monitor, OSD, PG, and MDS + data. Intended for CephFS clients. + +``profile role-definer`` (Monitor and Auth) + +:Description: Gives a user **all** permissions for the auth subsystem, read-only + access to monitors, and nothing else. Useful for automation + tools. Do not assign this unless you really, **really** know what + you're doing, as the security ramifications are substantial and + pervasive. + +``profile crash`` (Monitor and MGR) + +:Description: Gives a user read-only access to monitors. Used in conjunction + with the manager ``crash`` module to upload daemon crash + dumps into monitor storage for later analysis. + +Pool +---- + +A pool is a logical partition where users store data. +In Ceph deployments, it is common to create a pool as a logical partition for +similar types of data. For example, when deploying Ceph as a back end for +OpenStack, a typical deployment would have pools for volumes, images, backups +and virtual machines, and such users as ``client.glance`` and ``client.cinder``. + +Application Tags +---------------- + +Access may be restricted to specific pools as defined by their application +metadata. The ``*`` wildcard may be used for the ``key`` argument, the +``value`` argument, or both. The ``all`` tag is a synonym for ``*``. + +Namespace +--------- + +Objects within a pool can be associated to a namespace: that is, to a logical group of +objects within the pool. A user's access to a pool can be associated with a +namespace so that reads and writes by the user can take place only within the +namespace. Objects written to a namespace within the pool can be accessed only +by users who have access to the namespace. + +.. note:: Namespaces are primarily useful for applications written on top of + ``librados``. In such situations, the logical grouping provided by + namespaces can obviate the need to create different pools. In Luminous and + later releases, Ceph Object Gateway uses namespaces for various metadata + objects. + +The rationale for namespaces is this: namespaces are relatively less +computationally expensive than pools, which (pools) can be a computationally +expensive method of segregating data sets between different authorized users. + +For example, a pool ought to host approximately 100 placement-group replicas +per OSD. This means that a cluster with 1000 OSDs and three 3R replicated pools +would have (in a single pool) 100,000 placement-group replicas, and that means +that it has 33,333 Placement Groups. + +By contrast, writing an object to a namespace simply associates the namespace +to the object name without incurring the computational overhead of a separate +pool. Instead of creating a separate pool for a user or set of users, you can +use a namespace. + +.. note:: + + Namespaces are available only when using ``librados``. + + +Access may be restricted to specific RADOS namespaces by use of the ``namespace`` +capability. Limited globbing of namespaces (that is, use of wildcards (``*``)) is supported: if the last character +of the specified namespace is ``*``, then access is granted to any namespace +starting with the provided argument. + +Managing Users +============== + +User management functionality provides Ceph Storage Cluster administrators with +the ability to create, update, and delete users directly in the Ceph Storage +Cluster. + +When you create or delete users in the Ceph Storage Cluster, you might need to +distribute keys to clients so that they can be added to keyrings. For details, see `Keyring +Management`_. + +Listing Users +------------- + +To list the users in your cluster, run the following command: + +.. prompt:: bash $ + + ceph auth ls + +Ceph will list all users in your cluster. For example, in a two-node +cluster, ``ceph auth ls`` will provide an output that resembles the following:: + + installed auth entries: + + osd.0 + key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w== + caps: [mon] allow profile osd + caps: [osd] allow * + osd.1 + key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA== + caps: [mon] allow profile osd + caps: [osd] allow * + client.admin + key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw== + caps: [mds] allow + caps: [mon] allow * + caps: [osd] allow * + client.bootstrap-mds + key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww== + caps: [mon] allow profile bootstrap-mds + client.bootstrap-osd + key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw== + caps: [mon] allow profile bootstrap-osd + +Note that, according to the ``TYPE.ID`` notation for users, ``osd.0`` is a +user of type ``osd`` and an ID of ``0``, and ``client.admin`` is a user of type +``client`` and an ID of ``admin`` (that is, the default ``client.admin`` user). +Note too that each entry has a ``key: <value>`` entry, and also has one or more +``caps:`` entries. + +To save the output of ``ceph auth ls`` to a file, use the ``-o {filename}`` option. + + +Getting a User +-------------- + +To retrieve a specific user, key, and capabilities, run the following command: + +.. prompt:: bash $ + + ceph auth get {TYPE.ID} + +For example: + +.. prompt:: bash $ + + ceph auth get client.admin + +To save the output of ``ceph auth get`` to a file, use the ``-o {filename}`` option. Developers may also run the following command: + +.. prompt:: bash $ + + ceph auth export {TYPE.ID} + +The ``auth export`` command is identical to ``auth get``. + +.. _rados_ops_adding_a_user: + +Adding a User +------------- + +Adding a user creates a user name (that is, ``TYPE.ID``), a secret key, and +any capabilities specified in the command that creates the user. + +A user's key allows the user to authenticate with the Ceph Storage Cluster. +The user's capabilities authorize the user to read, write, or execute on Ceph +monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``). + +There are a few ways to add a user: + +- ``ceph auth add``: This command is the canonical way to add a user. It + will create the user, generate a key, and add any specified capabilities. + +- ``ceph auth get-or-create``: This command is often the most convenient way + to create a user, because it returns a keyfile format with the user name + (in brackets) and the key. If the user already exists, this command + simply returns the user name and key in the keyfile format. To save the output to + a file, use the ``-o {filename}`` option. + +- ``ceph auth get-or-create-key``: This command is a convenient way to create + a user and return the user's key and nothing else. This is useful for clients that + need only the key (for example, libvirt). If the user already exists, this command + simply returns the key. To save the output to + a file, use the ``-o {filename}`` option. + +It is possible, when creating client users, to create a user with no capabilities. A user +with no capabilities is useless beyond mere authentication, because the client +cannot retrieve the cluster map from the monitor. However, you might want to create a user +with no capabilities and wait until later to add capabilities to the user by using the ``ceph auth caps`` comand. + +A typical user has at least read capabilities on the Ceph monitor and +read and write capabilities on Ceph OSDs. A user's OSD permissions +are often restricted so that the user can access only one particular pool. +In the following example, the commands (1) add a client named ``john`` that has read capabilities on the Ceph monitor +and read and write capabilities on the pool named ``liverpool``, (2) authorize a client named ``paul`` to have read capabilities on the Ceph monitor and +read and write capabilities on the pool named ``liverpool``, (3) authorize a client named ``george`` to have read capabilities on the Ceph monitor and +read and write capabilities on the pool named ``liverpool`` and use the keyring named ``george.keyring`` to make this authorization, and (4) authorize +a client named ``ringo`` to have read capabilities on the Ceph monitor and read and write capabilities on the pool named ``liverpool`` and use the key +named ``ringo.key`` to make this authorization: + +.. prompt:: bash $ + + ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring + ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key + +.. important:: Any user that has capabilities on OSDs will have access to ALL pools in the cluster + unless that user's access has been restricted to a proper subset of the pools in the cluster. + + +.. _modify-user-capabilities: + +Modifying User Capabilities +--------------------------- + +The ``ceph auth caps`` command allows you to specify a user and change that +user's capabilities. Setting new capabilities will overwrite current capabilities. +To view current capabilities, run ``ceph auth get USERTYPE.USERID``. +To add capabilities, run a command of the following form (and be sure to specify the existing capabilities): + +.. prompt:: bash $ + + ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]'] + +For example: + +.. prompt:: bash $ + + ceph auth get client.john + ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool' + ceph auth caps client.brian-manager mon 'allow *' osd 'allow *' + +For additional details on capabilities, see `Authorization (Capabilities)`_. + +Deleting a User +--------------- + +To delete a user, use ``ceph auth del``: + +.. prompt:: bash $ + + ceph auth del {TYPE}.{ID} + +Here ``{TYPE}`` is either ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or the ID of the daemon. + + +Printing a User's Key +--------------------- + +To print a user's authentication key to standard output, run the following command: + +.. prompt:: bash $ + + ceph auth print-key {TYPE}.{ID} + +Here ``{TYPE}`` is either ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or the ID of the daemon. + +When it is necessary to populate client software with a user's key (as in the case of libvirt), +you can print the user's key by running the following command: + +.. prompt:: bash $ + + mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user` + +Importing a User +---------------- + +To import one or more users, use ``ceph auth import`` and +specify a keyring as follows: + +.. prompt:: bash $ + + ceph auth import -i /path/to/keyring + +For example: + +.. prompt:: bash $ + + sudo ceph auth import -i /etc/ceph/ceph.keyring + +.. note:: The Ceph storage cluster will add new users, their keys, and their + capabilities and will update existing users, their keys, and their + capabilities. + +Keyring Management +================== + +When you access Ceph via a Ceph client, the Ceph client will look for a local +keyring. Ceph presets the ``keyring`` setting with four keyring +names by default. For this reason, you do not have to set the keyring names in your Ceph configuration file +unless you want to override these defaults (which is not recommended). The four default keyring names are as follows: + +- ``/etc/ceph/$cluster.$name.keyring`` +- ``/etc/ceph/$cluster.keyring`` +- ``/etc/ceph/keyring`` +- ``/etc/ceph/keyring.bin`` + +The ``$cluster`` metavariable found in the first two default keyring names above +is your Ceph cluster name as defined by the name of the Ceph configuration +file: for example, if the Ceph configuration file is named ``ceph.conf``, +then your Ceph cluster name is ``ceph`` and the second name above would be +``ceph.keyring``. The ``$name`` metavariable is the user type and user ID: +for example, given the user ``client.admin``, the first name above would be +``ceph.client.admin.keyring``. + +.. note:: When running commands that read or write to ``/etc/ceph``, you might + need to use ``sudo`` to run the command as ``root``. + +After you create a user (for example, ``client.ringo``), you must get the key and add +it to a keyring on a Ceph client so that the user can access the Ceph Storage +Cluster. + +The `User Management`_ section details how to list, get, add, modify, and delete +users directly in the Ceph Storage Cluster. In addition, Ceph provides the +``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client. + +Creating a Keyring +------------------ + +When you use the procedures in the `Managing Users`_ section to create users, +you must provide user keys to the Ceph client(s). This is required so that the Ceph client(s) +can retrieve the key for the specified user and authenticate that user against the Ceph +Storage Cluster. Ceph clients access keyrings in order to look up a user name and +retrieve the user's key. + +The ``ceph-authtool`` utility allows you to create a keyring. To create an +empty keyring, use ``--create-keyring`` or ``-C``. For example: + +.. prompt:: bash $ + + ceph-authtool --create-keyring /path/to/keyring + +When creating a keyring with multiple users, we recommend using the cluster name +(of the form ``$cluster.keyring``) for the keyring filename and saving the keyring in the +``/etc/ceph`` directory. By doing this, you ensure that the ``keyring`` configuration default setting +will pick up the filename without requiring you to specify the filename in the local copy +of your Ceph configuration file. For example, you can create ``ceph.keyring`` by +running the following command: + +.. prompt:: bash $ + + sudo ceph-authtool -C /etc/ceph/ceph.keyring + +When creating a keyring with a single user, we recommend using the cluster name, +the user type, and the user name, and saving the keyring in the ``/etc/ceph`` directory. +For example, we recommend that the ``client.admin`` user use ``ceph.client.admin.keyring``. + +To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means +that the file will have ``rw`` permissions for the ``root`` user only, which is +appropriate when the keyring contains administrator keys. However, if you +intend to use the keyring for a particular user or group of users, be sure to use ``chown`` or ``chmod`` to establish appropriate keyring +ownership and access. + +Adding a User to a Keyring +-------------------------- + +When you :ref:`Add a user<rados_ops_adding_a_user>` to the Ceph Storage +Cluster, you can use the `Getting a User`_ procedure to retrieve a user, key, +and capabilities and then save the user to a keyring. + +If you want to use only one user per keyring, the `Getting a User`_ procedure with +the ``-o`` option will save the output in the keyring file format. For example, +to create a keyring for the ``client.admin`` user, run the following command: + +.. prompt:: bash $ + + sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring + +Notice that the file format in this command is the file format conventionally used when manipulating the keyrings of individual users. + +If you want to import users to a keyring, you can use ``ceph-authtool`` +to specify the destination keyring and the source keyring. +For example: + +.. prompt:: bash $ + + sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring + +Creating a User +--------------- + +Ceph provides the `Adding a User`_ function to create a user directly in the Ceph +Storage Cluster. However, you can also create a user, keys, and capabilities +directly on a Ceph client keyring, and then import the user to the Ceph +Storage Cluster. For example: + +.. prompt:: bash $ + + sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring + +For additional details on capabilities, see `Authorization (Capabilities)`_. + +You can also create a keyring and add a new user to the keyring simultaneously. +For example: + +.. prompt:: bash $ + + sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key + +In the above examples, the new user ``client.ringo`` has been added only to the +keyring. The new user has not been added to the Ceph Storage Cluster. + +To add the new user ``client.ringo`` to the Ceph Storage Cluster, run the following command: + +.. prompt:: bash $ + + sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring + +Modifying a User +---------------- + +To modify the capabilities of a user record in a keyring, specify the keyring +and the user, followed by the capabilities. For example: + +.. prompt:: bash $ + + sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' + +To update the user in the Ceph Storage Cluster, you must update the user +in the keyring to the user entry in the Ceph Storage Cluster. To do so, run the following command: + +.. prompt:: bash $ + + sudo ceph auth import -i /etc/ceph/ceph.keyring + +For details on updating a Ceph Storage Cluster user from a +keyring, see `Importing a User`_ + +You may also :ref:`Modify user capabilities<modify-user-capabilities>` directly in the cluster, store the +results to a keyring file, and then import the keyring into your main +``ceph.keyring`` file. + +Command Line Usage +================== + +Ceph supports the following usage for user name and secret: + +``--id`` | ``--user`` + +:Description: Ceph identifies users with a type and an ID: the form of this user identification is ``TYPE.ID``, and examples of the type and ID are + ``client.admin`` and ``client.user1``. The ``id``, ``name`` and + ``-n`` options allow you to specify the ID portion of the user + name (for example, ``admin``, ``user1``, ``foo``). You can specify + the user with the ``--id`` and omit the type. For example, + to specify user ``client.foo``, run the following commands: + + .. prompt:: bash $ + + ceph --id foo --keyring /path/to/keyring health + ceph --user foo --keyring /path/to/keyring health + + +``--name`` | ``-n`` + +:Description: Ceph identifies users with a type and an ID: the form of this user identification is ``TYPE.ID``, and examples of the type and ID are + ``client.admin`` and ``client.user1``. The ``--name`` and ``-n`` + options allow you to specify the fully qualified user name. + You are required to specify the user type (typically ``client``) with the + user ID. For example: + + .. prompt:: bash $ + + ceph --name client.foo --keyring /path/to/keyring health + ceph -n client.foo --keyring /path/to/keyring health + + +``--keyring`` + +:Description: The path to the keyring that contains one or more user names and + secrets. The ``--secret`` option provides the same functionality, + but it does not work with Ceph RADOS Gateway, which uses + ``--secret`` for another purpose. You may retrieve a keyring with + ``ceph auth get-or-create`` and store it locally. This is a + preferred approach, because you can switch user names without + switching the keyring path. For example: + + .. prompt:: bash $ + + sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage + + +.. _pools: ../pools + +Limitations +=========== + +The ``cephx`` protocol authenticates Ceph clients and servers to each other. It +is not intended to handle authentication of human users or application programs +that are run on their behalf. If your access control +needs require that kind of authentication, you will need to have some other mechanism, which is likely to be specific to the +front end that is used to access the Ceph object store. This other mechanism would ensure that only acceptable users and programs are able to run on the +machine that Ceph permits to access its object store. + +The keys used to authenticate Ceph clients and servers are typically stored in +a plain text file on a trusted host. Appropriate permissions must be set on the plain text file. + +.. important:: Storing keys in plaintext files has security shortcomings, but + they are difficult to avoid, given the basic authentication methods Ceph + uses in the background. Anyone setting up Ceph systems should be aware of + these shortcomings. + +In particular, user machines, especially portable machines, should not +be configured to interact directly with Ceph, since that mode of use would +require the storage of a plaintext authentication key on an insecure machine. +Anyone who stole that machine or obtained access to it could +obtain a key that allows them to authenticate their own machines to Ceph. + +Instead of permitting potentially insecure machines to access a Ceph object +store directly, you should require users to sign in to a trusted machine in +your environment, using a method that provides sufficient security for your +purposes. That trusted machine will store the plaintext Ceph keys for the +human users. A future version of Ceph might address these particular +authentication issues more fully. + +At present, none of the Ceph authentication protocols provide secrecy for +messages in transit. As a result, an eavesdropper on the wire can hear and understand +all data sent between clients and servers in Ceph, even if the eavesdropper cannot create or +alter the data. Similarly, Ceph does not include options to encrypt user data in the +object store. Users can, of course, hand-encrypt and store their own data in the Ceph +object store, but Ceph itself provides no features to perform object +encryption. Anyone storing sensitive data in Ceph should consider +encrypting their data before providing it to the Ceph system. + + +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _Cephx Config Reference: ../../configuration/auth-config-ref |