diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
commit | 483eb2f56657e8e7f419ab1a4fab8dce9ade8609 (patch) | |
tree | e5d88d25d870d5dedacb6bbdbe2a966086a0a5cf /doc/rados/operations | |
parent | Initial commit. (diff) | |
download | ceph-483eb2f56657e8e7f419ab1a4fab8dce9ade8609.tar.xz ceph-483eb2f56657e8e7f419ab1a4fab8dce9ade8609.zip |
Adding upstream version 14.2.21.upstream/14.2.21upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
29 files changed, 10195 insertions, 0 deletions
diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst new file mode 100644 index 00000000..ba03839a --- /dev/null +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -0,0 +1,375 @@ +.. _adding-and-removing-monitors: + +========================== + Adding/Removing Monitors +========================== + +When you have a cluster up and running, you may add or remove monitors +from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_ +or `Monitor Bootstrap`_. + +.. _adding-monitors: + +Adding Monitors +=============== + +Ceph monitors are light-weight processes that maintain a master copy of the +cluster map. You can run a cluster with 1 monitor. We recommend at least 3 +monitors for a production cluster. Ceph monitors use a variation of the +`Paxos`_ protocol to establish consensus about maps and other critical +information across the cluster. Due to the nature of Paxos, Ceph requires +a majority of monitors running to establish a quorum (thus establishing +consensus). + +It is advisable to run an odd-number of monitors but not mandatory. An +odd-number of monitors has a higher resiliency to failures than an +even-number of monitors. For instance, on a 2 monitor deployment, no +failures can be tolerated in order to maintain a quorum; with 3 monitors, +one failure can be tolerated; in a 4 monitor deployment, one failure can +be tolerated; with 5 monitors, two failures can be tolerated. This is +why an odd-number is advisable. Summarizing, Ceph needs a majority of +monitors to be running (and able to communicate with each other), but that +majority can be achieved using a single monitor, or 2 out of 2 monitors, +2 out of 3, 3 out of 4, etc. + +For an initial deployment of a multi-node Ceph cluster, it is advisable to +deploy three monitors, increasing the number two at a time if a valid need +for more than three exists. + +Since monitors are light-weight, it is possible to run them on the same +host as an OSD; however, we recommend running them on separate hosts, +because fsync issues with the kernel may impair performance. + +.. note:: A *majority* of monitors in your cluster must be able to + reach each other in order to establish a quorum. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new monitor, see `Hardware +Recommendations`_ for details on minimum recommendations for monitor hardware. +To add a monitor host to your cluster, first make sure you have an up-to-date +version of Linux installed (typically Ubuntu 16.04 or RHEL 7). + +Add your monitor host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Packages`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Packages: ../../../install/install-storage-cluster + + +.. _Adding a Monitor (Manual): + +Adding a Monitor (Manual) +------------------------- + +This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map +and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If +this results in only two monitor daemons, you may add more monitors by +repeating this procedure until you have a sufficient number of ``ceph-mon`` +daemons to achieve a quorum. + +At this point you should define your monitor's id. Traditionally, monitors +have been named with single letters (``a``, ``b``, ``c``, ...), but you are +free to define the id as you see fit. For the purpose of this document, +please take into account that ``{mon-id}`` should be the id you chose, +without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` +on ``mon.a``). + +#. Create the default directory on the machine that will host your + new monitor. :: + + ssh {new-mon-host} + sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} + +#. Create a temporary directory ``{tmp}`` to keep the files needed during + this process. This directory should be different from the monitor's default + directory created in the previous step, and can be removed after all the + steps are executed. :: + + mkdir {tmp} + +#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to + the retrieved keyring, and ``{key-filename}`` is the name of the file + containing the retrieved monitor key. :: + + ceph auth get mon. -o {tmp}/{key-filename} + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{map-filename}`` is the name of the file + containing the retrieved monitor map. :: + + ceph mon getmap -o {tmp}/{map-filename} + +#. Prepare the monitor's data directory created in the first step. You must + specify the path to the monitor map so that you can retrieve the + information about a quorum of monitors and their ``fsid``. You must also + specify a path to the monitor keyring:: + + sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} + + +#. Start the new monitor and it will automatically join the cluster. + The daemon needs to know which address to bind to, via either the + ``--public-addr {ip}`` or ``--public-network {network}`` argument. + For example:: + + ceph-mon -i {mon-id} --public-addr {ip:port} + +.. _removing-monitors: + +Removing Monitors +================= + +When you remove monitors from a cluster, consider that Ceph monitors use +PAXOS to establish consensus about the master cluster map. You must have +a sufficient number of monitors to establish a quorum for consensus about +the cluster map. + +.. _Removing a Monitor (Manual): + +Removing a Monitor (Manual) +--------------------------- + +This procedure removes a ``ceph-mon`` daemon from your cluster. If this +procedure results in only two monitor daemons, you may add or remove another +monitor until you have a number of ``ceph-mon`` daemons that can achieve a +quorum. + +#. Stop the monitor. :: + + service ceph -a stop mon.{mon-id} + +#. Remove the monitor from the cluster. :: + + ceph mon remove {mon-id} + +#. Remove the monitor entry from ``ceph.conf``. + + +Removing Monitors from an Unhealthy Cluster +------------------------------------------- + +This procedure removes a ``ceph-mon`` daemon from an unhealthy +cluster, for example a cluster where the monitors cannot form a +quorum. + + +#. Stop all ``ceph-mon`` daemons on all monitor hosts. :: + + ssh {mon-host} + service ceph stop mon || stop ceph-mon-all + # and repeat for all mons + +#. Identify a surviving monitor and log in to that host. :: + + ssh {mon-host} + +#. Extract a copy of the monmap file. :: + + ceph-mon -i {mon-id} --extract-monmap {map-path} + # in most cases, that's + ceph-mon -i `hostname` --extract-monmap /tmp/monmap + +#. Remove the non-surviving or problematic monitors. For example, if + you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where + only ``mon.a`` will survive, follow the example below:: + + monmaptool {map-path} --rm {mon-id} + # for example, + monmaptool /tmp/monmap --rm b + monmaptool /tmp/monmap --rm c + +#. Inject the surviving map with the removed monitors into the + surviving monitor(s). For example, to inject a map into monitor + ``mon.a``, follow the example below:: + + ceph-mon -i {mon-id} --inject-monmap {map-path} + # for example, + ceph-mon -i a --inject-monmap /tmp/monmap + +#. Start only the surviving monitors. + +#. Verify the monitors form a quorum (``ceph -s``). + +#. You may wish to archive the removed monitors' data directory in + ``/var/lib/ceph/mon`` in a safe location, or delete it if you are + confident the remaining monitors are healthy and are sufficiently + redundant. + +.. _Changing a Monitor's IP address: + +Changing a Monitor's IP Address +=============================== + +.. important:: Existing monitors are not supposed to change their IP addresses. + +Monitors are critical components of a Ceph cluster, and they need to maintain a +quorum for the whole system to work properly. To establish a quorum, the +monitors need to discover each other. Ceph has strict requirements for +discovering monitors. + +Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors. +However, monitors discover each other using the monitor map, not ``ceph.conf``. +For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you +need to obtain the current monmap for the cluster when creating a new monitor, +as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The +following sections explain the consistency requirements for Ceph monitors, and a +few safe ways to change a monitor's IP address. + + +Consistency Requirements +------------------------ + +A monitor always refers to the local copy of the monmap when discovering other +monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids +errors that could break the cluster (e.g., typos in ``ceph.conf`` when +specifying a monitor address or port). Since monitors use monmaps for discovery +and they share monmaps with clients and other Ceph daemons, the monmap provides +monitors with a strict guarantee that their consensus is valid. + +Strict consistency also applies to updates to the monmap. As with any other +updates on the monitor, changes to the monmap always run through a distributed +consensus algorithm called `Paxos`_. The monitors must agree on each update to +the monmap, such as adding or removing a monitor, to ensure that each monitor in +the quorum has the same version of the monmap. Updates to the monmap are +incremental so that monitors have the latest agreed upon version, and a set of +previous versions, allowing a monitor that has an older version of the monmap to +catch up with the current state of the cluster. + +If monitors discovered each other through the Ceph configuration file instead of +through the monmap, it would introduce additional risks because the Ceph +configuration files are not updated and distributed automatically. Monitors +might inadvertently use an older ``ceph.conf`` file, fail to recognize a +monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able +to determine the current state of the system accurately. Consequently, making +changes to an existing monitor's IP address must be done with great care. + + +Changing a Monitor's IP address (The Right Way) +----------------------------------------------- + +Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to +ensure that other monitors in the cluster will receive the update. To change a +monitor's IP address, you must add a new monitor with the IP address you want +to use (as described in `Adding a Monitor (Manual)`_), ensure that the new +monitor successfully joins the quorum; then, remove the monitor that uses the +old IP address. Then, update the ``ceph.conf`` file to ensure that clients and +other daemons know the IP address of the new monitor. + +For example, lets assume there are three monitors in place, such as :: + + [mon.a] + host = host01 + addr = 10.0.0.1:6789 + [mon.b] + host = host02 + addr = 10.0.0.2:6789 + [mon.c] + host = host03 + addr = 10.0.0.3:6789 + +To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the +steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure +that ``mon.d`` is running before removing ``mon.c``, or it will break the +quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving +all three monitors would thus require repeating this process as many times as +needed. + + +Changing a Monitor's IP address (The Messy Way) +----------------------------------------------- + +There may come a time when the monitors must be moved to a different network, a +different part of the datacenter or a different datacenter altogether. While it +is possible to do it, the process becomes a bit more hazardous. + +In such a case, the solution is to generate a new monmap with updated IP +addresses for all the monitors in the cluster, and inject the new map on each +individual monitor. This is not the most user-friendly approach, but we do not +expect this to be something that needs to be done every other week. As it is +clearly stated on the top of this section, monitors are not supposed to change +IP addresses. + +Using the previous monitor configuration as an example, assume you want to move +all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these +networks are unable to communicate. Use the following procedure: + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{filename}`` is the name of the file + containing the retrieved monitor map. :: + + ceph mon getmap -o {tmp}/{filename} + +#. The following example demonstrates the contents of the monmap. :: + + $ monmaptool --print {tmp}/{filename} + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.0.0.1:6789/0 mon.a + 1: 10.0.0.2:6789/0 mon.b + 2: 10.0.0.3:6789/0 mon.c + +#. Remove the existing monitors. :: + + $ monmaptool --rm a --rm b --rm c {tmp}/{filename} + + monmaptool: monmap file {tmp}/{filename} + monmaptool: removing a + monmaptool: removing b + monmaptool: removing c + monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) + +#. Add the new monitor locations. :: + + $ monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} + + monmaptool: monmap file {tmp}/{filename} + monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) + +#. Check new contents. :: + + $ monmaptool --print {tmp}/{filename} + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.1.0.1:6789/0 mon.a + 1: 10.1.0.2:6789/0 mon.b + 2: 10.1.0.3:6789/0 mon.c + +At this point, we assume the monitors (and stores) are installed at the new +location. The next step is to propagate the modified monmap to the new +monitors, and inject the modified monmap into each new monitor. + +#. First, make sure to stop all your monitors. Injection must be done while + the daemon is not running. + +#. Inject the monmap. :: + + ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} + +#. Restart the monitors. + +After this step, migration to the new location is complete and +the monitors should operate successfully. + + +.. _Manual Deployment: ../../../install/manual-deployment +.. _Monitor Bootstrap: ../../../dev/mon-bootstrap +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) diff --git a/doc/rados/operations/add-or-rm-osds.rst b/doc/rados/operations/add-or-rm-osds.rst new file mode 100644 index 00000000..90de7548 --- /dev/null +++ b/doc/rados/operations/add-or-rm-osds.rst @@ -0,0 +1,337 @@ +====================== + Adding/Removing OSDs +====================== + +When you have a cluster up and running, you may add OSDs or remove OSDs +from the cluster at runtime. + +Adding OSDs +=========== + +When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an +OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a +host machine. If your host has multiple storage drives, you may map one +``ceph-osd`` daemon for each drive. + +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. As your cluster reaches its ``near +full`` ratio, you should add one or more OSDs to expand your cluster's capacity. + +.. warning:: Do not let your cluster reach its ``full ratio`` before + adding an OSD. OSD failures that occur after the cluster reaches + its ``near full`` ratio may cause the cluster to exceed its + ``full ratio``. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new OSD, see `Hardware +Recommendations`_ for details on minimum recommendations for OSD hardware. To +add an OSD host to your cluster, first make sure you have an up-to-date version +of Linux installed, and you have made some initial preparations for your +storage drives. See `Filesystem Recommendations`_ for details. + +Add your OSD host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. See the `Network Configuration +Reference`_ for details. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations +.. _Network Configuration Reference: ../../configuration/network-config-ref + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Ceph (Manual)`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Ceph (Manual): ../../../install + + +Adding an OSD (Manual) +---------------------- + +This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive, +and configures the cluster to distribute data to the OSD. If your host has +multiple drives, you may add an OSD for each drive by repeating this procedure. + +To add an OSD, create a data directory for it, mount a drive to that directory, +add the OSD to the cluster, and then add it to the CRUSH map. + +When you add the OSD to the CRUSH map, consider the weight you give to the new +OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger +hard drives than older hosts in the cluster (i.e., they may have greater +weight). + +.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives + of dissimilar size, you can adjust their weights. However, for best + performance, consider a CRUSH hierarchy with drives of the same type/size. + +#. Create the OSD. If no UUID is given, it will be set automatically when the + OSD starts up. The following command will output the OSD number, which you + will need for subsequent steps. :: + + ceph osd create [{uuid} [{id}]] + + If the optional parameter {id} is given it will be used as the OSD id. + Note, in this case the command may fail if the number is already in use. + + .. warning:: In general, explicitly specifying {id} is not recommended. + IDs are allocated as an array, and skipping entries consumes some extra + memory. This can become significant if there are large gaps and/or + clusters are large. If {id} is not specified, the smallest available is + used. + +#. Create the default directory on your new OSD. :: + + ssh {new-osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + + +#. If the OSD is for a drive other than the OS drive, prepare it + for use with Ceph, and mount it to the directory you just created:: + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{drive} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + + +#. Initialize the OSD data directory. :: + + ssh {new-osd-host} + ceph-osd -i {osd-num} --mkfs --mkkey + + The directory must be empty before you can run ``ceph-osd``. + +#. Register the OSD authentication key. The value of ``ceph`` for + ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your + cluster name differs from ``ceph``, use your cluster name instead.:: + + ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring + + +#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The + ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy + wherever you wish. If you specify at least one bucket, the command + will place the OSD into the most specific bucket you specify, *and* it will + move that bucket underneath any other buckets you specify. **Important:** If + you specify only the root bucket, the command will attach the OSD directly + to the root, but CRUSH rules expect OSDs to be inside of hosts. + + Execute the following:: + + ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] + + You may also decompile the CRUSH map, add the OSD to the device list, add the + host as a bucket (if it's not already in the CRUSH map), add the device as an + item in the host, assign it a weight, recompile it and set it. See + `Add/Move an OSD`_ for details. + + +.. _rados-replacing-an-osd: + +Replacing an OSD +---------------- + +When disks fail, or if an administrator wants to reprovision OSDs with a new +backend, for instance, for switching from FileStore to BlueStore, OSDs need to +be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry +need to be keep intact after the OSD is destroyed for replacement. + +#. Destroy the OSD first:: + + ceph osd destroy {id} --yes-i-really-mean-it + +#. Zap a disk for the new OSD, if the disk was used before for other purposes. + It's not necessary for a new disk:: + + ceph-volume lvm zap /dev/sdX + +#. Prepare the disk for replacement by using the previously destroyed OSD id:: + + ceph-volume lvm prepare --osd-id {id} --data /dev/sdX + +#. And activate the OSD:: + + ceph-volume lvm activate {id} {fsid} + +Alternatively, instead of preparing and activating, the device can be recreated +in one call, like:: + + ceph-volume lvm create --osd-id {id} --data /dev/sdX + + +Starting the OSD +---------------- + +After you add an OSD to Ceph, the OSD is in your configuration. However, +it is not yet running. The OSD is ``down`` and ``in``. You must start +your new OSD before it can begin receiving data. You may use +``service ceph`` from your admin host or start the OSD from its host +machine. + +For Ubuntu Trusty use Upstart. :: + + sudo start ceph-osd id={osd-num} + +For all other distros use systemd. :: + + sudo systemctl start ceph-osd@{osd-num} + + +Once you start your OSD, it is ``up`` and ``in``. + + +Observe the Data Migration +-------------------------- + +Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing +the server by migrating placement groups to your new OSD. You can observe this +process with the `ceph`_ tool. :: + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + + +.. _Add/Move an OSD: ../crush-map#addosd +.. _ceph: ../monitoring + + + +Removing OSDs (Manual) +====================== + +When you want to reduce the size of a cluster or replace hardware, you may +remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd`` +daemon for one storage drive within a host machine. If your host has multiple +storage drives, you may need to remove one ``ceph-osd`` daemon for each drive. +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. Ensure that when you remove an OSD +that your cluster is not at its ``near full`` ratio. + +.. warning:: Do not let your cluster reach its ``full ratio`` when + removing an OSD. Removing OSDs could cause the cluster to reach + or exceed its ``full ratio``. + + +Take the OSD out of the Cluster +----------------------------------- + +Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it +out of the cluster so that Ceph can begin rebalancing and copying its data to +other OSDs. :: + + ceph osd out {osd-num} + + +Observe the Data Migration +-------------------------- + +Once you have taken your OSD ``out`` of the cluster, Ceph will begin +rebalancing the cluster by migrating placement groups out of the OSD you +removed. You can observe this process with the `ceph`_ tool. :: + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + +.. note:: Sometimes, typically in a "small" cluster with few hosts (for + instance with a small testing cluster), the fact to take ``out`` the + OSD can spawn a CRUSH corner case where some PGs remain stuck in the + ``active+remapped`` state. If you are in this case, you should mark + the OSD ``in`` with: + + ``ceph osd in {osd-num}`` + + to come back to the initial state and then, instead of marking ``out`` + the OSD, set its weight to 0 with: + + ``ceph osd crush reweight osd.{osd-num} 0`` + + After that, you can observe the data migration which should come to its + end. The difference between marking ``out`` the OSD and reweighting it + to 0 is that in the first case the weight of the bucket which contains + the OSD is not changed whereas in the second case the weight of the bucket + is updated (and decreased of the OSD weight). The reweight command could + be sometimes favoured in the case of a "small" cluster. + + + +Stopping the OSD +---------------- + +After you take an OSD out of the cluster, it may still be running. +That is, the OSD may be ``up`` and ``out``. You must stop +your OSD before you remove it from the configuration. :: + + ssh {osd-host} + sudo systemctl stop ceph-osd@{osd-num} + +Once you stop your OSD, it is ``down``. + + +Removing the OSD +---------------- + +This procedure removes an OSD from a cluster map, removes its authentication +key, removes the OSD from the OSD map, and removes the OSD from the +``ceph.conf`` file. If your host has multiple drives, you may need to remove an +OSD for each drive by repeating this procedure. + +#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH + map, removes its authentication key. And it is removed from the OSD map as + well. Please note the :ref:`purge subcommand <ceph-admin-osd>` is introduced in Luminous, for older + versions, please see below :: + + ceph osd purge {id} --yes-i-really-mean-it + +#. Navigate to the host where you keep the master copy of the cluster's + ``ceph.conf`` file. :: + + ssh {admin-host} + cd /etc/ceph + vim ceph.conf + +#. Remove the OSD entry from your ``ceph.conf`` file (if it exists). :: + + [osd.1] + host = {hostname} + +#. From the host where you keep the master copy of the cluster's ``ceph.conf`` file, + copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of other + hosts in your cluster. + +If your Ceph cluster is older than Luminous, instead of using ``ceph osd purge``, +you need to perform this step manually: + + +#. Remove the OSD from the CRUSH map so that it no longer receives data. You may + also decompile the CRUSH map, remove the OSD from the device list, remove the + device as an item in the host bucket or remove the host bucket (if it's in the + CRUSH map and you intend to remove the host), recompile the map and set it. + See `Remove an OSD`_ for details. :: + + ceph osd crush remove {name} + +#. Remove the OSD authentication key. :: + + ceph auth del osd.{osd-num} + + The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. + If your cluster name differs from ``ceph``, use your cluster name instead. + +#. Remove the OSD. :: + + ceph osd rm {osd-num} + #for example + ceph osd rm 1 + + +.. _Remove an OSD: ../crush-map#removeosd diff --git a/doc/rados/operations/balancer.rst b/doc/rados/operations/balancer.rst new file mode 100644 index 00000000..530e0dc4 --- /dev/null +++ b/doc/rados/operations/balancer.rst @@ -0,0 +1,144 @@ + +.. _balancer: + +Balancer +======== + +The *balancer* can optimize the placement of PGs across OSDs in +order to achieve a balanced distribution, either automatically or in a +supervised fashion. + +Status +------ + +The current status of the balancer can be checked at any time with:: + + ceph balancer status + + +Automatic balancing +------------------- + +The automatic balancing can be enabled, using the default settings, with:: + + ceph balancer on + +The balancer can be turned back off again with:: + + ceph balancer off + +This will use the ``crush-compat`` mode, which is backward compatible +with older clients, and will make small changes to the data +distribution over time to ensure that OSDs are equally utilized. + + +Throttling +---------- + +No adjustments will be made to the PG distribution if the cluster is +degraded (e.g., because an OSD has failed and the system has not yet +healed itself). + +When the cluster is healthy, the balancer will throttle its changes +such that the percentage of PGs that are misplaced (i.e., that need to +be moved) is below a threshold of (by default) 5%. The +``target_max_misplaced_ratio`` threshold can be adjusted with:: + + ceph config set mgr target_max_misplaced_ratio .07 # 7% + + +Modes +----- + +There are currently two supported balancer modes: + +#. **crush-compat**. The CRUSH compat mode uses the compat weight-set + feature (introduced in Luminous) to manage an alternative set of + weights for devices in the CRUSH hierarchy. The normal weights + should remain set to the size of the device to reflect the target + amount of data that we want to store on the device. The balancer + then optimizes the weight-set values, adjusting them up or down in + small increments, in order to achieve a distribution that matches + the target distribution as closely as possible. (Because PG + placement is a pseudorandom process, there is a natural amount of + variation in the placement; by optimizing the weights we + counter-act that natural variation.) + + Notably, this mode is *fully backwards compatible* with older + clients: when an OSDMap and CRUSH map is shared with older clients, + we present the optimized weights as the "real" weights. + + The primary restriction of this mode is that the balancer cannot + handle multiple CRUSH hierarchies with different placement rules if + the subtrees of the hierarchy share any OSDs. (This is normally + not the case, and is generally not a recommended configuration + because it is hard to manage the space utilization on the shared + OSDs.) + +#. **upmap**. Starting with Luminous, the OSDMap can store explicit + mappings for individual OSDs as exceptions to the normal CRUSH + placement calculation. These `upmap` entries provide fine-grained + control over the PG mapping. This CRUSH mode will optimize the + placement of individual PGs in order to achieve a balanced + distribution. In most cases, this distribution is "perfect," which + an equal number of PGs on each OSD (+/-1 PG, since they might not + divide evenly). + + Note that using upmap requires that all clients be Luminous or newer. + +The default mode is ``crush-compat``. The mode can be adjusted with:: + + ceph balancer mode upmap + +or:: + + ceph balancer mode crush-compat + +Supervised optimization +----------------------- + +The balancer operation is broken into a few distinct phases: + +#. building a *plan* +#. evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result after executing a *plan* +#. executing the *plan* + +To evautate and score the current distribution,:: + + ceph balancer eval + +You can also evaluate the distribution for a single pool with:: + + ceph balancer eval <pool-name> + +Greater detail for the evaluation can be seen with:: + + ceph balancer eval-verbose ... + +The balancer can generate a plan, using the currently configured mode, with:: + + ceph balancer optimize <plan-name> + +The name is provided by the user and can be any useful identifying string. The contents of a plan can be seen with:: + + ceph balancer show <plan-name> + +All plans can be shown with:: + + ceph balancer ls + +Old plans can be discarded with:: + + ceph balancer rm <plan-name> + +Currently recorded plans are shown as part of the status command:: + + ceph balancer status + +The quality of the distribution that would result after executing a plan can be calculated with:: + + ceph balancer eval <plan-name> + +Assuming the plan is expected to improve the distribution (i.e., it has a lower score than the current cluster state), the user can execute that plan with:: + + ceph balancer execute <plan-name> diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst new file mode 100644 index 00000000..91a64c93 --- /dev/null +++ b/doc/rados/operations/bluestore-migration.rst @@ -0,0 +1,292 @@ +===================== + BlueStore Migration +===================== + +Each OSD can run either BlueStore or FileStore, and a single Ceph +cluster can contain a mix of both. Users who have previously deployed +FileStore are likely to want to transition to BlueStore in order to +take advantage of the improved performance and robustness. There are +several strategies for making such a transition. + +An individual OSD cannot be converted in place in isolation, however: +BlueStore and FileStore are simply too different for that to be +practical. "Conversion" will rely either on the cluster's normal +replication and healing support or tools and strategies that copy OSD +content from an old (FileStore) device to a new (BlueStore) one. + + +Deploy new OSDs with BlueStore +============================== + +Any new OSDs (e.g., when the cluster is expanded) can be deployed +using BlueStore. This is the default behavior so no specific change +is needed. + +Similarly, any OSDs that are reprovisioned after replacing a failed drive +can use BlueStore. + +Convert existing OSDs +===================== + +Mark out and replace +-------------------- + +The simplest approach is to mark out each device in turn, wait for the +data to replicate across the cluster, reprovision the OSD, and mark +it back in again. It is simple and easy to automate. However, it requires +more data migration than should be necessary, so it is not optimal. + +#. Identify a FileStore OSD to replace:: + + ID=<osd-id-number> + DEVICE=<disk-device> + + You can tell whether a given OSD is FileStore or BlueStore with:: + + ceph osd metadata $ID | grep osd_objectstore + + You can get a current count of filestore vs bluestore with:: + + ceph osd count-metadata osd_objectstore + +#. Mark the filestore OSD out:: + + ceph osd out $ID + +#. Wait for the data to migrate off the OSD in question:: + + while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done + +#. Stop the OSD:: + + systemctl kill ceph-osd@$ID + +#. Make note of which device this OSD is using:: + + mount | grep /var/lib/ceph/osd/ceph-$ID + +#. Unmount the OSD:: + + umount /var/lib/ceph/osd/ceph-$ID + +#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy + the contents of the device; be certain the data on the device is + not needed (i.e., that the cluster is healthy) before proceeding. :: + + ceph-volume lvm zap $DEVICE + +#. Tell the cluster the OSD has been destroyed (and a new OSD can be + reprovisioned with the same ID):: + + ceph osd destroy $ID --yes-i-really-mean-it + +#. Reprovision a BlueStore OSD in its place with the same OSD ID. + This requires you do identify which device to wipe based on what you saw + mounted above. BE CAREFUL! :: + + ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID + +#. Repeat. + +You can allow the refilling of the replacement OSD to happen +concurrently with the draining of the next OSD, or follow the same +procedure for multiple OSDs in parallel, as long as you ensure the +cluster is fully clean (all data has all replicas) before destroying +any OSDs. Failure to do so will reduce the redundancy of your data +and increase the risk of (or potentially even cause) data loss. + +Advantages: + +* Simple. +* Can be done on a device-by-device basis. +* No spare devices or hosts are required. + +Disadvantages: + +* Data is copied over the network twice: once to some other OSD in the + cluster (to maintain the desired number of replicas), and then again + back to the reprovisioned BlueStore OSD. + + +Whole host replacement +---------------------- + +If you have a spare host in the cluster, or have sufficient free space +to evacuate an entire host in order to use it as a spare, then the +conversion can be done on a host-by-host basis with each stored copy of +the data migrating only once. + +First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster. + +Use a new, empty host +^^^^^^^^^^^^^^^^^^^^^ + +Ideally the host should have roughly the +same capacity as other hosts you will be converting (although it +doesn't strictly matter). :: + + NEWHOST=<empty-host-name> + +Add the host to the CRUSH hierarchy, but do not attach it to the root:: + + ceph osd crush add-bucket $NEWHOST host + +Make sure the ceph packages are installed. + +Use an existing host +^^^^^^^^^^^^^^^^^^^^ + +If you would like to use an existing host +that is already part of the cluster, and there is sufficient free +space on that host so that all of its data can be migrated off, +then you can instead do:: + + OLDHOST=<existing-cluster-host-to-offload> + ceph osd crush unlink $OLDHOST default + +where "default" is the immediate ancestor in the CRUSH map. (For +smaller clusters with unmodified configurations this will normally +be "default", but it might also be a rack name.) You should now +see the host at the top of the OSD tree output with no parent:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host oldhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host foo + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +If everything looks good, jump directly to the "Wait for data +migration to complete" step below and proceed from there to clean up +the old OSDs. + +Migration process +^^^^^^^^^^^^^^^^^ + +If you're using a new host, start at step #1. For an existing host, +jump to step #5 below. + +#. Provision new BlueStore OSDs for all devices:: + + ceph-volume lvm create --bluestore --data /dev/$DEVICE + +#. Verify OSDs join the cluster with:: + + ceph osd tree + + You should see the new host ``$NEWHOST`` with all of the OSDs beneath + it, but the host should *not* be nested beneath any other node in + hierarchy (like ``root default``). For example, if ``newhost`` is + the empty host, you might see something like:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host newhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host oldhost1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +#. Identify the first target host to convert :: + + OLDHOST=<existing-cluster-host-to-convert> + +#. Swap the new host into the old host's position in the cluster:: + + ceph osd crush swap-bucket $NEWHOST $OLDHOST + + At this point all data on ``$OLDHOST`` will start migrating to OSDs + on ``$NEWHOST``. If there is a difference in the total capacity of + the old and new hosts you may also see some data migrate to or from + other nodes in the cluster, but as long as the hosts are similarly + sized this will be a relatively small amount of data. + +#. Wait for data migration to complete:: + + while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done + +#. Stop all old OSDs on the now-empty ``$OLDHOST``:: + + ssh $OLDHOST + systemctl kill ceph-osd.target + umount /var/lib/ceph/osd/ceph-* + +#. Destroy and purge the old OSDs:: + + for osd in `ceph osd ls-tree $OLDHOST`; do + ceph osd purge $osd --yes-i-really-mean-it + done + +#. Wipe the old OSD devices. This requires you do identify which + devices are to be wiped manually (BE CAREFUL!). For each device,:: + + ceph-volume lvm zap $DEVICE + +#. Use the now-empty host as the new host, and repeat:: + + NEWHOST=$OLDHOST + +Advantages: + +* Data is copied over the network only once. +* Converts an entire host's OSDs at once. +* Can parallelize to converting multiple hosts at a time. +* No spare devices are required on each host. + +Disadvantages: + +* A spare host is required. +* An entire host's worth of OSDs will be migrating data at a time. This + is like likely to impact overall cluster performance. +* All migrated data still makes one full hop over the network. + + +Per-OSD device copy +------------------- + +A single logical OSD can be converted by using the ``copy`` function +of ``ceph-objectstore-tool``. This requires that the host have a free +device (or devices) to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has 12 OSDs, then you'd need a +13th available device so that each OSD can be converted in turn before the +old device is reclaimed to convert the next OSD. + +Caveats: + +* This strategy requires that a blank BlueStore OSD be prepared + without allocating a new OSD ID, something that the ``ceph-volume`` + tool doesn't support. More importantly, the setup of *dmcrypt* is + closely tied to the OSD identity, which means that this approach + does not work with encrypted OSDs. + +* The device must be manually partitioned. + +* Tooling not implemented! + +* Not documented! + +Advantages: + +* Little or no data migrates over the network during the conversion. + +Disadvantages: + +* Tooling not fully implemented. +* Process not documented. +* Each host must have a spare or empty device. +* The OSD is offline during the conversion, which means new writes will + be written to only a subset of the OSDs. This increases the risk of data + loss due to a subsequent failure. (However, if there is a failure before + conversion is complete, the original FileStore OSD can be started to provide + access to its original data.) diff --git a/doc/rados/operations/cache-tiering.rst b/doc/rados/operations/cache-tiering.rst new file mode 100644 index 00000000..237b6e3c --- /dev/null +++ b/doc/rados/operations/cache-tiering.rst @@ -0,0 +1,475 @@ +=============== + Cache Tiering +=============== + +A cache tier provides Ceph Clients with better I/O performance for a subset of +the data stored in a backing storage tier. Cache tiering involves creating a +pool of relatively fast/expensive storage devices (e.g., solid state drives) +configured to act as a cache tier, and a backing pool of either erasure-coded +or relatively slower/cheaper devices configured to act as an economical storage +tier. The Ceph objecter handles where to place the objects and the tiering +agent determines when to flush objects from the cache to the backing storage +tier. So the cache tier and the backing storage tier are completely transparent +to Ceph clients. + + +.. ditaa:: + +-------------+ + | Ceph Client | + +------+------+ + ^ + Tiering is | + Transparent | Faster I/O + to Ceph | +---------------+ + Client Ops | | | + | +----->+ Cache Tier | + | | | | + | | +-----+---+-----+ + | | | ^ + v v | | Active Data in Cache Tier + +------+----+--+ | | + | Objecter | | | + +-----------+--+ | | + ^ | | Inactive Data in Storage Tier + | v | + | +-----+---+-----+ + | | | + +----->| Storage Tier | + | | + +---------------+ + Slower I/O + + +The cache tiering agent handles the migration of data between the cache tier +and the backing storage tier automatically. However, admins have the ability to +configure how this migration takes place by setting the ``cache-mode``. There are +two main scenarios: + +- **writeback** mode: When admins configure tiers with ``writeback`` mode, Ceph + clients write data to the cache tier and receive an ACK from the cache tier. + In time, the data written to the cache tier migrates to the storage tier + and gets flushed from the cache tier. Conceptually, the cache tier is + overlaid "in front" of the backing storage tier. When a Ceph client needs + data that resides in the storage tier, the cache tiering agent migrates the + data to the cache tier on read, then it is sent to the Ceph client. + Thereafter, the Ceph client can perform I/O using the cache tier, until the + data becomes inactive. This is ideal for mutable data (e.g., photo/video + editing, transactional data, etc.). + +- **readproxy** mode: This mode will use any objects that already + exist in the cache tier, but if an object is not present in the + cache the request will be proxied to the base tier. This is useful + for transitioning from ``writeback`` mode to a disabled cache as it + allows the workload to function properly while the cache is drained, + without adding any new objects to the cache. + +Other cache modes are: + +- **readonly** promotes objects to the cache on read operations only; write + operations are forwarded to the base tier. This mode is intended for + read-only workloads that do not require consistency to be enforced by the + storage system. (**Warning**: when objects are updated in the base tier, + Ceph makes **no** attempt to sync these updates to the corresponding objects + in the cache. Since this mode is considered experimental, a + ``--yes-i-really-mean-it`` option must be passed in order to enable it.) + +- **none** is used to completely disable caching. + + +A word of caution +================= + +Cache tiering will *degrade* performance for most workloads. Users should use +extreme caution before using this feature. + +* *Workload dependent*: Whether a cache will improve performance is + highly dependent on the workload. Because there is a cost + associated with moving objects into or out of the cache, it can only + be effective when there is a *large skew* in the access pattern in + the data set, such that most of the requests touch a small number of + objects. The cache pool should be large enough to capture the + working set for your workload to avoid thrashing. + +* *Difficult to benchmark*: Most benchmarks that users run to measure + performance will show terrible performance with cache tiering, in + part because very few of them skew requests toward a small set of + objects, it can take a long time for the cache to "warm up," and + because the warm-up cost can be high. + +* *Usually slower*: For workloads that are not cache tiering-friendly, + performance is often slower than a normal RADOS pool without cache + tiering enabled. + +* *librados object enumeration*: The librados-level object enumeration + API is not meant to be coherent in the presence of the case. If + your application is using librados directly and relies on object + enumeration, cache tiering will probably not work as expected. + (This is not a problem for RGW, RBD, or CephFS.) + +* *Complexity*: Enabling cache tiering means that a lot of additional + machinery and complexity within the RADOS cluster is being used. + This increases the probability that you will encounter a bug in the system + that other users have not yet encountered and will put your deployment at a + higher level of risk. + +Known Good Workloads +-------------------- + +* *RGW time-skewed*: If the RGW workload is such that almost all read + operations are directed at recently written objects, a simple cache + tiering configuration that destages recently written objects from + the cache to the base tier after a configurable period can work + well. + +Known Bad Workloads +------------------- + +The following configurations are *known to work poorly* with cache +tiering. + +* *RBD with replicated cache and erasure-coded base*: This is a common + request, but usually does not perform well. Even reasonably skewed + workloads still send some small writes to cold objects, and because + small writes are not yet supported by the erasure-coded pool, entire + (usually 4 MB) objects must be migrated into the cache in order to + satisfy a small (often 4 KB) write. Only a handful of users have + successfully deployed this configuration, and it only works for them + because their data is extremely cold (backups) and they are not in + any way sensitive to performance. + +* *RBD with replicated cache and base*: RBD with a replicated base + tier does better than when the base is erasure coded, but it is + still highly dependent on the amount of skew in the workload, and + very difficult to validate. The user will need to have a good + understanding of their workload and will need to tune the cache + tiering parameters carefully. + + +Setting Up Pools +================ + +To set up cache tiering, you must have two pools. One will act as the +backing storage and the other will act as the cache. + + +Setting Up a Backing Storage Pool +--------------------------------- + +Setting up a backing storage pool typically involves one of two scenarios: + +- **Standard Storage**: In this scenario, the pool stores multiple copies + of an object in the Ceph Storage Cluster. + +- **Erasure Coding:** In this scenario, the pool uses erasure coding to + store data much more efficiently with a small performance tradeoff. + +In the standard storage scenario, you can setup a CRUSH rule to establish +the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD +Daemons perform optimally when all storage drives in the rule are of the +same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ +for details on creating a rule. Once you have created a rule, create +a backing storage pool. + +In the erasure coding scenario, the pool creation arguments will generate the +appropriate rule automatically. See `Create a Pool`_ for details. + +In subsequent examples, we will refer to the backing storage pool +as ``cold-storage``. + + +Setting Up a Cache Pool +----------------------- + +Setting up a cache pool follows the same procedure as the standard storage +scenario, but with this difference: the drives for the cache tier are typically +high performance drives that reside in their own servers and have their own +CRUSH rule. When setting up such a rule, it should take account of the hosts +that have the high performance drives while omitting the hosts that don't. See +`Placing Different Pools on Different OSDs`_ for details. + + +In subsequent examples, we will refer to the cache pool as ``hot-storage`` and +the backing pool as ``cold-storage``. + +For cache tier configuration and default values, see +`Pools - Set Pool Values`_. + + +Creating a Cache Tier +===================== + +Setting up a cache tier involves associating a backing storage pool with +a cache pool :: + + ceph osd tier add {storagepool} {cachepool} + +For example :: + + ceph osd tier add cold-storage hot-storage + +To set the cache mode, execute the following:: + + ceph osd tier cache-mode {cachepool} {cache-mode} + +For example:: + + ceph osd tier cache-mode hot-storage writeback + +The cache tiers overlay the backing storage tier, so they require one +additional step: you must direct all client traffic from the storage pool to +the cache pool. To direct client traffic directly to the cache pool, execute +the following:: + + ceph osd tier set-overlay {storagepool} {cachepool} + +For example:: + + ceph osd tier set-overlay cold-storage hot-storage + + +Configuring a Cache Tier +======================== + +Cache tiers have several configuration options. You may set +cache tier configuration options with the following usage:: + + ceph osd pool set {cachepool} {key} {value} + +See `Pools - Set Pool Values`_ for details. + + +Target Size and Type +-------------------- + +Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``:: + + ceph osd pool set {cachepool} hit_set_type bloom + +For example:: + + ceph osd pool set hot-storage hit_set_type bloom + +The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to +store, and how much time each HitSet should cover. :: + + ceph osd pool set {cachepool} hit_set_count 12 + ceph osd pool set {cachepool} hit_set_period 14400 + ceph osd pool set {cachepool} target_max_bytes 1000000000000 + +.. note:: A larger ``hit_set_count`` results in more RAM consumed by + the ``ceph-osd`` process. + +Binning accesses over time allows Ceph to determine whether a Ceph client +accessed an object at least once, or more than once over a time period +("age" vs "temperature"). + +The ``min_read_recency_for_promote`` defines how many HitSets to check for the +existence of an object when handling a read operation. The checking result is +used to decide whether to promote the object asynchronously. Its value should be +between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. +If it's set to 1, the current HitSet is checked. And if this object is in the +current HitSet, it's promoted. Otherwise not. For the other values, the exact +number of archive HitSets are checked. The object is promoted if the object is +found in any of the most recent ``min_read_recency_for_promote`` HitSets. + +A similar parameter can be set for the write operation, which is +``min_write_recency_for_promote``. :: + + ceph osd pool set {cachepool} min_read_recency_for_promote 2 + ceph osd pool set {cachepool} min_write_recency_for_promote 2 + +.. note:: The longer the period and the higher the + ``min_read_recency_for_promote`` and + ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` + daemon consumes. In particular, when the agent is active to flush + or evict cache objects, all ``hit_set_count`` HitSets are loaded + into RAM. + + +Cache Sizing +------------ + +The cache tiering agent performs two main functions: + +- **Flushing:** The agent identifies modified (or dirty) objects and forwards + them to the storage pool for long-term storage. + +- **Evicting:** The agent identifies objects that haven't been modified + (or clean) and evicts the least recently used among them from the cache. + + +Absolute Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects based upon the total number +of bytes or the total number of objects. To specify a maximum number of bytes, +execute the following:: + + ceph osd pool set {cachepool} target_max_bytes {#bytes} + +For example, to flush or evict at 1 TB, execute the following:: + + ceph osd pool set hot-storage target_max_bytes 1099511627776 + + +To specify the maximum number of objects, execute the following:: + + ceph osd pool set {cachepool} target_max_objects {#objects} + +For example, to flush or evict at 1M objects, execute the following:: + + ceph osd pool set hot-storage target_max_objects 1000000 + +.. note:: Ceph is not able to determine the size of a cache pool automatically, so + the configuration on the absolute size is required here, otherwise the + flush/evict will not work. If you specify both limits, the cache tiering + agent will begin flushing or evicting when either threshold is triggered. + +.. note:: All client requests will be blocked only when ``target_max_bytes`` or + ``target_max_objects`` reached + +Relative Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects relative to the size of the +cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in +`Absolute sizing`_). When the cache pool consists of a certain percentage of +modified (or dirty) objects, the cache tiering agent will flush them to the +storage pool. To set the ``cache_target_dirty_ratio``, execute the following:: + + ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} + +For example, setting the value to ``0.4`` will begin flushing modified +(dirty) objects when they reach 40% of the cache pool's capacity:: + + ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 + +When the dirty objects reaches a certain percentage of its capacity, flush dirty +objects with a higher speed. To set the ``cache_target_dirty_high_ratio``:: + + ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} + +For example, setting the value to ``0.6`` will begin aggressively flush dirty objects +when they reach 60% of the cache pool's capacity. obviously, we'd better set the value +between dirty_ratio and full_ratio:: + + ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 + +When the cache pool reaches a certain percentage of its capacity, the cache +tiering agent will evict objects to maintain free capacity. To set the +``cache_target_full_ratio``, execute the following:: + + ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} + +For example, setting the value to ``0.8`` will begin flushing unmodified +(clean) objects when they reach 80% of the cache pool's capacity:: + + ceph osd pool set hot-storage cache_target_full_ratio 0.8 + + +Cache Age +--------- + +You can specify the minimum age of an object before the cache tiering agent +flushes a recently modified (or dirty) object to the backing storage pool:: + + ceph osd pool set {cachepool} cache_min_flush_age {#seconds} + +For example, to flush modified (or dirty) objects after 10 minutes, execute +the following:: + + ceph osd pool set hot-storage cache_min_flush_age 600 + +You can specify the minimum age of an object before it will be evicted from +the cache tier:: + + ceph osd pool {cache-tier} cache_min_evict_age {#seconds} + +For example, to evict objects after 30 minutes, execute the following:: + + ceph osd pool set hot-storage cache_min_evict_age 1800 + + +Removing a Cache Tier +===================== + +Removing a cache tier differs depending on whether it is a writeback +cache or a read-only cache. + + +Removing a Read-Only Cache +-------------------------- + +Since a read-only cache does not have modified data, you can disable +and remove it without losing any recent changes to objects in the cache. + +#. Change the cache-mode to ``none`` to disable it. :: + + ceph osd tier cache-mode {cachepool} none + + For example:: + + ceph osd tier cache-mode hot-storage none + +#. Remove the cache pool from the backing pool. :: + + ceph osd tier remove {storagepool} {cachepool} + + For example:: + + ceph osd tier remove cold-storage hot-storage + + + +Removing a Writeback Cache +-------------------------- + +Since a writeback cache may have modified data, you must take steps to ensure +that you do not lose any recent changes to objects in the cache before you +disable and remove it. + + +#. Change the cache mode to ``proxy`` so that new and modified objects will + flush to the backing storage pool. :: + + ceph osd tier cache-mode {cachepool} proxy + + For example:: + + ceph osd tier cache-mode hot-storage proxy + + +#. Ensure that the cache pool has been flushed. This may take a few minutes:: + + rados -p {cachepool} ls + + If the cache pool still has objects, you can flush them manually. + For example:: + + rados -p {cachepool} cache-flush-evict-all + + +#. Remove the overlay so that clients will not direct traffic to the cache. :: + + ceph osd tier remove-overlay {storagetier} + + For example:: + + ceph osd tier remove-overlay cold-storage + + +#. Finally, remove the cache tier pool from the backing storage pool. :: + + ceph osd tier remove {storagepool} {cachepool} + + For example:: + + ceph osd tier remove cold-storage hot-storage + + +.. _Create a Pool: ../pools#create-a-pool +.. _Pools - Set Pool Values: ../pools#set-pool-values +.. _Placing Different Pools on Different OSDs: ../crush-map-edits/#placing-different-pools-on-different-osds +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _CRUSH Maps: ../crush-map +.. _Absolute Sizing: #absolute-sizing diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst new file mode 100644 index 00000000..76abca27 --- /dev/null +++ b/doc/rados/operations/control.rst @@ -0,0 +1,456 @@ +.. index:: control, commands + +================== + Control Commands +================== + + +Monitor Commands +================ + +Monitor commands are issued using the ceph utility:: + + ceph [-m monhost] {command} + +The command is usually (though not always) of the form:: + + ceph {subsystem} {command} + + +System Commands +=============== + +Execute the following to display the current status of the cluster. :: + + ceph -s + ceph status + +Execute the following to display a running summary of the status of the cluster, +and major events. :: + + ceph -w + +Execute the following to show the monitor quorum, including which monitors are +participating and which one is the leader. :: + + ceph quorum_status + +Execute the following to query the status of a single monitor, including whether +or not it is in the quorum. :: + + ceph [-m monhost] mon_status + + +Authentication Subsystem +======================== + +To add a keyring for an OSD, execute the following:: + + ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} + +To list the cluster's keys and their capabilities, execute the following:: + + ceph auth ls + + +Placement Group Subsystem +========================= + +To display the statistics for all placement groups, execute the following:: + + ceph pg dump [--format {format}] + +The valid formats are ``plain`` (default) and ``json``. + +To display the statistics for all placement groups stuck in a specified state, +execute the following:: + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] + + +``--format`` may be ``plain`` (default) or ``json`` + +``--threshold`` defines how many seconds "stuck" is (default: 300) + +**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD +with the most up-to-date data to come back. + +**Unclean** Placement groups contain objects that are not replicated the desired number +of times. They should be recovering. + +**Stale** Placement groups are in an unknown state - the OSDs that host them have not +reported to the monitor cluster in a while (configured by +``mon_osd_report_timeout``). + +Delete "lost" objects or revert them to their prior state, either a previous version +or delete them if they were just created. :: + + ceph pg {pgid} mark_unfound_lost revert|delete + + +OSD Subsystem +============= + +Query OSD subsystem status. :: + + ceph osd stat + +Write a copy of the most recent OSD map to a file. See +:ref:`osdmaptool <osdmaptool>`. :: + + ceph osd getmap -o file + +Write a copy of the crush map from the most recent OSD map to +file. :: + + ceph osd getcrushmap -o file + +The foregoing functionally equivalent to :: + + ceph osd getmap -o /tmp/osdmap + osdmaptool /tmp/osdmap --export-crush file + +Dump the OSD map. Valid formats for ``-f`` are ``plain`` and ``json``. If no +``--format`` option is given, the OSD map is dumped as plain text. :: + + ceph osd dump [--format {format}] + +Dump the OSD map as a tree with one line per OSD containing weight +and state. :: + + ceph osd tree [--format {format}] + +Find out where a specific object is or would be stored in the system:: + + ceph osd map <pool-name> <object-name> + +Add or move a new item (OSD) with the given id/name/weight at the specified +location. :: + + ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] + +Remove an existing item (OSD) from the CRUSH map. :: + + ceph osd crush remove {name} + +Remove an existing bucket from the CRUSH map. :: + + ceph osd crush remove {bucket-name} + +Move an existing bucket from one position in the hierarchy to another. :: + + ceph osd crush move {id} {loc1} [{loc2} ...] + +Set the weight of the item given by ``{name}`` to ``{weight}``. :: + + ceph osd crush reweight {name} {weight} + +Mark an OSD as lost. This may result in permanent data loss. Use with caution. :: + + ceph osd lost {id} [--yes-i-really-mean-it] + +Create a new OSD. If no UUID is given, it will be set automatically when the OSD +starts up. :: + + ceph osd create [{uuid}] + +Remove the given OSD(s). :: + + ceph osd rm [{id}...] + +Query the current max_osd parameter in the OSD map. :: + + ceph osd getmaxosd + +Import the given crush map. :: + + ceph osd setcrushmap -i file + +Set the ``max_osd`` parameter in the OSD map. This is necessary when +expanding the storage cluster. :: + + ceph osd setmaxosd + +Mark OSD ``{osd-num}`` down. :: + + ceph osd down {osd-num} + +Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). :: + + ceph osd out {osd-num} + +Mark ``{osd-num}`` in the distribution (i.e. allocated data). :: + + ceph osd in {osd-num} + +Set or clear the pause flags in the OSD map. If set, no IO requests +will be sent to any OSD. Clearing the flags via unpause results in +resending pending requests. :: + + ceph osd pause + ceph osd unpause + +Set the weight of ``{osd-num}`` to ``{weight}``. Two OSDs with the +same weight will receive roughly the same number of I/O requests and +store approximately the same amount of data. ``ceph osd reweight`` +sets an override weight on the OSD. This value is in the range 0 to 1, +and forces CRUSH to re-place (1-weight) of the data that would +otherwise live on this drive. It does not change the weights assigned +to the buckets above the OSD in the crush map, and is a corrective +measure in case the normal CRUSH distribution is not working out quite +right. For instance, if one of your OSDs is at 90% and the others are +at 50%, you could reduce this weight to try and compensate for it. :: + + ceph osd reweight {osd-num} {weight} + +Reweights all the OSDs by reducing the weight of OSDs which are +heavily overused. By default it will adjust the weights downward on +OSDs which have 120% of the average utilization, but if you include +threshold it will use that percentage instead. :: + + ceph osd reweight-by-utilization [threshold] + +Describes what reweight-by-utilization would do. :: + + ceph osd test-reweight-by-utilization + +Adds/removes the address to/from the blacklist. When adding an address, +you can specify how long it should be blacklisted in seconds; otherwise, +it will default to 1 hour. A blacklisted address is prevented from +connecting to any OSD. Blacklisting is most often used to prevent a +lagging metadata server from making bad changes to data on the OSDs. + +These commands are mostly only useful for failure testing, as +blacklists are normally maintained automatically and shouldn't need +manual intervention. :: + + ceph osd blacklist add ADDRESS[:source_port] [TIME] + ceph osd blacklist rm ADDRESS[:source_port] + +Creates/deletes a snapshot of a pool. :: + + ceph osd pool mksnap {pool-name} {snap-name} + ceph osd pool rmsnap {pool-name} {snap-name} + +Creates/deletes/renames a storage pool. :: + + ceph osd pool create {pool-name} pg_num [pgp_num] + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + ceph osd pool rename {old-name} {new-name} + +Changes a pool setting. :: + + ceph osd pool set {pool-name} {field} {value} + +Valid fields are: + + * ``size``: Sets the number of copies of data in the pool. + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number when calculating pg placement. + * ``crush_rule``: rule number for mapping placement. + +Get the value of a pool setting. :: + + ceph osd pool get {pool-name} {field} + +Valid fields are: + + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number of placement groups when calculating placement. + + +Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. :: + + ceph osd scrub {osd-num} + +Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. :: + + ceph osd repair N + +Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES`` +in write requests of ``BYTES_PER_WRITE`` each. By default, the test +writes 1 GB in total in 4-MB increments. +The benchmark is non-destructive and will not overwrite existing live +OSD data, but might temporarily affect the performance of clients +concurrently accessing the OSD. :: + + ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] + +To clear an OSD's caches between benchmark runs, use the 'cache drop' command :: + + ceph tell osd.N cache drop + +To get the cache statistics of an OSD, use the 'cache status' command :: + + ceph tell osd.N cache status + +MDS Subsystem +============= + +Change configuration parameters on a running mds. :: + + ceph tell mds.{mds-id} config set {setting} {value} + +Example:: + + ceph tell mds.0 config set debug_ms 1 + +Enables debug messages. :: + + ceph mds stat + +Displays the status of all metadata servers. :: + + ceph mds fail 0 + +Marks the active MDS as failed, triggering failover to a standby if present. + +.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap + + +Mon Subsystem +============= + +Show monitor stats:: + + ceph mon stat + + e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c + + +The ``quorum`` list at the end lists monitor nodes that are part of the current quorum. + +This is also available more directly:: + + ceph quorum_status -f json-pretty + +.. code-block:: javascript + + { + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "quorum_names": [ + "a", + "b", + "c" + ], + "quorum_leader_name": "a", + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + + +The above will block until a quorum is reached. + +For a status of just the monitor you connect to (use ``-m HOST:PORT`` +to select):: + + ceph mon_status -f json-pretty + + +.. code-block:: javascript + + { + "name": "b", + "rank": 1, + "state": "peon", + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "features": { + "required_con": "9025616074522624", + "required_mon": [ + "kraken" + ], + "quorum_con": "1152921504336314367", + "quorum_mon": [ + "kraken" + ] + }, + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + +A dump of the monitor state:: + + ceph mon dump + + dumped monmap epoch 2 + epoch 2 + fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc + last_changed 2016-12-26 14:42:09.288066 + created 2016-12-26 14:42:03.573585 + 0: 127.0.0.1:40000/0 mon.a + 1: 127.0.0.1:40001/0 mon.b + 2: 127.0.0.1:40002/0 mon.c + diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst new file mode 100644 index 00000000..3b838c37 --- /dev/null +++ b/doc/rados/operations/crush-map-edits.rst @@ -0,0 +1,697 @@ +Manually editing a CRUSH Map +============================ + +.. note:: Manually editing the CRUSH map is considered an advanced + administrator operation. All CRUSH changes that are + necessary for the overwhelming majority of installations are + possible via the standard ceph CLI and do not require manual + CRUSH map edits. If you have identified a use case where + manual edits *are* necessary, consider contacting the Ceph + developers so that future versions of Ceph can make this + unnecessary. + +To edit an existing CRUSH map: + +#. `Get the CRUSH map`_. +#. `Decompile`_ the CRUSH map. +#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. +#. `Recompile`_ the CRUSH map. +#. `Set the CRUSH map`_. + +For details on setting the CRUSH map rule for a specific pool, see `Set +Pool Values`_. + +.. _Get the CRUSH map: #getcrushmap +.. _Decompile: #decompilecrushmap +.. _Devices: #crushmapdevices +.. _Buckets: #crushmapbuckets +.. _Rules: #crushmaprules +.. _Recompile: #compilecrushmap +.. _Set the CRUSH map: #setcrushmap +.. _Set Pool Values: ../pools#setpoolvalues + +.. _getcrushmap: + +Get a CRUSH Map +--------------- + +To get the CRUSH map for your cluster, execute the following:: + + ceph osd getcrushmap -o {compiled-crushmap-filename} + +Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since +the CRUSH map is in a compiled form, you must decompile it first before you can +edit it. + +.. _decompilecrushmap: + +Decompile a CRUSH Map +--------------------- + +To decompile a CRUSH map, execute the following:: + + crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} + + +Sections +-------- + +There are six main sections to a CRUSH Map. + +#. **tunables:** The preamble at the top of the map described any *tunables* + for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These + correct for old bugs, optimizations, or other changes in behavior that have + been made over the years to improve CRUSH's behavior. + +#. **devices:** Devices are individual ``ceph-osd`` daemons that can + store data. + +#. **types**: Bucket ``types`` define the types of buckets used in + your CRUSH hierarchy. Buckets consist of a hierarchical aggregation + of storage locations (e.g., rows, racks, chassis, hosts, etc.) and + their assigned weights. + +#. **buckets:** Once you define bucket types, you must define each node + in the hierarchy, its type, and which devices or other nodes it + contains. + +#. **rules:** Rules define policy about how data is distributed across + devices in the hierarchy. + +#. **choose_args:** Choose_args are alternative weights associated with + the hierarchy that have been adjusted to optimize data placement. A single + choose_args map can be used for the entire cluster, or one can be + created for each individual pool. + + +.. _crushmapdevices: + +CRUSH Map Devices +----------------- + +Devices are individual ``ceph-osd`` daemons that can store data. You +will normally have one defined here for each OSD daemon in your +cluster. Devices are identified by an id (a non-negative integer) and +a name, normally ``osd.N`` where ``N`` is the device id. + +Devices may also have a *device class* associated with them (e.g., +``hdd`` or ``ssd``), allowing them to be conveniently targeted by a +crush rule. + +:: + + # devices + device {num} {osd.name} [class {class}] + +For example:: + + # devices + device 0 osd.0 class ssd + device 1 osd.1 class hdd + device 2 osd.2 + device 3 osd.3 + +In most cases, each device maps to a single ``ceph-osd`` daemon. This +is normally a single storage device, a pair of devices (for example, +one for data and one for a journal or metadata), or in some cases a +small RAID device. + + + + + +CRUSH Map Bucket Types +---------------------- + +The second list in the CRUSH map defines 'bucket' types. Buckets facilitate +a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent +physical locations in a hierarchy. Nodes aggregate other nodes or leaves. +Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage +media. + +.. tip:: The term "bucket" used in the context of CRUSH means a node in + the hierarchy, i.e. a location or a piece of physical hardware. It + is a different concept from the term "bucket" when used in the + context of RADOS Gateway APIs. + +To add a bucket type to the CRUSH map, create a new line under your list of +bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. +By convention, there is one leaf bucket and it is ``type 0``; however, you may +give it any name you like (e.g., osd, disk, drive, storage, etc.):: + + #types + type {num} {bucket-name} + +For example:: + + # types + type 0 osd + type 1 host + type 2 chassis + type 3 rack + type 4 row + type 5 pdu + type 6 pod + type 7 room + type 8 datacenter + type 9 region + type 10 root + + + +.. _crushmapbuckets: + +CRUSH Map Bucket Hierarchy +-------------------------- + +The CRUSH algorithm distributes data objects among storage devices according +to a per-device weight value, approximating a uniform probability distribution. +CRUSH distributes objects and their replicas according to the hierarchical +cluster map you define. Your CRUSH map represents the available storage +devices and the logical elements that contain them. + +To map placement groups to OSDs across failure domains, a CRUSH map defines a +hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH +map). The purpose of creating a bucket hierarchy is to segregate the +leaf nodes by their failure domains, such as hosts, chassis, racks, power +distribution units, pods, rows, rooms, and data centers. With the exception of +the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and +you may define it according to your own needs. + +We recommend adapting your CRUSH map to your firms's hardware naming conventions +and using instances names that reflect the physical hardware. Your naming +practice can make it easier to administer the cluster and troubleshoot +problems when an OSD and/or other hardware malfunctions and the administrator +need access to physical hardware. + +In the following example, the bucket hierarchy has a leaf bucket named ``osd``, +and two node buckets named ``host`` and ``rack`` respectively. + +.. ditaa:: + +-----------+ + | {o}rack | + | Bucket | + +-----+-----+ + | + +---------------+---------------+ + | | + +-----+-----+ +-----+-----+ + | {o}host | | {o}host | + | Bucket | | Bucket | + +-----+-----+ +-----+-----+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd | | osd | | osd | | osd | + | Bucket | | Bucket | | Bucket | | Bucket | + +-----------+ +-----------+ +-----------+ +-----------+ + +.. note:: The higher numbered ``rack`` bucket type aggregates the lower + numbered ``host`` bucket type. + +Since leaf nodes reflect storage devices declared under the ``#devices`` list +at the beginning of the CRUSH map, you do not need to declare them as bucket +instances. The second lowest bucket type in your hierarchy usually aggregates +the devices (i.e., it's usually the computer containing the storage media, and +uses whatever term you prefer to describe it, such as "node", "computer", +"server," "host", "machine", etc.). In high density environments, it is +increasingly common to see multiple hosts/nodes per chassis. You should account +for chassis failure too--e.g., the need to pull a chassis if a node fails may +result in bringing down numerous hosts/nodes and their OSDs. + +When declaring a bucket instance, you must specify its type, give it a unique +name (string), assign it a unique ID expressed as a negative integer (optional), +specify a weight relative to the total capacity/capability of its item(s), +specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``, +reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. +The items may consist of node buckets or leaves. Items may have a weight that +reflects the relative weight of the item. + +You may declare a node bucket with the following syntax:: + + [bucket-type] [bucket-name] { + id [a unique negative numeric ID] + weight [the relative capacity/capability of the item(s)] + alg [the bucket type: uniform | list | tree | straw ] + hash [the hash type: 0 by default] + item [item-name] weight [weight] + } + +For example, using the diagram above, we would define two host buckets +and one rack bucket. The OSDs are declared as items within the host buckets:: + + host node1 { + id -1 + alg straw + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host node2 { + id -2 + alg straw + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + rack rack1 { + id -3 + alg straw + hash 0 + item node1 weight 2.00 + item node2 weight 2.00 + } + +.. note:: In the foregoing example, note that the rack bucket does not contain + any OSDs. Rather it contains lower level host buckets, and includes the + sum total of their weight in the item entry. + +.. topic:: Bucket Types + + Ceph supports four bucket types, each representing a tradeoff between + performance and reorganization efficiency. If you are unsure of which bucket + type to use, we recommend using a ``straw`` bucket. For a detailed + discussion of bucket types, refer to + `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, + and more specifically to **Section 3.4**. The bucket types are: + + #. **Uniform**: Uniform buckets aggregate devices with **exactly** the same + weight. For example, when firms commission or decommission hardware, they + typically do so with many machines that have exactly the same physical + configuration (e.g., bulk purchases). When storage devices have exactly + the same weight, you may use the ``uniform`` bucket type, which allows + CRUSH to map replicas into uniform buckets in constant time. With + non-uniform weights, you should use another bucket algorithm. + + #. **List**: List buckets aggregate their content as linked lists. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, + a list is a natural and intuitive choice for an **expanding cluster**: + either an object is relocated to the newest device with some appropriate + probability, or it remains on the older devices as before. The result is + optimal data migration when items are added to the bucket. Items removed + from the middle or tail of the list, however, can result in a significant + amount of unnecessary movement, making list buckets most suitable for + circumstances in which they **never (or very rarely) shrink**. + + #. **Tree**: Tree buckets use a binary search tree. They are more efficient + than list buckets when a bucket contains a larger set of items. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, + tree buckets reduce the placement time to O(log :sub:`n`), making them + suitable for managing much larger sets of devices or nested buckets. + + #. **Straw**: List and Tree buckets use a divide and conquer strategy + in a way that either gives certain items precedence (e.g., those + at the beginning of a list) or obviates the need to consider entire + subtrees of items at all. That improves the performance of the replica + placement process, but can also introduce suboptimal reorganization + behavior when the contents of a bucket change due an addition, removal, + or re-weighting of an item. The straw bucket type allows all items to + fairly “compete” against each other for replica placement through a + process analogous to a draw of straws. + + #. **Straw2**: Straw2 buckets improve Straw to correctly avoid any data + movement between items when neighbor weights change. + + For example the weight of item A including adding it anew or removing + it completely, there will be data movement only to or from item A. + +.. topic:: Hash + + Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. + Enter ``0`` as your hash setting to select ``rjenkins1``. + + +.. _weightingbucketitems: + +.. topic:: Weighting Bucket Items + + Ceph expresses bucket weights as doubles, which allows for fine + weighting. A weight is the relative difference between device capacities. We + recommend using ``1.00`` as the relative weight for a 1TB storage device. + In such a scenario, a weight of ``0.5`` would represent approximately 500GB, + and a weight of ``3.00`` would represent approximately 3TB. Higher level + buckets have a weight that is the sum total of the leaf items aggregated by + the bucket. + + A bucket item weight is one dimensional, but you may also calculate your + item weights to reflect the performance of the storage drive. For example, + if you have many 1TB drives where some have relatively low data transfer + rate and the others have a relatively high data transfer rate, you may + weight them differently, even though they have the same capacity (e.g., + a weight of 0.80 for the first set of drives with lower total throughput, + and 1.20 for the second set of drives with higher total throughput). + + +.. _crushmaprules: + +CRUSH Map Rules +--------------- + +CRUSH maps support the notion of 'CRUSH rules', which are the rules that +determine data placement for a pool. The default CRUSH map has a rule for each +pool. For large clusters, you will likely create many pools where each pool may +have its own non-default CRUSH rule. + +.. note:: In most cases, you will not need to modify the default rule. When + you create a new pool, by default the rule will be set to ``0``. + + +CRUSH rules define placement and replication strategies or distribution policies +that allow you to specify exactly how CRUSH places object replicas. For +example, you might create a rule selecting a pair of targets for 2-way +mirroring, another rule for selecting three targets in two different data +centers for 3-way mirroring, and yet another rule for erasure coding over six +storage devices. For a detailed discussion of CRUSH rules, refer to +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, +and more specifically to **Section 3.2**. + +A rule takes the following form:: + + rule <rulename> { + + id [a unique whole numeric ID] + type [ replicated | erasure ] + min_size <min-size> + max_size <max-size> + step take <bucket-name> [class <device-class>] + step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type> + step emit + } + + +``id`` + +:Description: A unique whole number for identifying the rule. + +:Purpose: A component of the rule mask. +:Type: Integer +:Required: Yes +:Default: 0 + + +``type`` + +:Description: Describes a rule for either a storage drive (replicated) + or a RAID. + +:Purpose: A component of the rule mask. +:Type: String +:Required: Yes +:Default: ``replicated`` +:Valid Values: Currently only ``replicated`` and ``erasure`` + +``min_size`` + +:Description: If a pool makes fewer replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: ``1`` + +``max_size`` + +:Description: If a pool makes more replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: 10 + + +``step take <bucket-name> [class <device-class>]`` + +:Description: Takes a bucket name, and begins iterating down the tree. + If the ``device-class`` is specified, it must match + a class previously used when defining a device. All + devices that do not belong to the class are excluded. +:Purpose: A component of the rule. +:Required: Yes +:Example: ``step take data`` + + +``step choose firstn {num} type {bucket-type}`` + +:Description: Selects the number of buckets of the given type from within the + current bucket. The number is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step choose firstn 1 type row`` + + +``step chooseleaf firstn {num} type {bucket-type}`` + +:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf + node (that is, an OSD) from the subtree of each bucket in the set of buckets. + The number of buckets in the set is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. Usage removes the need to select a device using two steps. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step chooseleaf firstn 0 type row`` + + +``step emit`` + +:Description: Outputs the current value and empties the stack. Typically used + at the end of a rule, but may also be used to pick from different + trees in the same rule. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step choose``. +:Example: ``step emit`` + +.. important:: A given CRUSH rule may be assigned to multiple pools, but it + is not possible for a single pool to have multiple CRUSH rules. + +``firstn`` versus ``indep`` + +:Description: Controls the replacement strategy CRUSH uses when items (OSDs) + are marked down in the CRUSH map. If this rule is to be used with + replicated pools it should be ``firstn`` and if it's for + erasure-coded pools it should be ``indep``. + + The reason has to do with how they behave when a + previously-selected device fails. Let's say you have a PG stored + on OSDs 1, 2, 3, 4, 5. Then 3 goes down. + + With the "firstn" mode, CRUSH simply adjusts its calculation to + select 1 and 2, then selects 3 but discovers it's down, so it + retries and selects 4 and 5, and then goes on to select a new + OSD 6. So the final CRUSH mapping change is + 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6. + + But if you're storing an EC pool, that means you just changed the + data mapped to OSDs 4, 5, and 6! So the "indep" mode attempts to + not do that. You can instead expect it, when it selects the failed + OSD 3, to try again and pick out 6, for a final transformation of: + 1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5 + +.. _crush-reclassify: + +Migrating from a legacy SSD rule to device classes +-------------------------------------------------- + +It used to be necessary to manually edit your CRUSH map and maintain a +parallel hierarchy for each specialized device type (e.g., SSD) in order to +write rules that apply to those devices. Since the Luminous release, +the *device class* feature has enabled this transparently. + +However, migrating from an existing, manually customized per-device map to +the new device class rules in the trivial way will cause all data in the +system to be reshuffled. + +The ``crushtool`` has a few commands that can transform a legacy rule +and hierarchy so that you can start using the new class-based rules. +There are three types of transformations possible: + +#. ``--reclassify-root <root-name> <device-class>`` + + This will take everything in the hierarchy beneath root-name and + adjust any rules that reference that root via a ``take + <root-name>`` to instead ``take <root-name> class <device-class>``. + It renumbers the buckets in such a way that the old IDs are instead + used for the specified class's "shadow tree" so that no data + movement takes place. + + For example, imagine you have an existing rule like:: + + rule replicated_ruleset { + id 0 + type replicated + min_size 1 + max_size 10 + step take default + step chooseleaf firstn 0 type rack + step emit + } + + If you reclassify the root `default` as class `hdd`, the rule will + become:: + + rule replicated_ruleset { + id 0 + type replicated + min_size 1 + max_size 10 + step take default class hdd + step chooseleaf firstn 0 type rack + step emit + } + +#. ``--set-subtree-class <bucket-name> <device-class>`` + + This will mark every device in the subtree rooted at *bucket-name* + with the specified device class. + + This is normally used in conjunction with the ``--reclassify-root`` + option to ensure that all devices in that root are labeled with the + correct class. In some situations, however, some of those devices + (correctly) have a different class and we do not want to relabel + them. In such cases, one can exclude the ``--set-subtree-class`` + option. This means that the remapping process will not be perfect, + since the previous rule distributed across devices of multiple + classes but the adjusted rules will only map to devices of the + specified *device-class*, but that often is an accepted level of + data movement when the nubmer of outlier devices is small. + +#. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>`` + + This will allow you to merge a parallel type-specific hiearchy with the normal hierarchy. For example, many users have maps like:: + + host node1 { + id -2 # do not change unnecessarily + # weight 109.152 + alg straw + hash 0 # rjenkins1 + item osd.0 weight 9.096 + item osd.1 weight 9.096 + item osd.2 weight 9.096 + item osd.3 weight 9.096 + item osd.4 weight 9.096 + item osd.5 weight 9.096 + ... + } + + host node1-ssd { + id -10 # do not change unnecessarily + # weight 2.000 + alg straw + hash 0 # rjenkins1 + item osd.80 weight 2.000 + ... + } + + root default { + id -1 # do not change unnecessarily + alg straw + hash 0 # rjenkins1 + item node1 weight 110.967 + ... + } + + root ssd { + id -18 # do not change unnecessarily + # weight 16.000 + alg straw + hash 0 # rjenkins1 + item node1-ssd weight 2.000 + ... + } + + This function will reclassify each bucket that matches a + pattern. The pattern can look like ``%suffix`` or ``prefix%``. + For example, in the above example, we would use the pattern + ``%-ssd``. For each matched bucket, the remaining portion of the + name (that matches the ``%`` wildcard) specifies the *base bucket*. + All devices in the matched bucket are labeled with the specified + device class and then moved to the base bucket. If the base bucket + does not exist (e.g., ``node12-ssd`` exists but ``node12`` does + not), then it is created and linked underneath the specified + *default parent* bucket. In each case, we are careful to preserve + the old bucket IDs for the new shadow buckets to prevent data + movement. Any rules with ``take`` steps referencing the old + buckets are adjusted. + +#. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>`` + + The same command can also be used without a wildcard to map a + single bucket. For example, in the previous example, we want the + ``ssd`` bucket to be mapped to the ``default`` bucket. + +The final command to convert the map comprised of the above fragments would be something like:: + + $ ceph osd getcrushmap -o original + $ crushtool -i original --reclassify \ + --set-subtree-class default hdd \ + --reclassify-root default hdd \ + --reclassify-bucket %-ssd ssd default \ + --reclassify-bucket ssd ssd default \ + -o adjusted + +In order to ensure that the conversion is correct, there is a ``--compare`` command that will test a large sample of inputs to the CRUSH map and ensure that the same result comes back out. These inputs are controlled by the same options that apply to the ``--test`` command. For the above example,:: + + $ crushtool -i original --compare adjusted + rule 0 had 0/10240 mismatched mappings (0) + rule 1 had 0/10240 mismatched mappings (0) + maps appear equivalent + +If there were difference, you'd see what ratio of inputs are remapped +in the parentheses. + +If you are satisfied with the adjusted map, you can apply it to the cluster with something like:: + + ceph osd setcrushmap -i adjusted + +Tuning CRUSH, the hard way +-------------------------- + +If you can ensure that all clients are running recent code, you can +adjust the tunables by extracting the CRUSH map, modifying the values, +and reinjecting it into the cluster. + +* Extract the latest CRUSH map:: + + ceph osd getcrushmap -o /tmp/crush + +* Adjust tunables. These values appear to offer the best behavior + for both large and small clusters we tested with. You will need to + additionally specify the ``--enable-unsafe-tunables`` argument to + ``crushtool`` for this to work. Please use this option with + extreme care.:: + + crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new + +* Reinject modified map:: + + ceph osd setcrushmap -i /tmp/crush.new + +Legacy values +------------- + +For reference, the legacy values for the CRUSH tunables can be set +with:: + + crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy + +Again, the special ``--enable-unsafe-tunables`` option is required. +Further, as noted above, be careful running old versions of the +``ceph-osd`` daemon after reverting to legacy values as the feature +bit is not perfectly enforced. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst new file mode 100644 index 00000000..12f7184f --- /dev/null +++ b/doc/rados/operations/crush-map.rst @@ -0,0 +1,969 @@ +============ + CRUSH Maps +============ + +The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm +determines how to store and retrieve data by computing data storage locations. +CRUSH empowers Ceph clients to communicate with OSDs directly rather than +through a centralized server or broker. With an algorithmically determined +method of storing and retrieving data, Ceph avoids a single point of failure, a +performance bottleneck, and a physical limit to its scalability. + +CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly +store and retrieve data in OSDs with a uniform distribution of data across the +cluster. For a detailed discussion of CRUSH, see +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ + +CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of +'buckets' for aggregating the devices into physical locations, and a list of +rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By +reflecting the underlying physical organization of the installation, CRUSH can +model—and thereby address—potential sources of correlated device failures. +Typical sources include physical proximity, a shared power source, and a shared +network. By encoding this information into the cluster map, CRUSH placement +policies can separate object replicas across different failure domains while +still maintaining the desired distribution. For example, to address the +possibility of concurrent failures, it may be desirable to ensure that data +replicas are on devices using different shelves, racks, power supplies, +controllers, and/or physical locations. + +When you deploy OSDs they are automatically placed within the CRUSH map under a +``host`` node named with the hostname for the host they are running on. This, +combined with the default CRUSH failure domain, ensures that replicas or erasure +code shards are separated across hosts and a single host failure will not +affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks, +for example, is common for mid- to large-sized clusters. + + +CRUSH Location +============== + +The location of an OSD in terms of the CRUSH map's hierarchy is +referred to as a ``crush location``. This location specifier takes the +form of a list of key and value pairs describing a position. For +example, if an OSD is in a particular row, rack, chassis and host, and +is part of the 'default' CRUSH tree (this is the case for the vast +majority of clusters), its crush location could be described as:: + + root=default row=a rack=a2 chassis=a2a host=a2a1 + +Note: + +#. Note that the order of the keys does not matter. +#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default + these include root, datacenter, room, row, pod, pdu, rack, chassis and host, + but those types can be customized to be anything appropriate by modifying + the CRUSH map. +#. Not all keys need to be specified. For example, by default, Ceph + automatically sets a ``ceph-osd`` daemon's location to be + ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). + +The crush location for an OSD is normally expressed via the ``crush location`` +config option being set in the ``ceph.conf`` file. Each time the OSD starts, +it verifies it is in the correct location in the CRUSH map and, if it is not, +it moves itself. To disable this automatic CRUSH map management, add the +following to your configuration file in the ``[osd]`` section:: + + osd crush update on start = false + + +Custom location hooks +--------------------- + +A customized location hook can be used to generate a more complete +crush location on startup. The crush location is based on, in order +of preference: + +#. A ``crush location`` option in ceph.conf. +#. A default of ``root=default host=HOSTNAME`` where the hostname is + generated with the ``hostname -s`` command. + +This is not useful by itself, as the OSD itself has the exact same +behavior. However, a script can be written to provide additional +location fields (for example, the rack or datacenter), and then the +hook enabled via the config option:: + + crush location hook = /path/to/customized-ceph-crush-location + +This hook is passed several arguments (below) and should output a single line +to stdout with the CRUSH location description.:: + + --cluster CLUSTER --id ID --type TYPE + +where the cluster name is typically 'ceph', the id is the daemon +identifier (e.g., the OSD number or daemon identifier), and the daemon +type is ``osd``, ``mds``, or similar. + +For example, a simple hook that additionally specified a rack location +based on a hypothetical file ``/etc/rack`` might be:: + + #!/bin/sh + echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" + + +CRUSH structure +=============== + +The CRUSH map consists of, loosely speaking, a hierarchy describing +the physical topology of the cluster, and a set of rules defining +policy about how we place data on those devices. The hierarchy has +devices (``ceph-osd`` daemons) at the leaves, and internal nodes +corresponding to other physical features or groupings: hosts, racks, +rows, datacenters, and so on. The rules describe how replicas are +placed in terms of that hierarchy (e.g., 'three replicas in different +racks'). + +Devices +------- + +Devices are individual ``ceph-osd`` daemons that can store data. You +will normally have one defined here for each OSD daemon in your +cluster. Devices are identified by an id (a non-negative integer) and +a name, normally ``osd.N`` where ``N`` is the device id. + +Devices may also have a *device class* associated with them (e.g., +``hdd`` or ``ssd``), allowing them to be conveniently targeted by a +crush rule. + +Types and Buckets +----------------- + +A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, +racks, rows, etc. The CRUSH map defines a series of *types* that are +used to describe these nodes. By default, these types include: + +- osd (or device) +- host +- chassis +- rack +- row +- pdu +- pod +- room +- datacenter +- region +- root + +Most clusters make use of only a handful of these types, and others +can be defined as needed. + +The hierarchy is built with devices (normally type ``osd``) at the +leaves, interior nodes with non-device types, and a root node of type +``root``. For example, + +.. ditaa:: + + +-----------------+ + |{o}root default | + +--------+--------+ + | + +---------------+---------------+ + | | + +------+------+ +------+------+ + |{o}host foo | |{o}host bar | + +------+------+ +------+------+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd.0 | | osd.1 | | osd.2 | | osd.3 | + +-----------+ +-----------+ +-----------+ +-----------+ + +Each node (device or bucket) in the hierarchy has a *weight* +associated with it, indicating the relative proportion of the total +data that device or hierarchy subtree should store. Weights are set +at the leaves, indicating the size of the device, and automatically +sum up the tree from there, such that the weight of the default node +will be the total of all devices contained beneath it. Normally +weights are in units of terabytes (TB). + +You can get a simple view the CRUSH hierarchy for your cluster, +including the weights, with:: + + ceph osd crush tree + +Rules +----- + +Rules define policy about how data is distributed across the devices +in the hierarchy. + +CRUSH rules define placement and replication strategies or +distribution policies that allow you to specify exactly how CRUSH +places object replicas. For example, you might create a rule selecting +a pair of targets for 2-way mirroring, another rule for selecting +three targets in two different data centers for 3-way mirroring, and +yet another rule for erasure coding over six storage devices. For a +detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, +Scalable, Decentralized Placement of Replicated Data`_, and more +specifically to **Section 3.2**. + +In almost all cases, CRUSH rules can be created via the CLI by +specifying the *pool type* they will be used for (replicated or +erasure coded), the *failure domain*, and optionally a *device class*. +In rare cases rules must be written by hand by manually editing the +CRUSH map. + +You can see what rules are defined for your cluster with:: + + ceph osd crush rule ls + +You can view the contents of the rules with:: + + ceph osd crush rule dump + +Device classes +-------------- + +Each device can optionally have a *class* associated with it. By +default, OSDs automatically set their class on startup to either +`hdd`, `ssd`, or `nvme` based on the type of device they are backed +by. + +The device class for one or more OSDs can be explicitly set with:: + + ceph osd crush set-device-class <class> <osd-name> [...] + +Once a device class is set, it cannot be changed to another class +until the old class is unset with:: + + ceph osd crush rm-device-class <osd-name> [...] + +This allows administrators to set device classes without the class +being changed on OSD restart or by some other script. + +A placement rule that targets a specific device class can be created with:: + + ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> + +A pool can then be changed to use the new rule with:: + + ceph osd pool set <pool-name> crush_rule <rule-name> + +Device classes are implemented by creating a "shadow" CRUSH hierarchy +for each device class in use that contains only devices of that class. +Rules can then distribute data over the shadow hierarchy. One nice +thing about this approach is that it is fully backward compatible with +old Ceph clients. You can view the CRUSH hierarchy with shadow items +with:: + + ceph osd crush tree --show-shadow + +For older clusters created before Luminous that relied on manually +crafted CRUSH maps to maintain per-device-type hierarchies, there is a +*reclassify* tool available to help transition to device classes +without triggering data movement (see :ref:`crush-reclassify`). + + +Weights sets +------------ + +A *weight set* is an alternative set of weights to use when +calculating data placement. The normal weights associated with each +device in the CRUSH map are set based on the device size and indicate +how much data we *should* be storing where. However, because CRUSH is +based on a pseudorandom placement process, there is always some +variation from this ideal distribution, the same way that rolling a +dice sixty times will not result in rolling exactly 10 ones and 10 +sixes. Weight sets allow the cluster to do a numerical optimization +based on the specifics of your cluster (hierarchy, pools, etc.) to achieve +a balanced distribution. + +There are two types of weight sets supported: + + #. A **compat** weight set is a single alternative set of weights for + each device and node in the cluster. This is not well-suited for + correcting for all anomalies (for example, placement groups for + different pools may be different sizes and have different load + levels, but will be mostly treated the same by the balancer). + However, compat weight sets have the huge advantage that they are + *backward compatible* with previous versions of Ceph, which means + that even though weight sets were first introduced in Luminous + v12.2.z, older clients (e.g., firefly) can still connect to the + cluster when a compat weight set is being used to balance data. + #. A **per-pool** weight set is more flexible in that it allows + placement to be optimized for each data pool. Additionally, + weights can be adjusted for each position of placement, allowing + the optimizer to correct for a subtle skew of data toward devices + with small weights relative to their peers (and effect that is + usually only apparently in very large clusters but which can cause + balancing problems). + +When weight sets are in use, the weights associated with each node in +the hierarchy is visible as a separate column (labeled either +``(compat)`` or the pool name) from the command:: + + ceph osd crush tree + +When both *compat* and *per-pool* weight sets are in use, data +placement for a particular pool will use its own per-pool weight set +if present. If not, it will use the compat weight set if present. If +neither are present, it will use the normal CRUSH weights. + +Although weight sets can be set up and manipulated by hand, it is +recommended that the *balancer* module be enabled to do so +automatically. + + +Modifying the CRUSH map +======================= + +.. _addosd: + +Add/Move an OSD +--------------- + +.. note: OSDs are normally automatically added to the CRUSH map when + the OSD is created. This command is rarely needed. + +To add or move an OSD in the CRUSH map of a running cluster:: + + ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +``root`` + +:Description: The root node of the tree in which the OSD resides (normally ``default``) +:Type: Key/value pair. +:Required: Yes +:Example: ``root=default`` + + +``bucket-type`` + +:Description: You may specify the OSD's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + + +The following example adds ``osd.0`` to the hierarchy, or moves the +OSD from a previous location. :: + + ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 + + +Adjust OSD weight +----------------- + +.. note: Normally OSDs automatically add themselves to the CRUSH map + with the correct weight when they are created. This command + is rarely needed. + +To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute +the following:: + + ceph osd crush reweight {name} {weight} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD. +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +.. _removeosd: + +Remove an OSD +------------- + +.. note: OSDs are normally removed from the CRUSH as part of the + ``ceph osd purge`` command. This command is rarely needed. + +To remove an OSD from the CRUSH map of a running cluster, execute the +following:: + + ceph osd crush remove {name} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +Add a Bucket +------------ + +.. note: Buckets are normally implicitly created when an OSD is added + that specifies a ``{bucket-type}={bucket-name}`` as part of its + location and a bucket with that name does not already exist. This + command is typically used when manually adjusting the structure of the + hierarchy after OSDs have been created (for example, to move a + series of hosts underneath a new rack-level bucket). + +To add a bucket in the CRUSH map of a running cluster, execute the +``ceph osd crush add-bucket`` command:: + + ceph osd crush add-bucket {bucket-name} {bucket-type} + +Where: + +``bucket-name`` + +:Description: The full name of the bucket. +:Type: String +:Required: Yes +:Example: ``rack12`` + + +``bucket-type`` + +:Description: The type of the bucket. The type must already exist in the hierarchy. +:Type: String +:Required: Yes +:Example: ``rack`` + + +The following example adds the ``rack12`` bucket to the hierarchy:: + + ceph osd crush add-bucket rack12 rack + +Move a Bucket +------------- + +To move a bucket to a different location or position in the CRUSH map +hierarchy, execute the following:: + + ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] + +Where: + +``bucket-name`` + +:Description: The name of the bucket to move/reposition. +:Type: String +:Required: Yes +:Example: ``foo-bar-1`` + +``bucket-type`` + +:Description: You may specify the bucket's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +Remove a Bucket +--------------- + +To remove a bucket from the CRUSH map hierarchy, execute the following:: + + ceph osd crush remove {bucket-name} + +.. note:: A bucket must be empty before removing it from the CRUSH hierarchy. + +Where: + +``bucket-name`` + +:Description: The name of the bucket that you'd like to remove. +:Type: String +:Required: Yes +:Example: ``rack12`` + +The following example removes the ``rack12`` bucket from the hierarchy:: + + ceph osd crush remove rack12 + +Creating a compat weight set +---------------------------- + +.. note: This step is normally done automatically by the ``balancer`` + module when enabled. + +To create a *compat* weight set:: + + ceph osd crush weight-set create-compat + +Weights for the compat weight set can be adjusted with:: + + ceph osd crush weight-set reweight-compat {name} {weight} + +The compat weight set can be destroyed with:: + + ceph osd crush weight-set rm-compat + +Creating per-pool weight sets +----------------------------- + +To create a weight set for a specific pool,:: + + ceph osd crush weight-set create {pool-name} {mode} + +.. note:: Per-pool weight sets require that all servers and daemons + run Luminous v12.2.z or later. + +Where: + +``pool-name`` + +:Description: The name of a RADOS pool +:Type: String +:Required: Yes +:Example: ``rbd`` + +``mode`` + +:Description: Either ``flat`` or ``positional``. A *flat* weight set + has a single weight for each device or bucket. A + *positional* weight set has a potentially different + weight for each position in the resulting placement + mapping. For example, if a pool has a replica count of + 3, then a positional weight set will have three weights + for each device and bucket. +:Type: String +:Required: Yes +:Example: ``flat`` + +To adjust the weight of an item in a weight set:: + + ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} + +To list existing weight sets,:: + + ceph osd crush weight-set ls + +To remove a weight set,:: + + ceph osd crush weight-set rm {pool-name} + +Creating a rule for a replicated pool +------------------------------------- + +For a replicated pool, the primary decision when creating the CRUSH +rule is what the failure domain is going to be. For example, if a +failure domain of ``host`` is selected, then CRUSH will ensure that +each replica of the data is stored on a different host. If ``rack`` +is selected, then each replica will be stored in a different rack. +What failure domain you choose primarily depends on the size of your +cluster and how your hierarchy is structured. + +Normally, the entire cluster hierarchy is nested beneath a root node +named ``default``. If you have customized your hierarchy, you may +want to create a rule nested at some other node in the hierarchy. It +doesn't matter what type is associated with that node (it doesn't have +to be a ``root`` node). + +It is also possible to create a rule that restricts data placement to +a specific *class* of device. By default, Ceph OSDs automatically +classify themselves as either ``hdd`` or ``ssd``, depending on the +underlying type of device being used. These classes can also be +customized. + +To create a replicated rule,:: + + ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] + +Where: + +``name`` + +:Description: The name of the rule +:Type: String +:Required: Yes +:Example: ``rbd-rule`` + +``root`` + +:Description: The name of the node under which data should be placed. +:Type: String +:Required: Yes +:Example: ``default`` + +``failure-domain-type`` + +:Description: The type of CRUSH nodes across which we should separate replicas. +:Type: String +:Required: Yes +:Example: ``rack`` + +``class`` + +:Description: The device class data should be placed on. +:Type: String +:Required: No +:Example: ``ssd`` + +Creating a rule for an erasure coded pool +----------------------------------------- + +For an erasure-coded pool, the same basic decisions need to be made as +with a replicated pool: what is the failure domain, what node in the +hierarchy will data be placed under (usually ``default``), and will +placement be restricted to a specific device class. Erasure code +pools are created a bit differently, however, because they need to be +constructed carefully based on the erasure code being used. For this reason, +you must include this information in the *erasure code profile*. A CRUSH +rule will then be created from that either explicitly or automatically when +the profile is used to create a pool. + +The erasure code profiles can be listed with:: + + ceph osd erasure-code-profile ls + +An existing profile can be viewed with:: + + ceph osd erasure-code-profile get {profile-name} + +Normally profiles should never be modified; instead, a new profile +should be created and used when creating a new pool or creating a new +rule for an existing pool. + +An erasure code profile consists of a set of key=value pairs. Most of +these control the behavior of the erasure code that is encoding data +in the pool. Those that begin with ``crush-``, however, affect the +CRUSH rule that is created. + +The erasure code profile properties of interest are: + + * **crush-root**: the name of the CRUSH node to place data under [default: ``default``]. + * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``]. + * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used]. + * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. + +Once a profile is defined, you can create a CRUSH rule with:: + + ceph osd crush rule create-erasure {name} {profile-name} + +.. note: When creating a new pool, it is not actually necessary to + explicitly create the rule. If the erasure code profile alone is + specified and the rule argument is left off then Ceph will create + the CRUSH rule automatically. + +Deleting rules +-------------- + +Rules that are not in use by pools can be deleted with:: + + ceph osd crush rule rm {rule-name} + + +.. _crush-map-tunables: + +Tunables +======== + +Over time, we have made (and continue to make) improvements to the +CRUSH algorithm used to calculate the placement of data. In order to +support the change in behavior, we have introduced a series of tunable +options that control whether the legacy or improved variation of the +algorithm is used. + +In order to use newer tunables, both clients and servers must support +the new version of CRUSH. For this reason, we have created +``profiles`` that are named after the Ceph version in which they were +introduced. For example, the ``firefly`` tunables are first supported +in the firefly release, and will not work with older (e.g., dumpling) +clients. Once a given set of tunables are changed from the legacy +default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older +clients who do not support the new CRUSH features from connecting to +the cluster. + +argonaut (legacy) +----------------- + +The legacy CRUSH behavior used by argonaut and older releases works +fine for most clusters, provided there are not too many OSDs that have +been marked out. + +bobtail (CRUSH_TUNABLES2) +------------------------- + +The bobtail tunable profile fixes a few key misbehaviors: + + * For hierarchies with a small number of devices in the leaf buckets, + some PGs map to fewer than the desired number of replicas. This + commonly happens for hierarchies with "host" nodes with a small + number (1-3) of OSDs nested beneath each one. + + * For large clusters, some small percentages of PGs map to less than + the desired number of OSDs. This is more prevalent when there are + several layers of the hierarchy (e.g., row, rack, host, osd). + + * When some OSDs are marked out, the data tends to get redistributed + to nearby OSDs instead of across the entire hierarchy. + +The new tunables are: + + * ``choose_local_tries``: Number of local retries. Legacy value is + 2, optimal value is 0. + + * ``choose_local_fallback_tries``: Legacy value is 5, optimal value + is 0. + + * ``choose_total_tries``: Total number of attempts to choose an item. + Legacy value was 19, subsequent testing indicates that a value of + 50 is more appropriate for typical clusters. For extremely large + clusters, a larger value might be necessary. + + * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt + will retry, or only try once and allow the original placement to + retry. Legacy default is 0, optimal value is 1. + +Migration impact: + + * Moving from argonaut to bobtail tunables triggers a moderate amount + of data movement. Use caution on a cluster that is already + populated with data. + +firefly (CRUSH_TUNABLES3) +------------------------- + +The firefly tunable profile fixes a problem +with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG +mappings with too few results when too many OSDs have been marked out. + +The new tunable is: + + * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will + start with a non-zero value of r, based on how many attempts the + parent has already made. Legacy default is 0, but with this value + CRUSH is sometimes unable to find a mapping. The optimal value (in + terms of computational cost and correctness) is 1. + +Migration impact: + + * For existing clusters that have lots of existing data, changing + from 0 to 1 will cause a lot of data to move; a value of 4 or 5 + will allow CRUSH to find a valid mapping but will make less data + move. + +straw_calc_version tunable (introduced with Firefly too) +-------------------------------------------------------- + +There were some problems with the internal weights calculated and +stored in the CRUSH map for ``straw`` buckets. Specifically, when +there were items with a CRUSH weight of 0 or both a mix of weights and +some duplicated weights CRUSH would distribute data incorrectly (i.e., +not in proportion to the weights). + +The new tunable is: + + * ``straw_calc_version``: A value of 0 preserves the old, broken + internal weight calculation; a value of 1 fixes the behavior. + +Migration impact: + + * Moving to straw_calc_version 1 and then adjusting a straw bucket + (by adding, removing, or reweighting an item, or by using the + reweight-all command) can trigger a small to moderate amount of + data movement *if* the cluster has hit one of the problematic + conditions. + +This tunable option is special because it has absolutely no impact +concerning the required kernel version in the client side. + +hammer (CRUSH_V4) +----------------- + +The hammer tunable profile does not affect the +mapping of existing CRUSH maps simply by changing the profile. However: + + * There is a new bucket type (``straw2``) supported. The new + ``straw2`` bucket type fixes several limitations in the original + ``straw`` bucket. Specifically, the old ``straw`` buckets would + change some mappings that should have changed when a weight was + adjusted, while ``straw2`` achieves the original goal of only + changing mappings to or from the bucket item whose weight has + changed. + + * ``straw2`` is the default for any newly created buckets. + +Migration impact: + + * Changing a bucket type from ``straw`` to ``straw2`` will result in + a reasonably small amount of data movement, depending on how much + the bucket item weights vary from each other. When the weights are + all the same no data will move, and when item weights vary + significantly there will be more movement. + +jewel (CRUSH_TUNABLES5) +----------------------- + +The jewel tunable profile improves the +overall behavior of CRUSH such that significantly fewer mappings +change when an OSD is marked out of the cluster. + +The new tunable is: + + * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will + use a better value for an inner loop that greatly reduces the number + of mapping changes when an OSD is marked out. The legacy value is 0, + while the new value of 1 uses the new approach. + +Migration impact: + + * Changing this value on an existing cluster will result in a very + large amount of data movement as almost every PG mapping is likely + to change. + + + + +Which client versions support CRUSH_TUNABLES +-------------------------------------------- + + * argonaut series, v0.48.1 or later + * v0.49 or later + * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES2 +--------------------------------------------- + + * v0.55 or later, including bobtail series (v0.56.x) + * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES3 +--------------------------------------------- + + * v0.78 (firefly) or later + * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_V4 +-------------------------------------- + + * v0.94 (hammer) or later + * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES5 +--------------------------------------------- + + * v10.0.2 (jewel) or later + * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) + +Warning when tunables are non-optimal +------------------------------------- + +Starting with version v0.74, Ceph will issue a health warning if the +current CRUSH tunables don't include all the optimal values from the +``default`` profile (see below for the meaning of the ``default`` profile). +To make this warning go away, you have two options: + +1. Adjust the tunables on the existing cluster. Note that this will + result in some data movement (possibly as much as 10%). This is the + preferred route, but should be taken with care on a production cluster + where the data movement may affect performance. You can enable optimal + tunables with:: + + ceph osd crush tunables optimal + + If things go poorly (e.g., too much load) and not very much + progress has been made, or there is a client compatibility problem + (old kernel cephfs or rbd clients, or pre-bobtail librados + clients), you can switch back with:: + + ceph osd crush tunables legacy + +2. You can make the warning go away without making any changes to CRUSH by + adding the following option to your ceph.conf ``[mon]`` section:: + + mon warn on legacy crush tunables = false + + For the change to take effect, you will need to restart the monitors, or + apply the option to running monitors with:: + + ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false + + +A few important points +---------------------- + + * Adjusting these values will result in the shift of some PGs between + storage nodes. If the Ceph cluster is already storing a lot of + data, be prepared for some fraction of the data to move. + * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the + feature bits of new connections as soon as they get + the updated map. However, already-connected clients are + effectively grandfathered in, and will misbehave if they do not + support the new feature. + * If the CRUSH tunables are set to non-legacy values and then later + changed back to the default values, ``ceph-osd`` daemons will not be + required to support the feature. However, the OSD peering process + requires examining and understanding old maps. Therefore, you + should not run old versions of the ``ceph-osd`` daemon + if the cluster has previously used non-legacy CRUSH values, even if + the latest version of the map has been switched back to using the + legacy defaults. + +Tuning CRUSH +------------ + +The simplest way to adjust the crush tunables is by changing to a known +profile. Those are: + + * ``legacy``: the legacy behavior from argonaut and earlier. + * ``argonaut``: the legacy values supported by the original argonaut release + * ``bobtail``: the values supported by the bobtail release + * ``firefly``: the values supported by the firefly release + * ``hammer``: the values supported by the hammer release + * ``jewel``: the values supported by the jewel release + * ``optimal``: the best (ie optimal) values of the current version of Ceph + * ``default``: the default values of a new cluster installed from + scratch. These values, which depend on the current version of Ceph, + are hard coded and are generally a mix of optimal and legacy values. + These values generally match the ``optimal`` profile of the previous + LTS release, or the most recent release for which we generally except + more users to have up to date clients for. + +You can select a profile on a running cluster with the command:: + + ceph osd crush tunables {PROFILE} + +Note that this may result in some data movement. + + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf + + +Primary Affinity +================ + +When a Ceph Client reads or writes data, it always contacts the primary OSD in +the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an +OSD is not well suited to act as a primary compared to other OSDs (e.g., it has +a slow disk or a slow controller). To prevent performance bottlenecks +(especially on read operations) while maximizing utilization of your hardware, +you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use +the OSD as a primary in an acting set. :: + + ceph osd primary-affinity <osd-id> <weight> + +Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You +may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may +**NOT** be used as a primary and ``1`` means that an OSD may be used as a +primary. When the weight is ``< 1``, it is less likely that CRUSH will select +the Ceph OSD Daemon to act as a primary. + + + diff --git a/doc/rados/operations/data-placement.rst b/doc/rados/operations/data-placement.rst new file mode 100644 index 00000000..bd9bd7ec --- /dev/null +++ b/doc/rados/operations/data-placement.rst @@ -0,0 +1,43 @@ +========================= + Data Placement Overview +========================= + +Ceph stores, replicates and rebalances data objects across a RADOS cluster +dynamically. With many different users storing objects in different pools for +different purposes on countless OSDs, Ceph operations require some data +placement planning. The main data placement planning concepts in Ceph include: + +- **Pools:** Ceph stores data within pools, which are logical groups for storing + objects. Pools manage the number of placement groups, the number of replicas, + and the CRUSH rule for the pool. To store data in a pool, you must have + an authenticated user with permissions for the pool. Ceph can snapshot pools. + See `Pools`_ for additional details. + +- **Placement Groups:** Ceph maps objects to placement groups (PGs). + Placement groups (PGs) are shards or fragments of a logical object pool + that place objects as a group into OSDs. Placement groups reduce the amount + of per-object metadata when Ceph stores the data in OSDs. A larger number of + placement groups (e.g., 100 per OSD) leads to better balancing. See + `Placement Groups`_ for additional details. + +- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without + performance bottlenecks, without limitations to scalability, and without a + single point of failure. CRUSH maps provide the physical topology of the + cluster to the CRUSH algorithm to determine where the data for an object + and its replicas should be stored, and how to do so across failure domains + for added data safety among other things. See `CRUSH Maps`_ for additional + details. + +- **Balancer:** The balancer is a feature that will automatically optimize the + distribution of PGs across devices to achieve a balanced data distribution, + maximizing the amount of data that can be stored in the cluster and evenly + distributing the workload across OSDs. + +When you initially set up a test cluster, you can use the default values. Once +you begin planning for a large Ceph cluster, refer to pools, placement groups +and CRUSH for data placement operations. + +.. _Pools: ../pools +.. _Placement Groups: ../placement-groups +.. _CRUSH Maps: ../crush-map +.. _Balancer: ../balancer diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst new file mode 100644 index 00000000..b3db1e3d --- /dev/null +++ b/doc/rados/operations/devices.rst @@ -0,0 +1,133 @@ +Device Management +================= + +Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by +which daemons, and collects health metrics about those devices in order to +provide tools to predict and/or automatically respond to hardware failure. + +Device tracking +--------------- + +You can query which storage devices are in use with:: + + ceph device ls + +You can also list devices by daemon or by host:: + + ceph device ls-by-daemon <daemon> + ceph device ls-by-host <host> + +For any individual device, you can query information about its +location and how it is being consumed with:: + + ceph device info <devid> + + +Enabling monitoring +------------------- + +Ceph can also monitor health metrics associated with your device. For +example, SATA hard disks implement a standard called SMART that +provides a wide range of internal metrics about the device's usage and +health, like the number of hours powered on, number of power cycles, +or unrecoverable read errors. Other device types like SAS and NVMe +implement a similar set of metrics (via slightly different standards). +All of these can be collected by Ceph via the ``smartctl`` tool. + +You can enable or disable health monitoring with:: + + ceph device monitoring on + +or:: + + ceph device monitoring off + + +Scraping +-------- + +If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with:: + + ceph config set mgr mgr/devicehealth/scrape_frequency <seconds> + +The default is to scrape once every 24 hours. + +You can manually trigger a scrape of all devices with:: + + ceph device scrape-health-metrics + +A single device can be scraped with:: + + ceph device scrape-health-metrics <device-id> + +Or a single daemon's devices can be scraped with:: + + ceph device scrape-daemon-health-metrics <who> + +The stored health metrics for a device can be retrieved (optionally +for a specific timestamp) with:: + + ceph device get-health-metrics <devid> [sample-timestamp] + +Failure prediction +------------------ + +Ceph can predict life expectancy and device failures based on the +health metrics it collects. There are three modes: + +* *none*: disable device failure prediction. +* *local*: use a pre-trained prediction model from the ceph-mgr daemon +* *cloud*: share device health and performance metrics an external + cloud service run by ProphetStor, using either their free service or + a paid service with more accurate predictions + +The prediction mode can be configured with:: + + ceph config set global device_failure_prediction_mode <mode> + +Prediction normally runs in the background on a periodic basis, so it +may take some time before life expectancy values are populated. You +can see the life expectancy of all devices in output from:: + + ceph device ls + +You can also query the metadata for a specific device with:: + + ceph device info <devid> + +You can explicitly force prediction of a device's life expectancy with:: + + ceph device predict-life-expectancy <devid> + +If you are not using Ceph's internal device failure prediction but +have some external source of information about device failures, you +can inform Ceph of a device's life expectancy with:: + + ceph device set-life-expectancy <devid> <from> [<to>] + +Life expectancies are expressed as a time interval so that +uncertainty can be expressed in the form of a wide interval. The +interval end can also be left unspecified. + +Health alerts +------------- + +The ``mgr/devicehealth/warn_threshold`` controls how soon an expected +device failure must be before we generate a health warning. + +The stored life expectancy of all devices can be checked, and any +appropriate health alerts generated, with:: + + ceph device check-health + +Automatic Mitigation +-------------------- + +If the ``mgr/devicehealth/self_heal`` option is enabled (it is by +default), then for devices that are expected to fail soon the module +will automatically migrate data away from them by marking the devices +"out". + +The ``mgr/devicehealth/mark_out_threshold`` controls how soon an +expected device failure must be before we automatically mark an osd +"out". diff --git a/doc/rados/operations/erasure-code-clay.rst b/doc/rados/operations/erasure-code-clay.rst new file mode 100644 index 00000000..70f4ffff --- /dev/null +++ b/doc/rados/operations/erasure-code-clay.rst @@ -0,0 +1,235 @@ +================ +CLAY code plugin +================ + +CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings +in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let: + + d = number of OSDs contacted during repair + +If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires +reading from the *d=8* others to repair. And recovery of say a 1GiB needs +a download of 8 X 1GiB = 8GiB of information. + +However, in the case of the *clay* plugin *d* is configurable within the limits: + + k+1 <= d <= k+m-1 + +By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms +of network bandwidth and disk IO. In the case of the *clay* plugin configured with +*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and +250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB +amount of information. More general parameters are provided below. The benefits are substantial +when the repair is carried out for a rack that stores information on the order of +Terabytes. + + +-------------+---------------------------+ + | plugin | total amount of disk IO | + +=============+===========================+ + |jerasure,isa | k*S | + +-------------+---------------------------+ + | clay | d*S/(d-k+1) = (k+m-1)*S/m | + +-------------+---------------------------+ + +where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have +used the largest possible value of *d* as this will result in the smallest amount of data download needed +to achieve recovery from an OSD failure. + +Erasure-code profile examples +============================= + +An example configuration that can be used to observe reduced bandwidth usage:: + + $ ceph osd erasure-code-profile set CLAYprofile \ + plugin=clay \ + k=4 m=2 d=5 \ + crush-failure-domain=host + $ ceph osd pool create claypool 12 12 erasure CLAYprofile + + +Creating a clay profile +======================= + +To create a new clay code profile:: + + ceph osd erasure-code-profile set {name} \ + plugin=clay \ + k={data-chunks} \ + m={coding-chunks} \ + [d={helper-chunks}] \ + [scalar_mds={plugin-name}] \ + [technique={technique-name}] \ + [crush-failure-domain={bucket-type}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split into **data-chunks** parts, + each of which is stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``d={helper-chunks}`` + +:Description: Number of OSDs requested to send data during recovery of + a single chunk. *d* needs to be chosen such that + k+1 <= d <= k+m-1. Larger the *d*, the better the savings. + +:Type: Integer +:Required: No. +:Default: k+m-1 + +``scalar_mds={jerasure|isa|shec}`` + +:Description: **scalar_mds** specifies the plugin that is used as a + building block in the layered construction. It can be + one of *jerasure*, *isa*, *shec* + +:Type: String +:Required: No. +:Default: jerasure + +``technique={technique}`` + +:Description: **technique** specifies the technique that will be picked + within the 'scalar_mds' plugin specified. Supported techniques + are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', + 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van', + 'cauchy' for isa and 'single', 'multiple' for shec. + +:Type: String +:Required: No. +:Default: reed_sol_van (for jerasure, isa), single (for shec) + + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For intance **step take default**. + +:Type: String +:Required: No. +:Default: default + + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + + +Notion of sub-chunks +==================== + +The Clay code is able to save in terms of disk IO, network bandwidth as it +is a vector code and it is able to view and manipulate data within a chunk +at a finer granularity termed as a sub-chunk. The number of sub-chunks within +a chunk for a Clay code is given by: + + sub-chunk count = q\ :sup:`(k+m)/q`, where q=d-k+1 + + +During repair of an OSD, the helper information requested +from an available OSD is only a fraction of a chunk. In fact, the number +of sub-chunks within a chunk that are accessed during repair is given by: + + repair sub-chunk count = sub-chunk count / q + +Examples +-------- + +#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is + 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read + during repair. +#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count + is 16. A quarter of a chunk is read from an available OSD for repair of a failed + chunk. + + + +How to choose a configuration given a workload +============================================== + +Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks +are not necessarily stored consecutively within a chunk. For best disk IO +performance, it is helpful to read contiguous data. For this reason, it is suggested that +you choose stripe-size such that the sub-chunk size is sufficiently large. + +For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that:: + + sub-chunk size = stripe-size / (k*sub-chunk count) = 4KB, 8KB, 12KB ... + +#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d. + For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will + result in a sub-chunk count of 1024 and a sub-chunk size of 4KB. +#. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network + and disk IO benefits. + +Comparisons with LRC +==================== + +Locally Recoverable Codes (LRC) are also designed in order to save in terms of network +bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the +number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead. +The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in +addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc* +can recover from the failure of any ``m`` OSDs. + + +-----------------+----------------------------------+----------------------------------+ + | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) | + +=================+================+=================+==================================+ + | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) | + +-----------------+----------------------------------+----------------------------------+ + | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) | + +-----------------+----------------------------------+----------------------------------+ + + +where ``S`` is the amount of data stored of single OSD being recovered. diff --git a/doc/rados/operations/erasure-code-isa.rst b/doc/rados/operations/erasure-code-isa.rst new file mode 100644 index 00000000..3db07c6b --- /dev/null +++ b/doc/rados/operations/erasure-code-isa.rst @@ -0,0 +1,105 @@ +======================= +ISA erasure code plugin +======================= + +The *isa* plugin encapsulates the `ISA +<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_ +library. It only runs on Intel processors. + +Create an isa profile +===================== + +To create a new *isa* erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + plugin=isa \ + technique={reed_sol_van|cauchy} \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 7 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``technique={reed_sol_van|cauchy}`` + +:Description: The ISA plugin comes in two `Reed Solomon + <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_ + forms. If *reed_sol_van* is set, it is `Vandermonde + <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if + *cauchy* is set, it is `Cauchy + <https://en.wikipedia.org/wiki/Cauchy_matrix>`_. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For intance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-jerasure.rst b/doc/rados/operations/erasure-code-jerasure.rst new file mode 100644 index 00000000..878bcac5 --- /dev/null +++ b/doc/rados/operations/erasure-code-jerasure.rst @@ -0,0 +1,119 @@ +============================ +Jerasure erasure code plugin +============================ + +The *jerasure* plugin is the most generic and flexible plugin, it is +also the default for Ceph erasure coded pools. + +The *jerasure* plugin encapsulates the `Jerasure +<http://jerasure.org>`_ library. It is +recommended to read the *jerasure* documentation to get a better +understanding of the parameters. + +Create a jerasure profile +========================= + +To create a new *jerasure* erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + plugin=jerasure \ + k={data-chunks} \ + m={coding-chunks} \ + technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}`` + +:Description: The more flexible technique is *reed_sol_van* : it is + enough to set *k* and *m*. The *cauchy_good* technique + can be faster but you need to chose the *packetsize* + carefully. All of *reed_sol_r6_op*, *liberation*, + *blaum_roth*, *liber8tion* are *RAID6* equivalents in + the sense that they can only be configured with *m=2*. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``packetsize={bytes}`` + +:Description: The encoding will be done on packets of *bytes* size at + a time. Choosing the right packet size is difficult. The + *jerasure* documentation contains extensive information + on this topic. + +:Type: Integer +:Required: No. +:Default: 2048 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/doc/rados/operations/erasure-code-lrc.rst b/doc/rados/operations/erasure-code-lrc.rst new file mode 100644 index 00000000..eeb07a38 --- /dev/null +++ b/doc/rados/operations/erasure-code-lrc.rst @@ -0,0 +1,371 @@ +====================================== +Locally repairable erasure code plugin +====================================== + +With the *jerasure* plugin, when an erasure coded object is stored on +multiple OSDs, recovering from the loss of one OSD requires reading +from all the others. For instance if *jerasure* is configured with +*k=8* and *m=4*, losing one OSD requires reading from the eleven +others to repair. + +The *lrc* erasure code plugin creates local parity chunks to be able +to recover using less OSDs. For instance if *lrc* is configured with +*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for +every four OSDs. When a single OSD is lost, it can be recovered with +only four OSDs instead of eleven. + +Erasure code profile examples +============================= + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-failure-domain=host + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the reduced bandwidth will only be observed if the primary +OSD is in the same rack as the lost chunk.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-locality=rack \ + crush-failure-domain=host + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + + +Create an lrc profile +===================== + +To create a new lrc erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + plugin=lrc \ + k={data-chunks} \ + m={coding-chunks} \ + l={locality} \ + [crush-root={root}] \ + [crush-locality={bucket-type}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``l={locality}`` + +:Description: Group the coding and data chunks into sets of size + **locality**. For instance, for **k=4** and **m=2**, + when **locality=3** two groups of three are created. + Each set can be recovered without reading chunks + from another set. + +:Type: Integer +:Required: Yes. +:Example: 3 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-locality={bucket-type}`` + +:Description: The type of the crush bucket in which each set of chunks + defined by **l** will be stored. For instance, if it is + set to **rack**, each group of **l** chunks will be + placed in a different rack. It is used to create a + CRUSH rule step such as **step choose rack**. If it is not + set, no such grouping is done. + +:Type: String +:Required: No. + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Low level plugin configuration +============================== + +The sum of **k** and **m** must be a multiple of the **l** parameter. +The low level configuration parameters do not impose such a +restriction and it may be more convenient to use it for specific +purposes. It is for instance possible to define two groups, one with 4 +chunks and another with 3 chunks. It is also possible to recursively +define locality sets, for instance datacenters and racks into +datacenters. The **k/m/l** are implemented by generating a low level +configuration. + +The *lrc* erasure code plugin recursively applies erasure code +techniques so that recovering from the loss of some chunks only +requires a subset of the available chunks, most of the time. + +For instance, when three coding steps are described as:: + + chunk nr 01234567 + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +where *c* are coding chunks calculated from the data chunks *D*, the +loss of chunk *7* can be recovered with the last four chunks. And the +loss of chunk *2* chunk can be recovered with the first four +chunks. + +Erasure code profile examples using low level configuration +=========================================================== + +Minimal testing +--------------- + +It is strictly equivalent to using the default erasure code profile. The *DD* +implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used +by default.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "" ] ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed. It is equivalent to **k=4**, **m=2** and **l=3** although +the layout of the chunks is different:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the reduced bandwidth will only be observed if the primary +OSD is in the same rack as the lost chunk.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' \ + crush-steps='[ + [ "choose", "rack", 2 ], + [ "chooseleaf", "host", 4 ], + ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + +Testing with different Erasure Code backends +-------------------------------------------- + +LRC now uses jerasure as the default EC backend. It is possible to +specify the EC backend/algorithm on a per layer basis using the low +level configuration. The second argument in layers='[ [ "DDc", "" ] ]' +is actually an erasure code profile to be used for this level. The +example below specifies the ISA backend with the cauchy technique to +be used in the lrcpool.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + +You could also use a different erasure code profile for for each +layer.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "plugin=isa technique=cauchy" ], + [ "cDDD____", "plugin=isa" ], + [ "____cDDD", "plugin=jerasure" ], + ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + + + +Erasure coding and decoding algorithm +===================================== + +The steps found in the layers description:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +are applied in order. For instance, if a 4K object is encoded, it will +first go through *step 1* and be divided in four 1K chunks (the four +uppercase D). They are stored in the chunks 2, 3, 6 and 7, in +order. From these, two coding chunks are calculated (the two lowercase +c). The coding chunks are stored in the chunks 1 and 5, respectively. + +The *step 2* re-uses the content created by *step 1* in a similar +fashion and stores a single coding chunk *c* at position 0. The last four +chunks, marked with an underscore (*_*) for readability, are ignored. + +The *step 3* stores a single coding chunk *c* at position 4. The three +chunks created by *step 1* are used to compute this coding chunk, +i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. + +If chunk *2* is lost:: + + chunk nr 01234567 + + step 1 _c D_cDD + step 2 cD D____ + step 3 __ _cDDD + +decoding will attempt to recover it by walking the steps in reverse +order: *step 3* then *step 2* and finally *step 1*. + +The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) +and is skipped. + +The coding chunk from *step 2*, stored in chunk *0*, allows it to +recover the content of chunk *2*. There are no more chunks to recover +and the process stops, without considering *step 1*. + +Recovering chunk *2* requires reading chunks *0, 1, 3* and writing +back chunk *2*. + +If chunk *2, 3, 6* are lost:: + + chunk nr 01234567 + + step 1 _c _c D + step 2 cD __ _ + step 3 __ cD D + +The *step 3* can recover the content of chunk *6*:: + + chunk nr 01234567 + + step 1 _c _cDD + step 2 cD ____ + step 3 __ cDDD + +The *step 2* fails to recover and is skipped because there are two +chunks missing (*2, 3*) and it can only recover from one missing +chunk. + +The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to +recover the content of chunk *2, 3*:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +Controlling CRUSH placement +=========================== + +The default CRUSH rule provides OSDs that are on different hosts. For instance:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +needs exactly *8* OSDs, one for each chunk. If the hosts are in two +adjacent racks, the first four chunks can be placed in the first rack +and the last four in the second rack. So that recovering from the loss +of a single OSD does not require using bandwidth between the two +racks. + +For instance:: + + crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' + +will create a rule that will select two crush buckets of type +*rack* and for each of them choose four OSDs, each of them located in +different buckets of type *host*. + +The CRUSH rule can also be manually crafted for finer control. diff --git a/doc/rados/operations/erasure-code-profile.rst b/doc/rados/operations/erasure-code-profile.rst new file mode 100644 index 00000000..3a73eb34 --- /dev/null +++ b/doc/rados/operations/erasure-code-profile.rst @@ -0,0 +1,124 @@ +.. _erasure-code-profiles: + +===================== +Erasure code profiles +===================== + +Erasure code is defined by a **profile** and is used when creating an +erasure coded pool and the associated CRUSH rule. + +The **default** erasure code profile (which is created when the Ceph +cluster is initialized) provides the same level of redundancy as two +copies but requires 25% less disk space. It is described as a profile +with **k=2** and **m=1**, meaning the information is spread over three +OSD (k+m == 3) and one of them can be lost. + +To improve redundancy without increasing raw storage requirements, a +new profile can be created. For instance, a profile with **k=10** and +**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an +object on fourteen (k+m=14) OSDs. The object is first divided in +**10** chunks (if the object is 10MB, each chunk is 1MB) and **4** +coding chunks are computed, for recovery (each coding chunk has the +same size as the data chunk, i.e. 1MB). The raw space overhead is only +40% and the object will not be lost even if four OSDs break at the +same time. + +.. _list of available plugins: + +.. toctree:: + :maxdepth: 1 + + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay + +osd erasure-code-profile set +============================ + +To create a new erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + [{directory=directory}] \ + [{plugin=plugin}] \ + [{stripe_unit=stripe_unit}] \ + [{key=value} ...] \ + [--force] + +Where: + +``{directory=directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``{plugin=plugin}`` + +:Description: Use the erasure code **plugin** to compute coding chunks + and recover missing chunks. See the `list of available + plugins`_ for more information. + +:Type: String +:Required: No. +:Default: jerasure + +``{stripe_unit=stripe_unit}`` + +:Description: The amount of data in a data chunk, per stripe. For + example, a profile with 2 data chunks and stripe_unit=4K + would put the range 0-4K in chunk 0, 4K-8K in chunk 1, + then 8K-12K in chunk 0 again. This should be a multiple + of 4K for best performance. The default value is taken + from the monitor config option + ``osd_pool_erasure_code_stripe_unit`` when a pool is + created. The stripe_width of a pool using this profile + will be the number of data chunks multiplied by this + stripe_unit. + +:Type: String +:Required: No. + +``{key=value}`` + +:Description: The semantic of the remaining key/value pairs is defined + by the erasure code plugin. + +:Type: String +:Required: No. + +``--force`` + +:Description: Override an existing profile by the same name, and allow + setting a non-4K-aligned stripe_unit. + +:Type: String +:Required: No. + +osd erasure-code-profile rm +============================ + +To remove an erasure code profile:: + + ceph osd erasure-code-profile rm {name} + +If the profile is referenced by a pool, the deletion will fail. + +osd erasure-code-profile get +============================ + +To display an erasure code profile:: + + ceph osd erasure-code-profile get {name} + +osd erasure-code-profile ls +=========================== + +To list the names of all erasure code profiles:: + + ceph osd erasure-code-profile ls + diff --git a/doc/rados/operations/erasure-code-shec.rst b/doc/rados/operations/erasure-code-shec.rst new file mode 100644 index 00000000..c799b3ae --- /dev/null +++ b/doc/rados/operations/erasure-code-shec.rst @@ -0,0 +1,144 @@ +======================== +SHEC erasure code plugin +======================== + +The *shec* plugin encapsulates the `multiple SHEC +<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_ +library. It allows ceph to recover data more efficiently than Reed Solomon codes. + +Create an SHEC profile +====================== + +To create a new *shec* erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + plugin=shec \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [c={durability-estimator}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data-chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding-chunks** for each object and store them on + different OSDs. The number of **coding-chunks** does not necessarily + equal the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``c={durability-estimator}`` + +:Description: The number of parity chunks each of which includes each data chunk in its + calculation range. The number is used as a **durability estimator**. + For instance, if c=2, 2 OSDs can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 2 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the CRUSH rule. For instance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a CRUSH rule step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Brief description of SHEC's layouts +=================================== + +Space Efficiency +---------------- + +Space efficiency is a ratio of data chunks to all ones in a object and +represented as k/(k+m). +In order to improve space efficiency, you should increase k or decrease m. + +:: + + space efficiency of SHEC(4,3,2) = 4/(4+3) = 0.57 + SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency + +Durability +---------- + +The third parameter of SHEC (=c) is a durability estimator, which approximates +the number of OSDs that can be down without losing data. + +``durability estimator of SHEC(4,3,2) = 2`` + +Recovery Efficiency +------------------- + +Describing calculation of recovery efficiency is beyond the scope of this document, +but at least increasing m without increasing c achieves improvement of recovery efficiency. +(However, we must pay attention to the sacrifice of space efficiency in this case.) + +``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency`` + +Erasure code profile examples +============================= + +:: + + $ ceph osd erasure-code-profile set SHECprofile \ + plugin=shec \ + k=8 m=4 c=3 \ + crush-failure-domain=host + $ ceph osd pool create shecpool 256 256 erasure SHECprofile diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst new file mode 100644 index 00000000..03567d93 --- /dev/null +++ b/doc/rados/operations/erasure-code.rst @@ -0,0 +1,198 @@ +.. _ecpool: + +============= + Erasure code +============= + +A Ceph pool is associated to a type to sustain the loss of an OSD +(i.e. a disk since most of the time there is one OSD per disk). The +default choice when `creating a pool <../pools>`_ is *replicated*, +meaning every object is copied on multiple disks. The `Erasure Code +<https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used +instead to save space. + +Creating a sample erasure coded pool +------------------------------------ + +The simplest erasure coded pool is equivalent to `RAID5 +<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and +requires at least three hosts:: + + $ ceph osd pool create ecpool 12 12 erasure + pool 'ecpool' created + $ echo ABCDEFGHI | rados --pool ecpool put NYAN - + $ rados --pool ecpool get NYAN - + ABCDEFGHI + +.. note:: the 12 in *pool create* stands for + `the number of placement groups <../pools>`_. + +Erasure code profiles +--------------------- + +The default erasure code profile sustains the loss of a single OSD. It +is equivalent to a replicated pool of size two but requires 1.5TB +instead of 2TB to store 1TB of data. The default profile can be +displayed with:: + + $ ceph osd erasure-code-profile get default + k=2 + m=1 + plugin=jerasure + crush-failure-domain=host + technique=reed_sol_van + +Choosing the right profile is important because it cannot be modified +after the pool is created: a new pool with a different profile needs +to be created and all objects from the previous pool moved to the new. + +The most important parameters of the profile are *K*, *M* and +*crush-failure-domain* because they define the storage overhead and +the data durability. For instance, if the desired architecture must +sustain the loss of two racks with a storage overhead of 67% overhead, +the following profile can be defined:: + + $ ceph osd erasure-code-profile set myprofile \ + k=3 \ + m=2 \ + crush-failure-domain=rack + $ ceph osd pool create ecpool 12 12 erasure myprofile + $ echo ABCDEFGHI | rados --pool ecpool put NYAN - + $ rados --pool ecpool get NYAN - + ABCDEFGHI + +The *NYAN* object will be divided in three (*K=3*) and two additional +*chunks* will be created (*M=2*). The value of *M* defines how many +OSD can be lost simultaneously without losing any data. The +*crush-failure-domain=rack* will create a CRUSH rule that ensures +no two *chunks* are stored in the same rack. + +.. ditaa:: + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | v | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | + | +------+ + +----------------->| OSD5 | + +------+ + + +More information can be found in the `erasure code profiles +<../erasure-code-profile>`_ documentation. + + +Erasure Coding with Overwrites +------------------------------ + +By default, erasure coded pools only work with uses like RGW that +perform full object writes and appends. + +Since Luminous, partial writes for an erasure coded pool may be +enabled with a per-pool setting. This lets RBD and CephFS store their +data in an erasure coded pool:: + + ceph osd pool set ec_pool allow_ec_overwrites true + +This can only be enabled on a pool residing on bluestore OSDs, since +bluestore's checksumming is used to detect bitrot or other corruption +during deep-scrub. In addition to being unsafe, using filestore with +ec overwrites yields low performance compared to bluestore. + +Erasure coded pools do not support omap, so to use them with RBD and +CephFS you must instruct them to store their data in an ec pool, and +their metadata in a replicated pool. For RBD, this means using the +erasure coded pool as the ``--data-pool`` during image creation:: + + rbd create --size 1G --data-pool ec_pool replicated_pool/image_name + +For CephFS, an erasure coded pool can be set as the default data pool during +file system creation or via `file layouts <../../../cephfs/file-layouts>`_. + + +Erasure coded pool and cache tiering +------------------------------------ + +Erasure coded pools require more resources than replicated pools and +lack some functionalities such as omap. To overcome these +limitations, one can set up a `cache tier <../cache-tiering>`_ +before the erasure coded pool. + +For instance, if the pool *hot-storage* is made of fast storage:: + + $ ceph osd tier add ecpool hot-storage + $ ceph osd tier cache-mode hot-storage writeback + $ ceph osd tier set-overlay ecpool hot-storage + +will place the *hot-storage* pool as tier of *ecpool* in *writeback* +mode so that every write and read to the *ecpool* are actually using +the *hot-storage* and benefit from its flexibility and speed. + +More information can be found in the `cache tiering +<../cache-tiering>`_ documentation. + +Glossary +-------- + +*chunk* + when the encoding function is called, it returns chunks of the same + size. Data chunks which can be concatenated to reconstruct the original + object and coding chunks which can be used to rebuild a lost chunk. + +*K* + the number of data *chunks*, i.e. the number of *chunks* in which the + original object is divided. For instance if *K* = 2 a 10KB object + will be divided into *K* objects of 5KB each. + +*M* + the number of coding *chunks*, i.e. the number of additional *chunks* + computed by the encoding functions. If there are 2 coding *chunks*, + it means 2 OSDs can be out without losing data. + + +Table of content +---------------- + +.. toctree:: + :maxdepth: 1 + + erasure-code-profile + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst new file mode 100644 index 00000000..03e357f4 --- /dev/null +++ b/doc/rados/operations/health-checks.rst @@ -0,0 +1,1084 @@ + +============= +Health checks +============= + +Overview +======== + +There is a finite set of possible health messages that a Ceph cluster can +raise -- these are defined as *health checks* which have unique identifiers. + +The identifier is a terse pseudo-human-readable (i.e. like a variable name) +string. It is intended to enable tools (such as UIs) to make sense of +health checks, and present them in a way that reflects their meaning. + +This page lists the health checks that are raised by the monitor and manager +daemons. In addition to these, you may also see health checks that originate +from MDS daemons (see :ref:`cephfs-health-messages`), and health checks +that are defined by ceph-mgr python modules. + +Definitions +=========== + +Monitor +------- + +MON_DOWN +________ + +One or more monitor daemons is currently down. The cluster requires a +majority (more than 1/2) of the monitors in order to function. When +one or more monitors are down, clients may have a harder time forming +their initial connection to the cluster as they may need to try more +addresses before they reach an operating monitor. + +The down monitor daemon should generally be restarted as soon as +possible to reduce the risk of a subsequen monitor failure leading to +a service outage. + +MON_CLOCK_SKEW +______________ + +The clocks on the hosts running the ceph-mon monitor daemons are not +sufficiently well synchronized. This health alert is raised if the +cluster detects a clock skew greater than ``mon_clock_drift_allowed``. + +This is best resolved by synchronizing the clocks using a tool like +``ntpd`` or ``chrony``. + +If it is impractical to keep the clocks closely synchronized, the +``mon_clock_drift_allowed`` threshold can also be increased, but this +value must stay significantly below the ``mon_lease`` interval in +order for monitor cluster to function properly. + +MON_MSGR2_NOT_ENABLED +_____________________ + +The ``ms_bind_msgr2`` option is enabled but one or more monitors is +not configured to bind to a v2 port in the cluster's monmap. This +means that features specific to the msgr2 protocol (e.g., encryption) +are not available on some or all connections. + +In most cases this can be corrected by issuing the command:: + + ceph mon enable-msgr2 + +That command will change any monitor configured for the old default +port 6789 to continue to listen for v1 connections on 6789 and also +listen for v2 connections on the new default 3300 port. + +If a monitor is configured to listen for v1 connections on a non-standard port (not 6789), then the monmap will need to be modified manually. + +AUTH_INSECURE_GLOBAL_ID_RECLAIM +_______________________________ + +One or more clients or daemons are connected to the cluster that are +not securely reclaiming their global_id (a unique number identifying +each entity in the cluster) when reconnecting to a monitor. The +client is being permitted to connect anyway because the +``auth_allow_insecure_global_id_reclaim`` option is set to true (which may +be necessary until all ceph clients have been upgraded), and the +``auth_expose_insecure_global_id_reclaim`` option set to ``true`` (which +allows monitors to detect clients with insecure reclaim early by forcing them to +reconnect right after they first authenticate). + +You can identify which client(s) are using unpatched ceph client code with:: + + ceph health detail + +Clients global_id reclaim rehavior can also seen in the +``global_id_status`` field in the dump of clients connected to an +individual monitor (``reclaim_insecure`` means the client is +unpatched and is contributing to this health alert):: + + ceph tell mon.\* sessions + +We strongly recommend that all clients in the system are upgraded to a +newer version of Ceph that correctly reclaims global_id values. Once +all clients have been updated, you can stop allowing insecure reconnections +with:: + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +Although we do NOT recommend doing so, you can disable this warning indefinitely +with:: + + ceph config set mon mon_warn_on_insecure_global_id_reclaim false + +AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED +_______________________________________ + +Ceph is currently configured to allow clients to reconnect to monitors using +an insecure process to reclaim their previous global_id because the setting +``auth_allow_insecure_global_id_reclaim`` is set to ``true``. It may be necessary to +leave this setting enabled while existing Ceph clients are upgraded to newer +versions of Ceph that correctly and securely reclaim their global_id. + +If the ``AUTH_INSECURE_GLOBAL_ID_RECLAIM`` health alert has not also been raised and +the ``auth_expose_insecure_global_id_reclaim`` setting has not been disabled (it is +on by default), then there are currently no clients connected that need to be +upgraded, and it is safe to disallow insecure global_id reclaim with:: + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +Although we do NOT recommend doing so, you can disable this warning indefinitely +with:: + + ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false + + +Manager +------- + +MGR_MODULE_DEPENDENCY +_____________________ + +An enabled manager module is failing its dependency check. This health check +should come with an explanatory message from the module about the problem. + +For example, a module might report that a required package is not installed: +install the required package and restart your manager daemons. + +This health check is only applied to enabled modules. If a module is +not enabled, you can see whether it is reporting dependency issues in +the output of `ceph module ls`. + + +MGR_MODULE_ERROR +________________ + +A manager module has experienced an unexpected error. Typically, +this means an unhandled exception was raised from the module's `serve` +function. The human readable description of the error may be obscurely +worded if the exception did not provide a useful description of itself. + +This health check may indicate a bug: please open a Ceph bug report if you +think you have encountered a bug. + +If you believe the error is transient, you may restart your manager +daemon(s), or use `ceph mgr fail` on the active daemon to prompt +a failover to another daemon. + + +OSDs +---- + +OSD_DOWN +________ + +One or more OSDs are marked down. The ceph-osd daemon may have been +stopped, or peer OSDs may be unable to reach the OSD over the network. +Common causes include a stopped or crashed daemon, a down host, or a +network outage. + +Verify the host is healthy, the daemon is started, and network is +functioning. If the daemon has crashed, the daemon log file +(``/var/log/ceph/ceph-osd.*``) may contain debugging information. + +OSD_<crush type>_DOWN +_____________________ + +(e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN) + +All the OSDs within a particular CRUSH subtree are marked down, for example +all OSDs on a host. + +OSD_ORPHAN +__________ + +An OSD is referenced in the CRUSH map hierarchy but does not exist. + +The OSD can be removed from the CRUSH hierarchy with:: + + ceph osd crush rm osd.<id> + +OSD_OUT_OF_ORDER_FULL +_____________________ + +The utilization thresholds for `backfillfull`, `nearfull`, `full`, +and/or `failsafe_full` are not ascending. In particular, we expect +`backfillfull < nearfull`, `nearfull < full`, and `full < +failsafe_full`. + +The thresholds can be adjusted with:: + + ceph osd set-backfillfull-ratio <ratio> + ceph osd set-nearfull-ratio <ratio> + ceph osd set-full-ratio <ratio> + + +OSD_FULL +________ + +One or more OSDs has exceeded the `full` threshold and is preventing +the cluster from servicing writes. + +Utilization by pool can be checked with:: + + ceph df + +The currently defined `full` ratio can be seen with:: + + ceph osd dump | grep full_ratio + +A short-term workaround to restore write availability is to raise the full +threshold by a small amount:: + + ceph osd set-full-ratio <ratio> + +New storage should be added to the cluster by deploying more OSDs or +existing data should be deleted in order to free up space. + +OSD_BACKFILLFULL +________________ + +One or more OSDs has exceeded the `backfillfull` threshold, which will +prevent data from being allowed to rebalance to this device. This is +an early warning that rebalancing may not be able to complete and that +the cluster is approaching full. + +Utilization by pool can be checked with:: + + ceph df + +OSD_NEARFULL +____________ + +One or more OSDs has exceeded the `nearfull` threshold. This is an early +warning that the cluster is approaching full. + +Utilization by pool can be checked with:: + + ceph df + +OSDMAP_FLAGS +____________ + +One or more cluster flags of interest has been set. These flags include: + +* *full* - the cluster is flagged as full and cannot serve writes +* *pauserd*, *pausewr* - paused reads or writes +* *noup* - OSDs are not allowed to start +* *nodown* - OSD failure reports are being ignored, such that the + monitors will not mark OSDs `down` +* *noin* - OSDs that were previously marked `out` will not be marked + back `in` when they start +* *noout* - down OSDs will not automatically be marked out after the + configured interval +* *nobackfill*, *norecover*, *norebalance* - recovery or data + rebalancing is suspended +* *noscrub*, *nodeep_scrub* - scrubbing is disabled +* *notieragent* - cache tiering activity is suspended + +With the exception of *full*, these flags can be set or cleared with:: + + ceph osd set <flag> + ceph osd unset <flag> + +OSD_FLAGS +_________ + +One or more OSDs or CRUSH {nodes,device classes} has a flag of interest set. +These flags include: + +* *noup*: these OSDs are not allowed to start +* *nodown*: failure reports for these OSDs will be ignored +* *noin*: if these OSDs were previously marked `out` automatically + after a failure, they will not be marked in when they start +* *noout*: if these OSDs are down they will not automatically be marked + `out` after the configured interval + +These flags can be set and cleared in batch with:: + + ceph osd set-group <flags> <who> + ceph osd unset-group <flags> <who> + +For example, :: + + ceph osd set-group noup,noout osd.0 osd.1 + ceph osd unset-group noup,noout osd.0 osd.1 + ceph osd set-group noup,noout host-foo + ceph osd unset-group noup,noout host-foo + ceph osd set-group noup,noout class-hdd + ceph osd unset-group noup,noout class-hdd + +OLD_CRUSH_TUNABLES +__________________ + +The CRUSH map is using very old settings and should be updated. The +oldest tunables that can be used (i.e., the oldest client version that +can connect to the cluster) without triggering this health warning is +determined by the ``mon_crush_min_required_version`` config option. +See :ref:`crush-map-tunables` for more information. + +OLD_CRUSH_STRAW_CALC_VERSION +____________________________ + +The CRUSH map is using an older, non-optimal method for calculating +intermediate weight values for ``straw`` buckets. + +The CRUSH map should be updated to use the newer method +(``straw_calc_version=1``). See +:ref:`crush-map-tunables` for more information. + +CACHE_POOL_NO_HIT_SET +_____________________ + +One or more cache pools is not configured with a *hit set* to track +utilization, which will prevent the tiering agent from identifying +cold objects to flush and evict from the cache. + +Hit sets can be configured on the cache pool with:: + + ceph osd pool set <poolname> hit_set_type <type> + ceph osd pool set <poolname> hit_set_period <period-in-seconds> + ceph osd pool set <poolname> hit_set_count <number-of-hitsets> + ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> + +OSD_NO_SORTBITWISE +__________________ + +No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not +been set. + +The ``sortbitwise`` flag must be set before luminous v12.y.z or newer +OSDs can start. You can safely set the flag with:: + + ceph osd set sortbitwise + +POOL_FULL +_________ + +One or more pools has reached its quota and is no longer allowing writes. + +Pool quotas and utilization can be seen with:: + + ceph df detail + +You can either raise the pool quota with:: + + ceph osd pool set-quota <poolname> max_objects <num-objects> + ceph osd pool set-quota <poolname> max_bytes <num-bytes> + +or delete some existing data to reduce utilization. + +BLUEFS_SPILLOVER +________________ + +One or more OSDs that use the BlueStore backend have been allocated +`db` partitions (storage space for metadata, normally on a faster +device) but that space has filled, such that metadata has "spilled +over" onto the normal slow device. This isn't necessarily an error +condition or even unexpected, but if the administrator's expectation +was that all metadata would fit on the faster device, it indicates +that not enough space was provided. + +This warning can be disabled on all OSDs with:: + + ceph config set osd bluestore_warn_on_bluefs_spillover false + +Alternatively, it can be disabled on a specific OSD with:: + + ceph config set osd.123 bluestore_warn_on_bluefs_spillover false + +To provide more metadata space, the OSD in question could be destroyed and +reprovisioned. This will involve data migration and recovery. + +It may also be possible to expand the LVM logical volume backing the +`db` storage. If the underlying LV has been expanded, the OSD daemon +needs to be stopped and BlueFS informed of the device size change with:: + + ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-$ID + +BLUEFS_AVAILABLE_SPACE +______________________ + +To check how much space is free for BlueFS do:: + + ceph daemon osd.123 bluestore bluefs available + +This will output up to 3 values: `BDEV_DB free`, `BDEV_SLOW free` and +`available_from_bluestore`. `BDEV_DB` and `BDEV_SLOW` report amount of space that +has been acquired by BlueFS and is considered free. Value `available_from_bluestore` +denotes ability of BlueStore to relinquish more space to BlueFS. +It is normal that this value is different from amount of BlueStore free space, as +BlueFS allocation unit is typically larger than BlueStore allocation unit. +This means that only part of BlueStore free space will be acceptable for BlueFS. + +BLUEFS_LOW_SPACE +_________________ + +If BlueFS is running low on available free space and there is little +`available_from_bluestore` one can consider reducing BlueFS allocation unit size. +To simulate available space when allocation unit is different do:: + + ceph daemon osd.123 bluestore bluefs available <alloc-unit-size> + +BLUESTORE_FRAGMENTATION +_______________________ + +As BlueStore works free space on underlying storage will get fragmented. +This is normal and unavoidable but excessive fragmentation will cause slowdown. +To inspect BlueStore fragmentation one can do:: + + ceph daemon osd.123 bluestore allocator score block + +Score is given in [0-1] range. +[0.0 .. 0.4] tiny fragmentation +[0.4 .. 0.7] small, acceptable fragmentation +[0.7 .. 0.9] considerable, but safe fragmentation +[0.9 .. 1.0] severe fragmentation, may impact BlueFS ability to get space from BlueStore + +If detailed report of free fragments is required do:: + + ceph daemon osd.123 bluestore allocator dump block + +In case when handling OSD process that is not running fragmentation can be +inspected with `ceph-bluestore-tool`. +Get fragmentation score:: + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score + +And dump detailed free chunks:: + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump + +BLUESTORE_LEGACY_STATFS +_______________________ + +In the Nautilus release, BlueStore tracks its internal usage +statistics on a per-pool granular basis, and one or more OSDs have +BlueStore volumes that were created prior to Nautilus. If *all* OSDs +are older than Nautilus, this just means that the per-pool metrics are +not available. However, if there is a mix of pre-Nautilus and +post-Nautilus OSDs, the cluster usage statistics reported by ``ceph +df`` will not be accurate. + +The old OSDs can be updated to use the new usage tracking scheme by stopping each OSD, running a repair operation, and the restarting it. For example, if ``osd.123`` needed to be updated,:: + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +This warning can be disabled with:: + + ceph config set global bluestore_warn_on_legacy_statfs false + + +BLUESTORE_DISK_SIZE_MISMATCH +____________________________ + +One or more OSDs using BlueStore has an internal inconsistency between the size +of the physical device and the metadata tracking its size. This can lead to +the OSD crashing in the future. + +The OSDs in question should be destroyed and reprovisioned. Care should be +taken to do this one OSD at a time, and in a way that doesn't put any data at +risk. For example, if osd ``$N`` has the error,:: + + ceph osd out osd.$N + while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done + ceph osd destroy osd.$N + ceph-volume lvm zap /path/to/device + ceph-volume lvm create --osd-id $N --data /path/to/device + + +Device health +------------- + +DEVICE_HEALTH +_____________ + +One or more devices is expected to fail soon, where the warning +threshold is controlled by the ``mgr/devicehealth/warn_threshold`` +config option. + +This warning only applies to OSDs that are currently marked "in", so +the expected response to this failure is to mark the device "out" so +that data is migrated off of the device, and then to remove the +hardware from the system. Note that the marking out is normally done +automatically if ``mgr/devicehealth/self_heal`` is enabled based on +the ``mgr/devicehealth/mark_out_threshold``. + +Device health can be checked with:: + + ceph device info <device-id> + +Device life expectancy is set by a prediction model run by +the mgr or an by external tool via the command:: + + ceph device set-life-expectancy <device-id> <from> <to> + +You can change the stored life expectancy manually, but that usually +doesn't accomplish anything as whatever tool originally set it will +probably set it again, and changing the stored value does not affect +the actual health of the hardware device. + +DEVICE_HEALTH_IN_USE +____________________ + +One or more devices is expected to fail soon and has been marked "out" +of the cluster based on ``mgr/devicehealth/mark_out_threshold``, but it +is still participating in one more PGs. This may be because it was +only recently marked "out" and data is still migrating, or because data +cannot be migrated off for some reason (e.g., the cluster is nearly +full, or the CRUSH hierarchy is such that there isn't another suitable +OSD to migrate the data too). + +This message can be silenced by disabling the self heal behavior +(setting ``mgr/devicehealth/self_heal`` to false), by adjusting the +``mgr/devicehealth/mark_out_threshold``, or by addressing what is +preventing data from being migrated off of the ailing device. + +DEVICE_HEALTH_TOOMANY +_____________________ + +Too many devices is expected to fail soon and the +``mgr/devicehealth/self_heal`` behavior is enabled, such that marking +out all of the ailing devices would exceed the clusters +``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being +automatically marked "out". + +This generally indicates that too many devices in your cluster are +expected to fail soon and you should take action to add newer +(healthier) devices before too many devices fail and data is lost. + +The health message can also be silenced by adjusting parameters like +``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``, +but be warned that this will increase the likelihood of unrecoverable +data loss in the cluster. + + +Data health (pools & placement groups) +-------------------------------------- + +PG_AVAILABILITY +_______________ + +Data availability is reduced, meaning that the cluster is unable to +service potential read or write requests for some data in the cluster. +Specifically, one or more PGs is in a state that does not allow IO +requests to be serviced. Problematic PG states include *peering*, +*stale*, *incomplete*, and the lack of *active* (if those conditions do not clear +quickly). + +Detailed information about which PGs are affected is available from:: + + ceph health detail + +In most cases the root cause is that one or more OSDs is currently +down; see the discussion for ``OSD_DOWN`` above. + +The state of specific problematic PGs can be queried with:: + + ceph tell <pgid> query + +PG_DEGRADED +___________ + +Data redundancy is reduced for some data, meaning the cluster does not +have the desired number of replicas for all data (for replicated +pools) or erasure code fragments (for erasure coded pools). +Specifically, one or more PGs: + +* has the *degraded* or *undersized* flag set, meaning there are not + enough instances of that placement group in the cluster; +* has not had the *clean* flag set for some time. + +Detailed information about which PGs are affected is available from:: + + ceph health detail + +In most cases the root cause is that one or more OSDs is currently +down; see the dicussion for ``OSD_DOWN`` above. + +The state of specific problematic PGs can be queried with:: + + ceph tell <pgid> query + + +PG_RECOVERY_FULL +________________ + +Data redundancy may be reduced or at risk for some data due to a lack +of free space in the cluster. Specifically, one or more PGs has the +*recovery_toofull* flag set, meaning that the +cluster is unable to migrate or recover data because one or more OSDs +is above the *full* threshold. + +See the discussion for *OSD_FULL* above for steps to resolve this condition. + +PG_BACKFILL_FULL +________________ + +Data redundancy may be reduced or at risk for some data due to a lack +of free space in the cluster. Specifically, one or more PGs has the +*backfill_toofull* flag set, meaning that the +cluster is unable to migrate or recover data because one or more OSDs +is above the *backfillfull* threshold. + +See the discussion for *OSD_BACKFILLFULL* above for +steps to resolve this condition. + +PG_DAMAGED +__________ + +Data scrubbing has discovered some problems with data consistency in +the cluster. Specifically, one or more PGs has the *inconsistent* or +*snaptrim_error* flag is set, indicating an earlier scrub operation +found a problem, or that the *repair* flag is set, meaning a repair +for such an inconsistency is currently in progress. + +See :doc:`pg-repair` for more information. + +OSD_SCRUB_ERRORS +________________ + +Recent OSD scrubs have uncovered inconsistencies. This error is generally +paired with *PG_DAMAGED* (see above). + +See :doc:`pg-repair` for more information. + +OSD_TOO_MANY_REPAIRS +____________________ + +When a read error occurs and another replica is available it is used to repair +the error immediately, so that the client can get the object data. Scrub +handles errors for data at rest. In order to identify possible failing disks +that aren't seeing scrub errors, a count of read repairs is maintained. If +it exceeds a config value threshold *mon_osd_warn_num_repaired* default 10, +this health warning is generated. + +In order to allow clearing of the warning, a new command +``ceph tell osd.# clear_shards_repaired [count]`` has been added. +By default it will set the repair count to 0. If the administrator wanted +to re-enable the warning if any additional repairs are performed you can provide +a value to the command and specify the value of ``mon_osd_warn_num_repaired``. +This command will be replaced in future releases by the health mute/unmute feature. + +LARGE_OMAP_OBJECTS +__________________ + +One or more pools contain large omap objects as determined by +``osd_deep_scrub_large_omap_object_key_threshold`` (threshold for number of keys +to determine a large omap object) or +``osd_deep_scrub_large_omap_object_value_sum_threshold`` (the threshold for +summed size (bytes) of all key values to determine a large omap object) or both. +More information on the object name, key count, and size in bytes can be found +by searching the cluster log for 'Large omap object found'. Large omap objects +can be caused by RGW bucket index objects that do not have automatic resharding +enabled. Please see :ref:`RGW Dynamic Bucket Index Resharding +<rgw_dynamic_bucket_index_resharding>` for more information on resharding. + +The thresholds can be adjusted with:: + + ceph config set osd osd_deep_scrub_large_omap_object_key_threshold <keys> + ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold <bytes> + +CACHE_POOL_NEAR_FULL +____________________ + +A cache tier pool is nearly full. Full in this context is determined +by the ``target_max_bytes`` and ``target_max_objects`` properties on +the cache pool. Once the pool reaches the target threshold, write +requests to the pool may block while data is flushed and evicted +from the cache, a state that normally leads to very high latencies and +poor performance. + +The cache pool target size can be adjusted with:: + + ceph osd pool set <cache-pool-name> target_max_bytes <bytes> + ceph osd pool set <cache-pool-name> target_max_objects <objects> + +Normal cache flush and evict activity may also be throttled due to reduced +availability or performance of the base tier, or overall cluster load. + +TOO_FEW_PGS +___________ + +The number of PGs in use in the cluster is below the configurable +threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can lead +to suboptimal distribution and balance of data across the OSDs in +the cluster, and similarly reduce overall performance. + +This may be an expected condition if data pools have not yet been +created. + +The PG count for existing pools can be increased or new pools can be created. +Please refer to :ref:`choosing-number-of-placement-groups` for more +information. + +POOL_PG_NUM_NOT_POWER_OF_TWO +____________________________ + +One or more pools has a ``pg_num`` value that is not a power of two. +Although this is not strictly incorrect, it does lead to a less +balanced distribution of data because some PGs have roughly twice as +much data as others. + +This is easily corrected by setting the ``pg_num`` value for the +affected pool(s) to a nearby power of two:: + + ceph osd pool set <pool-name> pg_num <value> + +This health warning can be disabled with:: + + ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false + +POOL_TOO_FEW_PGS +________________ + +One or more pools should probably have more PGs, based on the amount +of data that is currently stored in the pool. This can lead to +suboptimal distribution and balance of data across the OSDs in the +cluster, and similarly reduce overall performance. This warning is +generated if the ``pg_autoscale_mode`` property on the pool is set to +``warn``. + +To disable the warning, you can disable auto-scaling of PGs for the +pool entirely with:: + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs,:: + + ceph osd pool set <pool-name> pg_autoscale_mode on + +You can also manually set the number of PGs for the pool to the +recommended amount with:: + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +Please refer to :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler` for more information. + +TOO_MANY_PGS +____________ + +The number of PGs in use in the cluster is above the configurable +threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold is +exceed the cluster will not allow new pools to be created, pool `pg_num` to +be increased, or pool replication to be increased (any of which would lead to +more PGs in the cluster). A large number of PGs can lead +to higher memory utilization for OSD daemons, slower peering after +cluster state changes (like OSD restarts, additions, or removals), and +higher load on the Manager and Monitor daemons. + +The simplest way to mitigate the problem is to increase the number of +OSDs in the cluster by adding more hardware. Note that the OSD count +used for the purposes of this health check is the number of "in" OSDs, +so marking "out" OSDs "in" (if there are any) can also help:: + + ceph osd in <osd id(s)> + +Please refer to :ref:`choosing-number-of-placement-groups` for more +information. + +POOL_TOO_MANY_PGS +_________________ + +One or more pools should probably have more PGs, based on the amount +of data that is currently stored in the pool. This can lead to higher +memory utilization for OSD daemons, slower peering after cluster state +changes (like OSD restarts, additions, or removals), and higher load +on the Manager and Monitor daemons. This warning is generated if the +``pg_autoscale_mode`` property on the pool is set to ``warn``. + +To disable the warning, you can disable auto-scaling of PGs for the +pool entirely with:: + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs,:: + + ceph osd pool set <pool-name> pg_autoscale_mode on + +You can also manually set the number of PGs for the pool to the +recommended amount with:: + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +Please refer to :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler` for more information. + +POOL_TARGET_SIZE_BYTES_OVERCOMMITTED +____________________________________ + +One or more pools have a ``target_size_bytes`` property set to +estimate the expected size of the pool, +but the value(s) exceed the total available storage (either by +themselves or in combination with other pools' actual usage). + +This is usually an indication that the ``target_size_bytes`` value for +the pool is too large and should be reduced or set to zero with:: + + ceph osd pool set <pool-name> target_size_bytes 0 + +For more information, see :ref:`specifying_pool_target_size`. + +POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO +____________________________________ + +One or more pools have both ``target_size_bytes`` and +``target_size_ratio`` set to estimate the expected size of the pool. +Only one of these properties should be non-zero. If both are set, +``target_size_ratio`` takes precedence and ``target_size_bytes`` is +ignored. + +To reset ``target_size_bytes`` to zero:: + + ceph osd pool set <pool-name> target_size_bytes 0 + +For more information, see :ref:`specifying_pool_target_size`. + +TOO_FEW_OSDS +____________ + +The number of OSDs in the cluster is below the configurable +threshold of ``osd_pool_default_size``. + +SMALLER_PGP_NUM +_______________ + +One or more pools has a ``pgp_num`` value less than ``pg_num``. This +is normally an indication that the PG count was increased without +also increasing the placement behavior. + +This is sometimes done deliberately to separate out the `split` step +when the PG count is adjusted from the data migration that is needed +when ``pgp_num`` is changed. + +This is normally resolved by setting ``pgp_num`` to match ``pg_num``, +triggering the data migration, with:: + + ceph osd pool set <pool> pgp_num <pg-num-value> + +MANY_OBJECTS_PER_PG +___________________ + +One or more pools has an average number of objects per PG that is +significantly higher than the overall cluster average. The specific +threshold is controlled by the ``mon_pg_warn_max_object_skew`` +configuration value. + +This is usually an indication that the pool(s) containing most of the +data in the cluster have too few PGs, and/or that other pools that do +not contain as much data have too many PGs. See the discussion of +*TOO_MANY_PGS* above. + +The threshold can be raised to silence the health warning by adjusting +the ``mon_pg_warn_max_object_skew`` config option on the monitors. + + +POOL_APP_NOT_ENABLED +____________________ + +A pool exists that contains one or more objects but has not been +tagged for use by a particular application. + +Resolve this warning by labeling the pool for use by an application. For +example, if the pool is used by RBD,:: + + rbd pool init <poolname> + +If the pool is being used by a custom application 'foo', you can also label +via the low-level command:: + + ceph osd pool application enable foo + +For more information, see :ref:`associate-pool-to-application`. + +POOL_FULL +_________ + +One or more pools has reached (or is very close to reaching) its +quota. The threshold to trigger this error condition is controlled by +the ``mon_pool_quota_crit_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) with:: + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +Setting the quota value to 0 will disable the quota. + +POOL_NEAR_FULL +______________ + +One or more pools is approaching is quota. The threshold to trigger +this warning condition is controlled by the +``mon_pool_quota_warn_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) with:: + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +Setting the quota value to 0 will disable the quota. + +OBJECT_MISPLACED +________________ + +One or more objects in the cluster is not stored on the node the +cluster would like it to be stored on. This is an indication that +data migration due to some recent cluster change has not yet completed. + +Misplaced data is not a dangerous condition in and of itself; data +consistency is never at risk, and old copies of objects are never +removed until the desired number of new copies (in the desired +locations) are present. + +OBJECT_UNFOUND +______________ + +One or more objects in the cluster cannot be found. Specifically, the +OSDs know that a new or updated copy of an object should exist, but a +copy of that version of the object has not been found on OSDs that are +currently online. + +Read or write requests to unfound objects will block. + +Ideally, a down OSD can be brought back online that has the more +recent copy of the unfound object. Candidate OSDs can be identified from the +peering state for the PG(s) responsible for the unfound object:: + + ceph tell <pgid> query + +If the latest copy of the object is not available, the cluster can be +told to roll back to a previous version of the object. See +:ref:`failures-osd-unfound` for more information. + +SLOW_OPS +________ + +One or more OSD requests is taking a long time to process. This can +be an indication of extreme load, a slow storage device, or a software +bug. + +The request queue on the OSD(s) in question can be queried with the +following command, executed from the OSD host:: + + ceph daemon osd.<id> ops + +A summary of the slowest recent requests can be seen with:: + + ceph daemon osd.<id> dump_historic_ops + +The location of an OSD can be found with:: + + ceph osd find osd.<id> + +PG_NOT_SCRUBBED +_______________ + +One or more PGs has not been scrubbed recently. PGs are normally +scrubbed every ``mon_scrub_interval`` seconds, and this warning +triggers when ``mon_warn_pg_not_scrubbed_ratio`` percentage of interval has elapsed +without a scrub since it was due. + +PGs will not scrub if they are not flagged as *clean*, which may +happen if they are misplaced or degraded (see *PG_AVAILABILITY* and +*PG_DEGRADED* above). + +You can manually initiate a scrub of a clean PG with:: + + ceph pg scrub <pgid> + +PG_NOT_DEEP_SCRUBBED +____________________ + +One or more PGs has not been deep scrubbed recently. PGs are normally +scrubbed every ``osd_deep_scrub_interval`` seconds, and this warning +triggers when ``mon_warn_pg_not_deep_scrubbed_ratio`` percentage of interval has elapsed +without a scrub since it was due. + +PGs will not (deep) scrub if they are not flagged as *clean*, which may +happen if they are misplaced or degraded (see *PG_AVAILABILITY* and +*PG_DEGRADED* above). + +You can manually initiate a scrub of a clean PG with:: + + ceph pg deep-scrub <pgid> + + +Miscellaneous +------------- + +RECENT_CRASH +____________ + +One or more Ceph daemons has crashed recently, and the crash has not +yet been archived (acknowledged) by the administrator. This may +indicate a software bug, a hardware problem (e.g., a failing disk), or +some other problem. + +New crashes can be listed with:: + + ceph crash ls-new + +Information about a specific crash can be examined with:: + + ceph crash info <crash-id> + +This warning can be silenced by "archiving" the crash (perhaps after +being examined by an administrator) so that it does not generate this +warning:: + + ceph crash archive <crash-id> + +Similarly, all new crashes can be archived with:: + + ceph crash archive-all + +Archived crashes will still be visible via ``ceph crash ls`` but not +``ceph crash ls-new``. + +The time period for what "recent" means is controlled by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +These warnings can be disabled entirely with:: + + ceph config set mgr/crash/warn_recent_interval 0 + +TELEMETRY_CHANGED +_________________ + +Telemetry has been enabled, but the contents of the telemetry report +have changed since that time, so telemetry reports will not be sent. + +The Ceph developers periodically revise the telemetry feature to +include new and useful information, or to remove information found to +be useless or sensitive. If any new information is included in the +report, Ceph will require the administrator to re-enable telemetry to +ensure they have an opportunity to (re)review what information will be +shared. + +To review the contents of the telemetry report,:: + + ceph telemetry show + +Note that the telemetry report consists of several optional channels +that may be independently enabled or disabled. For more information, see +:ref:`telemetry`. + +To re-enable telemetry (and make this warning go away),:: + + ceph telemetry on + +To disable telemetry (and make this warning go away),:: + + ceph telemetry off + +DASHBOARD_DEBUG +_______________ + +The Dashboard debug mode is enabled. This means, if there is an error +while processing a REST API request, the HTTP error response contains +a Python traceback. This behaviour should be disabled in production +environments because such a traceback might contain and expose sensible +information. + +The debug mode can be disabled with:: + + ceph dashboard debug disable diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst new file mode 100644 index 00000000..8e93886b --- /dev/null +++ b/doc/rados/operations/index.rst @@ -0,0 +1,94 @@ +==================== + Cluster Operations +==================== + +.. raw:: html + + <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3> + +High-level cluster operations consist primarily of starting, stopping, and +restarting a cluster with the ``ceph`` service; checking the cluster's health; +and, monitoring an operating cluster. + +.. toctree:: + :maxdepth: 1 + + operating + health-checks + monitoring + monitoring-osd-pg + user-management + pg-repair + +.. raw:: html + + </td><td><h3>Data Placement</h3> + +Once you have your cluster up and running, you may begin working with data +placement. Ceph supports petabyte-scale data storage clusters, with storage +pools and placement groups that distribute data across the cluster using Ceph's +CRUSH algorithm. + +.. toctree:: + :maxdepth: 1 + + data-placement + pools + erasure-code + cache-tiering + placement-groups + balancer + upmap + crush-map + crush-map-edits + + + +.. raw:: html + + </td></tr><tr><td><h3>Low-level Operations</h3> + +Low-level cluster operations consist of starting, stopping, and restarting a +particular daemon within a cluster; changing the settings of a particular +daemon or subsystem; and, adding a daemon to the cluster or removing a daemon +from the cluster. The most common use cases for low-level operations include +growing or shrinking the Ceph cluster and replacing legacy or failed hardware +with new hardware. + +.. toctree:: + :maxdepth: 1 + + add-or-rm-osds + add-or-rm-mons + devices + bluestore-migration + Command Reference <control> + + + +.. raw:: html + + </td><td><h3>Troubleshooting</h3> + +Ceph is still on the leading edge, so you may encounter situations that require +you to evaluate your Ceph configuration and modify your logging and debugging +settings to identify and remedy issues you are encountering with your cluster. + +.. toctree:: + :maxdepth: 1 + + ../troubleshooting/community + ../troubleshooting/troubleshooting-mon + ../troubleshooting/troubleshooting-osd + ../troubleshooting/troubleshooting-pg + ../troubleshooting/log-and-debug + ../troubleshooting/cpu-profiling + ../troubleshooting/memory-profiling + + + + +.. raw:: html + + </td></tr></tbody></table> + diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst new file mode 100644 index 00000000..08b70dd4 --- /dev/null +++ b/doc/rados/operations/monitoring-osd-pg.rst @@ -0,0 +1,518 @@ +========================= + Monitoring OSDs and PGs +========================= + +High availability and high reliability require a fault-tolerant approach to +managing hardware and software issues. Ceph has no single point-of-failure, and +can service requests for data in a "degraded" mode. Ceph's `data placement`_ +introduces a layer of indirection to ensure that data doesn't bind directly to +particular OSD addresses. This means that tracking down system faults requires +finding the `placement group`_ and the underlying OSDs at root of the problem. + +.. tip:: A fault in one part of the cluster may prevent you from accessing a + particular object, but that doesn't mean that you cannot access other objects. + When you run into a fault, don't panic. Just follow the steps for monitoring + your OSDs and placement groups. Then, begin troubleshooting. + +Ceph is generally self-repairing. However, when problems persist, monitoring +OSDs and placement groups will help you identify the problem. + + +Monitoring OSDs +=============== + +An OSD's status is either in the cluster (``in``) or out of the cluster +(``out``); and, it is either up and running (``up``), or it is down and not +running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster +(you can read and write data) or it is ``out`` of the cluster. If it was +``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate +placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will +not assign placement groups to the OSD. If an OSD is ``down``, it should also be +``out``. + +.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster + will not be in a healthy state. + +.. ditaa:: + + +----------------+ +----------------+ + | | | | + | OSD #n In | | OSD #n Up | + | | | | + +----------------+ +----------------+ + ^ ^ + | | + | | + v v + +----------------+ +----------------+ + | | | | + | OSD #n Out | | OSD #n Down | + | | | | + +----------------+ +----------------+ + +If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, +you may notice that the cluster does not always echo back ``HEALTH OK``. Don't +panic. With respect to OSDs, you should expect that the cluster will **NOT** +echo ``HEALTH OK`` in a few expected circumstances: + +#. You haven't started the cluster yet (it won't respond). +#. You have just started or restarted the cluster and it's not ready yet, + because the placement groups are getting created and the OSDs are in + the process of peering. +#. You just added or removed an OSD. +#. You just have modified your cluster map. + +An important aspect of monitoring OSDs is to ensure that when the cluster +is up and running that all OSDs that are ``in`` the cluster are ``up`` and +running, too. To see if all OSDs are running, execute:: + + ceph osd stat + +The result should tell you the total number of OSDs (x), +how many are ``up`` (y), how many are ``in`` (z) and the map epoch (eNNNN). :: + + x osds: y up, z in; epoch: eNNNN + +If the number of OSDs that are ``in`` the cluster is more than the number of +OSDs that are ``up``, execute the following command to identify the ``ceph-osd`` +daemons that are not running:: + + ceph osd tree + +:: + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 2.00000 pool openstack + -3 2.00000 rack dell-2950-rack-A + -2 2.00000 host dell-2950-A1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 down 1.00000 1.00000 + +.. tip:: The ability to search through a well-designed CRUSH hierarchy may help + you troubleshoot your cluster by identifying the physical locations faster. + +If an OSD is ``down``, start it:: + + sudo systemctl start ceph-osd@1 + +See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't +restart. + + +PG Sets +======= + +When CRUSH assigns placement groups to OSDs, it looks at the number of replicas +for the pool and assigns the placement group to OSDs such that each replica of +the placement group gets assigned to a different OSD. For example, if the pool +requires three replicas of a placement group, CRUSH may assign them to +``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a +pseudo-random placement that will take into account failure domains you set in +your `CRUSH map`_, so you will rarely see placement groups assigned to nearest +neighbor OSDs in a large cluster. We refer to the set of OSDs that should +contain the replicas of a particular placement group as the **Acting Set**. In +some cases, an OSD in the Acting Set is ``down`` or otherwise not able to +service requests for objects in the placement group. When these situations +arise, don't panic. Common examples include: + +- You added or removed an OSD. Then, CRUSH reassigned the placement group to + other OSDs--thereby changing the composition of the Acting Set and spawning + the migration of data with a "backfill" process. +- An OSD was ``down``, was restarted, and is now ``recovering``. +- An OSD in the Acting Set is ``down`` or unable to service requests, + and another OSD has temporarily assumed its duties. + +Ceph processes a client request using the **Up Set**, which is the set of OSDs +that will actually handle the requests. In most cases, the Up Set and the Acting +Set are virtually identical. When they are not, it may indicate that Ceph is +migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph +usually echoes a "HEALTH WARN" state with a "stuck stale" message in such +scenarios). + +To retrieve a list of placement groups, execute:: + + ceph pg dump + +To view which OSDs are within the Acting Set or the Up Set for a given placement +group, execute:: + + ceph pg map {pg-num} + +The result should tell you the osdmap epoch (eNNN), the placement group number +({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set +(acting[]). :: + + osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2] + +.. note:: If the Up Set and Acting Set do not match, this may be an indicator + that the cluster rebalancing itself or of a potential problem with + the cluster. + + +Peering +======= + +Before you can write data to a placement group, it must be in an ``active`` +state, and it **should** be in a ``clean`` state. For Ceph to determine the +current state of a placement group, the primary OSD of the placement group +(i.e., the first OSD in the acting set), peers with the secondary and tertiary +OSDs to establish agreement on the current state of the placement group +(assuming a pool with 3 replicas of the PG). + + +.. ditaa:: + + +---------+ +---------+ +-------+ + | OSD 1 | | OSD 2 | | OSD 3 | + +---------+ +---------+ +-------+ + | | | + | Request To | | + | Peer | | + |-------------->| | + |<--------------| | + | Peering | + | | + | Request To | + | Peer | + |----------------------------->| + |<-----------------------------| + | Peering | + +The OSDs also report their status to the monitor. See `Configuring Monitor/OSD +Interaction`_ for details. To troubleshoot peering issues, see `Peering +Failure`_. + + +Monitoring Placement Group States +================================= + +If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, +you may notice that the cluster does not always echo back ``HEALTH OK``. After +you check to see if the OSDs are running, you should also check placement group +states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a +number of placement group peering-related circumstances: + +#. You have just created a pool and placement groups haven't peered yet. +#. The placement groups are recovering. +#. You have just added an OSD to or removed an OSD from the cluster. +#. You have just modified your CRUSH map and your placement groups are migrating. +#. There is inconsistent data in different replicas of a placement group. +#. Ceph is scrubbing a placement group's replicas. +#. Ceph doesn't have enough storage capacity to complete backfilling operations. + +If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't +panic. In many cases, the cluster will recover on its own. In some cases, you +may need to take action. An important aspect of monitoring placement groups is +to ensure that when the cluster is up and running that all placement groups are +``active``, and preferably in the ``clean`` state. To see the status of all +placement groups, execute:: + + ceph pg stat + +The result should tell you the total number of placement groups (x), how many +placement groups are in a particular state such as ``active+clean`` (y) and the +amount of data stored (z). :: + + x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail + +.. note:: It is common for Ceph to report multiple states for placement groups. + +In addition to the placement group states, Ceph will also echo back the amount of +storage capacity used (aa), the amount of storage capacity remaining (bb), and the total +storage capacity for the placement group. These numbers can be important in a +few cases: + +- You are reaching your ``near full ratio`` or ``full ratio``. +- Your data is not getting distributed across the cluster due to an + error in your CRUSH configuration. + + +.. topic:: Placement Group IDs + + Placement group IDs consist of the pool number (not pool name) followed + by a period (.) and the placement group ID--a hexadecimal number. You + can view pool numbers and their names from the output of ``ceph osd + lspools``. For example, the first pool created corresponds to + pool number ``1``. A fully qualified placement group ID has the + following form:: + + {pool-num}.{pg-id} + + And it typically looks like this:: + + 1.1f + + +To retrieve a list of placement groups, execute the following:: + + ceph pg dump + +You can also format the output in JSON format and save it to a file:: + + ceph pg dump -o {filename} --format=json + +To query a particular placement group, execute the following:: + + ceph pg {poolnum}.{pg-id} query + +Ceph will output the query in JSON format. + +The following subsections describe the common pg states in detail. + +Creating +-------- + +When you create a pool, it will create the number of placement groups you +specified. Ceph will echo ``creating`` when it is creating one or more +placement groups. Once they are created, the OSDs that are part of a placement +group's Acting Set will peer. Once peering is complete, the placement group +status should be ``active+clean``, which means a Ceph client can begin writing +to the placement group. + +.. ditaa:: + + /-----------\ /-----------\ /-----------\ + | Creating |------>| Peering |------>| Active | + \-----------/ \-----------/ \-----------/ + +Peering +------- + +When Ceph is Peering a placement group, Ceph is bringing the OSDs that +store the replicas of the placement group into **agreement about the state** +of the objects and metadata in the placement group. When Ceph completes peering, +this means that the OSDs that store the placement group agree about the current +state of the placement group. However, completion of the peering process does +**NOT** mean that each replica has the latest contents. + +.. topic:: Authoritative History + + Ceph will **NOT** acknowledge a write operation to a client, until + all OSDs of the acting set persist the write operation. This practice + ensures that at least one member of the acting set will have a record + of every acknowledged write operation since the last successful + peering operation. + + With an accurate record of each acknowledged write operation, Ceph can + construct and disseminate a new authoritative history of the placement + group--a complete, and fully ordered set of operations that, if performed, + would bring an OSD’s copy of a placement group up to date. + + +Active +------ + +Once Ceph completes the peering process, a placement group may become +``active``. The ``active`` state means that the data in the placement group is +generally available in the primary placement group and the replicas for read +and write operations. + + +Clean +----- + +When a placement group is in the ``clean`` state, the primary OSD and the +replica OSDs have successfully peered and there are no stray replicas for the +placement group. Ceph replicated all objects in the placement group the correct +number of times. + + +Degraded +-------- + +When a client writes an object to the primary OSD, the primary OSD is +responsible for writing the replicas to the replica OSDs. After the primary OSD +writes the object to storage, the placement group will remain in a ``degraded`` +state until the primary OSD has received an acknowledgement from the replica +OSDs that Ceph created the replica objects successfully. + +The reason a placement group can be ``active+degraded`` is that an OSD may be +``active`` even though it doesn't hold all of the objects yet. If an OSD goes +``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. +The OSDs must peer again when the OSD comes back online. However, a client can +still write a new object to a ``degraded`` placement group if it is ``active``. + +If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the +``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD +to another OSD. The time between being marked ``down`` and being marked ``out`` +is controlled by ``mon osd down out interval``, which is set to ``600`` seconds +by default. + +A placement group can also be ``degraded``, because Ceph cannot find one or more +objects that Ceph thinks should be in the placement group. While you cannot +read or write to unfound objects, you can still access all of the other objects +in the ``degraded`` placement group. + + +Recovering +---------- + +Ceph was designed for fault-tolerance at a scale where hardware and software +problems are ongoing. When an OSD goes ``down``, its contents may fall behind +the current state of other replicas in the placement groups. When the OSD is +back ``up``, the contents of the placement groups must be updated to reflect the +current state. During that time period, the OSD may reflect a ``recovering`` +state. + +Recovery is not always trivial, because a hardware failure might cause a +cascading failure of multiple OSDs. For example, a network switch for a rack or +cabinet may fail, which can cause the OSDs of a number of host machines to fall +behind the current state of the cluster. Each one of the OSDs must recover once +the fault is resolved. + +Ceph provides a number of settings to balance the resource contention between +new service requests and the need to recover data objects and restore the +placement groups to the current state. The ``osd recovery delay start`` setting +allows an OSD to restart, re-peer and even process some replay requests before +starting the recovery process. The ``osd +recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, +restart and re-peer at staggered rates. The ``osd recovery max active`` setting +limits the number of recovery requests an OSD will entertain simultaneously to +prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting +limits the size of the recovered data chunks to prevent network congestion. + + +Back Filling +------------ + +When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs +in the cluster to the newly added OSD. Forcing the new OSD to accept the +reassigned placement groups immediately can put excessive load on the new OSD. +Back filling the OSD with the placement groups allows this process to begin in +the background. Once backfilling is complete, the new OSD will begin serving +requests when it is ready. + +During the backfill operations, you may see one of several states: +``backfill_wait`` indicates that a backfill operation is pending, but is not +underway yet; ``backfilling`` indicates that a backfill operation is underway; +and, ``backfill_toofull`` indicates that a backfill operation was requested, +but couldn't be completed due to insufficient storage capacity. When a +placement group cannot be backfilled, it may be considered ``incomplete``. + +The ``backfill_toofull`` state may be transient. It is possible that as PGs +are moved around, space may become available. The ``backfill_toofull`` is +similar to ``backfill_wait`` in that as soon as conditions change +backfill can proceed. + +Ceph provides a number of settings to manage the load spike associated with +reassigning placement groups to an OSD (especially a new OSD). By default, +``osd_max_backfills`` sets the maximum number of concurrent backfills to and from +an OSD to 1. The ``backfill full ratio`` enables an OSD to refuse a +backfill request if the OSD is approaching its full ratio (90%, by default) and +change with ``ceph osd set-backfillfull-ratio`` command. +If an OSD refuses a backfill request, the ``osd backfill retry interval`` +enables an OSD to retry the request (after 30 seconds, by default). OSDs can +also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan +intervals (64 and 512, by default). + + +Remapped +-------- + +When the Acting Set that services a placement group changes, the data migrates +from the old acting set to the new acting set. It may take some time for a new +primary OSD to service requests. So it may ask the old primary to continue to +service requests until the placement group migration is complete. Once data +migration completes, the mapping uses the primary OSD of the new acting set. + + +Stale +----- + +While Ceph uses heartbeats to ensure that hosts and daemons are running, the +``ceph-osd`` daemons may also get into a ``stuck`` state where they are not +reporting statistics in a timely manner (e.g., a temporary network fault). By +default, OSD daemons report their placement group, up through, boot and failure +statistics every half second (i.e., ``0.5``), which is more frequent than the +heartbeat thresholds. If the **Primary OSD** of a placement group's acting set +fails to report to the monitor or if other OSDs have reported the primary OSD +``down``, the monitors will mark the placement group ``stale``. + +When you start your cluster, it is common to see the ``stale`` state until +the peering process completes. After your cluster has been running for awhile, +seeing placement groups in the ``stale`` state indicates that the primary OSD +for those placement groups is ``down`` or not reporting placement group statistics +to the monitor. + + +Identifying Troubled PGs +======================== + +As previously noted, a placement group is not necessarily problematic just +because its state is not ``active+clean``. Generally, Ceph's ability to self +repair may not be working when placement groups get stuck. The stuck states +include: + +- **Unclean**: Placement groups contain objects that are not replicated the + desired number of times. They should be recovering. +- **Inactive**: Placement groups cannot process reads or writes because they + are waiting for an OSD with the most up-to-date data to come back ``up``. +- **Stale**: Placement groups are in an unknown state, because the OSDs that + host them have not reported to the monitor cluster in a while (configured + by ``mon osd report timeout``). + +To identify stuck placement groups, execute the following:: + + ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] + +See `Placement Group Subsystem`_ for additional details. To troubleshoot +stuck placement groups, see `Troubleshooting PG Errors`_. + + +Finding an Object Location +========================== + +To store object data in the Ceph Object Store, a Ceph client must: + +#. Set an object name +#. Specify a `pool`_ + +The Ceph client retrieves the latest cluster map and the CRUSH algorithm +calculates how to map the object to a `placement group`_, and then calculates +how to assign the placement group to an OSD dynamically. To find the object +location, all you need is the object name and the pool name. For example:: + + ceph osd map {poolname} {object-name} [namespace] + +.. topic:: Exercise: Locate an Object + + As an exercise, lets create an object. Specify an object name, a path to a + test file containing some object data and a pool name using the + ``rados put`` command on the command line. For example:: + + rados put {object-name} {file-path} --pool=data + rados put test-object-1 testfile.txt --pool=data + + To verify that the Ceph Object Store stored the object, execute the following:: + + rados -p data ls + + Now, identify the object location:: + + ceph osd map {pool-name} {object-name} + ceph osd map data test-object-1 + + Ceph should output the object's location. For example:: + + osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0) + + To remove the test object, simply delete it using the ``rados rm`` command. + For example:: + + rados rm test-object-1 --pool=data + + +As the cluster evolves, the object location may change dynamically. One benefit +of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform +the migration manually. See the `Architecture`_ section for details. + +.. _data placement: ../data-placement +.. _pool: ../pools +.. _placement group: ../placement-groups +.. _Architecture: ../../../architecture +.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running +.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors +.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering +.. _CRUSH map: ../crush-map +.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ +.. _Placement Group Subsystem: ../control#placement-group-subsystem diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst new file mode 100644 index 00000000..294e922d --- /dev/null +++ b/doc/rados/operations/monitoring.rst @@ -0,0 +1,475 @@ +====================== + Monitoring a Cluster +====================== + +Once you have a running cluster, you may use the ``ceph`` tool to monitor your +cluster. Monitoring a cluster typically involves checking OSD status, monitor +status, placement group status and metadata server status. + +Using the command line +====================== + +Interactive mode +---------------- + +To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line +with no arguments. For example:: + + ceph + ceph> health + ceph> status + ceph> quorum_status + ceph> mon_status + +Non-default paths +----------------- + +If you specified non-default locations for your configuration or keyring, +you may specify their locations:: + + ceph -c /path/to/conf -k /path/to/keyring health + +Checking a Cluster's Status +=========================== + +After you start your cluster, and before you start reading and/or +writing data, check your cluster's status first. + +To check a cluster's status, execute the following:: + + ceph status + +Or:: + + ceph -s + +In interactive mode, type ``status`` and press **Enter**. :: + + ceph> status + +Ceph will print the cluster status. For example, a tiny Ceph demonstration +cluster with one of each service may print the following: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + +.. topic:: How Ceph Calculates Data Usage + + The ``usage`` value reflects the *actual* amount of raw storage used. The + ``xxx GB / xxx GB`` value means the amount available (the lesser number) + of the overall storage capacity of the cluster. The notional number reflects + the size of the stored data before it is replicated, cloned or snapshotted. + Therefore, the amount of data actually stored typically exceeds the notional + amount stored, because Ceph creates replicas of the data and may also use + storage capacity for cloning and snapshotting. + + +Watching a Cluster +================== + +In addition to local logging by each daemon, Ceph clusters maintain +a *cluster log* that records high level events about the whole system. +This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by +default), but can also be monitored via the command line. + +To follow the cluster log, use the following command + +:: + + ceph -w + +Ceph will print the status of the system, followed by each log message as it +is emitted. For example: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c + mgr: x(active) + mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby + osd: 3 osds: 3 up, 3 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2.19K + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + + 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot + 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x + 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available + + +In addition to using ``ceph -w`` to print log lines as they are emitted, +use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster +log. + +Monitoring Health Checks +======================== + +Ceph continuously runs various *health checks* against its own status. When +a health check fails, this is reflected in the output of ``ceph status`` (or +``ceph health``). In addition, messages are sent to the cluster log to +indicate when a check fails, and when the cluster recovers. + +For example, when an OSD goes down, the ``health`` section of the status +output may be updated as follows: + +:: + + health: HEALTH_WARN + 1 osds down + Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded + +At this time, cluster log messages are also emitted to record the failure of the +health checks: + +:: + + 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) + 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) + +When the OSD comes back online, the cluster log records the cluster's return +to a health state: + +:: + + 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) + 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) + 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy + +Network Performance Checks +-------------------------- + +Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability. We +also use the response times to monitor network performance. +While it is possible that a busy OSD could delay a ping response, we can assume +that if a network switch fails mutiple delays will be detected between distinct pairs of OSDs. + +By default we will warn about ping times which exceed 1 second (1000 milliseconds). + +:: + + HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 1118.001 msec + +The health detail will add the combination of OSDs are seeing the delays and by how much. There is a limit of 10 +detail line items. + +:: + + [WRN] OSD_SLOW_PING_TIME_BACK: Long heartbeat ping times on back interface seen, longest is 1118.001 msec + Slow heartbeat ping on back interface from osd.0 to osd.1 1118.001 msec + Slow heartbeat ping on back interface from osd.0 to osd.2 1030.123 msec + Slow heartbeat ping on back interface from osd.2 to osd.1 1015.321 msec + Slow heartbeat ping on back interface from osd.1 to osd.0 1010.456 msec + +To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used. Typically, this would be +sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD. The current threshold which defaults to 1 second +(1000 milliseconds) can be overridden as an argument in milliseconds. + +The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr. + +:: + + $ ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0 + { + "threshold": 0, + "entries": [ + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "front", + "average": { + "1min": 1.023, + "5min": 0.860, + "15min": 0.883 + }, + "min": { + "1min": 0.818, + "5min": 0.607, + "15min": 0.607 + }, + "max": { + "1min": 1.164, + "5min": 1.173, + "15min": 1.544 + }, + "last": 0.924 + }, + { + "last update": "Wed Sep 4 17:04:49 2019", + "stale": false, + "from osd": 2, + "to osd": 0, + "interface": "back", + "average": { + "1min": 0.968, + "5min": 0.897, + "15min": 0.830 + }, + "min": { + "1min": 0.860, + "5min": 0.563, + "15min": 0.502 + }, + "max": { + "1min": 1.171, + "5min": 1.216, + "15min": 1.456 + }, + "last": 0.845 + }, + { + "last update": "Wed Sep 4 17:04:48 2019", + "stale": false, + "from osd": 0, + "to osd": 1, + "interface": "front", + "average": { + "1min": 0.965, + "5min": 0.811, + "15min": 0.850 + }, + "min": { + "1min": 0.650, + "5min": 0.488, + "15min": 0.466 + }, + "max": { + "1min": 1.252, + "5min": 1.252, + "15min": 1.362 + }, + "last": 0.791 + }, + ... + + +Detecting configuration issues +============================== + +In addition to the health checks that Ceph continuously runs on its +own status, there are some configuration issues that may only be detected +by an external tool. + +Use the `ceph-medic`_ tool to run these additional checks on your Ceph +cluster's configuration. + +Checking a Cluster's Usage Stats +================================ + +To check a cluster's data usage and data distribution among pools, you can +use the ``df`` option. It is similar to Linux ``df``. Execute +the following:: + + ceph df + +The **RAW STORAGE** section of the output provides an overview of the +amount of storage that is managed by your cluster. + +- **CLASS:** The class of OSD device (or the total for the cluster) +- **SIZE:** The amount of storage capacity managed by the cluster. +- **AVAIL:** The amount of free space available in the cluster. +- **USED:** The amount of raw storage consumed by user data. +- **RAW USED:** The amount of raw storage consumed by user data, internal overhead, or reserved capacity. +- **%RAW USED:** The percentage of raw storage used. Use this number in + conjunction with the ``full ratio`` and ``near full ratio`` to ensure that + you are not reaching your cluster's capacity. See `Storage Capacity`_ for + additional details. + +The **POOLS** section of the output provides a list of pools and the notional +usage of each pool. The output from this section **DOES NOT** reflect replicas, +clones or snapshots. For example, if you store an object with 1MB of data, the +notional usage will be 1MB, but the actual usage may be 2MB or more depending +on the number of replicas, clones and snapshots. + +- **NAME:** The name of the pool. +- **ID:** The pool ID. +- **USED:** The notional amount of data stored in kilobytes, unless the number + appends **M** for megabytes or **G** for gigabytes. +- **%USED:** The notional percentage of storage used per pool. +- **MAX AVAIL:** An estimate of the notional amount of data that can be written + to this pool. +- **OBJECTS:** The notional number of objects stored per pool. + +.. note:: The numbers in the **POOLS** section are notional. They are not + inclusive of the number of replicas, snapshots or clones. As a result, + the sum of the **USED** and **%USED** amounts will not add up to the + **USED** and **%USED** amounts in the **RAW** section of the + output. + +.. note:: The **MAX AVAIL** value is a complicated function of the + replication or erasure code used, the CRUSH rule that maps storage + to devices, the utilization of those devices, and the configured + mon_osd_full_ratio. + + + +Checking OSD Status +=================== + +You can check OSDs to ensure they are ``up`` and ``in`` by executing:: + + ceph osd stat + +Or:: + + ceph osd dump + +You can also check view OSDs according to their position in the CRUSH map. :: + + ceph osd tree + +Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up +and their weight. :: + + #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 3.00000 pool default + -3 3.00000 rack mainrack + -2 3.00000 host osd-host + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +Checking Monitor Status +======================= + +If your cluster has multiple monitors (likely), you should check the monitor +quorum status after you start the cluster and before reading and/or writing data. A +quorum must be present when multiple monitors are running. You should also check +monitor status periodically to ensure that they are running. + +To see display the monitor map, execute the following:: + + ceph mon stat + +Or:: + + ceph mon dump + +To check the quorum status for the monitor cluster, execute the following:: + + ceph quorum_status + +Ceph will return the quorum status. For example, a Ceph cluster consisting of +three monitors may return the following: + +.. code-block:: javascript + + { "election_epoch": 10, + "quorum": [ + 0, + 1, + 2], + "quorum_names": [ + "a", + "b", + "c"], + "quorum_leader_name": "a", + "monmap": { "epoch": 1, + "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", + "modified": "2011-12-12 13:28:27.505520", + "created": "2011-12-12 13:28:27.505520", + "features": {"persistent": [ + "kraken", + "luminous", + "mimic"], + "optional": [] + }, + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789/0", + "public_addr": "127.0.0.1:6789/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790/0", + "public_addr": "127.0.0.1:6790/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6791/0", + "public_addr": "127.0.0.1:6791/0"} + ] + } + } + +Checking MDS Status +=================== + +Metadata servers provide metadata services for CephFS. Metadata servers have +two sets of states: ``up | down`` and ``active | inactive``. To ensure your +metadata servers are ``up`` and ``active``, execute the following:: + + ceph mds stat + +To display details of the metadata cluster, execute the following:: + + ceph fs dump + + +Checking Placement Group States +=============================== + +Placement groups map objects to OSDs. When you monitor your +placement groups, you will want them to be ``active`` and ``clean``. +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg + + +Using the Admin Socket +====================== + +The Ceph admin socket allows you to query a daemon via a socket interface. +By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon +via the admin socket, login to the host running the daemon and use the +following command:: + + ceph daemon {daemon-name} + ceph daemon {path-to-socket-file} + +For example, the following are equivalent:: + + ceph daemon osd.0 foo + ceph daemon /var/run/ceph/ceph-osd.0.asok foo + +To view the available admin socket commands, execute the following command:: + + ceph daemon {daemon-name} help + +The admin socket command enables you to show and set your configuration at +runtime. See `Viewing a Configuration at Runtime`_ for details. + +Additionally, you can set configuration values at runtime directly (i.e., the +admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id} +config set``, which relies on the monitor but doesn't require you to login +directly to the host in question ). + +.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime +.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity +.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/ diff --git a/doc/rados/operations/operating.rst b/doc/rados/operations/operating.rst new file mode 100644 index 00000000..fcfd5268 --- /dev/null +++ b/doc/rados/operations/operating.rst @@ -0,0 +1,235 @@ +===================== + Operating a Cluster +===================== + +.. index:: systemd; operating a cluster + + +Running Ceph with systemd +========================== + +For all distributions that support systemd (CentOS 7, Fedora, Debian +Jessie 8 and later, SUSE), ceph daemons are now managed using native +systemd files instead of the legacy sysvinit scripts. For example:: + + sudo systemctl start ceph.target # start all daemons + sudo systemctl status ceph-osd@12 # check status of osd.12 + +To list the Ceph systemd units on a node, execute:: + + sudo systemctl status ceph\*.service ceph\*.target + +Starting all Daemons +-------------------- + +To start all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo systemctl start ceph.target + + +Stopping all Daemons +-------------------- + +To stop all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo systemctl stop ceph\*.service ceph\*.target + + +Starting all Daemons by Type +---------------------------- + +To start all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo systemctl start ceph-osd.target + sudo systemctl start ceph-mon.target + sudo systemctl start ceph-mds.target + + +Stopping all Daemons by Type +---------------------------- + +To stop all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo systemctl stop ceph-mon\*.service ceph-mon.target + sudo systemctl stop ceph-osd\*.service ceph-osd.target + sudo systemctl stop ceph-mds\*.service ceph-mds.target + + +Starting a Daemon +----------------- + +To start a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo systemctl start ceph-osd@{id} + sudo systemctl start ceph-mon@{hostname} + sudo systemctl start ceph-mds@{hostname} + +For example:: + + sudo systemctl start ceph-osd@1 + sudo systemctl start ceph-mon@ceph-server + sudo systemctl start ceph-mds@ceph-server + + +Stopping a Daemon +----------------- + +To stop a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo systemctl stop ceph-osd@{id} + sudo systemctl stop ceph-mon@{hostname} + sudo systemctl stop ceph-mds@{hostname} + +For example:: + + sudo systemctl stop ceph-osd@1 + sudo systemctl stop ceph-mon@ceph-server + sudo systemctl stop ceph-mds@ceph-server + + +.. index:: Ceph service; Upstart; operating a cluster + + + +Starting all Daemons +-------------------- + +To start all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo start ceph-all + + +Stopping all Daemons +-------------------- + +To stop all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo stop ceph-all + + +Starting all Daemons by Type +---------------------------- + +To start all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo start ceph-osd-all + sudo start ceph-mon-all + sudo start ceph-mds-all + + +Stopping all Daemons by Type +---------------------------- + +To stop all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo stop ceph-osd-all + sudo stop ceph-mon-all + sudo stop ceph-mds-all + + +Starting a Daemon +----------------- + +To start a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo start ceph-osd id={id} + sudo start ceph-mon id={hostname} + sudo start ceph-mds id={hostname} + +For example:: + + sudo start ceph-osd id=1 + sudo start ceph-mon id=ceph-server + sudo start ceph-mds id=ceph-server + + +Stopping a Daemon +----------------- + +To stop a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo stop ceph-osd id={id} + sudo stop ceph-mon id={hostname} + sudo stop ceph-mds id={hostname} + +For example:: + + sudo stop ceph-osd id=1 + sudo start ceph-mon id=ceph-server + sudo start ceph-mds id=ceph-server + + +.. index:: Ceph service; sysvinit; operating a cluster + + +Running Ceph +============ + +Each time you to **start**, **restart**, and **stop** Ceph daemons (or your +entire cluster) you must specify at least one option and one command. You may +also specify a daemon type or a daemon instance. :: + + {commandline} [options] [commands] [daemons] + + +The ``ceph`` options include: + ++-----------------+----------+-------------------------------------------------+ +| Option | Shortcut | Description | ++=================+==========+=================================================+ +| ``--verbose`` | ``-v`` | Use verbose logging. | ++-----------------+----------+-------------------------------------------------+ +| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. | ++-----------------+----------+-------------------------------------------------+ +| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` | +| | | Otherwise, it only executes on ``localhost``. | ++-----------------+----------+-------------------------------------------------+ +| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--conf`` | ``-c`` | Use an alternate configuration file. | ++-----------------+----------+-------------------------------------------------+ + +The ``ceph`` commands include: + ++------------------+------------------------------------------------------------+ +| Command | Description | ++==================+============================================================+ +| ``start`` | Start the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``stop`` | Stop the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` | ++------------------+------------------------------------------------------------+ +| ``killall`` | Kill all daemons of a particular type. | ++------------------+------------------------------------------------------------+ +| ``cleanlogs`` | Cleans out the log directory. | ++------------------+------------------------------------------------------------+ +| ``cleanalllogs`` | Cleans out **everything** in the log directory. | ++------------------+------------------------------------------------------------+ + +For subsystem operations, the ``ceph`` service can target specific daemon types +by adding a particular daemon type for the ``[daemons]`` option. Daemon types +include: + +- ``mon`` +- ``osd`` +- ``mds`` + + + +.. _Valgrind: http://www.valgrind.org/ +.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html diff --git a/doc/rados/operations/pg-concepts.rst b/doc/rados/operations/pg-concepts.rst new file mode 100644 index 00000000..636d6bf9 --- /dev/null +++ b/doc/rados/operations/pg-concepts.rst @@ -0,0 +1,102 @@ +========================== + Placement Group Concepts +========================== + +When you execute commands like ``ceph -w``, ``ceph osd dump``, and other +commands related to placement groups, Ceph may return values using some +of the following terms: + +*Peering* + The process of bringing all of the OSDs that store + a Placement Group (PG) into agreement about the state + of all of the objects (and their metadata) in that PG. + Note that agreeing on the state does not mean that + they all have the latest contents. + +*Acting Set* + The ordered list of OSDs who are (or were as of some epoch) + responsible for a particular placement group. + +*Up Set* + The ordered list of OSDs responsible for a particular placement + group for a particular epoch according to CRUSH. Normally this + is the same as the *Acting Set*, except when the *Acting Set* has + been explicitly overridden via ``pg_temp`` in the OSD Map. + +*Current Interval* or *Past Interval* + A sequence of OSD map epochs during which the *Acting Set* and *Up + Set* for particular placement group do not change. + +*Primary* + The member (and by convention first) of the *Acting Set*, + that is responsible for coordination peering, and is + the only OSD that will accept client-initiated + writes to objects in a placement group. + +*Replica* + A non-primary OSD in the *Acting Set* for a placement group + (and who has been recognized as such and *activated* by the primary). + +*Stray* + An OSD that is not a member of the current *Acting Set*, but + has not yet been told that it can delete its copies of a + particular placement group. + +*Recovery* + Ensuring that copies of all of the objects in a placement group + are on all of the OSDs in the *Acting Set*. Once *Peering* has + been performed, the *Primary* can start accepting write operations, + and *Recovery* can proceed in the background. + +*PG Info* + Basic metadata about the placement group's creation epoch, the version + for the most recent write to the placement group, *last epoch started*, + *last epoch clean*, and the beginning of the *current interval*. Any + inter-OSD communication about placement groups includes the *PG Info*, + such that any OSD that knows a placement group exists (or once existed) + also has a lower bound on *last epoch clean* or *last epoch started*. + +*PG Log* + A list of recent updates made to objects in a placement group. + Note that these logs can be truncated after all OSDs + in the *Acting Set* have acknowledged up to a certain + point. + +*Missing Set* + Each OSD notes update log entries and if they imply updates to + the contents of an object, adds that object to a list of needed + updates. This list is called the *Missing Set* for that ``<OSD,PG>``. + +*Authoritative History* + A complete, and fully ordered set of operations that, if + performed, would bring an OSD's copy of a placement group + up to date. + +*Epoch* + A (monotonically increasing) OSD map version number + +*Last Epoch Start* + The last epoch at which all nodes in the *Acting Set* + for a particular placement group agreed on an + *Authoritative History*. At this point, *Peering* is + deemed to have been successful. + +*up_thru* + Before a *Primary* can successfully complete the *Peering* process, + it must inform a monitor that is alive through the current + OSD map *Epoch* by having the monitor set its *up_thru* in the osd + map. This helps *Peering* ignore previous *Acting Sets* for which + *Peering* never completed after certain sequences of failures, such as + the second interval below: + + - *acting set* = [A,B] + - *acting set* = [A] + - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection) + - *acting set* = [B] (B restarts, A does not) + +*Last Epoch Clean* + The last *Epoch* at which all nodes in the *Acting set* + for a particular placement group were completely + up to date (both placement group logs and object contents). + At this point, *recovery* is deemed to have been + completed. diff --git a/doc/rados/operations/pg-repair.rst b/doc/rados/operations/pg-repair.rst new file mode 100644 index 00000000..0d6692a3 --- /dev/null +++ b/doc/rados/operations/pg-repair.rst @@ -0,0 +1,4 @@ +Repairing PG inconsistencies +============================ + + diff --git a/doc/rados/operations/pg-states.rst b/doc/rados/operations/pg-states.rst new file mode 100644 index 00000000..c38a683f --- /dev/null +++ b/doc/rados/operations/pg-states.rst @@ -0,0 +1,112 @@ +======================== + Placement Group States +======================== + +When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``), +Ceph will report on the status of the placement groups. A placement group has +one or more states. The optimum state for placement groups in the placement group +map is ``active + clean``. + +*creating* + Ceph is still creating the placement group. + +*activating* + The placement group is peered but not yet active. + +*active* + Ceph will process requests to the placement group. + +*clean* + Ceph replicated all objects in the placement group the correct number of times. + +*down* + A replica with necessary data is down, so the placement group is offline. + +*scrubbing* + Ceph is checking the placement group metadata for inconsistencies. + +*deep* + Ceph is checking the placement group data against stored checksums. + +*degraded* + Ceph has not replicated some objects in the placement group the correct number of times yet. + +*inconsistent* + Ceph detects inconsistencies in the one or more replicas of an object in the placement group + (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.). + +*peering* + The placement group is undergoing the peering process + +*repair* + Ceph is checking the placement group and repairing any inconsistencies it finds (if possible). + +*recovering* + Ceph is migrating/synchronizing objects and their replicas. + +*forced_recovery* + High recovery priority of that PG is enforced by user. + +*recovery_wait* + The placement group is waiting in line to start recover. + +*recovery_toofull* + A recovery operation is waiting because the destination OSD is over its + full ratio. + +*recovery_unfound* + Recovery stopped due to unfound objects. + +*backfilling* + Ceph is scanning and synchronizing the entire contents of a placement group + instead of inferring what contents need to be synchronized from the logs of + recent operations. Backfill is a special case of recovery. + +*forced_backfill* + High backfill priority of that PG is enforced by user. + +*backfill_wait* + The placement group is waiting in line to start backfill. + +*backfill_toofull* + A backfill operation is waiting because the destination OSD is over + the backfillfull ratio. + +*backfill_unfound* + Backfill stopped due to unfound objects. + +*incomplete* + Ceph detects that a placement group is missing information about + writes that may have occurred, or does not have any healthy + copies. If you see this state, try to start any failed OSDs that may + contain the needed information. In the case of an erasure coded pool + temporarily reducing min_size may allow recovery. + +*stale* + The placement group is in an unknown state - the monitors have not received + an update for it since the placement group mapping changed. + +*remapped* + The placement group is temporarily mapped to a different set of OSDs from what + CRUSH specified. + +*undersized* + The placement group has fewer copies than the configured pool replication level. + +*peered* + The placement group has peered, but cannot serve client IO due to not having + enough copies to reach the pool's configured min_size parameter. Recovery + may occur in this state, so the pg may heal up to min_size eventually. + +*snaptrim* + Trimming snaps. + +*snaptrim_wait* + Queued to trim snaps. + +*snaptrim_error* + Error stopped trimming snaps. + +*unknown* + The ceph-mgr hasn't yet received any information about the PG's state from an + OSD since mgr started up. diff --git a/doc/rados/operations/placement-groups.rst b/doc/rados/operations/placement-groups.rst new file mode 100644 index 00000000..adb5df0a --- /dev/null +++ b/doc/rados/operations/placement-groups.rst @@ -0,0 +1,678 @@ +================== + Placement Groups +================== + +.. _pg-autoscaler: + +Autoscaling placement groups +============================ + +Placement groups (PGs) are an internal implementation detail of how +Ceph distributes data. You can allow the cluster to either make +recommendations or automatically tune PGs based on how the cluster is +used by enabling *pg-autoscaling*. + +Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``. + +* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information. +* ``on``: Enable automated adjustments of the PG count for the given pool. +* ``warn``: Raise health alerts when the PG count should be adjusted + +To set the autoscaling mode for existing pools,:: + + ceph osd pool set <pool-name> pg_autoscale_mode <mode> + +For example to enable autoscaling on pool ``foo``,:: + + ceph osd pool set foo pg_autoscale_mode on + +You can also configure the default ``pg_autoscale_mode`` that is +applied to any pools that are created in the future with:: + + ceph config set global osd_pool_default_pg_autoscale_mode <mode> + +Viewing PG scaling recommendations +---------------------------------- + +You can view each pool, its relative utilization, and any suggested changes to +the PG count with this command:: + + ceph osd pool autoscale-status + +Output will be something like:: + + POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO PG_NUM NEW PG_NUM AUTOSCALE + a 12900M 3.0 82431M 0.4695 8 128 warn + c 0 3.0 82431M 0.0000 0.2000 0.9884 1 64 warn + b 0 953.6M 3.0 82431M 0.0347 8 warn + +**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if +present, is the amount of data the administrator has specified that +they expect to eventually be stored in this pool. The system uses +the larger of the two values for its calculation. + +**RATE** is the multiplier for the pool that determines how much raw +storage capacity is consumed. For example, a 3 replica pool will +have a ratio of 3.0, while a k=4,m=2 erasure coded pool will have a +ratio of 1.5. + +**RAW CAPACITY** is the total amount of raw storage capacity on the +OSDs that are responsible for storing this pool's (and perhaps other +pools') data. **RATIO** is the ratio of that total capacity that +this pool is consuming (i.e., ratio = size * rate / raw capacity). + +**TARGET RATIO**, if present, is the ratio of storage that the +administrator has specified that they expect this pool to consume +relative to other pools with target ratios set. +If both target size bytes and ratio are specified, the +ratio takes precedence. + +**EFFECTIVE RATIO** is the target ratio after adjusting in two ways: + +1. subtracting any capacity expected to be used by pools with target size set +2. normalizing the target ratios among pools with target ratio set so + they collectively target the rest of the space. For example, 4 + pools with target_ratio 1.0 would have an effective ratio of 0.25. + +The system uses the larger of the actual ratio and the effective ratio +for its calculation. + +**PG_NUM** is the current number of PGs for the pool (or the current +number of PGs that the pool is working towards, if a ``pg_num`` +change is in progress). **NEW PG_NUM**, if present, is what the +system believes the pool's ``pg_num`` should be changed to. It is +always a power of 2, and will only be present if the "ideal" value +varies from the current value by more than a factor of 3. + +The final column, **AUTOSCALE**, is the pool ``pg_autoscale_mode``, +and will be either ``on``, ``off``, or ``warn``. + + +Automated scaling +----------------- + +Allowing the cluster to automatically scale PGs based on usage is the +simplest approach. Ceph will look at the total available storage and +target number of PGs for the whole system, look at how much data is +stored in each pool, and try to apportion the PGs accordingly. The +system is relatively conservative with its approach, only making +changes to a pool when the current number of PGs (``pg_num``) is more +than 3 times off from what it thinks it should be. + +The target number of PGs per OSD is based on the +``mon_target_pg_per_osd`` configurable (default: 100), which can be +adjusted with:: + + ceph config set global mon_target_pg_per_osd 100 + +The autoscaler analyzes pools and adjusts on a per-subtree basis. +Because each pool may map to a different CRUSH rule, and each rule may +distribute data across different devices, Ceph will consider +utilization of each subtree of the hierarchy independently. For +example, a pool that maps to OSDs of class `ssd` and a pool that maps +to OSDs of class `hdd` will each have optimal PG counts that depend on +the number of those respective device types. + + +.. _specifying_pool_target_size: + +Specifying expected pool size +----------------------------- + +When a cluster or pool is first created, it will consume a small +fraction of the total cluster capacity and will appear to the system +as if it should only need a small number of placement groups. +However, in most cases cluster administrators have a good idea which +pools are expected to consume most of the system capacity over time. +By providing this information to Ceph, a more appropriate number of +PGs can be used from the beginning, preventing subsequent changes in +``pg_num`` and the overhead associated with moving data around when +those adjustments are made. + +The *target size* of a pool can be specified in two ways: either in +terms of the absolute size of the pool (i.e., bytes), or as a weight +relative to other pools with a ``target_size_ratio`` set. + +For example,:: + + ceph osd pool set mypool target_size_bytes 100T + +will tell the system that `mypool` is expected to consume 100 TiB of +space. Alternatively,:: + + ceph osd pool set mypool target_size_ratio 1.0 + +will tell the system that `mypool` is expected to consume 1.0 relative +to the other pools with ``target_size_ratio`` set. If `mypool` is the +only pool in the cluster, this means an expected use of 100% of the +total capacity. If there is a second pool with ``target_size_ratio`` +1.0, both pools would expect to use 50% of the cluster capacity. + +You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command. + +Note that if impossible target size values are specified (for example, +a capacity larger than the total cluster) then a health warning +(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised. + +If both ``target_size_ratio`` and ``target_size_bytes`` are specified +for a pool, only the ratio will be considered, and a health warning +(``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued. + +Specifying bounds on a pool's PGs +--------------------------------- + +It is also possible to specify a minimum number of PGs for a pool. +This is useful for establishing a lower bound on the amount of +parallelism client will see when doing IO, even when a pool is mostly +empty. Setting the lower bound prevents Ceph from reducing (or +recommending you reduce) the PG number below the configured number. + +You can set the minimum number of PGs for a pool with:: + + ceph osd pool set <pool-name> pg_num_min <num> + +You can also specify the minimum PG count at pool creation time with +the optional ``--pg-num-min <num>`` argument to the ``ceph osd pool +create`` command. + +.. _preselection: + +A preselection of pg_num +======================== + +When creating a new pool with:: + + ceph osd pool create {pool-name} pg_num + +it is mandatory to choose the value of ``pg_num`` because it cannot (currently) be +calculated automatically. Here are a few values commonly used: + +- Less than 5 OSDs set ``pg_num`` to 128 + +- Between 5 and 10 OSDs set ``pg_num`` to 512 + +- Between 10 and 50 OSDs set ``pg_num`` to 1024 + +- If you have more than 50 OSDs, you need to understand the tradeoffs + and how to calculate the ``pg_num`` value by yourself + +- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool + +As the number of OSDs increases, choosing the right value for pg_num +becomes more important because it has a significant influence on the +behavior of the cluster as well as the durability of the data when +something goes wrong (i.e. the probability that a catastrophic event +leads to data loss). + +How are Placement Groups used ? +=============================== + +A placement group (PG) aggregates objects within a pool because +tracking object placement and object metadata on a per-object basis is +computationally expensive--i.e., a system with millions of objects +cannot realistically track placement on a per-object basis. + +.. ditaa:: + /-----\ /-----\ /-----\ /-----\ /-----\ + | obj | | obj | | obj | | obj | | obj | + \-----/ \-----/ \-----/ \-----/ \-----/ + | | | | | + +--------+--------+ +---+----+ + | | + v v + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | + +------------------------------+ + | + v + +-----------------------+ + | Pool | + | | + +-----------------------+ + +The Ceph client will calculate which placement group an object should +be in. It does this by hashing the object ID and applying an operation +based on the number of PGs in the defined pool and the ID of the pool. +See `Mapping PGs to OSDs`_ for details. + +The object's contents within a placement group are stored in a set of +OSDs. For instance, in a replicated pool of size two, each placement +group will store objects on two OSDs, as shown below. + +.. ditaa:: + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | | | + v v v v + /----------\ /----------\ /----------\ /----------\ + | | | | | | | | + | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | + | | | | | | | | + \----------/ \----------/ \----------/ \----------/ + + +Should OSD #2 fail, another will be assigned to Placement Group #1 and +will be filled with copies of all objects in OSD #1. If the pool size +is changed from two to three, an additional OSD will be assigned to +the placement group and will receive copies of all objects in the +placement group. + +Placement groups do not own the OSD; they share it with other +placement groups from the same pool or even other pools. If OSD #2 +fails, the Placement Group #2 will also have to restore copies of +objects, using OSD #3. + +When the number of placement groups increases, the new placement +groups will be assigned OSDs. The result of the CRUSH function will +also change and some objects from the former placement groups will be +copied over to the new Placement Groups and removed from the old ones. + +Placement Groups Tradeoffs +========================== + +Data durability and even distribution among all OSDs call for more +placement groups but their number should be reduced to the minimum to +save CPU and memory. + +.. _data durability: + +Data durability +--------------- + +After an OSD fails, the risk of data loss increases until the data it +contained is fully recovered. Let's imagine a scenario that causes +permanent data loss in a single placement group: + +- The OSD fails and all copies of the object it contains are lost. + For all objects within the placement group the number of replica + suddenly drops from three to two. + +- Ceph starts recovery for this placement group by choosing a new OSD + to re-create the third copy of all objects. + +- Another OSD, within the same placement group, fails before the new + OSD is fully populated with the third copy. Some objects will then + only have one surviving copies. + +- Ceph picks yet another OSD and keeps copying objects to restore the + desired number of copies. + +- A third OSD, within the same placement group, fails before recovery + is complete. If this OSD contained the only remaining copy of an + object, it is permanently lost. + +In a cluster containing 10 OSDs with 512 placement groups in a three +replica pool, CRUSH will give each placement groups three OSDs. In the +end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement +Groups. When the first OSD fails, the above scenario will therefore +start recovery for all 150 placement groups at the same time. + +The 150 placement groups being recovered are likely to be +homogeneously spread over the 9 remaining OSDs. Each remaining OSD is +therefore likely to send copies of objects to all others and also +receive some new objects to be stored because they became part of a +new placement group. + +The amount of time it takes for this recovery to complete entirely +depends on the architecture of the Ceph cluster. Let say each OSD is +hosted by a 1TB SSD on a single machine and all of them are connected +to a 10Gb/s switch and the recovery for a single OSD completes within +M minutes. If there are two OSDs per machine using spinners with no +SSD journal and a 1Gb/s switch, it will at least be an order of +magnitude slower. + +In a cluster of this size, the number of placement groups has almost +no influence on data durability. It could be 128 or 8192 and the +recovery would not be slower or faster. + +However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs +is likely to speed up recovery and therefore improve data durability +significantly. Each OSD now participates in only ~75 placement groups +instead of ~150 when there were only 10 OSDs and it will still require +all 19 remaining OSDs to perform the same amount of object copies in +order to recover. But where 10 OSDs had to copy approximately 100GB +each, they now have to copy 50GB each instead. If the network was the +bottleneck, recovery will happen twice as fast. In other words, +recovery goes faster when the number of OSDs increases. + +If this cluster grows to 40 OSDs, each of them will only host ~35 +placement groups. If an OSD dies, recovery will keep going faster +unless it is blocked by another bottleneck. However, if this cluster +grows to 200 OSDs, each of them will only host ~7 placement groups. If +an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs +in these placement groups: recovery will take longer than when there +were 40 OSDs, meaning the number of placement groups should be +increased. + +No matter how short the recovery time is, there is a chance for a +second OSD to fail while it is in progress. In the 10 OSDs cluster +described above, if any of them fail, then ~17 placement groups +(i.e. ~150 / 9 placement groups being recovered) will only have one +surviving copy. And if any of the 8 remaining OSD fail, the last +objects of two placement groups are likely to be lost (i.e. ~17 / 8 +placement groups with only one remaining copy being recovered). + +When the size of the cluster grows to 20 OSDs, the number of Placement +Groups damaged by the loss of three OSDs drops. The second OSD lost +will degrade ~4 (i.e. ~75 / 19 placement groups being recovered) +instead of ~17 and the third OSD lost will only lose data if it is one +of the four OSDs containing the surviving copy. In other words, if the +probability of losing one OSD is 0.0001% during the recovery time +frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 * +0.0001% in the cluster with 20 OSDs. + +In a nutshell, more OSDs mean faster recovery and a lower risk of +cascading failures leading to the permanent loss of a Placement +Group. Having 512 or 4096 Placement Groups is roughly equivalent in a +cluster with less than 50 OSDs as far as data durability is concerned. + +Note: It may take a long time for a new OSD added to the cluster to be +populated with placement groups that were assigned to it. However +there is no degradation of any object and it has no impact on the +durability of the data contained in the Cluster. + +.. _object distribution: + +Object distribution within a pool +--------------------------------- + +Ideally objects are evenly distributed in each placement group. Since +CRUSH computes the placement group for each object, but does not +actually know how much data is stored in each OSD within this +placement group, the ratio between the number of placement groups and +the number of OSDs may influence the distribution of the data +significantly. + +For instance, if there was a single placement group for ten OSDs in a +three replica pool, only three OSD would be used because CRUSH would +have no other choice. When more placement groups are available, +objects are more likely to be evenly spread among them. CRUSH also +makes every effort to evenly spread OSDs among all existing Placement +Groups. + +As long as there are one or two orders of magnitude more Placement +Groups than OSDs, the distribution should be even. For instance, 256 +placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs +etc. + +Uneven data distribution can be caused by factors other than the ratio +between OSDs and placement groups. Since CRUSH does not take into +account the size of the objects, a few very large objects may create +an imbalance. Let say one million 4K objects totaling 4GB are evenly +spread among 1024 placement groups on 10 OSDs. They will use 4GB / 10 += 400MB on each OSD. If one 400MB object is added to the pool, the +three OSDs supporting the placement group in which the object has been +placed will be filled with 400MB + 400MB = 800MB while the seven +others will remain occupied with only 400MB. + +.. _resource usage: + +Memory, CPU and network usage +----------------------------- + +For each placement group, OSDs and MONs need memory, network and CPU +at all times and even more during recovery. Sharing this overhead by +clustering objects within a placement group is one of the main reasons +they exist. + +Minimizing the number of placement groups saves significant amounts of +resources. + +.. _choosing-number-of-placement-groups: + +Choosing the number of Placement Groups +======================================= + +.. note: It is rarely necessary to do this math by hand. Instead, use the ``ceph osd pool autoscale-status`` command in combination with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. See :ref:`pg-autoscaler` for more information. + +If you have more than 50 OSDs, we recommend approximately 50-100 +placement groups per OSD to balance out resource usage, data +durability and distribution. If you have less than 50 OSDs, choosing +among the `preselection`_ above is best. For a single pool of objects, +you can use the following formula to get a baseline:: + + (OSDs * 100) + Total PGs = ------------ + pool size + +Where **pool size** is either the number of replicas for replicated +pools or the K+M sum for erasure coded pools (as returned by **ceph +osd erasure-code-profile get**). + +You should then check if the result makes sense with the way you +designed your Ceph cluster to maximize `data durability`_, +`object distribution`_ and minimize `resource usage`_. + +The result should always be **rounded up to the nearest power of two**. + +Only a power of two will evenly balance the number of objects among +placement groups. Other values will result in an uneven distribution of +data across your OSDs. Their use should be limited to incrementally +stepping from one power of two to another. + +As an example, for a cluster with 200 OSDs and a pool size of 3 +replicas, you would estimate your number of PGs as follows:: + + (200 * 100) + ----------- = 6667. Nearest power of 2: 8192 + 3 + +When using multiple data pools for storing objects, you need to ensure +that you balance the number of placement groups per pool with the +number of placement groups per OSD so that you arrive at a reasonable +total number of placement groups that provides reasonably low variance +per OSD without taxing system resources or making the peering process +too slow. + +For instance a cluster of 10 pools each with 512 placement groups on +ten OSDs is a total of 5,120 placement groups spread over ten OSDs, +that is 512 placement groups per OSD. That does not use too many +resources. However, if 1,000 pools were created with 512 placement +groups each, the OSDs will handle ~50,000 placement groups each and it +would require significantly more resources and time for peering. + +You may find the `PGCalc`_ tool helpful. + + +.. _setting the number of placement groups: + +Set the Number of Placement Groups +================================== + +To set the number of placement groups in a pool, you must specify the +number of placement groups at the time you create the pool. +See `Create a Pool`_ for details. Even after a pool is created you can also change the number of placement groups with:: + + ceph osd pool set {pool-name} pg_num {pg_num} + +After you increase the number of placement groups, you must also +increase the number of placement groups for placement (``pgp_num``) +before your cluster will rebalance. The ``pgp_num`` will be the number of +placement groups that will be considered for placement by the CRUSH +algorithm. Increasing ``pg_num`` splits the placement groups but data +will not be migrated to the newer placement groups until placement +groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num`` +should be equal to the ``pg_num``. To increase the number of +placement groups for placement, execute the following:: + + ceph osd pool set {pool-name} pgp_num {pgp_num} + +When decreasing the number of PGs, ``pgp_num`` is adjusted +automatically for you. + +Get the Number of Placement Groups +================================== + +To get the number of placement groups in a pool, execute the following:: + + ceph osd pool get {pool-name} pg_num + + +Get a Cluster's PG Statistics +============================= + +To get the statistics for the placement groups in your cluster, execute the following:: + + ceph pg dump [--format {format}] + +Valid formats are ``plain`` (default) and ``json``. + + +Get Statistics for Stuck PGs +============================ + +To get the statistics for all placement groups stuck in a specified state, +execute the following:: + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] + +**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD +with the most up-to-date data to come up and in. + +**Unclean** Placement groups contain objects that are not replicated the desired number +of times. They should be recovering. + +**Stale** Placement groups are in an unknown state - the OSDs that host them have not +reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``). + +Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number +of seconds the placement group is stuck before including it in the returned statistics +(default 300 seconds). + + +Get a PG Map +============ + +To get the placement group map for a particular placement group, execute the following:: + + ceph pg map {pg-id} + +For example:: + + ceph pg map 1.6c + +Ceph will return the placement group map, the placement group, and the OSD status:: + + osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] + + +Get a PGs Statistics +==================== + +To retrieve statistics for a particular placement group, execute the following:: + + ceph pg {pg-id} query + + +Scrub a Placement Group +======================= + +To scrub a placement group, execute the following:: + + ceph pg scrub {pg-id} + +Ceph checks the primary and any replica nodes, generates a catalog of all objects +in the placement group and compares them to ensure that no objects are missing +or mismatched, and their contents are consistent. Assuming the replicas all +match, a final semantic sweep ensures that all of the snapshot-related object +metadata is consistent. Errors are reported via logs. + +To scrub all placement groups from a specific pool, execute the following:: + + ceph osd pool scrub {pool-name} + +Prioritize backfill/recovery of a Placement Group(s) +==================================================== + +You may run into a situation where a bunch of placement groups will require +recovery and/or backfill, and some particular groups hold data more important +than others (for example, those PGs may hold data for images used by running +machines and other PGs may be used by inactive machines/less relevant data). +In that case, you may want to prioritize recovery of those groups so +performance and/or availability of data stored on those groups is restored +earlier. To do this (mark particular placement group(s) as prioritized during +backfill or recovery), execute the following:: + + ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +This will cause Ceph to perform recovery or backfill on specified placement +groups first, before other placement groups. This does not interrupt currently +ongoing backfills or recovery, but causes specified PGs to be processed +as soon as possible. If you change your mind or prioritize wrong groups, +use:: + + ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +This will remove "force" flag from those PGs and they will be processed +in default order. Again, this doesn't affect currently processed placement +group, only those that are still queued. + +The "force" flag is cleared automatically after recovery or backfill of group +is done. + +Similarly, you may use the following commands to force Ceph to perform recovery +or backfill on all placement groups from a specified pool first:: + + ceph osd pool force-recovery {pool-name} + ceph osd pool force-backfill {pool-name} + +or:: + + ceph osd pool cancel-force-recovery {pool-name} + ceph osd pool cancel-force-backfill {pool-name} + +to restore to the default recovery or backfill priority if you change your mind. + +Note that these commands could possibly break the ordering of Ceph's internal +priority computations, so use them with caution! +Especially, if you have multiple pools that are currently sharing the same +underlying OSDs, and some particular pools hold data more important than others, +we recommend you use the following command to re-arrange all pools's +recovery/backfill priority in a better order:: + + ceph osd pool set {pool-name} recovery_priority {value} + +For example, if you have 10 pools you could make the most important one priority 10, +next 9, etc. Or you could leave most pools alone and have say 3 important pools +all priority 1 or priorities 3, 2, 1 respectively. + +Revert Lost +=========== + +If the cluster has lost one or more objects, and you have decided to +abandon the search for the lost data, you must mark the unfound objects +as ``lost``. + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. + +Currently the only supported option is "revert", which will either roll back to +a previous version of the object or (if it was a new object) forget about it +entirely. To mark the "unfound" objects as "lost", execute the following:: + + ceph pg {pg-id} mark_unfound_lost revert|delete + +.. important:: Use this feature with caution, because it may confuse + applications that expect the object(s) to exist. + + +.. toctree:: + :hidden: + + pg-states + pg-concepts + + +.. _Create a Pool: ../pools#createpool +.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds +.. _pgcalc: http://ceph.com/pgcalc/ diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst new file mode 100644 index 00000000..18fdd3ee --- /dev/null +++ b/doc/rados/operations/pools.rst @@ -0,0 +1,835 @@ +======= + Pools +======= + +When you first deploy a cluster without creating a pool, Ceph uses the default +pools for storing data. A pool provides you with: + +- **Resilience**: You can set how many OSD are allowed to fail without losing data. + For replicated pools, it is the desired number of copies/replicas of an object. + A typical configuration stores an object and one additional copy + (i.e., ``size = 2``), but you can determine the number of copies/replicas. + For `erasure coded pools <../erasure-code>`_, it is the number of coding chunks + (i.e. ``m=2`` in the **erasure code profile**) + +- **Placement Groups**: You can set the number of placement groups for the pool. + A typical configuration uses approximately 100 placement groups per OSD to + provide optimal balancing without using up too many computing resources. When + setting up multiple pools, be careful to ensure you set a reasonable number of + placement groups for both the pool and the cluster as a whole. + +- **CRUSH Rules**: When you store data in a pool, placement of the object + and its replicas (or chunks for erasure coded pools) in your cluster is governed + by CRUSH rules. You can create a custom CRUSH rule for your pool if the default + rule is not appropriate for your use case. + +- **Snapshots**: When you create snapshots with ``ceph osd pool mksnap``, + you effectively take a snapshot of a particular pool. + +To organize data into pools, you can list, create, and remove pools. +You can also view the utilization statistics for each pool. + +List Pools +========== + +To list your cluster's pools, execute:: + + ceph osd lspools + + +.. _createpool: + +Create a Pool +============= + +Before creating pools, refer to the `Pool, PG and CRUSH Config Reference`_. +Ideally, you should override the default value for the number of placement +groups in your Ceph configuration file, as the default is NOT ideal. +For details on placement group numbers refer to `setting the number of placement groups`_ + +.. note:: Starting with Luminous, all pools need to be associated to the + application using the pool. See `Associate Pool to Application`_ below for + more information. + +For example:: + + osd pool default pg num = 100 + osd pool default pgp num = 100 + +To create a pool, execute:: + + ceph osd pool create {pool-name} {pg-num} [{pgp-num}] [replicated] \ + [crush-rule-name] [expected-num-objects] + ceph osd pool create {pool-name} {pg-num} {pgp-num} erasure \ + [erasure-code-profile] [crush-rule-name] [expected_num_objects] + +Where: + +``{pool-name}`` + +:Description: The name of the pool. It must be unique. +:Type: String +:Required: Yes. + +``{pg-num}`` + +:Description: The total number of placement groups for the pool. See `Placement + Groups`_ for details on calculating a suitable number. The + default value ``8`` is NOT suitable for most systems. + +:Type: Integer +:Required: Yes. +:Default: 8 + +``{pgp-num}`` + +:Description: The total number of placement groups for placement purposes. This + **should be equal to the total number of placement groups**, except + for placement group splitting scenarios. + +:Type: Integer +:Required: Yes. Picks up default or Ceph configuration value if not specified. +:Default: 8 + +``{replicated|erasure}`` + +:Description: The pool type which may either be **replicated** to + recover from lost OSDs by keeping multiple copies of the + objects or **erasure** to get a kind of + `generalized RAID5 <../erasure-code>`_ capability. + The **replicated** pools require more + raw storage but implement all Ceph operations. The + **erasure** pools require less raw storage but only + implement a subset of the available operations. + +:Type: String +:Required: No. +:Default: replicated + +``[crush-rule-name]`` + +:Description: The name of a CRUSH rule to use for this pool. The specified + rule must exist. + +:Type: String +:Required: No. +:Default: For **replicated** pools it is the rule specified by the ``osd + pool default crush rule`` config variable. This rule must exist. + For **erasure** pools it is ``erasure-code`` if the ``default`` + `erasure code profile`_ is used or ``{pool-name}`` otherwise. This + rule will be created implicitly if it doesn't exist already. + + +``[erasure-code-profile=profile]`` + +.. _erasure code profile: ../erasure-code-profile + +:Description: For **erasure** pools only. Use the `erasure code profile`_. It + must be an existing profile as defined by + **osd erasure-code-profile set**. + +:Type: String +:Required: No. + +When you create a pool, set the number of placement groups to a reasonable value +(e.g., ``100``). Consider the total number of placement groups per OSD too. +Placement groups are computationally expensive, so performance will degrade when +you have many pools with many placement groups (e.g., 50 pools with 100 +placement groups each). The point of diminishing returns depends upon the power +of the OSD host. + +See `Placement Groups`_ for details on calculating an appropriate number of +placement groups for your pool. + +.. _Placement Groups: ../placement-groups + +``[expected-num-objects]`` + +:Description: The expected number of objects for this pool. By setting this value ( + together with a negative **filestore merge threshold**), the PG folder + splitting would happen at the pool creation time, to avoid the latency + impact to do a runtime folder splitting. + +:Type: Integer +:Required: No. +:Default: 0, no splitting at the pool creation time. + +.. _associate-pool-to-application: + +Associate Pool to Application +============================= + +Pools need to be associated with an application before use. Pools that will be +used with CephFS or pools that are automatically created by RGW are +automatically associated. Pools that are intended for use with RBD should be +initialized using the ``rbd`` tool (see `Block Device Commands`_ for more +information). + +For other cases, you can manually associate a free-form application name to +a pool.:: + + ceph osd pool application enable {pool-name} {application-name} + +.. note:: CephFS uses the application name ``cephfs``, RBD uses the + application name ``rbd``, and RGW uses the application name ``rgw``. + +Set Pool Quotas +=============== + +You can set pool quotas for the maximum number of bytes and/or the maximum +number of objects per pool. :: + + ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}] + +For example:: + + ceph osd pool set-quota data max_objects 10000 + +To remove a quota, set its value to ``0``. + + +Delete a Pool +============= + +To delete a pool, execute:: + + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + + +To remove a pool the mon_allow_pool_delete flag must be set to true in the Monitor's +configuration. Otherwise they will refuse to remove a pool. + +See `Monitor Configuration`_ for more information. + +.. _Monitor Configuration: ../../configuration/mon-config-ref + +If you created your own rules for a pool you created, you should consider +removing them when you no longer need your pool:: + + ceph osd pool get {pool-name} crush_rule + +If the rule was "123", for example, you can check the other pools like so:: + + ceph osd dump | grep "^pool" | grep "crush_rule 123" + +If no other pools use that custom rule, then it's safe to delete that +rule from the cluster. + +If you created users with permissions strictly for a pool that no longer +exists, you should consider deleting those users too:: + + ceph auth ls | grep -C 5 {pool-name} + ceph auth del {user} + + +Rename a Pool +============= + +To rename a pool, execute:: + + ceph osd pool rename {current-pool-name} {new-pool-name} + +If you rename a pool and you have per-pool capabilities for an authenticated +user, you must update the user's capabilities (i.e., caps) with the new pool +name. + +Show Pool Statistics +==================== + +To show a pool's utilization statistics, execute:: + + rados df + +Additionally, to obtain I/O information for a specific pool or all, execute:: + + ceph osd pool stats [{pool-name}] + + +Make a Snapshot of a Pool +========================= + +To make a snapshot of a pool, execute:: + + ceph osd pool mksnap {pool-name} {snap-name} + +Remove a Snapshot of a Pool +=========================== + +To remove a snapshot of a pool, execute:: + + ceph osd pool rmsnap {pool-name} {snap-name} + +.. _setpoolvalues: + + +Set Pool Values +=============== + +To set a value to a pool, execute the following:: + + ceph osd pool set {pool-name} {key} {value} + +You may set values for the following keys: + +.. _compression_algorithm: + +``compression_algorithm`` + +:Description: Sets inline compression algorithm to use for underlying BlueStore. This setting overrides the `global setting <http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression algorithm``. + +:Type: String +:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` + +``compression_mode`` + +:Description: Sets the policy for the inline compression algorithm for underlying BlueStore. This setting overrides the `global setting <http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression mode``. + +:Type: String +:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` + +``compression_min_blob_size`` + +:Description: Chunks smaller than this are never compressed. This setting overrides the `global setting <http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression min blob *``. + +:Type: Unsigned Integer + +``compression_max_blob_size`` + +:Description: Chunks larger than this are broken into smaller blobs sizing + ``compression_max_blob_size`` before being compressed. + +:Type: Unsigned Integer + +.. _size: + +``size`` + +:Description: Sets the number of replicas for objects in the pool. + See `Set the Number of Object Replicas`_ for further details. + Replicated pools only. + +:Type: Integer + +.. _min_size: + +``min_size`` + +:Description: Sets the minimum number of replicas required for I/O. + See `Set the Number of Object Replicas`_ for further details. + Replicated pools only. + +:Type: Integer +:Version: ``0.54`` and above + +.. _pg_num: + +``pg_num`` + +:Description: The effective number of placement groups to use when calculating + data placement. +:Type: Integer +:Valid Range: Superior to ``pg_num`` current value. + +.. _pgp_num: + +``pgp_num`` + +:Description: The effective number of placement groups for placement to use + when calculating data placement. + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + +.. _crush_rule: + +``crush_rule`` + +:Description: The rule to use for mapping object placement in the cluster. +:Type: String + +.. _allow_ec_overwrites: + +``allow_ec_overwrites`` + +:Description: Whether writes to an erasure coded pool can update part + of an object, so cephfs and rbd can use it. See + `Erasure Coding with Overwrites`_ for more details. +:Type: Boolean +:Version: ``12.2.0`` and above + +.. _hashpspool: + +``hashpspool`` + +:Description: Set/Unset HASHPSPOOL flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _nodelete: + +``nodelete`` + +:Description: Set/Unset NODELETE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _nopgchange: + +``nopgchange`` + +:Description: Set/Unset NOPGCHANGE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _nosizechange: + +``nosizechange`` + +:Description: Set/Unset NOSIZECHANGE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _write_fadvise_dontneed: + +``write_fadvise_dontneed`` + +:Description: Set/Unset WRITE_FADVISE_DONTNEED flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _noscrub: + +``noscrub`` + +:Description: Set/Unset NOSCRUB flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _nodeep-scrub: + +``nodeep-scrub`` + +:Description: Set/Unset NODEEP_SCRUB flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _hit_set_type: + +``hit_set_type`` + +:Description: Enables hit set tracking for cache pools. + See `Bloom Filter`_ for additional information. + +:Type: String +:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` +:Default: ``bloom``. Other values are for testing. + +.. _hit_set_count: + +``hit_set_count`` + +:Description: The number of hit sets to store for cache pools. The higher + the number, the more RAM consumed by the ``ceph-osd`` daemon. + +:Type: Integer +:Valid Range: ``1``. Agent doesn't handle > 1 yet. + +.. _hit_set_period: + +``hit_set_period`` + +:Description: The duration of a hit set period in seconds for cache pools. + The higher the number, the more RAM consumed by the + ``ceph-osd`` daemon. + +:Type: Integer +:Example: ``3600`` 1hr + +.. _hit_set_fpp: + +``hit_set_fpp`` + +:Description: The false positive probability for the ``bloom`` hit set type. + See `Bloom Filter`_ for additional information. + +:Type: Double +:Valid Range: 0.0 - 1.0 +:Default: ``0.05`` + +.. _cache_target_dirty_ratio: + +``cache_target_dirty_ratio`` + +:Description: The percentage of the cache pool containing modified (dirty) + objects before the cache tiering agent will flush them to the + backing storage pool. + +:Type: Double +:Default: ``.4`` + +.. _cache_target_dirty_high_ratio: + +``cache_target_dirty_high_ratio`` + +:Description: The percentage of the cache pool containing modified (dirty) + objects before the cache tiering agent will flush them to the + backing storage pool with a higher speed. + +:Type: Double +:Default: ``.6`` + +.. _cache_target_full_ratio: + +``cache_target_full_ratio`` + +:Description: The percentage of the cache pool containing unmodified (clean) + objects before the cache tiering agent will evict them from the + cache pool. + +:Type: Double +:Default: ``.8`` + +.. _target_max_bytes: + +``target_max_bytes`` + +:Description: Ceph will begin flushing or evicting objects when the + ``max_bytes`` threshold is triggered. + +:Type: Integer +:Example: ``1000000000000`` #1-TB + +.. _target_max_objects: + +``target_max_objects`` + +:Description: Ceph will begin flushing or evicting objects when the + ``max_objects`` threshold is triggered. + +:Type: Integer +:Example: ``1000000`` #1M objects + + +``hit_set_grade_decay_rate`` + +:Description: Temperature decay rate between two successive hit_sets +:Type: Integer +:Valid Range: 0 - 100 +:Default: ``20`` + + +``hit_set_search_last_n`` + +:Description: Count at most N appearance in hit_sets for temperature calculation +:Type: Integer +:Valid Range: 0 - hit_set_count +:Default: ``1`` + + +.. _cache_min_flush_age: + +``cache_min_flush_age`` + +:Description: The time (in seconds) before the cache tiering agent will flush + an object from the cache pool to the storage pool. + +:Type: Integer +:Example: ``600`` 10min + +.. _cache_min_evict_age: + +``cache_min_evict_age`` + +:Description: The time (in seconds) before the cache tiering agent will evict + an object from the cache pool. + +:Type: Integer +:Example: ``1800`` 30min + +.. _fast_read: + +``fast_read`` + +:Description: On Erasure Coding pool, if this flag is turned on, the read request + would issue sub reads to all shards, and waits until it receives enough + shards to decode to serve the client. In the case of jerasure and isa + erasure plugins, once the first K replies return, client's request is + served immediately using the data decoded from these replies. This + helps to tradeoff some resources for better performance. Currently this + flag is only supported for Erasure Coding pool. + +:Type: Boolean +:Defaults: ``0`` + +.. _scrub_min_interval: + +``scrub_min_interval`` + +:Description: The minimum interval in seconds for pool scrubbing when + load is low. If it is 0, the value osd_scrub_min_interval + from config is used. + +:Type: Double +:Default: ``0`` + +.. _scrub_max_interval: + +``scrub_max_interval`` + +:Description: The maximum interval in seconds for pool scrubbing + irrespective of cluster load. If it is 0, the value + osd_scrub_max_interval from config is used. + +:Type: Double +:Default: ``0`` + +.. _deep_scrub_interval: + +``deep_scrub_interval`` + +:Description: The interval in seconds for pool “deep” scrubbing. If it + is 0, the value osd_deep_scrub_interval from config is used. + +:Type: Double +:Default: ``0`` + + +.. _recovery_priority: + +``recovery_priority`` + +:Description: When a value is set it will increase or decrease the computed + reservation priority. This value must be in the range -10 to + 10. Use a negative priority for less important pools so they + have lower priority than any new pools. + +:Type: Integer +:Default: ``0`` + + +.. _recovery_op_priority: + +``recovery_op_priority`` + +:Description: Specify the recovery operation priority for this pool instead of ``osd_recovery_op_priority``. + +:Type: Integer +:Default: ``0`` + + +Get Pool Values +=============== + +To get a value from a pool, execute the following:: + + ceph osd pool get {pool-name} {key} + +You may get values for the following keys: + +``size`` + +:Description: see size_ + +:Type: Integer + +``min_size`` + +:Description: see min_size_ + +:Type: Integer +:Version: ``0.54`` and above + +``pg_num`` + +:Description: see pg_num_ + +:Type: Integer + + +``pgp_num`` + +:Description: see pgp_num_ + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + + +``crush_rule`` + +:Description: see crush_rule_ + + +``hit_set_type`` + +:Description: see hit_set_type_ + +:Type: String +:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` + +``hit_set_count`` + +:Description: see hit_set_count_ + +:Type: Integer + + +``hit_set_period`` + +:Description: see hit_set_period_ + +:Type: Integer + + +``hit_set_fpp`` + +:Description: see hit_set_fpp_ + +:Type: Double + + +``cache_target_dirty_ratio`` + +:Description: see cache_target_dirty_ratio_ + +:Type: Double + + +``cache_target_dirty_high_ratio`` + +:Description: see cache_target_dirty_high_ratio_ + +:Type: Double + + +``cache_target_full_ratio`` + +:Description: see cache_target_full_ratio_ + +:Type: Double + + +``target_max_bytes`` + +:Description: see target_max_bytes_ + +:Type: Integer + + +``target_max_objects`` + +:Description: see target_max_objects_ + +:Type: Integer + + +``cache_min_flush_age`` + +:Description: see cache_min_flush_age_ + +:Type: Integer + + +``cache_min_evict_age`` + +:Description: see cache_min_evict_age_ + +:Type: Integer + + +``fast_read`` + +:Description: see fast_read_ + +:Type: Boolean + + +``scrub_min_interval`` + +:Description: see scrub_min_interval_ + +:Type: Double + + +``scrub_max_interval`` + +:Description: see scrub_max_interval_ + +:Type: Double + + +``deep_scrub_interval`` + +:Description: see deep_scrub_interval_ + +:Type: Double + + +``allow_ec_overwrites`` + +:Description: see allow_ec_overwrites_ + +:Type: Boolean + + +``recovery_priority`` + +:Description: see recovery_priority_ + +:Type: Integer + + +``recovery_op_priority`` + +:Description: see recovery_op_priority_ + +:Type: Integer + + +Set the Number of Object Replicas +================================= + +To set the number of object replicas on a replicated pool, execute the following:: + + ceph osd pool set {poolname} size {num-replicas} + +.. important:: The ``{num-replicas}`` includes the object itself. + If you want the object and two copies of the object for a total of + three instances of the object, specify ``3``. + +For example:: + + ceph osd pool set data size 3 + +You may execute this command for each pool. **Note:** An object might accept +I/Os in degraded mode with fewer than ``pool size`` replicas. To set a minimum +number of required replicas for I/O, you should use the ``min_size`` setting. +For example:: + + ceph osd pool set data min_size 2 + +This ensures that no object in the data pool will receive I/O with fewer than +``min_size`` replicas. + + +Get the Number of Object Replicas +================================= + +To get the number of object replicas, execute the following:: + + ceph osd dump | grep 'replicated size' + +Ceph will list the pools, with the ``replicated size`` attribute highlighted. +By default, ceph creates two replicas of an object (a total of three copies, or +a size of 3). + + + +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups +.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites +.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool + diff --git a/doc/rados/operations/upmap.rst b/doc/rados/operations/upmap.rst new file mode 100644 index 00000000..8bad0d95 --- /dev/null +++ b/doc/rados/operations/upmap.rst @@ -0,0 +1,85 @@ +Using the pg-upmap +================== + +Starting in Luminous v12.2.z there is a new *pg-upmap* exception table +in the OSDMap that allows the cluster to explicitly map specific PGs to +specific OSDs. This allows the cluster to fine-tune the data +distribution to, in most cases, perfectly distributed PGs across OSDs. + +The key caveat to this new mechanism is that it requires that all +clients understand the new *pg-upmap* structure in the OSDMap. + +Enabling +-------- + +To allow use of the feature, you must tell the cluster that it only +needs to support luminous (and newer) clients with:: + + ceph osd set-require-min-compat-client luminous + +This command will fail if any pre-luminous clients or daemons are +connected to the monitors. You can see what client versions are in +use with:: + + ceph features + +Balancer module +----------------- + +The new `balancer` module for ceph-mgr will automatically balance +the number of PGs per OSD. See ``Balancer`` + + +Offline optimization +-------------------- + +Upmap entries are updated with an offline optimizer built into ``osdmaptool``. + +#. Grab the latest copy of your osdmap:: + + ceph osd getmap -o om + +#. Run the optimizer:: + + osdmaptool om --upmap out.txt [--upmap-pool <pool>] + [--upmap-max <max-optimizations>] [--upmap-deviation <max-deviation>] + [--upmap-active] + + It is highly recommended that optimization be done for each pool + individually, or for sets of similarly-utilized pools. You can + specify the ``--upmap-pool`` option multiple times. "Similar pools" + means pools that are mapped to the same devices and store the same + kind of data (e.g., RBD image pools, yes; RGW index pool and RGW + data pool, no). + + The ``max-optimizations`` value is the maximum number of upmap entries to + identify in the run. The default is `10` like the ceph-mgr balancer module, + but you should use a larger number if you are doing offline optimization. + If it cannot find any additional changes to make it will stop early + (i.e., when the pool distribution is perfect). + + The ``max-deviation`` value defaults to `5`. If an OSD PG count + varies from the computed target number by less than or equal + to this amount it will be considered perfect. + + The ``--upmap-active`` option simulates the behavior of the active + balancer in upmap mode. It keeps cycling until the OSDs are balanced + and reports how many rounds and how long each round is taking. The + elapsed time for rounds indicates the CPU load ceph-mgr will be + consuming when it tries to compute the next optimization plan. + +#. Apply the changes:: + + source out.txt + + The proposed changes are written to the output file ``out.txt`` in + the example above. These are normal ceph CLI commands that can be + run to apply the changes to the cluster. + + +The above steps can be repeated as many times as necessary to achieve +a perfect distribution of PGs for each set of pools. + +You can see some (gory) details about what the tool is doing by +passing ``--debug-osd 10`` and even more with ``--debug-crush 10`` +to ``osdmaptool``. diff --git a/doc/rados/operations/user-management.rst b/doc/rados/operations/user-management.rst new file mode 100644 index 00000000..30c034b4 --- /dev/null +++ b/doc/rados/operations/user-management.rst @@ -0,0 +1,756 @@ +.. _user-management: + +================= + User Management +================= + +This document describes :term:`Ceph Client` users, and their authentication and +authorization with the :term:`Ceph Storage Cluster`. Users are either +individuals or system actors such as applications, which use Ceph clients to +interact with the Ceph Storage Cluster daemons. + +.. ditaa:: + +-----+ + | {o} | + | | + +--+--+ /---------\ /---------\ + | | Ceph | | Ceph | + ---+---*----->| |<------------->| | + | uses | Clients | | Servers | + | \---------/ \---------/ + /--+--\ + | | + | | + actor + + +When Ceph runs with authentication and authorization enabled (enabled by +default), you must specify a user name and a keyring containing the secret key +of the specified user (usually via the command line). If you do not specify a +user name, Ceph will use ``client.admin`` as the default user name. If you do +not specify a keyring, Ceph will look for a keyring via the ``keyring`` setting +in the Ceph configuration. For example, if you execute the ``ceph health`` +command without specifying a user or keyring:: + + ceph health + +Ceph interprets the command like this:: + + ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health + +Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid +re-entry of the user name and secret. + +For details on configuring the Ceph Storage Cluster to use authentication, +see `Cephx Config Reference`_. For details on the architecture of Cephx, see +`Architecture - High Availability Authentication`_. + + +Background +========== + +Irrespective of the type of Ceph client (e.g., Block Device, Object Storage, +Filesystem, native API, etc.), Ceph stores all data as objects within `pools`_. +Ceph users must have access to pools in order to read and write data. +Additionally, Ceph users must have execute permissions to use Ceph's +administrative commands. The following concepts will help you understand Ceph +user management. + + +User +---- + +A user is either an individual or a system actor such as an application. +Creating users allows you to control who (or what) can access your Ceph Storage +Cluster, its pools, and the data within pools. + +Ceph has the notion of a ``type`` of user. For the purposes of user management, +the type will always be ``client``. Ceph identifies users in period (.) +delimited form consisting of the user type and the user ID: for example, +``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing +is that Ceph Monitors, OSDs, and Metadata Servers also use the Cephx protocol, +but they are not clients. Distinguishing the user type helps to distinguish +between client users and other users--streamlining access control, user +monitoring and traceability. + +Sometimes Ceph's user type may seem confusing, because the Ceph command line +allows you to specify a user with or without the type, depending upon your +command line usage. If you specify ``--user`` or ``--id``, you can omit the +type. So ``client.user1`` can be entered simply as ``user1``. If you specify +``--name`` or ``-n``, you must specify the type and name, such as +``client.user1``. We recommend using the type and name as a best practice +wherever possible. + +.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage + user or a Ceph Filesystem user. The Ceph Object Gateway uses a Ceph Storage + Cluster user to communicate between the gateway daemon and the storage + cluster, but the gateway has its own user management functionality for end + users. The Ceph Filesystem uses POSIX semantics. The user space associated + with the Ceph Filesystem is not the same as a Ceph Storage Cluster user. + + + +Authorization (Capabilities) +---------------------------- + +Ceph uses the term "capabilities" (caps) to describe authorizing an +authenticated user to exercise the functionality of the monitors, OSDs and +metadata servers. Capabilities can also restrict access to data within a pool, +a namespace within a pool, or a set of pools based on their application tags. +A Ceph administrative user sets a user's capabilities when creating or updating +a user. + +Capability syntax follows the form:: + + {daemon-type} '{cap-spec}[, {cap-spec} ...]' + +- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access + settings or ``profile {name}``. For example:: + + mon 'allow {access-spec} [network {network/prefix}]' + + mon 'profile {name}' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The optional ``{network/prefix}`` is a standard network name and + prefix length in CIDR notation (e.g., ``10.3.0.0/16``). If present, + the use of this capability is restricted to clients connecting from + this network. + +- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, ``class-read``, + ``class-write`` access settings or ``profile {name}``. Additionally, OSD + capabilities also allow for pool and namespace settings. :: + + osd 'allow {access-spec} [{match-spec}] [network {network/prefix}]' + + osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]] [network {network/prefix}]' + + The ``{access-spec}`` syntax is either of the following: :: + + * | all | [r][w][x] [class-read] [class-write] + + class {class name} [{method name}] + + The optional ``{match-spec}`` syntax is either of the following: :: + + pool={pool-name} [namespace={namespace-name}] [object_prefix {prefix}] + + [namespace={namespace-name}] tag {application} {key}={value} + + The optional ``{network/prefix}`` is a standard network name and + prefix length in CIDR notation (e.g., ``10.3.0.0/16``). If present, + the use of this capability is restricted to clients connecting from + this network. + +- **Manager Caps:** Manager (``ceph-mgr``) capabilities include + ``r``, ``w``, ``x`` access settings or ``profile {name}``. For example: :: + + mgr 'allow {access-spec} [network {network/prefix}]' + + mgr 'profile {name} [{key1} {match-type} {value1} ...] [network {network/prefix}]' + + Manager capabilities can also be specified for specific commands, + all commands exported by a built-in manager service, or all commands + exported by a specific add-on module. For example: :: + + mgr 'allow command "{command-prefix}" [with {key1} {match-type} {value1} ...] [network {network/prefix}]' + + mgr 'allow service {service-name} {access-spec} [network {network/prefix}]' + + mgr 'allow module {module-name} [with {key1} {match-type} {value1} ...] {access-spec} [network {network/prefix}]' + + The ``{access-spec}`` syntax is as follows: :: + + * | all | [r][w][x] + + The ``{service-name}`` is one of the following: :: + + mgr | osd | pg | py + + The ``{match-type}`` is one of the following: :: + + = | prefix | regex + +- **Metadata Server Caps:** For administrators, use ``allow *``. For all + other users, such as CephFS clients, consult :doc:`/cephfs/client-auth` + + +.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the + Ceph Storage Cluster, so it is not represented as a Ceph Storage + Cluster daemon type. + +The following entries describe each access capability. + +``allow`` + +:Description: Precedes access settings for a daemon. Implies ``rw`` + for MDS only. + + +``r`` + +:Description: Gives the user read access. Required with monitors to retrieve + the CRUSH map. + + +``w`` + +:Description: Gives the user write access to objects. + + +``x`` + +:Description: Gives the user the capability to call class methods + (i.e., both read and write) and to conduct ``auth`` + operations on monitors. + + +``class-read`` + +:Descriptions: Gives the user the capability to call class read methods. + Subset of ``x``. + + +``class-write`` + +:Description: Gives the user the capability to call class write methods. + Subset of ``x``. + + +``*``, ``all`` + +:Description: Gives the user read, write and execute permissions for a + particular daemon/pool, and the ability to execute + admin commands. + +The following entries describe valid capability profiles: + +``profile osd`` (Monitor only) + +:Description: Gives a user permissions to connect as an OSD to other OSDs or + monitors. Conferred on OSDs to enable OSDs to handle replication + heartbeat traffic and status reporting. + + +``profile mds`` (Monitor only) + +:Description: Gives a user permissions to connect as a MDS to other MDSs or + monitors. + + +``profile bootstrap-osd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an OSD. Conferred on + deployment tools such as ``ceph-volume``, ``ceph-deploy``, etc. + so that they have permissions to add keys, etc. when + bootstrapping an OSD. + + +``profile bootstrap-mds`` (Monitor only) + +:Description: Gives a user permissions to bootstrap a metadata server. + Conferred on deployment tools such as ``ceph-deploy``, etc. + so they have permissions to add keys, etc. when bootstrapping + a metadata server. + +``profile bootstrap-rbd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an RBD user. + Conferred on deployment tools such as ``ceph-deploy``, etc. + so they have permissions to add keys, etc. when bootstrapping + an RBD user. + +``profile bootstrap-rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an ``rbd-mirror`` daemon + user. Conferred on deployment tools such as ``ceph-deploy``, etc. + so they have permissions to add keys, etc. when bootstrapping + an ``rbd-mirror`` daemon. + +``profile rbd`` (Manager, Monitor, and OSD) + +:Description: Gives a user permissions to manipulate RBD images. When used + as a Monitor cap, it provides the minimal privileges required + by an RBD client application; this includes the ability + to blacklist other client users. When used as an OSD cap, it + provides read-write access to the specified pool to an + RBD client application. The Manager cap supports optional + ``pool`` and ``namespace`` keyword arguments. + +``profile rbd-mirror`` (Monitor only) + +:Description: Gives a user permissions to manipulate RBD images and retrieve + RBD mirroring config-key secrets. It provides the minimal + privileges required for the ``rbd-mirror`` daemon. + +``profile rbd-read-only`` (Manager and OSD) + +:Description: Gives a user read-only permissions to RBD images. The Manager + cap supports optional ``pool`` and ``namespace`` keyword + arguments. + + +Pool +---- + +A pool is a logical partition where users store data. +In Ceph deployments, it is common to create a pool as a logical partition for +similar types of data. For example, when deploying Ceph as a backend for +OpenStack, a typical deployment would have pools for volumes, images, backups +and virtual machines, and users such as ``client.glance``, ``client.cinder``, +etc. + +Application Tags +---------------- + +Access may be restricted to specific pools as defined by their application +metadata. The ``*`` wildcard may be used for the ``key`` argument, the +``value`` argument, or both. ``all`` is a synony for ``*``. + +Namespace +--------- + +Objects within a pool can be associated to a namespace--a logical group of +objects within the pool. A user's access to a pool can be associated with a +namespace such that reads and writes by the user take place only within the +namespace. Objects written to a namespace within the pool can only be accessed +by users who have access to the namespace. + +.. note:: Namespaces are primarily useful for applications written on top of + ``librados`` where the logical grouping can alleviate the need to create + different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various + metadata objects. + +The rationale for namespaces is that pools can be a computationally expensive +method of segregating data sets for the purposes of authorizing separate sets +of users. For example, a pool should have ~100 placement groups per OSD. So an +exemplary cluster with 1000 OSDs would have 100,000 placement groups for one +pool. Each pool would create another 100,000 placement groups in the exemplary +cluster. By contrast, writing an object to a namespace simply associates the +namespace to the object name with out the computational overhead of a separate +pool. Rather than creating a separate pool for a user or set of users, you may +use a namespace. **Note:** Only available using ``librados`` at this time. + +Access may be restricted to specific RADOS namespaces using the ``namespace`` +capability. Limited globbing of namespaces is supported; if the last character +of the specified namespace is ``*``, then access is granted to any namespace +starting with the provided argument. + + +Managing Users +============== + +User management functionality provides Ceph Storage Cluster administrators with +the ability to create, update and delete users directly in the Ceph Storage +Cluster. + +When you create or delete users in the Ceph Storage Cluster, you may need to +distribute keys to clients so that they can be added to keyrings. See `Keyring +Management`_ for details. + + +List Users +---------- + +To list the users in your cluster, execute the following:: + + ceph auth ls + +Ceph will list out all users in your cluster. For example, in a two-node +exemplary cluster, ``ceph auth ls`` will output something that looks like +this:: + + installed auth entries: + + osd.0 + key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w== + caps: [mon] allow profile osd + caps: [osd] allow * + osd.1 + key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA== + caps: [mon] allow profile osd + caps: [osd] allow * + client.admin + key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw== + caps: [mds] allow + caps: [mon] allow * + caps: [osd] allow * + client.bootstrap-mds + key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww== + caps: [mon] allow profile bootstrap-mds + client.bootstrap-osd + key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw== + caps: [mon] allow profile bootstrap-osd + + +Note that the ``TYPE.ID`` notation for users applies such that ``osd.0`` is a +user of type ``osd`` and its ID is ``0``, ``client.admin`` is a user of type +``client`` and its ID is ``admin`` (i.e., the default ``client.admin`` user). +Note also that each entry has a ``key: <value>`` entry, and one or more +``caps:`` entries. + +You may use the ``-o {filename}`` option with ``ceph auth ls`` to +save the output to a file. + + +Get a User +---------- + +To retrieve a specific user, key and capabilities, execute the +following:: + + ceph auth get {TYPE.ID} + +For example:: + + ceph auth get client.admin + +You may also use the ``-o {filename}`` option with ``ceph auth get`` to +save the output to a file. Developers may also execute the following:: + + ceph auth export {TYPE.ID} + +The ``auth export`` command is identical to ``auth get``. + + + +Add a User +---------- + +Adding a user creates a username (i.e., ``TYPE.ID``), a secret key and +any capabilities included in the command you use to create the user. + +A user's key enables the user to authenticate with the Ceph Storage Cluster. +The user's capabilities authorize the user to read, write, or execute on Ceph +monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``). + +There are a few ways to add a user: + +- ``ceph auth add``: This command is the canonical way to add a user. It + will create the user, generate a key and add any specified capabilities. + +- ``ceph auth get-or-create``: This command is often the most convenient way + to create a user, because it returns a keyfile format with the user name + (in brackets) and the key. If the user already exists, this command + simply returns the user name and key in the keyfile format. You may use the + ``-o {filename}`` option to save the output to a file. + +- ``ceph auth get-or-create-key``: This command is a convenient way to create + a user and return the user's key (only). This is useful for clients that + need the key only (e.g., libvirt). If the user already exists, this command + simply returns the key. You may use the ``-o {filename}`` option to save the + output to a file. + +When creating client users, you may create a user with no capabilities. A user +with no capabilities is useless beyond mere authentication, because the client +cannot retrieve the cluster map from the monitor. However, you can create a +user with no capabilities if you wish to defer adding capabilities later using +the ``ceph auth caps`` command. + +A typical user has at least read capabilities on the Ceph monitor and +read and write capability on Ceph OSDs. Additionally, a user's OSD permissions +are often restricted to accessing a particular pool. :: + + ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring + ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key + + +.. important:: If you provide a user with capabilities to OSDs, but you DO NOT + restrict access to particular pools, the user will have access to ALL + pools in the cluster! + + +.. _modify-user-capabilities: + +Modify User Capabilities +------------------------ + +The ``ceph auth caps`` command allows you to specify a user and change the +user's capabilities. Setting new capabilities will overwrite current capabilities. +To view current capabilities run ``ceph auth get USERTYPE.USERID``. To add +capabilities, you should also specify the existing capabilities when using the form:: + + ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]'] + +For example:: + + ceph auth get client.john + ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool' + ceph auth caps client.brian-manager mon 'allow *' osd 'allow *' + +See `Authorization (Capabilities)`_ for additional details on capabilities. + + +Delete a User +------------- + +To delete a user, use ``ceph auth del``:: + + ceph auth del {TYPE}.{ID} + +Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or ID of the daemon. + + +Print a User's Key +------------------ + +To print a user's authentication key to standard output, execute the following:: + + ceph auth print-key {TYPE}.{ID} + +Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or ID of the daemon. + +Printing a user's key is useful when you need to populate client +software with a user's key (e.g., libvirt). :: + + mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user` + + +Import a User(s) +---------------- + +To import one or more users, use ``ceph auth import`` and +specify a keyring:: + + ceph auth import -i /path/to/keyring + +For example:: + + sudo ceph auth import -i /etc/ceph/ceph.keyring + + +.. note:: The ceph storage cluster will add new users, their keys and their + capabilities and will update existing users, their keys and their + capabilities. + + +Keyring Management +================== + +When you access Ceph via a Ceph client, the Ceph client will look for a local +keyring. Ceph presets the ``keyring`` setting with the following four keyring +names by default so you don't have to set them in your Ceph configuration file +unless you want to override the defaults (not recommended): + +- ``/etc/ceph/$cluster.$name.keyring`` +- ``/etc/ceph/$cluster.keyring`` +- ``/etc/ceph/keyring`` +- ``/etc/ceph/keyring.bin`` + +The ``$cluster`` metavariable is your Ceph cluster name as defined by the +name of the Ceph configuration file (i.e., ``ceph.conf`` means the cluster name +is ``ceph``; thus, ``ceph.keyring``). The ``$name`` metavariable is the user +type and user ID (e.g., ``client.admin``; thus, ``ceph.client.admin.keyring``). + +.. note:: When executing commands that read or write to ``/etc/ceph``, you may + need to use ``sudo`` to execute the command as ``root``. + +After you create a user (e.g., ``client.ringo``), you must get the key and add +it to a keyring on a Ceph client so that the user can access the Ceph Storage +Cluster. + +The `User Management`_ section details how to list, get, add, modify and delete +users directly in the Ceph Storage Cluster. However, Ceph also provides the +``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client. + + +Create a Keyring +---------------- + +When you use the procedures in the `Managing Users`_ section to create users, +you need to provide user keys to the Ceph client(s) so that the Ceph client +can retrieve the key for the specified user and authenticate with the Ceph +Storage Cluster. Ceph Clients access keyrings to lookup a user name and +retrieve the user's key. + +The ``ceph-authtool`` utility allows you to create a keyring. To create an +empty keyring, use ``--create-keyring`` or ``-C``. For example:: + + ceph-authtool --create-keyring /path/to/keyring + +When creating a keyring with multiple users, we recommend using the cluster name +(e.g., ``$cluster.keyring``) for the keyring filename and saving it in the +``/etc/ceph`` directory so that the ``keyring`` configuration default setting +will pick up the filename without requiring you to specify it in the local copy +of your Ceph configuration file. For example, create ``ceph.keyring`` by +executing the following:: + + sudo ceph-authtool -C /etc/ceph/ceph.keyring + +When creating a keyring with a single user, we recommend using the cluster name, +the user type and the user name and saving it in the ``/etc/ceph`` directory. +For example, ``ceph.client.admin.keyring`` for the ``client.admin`` user. + +To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means +the file will have ``rw`` permissions for the ``root`` user only, which is +appropriate when the keyring contains administrator keys. However, if you +intend to use the keyring for a particular user or group of users, ensure +that you execute ``chown`` or ``chmod`` to establish appropriate keyring +ownership and access. + + +Add a User to a Keyring +----------------------- + +When you `Add a User`_ to the Ceph Storage Cluster, you can use the `Get a +User`_ procedure to retrieve a user, key and capabilities and save the user to a +keyring. + +When you only want to use one user per keyring, the `Get a User`_ procedure with +the ``-o`` option will save the output in the keyring file format. For example, +to create a keyring for the ``client.admin`` user, execute the following:: + + sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring + +Notice that we use the recommended file format for an individual user. + +When you want to import users to a keyring, you can use ``ceph-authtool`` +to specify the destination keyring and the source keyring. +For example:: + + sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring + + +Create a User +------------- + +Ceph provides the `Add a User`_ function to create a user directly in the Ceph +Storage Cluster. However, you can also create a user, keys and capabilities +directly on a Ceph client keyring. Then, you can import the user to the Ceph +Storage Cluster. For example:: + + sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring + +See `Authorization (Capabilities)`_ for additional details on capabilities. + +You can also create a keyring and add a new user to the keyring simultaneously. +For example:: + + sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key + +In the foregoing scenarios, the new user ``client.ringo`` is only in the +keyring. To add the new user to the Ceph Storage Cluster, you must still add +the new user to the Ceph Storage Cluster. :: + + sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring + + +Modify a User +------------- + +To modify the capabilities of a user record in a keyring, specify the keyring, +and the user followed by the capabilities. For example:: + + sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' + +To update the user to the Ceph Storage Cluster, you must update the user +in the keyring to the user entry in the the Ceph Storage Cluster. :: + + sudo ceph auth import -i /etc/ceph/ceph.keyring + +See `Import a User(s)`_ for details on updating a Ceph Storage Cluster user +from a keyring. + +You may also `Modify User Capabilities`_ directly in the cluster, store the +results to a keyring file; then, import the keyring into your main +``ceph.keyring`` file. + + +Command Line Usage +================== + +Ceph supports the following usage for user name and secret: + +``--id`` | ``--user`` + +:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or + ``client.admin``, ``client.user1``). The ``id``, ``name`` and + ``-n`` options enable you to specify the ID portion of the user + name (e.g., ``admin``, ``user1``, ``foo``, etc.). You can specify + the user with the ``--id`` and omit the type. For example, + to specify user ``client.foo`` enter the following:: + + ceph --id foo --keyring /path/to/keyring health + ceph --user foo --keyring /path/to/keyring health + + +``--name`` | ``-n`` + +:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or + ``client.admin``, ``client.user1``). The ``--name`` and ``-n`` + options enables you to specify the fully qualified user name. + You must specify the user type (typically ``client``) with the + user ID. For example:: + + ceph --name client.foo --keyring /path/to/keyring health + ceph -n client.foo --keyring /path/to/keyring health + + +``--keyring`` + +:Description: The path to the keyring containing one or more user name and + secret. The ``--secret`` option provides the same functionality, + but it does not work with Ceph RADOS Gateway, which uses + ``--secret`` for another purpose. You may retrieve a keyring with + ``ceph auth get-or-create`` and store it locally. This is a + preferred approach, because you can switch user names without + switching the keyring path. For example:: + + sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage + + +.. _pools: ../pools + + +Limitations +=========== + +The ``cephx`` protocol authenticates Ceph clients and servers to each other. It +is not intended to handle authentication of human users or application programs +run on their behalf. If that effect is required to handle your access control +needs, you must have another mechanism, which is likely to be specific to the +front end used to access the Ceph object store. This other mechanism has the +role of ensuring that only acceptable users and programs are able to run on the +machine that Ceph will permit to access its object store. + +The keys used to authenticate Ceph clients and servers are typically stored in +a plain text file with appropriate permissions in a trusted host. + +.. important:: Storing keys in plaintext files has security shortcomings, but + they are difficult to avoid, given the basic authentication methods Ceph + uses in the background. Those setting up Ceph systems should be aware of + these shortcomings. + +In particular, arbitrary user machines, especially portable machines, should not +be configured to interact directly with Ceph, since that mode of use would +require the storage of a plaintext authentication key on an insecure machine. +Anyone who stole that machine or obtained surreptitious access to it could +obtain the key that will allow them to authenticate their own machines to Ceph. + +Rather than permitting potentially insecure machines to access a Ceph object +store directly, users should be required to sign in to a trusted machine in +your environment using a method that provides sufficient security for your +purposes. That trusted machine will store the plaintext Ceph keys for the +human users. A future version of Ceph may address these particular +authentication issues more fully. + +At the moment, none of the Ceph authentication protocols provide secrecy for +messages in transit. Thus, an eavesdropper on the wire can hear and understand +all data sent between clients and servers in Ceph, even if it cannot create or +alter them. Further, Ceph does not include options to encrypt user data in the +object store. Users can hand-encrypt and store their own data in the Ceph +object store, of course, but Ceph provides no features to perform object +encryption itself. Those storing sensitive data in Ceph should consider +encrypting their data before providing it to the Ceph system. + + +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _Cephx Config Reference: ../../configuration/auth-config-ref |