From 19fcec84d8d7d21e796c7624e521b60d28ee21ed Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 20:45:59 +0200 Subject: Adding upstream version 16.2.11+ds. Signed-off-by: Daniel Baumann --- doc/rados/operations/add-or-rm-mons.rst | 446 +++++++ doc/rados/operations/add-or-rm-osds.rst | 386 ++++++ doc/rados/operations/balancer.rst | 206 ++++ doc/rados/operations/bluestore-migration.rst | 338 ++++++ doc/rados/operations/cache-tiering.rst | 552 +++++++++ doc/rados/operations/change-mon-elections.rst | 88 ++ doc/rados/operations/control.rst | 601 +++++++++ doc/rados/operations/crush-map-edits.rst | 747 ++++++++++++ doc/rados/operations/crush-map.rst | 1126 +++++++++++++++++ doc/rados/operations/data-placement.rst | 43 + doc/rados/operations/devices.rst | 208 ++++ doc/rados/operations/erasure-code-clay.rst | 240 ++++ doc/rados/operations/erasure-code-isa.rst | 107 ++ doc/rados/operations/erasure-code-jerasure.rst | 121 ++ doc/rados/operations/erasure-code-lrc.rst | 388 ++++++ doc/rados/operations/erasure-code-profile.rst | 126 ++ doc/rados/operations/erasure-code-shec.rst | 145 +++ doc/rados/operations/erasure-code.rst | 262 ++++ doc/rados/operations/health-checks.rst | 1549 ++++++++++++++++++++++++ doc/rados/operations/index.rst | 98 ++ doc/rados/operations/monitoring-osd-pg.rst | 553 +++++++++ doc/rados/operations/monitoring.rst | 647 ++++++++++ doc/rados/operations/operating.rst | 255 ++++ doc/rados/operations/pg-concepts.rst | 102 ++ doc/rados/operations/pg-repair.rst | 81 ++ doc/rados/operations/pg-states.rst | 118 ++ doc/rados/operations/placement-groups.rst | 798 ++++++++++++ doc/rados/operations/pools.rst | 900 ++++++++++++++ doc/rados/operations/stretch-mode.rst | 215 ++++ doc/rados/operations/upmap.rst | 105 ++ doc/rados/operations/user-management.rst | 823 +++++++++++++ 31 files changed, 12374 insertions(+) create mode 100644 doc/rados/operations/add-or-rm-mons.rst create mode 100644 doc/rados/operations/add-or-rm-osds.rst create mode 100644 doc/rados/operations/balancer.rst create mode 100644 doc/rados/operations/bluestore-migration.rst create mode 100644 doc/rados/operations/cache-tiering.rst create mode 100644 doc/rados/operations/change-mon-elections.rst create mode 100644 doc/rados/operations/control.rst create mode 100644 doc/rados/operations/crush-map-edits.rst create mode 100644 doc/rados/operations/crush-map.rst create mode 100644 doc/rados/operations/data-placement.rst create mode 100644 doc/rados/operations/devices.rst create mode 100644 doc/rados/operations/erasure-code-clay.rst create mode 100644 doc/rados/operations/erasure-code-isa.rst create mode 100644 doc/rados/operations/erasure-code-jerasure.rst create mode 100644 doc/rados/operations/erasure-code-lrc.rst create mode 100644 doc/rados/operations/erasure-code-profile.rst create mode 100644 doc/rados/operations/erasure-code-shec.rst create mode 100644 doc/rados/operations/erasure-code.rst create mode 100644 doc/rados/operations/health-checks.rst create mode 100644 doc/rados/operations/index.rst create mode 100644 doc/rados/operations/monitoring-osd-pg.rst create mode 100644 doc/rados/operations/monitoring.rst create mode 100644 doc/rados/operations/operating.rst create mode 100644 doc/rados/operations/pg-concepts.rst create mode 100644 doc/rados/operations/pg-repair.rst create mode 100644 doc/rados/operations/pg-states.rst create mode 100644 doc/rados/operations/placement-groups.rst create mode 100644 doc/rados/operations/pools.rst create mode 100644 doc/rados/operations/stretch-mode.rst create mode 100644 doc/rados/operations/upmap.rst create mode 100644 doc/rados/operations/user-management.rst (limited to 'doc/rados/operations') diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst new file mode 100644 index 000000000..359fa7676 --- /dev/null +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -0,0 +1,446 @@ +.. _adding-and-removing-monitors: + +========================== + Adding/Removing Monitors +========================== + +When you have a cluster up and running, you may add or remove monitors +from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_ +or `Monitor Bootstrap`_. + +.. _adding-monitors: + +Adding Monitors +=============== + +Ceph monitors are lightweight processes that are the single source of truth +for the cluster map. You can run a cluster with 1 monitor but we recommend at least 3 +for a production cluster. Ceph monitors use a variation of the +`Paxos`_ algorithm to establish consensus about maps and other critical +information across the cluster. Due to the nature of Paxos, Ceph requires +a majority of monitors to be active to establish a quorum (thus establishing +consensus). + +It is advisable to run an odd number of monitors. An +odd number of monitors is more resilient than an +even number. For instance, with a two monitor deployment, no +failures can be tolerated and still maintain a quorum; with three monitors, +one failure can be tolerated; in a four monitor deployment, one failure can +be tolerated; with five monitors, two failures can be tolerated. This avoids +the dreaded *split brain* phenomenon, and is why an odd number is best. +In short, Ceph needs a majority of +monitors to be active (and able to communicate with each other), but that +majority can be achieved using a single monitor, or 2 out of 2 monitors, +2 out of 3, 3 out of 4, etc. + +For small or non-critical deployments of multi-node Ceph clusters, it is +advisable to deploy three monitors, and to increase the number of monitors +to five for larger clusters or to survive a double failure. There is rarely +justification for seven or more. + +Since monitors are lightweight, it is possible to run them on the same +host as OSDs; however, we recommend running them on separate hosts, +because `fsync` issues with the kernel may impair performance. +Dedicated monitor nodes also minimize disruption since monitor and OSD +daemons are not inactive at the same time when a node crashes or is +taken down for maintenance. + +Dedicated +monitor nodes also make for cleaner maintenance by avoiding both OSDs and +a mon going down if a node is rebooted, taken down, or crashes. + +.. note:: A *majority* of monitors in your cluster must be able to + reach each other in order to establish a quorum. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new monitor, see `Hardware +Recommendations`_ for details on minimum recommendations for monitor hardware. +To add a monitor host to your cluster, first make sure you have an up-to-date +version of Linux installed (typically Ubuntu 16.04 or RHEL 7). + +Add your monitor host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Packages`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Packages: ../../../install/install-storage-cluster + + +.. _Adding a Monitor (Manual): + +Adding a Monitor (Manual) +------------------------- + +This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map +and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If +this results in only two monitor daemons, you may add more monitors by +repeating this procedure until you have a sufficient number of ``ceph-mon`` +daemons to achieve a quorum. + +At this point you should define your monitor's id. Traditionally, monitors +have been named with single letters (``a``, ``b``, ``c``, ...), but you are +free to define the id as you see fit. For the purpose of this document, +please take into account that ``{mon-id}`` should be the id you chose, +without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` +on ``mon.a``). + +#. Create the default directory on the machine that will host your + new monitor: + + .. prompt:: bash $ + + ssh {new-mon-host} + sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} + +#. Create a temporary directory ``{tmp}`` to keep the files needed during + this process. This directory should be different from the monitor's default + directory created in the previous step, and can be removed after all the + steps are executed: + + .. prompt:: bash $ + + mkdir {tmp} + +#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to + the retrieved keyring, and ``{key-filename}`` is the name of the file + containing the retrieved monitor key: + + .. prompt:: bash $ + + ceph auth get mon. -o {tmp}/{key-filename} + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{map-filename}`` is the name of the file + containing the retrieved monitor map: + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{map-filename} + +#. Prepare the monitor's data directory created in the first step. You must + specify the path to the monitor map so that you can retrieve the + information about a quorum of monitors and their ``fsid``. You must also + specify a path to the monitor keyring: + + .. prompt:: bash $ + + sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} + + +#. Start the new monitor and it will automatically join the cluster. + The daemon needs to know which address to bind to, via either the + ``--public-addr {ip}`` or ``--public-network {network}`` argument. + For example: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --public-addr {ip:port} + +.. _removing-monitors: + +Removing Monitors +================= + +When you remove monitors from a cluster, consider that Ceph monitors use +Paxos to establish consensus about the master cluster map. You must have +a sufficient number of monitors to establish a quorum for consensus about +the cluster map. + +.. _Removing a Monitor (Manual): + +Removing a Monitor (Manual) +--------------------------- + +This procedure removes a ``ceph-mon`` daemon from your cluster. If this +procedure results in only two monitor daemons, you may add or remove another +monitor until you have a number of ``ceph-mon`` daemons that can achieve a +quorum. + +#. Stop the monitor: + + .. prompt:: bash $ + + service ceph -a stop mon.{mon-id} + +#. Remove the monitor from the cluster: + + .. prompt:: bash $ + + ceph mon remove {mon-id} + +#. Remove the monitor entry from ``ceph.conf``. + +.. _rados-mon-remove-from-unhealthy: + +Removing Monitors from an Unhealthy Cluster +------------------------------------------- + +This procedure removes a ``ceph-mon`` daemon from an unhealthy +cluster, for example a cluster where the monitors cannot form a +quorum. + + +#. Stop all ``ceph-mon`` daemons on all monitor hosts: + + .. prompt:: bash $ + + ssh {mon-host} + systemctl stop ceph-mon.target + + Repeat for all monitor hosts. + +#. Identify a surviving monitor and log in to that host: + + .. prompt:: bash $ + + ssh {mon-host} + +#. Extract a copy of the monmap file: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --extract-monmap {map-path} + + In most cases, this command will be: + + .. prompt:: bash $ + + ceph-mon -i `hostname` --extract-monmap /tmp/monmap + +#. Remove the non-surviving or problematic monitors. For example, if + you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where + only ``mon.a`` will survive, follow the example below: + + .. prompt:: bash $ + + monmaptool {map-path} --rm {mon-id} + + For example, + + .. prompt:: bash $ + + monmaptool /tmp/monmap --rm b + monmaptool /tmp/monmap --rm c + +#. Inject the surviving map with the removed monitors into the + surviving monitor(s). For example, to inject a map into monitor + ``mon.a``, follow the example below: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {map-path} + + For example: + + .. prompt:: bash $ + + ceph-mon -i a --inject-monmap /tmp/monmap + +#. Start only the surviving monitors. + +#. Verify the monitors form a quorum (``ceph -s``). + +#. You may wish to archive the removed monitors' data directory in + ``/var/lib/ceph/mon`` in a safe location, or delete it if you are + confident the remaining monitors are healthy and are sufficiently + redundant. + +.. _Changing a Monitor's IP address: + +Changing a Monitor's IP Address +=============================== + +.. important:: Existing monitors are not supposed to change their IP addresses. + +Monitors are critical components of a Ceph cluster, and they need to maintain a +quorum for the whole system to work properly. To establish a quorum, the +monitors need to discover each other. Ceph has strict requirements for +discovering monitors. + +Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors. +However, monitors discover each other using the monitor map, not ``ceph.conf``. +For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you +need to obtain the current monmap for the cluster when creating a new monitor, +as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The +following sections explain the consistency requirements for Ceph monitors, and a +few safe ways to change a monitor's IP address. + + +Consistency Requirements +------------------------ + +A monitor always refers to the local copy of the monmap when discovering other +monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids +errors that could break the cluster (e.g., typos in ``ceph.conf`` when +specifying a monitor address or port). Since monitors use monmaps for discovery +and they share monmaps with clients and other Ceph daemons, the monmap provides +monitors with a strict guarantee that their consensus is valid. + +Strict consistency also applies to updates to the monmap. As with any other +updates on the monitor, changes to the monmap always run through a distributed +consensus algorithm called `Paxos`_. The monitors must agree on each update to +the monmap, such as adding or removing a monitor, to ensure that each monitor in +the quorum has the same version of the monmap. Updates to the monmap are +incremental so that monitors have the latest agreed upon version, and a set of +previous versions, allowing a monitor that has an older version of the monmap to +catch up with the current state of the cluster. + +If monitors discovered each other through the Ceph configuration file instead of +through the monmap, it would introduce additional risks because the Ceph +configuration files are not updated and distributed automatically. Monitors +might inadvertently use an older ``ceph.conf`` file, fail to recognize a +monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able +to determine the current state of the system accurately. Consequently, making +changes to an existing monitor's IP address must be done with great care. + + +Changing a Monitor's IP address (The Right Way) +----------------------------------------------- + +Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to +ensure that other monitors in the cluster will receive the update. To change a +monitor's IP address, you must add a new monitor with the IP address you want +to use (as described in `Adding a Monitor (Manual)`_), ensure that the new +monitor successfully joins the quorum; then, remove the monitor that uses the +old IP address. Then, update the ``ceph.conf`` file to ensure that clients and +other daemons know the IP address of the new monitor. + +For example, lets assume there are three monitors in place, such as :: + + [mon.a] + host = host01 + addr = 10.0.0.1:6789 + [mon.b] + host = host02 + addr = 10.0.0.2:6789 + [mon.c] + host = host03 + addr = 10.0.0.3:6789 + +To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the +steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure +that ``mon.d`` is running before removing ``mon.c``, or it will break the +quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving +all three monitors would thus require repeating this process as many times as +needed. + + +Changing a Monitor's IP address (The Messy Way) +----------------------------------------------- + +There may come a time when the monitors must be moved to a different network, a +different part of the datacenter or a different datacenter altogether. While it +is possible to do it, the process becomes a bit more hazardous. + +In such a case, the solution is to generate a new monmap with updated IP +addresses for all the monitors in the cluster, and inject the new map on each +individual monitor. This is not the most user-friendly approach, but we do not +expect this to be something that needs to be done every other week. As it is +clearly stated on the top of this section, monitors are not supposed to change +IP addresses. + +Using the previous monitor configuration as an example, assume you want to move +all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these +networks are unable to communicate. Use the following procedure: + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{filename}`` is the name of the file + containing the retrieved monitor map: + + .. prompt:: bash $ + + ceph mon getmap -o {tmp}/{filename} + +#. The following example demonstrates the contents of the monmap: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.0.0.1:6789/0 mon.a + 1: 10.0.0.2:6789/0 mon.b + 2: 10.0.0.3:6789/0 mon.c + +#. Remove the existing monitors: + + .. prompt:: bash $ + + monmaptool --rm a --rm b --rm c {tmp}/{filename} + + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: removing a + monmaptool: removing b + monmaptool: removing c + monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) + +#. Add the new monitor locations: + + .. prompt:: bash $ + + monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} + + + :: + + monmaptool: monmap file {tmp}/{filename} + monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) + +#. Check new contents: + + .. prompt:: bash $ + + monmaptool --print {tmp}/{filename} + + :: + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.1.0.1:6789/0 mon.a + 1: 10.1.0.2:6789/0 mon.b + 2: 10.1.0.3:6789/0 mon.c + +At this point, we assume the monitors (and stores) are installed at the new +location. The next step is to propagate the modified monmap to the new +monitors, and inject the modified monmap into each new monitor. + +#. First, make sure to stop all your monitors. Injection must be done while + the daemon is not running. + +#. Inject the monmap: + + .. prompt:: bash $ + + ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} + +#. Restart the monitors. + +After this step, migration to the new location is complete and +the monitors should operate successfully. + + +.. _Manual Deployment: ../../../install/manual-deployment +.. _Monitor Bootstrap: ../../../dev/mon-bootstrap +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) diff --git a/doc/rados/operations/add-or-rm-osds.rst b/doc/rados/operations/add-or-rm-osds.rst new file mode 100644 index 000000000..315552859 --- /dev/null +++ b/doc/rados/operations/add-or-rm-osds.rst @@ -0,0 +1,386 @@ +====================== + Adding/Removing OSDs +====================== + +When you have a cluster up and running, you may add OSDs or remove OSDs +from the cluster at runtime. + +Adding OSDs +=========== + +When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an +OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a +host machine. If your host has multiple storage drives, you may map one +``ceph-osd`` daemon for each drive. + +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. As your cluster reaches its ``near +full`` ratio, you should add one or more OSDs to expand your cluster's capacity. + +.. warning:: Do not let your cluster reach its ``full ratio`` before + adding an OSD. OSD failures that occur after the cluster reaches + its ``near full`` ratio may cause the cluster to exceed its + ``full ratio``. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new OSD, see `Hardware +Recommendations`_ for details on minimum recommendations for OSD hardware. To +add an OSD host to your cluster, first make sure you have an up-to-date version +of Linux installed, and you have made some initial preparations for your +storage drives. See `Filesystem Recommendations`_ for details. + +Add your OSD host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. See the `Network Configuration +Reference`_ for details. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations +.. _Network Configuration Reference: ../../configuration/network-config-ref + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Ceph (Manual)`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Ceph (Manual): ../../../install + + +Adding an OSD (Manual) +---------------------- + +This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive, +and configures the cluster to distribute data to the OSD. If your host has +multiple drives, you may add an OSD for each drive by repeating this procedure. + +To add an OSD, create a data directory for it, mount a drive to that directory, +add the OSD to the cluster, and then add it to the CRUSH map. + +When you add the OSD to the CRUSH map, consider the weight you give to the new +OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger +hard drives than older hosts in the cluster (i.e., they may have greater +weight). + +.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives + of dissimilar size, you can adjust their weights. However, for best + performance, consider a CRUSH hierarchy with drives of the same type/size. + +#. Create the OSD. If no UUID is given, it will be set automatically when the + OSD starts up. The following command will output the OSD number, which you + will need for subsequent steps: + + .. prompt:: bash $ + + ceph osd create [{uuid} [{id}]] + + If the optional parameter {id} is given it will be used as the OSD id. + Note, in this case the command may fail if the number is already in use. + + .. warning:: In general, explicitly specifying {id} is not recommended. + IDs are allocated as an array, and skipping entries consumes some extra + memory. This can become significant if there are large gaps and/or + clusters are large. If {id} is not specified, the smallest available is + used. + +#. Create the default directory on your new OSD: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +#. If the OSD is for a drive other than the OS drive, prepare it + for use with Ceph, and mount it to the directory you just created: + + .. prompt:: bash $ + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{drive} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +#. Initialize the OSD data directory: + + .. prompt:: bash $ + + ssh {new-osd-host} + ceph-osd -i {osd-num} --mkfs --mkkey + + The directory must be empty before you can run ``ceph-osd``. + +#. Register the OSD authentication key. The value of ``ceph`` for + ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your + cluster name differs from ``ceph``, use your cluster name instead: + + .. prompt:: bash $ + + ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring + +#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The + ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy + wherever you wish. If you specify at least one bucket, the command + will place the OSD into the most specific bucket you specify, *and* it will + move that bucket underneath any other buckets you specify. **Important:** If + you specify only the root bucket, the command will attach the OSD directly + to the root, but CRUSH rules expect OSDs to be inside of hosts. + + Execute the following: + + .. prompt:: bash $ + + ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] + + You may also decompile the CRUSH map, add the OSD to the device list, add the + host as a bucket (if it's not already in the CRUSH map), add the device as an + item in the host, assign it a weight, recompile it and set it. See + `Add/Move an OSD`_ for details. + + +.. _rados-replacing-an-osd: + +Replacing an OSD +---------------- + +.. note:: If the instructions in this section do not work for you, try the + instructions in the cephadm documentation: :ref:`cephadm-replacing-an-osd`. + +When disks fail, or if an administrator wants to reprovision OSDs with a new +backend, for instance, for switching from FileStore to BlueStore, OSDs need to +be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry +need to be keep intact after the OSD is destroyed for replacement. + +#. Make sure it is safe to destroy the OSD: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done + +#. Destroy the OSD first: + + .. prompt:: bash $ + + ceph osd destroy {id} --yes-i-really-mean-it + +#. Zap a disk for the new OSD, if the disk was used before for other purposes. + It's not necessary for a new disk: + + .. prompt:: bash $ + + ceph-volume lvm zap /dev/sdX + +#. Prepare the disk for replacement by using the previously destroyed OSD id: + + .. prompt:: bash $ + + ceph-volume lvm prepare --osd-id {id} --data /dev/sdX + +#. And activate the OSD: + + .. prompt:: bash $ + + ceph-volume lvm activate {id} {fsid} + +Alternatively, instead of preparing and activating, the device can be recreated +in one call, like: + + .. prompt:: bash $ + + ceph-volume lvm create --osd-id {id} --data /dev/sdX + + +Starting the OSD +---------------- + +After you add an OSD to Ceph, the OSD is in your configuration. However, +it is not yet running. The OSD is ``down`` and ``in``. You must start +your new OSD before it can begin receiving data. You may use +``service ceph`` from your admin host or start the OSD from its host +machine: + + .. prompt:: bash $ + + sudo systemctl start ceph-osd@{osd-num} + + +Once you start your OSD, it is ``up`` and ``in``. + + +Observe the Data Migration +-------------------------- + +Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing +the server by migrating placement groups to your new OSD. You can observe this +process with the `ceph`_ tool. : + + .. prompt:: bash $ + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + +.. _Add/Move an OSD: ../crush-map#addosd +.. _ceph: ../monitoring + + + +Removing OSDs (Manual) +====================== + +When you want to reduce the size of a cluster or replace hardware, you may +remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd`` +daemon for one storage drive within a host machine. If your host has multiple +storage drives, you may need to remove one ``ceph-osd`` daemon for each drive. +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. Ensure that when you remove an OSD +that your cluster is not at its ``near full`` ratio. + +.. warning:: Do not let your cluster reach its ``full ratio`` when + removing an OSD. Removing OSDs could cause the cluster to reach + or exceed its ``full ratio``. + + +Take the OSD out of the Cluster +----------------------------------- + +Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it +out of the cluster so that Ceph can begin rebalancing and copying its data to +other OSDs. : + + .. prompt:: bash $ + + ceph osd out {osd-num} + + +Observe the Data Migration +-------------------------- + +Once you have taken your OSD ``out`` of the cluster, Ceph will begin +rebalancing the cluster by migrating placement groups out of the OSD you +removed. You can observe this process with the `ceph`_ tool. : + + .. prompt:: bash $ + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + +.. note:: Sometimes, typically in a "small" cluster with few hosts (for + instance with a small testing cluster), the fact to take ``out`` the + OSD can spawn a CRUSH corner case where some PGs remain stuck in the + ``active+remapped`` state. If you are in this case, you should mark + the OSD ``in`` with: + + .. prompt:: bash $ + + ceph osd in {osd-num} + + to come back to the initial state and then, instead of marking ``out`` + the OSD, set its weight to 0 with: + + .. prompt:: bash $ + + ceph osd crush reweight osd.{osd-num} 0 + + After that, you can observe the data migration which should come to its + end. The difference between marking ``out`` the OSD and reweighting it + to 0 is that in the first case the weight of the bucket which contains + the OSD is not changed whereas in the second case the weight of the bucket + is updated (and decreased of the OSD weight). The reweight command could + be sometimes favoured in the case of a "small" cluster. + + + +Stopping the OSD +---------------- + +After you take an OSD out of the cluster, it may still be running. +That is, the OSD may be ``up`` and ``out``. You must stop +your OSD before you remove it from the configuration: + + .. prompt:: bash $ + + ssh {osd-host} + sudo systemctl stop ceph-osd@{osd-num} + +Once you stop your OSD, it is ``down``. + + +Removing the OSD +---------------- + +This procedure removes an OSD from a cluster map, removes its authentication +key, removes the OSD from the OSD map, and removes the OSD from the +``ceph.conf`` file. If your host has multiple drives, you may need to remove an +OSD for each drive by repeating this procedure. + +#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH + map, removes its authentication key. And it is removed from the OSD map as + well. Please note the :ref:`purge subcommand ` is introduced in Luminous, for older + versions, please see below: + + .. prompt:: bash $ + + ceph osd purge {id} --yes-i-really-mean-it + +#. Navigate to the host where you keep the master copy of the cluster's + ``ceph.conf`` file: + + .. prompt:: bash $ + + ssh {admin-host} + cd /etc/ceph + vim ceph.conf + +#. Remove the OSD entry from your ``ceph.conf`` file (if it exists):: + + [osd.1] + host = {hostname} + +#. From the host where you keep the master copy of the cluster's ``ceph.conf`` + file, copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of + other hosts in your cluster. + +If your Ceph cluster is older than Luminous, instead of using ``ceph osd +purge``, you need to perform this step manually: + + +#. Remove the OSD from the CRUSH map so that it no longer receives data. You may + also decompile the CRUSH map, remove the OSD from the device list, remove the + device as an item in the host bucket or remove the host bucket (if it's in the + CRUSH map and you intend to remove the host), recompile the map and set it. + See `Remove an OSD`_ for details: + + .. prompt:: bash $ + + ceph osd crush remove {name} + +#. Remove the OSD authentication key: + + .. prompt:: bash $ + + ceph auth del osd.{osd-num} + + The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the + ``$cluster-$id``. If your cluster name differs from ``ceph``, use your + cluster name instead. + +#. Remove the OSD: + + .. prompt:: bash $ + + ceph osd rm {osd-num} + + for example: + + .. prompt:: bash $ + + ceph osd rm 1 + +.. _Remove an OSD: ../crush-map#removeosd diff --git a/doc/rados/operations/balancer.rst b/doc/rados/operations/balancer.rst new file mode 100644 index 000000000..b02a8914d --- /dev/null +++ b/doc/rados/operations/balancer.rst @@ -0,0 +1,206 @@ +.. _balancer: + +Balancer +======== + +The *balancer* can optimize the placement of PGs across OSDs in +order to achieve a balanced distribution, either automatically or in a +supervised fashion. + +Status +------ + +The current status of the balancer can be checked at any time with: + + .. prompt:: bash $ + + ceph balancer status + + +Automatic balancing +------------------- + +The automatic balancing feature is enabled by default in ``upmap`` +mode. Please refer to :ref:`upmap` for more details. The balancer can be +turned off with: + + .. prompt:: bash $ + + ceph balancer off + +The balancer mode can be changed to ``crush-compat`` mode, which is +backward compatible with older clients, and will make small changes to +the data distribution over time to ensure that OSDs are equally utilized. + + +Throttling +---------- + +No adjustments will be made to the PG distribution if the cluster is +degraded (e.g., because an OSD has failed and the system has not yet +healed itself). + +When the cluster is healthy, the balancer will throttle its changes +such that the percentage of PGs that are misplaced (i.e., that need to +be moved) is below a threshold of (by default) 5%. The +``target_max_misplaced_ratio`` threshold can be adjusted with: + + .. prompt:: bash $ + + ceph config set mgr target_max_misplaced_ratio .07 # 7% + +Set the number of seconds to sleep in between runs of the automatic balancer: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/sleep_interval 60 + +Set the time of day to begin automatic balancing in HHMM format: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_time 0000 + +Set the time of day to finish automatic balancing in HHMM format: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_time 2359 + +Restrict automatic balancing to this day of the week or later. +Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/begin_weekday 0 + +Restrict automatic balancing to this day of the week or earlier. +Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/end_weekday 6 + +Pool IDs to which the automatic balancing will be limited. +The default for this is an empty string, meaning all pools will be balanced. +The numeric pool IDs can be gotten with the :command:`ceph osd pool ls detail` command: + + .. prompt:: bash $ + + ceph config set mgr mgr/balancer/pool_ids 1,2,3 + + +Modes +----- + +There are currently two supported balancer modes: + +#. **crush-compat**. The CRUSH compat mode uses the compat weight-set + feature (introduced in Luminous) to manage an alternative set of + weights for devices in the CRUSH hierarchy. The normal weights + should remain set to the size of the device to reflect the target + amount of data that we want to store on the device. The balancer + then optimizes the weight-set values, adjusting them up or down in + small increments, in order to achieve a distribution that matches + the target distribution as closely as possible. (Because PG + placement is a pseudorandom process, there is a natural amount of + variation in the placement; by optimizing the weights we + counter-act that natural variation.) + + Notably, this mode is *fully backwards compatible* with older + clients: when an OSDMap and CRUSH map is shared with older clients, + we present the optimized weights as the "real" weights. + + The primary restriction of this mode is that the balancer cannot + handle multiple CRUSH hierarchies with different placement rules if + the subtrees of the hierarchy share any OSDs. (This is normally + not the case, and is generally not a recommended configuration + because it is hard to manage the space utilization on the shared + OSDs.) + +#. **upmap**. Starting with Luminous, the OSDMap can store explicit + mappings for individual OSDs as exceptions to the normal CRUSH + placement calculation. These `upmap` entries provide fine-grained + control over the PG mapping. This CRUSH mode will optimize the + placement of individual PGs in order to achieve a balanced + distribution. In most cases, this distribution is "perfect," which + an equal number of PGs on each OSD (+/-1 PG, since they might not + divide evenly). + + Note that using upmap requires that all clients be Luminous or newer. + +The default mode is ``upmap``. The mode can be adjusted with: + + .. prompt:: bash $ + + ceph balancer mode crush-compat + +Supervised optimization +----------------------- + +The balancer operation is broken into a few distinct phases: + +#. building a *plan* +#. evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result after executing a *plan* +#. executing the *plan* + +To evaluate and score the current distribution: + + .. prompt:: bash $ + + ceph balancer eval + +You can also evaluate the distribution for a single pool with: + + .. prompt:: bash $ + + ceph balancer eval + +Greater detail for the evaluation can be seen with: + + .. prompt:: bash $ + + ceph balancer eval-verbose ... + +The balancer can generate a plan, using the currently configured mode, with: + + .. prompt:: bash $ + + ceph balancer optimize + +The name is provided by the user and can be any useful identifying string. The contents of a plan can be seen with: + + .. prompt:: bash $ + + ceph balancer show + +All plans can be shown with: + + .. prompt:: bash $ + + ceph balancer ls + +Old plans can be discarded with: + + .. prompt:: bash $ + + ceph balancer rm + +Currently recorded plans are shown as part of the status command: + + .. prompt:: bash $ + + ceph balancer status + +The quality of the distribution that would result after executing a plan can be calculated with: + + .. prompt:: bash $ + + ceph balancer eval + +Assuming the plan is expected to improve the distribution (i.e., it has a lower score than the current cluster state), the user can execute that plan with: + + .. prompt:: bash $ + + ceph balancer execute + diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst new file mode 100644 index 000000000..1ac5f2b13 --- /dev/null +++ b/doc/rados/operations/bluestore-migration.rst @@ -0,0 +1,338 @@ +===================== + BlueStore Migration +===================== + +Each OSD can run either BlueStore or FileStore, and a single Ceph +cluster can contain a mix of both. Users who have previously deployed +FileStore are likely to want to transition to BlueStore in order to +take advantage of the improved performance and robustness. There are +several strategies for making such a transition. + +An individual OSD cannot be converted in place in isolation, however: +BlueStore and FileStore are simply too different for that to be +practical. "Conversion" will rely either on the cluster's normal +replication and healing support or tools and strategies that copy OSD +content from an old (FileStore) device to a new (BlueStore) one. + + +Deploy new OSDs with BlueStore +============================== + +Any new OSDs (e.g., when the cluster is expanded) can be deployed +using BlueStore. This is the default behavior so no specific change +is needed. + +Similarly, any OSDs that are reprovisioned after replacing a failed drive +can use BlueStore. + +Convert existing OSDs +===================== + +Mark out and replace +-------------------- + +The simplest approach is to mark out each device in turn, wait for the +data to replicate across the cluster, reprovision the OSD, and mark +it back in again. It is simple and easy to automate. However, it requires +more data migration than should be necessary, so it is not optimal. + +#. Identify a FileStore OSD to replace:: + + ID= + DEVICE= + + You can tell whether a given OSD is FileStore or BlueStore with: + + .. prompt:: bash $ + + ceph osd metadata $ID | grep osd_objectstore + + You can get a current count of filestore vs bluestore with: + + .. prompt:: bash $ + + ceph osd count-metadata osd_objectstore + +#. Mark the filestore OSD out: + + .. prompt:: bash $ + + ceph osd out $ID + +#. Wait for the data to migrate off the OSD in question: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done + +#. Stop the OSD: + + .. prompt:: bash $ + + systemctl kill ceph-osd@$ID + +#. Make note of which device this OSD is using: + + .. prompt:: bash $ + + mount | grep /var/lib/ceph/osd/ceph-$ID + +#. Unmount the OSD: + + .. prompt:: bash $ + + umount /var/lib/ceph/osd/ceph-$ID + +#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy + the contents of the device; be certain the data on the device is + not needed (i.e., that the cluster is healthy) before proceeding: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Tell the cluster the OSD has been destroyed (and a new OSD can be + reprovisioned with the same ID): + + .. prompt:: bash $ + + ceph osd destroy $ID --yes-i-really-mean-it + +#. Reprovision a BlueStore OSD in its place with the same OSD ID. + This requires you do identify which device to wipe based on what you saw + mounted above. BE CAREFUL! : + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID + +#. Repeat. + +You can allow the refilling of the replacement OSD to happen +concurrently with the draining of the next OSD, or follow the same +procedure for multiple OSDs in parallel, as long as you ensure the +cluster is fully clean (all data has all replicas) before destroying +any OSDs. Failure to do so will reduce the redundancy of your data +and increase the risk of (or potentially even cause) data loss. + +Advantages: + +* Simple. +* Can be done on a device-by-device basis. +* No spare devices or hosts are required. + +Disadvantages: + +* Data is copied over the network twice: once to some other OSD in the + cluster (to maintain the desired number of replicas), and then again + back to the reprovisioned BlueStore OSD. + + +Whole host replacement +---------------------- + +If you have a spare host in the cluster, or have sufficient free space +to evacuate an entire host in order to use it as a spare, then the +conversion can be done on a host-by-host basis with each stored copy of +the data migrating only once. + +First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster. + +Use a new, empty host +^^^^^^^^^^^^^^^^^^^^^ + +Ideally the host should have roughly the +same capacity as other hosts you will be converting (although it +doesn't strictly matter). :: + + NEWHOST= + +Add the host to the CRUSH hierarchy, but do not attach it to the root: + +.. prompt:: bash $ + + ceph osd crush add-bucket $NEWHOST host + +Make sure the ceph packages are installed. + +Use an existing host +^^^^^^^^^^^^^^^^^^^^ + +If you would like to use an existing host +that is already part of the cluster, and there is sufficient free +space on that host so that all of its data can be migrated off, +then you can instead do:: + + OLDHOST= + +.. prompt:: bash $ + + ceph osd crush unlink $OLDHOST default + +where "default" is the immediate ancestor in the CRUSH map. (For +smaller clusters with unmodified configurations this will normally +be "default", but it might also be a rack name.) You should now +see the host at the top of the OSD tree output with no parent: + +.. prompt:: bash $ + + bin/ceph osd tree + +:: + + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host oldhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host foo + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +If everything looks good, jump directly to the "Wait for data +migration to complete" step below and proceed from there to clean up +the old OSDs. + +Migration process +^^^^^^^^^^^^^^^^^ + +If you're using a new host, start at step #1. For an existing host, +jump to step #5 below. + +#. Provision new BlueStore OSDs for all devices: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/$DEVICE + +#. Verify OSDs join the cluster with: + + .. prompt:: bash $ + + ceph osd tree + + You should see the new host ``$NEWHOST`` with all of the OSDs beneath + it, but the host should *not* be nested beneath any other node in + hierarchy (like ``root default``). For example, if ``newhost`` is + the empty host, you might see something like:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host newhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host oldhost1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +#. Identify the first target host to convert : + + .. prompt:: bash $ + + OLDHOST= + +#. Swap the new host into the old host's position in the cluster: + + .. prompt:: bash $ + + ceph osd crush swap-bucket $NEWHOST $OLDHOST + + At this point all data on ``$OLDHOST`` will start migrating to OSDs + on ``$NEWHOST``. If there is a difference in the total capacity of + the old and new hosts you may also see some data migrate to or from + other nodes in the cluster, but as long as the hosts are similarly + sized this will be a relatively small amount of data. + +#. Wait for data migration to complete: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done + +#. Stop all old OSDs on the now-empty ``$OLDHOST``: + + .. prompt:: bash $ + + ssh $OLDHOST + systemctl kill ceph-osd.target + umount /var/lib/ceph/osd/ceph-* + +#. Destroy and purge the old OSDs: + + .. prompt:: bash $ + + for osd in `ceph osd ls-tree $OLDHOST`; do + ceph osd purge $osd --yes-i-really-mean-it + done + +#. Wipe the old OSD devices. This requires you do identify which + devices are to be wiped manually (BE CAREFUL!). For each device: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Use the now-empty host as the new host, and repeat:: + + NEWHOST=$OLDHOST + +Advantages: + +* Data is copied over the network only once. +* Converts an entire host's OSDs at once. +* Can parallelize to converting multiple hosts at a time. +* No spare devices are required on each host. + +Disadvantages: + +* A spare host is required. +* An entire host's worth of OSDs will be migrating data at a time. This + is like likely to impact overall cluster performance. +* All migrated data still makes one full hop over the network. + + +Per-OSD device copy +------------------- + +A single logical OSD can be converted by using the ``copy`` function +of ``ceph-objectstore-tool``. This requires that the host have a free +device (or devices) to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has 12 OSDs, then you'd need a +13th available device so that each OSD can be converted in turn before the +old device is reclaimed to convert the next OSD. + +Caveats: + +* This strategy requires that a blank BlueStore OSD be prepared + without allocating a new OSD ID, something that the ``ceph-volume`` + tool doesn't support. More importantly, the setup of *dmcrypt* is + closely tied to the OSD identity, which means that this approach + does not work with encrypted OSDs. + +* The device must be manually partitioned. + +* Tooling not implemented! + +* Not documented! + +Advantages: + +* Little or no data migrates over the network during the conversion. + +Disadvantages: + +* Tooling not fully implemented. +* Process not documented. +* Each host must have a spare or empty device. +* The OSD is offline during the conversion, which means new writes will + be written to only a subset of the OSDs. This increases the risk of data + loss due to a subsequent failure. (However, if there is a failure before + conversion is complete, the original FileStore OSD can be started to provide + access to its original data.) diff --git a/doc/rados/operations/cache-tiering.rst b/doc/rados/operations/cache-tiering.rst new file mode 100644 index 000000000..8056ace47 --- /dev/null +++ b/doc/rados/operations/cache-tiering.rst @@ -0,0 +1,552 @@ +=============== + Cache Tiering +=============== + +A cache tier provides Ceph Clients with better I/O performance for a subset of +the data stored in a backing storage tier. Cache tiering involves creating a +pool of relatively fast/expensive storage devices (e.g., solid state drives) +configured to act as a cache tier, and a backing pool of either erasure-coded +or relatively slower/cheaper devices configured to act as an economical storage +tier. The Ceph objecter handles where to place the objects and the tiering +agent determines when to flush objects from the cache to the backing storage +tier. So the cache tier and the backing storage tier are completely transparent +to Ceph clients. + + +.. ditaa:: + +-------------+ + | Ceph Client | + +------+------+ + ^ + Tiering is | + Transparent | Faster I/O + to Ceph | +---------------+ + Client Ops | | | + | +----->+ Cache Tier | + | | | | + | | +-----+---+-----+ + | | | ^ + v v | | Active Data in Cache Tier + +------+----+--+ | | + | Objecter | | | + +-----------+--+ | | + ^ | | Inactive Data in Storage Tier + | v | + | +-----+---+-----+ + | | | + +----->| Storage Tier | + | | + +---------------+ + Slower I/O + + +The cache tiering agent handles the migration of data between the cache tier +and the backing storage tier automatically. However, admins have the ability to +configure how this migration takes place by setting the ``cache-mode``. There are +two main scenarios: + +- **writeback** mode: If the base tier and the cache tier are configured in + ``writeback`` mode, Ceph clients receive an ACK from the base tier every time + they write data to it. Then the cache tiering agent determines whether + ``osd_tier_default_cache_min_write_recency_for_promote`` has been set. If it + has been set and the data has been written more than a specified number of + times per interval, the data is promoted to the cache tier. + + When Ceph clients need access to data stored in the base tier, the cache + tiering agent reads the data from the base tier and returns it to the client. + While data is being read from the base tier, the cache tiering agent consults + the value of ``osd_tier_default_cache_min_read_recency_for_promote`` and + decides whether to promote that data from the base tier to the cache tier. + When data has been promoted from the base tier to the cache tier, the Ceph + client is able to perform I/O operations on it using the cache tier. This is + well-suited for mutable data (for example, photo/video editing, transactional + data). + +- **readproxy** mode: This mode will use any objects that already + exist in the cache tier, but if an object is not present in the + cache the request will be proxied to the base tier. This is useful + for transitioning from ``writeback`` mode to a disabled cache as it + allows the workload to function properly while the cache is drained, + without adding any new objects to the cache. + +Other cache modes are: + +- **readonly** promotes objects to the cache on read operations only; write + operations are forwarded to the base tier. This mode is intended for + read-only workloads that do not require consistency to be enforced by the + storage system. (**Warning**: when objects are updated in the base tier, + Ceph makes **no** attempt to sync these updates to the corresponding objects + in the cache. Since this mode is considered experimental, a + ``--yes-i-really-mean-it`` option must be passed in order to enable it.) + +- **none** is used to completely disable caching. + + +A word of caution +================= + +Cache tiering will *degrade* performance for most workloads. Users should use +extreme caution before using this feature. + +* *Workload dependent*: Whether a cache will improve performance is + highly dependent on the workload. Because there is a cost + associated with moving objects into or out of the cache, it can only + be effective when there is a *large skew* in the access pattern in + the data set, such that most of the requests touch a small number of + objects. The cache pool should be large enough to capture the + working set for your workload to avoid thrashing. + +* *Difficult to benchmark*: Most benchmarks that users run to measure + performance will show terrible performance with cache tiering, in + part because very few of them skew requests toward a small set of + objects, it can take a long time for the cache to "warm up," and + because the warm-up cost can be high. + +* *Usually slower*: For workloads that are not cache tiering-friendly, + performance is often slower than a normal RADOS pool without cache + tiering enabled. + +* *librados object enumeration*: The librados-level object enumeration + API is not meant to be coherent in the presence of the case. If + your application is using librados directly and relies on object + enumeration, cache tiering will probably not work as expected. + (This is not a problem for RGW, RBD, or CephFS.) + +* *Complexity*: Enabling cache tiering means that a lot of additional + machinery and complexity within the RADOS cluster is being used. + This increases the probability that you will encounter a bug in the system + that other users have not yet encountered and will put your deployment at a + higher level of risk. + +Known Good Workloads +-------------------- + +* *RGW time-skewed*: If the RGW workload is such that almost all read + operations are directed at recently written objects, a simple cache + tiering configuration that destages recently written objects from + the cache to the base tier after a configurable period can work + well. + +Known Bad Workloads +------------------- + +The following configurations are *known to work poorly* with cache +tiering. + +* *RBD with replicated cache and erasure-coded base*: This is a common + request, but usually does not perform well. Even reasonably skewed + workloads still send some small writes to cold objects, and because + small writes are not yet supported by the erasure-coded pool, entire + (usually 4 MB) objects must be migrated into the cache in order to + satisfy a small (often 4 KB) write. Only a handful of users have + successfully deployed this configuration, and it only works for them + because their data is extremely cold (backups) and they are not in + any way sensitive to performance. + +* *RBD with replicated cache and base*: RBD with a replicated base + tier does better than when the base is erasure coded, but it is + still highly dependent on the amount of skew in the workload, and + very difficult to validate. The user will need to have a good + understanding of their workload and will need to tune the cache + tiering parameters carefully. + + +Setting Up Pools +================ + +To set up cache tiering, you must have two pools. One will act as the +backing storage and the other will act as the cache. + + +Setting Up a Backing Storage Pool +--------------------------------- + +Setting up a backing storage pool typically involves one of two scenarios: + +- **Standard Storage**: In this scenario, the pool stores multiple copies + of an object in the Ceph Storage Cluster. + +- **Erasure Coding:** In this scenario, the pool uses erasure coding to + store data much more efficiently with a small performance tradeoff. + +In the standard storage scenario, you can setup a CRUSH rule to establish +the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD +Daemons perform optimally when all storage drives in the rule are of the +same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ +for details on creating a rule. Once you have created a rule, create +a backing storage pool. + +In the erasure coding scenario, the pool creation arguments will generate the +appropriate rule automatically. See `Create a Pool`_ for details. + +In subsequent examples, we will refer to the backing storage pool +as ``cold-storage``. + + +Setting Up a Cache Pool +----------------------- + +Setting up a cache pool follows the same procedure as the standard storage +scenario, but with this difference: the drives for the cache tier are typically +high performance drives that reside in their own servers and have their own +CRUSH rule. When setting up such a rule, it should take account of the hosts +that have the high performance drives while omitting the hosts that don't. See +:ref:`CRUSH Device Class` for details. + + +In subsequent examples, we will refer to the cache pool as ``hot-storage`` and +the backing pool as ``cold-storage``. + +For cache tier configuration and default values, see +`Pools - Set Pool Values`_. + + +Creating a Cache Tier +===================== + +Setting up a cache tier involves associating a backing storage pool with +a cache pool: + +.. prompt:: bash $ + + ceph osd tier add {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier add cold-storage hot-storage + +To set the cache mode, execute the following: + +.. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} {cache-mode} + +For example: + +.. prompt:: bash $ + + ceph osd tier cache-mode hot-storage writeback + +The cache tiers overlay the backing storage tier, so they require one +additional step: you must direct all client traffic from the storage pool to +the cache pool. To direct client traffic directly to the cache pool, execute +the following: + +.. prompt:: bash $ + + ceph osd tier set-overlay {storagepool} {cachepool} + +For example: + +.. prompt:: bash $ + + ceph osd tier set-overlay cold-storage hot-storage + + +Configuring a Cache Tier +======================== + +Cache tiers have several configuration options. You may set +cache tier configuration options with the following usage: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} {key} {value} + +See `Pools - Set Pool Values`_ for details. + + +Target Size and Type +-------------------- + +Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_type bloom + +For example: + +.. prompt:: bash $ + + ceph osd pool set hot-storage hit_set_type bloom + +The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to +store, and how much time each HitSet should cover: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} hit_set_count 12 + ceph osd pool set {cachepool} hit_set_period 14400 + ceph osd pool set {cachepool} target_max_bytes 1000000000000 + +.. note:: A larger ``hit_set_count`` results in more RAM consumed by + the ``ceph-osd`` process. + +Binning accesses over time allows Ceph to determine whether a Ceph client +accessed an object at least once, or more than once over a time period +("age" vs "temperature"). + +The ``min_read_recency_for_promote`` defines how many HitSets to check for the +existence of an object when handling a read operation. The checking result is +used to decide whether to promote the object asynchronously. Its value should be +between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. +If it's set to 1, the current HitSet is checked. And if this object is in the +current HitSet, it's promoted. Otherwise not. For the other values, the exact +number of archive HitSets are checked. The object is promoted if the object is +found in any of the most recent ``min_read_recency_for_promote`` HitSets. + +A similar parameter can be set for the write operation, which is +``min_write_recency_for_promote``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} min_read_recency_for_promote 2 + ceph osd pool set {cachepool} min_write_recency_for_promote 2 + +.. note:: The longer the period and the higher the + ``min_read_recency_for_promote`` and + ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` + daemon consumes. In particular, when the agent is active to flush + or evict cache objects, all ``hit_set_count`` HitSets are loaded + into RAM. + + +Cache Sizing +------------ + +The cache tiering agent performs two main functions: + +- **Flushing:** The agent identifies modified (or dirty) objects and forwards + them to the storage pool for long-term storage. + +- **Evicting:** The agent identifies objects that haven't been modified + (or clean) and evicts the least recently used among them from the cache. + + +Absolute Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects based upon the total number +of bytes or the total number of objects. To specify a maximum number of bytes, +execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_bytes {#bytes} + +For example, to flush or evict at 1 TB, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_bytes 1099511627776 + +To specify the maximum number of objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} target_max_objects {#objects} + +For example, to flush or evict at 1M objects, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage target_max_objects 1000000 + +.. note:: Ceph is not able to determine the size of a cache pool automatically, so + the configuration on the absolute size is required here, otherwise the + flush/evict will not work. If you specify both limits, the cache tiering + agent will begin flushing or evicting when either threshold is triggered. + +.. note:: All client requests will be blocked only when ``target_max_bytes`` or + ``target_max_objects`` reached + +Relative Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects relative to the size of the +cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in +`Absolute sizing`_). When the cache pool consists of a certain percentage of +modified (or dirty) objects, the cache tiering agent will flush them to the +storage pool. To set the ``cache_target_dirty_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} + +For example, setting the value to ``0.4`` will begin flushing modified +(dirty) objects when they reach 40% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 + +When the dirty objects reaches a certain percentage of its capacity, flush dirty +objects with a higher speed. To set the ``cache_target_dirty_high_ratio``: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} + +For example, setting the value to ``0.6`` will begin aggressively flush dirty +objects when they reach 60% of the cache pool's capacity. obviously, we'd +better set the value between dirty_ratio and full_ratio: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 + +When the cache pool reaches a certain percentage of its capacity, the cache +tiering agent will evict objects to maintain free capacity. To set the +``cache_target_full_ratio``, execute the following: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} + +For example, setting the value to ``0.8`` will begin flushing unmodified +(clean) objects when they reach 80% of the cache pool's capacity: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_target_full_ratio 0.8 + + +Cache Age +--------- + +You can specify the minimum age of an object before the cache tiering agent +flushes a recently modified (or dirty) object to the backing storage pool: + +.. prompt:: bash $ + + ceph osd pool set {cachepool} cache_min_flush_age {#seconds} + +For example, to flush modified (or dirty) objects after 10 minutes, execute the +following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_flush_age 600 + +You can specify the minimum age of an object before it will be evicted from the +cache tier: + +.. prompt:: bash $ + + ceph osd pool {cache-tier} cache_min_evict_age {#seconds} + +For example, to evict objects after 30 minutes, execute the following: + +.. prompt:: bash $ + + ceph osd pool set hot-storage cache_min_evict_age 1800 + + +Removing a Cache Tier +===================== + +Removing a cache tier differs depending on whether it is a writeback +cache or a read-only cache. + + +Removing a Read-Only Cache +-------------------------- + +Since a read-only cache does not have modified data, you can disable +and remove it without losing any recent changes to objects in the cache. + +#. Change the cache-mode to ``none`` to disable it.: + + .. prompt:: bash + + ceph osd tier cache-mode {cachepool} none + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage none + +#. Remove the cache pool from the backing pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +Removing a Writeback Cache +-------------------------- + +Since a writeback cache may have modified data, you must take steps to ensure +that you do not lose any recent changes to objects in the cache before you +disable and remove it. + + +#. Change the cache mode to ``proxy`` so that new and modified objects will + flush to the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier cache-mode {cachepool} proxy + + For example: + + .. prompt:: bash $ + + ceph osd tier cache-mode hot-storage proxy + + +#. Ensure that the cache pool has been flushed. This may take a few minutes: + + .. prompt:: bash $ + + rados -p {cachepool} ls + + If the cache pool still has objects, you can flush them manually. + For example: + + .. prompt:: bash $ + + rados -p {cachepool} cache-flush-evict-all + + +#. Remove the overlay so that clients will not direct traffic to the cache.: + + .. prompt:: bash $ + + ceph osd tier remove-overlay {storagetier} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove-overlay cold-storage + + +#. Finally, remove the cache tier pool from the backing storage pool.: + + .. prompt:: bash $ + + ceph osd tier remove {storagepool} {cachepool} + + For example: + + .. prompt:: bash $ + + ceph osd tier remove cold-storage hot-storage + + +.. _Create a Pool: ../pools#create-a-pool +.. _Pools - Set Pool Values: ../pools#set-pool-values +.. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter +.. _CRUSH Maps: ../crush-map +.. _Absolute Sizing: #absolute-sizing diff --git a/doc/rados/operations/change-mon-elections.rst b/doc/rados/operations/change-mon-elections.rst new file mode 100644 index 000000000..eba730bdc --- /dev/null +++ b/doc/rados/operations/change-mon-elections.rst @@ -0,0 +1,88 @@ +.. _changing_monitor_elections: + +===================================== +Configure Monitor Election Strategies +===================================== + +By default, the monitors will use the ``classic`` mode. We +recommend that you stay in this mode unless you have a very specific reason. + +If you want to switch modes BEFORE constructing the cluster, change +the ``mon election default strategy`` option. This option is an integer value: + +* 1 for "classic" +* 2 for "disallow" +* 3 for "connectivity" + +Once your cluster is running, you can change strategies by running :: + + $ ceph mon set election_strategy {classic|disallow|connectivity} + +Choosing a mode +=============== +The modes other than classic provide different features. We recommend +you stay in classic mode if you don't need the extra features as it is +the simplest mode. + +The disallow Mode +================= +This mode lets you mark monitors as disallowd, in which case they will +participate in the quorum and serve clients, but cannot be elected leader. You +may wish to use this if you have some monitors which are known to be far away +from clients. +You can disallow a leader by running: + +.. prompt:: bash $ + + ceph mon add disallowed_leader {name} + +You can remove a monitor from the disallowed list, and allow it to become +a leader again, by running: + +.. prompt:: bash $ + + ceph mon rm disallowed_leader {name} + +The list of disallowed_leaders is included when you run: + +.. prompt:: bash $ + + ceph mon dump + +The connectivity Mode +===================== +This mode evaluates connection scores provided by each monitor for its +peers and elects the monitor with the highest score. This mode is designed +to handle network partitioning or *net-splits*, which may happen if your cluster +is stretched across multiple data centers or otherwise has a non-uniform +or unbalanced network topology. + +This mode also supports disallowing monitors from being the leader +using the same commands as above in disallow. + +Examining connectivity scores +============================= +The monitors maintain connection scores even if they aren't in +the connectivity election mode. You can examine the scores a monitor +has by running: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores dump + +Scores for individual connections range from 0-1 inclusive, and also +include whether the connection is considered alive or dead (determined by +whether it returned its latest ping within the timeout). + +While this would be an unexpected occurrence, if for some reason you experience +problems and troubleshooting makes you think your scores have become invalid, +you can forget history and reset them by running: + +.. prompt:: bash $ + + ceph daemon mon.{name} connection scores reset + +While resetting scores has low risk (monitors will still quickly determine +if a connection is alive or dead, and trend back to the previous scores if they +were accurate!), it should also not be needed and is not recommended unless +requested by your support team or a developer. diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst new file mode 100644 index 000000000..d7a512618 --- /dev/null +++ b/doc/rados/operations/control.rst @@ -0,0 +1,601 @@ +.. index:: control, commands + +================== + Control Commands +================== + + +Monitor Commands +================ + +Monitor commands are issued using the ``ceph`` utility: + +.. prompt:: bash $ + + ceph [-m monhost] {command} + +The command is usually (though not always) of the form: + +.. prompt:: bash $ + + ceph {subsystem} {command} + + +System Commands +=============== + +Execute the following to display the current cluster status. : + +.. prompt:: bash $ + + ceph -s + ceph status + +Execute the following to display a running summary of cluster status +and major events. : + +.. prompt:: bash $ + + ceph -w + +Execute the following to show the monitor quorum, including which monitors are +participating and which one is the leader. : + +.. prompt:: bash $ + + ceph mon stat + ceph quorum_status + +Execute the following to query the status of a single monitor, including whether +or not it is in the quorum. : + +.. prompt:: bash $ + + ceph tell mon.[id] mon_status + +where the value of ``[id]`` can be determined, e.g., from ``ceph -s``. + + +Authentication Subsystem +======================== + +To add a keyring for an OSD, execute the following: + +.. prompt:: bash $ + + ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} + +To list the cluster's keys and their capabilities, execute the following: + +.. prompt:: bash $ + + ceph auth ls + + +Placement Group Subsystem +========================= + +To display the statistics for all placement groups (PGs), execute the following: + +.. prompt:: bash $ + + ceph pg dump [--format {format}] + +The valid formats are ``plain`` (default), ``json`` ``json-pretty``, ``xml``, and ``xml-pretty``. +When implementing monitoring and other tools, it is best to use ``json`` format. +JSON parsing is more deterministic than the human-oriented ``plain``, and the layout is much +less variable from release to release. The ``jq`` utility can be invaluable when extracting +data from JSON output. + +To display the statistics for all placement groups stuck in a specified state, +execute the following: + +.. prompt:: bash $ + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] + + +``--format`` may be ``plain`` (default), ``json``, ``json-pretty``, ``xml``, or ``xml-pretty``. + +``--threshold`` defines how many seconds "stuck" is (default: 300) + +**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD +with the most up-to-date data to come back. + +**Unclean** Placement groups contain objects that are not replicated the desired number +of times. They should be recovering. + +**Stale** Placement groups are in an unknown state - the OSDs that host them have not +reported to the monitor cluster in a while (configured by +``mon_osd_report_timeout``). + +Delete "lost" objects or revert them to their prior state, either a previous version +or delete them if they were just created. : + +.. prompt:: bash $ + + ceph pg {pgid} mark_unfound_lost revert|delete + + +.. _osd-subsystem: + +OSD Subsystem +============= + +Query OSD subsystem status. : + +.. prompt:: bash $ + + ceph osd stat + +Write a copy of the most recent OSD map to a file. See +:ref:`osdmaptool `. : + +.. prompt:: bash $ + + ceph osd getmap -o file + +Write a copy of the crush map from the most recent OSD map to +file. : + +.. prompt:: bash $ + + ceph osd getcrushmap -o file + +The foregoing is functionally equivalent to : + +.. prompt:: bash $ + + ceph osd getmap -o /tmp/osdmap + osdmaptool /tmp/osdmap --export-crush file + +Dump the OSD map. Valid formats for ``-f`` are ``plain``, ``json``, ``json-pretty``, +``xml``, and ``xml-pretty``. If no ``--format`` option is given, the OSD map is +dumped as plain text. As above, JSON format is best for tools, scripting, and other automation. : + +.. prompt:: bash $ + + ceph osd dump [--format {format}] + +Dump the OSD map as a tree with one line per OSD containing weight +and state. : + +.. prompt:: bash $ + + ceph osd tree [--format {format}] + +Find out where a specific object is or would be stored in the system: + +.. prompt:: bash $ + + ceph osd map + +Add or move a new item (OSD) with the given id/name/weight at the specified +location. : + +.. prompt:: bash $ + + ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] + +Remove an existing item (OSD) from the CRUSH map. : + +.. prompt:: bash $ + + ceph osd crush remove {name} + +Remove an existing bucket from the CRUSH map. : + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +Move an existing bucket from one position in the hierarchy to another. : + +.. prompt:: bash $ + + ceph osd crush move {id} {loc1} [{loc2} ...] + +Set the weight of the item given by ``{name}`` to ``{weight}``. : + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +Mark an OSD as ``lost``. This may result in permanent data loss. Use with caution. : + +.. prompt:: bash $ + + ceph osd lost {id} [--yes-i-really-mean-it] + +Create a new OSD. If no UUID is given, it will be set automatically when the OSD +starts up. : + +.. prompt:: bash $ + + ceph osd create [{uuid}] + +Remove the given OSD(s). : + +.. prompt:: bash $ + + ceph osd rm [{id}...] + +Query the current ``max_osd`` parameter in the OSD map. : + +.. prompt:: bash $ + + ceph osd getmaxosd + +Import the given crush map. : + +.. prompt:: bash $ + + ceph osd setcrushmap -i file + +Set the ``max_osd`` parameter in the OSD map. This defaults to 10000 now so +most admins will never need to adjust this. : + +.. prompt:: bash $ + + ceph osd setmaxosd + +Mark OSD ``{osd-num}`` down. : + +.. prompt:: bash $ + + ceph osd down {osd-num} + +Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). : + +.. prompt:: bash $ + + ceph osd out {osd-num} + +Mark ``{osd-num}`` in the distribution (i.e. allocated data). : + +.. prompt:: bash $ + + ceph osd in {osd-num} + +Set or clear the pause flags in the OSD map. If set, no IO requests +will be sent to any OSD. Clearing the flags via unpause results in +resending pending requests. : + +.. prompt:: bash $ + + ceph osd pause + ceph osd unpause + +Set the override weight (reweight) of ``{osd-num}`` to ``{weight}``. Two OSDs with the +same weight will receive roughly the same number of I/O requests and +store approximately the same amount of data. ``ceph osd reweight`` +sets an override weight on the OSD. This value is in the range 0 to 1, +and forces CRUSH to re-place (1-weight) of the data that would +otherwise live on this drive. It does not change weights assigned +to the buckets above the OSD in the crush map, and is a corrective +measure in case the normal CRUSH distribution is not working out quite +right. For instance, if one of your OSDs is at 90% and the others are +at 50%, you could reduce this weight to compensate. : + +.. prompt:: bash $ + + ceph osd reweight {osd-num} {weight} + +Balance OSD fullness by reducing the override weight of OSDs which are +overly utilized. Note that these override aka ``reweight`` values +default to 1.00000 and are relative only to each other; they not absolute. +It is crucial to distinguish them from CRUSH weights, which reflect the +absolute capacity of a bucket in TiB. By default this command adjusts +override weight on OSDs which have + or - 20% of the average utilization, +but if you include a ``threshold`` that percentage will be used instead. : + +.. prompt:: bash $ + + ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing] + +To limit the step by which any OSD's reweight will be changed, specify +``max_change`` which defaults to 0.05. To limit the number of OSDs that will +be adjusted, specify ``max_osds`` as well; the default is 4. Increasing these +parameters can speed leveling of OSD utilization, at the potential cost of +greater impact on client operations due to more data moving at once. + +To determine which and how many PGs and OSDs will be affected by a given invocation +you can test before executing. : + +.. prompt:: bash $ + + ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing] + +Adding ``--no-increasing`` to either command prevents increasing any +override weights that are currently < 1.00000. This can be useful when +you are balancing in a hurry to remedy ``full`` or ``nearful`` OSDs or +when some OSDs are being evacuated or slowly brought into service. + +Deployments utilizing Nautilus (or later revisions of Luminous and Mimic) +that have no pre-Luminous cients may instead wish to instead enable the +`balancer`` module for ``ceph-mgr``. + +Add/remove an IP address or CIDR range to/from the blocklist. +When adding to the blocklist, +you can specify how long it should be blocklisted in seconds; otherwise, +it will default to 1 hour. A blocklisted address is prevented from +connecting to any OSD. If you blocklist an IP or range containing an OSD, be aware +that OSD will also be prevented from performing operations on its peers where it +acts as a client. (This includes tiering and copy-from functionality.) + +If you want to blocklist a range (in CIDR format), you may do so by +including the ``range`` keyword. + +These commands are mostly only useful for failure testing, as +blocklists are normally maintained automatically and shouldn't need +manual intervention. : + +.. prompt:: bash $ + + ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME] + ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits] + +Creates/deletes a snapshot of a pool. : + +.. prompt:: bash $ + + ceph osd pool mksnap {pool-name} {snap-name} + ceph osd pool rmsnap {pool-name} {snap-name} + +Creates/deletes/renames a storage pool. : + +.. prompt:: bash $ + + ceph osd pool create {pool-name} [pg_num [pgp_num]] + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + ceph osd pool rename {old-name} {new-name} + +Changes a pool setting. : + +.. prompt:: bash $ + + ceph osd pool set {pool-name} {field} {value} + +Valid fields are: + + * ``size``: Sets the number of copies of data in the pool. + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number when calculating pg placement. + * ``crush_rule``: rule number for mapping placement. + +Get the value of a pool setting. : + +.. prompt:: bash $ + + ceph osd pool get {pool-name} {field} + +Valid fields are: + + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number of placement groups when calculating placement. + + +Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. : + +.. prompt:: bash $ + + ceph osd scrub {osd-num} + +Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. : + +.. prompt:: bash $ + + ceph osd repair N + +Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES`` +in write requests of ``BYTES_PER_WRITE`` each. By default, the test +writes 1 GB in total in 4-MB increments. +The benchmark is non-destructive and will not overwrite existing live +OSD data, but might temporarily affect the performance of clients +concurrently accessing the OSD. : + +.. prompt:: bash $ + + ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] + +To clear an OSD's caches between benchmark runs, use the 'cache drop' command : + +.. prompt:: bash $ + + ceph tell osd.N cache drop + +To get the cache statistics of an OSD, use the 'cache status' command : + +.. prompt:: bash $ + + ceph tell osd.N cache status + +MDS Subsystem +============= + +Change configuration parameters on a running mds. : + +.. prompt:: bash $ + + ceph tell mds.{mds-id} config set {setting} {value} + +Example: + +.. prompt:: bash $ + + ceph tell mds.0 config set debug_ms 1 + +Enables debug messages. : + +.. prompt:: bash $ + + ceph mds stat + +Displays the status of all metadata servers. : + +.. prompt:: bash $ + + ceph mds fail 0 + +Marks the active MDS as failed, triggering failover to a standby if present. + +.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap + + +Mon Subsystem +============= + +Show monitor stats: + +.. prompt:: bash $ + + ceph mon stat + +:: + + e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c + + +The ``quorum`` list at the end lists monitor nodes that are part of the current quorum. + +This is also available more directly: + +.. prompt:: bash $ + + ceph quorum_status -f json-pretty + +.. code-block:: javascript + + { + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "quorum_names": [ + "a", + "b", + "c" + ], + "quorum_leader_name": "a", + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + + +The above will block until a quorum is reached. + +For a status of just a single monitor: + +.. prompt:: bash $ + + ceph tell mon.[name] mon_status + +where the value of ``[name]`` can be taken from ``ceph quorum_status``. Sample +output:: + + { + "name": "b", + "rank": 1, + "state": "peon", + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "features": { + "required_con": "9025616074522624", + "required_mon": [ + "kraken" + ], + "quorum_con": "1152921504336314367", + "quorum_mon": [ + "kraken" + ] + }, + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + +A dump of the monitor state: + + .. prompt:: bash $ + + ceph mon dump + + :: + + dumped monmap epoch 2 + epoch 2 + fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc + last_changed 2016-12-26 14:42:09.288066 + created 2016-12-26 14:42:03.573585 + 0: 127.0.0.1:40000/0 mon.a + 1: 127.0.0.1:40001/0 mon.b + 2: 127.0.0.1:40002/0 mon.c + diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst new file mode 100644 index 000000000..18553e47d --- /dev/null +++ b/doc/rados/operations/crush-map-edits.rst @@ -0,0 +1,747 @@ +Manually editing a CRUSH Map +============================ + +.. note:: Manually editing the CRUSH map is an advanced + administrator operation. All CRUSH changes that are + necessary for the overwhelming majority of installations are + possible via the standard ceph CLI and do not require manual + CRUSH map edits. If you have identified a use case where + manual edits *are* necessary with recent Ceph releases, consider + contacting the Ceph developers so that future versions of Ceph + can obviate your corner case. + +To edit an existing CRUSH map: + +#. `Get the CRUSH map`_. +#. `Decompile`_ the CRUSH map. +#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. +#. `Recompile`_ the CRUSH map. +#. `Set the CRUSH map`_. + +For details on setting the CRUSH map rule for a specific pool, see `Set +Pool Values`_. + +.. _Get the CRUSH map: #getcrushmap +.. _Decompile: #decompilecrushmap +.. _Devices: #crushmapdevices +.. _Buckets: #crushmapbuckets +.. _Rules: #crushmaprules +.. _Recompile: #compilecrushmap +.. _Set the CRUSH map: #setcrushmap +.. _Set Pool Values: ../pools#setpoolvalues + +.. _getcrushmap: + +Get a CRUSH Map +--------------- + +To get the CRUSH map for your cluster, execute the following: + +.. prompt:: bash $ + + ceph osd getcrushmap -o {compiled-crushmap-filename} + +Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since +the CRUSH map is in a compiled form, you must decompile it first before you can +edit it. + +.. _decompilecrushmap: + +Decompile a CRUSH Map +--------------------- + +To decompile a CRUSH map, execute the following: + +.. prompt:: bash $ + + crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} + +.. _compilecrushmap: + +Recompile a CRUSH Map +--------------------- + +To compile a CRUSH map, execute the following: + +.. prompt:: bash $ + + crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename} + +.. _setcrushmap: + +Set the CRUSH Map +----------------- + +To set the CRUSH map for your cluster, execute the following: + +.. prompt:: bash $ + + ceph osd setcrushmap -i {compiled-crushmap-filename} + +Ceph will load (-i) a compiled CRUSH map from the filename you specified. + +Sections +-------- + +There are six main sections to a CRUSH Map. + +#. **tunables:** The preamble at the top of the map describes any *tunables* + that differ from the historical / legacy CRUSH behavior. These + correct for old bugs, optimizations, or other changes that have + been made over the years to improve CRUSH's behavior. + +#. **devices:** Devices are individual OSDs that store data. + +#. **types**: Bucket ``types`` define the types of buckets used in + your CRUSH hierarchy. Buckets consist of a hierarchical aggregation + of storage locations (e.g., rows, racks, chassis, hosts, etc.) and + their assigned weights. + +#. **buckets:** Once you define bucket types, you must define each node + in the hierarchy, its type, and which devices or other nodes it + contains. + +#. **rules:** Rules define policy about how data is distributed across + devices in the hierarchy. + +#. **choose_args:** Choose_args are alternative weights associated with + the hierarchy that have been adjusted to optimize data placement. A single + choose_args map can be used for the entire cluster, or one can be + created for each individual pool. + + +.. _crushmapdevices: + +CRUSH Map Devices +----------------- + +Devices are individual OSDs that store data. Usually one is defined here for each +OSD daemon in your +cluster. Devices are identified by an ``id`` (a non-negative integer) and +a ``name``, normally ``osd.N`` where ``N`` is the device id. + +.. _crush-map-device-class: + +Devices may also have a *device class* associated with them (e.g., +``hdd`` or ``ssd``), allowing them to be conveniently targeted by a +crush rule. + +.. prompt:: bash # + + devices + +:: + + device {num} {osd.name} [class {class}] + +For example: + +.. prompt:: bash # + + devices + +:: + + device 0 osd.0 class ssd + device 1 osd.1 class hdd + device 2 osd.2 + device 3 osd.3 + +In most cases, each device maps to a single ``ceph-osd`` daemon. This +is normally a single storage device, a pair of devices (for example, +one for data and one for a journal or metadata), or in some cases a +small RAID device. + +CRUSH Map Bucket Types +---------------------- + +The second list in the CRUSH map defines 'bucket' types. Buckets facilitate +a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent +physical locations in a hierarchy. Nodes aggregate other nodes or leaves. +Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage +media. + +.. tip:: The term "bucket" used in the context of CRUSH means a node in + the hierarchy, i.e. a location or a piece of physical hardware. It + is a different concept from the term "bucket" when used in the + context of RADOS Gateway APIs. + +To add a bucket type to the CRUSH map, create a new line under your list of +bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. +By convention, there is one leaf bucket and it is ``type 0``; however, you may +give it any name you like (e.g., osd, disk, drive, storage):: + + # types + type {num} {bucket-name} + +For example:: + + # types + type 0 osd + type 1 host + type 2 chassis + type 3 rack + type 4 row + type 5 pdu + type 6 pod + type 7 room + type 8 datacenter + type 9 zone + type 10 region + type 11 root + + + +.. _crushmapbuckets: + +CRUSH Map Bucket Hierarchy +-------------------------- + +The CRUSH algorithm distributes data objects among storage devices according +to a per-device weight value, approximating a uniform probability distribution. +CRUSH distributes objects and their replicas according to the hierarchical +cluster map you define. Your CRUSH map represents the available storage +devices and the logical elements that contain them. + +To map placement groups to OSDs across failure domains, a CRUSH map defines a +hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH +map). The purpose of creating a bucket hierarchy is to segregate the +leaf nodes by their failure domains, such as hosts, chassis, racks, power +distribution units, pods, rows, rooms, and data centers. With the exception of +the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and +you may define it according to your own needs. + +We recommend adapting your CRUSH map to your firm's hardware naming conventions +and using instance names that reflect the physical hardware. Your naming +practice can make it easier to administer the cluster and troubleshoot +problems when an OSD and/or other hardware malfunctions and the administrator +need access to physical hardware. + +In the following example, the bucket hierarchy has a leaf bucket named ``osd``, +and two node buckets named ``host`` and ``rack`` respectively. + +.. ditaa:: + +-----------+ + | {o}rack | + | Bucket | + +-----+-----+ + | + +---------------+---------------+ + | | + +-----+-----+ +-----+-----+ + | {o}host | | {o}host | + | Bucket | | Bucket | + +-----+-----+ +-----+-----+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd | | osd | | osd | | osd | + | Bucket | | Bucket | | Bucket | | Bucket | + +-----------+ +-----------+ +-----------+ +-----------+ + +.. note:: The higher numbered ``rack`` bucket type aggregates the lower + numbered ``host`` bucket type. + +Since leaf nodes reflect storage devices declared under the ``#devices`` list +at the beginning of the CRUSH map, you do not need to declare them as bucket +instances. The second lowest bucket type in your hierarchy usually aggregates +the devices (i.e., it's usually the computer containing the storage media, and +uses whatever term you prefer to describe it, such as "node", "computer", +"server," "host", "machine", etc.). In high density environments, it is +increasingly common to see multiple hosts/nodes per chassis. You should account +for chassis failure too--e.g., the need to pull a chassis if a node fails may +result in bringing down numerous hosts/nodes and their OSDs. + +When declaring a bucket instance, you must specify its type, give it a unique +name (string), assign it a unique ID expressed as a negative integer (optional), +specify a weight relative to the total capacity/capability of its item(s), +specify the bucket algorithm (usually ``straw2``), and the hash (usually ``0``, +reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. +The items may consist of node buckets or leaves. Items may have a weight that +reflects the relative weight of the item. + +You may declare a node bucket with the following syntax:: + + [bucket-type] [bucket-name] { + id [a unique negative numeric ID] + weight [the relative capacity/capability of the item(s)] + alg [the bucket type: uniform | list | tree | straw | straw2 ] + hash [the hash type: 0 by default] + item [item-name] weight [weight] + } + +For example, using the diagram above, we would define two host buckets +and one rack bucket. The OSDs are declared as items within the host buckets:: + + host node1 { + id -1 + alg straw2 + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host node2 { + id -2 + alg straw2 + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + rack rack1 { + id -3 + alg straw2 + hash 0 + item node1 weight 2.00 + item node2 weight 2.00 + } + +.. note:: In the foregoing example, note that the rack bucket does not contain + any OSDs. Rather it contains lower level host buckets, and includes the + sum total of their weight in the item entry. + +.. topic:: Bucket Types + + Ceph supports five bucket types, each representing a tradeoff between + performance and reorganization efficiency. If you are unsure of which bucket + type to use, we recommend using a ``straw2`` bucket. For a detailed + discussion of bucket types, refer to + `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, + and more specifically to **Section 3.4**. The bucket types are: + + #. **uniform**: Uniform buckets aggregate devices with **exactly** the same + weight. For example, when firms commission or decommission hardware, they + typically do so with many machines that have exactly the same physical + configuration (e.g., bulk purchases). When storage devices have exactly + the same weight, you may use the ``uniform`` bucket type, which allows + CRUSH to map replicas into uniform buckets in constant time. With + non-uniform weights, you should use another bucket algorithm. + + #. **list**: List buckets aggregate their content as linked lists. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, + a list is a natural and intuitive choice for an **expanding cluster**: + either an object is relocated to the newest device with some appropriate + probability, or it remains on the older devices as before. The result is + optimal data migration when items are added to the bucket. Items removed + from the middle or tail of the list, however, can result in a significant + amount of unnecessary movement, making list buckets most suitable for + circumstances in which they **never (or very rarely) shrink**. + + #. **tree**: Tree buckets use a binary search tree. They are more efficient + than list buckets when a bucket contains a larger set of items. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, + tree buckets reduce the placement time to O(log :sub:`n`), making them + suitable for managing much larger sets of devices or nested buckets. + + #. **straw**: List and Tree buckets use a divide and conquer strategy + in a way that either gives certain items precedence (e.g., those + at the beginning of a list) or obviates the need to consider entire + subtrees of items at all. That improves the performance of the replica + placement process, but can also introduce suboptimal reorganization + behavior when the contents of a bucket change due an addition, removal, + or re-weighting of an item. The straw bucket type allows all items to + fairly “compete” against each other for replica placement through a + process analogous to a draw of straws. + + #. **straw2**: Straw2 buckets improve Straw to correctly avoid any data + movement between items when neighbor weights change. + + For example the weight of item A including adding it anew or removing + it completely, there will be data movement only to or from item A. + +.. topic:: Hash + + Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. + Enter ``0`` as your hash setting to select ``rjenkins1``. + + +.. _weightingbucketitems: + +.. topic:: Weighting Bucket Items + + Ceph expresses bucket weights as doubles, which allows for fine + weighting. A weight is the relative difference between device capacities. We + recommend using ``1.00`` as the relative weight for a 1TB storage device. + In such a scenario, a weight of ``0.5`` would represent approximately 500GB, + and a weight of ``3.00`` would represent approximately 3TB. Higher level + buckets have a weight that is the sum total of the leaf items aggregated by + the bucket. + + A bucket item weight is one dimensional, but you may also calculate your + item weights to reflect the performance of the storage drive. For example, + if you have many 1TB drives where some have relatively low data transfer + rate and the others have a relatively high data transfer rate, you may + weight them differently, even though they have the same capacity (e.g., + a weight of 0.80 for the first set of drives with lower total throughput, + and 1.20 for the second set of drives with higher total throughput). + + +.. _crushmaprules: + +CRUSH Map Rules +--------------- + +CRUSH maps support the notion of 'CRUSH rules', which are the rules that +determine data placement for a pool. The default CRUSH map has a rule for each +pool. For large clusters, you will likely create many pools where each pool may +have its own non-default CRUSH rule. + +.. note:: In most cases, you will not need to modify the default rule. When + you create a new pool, by default the rule will be set to ``0``. + + +CRUSH rules define placement and replication strategies or distribution policies +that allow you to specify exactly how CRUSH places object replicas. For +example, you might create a rule selecting a pair of targets for 2-way +mirroring, another rule for selecting three targets in two different data +centers for 3-way mirroring, and yet another rule for erasure coding over six +storage devices. For a detailed discussion of CRUSH rules, refer to +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, +and more specifically to **Section 3.2**. + +A rule takes the following form:: + + rule { + + id [a unique whole numeric ID] + type [ replicated | erasure ] + min_size + max_size + step take [class ] + step [choose|chooseleaf] [firstn|indep] type + step emit + } + + +``id`` + +:Description: A unique whole number for identifying the rule. + +:Purpose: A component of the rule mask. +:Type: Integer +:Required: Yes +:Default: 0 + + +``type`` + +:Description: Describes a rule for either a storage drive (replicated) + or a RAID. + +:Purpose: A component of the rule mask. +:Type: String +:Required: Yes +:Default: ``replicated`` +:Valid Values: Currently only ``replicated`` and ``erasure`` + +``min_size`` + +:Description: If a pool makes fewer replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: ``1`` + +``max_size`` + +:Description: If a pool makes more replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: 10 + + +``step take [class ]`` + +:Description: Takes a bucket name, and begins iterating down the tree. + If the ``device-class`` is specified, it must match + a class previously used when defining a device. All + devices that do not belong to the class are excluded. +:Purpose: A component of the rule. +:Required: Yes +:Example: ``step take data`` + + +``step choose firstn {num} type {bucket-type}`` + +:Description: Selects the number of buckets of the given type from within the + current bucket. The number is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step choose firstn 1 type row`` + + +``step chooseleaf firstn {num} type {bucket-type}`` + +:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf + node (that is, an OSD) from the subtree of each bucket in the set of buckets. + The number of buckets in the set is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. Usage removes the need to select a device using two steps. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step chooseleaf firstn 0 type row`` + + +``step emit`` + +:Description: Outputs the current value and empties the stack. Typically used + at the end of a rule, but may also be used to pick from different + trees in the same rule. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step choose``. +:Example: ``step emit`` + +.. important:: A given CRUSH rule may be assigned to multiple pools, but it + is not possible for a single pool to have multiple CRUSH rules. + +``firstn`` versus ``indep`` + +:Description: Controls the replacement strategy CRUSH uses when items (OSDs) + are marked down in the CRUSH map. If this rule is to be used with + replicated pools it should be ``firstn`` and if it's for + erasure-coded pools it should be ``indep``. + + The reason has to do with how they behave when a + previously-selected device fails. Let's say you have a PG stored + on OSDs 1, 2, 3, 4, 5. Then 3 goes down. + + With the "firstn" mode, CRUSH simply adjusts its calculation to + select 1 and 2, then selects 3 but discovers it's down, so it + retries and selects 4 and 5, and then goes on to select a new + OSD 6. So the final CRUSH mapping change is + 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6. + + But if you're storing an EC pool, that means you just changed the + data mapped to OSDs 4, 5, and 6! So the "indep" mode attempts to + not do that. You can instead expect it, when it selects the failed + OSD 3, to try again and pick out 6, for a final transformation of: + 1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5 + +.. _crush-reclassify: + +Migrating from a legacy SSD rule to device classes +-------------------------------------------------- + +It used to be necessary to manually edit your CRUSH map and maintain a +parallel hierarchy for each specialized device type (e.g., SSD) in order to +write rules that apply to those devices. Since the Luminous release, +the *device class* feature has enabled this transparently. + +However, migrating from an existing, manually customized per-device map to +the new device class rules in the trivial way will cause all data in the +system to be reshuffled. + +The ``crushtool`` has a few commands that can transform a legacy rule +and hierarchy so that you can start using the new class-based rules. +There are three types of transformations possible: + +#. ``--reclassify-root `` + + This will take everything in the hierarchy beneath root-name and + adjust any rules that reference that root via a ``take + `` to instead ``take class ``. + It renumbers the buckets in such a way that the old IDs are instead + used for the specified class's "shadow tree" so that no data + movement takes place. + + For example, imagine you have an existing rule like:: + + rule replicated_ruleset { + id 0 + type replicated + min_size 1 + max_size 10 + step take default + step chooseleaf firstn 0 type rack + step emit + } + + If you reclassify the root `default` as class `hdd`, the rule will + become:: + + rule replicated_ruleset { + id 0 + type replicated + min_size 1 + max_size 10 + step take default class hdd + step chooseleaf firstn 0 type rack + step emit + } + +#. ``--set-subtree-class `` + + This will mark every device in the subtree rooted at *bucket-name* + with the specified device class. + + This is normally used in conjunction with the ``--reclassify-root`` + option to ensure that all devices in that root are labeled with the + correct class. In some situations, however, some of those devices + (correctly) have a different class and we do not want to relabel + them. In such cases, one can exclude the ``--set-subtree-class`` + option. This means that the remapping process will not be perfect, + since the previous rule distributed across devices of multiple + classes but the adjusted rules will only map to devices of the + specified *device-class*, but that often is an accepted level of + data movement when the number of outlier devices is small. + +#. ``--reclassify-bucket `` + + This will allow you to merge a parallel type-specific hierarchy with the normal hierarchy. For example, many users have maps like:: + + host node1 { + id -2 # do not change unnecessarily + # weight 109.152 + alg straw2 + hash 0 # rjenkins1 + item osd.0 weight 9.096 + item osd.1 weight 9.096 + item osd.2 weight 9.096 + item osd.3 weight 9.096 + item osd.4 weight 9.096 + item osd.5 weight 9.096 + ... + } + + host node1-ssd { + id -10 # do not change unnecessarily + # weight 2.000 + alg straw2 + hash 0 # rjenkins1 + item osd.80 weight 2.000 + ... + } + + root default { + id -1 # do not change unnecessarily + alg straw2 + hash 0 # rjenkins1 + item node1 weight 110.967 + ... + } + + root ssd { + id -18 # do not change unnecessarily + # weight 16.000 + alg straw2 + hash 0 # rjenkins1 + item node1-ssd weight 2.000 + ... + } + + This function will reclassify each bucket that matches a + pattern. The pattern can look like ``%suffix`` or ``prefix%``. + For example, in the above example, we would use the pattern + ``%-ssd``. For each matched bucket, the remaining portion of the + name (that matches the ``%`` wildcard) specifies the *base bucket*. + All devices in the matched bucket are labeled with the specified + device class and then moved to the base bucket. If the base bucket + does not exist (e.g., ``node12-ssd`` exists but ``node12`` does + not), then it is created and linked underneath the specified + *default parent* bucket. In each case, we are careful to preserve + the old bucket IDs for the new shadow buckets to prevent data + movement. Any rules with ``take`` steps referencing the old + buckets are adjusted. + +#. ``--reclassify-bucket `` + + The same command can also be used without a wildcard to map a + single bucket. For example, in the previous example, we want the + ``ssd`` bucket to be mapped to the ``default`` bucket. + +The final command to convert the map comprising the above fragments would be something like: + +.. prompt:: bash $ + + ceph osd getcrushmap -o original + crushtool -i original --reclassify \ + --set-subtree-class default hdd \ + --reclassify-root default hdd \ + --reclassify-bucket %-ssd ssd default \ + --reclassify-bucket ssd ssd default \ + -o adjusted + +In order to ensure that the conversion is correct, there is a ``--compare`` command that will test a large sample of inputs against the CRUSH map and check that the same result is output. These inputs are controlled by the same options that apply to the ``--test`` command. For the above example,: + +.. prompt:: bash $ + + crushtool -i original --compare adjusted + +:: + + rule 0 had 0/10240 mismatched mappings (0) + rule 1 had 0/10240 mismatched mappings (0) + maps appear equivalent + +If there were differences, the ratio of remapped inputs would be reported in +the parentheses. + +When you are satisfied with the adjusted map, apply it to the cluster with a command of the form: + +.. prompt:: bash $ + + ceph osd setcrushmap -i adjusted + +Tuning CRUSH, the hard way +-------------------------- + +If you can ensure that all clients are running recent code, you can +adjust the tunables by extracting the CRUSH map, modifying the values, +and reinjecting it into the cluster. + +* Extract the latest CRUSH map: + + .. prompt:: bash $ + + ceph osd getcrushmap -o /tmp/crush + +* Adjust tunables. These values appear to offer the best behavior + for both large and small clusters we tested with. You will need to + additionally specify the ``--enable-unsafe-tunables`` argument to + ``crushtool`` for this to work. Please use this option with + extreme care.: + + .. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new + +* Reinject modified map: + + .. prompt:: bash $ + + ceph osd setcrushmap -i /tmp/crush.new + +Legacy values +------------- + +For reference, the legacy values for the CRUSH tunables can be set +with: + +.. prompt:: bash $ + + crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy + +Again, the special ``--enable-unsafe-tunables`` option is required. +Further, as noted above, be careful running old versions of the +``ceph-osd`` daemon after reverting to legacy values as the feature +bit is not perfectly enforced. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst new file mode 100644 index 000000000..f22ebb24e --- /dev/null +++ b/doc/rados/operations/crush-map.rst @@ -0,0 +1,1126 @@ +============ + CRUSH Maps +============ + +The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm +determines how to store and retrieve data by computing storage locations. +CRUSH empowers Ceph clients to communicate with OSDs directly rather than +through a centralized server or broker. With an algorithmically determined +method of storing and retrieving data, Ceph avoids a single point of failure, a +performance bottleneck, and a physical limit to its scalability. + +CRUSH uses a map of your cluster (the CRUSH map) to pseudo-randomly +map data to OSDs, distributing it across the cluster according to configured +replication policy and failure domain. For a detailed discussion of CRUSH, see +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ + +CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a hierarchy +of 'buckets' for aggregating devices and buckets, and +rules that govern how CRUSH replicates data within the cluster's pools. By +reflecting the underlying physical organization of the installation, CRUSH can +model (and thereby address) the potential for correlated device failures. +Typical factors include chassis, racks, physical proximity, a shared power +source, and shared networking. By encoding this information into the cluster +map, CRUSH placement +policies distribute object replicas across failure domains while +maintaining the desired distribution. For example, to address the +possibility of concurrent failures, it may be desirable to ensure that data +replicas are on devices using different shelves, racks, power supplies, +controllers, and/or physical locations. + +When you deploy OSDs they are automatically added to the CRUSH map under a +``host`` bucket named for the node on which they run. This, +combined with the configured CRUSH failure domain, ensures that replicas or +erasure code shards are distributed across hosts and that a single host or other +failure will not affect availability. For larger clusters, administrators must +carefully consider their choice of failure domain. Separating replicas across racks, +for example, is typical for mid- to large-sized clusters. + + +CRUSH Location +============== + +The location of an OSD within the CRUSH map's hierarchy is +referred to as a ``CRUSH location``. This location specifier takes the +form of a list of key and value pairs. For +example, if an OSD is in a particular row, rack, chassis and host, and +is part of the 'default' CRUSH root (which is the case for most +clusters), its CRUSH location could be described as:: + + root=default row=a rack=a2 chassis=a2a host=a2a1 + +Note: + +#. Note that the order of the keys does not matter. +#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default + these include ``root``, ``datacenter``, ``room``, ``row``, ``pod``, ``pdu``, + ``rack``, ``chassis`` and ``host``. + These defined types suffice for almost all clusters, but can be customized + by modifying the CRUSH map. +#. Not all keys need to be specified. For example, by default, Ceph + automatically sets an ``OSD``'s location to be + ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). + +The CRUSH location for an OSD can be defined by adding the ``crush location`` +option in ``ceph.conf``. Each time the OSD starts, +it verifies it is in the correct location in the CRUSH map and, if it is not, +it moves itself. To disable this automatic CRUSH map management, add the +following to your configuration file in the ``[osd]`` section:: + + osd crush update on start = false + +Note that in most cases you will not need to manually configure this. + + +Custom location hooks +--------------------- + +A customized location hook can be used to generate a more complete +CRUSH location on startup. The CRUSH location is based on, in order +of preference: + +#. A ``crush location`` option in ``ceph.conf`` +#. A default of ``root=default host=HOSTNAME`` where the hostname is + derived from the ``hostname -s`` command + +A script can be written to provide additional +location fields (for example, ``rack`` or ``datacenter``) and the +hook enabled via the config option:: + + crush location hook = /path/to/customized-ceph-crush-location + +This hook is passed several arguments (below) and should output a single line +to ``stdout`` with the CRUSH location description.:: + + --cluster CLUSTER --id ID --type TYPE + +where the cluster name is typically ``ceph``, the ``id`` is the daemon +identifier (e.g., the OSD number or daemon identifier), and the daemon +type is ``osd``, ``mds``, etc. + +For example, a simple hook that additionally specifies a rack location +based on a value in the file ``/etc/rack`` might be:: + + #!/bin/sh + echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" + + +CRUSH structure +=============== + +The CRUSH map consists of a hierarchy that describes +the physical topology of the cluster and a set of rules defining +data placement policy. The hierarchy has +devices (OSDs) at the leaves, and internal nodes +corresponding to other physical features or groupings: hosts, racks, +rows, datacenters, and so on. The rules describe how replicas are +placed in terms of that hierarchy (e.g., 'three replicas in different +racks'). + +Devices +------- + +Devices are individual OSDs that store data, usually one for each storage drive. +Devices are identified by an ``id`` +(a non-negative integer) and a ``name``, normally ``osd.N`` where ``N`` is the device id. + +Since the Luminous release, devices may also have a *device class* assigned (e.g., +``hdd`` or ``ssd`` or ``nvme``), allowing them to be conveniently targeted by +CRUSH rules. This is especially useful when mixing device types within hosts. + +.. _crush_map_default_types: + +Types and Buckets +----------------- + +A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, +racks, rows, etc. The CRUSH map defines a series of *types* that are +used to describe these nodes. Default types include: + +- ``osd`` (or ``device``) +- ``host`` +- ``chassis`` +- ``rack`` +- ``row`` +- ``pdu`` +- ``pod`` +- ``room`` +- ``datacenter`` +- ``zone`` +- ``region`` +- ``root`` + +Most clusters use only a handful of these types, and others +can be defined as needed. + +The hierarchy is built with devices (normally type ``osd``) at the +leaves, interior nodes with non-device types, and a root node of type +``root``. For example, + +.. ditaa:: + + +-----------------+ + |{o}root default | + +--------+--------+ + | + +---------------+---------------+ + | | + +------+------+ +------+------+ + |{o}host foo | |{o}host bar | + +------+------+ +------+------+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd.0 | | osd.1 | | osd.2 | | osd.3 | + +-----------+ +-----------+ +-----------+ +-----------+ + +Each node (device or bucket) in the hierarchy has a *weight* +that indicates the relative proportion of the total +data that device or hierarchy subtree should store. Weights are set +at the leaves, indicating the size of the device, and automatically +sum up the tree, such that the weight of the ``root`` node +will be the total of all devices contained beneath it. Normally +weights are in units of terabytes (TB). + +You can get a simple view the of CRUSH hierarchy for your cluster, +including weights, with: + +.. prompt:: bash $ + + ceph osd tree + +Rules +----- + +CRUSH Rules define policy about how data is distributed across the devices +in the hierarchy. They define placement and replication strategies or +distribution policies that allow you to specify exactly how CRUSH +places data replicas. For example, you might create a rule selecting +a pair of targets for two-way mirroring, another rule for selecting +three targets in two different data centers for three-way mirroring, and +yet another rule for erasure coding (EC) across six storage devices. For a +detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, +Scalable, Decentralized Placement of Replicated Data`_, and more +specifically to **Section 3.2**. + +CRUSH rules can be created via the CLI by +specifying the *pool type* they will be used for (replicated or +erasure coded), the *failure domain*, and optionally a *device class*. +In rare cases rules must be written by hand by manually editing the +CRUSH map. + +You can see what rules are defined for your cluster with: + +.. prompt:: bash $ + + ceph osd crush rule ls + +You can view the contents of the rules with: + +.. prompt:: bash $ + + ceph osd crush rule dump + +Device classes +-------------- + +Each device can optionally have a *class* assigned. By +default, OSDs automatically set their class at startup to +`hdd`, `ssd`, or `nvme` based on the type of device they are backed +by. + +The device class for one or more OSDs can be explicitly set with: + +.. prompt:: bash $ + + ceph osd crush set-device-class [...] + +Once a device class is set, it cannot be changed to another class +until the old class is unset with: + +.. prompt:: bash $ + + ceph osd crush rm-device-class [...] + +This allows administrators to set device classes without the class +being changed on OSD restart or by some other script. + +A placement rule that targets a specific device class can be created with: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated + +A pool can then be changed to use the new rule with: + +.. prompt:: bash $ + + ceph osd pool set crush_rule + +Device classes are implemented by creating a "shadow" CRUSH hierarchy +for each device class in use that contains only devices of that class. +CRUSH rules can then distribute data over the shadow hierarchy. +This approach is fully backward compatible with +old Ceph clients. You can view the CRUSH hierarchy with shadow items +with: + +.. prompt:: bash $ + + ceph osd crush tree --show-shadow + +For older clusters created before Luminous that relied on manually +crafted CRUSH maps to maintain per-device-type hierarchies, there is a +*reclassify* tool available to help transition to device classes +without triggering data movement (see :ref:`crush-reclassify`). + + +Weights sets +------------ + +A *weight set* is an alternative set of weights to use when +calculating data placement. The normal weights associated with each +device in the CRUSH map are set based on the device size and indicate +how much data we *should* be storing where. However, because CRUSH is +a "probabilistic" pseudorandom placement process, there is always some +variation from this ideal distribution, in the same way that rolling a +die sixty times will not result in rolling exactly 10 ones and 10 +sixes. Weight sets allow the cluster to perform numerical optimization +based on the specifics of your cluster (hierarchy, pools, etc.) to achieve +a balanced distribution. + +There are two types of weight sets supported: + + #. A **compat** weight set is a single alternative set of weights for + each device and node in the cluster. This is not well-suited for + correcting for all anomalies (for example, placement groups for + different pools may be different sizes and have different load + levels, but will be mostly treated the same by the balancer). + However, compat weight sets have the huge advantage that they are + *backward compatible* with previous versions of Ceph, which means + that even though weight sets were first introduced in Luminous + v12.2.z, older clients (e.g., firefly) can still connect to the + cluster when a compat weight set is being used to balance data. + #. A **per-pool** weight set is more flexible in that it allows + placement to be optimized for each data pool. Additionally, + weights can be adjusted for each position of placement, allowing + the optimizer to correct for a subtle skew of data toward devices + with small weights relative to their peers (and effect that is + usually only apparently in very large clusters but which can cause + balancing problems). + +When weight sets are in use, the weights associated with each node in +the hierarchy is visible as a separate column (labeled either +``(compat)`` or the pool name) from the command: + +.. prompt:: bash $ + + ceph osd tree + +When both *compat* and *per-pool* weight sets are in use, data +placement for a particular pool will use its own per-pool weight set +if present. If not, it will use the compat weight set if present. If +neither are present, it will use the normal CRUSH weights. + +Although weight sets can be set up and manipulated by hand, it is +recommended that the ``ceph-mgr`` *balancer* module be enabled to do so +automatically when running Luminous or later releases. + + +Modifying the CRUSH map +======================= + +.. _addosd: + +Add/Move an OSD +--------------- + +.. note: OSDs are normally automatically added to the CRUSH map when + the OSD is created. This command is rarely needed. + +To add or move an OSD in the CRUSH map of a running cluster: + +.. prompt:: bash $ + + ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +``root`` + +:Description: The root node of the tree in which the OSD resides (normally ``default``) +:Type: Key/value pair. +:Required: Yes +:Example: ``root=default`` + + +``bucket-type`` + +:Description: You may specify the OSD's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + + +The following example adds ``osd.0`` to the hierarchy, or moves the +OSD from a previous location: + +.. prompt:: bash $ + + ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 + + +Adjust OSD weight +----------------- + +.. note: Normally OSDs automatically add themselves to the CRUSH map + with the correct weight when they are created. This command + is rarely needed. + +To adjust an OSD's CRUSH weight in the CRUSH map of a running cluster, execute +the following: + +.. prompt:: bash $ + + ceph osd crush reweight {name} {weight} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD. +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +.. _removeosd: + +Remove an OSD +------------- + +.. note: OSDs are normally removed from the CRUSH as part of the + ``ceph osd purge`` command. This command is rarely needed. + +To remove an OSD from the CRUSH map of a running cluster, execute the +following: + +.. prompt:: bash $ + + ceph osd crush remove {name} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +Add a Bucket +------------ + +.. note: Buckets are implicitly created when an OSD is added + that specifies a ``{bucket-type}={bucket-name}`` as part of its + location, if a bucket with that name does not already exist. This + command is typically used when manually adjusting the structure of the + hierarchy after OSDs have been created. One use is to move a + series of hosts underneath a new rack-level bucket; another is to + add new ``host`` buckets (OSD nodes) to a dummy ``root`` so that they don't + receive data until you're ready, at which time you would move them to the + ``default`` or other root as described below. + +To add a bucket in the CRUSH map of a running cluster, execute the +``ceph osd crush add-bucket`` command: + +.. prompt:: bash $ + + ceph osd crush add-bucket {bucket-name} {bucket-type} + +Where: + +``bucket-name`` + +:Description: The full name of the bucket. +:Type: String +:Required: Yes +:Example: ``rack12`` + + +``bucket-type`` + +:Description: The type of the bucket. The type must already exist in the hierarchy. +:Type: String +:Required: Yes +:Example: ``rack`` + + +The following example adds the ``rack12`` bucket to the hierarchy: + +.. prompt:: bash $ + + ceph osd crush add-bucket rack12 rack + +Move a Bucket +------------- + +To move a bucket to a different location or position in the CRUSH map +hierarchy, execute the following: + +.. prompt:: bash $ + + ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] + +Where: + +``bucket-name`` + +:Description: The name of the bucket to move/reposition. +:Type: String +:Required: Yes +:Example: ``foo-bar-1`` + +``bucket-type`` + +:Description: You may specify the bucket's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +Remove a Bucket +--------------- + +To remove a bucket from the CRUSH hierarchy, execute the following: + +.. prompt:: bash $ + + ceph osd crush remove {bucket-name} + +.. note:: A bucket must be empty before removing it from the CRUSH hierarchy. + +Where: + +``bucket-name`` + +:Description: The name of the bucket that you'd like to remove. +:Type: String +:Required: Yes +:Example: ``rack12`` + +The following example removes the ``rack12`` bucket from the hierarchy: + +.. prompt:: bash $ + + ceph osd crush remove rack12 + +Creating a compat weight set +---------------------------- + +.. note: This step is normally done automatically by the ``balancer`` + module when enabled. + +To create a *compat* weight set: + +.. prompt:: bash $ + + ceph osd crush weight-set create-compat + +Weights for the compat weight set can be adjusted with: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight-compat {name} {weight} + +The compat weight set can be destroyed with: + +.. prompt:: bash $ + + ceph osd crush weight-set rm-compat + +Creating per-pool weight sets +----------------------------- + +To create a weight set for a specific pool: + +.. prompt:: bash $ + + ceph osd crush weight-set create {pool-name} {mode} + +.. note:: Per-pool weight sets require that all servers and daemons + run Luminous v12.2.z or later. + +Where: + +``pool-name`` + +:Description: The name of a RADOS pool +:Type: String +:Required: Yes +:Example: ``rbd`` + +``mode`` + +:Description: Either ``flat`` or ``positional``. A *flat* weight set + has a single weight for each device or bucket. A + *positional* weight set has a potentially different + weight for each position in the resulting placement + mapping. For example, if a pool has a replica count of + 3, then a positional weight set will have three weights + for each device and bucket. +:Type: String +:Required: Yes +:Example: ``flat`` + +To adjust the weight of an item in a weight set: + +.. prompt:: bash $ + + ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} + +To list existing weight sets: + +.. prompt:: bash $ + + ceph osd crush weight-set ls + +To remove a weight set: + +.. prompt:: bash $ + + ceph osd crush weight-set rm {pool-name} + +Creating a rule for a replicated pool +------------------------------------- + +For a replicated pool, the primary decision when creating the CRUSH +rule is what the failure domain is going to be. For example, if a +failure domain of ``host`` is selected, then CRUSH will ensure that +each replica of the data is stored on a unique host. If ``rack`` +is selected, then each replica will be stored in a different rack. +What failure domain you choose primarily depends on the size and +topology of your cluster. + +In most cases the entire cluster hierarchy is nested beneath a root node +named ``default``. If you have customized your hierarchy, you may +want to create a rule nested at some other node in the hierarchy. It +doesn't matter what type is associated with that node (it doesn't have +to be a ``root`` node). + +It is also possible to create a rule that restricts data placement to +a specific *class* of device. By default, Ceph OSDs automatically +classify themselves as either ``hdd`` or ``ssd``, depending on the +underlying type of device being used. These classes can also be +customized. + +To create a replicated rule: + +.. prompt:: bash $ + + ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] + +Where: + +``name`` + +:Description: The name of the rule +:Type: String +:Required: Yes +:Example: ``rbd-rule`` + +``root`` + +:Description: The name of the node under which data should be placed. +:Type: String +:Required: Yes +:Example: ``default`` + +``failure-domain-type`` + +:Description: The type of CRUSH nodes across which we should separate replicas. +:Type: String +:Required: Yes +:Example: ``rack`` + +``class`` + +:Description: The device class on which data should be placed. +:Type: String +:Required: No +:Example: ``ssd`` + +Creating a rule for an erasure coded pool +----------------------------------------- + +For an erasure-coded (EC) pool, the same basic decisions need to be made: +what is the failure domain, which node in the +hierarchy will data be placed under (usually ``default``), and will +placement be restricted to a specific device class. Erasure code +pools are created a bit differently, however, because they need to be +constructed carefully based on the erasure code being used. For this reason, +you must include this information in the *erasure code profile*. A CRUSH +rule will then be created from that either explicitly or automatically when +the profile is used to create a pool. + +The erasure code profiles can be listed with: + +.. prompt:: bash $ + + ceph osd erasure-code-profile ls + +An existing profile can be viewed with: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get {profile-name} + +Normally profiles should never be modified; instead, a new profile +should be created and used when creating a new pool or creating a new +rule for an existing pool. + +An erasure code profile consists of a set of key=value pairs. Most of +these control the behavior of the erasure code that is encoding data +in the pool. Those that begin with ``crush-``, however, affect the +CRUSH rule that is created. + +The erasure code profile properties of interest are: + + * **crush-root**: the name of the CRUSH node under which to place data [default: ``default``]. + * **crush-failure-domain**: the CRUSH bucket type across which to distribute erasure-coded shards [default: ``host``]. + * **crush-device-class**: the device class on which to place data [default: none, meaning all devices are used]. + * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. + +Once a profile is defined, you can create a CRUSH rule with: + +.. prompt:: bash $ + + ceph osd crush rule create-erasure {name} {profile-name} + +.. note: When creating a new pool, it is not actually necessary to + explicitly create the rule. If the erasure code profile alone is + specified and the rule argument is left off then Ceph will create + the CRUSH rule automatically. + +Deleting rules +-------------- + +Rules that are not in use by pools can be deleted with: + +.. prompt:: bash $ + + ceph osd crush rule rm {rule-name} + + +.. _crush-map-tunables: + +Tunables +======== + +Over time, we have made (and continue to make) improvements to the +CRUSH algorithm used to calculate the placement of data. In order to +support the change in behavior, we have introduced a series of tunable +options that control whether the legacy or improved variation of the +algorithm is used. + +In order to use newer tunables, both clients and servers must support +the new version of CRUSH. For this reason, we have created +``profiles`` that are named after the Ceph version in which they were +introduced. For example, the ``firefly`` tunables are first supported +by the Firefly release, and will not work with older (e.g., Dumpling) +clients. Once a given set of tunables are changed from the legacy +default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older +clients who do not support the new CRUSH features from connecting to +the cluster. + +argonaut (legacy) +----------------- + +The legacy CRUSH behavior used by Argonaut and older releases works +fine for most clusters, provided there are not many OSDs that have +been marked out. + +bobtail (CRUSH_TUNABLES2) +------------------------- + +The ``bobtail`` tunable profile fixes a few key misbehaviors: + + * For hierarchies with a small number of devices in the leaf buckets, + some PGs map to fewer than the desired number of replicas. This + commonly happens for hierarchies with "host" nodes with a small + number (1-3) of OSDs nested beneath each one. + + * For large clusters, some small percentages of PGs map to fewer than + the desired number of OSDs. This is more prevalent when there are + mutiple hierarchy layers in use (e.g., ``row``, ``rack``, ``host``, ``osd``). + + * When some OSDs are marked out, the data tends to get redistributed + to nearby OSDs instead of across the entire hierarchy. + +The new tunables are: + + * ``choose_local_tries``: Number of local retries. Legacy value is + 2, optimal value is 0. + + * ``choose_local_fallback_tries``: Legacy value is 5, optimal value + is 0. + + * ``choose_total_tries``: Total number of attempts to choose an item. + Legacy value was 19, subsequent testing indicates that a value of + 50 is more appropriate for typical clusters. For extremely large + clusters, a larger value might be necessary. + + * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt + will retry, or only try once and allow the original placement to + retry. Legacy default is 0, optimal value is 1. + +Migration impact: + + * Moving from ``argonaut`` to ``bobtail`` tunables triggers a moderate amount + of data movement. Use caution on a cluster that is already + populated with data. + +firefly (CRUSH_TUNABLES3) +------------------------- + +The ``firefly`` tunable profile fixes a problem +with ``chooseleaf`` CRUSH rule behavior that tends to result in PG +mappings with too few results when too many OSDs have been marked out. + +The new tunable is: + + * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will + start with a non-zero value of ``r``, based on how many attempts the + parent has already made. Legacy default is ``0``, but with this value + CRUSH is sometimes unable to find a mapping. The optimal value (in + terms of computational cost and correctness) is ``1``. + +Migration impact: + + * For existing clusters that house lots of data, changing + from ``0`` to ``1`` will cause a lot of data to move; a value of ``4`` or ``5`` + will allow CRUSH to still find a valid mapping but will cause less data + to move. + +straw_calc_version tunable (introduced with Firefly too) +-------------------------------------------------------- + +There were some problems with the internal weights calculated and +stored in the CRUSH map for ``straw`` algorithm buckets. Specifically, when +there were items with a CRUSH weight of ``0``, or both a mix of different and +unique weights, CRUSH would distribute data incorrectly (i.e., +not in proportion to the weights). + +The new tunable is: + + * ``straw_calc_version``: A value of ``0`` preserves the old, broken + internal weight calculation; a value of ``1`` fixes the behavior. + +Migration impact: + + * Moving to straw_calc_version ``1`` and then adjusting a straw bucket + (by adding, removing, or reweighting an item, or by using the + reweight-all command) can trigger a small to moderate amount of + data movement *if* the cluster has hit one of the problematic + conditions. + +This tunable option is special because it has absolutely no impact +concerning the required kernel version in the client side. + +hammer (CRUSH_V4) +----------------- + +The ``hammer`` tunable profile does not affect the +mapping of existing CRUSH maps simply by changing the profile. However: + + * There is a new bucket algorithm (``straw2``) supported. The new + ``straw2`` bucket algorithm fixes several limitations in the original + ``straw``. Specifically, the old ``straw`` buckets would + change some mappings that should have changed when a weight was + adjusted, while ``straw2`` achieves the original goal of only + changing mappings to or from the bucket item whose weight has + changed. + + * ``straw2`` is the default for any newly created buckets. + +Migration impact: + + * Changing a bucket type from ``straw`` to ``straw2`` will result in + a reasonably small amount of data movement, depending on how much + the bucket item weights vary from each other. When the weights are + all the same no data will move, and when item weights vary + significantly there will be more movement. + +jewel (CRUSH_TUNABLES5) +----------------------- + +The ``jewel`` tunable profile improves the +overall behavior of CRUSH such that significantly fewer mappings +change when an OSD is marked out of the cluster. This results in +significantly less data movement. + +The new tunable is: + + * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will + use a better value for an inner loop that greatly reduces the number + of mapping changes when an OSD is marked out. The legacy value is ``0``, + while the new value of ``1`` uses the new approach. + +Migration impact: + + * Changing this value on an existing cluster will result in a very + large amount of data movement as almost every PG mapping is likely + to change. + + + + +Which client versions support CRUSH_TUNABLES +-------------------------------------------- + + * argonaut series, v0.48.1 or later + * v0.49 or later + * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES2 +--------------------------------------------- + + * v0.55 or later, including bobtail series (v0.56.x) + * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES3 +--------------------------------------------- + + * v0.78 (firefly) or later + * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_V4 +-------------------------------------- + + * v0.94 (hammer) or later + * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES5 +--------------------------------------------- + + * v10.0.2 (jewel) or later + * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) + +Warning when tunables are non-optimal +------------------------------------- + +Starting with version v0.74, Ceph will issue a health warning if the +current CRUSH tunables don't include all the optimal values from the +``default`` profile (see below for the meaning of the ``default`` profile). +To make this warning go away, you have two options: + +1. Adjust the tunables on the existing cluster. Note that this will + result in some data movement (possibly as much as 10%). This is the + preferred route, but should be taken with care on a production cluster + where the data movement may affect performance. You can enable optimal + tunables with: + + .. prompt:: bash $ + + ceph osd crush tunables optimal + + If things go poorly (e.g., too much load) and not very much + progress has been made, or there is a client compatibility problem + (old kernel CephFS or RBD clients, or pre-Bobtail ``librados`` + clients), you can switch back with: + + .. prompt:: bash $ + + ceph osd crush tunables legacy + +2. You can make the warning go away without making any changes to CRUSH by + adding the following option to your ceph.conf ``[mon]`` section:: + + mon warn on legacy crush tunables = false + + For the change to take effect, you will need to restart the monitors, or + apply the option to running monitors with: + + .. prompt:: bash $ + + ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false + + +A few important points +---------------------- + + * Adjusting these values will result in the shift of some PGs between + storage nodes. If the Ceph cluster is already storing a lot of + data, be prepared for some fraction of the data to move. + * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the + feature bits of new connections as soon as they get + the updated map. However, already-connected clients are + effectively grandfathered in, and will misbehave if they do not + support the new feature. + * If the CRUSH tunables are set to non-legacy values and then later + changed back to the default values, ``ceph-osd`` daemons will not be + required to support the feature. However, the OSD peering process + requires examining and understanding old maps. Therefore, you + should not run old versions of the ``ceph-osd`` daemon + if the cluster has previously used non-legacy CRUSH values, even if + the latest version of the map has been switched back to using the + legacy defaults. + +Tuning CRUSH +------------ + +The simplest way to adjust CRUSH tunables is by applying them in matched +sets known as *profiles*. As of the Octopus release these are: + + * ``legacy``: the legacy behavior from argonaut and earlier. + * ``argonaut``: the legacy values supported by the original argonaut release + * ``bobtail``: the values supported by the bobtail release + * ``firefly``: the values supported by the firefly release + * ``hammer``: the values supported by the hammer release + * ``jewel``: the values supported by the jewel release + * ``optimal``: the best (i.e. optimal) values of the current version of Ceph + * ``default``: the default values of a new cluster installed from + scratch. These values, which depend on the current version of Ceph, + are hardcoded and are generally a mix of optimal and legacy values. + These values generally match the ``optimal`` profile of the previous + LTS release, or the most recent release for which we generally expect + most users to have up-to-date clients for. + +You can apply a profile to a running cluster with the command: + +.. prompt:: bash $ + + ceph osd crush tunables {PROFILE} + +Note that this may result in data movement, potentially quite a bit. Study +release notes and documentation carefully before changing the profile on a +running cluster, and consider throttling recovery/backfill parameters to +limit the impact of a bolus of backfill. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf + + +Primary Affinity +================ + +When a Ceph Client reads or writes data, it first contacts the primary OSD in +each affected PG's acting set. By default, the first OSD in the acting set is +the primary. For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is +listed first and thus is the primary (aka lead) OSD. Sometimes we know that an +OSD is less well suited to act as the lead than are other OSDs (e.g., it has +a slow drive or a slow controller). To prevent performance bottlenecks +(especially on read operations) while maximizing utilization of your hardware, +you can influence the selection of primary OSDs by adjusting primary affinity +values, or by crafting a CRUSH rule that selects preferred OSDs first. + +Tuning primary OSD selection is mainly useful for replicated pools, because +by default read operations are served from the primary OSD for each PG. +For erasure coded (EC) pools, a way to speed up read operations is to enable +**fast read** as described in :ref:`pool-settings`. + +A common scenario for primary affinity is when a cluster contains +a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with +3.84TB SATA SSDs. On average the latter will be assigned double the number of +PGs and thus will serve double the number of write and read operations, thus +they'll be busier than the former. A rough assignment of primary affinity +inversely proportional to OSD size won't be 100% optimal, but it can readily +achieve a 15% improvement in overall read throughput by utilizing SATA +interface bandwidth and CPU cycles more evenly. + +By default, all ceph OSDs have primary affinity of ``1``, which indicates that +any OSD may act as a primary with equal probability. + +You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to +choose the OSD as primary in a PG's acting set.: + +.. prompt:: bash $ + + ceph osd primary-affinity + +You may set an OSD's primary affinity to a real number in the range ``[0-1]``, +where ``0`` indicates that the OSD may **NOT** be used as a primary and ``1`` +indicates that an OSD may be used as a primary. When the weight is between +these extremes, it is less likely that CRUSH will select that OSD as a primary. +The process for selecting the lead OSD is more nuanced than a simple +probability based on relative affinity values, but measurable results can be +achieved even with first-order approximations of desirable values. + +Custom CRUSH Rules +------------------ + +There are occasional clusters that balance cost and performance by mixing SSDs +and HDDs in the same replicated pool. By setting the primary affinity of HDD +OSDs to ``0`` one can direct operations to the SSD in each acting set. An +alternative is to define a CRUSH rule that always selects an SSD OSD as the +first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting +set will contain exactly one SSD OSD as the primary with the balance on HDDs. + +For example, the CRUSH rule below:: + + rule mixed_replicated_rule { + id 11 + type replicated + min_size 1 + max_size 10 + step take default class ssd + step chooseleaf firstn 1 type host + step emit + step take default class hdd + step chooseleaf firstn 0 type host + step emit + } + +chooses an SSD as the first OSD. Note that for an ``N``-times replicated pool +this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different +hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD +OSDs. + +This extra storage requirement can be avoided by placing SSDs and HDDs in +different hosts with the tradeoff that hosts with SSDs will receive all client +requests. You may thus consider faster CPU(s) for SSD hosts and more modest +ones for HDD nodes, since the latter will normally only service recovery +operations. Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly +must not contain the same servers:: + + rule mixed_replicated_rule_two { + id 1 + type replicated + min_size 1 + max_size 10 + step take ssd_hosts class ssd + step chooseleaf firstn 1 type host + step emit + step take hdd_hosts class hdd + step chooseleaf firstn -1 type host + step emit + } + + +Note also that on failure of an SSD, requests to a PG will be served temporarily +from a (slower) HDD OSD until the PG's data has been replicated onto the replacement +primary SSD OSD. + diff --git a/doc/rados/operations/data-placement.rst b/doc/rados/operations/data-placement.rst new file mode 100644 index 000000000..bd9bd7ec7 --- /dev/null +++ b/doc/rados/operations/data-placement.rst @@ -0,0 +1,43 @@ +========================= + Data Placement Overview +========================= + +Ceph stores, replicates and rebalances data objects across a RADOS cluster +dynamically. With many different users storing objects in different pools for +different purposes on countless OSDs, Ceph operations require some data +placement planning. The main data placement planning concepts in Ceph include: + +- **Pools:** Ceph stores data within pools, which are logical groups for storing + objects. Pools manage the number of placement groups, the number of replicas, + and the CRUSH rule for the pool. To store data in a pool, you must have + an authenticated user with permissions for the pool. Ceph can snapshot pools. + See `Pools`_ for additional details. + +- **Placement Groups:** Ceph maps objects to placement groups (PGs). + Placement groups (PGs) are shards or fragments of a logical object pool + that place objects as a group into OSDs. Placement groups reduce the amount + of per-object metadata when Ceph stores the data in OSDs. A larger number of + placement groups (e.g., 100 per OSD) leads to better balancing. See + `Placement Groups`_ for additional details. + +- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without + performance bottlenecks, without limitations to scalability, and without a + single point of failure. CRUSH maps provide the physical topology of the + cluster to the CRUSH algorithm to determine where the data for an object + and its replicas should be stored, and how to do so across failure domains + for added data safety among other things. See `CRUSH Maps`_ for additional + details. + +- **Balancer:** The balancer is a feature that will automatically optimize the + distribution of PGs across devices to achieve a balanced data distribution, + maximizing the amount of data that can be stored in the cluster and evenly + distributing the workload across OSDs. + +When you initially set up a test cluster, you can use the default values. Once +you begin planning for a large Ceph cluster, refer to pools, placement groups +and CRUSH for data placement operations. + +.. _Pools: ../pools +.. _Placement Groups: ../placement-groups +.. _CRUSH Maps: ../crush-map +.. _Balancer: ../balancer diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst new file mode 100644 index 000000000..1b6eaebde --- /dev/null +++ b/doc/rados/operations/devices.rst @@ -0,0 +1,208 @@ +.. _devices: + +Device Management +================= + +Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by +which daemons, and collects health metrics about those devices in order to +provide tools to predict and/or automatically respond to hardware failure. + +Device tracking +--------------- + +You can query which storage devices are in use with: + +.. prompt:: bash $ + + ceph device ls + +You can also list devices by daemon or by host: + +.. prompt:: bash $ + + ceph device ls-by-daemon + ceph device ls-by-host + +For any individual device, you can query information about its +location and how it is being consumed with: + +.. prompt:: bash $ + + ceph device info + +Identifying physical devices +---------------------------- + +You can blink the drive LEDs on hardware enclosures to make the replacement of +failed disks easy and less error-prone. Use the following command:: + + device light on|off [ident|fault] [--force] + +The ```` parameter is the device identification. You can obtain this +information using the following command: + +.. prompt:: bash $ + + ceph device ls + +The ``[ident|fault]`` parameter is used to set the kind of light to blink. +By default, the `identification` light is used. + +.. note:: + This command needs the Cephadm or the Rook `orchestrator `_ module enabled. + The orchestrator module enabled is shown by executing the following command: + + .. prompt:: bash $ + + ceph orch status + +The command behind the scene to blink the drive LEDs is `lsmcli`. If you need +to customize this command you can configure this via a Jinja2 template:: + + ceph config-key set mgr/cephadm/blink_device_light_cmd "