From 19fcec84d8d7d21e796c7624e521b60d28ee21ed Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 20:45:59 +0200 Subject: Adding upstream version 16.2.11+ds. Signed-off-by: Daniel Baumann --- doc/rados/operations/bluestore-migration.rst | 338 +++++++++++++++++++++++++++ 1 file changed, 338 insertions(+) create mode 100644 doc/rados/operations/bluestore-migration.rst (limited to 'doc/rados/operations/bluestore-migration.rst') diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst new file mode 100644 index 000000000..1ac5f2b13 --- /dev/null +++ b/doc/rados/operations/bluestore-migration.rst @@ -0,0 +1,338 @@ +===================== + BlueStore Migration +===================== + +Each OSD can run either BlueStore or FileStore, and a single Ceph +cluster can contain a mix of both. Users who have previously deployed +FileStore are likely to want to transition to BlueStore in order to +take advantage of the improved performance and robustness. There are +several strategies for making such a transition. + +An individual OSD cannot be converted in place in isolation, however: +BlueStore and FileStore are simply too different for that to be +practical. "Conversion" will rely either on the cluster's normal +replication and healing support or tools and strategies that copy OSD +content from an old (FileStore) device to a new (BlueStore) one. + + +Deploy new OSDs with BlueStore +============================== + +Any new OSDs (e.g., when the cluster is expanded) can be deployed +using BlueStore. This is the default behavior so no specific change +is needed. + +Similarly, any OSDs that are reprovisioned after replacing a failed drive +can use BlueStore. + +Convert existing OSDs +===================== + +Mark out and replace +-------------------- + +The simplest approach is to mark out each device in turn, wait for the +data to replicate across the cluster, reprovision the OSD, and mark +it back in again. It is simple and easy to automate. However, it requires +more data migration than should be necessary, so it is not optimal. + +#. Identify a FileStore OSD to replace:: + + ID= + DEVICE= + + You can tell whether a given OSD is FileStore or BlueStore with: + + .. prompt:: bash $ + + ceph osd metadata $ID | grep osd_objectstore + + You can get a current count of filestore vs bluestore with: + + .. prompt:: bash $ + + ceph osd count-metadata osd_objectstore + +#. Mark the filestore OSD out: + + .. prompt:: bash $ + + ceph osd out $ID + +#. Wait for the data to migrate off the OSD in question: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done + +#. Stop the OSD: + + .. prompt:: bash $ + + systemctl kill ceph-osd@$ID + +#. Make note of which device this OSD is using: + + .. prompt:: bash $ + + mount | grep /var/lib/ceph/osd/ceph-$ID + +#. Unmount the OSD: + + .. prompt:: bash $ + + umount /var/lib/ceph/osd/ceph-$ID + +#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy + the contents of the device; be certain the data on the device is + not needed (i.e., that the cluster is healthy) before proceeding: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Tell the cluster the OSD has been destroyed (and a new OSD can be + reprovisioned with the same ID): + + .. prompt:: bash $ + + ceph osd destroy $ID --yes-i-really-mean-it + +#. Reprovision a BlueStore OSD in its place with the same OSD ID. + This requires you do identify which device to wipe based on what you saw + mounted above. BE CAREFUL! : + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID + +#. Repeat. + +You can allow the refilling of the replacement OSD to happen +concurrently with the draining of the next OSD, or follow the same +procedure for multiple OSDs in parallel, as long as you ensure the +cluster is fully clean (all data has all replicas) before destroying +any OSDs. Failure to do so will reduce the redundancy of your data +and increase the risk of (or potentially even cause) data loss. + +Advantages: + +* Simple. +* Can be done on a device-by-device basis. +* No spare devices or hosts are required. + +Disadvantages: + +* Data is copied over the network twice: once to some other OSD in the + cluster (to maintain the desired number of replicas), and then again + back to the reprovisioned BlueStore OSD. + + +Whole host replacement +---------------------- + +If you have a spare host in the cluster, or have sufficient free space +to evacuate an entire host in order to use it as a spare, then the +conversion can be done on a host-by-host basis with each stored copy of +the data migrating only once. + +First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster. + +Use a new, empty host +^^^^^^^^^^^^^^^^^^^^^ + +Ideally the host should have roughly the +same capacity as other hosts you will be converting (although it +doesn't strictly matter). :: + + NEWHOST= + +Add the host to the CRUSH hierarchy, but do not attach it to the root: + +.. prompt:: bash $ + + ceph osd crush add-bucket $NEWHOST host + +Make sure the ceph packages are installed. + +Use an existing host +^^^^^^^^^^^^^^^^^^^^ + +If you would like to use an existing host +that is already part of the cluster, and there is sufficient free +space on that host so that all of its data can be migrated off, +then you can instead do:: + + OLDHOST= + +.. prompt:: bash $ + + ceph osd crush unlink $OLDHOST default + +where "default" is the immediate ancestor in the CRUSH map. (For +smaller clusters with unmodified configurations this will normally +be "default", but it might also be a rack name.) You should now +see the host at the top of the OSD tree output with no parent: + +.. prompt:: bash $ + + bin/ceph osd tree + +:: + + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host oldhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host foo + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +If everything looks good, jump directly to the "Wait for data +migration to complete" step below and proceed from there to clean up +the old OSDs. + +Migration process +^^^^^^^^^^^^^^^^^ + +If you're using a new host, start at step #1. For an existing host, +jump to step #5 below. + +#. Provision new BlueStore OSDs for all devices: + + .. prompt:: bash $ + + ceph-volume lvm create --bluestore --data /dev/$DEVICE + +#. Verify OSDs join the cluster with: + + .. prompt:: bash $ + + ceph osd tree + + You should see the new host ``$NEWHOST`` with all of the OSDs beneath + it, but the host should *not* be nested beneath any other node in + hierarchy (like ``root default``). For example, if ``newhost`` is + the empty host, you might see something like:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host newhost + 10 ssd 1.00000 osd.10 up 1.00000 1.00000 + 11 ssd 1.00000 osd.11 up 1.00000 1.00000 + 12 ssd 1.00000 osd.12 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host oldhost1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +#. Identify the first target host to convert : + + .. prompt:: bash $ + + OLDHOST= + +#. Swap the new host into the old host's position in the cluster: + + .. prompt:: bash $ + + ceph osd crush swap-bucket $NEWHOST $OLDHOST + + At this point all data on ``$OLDHOST`` will start migrating to OSDs + on ``$NEWHOST``. If there is a difference in the total capacity of + the old and new hosts you may also see some data migrate to or from + other nodes in the cluster, but as long as the hosts are similarly + sized this will be a relatively small amount of data. + +#. Wait for data migration to complete: + + .. prompt:: bash $ + + while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done + +#. Stop all old OSDs on the now-empty ``$OLDHOST``: + + .. prompt:: bash $ + + ssh $OLDHOST + systemctl kill ceph-osd.target + umount /var/lib/ceph/osd/ceph-* + +#. Destroy and purge the old OSDs: + + .. prompt:: bash $ + + for osd in `ceph osd ls-tree $OLDHOST`; do + ceph osd purge $osd --yes-i-really-mean-it + done + +#. Wipe the old OSD devices. This requires you do identify which + devices are to be wiped manually (BE CAREFUL!). For each device: + + .. prompt:: bash $ + + ceph-volume lvm zap $DEVICE + +#. Use the now-empty host as the new host, and repeat:: + + NEWHOST=$OLDHOST + +Advantages: + +* Data is copied over the network only once. +* Converts an entire host's OSDs at once. +* Can parallelize to converting multiple hosts at a time. +* No spare devices are required on each host. + +Disadvantages: + +* A spare host is required. +* An entire host's worth of OSDs will be migrating data at a time. This + is like likely to impact overall cluster performance. +* All migrated data still makes one full hop over the network. + + +Per-OSD device copy +------------------- + +A single logical OSD can be converted by using the ``copy`` function +of ``ceph-objectstore-tool``. This requires that the host have a free +device (or devices) to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has 12 OSDs, then you'd need a +13th available device so that each OSD can be converted in turn before the +old device is reclaimed to convert the next OSD. + +Caveats: + +* This strategy requires that a blank BlueStore OSD be prepared + without allocating a new OSD ID, something that the ``ceph-volume`` + tool doesn't support. More importantly, the setup of *dmcrypt* is + closely tied to the OSD identity, which means that this approach + does not work with encrypted OSDs. + +* The device must be manually partitioned. + +* Tooling not implemented! + +* Not documented! + +Advantages: + +* Little or no data migrates over the network during the conversion. + +Disadvantages: + +* Tooling not fully implemented. +* Process not documented. +* Each host must have a spare or empty device. +* The OSD is offline during the conversion, which means new writes will + be written to only a subset of the OSDs. This increases the risk of data + loss due to a subsequent failure. (However, if there is a failure before + conversion is complete, the original FileStore OSD can be started to provide + access to its original data.) -- cgit v1.2.3