summaryrefslogtreecommitdiffstats
path: root/doc/rados/operations/bluestore-migration.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rados/operations/bluestore-migration.rst')
-rw-r--r--doc/rados/operations/bluestore-migration.rst338
1 files changed, 338 insertions, 0 deletions
diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst
new file mode 100644
index 000000000..1ac5f2b13
--- /dev/null
+++ b/doc/rados/operations/bluestore-migration.rst
@@ -0,0 +1,338 @@
+=====================
+ BlueStore Migration
+=====================
+
+Each OSD can run either BlueStore or FileStore, and a single Ceph
+cluster can contain a mix of both. Users who have previously deployed
+FileStore are likely to want to transition to BlueStore in order to
+take advantage of the improved performance and robustness. There are
+several strategies for making such a transition.
+
+An individual OSD cannot be converted in place in isolation, however:
+BlueStore and FileStore are simply too different for that to be
+practical. "Conversion" will rely either on the cluster's normal
+replication and healing support or tools and strategies that copy OSD
+content from an old (FileStore) device to a new (BlueStore) one.
+
+
+Deploy new OSDs with BlueStore
+==============================
+
+Any new OSDs (e.g., when the cluster is expanded) can be deployed
+using BlueStore. This is the default behavior so no specific change
+is needed.
+
+Similarly, any OSDs that are reprovisioned after replacing a failed drive
+can use BlueStore.
+
+Convert existing OSDs
+=====================
+
+Mark out and replace
+--------------------
+
+The simplest approach is to mark out each device in turn, wait for the
+data to replicate across the cluster, reprovision the OSD, and mark
+it back in again. It is simple and easy to automate. However, it requires
+more data migration than should be necessary, so it is not optimal.
+
+#. Identify a FileStore OSD to replace::
+
+ ID=<osd-id-number>
+ DEVICE=<disk-device>
+
+ You can tell whether a given OSD is FileStore or BlueStore with:
+
+ .. prompt:: bash $
+
+ ceph osd metadata $ID | grep osd_objectstore
+
+ You can get a current count of filestore vs bluestore with:
+
+ .. prompt:: bash $
+
+ ceph osd count-metadata osd_objectstore
+
+#. Mark the filestore OSD out:
+
+ .. prompt:: bash $
+
+ ceph osd out $ID
+
+#. Wait for the data to migrate off the OSD in question:
+
+ .. prompt:: bash $
+
+ while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
+
+#. Stop the OSD:
+
+ .. prompt:: bash $
+
+ systemctl kill ceph-osd@$ID
+
+#. Make note of which device this OSD is using:
+
+ .. prompt:: bash $
+
+ mount | grep /var/lib/ceph/osd/ceph-$ID
+
+#. Unmount the OSD:
+
+ .. prompt:: bash $
+
+ umount /var/lib/ceph/osd/ceph-$ID
+
+#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy
+ the contents of the device; be certain the data on the device is
+ not needed (i.e., that the cluster is healthy) before proceeding:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm zap $DEVICE
+
+#. Tell the cluster the OSD has been destroyed (and a new OSD can be
+ reprovisioned with the same ID):
+
+ .. prompt:: bash $
+
+ ceph osd destroy $ID --yes-i-really-mean-it
+
+#. Reprovision a BlueStore OSD in its place with the same OSD ID.
+ This requires you do identify which device to wipe based on what you saw
+ mounted above. BE CAREFUL! :
+
+ .. prompt:: bash $
+
+ ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
+
+#. Repeat.
+
+You can allow the refilling of the replacement OSD to happen
+concurrently with the draining of the next OSD, or follow the same
+procedure for multiple OSDs in parallel, as long as you ensure the
+cluster is fully clean (all data has all replicas) before destroying
+any OSDs. Failure to do so will reduce the redundancy of your data
+and increase the risk of (or potentially even cause) data loss.
+
+Advantages:
+
+* Simple.
+* Can be done on a device-by-device basis.
+* No spare devices or hosts are required.
+
+Disadvantages:
+
+* Data is copied over the network twice: once to some other OSD in the
+ cluster (to maintain the desired number of replicas), and then again
+ back to the reprovisioned BlueStore OSD.
+
+
+Whole host replacement
+----------------------
+
+If you have a spare host in the cluster, or have sufficient free space
+to evacuate an entire host in order to use it as a spare, then the
+conversion can be done on a host-by-host basis with each stored copy of
+the data migrating only once.
+
+First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster.
+
+Use a new, empty host
+^^^^^^^^^^^^^^^^^^^^^
+
+Ideally the host should have roughly the
+same capacity as other hosts you will be converting (although it
+doesn't strictly matter). ::
+
+ NEWHOST=<empty-host-name>
+
+Add the host to the CRUSH hierarchy, but do not attach it to the root:
+
+.. prompt:: bash $
+
+ ceph osd crush add-bucket $NEWHOST host
+
+Make sure the ceph packages are installed.
+
+Use an existing host
+^^^^^^^^^^^^^^^^^^^^
+
+If you would like to use an existing host
+that is already part of the cluster, and there is sufficient free
+space on that host so that all of its data can be migrated off,
+then you can instead do::
+
+ OLDHOST=<existing-cluster-host-to-offload>
+
+.. prompt:: bash $
+
+ ceph osd crush unlink $OLDHOST default
+
+where "default" is the immediate ancestor in the CRUSH map. (For
+smaller clusters with unmodified configurations this will normally
+be "default", but it might also be a rack name.) You should now
+see the host at the top of the OSD tree output with no parent:
+
+.. prompt:: bash $
+
+ bin/ceph osd tree
+
+::
+
+ ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
+ -5 0 host oldhost
+ 10 ssd 1.00000 osd.10 up 1.00000 1.00000
+ 11 ssd 1.00000 osd.11 up 1.00000 1.00000
+ 12 ssd 1.00000 osd.12 up 1.00000 1.00000
+ -1 3.00000 root default
+ -2 3.00000 host foo
+ 0 ssd 1.00000 osd.0 up 1.00000 1.00000
+ 1 ssd 1.00000 osd.1 up 1.00000 1.00000
+ 2 ssd 1.00000 osd.2 up 1.00000 1.00000
+ ...
+
+If everything looks good, jump directly to the "Wait for data
+migration to complete" step below and proceed from there to clean up
+the old OSDs.
+
+Migration process
+^^^^^^^^^^^^^^^^^
+
+If you're using a new host, start at step #1. For an existing host,
+jump to step #5 below.
+
+#. Provision new BlueStore OSDs for all devices:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm create --bluestore --data /dev/$DEVICE
+
+#. Verify OSDs join the cluster with:
+
+ .. prompt:: bash $
+
+ ceph osd tree
+
+ You should see the new host ``$NEWHOST`` with all of the OSDs beneath
+ it, but the host should *not* be nested beneath any other node in
+ hierarchy (like ``root default``). For example, if ``newhost`` is
+ the empty host, you might see something like::
+
+ $ bin/ceph osd tree
+ ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
+ -5 0 host newhost
+ 10 ssd 1.00000 osd.10 up 1.00000 1.00000
+ 11 ssd 1.00000 osd.11 up 1.00000 1.00000
+ 12 ssd 1.00000 osd.12 up 1.00000 1.00000
+ -1 3.00000 root default
+ -2 3.00000 host oldhost1
+ 0 ssd 1.00000 osd.0 up 1.00000 1.00000
+ 1 ssd 1.00000 osd.1 up 1.00000 1.00000
+ 2 ssd 1.00000 osd.2 up 1.00000 1.00000
+ ...
+
+#. Identify the first target host to convert :
+
+ .. prompt:: bash $
+
+ OLDHOST=<existing-cluster-host-to-convert>
+
+#. Swap the new host into the old host's position in the cluster:
+
+ .. prompt:: bash $
+
+ ceph osd crush swap-bucket $NEWHOST $OLDHOST
+
+ At this point all data on ``$OLDHOST`` will start migrating to OSDs
+ on ``$NEWHOST``. If there is a difference in the total capacity of
+ the old and new hosts you may also see some data migrate to or from
+ other nodes in the cluster, but as long as the hosts are similarly
+ sized this will be a relatively small amount of data.
+
+#. Wait for data migration to complete:
+
+ .. prompt:: bash $
+
+ while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
+
+#. Stop all old OSDs on the now-empty ``$OLDHOST``:
+
+ .. prompt:: bash $
+
+ ssh $OLDHOST
+ systemctl kill ceph-osd.target
+ umount /var/lib/ceph/osd/ceph-*
+
+#. Destroy and purge the old OSDs:
+
+ .. prompt:: bash $
+
+ for osd in `ceph osd ls-tree $OLDHOST`; do
+ ceph osd purge $osd --yes-i-really-mean-it
+ done
+
+#. Wipe the old OSD devices. This requires you do identify which
+ devices are to be wiped manually (BE CAREFUL!). For each device:
+
+ .. prompt:: bash $
+
+ ceph-volume lvm zap $DEVICE
+
+#. Use the now-empty host as the new host, and repeat::
+
+ NEWHOST=$OLDHOST
+
+Advantages:
+
+* Data is copied over the network only once.
+* Converts an entire host's OSDs at once.
+* Can parallelize to converting multiple hosts at a time.
+* No spare devices are required on each host.
+
+Disadvantages:
+
+* A spare host is required.
+* An entire host's worth of OSDs will be migrating data at a time. This
+ is like likely to impact overall cluster performance.
+* All migrated data still makes one full hop over the network.
+
+
+Per-OSD device copy
+-------------------
+
+A single logical OSD can be converted by using the ``copy`` function
+of ``ceph-objectstore-tool``. This requires that the host have a free
+device (or devices) to provision a new, empty BlueStore OSD. For
+example, if each host in your cluster has 12 OSDs, then you'd need a
+13th available device so that each OSD can be converted in turn before the
+old device is reclaimed to convert the next OSD.
+
+Caveats:
+
+* This strategy requires that a blank BlueStore OSD be prepared
+ without allocating a new OSD ID, something that the ``ceph-volume``
+ tool doesn't support. More importantly, the setup of *dmcrypt* is
+ closely tied to the OSD identity, which means that this approach
+ does not work with encrypted OSDs.
+
+* The device must be manually partitioned.
+
+* Tooling not implemented!
+
+* Not documented!
+
+Advantages:
+
+* Little or no data migrates over the network during the conversion.
+
+Disadvantages:
+
+* Tooling not fully implemented.
+* Process not documented.
+* Each host must have a spare or empty device.
+* The OSD is offline during the conversion, which means new writes will
+ be written to only a subset of the OSDs. This increases the risk of data
+ loss due to a subsequent failure. (However, if there is a failure before
+ conversion is complete, the original FileStore OSD can be started to provide
+ access to its original data.)