summaryrefslogtreecommitdiffstats
path: root/doc/rados/operations/bluestore-migration.rst
blob: d24782c46f4739f76b50b6f8075b779a064629c2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
.. _rados_operations_bluestore_migration:

=====================
 BlueStore Migration
=====================
.. warning:: Filestore has been deprecated in the Reef release and is no longer supported.
	     Please migrate to BlueStore.

Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph
cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs.
Because BlueStore is superior to Filestore in performance and robustness, and
because Filestore is not supported by Ceph releases beginning with Reef, users
deploying Filestore OSDs should transition to BlueStore. There are several
strategies for making the transition to BlueStore.

BlueStore is so different from Filestore that an individual OSD cannot be
converted in place. Instead, the conversion process must use either (1) the
cluster's normal replication and healing support, or (2) tools and strategies
that copy OSD content from an old (Filestore) device to a new (BlueStore) one.

Deploying new OSDs with BlueStore
=================================

Use BlueStore when deploying new OSDs (for example, when the cluster is
expanded). Because this is the default behavior, no specific change is
needed.

Similarly, use BlueStore for any OSDs that have been reprovisioned after
a failed drive was replaced.

Converting existing OSDs
========================

"Mark-``out``" replacement
--------------------------

The simplest approach is to verify that the cluster is healthy and
then follow these steps for each Filestore OSD in succession: mark the OSD
``out``, wait for the data to replicate across the cluster, reprovision the OSD, 
mark the OSD back ``in``, and wait for recovery to complete before proceeding
to the next OSD. This approach is easy to automate, but it entails unnecessary
data migration that carries costs in time and SSD wear.

#. Identify a Filestore OSD to replace::

     ID=<osd-id-number>
     DEVICE=<disk-device>

   #. Determine whether a given OSD is Filestore or BlueStore:

      .. prompt:: bash $

         ceph osd metadata $ID | grep osd_objectstore

   #. Get a current count of Filestore and BlueStore OSDs:

      .. prompt:: bash $

         ceph osd count-metadata osd_objectstore

#. Mark a Filestore OSD ``out``:

   .. prompt:: bash $

      ceph osd out $ID

#. Wait for the data to migrate off this OSD:

   .. prompt:: bash $

      while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done

#. Stop the OSD:

   .. prompt:: bash $

      systemctl kill ceph-osd@$ID

   .. _osd_id_retrieval: 

#. Note which device the OSD is using:

   .. prompt:: bash $

      mount | grep /var/lib/ceph/osd/ceph-$ID

#. Unmount the OSD:

   .. prompt:: bash $

      umount /var/lib/ceph/osd/ceph-$ID

#. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy
   the contents of the device; you must be certain that the data on the device is
   not needed (in other words, that the cluster is healthy) before proceeding:

   .. prompt:: bash $

      ceph-volume lvm zap $DEVICE

#. Tell the cluster that the OSD has been destroyed (and that a new OSD can be
   reprovisioned with the same OSD ID):

   .. prompt:: bash $

      ceph osd destroy $ID --yes-i-really-mean-it

#. Provision a BlueStore OSD in place by using the same OSD ID. This requires
   you to identify which device to wipe, and to make certain that you target
   the correct and intended device, using the information that was retrieved in
   the :ref:`"Note which device the OSD is using" <osd_id_retrieval>` step.  BE
   CAREFUL!  Note that you may need to modify these commands when dealing with
   hybrid OSDs:

   .. prompt:: bash $

      ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID

#. Repeat.

You may opt to (1) have the balancing of the replacement BlueStore OSD take
place concurrently with the draining of the next Filestore OSD, or instead
(2) follow the same procedure for multiple OSDs in parallel. In either case,
however, you must ensure that the cluster is fully clean (in other words, that
all data has all replicas) before destroying any OSDs. If you opt to reprovision
multiple OSDs in parallel, be **very** careful to destroy OSDs only within a
single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to
satisfy this requirement will reduce the redundancy and availability of your
data and increase the risk of data loss (or even guarantee data loss).

Advantages:

* Simple.
* Can be done on a device-by-device basis.
* No spare devices or hosts are required.

Disadvantages:

* Data is copied over the network twice: once to another OSD in the cluster (to
  maintain the specified number of replicas), and again back to the
  reprovisioned BlueStore OSD.

"Whole host" replacement
------------------------

If you have a spare host in the cluster, or sufficient free space to evacuate
an entire host for use as a spare, then the conversion can be done on a
host-by-host basis so that each stored copy of the data is migrated only once.

To use this approach, you need an empty host that has no OSDs provisioned.
There are two ways to do this: either by using a new, empty host that is not
yet part of the cluster, or by offloading data from an existing host that is
already part of the cluster.

Using a new, empty host
^^^^^^^^^^^^^^^^^^^^^^^

Ideally the host will have roughly the same capacity as each of the other hosts
you will be converting.  Add the host to the CRUSH hierarchy, but do not attach
it to the root:


.. prompt:: bash $

   NEWHOST=<empty-host-name>
   ceph osd crush add-bucket $NEWHOST host

Make sure that Ceph packages are installed on the new host.

Using an existing host
^^^^^^^^^^^^^^^^^^^^^^

If you would like to use an existing host that is already part of the cluster,
and if there is sufficient free space on that host so that all of its data can
be migrated off to other cluster hosts, you can do the following (instead of
using a new, empty host):

.. prompt:: bash $

   OLDHOST=<existing-cluster-host-to-offload>
   ceph osd crush unlink $OLDHOST default

where "default" is the immediate ancestor in the CRUSH map. (For
smaller clusters with unmodified configurations this will normally
be "default", but it might instead be a rack name.) You should now
see the host at the top of the OSD tree output with no parent:

.. prompt:: bash $

   bin/ceph osd tree

::

  ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
  -5             0 host oldhost
  10   ssd 1.00000     osd.10        up  1.00000 1.00000
  11   ssd 1.00000     osd.11        up  1.00000 1.00000
  12   ssd 1.00000     osd.12        up  1.00000 1.00000
  -1       3.00000 root default
  -2       3.00000     host foo
   0   ssd 1.00000         osd.0     up  1.00000 1.00000
   1   ssd 1.00000         osd.1     up  1.00000 1.00000
   2   ssd 1.00000         osd.2     up  1.00000 1.00000
  ...

If everything looks good, jump directly to the :ref:`"Wait for the data
migration to complete" <bluestore_data_migration_step>` step below and proceed
from there to clean up the old OSDs.

Migration process
^^^^^^^^^^^^^^^^^

If you're using a new host, start at :ref:`the first step
<bluestore_migration_process_first_step>`. If you're using an existing host,
jump to :ref:`this step <bluestore_data_migration_step>`.

.. _bluestore_migration_process_first_step:

#. Provision new BlueStore OSDs for all devices:

   .. prompt:: bash $

      ceph-volume lvm create --bluestore --data /dev/$DEVICE

#. Verify that the new OSDs have joined the cluster:

   .. prompt:: bash $

      ceph osd tree

   You should see the new host ``$NEWHOST`` with all of the OSDs beneath
   it, but the host should *not* be nested beneath any other node in the
   hierarchy (like ``root default``).  For example, if ``newhost`` is
   the empty host, you might see something like::

     $ bin/ceph osd tree
     ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
     -5             0 host newhost
     10   ssd 1.00000     osd.10        up  1.00000 1.00000
     11   ssd 1.00000     osd.11        up  1.00000 1.00000
     12   ssd 1.00000     osd.12        up  1.00000 1.00000
     -1       3.00000 root default
     -2       3.00000     host oldhost1
      0   ssd 1.00000         osd.0     up  1.00000 1.00000
      1   ssd 1.00000         osd.1     up  1.00000 1.00000
      2   ssd 1.00000         osd.2     up  1.00000 1.00000
     ...

#. Identify the first target host to convert :

   .. prompt:: bash $

      OLDHOST=<existing-cluster-host-to-convert>

#. Swap the new host into the old host's position in the cluster:

   .. prompt:: bash $

      ceph osd crush swap-bucket $NEWHOST $OLDHOST

   At this point all data on ``$OLDHOST`` will begin migrating to the OSDs on
   ``$NEWHOST``.  If there is a difference between the total capacity of the
   old hosts and the total capacity of the new hosts, you may also see some
   data migrate to or from other nodes in the cluster. Provided that the hosts
   are similarly sized, however, this will be a relatively small amount of
   data.

   .. _bluestore_data_migration_step:

#. Wait for the data migration to complete:

   .. prompt:: bash $

      while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done

#. Stop all old OSDs on the now-empty ``$OLDHOST``:

   .. prompt:: bash $

      ssh $OLDHOST
      systemctl kill ceph-osd.target
      umount /var/lib/ceph/osd/ceph-*

#. Destroy and purge the old OSDs:

   .. prompt:: bash $

      for osd in `ceph osd ls-tree $OLDHOST`; do
         ceph osd purge $osd --yes-i-really-mean-it
      done

#. Wipe the old OSDs. This requires you to identify which devices are to be
   wiped manually. BE CAREFUL! For each device:

   .. prompt:: bash $

      ceph-volume lvm zap $DEVICE

#. Use the now-empty host as the new host, and repeat:

   .. prompt:: bash $

      NEWHOST=$OLDHOST

Advantages:

* Data is copied over the network only once.
* An entire host's OSDs are converted at once.
* Can be parallelized, to make possible the conversion of multiple hosts at the same time.
* No host involved in this process needs to have a spare device.

Disadvantages:

* A spare host is required.
* An entire host's worth of OSDs will be migrating data at a time. This
  is likely to impact overall cluster performance.
* All migrated data still makes one full hop over the network.

Per-OSD device copy
-------------------
A single logical OSD can be converted by using the ``copy`` function
included in ``ceph-objectstore-tool``. This requires that the host have one or more free
devices to provision a new, empty BlueStore OSD. For
example, if each host in your cluster has twelve OSDs, then you need a
thirteenth unused OSD so that each OSD can be converted before the
previous OSD is reclaimed to convert the next OSD.

Caveats:

* This approach requires that we prepare an empty BlueStore OSD but that we do not allocate
  a new OSD ID to it. The ``ceph-volume`` tool does not support such an operation. **IMPORTANT:**
  because the setup of *dmcrypt* is closely tied to the identity of the OSD, this approach does not
  work with encrypted OSDs.

* The device must be manually partitioned.

* An unsupported user-contributed script that demonstrates this process may be found here:
  https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash

Advantages:

* Provided that the 'noout' or the 'norecover'/'norebalance' flags are set on the OSD or the
  cluster while the conversion process is underway, little or no data migrates over the
  network during the conversion.

Disadvantages:

* Tooling is not fully implemented, supported, or documented.
  
* Each host must have an appropriate spare or empty device for staging.
  
* The OSD is offline during the conversion, which means new writes to PGs
  with the OSD in their acting set may not be ideally redundant until the
  subject OSD comes up and recovers. This increases the risk of data
  loss due to an overlapping failure. However, if another OSD fails before
  conversion and startup have completed, the original Filestore OSD can be
  started to provide access to its original data.