debian/local/doc/FAQ


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669

Frequently asked questions -- Debian mdadm
==========================================

Also see /usr/share/doc/mdadm/README.recipes.gz .

The latest version of this FAQ is available here:
  http://anonscm.debian.org/gitweb/?p=pkg-mdadm/mdadm.git;a=blob_plain;f=debian/FAQ;hb=HEAD

0. What does MD stand for?
~~~~~~~~~~~~~~~~~~~~~~~~~~
  MD is an abbreviation for "multiple device" (also often called "multi-
  disk"). The Linux MD implementation implements various strategies for
  combining multiple (typically but not necessarily physical) block devices
  into single logical ones. The most common use case is commonly known as
  "Software RAID". Linux supports RAID levels 1, 4, 5, 6 and 10 as well
  as the "pseudo" RAID level 0.
  In addition, the MD implementation covers linear and multipath
  configurations.

  Most people refer to MD as RAID. Since the original name of the RAID
  configuration software is "md"adm, I chose to use MD consistently instead.

1. How do I overwrite ("zero") the superblock?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mdadm --zero-superblock /dev/sdXY

  Note that this is a destructive operation. It does not actually delete any
  data, but the device will have lost its "authority". You cannot assemble the
  array with it anymore and if you add the device to another array, the
  synchronisation process *will* *overwrite* all data on the device.

  Nevertheless, sometimes it is necessary to zero the superblock:

  - If you want ot re-use a device (e.g. a HDD or SSD) that has been part of an
    array (with an different superblock version and/or location) in another one.
    In this case you zero the superblock before you assemble the array or add
    the device to a new array.

  - If you are trying to prevent a device from being recognised as part of an
    array. Say for instance you are trying to change an array spanning sd[ab]1
    to sd[bc]1 (maybe because sda is failing or too slow), then automatic
    (scan) assembly will still recognise sda1 as a valid device. You can limit
    the devices to scan with the DEVICE keyword in the configuration file, but
    this may not be what you want. Instead, zeroing the superblock will
    (permanently) prevent a device from being considered as part of an array.

  WARNING: Depending on which superblock version you use, it won't work to just
           overwrite the first few MiBs of the block device with 0x0 (e.g. via
           dd), since the superblock may be at other locations (especially the
           end of the device).
           Therefore always use mdadm --zero-superblock .

2. How do I change the preferred minor of an MD array (RAID)?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  See item 12 in /usr/share/doc/mdadm/README.recipes.gz and read the mdadm(8)
  manpage (search for 'preferred').

3. How does mdadm determine which /dev/mdX or /dev/md/X to use?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  The logic used by mdadm to determine the device node name in the mdadm
  --examine output (which is used to generate mdadm.conf) depends on several
  factors. Here's how mdadm determines it:

  It first checks the superblock version of a given array (or each array in
  turn when iterating all of them). Run

    mdadm --detail /dev/mdX | sed -ne 's,.*Version : ,,p'

  to determine the superblock version of a running array, or

    mdadm --examine /dev/sdXY | sed -ne 's,.*Version : ,,p'

  to determine the superblock version from a component device of an array.

  Version 0 superblocks (00.90.XX)
  ''''''''''''''''''''''''''''''''
    You need to know the preferred minor number stored in the superblock,
    so run either of

      mdadm --detail /dev/mdX | sed -ne 's,.*Preferred Minor : ,,p'
      mdadm --examine /dev/sdXY | sed -ne 's,.*Preferred Minor : ,,p'

    Let's call the resulting number MINOR. Also see FAQ 2 further up.

    Given MINOR, mdadm will output /dev/md<MINOR> if the device node
    /dev/md<MINOR> exists.
    Otherwise, it outputs /dev/md/<MINOR>

  Version 1 superblocks (01.XX.XX)
  ''''''''''''''''''''''''''''''''
    Version 1 superblocks actually seem to ignore preferred minors and instead
    use the value of the name field in the superblock. Unless specified
    explicitly during creation (-N|--name) the name is determined from the
    device name used, using the following regexp: 's,/dev/md/?(.*),$1,', thus:

      /dev/md0     -> 0
      /dev/md/0    -> 0
      /dev/md_d0   -> _d0 (d0 in later versions)
      /dev/md/d0   -> d0
      /dev/md/name -> name
      (/dev/name does not seem to work)

    mdadm will append the name to '/dev/md/', so it will always output device
    names under the /dev/md/ directory. Newer versions can create a symlink
    from /dev/mdX. See the symlinks option in mdadm.con(5) and mdadm(8).

    If you want to change the name, you can do so during assembly:

      mdadm -A -U name -N newname /dev/mdX /dev/sd[abc]X

    I know this all sounds inconsistent and upstream has some work to do.
    We're on it.

4. Which RAID level should I use?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Many people seem to prefer RAID4/5/6 because it makes more efficient use of
  space. For example, if you have devices of size X, then in order to get 2X
  storage, you need 3 devices for RAID5, but 4 if you use RAID10 or RAID1+ (or
  RAID6).

  This gain in usable space comes at a price: performance; RAID1/10 can be up
  to four times faster than RAID4/5/6.

  At the same time, however, RAID4/5/6 provide somewhat better redundancy in
  the event of two failing devices. In a RAID10 configuration, if one device is
  already dead, the RAID can only survive if any of the two devices in the other
  RAID1 array fails, but not if the second device in the degraded RAID1 array
  fails (see next item, 4b). A RAID6 across four devices can cope with any two
  devices failing. However, RAID6 is noticeably slower than RAID5. RAID5 and
  RAID4 do not differ much, but can only handle single-device failures.

  If you can afford the extra devices (storage *is* cheap these days), I suggest
  RAID1/10 over RAID4/5/6. If you don't care about performance but need as
  much space as possible, go with RAID4/5/6, but make sure to have backups.
  Heck, make sure to have backups whatever you do.

  Let it be said, however, that I thoroughly regret putting my primary
  workstation on RAID5. Anything device-intensive brings the system to its
  knees; I will have to migrate to RAID10 at one point.

  Please also consult /usr/share/doc/mdadm/RAID5_versus_RAID10.txt.gz,
  https://en.wikipedia.org/wiki/Standard_RAID_levels and perhaps even
  https://en.wikipedia.org/wiki/Non-standard_RAID_levels .

4b. Can a 4-device RAID10 survive two device failures?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  I am assuming that you are talking about a setup with two copies of each
  block, so --layout=near2/far2/offset2:

  In two thirds of the cases, yes[0], and it does not matter which layout you
  use. When you assemble 4 devices into a RAID10, you essentially stripe a RAID0
  across two RAID1, so the four devices A,B,C,D become two pairs: A,B and C,D.
  If A fails, the RAID10 can only survive if the second failing device is either
  C or D; if B fails, your array is dead.

  Thus, if you see a device failing, replace it as soon as possible!

  If you need to handle two failing devices out of a set of four, you have to
  use RAID6, or store more than two copies of each block (see the --layout
  option in the mdadm(8) manpage).

  See also question 18 further down.

  [0] It's actually (n-2)/(n-1), where n is the number of devices. I am not
      a mathematician, see http://aput.net/~jheiss/raid10/, which gives the
      chance of *failure* as 1/(n-1), so the chance of success is 1-1/(n-1), or
      (n-2)/(n-1), or 2/3 in the four device example.
     (Thanks to Per Olofsson for clarifying this in #493577).

5. How to convert RAID5 to RAID10?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  To convert 3 device RAID5 to RAID10, you need a spare device (either a hot
  spare, fourth device in the array, or a new one). Then you remove the spare
  and one of the three devices from the RAID5, create a degraded RAID10 across
  them, create the filesystem and copy the data (or do a raw copy), then add the
  other two devices to the new RAID10. However, mdadm cannot assemble a RAID10
  with 50% missing devices the way you might like it:

    mdadm --create -l 10 -n4 -pn2 /dev/md1 /dev/sd[cd] missing missing

  For reasons that may be answered by question 20 further down, mdadm actually
  cares about the order of devices you give it. If you intersperse the "missing"
  keywords with the physical devices, it should work:

    mdadm --create -l 10 -n4 -pn2 /dev/md1 /dev/sdc missing /dev/sdd missing

  or even

    mdadm --create -l 10 -n4 -pn2 /dev/md1 missing /dev/sd[cd] missing

  Also see item (4b) further up, and this thread:
    http://thread.gmane.org/gmane.linux.raid/13469/focus=13472

6. What is the difference between RAID1+0 and RAID10?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  RAID1+0 is a form of RAID in which a RAID0 is striped across two RAID1
  arrays. To assemble it, you create two RAID1 arrays and then create a RAID0
  array with the two md arrays.

  The Linux kernel provides the RAID10 level to do pretty much exactly the
  same for you, but with greater flexibility (and somewhat improved
  performance). While RAID1+0 makes sense with 4 devices, RAID10 can be
  configured to work with only 3 devices. Also, RAID10 has a little less
  overhead than RAID1+0, which has data pass the md layer twice.

  I prefer RAID10 over RAID1+0.

6b. What's the difference between RAID1+0 and RAID0+1?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  In short: RAID1+0 concatenates two mirrored arrays while RAID0+1 mirrors two
  concatenated arrays. However, the two are also often switched.

  The linux MD driver supports RAID10, which is equivalent to the above
  RAID1+0 definition.

  RAID1+0/10 has a greater chance to survive two device failures, its
  performance suffers less when in degraded state, and it resyncs faster after
  replacing a failed device.

  See http://aput.net/~jheiss/raid10/ for more details.

7. Which RAID10 layout scheme should I use
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  RAID10 gives you the choice between three ways of laying out chunks on the
  devices: near, far and offset.

  The examples below explain the chunk distribution for each of these layouts
  with 2 copies per chunk, using either an even number of devices (fewer than 4)
  or an odd number (fewer than 5).

  For simplicity we assume that the chunk size matches the block size of the
  underlying devices and also the RAID10 device exported by the kernel
  (e.g. /dev/md/name). The chunk numbers map therefore directly to the block
  addresses in the exported RAID10 device.

  The decimal numbers below (0, 1, 2, …) are the RAID10 chunks. Due to the
  foregoing assumption they are also the block addresses in the exported RAID10
  device. Identical numbers refer to copies of a chunk or block, but on different
  underlying devices. The hexadecimal numbers (0x00, 0x01, 0x02, …) refer to the
  block addresses in the underlying devices.

  "near" layout with 2 copies per chunk (--layout=n2):
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  The chunk copies are placed "as close to each other as possible".

  With an even number of devices, they lie at the same offset on the each device.
  It is a classic RAID1+0 setup, i.e. two groups of mirrored devices, with both
  forming a striped RAID0.

  device1 device2 device3 device4          device1 device2 device3 device4 device5
  ─────── ─────── ─────── ───────          ─────── ─────── ─────── ─────── ───────
     0       0       1       1      0x00      0       0       1       1       2
     2       2       3       3      0x01      2       3       3       4       4
     ⋯       ⋯       ⋯       ⋯       ⋯        ⋯       ⋯       ⋯       ⋯       ⋯
     ⋮       ⋮       ⋮       ⋮       ⋮        ⋮       ⋮       ⋮       ⋮       ⋮
     ⋯       ⋯       ⋯       ⋯       ⋯        ⋯       ⋯       ⋯       ⋯       ⋯
    254     254     255     255     0x80     317     318     318     319     319
  ╰──────┬──────╯ ╰──────┬──────╯
       RAID1           RAID1
  ╰──────────────┬──────────────╯
               RAID0

  "far" layout with 2 copies per chunk (--layout=f2):
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  The chunk copies are placed "as far from each other as possible".

  Here, a complete sequence of chunks is striped over all devices. Then a second
  sequence of chunks is placed next to them. More copies are added as the number
  2 goes up.

  It is undesirable, however, to place copies of the same chunks on the same
  devices. That is prevented by a cyclic permutation of each such stripe.

  device1 device2 device3 device4          device1 device2 device3 device4 device5
  ─────── ─────── ─────── ───────          ─────── ─────── ─────── ─────── ───────
     0       1       2       3      0x00      0       1       2       3       4    ╮
     4       5       6       7      0x01      5       6       7       8       9    ├ ▒
     ⋯       ⋯       ⋯       ⋯       ⋯        ⋯       ⋯       ⋯       ⋯       ⋯    ┆
     ⋮       ⋮       ⋮       ⋮       ⋮        ⋮       ⋮       ⋮       ⋮       ⋮    ┆
     ⋯       ⋯       ⋯       ⋯       ⋯        ⋯       ⋯       ⋯       ⋯       ⋯    ┆
    252     253     254     255     0x40     315     316     317     318     319   ╯
     3       0       1       2      0x41      4       0       1       2       3    ╮
     7       4       5       6      0x42      9       5       6       7       8    ├ ▒ₚ
     ⋯       ⋯       ⋯       ⋯       ⋯        ⋯       ⋯       ⋯       ⋯       ⋯    ┆
     ⋮       ⋮       ⋮       ⋮       ⋮        ⋮       ⋮       ⋮       ⋮       ⋮    ┆
     ⋯       ⋯       ⋯       ⋯       ⋯        ⋯       ⋯       ⋯       ⋯       ⋯    ┆
    255     252     253     254     0x80     319     315     316     317     318   ╯

  Each ▒ in the diagram represents a complete sequence of chunks. ▒ₚ is a cyclic
  permutation.

  A major advantage of the "far" layout is that sequential reads can be spread
  out over different devices, which makes the setup similar to RAID0 in terms of
  speed. For writes, there is a cost of seeking. They are substantially slower.

  "offset" layout with 2 copies per chunk (--layout=o2):
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Here, a number of consecutive chunks are bundled on each device during the
  striping operation. The number of consecutive chunks equals the number of
  devices. Next, a copy of the same chunks is striped in a different pattern.
  More copies are added as the number 2 goes up.

  A cyclic permutation in the pattern prevents copies of the same chunks
  landing on the same devices.

  device1 device2 device3 device4          device1 device2 device3 device4 device5
  ─────── ─────── ─────── ───────          ─────── ─────── ─────── ─────── ───────
     0       1       2       3      0x00      0       1       2       3       4     ) AA
     3       0       1       2      0x01      4       0       1       2       3     ) AAₚ
     4       5       6       7      0x02      5       6       7       8       9     ) AB
     7       4       5       6      0x03      9       5       6       7       8     ) ABₚ
     ⋯       ⋯       ⋯       ⋯       ⋯        ⋯       ⋯       ⋯       ⋯       ⋯     ) ⋯
     ⋮       ⋮       ⋮       ⋮       ⋮        ⋮       ⋮       ⋮       ⋮       ⋮       ⋮
     ⋯       ⋯       ⋯       ⋯       ⋯        ⋯       ⋯       ⋯       ⋯       ⋯     ) ⋯
    251     252     253     254     0x79     314     315     316     317     318    ) EX
    254     251     252     253     0x80     318     314     315     316     317    ) EXₚ

  With AA, AB, …, AZ, BA, … being the sets of consecutive chunks and
  AAₚ, ABₚ, …, AZₚ, BAₚ, … their cyclic permutations.

  The read characteristics are probably similar to the "far" layout when a
  suitably large chunk size is chosen, but with less seeking for writes.

  Upstream and the Debian maintainer do not understand all the nuances and
  implications. The "offset" layout was only added because the Common
  RAID Data Disk Format (DDF) supports it, and standard compliance is our
  goal.

  See the md(4) manpage for more details.

8. (One of) my RAID arrays is busy and cannot be stopped. What gives?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  It is perfectly normal for mdadm to report the array with the root
  filesystem to be busy on shutdown. The reason for this is that the root
  filesystem must be mounted to be able to stop the array (or otherwise
  /sbin/mdadm does not exist), but to stop the array, the root filesystem
  cannot be mounted. Catch 22. The kernel actually stops the array just before
  halting, so it's all well.

  If mdadm cannot stop other arrays on your system, check that these arrays
  aren't used anymore. Common causes for busy/locked arrays are:

    * The array contains a mounted filesystem (check the `mount' output)
    * The array is used as a swap backend (check /proc/swaps)
    * The array is used by the device-mapper (check with `dmsetup')
      * LVM
      * dm-crypt
      * EVMS
    * The array contains a swap partition used for suspend-to-ram
      (check /etc/initramfs-tools/conf.d/resume)
    * The array is used by a process (check with `lsof')

9. Should I use RAID0 (or linear)?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  No. Unless you know what you're doing and keep backups, or use it for data
  that can be lost.

9b. Why not?
~~~~~~~~~~~~
  RAID0 has zero redundancy. If you stripe a RAID0 across X devices, you
  increase the likelyhood of complete loss of the filesystem by a factor of X.

  The same applies to LVM by the way (when LVs are placed over X PVs).

  If you want/must used LVM or RAID0, stripe it across RAID1 arrays
  (RAID10/RAID1+0, or LVM on RAID1), and keep backups!

10. Can I cancel a running array check (checkarray)?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  See the -x option in the `/usr/share/mdadm/checkarray --help` output.

11. mdadm warns about duplicate/similar superblocks; what gives?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  In certain configurations, especially if your last partition extends all the
  way to the end of the device, mdadm may display a warning like:

    mdadm: WARNING /dev/sdXY and /dev/sdX appear to have very similar
    superblocks. If they are really different, please --zero the superblock on
    one. If they are the same or overlap, please remove one from the DEVICE
    list in mdadm.conf.

  There are two ways to solve this:

  (a) recreate the arrays with version-1 superblocks, which is not always an
      option -- you cannot yet upgrade version-0 to version-1 superblocks for
      existing arrays.

  (b) instead of 'DEVICE partitions', list exactly those devices that are
      components of MD arrays on your system. So istead of:

        DEVICE partitions

      for example:

        DEVICE /dev/sd[ab]* /dev/sdc[123]

12. mdadm -E / mkconf report different arrays with the same device
    name / minor number. What gives?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  In almost all cases, mdadm updates the super-minor field in an array's
  superblock when assembling the array. It does *not* do this for RAID0
  arrays. Thus, you may end up seeing something like this when you run
  mdadm -E or mkconf:

    ARRAY /dev/md0 level=raid0 num-devices=2 UUID=abcd...
    ARRAY /dev/md0 level=raid1 num-devices=2 UUID=dcba...

  Note how the two arrays have different UUIDs but both appear as /dev/md0.

  The solution in this case is to explicitly tell mdadm to update the
  superblock of the RAID0 array. Assuming that the RAID0 array in the above
  example should really be /dev/md1:

    mdadm --stop /dev/md1
    mdadm --assemble --update=super-minor --uuid=abcd... /dev/md1

  See question 2 of this FAQ, and also http://bugs.debian.org/386315 and
  recipe #12 in README.recipes .

13. Can a MD array be partitioned?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Since kernel 2.6.28, MD arrays can be partitioned like any other block
  device.

  Prior to 2.6.28, for a MD array to be able to hold partitions, it must be
  created as a "partitionable array", using the configuration auto=part on the
  command line or in the configuration file, or by using the standard naming
  scheme (md_d* or md/d*) for partitionable arrays:

    mdadm --create --auto=yes ... /dev/md_d0 ...
    # see mdadm(8) manpage about the values of the --auto keyword

14. When would I partition an array?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  This answer by Doug Ledford is shamelessly adapted from [0] (with
  permission):

    First, not all MD types make sense to be split up, e.g. multipath. For
    those types, when a device fails, the *entire* device is considered to have
    failed, but with different arrays you won't switch over to the next path
    until each MD array has attempted to access the bad path. This can have
    obvious bad consequences for certain array types that do automatic
    failover from one port to another (you can end up getting the array in
    a loop of switching ports repeatedly to satisfy the fact that one array
    failed over during a path down, then the path came back up, and another
    array stayed on the old path because it didn't send any commands during
    the path down time period).

    Second, convenience. Assume you have a 6 device RAID5 array. If a device
    fails and you are using a partitioned MD array, then all the partitions on
    the device will already be handled without using that device. No need to
    manually fail any still active array members from other arrays.

    Third, safety. Again with the RAID5 array. If you use multiple arrays on
    a single device, and that device fails, but it only failed on one array, then
    you now need to manually fail that device from the other arrays before
    shutting down or hot swapping the device. Generally speaking, that's not
    a big deal, but people do occasionally have fat finger syndrome and this
    is a good opportunity for someone to accidentally fail the wrong device, and
    when you then go to remove the device you create a two device failure instead
    of one and now you are in real trouble.

    Forth, to respond to what you wrote about independent of each other --
    part of the reason why you partition. I would argue that's not true. If
    your goal is to salvage as much use from a failing device as possible, then
    OK. But, generally speaking, people that have something of value on their
    devices don't want to salvage any part of a failing device, they want that
    device gone and replaced immediately. There simply is little to no value in
    an already malfunctioning device. They're too cheap and the data stored on
    them too valuable to risk loosing something in an effort to further
    utilize broken hardware. This of course is written with the understanding
    that the latest MD RAID code will do read error rewrites to compensate for
    minor device issues, so anything that will throw a device out of an array is
    more than just a minor sector glitch.

  [0] http://thread.gmane.org/gmane.linux.raid/13594/focus=13597

15. How can I start a dirty degraded array?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  A degraded array (e.g. a RAID5 with only two devices) that has not been
  properly stopped cannot be assembled just like that; mdadm will refuse and
  complain about a "dirty degraded array", for good reasons.

  The solution might be to force-assemble it, and then to start it. Please see
  recipes 4 and 4b of /usr/share/doc/mdadm/README.recipes.gz and make sure you
  know what you're doing.

16. How can I influence the speed with which an array is resynchronised?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  For each array, the MD subsystem exports parameters governing the
  synchronisation speed via sysfs. The values are in kB/sec.

    /sys/block/mdX/md/sync_speed     -- the current speed
    /sys/block/mdX/md/sync_speed_max -- the maximum speed
    /sys/block/mdX/md/sync_speed_min -- the guaranteed minimum speed

17. When I create a new array, why does it resynchronise at first?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  See the mdadm(8) manpage:
    When creating a RAID5 array, mdadm will automatically create a degraded
    array with an extra spare drive. This is because building the spare into
    a degraded array is in general faster than resyncing the parity on
    a non-degraded, but not clean, array. This feature can be over-ridden with
    the --force option.

  This also applies to RAID levels 4 and 6.

  It does not make much sense for RAID levels 1 and 10 and can thus be
  overridden with the --force and --assume-clean options, but it is not
  recommended. Read the manpage.

18. How many failed devics can a RAID10 handle?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  (see also question 4b)

  The following table shows how many devices you can lose and still have an
  operational array. In some cases, you *can* lose more than the given number
  of devices, but there is no guarantee that the array survives. Thus, the
  following is the guaranteed number of failed devices a RAID10 array survives
  and the maximum number of failed devices the array can (but is not guaranteed
  to) handle, given the number of devices used and the number of data block
  copies. Note that 2 copies means original + 1 copy. Thus, if you only have
  one copy (the original), you cannot handle any failures.

                  1            2            3            4    (# of copies)
        1        0/0          0/0          0/0          0/0
        2        0/0          1/1          1/1          1/1
        3        0/0          1/1          2/2          2/2
        4        0/0          1/2          2/2          3/3
        5        0/0          1/2          2/2          3/3
        6        0/0          1/3          2/3          3/3
        7        0/0          1/3          2/3          3/3
        8        0/0          1/4          2/3          3/4
  (# of devices)

  Note: I have not really verified the above information. Please don't count
  on it. If a device fails, replace it as soon as possible. Corrections welcome.

19. What should I do if a device fails?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Replace it as soon as possible.

  In case of physical devices with no hot-swap capabilities, for example via:

    mdadm --remove /dev/md0 /dev/sda1
    poweroff
    <replace device and start the machine>
    mdadm --add /dev/md0 /dev/sda1

20. So how do I find out which other device(s) can fail without killing the
    array?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Did you read the previous question and its answer?

  For cases when you have two copies of each block, the question is easily
  answered by looking at the output of /proc/mdstat. For instance on a 4 device
  array:

    md3 : active raid10 sdg7[3] sde7[0] sdh7[2] sdf7[1]

  you know that sde7/sdf7 form one pair and sdg7/sgh7 the other.

  If sdh now fails, this will become

    md3 : active raid10 sdg7[3] sde7[0] sdh7[4](F) sdf7[1]

  So now the second pair is degraded; the array could take another failure in
  the first pair, but if sdg now also fails, you're history.

  Now go and read question 19.

  For cases with more copies per block, it becomes more complicated. Let's
  think of a 7 device array with three copies:

    md5 : active raid10 sdg7[6] sde7[4] sdb7[5] sdf7[2] sda7[3] sdc7[1] sdd7[0]

  Each mirror now has 7/3 = 2.33 devices to it, so in order to determine groups,
  you need to round up. Note how the devices are arranged in decreasing order of
  their indices (the number in brackes in /proc/mdstat):

  device:   -sdd7-  -sdc7-  -sdf7-  -sda7-  -sde7-  -sdb7-  -sdg7-
  group:    [      one       ][      two       ][     three      ]

  Basically this means that after two devices failed, you need to make sure that
  the third failed device doesn't destroy all copies of any given block. And
  that's not always easy as it depends on the layout chosen: whether the
  blocks are near (same offset within each group), far (spread apart in a way
  to maximise the mean distance), or offset (offset by size/n within each
  block).

  I'll leave it up to you to figure things out. Now go read question 19.

21. Why does the kernel speak of 'resync' when using checkarray?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Please see README.checkarray and http://thread.gmane.org/gmane.linux.raid/11864 .

  In short: it's a bug. checkarray is actually not a resync, but the kernel
  does not distinguish between them.

22. Can I prioritise the sync process and sync certain arrays before others?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Upon start, md will resynchronise any unclean arrays, starting in somewhat
  random order. Sometimes it's desirable to sync e.g. /dev/md3 first (because
  it's the most important), but while /dev/md1 is synchronising, /dev/md3 will
  be DELAYED (see /proc/mdstat; only if they share the same physical
  components.

  It is possible to delay the synchronisation via /sys:

    echo idle >/sys/block/md1/md/sync_action

  This will cause md1 to go idle and MD to synchronise md3 (or whatever is
  queued next; repeat the above for other devices if necessary). MD will also
  realise that md1 is still not in sync and queue it for resynchronisation,
  so it will sync automatically when its turn has come.

23. mdadm's init script fails because it cannot find any arrays. What gives?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  This does not happen anymore, if no arrays present in config file, no arrays
  will be started.

24. What happened to mdrun? How do I replace it?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mdrun used to be the sledgehammer approach to assembling arrays. It has
  accumulated several problems over the years (e.g. Debian bug #354705) and
  thus has been deprecated and removed with the 2.6.7-2 version of this package.

  If you are still using mdrun, please ensure that you have a valid
  /etc/mdadm/mdadm.conf file (run /usr/share/mdadm/mkconf --generate to get
  one), and run

    mdadm --assemble --scan --auto=yes

  instead of mdrun.

25. Why are my arrays marked auto-read-only in /proc/mdstat?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Arrays are kept read-only until the first write occurs. This allows md to
  skip lengthy resynchronisation for arrays that have not been properly shut
  down, but which also not have changed.

26. Why doesn't mdadm find arrays specified in the config file and causes the
    boot to fail?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  My boot process dies at an early stage and drops me into the busybox shell.
  The last relevant output seems to be from mdadm and is something like

    "/dev/md2 does not exist"

  or

    "No devices listed in conf file found"

  Why does mdadm break my system?

  Short answer: It doesn't, the underlying devices aren't yet available yet
  when mdadm runs during the early boot process.

  Long answer: It doesn't, but the drivers of those devices incorrectly
  communicate to the kernel that the devices are ready, when in fact they are
  not. I consider this a bug in those drivers. Please consider reporting it.

  Workaround: there is nothing mdadm can or will do against this. Fortunately
  though, initramfs provides a method, documented at
  http://wiki.debian.org/InitramfsDebug. Please append rootdelay=10 (which sets
  a delay of 10 seconds before trying to mount the root filesystem) to the
  kernel command line and try if the boot now works.

 -- martin f. krafft <madduck@debian.org>  Wed, 13 May 2009 09:59:53 +0200