summaryrefslogtreecommitdiffstats
path: root/doc/rados/troubleshooting/troubleshooting-mon.rst
blob: 1170da7c33f690363aa29d412bfe479cef08122f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
.. _rados-troubleshooting-mon:

==========================
 Troubleshooting Monitors
==========================

.. index:: monitor, high availability

Even if a cluster experiences monitor-related problems, the cluster is not
necessarily in danger of going down. If a cluster has lost multiple monitors,
it can still remain up and running as long as there are enough surviving
monitors to form a quorum.
   
If your cluster is having monitor-related problems, we recommend that you
consult the following troubleshooting information.

Initial Troubleshooting
=======================

The first steps in the process of troubleshooting Ceph Monitors involve making
sure that the Monitors are running and that they are able to communicate with
the network and on the network. Follow the steps in this section to rule out
the simplest causes of Monitor malfunction.

#. **Make sure that the Monitors are running.**

    Make sure that the Monitor (*mon*) daemon processes (``ceph-mon``) are
    running. It might be the case that the mons have not be restarted after an
    upgrade. Checking for this simple oversight can save hours of painstaking
    troubleshooting. 
    
    It is also important to make sure that the manager daemons (``ceph-mgr``)
    are running. Remember that typical cluster configurations provide one
    Manager (``ceph-mgr``) for each Monitor (``ceph-mon``).

    .. note:: In releases prior to v1.12.5, Rook will not run more than two
       managers.

#. **Make sure that you can reach the Monitor nodes.**

    In certain rare cases, ``iptables`` rules might be blocking access to
    Monitor nodes or TCP ports. These rules might be left over from earlier
    stress testing or rule development. To check for the presence of such
    rules, SSH into each Monitor node and use ``telnet`` or ``nc`` or a similar
    tool to attempt to connect to each of the other Monitor nodes on ports
    ``tcp/3300`` and ``tcp/6789``. 

#. **Make sure that the "ceph status" command runs and receives a reply from the cluster.**

    If the ``ceph status`` command receives a reply from the cluster, then the
    cluster is up and running. Monitors answer to a ``status`` request only if
    there is a formed quorum. Confirm that one or more ``mgr`` daemons are
    reported as running. In a cluster with no deficiencies, ``ceph status``
    will report that all ``mgr`` daemons are running.

    If the ``ceph status`` command does not receive a reply from the cluster,
    then there are probably not enough Monitors ``up`` to form a quorum. If the
    ``ceph -s`` command is run with no further options specified, it connects
    to an arbitrarily selected Monitor. In certain cases, however, it might be
    helpful to connect to a specific Monitor (or to several specific Monitors
    in sequence) by adding the ``-m`` flag to the command: for example, ``ceph
    status -m mymon1``.

#. **None of this worked. What now?**

    If the above solutions have not resolved your problems, you might find it
    helpful to examine each individual Monitor in turn. Even if no quorum has
    been formed, it is possible to contact each Monitor individually and
    request its status by using the ``ceph tell mon.ID mon_status`` command
    (here ``ID`` is the Monitor's identifier).

    Run the ``ceph tell mon.ID mon_status`` command for each Monitor in the
    cluster. For more on this command's output, see :ref:`Understanding
    mon_status
    <rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`.

    There is also an alternative method for contacting each individual Monitor:
    SSH into each Monitor node and query the daemon's admin socket. See
    :ref:`Using the Monitor's Admin
    Socket<rados_troubleshoting_troubleshooting_mon_using_admin_socket>`.

.. _rados_troubleshoting_troubleshooting_mon_using_admin_socket:

Using the monitor's admin socket
================================

A monitor's admin socket allows you to interact directly with a specific daemon
by using a Unix socket file. This file is found in the monitor's ``run``
directory. The admin socket's default directory is
``/var/run/ceph/ceph-mon.ID.asok``, but this can be overridden and the admin
socket might be elsewhere, especially if your cluster's daemons are deployed in
containers. If you cannot find it, either check your ``ceph.conf`` for an
alternative path or run the following command:
    
.. prompt:: bash $

   ceph-conf --name mon.ID --show-config-value admin_socket

The admin socket is available for use only when the monitor daemon is running.
Whenever the monitor has been properly shut down, the admin socket is removed.
However, if the monitor is not running and the admin socket persists, it is
likely that the monitor has been improperly shut down.  In any case, if the
monitor is not running, it will be impossible to use the admin socket, and the
``ceph`` command is likely to return ``Error 111: Connection Refused``.

To access the admin socket, run a ``ceph tell`` command of the following form
(specifying the daemon that you are interested in):

.. prompt:: bash $

   ceph tell mon.<id> mon_status

This command passes a ``help`` command to the specific running monitor daemon
``<id>`` via its admin socket. If you know the full path to the admin socket
file, this can be done more directly by running the following command:

.. prompt:: bash $

   ceph --admin-daemon <full_path_to_asok_file> <command>

Running ``ceph help`` shows all supported commands that are available through
the admin socket. See especially ``config get``, ``config show``, ``mon stat``,
and ``quorum_status``.

.. _rados_troubleshoting_troubleshooting_mon_understanding_mon_status:

Understanding mon_status
========================

The status of the monitor (as reported by the ``ceph tell mon.X mon_status``
command) can always be obtained via the admin socket. This command outputs a
great deal of information about the monitor (including the information found in
the output of the ``quorum_status`` command).

To understand this command's output, let us consider the following example, in
which we see the output of ``ceph tell mon.c mon_status``::

  { "name": "c",
    "rank": 2,
    "state": "peon",
    "election_epoch": 38,
    "quorum": [
          1,
          2],
    "outside_quorum": [],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": { "epoch": 3,
        "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
        "modified": "2013-10-30 04:12:01.945629",
        "created": "2013-10-29 14:14:41.914786",
        "mons": [
              { "rank": 0,
                "name": "a",
                "addr": "127.0.0.1:6789\/0"},
              { "rank": 1,
                "name": "b",
                "addr": "127.0.0.1:6790\/0"},
              { "rank": 2,
                "name": "c",
                "addr": "127.0.0.1:6795\/0"}]}}

It is clear that there are three monitors in the monmap (*a*, *b*, and *c*),
the quorum is formed by only two monitors, and *c* is in the quorum as a
*peon*.

**Which monitor is out of the quorum?**

  The answer is **a** (that is, ``mon.a``).

**Why?**

  When the ``quorum`` set is examined, there are clearly two monitors in the
  set: *1* and *2*. But these are not monitor names. They are monitor ranks, as
  established in the current ``monmap``. The ``quorum`` set does not include
  the monitor that has rank 0, and according to the ``monmap`` that monitor is
  ``mon.a``.

**How are monitor ranks determined?**

  Monitor ranks are calculated (or recalculated) whenever monitors are added or
  removed. The calculation of ranks follows a simple rule: the **greater** the
  ``IP:PORT`` combination, the **lower** the rank. In this case, because
  ``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations,
  ``mon.a`` has the highest rank: namely, rank 0.
  

Most Common Monitor Issues
===========================

The Cluster Has Quorum but at Least One Monitor is Down
-------------------------------------------------------

When the cluster has quorum but at least one monitor is down, ``ceph health
detail`` returns a message similar to the following::

      $ ceph health detail
      [snip]
      mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)

**How do I troubleshoot a Ceph cluster that has quorum but also has at least one monitor down?**

  #. Make sure that ``mon.a`` is running.

  #. Make sure that you can connect to ``mon.a``'s node from the
     other Monitor nodes. Check the TCP ports as well. Check ``iptables`` and
     ``nf_conntrack`` on all nodes and make sure that you are not
     dropping/rejecting connections.

  If this initial troubleshooting doesn't solve your problem, then further
  investigation is necessary.

  First, check the problematic monitor's ``mon_status`` via the admin
  socket as explained in `Using the monitor's admin socket`_ and
  `Understanding mon_status`_.

  If the Monitor is out of the quorum, then its state will be one of the
  following: ``probing``, ``electing`` or ``synchronizing``. If the state of
  the Monitor is ``leader`` or ``peon``, then the Monitor believes itself to be
  in quorum but the rest of the cluster believes that it is not in quorum. It
  is possible that a Monitor that is in one of the ``probing``, ``electing``,
  or ``synchronizing`` states has entered the quorum during the process of
  troubleshooting. Check ``ceph status`` again to determine whether the Monitor
  has entered quorum during your troubleshooting. If the Monitor remains out of
  the quorum, then proceed with the investigations described in this section of
  the documentation.
  

**What does it mean when a Monitor's state is ``probing``?**

  If ``ceph health detail`` shows that a Monitor's state is
  ``probing``, then the Monitor is still looking for the other Monitors. Every
  Monitor remains in this state for some time when it is started. When a
  Monitor has connected to the other Monitors specified in the ``monmap``, it
  ceases to be in the ``probing`` state. The amount of time that a Monitor is
  in the ``probing`` state depends upon the parameters of the cluster of which
  it is a part. For example, when a Monitor is a part of a single-monitor
  cluster (never do this in production), the monitor passes through the probing
  state almost instantaneously. In a multi-monitor cluster, the Monitors stay
  in the ``probing`` state until they find enough monitors to form a quorum
  |---| this means that if two out of three Monitors in the cluster are
  ``down``, the one remaining Monitor stays in the ``probing``  state
  indefinitely until you bring one of the other monitors up.

  If quorum has been established, then the Monitor daemon should be able to
  find the other Monitors quickly, as long as they can be reached. If a Monitor
  is stuck in the ``probing`` state and you have exhausted the procedures above
  that describe the troubleshooting of communications between the Monitors,
  then it is possible that the problem Monitor is trying to reach the other
  Monitors at a wrong address. ``mon_status`` outputs the ``monmap`` that is
  known to the monitor: determine whether the other Monitors' locations as
  specified in the ``monmap`` match the locations of the Monitors in the
  network. If they do not, see `Recovering a Monitor's Broken monmap`_.
  If the locations of the Monitors as specified in the ``monmap`` match the
  locations of the Monitors in the network, then the persistent
  ``probing`` state could  be related to severe clock skews amongst the monitor
  nodes.  See `Clock Skews`_.  If the information in `Clock Skews`_ does not
  bring the Monitor out of the ``probing`` state, then prepare your system logs
  and ask the Ceph community for help. See `Preparing your logs`_ for
  information about the proper preparation of logs.


**What does it mean when a Monitor's state is ``electing``?**

  If ``ceph health detail`` shows that a Monitor's state is ``electing``, the
  monitor is in the middle of an election. Elections typically complete
  quickly, but sometimes the monitors can get stuck in what is known as an
  *election storm*. See :ref:`Monitor Elections <dev_mon_elections>` for more
  on monitor elections.
  
  The presence of election storm might indicate clock skew among the monitor
  nodes. See `Clock Skews`_ for more information. 
  
  If your clocks are properly synchronized, search the mailing lists and bug
  tracker for issues similar to your issue. The ``electing`` state is not
  likely to persist. In versions of Ceph after the release of Cuttlefish, there
  is no obvious reason other than clock skew that explains why an ``electing``
  state would persist.  
  
  It is possible to investigate the cause of a persistent ``electing`` state if
  you put the problematic Monitor into a ``down`` state while you investigate.
  This is possible only if there are enough surviving Monitors to form quorum. 

**What does it mean when a Monitor's state is ``synchronizing``?**

  If ``ceph health detail`` shows that the Monitor is ``synchronizing``, the
  monitor is catching up with the rest of the cluster so that it can join the
  quorum. The amount of time that it takes for the Monitor to synchronize with
  the rest of the quorum is a function of the size of the cluster's monitor
  store, the cluster's size, and the state of the cluster. Larger and degraded
  clusters generally keep Monitors in the ``synchronizing`` state longer than
  do smaller, new clusters.

  A Monitor that changes its state from ``synchronizing`` to ``electing`` and
  then back to ``synchronizing`` indicates a problem: the cluster state may be
  advancing (that is, generating new maps) too fast for the synchronization
  process to keep up with the pace of the creation of the new maps. This issue
  presented more frequently prior to the Cuttlefish release than it does in
  more recent releases, because the synchronization process has since been
  refactored and enhanced to avoid this dynamic. If you experience this in
  later versions, report the issue in the `Ceph bug tracker
  <https://tracker.ceph.com>`_. Prepare and provide logs to substantiate any
  bug you raise. See `Preparing your logs`_ for information about the proper
  preparation of logs.

**What does it mean when a Monitor's state is ``leader`` or ``peon``?**

  If ``ceph health detail`` shows that the Monitor is in the ``leader`` state
  or in the ``peon`` state, it is likely that clock skew is present. Follow the
  instructions in `Clock Skews`_. If you have followed those instructions and
  ``ceph health detail`` still shows that the Monitor is in the ``leader``
  state or the ``peon`` state, report the issue in the `Ceph bug tracker
  <https://tracker.ceph.com>`_. If you raise an issue, provide logs to
  substantiate it. See `Preparing your logs`_ for information about the
  proper preparation of logs.


Recovering a Monitor's Broken ``monmap``
----------------------------------------

This is how a ``monmap`` usually looks, depending on the number of
monitors::


      epoch 3
      fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8
      last_changed 2013-10-30 04:12:01.945629
      created 2013-10-29 14:14:41.914786
      0: 127.0.0.1:6789/0 mon.a
      1: 127.0.0.1:6790/0 mon.b
      2: 127.0.0.1:6795/0 mon.c
      
This may not be what you have however. For instance, in some versions of
early Cuttlefish there was a bug that could cause your ``monmap``
to be nullified.  Completely filled with zeros. This means that not even
``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros.
It's also possible to end up with a monitor with a severely outdated monmap,
notably if the node has been down for months while you fight with your vendor's
TAC.  The subject ``ceph-mon`` daemon might be unable to find the surviving
monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
``mon.b``; you will end up with a totally different monmap from the one
``mon.c`` knows).

In this situation you have two possible solutions:

Scrap the monitor and redeploy

  You should only take this route if you are positive that you won't
  lose the information kept by that monitor; that you have other monitors
  and that they are running just fine so that your new monitor is able
  to synchronize from the remaining monitors. Keep in mind that destroying
  a monitor, if there are no other copies of its contents, may lead to
  loss of data.

Inject a monmap into the monitor

  These are the basic steps:

  Retrieve the ``monmap`` from the surviving monitors and inject it into the
  monitor whose ``monmap`` is corrupted or lost.

  Implement this solution by carrying out the following procedure:

  1. Is there a quorum of monitors? If so, retrieve the ``monmap`` from the
     quorum::

      $ ceph mon getmap -o /tmp/monmap

  2. If there is no quorum, then retrieve the ``monmap`` directly from another
     monitor that has been stopped (in this example, the other monitor has
     the ID ``ID-FOO``)::

      $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap

  3. Stop the monitor you are going to inject the monmap into.

  4. Inject the monmap::

      $ ceph-mon -i ID --inject-monmap /tmp/monmap

  5. Start the monitor

  .. warning:: Injecting ``monmaps`` can cause serious problems because doing
     so will overwrite the latest existing ``monmap`` stored on the monitor. Be
     careful!

Clock Skews
-----------

The Paxos consensus algorithm requires close time synchroniziation, which means
that clock skew among the monitors in the quorum can have a serious effect on
monitor operation. The resulting behavior can be puzzling. To avoid this issue,
run a clock synchronization tool on your monitor nodes: for example, use
``Chrony`` or the legacy ``ntpd`` utility. Configure each monitor nodes so that
the `iburst` option is in effect and so that each monitor has multiple peers,
including the following: 

* Each other
* Internal ``NTP`` servers
* Multiple external, public pool servers

.. note:: The ``iburst`` option sends a burst of eight packets instead of the
   usual single packet, and is used during the process of getting two peers
   into initial synchronization.

Furthermore, it is advisable to synchronize *all* nodes in your cluster against
internal and external servers, and perhaps even against your monitors. Run
``NTP`` servers on bare metal: VM-virtualized clocks are not suitable for
steady timekeeping. See `https://www.ntp.org <https://www.ntp.org>`_ for more
information about the Network Time Protocol (NTP). Your organization might
already have quality internal ``NTP`` servers available.  Sources for ``NTP``
server appliances include the following:

* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_

Clock Skew Questions and Answers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**What's the maximum tolerated clock skew?**

  By default, monitors allow clocks to drift up to a maximum of 0.05 seconds
  (50 milliseconds).

**Can I increase the maximum tolerated clock skew?**

  Yes, but we strongly recommend against doing so. The maximum tolerated clock
  skew is configurable via the ``mon-clock-drift-allowed`` option, but it is
  almost certainly a bad idea to make changes to this option. The clock skew
  maximum is in place because clock-skewed monitors cannot be relied upon. The
  current default value has proven its worth at alerting the user before the
  monitors encounter serious problems. Changing this value might cause
  unforeseen effects on the stability of the monitors and overall cluster
  health.

**How do I know whether there is a clock skew?**

  The monitors will warn you via the cluster status ``HEALTH_WARN``. When clock
  skew is present, the ``ceph health detail`` and ``ceph status`` commands
  return an output resembling the following::

      mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)

  In this example, the monitor ``mon.c`` has been flagged as suffering from 
  clock skew.

  In Luminous and later releases, it is possible to check for a clock skew by
  running the ``ceph time-sync-status`` command. Note that the lead monitor
  typically has the numerically lowest IP address. It will always show ``0``:
  the reported offsets of other monitors are relative to the lead monitor, not
  to any external reference source.

**What should I do if there is a clock skew?**

  Synchronize your clocks. Using an NTP client might help. However, if you
  are already using an NTP client and you still encounter clock skew problems,
  determine whether the NTP server that you are using is remote to your network
  or instead hosted on your network. Hosting your own NTP servers tends to
  mitigate clock skew problems.


Client Can't Connect or Mount
-----------------------------

Check your IP tables. Some operating-system install utilities add a ``REJECT``
rule to ``iptables``. ``iptables`` rules will reject all clients other than
``ssh`` that try to connect to the host. If your monitor host's IP tables have
a ``REJECT`` rule in place, clients that are connecting from a separate node
will fail and will raise a timeout error. Any ``iptables`` rules that reject
clients trying to connect to Ceph daemons must be addressed. For example::

    REJECT all -- anywhere anywhere reject-with icmp-host-prohibited

It might also be necessary to add rules to iptables on your Ceph hosts to
ensure that clients are able to access the TCP ports associated with your Ceph
monitors (default: port 6789) and Ceph OSDs (default: 6800 through 7300). For
example::

    iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT


Monitor Store Failures
======================

Symptoms of store corruption
----------------------------

Ceph monitors store the :term:`Cluster Map` in a key-value store.  If key-value
store corruption causes a monitor to fail, then the monitor log might contain
one of the following error messages::

  Corruption: error in middle of record

or::

  Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb

Recovery using healthy monitor(s)
---------------------------------

If there are surviving monitors, we can always :ref:`replace
<adding-and-removing-monitors>` the corrupted monitor with a new one. After the
new monitor boots, it will synchronize with a healthy peer. After the new
monitor is fully synchronized, it will be able to serve clients.

.. _mon-store-recovery-using-osds:

Recovery using OSDs
-------------------

Even if all monitors fail at the same time, it is possible to recover the
monitor store by using information stored in OSDs. You are encouraged to deploy
at least three (and preferably five) monitors in a Ceph cluster. In such a
deployment, complete monitor failure is unlikely. However, unplanned power loss
in a data center whose disk settings or filesystem settings are improperly
configured could cause the underlying filesystem to fail and this could kill
all of the monitors. In such a case, data in the OSDs can be used to recover
the monitors.  The following is such a script and can be used to recover the
monitors:


.. code-block:: bash

  ms=/root/mon-store
  mkdir $ms
  
  # collect the cluster map from stopped OSDs
  for host in $hosts; do
    rsync -avz $ms/. user@$host:$ms.remote
    rm -rf $ms
    ssh user@$host <<EOF
      for osd in /var/lib/ceph/osd/ceph-*; do
        ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
      done
  EOF
    rsync -avz user@$host:$ms.remote/. $ms
  done
  
  # rebuild the monitor store from the collected map, if the cluster does not
  # use cephx authentication, we can skip the following steps to update the
  # keyring with the caps, and there is no need to pass the "--keyring" option.
  # i.e. just use "ceph-monstore-tool $ms rebuild" instead
  ceph-authtool /path/to/admin.keyring -n mon. \
    --cap mon 'allow *'
  ceph-authtool /path/to/admin.keyring -n client.admin \
    --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
  # add one or more ceph-mgr's key to the keyring. in this case, an encoded key
  # for mgr.x is added, you can find the encoded key in
  # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
  # deployed
  ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \
    --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
  # If your monitors' ids are not sorted by ip address, please specify them in order.
  # For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is  10.0.0.4,
  # please passing "--mon-ids b a c".
  # In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please
  # specify them in the command line by passing them as arguments of the "--mon-ids"
  # option. if you are not sure, please check your ceph.conf to see if there is any
  # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
  # using DNS SRV for looking up monitors.
  ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma
  
  # make a backup of the corrupted store.db just in case!  repeat for
  # all monitors.
  mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted

  # move rebuild store.db into place.  repeat for all monitors.
  mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
  chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db

This script performs the following steps:

#. Collects the map from each OSD host.
#. Rebuilds the store.
#. Fills the entities in the keyring file with appropriate capabilities.
#. Replaces the corrupted store on ``mon.foo`` with the recovered copy.


Known limitations
~~~~~~~~~~~~~~~~~

The above recovery tool is unable to recover the following information:

- **Certain added keyrings**: All of the OSD keyrings added using the ``ceph
  auth add`` command are recovered from the OSD's copy, and the
  ``client.admin`` keyring is imported using ``ceph-monstore-tool``. However,
  the MDS keyrings and all other keyrings will be missing in the recovered
  monitor store. You might need to manually re-add them.

- **Creating pools**: If any RADOS pools were in the process of being created,
  that state is lost. The recovery tool operates on the assumption that all
  pools have already been created. If there are PGs that are stuck in the
  'unknown' state after the recovery for a partially created pool, you can
  force creation of the *empty* PG by running the ``ceph osd force-create-pg``
  command. Note that this will create an *empty* PG, so take this action only
  if you know the pool is empty.

- **MDS Maps**: The MDS maps are lost.


Everything Failed! Now What?
============================

Reaching out for help
---------------------

You can find help on IRC in #ceph and #ceph-devel on OFTC (server
irc.oftc.net), or at ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make
sure that you have prepared your logs and that you have them ready upon
request.

See https://ceph.io/en/community/connect/ for current (as of October 2023)
information on getting in contact with the upstream Ceph community.


Preparing your logs
-------------------

The default location for monitor logs is ``/var/log/ceph/ceph-mon.FOO.log*``.
However, if they are not there, you can find their current location by running
the following command:

.. prompt:: bash

   ceph-conf --name mon.FOO --show-config-value log_file

The amount of information in the logs is determined by the debug levels in the
cluster's configuration files. If Ceph is using the default debug levels, then
your logs might be missing important information that would help the upstream
Ceph community address your issue.

To make sure your monitor logs contain relevant information, you can raise
debug levels. Here we are interested in information from the monitors.  As with
other components, the monitors have different parts that output their debug
information on different subsystems.

If you are an experienced Ceph troubleshooter, we recommend raising the debug
levels of the most relevant subsystems. Of course, this approach might not be
easy for beginners. In most cases, however, enough information to address the
issue will be secured if the following debug levels are entered::

      debug_mon = 10
      debug_ms = 1

Sometimes these debug levels do not yield enough information. In such cases,
members of the upstream Ceph community might ask you to make additional changes
to these or to other debug levels. In any case, it is better for us to receive
at least some useful information than to receive an empty log.


Do I need to restart a monitor to adjust debug levels?
------------------------------------------------------

No, restarting a monitor is not necessary. Debug levels may be adjusted by
using two different methods, depending on whether or not there is a quorum:

There is a quorum

  Either inject the debug option into the specific monitor that needs to 
  be debugged::

        ceph tell mon.FOO config set debug_mon 10/10

  Or inject it into all monitors at once::

        ceph tell mon.* config set debug_mon 10/10


There is no quorum

  Use the admin socket of the specific monitor that needs to be debugged
  and directly adjust the monitor's configuration options::

      ceph daemon mon.FOO config set debug_mon 10/10


To return the debug levels to their default values, run the above commands
using the debug level ``1/10`` rather than ``10/10``. To check a monitor's
current values, use the admin socket and run either of the following commands:

  .. prompt:: bash

     ceph daemon mon.FOO config show

or:

  .. prompt:: bash

     ceph daemon mon.FOO config get 'OPTION_NAME'



I Reproduced the problem with appropriate debug levels. Now what?
-----------------------------------------------------------------

We prefer that you send us only the portions of your logs that are relevant to
your monitor problems. Of course, it might not be easy for you to determine
which portions are relevant so we are willing to accept complete and
unabridged logs. However, we request that you avoid sending logs containing
hundreds of thousands of lines with no additional clarifying information. One
common-sense way of making our task easier is to write down the current time
and date when you are reproducing the problem and then extract portions of your
logs based on that information.

Finally, reach out to us on the mailing lists or IRC or Slack, or by filing a
new issue on the `tracker`_.

.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new

.. |---|   unicode:: U+2014 .. EM DASH
   :trim: