.. _rados-troubleshooting-mon: ========================== Troubleshooting Monitors ========================== .. index:: monitor, high availability Even if a cluster experiences monitor-related problems, the cluster is not necessarily in danger of going down. If a cluster has lost multiple monitors, it can still remain up and running as long as there are enough surviving monitors to form a quorum. If your cluster is having monitor-related problems, we recommend that you consult the following troubleshooting information. Initial Troubleshooting ======================= The first steps in the process of troubleshooting Ceph Monitors involve making sure that the Monitors are running and that they are able to communicate with the network and on the network. Follow the steps in this section to rule out the simplest causes of Monitor malfunction. #. **Make sure that the Monitors are running.** Make sure that the Monitor (*mon*) daemon processes (``ceph-mon``) are running. It might be the case that the mons have not be restarted after an upgrade. Checking for this simple oversight can save hours of painstaking troubleshooting. It is also important to make sure that the manager daemons (``ceph-mgr``) are running. Remember that typical cluster configurations provide one Manager (``ceph-mgr``) for each Monitor (``ceph-mon``). .. note:: In releases prior to v1.12.5, Rook will not run more than two managers. #. **Make sure that you can reach the Monitor nodes.** In certain rare cases, ``iptables`` rules might be blocking access to Monitor nodes or TCP ports. These rules might be left over from earlier stress testing or rule development. To check for the presence of such rules, SSH into each Monitor node and use ``telnet`` or ``nc`` or a similar tool to attempt to connect to each of the other Monitor nodes on ports ``tcp/3300`` and ``tcp/6789``. #. **Make sure that the "ceph status" command runs and receives a reply from the cluster.** If the ``ceph status`` command receives a reply from the cluster, then the cluster is up and running. Monitors answer to a ``status`` request only if there is a formed quorum. Confirm that one or more ``mgr`` daemons are reported as running. In a cluster with no deficiencies, ``ceph status`` will report that all ``mgr`` daemons are running. If the ``ceph status`` command does not receive a reply from the cluster, then there are probably not enough Monitors ``up`` to form a quorum. If the ``ceph -s`` command is run with no further options specified, it connects to an arbitrarily selected Monitor. In certain cases, however, it might be helpful to connect to a specific Monitor (or to several specific Monitors in sequence) by adding the ``-m`` flag to the command: for example, ``ceph status -m mymon1``. #. **None of this worked. What now?** If the above solutions have not resolved your problems, you might find it helpful to examine each individual Monitor in turn. Even if no quorum has been formed, it is possible to contact each Monitor individually and request its status by using the ``ceph tell mon.ID mon_status`` command (here ``ID`` is the Monitor's identifier). Run the ``ceph tell mon.ID mon_status`` command for each Monitor in the cluster. For more on this command's output, see :ref:`Understanding mon_status `. There is also an alternative method for contacting each individual Monitor: SSH into each Monitor node and query the daemon's admin socket. See :ref:`Using the Monitor's Admin Socket`. .. _rados_troubleshoting_troubleshooting_mon_using_admin_socket: Using the monitor's admin socket ================================ A monitor's admin socket allows you to interact directly with a specific daemon by using a Unix socket file. This file is found in the monitor's ``run`` directory. The admin socket's default directory is ``/var/run/ceph/ceph-mon.ID.asok``, but this can be overridden and the admin socket might be elsewhere, especially if your cluster's daemons are deployed in containers. If you cannot find it, either check your ``ceph.conf`` for an alternative path or run the following command: .. prompt:: bash $ ceph-conf --name mon.ID --show-config-value admin_socket The admin socket is available for use only when the monitor daemon is running. Whenever the monitor has been properly shut down, the admin socket is removed. However, if the monitor is not running and the admin socket persists, it is likely that the monitor has been improperly shut down. In any case, if the monitor is not running, it will be impossible to use the admin socket, and the ``ceph`` command is likely to return ``Error 111: Connection Refused``. To access the admin socket, run a ``ceph tell`` command of the following form (specifying the daemon that you are interested in): .. prompt:: bash $ ceph tell mon. mon_status This command passes a ``help`` command to the specific running monitor daemon ```` via its admin socket. If you know the full path to the admin socket file, this can be done more directly by running the following command: .. prompt:: bash $ ceph --admin-daemon Running ``ceph help`` shows all supported commands that are available through the admin socket. See especially ``config get``, ``config show``, ``mon stat``, and ``quorum_status``. .. _rados_troubleshoting_troubleshooting_mon_understanding_mon_status: Understanding mon_status ======================== The status of the monitor (as reported by the ``ceph tell mon.X mon_status`` command) can always be obtained via the admin socket. This command outputs a great deal of information about the monitor (including the information found in the output of the ``quorum_status`` command). To understand this command's output, let us consider the following example, in which we see the output of ``ceph tell mon.c mon_status``:: { "name": "c", "rank": 2, "state": "peon", "election_epoch": 38, "quorum": [ 1, 2], "outside_quorum": [], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 3, "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", "modified": "2013-10-30 04:12:01.945629", "created": "2013-10-29 14:14:41.914786", "mons": [ { "rank": 0, "name": "a", "addr": "127.0.0.1:6789\/0"}, { "rank": 1, "name": "b", "addr": "127.0.0.1:6790\/0"}, { "rank": 2, "name": "c", "addr": "127.0.0.1:6795\/0"}]}} It is clear that there are three monitors in the monmap (*a*, *b*, and *c*), the quorum is formed by only two monitors, and *c* is in the quorum as a *peon*. **Which monitor is out of the quorum?** The answer is **a** (that is, ``mon.a``). **Why?** When the ``quorum`` set is examined, there are clearly two monitors in the set: *1* and *2*. But these are not monitor names. They are monitor ranks, as established in the current ``monmap``. The ``quorum`` set does not include the monitor that has rank 0, and according to the ``monmap`` that monitor is ``mon.a``. **How are monitor ranks determined?** Monitor ranks are calculated (or recalculated) whenever monitors are added or removed. The calculation of ranks follows a simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the rank. In this case, because ``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations, ``mon.a`` has the highest rank: namely, rank 0. Most Common Monitor Issues =========================== The Cluster Has Quorum but at Least One Monitor is Down ------------------------------------------------------- When the cluster has quorum but at least one monitor is down, ``ceph health detail`` returns a message similar to the following:: $ ceph health detail [snip] mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) **How do I troubleshoot a Ceph cluster that has quorum but also has at least one monitor down?** #. Make sure that ``mon.a`` is running. #. Make sure that you can connect to ``mon.a``'s node from the other Monitor nodes. Check the TCP ports as well. Check ``iptables`` and ``nf_conntrack`` on all nodes and make sure that you are not dropping/rejecting connections. If this initial troubleshooting doesn't solve your problem, then further investigation is necessary. First, check the problematic monitor's ``mon_status`` via the admin socket as explained in `Using the monitor's admin socket`_ and `Understanding mon_status`_. If the Monitor is out of the quorum, then its state will be one of the following: ``probing``, ``electing`` or ``synchronizing``. If the state of the Monitor is ``leader`` or ``peon``, then the Monitor believes itself to be in quorum but the rest of the cluster believes that it is not in quorum. It is possible that a Monitor that is in one of the ``probing``, ``electing``, or ``synchronizing`` states has entered the quorum during the process of troubleshooting. Check ``ceph status`` again to determine whether the Monitor has entered quorum during your troubleshooting. If the Monitor remains out of the quorum, then proceed with the investigations described in this section of the documentation. **What does it mean when a Monitor's state is ``probing``?** If ``ceph health detail`` shows that a Monitor's state is ``probing``, then the Monitor is still looking for the other Monitors. Every Monitor remains in this state for some time when it is started. When a Monitor has connected to the other Monitors specified in the ``monmap``, it ceases to be in the ``probing`` state. The amount of time that a Monitor is in the ``probing`` state depends upon the parameters of the cluster of which it is a part. For example, when a Monitor is a part of a single-monitor cluster (never do this in production), the monitor passes through the probing state almost instantaneously. In a multi-monitor cluster, the Monitors stay in the ``probing`` state until they find enough monitors to form a quorum |---| this means that if two out of three Monitors in the cluster are ``down``, the one remaining Monitor stays in the ``probing`` state indefinitely until you bring one of the other monitors up. If quorum has been established, then the Monitor daemon should be able to find the other Monitors quickly, as long as they can be reached. If a Monitor is stuck in the ``probing`` state and you have exhausted the procedures above that describe the troubleshooting of communications between the Monitors, then it is possible that the problem Monitor is trying to reach the other Monitors at a wrong address. ``mon_status`` outputs the ``monmap`` that is known to the monitor: determine whether the other Monitors' locations as specified in the ``monmap`` match the locations of the Monitors in the network. If they do not, see `Recovering a Monitor's Broken monmap`_. If the locations of the Monitors as specified in the ``monmap`` match the locations of the Monitors in the network, then the persistent ``probing`` state could be related to severe clock skews amongst the monitor nodes. See `Clock Skews`_. If the information in `Clock Skews`_ does not bring the Monitor out of the ``probing`` state, then prepare your system logs and ask the Ceph community for help. See `Preparing your logs`_ for information about the proper preparation of logs. **What does it mean when a Monitor's state is ``electing``?** If ``ceph health detail`` shows that a Monitor's state is ``electing``, the monitor is in the middle of an election. Elections typically complete quickly, but sometimes the monitors can get stuck in what is known as an *election storm*. See :ref:`Monitor Elections ` for more on monitor elections. The presence of election storm might indicate clock skew among the monitor nodes. See `Clock Skews`_ for more information. If your clocks are properly synchronized, search the mailing lists and bug tracker for issues similar to your issue. The ``electing`` state is not likely to persist. In versions of Ceph after the release of Cuttlefish, there is no obvious reason other than clock skew that explains why an ``electing`` state would persist. It is possible to investigate the cause of a persistent ``electing`` state if you put the problematic Monitor into a ``down`` state while you investigate. This is possible only if there are enough surviving Monitors to form quorum. **What does it mean when a Monitor's state is ``synchronizing``?** If ``ceph health detail`` shows that the Monitor is ``synchronizing``, the monitor is catching up with the rest of the cluster so that it can join the quorum. The amount of time that it takes for the Monitor to synchronize with the rest of the quorum is a function of the size of the cluster's monitor store, the cluster's size, and the state of the cluster. Larger and degraded clusters generally keep Monitors in the ``synchronizing`` state longer than do smaller, new clusters. A Monitor that changes its state from ``synchronizing`` to ``electing`` and then back to ``synchronizing`` indicates a problem: the cluster state may be advancing (that is, generating new maps) too fast for the synchronization process to keep up with the pace of the creation of the new maps. This issue presented more frequently prior to the Cuttlefish release than it does in more recent releases, because the synchronization process has since been refactored and enhanced to avoid this dynamic. If you experience this in later versions, report the issue in the `Ceph bug tracker `_. Prepare and provide logs to substantiate any bug you raise. See `Preparing your logs`_ for information about the proper preparation of logs. **What does it mean when a Monitor's state is ``leader`` or ``peon``?** If ``ceph health detail`` shows that the Monitor is in the ``leader`` state or in the ``peon`` state, it is likely that clock skew is present. Follow the instructions in `Clock Skews`_. If you have followed those instructions and ``ceph health detail`` still shows that the Monitor is in the ``leader`` state or the ``peon`` state, report the issue in the `Ceph bug tracker `_. If you raise an issue, provide logs to substantiate it. See `Preparing your logs`_ for information about the proper preparation of logs. Recovering a Monitor's Broken ``monmap`` ---------------------------------------- This is how a ``monmap`` usually looks, depending on the number of monitors:: epoch 3 fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 last_changed 2013-10-30 04:12:01.945629 created 2013-10-29 14:14:41.914786 0: 127.0.0.1:6789/0 mon.a 1: 127.0.0.1:6790/0 mon.b 2: 127.0.0.1:6795/0 mon.c This may not be what you have however. For instance, in some versions of early Cuttlefish there was a bug that could cause your ``monmap`` to be nullified. Completely filled with zeros. This means that not even ``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros. It's also possible to end up with a monitor with a severely outdated monmap, notably if the node has been down for months while you fight with your vendor's TAC. The subject ``ceph-mon`` daemon might be unable to find the surviving monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, then remove ``mon.a``, then add a new monitor ``mon.e`` and remove ``mon.b``; you will end up with a totally different monmap from the one ``mon.c`` knows). In this situation you have two possible solutions: Scrap the monitor and redeploy You should only take this route if you are positive that you won't lose the information kept by that monitor; that you have other monitors and that they are running just fine so that your new monitor is able to synchronize from the remaining monitors. Keep in mind that destroying a monitor, if there are no other copies of its contents, may lead to loss of data. Inject a monmap into the monitor These are the basic steps: Retrieve the ``monmap`` from the surviving monitors and inject it into the monitor whose ``monmap`` is corrupted or lost. Implement this solution by carrying out the following procedure: 1. Is there a quorum of monitors? If so, retrieve the ``monmap`` from the quorum:: $ ceph mon getmap -o /tmp/monmap 2. If there is no quorum, then retrieve the ``monmap`` directly from another monitor that has been stopped (in this example, the other monitor has the ID ``ID-FOO``):: $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap 3. Stop the monitor you are going to inject the monmap into. 4. Inject the monmap:: $ ceph-mon -i ID --inject-monmap /tmp/monmap 5. Start the monitor .. warning:: Injecting ``monmaps`` can cause serious problems because doing so will overwrite the latest existing ``monmap`` stored on the monitor. Be careful! Clock Skews ----------- The Paxos consensus algorithm requires close time synchroniziation, which means that clock skew among the monitors in the quorum can have a serious effect on monitor operation. The resulting behavior can be puzzling. To avoid this issue, run a clock synchronization tool on your monitor nodes: for example, use ``Chrony`` or the legacy ``ntpd`` utility. Configure each monitor nodes so that the `iburst` option is in effect and so that each monitor has multiple peers, including the following: * Each other * Internal ``NTP`` servers * Multiple external, public pool servers .. note:: The ``iburst`` option sends a burst of eight packets instead of the usual single packet, and is used during the process of getting two peers into initial synchronization. Furthermore, it is advisable to synchronize *all* nodes in your cluster against internal and external servers, and perhaps even against your monitors. Run ``NTP`` servers on bare metal: VM-virtualized clocks are not suitable for steady timekeeping. See `https://www.ntp.org `_ for more information about the Network Time Protocol (NTP). Your organization might already have quality internal ``NTP`` servers available. Sources for ``NTP`` server appliances include the following: * Microsemi (formerly Symmetricom) `https://microsemi.com `_ * EndRun `https://endruntechnologies.com `_ * Netburner `https://www.netburner.com `_ Clock Skew Questions and Answers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **What's the maximum tolerated clock skew?** By default, monitors allow clocks to drift up to a maximum of 0.05 seconds (50 milliseconds). **Can I increase the maximum tolerated clock skew?** Yes, but we strongly recommend against doing so. The maximum tolerated clock skew is configurable via the ``mon-clock-drift-allowed`` option, but it is almost certainly a bad idea to make changes to this option. The clock skew maximum is in place because clock-skewed monitors cannot be relied upon. The current default value has proven its worth at alerting the user before the monitors encounter serious problems. Changing this value might cause unforeseen effects on the stability of the monitors and overall cluster health. **How do I know whether there is a clock skew?** The monitors will warn you via the cluster status ``HEALTH_WARN``. When clock skew is present, the ``ceph health detail`` and ``ceph status`` commands return an output resembling the following:: mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) In this example, the monitor ``mon.c`` has been flagged as suffering from clock skew. In Luminous and later releases, it is possible to check for a clock skew by running the ``ceph time-sync-status`` command. Note that the lead monitor typically has the numerically lowest IP address. It will always show ``0``: the reported offsets of other monitors are relative to the lead monitor, not to any external reference source. **What should I do if there is a clock skew?** Synchronize your clocks. Using an NTP client might help. However, if you are already using an NTP client and you still encounter clock skew problems, determine whether the NTP server that you are using is remote to your network or instead hosted on your network. Hosting your own NTP servers tends to mitigate clock skew problems. Client Can't Connect or Mount ----------------------------- Check your IP tables. Some operating-system install utilities add a ``REJECT`` rule to ``iptables``. ``iptables`` rules will reject all clients other than ``ssh`` that try to connect to the host. If your monitor host's IP tables have a ``REJECT`` rule in place, clients that are connecting from a separate node will fail and will raise a timeout error. Any ``iptables`` rules that reject clients trying to connect to Ceph daemons must be addressed. For example:: REJECT all -- anywhere anywhere reject-with icmp-host-prohibited It might also be necessary to add rules to iptables on your Ceph hosts to ensure that clients are able to access the TCP ports associated with your Ceph monitors (default: port 6789) and Ceph OSDs (default: 6800 through 7300). For example:: iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT Monitor Store Failures ====================== Symptoms of store corruption ---------------------------- Ceph monitors store the :term:`Cluster Map` in a key-value store. If key-value store corruption causes a monitor to fail, then the monitor log might contain one of the following error messages:: Corruption: error in middle of record or:: Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb Recovery using healthy monitor(s) --------------------------------- If there are surviving monitors, we can always :ref:`replace ` the corrupted monitor with a new one. After the new monitor boots, it will synchronize with a healthy peer. After the new monitor is fully synchronized, it will be able to serve clients. .. _mon-store-recovery-using-osds: Recovery using OSDs ------------------- Even if all monitors fail at the same time, it is possible to recover the monitor store by using information stored in OSDs. You are encouraged to deploy at least three (and preferably five) monitors in a Ceph cluster. In such a deployment, complete monitor failure is unlikely. However, unplanned power loss in a data center whose disk settings or filesystem settings are improperly configured could cause the underlying filesystem to fail and this could kill all of the monitors. In such a case, data in the OSDs can be used to recover the monitors. The following is such a script and can be used to recover the monitors: .. code-block:: bash ms=/root/mon-store mkdir $ms # collect the cluster map from stopped OSDs for host in $hosts; do rsync -avz $ms/. user@$host:$ms.remote rm -rf $ms ssh user@$host <