diff options
Diffstat (limited to 'doc/rados/troubleshooting')
-rw-r--r-- | doc/rados/troubleshooting/community.rst | 28 | ||||
-rw-r--r-- | doc/rados/troubleshooting/cpu-profiling.rst | 67 | ||||
-rw-r--r-- | doc/rados/troubleshooting/index.rst | 19 | ||||
-rw-r--r-- | doc/rados/troubleshooting/log-and-debug.rst | 599 | ||||
-rw-r--r-- | doc/rados/troubleshooting/memory-profiling.rst | 142 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-mon.rst | 613 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-osd.rst | 620 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-pg.rst | 693 |
8 files changed, 2781 insertions, 0 deletions
diff --git a/doc/rados/troubleshooting/community.rst b/doc/rados/troubleshooting/community.rst new file mode 100644 index 000000000..f816584ae --- /dev/null +++ b/doc/rados/troubleshooting/community.rst @@ -0,0 +1,28 @@ +==================== + The Ceph Community +==================== + +The Ceph community is an excellent source of information and help. For +operational issues with Ceph releases we recommend you `subscribe to the +ceph-users email list`_. When you no longer want to receive emails, you can +`unsubscribe from the ceph-users email list`_. + +You may also `subscribe to the ceph-devel email list`_. You should do so if +your issue is: + +- Likely related to a bug +- Related to a development release package +- Related to a development testing package +- Related to your own builds + +If you no longer want to receive emails from the ``ceph-devel`` email list, you +may `unsubscribe from the ceph-devel email list`_. + +.. tip:: The Ceph community is growing rapidly, and community members can help + you if you provide them with detailed information about your problem. You + can attach the output of the ``ceph report`` command to help people understand your issues. + +.. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io +.. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@ceph.io +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@ceph.io diff --git a/doc/rados/troubleshooting/cpu-profiling.rst b/doc/rados/troubleshooting/cpu-profiling.rst new file mode 100644 index 000000000..159f7998d --- /dev/null +++ b/doc/rados/troubleshooting/cpu-profiling.rst @@ -0,0 +1,67 @@ +=============== + CPU Profiling +=============== + +If you built Ceph from source and compiled Ceph for use with `oprofile`_ +you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. + + +Initializing oprofile +===================== + +The first time you use ``oprofile`` you need to initialize it. Locate the +``vmlinux`` image corresponding to the kernel you are now running. :: + + ls /boot + sudo opcontrol --init + sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 + + +Starting oprofile +================= + +To start ``oprofile`` execute the following command:: + + opcontrol --start + +Once you start ``oprofile``, you may run some tests with Ceph. + + +Stopping oprofile +================= + +To stop ``oprofile`` execute the following command:: + + opcontrol --stop + + +Retrieving oprofile Results +=========================== + +To retrieve the top ``cmon`` results, execute the following command:: + + opreport -gal ./cmon | less + + +To retrieve the top ``cmon`` results with call graphs attached, execute the +following command:: + + opreport -cal ./cmon | less + +.. important:: After reviewing results, you should reset ``oprofile`` before + running it again. Resetting ``oprofile`` removes data from the session + directory. + + +Resetting oprofile +================== + +To reset ``oprofile``, execute the following command:: + + sudo opcontrol --reset + +.. important:: You should reset ``oprofile`` after analyzing data so that + you do not commingle results from different tests. + +.. _oprofile: http://oprofile.sourceforge.net/about/ +.. _Installing Oprofile: ../../../dev/cpu-profiler diff --git a/doc/rados/troubleshooting/index.rst b/doc/rados/troubleshooting/index.rst new file mode 100644 index 000000000..80d14f3ce --- /dev/null +++ b/doc/rados/troubleshooting/index.rst @@ -0,0 +1,19 @@ +================= + Troubleshooting +================= + +Ceph is still on the leading edge, so you may encounter situations that require +you to examine your configuration, modify your logging output, troubleshoot +monitors and OSDs, profile memory and CPU usage, and reach out to the +Ceph community for help. + +.. toctree:: + :maxdepth: 1 + + community + log-and-debug + troubleshooting-mon + troubleshooting-osd + troubleshooting-pg + memory-profiling + cpu-profiling diff --git a/doc/rados/troubleshooting/log-and-debug.rst b/doc/rados/troubleshooting/log-and-debug.rst new file mode 100644 index 000000000..71170149b --- /dev/null +++ b/doc/rados/troubleshooting/log-and-debug.rst @@ -0,0 +1,599 @@ +======================= + Logging and Debugging +======================= + +Typically, when you add debugging to your Ceph configuration, you do so at +runtime. You can also add Ceph debug logging to your Ceph configuration file if +you are encountering issues when starting your cluster. You may view Ceph log +files under ``/var/log/ceph`` (the default location). + +.. tip:: When debug output slows down your system, the latency can hide + race conditions. + +Logging is resource intensive. If you are encountering a problem in a specific +area of your cluster, enable logging for that area of the cluster. For example, +if your OSDs are running fine, but your metadata servers are not, you should +start by enabling debug logging for the specific metadata server instance(s) +giving you trouble. Enable logging for each subsystem as needed. + +.. important:: Verbose logging can generate over 1GB of data per hour. If your + OS disk reaches its capacity, the node will stop working. + +If you enable or increase the rate of Ceph logging, ensure that you have +sufficient disk space on your OS disk. See `Accelerating Log Rotation`_ for +details on rotating log files. When your system is running well, remove +unnecessary debugging settings to ensure your cluster runs optimally. Logging +debug output messages is relatively slow, and a waste of resources when +operating your cluster. + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + +Runtime +======= + +If you would like to see the configuration settings at runtime, you must log +in to a host with a running daemon and execute the following:: + + ceph daemon {daemon-name} config show | less + +For example,:: + + ceph daemon osd.0 config show | less + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at runtime, use the +``ceph tell`` command to inject arguments into the runtime configuration:: + + ceph tell {daemon-type}.{daemon id or *} config set {name} {value} + +Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply +the runtime setting to all daemons of a particular type with ``*``, or specify +a specific daemon's ID. For example, to increase +debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: + + ceph tell osd.0 config set debug_osd 0/5 + +The ``ceph tell`` command goes through the monitors. If you cannot bind to the +monitor, you can still make the change by logging into the host of the daemon +whose configuration you'd like to change using ``ceph daemon``. +For example:: + + sudo ceph daemon osd.0 config set debug_osd 0/5 + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + + +Boot Time +========= + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at boot time, you must +add settings to your Ceph configuration file. Subsystems common to each daemon +may be set under ``[global]`` in your configuration file. Subsystems for +particular daemons are set under the daemon section in your configuration file +(*e.g.*, ``[mon]``, ``[osd]``, ``[mds]``). For example:: + + [global] + debug ms = 1/5 + + [mon] + debug mon = 20 + debug paxos = 1/5 + debug auth = 2 + + [osd] + debug osd = 1/5 + debug filestore = 1/5 + debug journal = 1 + debug monc = 5/20 + + [mds] + debug mds = 1 + debug mds balancer = 1 + + +See `Subsystem, Log and Debug Settings`_ for details. + + +Accelerating Log Rotation +========================= + +If your OS disk is relatively full, you can accelerate log rotation by modifying +the Ceph log rotation file at ``/etc/logrotate.d/ceph``. Add a size setting +after the rotation frequency to accelerate log rotation (via cronjob) if your +logs exceed the size setting. For example, the default setting looks like +this:: + + rotate 7 + weekly + compress + sharedscripts + +Modify it by adding a ``size`` setting. :: + + rotate 7 + weekly + size 500M + compress + sharedscripts + +Then, start the crontab editor for your user space. :: + + crontab -e + +Finally, add an entry to check the ``etc/logrotate.d/ceph`` file. :: + + 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 + +The preceding example checks the ``etc/logrotate.d/ceph`` file every 30 minutes. + + +Valgrind +======== + +Debugging may also require you to track down memory and threading issues. +You can run a single daemon, a type of daemon, or the whole cluster with +Valgrind. You should only use Valgrind when developing or debugging Ceph. +Valgrind is computationally expensive, and will slow down your system otherwise. +Valgrind messages are logged to ``stderr``. + + +Subsystem, Log and Debug Settings +================================= + +In most cases, you will enable debug logging output via subsystems. + +Ceph Subsystems +--------------- + +Each subsystem has a logging level for its output logs, and for its logs +in-memory. You may set different values for each of these subsystems by setting +a log file level and a memory level for debug logging. Ceph's logging levels +operate on a scale of ``1`` to ``20``, where ``1`` is terse and ``20`` is +verbose [#]_ . In general, the logs in-memory are not sent to the output log unless: + +- a fatal signal is raised or +- an ``assert`` in source code is triggered or +- upon requested. Please consult `document on admin socket <http://docs.ceph.com/en/latest/man/8/ceph/#daemon>`_ for more details. + +A debug logging setting can take a single value for the log level and the +memory level, which sets them both as the same value. For example, if you +specify ``debug ms = 5``, Ceph will treat it as a log level and a memory level +of ``5``. You may also specify them separately. The first setting is the log +level, and the second setting is the memory level. You must separate them with +a forward slash (/). For example, if you want to set the ``ms`` subsystem's +debug logging level to ``1`` and its memory level to ``5``, you would specify it +as ``debug ms = 1/5``. For example: + + + +.. code-block:: ini + + debug {subsystem} = {log-level}/{memory-level} + #for example + debug mds balancer = 1/20 + + +The following table provides a list of Ceph subsystems and their default log and +memory levels. Once you complete your logging efforts, restore the subsystems +to their default level or to a level suitable for normal operations. + + ++--------------------+-----------+--------------+ +| Subsystem | Log Level | Memory Level | ++====================+===========+==============+ +| ``default`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``lockdep`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``context`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``crush`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``mds`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds balancer`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds locker`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log expire`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds migrator`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``buffer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``timer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``filer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``striper`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``objecter`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``rados`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd mirror`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd replay`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``journaler`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objectcacher`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``client`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``osd`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``optracker`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objclass`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``filestore`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``journal`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``ms`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``mon`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``monc`` | 0 | 10 | ++--------------------+-----------+--------------+ +| ``paxos`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``tp`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``auth`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``crypto`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``finisher`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``reserver`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``heartbeatmap`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``perfcounter`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw sync`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``civetweb`` | 1 | 10 | ++--------------------+-----------+--------------+ +| ``javaclient`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``asok`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``throttle`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``refs`` | 0 | 0 | ++--------------------+-----------+--------------+ +| ``compressor`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bluestore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bluefs`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bdev`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``kstore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rocksdb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``leveldb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``memdb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``fuse`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mgr`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mgrc`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``dpdk`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``eventtrace`` | 1 | 5 | ++--------------------+-----------+--------------+ + + +Logging Settings +---------------- + +Logging and debugging settings are not required in a Ceph configuration file, +but you may override default settings as needed. Ceph supports the following +settings: + + +``log file`` + +:Description: The location of the logging file for your cluster. +:Type: String +:Required: No +:Default: ``/var/log/ceph/$cluster-$name.log`` + + +``log max new`` + +:Description: The maximum number of new log files. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``log max recent`` + +:Description: The maximum number of recent events to include in a log file. +:Type: Integer +:Required: No +:Default: ``10000`` + + +``log to file`` + +:Description: Determines if logging messages should appear in a file. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``log to stderr`` + +:Description: Determines if logging messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``err to stderr`` + +:Description: Determines if error messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``log to syslog`` + +:Description: Determines if logging messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``err to syslog`` + +:Description: Determines if error messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``log flush on exit`` + +:Description: Determines if Ceph should flush the log files after exit. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``clog to monitors`` + +:Description: Determines if ``clog`` messages should be sent to monitors. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``clog to syslog`` + +:Description: Determines if ``clog`` messages should be sent to syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log to syslog`` + +:Description: Determines if the cluster log should be output to the syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log file`` + +:Description: The locations of the cluster's log files. There are two channels in + Ceph: ``cluster`` and ``audit``. This option represents a mapping + from channels to log files, where the log entries of that + channel are sent to. The ``default`` entry is a fallback + mapping for channels not explicitly specified. So, the following + default setting will send cluster log to ``$cluster.log``, and + send audit log to ``$cluster.audit.log``, where ``$cluster`` will + be replaced with the actual cluster name. +:Type: String +:Required: No +:Default: ``default=/var/log/ceph/$cluster.$channel.log,cluster=/var/log/ceph/$cluster.log`` + + + +OSD +--- + + +``osd debug drop ping probability`` + +:Description: ? +:Type: Double +:Required: No +:Default: 0 + + +``osd debug drop ping duration`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create probability`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create duration`` + +:Description: ? +:Type: Double +:Required: No +:Default: 1 + + +``osd min pg log entries`` + +:Description: The minimum number of log entries for placement groups. +:Type: 32-bit Unsigned Integer +:Required: No +:Default: 250 + + +``osd op log threshold`` + +:Description: How many op log messages to show up in one pass. +:Type: Integer +:Required: No +:Default: 5 + + + +Filestore +--------- + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. This is an expensive operation. +:Type: Boolean +:Required: No +:Default: ``false`` + + +MDS +--- + + +``mds debug scatterstat`` + +:Description: Ceph will assert that various recursive stat invariants are true + (for developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug frag`` + +:Description: Ceph will verify directory fragmentation invariants when + convenient (developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug auth pins`` + +:Description: The debug auth pin invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug subtrees`` + +:Description: The debug subtree invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + + +RADOS Gateway +------------- + + +``rgw log nonexistent bucket`` + +:Description: Should we log a non-existent buckets? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw log object name`` + +:Description: Should an object's name be logged. // man date to see codes (a subset are supported) +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%H-%i-%n`` + + +``rgw log object name utc`` + +:Description: Object log name contains UTC? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw enable ops log`` + +:Description: Enables logging of every RGW operation. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``rgw enable usage log`` + +:Description: Enable logging of RGW's bandwidth usage. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw usage log flush threshold`` + +:Description: Threshold to flush pending log data. +:Type: Integer +:Required: No +:Default: ``1024`` + + +``rgw usage log tick interval`` + +:Description: Flush pending log data every ``s`` seconds. +:Type: Integer +:Required: No +:Default: 30 + + +``rgw intent log object name`` + +:Description: +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%i-%n`` + + +``rgw intent log object name utc`` + +:Description: Include a UTC timestamp in the intent log object name. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. [#] there are levels >20 in some rare cases and that they are extremely verbose. diff --git a/doc/rados/troubleshooting/memory-profiling.rst b/doc/rados/troubleshooting/memory-profiling.rst new file mode 100644 index 000000000..e2396e2fd --- /dev/null +++ b/doc/rados/troubleshooting/memory-profiling.rst @@ -0,0 +1,142 @@ +================== + Memory Profiling +================== + +Ceph MON, OSD and MDS can generate heap profiles using +``tcmalloc``. To generate heap profiles, ensure you have +``google-perftools`` installed:: + + sudo apt-get install google-perftools + +The profiler dumps output to your ``log file`` directory (i.e., +``/var/log/ceph``). See `Logging and Debugging`_ for details. +To view the profiler logs with Google's performance tools, execute the +following:: + + google-pprof --text {path-to-daemon} {log-path/filename} + +For example:: + + $ ceph tell osd.0 heap start_profiler + $ ceph tell osd.0 heap dump + osd.0 tcmalloc heap stats:------------------------------------------------ + MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application + MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist + MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist + MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist + MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists + MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) + MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used + MALLOC: + MALLOC: 231 Spans in use + MALLOC: 56 Thread heaps in use + MALLOC: 8192 Tcmalloc page size + ------------------------------------------------ + Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). + Bytes released to the OS take up virtual address space but no physical memory. + $ google-pprof --text \ + /usr/bin/ceph-osd \ + /var/log/ceph/ceph-osd.0.profile.0001.heap + Total: 3.7 MB + 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry + 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create + 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe + 0.0 0.4% 99.2% 0.0 0.6% decode_message + ... + +Another heap dump on the same daemon will add another file. It is +convenient to compare to a previous heap dump to show what has grown +in the interval. For instance:: + + $ google-pprof --text --base out/osd.0.profile.0001.heap \ + ceph-osd out/osd.0.profile.0003.heap + Total: 0.2 MB + 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry + 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create + 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op + 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate + +Refer to `Google Heap Profiler`_ for additional details. + +Once you have the heap profiler installed, start your cluster and +begin using the heap profiler. You may enable or disable the heap +profiler at runtime, or ensure that it runs continuously. For the +following commandline usage, replace ``{daemon-type}`` with ``mon``, +``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or +the MON or MDS id. + + +Starting the Profiler +--------------------- + +To start the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap start_profiler + +For example:: + + ceph tell osd.1 heap start_profiler + +Alternatively the profile can be started when the daemon starts +running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in +the environment. + +Printing Stats +-------------- + +To print out statistics, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stats + +For example:: + + ceph tell osd.0 heap stats + +.. note:: Printing stats does not require the profiler to be running and does + not dump the heap allocation information to a file. + + +Dumping Heap Information +------------------------ + +To dump heap information, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap dump + +For example:: + + ceph tell mds.a heap dump + +.. note:: Dumping heap information only works when the profiler is running. + + +Releasing Memory +---------------- + +To release memory that ``tcmalloc`` has allocated but which is not being used by +the Ceph daemon itself, execute the following:: + + ceph tell {daemon-type}{daemon-id} heap release + +For example:: + + ceph tell osd.2 heap release + + +Stopping the Profiler +--------------------- + +To stop the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stop_profiler + +For example:: + + ceph tell osd.0 heap stop_profiler + +.. _Logging and Debugging: ../log-and-debug +.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst new file mode 100644 index 000000000..dc575f761 --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -0,0 +1,613 @@ +.. _rados-troubleshooting-mon: + +================================= + Troubleshooting Monitors +================================= + +.. index:: monitor, high availability + +When a cluster encounters monitor-related troubles there's a tendency to +panic, and sometimes with good reason. Losing one or more monitors doesn't +necessarily mean that your cluster is down, so long as a majority are up, +running, and form a quorum. +Regardless of how bad the situation is, the first thing you should do is to +calm down, take a breath, and step through the below troubleshooting steps. + + +Initial Troubleshooting +======================== + + +**Are the monitors running?** + + First of all, we need to make sure the monitor (*mon*) daemon processes + (``ceph-mon``) are running. You would be amazed by how often Ceph admins + forget to start the mons, or to restart them after an upgrade. There's no + shame, but try to not lose a couple of hours looking for a deeper problem. + When running Kraken or later releases also ensure that the manager + daemons (``ceph-mgr``) are running, usually alongside each ``ceph-mon``. + + +**Are you able to reach to the mon nodes?** + + Doesn't happen often, but sometimes there are ``iptables`` rules that + block accesse to mon nodes or TCP ports. These may be leftovers from + prior stress-testing or rule development. Try SSHing into + the server and, if that succeeds, try connecting to the monitor's ports + (``tcp/3300`` and ``tcp/6789``) using a ``telnet``, ``nc``, or similar tools. + +**Does ceph -s run and obtain a reply from the cluster?** + + If the answer is yes then your cluster is up and running. One thing you + can take for granted is that the monitors will only answer to a ``status`` + request if there is a formed quorum. Also check that at least one ``mgr`` + daemon is reported as running, ideally all of them. + + If ``ceph -s`` hangs without obtaining a reply from the cluster + or showing ``fault`` messages, then it is likely that your monitors + are either down completely or just a fraction are up -- a fraction + insufficient to form a majority quorum. This check will connect to an + arbitrary mon; in rare cases it may be illuminating to bind to specific + mons in sequence by adding e.g. ``-m mymon1`` to the command. + +**What if ceph -s doesn't come back?** + + If you haven't gone through all the steps so far, please go back and do. + + You can contact each monitor individually asking them for their status, + regardless of a quorum being formed. This can be achieved using + ``ceph tell mon.ID mon_status``, ID being the monitor's identifier. You should + perform this for each monitor in the cluster. In section `Understanding + mon_status`_ we will explain how to interpret the output of this command. + + You may instead SSH into each mon node and query the daemon's admin socket. + + +Using the monitor's admin socket +================================= + +The admin socket allows you to interact with a given daemon directly using a +Unix socket file. This file can be found in your monitor's ``run`` directory. +By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok`` +but this may be elsewhere if you have overridden the default directory. If you +don't find it there, check your ``ceph.conf`` for an alternative path or +run:: + + ceph-conf --name mon.ID --show-config-value admin_socket + +Bear in mind that the admin socket will be available only while the monitor +daemon is running. When the monitor is properly shut down, the admin socket +will be removed. If however the monitor is not running and the admin socket +persists, it is likely that the monitor was improperly shut down. +Regardless, if the monitor is not running, you will not be able to use the +admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``. + +Accessing the admin socket is as simple as running ``ceph tell`` on the daemon +you are interested in. For example:: + + ceph tell mon.<id> mon_status + +Under the hood, this passes the command ``help`` to the running MON daemon +``<id>`` via its "admin socket", which is a file ending in ``.asok`` +somewhere under ``/var/run/ceph``. Once you know the full path to the file, +you can even do this yourself:: + + ceph --admin-daemon <full_path_to_asok_file> <command> + +Using ``help`` as the command to the ``ceph`` tool will show you the +supported commands available through the admin socket. Please take a look +at ``config get``, ``config show``, ``mon stat`` and ``quorum_status``, +as those can be enlightening when troubleshooting a monitor. + + +Understanding mon_status +========================= + +``mon_status`` can always be obtained via the admin socket. This command will +output a multitude of information about the monitor, including the same output +you would get with ``quorum_status``. + +Take the following example output of ``ceph tell mon.c mon_status``:: + + + { "name": "c", + "rank": 2, + "state": "peon", + "election_epoch": 38, + "quorum": [ + 1, + 2], + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { "epoch": 3, + "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", + "modified": "2013-10-30 04:12:01.945629", + "created": "2013-10-29 14:14:41.914786", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6795\/0"}]}} + +A couple of things are obvious: we have three monitors in the monmap (*a*, *b* +and *c*), the quorum is formed by only two monitors, and *c* is in the quorum +as a *peon*. + +Which monitor is out of the quorum? + + The answer would be **a**. + +Why? + + Take a look at the ``quorum`` set. We have two monitors in this set: *1* + and *2*. These are not monitor names. These are monitor ranks, as established + in the current monmap. We are missing the monitor with rank 0, and according + to the monmap that would be ``mon.a``. + +By the way, how are ranks established? + + Ranks are (re)calculated whenever you add or remove monitors and follow a + simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the + rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all + the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0. + +Most Common Monitor Issues +=========================== + +Have Quorum but at least one Monitor is down +--------------------------------------------- + +When this happens, depending on the version of Ceph you are running, +you should be seeing something similar to:: + + $ ceph health detail + [snip] + mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) + +How to troubleshoot this? + + First, make sure ``mon.a`` is running. + + Second, make sure you are able to connect to ``mon.a``'s node from the + other mon nodes. Check the TCP ports as well. Check ``iptables`` and + ``nf_conntrack`` on all nodes and ensure that you are not + dropping/rejecting connections. + + If this initial troubleshooting doesn't solve your problems, then it's + time to go deeper. + + First, check the problematic monitor's ``mon_status`` via the admin + socket as explained in `Using the monitor's admin socket`_ and + `Understanding mon_status`_. + + If the monitor is out of the quorum, its state should be one of + ``probing``, ``electing`` or ``synchronizing``. If it happens to be either + ``leader`` or ``peon``, then the monitor believes to be in quorum, while + the remaining cluster is sure it is not; or maybe it got into the quorum + while we were troubleshooting the monitor, so check you ``ceph -s`` again + just to make sure. Proceed if the monitor is not yet in the quorum. + +What if the state is ``probing``? + + This means the monitor is still looking for the other monitors. Every time + you start a monitor, the monitor will stay in this state for some time + while trying to connect the rest of the monitors specified in the ``monmap``. + The time a monitor will spend in this state can vary. For instance, when on + a single-monitor cluster (never do this in production), + the monitor will pass through the probing state almost instantaneously. + In a multi-monitor cluster, the monitors will stay in this state until they + find enough monitors to form a quorum -- this means that if you have 2 out + of 3 monitors down, the one remaining monitor will stay in this state + indefinitely until you bring one of the other monitors up. + + If you have a quorum the starting daemon should be able to find the + other monitors quickly, as long as they can be reached. If your + monitor is stuck probing and you have gone through with all the communication + troubleshooting, then there is a fair chance that the monitor is trying + to reach the other monitors on a wrong address. ``mon_status`` outputs the + ``monmap`` known to the monitor: check if the other monitor's locations + match reality. If they don't, jump to + `Recovering a Monitor's Broken monmap`_; if they do, then it may be related + to severe clock skews amongst the monitor nodes and you should refer to + `Clock Skews`_ first, but if that doesn't solve your problem then it is + the time to prepare some logs and reach out to the community (please refer + to `Preparing your logs`_ on how to best prepare your logs). + + +What if state is ``electing``? + + This means the monitor is in the middle of an election. With recent Ceph + releases these typically complete quickly, but at times the monitors can + get stuck in what is known as an *election storm*. This can indicate + clock skew among the monitor nodes; jump to + `Clock Skews`_ for more information. If all your clocks are properly + synchronized, you should search the mailing lists and tracker. + This is not a state that is likely to persist and aside from + (*really*) old bugs there is not an obvious reason besides clock skews on + why this would happen. Worst case, if there are enough surviving mons, + down the problematic one while you investigate. + +What if state is ``synchronizing``? + + This means the monitor is catching up with the rest of the cluster in + order to join the quorum. Time to synchronize is a function of the size + of your monitor store and thus of cluster size and state, so if you have a + large or degraded cluster this may take a while. + + If you notice that the monitor jumps from ``synchronizing`` to + ``electing`` and then back to ``synchronizing``, then you do have a + problem: the cluster state may be advancing (i.e., generating new maps) + too fast for the synchronization process to keep up. This was a more common + thing in early days (Cuttlefish), but since then the synchronization process + has been refactored and enhanced to avoid this dynamic. If you experience + this in later versions please let us know via a bug tracker. And bring some logs + (see `Preparing your logs`_). + +What if state is ``leader`` or ``peon``? + + This should not happen: famous last words. If it does, however, it likely + has a lot to do with clock skew -- see `Clock Skews`_. If you are not + suffering from clock skew, then please prepare your logs (see + `Preparing your logs`_) and reach out to the community. + + +Recovering a Monitor's Broken ``monmap`` +---------------------------------------- + +This is how a ``monmap`` usually looks, depending on the number of +monitors:: + + + epoch 3 + fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 + last_changed 2013-10-30 04:12:01.945629 + created 2013-10-29 14:14:41.914786 + 0: 127.0.0.1:6789/0 mon.a + 1: 127.0.0.1:6790/0 mon.b + 2: 127.0.0.1:6795/0 mon.c + +This may not be what you have however. For instance, in some versions of +early Cuttlefish there was a bug that could cause your ``monmap`` +to be nullified. Completely filled with zeros. This means that not even +``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros. +It's also possible to end up with a monitor with a severely outdated monmap, +notably if the node has been down for months while you fight with your vendor's +TAC. The subject ``ceph-mon`` daemon might be unable to find the surviving +monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, +then remove ``mon.a``, then add a new monitor ``mon.e`` and remove +``mon.b``; you will end up with a totally different monmap from the one +``mon.c`` knows). + +In this situation you have two possible solutions: + +Scrap the monitor and redeploy + + You should only take this route if you are positive that you won't + lose the information kept by that monitor; that you have other monitors + and that they are running just fine so that your new monitor is able + to synchronize from the remaining monitors. Keep in mind that destroying + a monitor, if there are no other copies of its contents, may lead to + loss of data. + +Inject a monmap into the monitor + + Usually the safest path. You should grab the monmap from the remaining + monitors and inject it into the monitor with the corrupted/lost monmap. + + These are the basic steps: + + 1. Is there a formed quorum? If so, grab the monmap from the quorum:: + + $ ceph mon getmap -o /tmp/monmap + + 2. No quorum? Grab the monmap directly from another monitor (this + assumes the monitor you are grabbing the monmap from has id ID-FOO + and has been stopped):: + + $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap + + 3. Stop the monitor you are going to inject the monmap into. + + 4. Inject the monmap:: + + $ ceph-mon -i ID --inject-monmap /tmp/monmap + + 5. Start the monitor + + Please keep in mind that the ability to inject monmaps is a powerful + feature that can cause havoc with your monitors if misused as it will + overwrite the latest, existing monmap kept by the monitor. + + +Clock Skews +------------ + +Monitor operation can be severely affected by clock skew among the quorum's +mons, as the PAXOS consensus algorithm requires tight time alignment. +Skew can result in weird behavior with no obvious +cause. To avoid such issues, you must run a clock synchronization tool +on your monitor nodes: ``Chrony`` or the legacy ``ntpd``. Be sure to +configure the mon nodes with the `iburst` option and multiple peers: + +* Each other +* Internal ``NTP`` servers +* Multiple external, public pool servers + +For good measure, *all* nodes in your cluster should also sync against +internal and external servers, and perhaps even your mons. ``NTP`` servers +should run on bare metal; VM virtualized clocks are not suitable for steady +timekeeping. Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info. Your +organization may already have quality internal ``NTP`` servers you can use. +Sources for ``NTP`` server appliances include: + +* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_ +* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_ +* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_ + + +What's the maximum tolerated clock skew? + + By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms). + + +Can I increase the maximum tolerated clock skew? + + The maximum tolerated clock skew is configurable via the + ``mon-clock-drift-allowed`` option, and + although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism + is in place because clock-skewed monitors are liely to misbehave. We, as + developers and QA aficionados, are comfortable with the current default + value, as it will alert the user before the monitors get out hand. Changing + this value may cause unforeseen effects on the + stability of the monitors and overall cluster health. + +How do I know there's a clock skew? + + The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph health + detail`` or ``ceph status`` should show something like:: + + mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) + + That means that ``mon.c`` has been flagged as suffering from a clock skew. + + On releases beginning with Luminous you can issue the + ``ceph time-sync-status`` command to check status. Note that the lead mon + is typically the one with the numerically lowest IP address. It will always + show ``0``: the reported offsets of other mons are relative to + the lead mon, not to any external reference source. + + +What should I do if there's a clock skew? + + Synchronize your clocks. Running an NTP client may help. If you are already + using one and you hit this sort of issues, check if you are using some NTP + server remote to your network and consider hosting your own NTP server on + your network. This last option tends to reduce the amount of issues with + monitor clock skews. + + +Client Can't Connect or Mount +------------------------------ + +Check your IP tables. Some OS install utilities add a ``REJECT`` rule to +``iptables``. The rule rejects all clients trying to connect to the host except +for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in +place, clients connecting from a separate node will fail to mount with a timeout +error. You need to address ``iptables`` rules that reject clients trying to +connect to Ceph daemons. For example, you would need to address rules that look +like this appropriately:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You may also need to add rules to IP tables on your Ceph hosts to ensure +that clients can access the ports associated with your Ceph monitors (i.e., port +6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For +example:: + + iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT + +Monitor Store Failures +====================== + +Symptoms of store corruption +---------------------------- + +Ceph monitor stores the :term:`Cluster Map` in a key/value store such as LevelDB. If +a monitor fails due to the key/value store corruption, following error messages +might be found in the monitor log:: + + Corruption: error in middle of record + +or:: + + Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb + +Recovery using healthy monitor(s) +--------------------------------- + +If there are any survivors, we can always :ref:`replace <adding-and-removing-monitors>` the corrupted one with a +new one. After booting up, the new joiner will sync up with a healthy +peer, and once it is fully sync'ed, it will be able to serve the clients. + +.. _mon-store-recovery-using-osds: + +Recovery using OSDs +------------------- + +But what if all monitors fail at the same time? Since users are encouraged to +deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous +failure is rare. But unplanned power-downs in a data center with improperly +configured disk/fs settings could fail the underlying file system, and hence +kill all the monitors. In this case, we can recover the monitor store with the +information stored in OSDs.:: + + ms=/root/mon-store + mkdir $ms + + # collect the cluster map from stopped OSDs + for host in $hosts; do + rsync -avz $ms/. user@$host:$ms.remote + rm -rf $ms + ssh user@$host <<EOF + for osd in /var/lib/ceph/osd/ceph-*; do + ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote + done + EOF + rsync -avz user@$host:$ms.remote/. $ms + done + + # rebuild the monitor store from the collected map, if the cluster does not + # use cephx authentication, we can skip the following steps to update the + # keyring with the caps, and there is no need to pass the "--keyring" option. + # i.e. just use "ceph-monstore-tool $ms rebuild" instead + ceph-authtool /path/to/admin.keyring -n mon. \ + --cap mon 'allow *' + ceph-authtool /path/to/admin.keyring -n client.admin \ + --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' + # add one or more ceph-mgr's key to the keyring. in this case, an encoded key + # for mgr.x is added, you can find the encoded key in + # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is + # deployed + ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \ + --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *' + # if your monitors' ids are not single characters like 'a', 'b', 'c', please + # specify them in the command line by passing them as arguments of the "--mon-ids" + # option. if you are not sure, please check your ceph.conf to see if there is any + # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are + # using DNS SRV for looking up monitors. + ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma + + # make a backup of the corrupted store.db just in case! repeat for + # all monitors. + mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted + + # move rebuild store.db into place. repeat for all monitors. + mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db + chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db + +The steps above + +#. collect the map from all OSD hosts, +#. then rebuild the store, +#. fill the entities in keyring file with appropriate caps +#. replace the corrupted store on ``mon.foo`` with the recovered copy. + +Known limitations +~~~~~~~~~~~~~~~~~ + +Following information are not recoverable using the steps above: + +- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command + are recovered from the OSD's copy. And the ``client.admin`` keyring is imported + using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing + in the recovered monitor store. You might need to re-add them manually. + +- **creating pools**: If any RADOS pools were in the process of being creating, that state is lost. The recovery tool assumes that all pools have been created. If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command. Note that this will create an *empty* PG, so only do this if you know the pool is empty. + +- **MDS Maps**: the MDS maps are lost. + + + +Everything Failed! Now What? +============================= + +Reaching out for help +---------------------- + +You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net) +and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make +sure you have grabbed your logs and have them ready if someone asks: the faster +the interaction and lower the latency in response, the better chances everyone's +time is optimized. + + +Preparing your logs +--------------------- + +Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We +may want them. However, your logs may not have the necessary information. If +you don't find your monitor logs at their default location, you can check +where they should be by running:: + + ceph-conf --name mon.FOO --show-config-value log_file + +The amount of information in the logs are subject to the debug levels being +enforced by your configuration files. If you have not enforced a specific +debug level then Ceph is using the default levels and your logs may not +contain important information to track down you issue. +A first step in getting relevant information into your logs will be to raise +debug levels. In this case we will be interested in the information from the +monitor. +Similarly to what happens on other components, different parts of the monitor +will output their debug information on different subsystems. + +You will have to raise the debug levels of those subsystems more closely +related to your issue. This may not be an easy task for someone unfamiliar +with troubleshooting Ceph. For most situations, setting the following options +on your monitors will be enough to pinpoint a potential source of the issue:: + + debug mon = 10 + debug ms = 1 + +If we find that these debug levels are not enough, there's a chance we may +ask you to raise them or even define other debug subsystems to obtain infos +from -- but at least we started off with some useful information, instead +of a massively empty log without much to go on with. + +Do I need to restart a monitor to adjust debug levels? +------------------------------------------------------ + +No. You may do it in one of two ways: + +You have quorum + + Either inject the debug option into the monitor you want to debug:: + + ceph tell mon.FOO config set debug_mon 10/10 + + or into all monitors at once:: + + ceph tell mon.* config set debug_mon 10/10 + +No quorum + + Use the monitor's admin socket and directly adjust the configuration + options:: + + ceph daemon mon.FOO config set debug_mon 10/10 + + +Going back to default values is as easy as rerunning the above commands +using the debug level ``1/10`` instead. You can check your current +values using the admin socket and the following commands:: + + ceph daemon mon.FOO config show + +or:: + + ceph daemon mon.FOO config get 'OPTION_NAME' + + +Reproduced the problem with appropriate debug levels. Now what? +---------------------------------------------------------------- + +Ideally you would send us only the relevant portions of your logs. +We realise that figuring out the corresponding portion may not be the +easiest of tasks. Therefore, we won't hold it to you if you provide the +full log, but common sense should be employed. If your log has hundreds +of thousands of lines, it may get tricky to go through the whole thing, +specially if we are not aware at which point, whatever your issue is, +happened. For instance, when reproducing, keep in mind to write down +current time and date and to extract the relevant portions of your logs +based on that. + +Finally, you should reach out to us on the mailing lists, on IRC or file +a new issue on the `tracker`_. + +.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst new file mode 100644 index 000000000..cc852d73d --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -0,0 +1,620 @@ +====================== + Troubleshooting OSDs +====================== + +Before troubleshooting your OSDs, first check your monitors and network. If +you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph shows +``HEALTH_OK``, it means that the monitors have a quorum. +If you don't have a monitor quorum or if there are errors with the monitor +status, `address the monitor issues first <../troubleshooting-mon>`_. +Check your networks to ensure they +are running properly, because networks may have a significant impact on OSD +operation and performance. Look for dropped packets on the host side +and CRC errors on the switch side. + +Obtaining Data About OSDs +========================= + +A good first step in troubleshooting your OSDs is to obtain topology information in +addition to the information you collected while `monitoring your OSDs`_ +(e.g., ``ceph osd tree``). + + +Ceph Logs +--------- + +If you haven't changed the default path, you can find Ceph log files at +``/var/log/ceph``:: + + ls /var/log/ceph + +If you don't see enough log detail you can change your logging level. See +`Logging and Debugging`_ for details to ensure that Ceph performs adequately +under high logging volume. + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. For details, list +the sockets for your Ceph daemons:: + + ls /var/run/ceph + +Then, execute the following, replacing ``{daemon-name}`` with an actual +daemon (e.g., ``osd.0``):: + + ceph daemon osd.0 help + +Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``):: + + ceph daemon {socket-file} help + +The admin socket, among other things, allows you to: + +- List your configuration at runtime +- Dump historic operations +- Dump the operation priority queue state +- Dump operations in flight +- Dump perfcounters + +Display Freespace +----------------- + +Filesystem issues may arise. To display your file system's free space, execute +``df``. :: + + df -h + +Execute ``df --help`` for additional usage. + +I/O Statistics +-------------- + +Use `iostat`_ to identify I/O-related issues. :: + + iostat -x + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages from the kernel, use ``dmesg`` with ``less``, ``more``, ``grep`` +or ``tail``. For example:: + + dmesg | grep scsi + +Stopping w/out Rebalancing +========================== + +Periodically, you may need to perform maintenance on a subset of your cluster, +or resolve a problem that affects a failure domain (e.g., a rack). If you do not +want CRUSH to automatically rebalance the cluster as you stop OSDs for +maintenance, set the cluster to ``noout`` first:: + + ceph osd set noout + +On Luminous or newer releases it is safer to set the flag only on affected OSDs. +You can do this individually :: + + ceph osd add-noout osd.0 + ceph osd rm-noout osd.0 + +Or an entire CRUSH bucket at a time. Say you're going to take down +``prod-ceph-data1701`` to add RAM :: + + ceph osd set-group noout prod-ceph-data1701 + +Once the flag is set you can stop the OSDs and any other colocated Ceph +services within the failure domain that requires maintenance work. :: + + systemctl stop ceph\*.service ceph\*.target + +.. note:: Placement groups within the OSDs you stop will become ``degraded`` + while you are addressing issues with within the failure domain. + +Once you have completed your maintenance, restart the OSDs and any other +daemons. If you rebooted the host as part of the maintenance, these should +come back on their own without intervention. :: + + sudo systemctl start ceph.target + +Finally, you must unset the cluster-wide``noout`` flag:: + + ceph osd unset noout + ceph osd unset-group noout prod-ceph-data1701 + +Note that most Linux distributions that Ceph supports today employ ``systemd`` +for service management. For other or older operating systems you may need +to issue equivalent ``service`` or ``start``/``stop`` commands. + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal circumstances, simply restarting the ``ceph-osd`` daemon will +allow it to rejoin the cluster and recover. + +An OSD Won't Start +------------------ + +If you start your cluster and an OSD won't start, check the following: + +- **Configuration File:** If you were not able to get OSDs running from + a new installation, check your configuration file to ensure it conforms + (e.g., ``host`` not ``hostname``, etc.). + +- **Check Paths:** Check the paths in your configuration, and the actual + paths themselves for data and metadata (journals, WAL, DB). If you separate the OSD data from + the metadata and there are errors in your configuration file or in the + actual mounts, you may have trouble starting OSDs. If you want to store the + metadata on a separate block device, you should partition or LVM your + drive and assign one partition per OSD. + +- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be + hitting the default maximum number of threads (e.g., usually 32k), especially + during recovery. You can increase the number of threads using ``sysctl`` to + see if increasing the maximum number of threads to the maximum possible + number of threads allowed (i.e., 4194303) will help. For example:: + + sysctl -w kernel.pid_max=4194303 + + If increasing the maximum thread count resolves the issue, you can make it + permanent by including a ``kernel.pid_max`` setting in a file under ``/etc/sysctl.d`` or + within the master ``/etc/sysctl.conf`` file. For example:: + + kernel.pid_max = 4194303 + +- **Check ``nf_conntrack``:** This connection tracking and limiting system + is the bane of many production Ceph clusters, and can be insidious in that + everything is fine at first. As cluster topology and client workload + grow, mysterious and intermittent connection failures and performance + glitches manifest, becoming worse over time and at certain times of day. + Check ``syslog`` history for table fillage events. You can mitigate this + bother by raising ``nf_conntrack_max`` to a much higher value via ``sysctl``. + Be sure to raise ``nf_conntrack_buckets`` accordingly to + ``nf_conntrack_max / 4``, which may require action outside of ``sysctl`` e.g. + ``"echo 131072 > /sys/module/nf_conntrack/parameters/hashsize`` + More interdictive but fussier is to blacklist the associated kernel modules + to disable processing altogether. This is fragile in that the modules + vary among kernel versions, as does the order in which they must be listed. + Even when blacklisted there are situations in which ``iptables`` or ``docker`` + may activate connection tracking anyway, so a "set and forget" strategy for + the tunables is advised. On modern systems this will not consume appreciable + resources. + +- **Kernel Version:** Identify the kernel version and distribution you + are using. Ceph uses some third party tools by default, which may be + buggy or may conflict with certain distributions and/or kernel + versions (e.g., Google ``gperftools`` and ``TCMalloc``). Check the + `OS recommendations`_ and the release notes for each Ceph version + to ensure you have addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, increase log levels + and start the problematic daemon(s) again. If segment faults recur, + search the Ceph bug tracker `https://tracker.ceph/com/projects/ceph <https://tracker.ceph.com/projects/ceph/>`_ + and the ``dev`` and ``ceph-users`` mailing list archives `https://ceph.io/resources <https://ceph.io/resources>`_. + If this is truly a new and unique + failure, post to the ``dev`` email list and provide the specific Ceph + release being run, ``ceph.conf`` (with secrets XXX'd out), + your monitor status output and excerpts from your log file(s). + +An OSD Failed +------------- + +When a ``ceph-osd`` process dies, surviving ``ceph-osd`` daemons will report +to the mons that it appears down, which will in turn surface the new status +via the ``ceph health`` command:: + + ceph health + HEALTH_WARN 1/3 in osds are down + +Specifically, you will get a warning whenever there are OSDs marked ``in`` +and ``down``. You can identify which are ``down`` with:: + + ceph health detail + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +or :: + + ceph osd tree down + +If there is a drive +failure or other fault preventing ``ceph-osd`` from functioning or +restarting, an error message should be present in its log file under +``/var/log/ceph``. + +If the daemon stopped because of a heartbeat failure or ``suicide timeout``, +the underlying drive or filesystem may be unresponsive. Check ``dmesg`` +and `syslog` output for drive or other kernel errors. You may need to +specify something like ``dmesg -T`` to get timestamps, otherwise it's +easy to mistake old errors for new. + +If the problem is a software error (failed assertion or other +unexpected error), search the archives and tracker as above, and +report it to the `ceph-devel`_ email list if there's no clear fix or +existing bug. + +.. _no-free-drive-space: + +No Free Drive Space +------------------- + +Ceph prevents you from writing to a full OSD so that you don't lose data. +In an operational cluster, you should receive a warning when your cluster's OSDs +and pools approach the full ratio. The ``mon osd full ratio`` defaults to +``0.95``, or 95% of capacity before it stops clients from writing data. +The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of +capacity above which backfills will not start. The +OSD nearfull ratio defaults to ``0.85``, or 85% of capacity +when it generates a health warning. + +Note that individual OSDs within a cluster will vary in how much data Ceph +allocates to them. This utilization can be displayed for each OSD with :: + + ceph osd df + +Overall cluster / pool fullness can be checked with :: + + ceph df + +Pay close attention to the **most full** OSDs, not the percentage of raw space +used as reported by ``ceph df``. It only takes one outlier OSD filling up to +fail writes to its pool. The space available to each pool as reported by +``ceph df`` considers the ratio settings relative to the *most full* OSD that +is part of a given pool. The distribution can be flattened by progressively +moving data from overfull or to underfull OSDs using the ``reweight-by-utilization`` +command. With Ceph releases beginning with later revisions of Luminous one can also +exploit the ``ceph-mgr`` ``balancer`` module to perform this task automatically +and rather effectively. + +The ratios can be adjusted: + +:: + + ceph osd set-nearfull-ratio <float[0.0-1.0]> + ceph osd set-full-ratio <float[0.0-1.0]> + ceph osd set-backfillfull-ratio <float[0.0-1.0]> + +Full cluster issues can arise when an OSD fails either as a test or organically +within small and/or very full or unbalanced cluster. When an OSD or node +holds an outsize percentage of the cluster's data, the ``nearfull`` and ``full`` +ratios may be exceeded as a result of component failures or even natural growth. +If you are testing how Ceph reacts to OSD failures on a small +cluster, you should leave ample free disk space and consider temporarily +lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and +OSD ``nearfull ratio`` + +Full ``ceph-osds`` will be reported by ``ceph health``:: + + ceph health + HEALTH_WARN 1 nearfull osd(s) + +Or:: + + ceph health detail + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) + osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% + +The best way to deal with a full cluster is to add capacity via new OSDs, enabling +the cluster to redistribute data to newly available storage. + +If you cannot start a legacy Filestore OSD because it is full, you may reclaim +some space deleting a few placement group directories in the full OSD. + +.. important:: If you choose to delete a placement group directory on a full OSD, + **DO NOT** delete the same placement group directory on another full OSD, or + **YOU WILL LOSE DATA**. You **MUST** maintain at least one copy of your data on + at least one OSD. This is a rare and extreme intervention, and is not to be + undertaken lightly. + +See `Monitor Config Reference`_ for additional details. + +OSDs are Slow/Unresponsive +========================== + +A common issue involves slow or unresponsive OSDs. Ensure that you +have eliminated other troubleshooting possibilities before delving into OSD +performance issues. For example, ensure that your network(s) is working properly +and your OSDs are running. Check to see if OSDs are throttling recovery traffic. + +.. tip:: Newer versions of Ceph provide better recovery handling by preventing + recovering OSDs from using up system resources so that ``up`` and ``in`` + OSDs are not available or are otherwise slow. + +Networking Issues +----------------- + +Ceph is a distributed storage system, so it relies upon networks for OSD peering +and replication, recovery from faults, and periodic heartbeats. Networking +issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for +details. + +Ensure that Ceph processes and Ceph-dependent processes are connected and/or +listening. :: + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +Check network statistics. :: + + netstat -s + +Drive Configuration +------------------- + +A SAS or SATA storage drive should only house one OSD; NVMe drives readily +handle two or more. Read and write throughput can bottleneck if other processes +share the drive, including journals / metadata, operating systems, Ceph monitors, +`syslog` logs, other OSDs, and non-Ceph processes. + +Ceph acknowledges writes *after* journaling, so fast SSDs are an +attractive option to accelerate the response time--particularly when +using the ``XFS`` or ``ext4`` file systems for legacy Filestore OSDs. +By contrast, the ``Btrfs`` +file system can write and journal simultaneously. (Note, however, that +we recommend against using ``Btrfs`` for production deployments.) + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Running a journal in a separate partition + may help, but you should prefer a separate physical drive. + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your drives for bad blocks, fragmentation, and other errors that can cause +performance to drop substantially. Invaluable tools include ``dmesg``, ``syslog`` +logs, and ``smartctl`` (from the ``smartmontools`` package). + +Co-resident Monitors/OSDs +------------------------- + +Monitors are relatively lightweight processes, but they issue lots of +``fsync()`` calls, +which can interfere with other workloads, particularly if monitors run on the +same drive as an OSD. Additionally, if you run monitors on the same host as +OSDs, you may incur performance issues related to: + +- Running an older kernel (pre-3.0) +- Running a kernel with no ``syncfs(2)`` syscall. + +In these cases, multiple OSDs running on the same host can drag each other down +by doing lots of commits. That often leads to the bursty writes. + +Co-resident Processes +--------------------- + +Spinning up co-resident processes (convergence) such as a cloud-based solution, virtual +machines and other applications that write data to Ceph while operating on the +same hardware as OSDs can introduce significant OSD latency. Generally, we +recommend optimizing hosts for use with Ceph and using other hosts for other +processes. The practice of separating Ceph operations from other applications +may help improve performance and may streamline troubleshooting and maintenance. + +Logging Levels +-------------- + +If you turned logging levels up to track an issue and then forgot to turn +logging levels back down, the OSD may be putting a lot of logs onto the disk. If +you intend to keep logging levels high, you may consider mounting a drive to the +default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +performance or it may increase recovery rates to the point that recovery +impacts OSD performance. Check to see if the OSD is recovering. + +Kernel Version +-------------- + +Check the kernel version you are running. Older kernels may not receive +new backports that Ceph depends upon for better performance. + +Kernel Issues with SyncFS +------------------------- + +Try running one OSD per host to see if performance improves. Old kernels +might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. + +Filesystem Issues +----------------- + +Currently, we recommend deploying clusters with the BlueStore back end. +When running a pre-Luminous release or if you have a specific reason to deploy +OSDs with the previous Filestore backend, we recommend ``XFS``. + +We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has +many attractive features, but bugs may lead to +performance issues and spurious ENOSPC errors. We do not recommend +``ext4`` for Filestore OSDs because ``xattr`` limitations break support for long +object names, which are needed for RGW. + +For more information, see `Filesystem Recommendations`_. + +.. _Filesystem Recommendations: ../configuration/filesystem-recommendations + +Insufficient RAM +---------------- + +We recommend a *minimum* of 4GB of RAM per OSD daemon and suggest rounding up +from 6-8GB. You may notice that during normal operations, ``ceph-osd`` +processes only use a fraction of that amount. +Unused RAM makes it tempting to use the excess RAM for co-resident +applications or to skimp on each node's memory capacity. However, +when OSDs experience recovery their memory utilization spikes. If +there is insufficient RAM available, OSD performance will slow considerably +and the daemons may even crash or be killed by the Linux ``OOM Killer``. + +Blocked Requests or Slow Requests +--------------------------------- + +If a ``ceph-osd`` daemon is slow to respond to a request, messages will be logged +noting ops that are taking too long. The warning threshold +defaults to 30 seconds and is configurable via the ``osd op complaint time`` +setting. When this happens, the cluster log will receive messages. + +Legacy versions of Ceph complain about ``old requests``:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +New versions of Ceph complain about ``slow requests``:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + +Possible causes include: + +- A failing drive (check ``dmesg`` output) +- A bug in the kernel file system (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions: + +- Remove VMs from Ceph hosts +- Upgrade kernel +- Upgrade Ceph +- Restart OSDs +- Replace failed or failing components + +Debugging Slow Requests +----------------------- + +If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id> dump_ops_in_flight``, +you will see a set of operations and a list of events each operation went +through. These are briefly described below. + +Events from the Messenger layer: + +- ``header_read``: When the messenger first started reading the message off the wire. +- ``throttled``: When the messenger tried to acquire memory throttle space to read + the message into memory. +- ``all_read``: When the messenger finished reading the message off the wire. +- ``dispatched``: When the messenger gave the message to the OSD. +- ``initiated``: This is identical to ``header_read``. The existence of both is a + historical oddity. + +Events from the OSD as it processes ops: + +- ``queued_for_pg``: The op has been put into the queue for processing by its PG. +- ``reached_pg``: The PG has started doing the op. +- ``waiting for \*``: The op is waiting for some other work to complete before it + can proceed (e.g. a new OSDMap; for its object target to scrub; for the PG to + finish peering; all as specified in the message). +- ``started``: The op has been accepted as something the OSD should do and + is now being performed. +- ``waiting for subops from``: The op has been sent to replica OSDs. + +Events from ```Filestore```: + +- ``commit_queued_for_journal_write``: The op has been given to the FileStore. +- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and waiting + to be persisted (as the next disk write). +- ``journaled_completion_queued``: The op was journaled to disk and its callback + queued for invocation. + +Events from the OSD after data has been given to underlying storage: + +- ``op_commit``: The op has been committed (i.e. written to journal) by the + primary OSD. +- ``op_applied``: The op has been `write()'en <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (i.e. applied in memory but not flushed out to disk) on the primary. +- ``sub_op_applied``: ``op_applied``, but for a replica's "subop". +- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools). +- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it + hears about the above, but for a particular replica (i.e. ``<X>``). +- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops). + +Many of these events are seemingly redundant, but cross important boundaries in +the internal code (such as passing data across locks into new threads). + +Flapping OSDs +============= + +When OSDs peer and check heartbeats, they use the cluster (back-end) +network when it's available. See `Monitor/OSD Interaction`_ for details. + +We have tradtionally recommended separate *public* (front-end) and *private* +(cluster / back-end / replication) networks: + +#. Segregation of heartbeat and replication / recovery traffic (private) + from client and OSD <-> mon traffic (public). This helps keep one + from DoS-ing the other, which could in turn result in a cascading failure. + +#. Additional throughput for both public and private traffic. + +When common networking technloogies were 100Mb/s and 1Gb/s, this separation +was often critical. With today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s +networks, the above capacity concerns are often diminished or even obviated. +For example, if your OSD nodes have two network ports, dedicating one to +the public and the other to the private network means no path redundancy. +This degrades your ability to weather network maintenance and failures without +significant cluster or client impact. Consider instead using both links +for just a public network: with bonding (LACP) or equal-cost routing (e.g. FRR) +you reap the benefits of increased throughput headroom, fault tolerance, and +reduced OSD flapping. + +When a private network (or even a single host link) fails or degrades while the +public network operates normally, OSDs may not handle this situation well. What +happens is that OSDs use the public network to report each other ``down`` to +the monitors, while marking themselves ``up``. The monitors then send out, +again on the public network, an updated cluster map with affected OSDs marked +`down`. These OSDs reply to the monitors "I'm not dead yet!", and the cycle +repeats. We call this scenario 'flapping`, and it can be difficult to isolate +and remediate. With no private network, this irksome dynamic is avoided: +OSDs are generally either ``up`` or ``down`` without flapping. + +If something does cause OSDs to 'flap' (repeatedly getting marked ``down`` and +then ``up`` again), you can force the monitors to halt the flapping by +temporarily freezing their states:: + + ceph osd set noup # prevent OSDs from getting marked up + ceph osd set nodown # prevent OSDs from getting marked down + +These flags are recorded in the osdmap:: + + ceph osd dump | grep flags + flags no-up,no-down + +You can clear the flags with:: + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are supported, ``noin`` and ``noout``, which prevent +booting OSDs from being marked ``in`` (allocated data) or protect OSDs +from eventually being marked ``out`` (regardless of what the current value for +``mon osd down out interval`` is). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the + sense that once the flags are cleared, the action they were blocking + should occur shortly after. The ``noin`` flag, on the other hand, + prevents OSDs from being marked ``in`` on boot, and any daemons that + started while the flag was set will remain that way. + +.. note:: The causes and effects of flapping can be somewhat mitigated through + careful adjustments to the ``mon_osd_down_out_subtree_limit``, + ``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``. + Derivation of optimal settings depends on cluster size, topology, and the + Ceph release in use. Their interactions are subtle and beyond the scope of + this document. + + +.. _iostat: https://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging: ../log-and-debug +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _Monitor Config Reference: ../../configuration/mon-config-ref +.. _monitoring your OSDs: ../../operations/monitoring-osd-pg +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _OS recommendations: ../../../start/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst new file mode 100644 index 000000000..f5e5054ba --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -0,0 +1,693 @@ +===================== + Troubleshooting PGs +===================== + +Placement Groups Never Get Clean +================================ + +When you create a cluster and your cluster remains in ``active``, +``active+remapped`` or ``active+degraded`` status and never achieves an +``active+clean`` status, you likely have a problem with your configuration. + +You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ +and make appropriate adjustments. + +As a general rule, you should run your cluster with more than one OSD and a +pool size greater than 1 object replica. + +.. _one-node-cluster: + +One Node Cluster +---------------- + +Ceph no longer provides documentation for operating on a single node, because +you would never deploy a system designed for distributed computing on a single +node. Additionally, mounting client kernel modules on a single node containing a +Ceph daemon may cause a deadlock due to issues with the Linux kernel itself +(unless you use VMs for the clients). You can experiment with Ceph in a 1-node +configuration, in spite of the limitations as described herein. + +If you are trying to create a cluster on a single node, you must change the +default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning +``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration +file before you create your monitors and OSDs. This tells Ceph that an OSD +can peer with another OSD on the same host. If you are trying to set up a +1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, +Ceph will try to peer the PGs of one OSD with the PGs of another OSD on +another node, chassis, rack, row, or even datacenter depending on the setting. + +.. tip:: DO NOT mount kernel clients directly on the same node as your + Ceph Storage Cluster, because kernel conflicts can arise. However, you + can mount kernel clients within virtual machines (VMs) on a single node. + +If you are creating OSDs using a single disk, you must create directories +for the data manually first. + + +Fewer OSDs than Replicas +------------------------ + +If you have brought up two OSDs to an ``up`` and ``in`` state, but you still +don't see ``active + clean`` placement groups, you may have an +``osd pool default size`` set to greater than ``2``. + +There are a few ways to address this situation. If you want to operate your +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd pool default min size`` to ``2`` so that you can write objects in +an ``active + degraded`` state. You may also set the ``osd pool default size`` +setting to ``2`` so that you only have two stored replicas (the original and +one replica), in which case the cluster should achieve an ``active + clean`` +state. + +.. note:: You can make the changes at runtime. If you make the changes in + your Ceph configuration file, you may need to restart your cluster. + + +Pool Size = 1 +------------- + +If you have the ``osd pool default size`` set to ``1``, you will only have +one copy of the object. OSDs rely on other OSDs to tell them which objects +they should have. If a first OSD has a copy of an object and there is no +second copy, then no second OSD can tell the first OSD that it should have +that copy. For each placement group mapped to the first OSD (see +``ceph pg dump``), you can force the first OSD to notice the placement groups +it needs by running:: + + ceph osd force-create-pg <pgid> + + +CRUSH Map Errors +---------------- + +Another candidate for placement groups remaining unclean involves errors +in your CRUSH map. + + +Stuck Placement Groups +====================== + +It is normal for placement groups to enter states like "degraded" or "peering" +following a failure. Normally these states indicate the normal progression +through the failure recovery process. However, if a placement group stays in one +of these states for a long time this may be an indication of a larger problem. +For this reason, the monitor will warn when placement groups get "stuck" in a +non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long + (i.e., it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long + (i.e., it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, + indicating that all nodes storing this placement group may be ``down``. + +You can explicitly list stuck placement groups with one of:: + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +For stuck ``stale`` placement groups, it is normally a matter of getting the +right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement +groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For +stuck ``unclean`` placement groups, there is usually something preventing +recovery from completing, like unfound objects (see +:ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +====================================== + +In certain cases, the ``ceph-osd`` `Peering` process can run into +problems, preventing a PG from becoming active and usable. For +example, ``ceph health`` might report:: + + ceph health detail + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +We can query the cluster to determine exactly why the PG is marked ``down`` with:: + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to +down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` +and things will recover. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk +failure), we can tell the cluster that it is ``lost`` and to cope as +best it can. + +.. important:: This is dangerous in that the cluster cannot + guarantee that the other copies of the data are consistent + and up to date. + +To instruct Ceph to continue anyway:: + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +=============== + +Under certain combinations of failures Ceph may complain about +``unfound`` objects:: + + ceph health detail + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer +copies of existing objects) exist, but it hasn't found copies of them. +One example of how this might come about for a PG whose data is on ceph-osds +1 and 2: + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +Now 1 knows that these object exist, but there is no live ``ceph-osd`` who +has a copy. In this case, IO to those objects will block, and the +cluster will hope that the failed node comes back soon; this is +assumed to be preferable to returning an IO error to the user. + +First, you can identify which objects are unfound with:: + + ceph pg 2.4 list_unfound [starting offset, in json] + +.. code-block:: javascript + + { + "num_missing": 1, + "num_unfound": 1, + "objects": [ + { + "oid": { + "oid": "object", + "key": "", + "snapid": -2, + "hash": 2249616407, + "max": 0, + "pool": 2, + "namespace": "" + }, + "need": "43'251", + "have": "0'0", + "flags": "none", + "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1", + "locations": [ + "0(3)", + "4(2)" + ] + } + ], + "state": "NotRecovering", + "available_might_have_unfound": true, + "might_have_unfound": [ + { + "osd": "2(4)", + "status": "osd is down" + } + ], + "more": false + } + +If there are too many objects to list in a single result, the ``more`` +field will be true and you can query for more. (Eventually the +command line tool will hide this from you, but not yet.) + +Second, you can identify which OSDs have been probed or might contain +data. + +At the end of the listing (before ``more`` is false), ``might_have_unfound`` is provided +when ``available_might_have_unfound`` is true. This is equivalent to the output +of ``ceph pg #.# query``. This eliminates the need to use ``query`` directly. +The ``might_have_unfound`` information given behaves the same way as described below for ``query``. +The only difference is that OSDs that have ``already probed`` status are ignored. + +Use of ``query``:: + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, for example, the cluster knows that ``osd.1`` might have +data, but it is ``down``. The full range of possible states include: + +* already probed +* querying +* OSD is down +* not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object can +exist that are not listed. For example, if a ceph-osd is stopped and +taken out of the cluster, the cluster fully recovers, and due to some +future set of failures ends up with an unfound object, it won't +consider the long-departed ceph-osd as a potential location to +consider. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This, again, is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. To mark the "unfound" objects as "lost":: + + ceph pg 2.5 mark_unfound_lost revert|delete + +This the final argument specifies how the cluster should deal with +lost objects. + +The "delete" option will forget about them entirely. + +The "revert" option (not available for erasure coded pools) will +either roll back to a previous version of the object or (if it was a +new object) forget about it entirely. Use this with caution, as it +may confuse applications that expected the object to exist. + + +Homeless Placement Groups +========================= + +It is possible for all OSDs that had copies of a given placement groups to fail. +If that's the case, that subset of the object store is unavailable, and the +monitor will receive no status updates for those placement groups. To detect +this situation, the monitor marks any placement group whose primary OSD has +failed as ``stale``. For example:: + + ceph health + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +You can identify which placement groups are ``stale``, and what the last OSDs to +store them were, with:: + + ceph health detail + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +If we want to get placement group 2.5 back online, for example, this tells us that +it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` +daemons will allow the cluster to recover that placement group (and, presumably, +many others). + + +Only a Few OSDs Receive Data +============================ + +If you have many nodes in your cluster and only a few of them receive data, +`check`_ the number of placement groups in your pool. Since placement groups get +mapped to OSDs, a small number of placement groups will not distribute across +your cluster. Try creating a pool with a placement group count that is a +multiple of the number of OSDs. See `Placement Groups`_ for details. The default +placement group count for pools is not useful, but you can change it `here`_. + + +Can't Write Data +================ + +If your cluster is up, but some OSDs are down and you cannot write data, +check to ensure that you have the minimum number of OSDs running for the +placement group. If you don't have the minimum number of OSDs running, +Ceph will not allow you to write data because there is no guarantee +that Ceph can replicate your data. See ``osd pool default min size`` +in the `Pool, PG and CRUSH Config Reference`_ for details. + + +PGs Inconsistent +================ + +If you receive an ``active + clean + inconsistent`` state, this may happen +due to an error during scrubbing. As always, we can identify the inconsistent +placement group(s) with:: + + $ ceph health detail + HEALTH_ERR 1 pgs inconsistent; 2 scrub errors + pg 0.6 is active+clean+inconsistent, acting [0,1,2] + 2 scrub errors + +Or if you prefer inspecting the output in a programmatic way:: + + $ rados list-inconsistent-pg rbd + ["0.6"] + +There is only one consistent state, but in the worst case, we could have +different inconsistencies in multiple perspectives found in more than one +objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: + + $ rados list-inconsistent-obj 0.6 --format=json-pretty + +.. code-block:: javascript + + { + "epoch": 14, + "inconsistents": [ + { + "object": { + "name": "foo", + "nspace": "", + "locator": "", + "snap": "head", + "version": 1 + }, + "errors": [ + "data_digest_mismatch", + "size_mismatch" + ], + "union_shard_errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", + "shards": [ + { + "osd": 0, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 1, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 2, + "errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "size": 0, + "omap_digest": "0xffffffff", + "data_digest": "0xffffffff" + } + ] + } + ] + } + +In this case, we can learn from the output: + +* The only inconsistent object is named ``foo``, and it is its head that has + inconsistencies. +* The inconsistencies fall into two categories: + + * ``errors``: these errors indicate inconsistencies between shards without a + determination of which shard(s) are bad. Check for the ``errors`` in the + `shards` array, if available, to pinpoint the problem. + + * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is + different from the ones of OSD.0 and OSD.1 + * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while + the size reported by OSD.0 and OSD.1 is 968. + * ``union_shard_errors``: the union of all shard specific ``errors`` in + ``shards`` array. The ``errors`` are set for the given shard that has the + problem. They include errors like ``read_error``. The ``errors`` ending in + ``oi`` indicate a comparison with ``selected_object_info``. Look at the + ``shards`` array to determine which shard has which error(s). + + * ``data_digest_mismatch_info``: the digest stored in the object-info is not + ``0xffffffff``, which is calculated from the shard read from OSD.2 + * ``size_mismatch_info``: the size stored in the object-info is different + from the one read from OSD.2. The latter is 0. + +You can repair the inconsistent placement group by executing:: + + ceph pg repair {placement-group-ID} + +Which overwrites the `bad` copies with the `authoritative` ones. In most cases, +Ceph is able to choose authoritative copies from all available replicas using +some predefined criteria. But this does not always work. For example, the stored +data digest could be missing, and the calculated digest will be ignored when +choosing the authoritative copies. So, please use the above command with caution. + +If ``read_error`` is listed in the ``errors`` attribute of a shard, the +inconsistency is likely due to disk errors. You might want to check your disk +used by that OSD. + +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, you may consider configuring your `NTP`_ daemons on your +monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph +`Clock Settings`_ for additional details. + + +Erasure Coded PGs are not active+clean +====================================== + +When CRUSH fails to find enough OSDs to map to a PG, it will show as a +``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: + + [2,1,6,0,5,8,2147483647,7,4] + +Not enough OSDs +--------------- + +If the Ceph cluster only has 8 OSDs and the erasure coded pool needs +9, that is what it will show. You can either create another erasure +coded pool that requires less OSDs:: + + ceph osd erasure-code-profile set myprofile k=5 m=3 + ceph osd pool create erasurepool erasure myprofile + +or add a new OSDs and the PG will automatically use them. + +CRUSH constraints cannot be satisfied +------------------------------------- + +If the cluster has enough OSDs, it is possible that the CRUSH rule +imposes constraints that cannot be satisfied. If there are 10 OSDs on +two hosts and the CRUSH rule requires that no two OSDs from the +same host are used in the same PG, the mapping may fail because only +two OSDs will be found. You can check the constraint by displaying ("dumping") +the rule:: + + $ ceph osd crush rule ls + [ + "replicated_rule", + "erasurepool"] + $ ceph osd crush rule dump erasurepool + { "rule_id": 1, + "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + + +You can resolve the problem by creating a new pool in which PGs are allowed +to have OSDs residing on the same host with:: + + ceph osd erasure-code-profile set myprofile crush-failure-domain=osd + ceph osd pool create erasurepool erasure myprofile + +CRUSH gives up too soon +----------------------- + +If the Ceph cluster has just enough OSDs to map the PG (for instance a +cluster with a total of 9 OSDs and an erasure coded pool that requires +9 OSDs per PG), it is possible that CRUSH gives up before finding a +mapping. It can be resolved by: + +* lowering the erasure coded pool requirements to use less OSDs per PG + (that requires the creation of another pool as erasure code profiles + cannot be dynamically modified). + +* adding more OSDs to the cluster (that does not require the erasure + coded pool to be modified, it will become clean automatically) + +* use a handmade CRUSH rule that tries more times to find a good + mapping. This can be done by setting ``set_choose_tries`` to a value + greater than the default. + +You should first verify the problem with ``crushtool`` after +extracting the crushmap from the cluster so your experiments do not +modify the Ceph cluster and only work on a local files:: + + $ ceph osd crush rule dump erasurepool + { "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + $ ceph osd getcrushmap > crush.map + got crush map from osdmap epoch 13 + $ crushtool -i crush.map --test --show-bad-mappings \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] + bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] + bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] + +Where ``--num-rep`` is the number of OSDs the erasure code CRUSH +rule needs, ``--rule`` is the value of the ``ruleset`` field +displayed by ``ceph osd crush rule dump``. The test will try mapping +one million values (i.e. the range defined by ``[--min-x,--max-x]``) +and must display at least one bad mapping. If it outputs nothing it +means all mappings are successful and you can stop right there: the +problem is elsewhere. + +The CRUSH rule can be edited by decompiling the crush map:: + + $ crushtool --decompile crush.map > crush.txt + +and adding the following line to the rule:: + + step set_choose_tries 100 + +The relevant part of the ``crush.txt`` file should look something +like:: + + rule erasurepool { + ruleset 1 + type erasure + min_size 3 + max_size 20 + step set_chooseleaf_tries 5 + step set_choose_tries 100 + step take default + step chooseleaf indep 0 type host + step emit + } + +It can then be compiled and tested again:: + + $ crushtool --compile crush.txt -o better-crush.map + +When all mappings succeed, an histogram of the number of tries that +were necessary to find all of them can be displayed with the +``--show-choose-tries`` option of ``crushtool``:: + + $ crushtool -i better-crush.map --test --show-bad-mappings \ + --show-choose-tries \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + ... + 11: 42 + 12: 44 + 13: 54 + 14: 45 + 15: 35 + 16: 34 + 17: 30 + 18: 25 + 19: 19 + 20: 22 + 21: 20 + 22: 17 + 23: 13 + 24: 16 + 25: 13 + 26: 11 + 27: 11 + 28: 13 + 29: 11 + 30: 10 + 31: 6 + 32: 5 + 33: 10 + 34: 3 + 35: 7 + 36: 5 + 37: 2 + 38: 5 + 39: 5 + 40: 2 + 41: 5 + 42: 4 + 43: 1 + 44: 2 + 45: 2 + 46: 3 + 47: 1 + 48: 0 + ... + 102: 0 + 103: 1 + 104: 0 + ... + +It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). + +.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups +.. _here: ../../configuration/pool-pg-config-ref +.. _Placement Groups: ../../operations/placement-groups +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _NTP: https://en.wikipedia.org/wiki/Network_Time_Protocol +.. _The Network Time Protocol: http://www.ntp.org/ +.. _Clock Settings: ../../configuration/mon-config-ref/#clock + + |