diff options
Diffstat (limited to '')
-rw-r--r-- | doc/rados/troubleshooting/community.rst | 37 | ||||
-rw-r--r-- | doc/rados/troubleshooting/cpu-profiling.rst | 80 | ||||
-rw-r--r-- | doc/rados/troubleshooting/index.rst | 19 | ||||
-rw-r--r-- | doc/rados/troubleshooting/log-and-debug.rst | 430 | ||||
-rw-r--r-- | doc/rados/troubleshooting/memory-profiling.rst | 203 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-mon.rst | 713 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-osd.rst | 787 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-pg.rst | 782 |
8 files changed, 3051 insertions, 0 deletions
diff --git a/doc/rados/troubleshooting/community.rst b/doc/rados/troubleshooting/community.rst new file mode 100644 index 000000000..c0d7be10c --- /dev/null +++ b/doc/rados/troubleshooting/community.rst @@ -0,0 +1,37 @@ +==================== + The Ceph Community +==================== + +Ceph-users email list +===================== + +The Ceph community is an excellent source of information and help. For +operational issues with Ceph we recommend that you `subscribe to the ceph-users +email list`_. When you no longer want to receive emails, you can `unsubscribe +from the ceph-users email list`_. + +Ceph-devel email list +===================== + +You can also `subscribe to the ceph-devel email list`_. You should do so if +your issue is: + +- Likely related to a bug +- Related to a development release package +- Related to a development testing package +- Related to your own builds + +If you no longer want to receive emails from the ``ceph-devel`` email list, you +can `unsubscribe from the ceph-devel email list`_. + +Ceph report +=========== + +.. tip:: Community members can help you if you provide them with detailed + information about your problem. Attach the output of the ``ceph report`` + command to help people understand your issues. + +.. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io +.. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@ceph.io +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@ceph.io diff --git a/doc/rados/troubleshooting/cpu-profiling.rst b/doc/rados/troubleshooting/cpu-profiling.rst new file mode 100644 index 000000000..b7fdd1d41 --- /dev/null +++ b/doc/rados/troubleshooting/cpu-profiling.rst @@ -0,0 +1,80 @@ +=============== + CPU Profiling +=============== + +If you built Ceph from source and compiled Ceph for use with `oprofile`_ +you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. + + +Initializing oprofile +===================== + +``oprofile`` must be initalized the first time it is used. Locate the +``vmlinux`` image that corresponds to the kernel you are running: + +.. prompt:: bash $ + + ls /boot + sudo opcontrol --init + sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 + + +Starting oprofile +================= + +Run the following command to start ``oprofile``: + +.. prompt:: bash $ + + opcontrol --start + + +Stopping oprofile +================= + +Run the following command to stop ``oprofile``: + +.. prompt:: bash $ + + opcontrol --stop + + +Retrieving oprofile Results +=========================== + +Run the following command to retrieve the top ``cmon`` results: + +.. prompt:: bash $ + + opreport -gal ./cmon | less + + +Run the following command to retrieve the top ``cmon`` results, with call +graphs attached: + +.. prompt:: bash $ + + opreport -cal ./cmon | less + +.. important:: After you have reviewed the results, reset ``oprofile`` before + running it again. The act of resetting ``oprofile`` removes data from the + session directory. + + +Resetting oprofile +================== + +Run the following command to reset ``oprofile``: + +.. prompt:: bash $ + + sudo opcontrol --reset + +.. important:: Reset ``oprofile`` after analyzing data. This ensures that + results from prior tests do not get mixed in with the results of the current + test. + +.. _oprofile: http://oprofile.sourceforge.net/about/ +.. _Installing Oprofile: ../../../dev/cpu-profiler + + diff --git a/doc/rados/troubleshooting/index.rst b/doc/rados/troubleshooting/index.rst new file mode 100644 index 000000000..b481ee1dc --- /dev/null +++ b/doc/rados/troubleshooting/index.rst @@ -0,0 +1,19 @@ +================= + Troubleshooting +================= + +You may encounter situations that require you to examine your configuration, +consult the documentation, modify your logging output, troubleshoot monitors +and OSDs, profile memory and CPU usage, and, in the last resort, reach out to +the Ceph community for help. + +.. toctree:: + :maxdepth: 1 + + community + log-and-debug + troubleshooting-mon + troubleshooting-osd + troubleshooting-pg + memory-profiling + cpu-profiling diff --git a/doc/rados/troubleshooting/log-and-debug.rst b/doc/rados/troubleshooting/log-and-debug.rst new file mode 100644 index 000000000..929c3f53f --- /dev/null +++ b/doc/rados/troubleshooting/log-and-debug.rst @@ -0,0 +1,430 @@ +======================= + Logging and Debugging +======================= + +Ceph component debug log levels can be adjusted at runtime, while services are +running. In some circumstances you might want to adjust debug log levels in +``ceph.conf`` or in the central config store. Increased debug logging can be +useful if you are encountering issues when operating your cluster. By default, +Ceph log files are in ``/var/log/ceph``. + +.. tip:: Remember that debug output can slow down your system, and that this + latency sometimes hides race conditions. + +Debug logging is resource intensive. If you encounter a problem in a specific +component of your cluster, begin troubleshooting by enabling logging for only +that component of the cluster. For example, if your OSDs are running without +errors, but your metadata servers are not, enable logging for any specific +metadata server instances that are having problems. Continue by enabling +logging for each subsystem only as needed. + +.. important:: Verbose logging sometimes generates over 1 GB of data per hour. + If the disk that your operating system runs on (your "OS disk") reaches its + capacity, the node associated with that disk will stop working. + +Whenever you enable or increase the rate of debug logging, make sure that you +have ample capacity for log files, as this may dramatically increase their +size. For details on rotating log files, see `Accelerating Log Rotation`_. +When your system is running well again, remove unnecessary debugging settings +in order to ensure that your cluster runs optimally. Logging debug-output +messages is a slow process and a potential waste of your cluster's resources. + +For details on available settings, see `Subsystem, Log and Debug Settings`_. + +Runtime +======= + +To see the configuration settings at runtime, log in to a host that has a +running daemon and run a command of the following form: + +.. prompt:: bash $ + + ceph daemon {daemon-name} config show | less + +For example: + +.. prompt:: bash $ + + ceph daemon osd.0 config show | less + +To activate Ceph's debugging output (that is, the ``dout()`` logging function) +at runtime, inject arguments into the runtime configuration by running a ``ceph +tell`` command of the following form: + +.. prompt:: bash $ + + ceph tell {daemon-type}.{daemon id or *} config set {name} {value} + +Here ``{daemon-type}`` is ``osd``, ``mon``, or ``mds``. Apply the runtime +setting either to a specific daemon (by specifying its ID) or to all daemons of +a particular type (by using the ``*`` operator). For example, to increase +debug logging for a specific ``ceph-osd`` daemon named ``osd.0``, run the +following command: + +.. prompt:: bash $ + + ceph tell osd.0 config set debug_osd 0/5 + +The ``ceph tell`` command goes through the monitors. However, if you are unable +to bind to the monitor, there is another method that can be used to activate +Ceph's debugging output: use the ``ceph daemon`` command to log in to the host +of a specific daemon and change the daemon's configuration. For example: + +.. prompt:: bash $ + + sudo ceph daemon osd.0 config set debug_osd 0/5 + +For details on available settings, see `Subsystem, Log and Debug Settings`_. + + +Boot Time +========= + +To activate Ceph's debugging output (that is, the ``dout()`` logging function) +at boot time, you must add settings to your Ceph configuration file. +Subsystems that are common to all daemons are set under ``[global]`` in the +configuration file. Subsystems for a specific daemon are set under the relevant +daemon section in the configuration file (for example, ``[mon]``, ``[osd]``, +``[mds]``). Here is an example that shows possible debugging settings in a Ceph +configuration file: + +.. code-block:: ini + + [global] + debug_ms = 1/5 + + [mon] + debug_mon = 20 + debug_paxos = 1/5 + debug_auth = 2 + + [osd] + debug_osd = 1/5 + debug_filestore = 1/5 + debug_journal = 1 + debug_monc = 5/20 + + [mds] + debug_mds = 1 + debug_mds_balancer = 1 + + +For details, see `Subsystem, Log and Debug Settings`_. + + +Accelerating Log Rotation +========================= + +If your log filesystem is nearly full, you can accelerate log rotation by +modifying the Ceph log rotation file at ``/etc/logrotate.d/ceph``. To increase +the frequency of log rotation (which will guard against a filesystem reaching +capacity), add a ``size`` directive after the ``weekly`` frequency directive. +To smooth out volume spikes, consider changing ``weekly`` to ``daily`` and +consider changing ``rotate`` to ``30``. The procedure for adding the size +setting is shown immediately below. + +#. Note the default settings of the ``/etc/logrotate.d/ceph`` file:: + + rotate 7 + weekly + compress + sharedscripts + +#. Modify them by adding a ``size`` setting:: + + rotate 7 + weekly + size 500M + compress + sharedscripts + +#. Start the crontab editor for your user space: + + .. prompt:: bash $ + + crontab -e + +#. Add an entry to crontab that instructs cron to check the + ``etc/logrotate.d/ceph`` file:: + + 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 + +In this example, the ``etc/logrotate.d/ceph`` file will be checked every 30 +minutes. + +Valgrind +======== + +When you are debugging your cluster's performance, you might find it necessary +to track down memory and threading issues. The Valgrind tool suite can be used +to detect problems in a specific daemon, in a particular type of daemon, or in +the entire cluster. Because Valgrind is computationally expensive, it should be +used only when developing or debugging Ceph, and it will slow down your system +if used at other times. Valgrind messages are logged to ``stderr``. + + +Subsystem, Log and Debug Settings +================================= + +Debug logging output is typically enabled via subsystems. + +Ceph Subsystems +--------------- + +For each subsystem, there is a logging level for its output logs (a so-called +"log level") and a logging level for its in-memory logs (a so-called "memory +level"). Different values may be set for these two logging levels in each +subsystem. Ceph's logging levels operate on a scale of ``1`` to ``20``, where +``1`` is terse and ``20`` is verbose [#f1]_. As a general rule, the in-memory +logs are not sent to the output log unless one or more of the following +conditions obtain: + +- a fatal signal is raised or +- an ``assert`` in source code is triggered or +- upon requested. Please consult `document on admin socket + <http://docs.ceph.com/en/latest/man/8/ceph/#daemon>`_ for more details. + +.. warning :: + .. [#f1] In certain rare cases, there are logging levels that can take a value greater than 20. The resulting logs are extremely verbose. + +Log levels and memory levels can be set either together or separately. If a +subsystem is assigned a single value, then that value determines both the log +level and the memory level. For example, ``debug ms = 5`` will give the ``ms`` +subsystem a log level of ``5`` and a memory level of ``5``. On the other hand, +if a subsystem is assigned two values that are separated by a forward slash +(/), then the first value determines the log level and the second value +determines the memory level. For example, ``debug ms = 1/5`` will give the +``ms`` subsystem a log level of ``1`` and a memory level of ``5``. See the +following: + +.. code-block:: ini + + debug {subsystem} = {log-level}/{memory-level} + #for example + debug mds balancer = 1/20 + +The following table provides a list of Ceph subsystems and their default log and +memory levels. Once you complete your logging efforts, restore the subsystems +to their default level or to a level suitable for normal operations. + ++--------------------------+-----------+--------------+ +| Subsystem | Log Level | Memory Level | ++==========================+===========+==============+ +| ``default`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``lockdep`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``context`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``crush`` | 1 | 1 | ++--------------------------+-----------+--------------+ +| ``mds`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mds balancer`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mds log`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mds log expire`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mds migrator`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``buffer`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``timer`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``filer`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``striper`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``objecter`` | 0 | 1 | ++--------------------------+-----------+--------------+ +| ``rados`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``rbd`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``rbd mirror`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``rbd replay`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``rbd pwl`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``journaler`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``objectcacher`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``immutable obj cache`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``client`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``osd`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``optracker`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``objclass`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``filestore`` | 1 | 3 | ++--------------------------+-----------+--------------+ +| ``journal`` | 1 | 3 | ++--------------------------+-----------+--------------+ +| ``ms`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``mon`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``monc`` | 0 | 10 | ++--------------------------+-----------+--------------+ +| ``paxos`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``tp`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``auth`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``crypto`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``finisher`` | 1 | 1 | ++--------------------------+-----------+--------------+ +| ``reserver`` | 1 | 1 | ++--------------------------+-----------+--------------+ +| ``heartbeatmap`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``perfcounter`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw sync`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw datacache`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw access`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rgw dbstore`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``javaclient`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``asok`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``throttle`` | 1 | 1 | ++--------------------------+-----------+--------------+ +| ``refs`` | 0 | 0 | ++--------------------------+-----------+--------------+ +| ``compressor`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``bluestore`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``bluefs`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``bdev`` | 1 | 3 | ++--------------------------+-----------+--------------+ +| ``kstore`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``rocksdb`` | 4 | 5 | ++--------------------------+-----------+--------------+ +| ``leveldb`` | 4 | 5 | ++--------------------------+-----------+--------------+ +| ``fuse`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``mgr`` | 2 | 5 | ++--------------------------+-----------+--------------+ +| ``mgrc`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``dpdk`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``eventtrace`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``prioritycache`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``test`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``cephfs mirror`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``cepgsqlite`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore onode`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore odata`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore ompap`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore tm`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore t`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore cleaner`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore epm`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore lba`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore fixedkv tree``| 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore cache`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore journal`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore device`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``seastore backref`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``alienstore`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``mclock`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``cyanstore`` | 0 | 5 | ++--------------------------+-----------+--------------+ +| ``ceph exporter`` | 1 | 5 | ++--------------------------+-----------+--------------+ +| ``memstore`` | 1 | 5 | ++--------------------------+-----------+--------------+ + + +Logging Settings +---------------- + +It is not necessary to specify logging and debugging settings in the Ceph +configuration file, but you may override default settings when needed. Ceph +supports the following settings: + +.. confval:: log_file +.. confval:: log_max_new +.. confval:: log_max_recent +.. confval:: log_to_file +.. confval:: log_to_stderr +.. confval:: err_to_stderr +.. confval:: log_to_syslog +.. confval:: err_to_syslog +.. confval:: log_flush_on_exit +.. confval:: clog_to_monitors +.. confval:: clog_to_syslog +.. confval:: mon_cluster_log_to_syslog +.. confval:: mon_cluster_log_file + +OSD +--- + +.. confval:: osd_debug_drop_ping_probability +.. confval:: osd_debug_drop_ping_duration + +Filestore +--------- + +.. confval:: filestore_debug_omap_check + +MDS +--- + +- :confval:`mds_debug_scatterstat` +- :confval:`mds_debug_frag` +- :confval:`mds_debug_auth_pins` +- :confval:`mds_debug_subtrees` + +RADOS Gateway +------------- + +- :confval:`rgw_log_nonexistent_bucket` +- :confval:`rgw_log_object_name` +- :confval:`rgw_log_object_name_utc` +- :confval:`rgw_enable_ops_log` +- :confval:`rgw_enable_usage_log` +- :confval:`rgw_usage_log_flush_threshold` +- :confval:`rgw_usage_log_tick_interval` diff --git a/doc/rados/troubleshooting/memory-profiling.rst b/doc/rados/troubleshooting/memory-profiling.rst new file mode 100644 index 000000000..8e58f2d76 --- /dev/null +++ b/doc/rados/troubleshooting/memory-profiling.rst @@ -0,0 +1,203 @@ +================== + Memory Profiling +================== + +Ceph Monitor, OSD, and MDS can report ``TCMalloc`` heap profiles. Install +``google-perftools`` if you want to generate these. Your OS distribution might +package this under a different name (for example, ``gperftools``), and your OS +distribution might use a different package manager. Run a command similar to +this one to install ``google-perftools``: + +.. prompt:: bash + + sudo apt-get install google-perftools + +The profiler dumps output to your ``log file`` directory (``/var/log/ceph``). +See `Logging and Debugging`_ for details. + +To view the profiler logs with Google's performance tools, run the following +command: + +.. prompt:: bash + + google-pprof --text {path-to-daemon} {log-path/filename} + +For example:: + + $ ceph tell osd.0 heap start_profiler + $ ceph tell osd.0 heap dump + osd.0 tcmalloc heap stats:------------------------------------------------ + MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application + MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist + MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist + MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist + MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists + MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) + MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used + MALLOC: + MALLOC: 231 Spans in use + MALLOC: 56 Thread heaps in use + MALLOC: 8192 Tcmalloc page size + ------------------------------------------------ + Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). + Bytes released to the OS take up virtual address space but no physical memory. + $ google-pprof --text \ + /usr/bin/ceph-osd \ + /var/log/ceph/ceph-osd.0.profile.0001.heap + Total: 3.7 MB + 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry + 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create + 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe + 0.0 0.4% 99.2% 0.0 0.6% decode_message + ... + +Performing another heap dump on the same daemon creates another file. It is +convenient to compare the new file to a file created by a previous heap dump to +show what has grown in the interval. For example:: + + $ google-pprof --text --base out/osd.0.profile.0001.heap \ + ceph-osd out/osd.0.profile.0003.heap + Total: 0.2 MB + 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry + 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create + 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op + 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate + +See `Google Heap Profiler`_ for additional details. + +After you have installed the heap profiler, start your cluster and begin using +the heap profiler. You can enable or disable the heap profiler at runtime, or +ensure that it runs continuously. When running commands based on the examples +that follow, do the following: + +#. replace ``{daemon-type}`` with ``mon``, ``osd`` or ``mds`` +#. replace ``{daemon-id}`` with the OSD number or the MON ID or the MDS ID + + +Starting the Profiler +--------------------- + +To start the heap profiler, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}.{daemon-id} heap start_profiler + +For example: + +.. prompt:: bash + + ceph tell osd.1 heap start_profiler + +Alternatively, if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in the +environment, the profile will be started when the daemon starts running. + +Printing Stats +-------------- + +To print out statistics, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}.{daemon-id} heap stats + +For example: + +.. prompt:: bash + + ceph tell osd.0 heap stats + +.. note:: The reporting of stats with this command does not require the + profiler to be running and does not dump the heap allocation information to + a file. + + +Dumping Heap Information +------------------------ + +To dump heap information, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}.{daemon-id} heap dump + +For example: + +.. prompt:: bash + + ceph tell mds.a heap dump + +.. note:: Dumping heap information works only when the profiler is running. + + +Releasing Memory +---------------- + +To release memory that ``tcmalloc`` has allocated but which is not being used +by the Ceph daemon itself, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}{daemon-id} heap release + +For example: + +.. prompt:: bash + + ceph tell osd.2 heap release + + +Stopping the Profiler +--------------------- + +To stop the heap profiler, run a command of the following form: + +.. prompt:: bash + + ceph tell {daemon-type}.{daemon-id} heap stop_profiler + +For example: + +.. prompt:: bash + + ceph tell osd.0 heap stop_profiler + +.. _Logging and Debugging: ../log-and-debug +.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html + +Alternative Methods of Memory Profiling +---------------------------------------- + +Running Massif heap profiler with Valgrind +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Massif heap profiler tool can be used with Valgrind to measure how much +heap memory is used. This method is well-suited to troubleshooting RadosGW. + +See the `Massif documentation +<https://valgrind.org/docs/manual/ms-manual.html>`_ for more information. + +Install Valgrind from the package manager for your distribution then start the +Ceph daemon you want to troubleshoot: + +.. prompt:: bash + + sudo -u ceph valgrind --max-threads=1024 --tool=massif /usr/bin/radosgw -f --cluster ceph --name NAME --setuser ceph --setgroup ceph + +When this command has completed its run, a file with a name of the form +``massif.out.<pid>`` will be saved in your current working directory. To run +the command above, the user who runs it must have write permissions in the +current directory. + +Run the ``ms_print`` command to get a graph and statistics from the collected +data in the ``massif.out.<pid>`` file: + +.. prompt:: bash + + ms_print massif.out.12345 + +The output of this command is helpful when submitting a bug report. diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst new file mode 100644 index 000000000..1170da7c3 --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -0,0 +1,713 @@ +.. _rados-troubleshooting-mon: + +========================== + Troubleshooting Monitors +========================== + +.. index:: monitor, high availability + +Even if a cluster experiences monitor-related problems, the cluster is not +necessarily in danger of going down. If a cluster has lost multiple monitors, +it can still remain up and running as long as there are enough surviving +monitors to form a quorum. + +If your cluster is having monitor-related problems, we recommend that you +consult the following troubleshooting information. + +Initial Troubleshooting +======================= + +The first steps in the process of troubleshooting Ceph Monitors involve making +sure that the Monitors are running and that they are able to communicate with +the network and on the network. Follow the steps in this section to rule out +the simplest causes of Monitor malfunction. + +#. **Make sure that the Monitors are running.** + + Make sure that the Monitor (*mon*) daemon processes (``ceph-mon``) are + running. It might be the case that the mons have not be restarted after an + upgrade. Checking for this simple oversight can save hours of painstaking + troubleshooting. + + It is also important to make sure that the manager daemons (``ceph-mgr``) + are running. Remember that typical cluster configurations provide one + Manager (``ceph-mgr``) for each Monitor (``ceph-mon``). + + .. note:: In releases prior to v1.12.5, Rook will not run more than two + managers. + +#. **Make sure that you can reach the Monitor nodes.** + + In certain rare cases, ``iptables`` rules might be blocking access to + Monitor nodes or TCP ports. These rules might be left over from earlier + stress testing or rule development. To check for the presence of such + rules, SSH into each Monitor node and use ``telnet`` or ``nc`` or a similar + tool to attempt to connect to each of the other Monitor nodes on ports + ``tcp/3300`` and ``tcp/6789``. + +#. **Make sure that the "ceph status" command runs and receives a reply from the cluster.** + + If the ``ceph status`` command receives a reply from the cluster, then the + cluster is up and running. Monitors answer to a ``status`` request only if + there is a formed quorum. Confirm that one or more ``mgr`` daemons are + reported as running. In a cluster with no deficiencies, ``ceph status`` + will report that all ``mgr`` daemons are running. + + If the ``ceph status`` command does not receive a reply from the cluster, + then there are probably not enough Monitors ``up`` to form a quorum. If the + ``ceph -s`` command is run with no further options specified, it connects + to an arbitrarily selected Monitor. In certain cases, however, it might be + helpful to connect to a specific Monitor (or to several specific Monitors + in sequence) by adding the ``-m`` flag to the command: for example, ``ceph + status -m mymon1``. + +#. **None of this worked. What now?** + + If the above solutions have not resolved your problems, you might find it + helpful to examine each individual Monitor in turn. Even if no quorum has + been formed, it is possible to contact each Monitor individually and + request its status by using the ``ceph tell mon.ID mon_status`` command + (here ``ID`` is the Monitor's identifier). + + Run the ``ceph tell mon.ID mon_status`` command for each Monitor in the + cluster. For more on this command's output, see :ref:`Understanding + mon_status + <rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`. + + There is also an alternative method for contacting each individual Monitor: + SSH into each Monitor node and query the daemon's admin socket. See + :ref:`Using the Monitor's Admin + Socket<rados_troubleshoting_troubleshooting_mon_using_admin_socket>`. + +.. _rados_troubleshoting_troubleshooting_mon_using_admin_socket: + +Using the monitor's admin socket +================================ + +A monitor's admin socket allows you to interact directly with a specific daemon +by using a Unix socket file. This file is found in the monitor's ``run`` +directory. The admin socket's default directory is +``/var/run/ceph/ceph-mon.ID.asok``, but this can be overridden and the admin +socket might be elsewhere, especially if your cluster's daemons are deployed in +containers. If you cannot find it, either check your ``ceph.conf`` for an +alternative path or run the following command: + +.. prompt:: bash $ + + ceph-conf --name mon.ID --show-config-value admin_socket + +The admin socket is available for use only when the monitor daemon is running. +Whenever the monitor has been properly shut down, the admin socket is removed. +However, if the monitor is not running and the admin socket persists, it is +likely that the monitor has been improperly shut down. In any case, if the +monitor is not running, it will be impossible to use the admin socket, and the +``ceph`` command is likely to return ``Error 111: Connection Refused``. + +To access the admin socket, run a ``ceph tell`` command of the following form +(specifying the daemon that you are interested in): + +.. prompt:: bash $ + + ceph tell mon.<id> mon_status + +This command passes a ``help`` command to the specific running monitor daemon +``<id>`` via its admin socket. If you know the full path to the admin socket +file, this can be done more directly by running the following command: + +.. prompt:: bash $ + + ceph --admin-daemon <full_path_to_asok_file> <command> + +Running ``ceph help`` shows all supported commands that are available through +the admin socket. See especially ``config get``, ``config show``, ``mon stat``, +and ``quorum_status``. + +.. _rados_troubleshoting_troubleshooting_mon_understanding_mon_status: + +Understanding mon_status +======================== + +The status of the monitor (as reported by the ``ceph tell mon.X mon_status`` +command) can always be obtained via the admin socket. This command outputs a +great deal of information about the monitor (including the information found in +the output of the ``quorum_status`` command). + +To understand this command's output, let us consider the following example, in +which we see the output of ``ceph tell mon.c mon_status``:: + + { "name": "c", + "rank": 2, + "state": "peon", + "election_epoch": 38, + "quorum": [ + 1, + 2], + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { "epoch": 3, + "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", + "modified": "2013-10-30 04:12:01.945629", + "created": "2013-10-29 14:14:41.914786", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6795\/0"}]}} + +It is clear that there are three monitors in the monmap (*a*, *b*, and *c*), +the quorum is formed by only two monitors, and *c* is in the quorum as a +*peon*. + +**Which monitor is out of the quorum?** + + The answer is **a** (that is, ``mon.a``). + +**Why?** + + When the ``quorum`` set is examined, there are clearly two monitors in the + set: *1* and *2*. But these are not monitor names. They are monitor ranks, as + established in the current ``monmap``. The ``quorum`` set does not include + the monitor that has rank 0, and according to the ``monmap`` that monitor is + ``mon.a``. + +**How are monitor ranks determined?** + + Monitor ranks are calculated (or recalculated) whenever monitors are added or + removed. The calculation of ranks follows a simple rule: the **greater** the + ``IP:PORT`` combination, the **lower** the rank. In this case, because + ``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations, + ``mon.a`` has the highest rank: namely, rank 0. + + +Most Common Monitor Issues +=========================== + +The Cluster Has Quorum but at Least One Monitor is Down +------------------------------------------------------- + +When the cluster has quorum but at least one monitor is down, ``ceph health +detail`` returns a message similar to the following:: + + $ ceph health detail + [snip] + mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) + +**How do I troubleshoot a Ceph cluster that has quorum but also has at least one monitor down?** + + #. Make sure that ``mon.a`` is running. + + #. Make sure that you can connect to ``mon.a``'s node from the + other Monitor nodes. Check the TCP ports as well. Check ``iptables`` and + ``nf_conntrack`` on all nodes and make sure that you are not + dropping/rejecting connections. + + If this initial troubleshooting doesn't solve your problem, then further + investigation is necessary. + + First, check the problematic monitor's ``mon_status`` via the admin + socket as explained in `Using the monitor's admin socket`_ and + `Understanding mon_status`_. + + If the Monitor is out of the quorum, then its state will be one of the + following: ``probing``, ``electing`` or ``synchronizing``. If the state of + the Monitor is ``leader`` or ``peon``, then the Monitor believes itself to be + in quorum but the rest of the cluster believes that it is not in quorum. It + is possible that a Monitor that is in one of the ``probing``, ``electing``, + or ``synchronizing`` states has entered the quorum during the process of + troubleshooting. Check ``ceph status`` again to determine whether the Monitor + has entered quorum during your troubleshooting. If the Monitor remains out of + the quorum, then proceed with the investigations described in this section of + the documentation. + + +**What does it mean when a Monitor's state is ``probing``?** + + If ``ceph health detail`` shows that a Monitor's state is + ``probing``, then the Monitor is still looking for the other Monitors. Every + Monitor remains in this state for some time when it is started. When a + Monitor has connected to the other Monitors specified in the ``monmap``, it + ceases to be in the ``probing`` state. The amount of time that a Monitor is + in the ``probing`` state depends upon the parameters of the cluster of which + it is a part. For example, when a Monitor is a part of a single-monitor + cluster (never do this in production), the monitor passes through the probing + state almost instantaneously. In a multi-monitor cluster, the Monitors stay + in the ``probing`` state until they find enough monitors to form a quorum + |---| this means that if two out of three Monitors in the cluster are + ``down``, the one remaining Monitor stays in the ``probing`` state + indefinitely until you bring one of the other monitors up. + + If quorum has been established, then the Monitor daemon should be able to + find the other Monitors quickly, as long as they can be reached. If a Monitor + is stuck in the ``probing`` state and you have exhausted the procedures above + that describe the troubleshooting of communications between the Monitors, + then it is possible that the problem Monitor is trying to reach the other + Monitors at a wrong address. ``mon_status`` outputs the ``monmap`` that is + known to the monitor: determine whether the other Monitors' locations as + specified in the ``monmap`` match the locations of the Monitors in the + network. If they do not, see `Recovering a Monitor's Broken monmap`_. + If the locations of the Monitors as specified in the ``monmap`` match the + locations of the Monitors in the network, then the persistent + ``probing`` state could be related to severe clock skews amongst the monitor + nodes. See `Clock Skews`_. If the information in `Clock Skews`_ does not + bring the Monitor out of the ``probing`` state, then prepare your system logs + and ask the Ceph community for help. See `Preparing your logs`_ for + information about the proper preparation of logs. + + +**What does it mean when a Monitor's state is ``electing``?** + + If ``ceph health detail`` shows that a Monitor's state is ``electing``, the + monitor is in the middle of an election. Elections typically complete + quickly, but sometimes the monitors can get stuck in what is known as an + *election storm*. See :ref:`Monitor Elections <dev_mon_elections>` for more + on monitor elections. + + The presence of election storm might indicate clock skew among the monitor + nodes. See `Clock Skews`_ for more information. + + If your clocks are properly synchronized, search the mailing lists and bug + tracker for issues similar to your issue. The ``electing`` state is not + likely to persist. In versions of Ceph after the release of Cuttlefish, there + is no obvious reason other than clock skew that explains why an ``electing`` + state would persist. + + It is possible to investigate the cause of a persistent ``electing`` state if + you put the problematic Monitor into a ``down`` state while you investigate. + This is possible only if there are enough surviving Monitors to form quorum. + +**What does it mean when a Monitor's state is ``synchronizing``?** + + If ``ceph health detail`` shows that the Monitor is ``synchronizing``, the + monitor is catching up with the rest of the cluster so that it can join the + quorum. The amount of time that it takes for the Monitor to synchronize with + the rest of the quorum is a function of the size of the cluster's monitor + store, the cluster's size, and the state of the cluster. Larger and degraded + clusters generally keep Monitors in the ``synchronizing`` state longer than + do smaller, new clusters. + + A Monitor that changes its state from ``synchronizing`` to ``electing`` and + then back to ``synchronizing`` indicates a problem: the cluster state may be + advancing (that is, generating new maps) too fast for the synchronization + process to keep up with the pace of the creation of the new maps. This issue + presented more frequently prior to the Cuttlefish release than it does in + more recent releases, because the synchronization process has since been + refactored and enhanced to avoid this dynamic. If you experience this in + later versions, report the issue in the `Ceph bug tracker + <https://tracker.ceph.com>`_. Prepare and provide logs to substantiate any + bug you raise. See `Preparing your logs`_ for information about the proper + preparation of logs. + +**What does it mean when a Monitor's state is ``leader`` or ``peon``?** + + If ``ceph health detail`` shows that the Monitor is in the ``leader`` state + or in the ``peon`` state, it is likely that clock skew is present. Follow the + instructions in `Clock Skews`_. If you have followed those instructions and + ``ceph health detail`` still shows that the Monitor is in the ``leader`` + state or the ``peon`` state, report the issue in the `Ceph bug tracker + <https://tracker.ceph.com>`_. If you raise an issue, provide logs to + substantiate it. See `Preparing your logs`_ for information about the + proper preparation of logs. + + +Recovering a Monitor's Broken ``monmap`` +---------------------------------------- + +This is how a ``monmap`` usually looks, depending on the number of +monitors:: + + + epoch 3 + fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 + last_changed 2013-10-30 04:12:01.945629 + created 2013-10-29 14:14:41.914786 + 0: 127.0.0.1:6789/0 mon.a + 1: 127.0.0.1:6790/0 mon.b + 2: 127.0.0.1:6795/0 mon.c + +This may not be what you have however. For instance, in some versions of +early Cuttlefish there was a bug that could cause your ``monmap`` +to be nullified. Completely filled with zeros. This means that not even +``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros. +It's also possible to end up with a monitor with a severely outdated monmap, +notably if the node has been down for months while you fight with your vendor's +TAC. The subject ``ceph-mon`` daemon might be unable to find the surviving +monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, +then remove ``mon.a``, then add a new monitor ``mon.e`` and remove +``mon.b``; you will end up with a totally different monmap from the one +``mon.c`` knows). + +In this situation you have two possible solutions: + +Scrap the monitor and redeploy + + You should only take this route if you are positive that you won't + lose the information kept by that monitor; that you have other monitors + and that they are running just fine so that your new monitor is able + to synchronize from the remaining monitors. Keep in mind that destroying + a monitor, if there are no other copies of its contents, may lead to + loss of data. + +Inject a monmap into the monitor + + These are the basic steps: + + Retrieve the ``monmap`` from the surviving monitors and inject it into the + monitor whose ``monmap`` is corrupted or lost. + + Implement this solution by carrying out the following procedure: + + 1. Is there a quorum of monitors? If so, retrieve the ``monmap`` from the + quorum:: + + $ ceph mon getmap -o /tmp/monmap + + 2. If there is no quorum, then retrieve the ``monmap`` directly from another + monitor that has been stopped (in this example, the other monitor has + the ID ``ID-FOO``):: + + $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap + + 3. Stop the monitor you are going to inject the monmap into. + + 4. Inject the monmap:: + + $ ceph-mon -i ID --inject-monmap /tmp/monmap + + 5. Start the monitor + + .. warning:: Injecting ``monmaps`` can cause serious problems because doing + so will overwrite the latest existing ``monmap`` stored on the monitor. Be + careful! + +Clock Skews +----------- + +The Paxos consensus algorithm requires close time synchroniziation, which means +that clock skew among the monitors in the quorum can have a serious effect on +monitor operation. The resulting behavior can be puzzling. To avoid this issue, +run a clock synchronization tool on your monitor nodes: for example, use +``Chrony`` or the legacy ``ntpd`` utility. Configure each monitor nodes so that +the `iburst` option is in effect and so that each monitor has multiple peers, +including the following: + +* Each other +* Internal ``NTP`` servers +* Multiple external, public pool servers + +.. note:: The ``iburst`` option sends a burst of eight packets instead of the + usual single packet, and is used during the process of getting two peers + into initial synchronization. + +Furthermore, it is advisable to synchronize *all* nodes in your cluster against +internal and external servers, and perhaps even against your monitors. Run +``NTP`` servers on bare metal: VM-virtualized clocks are not suitable for +steady timekeeping. See `https://www.ntp.org <https://www.ntp.org>`_ for more +information about the Network Time Protocol (NTP). Your organization might +already have quality internal ``NTP`` servers available. Sources for ``NTP`` +server appliances include the following: + +* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_ +* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_ +* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_ + +Clock Skew Questions and Answers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**What's the maximum tolerated clock skew?** + + By default, monitors allow clocks to drift up to a maximum of 0.05 seconds + (50 milliseconds). + +**Can I increase the maximum tolerated clock skew?** + + Yes, but we strongly recommend against doing so. The maximum tolerated clock + skew is configurable via the ``mon-clock-drift-allowed`` option, but it is + almost certainly a bad idea to make changes to this option. The clock skew + maximum is in place because clock-skewed monitors cannot be relied upon. The + current default value has proven its worth at alerting the user before the + monitors encounter serious problems. Changing this value might cause + unforeseen effects on the stability of the monitors and overall cluster + health. + +**How do I know whether there is a clock skew?** + + The monitors will warn you via the cluster status ``HEALTH_WARN``. When clock + skew is present, the ``ceph health detail`` and ``ceph status`` commands + return an output resembling the following:: + + mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) + + In this example, the monitor ``mon.c`` has been flagged as suffering from + clock skew. + + In Luminous and later releases, it is possible to check for a clock skew by + running the ``ceph time-sync-status`` command. Note that the lead monitor + typically has the numerically lowest IP address. It will always show ``0``: + the reported offsets of other monitors are relative to the lead monitor, not + to any external reference source. + +**What should I do if there is a clock skew?** + + Synchronize your clocks. Using an NTP client might help. However, if you + are already using an NTP client and you still encounter clock skew problems, + determine whether the NTP server that you are using is remote to your network + or instead hosted on your network. Hosting your own NTP servers tends to + mitigate clock skew problems. + + +Client Can't Connect or Mount +----------------------------- + +Check your IP tables. Some operating-system install utilities add a ``REJECT`` +rule to ``iptables``. ``iptables`` rules will reject all clients other than +``ssh`` that try to connect to the host. If your monitor host's IP tables have +a ``REJECT`` rule in place, clients that are connecting from a separate node +will fail and will raise a timeout error. Any ``iptables`` rules that reject +clients trying to connect to Ceph daemons must be addressed. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +It might also be necessary to add rules to iptables on your Ceph hosts to +ensure that clients are able to access the TCP ports associated with your Ceph +monitors (default: port 6789) and Ceph OSDs (default: 6800 through 7300). For +example:: + + iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT + + +Monitor Store Failures +====================== + +Symptoms of store corruption +---------------------------- + +Ceph monitors store the :term:`Cluster Map` in a key-value store. If key-value +store corruption causes a monitor to fail, then the monitor log might contain +one of the following error messages:: + + Corruption: error in middle of record + +or:: + + Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb + +Recovery using healthy monitor(s) +--------------------------------- + +If there are surviving monitors, we can always :ref:`replace +<adding-and-removing-monitors>` the corrupted monitor with a new one. After the +new monitor boots, it will synchronize with a healthy peer. After the new +monitor is fully synchronized, it will be able to serve clients. + +.. _mon-store-recovery-using-osds: + +Recovery using OSDs +------------------- + +Even if all monitors fail at the same time, it is possible to recover the +monitor store by using information stored in OSDs. You are encouraged to deploy +at least three (and preferably five) monitors in a Ceph cluster. In such a +deployment, complete monitor failure is unlikely. However, unplanned power loss +in a data center whose disk settings or filesystem settings are improperly +configured could cause the underlying filesystem to fail and this could kill +all of the monitors. In such a case, data in the OSDs can be used to recover +the monitors. The following is such a script and can be used to recover the +monitors: + + +.. code-block:: bash + + ms=/root/mon-store + mkdir $ms + + # collect the cluster map from stopped OSDs + for host in $hosts; do + rsync -avz $ms/. user@$host:$ms.remote + rm -rf $ms + ssh user@$host <<EOF + for osd in /var/lib/ceph/osd/ceph-*; do + ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote + done + EOF + rsync -avz user@$host:$ms.remote/. $ms + done + + # rebuild the monitor store from the collected map, if the cluster does not + # use cephx authentication, we can skip the following steps to update the + # keyring with the caps, and there is no need to pass the "--keyring" option. + # i.e. just use "ceph-monstore-tool $ms rebuild" instead + ceph-authtool /path/to/admin.keyring -n mon. \ + --cap mon 'allow *' + ceph-authtool /path/to/admin.keyring -n client.admin \ + --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' + # add one or more ceph-mgr's key to the keyring. in this case, an encoded key + # for mgr.x is added, you can find the encoded key in + # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is + # deployed + ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \ + --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *' + # If your monitors' ids are not sorted by ip address, please specify them in order. + # For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is 10.0.0.4, + # please passing "--mon-ids b a c". + # In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please + # specify them in the command line by passing them as arguments of the "--mon-ids" + # option. if you are not sure, please check your ceph.conf to see if there is any + # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are + # using DNS SRV for looking up monitors. + ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma + + # make a backup of the corrupted store.db just in case! repeat for + # all monitors. + mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted + + # move rebuild store.db into place. repeat for all monitors. + mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db + chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db + +This script performs the following steps: + +#. Collects the map from each OSD host. +#. Rebuilds the store. +#. Fills the entities in the keyring file with appropriate capabilities. +#. Replaces the corrupted store on ``mon.foo`` with the recovered copy. + + +Known limitations +~~~~~~~~~~~~~~~~~ + +The above recovery tool is unable to recover the following information: + +- **Certain added keyrings**: All of the OSD keyrings added using the ``ceph + auth add`` command are recovered from the OSD's copy, and the + ``client.admin`` keyring is imported using ``ceph-monstore-tool``. However, + the MDS keyrings and all other keyrings will be missing in the recovered + monitor store. You might need to manually re-add them. + +- **Creating pools**: If any RADOS pools were in the process of being created, + that state is lost. The recovery tool operates on the assumption that all + pools have already been created. If there are PGs that are stuck in the + 'unknown' state after the recovery for a partially created pool, you can + force creation of the *empty* PG by running the ``ceph osd force-create-pg`` + command. Note that this will create an *empty* PG, so take this action only + if you know the pool is empty. + +- **MDS Maps**: The MDS maps are lost. + + +Everything Failed! Now What? +============================ + +Reaching out for help +--------------------- + +You can find help on IRC in #ceph and #ceph-devel on OFTC (server +irc.oftc.net), or at ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make +sure that you have prepared your logs and that you have them ready upon +request. + +See https://ceph.io/en/community/connect/ for current (as of October 2023) +information on getting in contact with the upstream Ceph community. + + +Preparing your logs +------------------- + +The default location for monitor logs is ``/var/log/ceph/ceph-mon.FOO.log*``. +However, if they are not there, you can find their current location by running +the following command: + +.. prompt:: bash + + ceph-conf --name mon.FOO --show-config-value log_file + +The amount of information in the logs is determined by the debug levels in the +cluster's configuration files. If Ceph is using the default debug levels, then +your logs might be missing important information that would help the upstream +Ceph community address your issue. + +To make sure your monitor logs contain relevant information, you can raise +debug levels. Here we are interested in information from the monitors. As with +other components, the monitors have different parts that output their debug +information on different subsystems. + +If you are an experienced Ceph troubleshooter, we recommend raising the debug +levels of the most relevant subsystems. Of course, this approach might not be +easy for beginners. In most cases, however, enough information to address the +issue will be secured if the following debug levels are entered:: + + debug_mon = 10 + debug_ms = 1 + +Sometimes these debug levels do not yield enough information. In such cases, +members of the upstream Ceph community might ask you to make additional changes +to these or to other debug levels. In any case, it is better for us to receive +at least some useful information than to receive an empty log. + + +Do I need to restart a monitor to adjust debug levels? +------------------------------------------------------ + +No, restarting a monitor is not necessary. Debug levels may be adjusted by +using two different methods, depending on whether or not there is a quorum: + +There is a quorum + + Either inject the debug option into the specific monitor that needs to + be debugged:: + + ceph tell mon.FOO config set debug_mon 10/10 + + Or inject it into all monitors at once:: + + ceph tell mon.* config set debug_mon 10/10 + + +There is no quorum + + Use the admin socket of the specific monitor that needs to be debugged + and directly adjust the monitor's configuration options:: + + ceph daemon mon.FOO config set debug_mon 10/10 + + +To return the debug levels to their default values, run the above commands +using the debug level ``1/10`` rather than ``10/10``. To check a monitor's +current values, use the admin socket and run either of the following commands: + + .. prompt:: bash + + ceph daemon mon.FOO config show + +or: + + .. prompt:: bash + + ceph daemon mon.FOO config get 'OPTION_NAME' + + + +I Reproduced the problem with appropriate debug levels. Now what? +----------------------------------------------------------------- + +We prefer that you send us only the portions of your logs that are relevant to +your monitor problems. Of course, it might not be easy for you to determine +which portions are relevant so we are willing to accept complete and +unabridged logs. However, we request that you avoid sending logs containing +hundreds of thousands of lines with no additional clarifying information. One +common-sense way of making our task easier is to write down the current time +and date when you are reproducing the problem and then extract portions of your +logs based on that information. + +Finally, reach out to us on the mailing lists or IRC or Slack, or by filing a +new issue on the `tracker`_. + +.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new + +.. |---| unicode:: U+2014 .. EM DASH + :trim: diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst new file mode 100644 index 000000000..035947d7e --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -0,0 +1,787 @@ +====================== + Troubleshooting OSDs +====================== + +Before troubleshooting the cluster's OSDs, check the monitors +and the network. + +First, determine whether the monitors have a quorum. Run the ``ceph health`` +command or the ``ceph -s`` command and if Ceph shows ``HEALTH_OK`` then there +is a monitor quorum. + +If the monitors don't have a quorum or if there are errors with the monitor +status, address the monitor issues before proceeding by consulting the material +in `Troubleshooting Monitors <../troubleshooting-mon>`_. + +Next, check your networks to make sure that they are running properly. Networks +can have a significant impact on OSD operation and performance. Look for +dropped packets on the host side and CRC errors on the switch side. + + +Obtaining Data About OSDs +========================= + +When troubleshooting OSDs, it is useful to collect different kinds of +information about the OSDs. Some information comes from the practice of +`monitoring OSDs`_ (for example, by running the ``ceph osd tree`` command). +Additional information concerns the topology of your cluster, and is discussed +in the following sections. + + +Ceph Logs +--------- + +Ceph log files are stored under ``/var/log/ceph``. Unless the path has been +changed (or you are in a containerized environment that stores logs in a +different location), the log files can be listed by running the following +command: + +.. prompt:: bash + + ls /var/log/ceph + +If there is not enough log detail, change the logging level. To ensure that +Ceph performs adequately under high logging volume, see `Logging and +Debugging`_. + + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. First, list the +sockets of Ceph's daemons by running the following command: + +.. prompt:: bash + + ls /var/run/ceph + +Next, run a command of the following form (replacing ``{daemon-name}`` with the +name of a specific daemon: for example, ``osd.0``): + +.. prompt:: bash + + ceph daemon {daemon-name} help + +Alternatively, run the command with a ``{socket-file}`` specified (a "socket +file" is a specific file in ``/var/run/ceph``): + +.. prompt:: bash + + ceph daemon {socket-file} help + +The admin socket makes many tasks possible, including: + +- Listing Ceph configuration at runtime +- Dumping historic operations +- Dumping the operation priority queue state +- Dumping operations in flight +- Dumping perfcounters + +Display Free Space +------------------ + +Filesystem issues may arise. To display your filesystems' free space, run the +following command: + +.. prompt:: bash + + df -h + +To see this command's supported syntax and options, run ``df --help``. + +I/O Statistics +-------------- + +The `iostat`_ tool can be used to identify I/O-related issues. Run the +following command: + +.. prompt:: bash + + iostat -x + + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages from the kernel, run the ``dmesg`` command and +specify the output with ``less``, ``more``, ``grep``, or ``tail``. For +example: + +.. prompt:: bash + + dmesg | grep scsi + +Stopping without Rebalancing +============================ + +It might be occasionally necessary to perform maintenance on a subset of your +cluster or to resolve a problem that affects a failure domain (for example, a +rack). However, when you stop OSDs for maintenance, you might want to prevent +CRUSH from automatically rebalancing the cluster. To avert this rebalancing +behavior, set the cluster to ``noout`` by running the following command: + +.. prompt:: bash + + ceph osd set noout + +.. warning:: This is more a thought exercise offered for the purpose of giving + the reader a sense of failure domains and CRUSH behavior than a suggestion + that anyone in the post-Luminous world run ``ceph osd set noout``. When the + OSDs return to an ``up`` state, rebalancing will resume and the change + introduced by the ``ceph osd set noout`` command will be reverted. + +In Luminous and later releases, however, it is a safer approach to flag only +affected OSDs. To add or remove a ``noout`` flag to a specific OSD, run a +command like the following: + +.. prompt:: bash + + ceph osd add-noout osd.0 + ceph osd rm-noout osd.0 + +It is also possible to flag an entire CRUSH bucket. For example, if you plan to +take down ``prod-ceph-data1701`` in order to add RAM, you might run the +following command: + +.. prompt:: bash + + ceph osd set-group noout prod-ceph-data1701 + +After the flag is set, stop the OSDs and any other colocated +Ceph services within the failure domain that requires maintenance work:: + + systemctl stop ceph\*.service ceph\*.target + +.. note:: When an OSD is stopped, any placement groups within the OSD are + marked as ``degraded``. + +After the maintenance is complete, it will be necessary to restart the OSDs +and any other daemons that have stopped. However, if the host was rebooted as +part of the maintenance, they do not need to be restarted and will come back up +automatically. To restart OSDs or other daemons, use a command of the following +form: + +.. prompt:: bash + + sudo systemctl start ceph.target + +Finally, unset the ``noout`` flag as needed by running commands like the +following: + +.. prompt:: bash + + ceph osd unset noout + ceph osd unset-group noout prod-ceph-data1701 + +Many contemporary Linux distributions employ ``systemd`` for service +management. However, for certain operating systems (especially older ones) it +might be necessary to issue equivalent ``service`` or ``start``/``stop`` +commands. + + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal conditions, restarting a ``ceph-osd`` daemon will allow it to +rejoin the cluster and recover. + + +An OSD Won't Start +------------------ + +If the cluster has started but an OSD isn't starting, check the following: + +- **Configuration File:** If you were not able to get OSDs running from a new + installation, check your configuration file to ensure it conforms to the + standard (for example, make sure that it says ``host`` and not ``hostname``, + etc.). + +- **Check Paths:** Ensure that the paths specified in the configuration + correspond to the paths for data and metadata that actually exist (for + example, the paths to the journals, the WAL, and the DB). Separate the OSD + data from the metadata in order to see whether there are errors in the + configuration file and in the actual mounts. If so, these errors might + explain why OSDs are not starting. To store the metadata on a separate block + device, partition or LVM the drive and assign one partition per OSD. + +- **Check Max Threadcount:** If the cluster has a node with an especially high + number of OSDs, it might be hitting the default maximum number of threads + (usually 32,000). This is especially likely to happen during recovery. + Increasing the maximum number of threads to the maximum possible number of + threads allowed (4194303) might help with the problem. To increase the number + of threads to the maximum, run the following command: + + .. prompt:: bash + + sysctl -w kernel.pid_max=4194303 + + If this increase resolves the issue, you must make the increase permanent by + including a ``kernel.pid_max`` setting either in a file under + ``/etc/sysctl.d`` or within the master ``/etc/sysctl.conf`` file. For + example:: + + kernel.pid_max = 4194303 + +- **Check ``nf_conntrack``:** This connection-tracking and connection-limiting + system causes problems for many production Ceph clusters. The problems often + emerge slowly and subtly. As cluster topology and client workload grow, + mysterious and intermittent connection failures and performance glitches + occur more and more, especially at certain times of the day. To begin taking + the measure of your problem, check the ``syslog`` history for "table full" + events. One way to address this kind of problem is as follows: First, use the + ``sysctl`` utility to assign ``nf_conntrack_max`` a much higher value. Next, + raise the value of ``nf_conntrack_buckets`` so that ``nf_conntrack_buckets`` + × 8 = ``nf_conntrack_max``; this action might require running commands + outside of ``sysctl`` (for example, ``"echo 131072 > + /sys/module/nf_conntrack/parameters/hashsize``). Another way to address the + problem is to blacklist the associated kernel modules in order to disable + processing altogether. This approach is powerful, but fragile. The modules + and the order in which the modules must be listed can vary among kernel + versions. Even when blacklisted, ``iptables`` and ``docker`` might sometimes + activate connection tracking anyway, so we advise a "set and forget" strategy + for the tunables. On modern systems, this approach will not consume + appreciable resources. + +- **Kernel Version:** Identify the kernel version and distribution that are in + use. By default, Ceph uses third-party tools that might be buggy or come into + conflict with certain distributions or kernel versions (for example, Google's + ``gperftools`` and ``TCMalloc``). Check the `OS recommendations`_ and the + release notes for each Ceph version in order to make sure that you have + addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, increase log levels and + restart the problematic daemon(s). If segment faults recur, search the Ceph + bug tracker `https://tracker.ceph/com/projects/ceph + <https://tracker.ceph.com/projects/ceph/>`_ and the ``dev`` and + ``ceph-users`` mailing list archives `https://ceph.io/resources + <https://ceph.io/resources>`_ to see if others have experienced and reported + these issues. If this truly is a new and unique failure, post to the ``dev`` + email list and provide the following information: the specific Ceph release + being run, ``ceph.conf`` (with secrets XXX'd out), your monitor status + output, and excerpts from your log file(s). + + +An OSD Failed +------------- + +When an OSD fails, this means that a ``ceph-osd`` process is unresponsive or +has died and that the corresponding OSD has been marked ``down``. Surviving +``ceph-osd`` daemons will report to the monitors that the OSD appears to be +down, and a new status will be visible in the output of the ``ceph health`` +command, as in the following example: + +.. prompt:: bash + + ceph health + +:: + + HEALTH_WARN 1/3 in osds are down + +This health alert is raised whenever there are one or more OSDs marked ``in`` +and ``down``. To see which OSDs are ``down``, add ``detail`` to the command as in +the following example: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +Alternatively, run the following command: + +.. prompt:: bash + + ceph osd tree down + +If there is a drive failure or another fault that is preventing a given +``ceph-osd`` daemon from functioning or restarting, then there should be an +error message present in its log file under ``/var/log/ceph``. + +If the ``ceph-osd`` daemon stopped because of a heartbeat failure or a +``suicide timeout`` error, then the underlying drive or filesystem might be +unresponsive. Check ``dmesg`` output and `syslog` output for drive errors or +kernel errors. It might be necessary to specify certain flags (for example, +``dmesg -T`` to see human-readable timestamps) in order to avoid mistaking old +errors for new errors. + +If an entire host's OSDs are ``down``, check to see if there is a network +error or a hardware issue with the host. + +If the OSD problem is the result of a software error (for example, a failed +assertion or another unexpected error), search for reports of the issue in the +`bug tracker <https://tracker.ceph/com/projects/ceph>`_ , the `dev mailing list +archives <https://lists.ceph.io/hyperkitty/list/dev@ceph.io/>`_, and the +`ceph-users mailing list archives +<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/>`_. If there is no +clear fix or existing bug, then :ref:`report the problem to the ceph-devel +email list <Get Involved>`. + + +.. _no-free-drive-space: + +No Free Drive Space +------------------- + +If an OSD is full, Ceph prevents data loss by ensuring that no new data is +written to the OSD. In an properly running cluster, health checks are raised +when the cluster's OSDs and pools approach certain "fullness" ratios. The +``mon_osd_full_ratio`` threshold defaults to ``0.95`` (or 95% of capacity): +this is the point above which clients are prevented from writing data. The +``mon_osd_backfillfull_ratio`` threshold defaults to ``0.90`` (or 90% of +capacity): this is the point above which backfills will not start. The +``mon_osd_nearfull_ratio`` threshold defaults to ``0.85`` (or 85% of capacity): +this is the point at which it raises the ``OSD_NEARFULL`` health check. + +OSDs within a cluster will vary in how much data is allocated to them by Ceph. +To check "fullness" by displaying data utilization for every OSD, run the +following command: + +.. prompt:: bash + + ceph osd df + +To check "fullness" by displaying a cluster’s overall data usage and data +distribution among pools, run the following command: + +.. prompt:: bash + + ceph df + +When examining the output of the ``ceph df`` command, pay special attention to +the **most full** OSDs, as opposed to the percentage of raw space used. If a +single outlier OSD becomes full, all writes to this OSD's pool might fail as a +result. When ``ceph df`` reports the space available to a pool, it considers +the ratio settings relative to the *most full* OSD that is part of the pool. To +flatten the distribution, two approaches are available: (1) Using the +``reweight-by-utilization`` command to progressively move data from excessively +full OSDs or move data to insufficiently full OSDs, and (2) in later revisions +of Luminous and subsequent releases, exploiting the ``ceph-mgr`` ``balancer`` +module to perform the same task automatically. + +To adjust the "fullness" ratios, run a command or commands of the following +form: + +.. prompt:: bash + + ceph osd set-nearfull-ratio <float[0.0-1.0]> + ceph osd set-full-ratio <float[0.0-1.0]> + ceph osd set-backfillfull-ratio <float[0.0-1.0]> + +Sometimes full cluster issues arise because an OSD has failed. This can happen +either because of a test or because the cluster is small, very full, or +unbalanced. When an OSD or node holds an excessive percentage of the cluster's +data, component failures or natural growth can result in the ``nearfull`` and +``full`` ratios being exceeded. When testing Ceph's resilience to OSD failures +on a small cluster, it is advised to leave ample free disk space and to +consider temporarily lowering the OSD ``full ratio``, OSD ``backfillfull +ratio``, and OSD ``nearfull ratio``. + +The "fullness" status of OSDs is visible in the output of the ``ceph health`` +command, as in the following example: + +.. prompt:: bash + + ceph health + +:: + + HEALTH_WARN 1 nearfull osd(s) + +For details, add the ``detail`` command as in the following example: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) + osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% + +To address full cluster issues, it is recommended to add capacity by adding +OSDs. Adding new OSDs allows the cluster to redistribute data to newly +available storage. Search for ``rados bench`` orphans that are wasting space. + +If a legacy Filestore OSD cannot be started because it is full, it is possible +to reclaim space by deleting a small number of placement group directories in +the full OSD. + +.. important:: If you choose to delete a placement group directory on a full + OSD, **DO NOT** delete the same placement group directory on another full + OSD. **OTHERWISE YOU WILL LOSE DATA**. You **MUST** maintain at least one + copy of your data on at least one OSD. Deleting placement group directories + is a rare and extreme intervention. It is not to be undertaken lightly. + +See `Monitor Config Reference`_ for more information. + + +OSDs are Slow/Unresponsive +========================== + +OSDs are sometimes slow or unresponsive. When troubleshooting this common +problem, it is advised to eliminate other possibilities before investigating +OSD performance issues. For example, be sure to confirm that your network(s) +are working properly, to verify that your OSDs are running, and to check +whether OSDs are throttling recovery traffic. + +.. tip:: In pre-Luminous releases of Ceph, ``up`` and ``in`` OSDs were + sometimes not available or were otherwise slow because recovering OSDs were + consuming system resources. Newer releases provide better recovery handling + by preventing this phenomenon. + + +Networking Issues +----------------- + +As a distributed storage system, Ceph relies upon networks for OSD peering and +replication, recovery from faults, and periodic heartbeats. Networking issues +can cause OSD latency and flapping OSDs. For more information, see `Flapping +OSDs`_. + +To make sure that Ceph processes and Ceph-dependent processes are connected and +listening, run the following commands: + +.. prompt:: bash + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +To check network statistics, run the following command: + +.. prompt:: bash + + netstat -s + +Drive Configuration +------------------- + +An SAS or SATA storage drive should house only one OSD, but a NVMe drive can +easily house two or more. However, it is possible for read and write throughput +to bottleneck if other processes share the drive. Such processes include: +journals / metadata, operating systems, Ceph monitors, ``syslog`` logs, other +OSDs, and non-Ceph processes. + +Because Ceph acknowledges writes *after* journaling, fast SSDs are an +attractive option for accelerating response time -- particularly when using the +``XFS`` or ``ext4`` filesystems for legacy FileStore OSDs. By contrast, the +``Btrfs`` file system can write and journal simultaneously. (However, use of +``Btrfs`` is not recommended for production deployments.) + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Throughput might be improved somewhat by + running a journal in a separate partition, but it is better still to run + such a journal in a separate physical drive. + +.. warning:: Reef does not support FileStore. Releases after Reef do not + support FileStore. Any information that mentions FileStore is pertinent only + to the Quincy release of Ceph and to releases prior to Quincy. + + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your drives for bad blocks, fragmentation, and other errors that can +cause significantly degraded performance. Tools that are useful in checking for +drive errors include ``dmesg``, ``syslog`` logs, and ``smartctl`` (found in the +``smartmontools`` package). + +.. note:: ``smartmontools`` 7.0 and late provides NVMe stat passthrough and + JSON output. + + +Co-resident Monitors/OSDs +------------------------- + +Although monitors are relatively lightweight processes, performance issues can +result when monitors are run on the same host machine as an OSD. Monitors issue +many ``fsync()`` calls and this can interfere with other workloads. The danger +of performance issues is especially acute when the monitors are co-resident on +the same storage drive as an OSD. In addition, if the monitors are running an +older kernel (pre-3.0) or a kernel with no ``syncfs(2)`` syscall, then multiple +OSDs running on the same host might make so many commits as to undermine each +other's performance. This problem sometimes results in what is called "the +bursty writes". + + +Co-resident Processes +--------------------- + +Significant OSD latency can result from processes that write data to Ceph (for +example, cloud-based solutions and virtual machines) while operating on the +same hardware as OSDs. For this reason, making such processes co-resident with +OSDs is not generally recommended. Instead, the recommended practice is to +optimize certain hosts for use with Ceph and use other hosts for other +processes. This practice of separating Ceph operations from other applications +might help improve performance and might also streamline troubleshooting and +maintenance. + +Running co-resident processes on the same hardware is sometimes called +"convergence". When using Ceph, engage in convergence only with expertise and +after consideration. + + +Logging Levels +-------------- + +Performance issues can result from high logging levels. Operators sometimes +raise logging levels in order to track an issue and then forget to lower them +afterwards. In such a situation, OSDs might consume valuable system resources to +write needlessly verbose logs onto the disk. Anyone who does want to use high logging +levels is advised to consider mounting a drive to the default path for logging +(for example, ``/var/log/ceph/$cluster-$name.log``). + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +client or OSD performance, or it may increase recovery rates to the point that +recovery impacts client or OSD performance. Check to see if the client or OSD +is recovering. + + +Kernel Version +-------------- + +Check the kernel version that you are running. Older kernels may lack updates +that improve Ceph performance. + + +Kernel Issues with SyncFS +------------------------- + +If you have kernel issues with SyncFS, try running one OSD per host to see if +performance improves. Old kernels might not have a recent enough version of +``glibc`` to support ``syncfs(2)``. + + +Filesystem Issues +----------------- + +In post-Luminous releases, we recommend deploying clusters with the BlueStore +back end. When running a pre-Luminous release, or if you have a specific +reason to deploy OSDs with the previous Filestore backend, we recommend +``XFS``. + +We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has +many attractive features, but bugs may lead to performance issues and spurious +ENOSPC errors. We do not recommend ``ext4`` for Filestore OSDs because +``xattr`` limitations break support for long object names, which are needed for +RGW. + +For more information, see `Filesystem Recommendations`_. + +.. _Filesystem Recommendations: ../configuration/filesystem-recommendations + +Insufficient RAM +---------------- + +We recommend a *minimum* of 4GB of RAM per OSD daemon and we suggest rounding +up from 6GB to 8GB. During normal operations, you may notice that ``ceph-osd`` +processes use only a fraction of that amount. You might be tempted to use the +excess RAM for co-resident applications or to skimp on each node's memory +capacity. However, when OSDs experience recovery their memory utilization +spikes. If there is insufficient RAM available during recovery, OSD performance +will slow considerably and the daemons may even crash or be killed by the Linux +``OOM Killer``. + + +Blocked Requests or Slow Requests +--------------------------------- + +When a ``ceph-osd`` daemon is slow to respond to a request, the cluster log +receives messages reporting ops that are taking too long. The warning threshold +defaults to 30 seconds and is configurable via the ``osd_op_complaint_time`` +setting. + +Legacy versions of Ceph complain about ``old requests``:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +Newer versions of Ceph complain about ``slow requests``:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + +Possible causes include: + +- A failing drive (check ``dmesg`` output) +- A bug in the kernel file system (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions: + +- Remove VMs from Ceph hosts +- Upgrade kernel +- Upgrade Ceph +- Restart OSDs +- Replace failed or failing components + +Debugging Slow Requests +----------------------- + +If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id> +dump_ops_in_flight``, you will see a set of operations and a list of events +each operation went through. These are briefly described below. + +Events from the Messenger layer: + +- ``header_read``: The time that the messenger first started reading the message off the wire. +- ``throttled``: The time that the messenger tried to acquire memory throttle space to read + the message into memory. +- ``all_read``: The time that the messenger finished reading the message off the wire. +- ``dispatched``: The time that the messenger gave the message to the OSD. +- ``initiated``: This is identical to ``header_read``. The existence of both is a + historical oddity. + +Events from the OSD as it processes ops: + +- ``queued_for_pg``: The op has been put into the queue for processing by its PG. +- ``reached_pg``: The PG has started performing the op. +- ``waiting for \*``: The op is waiting for some other work to complete before + it can proceed (for example, a new OSDMap; the scrubbing of its object + target; the completion of a PG's peering; all as specified in the message). +- ``started``: The op has been accepted as something the OSD should do and + is now being performed. +- ``waiting for subops from``: The op has been sent to replica OSDs. + +Events from ```Filestore```: + +- ``commit_queued_for_journal_write``: The op has been given to the FileStore. +- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and is waiting + to be persisted (as the next disk write). +- ``journaled_completion_queued``: The op was journaled to disk and its callback + has been queued for invocation. + +Events from the OSD after data has been given to underlying storage: + +- ``op_commit``: The op has been committed (that is, written to journal) by the + primary OSD. +- ``op_applied``: The op has been `write()'en + <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (that is, + applied in memory but not flushed out to disk) on the primary. +- ``sub_op_applied``: ``op_applied``, but for a replica's "subop". +- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools). +- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it + hears about the above, but for a particular replica (i.e. ``<X>``). +- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops). + +Some of these events may appear redundant, but they cross important boundaries +in the internal code (such as passing data across locks into new threads). + + +Flapping OSDs +============= + +"Flapping" is the term for the phenomenon of an OSD being repeatedly marked +``up`` and then ``down`` in rapid succession. This section explains how to +recognize flapping, and how to mitigate it. + +When OSDs peer and check heartbeats, they use the cluster (back-end) network +when it is available. See `Monitor/OSD Interaction`_ for details. + +The upstream Ceph community has traditionally recommended separate *public* +(front-end) and *private* (cluster / back-end / replication) networks. This +provides the following benefits: + +#. Segregation of (1) heartbeat traffic and replication/recovery traffic + (private) from (2) traffic from clients and between OSDs and monitors + (public). This helps keep one stream of traffic from DoS-ing the other, + which could in turn result in a cascading failure. + +#. Additional throughput for both public and private traffic. + +In the past, when common networking technologies were measured in a range +encompassing 100Mb/s and 1Gb/s, this separation was often critical. But with +today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s networks, the above capacity concerns +are often diminished or even obviated. For example, if your OSD nodes have two +network ports, dedicating one to the public and the other to the private +network means that you have no path redundancy. This degrades your ability to +endure network maintenance and network failures without significant cluster or +client impact. In situations like this, consider instead using both links for +only a public network: with bonding (LACP) or equal-cost routing (for example, +FRR) you reap the benefits of increased throughput headroom, fault tolerance, +and reduced OSD flapping. + +When a private network (or even a single host link) fails or degrades while the +public network continues operating normally, OSDs may not handle this situation +well. In such situations, OSDs use the public network to report each other +``down`` to the monitors, while marking themselves ``up``. The monitors then +send out-- again on the public network--an updated cluster map with the +affected OSDs marked `down`. These OSDs reply to the monitors "I'm not dead +yet!", and the cycle repeats. We call this scenario 'flapping`, and it can be +difficult to isolate and remediate. Without a private network, this irksome +dynamic is avoided: OSDs are generally either ``up`` or ``down`` without +flapping. + +If something does cause OSDs to 'flap' (repeatedly being marked ``down`` and +then ``up`` again), you can force the monitors to halt the flapping by +temporarily freezing their states: + +.. prompt:: bash + + ceph osd set noup # prevent OSDs from getting marked up + ceph osd set nodown # prevent OSDs from getting marked down + +These flags are recorded in the osdmap: + +.. prompt:: bash + + ceph osd dump | grep flags + +:: + + flags no-up,no-down + +You can clear these flags with: + +.. prompt:: bash + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are available, ``noin`` and ``noout``, which prevent booting +OSDs from being marked ``in`` (allocated data) or protect OSDs from eventually +being marked ``out`` (regardless of the current value of +``mon_osd_down_out_interval``). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the sense that + after the flags are cleared, the action that they were blocking should be + possible shortly thereafter. But the ``noin`` flag prevents OSDs from being + marked ``in`` on boot, and any daemons that started while the flag was set + will remain that way. + +.. note:: The causes and effects of flapping can be mitigated somewhat by + making careful adjustments to ``mon_osd_down_out_subtree_limit``, + ``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``. + Derivation of optimal settings depends on cluster size, topology, and the + Ceph release in use. The interaction of all of these factors is subtle and + is beyond the scope of this document. + + +.. _iostat: https://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging: ../log-and-debug +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _Monitor Config Reference: ../../configuration/mon-config-ref +.. _monitoring your OSDs: ../../operations/monitoring-osd-pg + +.. _monitoring OSDs: ../../operations/monitoring-osd-pg/#monitoring-osds + +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _OS recommendations: ../../../start/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst new file mode 100644 index 000000000..74d04bd9f --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -0,0 +1,782 @@ +==================== + Troubleshooting PGs +==================== + +Placement Groups Never Get Clean +================================ + +If, after you have created your cluster, any Placement Groups (PGs) remain in +the ``active`` status, the ``active+remapped`` status or the +``active+degraded`` status and never achieves an ``active+clean`` status, you +likely have a problem with your configuration. + +In such a situation, it may be necessary to review the settings in the `Pool, +PG and CRUSH Config Reference`_ and make appropriate adjustments. + +As a general rule, run your cluster with more than one OSD and a pool size +greater than two object replicas. + +.. _one-node-cluster: + +One Node Cluster +---------------- + +Ceph no longer provides documentation for operating on a single node. Systems +designed for distributed computing by definition do not run on a single node. +The mounting of client kernel modules on a single node that contains a Ceph +daemon may cause a deadlock due to issues with the Linux kernel itself (unless +VMs are used as clients). You can experiment with Ceph in a one-node +configuration, in spite of the limitations as described herein. + +To create a cluster on a single node, you must change the +``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning +``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration +file before you create your monitors and OSDs. This tells Ceph that an OSD is +permitted to place another OSD on the same host. If you are trying to set up a +single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``, +Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on +another node, chassis, rack, row, or datacenter depending on the setting. + +.. tip:: DO NOT mount kernel clients directly on the same node as your Ceph + Storage Cluster. Kernel conflicts can arise. However, you can mount kernel + clients within virtual machines (VMs) on a single node. + +If you are creating OSDs using a single disk, you must manually create +directories for the data first. + + +Fewer OSDs than Replicas +------------------------ + +If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not +in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set +to greater than ``2``. + +There are a few ways to address this situation. If you want to operate your +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd_pool_default_min_size`` to ``2`` so that you can write objects in an +``active + degraded`` state. You may also set the ``osd_pool_default_size`` +setting to ``2`` so that you have only two stored replicas (the original and +one replica). In such a case, the cluster should achieve an ``active + clean`` +state. + +.. note:: You can make the changes while the cluster is running. If you make + the changes in your Ceph configuration file, you might need to restart your + cluster. + + +Pool Size = 1 +------------- + +If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy +of the object. OSDs rely on other OSDs to tell them which objects they should +have. If one OSD has a copy of an object and there is no second copy, then +there is no second OSD to tell the first OSD that it should have that copy. For +each placement group mapped to the first OSD (see ``ceph pg dump``), you can +force the first OSD to notice the placement groups it needs by running a +command of the following form: + +.. prompt:: bash + + ceph osd force-create-pg <pgid> + + +CRUSH Map Errors +---------------- + +If any placement groups in your cluster are unclean, then there might be errors +in your CRUSH map. + + +Stuck Placement Groups +====================== + +It is normal for placement groups to enter "degraded" or "peering" states after +a component failure. Normally, these states reflect the expected progression +through the failure recovery process. However, a placement group that stays in +one of these states for a long time might be an indication of a larger problem. +For this reason, the Ceph Monitors will warn when placement groups get "stuck" +in a non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long (that + is, it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long (that + is, it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a + ``ceph-osd``. This indicates that all nodes storing this placement group may + be ``down``. + +List stuck placement groups by running one of the following commands: + +.. prompt:: bash + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd`` + daemons are not running. +- Stuck ``inactive`` placement groups usually indicate a peering problem (see + :ref:`failures-osd-peering`). +- Stuck ``unclean`` placement groups usually indicate that something is + preventing recovery from completing, possibly unfound objects (see + :ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +====================================== + +In certain cases, the ``ceph-osd`` `peering` process can run into problems, +which can prevent a PG from becoming active and usable. In such a case, running +the command ``ceph health detail`` will report something similar to the following: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form: + +.. prompt:: bash + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to down +``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that +particular ``ceph-osd`` and recovery will proceed. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if +there has been a disk failure), the cluster can be informed that the OSD is +``lost`` and the cluster can be instructed that it must cope as best it can. + +.. important:: Informing the cluster that an OSD has been lost is dangerous + because the cluster cannot guarantee that the other copies of the data are + consistent and up to date. + +To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery +anyway, run a command of the following form: + +.. prompt:: bash + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +=============== + +Under certain combinations of failures, Ceph may complain about ``unfound`` +objects, as in this example: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer copies of +existing objects) exist, but it hasn't found copies of them. Here is an +example of how this might come about for a PG whose data is on two OSDS, which +we will call "1" and "2": + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 re-peer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +At this point, 1 knows that these objects exist, but there is no live +``ceph-osd`` that has a copy of the objects. In this case, IO to those objects +will block, and the cluster will hope that the failed node comes back soon. +This is assumed to be preferable to returning an IO error to the user. + +.. note:: The situation described immediately above is one reason that setting + ``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks + data loss. + +Identify which objects are unfound by running a command of the following form: + +.. prompt:: bash + + ceph pg 2.4 list_unfound [starting offset, in json] + +.. code-block:: javascript + + { + "num_missing": 1, + "num_unfound": 1, + "objects": [ + { + "oid": { + "oid": "object", + "key": "", + "snapid": -2, + "hash": 2249616407, + "max": 0, + "pool": 2, + "namespace": "" + }, + "need": "43'251", + "have": "0'0", + "flags": "none", + "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1", + "locations": [ + "0(3)", + "4(2)" + ] + } + ], + "state": "NotRecovering", + "available_might_have_unfound": true, + "might_have_unfound": [ + { + "osd": "2(4)", + "status": "osd is down" + } + ], + "more": false + } + +If there are too many objects to list in a single result, the ``more`` field +will be true and you can query for more. (Eventually the command line tool +will hide this from you, but not yet.) + +Now you can identify which OSDs have been probed or might contain data. + +At the end of the listing (before ``more: false``), ``might_have_unfound`` is +provided when ``available_might_have_unfound`` is true. This is equivalent to +the output of ``ceph pg #.# query``. This eliminates the need to use ``query`` +directly. The ``might_have_unfound`` information given behaves the same way as +that ``query`` does, which is described below. The only difference is that +OSDs that have the status of ``already probed`` are ignored. + +Use of ``query``: + +.. prompt:: bash + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, the cluster knows that ``osd.1`` might have data, but it is +``down``. Here is the full range of possible states: + +* already probed +* querying +* OSD is down +* not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object might exist that +are not listed. For example: if an OSD is stopped and taken out of the cluster +and then the cluster fully recovers, and then through a subsequent set of +failures the cluster ends up with an unfound object, the cluster will ignore +the removed OSD. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still lost, you may +have to give up on the lost objects. This, again, is possible only when unusual +combinations of failures have occurred that allow the cluster to learn about +writes that were performed before the writes themselves have been recovered. To +mark the "unfound" objects as "lost", run a command of the following form: + +.. prompt:: bash + + ceph pg 2.5 mark_unfound_lost revert|delete + +Here the final argument (``revert|delete``) specifies how the cluster should +deal with lost objects. + +The ``delete`` option will cause the cluster to forget about them entirely. + +The ``revert`` option (which is not available for erasure coded pools) will +either roll back to a previous version of the object or (if it was a new +object) forget about the object entirely. Use ``revert`` with caution, as it +may confuse applications that expect the object to exist. + +Homeless Placement Groups +========================= + +It is possible that every OSD that has copies of a given placement group fails. +If this happens, then the subset of the object store that contains those +placement groups becomes unavailable and the monitor will receive no status +updates for those placement groups. The monitor marks as ``stale`` any +placement group whose primary OSD has failed. For example: + +.. prompt:: bash + + ceph health + +:: + + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +Identify which placement groups are ``stale`` and which were the last OSDs to +store the ``stale`` placement groups by running the following command: + +.. prompt:: bash + + ceph health detail + +:: + + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +This output indicates that placement group 2.5 (``pg 2.5``) was last managed by +``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover +that placement group. + + +Only a Few OSDs Receive Data +============================ + +If only a few of the nodes in the cluster are receiving data, check the number +of placement groups in the pool as instructed in the :ref:`Placement Groups +<rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to +OSDs in an operation involving dividing the number of placement groups in the +cluster by the number of OSDs in the cluster, a small number of placement +groups (the remainder, in this operation) are sometimes not distributed across +the cluster. In situations like this, create a pool with a placement group +count that is a multiple of the number of OSDs. See `Placement Groups`_ for +details. See the :ref:`Pool, PG, and CRUSH Config Reference +<rados_config_pool_pg_crush_ref>` for instructions on changing the default +values used to determine how many placement groups are assigned to each pool. + + +Can't Write Data +================ + +If the cluster is up, but some OSDs are down and you cannot write data, make +sure that you have the minimum number of OSDs running in the pool. If you don't +have the minimum number of OSDs running in the pool, Ceph will not allow you to +write data to it because there is no guarantee that Ceph can replicate your +data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH +Config Reference <rados_config_pool_pg_crush_ref>` for details. + + +PGs Inconsistent +================ + +If the command ``ceph health detail`` returns an ``active + clean + +inconsistent`` state, this might indicate an error during scrubbing. Identify +the inconsistent placement group or placement groups by running the following +command: + +.. prompt:: bash + + $ ceph health detail + +:: + + HEALTH_ERR 1 pgs inconsistent; 2 scrub errors + pg 0.6 is active+clean+inconsistent, acting [0,1,2] + 2 scrub errors + +Alternatively, run this command if you prefer to inspect the output in a +programmatic way: + +.. prompt:: bash + + $ rados list-inconsistent-pg rbd + +:: + + ["0.6"] + +There is only one consistent state, but in the worst case, we could have +different inconsistencies in multiple perspectives found in more than one +object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of +``rados list-inconsistent-pg rbd`` will look something like this: + +.. prompt:: bash + + rados list-inconsistent-obj 0.6 --format=json-pretty + +.. code-block:: javascript + + { + "epoch": 14, + "inconsistents": [ + { + "object": { + "name": "foo", + "nspace": "", + "locator": "", + "snap": "head", + "version": 1 + }, + "errors": [ + "data_digest_mismatch", + "size_mismatch" + ], + "union_shard_errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", + "shards": [ + { + "osd": 0, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 1, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 2, + "errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "size": 0, + "omap_digest": "0xffffffff", + "data_digest": "0xffffffff" + } + ] + } + ] + } + +In this case, the output indicates the following: + +* The only inconsistent object is named ``foo``, and its head has + inconsistencies. +* The inconsistencies fall into two categories: + + #. ``errors``: these errors indicate inconsistencies between shards, without + an indication of which shard(s) are bad. Check for the ``errors`` in the + ``shards`` array, if available, to pinpoint the problem. + + * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2`` + is different from the digests of the replica reads of ``OSD.0`` and + ``OSD.1`` + * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``, + but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``. + + #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the + ``shards`` array. The ``errors`` are set for the shard with the problem. + These errors include ``read_error`` and other similar errors. The + ``errors`` ending in ``oi`` indicate a comparison with + ``selected_object_info``. Examine the ``shards`` array to determine + which shard has which error or errors. + + * ``data_digest_mismatch_info``: the digest stored in the ``object-info`` + is not ``0xffffffff``, which is calculated from the shard read from + ``OSD.2`` + * ``size_mismatch_info``: the size stored in the ``object-info`` is + different from the size read from ``OSD.2``. The latter is ``0``. + +.. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the + inconsistency is likely due to physical storage errors. In cases like this, + check the storage used by that OSD. + + Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive + repair. + +To repair the inconsistent placement group, run a command of the following +form: + +.. prompt:: bash + + ceph pg repair {placement-group-ID} + +.. warning: This command overwrites the "bad" copies with "authoritative" + copies. In most cases, Ceph is able to choose authoritative copies from all + the available replicas by using some predefined criteria. This, however, + does not work in every case. For example, it might be the case that the + stored data digest is missing, which means that the calculated digest is + ignored when Ceph chooses the authoritative copies. Be aware of this, and + use the above command with caution. + + +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, consider configuring the `NTP +<https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor +hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_ +and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information. + + +Erasure Coded PGs are not active+clean +====================================== + +If CRUSH fails to find enough OSDs to map to a PG, it will show as a +``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example:: + + [2,1,6,0,5,8,2147483647,7,4] + +Not enough OSDs +--------------- + +If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine +OSDs, the cluster will show "Not enough OSDs". In this case, you either create +another erasure coded pool that requires fewer OSDs, by running commands of the +following form: + +.. prompt:: bash + + ceph osd erasure-code-profile set myprofile k=5 m=3 + ceph osd pool create erasurepool erasure myprofile + +or add new OSDs, and the PG will automatically use them. + +CRUSH constraints cannot be satisfied +------------------------------------- + +If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing +constraints that cannot be satisfied. If there are ten OSDs on two hosts and +the CRUSH rule requires that no two OSDs from the same host are used in the +same PG, the mapping may fail because only two OSDs will be found. Check the +constraint by displaying ("dumping") the rule, as shown here: + +.. prompt:: bash + + ceph osd crush rule ls + +:: + + [ + "replicated_rule", + "erasurepool"] + $ ceph osd crush rule dump erasurepool + { "rule_id": 1, + "rule_name": "erasurepool", + "type": 3, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + + +Resolve this problem by creating a new pool in which PGs are allowed to have +OSDs residing on the same host by running the following commands: + +.. prompt:: bash + + ceph osd erasure-code-profile set myprofile crush-failure-domain=osd + ceph osd pool create erasurepool erasure myprofile + +CRUSH gives up too soon +----------------------- + +If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster +with a total of nine OSDs and an erasure coded pool that requires nine OSDs per +PG), it is possible that CRUSH gives up before finding a mapping. This problem +can be resolved by: + +* lowering the erasure coded pool requirements to use fewer OSDs per PG (this + requires the creation of another pool, because erasure code profiles cannot + be modified dynamically). + +* adding more OSDs to the cluster (this does not require the erasure coded pool + to be modified, because it will become clean automatically) + +* using a handmade CRUSH rule that tries more times to find a good mapping. + This can be modified for an existing CRUSH rule by setting + ``set_choose_tries`` to a value greater than the default. + +First, verify the problem by using ``crushtool`` after extracting the crushmap +from the cluster. This ensures that your experiments do not modify the Ceph +cluster and that they operate only on local files: + +.. prompt:: bash + + ceph osd crush rule dump erasurepool + +:: + + { "rule_id": 1, + "rule_name": "erasurepool", + "type": 3, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + $ ceph osd getcrushmap > crush.map + got crush map from osdmap epoch 13 + $ crushtool -i crush.map --test --show-bad-mappings \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] + bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] + bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] + +Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule +needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by +``ceph osd crush rule dump``. This test will attempt to map one million values +(in this example, the range defined by ``[--min-x,--max-x]``) and must display +at least one bad mapping. If this test outputs nothing, all mappings have been +successful and you can be assured that the problem with your cluster is not +caused by bad mappings. + +Changing the value of set_choose_tries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. Decompile the CRUSH map to edit the CRUSH rule by running the following + command: + + .. prompt:: bash + + crushtool --decompile crush.map > crush.txt + +#. Add the following line to the rule:: + + step set_choose_tries 100 + + The relevant part of the ``crush.txt`` file will resemble this:: + + rule erasurepool { + id 1 + type erasure + step set_chooseleaf_tries 5 + step set_choose_tries 100 + step take default + step chooseleaf indep 0 type host + step emit + } + +#. Recompile and retest the CRUSH rule: + + .. prompt:: bash + + crushtool --compile crush.txt -o better-crush.map + +#. When all mappings succeed, display a histogram of the number of tries that + were necessary to find all of the mapping by using the + ``--show-choose-tries`` option of the ``crushtool`` command, as in the + following example: + + .. prompt:: bash + + crushtool -i better-crush.map --test --show-bad-mappings \ + --show-choose-tries \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + ... + 11: 42 + 12: 44 + 13: 54 + 14: 45 + 15: 35 + 16: 34 + 17: 30 + 18: 25 + 19: 19 + 20: 22 + 21: 20 + 22: 17 + 23: 13 + 24: 16 + 25: 13 + 26: 11 + 27: 11 + 28: 13 + 29: 11 + 30: 10 + 31: 6 + 32: 5 + 33: 10 + 34: 3 + 35: 7 + 36: 5 + 37: 2 + 38: 5 + 39: 5 + 40: 2 + 41: 5 + 42: 4 + 43: 1 + 44: 2 + 45: 2 + 46: 3 + 47: 1 + 48: 0 + ... + 102: 0 + 103: 1 + 104: 0 + ... + + This output indicates that it took eleven tries to map forty-two PGs, twelve + tries to map forty-four PGs etc. The highest number of tries is the minimum + value of ``set_choose_tries`` that prevents bad mappings (for example, + ``103`` in the above output, because it did not take more than 103 tries for + any PG to be mapped). + +.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups +.. _Placement Groups: ../../operations/placement-groups +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref |