diff options
Diffstat (limited to 'doc/rados/troubleshooting')
-rw-r--r-- | doc/rados/troubleshooting/community.rst | 29 | ||||
-rw-r--r-- | doc/rados/troubleshooting/cpu-profiling.rst | 67 | ||||
-rw-r--r-- | doc/rados/troubleshooting/index.rst | 19 | ||||
-rw-r--r-- | doc/rados/troubleshooting/log-and-debug.rst | 591 | ||||
-rw-r--r-- | doc/rados/troubleshooting/memory-profiling.rst | 142 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-mon.rst | 582 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-osd.rst | 546 | ||||
-rw-r--r-- | doc/rados/troubleshooting/troubleshooting-pg.rst | 666 |
8 files changed, 2642 insertions, 0 deletions
diff --git a/doc/rados/troubleshooting/community.rst b/doc/rados/troubleshooting/community.rst new file mode 100644 index 00000000..9faad131 --- /dev/null +++ b/doc/rados/troubleshooting/community.rst @@ -0,0 +1,29 @@ +==================== + The Ceph Community +==================== + +The Ceph community is an excellent source of information and help. For +operational issues with Ceph releases we recommend you `subscribe to the +ceph-users email list`_. When you no longer want to receive emails, you can +`unsubscribe from the ceph-users email list`_. + +You may also `subscribe to the ceph-devel email list`_. You should do so if +your issue is: + +- Likely related to a bug +- Related to a development release package +- Related to a development testing package +- Related to your own builds + +If you no longer want to receive emails from the ``ceph-devel`` email list, you +may `unsubscribe from the ceph-devel email list`_. + +.. tip:: The Ceph community is growing rapidly, and community members can help + you if you provide them with detailed information about your problem. You + can attach the output of the ``ceph report`` command to help people understand your issues. + +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _ceph-devel: ceph-devel@vger.kernel.org
\ No newline at end of file diff --git a/doc/rados/troubleshooting/cpu-profiling.rst b/doc/rados/troubleshooting/cpu-profiling.rst new file mode 100644 index 00000000..159f7998 --- /dev/null +++ b/doc/rados/troubleshooting/cpu-profiling.rst @@ -0,0 +1,67 @@ +=============== + CPU Profiling +=============== + +If you built Ceph from source and compiled Ceph for use with `oprofile`_ +you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. + + +Initializing oprofile +===================== + +The first time you use ``oprofile`` you need to initialize it. Locate the +``vmlinux`` image corresponding to the kernel you are now running. :: + + ls /boot + sudo opcontrol --init + sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 + + +Starting oprofile +================= + +To start ``oprofile`` execute the following command:: + + opcontrol --start + +Once you start ``oprofile``, you may run some tests with Ceph. + + +Stopping oprofile +================= + +To stop ``oprofile`` execute the following command:: + + opcontrol --stop + + +Retrieving oprofile Results +=========================== + +To retrieve the top ``cmon`` results, execute the following command:: + + opreport -gal ./cmon | less + + +To retrieve the top ``cmon`` results with call graphs attached, execute the +following command:: + + opreport -cal ./cmon | less + +.. important:: After reviewing results, you should reset ``oprofile`` before + running it again. Resetting ``oprofile`` removes data from the session + directory. + + +Resetting oprofile +================== + +To reset ``oprofile``, execute the following command:: + + sudo opcontrol --reset + +.. important:: You should reset ``oprofile`` after analyzing data so that + you do not commingle results from different tests. + +.. _oprofile: http://oprofile.sourceforge.net/about/ +.. _Installing Oprofile: ../../../dev/cpu-profiler diff --git a/doc/rados/troubleshooting/index.rst b/doc/rados/troubleshooting/index.rst new file mode 100644 index 00000000..80d14f3c --- /dev/null +++ b/doc/rados/troubleshooting/index.rst @@ -0,0 +1,19 @@ +================= + Troubleshooting +================= + +Ceph is still on the leading edge, so you may encounter situations that require +you to examine your configuration, modify your logging output, troubleshoot +monitors and OSDs, profile memory and CPU usage, and reach out to the +Ceph community for help. + +.. toctree:: + :maxdepth: 1 + + community + log-and-debug + troubleshooting-mon + troubleshooting-osd + troubleshooting-pg + memory-profiling + cpu-profiling diff --git a/doc/rados/troubleshooting/log-and-debug.rst b/doc/rados/troubleshooting/log-and-debug.rst new file mode 100644 index 00000000..b923cf14 --- /dev/null +++ b/doc/rados/troubleshooting/log-and-debug.rst @@ -0,0 +1,591 @@ +======================= + Logging and Debugging +======================= + +Typically, when you add debugging to your Ceph configuration, you do so at +runtime. You can also add Ceph debug logging to your Ceph configuration file if +you are encountering issues when starting your cluster. You may view Ceph log +files under ``/var/log/ceph`` (the default location). + +.. tip:: When debug output slows down your system, the latency can hide + race conditions. + +Logging is resource intensive. If you are encountering a problem in a specific +area of your cluster, enable logging for that area of the cluster. For example, +if your OSDs are running fine, but your metadata servers are not, you should +start by enabling debug logging for the specific metadata server instance(s) +giving you trouble. Enable logging for each subsystem as needed. + +.. important:: Verbose logging can generate over 1GB of data per hour. If your + OS disk reaches its capacity, the node will stop working. + +If you enable or increase the rate of Ceph logging, ensure that you have +sufficient disk space on your OS disk. See `Accelerating Log Rotation`_ for +details on rotating log files. When your system is running well, remove +unnecessary debugging settings to ensure your cluster runs optimally. Logging +debug output messages is relatively slow, and a waste of resources when +operating your cluster. + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + +Runtime +======= + +If you would like to see the configuration settings at runtime, you must log +in to a host with a running daemon and execute the following:: + + ceph daemon {daemon-name} config show | less + +For example,:: + + ceph daemon osd.0 config show | less + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at runtime, use the +``ceph tell`` command to inject arguments into the runtime configuration:: + + ceph tell {daemon-type}.{daemon id or *} config set {name} {value} + +Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply +the runtime setting to all daemons of a particular type with ``*``, or specify +a specific daemon's ID. For example, to increase +debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: + + ceph tell osd.0 config set debug_osd 0/5 + +The ``ceph tell`` command goes through the monitors. If you cannot bind to the +monitor, you can still make the change by logging into the host of the daemon +whose configuration you'd like to change using ``ceph daemon``. +For example:: + + sudo ceph daemon osd.0 config set debug_osd 0/5 + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + + +Boot Time +========= + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at boot time, you must +add settings to your Ceph configuration file. Subsystems common to each daemon +may be set under ``[global]`` in your configuration file. Subsystems for +particular daemons are set under the daemon section in your configuration file +(*e.g.*, ``[mon]``, ``[osd]``, ``[mds]``). For example:: + + [global] + debug ms = 1/5 + + [mon] + debug mon = 20 + debug paxos = 1/5 + debug auth = 2 + + [osd] + debug osd = 1/5 + debug filestore = 1/5 + debug journal = 1 + debug monc = 5/20 + + [mds] + debug mds = 1 + debug mds balancer = 1 + + +See `Subsystem, Log and Debug Settings`_ for details. + + +Accelerating Log Rotation +========================= + +If your OS disk is relatively full, you can accelerate log rotation by modifying +the Ceph log rotation file at ``/etc/logrotate.d/ceph``. Add a size setting +after the rotation frequency to accelerate log rotation (via cronjob) if your +logs exceed the size setting. For example, the default setting looks like +this:: + + rotate 7 + weekly + compress + sharedscripts + +Modify it by adding a ``size`` setting. :: + + rotate 7 + weekly + size 500M + compress + sharedscripts + +Then, start the crontab editor for your user space. :: + + crontab -e + +Finally, add an entry to check the ``etc/logrotate.d/ceph`` file. :: + + 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 + +The preceding example checks the ``etc/logrotate.d/ceph`` file every 30 minutes. + + +Valgrind +======== + +Debugging may also require you to track down memory and threading issues. +You can run a single daemon, a type of daemon, or the whole cluster with +Valgrind. You should only use Valgrind when developing or debugging Ceph. +Valgrind is computationally expensive, and will slow down your system otherwise. +Valgrind messages are logged to ``stderr``. + + +Subsystem, Log and Debug Settings +================================= + +In most cases, you will enable debug logging output via subsystems. + +Ceph Subsystems +--------------- + +Each subsystem has a logging level for its output logs, and for its logs +in-memory. You may set different values for each of these subsystems by setting +a log file level and a memory level for debug logging. Ceph's logging levels +operate on a scale of ``1`` to ``20``, where ``1`` is terse and ``20`` is +verbose [#]_ . In general, the logs in-memory are not sent to the output log unless: + +- a fatal signal is raised or +- an ``assert`` in source code is triggered or +- upon requested. Please consult `document on admin socket <http://docs.ceph.com/docs/master/man/8/ceph/#daemon>`_ for more details. + +A debug logging setting can take a single value for the log level and the +memory level, which sets them both as the same value. For example, if you +specify ``debug ms = 5``, Ceph will treat it as a log level and a memory level +of ``5``. You may also specify them separately. The first setting is the log +level, and the second setting is the memory level. You must separate them with +a forward slash (/). For example, if you want to set the ``ms`` subsystem's +debug logging level to ``1`` and its memory level to ``5``, you would specify it +as ``debug ms = 1/5``. For example: + + + +.. code-block:: ini + + debug {subsystem} = {log-level}/{memory-level} + #for example + debug mds balancer = 1/20 + + +The following table provides a list of Ceph subsystems and their default log and +memory levels. Once you complete your logging efforts, restore the subsystems +to their default level or to a level suitable for normal operations. + + ++--------------------+-----------+--------------+ +| Subsystem | Log Level | Memory Level | ++====================+===========+==============+ +| ``default`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``lockdep`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``context`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``crush`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``mds`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds balancer`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds locker`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log expire`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds migrator`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``buffer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``timer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``filer`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``striper`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``objecter`` | 0 | 1 | ++--------------------+-----------+--------------+ +| ``rados`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd mirror`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd replay`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``journaler`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objectcacher`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``client`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``osd`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``optracker`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objclass`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``filestore`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``journal`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``ms`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``mon`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``monc`` | 0 | 10 | ++--------------------+-----------+--------------+ +| ``paxos`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``tp`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``auth`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``crypto`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``finisher`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``reserver`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``heartbeatmap`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``perfcounter`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw sync`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``civetweb`` | 1 | 10 | ++--------------------+-----------+--------------+ +| ``javaclient`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``asok`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``throttle`` | 1 | 1 | ++--------------------+-----------+--------------+ +| ``refs`` | 0 | 0 | ++--------------------+-----------+--------------+ +| ``compressor`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bluestore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bluefs`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``bdev`` | 1 | 3 | ++--------------------+-----------+--------------+ +| ``kstore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rocksdb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``leveldb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``memdb`` | 4 | 5 | ++--------------------+-----------+--------------+ +| ``fuse`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mgr`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mgrc`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``dpdk`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``eventtrace`` | 1 | 5 | ++--------------------+-----------+--------------+ + + +Logging Settings +---------------- + +Logging and debugging settings are not required in a Ceph configuration file, +but you may override default settings as needed. Ceph supports the following +settings: + + +``log file`` + +:Description: The location of the logging file for your cluster. +:Type: String +:Required: No +:Default: ``/var/log/ceph/$cluster-$name.log`` + + +``log max new`` + +:Description: The maximum number of new log files. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``log max recent`` + +:Description: The maximum number of recent events to include in a log file. +:Type: Integer +:Required: No +:Default: ``10000`` + + +``log to stderr`` + +:Description: Determines if logging messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``err to stderr`` + +:Description: Determines if error messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``log to syslog`` + +:Description: Determines if logging messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``err to syslog`` + +:Description: Determines if error messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``log flush on exit`` + +:Description: Determines if Ceph should flush the log files after exit. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``clog to monitors`` + +:Description: Determines if ``clog`` messages should be sent to monitors. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``clog to syslog`` + +:Description: Determines if ``clog`` messages should be sent to syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log to syslog`` + +:Description: Determines if the cluster log should be output to the syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log file`` + +:Description: The locations of the cluster's log files. There are two channels in + Ceph: ``cluster`` and ``audit``. This option represents a mapping + from channels to log files, where the log entries of that + channel are sent to. The ``default`` entry is a fallback + mapping for channels not explicitly specified. So, the following + default setting will send cluster log to ``$cluster.log``, and + send audit log to ``$cluster.audit.log``, where ``$cluster`` will + be replaced with the actual cluster name. +:Type: String +:Required: No +:Default: ``default=/var/log/ceph/$cluster.$channel.log,cluster=/var/log/ceph/$cluster.log`` + + + +OSD +--- + + +``osd debug drop ping probability`` + +:Description: ? +:Type: Double +:Required: No +:Default: 0 + + +``osd debug drop ping duration`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create probability`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create duration`` + +:Description: ? +:Type: Double +:Required: No +:Default: 1 + + +``osd min pg log entries`` + +:Description: The minimum number of log entries for placement groups. +:Type: 32-bit Unsigned Integer +:Required: No +:Default: 1000 + + +``osd op log threshold`` + +:Description: How many op log messages to show up in one pass. +:Type: Integer +:Required: No +:Default: 5 + + + +Filestore +--------- + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. This is an expensive operation. +:Type: Boolean +:Required: No +:Default: ``false`` + + +MDS +--- + + +``mds debug scatterstat`` + +:Description: Ceph will assert that various recursive stat invariants are true + (for developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug frag`` + +:Description: Ceph will verify directory fragmentation invariants when + convenient (developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug auth pins`` + +:Description: The debug auth pin invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug subtrees`` + +:Description: The debug subtree invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + + +RADOS Gateway +------------- + + +``rgw log nonexistent bucket`` + +:Description: Should we log a non-existent buckets? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw log object name`` + +:Description: Should an object's name be logged. // man date to see codes (a subset are supported) +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%H-%i-%n`` + + +``rgw log object name utc`` + +:Description: Object log name contains UTC? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw enable ops log`` + +:Description: Enables logging of every RGW operation. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``rgw enable usage log`` + +:Description: Enable logging of RGW's bandwidth usage. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw usage log flush threshold`` + +:Description: Threshold to flush pending log data. +:Type: Integer +:Required: No +:Default: ``1024`` + + +``rgw usage log tick interval`` + +:Description: Flush pending log data every ``s`` seconds. +:Type: Integer +:Required: No +:Default: 30 + + +``rgw intent log object name`` + +:Description: +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%i-%n`` + + +``rgw intent log object name utc`` + +:Description: Include a UTC timestamp in the intent log object name. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. [#] there are levels >20 in some rare cases and that they are extremely verbose. diff --git a/doc/rados/troubleshooting/memory-profiling.rst b/doc/rados/troubleshooting/memory-profiling.rst new file mode 100644 index 00000000..e2396e2f --- /dev/null +++ b/doc/rados/troubleshooting/memory-profiling.rst @@ -0,0 +1,142 @@ +================== + Memory Profiling +================== + +Ceph MON, OSD and MDS can generate heap profiles using +``tcmalloc``. To generate heap profiles, ensure you have +``google-perftools`` installed:: + + sudo apt-get install google-perftools + +The profiler dumps output to your ``log file`` directory (i.e., +``/var/log/ceph``). See `Logging and Debugging`_ for details. +To view the profiler logs with Google's performance tools, execute the +following:: + + google-pprof --text {path-to-daemon} {log-path/filename} + +For example:: + + $ ceph tell osd.0 heap start_profiler + $ ceph tell osd.0 heap dump + osd.0 tcmalloc heap stats:------------------------------------------------ + MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application + MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist + MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist + MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist + MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists + MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) + MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used + MALLOC: + MALLOC: 231 Spans in use + MALLOC: 56 Thread heaps in use + MALLOC: 8192 Tcmalloc page size + ------------------------------------------------ + Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). + Bytes released to the OS take up virtual address space but no physical memory. + $ google-pprof --text \ + /usr/bin/ceph-osd \ + /var/log/ceph/ceph-osd.0.profile.0001.heap + Total: 3.7 MB + 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry + 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create + 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe + 0.0 0.4% 99.2% 0.0 0.6% decode_message + ... + +Another heap dump on the same daemon will add another file. It is +convenient to compare to a previous heap dump to show what has grown +in the interval. For instance:: + + $ google-pprof --text --base out/osd.0.profile.0001.heap \ + ceph-osd out/osd.0.profile.0003.heap + Total: 0.2 MB + 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry + 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create + 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op + 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate + +Refer to `Google Heap Profiler`_ for additional details. + +Once you have the heap profiler installed, start your cluster and +begin using the heap profiler. You may enable or disable the heap +profiler at runtime, or ensure that it runs continuously. For the +following commandline usage, replace ``{daemon-type}`` with ``mon``, +``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or +the MON or MDS id. + + +Starting the Profiler +--------------------- + +To start the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap start_profiler + +For example:: + + ceph tell osd.1 heap start_profiler + +Alternatively the profile can be started when the daemon starts +running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in +the environment. + +Printing Stats +-------------- + +To print out statistics, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stats + +For example:: + + ceph tell osd.0 heap stats + +.. note:: Printing stats does not require the profiler to be running and does + not dump the heap allocation information to a file. + + +Dumping Heap Information +------------------------ + +To dump heap information, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap dump + +For example:: + + ceph tell mds.a heap dump + +.. note:: Dumping heap information only works when the profiler is running. + + +Releasing Memory +---------------- + +To release memory that ``tcmalloc`` has allocated but which is not being used by +the Ceph daemon itself, execute the following:: + + ceph tell {daemon-type}{daemon-id} heap release + +For example:: + + ceph tell osd.2 heap release + + +Stopping the Profiler +--------------------- + +To stop the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stop_profiler + +For example:: + + ceph tell osd.0 heap stop_profiler + +.. _Logging and Debugging: ../log-and-debug +.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst new file mode 100644 index 00000000..38cc4966 --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -0,0 +1,582 @@ +================================= + Troubleshooting Monitors +================================= + +.. index:: monitor, high availability + +When a cluster encounters monitor-related troubles there's a tendency to +panic, and some times with good reason. You should keep in mind that losing +a monitor, or a bunch of them, don't necessarily mean that your cluster is +down, as long as a majority is up, running and with a formed quorum. +Regardless of how bad the situation is, the first thing you should do is to +calm down, take a breath and try answering our initial troubleshooting script. + + +Initial Troubleshooting +======================== + + +**Are the monitors running?** + + First of all, we need to make sure the monitors are running. You would be + amazed by how often people forget to run the monitors, or restart them after + an upgrade. There's no shame in that, but let's try not losing a couple of + hours chasing an issue that is not there. + +**Are you able to connect to the monitor's servers?** + + Doesn't happen often, but sometimes people do have ``iptables`` rules that + block accesses to monitor servers or monitor ports. Usually leftovers from + monitor stress-testing that were forgotten at some point. Try ssh'ing into + the server and, if that succeeds, try connecting to the monitor's port + using you tool of choice (telnet, nc,...). + +**Does ceph -s run and obtain a reply from the cluster?** + + If the answer is yes then your cluster is up and running. One thing you + can take for granted is that the monitors will only answer to a ``status`` + request if there is a formed quorum. + + If ``ceph -s`` blocked however, without obtaining a reply from the cluster + or showing a lot of ``fault`` messages, then it is likely that your monitors + are either down completely or just a portion is up -- a portion that is not + enough to form a quorum (keep in mind that a quorum if formed by a majority + of monitors). + +**What if ceph -s doesn't finish?** + + If you haven't gone through all the steps so far, please go back and do. + + For those running on Emperor 0.72-rc1 and forward, you will be able to + contact each monitor individually asking them for their status, regardless + of a quorum being formed. This can be achieved using ``ceph ping mon.ID``, + ID being the monitor's identifier. You should perform this for each monitor + in the cluster. In section `Understanding mon_status`_ we will explain how + to interpret the output of this command. + + For the rest of you who don't tread on the bleeding edge, you will need to + ssh into the server and use the monitor's admin socket. Please jump to + `Using the monitor's admin socket`_. + +For other specific issues, keep on reading. + + +Using the monitor's admin socket +================================= + +The admin socket allows you to interact with a given daemon directly using a +Unix socket file. This file can be found in your monitor's ``run`` directory. +By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok`` +but this can vary if you defined it otherwise. If you don't find it there, +please check your ``ceph.conf`` for an alternative path or run:: + + ceph-conf --name mon.ID --show-config-value admin_socket + +Please bear in mind that the admin socket will only be available while the +monitor is running. When the monitor is properly shutdown, the admin socket +will be removed. If however the monitor is not running and the admin socket +still persists, it is likely that the monitor was improperly shutdown. +Regardless, if the monitor is not running, you will not be able to use the +admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``. + +Accessing the admin socket is as simple as telling the ``ceph`` tool to use +the ``asok`` file. In pre-Dumpling Ceph, this can be achieved by:: + + ceph --admin-daemon /var/run/ceph/ceph-mon.<id>.asok <command> + +while in Dumpling and beyond you can use the alternate (and recommended) +format:: + + ceph daemon mon.<id> <command> + +Using ``help`` as the command to the ``ceph`` tool will show you the +supported commands available through the admin socket. Please take a look +at ``config get``, ``config show``, ``mon_status`` and ``quorum_status``, +as those can be enlightening when troubleshooting a monitor. + + +Understanding mon_status +========================= + +``mon_status`` can be obtained through the ``ceph`` tool when you have +a formed quorum, or via the admin socket if you don't. This command will +output a multitude of information about the monitor, including the same +output you would get with ``quorum_status``. + +Take the following example of ``mon_status``:: + + + { "name": "c", + "rank": 2, + "state": "peon", + "election_epoch": 38, + "quorum": [ + 1, + 2], + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { "epoch": 3, + "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", + "modified": "2013-10-30 04:12:01.945629", + "created": "2013-10-29 14:14:41.914786", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6795\/0"}]}} + +A couple of things are obvious: we have three monitors in the monmap (*a*, *b* +and *c*), the quorum is formed by only two monitors, and *c* is in the quorum +as a *peon*. + +Which monitor is out of the quorum? + + The answer would be **a**. + +Why? + + Take a look at the ``quorum`` set. We have two monitors in this set: *1* + and *2*. These are not monitor names. These are monitor ranks, as established + in the current monmap. We are missing the monitor with rank 0, and according + to the monmap that would be ``mon.a``. + +By the way, how are ranks established? + + Ranks are (re)calculated whenever you add or remove monitors and follow a + simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the + rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all + the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0. + +Most Common Monitor Issues +=========================== + +Have Quorum but at least one Monitor is down +--------------------------------------------- + +When this happens, depending on the version of Ceph you are running, +you should be seeing something similar to:: + + $ ceph health detail + [snip] + mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) + +How to troubleshoot this? + + First, make sure ``mon.a`` is running. + + Second, make sure you are able to connect to ``mon.a``'s server from the + other monitors' servers. Check the ports as well. Check ``iptables`` on + all your monitor nodes and make sure you are not dropping/rejecting + connections. + + If this initial troubleshooting doesn't solve your problems, then it's + time to go deeper. + + First, check the problematic monitor's ``mon_status`` via the admin + socket as explained in `Using the monitor's admin socket`_ and + `Understanding mon_status`_. + + Considering the monitor is out of the quorum, its state should be one of + ``probing``, ``electing`` or ``synchronizing``. If it happens to be either + ``leader`` or ``peon``, then the monitor believes to be in quorum, while + the remaining cluster is sure it is not; or maybe it got into the quorum + while we were troubleshooting the monitor, so check you ``ceph -s`` again + just to make sure. Proceed if the monitor is not yet in the quorum. + +What if the state is ``probing``? + + This means the monitor is still looking for the other monitors. Every time + you start a monitor, the monitor will stay in this state for some time + while trying to find the rest of the monitors specified in the ``monmap``. + The time a monitor will spend in this state can vary. For instance, when on + a single-monitor cluster, the monitor will pass through the probing state + almost instantaneously, since there are no other monitors around. On a + multi-monitor cluster, the monitors will stay in this state until they + find enough monitors to form a quorum -- this means that if you have 2 out + of 3 monitors down, the one remaining monitor will stay in this state + indefinitely until you bring one of the other monitors up. + + If you have a quorum, however, the monitor should be able to find the + remaining monitors pretty fast, as long as they can be reached. If your + monitor is stuck probing and you have gone through with all the communication + troubleshooting, then there is a fair chance that the monitor is trying + to reach the other monitors on a wrong address. ``mon_status`` outputs the + ``monmap`` known to the monitor: check if the other monitor's locations + match reality. If they don't, jump to + `Recovering a Monitor's Broken monmap`_; if they do, then it may be related + to severe clock skews amongst the monitor nodes and you should refer to + `Clock Skews`_ first, but if that doesn't solve your problem then it is + the time to prepare some logs and reach out to the community (please refer + to `Preparing your logs`_ on how to best prepare your logs). + + +What if state is ``electing``? + + This means the monitor is in the middle of an election. These should be + fast to complete, but at times the monitors can get stuck electing. This + is usually a sign of a clock skew among the monitor nodes; jump to + `Clock Skews`_ for more infos on that. If all your clocks are properly + synchronized, it is best if you prepare some logs and reach out to the + community. This is not a state that is likely to persist and aside from + (*really*) old bugs there is not an obvious reason besides clock skews on + why this would happen. + +What if state is ``synchronizing``? + + This means the monitor is synchronizing with the rest of the cluster in + order to join the quorum. The synchronization process is as faster as + smaller your monitor store is, so if you have a big store it may + take a while. Don't worry, it should be finished soon enough. + + However, if you notice that the monitor jumps from ``synchronizing`` to + ``electing`` and then back to ``synchronizing``, then you do have a + problem: the cluster state is advancing (i.e., generating new maps) way + too fast for the synchronization process to keep up. This used to be a + thing in early Cuttlefish, but since then the synchronization process was + quite refactored and enhanced to avoid just this sort of behavior. If this + happens in later versions let us know. And bring some logs + (see `Preparing your logs`_). + +What if state is ``leader`` or ``peon``? + + This should not happen. There is a chance this might happen however, and + it has a lot to do with clock skews -- see `Clock Skews`_. If you are not + suffering from clock skews, then please prepare your logs (see + `Preparing your logs`_) and reach out to us. + + +Recovering a Monitor's Broken monmap +------------------------------------- + +This is how a ``monmap`` usually looks like, depending on the number of +monitors:: + + + epoch 3 + fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 + last_changed 2013-10-30 04:12:01.945629 + created 2013-10-29 14:14:41.914786 + 0: 127.0.0.1:6789/0 mon.a + 1: 127.0.0.1:6790/0 mon.b + 2: 127.0.0.1:6795/0 mon.c + +This may not be what you have however. For instance, in some versions of +early Cuttlefish there was this one bug that could cause your ``monmap`` +to be nullified. Completely filled with zeros. This means that not even +``monmaptool`` would be able to read it because it would find it hard to +make sense of only-zeros. Some other times, you may end up with a monitor +with a severely outdated monmap, thus being unable to find the remaining +monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, +then remove ``mon.a``, then add a new monitor ``mon.e`` and remove +``mon.b``; you will end up with a totally different monmap from the one +``mon.c`` knows). + +In this sort of situations, you have two possible solutions: + +Scrap the monitor and create a new one + + You should only take this route if you are positive that you won't + lose the information kept by that monitor; that you have other monitors + and that they are running just fine so that your new monitor is able + to synchronize from the remaining monitors. Keep in mind that destroying + a monitor, if there are no other copies of its contents, may lead to + loss of data. + +Inject a monmap into the monitor + + Usually the safest path. You should grab the monmap from the remaining + monitors and inject it into the monitor with the corrupted/lost monmap. + + These are the basic steps: + + 1. Is there a formed quorum? If so, grab the monmap from the quorum:: + + $ ceph mon getmap -o /tmp/monmap + + 2. No quorum? Grab the monmap directly from another monitor (this + assumes the monitor you are grabbing the monmap from has id ID-FOO + and has been stopped):: + + $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap + + 3. Stop the monitor you are going to inject the monmap into. + + 4. Inject the monmap:: + + $ ceph-mon -i ID --inject-monmap /tmp/monmap + + 5. Start the monitor + + Please keep in mind that the ability to inject monmaps is a powerful + feature that can cause havoc with your monitors if misused as it will + overwrite the latest, existing monmap kept by the monitor. + + +Clock Skews +------------ + +Monitors can be severely affected by significant clock skews across the +monitor nodes. This usually translates into weird behavior with no obvious +cause. To avoid such issues, you should run a clock synchronization tool +on your monitor nodes. + + +What's the maximum tolerated clock skew? + + By default the monitors will allow clocks to drift up to ``0.05 seconds``. + + +Can I increase the maximum tolerated clock skew? + + This value is configurable via the ``mon-clock-drift-allowed`` option, and + although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism + is in place because clock skewed monitor may not properly behave. We, as + developers and QA aficionados, are comfortable with the current default + value, as it will alert the user before the monitors get out hand. Changing + this value without testing it first may cause unforeseen effects on the + stability of the monitors and overall cluster healthiness, although there is + no risk of dataloss. + + +How do I know there's a clock skew? + + The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health + detail`` should show something in the form of:: + + mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) + + That means that ``mon.c`` has been flagged as suffering from a clock skew. + + +What should I do if there's a clock skew? + + Synchronize your clocks. Running an NTP client may help. If you are already + using one and you hit this sort of issues, check if you are using some NTP + server remote to your network and consider hosting your own NTP server on + your network. This last option tends to reduce the amount of issues with + monitor clock skews. + + +Client Can't Connect or Mount +------------------------------ + +Check your IP tables. Some OS install utilities add a ``REJECT`` rule to +``iptables``. The rule rejects all clients trying to connect to the host except +for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in +place, clients connecting from a separate node will fail to mount with a timeout +error. You need to address ``iptables`` rules that reject clients trying to +connect to Ceph daemons. For example, you would need to address rules that look +like this appropriately:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You may also need to add rules to IP tables on your Ceph hosts to ensure +that clients can access the ports associated with your Ceph monitors (i.e., port +6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For +example:: + + iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT + +Monitor Store Failures +====================== + +Symptoms of store corruption +---------------------------- + +Ceph monitor stores the :term:`cluster map` in a key/value store such as LevelDB. If +a monitor fails due to the key/value store corruption, following error messages +might be found in the monitor log:: + + Corruption: error in middle of record + +or:: + + Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb + +Recovery using healthy monitor(s) +--------------------------------- + +If there are any survivors, we can always :ref:`replace <adding-and-removing-monitors>` the corrupted one with a +new one. After booting up, the new joiner will sync up with a healthy +peer, and once it is fully sync'ed, it will be able to serve the clients. + +Recovery using OSDs +------------------- + +But what if all monitors fail at the same time? Since users are encouraged to +deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous +failure is rare. But unplanned power-downs in a data center with improperly +configured disk/fs settings could fail the underlying filesystem, and hence +kill all the monitors. In this case, we can recover the monitor store with the +information stored in OSDs.:: + + ms=/root/mon-store + mkdir $ms + + # collect the cluster map from OSDs + for host in $hosts; do + rsync -avz $ms/. user@host:$ms.remote + rm -rf $ms + ssh user@host <<EOF + for osd in /var/lib/ceph/osd/ceph-*; do + ceph-objectstore-tool --data-path \$osd --op update-mon-db --mon-store-path $ms.remote + done + EOF + rsync -avz user@host:$ms.remote/. $ms + done + + # rebuild the monitor store from the collected map, if the cluster does not + # use cephx authentication, we can skip the following steps to update the + # keyring with the caps, and there is no need to pass the "--keyring" option. + # i.e. just use "ceph-monstore-tool $ms rebuild" instead + ceph-authtool /path/to/admin.keyring -n mon. \ + --cap mon 'allow *' + ceph-authtool /path/to/admin.keyring -n client.admin \ + --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' + # add one or more ceph-mgr's key to the keyring. in this case, an encoded key + # for mgr.x is added, you can find the encoded key in + # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is + # deployed + ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \ + --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *' + # if your monitors' ids are not single characters like 'a', 'b', 'c', please + # specify them in the command line by passing them as arguments of the "--mon-ids" + # option. if you are not sure, please check your ceph.conf to see if there is any + # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are + # using DNS SRV for looking up monitors. + ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma + + # make a backup of the corrupted store.db just in case! repeat for + # all monitors. + mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted + + # move rebuild store.db into place. repeat for all monitors. + mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db + chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db + +The steps above + +#. collect the map from all OSD hosts, +#. then rebuild the store, +#. fill the entities in keyring file with appropriate caps +#. replace the corrupted store on ``mon.foo`` with the recovered copy. + +Known limitations +~~~~~~~~~~~~~~~~~ + +Following information are not recoverable using the steps above: + +- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command + are recovered from the OSD's copy. And the ``client.admin`` keyring is imported + using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing + in the recovered monitor store. You might need to re-add them manually. + +- **creating pools**: If any RADOS pools were in the process of being creating, that state is lost. The recovery tool assumes that all pools have been created. If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command. Note that this will create an *empty* PG, so only do this if you know the pool is empty. + +- **MDS Maps**: the MDS maps are lost. + + + +Everything Failed! Now What? +============================= + +Reaching out for help +---------------------- + +You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net) +and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make +sure you have grabbed your logs and have them ready if someone asks: the faster +the interaction and lower the latency in response, the better chances everyone's +time is optimized. + + +Preparing your logs +--------------------- + +Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We +may want them. However, your logs may not have the necessary information. If +you don't find your monitor logs at their default location, you can check +where they should be by running:: + + ceph-conf --name mon.FOO --show-config-value log_file + +The amount of information in the logs are subject to the debug levels being +enforced by your configuration files. If you have not enforced a specific +debug level then Ceph is using the default levels and your logs may not +contain important information to track down you issue. +A first step in getting relevant information into your logs will be to raise +debug levels. In this case we will be interested in the information from the +monitor. +Similarly to what happens on other components, different parts of the monitor +will output their debug information on different subsystems. + +You will have to raise the debug levels of those subsystems more closely +related to your issue. This may not be an easy task for someone unfamiliar +with troubleshooting Ceph. For most situations, setting the following options +on your monitors will be enough to pinpoint a potential source of the issue:: + + debug mon = 10 + debug ms = 1 + +If we find that these debug levels are not enough, there's a chance we may +ask you to raise them or even define other debug subsystems to obtain infos +from -- but at least we started off with some useful information, instead +of a massively empty log without much to go on with. + +Do I need to restart a monitor to adjust debug levels? +------------------------------------------------------ + +No. You may do it in one of two ways: + +You have quorum + + Either inject the debug option into the monitor you want to debug:: + + ceph tell mon.FOO config set debug_mon 10/10 + + or into all monitors at once:: + + ceph tell mon.* config set debug_mon 10/10 + +No quorum + + Use the monitor's admin socket and directly adjust the configuration + options:: + + ceph daemon mon.FOO config set debug_mon 10/10 + + +Going back to default values is as easy as rerunning the above commands +using the debug level ``1/10`` instead. You can check your current +values using the admin socket and the following commands:: + + ceph daemon mon.FOO config show + +or:: + + ceph daemon mon.FOO config get 'OPTION_NAME' + + +Reproduced the problem with appropriate debug levels. Now what? +---------------------------------------------------------------- + +Ideally you would send us only the relevant portions of your logs. +We realise that figuring out the corresponding portion may not be the +easiest of tasks. Therefore, we won't hold it to you if you provide the +full log, but common sense should be employed. If your log has hundreds +of thousands of lines, it may get tricky to go through the whole thing, +specially if we are not aware at which point, whatever your issue is, +happened. For instance, when reproducing, keep in mind to write down +current time and date and to extract the relevant portions of your logs +based on that. + +Finally, you should reach out to us on the mailing lists, on IRC or file +a new issue on the `tracker`_. + +.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst new file mode 100644 index 00000000..eb9fec06 --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -0,0 +1,546 @@ +====================== + Troubleshooting OSDs +====================== + +Before troubleshooting your OSDs, check your monitors and network first. If +you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns +a health status, it means that the monitors have a quorum. +If you don't have a monitor quorum or if there are errors with the monitor +status, `address the monitor issues first <../troubleshooting-mon>`_. +Check your networks to ensure they +are running properly, because networks may have a significant impact on OSD +operation and performance. + + + +Obtaining Data About OSDs +========================= + +A good first step in troubleshooting your OSDs is to obtain information in +addition to the information you collected while `monitoring your OSDs`_ +(e.g., ``ceph osd tree``). + + +Ceph Logs +--------- + +If you haven't changed the default path, you can find Ceph log files at +``/var/log/ceph``:: + + ls /var/log/ceph + +If you don't get enough log detail, you can change your logging level. See +`Logging and Debugging`_ for details to ensure that Ceph performs adequately +under high logging volume. + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. For details, list +the sockets for your Ceph processes:: + + ls /var/run/ceph + +Then, execute the following, replacing ``{daemon-name}`` with an actual +daemon (e.g., ``osd.0``):: + + ceph daemon osd.0 help + +Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``):: + + ceph daemon {socket-file} help + + +The admin socket, among other things, allows you to: + +- List your configuration at runtime +- Dump historic operations +- Dump the operation priority queue state +- Dump operations in flight +- Dump perfcounters + + +Display Freespace +----------------- + +Filesystem issues may arise. To display your filesystem's free space, execute +``df``. :: + + df -h + +Execute ``df --help`` for additional usage. + + +I/O Statistics +-------------- + +Use `iostat`_ to identify I/O-related issues. :: + + iostat -x + + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep`` +or ``tail``. For example:: + + dmesg | grep scsi + + +Stopping w/out Rebalancing +========================== + +Periodically, you may need to perform maintenance on a subset of your cluster, +or resolve a problem that affects a failure domain (e.g., a rack). If you do not +want CRUSH to automatically rebalance the cluster as you stop OSDs for +maintenance, set the cluster to ``noout`` first:: + + ceph osd set noout + +Once the cluster is set to ``noout``, you can begin stopping the OSDs within the +failure domain that requires maintenance work. :: + + stop ceph-osd id={num} + +.. note:: Placement groups within the OSDs you stop will become ``degraded`` + while you are addressing issues with within the failure domain. + +Once you have completed your maintenance, restart the OSDs. :: + + start ceph-osd id={num} + +Finally, you must unset the cluster from ``noout``. :: + + ceph osd unset noout + + + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal circumstances, simply restarting the ``ceph-osd`` daemon will +allow it to rejoin the cluster and recover. + +An OSD Won't Start +------------------ + +If you start your cluster and an OSD won't start, check the following: + +- **Configuration File:** If you were not able to get OSDs running from + a new installation, check your configuration file to ensure it conforms + (e.g., ``host`` not ``hostname``, etc.). + +- **Check Paths:** Check the paths in your configuration, and the actual + paths themselves for data and journals. If you separate the OSD data from + the journal data and there are errors in your configuration file or in the + actual mounts, you may have trouble starting OSDs. If you want to store the + journal on a block device, you should partition your journal disk and assign + one partition per OSD. + +- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be + hitting the default maximum number of threads (e.g., usually 32k), especially + during recovery. You can increase the number of threads using ``sysctl`` to + see if increasing the maximum number of threads to the maximum possible + number of threads allowed (i.e., 4194303) will help. For example:: + + sysctl -w kernel.pid_max=4194303 + + If increasing the maximum thread count resolves the issue, you can make it + permanent by including a ``kernel.pid_max`` setting in the + ``/etc/sysctl.conf`` file. For example:: + + kernel.pid_max = 4194303 + +- **Kernel Version:** Identify the kernel version and distribution you + are using. Ceph uses some third party tools by default, which may be + buggy or may conflict with certain distributions and/or kernel + versions (e.g., Google perftools). Check the `OS recommendations`_ + to ensure you have addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, turn your logging up + (if it is not already), and try again. If it segment faults again, + contact the ceph-devel email list and provide your Ceph configuration + file, your monitor output and the contents of your log file(s). + + + +An OSD Failed +------------- + +When a ``ceph-osd`` process dies, the monitor will learn about the failure +from surviving ``ceph-osd`` daemons and report it via the ``ceph health`` +command:: + + ceph health + HEALTH_WARN 1/3 in osds are down + +Specifically, you will get a warning whenever there are ``ceph-osd`` +processes that are marked ``in`` and ``down``. You can identify which +``ceph-osds`` are ``down`` with:: + + ceph health detail + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +If there is a disk +failure or other fault preventing ``ceph-osd`` from functioning or +restarting, an error message should be present in its log file in +``/var/log/ceph``. + +If the daemon stopped because of a heartbeat failure, the underlying +kernel file system may be unresponsive. Check ``dmesg`` output for disk +or other kernel errors. + +If the problem is a software error (failed assertion or other +unexpected error), it should be reported to the `ceph-devel`_ email list. + + +No Free Drive Space +------------------- + +Ceph prevents you from writing to a full OSD so that you don't lose data. +In an operational cluster, you should receive a warning when your cluster +is getting near its full ratio. The ``mon osd full ratio`` defaults to +``0.95``, or 95% of capacity before it stops clients from writing data. +The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of +capacity when it blocks backfills from starting. The +OSD nearfull ratio defaults to ``0.85``, or 85% of capacity +when it generates a health warning. + +Changing it can be done using: + +:: + + ceph osd set-nearfull-ratio <float[0.0-1.0]> + + +Full cluster issues usually arise when testing how Ceph handles an OSD +failure on a small cluster. When one node has a high percentage of the +cluster's data, the cluster can easily eclipse its nearfull and full ratio +immediately. If you are testing how Ceph reacts to OSD failures on a small +cluster, you should leave ample free disk space and consider temporarily +lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and +OSD ``nearfull ratio`` using these commands: + +:: + + ceph osd set-nearfull-ratio <float[0.0-1.0]> + ceph osd set-full-ratio <float[0.0-1.0]> + ceph osd set-backfillfull-ratio <float[0.0-1.0]> + +Full ``ceph-osds`` will be reported by ``ceph health``:: + + ceph health + HEALTH_WARN 1 nearfull osd(s) + +Or:: + + ceph health detail + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) + osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% + +The best way to deal with a full cluster is to add new ``ceph-osds``, allowing +the cluster to redistribute data to the newly available storage. + +If you cannot start an OSD because it is full, you may delete some data by deleting +some placement group directories in the full OSD. + +.. important:: If you choose to delete a placement group directory on a full OSD, + **DO NOT** delete the same placement group directory on another full OSD, or + **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on + at least one OSD. + +See `Monitor Config Reference`_ for additional details. + + +OSDs are Slow/Unresponsive +========================== + +A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you +have eliminated other troubleshooting possibilities before delving into OSD +performance issues. For example, ensure that your network(s) is working properly +and your OSDs are running. Check to see if OSDs are throttling recovery traffic. + +.. tip:: Newer versions of Ceph provide better recovery handling by preventing + recovering OSDs from using up system resources so that ``up`` and ``in`` + OSDs are not available or are otherwise slow. + + +Networking Issues +----------------- + +Ceph is a distributed storage system, so it depends upon networks to peer with +OSDs, replicate objects, recover from faults and check heartbeats. Networking +issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for +details. + +Ensure that Ceph processes and Ceph-dependent processes are connected and/or +listening. :: + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +Check network statistics. :: + + netstat -s + + +Drive Configuration +------------------- + +A storage drive should only support one OSD. Sequential read and sequential +write throughput can bottleneck if other processes share the drive, including +journals, operating systems, monitors, other OSDs and non-Ceph processes. + +Ceph acknowledges writes *after* journaling, so fast SSDs are an +attractive option to accelerate the response time--particularly when +using the ``XFS`` or ``ext4`` filesystems. By contrast, the ``btrfs`` +filesystem can write and journal simultaneously. (Note, however, that +we recommend against using ``btrfs`` for production deployments.) + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Running a journal in a separate partition + may help, but you should prefer a separate physical drive. + + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your disks for bad sectors and fragmentation. This can cause total throughput +to drop substantially. + + +Co-resident Monitors/OSDs +------------------------- + +Monitors are generally light-weight processes, but they do lots of ``fsync()``, +which can interfere with other workloads, particularly if monitors run on the +same drive as your OSDs. Additionally, if you run monitors on the same host as +the OSDs, you may incur performance issues related to: + +- Running an older kernel (pre-3.0) +- Running a kernel with no syncfs(2) syscall. + +In these cases, multiple OSDs running on the same host can drag each other down +by doing lots of commits. That often leads to the bursty writes. + + +Co-resident Processes +--------------------- + +Spinning up co-resident processes such as a cloud-based solution, virtual +machines and other applications that write data to Ceph while operating on the +same hardware as OSDs can introduce significant OSD latency. Generally, we +recommend optimizing a host for use with Ceph and using other hosts for other +processes. The practice of separating Ceph operations from other applications +may help improve performance and may streamline troubleshooting and maintenance. + + +Logging Levels +-------------- + +If you turned logging levels up to track an issue and then forgot to turn +logging levels back down, the OSD may be putting a lot of logs onto the disk. If +you intend to keep logging levels high, you may consider mounting a drive to the +default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). + + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +performance or it may increase recovery rates to the point that recovery +impacts OSD performance. Check to see if the OSD is recovering. + + +Kernel Version +-------------- + +Check the kernel version you are running. Older kernels may not receive +new backports that Ceph depends upon for better performance. + + +Kernel Issues with SyncFS +------------------------- + +Try running one OSD per host to see if performance improves. Old kernels +might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. + + +Filesystem Issues +----------------- + +Currently, we recommend deploying clusters with XFS. + +We recommend against using btrfs or ext4. The btrfs filesystem has +many attractive features, but bugs in the filesystem may lead to +performance issues and spurious ENOSPC errors. We do not recommend +ext4 because xattr size limitations break our support for long object +names (needed for RGW). + +For more information, see `Filesystem Recommendations`_. + +.. _Filesystem Recommendations: ../configuration/filesystem-recommendations + + +Insufficient RAM +---------------- + +We recommend 1GB of RAM per OSD daemon. You may notice that during normal +operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). +Unused RAM makes it tempting to use the excess RAM for co-resident applications, +VMs and so forth. However, when OSDs go into recovery mode, their memory +utilization spikes. If there is no RAM available, the OSD performance will slow +considerably. + + +Old Requests or Slow Requests +----------------------------- + +If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages +complaining about requests that are taking too long. The warning threshold +defaults to 30 seconds, and is configurable via the ``osd op complaint time`` +option. When this happens, the cluster log will receive messages. + +Legacy versions of Ceph complain about ``old requests``:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +New versions of Ceph complain about ``slow requests``:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + + +Possible causes include: + +- A bad drive (check ``dmesg`` output) +- A bug in the kernel file system (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions: + +- Remove VMs from Ceph hosts +- Upgrade kernel +- Upgrade Ceph +- Restart OSDs + +Debugging Slow Requests +----------------------- + +If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id> dump_ops_in_flight``, +you will see a set of operations and a list of events each operation went +through. These are briefly described below. + +Events from the Messenger layer: + +- ``header_read``: When the messenger first started reading the message off the wire. +- ``throttled``: When the messenger tried to acquire memory throttle space to read + the message into memory. +- ``all_read``: When the messenger finished reading the message off the wire. +- ``dispatched``: When the messenger gave the message to the OSD. +- ``initiated``: This is identical to ``header_read``. The existence of both is a + historical oddity. + +Events from the OSD as it prepares operations: + +- ``queued_for_pg``: The op has been put into the queue for processing by its PG. +- ``reached_pg``: The PG has started doing the op. +- ``waiting for \*``: The op is waiting for some other work to complete before it + can proceed (e.g. a new OSDMap; for its object target to scrub; for the PG to + finish peering; all as specified in the message). +- ``started``: The op has been accepted as something the OSD should do and + is now being performed. +- ``waiting for subops from``: The op has been sent to replica OSDs. + +Events from the FileStore: + +- ``commit_queued_for_journal_write``: The op has been given to the FileStore. +- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and waiting + to be persisted (as the next disk write). +- ``journaled_completion_queued``: The op was journaled to disk and its callback + queued for invocation. + +Events from the OSD after stuff has been given to local disk: + +- ``op_commit``: The op has been committed (i.e. written to journal) by the + primary OSD. +- ``op_applied``: The op has been `write()'en <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (i.e. applied in memory but not flushed out to disk) on the primary. +- ``sub_op_applied``: ``op_applied``, but for a replica's "subop". +- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools). +- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it + hears about the above, but for a particular replica (i.e. ``<X>``). +- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops). + +Many of these events are seemingly redundant, but cross important boundaries in +the internal code (such as passing data across locks into new threads). + +Flapping OSDs +============= + +We recommend using both a public (front-end) network and a cluster (back-end) +network so that you can better meet the capacity requirements of object +replication. Another advantage is that you can run a cluster network such that +it is not connected to the internet, thereby preventing some denial of service +attacks. When OSDs peer and check heartbeats, they use the cluster (back-end) +network when it's available. See `Monitor/OSD Interaction`_ for details. + +However, if the cluster (back-end) network fails or develops significant latency +while the public (front-end) network operates optimally, OSDs currently do not +handle this situation well. What happens is that OSDs mark each other ``down`` +on the monitor, while marking themselves ``up``. We call this scenario +'flapping`. + +If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and +then ``up`` again), you can force the monitors to stop the flapping with:: + + ceph osd set noup # prevent OSDs from getting marked up + ceph osd set nodown # prevent OSDs from getting marked down + +These flags are recorded in the osdmap structure:: + + ceph osd dump | grep flags + flags no-up,no-down + +You can clear the flags with:: + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are supported, ``noin`` and ``noout``, which prevent +booting OSDs from being marked ``in`` (allocated data) or protect OSDs +from eventually being marked ``out`` (regardless of what the current value for +``mon osd down out interval`` is). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the + sense that once the flags are cleared, the action they were blocking + should occur shortly after. The ``noin`` flag, on the other hand, + prevents OSDs from being marked ``in`` on boot, and any daemons that + started while the flag was set will remain that way. + + + + + + +.. _iostat: https://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging: ../log-and-debug +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _Monitor Config Reference: ../../configuration/mon-config-ref +.. _monitoring your OSDs: ../../operations/monitoring-osd-pg +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _OS recommendations: ../../../start/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst new file mode 100644 index 00000000..a11b972f --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -0,0 +1,666 @@ +===================== + Troubleshooting PGs +===================== + +Placement Groups Never Get Clean +================================ + +When you create a cluster and your cluster remains in ``active``, +``active+remapped`` or ``active+degraded`` status and never achieve an +``active+clean`` status, you likely have a problem with your configuration. + +You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ +and make appropriate adjustments. + +As a general rule, you should run your cluster with more than one OSD and a +pool size greater than 1 object replica. + +One Node Cluster +---------------- + +Ceph no longer provides documentation for operating on a single node, because +you would never deploy a system designed for distributed computing on a single +node. Additionally, mounting client kernel modules on a single node containing a +Ceph daemon may cause a deadlock due to issues with the Linux kernel itself +(unless you use VMs for the clients). You can experiment with Ceph in a 1-node +configuration, in spite of the limitations as described herein. + +If you are trying to create a cluster on a single node, you must change the +default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning +``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration +file before you create your monitors and OSDs. This tells Ceph that an OSD +can peer with another OSD on the same host. If you are trying to set up a +1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, +Ceph will try to peer the PGs of one OSD with the PGs of another OSD on +another node, chassis, rack, row, or even datacenter depending on the setting. + +.. tip:: DO NOT mount kernel clients directly on the same node as your + Ceph Storage Cluster, because kernel conflicts can arise. However, you + can mount kernel clients within virtual machines (VMs) on a single node. + +If you are creating OSDs using a single disk, you must create directories +for the data manually first. For example:: + + ceph-deploy osd create --data {disk} {host} + + +Fewer OSDs than Replicas +------------------------ + +If you have brought up two OSDs to an ``up`` and ``in`` state, but you still +don't see ``active + clean`` placement groups, you may have an +``osd pool default size`` set to greater than ``2``. + +There are a few ways to address this situation. If you want to operate your +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd pool default min size`` to ``2`` so that you can write objects in +an ``active + degraded`` state. You may also set the ``osd pool default size`` +setting to ``2`` so that you only have two stored replicas (the original and +one replica), in which case the cluster should achieve an ``active + clean`` +state. + +.. note:: You can make the changes at runtime. If you make the changes in + your Ceph configuration file, you may need to restart your cluster. + + +Pool Size = 1 +------------- + +If you have the ``osd pool default size`` set to ``1``, you will only have +one copy of the object. OSDs rely on other OSDs to tell them which objects +they should have. If a first OSD has a copy of an object and there is no +second copy, then no second OSD can tell the first OSD that it should have +that copy. For each placement group mapped to the first OSD (see +``ceph pg dump``), you can force the first OSD to notice the placement groups +it needs by running:: + + ceph osd force-create-pg <pgid> + + +CRUSH Map Errors +---------------- + +Another candidate for placement groups remaining unclean involves errors +in your CRUSH map. + + +Stuck Placement Groups +====================== + +It is normal for placement groups to enter states like "degraded" or "peering" +following a failure. Normally these states indicate the normal progression +through the failure recovery process. However, if a placement group stays in one +of these states for a long time this may be an indication of a larger problem. +For this reason, the monitor will warn when placement groups get "stuck" in a +non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long + (i.e., it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long + (i.e., it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, + indicating that all nodes storing this placement group may be ``down``. + +You can explicitly list stuck placement groups with one of:: + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +For stuck ``stale`` placement groups, it is normally a matter of getting the +right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement +groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For +stuck ``unclean`` placement groups, there is usually something preventing +recovery from completing, like unfound objects (see +:ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +====================================== + +In certain cases, the ``ceph-osd`` `Peering` process can run into +problems, preventing a PG from becoming active and usable. For +example, ``ceph health`` might report:: + + ceph health detail + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +We can query the cluster to determine exactly why the PG is marked ``down`` with:: + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to +down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` +and things will recover. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk +failure), we can tell the cluster that it is ``lost`` and to cope as +best it can. + +.. important:: This is dangerous in that the cluster cannot + guarantee that the other copies of the data are consistent + and up to date. + +To instruct Ceph to continue anyway:: + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +=============== + +Under certain combinations of failures Ceph may complain about +``unfound`` objects:: + + ceph health detail + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer +copies of existing objects) exist, but it hasn't found copies of them. +One example of how this might come about for a PG whose data is on ceph-osds +1 and 2: + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +Now 1 knows that these object exist, but there is no live ``ceph-osd`` who +has a copy. In this case, IO to those objects will block, and the +cluster will hope that the failed node comes back soon; this is +assumed to be preferable to returning an IO error to the user. + +First, you can identify which objects are unfound with:: + + ceph pg 2.4 list_unfound [starting offset, in json] + +.. code-block:: javascript + + { "offset": { "oid": "", + "key": "", + "snapid": 0, + "hash": 0, + "max": 0}, + "num_missing": 0, + "num_unfound": 0, + "objects": [ + { "oid": "object 1", + "key": "", + "hash": 0, + "max": 0 }, + ... + ], + "more": 0} + +If there are too many objects to list in a single result, the ``more`` +field will be true and you can query for more. (Eventually the +command line tool will hide this from you, but not yet.) + +Second, you can identify which OSDs have been probed or might contain +data:: + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, for example, the cluster knows that ``osd.1`` might have +data, but it is ``down``. The full range of possible states include: + +* already probed +* querying +* OSD is down +* not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object can +exist that are not listed. For example, if a ceph-osd is stopped and +taken out of the cluster, the cluster fully recovers, and due to some +future set of failures ends up with an unfound object, it won't +consider the long-departed ceph-osd as a potential location to +consider. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This, again, is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. To mark the "unfound" objects as "lost":: + + ceph pg 2.5 mark_unfound_lost revert|delete + +This the final argument specifies how the cluster should deal with +lost objects. + +The "delete" option will forget about them entirely. + +The "revert" option (not available for erasure coded pools) will +either roll back to a previous version of the object or (if it was a +new object) forget about it entirely. Use this with caution, as it +may confuse applications that expected the object to exist. + + +Homeless Placement Groups +========================= + +It is possible for all OSDs that had copies of a given placement groups to fail. +If that's the case, that subset of the object store is unavailable, and the +monitor will receive no status updates for those placement groups. To detect +this situation, the monitor marks any placement group whose primary OSD has +failed as ``stale``. For example:: + + ceph health + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +You can identify which placement groups are ``stale``, and what the last OSDs to +store them were, with:: + + ceph health detail + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +If we want to get placement group 2.5 back online, for example, this tells us that +it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` +daemons will allow the cluster to recover that placement group (and, presumably, +many others). + + +Only a Few OSDs Receive Data +============================ + +If you have many nodes in your cluster and only a few of them receive data, +`check`_ the number of placement groups in your pool. Since placement groups get +mapped to OSDs, a small number of placement groups will not distribute across +your cluster. Try creating a pool with a placement group count that is a +multiple of the number of OSDs. See `Placement Groups`_ for details. The default +placement group count for pools is not useful, but you can change it `here`_. + + +Can't Write Data +================ + +If your cluster is up, but some OSDs are down and you cannot write data, +check to ensure that you have the minimum number of OSDs running for the +placement group. If you don't have the minimum number of OSDs running, +Ceph will not allow you to write data because there is no guarantee +that Ceph can replicate your data. See ``osd pool default min size`` +in the `Pool, PG and CRUSH Config Reference`_ for details. + + +PGs Inconsistent +================ + +If you receive an ``active + clean + inconsistent`` state, this may happen +due to an error during scrubbing. As always, we can identify the inconsistent +placement group(s) with:: + + $ ceph health detail + HEALTH_ERR 1 pgs inconsistent; 2 scrub errors + pg 0.6 is active+clean+inconsistent, acting [0,1,2] + 2 scrub errors + +Or if you prefer inspecting the output in a programmatic way:: + + $ rados list-inconsistent-pg rbd + ["0.6"] + +There is only one consistent state, but in the worst case, we could have +different inconsistencies in multiple perspectives found in more than one +objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: + + $ rados list-inconsistent-obj 0.6 --format=json-pretty + +.. code-block:: javascript + + { + "epoch": 14, + "inconsistents": [ + { + "object": { + "name": "foo", + "nspace": "", + "locator": "", + "snap": "head", + "version": 1 + }, + "errors": [ + "data_digest_mismatch", + "size_mismatch" + ], + "union_shard_errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", + "shards": [ + { + "osd": 0, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 1, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 2, + "errors": [ + "data_digest_mismatch_info", + "size_mismatch_info" + ], + "size": 0, + "omap_digest": "0xffffffff", + "data_digest": "0xffffffff" + } + ] + } + ] + } + +In this case, we can learn from the output: + +* The only inconsistent object is named ``foo``, and it is its head that has + inconsistencies. +* The inconsistencies fall into two categories: + + * ``errors``: these errors indicate inconsistencies between shards without a + determination of which shard(s) are bad. Check for the ``errors`` in the + `shards` array, if available, to pinpoint the problem. + + * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is + different from the ones of OSD.0 and OSD.1 + * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while + the size reported by OSD.0 and OSD.1 is 968. + * ``union_shard_errors``: the union of all shard specific ``errors`` in + ``shards`` array. The ``errors`` are set for the given shard that has the + problem. They include errors like ``read_error``. The ``errors`` ending in + ``oi`` indicate a comparison with ``selected_object_info``. Look at the + ``shards`` array to determine which shard has which error(s). + + * ``data_digest_mismatch_info``: the digest stored in the object-info is not + ``0xffffffff``, which is calculated from the shard read from OSD.2 + * ``size_mismatch_info``: the size stored in the object-info is different + from the one read from OSD.2. The latter is 0. + +You can repair the inconsistent placement group by executing:: + + ceph pg repair {placement-group-ID} + +Which overwrites the `bad` copies with the `authoritative` ones. In most cases, +Ceph is able to choose authoritative copies from all available replicas using +some predefined criteria. But this does not always work. For example, the stored +data digest could be missing, and the calculated digest will be ignored when +choosing the authoritative copies. So, please use the above command with caution. + +If ``read_error`` is listed in the ``errors`` attribute of a shard, the +inconsistency is likely due to disk errors. You might want to check your disk +used by that OSD. + +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, you may consider configuring your `NTP`_ daemons on your +monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph +`Clock Settings`_ for additional details. + + +Erasure Coded PGs are not active+clean +====================================== + +When CRUSH fails to find enough OSDs to map to a PG, it will show as a +``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: + + [2,1,6,0,5,8,2147483647,7,4] + +Not enough OSDs +--------------- + +If the Ceph cluster only has 8 OSDs and the erasure coded pool needs +9, that is what it will show. You can either create another erasure +coded pool that requires less OSDs:: + + ceph osd erasure-code-profile set myprofile k=5 m=3 + ceph osd pool create erasurepool 16 16 erasure myprofile + +or add a new OSDs and the PG will automatically use them. + +CRUSH constraints cannot be satisfied +------------------------------------- + +If the cluster has enough OSDs, it is possible that the CRUSH rule +imposes constraints that cannot be satisfied. If there are 10 OSDs on +two hosts and the CRUSH rule requires that no two OSDs from the +same host are used in the same PG, the mapping may fail because only +two OSDs will be found. You can check the constraint by displaying ("dumping") +the rule:: + + $ ceph osd crush rule ls + [ + "replicated_rule", + "erasurepool"] + $ ceph osd crush rule dump erasurepool + { "rule_id": 1, + "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + + +You can resolve the problem by creating a new pool in which PGs are allowed +to have OSDs residing on the same host with:: + + ceph osd erasure-code-profile set myprofile crush-failure-domain=osd + ceph osd pool create erasurepool 16 16 erasure myprofile + +CRUSH gives up too soon +----------------------- + +If the Ceph cluster has just enough OSDs to map the PG (for instance a +cluster with a total of 9 OSDs and an erasure coded pool that requires +9 OSDs per PG), it is possible that CRUSH gives up before finding a +mapping. It can be resolved by: + +* lowering the erasure coded pool requirements to use less OSDs per PG + (that requires the creation of another pool as erasure code profiles + cannot be dynamically modified). + +* adding more OSDs to the cluster (that does not require the erasure + coded pool to be modified, it will become clean automatically) + +* use a handmade CRUSH rule that tries more times to find a good + mapping. This can be done by setting ``set_choose_tries`` to a value + greater than the default. + +You should first verify the problem with ``crushtool`` after +extracting the crushmap from the cluster so your experiments do not +modify the Ceph cluster and only work on a local files:: + + $ ceph osd crush rule dump erasurepool + { "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + $ ceph osd getcrushmap > crush.map + got crush map from osdmap epoch 13 + $ crushtool -i crush.map --test --show-bad-mappings \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] + bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] + bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] + +Where ``--num-rep`` is the number of OSDs the erasure code CRUSH +rule needs, ``--rule`` is the value of the ``ruleset`` field +displayed by ``ceph osd crush rule dump``. The test will try mapping +one million values (i.e. the range defined by ``[--min-x,--max-x]``) +and must display at least one bad mapping. If it outputs nothing it +means all mappings are successful and you can stop right there: the +problem is elsewhere. + +The CRUSH rule can be edited by decompiling the crush map:: + + $ crushtool --decompile crush.map > crush.txt + +and adding the following line to the rule:: + + step set_choose_tries 100 + +The relevant part of of the ``crush.txt`` file should look something +like:: + + rule erasurepool { + ruleset 1 + type erasure + min_size 3 + max_size 20 + step set_chooseleaf_tries 5 + step set_choose_tries 100 + step take default + step chooseleaf indep 0 type host + step emit + } + +It can then be compiled and tested again:: + + $ crushtool --compile crush.txt -o better-crush.map + +When all mappings succeed, an histogram of the number of tries that +were necessary to find all of them can be displayed with the +``--show-choose-tries`` option of ``crushtool``:: + + $ crushtool -i better-crush.map --test --show-bad-mappings \ + --show-choose-tries \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + ... + 11: 42 + 12: 44 + 13: 54 + 14: 45 + 15: 35 + 16: 34 + 17: 30 + 18: 25 + 19: 19 + 20: 22 + 21: 20 + 22: 17 + 23: 13 + 24: 16 + 25: 13 + 26: 11 + 27: 11 + 28: 13 + 29: 11 + 30: 10 + 31: 6 + 32: 5 + 33: 10 + 34: 3 + 35: 7 + 36: 5 + 37: 2 + 38: 5 + 39: 5 + 40: 2 + 41: 5 + 42: 4 + 43: 1 + 44: 2 + 45: 2 + 46: 3 + 47: 1 + 48: 0 + ... + 102: 0 + 103: 1 + 104: 0 + ... + +It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). + +.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups +.. _here: ../../configuration/pool-pg-config-ref +.. _Placement Groups: ../../operations/placement-groups +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _NTP: https://en.wikipedia.org/wiki/Network_Time_Protocol +.. _The Network Time Protocol: http://www.ntp.org/ +.. _Clock Settings: ../../configuration/mon-config-ref/#clock + + |