Adding upstream version 18.2.2.upstream/18.2.2

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-21 11:54:28 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-21 11:54:28 +0000
commit: e6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree: 64f88b554b444a49f656b6c656111a145cbbaa28 /doc/rados/troubleshooting
parent: Initial commit. (diff)
download: ceph-upstream/18.2.2.tar.xz
ceph-upstream/18.2.2.zip
8 files changed, 3051 insertions, 0 deletions
diff --git a/doc/rados/troubleshooting/community.rst b/doc/rados/troubleshooting/community.rst
new file mode 100644
index 000000000..c0d7be10c
--- /dev/null
+++ b/doc/rados/troubleshooting/community.rst
@@ -0,0 +1,37 @@
+====================
+ The Ceph Community
+====================
+
+Ceph-users email list
+=====================
+
+The Ceph community is an excellent source of information and help. For
+operational issues with Ceph we recommend that you `subscribe to the ceph-users
+email list`_. When you no longer want to receive emails, you can `unsubscribe
+from the ceph-users email list`_.
+
+Ceph-devel email list
+=====================
+
+You can also `subscribe to the ceph-devel email list`_. You should do so if
+your issue is:
+
+- Likely related to a bug
+- Related to a development release package
+- Related to a development testing package
+- Related to your own builds
+
+If you no longer want to receive emails from the ``ceph-devel`` email list, you
+can `unsubscribe from the ceph-devel email list`_.
+
+Ceph report
+===========
+
+.. tip:: Community members can help you if you provide them with detailed
+   information about your problem. Attach the output of the ``ceph report``
+   command to help people understand your issues.
+
+.. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io
+.. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io
+.. _subscribe to the ceph-users email list: mailto:ceph-users-join@ceph.io
+.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@ceph.io
diff --git a/doc/rados/troubleshooting/cpu-profiling.rst b/doc/rados/troubleshooting/cpu-profiling.rst
new file mode 100644
index 000000000..b7fdd1d41
--- /dev/null
+++ b/doc/rados/troubleshooting/cpu-profiling.rst
@@ -0,0 +1,80 @@
+===============
+ CPU Profiling
+===============
+
+If you built Ceph from source and compiled Ceph for use with `oprofile`_
+you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details.
+
+
+Initializing oprofile
+=====================
+
+``oprofile`` must be initalized the first time it is used. Locate the
+``vmlinux`` image that corresponds to the kernel you are running:
+
+.. prompt:: bash $
+
+   ls /boot
+   sudo opcontrol --init
+   sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6
+
+
+Starting oprofile
+=================
+
+Run the following command to start ``oprofile``: 
+
+.. prompt:: bash $
+
+   opcontrol --start
+
+
+Stopping oprofile
+=================
+
+Run the following command to stop ``oprofile``: 
+
+.. prompt:: bash $
+
+   opcontrol --stop
+    
+    
+Retrieving oprofile Results
+===========================
+
+Run the following command to retrieve the top ``cmon`` results: 
+
+.. prompt:: bash $
+
+   opreport -gal ./cmon | less    
+    
+
+Run the following command to retrieve the top ``cmon`` results, with call
+graphs attached: 
+
+.. prompt:: bash $
+
+   opreport -cal ./cmon | less    
+    
+.. important:: After you have reviewed the results, reset ``oprofile`` before
+   running it again. The act of resetting ``oprofile`` removes data from the
+   session directory.
+
+
+Resetting oprofile
+==================
+
+Run the following command to reset ``oprofile``:  
+
+.. prompt:: bash $
+
+   sudo opcontrol --reset   
+   
+.. important:: Reset ``oprofile`` after analyzing data. This ensures that 
+   results from prior tests do not get mixed in with the results of the current
+   test. 
+
+.. _oprofile: http://oprofile.sourceforge.net/about/
+.. _Installing Oprofile: ../../../dev/cpu-profiler
+
+
diff --git a/doc/rados/troubleshooting/index.rst b/doc/rados/troubleshooting/index.rst
new file mode 100644
index 000000000..b481ee1dc
--- /dev/null
+++ b/doc/rados/troubleshooting/index.rst
@@ -0,0 +1,19 @@
+=================
+ Troubleshooting
+=================
+
+You may encounter situations that require you to examine your configuration,
+consult the documentation, modify your logging output, troubleshoot monitors
+and OSDs, profile memory and CPU usage, and, in the last resort, reach out to
+the Ceph community for help.
+
+.. toctree::
+   :maxdepth: 1
+   
+   community
+   log-and-debug
+   troubleshooting-mon
+   troubleshooting-osd
+   troubleshooting-pg
+   memory-profiling
+   cpu-profiling
diff --git a/doc/rados/troubleshooting/log-and-debug.rst b/doc/rados/troubleshooting/log-and-debug.rst
new file mode 100644
index 000000000..929c3f53f
--- /dev/null
+++ b/doc/rados/troubleshooting/log-and-debug.rst
@@ -0,0 +1,430 @@
+=======================
+ Logging and Debugging
+=======================
+
+Ceph component debug log levels can be adjusted at runtime, while services are
+running. In some circumstances you might want to adjust debug log levels in
+``ceph.conf`` or in the central config store. Increased debug logging can be
+useful if you are encountering issues when operating your cluster.  By default,
+Ceph log files are in ``/var/log/ceph``.
+
+.. tip:: Remember that debug output can slow down your system, and that this
+   latency sometimes hides race conditions.
+
+Debug logging is resource intensive. If you encounter a problem in a specific
+component of your cluster, begin troubleshooting by enabling logging for only
+that component of the cluster. For example, if your OSDs are running without
+errors, but your metadata servers are not, enable logging for any specific
+metadata server instances that are having problems. Continue by enabling
+logging for each subsystem only as needed.
+
+.. important:: Verbose logging sometimes generates over 1 GB of data per hour.
+   If the disk that your operating system runs on (your "OS disk") reaches its
+   capacity, the node associated with that disk will stop working.
+
+Whenever you enable or increase the rate of debug logging, make sure that you
+have ample capacity for log files, as this may dramatically increase their
+size.  For details on rotating log files, see `Accelerating Log Rotation`_.
+When your system is running well again, remove unnecessary debugging settings
+in order to ensure that your cluster runs optimally. Logging debug-output
+messages is a slow process and a potential waste of your cluster's resources.
+
+For details on available settings, see `Subsystem, Log and Debug Settings`_.
+
+Runtime
+=======
+
+To see the configuration settings at runtime, log in to a host that has a
+running daemon and run a command of the following form:
+
+.. prompt:: bash $
+
+   ceph daemon {daemon-name} config show | less
+
+For example:
+
+.. prompt:: bash $
+
+   ceph daemon osd.0 config show | less
+
+To activate Ceph's debugging output (that is, the ``dout()`` logging function)
+at runtime, inject arguments into the runtime configuration by running a ``ceph
+tell`` command of the following form:
+
+..  prompt:: bash $
+
+    ceph tell {daemon-type}.{daemon id or *} config set {name} {value}
+
+Here ``{daemon-type}`` is ``osd``, ``mon``, or ``mds``. Apply the runtime
+setting either to a specific daemon (by specifying its ID) or to all daemons of
+a particular type (by using the ``*`` operator).  For example, to increase
+debug logging for a specific ``ceph-osd`` daemon named ``osd.0``, run the
+following command:
+
+..  prompt:: bash $
+
+    ceph tell osd.0 config set debug_osd 0/5
+
+The ``ceph tell`` command goes through the monitors. However, if you are unable
+to bind to the monitor, there is another method that can be used to activate
+Ceph's debugging output: use the ``ceph daemon`` command to log in to the host
+of a specific daemon and change the daemon's configuration. For example:
+
+.. prompt:: bash $
+
+   sudo ceph daemon osd.0 config set debug_osd 0/5
+
+For details on available settings, see `Subsystem, Log and Debug Settings`_.
+
+
+Boot Time
+=========
+
+To activate Ceph's debugging output (that is, the ``dout()`` logging function)
+at boot time, you must add settings to your Ceph configuration file.
+Subsystems that are common to all daemons are set under ``[global]`` in the
+configuration file. Subsystems for a specific daemon are set under the relevant
+daemon section in the configuration file (for example, ``[mon]``, ``[osd]``,
+``[mds]``). Here is an example that shows possible debugging settings in a Ceph
+configuration file:
+
+.. code-block:: ini
+
+    [global]
+        debug_ms = 1/5
+        
+    [mon]
+        debug_mon = 20
+        debug_paxos = 1/5
+        debug_auth = 2
+         
+     [osd]
+         debug_osd = 1/5
+         debug_filestore = 1/5
+         debug_journal = 1
+         debug_monc = 5/20
+         
+    [mds]
+        debug_mds = 1
+        debug_mds_balancer = 1
+
+
+For details, see `Subsystem, Log and Debug Settings`_.
+
+
+Accelerating Log Rotation
+=========================
+
+If your log filesystem is nearly full, you can accelerate log rotation by
+modifying the Ceph log rotation file at ``/etc/logrotate.d/ceph``. To increase
+the frequency of log rotation (which will guard against a filesystem reaching
+capacity), add a ``size`` directive after the ``weekly`` frequency directive.
+To smooth out volume spikes, consider changing ``weekly`` to ``daily`` and
+consider changing ``rotate`` to ``30``. The procedure for adding the size
+setting is shown immediately below. 
+
+#. Note the default settings of the ``/etc/logrotate.d/ceph`` file::
+
+      rotate 7
+      weekly
+      compress
+      sharedscripts
+
+#. Modify them by adding a ``size`` setting::
+
+      rotate 7
+      weekly
+      size 500M
+      compress
+      sharedscripts
+
+#. Start the crontab editor for your user space:
+
+   .. prompt:: bash $
+
+      crontab -e
+
+#. Add an entry to crontab that instructs cron to check the
+   ``etc/logrotate.d/ceph`` file::
+
+      30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1
+
+In this example, the ``etc/logrotate.d/ceph`` file will be checked every 30
+minutes.
+
+Valgrind
+========
+
+When you are debugging your cluster's performance, you might find it necessary
+to track down memory and threading issues. The Valgrind tool suite can be used
+to detect problems in a specific daemon, in a particular type of daemon, or in
+the entire cluster. Because Valgrind is computationally expensive, it should be
+used only when developing or debugging Ceph, and it will slow down your system
+if used at other times. Valgrind messages are logged to ``stderr``. 
+
+
+Subsystem, Log and Debug Settings
+=================================
+
+Debug logging output is typically enabled via subsystems. 
+
+Ceph Subsystems
+---------------
+
+For each subsystem, there is a logging level for its output logs (a so-called
+"log level") and a logging level for its in-memory logs (a so-called "memory
+level"). Different values may be set for these two logging levels in each
+subsystem. Ceph's logging levels operate on a scale of ``1`` to ``20``, where
+``1`` is terse and ``20`` is verbose [#f1]_.  As a general rule, the in-memory
+logs are not sent to the output log unless one or more of the following
+conditions obtain:
+
+- a fatal signal is raised or
+- an ``assert`` in source code is triggered or
+- upon requested. Please consult `document on admin socket
+  <http://docs.ceph.com/en/latest/man/8/ceph/#daemon>`_ for more details.
+
+.. warning ::
+   .. [#f1] In certain rare cases, there are logging levels that can take a value greater than 20. The resulting logs are extremely verbose.
+
+Log levels and memory levels can be set either together or separately. If a
+subsystem is assigned a single value, then that value determines both the log
+level and the memory level. For example, ``debug ms = 5`` will give the ``ms``
+subsystem a log level of ``5`` and a memory level of ``5``.  On the other hand,
+if a subsystem is assigned two values that are separated by a forward slash
+(/), then the first value determines the log level and the second value
+determines the memory level. For example, ``debug ms = 1/5`` will give the
+``ms`` subsystem a log level of ``1`` and a memory level of ``5``. See the
+following:
+
+.. code-block:: ini 
+
+    debug {subsystem} = {log-level}/{memory-level}
+    #for example
+    debug mds balancer = 1/20
+
+The following table provides a list of Ceph subsystems and their default log and
+memory levels. Once you complete your logging efforts, restore the subsystems
+to their default level or to a level suitable for normal operations.
+
++--------------------------+-----------+--------------+
+| Subsystem                | Log Level | Memory Level |
++==========================+===========+==============+
+| ``default``              |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``lockdep``              |     0     |      1       |
++--------------------------+-----------+--------------+
+| ``context``              |     0     |      1       |
++--------------------------+-----------+--------------+
+| ``crush``                |     1     |      1       |
++--------------------------+-----------+--------------+
+| ``mds``                  |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``mds balancer``         |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``mds log``              |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``mds log expire``       |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``mds migrator``         |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``buffer``               |     0     |      1       |
++--------------------------+-----------+--------------+
+| ``timer``                |     0     |      1       |
++--------------------------+-----------+--------------+
+| ``filer``                |     0     |      1       |
++--------------------------+-----------+--------------+
+| ``striper``              |     0     |      1       |
++--------------------------+-----------+--------------+
+| ``objecter``             |     0     |      1       |
++--------------------------+-----------+--------------+
+| ``rados``                |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``rbd``                  |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``rbd mirror``           |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``rbd replay``           |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``rbd pwl``              |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``journaler``            |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``objectcacher``         |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``immutable obj cache``  |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``client``               |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``osd``                  |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``optracker``            |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``objclass``             |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``filestore``            |     1     |      3       |
++--------------------------+-----------+--------------+
+| ``journal``              |     1     |      3       |
++--------------------------+-----------+--------------+
+| ``ms``                   |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``mon``                  |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``monc``                 |     0     |      10      |
++--------------------------+-----------+--------------+
+| ``paxos``                |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``tp``                   |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``auth``                 |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``crypto``               |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``finisher``             |     1     |      1       |
++--------------------------+-----------+--------------+
+| ``reserver``             |     1     |      1       |
++--------------------------+-----------+--------------+
+| ``heartbeatmap``         |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``perfcounter``          |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``rgw``                  |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``rgw sync``             |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``rgw datacache``        |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``rgw access``           |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``rgw dbstore``          |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``javaclient``           |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``asok``                 |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``throttle``             |     1     |      1       |
++--------------------------+-----------+--------------+
+| ``refs``                 |     0     |      0       |
++--------------------------+-----------+--------------+
+| ``compressor``           |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``bluestore``            |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``bluefs``               |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``bdev``                 |     1     |      3       |
++--------------------------+-----------+--------------+
+| ``kstore``               |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``rocksdb``              |     4     |      5       |
++--------------------------+-----------+--------------+
+| ``leveldb``              |     4     |      5       |
++--------------------------+-----------+--------------+
+| ``fuse``                 |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``mgr``                  |     2     |      5       |
++--------------------------+-----------+--------------+
+| ``mgrc``                 |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``dpdk``                 |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``eventtrace``           |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``prioritycache``        |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``test``                 |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``cephfs mirror``        |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``cepgsqlite``           |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore``             |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore onode``       |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore odata``       |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore ompap``       |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore tm``          |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore t``           |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore cleaner``     |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore epm``         |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore lba``         |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore fixedkv tree``|     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore cache``       |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore journal``     |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore device``      |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``seastore backref``     |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``alienstore``           |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``mclock``               |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``cyanstore``            |     0     |      5       |
++--------------------------+-----------+--------------+
+| ``ceph exporter``        |     1     |      5       |
++--------------------------+-----------+--------------+
+| ``memstore``             |     1     |      5       |
++--------------------------+-----------+--------------+
+
+
+Logging Settings
+----------------
+
+It is not necessary to specify logging and debugging settings in the Ceph
+configuration file, but you may override default settings when needed. Ceph
+supports the following settings:
+
+.. confval:: log_file
+.. confval:: log_max_new
+.. confval:: log_max_recent
+.. confval:: log_to_file
+.. confval:: log_to_stderr
+.. confval:: err_to_stderr
+.. confval:: log_to_syslog
+.. confval:: err_to_syslog
+.. confval:: log_flush_on_exit
+.. confval:: clog_to_monitors
+.. confval:: clog_to_syslog
+.. confval:: mon_cluster_log_to_syslog
+.. confval:: mon_cluster_log_file
+
+OSD
+---
+
+.. confval:: osd_debug_drop_ping_probability
+.. confval:: osd_debug_drop_ping_duration
+
+Filestore
+---------
+
+.. confval:: filestore_debug_omap_check
+
+MDS
+---
+
+- :confval:`mds_debug_scatterstat`
+- :confval:`mds_debug_frag`
+- :confval:`mds_debug_auth_pins`
+- :confval:`mds_debug_subtrees`
+
+RADOS Gateway
+-------------
+
+- :confval:`rgw_log_nonexistent_bucket`
+- :confval:`rgw_log_object_name`
+- :confval:`rgw_log_object_name_utc`
+- :confval:`rgw_enable_ops_log`
+- :confval:`rgw_enable_usage_log`
+- :confval:`rgw_usage_log_flush_threshold`
+- :confval:`rgw_usage_log_tick_interval`
diff --git a/doc/rados/troubleshooting/memory-profiling.rst b/doc/rados/troubleshooting/memory-profiling.rst
new file mode 100644
index 000000000..8e58f2d76
--- /dev/null
+++ b/doc/rados/troubleshooting/memory-profiling.rst
@@ -0,0 +1,203 @@
+==================
+ Memory Profiling
+==================
+
+Ceph Monitor, OSD, and MDS can report ``TCMalloc`` heap profiles. Install
+``google-perftools`` if you want to generate these. Your OS distribution might
+package this under a different name (for example, ``gperftools``), and your OS
+distribution might use a different package manager. Run a command similar to
+this one to install ``google-perftools``: 
+
+.. prompt:: bash 
+
+    sudo apt-get install google-perftools
+
+The profiler dumps output to your ``log file`` directory (``/var/log/ceph``).
+See `Logging and Debugging`_ for details.
+
+To view the profiler logs with Google's performance tools, run the following
+command:
+
+.. prompt:: bash
+
+    google-pprof --text {path-to-daemon}  {log-path/filename}
+
+For example::
+
+    $ ceph tell osd.0 heap start_profiler
+    $ ceph tell osd.0 heap dump
+    osd.0 tcmalloc heap stats:------------------------------------------------
+    MALLOC:        2632288 (    2.5 MiB) Bytes in use by application
+    MALLOC: +       499712 (    0.5 MiB) Bytes in page heap freelist
+    MALLOC: +       543800 (    0.5 MiB) Bytes in central cache freelist
+    MALLOC: +       327680 (    0.3 MiB) Bytes in transfer cache freelist
+    MALLOC: +      1239400 (    1.2 MiB) Bytes in thread cache freelists
+    MALLOC: +      1142936 (    1.1 MiB) Bytes in malloc metadata
+    MALLOC:   ------------
+    MALLOC: =      6385816 (    6.1 MiB) Actual memory used (physical + swap)
+    MALLOC: +            0 (    0.0 MiB) Bytes released to OS (aka unmapped)
+    MALLOC:   ------------
+    MALLOC: =      6385816 (    6.1 MiB) Virtual address space used
+    MALLOC:
+    MALLOC:            231              Spans in use
+    MALLOC:             56              Thread heaps in use
+    MALLOC:           8192              Tcmalloc page size
+    ------------------------------------------------
+    Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
+    Bytes released to the OS take up virtual address space but no physical memory.
+    $ google-pprof --text \
+                   /usr/bin/ceph-osd  \
+                   /var/log/ceph/ceph-osd.0.profile.0001.heap
+     Total: 3.7 MB
+     1.9  51.1%  51.1%      1.9  51.1% ceph::log::Log::create_entry
+     1.8  47.3%  98.4%      1.8  47.3% std::string::_Rep::_S_create
+     0.0   0.4%  98.9%      0.0   0.6% SimpleMessenger::add_accept_pipe
+     0.0   0.4%  99.2%      0.0   0.6% decode_message
+     ...
+
+Performing another heap dump on the same daemon creates another file. It is
+convenient to compare the new file to a file created by a previous heap dump to
+show what has grown in the interval. For example::
+
+    $ google-pprof --text --base out/osd.0.profile.0001.heap \
+          ceph-osd out/osd.0.profile.0003.heap
+     Total: 0.2 MB
+     0.1  50.3%  50.3%      0.1  50.3% ceph::log::Log::create_entry
+     0.1  46.6%  96.8%      0.1  46.6% std::string::_Rep::_S_create
+     0.0   0.9%  97.7%      0.0  26.1% ReplicatedPG::do_op
+     0.0   0.8%  98.5%      0.0   0.8% __gnu_cxx::new_allocator::allocate
+
+See `Google Heap Profiler`_ for additional details.
+
+After you have installed the heap profiler, start your cluster and begin using
+the heap profiler. You can enable or disable the heap profiler at runtime, or
+ensure that it runs continuously. When running commands based on the examples
+that follow, do the following:
+
+#. replace ``{daemon-type}`` with ``mon``, ``osd`` or ``mds`` 
+#. replace ``{daemon-id}`` with the OSD number or the MON ID or the MDS ID 
+
+
+Starting the Profiler
+---------------------
+
+To start the heap profiler, run a command of the following form: 
+
+.. prompt:: bash
+
+   ceph tell {daemon-type}.{daemon-id} heap start_profiler
+
+For example:
+
+.. prompt:: bash
+
+   ceph tell osd.1 heap start_profiler
+
+Alternatively, if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in the
+environment, the profile will be started when the daemon starts running.
+
+Printing Stats
+--------------
+
+To print out statistics, run a command of the following form:
+
+.. prompt:: bash
+
+   ceph  tell {daemon-type}.{daemon-id} heap stats
+
+For example:
+
+.. prompt:: bash
+
+   ceph tell osd.0 heap stats
+
+.. note:: The reporting of stats with this command does not require the
+   profiler to be running and does not dump the heap allocation information to
+   a file.
+
+
+Dumping Heap Information
+------------------------
+
+To dump heap information, run a command of the following form:
+
+.. prompt:: bash
+
+   ceph tell {daemon-type}.{daemon-id} heap dump
+
+For example:
+
+.. prompt:: bash
+
+   ceph tell mds.a heap dump
+
+.. note:: Dumping heap information works only when the profiler is running.
+
+
+Releasing Memory
+----------------
+
+To release memory that ``tcmalloc`` has allocated but which is not being used
+by the Ceph daemon itself, run a command of the following form:
+
+.. prompt:: bash
+
+   ceph tell {daemon-type}{daemon-id} heap release
+
+For example:
+
+.. prompt:: bash
+
+    ceph tell osd.2 heap release
+
+
+Stopping the Profiler
+---------------------
+
+To stop the heap profiler, run a command of the following form:
+
+.. prompt:: bash
+
+   ceph tell {daemon-type}.{daemon-id} heap stop_profiler
+
+For example:
+
+.. prompt:: bash
+
+   ceph tell osd.0 heap stop_profiler
+
+.. _Logging and Debugging: ../log-and-debug
+.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html
+
+Alternative Methods of  Memory Profiling
+----------------------------------------
+
+Running Massif heap profiler with Valgrind
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Massif heap profiler tool can be used with Valgrind to measure how much
+heap memory is used. This method is well-suited to troubleshooting RadosGW.
+
+See the `Massif documentation
+<https://valgrind.org/docs/manual/ms-manual.html>`_ for more information.
+
+Install Valgrind from the package manager for your distribution then start the
+Ceph daemon you want to troubleshoot:
+
+.. prompt:: bash
+
+   sudo -u ceph valgrind --max-threads=1024 --tool=massif /usr/bin/radosgw -f --cluster ceph --name NAME --setuser ceph --setgroup ceph
+
+When this command has completed its run, a file with a name of the form
+``massif.out.<pid>`` will be saved in your current working directory. To run
+the command above, the user who runs it must have write permissions in the
+current directory.
+
+Run the ``ms_print`` command to get a graph and statistics from the collected
+data in the ``massif.out.<pid>`` file:
+
+.. prompt:: bash
+
+   ms_print massif.out.12345
+
+The output of this command is helpful when submitting a bug report.
diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst
new file mode 100644
index 000000000..1170da7c3
--- /dev/null
+++ b/doc/rados/troubleshooting/troubleshooting-mon.rst
@@ -0,0 +1,713 @@
+.. _rados-troubleshooting-mon:
+
+==========================
+ Troubleshooting Monitors
+==========================
+
+.. index:: monitor, high availability
+
+Even if a cluster experiences monitor-related problems, the cluster is not
+necessarily in danger of going down. If a cluster has lost multiple monitors,
+it can still remain up and running as long as there are enough surviving
+monitors to form a quorum.
+   
+If your cluster is having monitor-related problems, we recommend that you
+consult the following troubleshooting information.
+
+Initial Troubleshooting
+=======================
+
+The first steps in the process of troubleshooting Ceph Monitors involve making
+sure that the Monitors are running and that they are able to communicate with
+the network and on the network. Follow the steps in this section to rule out
+the simplest causes of Monitor malfunction.
+
+#. **Make sure that the Monitors are running.**
+
+    Make sure that the Monitor (*mon*) daemon processes (``ceph-mon``) are
+    running. It might be the case that the mons have not be restarted after an
+    upgrade. Checking for this simple oversight can save hours of painstaking
+    troubleshooting. 
+    
+    It is also important to make sure that the manager daemons (``ceph-mgr``)
+    are running. Remember that typical cluster configurations provide one
+    Manager (``ceph-mgr``) for each Monitor (``ceph-mon``).
+
+    .. note:: In releases prior to v1.12.5, Rook will not run more than two
+       managers.
+
+#. **Make sure that you can reach the Monitor nodes.**
+
+    In certain rare cases, ``iptables`` rules might be blocking access to
+    Monitor nodes or TCP ports. These rules might be left over from earlier
+    stress testing or rule development. To check for the presence of such
+    rules, SSH into each Monitor node and use ``telnet`` or ``nc`` or a similar
+    tool to attempt to connect to each of the other Monitor nodes on ports
+    ``tcp/3300`` and ``tcp/6789``. 
+
+#. **Make sure that the "ceph status" command runs and receives a reply from the cluster.**
+
+    If the ``ceph status`` command receives a reply from the cluster, then the
+    cluster is up and running. Monitors answer to a ``status`` request only if
+    there is a formed quorum. Confirm that one or more ``mgr`` daemons are
+    reported as running. In a cluster with no deficiencies, ``ceph status``
+    will report that all ``mgr`` daemons are running.
+
+    If the ``ceph status`` command does not receive a reply from the cluster,
+    then there are probably not enough Monitors ``up`` to form a quorum. If the
+    ``ceph -s`` command is run with no further options specified, it connects
+    to an arbitrarily selected Monitor. In certain cases, however, it might be
+    helpful to connect to a specific Monitor (or to several specific Monitors
+    in sequence) by adding the ``-m`` flag to the command: for example, ``ceph
+    status -m mymon1``.
+
+#. **None of this worked. What now?**
+
+    If the above solutions have not resolved your problems, you might find it
+    helpful to examine each individual Monitor in turn. Even if no quorum has
+    been formed, it is possible to contact each Monitor individually and
+    request its status by using the ``ceph tell mon.ID mon_status`` command
+    (here ``ID`` is the Monitor's identifier).
+
+    Run the ``ceph tell mon.ID mon_status`` command for each Monitor in the
+    cluster. For more on this command's output, see :ref:`Understanding
+    mon_status
+    <rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`.
+
+    There is also an alternative method for contacting each individual Monitor:
+    SSH into each Monitor node and query the daemon's admin socket. See
+    :ref:`Using the Monitor's Admin
+    Socket<rados_troubleshoting_troubleshooting_mon_using_admin_socket>`.
+
+.. _rados_troubleshoting_troubleshooting_mon_using_admin_socket:
+
+Using the monitor's admin socket
+================================
+
+A monitor's admin socket allows you to interact directly with a specific daemon
+by using a Unix socket file. This file is found in the monitor's ``run``
+directory. The admin socket's default directory is
+``/var/run/ceph/ceph-mon.ID.asok``, but this can be overridden and the admin
+socket might be elsewhere, especially if your cluster's daemons are deployed in
+containers. If you cannot find it, either check your ``ceph.conf`` for an
+alternative path or run the following command:
+    
+.. prompt:: bash $
+
+   ceph-conf --name mon.ID --show-config-value admin_socket
+
+The admin socket is available for use only when the monitor daemon is running.
+Whenever the monitor has been properly shut down, the admin socket is removed.
+However, if the monitor is not running and the admin socket persists, it is
+likely that the monitor has been improperly shut down.  In any case, if the
+monitor is not running, it will be impossible to use the admin socket, and the
+``ceph`` command is likely to return ``Error 111: Connection Refused``.
+
+To access the admin socket, run a ``ceph tell`` command of the following form
+(specifying the daemon that you are interested in):
+
+.. prompt:: bash $
+
+   ceph tell mon.<id> mon_status
+
+This command passes a ``help`` command to the specific running monitor daemon
+``<id>`` via its admin socket. If you know the full path to the admin socket
+file, this can be done more directly by running the following command:
+
+.. prompt:: bash $
+
+   ceph --admin-daemon <full_path_to_asok_file> <command>
+
+Running ``ceph help`` shows all supported commands that are available through
+the admin socket. See especially ``config get``, ``config show``, ``mon stat``,
+and ``quorum_status``.
+
+.. _rados_troubleshoting_troubleshooting_mon_understanding_mon_status:
+
+Understanding mon_status
+========================
+
+The status of the monitor (as reported by the ``ceph tell mon.X mon_status``
+command) can always be obtained via the admin socket. This command outputs a
+great deal of information about the monitor (including the information found in
+the output of the ``quorum_status`` command).
+
+To understand this command's output, let us consider the following example, in
+which we see the output of ``ceph tell mon.c mon_status``::
+
+  { "name": "c",
+    "rank": 2,
+    "state": "peon",
+    "election_epoch": 38,
+    "quorum": [
+          1,
+          2],
+    "outside_quorum": [],
+    "extra_probe_peers": [],
+    "sync_provider": [],
+    "monmap": { "epoch": 3,
+        "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
+        "modified": "2013-10-30 04:12:01.945629",
+        "created": "2013-10-29 14:14:41.914786",
+        "mons": [
+              { "rank": 0,
+                "name": "a",
+                "addr": "127.0.0.1:6789\/0"},
+              { "rank": 1,
+                "name": "b",
+                "addr": "127.0.0.1:6790\/0"},
+              { "rank": 2,
+                "name": "c",
+                "addr": "127.0.0.1:6795\/0"}]}}
+
+It is clear that there are three monitors in the monmap (*a*, *b*, and *c*),
+the quorum is formed by only two monitors, and *c* is in the quorum as a
+*peon*.
+
+**Which monitor is out of the quorum?**
+
+  The answer is **a** (that is, ``mon.a``).
+
+**Why?**
+
+  When the ``quorum`` set is examined, there are clearly two monitors in the
+  set: *1* and *2*. But these are not monitor names. They are monitor ranks, as
+  established in the current ``monmap``. The ``quorum`` set does not include
+  the monitor that has rank 0, and according to the ``monmap`` that monitor is
+  ``mon.a``.
+
+**How are monitor ranks determined?**
+
+  Monitor ranks are calculated (or recalculated) whenever monitors are added or
+  removed. The calculation of ranks follows a simple rule: the **greater** the
+  ``IP:PORT`` combination, the **lower** the rank. In this case, because
+  ``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations,
+  ``mon.a`` has the highest rank: namely, rank 0.
+  
+
+Most Common Monitor Issues
+===========================
+
+The Cluster Has Quorum but at Least One Monitor is Down
+-------------------------------------------------------
+
+When the cluster has quorum but at least one monitor is down, ``ceph health
+detail`` returns a message similar to the following::
+
+      $ ceph health detail
+      [snip]
+      mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
+
+**How do I troubleshoot a Ceph cluster that has quorum but also has at least one monitor down?**
+
+  #. Make sure that ``mon.a`` is running.
+
+  #. Make sure that you can connect to ``mon.a``'s node from the
+     other Monitor nodes. Check the TCP ports as well. Check ``iptables`` and
+     ``nf_conntrack`` on all nodes and make sure that you are not
+     dropping/rejecting connections.
+
+  If this initial troubleshooting doesn't solve your problem, then further
+  investigation is necessary.
+
+  First, check the problematic monitor's ``mon_status`` via the admin
+  socket as explained in `Using the monitor's admin socket`_ and
+  `Understanding mon_status`_.
+
+  If the Monitor is out of the quorum, then its state will be one of the
+  following: ``probing``, ``electing`` or ``synchronizing``. If the state of
+  the Monitor is ``leader`` or ``peon``, then the Monitor believes itself to be
+  in quorum but the rest of the cluster believes that it is not in quorum. It
+  is possible that a Monitor that is in one of the ``probing``, ``electing``,
+  or ``synchronizing`` states has entered the quorum during the process of
+  troubleshooting. Check ``ceph status`` again to determine whether the Monitor
+  has entered quorum during your troubleshooting. If the Monitor remains out of
+  the quorum, then proceed with the investigations described in this section of
+  the documentation.
+  
+
+**What does it mean when a Monitor's state is ``probing``?**
+
+  If ``ceph health detail`` shows that a Monitor's state is
+  ``probing``, then the Monitor is still looking for the other Monitors. Every
+  Monitor remains in this state for some time when it is started. When a
+  Monitor has connected to the other Monitors specified in the ``monmap``, it
+  ceases to be in the ``probing`` state. The amount of time that a Monitor is
+  in the ``probing`` state depends upon the parameters of the cluster of which
+  it is a part. For example, when a Monitor is a part of a single-monitor
+  cluster (never do this in production), the monitor passes through the probing
+  state almost instantaneously. In a multi-monitor cluster, the Monitors stay
+  in the ``probing`` state until they find enough monitors to form a quorum
+  |---| this means that if two out of three Monitors in the cluster are
+  ``down``, the one remaining Monitor stays in the ``probing``  state
+  indefinitely until you bring one of the other monitors up.
+
+  If quorum has been established, then the Monitor daemon should be able to
+  find the other Monitors quickly, as long as they can be reached. If a Monitor
+  is stuck in the ``probing`` state and you have exhausted the procedures above
+  that describe the troubleshooting of communications between the Monitors,
+  then it is possible that the problem Monitor is trying to reach the other
+  Monitors at a wrong address. ``mon_status`` outputs the ``monmap`` that is
+  known to the monitor: determine whether the other Monitors' locations as
+  specified in the ``monmap`` match the locations of the Monitors in the
+  network. If they do not, see `Recovering a Monitor's Broken monmap`_.
+  If the locations of the Monitors as specified in the ``monmap`` match the
+  locations of the Monitors in the network, then the persistent
+  ``probing`` state could  be related to severe clock skews amongst the monitor
+  nodes.  See `Clock Skews`_.  If the information in `Clock Skews`_ does not
+  bring the Monitor out of the ``probing`` state, then prepare your system logs
+  and ask the Ceph community for help. See `Preparing your logs`_ for
+  information about the proper preparation of logs.
+
+
+**What does it mean when a Monitor's state is ``electing``?**
+
+  If ``ceph health detail`` shows that a Monitor's state is ``electing``, the
+  monitor is in the middle of an election. Elections typically complete
+  quickly, but sometimes the monitors can get stuck in what is known as an
+  *election storm*. See :ref:`Monitor Elections <dev_mon_elections>` for more
+  on monitor elections.
+  
+  The presence of election storm might indicate clock skew among the monitor
+  nodes. See `Clock Skews`_ for more information. 
+  
+  If your clocks are properly synchronized, search the mailing lists and bug
+  tracker for issues similar to your issue. The ``electing`` state is not
+  likely to persist. In versions of Ceph after the release of Cuttlefish, there
+  is no obvious reason other than clock skew that explains why an ``electing``
+  state would persist.  
+  
+  It is possible to investigate the cause of a persistent ``electing`` state if
+  you put the problematic Monitor into a ``down`` state while you investigate.
+  This is possible only if there are enough surviving Monitors to form quorum. 
+
+**What does it mean when a Monitor's state is ``synchronizing``?**
+
+  If ``ceph health detail`` shows that the Monitor is ``synchronizing``, the
+  monitor is catching up with the rest of the cluster so that it can join the
+  quorum. The amount of time that it takes for the Monitor to synchronize with
+  the rest of the quorum is a function of the size of the cluster's monitor
+  store, the cluster's size, and the state of the cluster. Larger and degraded
+  clusters generally keep Monitors in the ``synchronizing`` state longer than
+  do smaller, new clusters.
+
+  A Monitor that changes its state from ``synchronizing`` to ``electing`` and
+  then back to ``synchronizing`` indicates a problem: the cluster state may be
+  advancing (that is, generating new maps) too fast for the synchronization
+  process to keep up with the pace of the creation of the new maps. This issue
+  presented more frequently prior to the Cuttlefish release than it does in
+  more recent releases, because the synchronization process has since been
+  refactored and enhanced to avoid this dynamic. If you experience this in
+  later versions, report the issue in the `Ceph bug tracker
+  <https://tracker.ceph.com>`_. Prepare and provide logs to substantiate any
+  bug you raise. See `Preparing your logs`_ for information about the proper
+  preparation of logs.
+
+**What does it mean when a Monitor's state is ``leader`` or ``peon``?**
+
+  If ``ceph health detail`` shows that the Monitor is in the ``leader`` state
+  or in the ``peon`` state, it is likely that clock skew is present. Follow the
+  instructions in `Clock Skews`_. If you have followed those instructions and
+  ``ceph health detail`` still shows that the Monitor is in the ``leader``
+  state or the ``peon`` state, report the issue in the `Ceph bug tracker
+  <https://tracker.ceph.com>`_. If you raise an issue, provide logs to
+  substantiate it. See `Preparing your logs`_ for information about the
+  proper preparation of logs.
+
+
+Recovering a Monitor's Broken ``monmap``
+----------------------------------------
+
+This is how a ``monmap`` usually looks, depending on the number of
+monitors::
+
+
+      epoch 3
+      fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8
+      last_changed 2013-10-30 04:12:01.945629
+      created 2013-10-29 14:14:41.914786
+      0: 127.0.0.1:6789/0 mon.a
+      1: 127.0.0.1:6790/0 mon.b
+      2: 127.0.0.1:6795/0 mon.c
+      
+This may not be what you have however. For instance, in some versions of
+early Cuttlefish there was a bug that could cause your ``monmap``
+to be nullified.  Completely filled with zeros. This means that not even
+``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros.
+It's also possible to end up with a monitor with a severely outdated monmap,
+notably if the node has been down for months while you fight with your vendor's
+TAC.  The subject ``ceph-mon`` daemon might be unable to find the surviving
+monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
+then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
+``mon.b``; you will end up with a totally different monmap from the one
+``mon.c`` knows).
+
+In this situation you have two possible solutions:
+
+Scrap the monitor and redeploy
+
+  You should only take this route if you are positive that you won't
+  lose the information kept by that monitor; that you have other monitors
+  and that they are running just fine so that your new monitor is able
+  to synchronize from the remaining monitors. Keep in mind that destroying
+  a monitor, if there are no other copies of its contents, may lead to
+  loss of data.
+
+Inject a monmap into the monitor
+
+  These are the basic steps:
+
+  Retrieve the ``monmap`` from the surviving monitors and inject it into the
+  monitor whose ``monmap`` is corrupted or lost.
+
+  Implement this solution by carrying out the following procedure:
+
+  1. Is there a quorum of monitors? If so, retrieve the ``monmap`` from the
+     quorum::
+
+      $ ceph mon getmap -o /tmp/monmap
+
+  2. If there is no quorum, then retrieve the ``monmap`` directly from another
+     monitor that has been stopped (in this example, the other monitor has
+     the ID ``ID-FOO``)::
+
+      $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
+
+  3. Stop the monitor you are going to inject the monmap into.
+
+  4. Inject the monmap::
+
+      $ ceph-mon -i ID --inject-monmap /tmp/monmap
+
+  5. Start the monitor
+
+  .. warning:: Injecting ``monmaps`` can cause serious problems because doing
+     so will overwrite the latest existing ``monmap`` stored on the monitor. Be
+     careful!
+
+Clock Skews
+-----------
+
+The Paxos consensus algorithm requires close time synchroniziation, which means
+that clock skew among the monitors in the quorum can have a serious effect on
+monitor operation. The resulting behavior can be puzzling. To avoid this issue,
+run a clock synchronization tool on your monitor nodes: for example, use
+``Chrony`` or the legacy ``ntpd`` utility. Configure each monitor nodes so that
+the `iburst` option is in effect and so that each monitor has multiple peers,
+including the following: 
+
+* Each other
+* Internal ``NTP`` servers
+* Multiple external, public pool servers
+
+.. note:: The ``iburst`` option sends a burst of eight packets instead of the
+   usual single packet, and is used during the process of getting two peers
+   into initial synchronization.
+
+Furthermore, it is advisable to synchronize *all* nodes in your cluster against
+internal and external servers, and perhaps even against your monitors. Run
+``NTP`` servers on bare metal: VM-virtualized clocks are not suitable for
+steady timekeeping. See `https://www.ntp.org <https://www.ntp.org>`_ for more
+information about the Network Time Protocol (NTP). Your organization might
+already have quality internal ``NTP`` servers available.  Sources for ``NTP``
+server appliances include the following:
+
+* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
+* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
+* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_
+
+Clock Skew Questions and Answers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**What's the maximum tolerated clock skew?**
+
+  By default, monitors allow clocks to drift up to a maximum of 0.05 seconds
+  (50 milliseconds).
+
+**Can I increase the maximum tolerated clock skew?**
+
+  Yes, but we strongly recommend against doing so. The maximum tolerated clock
+  skew is configurable via the ``mon-clock-drift-allowed`` option, but it is
+  almost certainly a bad idea to make changes to this option. The clock skew
+  maximum is in place because clock-skewed monitors cannot be relied upon. The
+  current default value has proven its worth at alerting the user before the
+  monitors encounter serious problems. Changing this value might cause
+  unforeseen effects on the stability of the monitors and overall cluster
+  health.
+
+**How do I know whether there is a clock skew?**
+
+  The monitors will warn you via the cluster status ``HEALTH_WARN``. When clock
+  skew is present, the ``ceph health detail`` and ``ceph status`` commands
+  return an output resembling the following::
+
+      mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
+
+  In this example, the monitor ``mon.c`` has been flagged as suffering from 
+  clock skew.
+
+  In Luminous and later releases, it is possible to check for a clock skew by
+  running the ``ceph time-sync-status`` command. Note that the lead monitor
+  typically has the numerically lowest IP address. It will always show ``0``:
+  the reported offsets of other monitors are relative to the lead monitor, not
+  to any external reference source.
+
+**What should I do if there is a clock skew?**
+
+  Synchronize your clocks. Using an NTP client might help. However, if you
+  are already using an NTP client and you still encounter clock skew problems,
+  determine whether the NTP server that you are using is remote to your network
+  or instead hosted on your network. Hosting your own NTP servers tends to
+  mitigate clock skew problems.
+
+
+Client Can't Connect or Mount
+-----------------------------
+
+Check your IP tables. Some operating-system install utilities add a ``REJECT``
+rule to ``iptables``. ``iptables`` rules will reject all clients other than
+``ssh`` that try to connect to the host. If your monitor host's IP tables have
+a ``REJECT`` rule in place, clients that are connecting from a separate node
+will fail and will raise a timeout error. Any ``iptables`` rules that reject
+clients trying to connect to Ceph daemons must be addressed. For example::
+
+    REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
+
+It might also be necessary to add rules to iptables on your Ceph hosts to
+ensure that clients are able to access the TCP ports associated with your Ceph
+monitors (default: port 6789) and Ceph OSDs (default: 6800 through 7300). For
+example::
+
+    iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT
+
+
+Monitor Store Failures
+======================
+
+Symptoms of store corruption
+----------------------------
+
+Ceph monitors store the :term:`Cluster Map` in a key-value store.  If key-value
+store corruption causes a monitor to fail, then the monitor log might contain
+one of the following error messages::
+
+  Corruption: error in middle of record
+
+or::
+
+  Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb
+
+Recovery using healthy monitor(s)
+---------------------------------
+
+If there are surviving monitors, we can always :ref:`replace
+<adding-and-removing-monitors>` the corrupted monitor with a new one. After the
+new monitor boots, it will synchronize with a healthy peer. After the new
+monitor is fully synchronized, it will be able to serve clients.
+
+.. _mon-store-recovery-using-osds:
+
+Recovery using OSDs
+-------------------
+
+Even if all monitors fail at the same time, it is possible to recover the
+monitor store by using information stored in OSDs. You are encouraged to deploy
+at least three (and preferably five) monitors in a Ceph cluster. In such a
+deployment, complete monitor failure is unlikely. However, unplanned power loss
+in a data center whose disk settings or filesystem settings are improperly
+configured could cause the underlying filesystem to fail and this could kill
+all of the monitors. In such a case, data in the OSDs can be used to recover
+the monitors.  The following is such a script and can be used to recover the
+monitors:
+
+
+.. code-block:: bash
+
+  ms=/root/mon-store
+  mkdir $ms
+  
+  # collect the cluster map from stopped OSDs
+  for host in $hosts; do
+    rsync -avz $ms/. user@$host:$ms.remote
+    rm -rf $ms
+    ssh user@$host <<EOF
+      for osd in /var/lib/ceph/osd/ceph-*; do
+        ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
+      done
+  EOF
+    rsync -avz user@$host:$ms.remote/. $ms
+  done
+  
+  # rebuild the monitor store from the collected map, if the cluster does not
+  # use cephx authentication, we can skip the following steps to update the
+  # keyring with the caps, and there is no need to pass the "--keyring" option.
+  # i.e. just use "ceph-monstore-tool $ms rebuild" instead
+  ceph-authtool /path/to/admin.keyring -n mon. \
+    --cap mon 'allow *'
+  ceph-authtool /path/to/admin.keyring -n client.admin \
+    --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
+  # add one or more ceph-mgr's key to the keyring. in this case, an encoded key
+  # for mgr.x is added, you can find the encoded key in
+  # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
+  # deployed
+  ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \
+    --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
+  # If your monitors' ids are not sorted by ip address, please specify them in order.
+  # For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is  10.0.0.4,
+  # please passing "--mon-ids b a c".
+  # In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please
+  # specify them in the command line by passing them as arguments of the "--mon-ids"
+  # option. if you are not sure, please check your ceph.conf to see if there is any
+  # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
+  # using DNS SRV for looking up monitors.
+  ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma
+  
+  # make a backup of the corrupted store.db just in case!  repeat for
+  # all monitors.
+  mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted
+
+  # move rebuild store.db into place.  repeat for all monitors.
+  mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
+  chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
+
+This script performs the following steps:
+
+#. Collects the map from each OSD host.
+#. Rebuilds the store.
+#. Fills the entities in the keyring file with appropriate capabilities.
+#. Replaces the corrupted store on ``mon.foo`` with the recovered copy.
+
+
+Known limitations
+~~~~~~~~~~~~~~~~~
+
+The above recovery tool is unable to recover the following information:
+
+- **Certain added keyrings**: All of the OSD keyrings added using the ``ceph
+  auth add`` command are recovered from the OSD's copy, and the
+  ``client.admin`` keyring is imported using ``ceph-monstore-tool``. However,
+  the MDS keyrings and all other keyrings will be missing in the recovered
+  monitor store. You might need to manually re-add them.
+
+- **Creating pools**: If any RADOS pools were in the process of being created,
+  that state is lost. The recovery tool operates on the assumption that all
+  pools have already been created. If there are PGs that are stuck in the
+  'unknown' state after the recovery for a partially created pool, you can
+  force creation of the *empty* PG by running the ``ceph osd force-create-pg``
+  command. Note that this will create an *empty* PG, so take this action only
+  if you know the pool is empty.
+
+- **MDS Maps**: The MDS maps are lost.
+
+
+Everything Failed! Now What?
+============================
+
+Reaching out for help
+---------------------
+
+You can find help on IRC in #ceph and #ceph-devel on OFTC (server
+irc.oftc.net), or at ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make
+sure that you have prepared your logs and that you have them ready upon
+request.
+
+See https://ceph.io/en/community/connect/ for current (as of October 2023)
+information on getting in contact with the upstream Ceph community.
+
+
+Preparing your logs
+-------------------
+
+The default location for monitor logs is ``/var/log/ceph/ceph-mon.FOO.log*``.
+However, if they are not there, you can find their current location by running
+the following command:
+
+.. prompt:: bash
+
+   ceph-conf --name mon.FOO --show-config-value log_file
+
+The amount of information in the logs is determined by the debug levels in the
+cluster's configuration files. If Ceph is using the default debug levels, then
+your logs might be missing important information that would help the upstream
+Ceph community address your issue.
+
+To make sure your monitor logs contain relevant information, you can raise
+debug levels. Here we are interested in information from the monitors.  As with
+other components, the monitors have different parts that output their debug
+information on different subsystems.
+
+If you are an experienced Ceph troubleshooter, we recommend raising the debug
+levels of the most relevant subsystems. Of course, this approach might not be
+easy for beginners. In most cases, however, enough information to address the
+issue will be secured if the following debug levels are entered::
+
+      debug_mon = 10
+      debug_ms = 1
+
+Sometimes these debug levels do not yield enough information. In such cases,
+members of the upstream Ceph community might ask you to make additional changes
+to these or to other debug levels. In any case, it is better for us to receive
+at least some useful information than to receive an empty log.
+
+
+Do I need to restart a monitor to adjust debug levels?
+------------------------------------------------------
+
+No, restarting a monitor is not necessary. Debug levels may be adjusted by
+using two different methods, depending on whether or not there is a quorum:
+
+There is a quorum
+
+  Either inject the debug option into the specific monitor that needs to 
+  be debugged::
+
+        ceph tell mon.FOO config set debug_mon 10/10
+
+  Or inject it into all monitors at once::
+
+        ceph tell mon.* config set debug_mon 10/10
+
+
+There is no quorum
+
+  Use the admin socket of the specific monitor that needs to be debugged
+  and directly adjust the monitor's configuration options::
+
+      ceph daemon mon.FOO config set debug_mon 10/10
+
+
+To return the debug levels to their default values, run the above commands
+using the debug level ``1/10`` rather than ``10/10``. To check a monitor's
+current values, use the admin socket and run either of the following commands:
+
+  .. prompt:: bash
+
+     ceph daemon mon.FOO config show
+
+or:
+
+  .. prompt:: bash
+
+     ceph daemon mon.FOO config get 'OPTION_NAME'
+
+
+
+I Reproduced the problem with appropriate debug levels. Now what?
+-----------------------------------------------------------------
+
+We prefer that you send us only the portions of your logs that are relevant to
+your monitor problems. Of course, it might not be easy for you to determine
+which portions are relevant so we are willing to accept complete and
+unabridged logs. However, we request that you avoid sending logs containing
+hundreds of thousands of lines with no additional clarifying information. One
+common-sense way of making our task easier is to write down the current time
+and date when you are reproducing the problem and then extract portions of your
+logs based on that information.
+
+Finally, reach out to us on the mailing lists or IRC or Slack, or by filing a
+new issue on the `tracker`_.
+
+.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new
+
+.. |---|   unicode:: U+2014 .. EM DASH
+   :trim:
diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst
new file mode 100644
index 000000000..035947d7e
--- /dev/null
+++ b/doc/rados/troubleshooting/troubleshooting-osd.rst
@@ -0,0 +1,787 @@
+======================
+ Troubleshooting OSDs
+======================
+
+Before troubleshooting the cluster's OSDs, check the monitors
+and the network. 
+
+First, determine whether the monitors have a quorum. Run the ``ceph health``
+command or the ``ceph -s`` command and if Ceph shows ``HEALTH_OK`` then there
+is a monitor quorum. 
+
+If the monitors don't have a quorum or if there are errors with the monitor
+status, address the monitor issues before proceeding by consulting the material
+in `Troubleshooting Monitors <../troubleshooting-mon>`_.
+
+Next, check your networks to make sure that they are running properly. Networks
+can have a significant impact on OSD operation and performance. Look for
+dropped packets on the host side and CRC errors on the switch side.
+
+
+Obtaining Data About OSDs
+=========================
+
+When troubleshooting OSDs, it is useful to collect different kinds of
+information about the OSDs. Some information comes from the practice of
+`monitoring OSDs`_ (for example, by running the ``ceph osd tree`` command).
+Additional information concerns the topology of your cluster, and is discussed
+in the following sections.
+
+
+Ceph Logs
+---------
+
+Ceph log files are stored under ``/var/log/ceph``. Unless the path has been
+changed (or you are in a containerized environment that stores logs in a
+different location), the log files can be listed by running the following
+command:
+
+.. prompt:: bash
+
+   ls /var/log/ceph
+
+If there is not enough log detail, change the logging level. To ensure that
+Ceph performs adequately under high logging volume, see `Logging and
+Debugging`_.
+
+
+
+Admin Socket
+------------
+
+Use the admin socket tool to retrieve runtime information. First, list the
+sockets of Ceph's daemons by running the following command:
+
+.. prompt:: bash
+
+   ls /var/run/ceph
+
+Next, run a command of the following form (replacing ``{daemon-name}`` with the
+name of a specific daemon: for example, ``osd.0``):
+
+.. prompt:: bash
+
+   ceph daemon {daemon-name} help
+
+Alternatively, run the command with a ``{socket-file}`` specified (a "socket
+file" is a specific file in ``/var/run/ceph``):
+
+.. prompt:: bash
+
+   ceph daemon {socket-file} help
+
+The admin socket makes many tasks possible, including:
+
+- Listing Ceph configuration at runtime
+- Dumping historic operations
+- Dumping the operation priority queue state
+- Dumping operations in flight
+- Dumping perfcounters
+
+Display Free Space
+------------------
+
+Filesystem issues may arise. To display your filesystems' free space, run the
+following command:
+
+.. prompt:: bash
+
+   df -h
+
+To see this command's supported syntax and options, run ``df --help``.
+
+I/O Statistics
+--------------
+
+The `iostat`_ tool can be used to identify I/O-related issues. Run the
+following command:
+
+.. prompt:: bash
+
+   iostat -x
+
+
+Diagnostic Messages
+-------------------
+
+To retrieve diagnostic messages from the kernel, run the ``dmesg`` command and
+specify the output with ``less``, ``more``, ``grep``, or ``tail``. For
+example: 
+
+.. prompt:: bash
+
+    dmesg | grep scsi
+
+Stopping without Rebalancing
+============================
+
+It might be occasionally necessary to perform maintenance on a subset of your
+cluster or to resolve a problem that affects a failure domain (for example, a
+rack).  However, when you stop OSDs for maintenance, you might want to prevent
+CRUSH from automatically rebalancing the cluster. To avert this rebalancing
+behavior, set the cluster to ``noout`` by running the following command:
+
+.. prompt:: bash
+
+   ceph osd set noout
+
+.. warning:: This is more a thought exercise offered for the purpose of giving
+   the reader a sense of failure domains and CRUSH behavior than a suggestion
+   that anyone in the post-Luminous world run ``ceph osd set noout``. When the
+   OSDs return to an ``up`` state, rebalancing will resume and the change
+   introduced by the ``ceph osd set noout`` command will be reverted.
+
+In Luminous and later releases, however, it is a safer approach to flag only
+affected OSDs.  To add or remove a ``noout`` flag to a specific OSD, run a
+command like the following:
+
+.. prompt:: bash
+
+   ceph osd add-noout osd.0
+   ceph osd rm-noout  osd.0
+
+It is also possible to flag an entire CRUSH bucket. For example, if you plan to
+take down ``prod-ceph-data1701`` in order to add RAM, you might run the
+following command:
+
+.. prompt:: bash
+
+   ceph osd set-group noout prod-ceph-data1701
+
+After the flag is set, stop the OSDs and any other colocated
+Ceph services within the failure domain that requires maintenance work::
+
+   systemctl stop ceph\*.service ceph\*.target
+
+.. note:: When an OSD is stopped, any placement groups within the OSD are
+   marked as ``degraded``.
+
+After the maintenance is complete, it will be necessary to restart the OSDs
+and any other daemons that have stopped. However, if the host was rebooted as
+part of the maintenance, they do not need to be restarted and will come back up
+automatically. To restart OSDs or other daemons, use a command of the following
+form:
+
+.. prompt:: bash
+
+   sudo systemctl start ceph.target
+
+Finally, unset the ``noout`` flag as needed by running commands like the
+following:
+
+.. prompt:: bash
+
+   ceph osd unset noout
+   ceph osd unset-group noout prod-ceph-data1701
+
+Many contemporary Linux distributions employ ``systemd`` for service
+management.  However, for certain operating systems (especially older ones) it
+might be necessary to issue equivalent ``service`` or ``start``/``stop``
+commands.
+
+
+.. _osd-not-running:
+
+OSD Not Running
+===============
+
+Under normal conditions, restarting a ``ceph-osd`` daemon will allow it to
+rejoin the cluster and recover.
+
+
+An OSD Won't Start
+------------------
+
+If the cluster has started but an OSD isn't starting, check the following:
+
+- **Configuration File:** If you were not able to get OSDs running from a new
+  installation, check your configuration file to ensure it conforms to the
+  standard (for example, make sure that it says ``host`` and not ``hostname``,
+  etc.).
+
+- **Check Paths:** Ensure that the paths specified in the configuration
+  correspond to the paths for data and metadata that actually exist (for
+  example, the paths to the journals, the WAL, and the DB). Separate the OSD
+  data from the metadata in order to see whether there are errors in the
+  configuration file and in the actual mounts. If so, these errors might
+  explain why OSDs are not starting. To store the metadata on a separate block
+  device, partition or LVM the drive and assign one partition per OSD.
+
+- **Check Max Threadcount:** If the cluster has a node with an especially high
+  number of OSDs, it might be hitting the default maximum number of threads
+  (usually 32,000).  This is especially likely to happen during recovery.
+  Increasing the maximum number of threads to the maximum possible number of
+  threads allowed (4194303) might help with the problem. To increase the number
+  of threads to the maximum, run the following command:
+
+  .. prompt:: bash
+
+     sysctl -w kernel.pid_max=4194303
+
+  If this increase resolves the issue, you must make the increase permanent by
+  including a ``kernel.pid_max`` setting either in a file under
+  ``/etc/sysctl.d`` or within the master ``/etc/sysctl.conf`` file. For
+  example::
+
+     kernel.pid_max = 4194303
+
+- **Check ``nf_conntrack``:** This connection-tracking and connection-limiting
+  system causes problems for many production Ceph clusters. The problems often
+  emerge slowly and subtly. As cluster topology and client workload grow,
+  mysterious and intermittent connection failures and performance glitches
+  occur more and more, especially at certain times of the day. To begin taking
+  the measure of your problem, check the ``syslog`` history for "table full"
+  events. One way to address this kind of problem is as follows: First, use the
+  ``sysctl`` utility to assign ``nf_conntrack_max`` a much higher value. Next,
+  raise the value of ``nf_conntrack_buckets`` so that ``nf_conntrack_buckets``
+  × 8 = ``nf_conntrack_max``; this action might require running commands
+  outside of ``sysctl`` (for example, ``"echo 131072 >
+  /sys/module/nf_conntrack/parameters/hashsize``). Another way to address the
+  problem is to blacklist the associated kernel modules in order to disable
+  processing altogether. This approach is powerful, but fragile. The modules
+  and the order in which the modules must be listed can vary among kernel
+  versions. Even when blacklisted, ``iptables`` and ``docker`` might sometimes
+  activate connection tracking anyway, so we advise a "set and forget" strategy
+  for the tunables. On modern systems, this approach will not consume
+  appreciable resources.
+
+- **Kernel Version:** Identify the kernel version and distribution that are in
+  use. By default, Ceph uses third-party tools that might be buggy or come into
+  conflict with certain distributions or kernel versions (for example, Google's
+  ``gperftools`` and ``TCMalloc``). Check the `OS recommendations`_ and the
+  release notes for each Ceph version in order to make sure that you have
+  addressed any issues related to your kernel.
+
+- **Segment Fault:** If there is a segment fault, increase log levels and
+  restart the problematic daemon(s). If segment faults recur, search the Ceph
+  bug tracker `https://tracker.ceph/com/projects/ceph
+  <https://tracker.ceph.com/projects/ceph/>`_ and the ``dev`` and
+  ``ceph-users`` mailing list archives `https://ceph.io/resources
+  <https://ceph.io/resources>`_ to see if others have experienced and reported
+  these issues. If this truly is a new and unique failure, post to the ``dev``
+  email list and provide the following information: the specific Ceph release
+  being run, ``ceph.conf`` (with secrets XXX'd out), your monitor status
+  output, and excerpts from your log file(s).
+
+
+An OSD Failed
+-------------
+
+When an OSD fails, this means that a ``ceph-osd`` process is unresponsive or
+has died and that the corresponding OSD has been marked ``down``. Surviving
+``ceph-osd`` daemons will report to the monitors that the OSD appears to be
+down, and a new status will be visible in the output of the ``ceph health``
+command, as in the following example:
+
+.. prompt:: bash
+
+   ceph health
+
+::
+
+   HEALTH_WARN 1/3 in osds are down
+
+This health alert is raised whenever there are one or more OSDs marked ``in``
+and ``down``. To see which OSDs are ``down``, add ``detail`` to the command as in
+the following example:
+
+.. prompt:: bash
+
+   ceph health detail
+
+::
+
+   HEALTH_WARN 1/3 in osds are down
+   osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
+
+Alternatively, run the following command:
+
+.. prompt:: bash
+
+    ceph osd tree down
+
+If there is a drive failure or another fault that is preventing a given
+``ceph-osd`` daemon from functioning or restarting, then there should be an
+error message present in its log file under ``/var/log/ceph``.
+
+If the ``ceph-osd`` daemon stopped because of a heartbeat failure or a
+``suicide timeout`` error, then the underlying drive or filesystem might be
+unresponsive. Check ``dmesg`` output and `syslog`  output for drive errors or
+kernel errors. It might be necessary to specify certain flags (for example,
+``dmesg -T`` to see human-readable timestamps) in order to avoid mistaking old
+errors for new errors.
+
+If an entire host's OSDs are ``down``, check to see if there is a network
+error or a hardware issue with the host.
+
+If the OSD problem is the result of a software error (for example, a failed
+assertion or another unexpected error), search for reports of the issue in the
+`bug tracker <https://tracker.ceph/com/projects/ceph>`_ , the `dev mailing list
+archives <https://lists.ceph.io/hyperkitty/list/dev@ceph.io/>`_, and the
+`ceph-users mailing list archives
+<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/>`_.  If there is no
+clear fix or existing bug, then :ref:`report the problem to the ceph-devel
+email list <Get Involved>`.
+
+
+.. _no-free-drive-space:
+
+No Free Drive Space
+-------------------
+
+If an OSD is full, Ceph prevents data loss by ensuring that no new data is
+written to the OSD. In an properly running cluster, health checks are raised
+when the cluster's OSDs and pools approach certain "fullness" ratios. The
+``mon_osd_full_ratio`` threshold defaults to ``0.95`` (or 95% of capacity):
+this is the point above which clients are prevented from writing data. The
+``mon_osd_backfillfull_ratio`` threshold defaults to ``0.90`` (or 90% of
+capacity): this is the point above which backfills will not start. The
+``mon_osd_nearfull_ratio`` threshold defaults to ``0.85`` (or 85% of capacity):
+this is the point at which it raises the ``OSD_NEARFULL`` health check.
+
+OSDs within a cluster will vary in how much data is allocated to them by Ceph.
+To check "fullness" by displaying data utilization for every OSD, run the
+following command:
+
+.. prompt:: bash
+
+   ceph osd df
+
+To check "fullness" by displaying a cluster’s overall data usage and data
+distribution among pools, run the following command:
+
+.. prompt:: bash
+
+   ceph df 
+
+When examining the output of the ``ceph df`` command, pay special attention to
+the **most full** OSDs, as opposed to the percentage of raw space used. If a
+single outlier OSD becomes full, all writes to this OSD's pool might fail as a
+result. When ``ceph df`` reports the space available to a pool, it considers
+the ratio settings relative to the *most full* OSD that is part of the pool. To
+flatten the distribution, two approaches are available: (1) Using the
+``reweight-by-utilization`` command to progressively move data from excessively
+full OSDs or move data to insufficiently full OSDs, and (2) in later revisions
+of Luminous and subsequent releases, exploiting the ``ceph-mgr`` ``balancer``
+module to perform the same task automatically.
+
+To adjust the "fullness" ratios, run a command or commands of the following
+form:
+
+.. prompt:: bash
+
+   ceph osd set-nearfull-ratio <float[0.0-1.0]>
+   ceph osd set-full-ratio <float[0.0-1.0]>
+   ceph osd set-backfillfull-ratio <float[0.0-1.0]>
+
+Sometimes full cluster issues arise because an OSD has failed. This can happen
+either because of a test or because the cluster is small, very full, or
+unbalanced. When an OSD or node holds an excessive percentage of the cluster's
+data, component failures or natural growth can result in the ``nearfull`` and
+``full`` ratios being exceeded.  When testing Ceph's resilience to OSD failures
+on a small cluster, it is advised to leave ample free disk space and to
+consider temporarily lowering the OSD ``full ratio``, OSD ``backfillfull
+ratio``, and OSD ``nearfull ratio``.
+
+The "fullness" status of OSDs is visible in the output of the ``ceph health``
+command, as in the following example:
+
+.. prompt:: bash
+
+   ceph health
+
+::
+
+  HEALTH_WARN 1 nearfull osd(s)
+
+For details, add the ``detail`` command as in the following example:
+
+.. prompt:: bash
+
+    ceph health detail
+
+::
+
+    HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
+    osd.3 is full at 97%
+    osd.4 is backfill full at 91%
+    osd.2 is near full at 87%
+
+To address full cluster issues, it is recommended to add capacity by adding
+OSDs. Adding new OSDs allows the cluster to redistribute data to newly
+available storage. Search for ``rados bench`` orphans that are wasting space.
+
+If a legacy Filestore OSD cannot be started because it is full, it is possible
+to reclaim space by deleting a small number of placement group directories in
+the full OSD.
+
+.. important:: If you choose to delete a placement group directory on a full
+   OSD, **DO NOT** delete the same placement group directory on another full
+   OSD. **OTHERWISE YOU WILL LOSE DATA**. You **MUST** maintain at least one
+   copy of your data on at least one OSD. Deleting placement group directories
+   is a rare and extreme intervention. It is not to be undertaken lightly.
+
+See `Monitor Config Reference`_ for more information.
+
+
+OSDs are Slow/Unresponsive
+==========================
+
+OSDs are sometimes slow or unresponsive. When troubleshooting this common
+problem, it is advised to eliminate other possibilities before investigating
+OSD performance issues. For example, be sure to confirm that your network(s)
+are working properly, to verify that your OSDs are running, and to check
+whether OSDs are throttling recovery traffic.
+
+.. tip:: In pre-Luminous releases of Ceph, ``up`` and ``in`` OSDs were
+   sometimes not available or were otherwise slow because recovering OSDs were
+   consuming system resources. Newer releases provide better recovery handling
+   by preventing this phenomenon.
+
+
+Networking Issues
+-----------------
+
+As a distributed storage system, Ceph relies upon networks for OSD peering and
+replication, recovery from faults, and periodic heartbeats. Networking issues
+can cause OSD latency and flapping OSDs. For more information, see `Flapping
+OSDs`_.
+
+To make sure that Ceph processes and Ceph-dependent processes are connected and
+listening, run the following commands:
+
+.. prompt:: bash
+
+   netstat -a | grep ceph
+   netstat -l | grep ceph
+   sudo netstat -p | grep ceph
+
+To check network statistics, run the following command:
+
+.. prompt:: bash
+
+   netstat -s
+
+Drive Configuration
+-------------------
+
+An SAS or SATA storage drive should house only one OSD, but a NVMe drive can
+easily house two or more. However, it is possible for read and write throughput
+to bottleneck if other processes share the drive. Such processes include:
+journals / metadata, operating systems, Ceph monitors, ``syslog`` logs, other
+OSDs, and non-Ceph processes.
+
+Because Ceph acknowledges writes *after* journaling, fast SSDs are an
+attractive option for accelerating response time -- particularly when using the
+``XFS`` or ``ext4`` filesystems for legacy FileStore OSDs.  By contrast, the
+``Btrfs`` file system can write and journal simultaneously. (However, use of
+``Btrfs`` is not recommended for production deployments.)
+
+.. note:: Partitioning a drive does not change its total throughput or
+   sequential read/write limits. Throughput might be improved somewhat by
+   running a journal in a separate partition, but it is better still to run
+   such a journal in a separate physical drive.
+   
+.. warning:: Reef does not support FileStore. Releases after Reef do not
+   support FileStore. Any information that mentions FileStore is pertinent only
+   to the Quincy release of Ceph and to releases prior to Quincy.
+
+
+Bad Sectors / Fragmented Disk
+-----------------------------
+
+Check your drives for bad blocks, fragmentation, and other errors that can
+cause significantly degraded performance. Tools that are useful in checking for
+drive errors include ``dmesg``, ``syslog`` logs, and ``smartctl`` (found in the
+``smartmontools`` package).
+
+.. note:: ``smartmontools`` 7.0 and late provides NVMe stat passthrough and
+   JSON output.
+
+
+Co-resident Monitors/OSDs
+-------------------------
+
+Although monitors are relatively lightweight processes, performance issues can
+result when monitors are run on the same host machine as an OSD. Monitors issue
+many ``fsync()`` calls and this can interfere with other workloads. The danger
+of performance issues is especially acute when the monitors are co-resident on
+the same storage drive as an OSD. In addition, if the monitors are running an
+older kernel (pre-3.0) or a kernel with no ``syncfs(2)`` syscall, then multiple
+OSDs running on the same host might make so many commits as to undermine each
+other's performance.  This problem sometimes results in what is called "the
+bursty writes".
+
+
+Co-resident Processes
+---------------------
+
+Significant OSD latency can result from processes that write data to Ceph (for
+example, cloud-based solutions and virtual machines) while operating on the
+same hardware as OSDs. For this reason, making such processes co-resident with
+OSDs is not generally recommended. Instead, the recommended practice is to
+optimize certain hosts for use with Ceph and use other hosts for other
+processes. This practice of separating Ceph operations from other applications
+might help improve performance and might also streamline troubleshooting and
+maintenance.
+
+Running co-resident processes on the same hardware is sometimes called
+"convergence". When using Ceph, engage in convergence only with expertise and
+after consideration.
+
+
+Logging Levels
+--------------
+
+Performance issues can result from high logging levels. Operators sometimes
+raise logging levels in order to track an issue and then forget to lower them
+afterwards. In such a situation, OSDs might consume valuable system resources to
+write needlessly verbose logs onto the disk. Anyone who does want to use high logging
+levels is advised to consider mounting a drive to the default path for logging
+(for example, ``/var/log/ceph/$cluster-$name.log``).
+
+Recovery Throttling
+-------------------
+
+Depending upon your configuration, Ceph may reduce recovery rates to maintain
+client or OSD performance, or it may increase recovery rates to the point that
+recovery impacts client or OSD performance. Check to see if the client or OSD
+is recovering.
+
+
+Kernel Version
+--------------
+
+Check the kernel version that you are running. Older kernels may lack updates
+that improve Ceph performance. 
+
+
+Kernel Issues with SyncFS
+-------------------------
+
+If you have kernel issues with SyncFS, try running one OSD per host to see if
+performance improves. Old kernels might not have a recent enough version of
+``glibc`` to support ``syncfs(2)``.
+
+
+Filesystem Issues
+-----------------
+
+In post-Luminous releases, we recommend deploying clusters with the BlueStore
+back end.  When running a pre-Luminous release, or if you have a specific
+reason to deploy OSDs with the previous Filestore backend, we recommend
+``XFS``.
+
+We recommend against using ``Btrfs`` or ``ext4``.  The ``Btrfs`` filesystem has
+many attractive features, but bugs may lead to performance issues and spurious
+ENOSPC errors.  We do not recommend ``ext4`` for Filestore OSDs because
+``xattr`` limitations break support for long object names, which are needed for
+RGW.
+
+For more information, see `Filesystem Recommendations`_.
+
+.. _Filesystem Recommendations: ../configuration/filesystem-recommendations
+
+Insufficient RAM
+----------------
+
+We recommend a *minimum* of 4GB of RAM per OSD daemon and we suggest rounding
+up from 6GB to 8GB. During normal operations, you may notice that ``ceph-osd``
+processes use only a fraction of that amount.  You might be tempted to use the
+excess RAM for co-resident applications or to skimp on each node's memory
+capacity. However, when OSDs experience recovery their memory utilization
+spikes. If there is insufficient RAM available during recovery, OSD performance
+will slow considerably and the daemons may even crash or be killed by the Linux
+``OOM Killer``.
+
+
+Blocked Requests or Slow Requests
+---------------------------------
+
+When a ``ceph-osd`` daemon is slow to respond to a request, the cluster log
+receives messages reporting ops that are taking too long. The warning threshold
+defaults to 30 seconds and is configurable via the ``osd_op_complaint_time``
+setting.
+
+Legacy versions of Ceph complain about ``old requests``::
+
+    osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
+
+Newer versions of Ceph complain about ``slow requests``::
+
+    {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
+    {date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
+
+Possible causes include:
+
+- A failing drive (check ``dmesg`` output)
+- A bug in the kernel file system (check ``dmesg`` output)
+- An overloaded cluster (check system load, iostat, etc.)
+- A bug in the ``ceph-osd`` daemon.
+
+Possible solutions:
+
+- Remove VMs from Ceph hosts
+- Upgrade kernel
+- Upgrade Ceph
+- Restart OSDs
+- Replace failed or failing components
+
+Debugging Slow Requests
+-----------------------
+
+If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id>
+dump_ops_in_flight``, you will see a set of operations and a list of events
+each operation went through. These are briefly described below.
+
+Events from the Messenger layer:
+
+- ``header_read``: The time that the messenger first started reading the message off the wire.
+- ``throttled``: The time that the messenger tried to acquire memory throttle space to read
+  the message into memory.
+- ``all_read``: The time that the messenger finished reading the message off the wire.
+- ``dispatched``: The time that the messenger gave the message to the OSD.
+- ``initiated``: This is identical to ``header_read``. The existence of both is a
+  historical oddity.
+
+Events from the OSD as it processes ops:
+
+- ``queued_for_pg``: The op has been put into the queue for processing by its PG.
+- ``reached_pg``: The PG has started performing the op.
+- ``waiting for \*``: The op is waiting for some other work to complete before
+  it can proceed (for example, a new OSDMap; the scrubbing of its object
+  target; the completion of a PG's peering; all as specified in the message).
+- ``started``: The op has been accepted as something the OSD should do and 
+  is now being performed.
+- ``waiting for subops from``: The op has been sent to replica OSDs.
+
+Events from ```Filestore```:
+
+- ``commit_queued_for_journal_write``: The op has been given to the FileStore.
+- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and is waiting
+  to be persisted (as the next disk write).
+- ``journaled_completion_queued``: The op was journaled to disk and its callback
+  has been queued for invocation.
+
+Events from the OSD after data has been given to underlying storage:
+
+- ``op_commit``: The op has been committed (that is, written to journal) by the
+  primary OSD.
+- ``op_applied``: The op has been `write()'en
+  <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (that is,
+  applied in memory but not flushed out to disk) on the primary.
+- ``sub_op_applied``: ``op_applied``, but for a replica's "subop".
+- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools).
+- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it
+  hears about the above, but for a particular replica (i.e. ``<X>``).
+- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops).
+
+Some of these events may appear redundant, but they cross important boundaries
+in the internal code (such as passing data across locks into new threads).
+
+
+Flapping OSDs
+=============
+
+"Flapping" is the term for the phenomenon of an OSD being repeatedly marked
+``up`` and then ``down`` in rapid succession.  This section explains how to
+recognize flapping, and how to mitigate it.
+
+When OSDs peer and check heartbeats, they use the cluster (back-end) network
+when it is available. See `Monitor/OSD Interaction`_ for details.
+
+The upstream Ceph community has traditionally recommended separate *public*
+(front-end) and *private* (cluster / back-end / replication) networks. This
+provides the following benefits:
+
+#. Segregation of (1) heartbeat traffic and replication/recovery traffic
+   (private) from (2) traffic from clients and between OSDs and monitors
+   (public). This helps keep one stream of traffic from DoS-ing the other,
+   which could in turn result in a cascading failure.
+
+#. Additional throughput for both public and private traffic.
+
+In the past, when common networking technologies were measured in a range
+encompassing 100Mb/s and 1Gb/s, this separation was often critical. But with
+today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s networks, the above capacity concerns
+are often diminished or even obviated.  For example, if your OSD nodes have two
+network ports, dedicating one to the public and the other to the private
+network means that you have no path redundancy.  This degrades your ability to
+endure network maintenance and network failures without significant cluster or
+client impact. In situations like this, consider instead using both links for
+only a public network: with bonding (LACP) or equal-cost routing (for example,
+FRR) you reap the benefits of increased throughput headroom, fault tolerance,
+and reduced OSD flapping.
+
+When a private network (or even a single host link) fails or degrades while the
+public network continues operating normally, OSDs may not handle this situation
+well. In such situations, OSDs use the public network to report each other
+``down`` to the monitors, while marking themselves ``up``. The monitors then
+send out-- again on the public network--an updated cluster map with the
+affected OSDs marked `down`. These OSDs reply to the monitors "I'm not dead
+yet!", and the cycle repeats. We call this scenario 'flapping`, and it can be
+difficult to isolate and remediate. Without a private network, this irksome
+dynamic is avoided: OSDs are generally either ``up`` or ``down`` without
+flapping.
+
+If something does cause OSDs to 'flap' (repeatedly being marked ``down`` and
+then ``up`` again), you can force the monitors to halt the flapping by
+temporarily freezing their states:
+
+.. prompt:: bash
+
+   ceph osd set noup      # prevent OSDs from getting marked up
+   ceph osd set nodown    # prevent OSDs from getting marked down
+
+These flags are recorded in the osdmap:
+
+.. prompt:: bash
+
+   ceph osd dump | grep flags
+
+::
+
+   flags no-up,no-down
+
+You can clear these flags with:
+
+.. prompt:: bash
+
+   ceph osd unset noup
+   ceph osd unset nodown
+
+Two other flags are available, ``noin`` and ``noout``, which prevent booting
+OSDs from being marked ``in`` (allocated data) or protect OSDs from eventually
+being marked ``out`` (regardless of the current value of
+``mon_osd_down_out_interval``).
+
+.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the sense that
+   after the flags are cleared, the action that they were blocking should be
+   possible shortly thereafter. But the ``noin`` flag prevents OSDs from being
+   marked ``in`` on boot, and any daemons that started while the flag was set
+   will remain that way.
+
+.. note:: The causes and effects of flapping can be mitigated somewhat by
+   making careful adjustments to ``mon_osd_down_out_subtree_limit``,
+   ``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``.
+   Derivation of optimal settings depends on cluster size, topology, and the
+   Ceph release in use. The interaction of all of these factors is subtle and
+   is beyond the scope of this document.
+
+
+.. _iostat: https://en.wikipedia.org/wiki/Iostat
+.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
+.. _Logging and Debugging: ../log-and-debug
+.. _Debugging and Logging: ../debug
+.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
+.. _Monitor Config Reference: ../../configuration/mon-config-ref
+.. _monitoring your OSDs: ../../operations/monitoring-osd-pg
+
+.. _monitoring OSDs: ../../operations/monitoring-osd-pg/#monitoring-osds
+
+.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
+.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
+.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
+.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
+.. _OS recommendations: ../../../start/os-recommendations
+.. _ceph-devel: ceph-devel@vger.kernel.org
diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst
new file mode 100644
index 000000000..74d04bd9f
--- /dev/null
+++ b/doc/rados/troubleshooting/troubleshooting-pg.rst
@@ -0,0 +1,782 @@
+====================
+ Troubleshooting PGs
+====================
+
+Placement Groups Never Get Clean
+================================
+
+If, after you have created your cluster, any Placement Groups (PGs) remain in
+the ``active`` status, the ``active+remapped`` status or the
+``active+degraded`` status and never achieves an ``active+clean`` status, you
+likely have a problem with your configuration.
+
+In such a situation, it may be necessary to review the settings in the `Pool,
+PG and CRUSH Config Reference`_ and make appropriate adjustments.
+
+As a general rule, run your cluster with more than one OSD and a pool size
+greater than two object replicas.
+
+.. _one-node-cluster:
+
+One Node Cluster
+----------------
+
+Ceph no longer provides documentation for operating on a single node.  Systems
+designed for distributed computing by definition do not run on a single node.
+The mounting of client kernel modules on a single node that contains a Ceph
+daemon may cause a deadlock due to issues with the Linux kernel itself (unless
+VMs are used as clients). You can experiment with Ceph in a one-node
+configuration, in spite of the limitations as described herein.
+
+To create a cluster on a single node, you must change the
+``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning
+``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
+file before you create your monitors and OSDs. This tells Ceph that an OSD is
+permitted to place another OSD on the same host. If you are trying to set up a
+single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
+Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
+another node, chassis, rack, row, or datacenter depending on the setting.
+
+.. tip:: DO NOT mount kernel clients directly on the same node as your Ceph
+   Storage Cluster. Kernel conflicts can arise. However, you can mount kernel
+   clients within virtual machines (VMs) on a single node.
+
+If you are creating OSDs using a single disk, you must manually create
+directories for the data first.
+
+
+Fewer OSDs than Replicas
+------------------------
+
+If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not
+in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set
+to greater than ``2``.
+
+There are a few ways to address this situation. If you want to operate your
+cluster in an ``active + degraded`` state with two replicas, you can set the
+``osd_pool_default_min_size`` to ``2`` so that you can write objects in an
+``active + degraded`` state. You may also set the ``osd_pool_default_size``
+setting to ``2`` so that you have only two stored replicas (the original and
+one replica). In such a case, the cluster should achieve an ``active + clean``
+state.
+
+.. note:: You can make the changes while the cluster is running. If you make
+   the changes in your Ceph configuration file, you might need to restart your
+   cluster.
+
+
+Pool Size = 1
+-------------
+
+If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy
+of the object. OSDs rely on other OSDs to tell them which objects they should
+have. If one OSD has a copy of an object and there is no second copy, then
+there is no second OSD to tell the first OSD that it should have that copy. For
+each placement group mapped to the first OSD (see ``ceph pg dump``), you can
+force the first OSD to notice the placement groups it needs by running a
+command of the following form:
+
+.. prompt:: bash
+
+   ceph osd force-create-pg <pgid>
+
+
+CRUSH Map Errors
+----------------
+
+If any placement groups in your cluster are unclean, then there might be errors
+in your CRUSH map.
+
+
+Stuck Placement Groups
+======================
+
+It is normal for placement groups to enter "degraded" or "peering" states after
+a component failure. Normally, these states reflect the expected progression
+through the failure recovery process. However, a placement group that stays in
+one of these states for a long time might be an indication of a larger problem.
+For this reason, the Ceph Monitors will warn when placement groups get "stuck"
+in a non-optimal state. Specifically, we check for:
+
+* ``inactive`` - The placement group has not been ``active`` for too long (that
+  is, it hasn't been able to service read/write requests).
+
+* ``unclean`` - The placement group has not been ``clean`` for too long (that
+  is, it hasn't been able to completely recover from a previous failure).
+
+* ``stale`` - The placement group status has not been updated by a
+  ``ceph-osd``.  This indicates that all nodes storing this placement group may
+  be ``down``.
+
+List stuck placement groups by running one of the following commands:
+
+.. prompt:: bash
+
+   ceph pg dump_stuck stale
+   ceph pg dump_stuck inactive
+   ceph pg dump_stuck unclean
+
+- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd``
+  daemons are not running.
+- Stuck ``inactive`` placement groups usually indicate a peering problem (see
+  :ref:`failures-osd-peering`).
+- Stuck ``unclean`` placement groups usually indicate that something is
+  preventing recovery from completing, possibly unfound objects (see
+  :ref:`failures-osd-unfound`);
+
+
+
+.. _failures-osd-peering:
+
+Placement Group Down - Peering Failure
+======================================
+
+In certain cases, the ``ceph-osd`` `peering` process can run into problems,
+which can prevent a PG from becoming active and usable. In such a case, running
+the command ``ceph health detail`` will report something similar to the following:
+
+.. prompt:: bash
+
+   ceph health detail
+
+::
+
+    HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
+    ...
+    pg 0.5 is down+peering
+    pg 1.4 is down+peering
+    ...
+    osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
+
+Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form:
+
+.. prompt:: bash
+
+   ceph pg 0.5 query
+
+.. code-block:: javascript
+
+ { "state": "down+peering",
+   ...
+   "recovery_state": [
+        { "name": "Started\/Primary\/Peering\/GetInfo",
+          "enter_time": "2012-03-06 14:40:16.169679",
+          "requested_info_from": []},
+        { "name": "Started\/Primary\/Peering",
+          "enter_time": "2012-03-06 14:40:16.169659",
+          "probing_osds": [
+                0,
+                1],
+          "blocked": "peering is blocked due to down osds",
+          "down_osds_we_would_probe": [
+                1],
+          "peering_blocked_by": [
+                { "osd": 1,
+                  "current_lost_at": 0,
+                  "comment": "starting or marking this osd lost may let us proceed"}]},
+        { "name": "Started",
+          "enter_time": "2012-03-06 14:40:16.169513"}
+    ]
+ }
+
+The ``recovery_state`` section tells us that peering is blocked due to down
+``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that
+particular ``ceph-osd`` and recovery will proceed.
+
+Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if
+there has been a disk failure), the cluster can be informed that the OSD is
+``lost`` and the cluster can be instructed that it must cope as best it can.
+
+.. important:: Informing the cluster that an OSD has been lost is dangerous
+   because the cluster cannot guarantee that the other copies of the data are
+   consistent and up to date.
+
+To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery
+anyway, run a command of the following form:
+
+.. prompt:: bash
+
+   ceph osd lost 1
+
+Recovery will proceed.
+
+
+.. _failures-osd-unfound:
+
+Unfound Objects
+===============
+
+Under certain combinations of failures, Ceph may complain about ``unfound``
+objects, as in this example:
+
+.. prompt:: bash
+
+   ceph health detail
+
+::
+
+   HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
+   pg 2.4 is active+degraded, 78 unfound
+
+This means that the storage cluster knows that some objects (or newer copies of
+existing objects) exist, but it hasn't found copies of them.  Here is an
+example of how this might come about for a PG whose data is on two OSDS, which
+we will call "1" and "2":
+
+* 1 goes down
+* 2 handles some writes, alone
+* 1 comes up
+* 1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
+* Before the new objects are copied, 2 goes down.
+
+At this point, 1 knows that these objects exist, but there is no live
+``ceph-osd`` that has a copy of the objects. In this case, IO to those objects
+will block, and the cluster will hope that the failed node comes back soon.
+This is assumed to be preferable to returning an IO error to the user.
+
+.. note:: The situation described immediately above is one reason that setting
+   ``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks
+   data loss.
+
+Identify which objects are unfound by running a command of the following form:
+
+.. prompt:: bash
+
+   ceph pg 2.4 list_unfound [starting offset, in json]
+
+.. code-block:: javascript
+
+  {
+    "num_missing": 1,
+    "num_unfound": 1,
+    "objects": [
+        {
+            "oid": {
+                "oid": "object",
+                "key": "",
+                "snapid": -2,
+                "hash": 2249616407,
+                "max": 0,
+                "pool": 2,
+                "namespace": ""
+            },
+            "need": "43'251",
+            "have": "0'0",
+            "flags": "none",
+            "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
+            "locations": [
+                "0(3)",
+                "4(2)"
+            ]
+        }
+    ],
+    "state": "NotRecovering",
+    "available_might_have_unfound": true,
+    "might_have_unfound": [
+        {
+            "osd": "2(4)",
+            "status": "osd is down"
+        }
+    ],
+    "more": false
+  }
+
+If there are too many objects to list in a single result, the ``more`` field
+will be true and you can query for more.  (Eventually the command line tool
+will hide this from you, but not yet.)
+
+Now you can identify which OSDs have been probed or might contain data.
+
+At the end of the listing (before ``more: false``), ``might_have_unfound`` is
+provided when ``available_might_have_unfound`` is true.  This is equivalent to
+the output of ``ceph pg #.# query``.  This eliminates the need to use ``query``
+directly.  The ``might_have_unfound`` information given behaves the same way as
+that ``query`` does, which is described below.  The only difference is that
+OSDs that have the status of ``already probed`` are ignored.
+
+Use of ``query``:
+
+.. prompt:: bash
+
+   ceph pg 2.4 query
+
+.. code-block:: javascript
+
+   "recovery_state": [
+        { "name": "Started\/Primary\/Active",
+          "enter_time": "2012-03-06 15:15:46.713212",
+          "might_have_unfound": [
+                { "osd": 1,
+                  "status": "osd is down"}]},
+
+In this case, the cluster knows that ``osd.1`` might have data, but it is
+``down``. Here is the full range of possible states:
+
+* already probed
+* querying
+* OSD is down
+* not queried (yet)
+
+Sometimes it simply takes some time for the cluster to query possible
+locations.
+
+It is possible that there are other locations where the object might exist that
+are not listed. For example: if an OSD is stopped and taken out of the cluster
+and then the cluster fully recovers, and then through a subsequent set of
+failures the cluster ends up with an unfound object, the cluster will ignore
+the removed OSD. (This scenario, however, is unlikely.)
+
+If all possible locations have been queried and objects are still lost, you may
+have to give up on the lost objects. This, again, is possible only when unusual
+combinations of failures have occurred that allow the cluster to learn about
+writes that were performed before the writes themselves have been recovered. To
+mark the "unfound" objects as "lost", run a command of the following form:
+
+.. prompt:: bash
+
+   ceph pg 2.5 mark_unfound_lost revert|delete
+
+Here the final argument (``revert|delete``) specifies how the cluster should
+deal with lost objects.
+
+The ``delete`` option will cause the cluster to forget about them entirely.
+
+The ``revert`` option (which is not available for erasure coded pools) will
+either roll back to a previous version of the object or (if it was a new
+object) forget about the object entirely. Use ``revert`` with caution, as it
+may confuse applications that expect the object to exist.
+
+Homeless Placement Groups
+=========================
+
+It is possible that every OSD that has copies of a given placement group fails.
+If this happens, then the subset of the object store that contains those
+placement groups becomes unavailable and the monitor will receive no status
+updates for those placement groups. The monitor marks as ``stale`` any
+placement group whose primary OSD has failed. For example:
+
+.. prompt:: bash
+
+   ceph health
+
+::
+
+    HEALTH_WARN 24 pgs stale; 3/300 in osds are down
+
+Identify which placement groups are ``stale`` and which were the last OSDs to
+store the ``stale`` placement groups by running the following command:
+
+.. prompt:: bash
+
+   ceph health detail
+
+::
+
+   HEALTH_WARN 24 pgs stale; 3/300 in osds are down
+   ...
+   pg 2.5 is stuck stale+active+remapped, last acting [2,0]
+   ...
+   osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
+   osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
+   osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
+
+This output indicates that placement group 2.5 (``pg 2.5``) was last managed by
+``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover
+that placement group.
+
+
+Only a Few OSDs Receive Data
+============================
+
+If only a few of the nodes in the cluster are receiving data, check the number
+of placement groups in the pool as instructed in the :ref:`Placement Groups
+<rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to
+OSDs in an operation involving dividing the number of placement groups in the
+cluster by the number of OSDs in the cluster, a small number of placement
+groups (the remainder, in this operation) are sometimes not distributed across
+the cluster. In situations like this, create a pool with a placement group
+count that is a multiple of the number of OSDs. See `Placement Groups`_ for
+details. See the :ref:`Pool, PG, and CRUSH Config Reference
+<rados_config_pool_pg_crush_ref>` for instructions on changing the default
+values used to determine how many placement groups are assigned to each pool.
+
+
+Can't Write Data
+================
+
+If the cluster is up, but some OSDs are down and you cannot write data, make
+sure that you have the minimum number of OSDs running in the pool. If you don't
+have the minimum number of OSDs running in the pool, Ceph will not allow you to
+write data to it because there is no guarantee that Ceph can replicate your
+data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH
+Config Reference <rados_config_pool_pg_crush_ref>` for details.
+
+
+PGs Inconsistent
+================
+
+If the command ``ceph health detail`` returns an ``active + clean +
+inconsistent`` state, this might indicate an error during scrubbing. Identify
+the inconsistent placement group or placement groups by running the following
+command:
+
+.. prompt:: bash
+
+    $ ceph health detail
+
+::
+
+    HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
+    pg 0.6 is active+clean+inconsistent, acting [0,1,2]
+    2 scrub errors
+
+Alternatively, run this command if you prefer to inspect the output in a
+programmatic way:
+
+.. prompt:: bash
+
+   $ rados list-inconsistent-pg rbd
+
+::
+
+    ["0.6"]
+
+There is only one consistent state, but in the worst case, we could have
+different inconsistencies in multiple perspectives found in more than one
+object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of
+``rados list-inconsistent-pg rbd`` will look something like this:
+
+.. prompt:: bash
+
+   rados list-inconsistent-obj 0.6 --format=json-pretty
+
+.. code-block:: javascript
+
+    {
+        "epoch": 14,
+        "inconsistents": [
+            {
+                "object": {
+                    "name": "foo",
+                    "nspace": "",
+                    "locator": "",
+                    "snap": "head",
+                    "version": 1
+                },
+                "errors": [
+                    "data_digest_mismatch",
+                    "size_mismatch"
+                ],
+                "union_shard_errors": [
+                    "data_digest_mismatch_info",
+                    "size_mismatch_info"
+                ],
+                "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
+                "shards": [
+                    {
+                        "osd": 0,
+                        "errors": [],
+                        "size": 968,
+                        "omap_digest": "0xffffffff",
+                        "data_digest": "0xe978e67f"
+                    },
+                    {
+                        "osd": 1,
+                        "errors": [],
+                        "size": 968,
+                        "omap_digest": "0xffffffff",
+                        "data_digest": "0xe978e67f"
+                    },
+                    {
+                        "osd": 2,
+                        "errors": [
+                            "data_digest_mismatch_info",
+                            "size_mismatch_info"
+                        ],
+                        "size": 0,
+                        "omap_digest": "0xffffffff",
+                        "data_digest": "0xffffffff"
+                    }
+                ]
+            }
+        ]
+    }
+
+In this case, the output indicates the following:
+
+* The only inconsistent object is named ``foo``, and its head has
+  inconsistencies.
+* The inconsistencies fall into two categories:
+
+  #. ``errors``: these errors indicate inconsistencies between shards, without
+     an indication of which shard(s) are bad. Check for the ``errors`` in the
+     ``shards`` array, if available, to pinpoint the problem.
+
+     * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2``
+       is different from the digests of the replica reads of ``OSD.0`` and
+       ``OSD.1``
+     * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``,
+       but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``.
+
+  #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the
+     ``shards`` array. The ``errors`` are set for the shard with the problem.
+     These errors include ``read_error`` and other similar errors. The
+     ``errors`` ending in ``oi`` indicate a comparison with
+     ``selected_object_info``. Examine the ``shards`` array to determine
+     which shard has which error or errors.
+
+     * ``data_digest_mismatch_info``: the digest stored in the ``object-info``
+       is not ``0xffffffff``, which is calculated from the shard read from
+       ``OSD.2``
+     * ``size_mismatch_info``: the size stored in the ``object-info`` is
+       different from the size read from ``OSD.2``. The latter is ``0``.
+
+.. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the
+   inconsistency is likely due to physical storage errors. In cases like this,
+   check the storage used by that OSD. 
+   
+   Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive
+   repair.
+
+To repair the inconsistent placement group, run a command of the following
+form:
+
+.. prompt:: bash
+
+   ceph pg repair {placement-group-ID}
+    
+.. warning: This command overwrites the "bad" copies with "authoritative"
+   copies. In most cases, Ceph is able to choose authoritative copies from all
+   the available replicas by using some predefined criteria. This, however,
+   does not work in every case. For example, it might be the case that the
+   stored data digest is missing, which means that the calculated digest is
+   ignored when Ceph chooses the authoritative copies. Be aware of this, and
+   use the above command with caution.
+
+
+If you receive ``active + clean + inconsistent`` states periodically due to
+clock skew, consider configuring the `NTP
+<https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor
+hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_
+and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information.
+
+
+Erasure Coded PGs are not active+clean
+======================================
+
+If CRUSH fails to find enough OSDs to map to a PG, it will show as a
+``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example::
+
+     [2,1,6,0,5,8,2147483647,7,4]
+
+Not enough OSDs
+---------------
+
+If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine
+OSDs, the cluster will show "Not enough OSDs". In this case, you either create
+another erasure coded pool that requires fewer OSDs, by running commands of the
+following form:
+
+.. prompt:: bash
+
+     ceph osd erasure-code-profile set myprofile k=5 m=3
+     ceph osd pool create erasurepool erasure myprofile
+
+or add new OSDs, and the PG will automatically use them.
+
+CRUSH constraints cannot be satisfied
+-------------------------------------
+
+If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing
+constraints that cannot be satisfied. If there are ten OSDs on two hosts and
+the CRUSH rule requires that no two OSDs from the same host are used in the
+same PG, the mapping may fail because only two OSDs will be found. Check the
+constraint by displaying ("dumping") the rule, as shown here:
+
+.. prompt:: bash
+
+   ceph osd crush rule ls
+
+::
+
+    [
+        "replicated_rule",
+        "erasurepool"]
+    $ ceph osd crush rule dump erasurepool
+    { "rule_id": 1,
+      "rule_name": "erasurepool",
+      "type": 3,
+      "steps": [
+            { "op": "take",
+              "item": -1,
+              "item_name": "default"},
+            { "op": "chooseleaf_indep",
+              "num": 0,
+              "type": "host"},
+            { "op": "emit"}]}
+
+
+Resolve this problem by creating a new pool in which PGs are allowed to have
+OSDs residing on the same host by running the following commands:
+
+.. prompt:: bash
+
+   ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
+   ceph osd pool create erasurepool erasure myprofile
+
+CRUSH gives up too soon
+-----------------------
+
+If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster
+with a total of nine OSDs and an erasure coded pool that requires nine OSDs per
+PG), it is possible that CRUSH gives up before finding a mapping. This problem
+can be resolved by:
+
+* lowering the erasure coded pool requirements to use fewer OSDs per PG (this
+  requires the creation of another pool, because erasure code profiles cannot
+  be modified dynamically).
+
+* adding more OSDs to the cluster (this does not require the erasure coded pool
+  to be modified, because it will become clean automatically)
+
+* using a handmade CRUSH rule that tries more times to find a good mapping.
+  This can be modified for an existing CRUSH rule by setting
+  ``set_choose_tries`` to a value greater than the default.
+
+First, verify the problem by using  ``crushtool`` after extracting the crushmap
+from the cluster. This ensures that your experiments do not modify the Ceph
+cluster and that they operate only on local files:
+
+.. prompt:: bash
+
+   ceph osd crush rule dump erasurepool
+
+::
+
+    { "rule_id": 1,
+      "rule_name": "erasurepool",
+      "type": 3,
+      "steps": [
+            { "op": "take",
+              "item": -1,
+              "item_name": "default"},
+            { "op": "chooseleaf_indep",
+              "num": 0,
+              "type": "host"},
+            { "op": "emit"}]}
+    $ ceph osd getcrushmap > crush.map
+    got crush map from osdmap epoch 13
+    $ crushtool -i crush.map --test --show-bad-mappings \
+       --rule 1 \
+       --num-rep 9 \
+       --min-x 1 --max-x $((1024 * 1024))
+    bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
+    bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
+    bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
+
+Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule
+needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by
+``ceph osd crush rule dump``. This test will attempt to map one million values
+(in this example, the range defined by ``[--min-x,--max-x]``) and must display
+at least one bad mapping. If this test outputs nothing, all mappings have been
+successful and you can be assured that the problem with your cluster is not
+caused by bad mappings.
+
+Changing the value of set_choose_tries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+#. Decompile the CRUSH map to edit the CRUSH rule by running the following
+   command:
+
+   .. prompt:: bash
+
+      crushtool --decompile crush.map > crush.txt
+
+#. Add the following line to the rule::
+
+      step set_choose_tries 100
+
+   The relevant part of the ``crush.txt`` file will resemble this::
+
+      rule erasurepool {
+              id 1
+              type erasure
+              step set_chooseleaf_tries 5
+              step set_choose_tries 100
+              step take default
+              step chooseleaf indep 0 type host
+              step emit
+      }
+
+#. Recompile and retest the CRUSH rule:
+
+   .. prompt:: bash
+
+      crushtool --compile crush.txt -o better-crush.map
+
+#. When all mappings succeed, display a histogram of the number of tries that
+   were necessary to find all of the mapping by using the
+   ``--show-choose-tries`` option of the ``crushtool`` command, as in the
+   following example:
+
+   .. prompt:: bash
+
+      crushtool -i better-crush.map --test --show-bad-mappings \
+       --show-choose-tries \
+       --rule 1 \
+       --num-rep 9 \
+       --min-x 1 --max-x $((1024 * 1024))
+    ...
+    11:        42
+    12:        44
+    13:        54
+    14:        45
+    15:        35
+    16:        34
+    17:        30
+    18:        25
+    19:        19
+    20:        22
+    21:        20
+    22:        17
+    23:        13
+    24:        16
+    25:        13
+    26:        11
+    27:        11
+    28:        13
+    29:        11
+    30:        10
+    31:         6
+    32:         5
+    33:        10
+    34:         3
+    35:         7
+    36:         5
+    37:         2
+    38:         5
+    39:         5
+    40:         2
+    41:         5
+    42:         4
+    43:         1
+    44:         2
+    45:         2
+    46:         3
+    47:         1
+    48:         0
+    ...
+    102:         0
+    103:         1
+    104:         0
+    ...
+
+   This output indicates that it took eleven tries to map forty-two PGs, twelve
+   tries to map forty-four PGs etc. The highest number of tries is the minimum
+   value of ``set_choose_tries`` that prevents bad mappings (for example,
+   ``103`` in the above output, because it did not take more than 103 tries for
+   any PG to be mapped).
+
+.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
+.. _Placement Groups: ../../operations/placement-groups
+.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-21 11:54:28 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-21 11:54:28 +0000
commit	e6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree	64f88b554b444a49f656b6c656111a145cbbaa28 /doc/rados/troubleshooting
parent	Initial commit. (diff)
download	ceph-upstream/18.2.2.tar.xz ceph-upstream/18.2.2.zip