1 files changed, 370 insertions, 0 deletions
diff --git a/doc/cephadm/troubleshooting.rst b/doc/cephadm/troubleshooting.rst
new file mode 100644
index 000000000..9a534f633
--- /dev/null
+++ b/doc/cephadm/troubleshooting.rst
@@ -0,0 +1,370 @@
+Troubleshooting
+===============
+
+You might need to investigate why a cephadm command failed
+or why a certain service no longer runs properly.
+
+Cephadm deploys daemons as containers. This means that
+troubleshooting those containerized daemons might work
+differently than you expect (and that is certainly true if
+you expect this troubleshooting to work the way that
+troubleshooting does when the daemons involved aren't
+containerized). 
+
+Here are some tools and commands to help you troubleshoot
+your Ceph environment.
+
+.. _cephadm-pause:
+
+Pausing or disabling cephadm
+----------------------------
+
+If something goes wrong and cephadm is behaving badly, you can
+pause most of the Ceph cluster's background activity by running
+the following command: 
+
+.. prompt:: bash #
+
+  ceph orch pause
+
+This stops all changes in the Ceph cluster, but cephadm will
+still periodically check hosts to refresh its inventory of
+daemons and devices.  You can disable cephadm completely by
+running the following commands:
+
+.. prompt:: bash #
+
+  ceph orch set backend ''
+  ceph mgr module disable cephadm
+
+These commands disable all of the ``ceph orch ...`` CLI commands.
+All previously deployed daemon containers continue to exist and
+will start as they did before you ran these commands.
+
+See :ref:`cephadm-spec-unmanaged` for information on disabling
+individual services.
+
+
+Per-service and per-daemon events
+---------------------------------
+
+In order to help with the process of debugging failed daemon
+deployments, cephadm stores events per service and per daemon.
+These events often contain information relevant to
+troubleshooting
+your Ceph cluster. 
+
+Listing service events
+~~~~~~~~~~~~~~~~~~~~~~
+
+To see the events associated with a certain service, run a
+command of the and following form:
+
+.. prompt:: bash #
+
+  ceph orch ls --service_name=<service-name> --format yaml
+
+This will return something in the following form:
+
+.. code-block:: yaml
+
+  service_type: alertmanager
+  service_name: alertmanager
+  placement:
+    hosts:
+    - unknown_host
+  status:
+    ...
+    running: 1
+    size: 1
+  events:
+  - 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
+  - '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
+    place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'
+
+Listing daemon events
+~~~~~~~~~~~~~~~~~~~~~
+
+To see the events associated with a certain daemon, run a
+command of the and following form:
+
+.. prompt:: bash #
+
+  ceph orch ps --service-name <service-name> --daemon-id <daemon-id> --format yaml
+
+This will return something in the following form:
+
+.. code-block:: yaml
+
+  daemon_type: mds
+  daemon_id: cephfs.hostname.ppdhsz
+  hostname: hostname
+  status_desc: running
+  ...
+  events:
+  - 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
+    mds.cephfs.hostname.ppdhsz on host 'hostname'"
+
+
+Checking cephadm logs
+---------------------
+
+To learn how to monitor the cephadm logs as they are generated, read :ref:`watching_cephadm_logs`.
+
+If your Ceph cluster has been configured to log events to files, there will exist a
+cephadm log file called ``ceph.cephadm.log`` on all monitor hosts (see
+:ref:`cephadm-logs` for a more complete explanation of this).
+
+Gathering log files
+-------------------
+
+Use journalctl to gather the log files of all daemons:
+
+.. note:: By default cephadm now stores logs in journald. This means
+   that you will no longer find daemon logs in ``/var/log/ceph/``.
+
+To read the log file of one specific daemon, run::
+
+    cephadm logs --name <name-of-daemon>
+
+Note: this only works when run on the same host where the daemon is running. To
+get logs of a daemon running on a different host, give the ``--fsid`` option::
+
+    cephadm logs --fsid <fsid> --name <name-of-daemon>
+
+where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.
+
+To fetch all log files of all daemons on a given host, run::
+
+    for name in $(cephadm ls | jq -r '.[].name') ; do
+      cephadm logs --fsid <fsid> --name "$name" > $name;
+    done
+
+Collecting systemd status
+-------------------------
+
+To print the state of a systemd unit, run::
+
+      systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
+
+
+To fetch all state of all daemons of a given host, run::
+
+    fsid="$(cephadm shell ceph fsid)"
+    for name in $(cephadm ls | jq -r '.[].name') ; do
+      systemctl status "ceph-$fsid@$name.service" > $name;
+    done
+
+
+List all downloaded container images
+------------------------------------
+
+To list all container images that are downloaded on a host:
+
+.. note:: ``Image`` might also be called `ImageID`
+
+::
+
+    podman ps -a --format json | jq '.[].Image'
+    "docker.io/library/centos:8"
+    "registry.opensuse.org/opensuse/leap:15.2"
+
+
+Manually running containers
+---------------------------
+
+Cephadm writes small wrappers that run a containers. Refer to
+``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
+container execution command.
+
+.. _cephadm-ssh-errors:
+
+SSH errors
+----------
+
+Error message::
+
+  execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
+  ...
+  raise OrchestratorError(msg) from e
+  orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
+  Please make sure that the host is reachable and accepts connections using the cephadm SSH key
+  ...
+
+Things users can do:
+
+1. Ensure cephadm has an SSH identity key::
+
+     [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
+     INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
+     INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
+     [root@mon1 ~] # chmod 0600 ~/cephadm_private_key
+
+ If this fails, cephadm doesn't have a key. Fix this by running the following command::
+
+     [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key
+
+ or::
+
+     [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i -
+
+2. Ensure that the SSH config is correct::
+
+     [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
+
+3. Verify that we can connect to the host::
+
+     [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
+
+Verifying that the Public Key is Listed in the authorized_keys file
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To verify that the public key is in the authorized_keys file, run the following commands::
+
+     [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
+     [root@mon1 ~]# grep "`cat ~/ceph.pub`"  /root/.ssh/authorized_keys
+
+Failed to infer CIDR network error
+----------------------------------
+
+If you see this error::
+
+   ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
+
+Or this error::
+
+   Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
+
+This means that you must run a command of this form::
+
+  ceph config set mon public_network <mon_network>
+
+For more detail on operations of this kind, see :ref:`deploy_additional_monitors`
+
+Accessing the admin socket
+--------------------------
+
+Each Ceph daemon provides an admin socket that bypasses the
+MONs (See :ref:`rados-monitoring-using-admin-socket`).
+
+To access the admin socket, first enter the daemon container on the host::
+
+    [root@mon1 ~]# cephadm enter --name <daemon-name>
+    [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
+
+Calling miscellaneous ceph tools
+--------------------------------
+
+To call miscellaneous like ``ceph-objectstore-tool`` or 
+``ceph-monstore-tool``, you can run them by calling 
+``cephadm shell --name <daemon-name>`` like so::
+
+    root@myhostname # cephadm unit --name mon.myhostname stop
+    root@myhostname # cephadm shell --name mon.myhostname
+    [ceph: root@myhostname /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-myhostname get monmap > monmap         
+    [ceph: root@myhostname /]# monmaptool --print monmap
+    monmaptool: monmap file monmap
+    epoch 1
+    fsid 28596f44-3b56-11ec-9034-482ae35a5fbb
+    last_changed 2021-11-01T20:57:19.755111+0000
+    created 2021-11-01T20:57:19.755111+0000
+    min_mon_release 17 (quincy)
+    election_strategy: 1
+    0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
+
+This command sets up the environment in a way that is suitable
+for extended daemon maintenance and running the deamon interactively. 
+
+.. _cephadm-restore-quorum:
+
+Restoring the MON quorum
+------------------------
+
+In case the Ceph MONs cannot form a quorum, cephadm is not able
+to manage the cluster, until the quorum is restored.
+
+In order to restore the MON quorum, remove unhealthy MONs
+form the monmap by following these steps:
+
+1. Stop all MONs. For each MON host::
+
+    ssh {mon-host}
+    cephadm unit --name mon.`hostname` stop
+
+
+2. Identify a surviving monitor and log in to that host::
+
+    ssh {mon-host}
+    cephadm enter --name mon.`hostname`
+
+3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`
+
+.. _cephadm-manually-deploy-mgr:
+
+Manually deploying a MGR daemon
+-------------------------------
+cephadm requires a MGR daemon in order to manage the cluster. In case the cluster
+the last MGR of a cluster was removed, follow these steps in order to deploy 
+a MGR ``mgr.hostname.smfvfd`` on a random host of your cluster manually. 
+
+Disable the cephadm scheduler, in order to prevent cephadm from removing the new 
+MGR. See :ref:`cephadm-enable-cli`::
+
+  ceph config-key set mgr/cephadm/pause true
+
+Then get or create the auth entry for the new MGR::
+
+  ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
+
+Get the ceph.conf::
+
+  ceph config generate-minimal-conf
+
+Get the container image::
+
+  ceph config get "mgr.hostname.smfvfd" container_image
+
+Create a file ``config-json.json`` which contains the information neccessary to deploy
+the daemon:
+
+.. code-block:: json
+
+  {
+    "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
+    "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
+  }
+
+Deploy the daemon::
+
+  cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
+
+Analyzing core dumps
+---------------------
+
+In case a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run
+
+.. prompt:: bash #
+
+  ulimit -c unlimited
+
+core dumps will now be written to ``/var/lib/systemd/coredump``.
+
+.. note::
+
+  core dumps are not namespaced by the kernel, which means
+  they will be written to ``/var/lib/systemd/coredump`` on
+  the container host. 
+
+Now, wait for the crash to happen again. (To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``)
+
+Install debug packages by entering the cephadm shell and install ``ceph-debuginfo``::
+
+  # cephadm shell --mount /var/lib/systemd/coredump
+  [ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd
+  [ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
+  [ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-...
+  (gdb) bt
+  #0  0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
+  #1  0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
+  #2  0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
+  #3  0x0000563085ca3d7e in main ()