1 files changed, 543 insertions, 0 deletions
diff --git a/doc/cephadm/services/monitoring.rst b/doc/cephadm/services/monitoring.rst
new file mode 100644
index 000000000..a17a5ba03
--- /dev/null
+++ b/doc/cephadm/services/monitoring.rst
@@ -0,0 +1,543 @@
+.. _mgr-cephadm-monitoring:
+
+Monitoring Services
+===================
+
+Ceph Dashboard uses `Prometheus <https://prometheus.io/>`_, `Grafana
+<https://grafana.com/>`_, and related tools to store and visualize detailed
+metrics on cluster utilization and performance.  Ceph users have three options:
+
+#. Have cephadm deploy and configure these services.  This is the default
+   when bootstrapping a new cluster unless the ``--skip-monitoring-stack``
+   option is used.
+#. Deploy and configure these services manually.  This is recommended for users
+   with existing prometheus services in their environment (and in cases where
+   Ceph is running in Kubernetes with Rook).
+#. Skip the monitoring stack completely.  Some Ceph dashboard graphs will
+   not be available.
+
+The monitoring stack consists of `Prometheus <https://prometheus.io/>`_,
+Prometheus exporters (:ref:`mgr-prometheus`, `Node exporter
+<https://prometheus.io/docs/guides/node-exporter/>`_), `Prometheus Alert
+Manager <https://prometheus.io/docs/alerting/alertmanager/>`_ and `Grafana
+<https://grafana.com/>`_.
+
+.. note::
+
+  Prometheus' security model presumes that untrusted users have access to the
+  Prometheus HTTP endpoint and logs. Untrusted users have access to all the
+  (meta)data Prometheus collects that is contained in the database, plus a
+  variety of operational and debugging information.
+
+  However, Prometheus' HTTP API is limited to read-only operations.
+  Configurations can *not* be changed using the API and secrets are not
+  exposed. Moreover, Prometheus has some built-in measures to mitigate the
+  impact of denial of service attacks.
+
+  Please see `Prometheus' Security model
+  <https://prometheus.io/docs/operating/security/>` for more detailed
+  information.
+
+Deploying monitoring with cephadm
+---------------------------------
+
+The default behavior of ``cephadm`` is to deploy a basic monitoring stack.  It
+is however possible that you have a Ceph cluster without a monitoring stack,
+and you would like to add a monitoring stack to it. (Here are some ways that
+you might have come to have a Ceph cluster without a monitoring stack: You
+might have passed the ``--skip-monitoring stack`` option to ``cephadm`` during
+the installation of the cluster, or you might have converted an existing
+cluster (which had no monitoring stack) to cephadm management.)
+
+To set up monitoring on a Ceph cluster that has no monitoring, follow the
+steps below:
+
+#. Deploy a node-exporter service on every node of the cluster.  The node-exporter provides host-level metrics like CPU and memory utilization:
+
+   .. prompt:: bash #
+
+     ceph orch apply node-exporter
+
+#. Deploy alertmanager:
+
+   .. prompt:: bash #
+
+     ceph orch apply alertmanager
+
+#. Deploy Prometheus. A single Prometheus instance is sufficient, but
+   for high availability (HA) you might want to deploy two:
+
+   .. prompt:: bash #
+
+     ceph orch apply prometheus
+
+   or
+
+   .. prompt:: bash #
+
+     ceph orch apply prometheus --placement 'count:2'
+
+#. Deploy grafana:
+
+   .. prompt:: bash #
+
+     ceph orch apply grafana
+
+.. _cephadm-monitoring-centralized-logs:
+
+Centralized Logging in Ceph
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ceph now provides centralized logging with Loki & Promtail. Centralized Log Management (CLM) consolidates all log data and pushes it to a central repository, 
+with an accessible and easy-to-use interface. Centralized logging is designed to make your life easier. 
+Some of the advantages are:
+
+#. **Linear event timeline**: it is easier to troubleshoot issues analyzing a single chain of events than thousands of different logs from a hundred nodes.
+#. **Real-time live log monitoring**: it is impractical to follow logs from thousands of different sources.
+#. **Flexible retention policies**: with per-daemon logs, log rotation is usually set to a short interval (1-2 weeks) to save disk usage.
+#. **Increased security & backup**: logs can contain sensitive information and expose usage patterns. Additionally, centralized logging allows for HA, etc.
+
+Centralized Logging in Ceph is implemented using two new services - ``loki`` & ``promtail``.
+
+Loki: It is basically a log aggregation system and is used to query logs. It can be configured as a datasource in Grafana. 
+
+Promtail: It acts as an agent that gathers logs from the system and makes them available to Loki.
+
+These two services are not deployed by default in a Ceph cluster. To enable the centralized logging you can follow the steps mentioned here :ref:`centralized-logging`.
+
+.. _cephadm-monitoring-networks-ports:
+
+Networks and Ports
+~~~~~~~~~~~~~~~~~~
+
+All monitoring services can have the network and port they bind to configured with a yaml service specification. By default
+cephadm will use ``https`` protocol when configuring Grafana daemons unless the user explicitly sets the protocol to ``http``.
+
+example spec file:
+
+.. code-block:: yaml
+
+    service_type: grafana
+    service_name: grafana
+    placement:
+      count: 1
+    networks:
+    - 192.169.142.0/24
+    spec:
+      port: 4200
+      protocol: http
+
+.. _cephadm_monitoring-images:
+
+Using custom images
+~~~~~~~~~~~~~~~~~~~
+
+It is possible to install or upgrade monitoring components based on other
+images.  To do so, the name of the image to be used needs to be stored in the
+configuration first.  The following configuration options are available.
+
+- ``container_image_prometheus``
+- ``container_image_grafana``
+- ``container_image_alertmanager``
+- ``container_image_node_exporter``
+
+Custom images can be set with the ``ceph config`` command
+
+.. code-block:: bash
+
+     ceph config set mgr mgr/cephadm/<option_name> <value>
+
+For example
+
+.. code-block:: bash
+
+     ceph config set mgr mgr/cephadm/container_image_prometheus prom/prometheus:v1.4.1
+
+If there were already running monitoring stack daemon(s) of the type whose
+image you've changed, you must redeploy the daemon(s) in order to have them
+actually use the new image.
+
+For example, if you had changed the prometheus image
+
+.. prompt:: bash #
+
+     ceph orch redeploy prometheus
+
+
+.. note::
+
+     By setting a custom image, the default value will be overridden (but not
+     overwritten).  The default value changes when updates become available.
+     By setting a custom image, you will not be able to update the component
+     you have set the custom image for automatically.  You will need to
+     manually update the configuration (image name and tag) to be able to
+     install updates.
+
+     If you choose to go with the recommendations instead, you can reset the
+     custom image you have set before.  After that, the default value will be
+     used again.  Use ``ceph config rm`` to reset the configuration option
+
+     .. code-block:: bash
+
+          ceph config rm mgr mgr/cephadm/<option_name>
+
+     For example
+
+     .. code-block:: bash
+
+          ceph config rm mgr mgr/cephadm/container_image_prometheus
+
+See also :ref:`cephadm-airgap`.
+
+.. _cephadm-overwrite-jinja2-templates:
+
+Using custom configuration files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By overriding cephadm templates, it is possible to completely customize the
+configuration files for monitoring services.
+
+Internally, cephadm already uses `Jinja2
+<https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the
+configuration files for all monitoring components. Starting from version 17.2.3,
+cephadm supports Prometheus http service discovery, and uses this endpoint for the
+definition and management of the embedded Prometheus service. The endpoint listens on
+``https://<mgr-ip>:8765/sd/`` (the port is
+configurable through the variable ``service_discovery_port``) and returns scrape target
+information in `http_sd_config format
+<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_
+
+Customers with external monitoring stack can use `ceph-mgr` service discovery endpoint
+to get scraping configuration. Root certificate of the server can be obtained by the
+following command:
+
+   .. prompt:: bash #
+
+     ceph orch sd dump cert
+
+The configuration of Prometheus, Grafana, or Alertmanager may be customized by storing
+a Jinja2 template for each service. This template will be evaluated every time a service
+of that kind is deployed or reconfigured. That way, the custom configuration is preserved
+and automatically applied on future deployments of these services.
+
+.. note::
+
+  The configuration of the custom template is also preserved when the default
+  configuration of cephadm changes. If the updated configuration is to be used,
+  the custom template needs to be migrated *manually* after each upgrade of Ceph.
+
+Option names
+""""""""""""
+
+The following templates for files that will be generated by cephadm can be
+overridden. These are the names to be used when storing with ``ceph config-key
+set``:
+
+- ``services/alertmanager/alertmanager.yml``
+- ``services/grafana/ceph-dashboard.yml``
+- ``services/grafana/grafana.ini``
+- ``services/prometheus/prometheus.yml``
+- ``services/prometheus/alerting/custom_alerts.yml``
+- ``services/loki.yml``
+- ``services/promtail.yml``
+
+You can look up the file templates that are currently used by cephadm in
+``src/pybind/mgr/cephadm/templates``:
+
+- ``services/alertmanager/alertmanager.yml.j2``
+- ``services/grafana/ceph-dashboard.yml.j2``
+- ``services/grafana/grafana.ini.j2``
+- ``services/prometheus/prometheus.yml.j2``
+- ``services/loki.yml.j2``
+- ``services/promtail.yml.j2``
+
+Usage
+"""""
+
+The following command applies a single line value:
+
+.. code-block:: bash
+
+  ceph config-key set mgr/cephadm/<option_name> <value>
+
+To set contents of files as template use the ``-i`` argument:
+
+.. code-block:: bash
+
+  ceph config-key set mgr/cephadm/<option_name> -i $PWD/<filename>
+
+.. note::
+
+  When using files as input to ``config-key`` an absolute path to the file must
+  be used.
+
+
+Then the configuration file for the service needs to be recreated.
+This is done using `reconfig`. For more details see the following example.
+
+Example
+"""""""
+
+.. code-block:: bash
+
+  # set the contents of ./prometheus.yml.j2 as template
+  ceph config-key set mgr/cephadm/services/prometheus/prometheus.yml \
+    -i $PWD/prometheus.yml.j2
+
+  # reconfig the prometheus service
+  ceph orch reconfig prometheus
+
+.. code-block:: bash
+
+  # set additional custom alerting rules for Prometheus
+  ceph config-key set mgr/cephadm/services/prometheus/alerting/custom_alerts.yml \
+    -i $PWD/custom_alerts.yml
+
+  # Note that custom alerting rules are not parsed by Jinja and hence escaping
+  # will not be an issue.
+
+Deploying monitoring without cephadm
+------------------------------------
+
+If you have an existing prometheus monitoring infrastructure, or would like
+to manage it yourself, you need to configure it to integrate with your Ceph
+cluster.
+
+* Enable the prometheus module in the ceph-mgr daemon
+
+  .. code-block:: bash
+
+     ceph mgr module enable prometheus
+
+  By default, ceph-mgr presents prometheus metrics on port 9283 on each host
+  running a ceph-mgr daemon.  Configure prometheus to scrape these.
+
+To make this integration easier, cephadm provides a service discovery endpoint at
+``https://<mgr-ip>:8765/sd/``. This endpoint can be used by an external
+Prometheus server to retrieve target information for a specific service. Information returned
+by this endpoint uses the format specified by the Prometheus `http_sd_config option
+<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_
+
+Here's an example prometheus job definition that uses the cephadm service discovery endpoint
+
+  .. code-block:: bash
+
+     - job_name: 'ceph-exporter'  
+       http_sd_configs:  
+       - url: http://<mgr-ip>:8765/sd/prometheus/sd-config?service=ceph-exporter
+
+
+* To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`.
+
+* To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`.
+
+Disabling monitoring
+--------------------
+
+To disable monitoring and remove the software that supports it, run the following commands:
+
+.. code-block:: console
+
+  $ ceph orch rm grafana
+  $ ceph orch rm prometheus --force   # this will delete metrics data collected so far
+  $ ceph orch rm node-exporter
+  $ ceph orch rm alertmanager
+  $ ceph mgr module disable prometheus
+
+See also :ref:`orch-rm`.
+
+Setting up RBD-Image monitoring
+-------------------------------
+
+Due to performance reasons, monitoring of RBD images is disabled by default. For more information please see
+:ref:`prometheus-rbd-io-statistics`. If disabled, the overview and details dashboards will stay empty in Grafana
+and the metrics will not be visible in Prometheus.
+
+Setting up Prometheus
+-----------------------
+
+Setting Prometheus Retention Size and Time
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Cephadm can configure Prometheus TSDB retention by specifying ``retention_time``
+and ``retention_size`` values in the Prometheus service spec.
+The retention time value defaults to 15 days (15d). Users can set a different value/unit where
+supported units are: 'y', 'w', 'd', 'h', 'm' and 's'. The retention size value defaults
+to 0 (disabled). Supported units in this case are: 'B', 'KB', 'MB', 'GB', 'TB', 'PB' and 'EB'.
+
+In the following example spec we set the retention time to 1 year and the size to 1GB.
+
+.. code-block:: yaml
+
+    service_type: prometheus
+    placement:
+      count: 1
+    spec:
+      retention_time: "1y"
+      retention_size: "1GB"
+
+.. note::
+
+  If you already had Prometheus daemon(s) deployed before and are updating an
+  existent spec as opposed to doing a fresh Prometheus deployment, you must also
+  tell cephadm to redeploy the Prometheus daemon(s) to put this change into effect.
+  This can be done with a ``ceph orch redeploy prometheus`` command.
+
+Setting up Grafana
+------------------
+
+Manually setting the Grafana URL
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Cephadm automatically configures Prometheus, Grafana, and Alertmanager in
+all cases except one.
+
+In a some setups, the Dashboard user's browser might not be able to access the
+Grafana URL that is configured in Ceph Dashboard. This can happen when the
+cluster and the accessing user are in different DNS zones.
+
+If this is the case, you can use a configuration option for Ceph Dashboard
+to set the URL that the user's browser will use to access Grafana. This
+value will never be altered by cephadm. To set this configuration option,
+issue the following command:
+
+   .. prompt:: bash $
+
+     ceph dashboard set-grafana-frontend-api-url <grafana-server-api>
+
+It might take a minute or two for services to be deployed. After the
+services have been deployed, you should see something like this when you issue the command ``ceph orch ls``:
+
+.. code-block:: console
+
+  $ ceph orch ls
+  NAME           RUNNING  REFRESHED  IMAGE NAME                                      IMAGE ID        SPEC
+  alertmanager       1/1  6s ago     docker.io/prom/alertmanager:latest              0881eb8f169f  present
+  crash              2/2  6s ago     docker.io/ceph/daemon-base:latest-master-devel  mix           present
+  grafana            1/1  0s ago     docker.io/pcuzner/ceph-grafana-el8:latest       f77afcf0bcf6   absent
+  node-exporter      2/2  6s ago     docker.io/prom/node-exporter:latest             e5a616e4b9cf  present
+  prometheus         1/1  6s ago     docker.io/prom/prometheus:latest                e935122ab143  present
+
+Configuring SSL/TLS for Grafana
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``cephadm`` deploys Grafana using the certificate defined in the ceph
+key/value store. If no certificate is specified, ``cephadm`` generates a
+self-signed certificate during the deployment of the Grafana service. Each
+certificate is specific for the host it was generated on.
+
+A custom certificate can be configured using the following commands:
+
+.. prompt:: bash #
+
+  ceph config-key set mgr/cephadm/{hostname}/grafana_key -i $PWD/key.pem
+  ceph config-key set mgr/cephadm/{hostname}/grafana_crt -i $PWD/certificate.pem
+
+Where `hostname` is the hostname for the host where grafana service is deployed.
+
+If you have already deployed Grafana, run ``reconfig`` on the service to
+update its configuration:
+
+.. prompt:: bash #
+
+  ceph orch reconfig grafana
+
+The ``reconfig`` command also sets the proper URL for Ceph Dashboard.
+
+Setting the initial admin password
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, Grafana will not create an initial
+admin user. In order to create the admin user, please create a file
+``grafana.yaml`` with this content:
+
+.. code-block:: yaml
+
+  service_type: grafana
+  spec:
+    initial_admin_password: mypassword
+
+Then apply this specification:
+
+.. code-block:: bash
+
+  ceph orch apply -i grafana.yaml
+  ceph orch redeploy grafana
+
+Grafana will now create an admin user called ``admin`` with the
+given password.
+
+Turning off anonymous access
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, cephadm allows anonymous users (users who have not provided any
+login information) limited, viewer only access to the grafana dashboard. In
+order to set up grafana to only allow viewing from logged in users, you can
+set ``anonymous_access: False`` in your grafana spec.
+
+.. code-block:: yaml
+
+  service_type: grafana
+  placement:
+    hosts:
+    - host1
+  spec:
+    anonymous_access: False
+    initial_admin_password: "mypassword"
+
+Since deploying grafana with anonymous access set to false without an initial
+admin password set would make the dashboard inaccessible, cephadm requires
+setting the ``initial_admin_password`` when ``anonymous_access`` is set to false.
+
+
+Setting up Alertmanager
+-----------------------
+
+Adding Alertmanager webhooks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To add new webhooks to the Alertmanager configuration, add additional
+webhook urls like so:
+
+.. code-block:: yaml
+
+    service_type: alertmanager
+    spec:
+      user_data:
+        default_webhook_urls:
+        - "https://foo"
+        - "https://bar"
+
+Where ``default_webhook_urls`` is a list of additional URLs that are
+added to the default receivers' ``<webhook_configs>`` configuration.
+
+Run ``reconfig`` on the service to update its configuration:
+
+.. prompt:: bash #
+
+  ceph orch reconfig alertmanager
+
+Turn on Certificate Validation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you are using certificates for alertmanager and want to make sure
+these certs are verified, you should set the "secure" option to
+true in your alertmanager spec (this defaults to false).
+
+.. code-block:: yaml
+
+    service_type: alertmanager
+    spec:
+      secure: true
+
+If you already had alertmanager daemons running before applying the spec
+you must reconfigure them to update their configuration
+
+.. prompt:: bash #
+
+  ceph orch reconfig alertmanager
+
+Further Reading
+---------------
+
+* :ref:`mgr-prometheus`