diff options
Diffstat (limited to 'doc/cephadm/services/monitoring.rst')
-rw-r--r-- | doc/cephadm/services/monitoring.rst | 543 |
1 files changed, 543 insertions, 0 deletions
diff --git a/doc/cephadm/services/monitoring.rst b/doc/cephadm/services/monitoring.rst new file mode 100644 index 000000000..a17a5ba03 --- /dev/null +++ b/doc/cephadm/services/monitoring.rst @@ -0,0 +1,543 @@ +.. _mgr-cephadm-monitoring: + +Monitoring Services +=================== + +Ceph Dashboard uses `Prometheus <https://prometheus.io/>`_, `Grafana +<https://grafana.com/>`_, and related tools to store and visualize detailed +metrics on cluster utilization and performance. Ceph users have three options: + +#. Have cephadm deploy and configure these services. This is the default + when bootstrapping a new cluster unless the ``--skip-monitoring-stack`` + option is used. +#. Deploy and configure these services manually. This is recommended for users + with existing prometheus services in their environment (and in cases where + Ceph is running in Kubernetes with Rook). +#. Skip the monitoring stack completely. Some Ceph dashboard graphs will + not be available. + +The monitoring stack consists of `Prometheus <https://prometheus.io/>`_, +Prometheus exporters (:ref:`mgr-prometheus`, `Node exporter +<https://prometheus.io/docs/guides/node-exporter/>`_), `Prometheus Alert +Manager <https://prometheus.io/docs/alerting/alertmanager/>`_ and `Grafana +<https://grafana.com/>`_. + +.. note:: + + Prometheus' security model presumes that untrusted users have access to the + Prometheus HTTP endpoint and logs. Untrusted users have access to all the + (meta)data Prometheus collects that is contained in the database, plus a + variety of operational and debugging information. + + However, Prometheus' HTTP API is limited to read-only operations. + Configurations can *not* be changed using the API and secrets are not + exposed. Moreover, Prometheus has some built-in measures to mitigate the + impact of denial of service attacks. + + Please see `Prometheus' Security model + <https://prometheus.io/docs/operating/security/>` for more detailed + information. + +Deploying monitoring with cephadm +--------------------------------- + +The default behavior of ``cephadm`` is to deploy a basic monitoring stack. It +is however possible that you have a Ceph cluster without a monitoring stack, +and you would like to add a monitoring stack to it. (Here are some ways that +you might have come to have a Ceph cluster without a monitoring stack: You +might have passed the ``--skip-monitoring stack`` option to ``cephadm`` during +the installation of the cluster, or you might have converted an existing +cluster (which had no monitoring stack) to cephadm management.) + +To set up monitoring on a Ceph cluster that has no monitoring, follow the +steps below: + +#. Deploy a node-exporter service on every node of the cluster. The node-exporter provides host-level metrics like CPU and memory utilization: + + .. prompt:: bash # + + ceph orch apply node-exporter + +#. Deploy alertmanager: + + .. prompt:: bash # + + ceph orch apply alertmanager + +#. Deploy Prometheus. A single Prometheus instance is sufficient, but + for high availability (HA) you might want to deploy two: + + .. prompt:: bash # + + ceph orch apply prometheus + + or + + .. prompt:: bash # + + ceph orch apply prometheus --placement 'count:2' + +#. Deploy grafana: + + .. prompt:: bash # + + ceph orch apply grafana + +.. _cephadm-monitoring-centralized-logs: + +Centralized Logging in Ceph +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ceph now provides centralized logging with Loki & Promtail. Centralized Log Management (CLM) consolidates all log data and pushes it to a central repository, +with an accessible and easy-to-use interface. Centralized logging is designed to make your life easier. +Some of the advantages are: + +#. **Linear event timeline**: it is easier to troubleshoot issues analyzing a single chain of events than thousands of different logs from a hundred nodes. +#. **Real-time live log monitoring**: it is impractical to follow logs from thousands of different sources. +#. **Flexible retention policies**: with per-daemon logs, log rotation is usually set to a short interval (1-2 weeks) to save disk usage. +#. **Increased security & backup**: logs can contain sensitive information and expose usage patterns. Additionally, centralized logging allows for HA, etc. + +Centralized Logging in Ceph is implemented using two new services - ``loki`` & ``promtail``. + +Loki: It is basically a log aggregation system and is used to query logs. It can be configured as a datasource in Grafana. + +Promtail: It acts as an agent that gathers logs from the system and makes them available to Loki. + +These two services are not deployed by default in a Ceph cluster. To enable the centralized logging you can follow the steps mentioned here :ref:`centralized-logging`. + +.. _cephadm-monitoring-networks-ports: + +Networks and Ports +~~~~~~~~~~~~~~~~~~ + +All monitoring services can have the network and port they bind to configured with a yaml service specification. By default +cephadm will use ``https`` protocol when configuring Grafana daemons unless the user explicitly sets the protocol to ``http``. + +example spec file: + +.. code-block:: yaml + + service_type: grafana + service_name: grafana + placement: + count: 1 + networks: + - 192.169.142.0/24 + spec: + port: 4200 + protocol: http + +.. _cephadm_monitoring-images: + +Using custom images +~~~~~~~~~~~~~~~~~~~ + +It is possible to install or upgrade monitoring components based on other +images. To do so, the name of the image to be used needs to be stored in the +configuration first. The following configuration options are available. + +- ``container_image_prometheus`` +- ``container_image_grafana`` +- ``container_image_alertmanager`` +- ``container_image_node_exporter`` + +Custom images can be set with the ``ceph config`` command + +.. code-block:: bash + + ceph config set mgr mgr/cephadm/<option_name> <value> + +For example + +.. code-block:: bash + + ceph config set mgr mgr/cephadm/container_image_prometheus prom/prometheus:v1.4.1 + +If there were already running monitoring stack daemon(s) of the type whose +image you've changed, you must redeploy the daemon(s) in order to have them +actually use the new image. + +For example, if you had changed the prometheus image + +.. prompt:: bash # + + ceph orch redeploy prometheus + + +.. note:: + + By setting a custom image, the default value will be overridden (but not + overwritten). The default value changes when updates become available. + By setting a custom image, you will not be able to update the component + you have set the custom image for automatically. You will need to + manually update the configuration (image name and tag) to be able to + install updates. + + If you choose to go with the recommendations instead, you can reset the + custom image you have set before. After that, the default value will be + used again. Use ``ceph config rm`` to reset the configuration option + + .. code-block:: bash + + ceph config rm mgr mgr/cephadm/<option_name> + + For example + + .. code-block:: bash + + ceph config rm mgr mgr/cephadm/container_image_prometheus + +See also :ref:`cephadm-airgap`. + +.. _cephadm-overwrite-jinja2-templates: + +Using custom configuration files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By overriding cephadm templates, it is possible to completely customize the +configuration files for monitoring services. + +Internally, cephadm already uses `Jinja2 +<https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the +configuration files for all monitoring components. Starting from version 17.2.3, +cephadm supports Prometheus http service discovery, and uses this endpoint for the +definition and management of the embedded Prometheus service. The endpoint listens on +``https://<mgr-ip>:8765/sd/`` (the port is +configurable through the variable ``service_discovery_port``) and returns scrape target +information in `http_sd_config format +<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_ + +Customers with external monitoring stack can use `ceph-mgr` service discovery endpoint +to get scraping configuration. Root certificate of the server can be obtained by the +following command: + + .. prompt:: bash # + + ceph orch sd dump cert + +The configuration of Prometheus, Grafana, or Alertmanager may be customized by storing +a Jinja2 template for each service. This template will be evaluated every time a service +of that kind is deployed or reconfigured. That way, the custom configuration is preserved +and automatically applied on future deployments of these services. + +.. note:: + + The configuration of the custom template is also preserved when the default + configuration of cephadm changes. If the updated configuration is to be used, + the custom template needs to be migrated *manually* after each upgrade of Ceph. + +Option names +"""""""""""" + +The following templates for files that will be generated by cephadm can be +overridden. These are the names to be used when storing with ``ceph config-key +set``: + +- ``services/alertmanager/alertmanager.yml`` +- ``services/grafana/ceph-dashboard.yml`` +- ``services/grafana/grafana.ini`` +- ``services/prometheus/prometheus.yml`` +- ``services/prometheus/alerting/custom_alerts.yml`` +- ``services/loki.yml`` +- ``services/promtail.yml`` + +You can look up the file templates that are currently used by cephadm in +``src/pybind/mgr/cephadm/templates``: + +- ``services/alertmanager/alertmanager.yml.j2`` +- ``services/grafana/ceph-dashboard.yml.j2`` +- ``services/grafana/grafana.ini.j2`` +- ``services/prometheus/prometheus.yml.j2`` +- ``services/loki.yml.j2`` +- ``services/promtail.yml.j2`` + +Usage +""""" + +The following command applies a single line value: + +.. code-block:: bash + + ceph config-key set mgr/cephadm/<option_name> <value> + +To set contents of files as template use the ``-i`` argument: + +.. code-block:: bash + + ceph config-key set mgr/cephadm/<option_name> -i $PWD/<filename> + +.. note:: + + When using files as input to ``config-key`` an absolute path to the file must + be used. + + +Then the configuration file for the service needs to be recreated. +This is done using `reconfig`. For more details see the following example. + +Example +""""""" + +.. code-block:: bash + + # set the contents of ./prometheus.yml.j2 as template + ceph config-key set mgr/cephadm/services/prometheus/prometheus.yml \ + -i $PWD/prometheus.yml.j2 + + # reconfig the prometheus service + ceph orch reconfig prometheus + +.. code-block:: bash + + # set additional custom alerting rules for Prometheus + ceph config-key set mgr/cephadm/services/prometheus/alerting/custom_alerts.yml \ + -i $PWD/custom_alerts.yml + + # Note that custom alerting rules are not parsed by Jinja and hence escaping + # will not be an issue. + +Deploying monitoring without cephadm +------------------------------------ + +If you have an existing prometheus monitoring infrastructure, or would like +to manage it yourself, you need to configure it to integrate with your Ceph +cluster. + +* Enable the prometheus module in the ceph-mgr daemon + + .. code-block:: bash + + ceph mgr module enable prometheus + + By default, ceph-mgr presents prometheus metrics on port 9283 on each host + running a ceph-mgr daemon. Configure prometheus to scrape these. + +To make this integration easier, cephadm provides a service discovery endpoint at +``https://<mgr-ip>:8765/sd/``. This endpoint can be used by an external +Prometheus server to retrieve target information for a specific service. Information returned +by this endpoint uses the format specified by the Prometheus `http_sd_config option +<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_ + +Here's an example prometheus job definition that uses the cephadm service discovery endpoint + + .. code-block:: bash + + - job_name: 'ceph-exporter' + http_sd_configs: + - url: http://<mgr-ip>:8765/sd/prometheus/sd-config?service=ceph-exporter + + +* To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`. + +* To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`. + +Disabling monitoring +-------------------- + +To disable monitoring and remove the software that supports it, run the following commands: + +.. code-block:: console + + $ ceph orch rm grafana + $ ceph orch rm prometheus --force # this will delete metrics data collected so far + $ ceph orch rm node-exporter + $ ceph orch rm alertmanager + $ ceph mgr module disable prometheus + +See also :ref:`orch-rm`. + +Setting up RBD-Image monitoring +------------------------------- + +Due to performance reasons, monitoring of RBD images is disabled by default. For more information please see +:ref:`prometheus-rbd-io-statistics`. If disabled, the overview and details dashboards will stay empty in Grafana +and the metrics will not be visible in Prometheus. + +Setting up Prometheus +----------------------- + +Setting Prometheus Retention Size and Time +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Cephadm can configure Prometheus TSDB retention by specifying ``retention_time`` +and ``retention_size`` values in the Prometheus service spec. +The retention time value defaults to 15 days (15d). Users can set a different value/unit where +supported units are: 'y', 'w', 'd', 'h', 'm' and 's'. The retention size value defaults +to 0 (disabled). Supported units in this case are: 'B', 'KB', 'MB', 'GB', 'TB', 'PB' and 'EB'. + +In the following example spec we set the retention time to 1 year and the size to 1GB. + +.. code-block:: yaml + + service_type: prometheus + placement: + count: 1 + spec: + retention_time: "1y" + retention_size: "1GB" + +.. note:: + + If you already had Prometheus daemon(s) deployed before and are updating an + existent spec as opposed to doing a fresh Prometheus deployment, you must also + tell cephadm to redeploy the Prometheus daemon(s) to put this change into effect. + This can be done with a ``ceph orch redeploy prometheus`` command. + +Setting up Grafana +------------------ + +Manually setting the Grafana URL +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Cephadm automatically configures Prometheus, Grafana, and Alertmanager in +all cases except one. + +In a some setups, the Dashboard user's browser might not be able to access the +Grafana URL that is configured in Ceph Dashboard. This can happen when the +cluster and the accessing user are in different DNS zones. + +If this is the case, you can use a configuration option for Ceph Dashboard +to set the URL that the user's browser will use to access Grafana. This +value will never be altered by cephadm. To set this configuration option, +issue the following command: + + .. prompt:: bash $ + + ceph dashboard set-grafana-frontend-api-url <grafana-server-api> + +It might take a minute or two for services to be deployed. After the +services have been deployed, you should see something like this when you issue the command ``ceph orch ls``: + +.. code-block:: console + + $ ceph orch ls + NAME RUNNING REFRESHED IMAGE NAME IMAGE ID SPEC + alertmanager 1/1 6s ago docker.io/prom/alertmanager:latest 0881eb8f169f present + crash 2/2 6s ago docker.io/ceph/daemon-base:latest-master-devel mix present + grafana 1/1 0s ago docker.io/pcuzner/ceph-grafana-el8:latest f77afcf0bcf6 absent + node-exporter 2/2 6s ago docker.io/prom/node-exporter:latest e5a616e4b9cf present + prometheus 1/1 6s ago docker.io/prom/prometheus:latest e935122ab143 present + +Configuring SSL/TLS for Grafana +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``cephadm`` deploys Grafana using the certificate defined in the ceph +key/value store. If no certificate is specified, ``cephadm`` generates a +self-signed certificate during the deployment of the Grafana service. Each +certificate is specific for the host it was generated on. + +A custom certificate can be configured using the following commands: + +.. prompt:: bash # + + ceph config-key set mgr/cephadm/{hostname}/grafana_key -i $PWD/key.pem + ceph config-key set mgr/cephadm/{hostname}/grafana_crt -i $PWD/certificate.pem + +Where `hostname` is the hostname for the host where grafana service is deployed. + +If you have already deployed Grafana, run ``reconfig`` on the service to +update its configuration: + +.. prompt:: bash # + + ceph orch reconfig grafana + +The ``reconfig`` command also sets the proper URL for Ceph Dashboard. + +Setting the initial admin password +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, Grafana will not create an initial +admin user. In order to create the admin user, please create a file +``grafana.yaml`` with this content: + +.. code-block:: yaml + + service_type: grafana + spec: + initial_admin_password: mypassword + +Then apply this specification: + +.. code-block:: bash + + ceph orch apply -i grafana.yaml + ceph orch redeploy grafana + +Grafana will now create an admin user called ``admin`` with the +given password. + +Turning off anonymous access +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, cephadm allows anonymous users (users who have not provided any +login information) limited, viewer only access to the grafana dashboard. In +order to set up grafana to only allow viewing from logged in users, you can +set ``anonymous_access: False`` in your grafana spec. + +.. code-block:: yaml + + service_type: grafana + placement: + hosts: + - host1 + spec: + anonymous_access: False + initial_admin_password: "mypassword" + +Since deploying grafana with anonymous access set to false without an initial +admin password set would make the dashboard inaccessible, cephadm requires +setting the ``initial_admin_password`` when ``anonymous_access`` is set to false. + + +Setting up Alertmanager +----------------------- + +Adding Alertmanager webhooks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To add new webhooks to the Alertmanager configuration, add additional +webhook urls like so: + +.. code-block:: yaml + + service_type: alertmanager + spec: + user_data: + default_webhook_urls: + - "https://foo" + - "https://bar" + +Where ``default_webhook_urls`` is a list of additional URLs that are +added to the default receivers' ``<webhook_configs>`` configuration. + +Run ``reconfig`` on the service to update its configuration: + +.. prompt:: bash # + + ceph orch reconfig alertmanager + +Turn on Certificate Validation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you are using certificates for alertmanager and want to make sure +these certs are verified, you should set the "secure" option to +true in your alertmanager spec (this defaults to false). + +.. code-block:: yaml + + service_type: alertmanager + spec: + secure: true + +If you already had alertmanager daemons running before applying the spec +you must reconfigure them to update their configuration + +.. prompt:: bash # + + ceph orch reconfig alertmanager + +Further Reading +--------------- + +* :ref:`mgr-prometheus` |