diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
commit | 483eb2f56657e8e7f419ab1a4fab8dce9ade8609 (patch) | |
tree | e5d88d25d870d5dedacb6bbdbe2a966086a0a5cf /doc/mgr | |
parent | Initial commit. (diff) | |
download | ceph-upstream.tar.xz ceph-upstream.zip |
Adding upstream version 14.2.21.upstream/14.2.21upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r-- | doc/mgr/administrator.rst | 169 | ||||
-rw-r--r-- | doc/mgr/alerts.rst | 58 | ||||
-rw-r--r-- | doc/mgr/ansible.rst | 121 | ||||
-rw-r--r-- | doc/mgr/crash.rst | 83 | ||||
-rw-r--r-- | doc/mgr/dashboard.rst | 1030 | ||||
-rw-r--r-- | doc/mgr/dashboard_plugins/feature_toggles.inc.rst | 44 | ||||
-rw-r--r-- | doc/mgr/deepsea.rst | 79 | ||||
-rw-r--r-- | doc/mgr/diskprediction.rst | 353 | ||||
-rw-r--r-- | doc/mgr/hello.rst | 39 | ||||
-rw-r--r-- | doc/mgr/index.rst | 49 | ||||
-rw-r--r-- | doc/mgr/influx.rst | 165 | ||||
-rw-r--r-- | doc/mgr/insights.rst | 52 | ||||
-rw-r--r-- | doc/mgr/iostat.rst | 32 | ||||
-rw-r--r-- | doc/mgr/localpool.rst | 35 | ||||
-rw-r--r-- | doc/mgr/modules.rst | 389 | ||||
-rw-r--r-- | doc/mgr/orchestrator_cli.rst | 295 | ||||
-rw-r--r-- | doc/mgr/orchestrator_modules.rst | 285 | ||||
-rw-r--r-- | doc/mgr/prometheus.rst | 314 | ||||
-rw-r--r-- | doc/mgr/restful.rst | 156 | ||||
-rw-r--r-- | doc/mgr/rook.rst | 37 | ||||
-rw-r--r-- | doc/mgr/ssh.rst | 45 | ||||
-rw-r--r-- | doc/mgr/telegraf.rst | 88 | ||||
-rw-r--r-- | doc/mgr/telemetry.rst | 158 | ||||
-rw-r--r-- | doc/mgr/zabbix.rst | 144 |
24 files changed, 4220 insertions, 0 deletions
diff --git a/doc/mgr/administrator.rst b/doc/mgr/administrator.rst new file mode 100644 index 00000000..ccffe807 --- /dev/null +++ b/doc/mgr/administrator.rst @@ -0,0 +1,169 @@ +.. _mgr-administrator-guide: + +ceph-mgr administrator's guide +============================== + +Manual setup +------------ + +Usually, you would set up a ceph-mgr daemon using a tool such +as ceph-ansible. These instructions describe how to set up +a ceph-mgr daemon manually. + +First, create an authentication key for your daemon:: + + ceph auth get-or-create mgr.$name mon 'allow profile mgr' osd 'allow *' mds 'allow *' + +Place that key into ``mgr data`` path, which for a cluster "ceph" +and mgr $name "foo" would be ``/var/lib/ceph/mgr/ceph-foo``. + +Start the ceph-mgr daemon:: + + ceph-mgr -i $name + +Check that the mgr has come up by looking at the output +of ``ceph status``, which should now include a mgr status line:: + + mgr active: $name + +Client authentication +--------------------- + +The manager is a new daemon which requires new CephX capabilities. If you upgrade +a cluster from an old version of Ceph, or use the default install/deploy tools, +your admin client should get this capability automatically. If you use tooling from +elsewhere, you may get EACCES errors when invoking certain ceph cluster commands. +To fix that, add a "mgr allow \*" stanza to your client's cephx capabilities by +`Modifying User Capabilities`_. + +High availability +----------------- + +In general, you should set up a ceph-mgr on each of the hosts +running a ceph-mon daemon to achieve the same level of availability. + +By default, whichever ceph-mgr instance comes up first will be made +active by the monitors, and the others will be standbys. There is +no requirement for quorum among the ceph-mgr daemons. + +If the active daemon fails to send a beacon to the monitors for +more than ``mon mgr beacon grace`` (default 30s), then it will be replaced +by a standby. + +If you want to pre-empt failover, you can explicitly mark a ceph-mgr +daemon as failed using ``ceph mgr fail <mgr name>``. + +Using modules +------------- + +Use the command ``ceph mgr module ls`` to see which modules are +available, and which are currently enabled. Enable or disable modules +using the commands ``ceph mgr module enable <module>`` and +``ceph mgr module disable <module>`` respectively. + +If a module is *enabled* then the active ceph-mgr daemon will load +and execute it. In the case of modules that provide a service, +such as an HTTP server, the module may publish its address when it +is loaded. To see the addresses of such modules, use the command +``ceph mgr services``. + +Some modules may also implement a special standby mode which runs on +standby ceph-mgr daemons as well as the active daemon. This enables +modules that provide services to redirect their clients to the active +daemon, if the client tries to connect to a standby. + +Consult the documentation pages for individual manager modules for more +information about what functionality each module provides. + +Here is an example of enabling the :term:`Dashboard` module: + +:: + + $ ceph mgr module ls + { + "enabled_modules": [ + "restful", + "status" + ], + "disabled_modules": [ + "dashboard" + ] + } + + $ ceph mgr module enable dashboard + $ ceph mgr module ls + { + "enabled_modules": [ + "restful", + "status", + "dashboard" + ], + "disabled_modules": [ + ] + } + + $ ceph mgr services + { + "dashboard": "http://myserver.com:7789/", + "restful": "https://myserver.com:8789/" + } + + +The first time the cluster starts, it uses the ``mgr_initial_modules`` +setting to override which modules to enable. However, this setting +is ignored through the rest of the lifetime of the cluster: only +use it for bootstrapping. For example, before starting your +monitor daemons for the first time, you might add a section like +this to your ``ceph.conf``: + +:: + + [mon] + mgr initial modules = dashboard balancer + +Calling module commands +----------------------- + +Where a module implements command line hooks, the commands will +be accessible as ordinary Ceph commands:: + + ceph <command | help> + +If you would like to see the list of commands handled by the +manager (where normal ``ceph help`` would show all mon and mgr commands), +you can send a command directly to the manager daemon:: + + ceph tell mgr help + +Note that it is not necessary to address a particular mgr instance, +simply ``mgr`` will pick the current active daemon. + +Configuration +------------- + +``mgr module path`` + +:Description: Path to load modules from +:Type: String +:Default: ``"<library dir>/mgr"`` + +``mgr data`` + +:Description: Path to load daemon data (such as keyring) +:Type: String +:Default: ``"/var/lib/ceph/mgr/$cluster-$id"`` + +``mgr tick period`` + +:Description: How many seconds between mgr beacons to monitors, and other + periodic checks. +:Type: Integer +:Default: ``5`` + +``mon mgr beacon grace`` + +:Description: How long after last beacon should a mgr be considered failed +:Type: Integer +:Default: ``30`` + +.. _Modifying User Capabilities: ../../rados/operations/user-management/#modify-user-capabilities diff --git a/doc/mgr/alerts.rst b/doc/mgr/alerts.rst new file mode 100644 index 00000000..319d9d92 --- /dev/null +++ b/doc/mgr/alerts.rst @@ -0,0 +1,58 @@ +Alerts module +============= + +The alerts module can send simple alert messages about cluster health +via e-mail. In the future, it will support other notification methods +as well. + +:note: This module is *not* intended to be a robust monitoring + solution. The fact that it is run as part of the Ceph cluster + itself is fundamentally limiting in that a failure of the + ceph-mgr daemon prevents alerts from being sent. This module + can, however, be useful for standalone clusters that exist in + environments where existing monitoring infrastructure does not + exist. + +Enabling +-------- + +The *alerts* module is enabled with:: + + ceph mgr module enable alerts + +Configuration +------------- + +To configure SMTP, all of the following config options must be set:: + + ceph config set mgr mgr/alerts/smtp_host *<smtp-server>* + ceph config set mgr mgr/alerts/smtp_destination *<email-address-to-send-to>* + ceph config set mgr mgr/alerts/smtp_sender *<from-email-address>* + +By default, the module will use SSL and port 465. To change that,:: + + ceph config set mgr mgr/alerts/smtp_ssl false # if not SSL + ceph config set mgr mgr/alerts/smtp_port *<port-number>* # if not 465 + +To authenticate to the SMTP server, you must set the user and password:: + + ceph config set mgr mgr/alerts/smtp_user *<username>* + ceph config set mgr mgr/alerts/smtp_password *<password>* + +By default, the name in the ``From:`` line is simply ``Ceph``. To +change that (e.g., to identify which cluster this is),:: + + ceph config set mgr mgr/alerts/smtp_from_name 'Ceph Cluster Foo' + +By default, the module will check the cluster health once per minute +and, if there is a change, send a message. To change that +frequency,:: + + ceph config set mgr mgr/alerts/interval *<interval>* # e.g., "5m" for 5 minutes + +Commands +-------- + +To force an alert to be send immediately,:: + + ceph alerts send diff --git a/doc/mgr/ansible.rst b/doc/mgr/ansible.rst new file mode 100644 index 00000000..e81e67ba --- /dev/null +++ b/doc/mgr/ansible.rst @@ -0,0 +1,121 @@ + +.. _ansible-module: + +==================== +Ansible Orchestrator +==================== + +This module is a :ref:`Ceph orchestrator <orchestrator-modules>` module that uses `Ansible Runner Service <https://github.com/pcuzner/ansible-runner-service>`_ (a RESTful API server) to execute Ansible playbooks in order to satisfy the different operations supported. + +These operations basically (and for the moment) are: + +- Get an inventory of the Ceph cluster nodes and all the storage devices present in each node +- ... +- ... + + +Usage +===== + +Enable the module: + +:: + + # ceph mgr module enable ansible + +Disable the module + +:: + + # ceph mgr module disable ansible + + +Enable the Ansible orchestrator module and use it with the :ref:`CLI <orchestrator-cli-module>`: + +:: + + ceph mgr module enable ansible + ceph orchestrator set backend ansible + + +Configuration +============= + +Configuration must be set once the module is enabled by first time. + +This can be done in one monitor node via the configuration key facility on a +cluster-wide level (so they apply to all manager instances) as follows:: + + + # ceph config set mgr mgr/ansible/server_addr <ip_address/server_name> + # ceph config set mgr mgr/ansible/server_port <port> + # ceph config set mgr mgr/ansible/username <username> + # ceph config set mgr mgr/ansible/password <password> + # ceph config set mgr mgr/ansible/verify_server <verify_server_value> + +Where: + + * <ip_address/server_name>: Is the ip address/hostname of the server where the Ansible Runner Service is available. + * <port>: The port number where the Ansible Runner Service is listening + * <username>: The username of one authorized user in the Ansible Runner Service + * <password>: The password of the authorized user. + * <verify_server_value>: Either a boolean, in which case it controls whether the server's TLS certificate is verified, or a string, in which case it must be a path to a CA bundle to use in the verification. Defaults to ``True``. + + +Debugging +========= + +Any kind of incident with this orchestrator module can be debugged using the Ceph manager logs: + +Set the right log level in order to debug properly. Remember that the python log levels debug, info, warn, err are mapped into the Ceph severities 20, 4, 1 and 0 respectively. + +And use the "active" manager node: ( "ceph -s" command in one monitor give you this information) + +* Check current debug level:: + + [@mgr0 ~]# ceph daemon mgr.mgr0 config show | grep debug_mgr + "debug_mgr": "1/5", + "debug_mgrc": "1/5", + +* Change the log level to "debug":: + + [mgr0 ~]# ceph daemon mgr.mgr0 config set debug_mgr 20/5 + { + "success": "" + } + +* Restore "info" log level:: + + [mgr0 ~]# ceph daemon mgr.mgr0 config set debug_mgr 1/5 + { + "success": "" + } + + +Operations +========== + +**Inventory:** + +Get the list of storage devices installed in all the cluster nodes. The output format is:: + + [host: + device_name (type_of_device , size_in_bytes)] + +Example:: + + [root@mon0 ~]# ceph orchestrator device ls + 192.168.121.160: + vda (hdd, 44023414784b) + sda (hdd, 53687091200b) + sdb (hdd, 53687091200b) + sdc (hdd, 53687091200b) + 192.168.121.36: + vda (hdd, 44023414784b) + 192.168.121.201: + vda (hdd, 44023414784b) + 192.168.121.70: + vda (hdd, 44023414784b) + sda (hdd, 53687091200b) + sdb (hdd, 53687091200b) + sdc (hdd, 53687091200b) diff --git a/doc/mgr/crash.rst b/doc/mgr/crash.rst new file mode 100644 index 00000000..76e0ce94 --- /dev/null +++ b/doc/mgr/crash.rst @@ -0,0 +1,83 @@ +Crash Module +============ +The crash module collects information about daemon crashdumps and stores +it in the Ceph cluster for later analysis. + +Daemon crashdumps are dumped in /var/lib/ceph/crash by default; this can +be configured with the option 'crash dir'. Crash directories are named by +time and date and a randomly-generated UUID, and contain a metadata file +'meta' and a recent log file, with a "crash_id" that is the same. +This module allows the metadata about those dumps to be persisted in +the monitors' storage. + +Enabling +-------- + +The *crash* module is enabled with:: + + ceph mgr module enable crash + +Commands +-------- +:: + + ceph crash post -i <metafile> + +Save a crash dump. The metadata file is a JSON blob stored in the crash +dir as ``meta``. As usual, the ceph command can be invoked with ``-i -``, +and will read from stdin. + +:: + + ceph rm <crashid> + +Remove a specific crash dump. + +:: + + ceph crash ls + +List the timestamp/uuid crashids for all new and archived crash info. + +:: + + ceph crash ls-new + +List the timestamp/uuid crashids for all newcrash info. + +:: + + ceph crash stat + +Show a summary of saved crash info grouped by age. + +:: + + ceph crash info <crashid> + +Show all details of a saved crash. + +:: + + ceph crash prune <keep> + +Remove saved crashes older than 'keep' days. <keep> must be an integer. + +:: + + ceph crash archive <crashid> + +Archive a crash report so that it is no longer considered for the ``RECENT_CRASH`` health check and does not appear in the ``crash ls-new`` output (it will still appear in the ``crash ls`` output). + +:: + + ceph crash archive-all + +Archive all new crash reports. + + +Options +------- + +* ``mgr/crash/warn_recent_interval`` [default: 2 weeks] controls what constitutes "recent" for the purposes of raising the ``RECENT_CRASH`` health warning. +* ``mgr/crash/retain_interval`` [default: 1 year] controls how long crash reports are retained by the cluster before they are automatically purged. diff --git a/doc/mgr/dashboard.rst b/doc/mgr/dashboard.rst new file mode 100644 index 00000000..f92bd19d --- /dev/null +++ b/doc/mgr/dashboard.rst @@ -0,0 +1,1030 @@ +.. _mgr-dashboard: + +Ceph Dashboard +============== + +Overview +-------- + +The Ceph Dashboard is a built-in web-based Ceph management and monitoring +application to administer various aspects and objects of the cluster. It is +implemented as a :ref:`ceph-manager-daemon` module. + +The original Ceph Dashboard that was shipped with Ceph Luminous started +out as a simple read-only view into various run-time information and performance +data of a Ceph cluster. It used a very simple architecture to achieve the +original goal. However, there was a growing demand for adding more web-based +management capabilities, to make it easier to administer Ceph for users that +prefer a WebUI over using the command line. + +The new :term:`Ceph Dashboard` module is a replacement of the previous one and +adds a built-in web based monitoring and administration application to the Ceph +Manager. The architecture and functionality of this new plugin is derived from +and inspired by the `openATTIC Ceph management and monitoring tool +<https://openattic.org/>`_. The development is actively driven by the team +behind openATTIC at `SUSE <https://www.suse.com/>`_, with a lot of support from +companies like `Red Hat <https://redhat.com/>`_ and other members of the Ceph +community. + +The dashboard module's backend code uses the CherryPy framework and a custom +REST API implementation. The WebUI implementation is based on +Angular/TypeScript, merging both functionality from the original dashboard as +well as adding new functionality originally developed for the standalone version +of openATTIC. The Ceph Dashboard module is implemented as a web +application that visualizes information and statistics about the Ceph cluster +using a web server hosted by ``ceph-mgr``. + +Feature Overview +^^^^^^^^^^^^^^^^ + +The dashboard provides the following features: + +* **Multi-User and Role Management**: The dashboard supports multiple user + accounts with different permissions (roles). The user accounts and roles + can be modified on both the command line and via the WebUI. + See :ref:`dashboard-user-role-management` for details. +* **Single Sign-On (SSO)**: the dashboard supports authentication + via an external identity provider using the SAML 2.0 protocol. See + :ref:`dashboard-sso-support` for details. +* **SSL/TLS support**: All HTTP communication between the web browser and the + dashboard is secured via SSL. A self-signed certificate can be created with + a built-in command, but it's also possible to import custom certificates + signed and issued by a CA. See :ref:`dashboard-ssl-tls-support` for details. +* **Auditing**: the dashboard backend can be configured to log all PUT, POST + and DELETE API requests in the Ceph audit log. See :ref:`dashboard-auditing` + for instructions on how to enable this feature. +* **Internationalization (I18N)**: the dashboard can be used in different + languages that can be selected at run-time. + +Currently, Ceph Dashboard is capable of monitoring and managing the following +aspects of your Ceph cluster: + +* **Overall cluster health**: Display overall cluster status, performance + and capacity metrics. +* **Embedded Grafana Dashboards**: Ceph Dashboard is capable of embedding + `Grafana`_ dashboards in many locations, to display additional information + and performance metrics gathered by the :ref:`mgr-prometheus`. See + :ref:`dashboard-grafana` for details on how to configure this functionality. +* **Cluster logs**: Display the latest updates to the cluster's event and + audit log files. Log entries can be filtered by priority, date or keyword. +* **Hosts**: Display a list of all hosts associated to the cluster, which + services are running and which version of Ceph is installed. +* **Performance counters**: Display detailed service-specific statistics for + each running service. +* **Monitors**: List all MONs, their quorum status, open sessions. +* **Monitoring**: Enables creation, re-creation, editing and expiration of + Prometheus' Silences, lists the alerting configuration of Prometheus and + currently firing alerts. Also shows notifications for firing alerts. Needs + configuration. +* **Configuration Editor**: Display all available configuration options, + their description, type and default values and edit the current values. +* **Pools**: List all Ceph pools and their details (e.g. applications, + placement groups, replication size, EC profile, CRUSH ruleset, etc.) +* **OSDs**: List all OSDs, their status and usage statistics as well as + detailed information like attributes (OSD map), metadata, performance + counters and usage histograms for read/write operations. Mark OSDs + up/down/out, purge and reweight OSDs, perform scrub operations, modify + various scrub-related configuration options, select different profiles to + adjust the level of backfilling activity. +* **iSCSI**: List all hosts that run the TCMU runner service, display all + images and their performance characteristics (read/write ops, traffic). + Create, modify and delete iSCSI targets (via ``ceph-iscsi``). See + :ref:`dashboard-iscsi-management` for instructions on how to configure this + feature. +* **RBD**: List all RBD images and their properties (size, objects, features). + Create, copy, modify and delete RBD images. Define various I/O or bandwidth + limitation settings on a global, per-pool or per-image level. Create, delete + and rollback snapshots of selected images, protect/unprotect these snapshots + against modification. Copy or clone snapshots, flatten cloned images. +* **RBD mirroring**: Enable and configure RBD mirroring to a remote Ceph server. + Lists all active sync daemons and their status, pools and RBD images including + their synchronization state. +* **CephFS**: List all active filesystem clients and associated pools, + including their usage statistics. +* **Object Gateway**: List all active object gateways and their performance + counters. Display and manage (add/edit/delete) object gateway users and their + details (e.g. quotas) as well as the users' buckets and their details (e.g. + owner, quotas). See :ref:`dashboard-enabling-object-gateway` for configuration + instructions. +* **NFS**: Manage NFS exports of CephFS filesystems and RGW S3 buckets via NFS + Ganesha. See :ref:`dashboard-nfs-ganesha-management` for details on how to + enable this functionality. +* **Ceph Manager Modules**: Enable and disable all Ceph Manager modules, change + the module-specific configuration settings. + + +Supported Browsers +^^^^^^^^^^^^^^^^^^ + +Ceph Dashboard is primarily tested and developed using the following web +browsers: + ++----------------------------------------------+----------+ +| Browser | Versions | ++==============================================+==========+ +| `Chrome <https://www.google.com/chrome/>`_ | 68+ | ++----------------------------------------------+----------+ +| `Firefox <http://www.mozilla.org/firefox/>`_ | 61+ | ++----------------------------------------------+----------+ + +While Ceph Dashboard might work in older browsers, we cannot guarantee it and +recommend you to update your browser to the latest version. + +Enabling +-------- + +If you have installed ``ceph-mgr-dashboard`` from distribution packages, the +package management system should have taken care of installing all the required +dependencies. + +If you're installing Ceph from source and want to start the dashboard from your +development environment, please see the files ``README.rst`` and ``HACKING.rst`` +in directory ``src/pybind/mgr/dashboard`` of the source code. + +Within a running Ceph cluster, the Ceph Dashboard is enabled with:: + + $ ceph mgr module enable dashboard + +Configuration +------------- + +.. _dashboard-ssl-tls-support: + +SSL/TLS Support +^^^^^^^^^^^^^^^ + +All HTTP connections to the dashboard are secured with SSL/TLS by default. + +To get the dashboard up and running quickly, you can generate and install a +self-signed certificate using the following built-in command:: + + $ ceph dashboard create-self-signed-cert + +Note that most web browsers will complain about such self-signed certificates +and require explicit confirmation before establishing a secure connection to the +dashboard. + +To properly secure a deployment and to remove the certificate warning, a +certificate that is issued by a certificate authority (CA) should be used. + +For example, a key pair can be generated with a command similar to:: + + $ openssl req -new -nodes -x509 \ + -subj "/O=IT/CN=ceph-mgr-dashboard" -days 3650 \ + -keyout dashboard.key -out dashboard.crt -extensions v3_ca + +The ``dashboard.crt`` file should then be signed by a CA. Once that is done, you +can enable it for all Ceph manager instances by running the following commands:: + + $ ceph dashboard set-ssl-certificate -i dashboard.crt + $ ceph dashboard set-ssl-certificate-key -i dashboard.key + +If different certificates are desired for each manager instance for some reason, +the name of the instance can be included as follows (where ``$name`` is the name +of the ``ceph-mgr`` instance, usually the hostname):: + + $ ceph dashboard set-ssl-certificate $name -i dashboard.crt + $ ceph dashboard set-ssl-certificate-key $name -i dashboard.key + +SSL can also be disabled by setting this configuration value:: + + $ ceph config set mgr mgr/dashboard/ssl false + +This might be useful if the dashboard will be running behind a proxy which does +not support SSL for its upstream servers or other situations where SSL is not +wanted or required. + +.. warning:: + + Use caution when disabling SSL as usernames and passwords will be sent to the + dashboard unencrypted. + + +.. note:: + + You need to restart the Ceph manager processes manually after changing the SSL + certificate and key. This can be accomplished by either running ``ceph mgr + fail mgr`` or by disabling and re-enabling the dashboard module (which also + triggers the manager to respawn itself):: + + $ ceph mgr module disable dashboard + $ ceph mgr module enable dashboard + +Host Name and Port +^^^^^^^^^^^^^^^^^^ + +Like most web applications, dashboard binds to a TCP/IP address and TCP port. + +By default, the ``ceph-mgr`` daemon hosting the dashboard (i.e., the currently +active manager) will bind to TCP port 8443 or 8080 when SSL is disabled. + +If no specific address has been configured, the web app will bind to ``::``, +which corresponds to all available IPv4 and IPv6 addresses. + +These defaults can be changed via the configuration key facility on a +cluster-wide level (so they apply to all manager instances) as follows:: + + $ ceph config set mgr mgr/dashboard/server_addr $IP + $ ceph config set mgr mgr/dashboard/server_port $PORT + $ ceph config set mgr mgr/dashboard/ssl_server_port $PORT + +Since each ``ceph-mgr`` hosts its own instance of dashboard, it may also be +necessary to configure them separately. The IP address and port for a specific +manager instance can be changed with the following commands:: + + $ ceph config set mgr mgr/dashboard/$name/server_addr $IP + $ ceph config set mgr mgr/dashboard/$name/server_port $PORT + $ ceph config set mgr mgr/dashboard/$name/ssl_server_port $PORT + +Replace ``$name`` with the ID of the ceph-mgr instance hosting the dashboard web +app. + +.. note:: + + The command ``ceph mgr services`` will show you all endpoints that are + currently configured. Look for the ``dashboard`` key to obtain the URL for + accessing the dashboard. + +Username and Password +^^^^^^^^^^^^^^^^^^^^^ + +In order to be able to log in, you need to create a user account and associate +it with at least one role. We provide a set of predefined *system roles* that +you can use. For more details please refer to the `User and Role Management`_ +section. + +To create a user with the administrator role you can use the following +commands:: + + $ ceph dashboard ac-user-create <username> -i <file-containing-password> administrator + +.. _dashboard-enabling-object-gateway: + +Enabling the Object Gateway Management Frontend +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To use the Object Gateway management functionality of the dashboard, you will +need to provide the login credentials of a user with the ``system`` flag +enabled. + +If you do not have a user which shall be used for providing those credentials, +you will also need to create one:: + + $ radosgw-admin user create --uid=<user_id> --display-name=<display_name> \ + --system + +Take note of the keys ``access_key`` and ``secret_key`` in the output of this +command. + +The credentials of an existing user can also be obtained by using +`radosgw-admin`:: + + $ radosgw-admin user info --uid=<user_id> + +Finally, provide the credentials to the dashboard:: + + $ ceph dashboard set-rgw-api-access-key -i <file-containing-access-key> + $ ceph dashboard set-rgw-api-secret-key -i <file-containing-secret-key> + +In a typical default configuration with a single RGW endpoint, this is all you +have to do to get the Object Gateway management functionality working. The +dashboard will try to automatically determine the host and port of the Object +Gateway by obtaining this information from the Ceph Manager's service map. + +If multiple zones are used, it will automatically determine the host within the +master zone group and master zone. This should be sufficient for most setups, +but in some circumstances you might want to set the host and port manually:: + + $ ceph dashboard set-rgw-api-host <host> + $ ceph dashboard set-rgw-api-port <port> + +In addition to the settings mentioned so far, the following settings do also +exist and you may find yourself in the situation that you have to use them:: + + $ ceph dashboard set-rgw-api-scheme <scheme> # http or https + $ ceph dashboard set-rgw-api-admin-resource <admin_resource> + $ ceph dashboard set-rgw-api-user-id <user_id> + +If you are using a self-signed certificate in your Object Gateway setup, then +you should disable certificate verification in the dashboard to avoid refused +connections, e.g. caused by certificates signed by unknown CA or not matching +the host name:: + + $ ceph dashboard set-rgw-api-ssl-verify False + +If the Object Gateway takes too long to process requests and the dashboard runs +into timeouts, then you can set the timeout value to your needs:: + + $ ceph dashboard set-rest-requests-timeout <seconds> + +The default value is 45 seconds. + +.. _dashboard-iscsi-management: + +Enabling iSCSI Management +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The Ceph Dashboard can manage iSCSI targets using the REST API provided by the +`rbd-target-api` service of the :ref:`ceph-iscsi`. Please make sure that it's +installed and enabled on the iSCSI gateways. + +.. note:: + The iSCSI management functionality of Ceph Dashboard depends on the latest + version 3 of the `ceph-iscsi <https://github.com/ceph/ceph-iscsi>`_ project. + Make sure that your operating system provides the correct version, otherwise + the dashboard won't enable the management features. + +If ceph-iscsi REST API is configured in HTTPS mode and its using a self-signed +certificate, then you need to configure the dashboard to avoid SSL certificate +verification when accessing ceph-iscsi API. + +To disable API SSL verification run the following commmand:: + + $ ceph dashboard set-iscsi-api-ssl-verification false + +The available iSCSI gateways must be defined using the following commands:: + + $ ceph dashboard iscsi-gateway-list + $ # Gateway URL format for a new gateway: <scheme>://<username>:<password>@<host>[:port] + $ ceph dashboard iscsi-gateway-add -i <file-containing-gateway-url> [<gateway_name>] + $ ceph dashboard iscsi-gateway-rm <gateway_name> + + +.. _dashboard-grafana: + +Enabling the Embedding of Grafana Dashboards +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +`Grafana`_ requires data from `Prometheus <https://prometheus.io/>`_. Although +Grafana can use other data sources, the Grafana dashboards we provide contain +queries that are specific to Prometheus. Our Grafana dashboards therefore +require Prometheus as the data source. The Ceph :ref:`mgr-prometheus` also only +exports its data in the Prometheus' common format. The Grafana dashboards rely +on metric names from the Prometheus module and `Node exporter +<https://prometheus.io/docs/guides/node-exporter/>`_. The Node exporter is a +separate application that provides machine metrics. + +.. note:: + + Prometheus' security model presumes that untrusted users have access to the + Prometheus HTTP endpoint and logs. Untrusted users have access to all the + (meta)data Prometheus collects that is contained in the database, plus a + variety of operational and debugging information. + + However, Prometheus' HTTP API is limited to read-only operations. + Configurations can *not* be changed using the API and secrets are not + exposed. Moreover, Prometheus has some built-in measures to mitigate the + impact of denial of service attacks. + + Please see `Prometheus' Security model + <https://prometheus.io/docs/operating/security/>` for more detailed + information. + +Grafana and Prometheus are likely going to be bundled and installed by some +orchestration tools along Ceph in the near future, but currently, you will have +to install and configure both manually. After you have installed Prometheus and +Grafana on your preferred hosts, proceed with the following steps. + +#. Enable the Ceph Exporter which comes as Ceph Manager module by running:: + + $ ceph mgr module enable prometheus + +More details can be found in the documentation of the :ref:`mgr-prometheus`. + +#. Add the corresponding scrape configuration to Prometheus. This may look + like:: + + global: + scrape_interval: 5s + + scrape_configs: + - job_name: 'prometheus' + static_configs: + - targets: ['localhost:9090'] + - job_name: 'ceph' + static_configs: + - targets: ['localhost:9283'] + - job_name: 'node-exporter' + static_configs: + - targets: ['localhost:9100'] + +#. Add Prometheus as data source to Grafana + +#. Install the `vonage-status-panel and grafana-piechart-panel` plugins using:: + + grafana-cli plugins install vonage-status-panel + grafana-cli plugins install grafana-piechart-panel + +#. Add the Dashboards to Grafana: + + Dashboards can be added to Grafana by importing dashboard jsons. + Following command can be used for downloading json files:: + + wget https://raw.githubusercontent.com/ceph/ceph/master/monitoring/grafana/dashboards/<Dashboard-name>.json + + You can find all the dashboard jsons `here <https://github.com/ceph/ceph/tree/ + master/monitoring/grafana/dashboards>`_ . + + For Example, for ceph-cluster overview you can use:: + + wget https://raw.githubusercontent.com/ceph/ceph/master/monitoring/grafana/dashboards/ceph-cluster.json + +#. Configure Grafana in `/etc/grafana/grafana.ini` to adapt anonymous mode:: + + [auth.anonymous] + enabled = true + org_name = Main Org. + org_role = Viewer + +After you have set up Grafana and Prometheus, you will need to configure the +connection information that the Ceph Dashboard will use to access Grafana. + +You need to tell the dashboard on which URL the Grafana instance is +running/deployed:: + + $ ceph dashboard set-grafana-api-url <grafana-server-url> # default: '' + +The format of url is : `<protocol>:<IP-address>:<port>` + +.. note:: + Ceph Dashboard embeds the Grafana dashboards via ``iframe`` HTML elements. + If Grafana is configured without SSL/TLS support, most browsers will block the + embedding of insecure content into a secured web page, if the SSL support in + the dashboard has been enabled (which is the default configuration). If you + can't see the embedded Grafana dashboards after enabling them as outlined + above, check your browser's documentation on how to unblock mixed content. + Alternatively, consider enabling SSL/TLS support in Grafana. + +If you are using a self-signed certificate in your Grafana setup, then you should +disable certificate verification in the dashboard to avoid refused connections, +e.g. caused by certificates signed by unknown CA or not matching the host name:: + + $ ceph dashboard set-grafana-api-ssl-verify False + +You can directly access Grafana Instance as well to monitor your cluster. + +Alternative URL for Browsers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The Ceph Dashboard backend requires the Grafana URL to be able to verify the +existence of Grafana Dashboards before the frontend even loads them. Due to the +nature of how Grafana is implemented in Ceph Dashboard, this means that two +working connections are required in order to be able to see Grafana graphs in +Ceph Dashboard: + +- The backend (Ceph Mgr module) needs to verify the existence of the requested + graph. If this request succeeds, it lets the frontend know that it can safely + access Grafana. +- The frontend then requests the Grafana graphs directly from the user's + browser using an iframe. The Grafana instance is accessed directly without any + detour through Ceph Dashboard. + +Now, it might be the case that your environment makes it difficult for the +user's browser to directly access the URL configured in Ceph Dashboard. To solve +this issue, a separate URL can be configured which will solely be used to tell +the frontend (the user's browser) which URL it should use to access Grafana. + +To change the URL that is returned to the frontend issue the following command:: + + $ ceph dashboard set-grafana-frontend-api-url <grafana-server-url> + +If no value is set for that option, it will simply fall back to the value of the +GRAFANA_API_URL option. If set, it will instruct the browser to use this URL to +access Grafana. + +.. _dashboard-sso-support: + +Enabling Single Sign-On (SSO) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The Ceph Dashboard supports external authentication of users via the +`SAML 2.0 <https://en.wikipedia.org/wiki/SAML_2.0>`_ protocol. You need to create +the user accounts and associate them with the desired roles first, as authorization +is still performed by the Dashboard. However, the authentication process can be +performed by an existing Identity Provider (IdP). + +.. note:: + Ceph Dashboard SSO support relies on onelogin's + `python-saml <https://pypi.org/project/python-saml/>`_ library. + Please ensure that this library is installed on your system, either by using + your distribution's package management or via Python's `pip` installer. + +To configure SSO on Ceph Dashboard, you should use the following command:: + + $ ceph dashboard sso setup saml2 <ceph_dashboard_base_url> <idp_metadata> {<idp_username_attribute>} {<idp_entity_id>} {<sp_x_509_cert>} {<sp_private_key>} + +Parameters: + +* **<ceph_dashboard_base_url>**: Base URL where Ceph Dashboard is accessible (e.g., `https://cephdashboard.local`) +* **<idp_metadata>**: URL, file path or content of the IdP metadata XML (e.g., `https://myidp/metadata`) +* **<idp_username_attribute>** *(optional)*: Attribute that should be used to get the username from the authentication response. Defaults to `uid`. +* **<idp_entity_id>** *(optional)*: Use this when more than one entity id exists on the IdP metadata. +* **<sp_x_509_cert> / <sp_private_key>** *(optional)*: File path or content of the certificate that should be used by Ceph Dashboard (Service Provider) for signing and encryption. + +.. note:: + The issuer value of SAML requests will follow this pattern: **<ceph_dashboard_base_url>**/auth/saml2/metadata + +To display the current SAML 2.0 configuration, use the following command:: + + $ ceph dashboard sso show saml2 + +.. note:: + For more information about `onelogin_settings`, please check the `onelogin documentation <https://github.com/onelogin/python-saml>`_. + +To disable SSO:: + + $ ceph dashboard sso disable + +To check if SSO is enabled:: + + $ ceph dashboard sso status + +To enable SSO:: + + $ ceph dashboard sso enable saml2 + +Enabling Prometheus Alerting +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Using Prometheus for monitoring, you have to define `alerting rules +<https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules>`_. +To manage them you need to use the `Alertmanager +<https://prometheus.io/docs/alerting/alertmanager>`_. +If you are not using the Alertmanager yet, please `install it +<https://github.com/prometheus/alertmanager#install>`_ as it's mandatory in +order to receive and manage alerts from Prometheus. + +The Alertmanager capabilities can be consumed by the dashboard in three different +ways: + +#. Use the notification receiver of the dashboard. + +#. Use the Prometheus Alertmanager API. + +#. Use both sources simultaneously. + +All three methods are going to notify you about alerts. You won't be notified +twice if you use both sources, but you need to consume at least the Alertmanager API +in order to manage silences. + +#. Use the notification receiver of the dashboard: + + This allows you to get notifications as `configured + <https://prometheus.io/docs/alerting/configuration/>`_ from the Alertmanager. + You will get notified inside the dashboard once a notification is send out, + but you are not able to manage alerts. + + Add the dashboard receiver and the new route to your Alertmanager configuration. + This should look like:: + + route: + receiver: 'ceph-dashboard' + ... + receivers: + - name: 'ceph-dashboard' + webhook_configs: + - url: '<url-to-dashboard>/api/prometheus_receiver' + + + Please make sure that the Alertmanager considers your SSL certificate in terms + of the dashboard as valid. For more information about the correct + configuration checkout the `<http_config> documentation + <https://prometheus.io/docs/alerting/configuration/#%3Chttp_config%3E>`_. + +#. Use the API of Prometheus and the Alertmanager + + This allows you to manage alerts and silences. This will enable the "Active + Alerts", "All Alerts" as well as the "Silences" tabs in the "Monitoring" + section of the "Cluster" menu entry. + + Alerts can be sorted by name, job, severity, state and start time. + Unfortunately it's not possible to know when an alert + was sent out through a notification by the Alertmanager based on your + configuration, that's why the dashboard will notify the user on any visible + change to an alert and will notify the changed alert. + + Silences can be sorted by id, creator, status, start, updated and end time. + Silences can be created in various ways, it's also possible to expire them. + + #. Create from scratch + + #. Based on a selected alert + + #. Recreate from expired silence + + #. Update a silence (which will recreate and expire it (default Alertmanager behaviour)) + + To use it, specify the host and port of the Alertmanager server:: + + $ ceph dashboard set-alertmanager-api-host <alertmanager-host:port> # default: '' + + For example:: + + $ ceph dashboard set-alertmanager-api-host 'http://localhost:9093' + + To be able to see all configured alerts, you will need to configure the URL + to the Prometheus API. Using this API, the UI will also help you in verifying + that a new silence will match a corresponding alert. + + :: + + $ ceph dashboard set-prometheus-api-host <prometheus-host:port> # default: '' + + For example:: + + $ ceph dashboard set-prometheus-api-host 'http://localhost:9090' + + After setting up the hosts, you have to refresh the dashboard in your browser window. + +#. Use both methods + + The different behaviors of both methods are configured in a way that they + should not disturb each other through annoying duplicated notifications + popping up. + +Accessing the Dashboard +^^^^^^^^^^^^^^^^^^^^^^^ + +You can now access the dashboard using your (JavaScript-enabled) web browser, by +pointing it to any of the host names or IP addresses and the selected TCP port +where a manager instance is running: e.g., ``httpS://<$IP>:<$PORT>/``. + +You should then be greeted by the dashboard login page, requesting your +previously defined username and password. Select the **Keep me logged in** +checkbox if you want to skip the username/password request when accessing the +dashboard in the future. + +.. _dashboard-user-role-management: + +User and Role Management +------------------------ + +User Accounts +^^^^^^^^^^^^^ + +Ceph Dashboard supports managing multiple user accounts. Each user account +consists of a username, a password (stored in encrypted form using ``bcrypt``), +an optional name, and an optional email address. + +User accounts are stored in MON's configuration database, and are globally +shared across all ceph-mgr instances. + +We provide a set of CLI commands to manage user accounts: + +- *Show User(s)*:: + + $ ceph dashboard ac-user-show [<username>] + +- *Create User*:: + + $ ceph dashboard ac-user-create <username> -i <file-containing-password> [<rolename>] [<name>] [<email>] + +- *Delete User*:: + + $ ceph dashboard ac-user-delete <username> + +- *Change Password*:: + + $ ceph dashboard ac-user-set-password <username> -i <file-containing-password> + +- *Modify User (name, and email)*:: + + $ ceph dashboard ac-user-set-info <username> <name> <email> + + +User Roles and Permissions +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +User accounts are also associated with a set of roles that define which +dashboard functionality can be accessed by the user. + +The Dashboard functionality/modules are grouped within a *security scope*. +Security scopes are predefined and static. The current available security +scopes are: + +- **hosts**: includes all features related to the ``Hosts`` menu + entry. +- **config-opt**: includes all features related to management of Ceph + configuration options. +- **pool**: includes all features related to pool management. +- **osd**: includes all features related to OSD management. +- **monitor**: includes all features related to Monitor management. +- **rbd-image**: includes all features related to RBD image + management. +- **rbd-mirroring**: includes all features related to RBD-Mirroring + management. +- **iscsi**: includes all features related to iSCSI management. +- **rgw**: includes all features related to Rados Gateway management. +- **cephfs**: includes all features related to CephFS management. +- **manager**: include all features related to Ceph Manager + management. +- **log**: include all features related to Ceph logs management. +- **grafana**: include all features related to Grafana proxy. +- **prometheus**: include all features related to Prometheus alert management. +- **dashboard-settings**: allows to change dashboard settings. + +A *role* specifies a set of mappings between a *security scope* and a set of +*permissions*. There are four types of permissions: + +- **read** +- **create** +- **update** +- **delete** + +See below for an example of a role specification based on a Python dictionary:: + + # example of a role + { + 'role': 'my_new_role', + 'description': 'My new role', + 'scopes_permissions': { + 'pool': ['read', 'create'], + 'rbd-image': ['read', 'create', 'update', 'delete'] + } + } + +The above role dictates that a user has *read* and *create* permissions for +features related to pool management, and has full permissions for +features related to RBD image management. + +The Dashboard already provides a set of predefined roles that we call +*system roles*, and can be used right away in a fresh Ceph Dashboard +installation. + +The list of system roles are: + +- **administrator**: provides full permissions for all security scopes. +- **read-only**: provides *read* permission for all security scopes except + the dashboard settings. +- **block-manager**: provides full permissions for *rbd-image*, + *rbd-mirroring*, and *iscsi* scopes. +- **rgw-manager**: provides full permissions for the *rgw* scope +- **cluster-manager**: provides full permissions for the *hosts*, *osd*, + *monitor*, *manager*, and *config-opt* scopes. +- **pool-manager**: provides full permissions for the *pool* scope. +- **cephfs-manager**: provides full permissions for the *cephfs* scope. + +The list of currently available roles can be retrieved by the following +command:: + + $ ceph dashboard ac-role-show [<rolename>] + +It is also possible to create new roles using CLI commands. The available +commands to manage roles are the following: + +- *Create Role*:: + + $ ceph dashboard ac-role-create <rolename> [<description>] + +- *Delete Role*:: + + $ ceph dashboard ac-role-delete <rolename> + +- *Add Scope Permissions to Role*:: + + $ ceph dashboard ac-role-add-scope-perms <rolename> <scopename> <permission> [<permission>...] + +- *Delete Scope Permission from Role*:: + + $ ceph dashboard ac-role-del-perms <rolename> <scopename> + +To associate roles to users, the following CLI commands are available: + +- *Set User Roles*:: + + $ ceph dashboard ac-user-set-roles <username> <rolename> [<rolename>...] + +- *Add Roles To User*:: + + $ ceph dashboard ac-user-add-roles <username> <rolename> [<rolename>...] + +- *Delete Roles from User*:: + + $ ceph dashboard ac-user-del-roles <username> <rolename> [<rolename>...] + + +Example of User and Custom Role Creation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In this section we show a full example of the commands that need to be used +in order to create a user account, that should be able to manage RBD images, +view and create Ceph pools, and have read-only access to any other scopes. + +1. *Create the user*:: + + $ ceph dashboard ac-user-create bob -i <file-containing-password> + +2. *Create role and specify scope permissions*:: + + $ ceph dashboard ac-role-create rbd/pool-manager + $ ceph dashboard ac-role-add-scope-perms rbd/pool-manager rbd-image read create update delete + $ ceph dashboard ac-role-add-scope-perms rbd/pool-manager pool read create + +3. *Associate roles to user*:: + + $ ceph dashboard ac-user-set-roles bob rbd/pool-manager read-only + + +Proxy Configuration +------------------- + +In a Ceph cluster with multiple ceph-mgr instances, only the dashboard running +on the currently active ceph-mgr daemon will serve incoming requests. Accessing +the dashboard's TCP port on any of the other ceph-mgr instances that are +currently on standby will perform a HTTP redirect (303) to the currently active +manager's dashboard URL. This way, you can point your browser to any of the +ceph-mgr instances in order to access the dashboard. + +If you want to establish a fixed URL to reach the dashboard or if you don't want +to allow direct connections to the manager nodes, you could set up a proxy that +automatically forwards incoming requests to the currently active ceph-mgr +instance. + +Configuring a URL Prefix +^^^^^^^^^^^^^^^^^^^^^^^^ + +If you are accessing the dashboard via a reverse proxy configuration, +you may wish to service it under a URL prefix. To get the dashboard +to use hyperlinks that include your prefix, you can set the +``url_prefix`` setting: + +:: + + ceph config set mgr mgr/dashboard/url_prefix $PREFIX + +so you can access the dashboard at ``http://$IP:$PORT/$PREFIX/``. + +Disable the redirection +^^^^^^^^^^^^^^^^^^^^^^^ + +If the dashboard is behind a load-balancing proxy like `HAProxy <https://www.haproxy.org/>`_ +you might want to disable the redirection behaviour to prevent situations that +internal (unresolvable) URL's are published to the frontend client. Use the +following command to get the dashboard to respond with a HTTP error (500 by default) +instead of redirecting to the active dashboard:: + + $ ceph config set mgr mgr/dashboard/standby_behaviour "error" + +To reset the setting to the default redirection behaviour, use the following command:: + + $ ceph config set mgr mgr/dashboard/standby_behaviour "redirect" + +Configure the error status code +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When the redirection behaviour is disabled, then you want to customize the HTTP status +code of standby dashboards. To do so you need to run the command:: + + $ ceph config set mgr mgr/dashboard/standby_error_status_code 503 + +HAProxy example configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Below you will find an example configuration for SSL/TLS pass through using +`HAProxy <https://www.haproxy.org/>`_. + +Please note that the configuration works under the following conditions. +If the dashboard fails over, the front-end client might receive a HTTP redirect +(303) response and will be redirected to an unresolvable host. This happens when +the failover occurs during two HAProxy health checks. In this situation the +previously active dashboard node will now respond with a 303 which points to +the new active node. To prevent that situation you should consider to disable +the redirection behaviour on standby nodes. + +:: + + defaults + log global + option log-health-checks + timeout connect 5s + timeout client 50s + timeout server 450s + + frontend dashboard_front + mode http + bind *:80 + option httplog + redirect scheme https code 301 if !{ ssl_fc } + + frontend dashboard_front_ssl + mode tcp + bind *:443 + option tcplog + default_backend dashboard_back_ssl + + backend dashboard_back_ssl + mode tcp + option httpchk GET / + http-check expect status 200 + server x <HOST>:<PORT> check-ssl check verify none + server y <HOST>:<PORT> check-ssl check verify none + server z <HOST>:<PORT> check-ssl check verify none + +.. _dashboard-auditing: + +Auditing API Requests +--------------------- + +The REST API is capable of logging PUT, POST and DELETE requests to the Ceph +audit log. This feature is disabled by default, but can be enabled with the +following command:: + + $ ceph dashboard set-audit-api-enabled <true|false> + +If enabled, the following parameters are logged per each request: + +* from - The origin of the request, e.g. https://[::1]:44410 +* path - The REST API path, e.g. /api/auth +* method - e.g. PUT, POST or DELETE +* user - The name of the user, otherwise 'None' + +The logging of the request payload (the arguments and their values) is enabled +by default. Execute the following command to disable this behaviour:: + + $ ceph dashboard set-audit-api-log-payload <true|false> + +A log entry may look like this:: + + 2018-10-22 15:27:01.302514 mgr.x [INF] [DASHBOARD] from='https://[::ffff:127.0.0.1]:37022' path='/api/rgw/user/klaus' method='PUT' user='admin' params='{"max_buckets": "1000", "display_name": "Klaus Mustermann", "uid": "klaus", "suspended": "0", "email": "klaus.mustermann@ceph.com"}' + +.. _dashboard-nfs-ganesha-management: + +NFS-Ganesha Management +---------------------- + +Ceph Dashboard can manage `NFS Ganesha <http://nfs-ganesha.github.io/>`_ exports that use +CephFS or RadosGW as their backstore. + +To enable this feature in Ceph Dashboard there are some assumptions that need +to be met regarding the way NFS-Ganesha services are configured. + +The dashboard manages NFS-Ganesha config files stored in RADOS objects on the Ceph Cluster. +NFS-Ganesha must store part of their configuration in the Ceph cluster. + +These configuration files must follow some conventions. +conventions. +Each export block must be stored in its own RADOS object named +``export-<id>``, where ``<id>`` must match the ``Export_ID`` attribute of the +export configuration. Then, for each NFS-Ganesha service daemon there should +exist a RADOS object named ``conf-<daemon_id>``, where ``<daemon_id>`` is an +arbitrary string that should uniquely identify the daemon instance (e.g., the +hostname where the daemon is running). +Each ``conf-<daemon_id>`` object contains the RADOS URLs to the exports that +the NFS-Ganesha daemon should serve. These URLs are of the form:: + + %url rados://<pool_name>[/<namespace>]/export-<id> + +Both the ``conf-<daemon_id>`` and ``export-<id>`` objects must be stored in the +same RADOS pool/namespace. + + +Configuring NFS-Ganesha in the Dashboard +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To enable the management of NFS-Ganesha exports in Ceph Dashboard, we only +need to tell the Dashboard, in which RADOS pool and namespace the +configuration objects are stored. Then, Ceph Dashboard can access the objects +by following the naming convention described above. + +The Dashboard command to configure the NFS-Ganesha configuration objects +location is:: + + $ ceph dashboard set-ganesha-clusters-rados-pool-namespace <pool_name>[/<namespace>] + +After running the above command, Ceph Dashboard is able to find the NFS-Ganesha +configuration objects and we can start manage the exports through the Web UI. + + +Support for Multiple NFS-Ganesha Clusters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Ceph Dashboard also supports the management of NFS-Ganesha exports belonging +to different NFS-Ganesha clusters. An NFS-Ganesha cluster is a group of +NFS-Ganesha service daemons sharing the same exports. Different NFS-Ganesha +clusters are independent and don't share the exports configuration between each +other. + +Each NFS-Ganesha cluster should store its configuration objects in a +different RADOS pool/namespace to isolate the configuration from each other. + +To specify the locations of the configuration of each NFS-Ganesha cluster we +can use the same command as above but with a different value pattern:: + + $ ceph dashboard set-ganesha-clusters-rados-pool-namespace <cluster_id>:<pool_name>[/<namespace>](,<cluster_id>:<pool_name>[/<namespace>])* + +The ``<cluster_id>`` is an arbitrary string that should uniquely identify the +NFS-Ganesha cluster. + +When configuring the Ceph Dashboard with multiple NFS-Ganesha clusters, the +Web UI will automatically allow to choose to which cluster an export belongs. + + +Plug-ins +-------- + +Dashboard Plug-ins allow to extend the functionality of the dashboard in a modular +and loosely coupled approach. + +.. _Grafana: https://grafana.com/ + +.. include:: dashboard_plugins/feature_toggles.inc.rst diff --git a/doc/mgr/dashboard_plugins/feature_toggles.inc.rst b/doc/mgr/dashboard_plugins/feature_toggles.inc.rst new file mode 100644 index 00000000..b0244866 --- /dev/null +++ b/doc/mgr/dashboard_plugins/feature_toggles.inc.rst @@ -0,0 +1,44 @@ +.. _dashboard-feature-toggles: + +Feature Toggles +^^^^^^^^^^^^^^^ + +This plug-in allows to enable or disable some features from the Ceph Dashboard +on-demand. When a feature becomes disabled: + +- Its front-end elements (web pages, menu entries, charts, etc.) will become hidden. +- Its associated REST API endpoints will reject any further requests (404, Not Found Error). + +The main purpose of this plug-in is to allow ad-hoc customizations of the workflows exposed +by the dashboard. Additionally, it could allow for dynamically enabling experimental +features with minimal configuration burden and no service impact. + +The list of features that can be enabled/disabled is: + +- **Block (RBD)**: + - Image Management: ``rbd`` + - Mirroring: ``mirroring`` + - iSCSI: ``iscsi`` +- **Filesystem (Cephfs)**: ``cephfs`` +- **Objects (RGW)**: ``rgw`` (including daemon, user and bucket management). + +By default all features come enabled. + +To retrieve a list of features and their current statuses:: + + $ ceph dashboard feature status + Feature 'cephfs': 'enabled' + Feature 'iscsi': 'enabled' + Feature 'mirroring': 'enabled' + Feature 'rbd': 'enabled' + Feature 'rgw': 'enabled' + +To enable or disable the status of a single or multiple features:: + + $ ceph dashboard feature disable iscsi mirroring + Feature 'iscsi': disabled + Feature 'mirroring': disabled + +After a feature status has changed, the API REST endpoints immediately respond to +that change, while for the front-end UI elements, it may take up to 20 seconds to +reflect it. diff --git a/doc/mgr/deepsea.rst b/doc/mgr/deepsea.rst new file mode 100644 index 00000000..da83aef7 --- /dev/null +++ b/doc/mgr/deepsea.rst @@ -0,0 +1,79 @@ + +================================ +DeepSea orchestrator integration +================================ + +DeepSea (https://github.com/SUSE/DeepSea) is a collection of `Salt +<https://github.com/saltstack/salt>`_ state files, runners and modules for +deploying and managing Ceph. + +The ``deepsea`` module provides integration between Ceph's orchestrator +framework (used by modules such as ``dashboard`` to control cluster services) +and DeepSea. + +Orchestrator modules only provide services to other modules, which in turn +provide user interfaces. To try out the deepsea module, you might like +to use the :ref:`Orchestrator CLI <orchestrator-cli-module>` module. + +Requirements +------------ + +- A salt-master node with DeepSea 0.9.9 or later installed, and the salt-api + service running. +- Ideally, several salt-minion nodes against which at least DeepSea's stages 0 + through 2 have been run (this is the minimum required for the orchestrator's + inventory and status functions to return interesting information). + +Configuration +------------- + +Four configuration keys must be set in order for the module to talk to +salt-api: + +- salt_api_url +- salt_api_username +- salt_api_password +- salt_api_eauth (default is "sharedsecret") + +These all need to match the salt-api configuration on the salt master (see +eauth.conf, salt-api.conf and sharedsecret.conf in /etc/salt/master.d/ on the +salt-master node). + +Configuration keys +^^^^^^^^^^^^^^^^^^^ + +Configuration keys can be set on any machine with the proper cephx credentials, +these are usually Monitors where the *client.admin* key is present. + +:: + + ceph deepsea config-set <key> <value> + +For example: + +:: + + ceph deepsea config-set salt_api_url http://admin.example.com:8000/ + ceph deepsea config-set salt_api_username admin + ceph deepsea config-set salt_api_password 12345 + +The current configuration of the module can also be shown: + +:: + + ceph deepsea config-show + +Debugging +--------- + +Should you want to debug the deepsea module, increase the logging level for +ceph-mgr and check the logs. + +:: + + [mgr] + debug mgr = 20 + +With the log level set to 20, the module will print out all the data received +from the salt event bus. All log messages will be prefixed with *mgr[deepsea]* +for easy filtering. diff --git a/doc/mgr/diskprediction.rst b/doc/mgr/diskprediction.rst new file mode 100644 index 00000000..779cda5d --- /dev/null +++ b/doc/mgr/diskprediction.rst @@ -0,0 +1,353 @@ +===================== +Diskprediction Module +===================== + +The *diskprediction* module supports two modes: cloud mode and local mode. In cloud mode, the disk and Ceph operating status information is collected from Ceph cluster and sent to a cloud-based DiskPrediction server over the Internet. DiskPrediction server analyzes the data and provides the analytics and prediction results of performance and disk health states for Ceph clusters. + +Local mode doesn't require any external server for data analysis and output results. In local mode, the *diskprediction* module uses an internal predictor module for disk prediction service, and then returns the disk prediction result to the Ceph system. + +| Local predictor: 70% accuracy +| Cloud predictor for free: 95% accuracy + +Enabling +======== + +Run the following command to enable the *diskprediction* module in the Ceph +environment:: + + ceph mgr module enable diskprediction_cloud + ceph mgr module enable diskprediction_local + + +Select the prediction mode:: + + ceph config set global device_failure_prediction_mode local + +or:: + + ceph config set global device_failure_prediction_mode cloud + +To disable prediction,:: + + ceph config set global device_failure_prediction_mode none + + +Connection settings +=================== +The connection settings are used for connection between Ceph and DiskPrediction server. + +Local Mode +---------- + +The *diskprediction* module leverages Ceph device health check to collect disk health metrics and uses internal predictor module to produce the disk failure prediction and returns back to Ceph. Thus, no connection settings are required in local mode. The local predictor module requires at least six datasets of device health metrics to implement the prediction. + +Run the following command to use local predictor predict device life expectancy. + +:: + + ceph device predict-life-expectancy <device id> + + +Cloud Mode +---------- + +The user registration is required in cloud mode. The users have to sign up their accounts at https://www.diskprophet.com/#/ to receive the following DiskPrediction server information for connection settings. + +**Certificate file path**: After user registration is confirmed, the system will send a confirmation email including a certificate file download link. Download the certificate file and save it to the Ceph system. Run the following command to verify the file. Without certificate file verification, the connection settings cannot be completed. + +**DiskPrediction server**: The DiskPrediction server name. It could be an IP address if required. + +**Connection account**: An account name used to set up the connection between Ceph and DiskPrediction server + +**Connection password**: The password used to set up the connection between Ceph and DiskPrediction server + +Run the following command to complete connection setup. + +:: + + ceph device set-cloud-prediction-config <diskprediction_server> <connection_account> <connection_password> <certificate file path> + + +You can use the following command to display the connection settings: + +:: + + ceph device show-prediction-config + + +Additional optional configuration settings are the following: + +:diskprediction_upload_metrics_interval: Indicate the frequency to send Ceph performance metrics to DiskPrediction server regularly at times. Default is 10 minutes. +:diskprediction_upload_smart_interval: Indicate the frequency to send Ceph physical device info to DiskPrediction server regularly at times. Default is 12 hours. +:diskprediction_retrieve_prediction_interval: Indicate Ceph that retrieves physical device prediction data from DiskPrediction server regularly at times. Default is 12 hours. + + + +Diskprediction Data +=================== + +The *diskprediction* module actively sends/retrieves the following data to/from DiskPrediction server. + + +Metrics Data +------------- +- Ceph cluster status + ++----------------------+-----------------------------------------+ +|key |Description | ++======================+=========================================+ +|cluster_health |Ceph health check status | ++----------------------+-----------------------------------------+ +|num_mon |Number of monitor node | ++----------------------+-----------------------------------------+ +|num_mon_quorum |Number of monitors in quorum | ++----------------------+-----------------------------------------+ +|num_osd |Total number of OSD | ++----------------------+-----------------------------------------+ +|num_osd_up |Number of OSDs that are up | ++----------------------+-----------------------------------------+ +|num_osd_in |Number of OSDs that are in cluster | ++----------------------+-----------------------------------------+ +|osd_epoch |Current epoch of OSD map | ++----------------------+-----------------------------------------+ +|osd_bytes |Total capacity of cluster in bytes | ++----------------------+-----------------------------------------+ +|osd_bytes_used |Number of used bytes on cluster | ++----------------------+-----------------------------------------+ +|osd_bytes_avail |Number of available bytes on cluster | ++----------------------+-----------------------------------------+ +|num_pool |Number of pools | ++----------------------+-----------------------------------------+ +|num_pg |Total number of placement groups | ++----------------------+-----------------------------------------+ +|num_pg_active_clean |Number of placement groups in | +| |active+clean state | ++----------------------+-----------------------------------------+ +|num_pg_active |Number of placement groups in active | +| |state | ++----------------------+-----------------------------------------+ +|num_pg_peering |Number of placement groups in peering | +| |state | ++----------------------+-----------------------------------------+ +|num_object |Total number of objects on cluster | ++----------------------+-----------------------------------------+ +|num_object_degraded |Number of degraded (missing replicas) | +| |objects | ++----------------------+-----------------------------------------+ +|num_object_misplaced |Number of misplaced (wrong location in | +| |the cluster) objects | ++----------------------+-----------------------------------------+ +|num_object_unfound |Number of unfound objects | ++----------------------+-----------------------------------------+ +|num_bytes |Total number of bytes of all objects | ++----------------------+-----------------------------------------+ +|num_mds_up |Number of MDSs that are up | ++----------------------+-----------------------------------------+ +|num_mds_in |Number of MDS that are in cluster | ++----------------------+-----------------------------------------+ +|num_mds_failed |Number of failed MDS | ++----------------------+-----------------------------------------+ +|mds_epoch |Current epoch of MDS map | ++----------------------+-----------------------------------------+ + + +- Ceph mon/osd performance counts + +Mon: + ++----------------------+-----------------------------------------+ +|key |Description | ++======================+=========================================+ +|num_sessions |Current number of opened monitor sessions| ++----------------------+-----------------------------------------+ +|session_add |Number of created monitor sessions | ++----------------------+-----------------------------------------+ +|session_rm |Number of remove_session calls in monitor| ++----------------------+-----------------------------------------+ +|session_trim |Number of trimed monitor sessions | ++----------------------+-----------------------------------------+ +|num_elections |Number of elections monitor took part in | ++----------------------+-----------------------------------------+ +|election_call |Number of elections started by monitor | ++----------------------+-----------------------------------------+ +|election_win |Number of elections won by monitor | ++----------------------+-----------------------------------------+ +|election_lose |Number of elections lost by monitor | ++----------------------+-----------------------------------------+ + +Osd: + ++----------------------+-----------------------------------------+ +|key |Description | ++======================+=========================================+ +|op_wip |Replication operations currently being | +| |processed (primary) | ++----------------------+-----------------------------------------+ +|op_in_bytes |Client operations total write size | ++----------------------+-----------------------------------------+ +|op_r |Client read operations | ++----------------------+-----------------------------------------+ +|op_out_bytes |Client operations total read size | ++----------------------+-----------------------------------------+ +|op_w |Client write operations | ++----------------------+-----------------------------------------+ +|op_latency |Latency of client operations (including | +| |queue time) | ++----------------------+-----------------------------------------+ +|op_process_latency |Latency of client operations (excluding | +| |queue time) | ++----------------------+-----------------------------------------+ +|op_r_latency |Latency of read operation (including | +| |queue time) | ++----------------------+-----------------------------------------+ +|op_r_process_latency |Latency of read operation (excluding | +| |queue time) | ++----------------------+-----------------------------------------+ +|op_w_in_bytes |Client data written | ++----------------------+-----------------------------------------+ +|op_w_latency |Latency of write operation (including | +| |queue time) | ++----------------------+-----------------------------------------+ +|op_w_process_latency |Latency of write operation (excluding | +| |queue time) | ++----------------------+-----------------------------------------+ +|op_rw |Client read-modify-write operations | ++----------------------+-----------------------------------------+ +|op_rw_in_bytes |Client read-modify-write operations write| +| |in | ++----------------------+-----------------------------------------+ +|op_rw_out_bytes |Client read-modify-write operations read | +| |out | ++----------------------+-----------------------------------------+ +|op_rw_latency |Latency of read-modify-write operation | +| |(including queue time) | ++----------------------+-----------------------------------------+ +|op_rw_process_latency |Latency of read-modify-write operation | +| |(excluding queue time) | ++----------------------+-----------------------------------------+ + + +- Ceph pool statistics + ++----------------------+-----------------------------------------+ +|key |Description | ++======================+=========================================+ +|bytes_used |Per pool bytes used | ++----------------------+-----------------------------------------+ +|max_avail |Max available number of bytes in the pool| ++----------------------+-----------------------------------------+ +|objects |Number of objects in the pool | ++----------------------+-----------------------------------------+ +|wr_bytes |Number of bytes written in the pool | ++----------------------+-----------------------------------------+ +|dirty |Number of bytes dirty in the pool | ++----------------------+-----------------------------------------+ +|rd_bytes |Number of bytes read in the pool | ++----------------------+-----------------------------------------+ +|stored_raw |Bytes used in pool including copies made | ++----------------------+-----------------------------------------+ + +- Ceph physical device metadata + ++----------------------+-----------------------------------------+ +|key |Description | ++======================+=========================================+ +|disk_domain_id |Physical device identify id | ++----------------------+-----------------------------------------+ +|disk_name |Device attachement name | ++----------------------+-----------------------------------------+ +|disk_wwn |Device wwn | ++----------------------+-----------------------------------------+ +|model |Device model name | ++----------------------+-----------------------------------------+ +|serial_number |Device serial number | ++----------------------+-----------------------------------------+ +|size |Device size | ++----------------------+-----------------------------------------+ +|vendor |Device vendor name | ++----------------------+-----------------------------------------+ + +- Ceph each objects correlation information +- The module agent information +- The module agent cluster information +- The module agent host information + + +SMART Data +----------- +- Ceph physical device SMART data (provided by Ceph *devicehealth* module) + + +Prediction Data +---------------- +- Ceph physical device prediction data + + +Receiving predicted health status from a Ceph OSD disk drive +============================================================ + +You can receive predicted health status from Ceph OSD disk drive by using the +following command. + +:: + + ceph device get-predicted-status <device id> + + +The get-predicted-status command returns: + + +:: + + { + "near_failure": "Good", + "disk_wwn": "5000011111111111", + "serial_number": "111111111", + "predicted": "2018-05-30 18:33:12", + "attachment": "sdb" + } + + ++--------------------+-----------------------------------------------------+ +|Attribute | Description | ++====================+=====================================================+ +|near_failure | The disk failure prediction state: | +| | Good/Warning/Bad/Unknown | ++--------------------+-----------------------------------------------------+ +|disk_wwn | Disk WWN number | ++--------------------+-----------------------------------------------------+ +|serial_number | Disk serial number | ++--------------------+-----------------------------------------------------+ +|predicted | Predicted date | ++--------------------+-----------------------------------------------------+ +|attachment | device name on the local system | ++--------------------+-----------------------------------------------------+ + +The *near_failure* attribute for disk failure prediction state indicates disk life expectancy in the following table. + ++--------------------+-----------------------------------------------------+ +|near_failure | Life expectancy (weeks) | ++====================+=====================================================+ +|Good | > 6 weeks | ++--------------------+-----------------------------------------------------+ +|Warning | 2 weeks ~ 6 weeks | ++--------------------+-----------------------------------------------------+ +|Bad | < 2 weeks | ++--------------------+-----------------------------------------------------+ + + +Debugging +========= + +If you want to debug the DiskPrediction module mapping to Ceph logging level, +use the following command. + +:: + + [mgr] + + debug mgr = 20 + +With logging set to debug for the manager the module will print out logging +message with prefix *mgr[diskprediction]* for easy filtering. + diff --git a/doc/mgr/hello.rst b/doc/mgr/hello.rst new file mode 100644 index 00000000..725355fc --- /dev/null +++ b/doc/mgr/hello.rst @@ -0,0 +1,39 @@ +Hello World Module +================== + +This is a simple module skeleton for documentation purposes. + +Enabling +-------- + +The *hello* module is enabled with:: + + ceph mgr module enable hello + +To check that it is enabled, run:: + + ceph mgr module ls + +After editing the module file (found in ``src/pybind/mgr/hello/module.py``), you can see changes by running:: + + ceph mgr module disable hello + ceph mgr module enable hello + +or:: + + init-ceph restart mgr + +To execute the module, run:: + + ceph hello + +The log is found at:: + + build/out/mgr.x.log + + +Documenting +----------- + +After adding a new mgr module, be sure to add its documentation to ``doc/mgr/module_name.rst``. +Also, add a link to your new module into ``doc/mgr/index.rst``. diff --git a/doc/mgr/index.rst b/doc/mgr/index.rst new file mode 100644 index 00000000..6b377d1b --- /dev/null +++ b/doc/mgr/index.rst @@ -0,0 +1,49 @@ +.. _ceph-manager-daemon: + +=================== +Ceph Manager Daemon +=================== + +The :term:`Ceph Manager` daemon (ceph-mgr) runs alongside monitor daemons, +to provide additional monitoring and interfaces to external monitoring +and management systems. + +Since the 12.x (*luminous*) Ceph release, the ceph-mgr daemon is required for +normal operations. The ceph-mgr daemon is an optional component in +the 11.x (*kraken*) Ceph release. + +By default, the manager daemon requires no additional configuration, beyond +ensuring it is running. If there is no mgr daemon running, you will +see a health warning to that effect, and some of the other information +in the output of `ceph status` will be missing or stale until a mgr is started. + +Use your normal deployment tools, such as ceph-ansible or ceph-deploy, to +set up ceph-mgr daemons on each of your mon nodes. It is not mandatory +to place mgr daemons on the same nodes as mons, but it is almost always +sensible. + +.. toctree:: + :maxdepth: 1 + + Installation and Configuration <administrator> + Writing modules <modules> + Writing orchestrator plugins <orchestrator_modules> + Dashboard module <dashboard> + Alerts module <alerts> + DiskPrediction module <diskprediction> + Local pool module <localpool> + RESTful module <restful> + Zabbix module <zabbix> + Prometheus module <prometheus> + Influx module <influx> + Hello module <hello> + Telegraf module <telegraf> + Telemetry module <telemetry> + Iostat module <iostat> + Crash module <crash> + Orchestrator CLI module <orchestrator_cli> + Rook module <rook> + DeepSea module <deepsea> + Insights module <insights> + Ansible module <ansible> + SSH orchestrator <ssh> diff --git a/doc/mgr/influx.rst b/doc/mgr/influx.rst new file mode 100644 index 00000000..eab9494a --- /dev/null +++ b/doc/mgr/influx.rst @@ -0,0 +1,165 @@ +============= +Influx Module +============= + +The influx module continuously collects and sends time series data to an +influxdb database. + +The influx module was introduced in the 13.x *Mimic* release. + +-------- +Enabling +-------- + +To enable the module, use the following command: + +:: + + ceph mgr module enable influx + +If you wish to subsequently disable the module, you can use the equivalent +*disable* command: + +:: + + ceph mgr module disable influx + +------------- +Configuration +------------- + +For the influx module to send statistics to an InfluxDB server, it +is necessary to configure the servers address and some authentication +credentials. + +Set configuration values using the following command: + +:: + + ceph config set mgr mgr/influx/<key> <value> + + +The most important settings are ``hostname``, ``username`` and ``password``. +For example, a typical configuration might look like this: + +:: + + ceph config set mgr mgr/influx/hostname influx.mydomain.com + ceph config set mgr mgr/influx/username admin123 + ceph config set mgr mgr/influx/password p4ssw0rd + +Additional optional configuration settings are: + +:interval: Time between reports to InfluxDB. Default 30 seconds. +:database: InfluxDB database name. Default "ceph". You will need to create this database and grant write privileges to the configured username or the username must have admin privileges to create it. +:port: InfluxDB server port. Default 8086 +:ssl: Use https connection for InfluxDB server. Use "true" or "false". Default false +:verify_ssl: Verify https cert for InfluxDB server. Use "true" or "false". Default true +:threads: How many worker threads should be spawned for sending data to InfluxDB. Default is 5 +:batch_size: How big batches of data points should be when sending to InfluxDB. Default is 5000 + +--------- +Debugging +--------- + +By default, a few debugging statements as well as error statements have been set to print in the log files. Users can add more if necessary. +To make use of the debugging option in the module: + +- Add this to the ceph.conf file.:: + + [mgr] + debug_mgr = 20 + +- Use this command ``ceph tell mgr.<mymonitor> influx self-test``. +- Check the log files. Users may find it easier to filter the log files using *mgr[influx]*. + +-------------------- +Interesting counters +-------------------- + +The following tables describe a subset of the values output by +this module. + +^^^^^ +Pools +^^^^^ + ++---------------+-----------------------------------------------------+ +|Counter | Description | ++===============+=====================================================+ +|stored | Bytes stored in the pool not including copies | ++---------------+-----------------------------------------------------+ +|max_avail | Max available number of bytes in the pool | ++---------------+-----------------------------------------------------+ +|objects | Number of objects in the pool | ++---------------+-----------------------------------------------------+ +|wr_bytes | Number of bytes written in the pool | ++---------------+-----------------------------------------------------+ +|dirty | Number of bytes dirty in the pool | ++---------------+-----------------------------------------------------+ +|rd_bytes | Number of bytes read in the pool | ++---------------+-----------------------------------------------------+ +|stored_raw | Bytes used in pool including copies made | ++---------------+-----------------------------------------------------+ + +^^^^ +OSDs +^^^^ + ++------------+------------------------------------+ +|Counter | Description | ++============+====================================+ +|op_w | Client write operations | ++------------+------------------------------------+ +|op_in_bytes | Client operations total write size | ++------------+------------------------------------+ +|op_r | Client read operations | ++------------+------------------------------------+ +|op_out_bytes| Client operations total read size | ++------------+------------------------------------+ + + ++------------------------+--------------------------------------------------------------------------+ +|Counter | Description | ++========================+==========================================================================+ +|op_wip | Replication operations currently being processed (primary) | ++------------------------+--------------------------------------------------------------------------+ +|op_latency | Latency of client operations (including queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_process_latency | Latency of client operations (excluding queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_prepare_latency | Latency of client operations (excluding queue time and wait for finished)| ++------------------------+--------------------------------------------------------------------------+ +|op_r_latency | Latency of read operation (including queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_r_process_latency | Latency of read operation (excluding queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_w_in_bytes | Client data written | ++------------------------+--------------------------------------------------------------------------+ +|op_w_latency | Latency of write operation (including queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_w_process_latency | Latency of write operation (excluding queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_w_prepare_latency | Latency of write operations (excluding queue time and wait for finished) | ++------------------------+--------------------------------------------------------------------------+ +|op_rw | Client read-modify-write operations | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_in_bytes | Client read-modify-write operations write in | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_out_bytes | Client read-modify-write operations read out | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_latency | Latency of read-modify-write operation (including queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_process_latency | Latency of read-modify-write operation (excluding queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_prepare_latency | Latency of read-modify-write operations (excluding queue time | +| | and wait for finished) | ++------------------------+--------------------------------------------------------------------------+ +|op_before_queue_op_lat | Latency of IO before calling queue (before really queue into ShardedOpWq)| +| | op_before_dequeue_op_lat | ++------------------------+--------------------------------------------------------------------------+ +|op_before_dequeue_op_lat| Latency of IO before calling dequeue_op(already dequeued and get PG lock)| ++------------------------+--------------------------------------------------------------------------+ + +Latency counters are measured in microseconds unless otherwise specified in the description. + diff --git a/doc/mgr/insights.rst b/doc/mgr/insights.rst new file mode 100644 index 00000000..37b8903f --- /dev/null +++ b/doc/mgr/insights.rst @@ -0,0 +1,52 @@ +Insights Module +=============== + +The insights module collects and exposes system information to the Insights Core +data analysis framework. It is intended to replace explicit interrogation of +Ceph CLIs and daemon admin sockets, reducing the API surface that Insights +depends on. The insights reports contains the following: + +* **Health reports**. In addition to reporting the current health of the + cluster, the insights module reports a summary of the last 24 hours of health + checks. This feature is important for catching cluster health issues that are + transient and may not be present at the moment the report is generated. Health + checks are deduplicated to avoid unbounded data growth. + +* **Crash reports**. A summary of any daemon crashes in the past 24 hours is + included in the insights report. Crashes are reported as the number of crashes + per daemon type (e.g. `ceph-osd`) within the time window. Full details of a + crash may be obtained using the `crash module`_. + +* Software version, storage utilization, cluster maps, placement group summary, + monitor status, cluster configuration, and OSD metadata. + +Enabling +-------- + +The *insights* module is enabled with:: + + ceph mgr module enable insights + +Commands +-------- +:: + + ceph insights + +Generate the full report. + +:: + + ceph insights prune-health <hours> + +Remove historical health data older than <hours>. Passing `0` for <hours> will +clear all health data. + +This command is useful for cleaning the health history before automated nightly +reports are generated, which may contain spurious health checks accumulated +while performing system maintenance, or other health checks that have been +resolved. There is no need to prune health data to reclaim storage space; +garbage collection is performed regularly to remove old health data from +persistent storage. + +.. _crash module: ../crash diff --git a/doc/mgr/iostat.rst b/doc/mgr/iostat.rst new file mode 100644 index 00000000..f9f84938 --- /dev/null +++ b/doc/mgr/iostat.rst @@ -0,0 +1,32 @@ +.. _mgr-iostat-overview: + +iostat +====== + +This module shows the current throughput and IOPS done on the Ceph cluster. + +Enabling +-------- + +To check if the *iostat* module is enabled, run:: + + ceph mgr module ls + +The module can be enabled with:: + + ceph mgr module enable iostat + +To execute the module, run:: + + ceph iostat + +To change the frequency at which the statistics are printed, use the ``-p`` +option:: + + ceph iostat -p <period in seconds> + +For example, use the following command to print the statistics every 5 seconds:: + + ceph iostat -p 5 + +To stop the module, press Ctrl-C. diff --git a/doc/mgr/localpool.rst b/doc/mgr/localpool.rst new file mode 100644 index 00000000..4948dc75 --- /dev/null +++ b/doc/mgr/localpool.rst @@ -0,0 +1,35 @@ +Local Pool Module +================= + +The *localpool* module can automatically create RADOS pools that are +localized to a subset of the overall cluster. For example, by default, it will +create a pool for each distinct rack in the cluster. This can be useful for some +deployments that want to distribute some data locally as well as globally across the cluster . + +Enabling +-------- + +The *localpool* module is enabled with:: + + ceph mgr module enable localpool + +Configuring +----------- + +The *localpool* module understands the following options: + +* **subtree** (default: `rack`): which CRUSH subtree type the module + should create a pool for. +* **failure_domain** (default: `host`): what failure domain we should + separate data replicas across. +* **pg_num** (default: `128`): number of PGs to create for each pool +* **num_rep** (default: `3`): number of replicas for each pool. + (Currently, pools are always replicated.) +* **min_size** (default: none): value to set min_size to (unchanged from Ceph's default if this option is not set) +* **prefix** (default: `by-$subtreetype-`): prefix for the pool name. + +These options are set via the config-key interface. For example, to +change the replication level to 2x with only 64 PGs, :: + + ceph config set mgr mgr/localpool/num_rep 2 + ceph config set mgr mgr/localpool/pg_num 64 diff --git a/doc/mgr/modules.rst b/doc/mgr/modules.rst new file mode 100644 index 00000000..662565c0 --- /dev/null +++ b/doc/mgr/modules.rst @@ -0,0 +1,389 @@ + + +.. _mgr-module-dev: + +ceph-mgr module developer's guide +================================= + +.. warning:: + + This is developer documentation, describing Ceph internals that + are only relevant to people writing ceph-mgr modules. + +Creating a module +----------------- + +In pybind/mgr/, create a python module. Within your module, create a class +that inherits from ``MgrModule``. For ceph-mgr to detect your module, your +directory must contain a file called `module.py`. + +The most important methods to override are: + +* a ``serve`` member function for server-type modules. This + function should block forever. +* a ``notify`` member function if your module needs to + take action when new cluster data is available. +* a ``handle_command`` member function if your module + exposes CLI commands. + +Some modules interface with external orchestrators to deploy +Ceph services. These also inherit from ``Orchestrator``, which adds +additional methods to the base ``MgrModule`` class. See +:ref:`Orchestrator modules <orchestrator-modules>` for more on +creating these modules. + +Installing a module +------------------- + +Once your module is present in the location set by the +``mgr module path`` configuration setting, you can enable it +via the ``ceph mgr module enable`` command:: + + ceph mgr module enable mymodule + +Note that the MgrModule interface is not stable, so any modules maintained +outside of the Ceph tree are liable to break when run against any newer +or older versions of Ceph. + +Logging +------- + +``MgrModule`` instances have a ``log`` property which is a logger instance that +sends log messages into the Ceph logging layer where they will be recorded +in the mgr daemon's log file. + +Use it the same way you would any other python logger. The python +log levels debug, info, warn, err are mapped into the Ceph +severities 20, 4, 1 and 0 respectively. + +Exposing commands +----------------- + +Set the ``COMMANDS`` class attribute of your module to a list of dicts +like this:: + + COMMANDS = [ + { + "cmd": "foobar name=myarg,type=CephString", + "desc": "Do something awesome", + "perm": "rw", + # optional: + "poll": "true" + } + ] + +The ``cmd`` part of each entry is parsed in the same way as internal +Ceph mon and admin socket commands (see mon/MonCommands.h in +the Ceph source for examples). Note that the "poll" field is optional, +and is set to False by default; this indicates to the ``ceph`` CLI +that it should call this command repeatedly and output results (see +``ceph -h`` and its ``--period`` option). + +Each command is expected to return a tuple ``(retval, stdout, stderr)``. +``retval`` is an integer representing a libc error code (e.g. EINVAL, +EPERM, or 0 for no error), ``stdout`` is a string containing any +non-error output, and ``stderr`` is a string containing any progress or +error explanation output. Either or both of the two strings may be empty. + +Implement the ``handle_command`` function to respond to the commands +when they are sent: + + +.. py:currentmodule:: mgr_module +.. automethod:: MgrModule.handle_command + +Configuration options +--------------------- + +Modules can load and store configuration options using the +``set_module_option`` and ``get_module_option`` methods. + +.. note:: Use ``set_module_option`` and ``get_module_option`` to + manage user-visible configuration options that are not blobs (like + certificates). If you want to persist module-internal data or + binary configuration data consider using the `KV store`_. + +You must declare your available configuration options in the +``MODULE_OPTIONS`` class attribute, like this: + +:: + + MODULE_OPTIONS = [ + { + "name": "my_option" + } + ] + +If you try to use set_module_option or get_module_option on options not declared +in ``MODULE_OPTIONS``, an exception will be raised. + +You may choose to provide setter commands in your module to perform +high level validation. Users can also modify configuration using +the normal `ceph config set` command, where the configuration options +for a mgr module are named like `mgr/<module name>/<option>`. + +If a configuration option is different depending on which node the mgr +is running on, then use *localized* configuration ( +``get_localized_module_option``, ``set_localized_module_option``). +This may be necessary for options such as what address to listen on. +Localized options may also be set externally with ``ceph config set``, +where they key name is like ``mgr/<module name>/<mgr id>/<option>`` + +If you need to load and store data (e.g. something larger, binary, or multiline), +use the KV store instead of configuration options (see next section). + +Hints for using config options: + +* Reads are fast: ceph-mgr keeps a local in-memory copy, so in many cases + you can just do a get_module_option every time you use a option, rather than + copying it out into a variable. +* Writes block until the value is persisted (i.e. round trip to the monitor), + but reads from another thread will see the new value immediately. +* If a user has used `config set` from the command line, then the new + value will become visible to `get_module_option` immediately, although the + mon->mgr update is asynchronous, so `config set` will return a fraction + of a second before the new value is visible on the mgr. +* To delete a config value (i.e. revert to default), just pass ``None`` to + set_module_option. + +.. automethod:: MgrModule.get_module_option +.. automethod:: MgrModule.set_module_option +.. automethod:: MgrModule.get_localized_module_option +.. automethod:: MgrModule.set_localized_module_option + +KV store +-------- + +Modules have access to a private (per-module) key value store, which +is implemented using the monitor's "config-key" commands. Use +the ``set_store`` and ``get_store`` methods to access the KV store from +your module. + +The KV store commands work in a similar way to the configuration +commands. Reads are fast, operating from a local cache. Writes block +on persistence and do a round trip to the monitor. + +This data can be access from outside of ceph-mgr using the +``ceph config-key [get|set]`` commands. Key names follow the same +conventions as configuration options. Note that any values updated +from outside of ceph-mgr will not be seen by running modules until +the next restart. Users should be discouraged from accessing module KV +data externally -- if it is necessary for users to populate data, modules +should provide special commands to set the data via the module. + +Use the ``get_store_prefix`` function to enumerate keys within +a particular prefix (i.e. all keys starting with a particular substring). + + +.. automethod:: MgrModule.get_store +.. automethod:: MgrModule.set_store +.. automethod:: MgrModule.get_localized_store +.. automethod:: MgrModule.set_localized_store +.. automethod:: MgrModule.get_store_prefix + + +Accessing cluster data +---------------------- + +Modules have access to the in-memory copies of the Ceph cluster's +state that the mgr maintains. Accessor functions as exposed +as members of MgrModule. + +Calls that access the cluster or daemon state are generally going +from Python into native C++ routines. There is some overhead to this, +but much less than for example calling into a REST API or calling into +an SQL database. + +There are no consistency rules about access to cluster structures or +daemon metadata. For example, an OSD might exist in OSDMap but +have no metadata, or vice versa. On a healthy cluster these +will be very rare transient states, but modules should be written +to cope with the possibility. + +Note that these accessors must not be called in the modules ``__init__`` +function. This will result in a circular locking exception. + +.. automethod:: MgrModule.get +.. automethod:: MgrModule.get_server +.. automethod:: MgrModule.list_servers +.. automethod:: MgrModule.get_metadata +.. automethod:: MgrModule.get_daemon_status +.. automethod:: MgrModule.get_perf_schema +.. automethod:: MgrModule.get_counter +.. automethod:: MgrModule.get_mgr_id + +Exposing health checks +---------------------- + +Modules can raise first class Ceph health checks, which will be reported +in the output of ``ceph status`` and in other places that report on the +cluster's health. + +If you use ``set_health_checks`` to report a problem, be sure to call +it again with an empty dict to clear your health check when the problem +goes away. + +.. automethod:: MgrModule.set_health_checks + +What if the mons are down? +-------------------------- + +The manager daemon gets much of its state (such as the cluster maps) +from the monitor. If the monitor cluster is inaccessible, whichever +manager was active will continue to run, with the latest state it saw +still in memory. + +However, if you are creating a module that shows the cluster state +to the user then you may well not want to mislead them by showing +them that out of date state. + +To check if the manager daemon currently has a connection to +the monitor cluster, use this function: + +.. automethod:: MgrModule.have_mon_connection + +Reporting if your module cannot run +----------------------------------- + +If your module cannot be run for any reason (such as a missing dependency), +then you can report that by implementing the ``can_run`` function. + +.. automethod:: MgrModule.can_run + +Note that this will only work properly if your module can always be imported: +if you are importing a dependency that may be absent, then do it in a +try/except block so that your module can be loaded far enough to use +``can_run`` even if the dependency is absent. + +Sending commands +---------------- + +A non-blocking facility is provided for sending monitor commands +to the cluster. + +.. automethod:: MgrModule.send_command + +Receiving notifications +----------------------- + +The manager daemon calls the ``notify`` function on all active modules +when certain important pieces of cluster state are updated, such as the +cluster maps. + +The actual data is not passed into this function, rather it is a cue for +the module to go and read the relevant structure if it is interested. Most +modules ignore most types of notification: to ignore a notification +simply return from this function without doing anything. + +.. automethod:: MgrModule.notify + +Accessing RADOS or CephFS +------------------------- + +If you want to use the librados python API to access data stored in +the Ceph cluster, you can access the ``rados`` attribute of your +``MgrModule`` instance. This is an instance of ``rados.Rados`` which +has been constructed for you using the existing Ceph context (an internal +detail of the C++ Ceph code) of the mgr daemon. + +Always use this specially constructed librados instance instead of +constructing one by hand. + +Similarly, if you are using libcephfs to access the filesystem, then +use the libcephfs ``create_with_rados`` to construct it from the +``MgrModule.rados`` librados instance, and thereby inherit the correct context. + +Remember that your module may be running while other parts of the cluster +are down: do not assume that librados or libcephfs calls will return +promptly -- consider whether to use timeouts or to block if the rest of +the cluster is not fully available. + +Implementing standby mode +------------------------- + +For some modules, it is useful to run on standby manager daemons as well +as on the active daemon. For example, an HTTP server can usefully +serve HTTP redirect responses from the standby managers so that +the user can point his browser at any of the manager daemons without +having to worry about which one is active. + +Standby manager daemons look for a subclass of ``StandbyModule`` +in each module. If the class is not found then the module is not +used at all on standby daemons. If the class is found, then +its ``serve`` method is called. Implementations of ``StandbyModule`` +must inherit from ``mgr_module.MgrStandbyModule``. + +The interface of ``MgrStandbyModule`` is much restricted compared to +``MgrModule`` -- none of the Ceph cluster state is available to +the module. ``serve`` and ``shutdown`` methods are used in the same +way as a normal module class. The ``get_active_uri`` method enables +the standby module to discover the address of its active peer in +order to make redirects. See the ``MgrStandbyModule`` definition +in the Ceph source code for the full list of methods. + +For an example of how to use this interface, look at the source code +of the ``dashboard`` module. + +Communicating between modules +----------------------------- + +Modules can invoke member functions of other modules. + +.. automethod:: MgrModule.remote + +Be sure to handle ``ImportError`` to deal with the case that the desired +module is not enabled. + +If the remote method raises a python exception, this will be converted +to a RuntimeError on the calling side, where the message string describes +the exception that was originally thrown. If your logic intends +to handle certain errors cleanly, it is better to modify the remote method +to return an error value instead of raising an exception. + +At time of writing, inter-module calls are implemented without +copies or serialization, so when you return a python object, you're +returning a reference to that object to the calling module. It +is recommend *not* to rely on this reference passing, as in future the +implementation may change to serialize arguments and return +values. + + +Logging +------- + +Use your module's ``log`` attribute as your logger. This is a logger +configured to output via the ceph logging framework, to the local ceph-mgr +log files. + +Python log severities are mapped to ceph severities as follows: + +* DEBUG is 20 +* INFO is 4 +* WARN is 1 +* ERR is 0 + +Shutting down cleanly +--------------------- + +If a module implements the ``serve()`` method, it should also implement +the ``shutdown()`` method to shutdown cleanly: misbehaving modules +may otherwise prevent clean shutdown of ceph-mgr. + +Limitations +----------- + +It is not possible to call back into C++ code from a module's +``__init__()`` method. For example calling ``self.get_module_option()`` at +this point will result in an assertion failure in ceph-mgr. For modules +that implement the ``serve()`` method, it usually makes sense to do most +initialization inside that method instead. + +Is something missing? +--------------------- + +The ceph-mgr python interface is not set in stone. If you have a need +that is not satisfied by the current interface, please bring it up +on the ceph-devel mailing list. While it is desired to avoid bloating +the interface, it is not generally very hard to expose existing data +to the Python code when there is a good reason. + diff --git a/doc/mgr/orchestrator_cli.rst b/doc/mgr/orchestrator_cli.rst new file mode 100644 index 00000000..6ee37845 --- /dev/null +++ b/doc/mgr/orchestrator_cli.rst @@ -0,0 +1,295 @@ + +.. _orchestrator-cli-module: + +================ +Orchestrator CLI +================ + +This module provides a command line interface (CLI) to orchestrator +modules (ceph-mgr modules which interface with external orchestation services) + +As the orchestrator CLI unifies different external orchestrators, a common nomenclature +for the orchestrator module is needed. + ++--------------------------------------+---------------------------------------+ +| host | hostname (not DNS name) of the | +| | physical host. Not the podname, | +| | container name, or hostname inside | +| | the container. | ++--------------------------------------+---------------------------------------+ +| service type | The type of the service. e.g., nfs, | +| | mds, osd, mon, rgw, mgr, iscsi | ++--------------------------------------+---------------------------------------+ +| service | A logical service, Typically | +| | comprised of multiple service | +| | instances on multiple hosts for HA | +| | | +| | * ``fs_name`` for mds type | +| | * ``rgw_zone`` for rgw type | +| | * ``ganesha_cluster_id`` for nfs type | ++--------------------------------------+---------------------------------------+ +| service instance | A single instance of a service. | +| | Usually a daemon, but maybe not | +| | (e.g., might be a kernel service | +| | like LIO or knfsd or whatever) | +| | | +| | This identifier should | +| | uniquely identify the instance | ++--------------------------------------+---------------------------------------+ +| daemon | A running process on a host; use | +| | “service instance” instead | ++--------------------------------------+---------------------------------------+ + +The relation between the names is the following: + +* a service belongs to a service type +* a service instance belongs to a service type +* a service instance belongs to a single service group + +Configuration +============= + +To enable the orchestrator, please select the orchestrator module to use +with the ``set backend`` command:: + + ceph orchestrator set backend <module> + +For example, to enable the Rook orchestrator module and use it with the CLI:: + + ceph mgr module enable rook + ceph orchestrator set backend rook + +You can then check backend is properly configured:: + + ceph orchestrator status + +Disable the Orchestrator +~~~~~~~~~~~~~~~~~~~~~~~~ + +To disable the orchestrator again, use the empty string ``""``:: + + ceph orchestrator set backend ""`` + ceph mgr module disable rook + +Usage +===== + +.. warning:: + + The orchestrator CLI is unfinished and work in progress. Some commands will not + exist, or return a different result. + +.. note:: + + Orchestrator modules may only implement a subset of the commands listed below. + Also, the implementation of the commands are orchestrator module dependent and will + differ between implementations. + +Status +~~~~~~ + +:: + + ceph orchestrator status + +Show current orchestrator mode and high-level status (whether the module able +to talk to it) + +Also show any in-progress actions. + +Host Management +~~~~~~~~~~~~~~~ + +List hosts associated with the cluster:: + + ceph orchestrator host ls + +Add and remove hosts:: + + ceph orchestrator host add <host> + ceph orchestrator host rm <host> + +OSD Management +~~~~~~~~~~~~~~ + +List Devices +^^^^^^^^^^^^ + +Print a list of discovered devices, grouped by node and optionally +filtered to a particular node: + +:: + + ceph orchestrator device ls [--host=...] [--refresh] + +Create OSDs +^^^^^^^^^^^ + +Create OSDs on a group of devices on a single host:: + + ceph orchestrator osd create <host>:<drive> + ceph orchestrator osd create -i <path-to-drive-group.json> + + +The output of ``osd create`` is not specified and may vary between orchestrator backends. + +Where ``drive.group.json`` is a JSON file containing the fields defined in :class:`orchestrator.DriveGroupSpec` + + +Decommission an OSD +^^^^^^^^^^^^^^^^^^^ +:: + + ceph orchestrator osd rm <osd-id> [osd-id...] + +Removes one or more OSDs from the cluster and the host, if the OSDs are marked as +``destroyed``. + + +.. + Blink Device Lights + ^^^^^^^^^^^^^^^^^^^ + :: + + ceph orchestrator device ident-on <host> <devname> + ceph orchestrator device ident-off <host> <devname> + ceph orchestrator device fault-on <host> <devname> + ceph orchestrator device fault-off <host> <devname> + + ceph orchestrator osd ident-on {primary,journal,db,wal,all} <osd-id> + ceph orchestrator osd ident-off {primary,journal,db,wal,all} <osd-id> + ceph orchestrator osd fault-on {primary,journal,db,wal,all} <osd-id> + ceph orchestrator osd fault-off {primary,journal,db,wal,all} <osd-id> + + Where ``journal`` is the filestore journal, ``wal`` is the write ahead log of + bluestore and ``all`` stands for all devices associated with the osd + + +Monitor and manager management +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Creates or removes MONs or MGRs from the cluster. Orchestrator may return an +error if it doesn't know how to do this transition. + +Update the number of monitor nodes:: + + ceph orchestrator mon update <num> [host, host:network...] + +Each host can optionally specificy a network for the monitor to listen on. + +Update the number of manager nodes:: + + ceph orchestrator mgr update <num> [host...] + +.. + .. note:: + + The host lists are the new full list of mon/mgr hosts + + .. note:: + + specifying hosts is optional for some orchestrator modules + and mandatory for others (e.g. Ansible). + + +Service Status +~~~~~~~~~~~~~~ + +Print a list of services known to the orchestrator. The list can be limited to +services on a particular host with the optional --host parameter and/or +services of a particular type via optional --type parameter +(mon, osd, mgr, mds, rgw): + +:: + + ceph orchestrator service ls [--host host] [--svc_type type] [--refresh] + +Discover the status of a particular service:: + + ceph orchestrator service ls --svc_type type --svc_id <name> [--refresh] + + +Query the status of a particular service instance (mon, osd, mds, rgw). For OSDs +the id is the numeric OSD ID, for MDS services it is the filesystem name:: + + ceph orchestrator service-instance status <type> <instance-name> [--refresh] + + + +Stateless services (MDS/RGW/NFS/rbd-mirror/iSCSI) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The orchestrator is not responsible for configuring the services. Please look into the corresponding +documentation for details. + +The ``name`` parameter is an identifier of the group of instances: + +* a CephFS filesystem for a group of MDS daemons, +* a zone name for a group of RGWs + +Sizing: the ``size`` parameter gives the number of daemons in the cluster +(e.g. the number of MDS daemons for a particular CephFS filesystem). + +Creating/growing/shrinking/removing services:: + + ceph orchestrator {mds,rgw} update <name> <size> [host…] + ceph orchestrator {mds,rgw} add <name> + ceph orchestrator nfs update <name> <size> [host…] + ceph orchestrator nfs add <name> <pool> [--namespace=<namespace>] + ceph orchestrator {mds,rgw,nfs} rm <name> + +e.g., ``ceph orchestrator mds update myfs 3 host1 host2 host3`` + +Start/stop/reload:: + + ceph orchestrator service {stop,start,reload} <type> <name> + + ceph orchestrator service-instance {start,stop,reload} <type> <instance-name> + + +Current Implementation Status +============================= + +This is an overview of the current implementation status of the orchestrators. + +=================================== ========= ====== ========= ===== + Command Ansible Rook DeepSea SSH +=================================== ========= ====== ========= ===== + host add ⚪ ⚪ ⚪ ✔️ + host ls ⚪ ⚪ ⚪ ✔️ + host rm ⚪ ⚪ ⚪ ✔️ + mgr update ⚪ ⚪ ⚪ ✔️ + mon update ⚪ ✔️ ⚪ ✔️ + osd create ✔️ ✔️ ⚪ ✔️ + osd device {ident,fault}-{on,off} ⚪ ⚪ ⚪ ⚪ + osd rm ✔️ ⚪ ⚪ ⚪ + device {ident,fault}-(on,off} ⚪ ⚪ ⚪ ⚪ + device ls ✔️ ✔️ ✔️ ✔️ + service ls ⚪ ✔️ ✔️ ⚪ + service-instance status ⚪ ⚪ ⚪ ⚪ + iscsi {stop,start,reload} ⚪ ⚪ ⚪ ⚪ + iscsi add ⚪ ⚪ ⚪ ⚪ + iscsi rm ⚪ ⚪ ⚪ ⚪ + iscsi update ⚪ ⚪ ⚪ ⚪ + mds {stop,start,reload} ⚪ ⚪ ⚪ ⚪ + mds add ⚪ ✔️ ⚪ ⚪ + mds rm ⚪ ✔️ ⚪ ⚪ + mds update ⚪ ⚪ ⚪ ⚪ + nfs {stop,start,reload} ⚪ ⚪ ⚪ ⚪ + nfs add ⚪ ✔️ ⚪ ⚪ + nfs rm ⚪ ✔️ ⚪ ⚪ + nfs update ⚪ ⚪ ⚪ ⚪ + rbd-mirror {stop,start,reload} ⚪ ⚪ ⚪ ⚪ + rbd-mirror add ⚪ ⚪ ⚪ ⚪ + rbd-mirror rm ⚪ ⚪ ⚪ ⚪ + rbd-mirror update ⚪ ⚪ ⚪ ⚪ + rgw {stop,start,reload} ⚪ ⚪ ⚪ ⚪ + rgw add ⚪ ✔️ ⚪ ⚪ + rgw rm ⚪ ✔️ ⚪ ⚪ + rgw update ⚪ ⚪ ⚪ ⚪ +=================================== ========= ====== ========= ===== + +where + +* ⚪ = not yet implemented +* ❌ = not applicable +* ✔ = implemented diff --git a/doc/mgr/orchestrator_modules.rst b/doc/mgr/orchestrator_modules.rst new file mode 100644 index 00000000..6e600571 --- /dev/null +++ b/doc/mgr/orchestrator_modules.rst @@ -0,0 +1,285 @@ + + +.. _orchestrator-modules: + +.. py:currentmodule:: orchestrator + +ceph-mgr orchestrator modules +============================= + +.. warning:: + + This is developer documentation, describing Ceph internals that + are only relevant to people writing ceph-mgr orchestrator modules. + +In this context, *orchestrator* refers to some external service that +provides the ability to discover devices and create Ceph services. This +includes external projects such as ceph-ansible, DeepSea, and Rook. + +An *orchestrator module* is a ceph-mgr module (:ref:`mgr-module-dev`) +which implements common management operations using a particular +orchestrator. + +Orchestrator modules subclass the ``Orchestrator`` class: this class is +an interface, it only provides method definitions to be implemented +by subclasses. The purpose of defining this common interface +for different orchestrators is to enable common UI code, such as +the dashboard, to work with various different backends. + + +.. graphviz:: + + digraph G { + subgraph cluster_1 { + volumes [label="mgr/volumes"] + rook [label="mgr/rook"] + dashboard [label="mgr/dashboard"] + orchestrator_cli [label="mgr/orchestrator_cli"] + orchestrator [label="Orchestrator Interface"] + ansible [label="mgr/ansible"] + ssh [label="mgr/ssh"] + deepsea [label="mgr/deepsea"] + + label = "ceph-mgr"; + } + + volumes -> orchestrator + dashboard -> orchestrator + orchestrator_cli -> orchestrator + orchestrator -> rook -> rook_io + orchestrator -> ansible -> ceph_ansible + orchestrator -> deepsea -> suse_deepsea + orchestrator -> ssh + + + rook_io [label="Rook"] + ceph_ansible [label="ceph-ansible"] + suse_deepsea [label="DeepSea"] + + rankdir="TB"; + } + +Behind all the abstraction, the purpose of orchestrator modules is simple: +enable Ceph to do things like discover available hardware, create and +destroy OSDs, and run MDS and RGW services. + +A tutorial is not included here: for full and concrete examples, see +the existing implemented orchestrator modules in the Ceph source tree. + +Glossary +-------- + +Stateful service + a daemon that uses local storage, such as OSD or mon. + +Stateless service + a daemon that doesn't use any local storage, such + as an MDS, RGW, nfs-ganesha, iSCSI gateway. + +Label + arbitrary string tags that may be applied by administrators + to nodes. Typically administrators use labels to indicate + which nodes should run which kinds of service. Labels are + advisory (from human input) and do not guarantee that nodes + have particular physical capabilities. + +Drive group + collection of block devices with common/shared OSD + formatting (typically one or more SSDs acting as + journals/dbs for a group of HDDs). + +Placement + choice of which node is used to run a service. + +Key Concepts +------------ + +The underlying orchestrator remains the source of truth for information +about whether a service is running, what is running where, which +nodes are available, etc. Orchestrator modules should avoid taking +any internal copies of this information, and read it directly from +the orchestrator backend as much as possible. + +Bootstrapping nodes and adding them to the underlying orchestration +system is outside the scope of Ceph's orchestrator interface. Ceph +can only work on nodes when the orchestrator is already aware of them. + +Calls to orchestrator modules are all asynchronous, and return *completion* +objects (see below) rather than returning values immediately. + +Where possible, placement of stateless services should be left up to the +orchestrator. + +Completions and batching +------------------------ + +All methods that read or modify the state of the system can potentially +be long running. To handle that, all such methods return a *completion* +object (a *ReadCompletion* or a *WriteCompletion*). Orchestrator modules +must implement the *wait* method: this takes a list of completions, and +is responsible for checking if they're finished, and advancing the underlying +operations as needed. + +Each orchestrator module implements its own underlying mechanisms +for completions. This might involve running the underlying operations +in threads, or batching the operations up before later executing +in one go in the background. If implementing such a batching pattern, the +module would do no work on any operation until it appeared in a list +of completions passed into *wait*. + +*WriteCompletion* objects have a two-stage execution. First they become +*persistent*, meaning that the write has made it to the orchestrator +itself, and been persisted there (e.g. a manifest file has been updated). +If ceph-mgr crashed at this point, the operation would still eventually take +effect. Second, the completion becomes *effective*, meaning that the operation has really happened (e.g. a service has actually been started). + +.. automethod:: Orchestrator.wait + +.. autoclass:: _Completion + :members: + +.. autoclass:: ReadCompletion + :members: + +.. autoclass:: WriteCompletion + :members: + +Placement +--------- + +In general, stateless services do not require any specific placement +rules, as they can run anywhere that sufficient system resources +are available. However, some orchestrators may not include the +functionality to choose a location in this way, so we can optionally +specify a location when creating a stateless service. + +OSD services generally require a specific placement choice, as this +will determine which storage devices are used. + +Error Handling +-------------- + +The main goal of error handling within orchestrator modules is to provide debug information to +assist users when dealing with deployment errors. + +.. autoclass:: OrchestratorError +.. autoclass:: NoOrchestrator +.. autoclass:: OrchestratorValidationError + + +In detail, orchestrators need to explicitly deal with different kinds of errors: + +1. No orchestrator configured + + See :class:`NoOrchestrator`. + +2. An orchestrator doesn't implement a specific method. + + For example, an Orchestrator doesn't support ``add_host``. + + In this case, a ``NotImplementedError`` is raised. + +3. Missing features within implemented methods. + + E.g. optional parameters to a command that are not supported by the + backend (e.g. the hosts field in :func:`Orchestrator.update_mons` command with the rook backend). + + See :class:`OrchestratorValidationError`. + +4. Input validation errors + + The ``orchestrator_cli`` module and other calling modules are supposed to + provide meaningful error messages. + + See :class:`OrchestratorValidationError`. + +5. Errors when actually executing commands + + The resulting Completion should contain an error string that assists in understanding the + problem. In addition, :func:`_Completion.is_errored` is set to ``True`` + +6. Invalid configuration in the orchestrator modules + + This can be tackled similar to 5. + + +All other errors are unexpected orchestrator issues and thus should raise an exception that are then +logged into the mgr log file. If there is a completion object at that point, +:func:`_Completion.result` may contain an error message. + + +Excluded functionality +---------------------- + +- Ceph's orchestrator interface is not a general purpose framework for + managing linux servers -- it is deliberately constrained to manage + the Ceph cluster's services only. +- Multipathed storage is not handled (multipathing is unnecessary for + Ceph clusters). Each drive is assumed to be visible only on + a single node. + +Host management +--------------- + +.. automethod:: Orchestrator.add_host +.. automethod:: Orchestrator.remove_host +.. automethod:: Orchestrator.get_hosts + +Inventory and status +-------------------- + +.. automethod:: Orchestrator.get_inventory +.. autoclass:: InventoryFilter +.. autoclass:: InventoryNode + +.. autoclass:: InventoryDevice + :members: + +.. automethod:: Orchestrator.describe_service +.. autoclass:: ServiceDescription + +Service Actions +--------------- + +.. automethod:: Orchestrator.service_action + +OSD management +-------------- + +.. automethod:: Orchestrator.create_osds +.. automethod:: Orchestrator.replace_osds +.. automethod:: Orchestrator.remove_osds + +.. autoclass:: DeviceSelection + :members: + +.. autoclass:: DriveGroupSpec + :members: + :exclude-members: from_json + +Stateless Services +------------------ + +.. automethod:: Orchestrator.add_stateless_service +.. automethod:: Orchestrator.update_stateless_service +.. automethod:: Orchestrator.remove_stateless_service + +Upgrades +-------- + +.. automethod:: Orchestrator.upgrade_available +.. automethod:: Orchestrator.upgrade_start +.. automethod:: Orchestrator.upgrade_status +.. autoclass:: UpgradeSpec +.. autoclass:: UpgradeStatusSpec + +Utility +------- + +.. automethod:: Orchestrator.available + +Client Modules +-------------- + +.. autoclass:: OrchestratorClientMixin + :members: diff --git a/doc/mgr/prometheus.rst b/doc/mgr/prometheus.rst new file mode 100644 index 00000000..87296be3 --- /dev/null +++ b/doc/mgr/prometheus.rst @@ -0,0 +1,314 @@ +.. _mgr-prometheus: + +================= +Prometheus Module +================= + +Provides a Prometheus exporter to pass on Ceph performance counters +from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport +messages from all MgrClient processes (mons and OSDs, for instance) +with performance counter schema data and actual counter data, and keeps +a circular buffer of the last N samples. This module creates an HTTP +endpoint (like all Prometheus exporters) and retrieves the latest sample +of every counter when polled (or "scraped" in Prometheus terminology). +The HTTP path and query parameters are ignored; all extant counters +for all reporting entities are returned in text exposition format. +(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.) + +Enabling prometheus output +========================== + +The *prometheus* module is enabled with:: + + ceph mgr module enable prometheus + +Configuration +------------- + +.. note:: + + The Prometheus manager module needs to be restarted for configuration changes to be applied. + +By default the module will accept HTTP requests on port ``9283`` on all IPv4 +and IPv6 addresses on the host. The port and listen address are both +configurable with ``ceph config-key set``, with keys +``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port +is registered with Prometheus's `registry +<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_. + +:: + + ceph config set mgr mgr/prometheus/server_addr 0.0.0.0 + ceph config set mgr mgr/prometheus/server_port 9283 + +.. warning:: + + The ``scrape_interval`` of this module should always be set to match + Prometheus' scrape interval to work properly and not cause any issues. + +The Prometheus manager module is, by default, configured with a scrape interval +of 15 seconds. The scrape interval in the module is used for caching purposes +and to determine when a cache is stale. + +It is not recommended to use a scrape interval below 10 seconds. It is +recommended to use 15 seconds as scrape interval, though, in some cases it +might be useful to increase the scrape interval. + +To set a different scrape interval in the Prometheus module, set +``scrape_interval`` to the desired value:: + + ceph config set mgr mgr/prometheus/scrape_interval 20 + +On large clusters (>1000 OSDs), the time to fetch the metrics may become +significant. Without the cache, the Prometheus manager module could, +especially in conjunction with multiple Prometheus instances, overload the +manager and lead to unresponsive or crashing Ceph manager instances. Hence, +the cache is enabled by default and cannot be disabled. This means that there +is a possibility that the cache becomes stale. The cache is considered stale +when the time to fetch the metrics from Ceph exceeds the configured +``scrape_interval``. + +If that is the case, **a warning will be logged** and the module will either + +* respond with a 503 HTTP status code (service unavailable) or, +* it will return the content of the cache, even though it might be stale. + +This behavior can be configured. By default, it will return a 503 HTTP status +code (service unavailable). You can set other options using the ``ceph config +set`` commands. + +To tell the module to respond with possibly stale data, set it to ``return``:: + + ceph config set mgr mgr/prometheus/stale_cache_strategy return + +To tell the module to respond with "service unavailable", set it to ``fail``:: + + ceph config set mgr mgr/prometheus/stale_cache_strategy fail + +.. _prometheus-rbd-io-statistics: + +RBD IO statistics +----------------- + +The module can optionally collect RBD per-image IO statistics by enabling +dynamic OSD performance counters. The statistics are gathered for all images +in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools`` +configuration parameter. The parameter is a comma or space separated list +of ``pool[/namespace]`` entries. If the namespace is not specified the +statistics are collected for all namespaces in the pool. + +Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:: + + ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN" + +The module makes the list of all available images scanning the specified +pools and namespaces and refreshes it periodically. The period is +configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval`` +parameter (in sec) and is 300 sec (5 minutes) by default. The module will +force refresh earlier if it detects statistics from a previously unknown +RBD image. + +Example to turn up the sync interval to 10 minutes:: + + ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600 + +Statistic names and labels +========================== + +The names of the stats are exactly as Ceph names them, with +illegal characters ``.``, ``-`` and ``::`` translated to ``_``, +and ``ceph_`` prefixed to all names. + + +All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123" +that identifies the type and ID of the daemon they come from. Some +statistics can come from different types of daemon, so when querying +e.g. an OSD's RocksDB stats, you would probably want to filter +on ceph_daemon starting with "osd" to avoid mixing in the monitor +rocksdb stats. + + +The *cluster* statistics (i.e. those global to the Ceph cluster) +have labels appropriate to what they report on. For example, +metrics relating to pools have a ``pool_id`` label. + + +The long running averages that represent the histograms from core Ceph +are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics. +This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_ +and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_. + +Pool and OSD metadata series +---------------------------- + +Special series are output to enable displaying and querying on +certain metadata fields. + +Pools have a ``ceph_pool_metadata`` field like this: + +:: + + ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0 + +OSDs have a ``ceph_osd_metadata`` field like this: + +:: + + ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0 + + +Correlating drive statistics with node_exporter +----------------------------------------------- + +The prometheus output from Ceph is designed to be used in conjunction +with the generic host monitoring from the Prometheus node_exporter. + +To enable correlation of Ceph OSD statistics with node_exporter's +drive statistics, special series are output like this: + +:: + + ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"} + +To use this to get disk statistics by OSD ID, use either the ``and`` operator or +the ``*`` operator in your prometheus query. All metadata metrics (like `` +ceph_disk_occupation`` have the value 1 so they act neutral with ``*``. Using ``*`` +allows to use ``group_left`` and ``group_right`` grouping modifiers, so that +the resulting metric has additional labels from one side of the query. + +See the +`prometheus documentation`__ for more information about constructing queries. + +__ https://prometheus.io/docs/prometheus/latest/querying/basics + +The goal is to run a query like + +:: + + rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"} + +Out of the box the above query will not return any metrics since the ``instance`` labels of +both metrics don't match. The ``instance`` label of ``ceph_disk_occupation`` +will be the currently active MGR node. + + The following two section outline two approaches to remedy this. + +Use label_replace +================= + +The ``label_replace`` function (cp. +`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_) +can add a label to, or alter a label of, a metric within a query. + +To correlate an OSD and its disks write rate, the following query can be used: + +:: + + label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.*):.*") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"} + +Configuring Prometheus server +============================= + +honor_labels +------------ + +To enable Ceph to output properly-labeled data relating to any host, +use the ``honor_labels`` setting when adding the ceph-mgr endpoints +to your prometheus configuration. + +This allows Ceph to export the proper ``instance`` label without prometheus +overwriting it. Without this setting, Prometheus applies an ``instance`` label +that includes the hostname and port of the endpoint that the series came from. +Because Ceph clusters have multiple manager daemons, this results in an +``instance`` label that changes spuriously when the active manager daemon +changes. + +If this is undesirable a custom ``instance`` label can be set in the +Prometheus target configuration: you might wish to set it to the hostname +of your first mgr daemon, or something completely arbitrary like "ceph_cluster". + +node_exporter hostname labels +----------------------------- + +Set your ``instance`` labels to match what appears in Ceph's OSD metadata +in the ``instance`` field. This is generally the short hostname of the node. + +This is only necessary if you want to correlate Ceph stats with host stats, +but you may find it useful to do it in all cases in case you want to do +the correlation in the future. + +Example configuration +--------------------- + +This example shows a single node configuration running ceph-mgr and +node_exporter on a server called ``senta04``. Note that this requires to add the +appropriate instance label to every ``node_exporter`` target individually. + +This is just an example: there are other ways to configure prometheus +scrape targets and label rewrite rules. + +prometheus.yml +~~~~~~~~~~~~~~ + +:: + + global: + scrape_interval: 15s + evaluation_interval: 15s + + scrape_configs: + - job_name: 'node' + file_sd_configs: + - files: + - node_targets.yml + - job_name: 'ceph' + honor_labels: true + file_sd_configs: + - files: + - ceph_targets.yml + + +ceph_targets.yml +~~~~~~~~~~~~~~~~ + + +:: + + [ + { + "targets": [ "senta04.mydomain.com:9283" ], + "labels": {} + } + ] + + +node_targets.yml +~~~~~~~~~~~~~~~~ + +:: + + [ + { + "targets": [ "senta04.mydomain.com:9100" ], + "labels": { + "instance": "senta04" + } + } + ] + + +Notes +===== + +Counters and gauges are exported; currently histograms and long-running +averages are not. It's possible that Ceph's 2-D histograms could be +reduced to two separate 1-D histograms, and that long-running averages +could be exported as Prometheus' Summary type. + +Timestamps, as with many Prometheus exporters, are established by +the server's scrape time (Prometheus expects that it is polling the +actual counter process synchronously). It is possible to supply a +timestamp along with the stat report, but the Prometheus team strongly +advises against this. This means that timestamps will be delayed by +an unpredictable amount; it's not clear if this will be problematic, +but it's worth knowing about. diff --git a/doc/mgr/restful.rst b/doc/mgr/restful.rst new file mode 100644 index 00000000..271e37bb --- /dev/null +++ b/doc/mgr/restful.rst @@ -0,0 +1,156 @@ +Restful Module +============== + +RESTful module offers the REST API access to the status of the cluster +over an SSL-secured connection. + +Enabling +-------- + +The *restful* module is enabled with:: + + ceph mgr module enable restful + +You will also need to configure an SSL certificate below before the +API endpoint is available. By default the module will accept HTTPS +requests on port ``8003`` on all IPv4 and IPv6 addresses on the host. + +Securing +-------- + +All connections to *restful* are secured with SSL. You can generate a +self-signed certificate with the command:: + + ceph restful create-self-signed-cert + +Note that with a self-signed certificate most clients will need a flag +to allow a connection and/or suppress warning messages. For example, +if the ``ceph-mgr`` daemon is on the same host,:: + + curl -k https://localhost:8003/ + +To properly secure a deployment, a certificate that is signed by the +organization's certificate authority should be used. For example, a key pair +can be generated with a command similar to:: + + openssl req -new -nodes -x509 \ + -subj "/O=IT/CN=ceph-mgr-restful" \ + -days 3650 -keyout restful.key -out restful.crt -extensions v3_ca + +The ``restful.crt`` should then be signed by your organization's CA +(certificate authority). Once that is done, you can set it with:: + + ceph config-key set mgr/restful/$name/crt -i restful.crt + ceph config-key set mgr/restful/$name/key -i restful.key + +where ``$name`` is the name of the ``ceph-mgr`` instance (usually the +hostname). If all manager instances are to share the same certificate, +you can leave off the ``$name`` portion:: + + ceph config-key set mgr/restful/crt -i restful.crt + ceph config-key set mgr/restful/key -i restful.key + + +Configuring IP and port +----------------------- + +Like any other RESTful API endpoint, *restful* binds to an IP and +port. By default, the currently active ``ceph-mgr`` daemon will bind +to port 8003 and any available IPv4 or IPv6 address on the host. + +Since each ``ceph-mgr`` hosts its own instance of *restful*, it may +also be necessary to configure them separately. The IP and port +can be changed via the configuration key facility:: + + ceph config set mgr mgr/restful/$name/server_addr $IP + ceph config set mgr mgr/restful/$name/server_port $PORT + +where ``$name`` is the ID of the ceph-mgr daemon (usually the hostname). + +These settings can also be configured cluster-wide and not manager +specific. For example,:: + + ceph config set mgr mgr/restful/server_addr $IP + ceph config set mgr mgr/restful/server_port $PORT + +If the port is not configured, *restful* will bind to port ``8003``. +If the address it not configured, the *restful* will bind to ``::``, +which corresponds to all available IPv4 and IPv6 addresses. + +Load balancer +------------- + +Please note that *restful* will *only* start on the manager which +is active at that moment. Query the Ceph cluster status to see which +manager is active (e.g., ``ceph mgr dump``). In order to make the +API available via a consistent URL regardless of which manager +daemon is currently active, you may want to set up a load balancer +front-end to direct traffic to whichever manager endpoint is +available. + +Available methods +----------------- + +You can navigate to the ``/doc`` endpoint for full list of available +endpoints and HTTP methods implemented for each endpoint. + +For example, if you want to use the PATCH method of the ``/osd/<id>`` +endpoint to set the state ``up`` of the OSD id ``1``, you can use the +following curl command:: + + echo -En '{"up": true}' | curl --request PATCH --data @- --silent --insecure --user <user> 'https://<ceph-mgr>:<port>/osd/1' + +or you can use python to do so:: + + $ python + >> import requests + >> result = requests.patch( + 'https://<ceph-mgr>:<port>/osd/1', + json={"up": True}, + auth=("<user>", "<password>") + ) + >> print result.json() + +Some of the other endpoints implemented in the *restful* module include + +* ``/config/cluster``: **GET** +* ``/config/osd``: **GET**, **PATCH** +* ``/crush/rule``: **GET** +* ``/mon``: **GET** +* ``/osd``: **GET** +* ``/pool``: **GET**, **POST** +* ``/pool/<arg>``: **DELETE**, **GET**, **PATCH** +* ``/request``: **DELETE**, **GET**, **POST** +* ``/request/<arg>``: **DELETE**, **GET** +* ``/server``: **GET** + +The ``/request`` endpoint +------------------------- + +You can use the ``/request`` endpoint to poll the state of a request +you scheduled with any **DELETE**, **POST** or **PATCH** method. These +methods are by default asynchronous since it may take longer for them +to finish execution. You can modify this behaviour by appending +``?wait=1`` to the request url. The returned request will then always +be completed. + +The **POST** method of the ``/request`` method provides a passthrough +for the ceph mon commands as defined in ``src/mon/MonCommands.h``. +Let's consider the following command:: + + COMMAND("osd ls " \ + "name=epoch,type=CephInt,range=0,req=false", \ + "show all OSD ids", "osd", "r", "cli,rest") + +The **prefix** is **osd ls**. The optional argument's name is **epoch** +and it is of type ``CephInt``, i.e. ``integer``. This means that you +need to do the following **POST** request to schedule the command:: + + $ python + >> import requests + >> result = requests.post( + 'https://<ceph-mgr>:<port>/request', + json={'prefix': 'osd ls', 'epoch': 0}, + auth=("<user>", "<password>") + ) + >> print result.json() diff --git a/doc/mgr/rook.rst b/doc/mgr/rook.rst new file mode 100644 index 00000000..483772e4 --- /dev/null +++ b/doc/mgr/rook.rst @@ -0,0 +1,37 @@ + +============================= +Rook orchestrator integration +============================= + +Rook (https://rook.io/) is an orchestration tool that can run Ceph inside +a Kubernetes cluster. + +The ``rook`` module provides integration between Ceph's orchestrator framework +(used by modules such as ``dashboard`` to control cluster services) and +Rook. + +Orchestrator modules only provide services to other modules, which in turn +provide user interfaces. To try out the rook module, you might like +to use the :ref:`Orchestrator CLI <orchestrator-cli-module>` module. + +Requirements +------------ + +- Running ceph-mon and ceph-mgr services that were set up with Rook in + Kubernetes. +- Rook 0.9 or newer. + +Configuration +------------- + +Because a Rook cluster's ceph-mgr daemon is running as a Kubernetes pod, +the ``rook`` module can connect to the Kubernetes API without any explicit +configuration. + +Development +----------- + +If you are a developer, please see :ref:`kubernetes-dev` for instructions +on setting up a development environment to work with this. + + diff --git a/doc/mgr/ssh.rst b/doc/mgr/ssh.rst new file mode 100644 index 00000000..1d1e9663 --- /dev/null +++ b/doc/mgr/ssh.rst @@ -0,0 +1,45 @@ +================ +SSH orchestrator +================ + +The SSH orchestrator is an orchestrator module that does not rely on a separate +system such as Rook or Ansible, but rather manages nodes in a cluster by +establishing an SSH connection and issuing explicit management commands. + +Orchestrator modules only provide services to other modules, which in turn +provide user interfaces. To try out the SSH module, you might like +to use the :ref:`Orchestrator CLI <orchestrator-cli-module>` module. + +Requirements +------------ + +- The Python `remoto` library version 0.35 or newer + +Configuration +------------- + +The SSH orchestrator can be configured to use an SSH configuration file. This is +useful for specifying private keys and other SSH connection options. + +:: + + # ceph config set mgr mgr/ssh/ssh_config_file /path/to/config + +An SSH configuration file can be provided without requiring an accessible file +system path as the method above does. + +:: + + # ceph ssh set-ssh-config -i /path/to/config + +To clear this value use the command: + +:: + + # ceph ssh clear-ssh-config + +Development +----------- + +Instructions for setting up a development environment can be found in the Ceph +source tree at `src/pybind/mgr/ssh/README.md`. diff --git a/doc/mgr/telegraf.rst b/doc/mgr/telegraf.rst new file mode 100644 index 00000000..5944f725 --- /dev/null +++ b/doc/mgr/telegraf.rst @@ -0,0 +1,88 @@ +=============== +Telegraf Module +=============== +The Telegraf module collects and sends statistics series to a Telegraf agent. + +The Telegraf agent can buffer, aggregate, parse and process the data before +sending it to an output which can be InfluxDB, ElasticSearch and many more. + +Currently the only way to send statistics to Telegraf from this module is to +use the socket listener. The module can send statistics over UDP, TCP or +a UNIX socket. + +The Telegraf module was introduced in the 13.x *Mimic* release. + +-------- +Enabling +-------- + +To enable the module, use the following command: + +:: + + ceph mgr module enable telegraf + +If you wish to subsequently disable the module, you can use the corresponding +*disable* command: + +:: + + ceph mgr module disable telegraf + +------------- +Configuration +------------- + +For the telegraf module to send statistics to a Telegraf agent it is +required to configure the address to send the statistics to. + +Set configuration values using the following command: + +:: + + ceph telegraf config-set <key> <value> + + +The most important settings are ``address`` and ``interval``. + +For example, a typical configuration might look like this: + +:: + + ceph telegraf config-set address udp://:8094 + ceph telegraf config-set interval 10 + +The default values for these configuration keys are: + +- address: unixgram:///tmp/telegraf.sock +- interval: 15 + +---------------- +Socket Listener +---------------- +The module only supports sending data to Telegraf through the socket listener +of the Telegraf module using the Influx data format. + +A typical Telegraf configuration might be: + + + [[inputs.socket_listener]] + # service_address = "tcp://:8094" + # service_address = "tcp://127.0.0.1:http" + # service_address = "tcp4://:8094" + # service_address = "tcp6://:8094" + # service_address = "tcp6://[2001:db8::1]:8094" + service_address = "udp://:8094" + # service_address = "udp4://:8094" + # service_address = "udp6://:8094" + # service_address = "unix:///tmp/telegraf.sock" + # service_address = "unixgram:///tmp/telegraf.sock" + data_format = "influx" + +In this case the `address` configuration option for the module would need to be set +to: + + udp://:8094 + + +Refer to the Telegraf documentation for more configuration options. diff --git a/doc/mgr/telemetry.rst b/doc/mgr/telemetry.rst new file mode 100644 index 00000000..37fa8214 --- /dev/null +++ b/doc/mgr/telemetry.rst @@ -0,0 +1,158 @@ +.. _telemetry: + +Telemetry Module +================ + +The telemetry module sends anonymous data about the cluster back to the Ceph +developers to help understand how Ceph is used and what problems users may +be experiencing. + +Channels +-------- + +The telemetry report is broken down into several "channels," each with +a different type of information. Assuming telemetry has been enabled, +individual channels can be turned on and off. (If telemetry is off, +the per-channel setting has no effect.) + +* **basic** (default: on): Basic information about the cluster + + - capacity of the cluster + - number of monitors, managers, OSDs, MDSs, radosgws, or other daemons + - software version currently being used + - number and types of RADOS pools and CephFS file systems + - names of configuration options that have been changed from their + default (but *not* their values) + +* **crash** (default: on): Information about daemon crashes, including + + - type of daemon + - version of the daemon + - operating system (OS distribution, kernel version) + - stack trace identifying where in the Ceph code the crash occurred + +* **device** (default: on): Information about device metrics, including + + - anonymized SMART metrics + +* **ident** (default: off): User-provided identifying information about + the cluster + + - cluster description + - contact email address + +The data being reported does *not* contain any sensitive +data like pool names, object names, object contents, hostnames, or device +serial numbers. + +It contains counters and statistics on how the cluster has been +deployed, the version of Ceph, the distribition of the hosts and other +parameters which help the project to gain a better understanding of +the way Ceph is used. + +Data is sent over HTTPS to *telemetry.ceph.com*. + +Enabling the module +------------------- + +The module must first be enabled. Note that even if the module is +enabled, telemetry is still "off" by default, so simply enabling the +module will *NOT* result in any data being shared.:: + + ceph mgr module enable telemetry + +Sample report +------------- + +You can look at what data is reported at any time with the command:: + + ceph telemetry show + +To protect your privacy, device reports are generated separately, and data such +as hostname and device serial number is anonymized. The device telemetry is +sent to a different endpoint and does not associate the device data with a +particular cluster. To see a preview of the device report use the command:: + + ceph telemetry show-device + +Please note: In order to generate the device report we use Smartmontools +version 7.0 and up, which supports JSON output. +If you have any concerns about privacy with regard to the information included in +this report, please contact the Ceph developers. + +Channels +-------- + +Individual channels can be enabled or disabled with:: + + ceph config set mgr mgr/telemetry/channel_ident false + ceph config set mgr mgr/telemetry/channel_basic false + ceph config set mgr mgr/telemetry/channel_crash false + ceph config set mgr mgr/telemetry/channel_device false + ceph telemetry show + ceph telemetry show-device + +Enabling Telemetry +------------------ + +To allow the *telemetry* module to start sharing data:: + + ceph telemetry on + +Please note: Telemetry data is licensed under the Community Data License +Agreement - Sharing - Version 1.0 (https://cdla.io/sharing-1-0/). Hence, +telemetry module can be enabled only after you add '--license sharing-1-0' to +the 'ceph telemetry on' command. + +Telemetry can be disabled at any time with:: + + ceph telemetry off + +Interval +-------- + +The module compiles and sends a new report every 24 hours by default. +You can adjust this interval with:: + + ceph config set mgr mgr/telemetry/interval 72 # report every three days + +Status +-------- + +The see the current configuration:: + + ceph telemetry status + +Manually sending telemetry +-------------------------- + +To ad hoc send telemetry data:: + + ceph telemetry send + +In case telemetry is not enabled (with 'ceph telemetry on'), you need to add +'--license sharing-1-0' to 'ceph telemetry send' command. + +Sending telemetry through a proxy +--------------------------------- + +If the cluster cannot directly connect to the configured telemetry +endpoint (default *telemetry.ceph.com*), you can configure a HTTP/HTTPS +proxy server with:: + + ceph config set mgr mgr/telemetry/proxy https://10.0.0.1:8080 + +You can also include a *user:pass* if needed:: + + ceph config set mgr mgr/telemetry/proxy https://ceph:telemetry@10.0.0.1:8080 + + +Contact and Description +----------------------- + +A contact and description can be added to the report. This is +completely optional, and disabled by default.:: + + ceph config set mgr mgr/telemetry/contact 'John Doe <john.doe@example.com>' + ceph config set mgr mgr/telemetry/description 'My first Ceph cluster' + ceph config set mgr mgr/telemetry/channel_ident true diff --git a/doc/mgr/zabbix.rst b/doc/mgr/zabbix.rst new file mode 100644 index 00000000..1aa3ebfc --- /dev/null +++ b/doc/mgr/zabbix.rst @@ -0,0 +1,144 @@ +Zabbix Module +============= + +The Zabbix module actively sends information to a Zabbix server like: + +- Ceph status +- I/O operations +- I/O bandwidth +- OSD status +- Storage utilization + +Requirements +------------ + +The module requires that the *zabbix_sender* executable is present on *all* +machines running ceph-mgr. It can be installed on most distributions using +the package manager. + +Dependencies +^^^^^^^^^^^^ +Installing zabbix_sender can be done under Ubuntu or CentOS using either apt +or dnf. + +On Ubuntu Xenial: + +:: + + apt install zabbix-agent + +On Fedora: + +:: + + dnf install zabbix-sender + + +Enabling +-------- +You can enable the *zabbix* module with: + +:: + + ceph mgr module enable zabbix + +Configuration +------------- + +Two configuration keys are vital for the module to work: + +- zabbix_host +- identifier (optional) + +The parameter *zabbix_host* controls the hostname of the Zabbix server to which +*zabbix_sender* will send the items. This can be a IP-Address if required by +your installation. + +The *identifier* parameter controls the identifier/hostname to use as source +when sending items to Zabbix. This should match the name of the *Host* in +your Zabbix server. + +When the *identifier* parameter is not configured the ceph-<fsid> of the cluster +will be used when sending data to Zabbix. + +This would for example be *ceph-c4d32a99-9e80-490f-bd3a-1d22d8a7d354* + +Additional configuration keys which can be configured and their default values: + +- zabbix_port: 10051 +- zabbix_sender: /usr/bin/zabbix_sender +- interval: 60 + +Configuration keys +^^^^^^^^^^^^^^^^^^^ + +Configuration keys can be set on any machine with the proper cephx credentials, +these are usually Monitors where the *client.admin* key is present. + +:: + + ceph zabbix config-set <key> <value> + +For example: + +:: + + ceph zabbix config-set zabbix_host zabbix.localdomain + ceph zabbix config-set identifier ceph.eu-ams02.local + +The current configuration of the module can also be shown: + +:: + + ceph zabbix config-show + + +Template +^^^^^^^^ +A `template <https://raw.githubusercontent.com/ceph/ceph/9c54334b615362e0a60442c2f41849ed630598ab/src/pybind/mgr/zabbix/zabbix_template.xml>`_. +(XML) to be used on the Zabbix server can be found in the source directory of the module. + +This template contains all items and a few triggers. You can customize the triggers afterwards to fit your needs. + + +Multiple Zabbix servers +^^^^^^^^^^^^^^^^^^^^^^^ +It is possible to instruct zabbix module to send data to multiple Zabbix servers. + +Parameter *zabbix_host* can be set with multiple hostnames separated by commas. +Hosnames (or IP adderesses) can be followed by colon and port number. If a port +number is not present module will use the port number defined in *zabbix_port*. + +For example: + +:: + + ceph zabbix config-set zabbix_host "zabbix1,zabbix2:2222,zabbix3:3333" + + +Manually sending data +--------------------- +If needed the module can be asked to send data immediately instead of waiting for +the interval. + +This can be done with this command: + +:: + + ceph zabbix send + +The module will now send its latest data to the Zabbix server. + +Debugging +--------- + +Should you want to debug the Zabbix module increase the logging level for +ceph-mgr and check the logs. + +:: + + [mgr] + debug mgr = 20 + +With logging set to debug for the manager the module will print various logging +lines prefixed with *mgr[zabbix]* for easy filtering.
\ No newline at end of file |