diff options
Diffstat (limited to '')
35 files changed, 6125 insertions, 0 deletions
diff --git a/doc/mgr/administrator.rst b/doc/mgr/administrator.rst new file mode 100644 index 000000000..d59b013aa --- /dev/null +++ b/doc/mgr/administrator.rst @@ -0,0 +1,178 @@ +.. _mgr-administrator-guide: + +ceph-mgr administrator's guide +============================== + +Manual setup +------------ + +Usually, you would set up a ceph-mgr daemon using a tool such +as ceph-ansible. These instructions describe how to set up +a ceph-mgr daemon manually. + +First, create an authentication key for your daemon:: + + ceph auth get-or-create mgr.$name mon 'allow profile mgr' osd 'allow *' mds 'allow *' + +Place that key as file named ``keyring`` into ``mgr data`` path, which for a cluster "ceph" +and mgr $name "foo" would be ``/var/lib/ceph/mgr/ceph-foo`` respective ``/var/lib/ceph/mgr/ceph-foo/keyring``. + +Start the ceph-mgr daemon:: + + ceph-mgr -i $name + +Check that the mgr has come up by looking at the output +of ``ceph status``, which should now include a mgr status line:: + + mgr active: $name + +Client authentication +--------------------- + +The manager is a new daemon which requires new CephX capabilities. If you upgrade +a cluster from an old version of Ceph, or use the default install/deploy tools, +your admin client should get this capability automatically. If you use tooling from +elsewhere, you may get EACCES errors when invoking certain ceph cluster commands. +To fix that, add a "mgr allow \*" stanza to your client's cephx capabilities by +`Modifying User Capabilities`_. + +High availability +----------------- + +In general, you should set up a ceph-mgr on each of the hosts +running a ceph-mon daemon to achieve the same level of availability. + +By default, whichever ceph-mgr instance comes up first will be made +active by the monitors, and the others will be standbys. There is +no requirement for quorum among the ceph-mgr daemons. + +If the active daemon fails to send a beacon to the monitors for +more than :confval:`mon_mgr_beacon_grace`, then it will be replaced +by a standby. + +If you want to preempt failover, you can explicitly mark a ceph-mgr +daemon as failed using ``ceph mgr fail <mgr name>``. + +Performance and Scalability +--------------------------- + +All the mgr modules share a cache that can be enabled with +``ceph config set mgr mgr_ttl_cache_expire_seconds <seconds>``, where seconds +is the time to live of the cached python objects. + +It is recommended to enable the cache with a 10 seconds TTL when there are 500+ +osds or 10k+ pgs as internal structures might increase in size, and cause latency +issues when requesting large structures. As an example, an OSDMap with 1000 osds +has a approximate size of 4MiB. With heavy load, on a 3000 osd cluster there has +been a 1.5x improvement enabling the cache. + +Furthermore, you can run ``ceph daemon mgr.${MGRNAME} perf dump`` to retrieve perf +counters of a mgr module. In ``mgr.cache_hit`` and ``mgr.cache_miss`` you'll find the +hit/miss ratio of the mgr cache. + +Using modules +------------- + +Use the command ``ceph mgr module ls`` to see which modules are +available, and which are currently enabled. Use ``ceph mgr module ls --format=json-pretty`` +to view detailed metadata about disabled modules. Enable or disable modules +using the commands ``ceph mgr module enable <module>`` and +``ceph mgr module disable <module>`` respectively. + +If a module is *enabled* then the active ceph-mgr daemon will load +and execute it. In the case of modules that provide a service, +such as an HTTP server, the module may publish its address when it +is loaded. To see the addresses of such modules, use the command +``ceph mgr services``. + +Some modules may also implement a special standby mode which runs on +standby ceph-mgr daemons as well as the active daemon. This enables +modules that provide services to redirect their clients to the active +daemon, if the client tries to connect to a standby. + +Consult the documentation pages for individual manager modules for more +information about what functionality each module provides. + +Here is an example of enabling the :term:`Dashboard` module: + +.. code-block:: console + + $ ceph mgr module ls + { + "enabled_modules": [ + "restful", + "status" + ], + "disabled_modules": [ + "dashboard" + ] + } + + $ ceph mgr module enable dashboard + $ ceph mgr module ls + { + "enabled_modules": [ + "restful", + "status", + "dashboard" + ], + "disabled_modules": [ + ] + } + + $ ceph mgr services + { + "dashboard": "http://myserver.com:7789/", + "restful": "https://myserver.com:8789/" + } + + +The first time the cluster starts, it uses the :confval:`mgr_initial_modules` +setting to override which modules to enable. However, this setting +is ignored through the rest of the lifetime of the cluster: only +use it for bootstrapping. For example, before starting your +monitor daemons for the first time, you might add a section like +this to your ``ceph.conf``: + +.. code-block:: ini + + [mon] + mgr_initial_modules = dashboard balancer + +Module Pool +----------- + +The manager creates a pool for use by its module to store state. The name of +this pool is ``.mgr`` (with the leading ``.`` indicating a reserved pool +name). + +.. note:: + + Prior to Quincy, the ``devicehealth`` module created a + ``device_health_metrics`` pool to store device SMART statistics. With + Quincy, this pool is automatically renamed to be the common manager module + pool. + + +Calling module commands +----------------------- + +Where a module implements command line hooks, the commands will +be accessible as ordinary Ceph commands. Ceph will automatically incorporate +module commands into the standard CLI interface and route them appropriately to +the module.:: + + ceph <command | help> + +Configuration +------------- + +.. confval:: mgr_module_path +.. confval:: mgr_initial_modules +.. confval:: mgr_disabled_modules +.. confval:: mgr_standby_modules +.. confval:: mgr_data +.. confval:: mgr_tick_period +.. confval:: mon_mgr_beacon_grace + +.. _Modifying User Capabilities: ../../rados/operations/user-management/#modify-user-capabilities diff --git a/doc/mgr/alerts.rst b/doc/mgr/alerts.rst new file mode 100644 index 000000000..319d9d927 --- /dev/null +++ b/doc/mgr/alerts.rst @@ -0,0 +1,58 @@ +Alerts module +============= + +The alerts module can send simple alert messages about cluster health +via e-mail. In the future, it will support other notification methods +as well. + +:note: This module is *not* intended to be a robust monitoring + solution. The fact that it is run as part of the Ceph cluster + itself is fundamentally limiting in that a failure of the + ceph-mgr daemon prevents alerts from being sent. This module + can, however, be useful for standalone clusters that exist in + environments where existing monitoring infrastructure does not + exist. + +Enabling +-------- + +The *alerts* module is enabled with:: + + ceph mgr module enable alerts + +Configuration +------------- + +To configure SMTP, all of the following config options must be set:: + + ceph config set mgr mgr/alerts/smtp_host *<smtp-server>* + ceph config set mgr mgr/alerts/smtp_destination *<email-address-to-send-to>* + ceph config set mgr mgr/alerts/smtp_sender *<from-email-address>* + +By default, the module will use SSL and port 465. To change that,:: + + ceph config set mgr mgr/alerts/smtp_ssl false # if not SSL + ceph config set mgr mgr/alerts/smtp_port *<port-number>* # if not 465 + +To authenticate to the SMTP server, you must set the user and password:: + + ceph config set mgr mgr/alerts/smtp_user *<username>* + ceph config set mgr mgr/alerts/smtp_password *<password>* + +By default, the name in the ``From:`` line is simply ``Ceph``. To +change that (e.g., to identify which cluster this is),:: + + ceph config set mgr mgr/alerts/smtp_from_name 'Ceph Cluster Foo' + +By default, the module will check the cluster health once per minute +and, if there is a change, send a message. To change that +frequency,:: + + ceph config set mgr mgr/alerts/interval *<interval>* # e.g., "5m" for 5 minutes + +Commands +-------- + +To force an alert to be send immediately,:: + + ceph alerts send diff --git a/doc/mgr/capacity-card.png b/doc/mgr/capacity-card.png Binary files differnew file mode 100644 index 000000000..59a70348c --- /dev/null +++ b/doc/mgr/capacity-card.png diff --git a/doc/mgr/ceph_api/index.rst b/doc/mgr/ceph_api/index.rst new file mode 100644 index 000000000..1cdc9a97b --- /dev/null +++ b/doc/mgr/ceph_api/index.rst @@ -0,0 +1,92 @@ +.. _mgr ceph api: + +================ +Ceph RESTful API +================ + +Introduction +============ +The **Ceph RESTful API** (henceforth **Ceph API**) is provided by the +:ref:`mgr-dashboard` module. The Ceph API +service is available at the same URL as the regular Ceph Dashboard, under the +``/api`` base path (please refer to :ref:`dashboard-host-name-and-port`):: + + http://<server_addr>:<server_port>/api + +or, if HTTPS is enabled (please refer to :ref:`dashboard-ssl-tls-support`):: + + https://<server_addr>:<ssl_server_port>/api + +The Ceph API leverages the following standards: + +* `HTTP 1.1 <https://tools.ietf.org/html/rfc7231>`_ for API syntax and semantics, +* `JSON <https://tools.ietf.org/html/rfc8259>`_ for content encoding, +* `HTTP Content Negotiation <https://tools.ietf.org/html/rfc2295>`_ and `MIME <https://tools.ietf.org/html/rfc2045>`_ for versioning, +* `OAuth 2.0 <https://tools.ietf.org/html/rfc6750>`_ and `JWT <https://tools.ietf.org/html/rfc7519>`_ for authentication and authorization. + +.. warning:: + Some endpoints are still under active development, and should be carefully + used since new Ceph releases could bring backward incompatible changes. + + +Authentication and Authorization +================================ + +Requests to the Ceph API pass through two access control checkpoints: + +* **Authentication**: ensures that the request is performed on behalf of an existing and valid user account. +* **Authorization**: ensures that the previously authenticated user can in fact perform a specific action (create, read, update or delete) on the target endpoint. + +So, prior to start consuming the Ceph API, a valid JSON Web Token (JWT) has to +be obtained, and it may then be reused for subsequent requests. The +``/api/auth`` endpoint will provide the valid token: + +.. prompt:: bash $ + + curl -X POST "https://example.com:8443/api/auth" \ + -H "Accept: application/vnd.ceph.api.v1.0+json" \ + -H "Content-Type: application/json" \ + -d '{"username": <username>, "password": <password>}' + +:: + + { "token": "<redacted_token>", ...} + +The token obtained must be passed together with every API request in the +``Authorization`` HTTP header:: + + curl -H "Authorization: Bearer <token>" ... + +Authentication and authorization can be further configured from the +Ceph CLI, the Ceph-Dashboard UI and the Ceph API itself (please refer to +:ref:`dashboard-user-role-management`). + +Versioning +========== + +One of the main goals of the Ceph API is to keep a stable interface. For this +purpose, Ceph API is built upon the following principles: + +* **Mandatory**: in order to avoid implicit defaults, all endpoints require an explicit default version (starting with ``1.0``). +* **Per-endpoint**: as this API wraps many different Ceph components, this allows for a finer-grained change control. + * **Content/MIME Type**: the version expected from a specific endpoint is stated by the ``Accept: application/vnd.ceph.api.v<major>.<minor>+json`` HTTP header. If the current Ceph API server is not able to address that specific major version, a `415 - Unsupported Media Type <https://tools.ietf.org/html/rfc7231#section-6.5.13>`_ response will be returned. +* **Semantic Versioning**: with a ``major.minor`` version: + * Major changes are backward incompatible: they might result in non-additive changes to the request and/or response formats of a specific endpoint. + * Minor changes are backward/forward compatible: they basically consists of additive changes to the request or response formats of a specific endpoint. + +An example: + +.. prompt:: bash $ + + curl -X GET "https://example.com:8443/api/osd" \ + -H "Accept: application/vnd.ceph.api.v1.0+json" \ + -H "Authorization: Bearer <token>" + + +Specification +============= + +.. openapi:: ../../../src/pybind/mgr/dashboard/openapi.yaml + :group: + :examples: + :encoding: utf-8 diff --git a/doc/mgr/cli_api.rst b/doc/mgr/cli_api.rst new file mode 100644 index 000000000..81a99ae44 --- /dev/null +++ b/doc/mgr/cli_api.rst @@ -0,0 +1,39 @@ +CLI API Commands Module +======================= + +The CLI API module exposes most ceph-mgr python API via CLI. Furthermore, this API can be +benchmarked for further testing. + +Enabling +-------- + +The *cli api commands* module is enabled with:: + + ceph mgr module enable cli_api + +To check that it is enabled, run:: + + ceph mgr module ls | grep cli_api + +Usage +-------- + +To run a mgr module command, run:: + + ceph mgr cli <command> <param> + +For example, use the following command to print the list of servers:: + + ceph mgr cli list_servers + +List all available mgr module commands with:: + + ceph mgr cli --help + +To benchmark a command, run:: + + ceph mgr cli_benchmark <number of calls> <number of threads> <command> <param> + +For example, use the following command to benchmark the command to get osd_map:: + + ceph mgr cli_benchmark 100 10 get osd_map diff --git a/doc/mgr/cluster-utilization-card.png b/doc/mgr/cluster-utilization-card.png Binary files differnew file mode 100644 index 000000000..fc0fd9ed1 --- /dev/null +++ b/doc/mgr/cluster-utilization-card.png diff --git a/doc/mgr/crash.rst b/doc/mgr/crash.rst new file mode 100644 index 000000000..656c7a2d4 --- /dev/null +++ b/doc/mgr/crash.rst @@ -0,0 +1,103 @@ +Crash Module +============ +The crash module collects information about daemon crashdumps and stores +it in the Ceph cluster for later analysis. + +Enabling +-------- + +The *crash* module is enabled with:: + + ceph mgr module enable crash + +The *crash* upload key is generated with:: + + ceph auth get-or-create client.crash mon 'profile crash' mgr 'profile crash' + +On each node, you should store this key in +``/etc/ceph/ceph.client.crash.keyring``. + + +Automated collection +-------------------- + +Daemon crashdumps are dumped in ``/var/lib/ceph/crash`` by default; this can +be configured with the option 'crash dir'. Crash directories are named by +time and date and a randomly-generated UUID, and contain a metadata file +'meta' and a recent log file, with a "crash_id" that is the same. + +These crashes can be automatically submitted and persisted in the monitors' +storage by using ``ceph-crash.service``. +It watches the crashdump directory and uploads them with ``ceph crash post``. + +``ceph-crash`` tries some authentication names: ``client.crash.$hostname``, +``client.crash`` and ``client.admin``. +In order to successfully upload with ``ceph crash post``, these need +the suitable permissions: ``mon profile crash`` and ``mgr profile crash`` +and a keyring needs to be in ``/etc/ceph``. + + +Commands +-------- +:: + + ceph crash post -i <metafile> + +Save a crash dump. The metadata file is a JSON blob stored in the crash +dir as ``meta``. As usual, the ceph command can be invoked with ``-i -``, +and will read from stdin. + +:: + + ceph crash rm <crashid> + +Remove a specific crash dump. + +:: + + ceph crash ls + +List the timestamp/uuid crashids for all new and archived crash info. + +:: + + ceph crash ls-new + +List the timestamp/uuid crashids for all newcrash info. + +:: + + ceph crash stat + +Show a summary of saved crash info grouped by age. + +:: + + ceph crash info <crashid> + +Show all details of a saved crash. + +:: + + ceph crash prune <keep> + +Remove saved crashes older than 'keep' days. <keep> must be an integer. + +:: + + ceph crash archive <crashid> + +Archive a crash report so that it is no longer considered for the ``RECENT_CRASH`` health check and does not appear in the ``crash ls-new`` output (it will still appear in the ``crash ls`` output). + +:: + + ceph crash archive-all + +Archive all new crash reports. + + +Options +------- + +* ``mgr/crash/warn_recent_interval`` [default: 2 weeks] controls what constitutes "recent" for the purposes of raising the ``RECENT_CRASH`` health warning. +* ``mgr/crash/retain_interval`` [default: 1 year] controls how long crash reports are retained by the cluster before they are automatically purged. diff --git a/doc/mgr/dashboard-landing-page.png b/doc/mgr/dashboard-landing-page.png Binary files differnew file mode 100644 index 000000000..77a1fe12b --- /dev/null +++ b/doc/mgr/dashboard-landing-page.png diff --git a/doc/mgr/dashboard.rst b/doc/mgr/dashboard.rst new file mode 100644 index 000000000..696676aeb --- /dev/null +++ b/doc/mgr/dashboard.rst @@ -0,0 +1,1655 @@ +.. _mgr-dashboard: + +Ceph Dashboard +============== + +Overview +-------- + +The Ceph Dashboard is a built-in web-based Ceph management and monitoring +application through which you can inspect and administer various aspects +and resources within the cluster. It is implemented as a :ref:`ceph-manager-daemon` module. + +The original Ceph Dashboard that was shipped with Ceph Luminous started +out as a simple read-only view into run-time information and performance +data of Ceph clusters. It used a very simple architecture to achieve the +original goal. However, there was growing demand for richer web-based +management capabilities, to make it easier to administer Ceph for users that +prefer a WebUI over the CLI. + +The new :term:`Ceph Dashboard` module adds web-based monitoring and +administration to the Ceph Manager. The architecture and functionality of this new +module are derived from +and inspired by the `openATTIC Ceph management and monitoring tool +<https://openattic.org/>`_. Development is actively driven by the +openATTIC team at `SUSE <https://www.suse.com/>`_, with support from +companies including `Red Hat <https://redhat.com/>`_ and members of the Ceph +community. + +The dashboard module's backend code uses the CherryPy framework and implements +a custom REST API. The WebUI implementation is based on +Angular/TypeScript and includes both functionality from the original dashboard +and new features originally developed for the standalone version +of openATTIC. The Ceph Dashboard module is implemented as an +application that provides a graphical representation of information and statistics +through a web server hosted by ``ceph-mgr``. + +Feature Overview +^^^^^^^^^^^^^^^^ + +The dashboard provides the following features: + +* **Multi-User and Role Management**: The dashboard supports multiple user + accounts with different permissions (roles). User accounts and roles + can be managed via both the command line and the WebUI. The dashboard + supports various methods to enhance password security. Password + complexity rules may be configured, requiring users to change their password + after the first login or after a configurable time period. See + :ref:`dashboard-user-role-management` for details. +* **Single Sign-On (SSO)**: The dashboard supports authentication + via an external identity provider using the SAML 2.0 protocol. See + :ref:`dashboard-sso-support` for details. +* **SSL/TLS support**: All HTTP communication between the web browser and the + dashboard is secured via SSL. A self-signed certificate can be created with + a built-in command, but it's also possible to import custom certificates + signed and issued by a CA. See :ref:`dashboard-ssl-tls-support` for details. +* **Auditing**: The dashboard backend can be configured to log all ``PUT``, ``POST`` + and ``DELETE`` API requests in the Ceph audit log. See :ref:`dashboard-auditing` + for instructions on how to enable this feature. +* **Internationalization (I18N)**: The language used for dashboard text can be + selected at run-time. + +The Ceph Dashboard offers the following monitoring and management capabilities: + +* **Overall cluster health**: Display performance and capacity metrics as well + as cluster status. +* **Embedded Grafana Dashboards**: Ceph Dashboard + `Grafana`_ dashboards may be embedded in external applications and web pages + to surface information and performance metrics gathered by + the :ref:`mgr-prometheus` module. See + :ref:`dashboard-grafana` for details on how to configure this functionality. +* **Cluster logs**: Display the latest updates to the cluster's event and + audit log files. Log entries can be filtered by priority, date or keyword. +* **Hosts**: Display a list of all cluster hosts along with their + storage drives, which services are running, and which version of Ceph is + installed. +* **Performance counters**: Display detailed service-specific statistics for + each running service. +* **Monitors**: List all Mons, their quorum status, and open sessions. +* **Monitoring**: Enable creation, re-creation, editing, and expiration of + Prometheus' silences, list the alerting configuration and all + configured and firing alerts. Show notifications for firing alerts. +* **Configuration Editor**: Display all available configuration options, + their descriptions, types, default and currently set values. These may be edited as well. +* **Pools**: List Ceph pools and their details (e.g. applications, + pg-autoscaling, placement groups, replication size, EC profile, CRUSH + rules, quotas etc.) +* **OSDs**: List OSDs, their status and usage statistics as well as + detailed information like attributes (OSD map), metadata, performance + counters and usage histograms for read/write operations. Mark OSDs + up/down/out, purge and reweight OSDs, perform scrub operations, modify + various scrub-related configuration options, select profiles to + adjust the level of backfilling activity. List all drives associated with an + OSD. Set and change the device class of an OSD, display and sort OSDs by + device class. Deploy OSDs on new drives and hosts. +* **Device management**: List all hosts known by the orchestrator. List all + drives attached to a host and their properties. Display drive + health predictions and SMART data. Blink enclosure LEDs. +* **iSCSI**: List all hosts that run the TCMU runner service, display all + images and their performance characteristics (read/write ops, traffic). + Create, modify, and delete iSCSI targets (via ``ceph-iscsi``). Display the + iSCSI gateway status and info about active initiators. + See :ref:`dashboard-iscsi-management` for instructions on how to configure + this feature. +* **RBD**: List all RBD images and their properties (size, objects, features). + Create, copy, modify and delete RBD images (incl. snapshots) and manage RBD + namespaces. Define various I/O or bandwidth limitation settings on a global, + per-pool or per-image level. Create, delete and rollback snapshots of selected + images, protect/unprotect these snapshots against modification. Copy or clone + snapshots, flatten cloned images. +* **RBD mirroring**: Enable and configure RBD mirroring to a remote Ceph server. + List active daemons and their status, pools and RBD images including + sync progress. +* **CephFS**: List active file system clients and associated pools, + including usage statistics. Evict active CephFS clients. Manage CephFS + quotas and snapshots. Browse a CephFS directory structure. +* **Object Gateway**: List all active object gateways and their performance + counters. Display and manage (add/edit/delete) object gateway users and their + details (e.g. quotas) as well as the users' buckets and their details (e.g. + placement targets, owner, quotas, versioning, multi-factor authentication). + See :ref:`dashboard-enabling-object-gateway` for configuration instructions. +* **NFS**: Manage NFS exports of CephFS file systems and RGW S3 buckets via NFS + Ganesha. See :ref:`dashboard-nfs-ganesha-management` for details on how to + enable this functionality. +* **Ceph Manager Modules**: Enable and disable Ceph Manager modules, manage + module-specific configuration settings. + +Overview of the Dashboard Landing Page +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The landing page of Ceph Dashboard serves as the home page and features metrics +such as the overall cluster status, performance, and capacity. It provides real-time +updates on any changes in the cluster and allows quick access to other sections of the dashboard. + +.. image:: dashboard-landing-page.png + + +.. note:: + You can change the landing page to the previous version from: + ``Cluster >> Manager Modules >> Dashboard >> Edit``. + Editing the ``FEATURE_TOGGLE_DASHBOARD`` option will change the landing page, from one view to another. + + Note that the previous version of the landing page will be disabled in future releases. + +.. _dashboard-landing-page-details: + +Details +""""""" +Provides an overview of the cluster configuration, displaying various critical aspects of the cluster. + +.. image:: details-card.png + +.. _dashboard-landing-page-status: + +Status +"""""" +Provides a visual indication of cluster health, and displays cluster alerts grouped by severity. + +.. image:: status-card-open.png + +.. _dashboard-landing-page-capacity: + +Capacity +"""""""" +* **Used**: Displays the used capacity out of the total physical capacity provided by storage nodes (OSDs) +* **Warning**: Displays the `nearfull` threshold of the OSDs +* **Danger**: Displays the `full` threshold of the OSDs + +.. image:: capacity-card.png + +.. _dashboard-landing-page-inventory: + +Inventory +""""""""" +An inventory for all assets within the cluster. +Provides direct access to subpages of the dashboard from each item of this card. + +.. image:: inventory-card.png + +.. _dashboard-landing-page-performance: + +Cluster Utilization +""""""""""""""""""" +* **Used Capacity**: Total capacity used of the cluster. The maximum value of the chart is the maximum capacity of the cluster. +* **IOPS (Input/Output Operations Per Second)**: Number of read and write operations. +* **Latency**: Amount of time that it takes to process a read or a write request. +* **Client Throughput**: Amount of data that clients read or write to the cluster. +* **Recovery Throughput**: Amount of recovery data that clients read or write to the cluster. + + +.. image:: cluster-utilization-card.png + +Supported Browsers +^^^^^^^^^^^^^^^^^^ + +Ceph Dashboard is primarily tested and developed using the following web +browsers: + ++---------------------------------------------------------------+---------------------------------------+ +| Browser | Versions | ++===============================================================+=======================================+ +| `Chrome <https://www.google.com/chrome/>`_ and | latest 2 major versions | +| `Chromium <https://www.chromium.org/>`_ based browsers | | ++---------------------------------------------------------------+---------------------------------------+ +| `Firefox <https://www.mozilla.org/firefox/>`_ | latest 2 major versions | ++---------------------------------------------------------------+---------------------------------------+ +| `Firefox ESR <https://www.mozilla.org/firefox/enterprise/>`_ | latest major version | ++---------------------------------------------------------------+---------------------------------------+ + +While Ceph Dashboard might work in older browsers, we cannot guarantee compatibility and +recommend keeping your browser up to date. + +Enabling +-------- + +If you have installed ``ceph-mgr-dashboard`` from distribution packages, the +package management system should take care of installing all required +dependencies. + +If you're building Ceph from source and want to start the dashboard from your +development environment, please see the files ``README.rst`` and ``HACKING.rst`` +in the source directory ``src/pybind/mgr/dashboard``. + +Within a running Ceph cluster, the Ceph Dashboard is enabled with: + +.. prompt:: bash $ + + ceph mgr module enable dashboard + +Configuration +------------- + +.. _dashboard-ssl-tls-support: + +SSL/TLS Support +^^^^^^^^^^^^^^^ + +All HTTP connections to the dashboard are secured with SSL/TLS by default. + +To get the dashboard up and running quickly, you can generate and install a +self-signed certificate: + +.. prompt:: bash $ + + ceph dashboard create-self-signed-cert + +Note that most web browsers will complain about self-signed certificates +and require explicit confirmation before establishing a secure connection to the +dashboard. + +To properly secure a deployment and to remove the warning, a +certificate that is issued by a certificate authority (CA) should be used. + +For example, a key pair can be generated with a command similar to: + +.. prompt:: bash $ + + openssl req -new -nodes -x509 \ + -subj "/O=IT/CN=ceph-mgr-dashboard" -days 3650 \ + -keyout dashboard.key -out dashboard.crt -extensions v3_ca + +The ``dashboard.crt`` file should then be signed by a CA. Once that is done, you +can enable it for Ceph manager instances by running the following commands: + +.. prompt:: bash $ + + ceph dashboard set-ssl-certificate -i dashboard.crt + ceph dashboard set-ssl-certificate-key -i dashboard.key + +If unique certificates are desired for each manager instance, +the name of the instance can be included as follows (where ``$name`` is the name +of the ``ceph-mgr`` instance, usually the hostname): + +.. prompt:: bash $ + + ceph dashboard set-ssl-certificate $name -i dashboard.crt + ceph dashboard set-ssl-certificate-key $name -i dashboard.key + +SSL can also be disabled by setting this configuration value: + +.. prompt:: bash $ + + ceph config set mgr mgr/dashboard/ssl false + +This might be useful if the dashboard will be running behind a proxy which does +not support SSL for its upstream servers or other situations where SSL is not +wanted or required. See :ref:`dashboard-proxy-configuration` for more details. + +.. warning:: + + Use caution when disabling SSL as usernames and passwords will be sent to the + dashboard unencrypted. + + +.. note:: + + You must restart Ceph manager processes after changing the SSL + certificate and key. This can be accomplished by either running ``ceph mgr + fail mgr`` or by disabling and re-enabling the dashboard module (which also + triggers the manager to respawn itself): + + .. prompt:: bash $ + + ceph mgr module disable dashboard + ceph mgr module enable dashboard + +.. _dashboard-host-name-and-port: + +Host Name and Port +^^^^^^^^^^^^^^^^^^ + +Like most web applications, the dashboard binds to a TCP/IP address and TCP port. + +By default, the ``ceph-mgr`` daemon hosting the dashboard (i.e., the currently +active manager) will bind to TCP port 8443 or 8080 when SSL is disabled. + +If no specific address has been configured, the web app will bind to ``::``, +which corresponds to all available IPv4 and IPv6 addresses. + +These defaults can be changed via the configuration key facility on a +cluster-wide level (so they apply to all manager instances) as follows: + +.. prompt:: bash $ + + ceph config set mgr mgr/dashboard/server_addr $IP + ceph config set mgr mgr/dashboard/server_port $PORT + ceph config set mgr mgr/dashboard/ssl_server_port $PORT + +Since each ``ceph-mgr`` hosts its own instance of the dashboard, it may be +necessary to configure them separately. The IP address and port for a specific +manager instance can be changed with the following commands: + +.. prompt:: bash $ + + ceph config set mgr mgr/dashboard/$name/server_addr $IP + ceph config set mgr mgr/dashboard/$name/server_port $PORT + ceph config set mgr mgr/dashboard/$name/ssl_server_port $PORT + +Replace ``$name`` with the ID of the ceph-mgr instance hosting the dashboard. + +.. note:: + + The command ``ceph mgr services`` will show you all endpoints that are + currently configured. Look for the ``dashboard`` key to obtain the URL for + accessing the dashboard. + +Username and Password +^^^^^^^^^^^^^^^^^^^^^ + +In order to be able to log in, you need to create a user account and associate +it with at least one role. We provide a set of predefined *system roles* that +you can use. For more details please refer to the `User and Role Management`_ +section. + +To create a user with the administrator role you can use the following +commands: + +.. prompt:: bash $ + + ceph dashboard ac-user-create <username> -i <file-containing-password> administrator + +Account Lock-out +^^^^^^^^^^^^^^^^ + +It disables a user account if a user repeatedly enters the wrong credentials +for multiple times. It is enabled by default to prevent brute-force or dictionary +attacks. The user can get or set the default number of lock-out attempts using +these commands respectively: + +.. prompt:: bash $ + + ceph dashboard get-account-lockout-attempts + ceph dashboard set-account-lockout-attempts <value:int> + +.. warning:: + + This feature can be disabled by setting the default number of lock-out attempts to 0. + However, by disabling this feature, the account is more vulnerable to brute-force or + dictionary based attacks. This can be disabled by: + + .. prompt:: bash $ + + ceph dashboard set-account-lockout-attempts 0 + +Enable a Locked User +^^^^^^^^^^^^^^^^^^^^ + +If a user account is disabled as a result of multiple invalid login attempts, then +it needs to be manually enabled by the administrator. This can be done by the following +command: + +.. prompt:: bash $ + + ceph dashboard ac-user-enable <username> + +Accessing the Dashboard +^^^^^^^^^^^^^^^^^^^^^^^ + +You can now access the dashboard using your (JavaScript-enabled) web browser, by +pointing it to any of the host names or IP addresses and the selected TCP port +where a manager instance is running: e.g., ``http(s)://<$IP>:<$PORT>/``. + +The dashboard page displays and requests a previously defined username and +password. + +.. _dashboard-enabling-object-gateway: + +Enabling the Object Gateway Management Frontend +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When RGW is deployed with cephadm, the RGW credentials used by the +dashboard will be automatically configured. You can also manually force the +credentials to be set up with: + +.. prompt:: bash $ + + ceph dashboard set-rgw-credentials + +This will create an RGW user with uid ``dashboard`` for each realm in +the system. + +If you've configured a custom 'admin' resource in your RGW admin API, you should set it here also: + +.. prompt:: bash $ + + ceph dashboard set-rgw-api-admin-resource <admin_resource> + +If you are using a self-signed certificate in your Object Gateway setup, +you should disable certificate verification in the dashboard to avoid refused +connections, e.g. caused by certificates signed by unknown CA or not matching +the host name: + +.. prompt:: bash $ + + ceph dashboard set-rgw-api-ssl-verify False + +If the Object Gateway takes too long to process requests and the dashboard runs +into timeouts, you can set the timeout value to your needs: + +.. prompt:: bash $ + + ceph dashboard set-rest-requests-timeout <seconds> + +The default value is 45 seconds. + +.. _dashboard-iscsi-management: + +Enabling iSCSI Management +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The Ceph Dashboard can manage iSCSI targets using the REST API provided by the +``rbd-target-api`` service of the :ref:`ceph-iscsi`. Please make sure that it is +installed and enabled on the iSCSI gateways. + +.. note:: + + The iSCSI management functionality of Ceph Dashboard depends on the latest + version 3 of the `ceph-iscsi <https://github.com/ceph/ceph-iscsi>`_ project. + Make sure that your operating system provides the correct version, otherwise + the dashboard will not enable the management features. + +If the ``ceph-iscsi`` REST API is configured in HTTPS mode and its using a self-signed +certificate, you need to configure the dashboard to avoid SSL certificate +verification when accessing ceph-iscsi API. + +To disable API SSL verification run the following command: + +.. prompt:: bash $ + + ceph dashboard set-iscsi-api-ssl-verification false + +The available iSCSI gateways must be defined using the following commands: + +.. prompt:: bash $ + + ceph dashboard iscsi-gateway-list + # Gateway URL format for a new gateway: <scheme>://<username>:<password>@<host>[:port] + ceph dashboard iscsi-gateway-add -i <file-containing-gateway-url> [<gateway_name>] + ceph dashboard iscsi-gateway-rm <gateway_name> + + +.. _dashboard-grafana: + +Enabling the Embedding of Grafana Dashboards +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +`Grafana`_ pulls data from `Prometheus <https://prometheus.io/>`_. Although +Grafana can use other data sources, the Grafana dashboards we provide contain +queries that are specific to Prometheus. Our Grafana dashboards therefore +require Prometheus as the data source. The Ceph :ref:`mgr-prometheus` +module exports its data in the Prometheus exposition format. These Grafana +dashboards rely on metric names from the Prometheus module and `Node exporter +<https://prometheus.io/docs/guides/node-exporter/>`_. The Node exporter is a +separate application that provides machine metrics. + +.. note:: + + Prometheus' security model presumes that untrusted users have access to the + Prometheus HTTP endpoint and logs. Untrusted users have access to all the + (meta)data Prometheus collects that is contained in the database, plus a + variety of operational and debugging information. + + However, Prometheus' HTTP API is limited to read-only operations. + Configurations can *not* be changed using the API and secrets are not + exposed. Moreover, Prometheus has some built-in measures to mitigate the + impact of denial of service attacks. + + Please see `Prometheus' Security model + <https://prometheus.io/docs/operating/security/>` for more detailed + information. + +Installation and Configuration using cephadm +"""""""""""""""""""""""""""""""""""""""""""" + +Grafana and Prometheus can be installed using :ref:`cephadm`. They will +automatically be configured by ``cephadm``. Please see +:ref:`mgr-cephadm-monitoring` documentation for more details on how to use +``cephadm`` for installing and configuring Prometheus and Grafana. + +Manual Installation and Configuration +""""""""""""""""""""""""""""""""""""" + +The following process describes how to configure Grafana and Prometheus +manually. After you have installed Prometheus, Grafana, and the Node exporter +on appropriate hosts, proceed with the following steps. + +#. Enable the Ceph Exporter which comes as Ceph Manager module by running: + + .. prompt:: bash $ + + ceph mgr module enable prometheus + + More details can be found in the documentation of the :ref:`mgr-prometheus`. + +#. Add the corresponding scrape configuration to Prometheus. This may look + like:: + + global: + scrape_interval: 5s + + scrape_configs: + - job_name: 'prometheus' + static_configs: + - targets: ['localhost:9090'] + - job_name: 'ceph' + static_configs: + - targets: ['localhost:9283'] + - job_name: 'node-exporter' + static_configs: + - targets: ['localhost:9100'] + + .. note:: + + Please note that in the above example, Prometheus is configured + to scrape data from itself (port 9090), the Ceph manager module + `prometheus` (port 9283), which exports Ceph internal data, and the Node + Exporter (port 9100), which provides OS and hardware metrics for each host. + + Depending on your configuration, you may need to change the hostname in + or add additional configuration entries for the Node + Exporter. It is unlikely that you will need to change the default TCP ports. + + Moreover, you don't *need* to have more than one target for Ceph specific + data, provided by the `prometheus` mgr module. But it is recommended to + configure Prometheus to scrape Ceph specific data from all existing Ceph + managers. This enables a built-in high availability mechanism, so that + services run on a manager host will be restarted automatically on a different + manager host if one Ceph Manager goes down. + +#. Add Prometheus as data source to Grafana `using the Grafana Web UI <https://grafana.com/docs/grafana/latest/features/datasources/add-a-data-source/>`_. + + .. IMPORTANT:: + The data source must be named "Dashboard1". + +#. Install the `vonage-status-panel and grafana-piechart-panel` plugins using: + + .. prompt:: bash $ + + grafana-cli plugins install vonage-status-panel + grafana-cli plugins install grafana-piechart-panel + +#. Add Dashboards to Grafana: + + Dashboards can be added to Grafana by importing dashboard JSON files. + Use the following command to download the JSON files: + + .. prompt:: bash $ + + wget https://raw.githubusercontent.com/ceph/ceph/main/monitoring/ceph-mixin/dashboards_out/<Dashboard-name>.json + + You can find various dashboard JSON files `here <https://github.com/ceph/ceph/tree/ + main/monitoring/ceph-mixin/dashboards_out>`_. + + For Example, for ceph-cluster overview you can use: + + .. prompt:: bash $ + + wget https://raw.githubusercontent.com/ceph/ceph/main/monitoring/ceph-mixin/dashboards_out/ceph-cluster.json + + You may also author your own dashboards. + +#. Configure anonymous mode in ``/etc/grafana/grafana.ini``:: + + [auth.anonymous] + enabled = true + org_name = Main Org. + org_role = Viewer + + In newer versions of Grafana (starting with 6.2.0-beta1) a new setting named + ``allow_embedding`` has been introduced. This setting must be explicitly + set to ``true`` for the Grafana integration in Ceph Dashboard to work, as the + default is ``false``. + + :: + + [security] + allow_embedding = true + +Enabling RBD-Image monitoring +""""""""""""""""""""""""""""" + +Monitoring of RBD images is disabled by default, as it can significantly impact +performance. For more information please see :ref:`prometheus-rbd-io-statistics`. +When disabled, the overview and details dashboards will be empty in Grafana and +metrics will not be visible in Prometheus. + +Configuring Dashboard +""""""""""""""""""""" + +After you have set up Grafana and Prometheus, you will need to configure the +connection information that the Ceph Dashboard will use to access Grafana. + +You need to tell the dashboard on which URL the Grafana instance is +running/deployed: + +.. prompt:: bash $ + + ceph dashboard set-grafana-api-url <grafana-server-url> # default: '' + +The format of url is : `<protocol>:<IP-address>:<port>` + +.. note:: + + The Ceph Dashboard embeds Grafana dashboards via ``iframe`` HTML elements. + If Grafana is configured without SSL/TLS support, most browsers will block the + embedding of insecure content if SSL support is + enabled for the dashboard (which is the default). If you + can't see the embedded Grafana dashboards after enabling them as outlined + above, check your browser's documentation on how to unblock mixed content. + Alternatively, consider enabling SSL/TLS support in Grafana. + +If you are using a self-signed certificate for Grafana, +disable certificate verification in the dashboard to avoid refused connections, +which can be a result of certificates signed by an unknown CA or that do not +match the host name: + +.. prompt:: bash $ + + ceph dashboard set-grafana-api-ssl-verify False + +You can also access Grafana directly to monitor your cluster. + +.. note:: + + Ceph Dashboard configuration information can also be unset. For example, to + clear the Grafana API URL we configured above: + + .. prompt:: bash $ + + ceph dashboard reset-grafana-api-url + +Alternative URL for Browsers +"""""""""""""""""""""""""""" + +The Ceph Dashboard backend requires the Grafana URL to be able to verify the +existence of Grafana Dashboards before the frontend even loads them. Due to the +nature of how Grafana is implemented in Ceph Dashboard, this means that two +working connections are required in order to be able to see Grafana graphs in +Ceph Dashboard: + +- The backend (Ceph Mgr module) needs to verify the existence of the requested + graph. If this request succeeds, it lets the frontend know that it can safely + access Grafana. +- The frontend then requests the Grafana graphs directly from the user's + browser using an iframe. The Grafana instance is accessed directly without any + detour through Ceph Dashboard. + +Now, it might be the case that your environment makes it difficult for the +user's browser to directly access the URL configured in Ceph Dashboard. To solve +this issue, a separate URL can be configured which will solely be used to tell +the frontend (the user's browser) which URL it should use to access Grafana. +This setting won't ever be changed automatically, unlike the GRAFANA_API_URL +which is set by :ref:`cephadm` (only if cephadm is used to deploy monitoring +services). + +To change the URL that is returned to the frontend issue the following command: + +.. prompt:: bash $ + + ceph dashboard set-grafana-frontend-api-url <grafana-server-url> + +If no value is set for that option, it will simply fall back to the value of the +GRAFANA_API_URL option. If set, it will instruct the browser to use this URL to +access Grafana. + +.. _dashboard-sso-support: + +Enabling Single Sign-On (SSO) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The Ceph Dashboard supports external authentication of users via the +`SAML 2.0 <https://en.wikipedia.org/wiki/SAML_2.0>`_ protocol. You need to +first create user accounts and associate them with desired roles, as +authorization is performed by the Dashboard. However, the authentication +process can be performed by an existing Identity Provider (IdP). + +.. note:: + + Ceph Dashboard SSO support relies on onelogin's + `python-saml <https://pypi.org/project/python-saml/>`_ library. + Please ensure that this library is installed on your system, either by using + your distribution's package management or via Python's `pip` installer. + +To configure SSO on Ceph Dashboard, you should use the following command: + +.. prompt:: bash $ + + ceph dashboard sso setup saml2 <ceph_dashboard_base_url> <idp_metadata> {<idp_username_attribute>} {<idp_entity_id>} {<sp_x_509_cert>} {<sp_private_key>} + +Parameters: + +* **<ceph_dashboard_base_url>**: Base URL where Ceph Dashboard is accessible (e.g., `https://cephdashboard.local`) +* **<idp_metadata>**: URL to remote (`http://`, `https://`) or local (`file://`) path or content of the IdP metadata XML (e.g., `https://myidp/metadata`, `file:///home/myuser/metadata.xml`). +* **<idp_username_attribute>** *(optional)*: Attribute that should be used to get the username from the authentication response. Defaults to `uid`. +* **<idp_entity_id>** *(optional)*: Use this when more than one entity id exists on the IdP metadata. +* **<sp_x_509_cert> / <sp_private_key>** *(optional)*: File path of the certificate that should be used by Ceph Dashboard (Service Provider) for signing and encryption (these file paths should be accessible from the active ceph-mgr instance). + +.. note:: + + The issuer value of SAML requests will follow this pattern: **<ceph_dashboard_base_url>**/auth/saml2/metadata + +To display the current SAML 2.0 configuration, use the following command: + +.. prompt:: bash $ + + ceph dashboard sso show saml2 + +.. note:: + + For more information about `onelogin_settings`, please check the `onelogin documentation <https://github.com/onelogin/python-saml>`_. + +To disable SSO: + +.. prompt:: bash $ + + ceph dashboard sso disable + +To check if SSO is enabled: + +.. prompt:: bash $ + + ceph dashboard sso status + +To enable SSO: + +.. prompt:: bash $ + + ceph dashboard sso enable saml2 + +.. _dashboard-alerting: + +Enabling Prometheus Alerting +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To use Prometheus for alerting you must define `alerting rules +<https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules>`_. +These are managed by the `Alertmanager +<https://prometheus.io/docs/alerting/alertmanager>`_. +If you are not yet using the Alertmanager, `install it +<https://github.com/prometheus/alertmanager#install>`_ as it receives +and manages alerts from Prometheus. + +Alertmanager capabilities can be consumed by the dashboard in three different +ways: + +#. Use the notification receiver of the dashboard. + +#. Use the Prometheus Alertmanager API. + +#. Use both sources simultaneously. + +All three methods notify you about alerts. You won't be notified +twice if you use both sources, but you need to consume at least the Alertmanager API +in order to manage silences. + +1. Use the notification receiver of the dashboard + + This allows you to get notifications as `configured + <https://prometheus.io/docs/alerting/configuration/>`_ from the Alertmanager. + You will get notified inside the dashboard once a notification is send out, + but you are not able to manage alerts. + + Add the dashboard receiver and the new route to your Alertmanager + configuration. This should look like:: + + route: + receiver: 'ceph-dashboard' + ... + receivers: + - name: 'ceph-dashboard' + webhook_configs: + - url: '<url-to-dashboard>/api/prometheus_receiver' + + + Ensure that the Alertmanager considers your SSL certificate in terms + of the dashboard as valid. For more information about the correct + configuration checkout the `<http_config> documentation + <https://prometheus.io/docs/alerting/configuration/#%3Chttp_config%3E>`_. + +2. Use the API of Prometheus and the Alertmanager + + This allows you to manage alerts and silences and will enable the "Active + Alerts", "All Alerts" as well as the "Silences" tabs in the "Monitoring" + section of the "Cluster" menu entry. + + Alerts can be sorted by name, job, severity, state and start time. + Unfortunately it's not possible to know when an alert was sent out through a + notification by the Alertmanager based on your configuration, that's why the + dashboard will notify the user on any visible change to an alert and will + notify the changed alert. + + Silences can be sorted by id, creator, status, start, updated and end time. + Silences can be created in various ways, it's also possible to expire them. + + #. Create from scratch + + #. Based on a selected alert + + #. Recreate from expired silence + + #. Update a silence (which will recreate and expire it (default Alertmanager behaviour)) + + To use it, specify the host and port of the Alertmanager server: + + .. prompt:: bash $ + + ceph dashboard set-alertmanager-api-host <alertmanager-host:port> # default: '' + + For example: + + .. prompt:: bash $ + + ceph dashboard set-alertmanager-api-host 'http://localhost:9093' + + To be able to see all configured alerts, you will need to configure the URL to + the Prometheus API. Using this API, the UI will also help you in verifying + that a new silence will match a corresponding alert. + + + .. prompt:: bash $ + + ceph dashboard set-prometheus-api-host <prometheus-host:port> # default: '' + + For example: + + .. prompt:: bash $ + + ceph dashboard set-prometheus-api-host 'http://localhost:9090' + + After setting up the hosts, refresh your browser's dashboard window or tab. + +3. Use both methods + + The behaviors of both methods are configured in a way that they + should not disturb each other, through annoying duplicated notifications + may pop up. + +If you are using a self-signed certificate in your Prometheus or your +Alertmanager setup, you should disable certificate verification in the +dashboard to avoid refused connections caused by certificates signed by +an unknown CA or that do not match the host name. + +- For Prometheus: + +.. prompt:: bash $ + + ceph dashboard set-prometheus-api-ssl-verify False + +- For Alertmanager: + +.. prompt:: bash $ + + ceph dashboard set-alertmanager-api-ssl-verify False + +.. _dashboard-user-role-management: + +User and Role Management +------------------------ + +Password Policy +^^^^^^^^^^^^^^^ + +By default the password policy feature is enabled, which includes the +following checks: + +- Is the password longer than N characters? +- Are the old and new password the same? + +The password policy feature can be switched on or off completely: + +.. prompt:: bash $ + + ceph dashboard set-pwd-policy-enabled <true|false> + +The following individual checks can also be switched on or off: + +.. prompt:: bash $ + + ceph dashboard set-pwd-policy-check-length-enabled <true|false> + ceph dashboard set-pwd-policy-check-oldpwd-enabled <true|false> + ceph dashboard set-pwd-policy-check-username-enabled <true|false> + ceph dashboard set-pwd-policy-check-exclusion-list-enabled <true|false> + ceph dashboard set-pwd-policy-check-complexity-enabled <true|false> + ceph dashboard set-pwd-policy-check-sequential-chars-enabled <true|false> + ceph dashboard set-pwd-policy-check-repetitive-chars-enabled <true|false> + +Additionally the following options are available to configure password +policy. + +- Minimum password length (defaults to 8): + +.. prompt:: bash $ + + ceph dashboard set-pwd-policy-min-length <N> + +- Minimum password complexity (defaults to 10): + + .. prompt:: bash $ + + ceph dashboard set-pwd-policy-min-complexity <N> + + Password complexity is calculated by classifying each character in + the password. The complexity count starts by 0. A character is rated by + the following rules in the given order. + + - Increase by 1 if the character is a digit. + - Increase by 1 if the character is a lower case ASCII character. + - Increase by 2 if the character is an upper case ASCII character. + - Increase by 3 if the character is a special character like ``!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~``. + - Increase by 5 if the character has not been classified by one of the previous rules. + +- A list of comma separated words that are not allowed to be used in a + password: + + .. prompt:: bash $ + + ceph dashboard set-pwd-policy-exclusion-list <word>[,...] + + +User Accounts +^^^^^^^^^^^^^ + +The Ceph Dashboard supports multiple user accounts. Each user account +consists of a username, a password (stored in encrypted form using ``bcrypt``), +an optional name, and an optional email address. + +If a new user is created via the Web UI, it is possible to set an option that the +user must assign a new password when they log in for the first time. + +User accounts are stored in the monitors' configuration database, and are +available to all ``ceph-mgr`` instances. + +We provide a set of CLI commands to manage user accounts: + +- *Show User(s)*: + + .. prompt:: bash $ + + ceph dashboard ac-user-show [<username>] + +- *Create User*: + + .. prompt:: bash $ + + ceph dashboard ac-user-create [--enabled] [--force-password] [--pwd_update_required] <username> -i <file-containing-password> [<rolename>] [<name>] [<email>] [<pwd_expiration_date>] + + To bypass password policy checks use the `force-password` option. + Add the option `pwd_update_required` so that a newly created user has + to change their password after the first login. + +- *Delete User*: + + .. prompt:: bash $ + + ceph dashboard ac-user-delete <username> + +- *Change Password*: + + .. prompt:: bash $ + + ceph dashboard ac-user-set-password [--force-password] <username> -i <file-containing-password> + +- *Change Password Hash*: + + .. prompt:: bash $ + + ceph dashboard ac-user-set-password-hash <username> -i <file-containing-password-hash> + + The hash must be a bcrypt hash and salt, e.g. ``$2b$12$Pt3Vq/rDt2y9glTPSV.VFegiLkQeIpddtkhoFetNApYmIJOY8gau2``. + This can be used to import users from an external database. + +- *Modify User (name, and email)*: + + .. prompt:: bash $ + + ceph dashboard ac-user-set-info <username> <name> <email> + +- *Disable User*: + + .. prompt:: bash $ + + ceph dashboard ac-user-disable <username> + +- *Enable User*: + + .. prompt:: bash $ + + ceph dashboard ac-user-enable <username> + +User Roles and Permissions +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +User accounts are associated with a set of roles that define which +dashboard functionality can be accessed. + +The Dashboard functionality/modules are grouped within a *security scope*. +Security scopes are predefined and static. The current available security +scopes are: + +- **hosts**: includes all features related to the ``Hosts`` menu + entry. +- **config-opt**: includes all features related to management of Ceph + configuration options. +- **pool**: includes all features related to pool management. +- **osd**: includes all features related to OSD management. +- **monitor**: includes all features related to monitor management. +- **rbd-image**: includes all features related to RBD image + management. +- **rbd-mirroring**: includes all features related to RBD mirroring + management. +- **iscsi**: includes all features related to iSCSI management. +- **rgw**: includes all features related to RADOS Gateway (RGW) management. +- **cephfs**: includes all features related to CephFS management. +- **nfs-ganesha**: includes all features related to NFS Ganesha management. +- **manager**: include all features related to Ceph Manager + management. +- **log**: include all features related to Ceph logs management. +- **grafana**: include all features related to Grafana proxy. +- **prometheus**: include all features related to Prometheus alert management. +- **dashboard-settings**: allows to change dashboard settings. + +A *role* specifies a set of mappings between a *security scope* and a set of +*permissions*. There are four types of permissions: + +- **read** +- **create** +- **update** +- **delete** + +See below for an example of a role specification, in the form of a Python dictionary:: + + # example of a role + { + 'role': 'my_new_role', + 'description': 'My new role', + 'scopes_permissions': { + 'pool': ['read', 'create'], + 'rbd-image': ['read', 'create', 'update', 'delete'] + } + } + +The above role dictates that a user has *read* and *create* permissions for +features related to pool management, and has full permissions for +features related to RBD image management. + +The Dashboard provides a set of predefined roles that we call +*system roles*, which can be used right away by a fresh Ceph Dashboard +installation. + +The list of system roles are: + +- **administrator**: allows full permissions for all security scopes. +- **read-only**: allows *read* permission for all security scopes except + dashboard settings. +- **block-manager**: allows full permissions for *rbd-image*, + *rbd-mirroring*, and *iscsi* scopes. +- **rgw-manager**: allows full permissions for the *rgw* scope +- **cluster-manager**: allows full permissions for the *hosts*, *osd*, + *monitor*, *manager*, and *config-opt* scopes. +- **pool-manager**: allows full permissions for the *pool* scope. +- **cephfs-manager**: allows full permissions for the *cephfs* scope. + +The list of available roles can be retrieved with the following command: + +.. prompt:: bash $ + + ceph dashboard ac-role-show [<rolename>] + +You can also use the CLI to create new roles. The available commands are the +following: + +- *Create Role*: + + .. prompt:: bash $ + + ceph dashboard ac-role-create <rolename> [<description>] + +- *Delete Role*: + + .. prompt:: bash $ + + ceph dashboard ac-role-delete <rolename> + +- *Add Scope Permissions to Role*: + + .. prompt:: bash $ + + ceph dashboard ac-role-add-scope-perms <rolename> <scopename> <permission> [<permission>...] + +- *Delete Scope Permission from Role*: + + .. prompt:: bash $ + + ceph dashboard ac-role-del-scope-perms <rolename> <scopename> + +To assign roles to users, the following commands are available: + +- *Set User Roles*: + + .. prompt:: bash $ + + ceph dashboard ac-user-set-roles <username> <rolename> [<rolename>...] + +- *Add Roles To User*: + + .. prompt:: bash $ + + ceph dashboard ac-user-add-roles <username> <rolename> [<rolename>...] + +- *Delete Roles from User*: + + .. prompt:: bash $ + + ceph dashboard ac-user-del-roles <username> <rolename> [<rolename>...] + + +Example of User and Custom Role Creation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In this section we show a complete example of the commands that +create a user account that can manage RBD images, view and create Ceph pools, +and has read-only access to other scopes. + +1. *Create the user*: + + .. prompt:: bash $ + + ceph dashboard ac-user-create bob -i <file-containing-password> + +2. *Create role and specify scope permissions*: + + .. prompt:: bash $ + + ceph dashboard ac-role-create rbd/pool-manager + ceph dashboard ac-role-add-scope-perms rbd/pool-manager rbd-image read create update delete + ceph dashboard ac-role-add-scope-perms rbd/pool-manager pool read create + +3. *Associate roles to user*: + + .. prompt:: bash $ + + ceph dashboard ac-user-set-roles bob rbd/pool-manager read-only + +.. _dashboard-proxy-configuration: + +Proxy Configuration +------------------- + +In a Ceph cluster with multiple ``ceph-mgr`` instances, only the dashboard +running on the currently active ``ceph-mgr`` daemon will serve incoming requests. +Connections to the dashboard's TCP port on standby ``ceph-mgr`` instances +will receive an HTTP redirect (303) to the active manager's dashboard URL. +This enables you to point your browser to any ``ceph-mgr`` instance in +order to access the dashboard. + +If you want to establish a fixed URL to reach the dashboard or if you don't want +to allow direct connections to the manager nodes, you could set up a proxy that +automatically forwards incoming requests to the active ``ceph-mgr`` +instance. + +Configuring a URL Prefix +^^^^^^^^^^^^^^^^^^^^^^^^ + +If you are accessing the dashboard via a reverse proxy, +you may wish to service it under a URL prefix. To get the dashboard +to use hyperlinks that include your prefix, you can set the +``url_prefix`` setting: + +.. prompt:: bash $ + + ceph config set mgr mgr/dashboard/url_prefix $PREFIX + +so you can access the dashboard at ``http://$IP:$PORT/$PREFIX/``. + +Disable the redirection +^^^^^^^^^^^^^^^^^^^^^^^ + +If the dashboard is behind a load-balancing proxy like `HAProxy <https://www.haproxy.org/>`_ +you might want to disable redirection to prevent situations in which +internal (unresolvable) URLs are published to the frontend client. Use the +following command to get the dashboard to respond with an HTTP error (500 by default) +instead of redirecting to the active dashboard: + +.. prompt:: bash $ + + ceph config set mgr mgr/dashboard/standby_behaviour "error" + +To reset the setting to default redirection, use the following command: + +.. prompt:: bash $ + + ceph config set mgr mgr/dashboard/standby_behaviour "redirect" + +Configure the error status code +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When redirection is disabled, you may want to customize the HTTP status +code of standby dashboards. To do so you need to run the command: + +.. prompt:: bash $ + + ceph config set mgr mgr/dashboard/standby_error_status_code 503 + +Resolve IP address to hostname before redirect +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The redirect from a standby to the active dashboard is done via the IP +address. This is done because resolving IP addresses to hostnames can be error +prone in containerized environments. It is also the reason why the option is +disabled by default. +However, in some situations it might be helpful to redirect via the hostname. +For example if the configured TLS certificate matches only the hostnames. To +activate the redirection via the hostname run the following command:: + + $ ceph config set mgr mgr/dashboard/redirect_resolve_ip_addr True + +You can disable it again by:: + + $ ceph config set mgr mgr/dashboard/redirect_resolve_ip_addr False + +HAProxy example configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Below you will find an example configuration for SSL/TLS passthrough using +`HAProxy <https://www.haproxy.org/>`_. + +Please note that this configuration works under the following conditions. +If the dashboard fails over, the front-end client might receive a HTTP redirect +(303) response and will be redirected to an unresolvable host. This happens when +failover occurs between two HAProxy health checks. In this situation the +previously active dashboard node will now respond with a 303 which points to +the new active node. To prevent that situation you should consider disabling +redirection on standby nodes. + +:: + + defaults + log global + option log-health-checks + timeout connect 5s + timeout client 50s + timeout server 450s + + frontend dashboard_front + mode http + bind *:80 + option httplog + redirect scheme https code 301 if !{ ssl_fc } + + frontend dashboard_front_ssl + mode tcp + bind *:443 + option tcplog + default_backend dashboard_back_ssl + + backend dashboard_back_ssl + mode tcp + option httpchk GET / + http-check expect status 200 + server x <HOST>:<PORT> ssl check verify none + server y <HOST>:<PORT> ssl check verify none + server z <HOST>:<PORT> ssl check verify none + +.. _dashboard-auditing: + +Auditing API Requests +--------------------- + +The REST API can log PUT, POST and DELETE requests to the Ceph +audit log. This feature is disabled by default, but can be enabled with the +following command: + +.. prompt:: bash $ + + ceph dashboard set-audit-api-enabled <true|false> + +If enabled, the following parameters are logged per each request: + +* from - The origin of the request, e.g. https://[::1]:44410 +* path - The REST API path, e.g. /api/auth +* method - e.g. PUT, POST or DELETE +* user - The name of the user, otherwise 'None' + +The logging of the request payload (the arguments and their values) is enabled +by default. Execute the following command to disable this behaviour: + +.. prompt:: bash $ + + ceph dashboard set-audit-api-log-payload <true|false> + +A log entry may look like this:: + + 2018-10-22 15:27:01.302514 mgr.x [INF] [DASHBOARD] from='https://[::ffff:127.0.0.1]:37022' path='/api/rgw/user/klaus' method='PUT' user='admin' params='{"max_buckets": "1000", "display_name": "Klaus Mustermann", "uid": "klaus", "suspended": "0", "email": "klaus.mustermann@ceph.com"}' + +.. _dashboard-nfs-ganesha-management: + +NFS-Ganesha Management +---------------------- + +The dashboard requires enabling the NFS module which will be used to manage +NFS clusters and NFS exports. For more information check :ref:`mgr-nfs`. + +Plug-ins +-------- + +Plug-ins extend the functionality of the Ceph Dashboard in a modular +and loosely coupled fashion. + +.. _Grafana: https://grafana.com/ + +.. include:: dashboard_plugins/feature_toggles.inc.rst +.. include:: dashboard_plugins/debug.inc.rst +.. include:: dashboard_plugins/motd.inc.rst + + +Troubleshooting the Dashboard +----------------------------- + +Locating the Dashboard +^^^^^^^^^^^^^^^^^^^^^^ + +If you are unsure of the location of the Ceph Dashboard, run the following command: + +.. prompt:: bash $ + + ceph mgr services | jq .dashboard + +:: + + "https://host:port" + +The command returns the URL where the Ceph Dashboard is located: ``https://<host>:<port>/`` + +.. note:: + + Many Ceph tools return results in JSON format. We suggest that + you install the `jq <https://stedolan.github.io/jq>`_ command-line + utility to facilitate working with JSON data. + + +Accessing the Dashboard +^^^^^^^^^^^^^^^^^^^^^^^ + +If you are unable to access the Ceph Dashboard, run the following +commands: + +#. Verify the Ceph Dashboard module is enabled: + + .. prompt:: bash $ + + ceph mgr module ls | jq .enabled_modules + + Ensure the Ceph Dashboard module is listed in the return value of the + command. Example snipped output from the command above:: + + [ + "dashboard", + "iostat", + "restful" + ] + +#. If it is not listed, activate the module with the following command: + + .. prompt:: bash $ + + ceph mgr module enable dashboard + +#. Check the Ceph Dashboard and/or ``ceph-mgr`` log files for any errors. + + * Check if ``ceph-mgr`` log messages are written to a file by: + + .. prompt:: bash $ + + ceph config get mgr log_to_file + + :: + + true + + * Get the location of the log file (it's ``/var/log/ceph/<cluster-name>-<daemon-name>.log`` + by default): + + .. prompt:: bash $ + + ceph config get mgr log_file + + :: + + /var/log/ceph/$cluster-$name.log + +#. Ensure the SSL/TSL support is configured properly: + + * Check if the SSL/TSL support is enabled: + + .. prompt:: bash $ + + ceph config get mgr mgr/dashboard/ssl + + * If the command returns ``true``, verify a certificate exists by: + + .. prompt:: bash $ + + ceph config-key get mgr/dashboard/crt + + and: + + .. prompt:: bash $ + + ceph config-key get mgr/dashboard/key + + * If it doesn't return ``true``, run the following command to generate a self-signed + certificate or follow the instructions outlined in + :ref:`dashboard-ssl-tls-support`: + + .. prompt:: bash $ + + ceph dashboard create-self-signed-cert + + +Trouble Logging into the Dashboard +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If you are unable to log into the Ceph Dashboard and you receive the following +error, run through the procedural checks below: + +.. image:: ../images/dashboard/invalid-credentials.png + :align: center + +#. Check that your user credentials are correct. If you are seeing the + notification message above when trying to log into the Ceph Dashboard, it + is likely you are using the wrong credentials. Double check your username + and password, and ensure that your keyboard's caps lock is not enabled by accident. + +#. If your user credentials are correct, but you are experiencing the same + error, check that the user account exists: + + .. prompt:: bash $ + + ceph dashboard ac-user-show <username> + + This command returns your user data. If the user does not exist, it will + print:: + + Error ENOENT: User <username> does not exist + +#. Check if the user is enabled: + + .. prompt:: bash $ + + ceph dashboard ac-user-show <username> | jq .enabled + + :: + + true + + Check if ``enabled`` is set to ``true`` for your user. If not the user is + not enabled, run: + + .. prompt:: bash $ + + ceph dashboard ac-user-enable <username> + +Please see :ref:`dashboard-user-role-management` for more information. + + +A Dashboard Feature is Not Working +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When an error occurs on the backend, you will usually receive an error +notification on the frontend. Run through the following scenarios to debug. + +#. Check the Ceph Dashboard and ``ceph-mgr`` logfile(s) for any errors. These can + found by searching for keywords, such as *500 Internal Server Error*, + followed by ``traceback``. The end of a traceback contains more details about + what exact error occurred. +#. Check your web browser's JavaScript Console for any errors. + + +Ceph Dashboard Logs +^^^^^^^^^^^^^^^^^^^ + +Dashboard Debug Flag +"""""""""""""""""""" + +With this flag enabled, error traceback is included in backend responses. + +To enable this flag via the Ceph Dashboard, navigate from *Cluster* to *Manager +modules*. Select *Dashboard module* and click the edit button. Click the +*debug* checkbox and update. + +To enable it via the CLI, run the following command: + +.. prompt:: bash $ + + ceph dashboard debug enable + + +Setting Logging Level of Dashboard Module +""""""""""""""""""""""""""""""""""""""""" + +Setting the logging level to debug makes the log more verbose and helpful for +debugging. + +#. Increase the logging level of manager daemons: + + .. prompt:: bash $ + + ceph tell mgr config set debug_mgr 20 + +#. Adjust the logging level of the Ceph Dashboard module via the Dashboard or + CLI: + + * Navigate from *Cluster* to *Manager modules*. Select *Dashboard module* + and click the edit button. Modify the ``log_level`` configuration. + * To adjust it via the CLI, run the following command: + + .. prompt:: bash $ + + bin/ceph config set mgr mgr/dashboard/log_level debug + +3. High log levels can result in considerable log volume, which can +easily fill up your filesystem. Set a calendar reminder for an hour, a day, +or a week in the future to revert this temporary logging increase. This looks +something like this: + + .. prompt:: bash $ + + ceph config log + + :: + + ... + --- 11 --- 2020-11-07 11:11:11.960659 --- mgr.x/dashboard/log_level = debug --- + ... + + .. prompt:: bash $ + + ceph config reset 11 + +.. _centralized-logging: + +Enable Centralized Logging in Dashboard +""""""""""""""""""""""""""""""""""""""" + +To learn more about centralized logging, see :ref:`cephadm-monitoring-centralized-logs` + +1. Create the Loki service on any particular host using "Create Services" option. + +2. Similarly create the Promtail service which will be by default deployed + on all the running hosts. + +3. To see debug-level messages as well as info-level events, run the following command via CLI: + + .. prompt:: bash $ + + ceph config set mgr mgr/cephadm/log_to_cluster_level debug + +4. To enable logging to files, run the following commands via CLI: + + .. prompt:: bash $ + + ceph config set global log_to_file true + ceph config set global mon_cluster_log_to_file true + +5. Click on the Daemon Logs tab under Cluster -> Logs. + +6. You can find some pre-defined labels there on clicking the Log browser button such as filename, + job etc that can help you query the logs at one go. + +7. You can query the logs with LogQL for advanced search and perform some + calculations as well - https://grafana.com/docs/loki/latest/logql/. + + +Reporting issues from Dashboard +""""""""""""""""""""""""""""""" + +Ceph-Dashboard provides two ways to create an issue in the Ceph Issue Tracker, +either using the Ceph command line interface or by using the Ceph Dashboard +user interface. + +To create an issue in the Ceph Issue Tracker, a user needs to have an account +on the issue tracker. Under the ``my account`` tab in the Ceph Issue Tracker, +the user can see their API access key. This key is used for authentication +when creating a new issue. To store the Ceph API access key, in the CLI run: + +.. prompt:: bash $ + + ``ceph dashboard set-issue-tracker-api-key -i <file-containing-key>`` + +Then on successful update, you can create an issue using: + +.. prompt:: bash $ + + ``ceph dashboard create issue <project> <tracker_type> <subject> <description>`` + +The available projects to create an issue on are: +#. dashboard +#. block +#. object +#. file_system +#. ceph_manager +#. orchestrator +#. ceph_volume +#. core_ceph + +The available tracker types are: +#. bug +#. feature + +The subject and description are then set by the user. + +The user can also create an issue using the Dashboard user interface. The settings +icon drop down menu on the top right of the navigation bar has the option to +``Raise an issue``. On clicking it, a modal dialog opens that has the option to +select the project and tracker from their respective drop down menus. The subject +and multiline description are added by the user. The user can then submit the issue. diff --git a/doc/mgr/dashboard_plugins/debug.inc.rst b/doc/mgr/dashboard_plugins/debug.inc.rst new file mode 100644 index 000000000..883419cbf --- /dev/null +++ b/doc/mgr/dashboard_plugins/debug.inc.rst @@ -0,0 +1,43 @@ +.. _dashboard-debug: + +Debug +^^^^^ + +This plugin allows to customize the behaviour of the dashboard according to the +debug mode. It can be enabled, disabled or checked with the following command: + +.. prompt:: bash $ + + ceph dashboard debug status + +:: + + Debug: 'disabled' + +.. prompt:: bash $ + + ceph dashboard debug enable + +:: + + Debug: 'enabled' + +.. prompt:: bash $ + + ceph dashboard debug disable + +:: + + Debug: 'disabled' + +By default, it's disabled. This is the recommended setting for production +deployments. If required, debug mode can be enabled without need of restarting. +Currently, disabled debug mode equals to CherryPy ``production`` environment, +while when enabled, it uses ``test_suite`` defaults (please refer to +`CherryPy Environments +<https://docs.cherrypy.org/en/latest/config.html#environments>`_ for more +details). + +It also adds request uuid (``unique_id``) to Cherrypy on versions that don't +support this. It additionally prints the ``unique_id`` to error responses and +log messages. diff --git a/doc/mgr/dashboard_plugins/feature_toggles.inc.rst b/doc/mgr/dashboard_plugins/feature_toggles.inc.rst new file mode 100644 index 000000000..7c96b0faa --- /dev/null +++ b/doc/mgr/dashboard_plugins/feature_toggles.inc.rst @@ -0,0 +1,56 @@ +.. _dashboard-feature-toggles: + +Feature Toggles +^^^^^^^^^^^^^^^ + +This plug-in allows to enable or disable some features from the Ceph Dashboard +on-demand. When a feature becomes disabled: + +- Its front-end elements (web pages, menu entries, charts, etc.) will become hidden. +- Its associated REST API endpoints will reject any further requests (404, Not Found Error). + +The main purpose of this plug-in is to allow ad-hoc customizations of the workflows exposed +by the dashboard. Additionally, it could allow for dynamically enabling experimental +features with minimal configuration burden and no service impact. + +The list of features that can be enabled/disabled is: + +- **Block (RBD)**: + - Image Management: ``rbd`` + - Mirroring: ``mirroring`` + - iSCSI: ``iscsi`` +- **Filesystem (Cephfs)**: ``cephfs`` +- **Objects (RGW)**: ``rgw`` (including daemon, user and bucket management). +- **NFS**: ``nfs-ganesha`` exports. + +By default all features come enabled. + +To retrieve a list of features and their current statuses: + +.. prompt:: bash $ + + ceph dashboard feature status + +:: + + Feature 'cephfs': 'enabled' + Feature 'iscsi': 'enabled' + Feature 'mirroring': 'enabled' + Feature 'rbd': 'enabled' + Feature 'rgw': 'enabled' + Feature 'nfs': 'enabled' + +To enable or disable the status of a single or multiple features: + +.. prompt:: bash $ + + ceph dashboard feature disable iscsi mirroring + +:: + + Feature 'iscsi': disabled + Feature 'mirroring': disabled + +After a feature status has changed, the API REST endpoints immediately respond to +that change, while for the front-end UI elements, it may take up to 20 seconds to +reflect it. diff --git a/doc/mgr/dashboard_plugins/motd.inc.rst b/doc/mgr/dashboard_plugins/motd.inc.rst new file mode 100644 index 000000000..0f9cc199a --- /dev/null +++ b/doc/mgr/dashboard_plugins/motd.inc.rst @@ -0,0 +1,36 @@ +.. _dashboard-motd: + +Message of the day (MOTD) +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Displays a configured `message of the day` at the top of the Ceph Dashboard. + +The importance of a MOTD can be configured by its severity, which is +`info`, `warning` or `danger`. The MOTD can expire after a given time, +this means it will not be displayed in the UI anymore. Use the following +syntax to specify the expiration time: `Ns|m|h|d|w` for seconds, minutes, +hours, days and weeks. If the MOTD should expire after 2 hours, use `2h` +or `5w` for 5 weeks. Use `0` to configure a MOTD that does not expire. + +To configure a MOTD, run the following command: + +.. prompt:: bash $ + + ceph dashboard motd set <severity:info|warning|danger> <expires> <message> + +To show the configured MOTD: + +.. prompt:: bash $ + + ceph dashboard motd get + +To clear the configured MOTD run: + +.. prompt:: bash $ + + ceph dashboard motd clear + +A MOTD with a `info` or `warning` severity can be closed by the user. The +`info` MOTD is not displayed anymore until the local storage cookies are +cleared or a new MOTD with a different severity is displayed. A MOTD with +a 'warning' severity will be displayed again in a new session. diff --git a/doc/mgr/details-card.png b/doc/mgr/details-card.png Binary files differnew file mode 100644 index 000000000..0c219a890 --- /dev/null +++ b/doc/mgr/details-card.png diff --git a/doc/mgr/diskprediction.rst b/doc/mgr/diskprediction.rst new file mode 100644 index 000000000..f4b697511 --- /dev/null +++ b/doc/mgr/diskprediction.rst @@ -0,0 +1,59 @@ +.. _diskprediction: + +===================== +Diskprediction Module +===================== + +The *diskprediction* module leverages Ceph device health check to collect disk health metrics and uses internal predictor module to produce the disk failure prediction and returns back to Ceph. It doesn't require any external server for data analysis and output results. Its internal predictor's accuracy is around 70%. + +Enabling +======== + +Run the following command to enable the *diskprediction_local* module in the Ceph +environment:: + + ceph mgr module enable diskprediction_local + + +To enable the local predictor:: + + ceph config set global device_failure_prediction_mode local + +To disable prediction:: + + ceph config set global device_failure_prediction_mode none + + +*diskprediction_local* requires at least six datasets of device health metrics to +make prediction of the devices' life expectancy. And these health metrics are +collected only if health monitoring is :ref:`enabled <enabling-monitoring>`. + +Run the following command to retrieve the life expectancy of given device. + +:: + + ceph device predict-life-expectancy <device id> + +Configuration +============= + +The module performs the prediction on a daily basis by default. You can adjust +this interval with:: + + ceph config set mgr mgr/diskprediction_local/predict_interval <interval-in-seconds> + +Debugging +========= + +If you want to debug the DiskPrediction module mapping to Ceph logging level, +use the following command. + +:: + + [mgr] + + debug mgr = 20 + +With logging set to debug for the manager the module will print out logging +message with prefix *mgr[diskprediction]* for easy filtering. + diff --git a/doc/mgr/hello.rst b/doc/mgr/hello.rst new file mode 100644 index 000000000..725355fc9 --- /dev/null +++ b/doc/mgr/hello.rst @@ -0,0 +1,39 @@ +Hello World Module +================== + +This is a simple module skeleton for documentation purposes. + +Enabling +-------- + +The *hello* module is enabled with:: + + ceph mgr module enable hello + +To check that it is enabled, run:: + + ceph mgr module ls + +After editing the module file (found in ``src/pybind/mgr/hello/module.py``), you can see changes by running:: + + ceph mgr module disable hello + ceph mgr module enable hello + +or:: + + init-ceph restart mgr + +To execute the module, run:: + + ceph hello + +The log is found at:: + + build/out/mgr.x.log + + +Documenting +----------- + +After adding a new mgr module, be sure to add its documentation to ``doc/mgr/module_name.rst``. +Also, add a link to your new module into ``doc/mgr/index.rst``. diff --git a/doc/mgr/index.rst b/doc/mgr/index.rst new file mode 100644 index 000000000..4d20d5098 --- /dev/null +++ b/doc/mgr/index.rst @@ -0,0 +1,52 @@ +.. _ceph-manager-daemon: + +=================== +Ceph Manager Daemon +=================== + +The :term:`Ceph Manager` daemon (ceph-mgr) runs alongside monitor daemons, +to provide additional monitoring and interfaces to external monitoring +and management systems. + +Since the 12.x (*luminous*) Ceph release, the ceph-mgr daemon is required for +normal operations. The ceph-mgr daemon is an optional component in +the 11.x (*kraken*) Ceph release. + +By default, the manager daemon requires no additional configuration, beyond +ensuring it is running. If there is no mgr daemon running, you will +see a health warning to that effect, and some of the other information +in the output of `ceph status` will be missing or stale until a mgr is started. + +Use your normal deployment tools, such as ceph-ansible or cephadm, to +set up ceph-mgr daemons on each of your mon nodes. It is not mandatory +to place mgr daemons on the same nodes as mons, but it is almost always +sensible. + +.. toctree:: + :maxdepth: 1 + + Installation and Configuration <administrator> + Writing modules <modules> + Writing orchestrator plugins <orchestrator_modules> + Dashboard module <dashboard> + Ceph RESTful API <ceph_api/index> + Alerts module <alerts> + DiskPrediction module <diskprediction> + Local pool module <localpool> + RESTful module <restful> + Zabbix module <zabbix> + Prometheus module <prometheus> + Influx module <influx> + Hello module <hello> + Telegraf module <telegraf> + Telemetry module <telemetry> + Iostat module <iostat> + Crash module <crash> + Insights module <insights> + Orchestrator module <orchestrator> + Rook module <rook> + RGW module <rgw> + MDS Autoscaler module <mds_autoscaler> + NFS module <nfs> + Progress Module <progress> + CLI API Commands module <cli_api> diff --git a/doc/mgr/influx.rst b/doc/mgr/influx.rst new file mode 100644 index 000000000..2622d3919 --- /dev/null +++ b/doc/mgr/influx.rst @@ -0,0 +1,173 @@ +============= +Influx Module +============= + +.. mgr_module:: influx + +The influx module continuously collects and sends time series data to an +influxdb database. + +The influx module was introduced in the 13.x *Mimic* release. + +-------- +Enabling +-------- + +To enable the module, use the following command: + +.. prompt:: bash $ + + ceph mgr module enable influx + +If you wish to subsequently disable the module, you can use the equivalent +*disable* command: + +.. prompt:: bash $ + + ceph mgr module disable influx + +------------- +Configuration +------------- + +For the influx module to send statistics to an InfluxDB server, it +is necessary to configure the servers address and some authentication +credentials. + +Set configuration values using the following command: + +.. prompt:: bash $ + + ceph config set mgr mgr/influx/<key> <value> + + +The most important settings are :confval:`mgr/influx/hostname`, +:confval:`mgr/influx/username` and :confval:`mgr/influx/password`. +For example, a typical configuration might look like this: + +.. prompt:: bash $ + + ceph config set mgr mgr/influx/hostname influx.mydomain.com + ceph config set mgr mgr/influx/username admin123 + ceph config set mgr mgr/influx/password p4ssw0rd + +Following is the list of all configuration settings: + +.. confval:: hostname +.. confval:: username +.. confval:: password +.. confval:: interval +.. confval:: database +.. confval:: port +.. confval:: ssl +.. confval:: verify_ssl +.. confval:: threads +.. confval:: batch_size + +--------- +Debugging +--------- + +By default, a few debugging statements as well as error statements have been set to print in the log files. Users can add more if necessary. +To make use of the debugging option in the module: + +- Add this to the ceph.conf file. + + .. code-block:: ini + + [mgr] + debug_mgr = 20 + +- Use this command ``ceph influx self-test``. +- Check the log files. Users may find it easier to filter the log files using *mgr[influx]*. + +-------------------- +Interesting counters +-------------------- + +The following tables describe a subset of the values output by +this module. + +^^^^^ +Pools +^^^^^ + ++---------------+-----------------------------------------------------+ +|Counter | Description | ++===============+=====================================================+ +|stored | Bytes stored in the pool not including copies | ++---------------+-----------------------------------------------------+ +|max_avail | Max available number of bytes in the pool | ++---------------+-----------------------------------------------------+ +|objects | Number of objects in the pool | ++---------------+-----------------------------------------------------+ +|wr_bytes | Number of bytes written in the pool | ++---------------+-----------------------------------------------------+ +|dirty | Number of bytes dirty in the pool | ++---------------+-----------------------------------------------------+ +|rd_bytes | Number of bytes read in the pool | ++---------------+-----------------------------------------------------+ +|stored_raw | Bytes used in pool including copies made | ++---------------+-----------------------------------------------------+ + +^^^^ +OSDs +^^^^ + ++------------+------------------------------------+ +|Counter | Description | ++============+====================================+ +|op_w | Client write operations | ++------------+------------------------------------+ +|op_in_bytes | Client operations total write size | ++------------+------------------------------------+ +|op_r | Client read operations | ++------------+------------------------------------+ +|op_out_bytes| Client operations total read size | ++------------+------------------------------------+ + + ++------------------------+--------------------------------------------------------------------------+ +|Counter | Description | ++========================+==========================================================================+ +|op_wip | Replication operations currently being processed (primary) | ++------------------------+--------------------------------------------------------------------------+ +|op_latency | Latency of client operations (including queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_process_latency | Latency of client operations (excluding queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_prepare_latency | Latency of client operations (excluding queue time and wait for finished)| ++------------------------+--------------------------------------------------------------------------+ +|op_r_latency | Latency of read operation (including queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_r_process_latency | Latency of read operation (excluding queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_w_in_bytes | Client data written | ++------------------------+--------------------------------------------------------------------------+ +|op_w_latency | Latency of write operation (including queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_w_process_latency | Latency of write operation (excluding queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_w_prepare_latency | Latency of write operations (excluding queue time and wait for finished) | ++------------------------+--------------------------------------------------------------------------+ +|op_rw | Client read-modify-write operations | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_in_bytes | Client read-modify-write operations write in | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_out_bytes | Client read-modify-write operations read out | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_latency | Latency of read-modify-write operation (including queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_process_latency | Latency of read-modify-write operation (excluding queue time) | ++------------------------+--------------------------------------------------------------------------+ +|op_rw_prepare_latency | Latency of read-modify-write operations (excluding queue time | +| | and wait for finished) | ++------------------------+--------------------------------------------------------------------------+ +|op_before_queue_op_lat | Latency of IO before calling queue (before really queue into ShardedOpWq)| +| | op_before_dequeue_op_lat | ++------------------------+--------------------------------------------------------------------------+ +|op_before_dequeue_op_lat| Latency of IO before calling dequeue_op(already dequeued and get PG lock)| ++------------------------+--------------------------------------------------------------------------+ + +Latency counters are measured in microseconds unless otherwise specified in the description. + diff --git a/doc/mgr/insights.rst b/doc/mgr/insights.rst new file mode 100644 index 000000000..37b8903f1 --- /dev/null +++ b/doc/mgr/insights.rst @@ -0,0 +1,52 @@ +Insights Module +=============== + +The insights module collects and exposes system information to the Insights Core +data analysis framework. It is intended to replace explicit interrogation of +Ceph CLIs and daemon admin sockets, reducing the API surface that Insights +depends on. The insights reports contains the following: + +* **Health reports**. In addition to reporting the current health of the + cluster, the insights module reports a summary of the last 24 hours of health + checks. This feature is important for catching cluster health issues that are + transient and may not be present at the moment the report is generated. Health + checks are deduplicated to avoid unbounded data growth. + +* **Crash reports**. A summary of any daemon crashes in the past 24 hours is + included in the insights report. Crashes are reported as the number of crashes + per daemon type (e.g. `ceph-osd`) within the time window. Full details of a + crash may be obtained using the `crash module`_. + +* Software version, storage utilization, cluster maps, placement group summary, + monitor status, cluster configuration, and OSD metadata. + +Enabling +-------- + +The *insights* module is enabled with:: + + ceph mgr module enable insights + +Commands +-------- +:: + + ceph insights + +Generate the full report. + +:: + + ceph insights prune-health <hours> + +Remove historical health data older than <hours>. Passing `0` for <hours> will +clear all health data. + +This command is useful for cleaning the health history before automated nightly +reports are generated, which may contain spurious health checks accumulated +while performing system maintenance, or other health checks that have been +resolved. There is no need to prune health data to reclaim storage space; +garbage collection is performed regularly to remove old health data from +persistent storage. + +.. _crash module: ../crash diff --git a/doc/mgr/inventory-card.png b/doc/mgr/inventory-card.png Binary files differnew file mode 100644 index 000000000..54317fc9f --- /dev/null +++ b/doc/mgr/inventory-card.png diff --git a/doc/mgr/iostat.rst b/doc/mgr/iostat.rst new file mode 100644 index 000000000..f9f849383 --- /dev/null +++ b/doc/mgr/iostat.rst @@ -0,0 +1,32 @@ +.. _mgr-iostat-overview: + +iostat +====== + +This module shows the current throughput and IOPS done on the Ceph cluster. + +Enabling +-------- + +To check if the *iostat* module is enabled, run:: + + ceph mgr module ls + +The module can be enabled with:: + + ceph mgr module enable iostat + +To execute the module, run:: + + ceph iostat + +To change the frequency at which the statistics are printed, use the ``-p`` +option:: + + ceph iostat -p <period in seconds> + +For example, use the following command to print the statistics every 5 seconds:: + + ceph iostat -p 5 + +To stop the module, press Ctrl-C. diff --git a/doc/mgr/localpool.rst b/doc/mgr/localpool.rst new file mode 100644 index 000000000..2812925ca --- /dev/null +++ b/doc/mgr/localpool.rst @@ -0,0 +1,39 @@ +Local Pool Module +================= + +.. mgr_module:: localpool + +The *localpool* module can automatically create RADOS pools that are +localized to a subset of the overall cluster. For example, by default, it will +create a pool for each distinct ``rack`` in the cluster. This can be useful for +deployments where it is desirable to distribute some data locally and other data +globally across the cluster. One use-case is measuring performance and testing +behavior of specific drive, NIC, or chassis models in isolation. + +Enabling +-------- + +The *localpool* module is enabled with:: + + ceph mgr module enable localpool + +Configuring +----------- + +The *localpool* module understands the following options: + +.. confval:: subtree +.. confval:: failure_domain +.. confval:: pg_num +.. confval:: num_rep +.. confval:: min_size +.. confval:: prefix + :default: by-$subtreetype- + +These options are set via the config-key interface. For example, to +change the replication level to 2x with only 64 PGs, :: + + ceph config set mgr mgr/localpool/num_rep 2 + ceph config set mgr mgr/localpool/pg_num 64 + +.. mgr_module:: None diff --git a/doc/mgr/mds_autoscaler.rst b/doc/mgr/mds_autoscaler.rst new file mode 100644 index 000000000..46fc44155 --- /dev/null +++ b/doc/mgr/mds_autoscaler.rst @@ -0,0 +1,23 @@ +MDS Autoscaler Module +===================== + +The MDS Autoscaler Module monitors file systems to ensure sufficient MDS +daemons are available. It works by adjusting the placement specification for +the orchestrator backend of the MDS service. To enable, use: + +.. sh: + + ceph mgr module enable mds_autoscaler + +The module will monitor the following file system settings to inform +placement count adjustments: + +- ``max_mds`` file system setting +- ``standby_count_wanted`` file system setting + +The Ceph monitor daemons are still responsible for promoting or stopping MDS +according to these settings. The ``mds_autoscaler`` simply adjusts the +number of MDS which are spawned by the orchestrator. + +.. note: There is no CLI or module configurations as of now. Enable or disable + the module to turn on or off. diff --git a/doc/mgr/modules.rst b/doc/mgr/modules.rst new file mode 100644 index 000000000..667664139 --- /dev/null +++ b/doc/mgr/modules.rst @@ -0,0 +1,735 @@ + + +.. _mgr-module-dev: + +ceph-mgr module developer's guide +================================= + +.. warning:: + + This is developer documentation, describing Ceph internals that + are only relevant to people writing ceph-mgr modules. + +Creating a module +----------------- + +In pybind/mgr/, create a python module. Within your module, create a class +that inherits from ``MgrModule``. For ceph-mgr to detect your module, your +directory must contain a file called `module.py`. + +The most important methods to override are: + +* a ``serve`` member function for server-type modules. This + function should block forever. +* a ``notify`` member function if your module needs to + take action when new cluster data is available. +* a ``handle_command`` member function if your module + exposes CLI commands. But this approach for exposing commands + is deprecated. For more details, see :ref:`mgr-module-exposing-commands`. + +Some modules interface with external orchestrators to deploy +Ceph services. These also inherit from ``Orchestrator``, which adds +additional methods to the base ``MgrModule`` class. See +:ref:`Orchestrator modules <orchestrator-modules>` for more on +creating these modules. + +Installing a module +------------------- + +Once your module is present in the location set by the +``mgr module path`` configuration setting, you can enable it +via the ``ceph mgr module enable`` command:: + + ceph mgr module enable mymodule + +Note that the MgrModule interface is not stable, so any modules maintained +outside of the Ceph tree are liable to break when run against any newer +or older versions of Ceph. + +.. _mgr module dev logging: + +Logging +------- + +Logging in Ceph manager modules is done as in any other Python program. Just +import the ``logging`` package and get a logger instance with the +``logging.getLogger`` function. + +Each module has a ``log_level`` option that specifies the current Python +logging level of the module. +To change or query the logging level of the module use the following Ceph +commands:: + + ceph config get mgr mgr/<module_name>/log_level + ceph config set mgr mgr/<module_name>/log_level <info|debug|critical|error|warning|> + +The logging level used upon the module's start is determined by the current +logging level of the mgr daemon, unless if the ``log_level`` option was +previously set with the ``config set ...`` command. The mgr daemon logging +level is mapped to the module python logging level as follows: + +* <= 0 is CRITICAL +* <= 1 is WARNING +* <= 4 is INFO +* <= +inf is DEBUG + +We can unset the module log level and fallback to the mgr daemon logging level +by running the following command:: + + ceph config set mgr mgr/<module_name>/log_level '' + +By default, modules' logging messages are processed by the Ceph logging layer +where they will be recorded in the mgr daemon's log file. +But it's also possible to send a module's logging message to it's own file. + +The module's log file will be located in the same directory as the mgr daemon's +log file with the following name pattern:: + + <mgr_daemon_log_file_name>.<module_name>.log + +To enable the file logging on a module use the following command:: + + ceph config set mgr mgr/<module_name>/log_to_file true + +When the module's file logging is enabled, module's logging messages stop +being written to the mgr daemon's log file and are only written to the +module's log file. + +It's also possible to check the status and disable the file logging with the +following commands:: + + ceph config get mgr mgr/<module_name>/log_to_file + ceph config set mgr mgr/<module_name>/log_to_file false + + + +.. _mgr-module-exposing-commands: + +Exposing commands +----------------- + +There are two approaches for exposing a command. The first method involves using +the ``@CLICommand`` decorator to decorate the methods needed to handle a command. +The second method uses a ``COMMANDS`` attribute defined for the module class. + + +The CLICommand approach +~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: python + + @CLICommand('antigravity send to blackhole', + perm='rw') + def send_to_blackhole(self, oid: str, blackhole: Optional[str] = None, inbuf: Optional[str] = None): + ''' + Send the specified object to black hole + ''' + obj = self.find_object(oid) + if obj is None: + return HandleCommandResult(-errno.ENOENT, stderr=f"object '{oid}' not found") + if blackhole is not None and inbuf is not None: + try: + location = self.decrypt(blackhole, passphrase=inbuf) + except ValueError: + return HandleCommandResult(-errno.EINVAL, stderr='unable to decrypt location') + else: + location = blackhole + self.send_object_to(obj, location) + return HandleCommandResult(stdout=f"the black hole swallowed '{oid}'") + +The first parameter passed to ``CLICommand`` is the "name" of the command. +Since there are lots of commands in Ceph, we tend to group related commands +with a common prefix. In this case, "antigravity" is used for this purpose. +As the author is probably designing a module which is also able to launch +rockets into the deep space. + +The `type annotations <https://www.python.org/dev/peps/pep-0484/>`_ for the +method parameters are mandatory here, so the usage of the command can be +properly reported to the ``ceph`` CLI, and the manager daemon can convert +the serialized command parameters sent by the clients to the expected type +before passing them to the handler method. With properly implemented types, +one can also perform some sanity checks against the parameters! + +The names of the parameters are part of the command interface, so please +try to take the backward compatibility into consideration when changing +them. But you **cannot** change name of ``inbuf`` parameter, it is used +to pass the content of the file specified by ``ceph --in-file`` option. + +The docstring of the method is used for the description of the command. + +The manager daemon cooks the usage of the command from these ingredients, +like:: + + antigravity send to blackhole <oid> [<blackhole>] Send the specified object to black hole + +as part of the output of ``ceph --help``. + +In addition to ``@CLICommand``, you could also use ``@CLIReadCommand`` or +``@CLIWriteCommand`` if your command only requires read permissions or +write permissions respectively. + + +The COMMANDS Approach +~~~~~~~~~~~~~~~~~~~~~ + +This method uses the ``COMMANDS`` class attribute of your module to define +a list of dicts like this:: + + COMMANDS = [ + { + "cmd": "foobar name=myarg,type=CephString", + "desc": "Do something awesome", + "perm": "rw", + # optional: + "poll": "true" + } + ] + +The ``cmd`` part of each entry is parsed in the same way as internal +Ceph mon and admin socket commands (see mon/MonCommands.h in +the Ceph source for examples). Note that the "poll" field is optional, +and is set to False by default; this indicates to the ``ceph`` CLI +that it should call this command repeatedly and output results (see +``ceph -h`` and its ``--period`` option). + +Each command is expected to return a tuple ``(retval, stdout, stderr)``. +``retval`` is an integer representing a libc error code (e.g. EINVAL, +EPERM, or 0 for no error), ``stdout`` is a string containing any +non-error output, and ``stderr`` is a string containing any progress or +error explanation output. Either or both of the two strings may be empty. + +Implement the ``handle_command`` function to respond to the commands +when they are sent: + + +.. py:currentmodule:: mgr_module +.. automethod:: MgrModule.handle_command + + +Responses and Formatting +~~~~~~~~~~~~~~~~~~~~~~~~ + +Functions that handle manager commands are expected to return a three element +tuple with the type signature ``Tuple[int, str, str]``. The first element is a +return value/error code, where zero indicates no error and a negative `errno`_ +is typically used for error conditions. The second element corresponds to the +command's "output". The third element corresponds to the command's "error +output" (akin to stderr) and is frequently used to report textual error details +when the return code is non-zero. The ``mgr_module.HandleCommandResult`` type +can also be used in lieu of a response tuple. + +.. _`errno`: https://man7.org/linux/man-pages/man3/errno.3.html + +When the implementation of a command raises an exception one of two possible +approaches to handling the exception exist. First, the command function can do +nothing and let the exception bubble up to the manager. When this happens the +manager will automatically set a return code to -EINVAL and record a trace-back +in the error output. This trace-back can be very long in some cases. The second +approach is to handle an exception within a try-except block and convert the +exception to an error code that better fits the exception (converting a +KeyError to -ENOENT, for example). In this case the error output may also be +set to something more specific and actionable by the one calling the command. + +In many cases, especially in more recent versions of Ceph, manager commands are +designed to return structured output to the caller. Structured output includes +machine-parsable data such as JSON, YAML, XML, etc. JSON is the most common +structured output format returned by manager commands. As of Ceph Reef, there +are a number of new decorators available from the ``object_format`` module that +help manage formatting output and handling exceptions automatically. The +intent is that most of the implementation of a manager command can be written in +an idiomatic (aka "Pythonic") style and the decorators will take care of most of +the work needed to format the output and return manager response tuples. + +In most cases, net new code should use the ``Responder`` decorator. Example: + +.. code:: python + + @CLICommand('antigravity list wormholes', perm='r') + @Responder() + def list_wormholes(self, oid: str, details: bool = False) -> List[Dict[str, Any]]: + '''List wormholes associated with the supplied oid. + ''' + with self.open_wormhole_db() as db: + wormholes = db.query(oid=oid) + if not details: + return [{'name': wh.name} for wh in wormholes] + return [{'name': wh.name, 'age': wh.get_age(), 'destination': wh.dest} + for wh in wormholes] + +Formatting +++++++++++ + +The ``Responder`` decorator automatically takes care of converting Python +objects into a response tuple with formatted output. By default, this decorator +can automatically return JSON and YAML. When invoked from the command line the +``--format`` flag can be used to select the response format. If left +unspecified, JSON will be returned. The automatic formatting can be applied to +any basic Python type: lists, dicts, str, int, etc. Other objects can be +formatted automatically if they meet the ``SimpleDataProvider`` protocol - they +provide a ``to_simplified`` method. The ``to_simplified`` function must return +a simplified representation of the object made out of basic types. + +.. code:: python + + class MyCleverObject: + def to_simplified(self) -> Dict[str, int]: + # returns a python object(s) made up from basic types + return {"gravitons": 999, "tachyons": 404} + + @CLICommand('antigravity list wormholes', perm='r') + @Responder() + def list_wormholes(self, oid: str, details: bool = False) -> MyCleverObject: + '''List wormholes associated with the supplied oid. + ''' + ... + +The behavior of the automatic output formatting can be customized and extednted +to other types of formatting (XML, Plain Text, etc). As this is a complex +topic, please refer to the module documentation for the ``object_format`` +module. + + + +Error Handling +++++++++++++++ + +Additionally, the ``Responder`` decorator can automatically handle converting +some exceptions into response tuples. Any raised exception inheriting from +``ErrorResponseBase`` will be automatically converted into a response tuple. +The common approach will be to use ``ErrorResponse``, an exception type that +can be used directly and has arguments for the error output and return value or +it can be constructed from an existing exception using the ``wrap`` +classmethod. The wrap classmethod will automatically use the exception text and +if available the ``errno`` property of other exceptions. + +Converting our previous example to use this exception handling approach: + +.. code:: python + + @CLICommand('antigravity list wormholes', perm='r') + @Responder() + def list_wormholes(self, oid: str, details: bool = False) -> List[Dict[str, Any]]: + '''List wormholes associated with the supplied oid. + ''' + try: + with self.open_wormhole_db() as db: + wormholes = db.query(oid=oid) + except UnknownOIDError: + raise ErrorResponse(f"Unknown oid: {oid}", return_value=-errno.ENOENT) + except WormholeDBError as err: + raise ErrorResponse.wrap(err) + if not details: + return [{'name': wh.name} for wh in wormholes] + return [{'name': wh.name, 'age': wh.get_age(), 'destination': wh.dest} + for wh in wormholes] + + +.. note:: Because the decorator can not determine the difference between a + programming mistake and an expected error condition it does not try to + catch all exceptions. + + + +Additional Decorators ++++++++++++++++++++++ + +The ``object_format`` module provides additional decorators to complement +``Responder`` but for cases where ``Responder`` is insufficient or too "heavy +weight". + +The ``ErrorResponseHandler`` decorator exists for cases where you *must* still +return a manager response tuple but want to handle errors as exceptions (as in +typical Python code). In short, it works like ``Responder`` but only with +regards to exceptions. Just like ``Responder`` it handles exceptions that +inherit from ``ErrorResponseBase``. This can be useful in cases where you need +to return raw data in the output. Example: + +.. code:: python + + @CLICommand('antigravity dump config', perm='r') + @ErrorResponseHandler() + def dump_config(self, oid: str) -> Tuple[int, str, str]: + '''Dump configuration + ''' + # we have no control over what data is inside the blob! + try: + blob = self.fetch_raw_config_blob(oid) + return 0, blob, '' + except KeyError: + raise ErrorResponse("Blob does not exist", return_value=-errno.ENOENT) + + +The ``EmptyResponder`` decorator exists for cases where, on a success +condition, no output should be generated at all. If you used ``Responder`` and +default JSON formatting you may always see outputs like ``{}`` or ``[]`` if the +command completes without error. Instead, ``EmptyResponder`` helps you create +manager commands that obey the `Rule of Silence`_ when the command has no +interesting output to emit on success. The functions that ``EmptyResponder`` +decorate should always return ``None``. Like both ``Responder`` and +``ErrorResponseHandler`` exceptions that inhert from ``ErrorResponseBase`` will +be automatically processed. Example: + +.. code:: python + + @CLICommand('antigravity create wormhole', perm='rw') + @EmptyResponder() + def create_wormhole(self, oid: str, name: str) -> None: + '''Create a new wormhole. + ''' + try: + with self.open_wormhole_db() as db: + wh = Wormhole(name) + db.insert(oid=oid, wormhole=wh) + except UnknownOIDError: + raise ErrorResponse(f"Unknown oid: {oid}", return_value=-errno.ENOENT) + except InvalidWormholeError as err: + raise ErrorResponse.wrap(err) + except WormholeDBError as err: + raise ErrorResponse.wrap(err) + + +.. _`Rule of Silence`: http://www.linfo.org/rule_of_silence.html + + +Configuration options +--------------------- + +Modules can load and store configuration options using the +``set_module_option`` and ``get_module_option`` methods. + +.. note:: Use ``set_module_option`` and ``get_module_option`` to + manage user-visible configuration options that are not blobs (like + certificates). If you want to persist module-internal data or + binary configuration data consider using the `KV store`_. + +You must declare your available configuration options in the +``MODULE_OPTIONS`` class attribute, like this: + +.. code-block:: python + + MODULE_OPTIONS = [ + Option(name="my_option") + ] + +If you try to use set_module_option or get_module_option on options not declared +in ``MODULE_OPTIONS``, an exception will be raised. + +You may choose to provide setter commands in your module to perform +high level validation. Users can also modify configuration using +the normal `ceph config set` command, where the configuration options +for a mgr module are named like `mgr/<module name>/<option>`. + +If a configuration option is different depending on which node the mgr +is running on, then use *localized* configuration ( +``get_localized_module_option``, ``set_localized_module_option``). +This may be necessary for options such as what address to listen on. +Localized options may also be set externally with ``ceph config set``, +where they key name is like ``mgr/<module name>/<mgr id>/<option>`` + +If you need to load and store data (e.g. something larger, binary, or multiline), +use the KV store instead of configuration options (see next section). + +Hints for using config options: + +* Reads are fast: ceph-mgr keeps a local in-memory copy, so in many cases + you can just do a get_module_option every time you use a option, rather than + copying it out into a variable. +* Writes block until the value is persisted (i.e. round trip to the monitor), + but reads from another thread will see the new value immediately. +* If a user has used `config set` from the command line, then the new + value will become visible to `get_module_option` immediately, although the + mon->mgr update is asynchronous, so `config set` will return a fraction + of a second before the new value is visible on the mgr. +* To delete a config value (i.e. revert to default), just pass ``None`` to + set_module_option. + +.. automethod:: MgrModule.get_module_option +.. automethod:: MgrModule.set_module_option +.. automethod:: MgrModule.get_localized_module_option +.. automethod:: MgrModule.set_localized_module_option + +KV store +-------- + +Modules have access to a private (per-module) key value store, which +is implemented using the monitor's "config-key" commands. Use +the ``set_store`` and ``get_store`` methods to access the KV store from +your module. + +The KV store commands work in a similar way to the configuration +commands. Reads are fast, operating from a local cache. Writes block +on persistence and do a round trip to the monitor. + +This data can be access from outside of ceph-mgr using the +``ceph config-key [get|set]`` commands. Key names follow the same +conventions as configuration options. Note that any values updated +from outside of ceph-mgr will not be seen by running modules until +the next restart. Users should be discouraged from accessing module KV +data externally -- if it is necessary for users to populate data, modules +should provide special commands to set the data via the module. + +Use the ``get_store_prefix`` function to enumerate keys within +a particular prefix (i.e. all keys starting with a particular substring). + + +.. automethod:: MgrModule.get_store +.. automethod:: MgrModule.set_store +.. automethod:: MgrModule.get_localized_store +.. automethod:: MgrModule.set_localized_store +.. automethod:: MgrModule.get_store_prefix + + +Accessing cluster data +---------------------- + +Modules have access to the in-memory copies of the Ceph cluster's +state that the mgr maintains. Accessor functions as exposed +as members of MgrModule. + +Calls that access the cluster or daemon state are generally going +from Python into native C++ routines. There is some overhead to this, +but much less than for example calling into a REST API or calling into +an SQL database. + +There are no consistency rules about access to cluster structures or +daemon metadata. For example, an OSD might exist in OSDMap but +have no metadata, or vice versa. On a healthy cluster these +will be very rare transient states, but modules should be written +to cope with the possibility. + +Note that these accessors must not be called in the modules ``__init__`` +function. This will result in a circular locking exception. + +.. automethod:: MgrModule.get +.. automethod:: MgrModule.get_server +.. automethod:: MgrModule.list_servers +.. automethod:: MgrModule.get_metadata +.. automethod:: MgrModule.get_daemon_status +.. automethod:: MgrModule.get_perf_schema +.. automethod:: MgrModule.get_counter +.. automethod:: MgrModule.get_mgr_id +.. automethod:: MgrModule.get_daemon_health_metrics + +Exposing health checks +---------------------- + +Modules can raise first class Ceph health checks, which will be reported +in the output of ``ceph status`` and in other places that report on the +cluster's health. + +If you use ``set_health_checks`` to report a problem, be sure to call +it again with an empty dict to clear your health check when the problem +goes away. + +.. automethod:: MgrModule.set_health_checks + +What if the mons are down? +-------------------------- + +The manager daemon gets much of its state (such as the cluster maps) +from the monitor. If the monitor cluster is inaccessible, whichever +manager was active will continue to run, with the latest state it saw +still in memory. + +However, if you are creating a module that shows the cluster state +to the user then you may well not want to mislead them by showing +them that out of date state. + +To check if the manager daemon currently has a connection to +the monitor cluster, use this function: + +.. automethod:: MgrModule.have_mon_connection + +Reporting if your module cannot run +----------------------------------- + +If your module cannot be run for any reason (such as a missing dependency), +then you can report that by implementing the ``can_run`` function. + +.. automethod:: MgrModule.can_run + +Note that this will only work properly if your module can always be imported: +if you are importing a dependency that may be absent, then do it in a +try/except block so that your module can be loaded far enough to use +``can_run`` even if the dependency is absent. + +Sending commands +---------------- + +A non-blocking facility is provided for sending monitor commands +to the cluster. + +.. automethod:: MgrModule.send_command + +Receiving notifications +----------------------- + +The manager daemon calls the ``notify`` function on all active modules +when certain important pieces of cluster state are updated, such as the +cluster maps. + +The actual data is not passed into this function, rather it is a cue for +the module to go and read the relevant structure if it is interested. Most +modules ignore most types of notification: to ignore a notification +simply return from this function without doing anything. + +.. automethod:: MgrModule.notify + +Accessing RADOS or CephFS +------------------------- + +If you want to use the librados python API to access data stored in +the Ceph cluster, you can access the ``rados`` attribute of your +``MgrModule`` instance. This is an instance of ``rados.Rados`` which +has been constructed for you using the existing Ceph context (an internal +detail of the C++ Ceph code) of the mgr daemon. + +Always use this specially constructed librados instance instead of +constructing one by hand. + +Similarly, if you are using libcephfs to access the file system, then +use the libcephfs ``create_with_rados`` to construct it from the +``MgrModule.rados`` librados instance, and thereby inherit the correct context. + +Remember that your module may be running while other parts of the cluster +are down: do not assume that librados or libcephfs calls will return +promptly -- consider whether to use timeouts or to block if the rest of +the cluster is not fully available. + +Implementing standby mode +------------------------- + +For some modules, it is useful to run on standby manager daemons as well +as on the active daemon. For example, an HTTP server can usefully +serve HTTP redirect responses from the standby managers so that +the user can point his browser at any of the manager daemons without +having to worry about which one is active. + +Standby manager daemons look for a subclass of ``StandbyModule`` +in each module. If the class is not found then the module is not +used at all on standby daemons. If the class is found, then +its ``serve`` method is called. Implementations of ``StandbyModule`` +must inherit from ``mgr_module.MgrStandbyModule``. + +The interface of ``MgrStandbyModule`` is much restricted compared to +``MgrModule`` -- none of the Ceph cluster state is available to +the module. ``serve`` and ``shutdown`` methods are used in the same +way as a normal module class. The ``get_active_uri`` method enables +the standby module to discover the address of its active peer in +order to make redirects. See the ``MgrStandbyModule`` definition +in the Ceph source code for the full list of methods. + +For an example of how to use this interface, look at the source code +of the ``dashboard`` module. + +Communicating between modules +----------------------------- + +Modules can invoke member functions of other modules. + +.. automethod:: MgrModule.remote + +Be sure to handle ``ImportError`` to deal with the case that the desired +module is not enabled. + +If the remote method raises a python exception, this will be converted +to a RuntimeError on the calling side, where the message string describes +the exception that was originally thrown. If your logic intends +to handle certain errors cleanly, it is better to modify the remote method +to return an error value instead of raising an exception. + +At time of writing, inter-module calls are implemented without +copies or serialization, so when you return a python object, you're +returning a reference to that object to the calling module. It +is recommend *not* to rely on this reference passing, as in future the +implementation may change to serialize arguments and return +values. + + +Shutting down cleanly +--------------------- + +If a module implements the ``serve()`` method, it should also implement +the ``shutdown()`` method to shutdown cleanly: misbehaving modules +may otherwise prevent clean shutdown of ceph-mgr. + +Limitations +----------- + +It is not possible to call back into C++ code from a module's +``__init__()`` method. For example calling ``self.get_module_option()`` at +this point will result in an assertion failure in ceph-mgr. For modules +that implement the ``serve()`` method, it usually makes sense to do most +initialization inside that method instead. + +Debugging +--------- + +Apparently, we can always use the :ref:`mgr module dev logging` facility +for debugging a ceph-mgr module. But some of us might miss `PDB`_ and the +interactive Python interpreter. Yes, we can have them as well when developing +ceph-mgr modules! ``ceph_mgr_repl.py`` can drop you into an interactive shell +talking to ``selftest`` module. With this tool, one can peek and poke the +ceph-mgr module, and use all the exposed facilities in quite the same way +how we use the Python command line interpreter. For using ``ceph_mgr_repl.py``, +we need to + +#. ready a Ceph cluster +#. enable the ``selftest`` module +#. setup the necessary environment variables +#. launch the tool + +.. _PDB: https://docs.python.org/3/library/pdb.html + +Following is a sample session, in which the Ceph version is queried by +inputting ``print(mgr.version)`` at the prompt. And later +``timeit`` module is imported to measure the execution time of +`mgr.get_mgr_id()`. + +.. code-block:: console + + $ cd build + $ MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh -n -x + $ bin/ceph mgr module enable selftest + $ ../src/pybind/ceph_mgr_repl.py --show-env + $ export PYTHONPATH=/home/me/ceph/src/pybind:/home/me/ceph/build/lib/cython_modules/lib.3:/home/me/ceph/src/python-common:$PYTHONPATH + $ export LD_LIBRARY_PATH=/home/me/ceph/build/lib:$LD_LIBRARY_PATH + $ export PYTHONPATH=/home/me/ceph/src/pybind:/home/me/ceph/build/lib/cython_modules/lib.3:/home/me/ceph/src/python-common:$PYTHONPATH + $ export LD_LIBRARY_PATH=/home/me/ceph/build/lib:$LD_LIBRARY_PATH + $ ../src/pybind/ceph_mgr_repl.py + $ ../src/pybind/ceph_mgr_repl.py + Python 3.9.2 (default, Feb 28 2021, 17:03:44) + [GCC 10.2.1 20210110] on linux + Type "help", "copyright", "credits" or "license" for more information. + (MgrModuleInteractiveConsole) + [mgr self-test eval] >>> print(mgr.version) + ceph version Development (no_version) quincy (dev) + [mgr self-test eval] >>> from timeit import timeit + [mgr self-test eval] >>> timeit(mgr.get_mgr_id) + 0.16303414600042743 + [mgr self-test eval] >>> + +If you want to "talk" to a ceph-mgr module other than ``selftest`` using +this tool, you can either add a command to the module you want to debug +exactly like how ``mgr self-test eval`` command was added to ``selftest``. Or +we can make this simpler by promoting the ``eval()`` method to a dedicated +`Mixin`_ class and inherit your ``MgrModule`` subclass from it. And define +a command with it. Assuming the prefix of the command is ``mgr my-module eval``, +one can just put + +.. prompt:: bash $ + + ../src/pybind/ceph_mgr_repl.py --prefix "mgr my-module eval" + + +.. _Mixin: _https://en.wikipedia.org/wiki/Mixin + +Is something missing? +--------------------- + +The ceph-mgr python interface is not set in stone. If you have a need +that is not satisfied by the current interface, please bring it up +on the ceph-devel mailing list. While it is desired to avoid bloating +the interface, it is not generally very hard to expose existing data +to the Python code when there is a good reason. + diff --git a/doc/mgr/nfs.rst b/doc/mgr/nfs.rst new file mode 100644 index 000000000..7e6637684 --- /dev/null +++ b/doc/mgr/nfs.rst @@ -0,0 +1,680 @@ +.. _mgr-nfs: + +============================= +CephFS & RGW Exports over NFS +============================= + +CephFS namespaces and RGW buckets can be exported over NFS protocol +using the `NFS-Ganesha NFS server`_. + +The ``nfs`` manager module provides a general interface for managing +NFS exports of either CephFS directories or RGW buckets. Exports can +be managed either via the CLI ``ceph nfs export ...`` commands +or via the dashboard. + +The deployment of the nfs-ganesha daemons can also be managed +automatically if either the :ref:`cephadm` or :ref:`mgr-rook` +orchestrators are enabled. If neither are in use (e.g., Ceph is +deployed via an external orchestrator like Ansible or Puppet), the +nfs-ganesha daemons must be manually deployed; for more information, +see :ref:`nfs-ganesha-config`. + +.. note:: Starting with Ceph Pacific, the ``nfs`` mgr module must be enabled. + +NFS Cluster management +====================== + +.. _nfs-module-cluster-create: + +Create NFS Ganesha Cluster +-------------------------- + +.. code:: bash + + $ ceph nfs cluster create <cluster_id> [<placement>] [--ingress] [--virtual_ip <value>] [--ingress-mode {default|keepalive-only|haproxy-standard|haproxy-protocol}] [--port <int>] + +This creates a common recovery pool for all NFS Ganesha daemons, new user based on +``cluster_id``, and a common NFS Ganesha config RADOS object. + +.. note:: Since this command also brings up NFS Ganesha daemons using a ceph-mgr + orchestrator module (see :doc:`/mgr/orchestrator`) such as cephadm or rook, at + least one such module must be enabled for it to work. + + Currently, NFS Ganesha daemon deployed by cephadm listens on the standard + port. So only one daemon will be deployed on a host. + +``<cluster_id>`` is an arbitrary string by which this NFS Ganesha cluster will be +known (e.g., ``mynfs``). + +``<placement>`` is an optional string signifying which hosts should have NFS Ganesha +daemon containers running on them and, optionally, the total number of NFS +Ganesha daemons on the cluster (should you want to have more than one NFS Ganesha +daemon running per node). For example, the following placement string means +"deploy NFS Ganesha daemons on nodes host1 and host2 (one daemon per host):: + + "host1,host2" + +and this placement specification says to deploy single NFS Ganesha daemon each +on nodes host1 and host2 (for a total of two NFS Ganesha daemons in the +cluster):: + + "2 host1,host2" + +NFS can be deployed on a port other than 2049 (the default) with ``--port <port>``. + +To deploy NFS with a high-availability front-end (virtual IP and load balancer), add the +``--ingress`` flag and specify a virtual IP address. This will deploy a combination +of keepalived and haproxy to provide an high-availability NFS frontend for the NFS +service. + +.. note:: The ingress implementation is not yet complete. Enabling + ingress will deploy multiple ganesha instances and balance + load across them, but a host failure will not immediately + cause cephadm to deploy a replacement daemon before the NFS + grace period expires. This high-availability functionality + is expected to be completed by the Quincy release (March + 2022). + +For more details, refer :ref:`orchestrator-cli-placement-spec` but keep +in mind that specifying the placement via a YAML file is not supported. + +Deployment of NFS daemons and the ingress service is asynchronous: the +command may return before the services have completely started. You may +wish to check that these services do successfully start and stay running. +When using cephadm orchestration, these commands check service status: + +.. code:: bash + + $ ceph orch ls --service_name=nfs.<cluster_id> + $ ceph orch ls --service_name=ingress.nfs.<cluster_id> + + +Ingress +------- + +The core *nfs* service will deploy one or more nfs-ganesha daemons, +each of which will provide a working NFS endpoint. The IP for each +NFS endpoint will depend on which host the nfs-ganesha daemons are +deployed. By default, daemons are placed semi-randomly, but users can +also explicitly control where daemons are placed; see +:ref:`orchestrator-cli-placement-spec`. + +When a cluster is created with ``--ingress``, an *ingress* service is +additionally deployed to provide load balancing and high-availability +for the NFS servers. A virtual IP is used to provide a known, stable +NFS endpoint that all clients can use to mount. Ceph will take care +of the details of NFS redirecting traffic on the virtual IP to the +appropriate backend NFS servers, and redeploying NFS servers when they +fail. + +An optional ``--ingress-mode`` parameter can be provided to choose +how the *ingress* service is configured: + +- Setting ``--ingress-mode keepalive-only`` deploys a simplified *ingress* + service that provides a virtual IP with the nfs server directly binding to + that virtual IP and leaves out any sort of load balancing or traffic + redirection. This setup will restrict users to deploying only 1 nfs daemon + as multiple cannot bind to the same port on the virtual IP. +- Setting ``--ingress-mode haproxy-standard`` deploys a full *ingress* service + to provide load balancing and high-availability using HAProxy and keepalived. + Client IP addresses are not visible to the back-end NFS server and IP level + restrictions on NFS exports will not function. +- Setting ``--ingress-mode haproxy-protocol`` deploys a full *ingress* service + to provide load balancing and high-availability using HAProxy and keepalived. + Client IP addresses are visible to the back-end NFS server and IP level + restrictions on NFS exports are usable. This mode requires NFS Ganesha version + 5.0 or later. +- Setting ``--ingress-mode default`` is equivalent to not providing any other + ingress mode by name. When no other ingress mode is specified by name + the default ingress mode used is ``haproxy-standard``. + +Ingress can be added to an existing NFS service (e.g., one initially created +without the ``--ingress`` flag), and the basic NFS service can +also be modified after the fact to include non-default options, by modifying +the services directly. For more information, see :ref:`cephadm-ha-nfs`. + +Show NFS Cluster IP(s) +---------------------- + +To examine an NFS cluster's IP endpoints, including the IPs for the individual NFS +daemons, and the virtual IP (if any) for the ingress service, + +.. code:: bash + + $ ceph nfs cluster info [<cluster_id>] + +.. note:: This will not work with the rook backend. Instead, expose the port with + the kubectl patch command and fetch the port details with kubectl get services + command:: + + $ kubectl patch service -n rook-ceph -p '{"spec":{"type": "NodePort"}}' rook-ceph-nfs-<cluster-name>-<node-id> + $ kubectl get services -n rook-ceph rook-ceph-nfs-<cluster-name>-<node-id> + + +Delete NFS Ganesha Cluster +-------------------------- + +.. code:: bash + + $ ceph nfs cluster rm <cluster_id> + +This deletes the deployed cluster. + + +Removal of NFS daemons and the ingress service is asynchronous: the +command may return before the services have been completely deleted. You may +wish to check that these services are no longer reported. When using cephadm +orchestration, these commands check service status: + +.. code:: bash + + $ ceph orch ls --service_name=nfs.<cluster_id> + $ ceph orch ls --service_name=ingress.nfs.<cluster_id> + + +Updating an NFS Cluster +----------------------- + +In order to modify cluster parameters (like the port or placement), you need to +use the orchestrator interface to update the NFS service spec. The safest way to do +that is to export the current spec, modify it, and then re-apply it. For example, +to modify the ``nfs.foo`` service, + +.. code:: bash + + $ ceph orch ls --service-name nfs.foo --export > nfs.foo.yaml + $ vi nfs.foo.yaml + $ ceph orch apply -i nfs.foo.yaml + +For more information about the NFS service spec, see :ref:`deploy-cephadm-nfs-ganesha`. + +List NFS Ganesha Clusters +------------------------- + +.. code:: bash + + $ ceph nfs cluster ls + +This lists deployed clusters. + +.. _nfs-cluster-set: + +Set Customized NFS Ganesha Configuration +---------------------------------------- + +.. code:: bash + + $ ceph nfs cluster config set <cluster_id> -i <config_file> + +With this the nfs cluster will use the specified config and it will have +precedence over default config blocks. + +Example use cases include: + +#. Changing log level. The logging level can be adjusted with the following config + fragment:: + + LOG { + COMPONENTS { + ALL = FULL_DEBUG; + } + } + +#. Adding custom export block. + + The following sample block creates a single export. This export will not be + managed by `ceph nfs export` interface:: + + EXPORT { + Export_Id = 100; + Transports = TCP; + Path = /; + Pseudo = /ceph/; + Protocols = 4; + Access_Type = RW; + Attr_Expiration_Time = 0; + Squash = None; + FSAL { + Name = CEPH; + Filesystem = "filesystem name"; + User_Id = "user id"; + Secret_Access_Key = "secret key"; + } + } + +.. note:: User specified in FSAL block should have proper caps for NFS-Ganesha + daemons to access ceph cluster. User can be created in following way using + `auth get-or-create`:: + + # ceph auth get-or-create client.<user_id> mon 'allow r' osd 'allow rw pool=.nfs namespace=<nfs_cluster_name>, allow rw tag cephfs data=<fs_name>' mds 'allow rw path=<export_path>' + +View Customized NFS Ganesha Configuration +----------------------------------------- + +.. code:: bash + + $ ceph nfs cluster config get <cluster_id> + +This will output the user defined configuration (if any). + +Reset NFS Ganesha Configuration +------------------------------- + +.. code:: bash + + $ ceph nfs cluster config reset <cluster_id> + +This removes the user defined configuration. + +.. note:: With a rook deployment, ganesha pods must be explicitly restarted + for the new config blocks to be effective. + + +Export Management +================= + +.. warning:: Currently, the nfs interface is not integrated with dashboard. Both + dashboard and nfs interface have different export requirements and + create exports differently. Management of dashboard created exports is not + supported. + +Create CephFS Export +-------------------- + +.. code:: bash + + $ ceph nfs export create cephfs --cluster-id <cluster_id> --pseudo-path <pseudo_path> --fsname <fsname> [--readonly] [--path=/path/in/cephfs] [--client_addr <value>...] [--squash <value>] [--sectype <value>...] + +This creates export RADOS objects containing the export block, where + +``<cluster_id>`` is the NFS Ganesha cluster ID. + +``<pseudo_path>`` is the export position within the NFS v4 Pseudo Filesystem where the export will be available on the server. It must be an absolute path and unique. + +``<fsname>`` is the name of the FS volume used by the NFS Ganesha cluster +that will serve this export. + +``<path>`` is the path within cephfs. Valid path should be given and default +path is '/'. It need not be unique. Subvolume path can be fetched using: + +.. code:: + + $ ceph fs subvolume getpath <vol_name> <subvol_name> [--group_name <subvol_group_name>] + +``<client_addr>`` is the list of client address for which these export +permissions will be applicable. By default all clients can access the export +according to specified export permissions. See the `NFS-Ganesha Export Sample`_ +for permissible values. + +``<squash>`` defines the kind of user id squashing to be performed. The default +value is `no_root_squash`. See the `NFS-Ganesha Export Sample`_ for +permissible values. + +``<sectype>`` specifies which authentication methods will be used when +connecting to the export. Valid values include "krb5p", "krb5i", "krb5", "sys", +and "none". More than one value can be supplied. The flag may be specified +multiple times (example: ``--sectype=krb5p --sectype=krb5i``) or multiple +values may be separated by a comma (example: ``--sectype krb5p,krb5i``). The +server will negotatiate a supported security type with the client preferring +the supplied methods left-to-right. + +.. note:: Specifying values for sectype that require Kerberos will only function on servers + that are configured to support Kerberos. Setting up NFS-Ganesha to support Kerberos + is outside the scope of this document. + +.. note:: Export creation is supported only for NFS Ganesha clusters deployed using nfs interface. + +Create RGW Export +----------------- + +There are two kinds of RGW exports: + +- a *user* export will export all buckets owned by an + RGW user, where the top-level directory of the export is a list of buckets. +- a *bucket* export will export a single bucket, where the top-level directory contains + the objects in the bucket. + +RGW bucket export +^^^^^^^^^^^^^^^^^ + +To export a *bucket*: + +.. code:: + + $ ceph nfs export create rgw --cluster-id <cluster_id> --pseudo-path <pseudo_path> --bucket <bucket_name> [--user-id <user-id>] [--readonly] [--client_addr <value>...] [--squash <value>] [--sectype <value>...] + +For example, to export *mybucket* via NFS cluster *mynfs* at the pseudo-path */bucketdata* to any host in the ``192.168.10.0/24`` network + +.. code:: + + $ ceph nfs export create rgw --cluster-id mynfs --pseudo-path /bucketdata --bucket mybucket --client_addr 192.168.10.0/24 + +.. note:: Export creation is supported only for NFS Ganesha clusters deployed using nfs interface. + +``<cluster_id>`` is the NFS Ganesha cluster ID. + +``<pseudo_path>`` is the export position within the NFS v4 Pseudo Filesystem where the export will be available on the server. It must be an absolute path and unique. + +``<bucket_name>`` is the name of the bucket that will be exported. + +``<user_id>`` is optional, and specifies which RGW user will be used for read and write +operations to the bucket. If it is not specified, the user who owns the bucket will be +used. + +.. note:: Currently, if multi-site RGW is enabled, Ceph can only export RGW buckets in the default realm. + +``<client_addr>`` is the list of client address for which these export +permissions will be applicable. By default all clients can access the export +according to specified export permissions. See the `NFS-Ganesha Export Sample`_ +for permissible values. + +``<squash>`` defines the kind of user id squashing to be performed. The default +value is `no_root_squash`. See the `NFS-Ganesha Export Sample`_ for +permissible values. + +``<sectype>`` specifies which authentication methods will be used when +connecting to the export. Valid values include "krb5p", "krb5i", "krb5", "sys", +and "none". More than one value can be supplied. The flag may be specified +multiple times (example: ``--sectype=krb5p --sectype=krb5i``) or multiple +values may be separated by a comma (example: ``--sectype krb5p,krb5i``). The +server will negotatiate a supported security type with the client preferring +the supplied methods left-to-right. + +.. note:: Specifying values for sectype that require Kerberos will only function on servers + that are configured to support Kerberos. Setting up NFS-Ganesha to support Kerberos + is outside the scope of this document. + +RGW user export +^^^^^^^^^^^^^^^ + +To export an RGW *user*: + +.. code:: + + $ ceph nfs export create rgw --cluster-id <cluster_id> --pseudo-path <pseudo_path> --user-id <user-id> [--readonly] [--client_addr <value>...] [--squash <value>] + +For example, to export *myuser* via NFS cluster *mynfs* at the pseudo-path */myuser* to any host in the ``192.168.10.0/24`` network + +.. code:: + + $ ceph nfs export create rgw --cluster-id mynfs --pseudo-path /bucketdata --user-id myuser --client_addr 192.168.10.0/24 + + +Delete Export +------------- + +.. code:: bash + + $ ceph nfs export rm <cluster_id> <pseudo_path> + +This deletes an export in an NFS Ganesha cluster, where: + +``<cluster_id>`` is the NFS Ganesha cluster ID. + +``<pseudo_path>`` is the pseudo root path (must be an absolute path). + +List Exports +------------ + +.. code:: bash + + $ ceph nfs export ls <cluster_id> [--detailed] + +It lists exports for a cluster, where: + +``<cluster_id>`` is the NFS Ganesha cluster ID. + +With the ``--detailed`` option enabled it shows entire export block. + +Get Export +---------- + +.. code:: bash + + $ ceph nfs export info <cluster_id> <pseudo_path> + +This displays export block for a cluster based on pseudo root name, +where: + +``<cluster_id>`` is the NFS Ganesha cluster ID. + +``<pseudo_path>`` is the pseudo root path (must be an absolute path). + + +Create or update export via JSON specification +---------------------------------------------- + +An existing export can be dumped in JSON format with: + +.. prompt:: bash # + + ceph nfs export info *<cluster_id>* *<pseudo_path>* + +An export can be created or modified by importing a JSON description in the +same format: + +.. prompt:: bash # + + ceph nfs export apply *<cluster_id>* -i <json_file> + +For example,:: + + $ ceph nfs export info mynfs /cephfs > update_cephfs_export.json + $ cat update_cephfs_export.json + { + "export_id": 1, + "path": "/", + "cluster_id": "mynfs", + "pseudo": "/cephfs", + "access_type": "RW", + "squash": "no_root_squash", + "security_label": true, + "protocols": [ + 4 + ], + "transports": [ + "TCP" + ], + "fsal": { + "name": "CEPH", + "user_id": "nfs.mynfs.1", + "fs_name": "a", + "sec_label_xattr": "" + }, + "clients": [] + } + +The imported JSON can be a single dict describing a single export, or a JSON list +containing multiple export dicts. + +The exported JSON can be modified and then reapplied. Below, *pseudo* +and *access_type* are modified. When modifying an export, the +provided JSON should fully describe the new state of the export (just +as when creating a new export), with the exception of the +authentication credentials, which will be carried over from the +previous state of the export where possible. + +:: + + $ ceph nfs export apply mynfs -i update_cephfs_export.json + $ cat update_cephfs_export.json + { + "export_id": 1, + "path": "/", + "cluster_id": "mynfs", + "pseudo": "/cephfs_testing", + "access_type": "RO", + "squash": "no_root_squash", + "security_label": true, + "protocols": [ + 4 + ], + "transports": [ + "TCP" + ], + "fsal": { + "name": "CEPH", + "user_id": "nfs.mynfs.1", + "fs_name": "a", + "sec_label_xattr": "" + }, + "clients": [] + } + +An export can also be created or updated by injecting a Ganesha NFS EXPORT config +fragment. For example,:: + + $ ceph nfs export apply mynfs -i update_cephfs_export.conf + $ cat update_cephfs_export.conf + EXPORT { + FSAL { + name = "CEPH"; + filesystem = "a"; + } + export_id = 1; + path = "/"; + pseudo = "/a"; + access_type = "RW"; + squash = "none"; + attr_expiration_time = 0; + security_label = true; + protocols = 4; + transports = "TCP"; + } + + +Mounting +======== + +After the exports are successfully created and NFS Ganesha daemons are +deployed, exports can be mounted with: + +.. code:: bash + + $ mount -t nfs <ganesha-host-name>:<pseudo_path> <mount-point> + +For example, if the NFS cluster was created with ``--ingress --virtual-ip 192.168.10.10`` +and the export's pseudo-path was ``/foo``, the export can be mounted at ``/mnt`` with: + +.. code:: bash + + $ mount -t nfs 192.168.10.10:/foo /mnt + +If the NFS service is running on a non-standard port number: + +.. code:: bash + + $ mount -t nfs -o port=<ganesha-port> <ganesha-host-name>:<ganesha-pseudo_path> <mount-point> + +.. note:: Only NFS v4.0+ is supported. + +Troubleshooting +=============== + +Checking NFS-Ganesha logs with + +1) ``cephadm``: The NFS daemons can be listed with: + + .. code:: bash + + $ ceph orch ps --daemon-type nfs + + You can via the logs for a specific daemon (e.g., ``nfs.mynfs.0.0.myhost.xkfzal``) on + the relevant host with: + + .. code:: bash + + # cephadm logs --fsid <fsid> --name nfs.mynfs.0.0.myhost.xkfzal + +2) ``rook``: + + .. code:: bash + + $ kubectl logs -n rook-ceph rook-ceph-nfs-<cluster_id>-<node_id> nfs-ganesha + +The NFS log level can be adjusted using `nfs cluster config set` command (see :ref:`nfs-cluster-set`). + + +.. _nfs-ganesha-config: + + +Manual Ganesha deployment +========================= + +It may be possible to deploy and manage the NFS ganesha daemons without +orchestration frameworks such as cephadm or rook. + +.. note:: Manual configuration is not tested or fully documented; your + mileage may vary. If you make this work, please help us by + updating this documentation. + +Limitations +------------ + +If no orchestrator module is enabled for the Ceph Manager the NFS cluster +management commands, such as those starting with ``ceph nfs cluster``, will not +function. However, commands that manage NFS exports, like those prefixed with +``ceph nfs export`` are expected to work as long as the necessary RADOS objects +have already been created. The exact RADOS objects required are not documented +at this time as support for this feature is incomplete. A curious reader can +find some details about the object by reading the source code for the +``mgr/nfs`` module (found in the ceph source tree under +``src/pybind/mgr/nfs``). + + +Requirements +------------ + +The following packages are required to enable CephFS and RGW exports with nfs-ganesha: + +- ``nfs-ganesha``, ``nfs-ganesha-ceph``, ``nfs-ganesha-rados-grace`` and + ``nfs-ganesha-rados-urls`` packages (version 3.3 and above) + +Ganesha Configuration Hierarchy +------------------------------- + +Cephadm and rook start each nfs-ganesha daemon with a minimal +`bootstrap` configuration file that pulls from a shared `common` +configuration stored in the ``.nfs`` RADOS pool and watches the common +config for changes. Each export is written to a separate RADOS object +that is referenced by URL from the common config. + +.. ditaa:: + + rados://$pool/$namespace/export-$i rados://$pool/$namespace/userconf-nfs.$cluster_id + (export config) (user config) + + +----------+ +----------+ +----------+ +---------------------------+ + | | | | | | | | + | export-1 | | export-2 | | export-3 | | userconf-nfs.$cluster_id | + | | | | | | | | + +----+-----+ +----+-----+ +-----+----+ +-------------+-------------+ + ^ ^ ^ ^ + | | | | + +--------------------------------+-------------------------+ + %url | + | + +--------+--------+ + | | rados://$pool/$namespace/conf-nfs.$svc + | conf+nfs.$svc | (common config) + | | + +--------+--------+ + ^ + | + watch_url | + +----------------------------------------------+ + | | | + | | | RADOS + +----------------------------------------------------------------------------------+ + | | | CONTAINER + watch_url | watch_url | watch_url | + | | | + +--------+-------+ +--------+-------+ +-------+--------+ + | | | | | | /etc/ganesha/ganesha.conf + | nfs.$svc.a | | nfs.$svc.b | | nfs.$svc.c | (bootstrap config) + | | | | | | + +----------------+ +----------------+ +----------------+ + + +.. _NFS-Ganesha NFS Server: https://github.com/nfs-ganesha/nfs-ganesha/wiki +.. _NFS-Ganesha Export Sample: https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/config_samples/export.txt diff --git a/doc/mgr/orchestrator.rst b/doc/mgr/orchestrator.rst new file mode 100644 index 000000000..7cb6ddeec --- /dev/null +++ b/doc/mgr/orchestrator.rst @@ -0,0 +1,240 @@ + +.. _orchestrator-cli-module: + +================ +Orchestrator CLI +================ + +This module provides a command line interface (CLI) for orchestrator modules. +Orchestrator modules are ``ceph-mgr`` plugins that interface with external +orchestration services. + +Definition of Terms +=================== + +The orchestrator CLI unifies multiple external orchestrators, so we need a +common nomenclature for the orchestrator module: + ++--------------------------------------+---------------------------------------+ +| *host* | hostname (not the DNS name) of the | +| | physical host. Not the podname, | +| | container name, or hostname inside | +| | the container. | ++--------------------------------------+---------------------------------------+ +| *service type* | The type of the service. e.g., nfs, | +| | mds, osd, mon, rgw, mgr, iscsi | ++--------------------------------------+---------------------------------------+ +| *service* | A logical service. Typically | +| | comprised of multiple service | +| | instances on multiple hosts for HA | +| | | +| | * ``fs_name`` for mds type | +| | * ``rgw_zone`` for rgw type | +| | * ``ganesha_cluster_id`` for nfs type | ++--------------------------------------+---------------------------------------+ +| *daemon* | A single instance of a service. | +| | Usually a daemon, but maybe not | +| | (e.g., might be a kernel service | +| | like LIO or knfsd or whatever) | +| | | +| | This identifier should | +| | uniquely identify the instance. | ++--------------------------------------+---------------------------------------+ + +Here is how the names relate: + +* A *service* has a specific *service type*. +* A *daemon* is a physical instance of a *service type*. + +.. note:: + + Orchestrator modules might implement only a subset of the commands listed + below. The implementation of the commands may differ between modules. + +Status +====== + +.. prompt:: bash $ + + ceph orch status [--detail] + +This command shows the current orchestrator mode and its high-level status +(whether the orchestrator plugin is available and operational). + + +.. + Turn On Device Lights + ^^^^^^^^^^^^^^^^^^^^^ + :: + + ceph orch device ident-on <dev_id> + ceph orch device ident-on <dev_name> <host> + ceph orch device fault-on <dev_id> + ceph orch device fault-on <dev_name> <host> + + ceph orch device ident-off <dev_id> [--force=true] + ceph orch device ident-off <dev_id> <host> [--force=true] + ceph orch device fault-off <dev_id> [--force=true] + ceph orch device fault-off <dev_id> <host> [--force=true] + + where ``dev_id`` is the device id as listed in ``osd metadata``, + ``dev_name`` is the name of the device on the system and ``host`` is the host as + returned by ``orchestrator host ls`` + + ceph orch osd ident-on {primary,journal,db,wal,all} <osd-id> + ceph orch osd ident-off {primary,journal,db,wal,all} <osd-id> + ceph orch osd fault-on {primary,journal,db,wal,all} <osd-id> + ceph orch osd fault-off {primary,journal,db,wal,all} <osd-id> + + where ``journal`` is the filestore journal device, ``wal`` is the bluestore + write ahead log device, and ``all`` stands for all devices associated with the OSD + + +.. _orchestrator-cli-stateless-services: + +Stateless services (MDS/RGW/NFS/rbd-mirror/iSCSI) +================================================= + +.. note:: + + The orchestrator will not configure the services. See the relevant + documentation for details about how to configure particular services. + +The ``name`` parameter identifies the kind of the group of instances. The +following short list explains the meaning of the ``name`` parameter: + +* A CephFS file system identifies a group of MDS daemons. +* A zone name identifies a group of RGWs. + +Creating/growing/shrinking/removing services: + +.. prompt:: bash $ + + ceph orch apply mds <fs_name> [--placement=<placement>] [--dry-run] + ceph orch apply rgw <name> [--realm=<realm>] [--zone=<zone>] [--port=<port>] [--ssl] [--placement=<placement>] [--dry-run] + ceph orch apply nfs <name> <pool> [--namespace=<namespace>] [--placement=<placement>] [--dry-run] + ceph orch rm <service_name> [--force] + +where ``placement`` is a :ref:`orchestrator-cli-placement-spec`. + +e.g., ``ceph orch apply mds myfs --placement="3 host1 host2 host3"`` + +Service Commands: + +.. prompt:: bash $ + + ceph orch <start|stop|restart|redeploy|reconfig> <service_name> + +.. note:: These commands apply only to cephadm containerized daemons. + +Options +======= + +.. option:: start + + Start the daemon on the corresponding host. + +.. option:: stop + + Stop the daemon on the corresponding host. + +.. option:: restart + + Restart the daemon on the corresponding host. + +.. option:: redeploy + + Redeploy the Ceph daemon on the corresponding host. This will recreate the daemon directory + structure under ``/var/lib/ceph/<fsid>/<daemon-name>`` (if it doesn't exist), refresh its + configuration files, regenerate its unit-files and restarts the systemd daemon. + +.. option:: reconfig + + Reconfigure the daemon on the corresponding host. This will refresh configuration files then restart the daemon. + + .. note:: this command assumes the daemon directory ``/var/lib/ceph/<fsid>/<daemon-name>`` already exists. + + +Configuring the Orchestrator CLI +================================ + +Enable the orchestrator by using the ``set backend`` command to select the orchestrator module that will be used: + +.. prompt:: bash $ + + ceph orch set backend <module> + +Example - Configuring the Orchestrator CLI +------------------------------------------ + +For example, to enable the Rook orchestrator module and use it with the CLI: + +.. prompt:: bash $ + + ceph mgr module enable rook + ceph orch set backend rook + +Confirm that the backend is properly configured: + +.. prompt:: bash $ + + ceph orch status + +Disable the Orchestrator +------------------------ + +To disable the orchestrator, use the empty string ``""``: + +.. prompt:: bash $ + + ceph orch set backend "" + ceph mgr module disable rook + +Current Implementation Status +============================= + +This is an overview of the current implementation status of the orchestrators. + +=================================== ====== ========= + Command Rook Cephadm +=================================== ====== ========= + apply iscsi ⚪ ✔ + apply mds ✔ ✔ + apply mgr ⚪ ✔ + apply mon ✔ ✔ + apply nfs ✔ ✔ + apply osd ✔ ✔ + apply rbd-mirror ✔ ✔ + apply cephfs-mirror ⚪ ✔ + apply grafana ⚪ ✔ + apply prometheus ❌ ✔ + apply alertmanager ❌ ✔ + apply node-exporter ❌ ✔ + apply rgw ✔ ✔ + apply container ⚪ ✔ + apply snmp-gateway ❌ ✔ + host add ⚪ ✔ + host ls ✔ ✔ + host rm ⚪ ✔ + host maintenance enter ❌ ✔ + host maintenance exit ❌ ✔ + daemon status ⚪ ✔ + daemon {stop,start,...} ⚪ ✔ + device {ident,fault}-(on,off} ⚪ ✔ + device ls ✔ ✔ + iscsi add ⚪ ✔ + mds add ⚪ ✔ + nfs add ⚪ ✔ + rbd-mirror add ⚪ ✔ + rgw add ⚪ ✔ + ls ✔ ✔ + ps ✔ ✔ + status ✔ ✔ + upgrade ❌ ✔ +=================================== ====== ========= + +where + +* ⚪ = not yet implemented +* ❌ = not applicable +* ✔ = implemented diff --git a/doc/mgr/orchestrator_modules.rst b/doc/mgr/orchestrator_modules.rst new file mode 100644 index 000000000..a28b43059 --- /dev/null +++ b/doc/mgr/orchestrator_modules.rst @@ -0,0 +1,332 @@ + + +.. _orchestrator-modules: + +.. py:currentmodule:: orchestrator + +ceph-mgr orchestrator modules +============================= + +.. warning:: + + This is developer documentation, describing Ceph internals that + are only relevant to people writing ceph-mgr orchestrator modules. + +In this context, *orchestrator* refers to some external service that +provides the ability to discover devices and create Ceph services. This +includes external projects such as Rook. + +An *orchestrator module* is a ceph-mgr module (:ref:`mgr-module-dev`) +which implements common management operations using a particular +orchestrator. + +Orchestrator modules subclass the ``Orchestrator`` class: this class is +an interface, it only provides method definitions to be implemented +by subclasses. The purpose of defining this common interface +for different orchestrators is to enable common UI code, such as +the dashboard, to work with various different backends. + + +.. graphviz:: + + digraph G { + subgraph cluster_1 { + volumes [label="mgr/volumes"] + rook [label="mgr/rook"] + dashboard [label="mgr/dashboard"] + orchestrator_cli [label="mgr/orchestrator"] + orchestrator [label="Orchestrator Interface"] + cephadm [label="mgr/cephadm"] + + label = "ceph-mgr"; + } + + volumes -> orchestrator + dashboard -> orchestrator + orchestrator_cli -> orchestrator + orchestrator -> rook -> rook_io + orchestrator -> cephadm + + + rook_io [label="Rook"] + + rankdir="TB"; + } + +Behind all the abstraction, the purpose of orchestrator modules is simple: +enable Ceph to do things like discover available hardware, create and +destroy OSDs, and run MDS and RGW services. + +A tutorial is not included here: for full and concrete examples, see +the existing implemented orchestrator modules in the Ceph source tree. + +Glossary +-------- + +Stateful service + a daemon that uses local storage, such as OSD or mon. + +Stateless service + a daemon that doesn't use any local storage, such + as an MDS, RGW, nfs-ganesha, iSCSI gateway. + +Label + arbitrary string tags that may be applied by administrators + to hosts. Typically administrators use labels to indicate + which hosts should run which kinds of service. Labels are + advisory (from human input) and do not guarantee that hosts + have particular physical capabilities. + +Drive group + collection of block devices with common/shared OSD + formatting (typically one or more SSDs acting as + journals/dbs for a group of HDDs). + +Placement + choice of which host is used to run a service. + +Key Concepts +------------ + +The underlying orchestrator remains the source of truth for information +about whether a service is running, what is running where, which +hosts are available, etc. Orchestrator modules should avoid taking +any internal copies of this information, and read it directly from +the orchestrator backend as much as possible. + +Bootstrapping hosts and adding them to the underlying orchestration +system is outside the scope of Ceph's orchestrator interface. Ceph +can only work on hosts when the orchestrator is already aware of them. + +Where possible, placement of stateless services should be left up to the +orchestrator. + +Completions and batching +------------------------ + +All methods that read or modify the state of the system can potentially +be long running. Therefore the module needs to schedule those operations. + +Each orchestrator module implements its own underlying mechanisms +for completions. This might involve running the underlying operations +in threads, or batching the operations up before later executing +in one go in the background. If implementing such a batching pattern, the +module would do no work on any operation until it appeared in a list +of completions passed into *process*. + +Error Handling +-------------- + +The main goal of error handling within orchestrator modules is to provide debug information to +assist users when dealing with deployment errors. + +.. autoclass:: OrchestratorError +.. autoclass:: NoOrchestrator +.. autoclass:: OrchestratorValidationError + + +In detail, orchestrators need to explicitly deal with different kinds of errors: + +1. No orchestrator configured + + See :class:`NoOrchestrator`. + +2. An orchestrator doesn't implement a specific method. + + For example, an Orchestrator doesn't support ``add_host``. + + In this case, a ``NotImplementedError`` is raised. + +3. Missing features within implemented methods. + + E.g. optional parameters to a command that are not supported by the + backend (e.g. the hosts field in :func:`Orchestrator.apply_mons` command with the rook backend). + + See :class:`OrchestratorValidationError`. + +4. Input validation errors + + The ``orchestrator`` module and other calling modules are supposed to + provide meaningful error messages. + + See :class:`OrchestratorValidationError`. + +5. Errors when actually executing commands + + The resulting Completion should contain an error string that assists in understanding the + problem. In addition, :func:`Completion.is_errored` is set to ``True`` + +6. Invalid configuration in the orchestrator modules + + This can be tackled similar to 5. + + +All other errors are unexpected orchestrator issues and thus should raise an exception that are then +logged into the mgr log file. If there is a completion object at that point, +:func:`Completion.result` may contain an error message. + + +Excluded functionality +---------------------- + +- Ceph's orchestrator interface is not a general purpose framework for + managing linux servers -- it is deliberately constrained to manage + the Ceph cluster's services only. +- Multipathed storage is not handled (multipathing is unnecessary for + Ceph clusters). Each drive is assumed to be visible only on + a single host. + +Host management +--------------- + +.. automethod:: Orchestrator.add_host +.. automethod:: Orchestrator.remove_host +.. automethod:: Orchestrator.get_hosts +.. automethod:: Orchestrator.update_host_addr +.. automethod:: Orchestrator.add_host_label +.. automethod:: Orchestrator.remove_host_label + +.. autoclass:: HostSpec + +Devices +------- + +.. automethod:: Orchestrator.get_inventory +.. autoclass:: InventoryFilter + +.. py:currentmodule:: ceph.deployment.inventory + +.. autoclass:: Devices + :members: + +.. autoclass:: Device + :members: + +.. py:currentmodule:: orchestrator + +Placement +--------- + +A :ref:`orchestrator-cli-placement-spec` defines the placement of +daemons of a specific service. + +In general, stateless services do not require any specific placement +rules as they can run anywhere that sufficient system resources +are available. However, some orchestrators may not include the +functionality to choose a location in this way. Optionally, you can +specify a location when creating a stateless service. + + +.. py:currentmodule:: ceph.deployment.service_spec + +.. autoclass:: PlacementSpec + :members: + +.. py:currentmodule:: orchestrator + + +Services +-------- + +.. autoclass:: ServiceDescription + +.. py:currentmodule:: ceph.deployment.service_spec + +.. autoclass:: ServiceSpec + :members: + :private-members: + :noindex: + +.. py:currentmodule:: orchestrator + +.. automethod:: Orchestrator.describe_service + +.. automethod:: Orchestrator.service_action +.. automethod:: Orchestrator.remove_service + + +Daemons +------- + +.. automethod:: Orchestrator.list_daemons +.. automethod:: Orchestrator.remove_daemons +.. automethod:: Orchestrator.daemon_action + +.. autoclass:: DaemonDescription +.. autoclass:: DaemonDescriptionStatus + +OSD management +-------------- + +.. automethod:: Orchestrator.create_osds + +.. automethod:: Orchestrator.blink_device_light +.. autoclass:: DeviceLightLoc + +.. _orchestrator-osd-replace: + +OSD Replacement +^^^^^^^^^^^^^^^ + +See :ref:`rados-replacing-an-osd` for the underlying process. + +Replacing OSDs is fundamentally a two-staged process, as users need to +physically replace drives. The orchestrator therefore exposes this two-staged process. + +Phase one is a call to :meth:`Orchestrator.remove_daemons` with ``destroy=True`` in order to mark +the OSD as destroyed. + + +Phase two is a call to :meth:`Orchestrator.create_osds` with a Drive Group with + +.. py:currentmodule:: ceph.deployment.drive_group + +:attr:`DriveGroupSpec.osd_id_claims` set to the destroyed OSD ids. + +.. py:currentmodule:: orchestrator + +Services +-------- + +.. automethod:: Orchestrator.add_daemon +.. automethod:: Orchestrator.apply_mon +.. automethod:: Orchestrator.apply_mgr +.. automethod:: Orchestrator.apply_mds +.. automethod:: Orchestrator.apply_rbd_mirror + +.. py:currentmodule:: ceph.deployment.service_spec + +.. autoclass:: RGWSpec + :noindex: + +.. py:currentmodule:: orchestrator + +.. automethod:: Orchestrator.apply_rgw + +.. py:currentmodule:: ceph.deployment.service_spec + +.. autoclass:: NFSServiceSpec + +.. py:currentmodule:: orchestrator + +.. automethod:: Orchestrator.apply_nfs + +Upgrades +-------- + +.. automethod:: Orchestrator.upgrade_available +.. automethod:: Orchestrator.upgrade_start +.. automethod:: Orchestrator.upgrade_status +.. autoclass:: UpgradeStatusSpec + +Utility +------- + +.. automethod:: Orchestrator.available +.. automethod:: Orchestrator.get_feature_set + +Client Modules +-------------- + +.. autoclass:: OrchestratorClientMixin + :members: diff --git a/doc/mgr/progress.rst b/doc/mgr/progress.rst new file mode 100644 index 000000000..77a8a408a --- /dev/null +++ b/doc/mgr/progress.rst @@ -0,0 +1,58 @@ +Progress Module +=============== + +The progress module is used to inform users about the recovery progress of PGs +(Placement Groups) that are affected by events such as (1) OSDs being marked +in or out and (2) ``pg_autoscaler`` trying to match the target PG number. + +The ``ceph -s`` command returns something called " Global Recovery Progress", +which reports the overall recovery progress of PGs and is based on the number +of PGs that are in the ``active+clean`` state. + +Enabling +-------- + +The *progress* module is enabled by default, but it can be enabled manually by +running the following command: + +.. prompt:: bash # + + ceph progress on + +The module can be disabled at anytime by running the following command: + +.. prompt:: bash # + + ceph progress off + +Commands +-------- + +Show the summary of all the ongoing and completed events and their duration: + +.. prompt:: bash # + + ceph progress + +Show the summary of ongoing and completed events in JSON format: + +.. prompt:: bash # + + ceph progress json + +Clear all ongoing and completed events: + +.. prompt:: bash # + + ceph progress clear + +PG Recovery Event +----------------- + +An event for each PG affected by recovery event can be shown in +`ceph progress` This is completely optional, and disabled by default +due to CPU overheard: + +.. prompt:: bash # + + ceph config set mgr mgr/progress/allow_pg_recovery_event true diff --git a/doc/mgr/prometheus.rst b/doc/mgr/prometheus.rst new file mode 100644 index 000000000..25a7b0d08 --- /dev/null +++ b/doc/mgr/prometheus.rst @@ -0,0 +1,446 @@ +.. _mgr-prometheus: + +================= +Prometheus Module +================= + +Provides a Prometheus exporter to pass on Ceph performance counters +from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport +messages from all MgrClient processes (mons and OSDs, for instance) +with performance counter schema data and actual counter data, and keeps +a circular buffer of the last N samples. This module creates an HTTP +endpoint (like all Prometheus exporters) and retrieves the latest sample +of every counter when polled (or "scraped" in Prometheus terminology). +The HTTP path and query parameters are ignored; all extant counters +for all reporting entities are returned in text exposition format. +(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.) + +Enabling prometheus output +========================== + +The *prometheus* module is enabled with: + +.. prompt:: bash $ + + ceph mgr module enable prometheus + +Configuration +------------- + +.. note:: + + The Prometheus manager module needs to be restarted for configuration changes to be applied. + +.. mgr_module:: prometheus +.. confval:: server_addr +.. confval:: server_port +.. confval:: scrape_interval +.. confval:: cache +.. confval:: stale_cache_strategy +.. confval:: rbd_stats_pools +.. confval:: rbd_stats_pools_refresh_interval +.. confval:: standby_behaviour +.. confval:: standby_error_status_code +.. confval:: exclude_perf_counters + +By default the module will accept HTTP requests on port ``9283`` on all IPv4 +and IPv6 addresses on the host. The port and listen address are both +configurable with ``ceph config set``, with keys +``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port +is registered with Prometheus's `registry +<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_. + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/server_addr 0.0.0. + ceph config set mgr mgr/prometheus/server_port 9283 + +.. warning:: + + The :confval:`mgr/prometheus/scrape_interval` of this module should always be set to match + Prometheus' scrape interval to work properly and not cause any issues. + +The scrape interval in the module is used for caching purposes +and to determine when a cache is stale. + +It is not recommended to use a scrape interval below 10 seconds. It is +recommended to use 15 seconds as scrape interval, though, in some cases it +might be useful to increase the scrape interval. + +To set a different scrape interval in the Prometheus module, set +``scrape_interval`` to the desired value: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/scrape_interval 20 + +On large clusters (>1000 OSDs), the time to fetch the metrics may become +significant. Without the cache, the Prometheus manager module could, especially +in conjunction with multiple Prometheus instances, overload the manager and lead +to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled +by default. This means that there is a possibility that the cache becomes +stale. The cache is considered stale when the time to fetch the metrics from +Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`. + +If that is the case, **a warning will be logged** and the module will either + +* respond with a 503 HTTP status code (service unavailable) or, +* it will return the content of the cache, even though it might be stale. + +This behavior can be configured. By default, it will return a 503 HTTP status +code (service unavailable). You can set other options using the ``ceph config +set`` commands. + +To tell the module to respond with possibly stale data, set it to ``return``: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/stale_cache_strategy return + +To tell the module to respond with "service unavailable", set it to ``fail``: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/stale_cache_strategy fail + +If you are confident that you don't require the cache, you can disable it: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/cache false + +If you are using the prometheus module behind some kind of reverse proxy or +loadbalancer, you can simplify discovering the active instance by switching +to ``error``-mode: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/standby_behaviour error + +If set, the prometheus module will respond with a HTTP error when requesting ``/`` +from the standby instance. The default error code is 500, but you can configure +the HTTP response code with: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/standby_error_status_code 503 + +Valid error codes are between 400-599. + +To switch back to the default behaviour, simply set the config key to ``default``: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/standby_behaviour default + +.. _prometheus-rbd-io-statistics: + +Ceph Health Checks +------------------ + +The mgr/prometheus module also tracks and maintains a history of Ceph health checks, +exposing them to the Prometheus server as discrete metrics. This allows Prometheus +alert rules to be configured for specific health check events. + +The metrics take the following form; + +:: + + # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active) + # TYPE ceph_health_detail gauge + ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0 + ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0 + ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0 + +The health check history is made available through the following commands; + +:: + + healthcheck history ls [--format {plain|json|json-pretty}] + healthcheck history clear + +The ``ls`` command provides an overview of the health checks that the cluster has +encountered, or since the last ``clear`` command was issued. The example below; + +:: + + [ceph: root@c8-node1 /]# ceph healthcheck history ls + Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active + OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No + OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes + PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes + 3 health check(s) listed + + +RBD IO statistics +----------------- + +The module can optionally collect RBD per-image IO statistics by enabling +dynamic OSD performance counters. The statistics are gathered for all images +in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools`` +configuration parameter. The parameter is a comma or space separated list +of ``pool[/namespace]`` entries. If the namespace is not specified the +statistics are collected for all namespaces in the pool. + +Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN" + +The wildcard can be used to indicate all pools or namespaces: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/rbd_stats_pools "*" + +The module makes the list of all available images scanning the specified +pools and namespaces and refreshes it periodically. The period is +configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval`` +parameter (in sec) and is 300 sec (5 minutes) by default. The module will +force refresh earlier if it detects statistics from a previously unknown +RBD image. + +Example to turn up the sync interval to 10 minutes: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600 + +Ceph daemon performance counters metrics +----------------------------------------- + +With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon +perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting +the module option ``exclude_perf_counters`` to ``false``: + +.. prompt:: bash $ + + ceph config set mgr mgr/prometheus/exclude_perf_counters false + +Ceph daemon performance counters metrics +----------------------------------------- + +With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon +perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting +the module option ``exclude_perf_counters`` to ``false``:: + + ceph config set mgr mgr/prometheus/exclude_perf_counters false + +Statistic names and labels +========================== + +The names of the stats are exactly as Ceph names them, with +illegal characters ``.``, ``-`` and ``::`` translated to ``_``, +and ``ceph_`` prefixed to all names. + + +All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123" +that identifies the type and ID of the daemon they come from. Some +statistics can come from different types of daemon, so when querying +e.g. an OSD's RocksDB stats, you would probably want to filter +on ceph_daemon starting with "osd" to avoid mixing in the monitor +rocksdb stats. + + +The *cluster* statistics (i.e. those global to the Ceph cluster) +have labels appropriate to what they report on. For example, +metrics relating to pools have a ``pool_id`` label. + + +The long running averages that represent the histograms from core Ceph +are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics. +This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_ +and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_. + +Pool and OSD metadata series +---------------------------- + +Special series are output to enable displaying and querying on +certain metadata fields. + +Pools have a ``ceph_pool_metadata`` field like this: + +:: + + ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0 + +OSDs have a ``ceph_osd_metadata`` field like this: + +:: + + ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0 + + +Correlating drive statistics with node_exporter +----------------------------------------------- + +The prometheus output from Ceph is designed to be used in conjunction +with the generic host monitoring from the Prometheus node_exporter. + +To enable correlation of Ceph OSD statistics with node_exporter's +drive statistics, special series are output like this: + +:: + + ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"} + +To use this to get disk statistics by OSD ID, use either the ``and`` operator or +the ``*`` operator in your prometheus query. All metadata metrics (like `` +ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*`` +allows to use ``group_left`` and ``group_right`` grouping modifiers, so that +the resulting metric has additional labels from one side of the query. + +See the +`prometheus documentation`__ for more information about constructing queries. + +__ https://prometheus.io/docs/prometheus/latest/querying/basics + +The goal is to run a query like + +:: + + rate(node_disk_written_bytes_total[30s]) and + on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"} + +Out of the box the above query will not return any metrics since the ``instance`` labels of +both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human`` +will be the currently active MGR node. + +The following two section outline two approaches to remedy this. + +.. note:: + + If you need to group on the `ceph_daemon` label instead of `device` and + `instance` labels, using `ceph_disk_occupation_human` may not work reliably. + It is advised that you use `ceph_disk_occupation` instead. + + The difference is that `ceph_disk_occupation_human` may group several OSDs + into the value of a single `ceph_daemon` label in cases where multiple OSDs + share a disk. + +Use label_replace +================= + +The ``label_replace`` function (cp. +`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_) +can add a label to, or alter a label of, a metric within a query. + +To correlate an OSD and its disks write rate, the following query can be used: + +:: + + label_replace( + rate(node_disk_written_bytes_total[30s]), + "exported_instance", + "$1", + "instance", + "(.*):.*" + ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"} + +Configuring Prometheus server +============================= + +honor_labels +------------ + +To enable Ceph to output properly-labeled data relating to any host, +use the ``honor_labels`` setting when adding the ceph-mgr endpoints +to your prometheus configuration. + +This allows Ceph to export the proper ``instance`` label without prometheus +overwriting it. Without this setting, Prometheus applies an ``instance`` label +that includes the hostname and port of the endpoint that the series came from. +Because Ceph clusters have multiple manager daemons, this results in an +``instance`` label that changes spuriously when the active manager daemon +changes. + +If this is undesirable a custom ``instance`` label can be set in the +Prometheus target configuration: you might wish to set it to the hostname +of your first mgr daemon, or something completely arbitrary like "ceph_cluster". + +node_exporter hostname labels +----------------------------- + +Set your ``instance`` labels to match what appears in Ceph's OSD metadata +in the ``instance`` field. This is generally the short hostname of the node. + +This is only necessary if you want to correlate Ceph stats with host stats, +but you may find it useful to do it in all cases in case you want to do +the correlation in the future. + +Example configuration +--------------------- + +This example shows a single node configuration running ceph-mgr and +node_exporter on a server called ``senta04``. Note that this requires one +to add an appropriate and unique ``instance`` label to each ``node_exporter`` target. + +This is just an example: there are other ways to configure prometheus +scrape targets and label rewrite rules. + +prometheus.yml +~~~~~~~~~~~~~~ + +:: + + global: + scrape_interval: 15s + evaluation_interval: 15s + + scrape_configs: + - job_name: 'node' + file_sd_configs: + - files: + - node_targets.yml + - job_name: 'ceph' + honor_labels: true + file_sd_configs: + - files: + - ceph_targets.yml + + +ceph_targets.yml +~~~~~~~~~~~~~~~~ + + +:: + + [ + { + "targets": [ "senta04.mydomain.com:9283" ], + "labels": {} + } + ] + + +node_targets.yml +~~~~~~~~~~~~~~~~ + +:: + + [ + { + "targets": [ "senta04.mydomain.com:9100" ], + "labels": { + "instance": "senta04" + } + } + ] + + +Notes +===== + +Counters and gauges are exported; currently histograms and long-running +averages are not. It's possible that Ceph's 2-D histograms could be +reduced to two separate 1-D histograms, and that long-running averages +could be exported as Prometheus' Summary type. + +Timestamps, as with many Prometheus exporters, are established by +the server's scrape time (Prometheus expects that it is polling the +actual counter process synchronously). It is possible to supply a +timestamp along with the stat report, but the Prometheus team strongly +advises against this. This means that timestamps will be delayed by +an unpredictable amount; it's not clear if this will be problematic, +but it's worth knowing about. diff --git a/doc/mgr/restful.rst b/doc/mgr/restful.rst new file mode 100644 index 000000000..d684399fc --- /dev/null +++ b/doc/mgr/restful.rst @@ -0,0 +1,189 @@ +Restful Module +============== + +RESTful module offers the REST API access to the status of the cluster +over an SSL-secured connection. + +Enabling +-------- + +The *restful* module is enabled with:: + + ceph mgr module enable restful + +You will also need to configure an SSL certificate below before the +API endpoint is available. By default the module will accept HTTPS +requests on port ``8003`` on all IPv4 and IPv6 addresses on the host. + +Securing +-------- + +All connections to *restful* are secured with SSL. You can generate a +self-signed certificate with the command:: + + ceph restful create-self-signed-cert + +Note that with a self-signed certificate most clients will need a flag +to allow a connection and/or suppress warning messages. For example, +if the ``ceph-mgr`` daemon is on the same host,:: + + curl -k https://localhost:8003/ + +To properly secure a deployment, a certificate that is signed by the +organization's certificate authority should be used. For example, a key pair +can be generated with a command similar to:: + + openssl req -new -nodes -x509 \ + -subj "/O=IT/CN=ceph-mgr-restful" \ + -days 3650 -keyout restful.key -out restful.crt -extensions v3_ca + +The ``restful.crt`` should then be signed by your organization's CA +(certificate authority). Once that is done, you can set it with:: + + ceph config-key set mgr/restful/$name/crt -i restful.crt + ceph config-key set mgr/restful/$name/key -i restful.key + +where ``$name`` is the name of the ``ceph-mgr`` instance (usually the +hostname). If all manager instances are to share the same certificate, +you can leave off the ``$name`` portion:: + + ceph config-key set mgr/restful/crt -i restful.crt + ceph config-key set mgr/restful/key -i restful.key + + +Configuring IP and port +----------------------- + +Like any other RESTful API endpoint, *restful* binds to an IP and +port. By default, the currently active ``ceph-mgr`` daemon will bind +to port 8003 and any available IPv4 or IPv6 address on the host. + +Since each ``ceph-mgr`` hosts its own instance of *restful*, it may +also be necessary to configure them separately. The IP and port +can be changed via the configuration key facility:: + + ceph config set mgr mgr/restful/$name/server_addr $IP + ceph config set mgr mgr/restful/$name/server_port $PORT + +where ``$name`` is the ID of the ceph-mgr daemon (usually the hostname). + +These settings can also be configured cluster-wide and not manager +specific. For example,:: + + ceph config set mgr mgr/restful/server_addr $IP + ceph config set mgr mgr/restful/server_port $PORT + +If the port is not configured, *restful* will bind to port ``8003``. +If the address it not configured, the *restful* will bind to ``::``, +which corresponds to all available IPv4 and IPv6 addresses. + +.. _creating-an-api-user: + +Creating an API User +----------------------- + +To create an API user, please run the following command:: + + ceph restful create-key <username> + +Replace ``<username>`` with the desired name of the user. For example, to create a user named +``api``:: + + $ ceph restful create-key api + 52dffd92-a103-4a10-bfce-5b60f48f764e + +The UUID generated from ``ceph restful create-key api`` acts as the key for the user. + +To list all of your API keys, please run the following command:: + + ceph restful list-keys + +The ``ceph restful list-keys`` command will output in JSON:: + + { + "api": "52dffd92-a103-4a10-bfce-5b60f48f764e" + } + +You can use ``curl`` in order to test your user with the API. Here is an example:: + + curl -k https://api:52dffd92-a103-4a10-bfce-5b60f48f764e@<ceph-mgr>:<port>/server + +In the case above, we are using ``GET`` to fetch information from the ``server`` endpoint. + +Load balancer +------------- + +Please note that *restful* will *only* start on the manager which +is active at that moment. Query the Ceph cluster status to see which +manager is active (e.g., ``ceph mgr dump``). In order to make the +API available via a consistent URL regardless of which manager +daemon is currently active, you may want to set up a load balancer +front-end to direct traffic to whichever manager endpoint is +available. + +Available methods +----------------- + +You can navigate to the ``/doc`` endpoint for full list of available +endpoints and HTTP methods implemented for each endpoint. + +For example, if you want to use the PATCH method of the ``/osd/<id>`` +endpoint to set the state ``up`` of the OSD id ``1``, you can use the +following curl command:: + + echo -En '{"up": true}' | curl --request PATCH --data @- --silent --insecure --user <user> 'https://<ceph-mgr>:<port>/osd/1' + +or you can use python to do so:: + + $ python + >> import requests + >> result = requests.patch( + 'https://<ceph-mgr>:<port>/osd/1', + json={"up": True}, + auth=("<user>", "<password>") + ) + >> print result.json() + +Some of the other endpoints implemented in the *restful* module include + +* ``/config/cluster``: **GET** +* ``/config/osd``: **GET**, **PATCH** +* ``/crush/rule``: **GET** +* ``/mon``: **GET** +* ``/osd``: **GET** +* ``/pool``: **GET**, **POST** +* ``/pool/<arg>``: **DELETE**, **GET**, **PATCH** +* ``/request``: **DELETE**, **GET**, **POST** +* ``/request/<arg>``: **DELETE**, **GET** +* ``/server``: **GET** + +The ``/request`` endpoint +------------------------- + +You can use the ``/request`` endpoint to poll the state of a request +you scheduled with any **DELETE**, **POST** or **PATCH** method. These +methods are by default asynchronous since it may take longer for them +to finish execution. You can modify this behaviour by appending +``?wait=1`` to the request url. The returned request will then always +be completed. + +The **POST** method of the ``/request`` method provides a passthrough +for the ceph mon commands as defined in ``src/mon/MonCommands.h``. +Let's consider the following command:: + + COMMAND("osd ls " \ + "name=epoch,type=CephInt,range=0,req=false", \ + "show all OSD ids", "osd", "r", "cli,rest") + +The **prefix** is **osd ls**. The optional argument's name is **epoch** +and it is of type ``CephInt``, i.e. ``integer``. This means that you +need to do the following **POST** request to schedule the command:: + + $ python + >> import requests + >> result = requests.post( + 'https://<ceph-mgr>:<port>/request', + json={'prefix': 'osd ls', 'epoch': 0}, + auth=("<user>", "<password>") + ) + >> print result.json() diff --git a/doc/mgr/rgw.rst b/doc/mgr/rgw.rst new file mode 100644 index 000000000..a3f53280a --- /dev/null +++ b/doc/mgr/rgw.rst @@ -0,0 +1,141 @@ +.. _mgr-rgw-module: + +RGW Module +============ +The rgw module provides a simple interface to deploy RGW multisite. +It helps with bootstrapping and configuring RGW realm, zonegroup and +the different related entities. + +Enabling +-------- + +The *rgw* module is enabled with:: + + ceph mgr module enable rgw + + +RGW Realm Operations +----------------------- + +Bootstrapping RGW realm creates a new RGW realm entity, a new zonegroup, +and a new zone. It configures a new system user that can be used for +multisite sync operations. Under the hood this module instructs the +orchestrator to create and deploy the corresponding RGW daemons. The module +supports both passing the arguments through the cmd line or as a spec file: + +.. prompt:: bash # + + ceph rgw realm bootstrap [--realm-name] [--zonegroup-name] [--zone-name] [--port] [--placement] [--start-radosgw] + +The command supports providing the configuration through a spec file (`-i option`): + +.. prompt:: bash # + + ceph rgw realm bootstrap -i myrgw.yaml + +Following is an example of RGW multisite spec file: + +.. code-block:: yaml + + rgw_realm: myrealm + rgw_zonegroup: myzonegroup + rgw_zone: myzone + placement: + hosts: + - ceph-node-1 + - ceph-node-2 + spec: + rgw_frontend_port: 5500 + +.. note:: The spec file used by RGW has the same format as the one used by the orchestrator. Thus, + the user can provide any orchestration supported rgw parameters including advanced + configuration features such as SSL certificates etc. + +Users can also specify custom zone endpoints in the spec (or through the cmd line). In this case, no +cephadm daemons will be launched. Following is an example RGW spec file with zone endpoints: + +.. code-block:: yaml + + rgw_realm: myrealm + rgw_zonegroup: myzonegroup + rgw_zone: myzone + zone_endpoints: http://<rgw_host1>:<rgw_port1>, http://<rgw_host2>:<rgw_port2> + + +Realm Credentials Token +----------------------- + +Users can list the available tokens for the created (or already existing) realms. +The token is a base64 string that encapsulates the realm information and its +master zone endpoint authentication data. Following is an example of +the `ceph rgw realm tokens` output: + +.. prompt:: bash # + + ceph rgw realm tokens | jq + +.. code-block:: json + + [ + { + "realm": "myrealm1", + "token": "ewogICAgInJlYWxtX25hbWUiOiAibXlyZWFs....NHlBTFhoIgp9" + }, + { + "realm": "myrealm2", + "token": "ewogICAgInJlYWxtX25hbWUiOiAibXlyZWFs....RUU12ZDB0Igp9" + } + ] + +User can use the token to pull a realm to create secondary zone on a +different cluster that syncs with the master zone on the primary cluster +by using `ceph rgw zone create` command and providing the corresponding token. + +Following is an example of zone spec file: + +.. code-block:: yaml + + rgw_zone: my-secondary-zone + rgw_realm_token: <token> + placement: + hosts: + - ceph-node-1 + - ceph-node-2 + spec: + rgw_frontend_port: 5500 + + +.. prompt:: bash # + + ceph rgw zone create -i zone-spec.yaml + +.. note:: The spec file used by RGW has the same format as the one used by the orchestrator. Thus, + the user can provide any orchestration supported rgw parameters including advanced + configuration features such as SSL certificates etc. + +Commands +-------- +:: + + ceph rgw realm bootstrap -i spec.yaml + +Create a new realm + zonegroup + zone and deploy rgw daemons via the +orchestrator using the information specified in the YAML file. + +:: + + ceph rgw realm tokens + +List the tokens of all the available realms + +:: + + ceph rgw zone create -i spec.yaml + +Join an existing realm by creating a new secondary zone (using the realm token) + +:: + + ceph rgw admin [*] + +RGW admin command diff --git a/doc/mgr/rook.rst b/doc/mgr/rook.rst new file mode 100644 index 000000000..1ae369623 --- /dev/null +++ b/doc/mgr/rook.rst @@ -0,0 +1,39 @@ + +.. _mgr-rook: + +==== +Rook +==== + +Rook (https://rook.io/) is an orchestration tool that can run Ceph inside +a Kubernetes cluster. + +The ``rook`` module provides integration between Ceph's orchestrator framework +(used by modules such as ``dashboard`` to control cluster services) and +Rook. + +Orchestrator modules only provide services to other modules, which in turn +provide user interfaces. To try out the rook module, you might like +to use the :ref:`Orchestrator CLI <orchestrator-cli-module>` module. + +Requirements +------------ + +- Running ceph-mon and ceph-mgr services that were set up with Rook in + Kubernetes. +- Rook 0.9 or newer. + +Configuration +------------- + +Because a Rook cluster's ceph-mgr daemon is running as a Kubernetes pod, +the ``rook`` module can connect to the Kubernetes API without any explicit +configuration. + +Development +----------- + +If you are a developer, please see :ref:`kubernetes-dev` for instructions +on setting up a development environment to work with this. + + diff --git a/doc/mgr/status-card-open.png b/doc/mgr/status-card-open.png Binary files differnew file mode 100644 index 000000000..4ea20921b --- /dev/null +++ b/doc/mgr/status-card-open.png diff --git a/doc/mgr/telegraf.rst b/doc/mgr/telegraf.rst new file mode 100644 index 000000000..781ff5592 --- /dev/null +++ b/doc/mgr/telegraf.rst @@ -0,0 +1,91 @@ +=============== +Telegraf Module +=============== +The Telegraf module collects and sends statistics series to a Telegraf agent. + +The Telegraf agent can buffer, aggregate, parse and process the data before +sending it to an output which can be InfluxDB, ElasticSearch and many more. + +Currently the only way to send statistics to Telegraf from this module is to +use the socket listener. The module can send statistics over UDP, TCP or +a UNIX socket. + +The Telegraf module was introduced in the 13.x *Mimic* release. + +-------- +Enabling +-------- + +To enable the module, use the following command: + +:: + + ceph mgr module enable telegraf + +If you wish to subsequently disable the module, you can use the corresponding +*disable* command: + +:: + + ceph mgr module disable telegraf + +------------- +Configuration +------------- + +For the telegraf module to send statistics to a Telegraf agent it is +required to configure the address to send the statistics to. + +Set configuration values using the following command: + +:: + + ceph telegraf config-set <key> <value> + + +The most important settings are ``address`` and ``interval``. + +For example, a typical configuration might look like this: + +:: + + ceph telegraf config-set address udp://:8094 + ceph telegraf config-set interval 10 + +The default values for these configuration keys are: + +- address: unixgram:///tmp/telegraf.sock +- interval: 15 + +---------------- +Socket Listener +---------------- +The module only supports sending data to Telegraf through the socket listener +of the Telegraf module using the Influx data format. + +A typical Telegraf configuration might be + +:: + + [[inputs.socket_listener]] + # service_address = "tcp://:8094" + # service_address = "tcp://127.0.0.1:http" + # service_address = "tcp4://:8094" + # service_address = "tcp6://:8094" + # service_address = "tcp6://[2001:db8::1]:8094" + service_address = "udp://:8094" + # service_address = "udp4://:8094" + # service_address = "udp6://:8094" + # service_address = "unix:///tmp/telegraf.sock" + # service_address = "unixgram:///tmp/telegraf.sock" + data_format = "influx" + +In this case the `address` configuration option for the module would need to be set +to: + +:: + + udp://:8094 + + +Refer to the Telegraf documentation for more configuration options. diff --git a/doc/mgr/telemetry.rst b/doc/mgr/telemetry.rst new file mode 100644 index 000000000..90d45766c --- /dev/null +++ b/doc/mgr/telemetry.rst @@ -0,0 +1,292 @@ +.. _telemetry: + +Telemetry Module +================ + +The telemetry module sends anonymous data about the cluster back to the Ceph +developers to help understand how Ceph is used and what problems users may +be experiencing. + +This data is visualized on `public dashboards <https://telemetry-public.ceph.com/>`_ +that allow the community to quickly see summary statistics on how many clusters +are reporting, their total capacity and OSD count, and version distribution +trends. + +Channels +-------- + +The telemetry report is broken down into several "channels", each with +a different type of information. Assuming telemetry has been enabled, +individual channels can be turned on and off. (If telemetry is off, +the per-channel setting has no effect.) + +* **basic** (default: on): Basic information about the cluster + + - capacity of the cluster + - number of monitors, managers, OSDs, MDSs, object gateways, or other daemons + - software version currently being used + - number and types of RADOS pools and CephFS file systems + - names of configuration options that have been changed from their + default (but *not* their values) + +* **crash** (default: on): Information about daemon crashes, including + + - type of daemon + - version of the daemon + - operating system (OS distribution, kernel version) + - stack trace identifying where in the Ceph code the crash occurred + +* **device** (default: on): Information about device metrics, including + + - anonymized SMART metrics + +* **ident** (default: off): User-provided identifying information about + the cluster + + - cluster description + - contact email address + +* **perf** (default: off): Various performance metrics of a cluster, which can be used to + + - reveal overall cluster health + - identify workload patterns + - troubleshoot issues with latency, throttling, memory management, etc. + - monitor cluster performance by daemon + +The data being reported does *not* contain any sensitive +data like pool names, object names, object contents, hostnames, or device +serial numbers. + +It contains counters and statistics on how the cluster has been +deployed, the version of Ceph, the distribution of the hosts and other +parameters which help the project to gain a better understanding of +the way Ceph is used. + +Data is sent secured to *https://telemetry.ceph.com*. + +Individual channels can be enabled or disabled with:: + + ceph telemetry enable channel basic + ceph telemetry enable channel crash + ceph telemetry enable channel device + ceph telemetry enable channel ident + ceph telemetry enable channel perf + + ceph telemetry disable channel basic + ceph telemetry disable channel crash + ceph telemetry disable channel device + ceph telemetry disable channel ident + ceph telemetry disable channel perf + +Multiple channels can be enabled or disabled with:: + + ceph telemetry enable channel basic crash device ident perf + ceph telemetry disable channel basic crash device ident perf + +Channels can be enabled or disabled all at once with:: + + ceph telemetry enable channel all + ceph telemetry disable channel all + +Please note that telemetry should be on for these commands to take effect. + +List all channels with:: + + ceph telemetry channel ls + + NAME ENABLED DEFAULT DESC + basic ON ON Share basic cluster information (size, version) + crash ON ON Share metadata about Ceph daemon crashes (version, stack straces, etc) + device ON ON Share device health metrics (e.g., SMART data, minus potentially identifying info like serial numbers) + ident OFF OFF Share a user-provided description and/or contact email for the cluster + perf ON OFF Share various performance metrics of a cluster + + +Enabling Telemetry +------------------ + +To allow the *telemetry* module to start sharing data:: + + ceph telemetry on + +Please note: Telemetry data is licensed under the Community Data License +Agreement - Sharing - Version 1.0 (https://cdla.io/sharing-1-0/). Hence, +telemetry module can be enabled only after you add '--license sharing-1-0' to +the 'ceph telemetry on' command. +Once telemetry is on, please consider enabling channels which are off by +default, such as the 'perf' channel. 'ceph telemetry on' output will list the +exact command to enable these channels. + +Telemetry can be disabled at any time with:: + + ceph telemetry off + +Sample report +------------- + +You can look at what data is reported at any time with the command:: + + ceph telemetry show + +If telemetry is off, you can preview a sample report with:: + + ceph telemetry preview + +Generating a sample report might take a few moments in big clusters (clusters +with hundreds of OSDs or more). + +To protect your privacy, device reports are generated separately, and data such +as hostname and device serial number is anonymized. The device telemetry is +sent to a different endpoint and does not associate the device data with a +particular cluster. To see a preview of the device report use the command:: + + ceph telemetry show-device + +If telemetry is off, you can preview a sample device report with:: + + ceph telemetry preview-device + +Please note: In order to generate the device report we use Smartmontools +version 7.0 and up, which supports JSON output. +If you have any concerns about privacy with regard to the information included in +this report, please contact the Ceph developers. + +In case you prefer to have a single output of both reports, and telemetry is on, use:: + + ceph telemetry show-all + +If you would like to view a single output of both reports, and telemetry is off, use:: + + ceph telemetry preview-all + +**Sample report by channel** + +When telemetry is on you can see what data is reported by channel with:: + + ceph telemetry show <channel_name> + +Please note: If telemetry is on, and <channel_name> is disabled, the command +above will output a sample report by that channel, according to the collections +the user is enrolled to. However this data is not reported, since the channel +is disabled. + +If telemetry is off you can preview a sample report by channel with:: + + ceph telemetry preview <channel_name> + +Collections +----------- + +Collections represent different aspects of data that we collect within a channel. + +List all collections with:: + + ceph telemetry collection ls + + NAME STATUS DESC + basic_base NOT REPORTING: NOT OPTED-IN Basic information about the cluster (capacity, number and type of daemons, version, etc.) + basic_mds_metadata NOT REPORTING: NOT OPTED-IN MDS metadata + basic_pool_options_bluestore NOT REPORTING: NOT OPTED-IN Per-pool bluestore config options + basic_pool_usage NOT REPORTING: NOT OPTED-IN Default pool application and usage statistics + basic_rook_v01 NOT REPORTING: NOT OPTED-IN Basic Rook deployment data + basic_usage_by_class NOT REPORTING: NOT OPTED-IN Default device class usage statistics + crash_base NOT REPORTING: NOT OPTED-IN Information about daemon crashes (daemon type and version, backtrace, etc.) + device_base NOT REPORTING: NOT OPTED-IN Information about device health metrics + ident_base NOT REPORTING: NOT OPTED-IN, CHANNEL ident IS OFF User-provided identifying information about the cluster + perf_memory_metrics NOT REPORTING: NOT OPTED-IN, CHANNEL perf IS OFF Heap stats and mempools for mon and mds + perf_perf NOT REPORTING: NOT OPTED-IN, CHANNEL perf IS OFF Information about performance counters of the cluster + +Where: + +**NAME**: Collection name; prefix indicates the channel the collection belongs to. + +**STATUS**: Indicates whether the collection metrics are reported; this is +determined by the status (enabled / disabled) of the channel the collection +belongs to, along with the enrollment status of the collection (whether the user +is opted-in to this collection). + +**DESC**: General description of the collection. + +See the diff between the collections you are enrolled to, and the new, +available collections with:: + + ceph telemetry diff + +Enroll to the most recent collections with:: + + ceph telemetry on + +Then enable new channels that are off with:: + + ceph telemetry enable channel <channel_name> + +Interval +-------- + +The module compiles and sends a new report every 24 hours by default. +You can adjust this interval with:: + + ceph config set mgr mgr/telemetry/interval 72 # report every three days + +Status +-------- + +The see the current configuration:: + + ceph telemetry status + +Manually sending telemetry +-------------------------- + +To ad hoc send telemetry data:: + + ceph telemetry send + +In case telemetry is not enabled (with 'ceph telemetry on'), you need to add +'--license sharing-1-0' to 'ceph telemetry send' command. + +Sending telemetry through a proxy +--------------------------------- + +If the cluster cannot directly connect to the configured telemetry +endpoint (default *telemetry.ceph.com*), you can configure a HTTP/HTTPS +proxy server with:: + + ceph config set mgr mgr/telemetry/proxy https://10.0.0.1:8080 + +You can also include a *user:pass* if needed:: + + ceph config set mgr mgr/telemetry/proxy https://ceph:telemetry@10.0.0.1:8080 + + +Contact and Description +----------------------- + +A contact and description can be added to the report. This is +completely optional, and disabled by default.:: + + ceph config set mgr mgr/telemetry/contact 'John Doe <john.doe@example.com>' + ceph config set mgr mgr/telemetry/description 'My first Ceph cluster' + ceph config set mgr mgr/telemetry/channel_ident true + +Leaderboard +----------- + +To participate in a leaderboard in the `public dashboards +<https://telemetry-public.ceph.com/>`_, run the following command: + +.. prompt:: bash $ + + ceph config set mgr mgr/telemetry/leaderboard true + +The leaderboard displays basic information about the cluster. This includes the +total storage capacity and the number of OSDs. To add a description of the +cluster, run a command of the following form: + +.. prompt:: bash $ + + ceph config set mgr mgr/telemetry/leaderboard_description 'Ceph cluster for Computational Biology at the University of XYZ' + +If the ``ident`` channel is enabled, its details will not be displayed in the +leaderboard. + diff --git a/doc/mgr/zabbix.rst b/doc/mgr/zabbix.rst new file mode 100644 index 000000000..f044b7a79 --- /dev/null +++ b/doc/mgr/zabbix.rst @@ -0,0 +1,153 @@ +Zabbix Module +============= + +The Zabbix module actively sends information to a Zabbix server like: + +- Ceph status +- I/O operations +- I/O bandwidth +- OSD status +- Storage utilization + +Requirements +------------ + +The module requires that the *zabbix_sender* executable is present on *all* +machines running ceph-mgr. It can be installed on most distributions using +the package manager. + +Dependencies +^^^^^^^^^^^^ +Installing zabbix_sender can be done under Ubuntu or CentOS using either apt +or dnf. + +On Ubuntu Xenial: + +:: + + apt install zabbix-agent + +On Fedora: + +:: + + dnf install zabbix-sender + + +Enabling +-------- +You can enable the *zabbix* module with: + +:: + + ceph mgr module enable zabbix + +Configuration +------------- + +Two configuration keys are vital for the module to work: + +- zabbix_host +- identifier (optional) + +The parameter *zabbix_host* controls the hostname of the Zabbix server to which +*zabbix_sender* will send the items. This can be a IP-Address if required by +your installation. + +The *identifier* parameter controls the identifier/hostname to use as source +when sending items to Zabbix. This should match the name of the *Host* in +your Zabbix server. + +When the *identifier* parameter is not configured the ceph-<fsid> of the cluster +will be used when sending data to Zabbix. + +This would for example be *ceph-c4d32a99-9e80-490f-bd3a-1d22d8a7d354* + +Additional configuration keys which can be configured and their default values: + +- zabbix_port: 10051 +- zabbix_sender: /usr/bin/zabbix_sender +- interval: 60 +- discovery_interval: 100 + +Configuration keys +^^^^^^^^^^^^^^^^^^^ + +Configuration keys can be set on any machine with the proper cephx credentials, +these are usually Monitors where the *client.admin* key is present. + +:: + + ceph zabbix config-set <key> <value> + +For example: + +:: + + ceph zabbix config-set zabbix_host zabbix.localdomain + ceph zabbix config-set identifier ceph.eu-ams02.local + +The current configuration of the module can also be shown: + +:: + + ceph zabbix config-show + + +Template +^^^^^^^^ +A `template <https://raw.githubusercontent.com/ceph/ceph/master/src/pybind/mgr/zabbix/zabbix_template.xml>`_. +(XML) to be used on the Zabbix server can be found in the source directory of the module. + +This template contains all items and a few triggers. You can customize the triggers afterwards to fit your needs. + + +Multiple Zabbix servers +^^^^^^^^^^^^^^^^^^^^^^^ +It is possible to instruct zabbix module to send data to multiple Zabbix servers. + +Parameter *zabbix_host* can be set with multiple hostnames separated by commas. +Hostnames (or IP addresses) can be followed by colon and port number. If a port +number is not present module will use the port number defined in *zabbix_port*. + +For example: + +:: + + ceph zabbix config-set zabbix_host "zabbix1,zabbix2:2222,zabbix3:3333" + + +Manually sending data +--------------------- +If needed the module can be asked to send data immediately instead of waiting for +the interval. + +This can be done with this command: + +:: + + ceph zabbix send + +The module will now send its latest data to the Zabbix server. + +Items discovery is accomplished also via zabbix_sender, and runs every `discovery_interval * interval` seconds. If you wish to launch discovery +manually, this can be done with this command: + +:: + + ceph zabbix discovery + + +Debugging +--------- + +Should you want to debug the Zabbix module increase the logging level for +ceph-mgr and check the logs. + +:: + + [mgr] + debug mgr = 20 + +With logging set to debug for the manager the module will print various logging +lines prefixed with *mgr[zabbix]* for easy filtering. |