diff options
Diffstat (limited to 'doc/rados/operations/health-checks.rst')
-rw-r--r-- | doc/rados/operations/health-checks.rst | 1619 |
1 files changed, 1619 insertions, 0 deletions
diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst new file mode 100644 index 000000000..d52465602 --- /dev/null +++ b/doc/rados/operations/health-checks.rst @@ -0,0 +1,1619 @@ +.. _health-checks: + +=============== + Health checks +=============== + +Overview +======== + +There is a finite set of health messages that a Ceph cluster can raise. These +messages are known as *health checks*. Each health check has a unique +identifier. + +The identifier is a terse human-readable string -- that is, the identifier is +readable in much the same way as a typical variable name. It is intended to +enable tools (for example, UIs) to make sense of health checks and present them +in a way that reflects their meaning. + +This page lists the health checks that are raised by the monitor and manager +daemons. In addition to these, you might see health checks that originate +from MDS daemons (see :ref:`cephfs-health-messages`), and health checks +that are defined by ``ceph-mgr`` python modules. + +Definitions +=========== + +Monitor +------- + +DAEMON_OLD_VERSION +__________________ + +Warn if one or more old versions of Ceph are running on any daemons. A health +check is raised if multiple versions are detected. This condition must exist +for a period of time greater than ``mon_warn_older_version_delay`` (set to one +week by default) in order for the health check to be raised. This allows most +upgrades to proceed without the occurrence of a false warning. If the upgrade +is paused for an extended time period, ``health mute`` can be used by running +``ceph health mute DAEMON_OLD_VERSION --sticky``. Be sure, however, to run +``ceph health unmute DAEMON_OLD_VERSION`` after the upgrade has finished. + +MON_DOWN +________ + +One or more monitor daemons are currently down. The cluster requires a majority +(more than one-half) of the monitors to be available. When one or more monitors +are down, clients might have a harder time forming their initial connection to +the cluster, as they might need to try more addresses before they reach an +operating monitor. + +The down monitor daemon should be restarted as soon as possible to reduce the +risk of a subsequent monitor failure leading to a service outage. + +MON_CLOCK_SKEW +______________ + +The clocks on the hosts running the ceph-mon monitor daemons are not +well-synchronized. This health check is raised if the cluster detects a clock +skew greater than ``mon_clock_drift_allowed``. + +This issue is best resolved by synchronizing the clocks by using a tool like +``ntpd`` or ``chrony``. + +If it is impractical to keep the clocks closely synchronized, the +``mon_clock_drift_allowed`` threshold can also be increased. However, this +value must stay significantly below the ``mon_lease`` interval in order for the +monitor cluster to function properly. + +MON_MSGR2_NOT_ENABLED +_____________________ + +The :confval:`ms_bind_msgr2` option is enabled but one or more monitors are +not configured to bind to a v2 port in the cluster's monmap. This +means that features specific to the msgr2 protocol (for example, encryption) +are unavailable on some or all connections. + +In most cases this can be corrected by running the following command: + +.. prompt:: bash $ + + ceph mon enable-msgr2 + +After this command is run, any monitor configured to listen on the old default +port (6789) will continue to listen for v1 connections on 6789 and begin to +listen for v2 connections on the new default port 3300. + +If a monitor is configured to listen for v1 connections on a non-standard port +(that is, a port other than 6789), then the monmap will need to be modified +manually. + + +MON_DISK_LOW +____________ + +One or more monitors are low on disk space. This health check is raised if the +percentage of available space on the file system used by the monitor database +(normally ``/var/lib/ceph/mon``) drops below the percentage value +``mon_data_avail_warn`` (default: 30%). + +This alert might indicate that some other process or user on the system is +filling up the file system used by the monitor. It might also +indicate that the monitor database is too large (see ``MON_DISK_BIG`` +below). + +If space cannot be freed, the monitor's data directory might need to be +moved to another storage device or file system (this relocation process must be carried out while the monitor +daemon is not running). + + +MON_DISK_CRIT +_____________ + +One or more monitors are critically low on disk space. This health check is raised if the +percentage of available space on the file system used by the monitor database +(normally ``/var/lib/ceph/mon``) drops below the percentage value +``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above. + +MON_DISK_BIG +____________ + +The database size for one or more monitors is very large. This health check is +raised if the size of the monitor database is larger than +``mon_data_size_warn`` (default: 15 GiB). + +A large database is unusual, but does not necessarily indicate a problem. +Monitor databases might grow in size when there are placement groups that have +not reached an ``active+clean`` state in a long time. + +This alert might also indicate that the monitor's database is not properly +compacting, an issue that has been observed with some older versions of leveldb +and rocksdb. Forcing a compaction with ``ceph daemon mon.<id> compact`` might +shrink the database's on-disk size. + +This alert might also indicate that the monitor has a bug that prevents it from +pruning the cluster metadata that it stores. If the problem persists, please +report a bug. + +To adjust the warning threshold, run the following command: + +.. prompt:: bash $ + + ceph config set global mon_data_size_warn <size> + + +AUTH_INSECURE_GLOBAL_ID_RECLAIM +_______________________________ + +One or more clients or daemons that are connected to the cluster are not +securely reclaiming their ``global_id`` (a unique number that identifies each +entity in the cluster) when reconnecting to a monitor. The client is being +permitted to connect anyway because the +``auth_allow_insecure_global_id_reclaim`` option is set to ``true`` (which may +be necessary until all Ceph clients have been upgraded) and because the +``auth_expose_insecure_global_id_reclaim`` option is set to ``true`` (which +allows monitors to detect clients with "insecure reclaim" sooner by forcing +those clients to reconnect immediately after their initial authentication). + +To identify which client(s) are using unpatched Ceph client code, run the +following command: + +.. prompt:: bash $ + + ceph health detail + +If you collect a dump of the clients that are connected to an individual +monitor and examine the ``global_id_status`` field in the output of the dump, +you can see the ``global_id`` reclaim behavior of those clients. Here +``reclaim_insecure`` means that a client is unpatched and is contributing to +this health check. To effect a client dump, run the following command: + +.. prompt:: bash $ + + ceph tell mon.\* sessions + +We strongly recommend that all clients in the system be upgraded to a newer +version of Ceph that correctly reclaims ``global_id`` values. After all clients +have been updated, run the following command to stop allowing insecure +reconnections: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +If it is impractical to upgrade all clients immediately, you can temporarily +silence this alert by running the following command: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this alert +indefinitely by running the following command: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim false + +AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED +_______________________________________ + +Ceph is currently configured to allow clients that reconnect to monitors using +an insecure process to reclaim their previous ``global_id``. Such reclaiming is +allowed because, by default, ``auth_allow_insecure_global_id_reclaim`` is set +to ``true``. It might be necessary to leave this setting enabled while existing +Ceph clients are upgraded to newer versions of Ceph that correctly and securely +reclaim their ``global_id``. + +If the ``AUTH_INSECURE_GLOBAL_ID_RECLAIM`` health check has not also been +raised and if the ``auth_expose_insecure_global_id_reclaim`` setting has not +been disabled (it is enabled by default), then there are currently no clients +connected that need to be upgraded. In that case, it is safe to disable +``insecure global_id reclaim`` by running the following command: + +.. prompt:: bash $ + + ceph config set mon auth_allow_insecure_global_id_reclaim false + +On the other hand, if there are still clients that need to be upgraded, then +this alert can be temporarily silenced by running the following command: + +.. prompt:: bash $ + + ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1w # 1 week + +Although we do NOT recommend doing so, you can also disable this alert indefinitely +by running the following command: + +.. prompt:: bash $ + + ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false + + +Manager +------- + +MGR_DOWN +________ + +All manager daemons are currently down. The cluster should normally have at +least one running manager (``ceph-mgr``) daemon. If no manager daemon is +running, the cluster's ability to monitor itself will be compromised, and parts +of the management API will become unavailable (for example, the dashboard will +not work, and most CLI commands that report metrics or runtime state will +block). However, the cluster will still be able to perform all I/O operations +and to recover from failures. + +The "down" manager daemon should be restarted as soon as possible to ensure +that the cluster can be monitored (for example, so that the ``ceph -s`` +information is up to date, or so that metrics can be scraped by Prometheus). + + +MGR_MODULE_DEPENDENCY +_____________________ + +An enabled manager module is failing its dependency check. This health check +typically comes with an explanatory message from the module about the problem. + +For example, a module might report that a required package is not installed: in +this case, you should install the required package and restart your manager +daemons. + +This health check is applied only to enabled modules. If a module is not +enabled, you can see whether it is reporting dependency issues in the output of +`ceph module ls`. + + +MGR_MODULE_ERROR +________________ + +A manager module has experienced an unexpected error. Typically, this means +that an unhandled exception was raised from the module's `serve` function. The +human-readable description of the error might be obscurely worded if the +exception did not provide a useful description of itself. + +This health check might indicate a bug: please open a Ceph bug report if you +think you have encountered a bug. + +However, if you believe the error is transient, you may restart your manager +daemon(s) or use ``ceph mgr fail`` on the active daemon in order to force +failover to another daemon. + +OSDs +---- + +OSD_DOWN +________ + +One or more OSDs are marked "down". The ceph-osd daemon might have been +stopped, or peer OSDs might be unable to reach the OSD over the network. +Common causes include a stopped or crashed daemon, a "down" host, or a network +outage. + +Verify that the host is healthy, the daemon is started, and the network is +functioning. If the daemon has crashed, the daemon log file +(``/var/log/ceph/ceph-osd.*``) might contain debugging information. + +OSD_<crush type>_DOWN +_____________________ + +(for example, OSD_HOST_DOWN, OSD_ROOT_DOWN) + +All of the OSDs within a particular CRUSH subtree are marked "down" (for +example, all OSDs on a host). + +OSD_ORPHAN +__________ + +An OSD is referenced in the CRUSH map hierarchy, but does not exist. + +To remove the OSD from the CRUSH map hierarchy, run the following command: + +.. prompt:: bash $ + + ceph osd crush rm osd.<id> + +OSD_OUT_OF_ORDER_FULL +_____________________ + +The utilization thresholds for `nearfull`, `backfillfull`, `full`, and/or +`failsafe_full` are not ascending. In particular, the following pattern is +expected: `nearfull < backfillfull`, `backfillfull < full`, and `full < +failsafe_full`. + +To adjust these utilization thresholds, run the following commands: + +.. prompt:: bash $ + + ceph osd set-nearfull-ratio <ratio> + ceph osd set-backfillfull-ratio <ratio> + ceph osd set-full-ratio <ratio> + + +OSD_FULL +________ + +One or more OSDs have exceeded the `full` threshold and are preventing the +cluster from servicing writes. + +To check utilization by pool, run the following command: + +.. prompt:: bash $ + + ceph df + +To see the currently defined `full` ratio, run the following command: + +.. prompt:: bash $ + + ceph osd dump | grep full_ratio + +A short-term workaround to restore write availability is to raise the full +threshold by a small amount. To do so, run the following command: + +.. prompt:: bash $ + + ceph osd set-full-ratio <ratio> + +Additional OSDs should be deployed in order to add new storage to the cluster, +or existing data should be deleted in order to free up space in the cluster. + +OSD_BACKFILLFULL +________________ + +One or more OSDs have exceeded the `backfillfull` threshold or *would* exceed +it if the currently-mapped backfills were to finish, which will prevent data +from rebalancing to this OSD. This alert is an early warning that +rebalancing might be unable to complete and that the cluster is approaching +full. + +To check utilization by pool, run the following command: + +.. prompt:: bash $ + + ceph df + +OSD_NEARFULL +____________ + +One or more OSDs have exceeded the `nearfull` threshold. This alert is an early +warning that the cluster is approaching full. + +To check utilization by pool, run the following command: + +.. prompt:: bash $ + + ceph df + +OSDMAP_FLAGS +____________ + +One or more cluster flags of interest have been set. These flags include: + +* *full* - the cluster is flagged as full and cannot serve writes +* *pauserd*, *pausewr* - there are paused reads or writes +* *noup* - OSDs are not allowed to start +* *nodown* - OSD failure reports are being ignored, and that means that the + monitors will not mark OSDs "down" +* *noin* - OSDs that were previously marked ``out`` are not being marked + back ``in`` when they start +* *noout* - "down" OSDs are not automatically being marked ``out`` after the + configured interval +* *nobackfill*, *norecover*, *norebalance* - recovery or data + rebalancing is suspended +* *noscrub*, *nodeep_scrub* - scrubbing is disabled +* *notieragent* - cache-tiering activity is suspended + +With the exception of *full*, these flags can be set or cleared by running the +following commands: + +.. prompt:: bash $ + + ceph osd set <flag> + ceph osd unset <flag> + +OSD_FLAGS +_________ + +One or more OSDs or CRUSH {nodes,device classes} have a flag of interest set. +These flags include: + +* *noup*: these OSDs are not allowed to start +* *nodown*: failure reports for these OSDs will be ignored +* *noin*: if these OSDs were previously marked ``out`` automatically + after a failure, they will not be marked ``in`` when they start +* *noout*: if these OSDs are "down" they will not automatically be marked + ``out`` after the configured interval + +To set and clear these flags in batch, run the following commands: + +.. prompt:: bash $ + + ceph osd set-group <flags> <who> + ceph osd unset-group <flags> <who> + +For example: + +.. prompt:: bash $ + + ceph osd set-group noup,noout osd.0 osd.1 + ceph osd unset-group noup,noout osd.0 osd.1 + ceph osd set-group noup,noout host-foo + ceph osd unset-group noup,noout host-foo + ceph osd set-group noup,noout class-hdd + ceph osd unset-group noup,noout class-hdd + +OLD_CRUSH_TUNABLES +__________________ + +The CRUSH map is using very old settings and should be updated. The oldest set +of tunables that can be used (that is, the oldest client version that can +connect to the cluster) without raising this health check is determined by the +``mon_crush_min_required_version`` config option. For more information, see +:ref:`crush-map-tunables`. + +OLD_CRUSH_STRAW_CALC_VERSION +____________________________ + +The CRUSH map is using an older, non-optimal method of calculating intermediate +weight values for ``straw`` buckets. + +The CRUSH map should be updated to use the newer method (that is: +``straw_calc_version=1``). For more information, see :ref:`crush-map-tunables`. + +CACHE_POOL_NO_HIT_SET +_____________________ + +One or more cache pools are not configured with a *hit set* to track +utilization. This issue prevents the tiering agent from identifying cold +objects that are to be flushed and evicted from the cache. + +To configure hit sets on the cache pool, run the following commands: + +.. prompt:: bash $ + + ceph osd pool set <poolname> hit_set_type <type> + ceph osd pool set <poolname> hit_set_period <period-in-seconds> + ceph osd pool set <poolname> hit_set_count <number-of-hitsets> + ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> + +OSD_NO_SORTBITWISE +__________________ + +No pre-Luminous v12.y.z OSDs are running, but the ``sortbitwise`` flag has not +been set. + +The ``sortbitwise`` flag must be set in order for OSDs running Luminous v12.y.z +or newer to start. To safely set the flag, run the following command: + +.. prompt:: bash $ + + ceph osd set sortbitwise + +OSD_FILESTORE +__________________ + +Warn if OSDs are running Filestore. The Filestore OSD back end has been +deprecated; the BlueStore back end has been the default object store since the +Ceph Luminous release. + +The 'mclock_scheduler' is not supported for Filestore OSDs. For this reason, +the default 'osd_op_queue' is set to 'wpq' for Filestore OSDs and is enforced +even if the user attempts to change it. + + + +.. prompt:: bash $ + + ceph report | jq -c '."osd_metadata" | .[] | select(.osd_objectstore | contains("filestore")) | {id, osd_objectstore}' + +**In order to upgrade to Reef or a later release, you must first migrate any +Filestore OSDs to BlueStore.** + +If you are upgrading a pre-Reef release to Reef or later, but it is not +feasible to migrate Filestore OSDs to BlueStore immediately, you can +temporarily silence this alert by running the following command: + +.. prompt:: bash $ + + ceph health mute OSD_FILESTORE + +Since this migration can take a considerable amount of time to complete, we +recommend that you begin the process well in advance of any update to Reef or +to later releases. + +POOL_FULL +_________ + +One or more pools have reached their quota and are no longer allowing writes. + +To see pool quotas and utilization, run the following command: + +.. prompt:: bash $ + + ceph df detail + +If you opt to raise the pool quota, run the following commands: + +.. prompt:: bash $ + + ceph osd pool set-quota <poolname> max_objects <num-objects> + ceph osd pool set-quota <poolname> max_bytes <num-bytes> + +If not, delete some existing data to reduce utilization. + +BLUEFS_SPILLOVER +________________ + +One or more OSDs that use the BlueStore back end have been allocated `db` +partitions (that is, storage space for metadata, normally on a faster device), +but because that space has been filled, metadata has "spilled over" onto the +slow device. This is not necessarily an error condition or even unexpected +behavior, but may result in degraded performance. If the administrator had +expected that all metadata would fit on the faster device, this alert indicates +that not enough space was provided. + +To disable this alert on all OSDs, run the following command: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_bluefs_spillover false + +Alternatively, to disable the alert on a specific OSD, run the following +command: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_bluefs_spillover false + +To secure more metadata space, you can destroy and reprovision the OSD in +question. This process involves data migration and recovery. + +It might also be possible to expand the LVM logical volume that backs the `db` +storage. If the underlying LV has been expanded, you must stop the OSD daemon +and inform BlueFS of the device-size change by running the following command: + +.. prompt:: bash $ + + ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-$ID + +BLUEFS_AVAILABLE_SPACE +______________________ + +To see how much space is free for BlueFS, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available + +This will output up to three values: ``BDEV_DB free``, ``BDEV_SLOW free``, and +``available_from_bluestore``. ``BDEV_DB`` and ``BDEV_SLOW`` report the amount +of space that has been acquired by BlueFS and is now considered free. The value +``available_from_bluestore`` indicates the ability of BlueStore to relinquish +more space to BlueFS. It is normal for this value to differ from the amount of +BlueStore free space, because the BlueFS allocation unit is typically larger +than the BlueStore allocation unit. This means that only part of the BlueStore +free space will be available for BlueFS. + +BLUEFS_LOW_SPACE +_________________ + +If BlueFS is running low on available free space and there is not much free +space available from BlueStore (in other words, `available_from_bluestore` has +a low value), consider reducing the BlueFS allocation unit size. To simulate +available space when the allocation unit is different, run the following +command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore bluefs available <alloc-unit-size> + +BLUESTORE_FRAGMENTATION +_______________________ + +As BlueStore operates, the free space on the underlying storage will become +fragmented. This is normal and unavoidable, but excessive fragmentation causes +slowdown. To inspect BlueStore fragmentation, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator score block + +The fragmentation score is given in a [0-1] range. +[0.0 .. 0.4] tiny fragmentation +[0.4 .. 0.7] small, acceptable fragmentation +[0.7 .. 0.9] considerable, but safe fragmentation +[0.9 .. 1.0] severe fragmentation, might impact BlueFS's ability to get space from BlueStore + +To see a detailed report of free fragments, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator dump block + +For OSD processes that are not currently running, fragmentation can be +inspected with `ceph-bluestore-tool`. To see the fragmentation score, run the +following command: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score + +To dump detailed free chunks, run the following command: + +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump + +BLUESTORE_LEGACY_STATFS +_______________________ + +One or more OSDs have BlueStore volumes that were created prior to the +Nautilus release. (In Nautilus, BlueStore tracks its internal usage +statistics on a granular, per-pool basis.) + +If *all* OSDs +are older than Nautilus, this means that the per-pool metrics are +simply unavailable. But if there is a mixture of pre-Nautilus and +post-Nautilus OSDs, the cluster usage statistics reported by ``ceph +df`` will be inaccurate. + +The old OSDs can be updated to use the new usage-tracking scheme by stopping +each OSD, running a repair operation, and then restarting the OSD. For example, +to update ``osd.123``, run the following commands: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +To disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_legacy_statfs false + +BLUESTORE_NO_PER_POOL_OMAP +__________________________ + +One or more OSDs have volumes that were created prior to the Octopus release. +(In Octopus and later releases, BlueStore tracks omap space utilization by +pool.) + +If there are any BlueStore OSDs that do not have the new tracking enabled, the +cluster will report an approximate value for per-pool omap usage based on the +most recent deep scrub. + +The OSDs can be updated to track by pool by stopping each OSD, running a repair +operation, and then restarting the OSD. For example, to update ``osd.123``, run +the following commands: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +To disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pool_omap false + +BLUESTORE_NO_PER_PG_OMAP +__________________________ + +One or more OSDs have volumes that were created prior to Pacific. (In Pacific +and later releases Bluestore tracks omap space utilitzation by Placement Group +(PG).) + +Per-PG omap allows faster PG removal when PGs migrate. + +The older OSDs can be updated to track by PG by stopping each OSD, running a +repair operation, and then restarting the OSD. For example, to update +``osd.123``, run the following commands: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 + +To disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pg_omap false + + +BLUESTORE_DISK_SIZE_MISMATCH +____________________________ + +One or more BlueStore OSDs have an internal inconsistency between the size of +the physical device and the metadata that tracks its size. This inconsistency +can lead to the OSD(s) crashing in the future. + +The OSDs that have this inconsistency should be destroyed and reprovisioned. Be +very careful to execute this procedure on only one OSD at a time, so as to +minimize the risk of losing any data. To execute this procedure, where ``$N`` +is the OSD that has the inconsistency, run the following commands: + +.. prompt:: bash $ + + ceph osd out osd.$N + while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done + ceph osd destroy osd.$N + ceph-volume lvm zap /path/to/device + ceph-volume lvm create --osd-id $N --data /path/to/device + +.. note:: + + Wait for this recovery procedure to completely on one OSD before running it + on the next. + +BLUESTORE_NO_COMPRESSION +________________________ + +One or more OSDs is unable to load a BlueStore compression plugin. This issue +might be caused by a broken installation, in which the ``ceph-osd`` binary does +not match the compression plugins. Or it might be caused by a recent upgrade in +which the ``ceph-osd`` daemon was not restarted. + +To resolve this issue, verify that all of the packages on the host that is +running the affected OSD(s) are correctly installed and that the OSD daemon(s) +have been restarted. If the problem persists, check the OSD log for information +about the source of the problem. + +BLUESTORE_SPURIOUS_READ_ERRORS +______________________________ + +One or more BlueStore OSDs detect spurious read errors on the main device. +BlueStore has recovered from these errors by retrying disk reads. This alert +might indicate issues with underlying hardware, issues with the I/O subsystem, +or something similar. In theory, such issues can cause permanent data +corruption. Some observations on the root cause of spurious read errors can be +found here: https://tracker.ceph.com/issues/22464 + +This alert does not require an immediate response, but the affected host might +need additional attention: for example, upgrading the host to the latest +OS/kernel versions and implementing hardware-resource-utilization monitoring. + +To disable this alert on all OSDs, run the following command: + +.. prompt:: bash $ + + ceph config set osd bluestore_warn_on_spurious_read_errors false + +Or, to disable this alert on a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_spurious_read_errors false + +Device health +------------- + +DEVICE_HEALTH +_____________ + +One or more OSD devices are expected to fail soon, where the warning threshold +is determined by the ``mgr/devicehealth/warn_threshold`` config option. + +Because this alert applies only to OSDs that are currently marked ``in``, the +appropriate response to this expected failure is (1) to mark the OSD ``out`` so +that data is migrated off of the OSD, and then (2) to remove the hardware from +the system. Note that this marking ``out`` is normally done automatically if +``mgr/devicehealth/self_heal`` is enabled (as determined by +``mgr/devicehealth/mark_out_threshold``). + +To check device health, run the following command: + +.. prompt:: bash $ + + ceph device info <device-id> + +Device life expectancy is set either by a prediction model that the mgr runs or +by an external tool that is activated by running the following command: + +.. prompt:: bash $ + + ceph device set-life-expectancy <device-id> <from> <to> + +You can change the stored life expectancy manually, but such a change usually +doesn't accomplish anything. The reason for this is that whichever tool +originally set the stored life expectancy will probably undo your change by +setting it again, and a change to the stored value does not affect the actual +health of the hardware device. + +DEVICE_HEALTH_IN_USE +____________________ + +One or more devices (that is, OSDs) are expected to fail soon and have been +marked ``out`` of the cluster (as controlled by +``mgr/devicehealth/mark_out_threshold``), but they are still participating in +one or more Placement Groups. This might be because the OSD(s) were marked +``out`` only recently and data is still migrating, or because data cannot be +migrated off of the OSD(s) for some reason (for example, the cluster is nearly +full, or the CRUSH hierarchy is structured so that there isn't another suitable +OSD to migrate the data to). + +This message can be silenced by disabling self-heal behavior (that is, setting +``mgr/devicehealth/self_heal`` to ``false``), by adjusting +``mgr/devicehealth/mark_out_threshold``, or by addressing whichever condition +is preventing data from being migrated off of the ailing OSD(s). + +.. _rados_health_checks_device_health_toomany: + +DEVICE_HEALTH_TOOMANY +_____________________ + +Too many devices (that is, OSDs) are expected to fail soon, and because +``mgr/devicehealth/self_heal`` behavior is enabled, marking ``out`` all of the +ailing OSDs would exceed the cluster's ``mon_osd_min_in_ratio`` ratio. This +ratio prevents a cascade of too many OSDs from being automatically marked +``out``. + +You should promptly add new OSDs to the cluster to prevent data loss, or +incrementally replace the failing OSDs. + +Alternatively, you can silence this health check by adjusting options including +``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``. Be +warned, however, that this will increase the likelihood of unrecoverable data +loss. + + +Data health (pools & placement groups) +-------------------------------------- + +PG_AVAILABILITY +_______________ + +Data availability is reduced. In other words, the cluster is unable to service +potential read or write requests for at least some data in the cluster. More +precisely, one or more Placement Groups (PGs) are in a state that does not +allow I/O requests to be serviced. Any of the following PG states are +problematic if they do not clear quickly: *peering*, *stale*, *incomplete*, and +the lack of *active*. + +For detailed information about which PGs are affected, run the following +command: + +.. prompt:: bash $ + + ceph health detail + +In most cases, the root cause of this issue is that one or more OSDs are +currently ``down``: see ``OSD_DOWN`` above. + +To see the state of a specific problematic PG, run the following command: + +.. prompt:: bash $ + + ceph tell <pgid> query + +PG_DEGRADED +___________ + +Data redundancy is reduced for some data: in other words, the cluster does not +have the desired number of replicas for all data (in the case of replicated +pools) or erasure code fragments (in the case of erasure-coded pools). More +precisely, one or more Placement Groups (PGs): + +* have the *degraded* or *undersized* flag set, which means that there are not + enough instances of that PG in the cluster; or +* have not had the *clean* state set for a long time. + +For detailed information about which PGs are affected, run the following +command: + +.. prompt:: bash $ + + ceph health detail + +In most cases, the root cause of this issue is that one or more OSDs are +currently "down": see ``OSD_DOWN`` above. + +To see the state of a specific problematic PG, run the following command: + +.. prompt:: bash $ + + ceph tell <pgid> query + + +PG_RECOVERY_FULL +________________ + +Data redundancy might be reduced or even put at risk for some data due to a +lack of free space in the cluster. More precisely, one or more Placement Groups +have the *recovery_toofull* flag set, which means that the cluster is unable to +migrate or recover data because one or more OSDs are above the ``full`` +threshold. + +For steps to resolve this condition, see *OSD_FULL* above. + +PG_BACKFILL_FULL +________________ + +Data redundancy might be reduced or even put at risk for some data due to a +lack of free space in the cluster. More precisely, one or more Placement Groups +have the *backfill_toofull* flag set, which means that the cluster is unable to +migrate or recover data because one or more OSDs are above the ``backfillfull`` +threshold. + +For steps to resolve this condition, see *OSD_BACKFILLFULL* above. + +PG_DAMAGED +__________ + +Data scrubbing has discovered problems with data consistency in the cluster. +More precisely, one or more Placement Groups either (1) have the *inconsistent* +or ``snaptrim_error`` flag set, which indicates that an earlier data scrub +operation found a problem, or (2) have the *repair* flag set, which means that +a repair for such an inconsistency is currently in progress. + +For more information, see :doc:`pg-repair`. + +OSD_SCRUB_ERRORS +________________ + +Recent OSD scrubs have discovered inconsistencies. This alert is generally +paired with *PG_DAMAGED* (see above). + +For more information, see :doc:`pg-repair`. + +OSD_TOO_MANY_REPAIRS +____________________ + +The count of read repairs has exceeded the config value threshold +``mon_osd_warn_num_repaired`` (default: ``10``). Because scrub handles errors +only for data at rest, and because any read error that occurs when another +replica is available will be repaired immediately so that the client can get +the object data, there might exist failing disks that are not registering any +scrub errors. This repair count is maintained as a way of identifying any such +failing disks. + + +LARGE_OMAP_OBJECTS +__________________ + +One or more pools contain large omap objects, as determined by +``osd_deep_scrub_large_omap_object_key_threshold`` (threshold for the number of +keys to determine what is considered a large omap object) or +``osd_deep_scrub_large_omap_object_value_sum_threshold`` (the threshold for the +summed size in bytes of all key values to determine what is considered a large +omap object) or both. To find more information on object name, key count, and +size in bytes, search the cluster log for 'Large omap object found'. This issue +can be caused by RGW-bucket index objects that do not have automatic resharding +enabled. For more information on resharding, see :ref:`RGW Dynamic Bucket Index +Resharding <rgw_dynamic_bucket_index_resharding>`. + +To adjust the thresholds mentioned above, run the following commands: + +.. prompt:: bash $ + + ceph config set osd osd_deep_scrub_large_omap_object_key_threshold <keys> + ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold <bytes> + +CACHE_POOL_NEAR_FULL +____________________ + +A cache-tier pool is nearly full, as determined by the ``target_max_bytes`` and +``target_max_objects`` properties of the cache pool. Once the pool reaches the +target threshold, write requests to the pool might block while data is flushed +and evicted from the cache. This state normally leads to very high latencies +and poor performance. + +To adjust the cache pool's target size, run the following commands: + +.. prompt:: bash $ + + ceph osd pool set <cache-pool-name> target_max_bytes <bytes> + ceph osd pool set <cache-pool-name> target_max_objects <objects> + +There might be other reasons that normal cache flush and evict activity are +throttled: for example, reduced availability of the base tier, reduced +performance of the base tier, or overall cluster load. + +TOO_FEW_PGS +___________ + +The number of Placement Groups (PGs) that are in use in the cluster is below +the configurable threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can +lead to suboptimal distribution and suboptimal balance of data across the OSDs +in the cluster, and a reduction of overall performance. + +If data pools have not yet been created, this condition is expected. + +To address this issue, you can increase the PG count for existing pools or +create new pools. For more information, see +:ref:`choosing-number-of-placement-groups`. + +POOL_PG_NUM_NOT_POWER_OF_TWO +____________________________ + +One or more pools have a ``pg_num`` value that is not a power of two. Although +this is not strictly incorrect, it does lead to a less balanced distribution of +data because some Placement Groups will have roughly twice as much data as +others have. + +This is easily corrected by setting the ``pg_num`` value for the affected +pool(s) to a nearby power of two. To do so, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <value> + +To disable this health check, run the following command: + +.. prompt:: bash $ + + ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false + +POOL_TOO_FEW_PGS +________________ + +One or more pools should probably have more Placement Groups (PGs), given the +amount of data that is currently stored in the pool. This issue can lead to +suboptimal distribution and suboptimal balance of data across the OSDs in the +cluster, and a reduction of overall performance. This alert is raised only if +the ``pg_autoscale_mode`` property on the pool is set to ``warn``. + +To disable the alert, entirely disable auto-scaling of PGs for the pool by +running the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs for the pool, +run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +Alternatively, to manually set the number of PGs for the pool to the +recommended amount, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +For more information, see :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler`. + +TOO_MANY_PGS +____________ + +The number of Placement Groups (PGs) in use in the cluster is above the +configurable threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold +is exceeded, the cluster will not allow new pools to be created, pool `pg_num` +to be increased, or pool replication to be increased (any of which, if allowed, +would lead to more PGs in the cluster). A large number of PGs can lead to +higher memory utilization for OSD daemons, slower peering after cluster state +changes (for example, OSD restarts, additions, or removals), and higher load on +the Manager and Monitor daemons. + +The simplest way to mitigate the problem is to increase the number of OSDs in +the cluster by adding more hardware. Note that, because the OSD count that is +used for the purposes of this health check is the number of ``in`` OSDs, +marking ``out`` OSDs ``in`` (if there are any ``out`` OSDs available) can also +help. To do so, run the following command: + +.. prompt:: bash $ + + ceph osd in <osd id(s)> + +For more information, see :ref:`choosing-number-of-placement-groups`. + +POOL_TOO_MANY_PGS +_________________ + +One or more pools should probably have fewer Placement Groups (PGs), given the +amount of data that is currently stored in the pool. This issue can lead to +higher memory utilization for OSD daemons, slower peering after cluster state +changes (for example, OSD restarts, additions, or removals), and higher load on +the Manager and Monitor daemons. This alert is raised only if the +``pg_autoscale_mode`` property on the pool is set to ``warn``. + +To disable the alert, entirely disable auto-scaling of PGs for the pool by +running the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode off + +To allow the cluster to automatically adjust the number of PGs for the pool, +run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_autoscale_mode on + +Alternatively, to manually set the number of PGs for the pool to the +recommended amount, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> pg_num <new-pg-num> + +For more information, see :ref:`choosing-number-of-placement-groups` and +:ref:`pg-autoscaler`. + + +POOL_TARGET_SIZE_BYTES_OVERCOMMITTED +____________________________________ + +One or more pools have a ``target_size_bytes`` property that is set in order to +estimate the expected size of the pool, but the value(s) of this property are +greater than the total available storage (either by themselves or in +combination with other pools). + +This alert is usually an indication that the ``target_size_bytes`` value for +the pool is too large and should be reduced or set to zero. To reduce the +``target_size_bytes`` value or set it to zero, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +The above command sets the value of ``target_size_bytes`` to zero. To set the +value of ``target_size_bytes`` to a non-zero value, replace the ``0`` with that +non-zero value. + +For more information, see :ref:`specifying_pool_target_size`. + +POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO +____________________________________ + +One or more pools have both ``target_size_bytes`` and ``target_size_ratio`` set +in order to estimate the expected size of the pool. Only one of these +properties should be non-zero. If both are set to a non-zero value, then +``target_size_ratio`` takes precedence and ``target_size_bytes`` is ignored. + +To reset ``target_size_bytes`` to zero, run the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool-name> target_size_bytes 0 + +For more information, see :ref:`specifying_pool_target_size`. + +TOO_FEW_OSDS +____________ + +The number of OSDs in the cluster is below the configurable threshold of +``osd_pool_default_size``. This means that some or all data may not be able to +satisfy the data protection policy specified in CRUSH rules and pool settings. + +SMALLER_PGP_NUM +_______________ + +One or more pools have a ``pgp_num`` value less than ``pg_num``. This alert is +normally an indication that the Placement Group (PG) count was increased +without any increase in the placement behavior. + +This disparity is sometimes brought about deliberately, in order to separate +out the `split` step when the PG count is adjusted from the data migration that +is needed when ``pgp_num`` is changed. + +This issue is normally resolved by setting ``pgp_num`` to match ``pg_num``, so +as to trigger the data migration, by running the following command: + +.. prompt:: bash $ + + ceph osd pool set <pool> pgp_num <pg-num-value> + +MANY_OBJECTS_PER_PG +___________________ + +One or more pools have an average number of objects per Placement Group (PG) +that is significantly higher than the overall cluster average. The specific +threshold is determined by the ``mon_pg_warn_max_object_skew`` configuration +value. + +This alert is usually an indication that the pool(s) that contain most of the +data in the cluster have too few PGs, or that other pools that contain less +data have too many PGs. See *TOO_MANY_PGS* above. + +To silence the health check, raise the threshold by adjusting the +``mon_pg_warn_max_object_skew`` config option on the managers. + +The health check will be silenced for a specific pool only if +``pg_autoscale_mode`` is set to ``on``. + +POOL_APP_NOT_ENABLED +____________________ + +A pool exists but the pool has not been tagged for use by a particular +application. + +To resolve this issue, tag the pool for use by an application. For +example, if the pool is used by RBD, run the following command: + +.. prompt:: bash $ + + rbd pool init <poolname> + +Alternatively, if the pool is being used by a custom application (here 'foo'), +you can label the pool by running the following low-level command: + +.. prompt:: bash $ + + ceph osd pool application enable foo + +For more information, see :ref:`associate-pool-to-application`. + +POOL_FULL +_________ + +One or more pools have reached (or are very close to reaching) their quota. The +threshold to raise this health check is determined by the +``mon_pool_quota_crit_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) by running the following +commands: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +To disable a quota, set the quota value to 0. + +POOL_NEAR_FULL +______________ + +One or more pools are approaching a configured fullness threshold. + +One of the several thresholds that can raise this health check is determined by +the ``mon_pool_quota_warn_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) by running the following +commands: + +.. prompt:: bash $ + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +To disable a quota, set the quota value to 0. + +Other thresholds that can raise the two health checks above are +``mon_osd_nearfull_ratio`` and ``mon_osd_full_ratio``. For details and +resolution, see :ref:`storage-capacity` and :ref:`no-free-drive-space`. + +OBJECT_MISPLACED +________________ + +One or more objects in the cluster are not stored on the node that CRUSH would +prefer that they be stored on. This alert is an indication that data migration +due to a recent cluster change has not yet completed. + +Misplaced data is not a dangerous condition in and of itself; data consistency +is never at risk, and old copies of objects will not be removed until the +desired number of new copies (in the desired locations) has been created. + +OBJECT_UNFOUND +______________ + +One or more objects in the cluster cannot be found. More precisely, the OSDs +know that a new or updated copy of an object should exist, but no such copy has +been found on OSDs that are currently online. + +Read or write requests to unfound objects will block. + +Ideally, a "down" OSD that has a more recent copy of the unfound object can be +brought back online. To identify candidate OSDs, check the peering state of the +PG(s) responsible for the unfound object. To see the peering state, run the +following command: + +.. prompt:: bash $ + + ceph tell <pgid> query + +On the other hand, if the latest copy of the object is not available, the +cluster can be told to roll back to a previous version of the object. For more +information, see :ref:`failures-osd-unfound`. + +SLOW_OPS +________ + +One or more OSD requests or monitor requests are taking a long time to process. +This alert might be an indication of extreme load, a slow storage device, or a +software bug. + +To query the request queue for the daemon that is causing the slowdown, run the +following command from the daemon's host: + +.. prompt:: bash $ + + ceph daemon osd.<id> ops + +To see a summary of the slowest recent requests, run the following command: + +.. prompt:: bash $ + + ceph daemon osd.<id> dump_historic_ops + +To see the location of a specific OSD, run the following command: + +.. prompt:: bash $ + + ceph osd find osd.<id> + +PG_NOT_SCRUBBED +_______________ + +One or more Placement Groups (PGs) have not been scrubbed recently. PGs are +normally scrubbed within an interval determined by +:confval:`osd_scrub_max_interval` globally. This interval can be overridden on +per-pool basis by changing the value of the variable +:confval:`scrub_max_interval`. This health check is raised if a certain +percentage (determined by ``mon_warn_pg_not_scrubbed_ratio``) of the interval +has elapsed after the time the scrub was scheduled and no scrub has been +performed. + +PGs will be scrubbed only if they are flagged as ``clean`` (which means that +they are to be cleaned, and not that they have been examined and found to be +clean). Misplaced or degraded PGs will not be flagged as ``clean`` (see +*PG_AVAILABILITY* and *PG_DEGRADED* above). + +To manually initiate a scrub of a clean PG, run the following command: + +.. prompt: bash $ + + ceph pg scrub <pgid> + +PG_NOT_DEEP_SCRUBBED +____________________ + +One or more Placement Groups (PGs) have not been deep scrubbed recently. PGs +are normally scrubbed every :confval:`osd_deep_scrub_interval` seconds at most. +This health check is raised if a certain percentage (determined by +``mon_warn_pg_not_deep_scrubbed_ratio``) of the interval has elapsed after the +time the scrub was scheduled and no scrub has been performed. + +PGs will receive a deep scrub only if they are flagged as *clean* (which means +that they are to be cleaned, and not that they have been examined and found to +be clean). Misplaced or degraded PGs might not be flagged as ``clean`` (see +*PG_AVAILABILITY* and *PG_DEGRADED* above). + +To manually initiate a deep scrub of a clean PG, run the following command: + +.. prompt:: bash $ + + ceph pg deep-scrub <pgid> + + +PG_SLOW_SNAP_TRIMMING +_____________________ + +The snapshot trim queue for one or more PGs has exceeded the configured warning +threshold. This alert indicates either that an extremely large number of +snapshots was recently deleted, or that OSDs are unable to trim snapshots +quickly enough to keep up with the rate of new snapshot deletions. + +The warning threshold is determined by the ``mon_osd_snap_trim_queue_warn_on`` +option (default: 32768). + +This alert might be raised if OSDs are under excessive load and unable to keep +up with their background work, or if the OSDs' internal metadata database is +heavily fragmented and unable to perform. The alert might also indicate some +other performance issue with the OSDs. + +The exact size of the snapshot trim queue is reported by the ``snaptrimq_len`` +field of ``ceph pg ls -f json-detail``. + +Stretch Mode +------------ + +INCORRECT_NUM_BUCKETS_STRETCH_MODE +__________________________________ + +Stretch mode currently only support 2 dividing buckets with OSDs, this warning suggests +that the number of dividing buckets is not equal to 2 after stretch mode is enabled. +You can expect unpredictable failures and MON assertions until the condition is fixed. + +We encourage you to fix this by removing additional dividing buckets or bump the +number of dividing buckets to 2. + +UNEVEN_WEIGHTS_STRETCH_MODE +___________________________ + +The 2 dividing buckets must have equal weights when stretch mode is enabled. +This warning suggests that the 2 dividing buckets have uneven weights after +stretch mode is enabled. This is not immediately fatal, however, you can expect +Ceph to be confused when trying to process transitions between dividing buckets. + +We encourage you to fix this by making the weights even on both dividing buckets. +This can be done by making sure the combined weight of the OSDs on each dividing +bucket are the same. + +Miscellaneous +------------- + +RECENT_CRASH +____________ + +One or more Ceph daemons have crashed recently, and the crash(es) have not yet +been acknowledged and archived by the administrator. This alert might indicate +a software bug, a hardware problem (for example, a failing disk), or some other +problem. + +To list recent crashes, run the following command: + +.. prompt:: bash $ + + ceph crash ls-new + +To examine information about a specific crash, run the following command: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +To silence this alert, you can archive the crash (perhaps after the crash +has been examined by an administrator) by running the following command: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, to archive all recent crashes, run the following command: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible by running the command ``ceph crash +ls``, but not by running the command ``ceph crash ls-new``. + +The time period that is considered recent is determined by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +To entirely disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +RECENT_MGR_MODULE_CRASH +_______________________ + +One or more ``ceph-mgr`` modules have crashed recently, and the crash(es) have +not yet been acknowledged and archived by the administrator. This alert +usually indicates a software bug in one of the software modules that are +running inside the ``ceph-mgr`` daemon. The module that experienced the problem +might be disabled as a result, but other modules are unaffected and continue to +function as expected. + +As with the *RECENT_CRASH* health check, a specific crash can be inspected by +running the following command: + +.. prompt:: bash $ + + ceph crash info <crash-id> + +To silence this alert, you can archive the crash (perhaps after the crash has +been examined by an administrator) by running the following command: + +.. prompt:: bash $ + + ceph crash archive <crash-id> + +Similarly, to archive all recent crashes, run the following command: + +.. prompt:: bash $ + + ceph crash archive-all + +Archived crashes will still be visible by running the command ``ceph crash ls`` +but not by running the command ``ceph crash ls-new``. + +The time period that is considered recent is determined by the option +``mgr/crash/warn_recent_interval`` (default: two weeks). + +To entirely disable this alert, run the following command: + +.. prompt:: bash $ + + ceph config set mgr/crash/warn_recent_interval 0 + +TELEMETRY_CHANGED +_________________ + +Telemetry has been enabled, but because the contents of the telemetry report +have changed in the meantime, telemetry reports will not be sent. + +Ceph developers occasionally revise the telemetry feature to include new and +useful information, or to remove information found to be useless or sensitive. +If any new information is included in the report, Ceph requires the +administrator to re-enable telemetry. This requirement ensures that the +administrator has an opportunity to (re)review the information that will be +shared. + +To review the contents of the telemetry report, run the following command: + +.. prompt:: bash $ + + ceph telemetry show + +Note that the telemetry report consists of several channels that may be +independently enabled or disabled. For more information, see :ref:`telemetry`. + +To re-enable telemetry (and silence the alert), run the following command: + +.. prompt:: bash $ + + ceph telemetry on + +To disable telemetry (and silence the alert), run the following command: + +.. prompt:: bash $ + + ceph telemetry off + +AUTH_BAD_CAPS +_____________ + +One or more auth users have capabilities that cannot be parsed by the monitors. +As a general rule, this alert indicates that there are one or more daemon types +that the user is not authorized to use to perform any action. + +This alert is most likely to be raised after an upgrade if (1) the capabilities +were set with an older version of Ceph that did not properly validate the +syntax of those capabilities, or if (2) the syntax of the capabilities has +changed. + +To remove the user(s) in question, run the following command: + +.. prompt:: bash $ + + ceph auth rm <entity-name> + +(This resolves the health check, but it prevents clients from being able to +authenticate as the removed user.) + +Alternatively, to update the capabilities for the user(s), run the following +command: + +.. prompt:: bash $ + + ceph auth <entity-name> <daemon-type> <caps> [<daemon-type> <caps> ...] + +For more information about auth capabilities, see :ref:`user-management`. + +OSD_NO_DOWN_OUT_INTERVAL +________________________ + +The ``mon_osd_down_out_interval`` option is set to zero, which means that the +system does not automatically perform any repair or healing operations when an +OSD fails. Instead, an administrator an external orchestrator must manually +mark "down" OSDs as ``out`` (by running ``ceph osd out <osd-id>``) in order to +trigger recovery. + +This option is normally set to five or ten minutes, which should be enough time +for a host to power-cycle or reboot. + +To silence this alert, set ``mon_warn_on_osd_down_out_interval_zero`` to +``false`` by running the following command: + +.. prompt:: bash $ + + ceph config global mon mon_warn_on_osd_down_out_interval_zero false + +DASHBOARD_DEBUG +_______________ + +The Dashboard debug mode is enabled. This means that if there is an error while +processing a REST API request, the HTTP error response will contain a Python +traceback. This mode should be disabled in production environments because such +a traceback might contain and expose sensitive information. + +To disable the debug mode, run the following command: + +.. prompt:: bash $ + + ceph dashboard debug disable |