summaryrefslogtreecommitdiffstats
path: root/doc/dev/cephadm/compliance-check.rst
blob: eea462445adf0e499cbd76a1c816dd0bcecc23e1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
================
Compliance Check
================

The stability and reliability of a Ceph cluster is dependent not just upon the Ceph daemons, but
also the OS and hardware that Ceph is installed on. This document is intended to promote a design 
discussion for providing a "compliance" feature within mgr/cephadm, which would be responsible for
identifying common platform-related issues that could impact Ceph stability and operation.

The ultimate goal of these checks is to identify issues early and raise a healthcheck WARN
event, to alert the Administrator to the issue.

Prerequisites
=============
In order to effectively analyse the hosts that Ceph is deployed to, this feature requires a cache
of host-related metadata. The metadata is already available from cephadm's HostFacts class and the
``gather-facts`` cephadm command. For the purposes of this document, we will assume that this
data is available within the mgr/cephadm "cache" structure.

Some checks will require that the host status is also populated e.g. ONLINE, OFFLINE, MAINTENANCE

Administrator Interaction
=========================
Not all users will require this feature, and must be able to 'opt out'. For this reason,
mgr/cephadm must provide controls, such as the following;

.. code-block::

   ceph cephadm compliance enable | disable | status [--format json]
   ceph cephadm compliance ls [--format json]
   ceph cephadm compliance enable-check <name>
   ceph cephadm compliance disable-check <name>
   ceph cephadm compliance set-check-interval <int>
   ceph cephadm compliance get-check-interval

The status option would show the enabled/disabled state of the feature, along with the
check-interval.

The ``ls`` subcommand would show all checks in the following format;

``check-name status description``

Proposed Integration
====================
The compliance checks are not required to run all the time, but instead should run at discrete
intervals. The interval would be configurable under via the :code:`set-check-interval`
subcommand (default would be every 12 hours)


mgr/cephadm currently executes an event driven (time based) serve loop to act on deploy/remove and
reconcile activity. In order to execute the compliance checks, the compliance check code would be 
called from this main serve loop - when the :code:`set-check-interval` is met.


Proposed Checks
===============
All checks would push any errors to a list, so multiple issues can be escalated to the Admin at
the same time. The list below provides a description of each check, with the text following the 
name indicating a shortname version *(the shortname is the reference for command Interaction
when enabling or disabling a check)*


OS Consistency (OS)
___________________
* all hosts must use same vendor
* all hosts must be on the same major release (this check would only be applicable to distributions that
  offer a long-term-support strategy (RHEL, CentOS, SLES, Ubuntu etc)


*src: gather-facts output*

Linux Kernel Security Mode (LSM)
________________________________
* All hosts should have a consistent SELINUX/AppArmor configuration

*src: gather-facts output*

Services Check (SERVICES)
_________________________
Hosts that are in an ONLINE state should adhere to the following;

* all daemons (systemd units) should be enabled
* all daemons should be running (not dead)

*src: list_daemons output*

Support Status (SUPPORT)
________________________
If support status has been detected, it should be consistent across all hosts. At this point
support status is available only for Red Hat machines.

*src: gather-facts output*

Network : MTU (MTU)
________________________________
All network interfaces on the same Ceph network (public/cluster) should have the same MTU

*src: gather-facts output*

Network : LinkSpeed (LINKSPEED)
____________________________________________
All network interfaces on the same Ceph network (public/cluster) should have the same Linkspeed

*src: gather-facts output*

Network : Consistency (INTERFACE)
______________________________________________
All hosts with OSDs should have consistent network configuration - eg. if some hosts do
not separate cluster/public traffic but others do, that is an anomaly that would generate a
compliance check warning.

*src: gather-facts output*

Notification Strategy
=====================
If any of the checks fail, mgr/cephadm would raise a WARN level alert

Futures
=======
The checks highlighted here serve only as a starting point, and we should expect to expand
on the checks over time.