summaryrefslogtreecommitdiffstats
path: root/doc/hardware-monitoring/index.rst
blob: dcafa82303f593fa8f683d2afee6b1f783fcf243 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
.. _hardware-monitoring:

Hardware monitoring
===================

`node-proxy` is the internal name to designate the running agent which inventories a machine's hardware, provides the different statuses and enable the operator to perform some actions.
It gathers details from the RedFish API, processes and pushes data to agent endpoint in the Ceph manager daemon.

.. graphviz::

     digraph G {
         node [shape=record];
         mgr [label="{<mgr> ceph manager}"];
         dashboard [label="<dashboard> ceph dashboard"];
         agent [label="<agent> agent"];
         redfish [label="<redfish> redfish"];
     
         agent -> redfish [label=" 1." color=green];
         agent -> mgr [label=" 2." color=orange];
         dashboard:dashboard -> mgr [label=" 3."color=lightgreen];
         node [shape=plaintext];
         legend [label=<<table border="0" cellborder="1" cellspacing="0">
             <tr><td bgcolor="lightgrey">Legend</td></tr>
             <tr><td align="center">1. Collects data from redfish API</td></tr>
             <tr><td align="left">2. Pushes data to ceph mgr</td></tr>
             <tr><td align="left">3. Query ceph mgr</td></tr>
         </table>>];
     }


Limitations
-----------

For the time being, the `node-proxy` agent relies on the RedFish API.
It implies both `node-proxy` agent and `ceph-mgr` daemon need to be able to access the Out-Of-Band network to work.


Deploying the agent
-------------------

| The first step is to provide the out of band management tool credentials.
| This can be done when adding the host with a service spec file:

.. code-block:: bash

  # cat host.yml
  ---
  service_type: host
  hostname: node-10
  addr: 10.10.10.10
  oob:
    addr: 20.20.20.10
    username: admin
    password: p@ssword

Apply the spec:

.. code-block:: bash

  # ceph orch apply -i host.yml
  Added host 'node-10' with addr '10.10.10.10'

Deploy the agent:

.. code-block:: bash

  # ceph config set mgr mgr/cephadm/hw_monitoring true

CLI
---

| **orch** **hardware** **status** [hostname] [--category CATEGORY] [--format plain | json]

supported categories are:

* summary (default)
* memory
* storage
* processors
* network
* power
* fans
* firmwares
* criticals

Examples
********


hardware health statuses summary 
++++++++++++++++++++++++++++++++

.. code-block:: bash

  # ceph orch hardware status
  +------------+---------+-----+-----+--------+-------+------+
  |    HOST    | STORAGE | CPU | NET | MEMORY | POWER | FANS |
  +------------+---------+-----+-----+--------+-------+------+
  |   node-10  |    ok   |  ok |  ok |   ok   |   ok  |  ok  |
  +------------+---------+-----+-----+--------+-------+------+


storage devices report
++++++++++++++++++++++

.. code-block:: bash

  # ceph orch hardware status IBM-Ceph-1 --category storage
  +------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+
  |    HOST    |                          NAME                          |      MODEL       |      SIZE      | PROTOCOL |       SN       | STATUS |  STATE  |
  +------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+
  |   node-10  | Disk 8 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT99QLL    |   OK   | Enabled |
  |   node-10  | Disk 10 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT98ZYX    |   OK   | Enabled |
  |   node-10  | Disk 11 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT98ZWB    |   OK   | Enabled |
  |   node-10  | Disk 9 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT98ZC9    |   OK   | Enabled |
  |   node-10  | Disk 3 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT9903Y    |   OK   | Enabled |
  |   node-10  | Disk 1 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT9901E    |   OK   | Enabled |
  |   node-10  | Disk 7 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT98ZQJ    |   OK   | Enabled |
  |   node-10  | Disk 2 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT99PA2    |   OK   | Enabled |
  |   node-10  | Disk 4 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT99PFG    |   OK   | Enabled |
  |   node-10  | Disk 0 in Backplane 0 of Storage Controller in Slot 2  | MZ7L33T8HBNAAD3  | 3840755981824  |   SATA   | S6M5NE0T800539 |   OK   | Enabled |
  |   node-10  | Disk 1 in Backplane 0 of Storage Controller in Slot 2  | MZ7L33T8HBNAAD3  | 3840755981824  |   SATA   | S6M5NE0T800554 |   OK   | Enabled |
  |   node-10  | Disk 6 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT98ZER    |   OK   | Enabled |
  |   node-10  | Disk 0 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT98ZEJ    |   OK   | Enabled |
  |   node-10  | Disk 5 in Backplane 1 of Storage Controller in Slot 2  | ST20000NM008D-3D | 20000588955136 |   SATA   |    ZVT99QMH    |   OK   | Enabled |
  |   node-10  |           Disk 0 on AHCI Controller in SL 6            |  MTFDDAV240TDU   |  240057409536  |   SATA   |  22373BB1E0F8  |   OK   | Enabled |
  |   node-10  |           Disk 1 on AHCI Controller in SL 6            |  MTFDDAV240TDU   |  240057409536  |   SATA   |  22373BB1E0D5  |   OK   | Enabled |
  +------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+



firmwares details
+++++++++++++++++

.. code-block:: bash

  # ceph orch hardware status node-10 --category firmwares
  +------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+
  |    HOST    |                                 COMPONENT                                  |                             NAME                             |         DATE         |   VERSION   | STATUS |
  +------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+
  |   node-10  |               current-107649-7.03__raid.backplane.firmware.0               |                         Backplane 0                          | 2022-12-05T00:00:00Z |     7.03    |   OK   |
  
  
  ... omitted output ...
  
  
  |   node-10  |               previous-25227-6.10.30.20__idrac.embedded.1-1                |             Integrated Remote Access Controller              |      00:00:00Z       |  6.10.30.20 |   OK   |
  +------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+


hardware critical warnings report
+++++++++++++++++++++++++++++++++

.. code-block:: bash

  # ceph orch hardware status --category criticals
  +------------+-----------+------------+----------+-----------------+
  |    HOST    | COMPONENT |    NAME    |  STATUS  |      STATE      |
  +------------+-----------+------------+----------+-----------------+
  |   node-10  |   power   | PS2 Status | critical |    unplugged    |
  +------------+-----------+------------+----------+-----------------+


Developpers
-----------

.. py:currentmodule:: cephadm.agent
.. autoclass:: NodeProxyEndpoint
.. automethod:: NodeProxyEndpoint.__init__
.. automethod:: NodeProxyEndpoint.oob
.. automethod:: NodeProxyEndpoint.data
.. automethod:: NodeProxyEndpoint.fullreport
.. automethod:: NodeProxyEndpoint.summary
.. automethod:: NodeProxyEndpoint.criticals
.. automethod:: NodeProxyEndpoint.memory
.. automethod:: NodeProxyEndpoint.storage
.. automethod:: NodeProxyEndpoint.network
.. automethod:: NodeProxyEndpoint.power
.. automethod:: NodeProxyEndpoint.processors
.. automethod:: NodeProxyEndpoint.fans
.. automethod:: NodeProxyEndpoint.firmwares
.. automethod:: NodeProxyEndpoint.led