summaryrefslogtreecommitdiffstats
path: root/doc/mgr/diskprediction.rst
blob: 779cda5d41e9e98f8afb86a049cda99b32476779 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
=====================
Diskprediction Module
=====================

The *diskprediction* module supports two modes: cloud mode and local mode. In cloud mode, the disk and Ceph operating status information is collected from Ceph cluster and sent to a cloud-based DiskPrediction server over the Internet. DiskPrediction server analyzes the data and provides the analytics and prediction results of performance and disk health states for Ceph clusters. 

Local mode doesn't require any external server for data analysis and output results. In local mode, the *diskprediction* module uses an internal predictor module for disk prediction service, and then returns the disk prediction result to the Ceph system.

| Local predictor: 70% accuracy
| Cloud predictor for free: 95% accuracy

Enabling
========

Run the following command to enable the *diskprediction* module in the Ceph
environment::

    ceph mgr module enable diskprediction_cloud
    ceph mgr module enable diskprediction_local


Select the prediction mode::

    ceph config set global device_failure_prediction_mode local

or::
  
    ceph config set global device_failure_prediction_mode cloud

To disable prediction,::

  ceph config set global device_failure_prediction_mode none


Connection settings
===================
The connection settings are used for connection between Ceph and DiskPrediction server. 

Local Mode
----------

The *diskprediction* module leverages Ceph device health check to collect disk health metrics and uses internal predictor module to produce the disk failure prediction and returns back to Ceph. Thus, no connection settings are required in local mode. The local predictor module requires at least six datasets of device health metrics to implement the prediction.

Run the following command to use local predictor predict device life expectancy.

::

    ceph device predict-life-expectancy <device id>


Cloud Mode 
----------

The user registration is required in cloud mode. The users have to sign up their accounts at https://www.diskprophet.com/#/ to receive the following DiskPrediction server information for connection settings. 

**Certificate file path**: After user registration is confirmed, the system will send a confirmation email including a certificate file download link. Download the certificate file and save it to the Ceph system. Run the following command to verify the file. Without certificate file verification, the connection settings cannot be completed.
	
**DiskPrediction server**: The DiskPrediction server name. It could be an IP address if required. 

**Connection account**: An account name used to set up the connection between Ceph and DiskPrediction server

**Connection password**: The password used to set up the connection between Ceph and DiskPrediction server

Run the following command to complete connection setup.

::

    ceph device set-cloud-prediction-config <diskprediction_server> <connection_account> <connection_password> <certificate file path>
	

You can use the following command to display the connection settings:

::

    ceph device show-prediction-config


Additional optional configuration settings are the following:

:diskprediction_upload_metrics_interval: Indicate the frequency to send Ceph performance metrics to DiskPrediction server regularly at times.  Default is 10 minutes.
:diskprediction_upload_smart_interval: Indicate the frequency to send Ceph physical device info to DiskPrediction server regularly at times.  Default is 12 hours.
:diskprediction_retrieve_prediction_interval: Indicate Ceph that retrieves physical device prediction data from DiskPrediction server regularly at times.  Default is 12 hours.



Diskprediction Data
===================

The *diskprediction* module actively sends/retrieves the following data to/from DiskPrediction server.


Metrics Data
-------------
- Ceph cluster status

+----------------------+-----------------------------------------+
|key                   |Description                              |
+======================+=========================================+
|cluster_health        |Ceph health check status                 |
+----------------------+-----------------------------------------+
|num_mon               |Number of monitor node                   |
+----------------------+-----------------------------------------+
|num_mon_quorum        |Number of monitors in quorum             |
+----------------------+-----------------------------------------+
|num_osd               |Total number of OSD                      |
+----------------------+-----------------------------------------+
|num_osd_up            |Number of OSDs that are up               |
+----------------------+-----------------------------------------+
|num_osd_in            |Number of OSDs that are in cluster       |
+----------------------+-----------------------------------------+
|osd_epoch             |Current epoch of OSD map                 |
+----------------------+-----------------------------------------+
|osd_bytes             |Total capacity of cluster in bytes       |
+----------------------+-----------------------------------------+
|osd_bytes_used        |Number of used bytes on cluster          |
+----------------------+-----------------------------------------+
|osd_bytes_avail       |Number of available bytes on cluster     |
+----------------------+-----------------------------------------+
|num_pool              |Number of pools                          |
+----------------------+-----------------------------------------+
|num_pg                |Total number of placement groups         |
+----------------------+-----------------------------------------+
|num_pg_active_clean   |Number of placement groups in            |
|                      |active+clean state                       |
+----------------------+-----------------------------------------+
|num_pg_active         |Number of placement groups in active     |
|                      |state                                    |
+----------------------+-----------------------------------------+
|num_pg_peering        |Number of placement groups in peering    |
|                      |state                                    |
+----------------------+-----------------------------------------+
|num_object            |Total number of objects on cluster       |
+----------------------+-----------------------------------------+
|num_object_degraded   |Number of degraded (missing replicas)    |
|                      |objects                                  |
+----------------------+-----------------------------------------+
|num_object_misplaced  |Number of misplaced (wrong location in   |
|                      |the cluster) objects                     |
+----------------------+-----------------------------------------+
|num_object_unfound    |Number of unfound objects                |
+----------------------+-----------------------------------------+
|num_bytes             |Total number of bytes of all objects     |
+----------------------+-----------------------------------------+
|num_mds_up            |Number of MDSs that are up               |
+----------------------+-----------------------------------------+
|num_mds_in            |Number of MDS that are in cluster        |
+----------------------+-----------------------------------------+
|num_mds_failed        |Number of failed MDS                     |
+----------------------+-----------------------------------------+
|mds_epoch             |Current epoch of MDS map                 |
+----------------------+-----------------------------------------+


- Ceph mon/osd performance counts

Mon:

+----------------------+-----------------------------------------+
|key                   |Description                              |
+======================+=========================================+
|num_sessions          |Current number of opened monitor sessions|
+----------------------+-----------------------------------------+
|session_add           |Number of created monitor sessions       |
+----------------------+-----------------------------------------+
|session_rm            |Number of remove_session calls in monitor|
+----------------------+-----------------------------------------+
|session_trim          |Number of trimed monitor sessions        |
+----------------------+-----------------------------------------+
|num_elections         |Number of elections monitor took part in |
+----------------------+-----------------------------------------+
|election_call         |Number of elections started by monitor   |
+----------------------+-----------------------------------------+
|election_win          |Number of elections won by monitor       |
+----------------------+-----------------------------------------+
|election_lose         |Number of elections lost by monitor      |
+----------------------+-----------------------------------------+

Osd:

+----------------------+-----------------------------------------+
|key                   |Description                              |
+======================+=========================================+
|op_wip                |Replication operations currently being   |
|                      |processed (primary)                      |
+----------------------+-----------------------------------------+
|op_in_bytes           |Client operations total write size       |
+----------------------+-----------------------------------------+
|op_r                  |Client read operations                   |
+----------------------+-----------------------------------------+
|op_out_bytes          |Client operations total read size        |
+----------------------+-----------------------------------------+
|op_w                  |Client write operations                  |
+----------------------+-----------------------------------------+
|op_latency            |Latency of client operations (including  |
|                      |queue time)                              |
+----------------------+-----------------------------------------+
|op_process_latency    |Latency of client operations (excluding  |
|                      |queue time)                              |
+----------------------+-----------------------------------------+
|op_r_latency          |Latency of read operation (including     |
|                      |queue time)                              |
+----------------------+-----------------------------------------+
|op_r_process_latency  |Latency of read operation (excluding     |
|                      |queue time)                              |
+----------------------+-----------------------------------------+
|op_w_in_bytes         |Client data written                      |
+----------------------+-----------------------------------------+
|op_w_latency          |Latency of write operation (including    |
|                      |queue time)                              |
+----------------------+-----------------------------------------+
|op_w_process_latency  |Latency of write operation (excluding    |
|                      |queue time)                              |
+----------------------+-----------------------------------------+
|op_rw                 |Client read-modify-write operations      |
+----------------------+-----------------------------------------+
|op_rw_in_bytes        |Client read-modify-write operations write|
|                      |in                                       |
+----------------------+-----------------------------------------+
|op_rw_out_bytes       |Client read-modify-write operations read |
|                      |out                                      |
+----------------------+-----------------------------------------+
|op_rw_latency         |Latency of read-modify-write operation   |
|                      |(including queue time)                   |
+----------------------+-----------------------------------------+
|op_rw_process_latency |Latency of read-modify-write operation   |
|                      |(excluding queue time)                   |
+----------------------+-----------------------------------------+


- Ceph pool statistics

+----------------------+-----------------------------------------+
|key                   |Description                              |
+======================+=========================================+
|bytes_used            |Per pool bytes used                      |
+----------------------+-----------------------------------------+
|max_avail             |Max available number of bytes in the pool|
+----------------------+-----------------------------------------+
|objects               |Number of objects in the pool            |
+----------------------+-----------------------------------------+
|wr_bytes              |Number of bytes written in the pool      |
+----------------------+-----------------------------------------+
|dirty                 |Number of bytes dirty in the pool        |
+----------------------+-----------------------------------------+
|rd_bytes              |Number of bytes read in the pool         |
+----------------------+-----------------------------------------+
|stored_raw            |Bytes used in pool including copies made |
+----------------------+-----------------------------------------+

- Ceph physical device metadata

+----------------------+-----------------------------------------+
|key                   |Description                              |
+======================+=========================================+
|disk_domain_id        |Physical device identify id              |
+----------------------+-----------------------------------------+
|disk_name             |Device attachement name                  |
+----------------------+-----------------------------------------+
|disk_wwn              |Device wwn                               |
+----------------------+-----------------------------------------+
|model                 |Device model name                        |
+----------------------+-----------------------------------------+
|serial_number         |Device serial number                     |
+----------------------+-----------------------------------------+
|size                  |Device size                              |
+----------------------+-----------------------------------------+
|vendor                |Device vendor name                       |
+----------------------+-----------------------------------------+

- Ceph each objects correlation information
- The module agent information
- The module agent cluster information
- The module agent host information


SMART Data
-----------
- Ceph physical device SMART data (provided by Ceph *devicehealth* module)


Prediction Data
----------------
- Ceph physical device prediction data
 

Receiving predicted health status from a Ceph OSD disk drive
============================================================

You can receive predicted health status from Ceph OSD disk drive by using the
following command.

::

    ceph device get-predicted-status <device id>


The get-predicted-status command returns:


::

    {
	"near_failure": "Good",
	"disk_wwn": "5000011111111111",
	"serial_number": "111111111",
	"predicted": "2018-05-30 18:33:12",
	"attachment": "sdb"
    }


+--------------------+-----------------------------------------------------+
|Attribute           | Description                                         |
+====================+=====================================================+
|near_failure        | The disk failure prediction state:                  |
|                    | Good/Warning/Bad/Unknown                            |
+--------------------+-----------------------------------------------------+
|disk_wwn            | Disk WWN number                                     |
+--------------------+-----------------------------------------------------+
|serial_number       | Disk serial number                                  |
+--------------------+-----------------------------------------------------+
|predicted           | Predicted date                                      |
+--------------------+-----------------------------------------------------+
|attachment          | device name on the local system                     |
+--------------------+-----------------------------------------------------+

The *near_failure* attribute for disk failure prediction state indicates disk life expectancy in the following table.

+--------------------+-----------------------------------------------------+
|near_failure        | Life expectancy (weeks)                             |
+====================+=====================================================+
|Good                | > 6 weeks                                           |
+--------------------+-----------------------------------------------------+
|Warning             | 2 weeks ~ 6 weeks                                   |
+--------------------+-----------------------------------------------------+
|Bad                 | < 2 weeks                                           |
+--------------------+-----------------------------------------------------+
 

Debugging
=========

If you want to debug the DiskPrediction module mapping to Ceph logging level,
use the following command.

::

    [mgr]

        debug mgr = 20

With logging set to debug for the manager the module will print out logging
message with prefix *mgr[diskprediction]* for easy filtering.