summaryrefslogtreecommitdiffstats
path: root/Documentation/networking/devlink/devlink-health.rst
blob: 0c99b11f05f9d58eb94931c029e23ded51eab530 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
.. SPDX-License-Identifier: GPL-2.0

==============
Devlink Health
==============

Background
==========

The ``devlink`` health mechanism is targeted for Real Time Alerting, in
order to know when something bad happened to a PCI device.

  * Provide alert debug information.
  * Self healing.
  * If problem needs vendor support, provide a way to gather all needed
    debugging information.

Overview
========

The main idea is to unify and centralize driver health reports in the
generic ``devlink`` instance and allow the user to set different
attributes of the health reporting and recovery procedures.

The ``devlink`` health reporter:
Device driver creates a "health reporter" per each error/health type.
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by ``devlink``.
Device driver can provide specific callbacks for each "health reporter", e.g.:

  * Recovery procedures
  * Diagnostics procedures
  * Object dump procedures
  * OOB initial parameters

Different parts of the driver can register different types of health reporters
with different handlers.

Actions
=======

Once an error is reported, devlink health will perform the following actions:

  * A log is being send to the kernel trace events buffer
  * Health status and statistics are being updated for the reporter instance
  * Object dump is being taken and saved at the reporter instance (as long as
    there is no other dump which is already stored)
  * Auto recovery attempt is being done. Depends on:
    - Auto-recovery configuration
    - Grace period vs. time passed since last recover

User Interface
==============

User can access/change each reporter's parameters and driver specific callbacks
via ``devlink``, e.g per error type (per health reporter):

  * Configure reporter's generic parameters (like: disable/enable auto recovery)
  * Invoke recovery procedure
  * Run diagnostics
  * Object dump

.. list-table:: List of devlink health interfaces
   :widths: 10 90

   * - Name
     - Description
   * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
     - Retrieves status and configuration info per DEV and reporter.
   * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
     - Allows reporter-related configuration setting.
   * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
     - Triggers a reporter's recovery procedure.
   * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
     - Retrieves diagnostics data from a reporter on a device.
   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
     - Retrieves the last stored dump. Devlink health
       saves a single dump. If an dump is not already stored by the devlink
       for this reporter, devlink generates a new dump.
       dump output is defined by the reporter.
   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
     - Clears the last saved dump file for the specified reporter.

The following diagram provides a general overview of ``devlink-health``::

                                                   netlink
                                          +--------------------------+
                                          |                          |
                                          |            +             |
                                          |            |             |
                                          +--------------------------+
                                                       |request for ops
                                                       |(diagnose,
     mlx5_core                             devlink     |recover,
                                                       |dump)
    +--------+                            +--------------------------+
    |        |                            |    reporter|             |
    |        |                            |  +---------v----------+  |
    |        |   ops execution            |  |                    |  |
    |     <----------------------------------+                    |  |
    |        |                            |  |                    |  |
    |        |                            |  + ^------------------+  |
    |        |                            |    | request for ops     |
    |        |                            |    | (recover, dump)     |
    |        |                            |    |                     |
    |        |                            |  +-+------------------+  |
    |        |     health report          |  | health handler     |  |
    |        +------------------------------->                    |  |
    |        |                            |  +--------------------+  |
    |        |     health reporter create |                          |
    |        +---------------------------->                          |
    +--------+                            +--------------------------+