diff options
Diffstat (limited to '')
-rw-r--r-- | Documentation/networking/devlink/mlx5.rst | 288 |
1 files changed, 288 insertions, 0 deletions
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst new file mode 100644 index 0000000000..702f204a3d --- /dev/null +++ b/Documentation/networking/devlink/mlx5.rst @@ -0,0 +1,288 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +mlx5 devlink support +==================== + +This document describes the devlink features implemented by the ``mlx5`` +device driver. + +Parameters +========== + +.. list-table:: Generic parameters implemented + + * - Name + - Mode + - Validation + * - ``enable_roce`` + - driverinit + - Type: Boolean + + If the device supports RoCE disablement, RoCE enablement state controls + device support for RoCE capability. Otherwise, the control occurs in the + driver stack. When RoCE is disabled at the driver level, only raw + ethernet QPs are supported. + * - ``io_eq_size`` + - driverinit + - The range is between 64 and 4096. + * - ``event_eq_size`` + - driverinit + - The range is between 64 and 4096. + * - ``max_macs`` + - driverinit + - The range is between 1 and 2^31. Only power of 2 values are supported. + +The ``mlx5`` driver also implements the following driver-specific +parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``flow_steering_mode`` + - string + - runtime + - Controls the flow steering mode of the driver + + * ``dmfs`` Device managed flow steering. In DMFS mode, the HW + steering entities are created and managed through firmware. + * ``smfs`` Software managed flow steering. In SMFS mode, the HW + steering entities are created and manage through the driver without + firmware intervention. + + SMFS mode is faster and provides better rule insertion rate compared to + default DMFS mode. + * - ``fdb_large_groups`` + - u32 + - driverinit + - Control the number of large groups (size > 1) in the FDB table. + + * The default value is 15, and the range is between 1 and 1024. + * - ``esw_multiport`` + - Boolean + - runtime + - Control MultiPort E-Switch shared fdb mode. + + An experimental mode where a single E-Switch is used and all the vports + and physical ports on the NIC are connected to it. + + An example is to send traffic from a VF that is created on PF0 to an + uplink that is natively associated with the uplink of PF1 + + Note: Future devices, ConnectX-8 and onward, will eventually have this + as the default to allow forwarding between all NIC ports in a single + E-switch environment and the dual E-switch mode will likely get + deprecated. + + Default: disabled + * - ``esw_port_metadata`` + - Boolean + - runtime + - When applicable, disabling eswitch metadata can increase packet rate up + to 20% depending on the use case and packet sizes. + + Eswitch port metadata state controls whether to internally tag packets + with metadata. Metadata tagging must be enabled for multi-port RoCE, + failover between representors and stacked devices. By default metadata is + enabled on the supported devices in E-switch. Metadata is applicable only + for E-switch in switchdev mode and users may disable it when NONE of the + below use cases will be in use: + 1. HCA is in Dual/multi-port RoCE mode. + 2. VF/SF representor bonding (Usually used for Live migration) + 3. Stacked devices + + When metadata is disabled, the above use cases will fail to initialize if + users try to enable them. + * - ``hairpin_num_queues`` + - u32 + - driverinit + - We refer to a TC NIC rule that involves forwarding as "hairpin". + Hairpin queues are mlx5 hardware specific implementation for hardware + forwarding of such packets. + + Control the number of hairpin queues. + * - ``hairpin_queue_size`` + - u32 + - driverinit + - Control the size (in packets) of the hairpin queues. + +The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` + +Info versions +============= + +The ``mlx5`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``fw.psid`` + - fixed + - Used to represent the board id of the device. + * - ``fw.version`` + - stored, running + - Three digit major.minor.subminor firmware version number. + +Health reporters +================ + +tx reporter +----------- +The tx reporter is responsible for reporting and recovering of the following three error scenarios: + +- tx timeout + Report on kernel tx timeout detection. + Recover by searching lost interrupts. +- tx error completion + Report on error tx completion. + Recover by flushing the tx queue and reset it. +- tx PTP port timestamping CQ unhealthy + Report too many CQEs never delivered on port ts CQ. + Recover by flushing and re-creating all PTP channels. + +tx reporter also support on demand diagnose callback, on which it provides +real time information of its send queues status. + +User commands examples: + +- Diagnose send queues status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter tx + +.. note:: + This command has valid output only when interface is up, otherwise the command has empty output. + +- Show number of tx errors indicated, number of recover flows ended successfully, + is autorecover enabled and graceful period from last recover:: + + $ devlink health show pci/0000:82:00.0 reporter tx + +rx reporter +----------- +The rx reporter is responsible for reporting and recovering of the following two error scenarios: + +- rx queues' initialization (population) timeout + Population of rx queues' descriptors on ring initialization is done + in napi context via triggering an irq. In case of a failure to get + the minimum amount of descriptors, a timeout would occur, and + descriptors could be recovered by polling the EQ (Event Queue). +- rx completions with errors (reported by HW on interrupt context) + Report on rx completion error. + Recover (if needed) by flushing the related queue and reset it. + +rx reporter also supports on demand diagnose callback, on which it +provides real time information of its receive queues' status. + +- Diagnose rx queues' status and corresponding completion queue:: + + $ devlink health diagnose pci/0000:82:00.0 reporter rx + +.. note:: + This command has valid output only when interface is up. Otherwise, the command has empty output. + +- Show number of rx errors indicated, number of recover flows ended successfully, + is autorecover enabled, and graceful period from last recover:: + + $ devlink health show pci/0000:82:00.0 reporter rx + +fw reporter +----------- +The fw reporter implements `diagnose` and `dump` callbacks. +It follows symptoms of fw error such as fw syndrome by triggering +fw core dump and storing it into the dump buffer. +The fw reporter diagnose command can be triggered any time by the user to check +current fw status. + +User commands examples: + +- Check fw heath status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter fw + +- Read FW core dump if already stored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.0 reporter fw + +.. note:: + This command can run only on the PF which has fw tracer ownership, + running it on other PF or any VF will return "Operation not permitted". + +fw fatal reporter +----------------- +The fw fatal reporter implements `dump` and `recover` callbacks. +It follows fatal errors indications by CR-space dump and recover flow. +The CR-space dump uses vsc interface which is valid even if the FW command +interface is not functional, which is the case in most FW fatal errors. +The recover function runs recover flow which reloads the driver and triggers fw +reset if needed. +On firmware error, the health buffer is dumped into the dmesg. The log +level is derived from the error's severity (given in health buffer). + +User commands examples: + +- Run fw recover flow manually:: + + $ devlink health recover pci/0000:82:00.0 reporter fw_fatal + +- Read FW CR-space dump if already stored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal + +.. note:: + This command can run only on PF. + +vnic reporter +------------- +The vnic reporter implements only the `diagnose` callback. +It is responsible for querying the vnic diagnostic counters from fw and displaying +them in realtime. + +Description of the vnic counters: + +- total_q_under_processor_handle + number of queues in an error state due to + an async error or errored command. +- send_queue_priority_update_flow + number of QP/SQ priority/SL update events. +- cq_overrun + number of times CQ entered an error state due to an overflow. +- async_eq_overrun + number of times an EQ mapped to async events was overrun. + comp_eq_overrun number of times an EQ mapped to completion events was + overrun. +- quota_exceeded_command + number of commands issued and failed due to quota exceeded. +- invalid_command + number of commands issued and failed dues to any reason other than quota + exceeded. +- nic_receive_steering_discard + number of packets that completed RX flow + steering but were discarded due to a mismatch in flow table. +- generated_pkt_steering_fail + number of packets generated by the VNIC experiencing unexpected steering + failure (at any point in steering flow). +- handled_pkt_steering_fail + number of packets handled by the VNIC experiencing unexpected steering + failure (at any point in steering flow owned by the VNIC, including the FDB + for the eswitch owner). + +User commands examples: + +- Diagnose PF/VF vnic counters:: + + $ devlink health diagnose pci/0000:82:00.1 reporter vnic + +- Diagnose representor vnic counters (performed by supplying devlink port of the + representor, which can be obtained via devlink port command):: + + $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic + +.. note:: + This command can run over all interfaces such as PF/VF and representor ports. |