summaryrefslogtreecommitdiffstats
path: root/Documentation/networking/device_drivers/ethernet/mellanox
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking/device_drivers/ethernet/mellanox')
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst1305
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst25
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst168
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst281
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst229
5 files changed, 2008 insertions, 0 deletions
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
new file mode 100644
index 000000000..f69ee1ebe
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -0,0 +1,1305 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+================
+Ethtool counters
+================
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents
+========
+
+- `Overview`_
+- `Groups`_
+- `Types`_
+- `Descriptions`_
+
+Overview
+========
+
+There are several counter groups based on where the counter is being counted. In
+addition, each group of counters may have different counter types.
+
+These counter groups are based on which component in a networking setup,
+illustrated below, that they describe::
+
+ ----------------------------------------
+ | |
+ ---------------------------------------- ---------------------------------------- |
+ | Hypervisor | | VM | |
+ | | | | |
+ | ------------------- --------------- | | ------------------- --------------- | |
+ | | Ethernet driver | | RDMA driver | | | | Ethernet driver | | RDMA driver | | |
+ | ------------------- --------------- | | ------------------- --------------- | |
+ | | | | | | | | |
+ | ------------------- | | ------------------- | |
+ | | | | | |--
+ ---------------------------------------- ----------------------------------------
+ | |
+ ------------- -----------------------------
+ | |
+ ------ ------ ------ ------ ------ ------ ------
+ -----| PF |----------------------| VF |-| VF |-| VF |----- --| PF |--- --| PF |--- --| PF |---
+ | ------ ------ ------ ------ | | ------ | | ------ | | ------ |
+ | | | | | | | |
+ | | | | | | | |
+ | | | | | | | |
+ | eSwitch | | eSwitch | | eSwitch | | eSwitch |
+ ---------------------------------------------------------- ----------- ----------- -----------
+ -------------------------------------------------------------------------------
+ | |
+ | |
+ | Uplink (no counters) |
+ -------------------------------------------------------------------------------
+ ---------------------------------------------------------------
+ | |
+ | |
+ | MPFS (no counters) |
+ ---------------------------------------------------------------
+ |
+ |
+ | Port
+
+Groups
+======
+
+Ring
+ Software counters populated by the driver stack.
+
+Netdev
+ An aggregation of software ring counters.
+
+vPort counters
+ Traffic counters and drops due to steering or no buffers. May indicate issues
+ with NIC. These counters include Ethernet traffic counters (including Raw
+ Ethernet) and RDMA/RoCE traffic counters.
+
+Physical port counters
+ Counters that collect statistics about the PFs and VFs. May indicate issues
+ with NIC, link, or network. This measuring point holds information on
+ standardized counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and
+ additional counters like flow control, FEC and more. Physical port counters
+ are not exposed to virtual machines.
+
+Priority Port Counters
+ A set of the physical port counters, per priority per port.
+
+Types
+=====
+
+Counters are divided into three types.
+
+Traffic Informative Counters
+ Counters which count traffic. These counters can be used for load estimation
+ or for general debug.
+
+Traffic Acceleration Counters
+ Counters which count traffic that was accelerated by Mellanox driver or by
+ hardware. The counters are an additional layer to the informative counter set,
+ and the same traffic is counted in both informative and acceleration counters.
+
+.. [#accel] Traffic acceleration counter.
+
+Error Counters
+ Increment of these counters might indicate a problem. Each of these counters
+ has an explanation and correction action.
+
+Statistic can be fetched via the `ip link` or `ethtool` commands. `ethtool`
+provides more detailed information.::
+
+ ip –s link show <if-name>
+ ethtool -S <if-name>
+
+Descriptions
+============
+
+XSK, PTP, and QoS counters that are similar to counters defined previously will
+not be separately listed. For example, `ptp_tx[i]_packets` will not be
+explicitly documented since `tx[i]_packets` describes the behavior of both
+counters, except `ptp_tx[i]_packets` is only counted when precision time
+protocol is used.
+
+Ring / Netdev Counter
+----------------------------
+The following counters are available per ring or software port.
+
+These counters provide information on the amount of traffic that was accelerated
+by the NIC. The counters are counting the accelerated traffic in addition to the
+standard counters which counts it (i.e. accelerated traffic is counted twice).
+
+The counter names in the table below refers to both ring and port counters. The
+notation for ring counters includes the [i] index without the braces. The
+notation for port counters doesn't include the [i]. A counter name
+`rx[i]_packets` will be printed as `rx0_packets` for ring 0 and `rx_packets` for
+the software port.
+
+.. flat-table:: Ring / Software Port Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx[i]_packets`
+ - The number of packets received on ring i.
+ - Informative
+
+ * - `rx[i]_bytes`
+ - The number of bytes received on ring i.
+ - Informative
+
+ * - `tx[i]_packets`
+ - The number of packets transmitted on ring i.
+ - Informative
+
+ * - `tx[i]_bytes`
+ - The number of bytes transmitted on ring i.
+ - Informative
+
+ * - `tx[i]_recover`
+ - The number of times the SQ was recovered.
+ - Error
+
+ * - `tx[i]_cqes`
+ - Number of CQEs events on SQ issued on ring i.
+ - Informative
+
+ * - `tx[i]_cqe_err`
+ - The number of error CQEs encountered on the SQ for ring i.
+ - Error
+
+ * - `tx[i]_tso_packets`
+ - The number of TSO packets transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_tso_bytes`
+ - The number of TSO bytes transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_tso_inner_packets`
+ - The number of TSO packets which are indicated to be carry internal
+ encapsulation transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_tso_inner_bytes`
+ - The number of TSO bytes which are indicated to be carry internal
+ encapsulation transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_gro_packets`
+ - Number of received packets processed using hardware-accelerated GRO. The
+ number of hardware GRO offloaded packets received on ring i.
+ - Acceleration
+
+ * - `rx[i]_gro_bytes`
+ - Number of received bytes processed using hardware-accelerated GRO. The
+ number of hardware GRO offloaded bytes received on ring i.
+ - Acceleration
+
+ * - `rx[i]_gro_skbs`
+ - The number of receive SKBs constructed while performing
+ hardware-accelerated GRO.
+ - Informative
+
+ * - `rx[i]_gro_match_packets`
+ - Number of received packets processed using hardware-accelerated GRO that
+ met the flow table match criteria.
+ - Informative
+
+ * - `rx[i]_gro_large_hds`
+ - Number of receive packets using hardware-accelerated GRO that have large
+ headers that require additional memory to be allocated.
+ - Informative
+
+ * - `rx[i]_lro_packets`
+ - The number of LRO packets received on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_lro_bytes`
+ - The number of LRO bytes received on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_ecn_mark`
+ - The number of received packets where the ECN mark was turned on.
+ - Informative
+
+ * - `rx_oversize_pkts_buffer`
+ - The number of dropped received packets due to length which arrived to RQ
+ and exceed software buffer size allocated by the device for incoming
+ traffic. It might imply that the device MTU is larger than the software
+ buffers size.
+ - Error
+
+ * - `rx_oversize_pkts_sw_drop`
+ - Number of received packets dropped in software because the CQE data is
+ larger than the MTU size.
+ - Error
+
+ * - `rx[i]_csum_unnecessary`
+ - Packets received with a `CHECKSUM_UNNECESSARY` on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_unnecessary_inner`
+ - Packets received with inner encapsulation with a `CHECKSUM_UNNECESSARY`
+ on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_none`
+ - Packets received with a `CHECKSUM_NONE` on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_complete`
+ - Packets received with a `CHECKSUM_COMPLETE` on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_complete_tail`
+ - Number of received packets that had checksum calculation computed,
+ potentially needed padding, and were able to do so with
+ `CHECKSUM_PARTIAL`.
+ - Informative
+
+ * - `rx[i]_csum_complete_tail_slow`
+ - Number of received packets that need padding larger than eight bytes for
+ the checksum.
+ - Informative
+
+ * - `tx[i]_csum_partial`
+ - Packets transmitted with a `CHECKSUM_PARTIAL` on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_csum_partial_inner`
+ - Packets transmitted with inner encapsulation with a `CHECKSUM_PARTIAL` on
+ ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_csum_none`
+ - Packets transmitted with no hardware checksum acceleration on ring i.
+ - Informative
+
+ * - `tx[i]_stopped` / `tx_queue_stopped` [#ring_global]_
+ - Events where SQ was full on ring i. If this counter is increased, check
+ the amount of buffers allocated for transmission.
+ - Informative
+
+ * - `tx[i]_wake` / `tx_queue_wake` [#ring_global]_
+ - Events where SQ was full and has become not full on ring i.
+ - Informative
+
+ * - `tx[i]_dropped` / `tx_queue_dropped` [#ring_global]_
+ - Packets transmitted that were dropped due to DMA mapping failure on
+ ring i. If this counter is increased, check the amount of buffers
+ allocated for transmission.
+ - Error
+
+ * - `tx[i]_nop`
+ - The number of nop WQEs (empty WQEs) inserted to the SQ (related to
+ ring i) due to the reach of the end of the cyclic buffer. When reaching
+ near to the end of cyclic buffer the driver may add those empty WQEs to
+ avoid handling a state the a WQE start in the end of the queue and ends
+ in the beginning of the queue. This is a normal condition.
+ - Informative
+
+ * - `tx[i]_added_vlan_packets`
+ - The number of packets sent where vlan tag insertion was offloaded to the
+ hardware.
+ - Acceleration
+
+ * - `rx[i]_removed_vlan_packets`
+ - The number of packets received where vlan tag stripping was offloaded to
+ the hardware.
+ - Acceleration
+
+ * - `rx[i]_wqe_err`
+ - The number of wrong opcodes received on ring i.
+ - Error
+
+ * - `rx[i]_mpwqe_frag`
+ - The number of WQEs that failed to allocate compound page and hence
+ fragmented MPWQE’s (Multi Packet WQEs) were used on ring i. If this
+ counter raise, it may suggest that there is no enough memory for large
+ pages, the driver allocated fragmented pages. This is not abnormal
+ condition.
+ - Informative
+
+ * - `rx[i]_mpwqe_filler_cqes`
+ - The number of filler CQEs events that were issued on ring i.
+ - Informative
+
+ * - `rx[i]_mpwqe_filler_strides`
+ - The number of strides consumed by filler CQEs on ring i.
+ - Informative
+
+ * - `tx[i]_mpwqe_blks`
+ - The number of send blocks processed from Multi-Packet WQEs (mpwqe).
+ - Informative
+
+ * - `tx[i]_mpwqe_pkts`
+ - The number of send packets processed from Multi-Packet WQEs (mpwqe).
+ - Informative
+
+ * - `rx[i]_cqe_compress_blks`
+ - The number of receive blocks with CQE compression on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_cqe_compress_pkts`
+ - The number of receive packets with CQE compression on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_add`
+ - The number of aRFS flow rules added to the device for direct RQ steering
+ on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_request_in`
+ - Number of flow rules that have been requested to move into ring i for
+ direct RQ steering [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_request_out`
+ - Number of flow rules that have been requested to move out of ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_expired`
+ - Number of flow rules that have been expired and removed [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_err`
+ - Number of flow rules that failed to be added to the flow table.
+ - Error
+
+ * - `rx[i]_recover`
+ - The number of times the RQ was recovered.
+ - Error
+
+ * - `tx[i]_xmit_more`
+ - The number of packets sent with `xmit_more` indication set on the skbuff
+ (no doorbell).
+ - Acceleration
+
+ * - `ch[i]_poll`
+ - The number of invocations of NAPI poll of channel i.
+ - Informative
+
+ * - `ch[i]_arm`
+ - The number of times the NAPI poll function completed and armed the
+ completion queues on channel i.
+ - Informative
+
+ * - `ch[i]_aff_change`
+ - The number of times the NAPI poll function explicitly stopped execution
+ on a CPU due to a change in affinity, on channel i.
+ - Informative
+
+ * - `ch[i]_events`
+ - The number of hard interrupt events on the completion queues of channel i.
+ - Informative
+
+ * - `ch[i]_eq_rearm`
+ - The number of times the EQ was recovered.
+ - Error
+
+ * - `ch[i]_force_irq`
+ - Number of times NAPI is triggered by XSK wakeups by posting a NOP to
+ ICOSQ.
+ - Acceleration
+
+ * - `rx[i]_congst_umr`
+ - The number of times an outstanding UMR request is delayed due to
+ congestion, on ring i.
+ - Informative
+
+ * - `rx_pp_alloc_fast`
+ - Number of successful fast path allocations.
+ - Informative
+
+ * - `rx_pp_alloc_slow`
+ - Number of slow path order-0 allocations.
+ - Informative
+
+ * - `rx_pp_alloc_slow_high_order`
+ - Number of slow path high order allocations.
+ - Informative
+
+ * - `rx_pp_alloc_empty`
+ - Counter is incremented when ptr ring is empty, so a slow path allocation
+ was forced.
+ - Informative
+
+ * - `rx_pp_alloc_refill`
+ - Counter is incremented when an allocation which triggered a refill of the
+ cache.
+ - Informative
+
+ * - `rx_pp_alloc_waive`
+ - Counter is incremented when pages obtained from the ptr ring that cannot
+ be added to the cache due to a NUMA mismatch.
+ - Informative
+
+ * - `rx_pp_recycle_cached`
+ - Counter is incremented when recycling placed page in the page pool cache.
+ - Informative
+
+ * - `rx_pp_recycle_cache_full`
+ - Counter is incremented when page pool cache was full.
+ - Informative
+
+ * - `rx_pp_recycle_ring`
+ - Counter is incremented when page placed into the ptr ring.
+ - Informative
+
+ * - `rx_pp_recycle_ring_full`
+ - Counter is incremented when page released from page pool because the ptr
+ ring was full.
+ - Informative
+
+ * - `rx_pp_recycle_released_ref`
+ - Counter is incremented when page released (and not recycled) because
+ refcnt > 1.
+ - Informative
+
+ * - `rx[i]_xsk_buff_alloc_err`
+ - The number of times allocating an skb or XSK buffer failed in the XSK RQ
+ context.
+ - Error
+
+ * - `rx[i]_xdp_tx_xmit`
+ - The number of packets forwarded back to the port due to XDP program
+ `XDP_TX` action (bouncing). these packets are not counted by other
+ software counters. These packets are counted by physical port and vPort
+ counters.
+ - Informative
+
+ * - `rx[i]_xdp_tx_mpwqe`
+ - Number of multi-packet WQEs transmitted by the netdev and `XDP_TX`-ed by
+ the netdev during the RQ context.
+ - Acceleration
+
+ * - `rx[i]_xdp_tx_inlnw`
+ - Number of WQE data segments transmitted where the data could be inlined
+ in the WQE and then `XDP_TX`-ed during the RQ context.
+ - Acceleration
+
+ * - `rx[i]_xdp_tx_nops`
+ - Number of NOP WQEBBs (WQE building blocks) received posted to the XDP SQ.
+ - Acceleration
+
+ * - `rx[i]_xdp_tx_full`
+ - The number of packets that should have been forwarded back to the port
+ due to `XDP_TX` action but were dropped due to full tx queue. These packets
+ are not counted by other software counters. These packets are counted by
+ physical port and vPort counters. You may open more rx queues and spread
+ traffic rx over all queues and/or increase rx ring size.
+ - Error
+
+ * - `rx[i]_xdp_tx_err`
+ - The number of times an `XDP_TX` error such as frame too long and frame
+ too short occurred on `XDP_TX` ring of RX ring.
+ - Error
+
+ * - `rx[i]_xdp_tx_cqes` / `rx_xdp_tx_cqe` [#ring_global]_
+ - The number of completions received on the CQ of the `XDP_TX` ring.
+ - Informative
+
+ * - `rx[i]_xdp_drop`
+ - The number of packets dropped due to XDP program `XDP_DROP` action. these
+ packets are not counted by other software counters. These packets are
+ counted by physical port and vPort counters.
+ - Informative
+
+ * - `rx[i]_xdp_redirect`
+ - The number of times an XDP redirect action was triggered on ring i.
+ - Acceleration
+
+ * - `tx[i]_xdp_xmit`
+ - The number of packets redirected to the interface(due to XDP redirect).
+ These packets are not counted by other software counters. These packets
+ are counted by physical port and vPort counters.
+ - Informative
+
+ * - `tx[i]_xdp_full`
+ - The number of packets redirected to the interface(due to XDP redirect),
+ but were dropped due to full tx queue. these packets are not counted by
+ other software counters. you may enlarge tx queues.
+ - Informative
+
+ * - `tx[i]_xdp_mpwqe`
+ - Number of multi-packet WQEs offloaded onto the NIC that were
+ `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xdp_inlnw`
+ - Number of WQE data segments where the data could be inlined in the WQE
+ where the data segments were `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xdp_nops`
+ - Number of NOP WQEBBs (WQE building blocks) posted to the SQ that were
+ `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xdp_err`
+ - The number of packets redirected to the interface(due to XDP redirect)
+ but were dropped due to error such as frame too long and frame too short.
+ - Error
+
+ * - `tx[i]_xdp_cqes`
+ - The number of completions received for packets redirected to the
+ interface(due to XDP redirect) on the CQ.
+ - Informative
+
+ * - `tx[i]_xsk_xmit`
+ - The number of packets transmitted using XSK zerocopy functionality.
+ - Acceleration
+
+ * - `tx[i]_xsk_mpwqe`
+ - Number of multi-packet WQEs offloaded onto the NIC that were
+ `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xsk_inlnw`
+ - Number of WQE data segments where the data could be inlined in the WQE
+ that are transmitted using XSK zerocopy.
+ - Acceleration
+
+ * - `tx[i]_xsk_full`
+ - Number of times doorbell is rung in XSK zerocopy mode when SQ is full.
+ - Error
+
+ * - `tx[i]_xsk_err`
+ - Number of errors that occurred in XSK zerocopy mode such as if the data
+ size is larger than the MTU size.
+ - Error
+
+ * - `tx[i]_xsk_cqes`
+ - Number of CQEs processed in XSK zerocopy mode.
+ - Acceleration
+
+ * - `tx_tls_ctx`
+ - Number of TLS TX HW offload contexts added to device for encryption.
+ - Acceleration
+
+ * - `tx_tls_del`
+ - Number of TLS TX HW offload contexts removed from device (connection
+ closed).
+ - Acceleration
+
+ * - `tx_tls_pool_alloc`
+ - Number of times a unit of work is successfully allocated in the TLS HW
+ offload pool.
+ - Acceleration
+
+ * - `tx_tls_pool_free`
+ - Number of times a unit of work is freed in the TLS HW offload pool.
+ - Acceleration
+
+ * - `rx_tls_ctx`
+ - Number of TLS RX HW offload contexts added to device for decryption.
+ - Acceleration
+
+ * - `rx_tls_del`
+ - Number of TLS RX HW offload contexts deleted from device (connection has
+ finished).
+ - Acceleration
+
+ * - `rx[i]_tls_decrypted_packets`
+ - Number of successfully decrypted RX packets which were part of a TLS
+ stream.
+ - Acceleration
+
+ * - `rx[i]_tls_decrypted_bytes`
+ - Number of TLS payload bytes in RX packets which were successfully
+ decrypted.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_pkt`
+ - Number of received TLS packets with a resync request.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_start`
+ - Number of times the TLS async resync request was started.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_end`
+ - Number of times the TLS async resync request properly ended with
+ providing the HW tracked tcp-seq.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_skip`
+ - Number of times the TLS async resync request procedure was started but
+ not properly ended.
+ - Error
+
+ * - `rx[i]_tls_resync_res_ok`
+ - Number of times the TLS resync response call to the driver was
+ successfully handled.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_res_retry`
+ - Number of times the TLS resync response call to the driver was
+ reattempted when ICOSQ is full.
+ - Error
+
+ * - `rx[i]_tls_resync_res_skip`
+ - Number of times the TLS resync response call to the driver was terminated
+ unsuccessfully.
+ - Error
+
+ * - `rx[i]_tls_err`
+ - Number of times when CQE TLS offload was problematic.
+ - Error
+
+ * - `tx[i]_tls_encrypted_packets`
+ - The number of send packets that are TLS encrypted by the kernel.
+ - Acceleration
+
+ * - `tx[i]_tls_encrypted_bytes`
+ - The number of send bytes that are TLS encrypted by the kernel.
+ - Acceleration
+
+ * - `tx[i]_tls_ooo`
+ - Number of times out of order TLS SQE fragments were handled on ring i.
+ - Acceleration
+
+ * - `tx[i]_tls_dump_packets`
+ - Number of TLS decrypted packets copied over from NIC over DMA.
+ - Acceleration
+
+ * - `tx[i]_tls_dump_bytes`
+ - Number of TLS decrypted bytes copied over from NIC over DMA.
+ - Acceleration
+
+ * - `tx[i]_tls_resync_bytes`
+ - Number of TLS bytes requested to be resynchronized in order to be
+ decrypted.
+ - Acceleration
+
+ * - `tx[i]_tls_skip_no_sync_data`
+ - Number of TLS send data that can safely be skipped / do not need to be
+ decrypted.
+ - Acceleration
+
+ * - `tx[i]_tls_drop_no_sync_data`
+ - Number of TLS send data that were dropped due to retransmission of TLS
+ data.
+ - Acceleration
+
+ * - `ptp_cq[i]_abort`
+ - Number of times a CQE has to be skipped in precision time protocol due to
+ a skew between the port timestamp and CQE timestamp being greater than
+ 128 seconds.
+ - Error
+
+ * - `ptp_cq[i]_abort_abs_diff_ns`
+ - Accumulation of time differences between the port timestamp and CQE
+ timestamp when the difference is greater than 128 seconds in precision
+ time protocol.
+ - Error
+
+ * - `ptp_cq[i]_late_cqe`
+ - Number of times a CQE has been delivered on the PTP timestamping CQ when
+ the CQE was not expected since a certain amount of time had elapsed where
+ the device typically ensures not posting the CQE.
+ - Error
+
+.. [#ring_global] The corresponding ring and global counters do not share the
+ same name (i.e. do not follow the common naming scheme).
+
+vPort Counters
+--------------
+Counters on the NIC port that is connected to a eSwitch.
+
+.. flat-table:: vPort Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_vport_unicast_packets`
+ - Unicast packets received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_unicast_bytes`
+ - Unicast bytes received, steered to a port including Raw Ethernet QP/DPDK
+ traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_unicast_packets`
+ - Unicast packets transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_unicast_bytes`
+ - Unicast bytes transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_multicast_packets`
+ - Multicast packets received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_multicast_bytes`
+ - Multicast bytes received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_multicast_packets`
+ - Multicast packets transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_multicast_bytes`
+ - Multicast bytes transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_broadcast_packets`
+ - Broadcast packets received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_broadcast_bytes`
+ - Broadcast bytes received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_broadcast_packets`
+ - Broadcast packets transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_broadcast_bytes`
+ - Broadcast bytes transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_rdma_unicast_packets`
+ - RDMA unicast packets received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `rx_vport_rdma_unicast_bytes`
+ - RDMA unicast bytes received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_unicast_packets`
+ - RDMA unicast packets transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_unicast_bytes`
+ - RDMA unicast bytes transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `rx_vport_rdma_multicast_packets`
+ - RDMA multicast packets received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `rx_vport_rdma_multicast_bytes`
+ - RDMA multicast bytes received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_multicast_packets`
+ - RDMA multicast packets transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_multicast_bytes`
+ - RDMA multicast bytes transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `vport_loopback_packets`
+ - Unicast, multicast and broadcast packets that were loop-back (received
+ and transmitted), IB/Eth [#accel]_.
+ - Acceleration
+
+ * - `vport_loopback_bytes`
+ - Unicast, multicast and broadcast bytes that were loop-back (received
+ and transmitted), IB/Eth [#accel]_.
+ - Acceleration
+
+ * - `rx_steer_missed_packets`
+ - Number of packets that was received by the NIC, however was discarded
+ because it did not match any flow in the NIC flow table.
+ - Error
+
+ * - `rx_packets`
+ - Representor only: packets received, that were handled by the hypervisor.
+ - Informative
+
+ * - `rx_bytes`
+ - Representor only: bytes received, that were handled by the hypervisor.
+ - Informative
+
+ * - `tx_packets`
+ - Representor only: packets transmitted, that were handled by the
+ hypervisor.
+ - Informative
+
+ * - `tx_bytes`
+ - Representor only: bytes transmitted, that were handled by the hypervisor.
+ - Informative
+
+ * - `dev_internal_queue_oob`
+ - The number of dropped packets due to lack of receive WQEs for an internal
+ device RQ.
+ - Error
+
+Physical Port Counters
+----------------------
+The physical port counters are the counters on the external port connecting the
+adapter to the network. This measuring point holds information on standardized
+counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and additional counters
+like flow control, FEC and more.
+
+.. flat-table:: Physical Port Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_packets_phy`
+ - The number of packets received on the physical port. This counter doesn’t
+ include packets that were discarded due to FCS, frame size and similar
+ errors.
+ - Informative
+
+ * - `tx_packets_phy`
+ - The number of packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_bytes_phy`
+ - The number of bytes received on the physical port, including Ethernet
+ header and FCS.
+ - Informative
+
+ * - `tx_bytes_phy`
+ - The number of bytes transmitted on the physical port.
+ - Informative
+
+ * - `rx_multicast_phy`
+ - The number of multicast packets received on the physical port.
+ - Informative
+
+ * - `tx_multicast_phy`
+ - The number of multicast packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_broadcast_phy`
+ - The number of broadcast packets received on the physical port.
+ - Informative
+
+ * - `tx_broadcast_phy`
+ - The number of broadcast packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_crc_errors_phy`
+ - The number of dropped received packets due to FCS (Frame Check Sequence)
+ error on the physical port. If this counter is increased in high rate,
+ check the link quality using `rx_symbol_error_phy` and
+ `rx_corrected_bits_phy` counters below.
+ - Error
+
+ * - `rx_in_range_len_errors_phy`
+ - The number of received packets dropped due to length/type errors on a
+ physical port.
+ - Error
+
+ * - `rx_out_of_range_len_phy`
+ - The number of received packets dropped due to length greater than allowed
+ on a physical port. If this counter is increasing, it implies that the
+ peer connected to the adapter has a larger MTU configured. Using same MTU
+ configuration shall resolve this issue.
+ - Error
+
+ * - `rx_oversize_pkts_phy`
+ - The number of dropped received packets due to length which exceed MTU
+ size on a physical port. If this counter is increasing, it implies that
+ the peer connected to the adapter has a larger MTU configured. Using same
+ MTU configuration shall resolve this issue.
+ - Error
+
+ * - `rx_symbol_err_phy`
+ - The number of received packets dropped due to physical coding errors
+ (symbol errors) on a physical port.
+ - Error
+
+ * - `rx_mac_control_phy`
+ - The number of MAC control packets received on the physical port.
+ - Informative
+
+ * - `tx_mac_control_phy`
+ - The number of MAC control packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_pause_ctrl_phy`
+ - The number of link layer pause packets received on a physical port. If
+ this counter is increasing, it implies that the network is congested and
+ cannot absorb the traffic coming from to the adapter.
+ - Informative
+
+ * - `tx_pause_ctrl_phy`
+ - The number of link layer pause packets transmitted on a physical port. If
+ this counter is increasing, it implies that the NIC is congested and
+ cannot absorb the traffic coming from the network.
+ - Informative
+
+ * - `rx_unsupported_op_phy`
+ - The number of MAC control packets received with unsupported opcode on a
+ physical port.
+ - Error
+
+ * - `rx_discards_phy`
+ - The number of received packets dropped due to lack of buffers on a
+ physical port. If this counter is increasing, it implies that the adapter
+ is congested and cannot absorb the traffic coming from the network.
+ - Error
+
+ * - `tx_discards_phy`
+ - The number of packets which were discarded on transmission, even no
+ errors were detected. the drop might occur due to link in down state,
+ head of line drop, pause from the network, etc.
+ - Error
+
+ * - `tx_errors_phy`
+ - The number of transmitted packets dropped due to a length which exceed
+ MTU size on a physical port.
+ - Error
+
+ * - `rx_undersize_pkts_phy`
+ - The number of received packets dropped due to length which is shorter
+ than 64 bytes on a physical port. If this counter is increasing, it
+ implies that the peer connected to the adapter has a non-standard MTU
+ configured or malformed packet had arrived.
+ - Error
+
+ * - `rx_fragments_phy`
+ - The number of received packets dropped due to a length which is shorter
+ than 64 bytes and has FCS error on a physical port. If this counter is
+ increasing, it implies that the peer connected to the adapter has a
+ non-standard MTU configured.
+ - Error
+
+ * - `rx_jabbers_phy`
+ - The number of received packets d due to a length which is longer than 64
+ bytes and had FCS error on a physical port.
+ - Error
+
+ * - `rx_64_bytes_phy`
+ - The number of packets received on the physical port with size of 64 bytes.
+ - Informative
+
+ * - `rx_65_to_127_bytes_phy`
+ - The number of packets received on the physical port with size of 65 to
+ 127 bytes.
+ - Informative
+
+ * - `rx_128_to_255_bytes_phy`
+ - The number of packets received on the physical port with size of 128 to
+ 255 bytes.
+ - Informative
+
+ * - `rx_256_to_511_bytes_phy`
+ - The number of packets received on the physical port with size of 256 to
+ 512 bytes.
+ - Informative
+
+ * - `rx_512_to_1023_bytes_phy`
+ - The number of packets received on the physical port with size of 512 to
+ 1023 bytes.
+ - Informative
+
+ * - `rx_1024_to_1518_bytes_phy`
+ - The number of packets received on the physical port with size of 1024 to
+ 1518 bytes.
+ - Informative
+
+ * - `rx_1519_to_2047_bytes_phy`
+ - The number of packets received on the physical port with size of 1519 to
+ 2047 bytes.
+ - Informative
+
+ * - `rx_2048_to_4095_bytes_phy`
+ - The number of packets received on the physical port with size of 2048 to
+ 4095 bytes.
+ - Informative
+
+ * - `rx_4096_to_8191_bytes_phy`
+ - The number of packets received on the physical port with size of 4096 to
+ 8191 bytes.
+ - Informative
+
+ * - `rx_8192_to_10239_bytes_phy`
+ - The number of packets received on the physical port with size of 8192 to
+ 10239 bytes.
+ - Informative
+
+ * - `link_down_events_phy`
+ - The number of times where the link operative state changed to down. In
+ case this counter is increasing it may imply on port flapping. You may
+ need to replace the cable/transceiver.
+ - Error
+
+ * - `rx_out_of_buffer`
+ - Number of times receive queue had no software buffers allocated for the
+ adapter's incoming traffic.
+ - Error
+
+ * - `module_bus_stuck`
+ - The number of times that module's I\ :sup:`2`\C bus (data or clock)
+ short-wire was detected. You may need to replace the cable/transceiver.
+ - Error
+
+ * - `module_high_temp`
+ - The number of times that the module temperature was too high. If this
+ issue persist, you may need to check the ambient temperature or replace
+ the cable/transceiver module.
+ - Error
+
+ * - `module_bad_shorted`
+ - The number of times that the module cables were shorted. You may need to
+ replace the cable/transceiver module.
+ - Error
+
+ * - `module_unplug`
+ - The number of times that module was ejected.
+ - Informative
+
+ * - `rx_buffer_passed_thres_phy`
+ - The number of events where the port receive buffer was over 85% full.
+ - Informative
+
+ * - `tx_pause_storm_warning_events`
+ - The number of times the device was sending pauses for a long period of
+ time.
+ - Informative
+
+ * - `tx_pause_storm_error_events`
+ - The number of times the device was sending pauses for a long period of
+ time, reaching time out and disabling transmission of pause frames. on
+ the period where pause frames were disabled, drop could have been
+ occurred.
+ - Error
+
+ * - `rx[i]_buff_alloc_err`
+ - Failed to allocate a buffer to received packet (or SKB) on ring i.
+ - Error
+
+ * - `rx_bits_phy`
+ - This counter provides information on the total amount of traffic that
+ could have been received and can be used as a guideline to measure the
+ ratio of errored traffic in `rx_pcs_symbol_err_phy` and
+ `rx_corrected_bits_phy`.
+ - Informative
+
+ * - `rx_pcs_symbol_err_phy`
+ - This counter counts the number of symbol errors that wasn’t corrected by
+ FEC correction algorithm or that FEC algorithm was not active on this
+ interface. If this counter is increasing, it implies that the link
+ between the NIC and the network is suffering from high BER, and that
+ traffic is lost. You may need to replace the cable/transceiver. The error
+ rate is the number of `rx_pcs_symbol_err_phy` divided by the number of
+ `rx_bits_phy` on a specific time frame.
+ - Error
+
+ * - `rx_corrected_bits_phy`
+ - The number of corrected bits on this port according to active FEC
+ (RS/FC). If this counter is increasing, it implies that the link between
+ the NIC and the network is suffering from high BER. The corrected bit
+ rate is the number of `rx_corrected_bits_phy` divided by the number of
+ `rx_bits_phy` on a specific time frame.
+ - Error
+
+ * - `rx_err_lane_[l]_phy`
+ - This counter counts the number of physical raw errors per lane l index.
+ The counter counts errors before FEC corrections. If this counter is
+ increasing, it implies that the link between the NIC and the network is
+ suffering from high BER, and that traffic might be lost. You may need to
+ replace the cable/transceiver. Please check in accordance with
+ `rx_corrected_bits_phy`.
+ - Error
+
+ * - `rx_global_pause`
+ - The number of pause packets received on the physical port. If this
+ counter is increasing, it implies that the network is congested and
+ cannot absorb the traffic coming from the adapter. Note: This counter is
+ only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `rx_global_pause_duration`
+ - The duration of pause received (in microSec) on the physical port. The
+ counter represents the time the port did not send any traffic. If this
+ counter is increasing, it implies that the network is congested and
+ cannot absorb the traffic coming from the adapter. Note: This counter is
+ only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `tx_global_pause`
+ - The number of pause packets transmitted on a physical port. If this
+ counter is increasing, it implies that the adapter is congested and
+ cannot absorb the traffic coming from the network. Note: This counter is
+ only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `tx_global_pause_duration`
+ - The duration of pause transmitter (in microSec) on the physical port.
+ Note: This counter is only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `rx_global_pause_transition`
+ - The number of times a transition from Xoff to Xon on the physical port
+ has occurred. Note: This counter is only enabled when global pause mode
+ is enabled.
+ - Informative
+
+ * - `rx_if_down_packets`
+ - The number of received packets that were dropped due to interface down.
+ - Informative
+
+Priority Port Counters
+----------------------
+The following counters are physical port counters that are counted per L2
+priority (0-7).
+
+**Note:** `p` in the counter name represents the priority.
+
+.. flat-table:: Priority Port Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_prio[p]_bytes`
+ - The number of bytes received with priority p on the physical port.
+ - Informative
+
+ * - `rx_prio[p]_packets`
+ - The number of packets received with priority p on the physical port.
+ - Informative
+
+ * - `tx_prio[p]_bytes`
+ - The number of bytes transmitted on priority p on the physical port.
+ - Informative
+
+ * - `tx_prio[p]_packets`
+ - The number of packets transmitted on priority p on the physical port.
+ - Informative
+
+ * - `rx_prio[p]_pause`
+ - The number of pause packets received with priority p on a physical port.
+ If this counter is increasing, it implies that the network is congested
+ and cannot absorb the traffic coming from the adapter. Note: This counter
+ is available only if PFC was enabled on priority p.
+ - Informative
+
+ * - `rx_prio[p]_pause_duration`
+ - The duration of pause received (in microSec) on priority p on the
+ physical port. The counter represents the time the port did not send any
+ traffic on this priority. If this counter is increasing, it implies that
+ the network is congested and cannot absorb the traffic coming from the
+ adapter. Note: This counter is available only if PFC was enabled on
+ priority p.
+ - Informative
+
+ * - `rx_prio[p]_pause_transition`
+ - The number of times a transition from Xoff to Xon on priority p on the
+ physical port has occurred. Note: This counter is available only if PFC
+ was enabled on priority p.
+ - Informative
+
+ * - `tx_prio[p]_pause`
+ - The number of pause packets transmitted on priority p on a physical port.
+ If this counter is increasing, it implies that the adapter is congested
+ and cannot absorb the traffic coming from the network. Note: This counter
+ is available only if PFC was enabled on priority p.
+ - Informative
+
+ * - `tx_prio[p]_pause_duration`
+ - The duration of pause transmitter (in microSec) on priority p on the
+ physical port. Note: This counter is available only if PFC was enabled on
+ priority p.
+ - Informative
+
+ * - `rx_prio[p]_buf_discard`
+ - The number of packets discarded by device due to lack of per host receive
+ buffers.
+ - Informative
+
+ * - `rx_prio[p]_cong_discard`
+ - The number of packets discarded by device due to per host congestion.
+ - Informative
+
+ * - `rx_prio[p]_marked`
+ - The number of packets ecn marked by device due to per host congestion.
+ - Informative
+
+ * - `rx_prio[p]_discards`
+ - The number of packets discarded by device due to lack of receive buffers.
+ - Informative
+
+Device Counters
+---------------
+.. flat-table:: Device Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_pci_signal_integrity`
+ - Counts physical layer PCIe signal integrity errors, the number of
+ transitions to recovery due to Framing errors and CRC (dlp and tlp). If
+ this counter is raising, try moving the adapter card to a different slot
+ to rule out a bad PCI slot. Validate that you are running with the latest
+ firmware available and latest server BIOS version.
+ - Error
+
+ * - `tx_pci_signal_integrity`
+ - Counts physical layer PCIe signal integrity errors, the number of
+ transition to recovery initiated by the other side (moving to recovery
+ due to getting TS/EIEOS). If this counter is raising, try moving the
+ adapter card to a different slot to rule out a bad PCI slot. Validate
+ that you are running with the latest firmware available and latest server
+ BIOS version.
+ - Error
+
+ * - `outbound_pci_buffer_overflow`
+ - The number of packets dropped due to pci buffer overflow. If this counter
+ is raising in high rate, it might indicate that the receive traffic rate
+ for a host is larger than the PCIe bus and therefore a congestion occurs.
+ - Informative
+
+ * - `outbound_pci_stalled_rd`
+ - The percentage (in the range 0...100) of time within the last second that
+ the NIC had outbound non-posted reads requests but could not perform the
+ operation due to insufficient posted credits.
+ - Informative
+
+ * - `outbound_pci_stalled_wr`
+ - The percentage (in the range 0...100) of time within the last second that
+ the NIC had outbound posted writes requests but could not perform the
+ operation due to insufficient posted credits.
+ - Informative
+
+ * - `outbound_pci_stalled_rd_events`
+ - The number of seconds where `outbound_pci_stalled_rd` was above 30%.
+ - Informative
+
+ * - `outbound_pci_stalled_wr_events`
+ - The number of seconds where `outbound_pci_stalled_wr` was above 30%.
+ - Informative
+
+ * - `dev_out_of_buffer`
+ - The number of times the device owned queue had not enough buffers
+ allocated.
+ - Error
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
new file mode 100644
index 000000000..581a91caa
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+Mellanox ConnectX(R) mlx5 core VPI Network Driver
+=================================================
+
+:Copyright: |copy| 2019, Mellanox Technologies LTD.
+:Copyright: |copy| 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ kconfig
+ switchdev
+ tracepoints
+ counters
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
new file mode 100644
index 000000000..0a42c3395
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
@@ -0,0 +1,168 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+=======================================
+Enabling the driver and kconfig options
+=======================================
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
+| at build time via kernel Kconfig flags.
+| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
+| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
+| For the list of advanced features, please see below.
+
+**CONFIG_MLX5_BRIDGE=(y/n)**
+
+| Enable :ref:`Ethernet Bridging (BRIDGE) offloading support <mlx5_bridge_offload>`.
+| This will provide the ability to add representors of mlx5 uplink and VF
+| ports to Bridge and offloading rules for traffic between such ports.
+| Supports VLANs (trunk and access modes).
+
+
+**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
+
+| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
+| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
+
+
+**CONFIG_MLX5_CORE_EN=(y/n)**
+
+| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
+| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
+| built-in into mlx5_core.ko.
+
+
+**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
+
+| Enables `Data Center Bridging (DCB) Support <https://enterprise-support.nvidia.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
+
+
+**CONFIG_MLX5_CORE_IPOIB=(y/n)**
+
+| IPoIB offloads & acceleration support.
+| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
+| IPoIB ulp netdevice.
+
+
+**CONFIG_MLX5_CLS_ACT=(y/n)**
+
+| Enables offload support for TC classifier action (NET_CLS_ACT).
+| Works in both native NIC mode and Switchdev SRIOV mode.
+| Flow-based classifiers, such as those registered through
+| `tc-flower(8)`, are processed by the device, rather than the
+| host. Actions that would then overwrite matching classification
+| results would then be instant due to the offload.
+
+
+**CONFIG_MLX5_EN_ARFS=(y/n)**
+
+| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
+| https://enterprise-support.nvidia.com/s/article/howto-configure-arfs-on-connectx-4
+
+
+**CONFIG_MLX5_EN_IPSEC=(y/n)**
+
+| Enables :ref:`IPSec XFRM cryptography-offload acceleration <xfrm_device>`.
+
+
+**CONFIG_MLX5_EN_MACSEC=(y/n)**
+
+| Build support for MACsec cryptography-offload acceleration in the NIC.
+
+
+**CONFIG_MLX5_EN_RXNFC=(y/n)**
+
+| Enables ethtool receive network flow classification, which allows user defined
+| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
+
+
+**CONFIG_MLX5_EN_TLS=(y/n)**
+
+| TLS cryptography-offload acceleration.
+
+
+**CONFIG_MLX5_ESWITCH=(y/n)**
+
+| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
+| and switching for the enabled VFs and PF in two available modes:
+| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://enterprise-support.nvidia.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet>`_.
+| 2) :ref:`Switchdev mode (eswitch offloads) <switchdev>`.
+
+
+**CONFIG_MLX5_FPGA=(y/n)**
+
+| Build support for the Innova family of network cards by Mellanox Technologies.
+| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
+| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
+| building sandbox-specific client drivers.
+
+
+**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
+
+| Provides low-level InfiniBand/RDMA and `RoCE <https://enterprise-support.nvidia.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
+
+
+**CONFIG_MLX5_MPFS=(y/n)**
+
+| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
+| MPFs is required for when `Multi-Host <https://www.nvidia.com/en-us/networking/multi-host/>`_ configuration is enabled to allow passing
+| user configured unicast MAC addresses to the requesting PF.
+
+
+**CONFIG_MLX5_SF=(y/n)**
+
+| Build support for subfunction.
+| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
+| will enable support for creating subfunction devices.
+
+
+**CONFIG_MLX5_SF_MANAGER=(y/n)**
+
+| Build support for subfuction port in the NIC. A Mellanox subfunction
+| port is managed through devlink. A subfunction supports RDMA, netdevice
+| and vdpa device. It is similar to a SRIOV VF but it doesn't require
+| SRIOV support.
+
+
+**CONFIG_MLX5_SW_STEERING=(y/n)**
+
+| Build support for software-managed steering in the NIC.
+
+
+**CONFIG_MLX5_TC_CT=(y/n)**
+
+| Support offloading connection tracking rules via tc ct action.
+
+
+**CONFIG_MLX5_TC_SAMPLE=(y/n)**
+
+| Support offloading sample rules via tc sample action.
+
+
+**CONFIG_MLX5_VDPA=(y/n)**
+
+| Support library for Mellanox VDPA drivers. Provides code that is
+| common for all types of VDPA drivers. The following drivers are planned:
+| net, block.
+
+
+**CONFIG_MLX5_VDPA_NET=(y/n)**
+
+| VDPA network driver for ConnectX6 and newer. Provides offloading
+| of virtio net datapath such that descriptors put on the ring will
+| be executed by the hardware. It also supports a variety of stateless
+| offloads depending on the actual device used and firmware version.
+
+
+**CONFIG_MLX5_VFIO_PCI=(y/n)**
+
+| This provides migration support for MLX5 devices using the VFIO framework.
+
+
+**External options** ( Choose if the corresponding mlx5 feature is required )
+
+- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
+- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
+- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
new file mode 100644
index 000000000..b617e93d7
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
@@ -0,0 +1,281 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+=========
+Switchdev
+=========
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+.. _mlx5_bridge_offload:
+
+Bridge offload
+==============
+
+The mlx5 driver implements support for offloading bridge rules when in switchdev
+mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
+representor is attached to bridge.
+
+- Change device to switchdev mode::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
+
+ $ ip link set enp8s0f0 master bridge1
+
+VLANs
+-----
+
+Following bridge VLAN functions are supported by mlx5:
+
+- VLAN filtering (including multiple VLANs per port)::
+
+ $ ip link set bridge1 type bridge vlan_filtering 1
+ $ bridge vlan add dev enp8s0f0 vid 2-3
+
+- VLAN push on bridge ingress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 pvid
+
+- VLAN pop on bridge egress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 untagged
+
+Subfunction
+===========
+
+Subfunction which are spawned over the E-switch are created only with devlink
+device, and by default all the SF auxiliary devices are disabled.
+This will allow user to configure the SF before the SF have been fully probed,
+which will save time.
+
+Usage example:
+
+- Create SF::
+
+ $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11
+ $ devlink port function set pci/0000:08:00.0/32768 hw_addr 00:00:00:00:00:11 state active
+
+- Enable ETH auxiliary device::
+
+ $ devlink dev param set auxiliary/mlx5_core.sf.1 name enable_eth value true cmode driverinit
+
+- Now, in order to fully probe the SF, use devlink reload::
+
+ $ devlink dev reload auxiliary/mlx5_core.sf.1
+
+mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`).
+
+mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
+
+A subfunction has its own function capabilities and its own resources. This
+means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
+queues are neither shared nor stolen from the parent PCI function.
+
+When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
+resources neither shared nor stolen from the parent PCI function.
+
+A subfunction has a dedicated window in PCI BAR space that is not shared
+with the other subfunctions or the parent PCI function. This ensures that all
+devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
+PCI BAR space.
+
+A subfunction supports eswitch representation through which it supports tc
+offloads. The user configures eswitch to send/receive packets from/to
+the subfunction port.
+
+Subfunctions share PCI level resources such as PCI MSI-X IRQs with
+other subfunctions and/or with its parent PCI function.
+
+Example mlx5 software, system, and device view::
+
+ _______
+ | admin |
+ | user |----------
+ |_______| |
+ | |
+ ____|____ __|______ _________________
+ | | | | | |
+ | devlink | | tc tool | | user |
+ | tool | |_________| | applications |
+ |_________| | |_________________|
+ | | | |
+ | | | | Userspace
+ +---------|-------------|-------------------|----------|--------------------+
+ | | +----------+ +----------+ Kernel
+ | | | netdev | | rdma dev |
+ | | +----------+ +----------+
+ (devlink port add/del | ^ ^
+ port function set) | | |
+ | | +---------------|
+ _____|___ | | _______|_______
+ | | | | | mlx5 class |
+ | devlink | +------------+ | | drivers |
+ | kernel | | rep netdev | | |(mlx5_core,ib) |
+ |_________| +------------+ | |_______________|
+ | | | ^
+ (devlink ops) | | (probe/remove)
+ _________|________ | | ____|________
+ | subfunction | | +---------------+ | subfunction |
+ | management driver|----- | subfunction |---| driver |
+ | (mlx5_core) | | auxiliary dev | | (mlx5_core) |
+ |__________________| +---------------+ |_____________|
+ | ^
+ (sf add/del, vhca events) |
+ | (device add/del)
+ _____|____ ____|________
+ | | | subfunction |
+ | PCI NIC |--- activate/deactivate events--->| host driver |
+ |__________| | (mlx5_core) |
+ |_____________|
+
+Subfunction is created using devlink port interface.
+
+- Change device to switchdev mode::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Add a devlink port of subfunction flavour::
+
+ $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
+ pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Show a devlink port of the subfunction::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Delete a devlink port of subfunction after use::
+
+ $ devlink port del pci/0000:06:00.0/32768
+
+Function attributes
+===================
+
+The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
+a unified way for SmartNIC and non-SmartNIC.
+
+This is supported only when the eswitch mode is set to switchdev. Port function
+configuration of the PCI VF/SF is supported through devlink eswitch port.
+
+Port function attributes should be set before PCI VF/SF is enumerated by the
+driver.
+
+MAC address setup
+-----------------
+
+mlx5 driver support devlink port function attr mechanism to setup MAC
+address. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+RoCE capability setup
+~~~~~~~~~~~~~~~~~~~~~
+Not all mlx5 PCI devices/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
+PCI devices/SF.
+
+mlx5 driver support devlink port function attr mechanism to setup RoCE
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+migratable capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+mlx5 driver support devlink port function attr mechanism to setup migratable
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+IPsec crypto capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec crypto offloading need
+to explicitly enable the VF ipsec_crypto capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_crypto
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+IPsec packet capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec packet offloading need
+to explicitly enable the VF ipsec_packet capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_packet
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+SF state setup
+--------------
+
+To use the SF, the user must activate the SF using the SF function state
+attribute.
+
+- Get the state of the SF identified by its unique devlink port index::
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state inactive opstate detached
+
+- Activate the function and verify its state is active::
+
+ $ devlink port function set ens2f0npf0sf88 state active
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state active opstate detached
+
+Upon function activation, the PF driver instance gets the event from the device
+that a particular SF was activated. It's the cue to put the device on bus, probe
+it and instantiate the devlink instance and class specific auxiliary devices
+for it.
+
+- Show the auxiliary device and port of the subfunction::
+
+ $ devlink dev show
+ devlink dev show auxiliary/mlx5_core.sf.4
+
+ $ devlink port show auxiliary/mlx5_core.sf.4/1
+ auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
+
+ $ rdma link show mlx5_0/1
+ link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
+
+ $ rdma dev show
+ 8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
+ 13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
+
+- Subfunction auxiliary device and class device hierarchy::
+
+ mlx5_core.sf.4
+ (subfunction auxiliary device)
+ /\
+ / \
+ / \
+ / \
+ / \
+ mlx5_core.eth.4 mlx5_core.rdma.4
+ (sf eth aux dev) (sf rdma aux dev)
+ | |
+ | |
+ p0sf88 mlx5_0
+ (sf netdev) (sf rdma device)
+
+Additionally, the SF port also gets the event when the driver attaches to the
+auxiliary device of the subfunction. This results in changing the operational
+state of the function. This provides visibility to the user to decide when is it
+safe to delete the SF port for graceful termination of the subfunction.
+
+- Show the SF port operational state::
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state active opstate attached
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst
new file mode 100644
index 000000000..da8e53ceb
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst
@@ -0,0 +1,229 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+===========
+Tracepoints
+===========
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+mlx5 driver provides internal tracepoints for tracking and debugging using
+kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
+
+For the list of support mlx5 events, check /sys/kernel/tracing/events/mlx5/.
+
+tc and eswitch offloads tracepoints:
+
+- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
+
+- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
+
+ $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
+
+- mlx5e_stats_flower: trace flower stats request::
+
+ $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
+
+- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
+
+- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
+
+ $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
+
+Bridge offloads tracepoints:
+
+- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
+
+- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
+
+- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
+ mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
+
+- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
+
+- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
+
+- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1
+
+- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
+
+Eswitch QoS tracepoints:
+
+- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3
+
+- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group::
+
+ $ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5
+
+- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group::
+
+ $ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000
+
+- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group::
+
+ $ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1
+
+SF tracepoints:
+
+- mlx5_sf_add: trace addition of the SF port::
+
+ $ echo mlx5:mlx5_sf_add >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_free: trace freeing of the SF port::
+
+ $ echo mlx5:mlx5_sf_free >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_activate: trace activation of the SF port::
+
+ $ echo mlx5:mlx5_sf_activate >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-29841 [008] ..... 3669.635095: mlx5_sf_activate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_deactivate: trace deactivation of the SF port::
+
+ $ echo mlx5:mlx5_sf_deactivate >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-29994 [008] ..... 4015.969467: mlx5_sf_deactivate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_hwc_free: trace freeing of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_hwc_deferred_free: trace deferred freeing of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_update_state: trace state updates for SF contexts::
+
+ $ echo mlx5:mlx5_sf_update_state >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u20:3-29490 [009] ..... 4141.453530: mlx5_sf_update_state: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000 state=2
+
+- mlx5_sf_vhca_event: trace SF vhca event and state::
+
+ $ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1
+
+- mlx5_sf_dev_add: trace SF device add event::
+
+ $ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_dev_del: trace SF device delete event::
+
+ $ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88