diff options
Diffstat (limited to 'Documentation/networking')
22 files changed, 766 insertions, 49 deletions
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index f7a73421eb..e774b48de9 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -444,6 +444,18 @@ arp_missed_max The default value is 2, and the allowable range is 1 - 255. +coupled_control + + Specifies whether the LACP state machine's MUX in the 802.3ad mode + should have separate Collecting and Distributing states. + + This is by implementing the independent control state machine per + IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control + state machine. + + The default value is 1. This setting does not separate the Collecting + and Distributing states, maintaining the bond in coupled control. + downdelay Specifies the time, in milliseconds, to wait before disabling diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst index ba14e7b078..ef8b73e157 100644 --- a/Documentation/networking/bridge.rst +++ b/Documentation/networking/bridge.rst @@ -324,7 +324,7 @@ Contact Info The code is currently maintained by Roopa Prabhu <roopa@nvidia.com> and Nikolay Aleksandrov <razor@blackwall.org>. Bridge bugs and enhancements are discussed on the linux-netdev mailing list netdev@vger.kernel.org and -bridge@lists.linux-foundation.org. +bridge@lists.linux.dev. The list is open to anyone interested: http://vger.kernel.org/vger-lists.html#netdev diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst index d7e1ada905..62519d38c5 100644 --- a/Documentation/networking/can.rst +++ b/Documentation/networking/can.rst @@ -444,6 +444,24 @@ definitions are specified for CAN specific MTUs in include/linux/can.h: #define CANFD_MTU (sizeof(struct canfd_frame)) == 72 => CAN FD frame +Returned Message Flags +---------------------- + +When using the system call recvmsg(2) on a RAW or a BCM socket, the +msg->msg_flags field may contain the following flags: + +MSG_DONTROUTE: + set when the received frame was created on the local host. + +MSG_CONFIRM: + set when the frame was sent via the socket it is received on. + This flag can be interpreted as a 'transmission confirmation' when the + CAN driver supports the echo of frames on driver level, see + :ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`. + (Note: In order to receive such messages on a RAW socket, + CAN_RAW_RECV_OWN_MSGS must be set.) + + .. _socketcan-raw-sockets: RAW Protocol Sockets with can_filters (SOCK_RAW) @@ -693,22 +711,6 @@ where the CAN_INV_FILTER flag is set in order to notch single CAN IDs or CAN ID ranges from the incoming traffic. -RAW Socket Returned Message Flags -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When using recvmsg() call, the msg->msg_flags may contain following flags: - -MSG_DONTROUTE: - set when the received frame was created on the local host. - -MSG_CONFIRM: - set when the frame was sent via the socket it is received on. - This flag can be interpreted as a 'transmission confirmation' when the - CAN driver supports the echo of frames on driver level, see - :ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`. - In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set. - - Broadcast Manager Protocol Sockets (SOCK_DGRAM) ----------------------------------------------- diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst index b842bcb142..a4c7d0c65f 100644 --- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst +++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst @@ -211,10 +211,16 @@ Documentation/networking/net_dim.rst RX copybreak ============ + The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK and can be configured by the ETHTOOL_STUNABLE command of the SIOCETHTOOL ioctl. +This option controls the maximum packet length for which the RX +descriptor it was received on would be recycled. When a packet smaller +than RX copybreak bytes is received, it is copied into a new memory +buffer and the RX descriptor is returned to HW. + Statistics ========== diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst index 43de285b8a..6932d8c043 100644 --- a/Documentation/networking/device_drivers/ethernet/index.rst +++ b/Documentation/networking/device_drivers/ethernet/index.rst @@ -42,6 +42,7 @@ Contents: intel/ice marvell/octeontx2 marvell/octeon_ep + marvell/octeon_ep_vf mellanox/mlx5/index microsoft/netvsc neterion/s2io diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst index 5038e54586..934752f675 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst @@ -368,15 +368,28 @@ more options for Receive Side Scaling (RSS) hash byte configuration. # ethtool -N <ethX> rx-flow-hash <type> <option> Where <type> is: - tcp4 signifying TCP over IPv4 - udp4 signifying UDP over IPv4 - tcp6 signifying TCP over IPv6 - udp6 signifying UDP over IPv6 + tcp4 signifying TCP over IPv4 + udp4 signifying UDP over IPv4 + gtpc4 signifying GTP-C over IPv4 + gtpc4t signifying GTP-C (include TEID) over IPv4 + gtpu4 signifying GTP-U over IPV4 + gtpu4e signifying GTP-U and Extension Header over IPV4 + gtpu4u signifying GTP-U PSC Uplink over IPV4 + gtpu4d signifying GTP-U PSC Downlink over IPV4 + tcp6 signifying TCP over IPv6 + udp6 signifying UDP over IPv6 + gtpc6 signifying GTP-C over IPv6 + gtpc6t signifying GTP-C (include TEID) over IPv6 + gtpu6 signifying GTP-U over IPV6 + gtpu6e signifying GTP-U and Extension Header over IPV6 + gtpu6u signifying GTP-U PSC Uplink over IPV6 + gtpu6d signifying GTP-U PSC Downlink over IPV6 And <option> is one or more of: s Hash on the IP source address of the Rx packet. d Hash on the IP destination address of the Rx packet. f Hash on bytes 0 and 1 of the Layer 4 header of the Rx packet. n Hash on bytes 2 and 3 of the Layer 4 header of the Rx packet. + e Hash on GTP Packet on TEID (4bytes) of the Rx packet. Accelerated Receive Flow Steering (aRFS) diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst new file mode 100644 index 0000000000..603133d0b9 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst @@ -0,0 +1,24 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +======================================================================= +Linux kernel networking driver for Marvell's Octeon PCI Endpoint NIC VF +======================================================================= + +Network driver for Marvell's Octeon PCI EndPoint NIC VF. +Copyright (c) 2020 Marvell International Ltd. + +Overview +======== +This driver implements networking functionality of Marvell's Octeon PCI +EndPoint NIC VF. + +Supported Devices +================= +Currently, this driver support following devices: + * Network controller: Cavium, Inc. Device b203 + * Network controller: Cavium, Inc. Device b403 + * Network controller: Cavium, Inc. Device b103 + * Network controller: Cavium, Inc. Device b903 + * Network controller: Cavium, Inc. Device ba03 + * Network controller: Cavium, Inc. Device bc03 + * Network controller: Cavium, Inc. Device bd03 diff --git a/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst index 6ec7d686ef..05fe2b11bb 100644 --- a/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst +++ b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst @@ -99,6 +99,12 @@ Minimal SR-IOV support is currently offered and can be enabled by setting the sysfs 'sriov_numvfs' value, if supported by your particular firmware configuration. +XDP +--- + +Support for XDP includes the basics, plus Jumbo frames, Redirect and +ndo_xmit. There is no current support for zero-copy sockets or HW offload. + Statistics ========== @@ -138,6 +144,12 @@ Driver port specific:: rx_csum_none: 0 rx_csum_complete: 3 rx_csum_error: 0 + xdp_drop: 0 + xdp_aborted: 0 + xdp_pass: 0 + xdp_tx: 0 + xdp_redirect: 0 + xdp_frames: 0 Driver queue specific:: @@ -149,9 +161,12 @@ Driver queue specific:: tx_0_frags: 0 tx_0_tso: 0 tx_0_tso_bytes: 0 + tx_0_hwstamp_valid: 0 + tx_0_hwstamp_invalid: 0 tx_0_csum_none: 3 tx_0_csum: 0 tx_0_vlan_inserted: 0 + tx_0_xdp_frames: 0 rx_0_pkts: 2 rx_0_bytes: 120 rx_0_dma_map_err: 0 @@ -159,8 +174,15 @@ Driver queue specific:: rx_0_csum_none: 0 rx_0_csum_complete: 0 rx_0_csum_error: 0 + rx_0_hwstamp_valid: 0 + rx_0_hwstamp_invalid: 0 rx_0_dropped: 0 rx_0_vlan_stripped: 0 + rx_0_xdp_drop: 0 + rx_0_xdp_aborted: 0 + rx_0_xdp_pass: 0 + rx_0_xdp_tx: 0 + rx_0_xdp_redirect: 0 Firmware port specific:: diff --git a/Documentation/networking/device_drivers/wwan/t7xx.rst b/Documentation/networking/device_drivers/wwan/t7xx.rst index dd5b731957..f346f5f85f 100644 --- a/Documentation/networking/device_drivers/wwan/t7xx.rst +++ b/Documentation/networking/device_drivers/wwan/t7xx.rst @@ -39,6 +39,34 @@ command and receive response: - open the AT control channel using a UART tool or a special user tool +Sysfs +===== +The driver provides sysfs interfaces to userspace. + +t7xx_mode +--------- +The sysfs interface provides userspace with access to the device mode, this interface +supports read and write operations. + +Device mode: + +- ``unknown`` represents that device in unknown status +- ``ready`` represents that device in ready status +- ``reset`` represents that device in reset status +- ``fastboot_switching`` represents that device in fastboot switching status +- ``fastboot_download`` represents that device in fastboot download status +- ``fastboot_dump`` represents that device in fastboot dump status + +Read from userspace to get the current device mode. + +:: + $ cat /sys/bus/pci/devices/${bdf}/t7xx_mode + +Write from userspace to set the device mode. + +:: + $ echo fastboot_switching > /sys/bus/pci/devices/${bdf}/t7xx_mode + Management application development ================================== The driver and userspace interfaces are described below. The MBIM protocol is @@ -97,6 +125,20 @@ The driver exposes an AT port by implementing AT WWAN Port. The userspace end of the control port is a /dev/wwan0at0 character device. Application shall use this interface to issue AT commands. +fastboot port userspace ABI +--------------------------- + +/dev/wwan0fastboot0 character device +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The driver exposes a fastboot protocol interface by implementing +fastboot WWAN Port. The userspace end of the fastboot channel pipe is a +/dev/wwan0fastboot0 character device. Application shall use this interface for +fastboot protocol communication. + +Please note that driver needs to be reloaded to export /dev/wwan0fastboot0 +port, because device needs a cold reset after enter ``fastboot_switching`` +mode. + The MediaTek's T700 modem supports the 3GPP TS 27.007 [4] specification. References @@ -118,3 +160,7 @@ speak the Mobile Interface Broadband Model (MBIM) protocol"* [4] *Specification # 27.007 - 3GPP* - https://www.3gpp.org/DynaReport/27007.htm + +[5] *fastboot "a mechanism for communicating with bootloaders"* + +- https://android.googlesource.com/platform/system/core/+/refs/heads/main/fastboot/README.md diff --git a/Documentation/networking/devlink/devlink-eswitch-attr.rst b/Documentation/networking/devlink/devlink-eswitch-attr.rst new file mode 100644 index 0000000000..08bb39ab15 --- /dev/null +++ b/Documentation/networking/devlink/devlink-eswitch-attr.rst @@ -0,0 +1,76 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +Devlink E-Switch Attribute +========================== + +Devlink E-Switch supports two modes of operation: legacy and switchdev. +Legacy mode operates based on traditional MAC/VLAN steering rules. Switching +decisions are made based on MAC addresses, VLANs, etc. There is limited ability +to offload switching rules to hardware. + +On the other hand, switchdev mode allows for more advanced offloading +capabilities of the E-Switch to hardware. In switchdev mode, more switching +rules and logic can be offloaded to the hardware switch ASIC. It enables +representor netdevices that represent the slow path of virtual functions (VFs) +or scalable-functions (SFs) of the device. See more information about +:ref:`Documentation/networking/switchdev.rst <switchdev>` and +:ref:`Documentation/networking/representors.rst <representors>`. + +In addition, the devlink E-Switch also comes with other attributes listed +in the following section. + +Attributes Description +====================== + +The following is a list of E-Switch attributes. + +.. list-table:: E-Switch attributes + :widths: 8 5 45 + + * - Name + - Type + - Description + * - ``mode`` + - enum + - The mode of the device. The mode can be one of the following: + + * ``legacy`` operates based on traditional MAC/VLAN steering + rules. + * ``switchdev`` allows for more advanced offloading capabilities of + the E-Switch to hardware. + * - ``inline-mode`` + - enum + - Some HWs need the VF driver to put part of the packet + headers on the TX descriptor so the e-switch can do proper + matching and steering. Support for both switchdev mode and legacy mode. + + * ``none`` none. + * ``link`` L2 mode. + * ``network`` L3 mode. + * ``transport`` L4 mode. + * - ``encap-mode`` + - enum + - The encapsulation mode of the device. Support for both switchdev mode + and legacy mode. The mode can be one of the following: + + * ``none`` Disable encapsulation support. + * ``basic`` Enable encapsulation support. + +Example Usage +============= + +.. code:: shell + + # enable switchdev mode + $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev + + # set inline-mode and encap-mode + $ devlink dev eswitch set pci/0000:08:00.0 inline-mode none encap-mode basic + + # display devlink device eswitch attributes + $ devlink dev eswitch show pci/0000:08:00.0 + pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic + + # enable encap-mode with legacy mode + $ devlink dev eswitch set pci/0000:08:00.0 mode legacy inline-mode none encap-mode basic diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst index e14d7a701b..948c8c44e2 100644 --- a/Documentation/networking/devlink/index.rst +++ b/Documentation/networking/devlink/index.rst @@ -67,6 +67,7 @@ general. devlink-selftests devlink-trap devlink-linecard + devlink-eswitch-attr Driver-specific documentation ----------------------------- diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 702f204a3d..4569854074 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -97,6 +97,10 @@ parameters. When metadata is disabled, the above use cases will fail to initialize if users try to enable them. + + Note: Setting this parameter does not take effect immediately. Setting + must happen in legacy mode and eswitch port metadata takes effect after + enabling switchdev mode. * - ``hairpin_num_queues`` - u32 - driverinit @@ -246,7 +250,7 @@ them in realtime. Description of the vnic counters: -- total_q_under_processor_handle +- total_error_queues number of queues in an error state due to an async error or errored command. - send_queue_priority_update_flow @@ -255,7 +259,8 @@ Description of the vnic counters: number of times CQ entered an error state due to an overflow. - async_eq_overrun number of times an EQ mapped to async events was overrun. - comp_eq_overrun number of times an EQ mapped to completion events was +- comp_eq_overrun + number of times an EQ mapped to completion events was overrun. - quota_exceeded_command number of commands issued and failed due to quota exceeded. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 69f3d6dcd9..473d72c36d 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -74,6 +74,7 @@ Contents: mpls-sysctl mptcp-sysctl multiqueue + multi-pf-netdev napi net_cachelines/index netconsole diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 7afff42612..bd50df6a5a 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2503,7 +2503,7 @@ use_tempaddr - INTEGER temp_valid_lft - INTEGER valid lifetime (in seconds) for temporary addresses. If less than the - minimum required lifetime (typically 5 seconds), temporary addresses + minimum required lifetime (typically 5-7 seconds), temporary addresses will not be created. Default: 172800 (2 days) @@ -2511,7 +2511,7 @@ temp_valid_lft - INTEGER temp_prefered_lft - INTEGER Preferred lifetime (in seconds) for temporary addresses. If temp_prefered_lft is less than the minimum required lifetime (typically - 5 seconds), temporary addresses will not be created. If + 5-7 seconds), the preferred lifetime is the minimum required. If temp_prefered_lft is greater than temp_valid_lft, the preferred lifetime is temp_valid_lft. @@ -2535,6 +2535,16 @@ max_desync_factor - INTEGER Default: 600 +regen_min_advance - INTEGER + How far in advance (in seconds), at minimum, to create a new temporary + address before the current one is deprecated. This value is added to + the amount of time that may be required for duplicate address detection + to determine when to create a new address. Linux permits setting this + value to less than the default of 2 seconds, but a value less than 2 + does not conform to RFC 8981. + + Default: 2 + regen_max_retry - INTEGER Number of attempts before give up attempting to generate valid temporary addresses. diff --git a/Documentation/networking/l2tp.rst b/Documentation/networking/l2tp.rst index 7f383e99db..8496b467de 100644 --- a/Documentation/networking/l2tp.rst +++ b/Documentation/networking/l2tp.rst @@ -386,12 +386,19 @@ Sample userspace code: - Create session PPPoX data socket:: + /* Input: the L2TP tunnel UDP socket `tunnel_fd`, which needs to be + * bound already (both sockname and peername), otherwise it will not be + * ready. + */ + struct sockaddr_pppol2tp sax; - int fd; + int session_fd; + int ret; + + session_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP); + if (session_fd < 0) + return -errno; - /* Note, the tunnel socket must be bound already, else it - * will not be ready - */ sax.sa_family = AF_PPPOX; sax.sa_protocol = PX_PROTO_OL2TP; sax.pppol2tp.fd = tunnel_fd; @@ -406,12 +413,128 @@ Sample userspace code: /* session_fd is the fd of the session's PPPoL2TP socket. * tunnel_fd is the fd of the tunnel UDP / L2TPIP socket. */ - fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax)); - if (fd < 0 ) { + ret = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax)); + if (ret < 0 ) { + close(session_fd); + return -errno; + } + + return session_fd; + +L2TP control packets will still be available for read on `tunnel_fd`. + + - Create PPP channel:: + + /* Input: the session PPPoX data socket `session_fd` which was created + * as described above. + */ + + int ppp_chan_fd; + int chindx; + int ret; + + ret = ioctl(session_fd, PPPIOCGCHAN, &chindx); + if (ret < 0) + return -errno; + + ppp_chan_fd = open("/dev/ppp", O_RDWR); + if (ppp_chan_fd < 0) + return -errno; + + ret = ioctl(ppp_chan_fd, PPPIOCATTCHAN, &chindx); + if (ret < 0) { + close(ppp_chan_fd); return -errno; } + + return ppp_chan_fd; + +LCP PPP frames will be available for read on `ppp_chan_fd`. + + - Create PPP interface:: + + /* Input: the PPP channel `ppp_chan_fd` which was created as described + * above. + */ + + int ifunit = -1; + int ppp_if_fd; + int ret; + + ppp_if_fd = open("/dev/ppp", O_RDWR); + if (ppp_if_fd < 0) + return -errno; + + ret = ioctl(ppp_if_fd, PPPIOCNEWUNIT, &ifunit); + if (ret < 0) { + close(ppp_if_fd); + return -errno; + } + + ret = ioctl(ppp_chan_fd, PPPIOCCONNECT, &ifunit); + if (ret < 0) { + close(ppp_if_fd); + return -errno; + } + + return ppp_if_fd; + +IPCP/IPv6CP PPP frames will be available for read on `ppp_if_fd`. + +The ppp<ifunit> interface can then be configured as usual with netlink's +RTM_NEWLINK, RTM_NEWADDR, RTM_NEWROUTE, or ioctl's SIOCSIFMTU, SIOCSIFADDR, +SIOCSIFDSTADDR, SIOCSIFNETMASK, SIOCSIFFLAGS, or with the `ip` command. + + - Bridging L2TP sessions which have PPP pseudowire types (this is also called + L2TP tunnel switching or L2TP multihop) is supported by bridging the PPP + channels of the two L2TP sessions to be bridged:: + + /* Input: the session PPPoX data sockets `session_fd1` and `session_fd2` + * which were created as described further above. + */ + + int ppp_chan_fd; + int chindx1; + int chindx2; + int ret; + + ret = ioctl(session_fd1, PPPIOCGCHAN, &chindx1); + if (ret < 0) + return -errno; + + ret = ioctl(session_fd2, PPPIOCGCHAN, &chindx2); + if (ret < 0) + return -errno; + + ppp_chan_fd = open("/dev/ppp", O_RDWR); + if (ppp_chan_fd < 0) + return -errno; + + ret = ioctl(ppp_chan_fd, PPPIOCATTCHAN, &chindx1); + if (ret < 0) { + close(ppp_chan_fd); + return -errno; + } + + ret = ioctl(ppp_chan_fd, PPPIOCBRIDGECHAN, &chindx2); + close(ppp_chan_fd); + if (ret < 0) + return -errno; + return 0; +It can be noted that when bridging PPP channels, the PPP session is not locally +terminated, and no local PPP interface is created. PPP frames arriving on one +channel are directly passed to the other channel, and vice versa. + +The PPP channel does not need to be kept open. Only the session PPPoX data +sockets need to be kept open. + +More generally, it is also possible in the same way to e.g. bridge a PPPoL2TP +PPP channel with other types of PPP channels, such as PPPoE. + +See more details for the PPP side in ppp_generic.rst. + Old L2TPv2-only API ------------------- diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst new file mode 100644 index 0000000000..2688192258 --- /dev/null +++ b/Documentation/networking/multi-pf-netdev.rst @@ -0,0 +1,174 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +=============== +Multi-PF Netdev +=============== + +Contents +======== + +- `Background`_ +- `Overview`_ +- `mlx5 implementation`_ +- `Channels distribution`_ +- `Observability`_ +- `Steering`_ +- `Mutually exclusive features`_ + +Background +========== + +The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to +the network, each through its own dedicated PCIe interface. Through either a connection harness that +splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This +results in eliminating the network traffic traversing over the internal bus between the sockets, +significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing +network throughput. + +Overview +======== + +The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under +one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func, +sysfs entry, and devlink are kept separate. +Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA +traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of +proximity to the device and achieve improved performance. + +mlx5 implementation +=================== + +Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same +NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev +to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed. + +The netdev network channels are distributed between all devices, a proper configuration would utilize +the correct close NUMA node when working on a certain app/CPU. + +We pick one PF to be a primary (leader), and it fills a special role. The other devices +(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent +mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of +the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary +to/from the secondaries. + +Currently, we limit the support to PFs only, and up to two PFs (sockets). + +Channels distribution +===================== + +We distribute the channels between the different PFs to achieve local NUMA node performance +on multiple NUMA nodes. + +Each combined channel works against one specific PF, creating all its datapath queues against it. We +distribute channels to PFs in a round-robin policy. + +:: + + Example for 2 PFs and 5 channels: + +--------+--------+ + | ch idx | PF idx | + +--------+--------+ + | 0 | 0 | + | 1 | 1 | + | 2 | 0 | + | 3 | 1 | + | 4 | 0 | + +--------+--------+ + + +The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The +mapping between a channel index and a PF is fixed, no matter how many channels the user configures. +As the channel stats are persistent across channel's closure, changing the mapping every single time +would turn the accumulative stats less representing of the channel's history. + +This is achieved by using the correct core device instance (mdev) in each channel, instead of them +all using the same instance under "priv->mdev". + +Observability +============= +The relation between PF, irq, napi, and queue can be observed via netlink spec:: + + $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}' + [{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'}, + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'}, + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'}, + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'}, + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'}, + {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'}, + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'}, + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'}, + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'}, + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}] + + $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}' + [{'id': 543, 'ifindex': 13, 'irq': 42}, + {'id': 542, 'ifindex': 13, 'irq': 41}, + {'id': 541, 'ifindex': 13, 'irq': 40}, + {'id': 540, 'ifindex': 13, 'irq': 39}, + {'id': 539, 'ifindex': 13, 'irq': 36}] + +Here you can clearly observe our channels distribution policy:: + + $ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1 + /proc/irq/36/mlx5_comp1@pci:0000:08:00.0 + /proc/irq/39/mlx5_comp1@pci:0000:09:00.0 + /proc/irq/40/mlx5_comp2@pci:0000:08:00.0 + /proc/irq/41/mlx5_comp2@pci:0000:09:00.0 + /proc/irq/42/mlx5_comp3@pci:0000:08:00.0 + +Steering +======== +Secondary PFs are set to "silent" mode, meaning they are disconnected from the network. + +In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming +traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table, +that is capable of pointing to the receive queues of a different PF. + +In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can +go out to the network through it. + +In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the +PF on the same node as the CPU. + +XPS default config example: + +NUMA node(s): 2 +NUMA node0 CPU(s): 0-11 +NUMA node1 CPU(s): 12-23 + +PF0 on node0, PF1 on node1. + +- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001 +- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000 +- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002 +- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000 +- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004 +- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000 +- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008 +- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000 +- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010 +- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000 +- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020 +- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000 +- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040 +- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000 +- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080 +- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000 +- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100 +- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000 +- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200 +- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000 +- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400 +- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000 +- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800 +- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000 + +Mutually exclusive features +=========================== + +The nature of Multi-PF, where different channels work with different PFs, conflicts with +stateful features where the state is maintained in one of the PFs. +For example, in the TLS device-offload feature, special context objects are created per connection +and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence, +we disable this combination for now. diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst index 390730a743..d55c2a22ec 100644 --- a/Documentation/networking/netconsole.rst +++ b/Documentation/networking/netconsole.rst @@ -15,6 +15,8 @@ Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015 Release prepend support by Breno Leitao <leitao@debian.org>, Jul 7 2023 +Userdata append support by Matthew Wood <thepacketgeek@gmail.com>, Jan 22 2024 + Please send bug reports to Matt Mackall <mpm@selenic.com> Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com> @@ -171,6 +173,70 @@ You can modify these targets in runtime by creating the following targets:: cat cmdline1/remote_ip 10.0.0.3 +Append User Data +---------------- + +Custom user data can be appended to the end of messages with netconsole +dynamic configuration enabled. User data entries can be modified without +changing the "enabled" attribute of a target. + +Directories (keys) under `userdata` are limited to 53 character length, and +data in `userdata/<key>/value` are limited to 200 bytes:: + + cd /sys/kernel/config/netconsole && mkdir cmdline0 + cd cmdline0 + mkdir userdata/foo + echo bar > userdata/foo/value + mkdir userdata/qux + echo baz > userdata/qux/value + +Messages will now include this additional user data:: + + echo "This is a message" > /dev/kmsg + +Sends:: + + 12,607,22085407756,-;This is a message + foo=bar + qux=baz + +Preview the userdata that will be appended with:: + + cd /sys/kernel/config/netconsole/cmdline0/userdata + for f in `ls userdata`; do echo $f=$(cat userdata/$f/value); done + +If a `userdata` entry is created but no data is written to the `value` file, +the entry will be omitted from netconsole messages:: + + cd /sys/kernel/config/netconsole && mkdir cmdline0 + cd cmdline0 + mkdir userdata/foo + echo bar > userdata/foo/value + mkdir userdata/qux + +The `qux` key is omitted since it has no value:: + + echo "This is a message" > /dev/kmsg + 12,607,22085407756,-;This is a message + foo=bar + +Delete `userdata` entries with `rmdir`:: + + rmdir /sys/kernel/config/netconsole/cmdline0/userdata/qux + +.. warning:: + When writing strings to user data values, input is broken up per line in + configfs store calls and this can cause confusing behavior:: + + mkdir userdata/testing + printf "val1\nval2" > userdata/testing/value + # userdata store value is called twice, first with "val1\n" then "val2" + # so "val2" is stored, being the last value stored + cat userdata/testing/value + val2 + + It is recommended to not write user data values with newlines. + Extended console: ================= diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst index 9e4cccb90b..c2476917a6 100644 --- a/Documentation/networking/netdevices.rst +++ b/Documentation/networking/netdevices.rst @@ -252,8 +252,8 @@ ndo_eth_ioctl: Context: process ndo_get_stats: - Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU. - Context: atomic (can't sleep under rwlock or RCU) + Synchronization: rtnl_lock() semaphore, or RCU. + Context: atomic (can't sleep under RCU) ndo_start_xmit: Synchronization: __netif_tx_lock spinlock. diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst index decb39c19b..5e23386f69 100644 --- a/Documentation/networking/representors.rst +++ b/Documentation/networking/representors.rst @@ -1,4 +1,5 @@ .. SPDX-License-Identifier: GPL-2.0 +.. _representors: ============================= Network Function Representors diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst index 8054d33f44..5bf285d73e 100644 --- a/Documentation/networking/sfp-phylink.rst +++ b/Documentation/networking/sfp-phylink.rst @@ -231,16 +231,136 @@ this documentation. For further information on these methods, please see the inline documentation in :c:type:`struct phylink_mac_ops <phylink_mac_ops>`. -9. Remove calls to of_parse_phandle() for the PHY, - of_phy_register_fixed_link() for fixed links etc. from the probe - function, and replace with: +9. Fill-in the :c:type:`struct phylink_config <phylink_config>` fields with + a reference to the :c:type:`struct device <device>` associated to your + :c:type:`struct net_device <net_device>`: .. code-block:: c - struct phylink *phylink; priv->phylink_config.dev = &dev.dev; priv->phylink_config.type = PHYLINK_NETDEV; + Fill-in the various speeds, pause and duplex modes your MAC can handle: + + .. code-block:: c + + priv->phylink_config.mac_capabilities = MAC_SYM_PAUSE | MAC_10 | MAC_100 | MAC_1000FD; + +10. Some Ethernet controllers work in pair with a PCS (Physical Coding Sublayer) + block, that can handle among other things the encoding/decoding, link + establishment detection and autonegotiation. While some MACs have internal + PCS whose operation is transparent, some other require dedicated PCS + configuration for the link to become functional. In that case, phylink + provides a PCS abstraction through :c:type:`struct phylink_pcs <phylink_pcs>`. + + Identify if your driver has one or more internal PCS blocks, and/or if + your controller can use an external PCS block that might be internally + connected to your controller. + + If your controller doesn't have any internal PCS, you can go to step 11. + + If your Ethernet controller contains one or several PCS blocks, create + one :c:type:`struct phylink_pcs <phylink_pcs>` instance per PCS block within + your driver's private data structure: + + .. code-block:: c + + struct phylink_pcs pcs; + + Populate the relevant :c:type:`struct phylink_pcs_ops <phylink_pcs_ops>` to + configure your PCS. Create a :c:func:`pcs_get_state` function that reports + the inband link state, a :c:func:`pcs_config` function to configure your + PCS according to phylink-provided parameters, and a :c:func:`pcs_validate` + function that report to phylink all accepted configuration parameters for + your PCS: + + .. code-block:: c + + struct phylink_pcs_ops foo_pcs_ops = { + .pcs_validate = foo_pcs_validate, + .pcs_get_state = foo_pcs_get_state, + .pcs_config = foo_pcs_config, + }; + + Arrange for PCS link state interrupts to be forwarded into + phylink, via: + + .. code-block:: c + + phylink_pcs_change(pcs, link_is_up); + + where ``link_is_up`` is true if the link is currently up or false + otherwise. If a PCS is unable to provide these interrupts, then + it should set ``pcs->pcs_poll = true;`` when creating the PCS. + +11. If your controller relies on, or accepts the presence of an external PCS + controlled through its own driver, add a pointer to a phylink_pcs instance + in your driver private data structure: + + .. code-block:: c + + struct phylink_pcs *pcs; + + The way of getting an instance of the actual PCS depends on the platform, + some PCS sit on an MDIO bus and are grabbed by passing a pointer to the + corresponding :c:type:`struct mii_bus <mii_bus>` and the PCS's address on + that bus. In this example, we assume the controller attaches to a Lynx PCS + instance: + + .. code-block:: c + + priv->pcs = lynx_pcs_create_mdiodev(bus, 0); + + Some PCS can be recovered based on firmware information: + + .. code-block:: c + + priv->pcs = lynx_pcs_create_fwnode(of_fwnode_handle(node)); + +12. Populate the :c:func:`mac_select_pcs` callback and add it to your + :c:type:`struct phylink_mac_ops <phylink_mac_ops>` set of ops. This function + must return a pointer to the relevant :c:type:`struct phylink_pcs <phylink_pcs>` + that will be used for the requested link configuration: + + .. code-block:: c + + static struct phylink_pcs *foo_select_pcs(struct phylink_config *config, + phy_interface_t interface) + { + struct foo_priv *priv = container_of(config, struct foo_priv, + phylink_config); + + if ( /* 'interface' needs a PCS to function */ ) + return priv->pcs; + + return NULL; + } + + See :c:func:`mvpp2_select_pcs` for an example of a driver that has multiple + internal PCS. + +13. Fill-in all the :c:type:`phy_interface_t <phy_interface_t>` (i.e. all MAC to + PHY link modes) that your MAC can output. The following example shows a + configuration for a MAC that can handle all RGMII modes, SGMII and 1000BaseX. + You must adjust these according to what your MAC and all PCS associated + with this MAC are capable of, and not just the interface you wish to use: + + .. code-block:: c + + phy_interface_set_rgmii(priv->phylink_config.supported_interfaces); + __set_bit(PHY_INTERFACE_MODE_SGMII, + priv->phylink_config.supported_interfaces); + __set_bit(PHY_INTERFACE_MODE_1000BASEX, + priv->phylink_config.supported_interfaces); + +14. Remove calls to of_parse_phandle() for the PHY, + of_phy_register_fixed_link() for fixed links etc. from the probe + function, and replace with: + + .. code-block:: c + + struct phylink *phylink; + phylink = phylink_create(&priv->phylink_config, node, phy_mode, &phylink_ops); if (IS_ERR(phylink)) { err = PTR_ERR(phylink); @@ -249,14 +369,14 @@ this documentation. priv->phylink = phylink; - and arrange to destroy the phylink in the probe failure path as - appropriate and the removal path too by calling: + and arrange to destroy the phylink in the probe failure path as + appropriate and the removal path too by calling: - .. code-block:: c + .. code-block:: c phylink_destroy(priv->phylink); -10. Arrange for MAC link state interrupts to be forwarded into +15. Arrange for MAC link state interrupts to be forwarded into phylink, via: .. code-block:: c @@ -264,17 +384,16 @@ this documentation. phylink_mac_change(priv->phylink, link_is_up); where ``link_is_up`` is true if the link is currently up or false - otherwise. If a MAC is unable to provide these interrupts, then - it should set ``priv->phylink_config.pcs_poll = true;`` in step 9. + otherwise. -11. Verify that the driver does not call:: +16. Verify that the driver does not call:: netif_carrier_on() netif_carrier_off() - as these will interfere with phylink's tracking of the link state, - and cause phylink to omit calls via the :c:func:`mac_link_up` and - :c:func:`mac_link_down` methods. + as these will interfere with phylink's tracking of the link state, + and cause phylink to omit calls via the :c:func:`mac_link_up` and + :c:func:`mac_link_down` methods. Network drivers should call phylink_stop() and phylink_start() via their suspend/resume paths, which ensures that the appropriate diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst index 551b3cc29a..75e017dfa8 100644 --- a/Documentation/networking/statistics.rst +++ b/Documentation/networking/statistics.rst @@ -41,6 +41,15 @@ If `-s` is specified once the detailed errors won't be shown. `ip` supports JSON formatting via the `-j` option. +Queue statistics +~~~~~~~~~~~~~~~~ + +Queue statistics are accessible via the netdev netlink family. + +Currently no widely distributed CLI exists to access those statistics. +Kernel development tools (ynl) can be used to experiment with them, +see `Documentation/userspace-api/netlink/intro-specs.rst`. + Protocol-specific statistics ---------------------------- @@ -147,6 +156,12 @@ Statistics are reported both in the responses to link information requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`, when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request). +netdev (netlink) +~~~~~~~~~~~~~~~~ + +`netdev` generic netlink family allows accessing page pool and per queue +statistics. + ethtool ------- diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst index 535077cbeb..bfea9d8579 100644 --- a/Documentation/networking/xfrm_device.rst +++ b/Documentation/networking/xfrm_device.rst @@ -71,9 +71,9 @@ Callbacks to implement bool (*xdo_dev_offload_ok) (struct sk_buff *skb, struct xfrm_state *x); void (*xdo_dev_state_advance_esn) (struct xfrm_state *x); + void (*xdo_dev_state_update_stats) (struct xfrm_state *x); /* Solely packet offload callbacks */ - void (*xdo_dev_state_update_curlft) (struct xfrm_state *x); int (*xdo_dev_policy_add) (struct xfrm_policy *x, struct netlink_ext_ack *extack); void (*xdo_dev_policy_delete) (struct xfrm_policy *x); void (*xdo_dev_policy_free) (struct xfrm_policy *x); @@ -191,6 +191,6 @@ xdo_dev_policy_free() on any remaining offloaded states. Outcome of HW handling packets, the XFRM core can't count hard, soft limits. The HW/driver are responsible to perform it and provide accurate data when -xdo_dev_state_update_curlft() is called. In case of one of these limits +xdo_dev_state_update_stats() is called. In case of one of these limits occuried, the driver needs to call to xfrm_state_check_expire() to make sure that XFRM performs rekeying sequence. |