summaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-11 08:27:49 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-11 08:27:49 +0000
commitace9429bb58fd418f0c81d4c2835699bddf6bde6 (patch)
treeb2d64bc10158fdd5497876388cd68142ca374ed3 /Documentation/networking
parentInitial commit. (diff)
downloadlinux-ace9429bb58fd418f0c81d4c2835699bddf6bde6.tar.xz
linux-ace9429bb58fd418f0c81d4c2835699bddf6bde6.zip
Adding upstream version 6.6.15.upstream/6.6.15
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/6lowpan.rst53
-rw-r--r--Documentation/networking/6pack.rst191
-rw-r--r--Documentation/networking/af_xdp.rst849
-rw-r--r--Documentation/networking/alias.rst49
-rw-r--r--Documentation/networking/arcnet-hardware.rst3234
-rw-r--r--Documentation/networking/arcnet.rst594
-rw-r--r--Documentation/networking/atm.rst14
-rw-r--r--Documentation/networking/ax25.rst16
-rw-r--r--Documentation/networking/bareudp.rst58
-rw-r--r--Documentation/networking/batman-adv.rst168
-rw-r--r--Documentation/networking/bonding.rst2925
-rw-r--r--Documentation/networking/bridge.rst21
-rw-r--r--Documentation/networking/caif/caif.rst138
-rw-r--r--Documentation/networking/caif/index.rst12
-rw-r--r--Documentation/networking/caif/linux_caif.rst195
-rw-r--r--Documentation/networking/can.rst1508
-rw-r--r--Documentation/networking/can_ucan_protocol.rst332
-rw-r--r--Documentation/networking/cdc_mbim.rst355
-rw-r--r--Documentation/networking/checksum-offloads.rst143
-rw-r--r--Documentation/networking/dccp.rst219
-rw-r--r--Documentation/networking/dctcp.rst52
-rw-r--r--Documentation/networking/device_drivers/appletalk/cops.rst80
-rw-r--r--Documentation/networking/device_drivers/appletalk/index.rst18
-rw-r--r--Documentation/networking/device_drivers/atm/cxacru-cf.py48
-rw-r--r--Documentation/networking/device_drivers/atm/cxacru.rst120
-rw-r--r--Documentation/networking/device_drivers/atm/fore200e.rst66
-rw-r--r--Documentation/networking/device_drivers/atm/index.rst20
-rw-r--r--Documentation/networking/device_drivers/atm/iphase.rst193
-rw-r--r--Documentation/networking/device_drivers/cable/index.rst18
-rw-r--r--Documentation/networking/device_drivers/cable/sb1000.rst222
-rw-r--r--Documentation/networking/device_drivers/can/can327.rst331
-rw-r--r--Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst638
-rw-r--r--Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg151
-rw-r--r--Documentation/networking/device_drivers/can/freescale/flexcan.rst54
-rw-r--r--Documentation/networking/device_drivers/can/index.rst22
-rw-r--r--Documentation/networking/device_drivers/cellular/index.rst18
-rw-r--r--Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst196
-rw-r--r--Documentation/networking/device_drivers/ethernet/3com/3c509.rst249
-rw-r--r--Documentation/networking/device_drivers/ethernet/3com/vortex.rst459
-rw-r--r--Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst286
-rw-r--r--Documentation/networking/device_drivers/ethernet/amazon/ena.rst351
-rw-r--r--Documentation/networking/device_drivers/ethernet/amd/pds_core.rst139
-rw-r--r--Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst85
-rw-r--r--Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst79
-rw-r--r--Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst556
-rw-r--r--Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst393
-rw-r--r--Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst647
-rw-r--r--Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst171
-rw-r--r--Documentation/networking/device_drivers/ethernet/dec/dmfe.rst71
-rw-r--r--Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst314
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst269
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst161
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst186
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst12
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst194
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst406
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst217
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst51
-rw-r--r--Documentation/networking/device_drivers/ethernet/google/gve.rst175
-rw-r--r--Documentation/networking/device_drivers/ethernet/huawei/hinic.rst128
-rw-r--r--Documentation/networking/device_drivers/ethernet/index.rst64
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e100.rst185
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e1000.rst458
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e1000e.rst378
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/fm10k.rst137
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/i40e.rst766
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/iavf.rst326
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ice.rst1021
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/igb.rst208
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/igbvf.rst60
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst552
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst62
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst36
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst342
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst1305
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst25
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst168
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst281
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst229
-rw-r--r--Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst120
-rw-r--r--Documentation/networking/device_drivers/ethernet/neterion/s2io.rst196
-rw-r--r--Documentation/networking/device_drivers/ethernet/netronome/nfp.rst374
-rw-r--r--Documentation/networking/device_drivers/ethernet/pensando/ionic.rst274
-rw-r--r--Documentation/networking/device_drivers/ethernet/smsc/smc9.rst48
-rw-r--r--Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst700
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst143
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/cpsw.rst587
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst242
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/tlan.rst140
-rw-r--r--Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst202
-rw-r--r--Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst14
-rw-r--r--Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst20
-rw-r--r--Documentation/networking/device_drivers/fddi/defza.rst63
-rw-r--r--Documentation/networking/device_drivers/fddi/index.rst19
-rw-r--r--Documentation/networking/device_drivers/fddi/skfp.rst253
-rw-r--r--Documentation/networking/device_drivers/hamradio/baycom.rst174
-rw-r--r--Documentation/networking/device_drivers/hamradio/index.rst19
-rw-r--r--Documentation/networking/device_drivers/hamradio/z8530drv.rst686
-rw-r--r--Documentation/networking/device_drivers/index.rst28
-rw-r--r--Documentation/networking/device_drivers/qlogic/index.rst18
-rw-r--r--Documentation/networking/device_drivers/qlogic/qlge.rst118
-rw-r--r--Documentation/networking/device_drivers/wifi/index.rst20
-rw-r--r--Documentation/networking/device_drivers/wifi/intel/ipw2100.rst323
-rw-r--r--Documentation/networking/device_drivers/wifi/intel/ipw2200.rst526
-rw-r--r--Documentation/networking/device_drivers/wifi/ray_cs.rst165
-rw-r--r--Documentation/networking/device_drivers/wwan/index.rst19
-rw-r--r--Documentation/networking/device_drivers/wwan/iosm.rst96
-rw-r--r--Documentation/networking/device_drivers/wwan/t7xx.rst120
-rw-r--r--Documentation/networking/devlink/am65-nuss-cpsw-switch.rst26
-rw-r--r--Documentation/networking/devlink/bnxt.rst82
-rw-r--r--Documentation/networking/devlink/devlink-dpipe.rst252
-rw-r--r--Documentation/networking/devlink/devlink-flash.rst121
-rw-r--r--Documentation/networking/devlink/devlink-health.rst138
-rw-r--r--Documentation/networking/devlink/devlink-info.rst215
-rw-r--r--Documentation/networking/devlink/devlink-linecard.rst122
-rw-r--r--Documentation/networking/devlink/devlink-params.rst139
-rw-r--r--Documentation/networking/devlink/devlink-port.rst443
-rw-r--r--Documentation/networking/devlink/devlink-region.rst83
-rw-r--r--Documentation/networking/devlink/devlink-reload.rst81
-rw-r--r--Documentation/networking/devlink/devlink-resource.rst76
-rw-r--r--Documentation/networking/devlink/devlink-selftests.rst38
-rw-r--r--Documentation/networking/devlink/devlink-trap.rst640
-rw-r--r--Documentation/networking/devlink/etas_es58x.rst36
-rw-r--r--Documentation/networking/devlink/hns3.rst25
-rw-r--r--Documentation/networking/devlink/ice.rst395
-rw-r--r--Documentation/networking/devlink/index.rst69
-rw-r--r--Documentation/networking/devlink/ionic.rst29
-rw-r--r--Documentation/networking/devlink/iosm.rst162
-rw-r--r--Documentation/networking/devlink/mlx4.rst56
-rw-r--r--Documentation/networking/devlink/mlx5.rst288
-rw-r--r--Documentation/networking/devlink/mlxsw.rst105
-rw-r--r--Documentation/networking/devlink/mv88e6xxx.rst28
-rw-r--r--Documentation/networking/devlink/netdevsim.rst99
-rw-r--r--Documentation/networking/devlink/nfp.rst65
-rw-r--r--Documentation/networking/devlink/octeontx2.rst42
-rw-r--r--Documentation/networking/devlink/prestera.rst141
-rw-r--r--Documentation/networking/devlink/qed.rst26
-rw-r--r--Documentation/networking/devlink/sfc.rst57
-rw-r--r--Documentation/networking/devlink/ti-cpsw-switch.rst31
-rw-r--r--Documentation/networking/dns_resolver.rst155
-rw-r--r--Documentation/networking/driver.rst127
-rw-r--r--Documentation/networking/dsa/b53.rst183
-rw-r--r--Documentation/networking/dsa/bcm_sf2.rst115
-rw-r--r--Documentation/networking/dsa/configuration.rst458
-rw-r--r--Documentation/networking/dsa/dsa.rst1129
-rw-r--r--Documentation/networking/dsa/index.rst13
-rw-r--r--Documentation/networking/dsa/lan9303.rst37
-rw-r--r--Documentation/networking/dsa/sja1105.rst445
-rw-r--r--Documentation/networking/eql.rst373
-rw-r--r--Documentation/networking/ethtool-netlink.rst2103
-rw-r--r--Documentation/networking/failover.rst18
-rw-r--r--Documentation/networking/fib_trie.rst149
-rw-r--r--Documentation/networking/filter.rst685
-rw-r--r--Documentation/networking/gen_stats.rst129
-rw-r--r--Documentation/networking/generic-hdlc.rst170
-rw-r--r--Documentation/networking/generic_netlink.rst9
-rw-r--r--Documentation/networking/gtp.rst251
-rw-r--r--Documentation/networking/ieee802154.rst182
-rw-r--r--Documentation/networking/ila.rst296
-rw-r--r--Documentation/networking/index.rst132
-rw-r--r--Documentation/networking/ioam6-sysctl.rst26
-rw-r--r--Documentation/networking/ip-sysctl.rst3208
-rw-r--r--Documentation/networking/ip_dynaddr.rst40
-rw-r--r--Documentation/networking/ipddp.rst78
-rw-r--r--Documentation/networking/ipsec.rst46
-rw-r--r--Documentation/networking/ipv6.rst78
-rw-r--r--Documentation/networking/ipvlan.rst189
-rw-r--r--Documentation/networking/ipvs-sysctl.rst332
-rw-r--r--Documentation/networking/j1939.rst460
-rw-r--r--Documentation/networking/kapi.rst159
-rw-r--r--Documentation/networking/kcm.rst290
-rw-r--r--Documentation/networking/l2tp.rst677
-rw-r--r--Documentation/networking/lapb-module.rst305
-rw-r--r--Documentation/networking/mac80211-auth-assoc-deauth.txt95
-rw-r--r--Documentation/networking/mac80211-injection.rst106
-rw-r--r--Documentation/networking/mac80211_hwsim/hostapd.conf11
-rw-r--r--Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst80
-rw-r--r--Documentation/networking/mac80211_hwsim/wpa_supplicant.conf10
-rw-r--r--Documentation/networking/mctp.rst320
-rw-r--r--Documentation/networking/mpls-sysctl.rst57
-rw-r--r--Documentation/networking/mptcp-sysctl.rst84
-rw-r--r--Documentation/networking/msg_zerocopy.rst256
-rw-r--r--Documentation/networking/multiqueue.rst78
-rw-r--r--Documentation/networking/napi.rst255
-rw-r--r--Documentation/networking/net_dim.rst176
-rw-r--r--Documentation/networking/net_failover.rst184
-rw-r--r--Documentation/networking/netconsole.rst248
-rw-r--r--Documentation/networking/netdev-features.rst205
-rw-r--r--Documentation/networking/netdevices.rst299
-rw-r--r--Documentation/networking/netfilter-sysctl.rst17
-rw-r--r--Documentation/networking/netif-msg.rst95
-rw-r--r--Documentation/networking/nexthop-group-resilient.rst293
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.rst232
-rw-r--r--Documentation/networking/nf_flowtable.rst235
-rw-r--r--Documentation/networking/nfc.rst130
-rw-r--r--Documentation/networking/openvswitch.rst251
-rw-r--r--Documentation/networking/operstates.rst187
-rw-r--r--Documentation/networking/packet_mmap.rst1083
-rw-r--r--Documentation/networking/page_pool.rst191
-rw-r--r--Documentation/networking/phonet.rst230
-rw-r--r--Documentation/networking/phy.rst551
-rw-r--r--Documentation/networking/pktgen.rst410
-rw-r--r--Documentation/networking/plip.rst222
-rw-r--r--Documentation/networking/ppp_generic.rst456
-rw-r--r--Documentation/networking/proc_net_tcp.rst57
-rw-r--r--Documentation/networking/radiotap-headers.rst159
-rw-r--r--Documentation/networking/rds.rst448
-rw-r--r--Documentation/networking/regulatory.rst209
-rw-r--r--Documentation/networking/representors.rst261
-rw-r--r--Documentation/networking/rxrpc.rst1174
-rw-r--r--Documentation/networking/scaling.rst523
-rw-r--r--Documentation/networking/sctp.rst42
-rw-r--r--Documentation/networking/secid.rst20
-rw-r--r--Documentation/networking/seg6-sysctl.rst39
-rw-r--r--Documentation/networking/segmentation-offloads.rst184
-rw-r--r--Documentation/networking/sfp-phylink.rst284
-rw-r--r--Documentation/networking/skbuff.rst37
-rw-r--r--Documentation/networking/smc-sysctl.rst61
-rw-r--r--Documentation/networking/snmp_counter.rst1793
-rw-r--r--Documentation/networking/statistics.rst221
-rw-r--r--Documentation/networking/strparser.rst240
-rw-r--r--Documentation/networking/switchdev.rst564
-rw-r--r--Documentation/networking/sysfs-tagging.rst48
-rw-r--r--Documentation/networking/tc-actions-env-rules.rst29
-rw-r--r--Documentation/networking/tc-queue-filters.rst37
-rw-r--r--Documentation/networking/tcp-thin.rst52
-rw-r--r--Documentation/networking/team.rst8
-rw-r--r--Documentation/networking/timestamping.rst802
-rw-r--r--Documentation/networking/tipc.rst215
-rw-r--r--Documentation/networking/tls-handshake.rst222
-rw-r--r--Documentation/networking/tls-offload-layers.svg1
-rw-r--r--Documentation/networking/tls-offload-reorder-bad.svg1
-rw-r--r--Documentation/networking/tls-offload-reorder-good.svg1
-rw-r--r--Documentation/networking/tls-offload.rst539
-rw-r--r--Documentation/networking/tls.rst288
-rw-r--r--Documentation/networking/tproxy.rst109
-rw-r--r--Documentation/networking/tuntap.rst259
-rw-r--r--Documentation/networking/udplite.rst291
-rw-r--r--Documentation/networking/vrf.rst464
-rw-r--r--Documentation/networking/vxlan.rst88
-rw-r--r--Documentation/networking/x25-iface.rst81
-rw-r--r--Documentation/networking/x25.rst46
-rw-r--r--Documentation/networking/xdp-rx-metadata.rst113
-rw-r--r--Documentation/networking/xfrm_device.rst196
-rw-r--r--Documentation/networking/xfrm_proc.rst113
-rw-r--r--Documentation/networking/xfrm_sync.rst189
-rw-r--r--Documentation/networking/xfrm_sysctl.rst11
247 files changed, 65953 insertions, 0 deletions
diff --git a/Documentation/networking/6lowpan.rst b/Documentation/networking/6lowpan.rst
new file mode 100644
index 0000000000..e70a6520cc
--- /dev/null
+++ b/Documentation/networking/6lowpan.rst
@@ -0,0 +1,53 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================
+Netdev private dataroom for 6lowpan interfaces
+==============================================
+
+All 6lowpan able net devices, means all interfaces with ARPHRD_6LOWPAN,
+must have "struct lowpan_priv" placed at beginning of netdev_priv.
+
+The priv_size of each interface should be calculate by::
+
+ dev->priv_size = LOWPAN_PRIV_SIZE(LL_6LOWPAN_PRIV_DATA);
+
+Where LL_PRIV_6LOWPAN_DATA is sizeof linklayer 6lowpan private data struct.
+To access the LL_PRIV_6LOWPAN_DATA structure you can cast::
+
+ lowpan_priv(dev)-priv;
+
+to your LL_6LOWPAN_PRIV_DATA structure.
+
+Before registering the lowpan netdev interface you must run::
+
+ lowpan_netdev_setup(dev, LOWPAN_LLTYPE_FOOBAR);
+
+wheres LOWPAN_LLTYPE_FOOBAR is a define for your 6LoWPAN linklayer type of
+enum lowpan_lltypes.
+
+Example to evaluate the private usually you can do::
+
+ static inline struct lowpan_priv_foobar *
+ lowpan_foobar_priv(struct net_device *dev)
+ {
+ return (struct lowpan_priv_foobar *)lowpan_priv(dev)->priv;
+ }
+
+ switch (dev->type) {
+ case ARPHRD_6LOWPAN:
+ lowpan_priv = lowpan_priv(dev);
+ /* do great stuff which is ARPHRD_6LOWPAN related */
+ switch (lowpan_priv->lltype) {
+ case LOWPAN_LLTYPE_FOOBAR:
+ /* do 802.15.4 6LoWPAN handling here */
+ lowpan_foobar_priv(dev)->bar = foo;
+ break;
+ ...
+ }
+ break;
+ ...
+ }
+
+In case of generic 6lowpan branch ("net/6lowpan") you can remove the check
+on ARPHRD_6LOWPAN, because you can be sure that these function are called
+by ARPHRD_6LOWPAN interfaces.
diff --git a/Documentation/networking/6pack.rst b/Documentation/networking/6pack.rst
new file mode 100644
index 0000000000..bc5bf1f1a9
--- /dev/null
+++ b/Documentation/networking/6pack.rst
@@ -0,0 +1,191 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+6pack Protocol
+==============
+
+This is the 6pack-mini-HOWTO, written by
+
+Andreas Könsgen DG3KQ
+
+:Internet: ajk@comnets.uni-bremen.de
+:AMPR-net: dg3kq@db0pra.ampr.org
+:AX.25: dg3kq@db0ach.#nrw.deu.eu
+
+Last update: April 7, 1998
+
+1. What is 6pack, and what are the advantages to KISS?
+======================================================
+
+6pack is a transmission protocol for data exchange between the PC and
+the TNC over a serial line. It can be used as an alternative to KISS.
+
+6pack has two major advantages:
+
+- The PC is given full control over the radio
+ channel. Special control data is exchanged between the PC and the TNC so
+ that the PC knows at any time if the TNC is receiving data, if a TNC
+ buffer underrun or overrun has occurred, if the PTT is
+ set and so on. This control data is processed at a higher priority than
+ normal data, so a data stream can be interrupted at any time to issue an
+ important event. This helps to improve the channel access and timing
+ algorithms as everything is computed in the PC. It would even be possible
+ to experiment with something completely different from the known CSMA and
+ DAMA channel access methods.
+ This kind of real-time control is especially important to supply several
+ TNCs that are connected between each other and the PC by a daisy chain
+ (however, this feature is not supported yet by the Linux 6pack driver).
+
+- Each packet transferred over the serial line is supplied with a checksum,
+ so it is easy to detect errors due to problems on the serial line.
+ Received packets that are corrupt are not passed on to the AX.25 layer.
+ Damaged packets that the TNC has received from the PC are not transmitted.
+
+More details about 6pack are described in the file 6pack.ps that is located
+in the doc directory of the AX.25 utilities package.
+
+2. Who has developed the 6pack protocol?
+========================================
+
+The 6pack protocol has been developed by Ekki Plicht DF4OR, Henning Rech
+DF9IC and Gunter Jost DK7WJ. A driver for 6pack, written by Gunter Jost and
+Matthias Welwarsky DG2FEF, comes along with the PC version of FlexNet.
+They have also written a firmware for TNCs to perform the 6pack
+protocol (see section 4 below).
+
+3. Where can I get the latest version of 6pack for LinuX?
+=========================================================
+
+At the moment, the 6pack stuff can obtained via anonymous ftp from
+db0bm.automation.fh-aachen.de. In the directory /incoming/dg3kq,
+there is a file named 6pack.tgz.
+
+4. Preparing the TNC for 6pack operation
+========================================
+
+To be able to use 6pack, a special firmware for the TNC is needed. The EPROM
+of a newly bought TNC does not contain 6pack, so you will have to
+program an EPROM yourself. The image file for 6pack EPROMs should be
+available on any packet radio box where PC/FlexNet can be found. The name of
+the file is 6pack.bin. This file is copyrighted and maintained by the FlexNet
+team. It can be used under the terms of the license that comes along
+with PC/FlexNet. Please do not ask me about the internals of this file as I
+don't know anything about it. I used a textual description of the 6pack
+protocol to program the Linux driver.
+
+TNCs contain a 64kByte EPROM, the lower half of which is used for
+the firmware/KISS. The upper half is either empty or is sometimes
+programmed with software called TAPR. In the latter case, the TNC
+is supplied with a DIP switch so you can easily change between the
+two systems. When programming a new EPROM, one of the systems is replaced
+by 6pack. It is useful to replace TAPR, as this software is rarely used
+nowadays. If your TNC is not equipped with the switch mentioned above, you
+can build in one yourself that switches over the highest address pin
+of the EPROM between HIGH and LOW level. After having inserted the new EPROM
+and switched to 6pack, apply power to the TNC for a first test. The connect
+and the status LED are lit for about a second if the firmware initialises
+the TNC correctly.
+
+5. Building and installing the 6pack driver
+===========================================
+
+The driver has been tested with kernel version 2.1.90. Use with older
+kernels may lead to a compilation error because the interface to a kernel
+function has been changed in the 2.1.8x kernels.
+
+How to turn on 6pack support:
+=============================
+
+- In the linux kernel configuration program, select the code maturity level
+ options menu and turn on the prompting for development drivers.
+
+- Select the amateur radio support menu and turn on the serial port 6pack
+ driver.
+
+- Compile and install the kernel and the modules.
+
+To use the driver, the kissattach program delivered with the AX.25 utilities
+has to be modified.
+
+- Do a cd to the directory that holds the kissattach sources. Edit the
+ kissattach.c file. At the top, insert the following lines::
+
+ #ifndef N_6PACK
+ #define N_6PACK (N_AX25+1)
+ #endif
+
+ Then find the line:
+
+ int disc = N_AX25;
+
+ and replace N_AX25 by N_6PACK.
+
+- Recompile kissattach. Rename it to spattach to avoid confusions.
+
+Installing the driver:
+----------------------
+
+- Do an insmod 6pack. Look at your /var/log/messages file to check if the
+ module has printed its initialization message.
+
+- Do a spattach as you would launch kissattach when starting a KISS port.
+ Check if the kernel prints the message '6pack: TNC found'.
+
+- From here, everything should work as if you were setting up a KISS port.
+ The only difference is that the network device that represents
+ the 6pack port is called sp instead of sl or ax. So, sp0 would be the
+ first 6pack port.
+
+Although the driver has been tested on various platforms, I still declare it
+ALPHA. BE CAREFUL! Sync your disks before insmoding the 6pack module
+and spattaching. Watch out if your computer behaves strangely. Read section
+6 of this file about known problems.
+
+Note that the connect and status LEDs of the TNC are controlled in a
+different way than they are when the TNC is used with PC/FlexNet. When using
+FlexNet, the connect LED is on if there is a connection; the status LED is
+on if there is data in the buffer of the PC's AX.25 engine that has to be
+transmitted. Under Linux, the 6pack layer is beyond the AX.25 layer,
+so the 6pack driver doesn't know anything about connects or data that
+has not yet been transmitted. Therefore the LEDs are controlled
+as they are in KISS mode: The connect LED is turned on if data is transferred
+from the PC to the TNC over the serial line, the status LED if data is
+sent to the PC.
+
+6. Known problems
+=================
+
+When testing the driver with 2.0.3x kernels and
+operating with data rates on the radio channel of 9600 Baud or higher,
+the driver may, on certain systems, sometimes print the message '6pack:
+bad checksum', which is due to data loss if the other station sends two
+or more subsequent packets. I have been told that this is due to a problem
+with the serial driver of 2.0.3x kernels. I don't know yet if the problem
+still exists with 2.1.x kernels, as I have heard that the serial driver
+code has been changed with 2.1.x.
+
+When shutting down the sp interface with ifconfig, the kernel crashes if
+there is still an AX.25 connection left over which an IP connection was
+running, even if that IP connection is already closed. The problem does not
+occur when there is a bare AX.25 connection still running. I don't know if
+this is a problem of the 6pack driver or something else in the kernel.
+
+The driver has been tested as a module, not yet as a kernel-builtin driver.
+
+The 6pack protocol supports daisy-chaining of TNCs in a token ring, which is
+connected to one serial port of the PC. This feature is not implemented
+and at least at the moment I won't be able to do it because I do not have
+the opportunity to build a TNC daisy-chain and test it.
+
+Some of the comments in the source code are inaccurate. They are left from
+the SLIP/KISS driver, from which the 6pack driver has been derived.
+I haven't modified or removed them yet -- sorry! The code itself needs
+some cleaning and optimizing. This will be done in a later release.
+
+If you encounter a bug or if you have a question or suggestion concerning the
+driver, feel free to mail me, using the addresses given at the beginning of
+this file.
+
+Have fun!
+
+Andreas
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
new file mode 100644
index 0000000000..dceeb0d763
--- /dev/null
+++ b/Documentation/networking/af_xdp.rst
@@ -0,0 +1,849 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+AF_XDP
+======
+
+Overview
+========
+
+AF_XDP is an address family that is optimized for high performance
+packet processing.
+
+This document assumes that the reader is familiar with BPF and XDP. If
+not, the Cilium project has an excellent reference guide at
+http://cilium.readthedocs.io/en/latest/bpf/.
+
+Using the XDP_REDIRECT action from an XDP program, the program can
+redirect ingress frames to other XDP enabled netdevs, using the
+bpf_redirect_map() function. AF_XDP sockets enable the possibility for
+XDP programs to redirect frames to a memory buffer in a user-space
+application.
+
+An AF_XDP socket (XSK) is created with the normal socket()
+syscall. Associated with each XSK are two rings: the RX ring and the
+TX ring. A socket can receive packets on the RX ring and it can send
+packets on the TX ring. These rings are registered and sized with the
+setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory
+to have at least one of these rings for each socket. An RX or TX
+descriptor ring points to a data buffer in a memory area called a
+UMEM. RX and TX can share the same UMEM so that a packet does not have
+to be copied between RX and TX. Moreover, if a packet needs to be kept
+for a while due to a possible retransmit, the descriptor that points
+to that packet can be changed to point to another and reused right
+away. This again avoids copying data.
+
+The UMEM consists of a number of equally sized chunks. A descriptor in
+one of the rings references a frame by referencing its addr. The addr
+is simply an offset within the entire UMEM region. The user space
+allocates memory for this UMEM using whatever means it feels is most
+appropriate (malloc, mmap, huge pages, etc). This memory area is then
+registered with the kernel using the new setsockopt XDP_UMEM_REG. The
+UMEM also has two rings: the FILL ring and the COMPLETION ring. The
+FILL ring is used by the application to send down addr for the kernel
+to fill in with RX packet data. References to these frames will then
+appear in the RX ring once each packet has been received. The
+COMPLETION ring, on the other hand, contains frame addr that the
+kernel has transmitted completely and can now be used again by user
+space, for either TX or RX. Thus, the frame addrs appearing in the
+COMPLETION ring are addrs that were previously transmitted using the
+TX ring. In summary, the RX and FILL rings are used for the RX path
+and the TX and COMPLETION rings are used for the TX path.
+
+The socket is then finally bound with a bind() call to a device and a
+specific queue id on that device, and it is not until bind is
+completed that traffic starts to flow.
+
+The UMEM can be shared between processes, if desired. If a process
+wants to do this, it simply skips the registration of the UMEM and its
+corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
+call and submits the XSK of the process it would like to share UMEM
+with as well as its own newly created XSK socket. The new process will
+then receive frame addr references in its own RX ring that point to
+this shared UMEM. Note that since the ring structures are
+single-consumer / single-producer (for performance reasons), the new
+process has to create its own socket with associated RX and TX rings,
+since it cannot share this with the other process. This is also the
+reason that there is only one set of FILL and COMPLETION rings per
+UMEM. It is the responsibility of a single process to handle the UMEM.
+
+How is then packets distributed from an XDP program to the XSKs? There
+is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
+user-space application can place an XSK at an arbitrary place in this
+map. The XDP program can then redirect a packet to a specific index in
+this map and at this point XDP validates that the XSK in that map was
+indeed bound to that device and ring number. If not, the packet is
+dropped. If the map is empty at that index, the packet is also
+dropped. This also means that it is currently mandatory to have an XDP
+program loaded (and one XSK in the XSKMAP) to be able to get any
+traffic to user space through the XSK.
+
+AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
+driver does not have support for XDP, or XDP_SKB is explicitly chosen
+when loading the XDP program, XDP_SKB mode is employed that uses SKBs
+together with the generic XDP support and copies out the data to user
+space. A fallback mode that works for any network device. On the other
+hand, if the driver has support for XDP, it will be used by the AF_XDP
+code to provide better performance, but there is still a copy of the
+data into user space.
+
+Concepts
+========
+
+In order to use an AF_XDP socket, a number of associated objects need
+to be setup. These objects and their options are explained in the
+following sections.
+
+For an overview on how AF_XDP works, you can also take a look at the
+Linux Plumbers paper from 2018 on the subject:
+http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do
+NOT consult the paper from 2017 on "AF_PACKET v4", the first attempt
+at AF_XDP. Nearly everything changed since then. Jonathan Corbet has
+also written an excellent article on LWN, "Accelerating networking
+with AF_XDP". It can be found at https://lwn.net/Articles/750845/.
+
+UMEM
+----
+
+UMEM is a region of virtual contiguous memory, divided into
+equal-sized frames. An UMEM is associated to a netdev and a specific
+queue id of that netdev. It is created and configured (chunk size,
+headroom, start address and size) by using the XDP_UMEM_REG setsockopt
+system call. A UMEM is bound to a netdev and queue id, via the bind()
+system call.
+
+An AF_XDP is socket linked to a single UMEM, but one UMEM can have
+multiple AF_XDP sockets. To share an UMEM created via one socket A,
+the next socket B can do this by setting the XDP_SHARED_UMEM flag in
+struct sockaddr_xdp member sxdp_flags, and passing the file descriptor
+of A to struct sockaddr_xdp member sxdp_shared_umem_fd.
+
+The UMEM has two single-producer/single-consumer rings that are used
+to transfer ownership of UMEM frames between the kernel and the
+user-space application.
+
+Rings
+-----
+
+There are a four different kind of rings: FILL, COMPLETION, RX and
+TX. All rings are single-producer/single-consumer, so the user-space
+application need explicit synchronization of multiple
+processes/threads are reading/writing to them.
+
+The UMEM uses two rings: FILL and COMPLETION. Each socket associated
+with the UMEM must have an RX queue, TX queue or both. Say, that there
+is a setup with four sockets (all doing TX and RX). Then there will be
+one FILL ring, one COMPLETION ring, four TX rings and four RX rings.
+
+The rings are head(producer)/tail(consumer) based rings. A producer
+writes the data ring at the index pointed out by struct xdp_ring
+producer member, and increasing the producer index. A consumer reads
+the data ring at the index pointed out by struct xdp_ring consumer
+member, and increasing the consumer index.
+
+The rings are configured and created via the _RING setsockopt system
+calls and mmapped to user-space using the appropriate offset to mmap()
+(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and
+XDP_UMEM_PGOFF_COMPLETION_RING).
+
+The size of the rings need to be of size power of two.
+
+UMEM Fill Ring
+~~~~~~~~~~~~~~
+
+The FILL ring is used to transfer ownership of UMEM frames from
+user-space to kernel-space. The UMEM addrs are passed in the ring. As
+an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
+16 chunks and can pass addrs between 0 and 64k.
+
+Frames passed to the kernel are used for the ingress path (RX rings).
+
+The user application produces UMEM addrs to this ring. Note that, if
+running the application with aligned chunk mode, the kernel will mask
+the incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of
+the addr will be masked off, meaning that 2048, 2050 and 3000 refers
+to the same chunk. If the user application is run in the unaligned
+chunks mode, then the incoming addr will be left untouched.
+
+
+UMEM Completion Ring
+~~~~~~~~~~~~~~~~~~~~
+
+The COMPLETION Ring is used transfer ownership of UMEM frames from
+kernel-space to user-space. Just like the FILL ring, UMEM indices are
+used.
+
+Frames passed from the kernel to user-space are frames that has been
+sent (TX ring) and can be used by user-space again.
+
+The user application consumes UMEM addrs from this ring.
+
+
+RX Ring
+~~~~~~~
+
+The RX ring is the receiving side of a socket. Each entry in the ring
+is a struct xdp_desc descriptor. The descriptor contains UMEM offset
+(addr) and the length of the data (len).
+
+If no frames have been passed to kernel via the FILL ring, no
+descriptors will (or can) appear on the RX ring.
+
+The user application consumes struct xdp_desc descriptors from this
+ring.
+
+TX Ring
+~~~~~~~
+
+The TX ring is used to send frames. The struct xdp_desc descriptor is
+filled (index, length and offset) and passed into the ring.
+
+To start the transfer a sendmsg() system call is required. This might
+be relaxed in the future.
+
+The user application produces struct xdp_desc descriptors to this
+ring.
+
+Libbpf
+======
+
+Libbpf is a helper library for eBPF and XDP that makes using these
+technologies a lot simpler. It also contains specific helper functions
+in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It
+contains two types of functions: those that can be used to make the
+setup of AF_XDP socket easier and ones that can be used in the data
+plane to access the rings safely and quickly. To see an example on how
+to use this API, please take a look at the sample application in
+samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data
+plane operations.
+
+We recommend that you use this library unless you have become a power
+user. It will make your program a lot simpler.
+
+XSKMAP / BPF_MAP_TYPE_XSKMAP
+============================
+
+On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
+is used in conjunction with bpf_redirect_map() to pass the ingress
+frame to a socket.
+
+The user application inserts the socket into the map, via the bpf()
+system call.
+
+Note that if an XDP program tries to redirect to a socket that does
+not match the queue configuration and netdev, the frame will be
+dropped. E.g. an AF_XDP socket is bound to netdev eth0 and
+queue 17. Only the XDP program executing for eth0 and queue 17 will
+successfully pass data to the socket. Please refer to the sample
+application (samples/bpf/) in for an example.
+
+Configuration Flags and Socket Options
+======================================
+
+These are the various configuration flags that can be used to control
+and monitor the behavior of AF_XDP sockets.
+
+XDP_COPY and XDP_ZEROCOPY bind flags
+------------------------------------
+
+When you bind to a socket, the kernel will first try to use zero-copy
+copy. If zero-copy is not supported, it will fall back on using copy
+mode, i.e. copying all packets out to user space. But if you would
+like to force a certain mode, you can use the following flags. If you
+pass the XDP_COPY flag to the bind call, the kernel will force the
+socket into copy mode. If it cannot use copy mode, the bind call will
+fail with an error. Conversely, the XDP_ZEROCOPY flag will force the
+socket into zero-copy mode or fail.
+
+XDP_SHARED_UMEM bind flag
+-------------------------
+
+This flag enables you to bind multiple sockets to the same UMEM. It
+works on the same queue id, between queue ids and between
+netdevs/devices. In this mode, each socket has their own RX and TX
+rings as usual, but you are going to have one or more FILL and
+COMPLETION ring pairs. You have to create one of these pairs per
+unique netdev and queue id tuple that you bind to.
+
+Starting with the case were we would like to share a UMEM between
+sockets bound to the same netdev and queue id. The UMEM (tied to the
+fist socket created) will only have a single FILL ring and a single
+COMPLETION ring as there is only on unique netdev,queue_id tuple that
+we have bound to. To use this mode, create the first socket and bind
+it in the normal way. Create a second socket and create an RX and a TX
+ring, or at least one of them, but no FILL or COMPLETION rings as the
+ones from the first socket will be used. In the bind call, set he
+XDP_SHARED_UMEM option and provide the initial socket's fd in the
+sxdp_shared_umem_fd field. You can attach an arbitrary number of extra
+sockets this way.
+
+What socket will then a packet arrive on? This is decided by the XDP
+program. Put all the sockets in the XSK_MAP and just indicate which
+index in the array you would like to send each packet to. A simple
+round-robin example of distributing packets is shown below:
+
+.. code-block:: c
+
+ #include <linux/bpf.h>
+ #include "bpf_helpers.h"
+
+ #define MAX_SOCKS 16
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_XSKMAP);
+ __uint(max_entries, MAX_SOCKS);
+ __uint(key_size, sizeof(int));
+ __uint(value_size, sizeof(int));
+ } xsks_map SEC(".maps");
+
+ static unsigned int rr;
+
+ SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
+ {
+ rr = (rr + 1) & (MAX_SOCKS - 1);
+
+ return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
+ }
+
+Note, that since there is only a single set of FILL and COMPLETION
+rings, and they are single producer, single consumer rings, you need
+to make sure that multiple processes or threads do not use these rings
+concurrently. There are no synchronization primitives in the
+libbpf code that protects multiple users at this point in time.
+
+Libbpf uses this mode if you create more than one socket tied to the
+same UMEM. However, note that you need to supply the
+XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
+xsk_socket__create calls and load your own XDP program as there is no
+built in one in libbpf that will route the traffic for you.
+
+The second case is when you share a UMEM between sockets that are
+bound to different queue ids and/or netdevs. In this case you have to
+create one FILL ring and one COMPLETION ring for each unique
+netdev,queue_id pair. Let us say you want to create two sockets bound
+to two different queue ids on the same netdev. Create the first socket
+and bind it in the normal way. Create a second socket and create an RX
+and a TX ring, or at least one of them, and then one FILL and
+COMPLETION ring for this socket. Then in the bind call, set he
+XDP_SHARED_UMEM option and provide the initial socket's fd in the
+sxdp_shared_umem_fd field as you registered the UMEM on that
+socket. These two sockets will now share one and the same UMEM.
+
+There is no need to supply an XDP program like the one in the previous
+case where sockets were bound to the same queue id and
+device. Instead, use the NIC's packet steering capabilities to steer
+the packets to the right queue. In the previous example, there is only
+one queue shared among sockets, so the NIC cannot do this steering. It
+can only steer between queues.
+
+In libbpf, you need to use the xsk_socket__create_shared() API as it
+takes a reference to a FILL ring and a COMPLETION ring that will be
+created for you and bound to the shared UMEM. You can use this
+function for all the sockets you create, or you can use it for the
+second and following ones and use xsk_socket__create() for the first
+one. Both methods yield the same result.
+
+Note that a UMEM can be shared between sockets on the same queue id
+and device, as well as between queues on the same device and between
+devices at the same time.
+
+XDP_USE_NEED_WAKEUP bind flag
+-----------------------------
+
+This option adds support for a new flag called need_wakeup that is
+present in the FILL ring and the TX ring, the rings for which user
+space is a producer. When this option is set in the bind call, the
+need_wakeup flag will be set if the kernel needs to be explicitly
+woken up by a syscall to continue processing packets. If the flag is
+zero, no syscall is needed.
+
+If the flag is set on the FILL ring, the application needs to call
+poll() to be able to continue to receive packets on the RX ring. This
+can happen, for example, when the kernel has detected that there are no
+more buffers on the FILL ring and no buffers left on the RX HW ring of
+the NIC. In this case, interrupts are turned off as the NIC cannot
+receive any packets (as there are no buffers to put them in), and the
+need_wakeup flag is set so that user space can put buffers on the
+FILL ring and then call poll() so that the kernel driver can put these
+buffers on the HW ring and start to receive packets.
+
+If the flag is set for the TX ring, it means that the application
+needs to explicitly notify the kernel to send any packets put on the
+TX ring. This can be accomplished either by a poll() call, as in the
+RX path, or by calling sendto().
+
+An example of how to use this flag can be found in
+samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers
+would look like this for the TX path:
+
+.. code-block:: c
+
+ if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
+ sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
+
+I.e., only use the syscall if the flag is set.
+
+We recommend that you always enable this mode as it usually leads to
+better performance especially if you run the application and the
+driver on the same core, but also if you use different cores for the
+application and the kernel driver, as it reduces the number of
+syscalls needed for the TX path.
+
+XDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts
+------------------------------------------------------
+
+These setsockopts sets the number of descriptors that the RX, TX,
+FILL, and COMPLETION rings respectively should have. It is mandatory
+to set the size of at least one of the RX and TX rings. If you set
+both, you will be able to both receive and send traffic from your
+application, but if you only want to do one of them, you can save
+resources by only setting up one of them. Both the FILL ring and the
+COMPLETION ring are mandatory as you need to have a UMEM tied to your
+socket. But if the XDP_SHARED_UMEM flag is used, any socket after the
+first one does not have a UMEM and should in that case not have any
+FILL or COMPLETION rings created as the ones from the shared UMEM will
+be used. Note, that the rings are single-producer single-consumer, so
+do not try to access them from multiple processes at the same
+time. See the XDP_SHARED_UMEM section.
+
+In libbpf, you can create Rx-only and Tx-only sockets by supplying
+NULL to the rx and tx arguments, respectively, to the
+xsk_socket__create function.
+
+If you create a Tx-only socket, we recommend that you do not put any
+packets on the fill ring. If you do this, drivers might think you are
+going to receive something when you in fact will not, and this can
+negatively impact performance.
+
+XDP_UMEM_REG setsockopt
+-----------------------
+
+This setsockopt registers a UMEM to a socket. This is the area that
+contain all the buffers that packet can reside in. The call takes a
+pointer to the beginning of this area and the size of it. Moreover, it
+also has parameter called chunk_size that is the size that the UMEM is
+divided into. It can only be 2K or 4K at the moment. If you have an
+UMEM area that is 128K and a chunk size of 2K, this means that you
+will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM
+area and that your largest packet size can be 2K.
+
+There is also an option to set the headroom of each single buffer in
+the UMEM. If you set this to N bytes, it means that the packet will
+start N bytes into the buffer leaving the first N bytes for the
+application to use. The final option is the flags field, but it will
+be dealt with in separate sections for each UMEM flag.
+
+SO_BINDTODEVICE setsockopt
+--------------------------
+
+This is a generic SOL_SOCKET option that can be used to tie AF_XDP
+socket to a particular network interface. It is useful when a socket
+is created by a privileged process and passed to a non-privileged one.
+Once the option is set, kernel will refuse attempts to bind that socket
+to a different interface. Updating the value requires CAP_NET_RAW.
+
+XDP_STATISTICS getsockopt
+-------------------------
+
+Gets drop statistics of a socket that can be useful for debug
+purposes. The supported statistics are shown below:
+
+.. code-block:: c
+
+ struct xdp_statistics {
+ __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+ __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+ };
+
+XDP_OPTIONS getsockopt
+----------------------
+
+Gets options from an XDP socket. The only one supported so far is
+XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
+
+Multi-Buffer Support
+====================
+
+With multi-buffer support, programs using AF_XDP sockets can receive
+and transmit packets consisting of multiple buffers both in copy and
+zero-copy mode. For example, a packet can consist of two
+frames/buffers, one with the header and the other one with the data,
+or a 9K Ethernet jumbo frame can be constructed by chaining together
+three 4K frames.
+
+Some definitions:
+
+* A packet consists of one or more frames
+
+* A descriptor in one of the AF_XDP rings always refers to a single
+ frame. In the case the packet consists of a single frame, the
+ descriptor refers to the whole packet.
+
+To enable multi-buffer support for an AF_XDP socket, use the new bind
+flag XDP_USE_SG. If this is not provided, all multi-buffer packets
+will be dropped just as before. Note that the XDP program loaded also
+needs to be in multi-buffer mode. This can be accomplished by using
+"xdp.frags" as the section name of the XDP program used.
+
+To represent a packet consisting of multiple frames, a new flag called
+XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
+descriptors. If it is true (1) the packet continues with the next
+descriptor and if it is false (0) it means this is the last descriptor
+of the packet. Why the reverse logic of end-of-packet (eop) flag found
+in many NICs? Just to preserve compatibility with non-multi-buffer
+applications that have this bit set to false for all packets on Rx,
+and the apps set the options field to zero for Tx, as anything else
+will be treated as an invalid descriptor.
+
+These are the semantics for producing packets onto AF_XDP Tx ring
+consisting of multiple frames:
+
+* When an invalid descriptor is found, all the other
+ descriptors/frames of this packet are marked as invalid and not
+ completed. The next descriptor is treated as the start of a new
+ packet, even if this was not the intent (because we cannot guess
+ the intent). As before, if your program is producing invalid
+ descriptors you have a bug that must be fixed.
+
+* Zero length descriptors are treated as invalid descriptors.
+
+* For copy mode, the maximum supported number of frames in a packet is
+ equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
+ descriptors accumulated so far are dropped and treated as
+ invalid. To produce an application that will work on any system
+ regardless of this config setting, limit the number of frags to 18,
+ as the minimum value of the config is 17.
+
+* For zero-copy mode, the limit is up to what the NIC HW
+ supports. Usually at least five on the NICs we have checked. We
+ consciously chose to not enforce a rigid limit (such as
+ CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
+ resulted in copy actions under the hood to fit into what limit the
+ NIC supports. Kind of defeats the purpose of zero-copy mode. How to
+ probe for this limit is explained in the "probe for multi-buffer
+ support" section.
+
+On the Rx path in copy-mode, the xsk core copies the XDP data into
+multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
+detailed before. Zero-copy mode works the same, though the data is not
+copied. When the application gets a descriptor with the XDP_PKT_CONTD
+flag set to one, it means that the packet consists of multiple buffers
+and it continues with the next buffer in the following
+descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
+means that this is the last buffer of the packet. AF_XDP guarantees
+that only a complete packet (all frames in the packet) is sent to the
+application. If there is not enough space in the AF_XDP Rx ring, all
+frames of the packet will be dropped.
+
+If application reads a batch of descriptors, using for example the libxdp
+interfaces, it is not guaranteed that the batch will end with a full
+packet. It might end in the middle of a packet and the rest of the
+buffers of that packet will arrive at the beginning of the next batch,
+since the libxdp interface does not read the whole ring (unless you
+have an enormous batch size or a very small ring size).
+
+An example program each for Rx and Tx multi-buffer support can be found
+later in this document.
+
+Usage
+-----
+
+In order to use AF_XDP sockets two parts are needed. The
+user-space application and the XDP program. For a complete setup and
+usage example, please refer to the sample application. The user-space
+side is xdpsock_user.c and the XDP side is part of libbpf.
+
+The XDP code sample included in tools/lib/bpf/xsk.c is the following:
+
+.. code-block:: c
+
+ SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
+ {
+ int index = ctx->rx_queue_index;
+
+ // A set entry here means that the corresponding queue_id
+ // has an active AF_XDP socket bound to it.
+ if (bpf_map_lookup_elem(&xsks_map, &index))
+ return bpf_redirect_map(&xsks_map, index, 0);
+
+ return XDP_PASS;
+ }
+
+A simple but not so performance ring dequeue and enqueue could look
+like this:
+
+.. code-block:: c
+
+ // struct xdp_rxtx_ring {
+ // __u32 *producer;
+ // __u32 *consumer;
+ // struct xdp_desc *desc;
+ // };
+
+ // struct xdp_umem_ring {
+ // __u32 *producer;
+ // __u32 *consumer;
+ // __u64 *desc;
+ // };
+
+ // typedef struct xdp_rxtx_ring RING;
+ // typedef struct xdp_umem_ring RING;
+
+ // typedef struct xdp_desc RING_TYPE;
+ // typedef __u64 RING_TYPE;
+
+ int dequeue_one(RING *ring, RING_TYPE *item)
+ {
+ __u32 entries = *ring->producer - *ring->consumer;
+
+ if (entries == 0)
+ return -1;
+
+ // read-barrier!
+
+ *item = ring->desc[*ring->consumer & (RING_SIZE - 1)];
+ (*ring->consumer)++;
+ return 0;
+ }
+
+ int enqueue_one(RING *ring, const RING_TYPE *item)
+ {
+ u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer);
+
+ if (free_entries == 0)
+ return -1;
+
+ ring->desc[*ring->producer & (RING_SIZE - 1)] = *item;
+
+ // write-barrier!
+
+ (*ring->producer)++;
+ return 0;
+ }
+
+But please use the libbpf functions as they are optimized and ready to
+use. Will make your life easier.
+
+Usage Multi-Buffer Rx
+---------------------
+
+Here is a simple Rx path pseudo-code example (using libxdp interfaces
+for simplicity). Error paths have been excluded to keep it short:
+
+.. code-block:: c
+
+ void rx_packets(struct xsk_socket_info *xsk)
+ {
+ static bool new_packet = true;
+ u32 idx_rx = 0, idx_fq = 0;
+ static char *pkt;
+
+ int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
+
+ xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+
+ for (int i = 0; i < rcvd; i++) {
+ struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
+ char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
+ bool eop = !(desc->options & XDP_PKT_CONTD);
+
+ if (new_packet)
+ pkt = frag;
+ else
+ add_frag_to_pkt(pkt, frag);
+
+ if (eop)
+ process_pkt(pkt);
+
+ new_packet = eop;
+
+ *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
+ }
+
+ xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+ xsk_ring_cons__release(&xsk->rx, rcvd);
+ }
+
+Usage Multi-Buffer Tx
+---------------------
+
+Here is an example Tx path pseudo-code (using libxdp interfaces for
+simplicity) ignoring that the umem is finite in size, and that we
+eventually will run out of packets to send. Also assumes pkts.addr
+points to a valid location in the umem.
+
+.. code-block:: c
+
+ void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
+ int batch_size)
+ {
+ u32 idx, i, pkt_nb = 0;
+
+ xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
+
+ for (i = 0; i < batch_size;) {
+ u64 addr = pkts[pkt_nb].addr;
+ u32 len = pkts[pkt_nb].size;
+
+ do {
+ struct xdp_desc *tx_desc;
+
+ tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
+ tx_desc->addr = addr;
+
+ if (len > xsk_frame_size) {
+ tx_desc->len = xsk_frame_size;
+ tx_desc->options = XDP_PKT_CONTD;
+ } else {
+ tx_desc->len = len;
+ tx_desc->options = 0;
+ pkt_nb++;
+ }
+ len -= tx_desc->len;
+ addr += xsk_frame_size;
+
+ if (i == batch_size) {
+ /* Remember len, addr, pkt_nb for next iteration.
+ * Skipped for simplicity.
+ */
+ break;
+ }
+ } while (len);
+ }
+
+ xsk_ring_prod__submit(&xsk->tx, i);
+ }
+
+Probing for Multi-Buffer Support
+--------------------------------
+
+To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
+mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
+query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
+querying for XDP multi-buffer support. If XDP supports multi-buffer in
+a driver, then AF_XDP will also support that in SKB and DRV mode.
+
+To discover if a driver supports multi-buffer AF_XDP in zero-copy
+mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
+flag. If it is set, it means that at least zero-copy is supported and
+you should go and check the netlink attribute
+NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
+value will be returned stating the max number of frags that are
+supported by this device in zero-copy mode. These are the possible
+return values:
+
+1: Multi-buffer for zero-copy is not supported by this device, as max
+ one fragment supported means that multi-buffer is not possible.
+
+>=2: Multi-buffer is supported in zero-copy mode for this device. The
+ returned number signifies the max number of frags supported.
+
+For an example on how these are used through libbpf, please take a
+look at tools/testing/selftests/bpf/xskxceiver.c.
+
+Multi-Buffer Support for Zero-Copy Drivers
+------------------------------------------
+
+Zero-copy drivers usually use the batched APIs for Rx and Tx
+processing. Note that the Tx batch API guarantees that it will provide
+a batch of Tx descriptors that ends with full packet at the end. This
+to facilitate extending a zero-copy driver with multi-buffer support.
+
+Sample application
+==================
+
+There is a xdpsock benchmarking/test application included that
+demonstrates how to use AF_XDP sockets with private UMEMs. Say that
+you would like your UDP traffic from port 4242 to end up in queue 16,
+that we will enable AF_XDP on. Here, we use ethtool for this::
+
+ ethtool -N p3p2 rx-flow-hash udp4 fn
+ ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
+ action 16
+
+Running the rxdrop benchmark in XDP_DRV mode can then be done
+using::
+
+ samples/bpf/xdpsock -i p3p2 -q 16 -r -N
+
+For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
+can be displayed with "-h", as usual.
+
+This sample application uses libbpf to make the setup and usage of
+AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is
+really used to make something more advanced, take a look at the libbpf
+code in tools/lib/bpf/xsk.[ch].
+
+FAQ
+=======
+
+Q: I am not seeing any traffic on the socket. What am I doing wrong?
+
+A: When a netdev of a physical NIC is initialized, Linux usually
+ allocates one RX and TX queue pair per core. So on a 8 core system,
+ queue ids 0 to 7 will be allocated, one per core. In the AF_XDP
+ bind call or the xsk_socket__create libbpf function call, you
+ specify a specific queue id to bind to and it is only the traffic
+ towards that queue you are going to get on you socket. So in the
+ example above, if you bind to queue 0, you are NOT going to get any
+ traffic that is distributed to queues 1 through 7. If you are
+ lucky, you will see the traffic, but usually it will end up on one
+ of the queues you have not bound to.
+
+ There are a number of ways to solve the problem of getting the
+ traffic you want to the queue id you bound to. If you want to see
+ all the traffic, you can force the netdev to only have 1 queue, queue
+ id 0, and then bind to queue 0. You can use ethtool to do this::
+
+ sudo ethtool -L <interface> combined 1
+
+ If you want to only see part of the traffic, you can program the
+ NIC through ethtool to filter out your traffic to a single queue id
+ that you can bind your XDP socket to. Here is one example in which
+ UDP traffic to and from port 4242 are sent to queue 2::
+
+ sudo ethtool -N <interface> rx-flow-hash udp4 fn
+ sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
+ 4242 action 2
+
+ A number of other ways are possible all up to the capabilities of
+ the NIC you have.
+
+Q: Can I use the XSKMAP to implement a switch between different umems
+ in copy mode?
+
+A: The short answer is no, that is not supported at the moment. The
+ XSKMAP can only be used to switch traffic coming in on queue id X
+ to sockets bound to the same queue id X. The XSKMAP can contain
+ sockets bound to different queue ids, for example X and Y, but only
+ traffic goming in from queue id Y can be directed to sockets bound
+ to the same queue id Y. In zero-copy mode, you should use the
+ switch, or other distribution mechanism, in your NIC to direct
+ traffic to the correct queue id and socket.
+
+Q: My packets are sometimes corrupted. What is wrong?
+
+A: Care has to be taken not to feed the same buffer in the UMEM into
+ more than one ring at the same time. If you for example feed the
+ same buffer into the FILL ring and the TX ring at the same time, the
+ NIC might receive data into the buffer at the same time it is
+ sending it. This will cause some packets to become corrupted. Same
+ thing goes for feeding the same buffer into the FILL rings
+ belonging to different queue ids or netdevs bound with the
+ XDP_SHARED_UMEM flag.
+
+Credits
+=======
+
+- Björn Töpel (AF_XDP core)
+- Magnus Karlsson (AF_XDP core)
+- Alexander Duyck
+- Alexei Starovoitov
+- Daniel Borkmann
+- Jesper Dangaard Brouer
+- John Fastabend
+- Jonathan Corbet (LWN coverage)
+- Michael S. Tsirkin
+- Qi Z Zhang
+- Willem de Bruijn
diff --git a/Documentation/networking/alias.rst b/Documentation/networking/alias.rst
new file mode 100644
index 0000000000..af7c5ee920
--- /dev/null
+++ b/Documentation/networking/alias.rst
@@ -0,0 +1,49 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+IP-Aliasing
+===========
+
+IP-aliases are an obsolete way to manage multiple IP-addresses/masks
+per interface. Newer tools such as iproute2 support multiple
+address/prefixes per interface, but aliases are still supported
+for backwards compatibility.
+
+An alias is formed by adding a colon and a string when running ifconfig.
+This string is usually numeric, but this is not a must.
+
+
+Alias creation
+==============
+
+Alias creation is done by 'magic' interface naming: eg. to create a
+200.1.1.1 alias for eth0 ...
+::
+
+ # ifconfig eth0:0 200.1.1.1 etc,etc....
+ ~~ -> request alias #0 creation (if not yet exists) for eth0
+
+The corresponding route is also set up by this command. Please note:
+The route always points to the base interface.
+
+
+Alias deletion
+==============
+
+The alias is removed by shutting the alias down::
+
+ # ifconfig eth0:0 down
+ ~~~~~~~~~~ -> will delete alias
+
+
+Alias (re-)configuring
+======================
+
+Aliases are not real devices, but programs should be able to configure
+and refer to them as usual (ifconfig, route, etc).
+
+
+Relationship with main device
+=============================
+
+If the base device is shut down the added aliases will be deleted too.
diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst
new file mode 100644
index 0000000000..9822157235
--- /dev/null
+++ b/Documentation/networking/arcnet-hardware.rst
@@ -0,0 +1,3234 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+ARCnet Hardware
+===============
+
+.. note::
+
+ 1) This file is a supplement to arcnet.txt. Please read that for general
+ driver configuration help.
+ 2) This file is no longer Linux-specific. It should probably be moved out
+ of the kernel sources. Ideas?
+
+Because so many people (myself included) seem to have obtained ARCnet cards
+without manuals, this file contains a quick introduction to ARCnet hardware,
+some cabling tips, and a listing of all jumper settings I can find. Please
+e-mail apenwarr@worldvisions.ca with any settings for your particular card,
+or any other information you have!
+
+
+Introduction to ARCnet
+======================
+
+ARCnet is a network type which works in a way similar to popular Ethernet
+networks but which is also different in some very important ways.
+
+First of all, you can get ARCnet cards in at least two speeds: 2.5 Mbps
+(slower than Ethernet) and 100 Mbps (faster than normal Ethernet). In fact,
+there are others as well, but these are less common. The different hardware
+types, as far as I'm aware, are not compatible and so you cannot wire a
+100 Mbps card to a 2.5 Mbps card, and so on. From what I hear, my driver does
+work with 100 Mbps cards, but I haven't been able to verify this myself,
+since I only have the 2.5 Mbps variety. It is probably not going to saturate
+your 100 Mbps card. Stop complaining. :)
+
+You also cannot connect an ARCnet card to any kind of Ethernet card and
+expect it to work.
+
+There are two "types" of ARCnet - STAR topology and BUS topology. This
+refers to how the cards are meant to be wired together. According to most
+available documentation, you can only connect STAR cards to STAR cards and
+BUS cards to BUS cards. That makes sense, right? Well, it's not quite
+true; see below under "Cabling."
+
+Once you get past these little stumbling blocks, ARCnet is actually quite a
+well-designed standard. It uses something called "modified token passing"
+which makes it completely incompatible with so-called "Token Ring" cards,
+but which makes transfers much more reliable than Ethernet does. In fact,
+ARCnet will guarantee that a packet arrives safely at the destination, and
+even if it can't possibly be delivered properly (ie. because of a cable
+break, or because the destination computer does not exist) it will at least
+tell the sender about it.
+
+Because of the carefully defined action of the "token", it will always make
+a pass around the "ring" within a maximum length of time. This makes it
+useful for realtime networks.
+
+In addition, all known ARCnet cards have an (almost) identical programming
+interface. This means that with one ARCnet driver you can support any
+card, whereas with Ethernet each manufacturer uses what is sometimes a
+completely different programming interface, leading to a lot of different,
+sometimes very similar, Ethernet drivers. Of course, always using the same
+programming interface also means that when high-performance hardware
+facilities like PCI bus mastering DMA appear, it's hard to take advantage of
+them. Let's not go into that.
+
+One thing that makes ARCnet cards difficult to program for, however, is the
+limit on their packet sizes; standard ARCnet can only send packets that are
+up to 508 bytes in length. This is smaller than the Internet "bare minimum"
+of 576 bytes, let alone the Ethernet MTU of 1500. To compensate, an extra
+level of encapsulation is defined by RFC1201, which I call "packet
+splitting," that allows "virtual packets" to grow as large as 64K each,
+although they are generally kept down to the Ethernet-style 1500 bytes.
+
+For more information on the advantages and disadvantages (mostly the
+advantages) of ARCnet networks, you might try the "ARCnet Trade Association"
+WWW page:
+
+ http://www.arcnet.com
+
+
+Cabling ARCnet Networks
+=======================
+
+This section was rewritten by
+
+ Vojtech Pavlik <vojtech@suse.cz>
+
+using information from several people, including:
+
+ - Avery Pennraun <apenwarr@worldvisions.ca>
+ - Stephen A. Wood <saw@hallc1.cebaf.gov>
+ - John Paul Morrison <jmorriso@bogomips.ee.ubc.ca>
+ - Joachim Koenig <jojo@repas.de>
+
+and Avery touched it up a bit, at Vojtech's request.
+
+ARCnet (the classic 2.5 Mbps version) can be connected by two different
+types of cabling: coax and twisted pair. The other ARCnet-type networks
+(100 Mbps TCNS and 320 kbps - 32 Mbps ARCnet Plus) use different types of
+cabling (Type1, Fiber, C1, C4, C5).
+
+For a coax network, you "should" use 93 Ohm RG-62 cable. But other cables
+also work fine, because ARCnet is a very stable network. I personally use 75
+Ohm TV antenna cable.
+
+Cards for coax cabling are shipped in two different variants: for BUS and
+STAR network topologies. They are mostly the same. The only difference
+lies in the hybrid chip installed. BUS cards use high impedance output,
+while STAR use low impedance. Low impedance card (STAR) is electrically
+equal to a high impedance one with a terminator installed.
+
+Usually, the ARCnet networks are built up from STAR cards and hubs. There
+are two types of hubs - active and passive. Passive hubs are small boxes
+with four BNC connectors containing four 47 Ohm resistors::
+
+ | | wires
+ R + junction
+ -R-+-R- R 47 Ohm resistors
+ R
+ |
+
+The shielding is connected together. Active hubs are much more complicated;
+they are powered and contain electronics to amplify the signal and send it
+to other segments of the net. They usually have eight connectors. Active
+hubs come in two variants - dumb and smart. The dumb variant just
+amplifies, but the smart one decodes to digital and encodes back all packets
+coming through. This is much better if you have several hubs in the net,
+since many dumb active hubs may worsen the signal quality.
+
+And now to the cabling. What you can connect together:
+
+1. A card to a card. This is the simplest way of creating a 2-computer
+ network.
+
+2. A card to a passive hub. Remember that all unused connectors on the hub
+ must be properly terminated with 93 Ohm (or something else if you don't
+ have the right ones) terminators.
+
+ (Avery's note: oops, I didn't know that. Mine (TV cable) works
+ anyway, though.)
+
+3. A card to an active hub. Here is no need to terminate the unused
+ connectors except some kind of aesthetic feeling. But, there may not be
+ more than eleven active hubs between any two computers. That of course
+ doesn't limit the number of active hubs on the network.
+
+4. An active hub to another.
+
+5. An active hub to passive hub.
+
+Remember that you cannot connect two passive hubs together. The power loss
+implied by such a connection is too high for the net to operate reliably.
+
+An example of a typical ARCnet network::
+
+ R S - STAR type card
+ S------H--------A-------S R - Terminator
+ | | H - Hub
+ | | A - Active hub
+ | S----H----S
+ S |
+ |
+ S
+
+The BUS topology is very similar to the one used by Ethernet. The only
+difference is in cable and terminators: they should be 93 Ohm. Ethernet
+uses 50 Ohm impedance. You use T connectors to put the computers on a single
+line of cable, the bus. You have to put terminators at both ends of the
+cable. A typical BUS ARCnet network looks like::
+
+ RT----T------T------T------T------TR
+ B B B B B B
+
+ B - BUS type card
+ R - Terminator
+ T - T connector
+
+But that is not all! The two types can be connected together. According to
+the official documentation the only way of connecting them is using an active
+hub::
+
+ A------T------T------TR
+ | B B B
+ S---H---S
+ |
+ S
+
+The official docs also state that you can use STAR cards at the ends of
+BUS network in place of a BUS card and a terminator::
+
+ S------T------T------S
+ B B
+
+But, according to my own experiments, you can simply hang a BUS type card
+anywhere in middle of a cable in a STAR topology network. And more - you
+can use the bus card in place of any star card if you use a terminator. Then
+you can build very complicated networks fulfilling all your needs! An
+example::
+
+ S
+ |
+ RT------T-------T------H------S
+ B B B |
+ | R
+ S------A------T-------T-------A-------H------TR
+ | B B | | B
+ | S BT |
+ | | | S----A-----S
+ S------H---A----S | |
+ | | S------T----H---S |
+ S S B R S
+
+A basically different cabling scheme is used with Twisted Pair cabling. Each
+of the TP cards has two RJ (phone-cord style) connectors. The cards are
+then daisy-chained together using a cable connecting every two neighboring
+cards. The ends are terminated with RJ 93 Ohm terminators which plug into
+the empty connectors of cards on the ends of the chain. An example::
+
+ ___________ ___________
+ _R_|_ _|_|_ _|_R_
+ | | | | | |
+ |Card | |Card | |Card |
+ |_____| |_____| |_____|
+
+
+There are also hubs for the TP topology. There is nothing difficult
+involved in using them; you just connect a TP chain to a hub on any end or
+even at both. This way you can create almost any network configuration.
+The maximum of 11 hubs between any two computers on the net applies here as
+well. An example::
+
+ RP-------P--------P--------H-----P------P-----PR
+ |
+ RP-----H--------P--------H-----P------PR
+ | |
+ PR PR
+
+ R - RJ Terminator
+ P - TP Card
+ H - TP Hub
+
+Like any network, ARCnet has a limited cable length. These are the maximum
+cable lengths between two active ends (an active end being an active hub or
+a STAR card).
+
+ ========== ======= ===========
+ RG-62 93 Ohm up to 650 m
+ RG-59/U 75 Ohm up to 457 m
+ RG-11/U 75 Ohm up to 533 m
+ IBM Type 1 150 Ohm up to 200 m
+ IBM Type 3 100 Ohm up to 100 m
+ ========== ======= ===========
+
+The maximum length of all cables connected to a passive hub is limited to 65
+meters for RG-62 cabling; less for others. You can see that using passive
+hubs in a large network is a bad idea. The maximum length of a single "BUS
+Trunk" is about 300 meters for RG-62. The maximum distance between the two
+most distant points of the net is limited to 3000 meters. The maximum length
+of a TP cable between two cards/hubs is 650 meters.
+
+
+Setting the Jumpers
+===================
+
+All ARCnet cards should have a total of four or five different settings:
+
+ - the I/O address: this is the "port" your ARCnet card is on. Probed
+ values in the Linux ARCnet driver are only from 0x200 through 0x3F0. (If
+ your card has additional ones, which is possible, please tell me.) This
+ should not be the same as any other device on your system. According to
+ a doc I got from Novell, MS Windows prefers values of 0x300 or more,
+ eating net connections on my system (at least) otherwise. My guess is
+ this may be because, if your card is at 0x2E0, probing for a serial port
+ at 0x2E8 will reset the card and probably mess things up royally.
+
+ - Avery's favourite: 0x300.
+
+ - the IRQ: on 8-bit cards, it might be 2 (9), 3, 4, 5, or 7.
+ on 16-bit cards, it might be 2 (9), 3, 4, 5, 7, or 10-15.
+
+ Make sure this is different from any other card on your system. Note
+ that IRQ2 is the same as IRQ9, as far as Linux is concerned. You can
+ "cat /proc/interrupts" for a somewhat complete list of which ones are in
+ use at any given time. Here is a list of common usages from Vojtech
+ Pavlik <vojtech@suse.cz>:
+
+ ("Not on bus" means there is no way for a card to generate this
+ interrupt)
+
+ ====== =========================================================
+ IRQ 0 Timer 0 (Not on bus)
+ IRQ 1 Keyboard (Not on bus)
+ IRQ 2 IRQ Controller 2 (Not on bus, nor does interrupt the CPU)
+ IRQ 3 COM2
+ IRQ 4 COM1
+ IRQ 5 FREE (LPT2 if you have it; sometimes COM3; maybe PLIP)
+ IRQ 6 Floppy disk controller
+ IRQ 7 FREE (LPT1 if you don't use the polling driver; PLIP)
+ IRQ 8 Realtime Clock Interrupt (Not on bus)
+ IRQ 9 FREE (VGA vertical sync interrupt if enabled)
+ IRQ 10 FREE
+ IRQ 11 FREE
+ IRQ 12 FREE
+ IRQ 13 Numeric Coprocessor (Not on bus)
+ IRQ 14 Fixed Disk Controller
+ IRQ 15 FREE (Fixed Disk Controller 2 if you have it)
+ ====== =========================================================
+
+
+ .. note::
+
+ IRQ 9 is used on some video cards for the "vertical retrace"
+ interrupt. This interrupt would have been handy for things like
+ video games, as it occurs exactly once per screen refresh, but
+ unfortunately IBM cancelled this feature starting with the original
+ VGA and thus many VGA/SVGA cards do not support it. For this
+ reason, no modern software uses this interrupt and it can almost
+ always be safely disabled, if your video card supports it at all.
+
+ If your card for some reason CANNOT disable this IRQ (usually there
+ is a jumper), one solution would be to clip the printed circuit
+ contact on the board: it's the fourth contact from the left on the
+ back side. I take no responsibility if you try this.
+
+ - Avery's favourite: IRQ2 (actually IRQ9). Watch that VGA, though.
+
+ - the memory address: Unlike most cards, ARCnets use "shared memory" for
+ copying buffers around. Make SURE it doesn't conflict with any other
+ used memory in your system!
+
+ ::
+
+ A0000 - VGA graphics memory (ok if you don't have VGA)
+ B0000 - Monochrome text mode
+ C0000 \ One of these is your VGA BIOS - usually C0000.
+ E0000 /
+ F0000 - System BIOS
+
+ Anything less than 0xA0000 is, well, a BAD idea since it isn't above
+ 640k.
+
+ - Avery's favourite: 0xD0000
+
+ - the station address: Every ARCnet card has its own "unique" network
+ address from 0 to 255. Unlike Ethernet, you can set this address
+ yourself with a jumper or switch (or on some cards, with special
+ software). Since it's only 8 bits, you can only have 254 ARCnet cards
+ on a network. DON'T use 0 or 255, since these are reserved (although
+ neat stuff will probably happen if you DO use them). By the way, if you
+ haven't already guessed, don't set this the same as any other ARCnet on
+ your network!
+
+ - Avery's favourite: 3 and 4. Not that it matters.
+
+ - There may be ETS1 and ETS2 settings. These may or may not make a
+ difference on your card (many manuals call them "reserved"), but are
+ used to change the delays used when powering up a computer on the
+ network. This is only necessary when wiring VERY long range ARCnet
+ networks, on the order of 4km or so; in any case, the only real
+ requirement here is that all cards on the network with ETS1 and ETS2
+ jumpers have them in the same position. Chris Hindy <chrish@io.org>
+ sent in a chart with actual values for this:
+
+ ======= ======= =============== ====================
+ ET1 ET2 Response Time Reconfiguration Time
+ ======= ======= =============== ====================
+ open open 74.7us 840us
+ open closed 283.4us 1680us
+ closed open 561.8us 1680us
+ closed closed 1118.6us 1680us
+ ======= ======= =============== ====================
+
+ Make sure you set ETS1 and ETS2 to the SAME VALUE for all cards on your
+ network.
+
+Also, on many cards (not mine, though) there are red and green LED's.
+Vojtech Pavlik <vojtech@suse.cz> tells me this is what they mean:
+
+ =============== =============== =====================================
+ GREEN RED Status
+ =============== =============== =====================================
+ OFF OFF Power off
+ OFF Short flashes Cabling problems (broken cable or not
+ terminated)
+ OFF (short) ON Card init
+ ON ON Normal state - everything OK, nothing
+ happens
+ ON Long flashes Data transfer
+ ON OFF Never happens (maybe when wrong ID)
+ =============== =============== =====================================
+
+
+The following is all the specific information people have sent me about
+their own particular ARCnet cards. It is officially a mess, and contains
+huge amounts of duplicated information. I have no time to fix it. If you
+want to, PLEASE DO! Just send me a 'diff -u' of all your changes.
+
+The model # is listed right above specifics for that card, so you should be
+able to use your text viewer's "search" function to find the entry you want.
+If you don't KNOW what kind of card you have, try looking through the
+various diagrams to see if you can tell.
+
+If your model isn't listed and/or has different settings, PLEASE PLEASE
+tell me. I had to figure mine out without the manual, and it WASN'T FUN!
+
+Even if your ARCnet model isn't listed, but has the same jumpers as another
+model that is, please e-mail me to say so.
+
+Cards Listed in this file (in this order, mostly):
+
+ =============== ======================= ====
+ Manufacturer Model # Bits
+ =============== ======================= ====
+ SMC PC100 8
+ SMC PC110 8
+ SMC PC120 8
+ SMC PC130 8
+ SMC PC270E 8
+ SMC PC500 16
+ SMC PC500Longboard 16
+ SMC PC550Longboard 16
+ SMC PC600 16
+ SMC PC710 8
+ SMC? LCS-8830(-T) 8/16
+ Puredata PDI507 8
+ CNet Tech CN120-Series 8
+ CNet Tech CN160-Series 16
+ Lantech? UM9065L chipset 8
+ Acer 5210-003 8
+ Datapoint? LAN-ARC-8 8
+ Topware TA-ARC/10 8
+ Thomas-Conrad 500-6242-0097 REV A 8
+ Waterloo? (C)1985 Waterloo Micro. 8
+ No Name -- 8/16
+ No Name Taiwan R.O.C? 8
+ No Name Model 9058 8
+ Tiara Tiara Lancard? 8
+ =============== ======================= ====
+
+
+* SMC = Standard Microsystems Corp.
+* CNet Tech = CNet Technology, Inc.
+
+Unclassified Stuff
+==================
+
+ - Please send any other information you can find.
+
+ - And some other stuff (more info is welcome!)::
+
+ From: root@ultraworld.xs4all.nl (Timo Hilbrink)
+ To: apenwarr@foxnet.net (Avery Pennarun)
+ Date: Wed, 26 Oct 1994 02:10:32 +0000 (GMT)
+ Reply-To: timoh@xs4all.nl
+
+ [...parts deleted...]
+
+ About the jumpers: On my PC130 there is one more jumper, located near the
+ cable-connector and it's for changing to star or bus topology;
+ closed: star - open: bus
+ On the PC500 are some more jumper-pins, one block labeled with RX,PDN,TXI
+ and another with ALE,LA17,LA18,LA19 these are undocumented..
+
+ [...more parts deleted...]
+
+ --- CUT ---
+
+Standard Microsystems Corp (SMC)
+================================
+
+PC100, PC110, PC120, PC130 (8-bit cards) and PC500, PC600 (16-bit cards)
+------------------------------------------------------------------------
+
+ - mainly from Avery Pennarun <apenwarr@worldvisions.ca>. Values depicted
+ are from Avery's setup.
+ - special thanks to Timo Hilbrink <timoh@xs4all.nl> for noting that PC120,
+ 130, 500, and 600 all have the same switches as Avery's PC100.
+ PC500/600 have several extra, undocumented pins though. (?)
+ - PC110 settings were verified by Stephen A. Wood <saw@cebaf.gov>
+ - Also, the JP- and S-numbers probably don't match your card exactly. Try
+ to find jumpers/switches with the same number of settings - it's
+ probably more reliable.
+
+::
+
+ JP5 [|] : : : :
+ (IRQ Setting) IRQ2 IRQ3 IRQ4 IRQ5 IRQ7
+ Put exactly one jumper on exactly one set of pins.
+
+
+ 1 2 3 4 5 6 7 8 9 10
+ S1 /----------------------------------\
+ (I/O and Memory | 1 1 * 0 0 0 0 * 1 1 0 1 |
+ addresses) \----------------------------------/
+ |--| |--------| |--------|
+ (a) (b) (m)
+
+ WARNING. It's very important when setting these which way
+ you're holding the card, and which way you think is '1'!
+
+ If you suspect that your settings are not being made
+ correctly, try reversing the direction or inverting the
+ switch positions.
+
+ a: The first digit of the I/O address.
+ Setting Value
+ ------- -----
+ 00 0
+ 01 1
+ 10 2
+ 11 3
+
+ b: The second digit of the I/O address.
+ Setting Value
+ ------- -----
+ 0000 0
+ 0001 1
+ 0010 2
+ ... ...
+ 1110 E
+ 1111 F
+
+ The I/O address is in the form ab0. For example, if
+ a is 0x2 and b is 0xE, the address will be 0x2E0.
+
+ DO NOT SET THIS LESS THAN 0x200!!!!!
+
+
+ m: The first digit of the memory address.
+ Setting Value
+ ------- -----
+ 0000 0
+ 0001 1
+ 0010 2
+ ... ...
+ 1110 E
+ 1111 F
+
+ The memory address is in the form m0000. For example, if
+ m is D, the address will be 0xD0000.
+
+ DO NOT SET THIS TO C0000, F0000, OR LESS THAN A0000!
+
+ 1 2 3 4 5 6 7 8
+ S2 /--------------------------\
+ (Station Address) | 1 1 0 0 0 0 0 0 |
+ \--------------------------/
+
+ Setting Value
+ ------- -----
+ 00000000 00
+ 10000000 01
+ 01000000 02
+ ...
+ 01111111 FE
+ 11111111 FF
+
+ Note that this is binary with the digits reversed!
+
+ DO NOT SET THIS TO 0 OR 255 (0xFF)!
+
+
+PC130E/PC270E (8-bit cards)
+---------------------------
+
+ - from Juergen Seifert <seifert@htwm.de>
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the following Original SMC Manual
+
+ "Configuration Guide for ARCNET(R)-PC130E/PC270 Network
+ Controller Boards Pub. # 900.044A June, 1989"
+
+ARCNET is a registered trademark of the Datapoint Corporation
+SMC is a registered trademark of the Standard Microsystems Corporation
+
+The PC130E is an enhanced version of the PC130 board, is equipped with a
+standard BNC female connector for connection to RG-62/U coax cable.
+Since this board is designed both for point-to-point connection in star
+networks and for connection to bus networks, it is downwardly compatible
+with all the other standard boards designed for coax networks (that is,
+the PC120, PC110 and PC100 star topology boards and the PC220, PC210 and
+PC200 bus topology boards).
+
+The PC270E is an enhanced version of the PC260 board, is equipped with two
+modular RJ11-type jacks for connection to twisted pair wiring.
+It can be used in a star or a daisy-chained network.
+
+::
+
+ 8 7 6 5 4 3 2 1
+ ________________________________________________________________
+ | | S1 | |
+ | |_________________| |
+ | Offs|Base |I/O Addr |
+ | RAM Addr | ___|
+ | ___ ___ CR3 |___|
+ | | \/ | CR4 |___|
+ | | PROM | ___|
+ | | | N | | 8
+ | | SOCKET | o | | 7
+ | |________| d | | 6
+ | ___________________ e | | 5
+ | | | A | S | 4
+ | |oo| EXT2 | | d | 2 | 3
+ | |oo| EXT1 | SMC | d | | 2
+ | |oo| ROM | 90C63 | r |___| 1
+ | |oo| IRQ7 | | |o| _____|
+ | |oo| IRQ5 | | |o| | J1 |
+ | |oo| IRQ4 | | STAR |_____|
+ | |oo| IRQ3 | | | J2 |
+ | |oo| IRQ2 |___________________| |_____|
+ |___ ______________|
+ | |
+ |_____________________________________________|
+
+Legend::
+
+ SMC 90C63 ARCNET Controller / Transceiver /Logic
+ S1 1-3: I/O Base Address Select
+ 4-6: Memory Base Address Select
+ 7-8: RAM Offset Select
+ S2 1-8: Node ID Select
+ EXT Extended Timeout Select
+ ROM ROM Enable Select
+ STAR Selected - Star Topology (PC130E only)
+ Deselected - Bus Topology (PC130E only)
+ CR3/CR4 Diagnostic LEDs
+ J1 BNC RG62/U Connector (PC130E only)
+ J1 6-position Telephone Jack (PC270E only)
+ J2 6-position Telephone Jack (PC270E only)
+
+Setting one of the switches to Off/Open means "1", On/Closed means "0".
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in group S2 are used to set the node ID.
+These switches work in a way similar to the PC100-series cards; see that
+entry for more information.
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The first three switches in switch group S1 are used to select one
+of eight possible I/O Base addresses using the following table::
+
+
+ Switch | Hex I/O
+ 1 2 3 | Address
+ -------|--------
+ 0 0 0 | 260
+ 0 0 1 | 290
+ 0 1 0 | 2E0 (Manufacturer's default)
+ 0 1 1 | 2F0
+ 1 0 0 | 300
+ 1 0 1 | 350
+ 1 1 0 | 380
+ 1 1 1 | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The memory buffer requires 2K of a 16K block of RAM. The base of this
+16K block can be located in any of eight positions.
+Switches 4-6 of switch group S1 select the Base of the 16K block.
+Within that 16K address space, the buffer may be assigned any one of four
+positions, determined by the offset, switches 7 and 8 of group S1.
+
+::
+
+ Switch | Hex RAM | Hex ROM
+ 4 5 6 7 8 | Address | Address *)
+ -----------|---------|-----------
+ 0 0 0 0 0 | C0000 | C2000
+ 0 0 0 0 1 | C0800 | C2000
+ 0 0 0 1 0 | C1000 | C2000
+ 0 0 0 1 1 | C1800 | C2000
+ | |
+ 0 0 1 0 0 | C4000 | C6000
+ 0 0 1 0 1 | C4800 | C6000
+ 0 0 1 1 0 | C5000 | C6000
+ 0 0 1 1 1 | C5800 | C6000
+ | |
+ 0 1 0 0 0 | CC000 | CE000
+ 0 1 0 0 1 | CC800 | CE000
+ 0 1 0 1 0 | CD000 | CE000
+ 0 1 0 1 1 | CD800 | CE000
+ | |
+ 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default)
+ 0 1 1 0 1 | D0800 | D2000
+ 0 1 1 1 0 | D1000 | D2000
+ 0 1 1 1 1 | D1800 | D2000
+ | |
+ 1 0 0 0 0 | D4000 | D6000
+ 1 0 0 0 1 | D4800 | D6000
+ 1 0 0 1 0 | D5000 | D6000
+ 1 0 0 1 1 | D5800 | D6000
+ | |
+ 1 0 1 0 0 | D8000 | DA000
+ 1 0 1 0 1 | D8800 | DA000
+ 1 0 1 1 0 | D9000 | DA000
+ 1 0 1 1 1 | D9800 | DA000
+ | |
+ 1 1 0 0 0 | DC000 | DE000
+ 1 1 0 0 1 | DC800 | DE000
+ 1 1 0 1 0 | DD000 | DE000
+ 1 1 0 1 1 | DD800 | DE000
+ | |
+ 1 1 1 0 0 | E0000 | E2000
+ 1 1 1 0 1 | E0800 | E2000
+ 1 1 1 1 0 | E1000 | E2000
+ 1 1 1 1 1 | E1800 | E2000
+
+ *) To enable the 8K Boot PROM install the jumper ROM.
+ The default is jumper ROM not installed.
+
+
+Setting the Timeouts and Interrupt
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The jumpers labeled EXT1 and EXT2 are used to determine the timeout
+parameters. These two jumpers are normally left open.
+
+To select a hardware interrupt level set one (only one!) of the jumpers
+IRQ2, IRQ3, IRQ4, IRQ5, IRQ7. The Manufacturer's default is IRQ2.
+
+
+Configuring the PC130E for Star or Bus Topology
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The single jumper labeled STAR is used to configure the PC130E board for
+star or bus topology.
+When the jumper is installed, the board may be used in a star network, when
+it is removed, the board can be used in a bus topology.
+
+
+Diagnostic LEDs
+^^^^^^^^^^^^^^^
+
+Two diagnostic LEDs are visible on the rear bracket of the board.
+The green LED monitors the network activity: the red one shows the
+board activity::
+
+ Green | Status Red | Status
+ -------|------------------- ---------|-------------------
+ on | normal activity flash/on | data transfer
+ blink | reconfiguration off | no data transfer;
+ off | defective board or | incorrect memory or
+ | node ID is zero | I/O address
+
+
+PC500/PC550 Longboard (16-bit cards)
+------------------------------------
+
+ - from Juergen Seifert <seifert@htwm.de>
+
+
+ .. note::
+
+ There is another Version of the PC500 called Short Version, which
+ is different in hard- and software! The most important differences
+ are:
+
+ - The long board has no Shared memory.
+ - On the long board the selection of the interrupt is done by binary
+ coded switch, on the short board directly by jumper.
+
+[Avery's note: pay special attention to that: the long board HAS NO SHARED
+MEMORY. This means the current Linux-ARCnet driver can't use these cards.
+I have obtained a PC500Longboard and will be doing some experiments on it in
+the future, but don't hold your breath. Thanks again to Juergen Seifert for
+his advice about this!]
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the following Original SMC Manual
+
+ "Configuration Guide for SMC ARCNET-PC500/PC550
+ Series Network Controller Boards Pub. # 900.033 Rev. A
+ November, 1989"
+
+ARCNET is a registered trademark of the Datapoint Corporation
+SMC is a registered trademark of the Standard Microsystems Corporation
+
+The PC500 is equipped with a standard BNC female connector for connection
+to RG-62/U coax cable.
+The board is designed both for point-to-point connection in star networks
+and for connection to bus networks.
+
+The PC550 is equipped with two modular RJ11-type jacks for connection
+to twisted pair wiring.
+It can be used in a star or a daisy-chained (BUS) network.
+
+::
+
+ 1
+ 0 9 8 7 6 5 4 3 2 1 6 5 4 3 2 1
+ ____________________________________________________________________
+ < | SW1 | | SW2 | |
+ > |_____________________| |_____________| |
+ < IRQ |I/O Addr |
+ > ___|
+ < CR4 |___|
+ > CR3 |___|
+ < ___|
+ > N | | 8
+ < o | | 7
+ > d | S | 6
+ < e | W | 5
+ > A | 3 | 4
+ < d | | 3
+ > d | | 2
+ < r |___| 1
+ > |o| _____|
+ < |o| | J1 |
+ > 3 1 JP6 |_____|
+ < |o|o| JP2 | J2 |
+ > |o|o| |_____|
+ < 4 2__ ______________|
+ > | | |
+ <____| |_____________________________________________|
+
+Legend::
+
+ SW1 1-6: I/O Base Address Select
+ 7-10: Interrupt Select
+ SW2 1-6: Reserved for Future Use
+ SW3 1-8: Node ID Select
+ JP2 1-4: Extended Timeout Select
+ JP6 Selected - Star Topology (PC500 only)
+ Deselected - Bus Topology (PC500 only)
+ CR3 Green Monitors Network Activity
+ CR4 Red Monitors Board Activity
+ J1 BNC RG62/U Connector (PC500 only)
+ J1 6-position Telephone Jack (PC550 only)
+ J2 6-position Telephone Jack (PC550 only)
+
+Setting one of the switches to Off/Open means "1", On/Closed means "0".
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in group SW3 are used to set the node ID. Each node
+attached to the network must have an unique node ID which must be
+different from 0.
+Switch 1 serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
+ Switch | Value
+ -------|-------
+ 1 | 1
+ 2 | 2
+ 3 | 4
+ 4 | 8
+ 5 | 16
+ 6 | 32
+ 7 | 64
+ 8 | 128
+
+Some Examples::
+
+ Switch | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The first six switches in switch group SW1 are used to select one
+of 32 possible I/O Base addresses using the following table::
+
+ Switch | Hex I/O
+ 6 5 4 3 2 1 | Address
+ -------------|--------
+ 0 1 0 0 0 0 | 200
+ 0 1 0 0 0 1 | 210
+ 0 1 0 0 1 0 | 220
+ 0 1 0 0 1 1 | 230
+ 0 1 0 1 0 0 | 240
+ 0 1 0 1 0 1 | 250
+ 0 1 0 1 1 0 | 260
+ 0 1 0 1 1 1 | 270
+ 0 1 1 0 0 0 | 280
+ 0 1 1 0 0 1 | 290
+ 0 1 1 0 1 0 | 2A0
+ 0 1 1 0 1 1 | 2B0
+ 0 1 1 1 0 0 | 2C0
+ 0 1 1 1 0 1 | 2D0
+ 0 1 1 1 1 0 | 2E0 (Manufacturer's default)
+ 0 1 1 1 1 1 | 2F0
+ 1 1 0 0 0 0 | 300
+ 1 1 0 0 0 1 | 310
+ 1 1 0 0 1 0 | 320
+ 1 1 0 0 1 1 | 330
+ 1 1 0 1 0 0 | 340
+ 1 1 0 1 0 1 | 350
+ 1 1 0 1 1 0 | 360
+ 1 1 0 1 1 1 | 370
+ 1 1 1 0 0 0 | 380
+ 1 1 1 0 0 1 | 390
+ 1 1 1 0 1 0 | 3A0
+ 1 1 1 0 1 1 | 3B0
+ 1 1 1 1 0 0 | 3C0
+ 1 1 1 1 0 1 | 3D0
+ 1 1 1 1 1 0 | 3E0
+ 1 1 1 1 1 1 | 3F0
+
+
+Setting the Interrupt
+^^^^^^^^^^^^^^^^^^^^^
+
+Switches seven through ten of switch group SW1 are used to select the
+interrupt level. The interrupt level is binary coded, so selections
+from 0 to 15 would be possible, but only the following eight values will
+be supported: 3, 4, 5, 7, 9, 10, 11, 12.
+
+::
+
+ Switch | IRQ
+ 10 9 8 7 |
+ ---------|--------
+ 0 0 1 1 | 3
+ 0 1 0 0 | 4
+ 0 1 0 1 | 5
+ 0 1 1 1 | 7
+ 1 0 0 1 | 9 (=2) (default)
+ 1 0 1 0 | 10
+ 1 0 1 1 | 11
+ 1 1 0 0 | 12
+
+
+Setting the Timeouts
+^^^^^^^^^^^^^^^^^^^^
+
+The two jumpers JP2 (1-4) are used to determine the timeout parameters.
+These two jumpers are normally left open.
+Refer to the COM9026 Data Sheet for alternate configurations.
+
+
+Configuring the PC500 for Star or Bus Topology
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The single jumper labeled JP6 is used to configure the PC500 board for
+star or bus topology.
+When the jumper is installed, the board may be used in a star network, when
+it is removed, the board can be used in a bus topology.
+
+
+Diagnostic LEDs
+^^^^^^^^^^^^^^^
+
+Two diagnostic LEDs are visible on the rear bracket of the board.
+The green LED monitors the network activity: the red one shows the
+board activity::
+
+ Green | Status Red | Status
+ -------|------------------- ---------|-------------------
+ on | normal activity flash/on | data transfer
+ blink | reconfiguration off | no data transfer;
+ off | defective board or | incorrect memory or
+ | node ID is zero | I/O address
+
+
+PC710 (8-bit card)
+------------------
+
+ - from J.S. van Oosten <jvoosten@compiler.tdcnet.nl>
+
+Note: this data is gathered by experimenting and looking at info of other
+cards. However, I'm sure I got 99% of the settings right.
+
+The SMC710 card resembles the PC270 card, but is much more basic (i.e. no
+LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing::
+
+ _______________________________________
+ | +---------+ +---------+ |____
+ | | S2 | | S1 | |
+ | +---------+ +---------+ |
+ | |
+ | +===+ __ |
+ | | R | | | X-tal ###___
+ | | O | |__| ####__'|
+ | | M | || ###
+ | +===+ |
+ | |
+ | .. JP1 +----------+ |
+ | .. | big chip | |
+ | .. | 90C63 | |
+ | .. | | |
+ | .. +----------+ |
+ ------- -----------
+ |||||||||||||||||||||
+
+The row of jumpers at JP1 actually consists of 8 jumpers, (sometimes
+labelled) the same as on the PC270, from top to bottom: EXT2, EXT1, ROM,
+IRQ7, IRQ5, IRQ4, IRQ3, IRQ2 (gee, wonder what they would do? :-) )
+
+S1 and S2 perform the same function as on the PC270, only their numbers
+are swapped (S1 is the nodeaddress, S2 sets IO- and RAM-address).
+
+I know it works when connected to a PC110 type ARCnet board.
+
+
+*****************************************************************************
+
+Possibly SMC
+============
+
+LCS-8830(-T) (8 and 16-bit cards)
+---------------------------------
+
+ - from Mathias Katzer <mkatzer@HRZ.Uni-Bielefeld.DE>
+ - Marek Michalkiewicz <marekm@i17linuxb.ists.pwr.wroc.pl> says the
+ LCS-8830 is slightly different from LCS-8830-T. These are 8 bit, BUS
+ only (the JP0 jumper is hardwired), and BNC only.
+
+This is a LCS-8830-T made by SMC, I think ('SMC' only appears on one PLCC,
+nowhere else, not even on the few Xeroxed sheets from the manual).
+
+SMC ARCnet Board Type LCS-8830-T::
+
+ ------------------------------------
+ | |
+ | JP3 88 8 JP2 |
+ | ##### | \ |
+ | ##### ET1 ET2 ###|
+ | 8 ###|
+ | U3 SW 1 JP0 ###| Phone Jacks
+ | -- ###|
+ | | | |
+ | | | SW2 |
+ | | | |
+ | | | ##### |
+ | -- ##### #### BNC Connector
+ | ####
+ | 888888 JP1 |
+ | 234567 |
+ -- -------
+ |||||||||||||||||||||||||||
+ --------------------------
+
+
+ SW1: DIP-Switches for Station Address
+ SW2: DIP-Switches for Memory Base and I/O Base addresses
+
+ JP0: If closed, internal termination on (default open)
+ JP1: IRQ Jumpers
+ JP2: Boot-ROM enabled if closed
+ JP3: Jumpers for response timeout
+
+ U3: Boot-ROM Socket
+
+
+ ET1 ET2 Response Time Idle Time Reconfiguration Time
+
+ 78 86 840
+ X 285 316 1680
+ X 563 624 1680
+ X X 1130 1237 1680
+
+ (X means closed jumper)
+
+ (DIP-Switch downwards means "0")
+
+The station address is binary-coded with SW1.
+
+The I/O base address is coded with DIP-Switches 6,7 and 8 of SW2:
+
+======== ========
+Switches Base
+678 Address
+======== ========
+000 260-26f
+100 290-29f
+010 2e0-2ef
+110 2f0-2ff
+001 300-30f
+101 350-35f
+011 380-38f
+111 3e0-3ef
+======== ========
+
+
+DIP Switches 1-5 of SW2 encode the RAM and ROM Address Range:
+
+======== ============= ================
+Switches RAM ROM
+12345 Address Range Address Range
+======== ============= ================
+00000 C:0000-C:07ff C:2000-C:3fff
+10000 C:0800-C:0fff
+01000 C:1000-C:17ff
+11000 C:1800-C:1fff
+00100 C:4000-C:47ff C:6000-C:7fff
+10100 C:4800-C:4fff
+01100 C:5000-C:57ff
+11100 C:5800-C:5fff
+00010 C:C000-C:C7ff C:E000-C:ffff
+10010 C:C800-C:Cfff
+01010 C:D000-C:D7ff
+11010 C:D800-C:Dfff
+00110 D:0000-D:07ff D:2000-D:3fff
+10110 D:0800-D:0fff
+01110 D:1000-D:17ff
+11110 D:1800-D:1fff
+00001 D:4000-D:47ff D:6000-D:7fff
+10001 D:4800-D:4fff
+01001 D:5000-D:57ff
+11001 D:5800-D:5fff
+00101 D:8000-D:87ff D:A000-D:bfff
+10101 D:8800-D:8fff
+01101 D:9000-D:97ff
+11101 D:9800-D:9fff
+00011 D:C000-D:c7ff D:E000-D:ffff
+10011 D:C800-D:cfff
+01011 D:D000-D:d7ff
+11011 D:D800-D:dfff
+00111 E:0000-E:07ff E:2000-E:3fff
+10111 E:0800-E:0fff
+01111 E:1000-E:17ff
+11111 E:1800-E:1fff
+======== ============= ================
+
+
+PureData Corp
+=============
+
+PDI507 (8-bit card)
+--------------------
+
+ - from Mark Rejhon <mdrejhon@magi.com> (slight modifications by Avery)
+ - Avery's note: I think PDI508 cards (but definitely NOT PDI508Plus cards)
+ are mostly the same as this. PDI508Plus cards appear to be mainly
+ software-configured.
+
+Jumpers:
+
+ There is a jumper array at the bottom of the card, near the edge
+ connector. This array is labelled J1. They control the IRQs and
+ something else. Put only one jumper on the IRQ pins.
+
+ ETS1, ETS2 are for timing on very long distance networks. See the
+ more general information near the top of this file.
+
+ There is a J2 jumper on two pins. A jumper should be put on them,
+ since it was already there when I got the card. I don't know what
+ this jumper is for though.
+
+ There is a two-jumper array for J3. I don't know what it is for,
+ but there were already two jumpers on it when I got the card. It's
+ a six pin grid in a two-by-three fashion. The jumpers were
+ configured as follows::
+
+ .-------.
+ o | o o |
+ :-------: ------> Accessible end of card with connectors
+ o | o o | in this direction ------->
+ `-------'
+
+Carl de Billy <CARL@carainfo.com> explains J3 and J4:
+
+ J3 Diagram::
+
+ .-------.
+ o | o o |
+ :-------: TWIST Technology
+ o | o o |
+ `-------'
+ .-------.
+ | o o | o
+ :-------: COAX Technology
+ | o o | o
+ `-------'
+
+ - If using coax cable in a bus topology the J4 jumper must be removed;
+ place it on one pin.
+
+ - If using bus topology with twisted pair wiring move the J3
+ jumpers so they connect the middle pin and the pins closest to the RJ11
+ Connectors. Also the J4 jumper must be removed; place it on one pin of
+ J4 jumper for storage.
+
+ - If using star topology with twisted pair wiring move the J3
+ jumpers so they connect the middle pin and the pins closest to the RJ11
+ connectors.
+
+
+DIP Switches:
+
+ The DIP switches accessible on the accessible end of the card while
+ it is installed, is used to set the ARCnet address. There are 8
+ switches. Use an address from 1 to 254
+
+ ========== =========================
+ Switch No. ARCnet address
+ 12345678
+ ========== =========================
+ 00000000 FF (Don't use this!)
+ 00000001 FE
+ 00000010 FD
+ ...
+ 11111101 2
+ 11111110 1
+ 11111111 0 (Don't use this!)
+ ========== =========================
+
+ There is another array of eight DIP switches at the top of the
+ card. There are five labelled MS0-MS4 which seem to control the
+ memory address, and another three labelled IO0-IO2 which seem to
+ control the base I/O address of the card.
+
+ This was difficult to test by trial and error, and the I/O addresses
+ are in a weird order. This was tested by setting the DIP switches,
+ rebooting the computer, and attempting to load ARCETHER at various
+ addresses (mostly between 0x200 and 0x400). The address that caused
+ the red transmit LED to blink, is the one that I thought works.
+
+ Also, the address 0x3D0 seem to have a special meaning, since the
+ ARCETHER packet driver loaded fine, but without the red LED
+ blinking. I don't know what 0x3D0 is for though. I recommend using
+ an address of 0x300 since Windows may not like addresses below
+ 0x300.
+
+ ============= ===========
+ IO Switch No. I/O address
+ 210
+ ============= ===========
+ 111 0x260
+ 110 0x290
+ 101 0x2E0
+ 100 0x2F0
+ 011 0x300
+ 010 0x350
+ 001 0x380
+ 000 0x3E0
+ ============= ===========
+
+ The memory switches set a reserved address space of 0x1000 bytes
+ (0x100 segment units, or 4k). For example if I set an address of
+ 0xD000, it will use up addresses 0xD000 to 0xD100.
+
+ The memory switches were tested by booting using QEMM386 stealth,
+ and using LOADHI to see what address automatically became excluded
+ from the upper memory regions, and then attempting to load ARCETHER
+ using these addresses.
+
+ I recommend using an ARCnet memory address of 0xD000, and putting
+ the EMS page frame at 0xC000 while using QEMM stealth mode. That
+ way, you get contiguous high memory from 0xD100 almost all the way
+ the end of the megabyte.
+
+ Memory Switch 0 (MS0) didn't seem to work properly when set to OFF
+ on my card. It could be malfunctioning on my card. Experiment with
+ it ON first, and if it doesn't work, set it to OFF. (It may be a
+ modifier for the 0x200 bit?)
+
+ ============= ============================================
+ MS Switch No.
+ 43210 Memory address
+ ============= ============================================
+ 00001 0xE100 (guessed - was not detected by QEMM)
+ 00011 0xE000 (guessed - was not detected by QEMM)
+ 00101 0xDD00
+ 00111 0xDC00
+ 01001 0xD900
+ 01011 0xD800
+ 01101 0xD500
+ 01111 0xD400
+ 10001 0xD100
+ 10011 0xD000
+ 10101 0xCD00
+ 10111 0xCC00
+ 11001 0xC900 (guessed - crashes tested system)
+ 11011 0xC800 (guessed - crashes tested system)
+ 11101 0xC500 (guessed - crashes tested system)
+ 11111 0xC400 (guessed - crashes tested system)
+ ============= ============================================
+
+CNet Technology Inc. (8-bit cards)
+==================================
+
+120 Series (8-bit cards)
+------------------------
+ - from Juergen Seifert <seifert@htwm.de>
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the following Original CNet Manual
+
+ "ARCNET USER'S MANUAL for
+ CN120A
+ CN120AB
+ CN120TP
+ CN120ST
+ CN120SBT
+ P/N:12-01-0007
+ Revision 3.00"
+
+ARCNET is a registered trademark of the Datapoint Corporation
+
+- P/N 120A ARCNET 8 bit XT/AT Star
+- P/N 120AB ARCNET 8 bit XT/AT Bus
+- P/N 120TP ARCNET 8 bit XT/AT Twisted Pair
+- P/N 120ST ARCNET 8 bit XT/AT Star, Twisted Pair
+- P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair
+
+::
+
+ __________________________________________________________________
+ | |
+ | ___|
+ | LED |___|
+ | ___|
+ | N | | ID7
+ | o | | ID6
+ | d | S | ID5
+ | e | W | ID4
+ | ___________________ A | 2 | ID3
+ | | | d | | ID2
+ | | | 1 2 3 4 5 6 7 8 d | | ID1
+ | | | _________________ r |___| ID0
+ | | 90C65 || SW1 | ____|
+ | JP 8 7 | ||_________________| | |
+ | |o|o| JP1 | | | J2 |
+ | |o|o| |oo| | | JP 1 1 1 | |
+ | ______________ | | 0 1 2 |____|
+ | | PROM | |___________________| |o|o|o| _____|
+ | > SOCKET | JP 6 5 4 3 2 |o|o|o| | J1 |
+ | |______________| |o|o|o|o|o| |o|o|o| |_____|
+ |_____ |o|o|o|o|o| ______________|
+ | |
+ |_____________________________________________|
+
+Legend::
+
+ 90C65 ARCNET Probe
+ S1 1-5: Base Memory Address Select
+ 6-8: Base I/O Address Select
+ S2 1-8: Node ID Select (ID0-ID7)
+ JP1 ROM Enable Select
+ JP2 IRQ2
+ JP3 IRQ3
+ JP4 IRQ4
+ JP5 IRQ5
+ JP6 IRQ7
+ JP7/JP8 ET1, ET2 Timeout Parameters
+ JP10/JP11 Coax / Twisted Pair Select (CN120ST/SBT only)
+ JP12 Terminator Select (CN120AB/ST/SBT only)
+ J1 BNC RG62/U Connector (all except CN120TP)
+ J2 Two 6-position Telephone Jack (CN120TP/ST/SBT only)
+
+Setting one of the switches to Off means "1", On means "0".
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in SW2 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must be different from 0.
+Switch 1 (ID0) serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ ======= ====== =====
+ Switch Label Value
+ ======= ====== =====
+ 1 ID0 1
+ 2 ID1 2
+ 3 ID2 4
+ 4 ID3 8
+ 5 ID4 16
+ 6 ID5 32
+ 7 ID6 64
+ 8 ID7 128
+ ======= ====== =====
+
+Some Examples::
+
+ Switch | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The last three switches in switch block SW1 are used to select one
+of eight possible I/O Base addresses using the following table::
+
+
+ Switch | Hex I/O
+ 6 7 8 | Address
+ ------------|--------
+ ON ON ON | 260
+ OFF ON ON | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ OFF OFF ON | 2F0
+ ON ON OFF | 300
+ OFF ON OFF | 350
+ ON OFF OFF | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of eight positions. The address of the Boot Prom is
+memory base + 8K or memory base + 0x2000.
+Switches 1-5 of switch block SW1 select the Memory Base address.
+
+::
+
+ Switch | Hex RAM | Hex ROM
+ 1 2 3 4 5 | Address | Address *)
+ --------------------|---------|-----------
+ ON ON ON ON ON | C0000 | C2000
+ ON ON OFF ON ON | C4000 | C6000
+ ON ON ON OFF ON | CC000 | CE000
+ ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default)
+ ON ON ON ON OFF | D4000 | D6000
+ ON ON OFF ON OFF | D8000 | DA000
+ ON ON ON OFF OFF | DC000 | DE000
+ ON ON OFF OFF OFF | E0000 | E2000
+
+ *) To enable the Boot ROM install the jumper JP1
+
+.. note::
+
+ Since the switches 1 and 2 are always set to ON it may be possible
+ that they can be used to add an offset of 2K, 4K or 6K to the base
+ address, but this feature is not documented in the manual and I
+ haven't tested it yet.
+
+
+Setting the Interrupt Line
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To select a hardware interrupt level install one (only one!) of the jumpers
+JP2, JP3, JP4, JP5, JP6. JP2 is the default::
+
+ Jumper | IRQ
+ -------|-----
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+ 6 | 7
+
+
+Setting the Internal Terminator on CN120AB/TP/SBT
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The jumper JP12 is used to enable the internal terminator::
+
+ -----
+ 0 | 0 |
+ ----- ON | | ON
+ | 0 | | 0 |
+ | | OFF ----- OFF
+ | 0 | 0
+ -----
+ Terminator Terminator
+ disabled enabled
+
+
+Selecting the Connector Type on CN120ST/SBT
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+ JP10 JP11 JP10 JP11
+ ----- -----
+ 0 0 | 0 | | 0 |
+ ----- ----- | | | |
+ | 0 | | 0 | | 0 | | 0 |
+ | | | | ----- -----
+ | 0 | | 0 | 0 0
+ ----- -----
+ Coaxial Cable Twisted Pair Cable
+ (Default)
+
+
+Setting the Timeout Parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The jumpers labeled EXT1 and EXT2 are used to determine the timeout
+parameters. These two jumpers are normally left open.
+
+
+CNet Technology Inc. (16-bit cards)
+===================================
+
+160 Series (16-bit cards)
+-------------------------
+ - from Juergen Seifert <seifert@htwm.de>
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the following Original CNet Manual
+
+ "ARCNET USER'S MANUAL for
+ CN160A CN160AB CN160TP
+ P/N:12-01-0006 Revision 3.00"
+
+ARCNET is a registered trademark of the Datapoint Corporation
+
+- P/N 160A ARCNET 16 bit XT/AT Star
+- P/N 160AB ARCNET 16 bit XT/AT Bus
+- P/N 160TP ARCNET 16 bit XT/AT Twisted Pair
+
+::
+
+ ___________________________________________________________________
+ < _________________________ ___|
+ > |oo| JP2 | | LED |___|
+ < |oo| JP1 | 9026 | LED |___|
+ > |_________________________| ___|
+ < N | | ID7
+ > 1 o | | ID6
+ < 1 2 3 4 5 6 7 8 9 0 d | S | ID5
+ > _______________ _____________________ e | W | ID4
+ < | PROM | | SW1 | A | 2 | ID3
+ > > SOCKET | |_____________________| d | | ID2
+ < |_______________| | IO-Base | MEM | d | | ID1
+ > r |___| ID0
+ < ____|
+ > | |
+ < | J1 |
+ > | |
+ < |____|
+ > 1 1 1 1 |
+ < 3 4 5 6 7 JP 8 9 0 1 2 3 |
+ > |o|o|o|o|o| |o|o|o|o|o|o| |
+ < |o|o|o|o|o| __ |o|o|o|o|o|o| ___________|
+ > | | |
+ <____________| |_______________________________________|
+
+Legend::
+
+ 9026 ARCNET Probe
+ SW1 1-6: Base I/O Address Select
+ 7-10: Base Memory Address Select
+ SW2 1-8: Node ID Select (ID0-ID7)
+ JP1/JP2 ET1, ET2 Timeout Parameters
+ JP3-JP13 Interrupt Select
+ J1 BNC RG62/U Connector (CN160A/AB only)
+ J1 Two 6-position Telephone Jack (CN160TP only)
+ LED
+
+Setting one of the switches to Off means "1", On means "0".
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in SW2 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must be different from 0.
+Switch 1 (ID0) serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
+ Switch | Label | Value
+ -------|-------|-------
+ 1 | ID0 | 1
+ 2 | ID1 | 2
+ 3 | ID2 | 4
+ 4 | ID3 | 8
+ 5 | ID4 | 16
+ 6 | ID5 | 32
+ 7 | ID6 | 64
+ 8 | ID7 | 128
+
+Some Examples::
+
+ Switch | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The first six switches in switch block SW1 are used to select the I/O Base
+address using the following table::
+
+ Switch | Hex I/O
+ 1 2 3 4 5 6 | Address
+ ------------------------|--------
+ OFF ON ON OFF OFF ON | 260
+ OFF ON OFF ON ON OFF | 290
+ OFF ON OFF OFF OFF ON | 2E0 (Manufacturer's default)
+ OFF ON OFF OFF OFF OFF | 2F0
+ OFF OFF ON ON ON ON | 300
+ OFF OFF ON OFF ON OFF | 350
+ OFF OFF OFF ON ON ON | 380
+ OFF OFF OFF OFF OFF ON | 3E0
+
+Note: Other IO-Base addresses seem to be selectable, but only the above
+ combinations are documented.
+
+
+Setting the Base Memory (RAM) buffer Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The switches 7-10 of switch block SW1 are used to select the Memory
+Base address of the RAM (2K) and the PROM::
+
+ Switch | Hex RAM | Hex ROM
+ 7 8 9 10 | Address | Address
+ ----------------|---------|-----------
+ OFF OFF ON ON | C0000 | C8000
+ OFF OFF ON OFF | D0000 | D8000 (Default)
+ OFF OFF OFF ON | E0000 | E8000
+
+.. note::
+
+ Other MEM-Base addresses seem to be selectable, but only the above
+ combinations are documented.
+
+
+Setting the Interrupt Line
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To select a hardware interrupt level install one (only one!) of the jumpers
+JP3 through JP13 using the following table::
+
+ Jumper | IRQ
+ -------|-----------------
+ 3 | 14
+ 4 | 15
+ 5 | 12
+ 6 | 11
+ 7 | 10
+ 8 | 3
+ 9 | 4
+ 10 | 5
+ 11 | 6
+ 12 | 7
+ 13 | 2 (=9) Default!
+
+.. note::
+
+ - Do not use JP11=IRQ6, it may conflict with your Floppy Disk
+ Controller
+ - Use JP3=IRQ14 only, if you don't have an IDE-, MFM-, or RLL-
+ Hard Disk, it may conflict with their controllers
+
+
+Setting the Timeout Parameters
+------------------------------
+
+The jumpers labeled JP1 and JP2 are used to determine the timeout
+parameters. These two jumpers are normally left open.
+
+
+Lantech
+=======
+
+8-bit card, unknown model
+-------------------------
+ - from Vlad Lungu <vlungu@ugal.ro> - his e-mail address seemed broken at
+ the time I tried to reach him. Sorry Vlad, if you didn't get my reply.
+
+::
+
+ ________________________________________________________________
+ | 1 8 |
+ | ___________ __|
+ | | SW1 | LED |__|
+ | |__________| |
+ | ___|
+ | _____________________ |S | 8
+ | | | |W |
+ | | | |2 |
+ | | | |__| 1
+ | | UM9065L | |o| JP4 ____|____
+ | | | |o| | CN |
+ | | | |________|
+ | | | |
+ | |___________________| |
+ | |
+ | |
+ | _____________ |
+ | | | |
+ | | PROM | |ooooo| JP6 |
+ | |____________| |ooooo| |
+ |_____________ _ _|
+ |____________________________________________| |__|
+
+
+UM9065L : ARCnet Controller
+
+SW 1 : Shared Memory Address and I/O Base
+
+::
+
+ ON=0
+
+ 12345|Memory Address
+ -----|--------------
+ 00001| D4000
+ 00010| CC000
+ 00110| D0000
+ 01110| D1000
+ 01101| D9000
+ 10010| CC800
+ 10011| DC800
+ 11110| D1800
+
+It seems that the bits are considered in reverse order. Also, you must
+observe that some of those addresses are unusual and I didn't probe them; I
+used a memory dump in DOS to identify them. For the 00000 configuration and
+some others that I didn't write here the card seems to conflict with the
+video card (an S3 GENDAC). I leave the full decoding of those addresses to
+you.
+
+::
+
+ 678| I/O Address
+ ---|------------
+ 000| 260
+ 001| failed probe
+ 010| 2E0
+ 011| 380
+ 100| 290
+ 101| 350
+ 110| failed probe
+ 111| 3E0
+
+ SW 2 : Node ID (binary coded)
+
+ JP 4 : Boot PROM enable CLOSE - enabled
+ OPEN - disabled
+
+ JP 6 : IRQ set (ONLY ONE jumper on 1-5 for IRQ 2-6)
+
+
+Acer
+====
+
+8-bit card, Model 5210-003
+--------------------------
+
+ - from Vojtech Pavlik <vojtech@suse.cz> using portions of the existing
+ arcnet-hardware file.
+
+This is a 90C26 based card. Its configuration seems similar to the SMC
+PC100, but has some additional jumpers I don't know the meaning of.
+
+::
+
+ __
+ | |
+ ___________|__|_________________________
+ | | | |
+ | | BNC | |
+ | |______| ___|
+ | _____________________ |___
+ | | | |
+ | | Hybrid IC | |
+ | | | o|o J1 |
+ | |_____________________| 8|8 |
+ | 8|8 J5 |
+ | o|o |
+ | 8|8 |
+ |__ 8|8 |
+ (|__| LED o|o |
+ | 8|8 |
+ | 8|8 J15 |
+ | |
+ | _____ |
+ | | | _____ |
+ | | | | | ___|
+ | | | | | |
+ | _____ | ROM | | UFS | |
+ | | | | | | | |
+ | | | ___ | | | | |
+ | | | | | |__.__| |__.__| |
+ | | NCR | |XTL| _____ _____ |
+ | | | |___| | | | | |
+ | |90C26| | | | | |
+ | | | | RAM | | UFS | |
+ | | | J17 o|o | | | | |
+ | | | J16 o|o | | | | |
+ | |__.__| |__.__| |__.__| |
+ | ___ |
+ | | |8 |
+ | |SW2| |
+ | | | |
+ | |___|1 |
+ | ___ |
+ | | |10 J18 o|o |
+ | | | o|o |
+ | |SW1| o|o |
+ | | | J21 o|o |
+ | |___|1 |
+ | |
+ |____________________________________|
+
+
+Legend::
+
+ 90C26 ARCNET Chip
+ XTL 20 MHz Crystal
+ SW1 1-6 Base I/O Address Select
+ 7-10 Memory Address Select
+ SW2 1-8 Node ID Select (ID0-ID7)
+ J1-J5 IRQ Select
+ J6-J21 Unknown (Probably extra timeouts & ROM enable ...)
+ LED1 Activity LED
+ BNC Coax connector (STAR ARCnet)
+ RAM 2k of SRAM
+ ROM Boot ROM socket
+ UFS Unidentified Flying Sockets
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in SW2 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must not be 0.
+Switch 1 (ID0) serves as the least significant bit (LSB).
+
+Setting one of the switches to OFF means "1", ON means "0".
+
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
+ Switch | Value
+ -------|-------
+ 1 | 1
+ 2 | 2
+ 3 | 4
+ 4 | 8
+ 5 | 16
+ 6 | 32
+ 7 | 64
+ 8 | 128
+
+Don't set this to 0 or 255; these values are reserved.
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The switches 1 to 6 of switch block SW1 are used to select one
+of 32 possible I/O Base addresses using the following tables::
+
+ | Hex
+ Switch | Value
+ -------|-------
+ 1 | 200
+ 2 | 100
+ 3 | 80
+ 4 | 40
+ 5 | 20
+ 6 | 10
+
+The I/O address is sum of all switches set to "1". Remember that
+the I/O address space below 0x200 is RESERVED for mainboard, so
+switch 1 should be ALWAYS SET TO OFF.
+
+
+Setting the Base Memory (RAM) buffer Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of sixteen positions. However, the addresses below
+A0000 are likely to cause system hang because there's main RAM.
+
+Jumpers 7-10 of switch block SW1 select the Memory Base address::
+
+ Switch | Hex RAM
+ 7 8 9 10 | Address
+ ----------------|---------
+ OFF OFF OFF OFF | F0000 (conflicts with main BIOS)
+ OFF OFF OFF ON | E0000
+ OFF OFF ON OFF | D0000
+ OFF OFF ON ON | C0000 (conflicts with video BIOS)
+ OFF ON OFF OFF | B0000 (conflicts with mono video)
+ OFF ON OFF ON | A0000 (conflicts with graphics)
+
+
+Setting the Interrupt Line
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means
+shorted, OFF means open::
+
+ Jumper | IRQ
+ 1 2 3 4 5 |
+ ----------------------------
+ ON OFF OFF OFF OFF | 7
+ OFF ON OFF OFF OFF | 5
+ OFF OFF ON OFF OFF | 4
+ OFF OFF OFF ON OFF | 3
+ OFF OFF OFF OFF ON | 2
+
+
+Unknown jumpers & sockets
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+I know nothing about these. I just guess that J16&J17 are timeout
+jumpers and maybe one of J18-J21 selects ROM. Also J6-J10 and
+J11-J15 are connecting IRQ2-7 to some pins on the UFSs. I can't
+guess the purpose.
+
+Datapoint?
+==========
+
+LAN-ARC-8, an 8-bit card
+------------------------
+
+ - from Vojtech Pavlik <vojtech@suse.cz>
+
+This is another SMC 90C65-based ARCnet card. I couldn't identify the
+manufacturer, but it might be DataPoint, because the card has the
+original arcNet logo in its upper right corner.
+
+::
+
+ _______________________________________________________
+ | _________ |
+ | | SW2 | ON arcNet |
+ | |_________| OFF ___|
+ | _____________ 1 ______ 8 | | 8
+ | | | SW1 | XTAL | ____________ | S |
+ | > RAM (2k) | |______|| | | W |
+ | |_____________| | H | | 3 |
+ | _________|_____ y | |___| 1
+ | _________ | | |b | |
+ | |_________| | | |r | |
+ | | SMC | |i | |
+ | | 90C65| |d | |
+ | _________ | | | | |
+ | | SW1 | ON | | |I | |
+ | |_________| OFF |_________|_____/C | _____|
+ | 1 8 | | | |___
+ | ______________ | | | BNC |___|
+ | | | |____________| |_____|
+ | > EPROM SOCKET | _____________ |
+ | |______________| |_____________| |
+ | ______________|
+ | |
+ |________________________________________|
+
+Legend::
+
+ 90C65 ARCNET Chip
+ SW1 1-5: Base Memory Address Select
+ 6-8: Base I/O Address Select
+ SW2 1-8: Node ID Select
+ SW3 1-5: IRQ Select
+ 6-7: Extra Timeout
+ 8 : ROM Enable
+ BNC Coax connector
+ XTAL 20 MHz Crystal
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in SW3 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must not be 0.
+Switch 1 serves as the least significant bit (LSB).
+
+Setting one of the switches to Off means "1", On means "0".
+
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
+ Switch | Value
+ -------|-------
+ 1 | 1
+ 2 | 2
+ 3 | 4
+ 4 | 8
+ 5 | 16
+ 6 | 32
+ 7 | 64
+ 8 | 128
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The last three switches in switch block SW1 are used to select one
+of eight possible I/O Base addresses using the following table::
+
+
+ Switch | Hex I/O
+ 6 7 8 | Address
+ ------------|--------
+ ON ON ON | 260
+ OFF ON ON | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ OFF OFF ON | 2F0
+ ON ON OFF | 300
+ OFF ON OFF | 350
+ ON OFF OFF | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of eight positions. The address of the Boot Prom is
+memory base + 0x2000.
+
+Jumpers 3-5 of switch block SW1 select the Memory Base address.
+
+::
+
+ Switch | Hex RAM | Hex ROM
+ 1 2 3 4 5 | Address | Address *)
+ --------------------|---------|-----------
+ ON ON ON ON ON | C0000 | C2000
+ ON ON OFF ON ON | C4000 | C6000
+ ON ON ON OFF ON | CC000 | CE000
+ ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default)
+ ON ON ON ON OFF | D4000 | D6000
+ ON ON OFF ON OFF | D8000 | DA000
+ ON ON ON OFF OFF | DC000 | DE000
+ ON ON OFF OFF OFF | E0000 | E2000
+
+ *) To enable the Boot ROM set the switch 8 of switch block SW3 to position ON.
+
+The switches 1 and 2 probably add 0x0800 and 0x1000 to RAM base address.
+
+
+Setting the Interrupt Line
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Switches 1-5 of the switch block SW3 control the IRQ level::
+
+ Jumper | IRQ
+ 1 2 3 4 5 |
+ ----------------------------
+ ON OFF OFF OFF OFF | 3
+ OFF ON OFF OFF OFF | 4
+ OFF OFF ON OFF OFF | 5
+ OFF OFF OFF ON OFF | 7
+ OFF OFF OFF OFF ON | 2
+
+
+Setting the Timeout Parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The switches 6-7 of the switch block SW3 are used to determine the timeout
+parameters. These two switches are normally left in the OFF position.
+
+
+Topware
+=======
+
+8-bit card, TA-ARC/10
+---------------------
+
+ - from Vojtech Pavlik <vojtech@suse.cz>
+
+This is another very similar 90C65 card. Most of the switches and jumpers
+are the same as on other clones.
+
+::
+
+ _____________________________________________________________________
+ | ___________ | | ______ |
+ | |SW2 NODE ID| | | | XTAL | |
+ | |___________| | Hybrid IC | |______| |
+ | ___________ | | __|
+ | |SW1 MEM+I/O| |_________________________| LED1|__|)
+ | |___________| 1 2 |
+ | J3 |o|o| TIMEOUT ______|
+ | ______________ |o|o| | |
+ | | | ___________________ | RJ |
+ | > EPROM SOCKET | | \ |------|
+ |J2 |______________| | | | |
+ ||o| | | |______|
+ ||o| ROM ENABLE | SMC | _________ |
+ | _____________ | 90C65 | |_________| _____|
+ | | | | | | |___
+ | > RAM (2k) | | | | BNC |___|
+ | |_____________| | | |_____|
+ | |____________________| |
+ | ________ IRQ 2 3 4 5 7 ___________ |
+ ||________| |o|o|o|o|o| |___________| |
+ |________ J1|o|o|o|o|o| ______________|
+ | |
+ |_____________________________________________|
+
+Legend::
+
+ 90C65 ARCNET Chip
+ XTAL 20 MHz Crystal
+ SW1 1-5 Base Memory Address Select
+ 6-8 Base I/O Address Select
+ SW2 1-8 Node ID Select (ID0-ID7)
+ J1 IRQ Select
+ J2 ROM Enable
+ J3 Extra Timeout
+ LED1 Activity LED
+ BNC Coax connector (BUS ARCnet)
+ RJ Twisted Pair Connector (daisy chain)
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in SW2 are used to set the node ID. Each node attached to
+the network must have an unique node ID which must not be 0. Switch 1 (ID0)
+serves as the least significant bit (LSB).
+
+Setting one of the switches to Off means "1", On means "0".
+
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
+ Switch | Label | Value
+ -------|-------|-------
+ 1 | ID0 | 1
+ 2 | ID1 | 2
+ 3 | ID2 | 4
+ 4 | ID3 | 8
+ 5 | ID4 | 16
+ 6 | ID5 | 32
+ 7 | ID6 | 64
+ 8 | ID7 | 128
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The last three switches in switch block SW1 are used to select one
+of eight possible I/O Base addresses using the following table::
+
+
+ Switch | Hex I/O
+ 6 7 8 | Address
+ ------------|--------
+ ON ON ON | 260 (Manufacturer's default)
+ OFF ON ON | 290
+ ON OFF ON | 2E0
+ OFF OFF ON | 2F0
+ ON ON OFF | 300
+ OFF ON OFF | 350
+ ON OFF OFF | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of eight positions. The address of the Boot Prom is
+memory base + 0x2000.
+
+Jumpers 3-5 of switch block SW1 select the Memory Base address.
+
+::
+
+ Switch | Hex RAM | Hex ROM
+ 1 2 3 4 5 | Address | Address *)
+ --------------------|---------|-----------
+ ON ON ON ON ON | C0000 | C2000
+ ON ON OFF ON ON | C4000 | C6000 (Manufacturer's default)
+ ON ON ON OFF ON | CC000 | CE000
+ ON ON OFF OFF ON | D0000 | D2000
+ ON ON ON ON OFF | D4000 | D6000
+ ON ON OFF ON OFF | D8000 | DA000
+ ON ON ON OFF OFF | DC000 | DE000
+ ON ON OFF OFF OFF | E0000 | E2000
+
+ *) To enable the Boot ROM short the jumper J2.
+
+The jumpers 1 and 2 probably add 0x0800 and 0x1000 to RAM address.
+
+
+Setting the Interrupt Line
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means
+shorted, OFF means open::
+
+ Jumper | IRQ
+ 1 2 3 4 5 |
+ ----------------------------
+ ON OFF OFF OFF OFF | 2
+ OFF ON OFF OFF OFF | 3
+ OFF OFF ON OFF OFF | 4
+ OFF OFF OFF ON OFF | 5
+ OFF OFF OFF OFF ON | 7
+
+
+Setting the Timeout Parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The jumpers J3 are used to set the timeout parameters. These two
+jumpers are normally left open.
+
+Thomas-Conrad
+=============
+
+Model #500-6242-0097 REV A (8-bit card)
+---------------------------------------
+
+ - from Lars Karlsson <100617.3473@compuserve.com>
+
+::
+
+ ________________________________________________________
+ | ________ ________ |_____
+ | |........| |........| |
+ | |________| |________| ___|
+ | SW 3 SW 1 | |
+ | Base I/O Base Addr. Station | |
+ | address | |
+ | ______ switch | |
+ | | | | |
+ | | | |___|
+ | | | ______ |___._
+ | |______| |______| ____| BNC
+ | Jumper- _____| Connector
+ | Main chip block _ __| '
+ | | | | RJ Connector
+ | |_| | with 110 Ohm
+ | |__ Terminator
+ | ___________ __|
+ | |...........| | RJ-jack
+ | |...........| _____ | (unused)
+ | |___________| |_____| |__
+ | Boot PROM socket IRQ-jumpers |_ Diagnostic
+ |________ __ _| LED (red)
+ | | | | | | | | | | | | | | | | | | | | | |
+ | | | | | | | | | | | | | | | | | | | | |________|
+ |
+ |
+
+And here are the settings for some of the switches and jumpers on the cards.
+
+::
+
+ I/O
+
+ 1 2 3 4 5 6 7 8
+
+ 2E0----- 0 0 0 1 0 0 0 1
+ 2F0----- 0 0 0 1 0 0 0 0
+ 300----- 0 0 0 0 1 1 1 1
+ 350----- 0 0 0 0 1 1 1 0
+
+"0" in the above example means switch is off "1" means that it is on.
+
+::
+
+ ShMem address.
+
+ 1 2 3 4 5 6 7 8
+
+ CX00--0 0 1 1 | | |
+ DX00--0 0 1 0 |
+ X000--------- 1 1 |
+ X400--------- 1 0 |
+ X800--------- 0 1 |
+ XC00--------- 0 0
+ ENHANCED----------- 1
+ COMPATIBLE--------- 0
+
+::
+
+ IRQ
+
+
+ 3 4 5 7 2
+ . . . . .
+ . . . . .
+
+
+There is a DIP-switch with 8 switches, used to set the shared memory address
+to be used. The first 6 switches set the address, the 7th doesn't have any
+function, and the 8th switch is used to select "compatible" or "enhanced".
+When I got my two cards, one of them had this switch set to "enhanced". That
+card didn't work at all, it wasn't even recognized by the driver. The other
+card had this switch set to "compatible" and it behaved absolutely normally. I
+guess that the switch on one of the cards, must have been changed accidentally
+when the card was taken out of its former host. The question remains
+unanswered, what is the purpose of the "enhanced" position?
+
+[Avery's note: "enhanced" probably either disables shared memory (use IO
+ports instead) or disables IO ports (use memory addresses instead). This
+varies by the type of card involved. I fail to see how either of these
+enhance anything. Send me more detailed information about this mode, or
+just use "compatible" mode instead.]
+
+Waterloo Microsystems Inc. ??
+=============================
+
+8-bit card (C) 1985
+-------------------
+ - from Robert Michael Best <rmb117@cs.usask.ca>
+
+[Avery's note: these don't work with my driver for some reason. These cards
+SEEM to have settings similar to the PDI508Plus, which is
+software-configured and doesn't work with my driver either. The "Waterloo
+chip" is a boot PROM, probably designed specifically for the University of
+Waterloo. If you have any further information about this card, please
+e-mail me.]
+
+The probe has not been able to detect the card on any of the J2 settings,
+and I tried them again with the "Waterloo" chip removed.
+
+::
+
+ _____________________________________________________________________
+ | \/ \/ ___ __ __ |
+ | C4 C4 |^| | M || ^ ||^| |
+ | -- -- |_| | 5 || || | C3 |
+ | \/ \/ C10 |___|| ||_| |
+ | C4 C4 _ _ | | ?? |
+ | -- -- | \/ || | |
+ | | || | |
+ | | || C1 | |
+ | | || | \/ _____|
+ | | C6 || | C9 | |___
+ | | || | -- | BNC |___|
+ | | || | >C7| |_____|
+ | | || | |
+ | __ __ |____||_____| 1 2 3 6 |
+ || ^ | >C4| |o|o|o|o|o|o| J2 >C4| |
+ || | |o|o|o|o|o|o| |
+ || C2 | >C4| >C4| |
+ || | >C8| |
+ || | 2 3 4 5 6 7 IRQ >C4| |
+ ||_____| |o|o|o|o|o|o| J3 |
+ |_______ |o|o|o|o|o|o| _______________|
+ | |
+ |_____________________________________________|
+
+ C1 -- "COM9026
+ SMC 8638"
+ In a chip socket.
+
+ C2 -- "@Copyright
+ Waterloo Microsystems Inc.
+ 1985"
+ In a chip Socket with info printed on a label covering a round window
+ showing the circuit inside. (The window indicates it is an EPROM chip.)
+
+ C3 -- "COM9032
+ SMC 8643"
+ In a chip socket.
+
+ C4 -- "74LS"
+ 9 total no sockets.
+
+ M5 -- "50006-136
+ 20.000000 MHZ
+ MTQ-T1-S3
+ 0 M-TRON 86-40"
+ Metallic case with 4 pins, no socket.
+
+ C6 -- "MOSTEK@TC8643
+ MK6116N-20
+ MALAYSIA"
+ No socket.
+
+ C7 -- No stamp or label but in a 20 pin chip socket.
+
+ C8 -- "PAL10L8CN
+ 8623"
+ In a 20 pin socket.
+
+ C9 -- "PAl16R4A-2CN
+ 8641"
+ In a 20 pin socket.
+
+ C10 -- "M8640
+ NMC
+ 9306N"
+ In an 8 pin socket.
+
+ ?? -- Some components on a smaller board and attached with 20 pins all
+ along the side closest to the BNC connector. The are coated in a dark
+ resin.
+
+On the board there are two jumper banks labeled J2 and J3. The
+manufacturer didn't put a J1 on the board. The two boards I have both
+came with a jumper box for each bank.
+
+::
+
+ J2 -- Numbered 1 2 3 4 5 6.
+ 4 and 5 are not stamped due to solder points.
+
+ J3 -- IRQ 2 3 4 5 6 7
+
+The board itself has a maple leaf stamped just above the irq jumpers
+and "-2 46-86" beside C2. Between C1 and C6 "ASS 'Y 300163" and "@1986
+CORMAN CUSTOM ELECTRONICS CORP." stamped just below the BNC connector.
+Below that "MADE IN CANADA"
+
+No Name
+=======
+
+8-bit cards, 16-bit cards
+-------------------------
+
+ - from Juergen Seifert <seifert@htwm.de>
+
+I have named this ARCnet card "NONAME", since there is no name of any
+manufacturer on the Installation manual nor on the shipping box. The only
+hint to the existence of a manufacturer at all is written in copper,
+it is "Made in Taiwan"
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the Original
+
+ "ARCnet Installation Manual"
+
+::
+
+ ________________________________________________________________
+ | |STAR| BUS| T/P| |
+ | |____|____|____| |
+ | _____________________ |
+ | | | |
+ | | | |
+ | | | |
+ | | SMC | |
+ | | | |
+ | | COM90C65 | |
+ | | | |
+ | | | |
+ | |__________-__________| |
+ | _____|
+ | _______________ | CN |
+ | | PROM | |_____|
+ | > SOCKET | |
+ | |_______________| 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 |
+ | _______________ _______________ |
+ | |o|o|o|o|o|o|o|o| | SW1 || SW2 ||
+ | |o|o|o|o|o|o|o|o| |_______________||_______________||
+ |___ 2 3 4 5 7 E E R Node ID IOB__|__MEM____|
+ | \ IRQ / T T O |
+ |__________________1_2_M______________________|
+
+Legend::
+
+ COM90C65: ARCnet Probe
+ S1 1-8: Node ID Select
+ S2 1-3: I/O Base Address Select
+ 4-6: Memory Base Address Select
+ 7-8: RAM Offset Select
+ ET1, ET2 Extended Timeout Select
+ ROM ROM Enable Select
+ CN RG62 Coax Connector
+ STAR| BUS | T/P Three fields for placing a sign (colored circle)
+ indicating the topology of the card
+
+Setting one of the switches to Off means "1", On means "0".
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in group SW1 are used to set the node ID.
+Each node attached to the network must have an unique node ID which
+must be different from 0.
+Switch 8 serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
+ Switch | Value
+ -------|-------
+ 8 | 1
+ 7 | 2
+ 6 | 4
+ 5 | 8
+ 4 | 16
+ 3 | 32
+ 2 | 64
+ 1 | 128
+
+Some Examples::
+
+ Switch | Hex | Decimal
+ 1 2 3 4 5 6 7 8 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The first three switches in switch group SW2 are used to select one
+of eight possible I/O Base addresses using the following table::
+
+ Switch | Hex I/O
+ 1 2 3 | Address
+ ------------|--------
+ ON ON ON | 260
+ ON ON OFF | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ ON OFF OFF | 2F0
+ OFF ON ON | 300
+ OFF ON OFF | 350
+ OFF OFF ON | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The memory buffer requires 2K of a 16K block of RAM. The base of this
+16K block can be located in any of eight positions.
+Switches 4-6 of switch group SW2 select the Base of the 16K block.
+Within that 16K address space, the buffer may be assigned any one of four
+positions, determined by the offset, switches 7 and 8 of group SW2.
+
+::
+
+ Switch | Hex RAM | Hex ROM
+ 4 5 6 7 8 | Address | Address *)
+ -----------|---------|-----------
+ 0 0 0 0 0 | C0000 | C2000
+ 0 0 0 0 1 | C0800 | C2000
+ 0 0 0 1 0 | C1000 | C2000
+ 0 0 0 1 1 | C1800 | C2000
+ | |
+ 0 0 1 0 0 | C4000 | C6000
+ 0 0 1 0 1 | C4800 | C6000
+ 0 0 1 1 0 | C5000 | C6000
+ 0 0 1 1 1 | C5800 | C6000
+ | |
+ 0 1 0 0 0 | CC000 | CE000
+ 0 1 0 0 1 | CC800 | CE000
+ 0 1 0 1 0 | CD000 | CE000
+ 0 1 0 1 1 | CD800 | CE000
+ | |
+ 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default)
+ 0 1 1 0 1 | D0800 | D2000
+ 0 1 1 1 0 | D1000 | D2000
+ 0 1 1 1 1 | D1800 | D2000
+ | |
+ 1 0 0 0 0 | D4000 | D6000
+ 1 0 0 0 1 | D4800 | D6000
+ 1 0 0 1 0 | D5000 | D6000
+ 1 0 0 1 1 | D5800 | D6000
+ | |
+ 1 0 1 0 0 | D8000 | DA000
+ 1 0 1 0 1 | D8800 | DA000
+ 1 0 1 1 0 | D9000 | DA000
+ 1 0 1 1 1 | D9800 | DA000
+ | |
+ 1 1 0 0 0 | DC000 | DE000
+ 1 1 0 0 1 | DC800 | DE000
+ 1 1 0 1 0 | DD000 | DE000
+ 1 1 0 1 1 | DD800 | DE000
+ | |
+ 1 1 1 0 0 | E0000 | E2000
+ 1 1 1 0 1 | E0800 | E2000
+ 1 1 1 1 0 | E1000 | E2000
+ 1 1 1 1 1 | E1800 | E2000
+
+ *) To enable the 8K Boot PROM install the jumper ROM.
+ The default is jumper ROM not installed.
+
+
+Setting Interrupt Request Lines (IRQ)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To select a hardware interrupt level set one (only one!) of the jumpers
+IRQ2, IRQ3, IRQ4, IRQ5 or IRQ7. The manufacturer's default is IRQ2.
+
+
+Setting the Timeouts
+^^^^^^^^^^^^^^^^^^^^
+
+The two jumpers labeled ET1 and ET2 are used to determine the timeout
+parameters (response and reconfiguration time). Every node in a network
+must be set to the same timeout values.
+
+::
+
+ ET1 ET2 | Response Time (us) | Reconfiguration Time (ms)
+ --------|--------------------|--------------------------
+ Off Off | 78 | 840 (Default)
+ Off On | 285 | 1680
+ On Off | 563 | 1680
+ On On | 1130 | 1680
+
+On means jumper installed, Off means jumper not installed
+
+
+16-BIT ARCNET
+-------------
+
+The manual of my 8-Bit NONAME ARCnet Card contains another description
+of a 16-Bit Coax / Twisted Pair Card. This description is incomplete,
+because there are missing two pages in the manual booklet. (The table
+of contents reports pages ... 2-9, 2-11, 2-12, 3-1, ... but inside
+the booklet there is a different way of counting ... 2-9, 2-10, A-1,
+(empty page), 3-1, ..., 3-18, A-1 (again), A-2)
+Also the picture of the board layout is not as good as the picture of
+8-Bit card, because there isn't any letter like "SW1" written to the
+picture.
+
+Should somebody have such a board, please feel free to complete this
+description or to send a mail to me!
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the Original
+
+ "ARCnet Installation Manual"
+
+::
+
+ ___________________________________________________________________
+ < _________________ _________________ |
+ > | SW? || SW? | |
+ < |_________________||_________________| |
+ > ____________________ |
+ < | | |
+ > | | |
+ < | | |
+ > | | |
+ < | | |
+ > | | |
+ < | | |
+ > |____________________| |
+ < ____|
+ > ____________________ | |
+ < | | | J1 |
+ > | < | |
+ < |____________________| ? ? ? ? ? ? |____|
+ > |o|o|o|o|o|o| |
+ < |o|o|o|o|o|o| |
+ > |
+ < __ ___________|
+ > | | |
+ <____________| |_______________________________________|
+
+
+Setting one of the switches to Off means "1", On means "0".
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in group SW2 are used to set the node ID.
+Each node attached to the network must have an unique node ID which
+must be different from 0.
+Switch 8 serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
+ Switch | Value
+ -------|-------
+ 8 | 1
+ 7 | 2
+ 6 | 4
+ 5 | 8
+ 4 | 16
+ 3 | 32
+ 2 | 64
+ 1 | 128
+
+Some Examples::
+
+ Switch | Hex | Decimal
+ 1 2 3 4 5 6 7 8 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The first three switches in switch group SW1 are used to select one
+of eight possible I/O Base addresses using the following table::
+
+ Switch | Hex I/O
+ 3 2 1 | Address
+ ------------|--------
+ ON ON ON | 260
+ ON ON OFF | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ ON OFF OFF | 2F0
+ OFF ON ON | 300
+ OFF ON OFF | 350
+ OFF OFF ON | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The memory buffer requires 2K of a 16K block of RAM. The base of this
+16K block can be located in any of eight positions.
+Switches 6-8 of switch group SW1 select the Base of the 16K block.
+Within that 16K address space, the buffer may be assigned any one of four
+positions, determined by the offset, switches 4 and 5 of group SW1::
+
+ Switch | Hex RAM | Hex ROM
+ 8 7 6 5 4 | Address | Address
+ -----------|---------|-----------
+ 0 0 0 0 0 | C0000 | C2000
+ 0 0 0 0 1 | C0800 | C2000
+ 0 0 0 1 0 | C1000 | C2000
+ 0 0 0 1 1 | C1800 | C2000
+ | |
+ 0 0 1 0 0 | C4000 | C6000
+ 0 0 1 0 1 | C4800 | C6000
+ 0 0 1 1 0 | C5000 | C6000
+ 0 0 1 1 1 | C5800 | C6000
+ | |
+ 0 1 0 0 0 | CC000 | CE000
+ 0 1 0 0 1 | CC800 | CE000
+ 0 1 0 1 0 | CD000 | CE000
+ 0 1 0 1 1 | CD800 | CE000
+ | |
+ 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default)
+ 0 1 1 0 1 | D0800 | D2000
+ 0 1 1 1 0 | D1000 | D2000
+ 0 1 1 1 1 | D1800 | D2000
+ | |
+ 1 0 0 0 0 | D4000 | D6000
+ 1 0 0 0 1 | D4800 | D6000
+ 1 0 0 1 0 | D5000 | D6000
+ 1 0 0 1 1 | D5800 | D6000
+ | |
+ 1 0 1 0 0 | D8000 | DA000
+ 1 0 1 0 1 | D8800 | DA000
+ 1 0 1 1 0 | D9000 | DA000
+ 1 0 1 1 1 | D9800 | DA000
+ | |
+ 1 1 0 0 0 | DC000 | DE000
+ 1 1 0 0 1 | DC800 | DE000
+ 1 1 0 1 0 | DD000 | DE000
+ 1 1 0 1 1 | DD800 | DE000
+ | |
+ 1 1 1 0 0 | E0000 | E2000
+ 1 1 1 0 1 | E0800 | E2000
+ 1 1 1 1 0 | E1000 | E2000
+ 1 1 1 1 1 | E1800 | E2000
+
+
+Setting Interrupt Request Lines (IRQ)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+??????????????????????????????????????
+
+
+Setting the Timeouts
+^^^^^^^^^^^^^^^^^^^^
+
+??????????????????????????????????????
+
+
+8-bit cards ("Made in Taiwan R.O.C.")
+-------------------------------------
+
+ - from Vojtech Pavlik <vojtech@suse.cz>
+
+I have named this ARCnet card "NONAME", since I got only the card with
+no manual at all and the only text identifying the manufacturer is
+"MADE IN TAIWAN R.O.C" printed on the card.
+
+::
+
+ ____________________________________________________________
+ | 1 2 3 4 5 6 7 8 |
+ | |o|o| JP1 o|o|o|o|o|o|o|o| ON |
+ | + o|o|o|o|o|o|o|o| ___|
+ | _____________ o|o|o|o|o|o|o|o| OFF _____ | | ID7
+ | | | SW1 | | | | ID6
+ | > RAM (2k) | ____________________ | H | | S | ID5
+ | |_____________| | || y | | W | ID4
+ | | || b | | 2 | ID3
+ | | || r | | | ID2
+ | | || i | | | ID1
+ | | 90C65 || d | |___| ID0
+ | SW3 | || | |
+ | |o|o|o|o|o|o|o|o| ON | || I | |
+ | |o|o|o|o|o|o|o|o| | || C | |
+ | |o|o|o|o|o|o|o|o| OFF |____________________|| | _____|
+ | 1 2 3 4 5 6 7 8 | | | |___
+ | ______________ | | | BNC |___|
+ | | | |_____| |_____|
+ | > EPROM SOCKET | |
+ | |______________| |
+ | ______________|
+ | |
+ |_____________________________________________|
+
+Legend::
+
+ 90C65 ARCNET Chip
+ SW1 1-5: Base Memory Address Select
+ 6-8: Base I/O Address Select
+ SW2 1-8: Node ID Select (ID0-ID7)
+ SW3 1-5: IRQ Select
+ 6-7: Extra Timeout
+ 8 : ROM Enable
+ JP1 Led connector
+ BNC Coax connector
+
+Although the jumpers SW1 and SW3 are marked SW, not JP, they are jumpers, not
+switches.
+
+Setting the jumpers to ON means connecting the upper two pins, off the bottom
+two - or - in case of IRQ setting, connecting none of them at all.
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in SW2 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must not be 0.
+Switch 1 (ID0) serves as the least significant bit (LSB).
+
+Setting one of the switches to Off means "1", On means "0".
+
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
+ Switch | Label | Value
+ -------|-------|-------
+ 1 | ID0 | 1
+ 2 | ID1 | 2
+ 3 | ID2 | 4
+ 4 | ID3 | 8
+ 5 | ID4 | 16
+ 6 | ID5 | 32
+ 7 | ID6 | 64
+ 8 | ID7 | 128
+
+Some Examples::
+
+ Switch | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The last three switches in switch block SW1 are used to select one
+of eight possible I/O Base addresses using the following table::
+
+
+ Switch | Hex I/O
+ 6 7 8 | Address
+ ------------|--------
+ ON ON ON | 260
+ OFF ON ON | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ OFF OFF ON | 2F0
+ ON ON OFF | 300
+ OFF ON OFF | 350
+ ON OFF OFF | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of eight positions. The address of the Boot Prom is
+memory base + 0x2000.
+
+Jumpers 3-5 of jumper block SW1 select the Memory Base address.
+
+::
+
+ Switch | Hex RAM | Hex ROM
+ 1 2 3 4 5 | Address | Address *)
+ --------------------|---------|-----------
+ ON ON ON ON ON | C0000 | C2000
+ ON ON OFF ON ON | C4000 | C6000
+ ON ON ON OFF ON | CC000 | CE000
+ ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default)
+ ON ON ON ON OFF | D4000 | D6000
+ ON ON OFF ON OFF | D8000 | DA000
+ ON ON ON OFF OFF | DC000 | DE000
+ ON ON OFF OFF OFF | E0000 | E2000
+
+ *) To enable the Boot ROM set the jumper 8 of jumper block SW3 to position ON.
+
+The jumpers 1 and 2 probably add 0x0800, 0x1000 and 0x1800 to RAM adders.
+
+Setting the Interrupt Line
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Jumpers 1-5 of the jumper block SW3 control the IRQ level::
+
+ Jumper | IRQ
+ 1 2 3 4 5 |
+ ----------------------------
+ ON OFF OFF OFF OFF | 2
+ OFF ON OFF OFF OFF | 3
+ OFF OFF ON OFF OFF | 4
+ OFF OFF OFF ON OFF | 5
+ OFF OFF OFF OFF ON | 7
+
+
+Setting the Timeout Parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The jumpers 6-7 of the jumper block SW3 are used to determine the timeout
+parameters. These two jumpers are normally left in the OFF position.
+
+
+
+(Generic Model 9058)
+--------------------
+ - from Andrew J. Kroll <ag784@freenet.buffalo.edu>
+ - Sorry this sat in my to-do box for so long, Andrew! (yikes - over a
+ year!)
+
+::
+
+ _____
+ | <
+ | .---'
+ ________________________________________________________________ | |
+ | | SW2 | | |
+ | ___________ |_____________| | |
+ | | | 1 2 3 4 5 6 ___| |
+ | > 6116 RAM | _________ 8 | | |
+ | |___________| |20MHzXtal| 7 | | |
+ | |_________| __________ 6 | S | |
+ | 74LS373 | |- 5 | W | |
+ | _________ | E |- 4 | | |
+ | >_______| ______________|..... P |- 3 | 3 | |
+ | | | : O |- 2 | | |
+ | | | : X |- 1 |___| |
+ | ________________ | | : Y |- | |
+ | | SW1 | | SL90C65 | : |- | |
+ | |________________| | | : B |- | |
+ | 1 2 3 4 5 6 7 8 | | : O |- | |
+ | |_________o____|..../ A |- _______| |
+ | ____________________ | R |- | |------,
+ | | | | D |- | BNC | # |
+ | > 2764 PROM SOCKET | |__________|- |_______|------'
+ | |____________________| _________ | |
+ | >________| <- 74LS245 | |
+ | | |
+ |___ ______________| |
+ |H H H H H H H H H H H H H H H H H H H H H H H| | |
+ |U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U| | |
+ \|
+
+Legend::
+
+ SL90C65 ARCNET Controller / Transceiver /Logic
+ SW1 1-5: IRQ Select
+ 6: ET1
+ 7: ET2
+ 8: ROM ENABLE
+ SW2 1-3: Memory Buffer/PROM Address
+ 3-6: I/O Address Map
+ SW3 1-8: Node ID Select
+ BNC BNC RG62/U Connection
+ *I* have had success using RG59B/U with *NO* terminators!
+ What gives?!
+
+SW1: Timeouts, Interrupt and ROM
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To select a hardware interrupt level set one (only one!) of the dip switches
+up (on) SW1...(switches 1-5)
+IRQ3, IRQ4, IRQ5, IRQ7, IRQ2. The Manufacturer's default is IRQ2.
+
+The switches on SW1 labeled EXT1 (switch 6) and EXT2 (switch 7)
+are used to determine the timeout parameters. These two dip switches
+are normally left off (down).
+
+ To enable the 8K Boot PROM position SW1 switch 8 on (UP) labeled ROM.
+ The default is jumper ROM not installed.
+
+
+Setting the I/O Base Address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The last three switches in switch group SW2 are used to select one
+of eight possible I/O Base addresses using the following table::
+
+
+ Switch | Hex I/O
+ 4 5 6 | Address
+ -------|--------
+ 0 0 0 | 260
+ 0 0 1 | 290
+ 0 1 0 | 2E0 (Manufacturer's default)
+ 0 1 1 | 2F0
+ 1 0 0 | 300
+ 1 0 1 | 350
+ 1 1 0 | 380
+ 1 1 1 | 3E0
+
+
+Setting the Base Memory Address (RAM & ROM)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The memory buffer requires 2K of a 16K block of RAM. The base of this
+16K block can be located in any of eight positions.
+Switches 1-3 of switch group SW2 select the Base of the 16K block.
+(0 = DOWN, 1 = UP)
+I could, however, only verify two settings...
+
+
+::
+
+ Switch| Hex RAM | Hex ROM
+ 1 2 3 | Address | Address
+ ------|---------|-----------
+ 0 0 0 | E0000 | E2000
+ 0 0 1 | D0000 | D2000 (Manufacturer's default)
+ 0 1 0 | ????? | ?????
+ 0 1 1 | ????? | ?????
+ 1 0 0 | ????? | ?????
+ 1 0 1 | ????? | ?????
+ 1 1 0 | ????? | ?????
+ 1 1 1 | ????? | ?????
+
+
+Setting the Node ID
+^^^^^^^^^^^^^^^^^^^
+
+The eight switches in group SW3 are used to set the node ID.
+Each node attached to the network must have an unique node ID which
+must be different from 0.
+Switch 1 serves as the least significant bit (LSB).
+switches in the DOWN position are OFF (0) and in the UP position are ON (1)
+
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
+ Switch | Value
+ -------|-------
+ 1 | 1
+ 2 | 2
+ 3 | 4
+ 4 | 8
+ 5 | 16
+ 6 | 32
+ 7 | 64
+ 8 | 128
+
+Some Examples::
+
+ Switch# | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed <-.
+ 0 0 0 0 0 0 0 1 | 1 | 1 |
+ 0 0 0 0 0 0 1 0 | 2 | 2 |
+ 0 0 0 0 0 0 1 1 | 3 | 3 |
+ . . . | | |
+ 0 1 0 1 0 1 0 1 | 55 | 85 |
+ . . . | | + Don't use 0 or 255!
+ 1 0 1 0 1 0 1 0 | AA | 170 |
+ . . . | | |
+ 1 1 1 1 1 1 0 1 | FD | 253 |
+ 1 1 1 1 1 1 1 0 | FE | 254 |
+ 1 1 1 1 1 1 1 1 | FF | 255 <-'
+
+
+Tiara
+=====
+
+(model unknown)
+---------------
+
+ - from Christoph Lameter <christoph@lameter.com>
+
+
+Here is information about my card as far as I could figure it out::
+
+
+ ----------------------------------------------- tiara
+ Tiara LanCard of Tiara Computer Systems.
+
+ +----------------------------------------------+
+ ! ! Transmitter Unit ! !
+ ! +------------------+ -------
+ ! MEM Coax Connector
+ ! ROM 7654321 <- I/O -------
+ ! : : +--------+ !
+ ! : : ! 90C66LJ! +++
+ ! : : ! ! !D Switch to set
+ ! : : ! ! !I the Nodenumber
+ ! : : +--------+ !P
+ ! !++
+ ! 234567 <- IRQ !
+ +------------!!!!!!!!!!!!!!!!!!!!!!!!--------+
+ !!!!!!!!!!!!!!!!!!!!!!!!
+
+- 0 = Jumper Installed
+- 1 = Open
+
+Top Jumper line Bit 7 = ROM Enable 654=Memory location 321=I/O
+
+Settings for Memory Location (Top Jumper Line)
+
+=== ================
+456 Address selected
+=== ================
+000 C0000
+001 C4000
+010 CC000
+011 D0000
+100 D4000
+101 D8000
+110 DC000
+111 E0000
+=== ================
+
+Settings for I/O Address (Top Jumper Line)
+
+=== ====
+123 Port
+=== ====
+000 260
+001 290
+010 2E0
+011 2F0
+100 300
+101 350
+110 380
+111 3E0
+=== ====
+
+Settings for IRQ Selection (Lower Jumper Line)
+
+====== =====
+234567
+====== =====
+011111 IRQ 2
+101111 IRQ 3
+110111 IRQ 4
+111011 IRQ 5
+111110 IRQ 7
+====== =====
+
+Other Cards
+===========
+
+I have no information on other models of ARCnet cards at the moment. Please
+send any and all info to:
+
+ apenwarr@worldvisions.ca
+
+Thanks.
diff --git a/Documentation/networking/arcnet.rst b/Documentation/networking/arcnet.rst
new file mode 100644
index 0000000000..82fce606c0
--- /dev/null
+++ b/Documentation/networking/arcnet.rst
@@ -0,0 +1,594 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+ARCnet
+======
+
+.. note::
+
+ See also arcnet-hardware.txt in this directory for jumper-setting
+ and cabling information if you're like many of us and didn't happen to get a
+ manual with your ARCnet card.
+
+Since no one seems to listen to me otherwise, perhaps a poem will get your
+attention::
+
+ This driver's getting fat and beefy,
+ But my cat is still named Fifi.
+
+Hmm, I think I'm allowed to call that a poem, even though it's only two
+lines. Hey, I'm in Computer Science, not English. Give me a break.
+
+The point is: I REALLY REALLY REALLY REALLY REALLY want to hear from you if
+you test this and get it working. Or if you don't. Or anything.
+
+ARCnet 0.32 ALPHA first made it into the Linux kernel 1.1.80 - this was
+nice, but after that even FEWER people started writing to me because they
+didn't even have to install the patch. <sigh>
+
+Come on, be a sport! Send me a success report!
+
+(hey, that was even better than my original poem... this is getting bad!)
+
+
+.. warning::
+
+ If you don't e-mail me about your success/failure soon, I may be forced to
+ start SINGING. And we don't want that, do we?
+
+ (You know, it might be argued that I'm pushing this point a little too much.
+ If you think so, why not flame me in a quick little e-mail? Please also
+ include the type of card(s) you're using, software, size of network, and
+ whether it's working or not.)
+
+ My e-mail address is: apenwarr@worldvisions.ca
+
+These are the ARCnet drivers for Linux.
+
+This new release (2.91) has been put together by David Woodhouse
+<dwmw2@infradead.org>, in an attempt to tidy up the driver after adding support
+for yet another chipset. Now the generic support has been separated from the
+individual chipset drivers, and the source files aren't quite so packed with
+#ifdefs! I've changed this file a bit, but kept it in the first person from
+Avery, because I didn't want to completely rewrite it.
+
+The previous release resulted from many months of on-and-off effort from me
+(Avery Pennarun), many bug reports/fixes and suggestions from others, and in
+particular a lot of input and coding from Tomasz Motylewski. Starting with
+ARCnet 2.10 ALPHA, Tomasz's all-new-and-improved RFC1051 support has been
+included and seems to be working fine!
+
+
+Where do I discuss these drivers?
+---------------------------------
+
+Tomasz has been so kind as to set up a new and improved mailing list.
+Subscribe by sending a message with the BODY "subscribe linux-arcnet YOUR
+REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the
+list, mail to linux-arcnet@tichy.ch.uj.edu.pl.
+
+There are archives of the mailing list at:
+
+ http://epistolary.org/mailman/listinfo.cgi/arcnet
+
+The people on linux-net@vger.kernel.org (now defunct, replaced by
+netdev@vger.kernel.org) have also been known to be very helpful, especially
+when we're talking about ALPHA Linux kernels that may or may not work right
+in the first place.
+
+
+Other Drivers and Info
+----------------------
+
+You can try my ARCNET page on the World Wide Web at:
+
+ http://www.qis.net/~jschmitz/arcnet/
+
+Also, SMC (one of the companies that makes ARCnet cards) has a WWW site you
+might be interested in, which includes several drivers for various cards
+including ARCnet. Try:
+
+ http://www.smc.com/
+
+Performance Technologies makes various network software that supports
+ARCnet:
+
+ http://www.perftech.com/ or ftp to ftp.perftech.com.
+
+Novell makes a networking stack for DOS which includes ARCnet drivers. Try
+FTPing to ftp.novell.com.
+
+You can get the Crynwr packet driver collection (including arcether.com, the
+one you'll want to use with ARCnet cards) from
+oak.oakland.edu:/simtel/msdos/pktdrvr. It won't work perfectly on a 386+
+without patches, though, and also doesn't like several cards. Fixed
+versions are available on my WWW page, or via e-mail if you don't have WWW
+access.
+
+
+Installing the Driver
+---------------------
+
+All you will need to do in order to install the driver is::
+
+ make config
+ (be sure to choose ARCnet in the network devices
+ and at least one chipset driver.)
+ make clean
+ make zImage
+
+If you obtained this ARCnet package as an upgrade to the ARCnet driver in
+your current kernel, you will need to first copy arcnet.c over the one in
+the linux/drivers/net directory.
+
+You will know the driver is installed properly if you get some ARCnet
+messages when you reboot into the new Linux kernel.
+
+There are four chipset options:
+
+ 1. Standard ARCnet COM90xx chipset.
+
+This is the normal ARCnet card, which you've probably got. This is the only
+chipset driver which will autoprobe if not told where the card is.
+It following options on the command line::
+
+ com90xx=[<io>[,<irq>[,<shmem>]]][,<name>] | <name>
+
+If you load the chipset support as a module, the options are::
+
+ io=<io> irq=<irq> shmem=<shmem> device=<name>
+
+To disable the autoprobe, just specify "com90xx=" on the kernel command line.
+To specify the name alone, but allow autoprobe, just put "com90xx=<name>"
+
+ 2. ARCnet COM20020 chipset.
+
+This is the new chipset from SMC with support for promiscuous mode (packet
+sniffing), extra diagnostic information, etc. Unfortunately, there is no
+sensible method of autoprobing for these cards. You must specify the I/O
+address on the kernel command line.
+
+The command line options are::
+
+ com20020=<io>[,<irq>[,<node_ID>[,backplane[,CKP[,timeout]]]]][,name]
+
+If you load the chipset support as a module, the options are::
+
+ io=<io> irq=<irq> node=<node_ID> backplane=<backplane> clock=<CKP>
+ timeout=<timeout> device=<name>
+
+The COM20020 chipset allows you to set the node ID in software, overriding the
+default which is still set in DIP switches on the card. If you don't have the
+COM20020 data sheets, and you don't know what the other three options refer
+to, then they won't interest you - forget them.
+
+ 3. ARCnet COM90xx chipset in IO-mapped mode.
+
+This will also work with the normal ARCnet cards, but doesn't use the shared
+memory. It performs less well than the above driver, but is provided in case
+you have a card which doesn't support shared memory, or (strangely) in case
+you have so many ARCnet cards in your machine that you run out of shmem slots.
+If you don't give the IO address on the kernel command line, then the driver
+will not find the card.
+
+The command line options are::
+
+ com90io=<io>[,<irq>][,<name>]
+
+If you load the chipset support as a module, the options are:
+ io=<io> irq=<irq> device=<name>
+
+ 4. ARCnet RIM I cards.
+
+These are COM90xx chips which are _completely_ memory mapped. The support for
+these is not tested. If you have one, please mail the author with a success
+report. All options must be specified, except the device name.
+Command line options::
+
+ arcrimi=<shmem>,<irq>,<node_ID>[,<name>]
+
+If you load the chipset support as a module, the options are::
+
+ shmem=<shmem> irq=<irq> node=<node_ID> device=<name>
+
+
+Loadable Module Support
+-----------------------
+
+Configure and rebuild Linux. When asked, answer 'm' to "Generic ARCnet
+support" and to support for your ARCnet chipset if you want to use the
+loadable module. You can also say 'y' to "Generic ARCnet support" and 'm'
+to the chipset support if you wish.
+
+::
+
+ make config
+ make clean
+ make zImage
+ make modules
+
+If you're using a loadable module, you need to use insmod to load it, and
+you can specify various characteristics of your card on the command
+line. (In recent versions of the driver, autoprobing is much more reliable
+and works as a module, so most of this is now unnecessary.)
+
+For example::
+
+ cd /usr/src/linux/modules
+ insmod arcnet.o
+ insmod com90xx.o
+ insmod com20020.o io=0x2e0 device=eth1
+
+
+Using the Driver
+----------------
+
+If you build your kernel with ARCnet COM90xx support included, it should
+probe for your card automatically when you boot. If you use a different
+chipset driver complied into the kernel, you must give the necessary options
+on the kernel command line, as detailed above.
+
+Go read the NET-2-HOWTO and ETHERNET-HOWTO for Linux; they should be
+available where you picked up this driver. Think of your ARCnet as a
+souped-up (or down, as the case may be) Ethernet card.
+
+By the way, be sure to change all references from "eth0" to "arc0" in the
+HOWTOs. Remember that ARCnet isn't a "true" Ethernet, and the device name
+is DIFFERENT.
+
+
+Multiple Cards in One Computer
+------------------------------
+
+Linux has pretty good support for this now, but since I've been busy, the
+ARCnet driver has somewhat suffered in this respect. COM90xx support, if
+compiled into the kernel, will (try to) autodetect all the installed cards.
+
+If you have other cards, with support compiled into the kernel, then you can
+just repeat the options on the kernel command line, e.g.::
+
+ LILO: linux com20020=0x2e0 com20020=0x380 com90io=0x260
+
+If you have the chipset support built as a loadable module, then you need to
+do something like this::
+
+ insmod -o arc0 com90xx
+ insmod -o arc1 com20020 io=0x2e0
+ insmod -o arc2 com90xx
+
+The ARCnet drivers will now sort out their names automatically.
+
+
+How do I get it to work with...?
+--------------------------------
+
+NFS:
+ Should be fine linux->linux, just pretend you're using Ethernet cards.
+ oak.oakland.edu:/simtel/msdos/nfs has some nice DOS clients. There
+ is also a DOS-based NFS server called SOSS. It doesn't multitask
+ quite the way Linux does (actually, it doesn't multitask AT ALL) but
+ you never know what you might need.
+
+ With AmiTCP (and possibly others), you may need to set the following
+ options in your Amiga nfstab: MD 1024 MR 1024 MW 1024
+ (Thanks to Christian Gottschling <ferksy@indigo.tng.oche.de>
+ for this.)
+
+ Probably these refer to maximum NFS data/read/write block sizes. I
+ don't know why the defaults on the Amiga didn't work; write to me if
+ you know more.
+
+DOS:
+ If you're using the freeware arcether.com, you might want to install
+ the driver patch from my web page. It helps with PC/TCP, and also
+ can get arcether to load if it timed out too quickly during
+ initialization. In fact, if you use it on a 386+ you REALLY need
+ the patch, really.
+
+Windows:
+ See DOS :) Trumpet Winsock works fine with either the Novell or
+ Arcether client, assuming you remember to load winpkt of course.
+
+LAN Manager and Windows for Workgroups:
+ These programs use protocols that
+ are incompatible with the Internet standard. They try to pretend
+ the cards are Ethernet, and confuse everyone else on the network.
+
+ However, v2.00 and higher of the Linux ARCnet driver supports this
+ protocol via the 'arc0e' device. See the section on "Multiprotocol
+ Support" for more information.
+
+ Using the freeware Samba server and clients for Linux, you can now
+ interface quite nicely with TCP/IP-based WfWg or Lan Manager
+ networks.
+
+Windows 95:
+ Tools are included with Win95 that let you use either the LANMAN
+ style network drivers (NDIS) or Novell drivers (ODI) to handle your
+ ARCnet packets. If you use ODI, you'll need to use the 'arc0'
+ device with Linux. If you use NDIS, then try the 'arc0e' device.
+ See the "Multiprotocol Support" section below if you need arc0e,
+ you're completely insane, and/or you need to build some kind of
+ hybrid network that uses both encapsulation types.
+
+OS/2:
+ I've been told it works under Warp Connect with an ARCnet driver from
+ SMC. You need to use the 'arc0e' interface for this. If you get
+ the SMC driver to work with the TCP/IP stuff included in the
+ "normal" Warp Bonus Pack, let me know.
+
+ ftp.microsoft.com also has a freeware "Lan Manager for OS/2" client
+ which should use the same protocol as WfWg does. I had no luck
+ installing it under Warp, however. Please mail me with any results.
+
+NetBSD/AmiTCP:
+ These use an old version of the Internet standard ARCnet
+ protocol (RFC1051) which is compatible with the Linux driver v2.10
+ ALPHA and above using the arc0s device. (See "Multiprotocol ARCnet"
+ below.) ** Newer versions of NetBSD apparently support RFC1201.
+
+
+Using Multiprotocol ARCnet
+--------------------------
+
+The ARCnet driver v2.10 ALPHA supports three protocols, each on its own
+"virtual network device":
+
+ ====== ===============================================================
+ arc0 RFC1201 protocol, the official Internet standard which just
+ happens to be 100% compatible with Novell's TRXNET driver.
+ Version 1.00 of the ARCnet driver supported _only_ this
+ protocol. arc0 is the fastest of the three protocols (for
+ whatever reason), and allows larger packets to be used
+ because it supports RFC1201 "packet splitting" operations.
+ Unless you have a specific need to use a different protocol,
+ I strongly suggest that you stick with this one.
+
+ arc0e "Ethernet-Encapsulation" which sends packets over ARCnet
+ that are actually a lot like Ethernet packets, including the
+ 6-byte hardware addresses. This protocol is compatible with
+ Microsoft's NDIS ARCnet driver, like the one in WfWg and
+ LANMAN. Because the MTU of 493 is actually smaller than the
+ one "required" by TCP/IP (576), there is a chance that some
+ network operations will not function properly. The Linux
+ TCP/IP layer can compensate in most cases, however, by
+ automatically fragmenting the TCP/IP packets to make them
+ fit. arc0e also works slightly more slowly than arc0, for
+ reasons yet to be determined. (Probably it's the smaller
+ MTU that does it.)
+
+ arc0s The "[s]imple" RFC1051 protocol is the "previous" Internet
+ standard that is completely incompatible with the new
+ standard. Some software today, however, continues to
+ support the old standard (and only the old standard)
+ including NetBSD and AmiTCP. RFC1051 also does not support
+ RFC1201's packet splitting, and the MTU of 507 is still
+ smaller than the Internet "requirement," so it's quite
+ possible that you may run into problems. It's also slower
+ than RFC1201 by about 25%, for the same reason as arc0e.
+
+ The arc0s support was contributed by Tomasz Motylewski
+ and modified somewhat by me. Bugs are probably my fault.
+ ====== ===============================================================
+
+You can choose not to compile arc0e and arc0s into the driver if you want -
+this will save you a bit of memory and avoid confusion when eg. trying to
+use the "NFS-root" stuff in recent Linux kernels.
+
+The arc0e and arc0s devices are created automatically when you first
+ifconfig the arc0 device. To actually use them, though, you need to also
+ifconfig the other virtual devices you need. There are a number of ways you
+can set up your network then:
+
+
+1. Single Protocol.
+
+ This is the simplest way to configure your network: use just one of the
+ two available protocols. As mentioned above, it's a good idea to use
+ only arc0 unless you have a good reason (like some other software, ie.
+ WfWg, that only works with arc0e).
+
+ If you need only arc0, then the following commands should get you going::
+
+ ifconfig arc0 MY.IP.ADD.RESS
+ route add MY.IP.ADD.RESS arc0
+ route add -net SUB.NET.ADD.RESS arc0
+ [add other local routes here]
+
+ If you need arc0e (and only arc0e), it's a little different::
+
+ ifconfig arc0 MY.IP.ADD.RESS
+ ifconfig arc0e MY.IP.ADD.RESS
+ route add MY.IP.ADD.RESS arc0e
+ route add -net SUB.NET.ADD.RESS arc0e
+
+ arc0s works much the same way as arc0e.
+
+
+2. More than one protocol on the same wire.
+
+ Now things start getting confusing. To even try it, you may need to be
+ partly crazy. Here's what *I* did. :) Note that I don't include arc0s in
+ my home network; I don't have any NetBSD or AmiTCP computers, so I only
+ use arc0s during limited testing.
+
+ I have three computers on my home network; two Linux boxes (which prefer
+ RFC1201 protocol, for reasons listed above), and one XT that can't run
+ Linux but runs the free Microsoft LANMAN Client instead.
+
+ Worse, one of the Linux computers (freedom) also has a modem and acts as
+ a router to my Internet provider. The other Linux box (insight) also has
+ its own IP address and needs to use freedom as its default gateway. The
+ XT (patience), however, does not have its own Internet IP address and so
+ I assigned it one on a "private subnet" (as defined by RFC1597).
+
+ To start with, take a simple network with just insight and freedom.
+ Insight needs to:
+
+ - talk to freedom via RFC1201 (arc0) protocol, because I like it
+ more and it's faster.
+ - use freedom as its Internet gateway.
+
+ That's pretty easy to do. Set up insight like this::
+
+ ifconfig arc0 insight
+ route add insight arc0
+ route add freedom arc0 /* I would use the subnet here (like I said
+ to in "single protocol" above),
+ but the rest of the subnet
+ unfortunately lies across the PPP
+ link on freedom, which confuses
+ things. */
+ route add default gw freedom
+
+ And freedom gets configured like so::
+
+ ifconfig arc0 freedom
+ route add freedom arc0
+ route add insight arc0
+ /* and default gateway is configured by pppd */
+
+ Great, now insight talks to freedom directly on arc0, and sends packets
+ to the Internet through freedom. If you didn't know how to do the above,
+ you should probably stop reading this section now because it only gets
+ worse.
+
+ Now, how do I add patience into the network? It will be using LANMAN
+ Client, which means I need the arc0e device. It needs to be able to talk
+ to both insight and freedom, and also use freedom as a gateway to the
+ Internet. (Recall that patience has a "private IP address" which won't
+ work on the Internet; that's okay, I configured Linux IP masquerading on
+ freedom for this subnet).
+
+ So patience (necessarily; I don't have another IP number from my
+ provider) has an IP address on a different subnet than freedom and
+ insight, but needs to use freedom as an Internet gateway. Worse, most
+ DOS networking programs, including LANMAN, have braindead networking
+ schemes that rely completely on the netmask and a 'default gateway' to
+ determine how to route packets. This means that to get to freedom or
+ insight, patience WILL send through its default gateway, regardless of
+ the fact that both freedom and insight (courtesy of the arc0e device)
+ could understand a direct transmission.
+
+ I compensate by giving freedom an extra IP address - aliased 'gatekeeper' -
+ that is on my private subnet, the same subnet that patience is on. I
+ then define gatekeeper to be the default gateway for patience.
+
+ To configure freedom (in addition to the commands above)::
+
+ ifconfig arc0e gatekeeper
+ route add gatekeeper arc0e
+ route add patience arc0e
+
+ This way, freedom will send all packets for patience through arc0e,
+ giving its IP address as gatekeeper (on the private subnet). When it
+ talks to insight or the Internet, it will use its "freedom" Internet IP
+ address.
+
+ You will notice that we haven't configured the arc0e device on insight.
+ This would work, but is not really necessary, and would require me to
+ assign insight another special IP number from my private subnet. Since
+ both insight and patience are using freedom as their default gateway, the
+ two can already talk to each other.
+
+ It's quite fortunate that I set things up like this the first time (cough
+ cough) because it's really handy when I boot insight into DOS. There, it
+ runs the Novell ODI protocol stack, which only works with RFC1201 ARCnet.
+ In this mode it would be impossible for insight to communicate directly
+ with patience, since the Novell stack is incompatible with Microsoft's
+ Ethernet-Encap. Without changing any settings on freedom or patience, I
+ simply set freedom as the default gateway for insight (now in DOS,
+ remember) and all the forwarding happens "automagically" between the two
+ hosts that would normally not be able to communicate at all.
+
+ For those who like diagrams, I have created two "virtual subnets" on the
+ same physical ARCnet wire. You can picture it like this::
+
+
+ [RFC1201 NETWORK] [ETHER-ENCAP NETWORK]
+ (registered Internet subnet) (RFC1597 private subnet)
+
+ (IP Masquerade)
+ /---------------\ * /---------------\
+ | | * | |
+ | +-Freedom-*-Gatekeeper-+ |
+ | | | * | |
+ \-------+-------/ | * \-------+-------/
+ | | |
+ Insight | Patience
+ (Internet)
+
+
+
+It works: what now?
+-------------------
+
+Send mail describing your setup, preferably including driver version, kernel
+version, ARCnet card model, CPU type, number of systems on your network, and
+list of software in use to me at the following address:
+
+ apenwarr@worldvisions.ca
+
+I do send (sometimes automated) replies to all messages I receive. My email
+can be weird (and also usually gets forwarded all over the place along the
+way to me), so if you don't get a reply within a reasonable time, please
+resend.
+
+
+It doesn't work: what now?
+--------------------------
+
+Do the same as above, but also include the output of the ifconfig and route
+commands, as well as any pertinent log entries (ie. anything that starts
+with "arcnet:" and has shown up since the last reboot) in your mail.
+
+If you want to try fixing it yourself (I strongly recommend that you mail me
+about the problem first, since it might already have been solved) you may
+want to try some of the debug levels available. For heavy testing on
+D_DURING or more, it would be a REALLY good idea to kill your klogd daemon
+first! D_DURING displays 4-5 lines for each packet sent or received. D_TX,
+D_RX, and D_SKB actually DISPLAY each packet as it is sent or received,
+which is obviously quite big.
+
+Starting with v2.40 ALPHA, the autoprobe routines have changed
+significantly. In particular, they won't tell you why the card was not
+found unless you turn on the D_INIT_REASONS debugging flag.
+
+Once the driver is running, you can run the arcdump shell script (available
+from me or in the full ARCnet package, if you have it) as root to list the
+contents of the arcnet buffers at any time. To make any sense at all out of
+this, you should grab the pertinent RFCs. (some are listed near the top of
+arcnet.c). arcdump assumes your card is at 0xD0000. If it isn't, edit the
+script.
+
+Buffers 0 and 1 are used for receiving, and Buffers 2 and 3 are for sending.
+Ping-pong buffers are implemented both ways.
+
+If your debug level includes D_DURING and you did NOT define SLOW_XMIT_COPY,
+the buffers are cleared to a constant value of 0x42 every time the card is
+reset (which should only happen when you do an ifconfig up, or when Linux
+decides that the driver is broken). During a transmit, unused parts of the
+buffer will be cleared to 0x42 as well. This is to make it easier to figure
+out which bytes are being used by a packet.
+
+You can change the debug level without recompiling the kernel by typing::
+
+ ifconfig arc0 down metric 1xxx
+ /etc/rc.d/rc.inet1
+
+where "xxx" is the debug level you want. For example, "metric 1015" would put
+you at debug level 15. Debug level 7 is currently the default.
+
+Note that the debug level is (starting with v1.90 ALPHA) a binary
+combination of different debug flags; so debug level 7 is really 1+2+4 or
+D_NORMAL+D_EXTRA+D_INIT. To include D_DURING, you would add 16 to this,
+resulting in debug level 23.
+
+If you don't understand that, you probably don't want to know anyway.
+E-mail me about your problem.
+
+
+I want to send money: what now?
+-------------------------------
+
+Go take a nap or something. You'll feel better in the morning.
diff --git a/Documentation/networking/atm.rst b/Documentation/networking/atm.rst
new file mode 100644
index 0000000000..c1df8c0385
--- /dev/null
+++ b/Documentation/networking/atm.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===
+ATM
+===
+
+In order to use anything but the most primitive functions of ATM,
+several user-mode programs are required to assist the kernel. These
+programs and related material can be found via the ATM on Linux Web
+page at http://linux-atm.sourceforge.net/
+
+If you encounter problems with ATM, please report them on the ATM
+on Linux mailing list. Subscription information, archives, etc.,
+can be found on http://linux-atm.sourceforge.net/
diff --git a/Documentation/networking/ax25.rst b/Documentation/networking/ax25.rst
new file mode 100644
index 0000000000..605e72c6c8
--- /dev/null
+++ b/Documentation/networking/ax25.rst
@@ -0,0 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====
+AX.25
+=====
+
+To use the amateur radio protocols within Linux you will need to get a
+suitable copy of the AX.25 Utilities. More detailed information about
+AX.25, NET/ROM and ROSE, associated programs and utilities can be
+found on https://linux-ax25.in-berlin.de.
+
+There is a mailing list for discussing Linux amateur radio matters
+called linux-hams@vger.kernel.org. To subscribe to it, send a message to
+majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body
+of the message, the subject field is ignored. You don't need to be
+subscribed to post but of course that means you might miss an answer.
diff --git a/Documentation/networking/bareudp.rst b/Documentation/networking/bareudp.rst
new file mode 100644
index 0000000000..b9d04ee6da
--- /dev/null
+++ b/Documentation/networking/bareudp.rst
@@ -0,0 +1,58 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================
+Bare UDP Tunnelling Module Documentation
+========================================
+
+There are various L3 encapsulation standards using UDP being discussed to
+leverage the UDP based load balancing capability of different networks.
+MPLSoUDP (__ https://tools.ietf.org/html/rfc7510) is one among them.
+
+The Bareudp tunnel module provides a generic L3 encapsulation support for
+tunnelling different L3 protocols like MPLS, IP, NSH etc. inside a UDP tunnel.
+
+Special Handling
+----------------
+The bareudp device supports special handling for MPLS & IP as they can have
+multiple ethertypes.
+MPLS procotcol can have ethertypes ETH_P_MPLS_UC (unicast) & ETH_P_MPLS_MC (multicast).
+IP protocol can have ethertypes ETH_P_IP (v4) & ETH_P_IPV6 (v6).
+This special handling can be enabled only for ethertypes ETH_P_IP & ETH_P_MPLS_UC
+with a flag called multiproto mode.
+
+Usage
+------
+
+1) Device creation & deletion
+
+ a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls_uc
+
+ This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
+ 0x8847 (MPLS traffic). The destination port of the UDP header will be set to
+ 6635.The device will listen on UDP port 6635 to receive traffic.
+
+ b) ip link delete bareudp0
+
+2) Device creation with multiproto mode enabled
+
+The multiproto mode allows bareudp tunnels to handle several protocols of the
+same family. It is currently only available for IP and MPLS. This mode has to
+be enabled explicitly with the "multiproto" flag.
+
+ a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype ipv4 multiproto
+
+ For an IPv4 tunnel the multiproto mode allows the tunnel to also handle
+ IPv6.
+
+ b) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls_uc multiproto
+
+ For MPLS, the multiproto mode allows the tunnel to handle both unicast
+ and multicast MPLS packets.
+
+3) Device Usage
+
+The bareudp device could be used along with OVS or flower filter in TC.
+The OVS or TC flower layer must set the tunnel information in SKB dst field before
+sending packet buffer to the bareudp device for transmission. On reception the
+bareudp device extracts and stores the tunnel information in SKB dst field before
+passing the packet buffer to the network stack.
diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst
new file mode 100644
index 0000000000..8a0dcb1894
--- /dev/null
+++ b/Documentation/networking/batman-adv.rst
@@ -0,0 +1,168 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+batman-adv
+==========
+
+Batman advanced is a new approach to wireless networking which does no longer
+operate on the IP basis. Unlike the batman daemon, which exchanges information
+using UDP packets and sets routing tables, batman-advanced operates on ISO/OSI
+Layer 2 only and uses and routes (or better: bridges) Ethernet Frames. It
+emulates a virtual network switch of all nodes participating. Therefore all
+nodes appear to be link local, thus all higher operating protocols won't be
+affected by any changes within the network. You can run almost any protocol
+above batman advanced, prominent examples are: IPv4, IPv6, DHCP, IPX.
+
+Batman advanced was implemented as a Linux kernel driver to reduce the overhead
+to a minimum. It does not depend on any (other) network driver, and can be used
+on wifi as well as ethernet lan, vpn, etc ... (anything with ethernet-style
+layer 2).
+
+
+Configuration
+=============
+
+Load the batman-adv module into your kernel::
+
+ $ insmod batman-adv.ko
+
+The module is now waiting for activation. You must add some interfaces on which
+batman-adv can operate. The batman-adv soft-interface can be created using the
+iproute2 tool ``ip``::
+
+ $ ip link add name bat0 type batadv
+
+To activate a given interface simply attach it to the ``bat0`` interface::
+
+ $ ip link set dev eth0 master bat0
+
+Repeat this step for all interfaces you wish to add. Now batman-adv starts
+using/broadcasting on this/these interface(s).
+
+To deactivate an interface you have to detach it from the "bat0" interface::
+
+ $ ip link set dev eth0 nomaster
+
+The same can also be done using the batctl interface subcommand::
+
+ batctl -m bat0 interface create
+ batctl -m bat0 interface add -M eth0
+
+To detach eth0 and destroy bat0::
+
+ batctl -m bat0 interface del -M eth0
+ batctl -m bat0 interface destroy
+
+There are additional settings for each batadv mesh interface, vlan and hardif
+which can be modified using batctl. Detailed information about this can be found
+in its manual.
+
+For instance, you can check the current originator interval (value
+in milliseconds which determines how often batman-adv sends its broadcast
+packets)::
+
+ $ batctl -M bat0 orig_interval
+ 1000
+
+and also change its value::
+
+ $ batctl -M bat0 orig_interval 3000
+
+In very mobile scenarios, you might want to adjust the originator interval to a
+lower value. This will make the mesh more responsive to topology changes, but
+will also increase the overhead.
+
+Information about the current state can be accessed via the batadv generic
+netlink family. batctl provides a human readable version via its debug tables
+subcommands.
+
+
+Usage
+=====
+
+To make use of your newly created mesh, batman advanced provides a new
+interface "bat0" which you should use from this point on. All interfaces added
+to batman advanced are not relevant any longer because batman handles them for
+you. Basically, one "hands over" the data by using the batman interface and
+batman will make sure it reaches its destination.
+
+The "bat0" interface can be used like any other regular interface. It needs an
+IP address which can be either statically configured or dynamically (by using
+DHCP or similar services)::
+
+ NodeA: ip link set up dev bat0
+ NodeA: ip addr add 192.168.0.1/24 dev bat0
+
+ NodeB: ip link set up dev bat0
+ NodeB: ip addr add 192.168.0.2/24 dev bat0
+ NodeB: ping 192.168.0.1
+
+Note: In order to avoid problems remove all IP addresses previously assigned to
+interfaces now used by batman advanced, e.g.::
+
+ $ ip addr flush dev eth0
+
+
+Logging/Debugging
+=================
+
+All error messages, warnings and information messages are sent to the kernel
+log. Depending on your operating system distribution this can be read in one of
+a number of ways. Try using the commands: ``dmesg``, ``logread``, or looking in
+the files ``/var/log/kern.log`` or ``/var/log/syslog``. All batman-adv messages
+are prefixed with "batman-adv:" So to see just these messages try::
+
+ $ dmesg | grep batman-adv
+
+When investigating problems with your mesh network, it is sometimes necessary to
+see more detailed debug messages. This must be enabled when compiling the
+batman-adv module. When building batman-adv as part of the kernel, use "make
+menuconfig" and enable the option ``B.A.T.M.A.N. debugging``
+(``CONFIG_BATMAN_ADV_DEBUG=y``).
+
+Those additional debug messages can be accessed using the perf infrastructure::
+
+ $ trace-cmd stream -e batadv:batadv_dbg
+
+The additional debug output is by default disabled. It can be enabled during
+run time::
+
+ $ batctl -m bat0 loglevel routes tt
+
+will enable debug messages for when routes and translation table entries change.
+
+Counters for different types of packets entering and leaving the batman-adv
+module are available through ethtool::
+
+ $ ethtool --statistics bat0
+
+
+batctl
+======
+
+As batman advanced operates on layer 2, all hosts participating in the virtual
+switch are completely transparent for all protocols above layer 2. Therefore
+the common diagnosis tools do not work as expected. To overcome these problems,
+batctl was created. At the moment the batctl contains ping, traceroute, tcpdump
+and interfaces to the kernel module settings.
+
+For more information, please see the manpage (``man batctl``).
+
+batctl is available on https://www.open-mesh.org/
+
+
+Contact
+=======
+
+Please send us comments, experiences, questions, anything :)
+
+IRC:
+ #batadv on ircs://irc.hackint.org/
+Mailing-list:
+ b.a.t.m.a.n@lists.open-mesh.org (optional subscription at
+ https://lists.open-mesh.org/mailman3/postorius/lists/b.a.t.m.a.n.lists.open-mesh.org/)
+
+You can also contact the Authors:
+
+* Marek Lindner <mareklindner@neomailbox.ch>
+* Simon Wunderlich <sw@simonwunderlich.de>
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
new file mode 100644
index 0000000000..f7a73421eb
--- /dev/null
+++ b/Documentation/networking/bonding.rst
@@ -0,0 +1,2925 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Linux Ethernet Bonding Driver HOWTO
+===================================
+
+Latest update: 27 April 2011
+
+Initial release: Thomas Davis <tadavis at lbl.gov>
+
+Corrections, HA extensions: 2000/10/03-15:
+
+ - Willy Tarreau <willy at meta-x.org>
+ - Constantine Gavrilov <const-g at xpert.com>
+ - Chad N. Tindel <ctindel at ieee dot org>
+ - Janice Girouard <girouard at us dot ibm dot com>
+ - Jay Vosburgh <fubar at us dot ibm dot com>
+
+Reorganized and updated Feb 2005 by Jay Vosburgh
+Added Sysfs information: 2006/04/24
+
+ - Mitch Williams <mitch.a.williams at intel.com>
+
+Introduction
+============
+
+The Linux bonding driver provides a method for aggregating
+multiple network interfaces into a single logical "bonded" interface.
+The behavior of the bonded interfaces depends upon the mode; generally
+speaking, modes provide either hot standby or load balancing services.
+Additionally, link integrity monitoring may be performed.
+
+The bonding driver originally came from Donald Becker's
+beowulf patches for kernel 2.0. It has changed quite a bit since, and
+the original tools from extreme-linux and beowulf sites will not work
+with this version of the driver.
+
+For new versions of the driver, updated userspace tools, and
+who to ask for help, please follow the links at the end of this file.
+
+.. Table of Contents
+
+ 1. Bonding Driver Installation
+
+ 2. Bonding Driver Options
+
+ 3. Configuring Bonding Devices
+ 3.1 Configuration with Sysconfig Support
+ 3.1.1 Using DHCP with Sysconfig
+ 3.1.2 Configuring Multiple Bonds with Sysconfig
+ 3.2 Configuration with Initscripts Support
+ 3.2.1 Using DHCP with Initscripts
+ 3.2.2 Configuring Multiple Bonds with Initscripts
+ 3.3 Configuring Bonding Manually with Ifenslave
+ 3.3.1 Configuring Multiple Bonds Manually
+ 3.4 Configuring Bonding Manually via Sysfs
+ 3.5 Configuration with Interfaces Support
+ 3.6 Overriding Configuration for Special Cases
+ 3.7 Configuring LACP for 802.3ad mode in a more secure way
+
+ 4. Querying Bonding Configuration
+ 4.1 Bonding Configuration
+ 4.2 Network Configuration
+
+ 5. Switch Configuration
+
+ 6. 802.1q VLAN Support
+
+ 7. Link Monitoring
+ 7.1 ARP Monitor Operation
+ 7.2 Configuring Multiple ARP Targets
+ 7.3 MII Monitor Operation
+
+ 8. Potential Trouble Sources
+ 8.1 Adventures in Routing
+ 8.2 Ethernet Device Renaming
+ 8.3 Painfully Slow Or No Failed Link Detection By Miimon
+
+ 9. SNMP agents
+
+ 10. Promiscuous mode
+
+ 11. Configuring Bonding for High Availability
+ 11.1 High Availability in a Single Switch Topology
+ 11.2 High Availability in a Multiple Switch Topology
+ 11.2.1 HA Bonding Mode Selection for Multiple Switch Topology
+ 11.2.2 HA Link Monitoring for Multiple Switch Topology
+
+ 12. Configuring Bonding for Maximum Throughput
+ 12.1 Maximum Throughput in a Single Switch Topology
+ 12.1.1 MT Bonding Mode Selection for Single Switch Topology
+ 12.1.2 MT Link Monitoring for Single Switch Topology
+ 12.2 Maximum Throughput in a Multiple Switch Topology
+ 12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
+ 12.2.2 MT Link Monitoring for Multiple Switch Topology
+
+ 13. Switch Behavior Issues
+ 13.1 Link Establishment and Failover Delays
+ 13.2 Duplicated Incoming Packets
+
+ 14. Hardware Specific Considerations
+ 14.1 IBM BladeCenter
+
+ 15. Frequently Asked Questions
+
+ 16. Resources and Links
+
+
+1. Bonding Driver Installation
+==============================
+
+Most popular distro kernels ship with the bonding driver
+already available as a module. If your distro does not, or you
+have need to compile bonding from source (e.g., configuring and
+installing a mainline kernel from kernel.org), you'll need to perform
+the following steps:
+
+1.1 Configure and build the kernel with bonding
+-----------------------------------------------
+
+The current version of the bonding driver is available in the
+drivers/net/bonding subdirectory of the most recent kernel source
+(which is available on http://kernel.org). Most users "rolling their
+own" will want to use the most recent kernel from kernel.org.
+
+Configure kernel with "make menuconfig" (or "make xconfig" or
+"make config"), then select "Bonding driver support" in the "Network
+device support" section. It is recommended that you configure the
+driver as module since it is currently the only way to pass parameters
+to the driver or configure more than one bonding device.
+
+Build and install the new kernel and modules.
+
+1.2 Bonding Control Utility
+---------------------------
+
+It is recommended to configure bonding via iproute2 (netlink)
+or sysfs, the old ifenslave control utility is obsolete.
+
+2. Bonding Driver Options
+=========================
+
+Options for the bonding driver are supplied as parameters to the
+bonding module at load time, or are specified via sysfs.
+
+Module options may be given as command line arguments to the
+insmod or modprobe command, but are usually specified in either the
+``/etc/modprobe.d/*.conf`` configuration files, or in a distro-specific
+configuration file (some of which are detailed in the next section).
+
+Details on bonding support for sysfs is provided in the
+"Configuring Bonding Manually via Sysfs" section, below.
+
+The available bonding driver parameters are listed below. If a
+parameter is not specified the default value is used. When initially
+configuring a bond, it is recommended "tail -f /var/log/messages" be
+run in a separate window to watch for bonding driver error messages.
+
+It is critical that either the miimon or arp_interval and
+arp_ip_target parameters be specified, otherwise serious network
+degradation will occur during link failures. Very few devices do not
+support at least miimon, so there is really no reason not to use it.
+
+Options with textual values will accept either the text name
+or, for backwards compatibility, the option value. E.g.,
+"mode=802.3ad" and "mode=4" set the same mode.
+
+The parameters are as follows:
+
+active_slave
+
+ Specifies the new active slave for modes that support it
+ (active-backup, balance-alb and balance-tlb). Possible values
+ are the name of any currently enslaved interface, or an empty
+ string. If a name is given, the slave and its link must be up in order
+ to be selected as the new active slave. If an empty string is
+ specified, the current active slave is cleared, and a new active
+ slave is selected automatically.
+
+ Note that this is only available through the sysfs interface. No module
+ parameter by this name exists.
+
+ The normal value of this option is the name of the currently
+ active slave, or the empty string if there is no active slave or
+ the current mode does not use an active slave.
+
+ad_actor_sys_prio
+
+ In an AD system, this specifies the system priority. The allowed range
+ is 1 - 65535. If the value is not specified, it takes 65535 as the
+ default value.
+
+ This parameter has effect only in 802.3ad mode and is available through
+ SysFs interface.
+
+ad_actor_system
+
+ In an AD system, this specifies the mac-address for the actor in
+ protocol packet exchanges (LACPDUs). The value cannot be a multicast
+ address. If the all-zeroes MAC is specified, bonding will internally
+ use the MAC of the bond itself. It is preferred to have the
+ local-admin bit set for this mac but driver does not enforce it. If
+ the value is not given then system defaults to using the masters'
+ mac address as actors' system address.
+
+ This parameter has effect only in 802.3ad mode and is available through
+ SysFs interface.
+
+ad_select
+
+ Specifies the 802.3ad aggregation selection logic to use. The
+ possible values and their effects are:
+
+ stable or 0
+
+ The active aggregator is chosen by largest aggregate
+ bandwidth.
+
+ Reselection of the active aggregator occurs only when all
+ slaves of the active aggregator are down or the active
+ aggregator has no slaves.
+
+ This is the default value.
+
+ bandwidth or 1
+
+ The active aggregator is chosen by largest aggregate
+ bandwidth. Reselection occurs if:
+
+ - A slave is added to or removed from the bond
+
+ - Any slave's link state changes
+
+ - Any slave's 802.3ad association state changes
+
+ - The bond's administrative state changes to up
+
+ count or 2
+
+ The active aggregator is chosen by the largest number of
+ ports (slaves). Reselection occurs as described under the
+ "bandwidth" setting, above.
+
+ The bandwidth and count selection policies permit failover of
+ 802.3ad aggregations when partial failure of the active aggregator
+ occurs. This keeps the aggregator with the highest availability
+ (either in bandwidth or in number of ports) active at all times.
+
+ This option was added in bonding version 3.4.0.
+
+ad_user_port_key
+
+ In an AD system, the port-key has three parts as shown below -
+
+ ===== ============
+ Bits Use
+ ===== ============
+ 00 Duplex
+ 01-05 Speed
+ 06-15 User-defined
+ ===== ============
+
+ This defines the upper 10 bits of the port key. The values can be
+ from 0 - 1023. If not given, the system defaults to 0.
+
+ This parameter has effect only in 802.3ad mode and is available through
+ SysFs interface.
+
+all_slaves_active
+
+ Specifies that duplicate frames (received on inactive ports) should be
+ dropped (0) or delivered (1).
+
+ Normally, bonding will drop duplicate frames (received on inactive
+ ports), which is desirable for most users. But there are some times
+ it is nice to allow duplicate frames to be delivered.
+
+ The default value is 0 (drop duplicate frames received on inactive
+ ports).
+
+arp_interval
+
+ Specifies the ARP link monitoring frequency in milliseconds.
+
+ The ARP monitor works by periodically checking the slave
+ devices to determine whether they have sent or received
+ traffic recently (the precise criteria depends upon the
+ bonding mode, and the state of the slave). Regular traffic is
+ generated via ARP probes issued for the addresses specified by
+ the arp_ip_target option.
+
+ This behavior can be modified by the arp_validate option,
+ below.
+
+ If ARP monitoring is used in an etherchannel compatible mode
+ (modes 0 and 2), the switch should be configured in a mode
+ that evenly distributes packets across all links. If the
+ switch is configured to distribute the packets in an XOR
+ fashion, all replies from the ARP targets will be received on
+ the same link which could cause the other team members to
+ fail. ARP monitoring should not be used in conjunction with
+ miimon. A value of 0 disables ARP monitoring. The default
+ value is 0.
+
+arp_ip_target
+
+ Specifies the IP addresses to use as ARP monitoring peers when
+ arp_interval is > 0. These are the targets of the ARP request
+ sent to determine the health of the link to the targets.
+ Specify these values in ddd.ddd.ddd.ddd format. Multiple IP
+ addresses must be separated by a comma. At least one IP
+ address must be given for ARP monitoring to function. The
+ maximum number of targets that can be specified is 16. The
+ default value is no IP addresses.
+
+ns_ip6_target
+
+ Specifies the IPv6 addresses to use as IPv6 monitoring peers when
+ arp_interval is > 0. These are the targets of the NS request
+ sent to determine the health of the link to the targets.
+ Specify these values in ffff:ffff::ffff:ffff format. Multiple IPv6
+ addresses must be separated by a comma. At least one IPv6
+ address must be given for NS/NA monitoring to function. The
+ maximum number of targets that can be specified is 16. The
+ default value is no IPv6 addresses.
+
+arp_validate
+
+ Specifies whether or not ARP probes and replies should be
+ validated in any mode that supports arp monitoring, or whether
+ non-ARP traffic should be filtered (disregarded) for link
+ monitoring purposes.
+
+ Possible values are:
+
+ none or 0
+
+ No validation or filtering is performed.
+
+ active or 1
+
+ Validation is performed only for the active slave.
+
+ backup or 2
+
+ Validation is performed only for backup slaves.
+
+ all or 3
+
+ Validation is performed for all slaves.
+
+ filter or 4
+
+ Filtering is applied to all slaves. No validation is
+ performed.
+
+ filter_active or 5
+
+ Filtering is applied to all slaves, validation is performed
+ only for the active slave.
+
+ filter_backup or 6
+
+ Filtering is applied to all slaves, validation is performed
+ only for backup slaves.
+
+ Validation:
+
+ Enabling validation causes the ARP monitor to examine the incoming
+ ARP requests and replies, and only consider a slave to be up if it
+ is receiving the appropriate ARP traffic.
+
+ For an active slave, the validation checks ARP replies to confirm
+ that they were generated by an arp_ip_target. Since backup slaves
+ do not typically receive these replies, the validation performed
+ for backup slaves is on the broadcast ARP request sent out via the
+ active slave. It is possible that some switch or network
+ configurations may result in situations wherein the backup slaves
+ do not receive the ARP requests; in such a situation, validation
+ of backup slaves must be disabled.
+
+ The validation of ARP requests on backup slaves is mainly helping
+ bonding to decide which slaves are more likely to work in case of
+ the active slave failure, it doesn't really guarantee that the
+ backup slave will work if it's selected as the next active slave.
+
+ Validation is useful in network configurations in which multiple
+ bonding hosts are concurrently issuing ARPs to one or more targets
+ beyond a common switch. Should the link between the switch and
+ target fail (but not the switch itself), the probe traffic
+ generated by the multiple bonding instances will fool the standard
+ ARP monitor into considering the links as still up. Use of
+ validation can resolve this, as the ARP monitor will only consider
+ ARP requests and replies associated with its own instance of
+ bonding.
+
+ Filtering:
+
+ Enabling filtering causes the ARP monitor to only use incoming ARP
+ packets for link availability purposes. Arriving packets that are
+ not ARPs are delivered normally, but do not count when determining
+ if a slave is available.
+
+ Filtering operates by only considering the reception of ARP
+ packets (any ARP packet, regardless of source or destination) when
+ determining if a slave has received traffic for link availability
+ purposes.
+
+ Filtering is useful in network configurations in which significant
+ levels of third party broadcast traffic would fool the standard
+ ARP monitor into considering the links as still up. Use of
+ filtering can resolve this, as only ARP traffic is considered for
+ link availability purposes.
+
+ This option was added in bonding version 3.1.0.
+
+arp_all_targets
+
+ Specifies the quantity of arp_ip_targets that must be reachable
+ in order for the ARP monitor to consider a slave as being up.
+ This option affects only active-backup mode for slaves with
+ arp_validation enabled.
+
+ Possible values are:
+
+ any or 0
+
+ consider the slave up only when any of the arp_ip_targets
+ is reachable
+
+ all or 1
+
+ consider the slave up only when all of the arp_ip_targets
+ are reachable
+
+arp_missed_max
+
+ Specifies the number of arp_interval monitor checks that must
+ fail in order for an interface to be marked down by the ARP monitor.
+
+ In order to provide orderly failover semantics, backup interfaces
+ are permitted an extra monitor check (i.e., they must fail
+ arp_missed_max + 1 times before being marked down).
+
+ The default value is 2, and the allowable range is 1 - 255.
+
+downdelay
+
+ Specifies the time, in milliseconds, to wait before disabling
+ a slave after a link failure has been detected. This option
+ is only valid for the miimon link monitor. The downdelay
+ value should be a multiple of the miimon value; if not, it
+ will be rounded down to the nearest multiple. The default
+ value is 0.
+
+fail_over_mac
+
+ Specifies whether active-backup mode should set all slaves to
+ the same MAC address at enslavement (the traditional
+ behavior), or, when enabled, perform special handling of the
+ bond's MAC address in accordance with the selected policy.
+
+ Possible values are:
+
+ none or 0
+
+ This setting disables fail_over_mac, and causes
+ bonding to set all slaves of an active-backup bond to
+ the same MAC address at enslavement time. This is the
+ default.
+
+ active or 1
+
+ The "active" fail_over_mac policy indicates that the
+ MAC address of the bond should always be the MAC
+ address of the currently active slave. The MAC
+ address of the slaves is not changed; instead, the MAC
+ address of the bond changes during a failover.
+
+ This policy is useful for devices that cannot ever
+ alter their MAC address, or for devices that refuse
+ incoming broadcasts with their own source MAC (which
+ interferes with the ARP monitor).
+
+ The down side of this policy is that every device on
+ the network must be updated via gratuitous ARP,
+ vs. just updating a switch or set of switches (which
+ often takes place for any traffic, not just ARP
+ traffic, if the switch snoops incoming traffic to
+ update its tables) for the traditional method. If the
+ gratuitous ARP is lost, communication may be
+ disrupted.
+
+ When this policy is used in conjunction with the mii
+ monitor, devices which assert link up prior to being
+ able to actually transmit and receive are particularly
+ susceptible to loss of the gratuitous ARP, and an
+ appropriate updelay setting may be required.
+
+ follow or 2
+
+ The "follow" fail_over_mac policy causes the MAC
+ address of the bond to be selected normally (normally
+ the MAC address of the first slave added to the bond).
+ However, the second and subsequent slaves are not set
+ to this MAC address while they are in a backup role; a
+ slave is programmed with the bond's MAC address at
+ failover time (and the formerly active slave receives
+ the newly active slave's MAC address).
+
+ This policy is useful for multiport devices that
+ either become confused or incur a performance penalty
+ when multiple ports are programmed with the same MAC
+ address.
+
+
+ The default policy is none, unless the first slave cannot
+ change its MAC address, in which case the active policy is
+ selected by default.
+
+ This option may be modified via sysfs only when no slaves are
+ present in the bond.
+
+ This option was added in bonding version 3.2.0. The "follow"
+ policy was added in bonding version 3.3.0.
+
+lacp_active
+ Option specifying whether to send LACPDU frames periodically.
+
+ off or 0
+ LACPDU frames acts as "speak when spoken to".
+
+ on or 1
+ LACPDU frames are sent along the configured links
+ periodically. See lacp_rate for more details.
+
+ The default is on.
+
+lacp_rate
+
+ Option specifying the rate in which we'll ask our link partner
+ to transmit LACPDU packets in 802.3ad mode. Possible values
+ are:
+
+ slow or 0
+ Request partner to transmit LACPDUs every 30 seconds
+
+ fast or 1
+ Request partner to transmit LACPDUs every 1 second
+
+ The default is slow.
+
+max_bonds
+
+ Specifies the number of bonding devices to create for this
+ instance of the bonding driver. E.g., if max_bonds is 3, and
+ the bonding driver is not already loaded, then bond0, bond1
+ and bond2 will be created. The default value is 1. Specifying
+ a value of 0 will load bonding, but will not create any devices.
+
+miimon
+
+ Specifies the MII link monitoring frequency in milliseconds.
+ This determines how often the link state of each slave is
+ inspected for link failures. A value of zero disables MII
+ link monitoring. A value of 100 is a good starting point.
+ The use_carrier option, below, affects how the link state is
+ determined. See the High Availability section for additional
+ information. The default value is 100 if arp_interval is not
+ set.
+
+min_links
+
+ Specifies the minimum number of links that must be active before
+ asserting carrier. It is similar to the Cisco EtherChannel min-links
+ feature. This allows setting the minimum number of member ports that
+ must be up (link-up state) before marking the bond device as up
+ (carrier on). This is useful for situations where higher level services
+ such as clustering want to ensure a minimum number of low bandwidth
+ links are active before switchover. This option only affect 802.3ad
+ mode.
+
+ The default value is 0. This will cause carrier to be asserted (for
+ 802.3ad mode) whenever there is an active aggregator, regardless of the
+ number of available links in that aggregator. Note that, because an
+ aggregator cannot be active without at least one available link,
+ setting this option to 0 or to 1 has the exact same effect.
+
+mode
+
+ Specifies one of the bonding policies. The default is
+ balance-rr (round robin). Possible values are:
+
+ balance-rr or 0
+
+ Round-robin policy: Transmit packets in sequential
+ order from the first available slave through the
+ last. This mode provides load balancing and fault
+ tolerance.
+
+ active-backup or 1
+
+ Active-backup policy: Only one slave in the bond is
+ active. A different slave becomes active if, and only
+ if, the active slave fails. The bond's MAC address is
+ externally visible on only one port (network adapter)
+ to avoid confusing the switch.
+
+ In bonding version 2.6.2 or later, when a failover
+ occurs in active-backup mode, bonding will issue one
+ or more gratuitous ARPs on the newly active slave.
+ One gratuitous ARP is issued for the bonding master
+ interface and each VLAN interfaces configured above
+ it, provided that the interface has at least one IP
+ address configured. Gratuitous ARPs issued for VLAN
+ interfaces are tagged with the appropriate VLAN id.
+
+ This mode provides fault tolerance. The primary
+ option, documented below, affects the behavior of this
+ mode.
+
+ balance-xor or 2
+
+ XOR policy: Transmit based on the selected transmit
+ hash policy. The default policy is a simple [(source
+ MAC address XOR'd with destination MAC address XOR
+ packet type ID) modulo slave count]. Alternate transmit
+ policies may be selected via the xmit_hash_policy option,
+ described below.
+
+ This mode provides load balancing and fault tolerance.
+
+ broadcast or 3
+
+ Broadcast policy: transmits everything on all slave
+ interfaces. This mode provides fault tolerance.
+
+ 802.3ad or 4
+
+ IEEE 802.3ad Dynamic link aggregation. Creates
+ aggregation groups that share the same speed and
+ duplex settings. Utilizes all slaves in the active
+ aggregator according to the 802.3ad specification.
+
+ Slave selection for outgoing traffic is done according
+ to the transmit hash policy, which may be changed from
+ the default simple XOR policy via the xmit_hash_policy
+ option, documented below. Note that not all transmit
+ policies may be 802.3ad compliant, particularly in
+ regards to the packet mis-ordering requirements of
+ section 43.2.4 of the 802.3ad standard. Differing
+ peer implementations will have varying tolerances for
+ noncompliance.
+
+ Prerequisites:
+
+ 1. Ethtool support in the base drivers for retrieving
+ the speed and duplex of each slave.
+
+ 2. A switch that supports IEEE 802.3ad Dynamic link
+ aggregation.
+
+ Most switches will require some type of configuration
+ to enable 802.3ad mode.
+
+ balance-tlb or 5
+
+ Adaptive transmit load balancing: channel bonding that
+ does not require any special switch support.
+
+ In tlb_dynamic_lb=1 mode; the outgoing traffic is
+ distributed according to the current load (computed
+ relative to the speed) on each slave.
+
+ In tlb_dynamic_lb=0 mode; the load balancing based on
+ current load is disabled and the load is distributed
+ only using the hash distribution.
+
+ Incoming traffic is received by the current slave.
+ If the receiving slave fails, another slave takes over
+ the MAC address of the failed receiving slave.
+
+ Prerequisite:
+
+ Ethtool support in the base drivers for retrieving the
+ speed of each slave.
+
+ balance-alb or 6
+
+ Adaptive load balancing: includes balance-tlb plus
+ receive load balancing (rlb) for IPV4 traffic, and
+ does not require any special switch support. The
+ receive load balancing is achieved by ARP negotiation.
+ The bonding driver intercepts the ARP Replies sent by
+ the local system on their way out and overwrites the
+ source hardware address with the unique hardware
+ address of one of the slaves in the bond such that
+ different peers use different hardware addresses for
+ the server.
+
+ Receive traffic from connections created by the server
+ is also balanced. When the local system sends an ARP
+ Request the bonding driver copies and saves the peer's
+ IP information from the ARP packet. When the ARP
+ Reply arrives from the peer, its hardware address is
+ retrieved and the bonding driver initiates an ARP
+ reply to this peer assigning it to one of the slaves
+ in the bond. A problematic outcome of using ARP
+ negotiation for balancing is that each time that an
+ ARP request is broadcast it uses the hardware address
+ of the bond. Hence, peers learn the hardware address
+ of the bond and the balancing of receive traffic
+ collapses to the current slave. This is handled by
+ sending updates (ARP Replies) to all the peers with
+ their individually assigned hardware address such that
+ the traffic is redistributed. Receive traffic is also
+ redistributed when a new slave is added to the bond
+ and when an inactive slave is re-activated. The
+ receive load is distributed sequentially (round robin)
+ among the group of highest speed slaves in the bond.
+
+ When a link is reconnected or a new slave joins the
+ bond the receive traffic is redistributed among all
+ active slaves in the bond by initiating ARP Replies
+ with the selected MAC address to each of the
+ clients. The updelay parameter (detailed below) must
+ be set to a value equal or greater than the switch's
+ forwarding delay so that the ARP Replies sent to the
+ peers will not be blocked by the switch.
+
+ Prerequisites:
+
+ 1. Ethtool support in the base drivers for retrieving
+ the speed of each slave.
+
+ 2. Base driver support for setting the hardware
+ address of a device while it is open. This is
+ required so that there will always be one slave in the
+ team using the bond hardware address (the
+ curr_active_slave) while having a unique hardware
+ address for each slave in the bond. If the
+ curr_active_slave fails its hardware address is
+ swapped with the new curr_active_slave that was
+ chosen.
+
+num_grat_arp,
+num_unsol_na
+
+ Specify the number of peer notifications (gratuitous ARPs and
+ unsolicited IPv6 Neighbor Advertisements) to be issued after a
+ failover event. As soon as the link is up on the new slave
+ (possibly immediately) a peer notification is sent on the
+ bonding device and each VLAN sub-device. This is repeated at
+ the rate specified by peer_notif_delay if the number is
+ greater than 1.
+
+ The valid range is 0 - 255; the default value is 1. These options
+ affect only the active-backup mode. These options were added for
+ bonding versions 3.3.0 and 3.4.0 respectively.
+
+ From Linux 3.0 and bonding version 3.7.1, these notifications
+ are generated by the ipv4 and ipv6 code and the numbers of
+ repetitions cannot be set independently.
+
+packets_per_slave
+
+ Specify the number of packets to transmit through a slave before
+ moving to the next one. When set to 0 then a slave is chosen at
+ random.
+
+ The valid range is 0 - 65535; the default value is 1. This option
+ has effect only in balance-rr mode.
+
+peer_notif_delay
+
+ Specify the delay, in milliseconds, between each peer
+ notification (gratuitous ARP and unsolicited IPv6 Neighbor
+ Advertisement) when they are issued after a failover event.
+ This delay should be a multiple of the MII link monitor interval
+ (miimon).
+
+ The valid range is 0 - 300000. The default value is 0, which means
+ to match the value of the MII link monitor interval.
+
+prio
+ Slave priority. A higher number means higher priority.
+ The primary slave has the highest priority. This option also
+ follows the primary_reselect rules.
+
+ This option could only be configured via netlink, and is only valid
+ for active-backup(1), balance-tlb (5) and balance-alb (6) mode.
+ The valid value range is a signed 32 bit integer.
+
+ The default value is 0.
+
+primary
+
+ A string (eth0, eth2, etc) specifying which slave is the
+ primary device. The specified device will always be the
+ active slave while it is available. Only when the primary is
+ off-line will alternate devices be used. This is useful when
+ one slave is preferred over another, e.g., when one slave has
+ higher throughput than another.
+
+ The primary option is only valid for active-backup(1),
+ balance-tlb (5) and balance-alb (6) mode.
+
+primary_reselect
+
+ Specifies the reselection policy for the primary slave. This
+ affects how the primary slave is chosen to become the active slave
+ when failure of the active slave or recovery of the primary slave
+ occurs. This option is designed to prevent flip-flopping between
+ the primary slave and other slaves. Possible values are:
+
+ always or 0 (default)
+
+ The primary slave becomes the active slave whenever it
+ comes back up.
+
+ better or 1
+
+ The primary slave becomes the active slave when it comes
+ back up, if the speed and duplex of the primary slave is
+ better than the speed and duplex of the current active
+ slave.
+
+ failure or 2
+
+ The primary slave becomes the active slave only if the
+ current active slave fails and the primary slave is up.
+
+ The primary_reselect setting is ignored in two cases:
+
+ If no slaves are active, the first slave to recover is
+ made the active slave.
+
+ When initially enslaved, the primary slave is always made
+ the active slave.
+
+ Changing the primary_reselect policy via sysfs will cause an
+ immediate selection of the best active slave according to the new
+ policy. This may or may not result in a change of the active
+ slave, depending upon the circumstances.
+
+ This option was added for bonding version 3.6.0.
+
+tlb_dynamic_lb
+
+ Specifies if dynamic shuffling of flows is enabled in tlb
+ or alb mode. The value has no effect on any other modes.
+
+ The default behavior of tlb mode is to shuffle active flows across
+ slaves based on the load in that interval. This gives nice lb
+ characteristics but can cause packet reordering. If re-ordering is
+ a concern use this variable to disable flow shuffling and rely on
+ load balancing provided solely by the hash distribution.
+ xmit-hash-policy can be used to select the appropriate hashing for
+ the setup.
+
+ The sysfs entry can be used to change the setting per bond device
+ and the initial value is derived from the module parameter. The
+ sysfs entry is allowed to be changed only if the bond device is
+ down.
+
+ The default value is "1" that enables flow shuffling while value "0"
+ disables it. This option was added in bonding driver 3.7.1
+
+
+updelay
+
+ Specifies the time, in milliseconds, to wait before enabling a
+ slave after a link recovery has been detected. This option is
+ only valid for the miimon link monitor. The updelay value
+ should be a multiple of the miimon value; if not, it will be
+ rounded down to the nearest multiple. The default value is 0.
+
+use_carrier
+
+ Specifies whether or not miimon should use MII or ETHTOOL
+ ioctls vs. netif_carrier_ok() to determine the link
+ status. The MII or ETHTOOL ioctls are less efficient and
+ utilize a deprecated calling sequence within the kernel. The
+ netif_carrier_ok() relies on the device driver to maintain its
+ state with netif_carrier_on/off; at this writing, most, but
+ not all, device drivers support this facility.
+
+ If bonding insists that the link is up when it should not be,
+ it may be that your network device driver does not support
+ netif_carrier_on/off. The default state for netif_carrier is
+ "carrier on," so if a driver does not support netif_carrier,
+ it will appear as if the link is always up. In this case,
+ setting use_carrier to 0 will cause bonding to revert to the
+ MII / ETHTOOL ioctl method to determine the link state.
+
+ A value of 1 enables the use of netif_carrier_ok(), a value of
+ 0 will use the deprecated MII / ETHTOOL ioctls. The default
+ value is 1.
+
+xmit_hash_policy
+
+ Selects the transmit hash policy to use for slave selection in
+ balance-xor, 802.3ad, and tlb modes. Possible values are:
+
+ layer2
+
+ Uses XOR of hardware MAC addresses and packet type ID
+ field to generate the hash. The formula is
+
+ hash = source MAC[5] XOR destination MAC[5] XOR packet type ID
+ slave number = hash modulo slave count
+
+ This algorithm will place all traffic to a particular
+ network peer on the same slave.
+
+ This algorithm is 802.3ad compliant.
+
+ layer2+3
+
+ This policy uses a combination of layer2 and layer3
+ protocol information to generate the hash.
+
+ Uses XOR of hardware MAC addresses and IP addresses to
+ generate the hash. The formula is
+
+ hash = source MAC[5] XOR destination MAC[5] XOR packet type ID
+ hash = hash XOR source IP XOR destination IP
+ hash = hash XOR (hash RSHIFT 16)
+ hash = hash XOR (hash RSHIFT 8)
+ And then hash is reduced modulo slave count.
+
+ If the protocol is IPv6 then the source and destination
+ addresses are first hashed using ipv6_addr_hash.
+
+ This algorithm will place all traffic to a particular
+ network peer on the same slave. For non-IP traffic,
+ the formula is the same as for the layer2 transmit
+ hash policy.
+
+ This policy is intended to provide a more balanced
+ distribution of traffic than layer2 alone, especially
+ in environments where a layer3 gateway device is
+ required to reach most destinations.
+
+ This algorithm is 802.3ad compliant.
+
+ layer3+4
+
+ This policy uses upper layer protocol information,
+ when available, to generate the hash. This allows for
+ traffic to a particular network peer to span multiple
+ slaves, although a single connection will not span
+ multiple slaves.
+
+ The formula for unfragmented TCP and UDP packets is
+
+ hash = source port, destination port (as in the header)
+ hash = hash XOR source IP XOR destination IP
+ hash = hash XOR (hash RSHIFT 16)
+ hash = hash XOR (hash RSHIFT 8)
+ hash = hash RSHIFT 1
+ And then hash is reduced modulo slave count.
+
+ If the protocol is IPv6 then the source and destination
+ addresses are first hashed using ipv6_addr_hash.
+
+ For fragmented TCP or UDP packets and all other IPv4 and
+ IPv6 protocol traffic, the source and destination port
+ information is omitted. For non-IP traffic, the
+ formula is the same as for the layer2 transmit hash
+ policy.
+
+ This algorithm is not fully 802.3ad compliant. A
+ single TCP or UDP conversation containing both
+ fragmented and unfragmented packets will see packets
+ striped across two interfaces. This may result in out
+ of order delivery. Most traffic types will not meet
+ this criteria, as TCP rarely fragments traffic, and
+ most UDP traffic is not involved in extended
+ conversations. Other implementations of 802.3ad may
+ or may not tolerate this noncompliance.
+
+ encap2+3
+
+ This policy uses the same formula as layer2+3 but it
+ relies on skb_flow_dissect to obtain the header fields
+ which might result in the use of inner headers if an
+ encapsulation protocol is used. For example this will
+ improve the performance for tunnel users because the
+ packets will be distributed according to the encapsulated
+ flows.
+
+ encap3+4
+
+ This policy uses the same formula as layer3+4 but it
+ relies on skb_flow_dissect to obtain the header fields
+ which might result in the use of inner headers if an
+ encapsulation protocol is used. For example this will
+ improve the performance for tunnel users because the
+ packets will be distributed according to the encapsulated
+ flows.
+
+ vlan+srcmac
+
+ This policy uses a very rudimentary vlan ID and source mac
+ hash to load-balance traffic per-vlan, with failover
+ should one leg fail. The intended use case is for a bond
+ shared by multiple virtual machines, all configured to
+ use their own vlan, to give lacp-like functionality
+ without requiring lacp-capable switching hardware.
+
+ The formula for the hash is simply
+
+ hash = (vlan ID) XOR (source MAC vendor) XOR (source MAC dev)
+
+ The default value is layer2. This option was added in bonding
+ version 2.6.3. In earlier versions of bonding, this parameter
+ does not exist, and the layer2 policy is the only policy. The
+ layer2+3 value was added for bonding version 3.2.2.
+
+resend_igmp
+
+ Specifies the number of IGMP membership reports to be issued after
+ a failover event. One membership report is issued immediately after
+ the failover, subsequent packets are sent in each 200ms interval.
+
+ The valid range is 0 - 255; the default value is 1. A value of 0
+ prevents the IGMP membership report from being issued in response
+ to the failover event.
+
+ This option is useful for bonding modes balance-rr (0), active-backup
+ (1), balance-tlb (5) and balance-alb (6), in which a failover can
+ switch the IGMP traffic from one slave to another. Therefore a fresh
+ IGMP report must be issued to cause the switch to forward the incoming
+ IGMP traffic over the newly selected slave.
+
+ This option was added for bonding version 3.7.0.
+
+lp_interval
+
+ Specifies the number of seconds between instances where the bonding
+ driver sends learning packets to each slaves peer switch.
+
+ The valid range is 1 - 0x7fffffff; the default value is 1. This Option
+ has effect only in balance-tlb and balance-alb modes.
+
+3. Configuring Bonding Devices
+==============================
+
+You can configure bonding using either your distro's network
+initialization scripts, or manually using either iproute2 or the
+sysfs interface. Distros generally use one of three packages for the
+network initialization scripts: initscripts, sysconfig or interfaces.
+Recent versions of these packages have support for bonding, while older
+versions do not.
+
+We will first describe the options for configuring bonding for
+distros using versions of initscripts, sysconfig and interfaces with full
+or partial support for bonding, then provide information on enabling
+bonding without support from the network initialization scripts (i.e.,
+older versions of initscripts or sysconfig).
+
+If you're unsure whether your distro uses sysconfig,
+initscripts or interfaces, or don't know if it's new enough, have no fear.
+Determining this is fairly straightforward.
+
+First, look for a file called interfaces in /etc/network directory.
+If this file is present in your system, then your system use interfaces. See
+Configuration with Interfaces Support.
+
+Else, issue the command::
+
+ $ rpm -qf /sbin/ifup
+
+It will respond with a line of text starting with either
+"initscripts" or "sysconfig," followed by some numbers. This is the
+package that provides your network initialization scripts.
+
+Next, to determine if your installation supports bonding,
+issue the command::
+
+ $ grep ifenslave /sbin/ifup
+
+If this returns any matches, then your initscripts or
+sysconfig has support for bonding.
+
+3.1 Configuration with Sysconfig Support
+----------------------------------------
+
+This section applies to distros using a version of sysconfig
+with bonding support, for example, SuSE Linux Enterprise Server 9.
+
+SuSE SLES 9's networking configuration system does support
+bonding, however, at this writing, the YaST system configuration
+front end does not provide any means to work with bonding devices.
+Bonding devices can be managed by hand, however, as follows.
+
+First, if they have not already been configured, configure the
+slave devices. On SLES 9, this is most easily done by running the
+yast2 sysconfig configuration utility. The goal is for to create an
+ifcfg-id file for each slave device. The simplest way to accomplish
+this is to configure the devices for DHCP (this is only to get the
+file ifcfg-id file created; see below for some issues with DHCP). The
+name of the configuration file for each device will be of the form::
+
+ ifcfg-id-xx:xx:xx:xx:xx:xx
+
+Where the "xx" portion will be replaced with the digits from
+the device's permanent MAC address.
+
+Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been
+created, it is necessary to edit the configuration files for the slave
+devices (the MAC addresses correspond to those of the slave devices).
+Before editing, the file will contain multiple lines, and will look
+something like this::
+
+ BOOTPROTO='dhcp'
+ STARTMODE='on'
+ USERCTL='no'
+ UNIQUE='XNzu.WeZGOGF+4wE'
+ _nm_name='bus-pci-0001:61:01.0'
+
+Change the BOOTPROTO and STARTMODE lines to the following::
+
+ BOOTPROTO='none'
+ STARTMODE='off'
+
+Do not alter the UNIQUE or _nm_name lines. Remove any other
+lines (USERCTL, etc).
+
+Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified,
+it's time to create the configuration file for the bonding device
+itself. This file is named ifcfg-bondX, where X is the number of the
+bonding device to create, starting at 0. The first such file is
+ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig
+network configuration system will correctly start multiple instances
+of bonding.
+
+The contents of the ifcfg-bondX file is as follows::
+
+ BOOTPROTO="static"
+ BROADCAST="10.0.2.255"
+ IPADDR="10.0.2.10"
+ NETMASK="255.255.0.0"
+ NETWORK="10.0.2.0"
+ REMOTE_IPADDR=""
+ STARTMODE="onboot"
+ BONDING_MASTER="yes"
+ BONDING_MODULE_OPTS="mode=active-backup miimon=100"
+ BONDING_SLAVE0="eth0"
+ BONDING_SLAVE1="bus-pci-0000:06:08.1"
+
+Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK
+values with the appropriate values for your network.
+
+The STARTMODE specifies when the device is brought online.
+The possible values are:
+
+ ======== ======================================================
+ onboot The device is started at boot time. If you're not
+ sure, this is probably what you want.
+
+ manual The device is started only when ifup is called
+ manually. Bonding devices may be configured this
+ way if you do not wish them to start automatically
+ at boot for some reason.
+
+ hotplug The device is started by a hotplug event. This is not
+ a valid choice for a bonding device.
+
+ off or The device configuration is ignored.
+ ignore
+ ======== ======================================================
+
+The line BONDING_MASTER='yes' indicates that the device is a
+bonding master device. The only useful value is "yes."
+
+The contents of BONDING_MODULE_OPTS are supplied to the
+instance of the bonding module for this device. Specify the options
+for the bonding mode, link monitoring, and so on here. Do not include
+the max_bonds bonding parameter; this will confuse the configuration
+system if you have multiple bonding devices.
+
+Finally, supply one BONDING_SLAVEn="slave device" for each
+slave. where "n" is an increasing value, one for each slave. The
+"slave device" is either an interface name, e.g., "eth0", or a device
+specifier for the network device. The interface name is easier to
+find, but the ethN names are subject to change at boot time if, e.g.,
+a device early in the sequence has failed. The device specifiers
+(bus-pci-0000:06:08.1 in the example above) specify the physical
+network device, and will not change unless the device's bus location
+changes (for example, it is moved from one PCI slot to another). The
+example above uses one of each type for demonstration purposes; most
+configurations will choose one or the other for all slave devices.
+
+When all configuration files have been modified or created,
+networking must be restarted for the configuration changes to take
+effect. This can be accomplished via the following::
+
+ # /etc/init.d/network restart
+
+Note that the network control script (/sbin/ifdown) will
+remove the bonding module as part of the network shutdown processing,
+so it is not necessary to remove the module by hand if, e.g., the
+module parameters have changed.
+
+Also, at this writing, YaST/YaST2 will not manage bonding
+devices (they do not show bonding interfaces on its list of network
+devices). It is necessary to edit the configuration file by hand to
+change the bonding configuration.
+
+Additional general options and details of the ifcfg file
+format can be found in an example ifcfg template file::
+
+ /etc/sysconfig/network/ifcfg.template
+
+Note that the template does not document the various ``BONDING_*``
+settings described above, but does describe many of the other options.
+
+3.1.1 Using DHCP with Sysconfig
+-------------------------------
+
+Under sysconfig, configuring a device with BOOTPROTO='dhcp'
+will cause it to query DHCP for its IP address information. At this
+writing, this does not function for bonding devices; the scripts
+attempt to obtain the device address from DHCP prior to adding any of
+the slave devices. Without active slaves, the DHCP requests are not
+sent to the network.
+
+3.1.2 Configuring Multiple Bonds with Sysconfig
+-----------------------------------------------
+
+The sysconfig network initialization system is capable of
+handling multiple bonding devices. All that is necessary is for each
+bonding instance to have an appropriately configured ifcfg-bondX file
+(as described above). Do not specify the "max_bonds" parameter to any
+instance of bonding, as this will confuse sysconfig. If you require
+multiple bonding devices with identical parameters, create multiple
+ifcfg-bondX files.
+
+Because the sysconfig scripts supply the bonding module
+options in the ifcfg-bondX file, it is not necessary to add them to
+the system ``/etc/modules.d/*.conf`` configuration files.
+
+3.2 Configuration with Initscripts Support
+------------------------------------------
+
+This section applies to distros using a recent version of
+initscripts with bonding support, for example, Red Hat Enterprise Linux
+version 3 or later, Fedora, etc. On these systems, the network
+initialization scripts have knowledge of bonding, and can be configured to
+control bonding devices. Note that older versions of the initscripts
+package have lower levels of support for bonding; this will be noted where
+applicable.
+
+These distros will not automatically load the network adapter
+driver unless the ethX device is configured with an IP address.
+Because of this constraint, users must manually configure a
+network-script file for all physical adapters that will be members of
+a bondX link. Network script files are located in the directory:
+
+/etc/sysconfig/network-scripts
+
+The file name must be prefixed with "ifcfg-eth" and suffixed
+with the adapter's physical adapter number. For example, the script
+for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0.
+Place the following text in the file::
+
+ DEVICE=eth0
+ USERCTL=no
+ ONBOOT=yes
+ MASTER=bond0
+ SLAVE=yes
+ BOOTPROTO=none
+
+The DEVICE= line will be different for every ethX device and
+must correspond with the name of the file, i.e., ifcfg-eth1 must have
+a device line of DEVICE=eth1. The setting of the MASTER= line will
+also depend on the final bonding interface name chosen for your bond.
+As with other network devices, these typically start at 0, and go up
+one for each device, i.e., the first bonding instance is bond0, the
+second is bond1, and so on.
+
+Next, create a bond network script. The file name for this
+script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is
+the number of the bond. For bond0 the file is named "ifcfg-bond0",
+for bond1 it is named "ifcfg-bond1", and so on. Within that file,
+place the following text::
+
+ DEVICE=bond0
+ IPADDR=192.168.1.1
+ NETMASK=255.255.255.0
+ NETWORK=192.168.1.0
+ BROADCAST=192.168.1.255
+ ONBOOT=yes
+ BOOTPROTO=none
+ USERCTL=no
+
+Be sure to change the networking specific lines (IPADDR,
+NETMASK, NETWORK and BROADCAST) to match your network configuration.
+
+For later versions of initscripts, such as that found with Fedora
+7 (or later) and Red Hat Enterprise Linux version 5 (or later), it is possible,
+and, indeed, preferable, to specify the bonding options in the ifcfg-bond0
+file, e.g. a line of the format::
+
+ BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254"
+
+will configure the bond with the specified options. The options
+specified in BONDING_OPTS are identical to the bonding module parameters
+except for the arp_ip_target field when using versions of initscripts older
+than and 8.57 (Fedora 8) and 8.45.19 (Red Hat Enterprise Linux 5.2). When
+using older versions each target should be included as a separate option and
+should be preceded by a '+' to indicate it should be added to the list of
+queried targets, e.g.,::
+
+ arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2
+
+is the proper syntax to specify multiple targets. When specifying
+options via BONDING_OPTS, it is not necessary to edit
+``/etc/modprobe.d/*.conf``.
+
+For even older versions of initscripts that do not support
+BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon
+your distro) to load the bonding module with your desired options when the
+bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf
+will load the bonding module, and select its options:
+
+ alias bond0 bonding
+ options bond0 mode=balance-alb miimon=100
+
+Replace the sample parameters with the appropriate set of
+options for your configuration.
+
+Finally run "/etc/rc.d/init.d/network restart" as root. This
+will restart the networking subsystem and your bond link should be now
+up and running.
+
+3.2.1 Using DHCP with Initscripts
+---------------------------------
+
+Recent versions of initscripts (the versions supplied with Fedora
+Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to
+work) have support for assigning IP information to bonding devices via
+DHCP.
+
+To configure bonding for DHCP, configure it as described
+above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp"
+and add a line consisting of "TYPE=Bonding". Note that the TYPE value
+is case sensitive.
+
+3.2.2 Configuring Multiple Bonds with Initscripts
+-------------------------------------------------
+
+Initscripts packages that are included with Fedora 7 and Red Hat
+Enterprise Linux 5 support multiple bonding interfaces by simply
+specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the
+number of the bond. This support requires sysfs support in the kernel,
+and a bonding driver of version 3.0.0 or later. Other configurations may
+not support this method for specifying multiple bonding interfaces; for
+those instances, see the "Configuring Multiple Bonds Manually" section,
+below.
+
+3.3 Configuring Bonding Manually with iproute2
+-----------------------------------------------
+
+This section applies to distros whose network initialization
+scripts (the sysconfig or initscripts package) do not have specific
+knowledge of bonding. One such distro is SuSE Linux Enterprise Server
+version 8.
+
+The general method for these systems is to place the bonding
+module parameters into a config file in /etc/modprobe.d/ (as
+appropriate for the installed distro), then add modprobe and/or
+`ip link` commands to the system's global init script. The name of
+the global init script differs; for sysconfig, it is
+/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local.
+
+For example, if you wanted to make a simple bond of two e100
+devices (presumed to be eth0 and eth1), and have it persist across
+reboots, edit the appropriate file (/etc/init.d/boot.local or
+/etc/rc.d/rc.local), and add the following::
+
+ modprobe bonding mode=balance-alb miimon=100
+ modprobe e100
+ ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
+ ip link set eth0 master bond0
+ ip link set eth1 master bond0
+
+Replace the example bonding module parameters and bond0
+network configuration (IP address, netmask, etc) with the appropriate
+values for your configuration.
+
+Unfortunately, this method will not provide support for the
+ifup and ifdown scripts on the bond devices. To reload the bonding
+configuration, it is necessary to run the initialization script, e.g.,::
+
+ # /etc/init.d/boot.local
+
+or::
+
+ # /etc/rc.d/rc.local
+
+It may be desirable in such a case to create a separate script
+which only initializes the bonding configuration, then call that
+separate script from within boot.local. This allows for bonding to be
+enabled without re-running the entire global init script.
+
+To shut down the bonding devices, it is necessary to first
+mark the bonding device itself as being down, then remove the
+appropriate device driver modules. For our example above, you can do
+the following::
+
+ # ifconfig bond0 down
+ # rmmod bonding
+ # rmmod e100
+
+Again, for convenience, it may be desirable to create a script
+with these commands.
+
+
+3.3.1 Configuring Multiple Bonds Manually
+-----------------------------------------
+
+This section contains information on configuring multiple
+bonding devices with differing options for those systems whose network
+initialization scripts lack support for configuring multiple bonds.
+
+If you require multiple bonding devices, but all with the same
+options, you may wish to use the "max_bonds" module parameter,
+documented above.
+
+To create multiple bonding devices with differing options, it is
+preferable to use bonding parameters exported by sysfs, documented in the
+section below.
+
+For versions of bonding without sysfs support, the only means to
+provide multiple instances of bonding with differing options is to load
+the bonding driver multiple times. Note that current versions of the
+sysconfig network initialization scripts handle this automatically; if
+your distro uses these scripts, no special action is needed. See the
+section Configuring Bonding Devices, above, if you're not sure about your
+network initialization scripts.
+
+To load multiple instances of the module, it is necessary to
+specify a different name for each instance (the module loading system
+requires that every loaded module, even multiple instances of the same
+module, have a unique name). This is accomplished by supplying multiple
+sets of bonding options in ``/etc/modprobe.d/*.conf``, for example::
+
+ alias bond0 bonding
+ options bond0 -o bond0 mode=balance-rr miimon=100
+
+ alias bond1 bonding
+ options bond1 -o bond1 mode=balance-alb miimon=50
+
+will load the bonding module two times. The first instance is
+named "bond0" and creates the bond0 device in balance-rr mode with an
+miimon of 100. The second instance is named "bond1" and creates the
+bond1 device in balance-alb mode with an miimon of 50.
+
+In some circumstances (typically with older distributions),
+the above does not work, and the second bonding instance never sees
+its options. In that case, the second options line can be substituted
+as follows::
+
+ install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \
+ mode=balance-alb miimon=50
+
+This may be repeated any number of times, specifying a new and
+unique name in place of bond1 for each subsequent instance.
+
+It has been observed that some Red Hat supplied kernels are unable
+to rename modules at load time (the "-o bond1" part). Attempts to pass
+that option to modprobe will produce an "Operation not permitted" error.
+This has been reported on some Fedora Core kernels, and has been seen on
+RHEL 4 as well. On kernels exhibiting this problem, it will be impossible
+to configure multiple bonds with differing parameters (as they are older
+kernels, and also lack sysfs support).
+
+3.4 Configuring Bonding Manually via Sysfs
+------------------------------------------
+
+Starting with version 3.0.0, Channel Bonding may be configured
+via the sysfs interface. This interface allows dynamic configuration
+of all bonds in the system without unloading the module. It also
+allows for adding and removing bonds at runtime. Ifenslave is no
+longer required, though it is still supported.
+
+Use of the sysfs interface allows you to use multiple bonds
+with different configurations without having to reload the module.
+It also allows you to use multiple, differently configured bonds when
+bonding is compiled into the kernel.
+
+You must have the sysfs filesystem mounted to configure
+bonding this way. The examples in this document assume that you
+are using the standard mount point for sysfs, e.g. /sys. If your
+sysfs filesystem is mounted elsewhere, you will need to adjust the
+example paths accordingly.
+
+Creating and Destroying Bonds
+-----------------------------
+To add a new bond foo::
+
+ # echo +foo > /sys/class/net/bonding_masters
+
+To remove an existing bond bar::
+
+ # echo -bar > /sys/class/net/bonding_masters
+
+To show all existing bonds::
+
+ # cat /sys/class/net/bonding_masters
+
+.. note::
+
+ due to 4K size limitation of sysfs files, this list may be
+ truncated if you have more than a few hundred bonds. This is unlikely
+ to occur under normal operating conditions.
+
+Adding and Removing Slaves
+--------------------------
+Interfaces may be enslaved to a bond using the file
+/sys/class/net/<bond>/bonding/slaves. The semantics for this file
+are the same as for the bonding_masters file.
+
+To enslave interface eth0 to bond bond0::
+
+ # ifconfig bond0 up
+ # echo +eth0 > /sys/class/net/bond0/bonding/slaves
+
+To free slave eth0 from bond bond0::
+
+ # echo -eth0 > /sys/class/net/bond0/bonding/slaves
+
+When an interface is enslaved to a bond, symlinks between the
+two are created in the sysfs filesystem. In this case, you would get
+/sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and
+/sys/class/net/eth0/master pointing to /sys/class/net/bond0.
+
+This means that you can tell quickly whether or not an
+interface is enslaved by looking for the master symlink. Thus:
+# echo -eth0 > /sys/class/net/eth0/master/bonding/slaves
+will free eth0 from whatever bond it is enslaved to, regardless of
+the name of the bond interface.
+
+Changing a Bond's Configuration
+-------------------------------
+Each bond may be configured individually by manipulating the
+files located in /sys/class/net/<bond name>/bonding
+
+The names of these files correspond directly with the command-
+line parameters described elsewhere in this file, and, with the
+exception of arp_ip_target, they accept the same values. To see the
+current setting, simply cat the appropriate file.
+
+A few examples will be given here; for specific usage
+guidelines for each parameter, see the appropriate section in this
+document.
+
+To configure bond0 for balance-alb mode::
+
+ # ifconfig bond0 down
+ # echo 6 > /sys/class/net/bond0/bonding/mode
+ - or -
+ # echo balance-alb > /sys/class/net/bond0/bonding/mode
+
+.. note::
+
+ The bond interface must be down before the mode can be changed.
+
+To enable MII monitoring on bond0 with a 1 second interval::
+
+ # echo 1000 > /sys/class/net/bond0/bonding/miimon
+
+.. note::
+
+ If ARP monitoring is enabled, it will disabled when MII
+ monitoring is enabled, and vice-versa.
+
+To add ARP targets::
+
+ # echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
+ # echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target
+
+.. note::
+
+ up to 16 target addresses may be specified.
+
+To remove an ARP target::
+
+ # echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
+
+To configure the interval between learning packet transmits::
+
+ # echo 12 > /sys/class/net/bond0/bonding/lp_interval
+
+.. note::
+
+ the lp_interval is the number of seconds between instances where
+ the bonding driver sends learning packets to each slaves peer switch. The
+ default interval is 1 second.
+
+Example Configuration
+---------------------
+We begin with the same example that is shown in section 3.3,
+executed with sysfs, and without using ifenslave.
+
+To make a simple bond of two e100 devices (presumed to be eth0
+and eth1), and have it persist across reboots, edit the appropriate
+file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the
+following::
+
+ modprobe bonding
+ modprobe e100
+ echo balance-alb > /sys/class/net/bond0/bonding/mode
+ ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
+ echo 100 > /sys/class/net/bond0/bonding/miimon
+ echo +eth0 > /sys/class/net/bond0/bonding/slaves
+ echo +eth1 > /sys/class/net/bond0/bonding/slaves
+
+To add a second bond, with two e1000 interfaces in
+active-backup mode, using ARP monitoring, add the following lines to
+your init script::
+
+ modprobe e1000
+ echo +bond1 > /sys/class/net/bonding_masters
+ echo active-backup > /sys/class/net/bond1/bonding/mode
+ ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up
+ echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target
+ echo 2000 > /sys/class/net/bond1/bonding/arp_interval
+ echo +eth2 > /sys/class/net/bond1/bonding/slaves
+ echo +eth3 > /sys/class/net/bond1/bonding/slaves
+
+3.5 Configuration with Interfaces Support
+-----------------------------------------
+
+This section applies to distros which use /etc/network/interfaces file
+to describe network interface configuration, most notably Debian and its
+derivatives.
+
+The ifup and ifdown commands on Debian don't support bonding out of
+the box. The ifenslave-2.6 package should be installed to provide bonding
+support. Once installed, this package will provide ``bond-*`` options
+to be used into /etc/network/interfaces.
+
+Note that ifenslave-2.6 package will load the bonding module and use
+the ifenslave command when appropriate.
+
+Example Configurations
+----------------------
+
+In /etc/network/interfaces, the following stanza will configure bond0, in
+active-backup mode, with eth0 and eth1 as slaves::
+
+ auto bond0
+ iface bond0 inet dhcp
+ bond-slaves eth0 eth1
+ bond-mode active-backup
+ bond-miimon 100
+ bond-primary eth0 eth1
+
+If the above configuration doesn't work, you might have a system using
+upstart for system startup. This is most notably true for recent
+Ubuntu versions. The following stanza in /etc/network/interfaces will
+produce the same result on those systems::
+
+ auto bond0
+ iface bond0 inet dhcp
+ bond-slaves none
+ bond-mode active-backup
+ bond-miimon 100
+
+ auto eth0
+ iface eth0 inet manual
+ bond-master bond0
+ bond-primary eth0 eth1
+
+ auto eth1
+ iface eth1 inet manual
+ bond-master bond0
+ bond-primary eth0 eth1
+
+For a full list of ``bond-*`` supported options in /etc/network/interfaces and
+some more advanced examples tailored to you particular distros, see the files in
+/usr/share/doc/ifenslave-2.6.
+
+3.6 Overriding Configuration for Special Cases
+----------------------------------------------
+
+When using the bonding driver, the physical port which transmits a frame is
+typically selected by the bonding driver, and is not relevant to the user or
+system administrator. The output port is simply selected using the policies of
+the selected bonding mode. On occasion however, it is helpful to direct certain
+classes of traffic to certain physical interfaces on output to implement
+slightly more complex policies. For example, to reach a web server over a
+bonded interface in which eth0 connects to a private network, while eth1
+connects via a public network, it may be desirous to bias the bond to send said
+traffic over eth0 first, using eth1 only as a fall back, while all other traffic
+can safely be sent over either interface. Such configurations may be achieved
+using the traffic control utilities inherent in linux.
+
+By default the bonding driver is multiqueue aware and 16 queues are created
+when the driver initializes (see Documentation/networking/multiqueue.rst
+for details). If more or less queues are desired the module parameter
+tx_queues can be used to change this value. There is no sysfs parameter
+available as the allocation is done at module init time.
+
+The output of the file /proc/net/bonding/bondX has changed so the output Queue
+ID is now printed for each slave::
+
+ Bonding Mode: fault-tolerance (active-backup)
+ Primary Slave: None
+ Currently Active Slave: eth0
+ MII Status: up
+ MII Polling Interval (ms): 0
+ Up Delay (ms): 0
+ Down Delay (ms): 0
+
+ Slave Interface: eth0
+ MII Status: up
+ Link Failure Count: 0
+ Permanent HW addr: 00:1a:a0:12:8f:cb
+ Slave queue ID: 0
+
+ Slave Interface: eth1
+ MII Status: up
+ Link Failure Count: 0
+ Permanent HW addr: 00:1a:a0:12:8f:cc
+ Slave queue ID: 2
+
+The queue_id for a slave can be set using the command::
+
+ # echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id
+
+Any interface that needs a queue_id set should set it with multiple calls
+like the one above until proper priorities are set for all interfaces. On
+distributions that allow configuration via initscripts, multiple 'queue_id'
+arguments can be added to BONDING_OPTS to set all needed slave queues.
+
+These queue id's can be used in conjunction with the tc utility to configure
+a multiqueue qdisc and filters to bias certain traffic to transmit on certain
+slave devices. For instance, say we wanted, in the above configuration to
+force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output
+device. The following commands would accomplish this::
+
+ # tc qdisc add dev bond0 handle 1 root multiq
+
+ # tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip \
+ dst 192.168.1.100 action skbedit queue_mapping 2
+
+These commands tell the kernel to attach a multiqueue queue discipline to the
+bond0 interface and filter traffic enqueued to it, such that packets with a dst
+ip of 192.168.1.100 have their output queue mapping value overwritten to 2.
+This value is then passed into the driver, causing the normal output path
+selection policy to be overridden, selecting instead qid 2, which maps to eth1.
+
+Note that qid values begin at 1. Qid 0 is reserved to initiate to the driver
+that normal output policy selection should take place. One benefit to simply
+leaving the qid for a slave to 0 is the multiqueue awareness in the bonding
+driver that is now present. This awareness allows tc filters to be placed on
+slave devices as well as bond devices and the bonding driver will simply act as
+a pass-through for selecting output queues on the slave device rather than
+output port selection.
+
+This feature first appeared in bonding driver version 3.7.0 and support for
+output slave selection was limited to round-robin and active-backup modes.
+
+3.7 Configuring LACP for 802.3ad mode in a more secure way
+----------------------------------------------------------
+
+When using 802.3ad bonding mode, the Actor (host) and Partner (switch)
+exchange LACPDUs. These LACPDUs cannot be sniffed, because they are
+destined to link local mac addresses (which switches/bridges are not
+supposed to forward). However, most of the values are easily predictable
+or are simply the machine's MAC address (which is trivially known to all
+other hosts in the same L2). This implies that other machines in the L2
+domain can spoof LACPDU packets from other hosts to the switch and potentially
+cause mayhem by joining (from the point of view of the switch) another
+machine's aggregate, thus receiving a portion of that hosts incoming
+traffic and / or spoofing traffic from that machine themselves (potentially
+even successfully terminating some portion of flows). Though this is not
+a likely scenario, one could avoid this possibility by simply configuring
+few bonding parameters:
+
+ (a) ad_actor_system : You can set a random mac-address that can be used for
+ these LACPDU exchanges. The value can not be either NULL or Multicast.
+ Also it's preferable to set the local-admin bit. Following shell code
+ generates a random mac-address as described above::
+
+ # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
+ $(( (RANDOM & 0xFE) | 0x02 )) \
+ $(( RANDOM & 0xFF )) \
+ $(( RANDOM & 0xFF )) \
+ $(( RANDOM & 0xFF )) \
+ $(( RANDOM & 0xFF )) \
+ $(( RANDOM & 0xFF )))
+ # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
+
+ (b) ad_actor_sys_prio : Randomize the system priority. The default value
+ is 65535, but system can take the value from 1 - 65535. Following shell
+ code generates random priority and sets it::
+
+ # sys_prio=$(( 1 + RANDOM + RANDOM ))
+ # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
+
+ (c) ad_user_port_key : Use the user portion of the port-key. The default
+ keeps this empty. These are the upper 10 bits of the port-key and value
+ ranges from 0 - 1023. Following shell code generates these 10 bits and
+ sets it::
+
+ # usr_port_key=$(( RANDOM & 0x3FF ))
+ # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
+
+
+4 Querying Bonding Configuration
+=================================
+
+4.1 Bonding Configuration
+-------------------------
+
+Each bonding device has a read-only file residing in the
+/proc/net/bonding directory. The file contents include information
+about the bonding configuration, options and state of each slave.
+
+For example, the contents of /proc/net/bonding/bond0 after the
+driver is loaded with parameters of mode=0 and miimon=1000 is
+generally as follows::
+
+ Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004)
+ Bonding Mode: load balancing (round-robin)
+ Currently Active Slave: eth0
+ MII Status: up
+ MII Polling Interval (ms): 1000
+ Up Delay (ms): 0
+ Down Delay (ms): 0
+
+ Slave Interface: eth1
+ MII Status: up
+ Link Failure Count: 1
+
+ Slave Interface: eth0
+ MII Status: up
+ Link Failure Count: 1
+
+The precise format and contents will change depending upon the
+bonding configuration, state, and version of the bonding driver.
+
+4.2 Network configuration
+-------------------------
+
+The network configuration can be inspected using the ifconfig
+command. Bonding devices will have the MASTER flag set; Bonding slave
+devices will have the SLAVE flag set. The ifconfig output does not
+contain information on which slaves are associated with which masters.
+
+In the example below, the bond0 interface is the master
+(MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of
+bond0 have the same MAC address (HWaddr) as bond0 for all modes except
+TLB and ALB that require a unique MAC address for each slave::
+
+ # /sbin/ifconfig
+ bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
+ inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
+ UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
+ RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0
+ collisions:0 txqueuelen:0
+
+ eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
+ UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
+ RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0
+ collisions:0 txqueuelen:100
+ Interrupt:10 Base address:0x1080
+
+ eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
+ UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
+ RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0
+ collisions:0 txqueuelen:100
+ Interrupt:9 Base address:0x1400
+
+5. Switch Configuration
+=======================
+
+For this section, "switch" refers to whatever system the
+bonded devices are directly connected to (i.e., where the other end of
+the cable plugs into). This may be an actual dedicated switch device,
+or it may be another regular system (e.g., another computer running
+Linux),
+
+The active-backup, balance-tlb and balance-alb modes do not
+require any specific configuration of the switch.
+
+The 802.3ad mode requires that the switch have the appropriate
+ports configured as an 802.3ad aggregation. The precise method used
+to configure this varies from switch to switch, but, for example, a
+Cisco 3550 series switch requires that the appropriate ports first be
+grouped together in a single etherchannel instance, then that
+etherchannel is set to mode "lacp" to enable 802.3ad (instead of
+standard EtherChannel).
+
+The balance-rr, balance-xor and broadcast modes generally
+require that the switch have the appropriate ports grouped together.
+The nomenclature for such a group differs between switches, it may be
+called an "etherchannel" (as in the Cisco example, above), a "trunk
+group" or some other similar variation. For these modes, each switch
+will also have its own configuration options for the switch's transmit
+policy to the bond. Typical choices include XOR of either the MAC or
+IP addresses. The transmit policy of the two peers does not need to
+match. For these three modes, the bonding mode really selects a
+transmit policy for an EtherChannel group; all three will interoperate
+with another EtherChannel group.
+
+
+6. 802.1q VLAN Support
+======================
+
+It is possible to configure VLAN devices over a bond interface
+using the 8021q driver. However, only packets coming from the 8021q
+driver and passing through bonding will be tagged by default. Self
+generated packets, for example, bonding's learning packets or ARP
+packets generated by either ALB mode or the ARP monitor mechanism, are
+tagged internally by bonding itself. As a result, bonding must
+"learn" the VLAN IDs configured above it, and use those IDs to tag
+self generated packets.
+
+For reasons of simplicity, and to support the use of adapters
+that can do VLAN hardware acceleration offloading, the bonding
+interface declares itself as fully hardware offloading capable, it gets
+the add_vid/kill_vid notifications to gather the necessary
+information, and it propagates those actions to the slaves. In case
+of mixed adapter types, hardware accelerated tagged packets that
+should go through an adapter that is not offloading capable are
+"un-accelerated" by the bonding driver so the VLAN tag sits in the
+regular location.
+
+VLAN interfaces *must* be added on top of a bonding interface
+only after enslaving at least one slave. The bonding interface has a
+hardware address of 00:00:00:00:00:00 until the first slave is added.
+If the VLAN interface is created prior to the first enslavement, it
+would pick up the all-zeroes hardware address. Once the first slave
+is attached to the bond, the bond device itself will pick up the
+slave's hardware address, which is then available for the VLAN device.
+
+Also, be aware that a similar problem can occur if all slaves
+are released from a bond that still has one or more VLAN interfaces on
+top of it. When a new slave is added, the bonding interface will
+obtain its hardware address from the first slave, which might not
+match the hardware address of the VLAN interfaces (which was
+ultimately copied from an earlier slave).
+
+There are two methods to insure that the VLAN device operates
+with the correct hardware address if all slaves are removed from a
+bond interface:
+
+1. Remove all VLAN interfaces then recreate them
+
+2. Set the bonding interface's hardware address so that it
+matches the hardware address of the VLAN interfaces.
+
+Note that changing a VLAN interface's HW address would set the
+underlying device -- i.e. the bonding interface -- to promiscuous
+mode, which might not be what you want.
+
+
+7. Link Monitoring
+==================
+
+The bonding driver at present supports two schemes for
+monitoring a slave device's link state: the ARP monitor and the MII
+monitor.
+
+At the present time, due to implementation restrictions in the
+bonding driver itself, it is not possible to enable both ARP and MII
+monitoring simultaneously.
+
+7.1 ARP Monitor Operation
+-------------------------
+
+The ARP monitor operates as its name suggests: it sends ARP
+queries to one or more designated peer systems on the network, and
+uses the response as an indication that the link is operating. This
+gives some assurance that traffic is actually flowing to and from one
+or more peers on the local network.
+
+7.2 Configuring Multiple ARP Targets
+------------------------------------
+
+While ARP monitoring can be done with just one target, it can
+be useful in a High Availability setup to have several targets to
+monitor. In the case of just one target, the target itself may go
+down or have a problem making it unresponsive to ARP requests. Having
+an additional target (or several) increases the reliability of the ARP
+monitoring.
+
+Multiple ARP targets must be separated by commas as follows::
+
+ # example options for ARP monitoring with three targets
+ alias bond0 bonding
+ options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9
+
+For just a single target the options would resemble::
+
+ # example options for ARP monitoring with one target
+ alias bond0 bonding
+ options bond0 arp_interval=60 arp_ip_target=192.168.0.100
+
+
+7.3 MII Monitor Operation
+-------------------------
+
+The MII monitor monitors only the carrier state of the local
+network interface. It accomplishes this in one of three ways: by
+depending upon the device driver to maintain its carrier state, by
+querying the device's MII registers, or by making an ethtool query to
+the device.
+
+If the use_carrier module parameter is 1 (the default value),
+then the MII monitor will rely on the driver for carrier state
+information (via the netif_carrier subsystem). As explained in the
+use_carrier parameter information, above, if the MII monitor fails to
+detect carrier loss on the device (e.g., when the cable is physically
+disconnected), it may be that the driver does not support
+netif_carrier.
+
+If use_carrier is 0, then the MII monitor will first query the
+device's (via ioctl) MII registers and check the link state. If that
+request fails (not just that it returns carrier down), then the MII
+monitor will make an ethtool ETHTOOL_GLINK request to attempt to obtain
+the same information. If both methods fail (i.e., the driver either
+does not support or had some error in processing both the MII register
+and ethtool requests), then the MII monitor will assume the link is
+up.
+
+8. Potential Sources of Trouble
+===============================
+
+8.1 Adventures in Routing
+-------------------------
+
+When bonding is configured, it is important that the slave
+devices not have routes that supersede routes of the master (or,
+generally, not have routes at all). For example, suppose the bonding
+device bond0 has two slaves, eth0 and eth1, and the routing table is
+as follows::
+
+ Kernel IP routing table
+ Destination Gateway Genmask Flags MSS Window irtt Iface
+ 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0
+ 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1
+ 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0
+ 127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo
+
+This routing configuration will likely still update the
+receive/transmit times in the driver (needed by the ARP monitor), but
+may bypass the bonding driver (because outgoing traffic to, in this
+case, another host on network 10 would use eth0 or eth1 before bond0).
+
+The ARP monitor (and ARP itself) may become confused by this
+configuration, because ARP requests (generated by the ARP monitor)
+will be sent on one interface (bond0), but the corresponding reply
+will arrive on a different interface (eth0). This reply looks to ARP
+as an unsolicited ARP reply (because ARP matches replies on an
+interface basis), and is discarded. The MII monitor is not affected
+by the state of the routing table.
+
+The solution here is simply to insure that slaves do not have
+routes of their own, and if for some reason they must, those routes do
+not supersede routes of their master. This should generally be the
+case, but unusual configurations or errant manual or automatic static
+route additions may cause trouble.
+
+8.2 Ethernet Device Renaming
+----------------------------
+
+On systems with network configuration scripts that do not
+associate physical devices directly with network interface names (so
+that the same physical device always has the same "ethX" name), it may
+be necessary to add some special logic to config files in
+/etc/modprobe.d/.
+
+For example, given a modules.conf containing the following::
+
+ alias bond0 bonding
+ options bond0 mode=some-mode miimon=50
+ alias eth0 tg3
+ alias eth1 tg3
+ alias eth2 e1000
+ alias eth3 e1000
+
+If neither eth0 and eth1 are slaves to bond0, then when the
+bond0 interface comes up, the devices may end up reordered. This
+happens because bonding is loaded first, then its slave device's
+drivers are loaded next. Since no other drivers have been loaded,
+when the e1000 driver loads, it will receive eth0 and eth1 for its
+devices, but the bonding configuration tries to enslave eth2 and eth3
+(which may later be assigned to the tg3 devices).
+
+Adding the following::
+
+ add above bonding e1000 tg3
+
+causes modprobe to load e1000 then tg3, in that order, when
+bonding is loaded. This command is fully documented in the
+modules.conf manual page.
+
+On systems utilizing modprobe an equivalent problem can occur.
+In this case, the following can be added to config files in
+/etc/modprobe.d/ as::
+
+ softdep bonding pre: tg3 e1000
+
+This will load tg3 and e1000 modules before loading the bonding one.
+Full documentation on this can be found in the modprobe.d and modprobe
+manual pages.
+
+8.3. Painfully Slow Or No Failed Link Detection By Miimon
+---------------------------------------------------------
+
+By default, bonding enables the use_carrier option, which
+instructs bonding to trust the driver to maintain carrier state.
+
+As discussed in the options section, above, some drivers do
+not support the netif_carrier_on/_off link state tracking system.
+With use_carrier enabled, bonding will always see these links as up,
+regardless of their actual state.
+
+Additionally, other drivers do support netif_carrier, but do
+not maintain it in real time, e.g., only polling the link state at
+some fixed interval. In this case, miimon will detect failures, but
+only after some long period of time has expired. If it appears that
+miimon is very slow in detecting link failures, try specifying
+use_carrier=0 to see if that improves the failure detection time. If
+it does, then it may be that the driver checks the carrier state at a
+fixed interval, but does not cache the MII register values (so the
+use_carrier=0 method of querying the registers directly works). If
+use_carrier=0 does not improve the failover, then the driver may cache
+the registers, or the problem may be elsewhere.
+
+Also, remember that miimon only checks for the device's
+carrier state. It has no way to determine the state of devices on or
+beyond other ports of a switch, or if a switch is refusing to pass
+traffic while still maintaining carrier on.
+
+9. SNMP agents
+===============
+
+If running SNMP agents, the bonding driver should be loaded
+before any network drivers participating in a bond. This requirement
+is due to the interface index (ipAdEntIfIndex) being associated to
+the first interface found with a given IP address. That is, there is
+only one ipAdEntIfIndex for each IP address. For example, if eth0 and
+eth1 are slaves of bond0 and the driver for eth0 is loaded before the
+bonding driver, the interface for the IP address will be associated
+with the eth0 interface. This configuration is shown below, the IP
+address 192.168.1.1 has an interface index of 2 which indexes to eth0
+in the ifDescr table (ifDescr.2).
+
+::
+
+ interfaces.ifTable.ifEntry.ifDescr.1 = lo
+ interfaces.ifTable.ifEntry.ifDescr.2 = eth0
+ interfaces.ifTable.ifEntry.ifDescr.3 = eth1
+ interfaces.ifTable.ifEntry.ifDescr.4 = eth2
+ interfaces.ifTable.ifEntry.ifDescr.5 = eth3
+ interfaces.ifTable.ifEntry.ifDescr.6 = bond0
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
+
+This problem is avoided by loading the bonding driver before
+any network drivers participating in a bond. Below is an example of
+loading the bonding driver first, the IP address 192.168.1.1 is
+correctly associated with ifDescr.2.
+
+ interfaces.ifTable.ifEntry.ifDescr.1 = lo
+ interfaces.ifTable.ifEntry.ifDescr.2 = bond0
+ interfaces.ifTable.ifEntry.ifDescr.3 = eth0
+ interfaces.ifTable.ifEntry.ifDescr.4 = eth1
+ interfaces.ifTable.ifEntry.ifDescr.5 = eth2
+ interfaces.ifTable.ifEntry.ifDescr.6 = eth3
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
+
+While some distributions may not report the interface name in
+ifDescr, the association between the IP address and IfIndex remains
+and SNMP functions such as Interface_Scan_Next will report that
+association.
+
+10. Promiscuous mode
+====================
+
+When running network monitoring tools, e.g., tcpdump, it is
+common to enable promiscuous mode on the device, so that all traffic
+is seen (instead of seeing only traffic destined for the local host).
+The bonding driver handles promiscuous mode changes to the bonding
+master device (e.g., bond0), and propagates the setting to the slave
+devices.
+
+For the balance-rr, balance-xor, broadcast, and 802.3ad modes,
+the promiscuous mode setting is propagated to all slaves.
+
+For the active-backup, balance-tlb and balance-alb modes, the
+promiscuous mode setting is propagated only to the active slave.
+
+For balance-tlb mode, the active slave is the slave currently
+receiving inbound traffic.
+
+For balance-alb mode, the active slave is the slave used as a
+"primary." This slave is used for mode-specific control traffic, for
+sending to peers that are unassigned or if the load is unbalanced.
+
+For the active-backup, balance-tlb and balance-alb modes, when
+the active slave changes (e.g., due to a link failure), the
+promiscuous setting will be propagated to the new active slave.
+
+11. Configuring Bonding for High Availability
+=============================================
+
+High Availability refers to configurations that provide
+maximum network availability by having redundant or backup devices,
+links or switches between the host and the rest of the world. The
+goal is to provide the maximum availability of network connectivity
+(i.e., the network always works), even though other configurations
+could provide higher throughput.
+
+11.1 High Availability in a Single Switch Topology
+--------------------------------------------------
+
+If two hosts (or a host and a single switch) are directly
+connected via multiple physical links, then there is no availability
+penalty to optimizing for maximum bandwidth. In this case, there is
+only one switch (or peer), so if it fails, there is no alternative
+access to fail over to. Additionally, the bonding load balance modes
+support link monitoring of their members, so if individual links fail,
+the load will be rebalanced across the remaining devices.
+
+See Section 12, "Configuring Bonding for Maximum Throughput"
+for information on configuring bonding with one peer device.
+
+11.2 High Availability in a Multiple Switch Topology
+----------------------------------------------------
+
+With multiple switches, the configuration of bonding and the
+network changes dramatically. In multiple switch topologies, there is
+a trade off between network availability and usable bandwidth.
+
+Below is a sample network, configured to maximize the
+availability of the network::
+
+ | |
+ |port3 port3|
+ +-----+----+ +-----+----+
+ | |port2 ISL port2| |
+ | switch A +--------------------------+ switch B |
+ | | | |
+ +-----+----+ +-----++---+
+ |port1 port1|
+ | +-------+ |
+ +-------------+ host1 +---------------+
+ eth0 +-------+ eth1
+
+In this configuration, there is a link between the two
+switches (ISL, or inter switch link), and multiple ports connecting to
+the outside world ("port3" on each switch). There is no technical
+reason that this could not be extended to a third switch.
+
+11.2.1 HA Bonding Mode Selection for Multiple Switch Topology
+-------------------------------------------------------------
+
+In a topology such as the example above, the active-backup and
+broadcast modes are the only useful bonding modes when optimizing for
+availability; the other modes require all links to terminate on the
+same peer for them to behave rationally.
+
+active-backup:
+ This is generally the preferred mode, particularly if
+ the switches have an ISL and play together well. If the
+ network configuration is such that one switch is specifically
+ a backup switch (e.g., has lower capacity, higher cost, etc),
+ then the primary option can be used to insure that the
+ preferred link is always used when it is available.
+
+broadcast:
+ This mode is really a special purpose mode, and is suitable
+ only for very specific needs. For example, if the two
+ switches are not connected (no ISL), and the networks beyond
+ them are totally independent. In this case, if it is
+ necessary for some specific one-way traffic to reach both
+ independent networks, then the broadcast mode may be suitable.
+
+11.2.2 HA Link Monitoring Selection for Multiple Switch Topology
+----------------------------------------------------------------
+
+The choice of link monitoring ultimately depends upon your
+switch. If the switch can reliably fail ports in response to other
+failures, then either the MII or ARP monitors should work. For
+example, in the above example, if the "port3" link fails at the remote
+end, the MII monitor has no direct means to detect this. The ARP
+monitor could be configured with a target at the remote end of port3,
+thus detecting that failure without switch support.
+
+In general, however, in a multiple switch topology, the ARP
+monitor can provide a higher level of reliability in detecting end to
+end connectivity failures (which may be caused by the failure of any
+individual component to pass traffic for any reason). Additionally,
+the ARP monitor should be configured with multiple targets (at least
+one for each switch in the network). This will insure that,
+regardless of which switch is active, the ARP monitor has a suitable
+target to query.
+
+Note, also, that of late many switches now support a functionality
+generally referred to as "trunk failover." This is a feature of the
+switch that causes the link state of a particular switch port to be set
+down (or up) when the state of another switch port goes down (or up).
+Its purpose is to propagate link failures from logically "exterior" ports
+to the logically "interior" ports that bonding is able to monitor via
+miimon. Availability and configuration for trunk failover varies by
+switch, but this can be a viable alternative to the ARP monitor when using
+suitable switches.
+
+12. Configuring Bonding for Maximum Throughput
+==============================================
+
+12.1 Maximizing Throughput in a Single Switch Topology
+------------------------------------------------------
+
+In a single switch configuration, the best method to maximize
+throughput depends upon the application and network environment. The
+various load balancing modes each have strengths and weaknesses in
+different environments, as detailed below.
+
+For this discussion, we will break down the topologies into
+two categories. Depending upon the destination of most traffic, we
+categorize them into either "gatewayed" or "local" configurations.
+
+In a gatewayed configuration, the "switch" is acting primarily
+as a router, and the majority of traffic passes through this router to
+other networks. An example would be the following::
+
+
+ +----------+ +----------+
+ | |eth0 port1| | to other networks
+ | Host A +---------------------+ router +------------------->
+ | +---------------------+ | Hosts B and C are out
+ | |eth1 port2| | here somewhere
+ +----------+ +----------+
+
+The router may be a dedicated router device, or another host
+acting as a gateway. For our discussion, the important point is that
+the majority of traffic from Host A will pass through the router to
+some other network before reaching its final destination.
+
+In a gatewayed network configuration, although Host A may
+communicate with many other systems, all of its traffic will be sent
+and received via one other peer on the local network, the router.
+
+Note that the case of two systems connected directly via
+multiple physical links is, for purposes of configuring bonding, the
+same as a gatewayed configuration. In that case, it happens that all
+traffic is destined for the "gateway" itself, not some other network
+beyond the gateway.
+
+In a local configuration, the "switch" is acting primarily as
+a switch, and the majority of traffic passes through this switch to
+reach other stations on the same network. An example would be the
+following::
+
+ +----------+ +----------+ +--------+
+ | |eth0 port1| +-------+ Host B |
+ | Host A +------------+ switch |port3 +--------+
+ | +------------+ | +--------+
+ | |eth1 port2| +------------------+ Host C |
+ +----------+ +----------+port4 +--------+
+
+
+Again, the switch may be a dedicated switch device, or another
+host acting as a gateway. For our discussion, the important point is
+that the majority of traffic from Host A is destined for other hosts
+on the same local network (Hosts B and C in the above example).
+
+In summary, in a gatewayed configuration, traffic to and from
+the bonded device will be to the same MAC level peer on the network
+(the gateway itself, i.e., the router), regardless of its final
+destination. In a local configuration, traffic flows directly to and
+from the final destinations, thus, each destination (Host B, Host C)
+will be addressed directly by their individual MAC addresses.
+
+This distinction between a gatewayed and a local network
+configuration is important because many of the load balancing modes
+available use the MAC addresses of the local network source and
+destination to make load balancing decisions. The behavior of each
+mode is described below.
+
+
+12.1.1 MT Bonding Mode Selection for Single Switch Topology
+-----------------------------------------------------------
+
+This configuration is the easiest to set up and to understand,
+although you will have to decide which bonding mode best suits your
+needs. The trade offs for each mode are detailed below:
+
+balance-rr:
+ This mode is the only mode that will permit a single
+ TCP/IP connection to stripe traffic across multiple
+ interfaces. It is therefore the only mode that will allow a
+ single TCP/IP stream to utilize more than one interface's
+ worth of throughput. This comes at a cost, however: the
+ striping generally results in peer systems receiving packets out
+ of order, causing TCP/IP's congestion control system to kick
+ in, often by retransmitting segments.
+
+ It is possible to adjust TCP/IP's congestion limits by
+ altering the net.ipv4.tcp_reordering sysctl parameter. The
+ usual default value is 3. But keep in mind TCP stack is able
+ to automatically increase this when it detects reorders.
+
+ Note that the fraction of packets that will be delivered out of
+ order is highly variable, and is unlikely to be zero. The level
+ of reordering depends upon a variety of factors, including the
+ networking interfaces, the switch, and the topology of the
+ configuration. Speaking in general terms, higher speed network
+ cards produce more reordering (due to factors such as packet
+ coalescing), and a "many to many" topology will reorder at a
+ higher rate than a "many slow to one fast" configuration.
+
+ Many switches do not support any modes that stripe traffic
+ (instead choosing a port based upon IP or MAC level addresses);
+ for those devices, traffic for a particular connection flowing
+ through the switch to a balance-rr bond will not utilize greater
+ than one interface's worth of bandwidth.
+
+ If you are utilizing protocols other than TCP/IP, UDP for
+ example, and your application can tolerate out of order
+ delivery, then this mode can allow for single stream datagram
+ performance that scales near linearly as interfaces are added
+ to the bond.
+
+ This mode requires the switch to have the appropriate ports
+ configured for "etherchannel" or "trunking."
+
+active-backup:
+ There is not much advantage in this network topology to
+ the active-backup mode, as the inactive backup devices are all
+ connected to the same peer as the primary. In this case, a
+ load balancing mode (with link monitoring) will provide the
+ same level of network availability, but with increased
+ available bandwidth. On the plus side, active-backup mode
+ does not require any configuration of the switch, so it may
+ have value if the hardware available does not support any of
+ the load balance modes.
+
+balance-xor:
+ This mode will limit traffic such that packets destined
+ for specific peers will always be sent over the same
+ interface. Since the destination is determined by the MAC
+ addresses involved, this mode works best in a "local" network
+ configuration (as described above), with destinations all on
+ the same local network. This mode is likely to be suboptimal
+ if all your traffic is passed through a single router (i.e., a
+ "gatewayed" network configuration, as described above).
+
+ As with balance-rr, the switch ports need to be configured for
+ "etherchannel" or "trunking."
+
+broadcast:
+ Like active-backup, there is not much advantage to this
+ mode in this type of network topology.
+
+802.3ad:
+ This mode can be a good choice for this type of network
+ topology. The 802.3ad mode is an IEEE standard, so all peers
+ that implement 802.3ad should interoperate well. The 802.3ad
+ protocol includes automatic configuration of the aggregates,
+ so minimal manual configuration of the switch is needed
+ (typically only to designate that some set of devices is
+ available for 802.3ad). The 802.3ad standard also mandates
+ that frames be delivered in order (within certain limits), so
+ in general single connections will not see misordering of
+ packets. The 802.3ad mode does have some drawbacks: the
+ standard mandates that all devices in the aggregate operate at
+ the same speed and duplex. Also, as with all bonding load
+ balance modes other than balance-rr, no single connection will
+ be able to utilize more than a single interface's worth of
+ bandwidth.
+
+ Additionally, the linux bonding 802.3ad implementation
+ distributes traffic by peer (using an XOR of MAC addresses
+ and packet type ID), so in a "gatewayed" configuration, all
+ outgoing traffic will generally use the same device. Incoming
+ traffic may also end up on a single device, but that is
+ dependent upon the balancing policy of the peer's 802.3ad
+ implementation. In a "local" configuration, traffic will be
+ distributed across the devices in the bond.
+
+ Finally, the 802.3ad mode mandates the use of the MII monitor,
+ therefore, the ARP monitor is not available in this mode.
+
+balance-tlb:
+ The balance-tlb mode balances outgoing traffic by peer.
+ Since the balancing is done according to MAC address, in a
+ "gatewayed" configuration (as described above), this mode will
+ send all traffic across a single device. However, in a
+ "local" network configuration, this mode balances multiple
+ local network peers across devices in a vaguely intelligent
+ manner (not a simple XOR as in balance-xor or 802.3ad mode),
+ so that mathematically unlucky MAC addresses (i.e., ones that
+ XOR to the same value) will not all "bunch up" on a single
+ interface.
+
+ Unlike 802.3ad, interfaces may be of differing speeds, and no
+ special switch configuration is required. On the down side,
+ in this mode all incoming traffic arrives over a single
+ interface, this mode requires certain ethtool support in the
+ network device driver of the slave interfaces, and the ARP
+ monitor is not available.
+
+balance-alb:
+ This mode is everything that balance-tlb is, and more.
+ It has all of the features (and restrictions) of balance-tlb,
+ and will also balance incoming traffic from local network
+ peers (as described in the Bonding Module Options section,
+ above).
+
+ The only additional down side to this mode is that the network
+ device driver must support changing the hardware address while
+ the device is open.
+
+12.1.2 MT Link Monitoring for Single Switch Topology
+----------------------------------------------------
+
+The choice of link monitoring may largely depend upon which
+mode you choose to use. The more advanced load balancing modes do not
+support the use of the ARP monitor, and are thus restricted to using
+the MII monitor (which does not provide as high a level of end to end
+assurance as the ARP monitor).
+
+12.2 Maximum Throughput in a Multiple Switch Topology
+-----------------------------------------------------
+
+Multiple switches may be utilized to optimize for throughput
+when they are configured in parallel as part of an isolated network
+between two or more systems, for example::
+
+ +-----------+
+ | Host A |
+ +-+---+---+-+
+ | | |
+ +--------+ | +---------+
+ | | |
+ +------+---+ +-----+----+ +-----+----+
+ | Switch A | | Switch B | | Switch C |
+ +------+---+ +-----+----+ +-----+----+
+ | | |
+ +--------+ | +---------+
+ | | |
+ +-+---+---+-+
+ | Host B |
+ +-----------+
+
+In this configuration, the switches are isolated from one
+another. One reason to employ a topology such as this is for an
+isolated network with many hosts (a cluster configured for high
+performance, for example), using multiple smaller switches can be more
+cost effective than a single larger switch, e.g., on a network with 24
+hosts, three 24 port switches can be significantly less expensive than
+a single 72 port switch.
+
+If access beyond the network is required, an individual host
+can be equipped with an additional network device connected to an
+external network; this host then additionally acts as a gateway.
+
+12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
+-------------------------------------------------------------
+
+In actual practice, the bonding mode typically employed in
+configurations of this type is balance-rr. Historically, in this
+network configuration, the usual caveats about out of order packet
+delivery are mitigated by the use of network adapters that do not do
+any kind of packet coalescing (via the use of NAPI, or because the
+device itself does not generate interrupts until some number of
+packets has arrived). When employed in this fashion, the balance-rr
+mode allows individual connections between two hosts to effectively
+utilize greater than one interface's bandwidth.
+
+12.2.2 MT Link Monitoring for Multiple Switch Topology
+------------------------------------------------------
+
+Again, in actual practice, the MII monitor is most often used
+in this configuration, as performance is given preference over
+availability. The ARP monitor will function in this topology, but its
+advantages over the MII monitor are mitigated by the volume of probes
+needed as the number of systems involved grows (remember that each
+host in the network is configured with bonding).
+
+13. Switch Behavior Issues
+==========================
+
+13.1 Link Establishment and Failover Delays
+-------------------------------------------
+
+Some switches exhibit undesirable behavior with regard to the
+timing of link up and down reporting by the switch.
+
+First, when a link comes up, some switches may indicate that
+the link is up (carrier available), but not pass traffic over the
+interface for some period of time. This delay is typically due to
+some type of autonegotiation or routing protocol, but may also occur
+during switch initialization (e.g., during recovery after a switch
+failure). If you find this to be a problem, specify an appropriate
+value to the updelay bonding module option to delay the use of the
+relevant interface(s).
+
+Second, some switches may "bounce" the link state one or more
+times while a link is changing state. This occurs most commonly while
+the switch is initializing. Again, an appropriate updelay value may
+help.
+
+Note that when a bonding interface has no active links, the
+driver will immediately reuse the first link that goes up, even if the
+updelay parameter has been specified (the updelay is ignored in this
+case). If there are slave interfaces waiting for the updelay timeout
+to expire, the interface that first went into that state will be
+immediately reused. This reduces down time of the network if the
+value of updelay has been overestimated, and since this occurs only in
+cases with no connectivity, there is no additional penalty for
+ignoring the updelay.
+
+In addition to the concerns about switch timings, if your
+switches take a long time to go into backup mode, it may be desirable
+to not activate a backup interface immediately after a link goes down.
+Failover may be delayed via the downdelay bonding module option.
+
+13.2 Duplicated Incoming Packets
+--------------------------------
+
+NOTE: Starting with version 3.0.2, the bonding driver has logic to
+suppress duplicate packets, which should largely eliminate this problem.
+The following description is kept for reference.
+
+It is not uncommon to observe a short burst of duplicated
+traffic when the bonding device is first used, or after it has been
+idle for some period of time. This is most easily observed by issuing
+a "ping" to some other host on the network, and noticing that the
+output from ping flags duplicates (typically one per slave).
+
+For example, on a bond in active-backup mode with five slaves
+all connected to one switch, the output may appear as follows::
+
+ # ping -n 10.0.4.2
+ PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data.
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+ 64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms
+ 64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms
+ 64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms
+
+This is not due to an error in the bonding driver, rather, it
+is a side effect of how many switches update their MAC forwarding
+tables. Initially, the switch does not associate the MAC address in
+the packet with a particular switch port, and so it may send the
+traffic to all ports until its MAC forwarding table is updated. Since
+the interfaces attached to the bond may occupy multiple ports on a
+single switch, when the switch (temporarily) floods the traffic to all
+ports, the bond device receives multiple copies of the same packet
+(one per slave device).
+
+The duplicated packet behavior is switch dependent, some
+switches exhibit this, and some do not. On switches that display this
+behavior, it can be induced by clearing the MAC forwarding table (on
+most Cisco switches, the privileged command "clear mac address-table
+dynamic" will accomplish this).
+
+14. Hardware Specific Considerations
+====================================
+
+This section contains additional information for configuring
+bonding on specific hardware platforms, or for interfacing bonding
+with particular switches or other devices.
+
+14.1 IBM BladeCenter
+--------------------
+
+This applies to the JS20 and similar systems.
+
+On the JS20 blades, the bonding driver supports only
+balance-rr, active-backup, balance-tlb and balance-alb modes. This is
+largely due to the network topology inside the BladeCenter, detailed
+below.
+
+JS20 network adapter information
+--------------------------------
+
+All JS20s come with two Broadcom Gigabit Ethernet ports
+integrated on the planar (that's "motherboard" in IBM-speak). In the
+BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to
+I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2.
+An add-on Broadcom daughter card can be installed on a JS20 to provide
+two more Gigabit Ethernet ports. These ports, eth2 and eth3, are
+wired to I/O Modules 3 and 4, respectively.
+
+Each I/O Module may contain either a switch or a passthrough
+module (which allows ports to be directly connected to an external
+switch). Some bonding modes require a specific BladeCenter internal
+network topology in order to function; these are detailed below.
+
+Additional BladeCenter-specific networking information can be
+found in two IBM Redbooks (www.ibm.com/redbooks):
+
+- "IBM eServer BladeCenter Networking Options"
+- "IBM eServer BladeCenter Layer 2-7 Network Switching"
+
+BladeCenter networking configuration
+------------------------------------
+
+Because a BladeCenter can be configured in a very large number
+of ways, this discussion will be confined to describing basic
+configurations.
+
+Normally, Ethernet Switch Modules (ESMs) are used in I/O
+modules 1 and 2. In this configuration, the eth0 and eth1 ports of a
+JS20 will be connected to different internal switches (in the
+respective I/O modules).
+
+A passthrough module (OPM or CPM, optical or copper,
+passthrough module) connects the I/O module directly to an external
+switch. By using PMs in I/O module #1 and #2, the eth0 and eth1
+interfaces of a JS20 can be redirected to the outside world and
+connected to a common external switch.
+
+Depending upon the mix of ESMs and PMs, the network will
+appear to bonding as either a single switch topology (all PMs) or as a
+multiple switch topology (one or more ESMs, zero or more PMs). It is
+also possible to connect ESMs together, resulting in a configuration
+much like the example in "High Availability in a Multiple Switch
+Topology," above.
+
+Requirements for specific modes
+-------------------------------
+
+The balance-rr mode requires the use of passthrough modules
+for devices in the bond, all connected to an common external switch.
+That switch must be configured for "etherchannel" or "trunking" on the
+appropriate ports, as is usual for balance-rr.
+
+The balance-alb and balance-tlb modes will function with
+either switch modules or passthrough modules (or a mix). The only
+specific requirement for these modes is that all network interfaces
+must be able to reach all destinations for traffic sent over the
+bonding device (i.e., the network must converge at some point outside
+the BladeCenter).
+
+The active-backup mode has no additional requirements.
+
+Link monitoring issues
+----------------------
+
+When an Ethernet Switch Module is in place, only the ARP
+monitor will reliably detect link loss to an external switch. This is
+nothing unusual, but examination of the BladeCenter cabinet would
+suggest that the "external" network ports are the ethernet ports for
+the system, when it fact there is a switch between these "external"
+ports and the devices on the JS20 system itself. The MII monitor is
+only able to detect link failures between the ESM and the JS20 system.
+
+When a passthrough module is in place, the MII monitor does
+detect failures to the "external" port, which is then directly
+connected to the JS20 system.
+
+Other concerns
+--------------
+
+The Serial Over LAN (SoL) link is established over the primary
+ethernet (eth0) only, therefore, any loss of link to eth0 will result
+in losing your SoL connection. It will not fail over with other
+network traffic, as the SoL system is beyond the control of the
+bonding driver.
+
+It may be desirable to disable spanning tree on the switch
+(either the internal Ethernet Switch Module, or an external switch) to
+avoid fail-over delay issues when using bonding.
+
+
+15. Frequently Asked Questions
+==============================
+
+1. Is it SMP safe?
+-------------------
+
+Yes. The old 2.0.xx channel bonding patch was not SMP safe.
+The new driver was designed to be SMP safe from the start.
+
+2. What type of cards will work with it?
+-----------------------------------------
+
+Any Ethernet type cards (you can even mix cards - a Intel
+EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes,
+devices need not be of the same speed.
+
+Starting with version 3.2.1, bonding also supports Infiniband
+slaves in active-backup mode.
+
+3. How many bonding devices can I have?
+----------------------------------------
+
+There is no limit.
+
+4. How many slaves can a bonding device have?
+----------------------------------------------
+
+This is limited only by the number of network interfaces Linux
+supports and/or the number of network cards you can place in your
+system.
+
+5. What happens when a slave link dies?
+----------------------------------------
+
+If link monitoring is enabled, then the failing device will be
+disabled. The active-backup mode will fail over to a backup link, and
+other modes will ignore the failed link. The link will continue to be
+monitored, and should it recover, it will rejoin the bond (in whatever
+manner is appropriate for the mode). See the sections on High
+Availability and the documentation for each mode for additional
+information.
+
+Link monitoring can be enabled via either the miimon or
+arp_interval parameters (described in the module parameters section,
+above). In general, miimon monitors the carrier state as sensed by
+the underlying network device, and the arp monitor (arp_interval)
+monitors connectivity to another host on the local network.
+
+If no link monitoring is configured, the bonding driver will
+be unable to detect link failures, and will assume that all links are
+always available. This will likely result in lost packets, and a
+resulting degradation of performance. The precise performance loss
+depends upon the bonding mode and network configuration.
+
+6. Can bonding be used for High Availability?
+----------------------------------------------
+
+Yes. See the section on High Availability for details.
+
+7. Which switches/systems does it work with?
+---------------------------------------------
+
+The full answer to this depends upon the desired mode.
+
+In the basic balance modes (balance-rr and balance-xor), it
+works with any system that supports etherchannel (also called
+trunking). Most managed switches currently available have such
+support, and many unmanaged switches as well.
+
+The advanced balance modes (balance-tlb and balance-alb) do
+not have special switch requirements, but do need device drivers that
+support specific features (described in the appropriate section under
+module parameters, above).
+
+In 802.3ad mode, it works with systems that support IEEE
+802.3ad Dynamic Link Aggregation. Most managed and many unmanaged
+switches currently available support 802.3ad.
+
+The active-backup mode should work with any Layer-II switch.
+
+8. Where does a bonding device get its MAC address from?
+---------------------------------------------------------
+
+When using slave devices that have fixed MAC addresses, or when
+the fail_over_mac option is enabled, the bonding device's MAC address is
+the MAC address of the active slave.
+
+For other configurations, if not explicitly configured (with
+ifconfig or ip link), the MAC address of the bonding device is taken from
+its first slave device. This MAC address is then passed to all following
+slaves and remains persistent (even if the first slave is removed) until
+the bonding device is brought down or reconfigured.
+
+If you wish to change the MAC address, you can set it with
+ifconfig or ip link::
+
+ # ifconfig bond0 hw ether 00:11:22:33:44:55
+
+ # ip link set bond0 address 66:77:88:99:aa:bb
+
+The MAC address can be also changed by bringing down/up the
+device and then changing its slaves (or their order)::
+
+ # ifconfig bond0 down ; modprobe -r bonding
+ # ifconfig bond0 .... up
+ # ifenslave bond0 eth...
+
+This method will automatically take the address from the next
+slave that is added.
+
+To restore your slaves' MAC addresses, you need to detach them
+from the bond (``ifenslave -d bond0 eth0``). The bonding driver will
+then restore the MAC addresses that the slaves had before they were
+enslaved.
+
+16. Resources and Links
+=======================
+
+The latest version of the bonding driver can be found in the latest
+version of the linux kernel, found on http://kernel.org
+
+The latest version of this document can be found in the latest kernel
+source (named Documentation/networking/bonding.rst).
+
+Discussions regarding the development of the bonding driver take place
+on the main Linux network mailing list, hosted at vger.kernel.org. The list
+address is:
+
+netdev@vger.kernel.org
+
+The administrative interface (to subscribe or unsubscribe) can
+be found at:
+
+http://vger.kernel.org/vger-lists.html#netdev
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
new file mode 100644
index 0000000000..c859f3c163
--- /dev/null
+++ b/Documentation/networking/bridge.rst
@@ -0,0 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Ethernet Bridging
+=================
+
+In order to use the Ethernet bridging functionality, you'll need the
+userspace tools.
+
+Documentation for Linux bridging is on:
+ https://wiki.linuxfoundation.org/networking/bridge
+
+The bridge-utilities are maintained at:
+ git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
+
+Additionally, the iproute2 utilities can be used to configure
+bridge devices.
+
+If you still have questions, don't hesitate to post to the mailing list
+(more info https://lists.linux-foundation.org/mailman/listinfo/bridge).
+
diff --git a/Documentation/networking/caif/caif.rst b/Documentation/networking/caif/caif.rst
new file mode 100644
index 0000000000..d922d419c5
--- /dev/null
+++ b/Documentation/networking/caif/caif.rst
@@ -0,0 +1,138 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+
+================
+Using Linux CAIF
+================
+
+
+:Copyright: |copy| ST-Ericsson AB 2010
+
+:Author: Sjur Brendeland/ sjur.brandeland@stericsson.com
+
+Start
+=====
+
+If you have compiled CAIF for modules do::
+
+ $modprobe crc_ccitt
+ $modprobe caif
+ $modprobe caif_socket
+ $modprobe chnl_net
+
+
+Preparing the setup with a STE modem
+====================================
+
+If you are working on integration of CAIF you should make sure
+that the kernel is built with module support.
+
+There are some things that need to be tweaked to get the host TTY correctly
+set up to talk to the modem.
+Since the CAIF stack is running in the kernel and we want to use the existing
+TTY, we are installing our physical serial driver as a line discipline above
+the TTY device.
+
+To achieve this we need to install the N_CAIF ldisc from user space.
+The benefit is that we can hook up to any TTY.
+
+The use of Start-of-frame-extension (STX) must also be set as
+module parameter "ser_use_stx".
+
+Normally Frame Checksum is always used on UART, but this is also provided as a
+module parameter "ser_use_fcs".
+
+::
+
+ $ modprobe caif_serial ser_ttyname=/dev/ttyS0 ser_use_stx=yes
+ $ ifconfig caif_ttyS0 up
+
+PLEASE NOTE:
+ There is a limitation in Android shell.
+ It only accepts one argument to insmod/modprobe!
+
+Trouble shooting
+================
+
+There are debugfs parameters provided for serial communication.
+/sys/kernel/debug/caif_serial/<tty-name>/
+
+* ser_state: Prints the bit-mask status where
+
+ - 0x02 means SENDING, this is a transient state.
+ - 0x10 means FLOW_OFF_SENT, i.e. the previous frame has not been sent
+ and is blocking further send operation. Flow OFF has been propagated
+ to all CAIF Channels using this TTY.
+
+* tty_status: Prints the bit-mask tty status information
+
+ - 0x01 - tty->warned is on.
+ - 0x04 - tty->packed is on.
+ - 0x08 - tty->flow.tco_stopped is on.
+ - 0x10 - tty->hw_stopped is on.
+ - 0x20 - tty->flow.stopped is on.
+
+* last_tx_msg: Binary blob Prints the last transmitted frame.
+
+ This can be printed with::
+
+ $od --format=x1 /sys/kernel/debug/caif_serial/<tty>/last_rx_msg.
+
+ The first two tx messages sent look like this. Note: The initial
+ byte 02 is start of frame extension (STX) used for re-syncing
+ upon errors.
+
+ - Enumeration::
+
+ 0000000 02 05 00 00 03 01 d2 02
+ | | | | | |
+ STX(1) | | | |
+ Length(2)| | |
+ Control Channel(1)
+ Command:Enumeration(1)
+ Link-ID(1)
+ Checksum(2)
+
+ - Channel Setup::
+
+ 0000000 02 07 00 00 00 21 a1 00 48 df
+ | | | | | | | |
+ STX(1) | | | | | |
+ Length(2)| | | | |
+ Control Channel(1)
+ Command:Channel Setup(1)
+ Channel Type(1)
+ Priority and Link-ID(1)
+ Endpoint(1)
+ Checksum(2)
+
+* last_rx_msg: Prints the last transmitted frame.
+
+ The RX messages for LinkSetup look almost identical but they have the
+ bit 0x20 set in the command bit, and Channel Setup has added one byte
+ before Checksum containing Channel ID.
+
+ NOTE:
+ Several CAIF Messages might be concatenated. The maximum debug
+ buffer size is 128 bytes.
+
+Error Scenarios
+===============
+
+- last_tx_msg contains channel setup message and last_rx_msg is empty ->
+ The host seems to be able to send over the UART, at least the CAIF ldisc get
+ notified that sending is completed.
+
+- last_tx_msg contains enumeration message and last_rx_msg is empty ->
+ The host is not able to send the message from UART, the tty has not been
+ able to complete the transmit operation.
+
+- if /sys/kernel/debug/caif_serial/<tty>/tty_status is non-zero there
+ might be problems transmitting over UART.
+
+ E.g. host and modem wiring is not correct you will typically see
+ tty_status = 0x10 (hw_stopped) and ser_state = 0x10 (FLOW_OFF_SENT).
+
+ You will probably see the enumeration message in last_tx_message
+ and empty last_rx_message.
diff --git a/Documentation/networking/caif/index.rst b/Documentation/networking/caif/index.rst
new file mode 100644
index 0000000000..ec29b6f4bd
--- /dev/null
+++ b/Documentation/networking/caif/index.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+CAIF
+====
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ linux_caif
+ caif
diff --git a/Documentation/networking/caif/linux_caif.rst b/Documentation/networking/caif/linux_caif.rst
new file mode 100644
index 0000000000..a0480862ab
--- /dev/null
+++ b/Documentation/networking/caif/linux_caif.rst
@@ -0,0 +1,195 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+==========
+Linux CAIF
+==========
+
+Copyright |copy| ST-Ericsson AB 2010
+
+:Author: Sjur Brendeland/ sjur.brandeland@stericsson.com
+:License terms: GNU General Public License (GPL) version 2
+
+
+Introduction
+============
+
+CAIF is a MUX protocol used by ST-Ericsson cellular modems for
+communication between Modem and host. The host processes can open virtual AT
+channels, initiate GPRS Data connections, Video channels and Utility Channels.
+The Utility Channels are general purpose pipes between modem and host.
+
+ST-Ericsson modems support a number of transports between modem
+and host. Currently, UART and Loopback are available for Linux.
+
+
+Architecture
+============
+
+The implementation of CAIF is divided into:
+
+* CAIF Socket Layer and GPRS IP Interface.
+* CAIF Core Protocol Implementation
+* CAIF Link Layer, implemented as NET devices.
+
+::
+
+ RTNL
+ !
+ ! +------+ +------+
+ ! +------+! +------+!
+ ! ! IP !! !Socket!!
+ +-------> !interf!+ ! API !+ <- CAIF Client APIs
+ ! +------+ +------!
+ ! ! !
+ ! +-----------+
+ ! !
+ ! +------+ <- CAIF Core Protocol
+ ! ! CAIF !
+ ! ! Core !
+ ! +------+
+ ! +----------!---------+
+ ! ! ! !
+ ! +------+ +-----+ +------+
+ +--> ! HSI ! ! TTY ! ! USB ! <- Link Layer (Net Devices)
+ +------+ +-----+ +------+
+
+
+
+Implementation
+==============
+
+
+CAIF Core Protocol Layer
+------------------------
+
+CAIF Core layer implements the CAIF protocol as defined by ST-Ericsson.
+It implements the CAIF protocol stack in a layered approach, where
+each layer described in the specification is implemented as a separate layer.
+The architecture is inspired by the design patterns "Protocol Layer" and
+"Protocol Packet".
+
+CAIF structure
+^^^^^^^^^^^^^^
+
+The Core CAIF implementation contains:
+
+ - Simple implementation of CAIF.
+ - Layered architecture (a la Streams), each layer in the CAIF
+ specification is implemented in a separate c-file.
+ - Clients must call configuration function to add PHY layer.
+ - Clients must implement CAIF layer to consume/produce
+ CAIF payload with receive and transmit functions.
+ - Clients must call configuration function to add and connect the
+ Client layer.
+ - When receiving / transmitting CAIF Packets (cfpkt), ownership is passed
+ to the called function (except for framing layers' receive function)
+
+Layered Architecture
+====================
+
+The CAIF protocol can be divided into two parts: Support functions and Protocol
+Implementation. The support functions include:
+
+ - CFPKT CAIF Packet. Implementation of CAIF Protocol Packet. The
+ CAIF Packet has functions for creating, destroying and adding content
+ and for adding/extracting header and trailers to protocol packets.
+
+The CAIF Protocol implementation contains:
+
+ - CFCNFG CAIF Configuration layer. Configures the CAIF Protocol
+ Stack and provides a Client interface for adding Link-Layer and
+ Driver interfaces on top of the CAIF Stack.
+
+ - CFCTRL CAIF Control layer. Encodes and Decodes control messages
+ such as enumeration and channel setup. Also matches request and
+ response messages.
+
+ - CFSERVL General CAIF Service Layer functionality; handles flow
+ control and remote shutdown requests.
+
+ - CFVEI CAIF VEI layer. Handles CAIF AT Channels on VEI (Virtual
+ External Interface). This layer encodes/decodes VEI frames.
+
+ - CFDGML CAIF Datagram layer. Handles CAIF Datagram layer (IP
+ traffic), encodes/decodes Datagram frames.
+
+ - CFMUX CAIF Mux layer. Handles multiplexing between multiple
+ physical bearers and multiple channels such as VEI, Datagram, etc.
+ The MUX keeps track of the existing CAIF Channels and
+ Physical Instances and selects the appropriate instance based
+ on Channel-Id and Physical-ID.
+
+ - CFFRML CAIF Framing layer. Handles Framing i.e. Frame length
+ and frame checksum.
+
+ - CFSERL CAIF Serial layer. Handles concatenation/split of frames
+ into CAIF Frames with correct length.
+
+::
+
+ +---------+
+ | Config |
+ | CFCNFG |
+ +---------+
+ !
+ +---------+ +---------+ +---------+
+ | AT | | Control | | Datagram|
+ | CFVEIL | | CFCTRL | | CFDGML |
+ +---------+ +---------+ +---------+
+ \_____________!______________/
+ !
+ +---------+
+ | MUX |
+ | |
+ +---------+
+ _____!_____
+ / \
+ +---------+ +---------+
+ | CFFRML | | CFFRML |
+ | Framing | | Framing |
+ +---------+ +---------+
+ ! !
+ +---------+ +---------+
+ | | | Serial |
+ | | | CFSERL |
+ +---------+ +---------+
+
+
+In this layered approach the following "rules" apply.
+
+ - All layers embed the same structure "struct cflayer"
+ - A layer does not depend on any other layer's private data.
+ - Layers are stacked by setting the pointers::
+
+ layer->up , layer->dn
+
+ - In order to send data upwards, each layer should do::
+
+ layer->up->receive(layer->up, packet);
+
+ - In order to send data downwards, each layer should do::
+
+ layer->dn->transmit(layer->dn, packet);
+
+
+CAIF Socket and IP interface
+============================
+
+The IP interface and CAIF socket API are implemented on top of the
+CAIF Core protocol. The IP Interface and CAIF socket have an instance of
+'struct cflayer', just like the CAIF Core protocol stack.
+Net device and Socket implement the 'receive()' function defined by
+'struct cflayer', just like the rest of the CAIF stack. In this way, transmit and
+receive of packets is handled as by the rest of the layers: the 'dn->transmit()'
+function is called in order to transmit data.
+
+Configuration of Link Layer
+---------------------------
+The Link Layer is implemented as Linux network devices (struct net_device).
+Payload handling and registration is done using standard Linux mechanisms.
+
+The CAIF Protocol relies on a loss-less link layer without implementing
+retransmission. This implies that packet drops must not happen.
+Therefore a flow-control mechanism is implemented where the physical
+interface can initiate flow stop for all CAIF Channels.
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
new file mode 100644
index 0000000000..d7e1ada905
--- /dev/null
+++ b/Documentation/networking/can.rst
@@ -0,0 +1,1508 @@
+===================================
+SocketCAN - Controller Area Network
+===================================
+
+Overview / What is SocketCAN
+============================
+
+The socketcan package is an implementation of CAN protocols
+(Controller Area Network) for Linux. CAN is a networking technology
+which has widespread use in automation, embedded devices, and
+automotive fields. While there have been other CAN implementations
+for Linux based on character devices, SocketCAN uses the Berkeley
+socket API, the Linux network stack and implements the CAN device
+drivers as network interfaces. The CAN socket API has been designed
+as similar as possible to the TCP/IP protocols to allow programmers,
+familiar with network programming, to easily learn how to use CAN
+sockets.
+
+
+.. _socketcan-motivation:
+
+Motivation / Why Using the Socket API
+=====================================
+
+There have been CAN implementations for Linux before SocketCAN so the
+question arises, why we have started another project. Most existing
+implementations come as a device driver for some CAN hardware, they
+are based on character devices and provide comparatively little
+functionality. Usually, there is only a hardware-specific device
+driver which provides a character device interface to send and
+receive raw CAN frames, directly to/from the controller hardware.
+Queueing of frames and higher-level transport protocols like ISO-TP
+have to be implemented in user space applications. Also, most
+character-device implementations support only one single process to
+open the device at a time, similar to a serial interface. Exchanging
+the CAN controller requires employment of another device driver and
+often the need for adaption of large parts of the application to the
+new driver's API.
+
+SocketCAN was designed to overcome all of these limitations. A new
+protocol family has been implemented which provides a socket interface
+to user space applications and which builds upon the Linux network
+layer, enabling use all of the provided queueing functionality. A device
+driver for CAN controller hardware registers itself with the Linux
+network layer as a network device, so that CAN frames from the
+controller can be passed up to the network layer and on to the CAN
+protocol family module and also vice-versa. Also, the protocol family
+module provides an API for transport protocol modules to register, so
+that any number of transport protocols can be loaded or unloaded
+dynamically. In fact, the can core module alone does not provide any
+protocol and cannot be used without loading at least one additional
+protocol module. Multiple sockets can be opened at the same time,
+on different or the same protocol module and they can listen/send
+frames on different or the same CAN IDs. Several sockets listening on
+the same interface for frames with the same CAN ID are all passed the
+same received matching CAN frames. An application wishing to
+communicate using a specific transport protocol, e.g. ISO-TP, just
+selects that protocol when opening the socket, and then can read and
+write application data byte streams, without having to deal with
+CAN-IDs, frames, etc.
+
+Similar functionality visible from user-space could be provided by a
+character device, too, but this would lead to a technically inelegant
+solution for a couple of reasons:
+
+* **Intricate usage:** Instead of passing a protocol argument to
+ socket(2) and using bind(2) to select a CAN interface and CAN ID, an
+ application would have to do all these operations using ioctl(2)s.
+
+* **Code duplication:** A character device cannot make use of the Linux
+ network queueing code, so all that code would have to be duplicated
+ for CAN networking.
+
+* **Abstraction:** In most existing character-device implementations, the
+ hardware-specific device driver for a CAN controller directly
+ provides the character device for the application to work with.
+ This is at least very unusual in Unix systems for both, char and
+ block devices. For example you don't have a character device for a
+ certain UART of a serial interface, a certain sound chip in your
+ computer, a SCSI or IDE controller providing access to your hard
+ disk or tape streamer device. Instead, you have abstraction layers
+ which provide a unified character or block device interface to the
+ application on the one hand, and a interface for hardware-specific
+ device drivers on the other hand. These abstractions are provided
+ by subsystems like the tty layer, the audio subsystem or the SCSI
+ and IDE subsystems for the devices mentioned above.
+
+ The easiest way to implement a CAN device driver is as a character
+ device without such a (complete) abstraction layer, as is done by most
+ existing drivers. The right way, however, would be to add such a
+ layer with all the functionality like registering for certain CAN
+ IDs, supporting several open file descriptors and (de)multiplexing
+ CAN frames between them, (sophisticated) queueing of CAN frames, and
+ providing an API for device drivers to register with. However, then
+ it would be no more difficult, or may be even easier, to use the
+ networking framework provided by the Linux kernel, and this is what
+ SocketCAN does.
+
+The use of the networking framework of the Linux kernel is just the
+natural and most appropriate way to implement CAN for Linux.
+
+
+.. _socketcan-concept:
+
+SocketCAN Concept
+=================
+
+As described in :ref:`socketcan-motivation` the main goal of SocketCAN is to
+provide a socket interface to user space applications which builds
+upon the Linux network layer. In contrast to the commonly known
+TCP/IP and ethernet networking, the CAN bus is a broadcast-only(!)
+medium that has no MAC-layer addressing like ethernet. The CAN-identifier
+(can_id) is used for arbitration on the CAN-bus. Therefore the CAN-IDs
+have to be chosen uniquely on the bus. When designing a CAN-ECU
+network the CAN-IDs are mapped to be sent by a specific ECU.
+For this reason a CAN-ID can be treated best as a kind of source address.
+
+
+.. _socketcan-receive-lists:
+
+Receive Lists
+-------------
+
+The network transparent access of multiple applications leads to the
+problem that different applications may be interested in the same
+CAN-IDs from the same CAN network interface. The SocketCAN core
+module - which implements the protocol family CAN - provides several
+high efficient receive lists for this reason. If e.g. a user space
+application opens a CAN RAW socket, the raw protocol module itself
+requests the (range of) CAN-IDs from the SocketCAN core that are
+requested by the user. The subscription and unsubscription of
+CAN-IDs can be done for specific CAN interfaces or for all(!) known
+CAN interfaces with the can_rx_(un)register() functions provided to
+CAN protocol modules by the SocketCAN core (see :ref:`socketcan-core-module`).
+To optimize the CPU usage at runtime the receive lists are split up
+into several specific lists per device that match the requested
+filter complexity for a given use-case.
+
+
+.. _socketcan-local-loopback1:
+
+Local Loopback of Sent Frames
+-----------------------------
+
+As known from other networking concepts the data exchanging
+applications may run on the same or different nodes without any
+change (except for the according addressing information):
+
+.. code::
+
+ ___ ___ ___ _______ ___
+ | _ | | _ | | _ | | _ _ | | _ |
+ ||A|| ||B|| ||C|| ||A| |B|| ||C||
+ |___| |___| |___| |_______| |___|
+ | | | | |
+ -----------------(1)- CAN bus -(2)---------------
+
+To ensure that application A receives the same information in the
+example (2) as it would receive in example (1) there is need for
+some kind of local loopback of the sent CAN frames on the appropriate
+node.
+
+The Linux network devices (by default) just can handle the
+transmission and reception of media dependent frames. Due to the
+arbitration on the CAN bus the transmission of a low prio CAN-ID
+may be delayed by the reception of a high prio CAN frame. To
+reflect the correct [#f1]_ traffic on the node the loopback of the sent
+data has to be performed right after a successful transmission. If
+the CAN network interface is not capable of performing the loopback for
+some reason the SocketCAN core can do this task as a fallback solution.
+See :ref:`socketcan-local-loopback2` for details (recommended).
+
+The loopback functionality is enabled by default to reflect standard
+networking behaviour for CAN applications. Due to some requests from
+the RT-SocketCAN group the loopback optionally may be disabled for each
+separate socket. See sockopts from the CAN RAW sockets in :ref:`socketcan-raw-sockets`.
+
+.. [#f1] you really like to have this when you're running analyser
+ tools like 'candump' or 'cansniffer' on the (same) node.
+
+
+.. _socketcan-network-problem-notifications:
+
+Network Problem Notifications
+-----------------------------
+
+The use of the CAN bus may lead to several problems on the physical
+and media access control layer. Detecting and logging of these lower
+layer problems is a vital requirement for CAN users to identify
+hardware issues on the physical transceiver layer as well as
+arbitration problems and error frames caused by the different
+ECUs. The occurrence of detected errors are important for diagnosis
+and have to be logged together with the exact timestamp. For this
+reason the CAN interface driver can generate so called Error Message
+Frames that can optionally be passed to the user application in the
+same way as other CAN frames. Whenever an error on the physical layer
+or the MAC layer is detected (e.g. by the CAN controller) the driver
+creates an appropriate error message frame. Error messages frames can
+be requested by the user application using the common CAN filter
+mechanisms. Inside this filter definition the (interested) type of
+errors may be selected. The reception of error messages is disabled
+by default. The format of the CAN error message frame is briefly
+described in the Linux header file "include/uapi/linux/can/error.h".
+
+
+How to use SocketCAN
+====================
+
+Like TCP/IP, you first need to open a socket for communicating over a
+CAN network. Since SocketCAN implements a new protocol family, you
+need to pass PF_CAN as the first argument to the socket(2) system
+call. Currently, there are two CAN protocols to choose from, the raw
+socket protocol and the broadcast manager (BCM). So to open a socket,
+you would write::
+
+ s = socket(PF_CAN, SOCK_RAW, CAN_RAW);
+
+and::
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM);
+
+respectively. After the successful creation of the socket, you would
+normally use the bind(2) system call to bind the socket to a CAN
+interface (which is different from TCP/IP due to different addressing
+- see :ref:`socketcan-concept`). After binding (CAN_RAW) or connecting (CAN_BCM)
+the socket, you can read(2) and write(2) from/to the socket or use
+send(2), sendto(2), sendmsg(2) and the recv* counterpart operations
+on the socket as usual. There are also CAN specific socket options
+described below.
+
+The Classical CAN frame structure (aka CAN 2.0B), the CAN FD frame structure
+and the sockaddr structure are defined in include/linux/can.h:
+
+.. code-block:: C
+
+ struct can_frame {
+ canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */
+ union {
+ /* CAN frame payload length in byte (0 .. CAN_MAX_DLEN)
+ * was previously named can_dlc so we need to carry that
+ * name for legacy support
+ */
+ __u8 len;
+ __u8 can_dlc; /* deprecated */
+ };
+ __u8 __pad; /* padding */
+ __u8 __res0; /* reserved / padding */
+ __u8 len8_dlc; /* optional DLC for 8 byte payload length (9 .. 15) */
+ __u8 data[8] __attribute__((aligned(8)));
+ };
+
+Remark: The len element contains the payload length in bytes and should be
+used instead of can_dlc. The deprecated can_dlc was misleadingly named as
+it always contained the plain payload length in bytes and not the so called
+'data length code' (DLC).
+
+To pass the raw DLC from/to a Classical CAN network device the len8_dlc
+element can contain values 9 .. 15 when the len element is 8 (the real
+payload length for all DLC values greater or equal to 8).
+
+The alignment of the (linear) payload data[] to a 64bit boundary
+allows the user to define their own structs and unions to easily access
+the CAN payload. There is no given byteorder on the CAN bus by
+default. A read(2) system call on a CAN_RAW socket transfers a
+struct can_frame to the user space.
+
+The sockaddr_can structure has an interface index like the
+PF_PACKET socket, that also binds to a specific interface:
+
+.. code-block:: C
+
+ struct sockaddr_can {
+ sa_family_t can_family;
+ int can_ifindex;
+ union {
+ /* transport protocol class address info (e.g. ISOTP) */
+ struct { canid_t rx_id, tx_id; } tp;
+
+ /* J1939 address information */
+ struct {
+ /* 8 byte name when using dynamic addressing */
+ __u64 name;
+
+ /* pgn:
+ * 8 bit: PS in PDU2 case, else 0
+ * 8 bit: PF
+ * 1 bit: DP
+ * 1 bit: reserved
+ */
+ __u32 pgn;
+
+ /* 1 byte address */
+ __u8 addr;
+ } j1939;
+
+ /* reserved for future CAN protocols address information */
+ } can_addr;
+ };
+
+To determine the interface index an appropriate ioctl() has to
+be used (example for CAN_RAW sockets without error checking):
+
+.. code-block:: C
+
+ int s;
+ struct sockaddr_can addr;
+ struct ifreq ifr;
+
+ s = socket(PF_CAN, SOCK_RAW, CAN_RAW);
+
+ strcpy(ifr.ifr_name, "can0" );
+ ioctl(s, SIOCGIFINDEX, &ifr);
+
+ addr.can_family = AF_CAN;
+ addr.can_ifindex = ifr.ifr_ifindex;
+
+ bind(s, (struct sockaddr *)&addr, sizeof(addr));
+
+ (..)
+
+To bind a socket to all(!) CAN interfaces the interface index must
+be 0 (zero). In this case the socket receives CAN frames from every
+enabled CAN interface. To determine the originating CAN interface
+the system call recvfrom(2) may be used instead of read(2). To send
+on a socket that is bound to 'any' interface sendto(2) is needed to
+specify the outgoing interface.
+
+Reading CAN frames from a bound CAN_RAW socket (see above) consists
+of reading a struct can_frame:
+
+.. code-block:: C
+
+ struct can_frame frame;
+
+ nbytes = read(s, &frame, sizeof(struct can_frame));
+
+ if (nbytes < 0) {
+ perror("can raw socket read");
+ return 1;
+ }
+
+ /* paranoid check ... */
+ if (nbytes < sizeof(struct can_frame)) {
+ fprintf(stderr, "read: incomplete CAN frame\n");
+ return 1;
+ }
+
+ /* do something with the received CAN frame */
+
+Writing CAN frames can be done similarly, with the write(2) system call::
+
+ nbytes = write(s, &frame, sizeof(struct can_frame));
+
+When the CAN interface is bound to 'any' existing CAN interface
+(addr.can_ifindex = 0) it is recommended to use recvfrom(2) if the
+information about the originating CAN interface is needed:
+
+.. code-block:: C
+
+ struct sockaddr_can addr;
+ struct ifreq ifr;
+ socklen_t len = sizeof(addr);
+ struct can_frame frame;
+
+ nbytes = recvfrom(s, &frame, sizeof(struct can_frame),
+ 0, (struct sockaddr*)&addr, &len);
+
+ /* get interface name of the received CAN frame */
+ ifr.ifr_ifindex = addr.can_ifindex;
+ ioctl(s, SIOCGIFNAME, &ifr);
+ printf("Received a CAN frame from interface %s", ifr.ifr_name);
+
+To write CAN frames on sockets bound to 'any' CAN interface the
+outgoing interface has to be defined certainly:
+
+.. code-block:: C
+
+ strcpy(ifr.ifr_name, "can0");
+ ioctl(s, SIOCGIFINDEX, &ifr);
+ addr.can_ifindex = ifr.ifr_ifindex;
+ addr.can_family = AF_CAN;
+
+ nbytes = sendto(s, &frame, sizeof(struct can_frame),
+ 0, (struct sockaddr*)&addr, sizeof(addr));
+
+An accurate timestamp can be obtained with an ioctl(2) call after reading
+a message from the socket:
+
+.. code-block:: C
+
+ struct timeval tv;
+ ioctl(s, SIOCGSTAMP, &tv);
+
+The timestamp has a resolution of one microsecond and is set automatically
+at the reception of a CAN frame.
+
+Remark about CAN FD (flexible data rate) support:
+
+Generally the handling of CAN FD is very similar to the formerly described
+examples. The new CAN FD capable CAN controllers support two different
+bitrates for the arbitration phase and the payload phase of the CAN FD frame
+and up to 64 bytes of payload. This extended payload length breaks all the
+kernel interfaces (ABI) which heavily rely on the CAN frame with fixed eight
+bytes of payload (struct can_frame) like the CAN_RAW socket. Therefore e.g.
+the CAN_RAW socket supports a new socket option CAN_RAW_FD_FRAMES that
+switches the socket into a mode that allows the handling of CAN FD frames
+and Classical CAN frames simultaneously (see :ref:`socketcan-rawfd`).
+
+The struct canfd_frame is defined in include/linux/can.h:
+
+.. code-block:: C
+
+ struct canfd_frame {
+ canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */
+ __u8 len; /* frame payload length in byte (0 .. 64) */
+ __u8 flags; /* additional flags for CAN FD */
+ __u8 __res0; /* reserved / padding */
+ __u8 __res1; /* reserved / padding */
+ __u8 data[64] __attribute__((aligned(8)));
+ };
+
+The struct canfd_frame and the existing struct can_frame have the can_id,
+the payload length and the payload data at the same offset inside their
+structures. This allows to handle the different structures very similar.
+When the content of a struct can_frame is copied into a struct canfd_frame
+all structure elements can be used as-is - only the data[] becomes extended.
+
+When introducing the struct canfd_frame it turned out that the data length
+code (DLC) of the struct can_frame was used as a length information as the
+length and the DLC has a 1:1 mapping in the range of 0 .. 8. To preserve
+the easy handling of the length information the canfd_frame.len element
+contains a plain length value from 0 .. 64. So both canfd_frame.len and
+can_frame.len are equal and contain a length information and no DLC.
+For details about the distinction of CAN and CAN FD capable devices and
+the mapping to the bus-relevant data length code (DLC), see :ref:`socketcan-can-fd-driver`.
+
+The length of the two CAN(FD) frame structures define the maximum transfer
+unit (MTU) of the CAN(FD) network interface and skbuff data length. Two
+definitions are specified for CAN specific MTUs in include/linux/can.h:
+
+.. code-block:: C
+
+ #define CAN_MTU (sizeof(struct can_frame)) == 16 => Classical CAN frame
+ #define CANFD_MTU (sizeof(struct canfd_frame)) == 72 => CAN FD frame
+
+
+.. _socketcan-raw-sockets:
+
+RAW Protocol Sockets with can_filters (SOCK_RAW)
+------------------------------------------------
+
+Using CAN_RAW sockets is extensively comparable to the commonly
+known access to CAN character devices. To meet the new possibilities
+provided by the multi user SocketCAN approach, some reasonable
+defaults are set at RAW socket binding time:
+
+- The filters are set to exactly one filter receiving everything
+- The socket only receives valid data frames (=> no error message frames)
+- The loopback of sent CAN frames is enabled (see :ref:`socketcan-local-loopback2`)
+- The socket does not receive its own sent frames (in loopback mode)
+
+These default settings may be changed before or after binding the socket.
+To use the referenced definitions of the socket options for CAN_RAW
+sockets, include <linux/can/raw.h>.
+
+
+.. _socketcan-rawfilter:
+
+RAW socket option CAN_RAW_FILTER
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The reception of CAN frames using CAN_RAW sockets can be controlled
+by defining 0 .. n filters with the CAN_RAW_FILTER socket option.
+
+The CAN filter structure is defined in include/linux/can.h:
+
+.. code-block:: C
+
+ struct can_filter {
+ canid_t can_id;
+ canid_t can_mask;
+ };
+
+A filter matches, when:
+
+.. code-block:: C
+
+ <received_can_id> & mask == can_id & mask
+
+which is analogous to known CAN controllers hardware filter semantics.
+The filter can be inverted in this semantic, when the CAN_INV_FILTER
+bit is set in can_id element of the can_filter structure. In
+contrast to CAN controller hardware filters the user may set 0 .. n
+receive filters for each open socket separately:
+
+.. code-block:: C
+
+ struct can_filter rfilter[2];
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = CAN_SFF_MASK;
+ rfilter[1].can_id = 0x200;
+ rfilter[1].can_mask = 0x700;
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter));
+
+To disable the reception of CAN frames on the selected CAN_RAW socket:
+
+.. code-block:: C
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, NULL, 0);
+
+To set the filters to zero filters is quite obsolete as to not read
+data causes the raw socket to discard the received CAN frames. But
+having this 'send only' use-case we may remove the receive list in the
+Kernel to save a little (really a very little!) CPU usage.
+
+CAN Filter Usage Optimisation
+.............................
+
+The CAN filters are processed in per-device filter lists at CAN frame
+reception time. To reduce the number of checks that need to be performed
+while walking through the filter lists the CAN core provides an optimized
+filter handling when the filter subscription focusses on a single CAN ID.
+
+For the possible 2048 SFF CAN identifiers the identifier is used as an index
+to access the corresponding subscription list without any further checks.
+For the 2^29 possible EFF CAN identifiers a 10 bit XOR folding is used as
+hash function to retrieve the EFF table index.
+
+To benefit from the optimized filters for single CAN identifiers the
+CAN_SFF_MASK or CAN_EFF_MASK have to be set into can_filter.mask together
+with set CAN_EFF_FLAG and CAN_RTR_FLAG bits. A set CAN_EFF_FLAG bit in the
+can_filter.mask makes clear that it matters whether a SFF or EFF CAN ID is
+subscribed. E.g. in the example from above:
+
+.. code-block:: C
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = CAN_SFF_MASK;
+
+both SFF frames with CAN ID 0x123 and EFF frames with 0xXXXXX123 can pass.
+
+To filter for only 0x123 (SFF) and 0x12345678 (EFF) CAN identifiers the
+filter has to be defined in this way to benefit from the optimized filters:
+
+.. code-block:: C
+
+ struct can_filter rfilter[2];
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_SFF_MASK);
+ rfilter[1].can_id = 0x12345678 | CAN_EFF_FLAG;
+ rfilter[1].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_EFF_MASK);
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter));
+
+
+RAW Socket Option CAN_RAW_ERR_FILTER
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As described in :ref:`socketcan-network-problem-notifications` the CAN interface driver can generate so
+called Error Message Frames that can optionally be passed to the user
+application in the same way as other CAN frames. The possible
+errors are divided into different error classes that may be filtered
+using the appropriate error mask. To register for every possible
+error condition CAN_ERR_MASK can be used as value for the error mask.
+The values for the error mask are defined in linux/can/error.h:
+
+.. code-block:: C
+
+ can_err_mask_t err_mask = ( CAN_ERR_TX_TIMEOUT | CAN_ERR_BUSOFF );
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_ERR_FILTER,
+ &err_mask, sizeof(err_mask));
+
+
+RAW Socket Option CAN_RAW_LOOPBACK
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To meet multi user needs the local loopback is enabled by default
+(see :ref:`socketcan-local-loopback1` for details). But in some embedded use-cases
+(e.g. when only one application uses the CAN bus) this loopback
+functionality can be disabled (separately for each socket):
+
+.. code-block:: C
+
+ int loopback = 0; /* 0 = disabled, 1 = enabled (default) */
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_LOOPBACK, &loopback, sizeof(loopback));
+
+
+RAW socket option CAN_RAW_RECV_OWN_MSGS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When the local loopback is enabled, all the sent CAN frames are
+looped back to the open CAN sockets that registered for the CAN
+frames' CAN-ID on this given interface to meet the multi user
+needs. The reception of the CAN frames on the same socket that was
+sending the CAN frame is assumed to be unwanted and therefore
+disabled by default. This default behaviour may be changed on
+demand:
+
+.. code-block:: C
+
+ int recv_own_msgs = 1; /* 0 = disabled (default), 1 = enabled */
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS,
+ &recv_own_msgs, sizeof(recv_own_msgs));
+
+Note that reception of a socket's own CAN frames are subject to the same
+filtering as other CAN frames (see :ref:`socketcan-rawfilter`).
+
+.. _socketcan-rawfd:
+
+RAW Socket Option CAN_RAW_FD_FRAMES
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+CAN FD support in CAN_RAW sockets can be enabled with a new socket option
+CAN_RAW_FD_FRAMES which is off by default. When the new socket option is
+not supported by the CAN_RAW socket (e.g. on older kernels), switching the
+CAN_RAW_FD_FRAMES option returns the error -ENOPROTOOPT.
+
+Once CAN_RAW_FD_FRAMES is enabled the application can send both CAN frames
+and CAN FD frames. OTOH the application has to handle CAN and CAN FD frames
+when reading from the socket:
+
+.. code-block:: C
+
+ CAN_RAW_FD_FRAMES enabled: CAN_MTU and CANFD_MTU are allowed
+ CAN_RAW_FD_FRAMES disabled: only CAN_MTU is allowed (default)
+
+Example:
+
+.. code-block:: C
+
+ [ remember: CANFD_MTU == sizeof(struct canfd_frame) ]
+
+ struct canfd_frame cfd;
+
+ nbytes = read(s, &cfd, CANFD_MTU);
+
+ if (nbytes == CANFD_MTU) {
+ printf("got CAN FD frame with length %d\n", cfd.len);
+ /* cfd.flags contains valid data */
+ } else if (nbytes == CAN_MTU) {
+ printf("got Classical CAN frame with length %d\n", cfd.len);
+ /* cfd.flags is undefined */
+ } else {
+ fprintf(stderr, "read: invalid CAN(FD) frame\n");
+ return 1;
+ }
+
+ /* the content can be handled independently from the received MTU size */
+
+ printf("can_id: %X data length: %d data: ", cfd.can_id, cfd.len);
+ for (i = 0; i < cfd.len; i++)
+ printf("%02X ", cfd.data[i]);
+
+When reading with size CANFD_MTU only returns CAN_MTU bytes that have
+been received from the socket a Classical CAN frame has been read into the
+provided CAN FD structure. Note that the canfd_frame.flags data field is
+not specified in the struct can_frame and therefore it is only valid in
+CANFD_MTU sized CAN FD frames.
+
+Implementation hint for new CAN applications:
+
+To build a CAN FD aware application use struct canfd_frame as basic CAN
+data structure for CAN_RAW based applications. When the application is
+executed on an older Linux kernel and switching the CAN_RAW_FD_FRAMES
+socket option returns an error: No problem. You'll get Classical CAN frames
+or CAN FD frames and can process them the same way.
+
+When sending to CAN devices make sure that the device is capable to handle
+CAN FD frames by checking if the device maximum transfer unit is CANFD_MTU.
+The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall.
+
+
+RAW socket option CAN_RAW_JOIN_FILTERS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The CAN_RAW socket can set multiple CAN identifier specific filters that
+lead to multiple filters in the af_can.c filter processing. These filters
+are indenpendent from each other which leads to logical OR'ed filters when
+applied (see :ref:`socketcan-rawfilter`).
+
+This socket option joines the given CAN filters in the way that only CAN
+frames are passed to user space that matched *all* given CAN filters. The
+semantic for the applied filters is therefore changed to a logical AND.
+
+This is useful especially when the filterset is a combination of filters
+where the CAN_INV_FILTER flag is set in order to notch single CAN IDs or
+CAN ID ranges from the incoming traffic.
+
+
+RAW Socket Returned Message Flags
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When using recvmsg() call, the msg->msg_flags may contain following flags:
+
+MSG_DONTROUTE:
+ set when the received frame was created on the local host.
+
+MSG_CONFIRM:
+ set when the frame was sent via the socket it is received on.
+ This flag can be interpreted as a 'transmission confirmation' when the
+ CAN driver supports the echo of frames on driver level, see
+ :ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`.
+ In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set.
+
+
+Broadcast Manager Protocol Sockets (SOCK_DGRAM)
+-----------------------------------------------
+
+The Broadcast Manager protocol provides a command based configuration
+interface to filter and send (e.g. cyclic) CAN messages in kernel space.
+
+Receive filters can be used to down sample frequent messages; detect events
+such as message contents changes, packet length changes, and do time-out
+monitoring of received messages.
+
+Periodic transmission tasks of CAN frames or a sequence of CAN frames can be
+created and modified at runtime; both the message content and the two
+possible transmit intervals can be altered.
+
+A BCM socket is not intended for sending individual CAN frames using the
+struct can_frame as known from the CAN_RAW socket. Instead a special BCM
+configuration message is defined. The basic BCM configuration message used
+to communicate with the broadcast manager and the available operations are
+defined in the linux/can/bcm.h include. The BCM message consists of a
+message header with a command ('opcode') followed by zero or more CAN frames.
+The broadcast manager sends responses to user space in the same form:
+
+.. code-block:: C
+
+ struct bcm_msg_head {
+ __u32 opcode; /* command */
+ __u32 flags; /* special flags */
+ __u32 count; /* run 'count' times with ival1 */
+ struct timeval ival1, ival2; /* count and subsequent interval */
+ canid_t can_id; /* unique can_id for task */
+ __u32 nframes; /* number of can_frames following */
+ struct can_frame frames[0];
+ };
+
+The aligned payload 'frames' uses the same basic CAN frame structure defined
+at the beginning of :ref:`socketcan-rawfd` and in the include/linux/can.h include. All
+messages to the broadcast manager from user space have this structure.
+
+Note a CAN_BCM socket must be connected instead of bound after socket
+creation (example without error checking):
+
+.. code-block:: C
+
+ int s;
+ struct sockaddr_can addr;
+ struct ifreq ifr;
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM);
+
+ strcpy(ifr.ifr_name, "can0");
+ ioctl(s, SIOCGIFINDEX, &ifr);
+
+ addr.can_family = AF_CAN;
+ addr.can_ifindex = ifr.ifr_ifindex;
+
+ connect(s, (struct sockaddr *)&addr, sizeof(addr));
+
+ (..)
+
+The broadcast manager socket is able to handle any number of in flight
+transmissions or receive filters concurrently. The different RX/TX jobs are
+distinguished by the unique can_id in each BCM message. However additional
+CAN_BCM sockets are recommended to communicate on multiple CAN interfaces.
+When the broadcast manager socket is bound to 'any' CAN interface (=> the
+interface index is set to zero) the configured receive filters apply to any
+CAN interface unless the sendto() syscall is used to overrule the 'any' CAN
+interface index. When using recvfrom() instead of read() to retrieve BCM
+socket messages the originating CAN interface is provided in can_ifindex.
+
+
+Broadcast Manager Operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The opcode defines the operation for the broadcast manager to carry out,
+or details the broadcast managers response to several events, including
+user requests.
+
+Transmit Operations (user space to broadcast manager):
+
+TX_SETUP:
+ Create (cyclic) transmission task.
+
+TX_DELETE:
+ Remove (cyclic) transmission task, requires only can_id.
+
+TX_READ:
+ Read properties of (cyclic) transmission task for can_id.
+
+TX_SEND:
+ Send one CAN frame.
+
+Transmit Responses (broadcast manager to user space):
+
+TX_STATUS:
+ Reply to TX_READ request (transmission task configuration).
+
+TX_EXPIRED:
+ Notification when counter finishes sending at initial interval
+ 'ival1'. Requires the TX_COUNTEVT flag to be set at TX_SETUP.
+
+Receive Operations (user space to broadcast manager):
+
+RX_SETUP:
+ Create RX content filter subscription.
+
+RX_DELETE:
+ Remove RX content filter subscription, requires only can_id.
+
+RX_READ:
+ Read properties of RX content filter subscription for can_id.
+
+Receive Responses (broadcast manager to user space):
+
+RX_STATUS:
+ Reply to RX_READ request (filter task configuration).
+
+RX_TIMEOUT:
+ Cyclic message is detected to be absent (timer ival1 expired).
+
+RX_CHANGED:
+ BCM message with updated CAN frame (detected content change).
+ Sent on first message received or on receipt of revised CAN messages.
+
+
+Broadcast Manager Message Flags
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When sending a message to the broadcast manager the 'flags' element may
+contain the following flag definitions which influence the behaviour:
+
+SETTIMER:
+ Set the values of ival1, ival2 and count
+
+STARTTIMER:
+ Start the timer with the actual values of ival1, ival2
+ and count. Starting the timer leads simultaneously to emit a CAN frame.
+
+TX_COUNTEVT:
+ Create the message TX_EXPIRED when count expires
+
+TX_ANNOUNCE:
+ A change of data by the process is emitted immediately.
+
+TX_CP_CAN_ID:
+ Copies the can_id from the message header to each
+ subsequent frame in frames. This is intended as usage simplification. For
+ TX tasks the unique can_id from the message header may differ from the
+ can_id(s) stored for transmission in the subsequent struct can_frame(s).
+
+RX_FILTER_ID:
+ Filter by can_id alone, no frames required (nframes=0).
+
+RX_CHECK_DLC:
+ A change of the DLC leads to an RX_CHANGED.
+
+RX_NO_AUTOTIMER:
+ Prevent automatically starting the timeout monitor.
+
+RX_ANNOUNCE_RESUME:
+ If passed at RX_SETUP and a receive timeout occurred, a
+ RX_CHANGED message will be generated when the (cyclic) receive restarts.
+
+TX_RESET_MULTI_IDX:
+ Reset the index for the multiple frame transmission.
+
+RX_RTR_FRAME:
+ Send reply for RTR-request (placed in op->frames[0]).
+
+CAN_FD_FRAME:
+ The CAN frames following the bcm_msg_head are struct canfd_frame's
+
+Broadcast Manager Transmission Timers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Periodic transmission configurations may use up to two interval timers.
+In this case the BCM sends a number of messages ('count') at an interval
+'ival1', then continuing to send at another given interval 'ival2'. When
+only one timer is needed 'count' is set to zero and only 'ival2' is used.
+When SET_TIMER and START_TIMER flag were set the timers are activated.
+The timer values can be altered at runtime when only SET_TIMER is set.
+
+
+Broadcast Manager message sequence transmission
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Up to 256 CAN frames can be transmitted in a sequence in the case of a cyclic
+TX task configuration. The number of CAN frames is provided in the 'nframes'
+element of the BCM message head. The defined number of CAN frames are added
+as array to the TX_SETUP BCM configuration message:
+
+.. code-block:: C
+
+ /* create a struct to set up a sequence of four CAN frames */
+ struct {
+ struct bcm_msg_head msg_head;
+ struct can_frame frame[4];
+ } mytxmsg;
+
+ (..)
+ mytxmsg.msg_head.nframes = 4;
+ (..)
+
+ write(s, &mytxmsg, sizeof(mytxmsg));
+
+With every transmission the index in the array of CAN frames is increased
+and set to zero at index overflow.
+
+
+Broadcast Manager Receive Filter Timers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The timer values ival1 or ival2 may be set to non-zero values at RX_SETUP.
+When the SET_TIMER flag is set the timers are enabled:
+
+ival1:
+ Send RX_TIMEOUT when a received message is not received again within
+ the given time. When START_TIMER is set at RX_SETUP the timeout detection
+ is activated directly - even without a former CAN frame reception.
+
+ival2:
+ Throttle the received message rate down to the value of ival2. This
+ is useful to reduce messages for the application when the signal inside the
+ CAN frame is stateless as state changes within the ival2 period may get
+ lost.
+
+Broadcast Manager Multiplex Message Receive Filter
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To filter for content changes in multiplex message sequences an array of more
+than one CAN frames can be passed in a RX_SETUP configuration message. The
+data bytes of the first CAN frame contain the mask of relevant bits that
+have to match in the subsequent CAN frames with the received CAN frame.
+If one of the subsequent CAN frames is matching the bits in that frame data
+mark the relevant content to be compared with the previous received content.
+Up to 257 CAN frames (multiplex filter bit mask CAN frame plus 256 CAN
+filters) can be added as array to the TX_SETUP BCM configuration message:
+
+.. code-block:: C
+
+ /* usually used to clear CAN frame data[] - beware of endian problems! */
+ #define U64_DATA(p) (*(unsigned long long*)(p)->data)
+
+ struct {
+ struct bcm_msg_head msg_head;
+ struct can_frame frame[5];
+ } msg;
+
+ msg.msg_head.opcode = RX_SETUP;
+ msg.msg_head.can_id = 0x42;
+ msg.msg_head.flags = 0;
+ msg.msg_head.nframes = 5;
+ U64_DATA(&msg.frame[0]) = 0xFF00000000000000ULL; /* MUX mask */
+ U64_DATA(&msg.frame[1]) = 0x01000000000000FFULL; /* data mask (MUX 0x01) */
+ U64_DATA(&msg.frame[2]) = 0x0200FFFF000000FFULL; /* data mask (MUX 0x02) */
+ U64_DATA(&msg.frame[3]) = 0x330000FFFFFF0003ULL; /* data mask (MUX 0x33) */
+ U64_DATA(&msg.frame[4]) = 0x4F07FC0FF0000000ULL; /* data mask (MUX 0x4F) */
+
+ write(s, &msg, sizeof(msg));
+
+
+Broadcast Manager CAN FD Support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The programming API of the CAN_BCM depends on struct can_frame which is
+given as array directly behind the bcm_msg_head structure. To follow this
+schema for the CAN FD frames a new flag 'CAN_FD_FRAME' in the bcm_msg_head
+flags indicates that the concatenated CAN frame structures behind the
+bcm_msg_head are defined as struct canfd_frame:
+
+.. code-block:: C
+
+ struct {
+ struct bcm_msg_head msg_head;
+ struct canfd_frame frame[5];
+ } msg;
+
+ msg.msg_head.opcode = RX_SETUP;
+ msg.msg_head.can_id = 0x42;
+ msg.msg_head.flags = CAN_FD_FRAME;
+ msg.msg_head.nframes = 5;
+ (..)
+
+When using CAN FD frames for multiplex filtering the MUX mask is still
+expected in the first 64 bit of the struct canfd_frame data section.
+
+
+Connected Transport Protocols (SOCK_SEQPACKET)
+----------------------------------------------
+
+(to be written)
+
+
+Unconnected Transport Protocols (SOCK_DGRAM)
+--------------------------------------------
+
+(to be written)
+
+
+.. _socketcan-core-module:
+
+SocketCAN Core Module
+=====================
+
+The SocketCAN core module implements the protocol family
+PF_CAN. CAN protocol modules are loaded by the core module at
+runtime. The core module provides an interface for CAN protocol
+modules to subscribe needed CAN IDs (see :ref:`socketcan-receive-lists`).
+
+
+can.ko Module Params
+--------------------
+
+- **stats_timer**:
+ To calculate the SocketCAN core statistics
+ (e.g. current/maximum frames per second) this 1 second timer is
+ invoked at can.ko module start time by default. This timer can be
+ disabled by using stattimer=0 on the module commandline.
+
+- **debug**:
+ (removed since SocketCAN SVN r546)
+
+
+procfs content
+--------------
+
+As described in :ref:`socketcan-receive-lists` the SocketCAN core uses several filter
+lists to deliver received CAN frames to CAN protocol modules. These
+receive lists, their filters and the count of filter matches can be
+checked in the appropriate receive list. All entries contain the
+device and a protocol module identifier::
+
+ foo@bar:~$ cat /proc/net/can/rcvlist_all
+
+ receive list 'rx_all':
+ (vcan3: no entry)
+ (vcan2: no entry)
+ (vcan1: no entry)
+ device can_id can_mask function userdata matches ident
+ vcan0 000 00000000 f88e6370 f6c6f400 0 raw
+ (any: no entry)
+
+In this example an application requests any CAN traffic from vcan0::
+
+ rcvlist_all - list for unfiltered entries (no filter operations)
+ rcvlist_eff - list for single extended frame (EFF) entries
+ rcvlist_err - list for error message frames masks
+ rcvlist_fil - list for mask/value filters
+ rcvlist_inv - list for mask/value filters (inverse semantic)
+ rcvlist_sff - list for single standard frame (SFF) entries
+
+Additional procfs files in /proc/net/can::
+
+ stats - SocketCAN core statistics (rx/tx frames, match ratios, ...)
+ reset_stats - manual statistic reset
+ version - prints SocketCAN core and ABI version (removed in Linux 5.10)
+
+
+Writing Own CAN Protocol Modules
+--------------------------------
+
+To implement a new protocol in the protocol family PF_CAN a new
+protocol has to be defined in include/linux/can.h .
+The prototypes and definitions to use the SocketCAN core can be
+accessed by including include/linux/can/core.h .
+In addition to functions that register the CAN protocol and the
+CAN device notifier chain there are functions to subscribe CAN
+frames received by CAN interfaces and to send CAN frames::
+
+ can_rx_register - subscribe CAN frames from a specific interface
+ can_rx_unregister - unsubscribe CAN frames from a specific interface
+ can_send - transmit a CAN frame (optional with local loopback)
+
+For details see the kerneldoc documentation in net/can/af_can.c or
+the source code of net/can/raw.c or net/can/bcm.c .
+
+
+CAN Network Drivers
+===================
+
+Writing a CAN network device driver is much easier than writing a
+CAN character device driver. Similar to other known network device
+drivers you mainly have to deal with:
+
+- TX: Put the CAN frame from the socket buffer to the CAN controller.
+- RX: Put the CAN frame from the CAN controller to the socket buffer.
+
+See e.g. at Documentation/networking/netdevices.rst . The differences
+for writing CAN network device driver are described below:
+
+
+General Settings
+----------------
+
+.. code-block:: C
+
+ dev->type = ARPHRD_CAN; /* the netdevice hardware type */
+ dev->flags = IFF_NOARP; /* CAN has no arp */
+
+ dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> Classical CAN interface */
+
+ or alternative, when the controller supports CAN with flexible data rate:
+ dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */
+
+The struct can_frame or struct canfd_frame is the payload of each socket
+buffer (skbuff) in the protocol family PF_CAN.
+
+
+.. _socketcan-local-loopback2:
+
+Local Loopback of Sent Frames
+-----------------------------
+
+As described in :ref:`socketcan-local-loopback1` the CAN network device driver should
+support a local loopback functionality similar to the local echo
+e.g. of tty devices. In this case the driver flag IFF_ECHO has to be
+set to prevent the PF_CAN core from locally echoing sent frames
+(aka loopback) as fallback solution::
+
+ dev->flags = (IFF_NOARP | IFF_ECHO);
+
+
+CAN Controller Hardware Filters
+-------------------------------
+
+To reduce the interrupt load on deep embedded systems some CAN
+controllers support the filtering of CAN IDs or ranges of CAN IDs.
+These hardware filter capabilities vary from controller to
+controller and have to be identified as not feasible in a multi-user
+networking approach. The use of the very controller specific
+hardware filters could make sense in a very dedicated use-case, as a
+filter on driver level would affect all users in the multi-user
+system. The high efficient filter sets inside the PF_CAN core allow
+to set different multiple filters for each socket separately.
+Therefore the use of hardware filters goes to the category 'handmade
+tuning on deep embedded systems'. The author is running a MPC603e
+@133MHz with four SJA1000 CAN controllers from 2002 under heavy bus
+load without any problems ...
+
+
+Switchable Termination Resistors
+--------------------------------
+
+CAN bus requires a specific impedance across the differential pair,
+typically provided by two 120Ohm resistors on the farthest nodes of
+the bus. Some CAN controllers support activating / deactivating a
+termination resistor(s) to provide the correct impedance.
+
+Query the available resistances::
+
+ $ ip -details link show can0
+ ...
+ termination 120 [ 0, 120 ]
+
+Activate the terminating resistor::
+
+ $ ip link set dev can0 type can termination 120
+
+Deactivate the terminating resistor::
+
+ $ ip link set dev can0 type can termination 0
+
+To enable termination resistor support to a can-controller, either
+implement in the controller's struct can-priv::
+
+ termination_const
+ termination_const_cnt
+ do_set_termination
+
+or add gpio control with the device tree entries from
+Documentation/devicetree/bindings/net/can/can-controller.yaml
+
+
+The Virtual CAN Driver (vcan)
+-----------------------------
+
+Similar to the network loopback devices, vcan offers a virtual local
+CAN interface. A full qualified address on CAN consists of
+
+- a unique CAN Identifier (CAN ID)
+- the CAN bus this CAN ID is transmitted on (e.g. can0)
+
+so in common use cases more than one virtual CAN interface is needed.
+
+The virtual CAN interfaces allow the transmission and reception of CAN
+frames without real CAN controller hardware. Virtual CAN network
+devices are usually named 'vcanX', like vcan0 vcan1 vcan2 ...
+When compiled as a module the virtual CAN driver module is called vcan.ko
+
+Since Linux Kernel version 2.6.24 the vcan driver supports the Kernel
+netlink interface to create vcan network devices. The creation and
+removal of vcan network devices can be managed with the ip(8) tool::
+
+ - Create a virtual CAN network interface:
+ $ ip link add type vcan
+
+ - Create a virtual CAN network interface with a specific name 'vcan42':
+ $ ip link add dev vcan42 type vcan
+
+ - Remove a (virtual CAN) network interface 'vcan42':
+ $ ip link del vcan42
+
+
+The CAN Network Device Driver Interface
+---------------------------------------
+
+The CAN network device driver interface provides a generic interface
+to setup, configure and monitor CAN network devices. The user can then
+configure the CAN device, like setting the bit-timing parameters, via
+the netlink interface using the program "ip" from the "IPROUTE2"
+utility suite. The following chapter describes briefly how to use it.
+Furthermore, the interface uses a common data structure and exports a
+set of common functions, which all real CAN network device drivers
+should use. Please have a look to the SJA1000 or MSCAN driver to
+understand how to use them. The name of the module is can-dev.ko.
+
+
+Netlink interface to set/get devices properties
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The CAN device must be configured via netlink interface. The supported
+netlink message types are defined and briefly described in
+"include/linux/can/netlink.h". CAN link support for the program "ip"
+of the IPROUTE2 utility suite is available and it can be used as shown
+below:
+
+Setting CAN device properties::
+
+ $ ip link set can0 type can help
+ Usage: ip link set DEVICE type can
+ [ bitrate BITRATE [ sample-point SAMPLE-POINT] ] |
+ [ tq TQ prop-seg PROP_SEG phase-seg1 PHASE-SEG1
+ phase-seg2 PHASE-SEG2 [ sjw SJW ] ]
+
+ [ dbitrate BITRATE [ dsample-point SAMPLE-POINT] ] |
+ [ dtq TQ dprop-seg PROP_SEG dphase-seg1 PHASE-SEG1
+ dphase-seg2 PHASE-SEG2 [ dsjw SJW ] ]
+
+ [ loopback { on | off } ]
+ [ listen-only { on | off } ]
+ [ triple-sampling { on | off } ]
+ [ one-shot { on | off } ]
+ [ berr-reporting { on | off } ]
+ [ fd { on | off } ]
+ [ fd-non-iso { on | off } ]
+ [ presume-ack { on | off } ]
+ [ cc-len8-dlc { on | off } ]
+
+ [ restart-ms TIME-MS ]
+ [ restart ]
+
+ Where: BITRATE := { 1..1000000 }
+ SAMPLE-POINT := { 0.000..0.999 }
+ TQ := { NUMBER }
+ PROP-SEG := { 1..8 }
+ PHASE-SEG1 := { 1..8 }
+ PHASE-SEG2 := { 1..8 }
+ SJW := { 1..4 }
+ RESTART-MS := { 0 | NUMBER }
+
+Display CAN device details and statistics::
+
+ $ ip -details -statistics link show can0
+ 2: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP qlen 10
+ link/can
+ can <TRIPLE-SAMPLING> state ERROR-ACTIVE restart-ms 100
+ bitrate 125000 sample_point 0.875
+ tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
+ sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
+ clock 8000000
+ re-started bus-errors arbit-lost error-warn error-pass bus-off
+ 41 17457 0 41 42 41
+ RX: bytes packets errors dropped overrun mcast
+ 140859 17608 17457 0 0 0
+ TX: bytes packets errors dropped carrier collsns
+ 861 112 0 41 0 0
+
+More info to the above output:
+
+"<TRIPLE-SAMPLING>"
+ Shows the list of selected CAN controller modes: LOOPBACK,
+ LISTEN-ONLY, or TRIPLE-SAMPLING.
+
+"state ERROR-ACTIVE"
+ The current state of the CAN controller: "ERROR-ACTIVE",
+ "ERROR-WARNING", "ERROR-PASSIVE", "BUS-OFF" or "STOPPED"
+
+"restart-ms 100"
+ Automatic restart delay time. If set to a non-zero value, a
+ restart of the CAN controller will be triggered automatically
+ in case of a bus-off condition after the specified delay time
+ in milliseconds. By default it's off.
+
+"bitrate 125000 sample-point 0.875"
+ Shows the real bit-rate in bits/sec and the sample-point in the
+ range 0.000..0.999. If the calculation of bit-timing parameters
+ is enabled in the kernel (CONFIG_CAN_CALC_BITTIMING=y), the
+ bit-timing can be defined by setting the "bitrate" argument.
+ Optionally the "sample-point" can be specified. By default it's
+ 0.000 assuming CIA-recommended sample-points.
+
+"tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1"
+ Shows the time quanta in ns, propagation segment, phase buffer
+ segment 1 and 2 and the synchronisation jump width in units of
+ tq. They allow to define the CAN bit-timing in a hardware
+ independent format as proposed by the Bosch CAN 2.0 spec (see
+ chapter 8 of http://www.semiconductors.bosch.de/pdf/can2spec.pdf).
+
+"sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 clock 8000000"
+ Shows the bit-timing constants of the CAN controller, here the
+ "sja1000". The minimum and maximum values of the time segment 1
+ and 2, the synchronisation jump width in units of tq, the
+ bitrate pre-scaler and the CAN system clock frequency in Hz.
+ These constants could be used for user-defined (non-standard)
+ bit-timing calculation algorithms in user-space.
+
+"re-started bus-errors arbit-lost error-warn error-pass bus-off"
+ Shows the number of restarts, bus and arbitration lost errors,
+ and the state changes to the error-warning, error-passive and
+ bus-off state. RX overrun errors are listed in the "overrun"
+ field of the standard network statistics.
+
+Setting the CAN Bit-Timing
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The CAN bit-timing parameters can always be defined in a hardware
+independent format as proposed in the Bosch CAN 2.0 specification
+specifying the arguments "tq", "prop_seg", "phase_seg1", "phase_seg2"
+and "sjw"::
+
+ $ ip link set canX type can tq 125 prop-seg 6 \
+ phase-seg1 7 phase-seg2 2 sjw 1
+
+If the kernel option CONFIG_CAN_CALC_BITTIMING is enabled, CIA
+recommended CAN bit-timing parameters will be calculated if the bit-
+rate is specified with the argument "bitrate"::
+
+ $ ip link set canX type can bitrate 125000
+
+Note that this works fine for the most common CAN controllers with
+standard bit-rates but may *fail* for exotic bit-rates or CAN system
+clock frequencies. Disabling CONFIG_CAN_CALC_BITTIMING saves some
+space and allows user-space tools to solely determine and set the
+bit-timing parameters. The CAN controller specific bit-timing
+constants can be used for that purpose. They are listed by the
+following command::
+
+ $ ip -details link show can0
+ ...
+ sja1000: clock 8000000 tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
+
+
+Starting and Stopping the CAN Network Device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A CAN network device is started or stopped as usual with the command
+"ifconfig canX up/down" or "ip link set canX up/down". Be aware that
+you *must* define proper bit-timing parameters for real CAN devices
+before you can start it to avoid error-prone default settings::
+
+ $ ip link set canX up type can bitrate 125000
+
+A device may enter the "bus-off" state if too many errors occurred on
+the CAN bus. Then no more messages are received or sent. An automatic
+bus-off recovery can be enabled by setting the "restart-ms" to a
+non-zero value, e.g.::
+
+ $ ip link set canX type can restart-ms 100
+
+Alternatively, the application may realize the "bus-off" condition
+by monitoring CAN error message frames and do a restart when
+appropriate with the command::
+
+ $ ip link set canX type can restart
+
+Note that a restart will also create a CAN error message frame (see
+also :ref:`socketcan-network-problem-notifications`).
+
+
+.. _socketcan-can-fd-driver:
+
+CAN FD (Flexible Data Rate) Driver Support
+------------------------------------------
+
+CAN FD capable CAN controllers support two different bitrates for the
+arbitration phase and the payload phase of the CAN FD frame. Therefore a
+second bit timing has to be specified in order to enable the CAN FD bitrate.
+
+Additionally CAN FD capable CAN controllers support up to 64 bytes of
+payload. The representation of this length in can_frame.len and
+canfd_frame.len for userspace applications and inside the Linux network
+layer is a plain value from 0 .. 64 instead of the CAN 'data length code'.
+The data length code was a 1:1 mapping to the payload length in the Classical
+CAN frames anyway. The payload length to the bus-relevant DLC mapping is
+only performed inside the CAN drivers, preferably with the helper
+functions can_fd_dlc2len() and can_fd_len2dlc().
+
+The CAN netdevice driver capabilities can be distinguished by the network
+devices maximum transfer unit (MTU)::
+
+ MTU = 16 (CAN_MTU) => sizeof(struct can_frame) => Classical CAN device
+ MTU = 72 (CANFD_MTU) => sizeof(struct canfd_frame) => CAN FD capable device
+
+The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall.
+N.B. CAN FD capable devices can also handle and send Classical CAN frames.
+
+When configuring CAN FD capable CAN controllers an additional 'data' bitrate
+has to be set. This bitrate for the data phase of the CAN FD frame has to be
+at least the bitrate which was configured for the arbitration phase. This
+second bitrate is specified analogue to the first bitrate but the bitrate
+setting keywords for the 'data' bitrate start with 'd' e.g. dbitrate,
+dsample-point, dsjw or dtq and similar settings. When a data bitrate is set
+within the configuration process the controller option "fd on" can be
+specified to enable the CAN FD mode in the CAN controller. This controller
+option also switches the device MTU to 72 (CANFD_MTU).
+
+The first CAN FD specification presented as whitepaper at the International
+CAN Conference 2012 needed to be improved for data integrity reasons.
+Therefore two CAN FD implementations have to be distinguished today:
+
+- ISO compliant: The ISO 11898-1:2015 CAN FD implementation (default)
+- non-ISO compliant: The CAN FD implementation following the 2012 whitepaper
+
+Finally there are three types of CAN FD controllers:
+
+1. ISO compliant (fixed)
+2. non-ISO compliant (fixed, like the M_CAN IP core v3.0.1 in m_can.c)
+3. ISO/non-ISO CAN FD controllers (switchable, like the PEAK PCAN-USB FD)
+
+The current ISO/non-ISO mode is announced by the CAN controller driver via
+netlink and displayed by the 'ip' tool (controller option FD-NON-ISO).
+The ISO/non-ISO-mode can be altered by setting 'fd-non-iso {on|off}' for
+switchable CAN FD controllers only.
+
+Example configuring 500 kbit/s arbitration bitrate and 4 Mbit/s data bitrate::
+
+ $ ip link set can0 up type can bitrate 500000 sample-point 0.75 \
+ dbitrate 4000000 dsample-point 0.8 fd on
+ $ ip -details link show can0
+ 5: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN \
+ mode DEFAULT group default qlen 10
+ link/can promiscuity 0
+ can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
+ bitrate 500000 sample-point 0.750
+ tq 50 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
+ pcan_usb_pro_fd: tseg1 1..64 tseg2 1..16 sjw 1..16 brp 1..1024 \
+ brp-inc 1
+ dbitrate 4000000 dsample-point 0.800
+ dtq 12 dprop-seg 7 dphase-seg1 8 dphase-seg2 4 dsjw 1
+ pcan_usb_pro_fd: dtseg1 1..16 dtseg2 1..8 dsjw 1..4 dbrp 1..1024 \
+ dbrp-inc 1
+ clock 80000000
+
+Example when 'fd-non-iso on' is added on this switchable CAN FD adapter::
+
+ can <FD,FD-NON-ISO> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
+
+
+Supported CAN Hardware
+----------------------
+
+Please check the "Kconfig" file in "drivers/net/can" to get an actual
+list of the support CAN hardware. On the SocketCAN project website
+(see :ref:`socketcan-resources`) there might be further drivers available, also for
+older kernel versions.
+
+
+.. _socketcan-resources:
+
+SocketCAN Resources
+===================
+
+The Linux CAN / SocketCAN project resources (project site / mailing list)
+are referenced in the MAINTAINERS file in the Linux source tree.
+Search for CAN NETWORK [LAYERS|DRIVERS].
+
+Credits
+=======
+
+- Oliver Hartkopp (PF_CAN core, filters, drivers, bcm, SJA1000 driver)
+- Urs Thuermann (PF_CAN core, kernel integration, socket interfaces, raw, vcan)
+- Jan Kizka (RT-SocketCAN core, Socket-API reconciliation)
+- Wolfgang Grandegger (RT-SocketCAN core & drivers, Raw Socket-API reviews, CAN device driver interface, MSCAN driver)
+- Robert Schwebel (design reviews, PTXdist integration)
+- Marc Kleine-Budde (design reviews, Kernel 2.6 cleanups, drivers)
+- Benedikt Spranger (reviews)
+- Thomas Gleixner (LKML reviews, coding style, posting hints)
+- Andrey Volkov (kernel subtree structure, ioctls, MSCAN driver)
+- Matthias Brukner (first SJA1000 CAN netdevice implementation Q2/2003)
+- Klaus Hitschler (PEAK driver integration)
+- Uwe Koppe (CAN netdevices with PF_PACKET approach)
+- Michael Schulze (driver layer loopback requirement, RT CAN drivers review)
+- Pavel Pisa (Bit-timing calculation)
+- Sascha Hauer (SJA1000 platform driver)
+- Sebastian Haas (SJA1000 EMS PCI driver)
+- Markus Plessing (SJA1000 EMS PCI driver)
+- Per Dalen (SJA1000 Kvaser PCI driver)
+- Sam Ravnborg (reviews, coding style, kbuild help)
diff --git a/Documentation/networking/can_ucan_protocol.rst b/Documentation/networking/can_ucan_protocol.rst
new file mode 100644
index 0000000000..935d872ae8
--- /dev/null
+++ b/Documentation/networking/can_ucan_protocol.rst
@@ -0,0 +1,332 @@
+=================
+The UCAN Protocol
+=================
+
+UCAN is the protocol used by the microcontroller-based USB-CAN
+adapter that is integrated on System-on-Modules from Theobroma Systems
+and that is also available as a standalone USB stick.
+
+The UCAN protocol has been designed to be hardware-independent.
+It is modeled closely after how Linux represents CAN devices
+internally. All multi-byte integers are encoded as Little Endian.
+
+All structures mentioned in this document are defined in
+``drivers/net/can/usb/ucan.c``.
+
+USB Endpoints
+=============
+
+UCAN devices use three USB endpoints:
+
+CONTROL endpoint
+ The driver sends device management commands on this endpoint
+
+IN endpoint
+ The device sends CAN data frames and CAN error frames
+
+OUT endpoint
+ The driver sends CAN data frames on the out endpoint
+
+
+CONTROL Messages
+================
+
+UCAN devices are configured using vendor requests on the control pipe.
+
+To support multiple CAN interfaces in a single USB device all
+configuration commands target the corresponding interface in the USB
+descriptor.
+
+The driver uses ``ucan_ctrl_command_in/out`` and
+``ucan_device_request_in`` to deliver commands to the device.
+
+Setup Packet
+------------
+
+================= =====================================================
+``bmRequestType`` Direction | Vendor | (Interface or Device)
+``bRequest`` Command Number
+``wValue`` Subcommand Number (16 Bit) or 0 if not used
+``wIndex`` USB Interface Index (0 for device commands)
+``wLength`` * Host to Device - Number of bytes to transmit
+ * Device to Host - Maximum Number of bytes to
+ receive. If the device send less. Common ZLP
+ semantics are used.
+================= =====================================================
+
+Error Handling
+--------------
+
+The device indicates failed control commands by stalling the
+pipe.
+
+Device Commands
+---------------
+
+UCAN_DEVICE_GET_FW_STRING
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+*Dev2Host; optional*
+
+Request the device firmware string.
+
+
+Interface Commands
+------------------
+
+UCAN_COMMAND_START
+~~~~~~~~~~~~~~~~~~
+
+*Host2Dev; mandatory*
+
+Bring the CAN interface up.
+
+Payload Format
+ ``ucan_ctl_payload_t.cmd_start``
+
+==== ============================
+mode or mask of ``UCAN_MODE_*``
+==== ============================
+
+UCAN_COMMAND_STOP
+~~~~~~~~~~~~~~~~~~
+
+*Host2Dev; mandatory*
+
+Stop the CAN interface
+
+Payload Format
+ *empty*
+
+UCAN_COMMAND_RESET
+~~~~~~~~~~~~~~~~~~
+
+*Host2Dev; mandatory*
+
+Reset the CAN controller (including error counters)
+
+Payload Format
+ *empty*
+
+UCAN_COMMAND_GET
+~~~~~~~~~~~~~~~~
+
+*Host2Dev; mandatory*
+
+Get Information from the Device
+
+Subcommands
+^^^^^^^^^^^
+
+UCAN_COMMAND_GET_INFO
+ Request the device information structure ``ucan_ctl_payload_t.device_info``.
+
+ See the ``device_info`` field for details, and
+ ``uapi/linux/can/netlink.h`` for an explanation of the
+ ``can_bittiming fields``.
+
+ Payload Format
+ ``ucan_ctl_payload_t.device_info``
+
+UCAN_COMMAND_GET_PROTOCOL_VERSION
+
+ Request the device protocol version
+ ``ucan_ctl_payload_t.protocol_version``. The current protocol version is 3.
+
+ Payload Format
+ ``ucan_ctl_payload_t.protocol_version``
+
+.. note:: Devices that do not implement this command use the old
+ protocol version 1
+
+UCAN_COMMAND_SET_BITTIMING
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+*Host2Dev; mandatory*
+
+Setup bittiming by sending the structure
+``ucan_ctl_payload_t.cmd_set_bittiming`` (see ``struct bittiming`` for
+details)
+
+Payload Format
+ ``ucan_ctl_payload_t.cmd_set_bittiming``.
+
+UCAN_SLEEP/WAKE
+~~~~~~~~~~~~~~~
+
+*Host2Dev; optional*
+
+Configure sleep and wake modes. Not yet supported by the driver.
+
+UCAN_FILTER
+~~~~~~~~~~~
+
+*Host2Dev; optional*
+
+Setup hardware CAN filters. Not yet supported by the driver.
+
+Allowed interface commands
+--------------------------
+
+================== =================== ==================
+Legal Device State Command New Device State
+================== =================== ==================
+stopped SET_BITTIMING stopped
+stopped START started
+started STOP or RESET stopped
+stopped STOP or RESET stopped
+started RESTART started
+any GET *no change*
+================== =================== ==================
+
+IN Message Format
+=================
+
+A data packet on the USB IN endpoint contains one or more
+``ucan_message_in`` values. If multiple messages are batched in a USB
+data packet, the ``len`` field can be used to jump to the next
+``ucan_message_in`` value (take care to sanity-check the ``len`` value
+against the actual data size).
+
+.. _can_ucan_in_message_len:
+
+``len`` field
+-------------
+
+Each ``ucan_message_in`` must be aligned to a 4-byte boundary (relative
+to the start of the start of the data buffer). That means that there
+may be padding bytes between multiple ``ucan_message_in`` values:
+
+.. code::
+
+ +----------------------------+ < 0
+ | |
+ | struct ucan_message_in |
+ | |
+ +----------------------------+ < len
+ [padding]
+ +----------------------------+ < round_up(len, 4)
+ | |
+ | struct ucan_message_in |
+ | |
+ +----------------------------+
+ [...]
+
+``type`` field
+--------------
+
+The ``type`` field specifies the type of the message.
+
+UCAN_IN_RX
+~~~~~~~~~~
+
+``subtype``
+ zero
+
+Data received from the CAN bus (ID + payload).
+
+UCAN_IN_TX_COMPLETE
+~~~~~~~~~~~~~~~~~~~
+
+``subtype``
+ zero
+
+The CAN device has sent a message to the CAN bus. It answers with a
+list of tuples <echo-ids, flags>.
+
+The echo-id identifies the frame from (echos the id from a previous
+UCAN_OUT_TX message). The flag indicates the result of the
+transmission. Whereas a set Bit 0 indicates success. All other bits
+are reserved and set to zero.
+
+Flow Control
+------------
+
+When receiving CAN messages there is no flow control on the USB
+buffer. The driver has to handle inbound message quickly enough to
+avoid drops. I case the device buffer overflow the condition is
+reported by sending corresponding error frames (see
+:ref:`can_ucan_error_handling`)
+
+
+OUT Message Format
+==================
+
+A data packet on the USB OUT endpoint contains one or more ``struct
+ucan_message_out`` values. If multiple messages are batched into one
+data packet, the device uses the ``len`` field to jump to the next
+ucan_message_out value. Each ucan_message_out must be aligned to 4
+bytes (relative to the start of the data buffer). The mechanism is
+same as described in :ref:`can_ucan_in_message_len`.
+
+.. code::
+
+ +----------------------------+ < 0
+ | |
+ | struct ucan_message_out |
+ | |
+ +----------------------------+ < len
+ [padding]
+ +----------------------------+ < round_up(len, 4)
+ | |
+ | struct ucan_message_out |
+ | |
+ +----------------------------+
+ [...]
+
+``type`` field
+--------------
+
+In protocol version 3 only ``UCAN_OUT_TX`` is defined, others are used
+only by legacy devices (protocol version 1).
+
+UCAN_OUT_TX
+~~~~~~~~~~~
+``subtype``
+ echo id to be replied within a CAN_IN_TX_COMPLETE message
+
+Transmit a CAN frame. (parameters: ``id``, ``data``)
+
+Flow Control
+------------
+
+When the device outbound buffers are full it starts sending *NAKs* on
+the *OUT* pipe until more buffers are available. The driver stops the
+queue when a certain threshold of out packets are incomplete.
+
+.. _can_ucan_error_handling:
+
+CAN Error Handling
+==================
+
+If error reporting is turned on the device encodes errors into CAN
+error frames (see ``uapi/linux/can/error.h``) and sends it using the
+IN endpoint. The driver updates its error statistics and forwards
+it.
+
+Although UCAN devices can suppress error frames completely, in Linux
+the driver is always interested. Hence, the device is always started with
+the ``UCAN_MODE_BERR_REPORT`` set. Filtering those messages for the
+user space is done by the driver.
+
+Bus OFF
+-------
+
+- The device does not recover from bus of automatically.
+- Bus OFF is indicated by an error frame (see ``uapi/linux/can/error.h``)
+- Bus OFF recovery is started by ``UCAN_COMMAND_RESTART``
+- Once Bus OFF recover is completed the device sends an error frame
+ indicating that it is on ERROR-ACTIVE state.
+- During Bus OFF no frames are sent by the device.
+- During Bus OFF transmission requests from the host are completed
+ immediately with the success bit left unset.
+
+Example Conversation
+====================
+
+#) Device is connected to USB
+#) Host sends command ``UCAN_COMMAND_RESET``, subcmd 0
+#) Host sends command ``UCAN_COMMAND_GET``, subcmd ``UCAN_COMMAND_GET_INFO``
+#) Device sends ``UCAN_IN_DEVICE_INFO``
+#) Host sends command ``UCAN_OUT_SET_BITTIMING``
+#) Host sends command ``UCAN_COMMAND_START``, subcmd 0, mode ``UCAN_MODE_BERR_REPORT``
diff --git a/Documentation/networking/cdc_mbim.rst b/Documentation/networking/cdc_mbim.rst
new file mode 100644
index 0000000000..37f968acc4
--- /dev/null
+++ b/Documentation/networking/cdc_mbim.rst
@@ -0,0 +1,355 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+cdc_mbim - Driver for CDC MBIM Mobile Broadband modems
+======================================================
+
+The cdc_mbim driver supports USB devices conforming to the "Universal
+Serial Bus Communications Class Subclass Specification for Mobile
+Broadband Interface Model" [1], which is a further development of
+"Universal Serial Bus Communications Class Subclass Specifications for
+Network Control Model Devices" [2] optimized for Mobile Broadband
+devices, aka "3G/LTE modems".
+
+
+Command Line Parameters
+=======================
+
+The cdc_mbim driver has no parameters of its own. But the probing
+behaviour for NCM 1.0 backwards compatible MBIM functions (an
+"NCM/MBIM function" as defined in section 3.2 of [1]) is affected
+by a cdc_ncm driver parameter:
+
+prefer_mbim
+-----------
+:Type: Boolean
+:Valid Range: N/Y (0-1)
+:Default Value: Y (MBIM is preferred)
+
+This parameter sets the system policy for NCM/MBIM functions. Such
+functions will be handled by either the cdc_ncm driver or the cdc_mbim
+driver depending on the prefer_mbim setting. Setting prefer_mbim=N
+makes the cdc_mbim driver ignore these functions and lets the cdc_ncm
+driver handle them instead.
+
+The parameter is writable, and can be changed at any time. A manual
+unbind/bind is required to make the change effective for NCM/MBIM
+functions bound to the "wrong" driver
+
+
+Basic usage
+===========
+
+MBIM functions are inactive when unmanaged. The cdc_mbim driver only
+provides a userspace interface to the MBIM control channel, and will
+not participate in the management of the function. This implies that a
+userspace MBIM management application always is required to enable a
+MBIM function.
+
+Such userspace applications includes, but are not limited to:
+
+ - mbimcli (included with the libmbim [3] library), and
+ - ModemManager [4]
+
+Establishing a MBIM IP session reequires at least these actions by the
+management application:
+
+ - open the control channel
+ - configure network connection settings
+ - connect to network
+ - configure IP interface
+
+Management application development
+----------------------------------
+The driver <-> userspace interfaces are described below. The MBIM
+control channel protocol is described in [1].
+
+
+MBIM control channel userspace ABI
+==================================
+
+/dev/cdc-wdmX character device
+------------------------------
+The driver creates a two-way pipe to the MBIM function control channel
+using the cdc-wdm driver as a subdriver. The userspace end of the
+control channel pipe is a /dev/cdc-wdmX character device.
+
+The cdc_mbim driver does not process or police messages on the control
+channel. The channel is fully delegated to the userspace management
+application. It is therefore up to this application to ensure that it
+complies with all the control channel requirements in [1].
+
+The cdc-wdmX device is created as a child of the MBIM control
+interface USB device. The character device associated with a specific
+MBIM function can be looked up using sysfs. For example::
+
+ bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc
+ cdc-wdm0
+
+ bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev
+ 180:0
+
+
+USB configuration descriptors
+-----------------------------
+The wMaxControlMessage field of the CDC MBIM functional descriptor
+limits the maximum control message size. The management application is
+responsible for negotiating a control message size complying with the
+requirements in section 9.3.1 of [1], taking this descriptor field
+into consideration.
+
+The userspace application can access the CDC MBIM functional
+descriptor of a MBIM function using either of the two USB
+configuration descriptor kernel interfaces described in [6] or [7].
+
+See also the ioctl documentation below.
+
+
+Fragmentation
+-------------
+The userspace application is responsible for all control message
+fragmentation and defragmentaion, as described in section 9.5 of [1].
+
+
+/dev/cdc-wdmX write()
+---------------------
+The MBIM control messages from the management application *must not*
+exceed the negotiated control message size.
+
+
+/dev/cdc-wdmX read()
+--------------------
+The management application *must* accept control messages of up the
+negotiated control message size.
+
+
+/dev/cdc-wdmX ioctl()
+---------------------
+IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size
+This ioctl returns the wMaxControlMessage field of the CDC MBIM
+functional descriptor for MBIM devices. This is intended as a
+convenience, eliminating the need to parse the USB descriptors from
+userspace.
+
+::
+
+ #include <stdio.h>
+ #include <fcntl.h>
+ #include <sys/ioctl.h>
+ #include <linux/types.h>
+ #include <linux/usb/cdc-wdm.h>
+ int main()
+ {
+ __u16 max;
+ int fd = open("/dev/cdc-wdm0", O_RDWR);
+ if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max))
+ printf("wMaxControlMessage is %d\n", max);
+ }
+
+
+Custom device services
+----------------------
+The MBIM specification allows vendors to freely define additional
+services. This is fully supported by the cdc_mbim driver.
+
+Support for new MBIM services, including vendor specified services, is
+implemented entirely in userspace, like the rest of the MBIM control
+protocol
+
+New services should be registered in the MBIM Registry [5].
+
+
+
+MBIM data channel userspace ABI
+===============================
+
+wwanY network device
+--------------------
+The cdc_mbim driver represents the MBIM data channel as a single
+network device of the "wwan" type. This network device is initially
+mapped to MBIM IP session 0.
+
+
+Multiplexed IP sessions (IPS)
+-----------------------------
+MBIM allows multiplexing up to 256 IP sessions over a single USB data
+channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN
+subdevices of the master wwanY device, mapping MBIM IP session Z to
+VLAN ID Z for all values of Z greater than 0.
+
+The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure
+described in section 10.5.1 of [1].
+
+The userspace management application is responsible for adding new
+VLAN links prior to establishing MBIM IP sessions where the SessionId
+is greater than 0. These links can be added by using the normal VLAN
+kernel interfaces, either ioctl or netlink.
+
+For example, adding a link for a MBIM IP session with SessionId 3::
+
+ ip link add link wwan0 name wwan0.3 type vlan id 3
+
+The driver will automatically map the "wwan0.3" network device to MBIM
+IP session 3.
+
+
+Device Service Streams (DSS)
+----------------------------
+MBIM also allows up to 256 non-IP data streams to be multiplexed over
+the same shared USB data channel. The cdc_mbim driver models these
+sessions as another set of 802.1q VLAN subdevices of the master wwanY
+device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values
+of A.
+
+The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO
+structure described in section 10.5.29 of [1].
+
+The DSS VLAN subdevices are used as a practical interface between the
+shared MBIM data channel and a MBIM DSS aware userspace application.
+It is not intended to be presented as-is to an end user. The
+assumption is that a userspace application initiating a DSS session
+also takes care of the necessary framing of the DSS data, presenting
+the stream to the end user in an appropriate way for the stream type.
+
+The network device ABI requires a dummy ethernet header for every DSS
+data frame being transported. The contents of this header is
+arbitrary, with the following exceptions:
+
+ - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped
+ - RX frames will have the protocol field set to ETH_P_802_3 (but will
+ not be properly formatted 802.3 frames)
+ - RX frames will have the destination address set to the hardware
+ address of the master device
+
+The DSS supporting userspace management application is responsible for
+adding the dummy ethernet header on TX and stripping it on RX.
+
+This is a simple example using tools commonly available, exporting
+DssSessionId 5 as a pty character device pointed to by a /dev/nmea
+symlink::
+
+ ip link add link wwan0 name wwan0.dss5 type vlan id 261
+ ip link set dev wwan0.dss5 up
+ socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea
+
+This is only an example, most suitable for testing out a DSS
+service. Userspace applications supporting specific MBIM DSS services
+are expected to use the tools and programming interfaces required by
+that service.
+
+Note that adding VLAN links for DSS sessions is entirely optional. A
+management application may instead choose to bind a packet socket
+directly to the master network device, using the received VLAN tags to
+map frames to the correct DSS session and adding 18 byte VLAN ethernet
+headers with the appropriate tag on TX. In this case using a socket
+filter is recommended, matching only the DSS VLAN subset. This avoid
+unnecessary copying of unrelated IP session data to userspace. For
+example::
+
+ static struct sock_filter dssfilter[] = {
+ /* use special negative offsets to get VLAN tag */
+ BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */
+
+ /* verify DSS VLAN range */
+ BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG),
+ BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */
+ BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */
+
+ /* verify ethertype */
+ BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1),
+
+ BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */
+ BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */
+ };
+
+
+
+Tagged IP session 0 VLAN
+------------------------
+As described above, MBIM IP session 0 is treated as special by the
+driver. It is initially mapped to untagged frames on the wwanY
+network device.
+
+This mapping implies a few restrictions on multiplexed IPS and DSS
+sessions, which may not always be practical:
+
+ - no IPS or DSS session can use a frame size greater than the MTU on
+ IP session 0
+ - no IPS or DSS session can be in the up state unless the network
+ device representing IP session 0 also is up
+
+These problems can be avoided by optionally making the driver map IP
+session 0 to a VLAN subdevice, similar to all other IP sessions. This
+behaviour is triggered by adding a VLAN link for the magic VLAN ID
+4094. The driver will then immediately start mapping MBIM IP session
+0 to this VLAN, and will drop untagged frames on the master wwanY
+device.
+
+Tip: It might be less confusing to the end user to name this VLAN
+subdevice after the MBIM SessionID instead of the VLAN ID. For
+example::
+
+ ip link add link wwan0 name wwan0.0 type vlan id 4094
+
+
+VLAN mapping
+------------
+
+Summarizing the cdc_mbim driver mapping described above, we have this
+relationship between VLAN tags on the wwanY network device and MBIM
+sessions on the shared USB data channel::
+
+ VLAN ID MBIM type MBIM SessionID Notes
+ ---------------------------------------------------------
+ untagged IPS 0 a)
+ 1 - 255 IPS 1 - 255 <VLANID>
+ 256 - 511 DSS 0 - 255 <VLANID - 256>
+ 512 - 4093 b)
+ 4094 IPS 0 c)
+
+ a) if no VLAN ID 4094 link exists, else dropped
+ b) unsupported VLAN range, unconditionally dropped
+ c) if a VLAN ID 4094 link exists, else dropped
+
+
+
+
+References
+==========
+
+ 1) USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specification for Mobile Broadband
+ Interface Model", Revision 1.0 (Errata 1), May 1, 2013
+
+ - http://www.usb.org/developers/docs/devclass_docs/
+
+ 2) USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specifications for Network Control
+ Model Devices", Revision 1.0 (Errata 1), November 24, 2010
+
+ - http://www.usb.org/developers/docs/devclass_docs/
+
+ 3) libmbim - "a glib-based library for talking to WWAN modems and
+ devices which speak the Mobile Interface Broadband Model (MBIM)
+ protocol"
+
+ - http://www.freedesktop.org/wiki/Software/libmbim/
+
+ 4) ModemManager - "a DBus-activated daemon which controls mobile
+ broadband (2G/3G/4G) devices and connections"
+
+ - http://www.freedesktop.org/wiki/Software/ModemManager/
+
+ 5) "MBIM (Mobile Broadband Interface Model) Registry"
+
+ - http://compliance.usb.org/mbim/
+
+ 6) "/sys/kernel/debug/usb/devices output format"
+
+ - Documentation/driver-api/usb/usb.rst
+
+ 7) "/sys/bus/usb/devices/.../descriptors"
+
+ - Documentation/ABI/stable/sysfs-bus-usb
diff --git a/Documentation/networking/checksum-offloads.rst b/Documentation/networking/checksum-offloads.rst
new file mode 100644
index 0000000000..69b23cf687
--- /dev/null
+++ b/Documentation/networking/checksum-offloads.rst
@@ -0,0 +1,143 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Checksum Offloads
+=================
+
+
+Introduction
+============
+
+This document describes a set of techniques in the Linux networking stack to
+take advantage of checksum offload capabilities of various NICs.
+
+The following technologies are described:
+
+* TX Checksum Offload
+* LCO: Local Checksum Offload
+* RCO: Remote Checksum Offload
+
+Things that should be documented here but aren't yet:
+
+* RX Checksum Offload
+* CHECKSUM_UNNECESSARY conversion
+
+
+TX Checksum Offload
+===================
+
+The interface for offloading a transmit checksum to a device is explained in
+detail in comments near the top of include/linux/skbuff.h.
+
+In brief, it allows to request the device fill in a single ones-complement
+checksum defined by the sk_buff fields skb->csum_start and skb->csum_offset.
+The device should compute the 16-bit ones-complement checksum (i.e. the
+'IP-style' checksum) from csum_start to the end of the packet, and fill in the
+result at (csum_start + csum_offset).
+
+Because csum_offset cannot be negative, this ensures that the previous value of
+the checksum field is included in the checksum computation, thus it can be used
+to supply any needed corrections to the checksum (such as the sum of the
+pseudo-header for UDP or TCP).
+
+This interface only allows a single checksum to be offloaded. Where
+encapsulation is used, the packet may have multiple checksum fields in
+different header layers, and the rest will have to be handled by another
+mechanism such as LCO or RCO.
+
+CRC32c can also be offloaded using this interface, by means of filling
+skb->csum_start and skb->csum_offset as described above, and setting
+skb->csum_not_inet: see skbuff.h comment (section 'D') for more details.
+
+No offloading of the IP header checksum is performed; it is always done in
+software. This is OK because when we build the IP header, we obviously have it
+in cache, so summing it isn't expensive. It's also rather short.
+
+The requirements for GSO are more complicated, because when segmenting an
+encapsulated packet both the inner and outer checksums may need to be edited or
+recomputed for each resulting segment. See the skbuff.h comment (section 'E')
+for more details.
+
+A driver declares its offload capabilities in netdev->hw_features; see
+Documentation/networking/netdev-features.rst for more. Note that a device
+which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and
+csum_offset given in the SKB; if it tries to deduce these itself in hardware
+(as some NICs do) the driver should check that the values in the SKB match
+those which the hardware will deduce, and if not, fall back to checksumming in
+software instead (with skb_csum_hwoffload_help() or one of the
+skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in
+include/linux/skbuff.h).
+
+The stack should, for the most part, assume that checksum offload is supported
+by the underlying device. The only place that should check is
+validate_xmit_skb(), and the functions it calls directly or indirectly. That
+function compares the offload features requested by the SKB (which may include
+other offloads besides TX Checksum Offload) and, if they are not supported or
+enabled on the device (determined by netdev->features), performs the
+corresponding offload in software. In the case of TX Checksum Offload, that
+means calling skb_csum_hwoffload_help(skb, features).
+
+
+LCO: Local Checksum Offload
+===========================
+
+LCO is a technique for efficiently computing the outer checksum of an
+encapsulated datagram when the inner checksum is due to be offloaded.
+
+The ones-complement sum of a correctly checksummed TCP or UDP packet is equal
+to the complement of the sum of the pseudo header, because everything else gets
+'cancelled out' by the checksum field. This is because the sum was
+complemented before being written to the checksum field.
+
+More generally, this holds in any case where the 'IP-style' ones complement
+checksum is used, and thus any checksum that TX Checksum Offload supports.
+
+That is, if we have set up TX Checksum Offload with a start/offset pair, we
+know that after the device has filled in that checksum, the ones complement sum
+from csum_start to the end of the packet will be equal to the complement of
+whatever value we put in the checksum field beforehand. This allows us to
+compute the outer checksum without looking at the payload: we simply stop
+summing when we get to csum_start, then add the complement of the 16-bit word
+at (csum_start + csum_offset).
+
+Then, when the true inner checksum is filled in (either by hardware or by
+skb_checksum_help()), the outer checksum will become correct by virtue of the
+arithmetic.
+
+LCO is performed by the stack when constructing an outer UDP header for an
+encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for the
+IPv6 equivalents, in udp6_set_csum().
+
+It is also performed when constructing an IPv4 GRE header, in
+net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when
+constructing an IPv6 GRE header; the GRE checksum is computed over the whole
+packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be possible to use
+LCO here as IPv6 GRE still uses an IP-style checksum.
+
+All of the LCO implementations use a helper function lco_csum(), in
+include/linux/skbuff.h.
+
+LCO can safely be used for nested encapsulations; in this case, the outer
+encapsulation layer will sum over both its own header and the 'middle' header.
+This does mean that the 'middle' header will get summed multiple times, but
+there doesn't seem to be a way to avoid that without incurring bigger costs
+(e.g. in SKB bloat).
+
+
+RCO: Remote Checksum Offload
+============================
+
+RCO is a technique for eliding the inner checksum of an encapsulated datagram,
+allowing the outer checksum to be offloaded. It does, however, involve a
+change to the encapsulation protocols, which the receiver must also support.
+For this reason, it is disabled by default.
+
+RCO is detailed in the following Internet-Drafts:
+
+* https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00
+* https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
+
+In Linux, RCO is implemented individually in each encapsulation protocol, and
+most tunnel types have flags controlling its use. For instance, VXLAN has the
+flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that RCO should be
+used when transmitting to a given remote destination.
diff --git a/Documentation/networking/dccp.rst b/Documentation/networking/dccp.rst
new file mode 100644
index 0000000000..91e5c33ba3
--- /dev/null
+++ b/Documentation/networking/dccp.rst
@@ -0,0 +1,219 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+DCCP protocol
+=============
+
+
+.. Contents
+ - Introduction
+ - Missing features
+ - Socket options
+ - Sysctl variables
+ - IOCTLs
+ - Other tunables
+ - Notes
+
+
+Introduction
+============
+Datagram Congestion Control Protocol (DCCP) is an unreliable, connection
+oriented protocol designed to solve issues present in UDP and TCP, particularly
+for real-time and multimedia (streaming) traffic.
+It divides into a base protocol (RFC 4340) and pluggable congestion control
+modules called CCIDs. Like pluggable TCP congestion control, at least one CCID
+needs to be enabled in order for the protocol to function properly. In the Linux
+implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as
+the TCP-friendly CCID3 (RFC 4342), are optional.
+For a brief introduction to CCIDs and suggestions for choosing a CCID to match
+given applications, see section 10 of RFC 4340.
+
+It has a base protocol and pluggable congestion control IDs (CCIDs).
+
+DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol
+is at http://www.ietf.org/html.charters/dccp-charter.html
+
+
+Missing features
+================
+The Linux DCCP implementation does not currently support all the features that are
+specified in RFCs 4340...42.
+
+The known bugs are at:
+
+ http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP
+
+For more up-to-date versions of the DCCP implementation, please consider using
+the experimental DCCP test tree; instructions for checking this out are on:
+http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree
+
+
+Socket options
+==============
+DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes
+a policy ID as argument and can only be set before the connection (i.e. changes
+during an established connection are not supported). Currently, two policies are
+defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special,
+and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an
+u32 priority value as ancillary data to sendmsg(), where higher numbers indicate
+a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to
+be formatted using a cmsg(3) message header filled in as follows::
+
+ cmsg->cmsg_level = SOL_DCCP;
+ cmsg->cmsg_type = DCCP_SCM_PRIORITY;
+ cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */
+
+DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero
+value is always interpreted as unbounded queue length. If different from zero,
+the interpretation of this parameter depends on the current dequeuing policy
+(see above): the "simple" policy will enforce a fixed queue size by returning
+EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the
+lowest-priority packet first. The default value for this parameter is
+initialised from /proc/sys/net/dccp/default/tx_qlen.
+
+DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of
+service codes (RFC 4340, sec. 8.1.2); if this socket option is not set,
+the socket will fall back to 0 (which means that no meaningful service code
+is present). On active sockets this is set before connect(); specifying more
+than one code has no effect (all subsequent service codes are ignored). The
+case is different for passive sockets, where multiple service codes (up to 32)
+can be set before calling bind().
+
+DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet
+size (application payload size) in bytes, see RFC 4340, section 14.
+
+DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs
+supported by the endpoint. The option value is an array of type uint8_t whose
+size is passed as option length. The minimum array size is 4 elements, the
+value returned in the optlen argument always reflects the true number of
+built-in CCIDs.
+
+DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same
+time, combining the operation of the next two socket options. This option is
+preferable over the latter two, since often applications will use the same
+type of CCID for both directions; and mixed use of CCIDs is not currently well
+understood. This socket option takes as argument at least one uint8_t value, or
+an array of uint8_t values, which must match available CCIDS (see above). CCIDs
+must be registered on the socket before calling connect() or listen().
+
+DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets
+the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID.
+Please note that the getsockopt argument type here is ``int``, not uint8_t.
+
+DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID.
+
+DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold
+timewait state when closing the connection (RFC 4340, 8.3). The usual case is
+that the closing server sends a CloseReq, whereupon the client holds timewait
+state. When this boolean socket option is on, the server sends a Close instead
+and will enter TIMEWAIT. This option must be set after accept() returns.
+
+DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the
+partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums
+always cover the entire packet and that only fully covered application data is
+accepted by the receiver. Hence, when using this feature on the sender, it must
+be enabled at the receiver, too with suitable choice of CsCov.
+
+DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the
+ range 0..15 are acceptable. The default setting is 0 (full coverage),
+ values between 1..15 indicate partial coverage.
+
+DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it
+ sets a threshold, where again values 0..15 are acceptable. The default
+ of 0 means that all packets with a partial coverage will be discarded.
+ Values in the range 1..15 indicate that packets with minimally such a
+ coverage value are also acceptable. The higher the number, the more
+ restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage
+ settings are inherited to the child socket after accept().
+
+The following two options apply to CCID 3 exclusively and are getsockopt()-only.
+In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned.
+
+DCCP_SOCKOPT_CCID_RX_INFO
+ Returns a ``struct tfrc_rx_info`` in optval; the buffer for optval and
+ optlen must be set to at least sizeof(struct tfrc_rx_info).
+
+DCCP_SOCKOPT_CCID_TX_INFO
+ Returns a ``struct tfrc_tx_info`` in optval; the buffer for optval and
+ optlen must be set to at least sizeof(struct tfrc_tx_info).
+
+On unidirectional connections it is useful to close the unused half-connection
+via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs.
+
+
+Sysctl variables
+================
+Several DCCP default parameters can be managed by the following sysctls
+(sysctl net.dccp.default or /proc/sys/net/dccp/default):
+
+request_retries
+ The number of active connection initiation retries (the number of
+ Requests minus one) before timing out. In addition, it also governs
+ the behaviour of the other, passive side: this variable also sets
+ the number of times DCCP repeats sending a Response when the initial
+ handshake does not progress from RESPOND to OPEN (i.e. when no Ack
+ is received after the initial Request). This value should be greater
+ than 0, suggested is less than 10. Analogue of tcp_syn_retries.
+
+retries1
+ How often a DCCP Response is retransmitted until the listening DCCP
+ side considers its connecting peer dead. Analogue of tcp_retries1.
+
+retries2
+ The number of times a general DCCP packet is retransmitted. This has
+ importance for retransmitted acknowledgments and feature negotiation,
+ data packets are never retransmitted. Analogue of tcp_retries2.
+
+tx_ccid = 2
+ Default CCID for the sender-receiver half-connection. Depending on the
+ choice of CCID, the Send Ack Vector feature is enabled automatically.
+
+rx_ccid = 2
+ Default CCID for the receiver-sender half-connection; see tx_ccid.
+
+seq_window = 100
+ The initial sequence window (sec. 7.5.2) of the sender. This influences
+ the local ackno validity and the remote seqno validity windows (7.5.1).
+ Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set.
+
+tx_qlen = 5
+ The size of the transmit buffer in packets. A value of 0 corresponds
+ to an unbounded transmit buffer.
+
+sync_ratelimit = 125 ms
+ The timeout between subsequent DCCP-Sync packets sent in response to
+ sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit
+ of this parameter is milliseconds; a value of 0 disables rate-limiting.
+
+
+IOCTLS
+======
+FIONREAD
+ Works as in udp(7): returns in the ``int`` argument pointer the size of
+ the next pending datagram in bytes, or 0 when no datagram is pending.
+
+SIOCOUTQ
+ Returns the number of unsent data bytes in the socket send queue as ``int``
+ into the buffer specified by the argument pointer.
+
+Other tunables
+==============
+Per-route rto_min support
+ CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value
+ of the RTO timer. This setting can be modified via the 'rto_min' option
+ of iproute2; for example::
+
+ > ip route change 10.0.0.0/24 rto_min 250j dev wlan0
+ > ip route add 10.0.0.254/32 rto_min 800j dev wlan0
+ > ip route show dev wlan0
+
+ CCID-3 also supports the rto_min setting: it is used to define the lower
+ bound for the expiry of the nofeedback timer. This can be useful on LANs
+ with very low RTTs (e.g., loopback, Gbit ethernet).
+
+
+Notes
+=====
+DCCP does not travel through NAT successfully at present on many boxes. This is
+because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT
+support for DCCP has been added.
diff --git a/Documentation/networking/dctcp.rst b/Documentation/networking/dctcp.rst
new file mode 100644
index 0000000000..4cc8bb2dad
--- /dev/null
+++ b/Documentation/networking/dctcp.rst
@@ -0,0 +1,52 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+DCTCP (DataCenter TCP)
+======================
+
+DCTCP is an enhancement to the TCP congestion control algorithm for data
+center networks and leverages Explicit Congestion Notification (ECN) in
+the data center network to provide multi-bit feedback to the end hosts.
+
+To enable it on end hosts::
+
+ sysctl -w net.ipv4.tcp_congestion_control=dctcp
+ sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional)
+
+All switches in the data center network running DCTCP must support ECN
+marking and be configured for marking when reaching defined switch buffer
+thresholds. The default ECN marking threshold heuristic for DCTCP on
+switches is 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps,
+but might need further careful tweaking.
+
+For more details, see below documents:
+
+Paper:
+
+The algorithm is further described in detail in the following two
+SIGCOMM/SIGMETRICS papers:
+
+ i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
+ Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan:
+
+ "Data Center TCP (DCTCP)", Data Center Networks session"
+
+ Proc. ACM SIGCOMM, New Delhi, 2010.
+
+ http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
+ http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192
+
+ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar:
+
+ "Analysis of DCTCP: Stability, Convergence, and Fairness"
+ Proc. ACM SIGMETRICS, San Jose, 2011.
+
+ http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
+
+IETF informational draft:
+
+ http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
+
+DCTCP site:
+
+ http://simula.stanford.edu/~alizade/Site/DCTCP.html
diff --git a/Documentation/networking/device_drivers/appletalk/cops.rst b/Documentation/networking/device_drivers/appletalk/cops.rst
new file mode 100644
index 0000000000..964ba80599
--- /dev/null
+++ b/Documentation/networking/device_drivers/appletalk/cops.rst
@@ -0,0 +1,80 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================
+The COPS LocalTalk Linux driver (cops.c)
+========================================
+
+By Jay Schulist <jschlst@samba.org>
+
+This driver has two modes and they are: Dayna mode and Tangent mode.
+Each mode corresponds with the type of card. It has been found
+that there are 2 main types of cards and all other cards are
+the same and just have different names or only have minor differences
+such as more IO ports. As this driver is tested it will
+become more clear exactly what cards are supported.
+
+Right now these cards are known to work with the COPS driver. The
+LT-200 cards work in a somewhat more limited capacity than the
+DL200 cards, which work very well and are in use by many people.
+
+TANGENT driver mode:
+ - Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200
+
+DAYNA driver mode:
+ - Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95,
+ - Farallon PhoneNET PC III, Farallon PhoneNET PC II
+
+Other cards possibly supported mode unknown though:
+ - Dayna DL2000 (Full length)
+
+The COPS driver defaults to using Dayna mode. To change the driver's
+mode if you built a driver with dual support use board_type=1 or
+board_type=2 for Dayna or Tangent with insmod.
+
+Operation/loading of the driver
+===============================
+
+Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #)
+If you do not specify any options the driver will try and use the IO = 0x240,
+IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing.
+
+To load multiple COPS driver Localtalk cards you can do one of the following::
+
+ insmod cops io=0x240 irq=5
+ insmod -o cops2 cops io=0x260 irq=3
+
+Or in lilo.conf put something like this::
+
+ append="ether=5,0x240,lt0 ether=3,0x260,lt1"
+
+Then bring up the interface with ifconfig. It will look something like this::
+
+ lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00
+ inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
+ UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1
+ RX packets:0 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0
+
+Netatalk Configuration
+======================
+
+You will need to configure atalkd with something like the following to make
+it work with the cops.c driver.
+
+* For single LTalk card use::
+
+ dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033"
+ lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
+
+* For multiple cards, Ethernet and LocalTalk::
+
+ eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033"
+ lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
+
+* For multiple LocalTalk cards, and an Ethernet card.
+
+* Order seems to matter here, Ethernet last::
+
+ lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1"
+ lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2"
+ eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk"
diff --git a/Documentation/networking/device_drivers/appletalk/index.rst b/Documentation/networking/device_drivers/appletalk/index.rst
new file mode 100644
index 0000000000..c196baeb08
--- /dev/null
+++ b/Documentation/networking/device_drivers/appletalk/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+AppleTalk Device Drivers
+========================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ cops
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/atm/cxacru-cf.py b/Documentation/networking/device_drivers/atm/cxacru-cf.py
new file mode 100644
index 0000000000..b41d298398
--- /dev/null
+++ b/Documentation/networking/device_drivers/atm/cxacru-cf.py
@@ -0,0 +1,48 @@
+#!/usr/bin/env python
+# Copyright 2009 Simon Arlott
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms of the GNU General Public License as published by the Free
+# Software Foundation; either version 2 of the License, or (at your option)
+# any later version.
+#
+# This program is distributed in the hope that it will be useful, but WITHOUT
+# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+# more details.
+#
+# You should have received a copy of the GNU General Public License along with
+# this program; if not, write to the Free Software Foundation, Inc., 59
+# Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+#
+# Usage: cxacru-cf.py < cxacru-cf.bin
+# Output: values string suitable for the sysfs adsl_config attribute
+#
+# Warning: cxacru-cf.bin with MD5 hash cdbac2689969d5ed5d4850f117702110
+# contains mis-aligned values which will stop the modem from being able
+# to make a connection. If the first and last two bytes are removed then
+# the values become valid, but the modulation will be forced to ANSI
+# T1.413 only which may not be appropriate.
+#
+# The original binary format is a packed list of le32 values.
+
+import sys
+import struct
+
+i = 0
+while True:
+ buf = sys.stdin.read(4)
+
+ if len(buf) == 0:
+ break
+ elif len(buf) != 4:
+ sys.stdout.write("\n")
+ sys.stderr.write("Error: read {0} not 4 bytes\n".format(len(buf)))
+ sys.exit(1)
+
+ if i > 0:
+ sys.stdout.write(" ")
+ sys.stdout.write("{0:x}={1}".format(i, struct.unpack("<I", buf)[0]))
+ i += 1
+
+sys.stdout.write("\n")
diff --git a/Documentation/networking/device_drivers/atm/cxacru.rst b/Documentation/networking/device_drivers/atm/cxacru.rst
new file mode 100644
index 0000000000..6088af2ffe
--- /dev/null
+++ b/Documentation/networking/device_drivers/atm/cxacru.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+ATM cxacru device driver
+========================
+
+Firmware is required for this device: http://accessrunner.sourceforge.net/
+
+While it is capable of managing/maintaining the ADSL connection without the
+module loaded, the device will sometimes stop responding after unloading the
+driver and it is necessary to unplug/remove power to the device to fix this.
+
+Note: support for cxacru-cf.bin has been removed. It was not loaded correctly
+so it had no effect on the device configuration. Fixing it could have stopped
+existing devices working when an invalid configuration is supplied.
+
+There is a script cxacru-cf.py to convert an existing file to the sysfs form.
+
+Detected devices will appear as ATM devices named "cxacru". In /sys/class/atm/
+these are directories named cxacruN where N is the device number. A symlink
+named device points to the USB interface device's directory which contains
+several sysfs attribute files for retrieving device statistics:
+
+* adsl_controller_version
+
+* adsl_headend
+* adsl_headend_environment
+
+ - Information about the remote headend.
+
+* adsl_config
+
+ - Configuration writing interface.
+ - Write parameters in hexadecimal format <index>=<value>,
+ separated by whitespace, e.g.:
+
+ "1=0 a=5"
+
+ - Up to 7 parameters at a time will be sent and the modem will restart
+ the ADSL connection when any value is set. These are logged for future
+ reference.
+
+* downstream_attenuation (dB)
+* downstream_bits_per_frame
+* downstream_rate (kbps)
+* downstream_snr_margin (dB)
+
+ - Downstream stats.
+
+* upstream_attenuation (dB)
+* upstream_bits_per_frame
+* upstream_rate (kbps)
+* upstream_snr_margin (dB)
+* transmitter_power (dBm/Hz)
+
+ - Upstream stats.
+
+* downstream_crc_errors
+* downstream_fec_errors
+* downstream_hec_errors
+* upstream_crc_errors
+* upstream_fec_errors
+* upstream_hec_errors
+
+ - Error counts.
+
+* line_startable
+
+ - Indicates that ADSL support on the device
+ is/can be enabled, see adsl_start.
+
+* line_status
+
+ - "initialising"
+ - "down"
+ - "attempting to activate"
+ - "training"
+ - "channel analysis"
+ - "exchange"
+ - "waiting"
+ - "up"
+
+ Changes between "down" and "attempting to activate"
+ if there is no signal.
+
+* link_status
+
+ - "not connected"
+ - "connected"
+ - "lost"
+
+* mac_address
+
+* modulation
+
+ - "" (when not connected)
+ - "ANSI T1.413"
+ - "ITU-T G.992.1 (G.DMT)"
+ - "ITU-T G.992.2 (G.LITE)"
+
+* startup_attempts
+
+ - Count of total attempts to initialise ADSL.
+
+To enable/disable ADSL, the following can be written to the adsl_state file:
+
+ - "start"
+ - "stop
+ - "restart" (stops, waits 1.5s, then starts)
+ - "poll" (used to resume status polling if it was disabled due to failure)
+
+Changes in adsl/line state are reported via kernel log messages::
+
+ [4942145.150704] ATM dev 0: ADSL state: running
+ [4942243.663766] ATM dev 0: ADSL line: down
+ [4942249.665075] ATM dev 0: ADSL line: attempting to activate
+ [4942253.654954] ATM dev 0: ADSL line: training
+ [4942255.666387] ATM dev 0: ADSL line: channel analysis
+ [4942259.656262] ATM dev 0: ADSL line: exchange
+ [2635357.696901] ATM dev 0: ADSL line: up (8128 kb/s down | 832 kb/s up)
diff --git a/Documentation/networking/device_drivers/atm/fore200e.rst b/Documentation/networking/device_drivers/atm/fore200e.rst
new file mode 100644
index 0000000000..55df9ec09a
--- /dev/null
+++ b/Documentation/networking/device_drivers/atm/fore200e.rst
@@ -0,0 +1,66 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================
+FORE Systems PCA-200E/SBA-200E ATM NIC driver
+=============================================
+
+This driver adds support for the FORE Systems 200E-series ATM adapters
+to the Linux operating system. It is based on the earlier PCA-200E driver
+written by Uwe Dannowski.
+
+The driver simultaneously supports PCA-200E and SBA-200E adapters on
+i386, alpha (untested), powerpc, sparc and sparc64 archs.
+
+The intent is to enable the use of different models of FORE adapters at the
+same time, by hosts that have several bus interfaces (such as PCI+SBUS,
+or PCI+EISA).
+
+Only PCI and SBUS devices are currently supported by the driver, but support
+for other bus interfaces such as EISA should not be too hard to add.
+
+
+Firmware Copyright Notice
+-------------------------
+
+Please read the fore200e_firmware_copyright file present
+in the linux/drivers/atm directory for details and restrictions.
+
+
+Firmware Updates
+----------------
+
+The FORE Systems 200E-series driver is shipped with firmware data being
+uploaded to the ATM adapters at system boot time or at module loading time.
+The supplied firmware images should work with all adapters.
+
+However, if you encounter problems (the firmware doesn't start or the driver
+is unable to read the PROM data), you may consider trying another firmware
+version. Alternative binary firmware images can be found somewhere on the
+ForeThought CD-ROM supplied with your adapter by FORE Systems.
+
+You can also get the latest firmware images from FORE Systems at
+https://en.wikipedia.org/wiki/FORE_Systems. Register TACTics Online and go to
+the 'software updates' pages. The firmware binaries are part of
+the various ForeThought software distributions.
+
+Notice that different versions of the PCA-200E firmware exist, depending
+on the endianness of the host architecture. The driver is shipped with
+both little and big endian PCA firmware images.
+
+Name and location of the new firmware images can be set at kernel
+configuration time:
+
+1. Copy the new firmware binary files (with .bin, .bin1 or .bin2 suffix)
+ to some directory, such as linux/drivers/atm.
+
+2. Reconfigure your kernel to set the new firmware name and location.
+ Expected pathnames are absolute or relative to the drivers/atm directory.
+
+3. Rebuild and re-install your kernel or your module.
+
+
+Feedback
+--------
+
+Feedback is welcome. Please send success stories/bug reports/
+patches/improvement/comments/flames to <lizzi@cnam.fr>.
diff --git a/Documentation/networking/device_drivers/atm/index.rst b/Documentation/networking/device_drivers/atm/index.rst
new file mode 100644
index 0000000000..7b593f031a
--- /dev/null
+++ b/Documentation/networking/device_drivers/atm/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Asynchronous Transfer Mode (ATM) Device Drivers
+===============================================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ cxacru
+ fore200e
+ iphase
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/atm/iphase.rst b/Documentation/networking/device_drivers/atm/iphase.rst
new file mode 100644
index 0000000000..388c7101e2
--- /dev/null
+++ b/Documentation/networking/device_drivers/atm/iphase.rst
@@ -0,0 +1,193 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+ATM (i)Chip IA Linux Driver Source
+==================================
+
+ READ ME FIRST
+
+--------------------------------------------------------------------------------
+
+ Read This Before You Begin!
+
+--------------------------------------------------------------------------------
+
+Description
+===========
+
+This is the README file for the Interphase PCI ATM (i)Chip IA Linux driver
+source release.
+
+The features and limitations of this driver are as follows:
+
+ - A single VPI (VPI value of 0) is supported.
+ - Supports 4K VCs for the server board (with 512K control memory) and 1K
+ VCs for the client board (with 128K control memory).
+ - UBR, ABR and CBR service categories are supported.
+ - Only AAL5 is supported.
+ - Supports setting of PCR on the VCs.
+ - Multiple adapters in a system are supported.
+ - All variants of Interphase ATM PCI (i)Chip adapter cards are supported,
+ including x575 (OC3, control memory 128K , 512K and packet memory 128K,
+ 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See
+ http://www.iphase.com/
+ for details.
+ - Only x86 platforms are supported.
+ - SMP is supported.
+
+
+Before You Start
+================
+
+
+Installation
+------------
+
+1. Installing the adapters in the system
+
+ To install the ATM adapters in the system, follow the steps below.
+
+ a. Login as root.
+ b. Shut down the system and power off the system.
+ c. Install one or more ATM adapters in the system.
+ d. Connect each adapter to a port on an ATM switch. The green 'Link'
+ LED on the front panel of the adapter will be on if the adapter is
+ connected to the switch properly when the system is powered up.
+ e. Power on and boot the system.
+
+2. [ Removed ]
+
+3. Rebuild kernel with ABR support
+
+ [ a. and b. removed ]
+
+ c. Reconfigure the kernel, choose the Interphase ia driver through "make
+ menuconfig" or "make xconfig".
+ d. Rebuild the kernel, loadable modules and the atm tools.
+ e. Install the new built kernel and modules and reboot.
+
+4. Load the adapter hardware driver (ia driver) if it is built as a module
+
+ a. Login as root.
+ b. Change directory to /lib/modules/<kernel-version>/atm.
+ c. Run "insmod suni.o;insmod iphase.o"
+ The yellow 'status' LED on the front panel of the adapter will blink
+ while the driver is loaded in the system.
+ d. To verify that the 'ia' driver is loaded successfully, run the
+ following command::
+
+ cat /proc/atm/devices
+
+ If the driver is loaded successfully, the output of the command will
+ be similar to the following lines::
+
+ Itf Type ESI/"MAC"addr AAL(TX,err,RX,err,drop) ...
+ 0 ia xxxxxxxxx 0 ( 0 0 0 0 0 ) 5 ( 0 0 0 0 0 )
+
+ You can also check the system log file /var/log/messages for messages
+ related to the ATM driver.
+
+5. Ia Driver Configuration
+
+5.1 Configuration of adapter buffers
+ The (i)Chip boards have 3 different packet RAM size variants: 128K, 512K and
+ 1M. The RAM size decides the number of buffers and buffer size. The default
+ size and number of buffers are set as following:
+
+ ========= ======= ====== ====== ====== ====== ======
+ Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf
+ RAM size size size size size cnt cnt
+ ========= ======= ====== ====== ====== ====== ======
+ 128K 64K 64K 10K 10K 6 6
+ 512K 256K 256K 10K 10K 25 25
+ 1M 512K 512K 10K 10K 51 51
+ ========= ======= ====== ====== ====== ====== ======
+
+ These setting should work well in most environments, but can be
+ changed by typing the following command::
+
+ insmod <IA_DIR>/ia.o IA_RX_BUF=<RX_CNT> IA_RX_BUF_SZ=<RX_SIZE> \
+ IA_TX_BUF=<TX_CNT> IA_TX_BUF_SZ=<TX_SIZE>
+
+ Where:
+
+ - RX_CNT = number of receive buffers in the range (1-128)
+ - RX_SIZE = size of receive buffers in the range (48-64K)
+ - TX_CNT = number of transmit buffers in the range (1-128)
+ - TX_SIZE = size of transmit buffers in the range (48-64K)
+
+ 1. Transmit and receive buffer size must be a multiple of 4.
+ 2. Care should be taken so that the memory required for the
+ transmit and receive buffers is less than or equal to the
+ total adapter packet memory.
+
+5.2 Turn on ia debug trace
+
+ When the ia driver is built with the CONFIG_ATM_IA_DEBUG flag, the driver
+ can provide more debug trace if needed. There is a bit mask variable,
+ IADebugFlag, which controls the output of the traces. You can find the bit
+ map of the IADebugFlag in iphase.h.
+ The debug trace can be turn on through the insmod command line option, for
+ example, "insmod iphase.o IADebugFlag=0xffffffff" can turn on all the debug
+ traces together with loading the driver.
+
+6. Ia Driver Test Using ttcp_atm and PVC
+
+ For the PVC setup, the test machines can either be connected back-to-back or
+ through a switch. If connected through the switch, the switch must be
+ configured for the PVC(s).
+
+ a. For UBR test:
+
+ At the test machine intended to receive data, type::
+
+ ttcp_atm -r -a -s 0.100
+
+ At the other test machine, type::
+
+ ttcp_atm -t -a -s 0.100 -n 10000
+
+ Run "ttcp_atm -h" to display more options of the ttcp_atm tool.
+ b. For ABR test:
+
+ It is the same as the UBR testing, but with an extra command option::
+
+ -Pabr:max_pcr=<xxx>
+
+ where:
+
+ xxx = the maximum peak cell rate, from 170 - 353207.
+
+ This option must be set on both the machines.
+
+ c. For CBR test:
+
+ It is the same as the UBR testing, but with an extra command option::
+
+ -Pcbr:max_pcr=<xxx>
+
+ where:
+
+ xxx = the maximum peak cell rate, from 170 - 353207.
+
+ This option may only be set on the transmit machine.
+
+
+Outstanding Issues
+==================
+
+
+
+Contact Information
+-------------------
+
+::
+
+ Customer Support:
+ United States: Telephone: (214) 654-5555
+ Fax: (214) 654-5500
+ E-Mail: intouch@iphase.com
+ Europe: Telephone: 33 (0)1 41 15 44 00
+ Fax: 33 (0)1 41 15 12 13
+ World Wide Web: http://www.iphase.com
+ Anonymous FTP: ftp.iphase.com
diff --git a/Documentation/networking/device_drivers/cable/index.rst b/Documentation/networking/device_drivers/cable/index.rst
new file mode 100644
index 0000000000..cce3c43929
--- /dev/null
+++ b/Documentation/networking/device_drivers/cable/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Cable Modem Device Drivers
+==========================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ sb1000
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cable/sb1000.rst b/Documentation/networking/device_drivers/cable/sb1000.rst
new file mode 100644
index 0000000000..c8582ca403
--- /dev/null
+++ b/Documentation/networking/device_drivers/cable/sb1000.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+SB100 device driver
+===================
+
+sb1000 is a module network device driver for the General Instrument (also known
+as NextLevel) SURFboard1000 internal cable modem board. This is an ISA card
+which is used by a number of cable TV companies to provide cable modem access.
+It's a one-way downstream-only cable modem, meaning that your upstream net link
+is provided by your regular phone modem.
+
+This driver was written by Franco Venturi <fventuri@mediaone.net>. He deserves
+a great deal of thanks for this wonderful piece of code!
+
+Needed tools
+============
+
+Support for this device is now a part of the standard Linux kernel. The
+driver source code file is drivers/net/sb1000.c. In addition to this
+you will need:
+
+1. The "cmconfig" program. This is a utility which supplements "ifconfig"
+ to configure the cable modem and network interface (usually called "cm0");
+
+2. Several PPP scripts which live in /etc/ppp to make connecting via your
+ cable modem easy.
+
+ These utilities can be obtained from:
+
+ http://www.jacksonville.net/~fventuri/
+
+ in Franco's original source code distribution .tar.gz file. Support for
+ the sb1000 driver can be found at:
+
+ - http://web.archive.org/web/%2E/http://home.adelphia.net/~siglercm/sb1000.html
+ - http://web.archive.org/web/%2E/http://linuxpower.cx/~cable/
+
+ along with these utilities.
+
+3. The standard isapnp tools. These are necessary to configure your SB1000
+ card at boot time (or afterwards by hand) since it's a PnP card.
+
+ If you don't have these installed as a standard part of your Linux
+ distribution, you can find them at:
+
+ http://www.roestock.demon.co.uk/isapnptools/
+
+ or check your Linux distribution binary CD or their web site. For help with
+ isapnp, pnpdump, or /etc/isapnp.conf, go to:
+
+ http://www.roestock.demon.co.uk/isapnptools/isapnpfaq.html
+
+Using the driver
+================
+
+To make the SB1000 card work, follow these steps:
+
+1. Run ``make config``, or ``make menuconfig``, or ``make xconfig``, whichever
+ you prefer, in the top kernel tree directory to set up your kernel
+ configuration. Make sure to say "Y" to "Prompt for development drivers"
+ and to say "M" to the sb1000 driver. Also say "Y" or "M" to all the standard
+ networking questions to get TCP/IP and PPP networking support.
+
+2. **BEFORE** you build the kernel, edit drivers/net/sb1000.c. Make sure
+ to redefine the value of READ_DATA_PORT to match the I/O address used
+ by isapnp to access your PnP cards. This is the value of READPORT in
+ /etc/isapnp.conf or given by the output of pnpdump.
+
+3. Build and install the kernel and modules as usual.
+
+4. Boot your new kernel following the usual procedures.
+
+5. Set up to configure the new SB1000 PnP card by capturing the output
+ of "pnpdump" to a file and editing this file to set the correct I/O ports,
+ IRQ, and DMA settings for all your PnP cards. Make sure none of the settings
+ conflict with one another. Then test this configuration by running the
+ "isapnp" command with your new config file as the input. Check for
+ errors and fix as necessary. (As an aside, I use I/O ports 0x110 and
+ 0x310 and IRQ 11 for my SB1000 card and these work well for me. YMMV.)
+ Then save the finished config file as /etc/isapnp.conf for proper
+ configuration on subsequent reboots.
+
+6. Download the original file sb1000-1.1.2.tar.gz from Franco's site or one of
+ the others referenced above. As root, unpack it into a temporary directory
+ and do a ``make cmconfig`` and then ``install -c cmconfig /usr/local/sbin``.
+ Don't do ``make install`` because it expects to find all the utilities built
+ and ready for installation, not just cmconfig.
+
+7. As root, copy all the files under the ppp/ subdirectory in Franco's
+ tar file into /etc/ppp, being careful not to overwrite any files that are
+ already in there. Then modify ppp@gi-on to set the correct login name,
+ phone number, and frequency for the cable modem. Also edit pap-secrets
+ to specify your login name and password and any site-specific information
+ you need.
+
+8. Be sure to modify /etc/ppp/firewall to use ipchains instead of
+ the older ipfwadm commands from the 2.0.x kernels. There's a neat utility to
+ convert ipfwadm commands to ipchains commands:
+
+ http://users.dhp.com/~whisper/ipfwadm2ipchains/
+
+ You may also wish to modify the firewall script to implement a different
+ firewalling scheme.
+
+9. Start the PPP connection via the script /etc/ppp/ppp@gi-on. You must be
+ root to do this. It's better to use a utility like sudo to execute
+ frequently used commands like this with root permissions if possible. If you
+ connect successfully the cable modem interface will come up and you'll see a
+ driver message like this at the console::
+
+ cm0: sb1000 at (0x110,0x310), csn 1, S/N 0x2a0d16d8, IRQ 11.
+ sb1000.c:v1.1.2 6/01/98 (fventuri@mediaone.net)
+
+ The "ifconfig" command should show two new interfaces, ppp0 and cm0.
+
+ The command "cmconfig cm0" will give you information about the cable modem
+ interface.
+
+10. Try pinging a site via ``ping -c 5 www.yahoo.com``, for example. You should
+ see packets received.
+
+11. If you can't get site names (like www.yahoo.com) to resolve into
+ IP addresses (like 204.71.200.67), be sure your /etc/resolv.conf file
+ has no syntax errors and has the right nameserver IP addresses in it.
+ If this doesn't help, try something like ``ping -c 5 204.71.200.67`` to
+ see if the networking is running but the DNS resolution is where the
+ problem lies.
+
+12. If you still have problems, go to the support web sites mentioned above
+ and read the information and documentation there.
+
+Common problems
+===============
+
+1. Packets go out on the ppp0 interface but don't come back on the cm0
+ interface. It looks like I'm connected but I can't even ping any
+ numerical IP addresses. (This happens predominantly on Debian systems due
+ to a default boot-time configuration script.)
+
+Solution
+ As root ``echo 0 > /proc/sys/net/ipv4/conf/cm0/rp_filter`` so it
+ can share the same IP address as the ppp0 interface. Note that this
+ command should probably be added to the /etc/ppp/cablemodem script
+ *right*between* the "/sbin/ifconfig" and "/sbin/cmconfig" commands.
+ You may need to do this to /proc/sys/net/ipv4/conf/ppp0/rp_filter as well.
+ If you do this to /proc/sys/net/ipv4/conf/default/rp_filter on each reboot
+ (in rc.local or some such) then any interfaces can share the same IP
+ addresses.
+
+2. I get "unresolved symbol" error messages on executing ``insmod sb1000.o``.
+
+Solution
+ You probably have a non-matching kernel source tree and
+ /usr/include/linux and /usr/include/asm header files. Make sure you
+ install the correct versions of the header files in these two directories.
+ Then rebuild and reinstall the kernel.
+
+3. When isapnp runs it reports an error, and my SB1000 card isn't working.
+
+Solution
+ There's a problem with later versions of isapnp using the "(CHECK)"
+ option in the lines that allocate the two I/O addresses for the SB1000 card.
+ This first popped up on RH 6.0. Delete "(CHECK)" for the SB1000 I/O addresses.
+ Make sure they don't conflict with any other pieces of hardware first! Then
+ rerun isapnp and go from there.
+
+4. I can't execute the /etc/ppp/ppp@gi-on file.
+
+Solution
+ As root do ``chmod ug+x /etc/ppp/ppp@gi-on``.
+
+5. The firewall script isn't working (with 2.2.x and higher kernels).
+
+Solution
+ Use the ipfwadm2ipchains script referenced above to convert the
+ /etc/ppp/firewall script from the deprecated ipfwadm commands to ipchains.
+
+6. I'm getting *tons* of firewall deny messages in the /var/kern.log,
+ /var/messages, and/or /var/syslog files, and they're filling up my /var
+ partition!!!
+
+Solution
+ First, tell your ISP that you're receiving DoS (Denial of Service)
+ and/or portscanning (UDP connection attempts) attacks! Look over the deny
+ messages to figure out what the attack is and where it's coming from. Next,
+ edit /etc/ppp/cablemodem and make sure the ",nobroadcast" option is turned on
+ to the "cmconfig" command (uncomment that line). If you're not receiving these
+ denied packets on your broadcast interface (IP address xxx.yyy.zzz.255
+ typically), then someone is attacking your machine in particular. Be careful
+ out there....
+
+7. Everything seems to work fine but my computer locks up after a while
+ (and typically during a lengthy download through the cable modem)!
+
+Solution
+ You may need to add a short delay in the driver to 'slow down' the
+ SURFboard because your PC might not be able to keep up with the transfer rate
+ of the SB1000. To do this, it's probably best to download Franco's
+ sb1000-1.1.2.tar.gz archive and build and install sb1000.o manually. You'll
+ want to edit the 'Makefile' and look for the 'SB1000_DELAY'
+ define. Uncomment those 'CFLAGS' lines (and comment out the default ones)
+ and try setting the delay to something like 60 microseconds with:
+ '-DSB1000_DELAY=60'. Then do ``make`` and as root ``make install`` and try
+ it out. If it still doesn't work or you like playing with the driver, you may
+ try other numbers. Remember though that the higher the delay, the slower the
+ driver (which slows down the rest of the PC too when it is actively
+ used). Thanks to Ed Daiga for this tip!
+
+Credits
+=======
+
+This README came from Franco Venturi's original README file which is
+still supplied with his driver .tar.gz archive. I and all other sb1000 users
+owe Franco a tremendous "Thank you!" Additional thanks goes to Carl Patten
+and Ralph Bonnell who are now managing the Linux SB1000 web site, and to
+the SB1000 users who reported and helped debug the common problems listed
+above.
+
+
+ Clemmitt Sigler
+ csigler@vt.edu
diff --git a/Documentation/networking/device_drivers/can/can327.rst b/Documentation/networking/device_drivers/can/can327.rst
new file mode 100644
index 0000000000..b87bfbe5d5
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/can327.rst
@@ -0,0 +1,331 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-3-Clause)
+
+can327: ELM327 driver for Linux SocketCAN
+==========================================
+
+Authors
+--------
+
+Max Staudt <max@enpas.org>
+
+
+
+Motivation
+-----------
+
+This driver aims to lower the initial cost for hackers interested in
+working with CAN buses.
+
+CAN adapters are expensive, few, and far between.
+ELM327 interfaces are cheap and plentiful.
+Let's use ELM327s as CAN adapters.
+
+
+
+Introduction
+-------------
+
+This driver is an effort to turn abundant ELM327 based OBD interfaces
+into full fledged (as far as possible) CAN interfaces.
+
+Since the ELM327 was never meant to be a stand alone CAN controller,
+the driver has to switch between its modes as quickly as possible in
+order to fake full-duplex operation.
+
+As such, can327 is a best effort driver. However, this is more than
+enough to implement simple request-response protocols (such as OBD II),
+and to monitor broadcast messages on a bus (such as in a vehicle).
+
+Most ELM327s come as nondescript serial devices, attached via USB or
+Bluetooth. The driver cannot recognize them by itself, and as such it
+is up to the user to attach it in form of a TTY line discipline
+(similar to PPP, SLIP, slcan, ...).
+
+This driver is meant for ELM327 versions 1.4b and up, see below for
+known limitations in older controllers and clones.
+
+
+
+Data sheet
+-----------
+
+The official data sheets can be found at ELM electronics' home page:
+
+ https://www.elmelectronics.com/
+
+
+
+How to attach the line discipline
+----------------------------------
+
+Every ELM327 chip is factory programmed to operate at a serial setting
+of 38400 baud/s, 8 data bits, no parity, 1 stopbit.
+
+If you have kept this default configuration, the line discipline can
+be attached on a command prompt as follows::
+
+ sudo ldattach \
+ --debug \
+ --speed 38400 \
+ --eightbits \
+ --noparity \
+ --onestopbit \
+ --iflag -ICRNL,INLCR,-IXOFF \
+ 30 \
+ /dev/ttyUSB0
+
+To change the ELM327's serial settings, please refer to its data
+sheet. This needs to be done before attaching the line discipline.
+
+Once the ldisc is attached, the CAN interface starts out unconfigured.
+Set the speed before starting it::
+
+ # The interface needs to be down to change parameters
+ sudo ip link set can0 down
+ sudo ip link set can0 type can bitrate 500000
+ sudo ip link set can0 up
+
+500000 bit/s is a common rate for OBD-II diagnostics.
+If you're connecting straight to a car's OBD port, this is the speed
+that most cars (but not all!) expect.
+
+After this, you can set out as usual with candump, cansniffer, etc.
+
+
+
+How to check the controller version
+------------------------------------
+
+Use a terminal program to attach to the controller.
+
+After issuing the "``AT WS``" command, the controller will respond with
+its version::
+
+ >AT WS
+
+
+ ELM327 v1.4b
+
+ >
+
+Note that clones may claim to be any version they like.
+It is not indicative of their actual feature set.
+
+
+
+
+Communication example
+----------------------
+
+This is a short and incomplete introduction on how to talk to an ELM327.
+It is here to guide understanding of the controller's and the driver's
+limitation (listed below) as well as manual testing.
+
+
+The ELM327 has two modes:
+
+- Command mode
+- Reception mode
+
+In command mode, it expects one command per line, terminated by CR.
+By default, the prompt is a "``>``", after which a command can be
+entered::
+
+ >ATE1
+ OK
+ >
+
+The init script in the driver switches off several configuration options
+that are only meaningful in the original OBD scenario the chip is meant
+for, and are actually a hindrance for can327.
+
+
+When a command is not recognized, such as by an older version of the
+ELM327, a question mark is printed as a response instead of OK::
+
+ >ATUNKNOWN
+ ?
+ >
+
+At present, can327 does not evaluate this response. See the section
+below on known limitations for details.
+
+
+When a CAN frame is to be sent, the target address is configured, after
+which the frame is sent as a command that consists of the data's hex
+dump::
+
+ >ATSH123
+ OK
+ >DEADBEEF12345678
+ OK
+ >
+
+The above interaction sends the SFF frame "``DE AD BE EF 12 34 56 78``"
+with (11 bit) CAN ID ``0x123``.
+For this to function, the controller must be configured for SFF sending
+mode (using "``AT PB``", see code or datasheet).
+
+
+Once a frame has been sent and wait-for-reply mode is on (``ATR1``,
+configured on ``listen-only=off``), or when the reply timeout expires
+and the driver sets the controller into monitoring mode (``ATMA``),
+the ELM327 will send one line for each received CAN frame, consisting
+of CAN ID, DLC, and data::
+
+ 123 8 DEADBEEF12345678
+
+For EFF (29 bit) CAN frames, the address format is slightly different,
+which can327 uses to tell the two apart::
+
+ 12 34 56 78 8 DEADBEEF12345678
+
+The ELM327 will receive both SFF and EFF frames - the current CAN
+config (``ATPB``) does not matter.
+
+
+If the ELM327's internal UART sending buffer runs full, it will abort
+the monitoring mode, print "BUFFER FULL" and drop back into command
+mode. Note that in this case, unlike with other error messages, the
+error message may appear on the same line as the last (usually
+incomplete) data frame::
+
+ 12 34 56 78 8 DEADBEEF123 BUFFER FULL
+
+
+
+Known limitations of the controller
+------------------------------------
+
+- Clone devices ("v1.5" and others)
+
+ Sending RTR frames is not supported and will be dropped silently.
+
+ Receiving RTR with DLC 8 will appear to be a regular frame with
+ the last received frame's DLC and payload.
+
+ "``AT CSM``" (CAN Silent Monitoring, i.e. don't send CAN ACKs) is
+ not supported, and is hard coded to ON. Thus, frames are not ACKed
+ while listening: "``AT MA``" (Monitor All) will always be "silent".
+ However, immediately after sending a frame, the ELM327 will be in
+ "receive reply" mode, in which it *does* ACK any received frames.
+ Once the bus goes silent, or an error occurs (such as BUFFER FULL),
+ or the receive reply timeout runs out, the ELM327 will end reply
+ reception mode on its own and can327 will fall back to "``AT MA``"
+ in order to keep monitoring the bus.
+
+ Other limitations may apply, depending on the clone and the quality
+ of its firmware.
+
+
+- All versions
+
+ No full duplex operation is supported. The driver will switch
+ between input/output mode as quickly as possible.
+
+ The length of outgoing RTR frames cannot be set. In fact, some
+ clones (tested with one identifying as "``v1.5``") are unable to
+ send RTR frames at all.
+
+ We don't have a way to get real-time notifications on CAN errors.
+ While there is a command (``AT CS``) to retrieve some basic stats,
+ we don't poll it as it would force us to interrupt reception mode.
+
+
+- Versions prior to 1.4b
+
+ These versions do not send CAN ACKs when in monitoring mode (AT MA).
+ However, they do send ACKs while waiting for a reply immediately
+ after sending a frame. The driver maximizes this time to make the
+ controller as useful as possible.
+
+ Starting with version 1.4b, the ELM327 supports the "``AT CSM``"
+ command, and the "listen-only" CAN option will take effect.
+
+
+- Versions prior to 1.4
+
+ These chips do not support the "``AT PB``" command, and thus cannot
+ change bitrate or SFF/EFF mode on-the-fly. This will have to be
+ programmed by the user before attaching the line discipline. See the
+ data sheet for details.
+
+
+- Versions prior to 1.3
+
+ These chips cannot be used at all with can327. They do not support
+ the "``AT D1``" command, which is necessary to avoid parsing conflicts
+ on incoming data, as well as distinction of RTR frame lengths.
+
+ Specifically, this allows for easy distinction of SFF and EFF
+ frames, and to check whether frames are complete. While it is possible
+ to deduce the type and length from the length of the line the ELM327
+ sends us, this method fails when the ELM327's UART output buffer
+ overruns. It may abort sending in the middle of the line, which will
+ then be mistaken for something else.
+
+
+
+Known limitations of the driver
+--------------------------------
+
+- No 8/7 timing.
+
+ ELM327 can only set CAN bitrates that are of the form 500000/n, where
+ n is an integer divisor.
+ However there is an exception: With a separate flag, it may set the
+ speed to be 8/7 of the speed indicated by the divisor.
+ This mode is not currently implemented.
+
+- No evaluation of command responses.
+
+ The ELM327 will reply with OK when a command is understood, and with ?
+ when it is not. The driver does not currently check this, and simply
+ assumes that the chip understands every command.
+ The driver is built such that functionality degrades gracefully
+ nevertheless. See the section on known limitations of the controller.
+
+- No use of hardware CAN ID filtering
+
+ An ELM327's UART sending buffer will easily overflow on heavy CAN bus
+ load, resulting in the "``BUFFER FULL``" message. Using the hardware
+ filters available through "``AT CF xxx``" and "``AT CM xxx``" would be
+ helpful here, however SocketCAN does not currently provide a facility
+ to make use of such hardware features.
+
+
+
+Rationale behind the chosen configuration
+------------------------------------------
+
+``AT E1``
+ Echo on
+
+ We need this to be able to get a prompt reliably.
+
+``AT S1``
+ Spaces on
+
+ We need this to distinguish 11/29 bit CAN addresses received.
+
+ Note:
+ We can usually do this using the line length (odd/even),
+ but this fails if the line is not transmitted fully to
+ the host (BUFFER FULL).
+
+``AT D1``
+ DLC on
+
+ We need this to tell the "length" of RTR frames.
+
+
+
+A note on CAN bus termination
+------------------------------
+
+Your adapter may have resistors soldered in which are meant to terminate
+the bus. This is correct when it is plugged into a OBD-II socket, but
+not helpful when trying to tap into the middle of an existing CAN bus.
+
+If communications don't work with the adapter connected, check for the
+termination resistors on its PCB and try removing them.
diff --git a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
new file mode 100644
index 0000000000..1661d13174
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
@@ -0,0 +1,638 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+CTU CAN FD Driver
+=================
+
+Author: Martin Jerabek <martin.jerabek01@gmail.com>
+
+
+About CTU CAN FD IP Core
+------------------------
+
+`CTU CAN FD <https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_
+is an open source soft core written in VHDL.
+It originated in 2015 as Ondrej Ille's project
+at the `Department of Measurement <https://meas.fel.cvut.cz/>`_
+of `FEE <http://www.fel.cvut.cz/en/>`_ at `CTU <https://www.cvut.cz/en>`_.
+
+The SocketCAN driver for Xilinx Zynq SoC based MicroZed board
+`Vivado integration <https://gitlab.fel.cvut.cz/canbus/zynq/zynq-can-sja1000-top>`_
+and Intel Cyclone V 5CSEMA4U23C6 based DE0-Nano-SoC Terasic board
+`QSys integration <https://gitlab.fel.cvut.cz/canbus/intel-soc-ctucanfd>`_
+has been developed as well as support for
+`PCIe integration <https://gitlab.fel.cvut.cz/canbus/pcie-ctucanfd>`_ of the core.
+
+In the case of Zynq, the core is connected via the APB system bus, which does
+not have enumeration support, and the device must be specified in Device Tree.
+This kind of devices is called platform device in the kernel and is
+handled by a platform device driver.
+
+The basic functional model of the CTU CAN FD peripheral has been
+accepted into QEMU mainline. See QEMU `CAN emulation support <https://www.qemu.org/docs/master/system/devices/can.html>`_
+for CAN FD buses, host connection and CTU CAN FD core emulation. The development
+version of emulation support can be cloned from ctu-canfd branch of QEMU local
+development `repository <https://gitlab.fel.cvut.cz/canbus/qemu-canbus>`_.
+
+
+About SocketCAN
+---------------
+
+SocketCAN is a standard common interface for CAN devices in the Linux
+kernel. As the name suggests, the bus is accessed via sockets, similarly
+to common network devices. The reasoning behind this is in depth
+described in `Linux SocketCAN <https://www.kernel.org/doc/html/latest/networking/can.html>`_.
+In short, it offers a
+natural way to implement and work with higher layer protocols over CAN,
+in the same way as, e.g., UDP/IP over Ethernet.
+
+Device probe
+~~~~~~~~~~~~
+
+Before going into detail about the structure of a CAN bus device driver,
+let's reiterate how the kernel gets to know about the device at all.
+Some buses, like PCI or PCIe, support device enumeration. That is, when
+the system boots, it discovers all the devices on the bus and reads
+their configuration. The kernel identifies the device via its vendor ID
+and device ID, and if there is a driver registered for this identifier
+combination, its probe method is invoked to populate the driver's
+instance for the given hardware. A similar situation goes with USB, only
+it allows for device hot-plug.
+
+The situation is different for peripherals which are directly embedded
+in the SoC and connected to an internal system bus (AXI, APB, Avalon,
+and others). These buses do not support enumeration, and thus the kernel
+has to learn about the devices from elsewhere. This is exactly what the
+Device Tree was made for.
+
+Device tree
+~~~~~~~~~~~
+
+An entry in device tree states that a device exists in the system, how
+it is reachable (on which bus it resides) and its configuration –
+registers address, interrupts and so on. An example of such a device
+tree is given in .
+
+::
+
+ / {
+ /* ... */
+ amba: amba {
+ #address-cells = <1>;
+ #size-cells = <1>;
+ compatible = "simple-bus";
+
+ CTU_CAN_FD_0: CTU_CAN_FD@43c30000 {
+ compatible = "ctu,ctucanfd";
+ interrupt-parent = <&intc>;
+ interrupts = <0 30 4>;
+ clocks = <&clkc 15>;
+ reg = <0x43c30000 0x10000>;
+ };
+ };
+ };
+
+
+.. _sec:socketcan:drv:
+
+Driver structure
+~~~~~~~~~~~~~~~~
+
+The driver can be divided into two parts – platform-dependent device
+discovery and set up, and platform-independent CAN network device
+implementation.
+
+.. _sec:socketcan:platdev:
+
+Platform device driver
+^^^^^^^^^^^^^^^^^^^^^^
+
+In the case of Zynq, the core is connected via the AXI system bus, which
+does not have enumeration support, and the device must be specified in
+Device Tree. This kind of devices is called *platform device* in the
+kernel and is handled by a *platform device driver*\ [1]_.
+
+A platform device driver provides the following things:
+
+- A *probe* function
+
+- A *remove* function
+
+- A table of *compatible* devices that the driver can handle
+
+The *probe* function is called exactly once when the device appears (or
+the driver is loaded, whichever happens later). If there are more
+devices handled by the same driver, the *probe* function is called for
+each one of them. Its role is to allocate and initialize resources
+required for handling the device, as well as set up low-level functions
+for the platform-independent layer, e.g., *read_reg* and *write_reg*.
+After that, the driver registers the device to a higher layer, in our
+case as a *network device*.
+
+The *remove* function is called when the device disappears, or the
+driver is about to be unloaded. It serves to free the resources
+allocated in *probe* and to unregister the device from higher layers.
+
+Finally, the table of *compatible* devices states which devices the
+driver can handle. The Device Tree entry ``compatible`` is matched
+against the tables of all *platform drivers*.
+
+.. code:: c
+
+ /* Match table for OF platform binding */
+ static const struct of_device_id ctucan_of_match[] = {
+ { .compatible = "ctu,canfd-2", },
+ { .compatible = "ctu,ctucanfd", },
+ { /* end of list */ },
+ };
+ MODULE_DEVICE_TABLE(of, ctucan_of_match);
+
+ static int ctucan_probe(struct platform_device *pdev);
+ static int ctucan_remove(struct platform_device *pdev);
+
+ static struct platform_driver ctucanfd_driver = {
+ .probe = ctucan_probe,
+ .remove = ctucan_remove,
+ .driver = {
+ .name = DRIVER_NAME,
+ .of_match_table = ctucan_of_match,
+ },
+ };
+ module_platform_driver(ctucanfd_driver);
+
+
+.. _sec:socketcan:netdev:
+
+Network device driver
+^^^^^^^^^^^^^^^^^^^^^
+
+Each network device must support at least these operations:
+
+- Bring the device up: ``ndo_open``
+
+- Bring the device down: ``ndo_close``
+
+- Submit TX frames to the device: ``ndo_start_xmit``
+
+- Signal TX completion and errors to the network subsystem: ISR
+
+- Submit RX frames to the network subsystem: ISR and NAPI
+
+There are two possible event sources: the device and the network
+subsystem. Device events are usually signaled via an interrupt, handled
+in an Interrupt Service Routine (ISR). Handlers for the events
+originating in the network subsystem are then specified in
+``struct net_device_ops``.
+
+When the device is brought up, e.g., by calling ``ip link set can0 up``,
+the driver’s function ``ndo_open`` is called. It should validate the
+interface configuration and configure and enable the device. The
+analogous opposite is ``ndo_close``, called when the device is being
+brought down, be it explicitly or implicitly.
+
+When the system should transmit a frame, it does so by calling
+``ndo_start_xmit``, which enqueues the frame into the device. If the
+device HW queue (FIFO, mailboxes or whatever the implementation is)
+becomes full, the ``ndo_start_xmit`` implementation informs the network
+subsystem that it should stop the TX queue (via ``netif_stop_queue``).
+It is then re-enabled later in ISR when the device has some space
+available again and is able to enqueue another frame.
+
+All the device events are handled in ISR, namely:
+
+#. **TX completion**. When the device successfully finishes transmitting
+ a frame, the frame is echoed locally. On error, an informative error
+ frame [2]_ is sent to the network subsystem instead. In both cases,
+ the software TX queue is resumed so that more frames may be sent.
+
+#. **Error condition**. If something goes wrong (e.g., the device goes
+ bus-off or RX overrun happens), error counters are updated, and
+ informative error frames are enqueued to SW RX queue.
+
+#. **RX buffer not empty**. In this case, read the RX frames and enqueue
+ them to SW RX queue. Usually NAPI is used as a middle layer (see ).
+
+.. _sec:socketcan:napi:
+
+NAPI
+~~~~
+
+The frequency of incoming frames can be high and the overhead to invoke
+the interrupt service routine for each frame can cause significant
+system load. There are multiple mechanisms in the Linux kernel to deal
+with this situation. They evolved over the years of Linux kernel
+development and enhancements. For network devices, the current standard
+is NAPI – *the New API*. It is similar to classical top-half/bottom-half
+interrupt handling in that it only acknowledges the interrupt in the ISR
+and signals that the rest of the processing should be done in softirq
+context. On top of that, it offers the possibility to *poll* for new
+frames for a while. This has a potential to avoid the costly round of
+enabling interrupts, handling an incoming IRQ in ISR, re-enabling the
+softirq and switching context back to softirq.
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
+
+Integrating the core to Xilinx Zynq
+-----------------------------------
+
+The core interfaces a simple subset of the Avalon
+(search for Intel **Avalon Interface Specifications**)
+bus as it was originally used on
+Alterra FPGA chips, yet Xilinx natively interfaces with AXI
+(search for ARM **AMBA AXI and ACE Protocol Specification AXI3,
+AXI4, and AXI4-Lite, ACE and ACE-Lite**).
+The most obvious solution would be to use
+an Avalon/AXI bridge or implement some simple conversion entity.
+However, the core’s interface is half-duplex with no handshake
+signaling, whereas AXI is full duplex with two-way signaling. Moreover,
+even AXI-Lite slave interface is quite resource-intensive, and the
+flexibility and speed of AXI are not required for a CAN core.
+
+Thus a much simpler bus was chosen – APB (Advanced Peripheral Bus)
+(search for ARM **AMBA APB Protocol Specification**).
+APB-AXI bridge is directly available in
+Xilinx Vivado, and the interface adaptor entity is just a few simple
+combinatorial assignments.
+
+Finally, to be able to include the core in a block diagram as a custom
+IP, the core, together with the APB interface, has been packaged as a
+Vivado component.
+
+CTU CAN FD Driver design
+------------------------
+
+The general structure of a CAN device driver has already been examined
+in . The next paragraphs provide a more detailed description of the CTU
+CAN FD core driver in particular.
+
+Low-level driver
+~~~~~~~~~~~~~~~~
+
+The core is not intended to be used solely with SocketCAN, and thus it
+is desirable to have an OS-independent low-level driver. This low-level
+driver can then be used in implementations of OS driver or directly
+either on bare metal or in a user-space application. Another advantage
+is that if the hardware slightly changes, only the low-level driver
+needs to be modified.
+
+The code [3]_ is in part automatically generated and in part written
+manually by the core author, with contributions of the thesis’ author.
+The low-level driver supports operations such as: set bit timing, set
+controller mode, enable/disable, read RX frame, write TX frame, and so
+on.
+
+Configuring bit timing
+~~~~~~~~~~~~~~~~~~~~~~
+
+On CAN, each bit is divided into four segments: SYNC, PROP, PHASE1, and
+PHASE2. Their duration is expressed in multiples of a Time Quantum
+(details in `CAN Specification, Version 2.0 <http://esd.cs.ucr.edu/webres/can20.pdf>`_, chapter 8).
+When configuring
+bitrate, the durations of all the segments (and time quantum) must be
+computed from the bitrate and Sample Point. This is performed
+independently for both the Nominal bitrate and Data bitrate for CAN FD.
+
+SocketCAN is fairly flexible and offers either highly customized
+configuration by setting all the segment durations manually, or a
+convenient configuration by setting just the bitrate and sample point
+(and even that is chosen automatically per Bosch recommendation if not
+specified). However, each CAN controller may have different base clock
+frequency and different width of segment duration registers. The
+algorithm thus needs the minimum and maximum values for the durations
+(and clock prescaler) and tries to optimize the numbers to fit both the
+constraints and the requested parameters.
+
+.. code:: c
+
+ struct can_bittiming_const {
+ char name[16]; /* Name of the CAN controller hardware */
+ __u32 tseg1_min; /* Time segment 1 = prop_seg + phase_seg1 */
+ __u32 tseg1_max;
+ __u32 tseg2_min; /* Time segment 2 = phase_seg2 */
+ __u32 tseg2_max;
+ __u32 sjw_max; /* Synchronisation jump width */
+ __u32 brp_min; /* Bit-rate prescaler */
+ __u32 brp_max;
+ __u32 brp_inc;
+ };
+
+
+[lst:can_bittiming_const]
+
+A curious reader will notice that the durations of the segments PROP_SEG
+and PHASE_SEG1 are not determined separately but rather combined and
+then, by default, the resulting TSEG1 is evenly divided between PROP_SEG
+and PHASE_SEG1. In practice, this has virtually no consequences as the
+sample point is between PHASE_SEG1 and PHASE_SEG2. In CTU CAN FD,
+however, the duration registers ``PROP`` and ``PH1`` have different
+widths (6 and 7 bits, respectively), so the auto-computed values might
+overflow the shorter register and must thus be redistributed among the
+two [4]_.
+
+Handling RX
+~~~~~~~~~~~
+
+Frame reception is handled in NAPI queue, which is enabled from ISR when
+the RXNE (RX FIFO Not Empty) bit is set. Frames are read one by one
+until either no frame is left in the RX FIFO or the maximum work quota
+has been reached for the NAPI poll run (see ). Each frame is then passed
+to the network interface RX queue.
+
+An incoming frame may be either a CAN 2.0 frame or a CAN FD frame. The
+way to distinguish between these two in the kernel is to allocate either
+``struct can_frame`` or ``struct canfd_frame``, the two having different
+sizes. In the controller, the information about the frame type is stored
+in the first word of RX FIFO.
+
+This brings us a chicken-egg problem: we want to allocate the ``skb``
+for the frame, and only if it succeeds, fetch the frame from FIFO;
+otherwise keep it there for later. But to be able to allocate the
+correct ``skb``, we have to fetch the first work of FIFO. There are
+several possible solutions:
+
+#. Read the word, then allocate. If it fails, discard the rest of the
+ frame. When the system is low on memory, the situation is bad anyway.
+
+#. Always allocate ``skb`` big enough for an FD frame beforehand. Then
+ tweak the ``skb`` internals to look like it has been allocated for
+ the smaller CAN 2.0 frame.
+
+#. Add option to peek into the FIFO instead of consuming the word.
+
+#. If the allocation fails, store the read word into driver’s data. On
+ the next try, use the stored word instead of reading it again.
+
+Option 1 is simple enough, but not very satisfying if we could do
+better. Option 2 is not acceptable, as it would require modifying the
+private state of an integral kernel structure. The slightly higher
+memory consumption is just a virtual cherry on top of the “cake”. Option
+3 requires non-trivial HW changes and is not ideal from the HW point of
+view.
+
+Option 4 seems like a good compromise, with its disadvantage being that
+a partial frame may stay in the FIFO for a prolonged time. Nonetheless,
+there may be just one owner of the RX FIFO, and thus no one else should
+see the partial frame (disregarding some exotic debugging scenarios).
+Basides, the driver resets the core on its initialization, so the
+partial frame cannot be “adopted” either. In the end, option 4 was
+selected [5]_.
+
+.. _subsec:ctucanfd:rxtimestamp:
+
+Timestamping RX frames
+^^^^^^^^^^^^^^^^^^^^^^
+
+The CTU CAN FD core reports the exact timestamp when the frame has been
+received. The timestamp is by default captured at the sample point of
+the last bit of EOF but is configurable to be captured at the SOF bit.
+The timestamp source is external to the core and may be up to 64 bits
+wide. At the time of writing, passing the timestamp from kernel to
+userspace is not yet implemented, but is planned in the future.
+
+Handling TX
+~~~~~~~~~~~
+
+The CTU CAN FD core has 4 independent TX buffers, each with its own
+state and priority. When the core wants to transmit, a TX buffer in
+Ready state with the highest priority is selected.
+
+The priorities are 3bit numbers in register TX_PRIORITY
+(nibble-aligned). This should be flexible enough for most use cases.
+SocketCAN, however, supports only one FIFO queue for outgoing
+frames [6]_. The buffer priorities may be used to simulate the FIFO
+behavior by assigning each buffer a distinct priority and *rotating* the
+priorities after a frame transmission is completed.
+
+In addition to priority rotation, the SW must maintain head and tail
+pointers into the FIFO formed by the TX buffers to be able to determine
+which buffer should be used for next frame (``txb_head``) and which
+should be the first completed one (``txb_tail``). The actual buffer
+indices are (obviously) modulo 4 (number of TX buffers), but the
+pointers must be at least one bit wider to be able to distinguish
+between FIFO full and FIFO empty – in this situation,
+:math:`txb\_head \equiv txb\_tail\ (\textrm{mod}\ 4)`. An example of how
+the FIFO is maintained, together with priority rotation, is depicted in
+
+|
+
++------+---+---+---+---+
+| TXB# | 0 | 1 | 2 | 3 |
++======+===+===+===+===+
+| Seq | A | B | C | |
++------+---+---+---+---+
+| Prio | 7 | 6 | 5 | 4 |
++------+---+---+---+---+
+| | | T | | H |
++------+---+---+---+---+
+
+|
+
++------+---+---+---+---+
+| TXB# | 0 | 1 | 2 | 3 |
++======+===+===+===+===+
+| Seq | | B | C | |
++------+---+---+---+---+
+| Prio | 4 | 7 | 6 | 5 |
++------+---+---+---+---+
+| | | T | | H |
++------+---+---+---+---+
+
+|
+
++------+---+---+---+---+----+
+| TXB# | 0 | 1 | 2 | 3 | 0’ |
++======+===+===+===+===+====+
+| Seq | E | B | C | D | |
++------+---+---+---+---+----+
+| Prio | 4 | 7 | 6 | 5 | |
++------+---+---+---+---+----+
+| | | T | | | H |
++------+---+---+---+---+----+
+
+|
+
+.. kernel-figure:: fsm_txt_buffer_user.svg
+
+ TX Buffer states with possible transitions
+
+.. _subsec:ctucanfd:txtimestamp:
+
+Timestamping TX frames
+^^^^^^^^^^^^^^^^^^^^^^
+
+When submitting a frame to a TX buffer, one may specify the timestamp at
+which the frame should be transmitted. The frame transmission may start
+later, but not sooner. Note that the timestamp does not participate in
+buffer prioritization – that is decided solely by the mechanism
+described above.
+
+Support for time-based packet transmission was recently merged to Linux
+v4.19 `Time-based packet transmission <https://lwn.net/Articles/748879/>`_,
+but it remains yet to be researched
+whether this functionality will be practical for CAN.
+
+Also similarly to retrieving the timestamp of RX frames, the core
+supports retrieving the timestamp of TX frames – that is the time when
+the frame was successfully delivered. The particulars are very similar
+to timestamping RX frames and are described in .
+
+Handling RX buffer overrun
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a received frame does no more fit into the hardware RX FIFO in its
+entirety, RX FIFO overrun flag (STATUS[DOR]) is set and Data Overrun
+Interrupt (DOI) is triggered. When servicing the interrupt, care must be
+taken first to clear the DOR flag (via COMMAND[CDO]) and after that
+clear the DOI interrupt flag. Otherwise, the interrupt would be
+immediately [7]_ rearmed.
+
+**Note**: During development, it was discussed whether the internal HW
+pipelining cannot disrupt this clear sequence and whether an additional
+dummy cycle is necessary between clearing the flag and the interrupt. On
+the Avalon interface, it indeed proved to be the case, but APB being
+safe because it uses 2-cycle transactions. Essentially, the DOR flag
+would be cleared, but DOI register’s Preset input would still be high
+the cycle when the DOI clear request would also be applied (by setting
+the register’s Reset input high). As Set had higher priority than Reset,
+the DOI flag would not be reset. This has been already fixed by swapping
+the Set/Reset priority (see issue #187).
+
+Reporting Error Passive and Bus Off conditions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It may be desirable to report when the node reaches *Error Passive*,
+*Error Warning*, and *Bus Off* conditions. The driver is notified about
+error state change by an interrupt (EPI, EWLI), and then proceeds to
+determine the core’s error state by reading its error counters.
+
+There is, however, a slight race condition here – there is a delay
+between the time when the state transition occurs (and the interrupt is
+triggered) and when the error counters are read. When EPI is received,
+the node may be either *Error Passive* or *Bus Off*. If the node goes
+*Bus Off*, it obviously remains in the state until it is reset.
+Otherwise, the node is *or was* *Error Passive*. However, it may happen
+that the read state is *Error Warning* or even *Error Active*. It may be
+unclear whether and what exactly to report in that case, but I
+personally entertain the idea that the past error condition should still
+be reported. Similarly, when EWLI is received but the state is later
+detected to be *Error Passive*, *Error Passive* should be reported.
+
+
+CTU CAN FD Driver Sources Reference
+-----------------------------------
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd.h
+ :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_base.c
+ :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_pci.c
+ :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_platform.c
+ :internal:
+
+CTU CAN FD IP Core and Driver Development Acknowledgment
+---------------------------------------------------------
+
+* Odrej Ille <ondrej.ille@gmail.com>
+
+ * started the project as student at Department of Measurement, FEE, CTU
+ * invested great amount of personal time and enthusiasm to the project over years
+ * worked on more funded tasks
+
+* `Department of Measurement <https://meas.fel.cvut.cz/>`_,
+ `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_,
+ `Czech Technical University <https://www.cvut.cz/en>`_
+
+ * is the main investor into the project over many years
+ * uses project in their CAN/CAN FD diagnostics framework for `Skoda Auto <https://www.skoda-auto.cz/>`_
+
+* `Digiteq Automotive <https://www.digiteqautomotive.com/en>`_
+
+ * funding of the project CAN FD Open Cores Support Linux Kernel Based Systems
+ * negotiated and paid CTU to allow public access to the project
+ * provided additional funding of the work
+
+* `Department of Control Engineering <https://control.fel.cvut.cz/en>`_,
+ `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_,
+ `Czech Technical University <https://www.cvut.cz/en>`_
+
+ * solving the project CAN FD Open Cores Support Linux Kernel Based Systems
+ * providing GitLab management
+ * virtual servers and computational power for continuous integration
+ * providing hardware for HIL continuous integration tests
+
+* `PiKRON Ltd. <http://pikron.com/>`_
+
+ * minor funding to initiate preparation of the project open-sourcing
+
+* Petr Porazil <porazil@pikron.com>
+
+ * design of PCIe transceiver addon board and assembly of boards
+ * design and assembly of MZ_APO baseboard for MicroZed/Zynq based system
+
+* Martin Jerabek <martin.jerabek01@gmail.com>
+
+ * Linux driver development
+ * continuous integration platform architect and GHDL updates
+ * thesis `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_
+
+* Jiri Novak <jnovak@fel.cvut.cz>
+
+ * project initiation, management and use at Department of Measurement, FEE, CTU
+
+* Pavel Pisa <pisa@cmp.felk.cvut.cz>
+
+ * initiate open-sourcing, project coordination, management at Department of Control Engineering, FEE, CTU
+
+* Jaroslav Beran<jara.beran@gmail.com>
+
+ * system integration for Intel SoC, core and driver testing and updates
+
+* Carsten Emde (`OSADL <https://www.osadl.org/>`_)
+
+ * provided OSADL expertise to discuss IP core licensing
+ * pointed to possible deadlock for LGPL and CAN bus possible patent case which lead to relicense IP core design to BSD like license
+
+* Reiner Zitzmann and Holger Zeltwanger (`CAN in Automation <https://www.can-cia.org/>`_)
+
+ * provided suggestions and help to inform community about the project and invited us to events focused on CAN bus future development directions
+
+* Jan Charvat
+
+ * implemented CTU CAN FD functional model for QEMU which has been integrated into QEMU mainline (`docs/system/devices/can.rst <https://www.qemu.org/docs/master/system/devices/can.html>`_)
+ * Bachelor thesis Model of CAN FD Communication Controller for QEMU Emulator
+
+Notes
+-----
+
+
+.. [1]
+ Other buses have their own specific driver interface to set up the
+ device.
+
+.. [2]
+ Not to be mistaken with CAN Error Frame. This is a ``can_frame`` with
+ ``CAN_ERR_FLAG`` set and some error info in its ``data`` field.
+
+.. [3]
+ Available in CTU CAN FD repository
+ `<https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_
+
+.. [4]
+ As is done in the low-level driver functions
+ ``ctucan_hw_set_nom_bittiming`` and
+ ``ctucan_hw_set_data_bittiming``.
+
+.. [5]
+ At the time of writing this thesis, option 1 is still being used and
+ the modification is queued in gitlab issue #222
+
+.. [6]
+ Strictly speaking, multiple CAN TX queues are supported since v4.19
+ `can: enable multi-queue for SocketCAN devices <https://lore.kernel.org/patchwork/patch/913526/>`_ but no mainline driver is using
+ them yet.
+
+.. [7]
+ Or rather in the next clock cycle
diff --git a/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg
new file mode 100644
index 0000000000..381323423b
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg
@@ -0,0 +1,151 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg width="113.611mm" height="86.6873mm" version="1.1" viewBox="0 0 113.611 86.6873" xmlns="http://www.w3.org/2000/svg" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+ <defs>
+ <marker id="marker3667" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3517" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3373" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3199" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3037" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker2779" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker2477" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker2074" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker1964" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker1856" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="Arrow2Mend" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <filter id="filter1204" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <marker id="marker2074-3" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <filter id="filter1204-6" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-4" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-1-3" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-1-3-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ </defs>
+ <metadata>
+ <rdf:RDF>
+ <cc:Work rdf:about="">
+ <dc:format>image/svg+xml</dc:format>
+ <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+ <dc:title/>
+ </cc:Work>
+ </rdf:RDF>
+ </metadata>
+ <g transform="translate(-49.0277 -104.823)">
+ <g>
+ <path d="m130.534 165.429h-71.1816v-17.5315" fill="none" marker-end="url(#marker2477)" stroke="#28a4ff" stroke-width=".6"/>
+ <path d="m145.034 122.959v-11.5914h-43.1215" fill="none" marker-end="url(#marker3037)" stroke="#28a4ff" stroke-width=".6"/>
+ <rect x="130.679" y="122.933" width="28.2965" height="45.2319" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".499999"/>
+ <path d="m102.044 116.236h23.3126l-0.13388 18.8185h19.9383v3.66603" fill="none" marker-end="url(#marker3199)" stroke="#28a4ff" stroke-width=".6"/>
+ <path d="m59.5006 138.391v-24.2517h20.6338" fill="none" marker-end="url(#marker2779)" stroke="#28a4ff" stroke-width=".6"/>
+ <rect x="78.1389" y="126.411" width="28.0037" height="35.0443" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".5"/>
+ </g>
+ <g fill="#ffcb35" stroke="#000" stroke-linecap="square">
+ <ellipse cx="92.1408" cy="114.239" rx="10.8866" ry="4.39308" stroke-width=".5"/>
+ <ellipse cx="92.1408" cy="134.185" rx="10.8866" ry="4.39308" stroke-width=".499999"/>
+ <ellipse cx="92.1408" cy="152.199" rx="10.8866" ry="4.39308" stroke-width=".499999"/>
+ </g>
+ <g fill="#28a4ff" stroke="#000" stroke-linecap="square" stroke-width=".499999">
+ <ellipse cx="144.827" cy="143.316" rx="10.8866" ry="4.39308"/>
+ <ellipse cx="144.827" cy="159.143" rx="10.8866" ry="4.39308"/>
+ <ellipse cx="59.4364" cy="142.823" rx="7.36455" ry="4.39308"/>
+ <ellipse cx="144.827" cy="129.196" rx="10.8866" ry="4.39308"/>
+ <ellipse cx="143.077" cy="180.53" rx="10.8866" ry="4.39308"/>
+ </g>
+ <ellipse cx="110.386" cy="180.53" rx="10.8866" ry="4.39308" fill="#ffcb35" stroke="#000" stroke-linecap="square" stroke-width=".499999"/>
+ <text x="110.90907" y="179.42688" font-size="3.175px" xml:space="preserve"><tspan x="110.90907" y="179.42688" dy="0.60000002" text-align="center" text-anchor="middle">Accessible</tspan><tspan x="110.90907" y="183.39563"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text>
+ <text x="143.5869" y="179.52795" xml:space="preserve"><tspan x="143.5869" y="179.52795" dy="1 0 0 0 0 0" font-family="sans-serif" font-size="2.82222px" text-align="center" text-anchor="middle" style="font-variant-caps:normal;font-variant-east-asian:normal;font-variant-ligatures:normal;font-variant-numeric:normal">Inaccessible</tspan><tspan x="143.5869" y="183.36786" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text>
+ <g font-size="3.175px">
+ <text x="91.95018" y="115.29005" xml:space="preserve"><tspan x="91.95018" y="115.29005" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Ready</tspan></tspan></text>
+ <text x="145.25127" y="130.49019" xml:space="preserve"><tspan x="145.25127" y="130.49019" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX OK</tspan></tspan></text>
+ <text x="145.31845" y="144.43121" xml:space="preserve"><tspan x="145.31845" y="144.43121" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Aborted</tspan></tspan></text>
+ <text x="145.40399" y="160.36035" xml:space="preserve"><tspan x="145.40399" y="160.36035" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX failed</tspan></tspan></text>
+ <text x="91.823967" y="133.53941" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.823967" y="133.53941" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX in</tspan></tspan><tspan x="91.823967" y="136.39691" text-align="center">progress</tspan></text>
+ <text x="91.648918" y="151.84813" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.648918" y="151.84813" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Abort in</tspan></tspan><tspan x="91.648918" y="154.70563" text-align="center">progress</tspan></text>
+ <text x="59.456043" y="143.91658" xml:space="preserve"><tspan x="59.456043" y="143.91658" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Empty</tspan></tspan></text>
+ </g>
+ <g fill="none">
+ <g stroke="#000">
+ <rect x="52.3943" y="171.63" width="106.581" height="16.601" rx="0" ry="0" stroke-linecap="square" stroke-width=".499999"/>
+ <g stroke-width=".6">
+ <path d="m106.383 159.046h26.4967" marker-end="url(#Arrow2Mend)"/>
+ <path d="m103.138 152.268h41.5564v-3.92426" marker-end="url(#marker1856)"/>
+ <path d="m106.38 129.354h17.7785"/>
+ <path d="m125.818 129.359h7.2418" marker-end="url(#marker1964)"/>
+ </g>
+ <path d="m124.169 129.354a0.959514 0.97091 0 0 1 0.47587-0.84557 0.959514 0.97091 0 0 1 0.96164-3e-3 0.959514 0.97091 0 0 1 0.48149 0.84231" stroke-linecap="square" stroke-width=".600001"/>
+ <path d="m55.7026 180.832h34.8131" marker-end="url(#marker2074)" stroke-width=".6"/>
+ </g>
+ <g>
+ <path d="m55.6464 185.744h34.8131" marker-end="url(#marker2074-3)" stroke="#28a4ff" stroke-width=".600001"/>
+ <g stroke-width=".6">
+ <path d="m94.0487 129.889v-10.6493" marker-end="url(#marker3373)" stroke="#000"/>
+ <path d="m89.7534 118.621v10.662" marker-end="url(#marker3517)" stroke="#000"/>
+ <path d="m92.119 138.812v7.9718" marker-end="url(#marker3667)" stroke="#28a4ff"/>
+ </g>
+ </g>
+ </g>
+ <text transform="matrix(.264583 0 0 .264583 91.8919 139.964)" x="26.959213" y="9.11724" fill="#2aa1ff" filter="url(#filter1204-6-2-9-1-3-1)" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="26.959213" y="9.11724" text-align="center">Set</tspan><tspan x="26.959213" y="22.31724" text-align="center">abort</tspan></text>
+ <text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccessful</tspan></text>
+ <g font-size="12px" stroke-width="3.77953" text-anchor="middle">
+ <text transform="matrix(.264583 0 0 .264583 68.5988 118.913)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">starts</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">successful</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 107.77 145.476)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">sborted</tspan></text>
+ </g>
+ <g stroke-width="3.77953" text-anchor="middle">
+ <text transform="matrix(.264583 0 0 .264583 107.574 155.948)" x="38.824219" y="9.1171875" filter="url(#filter1204)" font-size="10.6667px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Retransmit</tspan><tspan x="38.824219" y="20.850557" text-align="center">limit reached or</tspan><tspan x="38.824219" y="32.583927" text-align="center">node went bus off</tspan><tspan x="38.824219" y="44.317299" text-align="center"/></text>
+ <text transform="matrix(.264583 0 0 .264583 60.7127 177.384)" x="38.824539" y="9.1173134" filter="url(#filter1204-6)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824539" y="9.1173134" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Transmission result</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 45.6885 173.226)" x="57.727047" y="9.11724" filter="url(#filter1204-6-9)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Legend:</tspan></text>
+ </g>
+ <g fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-anchor="middle">
+ <text transform="matrix(.264583 0 0 .264583 57.0045 182.079)" x="57.727047" y="9.11724" filter="url(#filter1204-6-2)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">SW command</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 57.7865 110.104)" x="40.822609" y="9.11724" filter="url(#filter1204-6-2-9)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="40.822609" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 116.893 107.491)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-4)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 87.5687 166.324)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-1)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set empty</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 106.53 113.074)" x="30.228771" y="8.9063139" filter="url(#filter1204-6-2-9-1-3)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="30.228771" y="8.9063139" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set abort</tspan></text>
+ </g>
+ </g>
+</svg>
diff --git a/Documentation/networking/device_drivers/can/freescale/flexcan.rst b/Documentation/networking/device_drivers/can/freescale/flexcan.rst
new file mode 100644
index 0000000000..106cd28901
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/freescale/flexcan.rst
@@ -0,0 +1,54 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============================
+Flexcan CAN Controller driver
+=============================
+
+Authors: Marc Kleine-Budde <mkl@pengutronix.de>,
+Dario Binacchi <dario.binacchi@amarulasolutions.com>
+
+On/off RTR frames reception
+===========================
+
+For most flexcan IP cores the driver supports 2 RX modes:
+
+- FIFO
+- mailbox
+
+The older flexcan cores (integrated into the i.MX25, i.MX28, i.MX35
+and i.MX53 SOCs) only receive RTR frames if the controller is
+configured for RX-FIFO mode.
+
+The RX FIFO mode uses a hardware FIFO with a depth of 6 CAN frames,
+while the mailbox mode uses a software FIFO with a depth of up to 62
+CAN frames. With the help of the bigger buffer, the mailbox mode
+performs better under high system load situations.
+
+As reception of RTR frames is part of the CAN standard, all flexcan
+cores come up in a mode where RTR reception is possible.
+
+With the "rx-rtr" private flag the ability to receive RTR frames can
+be waived at the expense of losing the ability to receive RTR
+messages. This trade off is beneficial in certain use cases.
+
+"rx-rtr" on
+ Receive RTR frames. (default)
+
+ The CAN controller can and will receive RTR frames.
+
+ On some IP cores the controller cannot receive RTR frames in the
+ more performant "RX mailbox" mode and will use "RX FIFO" mode
+ instead.
+
+"rx-rtr" off
+
+ Waive ability to receive RTR frames. (not supported on all IP cores)
+
+ This mode activates the "RX mailbox mode" for better performance, on
+ some IP cores RTR frames cannot be received anymore.
+
+The setting can only be changed if the interface is down::
+
+ ip link set dev can0 down
+ ethtool --set-priv-flags can0 rx-rtr {off|on}
+ ip link set dev can0 up
diff --git a/Documentation/networking/device_drivers/can/index.rst b/Documentation/networking/device_drivers/can/index.rst
new file mode 100644
index 0000000000..6a8a4f74fa
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/index.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Controller Area Network (CAN) Device Drivers
+============================================
+
+Device drivers for CAN devices.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ can327
+ ctu/ctucanfd-driver
+ freescale/flexcan
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cellular/index.rst b/Documentation/networking/device_drivers/cellular/index.rst
new file mode 100644
index 0000000000..fc1812d3fc
--- /dev/null
+++ b/Documentation/networking/device_drivers/cellular/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Cellular Modem Device Drivers
+=============================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ qualcomm/rmnet
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
new file mode 100644
index 0000000000..289c146a82
--- /dev/null
+++ b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
@@ -0,0 +1,196 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Rmnet Driver
+============
+
+1. Introduction
+===============
+
+rmnet driver is used for supporting the Multiplexing and aggregation
+Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
+Technologies, Inc. modems.
+
+This driver can be used to register onto any physical network device in
+IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
+
+Multiplexing allows for creation of logical netdevices (rmnet devices) to
+handle multiple private data networks (PDN) like a default internet, tethering,
+multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
+packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
+routes to the appropriate PDN after removing the MAP header.
+
+Aggregation is required to achieve high data rates. This involves hardware
+sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
+these MAP frames and send them to appropriate PDN's.
+
+2. Packet format
+================
+
+a. MAP packet v1 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Reserved Pad Multiplexer ID Payload length
+
+ Bit 32-x
+ Function Raw bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits must be zero when sent and ignored when received.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+b. Map packet v4 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Reserved Pad Multiplexer ID Payload length
+
+ Bit 32-(x-33) (x-32)-x
+ Function Raw bytes Checksum offload header
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits must be zero when sent and ignored when received.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+Checksum offload header, has the information about the checksum processing done
+by the hardware.Checksum offload header fields are in big endian format.
+
+Packet format::
+
+ Bit 0-14 15 16-31
+ Function Reserved Valid Checksum start offset
+
+ Bit 31-47 48-64
+ Function Checksum length Checksum value
+
+Reserved bits must be zero when sent and ignored when received.
+
+Valid bit indicates whether the partial checksum is calculated and is valid.
+Set to 1, if its is valid. Set to 0 otherwise.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Checksum start offset, Indicates the offset in bytes from the beginning of the
+IP header, from which modem computed checksum.
+
+Checksum length is the Length in bytes starting from CKSUM_START_OFFSET,
+over which checksum is computed.
+
+Checksum value, indicates the checksum computed.
+
+c. MAP packet v5 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Next header Pad Multiplexer ID Payload length
+
+ Bit 32-x
+ Function Raw bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Next header is used to indicate the presence of another header, currently is
+limited to checksum header.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+d. Checksum offload header v5
+
+Checksum offload header fields are in big endian format.
+
+ Bit 0 - 6 7 8-15 16-31
+ Function Header Type Next Header Checksum Valid Reserved
+
+Header Type is to indicate the type of header, this usually is set to CHECKSUM
+
+Header types
+= ==========================================
+0 Reserved
+1 Reserved
+2 checksum header
+
+Checksum Valid is to indicate whether the header checksum is valid. Value of 1
+implies that checksum is calculated on this packet and is valid, value of 0
+indicates that the calculated packet checksum is invalid.
+
+Reserved bits must be zero when sent and ignored when received.
+
+e. MAP packet v1/v5 (command specific)::
+
+ Bit 0 1 2-7 8 - 15 16 - 31
+ Function Command Reserved Pad Multiplexer ID Payload length
+ Bit 32 - 39 40 - 45 46 - 47 48 - 63
+ Function Command name Reserved Command Type Reserved
+ Bit 64 - 95
+ Function Transaction ID
+ Bit 96 - 127
+ Function Command data
+
+Command 1 indicates disabling flow while 2 is enabling flow
+
+Command types
+
+= ==========================================
+0 for MAP command request
+1 is to acknowledge the receipt of a command
+2 is for unsupported commands
+3 is for error during processing of commands
+= ==========================================
+
+f. Aggregation
+
+Aggregation is multiple MAP packets (can be data or command) delivered to
+rmnet in a single linear skb. rmnet will process the individual
+packets and either ACK the MAP command or deliver the IP packet to the
+network stack as needed
+
+MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
+
+MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
+
+3. Userspace configuration
+==========================
+
+rmnet userspace configuration is done through netlink using iproute2
+https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/
+
+The driver uses rtnl_link_ops for communication.
diff --git a/Documentation/networking/device_drivers/ethernet/3com/3c509.rst b/Documentation/networking/device_drivers/ethernet/3com/3c509.rst
new file mode 100644
index 0000000000..47f706bacd
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/3com/3c509.rst
@@ -0,0 +1,249 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================================================
+Linux and the 3Com EtherLink III Series Ethercards (driver v1.18c and higher)
+=============================================================================
+
+This file contains the instructions and caveats for v1.18c and higher versions
+of the 3c509 driver. You should not use the driver without reading this file.
+
+release 1.0
+
+28 February 2002
+
+Current maintainer (corrections to):
+ David Ruggiero <jdr@farfalle.com>
+
+Introduction
+============
+
+The following are notes and information on using the 3Com EtherLink III series
+ethercards in Linux. These cards are commonly known by the most widely-used
+card's 3Com model number, 3c509. They are all 10mb/s ISA-bus cards and shouldn't
+be (but sometimes are) confused with the similarly-numbered PCI-bus "3c905"
+(aka "Vortex" or "Boomerang") series. Kernel support for the 3c509 family is
+provided by the module 3c509.c, which has code to support all of the following
+models:
+
+ - 3c509 (original ISA card)
+ - 3c509B (later revision of the ISA card; supports full-duplex)
+ - 3c589 (PCMCIA)
+ - 3c589B (later revision of the 3c589; supports full-duplex)
+ - 3c579 (EISA)
+
+Large portions of this documentation were heavily borrowed from the guide
+written the original author of the 3c509 driver, Donald Becker. The master
+copy of that document, which contains notes on older versions of the driver,
+currently resides on Scyld web server: http://www.scyld.com/.
+
+
+Special Driver Features
+=======================
+
+Overriding card settings
+
+The driver allows boot- or load-time overriding of the card's detected IOADDR,
+IRQ, and transceiver settings, although this capability shouldn't generally be
+needed except to enable full-duplex mode (see below). An example of the syntax
+for LILO parameters for doing this::
+
+ ether=10,0x310,3,0x3c509,eth0
+
+This configures the first found 3c509 card for IRQ 10, base I/O 0x310, and
+transceiver type 3 (10base2). The flag "0x3c509" must be set to avoid conflicts
+with other card types when overriding the I/O address. When the driver is
+loaded as a module, only the IRQ may be overridden. For example,
+setting two cards to IRQ10 and IRQ11 is done by using the irq module
+option::
+
+ options 3c509 irq=10,11
+
+
+Full-duplex mode
+================
+
+The v1.18c driver added support for the 3c509B's full-duplex capabilities.
+In order to enable and successfully use full-duplex mode, three conditions
+must be met:
+
+(a) You must have a Etherlink III card model whose hardware supports full-
+duplex operations. Currently, the only members of the 3c509 family that are
+positively known to support full-duplex are the 3c509B (ISA bus) and 3c589B
+(PCMCIA) cards. Cards without the "B" model designation do *not* support
+full-duplex mode; these include the original 3c509 (no "B"), the original
+3c589, the 3c529 (MCA bus), and the 3c579 (EISA bus).
+
+(b) You must be using your card's 10baseT transceiver (i.e., the RJ-45
+connector), not its AUI (thick-net) or 10base2 (thin-net/coax) interfaces.
+AUI and 10base2 network cabling is physically incapable of full-duplex
+operation.
+
+(c) Most importantly, your 3c509B must be connected to a link partner that is
+itself full-duplex capable. This is almost certainly one of two things: a full-
+duplex-capable Ethernet switch (*not* a hub), or a full-duplex-capable NIC on
+another system that's connected directly to the 3c509B via a crossover cable.
+
+Full-duplex mode can be enabled using 'ethtool'.
+
+.. warning::
+
+ Extremely important caution concerning full-duplex mode
+
+ Understand that the 3c509B's hardware's full-duplex support is much more
+ limited than that provide by more modern network interface cards. Although
+ at the physical layer of the network it fully supports full-duplex operation,
+ the card was designed before the current Ethernet auto-negotiation (N-way)
+ spec was written. This means that the 3c509B family ***cannot and will not
+ auto-negotiate a full-duplex connection with its link partner under any
+ circumstances, no matter how it is initialized***. If the full-duplex mode
+ of the 3c509B is enabled, its link partner will very likely need to be
+ independently _forced_ into full-duplex mode as well; otherwise various nasty
+ failures will occur - at the very least, you'll see massive numbers of packet
+ collisions. This is one of very rare circumstances where disabling auto-
+ negotiation and forcing the duplex mode of a network interface card or switch
+ would ever be necessary or desirable.
+
+
+Available Transceiver Types
+===========================
+
+For versions of the driver v1.18c and above, the available transceiver types are:
+
+== =========================================================================
+0 transceiver type from EEPROM config (normally 10baseT); force half-duplex
+1 AUI (thick-net / DB15 connector)
+2 (undefined)
+3 10base2 (thin-net == coax / BNC connector)
+4 10baseT (RJ-45 connector); force half-duplex mode
+8 transceiver type and duplex mode taken from card's EEPROM config settings
+12 10baseT (RJ-45 connector); force full-duplex mode
+== =========================================================================
+
+Prior to driver version 1.18c, only transceiver codes 0-4 were supported. Note
+that the new transceiver codes 8 and 12 are the *only* ones that will enable
+full-duplex mode, no matter what the card's detected EEPROM settings might be.
+This insured that merely upgrading the driver from an earlier version would
+never automatically enable full-duplex mode in an existing installation;
+it must always be explicitly enabled via one of these code in order to be
+activated.
+
+The transceiver type can be changed using 'ethtool'.
+
+
+Interpretation of error messages and common problems
+----------------------------------------------------
+
+Error Messages
+^^^^^^^^^^^^^^
+
+eth0: Infinite loop in interrupt, status 2011.
+These are "mostly harmless" message indicating that the driver had too much
+work during that interrupt cycle. With a status of 0x2011 you are receiving
+packets faster than they can be removed from the card. This should be rare
+or impossible in normal operation. Possible causes of this error report are:
+
+ - a "green" mode enabled that slows the processor down when there is no
+ keyboard activity.
+
+ - some other device or device driver hogging the bus or disabling interrupts.
+ Check /proc/interrupts for excessive interrupt counts. The timer tick
+ interrupt should always be incrementing faster than the others.
+
+No received packets
+^^^^^^^^^^^^^^^^^^^
+
+If a 3c509, 3c562 or 3c589 can successfully transmit packets, but never
+receives packets (as reported by /proc/net/dev or 'ifconfig') you likely
+have an interrupt line problem. Check /proc/interrupts to verify that the
+card is actually generating interrupts. If the interrupt count is not
+increasing you likely have a physical conflict with two devices trying to
+use the same ISA IRQ line. The common conflict is with a sound card on IRQ10
+or IRQ5, and the easiest solution is to move the 3c509 to a different
+interrupt line. If the device is receiving packets but 'ping' doesn't work,
+you have a routing problem.
+
+Tx Carrier Errors Reported in /proc/net/dev
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+If an EtherLink III appears to transmit packets, but the "Tx carrier errors"
+field in /proc/net/dev increments as quickly as the Tx packet count, you
+likely have an unterminated network or the incorrect media transceiver selected.
+
+3c509B card is not detected on machines with an ISA PnP BIOS.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+While the updated driver works with most PnP BIOS programs, it does not work
+with all. This can be fixed by disabling PnP support using the 3Com-supplied
+setup program.
+
+3c509 card is not detected on overclocked machines
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Increase the delay time in id_read_eeprom() from the current value, 500,
+to an absurdly high value, such as 5000.
+
+
+Decoding Status and Error Messages
+----------------------------------
+
+
+The bits in the main status register are:
+
+===== ======================================
+value description
+===== ======================================
+0x01 Interrupt latch
+0x02 Tx overrun, or Rx underrun
+0x04 Tx complete
+0x08 Tx FIFO room available
+0x10 A complete Rx packet has arrived
+0x20 A Rx packet has started to arrive
+0x40 The driver has requested an interrupt
+0x80 Statistics counter nearly full
+===== ======================================
+
+The bits in the transmit (Tx) status word are:
+
+===== ============================================
+value description
+===== ============================================
+0x02 Out-of-window collision.
+0x04 Status stack overflow (normally impossible).
+0x08 16 collisions.
+0x10 Tx underrun (not enough PCI bus bandwidth).
+0x20 Tx jabber.
+0x40 Tx interrupt requested.
+0x80 Status is valid (this should always be set).
+===== ============================================
+
+
+When a transmit error occurs the driver produces a status message such as::
+
+ eth0: Transmit error, Tx status register 82
+
+The two values typically seen here are:
+
+0x82
+^^^^
+
+Out of window collision. This typically occurs when some other Ethernet
+host is incorrectly set to full duplex on a half duplex network.
+
+0x88
+^^^^
+
+16 collisions. This typically occurs when the network is exceptionally busy
+or when another host doesn't correctly back off after a collision. If this
+error is mixed with 0x82 errors it is the result of a host incorrectly set
+to full duplex (see above).
+
+Both of these errors are the result of network problems that should be
+corrected. They do not represent driver malfunction.
+
+
+Revision history (this file)
+============================
+
+28Feb02 v1.0 DR New; major portions based on Becker original 3c509 docs
+
diff --git a/Documentation/networking/device_drivers/ethernet/3com/vortex.rst b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst
new file mode 100644
index 0000000000..a060f84c4f
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst
@@ -0,0 +1,459 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+3Com Vortex device driver
+=========================
+
+Andrew Morton
+
+30 April 2000
+
+
+This document describes the usage and errata of the 3Com "Vortex" device
+driver for Linux, 3c59x.c.
+
+The driver was written by Donald Becker <becker@scyld.com>
+
+Don is no longer the prime maintainer of this version of the driver.
+Please report problems to one or more of:
+
+- Andrew Morton
+- Netdev mailing list <netdev@vger.kernel.org>
+- Linux kernel mailing list <linux-kernel@vger.kernel.org>
+
+Please note the 'Reporting and Diagnosing Problems' section at the end
+of this file.
+
+
+Since kernel 2.3.99-pre6, this driver incorporates the support for the
+3c575-series Cardbus cards which used to be handled by 3c575_cb.c.
+
+This driver supports the following hardware:
+
+ - 3c590 Vortex 10Mbps
+ - 3c592 EISA 10Mbps Demon/Vortex
+ - 3c597 EISA Fast Demon/Vortex
+ - 3c595 Vortex 100baseTx
+ - 3c595 Vortex 100baseT4
+ - 3c595 Vortex 100base-MII
+ - 3c900 Boomerang 10baseT
+ - 3c900 Boomerang 10Mbps Combo
+ - 3c900 Cyclone 10Mbps TPO
+ - 3c900 Cyclone 10Mbps Combo
+ - 3c900 Cyclone 10Mbps TPC
+ - 3c900B-FL Cyclone 10base-FL
+ - 3c905 Boomerang 100baseTx
+ - 3c905 Boomerang 100baseT4
+ - 3c905B Cyclone 100baseTx
+ - 3c905B Cyclone 10/100/BNC
+ - 3c905B-FX Cyclone 100baseFx
+ - 3c905C Tornado
+ - 3c920B-EMB-WNM (ATI Radeon 9100 IGP)
+ - 3c980 Cyclone
+ - 3c980C Python-T
+ - 3cSOHO100-TX Hurricane
+ - 3c555 Laptop Hurricane
+ - 3c556 Laptop Tornado
+ - 3c556B Laptop Hurricane
+ - 3c575 [Megahertz] 10/100 LAN CardBus
+ - 3c575 Boomerang CardBus
+ - 3CCFE575BT Cyclone CardBus
+ - 3CCFE575CT Tornado CardBus
+ - 3CCFE656 Cyclone CardBus
+ - 3CCFEM656B Cyclone+Winmodem CardBus
+ - 3CXFEM656C Tornado+Winmodem CardBus
+ - 3c450 HomePNA Tornado
+ - 3c920 Tornado
+ - 3c982 Hydra Dual Port A
+ - 3c982 Hydra Dual Port B
+ - 3c905B-T4
+ - 3c920B-EMB-WNM Tornado
+
+Module parameters
+=================
+
+There are several parameters which may be provided to the driver when
+its module is loaded. These are usually placed in ``/etc/modprobe.d/*.conf``
+configuration files. Example::
+
+ options 3c59x debug=3 rx_copybreak=300
+
+If you are using the PCMCIA tools (cardmgr) then the options may be
+placed in /etc/pcmcia/config.opts::
+
+ module "3c59x" opts "debug=3 rx_copybreak=300"
+
+
+The supported parameters are:
+
+debug=N
+
+ Where N is a number from 0 to 7. Anything above 3 produces a lot
+ of output in your system logs. debug=1 is default.
+
+options=N1,N2,N3,...
+
+ Each number in the list provides an option to the corresponding
+ network card. So if you have two 3c905's and you wish to provide
+ them with option 0x204 you would use::
+
+ options=0x204,0x204
+
+ The individual options are composed of a number of bitfields which
+ have the following meanings:
+
+ Possible media type settings
+
+ == =================================
+ 0 10baseT
+ 1 10Mbs AUI
+ 2 undefined
+ 3 10base2 (BNC)
+ 4 100base-TX
+ 5 100base-FX
+ 6 MII (Media Independent Interface)
+ 7 Use default setting from EEPROM
+ 8 Autonegotiate
+ 9 External MII
+ 10 Use default setting from EEPROM
+ == =================================
+
+ When generating a value for the 'options' setting, the above media
+ selection values may be OR'ed (or added to) the following:
+
+ ====== =============================================
+ 0x8000 Set driver debugging level to 7
+ 0x4000 Set driver debugging level to 2
+ 0x0400 Enable Wake-on-LAN
+ 0x0200 Force full duplex mode.
+ 0x0010 Bus-master enable bit (Old Vortex cards only)
+ ====== =============================================
+
+ For example::
+
+ insmod 3c59x options=0x204
+
+ will force full-duplex 100base-TX, rather than allowing the usual
+ autonegotiation.
+
+global_options=N
+
+ Sets the ``options`` parameter for all 3c59x NICs in the machine.
+ Entries in the ``options`` array above will override any setting of
+ this.
+
+full_duplex=N1,N2,N3...
+
+ Similar to bit 9 of 'options'. Forces the corresponding card into
+ full-duplex mode. Please use this in preference to the ``options``
+ parameter.
+
+ In fact, please don't use this at all! You're better off getting
+ autonegotiation working properly.
+
+global_full_duplex=N1
+
+ Sets full duplex mode for all 3c59x NICs in the machine. Entries
+ in the ``full_duplex`` array above will override any setting of this.
+
+flow_ctrl=N1,N2,N3...
+
+ Use 802.3x MAC-layer flow control. The 3com cards only support the
+ PAUSE command, which means that they will stop sending packets for a
+ short period if they receive a PAUSE frame from the link partner.
+
+ The driver only allows flow control on a link which is operating in
+ full duplex mode.
+
+ This feature does not appear to work on the 3c905 - only 3c905B and
+ 3c905C have been tested.
+
+ The 3com cards appear to only respond to PAUSE frames which are
+ sent to the reserved destination address of 01:80:c2:00:00:01. They
+ do not honour PAUSE frames which are sent to the station MAC address.
+
+rx_copybreak=M
+
+ The driver preallocates 32 full-sized (1536 byte) network buffers
+ for receiving. When a packet arrives, the driver has to decide
+ whether to leave the packet in its full-sized buffer, or to allocate
+ a smaller buffer and copy the packet across into it.
+
+ This is a speed/space tradeoff.
+
+ The value of rx_copybreak is used to decide when to make the copy.
+ If the packet size is less than rx_copybreak, the packet is copied.
+ The default value for rx_copybreak is 200 bytes.
+
+max_interrupt_work=N
+
+ The driver's interrupt service routine can handle many receive and
+ transmit packets in a single invocation. It does this in a loop.
+ The value of max_interrupt_work governs how many times the interrupt
+ service routine will loop. The default value is 32 loops. If this
+ is exceeded the interrupt service routine gives up and generates a
+ warning message "eth0: Too much work in interrupt".
+
+hw_checksums=N1,N2,N3,...
+
+ Recent 3com NICs are able to generate IPv4, TCP and UDP checksums
+ in hardware. Linux has used the Rx checksumming for a long time.
+ The "zero copy" patch which is planned for the 2.4 kernel series
+ allows you to make use of the NIC's DMA scatter/gather and transmit
+ checksumming as well.
+
+ The driver is set up so that, when the zerocopy patch is applied,
+ all Tornado and Cyclone devices will use S/G and Tx checksums.
+
+ This module parameter has been provided so you can override this
+ decision. If you think that Tx checksums are causing a problem, you
+ may disable the feature with ``hw_checksums=0``.
+
+ If you think your NIC should be performing Tx checksumming and the
+ driver isn't enabling it, you can force the use of hardware Tx
+ checksumming with ``hw_checksums=1``.
+
+ The driver drops a message in the logfiles to indicate whether or
+ not it is using hardware scatter/gather and hardware Tx checksums.
+
+ Scatter/gather and hardware checksums provide considerable
+ performance improvement for the sendfile() system call, but a small
+ decrease in throughput for send(). There is no effect upon receive
+ efficiency.
+
+compaq_ioaddr=N,
+compaq_irq=N,
+compaq_device_id=N
+
+ "Variables to work-around the Compaq PCI BIOS32 problem"....
+
+watchdog=N
+
+ Sets the time duration (in milliseconds) after which the kernel
+ decides that the transmitter has become stuck and needs to be reset.
+ This is mainly for debugging purposes, although it may be advantageous
+ to increase this value on LANs which have very high collision rates.
+ The default value is 5000 (5.0 seconds).
+
+enable_wol=N1,N2,N3,...
+
+ Enable Wake-on-LAN support for the relevant interface. Donald
+ Becker's ``ether-wake`` application may be used to wake suspended
+ machines.
+
+ Also enables the NIC's power management support.
+
+global_enable_wol=N
+
+ Sets enable_wol mode for all 3c59x NICs in the machine. Entries in
+ the ``enable_wol`` array above will override any setting of this.
+
+Media selection
+---------------
+
+A number of the older NICs such as the 3c590 and 3c900 series have
+10base2 and AUI interfaces.
+
+Prior to January, 2001 this driver would autoselect the 10base2 or AUI
+port if it didn't detect activity on the 10baseT port. It would then
+get stuck on the 10base2 port and a driver reload was necessary to
+switch back to 10baseT. This behaviour could not be prevented with a
+module option override.
+
+Later (current) versions of the driver _do_ support locking of the
+media type. So if you load the driver module with
+
+ modprobe 3c59x options=0
+
+it will permanently select the 10baseT port. Automatic selection of
+other media types does not occur.
+
+
+Transmit error, Tx status register 82
+-------------------------------------
+
+This is a common error which is almost always caused by another host on
+the same network being in full-duplex mode, while this host is in
+half-duplex mode. You need to find that other host and make it run in
+half-duplex mode or fix this host to run in full-duplex mode.
+
+As a last resort, you can force the 3c59x driver into full-duplex mode
+with
+
+ options 3c59x full_duplex=1
+
+but this has to be viewed as a workaround for broken network gear and
+should only really be used for equipment which cannot autonegotiate.
+
+
+Additional resources
+--------------------
+
+Details of the device driver implementation are at the top of the source file.
+
+Additional documentation is available at Don Becker's Linux Drivers site:
+
+ http://www.scyld.com/vortex.html
+
+Donald Becker's driver development site:
+
+ http://www.scyld.com/network.html
+
+Donald's vortex-diag program is useful for inspecting the NIC's state:
+
+ http://www.scyld.com/ethercard_diag.html
+
+Donald's mii-diag program may be used for inspecting and manipulating
+the NIC's Media Independent Interface subsystem:
+
+ http://www.scyld.com/ethercard_diag.html#mii-diag
+
+Donald's wake-on-LAN page:
+
+ http://www.scyld.com/wakeonlan.html
+
+3Com's DOS-based application for setting up the NICs EEPROMs:
+
+ ftp://ftp.3com.com/pub/nic/3c90x/3c90xx2.exe
+
+
+Autonegotiation notes
+---------------------
+
+ The driver uses a one-minute heartbeat for adapting to changes in
+ the external LAN environment if link is up and 5 seconds if link is down.
+ This means that when, for example, a machine is unplugged from a hubbed
+ 10baseT LAN plugged into a switched 100baseT LAN, the throughput
+ will be quite dreadful for up to sixty seconds. Be patient.
+
+ Cisco interoperability note from Walter Wong <wcw+@CMU.EDU>:
+
+ On a side note, adding HAS_NWAY seems to share a problem with the
+ Cisco 6509 switch. Specifically, you need to change the spanning
+ tree parameter for the port the machine is plugged into to 'portfast'
+ mode. Otherwise, the negotiation fails. This has been an issue
+ we've noticed for a while but haven't had the time to track down.
+
+ Cisco switches (Jeff Busch <jbusch@deja.com>)
+
+ My "standard config" for ports to which PC's/servers connect directly::
+
+ interface FastEthernet0/N
+ description machinename
+ load-interval 30
+ spanning-tree portfast
+
+ If autonegotiation is a problem, you may need to specify "speed
+ 100" and "duplex full" as well (or "speed 10" and "duplex half").
+
+ WARNING: DO NOT hook up hubs/switches/bridges to these
+ specially-configured ports! The switch will become very confused.
+
+
+Reporting and diagnosing problems
+---------------------------------
+
+Maintainers find that accurate and complete problem reports are
+invaluable in resolving driver problems. We are frequently not able to
+reproduce problems and must rely on your patience and efforts to get to
+the bottom of the problem.
+
+If you believe you have a driver problem here are some of the
+steps you should take:
+
+- Is it really a driver problem?
+
+ Eliminate some variables: try different cards, different
+ computers, different cables, different ports on the switch/hub,
+ different versions of the kernel or of the driver, etc.
+
+- OK, it's a driver problem.
+
+ You need to generate a report. Typically this is an email to the
+ maintainer and/or netdev@vger.kernel.org. The maintainer's
+ email address will be in the driver source or in the MAINTAINERS file.
+
+- The contents of your report will vary a lot depending upon the
+ problem. If it's a kernel crash then you should refer to
+ 'Documentation/admin-guide/reporting-issues.rst'.
+
+ But for most problems it is useful to provide the following:
+
+ - Kernel version, driver version
+
+ - A copy of the banner message which the driver generates when
+ it is initialised. For example:
+
+ eth0: 3Com PCI 3c905C Tornado at 0xa400, 00:50:da:6a:88:f0, IRQ 19
+ 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
+ MII transceiver found at address 24, status 782d.
+ Enabling bus-master transmits and whole-frame receives.
+
+ NOTE: You must provide the ``debug=2`` modprobe option to generate
+ a full detection message. Please do this::
+
+ modprobe 3c59x debug=2
+
+ - If it is a PCI device, the relevant output from 'lspci -vx', eg::
+
+ 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74)
+ Subsystem: 3Com Corporation: Unknown device 9200
+ Flags: bus master, medium devsel, latency 32, IRQ 19
+ I/O ports at a400 [size=128]
+ Memory at db000000 (32-bit, non-prefetchable) [size=128]
+ Expansion ROM at <unassigned> [disabled] [size=128K]
+ Capabilities: [dc] Power Management version 2
+ 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00
+ 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00
+ 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10
+ 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a
+
+ - A description of the environment: 10baseT? 100baseT?
+ full/half duplex? switched or hubbed?
+
+ - Any additional module parameters which you may be providing to the driver.
+
+ - Any kernel logs which are produced. The more the merrier.
+ If this is a large file and you are sending your report to a
+ mailing list, mention that you have the logfile, but don't send
+ it. If you're reporting direct to the maintainer then just send
+ it.
+
+ To ensure that all kernel logs are available, add the
+ following line to /etc/syslog.conf::
+
+ kern.* /var/log/messages
+
+ Then restart syslogd with::
+
+ /etc/rc.d/init.d/syslog restart
+
+ (The above may vary, depending upon which Linux distribution you use).
+
+ - If your problem is reproducible then that's great. Try the
+ following:
+
+ 1) Increase the debug level. Usually this is done via:
+
+ a) modprobe driver debug=7
+ b) In /etc/modprobe.d/driver.conf:
+ options driver debug=7
+
+ 2) Recreate the problem with the higher debug level,
+ send all logs to the maintainer.
+
+ 3) Download you card's diagnostic tool from Donald
+ Becker's website <http://www.scyld.com/ethercard_diag.html>.
+ Download mii-diag.c as well. Build these.
+
+ a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is
+ working correctly. Save the output.
+
+ b) Run the above commands when the card is malfunctioning. Send
+ both sets of output.
+
+Finally, please be patient and be prepared to do some work. You may
+end up working on this problem for a week or more as the maintainer
+asks more questions, asks for more tests, asks for patches to be
+applied, etc. At the end of it all, the problem may even remain
+unresolved.
diff --git a/Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst b/Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst
new file mode 100644
index 0000000000..7a7040072e
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst
@@ -0,0 +1,286 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+=======================================
+Altera Triple-Speed Ethernet MAC driver
+=======================================
+
+Copyright |copy| 2008-2014 Altera Corporation
+
+This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers
+using the SGDMA and MSGDMA soft DMA IP components. The driver uses the
+platform bus to obtain component resources. The designs used to test this
+driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board,
+and tested with ARM and NIOS processor hosts separately. The anticipated use
+cases are simple communications between an embedded system and an external peer
+for status and simple configuration of the embedded system.
+
+For more information visit www.altera.com and www.rocketboards.org. Support
+forums for the driver may be found on www.rocketboards.org, and a design used
+to test this driver may be found there as well. Support is also available from
+the maintainer of this driver, found in MAINTAINERS.
+
+The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP
+components that can be assembled and built into an FPGA using the Altera
+Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that
+this driver was tested against. The sopc2dts tool is used to create the
+device tree for the driver, and may be found at rocketboards.org.
+
+The driver probe function examines the device tree and determines if the
+Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The
+probe function then installs the appropriate set of DMA routines to
+initialize, setup transmits, receives, and interrupt handling primitives for
+the respective configurations.
+
+The SGDMA component is to be deprecated in the near future (over the next 1-2
+years as of this writing in early 2014) in favor of the MSGDMA component.
+SGDMA support is included for existing designs and reference in case a
+developer wishes to support their own soft DMA logic and driver support. Any
+new designs should not use the SGDMA.
+
+The SGDMA supports only a single transmit or receive operation at a time, and
+therefore will not perform as well compared to the MSGDMA soft IP. Please
+visit www.altera.com for known, documented SGDMA errata.
+
+Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time.
+Scatter-gather DMA will be added to a future maintenance update to this
+driver.
+
+Jumbo frames are not supported at this time.
+
+The driver limits PHY operations to 10/100Mbps, and has not yet been fully
+tested for 1Gbps. This support will be added in a future maintenance update.
+
+1. Kernel Configuration
+=======================
+
+The kernel configuration option is ALTERA_TSE:
+
+ Device Drivers ---> Network device support ---> Ethernet driver support --->
+ Altera Triple-Speed Ethernet MAC support (ALTERA_TSE)
+
+2. Driver parameters list
+=========================
+
+ - debug: message level (0: no output, 16: all);
+ - dma_rx_num: Number of descriptors in the RX list (default is 64);
+ - dma_tx_num: Number of descriptors in the TX list (default is 64).
+
+3. Command line options
+=======================
+
+Driver parameters can be also passed in command line by using::
+
+ altera_tse=dma_rx_num:128,dma_tx_num:512
+
+4. Driver information and notes
+===============================
+
+4.1. Transmit process
+---------------------
+When the driver's transmit routine is called by the kernel, it sets up a
+transmit descriptor by calling the underlying DMA transmit routine (SGDMA or
+MSGDMA), and initiates a transmit operation. Once the transmit is complete, an
+interrupt is driven by the transmit DMA logic. The driver handles the transmit
+completion in the context of the interrupt handling chain by recycling
+resource required to send and track the requested transmit operation.
+
+4.2. Receive process
+--------------------
+The driver will post receive buffers to the receive DMA logic during driver
+initialization. Receive buffers may or may not be queued depending upon the
+underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able
+to queue receive buffers to the SGDMA receive logic). When a packet is
+received, the DMA logic generates an interrupt. The driver handles a receive
+interrupt by obtaining the DMA receive logic status, reaping receive
+completions until no more receive completions are available.
+
+4.3. Interrupt Mitigation
+-------------------------
+The driver is able to mitigate the number of its DMA interrupts
+using NAPI for receive operations. Interrupt mitigation is not yet supported
+for transmit operations, but will be added in a future maintenance release.
+
+4.4) Ethtool support
+--------------------
+Ethtool is supported. Driver statistics and internal errors can be taken using:
+ethtool -S ethX command. It is possible to dump registers etc.
+
+4.5) PHY Support
+----------------
+The driver is compatible with PAL to work with PHY and GPHY devices.
+
+4.7) List of source files:
+--------------------------
+ - Kconfig
+ - Makefile
+ - altera_tse_main.c: main network device driver
+ - altera_tse_ethtool.c: ethtool support
+ - altera_tse.h: private driver structure and common definitions
+ - altera_msgdma.h: MSGDMA implementation function definitions
+ - altera_sgdma.h: SGDMA implementation function definitions
+ - altera_msgdma.c: MSGDMA implementation
+ - altera_sgdma.c: SGDMA implementation
+ - altera_sgdmahw.h: SGDMA register and descriptor definitions
+ - altera_msgdmahw.h: MSGDMA register and descriptor definitions
+ - altera_utils.c: Driver utility functions
+ - altera_utils.h: Driver utility function definitions
+
+5. Debug Information
+====================
+
+The driver exports debug information such as internal statistics,
+debug information, MAC and DMA registers etc.
+
+A user may use the ethtool support to get statistics:
+e.g. using: ethtool -S ethX (that shows the statistics counters)
+or sees the MAC registers: e.g. using: ethtool -d ethX
+
+The developer can also use the "debug" module parameter to get
+further debug information.
+
+6. Statistics Support
+=====================
+
+The controller and driver support a mix of IEEE standard defined statistics,
+RFC defined statistics, and driver or Altera defined statistics. The four
+specifications containing the standard definitions for these statistics are
+as follows:
+
+ - IEEE 802.3-2012 - IEEE Standard for Ethernet.
+ - RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt.
+ - RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt.
+ - Altera Triple Speed Ethernet User Guide, found at http://www.altera.com
+
+The statistics supported by the TSE and the device driver are as follows:
+
+"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.2. This statistics is the count of frames that are successfully
+transmitted.
+
+"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.5. This statistic is the count of frames that are successfully
+received. This count does not include any error packets such as CRC errors,
+length errors, or alignment errors.
+
+"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE
+802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are
+an integral number of bytes in length and do not pass the CRC test as the frame
+is received.
+
+"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012,
+Section 5.2.2.1.7. This statistic is the count of frames that are not an
+integral number of bytes in length and do not pass the CRC test as the frame is
+received.
+
+"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.8. This statistic is the count of data and pad bytes
+successfully transmitted from the interface.
+
+"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.14. This statistic is the count of data and pad bytes
+successfully received by the controller.
+
+"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE
+802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames
+transmitted from the network controller.
+
+"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE
+802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames
+received by the network controller.
+
+"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is
+a count of the number of packets received containing errors that prevented the
+packet from being delivered to a higher level protocol.
+
+"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic
+is a count of the number of packets that could not be transmitted due to errors.
+
+"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were not addressed
+to the broadcast address or a multicast group.
+
+"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were addressed to
+a multicast address group.
+
+"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were addressed to
+the broadcast address.
+
+"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This
+statistic is the number of outbound packets not transmitted even though an
+error was not detected. An example of a reason this might occur is to free up
+internal buffer space.
+
+"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were not addressed to
+a multicast group or broadcast address.
+
+"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were addressed to a
+multicast group.
+
+"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were addressed to a
+broadcast address.
+
+"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819.
+This statistic counts the number of packets dropped due to lack of internal
+controller resources.
+
+"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819.
+This statistic counts the total number of bytes received by the controller,
+including error and discarded packets.
+
+"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819.
+This statistic counts the total number of packets received by the controller,
+including error, discarded, unicast, multicast, and broadcast packets.
+
+"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819.
+This statistic counts the number of correctly formed packets received less
+than 64 bytes long.
+
+"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819.
+This statistic counts the number of correctly formed packets greater than 1518
+bytes long.
+
+"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819.
+This statistic counts the total number of packets received that were 64 octets
+in length.
+
+"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC
+2819. This statistic counts the total number of packets received that were
+between 65 and 127 octets in length inclusive.
+
+"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 128 and 255 octets in length inclusive.
+
+"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 256 and 511 octets in length inclusive.
+
+"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 512 and 1023 octets in length inclusive.
+
+"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define
+in RFC 2819. This statistic is the total number of packets received that were
+between 1024 and 1518 octets in length inclusive.
+
+"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the
+Altera TSE. This statistics counts the number of received good and errored
+frames between the length of 1519 and the maximum frame length configured
+in the frm_length register. See the Altera TSE User Guide for More details.
+
+"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This
+statistic is the total number of packets received that were longer than 1518
+octets, and had either a bad CRC with an integral number of octets (CRC Error)
+or a bad CRC with a non-integral number of octets (Alignment Error).
+
+"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This
+statistic is the total number of packets received that were less than 64 octets
+in length and had either a bad CRC with an integral number of octets (CRC
+error) or a bad CRC with a non-integral number of octets (Alignment Error).
diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
new file mode 100644
index 0000000000..5eaa3ab6c7
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -0,0 +1,351 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================================
+Linux kernel driver for Elastic Network Adapter (ENA) family
+============================================================
+
+Overview
+========
+
+ENA is a networking interface designed to make good use of modern CPU
+features and system architectures.
+
+The ENA device exposes a lightweight management interface with a
+minimal set of memory mapped registers and extendible command set
+through an Admin Queue.
+
+The driver supports a range of ENA devices, is link-speed independent
+(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc), and has
+a negotiated and extendible feature set.
+
+Some ENA devices support SR-IOV. This driver is used for both the
+SR-IOV Physical Function (PF) and Virtual Function (VF) devices.
+
+ENA devices enable high speed and low overhead network traffic
+processing by providing multiple Tx/Rx queue pairs (the maximum number
+is advertised by the device via the Admin Queue), a dedicated MSI-X
+interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation,
+and CPU cacheline optimized data placement.
+
+The ENA driver supports industry standard TCP/IP offload features such as
+checksum offload. Receive-side scaling (RSS) is supported for multi-core
+scaling.
+
+The ENA driver and its corresponding devices implement health
+monitoring mechanisms such as watchdog, enabling the device and driver
+to recover in a manner transparent to the application, as well as
+debug logs.
+
+Some of the ENA devices support a working mode called Low-latency
+Queue (LLQ), which saves several more microseconds.
+
+ENA Source Code Directory Structure
+===================================
+
+================= ======================================================
+ena_com.[ch] Management communication layer. This layer is
+ responsible for the handling all the management
+ (admin) communication between the device and the
+ driver.
+ena_eth_com.[ch] Tx/Rx data path.
+ena_admin_defs.h Definition of ENA management interface.
+ena_eth_io_defs.h Definition of ENA data path interface.
+ena_common_defs.h Common definitions for ena_com layer.
+ena_regs_defs.h Definition of ENA PCI memory-mapped (MMIO) registers.
+ena_netdev.[ch] Main Linux kernel driver.
+ena_ethtool.c ethtool callbacks.
+ena_pci_id_tbl.h Supported device IDs.
+================= ======================================================
+
+Management Interface:
+=====================
+
+ENA management interface is exposed by means of:
+
+- PCIe Configuration Space
+- Device Registers
+- Admin Queue (AQ) and Admin Completion Queue (ACQ)
+- Asynchronous Event Notification Queue (AENQ)
+
+ENA device MMIO Registers are accessed only during driver
+initialization and are not used during further normal device
+operation.
+
+AQ is used for submitting management commands, and the
+results/responses are reported asynchronously through ACQ.
+
+ENA introduces a small set of management commands with room for
+vendor-specific extensions. Most of the management operations are
+framed in a generic Get/Set feature command.
+
+The following admin queue commands are supported:
+
+- Create I/O submission queue
+- Create I/O completion queue
+- Destroy I/O submission queue
+- Destroy I/O completion queue
+- Get feature
+- Set feature
+- Configure AENQ
+- Get statistics
+
+Refer to ena_admin_defs.h for the list of supported Get/Set Feature
+properties.
+
+The Asynchronous Event Notification Queue (AENQ) is a uni-directional
+queue used by the ENA device to send to the driver events that cannot
+be reported using ACQ. AENQ events are subdivided into groups. Each
+group may have multiple syndromes, as shown below
+
+The events are:
+
+==================== ===============
+Group Syndrome
+==================== ===============
+Link state change **X**
+Fatal error **X**
+Notification Suspend traffic
+Notification Resume traffic
+Keep-Alive **X**
+==================== ===============
+
+ACQ and AENQ share the same MSI-X vector.
+
+Keep-Alive is a special mechanism that allows monitoring the device's health.
+A Keep-Alive event is delivered by the device every second.
+The driver maintains a watchdog (WD) handler which logs the current state and
+statistics. If the keep-alive events aren't delivered as expected the WD resets
+the device and the driver.
+
+Data Path Interface
+===================
+
+I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx
+SQ correspondingly). Each SQ has a completion queue (CQ) associated
+with it.
+
+The SQs and CQs are implemented as descriptor rings in contiguous
+physical memory.
+
+The ENA driver supports two Queue Operation modes for Tx SQs:
+
+- **Regular mode:**
+ In this mode the Tx SQs reside in the host's memory. The ENA
+ device fetches the ENA Tx descriptors and packet data from host
+ memory.
+
+- **Low Latency Queue (LLQ) mode or "push-mode":**
+ In this mode the driver pushes the transmit descriptors and the
+ first 96 bytes of the packet directly to the ENA device memory
+ space. The rest of the packet payload is fetched by the
+ device. For this operation mode, the driver uses a dedicated PCI
+ device memory BAR, which is mapped with write-combine capability.
+
+ **Note that** not all ENA devices support LLQ, and this feature is negotiated
+ with the device upon initialization. If the ENA device does not
+ support LLQ mode, the driver falls back to the regular mode.
+
+The Rx SQs support only the regular mode.
+
+The driver supports multi-queue for both Tx and Rx. This has various
+benefits:
+
+- Reduced CPU/thread/process contention on a given Ethernet interface.
+- Cache miss rate on completion is reduced, particularly for data
+ cache lines that hold the sk_buff structures.
+- Increased process-level parallelism when handling received packets.
+- Increased data cache hit rate, by steering kernel processing of
+ packets to the CPU, where the application thread consuming the
+ packet is running.
+- In hardware interrupt re-direction.
+
+Interrupt Modes
+===============
+
+The driver assigns a single MSI-X vector per queue pair (for both Tx
+and Rx directions). The driver assigns an additional dedicated MSI-X vector
+for management (for ACQ and AENQ).
+
+Management interrupt registration is performed when the Linux kernel
+probes the adapter, and it is de-registered when the adapter is
+removed. I/O queue interrupt registration is performed when the Linux
+interface of the adapter is opened, and it is de-registered when the
+interface is closed.
+
+The management interrupt is named::
+
+ ena-mgmnt@pci:<PCI domain:bus:slot.function>
+
+and for each queue pair, an interrupt is named::
+
+ <interface name>-Tx-Rx-<queue index>
+
+The ENA device operates in auto-mask and auto-clear interrupt
+modes. That is, once MSI-X is delivered to the host, its Cause bit is
+automatically cleared and the interrupt is masked. The interrupt is
+unmasked by the driver after NAPI processing is complete.
+
+Interrupt Moderation
+====================
+
+ENA driver and device can operate in conventional or adaptive interrupt
+moderation mode.
+
+**In conventional mode** the driver instructs device to postpone interrupt
+posting according to static interrupt delay value. The interrupt delay
+value can be configured through `ethtool(8)`. The following `ethtool`
+parameters are supported by the driver: ``tx-usecs``, ``rx-usecs``
+
+**In adaptive interrupt** moderation mode the interrupt delay value is
+updated by the driver dynamically and adjusted every NAPI cycle
+according to the traffic nature.
+
+Adaptive coalescing can be switched on/off through `ethtool(8)`'s
+:code:`adaptive_rx on|off` parameter.
+
+More information about Adaptive Interrupt Moderation (DIM) can be found in
+Documentation/networking/net_dim.rst
+
+.. _`RX copybreak`:
+
+RX copybreak
+============
+The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
+and can be configured by the ETHTOOL_STUNABLE command of the
+SIOCETHTOOL ioctl.
+
+Statistics
+==========
+
+The user can obtain ENA device and driver statistics using `ethtool`.
+The driver can collect regular or extended statistics (including
+per-queue stats) from the device.
+
+In addition the driver logs the stats to syslog upon device reset.
+
+MTU
+===
+
+The driver supports an arbitrarily large MTU with a maximum that is
+negotiated with the device. The driver configures MTU using the
+SetFeature command (ENA_ADMIN_MTU property). The user can change MTU
+via `ip(8)` and similar legacy tools.
+
+Stateless Offloads
+==================
+
+The ENA driver supports:
+
+- IPv4 header checksum offload
+- TCP/UDP over IPv4/IPv6 checksum offloads
+
+RSS
+===
+
+- The ENA device supports RSS that allows flexible Rx traffic
+ steering.
+- Toeplitz and CRC32 hash functions are supported.
+- Different combinations of L2/L3/L4 fields can be configured as
+ inputs for hash functions.
+- The driver configures RSS settings using the AQ SetFeature command
+ (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and
+ ENA_ADMIN_RSS_INDIRECTION_TABLE_CONFIG properties).
+- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash
+ function delivered in the Rx CQ descriptor is set in the received
+ SKB.
+- The user can provide a hash key, hash function, and configure the
+ indirection table through `ethtool(8)`.
+
+DATA PATH
+=========
+
+Tx
+--
+
+:code:`ena_start_xmit()` is called by the stack. This function does the following:
+
+- Maps data buffers (``skb->data`` and frags).
+- Populates ``ena_buf`` for the push buffer (if the driver and device are
+ in push mode).
+- Prepares ENA bufs for the remaining frags.
+- Allocates a new request ID from the empty ``req_id`` ring. The request
+ ID is the index of the packet in the Tx info. This is used for
+ out-of-order Tx completions.
+- Adds the packet to the proper place in the Tx ring.
+- Calls :code:`ena_com_prepare_tx()`, an ENA communication layer that converts
+ the ``ena_bufs`` to ENA descriptors (and adds meta ENA descriptors as
+ needed).
+
+ * This function also copies the ENA descriptors and the push buffer
+ to the Device memory space (if in push mode).
+
+- Writes a doorbell to the ENA device.
+- When the ENA device finishes sending the packet, a completion
+ interrupt is raised.
+- The interrupt handler schedules NAPI.
+- The :code:`ena_clean_tx_irq()` function is called. This function handles the
+ completion descriptors generated by the ENA, with a single
+ completion descriptor per completed packet.
+
+ * ``req_id`` is retrieved from the completion descriptor. The ``tx_info`` of
+ the packet is retrieved via the ``req_id``. The data buffers are
+ unmapped and ``req_id`` is returned to the empty ``req_id`` ring.
+ * The function stops when the completion descriptors are completed or
+ the budget is reached.
+
+Rx
+--
+
+- When a packet is received from the ENA device.
+- The interrupt handler schedules NAPI.
+- The :code:`ena_clean_rx_irq()` function is called. This function calls
+ :code:`ena_com_rx_pkt()`, an ENA communication layer function, which returns the
+ number of descriptors used for a new packet, and zero if
+ no new packet is found.
+- :code:`ena_rx_skb()` checks packet length:
+
+ * If the packet is small (len < rx_copybreak), the driver allocates
+ a SKB for the new packet, and copies the packet payload into the
+ SKB data buffer.
+
+ - In this way the original data buffer is not passed to the stack
+ and is reused for future Rx packets.
+
+ * Otherwise the function unmaps the Rx buffer, sets the first
+ descriptor as `skb`'s linear part and the other descriptors as the
+ `skb`'s frags.
+
+- The new SKB is updated with the necessary information (protocol,
+ checksum hw verify result, etc), and then passed to the network
+ stack, using the NAPI interface function :code:`napi_gro_receive()`.
+
+Dynamic RX Buffers (DRB)
+------------------------
+
+Each RX descriptor in the RX ring is a single memory page (which is either 4KB
+or 16KB long depending on system's configurations).
+To reduce the memory allocations required when dealing with a high rate of small
+packets, the driver tries to reuse the remaining RX descriptor's space if more
+than 2KB of this page remain unused.
+
+A simple example of this mechanism is the following sequence of events:
+
+::
+
+ 1. Driver allocates page-sized RX buffer and passes it to hardware
+ +----------------------+
+ |4KB RX Buffer |
+ +----------------------+
+
+ 2. A 300Bytes packet is received on this buffer
+
+ 3. The driver increases the ref count on this page and returns it back to
+ HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes
+ +----+--------------------+
+ |****|3796 Bytes RX Buffer|
+ +----+--------------------+
+
+This mechanism isn't used when an XDP program is loaded, or when the
+RX packet is less than rx_copybreak bytes (in which case the packet is
+copied out of the RX buffer into the linear part of a new skb allocated
+for it and the RX buffer remains the same size, see `RX copybreak`_).
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst
new file mode 100644
index 0000000000..9e8a16c441
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst
@@ -0,0 +1,139 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+========================================================
+Linux Driver for the AMD/Pensando(R) DSC adapter family
+========================================================
+
+Copyright(c) 2023 Advanced Micro Devices, Inc
+
+Identifying the Adapter
+=======================
+
+To find if one or more AMD/Pensando PCI Core devices are installed on the
+host, check for the PCI devices::
+
+ # lspci -d 1dd8:100c
+ b5:00.0 Processing accelerators: Pensando Systems Device 100c
+ b6:00.0 Processing accelerators: Pensando Systems Device 100c
+
+If such devices are listed as above, then the pds_core.ko driver should find
+and configure them for use. There should be log entries in the kernel
+messages such as these::
+
+ $ dmesg | grep pds_core
+ pds_core 0000:b5:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
+ pds_core 0000:b5:00.0: FW: 1.60.0-73
+ pds_core 0000:b6:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
+ pds_core 0000:b6:00.0: FW: 1.60.0-73
+
+Driver and firmware version information can be gathered with devlink::
+
+ $ devlink dev info pci/0000:b5:00.0
+ pci/0000:b5:00.0:
+ driver pds_core
+ serial_number FLM18420073
+ versions:
+ fixed:
+ asic.id 0x0
+ asic.rev 0x0
+ running:
+ fw 1.51.0-73
+ stored:
+ fw.goldfw 1.15.9-C-22
+ fw.mainfwa 1.60.0-73
+ fw.mainfwb 1.60.0-57
+
+Info versions
+=============
+
+The ``pds_core`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of firmware running on the device
+ * - ``fw.goldfw``
+ - stored
+ - Version of firmware stored in the goldfw slot
+ * - ``fw.mainfwa``
+ - stored
+ - Version of firmware stored in the mainfwa slot
+ * - ``fw.mainfwb``
+ - stored
+ - Version of firmware stored in the mainfwb slot
+ * - ``asic.id``
+ - fixed
+ - The ASIC type for this device
+ * - ``asic.rev``
+ - fixed
+ - The revision of the ASIC for this device
+
+Parameters
+==========
+
+The ``pds_core`` driver implements the following generic
+parameters for controlling the functionality to be made available
+as auxiliary_bus devices.
+
+.. list-table:: Generic parameters implemented
+ :widths: 5 5 8 82
+
+ * - Name
+ - Mode
+ - Type
+ - Description
+ * - ``enable_vnet``
+ - runtime
+ - Boolean
+ - Enables vDPA functionality through an auxiliary_bus device
+
+Firmware Management
+===================
+
+The ``flash`` command can update a the DSC firmware. The downloaded firmware
+will be saved into either of firmware bank 1 or bank 2, whichever is not
+currently in use, and that bank will used for the next boot::
+
+ # devlink dev flash pci/0000:b5:00.0 \
+ file pensando/dsc_fw_1.63.0-22.tar
+
+Health Reporters
+================
+
+The driver supports a devlink health reporter for FW status::
+
+ # devlink health show pci/0000:2b:00.0 reporter fw
+ pci/0000:2b:00.0:
+ reporter fw
+ state healthy error 0 recover 0
+ # devlink health diagnose pci/0000:2b:00.0 reporter fw
+ Status: healthy State: 1 Generation: 0 Recoveries: 0
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+ make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet driver support
+ -> AMD devices
+ -> AMD/Pensando Ethernet PDS_CORE Support
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by AMD/Pensando personnel::
+
+ netdev@vger.kernel.org
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst
new file mode 100644
index 0000000000..587927d3de
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst
@@ -0,0 +1,85 @@
+.. SPDX-License-Identifier: GPL-2.0+
+.. note: can be edited and viewed with /usr/bin/formiko-vim
+
+==========================================================
+PCI vDPA driver for the AMD/Pensando(R) DSC adapter family
+==========================================================
+
+AMD/Pensando vDPA VF Device Driver
+
+Copyright(c) 2023 Advanced Micro Devices, Inc
+
+Overview
+========
+
+The ``pds_vdpa`` driver is an auxiliary bus driver that supplies
+a vDPA device for use by the virtio network stack. It is used with
+the Pensando Virtual Function devices that offer vDPA and virtio queue
+services. It depends on the ``pds_core`` driver and hardware for the PF
+and VF PCI handling as well as for device configuration services.
+
+Using the device
+================
+
+The ``pds_vdpa`` device is enabled via multiple configuration steps and
+depends on the ``pds_core`` driver to create and enable SR-IOV Virtual
+Function devices. After the VFs are enabled, we enable the vDPA service
+in the ``pds_core`` device to create the auxiliary devices used by pds_vdpa.
+
+Example steps:
+
+.. code-block:: bash
+
+ #!/bin/bash
+
+ modprobe pds_core
+ modprobe vdpa
+ modprobe pds_vdpa
+
+ PF_BDF=`ls /sys/module/pds_core/drivers/pci\:pds_core/*/sriov_numvfs | awk -F / '{print $7}'`
+
+ # Enable vDPA VF auxiliary device(s) in the PF
+ devlink dev param set pci/$PF_BDF name enable_vnet cmode runtime value true
+
+ # Create a VF for vDPA use
+ echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
+
+ # Find the vDPA services/devices available
+ PDS_VDPA_MGMT=`vdpa mgmtdev show | grep vDPA | head -1 | cut -d: -f1`
+
+ # Create a vDPA device for use in virtio network configurations
+ vdpa dev add name vdpa1 mgmtdev $PDS_VDPA_MGMT mac 00:11:22:33:44:55
+
+ # Set up an ethernet interface on the vdpa device
+ modprobe virtio_vdpa
+
+
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+ make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet driver support
+ -> Pensando devices
+ -> Pensando Ethernet PDS_VDPA Support
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+ netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+ drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst
new file mode 100644
index 0000000000..7a6bc848a2
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst
@@ -0,0 +1,79 @@
+.. SPDX-License-Identifier: GPL-2.0+
+.. note: can be edited and viewed with /usr/bin/formiko-vim
+
+==========================================================
+PCI VFIO driver for the AMD/Pensando(R) DSC adapter family
+==========================================================
+
+AMD/Pensando Linux VFIO PCI Device Driver
+Copyright(c) 2023 Advanced Micro Devices, Inc.
+
+Overview
+========
+
+The ``pds-vfio-pci`` module is a PCI driver that supports Live Migration
+capable Virtual Function (VF) devices in the DSC hardware.
+
+Using the device
+================
+
+The pds-vfio-pci device is enabled via multiple configuration steps and
+depends on the ``pds_core`` driver to create and enable SR-IOV Virtual
+Function devices.
+
+Shown below are the steps to bind the driver to a VF and also to the
+associated auxiliary device created by the ``pds_core`` driver. This
+example assumes the pds_core and pds-vfio-pci modules are already
+loaded.
+
+.. code-block:: bash
+ :name: example-setup-script
+
+ #!/bin/bash
+
+ PF_BUS="0000:60"
+ PF_BDF="0000:60:00.0"
+ VF_BDF="0000:60:00.1"
+
+ # Prevent non-vfio VF driver from probing the VF device
+ echo 0 > /sys/class/pci_bus/$PF_BUS/device/$PF_BDF/sriov_drivers_autoprobe
+
+ # Create single VF for Live Migration via pds_core
+ echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
+
+ # Allow the VF to be bound to the pds-vfio-pci driver
+ echo "pds-vfio-pci" > /sys/class/pci_bus/$PF_BUS/device/$VF_BDF/driver_override
+
+ # Bind the VF to the pds-vfio-pci driver
+ echo "$VF_BDF" > /sys/bus/pci/drivers/pds-vfio-pci/bind
+
+After performing the steps above, a file in /dev/vfio/<iommu_group>
+should have been created.
+
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+ make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> VFIO Non-Privileged userspace driver framework
+ -> VFIO support for PDS PCI devices
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+ netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+ drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst
new file mode 100644
index 0000000000..099280a261
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst
@@ -0,0 +1,556 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============================
+Marvell(Aquantia) AQtion Driver
+===============================
+
+For the aQuantia Multi-Gigabit PCI Express Family of Ethernet Adapters
+
+.. Contents
+
+ - Identifying Your Adapter
+ - Configuration
+ - Supported ethtool options
+ - Command Line Parameters
+ - Config file parameters
+ - Support
+ - License
+
+Identifying Your Adapter
+========================
+
+The driver in this release is compatible with AQC-100, AQC-107, AQC-108
+based ethernet adapters.
+
+
+SFP+ Devices (for AQC-100 based adapters)
+-----------------------------------------
+
+This release tested with passive Direct Attach Cables (DAC) and SFP+/LC
+Optical Transceiver.
+
+Configuration
+=============
+
+Viewing Link Messages
+---------------------
+ Link messages will not be displayed to the console if the distribution is
+ restricting system messages. In order to see network driver link messages on
+ your console, set dmesg to eight by entering the following::
+
+ dmesg -n 8
+
+ .. note::
+
+ This setting is not saved across reboots.
+
+Jumbo Frames
+------------
+ The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
+ enabled by changing the MTU to a value larger than the default of 1500.
+ The maximum value for the MTU is 16000. Use the `ip` command to
+ increase the MTU size. For example::
+
+ ip link set mtu 16000 dev enp1s0
+
+ethtool
+-------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The latest
+ ethtool version is required for this functionality.
+
+NAPI
+----
+ NAPI (Rx polling mode) is supported in the atlantic driver.
+
+Supported ethtool options
+=========================
+
+Viewing adapter settings
+------------------------
+
+ ::
+
+ ethtool <ethX>
+
+ Output example::
+
+ Settings for enp1s0:
+ Supported ports: [ TP ]
+ Supported link modes: 100baseT/Full
+ 1000baseT/Full
+ 10000baseT/Full
+ 2500baseT/Full
+ 5000baseT/Full
+ Supported pause frame use: Symmetric
+ Supports auto-negotiation: Yes
+ Supported FEC modes: Not reported
+ Advertised link modes: 100baseT/Full
+ 1000baseT/Full
+ 10000baseT/Full
+ 2500baseT/Full
+ 5000baseT/Full
+ Advertised pause frame use: Symmetric
+ Advertised auto-negotiation: Yes
+ Advertised FEC modes: Not reported
+ Speed: 10000Mb/s
+ Duplex: Full
+ Port: Twisted Pair
+ PHYAD: 0
+ Transceiver: internal
+ Auto-negotiation: on
+ MDI-X: Unknown
+ Supports Wake-on: g
+ Wake-on: d
+ Link detected: yes
+
+
+ .. note::
+
+ AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10.
+ But you can still use these speeds::
+
+ ethtool -s eth0 autoneg off speed 2500
+
+Viewing adapter information
+---------------------------
+
+ ::
+
+ ethtool -i <ethX>
+
+ Output example::
+
+ driver: atlantic
+ version: 5.2.0-050200rc5-generic-kern
+ firmware-version: 3.1.78
+ expansion-rom-version:
+ bus-info: 0000:01:00.0
+ supports-statistics: yes
+ supports-test: no
+ supports-eeprom-access: no
+ supports-register-dump: yes
+ supports-priv-flags: no
+
+
+Viewing Ethernet adapter statistics
+-----------------------------------
+
+ ::
+
+ ethtool -S <ethX>
+
+ Output example::
+
+ NIC statistics:
+ InPackets: 13238607
+ InUCast: 13293852
+ InMCast: 52
+ InBCast: 3
+ InErrors: 0
+ OutPackets: 23703019
+ OutUCast: 23704941
+ OutMCast: 67
+ OutBCast: 11
+ InUCastOctects: 213182760
+ OutUCastOctects: 22698443
+ InMCastOctects: 6600
+ OutMCastOctects: 8776
+ InBCastOctects: 192
+ OutBCastOctects: 704
+ InOctects: 2131839552
+ OutOctects: 226938073
+ InPacketsDma: 95532300
+ OutPacketsDma: 59503397
+ InOctetsDma: 1137102462
+ OutOctetsDma: 2394339518
+ InDroppedDma: 0
+ Queue[0] InPackets: 23567131
+ Queue[0] OutPackets: 20070028
+ Queue[0] InJumboPackets: 0
+ Queue[0] InLroPackets: 0
+ Queue[0] InErrors: 0
+ Queue[1] InPackets: 45428967
+ Queue[1] OutPackets: 11306178
+ Queue[1] InJumboPackets: 0
+ Queue[1] InLroPackets: 0
+ Queue[1] InErrors: 0
+ Queue[2] InPackets: 3187011
+ Queue[2] OutPackets: 13080381
+ Queue[2] InJumboPackets: 0
+ Queue[2] InLroPackets: 0
+ Queue[2] InErrors: 0
+ Queue[3] InPackets: 23349136
+ Queue[3] OutPackets: 15046810
+ Queue[3] InJumboPackets: 0
+ Queue[3] InLroPackets: 0
+ Queue[3] InErrors: 0
+
+Interrupt coalescing support
+----------------------------
+
+ ITR mode, TX/RX coalescing timings could be viewed with::
+
+ ethtool -c <ethX>
+
+ and changed with::
+
+ ethtool -C <ethX> tx-usecs <usecs> rx-usecs <usecs>
+
+ To disable coalescing::
+
+ ethtool -C <ethX> tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1
+
+Wake on LAN support
+-------------------
+
+ WOL support by magic packet::
+
+ ethtool -s <ethX> wol g
+
+ To disable WOL::
+
+ ethtool -s <ethX> wol d
+
+Set and check the driver message level
+--------------------------------------
+
+ Set message level
+
+ ::
+
+ ethtool -s <ethX> msglvl <level>
+
+ Level values:
+
+ ====== =============================
+ 0x0001 general driver status.
+ 0x0002 hardware probing.
+ 0x0004 link state.
+ 0x0008 periodic status check.
+ 0x0010 interface being brought down.
+ 0x0020 interface being brought up.
+ 0x0040 receive error.
+ 0x0080 transmit error.
+ 0x0200 interrupt handling.
+ 0x0400 transmit completion.
+ 0x0800 receive completion.
+ 0x1000 packet contents.
+ 0x2000 hardware status.
+ 0x4000 Wake-on-LAN status.
+ ====== =============================
+
+ By default, the level of debugging messages is set 0x0001(general driver status).
+
+ Check message level
+
+ ::
+
+ ethtool <ethX> | grep "Current message level"
+
+ If you want to disable the output of messages::
+
+ ethtool -s <ethX> msglvl 0
+
+RX flow rules (ntuple filters)
+------------------------------
+
+ There are separate rules supported, that applies in that order:
+
+ 1. 16 VLAN ID rules
+ 2. 16 L2 EtherType rules
+ 3. 8 L3/L4 5-Tuple rules
+
+
+ The driver utilizes the ethtool interface for configuring ntuple filters,
+ via ``ethtool -N <device> <filter>``.
+
+ To enable or disable the RX flow rules::
+
+ ethtool -K ethX ntuple <on|off>
+
+ When disabling ntuple filters, all the user programmed filters are
+ flushed from the driver cache and hardware. All needed filters must
+ be re-added when ntuple is re-enabled.
+
+ Because of the fixed order of the rules, the location of filters is also fixed:
+
+ - Locations 0 - 15 for VLAN ID filters
+ - Locations 16 - 31 for L2 EtherType filters
+ - Locations 32 - 39 for L3/L4 5-tuple filters (locations 32, 36 for IPv6)
+
+ The L3/L4 5-tuple (protocol, source and destination IP address, source and
+ destination TCP/UDP/SCTP port) is compared against 8 filters. For IPv4, up to
+ 8 source and destination addresses can be matched. For IPv6, up to 2 pairs of
+ addresses can be supported. Source and destination ports are only compared for
+ TCP/UDP/SCTP packets.
+
+ To add a filter that directs packet to queue 5, use
+ ``<-N|-U|--config-nfc|--config-ntuple>`` switch::
+
+ ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 <loc 32>
+
+ - action is the queue number.
+ - loc is the rule number.
+
+ For ``flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6`` you must set the loc
+ number within 32 - 39.
+ For ``flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6`` you can set 8 rules
+ for traffic IPv4 or you can set 2 rules for traffic IPv6. Loc number traffic
+ IPv6 is 32 and 36.
+ At the moment you can not use IPv4 and IPv6 filters at the same time.
+
+ Example filter for IPv6 filter traffic::
+
+ sudo ethtool -N <ethX> flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32
+ sudo ethtool -N <ethX> flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36
+
+ Example filter for IPv4 filter traffic::
+
+ sudo ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32
+ sudo ethtool -N <ethX> flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33
+ sudo ethtool -N <ethX> flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34
+
+ If you set action -1, then all traffic corresponding to the filter will be discarded.
+
+ The maximum value action is 31.
+
+
+ The VLAN filter (VLAN id) is compared against 16 filters.
+ VLAN id must be accompanied by mask 0xF000. That is to distinguish VLAN filter
+ from L2 Ethertype filter with UserPriority since both User Priority and VLAN ID
+ are passed in the same 'vlan' parameter.
+
+ To add a filter that directs packets from VLAN 2001 to queue 5::
+
+ ethtool -N <ethX> flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0
+
+
+ L2 EtherType filters allows filter packet by EtherType field or both EtherType
+ and User Priority (PCP) field of 802.1Q.
+ UserPriority (vlan) parameter must be accompanied by mask 0x1FFF. That is to
+ distinguish VLAN filter from L2 Ethertype filter with UserPriority since both
+ User Priority and VLAN ID are passed in the same 'vlan' parameter.
+
+ To add a filter that directs IP4 packess of priority 3 to queue 3::
+
+ ethtool -N <ethX> flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16
+
+ To see the list of filters currently present::
+
+ ethtool <-u|-n|--show-nfc|--show-ntuple> <ethX>
+
+ Rules may be deleted from the table itself. This is done using::
+
+ sudo ethtool <-N|-U|--config-nfc|--config-ntuple> <ethX> delete <loc>
+
+ - loc is the rule number to be deleted.
+
+ Rx filters is an interface to load the filter table that funnels all flow
+ into queue 0 unless an alternative queue is specified using "action". In that
+ case, any flow that matches the filter criteria will be directed to the
+ appropriate queue. RX filters is supported on all kernels 2.6.30 and later.
+
+RSS for UDP
+-----------
+
+ Currently, NIC does not support RSS for fragmented IP packets, which leads to
+ incorrect working of RSS for fragmented UDP traffic. To disable RSS for UDP the
+ RX Flow L3/L4 rule may be used.
+
+ Example::
+
+ ethtool -N eth0 flow-type udp4 action 0 loc 32
+
+UDP GSO hardware offload
+------------------------
+
+ UDP GSO allows to boost UDP tx rates by offloading UDP headers allocation
+ into hardware. A special userspace socket option is required for this,
+ could be validated with /kernel/tools/testing/selftests/net/::
+
+ udpgso_bench_tx -u -4 -D 10.0.1.1 -s 6300 -S 100
+
+ Will cause sending out of 100 byte sized UDP packets formed from single
+ 6300 bytes user buffer.
+
+ UDP GSO is configured by::
+
+ ethtool -K eth0 tx-udp-segmentation on
+
+Private flags (testing)
+-----------------------
+
+ Atlantic driver supports private flags for hardware custom features::
+
+ $ ethtool --show-priv-flags ethX
+
+ Private flags for ethX:
+ DMASystemLoopback : off
+ PKTSystemLoopback : off
+ DMANetworkLoopback : off
+ PHYInternalLoopback: off
+ PHYExternalLoopback: off
+
+ Example::
+
+ $ ethtool --set-priv-flags ethX DMASystemLoopback on
+
+ DMASystemLoopback: DMA Host loopback.
+ PKTSystemLoopback: Packet buffer host loopback.
+ DMANetworkLoopback: Network side loopback on DMA block.
+ PHYInternalLoopback: Internal loopback on Phy.
+ PHYExternalLoopback: External loopback on Phy (with loopback ethernet cable).
+
+
+Command Line Parameters
+=======================
+The following command line parameters are available on atlantic driver:
+
+aq_itr -Interrupt throttling mode
+---------------------------------
+Accepted values: 0, 1, 0xFFFF
+
+Default value: 0xFFFF
+
+====== ==============================================================
+0 Disable interrupt throttling.
+1 Enable interrupt throttling and use specified tx and rx rates.
+0xFFFF Auto throttling mode. Driver will choose the best RX and TX
+ interrupt throttling settings based on link speed.
+====== ==============================================================
+
+aq_itr_tx - TX interrupt throttle rate
+--------------------------------------
+
+Accepted values: 0 - 0x1FF
+
+Default value: 0
+
+TX side throttling in microseconds. Adapter will setup maximum interrupt delay
+to this value. Minimum interrupt delay will be a half of this value
+
+aq_itr_rx - RX interrupt throttle rate
+--------------------------------------
+
+Accepted values: 0 - 0x1FF
+
+Default value: 0
+
+RX side throttling in microseconds. Adapter will setup maximum interrupt delay
+to this value. Minimum interrupt delay will be a half of this value
+
+.. note::
+
+ ITR settings could be changed in runtime by ethtool -c means (see below)
+
+Config file parameters
+======================
+
+For some fine tuning and performance optimizations,
+some parameters can be changed in the {source_dir}/aq_cfg.h file.
+
+AQ_CFG_RX_PAGEORDER
+-------------------
+
+Default value: 0
+
+RX page order override. That's a power of 2 number of RX pages allocated for
+each descriptor. Received descriptor size is still limited by
+AQ_CFG_RX_FRAME_MAX.
+
+Increasing pageorder makes page reuse better (actual on iommu enabled systems).
+
+AQ_CFG_RX_REFILL_THRES
+----------------------
+
+Default value: 32
+
+RX refill threshold. RX path will not refill freed descriptors until the
+specified number of free descriptors is observed. Larger values may help
+better page reuse but may lead to packet drops as well.
+
+AQ_CFG_VECS_DEF
+---------------
+
+Number of queues
+
+Valid Range: 0 - 8 (up to AQ_CFG_VECS_MAX)
+
+Default value: 8
+
+Notice this value will be capped by the number of cores available on the system.
+
+AQ_CFG_IS_RSS_DEF
+-----------------
+
+Enable/disable Receive Side Scaling
+
+This feature allows the adapter to distribute receive processing
+across multiple CPU-cores and to prevent from overloading a single CPU core.
+
+Valid values
+
+== ========
+0 disabled
+1 enabled
+== ========
+
+Default value: 1
+
+AQ_CFG_NUM_RSS_QUEUES_DEF
+-------------------------
+
+Number of queues for Receive Side Scaling
+
+Valid Range: 0 - 8 (up to AQ_CFG_VECS_DEF)
+
+Default value: AQ_CFG_VECS_DEF
+
+AQ_CFG_IS_LRO_DEF
+-----------------
+
+Enable/disable Large Receive Offload
+
+This offload enables the adapter to coalesce multiple TCP segments and indicate
+them as a single coalesced unit to the OS networking subsystem.
+
+The system consumes less energy but it also introduces more latency in packets
+processing.
+
+Valid values
+
+== ========
+0 disabled
+1 enabled
+== ========
+
+Default value: 1
+
+AQ_CFG_TX_CLEAN_BUDGET
+----------------------
+
+Maximum descriptors to cleanup on TX at once.
+
+Default value: 256
+
+After the aq_cfg.h file changed the driver must be rebuilt to take effect.
+
+Support
+=======
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to aqn_support@marvell.com
+
+License
+=======
+
+aQuantia Corporation Network Driver
+
+Copyright |copy| 2014 - 2019 aQuantia Corporation.
+
+This program is free software; you can redistribute it and/or modify it
+under the terms and conditions of the GNU General Public License,
+version 2, as published by the Free Software Foundation.
diff --git a/Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst b/Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst
new file mode 100644
index 0000000000..435dce5fa2
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst
@@ -0,0 +1,393 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=============================================
+Chelsio N210 10Gb Ethernet Network Controller
+=============================================
+
+Driver Release Notes for Linux
+
+Version 2.1.1
+
+June 20, 2005
+
+.. Contents
+
+ INTRODUCTION
+ FEATURES
+ PERFORMANCE
+ DRIVER MESSAGES
+ KNOWN ISSUES
+ SUPPORT
+
+
+Introduction
+============
+
+ This document describes the Linux driver for Chelsio 10Gb Ethernet Network
+ Controller. This driver supports the Chelsio N210 NIC and is backward
+ compatible with the Chelsio N110 model 10Gb NICs.
+
+
+Features
+========
+
+Adaptive Interrupts (adaptive-rx)
+---------------------------------
+
+ This feature provides an adaptive algorithm that adjusts the interrupt
+ coalescing parameters, allowing the driver to dynamically adapt the latency
+ settings to achieve the highest performance during various types of network
+ load.
+
+ The interface used to control this feature is ethtool. Please see the
+ ethtool manpage for additional usage information.
+
+ By default, adaptive-rx is disabled.
+ To enable adaptive-rx::
+
+ ethtool -C <interface> adaptive-rx on
+
+ To disable adaptive-rx, use ethtool::
+
+ ethtool -C <interface> adaptive-rx off
+
+ After disabling adaptive-rx, the timer latency value will be set to 50us.
+ You may set the timer latency after disabling adaptive-rx::
+
+ ethtool -C <interface> rx-usecs <microseconds>
+
+ An example to set the timer latency value to 100us on eth0::
+
+ ethtool -C eth0 rx-usecs 100
+
+ You may also provide a timer latency value while disabling adaptive-rx::
+
+ ethtool -C <interface> adaptive-rx off rx-usecs <microseconds>
+
+ If adaptive-rx is disabled and a timer latency value is specified, the timer
+ will be set to the specified value until changed by the user or until
+ adaptive-rx is enabled.
+
+ To view the status of the adaptive-rx and timer latency values::
+
+ ethtool -c <interface>
+
+
+TCP Segmentation Offloading (TSO) Support
+-----------------------------------------
+
+ This feature, also known as "large send", enables a system's protocol stack
+ to offload portions of outbound TCP processing to a network interface card
+ thereby reducing system CPU utilization and enhancing performance.
+
+ The interface used to control this feature is ethtool version 1.8 or higher.
+ Please see the ethtool manpage for additional usage information.
+
+ By default, TSO is enabled.
+ To disable TSO::
+
+ ethtool -K <interface> tso off
+
+ To enable TSO::
+
+ ethtool -K <interface> tso on
+
+ To view the status of TSO::
+
+ ethtool -k <interface>
+
+
+Performance
+===========
+
+ The following information is provided as an example of how to change system
+ parameters for "performance tuning" an what value to use. You may or may not
+ want to change these system parameters, depending on your server/workstation
+ application. Doing so is not warranted in any way by Chelsio Communications,
+ and is done at "YOUR OWN RISK". Chelsio will not be held responsible for loss
+ of data or damage to equipment.
+
+ Your distribution may have a different way of doing things, or you may prefer
+ a different method. These commands are shown only to provide an example of
+ what to do and are by no means definitive.
+
+ Making any of the following system changes will only last until you reboot
+ your system. You may want to write a script that runs at boot-up which
+ includes the optimal settings for your system.
+
+ Setting PCI Latency Timer::
+
+ setpci -d 1425::
+
+* 0x0c.l=0x0000F800
+
+ Disabling TCP timestamp::
+
+ sysctl -w net.ipv4.tcp_timestamps=0
+
+ Disabling SACK::
+
+ sysctl -w net.ipv4.tcp_sack=0
+
+ Setting large number of incoming connection requests::
+
+ sysctl -w net.ipv4.tcp_max_syn_backlog=3000
+
+ Setting maximum receive socket buffer size::
+
+ sysctl -w net.core.rmem_max=1024000
+
+ Setting maximum send socket buffer size::
+
+ sysctl -w net.core.wmem_max=1024000
+
+ Set smp_affinity (on a multiprocessor system) to a single CPU::
+
+ echo 1 > /proc/irq/<interrupt_number>/smp_affinity
+
+ Setting default receive socket buffer size::
+
+ sysctl -w net.core.rmem_default=524287
+
+ Setting default send socket buffer size::
+
+ sysctl -w net.core.wmem_default=524287
+
+ Setting maximum option memory buffers::
+
+ sysctl -w net.core.optmem_max=524287
+
+ Setting maximum backlog (# of unprocessed packets before kernel drops)::
+
+ sysctl -w net.core.netdev_max_backlog=300000
+
+ Setting TCP read buffers (min/default/max)::
+
+ sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
+
+ Setting TCP write buffers (min/pressure/max)::
+
+ sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
+
+ Setting TCP buffer space (min/pressure/max)::
+
+ sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000"
+
+ TCP window size for single connections:
+
+ The receive buffer (RX_WINDOW) size must be at least as large as the
+ Bandwidth-Delay Product of the communication link between the sender and
+ receiver. Due to the variations of RTT, you may want to increase the buffer
+ size up to 2 times the Bandwidth-Delay Product. Reference page 289 of
+ "TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens.
+
+ At 10Gb speeds, use the following formula::
+
+ RX_WINDOW >= 1.25MBytes * RTT(in milliseconds)
+ Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000
+
+ RX_WINDOW sizes of 256KB - 512KB should be sufficient.
+
+ Setting the min, max, and default receive buffer (RX_WINDOW) size::
+
+ sysctl -w net.ipv4.tcp_rmem="<min> <default> <max>"
+
+ TCP window size for multiple connections:
+ The receive buffer (RX_WINDOW) size may be calculated the same as single
+ connections, but should be divided by the number of connections. The
+ smaller window prevents congestion and facilitates better pacing,
+ especially if/when MAC level flow control does not work well or when it is
+ not supported on the machine. Experimentation may be necessary to attain
+ the correct value. This method is provided as a starting point for the
+ correct receive buffer size.
+
+ Setting the min, max, and default receive buffer (RX_WINDOW) size is
+ performed in the same manner as single connection.
+
+
+Driver Messages
+===============
+
+ The following messages are the most common messages logged by syslog. These
+ may be found in /var/log/messages.
+
+ Driver up::
+
+ Chelsio Network Driver - version 2.1.1
+
+ NIC detected::
+
+ eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit
+
+ Link up::
+
+ eth#: link is up at 10 Gbps, full duplex
+
+ Link down::
+
+ eth#: link is down
+
+
+Known Issues
+============
+
+ These issues have been identified during testing. The following information
+ is provided as a workaround to the problem. In some cases, this problem is
+ inherent to Linux or to a particular Linux Distribution and/or hardware
+ platform.
+
+ 1. Large number of TCP retransmits on a multiprocessor (SMP) system.
+
+ On a system with multiple CPUs, the interrupt (IRQ) for the network
+ controller may be bound to more than one CPU. This will cause TCP
+ retransmits if the packet data were to be split across different CPUs
+ and re-assembled in a different order than expected.
+
+ To eliminate the TCP retransmits, set smp_affinity on the particular
+ interrupt to a single CPU. You can locate the interrupt (IRQ) used on
+ the N110/N210 by using ifconfig::
+
+ ifconfig <dev_name> | grep Interrupt
+
+ Set the smp_affinity to a single CPU::
+
+ echo 1 > /proc/irq/<interrupt_number>/smp_affinity
+
+ It is highly suggested that you do not run the irqbalance daemon on your
+ system, as this will change any smp_affinity setting you have applied.
+ The irqbalance daemon runs on a 10 second interval and binds interrupts
+ to the least loaded CPU determined by the daemon. To disable this daemon::
+
+ chkconfig --level 2345 irqbalance off
+
+ By default, some Linux distributions enable the kernel feature,
+ irqbalance, which performs the same function as the daemon. To disable
+ this feature, add the following line to your bootloader::
+
+ noirqbalance
+
+ Example using the Grub bootloader::
+
+ title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp)
+ root (hd0,0)
+ kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance
+ initrd /initrd-2.4.21-27.ELsmp.img
+
+ 2. After running insmod, the driver is loaded and the incorrect network
+ interface is brought up without running ifup.
+
+ When using 2.4.x kernels, including RHEL kernels, the Linux kernel
+ invokes a script named "hotplug". This script is primarily used to
+ automatically bring up USB devices when they are plugged in, however,
+ the script also attempts to automatically bring up a network interface
+ after loading the kernel module. The hotplug script does this by scanning
+ the ifcfg-eth# config files in /etc/sysconfig/network-scripts, looking
+ for HWADDR=<mac_address>.
+
+ If the hotplug script does not find the HWADDRR within any of the
+ ifcfg-eth# files, it will bring up the device with the next available
+ interface name. If this interface is already configured for a different
+ network card, your new interface will have incorrect IP address and
+ network settings.
+
+ To solve this issue, you can add the HWADDR=<mac_address> key to the
+ interface config file of your network controller.
+
+ To disable this "hotplug" feature, you may add the driver (module name)
+ to the "blacklist" file located in /etc/hotplug. It has been noted that
+ this does not work for network devices because the net.agent script
+ does not use the blacklist file. Simply remove, or rename, the net.agent
+ script located in /etc/hotplug to disable this feature.
+
+ 3. Transport Protocol (TP) hangs when running heavy multi-connection traffic
+ on an AMD Opteron system with HyperTransport PCI-X Tunnel chipset.
+
+ If your AMD Opteron system uses the AMD-8131 HyperTransport PCI-X Tunnel
+ chipset, you may experience the "133-Mhz Mode Split Completion Data
+ Corruption" bug identified by AMD while using a 133Mhz PCI-X card on the
+ bus PCI-X bus.
+
+ AMD states, "Under highly specific conditions, the AMD-8131 PCI-X Tunnel
+ can provide stale data via split completion cycles to a PCI-X card that
+ is operating at 133 Mhz", causing data corruption.
+
+ AMD's provides three workarounds for this problem, however, Chelsio
+ recommends the first option for best performance with this bug:
+
+ For 133Mhz secondary bus operation, limit the transaction length and
+ the number of outstanding transactions, via BIOS configuration
+ programming of the PCI-X card, to the following:
+
+ Data Length (bytes): 1k
+
+ Total allowed outstanding transactions: 2
+
+ Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004,
+ section 56, "133-MHz Mode Split Completion Data Corruption" for more
+ details with this bug and workarounds suggested by AMD.
+
+ It may be possible to work outside AMD's recommended PCI-X settings, try
+ increasing the Data Length to 2k bytes for increased performance. If you
+ have issues with these settings, please revert to the "safe" settings
+ and duplicate the problem before submitting a bug or asking for support.
+
+ .. note::
+
+ The default setting on most systems is 8 outstanding transactions
+ and 2k bytes data length.
+
+ 4. On multiprocessor systems, it has been noted that an application which
+ is handling 10Gb networking can switch between CPUs causing degraded
+ and/or unstable performance.
+
+ If running on an SMP system and taking performance measurements, it
+ is suggested you either run the latest netperf-2.4.0+ or use a binding
+ tool such as Tim Hockin's procstate utilities (runon)
+ <http://www.hockin.org/~thockin/procstate/>.
+
+ Binding netserver and netperf (or other applications) to particular
+ CPUs will have a significant difference in performance measurements.
+ You may need to experiment which CPU to bind the application to in
+ order to achieve the best performance for your system.
+
+ If you are developing an application designed for 10Gb networking,
+ please keep in mind you may want to look at kernel functions
+ sched_setaffinity & sched_getaffinity to bind your application.
+
+ If you are just running user-space applications such as ftp, telnet,
+ etc., you may want to try the runon tool provided by Tim Hockin's
+ procstate utility. You could also try binding the interface to a
+ particular CPU: runon 0 ifup eth0
+
+
+Support
+=======
+
+ If you have problems with the software or hardware, please contact our
+ customer support team via email at support@chelsio.com or check our website
+ at http://www.chelsio.com
+
+-------------------------------------------------------------------------------
+
+::
+
+ Chelsio Communications
+ 370 San Aleso Ave.
+ Suite 100
+ Sunnyvale, CA 94085
+ http://www.chelsio.com
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License, version 2, as
+published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License along
+with this program; if not, write to the Free Software Foundation, Inc.,
+59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+
+THIS SOFTWARE IS PROVIDED ``AS IS`` AND WITHOUT ANY EXPRESS OR IMPLIED
+WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
+
+Copyright |copy| 2003-2005 Chelsio Communications. All rights reserved.
diff --git a/Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst b/Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst
new file mode 100644
index 0000000000..e5c283940a
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst
@@ -0,0 +1,647 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================
+Cirrus Logic LAN CS8900/CS8920 Ethernet Adapters
+================================================
+
+.. note::
+
+ This document was contributed by Cirrus Logic for kernel 2.2.5. This version
+ has been updated for 2.3.48 by Andrew Morton.
+
+ Still, this is too outdated! A major cleanup is needed here.
+
+Cirrus make a copy of this driver available at their website, as
+described below. In general, you should use the driver version which
+comes with your Linux distribution.
+
+
+Linux Network Interface Driver ver. 2.00 <kernel 2.3.48>
+
+
+.. TABLE OF CONTENTS
+
+ 1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
+ 1.1 Product Overview
+ 1.2 Driver Description
+ 1.2.1 Driver Name
+ 1.2.2 File in the Driver Package
+ 1.3 System Requirements
+ 1.4 Licensing Information
+
+ 2.0 ADAPTER INSTALLATION and CONFIGURATION
+ 2.1 CS8900-based Adapter Configuration
+ 2.2 CS8920-based Adapter Configuration
+
+ 3.0 LOADING THE DRIVER AS A MODULE
+
+ 4.0 COMPILING THE DRIVER
+ 4.1 Compiling the Driver as a Loadable Module
+ 4.2 Compiling the driver to support memory mode
+ 4.3 Compiling the driver to support Rx DMA
+
+ 5.0 TESTING AND TROUBLESHOOTING
+ 5.1 Known Defects and Limitations
+ 5.2 Testing the Adapter
+ 5.2.1 Diagnostic Self-Test
+ 5.2.2 Diagnostic Network Test
+ 5.3 Using the Adapter's LEDs
+ 5.4 Resolving I/O Conflicts
+
+ 6.0 TECHNICAL SUPPORT
+ 6.1 Contacting Cirrus Logic's Technical Support
+ 6.2 Information Required Before Contacting Technical Support
+ 6.3 Obtaining the Latest Driver Version
+ 6.4 Current maintainer
+ 6.5 Kernel boot parameters
+
+
+1. Cirrus Logic LAN CS8900/CS8920 Ethernet Adapters
+===================================================
+
+
+1.1. Product Overview
+=====================
+
+The CS8900-based ISA Ethernet Adapters from Cirrus Logic follow
+IEEE 802.3 standards and support half or full-duplex operation in ISA bus
+computers on 10 Mbps Ethernet networks. The adapters are designed for operation
+in 16-bit ISA or EISA bus expansion slots and are available in
+10BaseT-only or 3-media configurations (10BaseT, 10Base2, and AUI for 10Base-5
+or fiber networks).
+
+CS8920-based adapters are similar to the CS8900-based adapter with additional
+features for Plug and Play (PnP) support and Wakeup Frame recognition. As
+such, the configuration procedures differ somewhat between the two types of
+adapters. Refer to the "Adapter Configuration" section for details on
+configuring both types of adapters.
+
+
+1.2. Driver Description
+=======================
+
+The CS8900/CS8920 Ethernet Adapter driver for Linux supports the Linux
+v2.3.48 or greater kernel. It can be compiled directly into the kernel
+or loaded at run-time as a device driver module.
+
+1.2.1 Driver Name: cs89x0
+
+1.2.2 Files in the Driver Archive:
+
+The files in the driver at Cirrus' website include:
+
+ =================== ====================================================
+ readme.txt this file
+ build batch file to compile cs89x0.c.
+ cs89x0.c driver C code
+ cs89x0.h driver header file
+ cs89x0.o pre-compiled module (for v2.2.5 kernel)
+ config/Config.in sample file to include cs89x0 driver in the kernel.
+ config/Makefile sample file to include cs89x0 driver in the kernel.
+ config/Space.c sample file to include cs89x0 driver in the kernel.
+ =================== ====================================================
+
+
+
+1.3. System Requirements
+------------------------
+
+The following hardware is required:
+
+ * Cirrus Logic LAN (CS8900/20-based) Ethernet ISA Adapter
+
+ * IBM or IBM-compatible PC with:
+ * An 80386 or higher processor
+ * 16 bytes of contiguous IO space available between 210h - 370h
+ * One available IRQ (5,10,11,or 12 for the CS8900, 3-7,9-15 for CS8920).
+
+ * Appropriate cable (and connector for AUI, 10BASE-2) for your network
+ topology.
+
+The following software is required:
+
+* LINUX kernel version 2.3.48 or higher
+
+ * CS8900/20 Setup Utility (DOS-based)
+
+ * LINUX kernel sources for your kernel (if compiling into kernel)
+
+ * GNU Toolkit (gcc and make) v2.6 or above (if compiling into kernel
+ or a module)
+
+
+
+1.4. Licensing Information
+--------------------------
+
+This program is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free Software
+Foundation, version 1.
+
+This program is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+more details.
+
+For a full copy of the GNU General Public License, write to the Free Software
+Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+
+
+2. Adapter Installation and Configuration
+=========================================
+
+Both the CS8900 and CS8920-based adapters can be configured using parameters
+stored in an on-board EEPROM. You must use the DOS-based CS8900/20 Setup
+Utility if you want to change the adapter's configuration in EEPROM.
+
+When loading the driver as a module, you can specify many of the adapter's
+configuration parameters on the command-line to override the EEPROM's settings
+or for interface configuration when an EEPROM is not used. (CS8920-based
+adapters must use an EEPROM.) See Section 3.0 LOADING THE DRIVER AS A MODULE.
+
+Since the CS8900/20 Setup Utility is a DOS-based application, you must install
+and configure the adapter in a DOS-based system using the CS8900/20 Setup
+Utility before installation in the target LINUX system. (Not required if
+installing a CS8900-based adapter and the default configuration is acceptable.)
+
+
+2.1. CS8900-based Adapter Configuration
+---------------------------------------
+
+CS8900-based adapters shipped from Cirrus Logic have been configured
+with the following "default" settings::
+
+ Operation Mode: Memory Mode
+ IRQ: 10
+ Base I/O Address: 300
+ Memory Base Address: D0000
+ Optimization: DOS Client
+ Transmission Mode: Half-duplex
+ BootProm: None
+ Media Type: Autodetect (3-media cards) or
+ 10BASE-T (10BASE-T only adapter)
+
+You should only change the default configuration settings if conflicts with
+another adapter exists. To change the adapter's configuration, run the
+CS8900/20 Setup Utility.
+
+
+2.2. CS8920-based Adapter Configuration
+---------------------------------------
+
+CS8920-based adapters are shipped from Cirrus Logic configured as Plug
+and Play (PnP) enabled. However, since the cs89x0 driver does NOT
+support PnP, you must install the CS8920 adapter in a DOS-based PC and
+run the CS8900/20 Setup Utility to disable PnP and configure the
+adapter before installation in the target Linux system. Failure to do
+this will leave the adapter inactive and the driver will be unable to
+communicate with the adapter.
+
+::
+
+ ****************************************************************
+ * CS8920-BASED ADAPTERS: *
+ * *
+ * CS8920-BASED ADAPTERS ARE PLUG and PLAY ENABLED BY DEFAULT. *
+ * THE CS89X0 DRIVER DOES NOT SUPPORT PnP. THEREFORE, YOU MUST *
+ * RUN THE CS8900/20 SETUP UTILITY TO DISABLE PnP SUPPORT AND *
+ * TO ACTIVATE THE ADAPTER. *
+ ****************************************************************
+
+
+
+
+3. Loading the Driver as a Module
+=================================
+
+If the driver is compiled as a loadable module, you can load the driver module
+with the 'modprobe' command. Many of the adapter's configuration parameters can
+be specified as command-line arguments to the load command. This facility
+provides a means to override the EEPROM's settings or for interface
+configuration when an EEPROM is not used.
+
+Example::
+
+ insmod cs89x0.o io=0x200 irq=0xA media=aui
+
+This example loads the module and configures the adapter to use an IO port base
+address of 200h, interrupt 10, and use the AUI media connection. The following
+configuration options are available on the command line::
+
+ io=### - specify IO address (200h-360h)
+ irq=## - specify interrupt level
+ use_dma=1 - Enable DMA
+ dma=# - specify dma channel (Driver is compiled to support
+ Rx DMA only)
+ dmasize=# (16 or 64) - DMA size 16K or 64K. Default value is set to 16.
+ media=rj45 - specify media type
+ or media=bnc
+ or media=aui
+ or media=auto
+ duplex=full - specify forced half/full/autonegotiate duplex
+ or duplex=half
+ or duplex=auto
+ debug=# - debug level (only available if the driver was compiled
+ for debugging)
+
+**Notes:**
+
+a) If an EEPROM is present, any specified command-line parameter
+ will override the corresponding configuration value stored in
+ EEPROM.
+
+b) The "io" parameter must be specified on the command-line.
+
+c) The driver's hardware probe routine is designed to avoid
+ writing to I/O space until it knows that there is a cs89x0
+ card at the written addresses. This could cause problems
+ with device probing. To avoid this behaviour, add one
+ to the ``io=`` module parameter. This doesn't actually change
+ the I/O address, but it is a flag to tell the driver
+ to partially initialise the hardware before trying to
+ identify the card. This could be dangerous if you are
+ not sure that there is a cs89x0 card at the provided address.
+
+ For example, to scan for an adapter located at IO base 0x300,
+ specify an IO address of 0x301.
+
+d) The "duplex=auto" parameter is only supported for the CS8920.
+
+e) The minimum command-line configuration required if an EEPROM is
+ not present is:
+
+ io
+ irq
+ media type (no autodetect)
+
+f) The following additional parameters are CS89XX defaults (values
+ used with no EEPROM or command-line argument).
+
+ * DMA Burst = enabled
+ * IOCHRDY Enabled = enabled
+ * UseSA = enabled
+ * CS8900 defaults to half-duplex if not specified on command-line
+ * CS8920 defaults to autoneg if not specified on command-line
+ * Use reset defaults for other config parameters
+ * dma_mode = 0
+
+g) You can use ifconfig to set the adapter's Ethernet address.
+
+h) Many Linux distributions use the 'modprobe' command to load
+ modules. This program uses the '/etc/conf.modules' file to
+ determine configuration information which is passed to a driver
+ module when it is loaded. All the configuration options which are
+ described above may be placed within /etc/conf.modules.
+
+ For example::
+
+ > cat /etc/conf.modules
+ ...
+ alias eth0 cs89x0
+ options cs89x0 io=0x0200 dma=5 use_dma=1
+ ...
+
+ In this example we are telling the module system that the
+ ethernet driver for this machine should use the cs89x0 driver. We
+ are asking 'modprobe' to pass the 'io', 'dma' and 'use_dma'
+ arguments to the driver when it is loaded.
+
+i) Cirrus recommend that the cs89x0 use the ISA DMA channels 5, 6 or
+ 7. You will probably find that other DMA channels will not work.
+
+j) The cs89x0 supports DMA for receiving only. DMA mode is
+ significantly more efficient. Flooding a 400 MHz Celeron machine
+ with large ping packets consumes 82% of its CPU capacity in non-DMA
+ mode. With DMA this is reduced to 45%.
+
+k) If your Linux kernel was compiled with inbuilt plug-and-play
+ support you will be able to find information about the cs89x0 card
+ with the command::
+
+ cat /proc/isapnp
+
+l) If during DMA operation you find erratic behavior or network data
+ corruption you should use your PC's BIOS to slow the EISA bus clock.
+
+m) If the cs89x0 driver is compiled directly into the kernel
+ (non-modular) then its I/O address is automatically determined by
+ ISA bus probing. The IRQ number, media options, etc are determined
+ from the card's EEPROM.
+
+n) If the cs89x0 driver is compiled directly into the kernel, DMA
+ mode may be selected by providing the kernel with a boot option
+ 'cs89x0_dma=N' where 'N' is the desired DMA channel number (5, 6 or 7).
+
+ Kernel boot options may be provided on the LILO command line::
+
+ LILO boot: linux cs89x0_dma=5
+
+ or they may be placed in /etc/lilo.conf::
+
+ image=/boot/bzImage-2.3.48
+ append="cs89x0_dma=5"
+ label=linux
+ root=/dev/hda5
+ read-only
+
+ The DMA Rx buffer size is hardwired to 16 kbytes in this mode.
+ (64k mode is not available).
+
+
+4. Compiling the Driver
+=======================
+
+The cs89x0 driver can be compiled directly into the kernel or compiled into
+a loadable device driver module.
+
+Just use the standard way to configure the driver and compile the Kernel.
+
+
+4.1. Compiling the Driver to Support Rx DMA
+-------------------------------------------
+
+The compile-time optionality for DMA was removed in the 2.3 kernel
+series. DMA support is now unconditionally part of the driver. It is
+enabled by the 'use_dma=1' module option.
+
+
+5. Testing and Troubleshooting
+==============================
+
+5.1. Known Defects and Limitations
+----------------------------------
+
+Refer to the RELEASE.TXT file distributed as part of this archive for a list of
+known defects, driver limitations, and work arounds.
+
+
+5.2. Testing the Adapter
+------------------------
+
+Once the adapter has been installed and configured, the diagnostic option of
+the CS8900/20 Setup Utility can be used to test the functionality of the
+adapter and its network connection. Use the diagnostics 'Self Test' option to
+test the functionality of the adapter with the hardware configuration you have
+assigned. You can use the diagnostics 'Network Test' to test the ability of the
+adapter to communicate across the Ethernet with another PC equipped with a
+CS8900/20-based adapter card (it must also be running the CS8900/20 Setup
+Utility).
+
+.. note::
+
+ The Setup Utility's diagnostics are designed to run in a
+ DOS-only operating system environment. DO NOT run the diagnostics
+ from a DOS or command prompt session under Windows 95, Windows NT,
+ OS/2, or other operating system.
+
+To run the diagnostics tests on the CS8900/20 adapter:
+
+ 1. Boot DOS on the PC and start the CS8900/20 Setup Utility.
+
+ 2. The adapter's current configuration is displayed. Hit the ENTER key to
+ get to the main menu.
+
+ 4. Select 'Diagnostics' (ALT-G) from the main menu.
+ * Select 'Self-Test' to test the adapter's basic functionality.
+ * Select 'Network Test' to test the network connection and cabling.
+
+
+5.2.1. Diagnostic Self-test
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The diagnostic self-test checks the adapter's basic functionality as well as
+its ability to communicate across the ISA bus based on the system resources
+assigned during hardware configuration. The following tests are performed:
+
+ * IO Register Read/Write Test
+
+ The IO Register Read/Write test insures that the CS8900/20 can be
+ accessed in IO mode, and that the IO base address is correct.
+
+ * Shared Memory Test
+
+ The Shared Memory test insures the CS8900/20 can be accessed in memory
+ mode and that the range of memory addresses assigned does not conflict
+ with other devices in the system.
+
+ * Interrupt Test
+
+ The Interrupt test insures there are no conflicts with the assigned IRQ
+ signal.
+
+ * EEPROM Test
+
+ The EEPROM test insures the EEPROM can be read.
+
+ * Chip RAM Test
+
+ The Chip RAM test insures the 4K of memory internal to the CS8900/20 is
+ working properly.
+
+ * Internal Loop-back Test
+
+ The Internal Loop Back test insures the adapter's transmitter and
+ receiver are operating properly. If this test fails, make sure the
+ adapter's cable is connected to the network (check for LED activity for
+ example).
+
+ * Boot PROM Test
+
+ The Boot PROM test insures the Boot PROM is present, and can be read.
+ Failure indicates the Boot PROM was not successfully read due to a
+ hardware problem or due to a conflicts on the Boot PROM address
+ assignment. (Test only applies if the adapter is configured to use the
+ Boot PROM option.)
+
+Failure of a test item indicates a possible system resource conflict with
+another device on the ISA bus. In this case, you should use the Manual Setup
+option to reconfigure the adapter by selecting a different value for the system
+resource that failed.
+
+
+5.2.2. Diagnostic Network Test
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The Diagnostic Network Test verifies a working network connection by
+transferring data between two CS8900/20 adapters installed in different PCs
+on the same network. (Note: the diagnostic network test should not be run
+between two nodes across a router.)
+
+This test requires that each of the two PCs have a CS8900/20-based adapter
+installed and have the CS8900/20 Setup Utility running. The first PC is
+configured as a Responder and the other PC is configured as an Initiator.
+Once the Initiator is started, it sends data frames to the Responder which
+returns the frames to the Initiator.
+
+The total number of frames received and transmitted are displayed on the
+Initiator's display, along with a count of the number of frames received and
+transmitted OK or in error. The test can be terminated anytime by the user at
+either PC.
+
+To setup the Diagnostic Network Test:
+
+ 1. Select a PC with a CS8900/20-based adapter and a known working network
+ connection to act as the Responder. Run the CS8900/20 Setup Utility
+ and select 'Diagnostics -> Network Test -> Responder' from the main
+ menu. Hit ENTER to start the Responder.
+
+ 2. Return to the PC with the CS8900/20-based adapter you want to test and
+ start the CS8900/20 Setup Utility.
+
+ 3. From the main menu, Select 'Diagnostic -> Network Test -> Initiator'.
+ Hit ENTER to start the test.
+
+You may stop the test on the Initiator at any time while allowing the Responder
+to continue running. In this manner, you can move to additional PCs and test
+them by starting the Initiator on another PC without having to stop/start the
+Responder.
+
+
+
+5.3. Using the Adapter's LEDs
+-----------------------------
+
+The 2 and 3-media adapters have two LEDs visible on the back end of the board
+located near the 10Base-T connector.
+
+Link Integrity LED: A "steady" ON of the green LED indicates a valid 10Base-T
+connection. (Only applies to 10Base-T. The green LED has no significance for
+a 10Base-2 or AUI connection.)
+
+TX/RX LED: The yellow LED lights briefly each time the adapter transmits or
+receives data. (The yellow LED will appear to "flicker" on a typical network.)
+
+
+5.4. Resolving I/O Conflicts
+----------------------------
+
+An IO conflict occurs when two or more adapter use the same ISA resource (IO
+address, memory address or IRQ). You can usually detect an IO conflict in one
+of four ways after installing and or configuring the CS8900/20-based adapter:
+
+ 1. The system does not boot properly (or at all).
+
+ 2. The driver cannot communicate with the adapter, reporting an "Adapter
+ not found" error message.
+
+ 3. You cannot connect to the network or the driver will not load.
+
+ 4. If you have configured the adapter to run in memory mode but the driver
+ reports it is using IO mode when loading, this is an indication of a
+ memory address conflict.
+
+If an IO conflict occurs, run the CS8900/20 Setup Utility and perform a
+diagnostic self-test. Normally, the ISA resource in conflict will fail the
+self-test. If so, reconfigure the adapter selecting another choice for the
+resource in conflict. Run the diagnostics again to check for further IO
+conflicts.
+
+In some cases, such as when the PC will not boot, it may be necessary to remove
+the adapter and reconfigure it by installing it in another PC to run the
+CS8900/20 Setup Utility. Once reinstalled in the target system, run the
+diagnostics self-test to ensure the new configuration is free of conflicts
+before loading the driver again.
+
+When manually configuring the adapter, keep in mind the typical ISA system
+resource usage as indicated in the tables below.
+
+::
+
+ I/O Address Device IRQ Device
+ ----------- -------- --- --------
+ 200-20F Game I/O adapter 3 COM2, Bus Mouse
+ 230-23F Bus Mouse 4 COM1
+ 270-27F LPT3: third parallel port 5 LPT2
+ 2F0-2FF COM2: second serial port 6 Floppy Disk controller
+ 320-32F Fixed disk controller 7 LPT1
+ 8 Real-time Clock
+ 9 EGA/VGA display adapter
+ 12 Mouse (PS/2)
+ Memory Address Device 13 Math Coprocessor
+ -------------- --------------------- 14 Hard Disk controller
+ A000-BFFF EGA Graphics Adapter
+ A000-C7FF VGA Graphics Adapter
+ B000-BFFF Mono Graphics Adapter
+ B800-BFFF Color Graphics Adapter
+ E000-FFFF AT BIOS
+
+
+
+
+6. Technical Support
+====================
+
+6.1. Contacting Cirrus Logic's Technical Support
+------------------------------------------------
+
+Cirrus Logic's CS89XX Technical Application Support can be reached at::
+
+ Telephone :(800) 888-5016 (from inside U.S. and Canada)
+ :(512) 442-7555 (from outside the U.S. and Canada)
+ Fax :(512) 912-3871
+ Email :ethernet@crystal.cirrus.com
+ WWW :http://www.cirrus.com
+
+
+6.2. Information Required before Contacting Technical Support
+-------------------------------------------------------------
+
+Before contacting Cirrus Logic for technical support, be prepared to provide as
+Much of the following information as possible.
+
+1.) Adapter type (CRD8900, CDB8900, CDB8920, etc.)
+
+2.) Adapter configuration
+
+ * IO Base, Memory Base, IO or memory mode enabled, IRQ, DMA channel
+ * Plug and Play enabled/disabled (CS8920-based adapters only)
+ * Configured for media auto-detect or specific media type (which type).
+
+3.) PC System's Configuration
+
+ * Plug and Play system (yes/no)
+ * BIOS (make and version)
+ * System make and model
+ * CPU (type and speed)
+ * System RAM
+ * SCSI Adapter
+
+4.) Software
+
+ * CS89XX driver and version
+ * Your network operating system and version
+ * Your system's OS version
+ * Version of all protocol support files
+
+5.) Any Error Message displayed.
+
+
+
+6.3 Obtaining the Latest Driver Version
+---------------------------------------
+
+You can obtain the latest CS89XX drivers and support software from Cirrus Logic's
+Web site. You can also contact Cirrus Logic's Technical Support (email:
+ethernet@crystal.cirrus.com) and request that you be registered for automatic
+software-update notification.
+
+Cirrus Logic maintains a web page at http://www.cirrus.com with the
+latest drivers and technical publications.
+
+
+6.4. Current maintainer
+-----------------------
+
+In February 2000 the maintenance of this driver was assumed by Andrew
+Morton.
+
+6.5 Kernel module parameters
+----------------------------
+
+For use in embedded environments with no cs89x0 EEPROM, the kernel boot
+parameter ``cs89x0_media=`` has been implemented. Usage is::
+
+ cs89x0_media=rj45 or
+ cs89x0_media=aui or
+ cs89x0_media=bnc
diff --git a/Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst b/Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst
new file mode 100644
index 0000000000..14eb0a4d4e
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst
@@ -0,0 +1,171 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+DM9000 Network driver
+=====================
+
+Copyright 2008 Simtec Electronics,
+
+ Ben Dooks <ben@simtec.co.uk> <ben-linux@fluff.org>
+
+
+Introduction
+------------
+
+This file describes how to use the DM9000 platform-device based network driver
+that is contained in the files drivers/net/dm9000.c and drivers/net/dm9000.h.
+
+The driver supports three DM9000 variants, the DM9000E which is the first chip
+supported as well as the newer DM9000A and DM9000B devices. It is currently
+maintained and tested by Ben Dooks, who should be CC: to any patches for this
+driver.
+
+
+Defining the platform device
+----------------------------
+
+The minimum set of resources attached to the platform device are as follows:
+
+ 1) The physical address of the address register
+ 2) The physical address of the data register
+ 3) The IRQ line the device's interrupt pin is connected to.
+
+These resources should be specified in that order, as the ordering of the
+two address regions is important (the driver expects these to be address
+and then data).
+
+An example from arch/arm/mach-s3c/mach-bast.c is::
+
+ static struct resource bast_dm9k_resource[] = {
+ [0] = {
+ .start = S3C2410_CS5 + BAST_PA_DM9000,
+ .end = S3C2410_CS5 + BAST_PA_DM9000 + 3,
+ .flags = IORESOURCE_MEM,
+ },
+ [1] = {
+ .start = S3C2410_CS5 + BAST_PA_DM9000 + 0x40,
+ .end = S3C2410_CS5 + BAST_PA_DM9000 + 0x40 + 0x3f,
+ .flags = IORESOURCE_MEM,
+ },
+ [2] = {
+ .start = IRQ_DM9000,
+ .end = IRQ_DM9000,
+ .flags = IORESOURCE_IRQ | IORESOURCE_IRQ_HIGHLEVEL,
+ }
+ };
+
+ static struct platform_device bast_device_dm9k = {
+ .name = "dm9000",
+ .id = 0,
+ .num_resources = ARRAY_SIZE(bast_dm9k_resource),
+ .resource = bast_dm9k_resource,
+ };
+
+Note the setting of the IRQ trigger flag in bast_dm9k_resource[2].flags,
+as this will generate a warning if it is not present. The trigger from
+the flags field will be passed to request_irq() when registering the IRQ
+handler to ensure that the IRQ is setup correctly.
+
+This shows a typical platform device, without the optional configuration
+platform data supplied. The next example uses the same resources, but adds
+the optional platform data to pass extra configuration data::
+
+ static struct dm9000_plat_data bast_dm9k_platdata = {
+ .flags = DM9000_PLATF_16BITONLY,
+ };
+
+ static struct platform_device bast_device_dm9k = {
+ .name = "dm9000",
+ .id = 0,
+ .num_resources = ARRAY_SIZE(bast_dm9k_resource),
+ .resource = bast_dm9k_resource,
+ .dev = {
+ .platform_data = &bast_dm9k_platdata,
+ }
+ };
+
+The platform data is defined in include/linux/dm9000.h and described below.
+
+
+Platform data
+-------------
+
+Extra platform data for the DM9000 can describe the IO bus width to the
+device, whether or not an external PHY is attached to the device and
+the availability of an external configuration EEPROM.
+
+The flags for the platform data .flags field are as follows:
+
+DM9000_PLATF_8BITONLY
+
+ The IO should be done with 8bit operations.
+
+DM9000_PLATF_16BITONLY
+
+ The IO should be done with 16bit operations.
+
+DM9000_PLATF_32BITONLY
+
+ The IO should be done with 32bit operations.
+
+DM9000_PLATF_EXT_PHY
+
+ The chip is connected to an external PHY.
+
+DM9000_PLATF_NO_EEPROM
+
+ This can be used to signify that the board does not have an
+ EEPROM, or that the EEPROM should be hidden from the user.
+
+DM9000_PLATF_SIMPLE_PHY
+
+ Switch to using the simpler PHY polling method which does not
+ try and read the MII PHY state regularly. This is only available
+ when using the internal PHY. See the section on link state polling
+ for more information.
+
+ The config symbol DM9000_FORCE_SIMPLE_PHY_POLL, Kconfig entry
+ "Force simple NSR based PHY polling" allows this flag to be
+ forced on at build time.
+
+
+PHY Link state polling
+----------------------
+
+The driver keeps track of the link state and informs the network core
+about link (carrier) availability. This is managed by several methods
+depending on the version of the chip and on which PHY is being used.
+
+For the internal PHY, the original (and currently default) method is
+to read the MII state, either when the status changes if we have the
+necessary interrupt support in the chip or every two seconds via a
+periodic timer.
+
+To reduce the overhead for the internal PHY, there is now the option
+of using the DM9000_FORCE_SIMPLE_PHY_POLL config, or DM9000_PLATF_SIMPLE_PHY
+platform data option to read the summary information without the
+expensive MII accesses. This method is faster, but does not print
+as much information.
+
+When using an external PHY, the driver currently has to poll the MII
+link status as there is no method for getting an interrupt on link change.
+
+
+DM9000A / DM9000B
+-----------------
+
+These chips are functionally similar to the DM9000E and are supported easily
+by the same driver. The features are:
+
+ 1) Interrupt on internal PHY state change. This means that the periodic
+ polling of the PHY status may be disabled on these devices when using
+ the internal PHY.
+
+ 2) TCP/UDP checksum offloading, which the driver does not currently support.
+
+
+ethtool
+-------
+
+The driver supports the ethtool interface for access to the driver
+state information, the PHY state and the EEPROM.
diff --git a/Documentation/networking/device_drivers/ethernet/dec/dmfe.rst b/Documentation/networking/device_drivers/ethernet/dec/dmfe.rst
new file mode 100644
index 0000000000..c4cf809cad
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/dec/dmfe.rst
@@ -0,0 +1,71 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================================
+Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux
+==============================================================
+
+Note: This driver doesn't have a maintainer.
+
+
+This program is free software; you can redistribute it and/or
+modify it under the terms of the GNU General Public License
+as published by the Free Software Foundation; either version 2
+of the License, or (at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+
+This driver provides kernel support for Davicom DM9102(A)/DM9132/DM9801 ethernet cards ( CNET
+10/100 ethernet cards uses Davicom chipset too, so this driver supports CNET cards too ).If you
+didn't compile this driver as a module, it will automatically load itself on boot and print a
+line similar to::
+
+ dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17)
+
+If you compiled this driver as a module, you have to load it on boot.You can load it with command::
+
+ insmod dmfe
+
+This way it will autodetect the device mode.This is the suggested way to load the module.Or you can pass
+a mode= setting to module while loading, like::
+
+ insmod dmfe mode=0 # Force 10M Half Duplex
+ insmod dmfe mode=1 # Force 100M Half Duplex
+ insmod dmfe mode=4 # Force 10M Full Duplex
+ insmod dmfe mode=5 # Force 100M Full Duplex
+
+Next you should configure your network interface with a command similar to::
+
+ ifconfig eth0 172.22.3.18
+ ^^^^^^^^^^^
+ Your IP Address
+
+Then you may have to modify the default routing table with command::
+
+ route add default eth0
+
+
+Now your ethernet card should be up and running.
+
+
+TODO:
+
+- Implement pci_driver::suspend() and pci_driver::resume() power management methods.
+- Check on 64 bit boxes.
+- Check and fix on big endian boxes.
+- Test and make sure PCI latency is now correct for all cases.
+
+
+Authors:
+
+Sten Wang <sten_wang@davicom.com.tw > : Original Author
+
+Contributors:
+
+- Marcelo Tosatti <marcelo@conectiva.com.br>
+- Alan Cox <alan@lxorguk.ukuu.org.uk>
+- Jeff Garzik <jgarzik@pobox.com>
+- Vojtech Pavlik <vojtech@suse.cz>
diff --git a/Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst b/Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst
new file mode 100644
index 0000000000..ccdb5d0d74
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst
@@ -0,0 +1,314 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================================
+D-Link DL2000-based Gigabit Ethernet Adapter Installation
+=========================================================
+
+May 23, 2002
+
+.. Contents
+
+ - Compatibility List
+ - Quick Install
+ - Compiling the Driver
+ - Installing the Driver
+ - Option parameter
+ - Configuration Script Sample
+ - Troubleshooting
+
+
+Compatibility List
+==================
+
+Adapter Support:
+
+- D-Link DGE-550T Gigabit Ethernet Adapter.
+- D-Link DGE-550SX Gigabit Ethernet Adapter.
+- D-Link DL2000-based Gigabit Ethernet Adapter.
+
+
+The driver support Linux kernel 2.4.7 later. We had tested it
+on the environments below.
+
+ . Red Hat v6.2 (update kernel to 2.4.7)
+ . Red Hat v7.0 (update kernel to 2.4.7)
+ . Red Hat v7.1 (kernel 2.4.7)
+ . Red Hat v7.2 (kernel 2.4.7-10)
+
+
+Quick Install
+=============
+Install linux driver as following command::
+
+ 1. make all
+ 2. insmod dl2k.ko
+ 3. ifconfig eth0 up 10.xxx.xxx.xxx netmask 255.0.0.0
+ ^^^^^^^^^^^^^^^\ ^^^^^^^^\
+ IP NETMASK
+
+Now eth0 should active, you can test it by "ping" or get more information by
+"ifconfig". If tested ok, continue the next step.
+
+4. ``cp dl2k.ko /lib/modules/`uname -r`/kernel/drivers/net``
+5. Add the following line to /etc/modprobe.d/dl2k.conf::
+
+ alias eth0 dl2k
+
+6. Run ``depmod`` to updated module indexes.
+7. Run ``netconfig`` or ``netconf`` to create configuration script ifcfg-eth0
+ located at /etc/sysconfig/network-scripts or create it manually.
+
+ [see - Configuration Script Sample]
+8. Driver will automatically load and configure at next boot time.
+
+Compiling the Driver
+====================
+In Linux, NIC drivers are most commonly configured as loadable modules.
+The approach of building a monolithic kernel has become obsolete. The driver
+can be compiled as part of a monolithic kernel, but is strongly discouraged.
+The remainder of this section assumes the driver is built as a loadable module.
+In the Linux environment, it is a good idea to rebuild the driver from the
+source instead of relying on a precompiled version. This approach provides
+better reliability since a precompiled driver might depend on libraries or
+kernel features that are not present in a given Linux installation.
+
+The 3 files necessary to build Linux device driver are dl2k.c, dl2k.h and
+Makefile. To compile, the Linux installation must include the gcc compiler,
+the kernel source, and the kernel headers. The Linux driver supports Linux
+Kernels 2.4.7. Copy the files to a directory and enter the following command
+to compile and link the driver:
+
+CD-ROM drive
+------------
+
+::
+
+ [root@XXX /] mkdir cdrom
+ [root@XXX /] mount -r -t iso9660 -o conv=auto /dev/cdrom /cdrom
+ [root@XXX /] cd root
+ [root@XXX /root] mkdir dl2k
+ [root@XXX /root] cd dl2k
+ [root@XXX dl2k] cp /cdrom/linux/dl2k.tgz /root/dl2k
+ [root@XXX dl2k] tar xfvz dl2k.tgz
+ [root@XXX dl2k] make all
+
+Floppy disc drive
+-----------------
+
+::
+
+ [root@XXX /] cd root
+ [root@XXX /root] mkdir dl2k
+ [root@XXX /root] cd dl2k
+ [root@XXX dl2k] mcopy a:/linux/dl2k.tgz /root/dl2k
+ [root@XXX dl2k] tar xfvz dl2k.tgz
+ [root@XXX dl2k] make all
+
+Installing the Driver
+=====================
+
+Manual Installation
+-------------------
+
+ Once the driver has been compiled, it must be loaded, enabled, and bound
+ to a protocol stack in order to establish network connectivity. To load a
+ module enter the command::
+
+ insmod dl2k.o
+
+ or::
+
+ insmod dl2k.o <optional parameter> ; add parameter
+
+---------------------------------------------------------
+
+ example::
+
+ insmod dl2k.o media=100mbps_hd
+
+ or::
+
+ insmod dl2k.o media=3
+
+ or::
+
+ insmod dl2k.o media=3,2 ; for 2 cards
+
+---------------------------------------------------------
+
+ Please reference the list of the command line parameters supported by
+ the Linux device driver below.
+
+ The insmod command only loads the driver and gives it a name of the form
+ eth0, eth1, etc. To bring the NIC into an operational state,
+ it is necessary to issue the following command::
+
+ ifconfig eth0 up
+
+ Finally, to bind the driver to the active protocol (e.g., TCP/IP with
+ Linux), enter the following command::
+
+ ifup eth0
+
+ Note that this is meaningful only if the system can find a configuration
+ script that contains the necessary network information. A sample will be
+ given in the next paragraph.
+
+ The commands to unload a driver are as follows::
+
+ ifdown eth0
+ ifconfig eth0 down
+ rmmod dl2k.o
+
+ The following are the commands to list the currently loaded modules and
+ to see the current network configuration::
+
+ lsmod
+ ifconfig
+
+
+Automated Installation
+----------------------
+ This section describes how to install the driver such that it is
+ automatically loaded and configured at boot time. The following description
+ is based on a Red Hat 6.0/7.0 distribution, but it can easily be ported to
+ other distributions as well.
+
+Red Hat v6.x/v7.x
+-----------------
+ 1. Copy dl2k.o to the network modules directory, typically
+ /lib/modules/2.x.x-xx/net or /lib/modules/2.x.x/kernel/drivers/net.
+ 2. Locate the boot module configuration file, most commonly in the
+ /etc/modprobe.d/ directory. Add the following lines::
+
+ alias ethx dl2k
+ options dl2k <optional parameters>
+
+ where ethx will be eth0 if the NIC is the only ethernet adapter, eth1 if
+ one other ethernet adapter is installed, etc. Refer to the table in the
+ previous section for the list of optional parameters.
+ 3. Locate the network configuration scripts, normally the
+ /etc/sysconfig/network-scripts directory, and create a configuration
+ script named ifcfg-ethx that contains network information.
+ 4. Note that for most Linux distributions, Red Hat included, a configuration
+ utility with a graphical user interface is provided to perform steps 2
+ and 3 above.
+
+
+Parameter Description
+=====================
+You can install this driver without any additional parameter. However, if you
+are going to have extensive functions then it is necessary to set extra
+parameter. Below is a list of the command line parameters supported by the
+Linux device
+driver.
+
+
+=============================== ==============================================
+mtu=packet_size Specifies the maximum packet size. default
+ is 1500.
+
+media=media_type Specifies the media type the NIC operates at.
+ autosense Autosensing active media.
+
+ =========== =========================
+ 10mbps_hd 10Mbps half duplex.
+ 10mbps_fd 10Mbps full duplex.
+ 100mbps_hd 100Mbps half duplex.
+ 100mbps_fd 100Mbps full duplex.
+ 1000mbps_fd 1000Mbps full duplex.
+ 1000mbps_hd 1000Mbps half duplex.
+ 0 Autosensing active media.
+ 1 10Mbps half duplex.
+ 2 10Mbps full duplex.
+ 3 100Mbps half duplex.
+ 4 100Mbps full duplex.
+ 5 1000Mbps half duplex.
+ 6 1000Mbps full duplex.
+ =========== =========================
+
+ By default, the NIC operates at autosense.
+ 1000mbps_fd and 1000mbps_hd types are only
+ available for fiber adapter.
+
+vlan=n Specifies the VLAN ID. If vlan=0, the
+ Virtual Local Area Network (VLAN) function is
+ disable.
+
+jumbo=[0|1] Specifies the jumbo frame support. If jumbo=1,
+ the NIC accept jumbo frames. By default, this
+ function is disabled.
+ Jumbo frame usually improve the performance
+ int gigabit.
+ This feature need jumbo frame compatible
+ remote.
+
+rx_coalesce=m Number of rx frame handled each interrupt.
+rx_timeout=n Rx DMA wait time for an interrupt.
+ If set rx_coalesce > 0, hardware only assert
+ an interrupt for m frames. Hardware won't
+ assert rx interrupt until m frames received or
+ reach timeout of n * 640 nano seconds.
+ Set proper rx_coalesce and rx_timeout can
+ reduce congestion collapse and overload which
+ has been a bottleneck for high speed network.
+
+ For example, rx_coalesce=10 rx_timeout=800.
+ that is, hardware assert only 1 interrupt
+ for 10 frames received or timeout of 512 us.
+
+tx_coalesce=n Number of tx frame handled each interrupt.
+ Set n > 1 can reduce the interrupts
+ congestion usually lower performance of
+ high speed network card. Default is 16.
+
+tx_flow=[1|0] Specifies the Tx flow control. If tx_flow=0,
+ the Tx flow control disable else driver
+ autodetect.
+rx_flow=[1|0] Specifies the Rx flow control. If rx_flow=0,
+ the Rx flow control enable else driver
+ autodetect.
+=============================== ==============================================
+
+
+Configuration Script Sample
+===========================
+Here is a sample of a simple configuration script::
+
+ DEVICE=eth0
+ USERCTL=no
+ ONBOOT=yes
+ POOTPROTO=none
+ BROADCAST=207.200.5.255
+ NETWORK=207.200.5.0
+ NETMASK=255.255.255.0
+ IPADDR=207.200.5.2
+
+
+Troubleshooting
+===============
+Q1. Source files contain ^ M behind every line.
+
+ Make sure all files are Unix file format (no LF). Try the following
+ shell command to convert files::
+
+ cat dl2k.c | col -b > dl2k.tmp
+ mv dl2k.tmp dl2k.c
+
+ OR::
+
+ cat dl2k.c | tr -d "\r" > dl2k.tmp
+ mv dl2k.tmp dl2k.c
+
+Q2: Could not find header files (``*.h``)?
+
+ To compile the driver, you need kernel header files. After
+ installing the kernel source, the header files are usually located in
+ /usr/src/linux/include, which is the default include directory configured
+ in Makefile. For some distributions, there is a copy of header files in
+ /usr/src/include/linux and /usr/src/include/asm, that you can change the
+ INCLUDEDIR in Makefile to /usr/include without installing kernel source.
+
+ Note that RH 7.0 didn't provide correct header files in /usr/include,
+ including those files will make a wrong version driver.
+
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst
new file mode 100644
index 0000000000..241c6c6f6e
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst
@@ -0,0 +1,269 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+The QorIQ DPAA Ethernet Driver
+==============================
+
+Authors:
+- Madalin Bucur <madalin.bucur@nxp.com>
+- Camelia Groza <camelia.groza@nxp.com>
+
+.. Contents
+
+ - DPAA Ethernet Overview
+ - DPAA Ethernet Supported SoCs
+ - Configuring DPAA Ethernet in your kernel
+ - DPAA Ethernet Frame Processing
+ - DPAA Ethernet Features
+ - DPAA IRQ Affinity and Receive Side Scaling
+ - Debugging
+
+DPAA Ethernet Overview
+======================
+
+DPAA stands for Data Path Acceleration Architecture and it is a
+set of networking acceleration IPs that are available on several
+generations of SoCs, both on PowerPC and ARM64.
+
+The Freescale DPAA architecture consists of a series of hardware blocks
+that support Ethernet connectivity. The Ethernet driver depends upon the
+following drivers in the Linux kernel:
+
+ - Peripheral Access Memory Unit (PAMU) (* needed only for PPC platforms)
+ drivers/iommu/fsl_*
+ - Frame Manager (FMan)
+ drivers/net/ethernet/freescale/fman
+ - Queue Manager (QMan), Buffer Manager (BMan)
+ drivers/soc/fsl/qbman
+
+A simplified view of the dpaa_eth interfaces mapped to FMan MACs::
+
+ dpaa_eth /eth0\ ... /ethN\
+ driver | | | |
+ ------------- ---- ----------- ---- -------------
+ -Ports / Tx Rx \ ... / Tx Rx \
+ FMan | | | |
+ -MACs | MAC0 | | MACN |
+ / dtsec0 \ ... / dtsecN \ (or tgec)
+ / \ / \(or memac)
+ --------- -------------- --- -------------- ---------
+ FMan, FMan Port, FMan SP, FMan MURAM drivers
+ ---------------------------------------------------------
+ FMan HW blocks: MURAM, MACs, Ports, SP
+ ---------------------------------------------------------
+
+The dpaa_eth relation to the QMan, BMan and FMan::
+
+ ________________________________
+ dpaa_eth / eth0 \
+ driver / \
+ --------- -^- -^- -^- --- ---------
+ QMan driver / \ / \ / \ \ / | BMan |
+ |Rx | |Rx | |Tx | |Tx | | driver |
+ --------- |Dfl| |Err| |Cnf| |FQs| | |
+ QMan HW |FQ | |FQ | |FQs| | | | |
+ / \ / \ / \ \ / | |
+ --------- --- --- --- -v- ---------
+ | FMan QMI | |
+ | FMan HW FMan BMI | BMan HW |
+ ----------------------- --------
+
+where the acronyms used above (and in the code) are:
+
+=============== ===========================================================
+DPAA Data Path Acceleration Architecture
+FMan DPAA Frame Manager
+QMan DPAA Queue Manager
+BMan DPAA Buffers Manager
+QMI QMan interface in FMan
+BMI BMan interface in FMan
+FMan SP FMan Storage Profiles
+MURAM Multi-user RAM in FMan
+FQ QMan Frame Queue
+Rx Dfl FQ default reception FQ
+Rx Err FQ Rx error frames FQ
+Tx Cnf FQ Tx confirmation FQs
+Tx FQs transmission frame queues
+dtsec datapath three speed Ethernet controller (10/100/1000 Mbps)
+tgec ten gigabit Ethernet controller (10 Gbps)
+memac multirate Ethernet MAC (10/100/1000/10000)
+=============== ===========================================================
+
+DPAA Ethernet Supported SoCs
+============================
+
+The DPAA drivers enable the Ethernet controllers present on the following SoCs:
+
+PPC
+- P1023
+- P2041
+- P3041
+- P4080
+- P5020
+- P5040
+- T1023
+- T1024
+- T1040
+- T1042
+- T2080
+- T4240
+- B4860
+
+ARM
+- LS1043A
+- LS1046A
+
+Configuring DPAA Ethernet in your kernel
+========================================
+
+To enable the DPAA Ethernet driver, the following Kconfig options are required::
+
+ # common for arch/arm64 and arch/powerpc platforms
+ CONFIG_FSL_DPAA=y
+ CONFIG_FSL_FMAN=y
+ CONFIG_FSL_DPAA_ETH=y
+ CONFIG_FSL_XGMAC_MDIO=y
+
+ # for arch/powerpc only
+ CONFIG_FSL_PAMU=y
+
+ # common options needed for the PHYs used on the RDBs
+ CONFIG_VITESSE_PHY=y
+ CONFIG_REALTEK_PHY=y
+ CONFIG_AQUANTIA_PHY=y
+
+DPAA Ethernet Frame Processing
+==============================
+
+On Rx, buffers for the incoming frames are retrieved from the buffers found
+in the dedicated interface buffer pool. The driver initializes and seeds these
+with one page buffers.
+
+On Tx, all transmitted frames are returned to the driver through Tx
+confirmation frame queues. The driver is then responsible for freeing the
+buffers. In order to do this properly, a backpointer is added to the buffer
+before transmission that points to the skb. When the buffer returns to the
+driver on a confirmation FQ, the skb can be correctly consumed.
+
+DPAA Ethernet Features
+======================
+
+Currently the DPAA Ethernet driver enables the basic features required for
+a Linux Ethernet driver. The support for advanced features will be added
+gradually.
+
+The driver has Rx and Tx checksum offloading for UDP and TCP. Currently the Rx
+checksum offload feature is enabled by default and cannot be controlled through
+ethtool. Also, rx-flow-hash and rx-hashing was added. The addition of RSS
+provides a big performance boost for the forwarding scenarios, allowing
+different traffic flows received by one interface to be processed by different
+CPUs in parallel.
+
+The driver has support for multiple prioritized Tx traffic classes. Priorities
+range from 0 (lowest) to 3 (highest). These are mapped to HW workqueues with
+strict priority levels. Each traffic class contains NR_CPU TX queues. By
+default, only one traffic class is enabled and the lowest priority Tx queues
+are used. Higher priority traffic classes can be enabled with the mqprio
+qdisc. For example, all four traffic classes are enabled on an interface with
+the following command. Furthermore, skb priority levels are mapped to traffic
+classes as follows:
+
+ * priorities 0 to 3 - traffic class 0 (low priority)
+ * priorities 4 to 7 - traffic class 1 (medium-low priority)
+ * priorities 8 to 11 - traffic class 2 (medium-high priority)
+ * priorities 12 to 15 - traffic class 3 (high priority)
+
+::
+
+ tc qdisc add dev <int> root handle 1: \
+ mqprio num_tc 4 map 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 hw 1
+
+DPAA IRQ Affinity and Receive Side Scaling
+==========================================
+
+Traffic coming on the DPAA Rx queues or on the DPAA Tx confirmation
+queues is seen by the CPU as ingress traffic on a certain portal.
+The DPAA QMan portal interrupts are affined each to a certain CPU.
+The same portal interrupt services all the QMan portal consumers.
+
+By default the DPAA Ethernet driver enables RSS, making use of the
+DPAA FMan Parser and Keygen blocks to distribute traffic on 128
+hardware frame queues using a hash on IP v4/v6 source and destination
+and L4 source and destination ports, in present in the received frame.
+When RSS is disabled, all traffic received by a certain interface is
+received on the default Rx frame queue. The default DPAA Rx frame
+queues are configured to put the received traffic into a pool channel
+that allows any available CPU portal to dequeue the ingress traffic.
+The default frame queues have the HOLDACTIVE option set, ensuring that
+traffic bursts from a certain queue are serviced by the same CPU.
+This ensures a very low rate of frame reordering. A drawback of this
+is that only one CPU at a time can service the traffic received by a
+certain interface when RSS is not enabled.
+
+To implement RSS, the DPAA Ethernet driver allocates an extra set of
+128 Rx frame queues that are configured to dedicated channels, in a
+round-robin manner. The mapping of the frame queues to CPUs is now
+hardcoded, there is no indirection table to move traffic for a certain
+FQ (hash result) to another CPU. The ingress traffic arriving on one
+of these frame queues will arrive at the same portal and will always
+be processed by the same CPU. This ensures intra-flow order preservation
+and workload distribution for multiple traffic flows.
+
+RSS can be turned off for a certain interface using ethtool, i.e.::
+
+ # ethtool -N fm1-mac9 rx-flow-hash tcp4 ""
+
+To turn it back on, one needs to set rx-flow-hash for tcp4/6 or udp4/6::
+
+ # ethtool -N fm1-mac9 rx-flow-hash udp4 sfdn
+
+There is no independent control for individual protocols, any command
+run for one of tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 is
+going to control the rx-flow-hashing for all protocols on that interface.
+
+Besides using the FMan Keygen computed hash for spreading traffic on the
+128 Rx FQs, the DPAA Ethernet driver also sets the skb hash value when
+the NETIF_F_RXHASH feature is on (active by default). This can be turned
+on or off through ethtool, i.e.::
+
+ # ethtool -K fm1-mac9 rx-hashing off
+ # ethtool -k fm1-mac9 | grep hash
+ receive-hashing: off
+ # ethtool -K fm1-mac9 rx-hashing on
+ Actual changes:
+ receive-hashing: on
+ # ethtool -k fm1-mac9 | grep hash
+ receive-hashing: on
+
+Please note that Rx hashing depends upon the rx-flow-hashing being on
+for that interface - turning off rx-flow-hashing will also disable the
+rx-hashing (without ethtool reporting it as off as that depends on the
+NETIF_F_RXHASH feature flag).
+
+Debugging
+=========
+
+The following statistics are exported for each interface through ethtool:
+
+ - interrupt count per CPU
+ - Rx packets count per CPU
+ - Tx packets count per CPU
+ - Tx confirmed packets count per CPU
+ - Tx S/G frames count per CPU
+ - Tx error count per CPU
+ - Rx error count per CPU
+ - Rx error count per type
+ - congestion related statistics:
+
+ - congestion status
+ - time spent in congestion
+ - number of time the device entered congestion
+ - dropped packets count per cause
+
+The driver also exports the following information in sysfs:
+
+ - the FQ IDs for each FQ type
+ /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/fqids
+
+ - the ID of the buffer pool in use
+ /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/bpids
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst
new file mode 100644
index 0000000000..e4ebfe62a1
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst
@@ -0,0 +1,161 @@
+.. include:: <isonum.txt>
+
+===================================
+DPAA2 DPIO (Data Path I/O) Overview
+===================================
+
+:Copyright: |copy| 2016-2018 NXP
+
+This document provides an overview of the Freescale DPAA2 DPIO
+drivers
+
+Introduction
+============
+
+A DPAA2 DPIO (Data Path I/O) is a hardware object that provides
+interfaces to enqueue and dequeue frames to/from network interfaces
+and other accelerators. A DPIO also provides hardware buffer
+pool management for network interfaces.
+
+This document provides an overview the Linux DPIO driver, its
+subcomponents, and its APIs.
+
+See
+Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
+for a general overview of DPAA2 and the general DPAA2 driver architecture
+in Linux.
+
+Driver Overview
+---------------
+
+The DPIO driver is bound to DPIO objects discovered on the fsl-mc bus and
+provides services that:
+
+ A. allow other drivers, such as the Ethernet driver, to enqueue and dequeue
+ frames for their respective objects
+ B. allow drivers to register callbacks for data availability notifications
+ when data becomes available on a queue or channel
+ C. allow drivers to manage hardware buffer pools
+
+The Linux DPIO driver consists of 3 primary components--
+ DPIO object driver-- fsl-mc driver that manages the DPIO object
+
+ DPIO service-- provides APIs to other Linux drivers for services
+
+ QBman portal interface-- sends portal commands, gets responses::
+
+ fsl-mc other
+ bus drivers
+ | |
+ +---+----+ +------+-----+
+ |DPIO obj| |DPIO service|
+ | driver |---| (DPIO) |
+ +--------+ +------+-----+
+ |
+ +------+-----+
+ | QBman |
+ | portal i/f |
+ +------------+
+ |
+ hardware
+
+
+The diagram below shows how the DPIO driver components fit with the other
+DPAA2 Linux driver components::
+
+ +------------+
+ | OS Network |
+ | Stack |
+ +------------+ +------------+
+ | Allocator |. . . . . . . | Ethernet |
+ |(DPMCP,DPBP)| | (DPNI) |
+ +-.----------+ +---+---+----+
+ . . ^ |
+ . . <data avail, | |<enqueue,
+ . . tx confirm> | | dequeue>
+ +-------------+ . | |
+ | DPRC driver | . +--------+ +------------+
+ | (DPRC) | . . |DPIO obj| |DPIO service|
+ +----------+--+ | driver |-| (DPIO) |
+ | +--------+ +------+-----+
+ |<dev add/remove> +------|-----+
+ | | QBman |
+ +----+--------------+ | portal i/f |
+ | MC-bus driver | +------------+
+ | | |
+ | /soc/fsl-mc | |
+ +-------------------+ |
+ |
+ =========================================|=========|========================
+ +-+--DPIO---|-----------+
+ | | |
+ | QBman Portal |
+ +-----------------------+
+
+ ============================================================================
+
+
+DPIO Object Driver (dpio-driver.c)
+----------------------------------
+
+ The dpio-driver component registers with the fsl-mc bus to handle objects of
+ type "dpio". The implementation of probe() handles basic initialization
+ of the DPIO including mapping of the DPIO regions (the QBman SW portal)
+ and initializing interrupts and registering irq handlers. The dpio-driver
+ registers the probed DPIO with dpio-service.
+
+DPIO service (dpio-service.c, dpaa2-io.h)
+------------------------------------------
+
+ The dpio service component provides queuing, notification, and buffers
+ management services to DPAA2 drivers, such as the Ethernet driver. A system
+ will typically allocate 1 DPIO object per CPU to allow queuing operations
+ to happen simultaneously across all CPUs.
+
+ Notification handling
+ dpaa2_io_service_register()
+
+ dpaa2_io_service_deregister()
+
+ dpaa2_io_service_rearm()
+
+ Queuing
+ dpaa2_io_service_pull_fq()
+
+ dpaa2_io_service_pull_channel()
+
+ dpaa2_io_service_enqueue_fq()
+
+ dpaa2_io_service_enqueue_qd()
+
+ dpaa2_io_store_create()
+
+ dpaa2_io_store_destroy()
+
+ dpaa2_io_store_next()
+
+ Buffer pool management
+ dpaa2_io_service_release()
+
+ dpaa2_io_service_acquire()
+
+QBman portal interface (qbman-portal.c)
+---------------------------------------
+
+ The qbman-portal component provides APIs to do the low level hardware
+ bit twiddling for operations such as:
+
+ - initializing Qman software portals
+ - building and sending portal commands
+ - portal interrupt configuration and processing
+
+ The qbman-portal APIs are not public to other drivers, and are
+ only used by dpio-service.
+
+Other (dpaa2-fd.h, dpaa2-global.h)
+----------------------------------
+
+ Frame descriptor and scatter-gather definitions and the APIs used to
+ manipulate them are defined in dpaa2-fd.h.
+
+ Dequeue result struct and parsing APIs are defined in dpaa2-global.h.
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst
new file mode 100644
index 0000000000..682f3986c1
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst
@@ -0,0 +1,186 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============================
+DPAA2 Ethernet driver
+===============================
+
+:Copyright: |copy| 2017-2018 NXP
+
+This file provides documentation for the Freescale DPAA2 Ethernet driver.
+
+Supported Platforms
+===================
+This driver provides networking support for Freescale DPAA2 SoCs, e.g.
+LS2080A, LS2088A, LS1088A.
+
+
+Architecture Overview
+=====================
+Unlike regular NICs, in the DPAA2 architecture there is no single hardware block
+representing network interfaces; instead, several separate hardware resources
+concur to provide the networking functionality:
+
+- network interfaces
+- queues, channels
+- buffer pools
+- MAC/PHY
+
+All hardware resources are allocated and configured through the Management
+Complex (MC) portals. MC abstracts most of these resources as DPAA2 objects
+and exposes ABIs through which they can be configured and controlled. A few
+hardware resources, like queues, do not have a corresponding MC object and
+are treated as internal resources of other objects.
+
+For a more detailed description of the DPAA2 architecture and its object
+abstractions see
+*Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst*.
+
+Each Linux net device is built on top of a Datapath Network Interface (DPNI)
+object and uses Buffer Pools (DPBPs), I/O Portals (DPIOs) and Concentrators
+(DPCONs).
+
+Configuration interface::
+
+ -----------------------
+ | DPAA2 Ethernet Driver |
+ -----------------------
+ . . .
+ . . .
+ . . . . . . . . . . . .
+ . . .
+ . . .
+ ---------- ---------- -----------
+ | DPBP API | | DPNI API | | DPCON API |
+ ---------- ---------- -----------
+ . . . software
+ ======= . ========== . ============ . ===================
+ . . . hardware
+ ------------------------------------------
+ | MC hardware portals |
+ ------------------------------------------
+ . . .
+ . . .
+ ------ ------ -------
+ | DPBP | | DPNI | | DPCON |
+ ------ ------ -------
+
+The DPNIs are network interfaces without a direct one-on-one mapping to PHYs.
+DPBPs represent hardware buffer pools. Packet I/O is performed in the context
+of DPCON objects, using DPIO portals for managing and communicating with the
+hardware resources.
+
+Datapath (I/O) interface::
+
+ -----------------------------------------------
+ | DPAA2 Ethernet Driver |
+ -----------------------------------------------
+ | ^ ^ | |
+ | | | | |
+ enqueue| dequeue| data | dequeue| seed |
+ (Tx) | (Rx, TxC)| avail.| request| buffers|
+ | | notify| | |
+ | | | | |
+ V | | V V
+ -----------------------------------------------
+ | DPIO Driver |
+ -----------------------------------------------
+ | | | | | software
+ | | | | | ================
+ | | | | | hardware
+ -----------------------------------------------
+ | I/O hardware portals |
+ -----------------------------------------------
+ | ^ ^ | |
+ | | | | |
+ | | | V |
+ V | ================ V
+ ---------------------- | -------------
+ queues ---------------------- | | Buffer pool |
+ ---------------------- | -------------
+ =======================
+ Channel
+
+Datapath I/O (DPIO) portals provide enqueue and dequeue services, data
+availability notifications and buffer pool management. DPIOs are shared between
+all DPAA2 objects (and implicitly all DPAA2 kernel drivers) that work with data
+frames, but must be affine to the CPUs for the purpose of traffic distribution.
+
+Frames are transmitted and received through hardware frame queues, which can be
+grouped in channels for the purpose of hardware scheduling. The Ethernet driver
+enqueues TX frames on egress queues and after transmission is complete a TX
+confirmation frame is sent back to the CPU.
+
+When frames are available on ingress queues, a data availability notification
+is sent to the CPU; notifications are raised per channel, so even if multiple
+queues in the same channel have available frames, only one notification is sent.
+After a channel fires a notification, is must be explicitly rearmed.
+
+Each network interface can have multiple Rx, Tx and confirmation queues affined
+to CPUs, and one channel (DPCON) for each CPU that services at least one queue.
+DPCONs are used to distribute ingress traffic to different CPUs via the cores'
+affine DPIOs.
+
+The role of hardware buffer pools is storage of ingress frame data. Each network
+interface has a privately owned buffer pool which it seeds with kernel allocated
+buffers.
+
+
+DPNIs are decoupled from PHYs; a DPNI can be connected to a PHY through a DPMAC
+object or to another DPNI through an internal link, but the connection is
+managed by MC and completely transparent to the Ethernet driver.
+
+::
+
+ --------- --------- ---------
+ | eth if1 | | eth if2 | | eth ifn |
+ --------- --------- ---------
+ . . .
+ . . .
+ . . .
+ ---------------------------
+ | DPAA2 Ethernet Driver |
+ ---------------------------
+ . . .
+ . . .
+ . . .
+ ------ ------ ------ -------
+ | DPNI | | DPNI | | DPNI | | DPMAC |----+
+ ------ ------ ------ ------- |
+ | | | | |
+ | | | | -----
+ =========== ================== | PHY |
+ -----
+
+Creating a Network Interface
+============================
+A net device is created for each DPNI object probed on the MC bus. Each DPNI has
+a number of properties which determine the network interface configuration
+options and associated hardware resources.
+
+DPNI objects (and the other DPAA2 objects needed for a network interface) can be
+added to a container on the MC bus in one of two ways: statically, through a
+Datapath Layout Binary file (DPL) that is parsed by MC at boot time; or created
+dynamically at runtime, via the DPAA2 objects APIs.
+
+
+Features & Offloads
+===================
+Hardware checksum offloading is supported for TCP and UDP over IPv4/6 frames.
+The checksum offloads can be independently configured on RX and TX through
+ethtool.
+
+Hardware offload of unicast and multicast MAC filtering is supported on the
+ingress path and permanently enabled.
+
+Scatter-gather frames are supported on both RX and TX paths. On TX, SG support
+is configurable via ethtool; on RX it is always enabled.
+
+The DPAA2 hardware can process jumbo Ethernet frames of up to 10K bytes.
+
+The Ethernet driver defines a static flow hashing scheme that distributes
+traffic based on a 5-tuple key: src IP, dst IP, IP proto, L4 src port,
+L4 dst port. No user configuration is supported for now.
+
+Hardware specific statistics for the network interface as well as some
+non-standard driver stats can be consulted through ethtool -S option.
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst
new file mode 100644
index 0000000000..62f4a4aff6
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst
@@ -0,0 +1,12 @@
+===================
+DPAA2 Documentation
+===================
+
+.. toctree::
+ :maxdepth: 1
+
+ overview
+ dpio-driver
+ ethernet-driver
+ mac-phy-support
+ switch-driver
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
new file mode 100644
index 0000000000..e2a36d0d88
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
@@ -0,0 +1,194 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=======================
+DPAA2 MAC / PHY support
+=======================
+
+:Copyright: |copy| 2019 NXP
+
+Overview
+--------
+
+The DPAA2 MAC / PHY support consists of a set of APIs that help DPAA2 network
+drivers (dpaa2-eth, dpaa2-ethsw) interact with the PHY library.
+
+DPAA2 Software Architecture
+---------------------------
+
+Among other DPAA2 objects, the fsl-mc bus exports DPNI objects (abstracting a
+network interface) and DPMAC objects (abstracting a MAC). The dpaa2-eth driver
+probes on the DPNI object and connects to and configures a DPMAC object with
+the help of phylink.
+
+Data connections may be established between a DPNI and a DPMAC, or between two
+DPNIs. Depending on the connection type, the netif_carrier_[on/off] is handled
+directly by the dpaa2-eth driver or by phylink.
+
+.. code-block:: none
+
+ Sources of abstracted link state information presented by the MC firmware
+
+ +--------------------------------------+
+ +------------+ +---------+ | xgmac_mdio |
+ | net_device | | phylink |--| +-----+ +-----+ +-----+ +-----+ |
+ +------------+ +---------+ | | PHY | | PHY | | PHY | | PHY | |
+ | | | +-----+ +-----+ +-----+ +-----+ |
+ +------------------------------------+ | External MDIO bus |
+ | dpaa2-eth | +--------------------------------------+
+ +------------------------------------+
+ | | Linux
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ | | MC firmware
+ | /| V
+ +----------+ / | +----------+
+ | | / | | |
+ | | | | | |
+ | DPNI |<------| |<------| DPMAC |
+ | | | | | |
+ | | \ |<---+ | |
+ +----------+ \ | | +----------+
+ \| |
+ |
+ +--------------------------------------+
+ | MC firmware polling MAC PCS for link |
+ | +-----+ +-----+ +-----+ +-----+ |
+ | | PCS | | PCS | | PCS | | PCS | |
+ | +-----+ +-----+ +-----+ +-----+ |
+ | Internal MDIO bus |
+ +--------------------------------------+
+
+
+Depending on an MC firmware configuration setting, each MAC may be in one of two modes:
+
+- DPMAC_LINK_TYPE_FIXED: the link state management is handled exclusively by
+ the MC firmware by polling the MAC PCS. Without the need to register a
+ phylink instance, the dpaa2-eth driver will not bind to the connected dpmac
+ object at all.
+
+- DPMAC_LINK_TYPE_PHY: The MC firmware is left waiting for link state update
+ events, but those are in fact passed strictly between the dpaa2-mac (based on
+ phylink) and its attached net_device driver (dpaa2-eth, dpaa2-ethsw),
+ effectively bypassing the firmware.
+
+Implementation
+--------------
+
+At probe time or when a DPNI's endpoint is dynamically changed, the dpaa2-eth
+is responsible to find out if the peer object is a DPMAC and if this is the
+case, to integrate it with PHYLINK using the dpaa2_mac_connect() API, which
+will do the following:
+
+ - look up the device tree for PHYLINK-compatible of binding (phy-handle)
+ - will create a PHYLINK instance associated with the received net_device
+ - connect to the PHY using phylink_of_phy_connect()
+
+The following phylink_mac_ops callback are implemented:
+
+ - .validate() will populate the supported linkmodes with the MAC capabilities
+ only when the phy_interface_t is RGMII_* (at the moment, this is the only
+ link type supported by the driver).
+
+ - .mac_config() will configure the MAC in the new configuration using the
+ dpmac_set_link_state() MC firmware API.
+
+ - .mac_link_up() / .mac_link_down() will update the MAC link using the same
+ API described above.
+
+At driver unbind() or when the DPNI object is disconnected from the DPMAC, the
+dpaa2-eth driver calls dpaa2_mac_disconnect() which will, in turn, disconnect
+from the PHY and destroy the PHYLINK instance.
+
+In case of a DPNI-DPMAC connection, an 'ip link set dev eth0 up' would start
+the following sequence of operations:
+
+(1) phylink_start() called from .dev_open().
+(2) The .mac_config() and .mac_link_up() callbacks are called by PHYLINK.
+(3) In order to configure the HW MAC, the MC Firmware API
+ dpmac_set_link_state() is called.
+(4) The firmware will eventually setup the HW MAC in the new configuration.
+(5) A netif_carrier_on() call is made directly from PHYLINK on the associated
+ net_device.
+(6) The dpaa2-eth driver handles the LINK_STATE_CHANGE irq in order to
+ enable/disable Rx taildrop based on the pause frame settings.
+
+.. code-block:: none
+
+ +---------+ +---------+
+ | PHYLINK |-------------->| eth0 |
+ +---------+ (5) +---------+
+ (1) ^ |
+ | |
+ | v (2)
+ +-----------------------------------+
+ | dpaa2-eth |
+ +-----------------------------------+
+ | ^ (6)
+ | |
+ v (3) |
+ +---------+---------------+---------+
+ | DPMAC | | DPNI |
+ +---------+ +---------+
+ | MC Firmware |
+ +-----------------------------------+
+ |
+ |
+ v (4)
+ +-----------------------------------+
+ | HW MAC |
+ +-----------------------------------+
+
+In case of a DPNI-DPNI connection, a usual sequence of operations looks like
+the following:
+
+(1) ip link set dev eth0 up
+(2) The dpni_enable() MC API called on the associated fsl_mc_device.
+(3) ip link set dev eth1 up
+(4) The dpni_enable() MC API called on the associated fsl_mc_device.
+(5) The LINK_STATE_CHANGED irq is received by both instances of the dpaa2-eth
+ driver because now the operational link state is up.
+(6) The netif_carrier_on() is called on the exported net_device from
+ link_state_update().
+
+.. code-block:: none
+
+ +---------+ +---------+
+ | eth0 | | eth1 |
+ +---------+ +---------+
+ | ^ ^ |
+ | | | |
+ (1) v | (6) (6) | v (3)
+ +---------+ +---------+
+ |dpaa2-eth| |dpaa2-eth|
+ +---------+ +---------+
+ | ^ ^ |
+ | | | |
+ (2) v | (5) (5) | v (4)
+ +---------+---------------+---------+
+ | DPNI | | DPNI |
+ +---------+ +---------+
+ | MC Firmware |
+ +-----------------------------------+
+
+
+Exported API
+------------
+
+Any DPAA2 driver that drivers endpoints of DPMAC objects should service its
+_EVENT_ENDPOINT_CHANGED irq and connect/disconnect from the associated DPMAC
+when necessary using the below listed API::
+
+ - int dpaa2_mac_connect(struct dpaa2_mac *mac);
+ - void dpaa2_mac_disconnect(struct dpaa2_mac *mac);
+
+A phylink integration is necessary only when the partner DPMAC is not of
+``TYPE_FIXED``. This means it is either of ``TYPE_PHY``, or of
+``TYPE_BACKPLANE`` (the difference being the two that in the ``TYPE_BACKPLANE``
+mode, the MC firmware does not access the PCS registers). One can check for
+this condition using the following helper::
+
+ - static inline bool dpaa2_mac_is_type_phy(struct dpaa2_mac *mac);
+
+Before connection to a MAC, the caller must allocate and populate the
+dpaa2_mac structure with the associated net_device, a pointer to the MC portal
+to be used and the actual fsl_mc_device structure of the DPMAC.
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
new file mode 100644
index 0000000000..1996477292
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
@@ -0,0 +1,406 @@
+.. include:: <isonum.txt>
+
+=========================================================
+DPAA2 (Data Path Acceleration Architecture Gen2) Overview
+=========================================================
+
+:Copyright: |copy| 2015 Freescale Semiconductor Inc.
+:Copyright: |copy| 2018 NXP
+
+This document provides an overview of the Freescale DPAA2 architecture
+and how it is integrated into the Linux kernel.
+
+Introduction
+============
+
+DPAA2 is a hardware architecture designed for high-speeed network
+packet processing. DPAA2 consists of sophisticated mechanisms for
+processing Ethernet packets, queue management, buffer management,
+autonomous L2 switching, virtual Ethernet bridging, and accelerator
+(e.g. crypto) sharing.
+
+A DPAA2 hardware component called the Management Complex (or MC) manages the
+DPAA2 hardware resources. The MC provides an object-based abstraction for
+software drivers to use the DPAA2 hardware.
+The MC uses DPAA2 hardware resources such as queues, buffer pools, and
+network ports to create functional objects/devices such as network
+interfaces, an L2 switch, or accelerator instances.
+The MC provides memory-mapped I/O command interfaces (MC portals)
+which DPAA2 software drivers use to operate on DPAA2 objects.
+
+The diagram below shows an overview of the DPAA2 resource management
+architecture::
+
+ +--------------------------------------+
+ | OS |
+ | DPAA2 drivers |
+ | | |
+ +-----------------------------|--------+
+ |
+ | (create,discover,connect
+ | config,use,destroy)
+ |
+ DPAA2 |
+ +------------------------| mc portal |-+
+ | | |
+ | +- - - - - - - - - - - - -V- - -+ |
+ | | | |
+ | | Management Complex (MC) | |
+ | | | |
+ | +- - - - - - - - - - - - - - - -+ |
+ | |
+ | Hardware Hardware |
+ | Resources Objects |
+ | --------- ------- |
+ | -queues -DPRC |
+ | -buffer pools -DPMCP |
+ | -Eth MACs/ports -DPIO |
+ | -network interface -DPNI |
+ | profiles -DPMAC |
+ | -queue portals -DPBP |
+ | -MC portals ... |
+ | ... |
+ | |
+ +--------------------------------------+
+
+
+The MC mediates operations such as create, discover,
+connect, configuration, and destroy. Fast-path operations
+on data, such as packet transmit/receive, are not mediated by
+the MC and are done directly using memory mapped regions in
+DPIO objects.
+
+Overview of DPAA2 Objects
+=========================
+
+The section provides a brief overview of some key DPAA2 objects.
+A simple scenario is described illustrating the objects involved
+in creating a network interfaces.
+
+DPRC (Datapath Resource Container)
+----------------------------------
+
+A DPRC is a container object that holds all the other
+types of DPAA2 objects. In the example diagram below there
+are 8 objects of 5 types (DPMCP, DPIO, DPBP, DPNI, and DPMAC)
+in the container.
+
+::
+
+ +---------------------------------------------------------+
+ | DPRC |
+ | |
+ | +-------+ +-------+ +-------+ +-------+ +-------+ |
+ | | DPMCP | | DPIO | | DPBP | | DPNI | | DPMAC | |
+ | +-------+ +-------+ +-------+ +---+---+ +---+---+ |
+ | | DPMCP | | DPIO | |
+ | +-------+ +-------+ |
+ | | DPMCP | |
+ | +-------+ |
+ | |
+ +---------------------------------------------------------+
+
+From the point of view of an OS, a DPRC behaves similar to a plug and
+play bus, like PCI. DPRC commands can be used to enumerate the contents
+of the DPRC, discover the hardware objects present (including mappable
+regions and interrupts).
+
+::
+
+ DPRC.1 (bus)
+ |
+ +--+--------+-------+-------+-------+
+ | | | | |
+ DPMCP.1 DPIO.1 DPBP.1 DPNI.1 DPMAC.1
+ DPMCP.2 DPIO.2
+ DPMCP.3
+
+Hardware objects can be created and destroyed dynamically, providing
+the ability to hot plug/unplug objects in and out of the DPRC.
+
+A DPRC has a mappable MMIO region (an MC portal) that can be used
+to send MC commands. It has an interrupt for status events (like
+hotplug).
+All objects in a container share the same hardware "isolation context".
+This means that with respect to an IOMMU the isolation granularity
+is at the DPRC (container) level, not at the individual object
+level.
+
+DPRCs can be defined statically and populated with objects
+via a config file passed to the MC when firmware starts it.
+
+DPAA2 Objects for an Ethernet Network Interface
+-----------------------------------------------
+
+A typical Ethernet NIC is monolithic-- the NIC device contains TX/RX
+queuing mechanisms, configuration mechanisms, buffer management,
+physical ports, and interrupts. DPAA2 uses a more granular approach
+utilizing multiple hardware objects. Each object provides specialized
+functions. Groups of these objects are used by software to provide
+Ethernet network interface functionality. This approach provides
+efficient use of finite hardware resources, flexibility, and
+performance advantages.
+
+The diagram below shows the objects needed for a simple
+network interface configuration on a system with 2 CPUs.
+
+::
+
+ +---+---+ +---+---+
+ CPU0 CPU1
+ +---+---+ +---+---+
+ | |
+ +---+---+ +---+---+
+ DPIO DPIO
+ +---+---+ +---+---+
+ \ /
+ \ /
+ \ /
+ +---+---+
+ DPNI --- DPBP,DPMCP
+ +---+---+
+ |
+ |
+ +---+---+
+ DPMAC
+ +---+---+
+ |
+ port/PHY
+
+Below the objects are described. For each object a brief description
+is provided along with a summary of the kinds of operations the object
+supports and a summary of key resources of the object (MMIO regions
+and IRQs).
+
+DPMAC (Datapath Ethernet MAC)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Represents an Ethernet MAC, a hardware device that connects to an Ethernet
+PHY and allows physical transmission and reception of Ethernet frames.
+
+- MMIO regions: none
+- IRQs: DPNI link change
+- commands: set link up/down, link config, get stats,
+ IRQ config, enable, reset
+
+DPNI (Datapath Network Interface)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Contains TX/RX queues, network interface configuration, and RX buffer pool
+configuration mechanisms. The TX/RX queues are in memory and are identified
+by queue number.
+
+- MMIO regions: none
+- IRQs: link state
+- commands: port config, offload config, queue config,
+ parse/classify config, IRQ config, enable, reset
+
+DPIO (Datapath I/O)
+~~~~~~~~~~~~~~~~~~~
+Provides interfaces to enqueue and dequeue
+packets and do hardware buffer pool management operations. The DPAA2
+architecture separates the mechanism to access queues (the DPIO object)
+from the queues themselves. The DPIO provides an MMIO interface to
+enqueue/dequeue packets. To enqueue something a descriptor is written
+to the DPIO MMIO region, which includes the target queue number.
+There will typically be one DPIO assigned to each CPU. This allows all
+CPUs to simultaneously perform enqueue/dequeued operations. DPIOs are
+expected to be shared by different DPAA2 drivers.
+
+- MMIO regions: queue operations, buffer management
+- IRQs: data availability, congestion notification, buffer
+ pool depletion
+- commands: IRQ config, enable, reset
+
+DPBP (Datapath Buffer Pool)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Represents a hardware buffer pool.
+
+- MMIO regions: none
+- IRQs: none
+- commands: enable, reset
+
+DPMCP (Datapath MC Portal)
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+Provides an MC command portal.
+Used by drivers to send commands to the MC to manage
+objects.
+
+- MMIO regions: MC command portal
+- IRQs: command completion
+- commands: IRQ config, enable, reset
+
+Object Connections
+==================
+Some objects have explicit relationships that must
+be configured:
+
+- DPNI <--> DPMAC
+- DPNI <--> DPNI
+- DPNI <--> L2-switch-port
+
+ A DPNI must be connected to something such as a DPMAC,
+ another DPNI, or L2 switch port. The DPNI connection
+ is made via a DPRC command.
+
+::
+
+ +-------+ +-------+
+ | DPNI | | DPMAC |
+ +---+---+ +---+---+
+ | |
+ +==========+
+
+- DPNI <--> DPBP
+
+ A network interface requires a 'buffer pool' (DPBP
+ object) which provides a list of pointers to memory
+ where received Ethernet data is to be copied. The
+ Ethernet driver configures the DPBPs associated with
+ the network interface.
+
+Interrupts
+==========
+All interrupts generated by DPAA2 objects are message
+interrupts. At the hardware level message interrupts
+generated by devices will normally have 3 components--
+1) a non-spoofable 'device-id' expressed on the hardware
+bus, 2) an address, 3) a data value.
+
+In the case of DPAA2 devices/objects, all objects in the
+same container/DPRC share the same 'device-id'.
+For ARM-based SoC this is the same as the stream ID.
+
+
+DPAA2 Linux Drivers Overview
+============================
+
+This section provides an overview of the Linux kernel drivers for
+DPAA2-- 1) the bus driver and associated "DPAA2 infrastructure"
+drivers and 2) functional object drivers (such as Ethernet).
+
+As described previously, a DPRC is a container that holds the other
+types of DPAA2 objects. It is functionally similar to a plug-and-play
+bus controller.
+Each object in the DPRC is a Linux "device" and is bound to a driver.
+The diagram below shows the Linux drivers involved in a networking
+scenario and the objects bound to each driver. A brief description
+of each driver follows.
+
+::
+
+ +------------+
+ | OS Network |
+ | Stack |
+ +------------+ +------------+
+ | Allocator |. . . . . . . | Ethernet |
+ |(DPMCP,DPBP)| | (DPNI) |
+ +-.----------+ +---+---+----+
+ . . ^ |
+ . . <data avail, | | <enqueue,
+ . . tx confirm> | | dequeue>
+ +-------------+ . | |
+ | DPRC driver | . +---+---V----+ +---------+
+ | (DPRC) | . . . . . .| DPIO driver| | MAC |
+ +----------+--+ | (DPIO) | | (DPMAC) |
+ | +------+-----+ +-----+---+
+ |<dev add/remove> | |
+ | | |
+ +--------+----------+ | +--+---+
+ | MC-bus driver | | | PHY |
+ | | | |driver|
+ | /bus/fsl-mc | | +--+---+
+ +-------------------+ | |
+ | |
+ ========================= HARDWARE =========|=================|======
+ DPIO |
+ | |
+ DPNI---DPBP |
+ | |
+ DPMAC |
+ | |
+ PHY ---------------+
+ ============================================|========================
+
+A brief description of each driver is provided below.
+
+MC-bus driver
+-------------
+The MC-bus driver is a platform driver and is probed from a
+node in the device tree (compatible "fsl,qoriq-mc") passed in by boot
+firmware. It is responsible for bootstrapping the DPAA2 kernel
+infrastructure.
+Key functions include:
+
+- registering a new bus type named "fsl-mc" with the kernel,
+ and implementing bus call-backs (e.g. match/uevent/dev_groups)
+- implementing APIs for DPAA2 driver registration and for device
+ add/remove
+- creates an MSI IRQ domain
+- doing a 'device add' to expose the 'root' DPRC, in turn triggering
+ a bind of the root DPRC to the DPRC driver
+
+The binding for the MC-bus device-tree node can be consulted at
+*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt*.
+The sysfs bind/unbind interfaces for the MC-bus can be consulted at
+*Documentation/ABI/testing/sysfs-bus-fsl-mc*.
+
+DPRC driver
+-----------
+The DPRC driver is bound to DPRC objects and does runtime management
+of a bus instance. It performs the initial bus scan of the DPRC
+and handles interrupts for container events such as hot plug by
+re-scanning the DPRC.
+
+Allocator
+---------
+Certain objects such as DPMCP and DPBP are generic and fungible,
+and are intended to be used by other drivers. For example,
+the DPAA2 Ethernet driver needs:
+
+- DPMCPs to send MC commands, to configure network interfaces
+- DPBPs for network buffer pools
+
+The allocator driver registers for these allocatable object types
+and those objects are bound to the allocator when the bus is probed.
+The allocator maintains a pool of objects that are available for
+allocation by other DPAA2 drivers.
+
+DPIO driver
+-----------
+The DPIO driver is bound to DPIO objects and provides services that allow
+other drivers such as the Ethernet driver to enqueue and dequeue data for
+their respective objects.
+Key services include:
+
+- data availability notifications
+- hardware queuing operations (enqueue and dequeue of data)
+- hardware buffer pool management
+
+To transmit a packet the Ethernet driver puts data on a queue and
+invokes a DPIO API. For receive, the Ethernet driver registers
+a data availability notification callback. To dequeue a packet
+a DPIO API is used.
+There is typically one DPIO object per physical CPU for optimum
+performance, allowing different CPUs to simultaneously enqueue
+and dequeue data.
+
+The DPIO driver operates on behalf of all DPAA2 drivers
+active in the kernel-- Ethernet, crypto, compression,
+etc.
+
+Ethernet driver
+---------------
+The Ethernet driver is bound to a DPNI and implements the kernel
+interfaces needed to connect the DPAA2 network interface to
+the network stack.
+Each DPNI corresponds to a Linux network interface.
+
+MAC driver
+----------
+An Ethernet PHY is an off-chip, board specific component and is managed
+by the appropriate PHY driver via an mdio bus. The MAC driver
+plays a role of being a proxy between the PHY driver and the
+MC. It does this proxy via the MC commands to a DPMAC object.
+If the PHY driver signals a link change, the MAC driver notifies
+the MC via a DPMAC command. If a network interface is brought
+up or down, the MC notifies the DPMAC driver via an interrupt and
+the driver can take appropriate action.
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst
new file mode 100644
index 0000000000..8bf411b857
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst
@@ -0,0 +1,217 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===================
+DPAA2 Switch driver
+===================
+
+:Copyright: |copy| 2021 NXP
+
+The DPAA2 Switch driver probes on the Datapath Switch (DPSW) object which can
+be instantiated on the following DPAA2 SoCs and their variants: LS2088A and
+LX2160A.
+
+The driver uses the switch device driver model and exposes each switch port as
+a network interface, which can be included in a bridge or used as a standalone
+interface. Traffic switched between ports is offloaded into the hardware.
+
+The DPSW can have ports connected to DPNIs or to DPMACs for external access.
+::
+
+ [ethA] [ethB] [ethC] [ethD] [ethE] [ethF]
+ : : : : : :
+ : : : : : :
+ [dpaa2-eth] [dpaa2-eth] [ dpaa2-switch ]
+ : : : : : : kernel
+ =============================================================================
+ : : : : : : hardware
+ [DPNI] [DPNI] [============= DPSW =================]
+ | | | | | |
+ | ---------- | [DPMAC] [DPMAC]
+ ------------------------------- | |
+ | |
+ [PHY] [PHY]
+
+Creating an Ethernet Switch
+===========================
+
+The dpaa2-switch driver probes on DPSW devices found on the fsl-mc bus. These
+devices can be either created statically through the boot time configuration
+file - DataPath Layout (DPL) - or at runtime using the DPAA2 object APIs
+(incorporated already into the restool userspace tool).
+
+At the moment, the dpaa2-switch driver imposes the following restrictions on
+the DPSW object that it will probe:
+
+ * The minimum number of FDBs should be at least equal to the number of switch
+ interfaces. This is necessary so that separation of switch ports can be
+ done, ie when not under a bridge, each switch port will have its own FDB.
+ ::
+
+ fsl_dpaa2_switch dpsw.0: The number of FDBs is lower than the number of ports, cannot probe
+
+ * Both the broadcast and flooding configuration should be per FDB. This
+ enables the driver to restrict the broadcast and flooding domains of each
+ FDB depending on the switch ports that are sharing it (aka are under the
+ same bridge).
+ ::
+
+ fsl_dpaa2_switch dpsw.0: Flooding domain is not per FDB, cannot probe
+ fsl_dpaa2_switch dpsw.0: Broadcast domain is not per FDB, cannot probe
+
+ * The control interface of the switch should not be disabled
+ (DPSW_OPT_CTRL_IF_DIS not passed as a create time option). Without the
+ control interface, the driver is not capable to provide proper Rx/Tx traffic
+ support on the switch port netdevices.
+ ::
+
+ fsl_dpaa2_switch dpsw.0: Control Interface is disabled, cannot probe
+
+Besides the configuration of the actual DPSW object, the dpaa2-switch driver
+will need the following DPAA2 objects:
+
+ * 1 DPMCP - A Management Command Portal object is needed for any interraction
+ with the MC firmware.
+
+ * 1 DPBP - A Buffer Pool is used for seeding buffers intended for the Rx path
+ on the control interface.
+
+ * Access to at least one DPIO object (Software Portal) is needed for any
+ enqueue/dequeue operation to be performed on the control interface queues.
+ The DPIO object will be shared, no need for a private one.
+
+Switching features
+==================
+
+The driver supports the configuration of L2 forwarding rules in hardware for
+port bridging as well as standalone usage of the independent switch interfaces.
+
+The hardware is not configurable with respect to VLAN awareness, thus any DPAA2
+switch port should be used only in usecases with a VLAN aware bridge::
+
+ $ ip link add dev br0 type bridge vlan_filtering 1
+
+ $ ip link add dev br1 type bridge
+ $ ip link set dev ethX master br1
+ Error: fsl_dpaa2_switch: Cannot join a VLAN-unaware bridge
+
+Topology and loop detection through STP is supported when ``stp_state 1`` is
+used at bridge create ::
+
+ $ ip link add dev br0 type bridge vlan_filtering 1 stp_state 1
+
+L2 FDB manipulation (add/delete/dump) is supported.
+
+HW FDB learning can be configured on each switch port independently through
+bridge commands. When the HW learning is disabled, a fast age procedure will be
+run and any previously learnt addresses will be removed.
+::
+
+ $ bridge link set dev ethX learning off
+ $ bridge link set dev ethX learning on
+
+Restricting the unknown unicast and multicast flooding domain is supported, but
+not independently of each other::
+
+ $ ip link set dev ethX type bridge_slave flood off mcast_flood off
+ $ ip link set dev ethX type bridge_slave flood off mcast_flood on
+ Error: fsl_dpaa2_switch: Cannot configure multicast flooding independently of unicast.
+
+Broadcast flooding on a switch port can be disabled/enabled through the brport sysfs::
+
+ $ echo 0 > /sys/bus/fsl-mc/devices/dpsw.Y/net/ethX/brport/broadcast_flood
+
+Offloads
+========
+
+Routing actions (redirect, trap, drop)
+--------------------------------------
+
+The DPAA2 switch is able to offload flow-based redirection of packets making
+use of ACL tables. Shared filter blocks are supported by sharing a single ACL
+table between multiple ports.
+
+The following flow keys are supported:
+
+ * Ethernet: dst_mac/src_mac
+ * IPv4: dst_ip/src_ip/ip_proto/tos
+ * VLAN: vlan_id/vlan_prio/vlan_tpid/vlan_dei
+ * L4: dst_port/src_port
+
+Also, the matchall filter can be used to redirect the entire traffic received
+on a port.
+
+As per flow actions, the following are supported:
+
+ * drop
+ * mirred egress redirect
+ * trap
+
+Each ACL entry (filter) can be setup with only one of the listed
+actions.
+
+Example 1: send frames received on eth4 with a SA of 00:01:02:03:04:05 to the
+CPU::
+
+ $ tc qdisc add dev eth4 clsact
+ $ tc filter add dev eth4 ingress flower src_mac 00:01:02:03:04:05 skip_sw action trap
+
+Example 2: drop frames received on eth4 with VID 100 and PCP of 3::
+
+ $ tc filter add dev eth4 ingress protocol 802.1q flower skip_sw vlan_id 100 vlan_prio 3 action drop
+
+Example 3: redirect all frames received on eth4 to eth1::
+
+ $ tc filter add dev eth4 ingress matchall action mirred egress redirect dev eth1
+
+Example 4: Use a single shared filter block on both eth5 and eth6::
+
+ $ tc qdisc add dev eth5 ingress_block 1 clsact
+ $ tc qdisc add dev eth6 ingress_block 1 clsact
+ $ tc filter add block 1 ingress flower dst_mac 00:01:02:03:04:04 skip_sw \
+ action trap
+ $ tc filter add block 1 ingress protocol ipv4 flower src_ip 192.168.1.1 skip_sw \
+ action mirred egress redirect dev eth3
+
+Mirroring
+~~~~~~~~~
+
+The DPAA2 switch supports only per port mirroring and per VLAN mirroring.
+Adding mirroring filters in shared blocks is also supported.
+
+When using the tc-flower classifier with the 802.1q protocol, only the
+''vlan_id'' key will be accepted. Mirroring based on any other fields from the
+802.1q protocol will be rejected::
+
+ $ tc qdisc add dev eth8 ingress_block 1 clsact
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_prio 3 action mirred egress mirror dev eth6
+ Error: fsl_dpaa2_switch: Only matching on VLAN ID supported.
+ We have an error talking to the kernel
+
+If a mirroring VLAN filter is requested on a port, the VLAN must to be
+installed on the switch port in question either using ''bridge'' or by creating
+a VLAN upper device if the switch port is used as a standalone interface::
+
+ $ tc qdisc add dev eth8 ingress_block 1 clsact
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+ Error: VLAN must be installed on the switch port.
+ We have an error talking to the kernel
+
+ $ bridge vlan add vid 200 dev eth8
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+
+ $ ip link add link eth8 name eth8.200 type vlan id 200
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+
+Also, it should be noted that the mirrored traffic will be subject to the same
+egress restrictions as any other traffic. This means that when a mirrored
+packet will reach the mirror port, if the VLAN found in the packet is not
+installed on the port it will get dropped.
+
+The DPAA2 switch supports only a single mirroring destination, thus multiple
+mirror rules can be installed but their ''to'' port has to be the same::
+
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 100 action mirred egress mirror dev eth7
+ Error: fsl_dpaa2_switch: Multiple mirror ports not supported.
+ We have an error talking to the kernel
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst b/Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst
new file mode 100644
index 0000000000..9c4a91d382
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst
@@ -0,0 +1,51 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+The Gianfar Ethernet Driver
+===========================
+
+:Author: Andy Fleming <afleming@freescale.com>
+:Updated: 2005-07-28
+
+
+Checksum Offloading
+===================
+
+The eTSEC controller (first included in parts from late 2005 like
+the 8548) has the ability to perform TCP, UDP, and IP checksums
+in hardware. The Linux kernel only offloads the TCP and UDP
+checksums (and always performs the pseudo header checksums), so
+the driver only supports checksumming for TCP/IP and UDP/IP
+packets. Use ethtool to enable or disable this feature for RX
+and TX.
+
+VLAN
+====
+
+In order to use VLAN, please consult Linux documentation on
+configuring VLANs. The gianfar driver supports hardware insertion and
+extraction of VLAN headers, but not filtering. Filtering will be
+done by the kernel.
+
+Multicasting
+============
+
+The gianfar driver supports using the group hash table on the
+TSEC (and the extended hash table on the eTSEC) for multicast
+filtering. On the eTSEC, the exact-match MAC registers are used
+before the hash tables. See Linux documentation on how to join
+multicast groups.
+
+Padding
+=======
+
+The gianfar driver supports padding received frames with 2 bytes
+to align the IP header to a 16-byte boundary, when supported by
+hardware.
+
+Ethtool
+=======
+
+The gianfar driver supports the use of ethtool for many
+configuration options. You must run ethtool only on currently
+open interfaces. See ethtool documentation for details.
diff --git a/Documentation/networking/device_drivers/ethernet/google/gve.rst b/Documentation/networking/device_drivers/ethernet/google/gve.rst
new file mode 100644
index 0000000000..31d621bca8
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/google/gve.rst
@@ -0,0 +1,175 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==============================================================
+Linux kernel driver for Compute Engine Virtual Ethernet (gve):
+==============================================================
+
+Supported Hardware
+===================
+The GVE driver binds to a single PCI device id used by the virtual
+Ethernet device found in some Compute Engine VMs.
+
++--------------+----------+---------+
+|Field | Value | Comments|
++==============+==========+=========+
+|Vendor ID | `0x1AE0` | Google |
++--------------+----------+---------+
+|Device ID | `0x0042` | |
++--------------+----------+---------+
+|Sub-vendor ID | `0x1AE0` | Google |
++--------------+----------+---------+
+|Sub-device ID | `0x0058` | |
++--------------+----------+---------+
+|Revision ID | `0x0` | |
++--------------+----------+---------+
+|Device Class | `0x200` | Ethernet|
++--------------+----------+---------+
+
+PCI Bars
+========
+The gVNIC PCI device exposes three 32-bit memory BARS:
+- Bar0 - Device configuration and status registers.
+- Bar1 - MSI-X vector table
+- Bar2 - IRQ, RX and TX doorbells
+
+Device Interactions
+===================
+The driver interacts with the device in the following ways:
+ - Registers
+ - A block of MMIO registers
+ - See gve_register.h for more detail
+ - Admin Queue
+ - See description below
+ - Reset
+ - At any time the device can be reset
+ - Interrupts
+ - See supported interrupts below
+ - Transmit and Receive Queues
+ - See description below
+
+Descriptor Formats
+------------------
+GVE supports two descriptor formats: GQI and DQO. These two formats have
+entirely different descriptors, which will be described below.
+
+Addressing Mode
+------------------
+GVE supports two addressing modes: QPL and RDA.
+QPL ("queue-page-list") mode communicates data through a set of
+pre-registered pages.
+
+For RDA ("raw DMA addressing") mode, the set of pages is dynamic.
+Therefore, the packet buffers can be anywhere in guest memory.
+
+Registers
+---------
+All registers are MMIO.
+
+The registers are used for initializing and configuring the device as well as
+querying device status in response to management interrupts.
+
+Endianness
+----------
+- Admin Queue messages and registers are all Big Endian.
+- GQI descriptors and datapath registers are Big Endian.
+- DQO descriptors and datapath registers are Little Endian.
+
+Admin Queue (AQ)
+----------------
+The Admin Queue is a PAGE_SIZE memory block, treated as an array of AQ
+commands, used by the driver to issue commands to the device and set up
+resources.The driver and the device maintain a count of how many commands
+have been submitted and executed. To issue AQ commands, the driver must do
+the following (with proper locking):
+
+1) Copy new commands into next available slots in the AQ array
+2) Increment its counter by he number of new commands
+3) Write the counter into the GVE_ADMIN_QUEUE_DOORBELL register
+4) Poll the ADMIN_QUEUE_EVENT_COUNTER register until it equals
+ the value written to the doorbell, or until a timeout.
+
+The device will update the status field in each AQ command reported as
+executed through the ADMIN_QUEUE_EVENT_COUNTER register.
+
+Device Resets
+-------------
+A device reset is triggered by writing 0x0 to the AQ PFN register.
+This causes the device to release all resources allocated by the
+driver, including the AQ itself.
+
+Interrupts
+----------
+The following interrupts are supported by the driver:
+
+Management Interrupt
+~~~~~~~~~~~~~~~~~~~~
+The management interrupt is used by the device to tell the driver to
+look at the GVE_DEVICE_STATUS register.
+
+The handler for the management irq simply queues the service task in
+the workqueue to check the register and acks the irq.
+
+Notification Block Interrupts
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The notification block interrupts are used to tell the driver to poll
+the queues associated with that interrupt.
+
+The handler for these irqs schedule the napi for that block to run
+and poll the queues.
+
+GQI Traffic Queues
+------------------
+GQI queues are composed of a descriptor ring and a buffer and are assigned to a
+notification block.
+
+The descriptor rings are power-of-two-sized ring buffers consisting of
+fixed-size descriptors. They advance their head pointer using a __be32
+doorbell located in Bar2. The tail pointers are advanced by consuming
+descriptors in-order and updating a __be32 counter. Both the doorbell
+and the counter overflow to zero.
+
+Each queue's buffers must be registered in advance with the device as a
+queue page list, and packet data can only be put in those pages.
+
+Transmit
+~~~~~~~~
+gve maps the buffers for transmit rings into a FIFO and copies the packets
+into the FIFO before sending them to the NIC.
+
+Receive
+~~~~~~~
+The buffers for receive rings are put into a data ring that is the same
+length as the descriptor ring and the head and tail pointers advance over
+the rings together.
+
+DQO Traffic Queues
+------------------
+- Every TX and RX queue is assigned a notification block.
+
+- TX and RX buffers queues, which send descriptors to the device, use MMIO
+ doorbells to notify the device of new descriptors.
+
+- RX and TX completion queues, which receive descriptors from the device, use a
+ "generation bit" to know when a descriptor was populated by the device. The
+ driver initializes all bits with the "current generation". The device will
+ populate received descriptors with the "next generation" which is inverted
+ from the current generation. When the ring wraps, the current/next generation
+ are swapped.
+
+- It's the driver's responsibility to ensure that the RX and TX completion
+ queues are not overrun. This can be accomplished by limiting the number of
+ descriptors posted to HW.
+
+- TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
+ buffer_id. These will be returned on the TX completion and RX queues
+ respectively to let the driver know which packet/buffer was completed.
+
+Transmit
+~~~~~~~~
+A packet's buffers are DMA mapped for the device to access before transmission.
+After the packet was successfully transmitted, the buffers are unmapped.
+
+Receive
+~~~~~~~
+The driver posts fixed sized buffers to HW on the RX buffer queue. The packet
+received on the associated RX queue may span multiple descriptors.
diff --git a/Documentation/networking/device_drivers/ethernet/huawei/hinic.rst b/Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
new file mode 100644
index 0000000000..867ac8f4e0
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
@@ -0,0 +1,128 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================================
+Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family
+============================================================
+
+Overview:
+=========
+HiNIC is a network interface card for the Data Center Area.
+
+The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.).
+The driver supports also a negotiated and extendable feature set.
+
+Some HiNIC devices support SR-IOV. This driver is used for Physical Function
+(PF).
+
+HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and
+adaptive interrupt moderation.
+
+HiNIC devices support also various offload features such as checksum offload,
+TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and
+LRO(Large Receive Offload).
+
+
+Supported PCI vendor ID/device IDs:
+===================================
+
+19e5:1822 - HiNIC PF
+
+
+Driver Architecture and Source Code:
+====================================
+
+hinic_dev - Implement a Logical Network device that is independent from
+specific HW details about HW data structure formats.
+
+hinic_hwdev - Implement the HW details of the device and include the components
+for accessing the PCI NIC.
+
+hinic_hwdev contains the following components:
+===============================================
+
+HW Interface:
+=============
+
+The interface for accessing the pci device (DMA memory and PCI BARs).
+(hinic_hw_if.c, hinic_hw_if.h)
+
+Configuration Status Registers Area that describes the HW Registers on the
+configuration and status BAR0. (hinic_hw_csr.h)
+
+MGMT components:
+================
+
+Asynchronous Event Queues(AEQs) - The event queues for receiving messages from
+the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h)
+
+Application Programmable Interface commands(API CMD) - Interface for sending
+MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h)
+
+Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT
+commands to the card and receives notifications from the MGMT modules on the
+card by AEQs. Also set the addresses of the IO CMDQs in HW.
+(hinic_hw_mgmt.c, hinic_hw_mgmt.h)
+
+IO components:
+==============
+
+Completion Event Queues(CEQs) - The completion Event Queues that describe IO
+tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h)
+
+Work Queues(WQ) - Contain the memory and operations for use by CMD queues and
+the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains
+pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs).
+(hinic_hw_wq.c, hinic_hw_wq.h)
+
+Command Queues(CMDQ) - The queues for sending commands for IO management and is
+used to set the QPs addresses in HW. The commands completion events are
+accumulated on the CEQ that is configured to receive the CMDQ completion events.
+(hinic_hw_cmdq.c, hinic_hw_cmdq.h)
+
+Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting
+Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h)
+
+IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h)
+
+HW device:
+==========
+
+HW device - de/constructs the HW Interface, the MGMT components on the
+initialization of the driver and the IO components on the case of Interface
+UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h)
+
+
+hinic_dev contains the following components:
+===============================================
+
+PCI ID table - Contains the supported PCI Vendor/Device IDs.
+(hinic_pci_tbl.h)
+
+Port Commands - Send commands to the HW device for port management
+(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h)
+
+Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit.
+The Logical Tx queue is not dependent on the format of the HW Send Queue.
+(hinic_tx.c, hinic_tx.h)
+
+Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive.
+The Logical Rx queue is not dependent on the format of the HW Receive Queue.
+(hinic_rx.c, hinic_rx.h)
+
+hinic_dev - de/constructs the Logical Tx and Rx Queues.
+(hinic_main.c, hinic_dev.h)
+
+
+Miscellaneous
+=============
+
+Common functions that are used by HW and Logical Device.
+(hinic_common.c, hinic_common.h)
+
+
+Support
+=======
+
+If an issue is identified with the released source code on the supported kernel
+with a supported adapter, email the specific information related to the issue to
+aviad.krawczyk@huawei.com.
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
new file mode 100644
index 0000000000..9827e81608
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -0,0 +1,64 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Ethernet Device Drivers
+=======================
+
+Device drivers for Ethernet and Ethernet-based virtual function devices.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ 3com/3c509
+ 3com/vortex
+ amazon/ena
+ altera/altera_tse
+ amd/pds_core
+ amd/pds_vdpa
+ amd/pds_vfio_pci
+ aquantia/atlantic
+ chelsio/cxgb
+ cirrus/cs89x0
+ dlink/dl2k
+ davicom/dm9000
+ dec/dmfe
+ freescale/dpaa
+ freescale/dpaa2/index
+ freescale/gianfar
+ google/gve
+ huawei/hinic
+ intel/e100
+ intel/e1000
+ intel/e1000e
+ intel/fm10k
+ intel/igb
+ intel/igbvf
+ intel/ixgbe
+ intel/ixgbevf
+ intel/i40e
+ intel/iavf
+ intel/ice
+ marvell/octeontx2
+ marvell/octeon_ep
+ mellanox/mlx5/index
+ microsoft/netvsc
+ neterion/s2io
+ netronome/nfp
+ pensando/ionic
+ smsc/smc9
+ stmicro/stmmac
+ ti/cpsw
+ ti/cpsw_switchdev
+ ti/am65_nuss_cpsw_switchdev
+ ti/tlan
+ toshiba/spider_net
+ wangxun/txgbe
+ wangxun/ngbe
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/ethernet/intel/e100.rst b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
new file mode 100644
index 0000000000..5dee1b53e9
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
@@ -0,0 +1,185 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============================================================
+Linux Base Driver for the Intel(R) PRO/100 Family of Adapters
+=============================================================
+
+June 1, 2018
+
+Contents
+========
+
+- In This Release
+- Identifying Your Adapter
+- Building and Installation
+- Driver Configuration Parameters
+- Additional Configurations
+- Known Issues
+- Support
+
+
+In This Release
+===============
+
+This file describes the Linux Base Driver for the Intel(R) PRO/100 Family of
+Adapters. This driver includes support for Itanium(R)2-based systems.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel PRO/100 adapter.
+
+The following features are now available in supported kernels:
+ - Native VLANs
+ - Channel Bonding (teaming)
+ - SNMP
+
+Channel Bonding documentation can be found in the Linux kernel source:
+/Documentation/networking/bonding.rst
+
+
+Identifying Your Adapter
+========================
+
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+Driver Configuration Parameters
+===============================
+
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+Rx Descriptors:
+ Number of receive descriptors. A receive descriptor is a data
+ structure that describes a receive buffer and its attributes to the network
+ controller. The data in the descriptor is used by the controller to write
+ data from the controller to host memory. In the 3.x.x driver the valid range
+ for this parameter is 64-256. The default value is 256. This parameter can be
+ changed using the command::
+
+ ethtool -G eth? rx n
+
+ Where n is the number of desired Rx descriptors.
+
+Tx Descriptors:
+ Number of transmit descriptors. A transmit descriptor is a data
+ structure that describes a transmit buffer and its attributes to the network
+ controller. The data in the descriptor is used by the controller to read
+ data from the host memory to the controller. In the 3.x.x driver the valid
+ range for this parameter is 64-256. The default value is 128. This parameter
+ can be changed using the command::
+
+ ethtool -G eth? tx n
+
+ Where n is the number of desired Tx descriptors.
+
+Speed/Duplex:
+ The driver auto-negotiates the link speed and duplex settings by
+ default. The ethtool utility can be used as follows to force speed/duplex.::
+
+ ethtool -s eth? autoneg off speed {10|100} duplex {full|half}
+
+ NOTE: setting the speed/duplex to incorrect values will cause the link to
+ fail.
+
+Event Log Message Level:
+ The driver uses the message level flag to log events
+ to syslog. The message level can be set at driver load time. It can also be
+ set using the command::
+
+ ethtool -s eth? msglvl n
+
+
+Additional Configurations
+=========================
+
+Configuring the Driver on Different Distributions
+-------------------------------------------------
+
+Configuring a network driver to load properly when the system is started
+is distribution dependent. Typically, the configuration process involves
+adding an alias line to `/etc/modprobe.d/*.conf` as well as editing other
+system startup scripts and/or configuration files. Many popular Linux
+distributions ship with tools to make these changes for you. To learn
+the proper way to configure a network device for your system, refer to
+your distribution documentation. If during this process you are asked
+for the driver or module name, the name for the Linux Base Driver for
+the Intel PRO/100 Family of Adapters is e100.
+
+As an example, if you install the e100 driver for two PRO/100 adapters
+(eth0 and eth1), add the following to a configuration file in
+/etc/modprobe.d/::
+
+ alias eth0 e100
+ alias eth1 e100
+
+Viewing Link Messages
+---------------------
+
+In order to see link messages and other Intel driver information on your
+console, you must set the dmesg level up to six. This can be done by
+entering the following on the command line before loading the e100
+driver::
+
+ dmesg -n 6
+
+If you wish to see all messages issued by the driver, including debug
+messages, set the dmesg level to eight.
+
+NOTE: This setting is not saved across reboots.
+
+ethtool
+-------
+
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The ethtool
+version 1.6 or later is required for this functionality.
+
+The latest release of ethtool can be found from
+https://www.kernel.org/pub/software/network/ethtool/
+
+Enabling Wake on LAN (WoL)
+--------------------------
+WoL is provided through the ethtool utility. For instructions on
+enabling WoL with ethtool, refer to the ethtool man page. WoL will be
+enabled on the system during the next shut down or reboot. For this
+driver version, in order to enable WoL, the e100 driver must be loaded
+when shutting down or rebooting the system.
+
+NAPI
+----
+
+NAPI (Rx polling mode) is supported in the e100 driver.
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
+
+Multiple Interfaces on Same Ethernet Broadcast Network
+------------------------------------------------------
+
+Due to the default ARP behavior on Linux, it is not possible to have one
+system on two IP networks in the same Ethernet broadcast domain
+(non-partitioned switch) behave as expected. All Ethernet interfaces
+will respond to IP traffic for any IP address assigned to the system.
+This results in unbalanced receive traffic.
+
+If you have multiple interfaces in a server, either turn on ARP
+filtering by
+
+(1) entering::
+
+ echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
+
+ (this only works if your kernel's version is higher than 2.4.5), or
+
+(2) installing the interfaces in separate broadcast domains (either
+ in different switches or in a switch partitioned to VLANs).
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/e1000.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
new file mode 100644
index 0000000000..52a7fb9ce8
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
@@ -0,0 +1,458 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==========================================================
+Linux Base Driver for Intel(R) Ethernet Network Connection
+==========================================================
+
+Intel Gigabit Linux driver.
+Copyright(c) 1999 - 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Command Line Parameters
+- Speed and Duplex Configuration
+- Additional Configurations
+- Support
+
+Identifying Your Adapter
+========================
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/go/network/adapter/idguide.htm
+
+For the latest Intel network drivers for Linux, refer to the following
+website. In the search field, enter your adapter name or type, or use the
+networking link on the left to search for your adapter:
+
+ http://support.intel.com/support/go/network/adapter/home.htm
+
+Command Line Parameters
+=======================
+
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+NOTES:
+ For more information about the AutoNeg, Duplex, and Speed
+ parameters, see the "Speed and Duplex Configuration" section in
+ this document.
+
+ For more information about the InterruptThrottleRate,
+ RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay
+ parameters, see the application note at:
+ http://www.intel.com/design/network/applnots/ap450.htm
+
+AutoNeg
+-------
+
+(Supported only on adapters with copper connections)
+
+:Valid Range: 0x01-0x0F, 0x20-0x2F
+:Default Value: 0x2F
+
+This parameter is a bit-mask that specifies the speed and duplex settings
+advertised by the adapter. When this parameter is used, the Speed and
+Duplex parameters must not be specified.
+
+NOTE:
+ Refer to the Speed and Duplex section of this readme for more
+ information on the AutoNeg parameter.
+
+Duplex
+------
+
+(Supported only on adapters with copper connections)
+
+:Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full)
+:Default Value: 0
+
+This defines the direction in which data is allowed to flow. Can be
+either one or two-directional. If both Duplex and the link partner are
+set to auto-negotiate, the board auto-detects the correct duplex. If the
+link partner is forced (either full or half), Duplex defaults to half-
+duplex.
+
+FlowControl
+-----------
+
+:Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx)
+:Default Value: Reads flow control settings from the EEPROM
+
+This parameter controls the automatic generation(Tx) and response(Rx)
+to Ethernet PAUSE frames.
+
+InterruptThrottleRate
+---------------------
+
+(not supported on Intel(R) 82542, 82543 or 82544-based adapters)
+
+:Valid Range:
+ 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative,
+ 4=simplified balancing)
+:Default Value: 3
+
+The driver can limit the amount of interrupts per second that the adapter
+will generate for incoming packets. It does this by writing a value to the
+adapter that is based on the maximum amount of interrupts that the adapter
+will generate per second.
+
+Setting InterruptThrottleRate to a value greater or equal to 100
+will program the adapter to send out a maximum of that many interrupts
+per second, even if more packets have come in. This reduces interrupt
+load on the system and can lower CPU utilization under heavy load,
+but will increase latency as packets are not processed as quickly.
+
+The default behaviour of the driver previously assumed a static
+InterruptThrottleRate value of 8000, providing a good fallback value for
+all traffic types,but lacking in small packet performance and latency.
+The hardware can handle many more small packets per second however, and
+for this reason an adaptive interrupt moderation algorithm was implemented.
+
+Since 7.3.x, the driver has two adaptive modes (setting 1 or 3) in which
+it dynamically adjusts the InterruptThrottleRate value based on the traffic
+that it receives. After determining the type of incoming traffic in the last
+timeframe, it will adjust the InterruptThrottleRate to an appropriate value
+for that traffic.
+
+The algorithm classifies the incoming traffic every interval into
+classes. Once the class is determined, the InterruptThrottleRate value is
+adjusted to suit that traffic type the best. There are three classes defined:
+"Bulk traffic", for large amounts of packets of normal size; "Low latency",
+for small amounts of traffic and/or a significant percentage of small
+packets; and "Lowest latency", for almost completely small packets or
+minimal traffic.
+
+In dynamic conservative mode, the InterruptThrottleRate value is set to 4000
+for traffic that falls in class "Bulk traffic". If traffic falls in the "Low
+latency" or "Lowest latency" class, the InterruptThrottleRate is increased
+stepwise to 20000. This default mode is suitable for most applications.
+
+For situations where low latency is vital such as cluster or
+grid computing, the algorithm can reduce latency even more when
+InterruptThrottleRate is set to mode 1. In this mode, which operates
+the same as mode 3, the InterruptThrottleRate will be increased stepwise to
+70000 for traffic in class "Lowest latency".
+
+In simplified mode the interrupt rate is based on the ratio of TX and
+RX traffic. If the bytes per second rate is approximately equal, the
+interrupt rate will drop as low as 2000 interrupts per second. If the
+traffic is mostly transmit or mostly receive, the interrupt rate could
+be as high as 8000.
+
+Setting InterruptThrottleRate to 0 turns off any interrupt moderation
+and may improve small packet latency, but is generally not suitable
+for bulk throughput traffic.
+
+NOTE:
+ InterruptThrottleRate takes precedence over the TxAbsIntDelay and
+ RxAbsIntDelay parameters. In other words, minimizing the receive
+ and/or transmit absolute delays does not force the controller to
+ generate more interrupts than what the Interrupt Throttle Rate
+ allows.
+
+CAUTION:
+ If you are using the Intel(R) PRO/1000 CT Network Connection
+ (controller 82547), setting InterruptThrottleRate to a value
+ greater than 75,000, may hang (stop transmitting) adapters
+ under certain network conditions. If this occurs a NETDEV
+ WATCHDOG message is logged in the system event log. In
+ addition, the controller is automatically reset, restoring
+ the network connection. To eliminate the potential for the
+ hang, ensure that InterruptThrottleRate is set no greater
+ than 75,000 and is not set to 0.
+
+NOTE:
+ When e1000 is loaded with default settings and multiple adapters
+ are in use simultaneously, the CPU utilization may increase non-
+ linearly. In order to limit the CPU utilization without impacting
+ the overall throughput, we recommend that you load the driver as
+ follows::
+
+ modprobe e1000 InterruptThrottleRate=3000,3000,3000
+
+ This sets the InterruptThrottleRate to 3000 interrupts/sec for
+ the first, second, and third instances of the driver. The range
+ of 2000 to 3000 interrupts per second works on a majority of
+ systems and is a good starting point, but the optimal value will
+ be platform-specific. If CPU utilization is not a concern, use
+ RX_POLLING (NAPI) and default driver settings.
+
+RxDescriptors
+-------------
+
+:Valid Range:
+ - 48-256 for 82542 and 82543-based adapters
+ - 48-4096 for all other supported adapters
+:Default Value: 256
+
+This value specifies the number of receive buffer descriptors allocated
+by the driver. Increasing this value allows the driver to buffer more
+incoming packets, at the expense of increased system memory utilization.
+
+Each descriptor is 16 bytes. A receive buffer is also allocated for each
+descriptor and can be either 2048, 4096, 8192, or 16384 bytes, depending
+on the MTU setting. The maximum MTU size is 16110.
+
+NOTE:
+ MTU designates the frame size. It only needs to be set for Jumbo
+ Frames. Depending on the available system resources, the request
+ for a higher number of receive descriptors may be denied. In this
+ case, use a lower number.
+
+RxIntDelay
+----------
+
+:Valid Range: 0-65535 (0=off)
+:Default Value: 0
+
+This value delays the generation of receive interrupts in units of 1.024
+microseconds. Receive interrupt reduction can improve CPU efficiency if
+properly tuned for specific network traffic. Increasing this value adds
+extra latency to frame reception and can end up decreasing the throughput
+of TCP traffic. If the system is reporting dropped receives, this value
+may be set too high, causing the driver to run out of available receive
+descriptors.
+
+CAUTION:
+ When setting RxIntDelay to a value other than 0, adapters may
+ hang (stop transmitting) under certain network conditions. If
+ this occurs a NETDEV WATCHDOG message is logged in the system
+ event log. In addition, the controller is automatically reset,
+ restoring the network connection. To eliminate the potential
+ for the hang ensure that RxIntDelay is set to 0.
+
+RxAbsIntDelay
+-------------
+
+(This parameter is supported only on 82540, 82545 and later adapters.)
+
+:Valid Range: 0-65535 (0=off)
+:Default Value: 128
+
+This value, in units of 1.024 microseconds, limits the delay in which a
+receive interrupt is generated. Useful only if RxIntDelay is non-zero,
+this value ensures that an interrupt is generated after the initial
+packet is received within the set amount of time. Proper tuning,
+along with RxIntDelay, may improve traffic throughput in specific network
+conditions.
+
+Speed
+-----
+
+(This parameter is supported only on adapters with copper connections.)
+
+:Valid Settings: 0, 10, 100, 1000
+:Default Value: 0 (auto-negotiate at all supported speeds)
+
+Speed forces the line speed to the specified value in megabits per second
+(Mbps). If this parameter is not specified or is set to 0 and the link
+partner is set to auto-negotiate, the board will auto-detect the correct
+speed. Duplex should also be set when Speed is set to either 10 or 100.
+
+TxDescriptors
+-------------
+
+:Valid Range:
+ - 48-256 for 82542 and 82543-based adapters
+ - 48-4096 for all other supported adapters
+:Default Value: 256
+
+This value is the number of transmit descriptors allocated by the driver.
+Increasing this value allows the driver to queue more transmits. Each
+descriptor is 16 bytes.
+
+NOTE:
+ Depending on the available system resources, the request for a
+ higher number of transmit descriptors may be denied. In this case,
+ use a lower number.
+
+TxIntDelay
+----------
+
+:Valid Range: 0-65535 (0=off)
+:Default Value: 8
+
+This value delays the generation of transmit interrupts in units of
+1.024 microseconds. Transmit interrupt reduction can improve CPU
+efficiency if properly tuned for specific network traffic. If the
+system is reporting dropped transmits, this value may be set too high
+causing the driver to run out of available transmit descriptors.
+
+TxAbsIntDelay
+-------------
+
+(This parameter is supported only on 82540, 82545 and later adapters.)
+
+:Valid Range: 0-65535 (0=off)
+:Default Value: 32
+
+This value, in units of 1.024 microseconds, limits the delay in which a
+transmit interrupt is generated. Useful only if TxIntDelay is non-zero,
+this value ensures that an interrupt is generated after the initial
+packet is sent on the wire within the set amount of time. Proper tuning,
+along with TxIntDelay, may improve traffic throughput in specific
+network conditions.
+
+XsumRX
+------
+
+(This parameter is NOT supported on the 82542-based adapter.)
+
+:Valid Range: 0-1
+:Default Value: 1
+
+A value of '1' indicates that the driver should enable IP checksum
+offload for received packets (both UDP and TCP) to the adapter hardware.
+
+Copybreak
+---------
+
+:Valid Range: 0-xxxxxxx (0=off)
+:Default Value: 256
+:Usage: modprobe e1000.ko copybreak=128
+
+Driver copies all packets below or equaling this size to a fresh RX
+buffer before handing it up the stack.
+
+This parameter is different than other parameters, in that it is a
+single (not 1,1,1 etc.) parameter applied to all driver instances and
+it is also available during runtime at
+/sys/module/e1000/parameters/copybreak
+
+SmartPowerDownEnable
+--------------------
+
+:Valid Range: 0-1
+:Default Value: 0 (disabled)
+
+Allows PHY to turn off in lower power states. The user can turn off
+this parameter in supported chipsets.
+
+Speed and Duplex Configuration
+==============================
+
+Three keywords are used to control the speed and duplex configuration.
+These keywords are Speed, Duplex, and AutoNeg.
+
+If the board uses a fiber interface, these keywords are ignored, and the
+fiber interface board only links at 1000 Mbps full-duplex.
+
+For copper-based boards, the keywords interact as follows:
+
+- The default operation is auto-negotiate. The board advertises all
+ supported speed and duplex combinations, and it links at the highest
+ common speed and duplex mode IF the link partner is set to auto-negotiate.
+
+- If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps
+ is advertised (The 1000BaseT spec requires auto-negotiation.)
+
+- If Speed = 10 or 100, then both Speed and Duplex should be set. Auto-
+ negotiation is disabled, and the AutoNeg parameter is ignored. Partner
+ SHOULD also be forced.
+
+The AutoNeg parameter is used when more control is required over the
+auto-negotiation process. It should be used when you wish to control which
+speed and duplex combinations are advertised during the auto-negotiation
+process.
+
+The parameter may be specified as either a decimal or hexadecimal value as
+determined by the bitmap below.
+
+============== ====== ====== ======= ======= ====== ====== ======= ======
+Bit position 7 6 5 4 3 2 1 0
+Decimal Value 128 64 32 16 8 4 2 1
+Hex value 80 40 20 10 8 4 2 1
+Speed (Mbps) N/A N/A 1000 N/A 100 100 10 10
+Duplex Full Full Half Full Half
+============== ====== ====== ======= ======= ====== ====== ======= ======
+
+Some examples of using AutoNeg::
+
+ modprobe e1000 AutoNeg=0x01 (Restricts autonegotiation to 10 Half)
+ modprobe e1000 AutoNeg=1 (Same as above)
+ modprobe e1000 AutoNeg=0x02 (Restricts autonegotiation to 10 Full)
+ modprobe e1000 AutoNeg=0x03 (Restricts autonegotiation to 10 Half or 10 Full)
+ modprobe e1000 AutoNeg=0x04 (Restricts autonegotiation to 100 Half)
+ modprobe e1000 AutoNeg=0x05 (Restricts autonegotiation to 10 Half or 100
+ Half)
+ modprobe e1000 AutoNeg=0x020 (Restricts autonegotiation to 1000 Full)
+ modprobe e1000 AutoNeg=32 (Same as above)
+
+Note that when this parameter is used, Speed and Duplex must not be specified.
+
+If the link partner is forced to a specific speed and duplex, then this
+parameter should not be used. Instead, use the Speed and Duplex parameters
+previously mentioned to force the adapter to the same speed and duplex.
+
+Additional Configurations
+=========================
+
+Jumbo Frames
+------------
+
+ Jumbo Frames support is enabled by changing the MTU to a value larger than
+ the default of 1500. Use the ifconfig command to increase the MTU size.
+ For example::
+
+ ifconfig eth<x> mtu 9000 up
+
+ This setting is not saved across reboots. It can be made permanent if
+ you add::
+
+ MTU=9000
+
+ to the file /etc/sysconfig/network-scripts/ifcfg-eth<x>. This example
+ applies to the Red Hat distributions; other distributions may store this
+ setting in a different location.
+
+Notes:
+ Degradation in throughput performance may be observed in some Jumbo frames
+ environments. If this is observed, increasing the application's socket buffer
+ size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help.
+ See the specific application manual and /usr/src/linux*/Documentation/
+ networking/ip-sysctl.txt for more details.
+
+ - The maximum MTU setting for Jumbo Frames is 16110. This value coincides
+ with the maximum Jumbo Frames size of 16128.
+
+ - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in
+ poor performance or loss of link.
+
+ - Adapters based on the Intel(R) 82542 and 82573V/E controller do not
+ support Jumbo Frames. These correspond to the following product names::
+
+ Intel(R) PRO/1000 Gigabit Server Adapter
+ Intel(R) PRO/1000 PM Network Connection
+
+ethtool
+-------
+
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The ethtool
+ version 1.6 or later is required for this functionality.
+
+ The latest release of ethtool can be found from
+ https://www.kernel.org/pub/software/network/ethtool/
+
+Enabling Wake on LAN (WoL)
+--------------------------
+
+ WoL is configured through the ethtool utility.
+
+ WoL will be enabled on the system during the next shut down or reboot.
+ For this driver version, in order to enable WoL, the e1000 driver must be
+ loaded when shutting down or rebooting the system.
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+http://support.intel.com
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
new file mode 100644
index 0000000000..d8f810afdd
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
@@ -0,0 +1,378 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=====================================================
+Linux Driver for Intel(R) Ethernet Network Connection
+=====================================================
+
+Intel Gigabit Linux driver.
+Copyright(c) 2008-2018 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Command Line Parameters
+- Additional Configurations
+- Support
+
+
+Identifying Your Adapter
+========================
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+
+Command Line Parameters
+=======================
+If the driver is built as a module, the following optional parameters are used
+by entering them on the command line with the modprobe command using this
+syntax::
+
+ modprobe e1000e [<option>=<VAL1>,<VAL2>,...]
+
+There needs to be a <VAL#> for each network port in the system supported by
+this driver. The values will be applied to each instance, in function order.
+For example::
+
+ modprobe e1000e InterruptThrottleRate=16000,16000
+
+In this case, there are two network ports supported by e1000e in the system.
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+NOTE: A descriptor describes a data buffer and attributes related to the data
+buffer. This information is accessed by the hardware.
+
+InterruptThrottleRate
+---------------------
+:Valid Range: 0,1,3,4,100-100000
+:Default Value: 3
+
+Interrupt Throttle Rate controls the number of interrupts each interrupt
+vector can generate per second. Increasing ITR lowers latency at the cost of
+increased CPU utilization, though it may help throughput in some circumstances.
+
+Setting InterruptThrottleRate to a value greater or equal to 100
+will program the adapter to send out a maximum of that many interrupts
+per second, even if more packets have come in. This reduces interrupt
+load on the system and can lower CPU utilization under heavy load,
+but will increase latency as packets are not processed as quickly.
+
+The default behaviour of the driver previously assumed a static
+InterruptThrottleRate value of 8000, providing a good fallback value for
+all traffic types, but lacking in small packet performance and latency.
+The hardware can handle many more small packets per second however, and
+for this reason an adaptive interrupt moderation algorithm was implemented.
+
+The driver has two adaptive modes (setting 1 or 3) in which
+it dynamically adjusts the InterruptThrottleRate value based on the traffic
+that it receives. After determining the type of incoming traffic in the last
+timeframe, it will adjust the InterruptThrottleRate to an appropriate value
+for that traffic.
+
+The algorithm classifies the incoming traffic every interval into
+classes. Once the class is determined, the InterruptThrottleRate value is
+adjusted to suit that traffic type the best. There are three classes defined:
+"Bulk traffic", for large amounts of packets of normal size; "Low latency",
+for small amounts of traffic and/or a significant percentage of small
+packets; and "Lowest latency", for almost completely small packets or
+minimal traffic.
+
+ - 0: Off
+ Turns off any interrupt moderation and may improve small packet latency.
+ However, this is generally not suitable for bulk throughput traffic due
+ to the increased CPU utilization of the higher interrupt rate.
+ - 1: Dynamic mode
+ This mode attempts to moderate interrupts per vector while maintaining
+ very low latency. This can sometimes cause extra CPU utilization. If
+ planning on deploying e1000e in a latency sensitive environment, this
+ parameter should be considered.
+ - 3: Dynamic Conservative mode (default)
+ In dynamic conservative mode, the InterruptThrottleRate value is set to
+ 4000 for traffic that falls in class "Bulk traffic". If traffic falls in
+ the "Low latency" or "Lowest latency" class, the InterruptThrottleRate is
+ increased stepwise to 20000. This default mode is suitable for most
+ applications.
+ - 4: Simplified Balancing mode
+ In simplified mode the interrupt rate is based on the ratio of TX and
+ RX traffic. If the bytes per second rate is approximately equal, the
+ interrupt rate will drop as low as 2000 interrupts per second. If the
+ traffic is mostly transmit or mostly receive, the interrupt rate could
+ be as high as 8000.
+ - 100-100000:
+ Setting InterruptThrottleRate to a value greater or equal to 100
+ will program the adapter to send at most that many interrupts per second,
+ even if more packets have come in. This reduces interrupt load on the
+ system and can lower CPU utilization under heavy load, but will increase
+ latency as packets are not processed as quickly.
+
+NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and
+RxAbsIntDelay parameters. In other words, minimizing the receive and/or
+transmit absolute delays does not force the controller to generate more
+interrupts than what the Interrupt Throttle Rate allows.
+
+RxIntDelay
+----------
+:Valid Range: 0-65535 (0=off)
+:Default Value: 0
+
+This value delays the generation of receive interrupts in units of 1.024
+microseconds. Receive interrupt reduction can improve CPU efficiency if
+properly tuned for specific network traffic. Increasing this value adds extra
+latency to frame reception and can end up decreasing the throughput of TCP
+traffic. If the system is reporting dropped receives, this value may be set
+too high, causing the driver to run out of available receive descriptors.
+
+CAUTION: When setting RxIntDelay to a value other than 0, adapters may hang
+(stop transmitting) under certain network conditions. If this occurs a NETDEV
+WATCHDOG message is logged in the system event log. In addition, the
+controller is automatically reset, restoring the network connection. To
+eliminate the potential for the hang ensure that RxIntDelay is set to 0.
+
+RxAbsIntDelay
+-------------
+:Valid Range: 0-65535 (0=off)
+:Default Value: 8
+
+This value, in units of 1.024 microseconds, limits the delay in which a
+receive interrupt is generated. This value ensures that an interrupt is
+generated after the initial packet is received within the set amount of time,
+which is useful only if RxIntDelay is non-zero. Proper tuning, along with
+RxIntDelay, may improve traffic throughput in specific network conditions.
+
+TxIntDelay
+----------
+:Valid Range: 0-65535 (0=off)
+:Default Value: 8
+
+This value delays the generation of transmit interrupts in units of 1.024
+microseconds. Transmit interrupt reduction can improve CPU efficiency if
+properly tuned for specific network traffic. If the system is reporting
+dropped transmits, this value may be set too high causing the driver to run
+out of available transmit descriptors.
+
+TxAbsIntDelay
+-------------
+:Valid Range: 0-65535 (0=off)
+:Default Value: 32
+
+This value, in units of 1.024 microseconds, limits the delay in which a
+transmit interrupt is generated. It is useful only if TxIntDelay is non-zero.
+It ensures that an interrupt is generated after the initial Packet is sent on
+the wire within the set amount of time. Proper tuning, along with TxIntDelay,
+may improve traffic throughput in specific network conditions.
+
+copybreak
+---------
+:Valid Range: 0-xxxxxxx (0=off)
+:Default Value: 256
+
+The driver copies all packets below or equaling this size to a fresh receive
+buffer before handing it up the stack.
+This parameter differs from other parameters because it is a single (not 1,1,1
+etc.) parameter applied to all driver instances and it is also available
+during runtime at /sys/module/e1000e/parameters/copybreak.
+
+To use copybreak, type::
+
+ modprobe e1000e.ko copybreak=128
+
+SmartPowerDownEnable
+--------------------
+:Valid Range: 0,1
+:Default Value: 0 (disabled)
+
+Allows the PHY to turn off in lower power states. The user can turn off this
+parameter in supported chipsets.
+
+KumeranLockLoss
+---------------
+:Valid Range: 0,1
+:Default Value: 1 (enabled)
+
+This workaround skips resetting the PHY at shutdown for the initial silicon
+releases of ICH8 systems.
+
+IntMode
+-------
+:Valid Range: 0-2
+:Default Value: 0
+
+ +-------+----------------+
+ | Value | Interrupt Mode |
+ +=======+================+
+ | 0 | Legacy |
+ +-------+----------------+
+ | 1 | MSI |
+ +-------+----------------+
+ | 2 | MSI-X |
+ +-------+----------------+
+
+IntMode allows load time control over the type of interrupt registered for by
+the driver. MSI-X is required for multiple queue support, and some kernels and
+combinations of kernel .config options will force a lower level of interrupt
+support.
+
+This command will show different values for each type of interrupt::
+
+ cat /proc/interrupts
+
+CrcStripping
+------------
+:Valid Range: 0,1
+:Default Value: 1 (enabled)
+
+Strip the CRC from received packets before sending up the network stack. If
+you have a machine with a BMC enabled but cannot receive IPMI traffic after
+loading or enabling the driver, try disabling this feature.
+
+WriteProtectNVM
+---------------
+:Valid Range: 0,1
+:Default Value: 1 (enabled)
+
+If set to 1, configure the hardware to ignore all write/erase cycles to the
+GbE region in the ICHx NVM (in order to prevent accidental corruption of the
+NVM). This feature can be disabled by setting the parameter to 0 during initial
+driver load.
+
+NOTE: The machine must be power cycled (full off/on) when enabling NVM writes
+via setting the parameter to zero. Once the NVM has been locked (via the
+parameter at 1 when the driver loads) it cannot be unlocked except via power
+cycle.
+
+Debug
+-----
+:Valid Range: 0-16 (0=none,...,16=all)
+:Default Value: 0
+
+This parameter adjusts the level of debug messages displayed in the system logs.
+
+
+Additional Features and Configurations
+======================================
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ifconfig command to increase the MTU size. For example, enter the
+following where <x> is the interface number::
+
+ ifconfig eth<x> mtu 9000 up
+
+Alternatively, you can use the ip command as follows::
+
+ ip link set mtu 9000 dev eth<x>
+ ip link set up dev eth<x>
+
+This setting is not saved across reboots. The setting change can be made
+permanent by adding 'MTU=9000' to the file:
+
+- For RHEL: /etc/sysconfig/network-scripts/ifcfg-eth<x>
+- For SLES: /etc/sysconfig/network/<config_file>
+
+NOTE: The maximum MTU setting for Jumbo Frames is 8996. This value coincides
+with the maximum Jumbo Frames size of 9018 bytes.
+
+NOTE: Using Jumbo frames at 10 or 100 Mbps is not supported and may result in
+poor performance or loss of link.
+
+NOTE: The following adapters limit Jumbo Frames sized packets to a maximum of
+4088 bytes:
+
+ - Intel(R) 82578DM Gigabit Network Connection
+ - Intel(R) 82577LM Gigabit Network Connection
+
+The following adapters do not support Jumbo Frames:
+
+ - Intel(R) PRO/1000 Gigabit Server Adapter
+ - Intel(R) PRO/1000 PM Network Connection
+ - Intel(R) 82562G 10/100 Network Connection
+ - Intel(R) 82562G-2 10/100 Network Connection
+ - Intel(R) 82562GT 10/100 Network Connection
+ - Intel(R) 82562GT-2 10/100 Network Connection
+ - Intel(R) 82562V 10/100 Network Connection
+ - Intel(R) 82562V-2 10/100 Network Connection
+ - Intel(R) 82566DC Gigabit Network Connection
+ - Intel(R) 82566DC-2 Gigabit Network Connection
+ - Intel(R) 82566DM Gigabit Network Connection
+ - Intel(R) 82566MC Gigabit Network Connection
+ - Intel(R) 82566MM Gigabit Network Connection
+ - Intel(R) 82567V-3 Gigabit Network Connection
+ - Intel(R) 82577LC Gigabit Network Connection
+ - Intel(R) 82578DC Gigabit Network Connection
+
+NOTE: Jumbo Frames cannot be configured on an 82579-based Network device if
+MACSec is enabled on the system.
+
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+
+https://www.kernel.org/pub/software/network/ethtool/
+
+NOTE: When validating enable/disable tests on some parts (for example, 82578),
+it is necessary to add a few seconds between tests when working with ethtool.
+
+
+Speed and Duplex Configuration
+------------------------------
+In addressing speed and duplex configuration issues, you need to distinguish
+between copper-based adapters and fiber-based adapters.
+
+In the default mode, an Intel(R) Ethernet Network Adapter using copper
+connections will attempt to auto-negotiate with its link partner to determine
+the best setting. If the adapter cannot establish link with the link partner
+using auto-negotiation, you may need to manually configure the adapter and link
+partner to identical settings to establish link and pass packets. This should
+only be needed when attempting to link with an older switch that does not
+support auto-negotiation or one that has been forced to a specific speed or
+duplex mode. Your link partner must match the setting you choose. 1 Gbps speeds
+and higher cannot be forced. Use the autonegotiation advertising setting to
+manually set devices for 1 Gbps and higher.
+
+Speed, duplex, and autonegotiation advertising are configured through the
+ethtool utility.
+
+Caution: Only experienced network administrators should force speed and duplex
+or change autonegotiation advertising manually. The settings at the switch must
+always match the adapter settings. Adapter performance may suffer or your
+adapter may not operate if you configure the adapter differently from your
+switch.
+
+An Intel(R) Ethernet Network Adapter using fiber-based connections, however,
+will not attempt to auto-negotiate with its link partner since those adapters
+operate only in full duplex and only at their native speed.
+
+
+Enabling Wake on LAN (WoL)
+--------------------------
+WoL is configured through the ethtool utility.
+
+WoL will be enabled on the system during the next shut down or reboot. For
+this driver version, in order to enable WoL, the e1000e driver must be loaded
+prior to shutting down or suspending the system.
+
+NOTE: Wake on LAN is only supported on port A for the following devices:
+- Intel(R) PRO/1000 PT Dual Port Network Connection
+- Intel(R) PRO/1000 PT Dual Port Server Connection
+- Intel(R) PRO/1000 PT Dual Port Server Adapter
+- Intel(R) PRO/1000 PF Dual Port Server Adapter
+- Intel(R) PRO/1000 PT Quad Port Server Adapter
+- Intel(R) Gigabit PT Quad Port Server ExpressModule
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
new file mode 100644
index 0000000000..396a2c8c3d
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
@@ -0,0 +1,137 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============================================================
+Linux Base Driver for Intel(R) Ethernet Multi-host Controller
+=============================================================
+
+August 20, 2018
+Copyright(c) 2015-2018 Intel Corporation.
+
+Contents
+========
+- Identifying Your Adapter
+- Additional Configurations
+- Performance Tuning
+- Known Issues
+- Support
+
+Identifying Your Adapter
+========================
+The driver in this release is compatible with devices based on the Intel(R)
+Ethernet Multi-host Controller.
+
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+
+Flow Control
+------------
+The Intel(R) Ethernet Switch Host Interface Driver does not support Flow
+Control. It will not send pause frames. This may result in dropped frames.
+
+
+Virtual Functions (VFs)
+-----------------------
+Use sysfs to enable VFs.
+Valid Range: 0-64
+
+For example::
+
+ echo $num_vf_enabled > /sys/class/net/$dev/device/sriov_numvfs //enable VFs
+ echo 0 > /sys/class/net/$dev/device/sriov_numvfs //disable VFs
+
+NOTE: Neither the device nor the driver control how VFs are mapped into config
+space. Bus layout will vary by operating system. On operating systems that
+support it, you can check sysfs to find the mapping.
+
+NOTE: When SR-IOV mode is enabled, hardware VLAN filtering and VLAN tag
+stripping/insertion will remain enabled. Please remove the old VLAN filter
+before the new VLAN filter is added. For example::
+
+ ip link set eth0 vf 0 vlan 100 // set vlan 100 for VF 0
+ ip link set eth0 vf 0 vlan 0 // Delete vlan 100
+ ip link set eth0 vf 0 vlan 200 // set a new vlan 200 for VF 0
+
+
+Additional Features and Configurations
+======================================
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ifconfig command to increase the MTU size. For example, enter the
+following where <x> is the interface number::
+
+ ifconfig eth<x> mtu 9000 up
+
+Alternatively, you can use the ip command as follows::
+
+ ip link set mtu 9000 dev eth<x>
+ ip link set up dev eth<x>
+
+This setting is not saved across reboots. The setting change can be made
+permanent by adding 'MTU=9000' to the file:
+
+- For RHEL: /etc/sysconfig/network-scripts/ifcfg-eth<x>
+- For SLES: /etc/sysconfig/network/<config_file>
+
+NOTE: The maximum MTU setting for Jumbo Frames is 15342. This value coincides
+with the maximum Jumbo Frames size of 15364 bytes.
+
+NOTE: This driver will attempt to use multiple page sized buffers to receive
+each jumbo packet. This should help to avoid buffer starvation issues when
+allocating receive packets.
+
+
+Generic Receive Offload, aka GRO
+--------------------------------
+The driver supports the in-kernel software implementation of GRO. GRO has
+shown that by coalescing Rx traffic into larger chunks of data, CPU
+utilization can be significantly reduced when under large Rx load. GRO is an
+evolution of the previously-used LRO interface. GRO is able to coalesce
+other protocols besides TCP. It's also safe to use with configurations that
+are problematic for LRO, namely bridging and iSCSI.
+
+
+
+Supported ethtool Commands and Options for Filtering
+----------------------------------------------------
+-n --show-nfc
+ Retrieves the receive network flow classification configurations.
+
+rx-flow-hash tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6
+ Retrieves the hash options for the specified network traffic type.
+
+-N --config-nfc
+ Configures the receive network flow classification.
+
+rx-flow-hash tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 m|v|t|s|d|f|n|r
+ Configures the hash options for the specified network traffic type.
+
+- udp4: UDP over IPv4
+- udp6: UDP over IPv6
+- f Hash on bytes 0 and 1 of the Layer 4 header of the rx packet.
+- n Hash on bytes 2 and 3 of the Layer 4 header of the rx packet.
+
+
+Known Issues/Troubleshooting
+============================
+
+Enabling SR-IOV in a 64-bit Microsoft Windows Server 2012/R2 guest OS under Linux KVM
+-------------------------------------------------------------------------------------
+KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM. This
+includes traditional PCIe devices, as well as SR-IOV-capable devices based on
+the Intel Ethernet Controller XL710.
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
new file mode 100644
index 0000000000..4fbaa1a2d6
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
@@ -0,0 +1,766 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=================================================================
+Linux Base Driver for the Intel(R) Ethernet Controller 700 Series
+=================================================================
+
+Intel 40 Gigabit Linux driver.
+Copyright(c) 1999-2018 Intel Corporation.
+
+Contents
+========
+
+- Overview
+- Identifying Your Adapter
+- Intel(R) Ethernet Flow Director
+- Additional Configurations
+- Known Issues
+- Support
+
+
+Driver information can be obtained using ethtool, lspci, and ifconfig.
+Instructions on updating ethtool can be found in the section Additional
+Configurations later in this document.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel adapter. All hardware requirements listed apply to use
+with Linux.
+
+
+Identifying Your Adapter
+========================
+The driver is compatible with devices based on the following:
+
+ * Intel(R) Ethernet Controller X710
+ * Intel(R) Ethernet Controller XL710
+ * Intel(R) Ethernet Network Connection X722
+ * Intel(R) Ethernet Controller XXV710
+
+For the best performance, make sure the latest NVM/FW is installed on your
+device.
+
+For information on how to identify your adapter, and for the latest NVM/FW
+images and Intel network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+SFP+ and QSFP+ Devices
+----------------------
+For information about supported media, refer to this document:
+https://www.intel.com/content/dam/www/public/us/en/documents/release-notes/xl710-ethernet-controller-feature-matrix.pdf
+
+NOTE: Some adapters based on the Intel(R) Ethernet Controller 700 Series only
+support Intel Ethernet Optics modules. On these adapters, other modules are not
+supported and will not function. In all cases Intel recommends using Intel
+Ethernet Optics; other modules may function but are not validated by Intel.
+Contact Intel for supported media types.
+
+NOTE: For connections based on Intel(R) Ethernet Controller 700 Series, support
+is dependent on your system board. Please see your vendor for details.
+
+NOTE: In systems that do not have adequate airflow to cool the adapter and
+optical modules, you must use high temperature optical modules.
+
+Virtual Functions (VFs)
+-----------------------
+Use sysfs to enable VFs. For example::
+
+ #echo $num_vf_enabled > /sys/class/net/$dev/device/sriov_numvfs #enable VFs
+ #echo 0 > /sys/class/net/$dev/device/sriov_numvfs #disable VFs
+
+For example, the following instructions will configure PF eth0 and the first VF
+on VLAN 10::
+
+ $ ip link set dev eth0 vf 0 vlan 10
+
+VLAN Tag Packet Steering
+------------------------
+Allows you to send all packets with a specific VLAN tag to a particular SR-IOV
+virtual function (VF). Further, this feature allows you to designate a
+particular VF as trusted, and allows that trusted VF to request selective
+promiscuous mode on the Physical Function (PF).
+
+To set a VF as trusted or untrusted, enter the following command in the
+Hypervisor::
+
+ # ip link set dev eth0 vf 1 trust [on|off]
+
+Once the VF is designated as trusted, use the following commands in the VM to
+set the VF to promiscuous mode.
+
+::
+
+ For promiscuous all:
+ #ip link set eth2 promisc on
+ Where eth2 is a VF interface in the VM
+
+ For promiscuous Multicast:
+ #ip link set eth2 allmulticast on
+ Where eth2 is a VF interface in the VM
+
+NOTE: By default, the ethtool priv-flag vf-true-promisc-support is set to
+"off",meaning that promiscuous mode for the VF will be limited. To set the
+promiscuous mode for the VF to true promiscuous and allow the VF to see all
+ingress traffic, use the following command::
+
+ #ethtool -set-priv-flags p261p1 vf-true-promisc-support on
+
+The vf-true-promisc-support priv-flag does not enable promiscuous mode; rather,
+it designates which type of promiscuous mode (limited or true) you will get
+when you enable promiscuous mode using the ip link commands above. Note that
+this is a global setting that affects the entire device. However,the
+vf-true-promisc-support priv-flag is only exposed to the first PF of the
+device. The PF remains in limited promiscuous mode (unless it is in MFP mode)
+regardless of the vf-true-promisc-support setting.
+
+Now add a VLAN interface on the VF interface::
+
+ #ip link add link eth2 name eth2.100 type vlan id 100
+
+Note that the order in which you set the VF to promiscuous mode and add the
+VLAN interface does not matter (you can do either first). The end result in
+this example is that the VF will get all traffic that is tagged with VLAN 100.
+
+Intel(R) Ethernet Flow Director
+-------------------------------
+The Intel Ethernet Flow Director performs the following tasks:
+
+- Directs receive packets according to their flows to different queues.
+- Enables tight control on routing a flow in the platform.
+- Matches flows and CPU cores for flow affinity.
+- Supports multiple parameters for flexible flow classification and load
+ balancing (in SFP mode only).
+
+NOTE: The Linux i40e driver supports the following flow types: IPv4, TCPv4, and
+UDPv4. For a given flow type, it supports valid combinations of IP addresses
+(source or destination) and UDP/TCP ports (source and destination). For
+example, you can supply only a source IP address, a source IP address and a
+destination port, or any combination of one or more of these four parameters.
+
+NOTE: The Linux i40e driver allows you to filter traffic based on a
+user-defined flexible two-byte pattern and offset by using the ethtool user-def
+and mask fields. Only L3 and L4 flow types are supported for user-defined
+flexible filters. For a given flow type, you must clear all Intel Ethernet Flow
+Director filters before changing the input set (for that flow type).
+
+To enable or disable the Intel Ethernet Flow Director::
+
+ # ethtool -K ethX ntuple <on|off>
+
+When disabling ntuple filters, all the user programmed filters are flushed from
+the driver cache and hardware. All needed filters must be re-added when ntuple
+is re-enabled.
+
+To add a filter that directs packet to queue 2, use -U or -N switch::
+
+ # ethtool -N ethX flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 src-port 2000 dst-port 2001 action 2 [loc 1]
+
+To set a filter using only the source and destination IP address::
+
+ # ethtool -N ethX flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 action 2 [loc 1]
+
+To see the list of filters currently present::
+
+ # ethtool <-u|-n> ethX
+
+Application Targeted Routing (ATR) Perfect Filters
+--------------------------------------------------
+ATR is enabled by default when the kernel is in multiple transmit queue mode.
+An ATR Intel Ethernet Flow Director filter rule is added when a TCP-IP flow
+starts and is deleted when the flow ends. When a TCP-IP Intel Ethernet Flow
+Director rule is added from ethtool (Sideband filter), ATR is turned off by the
+driver. To re-enable ATR, the sideband can be disabled with the ethtool -K
+option. For example::
+
+ ethtool -K [adapter] ntuple [off|on]
+
+If sideband is re-enabled after ATR is re-enabled, ATR remains enabled until a
+TCP-IP flow is added. When all TCP-IP sideband rules are deleted, ATR is
+automatically re-enabled.
+
+Packets that match the ATR rules are counted in fdir_atr_match stats in
+ethtool, which also can be used to verify whether ATR rules still exist.
+
+Sideband Perfect Filters
+------------------------
+Sideband Perfect Filters are used to direct traffic that matches specified
+characteristics. They are enabled through ethtool's ntuple interface. To add a
+new filter use the following command::
+
+ ethtool -U <device> flow-type <type> src-ip <ip> dst-ip <ip> src-port <port> \
+ dst-port <port> action <queue>
+
+Where:
+ <device> - the ethernet device to program
+ <type> - can be ip4, tcp4, udp4, or sctp4
+ <ip> - the ip address to match on
+ <port> - the port number to match on
+ <queue> - the queue to direct traffic towards (-1 discards matching traffic)
+
+Use the following command to display all of the active filters::
+
+ ethtool -u <device>
+
+Use the following command to delete a filter::
+
+ ethtool -U <device> delete <N>
+
+Where <N> is the filter id displayed when printing all the active filters, and
+may also have been specified using "loc <N>" when adding the filter.
+
+The following example matches TCP traffic sent from 192.168.0.1, port 5300,
+directed to 192.168.0.5, port 80, and sends it to queue 7::
+
+ ethtool -U enp130s0 flow-type tcp4 src-ip 192.168.0.1 dst-ip 192.168.0.5 \
+ src-port 5300 dst-port 80 action 7
+
+For each flow-type, the programmed filters must all have the same matching
+input set. For example, issuing the following two commands is acceptable::
+
+ ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.5 src-port 55 action 10
+
+Issuing the next two commands, however, is not acceptable, since the first
+specifies src-ip and the second specifies dst-ip::
+
+ ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ ethtool -U enp130s0 flow-type ip4 dst-ip 192.168.0.5 src-port 55 action 10
+
+The second command will fail with an error. You may program multiple filters
+with the same fields, using different values, but, on one device, you may not
+program two tcp4 filters with different matching fields.
+
+Matching on a sub-portion of a field is not supported by the i40e driver, thus
+partial mask fields are not supported.
+
+The driver also supports matching user-defined data within the packet payload.
+This flexible data is specified using the "user-def" field of the ethtool
+command in the following way:
+
++----------------------------+--------------------------+
+| 31 28 24 20 16 | 15 12 8 4 0 |
++----------------------------+--------------------------+
+| offset into packet payload | 2 bytes of flexible data |
++----------------------------+--------------------------+
+
+For example,
+
+::
+
+ ... user-def 0x4FFFF ...
+
+tells the filter to look 4 bytes into the payload and match that value against
+0xFFFF. The offset is based on the beginning of the payload, and not the
+beginning of the packet. Thus
+
+::
+
+ flow-type tcp4 ... user-def 0x8BEAF ...
+
+would match TCP/IPv4 packets which have the value 0xBEAF 8 bytes into the
+TCP/IPv4 payload.
+
+Note that ICMP headers are parsed as 4 bytes of header and 4 bytes of payload.
+Thus to match the first byte of the payload, you must actually add 4 bytes to
+the offset. Also note that ip4 filters match both ICMP frames as well as raw
+(unknown) ip4 frames, where the payload will be the L3 payload of the IP4 frame.
+
+The maximum offset is 64. The hardware will only read up to 64 bytes of data
+from the payload. The offset must be even because the flexible data is 2 bytes
+long and must be aligned to byte 0 of the packet payload.
+
+The user-defined flexible offset is also considered part of the input set and
+cannot be programmed separately for multiple filters of the same type. However,
+the flexible data is not part of the input set and multiple filters may use the
+same offset but match against different data.
+
+To create filters that direct traffic to a specific Virtual Function, use the
+"action" parameter. Specify the action as a 64 bit value, where the lower 32
+bits represents the queue number, while the next 8 bits represent which VF.
+Note that 0 is the PF, so the VF identifier is offset by 1. For example::
+
+ ... action 0x800000002 ...
+
+specifies to direct traffic to Virtual Function 7 (8 minus 1) into queue 2 of
+that VF.
+
+Note that these filters will not break internal routing rules, and will not
+route traffic that otherwise would not have been sent to the specified Virtual
+Function.
+
+Setting the link-down-on-close Private Flag
+-------------------------------------------
+When the link-down-on-close private flag is set to "on", the port's link will
+go down when the interface is brought down using the ifconfig ethX down command.
+
+Use ethtool to view and set link-down-on-close, as follows::
+
+ ethtool --show-priv-flags ethX
+ ethtool --set-priv-flags ethX link-down-on-close [on|off]
+
+Viewing Link Messages
+---------------------
+Link messages will not be displayed to the console if the distribution is
+restricting system messages. In order to see network driver link messages on
+your console, set dmesg to eight by entering the following::
+
+ dmesg -n 8
+
+NOTE: This setting is not saved across reboots.
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ifconfig command to increase the MTU size. For example, enter the
+following where <x> is the interface number::
+
+ ifconfig eth<x> mtu 9000 up
+
+Alternatively, you can use the ip command as follows::
+
+ ip link set mtu 9000 dev eth<x>
+ ip link set up dev eth<x>
+
+This setting is not saved across reboots. The setting change can be made
+permanent by adding 'MTU=9000' to the file::
+
+ /etc/sysconfig/network-scripts/ifcfg-eth<x> // for RHEL
+ /etc/sysconfig/network/<config_file> // for SLES
+
+NOTE: The maximum MTU setting for Jumbo Frames is 9702. This value coincides
+with the maximum Jumbo Frames size of 9728 bytes.
+
+NOTE: This driver will attempt to use multiple page sized buffers to receive
+each jumbo packet. This should help to avoid buffer starvation issues when
+allocating receive packets.
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+https://www.kernel.org/pub/software/network/ethtool/
+
+Supported ethtool Commands and Options for Filtering
+----------------------------------------------------
+-n --show-nfc
+ Retrieves the receive network flow classification configurations.
+
+rx-flow-hash tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6
+ Retrieves the hash options for the specified network traffic type.
+
+-N --config-nfc
+ Configures the receive network flow classification.
+
+rx-flow-hash tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 m|v|t|s|d|f|n|r...
+ Configures the hash options for the specified network traffic type.
+
+udp4 UDP over IPv4
+udp6 UDP over IPv6
+
+f Hash on bytes 0 and 1 of the Layer 4 header of the Rx packet.
+n Hash on bytes 2 and 3 of the Layer 4 header of the Rx packet.
+
+Speed and Duplex Configuration
+------------------------------
+In addressing speed and duplex configuration issues, you need to distinguish
+between copper-based adapters and fiber-based adapters.
+
+In the default mode, an Intel(R) Ethernet Network Adapter using copper
+connections will attempt to auto-negotiate with its link partner to determine
+the best setting. If the adapter cannot establish link with the link partner
+using auto-negotiation, you may need to manually configure the adapter and link
+partner to identical settings to establish link and pass packets. This should
+only be needed when attempting to link with an older switch that does not
+support auto-negotiation or one that has been forced to a specific speed or
+duplex mode. Your link partner must match the setting you choose. 1 Gbps speeds
+and higher cannot be forced. Use the autonegotiation advertising setting to
+manually set devices for 1 Gbps and higher.
+
+NOTE: You cannot set the speed for devices based on the Intel(R) Ethernet
+Network Adapter XXV710 based devices.
+
+Speed, duplex, and autonegotiation advertising are configured through the
+ethtool utility.
+
+Caution: Only experienced network administrators should force speed and duplex
+or change autonegotiation advertising manually. The settings at the switch must
+always match the adapter settings. Adapter performance may suffer or your
+adapter may not operate if you configure the adapter differently from your
+switch.
+
+An Intel(R) Ethernet Network Adapter using fiber-based connections, however,
+will not attempt to auto-negotiate with its link partner since those adapters
+operate only in full duplex and only at their native speed.
+
+NAPI
+----
+NAPI (Rx polling mode) is supported in the i40e driver.
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
+
+Flow Control
+------------
+Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable
+receiving and transmitting pause frames for i40e. When transmit is enabled,
+pause frames are generated when the receive packet buffer crosses a predefined
+threshold. When receive is enabled, the transmit unit will halt for the time
+delay specified when a pause frame is received.
+
+NOTE: You must have a flow control capable link partner.
+
+Flow Control is on by default.
+
+Use ethtool to change the flow control settings.
+
+To enable or disable Rx or Tx Flow Control::
+
+ ethtool -A eth? rx <on|off> tx <on|off>
+
+Note: This command only enables or disables Flow Control if auto-negotiation is
+disabled. If auto-negotiation is enabled, this command changes the parameters
+used for auto-negotiation with the link partner.
+
+To enable or disable auto-negotiation::
+
+ ethtool -s eth? autoneg <on|off>
+
+Note: Flow Control auto-negotiation is part of link auto-negotiation. Depending
+on your device, you may not be able to change the auto-negotiation setting.
+
+RSS Hash Flow
+-------------
+Allows you to set the hash bytes per flow type and any combination of one or
+more options for Receive Side Scaling (RSS) hash byte configuration.
+
+::
+
+ # ethtool -N <dev> rx-flow-hash <type> <option>
+
+Where <type> is:
+ tcp4 signifying TCP over IPv4
+ udp4 signifying UDP over IPv4
+ tcp6 signifying TCP over IPv6
+ udp6 signifying UDP over IPv6
+And <option> is one or more of:
+ s Hash on the IP source address of the Rx packet.
+ d Hash on the IP destination address of the Rx packet.
+ f Hash on bytes 0 and 1 of the Layer 4 header of the Rx packet.
+ n Hash on bytes 2 and 3 of the Layer 4 header of the Rx packet.
+
+MAC and VLAN anti-spoofing feature
+----------------------------------
+When a malicious driver attempts to send a spoofed packet, it is dropped by the
+hardware and not transmitted.
+NOTE: This feature can be disabled for a specific Virtual Function (VF)::
+
+ ip link set <pf dev> vf <vf id> spoofchk {off|on}
+
+IEEE 1588 Precision Time Protocol (PTP) Hardware Clock (PHC)
+------------------------------------------------------------
+Precision Time Protocol (PTP) is used to synchronize clocks in a computer
+network. PTP support varies among Intel devices that support this driver. Use
+"ethtool -T <netdev name>" to get a definitive list of PTP capabilities
+supported by the device.
+
+IEEE 802.1ad (QinQ) Support
+---------------------------
+The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
+IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
+"tags," and multiple VLAN IDs are thus referred to as a "tag stack." Tag stacks
+allow L2 tunneling and the ability to segregate traffic within a particular
+VLAN ID, among other uses.
+
+The following are examples of how to configure 802.1ad (QinQ)::
+
+ ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
+ ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
+
+Where "24" and "371" are example VLAN IDs.
+
+NOTES:
+ Receive checksum offloads, cloud filters, and VLAN acceleration are not
+ supported for 802.1ad (QinQ) packets.
+
+VXLAN and GENEVE Overlay HW Offloading
+--------------------------------------
+Virtual Extensible LAN (VXLAN) allows you to extend an L2 network over an L3
+network, which may be useful in a virtualized or cloud environment. Some
+Intel(R) Ethernet Network devices perform VXLAN processing, offloading it from
+the operating system. This reduces CPU utilization.
+
+VXLAN offloading is controlled by the Tx and Rx checksum offload options
+provided by ethtool. That is, if Tx checksum offload is enabled, and the
+adapter has the capability, VXLAN offloading is also enabled.
+
+Support for VXLAN and GENEVE HW offloading is dependent on kernel support of
+the HW offloading features.
+
+Multiple Functions per Port
+---------------------------
+Some adapters based on the Intel Ethernet Controller X710/XL710 support
+multiple functions on a single physical port. Configure these functions through
+the System Setup/BIOS.
+
+Minimum TX Bandwidth is the guaranteed minimum data transmission bandwidth, as
+a percentage of the full physical port link speed, that the partition will
+receive. The bandwidth the partition is awarded will never fall below the level
+you specify.
+
+The range for the minimum bandwidth values is:
+1 to ((100 minus # of partitions on the physical port) plus 1)
+For example, if a physical port has 4 partitions, the range would be:
+1 to ((100 - 4) + 1 = 97)
+
+The Maximum Bandwidth percentage represents the maximum transmit bandwidth
+allocated to the partition as a percentage of the full physical port link
+speed. The accepted range of values is 1-100. The value is used as a limiter,
+should you chose that any one particular function not be able to consume 100%
+of a port's bandwidth (should it be available). The sum of all the values for
+Maximum Bandwidth is not restricted, because no more than 100% of a port's
+bandwidth can ever be used.
+
+NOTE: X710/XXV710 devices fail to enable Max VFs (64) when Multiple Functions
+per Port (MFP) and SR-IOV are enabled. An error from i40e is logged that says
+"add vsi failed for VF N, aq_err 16". To workaround the issue, enable less than
+64 virtual functions (VFs).
+
+Data Center Bridging (DCB)
+--------------------------
+DCB is a configuration Quality of Service implementation in hardware. It uses
+the VLAN priority tag (802.1p) to filter traffic. That means that there are 8
+different priorities that traffic can be filtered into. It also enables
+priority flow control (802.1Qbb) which can limit or eliminate the number of
+dropped packets during network stress. Bandwidth can be allocated to each of
+these priorities, which is enforced at the hardware level (802.1Qaz).
+
+Adapter firmware implements LLDP and DCBX protocol agents as per 802.1AB and
+802.1Qaz respectively. The firmware based DCBX agent runs in willing mode only
+and can accept settings from a DCBX capable peer. Software configuration of
+DCBX parameters via dcbtool/lldptool are not supported.
+
+NOTE: Firmware LLDP can be disabled by setting the private flag disable-fw-lldp.
+
+The i40e driver implements the DCB netlink interface layer to allow user-space
+to communicate with the driver and query DCB configuration for the port.
+
+NOTE:
+The kernel assumes that TC0 is available, and will disable Priority Flow
+Control (PFC) on the device if TC0 is not available. To fix this, ensure TC0 is
+enabled when setting up DCB on your switch.
+
+Interrupt Rate Limiting
+-----------------------
+:Valid Range: 0-235 (0=no limit)
+
+The Intel(R) Ethernet Controller XL710 family supports an interrupt rate
+limiting mechanism. The user can control, via ethtool, the number of
+microseconds between interrupts.
+
+Syntax::
+
+ # ethtool -C ethX rx-usecs-high N
+
+The range of 0-235 microseconds provides an effective range of 4,310 to 250,000
+interrupts per second. The value of rx-usecs-high can be set independently of
+rx-usecs and tx-usecs in the same ethtool command, and is also independent of
+the adaptive interrupt moderation algorithm. The underlying hardware supports
+granularity in 4-microsecond intervals, so adjacent values may result in the
+same interrupt rate.
+
+One possible use case is the following::
+
+ # ethtool -C ethX adaptive-rx off adaptive-tx off rx-usecs-high 20 rx-usecs \
+ 5 tx-usecs 5
+
+The above command would disable adaptive interrupt moderation, and allow a
+maximum of 5 microseconds before indicating a receive or transmit was complete.
+However, instead of resulting in as many as 200,000 interrupts per second, it
+limits total interrupts per second to 50,000 via the rx-usecs-high parameter.
+
+Performance Optimization
+========================
+Driver defaults are meant to fit a wide variety of workloads, but if further
+optimization is required we recommend experimenting with the following settings.
+
+NOTE: For better performance when processing small (64B) frame sizes, try
+enabling Hyper threading in the BIOS in order to increase the number of logical
+cores in the system and subsequently increase the number of queues available to
+the adapter.
+
+Virtualized Environments
+------------------------
+1. Disable XPS on both ends by using the included virt_perf_default script
+or by running the following command as root::
+
+ for file in `ls /sys/class/net/<ethX>/queues/tx-*/xps_cpus`;
+ do echo 0 > $file; done
+
+2. Using the appropriate mechanism (vcpupin) in the vm, pin the cpu's to
+individual lcpu's, making sure to use a set of cpu's included in the
+device's local_cpulist: /sys/class/net/<ethX>/device/local_cpulist.
+
+3. Configure as many Rx/Tx queues in the VM as available. Do not rely on
+the default setting of 1.
+
+
+Non-virtualized Environments
+----------------------------
+Pin the adapter's IRQs to specific cores by disabling the irqbalance service
+and using the included set_irq_affinity script. Please see the script's help
+text for further options.
+
+- The following settings will distribute the IRQs across all the cores evenly::
+
+ # scripts/set_irq_affinity -x all <interface1> , [ <interface2>, ... ]
+
+- The following settings will distribute the IRQs across all the cores that are
+ local to the adapter (same NUMA node)::
+
+ # scripts/set_irq_affinity -x local <interface1> ,[ <interface2>, ... ]
+
+For very CPU intensive workloads, we recommend pinning the IRQs to all cores.
+
+For IP Forwarding: Disable Adaptive ITR and lower Rx and Tx interrupts per
+queue using ethtool.
+
+- Setting rx-usecs and tx-usecs to 125 will limit interrupts to about 8000
+ interrupts per second per queue.
+
+::
+
+ # ethtool -C <interface> adaptive-rx off adaptive-tx off rx-usecs 125 \
+ tx-usecs 125
+
+For lower CPU utilization: Disable Adaptive ITR and lower Rx and Tx interrupts
+per queue using ethtool.
+
+- Setting rx-usecs and tx-usecs to 250 will limit interrupts to about 4000
+ interrupts per second per queue.
+
+::
+
+ # ethtool -C <interface> adaptive-rx off adaptive-tx off rx-usecs 250 \
+ tx-usecs 250
+
+For lower latency: Disable Adaptive ITR and ITR by setting Rx and Tx to 0 using
+ethtool.
+
+::
+
+ # ethtool -C <interface> adaptive-rx off adaptive-tx off rx-usecs 0 \
+ tx-usecs 0
+
+Application Device Queues (ADq)
+-------------------------------
+Application Device Queues (ADq) allows you to dedicate one or more queues to a
+specific application. This can reduce latency for the specified application,
+and allow Tx traffic to be rate limited per application. Follow the steps below
+to set ADq.
+
+1. Create traffic classes (TCs). Maximum of 8 TCs can be created per interface.
+The shaper bw_rlimit parameter is optional.
+
+Example: Sets up two tcs, tc0 and tc1, with 16 queues each and max tx rate set
+to 1Gbit for tc0 and 3Gbit for tc1.
+
+::
+
+ # tc qdisc add dev <interface> root mqprio num_tc 2 map 0 0 0 0 1 1 1 1
+ queues 16@0 16@16 hw 1 mode channel shaper bw_rlimit min_rate 1Gbit 2Gbit
+ max_rate 1Gbit 3Gbit
+
+map: priority mapping for up to 16 priorities to tcs (e.g. map 0 0 0 0 1 1 1 1
+sets priorities 0-3 to use tc0 and 4-7 to use tc1)
+
+queues: for each tc, <num queues>@<offset> (e.g. queues 16@0 16@16 assigns
+16 queues to tc0 at offset 0 and 16 queues to tc1 at offset 16. Max total
+number of queues for all tcs is 64 or number of cores, whichever is lower.)
+
+hw 1 mode channel: ‘channel’ with ‘hw’ set to 1 is a new new hardware
+offload mode in mqprio that makes full use of the mqprio options, the
+TCs, the queue configurations, and the QoS parameters.
+
+shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
+Totals must be equal or less than port speed.
+
+For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
+monitoring tools such as `ifstat` or `sar -n DEV [interval] [number of samples]`
+
+2. Enable HW TC offload on interface::
+
+ # ethtool -K <interface> hw-tc-offload on
+
+3. Apply TCs to ingress (RX) flow of interface::
+
+ # tc qdisc add dev <interface> ingress
+
+NOTES:
+ - Run all tc commands from the iproute2 <pathtoiproute2>/tc/ directory.
+ - ADq is not compatible with cloud filters.
+ - Setting up channels via ethtool (ethtool -L) is not supported when the
+ TCs are configured using mqprio.
+ - You must have iproute2 latest version
+ - NVM version 6.01 or later is required.
+ - ADq cannot be enabled when any the following features are enabled: Data
+ Center Bridging (DCB), Multiple Functions per Port (MFP), or Sideband
+ Filters.
+ - If another driver (for example, DPDK) has set cloud filters, you cannot
+ enable ADq.
+ - Tunnel filters are not supported in ADq. If encapsulated packets do
+ arrive in non-tunnel mode, filtering will be done on the inner headers.
+ For example, for VXLAN traffic in non-tunnel mode, PCTYPE is identified
+ as a VXLAN encapsulated packet, outer headers are ignored. Therefore,
+ inner headers are matched.
+ - If a TC filter on a PF matches traffic over a VF (on the PF), that
+ traffic will be routed to the appropriate queue of the PF, and will
+ not be passed on the VF. Such traffic will end up getting dropped higher
+ up in the TCP/IP stack as it does not match PF address data.
+ - If traffic matches multiple TC filters that point to different TCs,
+ that traffic will be duplicated and sent to all matching TC queues.
+ The hardware switch mirrors the packet to a VSI list when multiple
+ filters are matched.
+
+
+Known Issues/Troubleshooting
+============================
+
+NOTE: 1 Gb devices based on the Intel(R) Ethernet Network Connection X722 do
+not support the following features:
+
+ * Data Center Bridging (DCB)
+ * QOS
+ * VMQ
+ * SR-IOV
+ * Task Encapsulation offload (VXLAN, NVGRE)
+ * Energy Efficient Ethernet (EEE)
+ * Auto-media detect
+
+Unexpected Issues when the device driver and DPDK share a device
+----------------------------------------------------------------
+Unexpected issues may result when an i40e device is in multi driver mode and
+the kernel driver and DPDK driver are sharing the device. This is because
+access to the global NIC resources is not synchronized between multiple
+drivers. Any change to the global NIC configuration (writing to a global
+register, setting global configuration by AQ, or changing switch modes) will
+affect all ports and drivers on the device. Loading DPDK with the
+"multi-driver" module parameter may mitigate some of the issues.
+
+TC0 must be enabled when setting up DCB on a switch
+---------------------------------------------------
+The kernel assumes that TC0 is available, and will disable Priority Flow
+Control (PFC) on the device if TC0 is not available. To fix this, ensure TC0 is
+enabled when setting up DCB on your switch.
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/iavf.rst b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
new file mode 100644
index 0000000000..eb926c3bd4
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
@@ -0,0 +1,326 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=================================================================
+Linux Base Driver for Intel(R) Ethernet Adaptive Virtual Function
+=================================================================
+
+Intel Ethernet Adaptive Virtual Function Linux driver.
+Copyright(c) 2013-2018 Intel Corporation.
+
+Contents
+========
+
+- Overview
+- Identifying Your Adapter
+- Additional Configurations
+- Known Issues/Troubleshooting
+- Support
+
+Overview
+========
+
+This file describes the iavf Linux Base Driver. This driver was formerly
+called i40evf.
+
+The iavf driver supports the below mentioned virtual function devices and
+can only be activated on kernels running the i40e or newer Physical Function
+(PF) driver compiled with CONFIG_PCI_IOV. The iavf driver requires
+CONFIG_PCI_MSI to be enabled.
+
+The guest OS loading the iavf driver must support MSI-X interrupts.
+
+Identifying Your Adapter
+========================
+
+The driver in this kernel is compatible with devices based on the following:
+ * Intel(R) XL710 X710 Virtual Function
+ * Intel(R) X722 Virtual Function
+ * Intel(R) XXV710 Virtual Function
+ * Intel(R) Ethernet Adaptive Virtual Function
+
+For the best performance, make sure the latest NVM/FW is installed on your
+device.
+
+For information on how to identify your adapter, and for the latest NVM/FW
+images and Intel network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+
+Additional Features and Configurations
+======================================
+
+Viewing Link Messages
+---------------------
+Link messages will not be displayed to the console if the distribution is
+restricting system messages. In order to see network driver link messages on
+your console, set dmesg to eight by entering the following::
+
+ # dmesg -n 8
+
+NOTE:
+ This setting is not saved across reboots.
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+https://www.kernel.org/pub/software/network/ethtool/
+
+Setting VLAN Tag Stripping
+--------------------------
+If you have applications that require Virtual Functions (VFs) to receive
+packets with VLAN tags, you can disable VLAN tag stripping for the VF. The
+Physical Function (PF) processes requests issued from the VF to enable or
+disable VLAN tag stripping. Note that if the PF has assigned a VLAN to a VF,
+then requests from that VF to set VLAN tag stripping will be ignored.
+
+To enable/disable VLAN tag stripping for a VF, issue the following command
+from inside the VM in which you are running the VF::
+
+ # ethtool -K <if_name> rxvlan on/off
+
+or alternatively::
+
+ # ethtool --offload <if_name> rxvlan on/off
+
+Adaptive Virtual Function
+-------------------------
+Adaptive Virtual Function (AVF) allows the virtual function driver, or VF, to
+adapt to changing feature sets of the physical function driver (PF) with which
+it is associated. This allows system administrators to update a PF without
+having to update all the VFs associated with it. All AVFs have a single common
+device ID and branding string.
+
+AVFs have a minimum set of features known as "base mode," but may provide
+additional features depending on what features are available in the PF with
+which the AVF is associated. The following are base mode features:
+
+- 4 Queue Pairs (QP) and associated Configuration Status Registers (CSRs)
+ for Tx/Rx
+- i40e descriptors and ring format
+- Descriptor write-back completion
+- 1 control queue, with i40e descriptors, CSRs and ring format
+- 5 MSI-X interrupt vectors and corresponding i40e CSRs
+- 1 Interrupt Throttle Rate (ITR) index
+- 1 Virtual Station Interface (VSI) per VF
+- 1 Traffic Class (TC), TC0
+- Receive Side Scaling (RSS) with 64 entry indirection table and key,
+ configured through the PF
+- 1 unicast MAC address reserved per VF
+- 16 MAC address filters for each VF
+- Stateless offloads - non-tunneled checksums
+- AVF device ID
+- HW mailbox is used for VF to PF communications (including on Windows)
+
+IEEE 802.1ad (QinQ) Support
+---------------------------
+The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
+IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
+"tags," and multiple VLAN IDs are thus referred to as a "tag stack." Tag stacks
+allow L2 tunneling and the ability to segregate traffic within a particular
+VLAN ID, among other uses.
+
+The following are examples of how to configure 802.1ad (QinQ)::
+
+ # ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
+ # ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
+
+Where "24" and "371" are example VLAN IDs.
+
+NOTES:
+ Receive checksum offloads, cloud filters, and VLAN acceleration are not
+ supported for 802.1ad (QinQ) packets.
+
+Application Device Queues (ADq)
+-------------------------------
+Application Device Queues (ADq) allows you to dedicate one or more queues to a
+specific application. This can reduce latency for the specified application,
+and allow Tx traffic to be rate limited per application. Follow the steps below
+to set ADq.
+
+Requirements:
+
+- The sch_mqprio, act_mirred and cls_flower modules must be loaded
+- The latest version of iproute2
+- If another driver (for example, DPDK) has set cloud filters, you cannot
+ enable ADQ
+- Depending on the underlying PF device, ADQ cannot be enabled when the
+ following features are enabled:
+
+ + Data Center Bridging (DCB)
+ + Multiple Functions per Port (MFP)
+ + Sideband Filters
+
+1. Create traffic classes (TCs). Maximum of 8 TCs can be created per interface.
+The shaper bw_rlimit parameter is optional.
+
+Example: Sets up two tcs, tc0 and tc1, with 16 queues each and max tx rate set
+to 1Gbit for tc0 and 3Gbit for tc1.
+
+::
+
+ tc qdisc add dev <interface> root mqprio num_tc 2 map 0 0 0 0 1 1 1 1
+ queues 16@0 16@16 hw 1 mode channel shaper bw_rlimit min_rate 1Gbit 2Gbit
+ max_rate 1Gbit 3Gbit
+
+map: priority mapping for up to 16 priorities to tcs (e.g. map 0 0 0 0 1 1 1 1
+sets priorities 0-3 to use tc0 and 4-7 to use tc1)
+
+queues: for each tc, <num queues>@<offset> (e.g. queues 16@0 16@16 assigns
+16 queues to tc0 at offset 0 and 16 queues to tc1 at offset 16. Max total
+number of queues for all tcs is 64 or number of cores, whichever is lower.)
+
+hw 1 mode channel: ‘channel’ with ‘hw’ set to 1 is a new new hardware
+offload mode in mqprio that makes full use of the mqprio options, the
+TCs, the queue configurations, and the QoS parameters.
+
+shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
+Totals must be equal or less than port speed.
+
+For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
+monitoring tools such as ``ifstat`` or ``sar -n DEV [interval] [number of samples]``
+
+NOTE:
+ Setting up channels via ethtool (ethtool -L) is not supported when the
+ TCs are configured using mqprio.
+
+2. Enable HW TC offload on interface::
+
+ # ethtool -K <interface> hw-tc-offload on
+
+3. Apply TCs to ingress (RX) flow of interface::
+
+ # tc qdisc add dev <interface> ingress
+
+NOTES:
+ - Run all tc commands from the iproute2 <pathtoiproute2>/tc/ directory
+ - ADq is not compatible with cloud filters
+ - Setting up channels via ethtool (ethtool -L) is not supported when the TCs
+ are configured using mqprio
+ - You must have iproute2 latest version
+ - NVM version 6.01 or later is required
+ - ADq cannot be enabled when any the following features are enabled: Data
+ Center Bridging (DCB), Multiple Functions per Port (MFP), or Sideband Filters
+ - If another driver (for example, DPDK) has set cloud filters, you cannot
+ enable ADq
+ - Tunnel filters are not supported in ADq. If encapsulated packets do arrive
+ in non-tunnel mode, filtering will be done on the inner headers. For example,
+ for VXLAN traffic in non-tunnel mode, PCTYPE is identified as a VXLAN
+ encapsulated packet, outer headers are ignored. Therefore, inner headers are
+ matched.
+ - If a TC filter on a PF matches traffic over a VF (on the PF), that traffic
+ will be routed to the appropriate queue of the PF, and will not be passed on
+ the VF. Such traffic will end up getting dropped higher up in the TCP/IP
+ stack as it does not match PF address data.
+ - If traffic matches multiple TC filters that point to different TCs, that
+ traffic will be duplicated and sent to all matching TC queues. The hardware
+ switch mirrors the packet to a VSI list when multiple filters are matched.
+
+
+Known Issues/Troubleshooting
+============================
+
+Bonding fails with VFs bound to an Intel(R) Ethernet Controller 700 series device
+---------------------------------------------------------------------------------
+If you bind Virtual Functions (VFs) to an Intel(R) Ethernet Controller 700
+series based device, the VF slaves may fail when they become the active slave.
+If the MAC address of the VF is set by the PF (Physical Function) of the
+device, when you add a slave, or change the active-backup slave, Linux bonding
+tries to sync the backup slave's MAC address to the same MAC address as the
+active slave. Linux bonding will fail at this point. This issue will not occur
+if the VF's MAC address is not set by the PF.
+
+Traffic Is Not Being Passed Between VM and Client
+-------------------------------------------------
+You may not be able to pass traffic between a client system and a
+Virtual Machine (VM) running on a separate host if the Virtual Function
+(VF, or Virtual NIC) is not in trusted mode and spoof checking is enabled
+on the VF. Note that this situation can occur in any combination of client,
+host, and guest operating system. For information on how to set the VF to
+trusted mode, refer to the section "VLAN Tag Packet Steering" in this
+readme document. For information on setting spoof checking, refer to the
+section "MAC and VLAN anti-spoofing feature" in this readme document.
+
+Do not unload port driver if VF with active VM is bound to it
+-------------------------------------------------------------
+Do not unload a port's driver if a Virtual Function (VF) with an active Virtual
+Machine (VM) is bound to it. Doing so will cause the port to appear to hang.
+Once the VM shuts down, or otherwise releases the VF, the command will complete.
+
+Using four traffic classes fails
+--------------------------------
+Do not try to reserve more than three traffic classes in the iavf driver. Doing
+so will fail to set any traffic classes and will cause the driver to write
+errors to stdout. Use a maximum of three queues to avoid this issue.
+
+Multiple log error messages on iavf driver removal
+--------------------------------------------------
+If you have several VFs and you remove the iavf driver, several instances of
+the following log errors are written to the log::
+
+ Unable to send opcode 2 to PF, err I40E_ERR_QUEUE_EMPTY, aq_err ok
+ Unable to send the message to VF 2 aq_err 12
+ ARQ Overflow Error detected
+
+Virtual machine does not get link
+---------------------------------
+If the virtual machine has more than one virtual port assigned to it, and those
+virtual ports are bound to different physical ports, you may not get link on
+all of the virtual ports. The following command may work around the issue::
+
+ # ethtool -r <PF>
+
+Where <PF> is the PF interface in the host, for example: p5p1. You may need to
+run the command more than once to get link on all virtual ports.
+
+MAC address of Virtual Function changes unexpectedly
+----------------------------------------------------
+If a Virtual Function's MAC address is not assigned in the host, then the VF
+(virtual function) driver will use a random MAC address. This random MAC
+address may change each time the VF driver is reloaded. You can assign a static
+MAC address in the host machine. This static MAC address will survive
+a VF driver reload.
+
+Driver Buffer Overflow Fix
+--------------------------
+The fix to resolve CVE-2016-8105, referenced in Intel SA-00069
+https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00069.html
+is included in this and future versions of the driver.
+
+Multiple Interfaces on Same Ethernet Broadcast Network
+------------------------------------------------------
+Due to the default ARP behavior on Linux, it is not possible to have one system
+on two IP networks in the same Ethernet broadcast domain (non-partitioned
+switch) behave as expected. All Ethernet interfaces will respond to IP traffic
+for any IP address assigned to the system. This results in unbalanced receive
+traffic.
+
+If you have multiple interfaces in a server, either turn on ARP filtering by
+entering::
+
+ # echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
+
+NOTE:
+ This setting is not saved across reboots. The configuration change can be
+ made permanent by adding the following line to the file /etc/sysctl.conf::
+
+ net.ipv4.conf.all.arp_filter = 1
+
+Another alternative is to install the interfaces in separate broadcast domains
+(either in different switches or in a switch partitioned to VLANs).
+
+Rx Page Allocation Errors
+-------------------------
+'Page allocation failure. order:0' errors may occur under stress.
+This is caused by the way the Linux kernel reports this stressed condition.
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://support.intel.com
+
+If an issue is identified with the released source code on the supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
new file mode 100644
index 0000000000..e4d065c55e
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
@@ -0,0 +1,1021 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=================================================================
+Linux Base Driver for the Intel(R) Ethernet Controller 800 Series
+=================================================================
+
+Intel ice Linux driver.
+Copyright(c) 2018-2021 Intel Corporation.
+
+Contents
+========
+
+- Overview
+- Identifying Your Adapter
+- Important Notes
+- Additional Features & Configurations
+- Performance Optimization
+
+
+The associated Virtual Function (VF) driver for this driver is iavf.
+
+Driver information can be obtained using ethtool and lspci.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel adapter. All hardware requirements listed apply to use
+with Linux.
+
+This driver supports XDP (Express Data Path) and AF_XDP zero-copy. Note that
+XDP is blocked for frame sizes larger than 3KB.
+
+
+Identifying Your Adapter
+========================
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+
+Important Notes
+===============
+
+Packet drops may occur under receive stress
+-------------------------------------------
+Devices based on the Intel(R) Ethernet Controller 800 Series are designed to
+tolerate a limited amount of system latency during PCIe and DMA transactions.
+If these transactions take longer than the tolerated latency, it can impact the
+length of time the packets are buffered in the device and associated memory,
+which may result in dropped packets. These packets drops typically do not have
+a noticeable impact on throughput and performance under standard workloads.
+
+If these packet drops appear to affect your workload, the following may improve
+the situation:
+
+1) Make sure that your system's physical memory is in a high-performance
+ configuration, as recommended by the platform vendor. A common
+ recommendation is for all channels to be populated with a single DIMM
+ module.
+2) In your system's BIOS/UEFI settings, select the "Performance" profile.
+3) Your distribution may provide tools like "tuned," which can help tweak
+ kernel settings to achieve better standard settings for different workloads.
+
+
+Configuring SR-IOV for improved network security
+------------------------------------------------
+In a virtualized environment, on Intel(R) Ethernet Network Adapters that
+support SR-IOV, the virtual function (VF) may be subject to malicious behavior.
+Software-generated layer two frames, like IEEE 802.3x (link flow control), IEEE
+802.1Qbb (priority based flow-control), and others of this type, are not
+expected and can throttle traffic between the host and the virtual switch,
+reducing performance. To resolve this issue, and to ensure isolation from
+unintended traffic streams, configure all SR-IOV enabled ports for VLAN tagging
+from the administrative interface on the PF. This configuration allows
+unexpected, and potentially malicious, frames to be dropped.
+
+See "Configuring VLAN Tagging on SR-IOV Enabled Adapter Ports" later in this
+README for configuration instructions.
+
+
+Do not unload port driver if VF with active VM is bound to it
+-------------------------------------------------------------
+Do not unload a port's driver if a Virtual Function (VF) with an active Virtual
+Machine (VM) is bound to it. Doing so will cause the port to appear to hang.
+Once the VM shuts down, or otherwise releases the VF, the command will
+complete.
+
+
+Additional Features and Configurations
+======================================
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+https://kernel.org/pub/software/network/ethtool/
+
+NOTE: The rx_bytes value of ethtool does not match the rx_bytes value of
+Netdev, due to the 4-byte CRC being stripped by the device. The difference
+between the two rx_bytes values will be 4 x the number of Rx packets. For
+example, if Rx packets are 10 and Netdev (software statistics) displays
+rx_bytes as "X", then ethtool (hardware statistics) will display rx_bytes as
+"X+40" (4 bytes CRC x 10 packets).
+
+
+Viewing Link Messages
+---------------------
+Link messages will not be displayed to the console if the distribution is
+restricting system messages. In order to see network driver link messages on
+your console, set dmesg to eight by entering the following::
+
+ # dmesg -n 8
+
+NOTE: This setting is not saved across reboots.
+
+
+Dynamic Device Personalization
+------------------------------
+Dynamic Device Personalization (DDP) allows you to change the packet processing
+pipeline of a device by applying a profile package to the device at runtime.
+Profiles can be used to, for example, add support for new protocols, change
+existing protocols, or change default settings. DDP profiles can also be rolled
+back without rebooting the system.
+
+The DDP package loads during device initialization. The driver looks for
+``intel/ice/ddp/ice.pkg`` in your firmware root (typically ``/lib/firmware/``
+or ``/lib/firmware/updates/``) and checks that it contains a valid DDP package
+file.
+
+NOTE: Your distribution should likely have provided the latest DDP file, but if
+ice.pkg is missing, you can find it in the linux-firmware repository or from
+intel.com.
+
+If the driver is unable to load the DDP package, the device will enter Safe
+Mode. Safe Mode disables advanced and performance features and supports only
+basic traffic and minimal functionality, such as updating the NVM or
+downloading a new driver or DDP package. Safe Mode only applies to the affected
+physical function and does not impact any other PFs. See the "Intel(R) Ethernet
+Adapters and Devices User Guide" for more details on DDP and Safe Mode.
+
+NOTES:
+
+- If you encounter issues with the DDP package file, you may need to download
+ an updated driver or DDP package file. See the log messages for more
+ information.
+
+- The ice.pkg file is a symbolic link to the default DDP package file.
+
+- You cannot update the DDP package if any PF drivers are already loaded. To
+ overwrite a package, unload all PFs and then reload the driver with the new
+ package.
+
+- Only the first loaded PF per device can download a package for that device.
+
+You can install specific DDP package files for different physical devices in
+the same system. To install a specific DDP package file:
+
+1. Download the DDP package file you want for your device.
+
+2. Rename the file ice-xxxxxxxxxxxxxxxx.pkg, where 'xxxxxxxxxxxxxxxx' is the
+ unique 64-bit PCI Express device serial number (in hex) of the device you
+ want the package downloaded on. The filename must include the complete
+ serial number (including leading zeros) and be all lowercase. For example,
+ if the 64-bit serial number is b887a3ffffca0568, then the file name would be
+ ice-b887a3ffffca0568.pkg.
+
+ To find the serial number from the PCI bus address, you can use the
+ following command::
+
+ # lspci -vv -s af:00.0 | grep -i Serial
+ Capabilities: [150 v1] Device Serial Number b8-87-a3-ff-ff-ca-05-68
+
+ You can use the following command to format the serial number without the
+ dashes::
+
+ # lspci -vv -s af:00.0 | grep -i Serial | awk '{print $7}' | sed s/-//g
+ b887a3ffffca0568
+
+3. Copy the renamed DDP package file to
+ ``/lib/firmware/updates/intel/ice/ddp/``. If the directory does not yet
+ exist, create it before copying the file.
+
+4. Unload all of the PFs on the device.
+
+5. Reload the driver with the new package.
+
+NOTE: The presence of a device-specific DDP package file overrides the loading
+of the default DDP package file (ice.pkg).
+
+
+Intel(R) Ethernet Flow Director
+-------------------------------
+The Intel Ethernet Flow Director performs the following tasks:
+
+- Directs receive packets according to their flows to different queues
+- Enables tight control on routing a flow in the platform
+- Matches flows and CPU cores for flow affinity
+
+NOTE: This driver supports the following flow types:
+
+- IPv4
+- TCPv4
+- UDPv4
+- SCTPv4
+- IPv6
+- TCPv6
+- UDPv6
+- SCTPv6
+
+Each flow type supports valid combinations of IP addresses (source or
+destination) and UDP/TCP/SCTP ports (source and destination). You can supply
+only a source IP address, a source IP address and a destination port, or any
+combination of one or more of these four parameters.
+
+NOTE: This driver allows you to filter traffic based on a user-defined flexible
+two-byte pattern and offset by using the ethtool user-def and mask fields. Only
+L3 and L4 flow types are supported for user-defined flexible filters. For a
+given flow type, you must clear all Intel Ethernet Flow Director filters before
+changing the input set (for that flow type).
+
+
+Flow Director Filters
+---------------------
+Flow Director filters are used to direct traffic that matches specified
+characteristics. They are enabled through ethtool's ntuple interface. To enable
+or disable the Intel Ethernet Flow Director and these filters::
+
+ # ethtool -K <ethX> ntuple <off|on>
+
+NOTE: When you disable ntuple filters, all the user programmed filters are
+flushed from the driver cache and hardware. All needed filters must be re-added
+when ntuple is re-enabled.
+
+To display all of the active filters::
+
+ # ethtool -u <ethX>
+
+To add a new filter::
+
+ # ethtool -U <ethX> flow-type <type> src-ip <ip> [m <ip_mask>] dst-ip <ip>
+ [m <ip_mask>] src-port <port> [m <port_mask>] dst-port <port> [m <port_mask>]
+ action <queue>
+
+ Where:
+ <ethX> - the Ethernet device to program
+ <type> - can be ip4, tcp4, udp4, sctp4, ip6, tcp6, udp6, sctp6
+ <ip> - the IP address to match on
+ <ip_mask> - the IPv4 address to mask on
+ NOTE: These filters use inverted masks.
+ <port> - the port number to match on
+ <port_mask> - the 16-bit integer for masking
+ NOTE: These filters use inverted masks.
+ <queue> - the queue to direct traffic toward (-1 discards the
+ matched traffic)
+
+To delete a filter::
+
+ # ethtool -U <ethX> delete <N>
+
+ Where <N> is the filter ID displayed when printing all the active filters,
+ and may also have been specified using "loc <N>" when adding the filter.
+
+EXAMPLES:
+
+To add a filter that directs packet to queue 2::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 src-port 2000 dst-port 2001 action 2 [loc 1]
+
+To set a filter using only the source and destination IP address::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 action 2 [loc 1]
+
+To set a filter based on a user-defined pattern and offset::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 user-def 0x4FFFF action 2 [loc 1]
+
+ where the value of the user-def field contains the offset (4 bytes) and
+ the pattern (0xffff).
+
+To match TCP traffic sent from 192.168.0.1, port 5300, directed to 192.168.0.5,
+port 80, and then send it to queue 7::
+
+ # ethtool -U enp130s0 flow-type tcp4 src-ip 192.168.0.1 dst-ip 192.168.0.5
+ src-port 5300 dst-port 80 action 7
+
+To add a TCPv4 filter with a partial mask for a source IP subnet::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.0.0 m 0.255.255.255 dst-ip
+ 192.168.5.12 src-port 12600 dst-port 31 action 12
+
+NOTES:
+
+For each flow-type, the programmed filters must all have the same matching
+input set. For example, issuing the following two commands is acceptable::
+
+ # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.5 src-port 55 action 10
+
+Issuing the next two commands, however, is not acceptable, since the first
+specifies src-ip and the second specifies dst-ip::
+
+ # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ # ethtool -U enp130s0 flow-type ip4 dst-ip 192.168.0.5 src-port 55 action 10
+
+The second command will fail with an error. You may program multiple filters
+with the same fields, using different values, but, on one device, you may not
+program two tcp4 filters with different matching fields.
+
+The ice driver does not support matching on a subportion of a field, thus
+partial mask fields are not supported.
+
+
+Flex Byte Flow Director Filters
+-------------------------------
+The driver also supports matching user-defined data within the packet payload.
+This flexible data is specified using the "user-def" field of the ethtool
+command in the following way:
+
+.. table::
+
+ ============================== ============================
+ ``31 28 24 20 16`` ``15 12 8 4 0``
+ ``offset into packet payload`` ``2 bytes of flexible data``
+ ============================== ============================
+
+For example,
+
+::
+
+ ... user-def 0x4FFFF ...
+
+tells the filter to look 4 bytes into the payload and match that value against
+0xFFFF. The offset is based on the beginning of the payload, and not the
+beginning of the packet. Thus
+
+::
+
+ flow-type tcp4 ... user-def 0x8BEAF ...
+
+would match TCP/IPv4 packets which have the value 0xBEAF 8 bytes into the
+TCP/IPv4 payload.
+
+Note that ICMP headers are parsed as 4 bytes of header and 4 bytes of payload.
+Thus to match the first byte of the payload, you must actually add 4 bytes to
+the offset. Also note that ip4 filters match both ICMP frames as well as raw
+(unknown) ip4 frames, where the payload will be the L3 payload of the IP4
+frame.
+
+The maximum offset is 64. The hardware will only read up to 64 bytes of data
+from the payload. The offset must be even because the flexible data is 2 bytes
+long and must be aligned to byte 0 of the packet payload.
+
+The user-defined flexible offset is also considered part of the input set and
+cannot be programmed separately for multiple filters of the same type. However,
+the flexible data is not part of the input set and multiple filters may use the
+same offset but match against different data.
+
+
+RSS Hash Flow
+-------------
+Allows you to set the hash bytes per flow type and any combination of one or
+more options for Receive Side Scaling (RSS) hash byte configuration.
+
+::
+
+ # ethtool -N <ethX> rx-flow-hash <type> <option>
+
+ Where <type> is:
+ tcp4 signifying TCP over IPv4
+ udp4 signifying UDP over IPv4
+ tcp6 signifying TCP over IPv6
+ udp6 signifying UDP over IPv6
+ And <option> is one or more of:
+ s Hash on the IP source address of the Rx packet.
+ d Hash on the IP destination address of the Rx packet.
+ f Hash on bytes 0 and 1 of the Layer 4 header of the Rx packet.
+ n Hash on bytes 2 and 3 of the Layer 4 header of the Rx packet.
+
+
+Accelerated Receive Flow Steering (aRFS)
+----------------------------------------
+Devices based on the Intel(R) Ethernet Controller 800 Series support
+Accelerated Receive Flow Steering (aRFS) on the PF. aRFS is a load-balancing
+mechanism that allows you to direct packets to the same CPU where an
+application is running or consuming the packets in that flow.
+
+NOTES:
+
+- aRFS requires that ntuple filtering is enabled via ethtool.
+- aRFS support is limited to the following packet types:
+
+ - TCP over IPv4 and IPv6
+ - UDP over IPv4 and IPv6
+ - Nonfragmented packets
+
+- aRFS only supports Flow Director filters, which consist of the
+ source/destination IP addresses and source/destination ports.
+- aRFS and ethtool's ntuple interface both use the device's Flow Director. aRFS
+ and ntuple features can coexist, but you may encounter unexpected results if
+ there's a conflict between aRFS and ntuple requests. See "Intel(R) Ethernet
+ Flow Director" for additional information.
+
+To set up aRFS:
+
+1. Enable the Intel Ethernet Flow Director and ntuple filters using ethtool.
+
+::
+
+ # ethtool -K <ethX> ntuple on
+
+2. Set up the number of entries in the global flow table. For example:
+
+::
+
+ # NUM_RPS_ENTRIES=16384
+ # echo $NUM_RPS_ENTRIES > /proc/sys/net/core/rps_sock_flow_entries
+
+3. Set up the number of entries in the per-queue flow table. For example:
+
+::
+
+ # NUM_RX_QUEUES=64
+ # for file in /sys/class/net/$IFACE/queues/rx-*/rps_flow_cnt; do
+ # echo $(($NUM_RPS_ENTRIES/$NUM_RX_QUEUES)) > $file;
+ # done
+
+4. Disable the IRQ balance daemon (this is only a temporary stop of the service
+ until the next reboot).
+
+::
+
+ # systemctl stop irqbalance
+
+5. Configure the interrupt affinity.
+
+ See ``/Documentation/core-api/irq/irq-affinity.rst``
+
+
+To disable aRFS using ethtool::
+
+ # ethtool -K <ethX> ntuple off
+
+NOTE: This command will disable ntuple filters and clear any aRFS filters in
+software and hardware.
+
+Example Use Case:
+
+1. Set the server application on the desired CPU (e.g., CPU 4).
+
+::
+
+ # taskset -c 4 netserver
+
+2. Use netperf to route traffic from the client to CPU 4 on the server with
+ aRFS configured. This example uses TCP over IPv4.
+
+::
+
+ # netperf -H <Host IPv4 Address> -t TCP_STREAM
+
+
+Enabling Virtual Functions (VFs)
+--------------------------------
+Use sysfs to enable virtual functions (VF).
+
+For example, you can create 4 VFs as follows::
+
+ # echo 4 > /sys/class/net/<ethX>/device/sriov_numvfs
+
+To disable VFs, write 0 to the same file::
+
+ # echo 0 > /sys/class/net/<ethX>/device/sriov_numvfs
+
+The maximum number of VFs for the ice driver is 256 total (all ports). To check
+how many VFs each PF supports, use the following command::
+
+ # cat /sys/class/net/<ethX>/device/sriov_totalvfs
+
+Note: You cannot use SR-IOV when link aggregation (LAG)/bonding is active, and
+vice versa. To enforce this, the driver checks for this mutual exclusion.
+
+
+Displaying VF Statistics on the PF
+----------------------------------
+Use the following command to display the statistics for the PF and its VFs::
+
+ # ip -s link show dev <ethX>
+
+NOTE: The output of this command can be very large due to the maximum number of
+possible VFs.
+
+The PF driver will display a subset of the statistics for the PF and for all
+VFs that are configured. The PF will always print a statistics block for each
+of the possible VFs, and it will show zero for all unconfigured VFs.
+
+
+Configuring VLAN Tagging on SR-IOV Enabled Adapter Ports
+--------------------------------------------------------
+To configure VLAN tagging for the ports on an SR-IOV enabled adapter, use the
+following command. The VLAN configuration should be done before the VF driver
+is loaded or the VM is booted. The VF is not aware of the VLAN tag being
+inserted on transmit and removed on received frames (sometimes called "port
+VLAN" mode).
+
+::
+
+ # ip link set dev <ethX> vf <id> vlan <vlan id>
+
+For example, the following will configure PF eth0 and the first VF on VLAN 10::
+
+ # ip link set dev eth0 vf 0 vlan 10
+
+
+Enabling a VF link if the port is disconnected
+----------------------------------------------
+If the physical function (PF) link is down, you can force link up (from the
+host PF) on any virtual functions (VF) bound to the PF.
+
+For example, to force link up on VF 0 bound to PF eth0::
+
+ # ip link set eth0 vf 0 state enable
+
+Note: If the command does not work, it may not be supported by your system.
+
+
+Setting the MAC Address for a VF
+--------------------------------
+To change the MAC address for the specified VF::
+
+ # ip link set <ethX> vf 0 mac <address>
+
+For example::
+
+ # ip link set <ethX> vf 0 mac 00:01:02:03:04:05
+
+This setting lasts until the PF is reloaded.
+
+NOTE: Assigning a MAC address for a VF from the host will disable any
+subsequent requests to change the MAC address from within the VM. This is a
+security feature. The VM is not aware of this restriction, so if this is
+attempted in the VM, it will trigger MDD events.
+
+
+Trusted VFs and VF Promiscuous Mode
+-----------------------------------
+This feature allows you to designate a particular VF as trusted and allows that
+trusted VF to request selective promiscuous mode on the Physical Function (PF).
+
+To set a VF as trusted or untrusted, enter the following command in the
+Hypervisor::
+
+ # ip link set dev <ethX> vf 1 trust [on|off]
+
+NOTE: It's important to set the VF to trusted before setting promiscuous mode.
+If the VM is not trusted, the PF will ignore promiscuous mode requests from the
+VF. If the VM becomes trusted after the VF driver is loaded, you must make a
+new request to set the VF to promiscuous.
+
+Once the VF is designated as trusted, use the following commands in the VM to
+set the VF to promiscuous mode.
+
+For promiscuous all::
+
+ # ip link set <ethX> promisc on
+ Where <ethX> is a VF interface in the VM
+
+For promiscuous Multicast::
+
+ # ip link set <ethX> allmulticast on
+ Where <ethX> is a VF interface in the VM
+
+NOTE: By default, the ethtool private flag vf-true-promisc-support is set to
+"off," meaning that promiscuous mode for the VF will be limited. To set the
+promiscuous mode for the VF to true promiscuous and allow the VF to see all
+ingress traffic, use the following command::
+
+ # ethtool --set-priv-flags <ethX> vf-true-promisc-support on
+
+The vf-true-promisc-support private flag does not enable promiscuous mode;
+rather, it designates which type of promiscuous mode (limited or true) you will
+get when you enable promiscuous mode using the ip link commands above. Note
+that this is a global setting that affects the entire device. However, the
+vf-true-promisc-support private flag is only exposed to the first PF of the
+device. The PF remains in limited promiscuous mode regardless of the
+vf-true-promisc-support setting.
+
+Next, add a VLAN interface on the VF interface. For example::
+
+ # ip link add link eth2 name eth2.100 type vlan id 100
+
+Note that the order in which you set the VF to promiscuous mode and add the
+VLAN interface does not matter (you can do either first). The result in this
+example is that the VF will get all traffic that is tagged with VLAN 100.
+
+
+Malicious Driver Detection (MDD) for VFs
+----------------------------------------
+Some Intel Ethernet devices use Malicious Driver Detection (MDD) to detect
+malicious traffic from the VF and disable Tx/Rx queues or drop the offending
+packet until a VF driver reset occurs. You can view MDD messages in the PF's
+system log using the dmesg command.
+
+- If the PF driver logs MDD events from the VF, confirm that the correct VF
+ driver is installed.
+- To restore functionality, you can manually reload the VF or VM or enable
+ automatic VF resets.
+- When automatic VF resets are enabled, the PF driver will immediately reset
+ the VF and reenable queues when it detects MDD events on the receive path.
+- If automatic VF resets are disabled, the PF will not automatically reset the
+ VF when it detects MDD events.
+
+To enable or disable automatic VF resets, use the following command::
+
+ # ethtool --set-priv-flags <ethX> mdd-auto-reset-vf on|off
+
+
+MAC and VLAN Anti-Spoofing Feature for VFs
+------------------------------------------
+When a malicious driver on a Virtual Function (VF) interface attempts to send a
+spoofed packet, it is dropped by the hardware and not transmitted.
+
+NOTE: This feature can be disabled for a specific VF::
+
+ # ip link set <ethX> vf <vf id> spoofchk {off|on}
+
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ifconfig command to increase the MTU size. For example, enter the
+following where <ethX> is the interface number::
+
+ # ifconfig <ethX> mtu 9000 up
+
+Alternatively, you can use the ip command as follows::
+
+ # ip link set mtu 9000 dev <ethX>
+ # ip link set up dev <ethX>
+
+This setting is not saved across reboots.
+
+
+NOTE: The maximum MTU setting for jumbo frames is 9702. This corresponds to the
+maximum jumbo frame size of 9728 bytes.
+
+NOTE: This driver will attempt to use multiple page sized buffers to receive
+each jumbo packet. This should help to avoid buffer starvation issues when
+allocating receive packets.
+
+NOTE: Packet loss may have a greater impact on throughput when you use jumbo
+frames. If you observe a drop in performance after enabling jumbo frames,
+enabling flow control may mitigate the issue.
+
+
+Speed and Duplex Configuration
+------------------------------
+In addressing speed and duplex configuration issues, you need to distinguish
+between copper-based adapters and fiber-based adapters.
+
+In the default mode, an Intel(R) Ethernet Network Adapter using copper
+connections will attempt to auto-negotiate with its link partner to determine
+the best setting. If the adapter cannot establish link with the link partner
+using auto-negotiation, you may need to manually configure the adapter and link
+partner to identical settings to establish link and pass packets. This should
+only be needed when attempting to link with an older switch that does not
+support auto-negotiation or one that has been forced to a specific speed or
+duplex mode. Your link partner must match the setting you choose. 1 Gbps speeds
+and higher cannot be forced. Use the autonegotiation advertising setting to
+manually set devices for 1 Gbps and higher.
+
+Speed, duplex, and autonegotiation advertising are configured through the
+ethtool utility. For the latest version, download and install ethtool from the
+following website:
+
+ https://kernel.org/pub/software/network/ethtool/
+
+To see the speed configurations your device supports, run the following::
+
+ # ethtool <ethX>
+
+Caution: Only experienced network administrators should force speed and duplex
+or change autonegotiation advertising manually. The settings at the switch must
+always match the adapter settings. Adapter performance may suffer or your
+adapter may not operate if you configure the adapter differently from your
+switch.
+
+
+Data Center Bridging (DCB)
+--------------------------
+NOTE: The kernel assumes that TC0 is available, and will disable Priority Flow
+Control (PFC) on the device if TC0 is not available. To fix this, ensure TC0 is
+enabled when setting up DCB on your switch.
+
+DCB is a configuration Quality of Service implementation in hardware. It uses
+the VLAN priority tag (802.1p) to filter traffic. That means that there are 8
+different priorities that traffic can be filtered into. It also enables
+priority flow control (802.1Qbb) which can limit or eliminate the number of
+dropped packets during network stress. Bandwidth can be allocated to each of
+these priorities, which is enforced at the hardware level (802.1Qaz).
+
+DCB is normally configured on the network using the DCBX protocol (802.1Qaz), a
+specialization of LLDP (802.1AB). The ice driver supports the following
+mutually exclusive variants of DCBX support:
+
+1) Firmware-based LLDP Agent
+2) Software-based LLDP Agent
+
+In firmware-based mode, firmware intercepts all LLDP traffic and handles DCBX
+negotiation transparently for the user. In this mode, the adapter operates in
+"willing" DCBX mode, receiving DCB settings from the link partner (typically a
+switch). The local user can only query the negotiated DCB configuration. For
+information on configuring DCBX parameters on a switch, please consult the
+switch manufacturer's documentation.
+
+In software-based mode, LLDP traffic is forwarded to the network stack and user
+space, where a software agent can handle it. In this mode, the adapter can
+operate in either "willing" or "nonwilling" DCBX mode and DCB configuration can
+be both queried and set locally. This mode requires the FW-based LLDP Agent to
+be disabled.
+
+NOTE:
+
+- You can enable and disable the firmware-based LLDP Agent using an ethtool
+ private flag. Refer to the "FW-LLDP (Firmware Link Layer Discovery Protocol)"
+ section in this README for more information.
+- In software-based DCBX mode, you can configure DCB parameters using software
+ LLDP/DCBX agents that interface with the Linux kernel's DCB Netlink API. We
+ recommend using OpenLLDP as the DCBX agent when running in software mode. For
+ more information, see the OpenLLDP man pages and
+ https://github.com/intel/openlldp.
+- The driver implements the DCB netlink interface layer to allow the user space
+ to communicate with the driver and query DCB configuration for the port.
+- iSCSI with DCB is not supported.
+
+
+FW-LLDP (Firmware Link Layer Discovery Protocol)
+------------------------------------------------
+Use ethtool to change FW-LLDP settings. The FW-LLDP setting is per port and
+persists across boots.
+
+To enable LLDP::
+
+ # ethtool --set-priv-flags <ethX> fw-lldp-agent on
+
+To disable LLDP::
+
+ # ethtool --set-priv-flags <ethX> fw-lldp-agent off
+
+To check the current LLDP setting::
+
+ # ethtool --show-priv-flags <ethX>
+
+NOTE: You must enable the UEFI HII "LLDP Agent" attribute for this setting to
+take effect. If "LLDP AGENT" is set to disabled, you cannot enable it from the
+OS.
+
+
+Flow Control
+------------
+Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable
+receiving and transmitting pause frames for ice. When transmit is enabled,
+pause frames are generated when the receive packet buffer crosses a predefined
+threshold. When receive is enabled, the transmit unit will halt for the time
+delay specified when a pause frame is received.
+
+NOTE: You must have a flow control capable link partner.
+
+Flow Control is disabled by default.
+
+Use ethtool to change the flow control settings.
+
+To enable or disable Rx or Tx Flow Control::
+
+ # ethtool -A <ethX> rx <on|off> tx <on|off>
+
+Note: This command only enables or disables Flow Control if auto-negotiation is
+disabled. If auto-negotiation is enabled, this command changes the parameters
+used for auto-negotiation with the link partner.
+
+Note: Flow Control auto-negotiation is part of link auto-negotiation. Depending
+on your device, you may not be able to change the auto-negotiation setting.
+
+NOTE:
+
+- The ice driver requires flow control on both the port and link partner. If
+ flow control is disabled on one of the sides, the port may appear to hang on
+ heavy traffic.
+- You may encounter issues with link-level flow control (LFC) after disabling
+ DCB. The LFC status may show as enabled but traffic is not paused. To resolve
+ this issue, disable and reenable LFC using ethtool::
+
+ # ethtool -A <ethX> rx off tx off
+ # ethtool -A <ethX> rx on tx on
+
+
+NAPI
+----
+
+This driver supports NAPI (Rx polling mode).
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
+
+MACVLAN
+-------
+This driver supports MACVLAN. Kernel support for MACVLAN can be tested by
+checking if the MACVLAN driver is loaded. You can run 'lsmod | grep macvlan' to
+see if the MACVLAN driver is loaded or run 'modprobe macvlan' to try to load
+the MACVLAN driver.
+
+NOTE:
+
+- In passthru mode, you can only set up one MACVLAN device. It will inherit the
+ MAC address of the underlying PF (Physical Function) device.
+
+
+IEEE 802.1ad (QinQ) Support
+---------------------------
+The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
+IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
+"tags," and multiple VLAN IDs are thus referred to as a "tag stack." Tag stacks
+allow L2 tunneling and the ability to segregate traffic within a particular
+VLAN ID, among other uses.
+
+NOTES:
+
+- Receive checksum offloads and VLAN acceleration are not supported for 802.1ad
+ (QinQ) packets.
+
+- 0x88A8 traffic will not be received unless VLAN stripping is disabled with
+ the following command::
+
+ # ethtool -K <ethX> rxvlan off
+
+- 0x88A8/0x8100 double VLANs cannot be used with 0x8100 or 0x8100/0x8100 VLANS
+ configured on the same port. 0x88a8/0x8100 traffic will not be received if
+ 0x8100 VLANs are configured.
+
+- The VF can only transmit 0x88A8/0x8100 (i.e., 802.1ad/802.1Q) traffic if:
+
+ 1) The VF is not assigned a port VLAN.
+ 2) spoofchk is disabled from the PF. If you enable spoofchk, the VF will
+ not transmit 0x88A8/0x8100 traffic.
+
+- The VF may not receive all network traffic based on the Inner VLAN header
+ when VF true promiscuous mode (vf-true-promisc-support) and double VLANs are
+ enabled in SR-IOV mode.
+
+The following are examples of how to configure 802.1ad (QinQ)::
+
+ # ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
+ # ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
+
+ Where "24" and "371" are example VLAN IDs.
+
+
+Tunnel/Overlay Stateless Offloads
+---------------------------------
+Supported tunnels and overlays include VXLAN, GENEVE, and others depending on
+hardware and software configuration. Stateless offloads are enabled by default.
+
+To view the current state of all offloads::
+
+ # ethtool -k <ethX>
+
+
+UDP Segmentation Offload
+------------------------
+Allows the adapter to offload transmit segmentation of UDP packets with
+payloads up to 64K into valid Ethernet frames. Because the adapter hardware is
+able to complete data segmentation much faster than operating system software,
+this feature may improve transmission performance.
+In addition, the adapter may use fewer CPU resources.
+
+NOTE:
+
+- The application sending UDP packets must support UDP segmentation offload.
+
+To enable/disable UDP Segmentation Offload, issue the following command::
+
+ # ethtool -K <ethX> tx-udp-segmentation [off|on]
+
+
+GNSS module
+-----------
+Requires kernel compiled with CONFIG_GNSS=y or CONFIG_GNSS=m.
+Allows user to read messages from the GNSS hardware module and write supported
+commands. If the module is physically present, a GNSS device is spawned:
+``/dev/gnss<id>``.
+The protocol of write command is dependent on the GNSS hardware module as the
+driver writes raw bytes by the GNSS object to the receiver through i2c. Please
+refer to the hardware GNSS module documentation for configuration details.
+
+
+Performance Optimization
+========================
+Driver defaults are meant to fit a wide variety of workloads, but if further
+optimization is required, we recommend experimenting with the following
+settings.
+
+
+Rx Descriptor Ring Size
+-----------------------
+To reduce the number of Rx packet discards, increase the number of Rx
+descriptors for each Rx ring using ethtool.
+
+ Check if the interface is dropping Rx packets due to buffers being full
+ (rx_dropped.nic can mean that there is no PCIe bandwidth)::
+
+ # ethtool -S <ethX> | grep "rx_dropped"
+
+ If the previous command shows drops on queues, it may help to increase
+ the number of descriptors using 'ethtool -G'::
+
+ # ethtool -G <ethX> rx <N>
+ Where <N> is the desired number of ring entries/descriptors
+
+ This can provide temporary buffering for issues that create latency while
+ the CPUs process descriptors.
+
+
+Interrupt Rate Limiting
+-----------------------
+This driver supports an adaptive interrupt throttle rate (ITR) mechanism that
+is tuned for general workloads. The user can customize the interrupt rate
+control for specific workloads, via ethtool, adjusting the number of
+microseconds between interrupts.
+
+To set the interrupt rate manually, you must disable adaptive mode::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off
+
+For lower CPU utilization:
+
+ Disable adaptive ITR and lower Rx and Tx interrupts. The examples below
+ affect every queue of the specified interface.
+
+ Setting rx-usecs and tx-usecs to 80 will limit interrupts to about
+ 12,500 interrupts per second per queue::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 80 tx-usecs 80
+
+For reduced latency:
+
+ Disable adaptive ITR and ITR by setting rx-usecs and tx-usecs to 0
+ using ethtool::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
+
+Per-queue interrupt rate settings:
+
+ The following examples are for queues 1 and 3, but you can adjust other
+ queues.
+
+ To disable Rx adaptive ITR and set static Rx ITR to 10 microseconds or
+ about 100,000 interrupts/second, for queues 1 and 3::
+
+ # ethtool --per-queue <ethX> queue_mask 0xa --coalesce adaptive-rx off
+ rx-usecs 10
+
+ To show the current coalesce settings for queues 1 and 3::
+
+ # ethtool --per-queue <ethX> queue_mask 0xa --show-coalesce
+
+Bounding interrupt rates using rx-usecs-high:
+
+ :Valid Range: 0-236 (0=no limit)
+
+ The range of 0-236 microseconds provides an effective range of 4,237 to
+ 250,000 interrupts per second. The value of rx-usecs-high can be set
+ independently of rx-usecs and tx-usecs in the same ethtool command, and is
+ also independent of the adaptive interrupt moderation algorithm. The
+ underlying hardware supports granularity in 4-microsecond intervals, so
+ adjacent values may result in the same interrupt rate.
+
+ The following command would disable adaptive interrupt moderation, and allow
+ a maximum of 5 microseconds before indicating a receive or transmit was
+ complete. However, instead of resulting in as many as 200,000 interrupts per
+ second, it limits total interrupts per second to 50,000 via the rx-usecs-high
+ parameter.
+
+ ::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs-high 20
+ rx-usecs 5 tx-usecs 5
+
+
+Virtualized Environments
+------------------------
+In addition to the other suggestions in this section, the following may be
+helpful to optimize performance in VMs.
+
+ Using the appropriate mechanism (vcpupin) in the VM, pin the CPUs to
+ individual LCPUs, making sure to use a set of CPUs included in the
+ device's local_cpulist: ``/sys/class/net/<ethX>/device/local_cpulist``.
+
+ Configure as many Rx/Tx queues in the VM as available. (See the iavf driver
+ documentation for the number of queues supported.) For example::
+
+ # ethtool -L <virt_interface> rx <max> tx <max>
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
+
+
+Trademarks
+==========
+Intel is a trademark or registered trademark of Intel Corporation or its
+subsidiaries in the United States and/or other countries.
+
+* Other names and brands may be claimed as the property of others.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/igb.rst b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
new file mode 100644
index 0000000000..fbd590b6a0
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
@@ -0,0 +1,208 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==========================================================
+Linux Base Driver for Intel(R) Ethernet Network Connection
+==========================================================
+
+Intel Gigabit Linux driver.
+Copyright(c) 1999-2018 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Command Line Parameters
+- Additional Configurations
+- Support
+
+
+Identifying Your Adapter
+========================
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+
+Command Line Parameters
+========================
+If the driver is built as a module, the following optional parameters are used
+by entering them on the command line with the modprobe command using this
+syntax::
+
+ modprobe igb [<option>=<VAL1>,<VAL2>,...]
+
+There needs to be a <VAL#> for each network port in the system supported by
+this driver. The values will be applied to each instance, in function order.
+For example::
+
+ modprobe igb max_vfs=2,4
+
+In this case, there are two network ports supported by igb in the system.
+
+NOTE: A descriptor describes a data buffer and attributes related to the data
+buffer. This information is accessed by the hardware.
+
+max_vfs
+-------
+:Valid Range: 0-7
+
+This parameter adds support for SR-IOV. It causes the driver to spawn up to
+max_vfs worth of virtual functions. If the value is greater than 0 it will
+also force the VMDq parameter to be 1 or more.
+
+The parameters for the driver are referenced by position. Thus, if you have a
+dual port adapter, or more than one adapter in your system, and want N virtual
+functions per port, you must specify a number for each port with each parameter
+separated by a comma. For example::
+
+ modprobe igb max_vfs=4
+
+This will spawn 4 VFs on the first port.
+
+::
+
+ modprobe igb max_vfs=2,4
+
+This will spawn 2 VFs on the first port and 4 VFs on the second port.
+
+NOTE: Caution must be used in loading the driver with these parameters.
+Depending on your system configuration, number of slots, etc., it is impossible
+to predict in all cases where the positions would be on the command line.
+
+NOTE: Neither the device nor the driver control how VFs are mapped into config
+space. Bus layout will vary by operating system. On operating systems that
+support it, you can check sysfs to find the mapping.
+
+NOTE: When either SR-IOV mode or VMDq mode is enabled, hardware VLAN filtering
+and VLAN tag stripping/insertion will remain enabled. Please remove the old
+VLAN filter before the new VLAN filter is added. For example::
+
+ ip link set eth0 vf 0 vlan 100 // set vlan 100 for VF 0
+ ip link set eth0 vf 0 vlan 0 // Delete vlan 100
+ ip link set eth0 vf 0 vlan 200 // set a new vlan 200 for VF 0
+
+Debug
+-----
+:Valid Range: 0-16 (0=none,...,16=all)
+:Default Value: 0
+
+This parameter adjusts the level debug messages displayed in the system logs.
+
+
+Additional Features and Configurations
+======================================
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ifconfig command to increase the MTU size. For example, enter the
+following where <x> is the interface number::
+
+ ifconfig eth<x> mtu 9000 up
+
+Alternatively, you can use the ip command as follows::
+
+ ip link set mtu 9000 dev eth<x>
+ ip link set up dev eth<x>
+
+This setting is not saved across reboots. The setting change can be made
+permanent by adding 'MTU=9000' to the file:
+
+- For RHEL: /etc/sysconfig/network-scripts/ifcfg-eth<x>
+- For SLES: /etc/sysconfig/network/<config_file>
+
+NOTE: The maximum MTU setting for Jumbo Frames is 9216. This value coincides
+with the maximum Jumbo Frames size of 9234 bytes.
+
+NOTE: Using Jumbo frames at 10 or 100 Mbps is not supported and may result in
+poor performance or loss of link.
+
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+
+https://www.kernel.org/pub/software/network/ethtool/
+
+
+Enabling Wake on LAN (WoL)
+--------------------------
+WoL is configured through the ethtool utility.
+
+WoL will be enabled on the system during the next shut down or reboot. For
+this driver version, in order to enable WoL, the igb driver must be loaded
+prior to shutting down or suspending the system.
+
+NOTE: Wake on LAN is only supported on port A of multi-port devices. Also
+Wake On LAN is not supported for the following device:
+- Intel(R) Gigabit VT Quad Port Server Adapter
+
+
+Multiqueue
+----------
+In this mode, a separate MSI-X vector is allocated for each queue and one for
+"other" interrupts such as link status change and errors. All interrupts are
+throttled via interrupt moderation. Interrupt moderation must be used to avoid
+interrupt storms while the driver is processing one interrupt. The moderation
+value should be at least as large as the expected time for the driver to
+process an interrupt. Multiqueue is off by default.
+
+REQUIREMENTS: MSI-X support is required for Multiqueue. If MSI-X is not found,
+the system will fallback to MSI or to Legacy interrupts. This driver supports
+receive multiqueue on all kernels that support MSI-X.
+
+NOTE: On some kernels a reboot is required to switch between single queue mode
+and multiqueue mode or vice-versa.
+
+
+MAC and VLAN anti-spoofing feature
+----------------------------------
+When a malicious driver attempts to send a spoofed packet, it is dropped by the
+hardware and not transmitted.
+
+An interrupt is sent to the PF driver notifying it of the spoof attempt. When a
+spoofed packet is detected, the PF driver will send the following message to
+the system log (displayed by the "dmesg" command):
+Spoof event(s) detected on VF(n), where n = the VF that attempted to do the
+spoofing
+
+
+Setting MAC Address, VLAN and Rate Limit Using IProute2 Tool
+------------------------------------------------------------
+You can set a MAC address of a Virtual Function (VF), a default VLAN and the
+rate limit using the IProute2 tool. Download the latest version of the
+IProute2 tool from Sourceforge if your version does not have all the features
+you require.
+
+Credit Based Shaper (Qav Mode)
+------------------------------
+When enabling the CBS qdisc in the hardware offload mode, traffic shaping using
+the CBS (described in the IEEE 802.1Q-2018 Section 8.6.8.2 and discussed in the
+Annex L) algorithm will run in the i210 controller, so it's more accurate and
+uses less CPU.
+
+When using offloaded CBS, and the traffic rate obeys the configured rate
+(doesn't go above it), CBS should have little to no effect in the latency.
+
+The offloaded version of the algorithm has some limits, caused by how the idle
+slope is expressed in the adapter's registers. It can only represent idle slopes
+in 16.38431 kbps units, which means that if a idle slope of 2576kbps is
+requested, the controller will be configured to use a idle slope of ~2589 kbps,
+because the driver rounds the value up. For more details, see the comments on
+:c:func:`igb_config_tx_modes()`.
+
+NOTE: This feature is exclusive to i210 models.
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
new file mode 100644
index 0000000000..11a9017f30
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
@@ -0,0 +1,60 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+===========================================================
+Linux Base Virtual Function Driver for Intel(R) 1G Ethernet
+===========================================================
+
+Intel Gigabit Virtual Function Linux driver.
+Copyright(c) 1999-2018 Intel Corporation.
+
+Contents
+========
+- Identifying Your Adapter
+- Additional Configurations
+- Support
+
+This driver supports Intel 82576-based virtual function devices-based virtual
+function devices that can only be activated on kernels that support SR-IOV.
+
+SR-IOV requires the correct platform and OS support.
+
+The guest OS loading this driver must support MSI-X interrupts.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel adapter. All hardware requirements listed apply to use
+with Linux.
+
+Driver information can be obtained using ethtool, lspci, and ifconfig.
+Instructions on updating ethtool can be found in the section Additional
+Configurations later in this document.
+
+NOTE: There is a limit of a total of 32 shared VLANs to 1 or more VFs.
+
+
+Identifying Your Adapter
+========================
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+
+Additional Features and Configurations
+======================================
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+
+https://www.kernel.org/pub/software/network/ethtool/
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
new file mode 100644
index 0000000000..1e5f16993f
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
@@ -0,0 +1,552 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+===========================================================================
+Linux Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Adapters
+===========================================================================
+
+Intel 10 Gigabit Linux driver.
+Copyright(c) 1999-2018 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Command Line Parameters
+- Additional Configurations
+- Known Issues
+- Support
+
+Identifying Your Adapter
+========================
+The driver is compatible with devices based on the following:
+
+ * Intel(R) Ethernet Controller 82598
+ * Intel(R) Ethernet Controller 82599
+ * Intel(R) Ethernet Controller X520
+ * Intel(R) Ethernet Controller X540
+ * Intel(R) Ethernet Controller x550
+ * Intel(R) Ethernet Controller X552
+ * Intel(R) Ethernet Controller X553
+
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+SFP+ Devices with Pluggable Optics
+----------------------------------
+
+82599-BASED ADAPTERS
+~~~~~~~~~~~~~~~~~~~~
+NOTES:
+- If your 82599-based Intel(R) Network Adapter came with Intel optics or is an
+Intel(R) Ethernet Server Adapter X520-2, then it only supports Intel optics
+and/or the direct attach cables listed below.
+- When 82599-based SFP+ devices are connected back to back, they should be set
+to the same Speed setting via ethtool. Results may vary if you mix speed
+settings.
+
++---------------+---------------------------------------+------------------+
+| Supplier | Type | Part Numbers |
++===============+=======================================+==================+
+| SR Modules |
++---------------+---------------------------------------+------------------+
+| Intel | DUAL RATE 1G/10G SFP+ SR (bailed) | FTLX8571D3BCV-IT |
++---------------+---------------------------------------+------------------+
+| Intel | DUAL RATE 1G/10G SFP+ SR (bailed) | AFBR-703SDZ-IN2 |
++---------------+---------------------------------------+------------------+
+| Intel | DUAL RATE 1G/10G SFP+ SR (bailed) | AFBR-703SDDZ-IN1 |
++---------------+---------------------------------------+------------------+
+| LR Modules |
++---------------+---------------------------------------+------------------+
+| Intel | DUAL RATE 1G/10G SFP+ LR (bailed) | FTLX1471D3BCV-IT |
++---------------+---------------------------------------+------------------+
+| Intel | DUAL RATE 1G/10G SFP+ LR (bailed) | AFCT-701SDZ-IN2 |
++---------------+---------------------------------------+------------------+
+| Intel | DUAL RATE 1G/10G SFP+ LR (bailed) | AFCT-701SDDZ-IN1 |
++---------------+---------------------------------------+------------------+
+
+The following is a list of 3rd party SFP+ modules that have received some
+testing. Not all modules are applicable to all devices.
+
++---------------+---------------------------------------+------------------+
+| Supplier | Type | Part Numbers |
++===============+=======================================+==================+
+| Finisar | SFP+ SR bailed, 10g single rate | FTLX8571D3BCL |
++---------------+---------------------------------------+------------------+
+| Avago | SFP+ SR bailed, 10g single rate | AFBR-700SDZ |
++---------------+---------------------------------------+------------------+
+| Finisar | SFP+ LR bailed, 10g single rate | FTLX1471D3BCL |
++---------------+---------------------------------------+------------------+
+| Finisar | DUAL RATE 1G/10G SFP+ SR (No Bail) | FTLX8571D3QCV-IT |
++---------------+---------------------------------------+------------------+
+| Avago | DUAL RATE 1G/10G SFP+ SR (No Bail) | AFBR-703SDZ-IN1 |
++---------------+---------------------------------------+------------------+
+| Finisar | DUAL RATE 1G/10G SFP+ LR (No Bail) | FTLX1471D3QCV-IT |
++---------------+---------------------------------------+------------------+
+| Avago | DUAL RATE 1G/10G SFP+ LR (No Bail) | AFCT-701SDZ-IN1 |
++---------------+---------------------------------------+------------------+
+| Finisar | 1000BASE-T SFP | FCLF8522P2BTL |
++---------------+---------------------------------------+------------------+
+| Avago | 1000BASE-T | ABCU-5710RZ |
++---------------+---------------------------------------+------------------+
+| HP | 1000BASE-SX SFP | 453153-001 |
++---------------+---------------------------------------+------------------+
+
+82599-based adapters support all passive and active limiting direct attach
+cables that comply with SFF-8431 v4.1 and SFF-8472 v10.4 specifications.
+
+Laser turns off for SFP+ when ifconfig ethX down
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+"ifconfig ethX down" turns off the laser for 82599-based SFP+ fiber adapters.
+"ifconfig ethX up" turns on the laser.
+Alternatively, you can use "ip link set [down/up] dev ethX" to turn the
+laser off and on.
+
+
+82599-based QSFP+ Adapters
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+NOTES:
+- If your 82599-based Intel(R) Network Adapter came with Intel optics, it only
+supports Intel optics.
+- 82599-based QSFP+ adapters only support 4x10 Gbps connections. 1x40 Gbps
+connections are not supported. QSFP+ link partners must be configured for
+4x10 Gbps.
+- 82599-based QSFP+ adapters do not support automatic link speed detection.
+The link speed must be configured to either 10 Gbps or 1 Gbps to match the link
+partners speed capabilities. Incorrect speed configurations will result in
+failure to link.
+- Intel(R) Ethernet Converged Network Adapter X520-Q1 only supports the optics
+and direct attach cables listed below.
+
++---------------+---------------------------------------+------------------+
+| Supplier | Type | Part Numbers |
++===============+=======================================+==================+
+| Intel | DUAL RATE 1G/10G QSFP+ SRL (bailed) | E10GQSFPSR |
++---------------+---------------------------------------+------------------+
+
+82599-based QSFP+ adapters support all passive and active limiting QSFP+
+direct attach cables that comply with SFF-8436 v4.1 specifications.
+
+82598-BASED ADAPTERS
+~~~~~~~~~~~~~~~~~~~~
+NOTES:
+- Intel(r) Ethernet Network Adapters that support removable optical modules
+only support their original module type (for example, the Intel(R) 10 Gigabit
+SR Dual Port Express Module only supports SR optical modules). If you plug in
+a different type of module, the driver will not load.
+- Hot Swapping/hot plugging optical modules is not supported.
+- Only single speed, 10 gigabit modules are supported.
+- LAN on Motherboard (LOMs) may support DA, SR, or LR modules. Other module
+types are not supported. Please see your system documentation for details.
+
+The following is a list of SFP+ modules and direct attach cables that have
+received some testing. Not all modules are applicable to all devices.
+
++---------------+---------------------------------------+------------------+
+| Supplier | Type | Part Numbers |
++===============+=======================================+==================+
+| Finisar | SFP+ SR bailed, 10g single rate | FTLX8571D3BCL |
++---------------+---------------------------------------+------------------+
+| Avago | SFP+ SR bailed, 10g single rate | AFBR-700SDZ |
++---------------+---------------------------------------+------------------+
+| Finisar | SFP+ LR bailed, 10g single rate | FTLX1471D3BCL |
++---------------+---------------------------------------+------------------+
+
+82598-based adapters support all passive direct attach cables that comply with
+SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Active direct attach cables
+are not supported.
+
+Third party optic modules and cables referred to above are listed only for the
+purpose of highlighting third party specifications and potential
+compatibility, and are not recommendations or endorsements or sponsorship of
+any third party's product by Intel. Intel is not endorsing or promoting
+products made by any third party and the third party reference is provided
+only to share information regarding certain optic modules and cables with the
+above specifications. There may be other manufacturers or suppliers, producing
+or supplying optic modules and cables with similar or matching descriptions.
+Customers must use their own discretion and diligence to purchase optic
+modules and cables from any third party of their choice. Customers are solely
+responsible for assessing the suitability of the product and/or devices and
+for the selection of the vendor for purchasing any product. THE OPTIC MODULES
+AND CABLES REFERRED TO ABOVE ARE NOT WARRANTED OR SUPPORTED BY INTEL. INTEL
+ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED
+WARRANTY, RELATING TO SALE AND/OR USE OF SUCH THIRD PARTY PRODUCTS OR
+SELECTION OF VENDOR BY CUSTOMERS.
+
+Command Line Parameters
+=======================
+
+max_vfs
+-------
+:Valid Range: 1-63
+
+This parameter adds support for SR-IOV. It causes the driver to spawn up to
+max_vfs worth of virtual functions.
+If the value is greater than 0 it will also force the VMDq parameter to be 1 or
+more.
+
+NOTE: This parameter is only used on kernel 3.7.x and below. On kernel 3.8.x
+and above, use sysfs to enable VFs. Also, for Red Hat distributions, this
+parameter is only used on version 6.6 and older. For version 6.7 and newer, use
+sysfs. For example::
+
+ #echo $num_vf_enabled > /sys/class/net/$dev/device/sriov_numvfs // enable VFs
+ #echo 0 > /sys/class/net/$dev/device/sriov_numvfs //disable VFs
+
+The parameters for the driver are referenced by position. Thus, if you have a
+dual port adapter, or more than one adapter in your system, and want N virtual
+functions per port, you must specify a number for each port with each parameter
+separated by a comma. For example::
+
+ modprobe ixgbe max_vfs=4
+
+This will spawn 4 VFs on the first port.
+
+::
+
+ modprobe ixgbe max_vfs=2,4
+
+This will spawn 2 VFs on the first port and 4 VFs on the second port.
+
+NOTE: Caution must be used in loading the driver with these parameters.
+Depending on your system configuration, number of slots, etc., it is impossible
+to predict in all cases where the positions would be on the command line.
+
+NOTE: Neither the device nor the driver control how VFs are mapped into config
+space. Bus layout will vary by operating system. On operating systems that
+support it, you can check sysfs to find the mapping.
+
+NOTE: When either SR-IOV mode or VMDq mode is enabled, hardware VLAN filtering
+and VLAN tag stripping/insertion will remain enabled. Please remove the old
+VLAN filter before the new VLAN filter is added. For example,
+
+::
+
+ ip link set eth0 vf 0 vlan 100 // set VLAN 100 for VF 0
+ ip link set eth0 vf 0 vlan 0 // Delete VLAN 100
+ ip link set eth0 vf 0 vlan 200 // set a new VLAN 200 for VF 0
+
+With kernel 3.6, the driver supports the simultaneous usage of max_vfs and DCB
+features, subject to the constraints described below. Prior to kernel 3.6, the
+driver did not support the simultaneous operation of max_vfs greater than 0 and
+the DCB features (multiple traffic classes utilizing Priority Flow Control and
+Extended Transmission Selection).
+
+When DCB is enabled, network traffic is transmitted and received through
+multiple traffic classes (packet buffers in the NIC). The traffic is associated
+with a specific class based on priority, which has a value of 0 through 7 used
+in the VLAN tag. When SR-IOV is not enabled, each traffic class is associated
+with a set of receive/transmit descriptor queue pairs. The number of queue
+pairs for a given traffic class depends on the hardware configuration. When
+SR-IOV is enabled, the descriptor queue pairs are grouped into pools. The
+Physical Function (PF) and each Virtual Function (VF) is allocated a pool of
+receive/transmit descriptor queue pairs. When multiple traffic classes are
+configured (for example, DCB is enabled), each pool contains a queue pair from
+each traffic class. When a single traffic class is configured in the hardware,
+the pools contain multiple queue pairs from the single traffic class.
+
+The number of VFs that can be allocated depends on the number of traffic
+classes that can be enabled. The configurable number of traffic classes for
+each enabled VF is as follows:
+0 - 15 VFs = Up to 8 traffic classes, depending on device support
+16 - 31 VFs = Up to 4 traffic classes
+32 - 63 VFs = 1 traffic class
+
+When VFs are configured, the PF is allocated one pool as well. The PF supports
+the DCB features with the constraint that each traffic class will only use a
+single queue pair. When zero VFs are configured, the PF can support multiple
+queue pairs per traffic class.
+
+allow_unsupported_sfp
+---------------------
+:Valid Range: 0,1
+:Default Value: 0 (disabled)
+
+This parameter allows unsupported and untested SFP+ modules on 82599-based
+adapters, as long as the type of module is known to the driver.
+
+debug
+-----
+:Valid Range: 0-16 (0=none,...,16=all)
+:Default Value: 0
+
+This parameter adjusts the level of debug messages displayed in the system
+logs.
+
+
+Additional Features and Configurations
+======================================
+
+Flow Control
+------------
+Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable
+receiving and transmitting pause frames for ixgbe. When transmit is enabled,
+pause frames are generated when the receive packet buffer crosses a predefined
+threshold. When receive is enabled, the transmit unit will halt for the time
+delay specified when a pause frame is received.
+
+NOTE: You must have a flow control capable link partner.
+
+Flow Control is enabled by default.
+
+Use ethtool to change the flow control settings. To enable or disable Rx or
+Tx Flow Control::
+
+ ethtool -A eth? rx <on|off> tx <on|off>
+
+Note: This command only enables or disables Flow Control if auto-negotiation is
+disabled. If auto-negotiation is enabled, this command changes the parameters
+used for auto-negotiation with the link partner.
+
+To enable or disable auto-negotiation::
+
+ ethtool -s eth? autoneg <on|off>
+
+Note: Flow Control auto-negotiation is part of link auto-negotiation. Depending
+on your device, you may not be able to change the auto-negotiation setting.
+
+NOTE: For 82598 backplane cards entering 1 gigabit mode, flow control default
+behavior is changed to off. Flow control in 1 gigabit mode on these devices can
+lead to transmit hangs.
+
+Intel(R) Ethernet Flow Director
+-------------------------------
+The Intel Ethernet Flow Director performs the following tasks:
+
+- Directs receive packets according to their flows to different queues.
+- Enables tight control on routing a flow in the platform.
+- Matches flows and CPU cores for flow affinity.
+- Supports multiple parameters for flexible flow classification and load
+ balancing (in SFP mode only).
+
+NOTE: Intel Ethernet Flow Director masking works in the opposite manner from
+subnet masking. In the following command::
+
+ #ethtool -N eth11 flow-type ip4 src-ip 172.4.1.2 m 255.0.0.0 dst-ip \
+ 172.21.1.1 m 255.128.0.0 action 31
+
+The src-ip value that is written to the filter will be 0.4.1.2, not 172.0.0.0
+as might be expected. Similarly, the dst-ip value written to the filter will be
+0.21.1.1, not 172.0.0.0.
+
+To enable or disable the Intel Ethernet Flow Director::
+
+ # ethtool -K ethX ntuple <on|off>
+
+When disabling ntuple filters, all the user programmed filters are flushed from
+the driver cache and hardware. All needed filters must be re-added when ntuple
+is re-enabled.
+
+To add a filter that directs packet to queue 2, use -U or -N switch::
+
+ # ethtool -N ethX flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 src-port 2000 dst-port 2001 action 2 [loc 1]
+
+To see the list of filters currently present::
+
+ # ethtool <-u|-n> ethX
+
+Sideband Perfect Filters
+------------------------
+Sideband Perfect Filters are used to direct traffic that matches specified
+characteristics. They are enabled through ethtool's ntuple interface. To add a
+new filter use the following command::
+
+ ethtool -U <device> flow-type <type> src-ip <ip> dst-ip <ip> src-port <port> \
+ dst-port <port> action <queue>
+
+Where:
+ <device> - the ethernet device to program
+ <type> - can be ip4, tcp4, udp4, or sctp4
+ <ip> - the IP address to match on
+ <port> - the port number to match on
+ <queue> - the queue to direct traffic towards (-1 discards the matched traffic)
+
+Use the following command to delete a filter::
+
+ ethtool -U <device> delete <N>
+
+Where <N> is the filter id displayed when printing all the active filters, and
+may also have been specified using "loc <N>" when adding the filter.
+
+The following example matches TCP traffic sent from 192.168.0.1, port 5300,
+directed to 192.168.0.5, port 80, and sends it to queue 7::
+
+ ethtool -U enp130s0 flow-type tcp4 src-ip 192.168.0.1 dst-ip 192.168.0.5 \
+ src-port 5300 dst-port 80 action 7
+
+For each flow-type, the programmed filters must all have the same matching
+input set. For example, issuing the following two commands is acceptable::
+
+ ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.5 src-port 55 action 10
+
+Issuing the next two commands, however, is not acceptable, since the first
+specifies src-ip and the second specifies dst-ip::
+
+ ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ ethtool -U enp130s0 flow-type ip4 dst-ip 192.168.0.5 src-port 55 action 10
+
+The second command will fail with an error. You may program multiple filters
+with the same fields, using different values, but, on one device, you may not
+program two TCP4 filters with different matching fields.
+
+Matching on a sub-portion of a field is not supported by the ixgbe driver, thus
+partial mask fields are not supported.
+
+To create filters that direct traffic to a specific Virtual Function, use the
+"user-def" parameter. Specify the user-def as a 64 bit value, where the lower 32
+bits represents the queue number, while the next 8 bits represent which VF.
+Note that 0 is the PF, so the VF identifier is offset by 1. For example::
+
+ ... user-def 0x800000002 ...
+
+specifies to direct traffic to Virtual Function 7 (8 minus 1) into queue 2 of
+that VF.
+
+Note that these filters will not break internal routing rules, and will not
+route traffic that otherwise would not have been sent to the specified Virtual
+Function.
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ifconfig command to increase the MTU size. For example, enter the
+following where <x> is the interface number::
+
+ ifconfig eth<x> mtu 9000 up
+
+Alternatively, you can use the ip command as follows::
+
+ ip link set mtu 9000 dev eth<x>
+ ip link set up dev eth<x>
+
+This setting is not saved across reboots. The setting change can be made
+permanent by adding 'MTU=9000' to the file::
+
+ /etc/sysconfig/network-scripts/ifcfg-eth<x> // for RHEL
+ /etc/sysconfig/network/<config_file> // for SLES
+
+NOTE: The maximum MTU setting for Jumbo Frames is 9710. This value coincides
+with the maximum Jumbo Frames size of 9728 bytes.
+
+NOTE: This driver will attempt to use multiple page sized buffers to receive
+each jumbo packet. This should help to avoid buffer starvation issues when
+allocating receive packets.
+
+NOTE: For 82599-based network connections, if you are enabling jumbo frames in
+a virtual function (VF), jumbo frames must first be enabled in the physical
+function (PF). The VF MTU setting cannot be larger than the PF MTU.
+
+NBASE-T Support
+---------------
+The ixgbe driver supports NBASE-T on some devices. However, the advertisement
+of NBASE-T speeds is suppressed by default, to accommodate broken network
+switches which cannot cope with advertised NBASE-T speeds. Use the ethtool
+command to enable advertising NBASE-T speeds on devices which support it::
+
+ ethtool -s eth? advertise 0x1800000001028
+
+On Linux systems with INTERFACES(5), this can be specified as a pre-up command
+in /etc/network/interfaces so that the interface is always brought up with
+NBASE-T support, e.g.::
+
+ iface eth? inet dhcp
+ pre-up ethtool -s eth? advertise 0x1800000001028 || true
+
+Generic Receive Offload, aka GRO
+--------------------------------
+The driver supports the in-kernel software implementation of GRO. GRO has
+shown that by coalescing Rx traffic into larger chunks of data, CPU
+utilization can be significantly reduced when under large Rx load. GRO is an
+evolution of the previously-used LRO interface. GRO is able to coalesce
+other protocols besides TCP. It's also safe to use with configurations that
+are problematic for LRO, namely bridging and iSCSI.
+
+Data Center Bridging (DCB)
+--------------------------
+NOTE:
+The kernel assumes that TC0 is available, and will disable Priority Flow
+Control (PFC) on the device if TC0 is not available. To fix this, ensure TC0 is
+enabled when setting up DCB on your switch.
+
+DCB is a configuration Quality of Service implementation in hardware. It uses
+the VLAN priority tag (802.1p) to filter traffic. That means that there are 8
+different priorities that traffic can be filtered into. It also enables
+priority flow control (802.1Qbb) which can limit or eliminate the number of
+dropped packets during network stress. Bandwidth can be allocated to each of
+these priorities, which is enforced at the hardware level (802.1Qaz).
+
+Adapter firmware implements LLDP and DCBX protocol agents as per 802.1AB and
+802.1Qaz respectively. The firmware based DCBX agent runs in willing mode only
+and can accept settings from a DCBX capable peer. Software configuration of
+DCBX parameters via dcbtool/lldptool are not supported.
+
+The ixgbe driver implements the DCB netlink interface layer to allow user-space
+to communicate with the driver and query DCB configuration for the port.
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+https://www.kernel.org/pub/software/network/ethtool/
+
+FCoE
+----
+The ixgbe driver supports Fiber Channel over Ethernet (FCoE) and Data Center
+Bridging (DCB). This code has no default effect on the regular driver
+operation. Configuring DCB and FCoE is outside the scope of this README. Refer
+to http://www.open-fcoe.org/ for FCoE project information and contact
+ixgbe-eedc@lists.sourceforge.net for DCB information.
+
+MAC and VLAN anti-spoofing feature
+----------------------------------
+When a malicious driver attempts to send a spoofed packet, it is dropped by the
+hardware and not transmitted.
+
+An interrupt is sent to the PF driver notifying it of the spoof attempt. When a
+spoofed packet is detected, the PF driver will send the following message to
+the system log (displayed by the "dmesg" command)::
+
+ ixgbe ethX: ixgbe_spoof_check: n spoofed packets detected
+
+where "x" is the PF interface number; and "n" is number of spoofed packets.
+NOTE: This feature can be disabled for a specific Virtual Function (VF)::
+
+ ip link set <pf dev> vf <vf id> spoofchk {off|on}
+
+IPsec Offload
+-------------
+The ixgbe driver supports IPsec Hardware Offload. When creating Security
+Associations with "ip xfrm ..." the 'offload' tag option can be used to
+register the IPsec SA with the driver in order to get higher throughput in
+the secure communications.
+
+The offload is also supported for ixgbe's VFs, but the VF must be set as
+'trusted' and the support must be enabled with::
+
+ ethtool --set-priv-flags eth<x> vf-ipsec on
+ ip link set eth<x> vf <y> trust on
+
+
+Known Issues/Troubleshooting
+============================
+
+Enabling SR-IOV in a 64-bit Microsoft Windows Server 2012/R2 guest OS
+---------------------------------------------------------------------
+Linux KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM.
+This includes traditional PCIe devices, as well as SR-IOV-capable devices based
+on the Intel Ethernet Controller XL710.
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
new file mode 100644
index 0000000000..08dc0d368a
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
@@ -0,0 +1,62 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+============================================================
+Linux Base Virtual Function Driver for Intel(R) 10G Ethernet
+============================================================
+
+Intel 10 Gigabit Virtual Function Linux driver.
+Copyright(c) 1999-2018 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Known Issues
+- Support
+
+This driver supports 82599, X540, X550, and X552-based virtual function devices
+that can only be activated on kernels that support SR-IOV.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel adapter. All hardware requirements listed apply to use
+with Linux.
+
+
+Identifying Your Adapter
+========================
+The driver is compatible with devices based on the following:
+
+ * Intel(R) Ethernet Controller 82598
+ * Intel(R) Ethernet Controller 82599
+ * Intel(R) Ethernet Controller X520
+ * Intel(R) Ethernet Controller X540
+ * Intel(R) Ethernet Controller x550
+ * Intel(R) Ethernet Controller X552
+ * Intel(R) Ethernet Controller X553
+
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+Known Issues/Troubleshooting
+============================
+
+SR-IOV requires the correct platform and OS support.
+
+The guest OS loading this driver must support MSI-X interrupts.
+
+This driver is only supported as a loadable module at this time. Intel is not
+supplying patches against the kernel source to allow for static linking of the
+drivers.
+
+VLANs: There is a limit of a total of 64 shared VLANs to 1 or more VFs.
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
new file mode 100644
index 0000000000..cad96c8d1f
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+====================================================================
+Linux kernel networking driver for Marvell's Octeon PCI Endpoint NIC
+====================================================================
+
+Network driver for Marvell's Octeon PCI EndPoint NIC.
+Copyright (c) 2020 Marvell International Ltd.
+
+Contents
+========
+
+- `Overview`_
+- `Supported Devices`_
+- `Interface Control`_
+
+Overview
+========
+This driver implements networking functionality of Marvell's Octeon PCI
+EndPoint NIC.
+
+Supported Devices
+=================
+Currently, this driver support following devices:
+ * Network controller: Cavium, Inc. Device b200
+ * Network controller: Cavium, Inc. Device b400
+
+Interface Control
+=================
+Network Interface control like changing mtu, link speed, link down/up are
+done by writing command to mailbox command queue, a mailbox interface
+implemented through a reserved region in BAR4.
+This driver writes the commands into the mailbox and the firmware on the
+Octeon device processes them. The firmware also sends unsolicited notifications
+to driver for events suchs as link change, through notification queue
+implemented as part of mailbox interface.
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
new file mode 100644
index 0000000000..1e196cb9ce
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
@@ -0,0 +1,342 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+====================================
+Marvell OcteonTx2 RVU Kernel Drivers
+====================================
+
+Copyright (c) 2020 Marvell International Ltd.
+
+Contents
+========
+
+- `Overview`_
+- `Drivers`_
+- `Basic packet flow`_
+- `Devlink health reporters`_
+- `Quality of service`_
+
+Overview
+========
+
+Resource virtualization unit (RVU) on Marvell's OcteonTX2 SOC maps HW
+resources from the network, crypto and other functional blocks into
+PCI-compatible physical and virtual functions. Each functional block
+again has multiple local functions (LFs) for provisioning to PCI devices.
+RVU supports multiple PCIe SRIOV physical functions (PFs) and virtual
+functions (VFs). PF0 is called the administrative / admin function (AF)
+and has privileges to provision RVU functional block's LFs to each of the
+PF/VF.
+
+RVU managed networking functional blocks
+ - Network pool or buffer allocator (NPA)
+ - Network interface controller (NIX)
+ - Network parser CAM (NPC)
+ - Schedule/Synchronize/Order unit (SSO)
+ - Loopback interface (LBK)
+
+RVU managed non-networking functional blocks
+ - Crypto accelerator (CPT)
+ - Scheduled timers unit (TIM)
+ - Schedule/Synchronize/Order unit (SSO)
+ Used for both networking and non networking usecases
+
+Resource provisioning examples
+ - A PF/VF with NIX-LF & NPA-LF resources works as a pure network device
+ - A PF/VF with CPT-LF resource works as a pure crypto offload device.
+
+RVU functional blocks are highly configurable as per software requirements.
+
+Firmware setups following stuff before kernel boots
+ - Enables required number of RVU PFs based on number of physical links.
+ - Number of VFs per PF are either static or configurable at compile time.
+ Based on config, firmware assigns VFs to each of the PFs.
+ - Also assigns MSIX vectors to each of PF and VFs.
+ - These are not changed after kernel boot.
+
+Drivers
+=======
+
+Linux kernel will have multiple drivers registering to different PF and VFs
+of RVU. Wrt networking there will be 3 flavours of drivers.
+
+Admin Function driver
+---------------------
+
+As mentioned above RVU PF0 is called the admin function (AF), this driver
+supports resource provisioning and configuration of functional blocks.
+Doesn't handle any I/O. It sets up few basic stuff but most of the
+funcionality is achieved via configuration requests from PFs and VFs.
+
+PF/VFs communicates with AF via a shared memory region (mailbox). Upon
+receiving requests AF does resource provisioning and other HW configuration.
+AF is always attached to host kernel, but PFs and their VFs may be used by host
+kernel itself, or attached to VMs or to userspace applications like
+DPDK etc. So AF has to handle provisioning/configuration requests sent
+by any device from any domain.
+
+AF driver also interacts with underlying firmware to
+ - Manage physical ethernet links ie CGX LMACs.
+ - Retrieve information like speed, duplex, autoneg etc
+ - Retrieve PHY EEPROM and stats.
+ - Configure FEC, PAM modes
+ - etc
+
+From pure networking side AF driver supports following functionality.
+ - Map a physical link to a RVU PF to which a netdev is registered.
+ - Attach NIX and NPA block LFs to RVU PF/VF which provide buffer pools, RQs, SQs
+ for regular networking functionality.
+ - Flow control (pause frames) enable/disable/config.
+ - HW PTP timestamping related config.
+ - NPC parser profile config, basically how to parse pkt and what info to extract.
+ - NPC extract profile config, what to extract from the pkt to match data in MCAM entries.
+ - Manage NPC MCAM entries, upon request can frame and install requested packet forwarding rules.
+ - Defines receive side scaling (RSS) algorithms.
+ - Defines segmentation offload algorithms (eg TSO)
+ - VLAN stripping, capture and insertion config.
+ - SSO and TIM blocks config which provide packet scheduling support.
+ - Debugfs support, to check current resource provising, current status of
+ NPA pools, NIX RQ, SQ and CQs, various stats etc which helps in debugging issues.
+ - And many more.
+
+Physical Function driver
+------------------------
+
+This RVU PF handles IO, is mapped to a physical ethernet link and this
+driver registers a netdev. This supports SR-IOV. As said above this driver
+communicates with AF with a mailbox. To retrieve information from physical
+links this driver talks to AF and AF gets that info from firmware and responds
+back ie cannot talk to firmware directly.
+
+Supports ethtool for configuring links, RSS, queue count, queue size,
+flow control, ntuple filters, dump PHY EEPROM, config FEC etc.
+
+Virtual Function driver
+-----------------------
+
+There are two types VFs, VFs that share the physical link with their parent
+SR-IOV PF and the VFs which work in pairs using internal HW loopback channels (LBK).
+
+Type1:
+ - These VFs and their parent PF share a physical link and used for outside communication.
+ - VFs cannot communicate with AF directly, they send mbox message to PF and PF
+ forwards that to AF. AF after processing, responds back to PF and PF forwards
+ the reply to VF.
+ - From functionality point of view there is no difference between PF and VF as same type
+ HW resources are attached to both. But user would be able to configure few stuff only
+ from PF as PF is treated as owner/admin of the link.
+
+Type2:
+ - RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels.
+ - A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of
+ VF0 will be received by VF1 and vice versa.
+ - These VFs can be used by applications or virtual machines to communicate between them
+ without sending traffic outside. There is no switch present in HW, hence the support
+ for loopback VFs.
+ - These communicate directly with AF (PF0) via mbox.
+
+Except for the IO channels or links used for packet reception and transmission there is
+no other difference between these VF types. AF driver takes care of IO channel mapping,
+hence same VF driver works for both types of devices.
+
+Basic packet flow
+=================
+
+Ingress
+-------
+
+1. CGX LMAC receives packet.
+2. Forwards the packet to the NIX block.
+3. Then submitted to NPC block for parsing and then MCAM lookup to get the destination RVU device.
+4. NIX LF attached to the destination RVU device allocates a buffer from RQ mapped buffer pool of NPA block LF.
+5. RQ may be selected by RSS or by configuring MCAM rule with a RQ number.
+6. Packet is DMA'ed and driver is notified.
+
+Egress
+------
+
+1. Driver prepares a send descriptor and submits to SQ for transmission.
+2. The SQ is already configured (by AF) to transmit on a specific link/channel.
+3. The SQ descriptor ring is maintained in buffers allocated from SQ mapped pool of NPA block LF.
+4. NIX block transmits the pkt on the designated channel.
+5. NPC MCAM entries can be installed to divert pkt onto a different channel.
+
+Devlink health reporters
+========================
+
+NPA Reporters
+-------------
+The NPA reporters are responsible for reporting and recovering the following group of errors:
+
+1. GENERAL events
+
+ - Error due to operation of unmapped PF.
+ - Error due to disabled alloc/free for other HW blocks (NIX, SSO, TIM, DPI and AURA).
+
+2. ERROR events
+
+ - Fault due to NPA_AQ_INST_S read or NPA_AQ_RES_S write.
+ - AQ Doorbell Error.
+
+3. RAS events
+
+ - RAS Error Reporting for NPA_AQ_INST_S/NPA_AQ_RES_S.
+
+4. RVU events
+
+ - Error due to unmapped slot.
+
+Sample Output::
+
+ ~# devlink health
+ pci/0002:01:00.0:
+ reporter hw_npa_intr
+ state healthy error 2872 recover 2872 last_dump_date 2020-12-10 last_dump_time 09:39:09 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_gen
+ state healthy error 2872 recover 2872 last_dump_date 2020-12-11 last_dump_time 04:43:04 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_err
+ state healthy error 2871 recover 2871 last_dump_date 2020-12-10 last_dump_time 09:39:17 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_ras
+ state healthy error 0 recover 0 last_dump_date 2020-12-10 last_dump_time 09:32:40 grace_period 0 auto_recover true auto_dump true
+
+Each reporter dumps the
+
+ - Error Type
+ - Error Register value
+ - Reason in words
+
+For example::
+
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_gen
+ NPA_AF_GENERAL:
+ NPA General Interrupt Reg : 1
+ NIX0: free disabled RX
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_intr
+ NPA_AF_RVU:
+ NPA RVU Interrupt Reg : 1
+ Unmap Slot Error
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_err
+ NPA_AF_ERR:
+ NPA Error Interrupt Reg : 4096
+ AQ Doorbell Error
+
+
+NIX Reporters
+-------------
+The NIX reporters are responsible for reporting and recovering the following group of errors:
+
+1. GENERAL events
+
+ - Receive mirror/multicast packet drop due to insufficient buffer.
+ - SMQ Flush operation.
+
+2. ERROR events
+
+ - Memory Fault due to WQE read/write from multicast/mirror buffer.
+ - Receive multicast/mirror replication list error.
+ - Receive packet on an unmapped PF.
+ - Fault due to NIX_AQ_INST_S read or NIX_AQ_RES_S write.
+ - AQ Doorbell Error.
+
+3. RAS events
+
+ - RAS Error Reporting for NIX Receive Multicast/Mirror Entry Structure.
+ - RAS Error Reporting for WQE/Packet Data read from Multicast/Mirror Buffer..
+ - RAS Error Reporting for NIX_AQ_INST_S/NIX_AQ_RES_S.
+
+4. RVU events
+
+ - Error due to unmapped slot.
+
+Sample Output::
+
+ ~# ./devlink health
+ pci/0002:01:00.0:
+ reporter hw_npa_intr
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_gen
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_err
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_ras
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_intr
+ state healthy error 1121 recover 1121 last_dump_date 2021-01-19 last_dump_time 05:42:26 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_gen
+ state healthy error 949 recover 949 last_dump_date 2021-01-19 last_dump_time 05:42:43 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_err
+ state healthy error 1147 recover 1147 last_dump_date 2021-01-19 last_dump_time 05:42:59 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_ras
+ state healthy error 409 recover 409 last_dump_date 2021-01-19 last_dump_time 05:43:16 grace_period 0 auto_recover true auto_dump true
+
+Each reporter dumps the
+
+ - Error Type
+ - Error Register value
+ - Reason in words
+
+For example::
+
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_intr
+ NIX_AF_RVU:
+ NIX RVU Interrupt Reg : 1
+ Unmap Slot Error
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_gen
+ NIX_AF_GENERAL:
+ NIX General Interrupt Reg : 1
+ Rx multicast pkt drop
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_err
+ NIX_AF_ERR:
+ NIX Error Interrupt Reg : 64
+ Rx on unmapped PF_FUNC
+
+
+Quality of service
+==================
+
+
+Hardware algorithms used in scheduling
+--------------------------------------
+
+octeontx2 silicon and CN10K transmit interface consists of five transmit levels
+starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1
+levels. Each level contains an array of queues to support scheduling and shaping.
+The hardware uses the below algorithms depending on the priority of scheduler queues.
+once the usercreates tc classes with different priorities, the driver configures
+schedulers allocated to the class with specified priority along with rate-limiting
+configuration.
+
+1. Strict Priority
+
+ - Once packets are submitted to MDQ, hardware picks all active MDQs having different priority
+ using strict priority.
+
+2. Round Robin
+
+ - Active MDQs having the same priority level are chosen using round robin.
+
+
+Setup HTB offload
+-----------------
+
+1. Enable HW TC offload on the interface::
+
+ # ethtool -K <interface> hw-tc-offload on
+
+2. Crate htb root::
+
+ # tc qdisc add dev <interface> clsact
+ # tc qdisc replace dev <interface> root handle 1: htb offload
+
+3. Create tc classes with different priorities::
+
+ # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1
+
+ # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7
+
+4. Create tc classes with same priorities and different quantum::
+
+ # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 2 quantum 409600
+
+ # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 2 quantum 188416
+
+ # tc class add dev <interface> parent 1: classid 1:3 htb rate 10Gbit prio 2 quantum 32768
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
new file mode 100644
index 0000000000..f69ee1ebee
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -0,0 +1,1305 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+================
+Ethtool counters
+================
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents
+========
+
+- `Overview`_
+- `Groups`_
+- `Types`_
+- `Descriptions`_
+
+Overview
+========
+
+There are several counter groups based on where the counter is being counted. In
+addition, each group of counters may have different counter types.
+
+These counter groups are based on which component in a networking setup,
+illustrated below, that they describe::
+
+ ----------------------------------------
+ | |
+ ---------------------------------------- ---------------------------------------- |
+ | Hypervisor | | VM | |
+ | | | | |
+ | ------------------- --------------- | | ------------------- --------------- | |
+ | | Ethernet driver | | RDMA driver | | | | Ethernet driver | | RDMA driver | | |
+ | ------------------- --------------- | | ------------------- --------------- | |
+ | | | | | | | | |
+ | ------------------- | | ------------------- | |
+ | | | | | |--
+ ---------------------------------------- ----------------------------------------
+ | |
+ ------------- -----------------------------
+ | |
+ ------ ------ ------ ------ ------ ------ ------
+ -----| PF |----------------------| VF |-| VF |-| VF |----- --| PF |--- --| PF |--- --| PF |---
+ | ------ ------ ------ ------ | | ------ | | ------ | | ------ |
+ | | | | | | | |
+ | | | | | | | |
+ | | | | | | | |
+ | eSwitch | | eSwitch | | eSwitch | | eSwitch |
+ ---------------------------------------------------------- ----------- ----------- -----------
+ -------------------------------------------------------------------------------
+ | |
+ | |
+ | Uplink (no counters) |
+ -------------------------------------------------------------------------------
+ ---------------------------------------------------------------
+ | |
+ | |
+ | MPFS (no counters) |
+ ---------------------------------------------------------------
+ |
+ |
+ | Port
+
+Groups
+======
+
+Ring
+ Software counters populated by the driver stack.
+
+Netdev
+ An aggregation of software ring counters.
+
+vPort counters
+ Traffic counters and drops due to steering or no buffers. May indicate issues
+ with NIC. These counters include Ethernet traffic counters (including Raw
+ Ethernet) and RDMA/RoCE traffic counters.
+
+Physical port counters
+ Counters that collect statistics about the PFs and VFs. May indicate issues
+ with NIC, link, or network. This measuring point holds information on
+ standardized counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and
+ additional counters like flow control, FEC and more. Physical port counters
+ are not exposed to virtual machines.
+
+Priority Port Counters
+ A set of the physical port counters, per priority per port.
+
+Types
+=====
+
+Counters are divided into three types.
+
+Traffic Informative Counters
+ Counters which count traffic. These counters can be used for load estimation
+ or for general debug.
+
+Traffic Acceleration Counters
+ Counters which count traffic that was accelerated by Mellanox driver or by
+ hardware. The counters are an additional layer to the informative counter set,
+ and the same traffic is counted in both informative and acceleration counters.
+
+.. [#accel] Traffic acceleration counter.
+
+Error Counters
+ Increment of these counters might indicate a problem. Each of these counters
+ has an explanation and correction action.
+
+Statistic can be fetched via the `ip link` or `ethtool` commands. `ethtool`
+provides more detailed information.::
+
+ ip –s link show <if-name>
+ ethtool -S <if-name>
+
+Descriptions
+============
+
+XSK, PTP, and QoS counters that are similar to counters defined previously will
+not be separately listed. For example, `ptp_tx[i]_packets` will not be
+explicitly documented since `tx[i]_packets` describes the behavior of both
+counters, except `ptp_tx[i]_packets` is only counted when precision time
+protocol is used.
+
+Ring / Netdev Counter
+----------------------------
+The following counters are available per ring or software port.
+
+These counters provide information on the amount of traffic that was accelerated
+by the NIC. The counters are counting the accelerated traffic in addition to the
+standard counters which counts it (i.e. accelerated traffic is counted twice).
+
+The counter names in the table below refers to both ring and port counters. The
+notation for ring counters includes the [i] index without the braces. The
+notation for port counters doesn't include the [i]. A counter name
+`rx[i]_packets` will be printed as `rx0_packets` for ring 0 and `rx_packets` for
+the software port.
+
+.. flat-table:: Ring / Software Port Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx[i]_packets`
+ - The number of packets received on ring i.
+ - Informative
+
+ * - `rx[i]_bytes`
+ - The number of bytes received on ring i.
+ - Informative
+
+ * - `tx[i]_packets`
+ - The number of packets transmitted on ring i.
+ - Informative
+
+ * - `tx[i]_bytes`
+ - The number of bytes transmitted on ring i.
+ - Informative
+
+ * - `tx[i]_recover`
+ - The number of times the SQ was recovered.
+ - Error
+
+ * - `tx[i]_cqes`
+ - Number of CQEs events on SQ issued on ring i.
+ - Informative
+
+ * - `tx[i]_cqe_err`
+ - The number of error CQEs encountered on the SQ for ring i.
+ - Error
+
+ * - `tx[i]_tso_packets`
+ - The number of TSO packets transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_tso_bytes`
+ - The number of TSO bytes transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_tso_inner_packets`
+ - The number of TSO packets which are indicated to be carry internal
+ encapsulation transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_tso_inner_bytes`
+ - The number of TSO bytes which are indicated to be carry internal
+ encapsulation transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_gro_packets`
+ - Number of received packets processed using hardware-accelerated GRO. The
+ number of hardware GRO offloaded packets received on ring i.
+ - Acceleration
+
+ * - `rx[i]_gro_bytes`
+ - Number of received bytes processed using hardware-accelerated GRO. The
+ number of hardware GRO offloaded bytes received on ring i.
+ - Acceleration
+
+ * - `rx[i]_gro_skbs`
+ - The number of receive SKBs constructed while performing
+ hardware-accelerated GRO.
+ - Informative
+
+ * - `rx[i]_gro_match_packets`
+ - Number of received packets processed using hardware-accelerated GRO that
+ met the flow table match criteria.
+ - Informative
+
+ * - `rx[i]_gro_large_hds`
+ - Number of receive packets using hardware-accelerated GRO that have large
+ headers that require additional memory to be allocated.
+ - Informative
+
+ * - `rx[i]_lro_packets`
+ - The number of LRO packets received on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_lro_bytes`
+ - The number of LRO bytes received on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_ecn_mark`
+ - The number of received packets where the ECN mark was turned on.
+ - Informative
+
+ * - `rx_oversize_pkts_buffer`
+ - The number of dropped received packets due to length which arrived to RQ
+ and exceed software buffer size allocated by the device for incoming
+ traffic. It might imply that the device MTU is larger than the software
+ buffers size.
+ - Error
+
+ * - `rx_oversize_pkts_sw_drop`
+ - Number of received packets dropped in software because the CQE data is
+ larger than the MTU size.
+ - Error
+
+ * - `rx[i]_csum_unnecessary`
+ - Packets received with a `CHECKSUM_UNNECESSARY` on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_unnecessary_inner`
+ - Packets received with inner encapsulation with a `CHECKSUM_UNNECESSARY`
+ on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_none`
+ - Packets received with a `CHECKSUM_NONE` on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_complete`
+ - Packets received with a `CHECKSUM_COMPLETE` on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_complete_tail`
+ - Number of received packets that had checksum calculation computed,
+ potentially needed padding, and were able to do so with
+ `CHECKSUM_PARTIAL`.
+ - Informative
+
+ * - `rx[i]_csum_complete_tail_slow`
+ - Number of received packets that need padding larger than eight bytes for
+ the checksum.
+ - Informative
+
+ * - `tx[i]_csum_partial`
+ - Packets transmitted with a `CHECKSUM_PARTIAL` on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_csum_partial_inner`
+ - Packets transmitted with inner encapsulation with a `CHECKSUM_PARTIAL` on
+ ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_csum_none`
+ - Packets transmitted with no hardware checksum acceleration on ring i.
+ - Informative
+
+ * - `tx[i]_stopped` / `tx_queue_stopped` [#ring_global]_
+ - Events where SQ was full on ring i. If this counter is increased, check
+ the amount of buffers allocated for transmission.
+ - Informative
+
+ * - `tx[i]_wake` / `tx_queue_wake` [#ring_global]_
+ - Events where SQ was full and has become not full on ring i.
+ - Informative
+
+ * - `tx[i]_dropped` / `tx_queue_dropped` [#ring_global]_
+ - Packets transmitted that were dropped due to DMA mapping failure on
+ ring i. If this counter is increased, check the amount of buffers
+ allocated for transmission.
+ - Error
+
+ * - `tx[i]_nop`
+ - The number of nop WQEs (empty WQEs) inserted to the SQ (related to
+ ring i) due to the reach of the end of the cyclic buffer. When reaching
+ near to the end of cyclic buffer the driver may add those empty WQEs to
+ avoid handling a state the a WQE start in the end of the queue and ends
+ in the beginning of the queue. This is a normal condition.
+ - Informative
+
+ * - `tx[i]_added_vlan_packets`
+ - The number of packets sent where vlan tag insertion was offloaded to the
+ hardware.
+ - Acceleration
+
+ * - `rx[i]_removed_vlan_packets`
+ - The number of packets received where vlan tag stripping was offloaded to
+ the hardware.
+ - Acceleration
+
+ * - `rx[i]_wqe_err`
+ - The number of wrong opcodes received on ring i.
+ - Error
+
+ * - `rx[i]_mpwqe_frag`
+ - The number of WQEs that failed to allocate compound page and hence
+ fragmented MPWQE’s (Multi Packet WQEs) were used on ring i. If this
+ counter raise, it may suggest that there is no enough memory for large
+ pages, the driver allocated fragmented pages. This is not abnormal
+ condition.
+ - Informative
+
+ * - `rx[i]_mpwqe_filler_cqes`
+ - The number of filler CQEs events that were issued on ring i.
+ - Informative
+
+ * - `rx[i]_mpwqe_filler_strides`
+ - The number of strides consumed by filler CQEs on ring i.
+ - Informative
+
+ * - `tx[i]_mpwqe_blks`
+ - The number of send blocks processed from Multi-Packet WQEs (mpwqe).
+ - Informative
+
+ * - `tx[i]_mpwqe_pkts`
+ - The number of send packets processed from Multi-Packet WQEs (mpwqe).
+ - Informative
+
+ * - `rx[i]_cqe_compress_blks`
+ - The number of receive blocks with CQE compression on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_cqe_compress_pkts`
+ - The number of receive packets with CQE compression on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_add`
+ - The number of aRFS flow rules added to the device for direct RQ steering
+ on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_request_in`
+ - Number of flow rules that have been requested to move into ring i for
+ direct RQ steering [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_request_out`
+ - Number of flow rules that have been requested to move out of ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_expired`
+ - Number of flow rules that have been expired and removed [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_err`
+ - Number of flow rules that failed to be added to the flow table.
+ - Error
+
+ * - `rx[i]_recover`
+ - The number of times the RQ was recovered.
+ - Error
+
+ * - `tx[i]_xmit_more`
+ - The number of packets sent with `xmit_more` indication set on the skbuff
+ (no doorbell).
+ - Acceleration
+
+ * - `ch[i]_poll`
+ - The number of invocations of NAPI poll of channel i.
+ - Informative
+
+ * - `ch[i]_arm`
+ - The number of times the NAPI poll function completed and armed the
+ completion queues on channel i.
+ - Informative
+
+ * - `ch[i]_aff_change`
+ - The number of times the NAPI poll function explicitly stopped execution
+ on a CPU due to a change in affinity, on channel i.
+ - Informative
+
+ * - `ch[i]_events`
+ - The number of hard interrupt events on the completion queues of channel i.
+ - Informative
+
+ * - `ch[i]_eq_rearm`
+ - The number of times the EQ was recovered.
+ - Error
+
+ * - `ch[i]_force_irq`
+ - Number of times NAPI is triggered by XSK wakeups by posting a NOP to
+ ICOSQ.
+ - Acceleration
+
+ * - `rx[i]_congst_umr`
+ - The number of times an outstanding UMR request is delayed due to
+ congestion, on ring i.
+ - Informative
+
+ * - `rx_pp_alloc_fast`
+ - Number of successful fast path allocations.
+ - Informative
+
+ * - `rx_pp_alloc_slow`
+ - Number of slow path order-0 allocations.
+ - Informative
+
+ * - `rx_pp_alloc_slow_high_order`
+ - Number of slow path high order allocations.
+ - Informative
+
+ * - `rx_pp_alloc_empty`
+ - Counter is incremented when ptr ring is empty, so a slow path allocation
+ was forced.
+ - Informative
+
+ * - `rx_pp_alloc_refill`
+ - Counter is incremented when an allocation which triggered a refill of the
+ cache.
+ - Informative
+
+ * - `rx_pp_alloc_waive`
+ - Counter is incremented when pages obtained from the ptr ring that cannot
+ be added to the cache due to a NUMA mismatch.
+ - Informative
+
+ * - `rx_pp_recycle_cached`
+ - Counter is incremented when recycling placed page in the page pool cache.
+ - Informative
+
+ * - `rx_pp_recycle_cache_full`
+ - Counter is incremented when page pool cache was full.
+ - Informative
+
+ * - `rx_pp_recycle_ring`
+ - Counter is incremented when page placed into the ptr ring.
+ - Informative
+
+ * - `rx_pp_recycle_ring_full`
+ - Counter is incremented when page released from page pool because the ptr
+ ring was full.
+ - Informative
+
+ * - `rx_pp_recycle_released_ref`
+ - Counter is incremented when page released (and not recycled) because
+ refcnt > 1.
+ - Informative
+
+ * - `rx[i]_xsk_buff_alloc_err`
+ - The number of times allocating an skb or XSK buffer failed in the XSK RQ
+ context.
+ - Error
+
+ * - `rx[i]_xdp_tx_xmit`
+ - The number of packets forwarded back to the port due to XDP program
+ `XDP_TX` action (bouncing). these packets are not counted by other
+ software counters. These packets are counted by physical port and vPort
+ counters.
+ - Informative
+
+ * - `rx[i]_xdp_tx_mpwqe`
+ - Number of multi-packet WQEs transmitted by the netdev and `XDP_TX`-ed by
+ the netdev during the RQ context.
+ - Acceleration
+
+ * - `rx[i]_xdp_tx_inlnw`
+ - Number of WQE data segments transmitted where the data could be inlined
+ in the WQE and then `XDP_TX`-ed during the RQ context.
+ - Acceleration
+
+ * - `rx[i]_xdp_tx_nops`
+ - Number of NOP WQEBBs (WQE building blocks) received posted to the XDP SQ.
+ - Acceleration
+
+ * - `rx[i]_xdp_tx_full`
+ - The number of packets that should have been forwarded back to the port
+ due to `XDP_TX` action but were dropped due to full tx queue. These packets
+ are not counted by other software counters. These packets are counted by
+ physical port and vPort counters. You may open more rx queues and spread
+ traffic rx over all queues and/or increase rx ring size.
+ - Error
+
+ * - `rx[i]_xdp_tx_err`
+ - The number of times an `XDP_TX` error such as frame too long and frame
+ too short occurred on `XDP_TX` ring of RX ring.
+ - Error
+
+ * - `rx[i]_xdp_tx_cqes` / `rx_xdp_tx_cqe` [#ring_global]_
+ - The number of completions received on the CQ of the `XDP_TX` ring.
+ - Informative
+
+ * - `rx[i]_xdp_drop`
+ - The number of packets dropped due to XDP program `XDP_DROP` action. these
+ packets are not counted by other software counters. These packets are
+ counted by physical port and vPort counters.
+ - Informative
+
+ * - `rx[i]_xdp_redirect`
+ - The number of times an XDP redirect action was triggered on ring i.
+ - Acceleration
+
+ * - `tx[i]_xdp_xmit`
+ - The number of packets redirected to the interface(due to XDP redirect).
+ These packets are not counted by other software counters. These packets
+ are counted by physical port and vPort counters.
+ - Informative
+
+ * - `tx[i]_xdp_full`
+ - The number of packets redirected to the interface(due to XDP redirect),
+ but were dropped due to full tx queue. these packets are not counted by
+ other software counters. you may enlarge tx queues.
+ - Informative
+
+ * - `tx[i]_xdp_mpwqe`
+ - Number of multi-packet WQEs offloaded onto the NIC that were
+ `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xdp_inlnw`
+ - Number of WQE data segments where the data could be inlined in the WQE
+ where the data segments were `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xdp_nops`
+ - Number of NOP WQEBBs (WQE building blocks) posted to the SQ that were
+ `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xdp_err`
+ - The number of packets redirected to the interface(due to XDP redirect)
+ but were dropped due to error such as frame too long and frame too short.
+ - Error
+
+ * - `tx[i]_xdp_cqes`
+ - The number of completions received for packets redirected to the
+ interface(due to XDP redirect) on the CQ.
+ - Informative
+
+ * - `tx[i]_xsk_xmit`
+ - The number of packets transmitted using XSK zerocopy functionality.
+ - Acceleration
+
+ * - `tx[i]_xsk_mpwqe`
+ - Number of multi-packet WQEs offloaded onto the NIC that were
+ `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xsk_inlnw`
+ - Number of WQE data segments where the data could be inlined in the WQE
+ that are transmitted using XSK zerocopy.
+ - Acceleration
+
+ * - `tx[i]_xsk_full`
+ - Number of times doorbell is rung in XSK zerocopy mode when SQ is full.
+ - Error
+
+ * - `tx[i]_xsk_err`
+ - Number of errors that occurred in XSK zerocopy mode such as if the data
+ size is larger than the MTU size.
+ - Error
+
+ * - `tx[i]_xsk_cqes`
+ - Number of CQEs processed in XSK zerocopy mode.
+ - Acceleration
+
+ * - `tx_tls_ctx`
+ - Number of TLS TX HW offload contexts added to device for encryption.
+ - Acceleration
+
+ * - `tx_tls_del`
+ - Number of TLS TX HW offload contexts removed from device (connection
+ closed).
+ - Acceleration
+
+ * - `tx_tls_pool_alloc`
+ - Number of times a unit of work is successfully allocated in the TLS HW
+ offload pool.
+ - Acceleration
+
+ * - `tx_tls_pool_free`
+ - Number of times a unit of work is freed in the TLS HW offload pool.
+ - Acceleration
+
+ * - `rx_tls_ctx`
+ - Number of TLS RX HW offload contexts added to device for decryption.
+ - Acceleration
+
+ * - `rx_tls_del`
+ - Number of TLS RX HW offload contexts deleted from device (connection has
+ finished).
+ - Acceleration
+
+ * - `rx[i]_tls_decrypted_packets`
+ - Number of successfully decrypted RX packets which were part of a TLS
+ stream.
+ - Acceleration
+
+ * - `rx[i]_tls_decrypted_bytes`
+ - Number of TLS payload bytes in RX packets which were successfully
+ decrypted.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_pkt`
+ - Number of received TLS packets with a resync request.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_start`
+ - Number of times the TLS async resync request was started.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_end`
+ - Number of times the TLS async resync request properly ended with
+ providing the HW tracked tcp-seq.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_skip`
+ - Number of times the TLS async resync request procedure was started but
+ not properly ended.
+ - Error
+
+ * - `rx[i]_tls_resync_res_ok`
+ - Number of times the TLS resync response call to the driver was
+ successfully handled.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_res_retry`
+ - Number of times the TLS resync response call to the driver was
+ reattempted when ICOSQ is full.
+ - Error
+
+ * - `rx[i]_tls_resync_res_skip`
+ - Number of times the TLS resync response call to the driver was terminated
+ unsuccessfully.
+ - Error
+
+ * - `rx[i]_tls_err`
+ - Number of times when CQE TLS offload was problematic.
+ - Error
+
+ * - `tx[i]_tls_encrypted_packets`
+ - The number of send packets that are TLS encrypted by the kernel.
+ - Acceleration
+
+ * - `tx[i]_tls_encrypted_bytes`
+ - The number of send bytes that are TLS encrypted by the kernel.
+ - Acceleration
+
+ * - `tx[i]_tls_ooo`
+ - Number of times out of order TLS SQE fragments were handled on ring i.
+ - Acceleration
+
+ * - `tx[i]_tls_dump_packets`
+ - Number of TLS decrypted packets copied over from NIC over DMA.
+ - Acceleration
+
+ * - `tx[i]_tls_dump_bytes`
+ - Number of TLS decrypted bytes copied over from NIC over DMA.
+ - Acceleration
+
+ * - `tx[i]_tls_resync_bytes`
+ - Number of TLS bytes requested to be resynchronized in order to be
+ decrypted.
+ - Acceleration
+
+ * - `tx[i]_tls_skip_no_sync_data`
+ - Number of TLS send data that can safely be skipped / do not need to be
+ decrypted.
+ - Acceleration
+
+ * - `tx[i]_tls_drop_no_sync_data`
+ - Number of TLS send data that were dropped due to retransmission of TLS
+ data.
+ - Acceleration
+
+ * - `ptp_cq[i]_abort`
+ - Number of times a CQE has to be skipped in precision time protocol due to
+ a skew between the port timestamp and CQE timestamp being greater than
+ 128 seconds.
+ - Error
+
+ * - `ptp_cq[i]_abort_abs_diff_ns`
+ - Accumulation of time differences between the port timestamp and CQE
+ timestamp when the difference is greater than 128 seconds in precision
+ time protocol.
+ - Error
+
+ * - `ptp_cq[i]_late_cqe`
+ - Number of times a CQE has been delivered on the PTP timestamping CQ when
+ the CQE was not expected since a certain amount of time had elapsed where
+ the device typically ensures not posting the CQE.
+ - Error
+
+.. [#ring_global] The corresponding ring and global counters do not share the
+ same name (i.e. do not follow the common naming scheme).
+
+vPort Counters
+--------------
+Counters on the NIC port that is connected to a eSwitch.
+
+.. flat-table:: vPort Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_vport_unicast_packets`
+ - Unicast packets received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_unicast_bytes`
+ - Unicast bytes received, steered to a port including Raw Ethernet QP/DPDK
+ traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_unicast_packets`
+ - Unicast packets transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_unicast_bytes`
+ - Unicast bytes transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_multicast_packets`
+ - Multicast packets received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_multicast_bytes`
+ - Multicast bytes received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_multicast_packets`
+ - Multicast packets transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_multicast_bytes`
+ - Multicast bytes transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_broadcast_packets`
+ - Broadcast packets received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_broadcast_bytes`
+ - Broadcast bytes received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_broadcast_packets`
+ - Broadcast packets transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_broadcast_bytes`
+ - Broadcast bytes transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_rdma_unicast_packets`
+ - RDMA unicast packets received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `rx_vport_rdma_unicast_bytes`
+ - RDMA unicast bytes received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_unicast_packets`
+ - RDMA unicast packets transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_unicast_bytes`
+ - RDMA unicast bytes transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `rx_vport_rdma_multicast_packets`
+ - RDMA multicast packets received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `rx_vport_rdma_multicast_bytes`
+ - RDMA multicast bytes received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_multicast_packets`
+ - RDMA multicast packets transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_multicast_bytes`
+ - RDMA multicast bytes transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `vport_loopback_packets`
+ - Unicast, multicast and broadcast packets that were loop-back (received
+ and transmitted), IB/Eth [#accel]_.
+ - Acceleration
+
+ * - `vport_loopback_bytes`
+ - Unicast, multicast and broadcast bytes that were loop-back (received
+ and transmitted), IB/Eth [#accel]_.
+ - Acceleration
+
+ * - `rx_steer_missed_packets`
+ - Number of packets that was received by the NIC, however was discarded
+ because it did not match any flow in the NIC flow table.
+ - Error
+
+ * - `rx_packets`
+ - Representor only: packets received, that were handled by the hypervisor.
+ - Informative
+
+ * - `rx_bytes`
+ - Representor only: bytes received, that were handled by the hypervisor.
+ - Informative
+
+ * - `tx_packets`
+ - Representor only: packets transmitted, that were handled by the
+ hypervisor.
+ - Informative
+
+ * - `tx_bytes`
+ - Representor only: bytes transmitted, that were handled by the hypervisor.
+ - Informative
+
+ * - `dev_internal_queue_oob`
+ - The number of dropped packets due to lack of receive WQEs for an internal
+ device RQ.
+ - Error
+
+Physical Port Counters
+----------------------
+The physical port counters are the counters on the external port connecting the
+adapter to the network. This measuring point holds information on standardized
+counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and additional counters
+like flow control, FEC and more.
+
+.. flat-table:: Physical Port Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_packets_phy`
+ - The number of packets received on the physical port. This counter doesn’t
+ include packets that were discarded due to FCS, frame size and similar
+ errors.
+ - Informative
+
+ * - `tx_packets_phy`
+ - The number of packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_bytes_phy`
+ - The number of bytes received on the physical port, including Ethernet
+ header and FCS.
+ - Informative
+
+ * - `tx_bytes_phy`
+ - The number of bytes transmitted on the physical port.
+ - Informative
+
+ * - `rx_multicast_phy`
+ - The number of multicast packets received on the physical port.
+ - Informative
+
+ * - `tx_multicast_phy`
+ - The number of multicast packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_broadcast_phy`
+ - The number of broadcast packets received on the physical port.
+ - Informative
+
+ * - `tx_broadcast_phy`
+ - The number of broadcast packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_crc_errors_phy`
+ - The number of dropped received packets due to FCS (Frame Check Sequence)
+ error on the physical port. If this counter is increased in high rate,
+ check the link quality using `rx_symbol_error_phy` and
+ `rx_corrected_bits_phy` counters below.
+ - Error
+
+ * - `rx_in_range_len_errors_phy`
+ - The number of received packets dropped due to length/type errors on a
+ physical port.
+ - Error
+
+ * - `rx_out_of_range_len_phy`
+ - The number of received packets dropped due to length greater than allowed
+ on a physical port. If this counter is increasing, it implies that the
+ peer connected to the adapter has a larger MTU configured. Using same MTU
+ configuration shall resolve this issue.
+ - Error
+
+ * - `rx_oversize_pkts_phy`
+ - The number of dropped received packets due to length which exceed MTU
+ size on a physical port. If this counter is increasing, it implies that
+ the peer connected to the adapter has a larger MTU configured. Using same
+ MTU configuration shall resolve this issue.
+ - Error
+
+ * - `rx_symbol_err_phy`
+ - The number of received packets dropped due to physical coding errors
+ (symbol errors) on a physical port.
+ - Error
+
+ * - `rx_mac_control_phy`
+ - The number of MAC control packets received on the physical port.
+ - Informative
+
+ * - `tx_mac_control_phy`
+ - The number of MAC control packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_pause_ctrl_phy`
+ - The number of link layer pause packets received on a physical port. If
+ this counter is increasing, it implies that the network is congested and
+ cannot absorb the traffic coming from to the adapter.
+ - Informative
+
+ * - `tx_pause_ctrl_phy`
+ - The number of link layer pause packets transmitted on a physical port. If
+ this counter is increasing, it implies that the NIC is congested and
+ cannot absorb the traffic coming from the network.
+ - Informative
+
+ * - `rx_unsupported_op_phy`
+ - The number of MAC control packets received with unsupported opcode on a
+ physical port.
+ - Error
+
+ * - `rx_discards_phy`
+ - The number of received packets dropped due to lack of buffers on a
+ physical port. If this counter is increasing, it implies that the adapter
+ is congested and cannot absorb the traffic coming from the network.
+ - Error
+
+ * - `tx_discards_phy`
+ - The number of packets which were discarded on transmission, even no
+ errors were detected. the drop might occur due to link in down state,
+ head of line drop, pause from the network, etc.
+ - Error
+
+ * - `tx_errors_phy`
+ - The number of transmitted packets dropped due to a length which exceed
+ MTU size on a physical port.
+ - Error
+
+ * - `rx_undersize_pkts_phy`
+ - The number of received packets dropped due to length which is shorter
+ than 64 bytes on a physical port. If this counter is increasing, it
+ implies that the peer connected to the adapter has a non-standard MTU
+ configured or malformed packet had arrived.
+ - Error
+
+ * - `rx_fragments_phy`
+ - The number of received packets dropped due to a length which is shorter
+ than 64 bytes and has FCS error on a physical port. If this counter is
+ increasing, it implies that the peer connected to the adapter has a
+ non-standard MTU configured.
+ - Error
+
+ * - `rx_jabbers_phy`
+ - The number of received packets d due to a length which is longer than 64
+ bytes and had FCS error on a physical port.
+ - Error
+
+ * - `rx_64_bytes_phy`
+ - The number of packets received on the physical port with size of 64 bytes.
+ - Informative
+
+ * - `rx_65_to_127_bytes_phy`
+ - The number of packets received on the physical port with size of 65 to
+ 127 bytes.
+ - Informative
+
+ * - `rx_128_to_255_bytes_phy`
+ - The number of packets received on the physical port with size of 128 to
+ 255 bytes.
+ - Informative
+
+ * - `rx_256_to_511_bytes_phy`
+ - The number of packets received on the physical port with size of 256 to
+ 512 bytes.
+ - Informative
+
+ * - `rx_512_to_1023_bytes_phy`
+ - The number of packets received on the physical port with size of 512 to
+ 1023 bytes.
+ - Informative
+
+ * - `rx_1024_to_1518_bytes_phy`
+ - The number of packets received on the physical port with size of 1024 to
+ 1518 bytes.
+ - Informative
+
+ * - `rx_1519_to_2047_bytes_phy`
+ - The number of packets received on the physical port with size of 1519 to
+ 2047 bytes.
+ - Informative
+
+ * - `rx_2048_to_4095_bytes_phy`
+ - The number of packets received on the physical port with size of 2048 to
+ 4095 bytes.
+ - Informative
+
+ * - `rx_4096_to_8191_bytes_phy`
+ - The number of packets received on the physical port with size of 4096 to
+ 8191 bytes.
+ - Informative
+
+ * - `rx_8192_to_10239_bytes_phy`
+ - The number of packets received on the physical port with size of 8192 to
+ 10239 bytes.
+ - Informative
+
+ * - `link_down_events_phy`
+ - The number of times where the link operative state changed to down. In
+ case this counter is increasing it may imply on port flapping. You may
+ need to replace the cable/transceiver.
+ - Error
+
+ * - `rx_out_of_buffer`
+ - Number of times receive queue had no software buffers allocated for the
+ adapter's incoming traffic.
+ - Error
+
+ * - `module_bus_stuck`
+ - The number of times that module's I\ :sup:`2`\C bus (data or clock)
+ short-wire was detected. You may need to replace the cable/transceiver.
+ - Error
+
+ * - `module_high_temp`
+ - The number of times that the module temperature was too high. If this
+ issue persist, you may need to check the ambient temperature or replace
+ the cable/transceiver module.
+ - Error
+
+ * - `module_bad_shorted`
+ - The number of times that the module cables were shorted. You may need to
+ replace the cable/transceiver module.
+ - Error
+
+ * - `module_unplug`
+ - The number of times that module was ejected.
+ - Informative
+
+ * - `rx_buffer_passed_thres_phy`
+ - The number of events where the port receive buffer was over 85% full.
+ - Informative
+
+ * - `tx_pause_storm_warning_events`
+ - The number of times the device was sending pauses for a long period of
+ time.
+ - Informative
+
+ * - `tx_pause_storm_error_events`
+ - The number of times the device was sending pauses for a long period of
+ time, reaching time out and disabling transmission of pause frames. on
+ the period where pause frames were disabled, drop could have been
+ occurred.
+ - Error
+
+ * - `rx[i]_buff_alloc_err`
+ - Failed to allocate a buffer to received packet (or SKB) on ring i.
+ - Error
+
+ * - `rx_bits_phy`
+ - This counter provides information on the total amount of traffic that
+ could have been received and can be used as a guideline to measure the
+ ratio of errored traffic in `rx_pcs_symbol_err_phy` and
+ `rx_corrected_bits_phy`.
+ - Informative
+
+ * - `rx_pcs_symbol_err_phy`
+ - This counter counts the number of symbol errors that wasn’t corrected by
+ FEC correction algorithm or that FEC algorithm was not active on this
+ interface. If this counter is increasing, it implies that the link
+ between the NIC and the network is suffering from high BER, and that
+ traffic is lost. You may need to replace the cable/transceiver. The error
+ rate is the number of `rx_pcs_symbol_err_phy` divided by the number of
+ `rx_bits_phy` on a specific time frame.
+ - Error
+
+ * - `rx_corrected_bits_phy`
+ - The number of corrected bits on this port according to active FEC
+ (RS/FC). If this counter is increasing, it implies that the link between
+ the NIC and the network is suffering from high BER. The corrected bit
+ rate is the number of `rx_corrected_bits_phy` divided by the number of
+ `rx_bits_phy` on a specific time frame.
+ - Error
+
+ * - `rx_err_lane_[l]_phy`
+ - This counter counts the number of physical raw errors per lane l index.
+ The counter counts errors before FEC corrections. If this counter is
+ increasing, it implies that the link between the NIC and the network is
+ suffering from high BER, and that traffic might be lost. You may need to
+ replace the cable/transceiver. Please check in accordance with
+ `rx_corrected_bits_phy`.
+ - Error
+
+ * - `rx_global_pause`
+ - The number of pause packets received on the physical port. If this
+ counter is increasing, it implies that the network is congested and
+ cannot absorb the traffic coming from the adapter. Note: This counter is
+ only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `rx_global_pause_duration`
+ - The duration of pause received (in microSec) on the physical port. The
+ counter represents the time the port did not send any traffic. If this
+ counter is increasing, it implies that the network is congested and
+ cannot absorb the traffic coming from the adapter. Note: This counter is
+ only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `tx_global_pause`
+ - The number of pause packets transmitted on a physical port. If this
+ counter is increasing, it implies that the adapter is congested and
+ cannot absorb the traffic coming from the network. Note: This counter is
+ only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `tx_global_pause_duration`
+ - The duration of pause transmitter (in microSec) on the physical port.
+ Note: This counter is only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `rx_global_pause_transition`
+ - The number of times a transition from Xoff to Xon on the physical port
+ has occurred. Note: This counter is only enabled when global pause mode
+ is enabled.
+ - Informative
+
+ * - `rx_if_down_packets`
+ - The number of received packets that were dropped due to interface down.
+ - Informative
+
+Priority Port Counters
+----------------------
+The following counters are physical port counters that are counted per L2
+priority (0-7).
+
+**Note:** `p` in the counter name represents the priority.
+
+.. flat-table:: Priority Port Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_prio[p]_bytes`
+ - The number of bytes received with priority p on the physical port.
+ - Informative
+
+ * - `rx_prio[p]_packets`
+ - The number of packets received with priority p on the physical port.
+ - Informative
+
+ * - `tx_prio[p]_bytes`
+ - The number of bytes transmitted on priority p on the physical port.
+ - Informative
+
+ * - `tx_prio[p]_packets`
+ - The number of packets transmitted on priority p on the physical port.
+ - Informative
+
+ * - `rx_prio[p]_pause`
+ - The number of pause packets received with priority p on a physical port.
+ If this counter is increasing, it implies that the network is congested
+ and cannot absorb the traffic coming from the adapter. Note: This counter
+ is available only if PFC was enabled on priority p.
+ - Informative
+
+ * - `rx_prio[p]_pause_duration`
+ - The duration of pause received (in microSec) on priority p on the
+ physical port. The counter represents the time the port did not send any
+ traffic on this priority. If this counter is increasing, it implies that
+ the network is congested and cannot absorb the traffic coming from the
+ adapter. Note: This counter is available only if PFC was enabled on
+ priority p.
+ - Informative
+
+ * - `rx_prio[p]_pause_transition`
+ - The number of times a transition from Xoff to Xon on priority p on the
+ physical port has occurred. Note: This counter is available only if PFC
+ was enabled on priority p.
+ - Informative
+
+ * - `tx_prio[p]_pause`
+ - The number of pause packets transmitted on priority p on a physical port.
+ If this counter is increasing, it implies that the adapter is congested
+ and cannot absorb the traffic coming from the network. Note: This counter
+ is available only if PFC was enabled on priority p.
+ - Informative
+
+ * - `tx_prio[p]_pause_duration`
+ - The duration of pause transmitter (in microSec) on priority p on the
+ physical port. Note: This counter is available only if PFC was enabled on
+ priority p.
+ - Informative
+
+ * - `rx_prio[p]_buf_discard`
+ - The number of packets discarded by device due to lack of per host receive
+ buffers.
+ - Informative
+
+ * - `rx_prio[p]_cong_discard`
+ - The number of packets discarded by device due to per host congestion.
+ - Informative
+
+ * - `rx_prio[p]_marked`
+ - The number of packets ecn marked by device due to per host congestion.
+ - Informative
+
+ * - `rx_prio[p]_discards`
+ - The number of packets discarded by device due to lack of receive buffers.
+ - Informative
+
+Device Counters
+---------------
+.. flat-table:: Device Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_pci_signal_integrity`
+ - Counts physical layer PCIe signal integrity errors, the number of
+ transitions to recovery due to Framing errors and CRC (dlp and tlp). If
+ this counter is raising, try moving the adapter card to a different slot
+ to rule out a bad PCI slot. Validate that you are running with the latest
+ firmware available and latest server BIOS version.
+ - Error
+
+ * - `tx_pci_signal_integrity`
+ - Counts physical layer PCIe signal integrity errors, the number of
+ transition to recovery initiated by the other side (moving to recovery
+ due to getting TS/EIEOS). If this counter is raising, try moving the
+ adapter card to a different slot to rule out a bad PCI slot. Validate
+ that you are running with the latest firmware available and latest server
+ BIOS version.
+ - Error
+
+ * - `outbound_pci_buffer_overflow`
+ - The number of packets dropped due to pci buffer overflow. If this counter
+ is raising in high rate, it might indicate that the receive traffic rate
+ for a host is larger than the PCIe bus and therefore a congestion occurs.
+ - Informative
+
+ * - `outbound_pci_stalled_rd`
+ - The percentage (in the range 0...100) of time within the last second that
+ the NIC had outbound non-posted reads requests but could not perform the
+ operation due to insufficient posted credits.
+ - Informative
+
+ * - `outbound_pci_stalled_wr`
+ - The percentage (in the range 0...100) of time within the last second that
+ the NIC had outbound posted writes requests but could not perform the
+ operation due to insufficient posted credits.
+ - Informative
+
+ * - `outbound_pci_stalled_rd_events`
+ - The number of seconds where `outbound_pci_stalled_rd` was above 30%.
+ - Informative
+
+ * - `outbound_pci_stalled_wr_events`
+ - The number of seconds where `outbound_pci_stalled_wr` was above 30%.
+ - Informative
+
+ * - `dev_out_of_buffer`
+ - The number of times the device owned queue had not enough buffers
+ allocated.
+ - Error
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
new file mode 100644
index 0000000000..581a91caa5
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+Mellanox ConnectX(R) mlx5 core VPI Network Driver
+=================================================
+
+:Copyright: |copy| 2019, Mellanox Technologies LTD.
+:Copyright: |copy| 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ kconfig
+ switchdev
+ tracepoints
+ counters
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
new file mode 100644
index 0000000000..0a42c3395f
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
@@ -0,0 +1,168 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+=======================================
+Enabling the driver and kconfig options
+=======================================
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
+| at build time via kernel Kconfig flags.
+| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
+| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
+| For the list of advanced features, please see below.
+
+**CONFIG_MLX5_BRIDGE=(y/n)**
+
+| Enable :ref:`Ethernet Bridging (BRIDGE) offloading support <mlx5_bridge_offload>`.
+| This will provide the ability to add representors of mlx5 uplink and VF
+| ports to Bridge and offloading rules for traffic between such ports.
+| Supports VLANs (trunk and access modes).
+
+
+**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
+
+| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
+| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
+
+
+**CONFIG_MLX5_CORE_EN=(y/n)**
+
+| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
+| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
+| built-in into mlx5_core.ko.
+
+
+**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
+
+| Enables `Data Center Bridging (DCB) Support <https://enterprise-support.nvidia.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
+
+
+**CONFIG_MLX5_CORE_IPOIB=(y/n)**
+
+| IPoIB offloads & acceleration support.
+| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
+| IPoIB ulp netdevice.
+
+
+**CONFIG_MLX5_CLS_ACT=(y/n)**
+
+| Enables offload support for TC classifier action (NET_CLS_ACT).
+| Works in both native NIC mode and Switchdev SRIOV mode.
+| Flow-based classifiers, such as those registered through
+| `tc-flower(8)`, are processed by the device, rather than the
+| host. Actions that would then overwrite matching classification
+| results would then be instant due to the offload.
+
+
+**CONFIG_MLX5_EN_ARFS=(y/n)**
+
+| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
+| https://enterprise-support.nvidia.com/s/article/howto-configure-arfs-on-connectx-4
+
+
+**CONFIG_MLX5_EN_IPSEC=(y/n)**
+
+| Enables :ref:`IPSec XFRM cryptography-offload acceleration <xfrm_device>`.
+
+
+**CONFIG_MLX5_EN_MACSEC=(y/n)**
+
+| Build support for MACsec cryptography-offload acceleration in the NIC.
+
+
+**CONFIG_MLX5_EN_RXNFC=(y/n)**
+
+| Enables ethtool receive network flow classification, which allows user defined
+| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
+
+
+**CONFIG_MLX5_EN_TLS=(y/n)**
+
+| TLS cryptography-offload acceleration.
+
+
+**CONFIG_MLX5_ESWITCH=(y/n)**
+
+| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
+| and switching for the enabled VFs and PF in two available modes:
+| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://enterprise-support.nvidia.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet>`_.
+| 2) :ref:`Switchdev mode (eswitch offloads) <switchdev>`.
+
+
+**CONFIG_MLX5_FPGA=(y/n)**
+
+| Build support for the Innova family of network cards by Mellanox Technologies.
+| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
+| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
+| building sandbox-specific client drivers.
+
+
+**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
+
+| Provides low-level InfiniBand/RDMA and `RoCE <https://enterprise-support.nvidia.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
+
+
+**CONFIG_MLX5_MPFS=(y/n)**
+
+| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
+| MPFs is required for when `Multi-Host <https://www.nvidia.com/en-us/networking/multi-host/>`_ configuration is enabled to allow passing
+| user configured unicast MAC addresses to the requesting PF.
+
+
+**CONFIG_MLX5_SF=(y/n)**
+
+| Build support for subfunction.
+| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
+| will enable support for creating subfunction devices.
+
+
+**CONFIG_MLX5_SF_MANAGER=(y/n)**
+
+| Build support for subfuction port in the NIC. A Mellanox subfunction
+| port is managed through devlink. A subfunction supports RDMA, netdevice
+| and vdpa device. It is similar to a SRIOV VF but it doesn't require
+| SRIOV support.
+
+
+**CONFIG_MLX5_SW_STEERING=(y/n)**
+
+| Build support for software-managed steering in the NIC.
+
+
+**CONFIG_MLX5_TC_CT=(y/n)**
+
+| Support offloading connection tracking rules via tc ct action.
+
+
+**CONFIG_MLX5_TC_SAMPLE=(y/n)**
+
+| Support offloading sample rules via tc sample action.
+
+
+**CONFIG_MLX5_VDPA=(y/n)**
+
+| Support library for Mellanox VDPA drivers. Provides code that is
+| common for all types of VDPA drivers. The following drivers are planned:
+| net, block.
+
+
+**CONFIG_MLX5_VDPA_NET=(y/n)**
+
+| VDPA network driver for ConnectX6 and newer. Provides offloading
+| of virtio net datapath such that descriptors put on the ring will
+| be executed by the hardware. It also supports a variety of stateless
+| offloads depending on the actual device used and firmware version.
+
+
+**CONFIG_MLX5_VFIO_PCI=(y/n)**
+
+| This provides migration support for MLX5 devices using the VFIO framework.
+
+
+**External options** ( Choose if the corresponding mlx5 feature is required )
+
+- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
+- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
+- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
new file mode 100644
index 0000000000..b617e93d7c
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
@@ -0,0 +1,281 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+=========
+Switchdev
+=========
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+.. _mlx5_bridge_offload:
+
+Bridge offload
+==============
+
+The mlx5 driver implements support for offloading bridge rules when in switchdev
+mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
+representor is attached to bridge.
+
+- Change device to switchdev mode::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
+
+ $ ip link set enp8s0f0 master bridge1
+
+VLANs
+-----
+
+Following bridge VLAN functions are supported by mlx5:
+
+- VLAN filtering (including multiple VLANs per port)::
+
+ $ ip link set bridge1 type bridge vlan_filtering 1
+ $ bridge vlan add dev enp8s0f0 vid 2-3
+
+- VLAN push on bridge ingress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 pvid
+
+- VLAN pop on bridge egress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 untagged
+
+Subfunction
+===========
+
+Subfunction which are spawned over the E-switch are created only with devlink
+device, and by default all the SF auxiliary devices are disabled.
+This will allow user to configure the SF before the SF have been fully probed,
+which will save time.
+
+Usage example:
+
+- Create SF::
+
+ $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11
+ $ devlink port function set pci/0000:08:00.0/32768 hw_addr 00:00:00:00:00:11 state active
+
+- Enable ETH auxiliary device::
+
+ $ devlink dev param set auxiliary/mlx5_core.sf.1 name enable_eth value true cmode driverinit
+
+- Now, in order to fully probe the SF, use devlink reload::
+
+ $ devlink dev reload auxiliary/mlx5_core.sf.1
+
+mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`).
+
+mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
+
+A subfunction has its own function capabilities and its own resources. This
+means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
+queues are neither shared nor stolen from the parent PCI function.
+
+When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
+resources neither shared nor stolen from the parent PCI function.
+
+A subfunction has a dedicated window in PCI BAR space that is not shared
+with the other subfunctions or the parent PCI function. This ensures that all
+devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
+PCI BAR space.
+
+A subfunction supports eswitch representation through which it supports tc
+offloads. The user configures eswitch to send/receive packets from/to
+the subfunction port.
+
+Subfunctions share PCI level resources such as PCI MSI-X IRQs with
+other subfunctions and/or with its parent PCI function.
+
+Example mlx5 software, system, and device view::
+
+ _______
+ | admin |
+ | user |----------
+ |_______| |
+ | |
+ ____|____ __|______ _________________
+ | | | | | |
+ | devlink | | tc tool | | user |
+ | tool | |_________| | applications |
+ |_________| | |_________________|
+ | | | |
+ | | | | Userspace
+ +---------|-------------|-------------------|----------|--------------------+
+ | | +----------+ +----------+ Kernel
+ | | | netdev | | rdma dev |
+ | | +----------+ +----------+
+ (devlink port add/del | ^ ^
+ port function set) | | |
+ | | +---------------|
+ _____|___ | | _______|_______
+ | | | | | mlx5 class |
+ | devlink | +------------+ | | drivers |
+ | kernel | | rep netdev | | |(mlx5_core,ib) |
+ |_________| +------------+ | |_______________|
+ | | | ^
+ (devlink ops) | | (probe/remove)
+ _________|________ | | ____|________
+ | subfunction | | +---------------+ | subfunction |
+ | management driver|----- | subfunction |---| driver |
+ | (mlx5_core) | | auxiliary dev | | (mlx5_core) |
+ |__________________| +---------------+ |_____________|
+ | ^
+ (sf add/del, vhca events) |
+ | (device add/del)
+ _____|____ ____|________
+ | | | subfunction |
+ | PCI NIC |--- activate/deactivate events--->| host driver |
+ |__________| | (mlx5_core) |
+ |_____________|
+
+Subfunction is created using devlink port interface.
+
+- Change device to switchdev mode::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Add a devlink port of subfunction flavour::
+
+ $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
+ pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Show a devlink port of the subfunction::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Delete a devlink port of subfunction after use::
+
+ $ devlink port del pci/0000:06:00.0/32768
+
+Function attributes
+===================
+
+The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
+a unified way for SmartNIC and non-SmartNIC.
+
+This is supported only when the eswitch mode is set to switchdev. Port function
+configuration of the PCI VF/SF is supported through devlink eswitch port.
+
+Port function attributes should be set before PCI VF/SF is enumerated by the
+driver.
+
+MAC address setup
+-----------------
+
+mlx5 driver support devlink port function attr mechanism to setup MAC
+address. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+RoCE capability setup
+~~~~~~~~~~~~~~~~~~~~~
+Not all mlx5 PCI devices/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
+PCI devices/SF.
+
+mlx5 driver support devlink port function attr mechanism to setup RoCE
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+migratable capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+mlx5 driver support devlink port function attr mechanism to setup migratable
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+IPsec crypto capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec crypto offloading need
+to explicitly enable the VF ipsec_crypto capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_crypto
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+IPsec packet capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec packet offloading need
+to explicitly enable the VF ipsec_packet capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_packet
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+SF state setup
+--------------
+
+To use the SF, the user must activate the SF using the SF function state
+attribute.
+
+- Get the state of the SF identified by its unique devlink port index::
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state inactive opstate detached
+
+- Activate the function and verify its state is active::
+
+ $ devlink port function set ens2f0npf0sf88 state active
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state active opstate detached
+
+Upon function activation, the PF driver instance gets the event from the device
+that a particular SF was activated. It's the cue to put the device on bus, probe
+it and instantiate the devlink instance and class specific auxiliary devices
+for it.
+
+- Show the auxiliary device and port of the subfunction::
+
+ $ devlink dev show
+ devlink dev show auxiliary/mlx5_core.sf.4
+
+ $ devlink port show auxiliary/mlx5_core.sf.4/1
+ auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
+
+ $ rdma link show mlx5_0/1
+ link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
+
+ $ rdma dev show
+ 8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
+ 13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
+
+- Subfunction auxiliary device and class device hierarchy::
+
+ mlx5_core.sf.4
+ (subfunction auxiliary device)
+ /\
+ / \
+ / \
+ / \
+ / \
+ mlx5_core.eth.4 mlx5_core.rdma.4
+ (sf eth aux dev) (sf rdma aux dev)
+ | |
+ | |
+ p0sf88 mlx5_0
+ (sf netdev) (sf rdma device)
+
+Additionally, the SF port also gets the event when the driver attaches to the
+auxiliary device of the subfunction. This results in changing the operational
+state of the function. This provides visibility to the user to decide when is it
+safe to delete the SF port for graceful termination of the subfunction.
+
+- Show the SF port operational state::
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state active opstate attached
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst
new file mode 100644
index 0000000000..da8e53cebb
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst
@@ -0,0 +1,229 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+===========
+Tracepoints
+===========
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+mlx5 driver provides internal tracepoints for tracking and debugging using
+kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
+
+For the list of support mlx5 events, check /sys/kernel/tracing/events/mlx5/.
+
+tc and eswitch offloads tracepoints:
+
+- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
+
+- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
+
+ $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
+
+- mlx5e_stats_flower: trace flower stats request::
+
+ $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
+
+- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
+
+- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
+
+ $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
+
+Bridge offloads tracepoints:
+
+- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
+
+- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
+
+- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
+ mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
+
+- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
+
+- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
+
+- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1
+
+- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
+
+Eswitch QoS tracepoints:
+
+- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3
+
+- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group::
+
+ $ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5
+
+- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group::
+
+ $ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000
+
+- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group::
+
+ $ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1
+
+SF tracepoints:
+
+- mlx5_sf_add: trace addition of the SF port::
+
+ $ echo mlx5:mlx5_sf_add >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_free: trace freeing of the SF port::
+
+ $ echo mlx5:mlx5_sf_free >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_activate: trace activation of the SF port::
+
+ $ echo mlx5:mlx5_sf_activate >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-29841 [008] ..... 3669.635095: mlx5_sf_activate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_deactivate: trace deactivation of the SF port::
+
+ $ echo mlx5:mlx5_sf_deactivate >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-29994 [008] ..... 4015.969467: mlx5_sf_deactivate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_hwc_free: trace freeing of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_hwc_deferred_free: trace deferred freeing of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_update_state: trace state updates for SF contexts::
+
+ $ echo mlx5:mlx5_sf_update_state >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u20:3-29490 [009] ..... 4141.453530: mlx5_sf_update_state: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000 state=2
+
+- mlx5_sf_vhca_event: trace SF vhca event and state::
+
+ $ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1
+
+- mlx5_sf_dev_add: trace SF device add event::
+
+ $ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_dev_del: trace SF device delete event::
+
+ $ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
diff --git a/Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst b/Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst
new file mode 100644
index 0000000000..fc5acd427a
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Hyper-V network driver
+======================
+
+Compatibility
+=============
+
+This driver is compatible with Windows Server 2012 R2, 2016 and
+Windows 10.
+
+Features
+========
+
+Checksum offload
+----------------
+ The netvsc driver supports checksum offload as long as the
+ Hyper-V host version does. Windows Server 2016 and Azure
+ support checksum offload for TCP and UDP for both IPv4 and
+ IPv6. Windows Server 2012 only supports checksum offload for TCP.
+
+Receive Side Scaling
+--------------------
+ Hyper-V supports receive side scaling. For TCP & UDP, packets can
+ be distributed among available queues based on IP address and port
+ number.
+
+ For TCP & UDP, we can switch hash level between L3 and L4 by ethtool
+ command. TCP/UDP over IPv4 and v6 can be set differently. The default
+ hash level is L4. We currently only allow switching TX hash level
+ from within the guests.
+
+ On Azure, fragmented UDP packets have high loss rate with L4
+ hashing. Using L3 hashing is recommended in this case.
+
+ For example, for UDP over IPv4 on eth0:
+
+ To include UDP port numbers in hashing::
+
+ ethtool -N eth0 rx-flow-hash udp4 sdfn
+
+ To exclude UDP port numbers in hashing::
+
+ ethtool -N eth0 rx-flow-hash udp4 sd
+
+ To show UDP hash level::
+
+ ethtool -n eth0 rx-flow-hash udp4
+
+Generic Receive Offload, aka GRO
+--------------------------------
+ The driver supports GRO and it is enabled by default. GRO coalesces
+ like packets and significantly reduces CPU usage under heavy Rx
+ load.
+
+Large Receive Offload (LRO), or Receive Side Coalescing (RSC)
+-------------------------------------------------------------
+ The driver supports LRO/RSC in the vSwitch feature. It reduces the per packet
+ processing overhead by coalescing multiple TCP segments when possible. The
+ feature is enabled by default on VMs running on Windows Server 2019 and
+ later. It may be changed by ethtool command::
+
+ ethtool -K eth0 lro on
+ ethtool -K eth0 lro off
+
+SR-IOV support
+--------------
+ Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV
+ is enabled in both the vSwitch and the guest configuration, then the
+ Virtual Function (VF) device is passed to the guest as a PCI
+ device. In this case, both a synthetic (netvsc) and VF device are
+ visible in the guest OS and both NIC's have the same MAC address.
+
+ The VF is enslaved by netvsc device. The netvsc driver will transparently
+ switch the data path to the VF when it is available and up.
+ Network state (addresses, firewall, etc) should be applied only to the
+ netvsc device; the slave device should not be accessed directly in
+ most cases. The exceptions are if some special queue discipline or
+ flow direction is desired, these should be applied directly to the
+ VF slave device.
+
+Receive Buffer
+--------------
+ Packets are received into a receive area which is created when device
+ is probed. The receive area is broken into MTU sized chunks and each may
+ contain one or more packets. The number of receive sections may be changed
+ via ethtool Rx ring parameters.
+
+ There is a similar send buffer which is used to aggregate packets
+ for sending. The send area is broken into chunks, typically of 6144
+ bytes, each of section may contain one or more packets. Small
+ packets are usually transmitted via copy to the send buffer. However,
+ if the buffer is temporarily exhausted, or the packet to be transmitted is
+ an LSO packet, the driver will provide the host with pointers to the data
+ from the SKB. This attempts to achieve a balance between the overhead of
+ data copy and the impact of remapping VM memory to be accessible by the
+ host.
+
+XDP support
+-----------
+ XDP (eXpress Data Path) is a feature that runs eBPF bytecode at the early
+ stage when packets arrive at a NIC card. The goal is to increase performance
+ for packet processing, reducing the overhead of SKB allocation and other
+ upper network layers.
+
+ hv_netvsc supports XDP in native mode, and transparently sets the XDP
+ program on the associated VF NIC as well.
+
+ Setting / unsetting XDP program on synthetic NIC (netvsc) propagates to
+ VF NIC automatically. Setting / unsetting XDP program on VF NIC directly
+ is not recommended, also not propagated to synthetic NIC, and may be
+ overwritten by setting of synthetic NIC.
+
+ XDP program cannot run with LRO (RSC) enabled, so you need to disable LRO
+ before running XDP::
+
+ ethtool -K eth0 lro off
+
+ XDP_REDIRECT action is not yet supported.
diff --git a/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst b/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst
new file mode 100644
index 0000000000..c5673ec455
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst
@@ -0,0 +1,196 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================================
+Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver
+=========================================================
+
+Release notes for Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver.
+
+.. Contents
+ - 1. Introduction
+ - 2. Identifying the adapter/interface
+ - 3. Features supported
+ - 4. Command line parameters
+ - 5. Performance suggestions
+ - 6. Available Downloads
+
+
+1. Introduction
+===============
+This Linux driver supports Neterion's Xframe I PCI-X 1.0 and
+Xframe II PCI-X 2.0 adapters. It supports several features
+such as jumbo frames, MSI/MSI-X, checksum offloads, TSO, UFO and so on.
+See below for complete list of features.
+
+All features are supported for both IPv4 and IPv6.
+
+2. Identifying the adapter/interface
+====================================
+
+a. Insert the adapter(s) in your system.
+b. Build and load driver::
+
+ # insmod s2io.ko
+
+c. View log messages::
+
+ # dmesg | tail -40
+
+You will see messages similar to::
+
+ eth3: Neterion Xframe I 10GbE adapter (rev 3), Version 2.0.9.1, Intr type INTA
+ eth4: Neterion Xframe II 10GbE adapter (rev 2), Version 2.0.9.1, Intr type INTA
+ eth4: Device is on 64 bit 133MHz PCIX(M1) bus
+
+The above messages identify the adapter type(Xframe I/II), adapter revision,
+driver version, interface name(eth3, eth4), Interrupt type(INTA, MSI, MSI-X).
+In case of Xframe II, the PCI/PCI-X bus width and frequency are displayed
+as well.
+
+To associate an interface with a physical adapter use "ethtool -p <ethX>".
+The corresponding adapter's LED will blink multiple times.
+
+3. Features supported
+=====================
+a. Jumbo frames. Xframe I/II supports MTU up to 9600 bytes,
+ modifiable using ip command.
+
+b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit
+ and receive, TSO.
+
+c. Multi-buffer receive mode. Scattering of packet across multiple
+ buffers. Currently driver supports 2-buffer mode which yields
+ significant performance improvement on certain platforms(SGI Altix,
+ IBM xSeries).
+
+d. MSI/MSI-X. Can be enabled on platforms which support this feature
+ (IA64, Xeon) resulting in noticeable performance improvement(up to 7%
+ on certain platforms).
+
+e. Statistics. Comprehensive MAC-level and software statistics displayed
+ using "ethtool -S" option.
+
+f. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings,
+ with multiple steering options.
+
+4. Command line parameters
+==========================
+
+a. tx_fifo_num
+ Number of transmit queues
+
+Valid range: 1-8
+
+Default: 1
+
+b. rx_ring_num
+ Number of receive rings
+
+Valid range: 1-8
+
+Default: 1
+
+c. tx_fifo_len
+ Size of each transmit queue
+
+Valid range: Total length of all queues should not exceed 8192
+
+Default: 4096
+
+d. rx_ring_sz
+ Size of each receive ring(in 4K blocks)
+
+Valid range: Limited by memory on system
+
+Default: 30
+
+e. intr_type
+ Specifies interrupt type. Possible values 0(INTA), 2(MSI-X)
+
+Valid values: 0, 2
+
+Default: 2
+
+5. Performance suggestions
+==========================
+
+General:
+
+a. Set MTU to maximum(9000 for switch setup, 9600 in back-to-back configuration)
+b. Set TCP windows size to optimal value.
+
+For instance, for MTU=1500 a value of 210K has been observed to result in
+good performance::
+
+ # sysctl -w net.ipv4.tcp_rmem="210000 210000 210000"
+ # sysctl -w net.ipv4.tcp_wmem="210000 210000 210000"
+
+For MTU=9000, TCP window size of 10 MB is recommended::
+
+ # sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
+ # sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
+
+Transmit performance:
+
+a. By default, the driver respects BIOS settings for PCI bus parameters.
+ However, you may want to experiment with PCI bus parameters
+ max-split-transactions(MOST) and MMRBC (use setpci command).
+
+ A MOST value of 2 has been found optimal for Opterons and 3 for Itanium.
+
+ It could be different for your hardware.
+
+ Set MMRBC to 4K**.
+
+ For example you can set
+
+ For opteron::
+
+ #setpci -d 17d5:* 62=1d
+
+ For Itanium::
+
+ #setpci -d 17d5:* 62=3d
+
+ For detailed description of the PCI registers, please see Xframe User Guide.
+
+b. Ensure Transmit Checksum offload is enabled. Use ethtool to set/verify this
+ parameter.
+
+c. Turn on TSO(using "ethtool -K")::
+
+ # ethtool -K <ethX> tso on
+
+Receive performance:
+
+a. By default, the driver respects BIOS settings for PCI bus parameters.
+ However, you may want to set PCI latency timer to 248::
+
+ #setpci -d 17d5:* LATENCY_TIMER=f8
+
+ For detailed description of the PCI registers, please see Xframe User Guide.
+
+b. Use 2-buffer mode. This results in large performance boost on
+ certain platforms(eg. SGI Altix, IBM xSeries).
+
+c. Ensure Receive Checksum offload is enabled. Use "ethtool -K ethX" command to
+ set/verify this option.
+
+d. Enable NAPI feature(in kernel configuration Device Drivers ---> Network
+ device support ---> Ethernet (10000 Mbit) ---> S2IO 10Gbe Xframe NIC) to
+ bring down CPU utilization.
+
+.. note::
+
+ For AMD opteron platforms with 8131 chipset, MMRBC=1 and MOST=1 are
+ recommended as safe parameters.
+
+For more information, please review the AMD8131 errata at
+http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
+26310_AMD-8131_HyperTransport_PCI-X_Tunnel_Revision_Guide_rev_3_18.pdf
+
+6. Support
+==========
+
+For further support please contact either your 10GbE Xframe NIC vendor (IBM,
+HP, SGI etc.)
diff --git a/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
new file mode 100644
index 0000000000..650b57742d
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
@@ -0,0 +1,374 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+.. include:: <isonum.txt>
+
+===========================================
+Network Flow Processor (NFP) Kernel Drivers
+===========================================
+
+:Copyright: |copy| 2019, Netronome Systems, Inc.
+:Copyright: |copy| 2022, Corigine, Inc.
+
+Contents
+========
+
+- `Overview`_
+- `Acquiring Firmware`_
+- `Devlink Info`_
+- `Configure Device`_
+- `Statistics`_
+
+Overview
+========
+
+This driver supports Netronome and Corigine's line of Network Flow Processor
+devices, including the NFP3800, NFP4000, NFP5000, and NFP6000 models, which
+are also incorporated in the companies' family of Agilio SmartNICs. The SR-IOV
+physical and virtual functions for these devices are supported by the driver.
+
+Acquiring Firmware
+==================
+
+The NFP3800, NFP4000 and NFP6000 devices require application specific firmware
+to function. Application firmware can be located either on the host file system
+or in the device flash (if supported by management firmware).
+
+Firmware files on the host filesystem contain card type (`AMDA-*` string), media
+config etc. They should be placed in `/lib/firmware/netronome` directory to
+load firmware from the host file system.
+
+Firmware for basic NIC operation is available in the upstream
+`linux-firmware.git` repository.
+
+A more comprehensive list of firmware can be downloaded from the
+`Corigine Support site <https://www.corigine.com/DPUDownload.html>`_.
+
+Firmware in NVRAM
+-----------------
+
+Recent versions of management firmware supports loading application
+firmware from flash when the host driver gets probed. The firmware loading
+policy configuration may be used to configure this feature appropriately.
+
+Devlink or ethtool can be used to update the application firmware on the device
+flash by providing the appropriate `nic_AMDA*.nffw` file to the respective
+command. Users need to take care to write the correct firmware image for the
+card and media configuration to flash.
+
+Available storage space in flash depends on the card being used.
+
+Dealing with multiple projects
+------------------------------
+
+NFP hardware is fully programmable therefore there can be different
+firmware images targeting different applications.
+
+When using application firmware from host, we recommend placing
+actual firmware files in application-named subdirectories in
+`/lib/firmware/netronome` and linking the desired files, e.g.::
+
+ $ tree /lib/firmware/netronome/
+ /lib/firmware/netronome/
+ ├── bpf
+ │   ├── nic_AMDA0081-0001_1x40.nffw
+ │   └── nic_AMDA0081-0001_4x10.nffw
+ ├── flower
+ │   ├── nic_AMDA0081-0001_1x40.nffw
+ │   └── nic_AMDA0081-0001_4x10.nffw
+ ├── nic
+ │   ├── nic_AMDA0081-0001_1x40.nffw
+ │   └── nic_AMDA0081-0001_4x10.nffw
+ ├── nic_AMDA0081-0001_1x40.nffw -> bpf/nic_AMDA0081-0001_1x40.nffw
+ └── nic_AMDA0081-0001_4x10.nffw -> bpf/nic_AMDA0081-0001_4x10.nffw
+
+ 3 directories, 8 files
+
+You may need to use hard instead of symbolic links on distributions
+which use old `mkinitrd` command instead of `dracut` (e.g. Ubuntu).
+
+After changing firmware files you may need to regenerate the initramfs
+image. Initramfs contains drivers and firmware files your system may
+need to boot. Refer to the documentation of your distribution to find
+out how to update initramfs. Good indication of stale initramfs
+is system loading wrong driver or firmware on boot, but when driver is
+later reloaded manually everything works correctly.
+
+Selecting firmware per device
+-----------------------------
+
+Most commonly all cards on the system use the same type of firmware.
+If you want to load a specific firmware image for a specific card, you
+can use either the PCI bus address or serial number. The driver will
+print which files it's looking for when it recognizes a NFP device::
+
+ nfp: Looking for firmware file in order of priority:
+ nfp: netronome/serial-00-12-34-aa-bb-cc-10-ff.nffw: not found
+ nfp: netronome/pci-0000:02:00.0.nffw: not found
+ nfp: netronome/nic_AMDA0081-0001_1x40.nffw: found, loading...
+
+In this case if file (or link) called *serial-00-12-34-aa-bb-5d-10-ff.nffw*
+or *pci-0000:02:00.0.nffw* is present in `/lib/firmware/netronome` this
+firmware file will take precedence over `nic_AMDA*` files.
+
+Note that `serial-*` and `pci-*` files are **not** automatically included
+in initramfs, you will have to refer to documentation of appropriate tools
+to find out how to include them.
+
+Running firmware version
+------------------------
+
+The version of the loaded firmware for a particular <netdev> interface,
+(e.g. enp4s0), or an interface's port <netdev port> (e.g. enp4s0np0) can
+be displayed with the ethtool command::
+
+ $ ethtool -i <netdev>
+
+Firmware loading policy
+-----------------------
+
+Firmware loading policy is controlled via three HWinfo parameters
+stored as key value pairs in the device flash:
+
+app_fw_from_flash
+ Defines which firmware should take precedence, 'Disk' (0), 'Flash' (1) or
+ the 'Preferred' (2) firmware. When 'Preferred' is selected, the management
+ firmware makes the decision over which firmware will be loaded by comparing
+ versions of the flash firmware and the host supplied firmware.
+ This variable is configurable using the 'fw_load_policy'
+ devlink parameter.
+
+abi_drv_reset
+ Defines if the driver should reset the firmware when
+ the driver is probed, either 'Disk' (0) if firmware was found on disk,
+ 'Always' (1) reset or 'Never' (2) reset. Note that the device is always
+ reset on driver unload if firmware was loaded when the driver was probed.
+ This variable is configurable using the 'reset_dev_on_drv_probe'
+ devlink parameter.
+
+abi_drv_load_ifc
+ Defines a list of PF devices allowed to load FW on the device.
+ This variable is not currently user configurable.
+
+Devlink Info
+============
+
+The devlink info command displays the running and stored firmware versions
+on the device, serial number and board information.
+
+Devlink info command example (replace PCI address)::
+
+ $ devlink dev info pci/0000:03:00.0
+ pci/0000:03:00.0:
+ driver nfp
+ serial_number CSAAMDA2001-1003000111
+ versions:
+ fixed:
+ board.id AMDA2001-1003
+ board.rev 01
+ board.manufacture CSA
+ board.model mozart
+ running:
+ fw.mgmt 22.10.0-rc3
+ fw.cpld 0x1000003
+ fw.app nic-22.09.0
+ chip.init AMDA-2001-1003 1003000111
+ stored:
+ fw.bundle_id bspbundle_1003000111
+ fw.mgmt 22.10.0-rc3
+ fw.cpld 0x0
+ chip.init AMDA-2001-1003 1003000111
+
+Configure Device
+================
+
+This section explains how to use Agilio SmartNICs running basic NIC firmware.
+
+Configure interface link-speed
+------------------------------
+The following steps explains how to change between 10G mode and 25G mode on
+Agilio CX 2x25GbE cards. The changing of port speed must be done in order,
+port 0 (p0) must be set to 10G before port 1 (p1) may be set to 10G.
+
+Down the respective interface(s)::
+
+ $ ip link set dev <netdev port 0> down
+ $ ip link set dev <netdev port 1> down
+
+Set interface link-speed to 10G::
+
+ $ ethtool -s <netdev port 0> speed 10000
+ $ ethtool -s <netdev port 1> speed 10000
+
+Set interface link-speed to 25G::
+
+ $ ethtool -s <netdev port 0> speed 25000
+ $ ethtool -s <netdev port 1> speed 25000
+
+Reload driver for changes to take effect::
+
+ $ rmmod nfp; modprobe nfp
+
+Configure interface Maximum Transmission Unit (MTU)
+---------------------------------------------------
+
+The MTU of interfaces can temporarily be set using the iproute2, ip link or
+ifconfig tools. Note that this change will not persist. Setting this via
+Network Manager, or another appropriate OS configuration tool, is
+recommended as changes to the MTU using Network Manager can be made to
+persist.
+
+Set interface MTU to 9000 bytes::
+
+ $ ip link set dev <netdev port> mtu 9000
+
+It is the responsibility of the user or the orchestration layer to set
+appropriate MTU values when handling jumbo frames or utilizing tunnels. For
+example, if packets sent from a VM are to be encapsulated on the card and
+egress a physical port, then the MTU of the VF should be set to lower than
+that of the physical port to account for the extra bytes added by the
+additional header. If a setup is expected to see fallback traffic between
+the SmartNIC and the kernel then the user should also ensure that the PF MTU
+is appropriately set to avoid unexpected drops on this path.
+
+Configure Forward Error Correction (FEC) modes
+----------------------------------------------
+
+Agilio SmartNICs support FEC mode configuration, e.g. Auto, Firecode Base-R,
+ReedSolomon and Off modes. Each physical port's FEC mode can be set
+independently using ethtool. The supported FEC modes for an interface can
+be viewed using::
+
+ $ ethtool <netdev>
+
+The currently configured FEC mode can be viewed using::
+
+ $ ethtool --show-fec <netdev>
+
+To force the FEC mode for a particular port, auto-negotiation must be disabled
+(see the `Auto-negotiation`_ section). An example of how to set the FEC mode
+to Reed-Solomon is::
+
+ $ ethtool --set-fec <netdev> encoding rs
+
+Auto-negotiation
+----------------
+
+To change auto-negotiation settings, the link must first be put down. After the
+link is down, auto-negotiation can be enabled or disabled using::
+
+ ethtool -s <netdev> autoneg <on|off>
+
+Statistics
+==========
+
+Following device statistics are available through the ``ethtool -S`` interface:
+
+.. flat-table:: NFP device statistics
+ :header-rows: 1
+ :widths: 3 1 11
+
+ * - Name
+ - ID
+ - Meaning
+
+ * - dev_rx_discards
+ - 1
+ - Packet can be discarded on the RX path for one of the following reasons:
+
+ * The NIC is not in promisc mode, and the destination MAC address
+ doesn't match the interfaces' MAC address.
+ * The received packet is larger than the max buffer size on the host.
+ I.e. it exceeds the Layer 3 MRU.
+ * There is no freelist descriptor available on the host for the packet.
+ It is likely that the NIC couldn't cache one in time.
+ * A BPF program discarded the packet.
+ * The datapath drop action was executed.
+ * The MAC discarded the packet due to lack of ingress buffer space
+ on the NIC.
+
+ * - dev_rx_errors
+ - 2
+ - A packet can be counted (and dropped) as RX error for the following
+ reasons:
+
+ * A problem with the VEB lookup (only when SR-IOV is used).
+ * A physical layer problem that causes Ethernet errors, like FCS or
+ alignment errors. The cause is usually faulty cables or SFPs.
+
+ * - dev_rx_bytes
+ - 3
+ - Total number of bytes received.
+
+ * - dev_rx_uc_bytes
+ - 4
+ - Unicast bytes received.
+
+ * - dev_rx_mc_bytes
+ - 5
+ - Multicast bytes received.
+
+ * - dev_rx_bc_bytes
+ - 6
+ - Broadcast bytes received.
+
+ * - dev_rx_pkts
+ - 7
+ - Total number of packets received.
+
+ * - dev_rx_mc_pkts
+ - 8
+ - Multicast packets received.
+
+ * - dev_rx_bc_pkts
+ - 9
+ - Broadcast packets received.
+
+ * - dev_tx_discards
+ - 10
+ - A packet can be discarded in the TX direction if the MAC is
+ being flow controlled and the NIC runs out of TX queue space.
+
+ * - dev_tx_errors
+ - 11
+ - A packet can be counted as TX error (and dropped) for one for the
+ following reasons:
+
+ * The packet is an LSO segment, but the Layer 3 or Layer 4 offset
+ could not be determined. Therefore LSO could not continue.
+ * An invalid packet descriptor was received over PCIe.
+ * The packet Layer 3 length exceeds the device MTU.
+ * An error on the MAC/physical layer. Usually due to faulty cables or
+ SFPs.
+ * A CTM buffer could not be allocated.
+ * The packet offset was incorrect and could not be fixed by the NIC.
+
+ * - dev_tx_bytes
+ - 12
+ - Total number of bytes transmitted.
+
+ * - dev_tx_uc_bytes
+ - 13
+ - Unicast bytes transmitted.
+
+ * - dev_tx_mc_bytes
+ - 14
+ - Multicast bytes transmitted.
+
+ * - dev_tx_bc_bytes
+ - 15
+ - Broadcast bytes transmitted.
+
+ * - dev_tx_pkts
+ - 16
+ - Total number of packets transmitted.
+
+ * - dev_tx_mc_pkts
+ - 17
+ - Multicast packets transmitted.
+
+ * - dev_tx_bc_pkts
+ - 18
+ - Broadcast packets transmitted.
+
+Note that statistics unknown to the driver will be displayed as
+``dev_unknown_stat$ID``, where ``$ID`` refers to the second column
+above.
diff --git a/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst
new file mode 100644
index 0000000000..6ec7d686ef
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst
@@ -0,0 +1,274 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+========================================================
+Linux Driver for the Pensando(R) Ethernet adapter family
+========================================================
+
+Pensando Linux Ethernet driver.
+Copyright(c) 2019 Pensando Systems, Inc
+
+Contents
+========
+
+- Identifying the Adapter
+- Enabling the driver
+- Configuring the driver
+- Statistics
+- Support
+
+Identifying the Adapter
+=======================
+
+To find if one or more Pensando PCI Ethernet devices are installed on the
+host, check for the PCI devices::
+
+ $ lspci -d 1dd8:
+ b5:00.0 Ethernet controller: Device 1dd8:1002
+ b6:00.0 Ethernet controller: Device 1dd8:1002
+
+If such devices are listed as above, then the ionic.ko driver should find
+and configure them for use. There should be log entries in the kernel
+messages such as these::
+
+ $ dmesg | grep ionic
+ ionic 0000:b5:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
+ ionic 0000:b5:00.0 enp181s0: renamed from eth0
+ ionic 0000:b5:00.0 enp181s0: Link up - 100 Gbps
+ ionic 0000:b6:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
+ ionic 0000:b6:00.0 enp182s0: renamed from eth0
+ ionic 0000:b6:00.0 enp182s0: Link up - 100 Gbps
+
+Driver and firmware version information can be gathered with either of
+ethtool or devlink tools::
+
+ $ ethtool -i enp181s0
+ driver: ionic
+ version: 5.7.0
+ firmware-version: 1.8.0-28
+ ...
+
+ $ devlink dev info pci/0000:b5:00.0
+ pci/0000:b5:00.0:
+ driver ionic
+ serial_number FLM18420073
+ versions:
+ fixed:
+ asic.id 0x0
+ asic.rev 0x0
+ running:
+ fw 1.8.0-28
+
+See Documentation/networking/devlink/ionic.rst for more information
+on the devlink dev info data.
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+ make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet driver support
+ -> Pensando devices
+ -> Pensando Ethernet IONIC Support
+
+Configuring the Driver
+======================
+
+MTU
+---
+
+Jumbo frame support is available with a maximum size of 9194 bytes.
+
+Interrupt coalescing
+--------------------
+
+Interrupt coalescing can be configured by changing the rx-usecs value with
+the "ethtool -C" command. The rx-usecs range is 0-190. The tx-usecs value
+reflects the rx-usecs value as they are tied together on the same interrupt.
+
+SR-IOV
+------
+
+Minimal SR-IOV support is currently offered and can be enabled by setting
+the sysfs 'sriov_numvfs' value, if supported by your particular firmware
+configuration.
+
+Statistics
+==========
+
+Basic hardware stats
+--------------------
+
+The commands ``netstat -i``, ``ip -s link show``, and ``ifconfig`` show
+a limited set of statistics taken directly from firmware. For example::
+
+ $ ip -s link show enp181s0
+ 7: enp181s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
+ link/ether 00:ae:cd:00:07:68 brd ff:ff:ff:ff:ff:ff
+ RX: bytes packets errors dropped overrun mcast
+ 414 5 0 0 0 0
+ TX: bytes packets errors dropped carrier collsns
+ 1384 18 0 0 0 0
+
+ethtool -S
+----------
+
+The statistics shown from the ``ethtool -S`` command includes a combination of
+driver counters and firmware counters, including port and queue specific values.
+The driver values are counters computed by the driver, and the firmware values
+are gathered by the firmware from the port hardware and passed through the
+driver with no further interpretation.
+
+Driver port specific::
+
+ tx_packets: 12
+ tx_bytes: 964
+ rx_packets: 5
+ rx_bytes: 414
+ tx_tso: 0
+ tx_tso_bytes: 0
+ tx_csum_none: 12
+ tx_csum: 0
+ rx_csum_none: 0
+ rx_csum_complete: 3
+ rx_csum_error: 0
+
+Driver queue specific::
+
+ tx_0_pkts: 3
+ tx_0_bytes: 294
+ tx_0_clean: 3
+ tx_0_dma_map_err: 0
+ tx_0_linearize: 0
+ tx_0_frags: 0
+ tx_0_tso: 0
+ tx_0_tso_bytes: 0
+ tx_0_csum_none: 3
+ tx_0_csum: 0
+ tx_0_vlan_inserted: 0
+ rx_0_pkts: 2
+ rx_0_bytes: 120
+ rx_0_dma_map_err: 0
+ rx_0_alloc_err: 0
+ rx_0_csum_none: 0
+ rx_0_csum_complete: 0
+ rx_0_csum_error: 0
+ rx_0_dropped: 0
+ rx_0_vlan_stripped: 0
+
+Firmware port specific::
+
+ hw_tx_dropped: 0
+ hw_rx_dropped: 0
+ hw_rx_over_errors: 0
+ hw_rx_missed_errors: 0
+ hw_tx_aborted_errors: 0
+ frames_rx_ok: 15
+ frames_rx_all: 15
+ frames_rx_bad_fcs: 0
+ frames_rx_bad_all: 0
+ octets_rx_ok: 1290
+ octets_rx_all: 1290
+ frames_rx_unicast: 10
+ frames_rx_multicast: 5
+ frames_rx_broadcast: 0
+ frames_rx_pause: 0
+ frames_rx_bad_length: 0
+ frames_rx_undersized: 0
+ frames_rx_oversized: 0
+ frames_rx_fragments: 0
+ frames_rx_jabber: 0
+ frames_rx_pripause: 0
+ frames_rx_stomped_crc: 0
+ frames_rx_too_long: 0
+ frames_rx_vlan_good: 3
+ frames_rx_dropped: 0
+ frames_rx_less_than_64b: 0
+ frames_rx_64b: 4
+ frames_rx_65b_127b: 11
+ frames_rx_128b_255b: 0
+ frames_rx_256b_511b: 0
+ frames_rx_512b_1023b: 0
+ frames_rx_1024b_1518b: 0
+ frames_rx_1519b_2047b: 0
+ frames_rx_2048b_4095b: 0
+ frames_rx_4096b_8191b: 0
+ frames_rx_8192b_9215b: 0
+ frames_rx_other: 0
+ frames_tx_ok: 31
+ frames_tx_all: 31
+ frames_tx_bad: 0
+ octets_tx_ok: 2614
+ octets_tx_total: 2614
+ frames_tx_unicast: 8
+ frames_tx_multicast: 21
+ frames_tx_broadcast: 2
+ frames_tx_pause: 0
+ frames_tx_pripause: 0
+ frames_tx_vlan: 0
+ frames_tx_less_than_64b: 0
+ frames_tx_64b: 4
+ frames_tx_65b_127b: 27
+ frames_tx_128b_255b: 0
+ frames_tx_256b_511b: 0
+ frames_tx_512b_1023b: 0
+ frames_tx_1024b_1518b: 0
+ frames_tx_1519b_2047b: 0
+ frames_tx_2048b_4095b: 0
+ frames_tx_4096b_8191b: 0
+ frames_tx_8192b_9215b: 0
+ frames_tx_other: 0
+ frames_tx_pri_0: 0
+ frames_tx_pri_1: 0
+ frames_tx_pri_2: 0
+ frames_tx_pri_3: 0
+ frames_tx_pri_4: 0
+ frames_tx_pri_5: 0
+ frames_tx_pri_6: 0
+ frames_tx_pri_7: 0
+ frames_rx_pri_0: 0
+ frames_rx_pri_1: 0
+ frames_rx_pri_2: 0
+ frames_rx_pri_3: 0
+ frames_rx_pri_4: 0
+ frames_rx_pri_5: 0
+ frames_rx_pri_6: 0
+ frames_rx_pri_7: 0
+ tx_pripause_0_1us_count: 0
+ tx_pripause_1_1us_count: 0
+ tx_pripause_2_1us_count: 0
+ tx_pripause_3_1us_count: 0
+ tx_pripause_4_1us_count: 0
+ tx_pripause_5_1us_count: 0
+ tx_pripause_6_1us_count: 0
+ tx_pripause_7_1us_count: 0
+ rx_pripause_0_1us_count: 0
+ rx_pripause_1_1us_count: 0
+ rx_pripause_2_1us_count: 0
+ rx_pripause_3_1us_count: 0
+ rx_pripause_4_1us_count: 0
+ rx_pripause_5_1us_count: 0
+ rx_pripause_6_1us_count: 0
+ rx_pripause_7_1us_count: 0
+ rx_pause_1us_count: 0
+ frames_tx_truncated: 0
+
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+ netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+ drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst b/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
new file mode 100644
index 0000000000..e5eac896a6
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
@@ -0,0 +1,48 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+SMC 9xxxx Driver
+================
+
+Revision 0.12
+
+3/5/96
+
+Copyright 1996 Erik Stahlman
+
+Released under terms of the GNU General Public License.
+
+This file contains the instructions and caveats for my SMC9xxx driver. You
+should not be using the driver without reading this file.
+
+Things to note about installation:
+
+ 1. The driver should work on all kernels from 1.2.13 until 1.3.71.
+ (A kernel patch is supplied for 1.3.71 )
+
+ 2. If you include this into the kernel, you might need to change some
+ options, such as for forcing IRQ.
+
+
+ 3. To compile as a module, run 'make'.
+ Make will give you the appropriate options for various kernel support.
+
+ 4. Loading the driver as a module::
+
+ use: insmod smc9194.o
+ optional parameters:
+ io=xxxx : your base address
+ irq=xx : your irq
+ ifport=x : 0 for whatever is default
+ 1 for twisted pair
+ 2 for AUI ( or BNC on some cards )
+
+How to obtain the latest version?
+
+FTP:
+ ftp://fenris.campus.vt.edu/smc9/smc9-12.tar.gz
+ ftp://sfbox.vt.edu/filebox/F/fenris/smc9/smc9-12.tar.gz
+
+
+Contacting me:
+ erik@mail.vt.edu
diff --git a/Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst b/Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst
new file mode 100644
index 0000000000..5d46e50361
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst
@@ -0,0 +1,700 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==============================================================
+Linux Driver for the Synopsys(R) Ethernet Controllers "stmmac"
+==============================================================
+
+Authors: Giuseppe Cavallaro <peppe.cavallaro@st.com>,
+Alexandre Torgue <alexandre.torgue@st.com>, Jose Abreu <joabreu@synopsys.com>
+
+Contents
+========
+
+- In This Release
+- Feature List
+- Kernel Configuration
+- Command Line Parameters
+- Driver Information and Notes
+- Debug Information
+- Support
+
+In This Release
+===============
+
+This file describes the stmmac Linux Driver for all the Synopsys(R) Ethernet
+Controllers.
+
+Currently, this network device driver is for all STi embedded MAC/GMAC
+(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XILINX XC2V3000
+FF1152AMT0221 D1215994A VIRTEX FPGA board. The Synopsys Ethernet QoS 5.0 IPK
+is also supported.
+
+DesignWare(R) Cores Ethernet MAC 10/100/1000 Universal version 3.70a
+(and older) and DesignWare(R) Cores Ethernet Quality-of-Service version 4.0
+(and upper) have been used for developing this driver as well as
+DesignWare(R) Cores XGMAC - 10G Ethernet MAC and DesignWare(R) Cores
+Enterprise MAC - 100G Ethernet MAC.
+
+This driver supports both the platform bus and PCI.
+
+This driver includes support for the following Synopsys(R) DesignWare(R)
+Cores Ethernet Controllers and corresponding minimum and maximum versions:
+
++-------------------------------+--------------+--------------+--------------+
+| Controller Name | Min. Version | Max. Version | Abbrev. Name |
++===============================+==============+==============+==============+
+| Ethernet MAC Universal | N/A | 3.73a | GMAC |
++-------------------------------+--------------+--------------+--------------+
+| Ethernet Quality-of-Service | 4.00a | N/A | GMAC4+ |
++-------------------------------+--------------+--------------+--------------+
+| XGMAC - 10G Ethernet MAC | 2.10a | N/A | XGMAC2+ |
++-------------------------------+--------------+--------------+--------------+
+| XLGMAC - 100G Ethernet MAC | 2.00a | N/A | XLGMAC2+ |
++-------------------------------+--------------+--------------+--------------+
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Ethernet adapter. All hardware requirements listed apply
+to use with Linux.
+
+Feature List
+============
+
+The following features are available in this driver:
+ - GMII/MII/RGMII/SGMII/RMII/XGMII/XLGMII Interface
+ - Half-Duplex / Full-Duplex Operation
+ - Energy Efficient Ethernet (EEE)
+ - IEEE 802.3x PAUSE Packets (Flow Control)
+ - RMON/MIB Counters
+ - IEEE 1588 Timestamping (PTP)
+ - Pulse-Per-Second Output (PPS)
+ - MDIO Clause 22 / Clause 45 Interface
+ - MAC Loopback
+ - ARP Offloading
+ - Automatic CRC / PAD Insertion and Checking
+ - Checksum Offload for Received and Transmitted Packets
+ - Standard or Jumbo Ethernet Packets
+ - Source Address Insertion / Replacement
+ - VLAN TAG Insertion / Replacement / Deletion / Filtering (HASH and PERFECT)
+ - Programmable TX and RX Watchdog and Coalesce Settings
+ - Destination Address Filtering (PERFECT)
+ - HASH Filtering (Multicast)
+ - Layer 3 / Layer 4 Filtering
+ - Remote Wake-Up Detection
+ - Receive Side Scaling (RSS)
+ - Frame Preemption for TX and RX
+ - Programmable Burst Length, Threshold, Queue Size
+ - Multiple Queues (up to 8)
+ - Multiple Scheduling Algorithms (TX: WRR, DWRR, WFQ, SP, CBS, EST, TBS;
+ RX: WRR, SP)
+ - Flexible RX Parser
+ - TCP / UDP Segmentation Offload (TSO, USO)
+ - Split Header (SPH)
+ - Safety Features (ECC Protection, Data Parity Protection)
+ - Selftests using Ethtool
+
+Kernel Configuration
+====================
+
+The kernel configuration option is ``CONFIG_STMMAC_ETH``:
+ - ``CONFIG_STMMAC_PLATFORM``: is to enable the platform driver.
+ - ``CONFIG_STMMAC_PCI``: is to enable the pci driver.
+
+Command Line Parameters
+=======================
+
+If the driver is built as a module the following optional parameters are used
+by entering them on the command line with the modprobe command using this
+syntax (e.g. for PCI module)::
+
+ modprobe stmmac_pci [<option>=<VAL1>,<VAL2>,...]
+
+Driver parameters can be also passed in command line by using::
+
+ stmmaceth=watchdog:100,chain_mode=1
+
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+watchdog
+--------
+:Valid Range: 5000-None
+:Default Value: 5000
+
+This parameter overrides the transmit timeout in milliseconds.
+
+debug
+-----
+:Valid Range: 0-16 (0=none,...,16=all)
+:Default Value: 0
+
+This parameter adjusts the level of debug messages displayed in the system
+logs.
+
+phyaddr
+-------
+:Valid Range: 0-31
+:Default Value: -1
+
+This parameter overrides the physical address of the PHY device.
+
+flow_ctrl
+---------
+:Valid Range: 0-3 (0=off,1=rx,2=tx,3=rx/tx)
+:Default Value: 3
+
+This parameter changes the default Flow Control ability.
+
+pause
+-----
+:Valid Range: 0-65535
+:Default Value: 65535
+
+This parameter changes the default Flow Control Pause time.
+
+tc
+--
+:Valid Range: 64-256
+:Default Value: 64
+
+This parameter changes the default HW FIFO Threshold control value.
+
+buf_sz
+------
+:Valid Range: 1536-16384
+:Default Value: 1536
+
+This parameter changes the default RX DMA packet buffer size.
+
+eee_timer
+---------
+:Valid Range: 0-None
+:Default Value: 1000
+
+This parameter changes the default LPI TX Expiration time in milliseconds.
+
+chain_mode
+----------
+:Valid Range: 0-1 (0=off,1=on)
+:Default Value: 0
+
+This parameter changes the default mode of operation from Ring Mode to
+Chain Mode.
+
+Driver Information and Notes
+============================
+
+Transmit Process
+----------------
+
+The xmit method is invoked when the kernel needs to transmit a packet; it sets
+the descriptors in the ring and informs the DMA engine that there is a packet
+ready to be transmitted.
+
+By default, the driver sets the ``NETIF_F_SG`` bit in the features field of
+the ``net_device`` structure, enabling the scatter-gather feature. This is
+true on chips and configurations where the checksum can be done in hardware.
+
+Once the controller has finished transmitting the packet, timer will be
+scheduled to release the transmit resources.
+
+Receive Process
+---------------
+
+When one or more packets are received, an interrupt happens. The interrupts
+are not queued, so the driver has to scan all the descriptors in the ring
+during the receive process.
+
+This is based on NAPI, so the interrupt handler signals only if there is work
+to be done, and it exits. Then the poll method will be scheduled at some
+future point.
+
+The incoming packets are stored, by the DMA, in a list of pre-allocated socket
+buffers in order to avoid the memcpy (zero-copy).
+
+Interrupt Mitigation
+--------------------
+
+The driver is able to mitigate the number of its DMA interrupts using NAPI for
+the reception on chips older than the 3.50. New chips have an HW RX Watchdog
+used for this mitigation.
+
+Mitigation parameters can be tuned by ethtool.
+
+WoL
+---
+
+Wake up on Lan feature through Magic and Unicast frames are supported for the
+GMAC, GMAC4/5 and XGMAC core.
+
+DMA Descriptors
+---------------
+
+Driver handles both normal and alternate descriptors. The latter has been only
+tested on DesignWare(R) Cores Ethernet MAC Universal version 3.41a and later.
+
+stmmac supports DMA descriptor to operate both in dual buffer (RING) and
+linked-list(CHAINED) mode. In RING each descriptor points to two data buffer
+pointers whereas in CHAINED mode they point to only one data buffer pointer.
+RING mode is the default.
+
+In CHAINED mode each descriptor will have pointer to next descriptor in the
+list, hence creating the explicit chaining in the descriptor itself, whereas
+such explicit chaining is not possible in RING mode.
+
+Extended Descriptors
+--------------------
+
+The extended descriptors give us information about the Ethernet payload when
+it is carrying PTP packets or TCP/UDP/ICMP over IP. These are not available on
+GMAC Synopsys(R) chips older than the 3.50. At probe time the driver will
+decide if these can be actually used. This support also is mandatory for PTPv2
+because the extra descriptors are used for saving the hardware timestamps and
+Extended Status.
+
+Ethtool Support
+---------------
+
+Ethtool is supported. For example, driver statistics (including RMON),
+internal errors can be taken using::
+
+ ethtool -S ethX
+
+Ethtool selftests are also supported. This allows to do some early sanity
+checks to the HW using MAC and PHY loopback mechanisms::
+
+ ethtool -t ethX
+
+Jumbo and Segmentation Offloading
+---------------------------------
+
+Jumbo frames are supported and tested for the GMAC. The GSO has been also
+added but it's performed in software. LRO is not supported.
+
+TSO Support
+-----------
+
+TSO (TCP Segmentation Offload) feature is supported by GMAC > 4.x and XGMAC
+chip family. When a packet is sent through TCP protocol, the TCP stack ensures
+that the SKB provided to the low level driver (stmmac in our case) matches
+with the maximum frame len (IP header + TCP header + payload <= 1500 bytes
+(for MTU set to 1500)). It means that if an application using TCP want to send
+a packet which will have a length (after adding headers) > 1514 the packet
+will be split in several TCP packets: The data payload is split and headers
+(TCP/IP ..) are added. It is done by software.
+
+When TSO is enabled, the TCP stack doesn't care about the maximum frame length
+and provide SKB packet to stmmac as it is. The GMAC IP will have to perform
+the segmentation by it self to match with maximum frame length.
+
+This feature can be enabled in device tree through ``snps,tso`` entry.
+
+Energy Efficient Ethernet
+-------------------------
+
+Energy Efficient Ethernet (EEE) enables IEEE 802.3 MAC sublayer along with a
+family of Physical layer to operate in the Low Power Idle (LPI) mode. The EEE
+mode supports the IEEE 802.3 MAC operation at 100Mbps, 1000Mbps and 1Gbps.
+
+The LPI mode allows power saving by switching off parts of the communication
+device functionality when there is no data to be transmitted & received.
+The system on both the side of the link can disable some functionalities and
+save power during the period of low-link utilization. The MAC controls whether
+the system should enter or exit the LPI mode and communicate this to PHY.
+
+As soon as the interface is opened, the driver verifies if the EEE can be
+supported. This is done by looking at both the DMA HW capability register and
+the PHY devices MCD registers.
+
+To enter in TX LPI mode the driver needs to have a software timer that enable
+and disable the LPI mode when there is nothing to be transmitted.
+
+Precision Time Protocol (PTP)
+-----------------------------
+
+The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP), which
+enables precise synchronization of clocks in measurement and control systems
+implemented with technologies such as network communication.
+
+In addition to the basic timestamp features mentioned in IEEE 1588-2002
+Timestamps, new GMAC cores support the advanced timestamp features.
+IEEE 1588-2008 can be enabled when configuring the Kernel.
+
+SGMII/RGMII Support
+-------------------
+
+New GMAC devices provide own way to manage RGMII/SGMII. This information is
+available at run-time by looking at the HW capability register. This means
+that the stmmac can manage auto-negotiation and link status w/o using the
+PHYLIB stuff. In fact, the HW provides a subset of extended registers to
+restart the ANE, verify Full/Half duplex mode and Speed. Thanks to these
+registers, it is possible to look at the Auto-negotiated Link Parter Ability.
+
+Physical
+--------
+
+The driver is compatible with Physical Abstraction Layer to be connected with
+PHY and GPHY devices.
+
+Platform Information
+--------------------
+
+Several information can be passed through the platform and device-tree.
+
+::
+
+ struct plat_stmmacenet_data {
+
+1) Bus identifier::
+
+ int bus_id;
+
+2) PHY Physical Address. If set to -1 the driver will pick the first PHY it
+finds::
+
+ int phy_addr;
+
+3) PHY Device Interface::
+
+ int interface;
+
+4) Specific platform fields for the MDIO bus::
+
+ struct stmmac_mdio_bus_data *mdio_bus_data;
+
+5) Internal DMA parameters::
+
+ struct stmmac_dma_cfg *dma_cfg;
+
+6) Fixed CSR Clock Range selection::
+
+ int clk_csr;
+
+7) HW uses the GMAC core::
+
+ int has_gmac;
+
+8) If set the MAC will use Enhanced Descriptors::
+
+ int enh_desc;
+
+9) Core is able to perform TX Checksum and/or RX Checksum in HW::
+
+ int tx_coe;
+ int rx_coe;
+
+11) Some HWs are not able to perform the csum in HW for over-sized frames due
+to limited buffer sizes. Setting this flag the csum will be done in SW on
+JUMBO frames::
+
+ int bugged_jumbo;
+
+12) Core has the embedded power module::
+
+ int pmt;
+
+13) Force DMA to use the Store and Forward mode or Threshold mode::
+
+ int force_sf_dma_mode;
+ int force_thresh_dma_mode;
+
+15) Force to disable the RX Watchdog feature and switch to NAPI mode::
+
+ int riwt_off;
+
+16) Limit the maximum operating speed and MTU::
+
+ int max_speed;
+ int maxmtu;
+
+18) Number of Multicast/Unicast filters::
+
+ int multicast_filter_bins;
+ int unicast_filter_entries;
+
+20) Limit the maximum TX and RX FIFO size::
+
+ int tx_fifo_size;
+ int rx_fifo_size;
+
+21) Use the specified number of TX and RX Queues::
+
+ u32 rx_queues_to_use;
+ u32 tx_queues_to_use;
+
+22) Use the specified TX and RX scheduling algorithm::
+
+ u8 rx_sched_algorithm;
+ u8 tx_sched_algorithm;
+
+23) Internal TX and RX Queue parameters::
+
+ struct stmmac_rxq_cfg rx_queues_cfg[MTL_MAX_RX_QUEUES];
+ struct stmmac_txq_cfg tx_queues_cfg[MTL_MAX_TX_QUEUES];
+
+24) This callback is used for modifying some syscfg registers (on ST SoCs)
+according to the link speed negotiated by the physical layer::
+
+ void (*fix_mac_speed)(void *priv, unsigned int speed);
+
+25) Callbacks used for calling a custom initialization; This is sometimes
+necessary on some platforms (e.g. ST boxes) where the HW needs to have set
+some PIO lines or system cfg registers. init/exit callbacks should not use
+or modify platform data::
+
+ int (*init)(struct platform_device *pdev, void *priv);
+ void (*exit)(struct platform_device *pdev, void *priv);
+
+26) Perform HW setup of the bus. For example, on some ST platforms this field
+is used to configure the AMBA bridge to generate more efficient STBus traffic::
+
+ struct mac_device_info *(*setup)(void *priv);
+ void *bsp_priv;
+
+27) Internal clocks and rates::
+
+ struct clk *stmmac_clk;
+ struct clk *pclk;
+ struct clk *clk_ptp_ref;
+ unsigned int clk_ptp_rate;
+ unsigned int clk_ref_rate;
+ s32 ptp_max_adj;
+
+28) Main reset::
+
+ struct reset_control *stmmac_rst;
+
+29) AXI Internal Parameters::
+
+ struct stmmac_axi *axi;
+
+30) HW uses GMAC>4 cores::
+
+ int has_gmac4;
+
+31) HW is sun8i based::
+
+ bool has_sun8i;
+
+32) Enables TSO feature::
+
+ bool tso_en;
+
+33) Enables Receive Side Scaling (RSS) feature::
+
+ int rss_en;
+
+34) MAC Port selection::
+
+ int mac_port_sel_speed;
+
+35) Enables TX LPI Clock Gating::
+
+ bool en_tx_lpi_clockgating;
+
+36) HW uses XGMAC>2.10 cores::
+
+ int has_xgmac;
+
+::
+
+ }
+
+For MDIO bus data, we have:
+
+::
+
+ struct stmmac_mdio_bus_data {
+
+1) PHY mask passed when MDIO bus is registered::
+
+ unsigned int phy_mask;
+
+2) List of IRQs, one per PHY::
+
+ int *irqs;
+
+3) If IRQs is NULL, use this for probed PHY::
+
+ int probed_phy_irq;
+
+4) Set to true if PHY needs reset::
+
+ bool needs_reset;
+
+::
+
+ }
+
+For DMA engine configuration, we have:
+
+::
+
+ struct stmmac_dma_cfg {
+
+1) Programmable Burst Length (TX and RX)::
+
+ int pbl;
+
+2) If set, DMA TX / RX will use this value rather than pbl::
+
+ int txpbl;
+ int rxpbl;
+
+3) Enable 8xPBL::
+
+ bool pblx8;
+
+4) Enable Fixed or Mixed burst::
+
+ int fixed_burst;
+ int mixed_burst;
+
+5) Enable Address Aligned Beats::
+
+ bool aal;
+
+6) Enable Enhanced Addressing (> 32 bits)::
+
+ bool eame;
+
+::
+
+ }
+
+For DMA AXI parameters, we have:
+
+::
+
+ struct stmmac_axi {
+
+1) Enable AXI LPI::
+
+ bool axi_lpi_en;
+ bool axi_xit_frm;
+
+2) Set AXI Write / Read maximum outstanding requests::
+
+ u32 axi_wr_osr_lmt;
+ u32 axi_rd_osr_lmt;
+
+3) Set AXI 4KB bursts::
+
+ bool axi_kbbe;
+
+4) Set AXI maximum burst length map::
+
+ u32 axi_blen[AXI_BLEN];
+
+5) Set AXI Fixed burst / mixed burst::
+
+ bool axi_fb;
+ bool axi_mb;
+
+6) Set AXI rebuild incrx mode::
+
+ bool axi_rb;
+
+::
+
+ }
+
+For the RX Queues configuration, we have:
+
+::
+
+ struct stmmac_rxq_cfg {
+
+1) Mode to use (DCB or AVB)::
+
+ u8 mode_to_use;
+
+2) DMA channel to use::
+
+ u32 chan;
+
+3) Packet routing, if applicable::
+
+ u8 pkt_route;
+
+4) Use priority routing, and priority to route::
+
+ bool use_prio;
+ u32 prio;
+
+::
+
+ }
+
+For the TX Queues configuration, we have:
+
+::
+
+ struct stmmac_txq_cfg {
+
+1) Queue weight in scheduler::
+
+ u32 weight;
+
+2) Mode to use (DCB or AVB)::
+
+ u8 mode_to_use;
+
+3) Credit Base Shaper Parameters::
+
+ u32 send_slope;
+ u32 idle_slope;
+ u32 high_credit;
+ u32 low_credit;
+
+4) Use priority scheduling, and priority::
+
+ bool use_prio;
+ u32 prio;
+
+::
+
+ }
+
+Device Tree Information
+-----------------------
+
+Please refer to the following document:
+Documentation/devicetree/bindings/net/snps,dwmac.yaml
+
+HW Capabilities
+---------------
+
+Note that, starting from new chips, where it is available the HW capability
+register, many configurations are discovered at run-time for example to
+understand if EEE, HW csum, PTP, enhanced descriptor etc are actually
+available. As strategy adopted in this driver, the information from the HW
+capability register can replace what has been passed from the platform.
+
+Debug Information
+=================
+
+The driver exports many information i.e. internal statistics, debug
+information, MAC and DMA registers etc.
+
+These can be read in several ways depending on the type of the information
+actually needed.
+
+For example a user can be use the ethtool support to get statistics: e.g.
+using: ``ethtool -S ethX`` (that shows the Management counters (MMC) if
+supported) or sees the MAC/DMA registers: e.g. using: ``ethtool -d ethX``
+
+Compiling the Kernel with ``CONFIG_DEBUG_FS`` the driver will export the
+following debugfs entries:
+
+ - ``descriptors_status``: To show the DMA TX/RX descriptor rings
+ - ``dma_cap``: To show the HW Capabilities
+
+Developer can also use the ``debug`` module parameter to get further debug
+information (please see: NETIF Msg Level).
+
+Support
+=======
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the
+issue to netdev@vger.kernel.org
diff --git a/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst
new file mode 100644
index 0000000000..25fd9aa284
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst
@@ -0,0 +1,143 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================================================
+Texas Instruments K3 AM65 CPSW NUSS switchdev based ethernet driver
+===================================================================
+
+:Version: 1.0
+
+Port renaming
+=============
+
+In order to rename via udev::
+
+ ip -d link show dev sw0p1 | grep switchid
+
+ SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}==<switchid>, \
+ ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}"
+
+
+Multi mac mode
+==============
+
+- The driver is operating in multi-mac mode by default, thus
+ working as N individual network interfaces.
+
+Devlink configuration parameters
+================================
+
+See Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
+
+Enabling "switch"
+=================
+
+The Switch mode can be enabled by configuring devlink driver parameter
+"switch_mode" to 1/true::
+
+ devlink dev param set platform/c000000.ethernet \
+ name switch_mode value true cmode runtime
+
+This can be done regardless of the state of Port's netdev devices - UP/DOWN, but
+Port's netdev devices have to be in UP before joining to the bridge to avoid
+overwriting of bridge configuration as CPSW switch driver completely reloads its
+configuration when first port changes its state to UP.
+
+When the both interfaces joined the bridge - CPSW switch driver will enable
+marking packets with offload_fwd_mark flag.
+
+All configuration is implemented via switchdev API.
+
+Bridge setup
+============
+
+::
+
+ devlink dev param set platform/c000000.ethernet \
+ name switch_mode value true cmode runtime
+
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge ageing_time 1000
+ ip link set dev sw0p1 up
+ ip link set dev sw0p2 up
+ ip link set dev sw0p1 master br0
+ ip link set dev sw0p2 master br0
+
+ [*] bridge vlan add dev br0 vid 1 pvid untagged self
+
+ [*] if vlan_filtering=1. where default_pvid=1
+
+ Note. Steps [*] are mandatory.
+
+
+On/off STP
+==========
+
+::
+
+ ip link set dev BRDEV type bridge stp_state 1/0
+
+VLAN configuration
+==================
+
+::
+
+ bridge vlan add dev br0 vid 1 pvid untagged self <---- add cpu port to VLAN 1
+
+Note. This step is mandatory for bridge/default_pvid.
+
+Add extra VLANs
+===============
+
+ 1. untagged::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 pvid untagged master
+ bridge vlan add dev br0 vid 100 pvid untagged self <---- Add cpu port to VLAN100
+
+ 2. tagged::
+
+ bridge vlan add dev sw0p1 vid 100 master
+ bridge vlan add dev sw0p2 vid 100 master
+ bridge vlan add dev br0 vid 100 pvid tagged self <---- Add cpu port to VLAN100
+
+FDBs
+----
+
+FDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding FDBs::
+
+ bridge fdb add aa:bb:cc:dd:ee:ff dev sw0p1 master vlan 100
+ bridge fdb add aa:bb:cc:dd:ee:fe dev sw0p2 master <---- Add on all VLANs
+
+MDBs
+----
+
+MDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding MDBs::
+
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent vid 100
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent <---- Add on all VLANs
+
+Multicast flooding
+==================
+CPU port mcast_flooding is always on
+
+Turning flooding on/off on switch ports:
+bridge link set dev sw0p1 mcast_flood on/off
+
+Access and Trunk port
+=====================
+
+::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 master
+
+
+ bridge vlan add dev br0 vid 100 self
+ ip link add link br0 name br0.100 type vlan id 100
+
+Note. Setting PVID on Bridge device itself works only for
+default VLAN (default_pvid).
diff --git a/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
new file mode 100644
index 0000000000..a88946bd18
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
@@ -0,0 +1,587 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+Texas Instruments CPSW ethernet driver
+======================================
+
+Multiqueue & CBS & MQPRIO
+=========================
+
+
+The cpsw has 3 CBS shapers for each external ports. This document
+describes MQPRIO and CBS Qdisc offload configuration for cpsw driver
+based on examples. It potentially can be used in audio video bridging
+(AVB) and time sensitive networking (TSN).
+
+The following examples were tested on AM572x EVM and BBB boards.
+
+Test setup
+==========
+
+Under consideration two examples with AM572x EVM running cpsw driver
+in dual_emac mode.
+
+Several prerequisites:
+
+- TX queues must be rated starting from txq0 that has highest priority
+- Traffic classes are used starting from 0, that has highest priority
+- CBS shapers should be used with rated queues
+- The bandwidth for CBS shapers has to be set a little bit more then
+ potential incoming rate, thus, rate of all incoming tx queues has
+ to be a little less
+- Real rates can differ, due to discreetness
+- Map skb-priority to txq is not enough, also skb-priority to l2 prio
+ map has to be created with ip or vconfig tool
+- Any l2/socket prio (0 - 7) for classes can be used, but for
+ simplicity default values are used: 3 and 2
+- only 2 classes tested: A and B, but checked and can work with more,
+ maximum allowed 4, but only for 3 rate can be set.
+
+Test setup for examples
+=======================
+
+::
+
+ +-------------------------------+
+ |--+ |
+ | | Workstation0 |
+ |E | MAC 18:03:73:66:87:42 |
+ +-----------------------------+ +--|t | |
+ | | 1 | E | | |h |./tsn_listener -d \ |
+ | Target board: | 0 | t |--+ |0 | 18:03:73:66:87:42 -i eth0 \|
+ | AM572x EVM | 0 | h | | | -s 1500 |
+ | | 0 | 0 | |--+ |
+ | Only 2 classes: |Mb +---| +-------------------------------+
+ | class A, class B | |
+ | | +---| +-------------------------------+
+ | | 1 | E | |--+ |
+ | | 0 | t | | | Workstation1 |
+ | | 0 | h |--+ |E | MAC 20:cf:30:85:7d:fd |
+ | |Mb | 1 | +--|t | |
+ +-----------------------------+ |h |./tsn_listener -d \ |
+ |0 | 20:cf:30:85:7d:fd -i eth0 \|
+ | | -s 1500 |
+ |--+ |
+ +-------------------------------+
+
+
+Example 1: One port tx AVB configuration scheme for target board
+----------------------------------------------------------------
+
+(prints and scheme for AM572x evm, applicable for single port boards)
+
+- tc - traffic class
+- txq - transmit queue
+- p - priority
+- f - fifo (cpsw fifo)
+- S - shaper configured
+
+::
+
+ +------------------------------------------------------------------+ u
+ | +---------------+ +---------------+ +------+ +------+ | s
+ | | | | | | | | | | e
+ | | App 1 | | App 2 | | Apps | | Apps | | r
+ | | Class A | | Class B | | Rest | | Rest | |
+ | | Eth0 | | Eth0 | | Eth0 | | Eth1 | | s
+ | | VLAN100 | | VLAN100 | | | | | | | | p
+ | | 40 Mb/s | | 20 Mb/s | | | | | | | | a
+ | | SO_PRIORITY=3 | | SO_PRIORITY=2 | | | | | | | | c
+ | | | | | | | | | | | | | | e
+ | +---|-----------+ +---|-----------+ +---|--+ +---|--+ |
+ +-----|------------------|------------------|--------|-------------+
+ +-+ +------------+ | |
+ | | +-----------------+ +--+
+ | | | |
+ +---|-------|-------------|-----------------------|----------------+
+ | +----+ +----+ +----+ +----+ +----+ |
+ | | p3 | | p2 | | p1 | | p0 | | p0 | | k
+ | \ / \ / \ / \ / \ / | e
+ | \ / \ / \ / \ / \ / | r
+ | \/ \/ \/ \/ \/ | n
+ | | | | | | e
+ | | | +-----+ | | l
+ | | | | | |
+ | +----+ +----+ +----+ +----+ | s
+ | |tc0 | |tc1 | |tc2 | |tc0 | | p
+ | \ / \ / \ / \ / | a
+ | \ / \ / \ / \ / | c
+ | \/ \/ \/ \/ | e
+ | | | +-----+ | |
+ | | | | | | |
+ | | | | | | |
+ | | | | | | |
+ | +----+ +----+ +----+ +----+ +----+ |
+ | |txq0| |txq1| |txq2| |txq3| |txq4| |
+ | \ / \ / \ / \ / \ / |
+ | \ / \ / \ / \ / \ / |
+ | \/ \/ \/ \/ \/ |
+ | +-|------|------|------|--+ +--|--------------+ |
+ | | | | | | | Eth0.100 | | Eth1 | |
+ +---|------|------|------|------------------------|----------------+
+ | | | | |
+ p p p p |
+ 3 2 0-1, 4-7 <- L2 priority |
+ | | | | |
+ | | | | |
+ +---|------|------|------|------------------------|----------------+
+ | | | | | |----------+ |
+ | +----+ +----+ +----+ +----+ +----+ |
+ | |dma7| |dma6| |dma5| |dma4| |dma3| |
+ | \ / \ / \ / \ / \ / | c
+ | \S / \S / \ / \ / \ / | p
+ | \/ \/ \/ \/ \/ | s
+ | | | | +----- | | w
+ | | | | | | |
+ | | | | | | | d
+ | +----+ +----+ +----+p p+----+ | r
+ | | | | | | |o o| | | i
+ | | f3 | | f2 | | f0 |r r| f0 | | v
+ | |tc0 | |tc1 | |tc2 |t t|tc0 | | e
+ | \CBS / \CBS / \CBS /1 2\CBS / | r
+ | \S / \S / \ / \ / |
+ | \/ \/ \/ \/ |
+ +------------------------------------------------------------------+
+
+
+1) ::
+
+
+ // Add 4 tx queues, for interface Eth0, and 1 tx queue for Eth1
+ $ ethtool -L eth0 rx 1 tx 5
+ rx unmodified, ignoring
+
+2) ::
+
+ // Check if num of queues is set correctly:
+ $ ethtool -l eth0
+ Channel parameters for eth0:
+ Pre-set maximums:
+ RX: 8
+ TX: 8
+ Other: 0
+ Combined: 0
+ Current hardware settings:
+ RX: 1
+ TX: 5
+ Other: 0
+ Combined: 0
+
+3) ::
+
+ // TX queues must be rated starting from 0, so set bws for tx0 and tx1
+ // Set rates 40 and 20 Mb/s appropriately.
+ // Pay attention, real speed can differ a bit due to discreetness.
+ // Leave last 2 tx queues not rated.
+ $ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
+ $ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
+
+4) ::
+
+ // Check maximum rate of tx (cpdma) queues:
+ $ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
+ 40
+ 20
+ 0
+ 0
+ 0
+
+5) ::
+
+ // Map skb->priority to traffic class:
+ // 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+ // Map traffic class to transmit queue:
+ // tc0 -> txq0, tc1 -> txq1, tc2 -> (txq2, txq3)
+ $ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
+ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
+
+5a) ::
+
+ // As two interface sharing same set of tx queues, assign all traffic
+ // coming to interface Eth1 to separate queue in order to not mix it
+ // with traffic from interface Eth0, so use separate txq to send
+ // packets to Eth1, so all prio -> tc0 and tc0 -> txq4
+ // Here hw 0, so here still default configuration for eth1 in hw
+ $ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 1 \
+ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@4 hw 0
+
+6) ::
+
+ // Check classes settings
+ $ tc -g class show dev eth0
+ +---(100:ffe2) mqprio
+ | +---(100:3) mqprio
+ | +---(100:4) mqprio
+ |
+ +---(100:ffe1) mqprio
+ | +---(100:2) mqprio
+ |
+ +---(100:ffe0) mqprio
+ +---(100:1) mqprio
+
+ $ tc -g class show dev eth1
+ +---(100:ffe0) mqprio
+ +---(100:5) mqprio
+
+7) ::
+
+ // Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc
+ // Set it +1 Mb for reserve (important!)
+ // here only idle slope is important, others arg are ignored
+ // Pay attention, real speed can differ a bit due to discreetness
+ $ tc qdisc add dev eth0 parent 100:1 cbs locredit -1438 \
+ hicredit 62 sendslope -959000 idleslope 41000 offload 1
+ net eth0: set FIFO3 bw = 50
+
+8) ::
+
+ // Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc:
+ // Set it +1 Mb for reserve (important!)
+ $ tc qdisc add dev eth0 parent 100:2 cbs locredit -1468 \
+ hicredit 65 sendslope -979000 idleslope 21000 offload 1
+ net eth0: set FIFO2 bw = 30
+
+9) ::
+
+ // Create vlan 100 to map sk->priority to vlan qos
+ $ ip link add link eth0 name eth0.100 type vlan id 100
+ 8021q: 802.1Q VLAN Support v1.8
+ 8021q: adding VLAN 0 to HW filter on device eth0
+ 8021q: adding VLAN 0 to HW filter on device eth1
+ net eth0: Adding vlanid 100 to vlan filter
+
+10) ::
+
+ // Map skb->priority to L2 prio, 1 to 1
+ $ ip link set eth0.100 type vlan \
+ egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+11) ::
+
+ // Check egress map for vlan 100
+ $ cat /proc/net/vlan/eth0.100
+ [...]
+ INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
+ EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+12) ::
+
+ // Run your appropriate tools with socket option "SO_PRIORITY"
+ // to 3 for class A and/or to 2 for class B
+ // (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+ ./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
+ ./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
+
+13) ::
+
+ // run your listener on workstation (should be in same vlan)
+ // (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+ ./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39000 kbps
+
+14) ::
+
+ // Restore default configuration if needed
+ $ ip link del eth0.100
+ $ tc qdisc del dev eth1 root
+ $ tc qdisc del dev eth0 root
+ net eth0: Prev FIFO2 is shaped
+ net eth0: set FIFO3 bw = 0
+ net eth0: set FIFO2 bw = 0
+ $ ethtool -L eth0 rx 1 tx 1
+
+Example 2: Two port tx AVB configuration scheme for target board
+----------------------------------------------------------------
+
+(prints and scheme for AM572x evm, for dual emac boards only)
+
+::
+
+ +------------------------------------------------------------------+ u
+ | +----------+ +----------+ +------+ +----------+ +----------+ | s
+ | | | | | | | | | | | | e
+ | | App 1 | | App 2 | | Apps | | App 3 | | App 4 | | r
+ | | Class A | | Class B | | Rest | | Class B | | Class A | |
+ | | Eth0 | | Eth0 | | | | | Eth1 | | Eth1 | | s
+ | | VLAN100 | | VLAN100 | | | | | VLAN100 | | VLAN100 | | p
+ | | 40 Mb/s | | 20 Mb/s | | | | | 10 Mb/s | | 30 Mb/s | | a
+ | | SO_PRI=3 | | SO_PRI=2 | | | | | SO_PRI=3 | | SO_PRI=2 | | c
+ | | | | | | | | | | | | | | | | | e
+ | +---|------+ +---|------+ +---|--+ +---|------+ +---|------+ |
+ +-----|-------------|-------------|---------|-------------|--------+
+ +-+ +-------+ | +----------+ +----+
+ | | +-------+------+ | |
+ | | | | | |
+ +---|-------|-------------|--------------|-------------|-------|---+
+ | +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ |
+ | | p3 | | p2 | | p1 | | p0 | | p0 | | p1 | | p2 | | p3 | | k
+ | \ / \ / \ / \ / \ / \ / \ / \ / | e
+ | \ / \ / \ / \ / \ / \ / \ / \ / | r
+ | \/ \/ \/ \/ \/ \/ \/ \/ | n
+ | | | | | | | | e
+ | | | +----+ +----+ | | | l
+ | | | | | | | |
+ | +----+ +----+ +----+ +----+ +----+ +----+ | s
+ | |tc0 | |tc1 | |tc2 | |tc2 | |tc1 | |tc0 | | p
+ | \ / \ / \ / \ / \ / \ / | a
+ | \ / \ / \ / \ / \ / \ / | c
+ | \/ \/ \/ \/ \/ \/ | e
+ | | | +-----+ +-----+ | | |
+ | | | | | | | | | |
+ | | | | | | | | | |
+ | | | | | E E | | | | |
+ | +----+ +----+ +----+ +----+ t t +----+ +----+ +----+ +----+ |
+ | |txq0| |txq1| |txq4| |txq5| h h |txq6| |txq7| |txq3| |txq2| |
+ | \ / \ / \ / \ / 0 1 \ / \ / \ / \ / |
+ | \ / \ / \ / \ / . . \ / \ / \ / \ / |
+ | \/ \/ \/ \/ 1 1 \/ \/ \/ \/ |
+ | +-|------|------|------|--+ 0 0 +-|------|------|------|--+ |
+ | | | | | | | 0 0 | | | | | | |
+ +---|------|------|------|---------------|------|------|------|----+
+ | | | | | | | |
+ p p p p p p p p
+ 3 2 0-1, 4-7 <-L2 pri-> 0-1, 4-7 2 3
+ | | | | | | | |
+ | | | | | | | |
+ +---|------|------|------|---------------|------|------|------|----+
+ | | | | | | | | | |
+ | +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ |
+ | |dma7| |dma6| |dma3| |dma2| |dma1| |dma0| |dma4| |dma5| |
+ | \ / \ / \ / \ / \ / \ / \ / \ / | c
+ | \S / \S / \ / \ / \ / \ / \S / \S / | p
+ | \/ \/ \/ \/ \/ \/ \/ \/ | s
+ | | | | +----- | | | | | w
+ | | | | | +----+ | | | |
+ | | | | | | | | | | d
+ | +----+ +----+ +----+p p+----+ +----+ +----+ | r
+ | | | | | | |o o| | | | | | | i
+ | | f3 | | f2 | | f0 |r CPSW r| f3 | | f2 | | f0 | | v
+ | |tc0 | |tc1 | |tc2 |t t|tc0 | |tc1 | |tc2 | | e
+ | \CBS / \CBS / \CBS /1 2\CBS / \CBS / \CBS / | r
+ | \S / \S / \ / \S / \S / \ / |
+ | \/ \/ \/ \/ \/ \/ |
+ +------------------------------------------------------------------+
+ ========================================Eth==========================>
+
+1) ::
+
+ // Add 8 tx queues, for interface Eth0, but they are common, so are accessed
+ // by two interfaces Eth0 and Eth1.
+ $ ethtool -L eth1 rx 1 tx 8
+ rx unmodified, ignoring
+
+2) ::
+
+ // Check if num of queues is set correctly:
+ $ ethtool -l eth0
+ Channel parameters for eth0:
+ Pre-set maximums:
+ RX: 8
+ TX: 8
+ Other: 0
+ Combined: 0
+ Current hardware settings:
+ RX: 1
+ TX: 8
+ Other: 0
+ Combined: 0
+
+3) ::
+
+ // TX queues must be rated starting from 0, so set bws for tx0 and tx1 for Eth0
+ // and for tx2 and tx3 for Eth1. That is, rates 40 and 20 Mb/s appropriately
+ // for Eth0 and 30 and 10 Mb/s for Eth1.
+ // Real speed can differ a bit due to discreetness
+ // Leave last 4 tx queues as not rated
+ $ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
+ $ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
+ $ echo 30 > /sys/class/net/eth1/queues/tx-2/tx_maxrate
+ $ echo 10 > /sys/class/net/eth1/queues/tx-3/tx_maxrate
+
+4) ::
+
+ // Check maximum rate of tx (cpdma) queues:
+ $ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
+ 40
+ 20
+ 30
+ 10
+ 0
+ 0
+ 0
+ 0
+
+5) ::
+
+ // Map skb->priority to traffic class for Eth0:
+ // 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+ // Map traffic class to transmit queue:
+ // tc0 -> txq0, tc1 -> txq1, tc2 -> (txq4, txq5)
+ $ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
+ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@4 hw 1
+
+6) ::
+
+ // Check classes settings
+ $ tc -g class show dev eth0
+ +---(100:ffe2) mqprio
+ | +---(100:5) mqprio
+ | +---(100:6) mqprio
+ |
+ +---(100:ffe1) mqprio
+ | +---(100:2) mqprio
+ |
+ +---(100:ffe0) mqprio
+ +---(100:1) mqprio
+
+7) ::
+
+ // Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc for Eth0
+ // here only idle slope is important, others ignored
+ // Real speed can differ a bit due to discreetness
+ $ tc qdisc add dev eth0 parent 100:1 cbs locredit -1470 \
+ hicredit 62 sendslope -959000 idleslope 41000 offload 1
+ net eth0: set FIFO3 bw = 50
+
+8) ::
+
+ // Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc for Eth0
+ $ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \
+ hicredit 65 sendslope -979000 idleslope 21000 offload 1
+ net eth0: set FIFO2 bw = 30
+
+9) ::
+
+ // Create vlan 100 to map sk->priority to vlan qos for Eth0
+ $ ip link add link eth0 name eth0.100 type vlan id 100
+ net eth0: Adding vlanid 100 to vlan filter
+
+10) ::
+
+ // Map skb->priority to L2 prio for Eth0.100, one to one
+ $ ip link set eth0.100 type vlan \
+ egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+11) ::
+
+ // Check egress map for vlan 100
+ $ cat /proc/net/vlan/eth0.100
+ [...]
+ INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
+ EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+12) ::
+
+ // Map skb->priority to traffic class for Eth1:
+ // 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+ // Map traffic class to transmit queue:
+ // tc0 -> txq2, tc1 -> txq3, tc2 -> (txq6, txq7)
+ $ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 3 \
+ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@2 1@3 2@6 hw 1
+
+13) ::
+
+ // Check classes settings
+ $ tc -g class show dev eth1
+ +---(100:ffe2) mqprio
+ | +---(100:7) mqprio
+ | +---(100:8) mqprio
+ |
+ +---(100:ffe1) mqprio
+ | +---(100:4) mqprio
+ |
+ +---(100:ffe0) mqprio
+ +---(100:3) mqprio
+
+14) ::
+
+ // Set rate for class A - 31 Mbit (tc0, txq2) using CBS Qdisc for Eth1
+ // here only idle slope is important, others ignored, but calculated
+ // for interface speed - 100Mb for eth1 port.
+ // Set it +1 Mb for reserve (important!)
+ $ tc qdisc add dev eth1 parent 100:3 cbs locredit -1035 \
+ hicredit 465 sendslope -69000 idleslope 31000 offload 1
+ net eth1: set FIFO3 bw = 31
+
+15) ::
+
+ // Set rate for class B - 11 Mbit (tc1, txq3) using CBS Qdisc for Eth1
+ // Set it +1 Mb for reserve (important!)
+ $ tc qdisc add dev eth1 parent 100:4 cbs locredit -1335 \
+ hicredit 405 sendslope -89000 idleslope 11000 offload 1
+ net eth1: set FIFO2 bw = 11
+
+16) ::
+
+ // Create vlan 100 to map sk->priority to vlan qos for Eth1
+ $ ip link add link eth1 name eth1.100 type vlan id 100
+ net eth1: Adding vlanid 100 to vlan filter
+
+17) ::
+
+ // Map skb->priority to L2 prio for Eth1.100, one to one
+ $ ip link set eth1.100 type vlan \
+ egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+18) ::
+
+ // Check egress map for vlan 100
+ $ cat /proc/net/vlan/eth1.100
+ [...]
+ INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
+ EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+19) ::
+
+ // Run appropriate tools with socket option "SO_PRIORITY" to 3
+ // for class A and to 2 for class B. For both interfaces
+ ./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
+ ./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
+ ./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p2 -s 1500&
+ ./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p3 -s 1500&
+
+20) ::
+
+ // run your listener on workstation (should be in same vlan)
+ // (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+ ./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39000 kbps
+
+21) ::
+
+ // Restore default configuration if needed
+ $ ip link del eth1.100
+ $ ip link del eth0.100
+ $ tc qdisc del dev eth1 root
+ net eth1: Prev FIFO2 is shaped
+ net eth1: set FIFO3 bw = 0
+ net eth1: set FIFO2 bw = 0
+ $ tc qdisc del dev eth0 root
+ net eth0: Prev FIFO2 is shaped
+ net eth0: set FIFO3 bw = 0
+ net eth0: set FIFO2 bw = 0
+ $ ethtool -L eth0 rx 1 tx 1
diff --git a/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst
new file mode 100644
index 0000000000..464dce938e
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst
@@ -0,0 +1,242 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Texas Instruments CPSW switchdev based ethernet driver
+======================================================
+
+:Version: 2.0
+
+Port renaming
+=============
+
+On older udev versions renaming of ethX to swXpY will not be automatically
+supported
+
+In order to rename via udev::
+
+ ip -d link show dev sw0p1 | grep switchid
+
+ SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}==<switchid>, \
+ ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}"
+
+
+Dual mac mode
+=============
+
+- The new (cpsw_new.c) driver is operating in dual-emac mode by default, thus
+ working as 2 individual network interfaces. Main differences from legacy CPSW
+ driver are:
+
+ - optimized promiscuous mode: The P0_UNI_FLOOD (both ports) is enabled in
+ addition to ALLMULTI (current port) instead of ALE_BYPASS.
+ So, Ports in promiscuous mode will keep possibility of mcast and vlan
+ filtering, which is provides significant benefits when ports are joined
+ to the same bridge, but without enabling "switch" mode, or to different
+ bridges.
+ - learning disabled on ports as it make not too much sense for
+ segregated ports - no forwarding in HW.
+ - enabled basic support for devlink.
+
+ ::
+
+ devlink dev show
+ platform/48484000.switch
+
+ devlink dev param show
+ platform/48484000.switch:
+ name switch_mode type driver-specific
+ values:
+ cmode runtime value false
+ name ale_bypass type driver-specific
+ values:
+ cmode runtime value false
+
+Devlink configuration parameters
+================================
+
+See Documentation/networking/devlink/ti-cpsw-switch.rst
+
+Bridging in dual mac mode
+=========================
+
+The dual_mac mode requires two vids to be reserved for internal purposes,
+which, by default, equal CPSW Port numbers. As result, bridge has to be
+configured in vlan unaware mode or default_pvid has to be adjusted::
+
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge vlan_filtering 0
+ echo 0 > /sys/class/net/br0/bridge/default_pvid
+ ip link set dev sw0p1 master br0
+ ip link set dev sw0p2 master br0
+
+or::
+
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge vlan_filtering 0
+ echo 100 > /sys/class/net/br0/bridge/default_pvid
+ ip link set dev br0 type bridge vlan_filtering 1
+ ip link set dev sw0p1 master br0
+ ip link set dev sw0p2 master br0
+
+Enabling "switch"
+=================
+
+The Switch mode can be enabled by configuring devlink driver parameter
+"switch_mode" to 1/true::
+
+ devlink dev param set platform/48484000.switch \
+ name switch_mode value 1 cmode runtime
+
+This can be done regardless of the state of Port's netdev devices - UP/DOWN, but
+Port's netdev devices have to be in UP before joining to the bridge to avoid
+overwriting of bridge configuration as CPSW switch driver copletly reloads its
+configuration when first Port changes its state to UP.
+
+When the both interfaces joined the bridge - CPSW switch driver will enable
+marking packets with offload_fwd_mark flag unless "ale_bypass=0"
+
+All configuration is implemented via switchdev API.
+
+Bridge setup
+============
+
+::
+
+ devlink dev param set platform/48484000.switch \
+ name switch_mode value 1 cmode runtime
+
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge ageing_time 1000
+ ip link set dev sw0p1 up
+ ip link set dev sw0p2 up
+ ip link set dev sw0p1 master br0
+ ip link set dev sw0p2 master br0
+
+ [*] bridge vlan add dev br0 vid 1 pvid untagged self
+
+ [*] if vlan_filtering=1. where default_pvid=1
+
+ Note. Steps [*] are mandatory.
+
+
+On/off STP
+==========
+
+::
+
+ ip link set dev BRDEV type bridge stp_state 1/0
+
+VLAN configuration
+==================
+
+::
+
+ bridge vlan add dev br0 vid 1 pvid untagged self <---- add cpu port to VLAN 1
+
+Note. This step is mandatory for bridge/default_pvid.
+
+Add extra VLANs
+===============
+
+ 1. untagged::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 pvid untagged master
+ bridge vlan add dev br0 vid 100 pvid untagged self <---- Add cpu port to VLAN100
+
+ 2. tagged::
+
+ bridge vlan add dev sw0p1 vid 100 master
+ bridge vlan add dev sw0p2 vid 100 master
+ bridge vlan add dev br0 vid 100 pvid tagged self <---- Add cpu port to VLAN100
+
+FDBs
+----
+
+FDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding FDBs::
+
+ bridge fdb add aa:bb:cc:dd:ee:ff dev sw0p1 master vlan 100
+ bridge fdb add aa:bb:cc:dd:ee:fe dev sw0p2 master <---- Add on all VLANs
+
+MDBs
+----
+
+MDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding MDBs::
+
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent vid 100
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent <---- Add on all VLANs
+
+Multicast flooding
+==================
+CPU port mcast_flooding is always on
+
+Turning flooding on/off on switch ports:
+bridge link set dev sw0p1 mcast_flood on/off
+
+Access and Trunk port
+=====================
+
+::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 master
+
+
+ bridge vlan add dev br0 vid 100 self
+ ip link add link br0 name br0.100 type vlan id 100
+
+Note. Setting PVID on Bridge device itself working only for
+default VLAN (default_pvid).
+
+NFS
+===
+
+The only way for NFS to work is by chrooting to a minimal environment when
+switch configuration that will affect connectivity is needed.
+Assuming you are booting NFS with eth1 interface(the script is hacky and
+it's just there to prove NFS is doable).
+
+setup.sh::
+
+ #!/bin/sh
+ mkdir proc
+ mount -t proc none /proc
+ ifconfig br0 > /dev/null
+ if [ $? -ne 0 ]; then
+ echo "Setting up bridge"
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge ageing_time 1000
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ ip link set eth1 down
+ ip link set eth1 name sw0p1
+ ip link set dev sw0p1 up
+ ip link set dev sw0p2 up
+ ip link set dev sw0p2 master br0
+ ip link set dev sw0p1 master br0
+ bridge vlan add dev br0 vid 1 pvid untagged self
+ ifconfig sw0p1 0.0.0.0
+ udhchc -i br0
+ fi
+ umount /proc
+
+run_nfs.sh:::
+
+ #!/bin/sh
+ mkdir /tmp/root/bin -p
+ mkdir /tmp/root/lib -p
+
+ cp -r /lib/ /tmp/root/
+ cp -r /bin/ /tmp/root/
+ cp /sbin/ip /tmp/root/bin
+ cp /sbin/bridge /tmp/root/bin
+ cp /sbin/ifconfig /tmp/root/bin
+ cp /sbin/udhcpc /tmp/root/bin
+ cp /path/to/setup.sh /tmp/root/bin
+ chroot /tmp/root/ busybox sh /bin/setup.sh
+
+ run ./run_nfs.sh
diff --git a/Documentation/networking/device_drivers/ethernet/ti/tlan.rst b/Documentation/networking/device_drivers/ethernet/ti/tlan.rst
new file mode 100644
index 0000000000..4fdc0907f4
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/tlan.rst
@@ -0,0 +1,140 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+TLAN driver for Linux
+=====================
+
+:Version: 1.14a
+
+(C) 1997-1998 Caldera, Inc.
+
+(C) 1998 James Banks
+
+(C) 1999-2001 Torben Mathiasen <tmm@image.dk, torben.mathiasen@compaq.com>
+
+For driver information/updates visit http://www.compaq.com
+
+
+
+
+
+I. Supported Devices
+====================
+
+ Only PCI devices will work with this driver.
+
+ Supported:
+
+ ========= ========= ===========================================
+ Vendor ID Device ID Name
+ ========= ========= ===========================================
+ 0e11 ae32 Compaq Netelligent 10/100 TX PCI UTP
+ 0e11 ae34 Compaq Netelligent 10 T PCI UTP
+ 0e11 ae35 Compaq Integrated NetFlex 3/P
+ 0e11 ae40 Compaq Netelligent Dual 10/100 TX PCI UTP
+ 0e11 ae43 Compaq Netelligent Integrated 10/100 TX UTP
+ 0e11 b011 Compaq Netelligent 10/100 TX Embedded UTP
+ 0e11 b012 Compaq Netelligent 10 T/2 PCI UTP/Coax
+ 0e11 b030 Compaq Netelligent 10/100 TX UTP
+ 0e11 f130 Compaq NetFlex 3/P
+ 0e11 f150 Compaq NetFlex 3/P
+ 108d 0012 Olicom OC-2325
+ 108d 0013 Olicom OC-2183
+ 108d 0014 Olicom OC-2326
+ ========= ========= ===========================================
+
+
+ Caveats:
+
+ I am not sure if 100BaseTX daughterboards (for those cards which
+ support such things) will work. I haven't had any solid evidence
+ either way.
+
+ However, if a card supports 100BaseTx without requiring an add
+ on daughterboard, it should work with 100BaseTx.
+
+ The "Netelligent 10 T/2 PCI UTP/Coax" (b012) device is untested,
+ but I do not expect any problems.
+
+
+II. Driver Options
+==================
+
+ 1. You can append debug=x to the end of the insmod line to get
+ debug messages, where x is a bit field where the bits mean
+ the following:
+
+ ==== =====================================
+ 0x01 Turn on general debugging messages.
+ 0x02 Turn on receive debugging messages.
+ 0x04 Turn on transmit debugging messages.
+ 0x08 Turn on list debugging messages.
+ ==== =====================================
+
+ 2. You can append aui=1 to the end of the insmod line to cause
+ the adapter to use the AUI interface instead of the 10 Base T
+ interface. This is also what to do if you want to use the BNC
+ connector on a TLAN based device. (Setting this option on a
+ device that does not have an AUI/BNC connector will probably
+ cause it to not function correctly.)
+
+ 3. You can set duplex=1 to force half duplex, and duplex=2 to
+ force full duplex.
+
+ 4. You can set speed=10 to force 10Mbs operation, and speed=100
+ to force 100Mbs operation. (I'm not sure what will happen
+ if a card which only supports 10Mbs is forced into 100Mbs
+ mode.)
+
+ 5. You have to use speed=X duplex=Y together now. If you just
+ do "insmod tlan.o speed=100" the driver will do Auto-Neg.
+ To force a 10Mbps Half-Duplex link do "insmod tlan.o speed=10
+ duplex=1".
+
+ 6. If the driver is built into the kernel, you can use the 3rd
+ and 4th parameters to set aui and debug respectively. For
+ example::
+
+ ether=0,0,0x1,0x7,eth0
+
+ This sets aui to 0x1 and debug to 0x7, assuming eth0 is a
+ supported TLAN device.
+
+ The bits in the third byte are assigned as follows:
+
+ ==== ===============
+ 0x01 aui
+ 0x02 use half duplex
+ 0x04 use full duplex
+ 0x08 use 10BaseT
+ 0x10 use 100BaseTx
+ ==== ===============
+
+ You also need to set both speed and duplex settings when forcing
+ speeds with kernel-parameters.
+ ether=0,0,0x12,0,eth0 will force link to 100Mbps Half-Duplex.
+
+ 7. If you have more than one tlan adapter in your system, you can
+ use the above options on a per adapter basis. To force a 100Mbit/HD
+ link with your eth1 adapter use::
+
+ insmod tlan speed=0,100 duplex=0,1
+
+ Now eth0 will use auto-neg and eth1 will be forced to 100Mbit/HD.
+ Note that the tlan driver supports a maximum of 8 adapters.
+
+
+III. Things to try if you have problems
+=======================================
+
+ 1. Make sure your card's PCI id is among those listed in
+ section I, above.
+ 2. Make sure routing is correct.
+ 3. Try forcing different speed/duplex settings
+
+
+There is also a tlan mailing list which you can join by sending "subscribe tlan"
+in the body of an email to majordomo@vuser.vu.union.edu.
+
+There is also a tlan website at http://www.compaq.com
+
diff --git a/Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst b/Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst
new file mode 100644
index 0000000000..fe5b32be15
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst
@@ -0,0 +1,202 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+The Spidernet Device Driver
+===========================
+
+Written by Linas Vepstas <linas@austin.ibm.com>
+
+Version of 7 June 2007
+
+Abstract
+========
+This document sketches the structure of portions of the spidernet
+device driver in the Linux kernel tree. The spidernet is a gigabit
+ethernet device built into the Toshiba southbridge commonly used
+in the SONY Playstation 3 and the IBM QS20 Cell blade.
+
+The Structure of the RX Ring.
+=============================
+The receive (RX) ring is a circular linked list of RX descriptors,
+together with three pointers into the ring that are used to manage its
+contents.
+
+The elements of the ring are called "descriptors" or "descrs"; they
+describe the received data. This includes a pointer to a buffer
+containing the received data, the buffer size, and various status bits.
+
+There are three primary states that a descriptor can be in: "empty",
+"full" and "not-in-use". An "empty" or "ready" descriptor is ready
+to receive data from the hardware. A "full" descriptor has data in it,
+and is waiting to be emptied and processed by the OS. A "not-in-use"
+descriptor is neither empty or full; it is simply not ready. It may
+not even have a data buffer in it, or is otherwise unusable.
+
+During normal operation, on device startup, the OS (specifically, the
+spidernet device driver) allocates a set of RX descriptors and RX
+buffers. These are all marked "empty", ready to receive data. This
+ring is handed off to the hardware, which sequentially fills in the
+buffers, and marks them "full". The OS follows up, taking the full
+buffers, processing them, and re-marking them empty.
+
+This filling and emptying is managed by three pointers, the "head"
+and "tail" pointers, managed by the OS, and a hardware current
+descriptor pointer (GDACTDPA). The GDACTDPA points at the descr
+currently being filled. When this descr is filled, the hardware
+marks it full, and advances the GDACTDPA by one. Thus, when there is
+flowing RX traffic, every descr behind it should be marked "full",
+and everything in front of it should be "empty". If the hardware
+discovers that the current descr is not empty, it will signal an
+interrupt, and halt processing.
+
+The tail pointer tails or trails the hardware pointer. When the
+hardware is ahead, the tail pointer will be pointing at a "full"
+descr. The OS will process this descr, and then mark it "not-in-use",
+and advance the tail pointer. Thus, when there is flowing RX traffic,
+all of the descrs in front of the tail pointer should be "full", and
+all of those behind it should be "not-in-use". When RX traffic is not
+flowing, then the tail pointer can catch up to the hardware pointer.
+The OS will then note that the current tail is "empty", and halt
+processing.
+
+The head pointer (somewhat mis-named) follows after the tail pointer.
+When traffic is flowing, then the head pointer will be pointing at
+a "not-in-use" descr. The OS will perform various housekeeping duties
+on this descr. This includes allocating a new data buffer and
+dma-mapping it so as to make it visible to the hardware. The OS will
+then mark the descr as "empty", ready to receive data. Thus, when there
+is flowing RX traffic, everything in front of the head pointer should
+be "not-in-use", and everything behind it should be "empty". If no
+RX traffic is flowing, then the head pointer can catch up to the tail
+pointer, at which point the OS will notice that the head descr is
+"empty", and it will halt processing.
+
+Thus, in an idle system, the GDACTDPA, tail and head pointers will
+all be pointing at the same descr, which should be "empty". All of the
+other descrs in the ring should be "empty" as well.
+
+The show_rx_chain() routine will print out the locations of the
+GDACTDPA, tail and head pointers. It will also summarize the contents
+of the ring, starting at the tail pointer, and listing the status
+of the descrs that follow.
+
+A typical example of the output, for a nearly idle system, might be::
+
+ net eth1: Total number of descrs=256
+ net eth1: Chain tail located at descr=20
+ net eth1: Chain head is at 20
+ net eth1: HW curr desc (GDACTDPA) is at 21
+ net eth1: Have 1 descrs with stat=x40800101
+ net eth1: HW next desc (GDACNEXTDA) is at 22
+ net eth1: Last 255 descrs with stat=xa0800000
+
+In the above, the hardware has filled in one descr, number 20. Both
+head and tail are pointing at 20, because it has not yet been emptied.
+Meanwhile, hw is pointing at 21, which is free.
+
+The "Have nnn decrs" refers to the descr starting at the tail: in this
+case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers
+to all of the rest of the descrs, from the last status change. The "nnn"
+is a count of how many descrs have exactly the same status.
+
+The status x4... corresponds to "full" and status xa... corresponds
+to "empty". The actual value printed is RXCOMST_A.
+
+In the device driver source code, a different set of names are
+used for these same concepts, so that::
+
+ "empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa
+ "full" == SPIDER_NET_DESCR_FRAME_END == 0x4
+ "not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf
+
+
+The RX RAM full bug/feature
+===========================
+
+As long as the OS can empty out the RX buffers at a rate faster than
+the hardware can fill them, there is no problem. If, for some reason,
+the OS fails to empty the RX ring fast enough, the hardware GDACTDPA
+pointer will catch up to the head, notice the not-empty condition,
+ad stop. However, RX packets may still continue arriving on the wire.
+The spidernet chip can save some limited number of these in local RAM.
+When this local ram fills up, the spider chip will issue an interrupt
+indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit
+will be set in GHIINT1STS). When the RX ram full condition occurs,
+a certain bug/feature is triggered that has to be specially handled.
+This section describes the special handling for this condition.
+
+When the OS finally has a chance to run, it will empty out the RX ring.
+In particular, it will clear the descriptor on which the hardware had
+stopped. However, once the hardware has decided that a certain
+descriptor is invalid, it will not restart at that descriptor; instead
+it will restart at the next descr. This potentially will lead to a
+deadlock condition, as the tail pointer will be pointing at this descr,
+which, from the OS point of view, is empty; the OS will be waiting for
+this descr to be filled. However, the hardware has skipped this descr,
+and is filling the next descrs. Since the OS doesn't see this, there
+is a potential deadlock, with the OS waiting for one descr to fill,
+while the hardware is waiting for a different set of descrs to become
+empty.
+
+A call to show_rx_chain() at this point indicates the nature of the
+problem. A typical print when the network is hung shows the following::
+
+ net eth1: Spider RX RAM full, incoming packets might be discarded!
+ net eth1: Total number of descrs=256
+ net eth1: Chain tail located at descr=255
+ net eth1: Chain head is at 255
+ net eth1: HW curr desc (GDACTDPA) is at 0
+ net eth1: Have 1 descrs with stat=xa0800000
+ net eth1: HW next desc (GDACNEXTDA) is at 1
+ net eth1: Have 127 descrs with stat=x40800101
+ net eth1: Have 1 descrs with stat=x40800001
+ net eth1: Have 126 descrs with stat=x40800101
+ net eth1: Last 1 descrs with stat=xa0800000
+
+Both the tail and head pointers are pointing at descr 255, which is
+marked xa... which is "empty". Thus, from the OS point of view, there
+is nothing to be done. In particular, there is the implicit assumption
+that everything in front of the "empty" descr must surely also be empty,
+as explained in the last section. The OS is waiting for descr 255 to
+become non-empty, which, in this case, will never happen.
+
+The HW pointer is at descr 0. This descr is marked 0x4.. or "full".
+Since its already full, the hardware can do nothing more, and thus has
+halted processing. Notice that descrs 0 through 254 are all marked
+"full", while descr 254 and 255 are empty. (The "Last 1 descrs" is
+descr 254, since tail was at 255.) Thus, the system is deadlocked,
+and there can be no forward progress; the OS thinks there's nothing
+to do, and the hardware has nowhere to put incoming data.
+
+This bug/feature is worked around with the spider_net_resync_head_ptr()
+routine. When the driver receives RX interrupts, but an examination
+of the RX chain seems to show it is empty, then it is probable that
+the hardware has skipped a descr or two (sometimes dozens under heavy
+network conditions). The spider_net_resync_head_ptr() subroutine will
+search the ring for the next full descr, and the driver will resume
+operations there. Since this will leave "holes" in the ring, there
+is also a spider_net_resync_tail_ptr() that will skip over such holes.
+
+As of this writing, the spider_net_resync() strategy seems to work very
+well, even under heavy network loads.
+
+
+The TX ring
+===========
+The TX ring uses a low-watermark interrupt scheme to make sure that
+the TX queue is appropriately serviced for large packet sizes.
+
+For packet sizes greater than about 1KBytes, the kernel can fill
+the TX ring quicker than the device can drain it. Once the ring
+is full, the netdev is stopped. When there is room in the ring,
+the netdev needs to be reawakened, so that more TX packets are placed
+in the ring. The hardware can empty the ring about four times per jiffy,
+so its not appropriate to wait for the poll routine to refill, since
+the poll routine runs only once per jiffy. The low-watermark mechanism
+marks a descr about 1/4th of the way from the bottom of the queue, so
+that an interrupt is generated when the descr is processed. This
+interrupt wakes up the netdev, which can then refill the queue.
+For large packets, this mechanism generates a relatively small number
+of interrupts, about 1K/sec. For smaller packets, this will drop to zero
+interrupts, as the hardware can empty the queue faster than the kernel
+can fill it.
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
new file mode 100644
index 0000000000..43a02f9943
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================================
+Linux Base Driver for WangXun(R) Gigabit PCI Express Adapters
+=============================================================
+
+WangXun Gigabit Linux driver.
+Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd.
+
+Support
+=======
+ If you have problems with the software or hardware, please contact our
+ customer support team via email at nic-support@net-swift.com or check our website
+ at https://www.net-swift.com
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst
new file mode 100644
index 0000000000..d052ef40fe
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================================
+Linux Base Driver for WangXun(R) 10 Gigabit PCI Express Adapters
+================================================================
+
+WangXun 10 Gigabit Linux driver.
+Copyright (c) 2015 - 2022 Beijing WangXun Technology Co., Ltd.
+
+
+Contents
+========
+
+- Support
+
+
+Support
+=======
+If you got any problem, contact Wangxun support team via nic-support@net-swift.com
+and Cc: netdev.
diff --git a/Documentation/networking/device_drivers/fddi/defza.rst b/Documentation/networking/device_drivers/fddi/defza.rst
new file mode 100644
index 0000000000..7393f33ea7
--- /dev/null
+++ b/Documentation/networking/device_drivers/fddi/defza.rst
@@ -0,0 +1,63 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+Notes on the DEC FDDIcontroller 700 (DEFZA-xx) driver
+=====================================================
+
+:Version: v.1.1.4
+
+
+DEC FDDIcontroller 700 is DEC's first-generation TURBOchannel FDDI
+network card, designed in 1990 specifically for the DECstation 5000
+model 200 workstation. The board is a single attachment station and
+it was manufactured in two variations, both of which are supported.
+
+First is the SAS MMF DEFZA-AA option, the original design implementing
+the standard MMF-PMD, however with a pair of ST connectors rather than
+the usual MIC connector. The other one is the SAS ThinWire/STP DEFZA-CA
+option, denoted 700-C, with the network medium selectable by a switch
+between the DEC proprietary ThinWire-PMD using a BNC connector and the
+standard STP-PMD using a DE-9F connector. This option can interface to
+a DECconcentrator 500 device and, in the case of the STP-PMD, also other
+FDDI equipment and was designed to make it easier to transition from
+existing IEEE 802.3 10BASE2 Ethernet and IEEE 802.5 Token Ring networks
+by providing means to reuse existing cabling.
+
+This driver handles any number of cards installed in a single system.
+They get fddi0, fddi1, etc. interface names assigned in the order of
+increasing TURBOchannel slot numbers.
+
+The board only supports DMA on the receive side. Transmission involves
+the use of PIO. As a result under a heavy transmission load there will
+be a significant impact on system performance.
+
+The board supports a 64-entry CAM for matching destination addresses.
+Two entries are preoccupied by the Directed Beacon and Ring Purger
+multicast addresses and the rest is used as a multicast filter. An
+all-multi mode is also supported for LLC frames and it is used if
+requested explicitly or if the CAM overflows. The promiscuous mode
+supports separate enables for LLC and SMT frames, but this driver
+doesn't support changing them individually.
+
+
+Known problems:
+
+None.
+
+
+To do:
+
+5. MAC address change. The card does not support changing the Media
+ Access Controller's address registers but a similar effect can be
+ achieved by adding an alias to the CAM. There is no way to disable
+ matching against the original address though.
+
+7. Queueing incoming/outgoing SMT frames in the driver if the SMT
+ receive/RMC transmit ring is full. (?)
+
+8. Retrieving/reporting FDDI/SNMP stats.
+
+
+Both success and failure reports are welcome.
+
+Maciej W. Rozycki <macro@orcam.me.uk>
diff --git a/Documentation/networking/device_drivers/fddi/index.rst b/Documentation/networking/device_drivers/fddi/index.rst
new file mode 100644
index 0000000000..0b75294e6c
--- /dev/null
+++ b/Documentation/networking/device_drivers/fddi/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Fiber Distributed Data Interface (FDDI) Device Drivers
+======================================================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ defza
+ skfp
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/fddi/skfp.rst b/Documentation/networking/device_drivers/fddi/skfp.rst
new file mode 100644
index 0000000000..58f548105c
--- /dev/null
+++ b/Documentation/networking/device_drivers/fddi/skfp.rst
@@ -0,0 +1,253 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+========================
+SysKonnect driver - SKFP
+========================
+
+|copy| Copyright 1998-2000 SysKonnect,
+
+skfp.txt created 11-May-2000
+
+Readme File for skfp.o v2.06
+
+
+.. This file contains
+
+ (1) OVERVIEW
+ (2) SUPPORTED ADAPTERS
+ (3) GENERAL INFORMATION
+ (4) INSTALLATION
+ (5) INCLUSION OF THE ADAPTER IN SYSTEM START
+ (6) TROUBLESHOOTING
+ (7) FUNCTION OF THE ADAPTER LEDS
+ (8) HISTORY
+
+
+1. Overview
+===========
+
+This README explains how to use the driver 'skfp' for Linux with your
+network adapter.
+
+Chapter 2: Contains a list of all network adapters that are supported by
+this driver.
+
+Chapter 3:
+ Gives some general information.
+
+Chapter 4: Describes common problems and solutions.
+
+Chapter 5: Shows the changed functionality of the adapter LEDs.
+
+Chapter 6: History of development.
+
+
+2. Supported adapters
+=====================
+
+The network driver 'skfp' supports the following network adapters:
+SysKonnect adapters:
+
+ - SK-5521 (SK-NET FDDI-UP)
+ - SK-5522 (SK-NET FDDI-UP DAS)
+ - SK-5541 (SK-NET FDDI-FP)
+ - SK-5543 (SK-NET FDDI-LP)
+ - SK-5544 (SK-NET FDDI-LP DAS)
+ - SK-5821 (SK-NET FDDI-UP64)
+ - SK-5822 (SK-NET FDDI-UP64 DAS)
+ - SK-5841 (SK-NET FDDI-FP64)
+ - SK-5843 (SK-NET FDDI-LP64)
+ - SK-5844 (SK-NET FDDI-LP64 DAS)
+
+Compaq adapters (not tested):
+
+ - Netelligent 100 FDDI DAS Fibre SC
+ - Netelligent 100 FDDI SAS Fibre SC
+ - Netelligent 100 FDDI DAS UTP
+ - Netelligent 100 FDDI SAS UTP
+ - Netelligent 100 FDDI SAS Fibre MIC
+
+
+3. General Information
+======================
+
+From v2.01 on, the driver is integrated in the linux kernel sources.
+Therefore, the installation is the same as for any other adapter
+supported by the kernel.
+
+Refer to the manual of your distribution about the installation
+of network adapters.
+
+Makes my life much easier :-)
+
+4. Troubleshooting
+==================
+
+If you run into problems during installation, check those items:
+
+Problem:
+ The FDDI adapter cannot be found by the driver.
+
+Reason:
+ Look in /proc/pci for the following entry:
+
+ 'FDDI network controller: SysKonnect SK-FDDI-PCI ...'
+
+ If this entry exists, then the FDDI adapter has been
+ found by the system and should be able to be used.
+
+ If this entry does not exist or if the file '/proc/pci'
+ is not there, then you may have a hardware problem or PCI
+ support may not be enabled in your kernel.
+
+ The adapter can be checked using the diagnostic program
+ which is available from the SysKonnect web site:
+
+ www.syskonnect.de
+
+ Some COMPAQ machines have a problem with PCI under
+ Linux. This is described in the 'PCI howto' document
+ (included in some distributions or available from the
+ www, e.g. at 'www.linux.org') and no workaround is available.
+
+Problem:
+ You want to use your computer as a router between
+ multiple IP subnetworks (using multiple adapters), but
+ you cannot reach computers in other subnetworks.
+
+Reason:
+ Either the router's kernel is not configured for IP
+ forwarding or there is a problem with the routing table
+ and gateway configuration in at least one of the
+ computers.
+
+If your problem is not listed here, please contact our
+technical support for help.
+
+You can send email to: linux@syskonnect.de
+
+When contacting our technical support,
+please ensure that the following information is available:
+
+- System Manufacturer and Model
+- Boards in your system
+- Distribution
+- Kernel version
+
+
+5. Function of the Adapter LEDs
+===============================
+
+ The functionality of the LED's on the FDDI network adapters was
+ changed in SMT version v2.82. With this new SMT version, the yellow
+ LED works as a ring operational indicator. An active yellow LED
+ indicates that the ring is down. The green LED on the adapter now
+ works as a link indicator where an active GREEN LED indicates that
+ the respective port has a physical connection.
+
+ With versions of SMT prior to v2.82 a ring up was indicated if the
+ yellow LED was off while the green LED(s) showed the connection
+ status of the adapter. During a ring down the green LED was off and
+ the yellow LED was on.
+
+ All implementations indicate that a driver is not loaded if
+ all LEDs are off.
+
+
+6. History
+==========
+
+v2.06 (20000511) (In-Kernel version)
+ New features:
+
+ - 64 bit support
+ - new pci dma interface
+ - in kernel 2.3.99
+
+v2.05 (20000217) (In-Kernel version)
+ New features:
+
+ - Changes for 2.3.45 kernel
+
+v2.04 (20000207) (Standalone version)
+ New features:
+
+ - Added rx/tx byte counter
+
+v2.03 (20000111) (Standalone version)
+ Problems fixed:
+
+ - Fixed printk statements from v2.02
+
+v2.02 (991215) (Standalone version)
+ Problems fixed:
+
+ - Removed unnecessary output
+ - Fixed path for "printver.sh" in makefile
+
+v2.01 (991122) (In-Kernel version)
+ New features:
+
+ - Integration in Linux kernel sources
+ - Support for memory mapped I/O.
+
+v2.00 (991112)
+ New features:
+
+ - Full source released under GPL
+
+v1.05 (991023)
+ Problems fixed:
+
+ - Compilation with kernel version 2.2.13 failed
+
+v1.04 (990427)
+ Changes:
+
+ - New SMT module included, changing LED functionality
+
+ Problems fixed:
+
+ - Synchronization on SMP machines was buggy
+
+v1.03 (990325)
+ Problems fixed:
+
+ - Interrupt routing on SMP machines could be incorrect
+
+v1.02 (990310)
+ New features:
+
+ - Support for kernel versions 2.2.x added
+ - Kernel patch instead of private duplicate of kernel functions
+
+v1.01 (980812)
+ Problems fixed:
+
+ Connection hangup with telnet
+ Slow telnet connection
+
+v1.00 beta 01 (980507)
+ New features:
+
+ None.
+
+ Problems fixed:
+
+ None.
+
+ Known limitations:
+
+ - tar archive instead of standard package format (rpm).
+ - FDDI statistic is empty.
+ - not tested with 2.1.xx kernels
+ - integration in kernel not tested
+ - not tested simultaneously with FDDI adapters from other vendors.
+ - only X86 processors supported.
+ - SBA (Synchronous Bandwidth Allocator) parameters can
+ not be configured.
+ - does not work on some COMPAQ machines. See the PCI howto
+ document for details about this problem.
+ - data corruption with kernel versions below 2.0.33.
diff --git a/Documentation/networking/device_drivers/hamradio/baycom.rst b/Documentation/networking/device_drivers/hamradio/baycom.rst
new file mode 100644
index 0000000000..fe2d010f0e
--- /dev/null
+++ b/Documentation/networking/device_drivers/hamradio/baycom.rst
@@ -0,0 +1,174 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Linux Drivers for Baycom Modems
+===============================
+
+Thomas M. Sailer, HB9JNX/AE4WA, <sailer@ife.ee.ethz.ch>
+
+The drivers for the baycom modems have been split into
+separate drivers as they did not share any code, and the driver
+and device names have changed.
+
+This document describes the Linux Kernel Drivers for simple Baycom style
+amateur radio modems.
+
+The following drivers are available:
+====================================
+
+baycom_ser_fdx:
+ This driver supports the SER12 modems either full or half duplex.
+ Its baud rate may be changed via the ``baud`` module parameter,
+ therefore it supports just about every bit bang modem on a
+ serial port. Its devices are called bcsf0 through bcsf3.
+ This is the recommended driver for SER12 type modems,
+ however if you have a broken UART clone that does not have working
+ delta status bits, you may try baycom_ser_hdx.
+
+baycom_ser_hdx:
+ This is an alternative driver for SER12 type modems.
+ It only supports half duplex, and only 1200 baud. Its devices
+ are called bcsh0 through bcsh3. Use this driver only if baycom_ser_fdx
+ does not work with your UART.
+
+baycom_par:
+ This driver supports the par96 and picpar modems.
+ Its devices are called bcp0 through bcp3.
+
+baycom_epp:
+ This driver supports the EPP modem.
+ Its devices are called bce0 through bce3.
+ This driver is work-in-progress.
+
+The following modems are supported:
+
+======= ========================================================================
+ser12 This is a very simple 1200 baud AFSK modem. The modem consists only
+ of a modulator/demodulator chip, usually a TI TCM3105. The computer
+ is responsible for regenerating the receiver bit clock, as well as
+ for handling the HDLC protocol. The modem connects to a serial port,
+ hence the name. Since the serial port is not used as an async serial
+ port, the kernel driver for serial ports cannot be used, and this
+ driver only supports standard serial hardware (8250, 16450, 16550)
+
+par96 This is a modem for 9600 baud FSK compatible to the G3RUH standard.
+ The modem does all the filtering and regenerates the receiver clock.
+ Data is transferred from and to the PC via a shift register.
+ The shift register is filled with 16 bits and an interrupt is signalled.
+ The PC then empties the shift register in a burst. This modem connects
+ to the parallel port, hence the name. The modem leaves the
+ implementation of the HDLC protocol and the scrambler polynomial to
+ the PC.
+
+picpar This is a redesign of the par96 modem by Henning Rech, DF9IC. The modem
+ is protocol compatible to par96, but uses only three low power ICs
+ and can therefore be fed from the parallel port and does not require
+ an additional power supply. Furthermore, it incorporates a carrier
+ detect circuitry.
+
+EPP This is a high-speed modem adaptor that connects to an enhanced parallel
+ port.
+
+ Its target audience is users working over a high speed hub (76.8kbit/s).
+
+eppfpga This is a redesign of the EPP adaptor.
+======= ========================================================================
+
+All of the above modems only support half duplex communications. However,
+the driver supports the KISS (see below) fullduplex command. It then simply
+starts to send as soon as there's a packet to transmit and does not care
+about DCD, i.e. it starts to send even if there's someone else on the channel.
+This command is required by some implementations of the DAMA channel
+access protocol.
+
+
+The Interface of the drivers
+============================
+
+Unlike previous drivers, these drivers are no longer character devices,
+but they are now true kernel network interfaces. Installation is therefore
+simple. Once installed, four interfaces named bc{sf,sh,p,e}[0-3] are available.
+sethdlc from the ax25 utilities may be used to set driver states etc.
+Users of userland AX.25 stacks may use the net2kiss utility (also available
+in the ax25 utilities package) to convert packets of a network interface
+to a KISS stream on a pseudo tty. There's also a patch available from
+me for WAMPES which allows attaching a kernel network interface directly.
+
+
+Configuring the driver
+======================
+
+Every time a driver is inserted into the kernel, it has to know which
+modems it should access at which ports. This can be done with the setbaycom
+utility. If you are only using one modem, you can also configure the
+driver from the insmod command line (or by means of an option line in
+``/etc/modprobe.d/*.conf``).
+
+Examples::
+
+ modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4
+ sethdlc -i bcsf0 -p mode "ser12*" io 0x3f8 irq 4
+
+Both lines configure the first port to drive a ser12 modem at the first
+serial port (COM1 under DOS). The * in the mode parameter instructs the driver
+to use the software DCD algorithm (see below)::
+
+ insmod baycom_par mode="picpar" iobase=0x378
+ sethdlc -i bcp0 -p mode "picpar" io 0x378
+
+Both lines configure the first port to drive a picpar modem at the
+first parallel port (LPT1 under DOS). (Note: picpar implies
+hardware DCD, par96 implies software DCD).
+
+The channel access parameters can be set with sethdlc -a or kissparms.
+Note that both utilities interpret the values slightly differently.
+
+
+Hardware DCD versus Software DCD
+================================
+
+To avoid collisions on the air, the driver must know when the channel is
+busy. This is the task of the DCD circuitry/software. The driver may either
+utilise a software DCD algorithm (options=1) or use a DCD signal from
+the hardware (options=0).
+
+======= =================================================================
+ser12 if software DCD is utilised, the radio's squelch should always be
+ open. It is highly recommended to use the software DCD algorithm,
+ as it is much faster than most hardware squelch circuitry. The
+ disadvantage is a slightly higher load on the system.
+
+par96 the software DCD algorithm for this type of modem is rather poor.
+ The modem simply does not provide enough information to implement
+ a reasonable DCD algorithm in software. Therefore, if your radio
+ feeds the DCD input of the PAR96 modem, the use of the hardware
+ DCD circuitry is recommended.
+
+picpar the picpar modem features a builtin DCD hardware, which is highly
+ recommended.
+======= =================================================================
+
+
+
+Compatibility with the rest of the Linux kernel
+===============================================
+
+The serial driver and the baycom serial drivers compete
+for the same hardware resources. Of course only one driver can access a given
+interface at a time. The serial driver grabs all interfaces it can find at
+startup time. Therefore the baycom drivers subsequently won't be able to
+access a serial port. You might therefore find it necessary to release
+a port owned by the serial driver with 'setserial /dev/ttyS# uart none', where
+# is the number of the interface. The baycom drivers do not reserve any
+ports at startup, unless one is specified on the 'insmod' command line. Another
+method to solve the problem is to compile all drivers as modules and
+leave it to kmod to load the correct driver depending on the application.
+
+The parallel port drivers (baycom_par, baycom_epp) now use the parport subsystem
+to arbitrate the ports between different client drivers.
+
+vy 73s de
+
+Tom Sailer, sailer@ife.ee.ethz.ch
+
+hb9jnx @ hb9w.ampr.org
diff --git a/Documentation/networking/device_drivers/hamradio/index.rst b/Documentation/networking/device_drivers/hamradio/index.rst
new file mode 100644
index 0000000000..7e73173205
--- /dev/null
+++ b/Documentation/networking/device_drivers/hamradio/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Amateur Radio Device Drivers
+============================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ baycom
+ z8530drv
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/hamradio/z8530drv.rst b/Documentation/networking/device_drivers/hamradio/z8530drv.rst
new file mode 100644
index 0000000000..d2942760f1
--- /dev/null
+++ b/Documentation/networking/device_drivers/hamradio/z8530drv.rst
@@ -0,0 +1,686 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=========================================================
+SCC.C - Linux driver for Z8530 based HDLC cards for AX.25
+=========================================================
+
+
+This is a subset of the documentation. To use this driver you MUST have the
+full package from:
+
+Internet:
+
+ 1. ftp://ftp.ccac.rwth-aachen.de/pub/jr/z8530drv-utils_3.0-3.tar.gz
+
+ 2. ftp://ftp.pspt.fi/pub/ham/linux/ax25/z8530drv-utils_3.0-3.tar.gz
+
+Please note that the information in this document may be hopelessly outdated.
+A new version of the documentation, along with links to other important
+Linux Kernel AX.25 documentation and programs, is available on
+http://yaina.de/jreuter
+
+Copyright |copy| 1993,2000 by Joerg Reuter DL1BKE <jreuter@yaina.de>
+
+portions Copyright |copy| 1993 Guido ten Dolle PE1NNZ
+
+for the complete copyright notice see >> Copying.Z8530DRV <<
+
+1. Initialization of the driver
+===============================
+
+To use the driver, 3 steps must be performed:
+
+ 1. if compiled as module: loading the module
+ 2. Setup of hardware, MODEM and KISS parameters with sccinit
+ 3. Attach each channel to the Linux kernel AX.25 with "ifconfig"
+
+Unlike the versions below 2.4 this driver is a real network device
+driver. If you want to run xNOS instead of our fine kernel AX.25
+use a 2.x version (available from above sites) or read the
+AX.25-HOWTO on how to emulate a KISS TNC on network device drivers.
+
+
+1.1 Loading the module
+======================
+
+(If you're going to compile the driver as a part of the kernel image,
+ skip this chapter and continue with 1.2)
+
+Before you can use a module, you'll have to load it with::
+
+ insmod scc.o
+
+please read 'man insmod' that comes with module-init-tools.
+
+You should include the insmod in one of the /etc/rc.d/rc.* files,
+and don't forget to insert a call of sccinit after that. It
+will read your /etc/z8530drv.conf.
+
+1.2. /etc/z8530drv.conf
+=======================
+
+To setup all parameters you must run /sbin/sccinit from one
+of your rc.*-files. This has to be done BEFORE you can
+"ifconfig" an interface. Sccinit reads the file /etc/z8530drv.conf
+and sets the hardware, MODEM and KISS parameters. A sample file is
+delivered with this package. Change it to your needs.
+
+The file itself consists of two main sections.
+
+1.2.1 configuration of hardware parameters
+==========================================
+
+The hardware setup section defines the following parameters for each
+Z8530::
+
+ chip 1
+ data_a 0x300 # data port A
+ ctrl_a 0x304 # control port A
+ data_b 0x301 # data port B
+ ctrl_b 0x305 # control port B
+ irq 5 # IRQ No. 5
+ pclock 4915200 # clock
+ board BAYCOM # hardware type
+ escc no # enhanced SCC chip? (8580/85180/85280)
+ vector 0 # latch for interrupt vector
+ special no # address of special function register
+ option 0 # option to set via sfr
+
+
+chip
+ - this is just a delimiter to make sccinit a bit simpler to
+ program. A parameter has no effect.
+
+data_a
+ - the address of the data port A of this Z8530 (needed)
+ctrl_a
+ - the address of the control port A (needed)
+data_b
+ - the address of the data port B (needed)
+ctrl_b
+ - the address of the control port B (needed)
+
+irq
+ - the used IRQ for this chip. Different chips can use different
+ IRQs or the same. If they share an interrupt, it needs to be
+ specified within one chip-definition only.
+
+pclock - the clock at the PCLK pin of the Z8530 (option, 4915200 is
+ default), measured in Hertz
+
+board
+ - the "type" of the board:
+
+ ======================= ========
+ SCC type value
+ ======================= ========
+ PA0HZP SCC card PA0HZP
+ EAGLE card EAGLE
+ PC100 card PC100
+ PRIMUS-PC (DG9BL) card PRIMUS
+ BayCom (U)SCC card BAYCOM
+ ======================= ========
+
+escc
+ - if you want support for ESCC chips (8580, 85180, 85280), set
+ this to "yes" (option, defaults to "no")
+
+vector
+ - address of the vector latch (aka "intack port") for PA0HZP
+ cards. There can be only one vector latch for all chips!
+ (option, defaults to 0)
+
+special
+ - address of the special function register on several cards.
+ (option, defaults to 0)
+
+option - The value you write into that register (option, default is 0)
+
+You can specify up to four chips (8 channels). If this is not enough,
+just change::
+
+ #define MAXSCC 4
+
+to a higher value.
+
+Example for the BAYCOM USCC:
+----------------------------
+
+::
+
+ chip 1
+ data_a 0x300 # data port A
+ ctrl_a 0x304 # control port A
+ data_b 0x301 # data port B
+ ctrl_b 0x305 # control port B
+ irq 5 # IRQ No. 5 (#)
+ board BAYCOM # hardware type (*)
+ #
+ # SCC chip 2
+ #
+ chip 2
+ data_a 0x302
+ ctrl_a 0x306
+ data_b 0x303
+ ctrl_b 0x307
+ board BAYCOM
+
+An example for a PA0HZP card:
+-----------------------------
+
+::
+
+ chip 1
+ data_a 0x153
+ data_b 0x151
+ ctrl_a 0x152
+ ctrl_b 0x150
+ irq 9
+ pclock 4915200
+ board PA0HZP
+ vector 0x168
+ escc no
+ #
+ #
+ #
+ chip 2
+ data_a 0x157
+ data_b 0x155
+ ctrl_a 0x156
+ ctrl_b 0x154
+ irq 9
+ pclock 4915200
+ board PA0HZP
+ vector 0x168
+ escc no
+
+A DRSI would should probably work with this:
+--------------------------------------------
+(actually: two DRSI cards...)
+
+::
+
+ chip 1
+ data_a 0x303
+ data_b 0x301
+ ctrl_a 0x302
+ ctrl_b 0x300
+ irq 7
+ pclock 4915200
+ board DRSI
+ escc no
+ #
+ #
+ #
+ chip 2
+ data_a 0x313
+ data_b 0x311
+ ctrl_a 0x312
+ ctrl_b 0x310
+ irq 7
+ pclock 4915200
+ board DRSI
+ escc no
+
+Note that you cannot use the on-board baudrate generator off DRSI
+cards. Use "mode dpll" for clock source (see below).
+
+This is based on information provided by Mike Bilow (and verified
+by Paul Helay)
+
+The utility "gencfg"
+--------------------
+
+If you only know the parameters for the PE1CHL driver for DOS,
+run gencfg. It will generate the correct port addresses (I hope).
+Its parameters are exactly the same as the ones you use with
+the "attach scc" command in net, except that the string "init" must
+not appear. Example::
+
+ gencfg 2 0x150 4 2 0 1 0x168 9 4915200
+
+will print a skeleton z8530drv.conf for the OptoSCC to stdout.
+
+::
+
+ gencfg 2 0x300 2 4 5 -4 0 7 4915200 0x10
+
+does the same for the BAYCOM USCC card. In my opinion it is much easier
+to edit scc_config.h...
+
+
+1.2.2 channel configuration
+===========================
+
+The channel definition is divided into three sub sections for each
+channel:
+
+An example for scc0::
+
+ # DEVICE
+
+ device scc0 # the device for the following params
+
+ # MODEM / BUFFERS
+
+ speed 1200 # the default baudrate
+ clock dpll # clock source:
+ # dpll = normal half duplex operation
+ # external = MODEM provides own Rx/Tx clock
+ # divider = use full duplex divider if
+ # installed (1)
+ mode nrzi # HDLC encoding mode
+ # nrzi = 1k2 MODEM, G3RUH 9k6 MODEM
+ # nrz = DF9IC 9k6 MODEM
+ #
+ bufsize 384 # size of buffers. Note that this must include
+ # the AX.25 header, not only the data field!
+ # (optional, defaults to 384)
+
+ # KISS (Layer 1)
+
+ txdelay 36 # (see chapter 1.4)
+ persist 64
+ slot 8
+ tail 8
+ fulldup 0
+ wait 12
+ min 3
+ maxkey 7
+ idle 3
+ maxdef 120
+ group 0
+ txoff off
+ softdcd on
+ slip off
+
+The order WITHIN these sections is unimportant. The order OF these
+sections IS important. The MODEM parameters are set with the first
+recognized KISS parameter...
+
+Please note that you can initialize the board only once after boot
+(or insmod). You can change all parameters but "mode" and "clock"
+later with the Sccparam program or through KISS. Just to avoid
+security holes...
+
+(1) this divider is usually mounted on the SCC-PBC (PA0HZP) or not
+ present at all (BayCom). It feeds back the output of the DPLL
+ (digital pll) as transmit clock. Using this mode without a divider
+ installed will normally result in keying the transceiver until
+ maxkey expires --- of course without sending anything (useful).
+
+2. Attachment of a channel by your AX.25 software
+=================================================
+
+2.1 Kernel AX.25
+================
+
+To set up an AX.25 device you can simply type::
+
+ ifconfig scc0 44.128.1.1 hw ax25 dl0tha-7
+
+This will create a network interface with the IP number 44.128.20.107
+and the callsign "dl0tha". If you do not have any IP number (yet) you
+can use any of the 44.128.0.0 network. Note that you do not need
+axattach. The purpose of axattach (like slattach) is to create a KISS
+network device linked to a TTY. Please read the documentation of the
+ax25-utils and the AX.25-HOWTO to learn how to set the parameters of
+the kernel AX.25.
+
+2.2 NOS, NET and TFKISS
+=======================
+
+Since the TTY driver (aka KISS TNC emulation) is gone you need
+to emulate the old behaviour. The cost of using these programs is
+that you probably need to compile the kernel AX.25, regardless of whether
+you actually use it or not. First setup your /etc/ax25/axports,
+for example::
+
+ 9k6 dl0tha-9 9600 255 4 9600 baud port (scc3)
+ axlink dl0tha-15 38400 255 4 Link to NOS
+
+Now "ifconfig" the scc device::
+
+ ifconfig scc3 44.128.1.1 hw ax25 dl0tha-9
+
+You can now axattach a pseudo-TTY::
+
+ axattach /dev/ptys0 axlink
+
+and start your NOS and attach /dev/ptys0 there. The problem is that
+NOS is reachable only via digipeating through the kernel AX.25
+(disastrous on a DAMA controlled channel). To solve this problem,
+configure "rxecho" to echo the incoming frames from "9k6" to "axlink"
+and outgoing frames from "axlink" to "9k6" and start::
+
+ rxecho
+
+Or simply use "kissbridge" coming with z8530drv-utils::
+
+ ifconfig scc3 hw ax25 dl0tha-9
+ kissbridge scc3 /dev/ptys0
+
+
+3. Adjustment and Display of parameters
+=======================================
+
+3.1 Displaying SCC Parameters:
+==============================
+
+Once a SCC channel has been attached, the parameter settings and
+some statistic information can be shown using the param program::
+
+ dl1bke-u:~$ sccstat scc0
+
+ Parameters:
+
+ speed : 1200 baud
+ txdelay : 36
+ persist : 255
+ slottime : 0
+ txtail : 8
+ fulldup : 1
+ waittime : 12
+ mintime : 3 sec
+ maxkeyup : 7 sec
+ idletime : 3 sec
+ maxdefer : 120 sec
+ group : 0x00
+ txoff : off
+ softdcd : on
+ SLIP : off
+
+ Status:
+
+ HDLC Z8530 Interrupts Buffers
+ -----------------------------------------------------------------------
+ Sent : 273 RxOver : 0 RxInts : 125074 Size : 384
+ Received : 1095 TxUnder: 0 TxInts : 4684 NoSpace : 0
+ RxErrors : 1591 ExInts : 11776
+ TxErrors : 0 SpInts : 1503
+ Tx State : idle
+
+
+The status info shown is:
+
+============== ==============================================================
+Sent number of frames transmitted
+Received number of frames received
+RxErrors number of receive errors (CRC, ABORT)
+TxErrors number of discarded Tx frames (due to various reasons)
+Tx State status of the Tx interrupt handler: idle/busy/active/tail (2)
+RxOver number of receiver overruns
+TxUnder number of transmitter underruns
+RxInts number of receiver interrupts
+TxInts number of transmitter interrupts
+EpInts number of receiver special condition interrupts
+SpInts number of external/status interrupts
+Size maximum size of an AX.25 frame (*with* AX.25 headers!)
+NoSpace number of times a buffer could not get allocated
+============== ==============================================================
+
+An overrun is abnormal. If lots of these occur, the product of
+baudrate and number of interfaces is too high for the processing
+power of your computer. NoSpace errors are unlikely to be caused by the
+driver or the kernel AX.25.
+
+
+3.2 Setting Parameters
+======================
+
+
+The setting of parameters of the emulated KISS TNC is done in the
+same way in the SCC driver. You can change parameters by using
+the kissparms program from the ax25-utils package or use the program
+"sccparam"::
+
+ sccparam <device> <paramname> <decimal-|hexadecimal value>
+
+You can change the following parameters:
+
+=========== =====
+param value
+=========== =====
+speed 1200
+txdelay 36
+persist 255
+slottime 0
+txtail 8
+fulldup 1
+waittime 12
+mintime 3
+maxkeyup 7
+idletime 3
+maxdefer 120
+group 0x00
+txoff off
+softdcd on
+SLIP off
+=========== =====
+
+
+The parameters have the following meaning:
+
+speed:
+ The baudrate on this channel in bits/sec
+
+ Example: sccparam /dev/scc3 speed 9600
+
+txdelay:
+ The delay (in units of 10 ms) after keying of the
+ transmitter, until the first byte is sent. This is usually
+ called "TXDELAY" in a TNC. When 0 is specified, the driver
+ will just wait until the CTS signal is asserted. This
+ assumes the presence of a timer or other circuitry in the
+ MODEM and/or transmitter, that asserts CTS when the
+ transmitter is ready for data.
+ A normal value of this parameter is 30-36.
+
+ Example: sccparam /dev/scc0 txd 20
+
+persist:
+ This is the probability that the transmitter will be keyed
+ when the channel is found to be free. It is a value from 0
+ to 255, and the probability is (value+1)/256. The value
+ should be somewhere near 50-60, and should be lowered when
+ the channel is used more heavily.
+
+ Example: sccparam /dev/scc2 persist 20
+
+slottime:
+ This is the time between samples of the channel. It is
+ expressed in units of 10 ms. About 200-300 ms (value 20-30)
+ seems to be a good value.
+
+ Example: sccparam /dev/scc0 slot 20
+
+tail:
+ The time the transmitter will remain keyed after the last
+ byte of a packet has been transferred to the SCC. This is
+ necessary because the CRC and a flag still have to leave the
+ SCC before the transmitter is keyed down. The value depends
+ on the baudrate selected. A few character times should be
+ sufficient, e.g. 40ms at 1200 baud. (value 4)
+ The value of this parameter is in 10 ms units.
+
+ Example: sccparam /dev/scc2 4
+
+full:
+ The full-duplex mode switch. This can be one of the following
+ values:
+
+ 0: The interface will operate in CSMA mode (the normal
+ half-duplex packet radio operation)
+ 1: Fullduplex mode, i.e. the transmitter will be keyed at
+ any time, without checking the received carrier. It
+ will be unkeyed when there are no packets to be sent.
+ 2: Like 1, but the transmitter will remain keyed, also
+ when there are no packets to be sent. Flags will be
+ sent in that case, until a timeout (parameter 10)
+ occurs.
+
+ Example: sccparam /dev/scc0 fulldup off
+
+wait:
+ The initial waittime before any transmit attempt, after the
+ frame has been queue for transmit. This is the length of
+ the first slot in CSMA mode. In full duplex modes it is
+ set to 0 for maximum performance.
+ The value of this parameter is in 10 ms units.
+
+ Example: sccparam /dev/scc1 wait 4
+
+maxkey:
+ The maximal time the transmitter will be keyed to send
+ packets, in seconds. This can be useful on busy CSMA
+ channels, to avoid "getting a bad reputation" when you are
+ generating a lot of traffic. After the specified time has
+ elapsed, no new frame will be started. Instead, the trans-
+ mitter will be switched off for a specified time (parameter
+ min), and then the selected algorithm for keyup will be
+ started again.
+ The value 0 as well as "off" will disable this feature,
+ and allow infinite transmission time.
+
+ Example: sccparam /dev/scc0 maxk 20
+
+min:
+ This is the time the transmitter will be switched off when
+ the maximum transmission time is exceeded.
+
+ Example: sccparam /dev/scc3 min 10
+
+idle:
+ This parameter specifies the maximum idle time in full duplex
+ 2 mode, in seconds. When no frames have been sent for this
+ time, the transmitter will be keyed down. A value of 0 is
+ has same result as the fullduplex mode 1. This parameter
+ can be disabled.
+
+ Example: sccparam /dev/scc2 idle off # transmit forever
+
+maxdefer
+ This is the maximum time (in seconds) to wait for a free channel
+ to send. When this timer expires the transmitter will be keyed
+ IMMEDIATELY. If you love to get trouble with other users you
+ should set this to a very low value ;-)
+
+ Example: sccparam /dev/scc0 maxdefer 240 # 2 minutes
+
+
+txoff:
+ When this parameter has the value 0, the transmission of packets
+ is enable. Otherwise it is disabled.
+
+ Example: sccparam /dev/scc2 txoff on
+
+group:
+ It is possible to build special radio equipment to use more than
+ one frequency on the same band, e.g. using several receivers and
+ only one transmitter that can be switched between frequencies.
+ Also, you can connect several radios that are active on the same
+ band. In these cases, it is not possible, or not a good idea, to
+ transmit on more than one frequency. The SCC driver provides a
+ method to lock transmitters on different interfaces, using the
+ "param <interface> group <x>" command. This will only work when
+ you are using CSMA mode (parameter full = 0).
+
+ The number <x> must be 0 if you want no group restrictions, and
+ can be computed as follows to create restricted groups:
+ <x> is the sum of some OCTAL numbers:
+
+
+ === =======================================================
+ 200 This transmitter will only be keyed when all other
+ transmitters in the group are off.
+ 100 This transmitter will only be keyed when the carrier
+ detect of all other interfaces in the group is off.
+ 0xx A byte that can be used to define different groups.
+ Interfaces are in the same group, when the logical AND
+ between their xx values is nonzero.
+ === =======================================================
+
+ Examples:
+
+ When 2 interfaces use group 201, their transmitters will never be
+ keyed at the same time.
+
+ When 2 interfaces use group 101, the transmitters will only key
+ when both channels are clear at the same time. When group 301,
+ the transmitters will not be keyed at the same time.
+
+ Don't forget to convert the octal numbers into decimal before
+ you set the parameter.
+
+ Example: (to be written)
+
+softdcd:
+ use a software dcd instead of the real one... Useful for a very
+ slow squelch.
+
+ Example: sccparam /dev/scc0 soft on
+
+
+4. Problems
+===========
+
+If you have tx-problems with your BayCom USCC card please check
+the manufacturer of the 8530. SGS chips have a slightly
+different timing. Try Zilog... A solution is to write to register 8
+instead to the data port, but this won't work with the ESCC chips.
+*SIGH!*
+
+A very common problem is that the PTT locks until the maxkeyup timer
+expires, although interrupts and clock source are correct. In most
+cases compiling the driver with CONFIG_SCC_DELAY (set with
+make config) solves the problems. For more hints read the (pseudo) FAQ
+and the documentation coming with z8530drv-utils.
+
+I got reports that the driver has problems on some 386-based systems.
+(i.e. Amstrad) Those systems have a bogus AT bus timing which will
+lead to delayed answers on interrupts. You can recognize these
+problems by looking at the output of Sccstat for the suspected
+port. If it shows under- and overruns you own such a system.
+
+Delayed processing of received data: This depends on
+
+- the kernel version
+
+- kernel profiling compiled or not
+
+- a high interrupt load
+
+- a high load of the machine --- running X, Xmorph, XV and Povray,
+ while compiling the kernel... hmm ... even with 32 MB RAM ... ;-)
+ Or running a named for the whole .ampr.org domain on an 8 MB
+ box...
+
+- using information from rxecho or kissbridge.
+
+Kernel panics: please read /linux/README and find out if it
+really occurred within the scc driver.
+
+If you cannot solve a problem, send me
+
+- a description of the problem,
+- information on your hardware (computer system, scc board, modem)
+- your kernel version
+- the output of cat /proc/net/z8530
+
+4. Thor RLC100
+==============
+
+Mysteriously this board seems not to work with the driver. Anyone
+got it up-and-running?
+
+
+Many thanks to Linus Torvalds and Alan Cox for including the driver
+in the Linux standard distribution and their support.
+
+::
+
+ Joerg Reuter ampr-net: dl1bke@db0pra.ampr.org
+ AX-25 : DL1BKE @ DB0ABH.#BAY.DEU.EU
+ Internet: jreuter@yaina.de
+ WWW : http://yaina.de/jreuter
diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst
new file mode 100644
index 0000000000..601eacaf12
--- /dev/null
+++ b/Documentation/networking/device_drivers/index.rst
@@ -0,0 +1,28 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Hardware Device Drivers
+=======================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ appletalk/index
+ atm/index
+ cable/index
+ can/index
+ cellular/index
+ ethernet/index
+ fddi/index
+ hamradio/index
+ qlogic/index
+ wifi/index
+ wwan/index
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/qlogic/index.rst b/Documentation/networking/device_drivers/qlogic/index.rst
new file mode 100644
index 0000000000..ad05b04286
--- /dev/null
+++ b/Documentation/networking/device_drivers/qlogic/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+QLogic QLGE Device Drivers
+===============================================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ qlge
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/qlogic/qlge.rst b/Documentation/networking/device_drivers/qlogic/qlge.rst
new file mode 100644
index 0000000000..0b888253d1
--- /dev/null
+++ b/Documentation/networking/device_drivers/qlogic/qlge.rst
@@ -0,0 +1,118 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+QLogic QLGE 10Gb Ethernet device driver
+=======================================
+
+This driver use drgn and devlink for debugging.
+
+Dump kernel data structures in drgn
+-----------------------------------
+
+To dump kernel data structures, the following Python script can be used
+in drgn:
+
+.. code-block:: python
+
+ def align(x, a):
+ """the alignment a should be a power of 2
+ """
+ mask = a - 1
+ return (x+ mask) & ~mask
+
+ def struct_size(struct_type):
+ struct_str = "struct {}".format(struct_type)
+ return sizeof(Object(prog, struct_str, address=0x0))
+
+ def netdev_priv(netdevice):
+ NETDEV_ALIGN = 32
+ return netdevice.value_() + align(struct_size("net_device"), NETDEV_ALIGN)
+
+ name = 'xxx'
+ qlge_device = None
+ netdevices = prog['init_net'].dev_base_head.address_of_()
+ for netdevice in list_for_each_entry("struct net_device", netdevices, "dev_list"):
+ if netdevice.name.string_().decode('ascii') == name:
+ print(netdevice.name)
+
+ ql_adapter = Object(prog, "struct ql_adapter", address=netdev_priv(qlge_device))
+
+The struct ql_adapter will be printed in drgn as follows,
+
+ >>> ql_adapter
+ (struct ql_adapter){
+ .ricb = (struct ricb){
+ .base_cq = (u8)0,
+ .flags = (u8)120,
+ .mask = (__le16)26637,
+ .hash_cq_id = (u8 [1024]){ 172, 142, 255, 255 },
+ .ipv6_hash_key = (__le32 [10]){},
+ .ipv4_hash_key = (__le32 [4]){},
+ },
+ .flags = (unsigned long)0,
+ .wol = (u32)0,
+ .nic_stats = (struct nic_stats){
+ .tx_pkts = (u64)0,
+ .tx_bytes = (u64)0,
+ .tx_mcast_pkts = (u64)0,
+ .tx_bcast_pkts = (u64)0,
+ .tx_ucast_pkts = (u64)0,
+ .tx_ctl_pkts = (u64)0,
+ .tx_pause_pkts = (u64)0,
+ ...
+ },
+ .active_vlans = (unsigned long [64]){
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 52780853100545, 18446744073709551615,
+ 18446619461681283072, 0, 42949673024, 2147483647,
+ },
+ .rx_ring = (struct rx_ring [17]){
+ {
+ .cqicb = (struct cqicb){
+ .msix_vect = (u8)0,
+ .reserved1 = (u8)0,
+ .reserved2 = (u8)0,
+ .flags = (u8)0,
+ .len = (__le16)0,
+ .rid = (__le16)0,
+ ...
+ },
+ .cq_base = (void *)0x0,
+ .cq_base_dma = (dma_addr_t)0,
+ }
+ ...
+ }
+ }
+
+coredump via devlink
+--------------------
+
+
+And the coredump obtained via devlink in json format looks like,
+
+.. code:: shell
+
+ $ devlink health dump show DEVICE reporter coredump -p -j
+ {
+ "Core Registers": {
+ "segment": 1,
+ "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
+ },
+ "Test Logic Regs": {
+ "segment": 2,
+ "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
+ },
+ "RMII Registers": {
+ "segment": 3,
+ "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
+ },
+ ...
+ "Sem Registers": {
+ "segment": 50,
+ "values": [ 0,0,0,0 ]
+ }
+ }
+
+When the module parameter qlge_force_coredump is set to be true, the MPI
+RISC reset before coredumping. So coredumping will much longer since
+devlink tool has to wait for 5 secs for the resetting to be
+finished.
diff --git a/Documentation/networking/device_drivers/wifi/index.rst b/Documentation/networking/device_drivers/wifi/index.rst
new file mode 100644
index 0000000000..bf91a87c7a
--- /dev/null
+++ b/Documentation/networking/device_drivers/wifi/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Wi-Fi Device Drivers
+====================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ intel/ipw2100
+ intel/ipw2200
+ ray_cs
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/wifi/intel/ipw2100.rst b/Documentation/networking/device_drivers/wifi/intel/ipw2100.rst
new file mode 100644
index 0000000000..883e963557
--- /dev/null
+++ b/Documentation/networking/device_drivers/wifi/intel/ipw2100.rst
@@ -0,0 +1,323 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===========================================
+Intel(R) PRO/Wireless 2100 Driver for Linux
+===========================================
+
+Support for:
+
+- Intel(R) PRO/Wireless 2100 Network Connection
+
+Copyright |copy| 2003-2006, Intel Corporation
+
+README.ipw2100
+
+:Version: git-1.1.5
+:Date: January 25, 2006
+
+.. Index
+
+ 0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+ 1. Introduction
+ 2. Release git-1.1.5 Current Features
+ 3. Command Line Parameters
+ 4. Sysfs Helper Files
+ 5. Radio Kill Switch
+ 6. Dynamic Firmware
+ 7. Power Management
+ 8. Support
+ 9. License
+
+
+0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+=================================================
+
+Important Notice FOR ALL USERS OR DISTRIBUTORS!!!!
+
+Intel wireless LAN adapters are engineered, manufactured, tested, and
+quality checked to ensure that they meet all necessary local and
+governmental regulatory agency requirements for the regions that they
+are designated and/or marked to ship into. Since wireless LANs are
+generally unlicensed devices that share spectrum with radars,
+satellites, and other licensed and unlicensed devices, it is sometimes
+necessary to dynamically detect, avoid, and limit usage to avoid
+interference with these devices. In many instances Intel is required to
+provide test data to prove regional and local compliance to regional and
+governmental regulations before certification or approval to use the
+product is granted. Intel's wireless LAN's EEPROM, firmware, and
+software driver are designed to carefully control parameters that affect
+radio operation and to ensure electromagnetic compliance (EMC). These
+parameters include, without limitation, RF power, spectrum usage,
+channel scanning, and human exposure.
+
+For these reasons Intel cannot permit any manipulation by third parties
+of the software provided in binary format with the wireless WLAN
+adapters (e.g., the EEPROM and firmware). Furthermore, if you use any
+patches, utilities, or code with the Intel wireless LAN adapters that
+have been manipulated by an unauthorized party (i.e., patches,
+utilities, or code (including open source code modifications) which have
+not been validated by Intel), (i) you will be solely responsible for
+ensuring the regulatory compliance of the products, (ii) Intel will bear
+no liability, under any theory of liability for any issues associated
+with the modified products, including without limitation, claims under
+the warranty and/or issues arising from regulatory non-compliance, and
+(iii) Intel will not provide or be required to assist in providing
+support to any third parties for such modified products.
+
+Note: Many regulatory agencies consider Wireless LAN adapters to be
+modules, and accordingly, condition system-level regulatory approval
+upon receipt and review of test data documenting that the antennas and
+system configuration do not cause the EMC and radio operation to be
+non-compliant.
+
+The drivers available for download from SourceForge are provided as a
+part of a development project. Conformance to local regulatory
+requirements is the responsibility of the individual developer. As
+such, if you are interested in deploying or shipping a driver as part of
+solution intended to be used for purposes other than development, please
+obtain a tested driver from Intel Customer Support at:
+
+https://www.intel.com/support/wireless/sb/CS-006408.htm
+
+1. Introduction
+===============
+
+This document provides a brief overview of the features supported by the
+IPW2100 driver project. The main project website, where the latest
+development version of the driver can be found, is:
+
+ http://ipw2100.sourceforge.net
+
+There you can find the not only the latest releases, but also information about
+potential fixes and patches, as well as links to the development mailing list
+for the driver project.
+
+
+2. Release git-1.1.5 Current Supported Features
+===============================================
+
+- Managed (BSS) and Ad-Hoc (IBSS)
+- WEP (shared key and open)
+- Wireless Tools support
+- 802.1x (tested with XSupplicant 1.0.1)
+
+Enabled (but not supported) features:
+- Monitor/RFMon mode
+- WPA/WPA2
+
+The distinction between officially supported and enabled is a reflection
+on the amount of validation and interoperability testing that has been
+performed on a given feature.
+
+
+3. Command Line Parameters
+==========================
+
+If the driver is built as a module, the following optional parameters are used
+by entering them on the command line with the modprobe command using this
+syntax::
+
+ modprobe ipw2100 [<option>=<VAL1><,VAL2>...]
+
+For example, to disable the radio on driver loading, enter:
+
+ modprobe ipw2100 disable=1
+
+The ipw2100 driver supports the following module parameters:
+
+========= ============== ============ ==============================
+Name Value Example Meaning
+========= ============== ============ ==============================
+debug 0x0-0xffffffff debug=1024 Debug level set to 1024
+mode 0,1,2 mode=1 AdHoc
+channel int channel=3 Only valid in AdHoc or Monitor
+associate boolean associate=0 Do NOT auto associate
+disable boolean disable=1 Do not power the HW
+========= ============== ============ ==============================
+
+
+4. Sysfs Helper Files
+=====================
+
+There are several ways to control the behavior of the driver. Many of the
+general capabilities are exposed through the Wireless Tools (iwconfig). There
+are a few capabilities that are exposed through entries in the Linux Sysfs.
+
+
+**Driver Level**
+
+For the driver level files, look in /sys/bus/pci/drivers/ipw2100/
+
+ debug_level
+ This controls the same global as the 'debug' module parameter. For
+ information on the various debugging levels available, run the 'dvals'
+ script found in the driver source directory.
+
+ .. note::
+
+ 'debug_level' is only enabled if CONFIG_IPW2100_DEBUG is turn on.
+
+**Device Level**
+
+For the device level files look in::
+
+ /sys/bus/pci/drivers/ipw2100/{PCI-ID}/
+
+For example::
+
+ /sys/bus/pci/drivers/ipw2100/0000:02:01.0
+
+For the device level files, see /sys/bus/pci/drivers/ipw2100:
+
+ rf_kill
+ read
+
+ == =========================================
+ 0 RF kill not enabled (radio on)
+ 1 SW based RF kill active (radio off)
+ 2 HW based RF kill active (radio off)
+ 3 Both HW and SW RF kill active (radio off)
+ == =========================================
+
+ write
+
+ == ==================================================
+ 0 If SW based RF kill active, turn the radio back on
+ 1 If radio is on, activate SW based RF kill
+ == ==================================================
+
+ .. note::
+
+ If you enable the SW based RF kill and then toggle the HW
+ based RF kill from ON -> OFF -> ON, the radio will NOT come back on
+
+
+5. Radio Kill Switch
+====================
+
+Most laptops provide the ability for the user to physically disable the radio.
+Some vendors have implemented this as a physical switch that requires no
+software to turn the radio off and on. On other laptops, however, the switch
+is controlled through a button being pressed and a software driver then making
+calls to turn the radio off and on. This is referred to as a "software based
+RF kill switch"
+
+See the Sysfs helper file 'rf_kill' for determining the state of the RF switch
+on your system.
+
+
+6. Dynamic Firmware
+===================
+
+As the firmware is licensed under a restricted use license, it can not be
+included within the kernel sources. To enable the IPW2100 you will need a
+firmware image to load into the wireless NIC's processors.
+
+You can obtain these images from <http://ipw2100.sf.net/firmware.php>.
+
+See INSTALL for instructions on installing the firmware.
+
+
+7. Power Management
+===================
+
+The IPW2100 supports the configuration of the Power Save Protocol
+through a private wireless extension interface. The IPW2100 supports
+the following different modes:
+
+ === ===========================================================
+ off No power management. Radio is always on.
+ on Automatic power management
+ 1-5 Different levels of power management. The higher the
+ number the greater the power savings, but with an impact to
+ packet latencies.
+ === ===========================================================
+
+Power management works by powering down the radio after a certain
+interval of time has passed where no packets are passed through the
+radio. Once powered down, the radio remains in that state for a given
+period of time. For higher power savings, the interval between last
+packet processed to sleep is shorter and the sleep period is longer.
+
+When the radio is asleep, the access point sending data to the station
+must buffer packets at the AP until the station wakes up and requests
+any buffered packets. If you have an AP that does not correctly support
+the PSP protocol you may experience packet loss or very poor performance
+while power management is enabled. If this is the case, you will need
+to try and find a firmware update for your AP, or disable power
+management (via ``iwconfig eth1 power off``)
+
+To configure the power level on the IPW2100 you use a combination of
+iwconfig and iwpriv. iwconfig is used to turn power management on, off,
+and set it to auto.
+
+ ========================= ====================================
+ iwconfig eth1 power off Disables radio power down
+ iwconfig eth1 power on Enables radio power management to
+ last set level (defaults to AUTO)
+ iwpriv eth1 set_power 0 Sets power level to AUTO and enables
+ power management if not previously
+ enabled.
+ iwpriv eth1 set_power 1-5 Set the power level as specified,
+ enabling power management if not
+ previously enabled.
+ ========================= ====================================
+
+You can view the current power level setting via::
+
+ iwpriv eth1 get_power
+
+It will return the current period or timeout that is configured as a string
+in the form of xxxx/yyyy (z) where xxxx is the timeout interval (amount of
+time after packet processing), yyyy is the period to sleep (amount of time to
+wait before powering the radio and querying the access point for buffered
+packets), and z is the 'power level'. If power management is turned off the
+xxxx/yyyy will be replaced with 'off' -- the level reported will be the active
+level if `iwconfig eth1 power on` is invoked.
+
+
+8. Support
+==========
+
+For general development information and support,
+go to:
+
+ http://ipw2100.sf.net/
+
+The ipw2100 1.1.0 driver and firmware can be downloaded from:
+
+ http://support.intel.com
+
+For installation support on the ipw2100 1.1.0 driver on Linux kernels
+2.6.8 or greater, email support is available from:
+
+ http://supportmail.intel.com
+
+9. License
+==========
+
+ Copyright |copy| 2003 - 2006 Intel Corporation. All rights reserved.
+
+ This program is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License (version 2) as
+ published by the Free Software Foundation.
+
+ This program is distributed in the hope that it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ You should have received a copy of the GNU General Public License along with
+ this program; if not, write to the Free Software Foundation, Inc., 59
+ Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+
+ The full GNU General Public License is included in this distribution in the
+ file called LICENSE.
+
+ License Contact Information:
+
+ James P. Ketrenos <ipw2100-admin@linux.intel.com>
+
+ Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
diff --git a/Documentation/networking/device_drivers/wifi/intel/ipw2200.rst b/Documentation/networking/device_drivers/wifi/intel/ipw2200.rst
new file mode 100644
index 0000000000..0cb42d2fd7
--- /dev/null
+++ b/Documentation/networking/device_drivers/wifi/intel/ipw2200.rst
@@ -0,0 +1,526 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+==============================================
+Intel(R) PRO/Wireless 2915ABG Driver for Linux
+==============================================
+
+
+Support for:
+
+- Intel(R) PRO/Wireless 2200BG Network Connection
+- Intel(R) PRO/Wireless 2915ABG Network Connection
+
+Note: The Intel(R) PRO/Wireless 2915ABG Driver for Linux and Intel(R)
+PRO/Wireless 2200BG Driver for Linux is a unified driver that works on
+both hardware adapters listed above. In this document the Intel(R)
+PRO/Wireless 2915ABG Driver for Linux will be used to reference the
+unified driver.
+
+Copyright |copy| 2004-2006, Intel Corporation
+
+README.ipw2200
+
+:Version: 1.1.2
+:Date: March 30, 2006
+
+
+.. Index
+
+ 0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+ 1. Introduction
+ 1.1. Overview of features
+ 1.2. Module parameters
+ 1.3. Wireless Extension Private Methods
+ 1.4. Sysfs Helper Files
+ 1.5. Supported channels
+ 2. Ad-Hoc Networking
+ 3. Interacting with Wireless Tools
+ 3.1. iwconfig mode
+ 3.2. iwconfig sens
+ 4. About the Version Numbers
+ 5. Firmware installation
+ 6. Support
+ 7. License
+
+
+0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+=================================================
+
+Important Notice FOR ALL USERS OR DISTRIBUTORS!!!!
+
+Intel wireless LAN adapters are engineered, manufactured, tested, and
+quality checked to ensure that they meet all necessary local and
+governmental regulatory agency requirements for the regions that they
+are designated and/or marked to ship into. Since wireless LANs are
+generally unlicensed devices that share spectrum with radars,
+satellites, and other licensed and unlicensed devices, it is sometimes
+necessary to dynamically detect, avoid, and limit usage to avoid
+interference with these devices. In many instances Intel is required to
+provide test data to prove regional and local compliance to regional and
+governmental regulations before certification or approval to use the
+product is granted. Intel's wireless LAN's EEPROM, firmware, and
+software driver are designed to carefully control parameters that affect
+radio operation and to ensure electromagnetic compliance (EMC). These
+parameters include, without limitation, RF power, spectrum usage,
+channel scanning, and human exposure.
+
+For these reasons Intel cannot permit any manipulation by third parties
+of the software provided in binary format with the wireless WLAN
+adapters (e.g., the EEPROM and firmware). Furthermore, if you use any
+patches, utilities, or code with the Intel wireless LAN adapters that
+have been manipulated by an unauthorized party (i.e., patches,
+utilities, or code (including open source code modifications) which have
+not been validated by Intel), (i) you will be solely responsible for
+ensuring the regulatory compliance of the products, (ii) Intel will bear
+no liability, under any theory of liability for any issues associated
+with the modified products, including without limitation, claims under
+the warranty and/or issues arising from regulatory non-compliance, and
+(iii) Intel will not provide or be required to assist in providing
+support to any third parties for such modified products.
+
+Note: Many regulatory agencies consider Wireless LAN adapters to be
+modules, and accordingly, condition system-level regulatory approval
+upon receipt and review of test data documenting that the antennas and
+system configuration do not cause the EMC and radio operation to be
+non-compliant.
+
+The drivers available for download from SourceForge are provided as a
+part of a development project. Conformance to local regulatory
+requirements is the responsibility of the individual developer. As
+such, if you are interested in deploying or shipping a driver as part of
+solution intended to be used for purposes other than development, please
+obtain a tested driver from Intel Customer Support at:
+
+http://support.intel.com
+
+
+1. Introduction
+===============
+
+The following sections attempt to provide a brief introduction to using
+the Intel(R) PRO/Wireless 2915ABG Driver for Linux.
+
+This document is not meant to be a comprehensive manual on
+understanding or using wireless technologies, but should be sufficient
+to get you moving without wires on Linux.
+
+For information on building and installing the driver, see the INSTALL
+file.
+
+
+1.1. Overview of Features
+-------------------------
+The current release (1.1.2) supports the following features:
+
++ BSS mode (Infrastructure, Managed)
++ IBSS mode (Ad-Hoc)
++ WEP (OPEN and SHARED KEY mode)
++ 802.1x EAP via wpa_supplicant and xsupplicant
++ Wireless Extension support
++ Full B and G rate support (2200 and 2915)
++ Full A rate support (2915 only)
++ Transmit power control
++ S state support (ACPI suspend/resume)
+
+The following features are currently enabled, but not officially
+supported:
+
++ WPA
++ long/short preamble support
++ Monitor mode (aka RFMon)
+
+The distinction between officially supported and enabled is a reflection
+on the amount of validation and interoperability testing that has been
+performed on a given feature.
+
+
+
+1.2. Command Line Parameters
+----------------------------
+
+Like many modules used in the Linux kernel, the Intel(R) PRO/Wireless
+2915ABG Driver for Linux allows configuration options to be provided
+as module parameters. The most common way to specify a module parameter
+is via the command line.
+
+The general form is::
+
+ % modprobe ipw2200 parameter=value
+
+Where the supported parameter are:
+
+ associate
+ Set to 0 to disable the auto scan-and-associate functionality of the
+ driver. If disabled, the driver will not attempt to scan
+ for and associate to a network until it has been configured with
+ one or more properties for the target network, for example configuring
+ the network SSID. Default is 0 (do not auto-associate)
+
+ Example: % modprobe ipw2200 associate=0
+
+ auto_create
+ Set to 0 to disable the auto creation of an Ad-Hoc network
+ matching the channel and network name parameters provided.
+ Default is 1.
+
+ channel
+ channel number for association. The normal method for setting
+ the channel would be to use the standard wireless tools
+ (i.e. `iwconfig eth1 channel 10`), but it is useful sometimes
+ to set this while debugging. Channel 0 means 'ANY'
+
+ debug
+ If using a debug build, this is used to control the amount of debug
+ info is logged. See the 'dvals' and 'load' script for more info on
+ how to use this (the dvals and load scripts are provided as part
+ of the ipw2200 development snapshot releases available from the
+ SourceForge project at http://ipw2200.sf.net)
+
+ led
+ Can be used to turn on experimental LED code.
+ 0 = Off, 1 = On. Default is 1.
+
+ mode
+ Can be used to set the default mode of the adapter.
+ 0 = Managed, 1 = Ad-Hoc, 2 = Monitor
+
+
+1.3. Wireless Extension Private Methods
+---------------------------------------
+
+As an interface designed to handle generic hardware, there are certain
+capabilities not exposed through the normal Wireless Tool interface. As
+such, a provision is provided for a driver to declare custom, or
+private, methods. The Intel(R) PRO/Wireless 2915ABG Driver for Linux
+defines several of these to configure various settings.
+
+The general form of using the private wireless methods is::
+
+ % iwpriv $IFNAME method parameters
+
+Where $IFNAME is the interface name the device is registered with
+(typically eth1, customized via one of the various network interface
+name managers, such as ifrename)
+
+The supported private methods are:
+
+ get_mode
+ Can be used to report out which IEEE mode the driver is
+ configured to support. Example:
+
+ % iwpriv eth1 get_mode
+ eth1 get_mode:802.11bg (6)
+
+ set_mode
+ Can be used to configure which IEEE mode the driver will
+ support.
+
+ Usage::
+
+ % iwpriv eth1 set_mode {mode}
+
+ Where {mode} is a number in the range 1-7:
+
+ == =====================
+ 1 802.11a (2915 only)
+ 2 802.11b
+ 3 802.11ab (2915 only)
+ 4 802.11g
+ 5 802.11ag (2915 only)
+ 6 802.11bg
+ 7 802.11abg (2915 only)
+ == =====================
+
+ get_preamble
+ Can be used to report configuration of preamble length.
+
+ set_preamble
+ Can be used to set the configuration of preamble length:
+
+ Usage::
+
+ % iwpriv eth1 set_preamble {mode}
+
+ Where {mode} is one of:
+
+ == ========================================
+ 1 Long preamble only
+ 0 Auto (long or short based on connection)
+ == ========================================
+
+
+1.4. Sysfs Helper Files
+-----------------------
+
+The Linux kernel provides a pseudo file system that can be used to
+access various components of the operating system. The Intel(R)
+PRO/Wireless 2915ABG Driver for Linux exposes several configuration
+parameters through this mechanism.
+
+An entry in the sysfs can support reading and/or writing. You can
+typically query the contents of a sysfs entry through the use of cat,
+and can set the contents via echo. For example::
+
+ % cat /sys/bus/pci/drivers/ipw2200/debug_level
+
+Will report the current debug level of the driver's logging subsystem
+(only available if CONFIG_IPW2200_DEBUG was configured when the driver
+was built).
+
+You can set the debug level via::
+
+ % echo $VALUE > /sys/bus/pci/drivers/ipw2200/debug_level
+
+Where $VALUE would be a number in the case of this sysfs entry. The
+input to sysfs files does not have to be a number. For example, the
+firmware loader used by hotplug utilizes sysfs entries for transferring
+the firmware image from user space into the driver.
+
+The Intel(R) PRO/Wireless 2915ABG Driver for Linux exposes sysfs entries
+at two levels -- driver level, which apply to all instances of the driver
+(in the event that there are more than one device installed) and device
+level, which applies only to the single specific instance.
+
+
+1.4.1 Driver Level Sysfs Helper Files
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For the driver level files, look in /sys/bus/pci/drivers/ipw2200/
+
+ debug_level
+ This controls the same global as the 'debug' module parameter
+
+
+
+1.4.2 Device Level Sysfs Helper Files
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For the device level files, look in::
+
+ /sys/bus/pci/drivers/ipw2200/{PCI-ID}/
+
+For example:::
+
+ /sys/bus/pci/drivers/ipw2200/0000:02:01.0
+
+For the device level files, see /sys/bus/pci/drivers/ipw2200:
+
+ rf_kill
+ read -
+
+ == =========================================
+ 0 RF kill not enabled (radio on)
+ 1 SW based RF kill active (radio off)
+ 2 HW based RF kill active (radio off)
+ 3 Both HW and SW RF kill active (radio off)
+ == =========================================
+
+ write -
+
+ == ==================================================
+ 0 If SW based RF kill active, turn the radio back on
+ 1 If radio is on, activate SW based RF kill
+ == ==================================================
+
+ .. note::
+
+ If you enable the SW based RF kill and then toggle the HW
+ based RF kill from ON -> OFF -> ON, the radio will NOT come back on
+
+ ucode
+ read-only access to the ucode version number
+
+ led
+ read -
+
+ == =================
+ 0 LED code disabled
+ 1 LED code enabled
+ == =================
+
+ write -
+
+ == ================
+ 0 Disable LED code
+ 1 Enable LED code
+ == ================
+
+
+ .. note::
+
+ The LED code has been reported to hang some systems when
+ running ifconfig and is therefore disabled by default.
+
+
+1.5. Supported channels
+-----------------------
+
+Upon loading the Intel(R) PRO/Wireless 2915ABG Driver for Linux, a
+message stating the detected geography code and the number of 802.11
+channels supported by the card will be displayed in the log.
+
+The geography code corresponds to a regulatory domain as shown in the
+table below.
+
+ +------+----------------------------+--------------------+
+ | | | Supported channels |
+ | Code | Geography +----------+---------+
+ | | | 802.11bg | 802.11a |
+ +======+============================+==========+=========+
+ | --- | Restricted | 11 | 0 |
+ +------+----------------------------+----------+---------+
+ | ZZF | Custom US/Canada | 11 | 8 |
+ +------+----------------------------+----------+---------+
+ | ZZD | Rest of World | 13 | 0 |
+ +------+----------------------------+----------+---------+
+ | ZZA | Custom USA & Europe & High | 11 | 13 |
+ +------+----------------------------+----------+---------+
+ | ZZB | Custom NA & Europe | 11 | 13 |
+ +------+----------------------------+----------+---------+
+ | ZZC | Custom Japan | 11 | 4 |
+ +------+----------------------------+----------+---------+
+ | ZZM | Custom | 11 | 0 |
+ +------+----------------------------+----------+---------+
+ | ZZE | Europe | 13 | 19 |
+ +------+----------------------------+----------+---------+
+ | ZZJ | Custom Japan | 14 | 4 |
+ +------+----------------------------+----------+---------+
+ | ZZR | Rest of World | 14 | 0 |
+ +------+----------------------------+----------+---------+
+ | ZZH | High Band | 13 | 4 |
+ +------+----------------------------+----------+---------+
+ | ZZG | Custom Europe | 13 | 4 |
+ +------+----------------------------+----------+---------+
+ | ZZK | Europe | 13 | 24 |
+ +------+----------------------------+----------+---------+
+ | ZZL | Europe | 11 | 13 |
+ +------+----------------------------+----------+---------+
+
+2. Ad-Hoc Networking
+=====================
+
+When using a device in an Ad-Hoc network, it is useful to understand the
+sequence and requirements for the driver to be able to create, join, or
+merge networks.
+
+The following attempts to provide enough information so that you can
+have a consistent experience while using the driver as a member of an
+Ad-Hoc network.
+
+2.1. Joining an Ad-Hoc Network
+------------------------------
+
+The easiest way to get onto an Ad-Hoc network is to join one that
+already exists.
+
+2.2. Creating an Ad-Hoc Network
+-------------------------------
+
+An Ad-Hoc networks is created using the syntax of the Wireless tool.
+
+For Example:
+iwconfig eth1 mode ad-hoc essid testing channel 2
+
+2.3. Merging Ad-Hoc Networks
+----------------------------
+
+
+3. Interaction with Wireless Tools
+==================================
+
+3.1 iwconfig mode
+-----------------
+
+When configuring the mode of the adapter, all run-time configured parameters
+are reset to the value used when the module was loaded. This includes
+channels, rates, ESSID, etc.
+
+3.2 iwconfig sens
+-----------------
+
+The 'iwconfig ethX sens XX' command will not set the signal sensitivity
+threshold, as described in iwconfig documentation, but rather the number
+of consecutive missed beacons that will trigger handover, i.e. roaming
+to another access point. At the same time, it will set the disassociation
+threshold to 3 times the given value.
+
+
+4. About the Version Numbers
+=============================
+
+Due to the nature of open source development projects, there are
+frequently changes being incorporated that have not gone through
+a complete validation process. These changes are incorporated into
+development snapshot releases.
+
+Releases are numbered with a three level scheme:
+
+ major.minor.development
+
+Any version where the 'development' portion is 0 (for example
+1.0.0, 1.1.0, etc.) indicates a stable version that will be made
+available for kernel inclusion.
+
+Any version where the 'development' portion is not a 0 (for
+example 1.0.1, 1.1.5, etc.) indicates a development version that is
+being made available for testing and cutting edge users. The stability
+and functionality of the development releases are not know. We make
+efforts to try and keep all snapshots reasonably stable, but due to the
+frequency of their release, and the desire to get those releases
+available as quickly as possible, unknown anomalies should be expected.
+
+The major version number will be incremented when significant changes
+are made to the driver. Currently, there are no major changes planned.
+
+5. Firmware installation
+========================
+
+The driver requires a firmware image, download it and extract the
+files under /lib/firmware (or wherever your hotplug's firmware.agent
+will look for firmware files)
+
+The firmware can be downloaded from the following URL:
+
+ http://ipw2200.sf.net/
+
+
+6. Support
+==========
+
+For direct support of the 1.0.0 version, you can contact
+http://supportmail.intel.com, or you can use the open source project
+support.
+
+For general information and support, go to:
+
+ http://ipw2200.sf.net/
+
+
+7. License
+==========
+
+ Copyright |copy| 2003 - 2006 Intel Corporation. All rights reserved.
+
+ This program is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License version 2 as
+ published by the Free Software Foundation.
+
+ This program is distributed in the hope that it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ You should have received a copy of the GNU General Public License along with
+ this program; if not, write to the Free Software Foundation, Inc., 59
+ Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+
+ The full GNU General Public License is included in this distribution in the
+ file called LICENSE.
+
+ Contact Information:
+
+ James P. Ketrenos <ipw2100-admin@linux.intel.com>
+
+ Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
diff --git a/Documentation/networking/device_drivers/wifi/ray_cs.rst b/Documentation/networking/device_drivers/wifi/ray_cs.rst
new file mode 100644
index 0000000000..9a46d1ae8f
--- /dev/null
+++ b/Documentation/networking/device_drivers/wifi/ray_cs.rst
@@ -0,0 +1,165 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+=========================
+Raylink wireless LAN card
+=========================
+
+September 21, 1999
+
+Copyright |copy| 1998 Corey Thomas (corey@world.std.com)
+
+This file is the documentation for the Raylink Wireless LAN card driver for
+Linux. The Raylink wireless LAN card is a PCMCIA card which provides IEEE
+802.11 compatible wireless network connectivity at 1 and 2 megabits/second.
+See http://www.raytheon.com/micro/raylink/ for more information on the Raylink
+card. This driver is in early development and does have bugs. See the known
+bugs and limitations at the end of this document for more information.
+This driver also works with WebGear's Aviator 2.4 and Aviator Pro
+wireless LAN cards.
+
+As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel
+source. My web page for the development of ray_cs is at
+http://web.ralinktech.com/ralink/Home/Support/Linux.html
+and I can be emailed at corey@world.std.com
+
+The kernel driver is based on ray_cs-1.62.tgz
+
+The driver at my web page is intended to be used as an add on to
+David Hinds pcmcia package. All the command line parameters are
+available when compiled as a module. When built into the kernel, only
+the essid= string parameter is available via the kernel command line.
+This will change after the method of sorting out parameters for all
+the PCMCIA drivers is agreed upon. If you must have a built in driver
+with nondefault parameters, they can be edited in
+/usr/src/linux/drivers/net/pcmcia/ray_cs.c. Searching for module_param
+will find them all.
+
+Information on card services is available at:
+
+ http://pcmcia-cs.sourceforge.net/
+
+
+Card services user programs are still required for PCMCIA devices.
+pcmcia-cs-3.1.1 or greater is required for the kernel version of
+the driver.
+
+Currently, ray_cs is not part of David Hinds card services package,
+so the following magic is required.
+
+At the end of the /etc/pcmcia/config.opts file, add the line:
+source ./ray_cs.opts
+This will make card services read the ray_cs.opts file
+when starting. Create the file /etc/pcmcia/ray_cs.opts containing the
+following::
+
+ #### start of /etc/pcmcia/ray_cs.opts ###################
+ # Configuration options for Raylink Wireless LAN PCMCIA card
+ device "ray_cs"
+ class "network" module "misc/ray_cs"
+
+ card "RayLink PC Card WLAN Adapter"
+ manfid 0x01a6, 0x0000
+ bind "ray_cs"
+
+ module "misc/ray_cs" opts ""
+ #### end of /etc/pcmcia/ray_cs.opts #####################
+
+
+To join an existing network with
+different parameters, contact the network administrator for the
+configuration information, and edit /etc/pcmcia/ray_cs.opts.
+Add the parameters below between the empty quotes.
+
+Parameters for ray_cs driver which may be specified in ray_cs.opts:
+
+=============== =============== =============================================
+bc integer 0 = normal mode (802.11 timing),
+ 1 = slow down inter frame timing to allow
+ operation with older breezecom access
+ points.
+
+beacon_period integer beacon period in Kilo-microseconds,
+
+ legal values = must be integer multiple
+ of hop dwell
+
+ default = 256
+
+country integer 1 = USA (default),
+ 2 = Europe,
+ 3 = Japan,
+ 4 = Korea,
+ 5 = Spain,
+ 6 = France,
+ 7 = Israel,
+ 8 = Australia
+
+essid string ESS ID - network name to join
+
+ string with maximum length of 32 chars
+ default value = "ADHOC_ESSID"
+
+hop_dwell integer hop dwell time in Kilo-microseconds
+
+ legal values = 16,32,64,128(default),256
+
+irq_mask integer linux standard 16 bit value 1bit/IRQ
+
+ lsb is IRQ 0, bit 1 is IRQ 1 etc.
+ Used to restrict choice of IRQ's to use.
+ Recommended method for controlling
+ interrupts is in /etc/pcmcia/config.opts
+
+net_type integer 0 (default) = adhoc network,
+ 1 = infrastructure
+
+phy_addr string string containing new MAC address in
+ hex, must start with x eg
+ x00008f123456
+
+psm integer 0 = continuously active,
+ 1 = power save mode (not useful yet)
+
+pc_debug integer (0-5) larger values for more verbose
+ logging. Replaces ray_debug.
+
+ray_debug integer Replaced with pc_debug
+
+ray_mem_speed integer defaults to 500
+
+sniffer integer 0 = not sniffer (default),
+ 1 = sniffer which can be used to record all
+ network traffic using tcpdump or similar,
+ but no normal network use is allowed.
+
+translate integer 0 = no translation (encapsulate frames),
+ 1 = translation (RFC1042/802.1)
+=============== =============== =============================================
+
+More on sniffer mode:
+
+tcpdump does not understand 802.11 headers, so it can't
+interpret the contents, but it can record to a file. This is only
+useful for debugging 802.11 lowlevel protocols that are not visible to
+linux. If you want to watch ftp xfers, or do similar things, you
+don't need to use sniffer mode. Also, some packet types are never
+sent up by the card, so you will never see them (ack, rts, cts, probe
+etc.) There is a simple program (showcap) included in the ray_cs
+package which parses the 802.11 headers.
+
+Known Problems and missing features
+
+ Does not work with non x86
+
+ Does not work with SMP
+
+ Support for defragmenting frames is not yet debugged, and in
+ fact is known to not work. I have never encountered a net set
+ up to fragment, but still, it should be fixed.
+
+ The ioctl support is incomplete. The hardware address cannot be set
+ using ifconfig yet. If a different hardware address is needed, it may
+ be set using the phy_addr parameter in ray_cs.opts. This requires
+ a card insertion to take effect.
diff --git a/Documentation/networking/device_drivers/wwan/index.rst b/Documentation/networking/device_drivers/wwan/index.rst
new file mode 100644
index 0000000000..370d8264d5
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+WWAN Device Drivers
+===================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ iosm
+ t7xx
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/wwan/iosm.rst b/Documentation/networking/device_drivers/wwan/iosm.rst
new file mode 100644
index 0000000000..6f9e955af9
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/iosm.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+.. Copyright (C) 2020-21 Intel Corporation
+
+.. _iosm_driver_doc:
+
+===========================================
+IOSM Driver for Intel M.2 PCIe based Modems
+===========================================
+The IOSM (IPC over Shared Memory) driver is a WWAN PCIe host driver developed
+for linux or chrome platform for data exchange over PCIe interface between
+Host platform & Intel M.2 Modem. The driver exposes interface conforming to the
+MBIM protocol [1]. Any front end application ( eg: Modem Manager) could easily
+manage the MBIM interface to enable data communication towards WWAN.
+
+Basic usage
+===========
+MBIM functions are inactive when unmanaged. The IOSM driver only provides a
+userspace interface MBIM "WWAN PORT" representing MBIM control channel and does
+not play any role in managing the functionality. It is the job of a userspace
+application to detect port enumeration and enable MBIM functionality.
+
+Examples of few such userspace application are:
+- mbimcli (included with the libmbim [2] library), and
+- Modem Manager [3]
+
+Management Applications to carry out below required actions for establishing
+MBIM IP session:
+- open the MBIM control channel
+- configure network connection settings
+- connect to network
+- configure IP network interface
+
+Management application development
+==================================
+The driver and userspace interfaces are described below. The MBIM protocol is
+described in [1] Mobile Broadband Interface Model v1.0 Errata-1.
+
+MBIM control channel userspace ABI
+----------------------------------
+
+/dev/wwan0mbim0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an MBIM interface to the MBIM function by implementing
+MBIM WWAN Port. The userspace end of the control channel pipe is a
+/dev/wwan0mbim0 character device. Application shall use this interface for
+MBIM protocol communication.
+
+Fragmentation
+~~~~~~~~~~~~~
+The userspace application is responsible for all control message fragmentation
+and defragmentation as per MBIM specification.
+
+/dev/wwan0mbim0 write()
+~~~~~~~~~~~~~~~~~~~~~~~
+The MBIM control messages from the management application must not exceed the
+negotiated control message size.
+
+/dev/wwan0mbim0 read()
+~~~~~~~~~~~~~~~~~~~~~~
+The management application must accept control messages of up the negotiated
+control message size.
+
+MBIM data channel userspace ABI
+-------------------------------
+
+wwan0-X network device
+~~~~~~~~~~~~~~~~~~~~~~
+The IOSM driver exposes IP link interface "wwan0-X" of type "wwan" for IP
+traffic. Iproute network utility is used for creating "wwan0-X" network
+interface and for associating it with MBIM IP session. The Driver supports
+up to 8 IP sessions for simultaneous IP communication.
+
+The userspace management application is responsible for creating new IP link
+prior to establishing MBIM IP session where the SessionId is greater than 0.
+
+For example, creating new IP link for a MBIM IP session with SessionId 1:
+
+ ip link add dev wwan0-1 parentdev-name wwan0 type wwan linkid 1
+
+The driver will automatically map the "wwan0-1" network device to MBIM IP
+session 1.
+
+References
+==========
+[1] "MBIM (Mobile Broadband Interface Model) Errata-1"
+ - https://www.usb.org/document-library/
+
+[2] libmbim - "a glib-based library for talking to WWAN modems and
+ devices which speak the Mobile Interface Broadband Model (MBIM)
+ protocol"
+ - http://www.freedesktop.org/wiki/Software/libmbim/
+
+[3] Modem Manager - "a DBus-activated daemon which controls mobile
+ broadband (2G/3G/4G) devices and connections"
+ - http://www.freedesktop.org/wiki/Software/ModemManager/
diff --git a/Documentation/networking/device_drivers/wwan/t7xx.rst b/Documentation/networking/device_drivers/wwan/t7xx.rst
new file mode 100644
index 0000000000..dd5b731957
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/t7xx.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+.. Copyright (C) 2020-21 Intel Corporation
+
+.. _t7xx_driver_doc:
+
+============================================
+t7xx driver for MTK PCIe based T700 5G modem
+============================================
+The t7xx driver is a WWAN PCIe host driver developed for linux or Chrome OS platforms
+for data exchange over PCIe interface between Host platform & MediaTek's T700 5G modem.
+The driver exposes an interface conforming to the MBIM protocol [1]. Any front end
+application (e.g. Modem Manager) could easily manage the MBIM interface to enable
+data communication towards WWAN. The driver also provides an interface to interact
+with the MediaTek's modem via AT commands.
+
+Basic usage
+===========
+MBIM & AT functions are inactive when unmanaged. The t7xx driver provides
+WWAN port userspace interfaces representing MBIM & AT control channels and does
+not play any role in managing their functionality. It is the job of a userspace
+application to detect port enumeration and enable MBIM & AT functionalities.
+
+Examples of few such userspace applications are:
+
+- mbimcli (included with the libmbim [2] library), and
+- Modem Manager [3]
+
+Management Applications to carry out below required actions for establishing
+MBIM IP session:
+
+- open the MBIM control channel
+- configure network connection settings
+- connect to network
+- configure IP network interface
+
+Management Applications to carry out below required actions for send an AT
+command and receive response:
+
+- open the AT control channel using a UART tool or a special user tool
+
+Management application development
+==================================
+The driver and userspace interfaces are described below. The MBIM protocol is
+described in [1] Mobile Broadband Interface Model v1.0 Errata-1.
+
+MBIM control channel userspace ABI
+----------------------------------
+
+/dev/wwan0mbim0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an MBIM interface to the MBIM function by implementing
+MBIM WWAN Port. The userspace end of the control channel pipe is a
+/dev/wwan0mbim0 character device. Application shall use this interface for
+MBIM protocol communication.
+
+Fragmentation
+~~~~~~~~~~~~~
+The userspace application is responsible for all control message fragmentation
+and defragmentation as per MBIM specification.
+
+/dev/wwan0mbim0 write()
+~~~~~~~~~~~~~~~~~~~~~~~
+The MBIM control messages from the management application must not exceed the
+negotiated control message size.
+
+/dev/wwan0mbim0 read()
+~~~~~~~~~~~~~~~~~~~~~~
+The management application must accept control messages of up the negotiated
+control message size.
+
+MBIM data channel userspace ABI
+-------------------------------
+
+wwan0-X network device
+~~~~~~~~~~~~~~~~~~~~~~
+The t7xx driver exposes IP link interface "wwan0-X" of type "wwan" for IP
+traffic. Iproute network utility is used for creating "wwan0-X" network
+interface and for associating it with MBIM IP session.
+
+The userspace management application is responsible for creating new IP link
+prior to establishing MBIM IP session where the SessionId is greater than 0.
+
+For example, creating new IP link for a MBIM IP session with SessionId 1:
+
+ ip link add dev wwan0-1 parentdev wwan0 type wwan linkid 1
+
+The driver will automatically map the "wwan0-1" network device to MBIM IP
+session 1.
+
+AT port userspace ABI
+----------------------------------
+
+/dev/wwan0at0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an AT port by implementing AT WWAN Port.
+The userspace end of the control port is a /dev/wwan0at0 character
+device. Application shall use this interface to issue AT commands.
+
+The MediaTek's T700 modem supports the 3GPP TS 27.007 [4] specification.
+
+References
+==========
+[1] *MBIM (Mobile Broadband Interface Model) Errata-1*
+
+- https://www.usb.org/document-library/
+
+[2] *libmbim "a glib-based library for talking to WWAN modems and devices which
+speak the Mobile Interface Broadband Model (MBIM) protocol"*
+
+- http://www.freedesktop.org/wiki/Software/libmbim/
+
+[3] *Modem Manager "a DBus-activated daemon which controls mobile broadband
+(2G/3G/4G/5G) devices and connections"*
+
+- http://www.freedesktop.org/wiki/Software/ModemManager/
+
+[4] *Specification # 27.007 - 3GPP*
+
+- https://www.3gpp.org/DynaReport/27007.htm
diff --git a/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
new file mode 100644
index 0000000000..1e589c26ab
--- /dev/null
+++ b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+am65-cpsw-nuss devlink support
+==============================
+
+This document describes the devlink features implemented by the ``am65-cpsw-nuss``
+device driver.
+
+Parameters
+==========
+
+The ``am65-cpsw-nuss`` driver implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``switch_mode``
+ - Boolean
+ - runtime
+ - Enable switch mode
diff --git a/Documentation/networking/devlink/bnxt.rst b/Documentation/networking/devlink/bnxt.rst
new file mode 100644
index 0000000000..a4fb27663c
--- /dev/null
+++ b/Documentation/networking/devlink/bnxt.rst
@@ -0,0 +1,82 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+bnxt devlink support
+====================
+
+This document describes the devlink features implemented by the ``bnxt``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``enable_sriov``
+ - Permanent
+ * - ``ignore_ari``
+ - Permanent
+ * - ``msix_vec_per_pf_max``
+ - Permanent
+ * - ``msix_vec_per_pf_min``
+ - Permanent
+ * - ``enable_remote_dev_reset``
+ - Runtime
+
+The ``bnxt`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``gre_ver_check``
+ - Boolean
+ - Permanent
+ - Generic Routing Encapsulation (GRE) version check will be enabled in
+ the device. If disabled, the device will skip the version check for
+ incoming packets.
+
+Info versions
+=============
+
+The ``bnxt_en`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``board.id``
+ - fixed
+ - Part number identifying the board design
+ * - ``asic.id``
+ - fixed
+ - ASIC design identifier
+ * - ``asic.rev``
+ - fixed
+ - ASIC design revision
+ * - ``fw.psid``
+ - stored, running
+ - Firmware parameter set version of the board
+ * - ``fw``
+ - stored, running
+ - Overall board firmware version
+ * - ``fw.mgmt``
+ - stored, running
+ - NIC hardware resource management firmware version
+ * - ``fw.mgmt.api``
+ - running
+ - Minimum firmware interface spec version supported between driver and firmware
+ * - ``fw.nsci``
+ - stored, running
+ - General platform management firmware version
+ * - ``fw.roce``
+ - stored, running
+ - RoCE management firmware version
diff --git a/Documentation/networking/devlink/devlink-dpipe.rst b/Documentation/networking/devlink/devlink-dpipe.rst
new file mode 100644
index 0000000000..af37f250df
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-dpipe.rst
@@ -0,0 +1,252 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Devlink DPIPE
+=============
+
+Background
+==========
+
+While performing the hardware offloading process, much of the hardware
+specifics cannot be presented. These details are useful for debugging, and
+``devlink-dpipe`` provides a standardized way to provide visibility into the
+offloading process.
+
+For example, the routing longest prefix match (LPM) algorithm used by the
+Linux kernel may differ from the hardware implementation. The pipeline debug
+API (DPIPE) is aimed at providing the user visibility into the ASIC's
+pipeline in a generic way.
+
+The hardware offload process is expected to be done in a way that the user
+should not be able to distinguish between the hardware vs. software
+implementation. In this process, hardware specifics are neglected. In
+reality those details can have lots of meaning and should be exposed in some
+standard way.
+
+This problem is made even more complex when one wishes to offload the
+control path of the whole networking stack to a switch ASIC. Due to
+differences in the hardware and software models some processes cannot be
+represented correctly.
+
+One example is the kernel's LPM algorithm which in many cases differs
+greatly to the hardware implementation. The configuration API is the same,
+but one cannot rely on the Forward Information Base (FIB) to look like the
+Level Path Compression trie (LPC-trie) in hardware.
+
+In many situations trying to analyze systems failure solely based on the
+kernel's dump may not be enough. By combining this data with complementary
+information about the underlying hardware, this debugging can be made
+easier; additionally, the information can be useful when debugging
+performance issues.
+
+Overview
+========
+
+The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
+modeled as a graph of match/action tables. Each table represents a specific
+hardware block. This model is not new, first being used by the P4 language.
+
+Traditionally it has been used as an alternative model for hardware
+configuration, but the ``devlink-dpipe`` interface uses it for visibility
+purposes as a standard complementary tool. The system's view from
+``devlink-dpipe`` should change according to the changes done by the
+standard configuration tools.
+
+For example, it’s quite common to implement Access Control Lists (ACL)
+using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
+divided into TCAM regions. Complex TC filters can have multiple rules with
+different priorities and different lookup keys. On the other hand hardware
+TCAM regions have a predefined lookup key. Offloading the TC filter rules
+using TCAM engine can result in multiple TCAM regions being interconnected
+in a chain (which may affect the data path latency). In response to a new TC
+filter new tables should be created describing those regions.
+
+Model
+=====
+
+The ``DPIPE`` model introduces several objects:
+
+ * headers
+ * tables
+ * entries
+
+A ``header`` describes packet formats and provides names for fields within
+the packet. A ``table`` describes hardware blocks. An ``entry`` describes
+the actual content of a specific table.
+
+The hardware pipeline is not port specific, but rather describes the whole
+ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
+
+Drivers can register and unregister tables at run time, in order to support
+dynamic behavior. This dynamic behavior is mandatory for describing hardware
+blocks like TCAM regions which can be allocated and freed dynamically.
+
+``devlink-dpipe`` generally is not intended for configuration. The exception
+is hardware counting for a specific table.
+
+The following commands are used to obtain the ``dpipe`` objects from
+userspace:
+
+ * ``table_get``: Receive a table's description.
+ * ``headers_get``: Receive a device's supported headers.
+ * ``entries_get``: Receive a table's current entries.
+ * ``counters_set``: Enable or disable counters on a table.
+
+Table
+-----
+
+The driver should implement the following operations for each table:
+
+ * ``matches_dump``: Dump the supported matches.
+ * ``actions_dump``: Dump the supported actions.
+ * ``entries_dump``: Dump the actual content of the table.
+ * ``counters_set_update``: Synchronize hardware with counters enabled or
+ disabled.
+
+Header/Field
+------------
+
+In a similar way to P4 headers and fields are used to describe a table's
+behavior. There is a slight difference between the standard protocol headers
+and specific ASIC metadata. The protocol headers should be declared in the
+``devlink`` core API. On the other hand ASIC meta data is driver specific
+and should be defined in the driver. Additionally, each driver-specific
+devlink documentation file should document the driver-specific ``dpipe``
+headers it implements. The headers and fields are identified by enumeration.
+
+In order to provide further visibility some ASIC metadata fields could be
+mapped to kernel objects. For example, internal router interface indexes can
+be directly mapped to the net device ifindex. FIB table indexes used by
+different Virtual Routing and Forwarding (VRF) tables can be mapped to
+internal routing table indexes.
+
+Match
+-----
+
+Matches are kept primitive and close to hardware operation. Match types like
+LPM are not supported due to the fact that this is exactly a process we wish
+to describe in full detail. Example of matches:
+
+ * ``field_exact``: Exact match on a specific field.
+ * ``field_exact_mask``: Exact match on a specific field after masking.
+ * ``field_range``: Match on a specific range.
+
+The id's of the header and the field should be specified in order to
+identify the specific field. Furthermore, the header index should be
+specified in order to distinguish multiple headers of the same type in a
+packet (tunneling).
+
+Action
+------
+
+Similar to match, the actions are kept primitive and close to hardware
+operation. For example:
+
+ * ``field_modify``: Modify the field value.
+ * ``field_inc``: Increment the field value.
+ * ``push_header``: Add a header.
+ * ``pop_header``: Remove a header.
+
+Entry
+-----
+
+Entries of a specific table can be dumped on demand. Each eentry is
+identified with an index and its properties are described by a list of
+match/action values and specific counter. By dumping the tables content the
+interactions between tables can be resolved.
+
+Abstraction Example
+===================
+
+The following is an example of the abstraction model of the L3 part of
+Mellanox Spectrum ASIC. The blocks are described in the order they appear in
+the pipeline. The table sizes in the following examples are not real
+hardware sizes and are provided for demonstration purposes.
+
+LPM
+---
+
+The LPM algorithm can be implemented as a list of hash tables. Each hash
+table contains routes with the same prefix length. The root of the list is
+/32, and in case of a miss the hardware will continue to the next hash
+table. The depth of the search will affect the data path latency.
+
+In case of a hit the entry contains information about the next stage of the
+pipeline which resolves the MAC address. The next stage can be either local
+host table for directly connected routes, or adjacency table for next-hops.
+The ``meta.lpm_prefix`` field is used to connect two LPM tables.
+
+.. code::
+
+ table lpm_prefix_16 {
+ size: 4096,
+ counters_enabled: true,
+ match: { meta.vr_id: exact,
+ ipv4.dst_addr: exact_mask,
+ ipv6.dst_addr: exact_mask,
+ meta.lpm_prefix: exact },
+ action: { meta.adj_index: set,
+ meta.adj_group_size: set,
+ meta.rif_port: set,
+ meta.lpm_prefix: set },
+ }
+
+Local Host
+----------
+
+In the case of local routes the LPM lookup already resolves the egress
+router interface (RIF), yet the exact MAC address is not known. The local
+host table is a hash table combining the output interface id with
+destination IP address as a key. The result is the MAC address.
+
+.. code::
+
+ table local_host {
+ size: 4096,
+ counters_enabled: true,
+ match: { meta.rif_port: exact,
+ ipv4.dst_addr: exact},
+ action: { ethernet.daddr: set }
+ }
+
+Adjacency
+---------
+
+In case of remote routes this table does the ECMP. The LPM lookup results in
+ECMP group size and index that serves as a global offset into this table.
+Concurrently a hash of the packet is generated. Based on the ECMP group size
+and the packet's hash a local offset is generated. Multiple LPM entries can
+point to the same adjacency group.
+
+.. code::
+
+ table adjacency {
+ size: 4096,
+ counters_enabled: true,
+ match: { meta.adj_index: exact,
+ meta.adj_group_size: exact,
+ meta.packet_hash_index: exact },
+ action: { ethernet.daddr: set,
+ meta.erif: set }
+ }
+
+ERIF
+----
+
+In case the egress RIF and destination MAC have been resolved by previous
+tables this table does multiple operations like TTL decrease and MTU check.
+Then the decision of forward/drop is taken and the port L3 statistics are
+updated based on the packet's type (broadcast, unicast, multicast).
+
+.. code::
+
+ table erif {
+ size: 800,
+ counters_enabled: true,
+ match: { meta.rif_port: exact,
+ meta.is_l3_unicast: exact,
+ meta.is_l3_broadcast: exact,
+ meta.is_l3_multicast, exact },
+ action: { meta.l3_drop: set,
+ meta.l3_forward: set }
+ }
diff --git a/Documentation/networking/devlink/devlink-flash.rst b/Documentation/networking/devlink/devlink-flash.rst
new file mode 100644
index 0000000000..603e732f00
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-flash.rst
@@ -0,0 +1,121 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+.. _devlink_flash:
+
+=============
+Devlink Flash
+=============
+
+The ``devlink-flash`` API allows updating device firmware. It replaces the
+older ``ethtool-flash`` mechanism, and doesn't require taking any
+networking locks in the kernel to perform the flash update. Example use::
+
+ $ devlink dev flash pci/0000:05:00.0 file flash-boot.bin
+
+Note that the file name is a path relative to the firmware loading path
+(usually ``/lib/firmware/``). Drivers may send status updates to inform
+user space about the progress of the update operation.
+
+Overwrite Mask
+==============
+
+The ``devlink-flash`` command allows optionally specifying a mask indicating
+how the device should handle subsections of flash components when updating.
+This mask indicates the set of sections which are allowed to be overwritten.
+
+.. list-table:: List of overwrite mask bits
+ :widths: 5 95
+
+ * - Name
+ - Description
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+ - Indicates that the device should overwrite settings in the components
+ being updated with the settings found in the provided image.
+ * - ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+ - Indicates that the device should overwrite identifiers in the
+ components being updated with the identifiers found in the provided
+ image. This includes MAC addresses, serial IDs, and similar device
+ identifiers.
+
+Multiple overwrite bits may be combined and requested together. If no bits
+are provided, it is expected that the device only update firmware binaries
+in the components being updated. Settings and identifiers are expected to be
+preserved across the update. A device may not support every combination and
+the driver for such a device must reject any combination which cannot be
+faithfully implemented.
+
+Firmware Loading
+================
+
+Devices which require firmware to operate usually store it in non-volatile
+memory on the board, e.g. flash. Some devices store only basic firmware on
+the board, and the driver loads the rest from disk during probing.
+``devlink-info`` allows users to query firmware information (loaded
+components and versions).
+
+In other cases the device can both store the image on the board, load from
+disk, or automatically flash a new image from disk. The ``fw_load_policy``
+devlink parameter can be used to control this behavior
+(:ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`).
+
+On-disk firmware files are usually stored in ``/lib/firmware/``.
+
+Firmware Version Management
+===========================
+
+Drivers are expected to implement ``devlink-flash`` and ``devlink-info``
+functionality, which together allow for implementing vendor-independent
+automated firmware update facilities.
+
+``devlink-info`` exposes the ``driver`` name and three version groups
+(``fixed``, ``running``, ``stored``).
+
+The ``driver`` attribute and ``fixed`` group identify the specific device
+design, e.g. for looking up applicable firmware updates. This is why
+``serial_number`` is not part of the ``fixed`` versions (even though it
+is fixed) - ``fixed`` versions should identify the design, not a single
+device.
+
+``running`` and ``stored`` firmware versions identify the firmware running
+on the device, and firmware which will be activated after reboot or device
+reset.
+
+The firmware update agent is supposed to be able to follow this simple
+algorithm to update firmware contents, regardless of the device vendor:
+
+.. code-block:: sh
+
+ # Get unique HW design identifier
+ $hw_id = devlink-dev-info['fixed']
+
+ # Find out which FW flash we want to use for this NIC
+ $want_flash_vers = some-db-backed.lookup($hw_id, 'flash')
+
+ # Update flash if necessary
+ if $want_flash_vers != devlink-dev-info['stored']:
+ $file = some-db-backed.download($hw_id, 'flash')
+ devlink-dev-flash($file)
+
+ # Find out the expected overall firmware versions
+ $want_fw_vers = some-db-backed.lookup($hw_id, 'all')
+
+ # Update on-disk file if necessary
+ if $want_fw_vers != devlink-dev-info['running']:
+ $file = some-db-backed.download($hw_id, 'disk')
+ write($file, '/lib/firmware/')
+
+ # Try device reset, if available
+ if $want_fw_vers != devlink-dev-info['running']:
+ devlink-reset()
+
+ # Reboot, if reset wasn't enough
+ if $want_fw_vers != devlink-dev-info['running']:
+ reboot()
+
+Note that each reference to ``devlink-dev-info`` in this pseudo-code
+is expected to fetch up-to-date information from the kernel.
+
+For the convenience of identifying firmware files some vendors add
+``bundle_id`` information to the firmware versions. This meta-version covers
+multiple per-component versions and can be used e.g. in firmware file names
+(all component versions could get rather long.)
diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst
new file mode 100644
index 0000000000..e0b8cfed61
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-health.rst
@@ -0,0 +1,138 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Health
+==============
+
+Background
+==========
+
+The ``devlink`` health mechanism is targeted for Real Time Alerting, in
+order to know when something bad happened to a PCI device.
+
+ * Provide alert debug information.
+ * Self healing.
+ * If problem needs vendor support, provide a way to gather all needed
+ debugging information.
+
+Overview
+========
+
+The main idea is to unify and centralize driver health reports in the
+generic ``devlink`` instance and allow the user to set different
+attributes of the health reporting and recovery procedures.
+
+The ``devlink`` health reporter:
+Device driver creates a "health reporter" per each error/health type.
+Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
+or unknown (driver specific).
+For each registered health reporter a driver can issue error/health reports
+asynchronously. All health reports handling is done by ``devlink``.
+Device driver can provide specific callbacks for each "health reporter", e.g.:
+
+ * Recovery procedures
+ * Diagnostics procedures
+ * Object dump procedures
+ * Out Of Box initial parameters
+
+Different parts of the driver can register different types of health reporters
+with different handlers.
+
+Actions
+=======
+
+Once an error is reported, devlink health will perform the following actions:
+
+ * A log is being send to the kernel trace events buffer
+ * Health status and statistics are being updated for the reporter instance
+ * Object dump is being taken and saved at the reporter instance (as long as
+ auto-dump is set and there is no other dump which is already stored)
+ * Auto recovery attempt is being done. Depends on:
+
+ - Auto-recovery configuration
+ - Grace period vs. time passed since last recover
+
+Devlink formatted message
+=========================
+
+To handle devlink health diagnose and health dump requests, devlink creates a
+formatted message structure ``devlink_fmsg`` and send it to the driver's callback
+to fill the data in using the devlink fmsg API.
+
+Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
+json-like format. The API allows the driver to add nested attributes such as
+object, object pair and value array, in addition to attributes such as name and
+value.
+
+Driver should use this API to fill the fmsg context in a format which will be
+translated by the devlink to the netlink message later. When it needs to send
+the data using SKBs to the netlink layer, it fragments the data between
+different SKBs. In order to do this fragmentation, it uses virtual nests
+attributes, to avoid actual nesting use which cannot be divided between
+different SKBs.
+
+User Interface
+==============
+
+User can access/change each reporter's parameters and driver specific callbacks
+via ``devlink``, e.g per error type (per health reporter):
+
+ * Configure reporter's generic parameters (like: disable/enable auto recovery)
+ * Invoke recovery procedure
+ * Run diagnostics
+ * Object dump
+
+.. list-table:: List of devlink health interfaces
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
+ - Retrieves status and configuration info per DEV and reporter.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
+ - Allows reporter-related configuration setting.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
+ - Triggers reporter's recovery procedure.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
+ - Triggers a fake health event on the reporter. The effects of the test
+ event in terms of recovery flow should follow closely that of a real
+ event.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
+ - Retrieves current device state related to the reporter.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
+ - Retrieves the last stored dump. Devlink health
+ saves a single dump. If an dump is not already stored by devlink
+ for this reporter, devlink generates a new dump.
+ Dump output is defined by the reporter.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
+ - Clears the last saved dump file for the specified reporter.
+
+The following diagram provides a general overview of ``devlink-health``::
+
+ netlink
+ +--------------------------+
+ | |
+ | + |
+ | | |
+ +--------------------------+
+ |request for ops
+ |(diagnose,
+ driver devlink |recover,
+ |dump)
+ +--------+ +--------------------------+
+ | | | reporter| |
+ | | | +---------v----------+ |
+ | | ops execution | | | |
+ | <----------------------------------+ | |
+ | | | | | |
+ | | | + ^------------------+ |
+ | | | | request for ops |
+ | | | | (recover, dump) |
+ | | | | |
+ | | | +-+------------------+ |
+ | | health report | | health handler | |
+ | +-------------------------------> | |
+ | | | +--------------------+ |
+ | | health reporter create | |
+ | +----------------------------> |
+ +--------+ +--------------------------+
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
new file mode 100644
index 0000000000..1242b0e682
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -0,0 +1,215 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+============
+Devlink Info
+============
+
+The ``devlink-info`` mechanism enables device drivers to report device
+(hardware and firmware) information in a standard, extensible fashion.
+
+The original motivation for the ``devlink-info`` API was twofold:
+
+ - making it possible to automate device and firmware management in a fleet
+ of machines in a vendor-independent fashion (see also
+ :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`);
+ - name the per component FW versions (as opposed to the crowded ethtool
+ version string).
+
+``devlink-info`` supports reporting multiple types of objects. Reporting driver
+versions is generally discouraged - here, and via any other Linux API.
+
+.. list-table:: List of top level info objects
+ :widths: 5 95
+
+ * - Name
+ - Description
+ * - ``driver``
+ - Name of the currently used device driver, also available through sysfs.
+
+ * - ``serial_number``
+ - Serial number of the device.
+
+ This is usually the serial number of the ASIC, also often available
+ in PCI config space of the device in the *Device Serial Number*
+ capability.
+
+ The serial number should be unique per physical device.
+ Sometimes the serial number of the device is only 48 bits long (the
+ length of the Ethernet MAC address), and since PCI DSN is 64 bits long
+ devices pad or encode additional information into the serial number.
+ One example is adding port ID or PCI interface ID in the extra two bytes.
+ Drivers should make sure to strip or normalize any such padding
+ or interface ID, and report only the part of the serial number
+ which uniquely identifies the hardware. In other words serial number
+ reported for two ports of the same device or on two hosts of
+ a multi-host device should be identical.
+
+ * - ``board.serial_number``
+ - Board serial number of the device.
+
+ This is usually the serial number of the board, often available in
+ PCI *Vital Product Data*.
+
+ * - ``fixed``
+ - Group for hardware identifiers, and versions of components
+ which are not field-updatable.
+
+ Versions in this section identify the device design. For example,
+ component identifiers or the board version reported in the PCI VPD.
+ Data in ``devlink-info`` should be broken into the smallest logical
+ components, e.g. PCI VPD may concatenate various information
+ to form the Part Number string, while in ``devlink-info`` all parts
+ should be reported as separate items.
+
+ This group must not contain any frequently changing identifiers,
+ such as serial numbers. See
+ :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`
+ to understand why.
+
+ * - ``running``
+ - Group for information about currently running software/firmware.
+ These versions often only update after a reboot, sometimes device reset.
+
+ * - ``stored``
+ - Group for software/firmware versions in device flash.
+
+ Stored values must update to reflect changes in the flash even
+ if reboot has not yet occurred. If device is not capable of updating
+ ``stored`` versions when new software is flashed, it must not report
+ them.
+
+Each version can be reported at most once in each version group. Firmware
+components stored on the flash should feature in both the ``running`` and
+``stored`` sections, if device is capable of reporting ``stored`` versions
+(see :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`).
+In case software/firmware components are loaded from the disk (e.g.
+``/lib/firmware``) only the running version should be reported via
+the kernel API.
+
+Generic Versions
+================
+
+It is expected that drivers use the following generic names for exporting
+version information. If a generic name for a given component doesn't exist yet,
+driver authors should consult existing driver-specific versions and attempt
+reuse. As last resort, if a component is truly unique, using driver-specific
+names is allowed, but these should be documented in the driver-specific file.
+
+All versions should try to use the following terminology:
+
+.. list-table:: List of common version suffixes
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``id``, ``revision``
+ - Identifiers of designs and revision, mostly used for hardware versions.
+
+ * - ``api``
+ - Version of API between components. API items are usually of limited
+ value to the user, and can be inferred from other versions by the vendor,
+ so adding API versions is generally discouraged as noise.
+
+ * - ``bundle_id``
+ - Identifier of a distribution package which was flashed onto the device.
+ This is an attribute of a firmware package which covers multiple versions
+ for ease of managing firmware images (see
+ :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`).
+
+ ``bundle_id`` can appear in both ``running`` and ``stored`` versions,
+ but it must not be reported if any of the components covered by the
+ ``bundle_id`` was changed and no longer matches the version from
+ the bundle.
+
+board.id
+--------
+
+Unique identifier of the board design.
+
+board.rev
+---------
+
+Board design revision.
+
+asic.id
+-------
+
+ASIC design identifier.
+
+asic.rev
+--------
+
+ASIC design revision/stepping.
+
+board.manufacture
+-----------------
+
+An identifier of the company or the facility which produced the part.
+
+fw
+--
+
+Overall firmware version, often representing the collection of
+fw.mgmt, fw.app, etc.
+
+fw.mgmt
+-------
+
+Control unit firmware version. This firmware is responsible for house
+keeping tasks, PHY control etc. but not the packet-by-packet data path
+operation.
+
+fw.mgmt.api
+-----------
+
+Firmware interface specification version of the software interfaces between
+driver and firmware.
+
+fw.app
+------
+
+Data path microcode controlling high-speed packet processing.
+
+fw.undi
+-------
+
+UNDI software, may include the UEFI driver, firmware or both.
+
+fw.ncsi
+-------
+
+Version of the software responsible for supporting/handling the
+Network Controller Sideband Interface.
+
+fw.psid
+-------
+
+Unique identifier of the firmware parameter set. These are usually
+parameters of a particular board, defined at manufacturing time.
+
+fw.roce
+-------
+
+RoCE firmware version which is responsible for handling roce
+management.
+
+fw.bundle_id
+------------
+
+Unique identifier of the entire firmware bundle.
+
+fw.bootloader
+-------------
+
+Version of the bootloader.
+
+Future work
+===========
+
+The following extensions could be useful:
+
+ - on-disk firmware file names - drivers list the file names of firmware they
+ may need to load onto devices via the ``MODULE_FIRMWARE()`` macro. These,
+ however, are per module, rather than per device. It'd be useful to list
+ the names of firmware files the driver will try to load for a given device,
+ in order of priority.
diff --git a/Documentation/networking/devlink/devlink-linecard.rst b/Documentation/networking/devlink/devlink-linecard.rst
new file mode 100644
index 0000000000..6c0b8928bc
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-linecard.rst
@@ -0,0 +1,122 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Devlink Line card
+=================
+
+Background
+==========
+
+The ``devlink-linecard`` mechanism is targeted for manipulation of
+line cards that serve as a detachable PHY modules for modular switch
+system. Following operations are provided:
+
+ * Get a list of supported line card types.
+ * Provision of a slot with specific line card type.
+ * Get and monitor of line card state and its change.
+
+Line card according to the type may contain one or more gearboxes
+to mux the lanes with certain speed to multiple ports with lanes
+of different speed. Line card ensures N:M mapping between
+the switch ASIC modules and physical front panel ports.
+
+Overview
+========
+
+Each line card devlink object is created by device driver,
+according to the physical line card slots available on the device.
+
+Similar to splitter cable, where the device might have no way
+of detection of the splitter cable geometry, the device
+might not have a way to detect line card type. For that devices,
+concept of provisioning is introduced. It allows the user to:
+
+ * Provision a line card slot with certain line card type
+
+ - Device driver would instruct the ASIC to prepare all
+ resources accordingly. The device driver would
+ create all instances, namely devlink port and netdevices
+ that reside on the line card, according to the line card type
+ * Manipulate of line card entities even without line card
+ being physically connected or powered-up
+ * Setup splitter cable on line card ports
+
+ - As on the ordinary ports, user may provision a splitter
+ cable of a certain type, without the need to
+ be physically connected to the port
+ * Configure devlink ports and netdevices
+
+Netdevice carrier is decided as follows:
+
+ * Line card is not inserted or powered-down
+
+ - The carrier is always down
+ * Line card is inserted and powered up
+
+ - The carrier is decided as for ordinary port netdevice
+
+Line card state
+===============
+
+The ``devlink-linecard`` mechanism supports the following line card states:
+
+ * ``unprovisioned``: Line card is not provisioned on the slot.
+ * ``unprovisioning``: Line card slot is currently being unprovisioned.
+ * ``provisioning``: Line card slot is currently in a process of being provisioned
+ with a line card type.
+ * ``provisioning_failed``: Provisioning was not successful.
+ * ``provisioned``: Line card slot is provisioned with a type.
+ * ``active``: Line card is powered-up and active.
+
+The following diagram provides a general overview of ``devlink-linecard``
+state transitions::
+
+ +-------------------------+
+ | |
+ +----------------------------------> unprovisioned |
+ | | |
+ | +--------|-------^--------+
+ | | |
+ | | |
+ | +--------v-------|--------+
+ | | |
+ | | provisioning |
+ | | |
+ | +------------|------------+
+ | |
+ | +-----------------------------+
+ | | |
+ | +------------v------------+ +------------v------------+ +-------------------------+
+ | | | | ----> |
+ +----- provisioning_failed | | provisioned | | active |
+ | | | | <---- |
+ | +------------^------------+ +------------|------------+ +-------------------------+
+ | | |
+ | | |
+ | | +------------v------------+
+ | | | |
+ | | | unprovisioning |
+ | | | |
+ | | +------------|------------+
+ | | |
+ | +-----------------------------+
+ | |
+ +-----------------------------------------------+
+
+
+Example usage
+=============
+
+.. code:: shell
+
+ $ devlink lc show [ DEV [ lc LC_INDEX ] ]
+ $ devlink lc set DEV lc LC_INDEX [ { type LC_TYPE | notype } ]
+
+ # Show current line card configuration and status for all slots:
+ $ devlink lc
+
+ # Set slot 8 to be provisioned with type "16x100G":
+ $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
+
+ # Set slot 8 to be unprovisioned:
+ $ devlink lc set pci/0000:01:00.0 lc 8 notype
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
new file mode 100644
index 0000000000..4e01dc32bc
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -0,0 +1,139 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Params
+==============
+
+``devlink`` provides capability for a driver to expose device parameters for low
+level device functionality. Since devlink can operate at the device-wide
+level, it can be used to provide configuration that may affect multiple
+ports on a single device.
+
+This document describes a number of generic parameters that are supported
+across multiple drivers. Each driver is also free to add their own
+parameters. Each driver must document the specific parameters they support,
+whether generic or not.
+
+Configuration modes
+===================
+
+Parameters may be set in different configuration modes.
+
+.. list-table:: Possible configuration modes
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``runtime``
+ - set while the driver is running, and takes effect immediately. No
+ reset is required.
+ * - ``driverinit``
+ - applied while the driver initializes. Requires the user to restart
+ the driver using the ``devlink`` reload command.
+ * - ``permanent``
+ - written to the device's non-volatile memory. A hard reset is required
+ for it to take effect.
+
+Reloading
+---------
+
+In order for ``driverinit`` parameters to take effect, the driver must
+support reloading via the ``devlink-reload`` command. This command will
+request a reload of the device driver.
+
+.. _devlink_params_generic:
+
+Generic configuration parameters
+================================
+The following is a list of generic configuration parameters that drivers may
+add. Use of generic parameters is preferred over each driver creating their
+own name.
+
+.. list-table:: List of generic parameters
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``enable_sriov``
+ - Boolean
+ - Enable Single Root I/O Virtualization (SRIOV) in the device.
+ * - ``ignore_ari``
+ - Boolean
+ - Ignore Alternative Routing-ID Interpretation (ARI) capability. If
+ enabled, the adapter will ignore ARI capability even when the
+ platform has support enabled. The device will create the same number
+ of partitions as when the platform does not support ARI.
+ * - ``msix_vec_per_pf_max``
+ - u32
+ - Provides the maximum number of MSI-X interrupts that a device can
+ create. Value is the same across all physical functions (PFs) in the
+ device.
+ * - ``msix_vec_per_pf_min``
+ - u32
+ - Provides the minimum number of MSI-X interrupts required for the
+ device to initialize. Value is the same across all physical functions
+ (PFs) in the device.
+ * - ``fw_load_policy``
+ - u8
+ - Control the device's firmware loading policy.
+ - ``DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER`` (0)
+ Load firmware version preferred by the driver.
+ - ``DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH`` (1)
+ Load firmware currently stored in flash.
+ - ``DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK`` (2)
+ Load firmware currently available on host's disk.
+ * - ``reset_dev_on_drv_probe``
+ - u8
+ - Controls the device's reset policy on driver probe.
+ - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_UNKNOWN`` (0)
+ Unknown or invalid value.
+ - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_ALWAYS`` (1)
+ Always reset device on driver probe.
+ - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_NEVER`` (2)
+ Never reset device on driver probe.
+ - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_DISK`` (3)
+ Reset the device only if firmware can be found in the filesystem.
+ * - ``enable_roce``
+ - Boolean
+ - Enable handling of RoCE traffic in the device.
+ * - ``enable_eth``
+ - Boolean
+ - When enabled, the device driver will instantiate Ethernet specific
+ auxiliary device of the devlink device.
+ * - ``enable_rdma``
+ - Boolean
+ - When enabled, the device driver will instantiate RDMA specific
+ auxiliary device of the devlink device.
+ * - ``enable_vnet``
+ - Boolean
+ - When enabled, the device driver will instantiate VDPA networking
+ specific auxiliary device of the devlink device.
+ * - ``enable_iwarp``
+ - Boolean
+ - Enable handling of iWARP traffic in the device.
+ * - ``internal_err_reset``
+ - Boolean
+ - When enabled, the device driver will reset the device on internal
+ errors.
+ * - ``max_macs``
+ - u32
+ - Typically macvlan, vlan net devices mac are also programmed in their
+ parent netdevice's Function rx filter. This parameter limit the
+ maximum number of unicast mac address filters to receive traffic from
+ per ethernet port of this device.
+ * - ``region_snapshot_enable``
+ - Boolean
+ - Enable capture of ``devlink-region`` snapshots.
+ * - ``enable_remote_dev_reset``
+ - Boolean
+ - Enable device reset by remote host. When cleared, the device driver
+ will NACK any attempt of other host to reset the device. This parameter
+ is useful for setups where a device is shared by different hosts, such
+ as multi-host setup.
+ * - ``io_eq_size``
+ - u32
+ - Control the size of I/O completion EQs.
+ * - ``event_eq_size``
+ - u32
+ - Control the size of asynchronous control events EQ.
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
new file mode 100644
index 0000000000..e33ad2401a
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -0,0 +1,443 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _devlink_port:
+
+============
+Devlink Port
+============
+
+``devlink-port`` is a port that exists on the device. It has a logically
+separate ingress/egress point of the device. A devlink port can be any one
+of many flavours. A devlink port flavour along with port attributes
+describe what a port represents.
+
+A device driver that intends to publish a devlink port sets the
+devlink port attributes and registers the devlink port.
+
+Devlink port flavours are described below.
+
+.. list-table:: List of devlink port flavours
+ :widths: 33 90
+
+ * - Flavour
+ - Description
+ * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
+ - Any kind of physical port. This can be an eswitch physical port or any
+ other physical port on the device.
+ * - ``DEVLINK_PORT_FLAVOUR_DSA``
+ - This indicates a DSA interconnect port.
+ * - ``DEVLINK_PORT_FLAVOUR_CPU``
+ - This indicates a CPU port applicable only to DSA.
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
+ - This indicates an eswitch port representing a port of PCI
+ physical function (PF).
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
+ - This indicates an eswitch port representing a port of PCI
+ virtual function (VF).
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
+ - This indicates an eswitch port representing a port of PCI
+ subfunction (SF).
+ * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
+ - This indicates a virtual port for the PCI virtual function.
+
+Devlink port can have a different type based on the link layer described below.
+
+.. list-table:: List of devlink port types
+ :widths: 23 90
+
+ * - Type
+ - Description
+ * - ``DEVLINK_PORT_TYPE_ETH``
+ - Driver should set this port type when a link layer of the port is
+ Ethernet.
+ * - ``DEVLINK_PORT_TYPE_IB``
+ - Driver should set this port type when a link layer of the port is
+ InfiniBand.
+ * - ``DEVLINK_PORT_TYPE_AUTO``
+ - This type is indicated by the user when driver should detect the port
+ type automatically.
+
+PCI controllers
+---------------
+In most cases a PCI device has only one controller. A controller consists of
+potentially multiple physical, virtual functions and subfunctions. A function
+consists of one or more ports. This port is represented by the devlink eswitch
+port.
+
+A PCI device connected to multiple CPUs or multiple PCI root complexes or a
+SmartNIC, however, may have multiple controllers. For a device with multiple
+controllers, each controller is distinguished by a unique controller number.
+An eswitch is on the PCI device which supports ports of multiple controllers.
+
+An example view of a system with two controllers::
+
+ ---------------------------------------------------------
+ | |
+ | --------- --------- ------- ------- |
+ ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
+ | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- |
+ | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ |
+ | connect | | ------- ------- |
+ ----------- | | controller_num=1 (no eswitch) |
+ ------|--------------------------------------------------
+ (internal wire)
+ |
+ ---------------------------------------------------------
+ | devlink eswitch ports and reps |
+ | ----------------------------------------------------- |
+ | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
+ | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
+ | ----------------------------------------------------- |
+ | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
+ | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
+ | ----------------------------------------------------- |
+ | |
+ | |
+ ----------- | --------- --------- ------- ------- |
+ | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
+ | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- |
+ | connect | | | pf0 |______/________/ | pf1 |___/_______/ |
+ ----------- | ------- ------- |
+ | |
+ | local controller_num=0 (eswitch) |
+ ---------------------------------------------------------
+
+In the above example, the external controller (identified by controller number = 1)
+doesn't have the eswitch. Local controller (identified by controller number = 0)
+has the eswitch. The Devlink instance on the local controller has eswitch
+devlink ports for both the controllers.
+
+Function configuration
+======================
+
+Users can configure one or more function attributes before enumerating the PCI
+function. Usually it means, user should configure function attribute
+before a bus specific device for the function is created. However, when
+SRIOV is enabled, virtual function devices are created on the PCI bus.
+Hence, function attribute should be configured before binding virtual
+function device to the driver. For subfunctions, this means user should
+configure port function attribute before activating the port function.
+
+A user may set the hardware address of the function using
+`devlink port function set hw_addr` command. For Ethernet port function
+this means a MAC address.
+
+Users may also set the RoCE capability of the function using
+`devlink port function set roce` command.
+
+Users may also set the function as migratable using
+'devlink port function set migratable' command.
+
+Users may also set the IPsec crypto capability of the function using
+`devlink port function set ipsec_crypto` command.
+
+Users may also set the IPsec packet capability of the function using
+`devlink port function set ipsec_packet` command.
+
+Function attributes
+===================
+
+MAC address setup
+-----------------
+The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
+device created for the PCI VF/SF.
+
+- Get the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:11:22:33:44:55
+
+- Get the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:88:88
+
+RoCE capability setup
+---------------------
+Not all PCI VFs/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves system memory per PCI VF/SF.
+
+When user disables RoCE capability for a VF/SF, user application cannot send or
+receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
+will be empty.
+
+When RoCE capability is disabled in the device using port function attribute,
+VF/SF driver cannot override it.
+
+- Get RoCE capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce enable
+
+- Set RoCE capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 roce disable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce disable
+
+migratable capability setup
+---------------------------
+Live migration is the process of transferring a live virtual machine
+from one physical host to another without disrupting its normal
+operation.
+
+User who want PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
+with migration support, the user can migrate the VM with this VF from one HV to a
+different one.
+
+However, when migratable capability is enable, device will disable features which cannot
+be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
+
+Example of LM with migratable function configuration:
+- Get migratable capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable disable
+
+- Set migratable capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 migratable enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable enable
+
+- Bind VF to VFIO driver with migration support::
+
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
+ $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
+
+Attach VF to the VM.
+Start the VM.
+Perform live migration.
+
+IPsec crypto capability setup
+-----------------------------
+When user enables IPsec crypto capability for a VF, user application can offload
+XFRM state crypto operation (Encrypt/Decrypt) to this VF.
+
+When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
+processed in software by the kernel.
+
+- Get IPsec crypto capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
+
+- Set IPsec crypto capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
+
+IPsec packet capability setup
+-----------------------------
+When user enables IPsec packet capability for a VF, user application can offload
+XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
+IPsec encapsulation.
+
+When IPsec packet capability is disabled (default) for a VF, the XFRM state and
+policy is processed in software by the kernel.
+
+- Get IPsec packet capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet disabled
+
+- Set IPsec packet capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet enabled
+
+Subfunction
+============
+
+Subfunction is a lightweight function that has a parent PCI function on which
+it is deployed. Subfunction is created and deployed in unit of 1. Unlike
+SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
+A subfunction communicates with the hardware through the parent PCI function.
+
+To use a subfunction, 3 steps setup sequence is followed:
+
+1) create - create a subfunction;
+2) configure - configure subfunction attributes;
+3) deploy - deploy the subfunction;
+
+Subfunction management is done using devlink port user interface.
+User performs setup on the subfunction management device.
+
+(1) Create
+----------
+A subfunction is created using a devlink port interface. A user adds the
+subfunction by adding a devlink port of subfunction flavour. The devlink
+kernel code calls down to subfunction management driver (devlink ops) and asks
+it to create a subfunction devlink port. Driver then instantiates the
+subfunction port and any associated objects such as health reporters and
+representor netdevice.
+
+(2) Configure
+-------------
+A subfunction devlink port is created but it is not active yet. That means the
+entities are created on devlink side, the e-switch port representor is created,
+but the subfunction device itself is not created. A user might use e-switch port
+representor to do settings, putting it into bridge, adding TC rules, etc. A user
+might as well configure the hardware address (such as MAC address) of the
+subfunction while subfunction is inactive.
+
+(3) Deploy
+----------
+Once a subfunction is configured, user must activate it to use it. Upon
+activation, subfunction management driver asks the subfunction management
+device to instantiate the subfunction device on particular PCI function.
+A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
+At this point a matching subfunction driver binds to the subfunction's auxiliary device.
+
+Rate object management
+======================
+
+Devlink provides API to manage tx rates of single devlink port or a group.
+This is done through rate objects, which can be one of the two types:
+
+``leaf``
+ Represents a single devlink port; created/destroyed by the driver. Since leaf
+ have 1to1 mapping to its devlink port, in user space it is referred as
+ ``pci/<bus_addr>/<port_index>``;
+
+``node``
+ Represents a group of rate objects (leafs and/or nodes); created/deleted by
+ request from the userspace; initially empty (no rate objects added). In
+ userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
+ ``node_name`` can be any identifier, except decimal number, to avoid
+ collisions with leafs.
+
+API allows to configure following rate object's parameters:
+
+``tx_share``
+ Minimum TX rate value shared among all other rate objects, or rate objects
+ that parts of the parent group, if it is a part of the same group.
+
+``tx_max``
+ Maximum TX rate value.
+
+``tx_priority``
+ Allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their priority
+ as long as the nodes remain within their bandwidth limit. The higher the
+ priority the higher the probability that the node will get selected for
+ scheduling.
+
+``tx_weight``
+ Allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with the
+ strict priority. As a node is configured with a higher rate it gets more
+ BW relative to its siblings. Values are relative like a percentage
+ points, they basically tell how much BW should node take relative to
+ its siblings.
+
+``parent``
+ Parent node name. Parent node rate limits are considered as additional limits
+ to all node children limits. ``tx_max`` is an upper limit for children.
+ ``tx_share`` is a total bandwidth distributed among children.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+Arbitration flow from the high level:
+
+#. Choose a node, or group of nodes with the highest priority that stays
+ within the BW limit and are not blocked. Use ``tx_priority`` as a
+ parameter for this arbitration.
+
+#. If group of nodes have the same priority perform WFQ arbitration on
+ that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
+
+#. Select the winner node, and continue arbitration flow among its children,
+ until leaf node is reached, and the winner is established.
+
+#. If all the nodes from the highest priority sub-group are satisfied, or
+ overused their assigned BW, move to the lower priority nodes.
+
+Driver implementations are allowed to support both or either rate object types
+and setting methods of their parameters. Additionally driver implementation
+may export nodes/leafs and their child-parent relationships.
+
+Terms and Definitions
+=====================
+
+.. list-table:: Terms and Definitions
+ :widths: 22 90
+
+ * - Term
+ - Definitions
+ * - ``PCI device``
+ - A physical PCI device having one or more PCI buses consists of one or
+ more PCI controllers.
+ * - ``PCI controller``
+ - A controller consists of potentially multiple physical functions,
+ virtual functions and subfunctions.
+ * - ``Port function``
+ - An object to manage the function of a port.
+ * - ``Subfunction``
+ - A lightweight function that has parent PCI function on which it is
+ deployed.
+ * - ``Subfunction device``
+ - A bus device of the subfunction, usually on a auxiliary bus.
+ * - ``Subfunction driver``
+ - A device driver for the subfunction auxiliary device.
+ * - ``Subfunction management device``
+ - A PCI physical function that supports subfunction management.
+ * - ``Subfunction management driver``
+ - A device driver for PCI physical function that supports
+ subfunction management using devlink port interface.
+ * - ``Subfunction host driver``
+ - A device driver for PCI physical function that hosts subfunction
+ devices. In most cases it is same as subfunction management driver. When
+ subfunction is used on external controller, subfunction management and
+ host drivers are different.
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
new file mode 100644
index 0000000000..9232cd7da3
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -0,0 +1,83 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Region
+==============
+
+``devlink`` regions enable access to driver defined address regions using
+devlink.
+
+Each device can create and register its own supported address regions. The
+region can then be accessed via the devlink region interface.
+
+Region snapshots are collected by the driver, and can be accessed via read
+or dump commands. This allows future analysis on the created snapshots.
+Regions may optionally support triggering snapshots on demand.
+
+Snapshot identifiers are scoped to the devlink instance, not a region.
+All snapshots with the same snapshot id within a devlink instance
+correspond to the same event.
+
+The major benefit to creating a region is to provide access to internal
+address regions that are otherwise inaccessible to the user.
+
+Regions may also be used to provide an additional way to debug complex error
+states, but see also Documentation/networking/devlink/devlink-health.rst
+
+Regions may optionally support capturing a snapshot on demand via the
+``DEVLINK_CMD_REGION_NEW`` netlink message. A driver wishing to allow
+requested snapshots must implement the ``.snapshot`` callback for the region
+in its ``devlink_region_ops`` structure. If snapshot id is not set in
+the ``DEVLINK_CMD_REGION_NEW`` request kernel will allocate one and send
+the snapshot information to user space.
+
+Regions may optionally allow directly reading from their contents without a
+snapshot. Direct read requests are not atomic. In particular a read request
+of size 256 bytes or larger will be split into multiple chunks. If atomic
+access is required, use a snapshot. A driver wishing to enable this for a
+region should implement the ``.read`` callback in the ``devlink_region_ops``
+structure. User space can request a direct read by using the
+``DEVLINK_ATTR_REGION_DIRECT`` attribute instead of specifying a snapshot
+id.
+
+example usage
+-------------
+
+.. code:: shell
+
+ $ devlink region help
+ $ devlink region show [ DEV/REGION ]
+ $ devlink region del DEV/REGION snapshot SNAPSHOT_ID
+ $ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
+ $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] address ADDRESS length length
+
+ # Show all of the exposed regions with region sizes:
+ $ devlink region show
+ pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2] max 8
+ pci/0000:00:05.0/fw-health: size 64 snapshot [1 2] max 8
+
+ # Delete a snapshot using:
+ $ devlink region del pci/0000:00:05.0/cr-space snapshot 1
+
+ # Request an immediate snapshot, if supported by the region
+ $ devlink region new pci/0000:00:05.0/cr-space
+ pci/0000:00:05.0/cr-space: snapshot 5
+
+ # Dump a snapshot:
+ $ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+ 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
+ 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
+
+ # Read a specific part of a snapshot:
+ $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+
+ # Read from the region without a snapshot
+ $ devlink region read pci/0000:00:05.0/fw-health address 16 length 16
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+
+As regions are likely very device or driver specific, no generic regions are
+defined. See the driver-specific documentation files for information on the
+specific regions a driver supports.
diff --git a/Documentation/networking/devlink/devlink-reload.rst b/Documentation/networking/devlink/devlink-reload.rst
new file mode 100644
index 0000000000..505d22da02
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-reload.rst
@@ -0,0 +1,81 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Reload
+==============
+
+``devlink-reload`` provides mechanism to reinit driver entities, applying
+``devlink-params`` and ``devlink-resources`` new values. It also provides
+mechanism to activate firmware.
+
+Reload Actions
+==============
+
+User may select a reload action.
+By default ``driver_reinit`` action is selected.
+
+.. list-table:: Possible reload actions
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``driver-reinit``
+ - Devlink driver entities re-initialization, including applying
+ new values to devlink entities which are used during driver
+ load such as ``devlink-params`` in configuration mode
+ ``driverinit`` or ``devlink-resources``
+ * - ``fw_activate``
+ - Firmware activate. Activates new firmware if such image is stored and
+ pending activation. If no limitation specified this action may involve
+ firmware reset. If no new image pending this action will reload current
+ firmware image.
+
+Note that even though user asks for a specific action, the driver
+implementation might require to perform another action alongside with
+it. For example, some driver do not support driver reinitialization
+being performed without fw activation. Therefore, the devlink reload
+command returns the list of actions which were actrually performed.
+
+Reload Limits
+=============
+
+By default reload actions are not limited and driver implementation may
+include reset or downtime as needed to perform the actions.
+
+However, some drivers support action limits, which limit the action
+implementation to specific constraints.
+
+.. list-table:: Possible reload limits
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``no_reset``
+ - No reset allowed, no down time allowed, no link flap and no
+ configuration is lost.
+
+Change Namespace
+================
+
+The netns option allows user to be able to move devlink instances into
+namespaces during devlink reload operation.
+By default all devlink instances are created in init_net and stay there.
+
+example usage
+-------------
+
+.. code:: shell
+
+ $ devlink dev reload help
+ $ devlink dev reload DEV [ netns { PID | NAME | ID } ] [ action { driver_reinit | fw_activate } ] [ limit no_reset ]
+
+ # Run reload command for devlink driver entities re-initialization:
+ $ devlink dev reload pci/0000:82:00.0 action driver_reinit
+ reload_actions_performed:
+ driver_reinit
+
+ # Run reload command to activate firmware:
+ # Note that mlx5 driver reloads the driver while activating firmware
+ $ devlink dev reload pci/0000:82:00.0 action fw_activate
+ reload_actions_performed:
+ driver_reinit fw_activate
diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst
new file mode 100644
index 0000000000..3d5ae51e65
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-resource.rst
@@ -0,0 +1,76 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+Devlink Resource
+================
+
+``devlink`` provides the ability for drivers to register resources, which
+can allow administrators to see the device restrictions for a given
+resource, as well as how much of the given resource is currently
+in use. Additionally, these resources can optionally have configurable size.
+This could enable the administrator to limit the number of resources that
+are used.
+
+For example, the ``netdevsim`` driver enables ``/IPv4/fib`` and
+``/IPv4/fib-rules`` as resources to limit the number of IPv4 FIB entries and
+rules for a given device.
+
+Resource Ids
+============
+
+Each resource is represented by an id, and contains information about its
+current size and related sub resources. To access a sub resource, you
+specify the path of the resource. For example ``/IPv4/fib`` is the id for
+the ``fib`` sub-resource under the ``IPv4`` resource.
+
+Generic Resources
+=================
+
+Generic resources are used to describe resources that can be shared by multiple
+device drivers and their description must be added to the following table:
+
+.. list-table:: List of Generic Resources
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``physical_ports``
+ - A limited capacity of physical ports that the switch ASIC can support
+
+example usage
+-------------
+
+The resources exposed by the driver can be observed, for example:
+
+.. code:: shell
+
+ $devlink resource show pci/0000:03:00.0
+ pci/0000:03:00.0:
+ name kvd size 245760 unit entry
+ resources:
+ name linear size 98304 occ 0 unit entry size_min 0 size_max 147456 size_gran 128
+ name hash_double size 60416 unit entry size_min 32768 size_max 180224 size_gran 128
+ name hash_single size 87040 unit entry size_min 65536 size_max 212992 size_gran 128
+
+Some resource's size can be changed. Examples:
+
+.. code:: shell
+
+ $devlink resource set pci/0000:03:00.0 path /kvd/hash_single size 73088
+ $devlink resource set pci/0000:03:00.0 path /kvd/hash_double size 74368
+
+The changes do not apply immediately, this can be validated by the 'size_new'
+attribute, which represents the pending change in size. For example:
+
+.. code:: shell
+
+ $devlink resource show pci/0000:03:00.0
+ pci/0000:03:00.0:
+ name kvd size 245760 unit entry size_valid false
+ resources:
+ name linear size 98304 size_new 147456 occ 0 unit entry size_min 0 size_max 147456 size_gran 128
+ name hash_double size 60416 unit entry size_min 32768 size_max 180224 size_gran 128
+ name hash_single size 87040 unit entry size_min 65536 size_max 212992 size_gran 128
+
+Note that changes in resource size may require a device reload to properly
+take effect.
diff --git a/Documentation/networking/devlink/devlink-selftests.rst b/Documentation/networking/devlink/devlink-selftests.rst
new file mode 100644
index 0000000000..c0aa1f3aef
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-selftests.rst
@@ -0,0 +1,38 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+=================
+Devlink Selftests
+=================
+
+The ``devlink-selftests`` API allows executing selftests on the device.
+
+Tests Mask
+==========
+The ``devlink-selftests`` command should be run with a mask indicating
+the tests to be executed.
+
+Tests Description
+=================
+The following is a list of tests that drivers may execute.
+
+.. list-table:: List of tests
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``DEVLINK_SELFTEST_FLASH``
+ - Devices may have the firmware on non-volatile memory on the board, e.g.
+ flash. This particular test helps to run a flash selftest on the device.
+ Implementation of the test is left to the driver/firmware.
+
+example usage
+-------------
+
+.. code:: shell
+
+ # Query selftests supported on the devlink device
+ $ devlink dev selftests show DEV
+ # Query selftests supported on all devlink devices
+ $ devlink dev selftests show
+ # Executes selftests on the device
+ $ devlink dev selftests run DEV id flash
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
new file mode 100644
index 0000000000..2c14dfe69b
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -0,0 +1,640 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Devlink Trap
+============
+
+Background
+==========
+
+Devices capable of offloading the kernel's datapath and perform functions such
+as bridging and routing must also be able to send specific packets to the
+kernel (i.e., the CPU) for processing.
+
+For example, a device acting as a multicast-aware bridge must be able to send
+IGMP membership reports to the kernel for processing by the bridge module.
+Without processing such packets, the bridge module could never populate its
+MDB.
+
+As another example, consider a device acting as router which has received an IP
+packet with a TTL of 1. Upon routing the packet the device must send it to the
+kernel so that it will route it as well and generate an ICMP Time Exceeded
+error datagram. Without letting the kernel route such packets itself, utilities
+such as ``traceroute`` could never work.
+
+The fundamental ability of sending certain packets to the kernel for processing
+is called "packet trapping".
+
+Overview
+========
+
+The ``devlink-trap`` mechanism allows capable device drivers to register their
+supported packet traps with ``devlink`` and report trapped packets to
+``devlink`` for further analysis.
+
+Upon receiving trapped packets, ``devlink`` will perform a per-trap packets and
+bytes accounting and potentially report the packet to user space via a netlink
+event along with all the provided metadata (e.g., trap reason, timestamp, input
+port). This is especially useful for drop traps (see :ref:`Trap-Types`)
+as it allows users to obtain further visibility into packet drops that would
+otherwise be invisible.
+
+The following diagram provides a general overview of ``devlink-trap``::
+
+ Netlink event: Packet w/ metadata
+ Or a summary of recent drops
+ ^
+ |
+ Userspace |
+ +---------------------------------------------------+
+ Kernel |
+ |
+ +-------+--------+
+ | |
+ | drop_monitor |
+ | |
+ +-------^--------+
+ |
+ | Non-control traps
+ |
+ +----+----+
+ | | Kernel's Rx path
+ | devlink | (non-drop traps)
+ | |
+ +----^----+ ^
+ | |
+ +-----------+
+ |
+ +-------+-------+
+ | |
+ | Device driver |
+ | |
+ +-------^-------+
+ Kernel |
+ +---------------------------------------------------+
+ Hardware |
+ | Trapped packet
+ |
+ +--+---+
+ | |
+ | ASIC |
+ | |
+ +------+
+
+.. _Trap-Types:
+
+Trap Types
+==========
+
+The ``devlink-trap`` mechanism supports the following packet trap types:
+
+ * ``drop``: Trapped packets were dropped by the underlying device. Packets
+ are only processed by ``devlink`` and not injected to the kernel's Rx path.
+ The trap action (see :ref:`Trap-Actions`) can be changed.
+ * ``exception``: Trapped packets were not forwarded as intended by the
+ underlying device due to an exception (e.g., TTL error, missing neighbour
+ entry) and trapped to the control plane for resolution. Packets are
+ processed by ``devlink`` and injected to the kernel's Rx path. Changing the
+ action of such traps is not allowed, as it can easily break the control
+ plane.
+ * ``control``: Trapped packets were trapped by the device because these are
+ control packets required for the correct functioning of the control plane.
+ For example, ARP request and IGMP query packets. Packets are injected to
+ the kernel's Rx path, but not reported to the kernel's drop monitor.
+ Changing the action of such traps is not allowed, as it can easily break
+ the control plane.
+
+.. _Trap-Actions:
+
+Trap Actions
+============
+
+The ``devlink-trap`` mechanism supports the following packet trap actions:
+
+ * ``trap``: The sole copy of the packet is sent to the CPU.
+ * ``drop``: The packet is dropped by the underlying device and a copy is not
+ sent to the CPU.
+ * ``mirror``: The packet is forwarded by the underlying device and a copy is
+ sent to the CPU.
+
+Generic Packet Traps
+====================
+
+Generic packet traps are used to describe traps that trap well-defined packets
+or packets that are trapped due to well-defined conditions (e.g., TTL error).
+Such traps can be shared by multiple device drivers and their description must
+be added to the following table:
+
+.. list-table:: List of Generic Packet Traps
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``source_mac_is_multicast``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop because of a
+ multicast source MAC
+ * - ``vlan_tag_mismatch``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop in case of VLAN
+ tag mismatch: The ingress bridge port is not configured with a PVID and
+ the packet is untagged or prio-tagged
+ * - ``ingress_vlan_filter``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop in case they are
+ tagged with a VLAN that is not configured on the ingress bridge port
+ * - ``ingress_spanning_tree_filter``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop in case the STP
+ state of the ingress bridge port is not "forwarding"
+ * - ``port_list_is_empty``
+ - ``drop``
+ - Traps packets that the device decided to drop in case they need to be
+ flooded (e.g., unknown unicast, unregistered multicast) and there are
+ no ports the packets should be flooded to
+ * - ``port_loopback_filter``
+ - ``drop``
+ - Traps packets that the device decided to drop in case after layer 2
+ forwarding the only port from which they should be transmitted through
+ is the port from which they were received
+ * - ``blackhole_route``
+ - ``drop``
+ - Traps packets that the device decided to drop in case they hit a
+ blackhole route
+ * - ``ttl_value_is_too_small``
+ - ``exception``
+ - Traps unicast packets that should be forwarded by the device whose TTL
+ was decremented to 0 or less
+ * - ``tail_drop``
+ - ``drop``
+ - Traps packets that the device decided to drop because they could not be
+ enqueued to a transmission queue which is full
+ * - ``non_ip``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to
+ undergo a layer 3 lookup, but are not IP or MPLS packets
+ * - ``uc_dip_over_mc_dmac``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and they have a unicast destination IP and a multicast destination
+ MAC
+ * - ``dip_is_loopback_address``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their destination IP is the loopback address (i.e., 127.0.0.0/8
+ and ::1/128)
+ * - ``sip_is_mc``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their source IP is multicast (i.e., 224.0.0.0/8 and ff::/8)
+ * - ``sip_is_loopback_address``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their source IP is the loopback address (i.e., 127.0.0.0/8 and ::1/128)
+ * - ``ip_header_corrupted``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their IP header is corrupted: wrong checksum, wrong IP version
+ or too short Internet Header Length (IHL)
+ * - ``ipv4_sip_is_limited_bc``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their source IP is limited broadcast (i.e., 255.255.255.255/32)
+ * - ``ipv6_mc_dip_reserved_scope``
+ - ``drop``
+ - Traps IPv6 packets that the device decided to drop because they need to
+ be routed and their IPv6 multicast destination IP has a reserved scope
+ (i.e., ffx0::/16)
+ * - ``ipv6_mc_dip_interface_local_scope``
+ - ``drop``
+ - Traps IPv6 packets that the device decided to drop because they need to
+ be routed and their IPv6 multicast destination IP has an interface-local scope
+ (i.e., ffx1::/16)
+ * - ``mtu_value_is_too_small``
+ - ``exception``
+ - Traps packets that should have been routed by the device, but were bigger
+ than the MTU of the egress interface
+ * - ``unresolved_neigh``
+ - ``exception``
+ - Traps packets that did not have a matching IP neighbour after routing
+ * - ``mc_reverse_path_forwarding``
+ - ``exception``
+ - Traps multicast IP packets that failed reverse-path forwarding (RPF)
+ check during multicast routing
+ * - ``reject_route``
+ - ``exception``
+ - Traps packets that hit reject routes (i.e., "unreachable", "prohibit")
+ * - ``ipv4_lpm_miss``
+ - ``exception``
+ - Traps unicast IPv4 packets that did not match any route
+ * - ``ipv6_lpm_miss``
+ - ``exception``
+ - Traps unicast IPv6 packets that did not match any route
+ * - ``non_routable_packet``
+ - ``drop``
+ - Traps packets that the device decided to drop because they are not
+ supposed to be routed. For example, IGMP queries can be flooded by the
+ device in layer 2 and reach the router. Such packets should not be
+ routed and instead dropped
+ * - ``decap_error``
+ - ``exception``
+ - Traps NVE and IPinIP packets that the device decided to drop because of
+ failure during decapsulation (e.g., packet being too short, reserved
+ bits set in VXLAN header)
+ * - ``overlay_smac_is_mc``
+ - ``drop``
+ - Traps NVE packets that the device decided to drop because their overlay
+ source MAC is multicast
+ * - ``ingress_flow_action_drop``
+ - ``drop``
+ - Traps packets dropped during processing of ingress flow action drop
+ * - ``egress_flow_action_drop``
+ - ``drop``
+ - Traps packets dropped during processing of egress flow action drop
+ * - ``stp``
+ - ``control``
+ - Traps STP packets
+ * - ``lacp``
+ - ``control``
+ - Traps LACP packets
+ * - ``lldp``
+ - ``control``
+ - Traps LLDP packets
+ * - ``igmp_query``
+ - ``control``
+ - Traps IGMP Membership Query packets
+ * - ``igmp_v1_report``
+ - ``control``
+ - Traps IGMP Version 1 Membership Report packets
+ * - ``igmp_v2_report``
+ - ``control``
+ - Traps IGMP Version 2 Membership Report packets
+ * - ``igmp_v3_report``
+ - ``control``
+ - Traps IGMP Version 3 Membership Report packets
+ * - ``igmp_v2_leave``
+ - ``control``
+ - Traps IGMP Version 2 Leave Group packets
+ * - ``mld_query``
+ - ``control``
+ - Traps MLD Multicast Listener Query packets
+ * - ``mld_v1_report``
+ - ``control``
+ - Traps MLD Version 1 Multicast Listener Report packets
+ * - ``mld_v2_report``
+ - ``control``
+ - Traps MLD Version 2 Multicast Listener Report packets
+ * - ``mld_v1_done``
+ - ``control``
+ - Traps MLD Version 1 Multicast Listener Done packets
+ * - ``ipv4_dhcp``
+ - ``control``
+ - Traps IPv4 DHCP packets
+ * - ``ipv6_dhcp``
+ - ``control``
+ - Traps IPv6 DHCP packets
+ * - ``arp_request``
+ - ``control``
+ - Traps ARP request packets
+ * - ``arp_response``
+ - ``control``
+ - Traps ARP response packets
+ * - ``arp_overlay``
+ - ``control``
+ - Traps NVE-decapsulated ARP packets that reached the overlay network.
+ This is required, for example, when the address that needs to be
+ resolved is a local address
+ * - ``ipv6_neigh_solicit``
+ - ``control``
+ - Traps IPv6 Neighbour Solicitation packets
+ * - ``ipv6_neigh_advert``
+ - ``control``
+ - Traps IPv6 Neighbour Advertisement packets
+ * - ``ipv4_bfd``
+ - ``control``
+ - Traps IPv4 BFD packets
+ * - ``ipv6_bfd``
+ - ``control``
+ - Traps IPv6 BFD packets
+ * - ``ipv4_ospf``
+ - ``control``
+ - Traps IPv4 OSPF packets
+ * - ``ipv6_ospf``
+ - ``control``
+ - Traps IPv6 OSPF packets
+ * - ``ipv4_bgp``
+ - ``control``
+ - Traps IPv4 BGP packets
+ * - ``ipv6_bgp``
+ - ``control``
+ - Traps IPv6 BGP packets
+ * - ``ipv4_vrrp``
+ - ``control``
+ - Traps IPv4 VRRP packets
+ * - ``ipv6_vrrp``
+ - ``control``
+ - Traps IPv6 VRRP packets
+ * - ``ipv4_pim``
+ - ``control``
+ - Traps IPv4 PIM packets
+ * - ``ipv6_pim``
+ - ``control``
+ - Traps IPv6 PIM packets
+ * - ``uc_loopback``
+ - ``control``
+ - Traps unicast packets that need to be routed through the same layer 3
+ interface from which they were received. Such packets are routed by the
+ kernel, but also cause it to potentially generate ICMP redirect packets
+ * - ``local_route``
+ - ``control``
+ - Traps unicast packets that hit a local route and need to be locally
+ delivered
+ * - ``external_route``
+ - ``control``
+ - Traps packets that should be routed through an external interface (e.g.,
+ management interface) that does not belong to the same device (e.g.,
+ switch ASIC) as the ingress interface
+ * - ``ipv6_uc_dip_link_local_scope``
+ - ``control``
+ - Traps unicast IPv6 packets that need to be routed and have a destination
+ IP address with a link-local scope (i.e., fe80::/10). The trap allows
+ device drivers to avoid programming link-local routes, but still receive
+ packets for local delivery
+ * - ``ipv6_dip_all_nodes``
+ - ``control``
+ - Traps IPv6 packets that their destination IP address is the "All Nodes
+ Address" (i.e., ff02::1)
+ * - ``ipv6_dip_all_routers``
+ - ``control``
+ - Traps IPv6 packets that their destination IP address is the "All Routers
+ Address" (i.e., ff02::2)
+ * - ``ipv6_router_solicit``
+ - ``control``
+ - Traps IPv6 Router Solicitation packets
+ * - ``ipv6_router_advert``
+ - ``control``
+ - Traps IPv6 Router Advertisement packets
+ * - ``ipv6_redirect``
+ - ``control``
+ - Traps IPv6 Redirect Message packets
+ * - ``ipv4_router_alert``
+ - ``control``
+ - Traps IPv4 packets that need to be routed and include the Router Alert
+ option. Such packets need to be locally delivered to raw sockets that
+ have the IP_ROUTER_ALERT socket option set
+ * - ``ipv6_router_alert``
+ - ``control``
+ - Traps IPv6 packets that need to be routed and include the Router Alert
+ option in their Hop-by-Hop extension header. Such packets need to be
+ locally delivered to raw sockets that have the IPV6_ROUTER_ALERT socket
+ option set
+ * - ``ptp_event``
+ - ``control``
+ - Traps PTP time-critical event messages (Sync, Delay_req, Pdelay_Req and
+ Pdelay_Resp)
+ * - ``ptp_general``
+ - ``control``
+ - Traps PTP general messages (Announce, Follow_Up, Delay_Resp,
+ Pdelay_Resp_Follow_Up, management and signaling)
+ * - ``flow_action_sample``
+ - ``control``
+ - Traps packets sampled during processing of flow action sample (e.g., via
+ tc's sample action)
+ * - ``flow_action_trap``
+ - ``control``
+ - Traps packets logged during processing of flow action trap (e.g., via
+ tc's trap action)
+ * - ``early_drop``
+ - ``drop``
+ - Traps packets dropped due to the RED (Random Early Detection) algorithm
+ (i.e., early drops)
+ * - ``vxlan_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the VXLAN header parsing which
+ might be because of packet truncation or the I flag is not set.
+ * - ``llc_snap_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the LLC+SNAP header parsing
+ * - ``vlan_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the VLAN header parsing. Could
+ include unexpected packet truncation.
+ * - ``pppoe_ppp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the PPPoE+PPP header parsing.
+ This could include finding a session ID of 0xFFFF (which is reserved and
+ not for use), a PPPoE length which is larger than the frame received or
+ any common error on this type of header
+ * - ``mpls_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the MPLS header parsing which
+ could include unexpected header truncation
+ * - ``arp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the ARP header parsing
+ * - ``ip_1_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the first IP header parsing.
+ This packet trap could include packets which do not pass an IP checksum
+ check, a header length check (a minimum of 20 bytes), which might suffer
+ from packet truncation thus the total length field exceeds the received
+ packet length etc
+ * - ``ip_n_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the parsing of the last IP
+ header (the inner one in case of an IP over IP tunnel). The same common
+ error checking is performed here as for the ip_1_parsing trap
+ * - ``gre_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the GRE header parsing
+ * - ``udp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the UDP header parsing.
+ This packet trap could include checksum errorrs, an improper UDP
+ length detected (smaller than 8 bytes) or detection of header
+ truncation.
+ * - ``tcp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the TCP header parsing.
+ This could include TCP checksum errors, improper combination of SYN, FIN
+ and/or RESET etc.
+ * - ``ipsec_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the IPSEC header parsing
+ * - ``sctp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the SCTP header parsing.
+ This would mean that port number 0 was used or that the header is
+ truncated.
+ * - ``dccp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the DCCP header parsing
+ * - ``gtp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the GTP header parsing
+ * - ``esp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the ESP header parsing
+ * - ``blackhole_nexthop``
+ - ``drop``
+ - Traps packets that the device decided to drop in case they hit a
+ blackhole nexthop
+ * - ``dmac_filter``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop because
+ the destination MAC is not configured in the MAC table and
+ the interface is not in promiscuous mode
+ * - ``eapol``
+ - ``control``
+ - Traps "Extensible Authentication Protocol over LAN" (EAPOL) packets
+ specified in IEEE 802.1X
+ * - ``locked_port``
+ - ``drop``
+ - Traps packets that the device decided to drop because they failed the
+ locked bridge port check. That is, packets that were received via a
+ locked port and whose {SMAC, VID} does not correspond to an FDB entry
+ pointing to the port
+
+Driver-specific Packet Traps
+============================
+
+Device drivers can register driver-specific packet traps, but these must be
+clearly documented. Such traps can correspond to device-specific exceptions and
+help debug packet drops caused by these exceptions. The following list includes
+links to the description of driver-specific traps registered by various device
+drivers:
+
+ * Documentation/networking/devlink/netdevsim.rst
+ * Documentation/networking/devlink/mlxsw.rst
+ * Documentation/networking/devlink/prestera.rst
+
+.. _Generic-Packet-Trap-Groups:
+
+Generic Packet Trap Groups
+==========================
+
+Generic packet trap groups are used to aggregate logically related packet
+traps. These groups allow the user to batch operations such as setting the trap
+action of all member traps. In addition, ``devlink-trap`` can report aggregated
+per-group packets and bytes statistics, in case per-trap statistics are too
+narrow. The description of these groups must be added to the following table:
+
+.. list-table:: List of Generic Packet Trap Groups
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``l2_drops``
+ - Contains packet traps for packets that were dropped by the device during
+ layer 2 forwarding (i.e., bridge)
+ * - ``l3_drops``
+ - Contains packet traps for packets that were dropped by the device during
+ layer 3 forwarding
+ * - ``l3_exceptions``
+ - Contains packet traps for packets that hit an exception (e.g., TTL
+ error) during layer 3 forwarding
+ * - ``buffer_drops``
+ - Contains packet traps for packets that were dropped by the device due to
+ an enqueue decision
+ * - ``tunnel_drops``
+ - Contains packet traps for packets that were dropped by the device during
+ tunnel encapsulation / decapsulation
+ * - ``acl_drops``
+ - Contains packet traps for packets that were dropped by the device during
+ ACL processing
+ * - ``stp``
+ - Contains packet traps for STP packets
+ * - ``lacp``
+ - Contains packet traps for LACP packets
+ * - ``lldp``
+ - Contains packet traps for LLDP packets
+ * - ``mc_snooping``
+ - Contains packet traps for IGMP and MLD packets required for multicast
+ snooping
+ * - ``dhcp``
+ - Contains packet traps for DHCP packets
+ * - ``neigh_discovery``
+ - Contains packet traps for neighbour discovery packets (e.g., ARP, IPv6
+ ND)
+ * - ``bfd``
+ - Contains packet traps for BFD packets
+ * - ``ospf``
+ - Contains packet traps for OSPF packets
+ * - ``bgp``
+ - Contains packet traps for BGP packets
+ * - ``vrrp``
+ - Contains packet traps for VRRP packets
+ * - ``pim``
+ - Contains packet traps for PIM packets
+ * - ``uc_loopback``
+ - Contains a packet trap for unicast loopback packets (i.e.,
+ ``uc_loopback``). This trap is singled-out because in cases such as
+ one-armed router it will be constantly triggered. To limit the impact on
+ the CPU usage, a packet trap policer with a low rate can be bound to the
+ group without affecting other traps
+ * - ``local_delivery``
+ - Contains packet traps for packets that should be locally delivered after
+ routing, but do not match more specific packet traps (e.g.,
+ ``ipv4_bgp``)
+ * - ``external_delivery``
+ - Contains packet traps for packets that should be routed through an
+ external interface (e.g., management interface) that does not belong to
+ the same device (e.g., switch ASIC) as the ingress interface
+ * - ``ipv6``
+ - Contains packet traps for various IPv6 control packets (e.g., Router
+ Advertisements)
+ * - ``ptp_event``
+ - Contains packet traps for PTP time-critical event messages (Sync,
+ Delay_req, Pdelay_Req and Pdelay_Resp)
+ * - ``ptp_general``
+ - Contains packet traps for PTP general messages (Announce, Follow_Up,
+ Delay_Resp, Pdelay_Resp_Follow_Up, management and signaling)
+ * - ``acl_sample``
+ - Contains packet traps for packets that were sampled by the device during
+ ACL processing
+ * - ``acl_trap``
+ - Contains packet traps for packets that were trapped (logged) by the
+ device during ACL processing
+ * - ``parser_error_drops``
+ - Contains packet traps for packets that were marked by the device during
+ parsing as erroneous
+ * - ``eapol``
+ - Contains packet traps for "Extensible Authentication Protocol over LAN"
+ (EAPOL) packets specified in IEEE 802.1X
+
+Packet Trap Policers
+====================
+
+As previously explained, the underlying device can trap certain packets to the
+CPU for processing. In most cases, the underlying device is capable of handling
+packet rates that are several orders of magnitude higher compared to those that
+can be handled by the CPU.
+
+Therefore, in order to prevent the underlying device from overwhelming the CPU,
+devices usually include packet trap policers that are able to police the
+trapped packets to rates that can be handled by the CPU.
+
+The ``devlink-trap`` mechanism allows capable device drivers to register their
+supported packet trap policers with ``devlink``. The device driver can choose
+to associate these policers with supported packet trap groups (see
+:ref:`Generic-Packet-Trap-Groups`) during its initialization, thereby exposing
+its default control plane policy to user space.
+
+Device drivers should allow user space to change the parameters of the policers
+(e.g., rate, burst size) as well as the association between the policers and
+trap groups by implementing the relevant callbacks.
+
+If possible, device drivers should implement a callback that allows user space
+to retrieve the number of packets that were dropped by the policer because its
+configured policy was violated.
+
+Testing
+=======
+
+See ``tools/testing/selftests/drivers/net/netdevsim/devlink_trap.sh`` for a
+test covering the core infrastructure. Test cases should be added for any new
+functionality.
+
+Device drivers should focus their tests on device-specific functionality, such
+as the triggering of supported packet traps.
diff --git a/Documentation/networking/devlink/etas_es58x.rst b/Documentation/networking/devlink/etas_es58x.rst
new file mode 100644
index 0000000000..3b857d82a4
--- /dev/null
+++ b/Documentation/networking/devlink/etas_es58x.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+etas_es58x devlink support
+==========================
+
+This document describes the devlink features implemented by the
+``etas_es58x`` device driver.
+
+Info versions
+=============
+
+The ``etas_es58x`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of the firmware running on the device. Also available
+ through ``ethtool -i`` as the first member of the
+ ``firmware-version``.
+ * - ``fw.bootloader``
+ - running
+ - Version of the bootloader running on the device. Also available
+ through ``ethtool -i`` as the second member of the
+ ``firmware-version``.
+ * - ``board.rev``
+ - fixed
+ - The hardware revision of the device.
+ * - ``serial_number``
+ - fixed
+ - The USB serial number. Also available through ``lsusb -v``.
diff --git a/Documentation/networking/devlink/hns3.rst b/Documentation/networking/devlink/hns3.rst
new file mode 100644
index 0000000000..4562a6e478
--- /dev/null
+++ b/Documentation/networking/devlink/hns3.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+hns3 devlink support
+====================
+
+This document describes the devlink features implemented by the ``hns3``
+device driver.
+
+The ``hns3`` driver supports reloading via ``DEVLINK_CMD_RELOAD``.
+
+Info versions
+=============
+
+The ``hns3`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 10 10 80
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Used to represent the firmware version.
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
new file mode 100644
index 0000000000..2f60e34ab9
--- /dev/null
+++ b/Documentation/networking/devlink/ice.rst
@@ -0,0 +1,395 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+ice devlink support
+===================
+
+This document describes the devlink features implemented by the ``ice``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ - Notes
+ * - ``enable_roce``
+ - runtime
+ - mutually exclusive with ``enable_iwarp``
+ * - ``enable_iwarp``
+ - runtime
+ - mutually exclusive with ``enable_roce``
+
+Info versions
+=============
+
+The ``ice`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 5 90
+
+ * - Name
+ - Type
+ - Example
+ - Description
+ * - ``board.id``
+ - fixed
+ - K65390-000
+ - The Product Board Assembly (PBA) identifier of the board.
+ * - ``fw.mgmt``
+ - running
+ - 2.1.7
+ - 3-digit version number of the management firmware running on the
+ Embedded Management Processor of the device. It controls the PHY,
+ link, access to device resources, etc. Intel documentation refers to
+ this as the EMP firmware.
+ * - ``fw.mgmt.api``
+ - running
+ - 1.5.1
+ - 3-digit version number (major.minor.patch) of the API exported over
+ the AdminQ by the management firmware. Used by the driver to
+ identify what commands are supported. Historical versions of the
+ kernel only displayed a 2-digit version number (major.minor).
+ * - ``fw.mgmt.build``
+ - running
+ - 0x305d955f
+ - Unique identifier of the source for the management firmware.
+ * - ``fw.undi``
+ - running
+ - 1.2581.0
+ - Version of the Option ROM containing the UEFI driver. The version is
+ reported in ``major.minor.patch`` format. The major version is
+ incremented whenever a major breaking change occurs, or when the
+ minor version would overflow. The minor version is incremented for
+ non-breaking changes and reset to 1 when the major version is
+ incremented. The patch version is normally 0 but is incremented when
+ a fix is delivered as a patch against an older base Option ROM.
+ * - ``fw.psid.api``
+ - running
+ - 0.80
+ - Version defining the format of the flash contents.
+ * - ``fw.bundle_id``
+ - running
+ - 0x80002ec0
+ - Unique identifier of the firmware image file that was loaded onto
+ the device. Also referred to as the EETRACK identifier of the NVM.
+ * - ``fw.app.name``
+ - running
+ - ICE OS Default Package
+ - The name of the DDP package that is active in the device. The DDP
+ package is loaded by the driver during initialization. Each
+ variation of the DDP package has a unique name.
+ * - ``fw.app``
+ - running
+ - 1.3.1.0
+ - The version of the DDP package that is active in the device. Note
+ that both the name (as reported by ``fw.app.name``) and version are
+ required to uniquely identify the package.
+ * - ``fw.app.bundle_id``
+ - running
+ - 0xc0000001
+ - Unique identifier for the DDP package loaded in the device. Also
+ referred to as the DDP Track ID. Can be used to uniquely identify
+ the specific DDP package.
+ * - ``fw.netlist``
+ - running
+ - 1.1.2000-6.7.0
+ - The version of the netlist module. This module defines the device's
+ Ethernet capabilities and default settings, and is used by the
+ management firmware as part of managing link and device
+ connectivity.
+ * - ``fw.netlist.build``
+ - running
+ - 0xee16ced7
+ - The first 4 bytes of the hash of the netlist module contents.
+
+Flash Update
+============
+
+The ``ice`` driver implements support for flash update using the
+``devlink-flash`` interface. It supports updating the device flash using a
+combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
+``fw.netlist`` components.
+
+.. list-table:: List of supported overwrite modes
+ :widths: 5 95
+
+ * - Bits
+ - Behavior
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+ - Do not preserve settings stored in the flash components being
+ updated. This includes overwriting the port configuration that
+ determines the number of physical functions the device will
+ initialize with.
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+ - Do not preserve either settings or identifiers. Overwrite everything
+ in the flash with the contents from the provided image, without
+ performing any preservation. This includes overwriting device
+ identifying fields such as the MAC address, VPD area, and device
+ serial number. It is expected that this combination be used with an
+ image customized for the specific device.
+
+The ice hardware does not support overwriting only identifiers while
+preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
+own will be rejected. If no overwrite mask is provided, the firmware will be
+instructed to preserve all settings and identifying fields when updating.
+
+Reload
+======
+
+The ``ice`` driver supports activating new firmware after a flash update
+using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
+action.
+
+.. code:: shell
+
+ $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
+
+The new firmware is activated by issuing a device specific Embedded
+Management Processor reset which requests the device to reset and reload the
+EMP firmware image.
+
+The driver does not currently support reloading the driver via
+``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
+
+Port split
+==========
+
+The ``ice`` driver supports port splitting only for port 0, as the FW has
+a predefined set of available port split options for the whole device.
+
+A system reboot is required for port split to be applied.
+
+The following command will select the port split option with 4 ports:
+
+.. code:: shell
+
+ $ devlink port split pci/0000:16:00.0/0 count 4
+
+The list of all available port options will be printed to dynamic debug after
+each ``split`` and ``unsplit`` command. The first option is the default.
+
+.. code:: shell
+
+ ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
+ ice 0000:16:00.0: Status Split Quad 0 Quad 1
+ ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7
+ ice 0000:16:00.0: Active 2 100 - - - 100 - - -
+ ice 0000:16:00.0: 2 50 - 50 - - - - -
+ ice 0000:16:00.0: Pending 4 25 25 25 25 - - - -
+ ice 0000:16:00.0: 4 25 25 - - 25 25 - -
+ ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10
+ ice 0000:16:00.0: 1 100 - - - - - - -
+
+There could be multiple FW port options with the same port split count. When
+the same port split count request is issued again, the next FW port option with
+the same port split count will be selected.
+
+``devlink port unsplit`` will select the option with a split count of 1. If
+there is no FW option available with split count 1, you will receive an error.
+
+Regions
+=======
+
+The ``ice`` driver implements the following regions for accessing internal
+device data.
+
+.. list-table:: regions implemented
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``nvm-flash``
+ - The contents of the entire flash chip, sometimes referred to as
+ the device's Non Volatile Memory.
+ * - ``shadow-ram``
+ - The contents of the Shadow RAM, which is loaded from the beginning
+ of the flash. Although the contents are primarily from the flash,
+ this area also contains data generated during device boot which is
+ not stored in flash.
+ * - ``device-caps``
+ - The contents of the device firmware's capabilities buffer. Useful to
+ determine the current state and configuration of the device.
+
+Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
+snapshot. The ``device-caps`` region requires a snapshot as the contents are
+sent by firmware and can't be split into separate reads.
+
+Users can request an immediate capture of a snapshot for all three regions
+via the ``DEVLINK_CMD_REGION_NEW`` command.
+
+.. code:: shell
+
+ $ devlink region show
+ pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
+ pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
+
+ $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
+ $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
+
+ $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+ 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
+ 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
+
+ $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+
+ $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
+
+ $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
+ $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
+ 0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
+ 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
+ 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
+ 00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
+ 00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
+ 0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
+ 00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+ 00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+ 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+
+ $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
+
+Devlink Rate
+============
+
+The ``ice`` driver implements devlink-rate API. It allows for offload of
+the Hierarchical QoS to the hardware. It enables user to group Virtual
+Functions in a tree structure and assign supported parameters: tx_share,
+tx_max, tx_priority and tx_weight to each node in a tree. So effectively
+user gains an ability to control how much bandwidth is allocated for each
+VF group. This is later enforced by the HW.
+
+It is assumed that this feature is mutually exclusive with DCB performed
+in FW and ADQ, or any driver feature that would trigger changes in QoS,
+for example creation of the new traffic class. The driver will prevent DCB
+or ADQ configuration if user started making any changes to the nodes using
+devlink-rate API. To configure those features a driver reload is necessary.
+Correspondingly if ADQ or DCB will get configured the driver won't export
+hierarchy at all, or will remove the untouched hierarchy if those
+features are enabled after the hierarchy is exported, but before any
+changes are made.
+
+This feature is also dependent on switchdev being enabled in the system.
+It's required because devlink-rate requires devlink-port objects to be
+present, and those objects are only created in switchdev mode.
+
+If the driver is set to the switchdev mode, it will export internal
+hierarchy the moment VF's are created. Root of the tree is always
+represented by the node_0. This node can't be deleted by the user. Leaf
+nodes and nodes with children also can't be deleted.
+
+.. list-table:: Attributes supported
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``tx_max``
+ - maximum bandwidth to be consumed by the tree Node. Rate Limit is
+ an absolute number specifying a maximum amount of bytes a Node may
+ consume during the course of one second. Rate limit guarantees
+ that a link will not oversaturate the receiver on the remote end
+ and also enforces an SLA between the subscriber and network
+ provider.
+ * - ``tx_share``
+ - minimum bandwidth allocated to a tree node when it is not blocked.
+ It specifies an absolute BW. While tx_max defines the maximum
+ bandwidth the node may consume, the tx_share marks committed BW
+ for the Node.
+ * - ``tx_priority``
+ - allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their
+ priority as long as the nodes remain within their bandwidth limit.
+ Range 0-7. Nodes with priority 7 have the highest priority and are
+ selected first, while nodes with priority 0 have the lowest
+ priority. Nodes that have the same priority are treated equally.
+ * - ``tx_weight``
+ - allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with
+ the strict priority. Range 1-200. Only relative values matter for
+ arbitration.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+.. code:: shell
+
+ # enable switchdev
+ $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
+
+ # at this point driver should export internal hierarchy
+ $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
+
+ $ devlink port function rate show
+ pci/0000:4b:00.0/node_25: type node parent node_24
+ pci/0000:4b:00.0/node_24: type node parent node_0
+ pci/0000:4b:00.0/node_32: type node parent node_31
+ pci/0000:4b:00.0/node_31: type node parent node_30
+ pci/0000:4b:00.0/node_30: type node parent node_16
+ pci/0000:4b:00.0/node_19: type node parent node_18
+ pci/0000:4b:00.0/node_18: type node parent node_17
+ pci/0000:4b:00.0/node_17: type node parent node_16
+ pci/0000:4b:00.0/node_14: type node parent node_5
+ pci/0000:4b:00.0/node_5: type node parent node_3
+ pci/0000:4b:00.0/node_13: type node parent node_4
+ pci/0000:4b:00.0/node_12: type node parent node_4
+ pci/0000:4b:00.0/node_11: type node parent node_4
+ pci/0000:4b:00.0/node_10: type node parent node_4
+ pci/0000:4b:00.0/node_9: type node parent node_4
+ pci/0000:4b:00.0/node_8: type node parent node_4
+ pci/0000:4b:00.0/node_7: type node parent node_4
+ pci/0000:4b:00.0/node_6: type node parent node_4
+ pci/0000:4b:00.0/node_4: type node parent node_3
+ pci/0000:4b:00.0/node_3: type node parent node_16
+ pci/0000:4b:00.0/node_16: type node parent node_15
+ pci/0000:4b:00.0/node_15: type node parent node_0
+ pci/0000:4b:00.0/node_2: type node parent node_1
+ pci/0000:4b:00.0/node_1: type node parent node_0
+ pci/0000:4b:00.0/node_0: type node
+ pci/0000:4b:00.0/1: type leaf parent node_25
+ pci/0000:4b:00.0/2: type leaf parent node_25
+
+ # let's create some custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
+
+ # second custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
+
+ # reassign second VF to newly created branch
+ $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
+
+ # assign tx_weight to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
+
+ # assign tx_share to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
new file mode 100644
index 0000000000..b49749e2b9
--- /dev/null
+++ b/Documentation/networking/devlink/index.rst
@@ -0,0 +1,69 @@
+Linux Devlink Documentation
+===========================
+
+devlink is an API to expose device information and resources not directly
+related to any device class, such as chip-wide/switch-ASIC-wide configuration.
+
+Locking
+-------
+
+Driver facing APIs are currently transitioning to allow more explicit
+locking. Drivers can use the existing ``devlink_*`` set of APIs, or
+new APIs prefixed by ``devl_*``. The older APIs handle all the locking
+in devlink core, but don't allow registration of most sub-objects once
+the main devlink object is itself registered. The newer ``devl_*`` APIs assume
+the devlink instance lock is already held. Drivers can take the instance
+lock by calling ``devl_lock()``. It is also held all callbacks of devlink
+netlink commands.
+
+Drivers are encouraged to use the devlink instance lock for their own needs.
+
+Interface documentation
+-----------------------
+
+The following pages describe various interfaces available through devlink in
+general.
+
+.. toctree::
+ :maxdepth: 1
+
+ devlink-dpipe
+ devlink-health
+ devlink-info
+ devlink-flash
+ devlink-params
+ devlink-port
+ devlink-region
+ devlink-resource
+ devlink-reload
+ devlink-selftests
+ devlink-trap
+ devlink-linecard
+
+Driver-specific documentation
+-----------------------------
+
+Each driver that implements ``devlink`` is expected to document what
+parameters, info versions, and other features it supports.
+
+.. toctree::
+ :maxdepth: 1
+
+ bnxt
+ etas_es58x
+ hns3
+ ionic
+ ice
+ mlx4
+ mlx5
+ mlxsw
+ mv88e6xxx
+ netdevsim
+ nfp
+ qed
+ ti-cpsw-switch
+ am65-nuss-cpsw-switch
+ prestera
+ iosm
+ octeontx2
+ sfc
diff --git a/Documentation/networking/devlink/ionic.rst b/Documentation/networking/devlink/ionic.rst
new file mode 100644
index 0000000000..48da9c92d5
--- /dev/null
+++ b/Documentation/networking/devlink/ionic.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+ionic devlink support
+=====================
+
+This document describes the devlink features implemented by the ``ionic``
+device driver.
+
+Info versions
+=============
+
+The ``ionic`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of firmware running on the device
+ * - ``asic.id``
+ - fixed
+ - The ASIC type for this device
+ * - ``asic.rev``
+ - fixed
+ - The revision of the ASIC for this device
diff --git a/Documentation/networking/devlink/iosm.rst b/Documentation/networking/devlink/iosm.rst
new file mode 100644
index 0000000000..6136181339
--- /dev/null
+++ b/Documentation/networking/devlink/iosm.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+iosm devlink support
+====================
+
+This document describes the devlink features implemented by the ``iosm``
+device driver.
+
+Parameters
+==========
+
+The ``iosm`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``erase_full_flash``
+ - u8
+ - runtime
+ - erase_full_flash parameter is used to check if full erase is required for
+ the device during firmware flashing.
+ If set, Full nand erase command will be sent to the device. By default,
+ only conditional erase support is enabled.
+
+
+Flash Update
+============
+
+The ``iosm`` driver implements support for flash update using the
+``devlink-flash`` interface.
+
+It supports updating the device flash using a combined flash image which contains
+the Bootloader images and other modem software images.
+
+The driver uses DEVLINK_SUPPORT_FLASH_UPDATE_COMPONENT to identify type of
+firmware image that need to be flashed as requested by user space application.
+Supported firmware image types.
+
+.. list-table:: Firmware Image types
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``PSI RAM``
+ - Primary Signed Image
+ * - ``EBL``
+ - External Bootloader
+ * - ``FLS``
+ - Modem Software Image
+
+PSI RAM and EBL are the RAM images which are injected to the device when the
+device is in BOOT ROM stage. Once this is successful, the actual modem firmware
+image is flashed to the device. The modem software image contains multiple files
+each having one secure bin file and at least one Loadmap/Region file. For flashing
+these files, appropriate commands are sent to the modem device along with the
+data required for flashing. The data like region count and address of each region
+has to be passed to the driver using the devlink param command.
+
+If the device has to be fully erased before firmware flashing, user application
+need to set the erase_full_flash parameter using devlink param command.
+By default, conditional erase feature is supported.
+
+Flash Commands:
+===============
+1) When modem is in Boot ROM stage, user can use below command to inject PSI RAM
+image using devlink flash command.
+
+$ devlink dev flash pci/0000:02:00.0 file <PSI_RAM_File_name>
+
+2) If user want to do a full erase, below command need to be issued to set the
+erase full flash param (To be set only if full erase required).
+
+$ devlink dev param set pci/0000:02:00.0 name erase_full_flash value true cmode runtime
+
+3) Inject EBL after the modem is in PSI stage.
+
+$ devlink dev flash pci/0000:02:00.0 file <EBL_File_name>
+
+4) Once EBL is injected successfully, then the actual firmware flashing takes
+place. Below is the sequence of commands used for each of the firmware images.
+
+a) Flash secure bin file.
+
+$ devlink dev flash pci/0000:02:00.0 file <Secure_bin_file_name>
+
+b) Flashing the Loadmap/Region file
+
+$ devlink dev flash pci/0000:02:00.0 file <Load_map_file_name>
+
+Regions
+=======
+
+The ``iosm`` driver supports dumping the coredump logs.
+
+In case a firmware encounters an exception, a snapshot will be taken by the
+driver. Following regions are accessed for device internal data.
+
+.. list-table:: Regions implemented
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``report.json``
+ - The summary of exception details logged as part of this region.
+ * - ``coredump.fcd``
+ - This region contains the details related to the exception occurred in the
+ device (RAM dump).
+ * - ``cdd.log``
+ - This region contains the logs related to the modem CDD driver.
+ * - ``eeprom.bin``
+ - This region contains the eeprom logs.
+ * - ``bootcore_trace.bin``
+ - This region contains the current instance of bootloader logs.
+ * - ``bootcore_prev_trace.bin``
+ - This region contains the previous instance of bootloader logs.
+
+
+Region commands
+===============
+
+$ devlink region show
+
+$ devlink region new pci/0000:02:00.0/report.json
+
+$ devlink region dump pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region del pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region new pci/0000:02:00.0/coredump.fcd
+
+$ devlink region dump pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region del pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region new pci/0000:02:00.0/cdd.log
+
+$ devlink region dump pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region del pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region new pci/0000:02:00.0/eeprom.bin
+
+$ devlink region dump pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region del pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region new pci/0000:02:00.0/bootcore_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region del pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region new pci/0000:02:00.0/bootcore_prev_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
+
+$ devlink region del pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
diff --git a/Documentation/networking/devlink/mlx4.rst b/Documentation/networking/devlink/mlx4.rst
new file mode 100644
index 0000000000..7b2d17ea54
--- /dev/null
+++ b/Documentation/networking/devlink/mlx4.rst
@@ -0,0 +1,56 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+mlx4 devlink support
+====================
+
+This document describes the devlink features implemented by the ``mlx4``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``internal_err_reset``
+ - driverinit, runtime
+ * - ``max_macs``
+ - driverinit
+ * - ``region_snapshot_enable``
+ - driverinit, runtime
+
+The ``mlx4`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``enable_64b_cqe_eqe``
+ - Boolean
+ - driverinit
+ - Enable 64 byte CQEs/EQEs, if the FW supports it.
+ * - ``enable_4k_uar``
+ - Boolean
+ - driverinit
+ - Enable using the 4k UAR.
+
+The ``mlx4`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
+
+Regions
+=======
+
+The ``mlx4`` driver supports dumping the firmware PCI crspace and health
+buffer during a critical firmware issue.
+
+In case a firmware command times out, firmware getting stuck, or a non zero
+value on the catastrophic buffer, a snapshot will be taken by the driver.
+
+The ``cr-space`` region will contain the firmware PCI crspace contents. The
+``fw-health`` region will contain the device firmware's health buffer.
+Snapshots for both of these regions are taken on the same event triggers.
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
new file mode 100644
index 0000000000..702f204a3d
--- /dev/null
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -0,0 +1,288 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+mlx5 devlink support
+====================
+
+This document describes the devlink features implemented by the ``mlx5``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ - Validation
+ * - ``enable_roce``
+ - driverinit
+ - Type: Boolean
+
+ If the device supports RoCE disablement, RoCE enablement state controls
+ device support for RoCE capability. Otherwise, the control occurs in the
+ driver stack. When RoCE is disabled at the driver level, only raw
+ ethernet QPs are supported.
+ * - ``io_eq_size``
+ - driverinit
+ - The range is between 64 and 4096.
+ * - ``event_eq_size``
+ - driverinit
+ - The range is between 64 and 4096.
+ * - ``max_macs``
+ - driverinit
+ - The range is between 1 and 2^31. Only power of 2 values are supported.
+
+The ``mlx5`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``flow_steering_mode``
+ - string
+ - runtime
+ - Controls the flow steering mode of the driver
+
+ * ``dmfs`` Device managed flow steering. In DMFS mode, the HW
+ steering entities are created and managed through firmware.
+ * ``smfs`` Software managed flow steering. In SMFS mode, the HW
+ steering entities are created and manage through the driver without
+ firmware intervention.
+
+ SMFS mode is faster and provides better rule insertion rate compared to
+ default DMFS mode.
+ * - ``fdb_large_groups``
+ - u32
+ - driverinit
+ - Control the number of large groups (size > 1) in the FDB table.
+
+ * The default value is 15, and the range is between 1 and 1024.
+ * - ``esw_multiport``
+ - Boolean
+ - runtime
+ - Control MultiPort E-Switch shared fdb mode.
+
+ An experimental mode where a single E-Switch is used and all the vports
+ and physical ports on the NIC are connected to it.
+
+ An example is to send traffic from a VF that is created on PF0 to an
+ uplink that is natively associated with the uplink of PF1
+
+ Note: Future devices, ConnectX-8 and onward, will eventually have this
+ as the default to allow forwarding between all NIC ports in a single
+ E-switch environment and the dual E-switch mode will likely get
+ deprecated.
+
+ Default: disabled
+ * - ``esw_port_metadata``
+ - Boolean
+ - runtime
+ - When applicable, disabling eswitch metadata can increase packet rate up
+ to 20% depending on the use case and packet sizes.
+
+ Eswitch port metadata state controls whether to internally tag packets
+ with metadata. Metadata tagging must be enabled for multi-port RoCE,
+ failover between representors and stacked devices. By default metadata is
+ enabled on the supported devices in E-switch. Metadata is applicable only
+ for E-switch in switchdev mode and users may disable it when NONE of the
+ below use cases will be in use:
+ 1. HCA is in Dual/multi-port RoCE mode.
+ 2. VF/SF representor bonding (Usually used for Live migration)
+ 3. Stacked devices
+
+ When metadata is disabled, the above use cases will fail to initialize if
+ users try to enable them.
+ * - ``hairpin_num_queues``
+ - u32
+ - driverinit
+ - We refer to a TC NIC rule that involves forwarding as "hairpin".
+ Hairpin queues are mlx5 hardware specific implementation for hardware
+ forwarding of such packets.
+
+ Control the number of hairpin queues.
+ * - ``hairpin_queue_size``
+ - u32
+ - driverinit
+ - Control the size (in packets) of the hairpin queues.
+
+The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
+
+Info versions
+=============
+
+The ``mlx5`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw.psid``
+ - fixed
+ - Used to represent the board id of the device.
+ * - ``fw.version``
+ - stored, running
+ - Three digit major.minor.subminor firmware version number.
+
+Health reporters
+================
+
+tx reporter
+-----------
+The tx reporter is responsible for reporting and recovering of the following three error scenarios:
+
+- tx timeout
+ Report on kernel tx timeout detection.
+ Recover by searching lost interrupts.
+- tx error completion
+ Report on error tx completion.
+ Recover by flushing the tx queue and reset it.
+- tx PTP port timestamping CQ unhealthy
+ Report too many CQEs never delivered on port ts CQ.
+ Recover by flushing and re-creating all PTP channels.
+
+tx reporter also support on demand diagnose callback, on which it provides
+real time information of its send queues status.
+
+User commands examples:
+
+- Diagnose send queues status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter tx
+
+.. note::
+ This command has valid output only when interface is up, otherwise the command has empty output.
+
+- Show number of tx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter tx
+
+rx reporter
+-----------
+The rx reporter is responsible for reporting and recovering of the following two error scenarios:
+
+- rx queues' initialization (population) timeout
+ Population of rx queues' descriptors on ring initialization is done
+ in napi context via triggering an irq. In case of a failure to get
+ the minimum amount of descriptors, a timeout would occur, and
+ descriptors could be recovered by polling the EQ (Event Queue).
+- rx completions with errors (reported by HW on interrupt context)
+ Report on rx completion error.
+ Recover (if needed) by flushing the related queue and reset it.
+
+rx reporter also supports on demand diagnose callback, on which it
+provides real time information of its receive queues' status.
+
+- Diagnose rx queues' status and corresponding completion queue::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter rx
+
+.. note::
+ This command has valid output only when interface is up. Otherwise, the command has empty output.
+
+- Show number of rx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled, and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter rx
+
+fw reporter
+-----------
+The fw reporter implements `diagnose` and `dump` callbacks.
+It follows symptoms of fw error such as fw syndrome by triggering
+fw core dump and storing it into the dump buffer.
+The fw reporter diagnose command can be triggered any time by the user to check
+current fw status.
+
+User commands examples:
+
+- Check fw heath status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter fw
+
+- Read FW core dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.0 reporter fw
+
+.. note::
+ This command can run only on the PF which has fw tracer ownership,
+ running it on other PF or any VF will return "Operation not permitted".
+
+fw fatal reporter
+-----------------
+The fw fatal reporter implements `dump` and `recover` callbacks.
+It follows fatal errors indications by CR-space dump and recover flow.
+The CR-space dump uses vsc interface which is valid even if the FW command
+interface is not functional, which is the case in most FW fatal errors.
+The recover function runs recover flow which reloads the driver and triggers fw
+reset if needed.
+On firmware error, the health buffer is dumped into the dmesg. The log
+level is derived from the error's severity (given in health buffer).
+
+User commands examples:
+
+- Run fw recover flow manually::
+
+ $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
+
+- Read FW CR-space dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
+
+.. note::
+ This command can run only on PF.
+
+vnic reporter
+-------------
+The vnic reporter implements only the `diagnose` callback.
+It is responsible for querying the vnic diagnostic counters from fw and displaying
+them in realtime.
+
+Description of the vnic counters:
+
+- total_q_under_processor_handle
+ number of queues in an error state due to
+ an async error or errored command.
+- send_queue_priority_update_flow
+ number of QP/SQ priority/SL update events.
+- cq_overrun
+ number of times CQ entered an error state due to an overflow.
+- async_eq_overrun
+ number of times an EQ mapped to async events was overrun.
+ comp_eq_overrun number of times an EQ mapped to completion events was
+ overrun.
+- quota_exceeded_command
+ number of commands issued and failed due to quota exceeded.
+- invalid_command
+ number of commands issued and failed dues to any reason other than quota
+ exceeded.
+- nic_receive_steering_discard
+ number of packets that completed RX flow
+ steering but were discarded due to a mismatch in flow table.
+- generated_pkt_steering_fail
+ number of packets generated by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow).
+- handled_pkt_steering_fail
+ number of packets handled by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow owned by the VNIC, including the FDB
+ for the eswitch owner).
+
+User commands examples:
+
+- Diagnose PF/VF vnic counters::
+
+ $ devlink health diagnose pci/0000:82:00.1 reporter vnic
+
+- Diagnose representor vnic counters (performed by supplying devlink port of the
+ representor, which can be obtained via devlink port command)::
+
+ $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic
+
+.. note::
+ This command can run over all interfaces such as PF/VF and representor ports.
diff --git a/Documentation/networking/devlink/mlxsw.rst b/Documentation/networking/devlink/mlxsw.rst
new file mode 100644
index 0000000000..433962225b
--- /dev/null
+++ b/Documentation/networking/devlink/mlxsw.rst
@@ -0,0 +1,105 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+mlxsw devlink support
+=====================
+
+This document describes the devlink features implemented by the ``mlxsw``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``fw_load_policy``
+ - driverinit
+
+The ``mlxsw`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``acl_region_rehash_interval``
+ - u32
+ - runtime
+ - Sets an interval for periodic ACL region rehashes. The value is
+ specified in milliseconds, with a minimum of ``3000``. The value of
+ ``0`` disables periodic work entirely. The first rehash will be run
+ immediately after the value is set.
+
+The ``mlxsw`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
+
+Info versions
+=============
+
+The ``mlxsw`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``hw.revision``
+ - fixed
+ - The hardware revision for this board
+ * - ``fw.psid``
+ - fixed
+ - Firmware PSID
+ * - ``fw.version``
+ - running
+ - Three digit firmware version
+
+Line card auxiliary device info versions
+========================================
+
+The ``mlxsw`` driver reports the following versions for line card auxiliary device
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``hw.revision``
+ - fixed
+ - The hardware revision for this line card
+ * - ``ini.version``
+ - running
+ - Version of line card INI loaded
+ * - ``fw.psid``
+ - fixed
+ - Line card device PSID
+ * - ``fw.version``
+ - running
+ - Three digit firmware version of line card device
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``mlxsw``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``irif_disabled``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed from a disabled router interface (RIF). This can happen during
+ RIF dismantle, when the RIF is first disabled before being removed
+ completely
+ * - ``erif_disabled``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed through a disabled router interface (RIF). This can happen during
+ RIF dismantle, when the RIF is first disabled before being removed
+ completely
diff --git a/Documentation/networking/devlink/mv88e6xxx.rst b/Documentation/networking/devlink/mv88e6xxx.rst
new file mode 100644
index 0000000000..c621212a47
--- /dev/null
+++ b/Documentation/networking/devlink/mv88e6xxx.rst
@@ -0,0 +1,28 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+mv88e6xxx devlink support
+=========================
+
+This document describes the devlink features implemented by the ``mv88e6xxx``
+device driver.
+
+Parameters
+==========
+
+The ``mv88e6xxx`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``ATU_hash``
+ - u8
+ - runtime
+ - Select one of four possible hashing algorithms for MAC addresses in
+ the Address Translation Unit. A value of 3 may work better than the
+ default of 1 when many MAC addresses have the same OUI. Only the
+ values 0 to 3 are valid for this parameter.
diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst
new file mode 100644
index 0000000000..8848272542
--- /dev/null
+++ b/Documentation/networking/devlink/netdevsim.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+netdevsim devlink support
+=========================
+
+This document describes the ``devlink`` features supported by the
+``netdevsim`` device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``max_macs``
+ - driverinit
+
+The ``netdevsim`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``test1``
+ - Boolean
+ - driverinit
+ - Test parameter used to show how a driver-specific devlink parameter
+ can be implemented.
+
+The ``netdevsim`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
+
+Regions
+=======
+
+The ``netdevsim`` driver exposes a ``dummy`` region as an example of how the
+devlink-region interfaces work. A snapshot is taken whenever the
+``take_snapshot`` debugfs file is written to.
+
+Resources
+=========
+
+The ``netdevsim`` driver exposes resources to control the number of FIB
+entries, FIB rule entries and nexthops that the driver will allow.
+
+.. code:: shell
+
+ $ devlink resource set netdevsim/netdevsim0 path /IPv4/fib size 96
+ $ devlink resource set netdevsim/netdevsim0 path /IPv4/fib-rules size 16
+ $ devlink resource set netdevsim/netdevsim0 path /IPv6/fib size 64
+ $ devlink resource set netdevsim/netdevsim0 path /IPv6/fib-rules size 16
+ $ devlink resource set netdevsim/netdevsim0 path /nexthops size 16
+ $ devlink dev reload netdevsim/netdevsim0
+
+Rate objects
+============
+
+The ``netdevsim`` driver supports rate objects management, which includes:
+
+- registerging/unregistering leaf rate objects per VF devlink port;
+- creation/deletion node rate objects;
+- setting tx_share and tx_max rate values for any rate object type;
+- setting parent node for any rate object type.
+
+Rate nodes and their parameters are exposed in ``netdevsim`` debugfs in RO mode.
+For example created rate node with name ``some_group``:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/rate_groups/some_group
+ rate_parent tx_max tx_share
+
+Same parameters are exposed for leaf objects in corresponding ports directories.
+For ex.:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/ports/1
+ dev ethtool rate_parent tx_max tx_share
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``netdevsim``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fid_miss``
+ - ``exception``
+ - When a packet enters the device it is classified to a filtering
+ identifier (FID) based on the ingress port and VLAN. This trap is used
+ to trap packets for which a FID could not be found
diff --git a/Documentation/networking/devlink/nfp.rst b/Documentation/networking/devlink/nfp.rst
new file mode 100644
index 0000000000..a1717db0df
--- /dev/null
+++ b/Documentation/networking/devlink/nfp.rst
@@ -0,0 +1,65 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+nfp devlink support
+===================
+
+This document describes the devlink features implemented by the ``nfp``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``fw_load_policy``
+ - permanent
+ * - ``reset_dev_on_drv_probe``
+ - permanent
+
+Info versions
+=============
+
+The ``nfp`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``board.id``
+ - fixed
+ - Part number identifying the board design
+ * - ``board.rev``
+ - fixed
+ - Revision of the board design
+ * - ``board.manufacture``
+ - fixed
+ - Vendor of the board design
+ * - ``board.model``
+ - fixed
+ - Model name of the board design
+ * - ``fw.bundle_id``
+ - stored, running
+ - Firmware bundle id
+ * - ``fw.mgmt``
+ - stored, running
+ - Version of the management firmware
+ * - ``fw.cpld``
+ - stored, running
+ - The CPLD firmware component version
+ * - ``fw.app``
+ - stored, running
+ - The APP firmware component version
+ * - ``fw.undi``
+ - stored, running
+ - The UNDI firmware component version
+ * - ``fw.ncsi``
+ - stored, running
+ - The NSCI firmware component version
+ * - ``chip.init``
+ - stored, running
+ - The CFGR firmware component version
diff --git a/Documentation/networking/devlink/octeontx2.rst b/Documentation/networking/devlink/octeontx2.rst
new file mode 100644
index 0000000000..610de99b72
--- /dev/null
+++ b/Documentation/networking/devlink/octeontx2.rst
@@ -0,0 +1,42 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+octeontx2 devlink support
+=========================
+
+This document describes the devlink features implemented by the ``octeontx2 AF, PF and VF``
+device drivers.
+
+Parameters
+==========
+
+The ``octeontx2 PF and VF`` drivers implement the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``mcam_count``
+ - u16
+ - runtime
+ - Select number of match CAM entries to be allocated for an interface.
+ The same is used for ntuple filters of the interface. Supported by
+ PF and VF drivers.
+
+The ``octeontx2 AF`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``dwrr_mtu``
+ - u32
+ - runtime
+ - Use to set the quantum which hardware uses for scheduling among transmit queues.
+ Hardware uses weighted DWRR algorithm to schedule among all transmit queues.
diff --git a/Documentation/networking/devlink/prestera.rst b/Documentation/networking/devlink/prestera.rst
new file mode 100644
index 0000000000..96b1124e61
--- /dev/null
+++ b/Documentation/networking/devlink/prestera.rst
@@ -0,0 +1,141 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+prestera devlink support
+========================
+
+This document describes the devlink features implemented by the ``prestera``
+device driver.
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``arp_bc``
+ - ``trap``
+ - Traps ARP broadcast packets (both requests/responses)
+ * - ``is_is``
+ - ``trap``
+ - Traps IS-IS packets
+ * - ``ospf``
+ - ``trap``
+ - Traps OSPF packets
+ * - ``ip_bc_mac``
+ - ``trap``
+ - Traps IPv4 packets with broadcast DA Mac address
+ * - ``stp``
+ - ``trap``
+ - Traps STP BPDU
+ * - ``lacp``
+ - ``trap``
+ - Traps LACP packets
+ * - ``lldp``
+ - ``trap``
+ - Traps LLDP packets
+ * - ``router_mc``
+ - ``trap``
+ - Traps multicast packets
+ * - ``vrrp``
+ - ``trap``
+ - Traps VRRP packets
+ * - ``dhcp``
+ - ``trap``
+ - Traps DHCP packets
+ * - ``mtu_error``
+ - ``trap``
+ - Traps (exception) packets that exceeded port's MTU
+ * - ``mac_to_me``
+ - ``trap``
+ - Traps packets with switch-port's DA Mac address
+ * - ``ttl_error``
+ - ``trap``
+ - Traps (exception) IPv4 packets whose TTL exceeded
+ * - ``ipv4_options``
+ - ``trap``
+ - Traps (exception) packets due to the malformed IPV4 header options
+ * - ``ip_default_route``
+ - ``trap``
+ - Traps packets that have no specific IP interface (IP to me) and no forwarding prefix
+ * - ``local_route``
+ - ``trap``
+ - Traps packets that have been send to one of switch IP interfaces addresses
+ * - ``ipv4_icmp_redirect``
+ - ``trap``
+ - Traps (exception) IPV4 ICMP redirect packets
+ * - ``arp_response``
+ - ``trap``
+ - Traps ARP replies packets that have switch-port's DA Mac address
+ * - ``acl_code_0``
+ - ``trap``
+ - Traps packets that have ACL priority set to 0 (tc pref 0)
+ * - ``acl_code_1``
+ - ``trap``
+ - Traps packets that have ACL priority set to 1 (tc pref 1)
+ * - ``acl_code_2``
+ - ``trap``
+ - Traps packets that have ACL priority set to 2 (tc pref 2)
+ * - ``acl_code_3``
+ - ``trap``
+ - Traps packets that have ACL priority set to 3 (tc pref 3)
+ * - ``acl_code_4``
+ - ``trap``
+ - Traps packets that have ACL priority set to 4 (tc pref 4)
+ * - ``acl_code_5``
+ - ``trap``
+ - Traps packets that have ACL priority set to 5 (tc pref 5)
+ * - ``acl_code_6``
+ - ``trap``
+ - Traps packets that have ACL priority set to 6 (tc pref 6)
+ * - ``acl_code_7``
+ - ``trap``
+ - Traps packets that have ACL priority set to 7 (tc pref 7)
+ * - ``ipv4_bgp``
+ - ``trap``
+ - Traps IPv4 BGP packets
+ * - ``ssh``
+ - ``trap``
+ - Traps SSH packets
+ * - ``telnet``
+ - ``trap``
+ - Traps Telnet packets
+ * - ``icmp``
+ - ``trap``
+ - Traps ICMP packets
+ * - ``rxdma_drop``
+ - ``drop``
+ - Drops packets (RxDMA) due to the lack of ingress buffers etc.
+ * - ``port_no_vlan``
+ - ``drop``
+ - Drops packets due to faulty-configured network or due to internal bug (config issue).
+ * - ``local_port``
+ - ``drop``
+ - Drops packets whose decision (FDB entry) is to bridge packet back to the incoming port/trunk.
+ * - ``invalid_sa``
+ - ``drop``
+ - Drops packets with multicast source MAC address.
+ * - ``illegal_ip_addr``
+ - ``drop``
+ - Drops packets with illegal SIP/DIP multicast/unicast addresses.
+ * - ``illegal_ipv4_hdr``
+ - ``drop``
+ - Drops packets with illegal IPV4 header.
+ * - ``ip_uc_dip_da_mismatch``
+ - ``drop``
+ - Drops packets with destination MAC being unicast, but destination IP address being multicast.
+ * - ``ip_sip_is_zero``
+ - ``drop``
+ - Drops packets with zero (0) IPV4 source address.
+ * - ``met_red``
+ - ``drop``
+ - Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwidth.
diff --git a/Documentation/networking/devlink/qed.rst b/Documentation/networking/devlink/qed.rst
new file mode 100644
index 0000000000..805c6f6362
--- /dev/null
+++ b/Documentation/networking/devlink/qed.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+qed devlink support
+===================
+
+This document describes the devlink features implemented by the ``qed`` core
+device driver.
+
+Parameters
+==========
+
+The ``qed`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``iwarp_cmt``
+ - Boolean
+ - runtime
+ - Enable iWARP functionality for 100g devices. Note that this impacts
+ L2 performance, and is therefore not enabled by default.
diff --git a/Documentation/networking/devlink/sfc.rst b/Documentation/networking/devlink/sfc.rst
new file mode 100644
index 0000000000..db64a1bd97
--- /dev/null
+++ b/Documentation/networking/devlink/sfc.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+sfc devlink support
+===================
+
+This document describes the devlink features implemented by the ``sfc``
+device driver for the ef100 device.
+
+Info versions
+=============
+
+The ``sfc`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw.mgmt.suc``
+ - running
+ - For boards where the management function is split between multiple
+ control units, this is the SUC control unit's firmware version.
+ * - ``fw.mgmt.cmc``
+ - running
+ - For boards where the management function is split between multiple
+ control units, this is the CMC control unit's firmware version.
+ * - ``fpga.rev``
+ - running
+ - FPGA design revision.
+ * - ``fpga.app``
+ - running
+ - Datapath programmable logic version.
+ * - ``fw.app``
+ - running
+ - Datapath software/microcode/firmware version.
+ * - ``coproc.boot``
+ - running
+ - SmartNIC application co-processor (APU) first stage boot loader version.
+ * - ``coproc.uboot``
+ - running
+ - SmartNIC application co-processor (APU) co-operating system loader version.
+ * - ``coproc.main``
+ - running
+ - SmartNIC application co-processor (APU) main operating system version.
+ * - ``coproc.recovery``
+ - running
+ - SmartNIC application co-processor (APU) recovery operating system version.
+ * - ``fw.exprom``
+ - running
+ - Expansion ROM version. For boards where the expansion ROM is split between
+ multiple images (e.g. PXE and UEFI), this is the specifically the PXE boot
+ ROM version.
+ * - ``fw.uefi``
+ - running
+ - UEFI driver version (No UNDI support).
diff --git a/Documentation/networking/devlink/ti-cpsw-switch.rst b/Documentation/networking/devlink/ti-cpsw-switch.rst
new file mode 100644
index 0000000000..dc399e32ab
--- /dev/null
+++ b/Documentation/networking/devlink/ti-cpsw-switch.rst
@@ -0,0 +1,31 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+ti-cpsw-switch devlink support
+==============================
+
+This document describes the devlink features implemented by the ``ti-cpsw-switch``
+device driver.
+
+Parameters
+==========
+
+The ``ti-cpsw-switch`` driver implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``ale_bypass``
+ - Boolean
+ - runtime
+ - Enables ALE_CONTROL(4).BYPASS mode for debugging purposes. In this
+ mode, all packets will be sent to the host port only.
+ * - ``switch_mode``
+ - Boolean
+ - runtime
+ - Enable switch mode
diff --git a/Documentation/networking/dns_resolver.rst b/Documentation/networking/dns_resolver.rst
new file mode 100644
index 0000000000..add4d59a99
--- /dev/null
+++ b/Documentation/networking/dns_resolver.rst
@@ -0,0 +1,155 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+DNS Resolver Module
+===================
+
+.. Contents:
+
+ - Overview.
+ - Compilation.
+ - Setting up.
+ - Usage.
+ - Mechanism.
+ - Debugging.
+
+
+Overview
+========
+
+The DNS resolver module provides a way for kernel services to make DNS queries
+by way of requesting a key of key type dns_resolver. These queries are
+upcalled to userspace through /sbin/request-key.
+
+These routines must be supported by userspace tools dns.upcall, cifs.upcall and
+request-key. It is under development and does not yet provide the full feature
+set. The features it does support include:
+
+ (*) Implements the dns_resolver key_type to contact userspace.
+
+It does not yet support the following AFS features:
+
+ (*) Dns query support for AFSDB resource record.
+
+This code is extracted from the CIFS filesystem.
+
+
+Compilation
+===========
+
+The module should be enabled by turning on the kernel configuration options::
+
+ CONFIG_DNS_RESOLVER - tristate "DNS Resolver support"
+
+
+Setting up
+==========
+
+To set up this facility, the /etc/request-key.conf file must be altered so that
+/sbin/request-key can appropriately direct the upcalls. For example, to handle
+basic dname to IPv4/IPv6 address resolution, the following line should be
+added::
+
+
+ #OP TYPE DESC CO-INFO PROGRAM ARG1 ARG2 ARG3 ...
+ #====== ============ ======= ======= ==========================
+ create dns_resolver * * /usr/sbin/cifs.upcall %k
+
+To direct a query for query type 'foo', a line of the following should be added
+before the more general line given above as the first match is the one taken::
+
+ create dns_resolver foo:* * /usr/sbin/dns.foo %k
+
+
+Usage
+=====
+
+To make use of this facility, one of the following functions that are
+implemented in the module can be called after doing::
+
+ #include <linux/dns_resolver.h>
+
+ ::
+
+ int dns_query(const char *type, const char *name, size_t namelen,
+ const char *options, char **_result, time_t *_expiry);
+
+ This is the basic access function. It looks for a cached DNS query and if
+ it doesn't find it, it upcalls to userspace to make a new DNS query, which
+ may then be cached. The key description is constructed as a string of the
+ form::
+
+ [<type>:]<name>
+
+ where <type> optionally specifies the particular upcall program to invoke,
+ and thus the type of query to do, and <name> specifies the string to be
+ looked up. The default query type is a straight hostname to IP address
+ set lookup.
+
+ The name parameter is not required to be a NUL-terminated string, and its
+ length should be given by the namelen argument.
+
+ The options parameter may be NULL or it may be a set of options
+ appropriate to the query type.
+
+ The return value is a string appropriate to the query type. For instance,
+ for the default query type it is just a list of comma-separated IPv4 and
+ IPv6 addresses. The caller must free the result.
+
+ The length of the result string is returned on success, and a negative
+ error code is returned otherwise. -EKEYREJECTED will be returned if the
+ DNS lookup failed.
+
+ If _expiry is non-NULL, the expiry time (TTL) of the result will be
+ returned also.
+
+The kernel maintains an internal keyring in which it caches looked up keys.
+This can be cleared by any process that has the CAP_SYS_ADMIN capability by
+the use of KEYCTL_KEYRING_CLEAR on the keyring ID.
+
+
+Reading DNS Keys from Userspace
+===============================
+
+Keys of dns_resolver type can be read from userspace using keyctl_read() or
+"keyctl read/print/pipe".
+
+
+Mechanism
+=========
+
+The dnsresolver module registers a key type called "dns_resolver". Keys of
+this type are used to transport and cache DNS lookup results from userspace.
+
+When dns_query() is invoked, it calls request_key() to search the local
+keyrings for a cached DNS result. If that fails to find one, it upcalls to
+userspace to get a new result.
+
+Upcalls to userspace are made through the request_key() upcall vector, and are
+directed by means of configuration lines in /etc/request-key.conf that tell
+/sbin/request-key what program to run to instantiate the key.
+
+The upcall handler program is responsible for querying the DNS, processing the
+result into a form suitable for passing to the keyctl_instantiate_key()
+routine. This then passes the data to dns_resolver_instantiate() which strips
+off and processes any options included in the data, and then attaches the
+remainder of the string to the key as its payload.
+
+The upcall handler program should set the expiry time on the key to that of the
+lowest TTL of all the records it has extracted a result from. This means that
+the key will be discarded and recreated when the data it holds has expired.
+
+dns_query() returns a copy of the value attached to the key, or an error if
+that is indicated instead.
+
+See <file:Documentation/security/keys/request-key.rst> for further
+information about request-key function.
+
+
+Debugging
+=========
+
+Debugging messages can be turned on dynamically by writing a 1 into the
+following file::
+
+ /sys/module/dnsresolver/parameters/debug
diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst
new file mode 100644
index 0000000000..4f5dfa9c02
--- /dev/null
+++ b/Documentation/networking/driver.rst
@@ -0,0 +1,127 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Softnet Driver Issues
+=====================
+
+Probing guidelines
+==================
+
+Address validation
+------------------
+
+Any hardware layer address you obtain for your device should
+be verified. For example, for ethernet check it with
+linux/etherdevice.h:is_valid_ether_addr()
+
+Close/stop guidelines
+=====================
+
+Quiescence
+----------
+
+After the ndo_stop routine has been called, the hardware must
+not receive or transmit any data. All in flight packets must
+be aborted. If necessary, poll or wait for completion of
+any reset commands.
+
+Auto-close
+----------
+
+The ndo_stop routine will be called by unregister_netdevice
+if device is still UP.
+
+Transmit path guidelines
+========================
+
+Stop queues in advance
+----------------------
+
+The ndo_start_xmit method must not return NETDEV_TX_BUSY under
+any normal circumstances. It is considered a hard error unless
+there is no way your device can tell ahead of time when its
+transmit function will become busy.
+
+Instead it must maintain the queue properly. For example,
+for a driver implementing scatter-gather this means:
+
+.. code-block:: c
+
+ static u32 drv_tx_avail(struct drv_ring *dr)
+ {
+ u32 used = READ_ONCE(dr->prod) - READ_ONCE(dr->cons);
+
+ return dr->tx_ring_size - (used & bp->tx_ring_mask);
+ }
+
+ static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb,
+ struct net_device *dev)
+ {
+ struct drv *dp = netdev_priv(dev);
+ struct netdev_queue *txq;
+ struct drv_ring *dr;
+ int idx;
+
+ idx = skb_get_queue_mapping(skb);
+ dr = dp->tx_rings[idx];
+ txq = netdev_get_tx_queue(dev, idx);
+
+ //...
+ /* This should be a very rare race - log it. */
+ if (drv_tx_avail(dr) <= skb_shinfo(skb)->nr_frags + 1) {
+ netif_stop_queue(dev);
+ netdev_warn(dev, "Tx Ring full when queue awake!\n");
+ return NETDEV_TX_BUSY;
+ }
+
+ //... queue packet to card ...
+
+ netdev_tx_sent_queue(txq, skb->len);
+
+ //... update tx producer index using WRITE_ONCE() ...
+
+ if (!netif_txq_maybe_stop(txq, drv_tx_avail(dr),
+ MAX_SKB_FRAGS + 1, 2 * MAX_SKB_FRAGS))
+ dr->stats.stopped++;
+
+ //...
+ return NETDEV_TX_OK;
+ }
+
+And then at the end of your TX reclamation event handling:
+
+.. code-block:: c
+
+ //... update tx consumer index using WRITE_ONCE() ...
+
+ netif_txq_completed_wake(txq, cmpl_pkts, cmpl_bytes,
+ drv_tx_avail(dr), 2 * MAX_SKB_FRAGS);
+
+Lockless queue stop / wake helper macros
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: include/net/netdev_queues.h
+ :doc: Lockless queue stopping / waking helpers.
+
+No exclusive ownership
+----------------------
+
+An ndo_start_xmit method must not modify the shared parts of a
+cloned SKB.
+
+Timely completions
+------------------
+
+Do not forget that once you return NETDEV_TX_OK from your
+ndo_start_xmit method, it is your driver's responsibility to free
+up the SKB and in some finite amount of time.
+
+For example, this means that it is not allowed for your TX
+mitigation scheme to let TX packets "hang out" in the TX
+ring unreclaimed forever if no new TX packets are sent.
+This error can deadlock sockets waiting for send buffer room
+to be freed up.
+
+If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
+must not keep any reference to that SKB and you must not attempt
+to free it up.
diff --git a/Documentation/networking/dsa/b53.rst b/Documentation/networking/dsa/b53.rst
new file mode 100644
index 0000000000..b41637cdb8
--- /dev/null
+++ b/Documentation/networking/dsa/b53.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================
+Broadcom RoboSwitch Ethernet switch driver
+==========================================
+
+The Broadcom RoboSwitch Ethernet switch family is used in quite a range of
+xDSL router, cable modems and other multimedia devices.
+
+The actual implementation supports the devices BCM5325E, BCM5365, BCM539x,
+BCM53115 and BCM53125 as well as BCM63XX.
+
+Implementation details
+======================
+
+The driver is located in ``drivers/net/dsa/b53/`` and is implemented as a
+DSA driver; see ``Documentation/networking/dsa/dsa.rst`` for details on the
+subsystem and what it provides.
+
+The switch is, if possible, configured to enable a Broadcom specific 4-bytes
+switch tag which gets inserted by the switch for every packet forwarded to the
+CPU interface, conversely, the CPU network interface should insert a similar
+tag for packets entering the CPU port. The tag format is described in
+``net/dsa/tag_brcm.c``.
+
+The configuration of the device depends on whether or not tagging is
+supported.
+
+The interface names and example network configuration are used according the
+configuration described in the :ref:`dsa-config-showcases`.
+
+Configuration with tagging support
+----------------------------------
+
+The tagging based configuration is desired. It is not specific to the b53
+DSA driver and will work like all DSA drivers which supports tagging.
+
+See :ref:`dsa-tagged-configuration`.
+
+Configuration without tagging support
+-------------------------------------
+
+Older models (5325, 5365) support a different tag format that is not supported
+yet. 539x and 531x5 require managed mode and some special handling, which is
+also not yet supported. The tagging support is disabled in these cases and the
+switch need a different configuration.
+
+The configuration slightly differ from the :ref:`dsa-vlan-configuration`.
+
+The b53 tags the CPU port in all VLANs, since otherwise any PVID untagged
+VLAN programming would basically change the CPU port's default PVID and make
+it untagged, undesirable.
+
+In difference to the configuration described in :ref:`dsa-vlan-configuration`
+the default VLAN 1 has to be removed from the slave interface configuration in
+single port and gateway configuration, while there is no need to add an extra
+VLAN configuration in the bridge showcase.
+
+single port
+~~~~~~~~~~~
+The configuration can only be set up via VLAN tagging and bridge setup.
+By default packages are tagged with vid 1:
+
+.. code-block:: sh
+
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+ ip link add link eth0 name eth0.2 type vlan id 2
+ ip link add link eth0 name eth0.3 type vlan id 3
+
+ # The master interface needs to be brought up before the slave ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+ ip link set eth0.2 up
+ ip link set eth0.3 up
+
+ # bring up the slave interfaces
+ ip link set wan up
+ ip link set lan1 up
+ ip link set lan2 up
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ # add ports to bridges
+ ip link set dev wan master br0
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 2 pvid untagged
+ bridge vlan del dev lan1 vid 1
+ bridge vlan add dev lan2 vid 3 pvid untagged
+ bridge vlan del dev lan2 vid 1
+
+ # configure the VLANs
+ ip addr add 192.0.2.1/30 dev eth0.1
+ ip addr add 192.0.2.5/30 dev eth0.2
+ ip addr add 192.0.2.9/30 dev eth0.3
+
+ # bring up the bridge devices
+ ip link set br0 up
+
+
+bridge
+~~~~~~
+
+.. code-block:: sh
+
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+
+ # The master interface needs to be brought up before the slave ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+
+ # bring up the slave interfaces
+ ip link set wan up
+ ip link set lan1 up
+ ip link set lan2 up
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ # add ports to bridge
+ ip link set dev wan master br0
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set eth0.1 master br0
+
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
+
+ # bring up the bridge
+ ip link set dev br0 up
+
+gateway
+~~~~~~~
+
+.. code-block:: sh
+
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+ ip link add link eth0 name eth0.2 type vlan id 2
+
+ # The master interface needs to be brought up before the slave ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+ ip link set eth0.2 up
+
+ # bring up the slave interfaces
+ ip link set wan up
+ ip link set lan1 up
+ ip link set lan2 up
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ # add ports to bridges
+ ip link set dev wan master br0
+ ip link set eth0.1 master br0
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+
+ # tag traffic on ports
+ bridge vlan add dev wan vid 2 pvid untagged
+ bridge vlan del dev wan vid 1
+
+ # configure the VLANs
+ ip addr add 192.0.2.1/30 dev eth0.2
+ ip addr add 192.0.2.129/25 dev br0
+
+ # bring up the bridge devices
+ ip link set br0 up
diff --git a/Documentation/networking/dsa/bcm_sf2.rst b/Documentation/networking/dsa/bcm_sf2.rst
new file mode 100644
index 0000000000..dee234039e
--- /dev/null
+++ b/Documentation/networking/dsa/bcm_sf2.rst
@@ -0,0 +1,115 @@
+=============================================
+Broadcom Starfighter 2 Ethernet switch driver
+=============================================
+
+Broadcom's Starfighter 2 Ethernet switch hardware block is commonly found and
+deployed in the following products:
+
+- xDSL gateways such as BCM63138
+- streaming/multimedia Set Top Box such as BCM7445
+- Cable Modem/residential gateways such as BCM7145/BCM3390
+
+The switch is typically deployed in a configuration involving between 5 to 13
+ports, offering a range of built-in and customizable interfaces:
+
+- single integrated Gigabit PHY
+- quad integrated Gigabit PHY
+- quad external Gigabit PHY w/ MDIO multiplexer
+- integrated MoCA PHY
+- several external MII/RevMII/GMII/RGMII interfaces
+
+The switch also supports specific congestion control features which allow MoCA
+fail-over not to lose packets during a MoCA role re-election, as well as out of
+band back-pressure to the host CPU network interface when downstream interfaces
+are connected at a lower speed.
+
+The switch hardware block is typically interfaced using MMIO accesses and
+contains a bunch of sub-blocks/registers:
+
+- ``SWITCH_CORE``: common switch registers
+- ``SWITCH_REG``: external interfaces switch register
+- ``SWITCH_MDIO``: external MDIO bus controller (there is another one in SWITCH_CORE,
+ which is used for indirect PHY accesses)
+- ``SWITCH_INDIR_RW``: 64-bits wide register helper block
+- ``SWITCH_INTRL2_0/1``: Level-2 interrupt controllers
+- ``SWITCH_ACB``: Admission control block
+- ``SWITCH_FCB``: Fail-over control block
+
+Implementation details
+======================
+
+The driver is located in ``drivers/net/dsa/bcm_sf2.c`` and is implemented as a DSA
+driver; see ``Documentation/networking/dsa/dsa.rst`` for details on the subsystem
+and what it provides.
+
+The SF2 switch is configured to enable a Broadcom specific 4-bytes switch tag
+which gets inserted by the switch for every packet forwarded to the CPU
+interface, conversely, the CPU network interface should insert a similar tag for
+packets entering the CPU port. The tag format is described in
+``net/dsa/tag_brcm.c``.
+
+Overall, the SF2 driver is a fairly regular DSA driver; there are a few
+specifics covered below.
+
+Device Tree probing
+-------------------
+
+The DSA platform device driver is probed using a specific compatible string
+provided in ``net/dsa/dsa.c``. The reason for that is because the DSA subsystem gets
+registered as a platform device driver currently. DSA will provide the needed
+device_node pointers which are then accessible by the switch driver setup
+function to setup resources such as register ranges and interrupts. This
+currently works very well because none of the of_* functions utilized by the
+driver require a struct device to be bound to a struct device_node, but things
+may change in the future.
+
+MDIO indirect accesses
+----------------------
+
+Due to a limitation in how Broadcom switches have been designed, external
+Broadcom switches connected to a SF2 require the use of the DSA slave MDIO bus
+in order to properly configure them. By default, the SF2 pseudo-PHY address, and
+an external switch pseudo-PHY address will both be snooping for incoming MDIO
+transactions, since they are at the same address (30), resulting in some kind of
+"double" programming. Using DSA, and setting ``ds->phys_mii_mask`` accordingly, we
+selectively divert reads and writes towards external Broadcom switches
+pseudo-PHY addresses. Newer revisions of the SF2 hardware have introduced a
+configurable pseudo-PHY address which circumvents the initial design limitation.
+
+Multimedia over CoAxial (MoCA) interfaces
+-----------------------------------------
+
+MoCA interfaces are fairly specific and require the use of a firmware blob which
+gets loaded onto the MoCA processor(s) for packet processing. The switch
+hardware contains logic which will assert/de-assert link states accordingly for
+the MoCA interface whenever the MoCA coaxial cable gets disconnected or the
+firmware gets reloaded. The SF2 driver relies on such events to properly set its
+MoCA interface carrier state and properly report this to the networking stack.
+
+The MoCA interfaces are supported using the PHY library's fixed PHY/emulated PHY
+device and the switch driver registers a ``fixed_link_update`` callback for such
+PHYs which reflects the link state obtained from the interrupt handler.
+
+
+Power Management
+----------------
+
+Whenever possible, the SF2 driver tries to minimize the overall switch power
+consumption by applying a combination of:
+
+- turning off internal buffers/memories
+- disabling packet processing logic
+- putting integrated PHYs in IDDQ/low-power
+- reducing the switch core clock based on the active port count
+- enabling and advertising EEE
+- turning off RGMII data processing logic when the link goes down
+
+Wake-on-LAN
+-----------
+
+Wake-on-LAN is currently implemented by utilizing the host processor Ethernet
+MAC controller wake-on logic. Whenever Wake-on-LAN is requested, an intersection
+between the user request and the supported host Ethernet interface WoL
+capabilities is done and the intersection result gets configured. During
+system-wide suspend/resume, only ports not participating in Wake-on-LAN are
+disabled.
diff --git a/Documentation/networking/dsa/configuration.rst b/Documentation/networking/dsa/configuration.rst
new file mode 100644
index 0000000000..d2934c40f0
--- /dev/null
+++ b/Documentation/networking/dsa/configuration.rst
@@ -0,0 +1,458 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+DSA switch configuration from userspace
+=======================================
+
+The DSA switch configuration is not integrated into the main userspace
+network configuration suites by now and has to be performed manually.
+
+.. _dsa-config-showcases:
+
+Configuration showcases
+-----------------------
+
+To configure a DSA switch a couple of commands need to be executed. In this
+documentation some common configuration scenarios are handled as showcases:
+
+*single port*
+ Every switch port acts as a different configurable Ethernet port
+
+*bridge*
+ Every switch port is part of one configurable Ethernet bridge
+
+*gateway*
+ Every switch port except one upstream port is part of a configurable
+ Ethernet bridge.
+ The upstream port acts as different configurable Ethernet port.
+
+All configurations are performed with tools from iproute2, which is available
+at https://www.kernel.org/pub/linux/utils/net/iproute2/
+
+Through DSA every port of a switch is handled like a normal linux Ethernet
+interface. The CPU port is the switch port connected to an Ethernet MAC chip.
+The corresponding linux Ethernet interface is called the master interface.
+All other corresponding linux interfaces are called slave interfaces.
+
+The slave interfaces depend on the master interface being up in order for them
+to send or receive traffic. Prior to kernel v5.12, the state of the master
+interface had to be managed explicitly by the user. Starting with kernel v5.12,
+the behavior is as follows:
+
+- when a DSA slave interface is brought up, the master interface is
+ automatically brought up.
+- when the master interface is brought down, all DSA slave interfaces are
+ automatically brought down.
+
+In this documentation the following Ethernet interfaces are used:
+
+*eth0*
+ the master interface
+
+*eth1*
+ another master interface
+
+*lan1*
+ a slave interface
+
+*lan2*
+ another slave interface
+
+*lan3*
+ a third slave interface
+
+*wan*
+ A slave interface dedicated for upstream traffic
+
+Further Ethernet interfaces can be configured similar.
+The configured IPs and networks are:
+
+*single port*
+ * lan1: 192.0.2.1/30 (192.0.2.0 - 192.0.2.3)
+ * lan2: 192.0.2.5/30 (192.0.2.4 - 192.0.2.7)
+ * lan3: 192.0.2.9/30 (192.0.2.8 - 192.0.2.11)
+
+*bridge*
+ * br0: 192.0.2.129/25 (192.0.2.128 - 192.0.2.255)
+
+*gateway*
+ * br0: 192.0.2.129/25 (192.0.2.128 - 192.0.2.255)
+ * wan: 192.0.2.1/30 (192.0.2.0 - 192.0.2.3)
+
+.. _dsa-tagged-configuration:
+
+Configuration with tagging support
+----------------------------------
+
+The tagging based configuration is desired and supported by the majority of
+DSA switches. These switches are capable to tag incoming and outgoing traffic
+without using a VLAN based configuration.
+
+*single port*
+ .. code-block:: sh
+
+ # configure each interface
+ ip addr add 192.0.2.1/30 dev lan1
+ ip addr add 192.0.2.5/30 dev lan2
+ ip addr add 192.0.2.9/30 dev lan3
+
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
+
+ # bring up the slave interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
+
+*bridge*
+ .. code-block:: sh
+
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
+
+ # bring up the slave interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # add ports to bridge
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set dev lan3 master br0
+
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
+
+ # bring up the bridge
+ ip link set dev br0 up
+
+*gateway*
+ .. code-block:: sh
+
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
+
+ # bring up the slave interfaces
+ ip link set wan up
+ ip link set lan1 up
+ ip link set lan2 up
+
+ # configure the upstream port
+ ip addr add 192.0.2.1/30 dev wan
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # add ports to bridge
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
+
+ # bring up the bridge
+ ip link set dev br0 up
+
+.. _dsa-vlan-configuration:
+
+Configuration without tagging support
+-------------------------------------
+
+A minority of switches are not capable to use a taging protocol
+(DSA_TAG_PROTO_NONE). These switches can be configured by a VLAN based
+configuration.
+
+*single port*
+ The configuration can only be set up via VLAN tagging and bridge setup.
+
+ .. code-block:: sh
+
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+ ip link add link eth0 name eth0.2 type vlan id 2
+ ip link add link eth0 name eth0.3 type vlan id 3
+
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+ ip link set eth0.2 up
+ ip link set eth0.3 up
+
+ # bring up the slave interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ # add ports to bridges
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set dev lan3 master br0
+
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 1 pvid untagged
+ bridge vlan add dev lan2 vid 2 pvid untagged
+ bridge vlan add dev lan3 vid 3 pvid untagged
+
+ # configure the VLANs
+ ip addr add 192.0.2.1/30 dev eth0.1
+ ip addr add 192.0.2.5/30 dev eth0.2
+ ip addr add 192.0.2.9/30 dev eth0.3
+
+ # bring up the bridge devices
+ ip link set br0 up
+
+
+*bridge*
+ .. code-block:: sh
+
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+
+ # bring up the slave interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ # add ports to bridge
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set dev lan3 master br0
+ ip link set eth0.1 master br0
+
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 1 pvid untagged
+ bridge vlan add dev lan2 vid 1 pvid untagged
+ bridge vlan add dev lan3 vid 1 pvid untagged
+
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
+
+ # bring up the bridge
+ ip link set dev br0 up
+
+*gateway*
+ .. code-block:: sh
+
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+ ip link add link eth0 name eth0.2 type vlan id 2
+
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+ ip link set eth0.2 up
+
+ # bring up the slave interfaces
+ ip link set wan up
+ ip link set lan1 up
+ ip link set lan2 up
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ # add ports to bridges
+ ip link set dev wan master br0
+ ip link set eth0.1 master br0
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 1 pvid untagged
+ bridge vlan add dev lan2 vid 1 pvid untagged
+ bridge vlan add dev wan vid 2 pvid untagged
+
+ # configure the VLANs
+ ip addr add 192.0.2.1/30 dev eth0.2
+ ip addr add 192.0.2.129/25 dev br0
+
+ # bring up the bridge devices
+ ip link set br0 up
+
+Forwarding database (FDB) management
+------------------------------------
+
+The existing DSA switches do not have the necessary hardware support to keep
+the software FDB of the bridge in sync with the hardware tables, so the two
+tables are managed separately (``bridge fdb show`` queries both, and depending
+on whether the ``self`` or ``master`` flags are being used, a ``bridge fdb
+add`` or ``bridge fdb del`` command acts upon entries from one or both tables).
+
+Up until kernel v4.14, DSA only supported user space management of bridge FDB
+entries using the bridge bypass operations (which do not update the software
+FDB, just the hardware one) using the ``self`` flag (which is optional and can
+be omitted).
+
+ .. code-block:: sh
+
+ bridge fdb add dev swp0 00:01:02:03:04:05 self static
+ # or shorthand
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Due to a bug, the bridge bypass FDB implementation provided by DSA did not
+distinguish between ``static`` and ``local`` FDB entries (``static`` are meant
+to be forwarded, while ``local`` are meant to be locally terminated, i.e. sent
+to the host port). Instead, all FDB entries with the ``self`` flag (implicit or
+explicit) are treated by DSA as ``static`` even if they are ``local``.
+
+ .. code-block:: sh
+
+ # This command:
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+ # behaves the same for DSA as this command:
+ bridge fdb add dev swp0 00:01:02:03:04:05 local
+ # or shorthand, because the 'local' flag is implicit if 'static' is not
+ # specified, it also behaves the same as:
+ bridge fdb add dev swp0 00:01:02:03:04:05
+
+The last command is an incorrect way of adding a static bridge FDB entry to a
+DSA switch using the bridge bypass operations, and works by mistake. Other
+drivers will treat an FDB entry added by the same command as ``local`` and as
+such, will not forward it, as opposed to DSA.
+
+Between kernel v4.14 and v5.14, DSA has supported in parallel two modes of
+adding a bridge FDB entry to the switch: the bridge bypass discussed above, as
+well as a new mode using the ``master`` flag which installs FDB entries in the
+software bridge too.
+
+ .. code-block:: sh
+
+ bridge fdb add dev swp0 00:01:02:03:04:05 master static
+
+Since kernel v5.14, DSA has gained stronger integration with the bridge's
+software FDB, and the support for its bridge bypass FDB implementation (using
+the ``self`` flag) has been removed. This results in the following changes:
+
+ .. code-block:: sh
+
+ # This is the only valid way of adding an FDB entry that is supported,
+ # compatible with v4.14 kernels and later:
+ bridge fdb add dev swp0 00:01:02:03:04:05 master static
+ # This command is no longer buggy and the entry is properly treated as
+ # 'local' instead of being forwarded:
+ bridge fdb add dev swp0 00:01:02:03:04:05
+ # This command no longer installs a static FDB entry to hardware:
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Script writers are therefore encouraged to use the ``master static`` set of
+flags when working with bridge FDB entries on DSA switch interfaces.
+
+Affinity of user ports to CPU ports
+-----------------------------------
+
+Typically, DSA switches are attached to the host via a single Ethernet
+interface, but in cases where the switch chip is discrete, the hardware design
+may permit the use of 2 or more ports connected to the host, for an increase in
+termination throughput.
+
+DSA can make use of multiple CPU ports in two ways. First, it is possible to
+statically assign the termination traffic associated with a certain user port
+to be processed by a certain CPU port. This way, user space can implement
+custom policies of static load balancing between user ports, by spreading the
+affinities according to the available CPU ports.
+
+Secondly, it is possible to perform load balancing between CPU ports on a per
+packet basis, rather than statically assigning user ports to CPU ports.
+This can be achieved by placing the DSA masters under a LAG interface (bonding
+or team). DSA monitors this operation and creates a mirror of this software LAG
+on the CPU ports facing the physical DSA masters that constitute the LAG slave
+devices.
+
+To make use of multiple CPU ports, the firmware (device tree) description of
+the switch must mark all the links between CPU ports and their DSA masters
+using the ``ethernet`` reference/phandle. At startup, only a single CPU port
+and DSA master will be used - the numerically first port from the firmware
+description which has an ``ethernet`` property. It is up to the user to
+configure the system for the switch to use other masters.
+
+DSA uses the ``rtnl_link_ops`` mechanism (with a "dsa" ``kind``) to allow
+changing the DSA master of a user port. The ``IFLA_DSA_MASTER`` u32 netlink
+attribute contains the ifindex of the master device that handles each slave
+device. The DSA master must be a valid candidate based on firmware node
+information, or a LAG interface which contains only slaves which are valid
+candidates.
+
+Using iproute2, the following manipulations are possible:
+
+ .. code-block:: sh
+
+ # See the DSA master in current use
+ ip -d link show dev swp0
+ (...)
+ dsa master eth0
+
+ # Static CPU port distribution
+ ip link set swp0 type dsa master eth1
+ ip link set swp1 type dsa master eth0
+ ip link set swp2 type dsa master eth1
+ ip link set swp3 type dsa master eth0
+
+ # CPU ports in LAG, using explicit assignment of the DSA master
+ ip link add bond0 type bond mode balance-xor && ip link set bond0 up
+ ip link set eth1 down && ip link set eth1 master bond0
+ ip link set swp0 type dsa master bond0
+ ip link set swp1 type dsa master bond0
+ ip link set swp2 type dsa master bond0
+ ip link set swp3 type dsa master bond0
+ ip link set eth0 down && ip link set eth0 master bond0
+ ip -d link show dev swp0
+ (...)
+ dsa master bond0
+
+ # CPU ports in LAG, relying on implicit migration of the DSA master
+ ip link add bond0 type bond mode balance-xor && ip link set bond0 up
+ ip link set eth0 down && ip link set eth0 master bond0
+ ip link set eth1 down && ip link set eth1 master bond0
+ ip -d link show dev swp0
+ (...)
+ dsa master bond0
+
+Notice that in the case of CPU ports under a LAG, the use of the
+``IFLA_DSA_MASTER`` netlink attribute is not strictly needed, but rather, DSA
+reacts to the ``IFLA_MASTER`` attribute change of its present master (``eth0``)
+and migrates all user ports to the new upper of ``eth0``, ``bond0``. Similarly,
+when ``bond0`` is destroyed using ``RTM_DELLINK``, DSA migrates the user ports
+that were assigned to this interface to the first physical DSA master which is
+eligible, based on the firmware description (it effectively reverts to the
+startup configuration).
+
+In a setup with more than 2 physical CPU ports, it is therefore possible to mix
+static user to CPU port assignment with LAG between DSA masters. It is not
+possible to statically assign a user port towards a DSA master that has any
+upper interfaces (this includes LAG devices - the master must always be the LAG
+in this case).
+
+Live changing of the DSA master (and thus CPU port) affinity of a user port is
+permitted, in order to allow dynamic redistribution in response to traffic.
+
+Physical DSA masters are allowed to join and leave at any time a LAG interface
+used as a DSA master; however, DSA will reject a LAG interface as a valid
+candidate for being a DSA master unless it has at least one physical DSA master
+as a slave device.
diff --git a/Documentation/networking/dsa/dsa.rst b/Documentation/networking/dsa/dsa.rst
new file mode 100644
index 0000000000..a94ddf8334
--- /dev/null
+++ b/Documentation/networking/dsa/dsa.rst
@@ -0,0 +1,1129 @@
+============
+Architecture
+============
+
+This document describes the **Distributed Switch Architecture (DSA)** subsystem
+design principles, limitations, interactions with other subsystems, and how to
+develop drivers for this subsystem as well as a TODO for developers interested
+in joining the effort.
+
+Design principles
+=================
+
+The Distributed Switch Architecture subsystem was primarily designed to
+support Marvell Ethernet switches (MV88E6xxx, a.k.a. Link Street product
+line) using Linux, but has since evolved to support other vendors as well.
+
+The original philosophy behind this design was to be able to use unmodified
+Linux tools such as bridge, iproute2, ifconfig to work transparently whether
+they configured/queried a switch port network device or a regular network
+device.
+
+An Ethernet switch typically comprises multiple front-panel ports and one
+or more CPU or management ports. The DSA subsystem currently relies on the
+presence of a management port connected to an Ethernet controller capable of
+receiving Ethernet frames from the switch. This is a very common setup for all
+kinds of Ethernet switches found in Small Home and Office products: routers,
+gateways, or even top-of-rack switches. This host Ethernet controller will
+be later referred to as "master" and "cpu" in DSA terminology and code.
+
+The D in DSA stands for Distributed, because the subsystem has been designed
+with the ability to configure and manage cascaded switches on top of each other
+using upstream and downstream Ethernet links between switches. These specific
+ports are referred to as "dsa" ports in DSA terminology and code. A collection
+of multiple switches connected to each other is called a "switch tree".
+
+For each front-panel port, DSA creates specialized network devices which are
+used as controlling and data-flowing endpoints for use by the Linux networking
+stack. These specialized network interfaces are referred to as "slave" network
+interfaces in DSA terminology and code.
+
+The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
+which is a hardware feature making the switch insert a specific tag for each
+Ethernet frame it receives to/from specific ports to help the management
+interface figure out:
+
+- what port is this frame coming from
+- what was the reason why this frame got forwarded
+- how to send CPU originated traffic to specific ports
+
+The subsystem does support switches not capable of inserting/stripping tags, but
+the features might be slightly limited in that case (traffic separation relies
+on Port-based VLAN IDs).
+
+Note that DSA does not currently create network interfaces for the "cpu" and
+"dsa" ports because:
+
+- the "cpu" port is the Ethernet switch facing side of the management
+ controller, and as such, would create a duplication of feature, since you
+ would get two interfaces for the same conduit: master netdev, and "cpu" netdev
+
+- the "dsa" port(s) are just conduits between two or more switches, and as such
+ cannot really be used as proper network interfaces either, only the
+ downstream, or the top-most upstream interface makes sense with that model
+
+Switch tagging protocols
+------------------------
+
+DSA supports many vendor-specific tagging protocols, one software-defined
+tagging protocol, and a tag-less mode as well (``DSA_TAG_PROTO_NONE``).
+
+The exact format of the tag protocol is vendor specific, but in general, they
+all contain something which:
+
+- identifies which port the Ethernet frame came from/should be sent to
+- provides a reason why this frame was forwarded to the management interface
+
+All tagging protocols are in ``net/dsa/tag_*.c`` files and implement the
+methods of the ``struct dsa_device_ops`` structure, which are detailed below.
+
+Tagging protocols generally fall in one of three categories:
+
+1. The switch-specific frame header is located before the Ethernet header,
+ shifting to the right (from the perspective of the DSA master's frame
+ parser) the MAC DA, MAC SA, EtherType and the entire L2 payload.
+2. The switch-specific frame header is located before the EtherType, keeping
+ the MAC DA and MAC SA in place from the DSA master's perspective, but
+ shifting the 'real' EtherType and L2 payload to the right.
+3. The switch-specific frame header is located at the tail of the packet,
+ keeping all frame headers in place and not altering the view of the packet
+ that the DSA master's frame parser has.
+
+A tagging protocol may tag all packets with switch tags of the same length, or
+the tag length might vary (for example packets with PTP timestamps might
+require an extended switch tag, or there might be one tag length on TX and a
+different one on RX). Either way, the tagging protocol driver must populate the
+``struct dsa_device_ops::needed_headroom`` and/or ``struct dsa_device_ops::needed_tailroom``
+with the length in octets of the longest switch frame header/trailer. The DSA
+framework will automatically adjust the MTU of the master interface to
+accommodate for this extra size in order for DSA user ports to support the
+standard MTU (L2 payload length) of 1500 octets. The ``needed_headroom`` and
+``needed_tailroom`` properties are also used to request from the network stack,
+on a best-effort basis, the allocation of packets with enough extra space such
+that the act of pushing the switch tag on transmission of a packet does not
+cause it to reallocate due to lack of memory.
+
+Even though applications are not expected to parse DSA-specific frame headers,
+the format on the wire of the tagging protocol represents an Application Binary
+Interface exposed by the kernel towards user space, for decoders such as
+``libpcap``. The tagging protocol driver must populate the ``proto`` member of
+``struct dsa_device_ops`` with a value that uniquely describes the
+characteristics of the interaction required between the switch hardware and the
+data path driver: the offset of each bit field within the frame header and any
+stateful processing required to deal with the frames (as may be required for
+PTP timestamping).
+
+From the perspective of the network stack, all switches within the same DSA
+switch tree use the same tagging protocol. In case of a packet transiting a
+fabric with more than one switch, the switch-specific frame header is inserted
+by the first switch in the fabric that the packet was received on. This header
+typically contains information regarding its type (whether it is a control
+frame that must be trapped to the CPU, or a data frame to be forwarded).
+Control frames should be decapsulated only by the software data path, whereas
+data frames might also be autonomously forwarded towards other user ports of
+other switches from the same fabric, and in this case, the outermost switch
+ports must decapsulate the packet.
+
+Note that in certain cases, it might be the case that the tagging format used
+by a leaf switch (not connected directly to the CPU) is not the same as what
+the network stack sees. This can be seen with Marvell switch trees, where the
+CPU port can be configured to use either the DSA or the Ethertype DSA (EDSA)
+format, but the DSA links are configured to use the shorter (without Ethertype)
+DSA frame header, in order to reduce the autonomous packet forwarding overhead.
+It still remains the case that, if the DSA switch tree is configured for the
+EDSA tagging protocol, the operating system sees EDSA-tagged packets from the
+leaf switches that tagged them with the shorter DSA header. This can be done
+because the Marvell switch connected directly to the CPU is configured to
+perform tag translation between DSA and EDSA (which is simply the operation of
+adding or removing the ``ETH_P_EDSA`` EtherType and some padding octets).
+
+It is possible to construct cascaded setups of DSA switches even if their
+tagging protocols are not compatible with one another. In this case, there are
+no DSA links in this fabric, and each switch constitutes a disjoint DSA switch
+tree. The DSA links are viewed as simply a pair of a DSA master (the out-facing
+port of the upstream DSA switch) and a CPU port (the in-facing port of the
+downstream DSA switch).
+
+The tagging protocol of the attached DSA switch tree can be viewed through the
+``dsa/tagging`` sysfs attribute of the DSA master::
+
+ cat /sys/class/net/eth0/dsa/tagging
+
+If the hardware and driver are capable, the tagging protocol of the DSA switch
+tree can be changed at runtime. This is done by writing the new tagging
+protocol name to the same sysfs device attribute as above (the DSA master and
+all attached switch ports must be down while doing this).
+
+It is desirable that all tagging protocols are testable with the ``dsa_loop``
+mockup driver, which can be attached to any network interface. The goal is that
+any network interface should be capable of transmitting the same packet in the
+same way, and the tagger should decode the same received packet in the same way
+regardless of the driver used for the switch control path, and the driver used
+for the DSA master.
+
+The transmission of a packet goes through the tagger's ``xmit`` function.
+The passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
+``skb_mac_header(skb)``, i.e. at the destination MAC address, and the passed
+``struct net_device *dev`` represents the virtual DSA user network interface
+whose hardware counterpart the packet must be steered to (i.e. ``swp0``).
+The job of this method is to prepare the skb in a way that the switch will
+understand what egress port the packet is for (and not deliver it towards other
+ports). Typically this is fulfilled by pushing a frame header. Checking for
+insufficient size in the skb headroom or tailroom is unnecessary provided that
+the ``needed_headroom`` and ``needed_tailroom`` properties were filled out
+properly, because DSA ensures there is enough space before calling this method.
+
+The reception of a packet goes through the tagger's ``rcv`` function. The
+passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
+``skb_mac_header(skb) + ETH_ALEN`` octets, i.e. to where the first octet after
+the EtherType would have been, were this frame not tagged. The role of this
+method is to consume the frame header, adjust ``skb->data`` to really point at
+the first octet after the EtherType, and to change ``skb->dev`` to point to the
+virtual DSA user network interface corresponding to the physical front-facing
+switch port that the packet was received on.
+
+Since tagging protocols in category 1 and 2 break software (and most often also
+hardware) packet dissection on the DSA master, features such as RPS (Receive
+Packet Steering) on the DSA master would be broken. The DSA framework deals
+with this by hooking into the flow dissector and shifting the offset at which
+the IP header is to be found in the tagged frame as seen by the DSA master.
+This behavior is automatic based on the ``overhead`` value of the tagging
+protocol. If not all packets are of equal size, the tagger can implement the
+``flow_dissect`` method of the ``struct dsa_device_ops`` and override this
+default behavior by specifying the correct offset incurred by each individual
+RX packet. Tail taggers do not cause issues to the flow dissector.
+
+Checksum offload should work with category 1 and 2 taggers when the DSA master
+driver declares NETIF_F_HW_CSUM in vlan_features and looks at csum_start and
+csum_offset. For those cases, DSA will shift the checksum start and offset by
+the tag size. If the DSA master driver still uses the legacy NETIF_F_IP_CSUM
+or NETIF_F_IPV6_CSUM in vlan_features, the offload might only work if the
+offload hardware already expects that specific tag (perhaps due to matching
+vendors). DSA slaves inherit those flags from the master port, and it is up to
+the driver to correctly fall back to software checksum when the IP header is not
+where the hardware expects. If that check is ineffective, the packets might go
+to the network without a proper checksum (the checksum field will have the
+pseudo IP header sum). For category 3, when the offload hardware does not
+already expect the switch tag in use, the checksum must be calculated before any
+tag is inserted (i.e. inside the tagger). Otherwise, the DSA master would
+include the tail tag in the (software or hardware) checksum calculation. Then,
+when the tag gets stripped by the switch during transmission, it will leave an
+incorrect IP checksum in place.
+
+Due to various reasons (most common being category 1 taggers being associated
+with DSA-unaware masters, mangling what the master perceives as MAC DA), the
+tagging protocol may require the DSA master to operate in promiscuous mode, to
+receive all frames regardless of the value of the MAC DA. This can be done by
+setting the ``promisc_on_master`` property of the ``struct dsa_device_ops``.
+Note that this assumes a DSA-unaware master driver, which is the norm.
+
+Master network devices
+----------------------
+
+Master network devices are regular, unmodified Linux network device drivers for
+the CPU/management Ethernet interface. Such a driver might occasionally need to
+know whether DSA is enabled (e.g.: to enable/disable specific offload features),
+but the DSA subsystem has been proven to work with industry standard drivers:
+``e1000e,`` ``mv643xx_eth`` etc. without having to introduce modifications to these
+drivers. Such network devices are also often referred to as conduit network
+devices since they act as a pipe between the host processor and the hardware
+Ethernet switch.
+
+Networking stack hooks
+----------------------
+
+When a master netdev is used with DSA, a small hook is placed in the
+networking stack is in order to have the DSA subsystem process the Ethernet
+switch specific tagging protocol. DSA accomplishes this by registering a
+specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the
+networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical
+Ethernet Frame receive sequence looks like this:
+
+Master network device (e.g.: e1000e):
+
+1. Receive interrupt fires:
+
+ - receive function is invoked
+ - basic packet processing is done: getting length, status etc.
+ - packet is prepared to be processed by the Ethernet layer by calling
+ ``eth_type_trans``
+
+2. net/ethernet/eth.c::
+
+ eth_type_trans(skb, dev)
+ if (dev->dsa_ptr != NULL)
+ -> skb->protocol = ETH_P_XDSA
+
+3. drivers/net/ethernet/\*::
+
+ netif_receive_skb(skb)
+ -> iterate over registered packet_type
+ -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv()
+
+4. net/dsa/dsa.c::
+
+ -> dsa_switch_rcv()
+ -> invoke switch tag specific protocol handler in 'net/dsa/tag_*.c'
+
+5. net/dsa/tag_*.c:
+
+ - inspect and strip switch tag protocol to determine originating port
+ - locate per-port network device
+ - invoke ``eth_type_trans()`` with the DSA slave network device
+ - invoked ``netif_receive_skb()``
+
+Past this point, the DSA slave network devices get delivered regular Ethernet
+frames that can be processed by the networking stack.
+
+Slave network devices
+---------------------
+
+Slave network devices created by DSA are stacked on top of their master network
+device, each of these network interfaces will be responsible for being a
+controlling and data-flowing end-point for each front-panel port of the switch.
+These interfaces are specialized in order to:
+
+- insert/remove the switch tag protocol (if it exists) when sending traffic
+ to/from specific switch ports
+- query the switch for ethtool operations: statistics, link state,
+ Wake-on-LAN, register dumps...
+- manage external/internal PHY: link, auto-negotiation, etc.
+
+These slave network devices have custom net_device_ops and ethtool_ops function
+pointers which allow DSA to introduce a level of layering between the networking
+stack/ethtool and the switch driver implementation.
+
+Upon frame transmission from these slave network devices, DSA will look up which
+switch tagging protocol is currently registered with these network devices and
+invoke a specific transmit routine which takes care of adding the relevant
+switch tag in the Ethernet frames.
+
+These frames are then queued for transmission using the master network device
+``ndo_start_xmit()`` function. Since they contain the appropriate switch tag, the
+Ethernet switch will be able to process these incoming frames from the
+management interface and deliver them to the physical switch port.
+
+When using multiple CPU ports, it is possible to stack a LAG (bonding/team)
+device between the DSA slave devices and the physical DSA masters. The LAG
+device is thus also a DSA master, but the LAG slave devices continue to be DSA
+masters as well (just with no user port assigned to them; this is needed for
+recovery in case the LAG DSA master disappears). Thus, the data path of the LAG
+DSA master is used asymmetrically. On RX, the ``ETH_P_XDSA`` handler, which
+calls ``dsa_switch_rcv()``, is invoked early (on the physical DSA master;
+LAG slave). Therefore, the RX data path of the LAG DSA master is not used.
+On the other hand, TX takes place linearly: ``dsa_slave_xmit`` calls
+``dsa_enqueue_skb``, which calls ``dev_queue_xmit`` towards the LAG DSA master.
+The latter calls ``dev_queue_xmit`` towards one physical DSA master or the
+other, and in both cases, the packet exits the system through a hardware path
+towards the switch.
+
+Graphical representation
+------------------------
+
+Summarized, this is basically how DSA looks like from a network device
+perspective::
+
+ Unaware application
+ opens and binds socket
+ | ^
+ | |
+ +-----------v--|--------------------+
+ |+------+ +------+ +------+ +------+|
+ || swp0 | | swp1 | | swp2 | | swp3 ||
+ |+------+-+------+-+------+-+------+|
+ | DSA switch driver |
+ +-----------------------------------+
+ | ^
+ Tag added by | | Tag consumed by
+ switch driver | | switch driver
+ v |
+ +-----------------------------------+
+ | Unmodified host interface driver | Software
+ --------+-----------------------------------+------------
+ | Host interface (eth0) | Hardware
+ +-----------------------------------+
+ | ^
+ Tag consumed by | | Tag added by
+ switch hardware | | switch hardware
+ v |
+ +-----------------------------------+
+ | Switch |
+ |+------+ +------+ +------+ +------+|
+ || swp0 | | swp1 | | swp2 | | swp3 ||
+ ++------+-+------+-+------+-+------++
+
+Slave MDIO bus
+--------------
+
+In order to be able to read to/from a switch PHY built into it, DSA creates a
+slave MDIO bus which allows a specific switch driver to divert and intercept
+MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
+switches, these functions would utilize direct or indirect PHY addressing mode
+to return standard MII registers from the switch builtin PHYs, allowing the PHY
+library and/or to return link status, link partner pages, auto-negotiation
+results, etc.
+
+For Ethernet switches which have both external and internal MDIO buses, the
+slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
+internal or external MDIO devices this switch might be connected to: internal
+PHYs, external PHYs, or even external switches.
+
+Data structures
+---------------
+
+DSA data structures are defined in ``include/net/dsa.h`` as well as
+``net/dsa/dsa_priv.h``:
+
+- ``dsa_chip_data``: platform data configuration for a given switch device,
+ this structure describes a switch device's parent device, its address, as
+ well as various properties of its ports: names/labels, and finally a routing
+ table indication (when cascading switches)
+
+- ``dsa_platform_data``: platform device configuration data which can reference
+ a collection of dsa_chip_data structures if multiple switches are cascaded,
+ the master network device this switch tree is attached to needs to be
+ referenced
+
+- ``dsa_switch_tree``: structure assigned to the master network device under
+ ``dsa_ptr``, this structure references a dsa_platform_data structure as well as
+ the tagging protocol supported by the switch tree, and which receive/transmit
+ function hooks should be invoked, information about the directly attached
+ switch is also provided: CPU port. Finally, a collection of dsa_switch are
+ referenced to address individual switches in the tree.
+
+- ``dsa_switch``: structure describing a switch device in the tree, referencing
+ a ``dsa_switch_tree`` as a backpointer, slave network devices, master network
+ device, and a reference to the backing``dsa_switch_ops``
+
+- ``dsa_switch_ops``: structure referencing function pointers, see below for a
+ full description.
+
+Design limitations
+==================
+
+Lack of CPU/DSA network devices
+-------------------------------
+
+DSA does not currently create slave network devices for the CPU or DSA ports, as
+described before. This might be an issue in the following cases:
+
+- inability to fetch switch CPU port statistics counters using ethtool, which
+ can make it harder to debug MDIO switch connected using xMII interfaces
+
+- inability to configure the CPU port link parameters based on the Ethernet
+ controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/
+
+- inability to configure specific VLAN IDs / trunking VLANs between switches
+ when using a cascaded setup
+
+Common pitfalls using DSA setups
+--------------------------------
+
+Once a master network device is configured to use DSA (dev->dsa_ptr becomes
+non-NULL), and the switch behind it expects a tagging protocol, this network
+interface can only exclusively be used as a conduit interface. Sending packets
+directly through this interface (e.g.: opening a socket using this interface)
+will not make us go through the switch tagging protocol transmit function, so
+the Ethernet switch on the other end, expecting a tag will typically drop this
+frame.
+
+Interactions with other subsystems
+==================================
+
+DSA currently leverages the following subsystems:
+
+- MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c``
+- Switchdev:``net/switchdev/*``
+- Device Tree for various of_* functions
+- Devlink: ``net/core/devlink.c``
+
+MDIO/PHY library
+----------------
+
+Slave network devices exposed by DSA may or may not be interfacing with PHY
+devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA
+subsystem deals with all possible combinations:
+
+- internal PHY devices, built into the Ethernet switch hardware
+- external PHY devices, connected via an internal or external MDIO bus
+- internal PHY devices, connected via an internal MDIO bus
+- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
+ fixed PHYs
+
+The PHY configuration is done by the ``dsa_slave_phy_setup()`` function and the
+logic basically looks like this:
+
+- if Device Tree is used, the PHY device is looked up using the standard
+ "phy-handle" property, if found, this PHY device is created and registered
+ using ``of_phy_connect()``
+
+- if Device Tree is used and the PHY device is "fixed", that is, conforms to
+ the definition of a non-MDIO managed PHY as defined in
+ ``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered
+ and connected transparently using the special fixed MDIO bus driver
+
+- finally, if the PHY is built into the switch, as is very common with
+ standalone switch packages, the PHY is probed using the slave MII bus created
+ by DSA
+
+
+SWITCHDEV
+---------
+
+DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
+more specifically with its VLAN filtering portion when configuring VLANs on top
+of per-port slave network devices. As of today, the only SWITCHDEV objects
+supported by DSA are the FDB and VLAN objects.
+
+Devlink
+-------
+
+DSA registers one devlink device per physical switch in the fabric.
+For each devlink device, every physical port (i.e. user ports, CPU ports, DSA
+links or unused ports) is exposed as a devlink port.
+
+DSA drivers can make use of the following devlink features:
+
+- Regions: debugging feature which allows user space to dump driver-defined
+ areas of hardware information in a low-level, binary format. Both global
+ regions as well as per-port regions are supported. It is possible to export
+ devlink regions even for pieces of data that are already exposed in some way
+ to the standard iproute2 user space programs (ip-link, bridge), like address
+ tables and VLAN tables. For example, this might be useful if the tables
+ contain additional hardware-specific details which are not visible through
+ the iproute2 abstraction, or it might be useful to inspect these tables on
+ the non-user ports too, which are invisible to iproute2 because no network
+ interface is registered for them.
+- Params: a feature which enables user to configure certain low-level tunable
+ knobs pertaining to the device. Drivers may implement applicable generic
+ devlink params, or may add new device-specific devlink params.
+- Resources: a monitoring feature which enables users to see the degree of
+ utilization of certain hardware tables in the device, such as FDB, VLAN, etc.
+- Shared buffers: a QoS feature for adjusting and partitioning memory and frame
+ reservations per port and per traffic class, in the ingress and egress
+ directions, such that low-priority bulk traffic does not impede the
+ processing of high-priority critical traffic.
+
+For more details, consult ``Documentation/networking/devlink/``.
+
+Device Tree
+-----------
+
+DSA features a standardized binding which is documented in
+``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper
+functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query
+per-port PHY specific details: interface connection, MDIO bus location, etc.
+
+Driver development
+==================
+
+DSA switch drivers need to implement a ``dsa_switch_ops`` structure which will
+contain the various members described below.
+
+Probing, registration and device lifetime
+-----------------------------------------
+
+DSA switches are regular ``device`` structures on buses (be they platform, SPI,
+I2C, MDIO or otherwise). The DSA framework is not involved in their probing
+with the device core.
+
+Switch registration from the perspective of a driver means passing a valid
+``struct dsa_switch`` pointer to ``dsa_register_switch()``, usually from the
+switch driver's probing function. The following members must be valid in the
+provided structure:
+
+- ``ds->dev``: will be used to parse the switch's OF node or platform data.
+
+- ``ds->num_ports``: will be used to create the port list for this switch, and
+ to validate the port indices provided in the OF node.
+
+- ``ds->ops``: a pointer to the ``dsa_switch_ops`` structure holding the DSA
+ method implementations.
+
+- ``ds->priv``: backpointer to a driver-private data structure which can be
+ retrieved in all further DSA method callbacks.
+
+In addition, the following flags in the ``dsa_switch`` structure may optionally
+be configured to obtain driver-specific behavior from the DSA core. Their
+behavior when set is documented through comments in ``include/net/dsa.h``.
+
+- ``ds->vlan_filtering_is_global``
+
+- ``ds->needs_standalone_vlan_filtering``
+
+- ``ds->configure_vlan_while_not_filtering``
+
+- ``ds->untag_bridge_pvid``
+
+- ``ds->assisted_learning_on_cpu_port``
+
+- ``ds->mtu_enforcement_ingress``
+
+- ``ds->fdb_isolation``
+
+Internally, DSA keeps an array of switch trees (group of switches) global to
+the kernel, and attaches a ``dsa_switch`` structure to a tree on registration.
+The tree ID to which the switch is attached is determined by the first u32
+number of the ``dsa,member`` property of the switch's OF node (0 if missing).
+The switch ID within the tree is determined by the second u32 number of the
+same OF property (0 if missing). Registering multiple switches with the same
+switch ID and tree ID is illegal and will cause an error. Using platform data,
+a single switch and a single switch tree is permitted.
+
+In case of a tree with multiple switches, probing takes place asymmetrically.
+The first N-1 callers of ``dsa_register_switch()`` only add their ports to the
+port list of the tree (``dst->ports``), each port having a backpointer to its
+associated switch (``dp->ds``). Then, these switches exit their
+``dsa_register_switch()`` call early, because ``dsa_tree_setup_routing_table()``
+has determined that the tree is not yet complete (not all ports referenced by
+DSA links are present in the tree's port list). The tree becomes complete when
+the last switch calls ``dsa_register_switch()``, and this triggers the effective
+continuation of initialization (including the call to ``ds->ops->setup()``) for
+all switches within that tree, all as part of the calling context of the last
+switch's probe function.
+
+The opposite of registration takes place when calling ``dsa_unregister_switch()``,
+which removes a switch's ports from the port list of the tree. The entire tree
+is torn down when the first switch unregisters.
+
+It is mandatory for DSA switch drivers to implement the ``shutdown()`` callback
+of their respective bus, and call ``dsa_switch_shutdown()`` from it (a minimal
+version of the full teardown performed by ``dsa_unregister_switch()``).
+The reason is that DSA keeps a reference on the master net device, and if the
+driver for the master device decides to unbind on shutdown, DSA's reference
+will block that operation from finalizing.
+
+Either ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` must be called,
+but not both, and the device driver model permits the bus' ``remove()`` method
+to be called even if ``shutdown()`` was already called. Therefore, drivers are
+expected to implement a mutual exclusion method between ``remove()`` and
+``shutdown()`` by setting their drvdata to NULL after any of these has run, and
+checking whether the drvdata is NULL before proceeding to take any action.
+
+After ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` was called, no
+further callbacks via the provided ``dsa_switch_ops`` may take place, and the
+driver may free the data structures associated with the ``dsa_switch``.
+
+Switch configuration
+--------------------
+
+- ``get_tag_protocol``: this is to indicate what kind of tagging protocol is
+ supported, should be a valid value from the ``dsa_tag_protocol`` enum.
+ The returned information does not have to be static; the driver is passed the
+ CPU port number, as well as the tagging protocol of a possibly stacked
+ upstream switch, in case there are hardware limitations in terms of supported
+ tag formats.
+
+- ``change_tag_protocol``: when the default tagging protocol has compatibility
+ problems with the master or other issues, the driver may support changing it
+ at runtime, either through a device tree property or through sysfs. In that
+ case, further calls to ``get_tag_protocol`` should report the protocol in
+ current use.
+
+- ``setup``: setup function for the switch, this function is responsible for setting
+ up the ``dsa_switch_ops`` private structure with all it needs: register maps,
+ interrupts, mutexes, locks, etc. This function is also expected to properly
+ configure the switch to separate all network interfaces from each other, that
+ is, they should be isolated by the switch hardware itself, typically by creating
+ a Port-based VLAN ID for each port and allowing only the CPU port and the
+ specific port to be in the forwarding vector. Ports that are unused by the
+ platform should be disabled. Past this function, the switch is expected to be
+ fully configured and ready to serve any kind of request. It is recommended
+ to issue a software reset of the switch during this setup function in order to
+ avoid relying on what a previous software agent such as a bootloader/firmware
+ may have previously configured. The method responsible for undoing any
+ applicable allocations or operations done here is ``teardown``.
+
+- ``port_setup`` and ``port_teardown``: methods for initialization and
+ destruction of per-port data structures. It is mandatory for some operations
+ such as registering and unregistering devlink port regions to be done from
+ these methods, otherwise they are optional. A port will be torn down only if
+ it has been previously set up. It is possible for a port to be set up during
+ probing only to be torn down immediately afterwards, for example in case its
+ PHY cannot be found. In this case, probing of the DSA switch continues
+ without that particular port.
+
+- ``port_change_master``: method through which the affinity (association used
+ for traffic termination purposes) between a user port and a CPU port can be
+ changed. By default all user ports from a tree are assigned to the first
+ available CPU port that makes sense for them (most of the times this means
+ the user ports of a tree are all assigned to the same CPU port, except for H
+ topologies as described in commit 2c0b03258b8b). The ``port`` argument
+ represents the index of the user port, and the ``master`` argument represents
+ the new DSA master ``net_device``. The CPU port associated with the new
+ master can be retrieved by looking at ``struct dsa_port *cpu_dp =
+ master->dsa_ptr``. Additionally, the master can also be a LAG device where
+ all the slave devices are physical DSA masters. LAG DSA masters also have a
+ valid ``master->dsa_ptr`` pointer, however this is not unique, but rather a
+ duplicate of the first physical DSA master's (LAG slave) ``dsa_ptr``. In case
+ of a LAG DSA master, a further call to ``port_lag_join`` will be emitted
+ separately for the physical CPU ports associated with the physical DSA
+ masters, requesting them to create a hardware LAG associated with the LAG
+ interface.
+
+PHY devices and link management
+-------------------------------
+
+- ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs,
+ if the PHY library PHY driver needs to know about information it cannot obtain
+ on its own (e.g.: coming from switch memory mapped registers), this function
+ should return a 32-bit bitmask of "flags" that is private between the switch
+ driver and the Ethernet PHY driver in ``drivers/net/phy/\*``.
+
+- ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read
+ the switch port MDIO registers. If unavailable, return 0xffff for each read.
+ For builtin switch Ethernet PHYs, this function should allow reading the link
+ status, auto-negotiation results, link partner pages, etc.
+
+- ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write
+ to the switch port MDIO registers. If unavailable return a negative error
+ code.
+
+- ``adjust_link``: Function invoked by the PHY library when a slave network device
+ is attached to a PHY device. This function is responsible for appropriately
+ configuring the switch port link parameters: speed, duplex, pause based on
+ what the ``phy_device`` is providing.
+
+- ``fixed_link_update``: Function invoked by the PHY library, and specifically by
+ the fixed PHY driver asking the switch driver for link parameters that could
+ not be auto-negotiated, or obtained by reading the PHY registers through MDIO.
+ This is particularly useful for specific kinds of hardware such as QSGMII,
+ MoCA or other kinds of non-MDIO managed PHYs where out of band link
+ information is obtained
+
+Ethtool operations
+------------------
+
+- ``get_strings``: ethtool function used to query the driver's strings, will
+ typically return statistics strings, private flags strings, etc.
+
+- ``get_ethtool_stats``: ethtool function used to query per-port statistics and
+ return their values. DSA overlays slave network devices general statistics:
+ RX/TX counters from the network device, with switch driver specific statistics
+ per port
+
+- ``get_sset_count``: ethtool function used to query the number of statistics items
+
+- ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this
+ function may for certain implementations also query the master network device
+ Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
+
+- ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port,
+ direct counterpart to set_wol with similar restrictions
+
+- ``set_eee``: ethtool function which is used to configure a switch port EEE (Green
+ Ethernet) settings, can optionally invoke the PHY library to enable EEE at the
+ PHY level if relevant. This function should enable EEE at the switch port MAC
+ controller and data-processing logic
+
+- ``get_eee``: ethtool function which is used to query a switch port EEE settings,
+ this function should return the EEE state of the switch port MAC controller
+ and data-processing logic as well as query the PHY for its currently configured
+ EEE settings
+
+- ``get_eeprom_len``: ethtool function returning for a given switch the EEPROM
+ length/size in bytes
+
+- ``get_eeprom``: ethtool function returning for a given switch the EEPROM contents
+
+- ``set_eeprom``: ethtool function writing specified data to a given switch EEPROM
+
+- ``get_regs_len``: ethtool function returning the register length for a given
+ switch
+
+- ``get_regs``: ethtool function returning the Ethernet switch internal register
+ contents. This function might require user-land code in ethtool to
+ pretty-print register values and registers
+
+Power management
+----------------
+
+- ``suspend``: function invoked by the DSA platform device when the system goes to
+ suspend, should quiesce all Ethernet switch activities, but keep ports
+ participating in Wake-on-LAN active as well as additional wake-up logic if
+ supported
+
+- ``resume``: function invoked by the DSA platform device when the system resumes,
+ should resume all Ethernet switch activities and re-configure the switch to be
+ in a fully active state
+
+- ``port_enable``: function invoked by the DSA slave network device ndo_open
+ function when a port is administratively brought up, this function should
+ fully enable a given switch port. DSA takes care of marking the port with
+ ``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it
+ was not, and propagating these changes down to the hardware
+
+- ``port_disable``: function invoked by the DSA slave network device ndo_close
+ function when a port is administratively brought down, this function should
+ fully disable a given switch port. DSA takes care of marking the port with
+ ``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is
+ disabled while being a bridge member
+
+Address databases
+-----------------
+
+Switching hardware is expected to have a table for FDB entries, however not all
+of them are active at the same time. An address database is the subset (partition)
+of FDB entries that is active (can be matched by address learning on RX, or FDB
+lookup on TX) depending on the state of the port. An address database may
+occasionally be called "FID" (Filtering ID) in this document, although the
+underlying implementation may choose whatever is available to the hardware.
+
+For example, all ports that belong to a VLAN-unaware bridge (which is
+*currently* VLAN-unaware) are expected to learn source addresses in the
+database associated by the driver with that bridge (and not with other
+VLAN-unaware bridges). During forwarding and FDB lookup, a packet received on a
+VLAN-unaware bridge port should be able to find a VLAN-unaware FDB entry having
+the same MAC DA as the packet, which is present on another port member of the
+same bridge. At the same time, the FDB lookup process must be able to not find
+an FDB entry having the same MAC DA as the packet, if that entry points towards
+a port which is a member of a different VLAN-unaware bridge (and is therefore
+associated with a different address database).
+
+Similarly, each VLAN of each offloaded VLAN-aware bridge should have an
+associated address database, which is shared by all ports which are members of
+that VLAN, but not shared by ports belonging to different bridges that are
+members of the same VID.
+
+In this context, a VLAN-unaware database means that all packets are expected to
+match on it irrespective of VLAN ID (only MAC address lookup), whereas a
+VLAN-aware database means that packets are supposed to match based on the VLAN
+ID from the classified 802.1Q header (or the pvid if untagged).
+
+At the bridge layer, VLAN-unaware FDB entries have the special VID value of 0,
+whereas VLAN-aware FDB entries have non-zero VID values. Note that a
+VLAN-unaware bridge may have VLAN-aware (non-zero VID) FDB entries, and a
+VLAN-aware bridge may have VLAN-unaware FDB entries. As in hardware, the
+software bridge keeps separate address databases, and offloads to hardware the
+FDB entries belonging to these databases, through switchdev, asynchronously
+relative to the moment when the databases become active or inactive.
+
+When a user port operates in standalone mode, its driver should configure it to
+use a separate database called a port private database. This is different from
+the databases described above, and should impede operation as standalone port
+(packet in, packet out to the CPU port) as little as possible. For example,
+on ingress, it should not attempt to learn the MAC SA of ingress traffic, since
+learning is a bridging layer service and this is a standalone port, therefore
+it would consume useless space. With no address learning, the port private
+database should be empty in a naive implementation, and in this case, all
+received packets should be trivially flooded to the CPU port.
+
+DSA (cascade) and CPU ports are also called "shared" ports because they service
+multiple address databases, and the database that a packet should be associated
+to is usually embedded in the DSA tag. This means that the CPU port may
+simultaneously transport packets coming from a standalone port (which were
+classified by hardware in one address database), and from a bridge port (which
+were classified to a different address database).
+
+Switch drivers which satisfy certain criteria are able to optimize the naive
+configuration by removing the CPU port from the flooding domain of the switch,
+and just program the hardware with FDB entries pointing towards the CPU port
+for which it is known that software is interested in those MAC addresses.
+Packets which do not match a known FDB entry will not be delivered to the CPU,
+which will save CPU cycles required for creating an skb just to drop it.
+
+DSA is able to perform host address filtering for the following kinds of
+addresses:
+
+- Primary unicast MAC addresses of ports (``dev->dev_addr``). These are
+ associated with the port private database of the respective user port,
+ and the driver is notified to install them through ``port_fdb_add`` towards
+ the CPU port.
+
+- Secondary unicast and multicast MAC addresses of ports (addresses added
+ through ``dev_uc_add()`` and ``dev_mc_add()``). These are also associated
+ with the port private database of the respective user port.
+
+- Local/permanent bridge FDB entries (``BR_FDB_LOCAL``). These are the MAC
+ addresses of the bridge ports, for which packets must be terminated locally
+ and not forwarded. They are associated with the address database for that
+ bridge.
+
+- Static bridge FDB entries installed towards foreign (non-DSA) interfaces
+ present in the same bridge as some DSA switch ports. These are also
+ associated with the address database for that bridge.
+
+- Dynamically learned FDB entries on foreign interfaces present in the same
+ bridge as some DSA switch ports, only if ``ds->assisted_learning_on_cpu_port``
+ is set to true by the driver. These are associated with the address database
+ for that bridge.
+
+For various operations detailed below, DSA provides a ``dsa_db`` structure
+which can be of the following types:
+
+- ``DSA_DB_PORT``: the FDB (or MDB) entry to be installed or deleted belongs to
+ the port private database of user port ``db->dp``.
+- ``DSA_DB_BRIDGE``: the entry belongs to one of the address databases of bridge
+ ``db->bridge``. Separation between the VLAN-unaware database and the per-VID
+ databases of this bridge is expected to be done by the driver.
+- ``DSA_DB_LAG``: the entry belongs to the address database of LAG ``db->lag``.
+ Note: ``DSA_DB_LAG`` is currently unused and may be removed in the future.
+
+The drivers which act upon the ``dsa_db`` argument in ``port_fdb_add``,
+``port_mdb_add`` etc should declare ``ds->fdb_isolation`` as true.
+
+DSA associates each offloaded bridge and each offloaded LAG with a one-based ID
+(``struct dsa_bridge :: num``, ``struct dsa_lag :: id``) for the purposes of
+refcounting addresses on shared ports. Drivers may piggyback on DSA's numbering
+scheme (the ID is readable through ``db->bridge.num`` and ``db->lag.id`` or may
+implement their own.
+
+Only the drivers which declare support for FDB isolation are notified of FDB
+entries on the CPU port belonging to ``DSA_DB_PORT`` databases.
+For compatibility/legacy reasons, ``DSA_DB_BRIDGE`` addresses are notified to
+drivers even if they do not support FDB isolation. However, ``db->bridge.num``
+and ``db->lag.id`` are always set to 0 in that case (to denote the lack of
+isolation, for refcounting purposes).
+
+Note that it is not mandatory for a switch driver to implement physically
+separate address databases for each standalone user port. Since FDB entries in
+the port private databases will always point to the CPU port, there is no risk
+for incorrect forwarding decisions. In this case, all standalone ports may
+share the same database, but the reference counting of host-filtered addresses
+(not deleting the FDB entry for a port's MAC address if it's still in use by
+another port) becomes the responsibility of the driver, because DSA is unaware
+that the port databases are in fact shared. This can be achieved by calling
+``dsa_fdb_present_in_other_db()`` and ``dsa_mdb_present_in_other_db()``.
+The down side is that the RX filtering lists of each user port are in fact
+shared, which means that user port A may accept a packet with a MAC DA it
+shouldn't have, only because that MAC address was in the RX filtering list of
+user port B. These packets will still be dropped in software, however.
+
+Bridge layer
+------------
+
+Offloading the bridge forwarding plane is optional and handled by the methods
+below. They may be absent, return -EOPNOTSUPP, or ``ds->max_num_bridges`` may
+be non-zero and exceeded, and in this case, joining a bridge port is still
+possible, but the packet forwarding will take place in software, and the ports
+under a software bridge must remain configured in the same way as for
+standalone operation, i.e. have all bridging service functions (address
+learning etc) disabled, and send all received packets to the CPU port only.
+
+Concretely, a port starts offloading the forwarding plane of a bridge once it
+returns success to the ``port_bridge_join`` method, and stops doing so after
+``port_bridge_leave`` has been called. Offloading the bridge means autonomously
+learning FDB entries in accordance with the software bridge port's state, and
+autonomously forwarding (or flooding) received packets without CPU intervention.
+This is optional even when offloading a bridge port. Tagging protocol drivers
+are expected to call ``dsa_default_offload_fwd_mark(skb)`` for packets which
+have already been autonomously forwarded in the forwarding domain of the
+ingress switch port. DSA, through ``dsa_port_devlink_setup()``, considers all
+switch ports part of the same tree ID to be part of the same bridge forwarding
+domain (capable of autonomous forwarding to each other).
+
+Offloading the TX forwarding process of a bridge is a distinct concept from
+simply offloading its forwarding plane, and refers to the ability of certain
+driver and tag protocol combinations to transmit a single skb coming from the
+bridge device's transmit function to potentially multiple egress ports (and
+thereby avoid its cloning in software).
+
+Packets for which the bridge requests this behavior are called data plane
+packets and have ``skb->offload_fwd_mark`` set to true in the tag protocol
+driver's ``xmit`` function. Data plane packets are subject to FDB lookup,
+hardware learning on the CPU port, and do not override the port STP state.
+Additionally, replication of data plane packets (multicast, flooding) is
+handled in hardware and the bridge driver will transmit a single skb for each
+packet that may or may not need replication.
+
+When the TX forwarding offload is enabled, the tag protocol driver is
+responsible to inject packets into the data plane of the hardware towards the
+correct bridging domain (FID) that the port is a part of. The port may be
+VLAN-unaware, and in this case the FID must be equal to the FID used by the
+driver for its VLAN-unaware address database associated with that bridge.
+Alternatively, the bridge may be VLAN-aware, and in that case, it is guaranteed
+that the packet is also VLAN-tagged with the VLAN ID that the bridge processed
+this packet in. It is the responsibility of the hardware to untag the VID on
+the egress-untagged ports, or keep the tag on the egress-tagged ones.
+
+- ``port_bridge_join``: bridge layer function invoked when a given switch port is
+ added to a bridge, this function should do what's necessary at the switch
+ level to permit the joining port to be added to the relevant logical
+ domain for it to ingress/egress traffic with other members of the bridge.
+ By setting the ``tx_fwd_offload`` argument to true, the TX forwarding process
+ of this bridge is also offloaded.
+
+- ``port_bridge_leave``: bridge layer function invoked when a given switch port is
+ removed from a bridge, this function should do what's necessary at the
+ switch level to deny the leaving port from ingress/egress traffic from the
+ remaining bridge members.
+
+- ``port_stp_state_set``: bridge layer function invoked when a given switch port STP
+ state is computed by the bridge layer and should be propagated to switch
+ hardware to forward/block/learn traffic.
+
+- ``port_bridge_flags``: bridge layer function invoked when a port must
+ configure its settings for e.g. flooding of unknown traffic or source address
+ learning. The switch driver is responsible for initial setup of the
+ standalone ports with address learning disabled and egress flooding of all
+ types of traffic, then the DSA core notifies of any change to the bridge port
+ flags when the port joins and leaves a bridge. DSA does not currently manage
+ the bridge port flags for the CPU port. The assumption is that address
+ learning should be statically enabled (if supported by the hardware) on the
+ CPU port, and flooding towards the CPU port should also be enabled, due to a
+ lack of an explicit address filtering mechanism in the DSA core.
+
+- ``port_fast_age``: bridge layer function invoked when flushing the
+ dynamically learned FDB entries on the port is necessary. This is called when
+ transitioning from an STP state where learning should take place to an STP
+ state where it shouldn't, or when leaving a bridge, or when address learning
+ is turned off via ``port_bridge_flags``.
+
+Bridge VLAN filtering
+---------------------
+
+- ``port_vlan_filtering``: bridge layer function invoked when the bridge gets
+ configured for turning on or off VLAN filtering. If nothing specific needs to
+ be done at the hardware level, this callback does not need to be implemented.
+ When VLAN filtering is turned on, the hardware must be programmed with
+ rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed
+ VLAN ID map/rules. If there is no PVID programmed into the switch port,
+ untagged frames must be rejected as well. When turned off the switch must
+ accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
+ allowed.
+
+- ``port_vlan_add``: bridge layer function invoked when a VLAN is configured
+ (tagged or untagged) for the given switch port. The CPU port becomes a member
+ of a VLAN only if a foreign bridge port is also a member of it (and
+ forwarding needs to take place in software), or the VLAN is installed to the
+ VLAN group of the bridge device itself, for termination purposes
+ (``bridge vlan add dev br0 vid 100 self``). VLANs on shared ports are
+ reference counted and removed when there is no user left. Drivers do not need
+ to manually install a VLAN on the CPU port.
+
+- ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the
+ given switch port
+
+- ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a
+ Forwarding Database entry, the switch hardware should be programmed with the
+ specified address in the specified VLAN Id in the forwarding database
+ associated with this VLAN ID.
+
+- ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a
+ Forwarding Database entry, the switch hardware should be programmed to delete
+ the specified MAC address from the specified VLAN ID if it was mapped into
+ this port forwarding database
+
+- ``port_fdb_dump``: bridge bypass function invoked by ``ndo_fdb_dump`` on the
+ physical DSA port interfaces. Since DSA does not attempt to keep in sync its
+ hardware FDB entries with the software bridge, this method is implemented as
+ a means to view the entries visible on user ports in the hardware database.
+ The entries reported by this function have the ``self`` flag in the output of
+ the ``bridge fdb show`` command.
+
+- ``port_mdb_add``: bridge layer function invoked when the bridge wants to install
+ a multicast database entry. The switch hardware should be programmed with the
+ specified address in the specified VLAN ID in the forwarding database
+ associated with this VLAN ID.
+
+- ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a
+ multicast database entry, the switch hardware should be programmed to delete
+ the specified MAC address from the specified VLAN ID if it was mapped into
+ this port forwarding database.
+
+Link aggregation
+----------------
+
+Link aggregation is implemented in the Linux networking stack by the bonding
+and team drivers, which are modeled as virtual, stackable network interfaces.
+DSA is capable of offloading a link aggregation group (LAG) to hardware that
+supports the feature, and supports bridging between physical ports and LAGs,
+as well as between LAGs. A bonding/team interface which holds multiple physical
+ports constitutes a logical port, although DSA has no explicit concept of a
+logical port at the moment. Due to this, events where a LAG joins/leaves a
+bridge are treated as if all individual physical ports that are members of that
+LAG join/leave the bridge. Switchdev port attributes (VLAN filtering, STP
+state, etc) and objects (VLANs, MDB entries) offloaded to a LAG as bridge port
+are treated similarly: DSA offloads the same switchdev object / port attribute
+on all members of the LAG. Static bridge FDB entries on a LAG are not yet
+supported, since the DSA driver API does not have the concept of a logical port
+ID.
+
+- ``port_lag_join``: function invoked when a given switch port is added to a
+ LAG. The driver may return ``-EOPNOTSUPP``, and in this case, DSA will fall
+ back to a software implementation where all traffic from this port is sent to
+ the CPU.
+- ``port_lag_leave``: function invoked when a given switch port leaves a LAG
+ and returns to operation as a standalone port.
+- ``port_lag_change``: function invoked when the link state of any member of
+ the LAG changes, and the hashing function needs rebalancing to only make use
+ of the subset of physical LAG member ports that are up.
+
+Drivers that benefit from having an ID associated with each offloaded LAG
+can optionally populate ``ds->num_lag_ids`` from the ``dsa_switch_ops::setup``
+method. The LAG ID associated with a bonding/team interface can then be
+retrieved by a DSA switch driver using the ``dsa_lag_id`` function.
+
+IEC 62439-2 (MRP)
+-----------------
+
+The Media Redundancy Protocol is a topology management protocol optimized for
+fast fault recovery time for ring networks, which has some components
+implemented as a function of the bridge driver. MRP uses management PDUs
+(Test, Topology, LinkDown/Up, Option) sent at a multicast destination MAC
+address range of 01:15:4e:00:00:0x and with an EtherType of 0x88e3.
+Depending on the node's role in the ring (MRM: Media Redundancy Manager,
+MRC: Media Redundancy Client, MRA: Media Redundancy Automanager), certain MRP
+PDUs might need to be terminated locally and others might need to be forwarded.
+An MRM might also benefit from offloading to hardware the creation and
+transmission of certain MRP PDUs (Test).
+
+Normally an MRP instance can be created on top of any network interface,
+however in the case of a device with an offloaded data path such as DSA, it is
+necessary for the hardware, even if it is not MRP-aware, to be able to extract
+the MRP PDUs from the fabric before the driver can proceed with the software
+implementation. DSA today has no driver which is MRP-aware, therefore it only
+listens for the bare minimum switchdev objects required for the software assist
+to work properly. The operations are detailed below.
+
+- ``port_mrp_add`` and ``port_mrp_del``: notifies driver when an MRP instance
+ with a certain ring ID, priority, primary port and secondary port is
+ created/deleted.
+- ``port_mrp_add_ring_role`` and ``port_mrp_del_ring_role``: function invoked
+ when an MRP instance changes ring roles between MRM or MRC. This affects
+ which MRP PDUs should be trapped to software and which should be autonomously
+ forwarded.
+
+IEC 62439-3 (HSR/PRP)
+---------------------
+
+The Parallel Redundancy Protocol (PRP) is a network redundancy protocol which
+works by duplicating and sequence numbering packets through two independent L2
+networks (which are unaware of the PRP tail tags carried in the packets), and
+eliminating the duplicates at the receiver. The High-availability Seamless
+Redundancy (HSR) protocol is similar in concept, except all nodes that carry
+the redundant traffic are aware of the fact that it is HSR-tagged (because HSR
+uses a header with an EtherType of 0x892f) and are physically connected in a
+ring topology. Both HSR and PRP use supervision frames for monitoring the
+health of the network and for discovery of other nodes.
+
+In Linux, both HSR and PRP are implemented in the hsr driver, which
+instantiates a virtual, stackable network interface with two member ports.
+The driver only implements the basic roles of DANH (Doubly Attached Node
+implementing HSR) and DANP (Doubly Attached Node implementing PRP); the roles
+of RedBox and QuadBox are not implemented (therefore, bridging a hsr network
+interface with a physical switch port does not produce the expected result).
+
+A driver which is able of offloading certain functions of a DANP or DANH should
+declare the corresponding netdev features as indicated by the documentation at
+``Documentation/networking/netdev-features.rst``. Additionally, the following
+methods must be implemented:
+
+- ``port_hsr_join``: function invoked when a given switch port is added to a
+ DANP/DANH. The driver may return ``-EOPNOTSUPP`` and in this case, DSA will
+ fall back to a software implementation where all traffic from this port is
+ sent to the CPU.
+- ``port_hsr_leave``: function invoked when a given switch port leaves a
+ DANP/DANH and returns to normal operation as a standalone port.
+
+TODO
+====
+
+Making SWITCHDEV and DSA converge towards an unified codebase
+-------------------------------------------------------------
+
+SWITCHDEV properly takes care of abstracting the networking stack with offload
+capable hardware, but does not enforce a strict switch device driver model. On
+the other DSA enforces a fairly strict device driver model, and deals with most
+of the switch specific. At some point we should envision a merger between these
+two subsystems and get the best of both worlds.
diff --git a/Documentation/networking/dsa/index.rst b/Documentation/networking/dsa/index.rst
new file mode 100644
index 0000000000..ee631e2d64
--- /dev/null
+++ b/Documentation/networking/dsa/index.rst
@@ -0,0 +1,13 @@
+===============================
+Distributed Switch Architecture
+===============================
+
+.. toctree::
+ :maxdepth: 1
+
+ dsa
+ b53
+ bcm_sf2
+ lan9303
+ sja1105
+ configuration
diff --git a/Documentation/networking/dsa/lan9303.rst b/Documentation/networking/dsa/lan9303.rst
new file mode 100644
index 0000000000..e3c820db28
--- /dev/null
+++ b/Documentation/networking/dsa/lan9303.rst
@@ -0,0 +1,37 @@
+==============================
+LAN9303 Ethernet switch driver
+==============================
+
+The LAN9303 is a three port 10/100 Mbps ethernet switch with integrated phys for
+the two external ethernet ports. The third port is an RMII/MII interface to a
+host master network interface (e.g. fixed link).
+
+
+Driver details
+==============
+
+The driver is implemented as a DSA driver, see ``Documentation/networking/dsa/dsa.rst``.
+
+See ``Documentation/devicetree/bindings/net/dsa/lan9303.txt`` for device tree
+binding.
+
+The LAN9303 can be managed both via MDIO and I2C, both supported by this driver.
+
+At startup the driver configures the device to provide two separate network
+interfaces (which is the default state of a DSA device). Due to HW limitations,
+no HW MAC learning takes place in this mode.
+
+When both user ports are joined to the same bridge, the normal HW MAC learning
+is enabled. This means that unicast traffic is forwarded in HW. Broadcast and
+multicast is flooded in HW. STP is also supported in this mode. The driver
+support fdb/mdb operations as well, meaning IGMP snooping is supported.
+
+If one of the user ports leave the bridge, the ports goes back to the initial
+separated operation.
+
+
+Driver limitations
+==================
+
+ - Support for VLAN filtering is not implemented
+ - The HW does not support VLAN-specific fdb entries
diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst
new file mode 100644
index 0000000000..e0219c1452
--- /dev/null
+++ b/Documentation/networking/dsa/sja1105.rst
@@ -0,0 +1,445 @@
+=========================
+NXP SJA1105 switch driver
+=========================
+
+Overview
+========
+
+The NXP SJA1105 is a family of 10 SPI-managed automotive switches:
+
+- SJA1105E: First generation, no TTEthernet
+- SJA1105T: First generation, TTEthernet
+- SJA1105P: Second generation, no TTEthernet, no SGMII
+- SJA1105Q: Second generation, TTEthernet, no SGMII
+- SJA1105R: Second generation, no TTEthernet, SGMII
+- SJA1105S: Second generation, TTEthernet, SGMII
+- SJA1110A: Third generation, TTEthernet, SGMII, integrated 100base-T1 and
+ 100base-TX PHYs
+- SJA1110B: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110C: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110D: Third generation, TTEthernet, SGMII, 100base-T1
+
+Being automotive parts, their configuration interface is geared towards
+set-and-forget use, with minimal dynamic interaction at runtime. They
+require a static configuration to be composed by software and packed
+with CRC and table headers, and sent over SPI.
+
+The static configuration is composed of several configuration tables. Each
+table takes a number of entries. Some configuration tables can be (partially)
+reconfigured at runtime, some not. Some tables are mandatory, some not:
+
+============================= ================== =============================
+Table Mandatory Reconfigurable
+============================= ================== =============================
+Schedule no no
+Schedule entry points if Scheduling no
+VL Lookup no no
+VL Policing if VL Lookup no
+VL Forwarding if VL Lookup no
+L2 Lookup no no
+L2 Policing yes no
+VLAN Lookup yes yes
+L2 Forwarding yes partially (fully on P/Q/R/S)
+MAC Config yes partially (fully on P/Q/R/S)
+Schedule Params if Scheduling no
+Schedule Entry Points Params if Scheduling no
+VL Forwarding Params if VL Forwarding no
+L2 Lookup Params no partially (fully on P/Q/R/S)
+L2 Forwarding Params yes no
+Clock Sync Params no no
+AVB Params no no
+General Params yes partially
+Retagging no yes
+xMII Params yes no
+SGMII no yes
+============================= ================== =============================
+
+
+Also the configuration is write-only (software cannot read it back from the
+switch except for very few exceptions).
+
+The driver creates a static configuration at probe time, and keeps it at
+all times in memory, as a shadow for the hardware state. When required to
+change a hardware setting, the static configuration is also updated.
+If that changed setting can be transmitted to the switch through the dynamic
+reconfiguration interface, it is; otherwise the switch is reset and
+reprogrammed with the updated static configuration.
+
+Switching features
+==================
+
+The driver supports the configuration of L2 forwarding rules in hardware for
+port bridging. The forwarding, broadcast and flooding domain between ports can
+be restricted through two methods: either at the L2 forwarding level (isolate
+one bridge's ports from another's) or at the VLAN port membership level
+(isolate ports within the same bridge). The final forwarding decision taken by
+the hardware is a logical AND of these two sets of rules.
+
+The hardware tags all traffic internally with a port-based VLAN (pvid), or it
+decodes the VLAN information from the 802.1Q tag. Advanced VLAN classification
+is not possible. Once attributed a VLAN tag, frames are checked against the
+port's membership rules and dropped at ingress if they don't match any VLAN.
+This behavior is available when switch ports are enslaved to a bridge with
+``vlan_filtering 1``.
+
+Normally the hardware is not configurable with respect to VLAN awareness, but
+by changing what TPID the switch searches 802.1Q tags for, the semantics of a
+bridge with ``vlan_filtering 0`` can be kept (accept all traffic, tagged or
+untagged), and therefore this mode is also supported.
+
+Segregating the switch ports in multiple bridges is supported (e.g. 2 + 2), but
+all bridges should have the same level of VLAN awareness (either both have
+``vlan_filtering`` 0, or both 1).
+
+Topology and loop detection through STP is supported.
+
+Offloads
+========
+
+Time-aware scheduling
+---------------------
+
+The switch supports a variation of the enhancements for scheduled traffic
+specified in IEEE 802.1Q-2018 (formerly 802.1Qbv). This means it can be used to
+ensure deterministic latency for priority traffic that is sent in-band with its
+gate-open event in the network schedule.
+
+This capability can be managed through the tc-taprio offload ('flags 2'). The
+difference compared to the software implementation of taprio is that the latter
+would only be able to shape traffic originated from the CPU, but not
+autonomously forwarded flows.
+
+The device has 8 traffic classes, and maps incoming frames to one of them based
+on the VLAN PCP bits (if no VLAN is present, the port-based default is used).
+As described in the previous sections, depending on the value of
+``vlan_filtering``, the EtherType recognized by the switch as being VLAN can
+either be the typical 0x8100 or a custom value used internally by the driver
+for tagging. Therefore, the switch ignores the VLAN PCP if used in standalone
+or bridge mode with ``vlan_filtering=0``, as it will not recognize the 0x8100
+EtherType. In these modes, injecting into a particular TX queue can only be
+done by the DSA net devices, which populate the PCP field of the tagging header
+on egress. Using ``vlan_filtering=1``, the behavior is the other way around:
+offloaded flows can be steered to TX queues based on the VLAN PCP, but the DSA
+net devices are no longer able to do that. To inject frames into a hardware TX
+queue with VLAN awareness active, it is necessary to create a VLAN
+sub-interface on the DSA master port, and send normal (0x8100) VLAN-tagged
+towards the switch, with the VLAN PCP bits set appropriately.
+
+Management traffic (having DMAC 01-80-C2-xx-xx-xx or 01-19-1B-xx-xx-xx) is the
+notable exception: the switch always treats it with a fixed priority and
+disregards any VLAN PCP bits even if present. The traffic class for management
+traffic has a value of 7 (highest priority) at the moment, which is not
+configurable in the driver.
+
+Below is an example of configuring a 500 us cyclic schedule on egress port
+``swp5``. The traffic class gate for management traffic (7) is open for 100 us,
+and the gates for all other traffic classes are open for 400 us::
+
+ #!/bin/bash
+
+ set -e -u -o pipefail
+
+ NSEC_PER_SEC="1000000000"
+
+ gatemask() {
+ local tc_list="$1"
+ local mask=0
+
+ for tc in ${tc_list}; do
+ mask=$((${mask} | (1 << ${tc})))
+ done
+
+ printf "%02x" ${mask}
+ }
+
+ if ! systemctl is-active --quiet ptp4l; then
+ echo "Please start the ptp4l service"
+ exit
+ fi
+
+ now=$(phc_ctl /dev/ptp1 get | gawk '/clock time is/ { print $5; }')
+ # Phase-align the base time to the start of the next second.
+ sec=$(echo "${now}" | gawk -F. '{ print $1; }')
+ base_time="$(((${sec} + 1) * ${NSEC_PER_SEC}))"
+
+ tc qdisc add dev swp5 parent root handle 100 taprio \
+ num_tc 8 \
+ map 0 1 2 3 5 6 7 \
+ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
+ base-time ${base_time} \
+ sched-entry S $(gatemask 7) 100000 \
+ sched-entry S $(gatemask "0 1 2 3 4 5 6") 400000 \
+ flags 2
+
+It is possible to apply the tc-taprio offload on multiple egress ports. There
+are hardware restrictions related to the fact that no gate event may trigger
+simultaneously on two ports. The driver checks the consistency of the schedules
+against this restriction and errors out when appropriate. Schedule analysis is
+needed to avoid this, which is outside the scope of the document.
+
+Routing actions (redirect, trap, drop)
+--------------------------------------
+
+The switch is able to offload flow-based redirection of packets to a set of
+destination ports specified by the user. Internally, this is implemented by
+making use of Virtual Links, a TTEthernet concept.
+
+The driver supports 2 types of keys for Virtual Links:
+
+- VLAN-aware virtual links: these match on destination MAC address, VLAN ID and
+ VLAN PCP.
+- VLAN-unaware virtual links: these match on destination MAC address only.
+
+The VLAN awareness state of the bridge (vlan_filtering) cannot be changed while
+there are virtual link rules installed.
+
+Composing multiple actions inside the same rule is supported. When only routing
+actions are requested, the driver creates a "non-critical" virtual link. When
+the action list also contains tc-gate (more details below), the virtual link
+becomes "time-critical" (draws frame buffers from a reserved memory partition,
+etc).
+
+The 3 routing actions that are supported are "trap", "drop" and "redirect".
+
+Example 1: send frames received on swp2 with a DA of 42:be:24:9b:76:20 to the
+CPU and to swp3. This type of key (DA only) when the port's VLAN awareness
+state is off::
+
+ tc qdisc add dev swp2 clsact
+ tc filter add dev swp2 ingress flower skip_sw dst_mac 42:be:24:9b:76:20 \
+ action mirred egress redirect dev swp3 \
+ action trap
+
+Example 2: drop frames received on swp2 with a DA of 42:be:24:9b:76:20, a VID
+of 100 and a PCP of 0::
+
+ tc filter add dev swp2 ingress protocol 802.1Q flower skip_sw \
+ dst_mac 42:be:24:9b:76:20 vlan_id 100 vlan_prio 0 action drop
+
+Time-based ingress policing
+---------------------------
+
+The TTEthernet hardware abilities of the switch can be constrained to act
+similarly to the Per-Stream Filtering and Policing (PSFP) clause specified in
+IEEE 802.1Q-2018 (formerly 802.1Qci). This means it can be used to perform
+tight timing-based admission control for up to 1024 flows (identified by a
+tuple composed of destination MAC address, VLAN ID and VLAN PCP). Packets which
+are received outside their expected reception window are dropped.
+
+This capability can be managed through the offload of the tc-gate action. As
+routing actions are intrinsic to virtual links in TTEthernet (which performs
+explicit routing of time-critical traffic and does not leave that in the hands
+of the FDB, flooding etc), the tc-gate action may never appear alone when
+asking sja1105 to offload it. One (or more) redirect or trap actions must also
+follow along.
+
+Example: create a tc-taprio schedule that is phase-aligned with a tc-gate
+schedule (the clocks must be synchronized by a 1588 application stack, which is
+outside the scope of this document). No packet delivered by the sender will be
+dropped. Note that the reception window is larger than the transmission window
+(and much more so, in this example) to compensate for the packet propagation
+delay of the link (which can be determined by the 1588 application stack).
+
+Receiver (sja1105)::
+
+ tc qdisc add dev swp2 clsact
+ now=$(phc_ctl /dev/ptp1 get | awk '/clock time is/ {print $5}') && \
+ sec=$(echo $now | awk -F. '{print $1}') && \
+ base_time="$(((sec + 2) * 1000000000))" && \
+ echo "base time ${base_time}"
+ tc filter add dev swp2 ingress flower skip_sw \
+ dst_mac 42:be:24:9b:76:20 \
+ action gate base-time ${base_time} \
+ sched-entry OPEN 60000 -1 -1 \
+ sched-entry CLOSE 40000 -1 -1 \
+ action trap
+
+Sender::
+
+ now=$(phc_ctl /dev/ptp0 get | awk '/clock time is/ {print $5}') && \
+ sec=$(echo $now | awk -F. '{print $1}') && \
+ base_time="$(((sec + 2) * 1000000000))" && \
+ echo "base time ${base_time}"
+ tc qdisc add dev eno0 parent root taprio \
+ num_tc 8 \
+ map 0 1 2 3 4 5 6 7 \
+ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
+ base-time ${base_time} \
+ sched-entry S 01 50000 \
+ sched-entry S 00 50000 \
+ flags 2
+
+The engine used to schedule the ingress gate operations is the same that the
+one used for the tc-taprio offload. Therefore, the restrictions regarding the
+fact that no two gate actions (either tc-gate or tc-taprio gates) may fire at
+the same time (during the same 200 ns slot) still apply.
+
+To come in handy, it is possible to share time-triggered virtual links across
+more than 1 ingress port, via flow blocks. In this case, the restriction of
+firing at the same time does not apply because there is a single schedule in
+the system, that of the shared virtual link::
+
+ tc qdisc add dev swp2 ingress_block 1 clsact
+ tc qdisc add dev swp3 ingress_block 1 clsact
+ tc filter add block 1 flower skip_sw dst_mac 42:be:24:9b:76:20 \
+ action gate index 2 \
+ base-time 0 \
+ sched-entry OPEN 50000000 -1 -1 \
+ sched-entry CLOSE 50000000 -1 -1 \
+ action trap
+
+Hardware statistics for each flow are also available ("pkts" counts the number
+of dropped frames, which is a sum of frames dropped due to timing violations,
+lack of destination ports and MTU enforcement checks). Byte-level counters are
+not available.
+
+Limitations
+===========
+
+The SJA1105 switch family always performs VLAN processing. When configured as
+VLAN-unaware, frames carry a different VLAN tag internally, depending on
+whether the port is standalone or under a VLAN-unaware bridge.
+
+The virtual link keys are always fixed at {MAC DA, VLAN ID, VLAN PCP}, but the
+driver asks for the VLAN ID and VLAN PCP when the port is under a VLAN-aware
+bridge. Otherwise, it fills in the VLAN ID and PCP automatically, based on
+whether the port is standalone or in a VLAN-unaware bridge, and accepts only
+"VLAN-unaware" tc-flower keys (MAC DA).
+
+The existing tc-flower keys that are offloaded using virtual links are no
+longer operational after one of the following happens:
+
+- port was standalone and joins a bridge (VLAN-aware or VLAN-unaware)
+- port is part of a bridge whose VLAN awareness state changes
+- port was part of a bridge and becomes standalone
+- port was standalone, but another port joins a VLAN-aware bridge and this
+ changes the global VLAN awareness state of the bridge
+
+The driver cannot veto all these operations, and it cannot update/remove the
+existing tc-flower filters either. So for proper operation, the tc-flower
+filters should be installed only after the forwarding configuration of the port
+has been made, and removed by user space before making any changes to it.
+
+Device Tree bindings and board design
+=====================================
+
+This section references ``Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml``
+and aims to showcase some potential switch caveats.
+
+RMII PHY role and out-of-band signaling
+---------------------------------------
+
+In the RMII spec, the 50 MHz clock signals are either driven by the MAC or by
+an external oscillator (but not by the PHY).
+But the spec is rather loose and devices go outside it in several ways.
+Some PHYs go against the spec and may provide an output pin where they source
+the 50 MHz clock themselves, in an attempt to be helpful.
+On the other hand, the SJA1105 is only binary configurable - when in the RMII
+MAC role it will also attempt to drive the clock signal. To prevent this from
+happening it must be put in RMII PHY role.
+But doing so has some unintended consequences.
+In the RMII spec, the PHY can transmit extra out-of-band signals via RXD[1:0].
+These are practically some extra code words (/J/ and /K/) sent prior to the
+preamble of each frame. The MAC does not have this out-of-band signaling
+mechanism defined by the RMII spec.
+So when the SJA1105 port is put in PHY role to avoid having 2 drivers on the
+clock signal, inevitably an RMII PHY-to-PHY connection is created. The SJA1105
+emulates a PHY interface fully and generates the /J/ and /K/ symbols prior to
+frame preambles, which the real PHY is not expected to understand. So the PHY
+simply encodes the extra symbols received from the SJA1105-as-PHY onto the
+100Base-Tx wire.
+On the other side of the wire, some link partners might discard these extra
+symbols, while others might choke on them and discard the entire Ethernet
+frames that follow along. This looks like packet loss with some link partners
+but not with others.
+The take-away is that in RMII mode, the SJA1105 must be let to drive the
+reference clock if connected to a PHY.
+
+RGMII fixed-link and internal delays
+------------------------------------
+
+As mentioned in the bindings document, the second generation of devices has
+tunable delay lines as part of the MAC, which can be used to establish the
+correct RGMII timing budget.
+When powered up, these can shift the Rx and Tx clocks with a phase difference
+between 73.8 and 101.7 degrees.
+The catch is that the delay lines need to lock onto a clock signal with a
+stable frequency. This means that there must be at least 2 microseconds of
+silence between the clock at the old vs at the new frequency. Otherwise the
+lock is lost and the delay lines must be reset (powered down and back up).
+In RGMII the clock frequency changes with link speed (125 MHz at 1000 Mbps, 25
+MHz at 100 Mbps and 2.5 MHz at 10 Mbps), and link speed might change during the
+AN process.
+In the situation where the switch port is connected through an RGMII fixed-link
+to a link partner whose link state life cycle is outside the control of Linux
+(such as a different SoC), then the delay lines would remain unlocked (and
+inactive) until there is manual intervention (ifdown/ifup on the switch port).
+The take-away is that in RGMII mode, the switch's internal delays are only
+reliable if the link partner never changes link speeds, or if it does, it does
+so in a way that is coordinated with the switch port (practically, both ends of
+the fixed-link are under control of the same Linux system).
+As to why would a fixed-link interface ever change link speeds: there are
+Ethernet controllers out there which come out of reset in 100 Mbps mode, and
+their driver inevitably needs to change the speed and clock frequency if it's
+required to work at gigabit.
+
+MDIO bus and PHY management
+---------------------------
+
+The SJA1105 does not have an MDIO bus and does not perform in-band AN either.
+Therefore there is no link state notification coming from the switch device.
+A board would need to hook up the PHYs connected to the switch to any other
+MDIO bus available to Linux within the system (e.g. to the DSA master's MDIO
+bus). Link state management then works by the driver manually keeping in sync
+(over SPI commands) the MAC link speed with the settings negotiated by the PHY.
+
+By comparison, the SJA1110 supports an MDIO slave access point over which its
+internal 100base-T1 PHYs can be accessed from the host. This is, however, not
+used by the driver, instead the internal 100base-T1 and 100base-TX PHYs are
+accessed through SPI commands, modeled in Linux as virtual MDIO buses.
+
+The microcontroller attached to the SJA1110 port 0 also has an MDIO controller
+operating in master mode, however the driver does not support this either,
+since the microcontroller gets disabled when the Linux driver operates.
+Discrete PHYs connected to the switch ports should have their MDIO interface
+attached to an MDIO controller from the host system and not to the switch,
+similar to SJA1105.
+
+Port compatibility matrix
+-------------------------
+
+The SJA1105 port compatibility matrix is:
+
+===== ============== ============== ==============
+Port SJA1105E/T SJA1105P/Q SJA1105R/S
+===== ============== ============== ==============
+0 xMII xMII xMII
+1 xMII xMII xMII
+2 xMII xMII xMII
+3 xMII xMII xMII
+4 xMII xMII SGMII
+===== ============== ============== ==============
+
+
+The SJA1110 port compatibility matrix is:
+
+===== ============== ============== ============== ==============
+Port SJA1110A SJA1110B SJA1110C SJA1110D
+===== ============== ============== ============== ==============
+0 RevMII (uC) RevMII (uC) RevMII (uC) RevMII (uC)
+1 100base-TX 100base-TX 100base-TX
+ or SGMII SGMII
+2 xMII xMII xMII xMII
+ or SGMII or SGMII
+3 xMII xMII xMII
+ or SGMII or SGMII SGMII
+ or 2500base-X or 2500base-X or 2500base-X
+4 SGMII SGMII SGMII SGMII
+ or 2500base-X or 2500base-X or 2500base-X or 2500base-X
+5 100base-T1 100base-T1 100base-T1 100base-T1
+6 100base-T1 100base-T1 100base-T1 100base-T1
+7 100base-T1 100base-T1 100base-T1 100base-T1
+8 100base-T1 100base-T1 n/a n/a
+9 100base-T1 100base-T1 n/a n/a
+10 100base-T1 n/a n/a n/a
+===== ============== ============== ============== ==============
diff --git a/Documentation/networking/eql.rst b/Documentation/networking/eql.rst
new file mode 100644
index 0000000000..a628c4c811
--- /dev/null
+++ b/Documentation/networking/eql.rst
@@ -0,0 +1,373 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================
+EQL Driver: Serial IP Load Balancing HOWTO
+==========================================
+
+ Simon "Guru Aleph-Null" Janes, simon@ncm.com
+
+ v1.1, February 27, 1995
+
+ This is the manual for the EQL device driver. EQL is a software device
+ that lets you load-balance IP serial links (SLIP or uncompressed PPP)
+ to increase your bandwidth. It will not reduce your latency (i.e. ping
+ times) except in the case where you already have lots of traffic on
+ your link, in which it will help them out. This driver has been tested
+ with the 1.1.75 kernel, and is known to have patched cleanly with
+ 1.1.86. Some testing with 1.1.92 has been done with the v1.1 patch
+ which was only created to patch cleanly in the very latest kernel
+ source trees. (Yes, it worked fine.)
+
+1. Introduction
+===============
+
+ Which is worse? A huge fee for a 56K leased line or two phone lines?
+ It's probably the former. If you find yourself craving more bandwidth,
+ and have a ISP that is flexible, it is now possible to bind modems
+ together to work as one point-to-point link to increase your
+ bandwidth. All without having to have a special black box on either
+ side.
+
+
+ The eql driver has only been tested with the Livingston PortMaster-2e
+ terminal server. I do not know if other terminal servers support load-
+ balancing, but I do know that the PortMaster does it, and does it
+ almost as well as the eql driver seems to do it (-- Unfortunately, in
+ my testing so far, the Livingston PortMaster 2e's load-balancing is a
+ good 1 to 2 KB/s slower than the test machine working with a 28.8 Kbps
+ and 14.4 Kbps connection. However, I am not sure that it really is
+ the PortMaster, or if it's Linux's TCP drivers. I'm told that Linux's
+ TCP implementation is pretty fast though.--)
+
+
+ I suggest to ISPs out there that it would probably be fair to charge
+ a load-balancing client 75% of the cost of the second line and 50% of
+ the cost of the third line etc...
+
+
+ Hey, we can all dream you know...
+
+
+2. Kernel Configuration
+=======================
+
+ Here I describe the general steps of getting a kernel up and working
+ with the eql driver. From patching, building, to installing.
+
+
+2.1. Patching The Kernel
+------------------------
+
+ If you do not have or cannot get a copy of the kernel with the eql
+ driver folded into it, get your copy of the driver from
+ ftp://slaughter.ncm.com/pub/Linux/LOAD_BALANCING/eql-1.1.tar.gz.
+ Unpack this archive someplace obvious like /usr/local/src/. It will
+ create the following files::
+
+ -rw-r--r-- guru/ncm 198 Jan 19 18:53 1995 eql-1.1/NO-WARRANTY
+ -rw-r--r-- guru/ncm 30620 Feb 27 21:40 1995 eql-1.1/eql-1.1.patch
+ -rwxr-xr-x guru/ncm 16111 Jan 12 22:29 1995 eql-1.1/eql_enslave
+ -rw-r--r-- guru/ncm 2195 Jan 10 21:48 1995 eql-1.1/eql_enslave.c
+
+ Unpack a recent kernel (something after 1.1.92) someplace convenient
+ like say /usr/src/linux-1.1.92.eql. Use symbolic links to point
+ /usr/src/linux to this development directory.
+
+
+ Apply the patch by running the commands::
+
+ cd /usr/src
+ patch </usr/local/src/eql-1.1/eql-1.1.patch
+
+
+2.2. Building The Kernel
+------------------------
+
+ After patching the kernel, run make config and configure the kernel
+ for your hardware.
+
+
+ After configuration, make and install according to your habit.
+
+
+3. Network Configuration
+========================
+
+ So far, I have only used the eql device with the DSLIP SLIP connection
+ manager by Matt Dillon (-- "The man who sold his soul to code so much
+ so quickly."--) . How you configure it for other "connection"
+ managers is up to you. Most other connection managers that I've seen
+ don't do a very good job when it comes to handling more than one
+ connection.
+
+
+3.1. /etc/rc.d/rc.inet1
+-----------------------
+
+ In rc.inet1, ifconfig the eql device to the IP address you usually use
+ for your machine, and the MTU you prefer for your SLIP lines. One
+ could argue that MTU should be roughly half the usual size for two
+ modems, one-third for three, one-fourth for four, etc... But going
+ too far below 296 is probably overkill. Here is an example ifconfig
+ command that sets up the eql device::
+
+ ifconfig eql 198.67.33.239 mtu 1006
+
+ Once the eql device is up and running, add a static default route to
+ it in the routing table using the cool new route syntax that makes
+ life so much easier::
+
+ route add default eql
+
+
+3.2. Enslaving Devices By Hand
+------------------------------
+
+ Enslaving devices by hand requires two utility programs: eql_enslave
+ and eql_emancipate (-- eql_emancipate hasn't been written because when
+ an enslaved device "dies", it is automatically taken out of the queue.
+ I haven't found a good reason to write it yet... other than for
+ completeness, but that isn't a good motivator is it?--)
+
+
+ The syntax for enslaving a device is "eql_enslave <master-name>
+ <slave-name> <estimated-bps>". Here are some example enslavings::
+
+ eql_enslave eql sl0 28800
+ eql_enslave eql ppp0 14400
+ eql_enslave eql sl1 57600
+
+ When you want to free a device from its life of slavery, you can
+ either down the device with ifconfig (eql will automatically bury the
+ dead slave and remove it from its queue) or use eql_emancipate to free
+ it. (-- Or just ifconfig it down, and the eql driver will take it out
+ for you.--)::
+
+ eql_emancipate eql sl0
+ eql_emancipate eql ppp0
+ eql_emancipate eql sl1
+
+
+3.3. DSLIP Configuration for the eql Device
+-------------------------------------------
+
+ The general idea is to bring up and keep up as many SLIP connections
+ as you need, automatically.
+
+
+3.3.1. /etc/slip/runslip.conf
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ Here is an example runslip.conf::
+
+ name sl-line-1
+ enabled
+ baud 38400
+ mtu 576
+ ducmd -e /etc/slip/dialout/cua2-288.xp -t 9
+ command eql_enslave eql $interface 28800
+ address 198.67.33.239
+ line /dev/cua2
+
+ name sl-line-2
+ enabled
+ baud 38400
+ mtu 576
+ ducmd -e /etc/slip/dialout/cua3-288.xp -t 9
+ command eql_enslave eql $interface 28800
+ address 198.67.33.239
+ line /dev/cua3
+
+
+3.4. Using PPP and the eql Device
+---------------------------------
+
+ I have not yet done any load-balancing testing for PPP devices, mainly
+ because I don't have a PPP-connection manager like SLIP has with
+ DSLIP. I did find a good tip from LinuxNET:Billy for PPP performance:
+ make sure you have asyncmap set to something so that control
+ characters are not escaped.
+
+
+ I tried to fix up a PPP script/system for redialing lost PPP
+ connections for use with the eql driver the weekend of Feb 25-26 '95
+ (Hereafter known as the 8-hour PPP Hate Festival). Perhaps later this
+ year.
+
+
+4. About the Slave Scheduler Algorithm
+======================================
+
+ The slave scheduler probably could be replaced with a dozen other
+ things and push traffic much faster. The formula in the current set
+ up of the driver was tuned to handle slaves with wildly different
+ bits-per-second "priorities".
+
+
+ All testing I have done was with two 28.8 V.FC modems, one connecting
+ at 28800 bps or slower, and the other connecting at 14400 bps all the
+ time.
+
+
+ One version of the scheduler was able to push 5.3 K/s through the
+ 28800 and 14400 connections, but when the priorities on the links were
+ very wide apart (57600 vs. 14400) the "faster" modem received all
+ traffic and the "slower" modem starved.
+
+
+5. Testers' Reports
+===================
+
+ Some people have experimented with the eql device with newer
+ kernels (than 1.1.75). I have since updated the driver to patch
+ cleanly in newer kernels because of the removal of the old "slave-
+ balancing" driver config option.
+
+
+ - icee from LinuxNET patched 1.1.86 without any rejects and was able
+ to boot the kernel and enslave a couple of ISDN PPP links.
+
+5.1. Randolph Bentson's Test Report
+-----------------------------------
+
+ ::
+
+ From bentson@grieg.seaslug.org Wed Feb 8 19:08:09 1995
+ Date: Tue, 7 Feb 95 22:57 PST
+ From: Randolph Bentson <bentson@grieg.seaslug.org>
+ To: guru@ncm.com
+ Subject: EQL driver tests
+
+
+ I have been checking out your eql driver. (Nice work, that!)
+ Although you may already done this performance testing, here
+ are some data I've discovered.
+
+ Randolph Bentson
+ bentson@grieg.seaslug.org
+
+------------------------------------------------------------------
+
+
+ A pseudo-device driver, EQL, written by Simon Janes, can be used
+ to bundle multiple SLIP connections into what appears to be a
+ single connection. This allows one to improve dial-up network
+ connectivity gradually, without having to buy expensive DSU/CSU
+ hardware and services.
+
+ I have done some testing of this software, with two goals in
+ mind: first, to ensure it actually works as described and
+ second, as a method of exercising my device driver.
+
+ The following performance measurements were derived from a set
+ of SLIP connections run between two Linux systems (1.1.84) using
+ a 486DX2/66 with a Cyclom-8Ys and a 486SLC/40 with a Cyclom-16Y.
+ (Ports 0,1,2,3 were used. A later configuration will distribute
+ port selection across the different Cirrus chips on the boards.)
+ Once a link was established, I timed a binary ftp transfer of
+ 289284 bytes of data. If there were no overhead (packet headers,
+ inter-character and inter-packet delays, etc.) the transfers
+ would take the following times::
+
+ bits/sec seconds
+ 345600 8.3
+ 234600 12.3
+ 172800 16.7
+ 153600 18.8
+ 76800 37.6
+ 57600 50.2
+ 38400 75.3
+ 28800 100.4
+ 19200 150.6
+ 9600 301.3
+
+ A single line running at the lower speeds and with large packets
+ comes to within 2% of this. Performance is limited for the higher
+ speeds (as predicted by the Cirrus databook) to an aggregate of
+ about 160 kbits/sec. The next round of testing will distribute
+ the load across two or more Cirrus chips.
+
+ The good news is that one gets nearly the full advantage of the
+ second, third, and fourth line's bandwidth. (The bad news is
+ that the connection establishment seemed fragile for the higher
+ speeds. Once established, the connection seemed robust enough.)
+
+ ====== ======== === ======== ======= ======= ===
+ #lines speed mtu seconds theory actual %of
+ kbit/sec duration speed speed max
+ ====== ======== === ======== ======= ======= ===
+ 3 115200 900 _ 345600
+ 3 115200 400 18.1 345600 159825 46
+ 2 115200 900 _ 230400
+ 2 115200 600 18.1 230400 159825 69
+ 2 115200 400 19.3 230400 149888 65
+ 4 57600 900 _ 234600
+ 4 57600 600 _ 234600
+ 4 57600 400 _ 234600
+ 3 57600 600 20.9 172800 138413 80
+ 3 57600 900 21.2 172800 136455 78
+ 3 115200 600 21.7 345600 133311 38
+ 3 57600 400 22.5 172800 128571 74
+ 4 38400 900 25.2 153600 114795 74
+ 4 38400 600 26.4 153600 109577 71
+ 4 38400 400 27.3 153600 105965 68
+ 2 57600 900 29.1 115200 99410.3 86
+ 1 115200 900 30.7 115200 94229.3 81
+ 2 57600 600 30.2 115200 95789.4 83
+ 3 38400 900 30.3 115200 95473.3 82
+ 3 38400 600 31.2 115200 92719.2 80
+ 1 115200 600 31.3 115200 92423 80
+ 2 57600 400 32.3 115200 89561.6 77
+ 1 115200 400 32.8 115200 88196.3 76
+ 3 38400 400 33.5 115200 86353.4 74
+ 2 38400 900 43.7 76800 66197.7 86
+ 2 38400 600 44 76800 65746.4 85
+ 2 38400 400 47.2 76800 61289 79
+ 4 19200 900 50.8 76800 56945.7 74
+ 4 19200 400 53.2 76800 54376.7 70
+ 4 19200 600 53.7 76800 53870.4 70
+ 1 57600 900 54.6 57600 52982.4 91
+ 1 57600 600 56.2 57600 51474 89
+ 3 19200 900 60.5 57600 47815.5 83
+ 1 57600 400 60.2 57600 48053.8 83
+ 3 19200 600 62 57600 46658.7 81
+ 3 19200 400 64.7 57600 44711.6 77
+ 1 38400 900 79.4 38400 36433.8 94
+ 1 38400 600 82.4 38400 35107.3 91
+ 2 19200 900 84.4 38400 34275.4 89
+ 1 38400 400 86.8 38400 33327.6 86
+ 2 19200 600 87.6 38400 33023.3 85
+ 2 19200 400 91.2 38400 31719.7 82
+ 4 9600 900 94.7 38400 30547.4 79
+ 4 9600 400 106 38400 27290.9 71
+ 4 9600 600 110 38400 26298.5 68
+ 3 9600 900 118 28800 24515.6 85
+ 3 9600 600 120 28800 24107 83
+ 3 9600 400 131 28800 22082.7 76
+ 1 19200 900 155 19200 18663.5 97
+ 1 19200 600 161 19200 17968 93
+ 1 19200 400 170 19200 17016.7 88
+ 2 9600 600 176 19200 16436.6 85
+ 2 9600 900 180 19200 16071.3 83
+ 2 9600 400 181 19200 15982.5 83
+ 1 9600 900 305 9600 9484.72 98
+ 1 9600 600 314 9600 9212.87 95
+ 1 9600 400 332 9600 8713.37 90
+ ====== ======== === ======== ======= ======= ===
+
+5.2. Anthony Healy's Report
+---------------------------
+
+ ::
+
+ Date: Mon, 13 Feb 1995 16:17:29 +1100 (EST)
+ From: Antony Healey <ahealey@st.nepean.uws.edu.au>
+ To: Simon Janes <guru@ncm.com>
+ Subject: Re: Load Balancing
+
+ Hi Simon,
+ I've installed your patch and it works great. I have trialed
+ it over twin SL/IP lines, just over null modems, but I was
+ able to data at over 48Kb/s [ISDN link -Simon]. I managed a
+ transfer of up to 7.5 Kbyte/s on one go, but averaged around
+ 6.4 Kbyte/s, which I think is pretty cool. :)
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
new file mode 100644
index 0000000000..2540c70952
--- /dev/null
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -0,0 +1,2103 @@
+=============================
+Netlink interface for ethtool
+=============================
+
+
+Basic information
+=================
+
+Netlink interface for ethtool uses generic netlink family ``ethtool``
+(userspace application should use macros ``ETHTOOL_GENL_NAME`` and
+``ETHTOOL_GENL_VERSION`` defined in ``<linux/ethtool_netlink.h>`` uapi
+header). This family does not use a specific header, all information in
+requests and replies is passed using netlink attributes.
+
+The ethtool netlink interface uses extended ACK for error and warning
+reporting, userspace application developers are encouraged to make these
+messages available to user in a suitable way.
+
+Requests can be divided into three categories: "get" (retrieving information),
+"set" (setting parameters) and "action" (invoking an action).
+
+All "set" and "action" type requests require admin privileges
+(``CAP_NET_ADMIN`` in the namespace). Most "get" type requests are allowed for
+anyone but there are exceptions (where the response contains sensitive
+information). In some cases, the request as such is allowed for anyone but
+unprivileged users have attributes with sensitive information (e.g.
+wake-on-lan password) omitted.
+
+
+Conventions
+===========
+
+Attributes which represent a boolean value usually use NLA_U8 type so that we
+can distinguish three states: "on", "off" and "not present" (meaning the
+information is not available in "get" requests or value is not to be changed
+in "set" requests). For these attributes, the "true" value should be passed as
+number 1 but any non-zero value should be understood as "true" by recipient.
+In the tables below, "bool" denotes NLA_U8 attributes interpreted in this way.
+
+In the message structure descriptions below, if an attribute name is suffixed
+with "+", parent nest can contain multiple attributes of the same type. This
+implements an array of entries.
+
+Attributes that need to be filled-in by device drivers and that are dumped to
+user space based on whether they are valid or not should not use zero as a
+valid value. This avoids the need to explicitly signal the validity of the
+attribute in the device driver API.
+
+
+Request header
+==============
+
+Each request or reply message contains a nested attribute with common header.
+Structure of this header is
+
+ ============================== ====== =============================
+ ``ETHTOOL_A_HEADER_DEV_INDEX`` u32 device ifindex
+ ``ETHTOOL_A_HEADER_DEV_NAME`` string device name
+ ``ETHTOOL_A_HEADER_FLAGS`` u32 flags common for all requests
+ ============================== ====== =============================
+
+``ETHTOOL_A_HEADER_DEV_INDEX`` and ``ETHTOOL_A_HEADER_DEV_NAME`` identify the
+device message relates to. One of them is sufficient in requests, if both are
+used, they must identify the same device. Some requests, e.g. global string
+sets, do not require device identification. Most ``GET`` requests also allow
+dump requests without device identification to query the same information for
+all devices providing it (each device in a separate message).
+
+``ETHTOOL_A_HEADER_FLAGS`` is a bitmap of request flags common for all request
+types. The interpretation of these flags is the same for all request types but
+the flags may not apply to requests. Recognized flags are:
+
+ ================================= ===================================
+ ``ETHTOOL_FLAG_COMPACT_BITSETS`` use compact format bitsets in reply
+ ``ETHTOOL_FLAG_OMIT_REPLY`` omit optional reply (_SET and _ACT)
+ ``ETHTOOL_FLAG_STATS`` include optional device statistics
+ ================================= ===================================
+
+New request flags should follow the general idea that if the flag is not set,
+the behaviour is backward compatible, i.e. requests from old clients not aware
+of the flag should be interpreted the way the client expects. A client must
+not set flags it does not understand.
+
+
+Bit sets
+========
+
+For short bitmaps of (reasonably) fixed length, standard ``NLA_BITFIELD32``
+type is used. For arbitrary length bitmaps, ethtool netlink uses a nested
+attribute with contents of one of two forms: compact (two binary bitmaps
+representing bit values and mask of affected bits) and bit-by-bit (list of
+bits identified by either index or name).
+
+Verbose (bit-by-bit) bitsets allow sending symbolic names for bits together
+with their values which saves a round trip (when the bitset is passed in a
+request) or at least a second request (when the bitset is in a reply). This is
+useful for one shot applications like traditional ethtool command. On the
+other hand, long running applications like ethtool monitor (displaying
+notifications) or network management daemons may prefer fetching the names
+only once and using compact form to save message size. Notifications from
+ethtool netlink interface always use compact form for bitsets.
+
+A bitset can represent either a value/mask pair (``ETHTOOL_A_BITSET_NOMASK``
+not set) or a single bitmap (``ETHTOOL_A_BITSET_NOMASK`` set). In requests
+modifying a bitmap, the former changes the bit set in mask to values set in
+value and preserves the rest; the latter sets the bits set in the bitmap and
+clears the rest.
+
+Compact form: nested (bitset) attribute contents:
+
+ ============================ ====== ============================
+ ``ETHTOOL_A_BITSET_NOMASK`` flag no mask, only a list
+ ``ETHTOOL_A_BITSET_SIZE`` u32 number of significant bits
+ ``ETHTOOL_A_BITSET_VALUE`` binary bitmap of bit values
+ ``ETHTOOL_A_BITSET_MASK`` binary bitmap of valid bits
+ ============================ ====== ============================
+
+Value and mask must have length at least ``ETHTOOL_A_BITSET_SIZE`` bits
+rounded up to a multiple of 32 bits. They consist of 32-bit words in host byte
+order, words ordered from least significant to most significant (i.e. the same
+way as bitmaps are passed with ioctl interface).
+
+For compact form, ``ETHTOOL_A_BITSET_SIZE`` and ``ETHTOOL_A_BITSET_VALUE`` are
+mandatory. ``ETHTOOL_A_BITSET_MASK`` attribute is mandatory if
+``ETHTOOL_A_BITSET_NOMASK`` is not set (bitset represents a value/mask pair);
+if ``ETHTOOL_A_BITSET_NOMASK`` is not set, ``ETHTOOL_A_BITSET_MASK`` is not
+allowed (bitset represents a single bitmap.
+
+Kernel bit set length may differ from userspace length if older application is
+used on newer kernel or vice versa. If userspace bitmap is longer, an error is
+issued only if the request actually tries to set values of some bits not
+recognized by kernel.
+
+Bit-by-bit form: nested (bitset) attribute contents:
+
+ +------------------------------------+--------+-----------------------------+
+ | ``ETHTOOL_A_BITSET_NOMASK`` | flag | no mask, only a list |
+ +------------------------------------+--------+-----------------------------+
+ | ``ETHTOOL_A_BITSET_SIZE`` | u32 | number of significant bits |
+ +------------------------------------+--------+-----------------------------+
+ | ``ETHTOOL_A_BITSET_BITS`` | nested | array of bits |
+ +-+----------------------------------+--------+-----------------------------+
+ | | ``ETHTOOL_A_BITSET_BITS_BIT+`` | nested | one bit |
+ +-+-+--------------------------------+--------+-----------------------------+
+ | | | ``ETHTOOL_A_BITSET_BIT_INDEX`` | u32 | bit index (0 for LSB) |
+ +-+-+--------------------------------+--------+-----------------------------+
+ | | | ``ETHTOOL_A_BITSET_BIT_NAME`` | string | bit name |
+ +-+-+--------------------------------+--------+-----------------------------+
+ | | | ``ETHTOOL_A_BITSET_BIT_VALUE`` | flag | present if bit is set |
+ +-+-+--------------------------------+--------+-----------------------------+
+
+Bit size is optional for bit-by-bit form. ``ETHTOOL_A_BITSET_BITS`` nest can
+only contain ``ETHTOOL_A_BITSET_BITS_BIT`` attributes but there can be an
+arbitrary number of them. A bit may be identified by its index or by its
+name. When used in requests, listed bits are set to 0 or 1 according to
+``ETHTOOL_A_BITSET_BIT_VALUE``, the rest is preserved. A request fails if
+index exceeds kernel bit length or if name is not recognized.
+
+When ``ETHTOOL_A_BITSET_NOMASK`` flag is present, bitset is interpreted as
+a simple bitmap. ``ETHTOOL_A_BITSET_BIT_VALUE`` attributes are not used in
+such case. Such bitset represents a bitmap with listed bits set and the rest
+zero.
+
+In requests, application can use either form. Form used by kernel in reply is
+determined by ``ETHTOOL_FLAG_COMPACT_BITSETS`` flag in flags field of request
+header. Semantics of value and mask depends on the attribute.
+
+
+List of message types
+=====================
+
+All constants identifying message types use ``ETHTOOL_CMD_`` prefix and suffix
+according to message purpose:
+
+ ============== ======================================
+ ``_GET`` userspace request to retrieve data
+ ``_SET`` userspace request to set data
+ ``_ACT`` userspace request to perform an action
+ ``_GET_REPLY`` kernel reply to a ``GET`` request
+ ``_SET_REPLY`` kernel reply to a ``SET`` request
+ ``_ACT_REPLY`` kernel reply to an ``ACT`` request
+ ``_NTF`` kernel notification
+ ============== ======================================
+
+Userspace to kernel:
+
+ ===================================== =================================
+ ``ETHTOOL_MSG_STRSET_GET`` get string set
+ ``ETHTOOL_MSG_LINKINFO_GET`` get link settings
+ ``ETHTOOL_MSG_LINKINFO_SET`` set link settings
+ ``ETHTOOL_MSG_LINKMODES_GET`` get link modes info
+ ``ETHTOOL_MSG_LINKMODES_SET`` set link modes info
+ ``ETHTOOL_MSG_LINKSTATE_GET`` get link state
+ ``ETHTOOL_MSG_DEBUG_GET`` get debugging settings
+ ``ETHTOOL_MSG_DEBUG_SET`` set debugging settings
+ ``ETHTOOL_MSG_WOL_GET`` get wake-on-lan settings
+ ``ETHTOOL_MSG_WOL_SET`` set wake-on-lan settings
+ ``ETHTOOL_MSG_FEATURES_GET`` get device features
+ ``ETHTOOL_MSG_FEATURES_SET`` set device features
+ ``ETHTOOL_MSG_PRIVFLAGS_GET`` get private flags
+ ``ETHTOOL_MSG_PRIVFLAGS_SET`` set private flags
+ ``ETHTOOL_MSG_RINGS_GET`` get ring sizes
+ ``ETHTOOL_MSG_RINGS_SET`` set ring sizes
+ ``ETHTOOL_MSG_CHANNELS_GET`` get channel counts
+ ``ETHTOOL_MSG_CHANNELS_SET`` set channel counts
+ ``ETHTOOL_MSG_COALESCE_GET`` get coalescing parameters
+ ``ETHTOOL_MSG_COALESCE_SET`` set coalescing parameters
+ ``ETHTOOL_MSG_PAUSE_GET`` get pause parameters
+ ``ETHTOOL_MSG_PAUSE_SET`` set pause parameters
+ ``ETHTOOL_MSG_EEE_GET`` get EEE settings
+ ``ETHTOOL_MSG_EEE_SET`` set EEE settings
+ ``ETHTOOL_MSG_TSINFO_GET`` get timestamping info
+ ``ETHTOOL_MSG_CABLE_TEST_ACT`` action start cable test
+ ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT`` action start raw TDR cable test
+ ``ETHTOOL_MSG_TUNNEL_INFO_GET`` get tunnel offload info
+ ``ETHTOOL_MSG_FEC_GET`` get FEC settings
+ ``ETHTOOL_MSG_FEC_SET`` set FEC settings
+ ``ETHTOOL_MSG_MODULE_EEPROM_GET`` read SFP module EEPROM
+ ``ETHTOOL_MSG_STATS_GET`` get standard statistics
+ ``ETHTOOL_MSG_PHC_VCLOCKS_GET`` get PHC virtual clocks info
+ ``ETHTOOL_MSG_MODULE_SET`` set transceiver module parameters
+ ``ETHTOOL_MSG_MODULE_GET`` get transceiver module parameters
+ ``ETHTOOL_MSG_PSE_SET`` set PSE parameters
+ ``ETHTOOL_MSG_PSE_GET`` get PSE parameters
+ ``ETHTOOL_MSG_RSS_GET`` get RSS settings
+ ``ETHTOOL_MSG_MM_GET`` get MAC merge layer state
+ ``ETHTOOL_MSG_MM_SET`` set MAC merge layer parameters
+ ===================================== =================================
+
+Kernel to userspace:
+
+ ======================================== =================================
+ ``ETHTOOL_MSG_STRSET_GET_REPLY`` string set contents
+ ``ETHTOOL_MSG_LINKINFO_GET_REPLY`` link settings
+ ``ETHTOOL_MSG_LINKINFO_NTF`` link settings notification
+ ``ETHTOOL_MSG_LINKMODES_GET_REPLY`` link modes info
+ ``ETHTOOL_MSG_LINKMODES_NTF`` link modes notification
+ ``ETHTOOL_MSG_LINKSTATE_GET_REPLY`` link state info
+ ``ETHTOOL_MSG_DEBUG_GET_REPLY`` debugging settings
+ ``ETHTOOL_MSG_DEBUG_NTF`` debugging settings notification
+ ``ETHTOOL_MSG_WOL_GET_REPLY`` wake-on-lan settings
+ ``ETHTOOL_MSG_WOL_NTF`` wake-on-lan settings notification
+ ``ETHTOOL_MSG_FEATURES_GET_REPLY`` device features
+ ``ETHTOOL_MSG_FEATURES_SET_REPLY`` optional reply to FEATURES_SET
+ ``ETHTOOL_MSG_FEATURES_NTF`` netdev features notification
+ ``ETHTOOL_MSG_PRIVFLAGS_GET_REPLY`` private flags
+ ``ETHTOOL_MSG_PRIVFLAGS_NTF`` private flags
+ ``ETHTOOL_MSG_RINGS_GET_REPLY`` ring sizes
+ ``ETHTOOL_MSG_RINGS_NTF`` ring sizes
+ ``ETHTOOL_MSG_CHANNELS_GET_REPLY`` channel counts
+ ``ETHTOOL_MSG_CHANNELS_NTF`` channel counts
+ ``ETHTOOL_MSG_COALESCE_GET_REPLY`` coalescing parameters
+ ``ETHTOOL_MSG_COALESCE_NTF`` coalescing parameters
+ ``ETHTOOL_MSG_PAUSE_GET_REPLY`` pause parameters
+ ``ETHTOOL_MSG_PAUSE_NTF`` pause parameters
+ ``ETHTOOL_MSG_EEE_GET_REPLY`` EEE settings
+ ``ETHTOOL_MSG_EEE_NTF`` EEE settings
+ ``ETHTOOL_MSG_TSINFO_GET_REPLY`` timestamping info
+ ``ETHTOOL_MSG_CABLE_TEST_NTF`` Cable test results
+ ``ETHTOOL_MSG_CABLE_TEST_TDR_NTF`` Cable test TDR results
+ ``ETHTOOL_MSG_TUNNEL_INFO_GET_REPLY`` tunnel offload info
+ ``ETHTOOL_MSG_FEC_GET_REPLY`` FEC settings
+ ``ETHTOOL_MSG_FEC_NTF`` FEC settings
+ ``ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY`` read SFP module EEPROM
+ ``ETHTOOL_MSG_STATS_GET_REPLY`` standard statistics
+ ``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY`` PHC virtual clocks info
+ ``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters
+ ``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters
+ ``ETHTOOL_MSG_RSS_GET_REPLY`` RSS settings
+ ``ETHTOOL_MSG_MM_GET_REPLY`` MAC merge layer status
+ ======================================== =================================
+
+``GET`` requests are sent by userspace applications to retrieve device
+information. They usually do not contain any message specific attributes.
+Kernel replies with corresponding "GET_REPLY" message. For most types, ``GET``
+request with ``NLM_F_DUMP`` and no device identification can be used to query
+the information for all devices supporting the request.
+
+If the data can be also modified, corresponding ``SET`` message with the same
+layout as corresponding ``GET_REPLY`` is used to request changes. Only
+attributes where a change is requested are included in such request (also, not
+all attributes may be changed). Replies to most ``SET`` request consist only
+of error code and extack; if kernel provides additional data, it is sent in
+the form of corresponding ``SET_REPLY`` message which can be suppressed by
+setting ``ETHTOOL_FLAG_OMIT_REPLY`` flag in request header.
+
+Data modification also triggers sending a ``NTF`` message with a notification.
+These usually bear only a subset of attributes which was affected by the
+change. The same notification is issued if the data is modified using other
+means (mostly ioctl ethtool interface). Unlike notifications from ethtool
+netlink code which are only sent if something actually changed, notifications
+triggered by ioctl interface may be sent even if the request did not actually
+change any data.
+
+``ACT`` messages request kernel (driver) to perform a specific action. If some
+information is reported by kernel (which can be suppressed by setting
+``ETHTOOL_FLAG_OMIT_REPLY`` flag in request header), the reply takes form of
+an ``ACT_REPLY`` message. Performing an action also triggers a notification
+(``NTF`` message).
+
+Later sections describe the format and semantics of these messages.
+
+
+STRSET_GET
+==========
+
+Requests contents of a string set as provided by ioctl commands
+``ETHTOOL_GSSET_INFO`` and ``ETHTOOL_GSTRINGS.`` String sets are not user
+writeable so that the corresponding ``STRSET_SET`` message is only used in
+kernel replies. There are two types of string sets: global (independent of
+a device, e.g. device feature names) and device specific (e.g. device private
+flags).
+
+Request contents:
+
+ +---------------------------------------+--------+------------------------+
+ | ``ETHTOOL_A_STRSET_HEADER`` | nested | request header |
+ +---------------------------------------+--------+------------------------+
+ | ``ETHTOOL_A_STRSET_STRINGSETS`` | nested | string set to request |
+ +-+-------------------------------------+--------+------------------------+
+ | | ``ETHTOOL_A_STRINGSETS_STRINGSET+`` | nested | one string set |
+ +-+-+-----------------------------------+--------+------------------------+
+ | | | ``ETHTOOL_A_STRINGSET_ID`` | u32 | set id |
+ +-+-+-----------------------------------+--------+------------------------+
+
+Kernel response contents:
+
+ +---------------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_STRSET_HEADER`` | nested | reply header |
+ +---------------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_STRSET_STRINGSETS`` | nested | array of string sets |
+ +-+-------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_STRINGSETS_STRINGSET+`` | nested | one string set |
+ +-+-+-----------------------------------+--------+-----------------------+
+ | | | ``ETHTOOL_A_STRINGSET_ID`` | u32 | set id |
+ +-+-+-----------------------------------+--------+-----------------------+
+ | | | ``ETHTOOL_A_STRINGSET_COUNT`` | u32 | number of strings |
+ +-+-+-----------------------------------+--------+-----------------------+
+ | | | ``ETHTOOL_A_STRINGSET_STRINGS`` | nested | array of strings |
+ +-+-+-+---------------------------------+--------+-----------------------+
+ | | | | ``ETHTOOL_A_STRINGS_STRING+`` | nested | one string |
+ +-+-+-+-+-------------------------------+--------+-----------------------+
+ | | | | | ``ETHTOOL_A_STRING_INDEX`` | u32 | string index |
+ +-+-+-+-+-------------------------------+--------+-----------------------+
+ | | | | | ``ETHTOOL_A_STRING_VALUE`` | string | string value |
+ +-+-+-+-+-------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_STRSET_COUNTS_ONLY`` | flag | return only counts |
+ +---------------------------------------+--------+-----------------------+
+
+Device identification in request header is optional. Depending on its presence
+a and ``NLM_F_DUMP`` flag, there are three type of ``STRSET_GET`` requests:
+
+ - no ``NLM_F_DUMP,`` no device: get "global" stringsets
+ - no ``NLM_F_DUMP``, with device: get string sets related to the device
+ - ``NLM_F_DUMP``, no device: get device related string sets for all devices
+
+If there is no ``ETHTOOL_A_STRSET_STRINGSETS`` array, all string sets of
+requested type are returned, otherwise only those specified in the request.
+Flag ``ETHTOOL_A_STRSET_COUNTS_ONLY`` tells kernel to only return string
+counts of the sets, not the actual strings.
+
+
+LINKINFO_GET
+============
+
+Requests link settings as provided by ``ETHTOOL_GLINKSETTINGS`` except for
+link modes and autonegotiation related information. The request does not use
+any attributes.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKINFO_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKINFO_HEADER`` nested reply header
+ ``ETHTOOL_A_LINKINFO_PORT`` u8 physical port
+ ``ETHTOOL_A_LINKINFO_PHYADDR`` u8 phy MDIO address
+ ``ETHTOOL_A_LINKINFO_TP_MDIX`` u8 MDI(-X) status
+ ``ETHTOOL_A_LINKINFO_TP_MDIX_CTRL`` u8 MDI(-X) control
+ ``ETHTOOL_A_LINKINFO_TRANSCEIVER`` u8 transceiver
+ ==================================== ====== ==========================
+
+Attributes and their values have the same meaning as matching members of the
+corresponding ioctl structures.
+
+``LINKINFO_GET`` allows dump requests (kernel returns reply message for all
+devices supporting the request).
+
+
+LINKINFO_SET
+============
+
+``LINKINFO_SET`` request allows setting some of the attributes reported by
+``LINKINFO_GET``.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKINFO_HEADER`` nested request header
+ ``ETHTOOL_A_LINKINFO_PORT`` u8 physical port
+ ``ETHTOOL_A_LINKINFO_PHYADDR`` u8 phy MDIO address
+ ``ETHTOOL_A_LINKINFO_TP_MDIX_CTRL`` u8 MDI(-X) control
+ ==================================== ====== ==========================
+
+MDI(-X) status and transceiver cannot be set, request with the corresponding
+attributes is rejected.
+
+
+LINKMODES_GET
+=============
+
+Requests link modes (supported, advertised and peer advertised) and related
+information (autonegotiation status, link speed and duplex) as provided by
+``ETHTOOL_GLINKSETTINGS``. The request does not use any attributes.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKMODES_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ========================================== ====== ==========================
+ ``ETHTOOL_A_LINKMODES_HEADER`` nested reply header
+ ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status
+ ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes
+ ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes
+ ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s)
+ ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
+ ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG`` u8 Master/slave port mode
+ ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_STATE`` u8 Master/slave port state
+ ``ETHTOOL_A_LINKMODES_RATE_MATCHING`` u8 PHY rate matching
+ ========================================== ====== ==========================
+
+For ``ETHTOOL_A_LINKMODES_OURS``, value represents advertised modes and mask
+represents supported modes. ``ETHTOOL_A_LINKMODES_PEER`` in the reply is a bit
+list.
+
+``LINKMODES_GET`` allows dump requests (kernel returns reply messages for all
+devices supporting the request).
+
+
+LINKMODES_SET
+=============
+
+Request contents:
+
+ ========================================== ====== ==========================
+ ``ETHTOOL_A_LINKMODES_HEADER`` nested request header
+ ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status
+ ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes
+ ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes
+ ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s)
+ ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
+ ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG`` u8 Master/slave port mode
+ ``ETHTOOL_A_LINKMODES_RATE_MATCHING`` u8 PHY rate matching
+ ``ETHTOOL_A_LINKMODES_LANES`` u32 lanes
+ ========================================== ====== ==========================
+
+``ETHTOOL_A_LINKMODES_OURS`` bit set allows setting advertised link modes. If
+autonegotiation is on (either set now or kept from before), advertised modes
+are not changed (no ``ETHTOOL_A_LINKMODES_OURS`` attribute) and at least one
+of speed, duplex and lanes is specified, kernel adjusts advertised modes to all
+supported modes matching speed, duplex, lanes or all (whatever is specified).
+This autoselection is done on ethtool side with ioctl interface, netlink
+interface is supposed to allow requesting changes without knowing what exactly
+kernel supports.
+
+
+LINKSTATE_GET
+=============
+
+Requests link state information. Link up/down flag (as provided by
+``ETHTOOL_GLINK`` ioctl command) is provided. Optionally, extended state might
+be provided as well. In general, extended state describes reasons for why a port
+is down, or why it operates in some non-obvious mode. This request does not have
+any attributes.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKSTATE_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ============================
+ ``ETHTOOL_A_LINKSTATE_HEADER`` nested reply header
+ ``ETHTOOL_A_LINKSTATE_LINK`` bool link state (up/down)
+ ``ETHTOOL_A_LINKSTATE_SQI`` u32 Current Signal Quality Index
+ ``ETHTOOL_A_LINKSTATE_SQI_MAX`` u32 Max support SQI value
+ ``ETHTOOL_A_LINKSTATE_EXT_STATE`` u8 link extended state
+ ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` u8 link extended substate
+ ``ETHTOOL_A_LINKSTATE_EXT_DOWN_CNT`` u32 count of link down events
+ ==================================== ====== ============================
+
+For most NIC drivers, the value of ``ETHTOOL_A_LINKSTATE_LINK`` returns
+carrier flag provided by ``netif_carrier_ok()`` but there are drivers which
+define their own handler.
+
+``ETHTOOL_A_LINKSTATE_EXT_STATE`` and ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` are
+optional values. ethtool core can provide either both
+``ETHTOOL_A_LINKSTATE_EXT_STATE`` and ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE``,
+or only ``ETHTOOL_A_LINKSTATE_EXT_STATE``, or none of them.
+
+``LINKSTATE_GET`` allows dump requests (kernel returns reply messages for all
+devices supporting the request).
+
+
+Link extended states:
+
+ ================================================ ============================================
+ ``ETHTOOL_LINK_EXT_STATE_AUTONEG`` States relating to the autonegotiation or
+ issues therein
+
+ ``ETHTOOL_LINK_EXT_STATE_LINK_TRAINING_FAILURE`` Failure during link training
+
+ ``ETHTOOL_LINK_EXT_STATE_LINK_LOGICAL_MISMATCH`` Logical mismatch in physical coding sublayer
+ or forward error correction sublayer
+
+ ``ETHTOOL_LINK_EXT_STATE_BAD_SIGNAL_INTEGRITY`` Signal integrity issues
+
+ ``ETHTOOL_LINK_EXT_STATE_NO_CABLE`` No cable connected
+
+ ``ETHTOOL_LINK_EXT_STATE_CABLE_ISSUE`` Failure is related to cable,
+ e.g., unsupported cable
+
+ ``ETHTOOL_LINK_EXT_STATE_EEPROM_ISSUE`` Failure is related to EEPROM, e.g., failure
+ during reading or parsing the data
+
+ ``ETHTOOL_LINK_EXT_STATE_CALIBRATION_FAILURE`` Failure during calibration algorithm
+
+ ``ETHTOOL_LINK_EXT_STATE_POWER_BUDGET_EXCEEDED`` The hardware is not able to provide the
+ power required from cable or module
+
+ ``ETHTOOL_LINK_EXT_STATE_OVERHEAT`` The module is overheated
+
+ ``ETHTOOL_LINK_EXT_STATE_MODULE`` Transceiver module issue
+ ================================================ ============================================
+
+Link extended substates:
+
+ Autoneg substates:
+
+ =============================================================== ================================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_PARTNER_DETECTED`` Peer side is down
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_ACK_NOT_RECEIVED`` Ack not received from peer side
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NEXT_PAGE_EXCHANGE_FAILED`` Next page exchange failed
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_PARTNER_DETECTED_FORCE_MODE`` Peer side is down during force
+ mode or there is no agreement of
+ speed
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_FEC_MISMATCH_DURING_OVERRIDE`` Forward error correction modes
+ in both sides are mismatched
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_HCD`` No Highest Common Denominator
+ =============================================================== ================================
+
+ Link training substates:
+
+ =========================================================================== ====================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_FRAME_LOCK_NOT_ACQUIRED`` Frames were not
+ recognized, the
+ lock failed
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_LINK_INHIBIT_TIMEOUT`` The lock did not
+ occur before
+ timeout
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_LINK_PARTNER_DID_NOT_SET_RECEIVER_READY`` Peer side did not
+ send ready signal
+ after training
+ process
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_REMOTE_FAULT`` Remote side is not
+ ready yet
+ =========================================================================== ====================
+
+ Link logical mismatch substates:
+
+ ================================================================ ===============================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_ACQUIRE_BLOCK_LOCK`` Physical coding sublayer was
+ not locked in first phase -
+ block lock
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_ACQUIRE_AM_LOCK`` Physical coding sublayer was
+ not locked in second phase -
+ alignment markers lock
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_GET_ALIGN_STATUS`` Physical coding sublayer did
+ not get align status
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_FC_FEC_IS_NOT_LOCKED`` FC forward error correction is
+ not locked
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_RS_FEC_IS_NOT_LOCKED`` RS forward error correction is
+ not locked
+ ================================================================ ===============================
+
+ Bad signal integrity substates:
+
+ ================================================================= =============================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_LARGE_NUMBER_OF_PHYSICAL_ERRORS`` Large number of physical
+ errors
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_UNSUPPORTED_RATE`` The system attempted to
+ operate the cable at a rate
+ that is not formally
+ supported, which led to
+ signal integrity issues
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_REFERENCE_CLOCK_LOST`` The external clock signal for
+ SerDes is too weak or
+ unavailable.
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_ALOS`` The received signal for
+ SerDes is too weak because
+ analog loss of signal.
+ ================================================================= =============================
+
+ Cable issue substates:
+
+ =================================================== ============================================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_CI_UNSUPPORTED_CABLE`` Unsupported cable
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_CI_CABLE_TEST_FAILURE`` Cable test failure
+ =================================================== ============================================
+
+ Transceiver module issue substates:
+
+ =================================================== ============================================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_MODULE_CMIS_NOT_READY`` The CMIS Module State Machine did not reach
+ the ModuleReady state. For example, if the
+ module is stuck at ModuleFault state
+ =================================================== ============================================
+
+DEBUG_GET
+=========
+
+Requests debugging settings of a device. At the moment, only message mask is
+provided.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_DEBUG_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_DEBUG_HEADER`` nested reply header
+ ``ETHTOOL_A_DEBUG_MSGMASK`` bitset message mask
+ ==================================== ====== ==========================
+
+The message mask (``ETHTOOL_A_DEBUG_MSGMASK``) is equal to message level as
+provided by ``ETHTOOL_GMSGLVL`` and set by ``ETHTOOL_SMSGLVL`` in ioctl
+interface. While it is called message level there for historical reasons, most
+drivers and almost all newer drivers use it as a mask of enabled message
+classes (represented by ``NETIF_MSG_*`` constants); therefore netlink
+interface follows its actual use in practice.
+
+``DEBUG_GET`` allows dump requests (kernel returns reply messages for all
+devices supporting the request).
+
+
+DEBUG_SET
+=========
+
+Set or update debugging settings of a device. At the moment, only message mask
+is supported.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_DEBUG_HEADER`` nested request header
+ ``ETHTOOL_A_DEBUG_MSGMASK`` bitset message mask
+ ==================================== ====== ==========================
+
+``ETHTOOL_A_DEBUG_MSGMASK`` bit set allows setting or modifying mask of
+enabled debugging message types for the device.
+
+
+WOL_GET
+=======
+
+Query device wake-on-lan settings. Unlike most "GET" type requests,
+``ETHTOOL_MSG_WOL_GET`` requires (netns) ``CAP_NET_ADMIN`` privileges as it
+(potentially) provides SecureOn(tm) password which is confidential.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_WOL_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_WOL_HEADER`` nested reply header
+ ``ETHTOOL_A_WOL_MODES`` bitset mask of enabled WoL modes
+ ``ETHTOOL_A_WOL_SOPASS`` binary SecureOn(tm) password
+ ==================================== ====== ==========================
+
+In reply, ``ETHTOOL_A_WOL_MODES`` mask consists of modes supported by the
+device, value of modes which are enabled. ``ETHTOOL_A_WOL_SOPASS`` is only
+included in reply if ``WAKE_MAGICSECURE`` mode is supported.
+
+
+WOL_SET
+=======
+
+Set or update wake-on-lan settings.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_WOL_HEADER`` nested request header
+ ``ETHTOOL_A_WOL_MODES`` bitset enabled WoL modes
+ ``ETHTOOL_A_WOL_SOPASS`` binary SecureOn(tm) password
+ ==================================== ====== ==========================
+
+``ETHTOOL_A_WOL_SOPASS`` is only allowed for devices supporting
+``WAKE_MAGICSECURE`` mode.
+
+
+FEATURES_GET
+============
+
+Gets netdev features like ``ETHTOOL_GFEATURES`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_FEATURES_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_FEATURES_HEADER`` nested reply header
+ ``ETHTOOL_A_FEATURES_HW`` bitset dev->hw_features
+ ``ETHTOOL_A_FEATURES_WANTED`` bitset dev->wanted_features
+ ``ETHTOOL_A_FEATURES_ACTIVE`` bitset dev->features
+ ``ETHTOOL_A_FEATURES_NOCHANGE`` bitset NETIF_F_NEVER_CHANGE
+ ==================================== ====== ==========================
+
+Bitmaps in kernel response have the same meaning as bitmaps used in ioctl
+interference but attribute names are different (they are based on
+corresponding members of struct net_device). Legacy "flags" are not provided,
+if userspace needs them (most likely only ethtool for backward compatibility),
+it can calculate their values from related feature bits itself.
+ETHA_FEATURES_HW uses mask consisting of all features recognized by kernel (to
+provide all names when using verbose bitmap format), the other three use no
+mask (simple bit lists).
+
+
+FEATURES_SET
+============
+
+Request to set netdev features like ``ETHTOOL_SFEATURES`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_FEATURES_HEADER`` nested request header
+ ``ETHTOOL_A_FEATURES_WANTED`` bitset requested features
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_FEATURES_HEADER`` nested reply header
+ ``ETHTOOL_A_FEATURES_WANTED`` bitset diff wanted vs. result
+ ``ETHTOOL_A_FEATURES_ACTIVE`` bitset diff old vs. new active
+ ==================================== ====== ==========================
+
+Request contains only one bitset which can be either value/mask pair (request
+to change specific feature bits and leave the rest) or only a value (request
+to set all features to specified set).
+
+As request is subject to netdev_change_features() sanity checks, optional
+kernel reply (can be suppressed by ``ETHTOOL_FLAG_OMIT_REPLY`` flag in request
+header) informs client about the actual result. ``ETHTOOL_A_FEATURES_WANTED``
+reports the difference between client request and actual result: mask consists
+of bits which differ between requested features and result (dev->features
+after the operation), value consists of values of these bits in the request
+(i.e. negated values from resulting features). ``ETHTOOL_A_FEATURES_ACTIVE``
+reports the difference between old and new dev->features: mask consists of
+bits which have changed, values are their values in new dev->features (after
+the operation).
+
+``ETHTOOL_MSG_FEATURES_NTF`` notification is sent not only if device features
+are modified using ``ETHTOOL_MSG_FEATURES_SET`` request or on of ethtool ioctl
+request but also each time features are modified with netdev_update_features()
+or netdev_change_features().
+
+
+PRIVFLAGS_GET
+=============
+
+Gets private flags like ``ETHTOOL_GPFLAGS`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PRIVFLAGS_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PRIVFLAGS_HEADER`` nested reply header
+ ``ETHTOOL_A_PRIVFLAGS_FLAGS`` bitset private flags
+ ==================================== ====== ==========================
+
+``ETHTOOL_A_PRIVFLAGS_FLAGS`` is a bitset with values of device private flags.
+These flags are defined by driver, their number and names (and also meaning)
+are device dependent. For compact bitset format, names can be retrieved as
+``ETH_SS_PRIV_FLAGS`` string set. If verbose bitset format is requested,
+response uses all private flags supported by the device as mask so that client
+gets the full information without having to fetch the string set with names.
+
+
+PRIVFLAGS_SET
+=============
+
+Sets or modifies values of device private flags like ``ETHTOOL_SPFLAGS``
+ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PRIVFLAGS_HEADER`` nested request header
+ ``ETHTOOL_A_PRIVFLAGS_FLAGS`` bitset private flags
+ ==================================== ====== ==========================
+
+``ETHTOOL_A_PRIVFLAGS_FLAGS`` can either set the whole set of private flags or
+modify only values of some of them.
+
+
+RINGS_GET
+=========
+
+Gets ring sizes like ``ETHTOOL_GRINGPARAM`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_RINGS_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ======================================= ====== ===========================
+ ``ETHTOOL_A_RINGS_HEADER`` nested reply header
+ ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO_MAX`` u32 max size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX_MAX`` u32 max size of TX ring
+ ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
+ ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split
+ ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
+ ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
+ ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
+ ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer
+ ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX`` u32 max size of TX push buffer
+ ======================================= ====== ===========================
+
+``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with
+page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``).
+If enabled the device is configured to place frame headers and data into
+separate buffers. The device configuration must make it possible to receive
+full memory pages of data, for example because MTU is high enough or through
+HW-GRO.
+
+``ETHTOOL_A_RINGS_[RX|TX]_PUSH`` flag is used to enable descriptor fast
+path to send or receive packets. In ordinary path, driver fills descriptors in DRAM and
+notifies NIC hardware. In fast path, driver pushes descriptors to the device
+through MMIO writes, thus reducing the latency. However, enabling this feature
+may increase the CPU cost. Drivers may enforce additional per-packet
+eligibility checks (e.g. on packet size).
+
+``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` specifies the maximum number of bytes of a
+transmitted packet a driver can push directly to the underlying device
+('push' mode). Pushing some of the payload bytes to the device has the
+advantages of reducing latency for small packets by avoiding DMA mapping (same
+as ``ETHTOOL_A_RINGS_TX_PUSH`` parameter) as well as allowing the underlying
+device to process packet headers ahead of fetching its payload.
+This can help the device to make fast actions based on the packet's headers.
+This is similar to the "tx-copybreak" parameter, which copies the packet to a
+preallocated DMA memory area instead of mapping new memory. However,
+tx-push-buff parameter copies the packet directly to the device to allow the
+device to take faster actions on the packet.
+
+RINGS_SET
+=========
+
+Sets ring sizes like ``ETHTOOL_SRINGPARAM`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ===========================
+ ``ETHTOOL_A_RINGS_HEADER`` nested reply header
+ ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
+ ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
+ ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
+ ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
+ ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer
+ ==================================== ====== ===========================
+
+Kernel checks that requested ring sizes do not exceed limits reported by
+driver. Driver may impose additional constraints and may not suspport all
+attributes.
+
+
+``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
+Completion queue events(CQE) are the events posted by NIC to indicate the
+completion status of a packet when the packet is sent(like send success or
+error) or received(like pointers to packet fragments). The CQE size parameter
+enables to modify the CQE size other than default size if NIC supports it.
+A bigger CQE can have more receive buffer pointers inturn NIC can transfer
+a bigger frame from wire. Based on the NIC hardware, the overall completion
+queue size can be adjusted in the driver if CQE size is modified.
+
+CHANNELS_GET
+============
+
+Gets channel counts like ``ETHTOOL_GCHANNELS`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_CHANNELS_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_CHANNELS_HEADER`` nested reply header
+ ``ETHTOOL_A_CHANNELS_RX_MAX`` u32 max receive channels
+ ``ETHTOOL_A_CHANNELS_TX_MAX`` u32 max transmit channels
+ ``ETHTOOL_A_CHANNELS_OTHER_MAX`` u32 max other channels
+ ``ETHTOOL_A_CHANNELS_COMBINED_MAX`` u32 max combined channels
+ ``ETHTOOL_A_CHANNELS_RX_COUNT`` u32 receive channel count
+ ``ETHTOOL_A_CHANNELS_TX_COUNT`` u32 transmit channel count
+ ``ETHTOOL_A_CHANNELS_OTHER_COUNT`` u32 other channel count
+ ``ETHTOOL_A_CHANNELS_COMBINED_COUNT`` u32 combined channel count
+ ===================================== ====== ==========================
+
+
+CHANNELS_SET
+============
+
+Sets channel counts like ``ETHTOOL_SCHANNELS`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_CHANNELS_HEADER`` nested request header
+ ``ETHTOOL_A_CHANNELS_RX_COUNT`` u32 receive channel count
+ ``ETHTOOL_A_CHANNELS_TX_COUNT`` u32 transmit channel count
+ ``ETHTOOL_A_CHANNELS_OTHER_COUNT`` u32 other channel count
+ ``ETHTOOL_A_CHANNELS_COMBINED_COUNT`` u32 combined channel count
+ ===================================== ====== ==========================
+
+Kernel checks that requested channel counts do not exceed limits reported by
+driver. Driver may impose additional constraints and may not suspport all
+attributes.
+
+
+COALESCE_GET
+============
+
+Gets coalescing parameters like ``ETHTOOL_GCOALESCE`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_COALESCE_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ =========================================== ====== =======================
+ ``ETHTOOL_A_COALESCE_HEADER`` nested reply header
+ ``ETHTOOL_A_COALESCE_RX_USECS`` u32 delay (us), normal Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES`` u32 max packets, normal Rx
+ ``ETHTOOL_A_COALESCE_RX_USECS_IRQ`` u32 delay (us), Rx in IRQ
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_IRQ`` u32 max packets, Rx in IRQ
+ ``ETHTOOL_A_COALESCE_TX_USECS`` u32 delay (us), normal Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES`` u32 max packets, normal Tx
+ ``ETHTOOL_A_COALESCE_TX_USECS_IRQ`` u32 delay (us), Tx in IRQ
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_IRQ`` u32 IRQ packets, Tx in IRQ
+ ``ETHTOOL_A_COALESCE_STATS_BLOCK_USECS`` u32 delay of stats update
+ ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_RX`` bool adaptive Rx coalesce
+ ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_TX`` bool adaptive Tx coalesce
+ ``ETHTOOL_A_COALESCE_PKT_RATE_LOW`` u32 threshold for low rate
+ ``ETHTOOL_A_COALESCE_RX_USECS_LOW`` u32 delay (us), low Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_LOW`` u32 max packets, low Rx
+ ``ETHTOOL_A_COALESCE_TX_USECS_LOW`` u32 delay (us), low Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_LOW`` u32 max packets, low Tx
+ ``ETHTOOL_A_COALESCE_PKT_RATE_HIGH`` u32 threshold for high rate
+ ``ETHTOOL_A_COALESCE_RX_USECS_HIGH`` u32 delay (us), high Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_HIGH`` u32 max packets, high Rx
+ ``ETHTOOL_A_COALESCE_TX_USECS_HIGH`` u32 delay (us), high Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH`` u32 max packets, high Tx
+ ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval
+ ``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx
+ ``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
+ =========================================== ====== =======================
+
+Attributes are only included in reply if their value is not zero or the
+corresponding bit in ``ethtool_ops::supported_coalesce_params`` is set (i.e.
+they are declared as supported by driver).
+
+Timer reset mode (``ETHTOOL_A_COALESCE_USE_CQE_TX`` and
+``ETHTOOL_A_COALESCE_USE_CQE_RX``) controls the interaction between packet
+arrival and the various time based delay parameters. By default timers are
+expected to limit the max delay between any packet arrival/departure and a
+corresponding interrupt. In this mode timer should be started by packet
+arrival (sometimes delivery of previous interrupt) and reset when interrupt
+is delivered.
+Setting the appropriate attribute to 1 will enable ``CQE`` mode, where
+each packet event resets the timer. In this mode timer is used to force
+the interrupt if queue goes idle, while busy queues depend on the packet
+limit to trigger interrupts.
+
+Tx aggregation consists of copying frames into a contiguous buffer so that they
+can be submitted as a single IO operation. ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``
+describes the maximum size in bytes for the submitted buffer.
+``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` describes the maximum number of frames
+that can be aggregated into a single buffer.
+``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` describes the amount of time in usecs,
+counted since the first packet arrival in an aggregated block, after which the
+block should be sent.
+This feature is mainly of interest for specific USB devices which does not cope
+well with frequent small-sized URBs transmissions.
+
+COALESCE_SET
+============
+
+Sets coalescing parameters like ``ETHTOOL_SCOALESCE`` ioctl request.
+
+Request contents:
+
+ =========================================== ====== =======================
+ ``ETHTOOL_A_COALESCE_HEADER`` nested request header
+ ``ETHTOOL_A_COALESCE_RX_USECS`` u32 delay (us), normal Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES`` u32 max packets, normal Rx
+ ``ETHTOOL_A_COALESCE_RX_USECS_IRQ`` u32 delay (us), Rx in IRQ
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_IRQ`` u32 max packets, Rx in IRQ
+ ``ETHTOOL_A_COALESCE_TX_USECS`` u32 delay (us), normal Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES`` u32 max packets, normal Tx
+ ``ETHTOOL_A_COALESCE_TX_USECS_IRQ`` u32 delay (us), Tx in IRQ
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_IRQ`` u32 IRQ packets, Tx in IRQ
+ ``ETHTOOL_A_COALESCE_STATS_BLOCK_USECS`` u32 delay of stats update
+ ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_RX`` bool adaptive Rx coalesce
+ ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_TX`` bool adaptive Tx coalesce
+ ``ETHTOOL_A_COALESCE_PKT_RATE_LOW`` u32 threshold for low rate
+ ``ETHTOOL_A_COALESCE_RX_USECS_LOW`` u32 delay (us), low Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_LOW`` u32 max packets, low Rx
+ ``ETHTOOL_A_COALESCE_TX_USECS_LOW`` u32 delay (us), low Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_LOW`` u32 max packets, low Tx
+ ``ETHTOOL_A_COALESCE_PKT_RATE_HIGH`` u32 threshold for high rate
+ ``ETHTOOL_A_COALESCE_RX_USECS_HIGH`` u32 delay (us), high Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_HIGH`` u32 max packets, high Rx
+ ``ETHTOOL_A_COALESCE_TX_USECS_HIGH`` u32 delay (us), high Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH`` u32 max packets, high Tx
+ ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval
+ ``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx
+ ``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
+ =========================================== ====== =======================
+
+Request is rejected if it attributes declared as unsupported by driver (i.e.
+such that the corresponding bit in ``ethtool_ops::supported_coalesce_params``
+is not set), regardless of their values. Driver may impose additional
+constraints on coalescing parameters and their values.
+
+Compared to requests issued via the ``ioctl()`` netlink version of this request
+will try harder to make sure that values specified by the user have been applied
+and may call the driver twice.
+
+
+PAUSE_GET
+=========
+
+Gets pause frame settings like ``ETHTOOL_GPAUSEPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PAUSE_HEADER`` nested request header
+ ``ETHTOOL_A_PAUSE_STATS_SRC`` u32 source of statistics
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_PAUSE_STATS_SRC`` is optional. It takes values from:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_mac_stats_src
+
+If absent from the request, stats will be provided with
+an ``ETHTOOL_A_PAUSE_STATS_SRC`` attribute in the response equal to
+``ETHTOOL_MAC_STATS_SRC_AGGREGATE``.
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PAUSE_HEADER`` nested request header
+ ``ETHTOOL_A_PAUSE_AUTONEG`` bool pause autonegotiation
+ ``ETHTOOL_A_PAUSE_RX`` bool receive pause frames
+ ``ETHTOOL_A_PAUSE_TX`` bool transmit pause frames
+ ``ETHTOOL_A_PAUSE_STATS`` nested pause statistics
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_PAUSE_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set
+in ``ETHTOOL_A_HEADER_FLAGS``.
+It will be empty if driver did not report any statistics. Drivers fill in
+the statistics in the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_pause_stats
+
+Each member has a corresponding attribute defined.
+
+PAUSE_SET
+=========
+
+Sets pause parameters like ``ETHTOOL_GPAUSEPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PAUSE_HEADER`` nested request header
+ ``ETHTOOL_A_PAUSE_AUTONEG`` bool pause autonegotiation
+ ``ETHTOOL_A_PAUSE_RX`` bool receive pause frames
+ ``ETHTOOL_A_PAUSE_TX`` bool transmit pause frames
+ ===================================== ====== ==========================
+
+
+EEE_GET
+=======
+
+Gets Energy Efficient Ethernet settings like ``ETHTOOL_GEEE`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_EEE_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_EEE_HEADER`` nested request header
+ ``ETHTOOL_A_EEE_MODES_OURS`` bool supported/advertised modes
+ ``ETHTOOL_A_EEE_MODES_PEER`` bool peer advertised link modes
+ ``ETHTOOL_A_EEE_ACTIVE`` bool EEE is actively used
+ ``ETHTOOL_A_EEE_ENABLED`` bool EEE is enabled
+ ``ETHTOOL_A_EEE_TX_LPI_ENABLED`` bool Tx lpi enabled
+ ``ETHTOOL_A_EEE_TX_LPI_TIMER`` u32 Tx lpi timeout (in us)
+ ===================================== ====== ==========================
+
+In ``ETHTOOL_A_EEE_MODES_OURS``, mask consists of link modes for which EEE is
+enabled, value of link modes for which EEE is advertised. Link modes for which
+peer advertises EEE are listed in ``ETHTOOL_A_EEE_MODES_PEER`` (no mask). The
+netlink interface allows reporting EEE status for all link modes but only
+first 32 are provided by the ``ethtool_ops`` callback.
+
+
+EEE_SET
+=======
+
+Sets Energy Efficient Ethernet parameters like ``ETHTOOL_SEEE`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_EEE_HEADER`` nested request header
+ ``ETHTOOL_A_EEE_MODES_OURS`` bool advertised modes
+ ``ETHTOOL_A_EEE_ENABLED`` bool EEE is enabled
+ ``ETHTOOL_A_EEE_TX_LPI_ENABLED`` bool Tx lpi enabled
+ ``ETHTOOL_A_EEE_TX_LPI_TIMER`` u32 Tx lpi timeout (in us)
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_EEE_MODES_OURS`` is used to either list link modes to advertise
+EEE for (if there is no mask) or specify changes to the list (if there is
+a mask). The netlink interface allows reporting EEE status for all link modes
+but only first 32 can be set at the moment as that is what the ``ethtool_ops``
+callback supports.
+
+
+TSINFO_GET
+==========
+
+Gets timestamping information like ``ETHTOOL_GET_TS_INFO`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_TSINFO_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_TSINFO_HEADER`` nested request header
+ ``ETHTOOL_A_TSINFO_TIMESTAMPING`` bitset SO_TIMESTAMPING flags
+ ``ETHTOOL_A_TSINFO_TX_TYPES`` bitset supported Tx types
+ ``ETHTOOL_A_TSINFO_RX_FILTERS`` bitset supported Rx filters
+ ``ETHTOOL_A_TSINFO_PHC_INDEX`` u32 PTP hw clock index
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_TSINFO_PHC_INDEX`` is absent if there is no associated PHC (there
+is no special value for this case). The bitset attributes are omitted if they
+would be empty (no bit set).
+
+CABLE_TEST
+==========
+
+Start a cable test.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_CABLE_TEST_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Notification contents:
+
+An Ethernet cable typically contains 1, 2 or 4 pairs. The length of
+the pair can only be measured when there is a fault in the pair and
+hence a reflection. Information about the fault may not be available,
+depending on the specific hardware. Hence the contents of the notify
+message are mostly optional. The attributes can be repeated an
+arbitrary number of times, in an arbitrary order, for an arbitrary
+number of pairs.
+
+The example shows the notification sent when the test is completed for
+a T2 cable, i.e. two pairs. One pair is OK and hence has no length
+information. The second pair has a fault and does have length
+information.
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_CABLE_TEST_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_CABLE_TEST_STATUS`` | u8 | completed |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_CABLE_TEST_NTF_NEST`` | nested | all the results |
+ +-+-------------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_CABLE_NEST_RESULT`` | nested | cable test result |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_CODE`` | u8 | result code |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_CABLE_NEST_RESULT`` | nested | cable test results |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_CODE`` | u8 | result code |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_CABLE_NEST_FAULT_LENGTH`` | nested | cable length |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_CM`` | u32 | length in cm |
+ +-+-+-----------------------------------------+--------+---------------------+
+
+CABLE_TEST TDR
+==============
+
+Start a cable test and report raw TDR data
+
+Request contents:
+
+ +--------------------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_HEADER`` | nested | reply header |
+ +--------------------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_CFG`` | nested | test configuration |
+ +-+------------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_CABLE_STEP_FIRST_DISTANCE`` | u32 | first data distance |
+ +-+-+----------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_CABLE_STEP_LAST_DISTANCE`` | u32 | last data distance |
+ +-+-+----------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_CABLE_STEP_STEP_DISTANCE`` | u32 | distance of each step |
+ +-+-+----------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_CABLE_TEST_TDR_CFG_PAIR`` | u8 | pair to test |
+ +-+-+----------------------------------------+--------+-----------------------+
+
+The ETHTOOL_A_CABLE_TEST_TDR_CFG is optional, as well as all members
+of the nest. All distances are expressed in centimeters. The PHY takes
+the distances as a guide, and rounds to the nearest distance it
+actually supports. If a pair is passed, only that one pair will be
+tested. Otherwise all pairs are tested.
+
+Notification contents:
+
+Raw TDR data is gathered by sending a pulse down the cable and
+recording the amplitude of the reflected pulse for a given distance.
+
+It can take a number of seconds to collect TDR data, especial if the
+full 100 meters is probed at 1 meter intervals. When the test is
+started a notification will be sent containing just
+ETHTOOL_A_CABLE_TEST_TDR_STATUS with the value
+ETHTOOL_A_CABLE_TEST_NTF_STATUS_STARTED.
+
+When the test has completed a second notification will be sent
+containing ETHTOOL_A_CABLE_TEST_TDR_STATUS with the value
+ETHTOOL_A_CABLE_TEST_NTF_STATUS_COMPLETED and the TDR data.
+
+The message may optionally contain the amplitude of the pulse send
+down the cable. This is measured in mV. A reflection should not be
+bigger than transmitted pulse.
+
+Before the raw TDR data should be an ETHTOOL_A_CABLE_TDR_NEST_STEP
+nest containing information about the distance along the cable for the
+first reading, the last reading, and the step between each
+reading. Distances are measured in centimeters. These should be the
+exact values the PHY used. These may be different to what the user
+requested, if the native measurement resolution is greater than 1 cm.
+
+For each step along the cable, a ETHTOOL_A_CABLE_TDR_NEST_AMPLITUDE is
+used to report the amplitude of the reflection for a given pair.
+
+ +---------------------------------------------+--------+----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_STATUS`` | u8 | completed |
+ +---------------------------------------------+--------+----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_NTF_NEST`` | nested | all the results |
+ +-+-------------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_TDR_NEST_PULSE`` | nested | TX Pulse amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_PULSE_mV`` | s16 | Pulse amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_NEST_STEP`` | nested | TDR step info |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_STEP_FIRST_DISTANCE`` | u32 | First data distance |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_STEP_LAST_DISTANCE`` | u32 | Last data distance |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_STEP_STEP_DISTANCE`` | u32 | distance of each step|
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_TDR_NEST_AMPLITUDE`` | nested | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_AMPLITUDE_mV`` | s16 | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_TDR_NEST_AMPLITUDE`` | nested | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_AMPLITUDE_mV`` | s16 | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_TDR_NEST_AMPLITUDE`` | nested | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_AMPLITUDE_mV`` | s16 | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+
+TUNNEL_INFO
+===========
+
+Gets information about the tunnel state NIC is aware of.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_TUNNEL_INFO_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_TUNNEL_INFO_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_TUNNEL_INFO_UDP_PORTS`` | nested | all UDP port tables |
+ +-+-------------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_TUNNEL_UDP_TABLE`` | nested | one UDP port table |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_SIZE`` | u32 | max size of the |
+ | | | | | table |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES`` | bitset | tunnel types which |
+ | | | | | table can hold |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_ENTRY`` | nested | offloaded UDP port |
+ +-+-+-+---------------------------------------+--------+---------------------+
+ | | | | ``ETHTOOL_A_TUNNEL_UDP_ENTRY_PORT`` | be16 | UDP port |
+ +-+-+-+---------------------------------------+--------+---------------------+
+ | | | | ``ETHTOOL_A_TUNNEL_UDP_ENTRY_TYPE`` | u32 | tunnel type |
+ +-+-+-+---------------------------------------+--------+---------------------+
+
+For UDP tunnel table empty ``ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES`` indicates that
+the table contains static entries, hard-coded by the NIC.
+
+FEC_GET
+=======
+
+Gets FEC configuration and state like ``ETHTOOL_GFECPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_FEC_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_FEC_HEADER`` nested request header
+ ``ETHTOOL_A_FEC_MODES`` bitset configured modes
+ ``ETHTOOL_A_FEC_AUTO`` bool FEC mode auto selection
+ ``ETHTOOL_A_FEC_ACTIVE`` u32 index of active FEC mode
+ ``ETHTOOL_A_FEC_STATS`` nested FEC statistics
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_FEC_ACTIVE`` is the bit index of the FEC link mode currently
+active on the interface. This attribute may not be present if device does
+not support FEC.
+
+``ETHTOOL_A_FEC_MODES`` and ``ETHTOOL_A_FEC_AUTO`` are only meaningful when
+autonegotiation is disabled. If ``ETHTOOL_A_FEC_AUTO`` is non-zero driver will
+select the FEC mode automatically based on the parameters of the SFP module.
+This is equivalent to the ``ETHTOOL_FEC_AUTO`` bit of the ioctl interface.
+``ETHTOOL_A_FEC_MODES`` carry the current FEC configuration using link mode
+bits (rather than old ``ETHTOOL_FEC_*`` bits).
+
+``ETHTOOL_A_FEC_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set in
+``ETHTOOL_A_HEADER_FLAGS``.
+Each attribute carries an array of 64bit statistics. First entry in the array
+contains the total number of events on the port, while the following entries
+are counters corresponding to lanes/PCS instances. The number of entries in
+the array will be:
+
++--------------+---------------------------------------------+
+| `0` | device does not support FEC statistics |
++--------------+---------------------------------------------+
+| `1` | device does not support per-lane break down |
++--------------+---------------------------------------------+
+| `1 + #lanes` | device has full support for FEC stats |
++--------------+---------------------------------------------+
+
+Drivers fill in the statistics in the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_fec_stats
+
+FEC_SET
+=======
+
+Sets FEC parameters like ``ETHTOOL_SFECPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_FEC_HEADER`` nested request header
+ ``ETHTOOL_A_FEC_MODES`` bitset configured modes
+ ``ETHTOOL_A_FEC_AUTO`` bool FEC mode auto selection
+ ===================================== ====== ==========================
+
+``FEC_SET`` is only meaningful when autonegotiation is disabled. Otherwise
+FEC mode is selected as part of autonegotiation.
+
+``ETHTOOL_A_FEC_MODES`` selects which FEC mode should be used. It's recommended
+to set only one bit, if multiple bits are set driver may choose between them
+in an implementation specific way.
+
+``ETHTOOL_A_FEC_AUTO`` requests the driver to choose FEC mode based on SFP
+module parameters. This does not mean autonegotiation.
+
+MODULE_EEPROM_GET
+=================
+
+Fetch module EEPROM data dump.
+This interface is designed to allow dumps of at most 1/2 page at once. This
+means only dumps of 128 (or less) bytes are allowed, without crossing half page
+boundary located at offset 128. For pages other than 0 only high 128 bytes are
+accessible.
+
+Request contents:
+
+ ======================================= ====== ==========================
+ ``ETHTOOL_A_MODULE_EEPROM_HEADER`` nested request header
+ ``ETHTOOL_A_MODULE_EEPROM_OFFSET`` u32 offset within a page
+ ``ETHTOOL_A_MODULE_EEPROM_LENGTH`` u32 amount of bytes to read
+ ``ETHTOOL_A_MODULE_EEPROM_PAGE`` u8 page number
+ ``ETHTOOL_A_MODULE_EEPROM_BANK`` u8 bank number
+ ``ETHTOOL_A_MODULE_EEPROM_I2C_ADDRESS`` u8 page I2C address
+ ======================================= ====== ==========================
+
+If ``ETHTOOL_A_MODULE_EEPROM_BANK`` is not specified, bank 0 is assumed.
+
+Kernel response contents:
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_MODULE_EEPROM_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_MODULE_EEPROM_DATA`` | binary | array of bytes from |
+ | | | module EEPROM |
+ +---------------------------------------------+--------+---------------------+
+
+``ETHTOOL_A_MODULE_EEPROM_DATA`` has an attribute length equal to the amount of
+bytes driver actually read.
+
+STATS_GET
+=========
+
+Get standard statistics for the interface. Note that this is not
+a re-implementation of ``ETHTOOL_GSTATS`` which exposed driver-defined
+stats.
+
+Request contents:
+
+ ======================================= ====== ==========================
+ ``ETHTOOL_A_STATS_HEADER`` nested request header
+ ``ETHTOOL_A_STATS_SRC`` u32 source of statistics
+ ``ETHTOOL_A_STATS_GROUPS`` bitset requested groups of stats
+ ======================================= ====== ==========================
+
+Kernel response contents:
+
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_HEADER`` | nested | reply header |
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_SRC`` | u32 | source of statistics |
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_GRP`` | nested | one or more group of stats |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_ID`` | u32 | group ID - ``ETHTOOL_STATS_*`` |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_SS_ID`` | u32 | string set ID for names |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_STAT`` | nested | nest containing a statistic |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_HIST_RX`` | nested | histogram statistic (Rx) |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_HIST_TX`` | nested | histogram statistic (Tx) |
+ +-+---------------------------------+--------+--------------------------------+
+
+Users specify which groups of statistics they are requesting via
+the ``ETHTOOL_A_STATS_GROUPS`` bitset. Currently defined values are:
+
+ ====================== ======== ===============================================
+ ETHTOOL_STATS_ETH_MAC eth-mac Basic IEEE 802.3 MAC statistics (30.3.1.1.*)
+ ETHTOOL_STATS_ETH_PHY eth-phy Basic IEEE 802.3 PHY statistics (30.3.2.1.*)
+ ETHTOOL_STATS_ETH_CTRL eth-ctrl Basic IEEE 802.3 MAC Ctrl statistics (30.3.3.*)
+ ETHTOOL_STATS_RMON rmon RMON (RFC 2819) statistics
+ ====================== ======== ===============================================
+
+Each group should have a corresponding ``ETHTOOL_A_STATS_GRP`` in the reply.
+``ETHTOOL_A_STATS_GRP_ID`` identifies which group's statistics nest contains.
+``ETHTOOL_A_STATS_GRP_SS_ID`` identifies the string set ID for the names of
+the statistics in the group, if available.
+
+Statistics are added to the ``ETHTOOL_A_STATS_GRP`` nest under
+``ETHTOOL_A_STATS_GRP_STAT``. ``ETHTOOL_A_STATS_GRP_STAT`` should contain
+single 8 byte (u64) attribute inside - the type of that attribute is
+the statistic ID and the value is the value of the statistic.
+Each group has its own interpretation of statistic IDs.
+Attribute IDs correspond to strings from the string set identified
+by ``ETHTOOL_A_STATS_GRP_SS_ID``. Complex statistics (such as RMON histogram
+entries) are also listed inside ``ETHTOOL_A_STATS_GRP`` and do not have
+a string defined in the string set.
+
+RMON "histogram" counters count number of packets within given size range.
+Because RFC does not specify the ranges beyond the standard 1518 MTU devices
+differ in definition of buckets. For this reason the definition of packet ranges
+is left to each driver.
+
+``ETHTOOL_A_STATS_GRP_HIST_RX`` and ``ETHTOOL_A_STATS_GRP_HIST_TX`` nests
+contain the following attributes:
+
+ ================================= ====== ===================================
+ ETHTOOL_A_STATS_RMON_HIST_BKT_LOW u32 low bound of the packet size bucket
+ ETHTOOL_A_STATS_RMON_HIST_BKT_HI u32 high bound of the bucket
+ ETHTOOL_A_STATS_RMON_HIST_VAL u64 packet counter
+ ================================= ====== ===================================
+
+Low and high bounds are inclusive, for example:
+
+ ============================= ==== ====
+ RFC statistic low high
+ ============================= ==== ====
+ etherStatsPkts64Octets 0 64
+ etherStatsPkts512to1023Octets 512 1023
+ ============================= ==== ====
+
+``ETHTOOL_A_STATS_SRC`` is optional. Similar to ``PAUSE_GET``, it takes values
+from ``enum ethtool_mac_stats_src``. If absent from the request, stats will be
+provided with an ``ETHTOOL_A_STATS_SRC`` attribute in the response equal to
+``ETHTOOL_MAC_STATS_SRC_AGGREGATE``.
+
+PHC_VCLOCKS_GET
+===============
+
+Query device PHC virtual clocks information.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PHC_VCLOCKS_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PHC_VCLOCKS_HEADER`` nested reply header
+ ``ETHTOOL_A_PHC_VCLOCKS_NUM`` u32 PHC virtual clocks number
+ ``ETHTOOL_A_PHC_VCLOCKS_INDEX`` s32 PHC index array
+ ==================================== ====== ==========================
+
+MODULE_GET
+==========
+
+Gets transceiver module parameters.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_MODULE_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== ==========================
+ ``ETHTOOL_A_MODULE_HEADER`` nested reply header
+ ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` u8 power mode policy
+ ``ETHTOOL_A_MODULE_POWER_MODE`` u8 operational power mode
+ ====================================== ====== ==========================
+
+The optional ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` attribute encodes the
+transceiver module power mode policy enforced by the host. The default policy
+is driver-dependent, but "auto" is the recommended default and it should be
+implemented by new drivers and drivers where conformance to a legacy behavior
+is not critical.
+
+The optional ``ETHTHOOL_A_MODULE_POWER_MODE`` attribute encodes the operational
+power mode policy of the transceiver module. It is only reported when a module
+is plugged-in. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_module_power_mode
+
+MODULE_SET
+==========
+
+Sets transceiver module parameters.
+
+Request contents:
+
+ ====================================== ====== ==========================
+ ``ETHTOOL_A_MODULE_HEADER`` nested request header
+ ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` u8 power mode policy
+ ====================================== ====== ==========================
+
+When set, the optional ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` attribute is used
+to set the transceiver module power policy enforced by the host. Possible
+values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_module_power_mode_policy
+
+For SFF-8636 modules, low power mode is forced by the host according to table
+6-10 in revision 2.10a of the specification.
+
+For CMIS modules, low power mode is forced by the host according to table 6-12
+in revision 5.0 of the specification.
+
+PSE_GET
+=======
+
+Gets PSE attributes.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PSE_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PSE_HEADER`` nested reply header
+ ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL
+ PSE functions
+ ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the
+ PoDL PSE.
+ ====================================== ====== =============================
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies
+the operational state of the PoDL PSE functions. The operational state of the
+PSE function can be changed using the ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL``
+action. This option is corresponding to ``IEEE 802.3-2018`` 30.15.1.1.2
+aPoDLPSEAdminState. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_podl_pse_admin_state
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` attribute identifies
+the power detection status of the PoDL PSE. The status depend on internal PSE
+state machine and automatic PD classification support. This option is
+corresponding to ``IEEE 802.3-2018`` 30.15.1.1.3 aPoDLPSEPowerDetectionStatus.
+Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_podl_pse_pw_d_status
+
+PSE_SET
+=======
+
+Sets PSE parameters.
+
+Request contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PSE_HEADER`` nested request header
+ ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` u32 Control PoDL PSE Admin state
+ ====================================== ====== =============================
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used
+to control PoDL PSE Admin functions. This option is implementing
+``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See
+``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values.
+
+RSS_GET
+=======
+
+Get indirection table, hash key and hash function info associated with a
+RSS context of an interface similar to ``ETHTOOL_GRSSH`` ioctl request.
+
+Request contents:
+
+===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_HEADER`` nested request header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
+===================================== ====== ==========================
+
+Kernel response contents:
+
+===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_HEADER`` nested reply header
+ ``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
+ ``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
+ ``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
+===================================== ====== ==========================
+
+ETHTOOL_A_RSS_HFUNC attribute is bitmap indicating the hash function
+being used. Current supported options are toeplitz, xor or crc32.
+ETHTOOL_A_RSS_INDIR attribute returns RSS indrection table where each byte
+indicates queue number.
+
+PLCA_GET_CFG
+============
+
+Gets the IEEE 802.3cg-2019 Clause 148 Physical Layer Collision Avoidance
+(PLCA) Reconciliation Sublayer (RS) attributes.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PLCA_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PLCA_HEADER`` nested reply header
+ ``ETHTOOL_A_PLCA_VERSION`` u16 Supported PLCA management
+ interface standard/version
+ ``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State
+ ``ETHTOOL_A_PLCA_NODE_ID`` u32 PLCA unique local node ID
+ ``ETHTOOL_A_PLCA_NODE_CNT`` u32 Number of PLCA nodes on the
+ network, including the
+ coordinator
+ ``ETHTOOL_A_PLCA_TO_TMR`` u32 Transmit Opportunity Timer
+ value in bit-times (BT)
+ ``ETHTOOL_A_PLCA_BURST_CNT`` u32 Number of additional packets
+ the node is allowed to send
+ within a single TO
+ ``ETHTOOL_A_PLCA_BURST_TMR`` u32 Time to wait for the MAC to
+ transmit a new frame before
+ terminating the burst
+ ====================================== ====== =============================
+
+When set, the optional ``ETHTOOL_A_PLCA_VERSION`` attribute indicates which
+standard and version the PLCA management interface complies to. When not set,
+the interface is vendor-specific and (possibly) supplied by the driver.
+The OPEN Alliance SIG specifies a standard register map for 10BASE-T1S PHYs
+embedding the PLCA Reconcialiation Sublayer. See "10BASE-T1S PLCA Management
+Registers" at https://www.opensig.org/about/specifications/.
+
+When set, the optional ``ETHTOOL_A_PLCA_ENABLED`` attribute indicates the
+administrative state of the PLCA RS. When not set, the node operates in "plain"
+CSMA/CD mode. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.1
+aPLCAAdminState / 30.16.1.2.1 acPLCAAdminControl.
+
+When set, the optional ``ETHTOOL_A_PLCA_NODE_ID`` attribute indicates the
+configured local node ID of the PHY. This ID determines which transmit
+opportunity (TO) is reserved for the node to transmit into. This option is
+corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.4 aPLCALocalNodeID. The valid
+range for this attribute is [0 .. 255] where 255 means "not configured".
+
+When set, the optional ``ETHTOOL_A_PLCA_NODE_CNT`` attribute indicates the
+configured maximum number of PLCA nodes on the mixing-segment. This number
+determines the total number of transmit opportunities generated during a
+PLCA cycle. This attribute is relevant only for the PLCA coordinator, which is
+the node with aPLCALocalNodeID set to 0. Follower nodes ignore this setting.
+This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.3
+aPLCANodeCount. The valid range for this attribute is [1 .. 255].
+
+When set, the optional ``ETHTOOL_A_PLCA_TO_TMR`` attribute indicates the
+configured value of the transmit opportunity timer in bit-times. This value
+must be set equal across all nodes sharing the medium for PLCA to work
+correctly. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.5
+aPLCATransmitOpportunityTimer. The valid range for this attribute is
+[0 .. 255].
+
+When set, the optional ``ETHTOOL_A_PLCA_BURST_CNT`` attribute indicates the
+configured number of extra packets that the node is allowed to send during a
+single transmit opportunity. By default, this attribute is 0, meaning that
+the node can only send a single frame per TO. When greater than 0, the PLCA RS
+keeps the TO after any transmission, waiting for the MAC to send a new frame
+for up to aPLCABurstTimer BTs. This can only happen a number of times per PLCA
+cycle up to the value of this parameter. After that, the burst is over and the
+normal counting of TOs resumes. This option is corresponding to
+``IEEE 802.3cg-2019`` 30.16.1.1.6 aPLCAMaxBurstCount. The valid range for this
+attribute is [0 .. 255].
+
+When set, the optional ``ETHTOOL_A_PLCA_BURST_TMR`` attribute indicates how
+many bit-times the PLCA RS waits for the MAC to initiate a new transmission
+when aPLCAMaxBurstCount is greater than 0. If the MAC fails to send a new
+frame within this time, the burst ends and the counting of TOs resumes.
+Otherwise, the new frame is sent as part of the current burst. This option
+is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.7 aPLCABurstTimer. The
+valid range for this attribute is [0 .. 255]. Although, the value should be
+set greater than the Inter-Frame-Gap (IFG) time of the MAC (plus some margin)
+for PLCA burst mode to work as intended.
+
+PLCA_SET_CFG
+============
+
+Sets PLCA RS parameters.
+
+Request contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PLCA_HEADER`` nested request header
+ ``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State
+ ``ETHTOOL_A_PLCA_NODE_ID`` u8 PLCA unique local node ID
+ ``ETHTOOL_A_PLCA_NODE_CNT`` u8 Number of PLCA nodes on the
+ netkork, including the
+ coordinator
+ ``ETHTOOL_A_PLCA_TO_TMR`` u8 Transmit Opportunity Timer
+ value in bit-times (BT)
+ ``ETHTOOL_A_PLCA_BURST_CNT`` u8 Number of additional packets
+ the node is allowed to send
+ within a single TO
+ ``ETHTOOL_A_PLCA_BURST_TMR`` u8 Time to wait for the MAC to
+ transmit a new frame before
+ terminating the burst
+ ====================================== ====== =============================
+
+For a description of each attribute, see ``PLCA_GET_CFG``.
+
+PLCA_GET_STATUS
+===============
+
+Gets PLCA RS status information.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PLCA_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PLCA_HEADER`` nested reply header
+ ``ETHTOOL_A_PLCA_STATUS`` u8 PLCA RS operational status
+ ====================================== ====== =============================
+
+When set, the ``ETHTOOL_A_PLCA_STATUS`` attribute indicates whether the node is
+detecting the presence of the BEACON on the network. This flag is
+corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.2 aPLCAStatus.
+
+MM_GET
+======
+
+Retrieve 802.3 MAC Merge parameters.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_MM_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ================================= ====== ===================================
+ ``ETHTOOL_A_MM_HEADER`` nested request header
+ ``ETHTOOL_A_MM_PMAC_ENABLED`` bool set if RX of preemptible and SMD-V
+ frames is enabled
+ ``ETHTOOL_A_MM_TX_ENABLED`` bool set if TX of preemptible frames is
+ administratively enabled (might be
+ inactive if verification failed)
+ ``ETHTOOL_A_MM_TX_ACTIVE`` bool set if TX of preemptible frames is
+ operationally enabled
+ ``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE`` u32 minimum size of transmitted
+ non-final fragments, in octets
+ ``ETHTOOL_A_MM_RX_MIN_FRAG_SIZE`` u32 minimum size of received non-final
+ fragments, in octets
+ ``ETHTOOL_A_MM_VERIFY_ENABLED`` bool set if TX of SMD-V frames is
+ administratively enabled
+ ``ETHTOOL_A_MM_VERIFY_STATUS`` u8 state of the verification function
+ ``ETHTOOL_A_MM_VERIFY_TIME`` u32 delay between verification attempts
+ ``ETHTOOL_A_MM_MAX_VERIFY_TIME``` u32 maximum verification interval
+ supported by device
+ ``ETHTOOL_A_MM_STATS`` nested IEEE 802.3-2018 subclause 30.14.1
+ oMACMergeEntity statistics counters
+ ================================= ====== ===================================
+
+The attributes are populated by the device driver through the following
+structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_mm_state
+
+The ``ETHTOOL_A_MM_VERIFY_STATUS`` will report one of the values from
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_mm_verify_status
+
+If ``ETHTOOL_A_MM_VERIFY_ENABLED`` was passed as false in the ``MM_SET``
+command, ``ETHTOOL_A_MM_VERIFY_STATUS`` will report either
+``ETHTOOL_MM_VERIFY_STATUS_INITIAL`` or ``ETHTOOL_MM_VERIFY_STATUS_DISABLED``,
+otherwise it should report one of the other states.
+
+It is recommended that drivers start with the pMAC disabled, and enable it upon
+user space request. It is also recommended that user space does not depend upon
+the default values from ``ETHTOOL_MSG_MM_GET`` requests.
+
+``ETHTOOL_A_MM_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set in
+``ETHTOOL_A_HEADER_FLAGS``. The attribute will be empty if driver did not
+report any statistics. Drivers fill in the statistics in the following
+structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_mm_stats
+
+MM_SET
+======
+
+Modifies the configuration of the 802.3 MAC Merge layer.
+
+Request contents:
+
+ ================================= ====== ==========================
+ ``ETHTOOL_A_MM_VERIFY_TIME`` u32 see MM_GET description
+ ``ETHTOOL_A_MM_VERIFY_ENABLED`` bool see MM_GET description
+ ``ETHTOOL_A_MM_TX_ENABLED`` bool see MM_GET description
+ ``ETHTOOL_A_MM_PMAC_ENABLED`` bool see MM_GET description
+ ``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE`` u32 see MM_GET description
+ ================================= ====== ==========================
+
+The attributes are propagated to the driver through the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_mm_cfg
+
+Request translation
+===================
+
+The following table maps ioctl commands to netlink commands providing their
+functionality. Entries with "n/a" in right column are commands which do not
+have their netlink replacement yet. Entries which "n/a" in the left column
+are netlink only.
+
+ =================================== =====================================
+ ioctl command netlink command
+ =================================== =====================================
+ ``ETHTOOL_GSET`` ``ETHTOOL_MSG_LINKINFO_GET``
+ ``ETHTOOL_MSG_LINKMODES_GET``
+ ``ETHTOOL_SSET`` ``ETHTOOL_MSG_LINKINFO_SET``
+ ``ETHTOOL_MSG_LINKMODES_SET``
+ ``ETHTOOL_GDRVINFO`` n/a
+ ``ETHTOOL_GREGS`` n/a
+ ``ETHTOOL_GWOL`` ``ETHTOOL_MSG_WOL_GET``
+ ``ETHTOOL_SWOL`` ``ETHTOOL_MSG_WOL_SET``
+ ``ETHTOOL_GMSGLVL`` ``ETHTOOL_MSG_DEBUG_GET``
+ ``ETHTOOL_SMSGLVL`` ``ETHTOOL_MSG_DEBUG_SET``
+ ``ETHTOOL_NWAY_RST`` n/a
+ ``ETHTOOL_GLINK`` ``ETHTOOL_MSG_LINKSTATE_GET``
+ ``ETHTOOL_GEEPROM`` n/a
+ ``ETHTOOL_SEEPROM`` n/a
+ ``ETHTOOL_GCOALESCE`` ``ETHTOOL_MSG_COALESCE_GET``
+ ``ETHTOOL_SCOALESCE`` ``ETHTOOL_MSG_COALESCE_SET``
+ ``ETHTOOL_GRINGPARAM`` ``ETHTOOL_MSG_RINGS_GET``
+ ``ETHTOOL_SRINGPARAM`` ``ETHTOOL_MSG_RINGS_SET``
+ ``ETHTOOL_GPAUSEPARAM`` ``ETHTOOL_MSG_PAUSE_GET``
+ ``ETHTOOL_SPAUSEPARAM`` ``ETHTOOL_MSG_PAUSE_SET``
+ ``ETHTOOL_GRXCSUM`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SRXCSUM`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GTXCSUM`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_STXCSUM`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GSG`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SSG`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_TEST`` n/a
+ ``ETHTOOL_GSTRINGS`` ``ETHTOOL_MSG_STRSET_GET``
+ ``ETHTOOL_PHYS_ID`` n/a
+ ``ETHTOOL_GSTATS`` n/a
+ ``ETHTOOL_GTSO`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_STSO`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GPERMADDR`` rtnetlink ``RTM_GETLINK``
+ ``ETHTOOL_GUFO`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SUFO`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GGSO`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SGSO`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GFLAGS`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SFLAGS`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_GET``
+ ``ETHTOOL_SPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_SET``
+ ``ETHTOOL_GRXFH`` n/a
+ ``ETHTOOL_SRXFH`` n/a
+ ``ETHTOOL_GGRO`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SGRO`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GRXRINGS`` n/a
+ ``ETHTOOL_GRXCLSRLCNT`` n/a
+ ``ETHTOOL_GRXCLSRULE`` n/a
+ ``ETHTOOL_GRXCLSRLALL`` n/a
+ ``ETHTOOL_SRXCLSRLDEL`` n/a
+ ``ETHTOOL_SRXCLSRLINS`` n/a
+ ``ETHTOOL_FLASHDEV`` n/a
+ ``ETHTOOL_RESET`` n/a
+ ``ETHTOOL_SRXNTUPLE`` n/a
+ ``ETHTOOL_GRXNTUPLE`` n/a
+ ``ETHTOOL_GSSET_INFO`` ``ETHTOOL_MSG_STRSET_GET``
+ ``ETHTOOL_GRXFHINDIR`` n/a
+ ``ETHTOOL_SRXFHINDIR`` n/a
+ ``ETHTOOL_GFEATURES`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SFEATURES`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GCHANNELS`` ``ETHTOOL_MSG_CHANNELS_GET``
+ ``ETHTOOL_SCHANNELS`` ``ETHTOOL_MSG_CHANNELS_SET``
+ ``ETHTOOL_SET_DUMP`` n/a
+ ``ETHTOOL_GET_DUMP_FLAG`` n/a
+ ``ETHTOOL_GET_DUMP_DATA`` n/a
+ ``ETHTOOL_GET_TS_INFO`` ``ETHTOOL_MSG_TSINFO_GET``
+ ``ETHTOOL_GMODULEINFO`` ``ETHTOOL_MSG_MODULE_EEPROM_GET``
+ ``ETHTOOL_GMODULEEEPROM`` ``ETHTOOL_MSG_MODULE_EEPROM_GET``
+ ``ETHTOOL_GEEE`` ``ETHTOOL_MSG_EEE_GET``
+ ``ETHTOOL_SEEE`` ``ETHTOOL_MSG_EEE_SET``
+ ``ETHTOOL_GRSSH`` ``ETHTOOL_MSG_RSS_GET``
+ ``ETHTOOL_SRSSH`` n/a
+ ``ETHTOOL_GTUNABLE`` n/a
+ ``ETHTOOL_STUNABLE`` n/a
+ ``ETHTOOL_GPHYSTATS`` n/a
+ ``ETHTOOL_PERQUEUE`` n/a
+ ``ETHTOOL_GLINKSETTINGS`` ``ETHTOOL_MSG_LINKINFO_GET``
+ ``ETHTOOL_MSG_LINKMODES_GET``
+ ``ETHTOOL_SLINKSETTINGS`` ``ETHTOOL_MSG_LINKINFO_SET``
+ ``ETHTOOL_MSG_LINKMODES_SET``
+ ``ETHTOOL_PHY_GTUNABLE`` n/a
+ ``ETHTOOL_PHY_STUNABLE`` n/a
+ ``ETHTOOL_GFECPARAM`` ``ETHTOOL_MSG_FEC_GET``
+ ``ETHTOOL_SFECPARAM`` ``ETHTOOL_MSG_FEC_SET``
+ n/a ``ETHTOOL_MSG_CABLE_TEST_ACT``
+ n/a ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT``
+ n/a ``ETHTOOL_MSG_TUNNEL_INFO_GET``
+ n/a ``ETHTOOL_MSG_PHC_VCLOCKS_GET``
+ n/a ``ETHTOOL_MSG_MODULE_GET``
+ n/a ``ETHTOOL_MSG_MODULE_SET``
+ n/a ``ETHTOOL_MSG_PLCA_GET_CFG``
+ n/a ``ETHTOOL_MSG_PLCA_SET_CFG``
+ n/a ``ETHTOOL_MSG_PLCA_GET_STATUS``
+ n/a ``ETHTOOL_MSG_MM_GET``
+ n/a ``ETHTOOL_MSG_MM_SET``
+ =================================== =====================================
diff --git a/Documentation/networking/failover.rst b/Documentation/networking/failover.rst
new file mode 100644
index 0000000000..f0c8483cdb
--- /dev/null
+++ b/Documentation/networking/failover.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+FAILOVER
+========
+
+Overview
+========
+
+The failover module provides a generic interface for paravirtual drivers
+to register a netdev and a set of ops with a failover instance. The ops
+are used as event handlers that get called to handle netdev register/
+unregister/link change/name change events on slave pci ethernet devices
+with the same mac address as the failover netdev.
+
+This enables paravirtual drivers to use a VF as an accelerated low latency
+datapath. It also allows live migration of VMs with direct attached VFs by
+failing over to the paravirtual datapath when the VF is unplugged.
diff --git a/Documentation/networking/fib_trie.rst b/Documentation/networking/fib_trie.rst
new file mode 100644
index 0000000000..f1435b7fcd
--- /dev/null
+++ b/Documentation/networking/fib_trie.rst
@@ -0,0 +1,149 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+LC-trie implementation notes
+============================
+
+Node types
+----------
+leaf
+ An end node with data. This has a copy of the relevant key, along
+ with 'hlist' with routing table entries sorted by prefix length.
+ See struct leaf and struct leaf_info.
+
+trie node or tnode
+ An internal node, holding an array of child (leaf or tnode) pointers,
+ indexed through a subset of the key. See Level Compression.
+
+A few concepts explained
+------------------------
+Bits (tnode)
+ The number of bits in the key segment used for indexing into the
+ child array - the "child index". See Level Compression.
+
+Pos (tnode)
+ The position (in the key) of the key segment used for indexing into
+ the child array. See Path Compression.
+
+Path Compression / skipped bits
+ Any given tnode is linked to from the child array of its parent, using
+ a segment of the key specified by the parent's "pos" and "bits"
+ In certain cases, this tnode's own "pos" will not be immediately
+ adjacent to the parent (pos+bits), but there will be some bits
+ in the key skipped over because they represent a single path with no
+ deviations. These "skipped bits" constitute Path Compression.
+ Note that the search algorithm will simply skip over these bits when
+ searching, making it necessary to save the keys in the leaves to
+ verify that they actually do match the key we are searching for.
+
+Level Compression / child arrays
+ the trie is kept level balanced moving, under certain conditions, the
+ children of a full child (see "full_children") up one level, so that
+ instead of a pure binary tree, each internal node ("tnode") may
+ contain an arbitrarily large array of links to several children.
+ Conversely, a tnode with a mostly empty child array (see empty_children)
+ may be "halved", having some of its children moved downwards one level,
+ in order to avoid ever-increasing child arrays.
+
+empty_children
+ the number of positions in the child array of a given tnode that are
+ NULL.
+
+full_children
+ the number of children of a given tnode that aren't path compressed.
+ (in other words, they aren't NULL or leaves and their "pos" is equal
+ to this tnode's "pos"+"bits").
+
+ (The word "full" here is used more in the sense of "complete" than
+ as the opposite of "empty", which might be a tad confusing.)
+
+Comments
+---------
+
+We have tried to keep the structure of the code as close to fib_hash as
+possible to allow verification and help up reviewing.
+
+fib_find_node()
+ A good start for understanding this code. This function implements a
+ straightforward trie lookup.
+
+fib_insert_node()
+ Inserts a new leaf node in the trie. This is bit more complicated than
+ fib_find_node(). Inserting a new node means we might have to run the
+ level compression algorithm on part of the trie.
+
+trie_leaf_remove()
+ Looks up a key, deletes it and runs the level compression algorithm.
+
+trie_rebalance()
+ The key function for the dynamic trie after any change in the trie
+ it is run to optimize and reorganize. It will walk the trie upwards
+ towards the root from a given tnode, doing a resize() at each step
+ to implement level compression.
+
+resize()
+ Analyzes a tnode and optimizes the child array size by either inflating
+ or shrinking it repeatedly until it fulfills the criteria for optimal
+ level compression. This part follows the original paper pretty closely
+ and there may be some room for experimentation here.
+
+inflate()
+ Doubles the size of the child array within a tnode. Used by resize().
+
+halve()
+ Halves the size of the child array within a tnode - the inverse of
+ inflate(). Used by resize();
+
+fn_trie_insert(), fn_trie_delete(), fn_trie_select_default()
+ The route manipulation functions. Should conform pretty closely to the
+ corresponding functions in fib_hash.
+
+fn_trie_flush()
+ This walks the full trie (using nextleaf()) and searches for empty
+ leaves which have to be removed.
+
+fn_trie_dump()
+ Dumps the routing table ordered by prefix length. This is somewhat
+ slower than the corresponding fib_hash function, as we have to walk the
+ entire trie for each prefix length. In comparison, fib_hash is organized
+ as one "zone"/hash per prefix length.
+
+Locking
+-------
+
+fib_lock is used for an RW-lock in the same way that this is done in fib_hash.
+However, the functions are somewhat separated for other possible locking
+scenarios. It might conceivably be possible to run trie_rebalance via RCU
+to avoid read_lock in the fn_trie_lookup() function.
+
+Main lookup mechanism
+---------------------
+fn_trie_lookup() is the main lookup function.
+
+The lookup is in its simplest form just like fib_find_node(). We descend the
+trie, key segment by key segment, until we find a leaf. check_leaf() does
+the fib_semantic_match in the leaf's sorted prefix hlist.
+
+If we find a match, we are done.
+
+If we don't find a match, we enter prefix matching mode. The prefix length,
+starting out at the same as the key length, is reduced one step at a time,
+and we backtrack upwards through the trie trying to find a longest matching
+prefix. The goal is always to reach a leaf and get a positive result from the
+fib_semantic_match mechanism.
+
+Inside each tnode, the search for longest matching prefix consists of searching
+through the child array, chopping off (zeroing) the least significant "1" of
+the child index until we find a match or the child index consists of nothing but
+zeros.
+
+At this point we backtrack (t->stats.backtrack++) up the trie, continuing to
+chop off part of the key in order to find the longest matching prefix.
+
+At this point we will repeatedly descend subtries to look for a match, and there
+are some optimizations available that can provide us with "shortcuts" to avoid
+descending into dead ends. Look for "HL_OPTIMIZE" sections in the code.
+
+To alleviate any doubts about the correctness of the route selection process,
+a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which
+gives userland access to fib_lookup().
diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst
new file mode 100644
index 0000000000..f69da50748
--- /dev/null
+++ b/Documentation/networking/filter.rst
@@ -0,0 +1,685 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _networking-filter:
+
+=======================================================
+Linux Socket Filtering aka Berkeley Packet Filter (BPF)
+=======================================================
+
+Notice
+------
+
+This file used to document the eBPF format and mechanisms even when not
+related to socket filtering. The ../bpf/index.rst has more details
+on eBPF.
+
+Introduction
+------------
+
+Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
+Though there are some distinct differences between the BSD and Linux
+Kernel filtering, but when we speak of BPF or LSF in Linux context, we
+mean the very same mechanism of filtering in the Linux kernel.
+
+BPF allows a user-space program to attach a filter onto any socket and
+allow or disallow certain types of data to come through the socket. LSF
+follows exactly the same filter code structure as BSD's BPF, so referring
+to the BSD bpf.4 manpage is very helpful in creating filters.
+
+On Linux, BPF is much simpler than on BSD. One does not have to worry
+about devices or anything like that. You simply create your filter code,
+send it to the kernel via the SO_ATTACH_FILTER option and if your filter
+code passes the kernel check on it, you then immediately begin filtering
+data on that socket.
+
+You can also detach filters from your socket via the SO_DETACH_FILTER
+option. This will probably not be used much since when you close a socket
+that has a filter on it the filter is automagically removed. The other
+less common case may be adding a different filter on the same socket where
+you had another filter that is still running: the kernel takes care of
+removing the old one and placing your new one in its place, assuming your
+filter has passed the checks, otherwise if it fails the old filter will
+remain on that socket.
+
+SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
+set, a filter cannot be removed or changed. This allows one process to
+setup a socket, attach a filter, lock it then drop privileges and be
+assured that the filter will be kept until the socket is closed.
+
+The biggest user of this construct might be libpcap. Issuing a high-level
+filter command like `tcpdump -i em1 port 22` passes through the libpcap
+internal compiler that generates a structure that can eventually be loaded
+via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
+displays what is being placed into this structure.
+
+Although we were only speaking about sockets here, BPF in Linux is used
+in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
+qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places
+such as team driver, PTP code, etc where BPF is being used.
+
+.. [1] Documentation/userspace-api/seccomp_filter.rst
+
+Original BPF paper:
+
+Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
+architecture for user-level packet capture. In Proceedings of the
+USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
+Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
+CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
+
+Structure
+---------
+
+User space applications include <linux/filter.h> which contains the
+following relevant structures::
+
+ struct sock_filter { /* Filter block */
+ __u16 code; /* Actual filter code */
+ __u8 jt; /* Jump true */
+ __u8 jf; /* Jump false */
+ __u32 k; /* Generic multiuse field */
+ };
+
+Such a structure is assembled as an array of 4-tuples, that contains
+a code, jt, jf and k value. jt and jf are jump offsets and k a generic
+value to be used for a provided code::
+
+ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
+ unsigned short len; /* Number of filter blocks */
+ struct sock_filter __user *filter;
+ };
+
+For socket filtering, a pointer to this structure (as shown in
+follow-up example) is being passed to the kernel through setsockopt(2).
+
+Example
+-------
+
+::
+
+ #include <sys/socket.h>
+ #include <sys/types.h>
+ #include <arpa/inet.h>
+ #include <linux/if_ether.h>
+ /* ... */
+
+ /* From the example above: tcpdump -i em1 port 22 -dd */
+ struct sock_filter code[] = {
+ { 0x28, 0, 0, 0x0000000c },
+ { 0x15, 0, 8, 0x000086dd },
+ { 0x30, 0, 0, 0x00000014 },
+ { 0x15, 2, 0, 0x00000084 },
+ { 0x15, 1, 0, 0x00000006 },
+ { 0x15, 0, 17, 0x00000011 },
+ { 0x28, 0, 0, 0x00000036 },
+ { 0x15, 14, 0, 0x00000016 },
+ { 0x28, 0, 0, 0x00000038 },
+ { 0x15, 12, 13, 0x00000016 },
+ { 0x15, 0, 12, 0x00000800 },
+ { 0x30, 0, 0, 0x00000017 },
+ { 0x15, 2, 0, 0x00000084 },
+ { 0x15, 1, 0, 0x00000006 },
+ { 0x15, 0, 8, 0x00000011 },
+ { 0x28, 0, 0, 0x00000014 },
+ { 0x45, 6, 0, 0x00001fff },
+ { 0xb1, 0, 0, 0x0000000e },
+ { 0x48, 0, 0, 0x0000000e },
+ { 0x15, 2, 0, 0x00000016 },
+ { 0x48, 0, 0, 0x00000010 },
+ { 0x15, 0, 1, 0x00000016 },
+ { 0x06, 0, 0, 0x0000ffff },
+ { 0x06, 0, 0, 0x00000000 },
+ };
+
+ struct sock_fprog bpf = {
+ .len = ARRAY_SIZE(code),
+ .filter = code,
+ };
+
+ sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+ if (sock < 0)
+ /* ... bail out ... */
+
+ ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
+ if (ret < 0)
+ /* ... bail out ... */
+
+ /* ... */
+ close(sock);
+
+The above example code attaches a socket filter for a PF_PACKET socket
+in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
+be dropped for this socket.
+
+The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
+and SO_LOCK_FILTER for preventing the filter to be detached, takes an
+integer value with 0 or 1.
+
+Note that socket filters are not restricted to PF_PACKET sockets only,
+but can also be used on other socket families.
+
+Summary of system calls:
+
+ * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
+ * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
+ * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val));
+
+Normally, most use cases for socket filtering on packet sockets will be
+covered by libpcap in high-level syntax, so as an application developer
+you should stick to that. libpcap wraps its own layer around all that.
+
+Unless i) using/linking to libpcap is not an option, ii) the required BPF
+filters use Linux extensions that are not supported by libpcap's compiler,
+iii) a filter might be more complex and not cleanly implementable with
+libpcap's compiler, or iv) particular filter codes should be optimized
+differently than libpcap's internal compiler does; then in such cases
+writing such a filter "by hand" can be of an alternative. For example,
+xt_bpf and cls_bpf users might have requirements that could result in
+more complex filter code, or one that cannot be expressed with libpcap
+(e.g. different return codes for various code paths). Moreover, BPF JIT
+implementors may wish to manually write test cases and thus need low-level
+access to BPF code as well.
+
+BPF engine and instruction set
+------------------------------
+
+Under tools/bpf/ there's a small helper tool called bpf_asm which can
+be used to write low-level filters for example scenarios mentioned in the
+previous section. Asm-like syntax mentioned here has been implemented in
+bpf_asm and will be used for further explanations (instead of dealing with
+less readable opcodes directly, principles are the same). The syntax is
+closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
+
+The BPF architecture consists of the following basic elements:
+
+ ======= ====================================================
+ Element Description
+ ======= ====================================================
+ A 32 bit wide accumulator
+ X 32 bit wide X register
+ M[] 16 x 32 bit wide misc registers aka "scratch memory
+ store", addressable from 0 to 15
+ ======= ====================================================
+
+A program, that is translated by bpf_asm into "opcodes" is an array that
+consists of the following elements (as already mentioned)::
+
+ op:16, jt:8, jf:8, k:32
+
+The element op is a 16 bit wide opcode that has a particular instruction
+encoded. jt and jf are two 8 bit wide jump targets, one for condition
+"jump if true", the other one "jump if false". Eventually, element k
+contains a miscellaneous argument that can be interpreted in different
+ways depending on the given instruction in op.
+
+The instruction set consists of load, store, branch, alu, miscellaneous
+and return instructions that are also represented in bpf_asm syntax. This
+table lists all bpf_asm instructions available resp. what their underlying
+opcodes as defined in linux/filter.h stand for:
+
+ =========== =================== =====================
+ Instruction Addressing mode Description
+ =========== =================== =====================
+ ld 1, 2, 3, 4, 12 Load word into A
+ ldi 4 Load word into A
+ ldh 1, 2 Load half-word into A
+ ldb 1, 2 Load byte into A
+ ldx 3, 4, 5, 12 Load word into X
+ ldxi 4 Load word into X
+ ldxb 5 Load byte into X
+
+ st 3 Store A into M[]
+ stx 3 Store X into M[]
+
+ jmp 6 Jump to label
+ ja 6 Jump to label
+ jeq 7, 8, 9, 10 Jump on A == <x>
+ jneq 9, 10 Jump on A != <x>
+ jne 9, 10 Jump on A != <x>
+ jlt 9, 10 Jump on A < <x>
+ jle 9, 10 Jump on A <= <x>
+ jgt 7, 8, 9, 10 Jump on A > <x>
+ jge 7, 8, 9, 10 Jump on A >= <x>
+ jset 7, 8, 9, 10 Jump on A & <x>
+
+ add 0, 4 A + <x>
+ sub 0, 4 A - <x>
+ mul 0, 4 A * <x>
+ div 0, 4 A / <x>
+ mod 0, 4 A % <x>
+ neg !A
+ and 0, 4 A & <x>
+ or 0, 4 A | <x>
+ xor 0, 4 A ^ <x>
+ lsh 0, 4 A << <x>
+ rsh 0, 4 A >> <x>
+
+ tax Copy A into X
+ txa Copy X into A
+
+ ret 4, 11 Return
+ =========== =================== =====================
+
+The next table shows addressing formats from the 2nd column:
+
+ =============== =================== ===============================================
+ Addressing mode Syntax Description
+ =============== =================== ===============================================
+ 0 x/%x Register X
+ 1 [k] BHW at byte offset k in the packet
+ 2 [x + k] BHW at the offset X + k in the packet
+ 3 M[k] Word at offset k in M[]
+ 4 #k Literal value stored in k
+ 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet
+ 6 L Jump label L
+ 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf
+ 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf
+ 9 #k,Lt Jump to Lt if predicate is true
+ 10 x/%x,Lt Jump to Lt if predicate is true
+ 11 a/%a Accumulator A
+ 12 extension BPF extension
+ =============== =================== ===============================================
+
+The Linux kernel also has a couple of BPF extensions that are used along
+with the class of load instructions by "overloading" the k argument with
+a negative offset + a particular extension offset. The result of such BPF
+extensions are loaded into A.
+
+Possible BPF extensions are shown in the following table:
+
+ =================================== =================================================
+ Extension Description
+ =================================== =================================================
+ len skb->len
+ proto skb->protocol
+ type skb->pkt_type
+ poff Payload start offset
+ ifidx skb->dev->ifindex
+ nla Netlink attribute of type X with offset A
+ nlan Nested Netlink attribute of type X with offset A
+ mark skb->mark
+ queue skb->queue_mapping
+ hatype skb->dev->type
+ rxhash skb->hash
+ cpu raw_smp_processor_id()
+ vlan_tci skb_vlan_tag_get(skb)
+ vlan_avail skb_vlan_tag_present(skb)
+ vlan_tpid skb->vlan_proto
+ rand get_random_u32()
+ =================================== =================================================
+
+These extensions can also be prefixed with '#'.
+Examples for low-level BPF:
+
+**ARP packets**::
+
+ ldh [12]
+ jne #0x806, drop
+ ret #-1
+ drop: ret #0
+
+**IPv4 TCP packets**::
+
+ ldh [12]
+ jne #0x800, drop
+ ldb [23]
+ jneq #6, drop
+ ret #-1
+ drop: ret #0
+
+**icmp random packet sampling, 1 in 4**::
+
+ ldh [12]
+ jne #0x800, drop
+ ldb [23]
+ jneq #1, drop
+ # get a random uint32 number
+ ld rand
+ mod #4
+ jneq #1, drop
+ ret #-1
+ drop: ret #0
+
+**SECCOMP filter example**::
+
+ ld [4] /* offsetof(struct seccomp_data, arch) */
+ jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
+ ld [0] /* offsetof(struct seccomp_data, nr) */
+ jeq #15, good /* __NR_rt_sigreturn */
+ jeq #231, good /* __NR_exit_group */
+ jeq #60, good /* __NR_exit */
+ jeq #0, good /* __NR_read */
+ jeq #1, good /* __NR_write */
+ jeq #5, good /* __NR_fstat */
+ jeq #9, good /* __NR_mmap */
+ jeq #14, good /* __NR_rt_sigprocmask */
+ jeq #13, good /* __NR_rt_sigaction */
+ jeq #35, good /* __NR_nanosleep */
+ bad: ret #0 /* SECCOMP_RET_KILL_THREAD */
+ good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
+
+Examples for low-level BPF extension:
+
+**Packet for interface index 13**::
+
+ ld ifidx
+ jneq #13, drop
+ ret #-1
+ drop: ret #0
+
+**(Accelerated) VLAN w/ id 10**::
+
+ ld vlan_tci
+ jneq #10, drop
+ ret #-1
+ drop: ret #0
+
+The above example code can be placed into a file (here called "foo"), and
+then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
+and cls_bpf understands and can directly be loaded with. Example with above
+ARP code::
+
+ $ ./bpf_asm foo
+ 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
+
+In copy and paste C-like output::
+
+ $ ./bpf_asm -c foo
+ { 0x28, 0, 0, 0x0000000c },
+ { 0x15, 0, 1, 0x00000806 },
+ { 0x06, 0, 0, 0xffffffff },
+ { 0x06, 0, 0, 0000000000 },
+
+In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
+filters that might not be obvious at first, it's good to test filters before
+attaching to a live system. For that purpose, there's a small tool called
+bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
+for testing BPF filters against given pcap files, single stepping through the
+BPF code on the pcap's packets and to do BPF machine register dumps.
+
+Starting bpf_dbg is trivial and just requires issuing::
+
+ # ./bpf_dbg
+
+In case input and output do not equal stdin/stdout, bpf_dbg takes an
+alternative stdin source as a first argument, and an alternative stdout
+sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
+
+Other than that, a particular libreadline configuration can be set via
+file "~/.bpf_dbg_init" and the command history is stored in the file
+"~/.bpf_dbg_history".
+
+Interaction in bpf_dbg happens through a shell that also has auto-completion
+support (follow-up example commands starting with '>' denote bpf_dbg shell).
+The usual workflow would be to ...
+
+* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
+ Loads a BPF filter from standard output of bpf_asm, or transformed via
+ e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT
+ debugging (next section), this command creates a temporary socket and
+ loads the BPF code into the kernel. Thus, this will also be useful for
+ JIT developers.
+
+* load pcap foo.pcap
+
+ Loads standard tcpdump pcap file.
+
+* run [<n>]
+
+bpf passes:1 fails:9
+ Runs through all packets from a pcap to account how many passes and fails
+ the filter will generate. A limit of packets to traverse can be given.
+
+* disassemble::
+
+ l0: ldh [12]
+ l1: jeq #0x800, l2, l5
+ l2: ldb [23]
+ l3: jeq #0x1, l4, l5
+ l4: ret #0xffff
+ l5: ret #0
+
+ Prints out BPF code disassembly.
+
+* dump::
+
+ /* { op, jt, jf, k }, */
+ { 0x28, 0, 0, 0x0000000c },
+ { 0x15, 0, 3, 0x00000800 },
+ { 0x30, 0, 0, 0x00000017 },
+ { 0x15, 0, 1, 0x00000001 },
+ { 0x06, 0, 0, 0x0000ffff },
+ { 0x06, 0, 0, 0000000000 },
+
+ Prints out C-style BPF code dump.
+
+* breakpoint 0::
+
+ breakpoint at: l0: ldh [12]
+
+* breakpoint 1::
+
+ breakpoint at: l1: jeq #0x800, l2, l5
+
+ ...
+
+ Sets breakpoints at particular BPF instructions. Issuing a `run` command
+ will walk through the pcap file continuing from the current packet and
+ break when a breakpoint is being hit (another `run` will continue from
+ the currently active breakpoint executing next instructions):
+
+ * run::
+
+ -- register dump --
+ pc: [0] <-- program counter
+ code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
+ curr: l0: ldh [12] <-- disassembly of current instruction
+ A: [00000000][0] <-- content of A (hex, decimal)
+ X: [00000000][0] <-- content of X (hex, decimal)
+ M[0,15]: [00000000][0] <-- folded content of M (hex, decimal)
+ -- packet dump -- <-- Current packet from pcap (hex)
+ len: 42
+ 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
+ 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
+ 32: 00 00 00 00 00 00 0a 3b 01 01
+ (breakpoint)
+ >
+
+ * breakpoint::
+
+ breakpoints: 0 1
+
+ Prints currently set breakpoints.
+
+* step [-<n>, +<n>]
+
+ Performs single stepping through the BPF program from the current pc
+ offset. Thus, on each step invocation, above register dump is issued.
+ This can go forwards and backwards in time, a plain `step` will break
+ on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
+
+* select <n>
+
+ Selects a given packet from the pcap file to continue from. Thus, on
+ the next `run` or `step`, the BPF program is being evaluated against
+ the user pre-selected packet. Numbering starts just as in Wireshark
+ with index 1.
+
+* quit
+
+ Exits bpf_dbg.
+
+JIT compiler
+------------
+
+The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
+PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
+CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
+attached filter from user space or for internal kernel users if it has
+been previously enabled by root::
+
+ echo 1 > /proc/sys/net/core/bpf_jit_enable
+
+For JIT developers, doing audits etc, each compile run can output the generated
+opcode image into the kernel log via::
+
+ echo 2 > /proc/sys/net/core/bpf_jit_enable
+
+Example output from dmesg::
+
+ [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
+ [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
+ [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
+ [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
+ [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
+ [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
+
+When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and
+setting any other value than that will return in failure. This is even the case for
+setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log
+is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the
+generally recommended approach instead.
+
+In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
+generating disassembly out of the kernel log's hexdump::
+
+ # ./bpf_jit_disasm
+ 70 bytes emitted from JIT compiler (pass:3, flen:6)
+ ffffffffa0069c8f + <x>:
+ 0: push %rbp
+ 1: mov %rsp,%rbp
+ 4: sub $0x60,%rsp
+ 8: mov %rbx,-0x8(%rbp)
+ c: mov 0x68(%rdi),%r9d
+ 10: sub 0x6c(%rdi),%r9d
+ 14: mov 0xd8(%rdi),%r8
+ 1b: mov $0xc,%esi
+ 20: callq 0xffffffffe0ff9442
+ 25: cmp $0x800,%eax
+ 2a: jne 0x0000000000000042
+ 2c: mov $0x17,%esi
+ 31: callq 0xffffffffe0ff945e
+ 36: cmp $0x1,%eax
+ 39: jne 0x0000000000000042
+ 3b: mov $0xffff,%eax
+ 40: jmp 0x0000000000000044
+ 42: xor %eax,%eax
+ 44: leaveq
+ 45: retq
+
+ Issuing option `-o` will "annotate" opcodes to resulting assembler
+ instructions, which can be very useful for JIT developers:
+
+ # ./bpf_jit_disasm -o
+ 70 bytes emitted from JIT compiler (pass:3, flen:6)
+ ffffffffa0069c8f + <x>:
+ 0: push %rbp
+ 55
+ 1: mov %rsp,%rbp
+ 48 89 e5
+ 4: sub $0x60,%rsp
+ 48 83 ec 60
+ 8: mov %rbx,-0x8(%rbp)
+ 48 89 5d f8
+ c: mov 0x68(%rdi),%r9d
+ 44 8b 4f 68
+ 10: sub 0x6c(%rdi),%r9d
+ 44 2b 4f 6c
+ 14: mov 0xd8(%rdi),%r8
+ 4c 8b 87 d8 00 00 00
+ 1b: mov $0xc,%esi
+ be 0c 00 00 00
+ 20: callq 0xffffffffe0ff9442
+ e8 1d 94 ff e0
+ 25: cmp $0x800,%eax
+ 3d 00 08 00 00
+ 2a: jne 0x0000000000000042
+ 75 16
+ 2c: mov $0x17,%esi
+ be 17 00 00 00
+ 31: callq 0xffffffffe0ff945e
+ e8 28 94 ff e0
+ 36: cmp $0x1,%eax
+ 83 f8 01
+ 39: jne 0x0000000000000042
+ 75 07
+ 3b: mov $0xffff,%eax
+ b8 ff ff 00 00
+ 40: jmp 0x0000000000000044
+ eb 02
+ 42: xor %eax,%eax
+ 31 c0
+ 44: leaveq
+ c9
+ 45: retq
+ c3
+
+For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
+toolchain for developing and testing the kernel's JIT compiler.
+
+BPF kernel internals
+--------------------
+Internally, for the kernel interpreter, a different instruction set
+format with similar underlying principles from BPF described in previous
+paragraphs is being used. However, the instruction set format is modelled
+closer to the underlying architecture to mimic native instruction sets, so
+that a better performance can be achieved (more details later). This new
+ISA is called eBPF. See the ../bpf/index.rst for details. (Note: eBPF which
+originates from [e]xtended BPF is not the same as BPF extensions! While
+eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
+of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
+
+The new instruction set was originally designed with the possible goal in
+mind to write programs in "restricted C" and compile into eBPF with a optional
+GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
+minimal performance overhead over two steps, that is, C -> eBPF -> native code.
+
+Currently, the new format is being used for running user BPF programs, which
+includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
+team driver's classifier for its load-balancing mode, netfilter's xt_bpf
+extension, PTP dissector/classifier, and much more. They are all internally
+converted by the kernel into the new instruction set representation and run
+in the eBPF interpreter. For in-kernel handlers, this all works transparently
+by using bpf_prog_create() for setting up the filter, resp.
+bpf_prog_destroy() for destroying it. The function
+bpf_prog_run(filter, ctx) transparently invokes eBPF interpreter or JITed
+code to run the filter. 'filter' is a pointer to struct bpf_prog that we
+got from bpf_prog_create(), and 'ctx' the given context (e.g.
+skb pointer). All constraints and restrictions from bpf_check_classic() apply
+before a conversion to the new layout is being done behind the scenes!
+
+Currently, the classic BPF format is being used for JITing on most
+32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
+sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF
+instruction set.
+
+Testing
+-------
+
+Next to the BPF toolchain, the kernel also ships a test module that contains
+various test cases for classic and eBPF that can be executed against
+the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
+enabled via Kconfig::
+
+ CONFIG_TEST_BPF=m
+
+After the module has been built and installed, the test suite can be executed
+via insmod or modprobe against 'test_bpf' module. Results of the test cases
+including timings in nsec can be found in the kernel log (dmesg).
+
+Misc
+----
+
+Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
+SECCOMP-BPF kernel fuzzing.
+
+Written by
+----------
+
+The document was written in the hope that it is found useful and in order
+to give potential BPF hackers or security auditors a better overview of
+the underlying architecture.
+
+- Jay Schulist <jschlst@samba.org>
+- Daniel Borkmann <daniel@iogearbox.net>
+- Alexei Starovoitov <ast@kernel.org>
diff --git a/Documentation/networking/gen_stats.rst b/Documentation/networking/gen_stats.rst
new file mode 100644
index 0000000000..595a83b9a6
--- /dev/null
+++ b/Documentation/networking/gen_stats.rst
@@ -0,0 +1,129 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================================
+Generic networking statistics for netlink users
+===============================================
+
+Statistic counters are grouped into structs:
+
+==================== ===================== =====================
+Struct TLV type Description
+==================== ===================== =====================
+gnet_stats_basic TCA_STATS_BASIC Basic statistics
+gnet_stats_rate_est TCA_STATS_RATE_EST Rate estimator
+gnet_stats_queue TCA_STATS_QUEUE Queue statistics
+none TCA_STATS_APP Application specific
+==================== ===================== =====================
+
+
+Collecting:
+-----------
+
+Declare the statistic structs you need::
+
+ struct mystruct {
+ struct gnet_stats_basic bstats;
+ struct gnet_stats_queue qstats;
+ ...
+ };
+
+Update statistics, in dequeue() methods only, (while owning qdisc->running)::
+
+ mystruct->tstats.packet++;
+ mystruct->qstats.backlog += skb->pkt_len;
+
+
+Export to userspace (Dump):
+---------------------------
+
+::
+
+ my_dumping_routine(struct sk_buff *skb, ...)
+ {
+ struct gnet_dump dump;
+
+ if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump,
+ TCA_PAD) < 0)
+ goto rtattr_failure;
+
+ if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 ||
+ gnet_stats_copy_queue(&dump, &mystruct->qstats) < 0 ||
+ gnet_stats_copy_app(&dump, &xstats, sizeof(xstats)) < 0)
+ goto rtattr_failure;
+
+ if (gnet_stats_finish_copy(&dump) < 0)
+ goto rtattr_failure;
+ ...
+ }
+
+TCA_STATS/TCA_XSTATS backward compatibility:
+--------------------------------------------
+
+Prior users of struct tc_stats and xstats can maintain backward
+compatibility by calling the compat wrappers to keep providing the
+existing TLV types::
+
+ my_dumping_routine(struct sk_buff *skb, ...)
+ {
+ if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS,
+ TCA_XSTATS, &mystruct->lock, &dump,
+ TCA_PAD) < 0)
+ goto rtattr_failure;
+ ...
+ }
+
+A struct tc_stats will be filled out during gnet_stats_copy_* calls
+and appended to the skb. TCA_XSTATS is provided if gnet_stats_copy_app
+was called.
+
+
+Locking:
+--------
+
+Locks are taken before writing and released once all statistics have
+been written. Locks are always released in case of an error. You
+are responsible for making sure that the lock is initialized.
+
+
+Rate Estimator:
+---------------
+
+0) Prepare an estimator attribute. Most likely this would be in user
+ space. The value of this TLV should contain a tc_estimator structure.
+ As usual, such a TLV needs to be 32 bit aligned and therefore the
+ length needs to be appropriately set, etc. The estimator interval
+ and ewma log need to be converted to the appropriate values.
+ tc_estimator.c::tc_setup_estimator() is advisable to be used as the
+ conversion routine. It does a few clever things. It takes a time
+ interval in microsecs, a time constant also in microsecs and a struct
+ tc_estimator to be populated. The returned tc_estimator can be
+ transported to the kernel. Transfer such a structure in a TLV of type
+ TCA_RATE to your code in the kernel.
+
+In the kernel when setting up:
+
+1) make sure you have basic stats and rate stats setup first.
+2) make sure you have initialized stats lock that is used to setup such
+ stats.
+3) Now initialize a new estimator::
+
+ int ret = gen_new_estimator(my_basicstats,my_rate_est_stats,
+ mystats_lock, attr_with_tcestimator_struct);
+
+ if ret == 0
+ success
+ else
+ failed
+
+From now on, every time you dump my_rate_est_stats it will contain
+up-to-date info.
+
+Once you are done, call gen_kill_estimator(my_basicstats,
+my_rate_est_stats) Make sure that my_basicstats and my_rate_est_stats
+are still valid (i.e still exist) at the time of making this call.
+
+
+Authors:
+--------
+- Thomas Graf <tgraf@suug.ch>
+- Jamal Hadi Salim <hadi@cyberus.ca>
diff --git a/Documentation/networking/generic-hdlc.rst b/Documentation/networking/generic-hdlc.rst
new file mode 100644
index 0000000000..1c3bb5cb98
--- /dev/null
+++ b/Documentation/networking/generic-hdlc.rst
@@ -0,0 +1,170 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Generic HDLC layer
+==================
+
+Krzysztof Halasa <khc@pm.waw.pl>
+
+
+Generic HDLC layer currently supports:
+
+1. Frame Relay (ANSI, CCITT, Cisco and no LMI)
+
+ - Normal (routed) and Ethernet-bridged (Ethernet device emulation)
+ interfaces can share a single PVC.
+ - ARP support (no InARP support in the kernel - there is an
+ experimental InARP user-space daemon available on:
+ http://www.kernel.org/pub/linux/utils/net/hdlc/).
+
+2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation
+3. Cisco HDLC
+4. PPP
+5. X.25 (uses X.25 routines).
+
+Generic HDLC is a protocol driver only - it needs a low-level driver
+for your particular hardware.
+
+Ethernet device emulation (using HDLC or Frame-Relay PVC) is compatible
+with IEEE 802.1Q (VLANs) and 802.1D (Ethernet bridging).
+
+
+Make sure the hdlc.o and the hardware driver are loaded. It should
+create a number of "hdlc" (hdlc0 etc) network devices, one for each
+WAN port. You'll need the "sethdlc" utility, get it from:
+
+ http://www.kernel.org/pub/linux/utils/net/hdlc/
+
+Compile sethdlc.c utility::
+
+ gcc -O2 -Wall -o sethdlc sethdlc.c
+
+Make sure you're using a correct version of sethdlc for your kernel.
+
+Use sethdlc to set physical interface, clock rate, HDLC mode used,
+and add any required PVCs if using Frame Relay.
+Usually you want something like::
+
+ sethdlc hdlc0 clock int rate 128000
+ sethdlc hdlc0 cisco interval 10 timeout 25
+
+or::
+
+ sethdlc hdlc0 rs232 clock ext
+ sethdlc hdlc0 fr lmi ansi
+ sethdlc hdlc0 create 99
+ ifconfig hdlc0 up
+ ifconfig pvc0 localIP pointopoint remoteIP
+
+In Frame Relay mode, ifconfig master hdlc device up (without assigning
+any IP address to it) before using pvc devices.
+
+
+Setting interface:
+
+* v35 | rs232 | x21 | t1 | e1
+ - sets physical interface for a given port
+ if the card has software-selectable interfaces
+ loopback
+ - activate hardware loopback (for testing only)
+* clock ext
+ - both RX clock and TX clock external
+* clock int
+ - both RX clock and TX clock internal
+* clock txint
+ - RX clock external, TX clock internal
+* clock txfromrx
+ - RX clock external, TX clock derived from RX clock
+* rate
+ - sets clock rate in bps (for "int" or "txint" clock only)
+
+
+Setting protocol:
+
+* hdlc - sets raw HDLC (IP-only) mode
+
+ nrz / nrzi / fm-mark / fm-space / manchester - sets transmission code
+
+ no-parity / crc16 / crc16-pr0 (CRC16 with preset zeros) / crc32-itu
+
+ crc16-itu (CRC16 with ITU-T polynomial) / crc16-itu-pr0 - sets parity
+
+* hdlc-eth - Ethernet device emulation using HDLC. Parity and encoding
+ as above.
+
+* cisco - sets Cisco HDLC mode (IP, IPv6 and IPX supported)
+
+ interval - time in seconds between keepalive packets
+
+ timeout - time in seconds after last received keepalive packet before
+ we assume the link is down
+
+* ppp - sets synchronous PPP mode
+
+* x25 - sets X.25 mode
+
+* fr - Frame Relay mode
+
+ lmi ansi / ccitt / cisco / none - LMI (link management) type
+
+ dce - Frame Relay DCE (network) side LMI instead of default DTE (user).
+
+ It has nothing to do with clocks!
+
+ - t391 - link integrity verification polling timer (in seconds) - user
+ - t392 - polling verification timer (in seconds) - network
+ - n391 - full status polling counter - user
+ - n392 - error threshold - both user and network
+ - n393 - monitored events count - both user and network
+
+Frame-Relay only:
+
+* create n | delete n - adds / deletes PVC interface with DLCI #n.
+ Newly created interface will be named pvc0, pvc1 etc.
+
+* create ether n | delete ether n - adds a device for Ethernet-bridged
+ frames. The device will be named pvceth0, pvceth1 etc.
+
+
+
+
+Board-specific issues
+---------------------
+
+n2.o and c101.o need parameters to work::
+
+ insmod n2 hw=io,irq,ram,ports[:io,irq,...]
+
+example::
+
+ insmod n2 hw=0x300,10,0xD0000,01
+
+or::
+
+ insmod c101 hw=irq,ram[:irq,...]
+
+example::
+
+ insmod c101 hw=9,0xdc000
+
+If built into the kernel, these drivers need kernel (command line) parameters::
+
+ n2.hw=io,irq,ram,ports:...
+
+or::
+
+ c101.hw=irq,ram:...
+
+
+
+If you have a problem with N2, C101 or PLX200SYN card, you can issue the
+"private" command to see port's packet descriptor rings (in kernel logs)::
+
+ sethdlc hdlc0 private
+
+The hardware driver has to be build with #define DEBUG_RINGS.
+Attaching this info to bug reports would be helpful. Anyway, let me know
+if you have problems using this.
+
+For patches and other info look at:
+<http://www.kernel.org/pub/linux/utils/net/hdlc/>.
diff --git a/Documentation/networking/generic_netlink.rst b/Documentation/networking/generic_netlink.rst
new file mode 100644
index 0000000000..d960dbd7e8
--- /dev/null
+++ b/Documentation/networking/generic_netlink.rst
@@ -0,0 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Generic Netlink
+===============
+
+A wiki document on how to use Generic Netlink can be found here:
+
+ * https://wiki.linuxfoundation.org/networking/generic_netlink_howto
diff --git a/Documentation/networking/gtp.rst b/Documentation/networking/gtp.rst
new file mode 100644
index 0000000000..9a7835cc14
--- /dev/null
+++ b/Documentation/networking/gtp.rst
@@ -0,0 +1,251 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+The Linux kernel GTP tunneling module
+=====================================
+
+Documentation by
+ Harald Welte <laforge@gnumonks.org> and
+ Andreas Schultz <aschultz@tpip.net>
+
+In 'drivers/net/gtp.c' you are finding a kernel-level implementation
+of a GTP tunnel endpoint.
+
+What is GTP
+===========
+
+GTP is the Generic Tunnel Protocol, which is a 3GPP protocol used for
+tunneling User-IP payload between a mobile station (phone, modem)
+and the interconnection between an external packet data network (such
+as the internet).
+
+So when you start a 'data connection' from your mobile phone, the
+phone will use the control plane to signal for the establishment of
+such a tunnel between that external data network and the phone. The
+tunnel endpoints thus reside on the phone and in the gateway. All
+intermediate nodes just transport the encapsulated packet.
+
+The phone itself does not implement GTP but uses some other
+technology-dependent protocol stack for transmitting the user IP
+payload, such as LLC/SNDCP/RLC/MAC.
+
+At some network element inside the cellular operator infrastructure
+(SGSN in case of GPRS/EGPRS or classic UMTS, hNodeB in case of a 3G
+femtocell, eNodeB in case of 4G/LTE), the cellular protocol stacking
+is translated into GTP *without breaking the end-to-end tunnel*. So
+intermediate nodes just perform some specific relay function.
+
+At some point the GTP packet ends up on the so-called GGSN (GSM/UMTS)
+or P-GW (LTE), which terminates the tunnel, decapsulates the packet
+and forwards it onto an external packet data network. This can be
+public internet, but can also be any private IP network (or even
+theoretically some non-IP network like X.25).
+
+You can find the protocol specification in 3GPP TS 29.060, available
+publicly via the 3GPP website at http://www.3gpp.org/DynaReport/29060.htm
+
+A direct PDF link to v13.6.0 is provided for convenience below:
+http://www.etsi.org/deliver/etsi_ts/129000_129099/129060/13.06.00_60/ts_129060v130600p.pdf
+
+The Linux GTP tunnelling module
+===============================
+
+The module implements the function of a tunnel endpoint, i.e. it is
+able to decapsulate tunneled IP packets in the uplink originated by
+the phone, and encapsulate raw IP packets received from the external
+packet network in downlink towards the phone.
+
+It *only* implements the so-called 'user plane', carrying the User-IP
+payload, called GTP-U. It does not implement the 'control plane',
+which is a signaling protocol used for establishment and teardown of
+GTP tunnels (GTP-C).
+
+So in order to have a working GGSN/P-GW setup, you will need a
+userspace program that implements the GTP-C protocol and which then
+uses the netlink interface provided by the GTP-U module in the kernel
+to configure the kernel module.
+
+This split architecture follows the tunneling modules of other
+protocols, e.g. PPPoE or L2TP, where you also run a userspace daemon
+to handle the tunnel establishment, authentication etc. and only the
+data plane is accelerated inside the kernel.
+
+Don't be confused by terminology: The GTP User Plane goes through
+kernel accelerated path, while the GTP Control Plane goes to
+Userspace :)
+
+The official homepage of the module is at
+https://osmocom.org/projects/linux-kernel-gtp-u/wiki
+
+Userspace Programs with Linux Kernel GTP-U support
+==================================================
+
+At the time of this writing, there are at least two Free Software
+implementations that implement GTP-C and can use the netlink interface
+to make use of the Linux kernel GTP-U support:
+
+* OpenGGSN (classic 2G/3G GGSN in C):
+ https://osmocom.org/projects/openggsn/wiki/OpenGGSN
+
+* ergw (GGSN + P-GW in Erlang):
+ https://github.com/travelping/ergw
+
+Userspace Library / Command Line Utilities
+==========================================
+
+There is a userspace library called 'libgtpnl' which is based on
+libmnl and which implements a C-language API towards the netlink
+interface provided by the Kernel GTP module:
+
+http://git.osmocom.org/libgtpnl/
+
+Protocol Versions
+=================
+
+There are two different versions of GTP-U: v0 [GSM TS 09.60] and v1
+[3GPP TS 29.281]. Both are implemented in the Kernel GTP module.
+Version 0 is a legacy version, and deprecated from recent 3GPP
+specifications.
+
+GTP-U uses UDP for transporting PDUs. The receiving UDP port is 2151
+for GTPv1-U and 3386 for GTPv0-U.
+
+There are three versions of GTP-C: v0, v1, and v2. As the kernel
+doesn't implement GTP-C, we don't have to worry about this. It's the
+responsibility of the control plane implementation in userspace to
+implement that.
+
+IPv6
+====
+
+The 3GPP specifications indicate either IPv4 or IPv6 can be used both
+on the inner (user) IP layer, or on the outer (transport) layer.
+
+Unfortunately, the Kernel module currently supports IPv6 neither for
+the User IP payload, nor for the outer IP layer. Patches or other
+Contributions to fix this are most welcome!
+
+Mailing List
+============
+
+If you have questions regarding how to use the Kernel GTP module from
+your own software, or want to contribute to the code, please use the
+osmocom-net-grps mailing list for related discussion. The list can be
+reached at osmocom-net-gprs@lists.osmocom.org and the mailman
+interface for managing your subscription is at
+https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs
+
+Issue Tracker
+=============
+
+The Osmocom project maintains an issue tracker for the Kernel GTP-U
+module at
+https://osmocom.org/projects/linux-kernel-gtp-u/issues
+
+History / Acknowledgements
+==========================
+
+The Module was originally created in 2012 by Harald Welte, but never
+completed. Pablo came in to finish the mess Harald left behind. But
+doe to a lack of user interest, it never got merged.
+
+In 2015, Andreas Schultz came to the rescue and fixed lots more bugs,
+extended it with new features and finally pushed all of us to get it
+mainline, where it was merged in 4.7.0.
+
+Architectural Details
+=====================
+
+Local GTP-U entity and tunnel identification
+--------------------------------------------
+
+GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152
+for GTPv1-U and 3386 for GTPv0-U.
+
+There is only one GTP-U entity (and therefore SGSN/GGSN/S-GW/PDN-GW
+instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique
+per GTP-U entity.
+
+A specific tunnel is only defined by the destination entity. Since the
+destination port is constant, only the destination IP and TEID define
+a tunnel. The source IP and Port have no meaning for the tunnel.
+
+Therefore:
+
+ * when sending, the remote entity is defined by the remote IP and
+ the tunnel endpoint id. The source IP and port have no meaning and
+ can be changed at any time.
+
+ * when receiving the local entity is defined by the local
+ destination IP and the tunnel endpoint id. The source IP and port
+ have no meaning and can change at any time.
+
+[3GPP TS 29.281] Section 4.3.0 defines this so::
+
+ The TEID in the GTP-U header is used to de-multiplex traffic
+ incoming from remote tunnel endpoints so that it is delivered to the
+ User plane entities in a way that allows multiplexing of different
+ users, different packet protocols and different QoS levels.
+ Therefore no two remote GTP-U endpoints shall send traffic to a
+ GTP-U protocol entity using the same TEID value except
+ for data forwarding as part of mobility procedures.
+
+The definition above only defines that two remote GTP-U endpoints
+*should not* send to the same TEID, it *does not* forbid or exclude
+such a scenario. In fact, the mentioned mobility procedures make it
+necessary that the GTP-U entity accepts traffic for TEIDs from
+multiple or unknown peers.
+
+Therefore, the receiving side identifies tunnels exclusively based on
+TEIDs, not based on the source IP!
+
+APN vs. Network Device
+======================
+
+The GTP-U driver creates a Linux network device for each Gi/SGi
+interface.
+
+[3GPP TS 29.281] calls the Gi/SGi reference point an interface. This
+may lead to the impression that the GGSN/P-GW can have only one such
+interface.
+
+Correct is that the Gi/SGi reference point defines the interworking
+between +the 3GPP packet domain (PDN) based on GTP-U tunnel and IP
+based networks.
+
+There is no provision in any of the 3GPP documents that limits the
+number of Gi/SGi interfaces implemented by a GGSN/P-GW.
+
+[3GPP TS 29.061] Section 11.3 makes it clear that the selection of a
+specific Gi/SGi interfaces is made through the Access Point Name
+(APN)::
+
+ 2. each private network manages its own addressing. In general this
+ will result in different private networks having overlapping
+ address ranges. A logically separate connection (e.g. an IP in IP
+ tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW
+ and each private network.
+
+ In this case the IP address alone is not necessarily unique. The
+ pair of values, Access Point Name (APN) and IPv4 address and/or
+ IPv6 prefixes, is unique.
+
+In order to support the overlapping address range use case, each APN
+is mapped to a separate Gi/SGi interface (network device).
+
+.. note::
+
+ The Access Point Name is purely a control plane (GTP-C) concept.
+ At the GTP-U level, only Tunnel Endpoint Identifiers are present in
+ GTP-U packets and network devices are known
+
+Therefore for a given UE the mapping in IP to PDN network is:
+
+ * network device + MS IP -> Peer IP + Peer TEID,
+
+and from PDN to IP network:
+
+ * local GTP-U IP + TEID -> network device
+
+Furthermore, before a received T-PDU is injected into the network
+device the MS IP is checked against the IP recorded in PDP context.
diff --git a/Documentation/networking/ieee802154.rst b/Documentation/networking/ieee802154.rst
new file mode 100644
index 0000000000..c652d383fe
--- /dev/null
+++ b/Documentation/networking/ieee802154.rst
@@ -0,0 +1,182 @@
+===============================
+IEEE 802.15.4 Developer's Guide
+===============================
+
+Introduction
+============
+The IEEE 802.15.4 working group focuses on standardization of the bottom
+two layers: Medium Access Control (MAC) and Physical access (PHY). And there
+are mainly two options available for upper layers:
+
+- ZigBee - proprietary protocol from the ZigBee Alliance
+- 6LoWPAN - IPv6 networking over low rate personal area networks
+
+The goal of the Linux-wpan is to provide a complete implementation
+of the IEEE 802.15.4 and 6LoWPAN protocols. IEEE 802.15.4 is a stack
+of protocols for organizing Low-Rate Wireless Personal Area Networks.
+
+The stack is composed of three main parts:
+
+- IEEE 802.15.4 layer; We have chosen to use plain Berkeley socket API,
+ the generic Linux networking stack to transfer IEEE 802.15.4 data
+ messages and a special protocol over netlink for configuration/management
+- MAC - provides access to shared channel and reliable data delivery
+- PHY - represents device drivers
+
+Socket API
+==========
+
+::
+
+ int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
+
+The address family, socket addresses etc. are defined in the
+include/net/af_ieee802154.h header or in the special header
+in the userspace package (see either https://linux-wpan.org/wpan-tools.html
+or the git tree at https://github.com/linux-wpan/wpan-tools).
+
+6LoWPAN Linux implementation
+============================
+
+The IEEE 802.15.4 standard specifies an MTU of 127 bytes, yielding about 80
+octets of actual MAC payload once security is turned on, on a wireless link
+with a link throughput of 250 kbps or less. The 6LoWPAN adaptation format
+[RFC4944] was specified to carry IPv6 datagrams over such constrained links,
+taking into account limited bandwidth, memory, or energy resources that are
+expected in applications such as wireless Sensor Networks. [RFC4944] defines
+a Mesh Addressing header to support sub-IP forwarding, a Fragmentation header
+to support the IPv6 minimum MTU requirement [RFC2460], and stateless header
+compression for IPv6 datagrams (LOWPAN_HC1 and LOWPAN_HC2) to reduce the
+relatively large IPv6 and UDP headers down to (in the best case) several bytes.
+
+In September 2011 the standard update was published - [RFC6282].
+It deprecates HC1 and HC2 compression and defines IPHC encoding format which is
+used in this Linux implementation.
+
+All the code related to 6lowpan you may find in files: net/6lowpan/*
+and net/ieee802154/6lowpan/*
+
+To setup a 6LoWPAN interface you need:
+1. Add IEEE802.15.4 interface and set channel and PAN ID;
+2. Add 6lowpan interface by command like:
+# ip link add link wpan0 name lowpan0 type lowpan
+3. Bring up 'lowpan0' interface
+
+Drivers
+=======
+
+Like with WiFi, there are several types of devices implementing IEEE 802.15.4.
+1) 'HardMAC'. The MAC layer is implemented in the device itself, the device
+exports a management (e.g. MLME) and data API.
+2) 'SoftMAC' or just radio. These types of devices are just radio transceivers
+possibly with some kinds of acceleration like automatic CRC computation and
+comparison, automagic ACK handling, address matching, etc.
+
+Those types of devices require different approach to be hooked into Linux kernel.
+
+HardMAC
+-------
+
+See the header include/net/ieee802154_netdev.h. You have to implement Linux
+net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family
+code via plain sk_buffs. On skb reception skb->cb must contain additional
+info as described in the struct ieee802154_mac_cb. During packet transmission
+the skb->cb is used to provide additional data to device's header_ops->create
+function. Be aware that this data can be overridden later (when socket code
+submits skb to qdisc), so if you need something from that cb later, you should
+store info in the skb->data on your own.
+
+To hook the MLME interface you have to populate the ml_priv field of your
+net_device with a pointer to struct ieee802154_mlme_ops instance. The fields
+assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional.
+All other fields are required.
+
+SoftMAC
+-------
+
+The MAC is the middle layer in the IEEE 802.15.4 Linux stack. This moment it
+provides interface for drivers registration and management of slave interfaces.
+
+NOTE: Currently the only monitor device type is supported - it's IEEE 802.15.4
+stack interface for network sniffers (e.g. WireShark).
+
+This layer is going to be extended soon.
+
+See header include/net/mac802154.h and several drivers in
+drivers/net/ieee802154/.
+
+Fake drivers
+------------
+
+In addition there is a driver available which simulates a real device with
+SoftMAC (fakelb - IEEE 802.15.4 loopback driver) interface. This option
+provides a possibility to test and debug the stack without usage of real hardware.
+
+Device drivers API
+==================
+
+The include/net/mac802154.h defines following functions:
+
+.. c:function:: struct ieee802154_dev *ieee802154_alloc_device (size_t priv_size, struct ieee802154_ops *ops)
+
+Allocation of IEEE 802.15.4 compatible device.
+
+.. c:function:: void ieee802154_free_device(struct ieee802154_dev *dev)
+
+Freeing allocated device.
+
+.. c:function:: int ieee802154_register_device(struct ieee802154_dev *dev)
+
+Register PHY in the system.
+
+.. c:function:: void ieee802154_unregister_device(struct ieee802154_dev *dev)
+
+Freeing registered PHY.
+
+.. c:function:: void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb, u8 lqi)
+
+Telling 802.15.4 module there is a new received frame in the skb with
+the RF Link Quality Indicator (LQI) from the hardware device.
+
+.. c:function:: void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb, bool ifs_handling)
+
+Telling 802.15.4 module the frame in the skb is or going to be
+transmitted through the hardware device
+
+The device driver must implement the following callbacks in the IEEE 802.15.4
+operations structure at least::
+
+ struct ieee802154_ops {
+ ...
+ int (*start)(struct ieee802154_hw *hw);
+ void (*stop)(struct ieee802154_hw *hw);
+ ...
+ int (*xmit_async)(struct ieee802154_hw *hw, struct sk_buff *skb);
+ int (*ed)(struct ieee802154_hw *hw, u8 *level);
+ int (*set_channel)(struct ieee802154_hw *hw, u8 page, u8 channel);
+ ...
+ };
+
+.. c:function:: int start(struct ieee802154_hw *hw)
+
+Handler that 802.15.4 module calls for the hardware device initialization.
+
+.. c:function:: void stop(struct ieee802154_hw *hw)
+
+Handler that 802.15.4 module calls for the hardware device cleanup.
+
+.. c:function:: int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb)
+
+Handler that 802.15.4 module calls for each frame in the skb going to be
+transmitted through the hardware device.
+
+.. c:function:: int ed(struct ieee802154_hw *hw, u8 *level)
+
+Handler that 802.15.4 module calls for Energy Detection from the hardware
+device.
+
+.. c:function:: int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel)
+
+Set radio for listening on specific channel of the hardware device.
+
+Moreover IEEE 802.15.4 device operations structure should be filled.
diff --git a/Documentation/networking/ila.rst b/Documentation/networking/ila.rst
new file mode 100644
index 0000000000..5ac0a6270b
--- /dev/null
+++ b/Documentation/networking/ila.rst
@@ -0,0 +1,296 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Identifier Locator Addressing (ILA)
+===================================
+
+
+Introduction
+============
+
+Identifier-locator addressing (ILA) is a technique used with IPv6 that
+differentiates between location and identity of a network node. Part of an
+address expresses the immutable identity of the node, and another part
+indicates the location of the node which can be dynamic. Identifier-locator
+addressing can be used to efficiently implement overlay networks for
+network virtualization as well as solutions for use cases in mobility.
+
+ILA can be thought of as means to implement an overlay network without
+encapsulation. This is accomplished by performing network address
+translation on destination addresses as a packet traverses a network. To
+the network, an ILA translated packet appears to be no different than any
+other IPv6 packet. For instance, if the transport protocol is TCP then an
+ILA translated packet looks like just another TCP/IPv6 packet. The
+advantage of this is that ILA is transparent to the network so that
+optimizations in the network, such as ECMP, RSS, GRO, GSO, etc., just work.
+
+The ILA protocol is described in Internet-Draft draft-herbert-intarea-ila.
+
+
+ILA terminology
+===============
+
+ - Identifier
+ A number that identifies an addressable node in the network
+ independent of its location. ILA identifiers are sixty-four
+ bit values.
+
+ - Locator
+ A network prefix that routes to a physical host. Locators
+ provide the topological location of an addressed node. ILA
+ locators are sixty-four bit prefixes.
+
+ - ILA mapping
+ A mapping of an ILA identifier to a locator (or to a
+ locator and meta data). An ILA domain maintains a database
+ that contains mappings for all destinations in the domain.
+
+ - SIR address
+ An IPv6 address composed of a SIR prefix (upper sixty-
+ four bits) and an identifier (lower sixty-four bits).
+ SIR addresses are visible to applications and provide a
+ means for them to address nodes independent of their
+ location.
+
+ - ILA address
+ An IPv6 address composed of a locator (upper sixty-four
+ bits) and an identifier (low order sixty-four bits). ILA
+ addresses are never visible to an application.
+
+ - ILA host
+ An end host that is capable of performing ILA translations
+ on transmit or receive.
+
+ - ILA router
+ A network node that performs ILA translation and forwarding
+ of translated packets.
+
+ - ILA forwarding cache
+ A type of ILA router that only maintains a working set
+ cache of mappings.
+
+ - ILA node
+ A network node capable of performing ILA translations. This
+ can be an ILA router, ILA forwarding cache, or ILA host.
+
+
+Operation
+=========
+
+There are two fundamental operations with ILA:
+
+ - Translate a SIR address to an ILA address. This is performed on ingress
+ to an ILA overlay.
+
+ - Translate an ILA address to a SIR address. This is performed on egress
+ from the ILA overlay.
+
+ILA can be deployed either on end hosts or intermediate devices in the
+network; these are provided by "ILA hosts" and "ILA routers" respectively.
+Configuration and datapath for these two points of deployment is somewhat
+different.
+
+The diagram below illustrates the flow of packets through ILA as well
+as showing ILA hosts and routers::
+
+ +--------+ +--------+
+ | Host A +-+ +--->| Host B |
+ | | | (2) ILA (') | |
+ +--------+ | ...addressed.... ( ) +--------+
+ V +---+--+ . packet . +---+--+ (_)
+ (1) SIR | | ILA |----->-------->---->| ILA | | (3) SIR
+ addressed +->|router| . . |router|->-+ addressed
+ packet +---+--+ . IPv6 . +---+--+ packet
+ / . Network .
+ / . . +--+-++--------+
+ +--------+ / . . |ILA || Host |
+ | Host +--+ . .- -|host|| |
+ | | . . +--+-++--------+
+ +--------+ ................
+
+
+Transport checksum handling
+===========================
+
+When an address is translated by ILA, an encapsulated transport checksum
+that includes the translated address in a pseudo header may be rendered
+incorrect on the wire. This is a problem for intermediate devices,
+including checksum offload in NICs, that process the checksum. There are
+three options to deal with this:
+
+- no action Allow the checksum to be incorrect on the wire. Before
+ a receiver verifies a checksum the ILA to SIR address
+ translation must be done.
+
+- adjust transport checksum
+ When ILA translation is performed the packet is parsed
+ and if a transport layer checksum is found then it is
+ adjusted to reflect the correct checksum per the
+ translated address.
+
+- checksum neutral mapping
+ When an address is translated the difference can be offset
+ elsewhere in a part of the packet that is covered by
+ the checksum. The low order sixteen bits of the identifier
+ are used. This method is preferred since it doesn't require
+ parsing a packet beyond the IP header and in most cases the
+ adjustment can be precomputed and saved with the mapping.
+
+Note that the checksum neutral adjustment affects the low order sixteen
+bits of the identifier. When ILA to SIR address translation is done on
+egress the low order bits are restored to the original value which
+restores the identifier as it was originally sent.
+
+
+Identifier types
+================
+
+ILA defines different types of identifiers for different use cases.
+
+The defined types are:
+
+ 0: interface identifier
+
+ 1: locally unique identifier
+
+ 2: virtual networking identifier for IPv4 address
+
+ 3: virtual networking identifier for IPv6 unicast address
+
+ 4: virtual networking identifier for IPv6 multicast address
+
+ 5: non-local address identifier
+
+In the current implementation of kernel ILA only locally unique identifiers
+(LUID) are supported. LUID allows for a generic, unformatted 64 bit
+identifier.
+
+
+Identifier formats
+==================
+
+Kernel ILA supports two optional fields in an identifier for formatting:
+"C-bit" and "identifier type". The presence of these fields is determined
+by configuration as demonstrated below.
+
+If the identifier type is present it occupies the three highest order
+bits of an identifier. The possible values are given in the above list.
+
+If the C-bit is present, this is used as an indication that checksum
+neutral mapping has been done. The C-bit can only be set in an
+ILA address, never a SIR address.
+
+In the simplest format the identifier types, C-bit, and checksum
+adjustment value are not present so an identifier is considered an
+unstructured sixty-four bit value::
+
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Identifier |
+ + +
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+The checksum neutral adjustment may be configured to always be
+present using neutral-map-auto. In this case there is no C-bit, but the
+checksum adjustment is in the low order 16 bits. The identifier is
+still sixty-four bits::
+
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Identifier |
+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | Checksum-neutral adjustment |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+The C-bit may used to explicitly indicate that checksum neutral
+mapping has been applied to an ILA address. The format is::
+
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |C| Identifier |
+ | +-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | Checksum-neutral adjustment |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+The identifier type field may be present to indicate the identifier
+type. If it is not present then the type is inferred based on mapping
+configuration. The checksum neutral adjustment may automatically
+used with the identifier type as illustrated below::
+
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type| Identifier |
+ +-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | Checksum-neutral adjustment |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+If the identifier type and the C-bit can be present simultaneously so
+the identifier format would be::
+
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type|C| Identifier |
+ +-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | Checksum-neutral adjustment |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+Configuration
+=============
+
+There are two methods to configure ILA mappings. One is by using LWT routes
+and the other is ila_xlat (called from NFHOOK PREROUTING hook). ila_xlat
+is intended to be used in the receive path for ILA hosts .
+
+An ILA router has also been implemented in XDP. Description of that is
+outside the scope of this document.
+
+The usage of for ILA LWT routes is:
+
+ip route add DEST/128 encap ila LOC csum-mode MODE ident-type TYPE via ADDR
+
+Destination (DEST) can either be a SIR address (for an ILA host or ingress
+ILA router) or an ILA address (egress ILA router). LOC is the sixty-four
+bit locator (with format W:X:Y:Z) that overwrites the upper sixty-four
+bits of the destination address. Checksum MODE is one of "no-action",
+"adj-transport", "neutral-map", and "neutral-map-auto". If neutral-map is
+set then the C-bit will be present. Identifier TYPE one of "luid" or
+"use-format." In the case of use-format, the identifier type field is
+present and the effective type is taken from that.
+
+The usage of ila_xlat is:
+
+ip ila add loc_match MATCH loc LOC csum-mode MODE ident-type TYPE
+
+MATCH indicates the incoming locator that must be matched to apply
+a the translaiton. LOC is the locator that overwrites the upper
+sixty-four bits of the destination address. MODE and TYPE have the
+same meanings as described above.
+
+
+Some examples
+=============
+
+::
+
+ # Configure an ILA route that uses checksum neutral mapping as well
+ # as type field. Note that the type field is set in the SIR address
+ # (the 2000 implies type is 1 which is LUID).
+ ip route add 3333:0:0:1:2000:0:1:87/128 encap ila 2001:0:87:0 \
+ csum-mode neutral-map ident-type use-format
+
+ # Configure an ILA LWT route that uses auto checksum neutral mapping
+ # (no C-bit) and configure identifier type to be LUID so that the
+ # identifier type field will not be present.
+ ip route add 3333:0:0:1:2000:0:2:87/128 encap ila 2001:0:87:1 \
+ csum-mode neutral-map-auto ident-type luid
+
+ ila_xlat configuration
+
+ # Configure an ILA to SIR mapping that matches a locator and overwrites
+ # it with a SIR address (3333:0:0:1 in this example). The C-bit and
+ # identifier field are used.
+ ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \
+ csum-mode neutral-map-auto ident-type use-format
+
+ # Configure an ILA to SIR mapping where checksum neutral is automatically
+ # set without the C-bit and the identifier type is configured to be LUID
+ # so that the identifier type field is not present.
+ ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \
+ csum-mode neutral-map-auto ident-type use-format
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
new file mode 100644
index 0000000000..5b75c3f7a1
--- /dev/null
+++ b/Documentation/networking/index.rst
@@ -0,0 +1,132 @@
+Networking
+==========
+
+Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ af_xdp
+ bareudp
+ batman-adv
+ can
+ can_ucan_protocol
+ device_drivers/index
+ dsa/index
+ devlink/index
+ caif/index
+ ethtool-netlink
+ ieee802154
+ j1939
+ kapi
+ msg_zerocopy
+ failover
+ net_dim
+ net_failover
+ page_pool
+ phy
+ sfp-phylink
+ alias
+ bridge
+ snmp_counter
+ checksum-offloads
+ segmentation-offloads
+ scaling
+ tls
+ tls-offload
+ tls-handshake
+ nfc
+ 6lowpan
+ 6pack
+ arcnet-hardware
+ arcnet
+ atm
+ ax25
+ bonding
+ cdc_mbim
+ dccp
+ dctcp
+ dns_resolver
+ driver
+ eql
+ fib_trie
+ filter
+ generic-hdlc
+ generic_netlink
+ gen_stats
+ gtp
+ ila
+ ioam6-sysctl
+ ipddp
+ ip_dynaddr
+ ipsec
+ ip-sysctl
+ ipv6
+ ipvlan
+ ipvs-sysctl
+ kcm
+ l2tp
+ lapb-module
+ mac80211-injection
+ mctp
+ mpls-sysctl
+ mptcp-sysctl
+ multiqueue
+ napi
+ netconsole
+ netdev-features
+ netdevices
+ netfilter-sysctl
+ netif-msg
+ nexthop-group-resilient
+ nf_conntrack-sysctl
+ nf_flowtable
+ openvswitch
+ operstates
+ packet_mmap
+ phonet
+ pktgen
+ plip
+ ppp_generic
+ proc_net_tcp
+ radiotap-headers
+ rds
+ regulatory
+ representors
+ rxrpc
+ sctp
+ secid
+ seg6-sysctl
+ skbuff
+ smc-sysctl
+ statistics
+ strparser
+ switchdev
+ sysfs-tagging
+ tc-actions-env-rules
+ tc-queue-filters
+ tcp-thin
+ team
+ timestamping
+ tipc
+ tproxy
+ tuntap
+ udplite
+ vrf
+ vxlan
+ x25
+ x25-iface
+ xfrm_device
+ xfrm_proc
+ xfrm_sync
+ xfrm_sysctl
+ xdp-rx-metadata
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/ioam6-sysctl.rst b/Documentation/networking/ioam6-sysctl.rst
new file mode 100644
index 0000000000..c18cab2c48
--- /dev/null
+++ b/Documentation/networking/ioam6-sysctl.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+IOAM6 Sysfs variables
+=====================
+
+
+/proc/sys/net/conf/<iface>/ioam6_* variables:
+=============================================
+
+ioam6_enabled - BOOL
+ Accept (= enabled) or ignore (= disabled) IPv6 IOAM options on ingress
+ for this interface.
+
+ * 0 - disabled (default)
+ * 1 - enabled
+
+ioam6_id - SHORT INTEGER
+ Define the IOAM id of this interface.
+
+ Default is ~0.
+
+ioam6_id_wide - INTEGER
+ Define the wide IOAM id of this interface.
+
+ Default is ~0.
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
new file mode 100644
index 0000000000..a66054d076
--- /dev/null
+++ b/Documentation/networking/ip-sysctl.rst
@@ -0,0 +1,3208 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+IP Sysctl
+=========
+
+/proc/sys/net/ipv4/* Variables
+==============================
+
+ip_forward - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ Forward Packets between interfaces.
+
+ This variable is special, its change resets all configuration
+ parameters to their default state (RFC1122 for hosts, RFC1812
+ for routers)
+
+ip_default_ttl - INTEGER
+ Default value of TTL field (Time To Live) for outgoing (but not
+ forwarded) IP packets. Should be between 1 and 255 inclusive.
+ Default: 64 (as recommended by RFC1700)
+
+ip_no_pmtu_disc - INTEGER
+ Disable Path MTU Discovery. If enabled in mode 1 and a
+ fragmentation-required ICMP is received, the PMTU to this
+ destination will be set to the smallest of the old MTU to
+ this destination and min_pmtu (see below). You will need
+ to raise min_pmtu to the smallest interface MTU on your system
+ manually if you want to avoid locally generated fragments.
+
+ In mode 2 incoming Path MTU Discovery messages will be
+ discarded. Outgoing frames are handled the same as in mode 1,
+ implicitly setting IP_PMTUDISC_DONT on every created socket.
+
+ Mode 3 is a hardened pmtu discover mode. The kernel will only
+ accept fragmentation-needed errors if the underlying protocol
+ can verify them besides a plain socket lookup. Current
+ protocols for which pmtu events will be honored are TCP, SCTP
+ and DCCP as they verify e.g. the sequence number or the
+ association. This mode should not be enabled globally but is
+ only intended to secure e.g. name servers in namespaces where
+ TCP path mtu must still work but path MTU information of other
+ protocols should be discarded. If enabled globally this mode
+ could break other protocols.
+
+ Possible values: 0-3
+
+ Default: FALSE
+
+min_pmtu - INTEGER
+ default 552 - minimum Path MTU. Unless this is changed manually,
+ each cached pmtu will never be lower than this setting.
+
+ip_forward_use_pmtu - BOOLEAN
+ By default we don't trust protocol path MTUs while forwarding
+ because they could be easily forged and can lead to unwanted
+ fragmentation by the router.
+ You only need to enable this if you have user-space software
+ which tries to discover path mtus by itself and depends on the
+ kernel honoring this information. This is normally not the
+ case.
+
+ Default: 0 (disabled)
+
+ Possible values:
+
+ - 0 - disabled
+ - 1 - enabled
+
+fwmark_reflect - BOOLEAN
+ Controls the fwmark of kernel-generated IPv4 reply packets that are not
+ associated with a socket for example, TCP RSTs or ICMP echo replies).
+ If unset, these packets have a fwmark of zero. If set, they have the
+ fwmark of the packet they are replying to.
+
+ Default: 0
+
+fib_multipath_use_neigh - BOOLEAN
+ Use status of existing neighbor entry when determining nexthop for
+ multipath routes. If disabled, neighbor information is not used and
+ packets could be directed to a failed nexthop. Only valid for kernels
+ built with CONFIG_IP_ROUTE_MULTIPATH enabled.
+
+ Default: 0 (disabled)
+
+ Possible values:
+
+ - 0 - disabled
+ - 1 - enabled
+
+fib_multipath_hash_policy - INTEGER
+ Controls which hash policy to use for multipath routes. Only valid
+ for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled.
+
+ Default: 0 (Layer 3)
+
+ Possible values:
+
+ - 0 - Layer 3
+ - 1 - Layer 4
+ - 2 - Layer 3 or inner Layer 3 if present
+ - 3 - Custom multipath hash. Fields used for multipath hash calculation
+ are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+ When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+ fields used for multipath hash calculation are determined by this
+ sysctl.
+
+ This value is a bitmask which enables various fields for multipath hash
+ calculation.
+
+ Possible fields are:
+
+ ====== ============================
+ 0x0001 Source IP address
+ 0x0002 Destination IP address
+ 0x0004 IP protocol
+ 0x0008 Unused (Flow Label)
+ 0x0010 Source port
+ 0x0020 Destination port
+ 0x0040 Inner source IP address
+ 0x0080 Inner destination IP address
+ 0x0100 Inner IP protocol
+ 0x0200 Inner Flow Label
+ 0x0400 Inner source port
+ 0x0800 Inner destination port
+ ====== ============================
+
+ Default: 0x0007 (source IP, destination IP and IP protocol)
+
+fib_sync_mem - UNSIGNED INTEGER
+ Amount of dirty memory from fib entries that can be backlogged before
+ synchronize_rcu is forced.
+
+ Default: 512kB Minimum: 64kB Maximum: 64MB
+
+ip_forward_update_priority - INTEGER
+ Whether to update SKB priority from "TOS" field in IPv4 header after it
+ is forwarded. The new SKB priority is mapped from TOS field value
+ according to an rt_tos2priority table (see e.g. man tc-prio).
+
+ Default: 1 (Update priority.)
+
+ Possible values:
+
+ - 0 - Do not update priority.
+ - 1 - Update priority.
+
+route/max_size - INTEGER
+ Maximum number of routes allowed in the kernel. Increase
+ this when using large numbers of interfaces and/or routes.
+
+ From linux kernel 3.6 onwards, this is deprecated for ipv4
+ as route cache is no longer used.
+
+ From linux kernel 6.3 onwards, this is deprecated for ipv6
+ as garbage collection manages cached route entries.
+
+neigh/default/gc_thresh1 - INTEGER
+ Minimum number of entries to keep. Garbage collector will not
+ purge entries if there are fewer than this number.
+
+ Default: 128
+
+neigh/default/gc_thresh2 - INTEGER
+ Threshold when garbage collector becomes more aggressive about
+ purging entries. Entries older than 5 seconds will be cleared
+ when over this number.
+
+ Default: 512
+
+neigh/default/gc_thresh3 - INTEGER
+ Maximum number of non-PERMANENT neighbor entries allowed. Increase
+ this when using large numbers of interfaces and when communicating
+ with large numbers of directly-connected peers.
+
+ Default: 1024
+
+neigh/default/unres_qlen_bytes - INTEGER
+ The maximum number of bytes which may be used by packets
+ queued for each unresolved address by other network layers.
+ (added in linux 3.3)
+
+ Setting negative value is meaningless and will return error.
+
+ Default: SK_WMEM_MAX, (same as net.core.wmem_default).
+
+ Exact value depends on architecture and kernel options,
+ but should be enough to allow queuing 256 packets
+ of medium size.
+
+neigh/default/unres_qlen - INTEGER
+ The maximum number of packets which may be queued for each
+ unresolved address by other network layers.
+
+ (deprecated in linux 3.3) : use unres_qlen_bytes instead.
+
+ Prior to linux 3.3, the default value is 3 which may cause
+ unexpected packet loss. The current default value is calculated
+ according to default value of unres_qlen_bytes and true size of
+ packet.
+
+ Default: 101
+
+neigh/default/interval_probe_time_ms - INTEGER
+ The probe interval for neighbor entries with NTF_MANAGED flag,
+ the min value is 1.
+
+ Default: 5000
+
+mtu_expires - INTEGER
+ Time, in seconds, that cached PMTU information is kept.
+
+min_adv_mss - INTEGER
+ The advertised MSS depends on the first hop route MTU, but will
+ never be lower than this setting.
+
+fib_notify_on_flag_change - INTEGER
+ Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/
+ RTM_F_TRAP/RTM_F_OFFLOAD_FAILED flags are changed.
+
+ After installing a route to the kernel, user space receives an
+ acknowledgment, which means the route was installed in the kernel,
+ but not necessarily in hardware.
+ It is also possible for a route already installed in hardware to change
+ its action and therefore its flags. For example, a host route that is
+ trapping packets can be "promoted" to perform decapsulation following
+ the installation of an IPinIP/VXLAN tunnel.
+ The notifications will indicate to user-space the state of the route.
+
+ Default: 0 (Do not emit notifications.)
+
+ Possible values:
+
+ - 0 - Do not emit notifications.
+ - 1 - Emit notifications.
+ - 2 - Emit notifications only for RTM_F_OFFLOAD_FAILED flag change.
+
+IP Fragmentation:
+
+ipfrag_high_thresh - LONG INTEGER
+ Maximum memory used to reassemble IP fragments.
+
+ipfrag_low_thresh - LONG INTEGER
+ (Obsolete since linux-4.17)
+ Maximum memory used to reassemble IP fragments before the kernel
+ begins to remove incomplete fragment queues to free up resources.
+ The kernel still accepts new fragments for defragmentation.
+
+ipfrag_time - INTEGER
+ Time in seconds to keep an IP fragment in memory.
+
+ipfrag_max_dist - INTEGER
+ ipfrag_max_dist is a non-negative integer value which defines the
+ maximum "disorder" which is allowed among fragments which share a
+ common IP source address. Note that reordering of packets is
+ not unusual, but if a large number of fragments arrive from a source
+ IP address while a particular fragment queue remains incomplete, it
+ probably indicates that one or more fragments belonging to that queue
+ have been lost. When ipfrag_max_dist is positive, an additional check
+ is done on fragments before they are added to a reassembly queue - if
+ ipfrag_max_dist (or more) fragments have arrived from a particular IP
+ address between additions to any IP fragment queue using that source
+ address, it's presumed that one or more fragments in the queue are
+ lost. The existing fragment queue will be dropped, and a new one
+ started. An ipfrag_max_dist value of zero disables this check.
+
+ Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can
+ result in unnecessarily dropping fragment queues when normal
+ reordering of packets occurs, which could lead to poor application
+ performance. Using a very large value, e.g. 50000, increases the
+ likelihood of incorrectly reassembling IP fragments that originate
+ from different IP datagrams, which could result in data corruption.
+ Default: 64
+
+bc_forwarding - INTEGER
+ bc_forwarding enables the feature described in rfc1812#section-5.3.5.2
+ and rfc2644. It allows the router to forward directed broadcast.
+ To enable this feature, the 'all' entry and the input interface entry
+ should be set to 1.
+ Default: 0
+
+INET peer storage
+=================
+
+inet_peer_threshold - INTEGER
+ The approximate size of the storage. Starting from this threshold
+ entries will be thrown aggressively. This threshold also determines
+ entries' time-to-live and time intervals between garbage collection
+ passes. More entries, less time-to-live, less GC interval.
+
+inet_peer_minttl - INTEGER
+ Minimum time-to-live of entries. Should be enough to cover fragment
+ time-to-live on the reassembling side. This minimum time-to-live is
+ guaranteed if the pool size is less than inet_peer_threshold.
+ Measured in seconds.
+
+inet_peer_maxttl - INTEGER
+ Maximum time-to-live of entries. Unused entries will expire after
+ this period of time if there is no memory pressure on the pool (i.e.
+ when the number of entries in the pool is very small).
+ Measured in seconds.
+
+TCP variables
+=============
+
+somaxconn - INTEGER
+ Limit of socket listen() backlog, known in userspace as SOMAXCONN.
+ Defaults to 4096. (Was 128 before linux-5.4)
+ See also tcp_max_syn_backlog for additional tuning for TCP sockets.
+
+tcp_abort_on_overflow - BOOLEAN
+ If listening service is too slow to accept new connections,
+ reset them. Default state is FALSE. It means that if overflow
+ occurred due to a burst, connection will recover. Enable this
+ option _only_ if you are really sure that listening daemon
+ cannot be tuned to accept connections faster. Enabling this
+ option can harm clients of your server.
+
+tcp_adv_win_scale - INTEGER
+ Obsolete since linux-6.6
+ Count buffering overhead as bytes/2^tcp_adv_win_scale
+ (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale),
+ if it is <= 0.
+
+ Possible values are [-31, 31], inclusive.
+
+ Default: 1
+
+tcp_allowed_congestion_control - STRING
+ Show/set the congestion control choices available to non-privileged
+ processes. The list is a subset of those listed in
+ tcp_available_congestion_control.
+
+ Default is "reno" and the default setting (tcp_congestion_control).
+
+tcp_app_win - INTEGER
+ Reserve max(window/2^tcp_app_win, mss) of window for application
+ buffer. Value 0 is special, it means that nothing is reserved.
+
+ Possible values are [0, 31], inclusive.
+
+ Default: 31
+
+tcp_autocorking - BOOLEAN
+ Enable TCP auto corking :
+ When applications do consecutive small write()/sendmsg() system calls,
+ we try to coalesce these small writes as much as possible, to lower
+ total amount of sent packets. This is done if at least one prior
+ packet for the flow is waiting in Qdisc queues or device transmit
+ queue. Applications can still use TCP_CORK for optimal behavior
+ when they know how/when to uncork their sockets.
+
+ Default : 1
+
+tcp_available_congestion_control - STRING
+ Shows the available congestion control choices that are registered.
+ More congestion control algorithms may be available as modules,
+ but not loaded.
+
+tcp_base_mss - INTEGER
+ The initial value of search_low to be used by the packetization layer
+ Path MTU discovery (MTU probing). If MTU probing is enabled,
+ this is the initial MSS used by the connection.
+
+tcp_mtu_probe_floor - INTEGER
+ If MTU probing is enabled this caps the minimum MSS used for search_low
+ for the connection.
+
+ Default : 48
+
+tcp_min_snd_mss - INTEGER
+ TCP SYN and SYNACK messages usually advertise an ADVMSS option,
+ as described in RFC 1122 and RFC 6691.
+
+ If this ADVMSS option is smaller than tcp_min_snd_mss,
+ it is silently capped to tcp_min_snd_mss.
+
+ Default : 48 (at least 8 bytes of payload per segment)
+
+tcp_congestion_control - STRING
+ Set the congestion control algorithm to be used for new
+ connections. The algorithm "reno" is always available, but
+ additional choices may be available based on kernel configuration.
+ Default is set as part of kernel configuration.
+ For passive connections, the listener congestion control choice
+ is inherited.
+
+ [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ]
+
+tcp_dsack - BOOLEAN
+ Allows TCP to send "duplicate" SACKs.
+
+tcp_early_retrans - INTEGER
+ Tail loss probe (TLP) converts RTOs occurring due to tail
+ losses into fast recovery (draft-ietf-tcpm-rack). Note that
+ TLP requires RACK to function properly (see tcp_recovery below)
+
+ Possible values:
+
+ - 0 disables TLP
+ - 3 or 4 enables TLP
+
+ Default: 3
+
+tcp_ecn - INTEGER
+ Control use of Explicit Congestion Notification (ECN) by TCP.
+ ECN is used only when both ends of the TCP connection indicate
+ support for it. This feature is useful in avoiding losses due
+ to congestion by allowing supporting routers to signal
+ congestion before having to drop packets.
+
+ Possible values are:
+
+ = =====================================================
+ 0 Disable ECN. Neither initiate nor accept ECN.
+ 1 Enable ECN when requested by incoming connections and
+ also request ECN on outgoing connection attempts.
+ 2 Enable ECN when requested by incoming connections
+ but do not request ECN on outgoing connections.
+ = =====================================================
+
+ Default: 2
+
+tcp_ecn_fallback - BOOLEAN
+ If the kernel detects that ECN connection misbehaves, enable fall
+ back to non-ECN. Currently, this knob implements the fallback
+ from RFC3168, section 6.1.1.1., but we reserve that in future,
+ additional detection mechanisms could be implemented under this
+ knob. The value is not used, if tcp_ecn or per route (or congestion
+ control) ECN settings are disabled.
+
+ Default: 1 (fallback enabled)
+
+tcp_fack - BOOLEAN
+ This is a legacy option, it has no effect anymore.
+
+tcp_fin_timeout - INTEGER
+ The length of time an orphaned (no longer referenced by any
+ application) connection will remain in the FIN_WAIT_2 state
+ before it is aborted at the local end. While a perfectly
+ valid "receive only" state for an un-orphaned connection, an
+ orphaned connection in FIN_WAIT_2 state could otherwise wait
+ forever for the remote to close its end of the connection.
+
+ Cf. tcp_max_orphans
+
+ Default: 60 seconds
+
+tcp_frto - INTEGER
+ Enables Forward RTO-Recovery (F-RTO) defined in RFC5682.
+ F-RTO is an enhanced recovery algorithm for TCP retransmission
+ timeouts. It is particularly beneficial in networks where the
+ RTT fluctuates (e.g., wireless). F-RTO is sender-side only
+ modification. It does not require any support from the peer.
+
+ By default it's enabled with a non-zero value. 0 disables F-RTO.
+
+tcp_fwmark_accept - BOOLEAN
+ If set, incoming connections to listening sockets that do not have a
+ socket mark will set the mark of the accepting socket to the fwmark of
+ the incoming SYN packet. This will cause all packets on that connection
+ (starting from the first SYNACK) to be sent with that fwmark. The
+ listening socket's mark is unchanged. Listening sockets that already
+ have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are
+ unaffected.
+
+ Default: 0
+
+tcp_invalid_ratelimit - INTEGER
+ Limit the maximal rate for sending duplicate acknowledgments
+ in response to incoming TCP packets that are for an existing
+ connection but that are invalid due to any of these reasons:
+
+ (a) out-of-window sequence number,
+ (b) out-of-window acknowledgment number, or
+ (c) PAWS (Protection Against Wrapped Sequence numbers) check failure
+
+ This can help mitigate simple "ack loop" DoS attacks, wherein
+ a buggy or malicious middlebox or man-in-the-middle can
+ rewrite TCP header fields in manner that causes each endpoint
+ to think that the other is sending invalid TCP segments, thus
+ causing each side to send an unterminating stream of duplicate
+ acknowledgments for invalid segments.
+
+ Using 0 disables rate-limiting of dupacks in response to
+ invalid segments; otherwise this value specifies the minimal
+ space between sending such dupacks, in milliseconds.
+
+ Default: 500 (milliseconds).
+
+tcp_keepalive_time - INTEGER
+ How often TCP sends out keepalive messages when keepalive is enabled.
+ Default: 2hours.
+
+tcp_keepalive_probes - INTEGER
+ How many keepalive probes TCP sends out, until it decides that the
+ connection is broken. Default value: 9.
+
+tcp_keepalive_intvl - INTEGER
+ How frequently the probes are send out. Multiplied by
+ tcp_keepalive_probes it is time to kill not responding connection,
+ after probes started. Default value: 75sec i.e. connection
+ will be aborted after ~11 minutes of retries.
+
+tcp_l3mdev_accept - BOOLEAN
+ Enables child sockets to inherit the L3 master device index.
+ Enabling this option allows a "global" listen socket to work
+ across L3 master domains (e.g., VRFs) with connected sockets
+ derived from the listen socket to be bound to the L3 domain in
+ which the packets originated. Only valid when the kernel was
+ compiled with CONFIG_NET_L3_MASTER_DEV.
+
+ Default: 0 (disabled)
+
+tcp_low_latency - BOOLEAN
+ This is a legacy option, it has no effect anymore.
+
+tcp_max_orphans - INTEGER
+ Maximal number of TCP sockets not attached to any user file handle,
+ held by system. If this number is exceeded orphaned connections are
+ reset immediately and warning is printed. This limit exists
+ only to prevent simple DoS attacks, you _must_ not rely on this
+ or lower the limit artificially, but rather increase it
+ (probably, after increasing installed memory),
+ if network conditions require more than default value,
+ and tune network services to linger and kill such states
+ more aggressively. Let me to remind again: each orphan eats
+ up to ~64K of unswappable memory.
+
+tcp_max_syn_backlog - INTEGER
+ Maximal number of remembered connection requests (SYN_RECV),
+ which have not received an acknowledgment from connecting client.
+
+ This is a per-listener limit.
+
+ The minimal value is 128 for low memory machines, and it will
+ increase in proportion to the memory of machine.
+
+ If server suffers from overload, try increasing this number.
+
+ Remember to also check /proc/sys/net/core/somaxconn
+ A SYN_RECV request socket consumes about 304 bytes of memory.
+
+tcp_max_tw_buckets - INTEGER
+ Maximal number of timewait sockets held by system simultaneously.
+ If this number is exceeded time-wait socket is immediately destroyed
+ and warning is printed. This limit exists only to prevent
+ simple DoS attacks, you _must_ not lower the limit artificially,
+ but rather increase it (probably, after increasing installed memory),
+ if network conditions require more than default value.
+
+tcp_mem - vector of 3 INTEGERs: min, pressure, max
+ min: below this number of pages TCP is not bothered about its
+ memory appetite.
+
+ pressure: when amount of memory allocated by TCP exceeds this number
+ of pages, TCP moderates its memory consumption and enters memory
+ pressure mode, which is exited when memory consumption falls
+ under "min".
+
+ max: number of pages allowed for queueing by all TCP sockets.
+
+ Defaults are calculated at boot time from amount of available
+ memory.
+
+tcp_min_rtt_wlen - INTEGER
+ The window length of the windowed min filter to track the minimum RTT.
+ A shorter window lets a flow more quickly pick up new (higher)
+ minimum RTT when it is moved to a longer path (e.g., due to traffic
+ engineering). A longer window makes the filter more resistant to RTT
+ inflations such as transient congestion. The unit is seconds.
+
+ Possible values: 0 - 86400 (1 day)
+
+ Default: 300
+
+tcp_moderate_rcvbuf - BOOLEAN
+ If set, TCP performs receive buffer auto-tuning, attempting to
+ automatically size the buffer (no greater than tcp_rmem[2]) to
+ match the size required by the path for full throughput. Enabled by
+ default.
+
+tcp_mtu_probing - INTEGER
+ Controls TCP Packetization-Layer Path MTU Discovery. Takes three
+ values:
+
+ - 0 - Disabled
+ - 1 - Disabled by default, enabled when an ICMP black hole detected
+ - 2 - Always enabled, use initial MSS of tcp_base_mss.
+
+tcp_probe_interval - UNSIGNED INTEGER
+ Controls how often to start TCP Packetization-Layer Path MTU
+ Discovery reprobe. The default is reprobing every 10 minutes as
+ per RFC4821.
+
+tcp_probe_threshold - INTEGER
+ Controls when TCP Packetization-Layer Path MTU Discovery probing
+ will stop in respect to the width of search range in bytes. Default
+ is 8 bytes.
+
+tcp_no_metrics_save - BOOLEAN
+ By default, TCP saves various connection metrics in the route cache
+ when the connection closes, so that connections established in the
+ near future can use these to set initial conditions. Usually, this
+ increases overall performance, but may sometimes cause performance
+ degradation. If set, TCP will not cache metrics on closing
+ connections.
+
+tcp_no_ssthresh_metrics_save - BOOLEAN
+ Controls whether TCP saves ssthresh metrics in the route cache.
+
+ Default is 1, which disables ssthresh metrics.
+
+tcp_orphan_retries - INTEGER
+ This value influences the timeout of a locally closed TCP connection,
+ when RTO retransmissions remain unacknowledged.
+ See tcp_retries2 for more details.
+
+ The default value is 8.
+
+ If your machine is a loaded WEB server,
+ you should think about lowering this value, such sockets
+ may consume significant resources. Cf. tcp_max_orphans.
+
+tcp_recovery - INTEGER
+ This value is a bitmap to enable various experimental loss recovery
+ features.
+
+ ========= =============================================================
+ RACK: 0x1 enables the RACK loss detection for fast detection of lost
+ retransmissions and tail drops. It also subsumes and disables
+ RFC6675 recovery for SACK connections.
+
+ RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
+
+ RACK: 0x4 disables RACK's DUPACK threshold heuristic
+ ========= =============================================================
+
+ Default: 0x1
+
+tcp_reflect_tos - BOOLEAN
+ For listening sockets, reuse the DSCP value of the initial SYN message
+ for outgoing packets. This allows to have both directions of a TCP
+ stream to use the same DSCP value, assuming DSCP remains unchanged for
+ the lifetime of the connection.
+
+ This options affects both IPv4 and IPv6.
+
+ Default: 0 (disabled)
+
+tcp_reordering - INTEGER
+ Initial reordering level of packets in a TCP stream.
+ TCP stack can then dynamically adjust flow reordering level
+ between this initial value and tcp_max_reordering
+
+ Default: 3
+
+tcp_max_reordering - INTEGER
+ Maximal reordering level of packets in a TCP stream.
+ 300 is a fairly conservative value, but you might increase it
+ if paths are using per packet load balancing (like bonding rr mode)
+
+ Default: 300
+
+tcp_retrans_collapse - BOOLEAN
+ Bug-to-bug compatibility with some broken printers.
+ On retransmit try to send bigger packets to work around bugs in
+ certain TCP stacks.
+
+tcp_retries1 - INTEGER
+ This value influences the time, after which TCP decides, that
+ something is wrong due to unacknowledged RTO retransmissions,
+ and reports this suspicion to the network layer.
+ See tcp_retries2 for more details.
+
+ RFC 1122 recommends at least 3 retransmissions, which is the
+ default.
+
+tcp_retries2 - INTEGER
+ This value influences the timeout of an alive TCP connection,
+ when RTO retransmissions remain unacknowledged.
+ Given a value of N, a hypothetical TCP connection following
+ exponential backoff with an initial RTO of TCP_RTO_MIN would
+ retransmit N times before killing the connection at the (N+1)th RTO.
+
+ The default value of 15 yields a hypothetical timeout of 924.6
+ seconds and is a lower bound for the effective timeout.
+ TCP will effectively time out at the first RTO which exceeds the
+ hypothetical timeout.
+
+ RFC 1122 recommends at least 100 seconds for the timeout,
+ which corresponds to a value of at least 8.
+
+tcp_rfc1337 - BOOLEAN
+ If set, the TCP stack behaves conforming to RFC1337. If unset,
+ we are not conforming to RFC, but prevent TCP TIME_WAIT
+ assassination.
+
+ Default: 0
+
+tcp_rmem - vector of 3 INTEGERs: min, default, max
+ min: Minimal size of receive buffer used by TCP sockets.
+ It is guaranteed to each TCP socket, even under moderate memory
+ pressure.
+
+ Default: 4K
+
+ default: initial size of receive buffer used by TCP sockets.
+ This value overrides net.core.rmem_default used by other protocols.
+ Default: 131072 bytes.
+ This value results in initial window of 65535.
+
+ max: maximal size of receive buffer allowed for automatically
+ selected receiver buffers for TCP socket. This value does not override
+ net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables
+ automatic tuning of that socket's receive buffer size, in which
+ case this value is ignored.
+ Default: between 131072 and 6MB, depending on RAM size.
+
+tcp_sack - BOOLEAN
+ Enable select acknowledgments (SACKS).
+
+tcp_comp_sack_delay_ns - LONG INTEGER
+ TCP tries to reduce number of SACK sent, using a timer
+ based on 5% of SRTT, capped by this sysctl, in nano seconds.
+ The default is 1ms, based on TSO autosizing period.
+
+ Default : 1,000,000 ns (1 ms)
+
+tcp_comp_sack_slack_ns - LONG INTEGER
+ This sysctl control the slack used when arming the
+ timer used by SACK compression. This gives extra time
+ for small RTT flows, and reduces system overhead by allowing
+ opportunistic reduction of timer interrupts.
+
+ Default : 100,000 ns (100 us)
+
+tcp_comp_sack_nr - INTEGER
+ Max number of SACK that can be compressed.
+ Using 0 disables SACK compression.
+
+ Default : 44
+
+tcp_slow_start_after_idle - BOOLEAN
+ If set, provide RFC2861 behavior and time out the congestion
+ window after an idle period. An idle period is defined at
+ the current RTO. If unset, the congestion window will not
+ be timed out after an idle period.
+
+ Default: 1
+
+tcp_stdurg - BOOLEAN
+ Use the Host requirements interpretation of the TCP urgent pointer field.
+ Most hosts use the older BSD interpretation, so if you turn this on
+ Linux might not communicate correctly with them.
+
+ Default: FALSE
+
+tcp_synack_retries - INTEGER
+ Number of times SYNACKs for a passive TCP connection attempt will
+ be retransmitted. Should not be higher than 255. Default value
+ is 5, which corresponds to 31seconds till the last retransmission
+ with the current initial RTO of 1second. With this the final timeout
+ for a passive TCP connection will happen after 63seconds.
+
+tcp_syncookies - INTEGER
+ Only valid when the kernel was compiled with CONFIG_SYN_COOKIES
+ Send out syncookies when the syn backlog queue of a socket
+ overflows. This is to prevent against the common 'SYN flood attack'
+ Default: 1
+
+ Note, that syncookies is fallback facility.
+ It MUST NOT be used to help highly loaded servers to stand
+ against legal connection rate. If you see SYN flood warnings
+ in your logs, but investigation shows that they occur
+ because of overload with legal connections, you should tune
+ another parameters until this warning disappear.
+ See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.
+
+ syncookies seriously violate TCP protocol, do not allow
+ to use TCP extensions, can result in serious degradation
+ of some services (f.e. SMTP relaying), visible not by you,
+ but your clients and relays, contacting you. While you see
+ SYN flood warnings in logs not being really flooded, your server
+ is seriously misconfigured.
+
+ If you want to test which effects syncookies have to your
+ network connections you can set this knob to 2 to enable
+ unconditionally generation of syncookies.
+
+tcp_migrate_req - BOOLEAN
+ The incoming connection is tied to a specific listening socket when
+ the initial SYN packet is received during the three-way handshake.
+ When a listener is closed, in-flight request sockets during the
+ handshake and established sockets in the accept queue are aborted.
+
+ If the listener has SO_REUSEPORT enabled, other listeners on the
+ same port should have been able to accept such connections. This
+ option makes it possible to migrate such child sockets to another
+ listener after close() or shutdown().
+
+ The BPF_SK_REUSEPORT_SELECT_OR_MIGRATE type of eBPF program should
+ usually be used to define the policy to pick an alive listener.
+ Otherwise, the kernel will randomly pick an alive listener only if
+ this option is enabled.
+
+ Note that migration between listeners with different settings may
+ crash applications. Let's say migration happens from listener A to
+ B, and only B has TCP_SAVE_SYN enabled. B cannot read SYN data from
+ the requests migrated from A. To avoid such a situation, cancel
+ migration by returning SK_DROP in the type of eBPF program, or
+ disable this option.
+
+ Default: 0
+
+tcp_fastopen - INTEGER
+ Enable TCP Fast Open (RFC7413) to send and accept data in the opening
+ SYN packet.
+
+ The client support is enabled by flag 0x1 (on by default). The client
+ then must use sendmsg() or sendto() with the MSG_FASTOPEN flag,
+ rather than connect() to send data in SYN.
+
+ The server support is enabled by flag 0x2 (off by default). Then
+ either enable for all listeners with another flag (0x400) or
+ enable individual listeners via TCP_FASTOPEN socket option with
+ the option value being the length of the syn-data backlog.
+
+ The values (bitmap) are
+
+ ===== ======== ======================================================
+ 0x1 (client) enables sending data in the opening SYN on the client.
+ 0x2 (server) enables the server support, i.e., allowing data in
+ a SYN packet to be accepted and passed to the
+ application before 3-way handshake finishes.
+ 0x4 (client) send data in the opening SYN regardless of cookie
+ availability and without a cookie option.
+ 0x200 (server) accept data-in-SYN w/o any cookie option present.
+ 0x400 (server) enable all listeners to support Fast Open by
+ default without explicit TCP_FASTOPEN socket option.
+ ===== ======== ======================================================
+
+ Default: 0x1
+
+ Note that additional client or server features are only
+ effective if the basic support (0x1 and 0x2) are enabled respectively.
+
+tcp_fastopen_blackhole_timeout_sec - INTEGER
+ Initial time period in second to disable Fastopen on active TCP sockets
+ when a TFO firewall blackhole issue happens.
+ This time period will grow exponentially when more blackhole issues
+ get detected right after Fastopen is re-enabled and will reset to
+ initial value when the blackhole issue goes away.
+ 0 to disable the blackhole detection.
+
+ By default, it is set to 0 (feature is disabled).
+
+tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
+ The list consists of a primary key and an optional backup key. The
+ primary key is used for both creating and validating cookies, while the
+ optional backup key is only used for validating cookies. The purpose of
+ the backup key is to maximize TFO validation when keys are rotated.
+
+ A randomly chosen primary key may be configured by the kernel if
+ the tcp_fastopen sysctl is set to 0x400 (see above), or if the
+ TCP_FASTOPEN setsockopt() optname is set and a key has not been
+ previously configured via sysctl. If keys are configured via
+ setsockopt() by using the TCP_FASTOPEN_KEY optname, then those
+ per-socket keys will be used instead of any keys that are specified via
+ sysctl.
+
+ A key is specified as 4 8-digit hexadecimal integers which are separated
+ by a '-' as: xxxxxxxx-xxxxxxxx-xxxxxxxx-xxxxxxxx. Leading zeros may be
+ omitted. A primary and a backup key may be specified by separating them
+ by a comma. If only one key is specified, it becomes the primary key and
+ any previously configured backup keys are removed.
+
+tcp_syn_retries - INTEGER
+ Number of times initial SYNs for an active TCP connection attempt
+ will be retransmitted. Should not be higher than 127. Default value
+ is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4)
+ till the last retransmission with the current initial RTO of 1second.
+ With this the final timeout for an active TCP connection attempt
+ will happen after 131seconds.
+
+tcp_timestamps - INTEGER
+ Enable timestamps as defined in RFC1323.
+
+ - 0: Disabled.
+ - 1: Enable timestamps as defined in RFC1323 and use random offset for
+ each connection rather than only using the current time.
+ - 2: Like 1, but without random offsets.
+
+ Default: 1
+
+tcp_min_tso_segs - INTEGER
+ Minimal number of segments per TSO frame.
+
+ Since linux-3.12, TCP does an automatic sizing of TSO frames,
+ depending on flow rate, instead of filling 64Kbytes packets.
+ For specific usages, it's possible to force TCP to build big
+ TSO frames. Note that TCP stack might split too big TSO packets
+ if available window is too small.
+
+ Default: 2
+
+tcp_tso_rtt_log - INTEGER
+ Adjustment of TSO packet sizes based on min_rtt
+
+ Starting from linux-5.18, TCP autosizing can be tweaked
+ for flows having small RTT.
+
+ Old autosizing was splitting the pacing budget to send 1024 TSO
+ per second.
+
+ tso_packet_size = sk->sk_pacing_rate / 1024;
+
+ With the new mechanism, we increase this TSO sizing using:
+
+ distance = min_rtt_usec / (2^tcp_tso_rtt_log)
+ tso_packet_size += gso_max_size >> distance;
+
+ This means that flows between very close hosts can use bigger
+ TSO packets, reducing their cpu costs.
+
+ If you want to use the old autosizing, set this sysctl to 0.
+
+ Default: 9 (2^9 = 512 usec)
+
+tcp_pacing_ss_ratio - INTEGER
+ sk->sk_pacing_rate is set by TCP stack using a ratio applied
+ to current rate. (current_rate = cwnd * mss / srtt)
+ If TCP is in slow start, tcp_pacing_ss_ratio is applied
+ to let TCP probe for bigger speeds, assuming cwnd can be
+ doubled every other RTT.
+
+ Default: 200
+
+tcp_pacing_ca_ratio - INTEGER
+ sk->sk_pacing_rate is set by TCP stack using a ratio applied
+ to current rate. (current_rate = cwnd * mss / srtt)
+ If TCP is in congestion avoidance phase, tcp_pacing_ca_ratio
+ is applied to conservatively probe for bigger throughput.
+
+ Default: 120
+
+tcp_syn_linear_timeouts - INTEGER
+ The number of times for an active TCP connection to retransmit SYNs with
+ a linear backoff timeout before defaulting to an exponential backoff
+ timeout. This has no effect on SYNACK at the passive TCP side.
+
+ With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would
+ expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts,
+ and the first exponential backoff using 2^0 * initial_RTO).
+ Default: 4
+
+tcp_tso_win_divisor - INTEGER
+ This allows control over what percentage of the congestion window
+ can be consumed by a single TSO frame.
+ The setting of this parameter is a choice between burstiness and
+ building larger TSO frames.
+
+ Default: 3
+
+tcp_tw_reuse - INTEGER
+ Enable reuse of TIME-WAIT sockets for new connections when it is
+ safe from protocol viewpoint.
+
+ - 0 - disable
+ - 1 - global enable
+ - 2 - enable for loopback traffic only
+
+ It should not be changed without advice/request of technical
+ experts.
+
+ Default: 2
+
+tcp_window_scaling - BOOLEAN
+ Enable window scaling as defined in RFC1323.
+
+tcp_shrink_window - BOOLEAN
+ This changes how the TCP receive window is calculated.
+
+ RFC 7323, section 2.4, says there are instances when a retracted
+ window can be offered, and that TCP implementations MUST ensure
+ that they handle a shrinking window, as specified in RFC 1122.
+
+ - 0 - Disabled. The window is never shrunk.
+ - 1 - Enabled. The window is shrunk when necessary to remain within
+ the memory limit set by autotuning (sk_rcvbuf).
+ This only occurs if a non-zero receive window
+ scaling factor is also in effect.
+
+ Default: 0
+
+tcp_wmem - vector of 3 INTEGERs: min, default, max
+ min: Amount of memory reserved for send buffers for TCP sockets.
+ Each TCP socket has rights to use it due to fact of its birth.
+
+ Default: 4K
+
+ default: initial size of send buffer used by TCP sockets. This
+ value overrides net.core.wmem_default used by other protocols.
+
+ It is usually lower than net.core.wmem_default.
+
+ Default: 16K
+
+ max: Maximal amount of memory allowed for automatically tuned
+ send buffers for TCP sockets. This value does not override
+ net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables
+ automatic tuning of that socket's send buffer size, in which case
+ this value is ignored.
+
+ Default: between 64K and 4MB, depending on RAM size.
+
+tcp_notsent_lowat - UNSIGNED INTEGER
+ A TCP socket can control the amount of unsent bytes in its write queue,
+ thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
+ reports POLLOUT events if the amount of unsent bytes is below a per
+ socket value, and if the write queue is not full. sendmsg() will
+ also not add new buffers if the limit is hit.
+
+ This global variable controls the amount of unsent data for
+ sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
+ to the global variable has immediate effect.
+
+ Default: UINT_MAX (0xFFFFFFFF)
+
+tcp_workaround_signed_windows - BOOLEAN
+ If set, assume no receipt of a window scaling option means the
+ remote TCP is broken and treats the window as a signed quantity.
+ If unset, assume the remote TCP is not broken even if we do
+ not receive a window scaling option from them.
+
+ Default: 0
+
+tcp_thin_linear_timeouts - BOOLEAN
+ Enable dynamic triggering of linear timeouts for thin streams.
+ If set, a check is performed upon retransmission by timeout to
+ determine if the stream is thin (less than 4 packets in flight).
+ As long as the stream is found to be thin, up to 6 linear
+ timeouts may be performed before exponential backoff mode is
+ initiated. This improves retransmission latency for
+ non-aggressive thin streams, often found to be time-dependent.
+ For more information on thin streams, see
+ Documentation/networking/tcp-thin.rst
+
+ Default: 0
+
+tcp_limit_output_bytes - INTEGER
+ Controls TCP Small Queue limit per tcp socket.
+ TCP bulk sender tends to increase packets in flight until it
+ gets losses notifications. With SNDBUF autotuning, this can
+ result in a large amount of packets queued on the local machine
+ (e.g.: qdiscs, CPU backlog, or device) hurting latency of other
+ flows, for typical pfifo_fast qdiscs. tcp_limit_output_bytes
+ limits the number of bytes on qdisc or device to reduce artificial
+ RTT/cwnd and reduce bufferbloat.
+
+ Default: 1048576 (16 * 65536)
+
+tcp_challenge_ack_limit - INTEGER
+ Limits number of Challenge ACK sent per second, as recommended
+ in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
+ Note that this per netns rate limit can allow some side channel
+ attacks and probably should not be enabled.
+ TCP stack implements per TCP socket limits anyway.
+ Default: INT_MAX (unlimited)
+
+tcp_ehash_entries - INTEGER
+ Show the number of hash buckets for TCP sockets in the current
+ networking namespace.
+
+ A negative value means the networking namespace does not own its
+ hash buckets and shares the initial networking namespace's one.
+
+tcp_child_ehash_entries - INTEGER
+ Control the number of hash buckets for TCP sockets in the child
+ networking namespace, which must be set before clone() or unshare().
+
+ If the value is not 0, the kernel uses a value rounded up to 2^n
+ as the actual hash bucket size. 0 is a special value, meaning
+ the child networking namespace will share the initial networking
+ namespace's hash buckets.
+
+ Note that the child will use the global one in case the kernel
+ fails to allocate enough memory. In addition, the global hash
+ buckets are spread over available NUMA nodes, but the allocation
+ of the child hash table depends on the current process's NUMA
+ policy, which could result in performance differences.
+
+ Note also that the default value of tcp_max_tw_buckets and
+ tcp_max_syn_backlog depend on the hash bucket size.
+
+ Possible values: 0, 2^n (n: 0 - 24 (16Mi))
+
+ Default: 0
+
+tcp_plb_enabled - BOOLEAN
+ If set and the underlying congestion control (e.g. DCTCP) supports
+ and enables PLB feature, TCP PLB (Protective Load Balancing) is
+ enabled. PLB is described in the following paper:
+ https://doi.org/10.1145/3544216.3544226. Based on PLB parameters,
+ upon sensing sustained congestion, TCP triggers a change in
+ flow label field for outgoing IPv6 packets. A change in flow label
+ field potentially changes the path of outgoing packets for switches
+ that use ECMP/WCMP for routing.
+
+ PLB changes socket txhash which results in a change in IPv6 Flow Label
+ field, and currently no-op for IPv4 headers. It is possible
+ to apply PLB for IPv4 with other network header fields (e.g. TCP
+ or IPv4 options) or using encapsulation where outer header is used
+ by switches to determine next hop. In either case, further host
+ and switch side changes will be needed.
+
+ When set, PLB assumes that congestion signal (e.g. ECN) is made
+ available and used by congestion control module to estimate a
+ congestion measure (e.g. ce_ratio). PLB needs a congestion measure to
+ make repathing decisions.
+
+ Default: FALSE
+
+tcp_plb_idle_rehash_rounds - INTEGER
+ Number of consecutive congested rounds (RTT) seen after which
+ a rehash can be performed, given there are no packets in flight.
+ This is referred to as M in PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ Possible Values: 0 - 31
+
+ Default: 3
+
+tcp_plb_rehash_rounds - INTEGER
+ Number of consecutive congested rounds (RTT) seen after which
+ a forced rehash can be performed. Be careful when setting this
+ parameter, as a small value increases the risk of retransmissions.
+ This is referred to as N in PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ Possible Values: 0 - 31
+
+ Default: 12
+
+tcp_plb_suspend_rto_sec - INTEGER
+ Time, in seconds, to suspend PLB in event of an RTO. In order to avoid
+ having PLB repath onto a connectivity "black hole", after an RTO a TCP
+ connection suspends PLB repathing for a random duration between 1x and
+ 2x of this parameter. Randomness is added to avoid concurrent rehashing
+ of multiple TCP connections. This should be set corresponding to the
+ amount of time it takes to repair a failed link.
+
+ Possible Values: 0 - 255
+
+ Default: 60
+
+tcp_plb_cong_thresh - INTEGER
+ Fraction of packets marked with congestion over a round (RTT) to
+ tag that round as congested. This is referred to as K in the PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ The 0-1 fraction range is mapped to 0-256 range to avoid floating
+ point operations. For example, 128 means that if at least 50% of
+ the packets in a round were marked as congested then the round
+ will be tagged as congested.
+
+ Setting threshold to 0 means that PLB repaths every RTT regardless
+ of congestion. This is not intended behavior for PLB and should be
+ used only for experimentation purpose.
+
+ Possible Values: 0 - 256
+
+ Default: 128
+
+UDP variables
+=============
+
+udp_l3mdev_accept - BOOLEAN
+ Enabling this option allows a "global" bound socket to work
+ across L3 master domains (e.g., VRFs) with packets capable of
+ being received regardless of the L3 domain in which they
+ originated. Only valid when the kernel was compiled with
+ CONFIG_NET_L3_MASTER_DEV.
+
+ Default: 0 (disabled)
+
+udp_mem - vector of 3 INTEGERs: min, pressure, max
+ Number of pages allowed for queueing by all UDP sockets.
+
+ min: Number of pages allowed for queueing by all UDP sockets.
+
+ pressure: This value was introduced to follow format of tcp_mem.
+
+ max: This value was introduced to follow format of tcp_mem.
+
+ Default is calculated at boot time from amount of available memory.
+
+udp_rmem_min - INTEGER
+ Minimal size of receive buffer used by UDP sockets in moderation.
+ Each UDP socket is able to use the size for receiving data, even if
+ total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
+
+ Default: 4K
+
+udp_wmem_min - INTEGER
+ UDP does not have tx memory accounting and this tunable has no effect.
+
+udp_hash_entries - INTEGER
+ Show the number of hash buckets for UDP sockets in the current
+ networking namespace.
+
+ A negative value means the networking namespace does not own its
+ hash buckets and shares the initial networking namespace's one.
+
+udp_child_ehash_entries - INTEGER
+ Control the number of hash buckets for UDP sockets in the child
+ networking namespace, which must be set before clone() or unshare().
+
+ If the value is not 0, the kernel uses a value rounded up to 2^n
+ as the actual hash bucket size. 0 is a special value, meaning
+ the child networking namespace will share the initial networking
+ namespace's hash buckets.
+
+ Note that the child will use the global one in case the kernel
+ fails to allocate enough memory. In addition, the global hash
+ buckets are spread over available NUMA nodes, but the allocation
+ of the child hash table depends on the current process's NUMA
+ policy, which could result in performance differences.
+
+ Possible values: 0, 2^n (n: 7 (128) - 16 (64K))
+
+ Default: 0
+
+
+RAW variables
+=============
+
+raw_l3mdev_accept - BOOLEAN
+ Enabling this option allows a "global" bound socket to work
+ across L3 master domains (e.g., VRFs) with packets capable of
+ being received regardless of the L3 domain in which they
+ originated. Only valid when the kernel was compiled with
+ CONFIG_NET_L3_MASTER_DEV.
+
+ Default: 1 (enabled)
+
+CIPSOv4 Variables
+=================
+
+cipso_cache_enable - BOOLEAN
+ If set, enable additions to and lookups from the CIPSO label mapping
+ cache. If unset, additions are ignored and lookups always result in a
+ miss. However, regardless of the setting the cache is still
+ invalidated when required when means you can safely toggle this on and
+ off and the cache will always be "safe".
+
+ Default: 1
+
+cipso_cache_bucket_size - INTEGER
+ The CIPSO label cache consists of a fixed size hash table with each
+ hash bucket containing a number of cache entries. This variable limits
+ the number of entries in each hash bucket; the larger the value is, the
+ more CIPSO label mappings that can be cached. When the number of
+ entries in a given hash bucket reaches this limit adding new entries
+ causes the oldest entry in the bucket to be removed to make room.
+
+ Default: 10
+
+cipso_rbm_optfmt - BOOLEAN
+ Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of
+ the CIPSO draft specification (see Documentation/netlabel for details).
+ This means that when set the CIPSO tag will be padded with empty
+ categories in order to make the packet data 32-bit aligned.
+
+ Default: 0
+
+cipso_rbm_structvalid - BOOLEAN
+ If set, do a very strict check of the CIPSO option when
+ ip_options_compile() is called. If unset, relax the checks done during
+ ip_options_compile(). Either way is "safe" as errors are caught else
+ where in the CIPSO processing code but setting this to 0 (False) should
+ result in less work (i.e. it should be faster) but could cause problems
+ with other implementations that require strict checking.
+
+ Default: 0
+
+IP Variables
+============
+
+ip_local_port_range - 2 INTEGERS
+ Defines the local port range that is used by TCP and UDP to
+ choose the local port. The first number is the first, the
+ second the last local port number.
+ If possible, it is better these numbers have different parity
+ (one even and one odd value).
+ Must be greater than or equal to ip_unprivileged_port_start.
+ The default values are 32768 and 60999 respectively.
+
+ip_local_reserved_ports - list of comma separated ranges
+ Specify the ports which are reserved for known third-party
+ applications. These ports will not be used by automatic port
+ assignments (e.g. when calling connect() or bind() with port
+ number 0). Explicit port allocation behavior is unchanged.
+
+ The format used for both input and output is a comma separated
+ list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and
+ 10). Writing to the file will clear all previously reserved
+ ports and update the current list with the one given in the
+ input.
+
+ Note that ip_local_port_range and ip_local_reserved_ports
+ settings are independent and both are considered by the kernel
+ when determining which ports are available for automatic port
+ assignments.
+
+ You can reserve ports which are not in the current
+ ip_local_port_range, e.g.::
+
+ $ cat /proc/sys/net/ipv4/ip_local_port_range
+ 32000 60999
+ $ cat /proc/sys/net/ipv4/ip_local_reserved_ports
+ 8080,9148
+
+ although this is redundant. However such a setting is useful
+ if later the port range is changed to a value that will
+ include the reserved ports. Also keep in mind, that overlapping
+ of these ranges may affect probability of selecting ephemeral
+ ports which are right after block of reserved ports.
+
+ Default: Empty
+
+ip_unprivileged_port_start - INTEGER
+ This is a per-namespace sysctl. It defines the first
+ unprivileged port in the network namespace. Privileged ports
+ require root or CAP_NET_BIND_SERVICE in order to bind to them.
+ To disable all privileged ports, set this to 0. They must not
+ overlap with the ip_local_port_range.
+
+ Default: 1024
+
+ip_nonlocal_bind - BOOLEAN
+ If set, allows processes to bind() to non-local IP addresses,
+ which can be quite useful - but may break some applications.
+
+ Default: 0
+
+ip_autobind_reuse - BOOLEAN
+ By default, bind() does not select the ports automatically even if
+ the new socket and all sockets bound to the port have SO_REUSEADDR.
+ ip_autobind_reuse allows bind() to reuse the port and this is useful
+ when you use bind()+connect(), but may break some applications.
+ The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this
+ option should only be set by experts.
+ Default: 0
+
+ip_dynaddr - INTEGER
+ If set non-zero, enables support for dynamic addresses.
+ If set to a non-zero value larger than 1, a kernel log
+ message will be printed when dynamic address rewriting
+ occurs.
+
+ Default: 0
+
+ip_early_demux - BOOLEAN
+ Optimize input packet processing down to one demux for
+ certain kinds of local sockets. Currently we only do this
+ for established TCP and connected UDP sockets.
+
+ It may add an additional cost for pure routing workloads that
+ reduces overall throughput, in such case you should disable it.
+
+ Default: 1
+
+ping_group_range - 2 INTEGERS
+ Restrict ICMP_PROTO datagram sockets to users in the group range.
+ The default is "1 0", meaning, that nobody (not even root) may
+ create ping sockets. Setting it to "100 100" would grant permissions
+ to the single group. "0 4294967294" would enable it for the world, "100
+ 4294967294" would enable it for the users, but not daemons.
+
+tcp_early_demux - BOOLEAN
+ Enable early demux for established TCP sockets.
+
+ Default: 1
+
+udp_early_demux - BOOLEAN
+ Enable early demux for connected UDP sockets. Disable this if
+ your system could experience more unconnected load.
+
+ Default: 1
+
+icmp_echo_ignore_all - BOOLEAN
+ If set non-zero, then the kernel will ignore all ICMP ECHO
+ requests sent to it.
+
+ Default: 0
+
+icmp_echo_enable_probe - BOOLEAN
+ If set to one, then the kernel will respond to RFC 8335 PROBE
+ requests sent to it.
+
+ Default: 0
+
+icmp_echo_ignore_broadcasts - BOOLEAN
+ If set non-zero, then the kernel will ignore all ICMP ECHO and
+ TIMESTAMP requests sent to it via broadcast/multicast.
+
+ Default: 1
+
+icmp_ratelimit - INTEGER
+ Limit the maximal rates for sending ICMP packets whose type matches
+ icmp_ratemask (see below) to specific targets.
+ 0 to disable any limiting,
+ otherwise the minimal space between responses in milliseconds.
+ Note that another sysctl, icmp_msgs_per_sec limits the number
+ of ICMP packets sent on all targets.
+
+ Default: 1000
+
+icmp_msgs_per_sec - INTEGER
+ Limit maximal number of ICMP packets sent per second from this host.
+ Only messages whose type matches icmp_ratemask (see below) are
+ controlled by this limit. For security reasons, the precise count
+ of messages per second is randomized.
+
+ Default: 1000
+
+icmp_msgs_burst - INTEGER
+ icmp_msgs_per_sec controls number of ICMP packets sent per second,
+ while icmp_msgs_burst controls the burst size of these packets.
+ For security reasons, the precise burst size is randomized.
+
+ Default: 50
+
+icmp_ratemask - INTEGER
+ Mask made of ICMP types for which rates are being limited.
+
+ Significant bits: IHGFEDCBA9876543210
+
+ Default mask: 0000001100000011000 (6168)
+
+ Bit definitions (see include/linux/icmp.h):
+
+ = =========================
+ 0 Echo Reply
+ 3 Destination Unreachable [1]_
+ 4 Source Quench [1]_
+ 5 Redirect
+ 8 Echo Request
+ B Time Exceeded [1]_
+ C Parameter Problem [1]_
+ D Timestamp Request
+ E Timestamp Reply
+ F Info Request
+ G Info Reply
+ H Address Mask Request
+ I Address Mask Reply
+ = =========================
+
+ .. [1] These are rate limited by default (see default mask above)
+
+icmp_ignore_bogus_error_responses - BOOLEAN
+ Some routers violate RFC1122 by sending bogus responses to broadcast
+ frames. Such violations are normally logged via a kernel warning.
+ If this is set to TRUE, the kernel will not give such warnings, which
+ will avoid log file clutter.
+
+ Default: 1
+
+icmp_errors_use_inbound_ifaddr - BOOLEAN
+
+ If zero, icmp error messages are sent with the primary address of
+ the exiting interface.
+
+ If non-zero, the message will be sent with the primary address of
+ the interface that received the packet that caused the icmp error.
+ This is the behaviour many network administrators will expect from
+ a router. And it can make debugging complicated network layouts
+ much easier.
+
+ Note that if no primary address exists for the interface selected,
+ then the primary address of the first non-loopback interface that
+ has one will be used regardless of this setting.
+
+ Default: 0
+
+igmp_max_memberships - INTEGER
+ Change the maximum number of multicast groups we can subscribe to.
+ Default: 20
+
+ Theoretical maximum value is bounded by having to send a membership
+ report in a single datagram (i.e. the report can't span multiple
+ datagrams, or risk confusing the switch and leaving groups you don't
+ intend to).
+
+ The number of supported groups 'M' is bounded by the number of group
+ report entries you can fit into a single datagram of 65535 bytes.
+
+ M = 65536-sizeof (ip header)/(sizeof(Group record))
+
+ Group records are variable length, with a minimum of 12 bytes.
+ So net.ipv4.igmp_max_memberships should not be set higher than:
+
+ (65536-24) / 12 = 5459
+
+ The value 5459 assumes no IP header options, so in practice
+ this number may be lower.
+
+igmp_max_msf - INTEGER
+ Maximum number of addresses allowed in the source filter list for a
+ multicast group.
+
+ Default: 10
+
+igmp_qrv - INTEGER
+ Controls the IGMP query robustness variable (see RFC2236 8.1).
+
+ Default: 2 (as specified by RFC2236 8.1)
+
+ Minimum: 1 (as specified by RFC6636 4.5)
+
+force_igmp_version - INTEGER
+ - 0 - (default) No enforcement of a IGMP version, IGMPv1/v2 fallback
+ allowed. Will back to IGMPv3 mode again if all IGMPv1/v2 Querier
+ Present timer expires.
+ - 1 - Enforce to use IGMP version 1. Will also reply IGMPv1 report if
+ receive IGMPv2/v3 query.
+ - 2 - Enforce to use IGMP version 2. Will fallback to IGMPv1 if receive
+ IGMPv1 query message. Will reply report if receive IGMPv3 query.
+ - 3 - Enforce to use IGMP version 3. The same react with default 0.
+
+ .. note::
+
+ this is not the same with force_mld_version because IGMPv3 RFC3376
+ Security Considerations does not have clear description that we could
+ ignore other version messages completely as MLDv2 RFC3810. So make
+ this value as default 0 is recommended.
+
+``conf/interface/*``
+ changes special settings per interface (where
+ interface" is the name of your network interface)
+
+``conf/all/*``
+ is special, changes the settings for all interfaces
+
+log_martians - BOOLEAN
+ Log packets with impossible addresses to kernel log.
+ log_martians for the interface will be enabled if at least one of
+ conf/{all,interface}/log_martians is set to TRUE,
+ it will be disabled otherwise
+
+accept_redirects - BOOLEAN
+ Accept ICMP redirect messages.
+ accept_redirects for the interface will be enabled if:
+
+ - both conf/{all,interface}/accept_redirects are TRUE in the case
+ forwarding for the interface is enabled
+
+ or
+
+ - at least one of conf/{all,interface}/accept_redirects is TRUE in the
+ case forwarding for the interface is disabled
+
+ accept_redirects for the interface will be disabled otherwise
+
+ default:
+
+ - TRUE (host)
+ - FALSE (router)
+
+forwarding - BOOLEAN
+ Enable IP forwarding on this interface. This controls whether packets
+ received _on_ this interface can be forwarded.
+
+mc_forwarding - BOOLEAN
+ Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE
+ and a multicast routing daemon is required.
+ conf/all/mc_forwarding must also be set to TRUE to enable multicast
+ routing for the interface
+
+medium_id - INTEGER
+ Integer value used to differentiate the devices by the medium they
+ are attached to. Two devices can have different id values when
+ the broadcast packets are received only on one of them.
+ The default value 0 means that the device is the only interface
+ to its medium, value of -1 means that medium is not known.
+
+ Currently, it is used to change the proxy_arp behavior:
+ the proxy_arp feature is enabled for packets forwarded between
+ two devices attached to different media.
+
+proxy_arp - BOOLEAN
+ Do proxy arp.
+
+ proxy_arp for the interface will be enabled if at least one of
+ conf/{all,interface}/proxy_arp is set to TRUE,
+ it will be disabled otherwise
+
+proxy_arp_pvlan - BOOLEAN
+ Private VLAN proxy arp.
+
+ Basically allow proxy arp replies back to the same interface
+ (from which the ARP request/solicitation was received).
+
+ This is done to support (ethernet) switch features, like RFC
+ 3069, where the individual ports are NOT allowed to
+ communicate with each other, but they are allowed to talk to
+ the upstream router. As described in RFC 3069, it is possible
+ to allow these hosts to communicate through the upstream
+ router by proxy_arp'ing. Don't need to be used together with
+ proxy_arp.
+
+ This technology is known by different names:
+
+ In RFC 3069 it is called VLAN Aggregation.
+ Cisco and Allied Telesyn call it Private VLAN.
+ Hewlett-Packard call it Source-Port filtering or port-isolation.
+ Ericsson call it MAC-Forced Forwarding (RFC Draft).
+
+proxy_delay - INTEGER
+ Delay proxy response.
+
+ Delay response to a neighbor solicitation when proxy_arp
+ or proxy_ndp is enabled. A random value between [0, proxy_delay)
+ will be chosen, setting to zero means reply with no delay.
+ Value in jiffies. Defaults to 80.
+
+shared_media - BOOLEAN
+ Send(router) or accept(host) RFC1620 shared media redirects.
+ Overrides secure_redirects.
+
+ shared_media for the interface will be enabled if at least one of
+ conf/{all,interface}/shared_media is set to TRUE,
+ it will be disabled otherwise
+
+ default TRUE
+
+secure_redirects - BOOLEAN
+ Accept ICMP redirect messages only to gateways listed in the
+ interface's current gateway list. Even if disabled, RFC1122 redirect
+ rules still apply.
+
+ Overridden by shared_media.
+
+ secure_redirects for the interface will be enabled if at least one of
+ conf/{all,interface}/secure_redirects is set to TRUE,
+ it will be disabled otherwise
+
+ default TRUE
+
+send_redirects - BOOLEAN
+ Send redirects, if router.
+
+ send_redirects for the interface will be enabled if at least one of
+ conf/{all,interface}/send_redirects is set to TRUE,
+ it will be disabled otherwise
+
+ Default: TRUE
+
+bootp_relay - BOOLEAN
+ Accept packets with source address 0.b.c.d destined
+ not to this host as local ones. It is supposed, that
+ BOOTP relay daemon will catch and forward such packets.
+ conf/all/bootp_relay must also be set to TRUE to enable BOOTP relay
+ for the interface
+
+ default FALSE
+
+ Not Implemented Yet.
+
+accept_source_route - BOOLEAN
+ Accept packets with SRR option.
+ conf/all/accept_source_route must also be set to TRUE to accept packets
+ with SRR option on the interface
+
+ default
+
+ - TRUE (router)
+ - FALSE (host)
+
+accept_local - BOOLEAN
+ Accept packets with local source addresses. In combination with
+ suitable routing, this can be used to direct packets between two
+ local interfaces over the wire and have them accepted properly.
+ default FALSE
+
+route_localnet - BOOLEAN
+ Do not consider loopback addresses as martian source or destination
+ while routing. This enables the use of 127/8 for local routing purposes.
+
+ default FALSE
+
+rp_filter - INTEGER
+ - 0 - No source validation.
+ - 1 - Strict mode as defined in RFC3704 Strict Reverse Path
+ Each incoming packet is tested against the FIB and if the interface
+ is not the best reverse path the packet check will fail.
+ By default failed packets are discarded.
+ - 2 - Loose mode as defined in RFC3704 Loose Reverse Path
+ Each incoming packet's source address is also tested against the FIB
+ and if the source address is not reachable via any interface
+ the packet check will fail.
+
+ Current recommended practice in RFC3704 is to enable strict mode
+ to prevent IP spoofing from DDos attacks. If using asymmetric routing
+ or other complicated routing, then loose mode is recommended.
+
+ The max value from conf/{all,interface}/rp_filter is used
+ when doing source validation on the {interface}.
+
+ Default value is 0. Note that some distributions enable it
+ in startup scripts.
+
+src_valid_mark - BOOLEAN
+ - 0 - The fwmark of the packet is not included in reverse path
+ route lookup. This allows for asymmetric routing configurations
+ utilizing the fwmark in only one direction, e.g., transparent
+ proxying.
+
+ - 1 - The fwmark of the packet is included in reverse path route
+ lookup. This permits rp_filter to function when the fwmark is
+ used for routing traffic in both directions.
+
+ This setting also affects the utilization of fmwark when
+ performing source address selection for ICMP replies, or
+ determining addresses stored for the IPOPT_TS_TSANDADDR and
+ IPOPT_RR IP options.
+
+ The max value from conf/{all,interface}/src_valid_mark is used.
+
+ Default value is 0.
+
+arp_filter - BOOLEAN
+ - 1 - Allows you to have multiple network interfaces on the same
+ subnet, and have the ARPs for each interface be answered
+ based on whether or not the kernel would route a packet from
+ the ARP'd IP out that interface (therefore you must use source
+ based routing for this to work). In other words it allows control
+ of which cards (usually 1) will respond to an arp request.
+
+ - 0 - (default) The kernel can respond to arp requests with addresses
+ from other interfaces. This may seem wrong but it usually makes
+ sense, because it increases the chance of successful communication.
+ IP addresses are owned by the complete host on Linux, not by
+ particular interfaces. Only for more complex setups like load-
+ balancing, does this behaviour cause problems.
+
+ arp_filter for the interface will be enabled if at least one of
+ conf/{all,interface}/arp_filter is set to TRUE,
+ it will be disabled otherwise
+
+arp_announce - INTEGER
+ Define different restriction levels for announcing the local
+ source IP address from IP packets in ARP requests sent on
+ interface:
+
+ - 0 - (default) Use any local address, configured on any interface
+ - 1 - Try to avoid local addresses that are not in the target's
+ subnet for this interface. This mode is useful when target
+ hosts reachable via this interface require the source IP
+ address in ARP requests to be part of their logical network
+ configured on the receiving interface. When we generate the
+ request we will check all our subnets that include the
+ target IP and will preserve the source address if it is from
+ such subnet. If there is no such subnet we select source
+ address according to the rules for level 2.
+ - 2 - Always use the best local address for this target.
+ In this mode we ignore the source address in the IP packet
+ and try to select local address that we prefer for talks with
+ the target host. Such local address is selected by looking
+ for primary IP addresses on all our subnets on the outgoing
+ interface that include the target IP address. If no suitable
+ local address is found we select the first local address
+ we have on the outgoing interface or on all other interfaces,
+ with the hope we will receive reply for our request and
+ even sometimes no matter the source IP address we announce.
+
+ The max value from conf/{all,interface}/arp_announce is used.
+
+ Increasing the restriction level gives more chance for
+ receiving answer from the resolved target while decreasing
+ the level announces more valid sender's information.
+
+arp_ignore - INTEGER
+ Define different modes for sending replies in response to
+ received ARP requests that resolve local target IP addresses:
+
+ - 0 - (default): reply for any local target IP address, configured
+ on any interface
+ - 1 - reply only if the target IP address is local address
+ configured on the incoming interface
+ - 2 - reply only if the target IP address is local address
+ configured on the incoming interface and both with the
+ sender's IP address are part from same subnet on this interface
+ - 3 - do not reply for local addresses configured with scope host,
+ only resolutions for global and link addresses are replied
+ - 4-7 - reserved
+ - 8 - do not reply for all local addresses
+
+ The max value from conf/{all,interface}/arp_ignore is used
+ when ARP request is received on the {interface}
+
+arp_notify - BOOLEAN
+ Define mode for notification of address and device changes.
+
+ == ==========================================================
+ 0 (default): do nothing
+ 1 Generate gratuitous arp requests when device is brought up
+ or hardware address changes.
+ == ==========================================================
+
+arp_accept - INTEGER
+ Define behavior for accepting gratuitous ARP (garp) frames from devices
+ that are not already present in the ARP table:
+
+ - 0 - don't create new entries in the ARP table
+ - 1 - create new entries in the ARP table
+ - 2 - create new entries only if the source IP address is in the same
+ subnet as an address configured on the interface that received the
+ garp message.
+
+ Both replies and requests type gratuitous arp will trigger the
+ ARP table to be updated, if this setting is on.
+
+ If the ARP table already contains the IP address of the
+ gratuitous arp frame, the arp table will be updated regardless
+ if this setting is on or off.
+
+arp_evict_nocarrier - BOOLEAN
+ Clears the ARP cache on NOCARRIER events. This option is important for
+ wireless devices where the ARP cache should not be cleared when roaming
+ between access points on the same network. In most cases this should
+ remain as the default (1).
+
+ - 1 - (default): Clear the ARP cache on NOCARRIER events
+ - 0 - Do not clear ARP cache on NOCARRIER events
+
+mcast_solicit - INTEGER
+ The maximum number of multicast probes in INCOMPLETE state,
+ when the associated hardware address is unknown. Defaults
+ to 3.
+
+ucast_solicit - INTEGER
+ The maximum number of unicast probes in PROBE state, when
+ the hardware address is being reconfirmed. Defaults to 3.
+
+app_solicit - INTEGER
+ The maximum number of probes to send to the user space ARP daemon
+ via netlink before dropping back to multicast probes (see
+ mcast_resolicit). Defaults to 0.
+
+mcast_resolicit - INTEGER
+ The maximum number of multicast probes after unicast and
+ app probes in PROBE state. Defaults to 0.
+
+disable_policy - BOOLEAN
+ Disable IPSEC policy (SPD) for this interface
+
+disable_xfrm - BOOLEAN
+ Disable IPSEC encryption on this interface, whatever the policy
+
+igmpv2_unsolicited_report_interval - INTEGER
+ The interval in milliseconds in which the next unsolicited
+ IGMPv1 or IGMPv2 report retransmit will take place.
+
+ Default: 10000 (10 seconds)
+
+igmpv3_unsolicited_report_interval - INTEGER
+ The interval in milliseconds in which the next unsolicited
+ IGMPv3 report retransmit will take place.
+
+ Default: 1000 (1 seconds)
+
+ignore_routes_with_linkdown - BOOLEAN
+ Ignore routes whose link is down when performing a FIB lookup.
+
+promote_secondaries - BOOLEAN
+ When a primary IP address is removed from this interface
+ promote a corresponding secondary IP address instead of
+ removing all the corresponding secondary IP addresses.
+
+drop_unicast_in_l2_multicast - BOOLEAN
+ Drop any unicast IP packets that are received in link-layer
+ multicast (or broadcast) frames.
+
+ This behavior (for multicast) is actually a SHOULD in RFC
+ 1122, but is disabled by default for compatibility reasons.
+
+ Default: off (0)
+
+drop_gratuitous_arp - BOOLEAN
+ Drop all gratuitous ARP frames, for example if there's a known
+ good ARP proxy on the network and such frames need not be used
+ (or in the case of 802.11, must not be used to prevent attacks.)
+
+ Default: off (0)
+
+
+tag - INTEGER
+ Allows you to write a number, which can be used as required.
+
+ Default value is 0.
+
+xfrm4_gc_thresh - INTEGER
+ (Obsolete since linux-4.14)
+ The threshold at which we will start garbage collecting for IPv4
+ destination cache entries. At twice this value the system will
+ refuse new allocations.
+
+igmp_link_local_mcast_reports - BOOLEAN
+ Enable IGMP reports for link local multicast groups in the
+ 224.0.0.X range.
+
+ Default TRUE
+
+Alexey Kuznetsov.
+kuznet@ms2.inr.ac.ru
+
+Updated by:
+
+- Andi Kleen
+ ak@muc.de
+- Nicolas Delon
+ delon.nicolas@wanadoo.fr
+
+
+
+
+/proc/sys/net/ipv6/* Variables
+==============================
+
+IPv6 has no global variables such as tcp_*. tcp_* settings under ipv4/ also
+apply to IPv6 [XXX?].
+
+bindv6only - BOOLEAN
+ Default value for IPV6_V6ONLY socket option,
+ which restricts use of the IPv6 socket to IPv6 communication
+ only.
+
+ - TRUE: disable IPv4-mapped address feature
+ - FALSE: enable IPv4-mapped address feature
+
+ Default: FALSE (as specified in RFC3493)
+
+flowlabel_consistency - BOOLEAN
+ Protect the consistency (and unicity) of flow label.
+ You have to disable it to use IPV6_FL_F_REFLECT flag on the
+ flow label manager.
+
+ - TRUE: enabled
+ - FALSE: disabled
+
+ Default: TRUE
+
+auto_flowlabels - INTEGER
+ Automatically generate flow labels based on a flow hash of the
+ packet. This allows intermediate devices, such as routers, to
+ identify packet flows for mechanisms like Equal Cost Multipath
+ Routing (see RFC 6438).
+
+ = ===========================================================
+ 0 automatic flow labels are completely disabled
+ 1 automatic flow labels are enabled by default, they can be
+ disabled on a per socket basis using the IPV6_AUTOFLOWLABEL
+ socket option
+ 2 automatic flow labels are allowed, they may be enabled on a
+ per socket basis using the IPV6_AUTOFLOWLABEL socket option
+ 3 automatic flow labels are enabled and enforced, they cannot
+ be disabled by the socket option
+ = ===========================================================
+
+ Default: 1
+
+flowlabel_state_ranges - BOOLEAN
+ Split the flow label number space into two ranges. 0-0x7FFFF is
+ reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF
+ is reserved for stateless flow labels as described in RFC6437.
+
+ - TRUE: enabled
+ - FALSE: disabled
+
+ Default: true
+
+flowlabel_reflect - INTEGER
+ Control flow label reflection. Needed for Path MTU
+ Discovery to work with Equal Cost Multipath Routing in anycast
+ environments. See RFC 7690 and:
+ https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
+
+ This is a bitmask.
+
+ - 1: enabled for established flows
+
+ Note that this prevents automatic flowlabel changes, as done
+ in "tcp: change IPv6 flow-label upon receiving spurious retransmission"
+ and "tcp: Change txhash on every SYN and RTO retransmit"
+
+ - 2: enabled for TCP RESET packets (no active listener)
+ If set, a RST packet sent in response to a SYN packet on a closed
+ port will reflect the incoming flow label.
+
+ - 4: enabled for ICMPv6 echo reply messages.
+
+ Default: 0
+
+fib_multipath_hash_policy - INTEGER
+ Controls which hash policy to use for multipath routes.
+
+ Default: 0 (Layer 3)
+
+ Possible values:
+
+ - 0 - Layer 3 (source and destination addresses plus flow label)
+ - 1 - Layer 4 (standard 5-tuple)
+ - 2 - Layer 3 or inner Layer 3 if present
+ - 3 - Custom multipath hash. Fields used for multipath hash calculation
+ are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+ When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+ fields used for multipath hash calculation are determined by this
+ sysctl.
+
+ This value is a bitmask which enables various fields for multipath hash
+ calculation.
+
+ Possible fields are:
+
+ ====== ============================
+ 0x0001 Source IP address
+ 0x0002 Destination IP address
+ 0x0004 IP protocol
+ 0x0008 Flow Label
+ 0x0010 Source port
+ 0x0020 Destination port
+ 0x0040 Inner source IP address
+ 0x0080 Inner destination IP address
+ 0x0100 Inner IP protocol
+ 0x0200 Inner Flow Label
+ 0x0400 Inner source port
+ 0x0800 Inner destination port
+ ====== ============================
+
+ Default: 0x0007 (source IP, destination IP and IP protocol)
+
+anycast_src_echo_reply - BOOLEAN
+ Controls the use of anycast addresses as source addresses for ICMPv6
+ echo reply
+
+ - TRUE: enabled
+ - FALSE: disabled
+
+ Default: FALSE
+
+idgen_delay - INTEGER
+ Controls the delay in seconds after which time to retry
+ privacy stable address generation if a DAD conflict is
+ detected.
+
+ Default: 1 (as specified in RFC7217)
+
+idgen_retries - INTEGER
+ Controls the number of retries to generate a stable privacy
+ address if a DAD conflict is detected.
+
+ Default: 3 (as specified in RFC7217)
+
+mld_qrv - INTEGER
+ Controls the MLD query robustness variable (see RFC3810 9.1).
+
+ Default: 2 (as specified by RFC3810 9.1)
+
+ Minimum: 1 (as specified by RFC6636 4.5)
+
+max_dst_opts_number - INTEGER
+ Maximum number of non-padding TLVs allowed in a Destination
+ options extension header. If this value is less than zero
+ then unknown options are disallowed and the number of known
+ TLVs allowed is the absolute value of this number.
+
+ Default: 8
+
+max_hbh_opts_number - INTEGER
+ Maximum number of non-padding TLVs allowed in a Hop-by-Hop
+ options extension header. If this value is less than zero
+ then unknown options are disallowed and the number of known
+ TLVs allowed is the absolute value of this number.
+
+ Default: 8
+
+max_dst_opts_length - INTEGER
+ Maximum length allowed for a Destination options extension
+ header.
+
+ Default: INT_MAX (unlimited)
+
+max_hbh_length - INTEGER
+ Maximum length allowed for a Hop-by-Hop options extension
+ header.
+
+ Default: INT_MAX (unlimited)
+
+skip_notify_on_dev_down - BOOLEAN
+ Controls whether an RTM_DELROUTE message is generated for routes
+ removed when a device is taken down or deleted. IPv4 does not
+ generate this message; IPv6 does by default. Setting this sysctl
+ to true skips the message, making IPv4 and IPv6 on par in relying
+ on userspace caches to track link events and evict routes.
+
+ Default: false (generate message)
+
+nexthop_compat_mode - BOOLEAN
+ New nexthop API provides a means for managing nexthops independent of
+ prefixes. Backwards compatibility with old route format is enabled by
+ default which means route dumps and notifications contain the new
+ nexthop attribute but also the full, expanded nexthop definition.
+ Further, updates or deletes of a nexthop configuration generate route
+ notifications for each fib entry using the nexthop. Once a system
+ understands the new API, this sysctl can be disabled to achieve full
+ performance benefits of the new API by disabling the nexthop expansion
+ and extraneous notifications.
+ Default: true (backward compat mode)
+
+fib_notify_on_flag_change - INTEGER
+ Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/
+ RTM_F_TRAP/RTM_F_OFFLOAD_FAILED flags are changed.
+
+ After installing a route to the kernel, user space receives an
+ acknowledgment, which means the route was installed in the kernel,
+ but not necessarily in hardware.
+ It is also possible for a route already installed in hardware to change
+ its action and therefore its flags. For example, a host route that is
+ trapping packets can be "promoted" to perform decapsulation following
+ the installation of an IPinIP/VXLAN tunnel.
+ The notifications will indicate to user-space the state of the route.
+
+ Default: 0 (Do not emit notifications.)
+
+ Possible values:
+
+ - 0 - Do not emit notifications.
+ - 1 - Emit notifications.
+ - 2 - Emit notifications only for RTM_F_OFFLOAD_FAILED flag change.
+
+ioam6_id - INTEGER
+ Define the IOAM id of this node. Uses only 24 bits out of 32 in total.
+
+ Min: 0
+ Max: 0xFFFFFF
+
+ Default: 0xFFFFFF
+
+ioam6_id_wide - LONG INTEGER
+ Define the wide IOAM id of this node. Uses only 56 bits out of 64 in
+ total. Can be different from ioam6_id.
+
+ Min: 0
+ Max: 0xFFFFFFFFFFFFFF
+
+ Default: 0xFFFFFFFFFFFFFF
+
+IPv6 Fragmentation:
+
+ip6frag_high_thresh - INTEGER
+ Maximum memory used to reassemble IPv6 fragments. When
+ ip6frag_high_thresh bytes of memory is allocated for this purpose,
+ the fragment handler will toss packets until ip6frag_low_thresh
+ is reached.
+
+ip6frag_low_thresh - INTEGER
+ See ip6frag_high_thresh
+
+ip6frag_time - INTEGER
+ Time in seconds to keep an IPv6 fragment in memory.
+
+``conf/default/*``:
+ Change the interface-specific default settings.
+
+ These settings would be used during creating new interfaces.
+
+
+``conf/all/*``:
+ Change all the interface-specific settings.
+
+ [XXX: Other special features than forwarding?]
+
+conf/all/disable_ipv6 - BOOLEAN
+ Changing this value is same as changing ``conf/default/disable_ipv6``
+ setting and also all per-interface ``disable_ipv6`` settings to the same
+ value.
+
+ Reading this value does not have any particular meaning. It does not say
+ whether IPv6 support is enabled or disabled. Returned value can be 1
+ also in the case when some interface has ``disable_ipv6`` set to 0 and
+ has configured IPv6 addresses.
+
+conf/all/forwarding - BOOLEAN
+ Enable global IPv6 forwarding between all interfaces.
+
+ IPv4 and IPv6 work differently here; e.g. netfilter must be used
+ to control which interfaces may forward packets and which not.
+
+ This also sets all interfaces' Host/Router setting
+ 'forwarding' to the specified value. See below for details.
+
+ This referred to as global forwarding.
+
+proxy_ndp - BOOLEAN
+ Do proxy ndp.
+
+fwmark_reflect - BOOLEAN
+ Controls the fwmark of kernel-generated IPv6 reply packets that are not
+ associated with a socket for example, TCP RSTs or ICMPv6 echo replies).
+ If unset, these packets have a fwmark of zero. If set, they have the
+ fwmark of the packet they are replying to.
+
+ Default: 0
+
+``conf/interface/*``:
+ Change special settings per interface.
+
+ The functional behaviour for certain settings is different
+ depending on whether local forwarding is enabled or not.
+
+accept_ra - INTEGER
+ Accept Router Advertisements; autoconfigure using them.
+
+ It also determines whether or not to transmit Router
+ Solicitations. If and only if the functional setting is to
+ accept Router Advertisements, Router Solicitations will be
+ transmitted.
+
+ Possible values are:
+
+ == ===========================================================
+ 0 Do not accept Router Advertisements.
+ 1 Accept Router Advertisements if forwarding is disabled.
+ 2 Overrule forwarding behaviour. Accept Router Advertisements
+ even if forwarding is enabled.
+ == ===========================================================
+
+ Functional default:
+
+ - enabled if local forwarding is disabled.
+ - disabled if local forwarding is enabled.
+
+accept_ra_defrtr - BOOLEAN
+ Learn default router in Router Advertisement.
+
+ Functional default:
+
+ - enabled if accept_ra is enabled.
+ - disabled if accept_ra is disabled.
+
+ra_defrtr_metric - UNSIGNED INTEGER
+ Route metric for default route learned in Router Advertisement. This value
+ will be assigned as metric for the default route learned via IPv6 Router
+ Advertisement. Takes affect only if accept_ra_defrtr is enabled.
+
+ Possible values:
+ 1 to 0xFFFFFFFF
+
+ Default: IP6_RT_PRIO_USER i.e. 1024.
+
+accept_ra_from_local - BOOLEAN
+ Accept RA with source-address that is found on local machine
+ if the RA is otherwise proper and able to be accepted.
+
+ Default is to NOT accept these as it may be an un-intended
+ network loop.
+
+ Functional default:
+
+ - enabled if accept_ra_from_local is enabled
+ on a specific interface.
+ - disabled if accept_ra_from_local is disabled
+ on a specific interface.
+
+accept_ra_min_hop_limit - INTEGER
+ Minimum hop limit Information in Router Advertisement.
+
+ Hop limit Information in Router Advertisement less than this
+ variable shall be ignored.
+
+ Default: 1
+
+accept_ra_min_lft - INTEGER
+ Minimum acceptable lifetime value in Router Advertisement.
+
+ RA sections with a lifetime less than this value shall be
+ ignored. Zero lifetimes stay unaffected.
+
+ Default: 0
+
+accept_ra_pinfo - BOOLEAN
+ Learn Prefix Information in Router Advertisement.
+
+ Functional default:
+
+ - enabled if accept_ra is enabled.
+ - disabled if accept_ra is disabled.
+
+accept_ra_rt_info_min_plen - INTEGER
+ Minimum prefix length of Route Information in RA.
+
+ Route Information w/ prefix smaller than this variable shall
+ be ignored.
+
+ Functional default:
+
+ * 0 if accept_ra_rtr_pref is enabled.
+ * -1 if accept_ra_rtr_pref is disabled.
+
+accept_ra_rt_info_max_plen - INTEGER
+ Maximum prefix length of Route Information in RA.
+
+ Route Information w/ prefix larger than this variable shall
+ be ignored.
+
+ Functional default:
+
+ * 0 if accept_ra_rtr_pref is enabled.
+ * -1 if accept_ra_rtr_pref is disabled.
+
+accept_ra_rtr_pref - BOOLEAN
+ Accept Router Preference in RA.
+
+ Functional default:
+
+ - enabled if accept_ra is enabled.
+ - disabled if accept_ra is disabled.
+
+accept_ra_mtu - BOOLEAN
+ Apply the MTU value specified in RA option 5 (RFC4861). If
+ disabled, the MTU specified in the RA will be ignored.
+
+ Functional default:
+
+ - enabled if accept_ra is enabled.
+ - disabled if accept_ra is disabled.
+
+accept_redirects - BOOLEAN
+ Accept Redirects.
+
+ Functional default:
+
+ - enabled if local forwarding is disabled.
+ - disabled if local forwarding is enabled.
+
+accept_source_route - INTEGER
+ Accept source routing (routing extension header).
+
+ - >= 0: Accept only routing header type 2.
+ - < 0: Do not accept routing header.
+
+ Default: 0
+
+autoconf - BOOLEAN
+ Autoconfigure addresses using Prefix Information in Router
+ Advertisements.
+
+ Functional default:
+
+ - enabled if accept_ra_pinfo is enabled.
+ - disabled if accept_ra_pinfo is disabled.
+
+dad_transmits - INTEGER
+ The amount of Duplicate Address Detection probes to send.
+
+ Default: 1
+
+forwarding - INTEGER
+ Configure interface-specific Host/Router behaviour.
+
+ .. note::
+
+ It is recommended to have the same setting on all
+ interfaces; mixed router/host scenarios are rather uncommon.
+
+ Possible values are:
+
+ - 0 Forwarding disabled
+ - 1 Forwarding enabled
+
+ **FALSE (0)**:
+
+ By default, Host behaviour is assumed. This means:
+
+ 1. IsRouter flag is not set in Neighbour Advertisements.
+ 2. If accept_ra is TRUE (default), transmit Router
+ Solicitations.
+ 3. If accept_ra is TRUE (default), accept Router
+ Advertisements (and do autoconfiguration).
+ 4. If accept_redirects is TRUE (default), accept Redirects.
+
+ **TRUE (1)**:
+
+ If local forwarding is enabled, Router behaviour is assumed.
+ This means exactly the reverse from the above:
+
+ 1. IsRouter flag is set in Neighbour Advertisements.
+ 2. Router Solicitations are not sent unless accept_ra is 2.
+ 3. Router Advertisements are ignored unless accept_ra is 2.
+ 4. Redirects are ignored.
+
+ Default: 0 (disabled) if global forwarding is disabled (default),
+ otherwise 1 (enabled).
+
+hop_limit - INTEGER
+ Default Hop Limit to set.
+
+ Default: 64
+
+mtu - INTEGER
+ Default Maximum Transfer Unit
+
+ Default: 1280 (IPv6 required minimum)
+
+ip_nonlocal_bind - BOOLEAN
+ If set, allows processes to bind() to non-local IPv6 addresses,
+ which can be quite useful - but may break some applications.
+
+ Default: 0
+
+router_probe_interval - INTEGER
+ Minimum interval (in seconds) between Router Probing described
+ in RFC4191.
+
+ Default: 60
+
+router_solicitation_delay - INTEGER
+ Number of seconds to wait after interface is brought up
+ before sending Router Solicitations.
+
+ Default: 1
+
+router_solicitation_interval - INTEGER
+ Number of seconds to wait between Router Solicitations.
+
+ Default: 4
+
+router_solicitations - INTEGER
+ Number of Router Solicitations to send until assuming no
+ routers are present.
+
+ Default: 3
+
+use_oif_addrs_only - BOOLEAN
+ When enabled, the candidate source addresses for destinations
+ routed via this interface are restricted to the set of addresses
+ configured on this interface (vis. RFC 6724, section 4).
+
+ Default: false
+
+use_tempaddr - INTEGER
+ Preference for Privacy Extensions (RFC3041).
+
+ * <= 0 : disable Privacy Extensions
+ * == 1 : enable Privacy Extensions, but prefer public
+ addresses over temporary addresses.
+ * > 1 : enable Privacy Extensions and prefer temporary
+ addresses over public addresses.
+
+ Default:
+
+ * 0 (for most devices)
+ * -1 (for point-to-point devices and loopback devices)
+
+temp_valid_lft - INTEGER
+ valid lifetime (in seconds) for temporary addresses.
+
+ Default: 172800 (2 days)
+
+temp_prefered_lft - INTEGER
+ Preferred lifetime (in seconds) for temporary addresses.
+
+ Default: 86400 (1 day)
+
+keep_addr_on_down - INTEGER
+ Keep all IPv6 addresses on an interface down event. If set static
+ global addresses with no expiration time are not flushed.
+
+ * >0 : enabled
+ * 0 : system default
+ * <0 : disabled
+
+ Default: 0 (addresses are removed)
+
+max_desync_factor - INTEGER
+ Maximum value for DESYNC_FACTOR, which is a random value
+ that ensures that clients don't synchronize with each
+ other and generate new addresses at exactly the same time.
+ value is in seconds.
+
+ Default: 600
+
+regen_max_retry - INTEGER
+ Number of attempts before give up attempting to generate
+ valid temporary addresses.
+
+ Default: 5
+
+max_addresses - INTEGER
+ Maximum number of autoconfigured addresses per interface. Setting
+ to zero disables the limitation. It is not recommended to set this
+ value too large (or to zero) because it would be an easy way to
+ crash the kernel by allowing too many addresses to be created.
+
+ Default: 16
+
+disable_ipv6 - BOOLEAN
+ Disable IPv6 operation. If accept_dad is set to 2, this value
+ will be dynamically set to TRUE if DAD fails for the link-local
+ address.
+
+ Default: FALSE (enable IPv6 operation)
+
+ When this value is changed from 1 to 0 (IPv6 is being enabled),
+ it will dynamically create a link-local address on the given
+ interface and start Duplicate Address Detection, if necessary.
+
+ When this value is changed from 0 to 1 (IPv6 is being disabled),
+ it will dynamically delete all addresses and routes on the given
+ interface. From now on it will not possible to add addresses/routes
+ to the selected interface.
+
+accept_dad - INTEGER
+ Whether to accept DAD (Duplicate Address Detection).
+
+ == ==============================================================
+ 0 Disable DAD
+ 1 Enable DAD (default)
+ 2 Enable DAD, and disable IPv6 operation if MAC-based duplicate
+ link-local address has been found.
+ == ==============================================================
+
+ DAD operation and mode on a given interface will be selected according
+ to the maximum value of conf/{all,interface}/accept_dad.
+
+force_tllao - BOOLEAN
+ Enable sending the target link-layer address option even when
+ responding to a unicast neighbor solicitation.
+
+ Default: FALSE
+
+ Quoting from RFC 2461, section 4.4, Target link-layer address:
+
+ "The option MUST be included for multicast solicitations in order to
+ avoid infinite Neighbor Solicitation "recursion" when the peer node
+ does not have a cache entry to return a Neighbor Advertisements
+ message. When responding to unicast solicitations, the option can be
+ omitted since the sender of the solicitation has the correct link-
+ layer address; otherwise it would not have be able to send the unicast
+ solicitation in the first place. However, including the link-layer
+ address in this case adds little overhead and eliminates a potential
+ race condition where the sender deletes the cached link-layer address
+ prior to receiving a response to a previous solicitation."
+
+ndisc_notify - BOOLEAN
+ Define mode for notification of address and device changes.
+
+ * 0 - (default): do nothing
+ * 1 - Generate unsolicited neighbour advertisements when device is brought
+ up or hardware address changes.
+
+ndisc_tclass - INTEGER
+ The IPv6 Traffic Class to use by default when sending IPv6 Neighbor
+ Discovery (Router Solicitation, Router Advertisement, Neighbor
+ Solicitation, Neighbor Advertisement, Redirect) messages.
+ These 8 bits can be interpreted as 6 high order bits holding the DSCP
+ value and 2 low order bits representing ECN (which you probably want
+ to leave cleared).
+
+ * 0 - (default)
+
+ndisc_evict_nocarrier - BOOLEAN
+ Clears the neighbor discovery table on NOCARRIER events. This option is
+ important for wireless devices where the neighbor discovery cache should
+ not be cleared when roaming between access points on the same network.
+ In most cases this should remain as the default (1).
+
+ - 1 - (default): Clear neighbor discover cache on NOCARRIER events.
+ - 0 - Do not clear neighbor discovery cache on NOCARRIER events.
+
+mldv1_unsolicited_report_interval - INTEGER
+ The interval in milliseconds in which the next unsolicited
+ MLDv1 report retransmit will take place.
+
+ Default: 10000 (10 seconds)
+
+mldv2_unsolicited_report_interval - INTEGER
+ The interval in milliseconds in which the next unsolicited
+ MLDv2 report retransmit will take place.
+
+ Default: 1000 (1 second)
+
+force_mld_version - INTEGER
+ * 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed
+ * 1 - Enforce to use MLD version 1
+ * 2 - Enforce to use MLD version 2
+
+suppress_frag_ndisc - INTEGER
+ Control RFC 6980 (Security Implications of IPv6 Fragmentation
+ with IPv6 Neighbor Discovery) behavior:
+
+ * 1 - (default) discard fragmented neighbor discovery packets
+ * 0 - allow fragmented neighbor discovery packets
+
+optimistic_dad - BOOLEAN
+ Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
+
+ * 0: disabled (default)
+ * 1: enabled
+
+ Optimistic Duplicate Address Detection for the interface will be enabled
+ if at least one of conf/{all,interface}/optimistic_dad is set to 1,
+ it will be disabled otherwise.
+
+use_optimistic - BOOLEAN
+ If enabled, do not classify optimistic addresses as deprecated during
+ source address selection. Preferred addresses will still be chosen
+ before optimistic addresses, subject to other ranking in the source
+ address selection algorithm.
+
+ * 0: disabled (default)
+ * 1: enabled
+
+ This will be enabled if at least one of
+ conf/{all,interface}/use_optimistic is set to 1, disabled otherwise.
+
+stable_secret - IPv6 address
+ This IPv6 address will be used as a secret to generate IPv6
+ addresses for link-local addresses and autoconfigured
+ ones. All addresses generated after setting this secret will
+ be stable privacy ones by default. This can be changed via the
+ addrgenmode ip-link. conf/default/stable_secret is used as the
+ secret for the namespace, the interface specific ones can
+ overwrite that. Writes to conf/all/stable_secret are refused.
+
+ It is recommended to generate this secret during installation
+ of a system and keep it stable after that.
+
+ By default the stable secret is unset.
+
+addr_gen_mode - INTEGER
+ Defines how link-local and autoconf addresses are generated.
+
+ = =================================================================
+ 0 generate address based on EUI64 (default)
+ 1 do no generate a link-local address, use EUI64 for addresses
+ generated from autoconf
+ 2 generate stable privacy addresses, using the secret from
+ stable_secret (RFC7217)
+ 3 generate stable privacy addresses, using a random secret if unset
+ = =================================================================
+
+drop_unicast_in_l2_multicast - BOOLEAN
+ Drop any unicast IPv6 packets that are received in link-layer
+ multicast (or broadcast) frames.
+
+ By default this is turned off.
+
+drop_unsolicited_na - BOOLEAN
+ Drop all unsolicited neighbor advertisements, for example if there's
+ a known good NA proxy on the network and such frames need not be used
+ (or in the case of 802.11, must not be used to prevent attacks.)
+
+ By default this is turned off.
+
+accept_untracked_na - INTEGER
+ Define behavior for accepting neighbor advertisements from devices that
+ are absent in the neighbor cache:
+
+ - 0 - (default) Do not accept unsolicited and untracked neighbor
+ advertisements.
+
+ - 1 - Add a new neighbor cache entry in STALE state for routers on
+ receiving a neighbor advertisement (either solicited or unsolicited)
+ with target link-layer address option specified if no neighbor entry
+ is already present for the advertised IPv6 address. Without this knob,
+ NAs received for untracked addresses (absent in neighbor cache) are
+ silently ignored.
+
+ This is as per router-side behavior documented in RFC9131.
+
+ This has lower precedence than drop_unsolicited_na.
+
+ This will optimize the return path for the initial off-link
+ communication that is initiated by a directly connected host, by
+ ensuring that the first-hop router which turns on this setting doesn't
+ have to buffer the initial return packets to do neighbor-solicitation.
+ The prerequisite is that the host is configured to send unsolicited
+ neighbor advertisements on interface bringup. This setting should be
+ used in conjunction with the ndisc_notify setting on the host to
+ satisfy this prerequisite.
+
+ - 2 - Extend option (1) to add a new neighbor cache entry only if the
+ source IP address is in the same subnet as an address configured on
+ the interface that received the neighbor advertisement.
+
+enhanced_dad - BOOLEAN
+ Include a nonce option in the IPv6 neighbor solicitation messages used for
+ duplicate address detection per RFC7527. A received DAD NS will only signal
+ a duplicate address if the nonce is different. This avoids any false
+ detection of duplicates due to loopback of the NS messages that we send.
+ The nonce option will be sent on an interface unless both of
+ conf/{all,interface}/enhanced_dad are set to FALSE.
+
+ Default: TRUE
+
+``icmp/*``:
+===========
+
+ratelimit - INTEGER
+ Limit the maximal rates for sending ICMPv6 messages.
+
+ 0 to disable any limiting,
+ otherwise the minimal space between responses in milliseconds.
+
+ Default: 1000
+
+ratemask - list of comma separated ranges
+ For ICMPv6 message types matching the ranges in the ratemask, limit
+ the sending of the message according to ratelimit parameter.
+
+ The format used for both input and output is a comma separated
+ list of ranges (e.g. "0-127,129" for ICMPv6 message type 0 to 127 and
+ 129). Writing to the file will clear all previous ranges of ICMPv6
+ message types and update the current list with the input.
+
+ Refer to: https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml
+ for numerical values of ICMPv6 message types, e.g. echo request is 128
+ and echo reply is 129.
+
+ Default: 0-1,3-127 (rate limit ICMPv6 errors except Packet Too Big)
+
+echo_ignore_all - BOOLEAN
+ If set non-zero, then the kernel will ignore all ICMP ECHO
+ requests sent to it over the IPv6 protocol.
+
+ Default: 0
+
+echo_ignore_multicast - BOOLEAN
+ If set non-zero, then the kernel will ignore all ICMP ECHO
+ requests sent to it over the IPv6 protocol via multicast.
+
+ Default: 0
+
+echo_ignore_anycast - BOOLEAN
+ If set non-zero, then the kernel will ignore all ICMP ECHO
+ requests sent to it over the IPv6 protocol destined to anycast address.
+
+ Default: 0
+
+error_anycast_as_unicast - BOOLEAN
+ If set to 1, then the kernel will respond with ICMP Errors
+ resulting from requests sent to it over the IPv6 protocol destined
+ to anycast address essentially treating anycast as unicast.
+
+ Default: 0
+
+xfrm6_gc_thresh - INTEGER
+ (Obsolete since linux-4.14)
+ The threshold at which we will start garbage collecting for IPv6
+ destination cache entries. At twice this value the system will
+ refuse new allocations.
+
+
+IPv6 Update by:
+Pekka Savola <pekkas@netcore.fi>
+YOSHIFUJI Hideaki / USAGI Project <yoshfuji@linux-ipv6.org>
+
+
+/proc/sys/net/bridge/* Variables:
+=================================
+
+bridge-nf-call-arptables - BOOLEAN
+ - 1 : pass bridged ARP traffic to arptables' FORWARD chain.
+ - 0 : disable this.
+
+ Default: 1
+
+bridge-nf-call-iptables - BOOLEAN
+ - 1 : pass bridged IPv4 traffic to iptables' chains.
+ - 0 : disable this.
+
+ Default: 1
+
+bridge-nf-call-ip6tables - BOOLEAN
+ - 1 : pass bridged IPv6 traffic to ip6tables' chains.
+ - 0 : disable this.
+
+ Default: 1
+
+bridge-nf-filter-vlan-tagged - BOOLEAN
+ - 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables.
+ - 0 : disable this.
+
+ Default: 0
+
+bridge-nf-filter-pppoe-tagged - BOOLEAN
+ - 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
+ - 0 : disable this.
+
+ Default: 0
+
+bridge-nf-pass-vlan-input-dev - BOOLEAN
+ - 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan
+ interface on the bridge and set the netfilter input device to the
+ vlan. This allows use of e.g. "iptables -i br0.1" and makes the
+ REDIRECT target work with vlan-on-top-of-bridge interfaces. When no
+ matching vlan interface is found, or this switch is off, the input
+ device is set to the bridge interface.
+
+ - 0: disable bridge netfilter vlan interface lookup.
+
+ Default: 0
+
+``proc/sys/net/sctp/*`` Variables:
+==================================
+
+addip_enable - BOOLEAN
+ Enable or disable extension of Dynamic Address Reconfiguration
+ (ADD-IP) functionality specified in RFC5061. This extension provides
+ the ability to dynamically add and remove new addresses for the SCTP
+ associations.
+
+ 1: Enable extension.
+
+ 0: Disable extension.
+
+ Default: 0
+
+pf_enable - INTEGER
+ Enable or disable pf (pf is short for potentially failed) state. A value
+ of pf_retrans > path_max_retrans also disables pf state. That is, one of
+ both pf_enable and pf_retrans > path_max_retrans can disable pf state.
+ Since pf_retrans and path_max_retrans can be changed by userspace
+ application, sometimes user expects to disable pf state by the value of
+ pf_retrans > path_max_retrans, but occasionally the value of pf_retrans
+ or path_max_retrans is changed by the user application, this pf state is
+ enabled. As such, it is necessary to add this to dynamically enable
+ and disable pf state. See:
+ https://datatracker.ietf.org/doc/draft-ietf-tsvwg-sctp-failover for
+ details.
+
+ 1: Enable pf.
+
+ 0: Disable pf.
+
+ Default: 1
+
+pf_expose - INTEGER
+ Unset or enable/disable pf (pf is short for potentially failed) state
+ exposure. Applications can control the exposure of the PF path state
+ in the SCTP_PEER_ADDR_CHANGE event and the SCTP_GET_PEER_ADDR_INFO
+ sockopt. When it's unset, no SCTP_PEER_ADDR_CHANGE event with
+ SCTP_ADDR_PF state will be sent and a SCTP_PF-state transport info
+ can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled,
+ a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming
+ SCTP_PF state and a SCTP_PF-state transport info can be got via
+ SCTP_GET_PEER_ADDR_INFO sockopt; When it's disabled, no
+ SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when
+ trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO
+ sockopt.
+
+ 0: Unset pf state exposure, Compatible with old applications.
+
+ 1: Disable pf state exposure.
+
+ 2: Enable pf state exposure.
+
+ Default: 0
+
+addip_noauth_enable - BOOLEAN
+ Dynamic Address Reconfiguration (ADD-IP) requires the use of
+ authentication to protect the operations of adding or removing new
+ addresses. This requirement is mandated so that unauthorized hosts
+ would not be able to hijack associations. However, older
+ implementations may not have implemented this requirement while
+ allowing the ADD-IP extension. For reasons of interoperability,
+ we provide this variable to control the enforcement of the
+ authentication requirement.
+
+ == ===============================================================
+ 1 Allow ADD-IP extension to be used without authentication. This
+ should only be set in a closed environment for interoperability
+ with older implementations.
+
+ 0 Enforce the authentication requirement
+ == ===============================================================
+
+ Default: 0
+
+auth_enable - BOOLEAN
+ Enable or disable Authenticated Chunks extension. This extension
+ provides the ability to send and receive authenticated chunks and is
+ required for secure operation of Dynamic Address Reconfiguration
+ (ADD-IP) extension.
+
+ - 1: Enable this extension.
+ - 0: Disable this extension.
+
+ Default: 0
+
+prsctp_enable - BOOLEAN
+ Enable or disable the Partial Reliability extension (RFC3758) which
+ is used to notify peers that a given DATA should no longer be expected.
+
+ - 1: Enable extension
+ - 0: Disable
+
+ Default: 1
+
+max_burst - INTEGER
+ The limit of the number of new packets that can be initially sent. It
+ controls how bursty the generated traffic can be.
+
+ Default: 4
+
+association_max_retrans - INTEGER
+ Set the maximum number for retransmissions that an association can
+ attempt deciding that the remote end is unreachable. If this value
+ is exceeded, the association is terminated.
+
+ Default: 10
+
+max_init_retransmits - INTEGER
+ The maximum number of retransmissions of INIT and COOKIE-ECHO chunks
+ that an association will attempt before declaring the destination
+ unreachable and terminating.
+
+ Default: 8
+
+path_max_retrans - INTEGER
+ The maximum number of retransmissions that will be attempted on a given
+ path. Once this threshold is exceeded, the path is considered
+ unreachable, and new traffic will use a different path when the
+ association is multihomed.
+
+ Default: 5
+
+pf_retrans - INTEGER
+ The number of retransmissions that will be attempted on a given path
+ before traffic is redirected to an alternate transport (should one
+ exist). Note this is distinct from path_max_retrans, as a path that
+ passes the pf_retrans threshold can still be used. Its only
+ deprioritized when a transmission path is selected by the stack. This
+ setting is primarily used to enable fast failover mechanisms without
+ having to reduce path_max_retrans to a very low value. See:
+ http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+ for details. Note also that a value of pf_retrans > path_max_retrans
+ disables this feature. Since both pf_retrans and path_max_retrans can
+ be changed by userspace application, a variable pf_enable is used to
+ disable pf state.
+
+ Default: 0
+
+ps_retrans - INTEGER
+ Primary.Switchover.Max.Retrans (PSMR), it's a tunable parameter coming
+ from section-5 "Primary Path Switchover" in rfc7829. The primary path
+ will be changed to another active path when the path error counter on
+ the old primary path exceeds PSMR, so that "the SCTP sender is allowed
+ to continue data transmission on a new working path even when the old
+ primary destination address becomes active again". Note this feature
+ is disabled by initializing 'ps_retrans' per netns as 0xffff by default,
+ and its value can't be less than 'pf_retrans' when changing by sysctl.
+
+ Default: 0xffff
+
+rto_initial - INTEGER
+ The initial round trip timeout value in milliseconds that will be used
+ in calculating round trip times. This is the initial time interval
+ for retransmissions.
+
+ Default: 3000
+
+rto_max - INTEGER
+ The maximum value (in milliseconds) of the round trip timeout. This
+ is the largest time interval that can elapse between retransmissions.
+
+ Default: 60000
+
+rto_min - INTEGER
+ The minimum value (in milliseconds) of the round trip timeout. This
+ is the smallest time interval the can elapse between retransmissions.
+
+ Default: 1000
+
+hb_interval - INTEGER
+ The interval (in milliseconds) between HEARTBEAT chunks. These chunks
+ are sent at the specified interval on idle paths to probe the state of
+ a given path between 2 associations.
+
+ Default: 30000
+
+sack_timeout - INTEGER
+ The amount of time (in milliseconds) that the implementation will wait
+ to send a SACK.
+
+ Default: 200
+
+valid_cookie_life - INTEGER
+ The default lifetime of the SCTP cookie (in milliseconds). The cookie
+ is used during association establishment.
+
+ Default: 60000
+
+cookie_preserve_enable - BOOLEAN
+ Enable or disable the ability to extend the lifetime of the SCTP cookie
+ that is used during the establishment phase of SCTP association
+
+ - 1: Enable cookie lifetime extension.
+ - 0: Disable
+
+ Default: 1
+
+cookie_hmac_alg - STRING
+ Select the hmac algorithm used when generating the cookie value sent by
+ a listening sctp socket to a connecting client in the INIT-ACK chunk.
+ Valid values are:
+
+ * md5
+ * sha1
+ * none
+
+ Ability to assign md5 or sha1 as the selected alg is predicated on the
+ configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and
+ CONFIG_CRYPTO_SHA1).
+
+ Default: Dependent on configuration. MD5 if available, else SHA1 if
+ available, else none.
+
+rcvbuf_policy - INTEGER
+ Determines if the receive buffer is attributed to the socket or to
+ association. SCTP supports the capability to create multiple
+ associations on a single socket. When using this capability, it is
+ possible that a single stalled association that's buffering a lot
+ of data may block other associations from delivering their data by
+ consuming all of the receive buffer space. To work around this,
+ the rcvbuf_policy could be set to attribute the receiver buffer space
+ to each association instead of the socket. This prevents the described
+ blocking.
+
+ - 1: rcvbuf space is per association
+ - 0: rcvbuf space is per socket
+
+ Default: 0
+
+sndbuf_policy - INTEGER
+ Similar to rcvbuf_policy above, this applies to send buffer space.
+
+ - 1: Send buffer is tracked per association
+ - 0: Send buffer is tracked per socket.
+
+ Default: 0
+
+sctp_mem - vector of 3 INTEGERs: min, pressure, max
+ Number of pages allowed for queueing by all SCTP sockets.
+
+ min: Below this number of pages SCTP is not bothered about its
+ memory appetite. When amount of memory allocated by SCTP exceeds
+ this number, SCTP starts to moderate memory usage.
+
+ pressure: This value was introduced to follow format of tcp_mem.
+
+ max: Number of pages allowed for queueing by all SCTP sockets.
+
+ Default is calculated at boot time from amount of available memory.
+
+sctp_rmem - vector of 3 INTEGERs: min, default, max
+ Only the first value ("min") is used, "default" and "max" are
+ ignored.
+
+ min: Minimal size of receive buffer used by SCTP socket.
+ It is guaranteed to each SCTP socket (but not association) even
+ under moderate memory pressure.
+
+ Default: 4K
+
+sctp_wmem - vector of 3 INTEGERs: min, default, max
+ Only the first value ("min") is used, "default" and "max" are
+ ignored.
+
+ min: Minimum size of send buffer that can be used by SCTP sockets.
+ It is guaranteed to each SCTP socket (but not association) even
+ under moderate memory pressure.
+
+ Default: 4K
+
+addr_scope_policy - INTEGER
+ Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00
+
+ - 0 - Disable IPv4 address scoping
+ - 1 - Enable IPv4 address scoping
+ - 2 - Follow draft but allow IPv4 private addresses
+ - 3 - Follow draft but allow IPv4 link local addresses
+
+ Default: 1
+
+udp_port - INTEGER
+ The listening port for the local UDP tunneling sock. Normally it's
+ using the IANA-assigned UDP port number 9899 (sctp-tunneling).
+
+ This UDP sock is used for processing the incoming UDP-encapsulated
+ SCTP packets (from RFC6951), and shared by all applications in the
+ same net namespace. This UDP sock will be closed when the value is
+ set to 0.
+
+ The value will also be used to set the src port of the UDP header
+ for the outgoing UDP-encapsulated SCTP packets. For the dest port,
+ please refer to 'encap_port' below.
+
+ Default: 0
+
+encap_port - INTEGER
+ The default remote UDP encapsulation port.
+
+ This value is used to set the dest port of the UDP header for the
+ outgoing UDP-encapsulated SCTP packets by default. Users can also
+ change the value for each sock/asoc/transport by using setsockopt.
+ For further information, please refer to RFC6951.
+
+ Note that when connecting to a remote server, the client should set
+ this to the port that the UDP tunneling sock on the peer server is
+ listening to and the local UDP tunneling sock on the client also
+ must be started. On the server, it would get the encap_port from
+ the incoming packet's source port.
+
+ Default: 0
+
+plpmtud_probe_interval - INTEGER
+ The time interval (in milliseconds) for the PLPMTUD probe timer,
+ which is configured to expire after this period to receive an
+ acknowledgment to a probe packet. This is also the time interval
+ between the probes for the current pmtu when the probe search
+ is done.
+
+ PLPMTUD will be disabled when 0 is set, and other values for it
+ must be >= 5000.
+
+ Default: 0
+
+reconf_enable - BOOLEAN
+ Enable or disable extension of Stream Reconfiguration functionality
+ specified in RFC6525. This extension provides the ability to "reset"
+ a stream, and it includes the Parameters of "Outgoing/Incoming SSN
+ Reset", "SSN/TSN Reset" and "Add Outgoing/Incoming Streams".
+
+ - 1: Enable extension.
+ - 0: Disable extension.
+
+ Default: 0
+
+intl_enable - BOOLEAN
+ Enable or disable extension of User Message Interleaving functionality
+ specified in RFC8260. This extension allows the interleaving of user
+ messages sent on different streams. With this feature enabled, I-DATA
+ chunk will replace DATA chunk to carry user messages if also supported
+ by the peer. Note that to use this feature, one needs to set this option
+ to 1 and also needs to set socket options SCTP_FRAGMENT_INTERLEAVE to 2
+ and SCTP_INTERLEAVING_SUPPORTED to 1.
+
+ - 1: Enable extension.
+ - 0: Disable extension.
+
+ Default: 0
+
+ecn_enable - BOOLEAN
+ Control use of Explicit Congestion Notification (ECN) by SCTP.
+ Like in TCP, ECN is used only when both ends of the SCTP connection
+ indicate support for it. This feature is useful in avoiding losses
+ due to congestion by allowing supporting routers to signal congestion
+ before having to drop packets.
+
+ 1: Enable ecn.
+ 0: Disable ecn.
+
+ Default: 1
+
+l3mdev_accept - BOOLEAN
+ Enabling this option allows a "global" bound socket to work
+ across L3 master domains (e.g., VRFs) with packets capable of
+ being received regardless of the L3 domain in which they
+ originated. Only valid when the kernel was compiled with
+ CONFIG_NET_L3_MASTER_DEV.
+
+ Default: 1 (enabled)
+
+
+``/proc/sys/net/core/*``
+========================
+
+ Please see: Documentation/admin-guide/sysctl/net.rst for descriptions of these entries.
+
+
+``/proc/sys/net/unix/*``
+========================
+
+max_dgram_qlen - INTEGER
+ The maximum length of dgram socket receive queue
+
+ Default: 10
+
diff --git a/Documentation/networking/ip_dynaddr.rst b/Documentation/networking/ip_dynaddr.rst
new file mode 100644
index 0000000000..eacc0c780c
--- /dev/null
+++ b/Documentation/networking/ip_dynaddr.rst
@@ -0,0 +1,40 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+IP dynamic address hack-port v0.03
+==================================
+
+This stuff allows diald ONESHOT connections to get established by
+dynamically changing packet source address (and socket's if local procs).
+It is implemented for TCP diald-box connections(1) and IP_MASQuerading(2).
+
+If enabled\ [#]_ and forwarding interface has changed:
+
+ 1) Socket (and packet) source address is rewritten ON RETRANSMISSIONS
+ while in SYN_SENT state (diald-box processes).
+ 2) Out-bounded MASQueraded source address changes ON OUTPUT (when
+ internal host does retransmission) until a packet from outside is
+ received by the tunnel.
+
+This is specially helpful for auto dialup links (diald), where the
+``actual`` outgoing address is unknown at the moment the link is
+going up. So, the *same* (local AND masqueraded) connections requests that
+bring the link up will be able to get established.
+
+.. [#] At boot, by default no address rewriting is attempted.
+
+ To enable::
+
+ # echo 1 > /proc/sys/net/ipv4/ip_dynaddr
+
+ To enable verbose mode::
+
+ # echo 2 > /proc/sys/net/ipv4/ip_dynaddr
+
+ To disable (default)::
+
+ # echo 0 > /proc/sys/net/ipv4/ip_dynaddr
+
+Enjoy!
+
+Juanjo <jjciarla@raiz.uncu.edu.ar>
diff --git a/Documentation/networking/ipddp.rst b/Documentation/networking/ipddp.rst
new file mode 100644
index 0000000000..be7091b779
--- /dev/null
+++ b/Documentation/networking/ipddp.rst
@@ -0,0 +1,78 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================================
+AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation
+=========================================================
+
+Documentation ipddp.c
+
+This file is written by Jay Schulist <jschlst@samba.org>
+
+Introduction
+------------
+
+AppleTalk-IP (IPDDP) is the method computers connected to AppleTalk
+networks can use to communicate via IP. AppleTalk-IP is simply IP datagrams
+inside AppleTalk packets.
+
+Through this driver you can either allow your Linux box to communicate
+IP over an AppleTalk network or you can provide IP gatewaying functions
+for your AppleTalk users.
+
+You can currently encapsulate or decapsulate AppleTalk-IP on LocalTalk,
+EtherTalk and PPPTalk. The only limit on the protocol is that of what
+kernel AppleTalk layer and drivers are available.
+
+Each mode requires its own user space software.
+
+Compiling AppleTalk-IP Decapsulation/Encapsulation
+==================================================
+
+AppleTalk-IP decapsulation needs to be compiled into your kernel. You
+will need to turn on AppleTalk-IP driver support. Then you will need to
+select ONE of the two options; IP to AppleTalk-IP encapsulation support or
+AppleTalk-IP to IP decapsulation support. If you compile the driver
+statically you will only be able to use the driver for the function you have
+enabled in the kernel. If you compile the driver as a module you can
+select what mode you want it to run in via a module loading param.
+ipddp_mode=1 for AppleTalk-IP encapsulation and ipddp_mode=2 for
+AppleTalk-IP to IP decapsulation.
+
+Basic instructions for user space tools
+=======================================
+
+I will briefly describe the operation of the tools, but you will
+need to consult the supporting documentation for each set of tools.
+
+Decapsulation - You will need to download a software package called
+MacGate. In this distribution there will be a tool called MacRoute
+which enables you to add routes to the kernel for your Macs by hand.
+Also the tool MacRegGateWay is included to register the
+proper IP Gateway and IP addresses for your machine. Included in this
+distribution is a patch to netatalk-1.4b2+asun2.0a17.2 (available from
+ftp.u.washington.edu/pub/user-supported/asun/) this patch is optional
+but it allows automatic adding and deleting of routes for Macs. (Handy
+for locations with large Mac installations)
+
+Encapsulation - You will need to download a software daemon called ipddpd.
+This software expects there to be an AppleTalk-IP gateway on the network.
+You will also need to add the proper routes to route your Linux box's IP
+traffic out the ipddp interface.
+
+Common Uses of ipddp.c
+----------------------
+Of course AppleTalk-IP decapsulation and encapsulation, but specifically
+decapsulation is being used most for connecting LocalTalk networks to
+IP networks. Although it has been used on EtherTalk networks to allow
+Macs that are only able to tunnel IP over EtherTalk.
+
+Encapsulation has been used to allow a Linux box stuck on a LocalTalk
+network to use IP. It should work equally well if you are stuck on an
+EtherTalk only network.
+
+Further Assistance
+-------------------
+You can contact me (Jay Schulist <jschlst@samba.org>) with any
+questions regarding decapsulation or encapsulation. Bradford W. Johnson
+<johns393@maroon.tc.umn.edu> originally wrote the ipddp.c driver for IP
+encapsulation in AppleTalk.
diff --git a/Documentation/networking/ipsec.rst b/Documentation/networking/ipsec.rst
new file mode 100644
index 0000000000..afe9d7b48b
--- /dev/null
+++ b/Documentation/networking/ipsec.rst
@@ -0,0 +1,46 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====
+IPsec
+=====
+
+
+Here documents known IPsec corner cases which need to be keep in mind when
+deploy various IPsec configuration in real world production environment.
+
+1. IPcomp:
+ Small IP packet won't get compressed at sender, and failed on
+ policy check on receiver.
+
+Quote from RFC3173::
+
+ 2.2. Non-Expansion Policy
+
+ If the total size of a compressed payload and the IPComp header, as
+ defined in section 3, is not smaller than the size of the original
+ payload, the IP datagram MUST be sent in the original non-compressed
+ form. To clarify: If an IP datagram is sent non-compressed, no
+
+ IPComp header is added to the datagram. This policy ensures saving
+ the decompression processing cycles and avoiding incurring IP
+ datagram fragmentation when the expanded datagram is larger than the
+ MTU.
+
+ Small IP datagrams are likely to expand as a result of compression.
+ Therefore, a numeric threshold should be applied before compression,
+ where IP datagrams of size smaller than the threshold are sent in the
+ original form without attempting compression. The numeric threshold
+ is implementation dependent.
+
+Current IPComp implementation is indeed by the book, while as in practice
+when sending non-compressed packet to the peer (whether or not packet len
+is smaller than the threshold or the compressed len is larger than original
+packet len), the packet is dropped when checking the policy as this packet
+matches the selector but not coming from any XFRM layer, i.e., with no
+security path. Such naked packet will not eventually make it to upper layer.
+The result is much more wired to the user when ping peer with different
+payload length.
+
+One workaround is try to set "level use" for each policy if user observed
+above scenario. The consequence of doing so is small packet(uncompressed)
+will skip policy checking on receiver side.
diff --git a/Documentation/networking/ipv6.rst b/Documentation/networking/ipv6.rst
new file mode 100644
index 0000000000..ba09c2f2dc
--- /dev/null
+++ b/Documentation/networking/ipv6.rst
@@ -0,0 +1,78 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+IPv6
+====
+
+
+Options for the ipv6 module are supplied as parameters at load time.
+
+Module options may be given as command line arguments to the insmod
+or modprobe command, but are usually specified in either
+``/etc/modules.d/*.conf`` configuration files, or in a distro-specific
+configuration file.
+
+The available ipv6 module parameters are listed below. If a parameter
+is not specified the default value is used.
+
+The parameters are as follows:
+
+disable
+
+ Specifies whether to load the IPv6 module, but disable all
+ its functionality. This might be used when another module
+ has a dependency on the IPv6 module being loaded, but no
+ IPv6 addresses or operations are desired.
+
+ The possible values and their effects are:
+
+ 0
+ IPv6 is enabled.
+
+ This is the default value.
+
+ 1
+ IPv6 is disabled.
+
+ No IPv6 addresses will be added to interfaces, and
+ it will not be possible to open an IPv6 socket.
+
+ A reboot is required to enable IPv6.
+
+autoconf
+
+ Specifies whether to enable IPv6 address autoconfiguration
+ on all interfaces. This might be used when one does not wish
+ for addresses to be automatically generated from prefixes
+ received in Router Advertisements.
+
+ The possible values and their effects are:
+
+ 0
+ IPv6 address autoconfiguration is disabled on all interfaces.
+
+ Only the IPv6 loopback address (::1) and link-local addresses
+ will be added to interfaces.
+
+ 1
+ IPv6 address autoconfiguration is enabled on all interfaces.
+
+ This is the default value.
+
+disable_ipv6
+
+ Specifies whether to disable IPv6 on all interfaces.
+ This might be used when no IPv6 addresses are desired.
+
+ The possible values and their effects are:
+
+ 0
+ IPv6 is enabled on all interfaces.
+
+ This is the default value.
+
+ 1
+ IPv6 is disabled on all interfaces.
+
+ No IPv6 addresses will be added to interfaces.
+
diff --git a/Documentation/networking/ipvlan.rst b/Documentation/networking/ipvlan.rst
new file mode 100644
index 0000000000..895d0ccfd5
--- /dev/null
+++ b/Documentation/networking/ipvlan.rst
@@ -0,0 +1,189 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+IPVLAN Driver HOWTO
+===================
+
+Initial Release:
+ Mahesh Bandewar <maheshb AT google.com>
+
+1. Introduction:
+================
+This is conceptually very similar to the macvlan driver with one major
+exception of using L3 for mux-ing /demux-ing among slaves. This property makes
+the master device share the L2 with its slave devices. I have developed this
+driver in conjunction with network namespaces and not sure if there is use case
+outside of it.
+
+
+2. Building and Installation:
+=============================
+
+In order to build the driver, please select the config item CONFIG_IPVLAN.
+The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module
+(CONFIG_IPVLAN=m).
+
+
+3. Configuration:
+=================
+
+There are no module parameters for this driver and it can be configured
+using IProute2/ip utility.
+::
+
+ ip link add link <master> name <slave> type ipvlan [ mode MODE ] [ FLAGS ]
+ where
+ MODE: l3 (default) | l3s | l2
+ FLAGS: bridge (default) | private | vepa
+
+e.g.
+
+ (a) Following will create IPvlan link with eth0 as master in
+ L3 bridge mode::
+
+ bash# ip link add link eth0 name ipvl0 type ipvlan
+ (b) This command will create IPvlan link in L2 bridge mode::
+
+ bash# ip link add link eth0 name ipvl0 type ipvlan mode l2 bridge
+
+ (c) This command will create an IPvlan device in L2 private mode::
+
+ bash# ip link add link eth0 name ipvlan type ipvlan mode l2 private
+
+ (d) This command will create an IPvlan device in L2 vepa mode::
+
+ bash# ip link add link eth0 name ipvlan type ipvlan mode l2 vepa
+
+
+4. Operating modes:
+===================
+
+IPvlan has two modes of operation - L2 and L3. For a given master device,
+you can select one of these two modes and all slaves on that master will
+operate in the same (selected) mode. The RX mode is almost identical except
+that in L3 mode the slaves won't receive any multicast / broadcast traffic.
+L3 mode is more restrictive since routing is controlled from the other (mostly)
+default namespace.
+
+4.1 L2 mode:
+------------
+
+In this mode TX processing happens on the stack instance attached to the
+slave device and packets are switched and queued to the master device to send
+out. In this mode the slaves will RX/TX multicast and broadcast (if applicable)
+as well.
+
+4.2 L3 mode:
+------------
+
+In this mode TX processing up to L3 happens on the stack instance attached
+to the slave device and packets are switched to the stack instance of the
+master device for the L2 processing and routing from that instance will be
+used before packets are queued on the outbound device. In this mode the slaves
+will not receive nor can send multicast / broadcast traffic.
+
+4.3 L3S mode:
+-------------
+
+This is very similar to the L3 mode except that iptables (conn-tracking)
+works in this mode and hence it is L3-symmetric (L3s). This will have slightly less
+performance but that shouldn't matter since you are choosing this mode over plain-L3
+mode to make conn-tracking work.
+
+5. Mode flags:
+==============
+
+At this time following mode flags are available
+
+5.1 bridge:
+-----------
+This is the default option. To configure the IPvlan port in this mode,
+user can choose to either add this option on the command-line or don't specify
+anything. This is the traditional mode where slaves can cross-talk among
+themselves apart from talking through the master device.
+
+5.2 private:
+------------
+If this option is added to the command-line, the port is set in private
+mode. i.e. port won't allow cross communication between slaves.
+
+5.3 vepa:
+---------
+If this is added to the command-line, the port is set in VEPA mode.
+i.e. port will offload switching functionality to the external entity as
+described in 802.1Qbg
+Note: VEPA mode in IPvlan has limitations. IPvlan uses the mac-address of the
+master-device, so the packets which are emitted in this mode for the adjacent
+neighbor will have source and destination mac same. This will make the switch /
+router send the redirect message.
+
+6. What to choose (macvlan vs. ipvlan)?
+=======================================
+
+These two devices are very similar in many regards and the specific use
+case could very well define which device to choose. if one of the following
+situations defines your use case then you can choose to use ipvlan:
+
+
+(a) The Linux host that is connected to the external switch / router has
+ policy configured that allows only one mac per port.
+(b) No of virtual devices created on a master exceed the mac capacity and
+ puts the NIC in promiscuous mode and degraded performance is a concern.
+(c) If the slave device is to be put into the hostile / untrusted network
+ namespace where L2 on the slave could be changed / misused.
+
+
+6. Example configuration:
+=========================
+
+::
+
+ +=============================================================+
+ | Host: host1 |
+ | |
+ | +----------------------+ +----------------------+ |
+ | | NS:ns0 | | NS:ns1 | |
+ | | | | | |
+ | | | | | |
+ | | ipvl0 | | ipvl1 | |
+ | +----------#-----------+ +-----------#----------+ |
+ | # # |
+ | ################################ |
+ | # eth0 |
+ +==============================#==============================+
+
+
+(a) Create two network namespaces - ns0, ns1::
+
+ ip netns add ns0
+ ip netns add ns1
+
+(b) Create two ipvlan slaves on eth0 (master device)::
+
+ ip link add link eth0 ipvl0 type ipvlan mode l2
+ ip link add link eth0 ipvl1 type ipvlan mode l2
+
+(c) Assign slaves to the respective network namespaces::
+
+ ip link set dev ipvl0 netns ns0
+ ip link set dev ipvl1 netns ns1
+
+(d) Now switch to the namespace (ns0 or ns1) to configure the slave devices
+
+ - For ns0::
+
+ (1) ip netns exec ns0 bash
+ (2) ip link set dev ipvl0 up
+ (3) ip link set dev lo up
+ (4) ip -4 addr add 127.0.0.1 dev lo
+ (5) ip -4 addr add $IPADDR dev ipvl0
+ (6) ip -4 route add default via $ROUTER dev ipvl0
+
+ - For ns1::
+
+ (1) ip netns exec ns1 bash
+ (2) ip link set dev ipvl1 up
+ (3) ip link set dev lo up
+ (4) ip -4 addr add 127.0.0.1 dev lo
+ (5) ip -4 addr add $IPADDR dev ipvl1
+ (6) ip -4 route add default via $ROUTER dev ipvl1
diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
new file mode 100644
index 0000000000..3fb5fa142e
--- /dev/null
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -0,0 +1,332 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+IPvs-sysctl
+===========
+
+/proc/sys/net/ipv4/vs/* Variables:
+==================================
+
+am_droprate - INTEGER
+ default 10
+
+ It sets the always mode drop rate, which is used in the mode 3
+ of the drop_rate defense.
+
+amemthresh - INTEGER
+ default 1024
+
+ It sets the available memory threshold (in pages), which is
+ used in the automatic modes of defense. When there is no
+ enough available memory, the respective strategy will be
+ enabled and the variable is automatically set to 2, otherwise
+ the strategy is disabled and the variable is set to 1.
+
+backup_only - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ If set, disable the director function while the server is
+ in backup mode to avoid packet loops for DR/TUN methods.
+
+conn_reuse_mode - INTEGER
+ 1 - default
+
+ Controls how ipvs will deal with connections that are detected
+ port reuse. It is a bitmap, with the values being:
+
+ 0: disable any special handling on port reuse. The new
+ connection will be delivered to the same real server that was
+ servicing the previous connection.
+
+ bit 1: enable rescheduling of new connections when it is safe.
+ That is, whenever expire_nodest_conn and for TCP sockets, when
+ the connection is in TIME_WAIT state (which is only possible if
+ you use NAT mode).
+
+ bit 2: it is bit 1 plus, for TCP connections, when connections
+ are in FIN_WAIT state, as this is the last state seen by load
+ balancer in Direct Routing mode. This bit helps on adding new
+ real servers to a very busy cluster.
+
+conntrack - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ If set, maintain connection tracking entries for
+ connections handled by IPVS.
+
+ This should be enabled if connections handled by IPVS are to be
+ also handled by stateful firewall rules. That is, iptables rules
+ that make use of connection tracking. It is a performance
+ optimisation to disable this setting otherwise.
+
+ Connections handled by the IPVS FTP application module
+ will have connection tracking entries regardless of this setting.
+
+ Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled.
+
+cache_bypass - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ If it is enabled, forward packets to the original destination
+ directly when no cache server is available and destination
+ address is not local (iph->daddr is RTN_UNICAST). It is mostly
+ used in transparent web cache cluster.
+
+debug_level - INTEGER
+ - 0 - transmission error messages (default)
+ - 1 - non-fatal error messages
+ - 2 - configuration
+ - 3 - destination trash
+ - 4 - drop entry
+ - 5 - service lookup
+ - 6 - scheduling
+ - 7 - connection new/expire, lookup and synchronization
+ - 8 - state transition
+ - 9 - binding destination, template checks and applications
+ - 10 - IPVS packet transmission
+ - 11 - IPVS packet handling (ip_vs_in/ip_vs_out)
+ - 12 or more - packet traversal
+
+ Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled.
+
+ Higher debugging levels include the messages for lower debugging
+ levels, so setting debug level 2, includes level 0, 1 and 2
+ messages. Thus, logging becomes more and more verbose the higher
+ the level.
+
+drop_entry - INTEGER
+ - 0 - disabled (default)
+
+ The drop_entry defense is to randomly drop entries in the
+ connection hash table, just in order to collect back some
+ memory for new connections. In the current code, the
+ drop_entry procedure can be activated every second, then it
+ randomly scans 1/32 of the whole and drops entries that are in
+ the SYN-RECV/SYNACK state, which should be effective against
+ syn-flooding attack.
+
+ The valid values of drop_entry are from 0 to 3, where 0 means
+ that this strategy is always disabled, 1 and 2 mean automatic
+ modes (when there is no enough available memory, the strategy
+ is enabled and the variable is automatically set to 2,
+ otherwise the strategy is disabled and the variable is set to
+ 1), and 3 means that the strategy is always enabled.
+
+drop_packet - INTEGER
+ - 0 - disabled (default)
+
+ The drop_packet defense is designed to drop 1/rate packets
+ before forwarding them to real servers. If the rate is 1, then
+ drop all the incoming packets.
+
+ The value definition is the same as that of the drop_entry. In
+ the automatic mode, the rate is determined by the follow
+ formula: rate = amemthresh / (amemthresh - available_memory)
+ when available memory is less than the available memory
+ threshold. When the mode 3 is set, the always mode drop rate
+ is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
+
+est_cpulist - CPULIST
+ Allowed CPUs for estimation kthreads
+
+ Syntax: standard cpulist format
+ empty list - stop kthread tasks and estimation
+ default - the system's housekeeping CPUs for kthreads
+
+ Example:
+ "all": all possible CPUs
+ "0-N": all possible CPUs, N denotes last CPU number
+ "0,1-N:1/2": first and all CPUs with odd number
+ "": empty list
+
+est_nice - INTEGER
+ default 0
+ Valid range: -20 (more favorable) .. 19 (less favorable)
+
+ Niceness value to use for the estimation kthreads (scheduling
+ priority)
+
+expire_nodest_conn - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ The default value is 0, the load balancer will silently drop
+ packets when its destination server is not available. It may
+ be useful, when user-space monitoring program deletes the
+ destination server (because of server overload or wrong
+ detection) and add back the server later, and the connections
+ to the server can continue.
+
+ If this feature is enabled, the load balancer will expire the
+ connection immediately when a packet arrives and its
+ destination server is not available, then the client program
+ will be notified that the connection is closed. This is
+ equivalent to the feature some people requires to flush
+ connections when its destination is not available.
+
+expire_quiescent_template - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ When set to a non-zero value, the load balancer will expire
+ persistent templates when the destination server is quiescent.
+ This may be useful, when a user makes a destination server
+ quiescent by setting its weight to 0 and it is desired that
+ subsequent otherwise persistent connections are sent to a
+ different destination server. By default new persistent
+ connections are allowed to quiescent destination servers.
+
+ If this feature is enabled, the load balancer will expire the
+ persistence template if it is to be used to schedule a new
+ connection and the destination server is quiescent.
+
+ignore_tunneled - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ If set, ipvs will set the ipvs_property on all packets which are of
+ unrecognized protocols. This prevents us from routing tunneled
+ protocols like ipip, which is useful to prevent rescheduling
+ packets that have been tunneled to the ipvs host (i.e. to prevent
+ ipvs routing loops when ipvs is also acting as a real server).
+
+nat_icmp_send - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ It controls sending icmp error messages (ICMP_DEST_UNREACH)
+ for VS/NAT when the load balancer receives packets from real
+ servers but the connection entries don't exist.
+
+pmtu_disc - BOOLEAN
+ - 0 - disabled
+ - not 0 - enabled (default)
+
+ By default, reject with FRAG_NEEDED all DF packets that exceed
+ the PMTU, irrespective of the forwarding method. For TUN method
+ the flag can be disabled to fragment such packets.
+
+secure_tcp - INTEGER
+ - 0 - disabled (default)
+
+ The secure_tcp defense is to use a more complicated TCP state
+ transition table. For VS/NAT, it also delays entering the
+ TCP ESTABLISHED state until the three way handshake is completed.
+
+ The value definition is the same as that of drop_entry and
+ drop_packet.
+
+sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period
+ default 3 50
+
+ It sets synchronization threshold, which is the minimum number
+ of incoming packets that a connection needs to receive before
+ the connection will be synchronized. A connection will be
+ synchronized, every time the number of its incoming packets
+ modulus sync_period equals the threshold. The range of the
+ threshold is from 0 to sync_period.
+
+ When sync_period and sync_refresh_period are 0, send sync only
+ for state changes or only once when pkts matches sync_threshold
+
+sync_refresh_period - UNSIGNED INTEGER
+ default 0
+
+ In seconds, difference in reported connection timer that triggers
+ new sync message. It can be used to avoid sync messages for the
+ specified period (or half of the connection timeout if it is lower)
+ if connection state is not changed since last sync.
+
+ This is useful for normal connections with high traffic to reduce
+ sync rate. Additionally, retry sync_retries times with period of
+ sync_refresh_period/8.
+
+sync_retries - INTEGER
+ default 0
+
+ Defines sync retries with period of sync_refresh_period/8. Useful
+ to protect against loss of sync messages. The range of the
+ sync_retries is from 0 to 3.
+
+sync_qlen_max - UNSIGNED LONG
+
+ Hard limit for queued sync messages that are not sent yet. It
+ defaults to 1/32 of the memory pages but actually represents
+ number of messages. It will protect us from allocating large
+ parts of memory when the sending rate is lower than the queuing
+ rate.
+
+sync_sock_size - INTEGER
+ default 0
+
+ Configuration of SNDBUF (master) or RCVBUF (slave) socket limit.
+ Default value is 0 (preserve system defaults).
+
+sync_ports - INTEGER
+ default 1
+
+ The number of threads that master and backup servers can use for
+ sync traffic. Every thread will use single UDP port, thread 0 will
+ use the default port 8848 while last thread will use port
+ 8848+sync_ports-1.
+
+snat_reroute - BOOLEAN
+ - 0 - disabled
+ - not 0 - enabled (default)
+
+ If enabled, recalculate the route of SNATed packets from
+ realservers so that they are routed as if they originate from the
+ director. Otherwise they are routed as if they are forwarded by the
+ director.
+
+ If policy routing is in effect then it is possible that the route
+ of a packet originating from a director is routed differently to a
+ packet being forwarded by the director.
+
+ If policy routing is not in effect then the recalculated route will
+ always be the same as the original route so it is an optimisation
+ to disable snat_reroute and avoid the recalculation.
+
+sync_persist_mode - INTEGER
+ default 0
+
+ Controls the synchronisation of connections when using persistence
+
+ 0: All types of connections are synchronised
+
+ 1: Attempt to reduce the synchronisation traffic depending on
+ the connection type. For persistent services avoid synchronisation
+ for normal connections, do it only for persistence templates.
+ In such case, for TCP and SCTP it may need enabling sloppy_tcp and
+ sloppy_sctp flags on backup servers. For non-persistent services
+ such optimization is not applied, mode 0 is assumed.
+
+sync_version - INTEGER
+ default 1
+
+ The version of the synchronisation protocol used when sending
+ synchronisation messages.
+
+ 0 selects the original synchronisation protocol (version 0). This
+ should be used when sending synchronisation messages to a legacy
+ system that only understands the original synchronisation protocol.
+
+ 1 selects the current synchronisation protocol (version 1). This
+ should be used where possible.
+
+ Kernels with this sync_version entry are able to receive messages
+ of both version 1 and version 2 of the synchronisation protocol.
+
+run_estimation - BOOLEAN
+ 0 - disabled
+ not 0 - enabled (default)
+
+ If disabled, the estimation will be suspended and kthread tasks
+ stopped.
+
+ You can always re-enable estimation by setting this value to 1.
+ But be careful, the first estimation after re-enable is not
+ accurate.
diff --git a/Documentation/networking/j1939.rst b/Documentation/networking/j1939.rst
new file mode 100644
index 0000000000..e4bd7aa1f5
--- /dev/null
+++ b/Documentation/networking/j1939.rst
@@ -0,0 +1,460 @@
+.. SPDX-License-Identifier: (GPL-2.0 OR MIT)
+
+===================
+J1939 Documentation
+===================
+
+Overview / What Is J1939
+========================
+
+SAE J1939 defines a higher layer protocol on CAN. It implements a more
+sophisticated addressing scheme and extends the maximum packet size above 8
+bytes. Several derived specifications exist, which differ from the original
+J1939 on the application level, like MilCAN A, NMEA2000, and especially
+ISO-11783 (ISOBUS). This last one specifies the so-called ETP (Extended
+Transport Protocol), which has been included in this implementation. This
+results in a maximum packet size of ((2 ^ 24) - 1) * 7 bytes == 111 MiB.
+
+Specifications used
+-------------------
+
+* SAE J1939-21 : data link layer
+* SAE J1939-81 : network management
+* ISO 11783-6 : Virtual Terminal (Extended Transport Protocol)
+
+.. _j1939-motivation:
+
+Motivation
+==========
+
+Given the fact there's something like SocketCAN with an API similar to BSD
+sockets, we found some reasons to justify a kernel implementation for the
+addressing and transport methods used by J1939.
+
+* **Addressing:** when a process on an ECU communicates via J1939, it should
+ not necessarily know its source address. Although, at least one process per
+ ECU should know the source address. Other processes should be able to reuse
+ that address. This way, address parameters for different processes
+ cooperating for the same ECU, are not duplicated. This way of working is
+ closely related to the UNIX concept, where programs do just one thing and do
+ it well.
+
+* **Dynamic addressing:** Address Claiming in J1939 is time critical.
+ Furthermore, data transport should be handled properly during the address
+ negotiation. Putting this functionality in the kernel eliminates it as a
+ requirement for _every_ user space process that communicates via J1939. This
+ results in a consistent J1939 bus with proper addressing.
+
+* **Transport:** both TP & ETP reuse some PGNs to relay big packets over them.
+ Different processes may thus use the same TP & ETP PGNs without actually
+ knowing it. The individual TP & ETP sessions _must_ be serialized
+ (synchronized) between different processes. The kernel solves this problem
+ properly and eliminates the serialization (synchronization) as a requirement
+ for _every_ user space process that communicates via J1939.
+
+J1939 defines some other features (relaying, gateway, fast packet transport,
+...). In-kernel code for these would not contribute to protocol stability.
+Therefore, these parts are left to user space.
+
+The J1939 sockets operate on CAN network devices (see SocketCAN). Any J1939
+user space library operating on CAN raw sockets will still operate properly.
+Since such a library does not communicate with the in-kernel implementation, care
+must be taken that these two do not interfere. In practice, this means they
+cannot share ECU addresses. A single ECU (or virtual ECU) address is used by
+the library exclusively, or by the in-kernel system exclusively.
+
+J1939 concepts
+==============
+
+PGN
+---
+
+The J1939 protocol uses the 29-bit CAN identifier with the following structure:
+
+ ============ ============== ====================
+ 29 bit CAN-ID
+ --------------------------------------------------
+ Bit positions within the CAN-ID
+ --------------------------------------------------
+ 28 ... 26 25 ... 8 7 ... 0
+ ============ ============== ====================
+ Priority PGN SA (Source Address)
+ ============ ============== ====================
+
+The PGN (Parameter Group Number) is a number to identify a packet. The PGN
+is composed as follows:
+
+ ============ ============== ================= =================
+ PGN
+ ------------------------------------------------------------------
+ Bit positions within the CAN-ID
+ ------------------------------------------------------------------
+ 25 24 23 ... 16 15 ... 8
+ ============ ============== ================= =================
+ R (Reserved) DP (Data Page) PF (PDU Format) PS (PDU Specific)
+ ============ ============== ================= =================
+
+In J1939-21 distinction is made between PDU1 format (where PF < 240) and PDU2
+format (where PF >= 240). Furthermore, when using the PDU2 format, the PS-field
+contains a so-called Group Extension, which is part of the PGN. When using PDU2
+format, the Group Extension is set in the PS-field.
+
+ ============== ========================
+ PDU1 Format (specific) (peer to peer)
+ ----------------------------------------
+ Bit positions within the CAN-ID
+ ----------------------------------------
+ 23 ... 16 15 ... 8
+ ============== ========================
+ 00h ... EFh DA (Destination address)
+ ============== ========================
+
+ ============== ========================
+ PDU2 Format (global) (broadcast)
+ ----------------------------------------
+ Bit positions within the CAN-ID
+ ----------------------------------------
+ 23 ... 16 15 ... 8
+ ============== ========================
+ F0h ... FFh GE (Group Extension)
+ ============== ========================
+
+On the other hand, when using PDU1 format, the PS-field contains a so-called
+Destination Address, which is _not_ part of the PGN. When communicating a PGN
+from user space to kernel (or vice versa) and PDU2 format is used, the PS-field
+of the PGN shall be set to zero. The Destination Address shall be set
+elsewhere.
+
+Regarding PGN mapping to 29-bit CAN identifier, the Destination Address shall
+be get/set from/to the appropriate bits of the identifier by the kernel.
+
+
+Addressing
+----------
+
+Both static and dynamic addressing methods can be used.
+
+For static addresses, no extra checks are made by the kernel and provided
+addresses are considered right. This responsibility is for the OEM or system
+integrator.
+
+For dynamic addressing, so-called Address Claiming, extra support is foreseen
+in the kernel. In J1939 any ECU is known by its 64-bit NAME. At the moment of
+a successful address claim, the kernel keeps track of both NAME and source
+address being claimed. This serves as a base for filter schemes. By default,
+packets with a destination that is not locally will be rejected.
+
+Mixed mode packets (from a static to a dynamic address or vice versa) are
+allowed. The BSD sockets define separate API calls for getting/setting the
+local & remote address and are applicable for J1939 sockets.
+
+Filtering
+---------
+
+J1939 defines white list filters per socket that a user can set in order to
+receive a subset of the J1939 traffic. Filtering can be based on:
+
+* SA
+* SOURCE_NAME
+* PGN
+
+When multiple filters are in place for a single socket, and a packet comes in
+that matches several of those filters, the packet is only received once for
+that socket.
+
+How to Use J1939
+================
+
+API Calls
+---------
+
+On CAN, you first need to open a socket for communicating over a CAN network.
+To use J1939, ``#include <linux/can/j1939.h>``. From there, ``<linux/can.h>`` will be
+included too. To open a socket, use:
+
+.. code-block:: C
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_J1939);
+
+J1939 does use ``SOCK_DGRAM`` sockets. In the J1939 specification, connections are
+mentioned in the context of transport protocol sessions. These still deliver
+packets to the other end (using several CAN packets). ``SOCK_STREAM`` is not
+supported.
+
+After the successful creation of the socket, you would normally use the ``bind(2)``
+and/or ``connect(2)`` system call to bind the socket to a CAN interface. After
+binding and/or connecting the socket, you can ``read(2)`` and ``write(2)`` from/to the
+socket or use ``send(2)``, ``sendto(2)``, ``sendmsg(2)`` and the ``recv*()`` counterpart
+operations on the socket as usual. There are also J1939 specific socket options
+described below.
+
+In order to send data, a ``bind(2)`` must have been successful. ``bind(2)`` assigns a
+local address to a socket.
+
+Different from CAN is that the payload data is just the data that get sends,
+without its header info. The header info is derived from the sockaddr supplied
+to ``bind(2)``, ``connect(2)``, ``sendto(2)`` and ``recvfrom(2)``. A ``write(2)`` with size 4 will
+result in a packet with 4 bytes.
+
+The sockaddr structure has extensions for use with J1939 as specified below:
+
+.. code-block:: C
+
+ struct sockaddr_can {
+ sa_family_t can_family;
+ int can_ifindex;
+ union {
+ struct {
+ __u64 name;
+ /* pgn:
+ * 8 bit: PS in PDU2 case, else 0
+ * 8 bit: PF
+ * 1 bit: DP
+ * 1 bit: reserved
+ */
+ __u32 pgn;
+ __u8 addr;
+ } j1939;
+ } can_addr;
+ }
+
+``can_family`` & ``can_ifindex`` serve the same purpose as for other SocketCAN sockets.
+
+``can_addr.j1939.pgn`` specifies the PGN (max 0x3ffff). Individual bits are
+specified above.
+
+``can_addr.j1939.name`` contains the 64-bit J1939 NAME.
+
+``can_addr.j1939.addr`` contains the address.
+
+The ``bind(2)`` system call assigns the local address, i.e. the source address when
+sending packages. If a PGN during ``bind(2)`` is set, it's used as a RX filter.
+I.e. only packets with a matching PGN are received. If an ADDR or NAME is set
+it is used as a receive filter, too. It will match the destination NAME or ADDR
+of the incoming packet. The NAME filter will work only if appropriate Address
+Claiming for this name was done on the CAN bus and registered/cached by the
+kernel.
+
+On the other hand ``connect(2)`` assigns the remote address, i.e. the destination
+address. The PGN from ``connect(2)`` is used as the default PGN when sending
+packets. If ADDR or NAME is set it will be used as the default destination ADDR
+or NAME. Further a set ADDR or NAME during ``connect(2)`` is used as a receive
+filter. It will match the source NAME or ADDR of the incoming packet.
+
+Both ``write(2)`` and ``send(2)`` will send a packet with local address from ``bind(2)`` and the
+remote address from ``connect(2)``. Use ``sendto(2)`` to overwrite the destination
+address.
+
+If ``can_addr.j1939.name`` is set (!= 0) the NAME is looked up by the kernel and
+the corresponding ADDR is used. If ``can_addr.j1939.name`` is not set (== 0),
+``can_addr.j1939.addr`` is used.
+
+When creating a socket, reasonable defaults are set. Some options can be
+modified with ``setsockopt(2)`` & ``getsockopt(2)``.
+
+RX path related options:
+
+- ``SO_J1939_FILTER`` - configure array of filters
+- ``SO_J1939_PROMISC`` - disable filters set by ``bind(2)`` and ``connect(2)``
+
+By default no broadcast packets can be send or received. To enable sending or
+receiving broadcast packets use the socket option ``SO_BROADCAST``:
+
+.. code-block:: C
+
+ int value = 1;
+ setsockopt(sock, SOL_SOCKET, SO_BROADCAST, &value, sizeof(value));
+
+The following diagram illustrates the RX path:
+
+.. code::
+
+ +--------------------+
+ | incoming packet |
+ +--------------------+
+ |
+ V
+ +--------------------+
+ | SO_J1939_PROMISC? |
+ +--------------------+
+ | |
+ no | | yes
+ | |
+ .---------' `---------.
+ | |
+ +---------------------------+ |
+ | bind() + connect() + | |
+ | SOCK_BROADCAST filter | |
+ +---------------------------+ |
+ | |
+ |<---------------------'
+ V
+ +---------------------------+
+ | SO_J1939_FILTER |
+ +---------------------------+
+ |
+ V
+ +---------------------------+
+ | socket recv() |
+ +---------------------------+
+
+TX path related options:
+``SO_J1939_SEND_PRIO`` - change default send priority for the socket
+
+Message Flags during send() and Related System Calls
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``send(2)``, ``sendto(2)`` and ``sendmsg(2)`` take a 'flags' argument. Currently
+supported flags are:
+
+* ``MSG_DONTWAIT``, i.e. non-blocking operation.
+
+recvmsg(2)
+^^^^^^^^^^
+
+In most cases ``recvmsg(2)`` is needed if you want to extract more information than
+``recvfrom(2)`` can provide. For example package priority and timestamp. The
+Destination Address, name and packet priority (if applicable) are attached to
+the msghdr in the ``recvmsg(2)`` call. They can be extracted using ``cmsg(3)`` macros,
+with ``cmsg_level == SOL_J1939 && cmsg_type == SCM_J1939_DEST_ADDR``,
+``SCM_J1939_DEST_NAME`` or ``SCM_J1939_PRIO``. The returned data is a ``uint8_t`` for
+``priority`` and ``dst_addr``, and ``uint64_t`` for ``dst_name``.
+
+.. code-block:: C
+
+ uint8_t priority, dst_addr;
+ uint64_t dst_name;
+
+ for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+ switch (cmsg->cmsg_level) {
+ case SOL_CAN_J1939:
+ if (cmsg->cmsg_type == SCM_J1939_DEST_ADDR)
+ dst_addr = *CMSG_DATA(cmsg);
+ else if (cmsg->cmsg_type == SCM_J1939_DEST_NAME)
+ memcpy(&dst_name, CMSG_DATA(cmsg), cmsg->cmsg_len - CMSG_LEN(0));
+ else if (cmsg->cmsg_type == SCM_J1939_PRIO)
+ priority = *CMSG_DATA(cmsg);
+ break;
+ }
+ }
+
+Dynamic Addressing
+------------------
+
+Distinction has to be made between using the claimed address and doing an
+address claim. To use an already claimed address, one has to fill in the
+``j1939.name`` member and provide it to ``bind(2)``. If the name had claimed an address
+earlier, all further messages being sent will use that address. And the
+``j1939.addr`` member will be ignored.
+
+An exception on this is PGN 0x0ee00. This is the "Address Claim/Cannot Claim
+Address" message and the kernel will use the ``j1939.addr`` member for that PGN if
+necessary.
+
+To claim an address following code example can be used:
+
+.. code-block:: C
+
+ struct sockaddr_can baddr = {
+ .can_family = AF_CAN,
+ .can_addr.j1939 = {
+ .name = name,
+ .addr = J1939_IDLE_ADDR,
+ .pgn = J1939_NO_PGN, /* to disable bind() rx filter for PGN */
+ },
+ .can_ifindex = if_nametoindex("can0"),
+ };
+
+ bind(sock, (struct sockaddr *)&baddr, sizeof(baddr));
+
+ /* for Address Claiming broadcast must be allowed */
+ int value = 1;
+ setsockopt(sock, SOL_SOCKET, SO_BROADCAST, &value, sizeof(value));
+
+ /* configured advanced RX filter with PGN needed for Address Claiming */
+ const struct j1939_filter filt[] = {
+ {
+ .pgn = J1939_PGN_ADDRESS_CLAIMED,
+ .pgn_mask = J1939_PGN_PDU1_MAX,
+ }, {
+ .pgn = J1939_PGN_REQUEST,
+ .pgn_mask = J1939_PGN_PDU1_MAX,
+ }, {
+ .pgn = J1939_PGN_ADDRESS_COMMANDED,
+ .pgn_mask = J1939_PGN_MAX,
+ },
+ };
+
+ setsockopt(sock, SOL_CAN_J1939, SO_J1939_FILTER, &filt, sizeof(filt));
+
+ uint64_t dat = htole64(name);
+ const struct sockaddr_can saddr = {
+ .can_family = AF_CAN,
+ .can_addr.j1939 = {
+ .pgn = J1939_PGN_ADDRESS_CLAIMED,
+ .addr = J1939_NO_ADDR,
+ },
+ };
+
+ /* Afterwards do a sendto(2) with data set to the NAME (Little Endian). If the
+ * NAME provided, does not match the j1939.name provided to bind(2), EPROTO
+ * will be returned.
+ */
+ sendto(sock, dat, sizeof(dat), 0, (const struct sockaddr *)&saddr, sizeof(saddr));
+
+If no-one else contests the address claim within 250ms after transmission, the
+kernel marks the NAME-SA assignment as valid. The valid assignment will be kept
+among other valid NAME-SA assignments. From that point, any socket bound to the
+NAME can send packets.
+
+If another ECU claims the address, the kernel will mark the NAME-SA expired.
+No socket bound to the NAME can send packets (other than address claims). To
+claim another address, some socket bound to NAME, must ``bind(2)`` again, but with
+only ``j1939.addr`` changed to the new SA, and must then send a valid address claim
+packet. This restarts the state machine in the kernel (and any other
+participant on the bus) for this NAME.
+
+``can-utils`` also include the ``j1939acd`` tool, so it can be used as code example or as
+default Address Claiming daemon.
+
+Send Examples
+-------------
+
+Static Addressing
+^^^^^^^^^^^^^^^^^
+
+This example will send a PGN (0x12300) from SA 0x20 to DA 0x30.
+
+Bind:
+
+.. code-block:: C
+
+ struct sockaddr_can baddr = {
+ .can_family = AF_CAN,
+ .can_addr.j1939 = {
+ .name = J1939_NO_NAME,
+ .addr = 0x20,
+ .pgn = J1939_NO_PGN,
+ },
+ .can_ifindex = if_nametoindex("can0"),
+ };
+
+ bind(sock, (struct sockaddr *)&baddr, sizeof(baddr));
+
+Now, the socket 'sock' is bound to the SA 0x20. Since no ``connect(2)`` was called,
+at this point we can use only ``sendto(2)`` or ``sendmsg(2)``.
+
+Send:
+
+.. code-block:: C
+
+ const struct sockaddr_can saddr = {
+ .can_family = AF_CAN,
+ .can_addr.j1939 = {
+ .name = J1939_NO_NAME;
+ .addr = 0x30,
+ .pgn = 0x12300,
+ },
+ };
+
+ sendto(sock, dat, sizeof(dat), 0, (const struct sockaddr *)&saddr, sizeof(saddr));
diff --git a/Documentation/networking/kapi.rst b/Documentation/networking/kapi.rst
new file mode 100644
index 0000000000..ea55f462ce
--- /dev/null
+++ b/Documentation/networking/kapi.rst
@@ -0,0 +1,159 @@
+=========================================
+Linux Networking and Network Devices APIs
+=========================================
+
+Linux Networking
+================
+
+Networking Base Types
+---------------------
+
+.. kernel-doc:: include/linux/net.h
+ :internal:
+
+Socket Buffer Functions
+-----------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :internal:
+
+.. kernel-doc:: include/net/sock.h
+ :internal:
+
+.. kernel-doc:: net/socket.c
+ :export:
+
+.. kernel-doc:: net/core/skbuff.c
+ :export:
+
+.. kernel-doc:: net/core/sock.c
+ :export:
+
+.. kernel-doc:: net/core/datagram.c
+ :export:
+
+.. kernel-doc:: net/core/stream.c
+ :export:
+
+Socket Filter
+-------------
+
+.. kernel-doc:: net/core/filter.c
+ :export:
+
+Generic Network Statistics
+--------------------------
+
+.. kernel-doc:: include/uapi/linux/gen_stats.h
+ :internal:
+
+.. kernel-doc:: net/core/gen_stats.c
+ :export:
+
+.. kernel-doc:: net/core/gen_estimator.c
+ :export:
+
+SUN RPC subsystem
+-----------------
+
+.. kernel-doc:: net/sunrpc/xdr.c
+ :export:
+
+.. kernel-doc:: net/sunrpc/svc_xprt.c
+ :export:
+
+.. kernel-doc:: net/sunrpc/xprt.c
+ :export:
+
+.. kernel-doc:: net/sunrpc/sched.c
+ :export:
+
+.. kernel-doc:: net/sunrpc/socklib.c
+ :export:
+
+.. kernel-doc:: net/sunrpc/stats.c
+ :export:
+
+.. kernel-doc:: net/sunrpc/rpc_pipe.c
+ :export:
+
+.. kernel-doc:: net/sunrpc/rpcb_clnt.c
+ :export:
+
+.. kernel-doc:: net/sunrpc/clnt.c
+ :export:
+
+Network device support
+======================
+
+Driver Support
+--------------
+
+.. kernel-doc:: net/core/dev.c
+ :export:
+
+.. kernel-doc:: net/ethernet/eth.c
+ :export:
+
+.. kernel-doc:: net/sched/sch_generic.c
+ :export:
+
+.. kernel-doc:: include/linux/etherdevice.h
+ :internal:
+
+.. kernel-doc:: include/linux/netdevice.h
+ :internal:
+
+PHY Support
+-----------
+
+.. kernel-doc:: drivers/net/phy/phy.c
+ :export:
+
+.. kernel-doc:: drivers/net/phy/phy.c
+ :internal:
+
+.. kernel-doc:: drivers/net/phy/phy-core.c
+ :export:
+
+.. kernel-doc:: drivers/net/phy/phy-c45.c
+ :export:
+
+.. kernel-doc:: include/linux/phy.h
+ :internal:
+
+.. kernel-doc:: drivers/net/phy/phy_device.c
+ :export:
+
+.. kernel-doc:: drivers/net/phy/phy_device.c
+ :internal:
+
+.. kernel-doc:: drivers/net/phy/mdio_bus.c
+ :export:
+
+.. kernel-doc:: drivers/net/phy/mdio_bus.c
+ :internal:
+
+PHYLINK
+-------
+
+ PHYLINK interfaces traditional network drivers with PHYLIB, fixed-links,
+ and SFF modules (eg, hot-pluggable SFP) that may contain PHYs. PHYLINK
+ provides management of the link state and link modes.
+
+.. kernel-doc:: include/linux/phylink.h
+ :internal:
+
+.. kernel-doc:: drivers/net/phy/phylink.c
+
+SFP support
+-----------
+
+.. kernel-doc:: drivers/net/phy/sfp-bus.c
+ :internal:
+
+.. kernel-doc:: include/linux/sfp.h
+ :internal:
+
+.. kernel-doc:: drivers/net/phy/sfp-bus.c
+ :export:
diff --git a/Documentation/networking/kcm.rst b/Documentation/networking/kcm.rst
new file mode 100644
index 0000000000..db0f5560ac
--- /dev/null
+++ b/Documentation/networking/kcm.rst
@@ -0,0 +1,290 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+Kernel Connection Multiplexor
+=============================
+
+Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
+interface over TCP for generic application protocols. With KCM an application
+can efficiently send and receive application protocol messages over TCP using
+datagram sockets.
+
+KCM implements an NxM multiplexor in the kernel as diagrammed below::
+
+ +------------+ +------------+ +------------+ +------------+
+ | KCM socket | | KCM socket | | KCM socket | | KCM socket |
+ +------------+ +------------+ +------------+ +------------+
+ | | | |
+ +-----------+ | | +----------+
+ | | | |
+ +----------------------------------+
+ | Multiplexor |
+ +----------------------------------+
+ | | | | |
+ +---------+ | | | ------------+
+ | | | | |
+ +----------+ +----------+ +----------+ +----------+ +----------+
+ | Psock | | Psock | | Psock | | Psock | | Psock |
+ +----------+ +----------+ +----------+ +----------+ +----------+
+ | | | | |
+ +----------+ +----------+ +----------+ +----------+ +----------+
+ | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
+ +----------+ +----------+ +----------+ +----------+ +----------+
+
+KCM sockets
+===========
+
+The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
+bound to a multiplexor are considered to have equivalent function, and I/O
+operations in different sockets may be done in parallel without the need for
+synchronization between threads in userspace.
+
+Multiplexor
+===========
+
+The multiplexor provides the message steering. In the transmit path, messages
+written on a KCM socket are sent atomically on an appropriate TCP socket.
+Similarly, in the receive path, messages are constructed on each TCP socket
+(Psock) and complete messages are steered to a KCM socket.
+
+TCP sockets & Psocks
+====================
+
+TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
+for each bound TCP socket, this structure holds the state for constructing
+messages on receive as well as other connection specific information for KCM.
+
+Connected mode semantics
+========================
+
+Each multiplexor assumes that all attached TCP connections are to the same
+destination and can use the different connections for load balancing when
+transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
+can be used to send and receive messages from the KCM socket.
+
+Socket types
+============
+
+KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
+
+Message delineation
+-------------------
+
+Messages are sent over a TCP stream with some application protocol message
+format that typically includes a header which frames the messages. The length
+of a received message can be deduced from the application protocol header
+(often just a simple length field).
+
+A TCP stream must be parsed to determine message boundaries. Berkeley Packet
+Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
+BPF program must be specified. The program is called at the start of receiving
+a new message and is given an skbuff that contains the bytes received so far.
+It parses the message header and returns the length of the message. Given this
+information, KCM will construct the message of the stated length and deliver it
+to a KCM socket.
+
+TCP socket management
+---------------------
+
+When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
+write space available (POLLOUT) events are handled by the multiplexor. If there
+is a state change (disconnection) or other error on a TCP socket, an error is
+posted on the TCP socket so that a POLLERR event happens and KCM discontinues
+using the socket. When the application gets the error notification for a
+TCP socket, it should unattach the socket from KCM and then handle the error
+condition (the typical response is to close the socket and create a new
+connection if necessary).
+
+KCM limits the maximum receive message size to be the size of the receive
+socket buffer on the attached TCP socket (the socket buffer size can be set by
+SO_RCVBUF). If the length of a new message reported by the BPF program is
+greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
+socket. The BPF program may also enforce a maximum messages size and report an
+error when it is exceeded.
+
+A timeout may be set for assembling messages on a receive socket. The timeout
+value is taken from the receive timeout of the attached TCP socket (this is set
+by SO_RCVTIMEO). If the timer expires before assembly is complete an error
+(ETIMEDOUT) is posted on the socket.
+
+User interface
+==============
+
+Creating a multiplexor
+----------------------
+
+A new multiplexor and initial KCM socket is created by a socket call::
+
+ socket(AF_KCM, type, protocol)
+
+- type is either SOCK_DGRAM or SOCK_SEQPACKET
+- protocol is KCMPROTO_CONNECTED
+
+Cloning KCM sockets
+-------------------
+
+After the first KCM socket is created using the socket call as described
+above, additional sockets for the multiplexor can be created by cloning
+a KCM socket. This is accomplished by an ioctl on a KCM socket::
+
+ /* From linux/kcm.h */
+ struct kcm_clone {
+ int fd;
+ };
+
+ struct kcm_clone info;
+
+ memset(&info, 0, sizeof(info));
+
+ err = ioctl(kcmfd, SIOCKCMCLONE, &info);
+
+ if (!err)
+ newkcmfd = info.fd;
+
+Attach transport sockets
+------------------------
+
+Attaching of transport sockets to a multiplexor is performed by calling an
+ioctl on a KCM socket for the multiplexor. e.g.::
+
+ /* From linux/kcm.h */
+ struct kcm_attach {
+ int fd;
+ int bpf_fd;
+ };
+
+ struct kcm_attach info;
+
+ memset(&info, 0, sizeof(info));
+
+ info.fd = tcpfd;
+ info.bpf_fd = bpf_prog_fd;
+
+ ioctl(kcmfd, SIOCKCMATTACH, &info);
+
+The kcm_attach structure contains:
+
+ - fd: file descriptor for TCP socket being attached
+ - bpf_prog_fd: file descriptor for compiled BPF program downloaded
+
+Unattach transport sockets
+--------------------------
+
+Unattaching a transport socket from a multiplexor is straightforward. An
+"unattach" ioctl is done with the kcm_unattach structure as the argument::
+
+ /* From linux/kcm.h */
+ struct kcm_unattach {
+ int fd;
+ };
+
+ struct kcm_unattach info;
+
+ memset(&info, 0, sizeof(info));
+
+ info.fd = cfd;
+
+ ioctl(fd, SIOCKCMUNATTACH, &info);
+
+Disabling receive on KCM socket
+-------------------------------
+
+A setsockopt is used to disable or enable receiving on a KCM socket.
+When receive is disabled, any pending messages in the socket's
+receive buffer are moved to other sockets. This feature is useful
+if an application thread knows that it will be doing a lot of
+work on a request and won't be able to service new messages for a
+while. Example use::
+
+ int val = 1;
+
+ setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
+
+BFP programs for message delineation
+------------------------------------
+
+BPF programs can be compiled using the BPF LLVM backend. For example,
+the BPF program for parsing Thrift is::
+
+ #include "bpf.h" /* for __sk_buff */
+ #include "bpf_helpers.h" /* for load_word intrinsic */
+
+ SEC("socket_kcm")
+ int bpf_prog1(struct __sk_buff *skb)
+ {
+ return load_word(skb, 0) + 4;
+ }
+
+ char _license[] SEC("license") = "GPL";
+
+Use in applications
+===================
+
+KCM accelerates application layer protocols. Specifically, it allows
+applications to use a message based interface for sending and receiving
+messages. The kernel provides necessary assurances that messages are sent
+and received atomically. This relieves much of the burden applications have
+in mapping a message based protocol onto the TCP stream. KCM also make
+application layer messages a unit of work in the kernel for the purposes of
+steering and scheduling, which in turn allows a simpler networking model in
+multithreaded applications.
+
+Configurations
+--------------
+
+In an Nx1 configuration, KCM logically provides multiple socket handles
+to the same TCP connection. This allows parallelism between in I/O
+operations on the TCP socket (for instance copyin and copyout of data is
+parallelized). In an application, a KCM socket can be opened for each
+processing thread and inserted into the epoll (similar to how SO_REUSEPORT
+is used to allow multiple listener sockets on the same port).
+
+In a MxN configuration, multiple connections are established to the
+same destination. These are used for simple load balancing.
+
+Message batching
+----------------
+
+The primary purpose of KCM is load balancing between KCM sockets and hence
+threads in a nominal use case. Perfect load balancing, that is steering
+each received message to a different KCM socket or steering each sent
+message to a different TCP socket, can negatively impact performance
+since this doesn't allow for affinities to be established. Balancing
+based on groups, or batches of messages, can be beneficial for performance.
+
+On transmit, there are three ways an application can batch (pipeline)
+messages on a KCM socket.
+
+ 1) Send multiple messages in a single sendmmsg.
+ 2) Send a group of messages each with a sendmsg call, where all messages
+ except the last have MSG_BATCH in the flags of sendmsg call.
+ 3) Create "super message" composed of multiple messages and send this
+ with a single sendmsg.
+
+On receive, the KCM module attempts to queue messages received on the
+same KCM socket during each TCP ready callback. The targeted KCM socket
+changes at each receive ready callback on the KCM socket. The application
+does not need to configure this.
+
+Error handling
+--------------
+
+An application should include a thread to monitor errors raised on
+the TCP connection. Normally, this will be done by placing each
+TCP socket attached to a KCM multiplexor in epoll set for POLLERR
+event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
+on the socket thus waking up the application thread. When the application
+sees the error (which may just be a disconnect) it should unattach the
+socket from KCM and then close it. It is assumed that once an error is
+posted on the TCP socket the data stream is unrecoverable (i.e. an error
+may have occurred in the middle of receiving a message).
+
+TCP connection monitoring
+-------------------------
+
+In KCM there is no means to correlate a message to the TCP socket that
+was used to send or receive the message (except in the case there is
+only one attached TCP socket). However, the application does retain
+an open file descriptor to the socket so it will be able to get statistics
+from the socket which can be used in detecting issues (such as high
+retransmissions on the socket).
diff --git a/Documentation/networking/l2tp.rst b/Documentation/networking/l2tp.rst
new file mode 100644
index 0000000000..7f383e99db
--- /dev/null
+++ b/Documentation/networking/l2tp.rst
@@ -0,0 +1,677 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+L2TP
+====
+
+Layer 2 Tunneling Protocol (L2TP) allows L2 frames to be tunneled over
+an IP network.
+
+This document covers the kernel's L2TP subsystem. It documents kernel
+APIs for application developers who want to use the L2TP subsystem and
+it provides some technical details about the internal implementation
+which may be useful to kernel developers and maintainers.
+
+Overview
+========
+
+The kernel's L2TP subsystem implements the datapath for L2TPv2 and
+L2TPv3. L2TPv2 is carried over UDP. L2TPv3 is carried over UDP or
+directly over IP (protocol 115).
+
+The L2TP RFCs define two basic kinds of L2TP packets: control packets
+(the "control plane"), and data packets (the "data plane"). The kernel
+deals only with data packets. The more complex control packets are
+handled by user space.
+
+An L2TP tunnel carries one or more L2TP sessions. Each tunnel is
+associated with a socket. Each session is associated with a virtual
+netdevice, e.g. ``pppN``, ``l2tpethN``, through which data frames pass
+to/from L2TP. Fields in the L2TP header identify the tunnel or session
+and whether it is a control or data packet. When tunnels and sessions
+are set up using the Linux kernel API, we're just setting up the L2TP
+data path. All aspects of the control protocol are to be handled by
+user space.
+
+This split in responsibilities leads to a natural sequence of
+operations when establishing tunnels and sessions. The procedure looks
+like this:
+
+ 1) Create a tunnel socket. Exchange L2TP control protocol messages
+ with the peer over that socket in order to establish a tunnel.
+
+ 2) Create a tunnel context in the kernel, using information
+ obtained from the peer using the control protocol messages.
+
+ 3) Exchange L2TP control protocol messages with the peer over the
+ tunnel socket in order to establish a session.
+
+ 4) Create a session context in the kernel using information
+ obtained from the peer using the control protocol messages.
+
+L2TP APIs
+=========
+
+This section documents each userspace API of the L2TP subsystem.
+
+Tunnel Sockets
+--------------
+
+L2TPv2 always uses UDP. L2TPv3 may use UDP or IP encapsulation.
+
+To create a tunnel socket for use by L2TP, the standard POSIX
+socket API is used.
+
+For example, for a tunnel using IPv4 addresses and UDP encapsulation::
+
+ int sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
+
+Or for a tunnel using IPv6 addresses and IP encapsulation::
+
+ int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
+
+UDP socket programming doesn't need to be covered here.
+
+IPPROTO_L2TP is an IP protocol type implemented by the kernel's L2TP
+subsystem. The L2TPIP socket address is defined in struct
+sockaddr_l2tpip and struct sockaddr_l2tpip6 at
+`include/uapi/linux/l2tp.h`_. The address includes the L2TP tunnel
+(connection) id. To use L2TP IP encapsulation, an L2TPv3 application
+should bind the L2TPIP socket using the locally assigned
+tunnel id. When the peer's tunnel id and IP address is known, a
+connect must be done.
+
+If the L2TP application needs to handle L2TPv3 tunnel setup requests
+from peers using L2TPIP, it must open a dedicated L2TPIP
+socket to listen for those requests and bind the socket using tunnel
+id 0 since tunnel setup requests are addressed to tunnel id 0.
+
+An L2TP tunnel and all of its sessions are automatically closed when
+its tunnel socket is closed.
+
+Netlink API
+-----------
+
+L2TP applications use netlink to manage L2TP tunnel and session
+instances in the kernel. The L2TP netlink API is defined in
+`include/uapi/linux/l2tp.h`_.
+
+L2TP uses `Generic Netlink`_ (GENL). Several commands are defined:
+Create, Delete, Modify and Get for tunnel and session
+instances, e.g. ``L2TP_CMD_TUNNEL_CREATE``. The API header lists the
+netlink attribute types that can be used with each command.
+
+Tunnel and session instances are identified by a locally unique
+32-bit id. L2TP tunnel ids are given by ``L2TP_ATTR_CONN_ID`` and
+``L2TP_ATTR_PEER_CONN_ID`` attributes and L2TP session ids are given
+by ``L2TP_ATTR_SESSION_ID`` and ``L2TP_ATTR_PEER_SESSION_ID``
+attributes. If netlink is used to manage L2TPv2 tunnel and session
+instances, the L2TPv2 16-bit tunnel/session id is cast to a 32-bit
+value in these attributes.
+
+In the ``L2TP_CMD_TUNNEL_CREATE`` command, ``L2TP_ATTR_FD`` tells the
+kernel the tunnel socket fd being used. If not specified, the kernel
+creates a kernel socket for the tunnel, using IP parameters set in
+``L2TP_ATTR_IP[6]_SADDR``, ``L2TP_ATTR_IP[6]_DADDR``,
+``L2TP_ATTR_UDP_SPORT``, ``L2TP_ATTR_UDP_DPORT`` attributes. Kernel
+sockets are used to implement unmanaged L2TPv3 tunnels (iproute2's "ip
+l2tp" commands). If ``L2TP_ATTR_FD`` is given, it must be a socket fd
+that is already bound and connected. There is more information about
+unmanaged tunnels later in this document.
+
+``L2TP_CMD_TUNNEL_CREATE`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Sets the tunnel (connection) id.
+PEER_CONN_ID Y Sets the peer tunnel (connection) id.
+PROTO_VERSION Y Protocol version. 2 or 3.
+ENCAP_TYPE Y Encapsulation type: UDP or IP.
+FD N Tunnel socket file descriptor.
+UDP_CSUM N Enable IPv4 UDP checksums. Used only if FD is
+ not set.
+UDP_ZERO_CSUM6_TX N Zero IPv6 UDP checksum on transmit. Used only
+ if FD is not set.
+UDP_ZERO_CSUM6_RX N Zero IPv6 UDP checksum on receive. Used only if
+ FD is not set.
+IP_SADDR N IPv4 source address. Used only if FD is not
+ set.
+IP_DADDR N IPv4 destination address. Used only if FD is
+ not set.
+UDP_SPORT N UDP source port. Used only if FD is not set.
+UDP_DPORT N UDP destination port. Used only if FD is not
+ set.
+IP6_SADDR N IPv6 source address. Used only if FD is not
+ set.
+IP6_DADDR N IPv6 destination address. Used only if FD is
+ not set.
+DEBUG N Debug flags.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_DESTROY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the tunnel id to be destroyed.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_MODIFY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the tunnel id to be modified.
+DEBUG N Debug flags.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_GET`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID N Identifies the tunnel id to be queried.
+ Ignored in DUMP requests.
+================== ======== ===
+
+``L2TP_CMD_SESSION_CREATE`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y The parent tunnel id.
+SESSION_ID Y Sets the session id.
+PEER_SESSION_ID Y Sets the parent session id.
+PW_TYPE Y Sets the pseudowire type.
+DEBUG N Debug flags.
+RECV_SEQ N Enable rx data sequence numbers.
+SEND_SEQ N Enable tx data sequence numbers.
+LNS_MODE N Enable LNS mode (auto-enable data sequence
+ numbers).
+RECV_TIMEOUT N Timeout to wait when reordering received
+ packets.
+L2SPEC_TYPE N Sets layer2-specific-sublayer type (L2TPv3
+ only).
+COOKIE N Sets optional cookie (L2TPv3 only).
+PEER_COOKIE N Sets optional peer cookie (L2TPv3 only).
+IFNAME N Sets interface name (L2TPv3 only).
+================== ======== ===
+
+For Ethernet session types, this will create an l2tpeth virtual
+interface which can then be configured as required. For PPP session
+types, a PPPoL2TP socket must also be opened and connected, mapping it
+onto the new session. This is covered in "PPPoL2TP Sockets" later.
+
+``L2TP_CMD_SESSION_DESTROY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the parent tunnel id of the session
+ to be destroyed.
+SESSION_ID Y Identifies the session id to be destroyed.
+IFNAME N Identifies the session by interface name. If
+ set, this overrides any CONN_ID and SESSION_ID
+ attributes. Currently supported for L2TPv3
+ Ethernet sessions only.
+================== ======== ===
+
+``L2TP_CMD_SESSION_MODIFY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the parent tunnel id of the session
+ to be modified.
+SESSION_ID Y Identifies the session id to be modified.
+IFNAME N Identifies the session by interface name. If
+ set, this overrides any CONN_ID and SESSION_ID
+ attributes. Currently supported for L2TPv3
+ Ethernet sessions only.
+DEBUG N Debug flags.
+RECV_SEQ N Enable rx data sequence numbers.
+SEND_SEQ N Enable tx data sequence numbers.
+LNS_MODE N Enable LNS mode (auto-enable data sequence
+ numbers).
+RECV_TIMEOUT N Timeout to wait when reordering received
+ packets.
+================== ======== ===
+
+``L2TP_CMD_SESSION_GET`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID N Identifies the tunnel id to be queried.
+ Ignored for DUMP requests.
+SESSION_ID N Identifies the session id to be queried.
+ Ignored for DUMP requests.
+IFNAME N Identifies the session by interface name.
+ If set, this overrides any CONN_ID and
+ SESSION_ID attributes. Ignored for DUMP
+ requests. Currently supported for L2TPv3
+ Ethernet sessions only.
+================== ======== ===
+
+Application developers should refer to `include/uapi/linux/l2tp.h`_ for
+netlink command and attribute definitions.
+
+Sample userspace code using libmnl_:
+
+ - Open L2TP netlink socket::
+
+ struct nl_sock *nl_sock;
+ int l2tp_nl_family_id;
+
+ nl_sock = nl_socket_alloc();
+ genl_connect(nl_sock);
+ genl_id = genl_ctrl_resolve(nl_sock, L2TP_GENL_NAME);
+
+ - Create a tunnel::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_TUNNEL_CREATE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_FD, tunl_sock_fd);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_CONN_ID, peer_tid);
+ mnl_attr_put_u8(nlh, L2TP_ATTR_PROTO_VERSION, protocol_version);
+ mnl_attr_put_u16(nlh, L2TP_ATTR_ENCAP_TYPE, encap);
+
+ - Create a session::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_SESSION_CREATE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_CONN_ID, peer_tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_SESSION_ID, sid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_SESSION_ID, peer_sid);
+ mnl_attr_put_u16(nlh, L2TP_ATTR_PW_TYPE, pwtype);
+ /* there are other session options which can be set using netlink
+ * attributes during session creation -- see l2tp.h
+ */
+
+ - Delete a session::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_SESSION_DELETE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_SESSION_ID, sid);
+
+ - Delete a tunnel and all of its sessions (if any)::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_TUNNEL_DELETE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+
+PPPoL2TP Session Socket API
+---------------------------
+
+For PPP session types, a PPPoL2TP socket must be opened and connected
+to the L2TP session.
+
+When creating PPPoL2TP sockets, the application provides information
+to the kernel about the tunnel and session in a socket connect()
+call. Source and destination tunnel and session ids are provided, as
+well as the file descriptor of a UDP or L2TPIP socket. See struct
+pppol2tp_addr in `include/linux/if_pppol2tp.h`_. For historical reasons,
+there are unfortunately slightly different address structures for
+L2TPv2/L2TPv3 IPv4/IPv6 tunnels and userspace must use the appropriate
+structure that matches the tunnel socket type.
+
+Userspace may control behavior of the tunnel or session using
+setsockopt and ioctl on the PPPoX socket. The following socket
+options are supported:-
+
+========= ===========================================================
+DEBUG bitmask of debug message categories. See below.
+SENDSEQ - 0 => don't send packets with sequence numbers
+ - 1 => send packets with sequence numbers
+RECVSEQ - 0 => receive packet sequence numbers are optional
+ - 1 => drop receive packets without sequence numbers
+LNSMODE - 0 => act as LAC.
+ - 1 => act as LNS.
+REORDERTO reorder timeout (in millisecs). If 0, don't try to reorder.
+========= ===========================================================
+
+In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided
+to retrieve tunnel and session statistics from the kernel using the
+PPPoX socket of the appropriate tunnel or session.
+
+Sample userspace code:
+
+ - Create session PPPoX data socket::
+
+ struct sockaddr_pppol2tp sax;
+ int fd;
+
+ /* Note, the tunnel socket must be bound already, else it
+ * will not be ready
+ */
+ sax.sa_family = AF_PPPOX;
+ sax.sa_protocol = PX_PROTO_OL2TP;
+ sax.pppol2tp.fd = tunnel_fd;
+ sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
+ sax.pppol2tp.addr.sin_port = addr->sin_port;
+ sax.pppol2tp.addr.sin_family = AF_INET;
+ sax.pppol2tp.s_tunnel = tunnel_id;
+ sax.pppol2tp.s_session = session_id;
+ sax.pppol2tp.d_tunnel = peer_tunnel_id;
+ sax.pppol2tp.d_session = peer_session_id;
+
+ /* session_fd is the fd of the session's PPPoL2TP socket.
+ * tunnel_fd is the fd of the tunnel UDP / L2TPIP socket.
+ */
+ fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
+ if (fd < 0 ) {
+ return -errno;
+ }
+ return 0;
+
+Old L2TPv2-only API
+-------------------
+
+When L2TP was first added to the Linux kernel in 2.6.23, it
+implemented only L2TPv2 and did not include a netlink API. Instead,
+tunnel and session instances in the kernel were managed directly using
+only PPPoL2TP sockets. The PPPoL2TP socket is used as described in
+section "PPPoL2TP Session Socket API" but tunnel and session instances
+are automatically created on a connect() of the socket instead of
+being created by a separate netlink request:
+
+ - Tunnels are managed using a tunnel management socket which is a
+ dedicated PPPoL2TP socket, connected to (invalid) session
+ id 0. The L2TP tunnel instance is created when the PPPoL2TP
+ tunnel management socket is connected and is destroyed when the
+ socket is closed.
+
+ - Session instances are created in the kernel when a PPPoL2TP
+ socket is connected to a non-zero session id. Session parameters
+ are set using setsockopt. The L2TP session instance is destroyed
+ when the socket is closed.
+
+This API is still supported but its use is discouraged. Instead, new
+L2TPv2 applications should use netlink to first create the tunnel and
+session, then create a PPPoL2TP socket for the session.
+
+Unmanaged L2TPv3 tunnels
+------------------------
+
+The kernel L2TP subsystem also supports static (unmanaged) L2TPv3
+tunnels. Unmanaged tunnels have no userspace tunnel socket, and
+exchange no control messages with the peer to set up the tunnel; the
+tunnel is configured manually at each end of the tunnel. All
+configuration is done using netlink. There is no need for an L2TP
+userspace application in this case -- the tunnel socket is created by
+the kernel and configured using parameters sent in the
+``L2TP_CMD_TUNNEL_CREATE`` netlink request. The ``ip`` utility of
+``iproute2`` has commands for managing static L2TPv3 tunnels; do ``ip
+l2tp help`` for more information.
+
+Debugging
+---------
+
+The L2TP subsystem offers a range of debugging interfaces through the
+debugfs filesystem.
+
+To access these interfaces, the debugfs filesystem must first be mounted::
+
+ # mount -t debugfs debugfs /debug
+
+Files under the l2tp directory can then be accessed, providing a summary
+of the current population of tunnel and session contexts existing in the
+kernel::
+
+ # cat /debug/l2tp/tunnels
+
+The debugfs files should not be used by applications to obtain L2TP
+state information because the file format is subject to change. It is
+implemented to provide extra debug information to help diagnose
+problems. Applications should instead use the netlink API.
+
+In addition the L2TP subsystem implements tracepoints using the standard
+kernel event tracing API. The available L2TP events can be reviewed as
+follows::
+
+ # find /debug/tracing/events/l2tp
+
+Finally, /proc/net/pppol2tp is also provided for backwards compatibility
+with the original pppol2tp code. It lists information about L2TPv2
+tunnels and sessions only. Its use is discouraged.
+
+Internal Implementation
+=======================
+
+This section is for kernel developers and maintainers.
+
+Sockets
+-------
+
+UDP sockets are implemented by the networking core. When an L2TP
+tunnel is created using a UDP socket, the socket is set up as an
+encapsulated UDP socket by setting encap_rcv and encap_destroy
+callbacks on the UDP socket. l2tp_udp_encap_recv is called when
+packets are received on the socket. l2tp_udp_encap_destroy is called
+when userspace closes the socket.
+
+L2TPIP sockets are implemented in `net/l2tp/l2tp_ip.c`_ and
+`net/l2tp/l2tp_ip6.c`_.
+
+Tunnels
+-------
+
+The kernel keeps a struct l2tp_tunnel context per L2TP tunnel. The
+l2tp_tunnel is always associated with a UDP or L2TP/IP socket and
+keeps a list of sessions in the tunnel. When a tunnel is first
+registered with L2TP core, the reference count on the socket is
+increased. This ensures that the socket cannot be removed while L2TP's
+data structures reference it.
+
+Tunnels are identified by a unique tunnel id. The id is 16-bit for
+L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
+value.
+
+Tunnels are kept in a per-net list, indexed by tunnel id. The tunnel
+id namespace is shared by L2TPv2 and L2TPv3. The tunnel context can be
+derived from the socket's sk_user_data.
+
+Handling tunnel socket close is perhaps the most tricky part of the
+L2TP implementation. If userspace closes a tunnel socket, the L2TP
+tunnel and all of its sessions must be closed and destroyed. Since the
+tunnel context holds a ref on the tunnel socket, the socket's
+sk_destruct won't be called until the tunnel sock_put's its
+socket. For UDP sockets, when userspace closes the tunnel socket, the
+socket's encap_destroy handler is invoked, which L2TP uses to initiate
+its tunnel close actions. For L2TPIP sockets, the socket's close
+handler initiates the same tunnel close actions. All sessions are
+first closed. Each session drops its tunnel ref. When the tunnel ref
+reaches zero, the tunnel puts its socket ref. When the socket is
+eventually destroyed, its sk_destruct finally frees the L2TP tunnel
+context.
+
+Sessions
+--------
+
+The kernel keeps a struct l2tp_session context for each session. Each
+session has private data which is used for data specific to the
+session type. With L2TPv2, the session always carries PPP
+traffic. With L2TPv3, the session can carry Ethernet frames (Ethernet
+pseudowire) or other data types such as PPP, ATM, HDLC or Frame
+Relay. Linux currently implements only Ethernet and PPP session types.
+
+Some L2TP session types also have a socket (PPP pseudowires) while
+others do not (Ethernet pseudowires). We can't therefore use the
+socket reference count as the reference count for session
+contexts. The L2TP implementation therefore has its own internal
+reference counts on the session contexts.
+
+Like tunnels, L2TP sessions are identified by a unique
+session id. Just as with tunnel ids, the session id is 16-bit for
+L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
+value.
+
+Sessions hold a ref on their parent tunnel to ensure that the tunnel
+stays extant while one or more sessions references it.
+
+Sessions are kept in a per-tunnel list, indexed by session id. L2TPv3
+sessions are also kept in a per-net list indexed by session id,
+because L2TPv3 session ids are unique across all tunnels and L2TPv3
+data packets do not contain a tunnel id in the header. This list is
+therefore needed to find the session context associated with a
+received data packet when the tunnel context cannot be derived from
+the tunnel socket.
+
+Although the L2TPv3 RFC specifies that L2TPv3 session ids are not
+scoped by the tunnel, the kernel does not police this for L2TPv3 UDP
+tunnels and does not add sessions of L2TPv3 UDP tunnels into the
+per-net session list. In the UDP receive code, we must trust that the
+tunnel can be identified using the tunnel socket's sk_user_data and
+lookup the session in the tunnel's session list instead of the per-net
+session list.
+
+PPP
+---
+
+`net/l2tp/l2tp_ppp.c`_ implements the PPPoL2TP socket family. Each PPP
+session has a PPPoL2TP socket.
+
+The PPPoL2TP socket's sk_user_data references the l2tp_session.
+
+Userspace sends and receives PPP packets over L2TP using a PPPoL2TP
+socket. Only PPP control frames pass over this socket: PPP data
+packets are handled entirely by the kernel, passing between the L2TP
+session and its associated ``pppN`` netdev through the PPP channel
+interface of the kernel PPP subsystem.
+
+The L2TP PPP implementation handles the closing of a PPPoL2TP socket
+by closing its corresponding L2TP session. This is complicated because
+it must consider racing with netlink session create/destroy requests
+and pppol2tp_connect trying to reconnect with a session that is in the
+process of being closed. Unlike tunnels, PPP sessions do not hold a
+ref on their associated socket, so code must be careful to sock_hold
+the socket where necessary. For all the details, see commit
+3d609342cc04129ff7568e19316ce3d7451a27e8.
+
+Ethernet
+--------
+
+`net/l2tp/l2tp_eth.c`_ implements L2TPv3 Ethernet pseudowires. It
+manages a netdev for each session.
+
+L2TP Ethernet sessions are created and destroyed by netlink request,
+or are destroyed when the tunnel is destroyed. Unlike PPP sessions,
+Ethernet sessions do not have an associated socket.
+
+Miscellaneous
+=============
+
+RFCs
+----
+
+The kernel code implements the datapath features specified in the
+following RFCs:
+
+======= =============== ===================================
+RFC2661 L2TPv2 https://tools.ietf.org/html/rfc2661
+RFC3931 L2TPv3 https://tools.ietf.org/html/rfc3931
+RFC4719 L2TPv3 Ethernet https://tools.ietf.org/html/rfc4719
+======= =============== ===================================
+
+Implementations
+---------------
+
+A number of open source applications use the L2TP kernel subsystem:
+
+============ ==============================================
+iproute2 https://github.com/shemminger/iproute2
+go-l2tp https://github.com/katalix/go-l2tp
+tunneldigger https://github.com/wlanslovenija/tunneldigger
+xl2tpd https://github.com/xelerance/xl2tpd
+============ ==============================================
+
+Limitations
+-----------
+
+The current implementation has a number of limitations:
+
+ 1) Multiple UDP sockets with the same 5-tuple address cannot be
+ used. The kernel's tunnel context is identified using private
+ data associated with the socket so it is important that each
+ socket is uniquely identified by its address.
+
+ 2) Interfacing with openvswitch is not yet implemented. It may be
+ useful to map OVS Ethernet and VLAN ports into L2TPv3 tunnels.
+
+ 3) VLAN pseudowires are implemented using an ``l2tpethN`` interface
+ configured with a VLAN sub-interface. Since L2TPv3 VLAN
+ pseudowires carry one and only one VLAN, it may be better to use
+ a single netdevice rather than an ``l2tpethN`` and ``l2tpethN``:M
+ pair per VLAN session. The netlink attribute
+ ``L2TP_ATTR_VLAN_ID`` was added for this, but it was never
+ implemented.
+
+Testing
+-------
+
+Unmanaged L2TPv3 Ethernet features are tested by the kernel's built-in
+selftests. See `tools/testing/selftests/net/l2tp.sh`_.
+
+Another test suite, l2tp-ktest_, covers all
+of the L2TP APIs and tunnel/session types. This may be integrated into
+the kernel's built-in L2TP selftests in the future.
+
+.. Links
+.. _Generic Netlink: generic_netlink.html
+.. _libmnl: https://www.netfilter.org/projects/libmnl
+.. _include/uapi/linux/l2tp.h: ../../../include/uapi/linux/l2tp.h
+.. _include/linux/if_pppol2tp.h: ../../../include/linux/if_pppol2tp.h
+.. _net/l2tp/l2tp_ip.c: ../../../net/l2tp/l2tp_ip.c
+.. _net/l2tp/l2tp_ip6.c: ../../../net/l2tp/l2tp_ip6.c
+.. _net/l2tp/l2tp_ppp.c: ../../../net/l2tp/l2tp_ppp.c
+.. _net/l2tp/l2tp_eth.c: ../../../net/l2tp/l2tp_eth.c
+.. _tools/testing/selftests/net/l2tp.sh: ../../../tools/testing/selftests/net/l2tp.sh
+.. _l2tp-ktest: https://github.com/katalix/l2tp-ktest
diff --git a/Documentation/networking/lapb-module.rst b/Documentation/networking/lapb-module.rst
new file mode 100644
index 0000000000..ff586bc9f0
--- /dev/null
+++ b/Documentation/networking/lapb-module.rst
@@ -0,0 +1,305 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+The Linux LAPB Module Interface
+===============================
+
+Version 1.3
+
+Jonathan Naylor 29.12.96
+
+Changed (Henner Eisen, 2000-10-29): int return value for data_indication()
+
+The LAPB module will be a separately compiled module for use by any parts of
+the Linux operating system that require a LAPB service. This document
+defines the interfaces to, and the services provided by this module. The
+term module in this context does not imply that the LAPB module is a
+separately loadable module, although it may be. The term module is used in
+its more standard meaning.
+
+The interface to the LAPB module consists of functions to the module,
+callbacks from the module to indicate important state changes, and
+structures for getting and setting information about the module.
+
+Structures
+----------
+
+Probably the most important structure is the skbuff structure for holding
+received and transmitted data, however it is beyond the scope of this
+document.
+
+The two LAPB specific structures are the LAPB initialisation structure and
+the LAPB parameter structure. These will be defined in a standard header
+file, <linux/lapb.h>. The header file <net/lapb.h> is internal to the LAPB
+module and is not for use.
+
+LAPB Initialisation Structure
+-----------------------------
+
+This structure is used only once, in the call to lapb_register (see below).
+It contains information about the device driver that requires the services
+of the LAPB module::
+
+ struct lapb_register_struct {
+ void (*connect_confirmation)(int token, int reason);
+ void (*connect_indication)(int token, int reason);
+ void (*disconnect_confirmation)(int token, int reason);
+ void (*disconnect_indication)(int token, int reason);
+ int (*data_indication)(int token, struct sk_buff *skb);
+ void (*data_transmit)(int token, struct sk_buff *skb);
+ };
+
+Each member of this structure corresponds to a function in the device driver
+that is called when a particular event in the LAPB module occurs. These will
+be described in detail below. If a callback is not required (!!) then a NULL
+may be substituted.
+
+
+LAPB Parameter Structure
+------------------------
+
+This structure is used with the lapb_getparms and lapb_setparms functions
+(see below). They are used to allow the device driver to get and set the
+operational parameters of the LAPB implementation for a given connection::
+
+ struct lapb_parms_struct {
+ unsigned int t1;
+ unsigned int t1timer;
+ unsigned int t2;
+ unsigned int t2timer;
+ unsigned int n2;
+ unsigned int n2count;
+ unsigned int window;
+ unsigned int state;
+ unsigned int mode;
+ };
+
+T1 and T2 are protocol timing parameters and are given in units of 100ms. N2
+is the maximum number of tries on the link before it is declared a failure.
+The window size is the maximum number of outstanding data packets allowed to
+be unacknowledged by the remote end, the value of the window is between 1
+and 7 for a standard LAPB link, and between 1 and 127 for an extended LAPB
+link.
+
+The mode variable is a bit field used for setting (at present) three values.
+The bit fields have the following meanings:
+
+====== =================================================
+Bit Meaning
+====== =================================================
+0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED).
+1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP).
+2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE)
+3-31 Reserved, must be 0.
+====== =================================================
+
+Extended LAPB operation indicates the use of extended sequence numbers and
+consequently larger window sizes, the default is standard LAPB operation.
+MLP operation is the same as SLP operation except that the addresses used by
+LAPB are different to indicate the mode of operation, the default is Single
+Link Procedure. The difference between DCE and DTE operation is (i) the
+addresses used for commands and responses, and (ii) when the DCE is not
+connected, it sends DM without polls set, every T1. The upper case constant
+names will be defined in the public LAPB header file.
+
+
+Functions
+---------
+
+The LAPB module provides a number of function entry points.
+
+::
+
+ int lapb_register(void *token, struct lapb_register_struct);
+
+This must be called before the LAPB module may be used. If the call is
+successful then LAPB_OK is returned. The token must be a unique identifier
+generated by the device driver to allow for the unique identification of the
+instance of the LAPB link. It is returned by the LAPB module in all of the
+callbacks, and is used by the device driver in all calls to the LAPB module.
+For multiple LAPB links in a single device driver, multiple calls to
+lapb_register must be made. The format of the lapb_register_struct is given
+above. The return values are:
+
+============= =============================
+LAPB_OK LAPB registered successfully.
+LAPB_BADTOKEN Token is already registered.
+LAPB_NOMEM Out of memory
+============= =============================
+
+::
+
+ int lapb_unregister(void *token);
+
+This releases all the resources associated with a LAPB link. Any current
+LAPB link will be abandoned without further messages being passed. After
+this call, the value of token is no longer valid for any calls to the LAPB
+function. The valid return values are:
+
+============= ===============================
+LAPB_OK LAPB unregistered successfully.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+============= ===============================
+
+::
+
+ int lapb_getparms(void *token, struct lapb_parms_struct *parms);
+
+This allows the device driver to get the values of the current LAPB
+variables, the lapb_parms_struct is described above. The valid return values
+are:
+
+============= =============================
+LAPB_OK LAPB getparms was successful.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+============= =============================
+
+::
+
+ int lapb_setparms(void *token, struct lapb_parms_struct *parms);
+
+This allows the device driver to set the values of the current LAPB
+variables, the lapb_parms_struct is described above. The values of t1timer,
+t2timer and n2count are ignored, likewise changing the mode bits when
+connected will be ignored. An error implies that none of the values have
+been changed. The valid return values are:
+
+============= =================================================
+LAPB_OK LAPB getparms was successful.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+LAPB_INVALUE One of the values was out of its allowable range.
+============= =================================================
+
+::
+
+ int lapb_connect_request(void *token);
+
+Initiate a connect using the current parameter settings. The valid return
+values are:
+
+============== =================================
+LAPB_OK LAPB is starting to connect.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+LAPB_CONNECTED LAPB module is already connected.
+============== =================================
+
+::
+
+ int lapb_disconnect_request(void *token);
+
+Initiate a disconnect. The valid return values are:
+
+================= ===============================
+LAPB_OK LAPB is starting to disconnect.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+LAPB_NOTCONNECTED LAPB module is not connected.
+================= ===============================
+
+::
+
+ int lapb_data_request(void *token, struct sk_buff *skb);
+
+Queue data with the LAPB module for transmitting over the link. If the call
+is successful then the skbuff is owned by the LAPB module and may not be
+used by the device driver again. The valid return values are:
+
+================= =============================
+LAPB_OK LAPB has accepted the data.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+LAPB_NOTCONNECTED LAPB module is not connected.
+================= =============================
+
+::
+
+ int lapb_data_received(void *token, struct sk_buff *skb);
+
+Queue data with the LAPB module which has been received from the device. It
+is expected that the data passed to the LAPB module has skb->data pointing
+to the beginning of the LAPB data. If the call is successful then the skbuff
+is owned by the LAPB module and may not be used by the device driver again.
+The valid return values are:
+
+============= ===========================
+LAPB_OK LAPB has accepted the data.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+============= ===========================
+
+Callbacks
+---------
+
+These callbacks are functions provided by the device driver for the LAPB
+module to call when an event occurs. They are registered with the LAPB
+module with lapb_register (see above) in the structure lapb_register_struct
+(see above).
+
+::
+
+ void (*connect_confirmation)(void *token, int reason);
+
+This is called by the LAPB module when a connection is established after
+being requested by a call to lapb_connect_request (see above). The reason is
+always LAPB_OK.
+
+::
+
+ void (*connect_indication)(void *token, int reason);
+
+This is called by the LAPB module when the link is established by the remote
+system. The value of reason is always LAPB_OK.
+
+::
+
+ void (*disconnect_confirmation)(void *token, int reason);
+
+This is called by the LAPB module when an event occurs after the device
+driver has called lapb_disconnect_request (see above). The reason indicates
+what has happened. In all cases the LAPB link can be regarded as being
+terminated. The values for reason are:
+
+================= ====================================================
+LAPB_OK The LAPB link was terminated normally.
+LAPB_NOTCONNECTED The remote system was not connected.
+LAPB_TIMEDOUT No response was received in N2 tries from the remote
+ system.
+================= ====================================================
+
+::
+
+ void (*disconnect_indication)(void *token, int reason);
+
+This is called by the LAPB module when the link is terminated by the remote
+system or another event has occurred to terminate the link. This may be
+returned in response to a lapb_connect_request (see above) if the remote
+system refused the request. The values for reason are:
+
+================= ====================================================
+LAPB_OK The LAPB link was terminated normally by the remote
+ system.
+LAPB_REFUSED The remote system refused the connect request.
+LAPB_NOTCONNECTED The remote system was not connected.
+LAPB_TIMEDOUT No response was received in N2 tries from the remote
+ system.
+================= ====================================================
+
+::
+
+ int (*data_indication)(void *token, struct sk_buff *skb);
+
+This is called by the LAPB module when data has been received from the
+remote system that should be passed onto the next layer in the protocol
+stack. The skbuff becomes the property of the device driver and the LAPB
+module will not perform any more actions on it. The skb->data pointer will
+be pointing to the first byte of data after the LAPB header.
+
+This method should return NET_RX_DROP (as defined in the header
+file include/linux/netdevice.h) if and only if the frame was dropped
+before it could be delivered to the upper layer.
+
+::
+
+ void (*data_transmit)(void *token, struct sk_buff *skb);
+
+This is called by the LAPB module when data is to be transmitted to the
+remote system by the device driver. The skbuff becomes the property of the
+device driver and the LAPB module will not perform any more actions on it.
+The skb->data pointer will be pointing to the first byte of the LAPB header.
diff --git a/Documentation/networking/mac80211-auth-assoc-deauth.txt b/Documentation/networking/mac80211-auth-assoc-deauth.txt
new file mode 100644
index 0000000000..d7a15fe91b
--- /dev/null
+++ b/Documentation/networking/mac80211-auth-assoc-deauth.txt
@@ -0,0 +1,95 @@
+#
+# This outlines the Linux authentication/association and
+# deauthentication/disassociation flows.
+#
+# This can be converted into a diagram using the service
+# at http://www.websequencediagrams.com/
+#
+
+participant userspace
+participant mac80211
+participant driver
+
+alt authentication needed (not FT)
+userspace->mac80211: authenticate
+
+alt authenticated/authenticating already
+mac80211->driver: sta_state(AP, not-exists)
+mac80211->driver: bss_info_changed(clear BSSID)
+else associated
+note over mac80211,driver
+like deauth/disassoc, without sending the
+BA session stop & deauth/disassoc frames
+end note
+end
+
+mac80211->driver: config(channel, channel type)
+mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap)
+mac80211->driver: sta_state(AP, exists)
+
+alt no probe request data known
+mac80211->driver: TX directed probe request
+driver->mac80211: RX probe response
+end
+
+mac80211->driver: TX auth frame
+driver->mac80211: RX auth frame
+
+alt WEP shared key auth
+mac80211->driver: TX auth frame
+driver->mac80211: RX auth frame
+end
+
+mac80211->driver: sta_state(AP, authenticated)
+mac80211->userspace: RX auth frame
+
+end
+
+userspace->mac80211: associate
+alt authenticated or associated
+note over mac80211,driver: cleanup like for authenticate
+end
+
+alt not previously authenticated (FT)
+mac80211->driver: config(channel, channel type)
+mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap)
+mac80211->driver: sta_state(AP, exists)
+mac80211->driver: sta_state(AP, authenticated)
+end
+mac80211->driver: TX assoc
+driver->mac80211: RX assoc response
+note over mac80211: init rate control
+mac80211->driver: sta_state(AP, associated)
+
+alt not using WPA
+mac80211->driver: sta_state(AP, authorized)
+end
+
+mac80211->driver: set up QoS parameters
+
+mac80211->driver: bss_info_changed(QoS, HT, associated with AID)
+mac80211->userspace: associated
+
+note left of userspace: associated now
+
+alt using WPA
+note over userspace
+do 4-way-handshake
+(data frames)
+end note
+userspace->mac80211: authorized
+mac80211->driver: sta_state(AP, authorized)
+end
+
+userspace->mac80211: deauthenticate/disassociate
+mac80211->driver: stop BA sessions
+mac80211->driver: TX deauth/disassoc
+mac80211->driver: flush frames
+mac80211->driver: sta_state(AP,associated)
+mac80211->driver: sta_state(AP,authenticated)
+mac80211->driver: sta_state(AP,exists)
+mac80211->driver: sta_state(AP,not-exists)
+mac80211->driver: turn off powersave
+mac80211->driver: bss_info_changed(clear BSSID, not associated, no QoS, ...)
+mac80211->driver: config(channel type to non-HT)
+mac80211->userspace: disconnected
diff --git a/Documentation/networking/mac80211-injection.rst b/Documentation/networking/mac80211-injection.rst
new file mode 100644
index 0000000000..63ba6611fd
--- /dev/null
+++ b/Documentation/networking/mac80211-injection.rst
@@ -0,0 +1,106 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+How to use packet injection with mac80211
+=========================================
+
+mac80211 now allows arbitrary packets to be injected down any Monitor Mode
+interface from userland. The packet you inject needs to be composed in the
+following format::
+
+ [ radiotap header ]
+ [ ieee80211 header ]
+ [ payload ]
+
+The radiotap format is discussed in
+./Documentation/networking/radiotap-headers.rst.
+
+Despite many radiotap parameters being currently defined, most only make sense
+to appear on received packets. The following information is parsed from the
+radiotap headers and used to control injection:
+
+ * IEEE80211_RADIOTAP_FLAGS
+
+ ========================= ===========================================
+ IEEE80211_RADIOTAP_F_FCS FCS will be removed and recalculated
+ IEEE80211_RADIOTAP_F_WEP frame will be encrypted if key available
+ IEEE80211_RADIOTAP_F_FRAG frame will be fragmented if longer than the
+ current fragmentation threshold.
+ ========================= ===========================================
+
+ * IEEE80211_RADIOTAP_TX_FLAGS
+
+ ============================= ========================================
+ IEEE80211_RADIOTAP_F_TX_NOACK frame should be sent without waiting for
+ an ACK even if it is a unicast frame
+ ============================= ========================================
+
+ * IEEE80211_RADIOTAP_RATE
+
+ legacy rate for the transmission (only for devices without own rate control)
+
+ * IEEE80211_RADIOTAP_MCS
+
+ HT rate for the transmission (only for devices without own rate control).
+ Also some flags are parsed
+
+ ============================ ========================
+ IEEE80211_RADIOTAP_MCS_SGI use short guard interval
+ IEEE80211_RADIOTAP_MCS_BW_40 send in HT40 mode
+ ============================ ========================
+
+ * IEEE80211_RADIOTAP_DATA_RETRIES
+
+ number of retries when either IEEE80211_RADIOTAP_RATE or
+ IEEE80211_RADIOTAP_MCS was used
+
+ * IEEE80211_RADIOTAP_VHT
+
+ VHT mcs and number of streams used in the transmission (only for devices
+ without own rate control). Also other fields are parsed
+
+ flags field
+ IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval
+
+ bandwidth field
+ * 1: send using 40MHz channel width
+ * 4: send using 80MHz channel width
+ * 11: send using 160MHz channel width
+
+The injection code can also skip all other currently defined radiotap fields
+facilitating replay of captured radiotap headers directly.
+
+Here is an example valid radiotap header defining some parameters::
+
+ 0x00, 0x00, // <-- radiotap version
+ 0x0b, 0x00, // <- radiotap header length
+ 0x04, 0x0c, 0x00, 0x00, // <-- bitmap
+ 0x6c, // <-- rate
+ 0x0c, //<-- tx power
+ 0x01 //<-- antenna
+
+The ieee80211 header follows immediately afterwards, looking for example like
+this::
+
+ 0x08, 0x01, 0x00, 0x00,
+ 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
+ 0x13, 0x22, 0x33, 0x44, 0x55, 0x66,
+ 0x13, 0x22, 0x33, 0x44, 0x55, 0x66,
+ 0x10, 0x86
+
+Then lastly there is the payload.
+
+After composing the packet contents, it is sent by send()-ing it to a logical
+mac80211 interface that is in Monitor mode. Libpcap can also be used,
+(which is easier than doing the work to bind the socket to the right
+interface), along the following lines:::
+
+ ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf);
+ ...
+ r = pcap_inject(ppcap, u8aSendBuffer, nLength);
+
+You can also find a link to a complete inject application here:
+
+https://wireless.wiki.kernel.org/en/users/Documentation/packetspammer
+
+Andy Green <andy@warmcat.com>
diff --git a/Documentation/networking/mac80211_hwsim/hostapd.conf b/Documentation/networking/mac80211_hwsim/hostapd.conf
new file mode 100644
index 0000000000..08cde7e35f
--- /dev/null
+++ b/Documentation/networking/mac80211_hwsim/hostapd.conf
@@ -0,0 +1,11 @@
+interface=wlan0
+driver=nl80211
+
+hw_mode=g
+channel=1
+ssid=mac80211 test
+
+wpa=2
+wpa_key_mgmt=WPA-PSK
+wpa_pairwise=CCMP
+wpa_passphrase=12345678
diff --git a/Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst b/Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst
new file mode 100644
index 0000000000..d2266ce553
--- /dev/null
+++ b/Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst
@@ -0,0 +1,80 @@
+:orphan:
+
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===================================================================
+mac80211_hwsim - software simulator of 802.11 radio(s) for mac80211
+===================================================================
+
+:Copyright: |copy| 2008, Jouni Malinen <j@w1.fi>
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License version 2 as
+published by the Free Software Foundation.
+
+
+Introduction
+============
+
+mac80211_hwsim is a Linux kernel module that can be used to simulate
+arbitrary number of IEEE 802.11 radios for mac80211. It can be used to
+test most of the mac80211 functionality and user space tools (e.g.,
+hostapd and wpa_supplicant) in a way that matches very closely with
+the normal case of using real WLAN hardware. From the mac80211 view
+point, mac80211_hwsim is yet another hardware driver, i.e., no changes
+to mac80211 are needed to use this testing tool.
+
+The main goal for mac80211_hwsim is to make it easier for developers
+to test their code and work with new features to mac80211, hostapd,
+and wpa_supplicant. The simulated radios do not have the limitations
+of real hardware, so it is easy to generate an arbitrary test setup
+and always reproduce the same setup for future tests. In addition,
+since all radio operation is simulated, any channel can be used in
+tests regardless of regulatory rules.
+
+mac80211_hwsim kernel module has a parameter 'radios' that can be used
+to select how many radios are simulated (default 2). This allows
+configuration of both very simply setups (e.g., just a single access
+point and a station) or large scale tests (multiple access points with
+hundreds of stations).
+
+mac80211_hwsim works by tracking the current channel of each virtual
+radio and copying all transmitted frames to all other radios that are
+currently enabled and on the same channel as the transmitting
+radio. Software encryption in mac80211 is used so that the frames are
+actually encrypted over the virtual air interface to allow more
+complete testing of encryption.
+
+A global monitoring netdev, hwsim#, is created independent of
+mac80211. This interface can be used to monitor all transmitted frames
+regardless of channel.
+
+
+Simple example
+==============
+
+This example shows how to use mac80211_hwsim to simulate two radios:
+one to act as an access point and the other as a station that
+associates with the AP. hostapd and wpa_supplicant are used to take
+care of WPA2-PSK authentication. In addition, hostapd is also
+processing access point side of association.
+
+::
+
+
+ # Build mac80211_hwsim as part of kernel configuration
+
+ # Load the module
+ modprobe mac80211_hwsim
+
+ # Run hostapd (AP) for wlan0
+ hostapd hostapd.conf
+
+ # Run wpa_supplicant (station) for wlan1
+ wpa_supplicant -Dnl80211 -iwlan1 -c wpa_supplicant.conf
+
+
+More test cases are available in hostap.git:
+git://w1.fi/srv/git/hostap.git and mac80211_hwsim/tests subdirectory
+(http://w1.fi/gitweb/gitweb.cgi?p=hostap.git;a=tree;f=mac80211_hwsim/tests)
diff --git a/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf b/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf
new file mode 100644
index 0000000000..299128cff0
--- /dev/null
+++ b/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf
@@ -0,0 +1,10 @@
+ctrl_interface=/var/run/wpa_supplicant
+
+network={
+ ssid="mac80211 test"
+ psk="12345678"
+ key_mgmt=WPA-PSK
+ proto=WPA2
+ pairwise=CCMP
+ group=CCMP
+}
diff --git a/Documentation/networking/mctp.rst b/Documentation/networking/mctp.rst
new file mode 100644
index 0000000000..c628cb5406
--- /dev/null
+++ b/Documentation/networking/mctp.rst
@@ -0,0 +1,320 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================
+Management Component Transport Protocol (MCTP)
+==============================================
+
+net/mctp/ contains protocol support for MCTP, as defined by DMTF standard
+DSP0236. Physical interface drivers ("bindings" in the specification) are
+provided in drivers/net/mctp/.
+
+The core code provides a socket-based interface to send and receive MCTP
+messages, through an AF_MCTP, SOCK_DGRAM socket.
+
+Structure: interfaces & networks
+================================
+
+The kernel models the local MCTP topology through two items: interfaces and
+networks.
+
+An interface (or "link") is an instance of an MCTP physical transport binding
+(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware
+device. This is represented as a ``struct netdevice``.
+
+A network defines a unique address space for MCTP endpoints by endpoint-ID
+(described by DSP0236, section 3.2.31). A network has a user-visible identifier
+to allow references from userspace. Route definitions are specific to one
+network.
+
+Interfaces are associated with one network. A network may be associated with one
+or more interfaces.
+
+If multiple networks are present, each may contain endpoint IDs (EIDs) that are
+also present on other networks.
+
+Sockets API
+===========
+
+Protocol definitions
+--------------------
+
+MCTP uses ``AF_MCTP`` / ``PF_MCTP`` for the address- and protocol- families.
+Since MCTP is message-based, only ``SOCK_DGRAM`` sockets are supported.
+
+.. code-block:: C
+
+ int sd = socket(AF_MCTP, SOCK_DGRAM, 0);
+
+The only (current) value for the ``protocol`` argument is 0.
+
+As with all socket address families, source and destination addresses are
+specified with a ``sockaddr`` type, with a single-byte endpoint address:
+
+.. code-block:: C
+
+ typedef __u8 mctp_eid_t;
+
+ struct mctp_addr {
+ mctp_eid_t s_addr;
+ };
+
+ struct sockaddr_mctp {
+ __kernel_sa_family_t smctp_family;
+ unsigned int smctp_network;
+ struct mctp_addr smctp_addr;
+ __u8 smctp_type;
+ __u8 smctp_tag;
+ };
+
+ #define MCTP_NET_ANY 0x0
+ #define MCTP_ADDR_ANY 0xff
+
+
+Syscall behaviour
+-----------------
+
+The following sections describe the MCTP-specific behaviours of the standard
+socket system calls. These behaviours have been chosen to map closely to the
+existing sockets APIs.
+
+``bind()`` : set local socket address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sockets that receive incoming request packets will bind to a local address,
+using the ``bind()`` syscall.
+
+.. code-block:: C
+
+ struct sockaddr_mctp addr;
+
+ addr.smctp_family = AF_MCTP;
+ addr.smctp_network = MCTP_NET_ANY;
+ addr.smctp_addr.s_addr = MCTP_ADDR_ANY;
+ addr.smctp_type = MCTP_TYPE_PLDM;
+ addr.smctp_tag = MCTP_TAG_OWNER;
+
+ int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr));
+
+This establishes the local address of the socket. Incoming MCTP messages that
+match the network, address, and message type will be received by this socket.
+The reference to 'incoming' is important here; a bound socket will only receive
+messages with the TO bit set, to indicate an incoming request message, rather
+than a response.
+
+The ``smctp_tag`` value will configure the tags accepted from the remote side of
+this socket. Given the above, the only valid value is ``MCTP_TAG_OWNER``, which
+will result in remotely "owned" tags being routed to this socket. Since
+``MCTP_TAG_OWNER`` is set, the 3 least-significant bits of ``smctp_tag`` are not
+used; callers must set them to zero.
+
+A ``smctp_network`` value of ``MCTP_NET_ANY`` will configure the socket to
+receive incoming packets from any locally-connected network. A specific network
+value will cause the socket to only receive incoming messages from that network.
+
+The ``smctp_addr`` field specifies a local address to bind to. A value of
+``MCTP_ADDR_ANY`` configures the socket to receive messages addressed to any
+local destination EID.
+
+The ``smctp_type`` field specifies which message types to receive. Only the
+lower 7 bits of the type is matched on incoming messages (ie., the
+most-significant IC bit is not part of the match). This results in the socket
+receiving packets with and without a message integrity check footer.
+
+``sendto()``, ``sendmsg()``, ``send()`` : transmit an MCTP message
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An MCTP message is transmitted using one of the ``sendto()``, ``sendmsg()`` or
+``send()`` syscalls. Using ``sendto()`` as the primary example:
+
+.. code-block:: C
+
+ struct sockaddr_mctp addr;
+ char buf[14];
+ ssize_t len;
+
+ /* set message destination */
+ addr.smctp_family = AF_MCTP;
+ addr.smctp_network = 0;
+ addr.smctp_addr.s_addr = 8;
+ addr.smctp_tag = MCTP_TAG_OWNER;
+ addr.smctp_type = MCTP_TYPE_ECHO;
+
+ /* arbitrary message to send, with message-type header */
+ buf[0] = MCTP_TYPE_ECHO;
+ memcpy(buf + 1, "hello, world!", sizeof(buf) - 1);
+
+ len = sendto(sd, buf, sizeof(buf), 0,
+ (struct sockaddr_mctp *)&addr, sizeof(addr));
+
+The network and address fields of ``addr`` define the remote address to send to.
+If ``smctp_tag`` has the ``MCTP_TAG_OWNER``, the kernel will ignore any bits set
+in ``MCTP_TAG_VALUE``, and generate a tag value suitable for the destination
+EID. If ``MCTP_TAG_OWNER`` is not set, the message will be sent with the tag
+value as specified. If a tag value cannot be allocated, the system call will
+report an errno of ``EAGAIN``.
+
+The application must provide the message type byte as the first byte of the
+message buffer passed to ``sendto()``. If a message integrity check is to be
+included in the transmitted message, it must also be provided in the message
+buffer, and the most-significant bit of the message type byte must be 1.
+
+The ``sendmsg()`` system call allows a more compact argument interface, and the
+message buffer to be specified as a scatter-gather list. At present no ancillary
+message types (used for the ``msg_control`` data passed to ``sendmsg()``) are
+defined.
+
+Transmitting a message on an unconnected socket with ``MCTP_TAG_OWNER``
+specified will cause an allocation of a tag, if no valid tag is already
+allocated for that destination. The (destination-eid,tag) tuple acts as an
+implicit local socket address, to allow the socket to receive responses to this
+outgoing message. If any previous allocation has been performed (to for a
+different remote EID), that allocation is lost.
+
+Sockets will only receive responses to requests they have sent (with TO=1) and
+may only respond (with TO=0) to requests they have received.
+
+``recvfrom()``, ``recvmsg()``, ``recv()`` : receive an MCTP message
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An MCTP message can be received by an application using one of the
+``recvfrom()``, ``recvmsg()``, or ``recv()`` system calls. Using ``recvfrom()``
+as the primary example:
+
+.. code-block:: C
+
+ struct sockaddr_mctp addr;
+ socklen_t addrlen;
+ char buf[14];
+ ssize_t len;
+
+ addrlen = sizeof(addr);
+
+ len = recvfrom(sd, buf, sizeof(buf), 0,
+ (struct sockaddr_mctp *)&addr, &addrlen);
+
+ /* We can expect addr to describe an MCTP address */
+ assert(addrlen >= sizeof(buf));
+ assert(addr.smctp_family == AF_MCTP);
+
+ printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr);
+
+The address argument to ``recvfrom`` and ``recvmsg`` is populated with the
+remote address of the incoming message, including tag value (this will be needed
+in order to reply to the message).
+
+The first byte of the message buffer will contain the message type byte. If an
+integrity check follows the message, it will be included in the received buffer.
+
+The ``recv()`` system call behaves in a similar way, but does not provide a
+remote address to the application. Therefore, these are only useful if the
+remote address is already known, or the message does not require a reply.
+
+Like the send calls, sockets will only receive responses to requests they have
+sent (TO=1) and may only respond (TO=0) to requests they have received.
+
+``ioctl(SIOCMCTPALLOCTAG)`` and ``ioctl(SIOCMCTPDROPTAG)``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These tags give applications more control over MCTP message tags, by allocating
+(and dropping) tag values explicitly, rather than the kernel automatically
+allocating a per-message tag at ``sendmsg()`` time.
+
+In general, you will only need to use these ioctls if your MCTP protocol does
+not fit the usual request/response model. For example, if you need to persist
+tags across multiple requests, or a request may generate more than one response.
+In these cases, the ioctls allow you to decouple the tag allocation (and
+release) from individual message send and receive operations.
+
+Both ioctls are passed a pointer to a ``struct mctp_ioc_tag_ctl``:
+
+.. code-block:: C
+
+ struct mctp_ioc_tag_ctl {
+ mctp_eid_t peer_addr;
+ __u8 tag;
+ __u16 flags;
+ };
+
+``SIOCMCTPALLOCTAG`` allocates a tag for a specific peer, which an application
+can use in future ``sendmsg()`` calls. The application populates the
+``peer_addr`` member with the remote EID. Other fields must be zero.
+
+On return, the ``tag`` member will be populated with the allocated tag value.
+The allocated tag will have the following tag bits set:
+
+ - ``MCTP_TAG_OWNER``: it only makes sense to allocate tags if you're the tag
+ owner
+
+ - ``MCTP_TAG_PREALLOC``: to indicate to ``sendmsg()`` that this is a
+ preallocated tag.
+
+ - ... and the actual tag value, within the least-significant three bits
+ (``MCTP_TAG_MASK``). Note that zero is a valid tag value.
+
+The tag value should be used as-is for the ``smctp_tag`` member of ``struct
+sockaddr_mctp``.
+
+``SIOCMCTPDROPTAG`` releases a tag that has been previously allocated by a
+``SIOCMCTPALLOCTAG`` ioctl. The ``peer_addr`` must be the same as used for the
+allocation, and the ``tag`` value must match exactly the tag returned from the
+allocation (including the ``MCTP_TAG_OWNER`` and ``MCTP_TAG_PREALLOC`` bits).
+The ``flags`` field must be zero.
+
+Kernel internals
+================
+
+There are a few possible packet flows in the MCTP stack:
+
+1. local TX to remote endpoint, message <= MTU::
+
+ sendmsg()
+ -> mctp_local_output()
+ : route lookup
+ -> rt->output() (== mctp_route_output)
+ -> dev_queue_xmit()
+
+2. local TX to remote endpoint, message > MTU::
+
+ sendmsg()
+ -> mctp_local_output()
+ -> mctp_do_fragment_route()
+ : creates packet-sized skbs. For each new skb:
+ -> rt->output() (== mctp_route_output)
+ -> dev_queue_xmit()
+
+3. remote TX to local endpoint, single-packet message::
+
+ mctp_pkttype_receive()
+ : route lookup
+ -> rt->output() (== mctp_route_input)
+ : sk_key lookup
+ -> sock_queue_rcv_skb()
+
+4. remote TX to local endpoint, multiple-packet message::
+
+ mctp_pkttype_receive()
+ : route lookup
+ -> rt->output() (== mctp_route_input)
+ : sk_key lookup
+ : stores skb in struct sk_key->reasm_head
+
+ mctp_pkttype_receive()
+ : route lookup
+ -> rt->output() (== mctp_route_input)
+ : sk_key lookup
+ : finds existing reassembly in sk_key->reasm_head
+ : appends new fragment
+ -> sock_queue_rcv_skb()
+
+Key refcounts
+-------------
+
+ * keys are refed by:
+
+ - a skb: during route output, stored in ``skb->cb``.
+
+ - netns and sock lists.
+
+ * keys can be associated with a device, in which case they hold a
+ reference to the dev (set through ``key->dev``, counted through
+ ``dev->key_count``). Multiple keys can reference the device.
diff --git a/Documentation/networking/mpls-sysctl.rst b/Documentation/networking/mpls-sysctl.rst
new file mode 100644
index 0000000000..0a2ac88404
--- /dev/null
+++ b/Documentation/networking/mpls-sysctl.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+MPLS Sysfs variables
+====================
+
+/proc/sys/net/mpls/* Variables:
+===============================
+
+platform_labels - INTEGER
+ Number of entries in the platform label table. It is not
+ possible to configure forwarding for label values equal to or
+ greater than the number of platform labels.
+
+ A dense utilization of the entries in the platform label table
+ is possible and expected as the platform labels are locally
+ allocated.
+
+ If the number of platform label table entries is set to 0 no
+ label will be recognized by the kernel and mpls forwarding
+ will be disabled.
+
+ Reducing this value will remove all label routing entries that
+ no longer fit in the table.
+
+ Possible values: 0 - 1048575
+
+ Default: 0
+
+ip_ttl_propagate - BOOL
+ Control whether TTL is propagated from the IPv4/IPv6 header to
+ the MPLS header on imposing labels and propagated from the
+ MPLS header to the IPv4/IPv6 header on popping the last label.
+
+ If disabled, the MPLS transport network will appear as a
+ single hop to transit traffic.
+
+ * 0 - disabled / RFC 3443 [Short] Pipe Model
+ * 1 - enabled / RFC 3443 Uniform Model (default)
+
+default_ttl - INTEGER
+ Default TTL value to use for MPLS packets where it cannot be
+ propagated from an IP header, either because one isn't present
+ or ip_ttl_propagate has been disabled.
+
+ Possible values: 1 - 255
+
+ Default: 255
+
+conf/<interface>/input - BOOL
+ Control whether packets can be input on this interface.
+
+ If disabled, packets will be discarded without further
+ processing.
+
+ * 0 - disabled (default)
+ * not 0 - enabled
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
new file mode 100644
index 0000000000..15f1919d64
--- /dev/null
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -0,0 +1,84 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+MPTCP Sysfs variables
+=====================
+
+/proc/sys/net/mptcp/* Variables
+===============================
+
+enabled - BOOLEAN
+ Control whether MPTCP sockets can be created.
+
+ MPTCP sockets can be created if the value is 1. This is a
+ per-namespace sysctl.
+
+ Default: 1 (enabled)
+
+add_addr_timeout - INTEGER (seconds)
+ Set the timeout after which an ADD_ADDR control message will be
+ resent to an MPTCP peer that has not acknowledged a previous
+ ADD_ADDR message.
+
+ The default value matches TCP_RTO_MAX. This is a per-namespace
+ sysctl.
+
+ Default: 120
+
+checksum_enabled - BOOLEAN
+ Control whether DSS checksum can be enabled.
+
+ DSS checksum can be enabled if the value is nonzero. This is a
+ per-namespace sysctl.
+
+ Default: 0
+
+allow_join_initial_addr_port - BOOLEAN
+ Allow peers to send join requests to the IP address and port number used
+ by the initial subflow if the value is 1. This controls a flag that is
+ sent to the peer at connection time, and whether such join requests are
+ accepted or denied.
+
+ Joins to addresses advertised with ADD_ADDR are not affected by this
+ value.
+
+ This is a per-namespace sysctl.
+
+ Default: 1
+
+pm_type - INTEGER
+ Set the default path manager type to use for each new MPTCP
+ socket. In-kernel path management will control subflow
+ connections and address advertisements according to
+ per-namespace values configured over the MPTCP netlink
+ API. Userspace path management puts per-MPTCP-connection subflow
+ connection decisions and address advertisements under control of
+ a privileged userspace program, at the cost of more netlink
+ traffic to propagate all of the related events and commands.
+
+ This is a per-namespace sysctl.
+
+ * 0 - In-kernel path manager
+ * 1 - Userspace path manager
+
+ Default: 0
+
+stale_loss_cnt - INTEGER
+ The number of MPTCP-level retransmission intervals with no traffic and
+ pending outstanding data on a given subflow required to declare it stale.
+ The packet scheduler ignores stale subflows.
+ A low stale_loss_cnt value allows for fast active-backup switch-over,
+ an high value maximize links utilization on edge scenarios e.g. lossy
+ link with high BER or peer pausing the data processing.
+
+ This is a per-namespace sysctl.
+
+ Default: 4
+
+scheduler - STRING
+ Select the scheduler of your choice.
+
+ Support for selection of different schedulers. This is a per-namespace
+ sysctl.
+
+ Default: "default"
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
new file mode 100644
index 0000000000..b3ea96af9b
--- /dev/null
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -0,0 +1,256 @@
+
+============
+MSG_ZEROCOPY
+============
+
+Intro
+=====
+
+The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
+The feature is currently implemented for TCP and UDP sockets.
+
+
+Opportunity and Caveats
+-----------------------
+
+Copying large buffers between user process and kernel can be
+expensive. Linux supports various interfaces that eschew copying,
+such as sendfile and splice. The MSG_ZEROCOPY flag extends the
+underlying copy avoidance mechanism to common socket send calls.
+
+Copy avoidance is not a free lunch. As implemented, with page pinning,
+it replaces per byte copy cost with page accounting and completion
+notification overhead. As a result, MSG_ZEROCOPY is generally only
+effective at writes over around 10 KB.
+
+Page pinning also changes system call semantics. It temporarily shares
+the buffer between process and network stack. Unlike with copying, the
+process cannot immediately overwrite the buffer after system call
+return without possibly modifying the data in flight. Kernel integrity
+is not affected, but a buggy program can possibly corrupt its own data
+stream.
+
+The kernel returns a notification when it is safe to modify data.
+Converting an existing application to MSG_ZEROCOPY is not always as
+trivial as just passing the flag, then.
+
+
+More Info
+---------
+
+Much of this document was derived from a longer paper presented at
+netdev 2.1. For more in-depth information see that paper and talk,
+the excellent reporting over at LWN.net or read the original code.
+
+ paper, slides, video
+ https://netdevconf.org/2.1/session.html?debruijn
+
+ LWN article
+ https://lwn.net/Articles/726917/
+
+ patchset
+ [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
+ https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
+
+
+Interface
+=========
+
+Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
+avoidance, but not the only one.
+
+Socket Setup
+------------
+
+The kernel is permissive when applications pass undefined flags to the
+send system call. By default it simply ignores these. To avoid enabling
+copy avoidance mode for legacy processes that accidentally already pass
+this flag, a process must first signal intent by setting a socket option:
+
+::
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
+ error(1, errno, "setsockopt zerocopy");
+
+Transmission
+------------
+
+The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
+Pass the new flag.
+
+::
+
+ ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
+
+A zerocopy failure will return -1 with errno ENOBUFS. This happens if
+the socket exceeds its optmem limit or the user exceeds their ulimit on
+locked pages.
+
+
+Mixing copy avoidance and copying
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Many workloads have a mixture of large and small buffers. Because copy
+avoidance is more expensive than copying for small packets, the
+feature is implemented as a flag. It is safe to mix calls with the flag
+with those without.
+
+
+Notifications
+-------------
+
+The kernel has to notify the process when it is safe to reuse a
+previously passed buffer. It queues completion notifications on the
+socket error queue, akin to the transmit timestamping interface.
+
+The notification itself is a simple scalar value. Each socket
+maintains an internal unsigned 32-bit counter. Each send call with
+MSG_ZEROCOPY that successfully sends data increments the counter. The
+counter is not incremented on failure or if called with length zero.
+The counter counts system call invocations, not bytes. It wraps after
+UINT_MAX calls.
+
+
+Notification Reception
+~~~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates the API. In the simplest case, each
+send syscall is followed by a poll and recvmsg on the error queue.
+
+Reading from the error queue is always a non-blocking operation. The
+poll call is there to block until an error is outstanding. It will set
+POLLERR in its output flags. That flag does not have to be set in the
+events field. Errors are signaled unconditionally.
+
+::
+
+ pfd.fd = fd;
+ pfd.events = 0;
+ if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
+ error(1, errno, "poll");
+
+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (ret == -1)
+ error(1, errno, "recvmsg");
+
+ read_notification(msg);
+
+The example is for demonstration purpose only. In practice, it is more
+efficient to not wait for notifications, but read without blocking
+every couple of send calls.
+
+Notifications can be processed out of order with other operations on
+the socket. A socket that has an error queued would normally block
+other operations until the error is read. Zerocopy notifications have
+a zero error code, however, to not block send and recv calls.
+
+
+Notification Batching
+~~~~~~~~~~~~~~~~~~~~~
+
+Multiple outstanding packets can be read at once using the recvmmsg
+call. This is often not needed. In each message the kernel returns not
+a single value, but a range. It coalesces consecutive notifications
+while one is outstanding for reception on the error queue.
+
+When a new notification is about to be queued, it checks whether the
+new value extends the range of the notification at the tail of the
+queue. If so, it drops the new notification packet and instead increases
+the range upper value of the outstanding notification.
+
+For protocols that acknowledge data in-order, like TCP, each
+notification can be squashed into the previous one, so that no more
+than one notification is outstanding at any one point.
+
+Ordered delivery is the common case, but not guaranteed. Notifications
+may arrive out of order on retransmission and socket teardown.
+
+
+Notification Parsing
+~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates how to parse the control message: the
+read_notification() call in the previous snippet. A notification
+is encoded in the standard error format, sock_extended_err.
+
+The level and type fields in the control data are protocol family
+specific, IP_RECVERR or IPV6_RECVERR.
+
+Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
+as explained before, to avoid blocking read and write system calls on
+the socket.
+
+The 32-bit notification range is encoded as [ee_info, ee_data]. This
+range is inclusive. Other fields in the struct must be treated as
+undefined, bar for ee_code, as discussed below.
+
+::
+
+ struct sock_extended_err *serr;
+ struct cmsghdr *cm;
+
+ cm = CMSG_FIRSTHDR(msg);
+ if (cm->cmsg_level != SOL_IP &&
+ cm->cmsg_type != IP_RECVERR)
+ error(1, 0, "cmsg");
+
+ serr = (void *) CMSG_DATA(cm);
+ if (serr->ee_errno != 0 ||
+ serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+ error(1, 0, "serr");
+
+ printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
+
+
+Deferred copies
+~~~~~~~~~~~~~~~
+
+Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
+avoidance, and a contract that the kernel will queue a completion
+notification. It is not a guarantee that the copy is elided.
+
+Copy avoidance is not always feasible. Devices that do not support
+scatter-gather I/O cannot send packets made up of kernel generated
+protocol headers plus zerocopy user data. A packet may need to be
+converted to a private copy of data deep in the stack, say to compute
+a checksum.
+
+In all these cases, the kernel returns a completion notification when
+it releases its hold on the shared pages. That notification may arrive
+before the (copied) data is fully transmitted. A zerocopy completion
+notification is not a transmit completion notification, therefore.
+
+Deferred copies can be more expensive than a copy immediately in the
+system call, if the data is no longer warm in the cache. The process
+also incurs notification processing cost for no benefit. For this
+reason, the kernel signals if data was completed with a copy, by
+setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
+A process may use this signal to stop passing flag MSG_ZEROCOPY on
+subsequent requests on the same socket.
+
+
+Implementation
+==============
+
+Loopback
+--------
+
+Data sent to local sockets can be queued indefinitely if the receive
+process does not read its socket. Unbound notification latency is not
+acceptable. For this reason all packets generated with MSG_ZEROCOPY
+that are looped to a local socket will incur a deferred copy. This
+includes looping onto packet sockets (e.g., tcpdump) and tun devices.
+
+
+Testing
+=======
+
+More realistic example code can be found in the kernel source under
+tools/testing/selftests/net/msg_zerocopy.c.
+
+Be cognizant of the loopback constraint. The test can be run between
+a pair of hosts. But if run between a local pair of processes, for
+instance when run with msg_zerocopy.sh between a veth pair across
+namespaces, the test will not show any improvement. For testing, the
+loopback restriction can be temporarily relaxed by making
+skb_orphan_frags_rx identical to skb_orphan_frags.
diff --git a/Documentation/networking/multiqueue.rst b/Documentation/networking/multiqueue.rst
new file mode 100644
index 0000000000..0a576166e9
--- /dev/null
+++ b/Documentation/networking/multiqueue.rst
@@ -0,0 +1,78 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================
+HOWTO for multiqueue network device support
+===========================================
+
+Section 1: Base driver requirements for implementing multiqueue support
+=======================================================================
+
+Intro: Kernel support for multiqueue devices
+---------------------------------------------------------
+
+Kernel support for multiqueue devices is always present.
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device. The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today. Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational. netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+
+Section 2: Qdisc support for multiqueue devices
+===============================================
+
+Currently two qdiscs are optimized for multiqueue devices. The first is the
+default pfifo_fast qdisc. This qdisc supports one qdisc per hardware queue.
+A new round-robin qdisc, sch_multiq also supports multiple hardware queues. The
+qdisc is responsible for classifying the skb's and then directing the skb's to
+bands and queues based on the value in skb->queue_mapping. Use this field in
+the base driver to determine which queue to send the skb to.
+
+sch_multiq has been added for hardware that wishes to avoid head-of-line
+blocking. It will cycle though the bands and verify that the hardware queue
+associated with the band is not stopped prior to dequeuing a packet.
+
+On qdisc load, the number of bands is based on the number of queues on the
+hardware. Once the association is made, any skb with skb->queue_mapping set,
+will be queued to the band associated with the hardware queue.
+
+
+Section 3: Brief howto using MULTIQ for multiqueue devices
+==========================================================
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs. To add the MULTIQ qdisc to your network device, assuming the device
+is called eth0, run the following command::
+
+ # tc qdisc add dev eth0 root handle 1: multiq
+
+The qdisc will allocate the number of bands to equal the number of queues that
+the device reports, and bring the qdisc online. Assuming eth0 has 4 Tx
+queues, the band mapping would look like::
+
+ band 0 => queue 0
+ band 1 => queue 1
+ band 2 => queue 2
+ band 3 => queue 3
+
+Traffic will begin flowing through each queue based on either the simple_tx_hash
+function or based on netdev->select_queue() if you have it defined.
+
+The behavior of tc filters remains the same. However a new tc action,
+skbedit, has been added. Assuming you wanted to route all traffic to a
+specific host, for example 192.168.0.3, through a specific queue you could use
+this action and establish a filter such as::
+
+ tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \
+ match ip dst 192.168.0.3 \
+ action skbedit queue_mapping 3
+
+:Author: Alexander Duyck <alexander.h.duyck@intel.com>
+:Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
new file mode 100644
index 0000000000..7bf7b95c4f
--- /dev/null
+++ b/Documentation/networking/napi.rst
@@ -0,0 +1,255 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+.. _napi:
+
+====
+NAPI
+====
+
+NAPI is the event handling mechanism used by the Linux networking stack.
+The name NAPI no longer stands for anything in particular [#]_.
+
+In basic operation the device notifies the host about new events
+via an interrupt.
+The host then schedules a NAPI instance to process the events.
+The device may also be polled for events via NAPI without receiving
+interrupts first (:ref:`busy polling<poll>`).
+
+NAPI processing usually happens in the software interrupt context,
+but there is an option to use :ref:`separate kernel threads<threaded>`
+for NAPI processing.
+
+All in all NAPI abstracts away from the drivers the context and configuration
+of event (packet Rx and Tx) processing.
+
+Driver API
+==========
+
+The two most important elements of NAPI are the struct napi_struct
+and the associated poll method. struct napi_struct holds the state
+of the NAPI instance while the method is the driver-specific event
+handler. The method will typically free Tx packets that have been
+transmitted and process newly received packets.
+
+.. _drv_ctrl:
+
+Control API
+-----------
+
+netif_napi_add() and netif_napi_del() add/remove a NAPI instance
+from the system. The instances are attached to the netdevice passed
+as argument (and will be deleted automatically when netdevice is
+unregistered). Instances are added in a disabled state.
+
+napi_enable() and napi_disable() manage the disabled state.
+A disabled NAPI can't be scheduled and its poll method is guaranteed
+to not be invoked. napi_disable() waits for ownership of the NAPI
+instance to be released.
+
+The control APIs are not idempotent. Control API calls are safe against
+concurrent use of datapath APIs but an incorrect sequence of control API
+calls may result in crashes, deadlocks, or race conditions. For example,
+calling napi_disable() multiple times in a row will deadlock.
+
+Datapath API
+------------
+
+napi_schedule() is the basic method of scheduling a NAPI poll.
+Drivers should call this function in their interrupt handler
+(see :ref:`drv_sched` for more info). A successful call to napi_schedule()
+will take ownership of the NAPI instance.
+
+Later, after NAPI is scheduled, the driver's poll method will be
+called to process the events/packets. The method takes a ``budget``
+argument - drivers can process completions for any number of Tx
+packets but should only process up to ``budget`` number of
+Rx packets. Rx processing is usually much more expensive.
+
+In other words for Rx processing the ``budget`` argument limits how many
+packets driver can process in a single poll. Rx specific APIs like page
+pool or XDP cannot be used at all when ``budget`` is 0.
+skb Tx processing should happen regardless of the ``budget``, but if
+the argument is 0 driver cannot call any XDP (or page pool) APIs.
+
+.. warning::
+
+ The ``budget`` argument may be 0 if core tries to only process
+ skb Tx completions and no Rx or XDP packets.
+
+The poll method returns the amount of work done. If the driver still
+has outstanding work to do (e.g. ``budget`` was exhausted)
+the poll method should return exactly ``budget``. In that case,
+the NAPI instance will be serviced/polled again (without the
+need to be scheduled).
+
+If event processing has been completed (all outstanding packets
+processed) the poll method should call napi_complete_done()
+before returning. napi_complete_done() releases the ownership
+of the instance.
+
+.. warning::
+
+ The case of finishing all events and using exactly ``budget``
+ must be handled carefully. There is no way to report this
+ (rare) condition to the stack, so the driver must either
+ not call napi_complete_done() and wait to be called again,
+ or return ``budget - 1``.
+
+ If the ``budget`` is 0 napi_complete_done() should never be called.
+
+Call sequence
+-------------
+
+Drivers should not make assumptions about the exact sequencing
+of calls. The poll method may be called without the driver scheduling
+the instance (unless the instance is disabled). Similarly,
+it's not guaranteed that the poll method will be called, even
+if napi_schedule() succeeded (e.g. if the instance gets disabled).
+
+As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent
+calls to the poll method only wait for the ownership of the instance
+to be released, not for the poll method to exit. This means that
+drivers should avoid accessing any data structures after calling
+napi_complete_done().
+
+.. _drv_sched:
+
+Scheduling and IRQ masking
+--------------------------
+
+Drivers should keep the interrupts masked after scheduling
+the NAPI instance - until NAPI polling finishes any further
+interrupts are unnecessary.
+
+Drivers which have to mask the interrupts explicitly (as opposed
+to IRQ being auto-masked by the device) should use the napi_schedule_prep()
+and __napi_schedule() calls:
+
+.. code-block:: c
+
+ if (napi_schedule_prep(&v->napi)) {
+ mydrv_mask_rxtx_irq(v->idx);
+ /* schedule after masking to avoid races */
+ __napi_schedule(&v->napi);
+ }
+
+IRQ should only be unmasked after a successful call to napi_complete_done():
+
+.. code-block:: c
+
+ if (budget && napi_complete_done(&v->napi, work_done)) {
+ mydrv_unmask_rxtx_irq(v->idx);
+ return min(work_done, budget - 1);
+ }
+
+napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage
+of guarantees given by being invoked in IRQ context (no need to
+mask interrupts). Note that PREEMPT_RT forces all interrupts
+to be threaded so the interrupt may need to be marked ``IRQF_NO_THREAD``
+to avoid issues on real-time kernel configurations.
+
+Instance to queue mapping
+-------------------------
+
+Modern devices have multiple NAPI instances (struct napi_struct) per
+interface. There is no strong requirement on how the instances are
+mapped to queues and interrupts. NAPI is primarily a polling/processing
+abstraction without specific user-facing semantics. That said, most networking
+devices end up using NAPI in fairly similar ways.
+
+NAPI instances most often correspond 1:1:1 to interrupts and queue pairs
+(queue pair is a set of a single Rx and single Tx queue).
+
+In less common cases a NAPI instance may be used for multiple queues
+or Rx and Tx queues can be serviced by separate NAPI instances on a single
+core. Regardless of the queue assignment, however, there is usually still
+a 1:1 mapping between NAPI instances and interrupts.
+
+It's worth noting that the ethtool API uses a "channel" terminology where
+each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear
+what constitutes a channel; the recommended interpretation is to understand
+a channel as an IRQ/NAPI which services queues of a given type. For example,
+a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected
+to utilize 3 interrupts, 2 Rx and 2 Tx queues.
+
+User API
+========
+
+User interactions with NAPI depend on NAPI instance ID. The instance IDs
+are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option.
+It's not currently possible to query IDs used by a given device.
+
+Software IRQ coalescing
+-----------------------
+
+NAPI does not perform any explicit event coalescing by default.
+In most scenarios batching happens due to IRQ coalescing which is done
+by the device. There are cases where software coalescing is helpful.
+
+NAPI can be configured to arm a repoll timer instead of unmasking
+the hardware interrupts as soon as all packets are processed.
+The ``gro_flush_timeout`` sysfs configuration of the netdevice
+is reused to control the delay of the timer, while
+``napi_defer_hard_irqs`` controls the number of consecutive empty polls
+before NAPI gives up and goes back to using hardware IRQs.
+
+.. _poll:
+
+Busy polling
+------------
+
+Busy polling allows a user process to check for incoming packets before
+the device interrupt fires. As is the case with any busy polling it trades
+off CPU cycles for lower latency (production uses of NAPI busy polling
+are not well known).
+
+Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
+selected sockets or using the global ``net.core.busy_poll`` and
+``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
+also exists.
+
+IRQ mitigation
+---------------
+
+While busy polling is supposed to be used by low latency applications,
+a similar mechanism can be used for IRQ mitigation.
+
+Very high request-per-second applications (especially routing/forwarding
+applications and especially applications using AF_XDP sockets) may not
+want to be interrupted until they finish processing a request or a batch
+of packets.
+
+Such applications can pledge to the kernel that they will perform a busy
+polling operation periodically, and the driver should keep the device IRQs
+permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
+socket option. To avoid system misbehavior the pledge is revoked
+if ``gro_flush_timeout`` passes without any busy poll call.
+
+The NAPI budget for busy polling is lower than the default (which makes
+sense given the low latency intention of normal busy polling). This is
+not the case with IRQ mitigation, however, so the budget can be adjusted
+with the ``SO_BUSY_POLL_BUDGET`` socket option.
+
+.. _threaded:
+
+Threaded NAPI
+-------------
+
+Threaded NAPI is an operating mode that uses dedicated kernel
+threads rather than software IRQ context for NAPI processing.
+The configuration is per netdevice and will affect all
+NAPI instances of that device. Each NAPI instance will spawn a separate
+thread (called ``napi/${ifc-name}-${napi-id}``).
+
+It is recommended to pin each kernel thread to a single CPU, the same
+CPU as the CPU which services the interrupt. Note that the mapping
+between IRQs and NAPI instances may not be trivial (and is driver
+dependent). The NAPI instance IDs will be assigned in the opposite
+order than the process IDs of the kernel threads.
+
+Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
+netdev's sysfs directory.
+
+.. rubric:: Footnotes
+
+.. [#] NAPI was originally referred to as New API in 2.4 Linux.
diff --git a/Documentation/networking/net_dim.rst b/Documentation/networking/net_dim.rst
new file mode 100644
index 0000000000..3bed9fd953
--- /dev/null
+++ b/Documentation/networking/net_dim.rst
@@ -0,0 +1,176 @@
+======================================================
+Net DIM - Generic Network Dynamic Interrupt Moderation
+======================================================
+
+:Author: Tal Gilboa <talgi@mellanox.com>
+
+.. contents:: :depth: 2
+
+Assumptions
+===========
+
+This document assumes the reader has basic knowledge in network drivers
+and in general interrupt moderation.
+
+
+Introduction
+============
+
+Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
+interrupt moderation configuration of a channel in order to optimize packet
+processing. The mechanism includes an algorithm which decides if and how to
+change moderation parameters for a channel, usually by performing an analysis on
+runtime data sampled from the system. Net DIM is such a mechanism. In each
+iteration of the algorithm, it analyses a given sample of the data, compares it
+to the previous sample and if required, it can decide to change some of the
+interrupt moderation configuration fields. The data sample is composed of data
+bandwidth, the number of packets and the number of events. The time between
+samples is also measured. Net DIM compares the current and the previous data and
+returns an adjusted interrupt moderation configuration object. In some cases,
+the algorithm might decide not to change anything. The configuration fields are
+the minimum duration (microseconds) allowed between events and the maximum
+number of wanted packets per event. The Net DIM algorithm ascribes importance to
+increase bandwidth over reducing interrupt rate.
+
+
+Net DIM Algorithm
+=================
+
+Each iteration of the Net DIM algorithm follows these steps:
+
+#. Calculates new data sample.
+#. Compares it to previous sample.
+#. Makes a decision - suggests interrupt moderation configuration fields.
+#. Applies a schedule work function, which applies suggested configuration.
+
+The first two steps are straightforward, both the new and the previous data are
+supplied by the driver registered to Net DIM. The previous data is the new data
+supplied to the previous iteration. The comparison step checks the difference
+between the new and previous data and decides on the result of the last step.
+A step would result as "better" if bandwidth increases and as "worse" if
+bandwidth reduces. If there is no change in bandwidth, the packet rate is
+compared in a similar fashion - increase == "better" and decrease == "worse".
+In case there is no change in the packet rate as well, the interrupt rate is
+compared. Here the algorithm tries to optimize for lower interrupt rate so an
+increase in the interrupt rate is considered "worse" and a decrease is
+considered "better". Step #2 has an optimization for avoiding false results: it
+only considers a difference between samples as valid if it is greater than a
+certain percentage. Also, since Net DIM does not measure anything by itself, it
+assumes the data provided by the driver is valid.
+
+Step #3 decides on the suggested configuration based on the result from step #2
+and the internal state of the algorithm. The states reflect the "direction" of
+the algorithm: is it going left (reducing moderation), right (increasing
+moderation) or standing still. Another optimization is that if a decision
+to stay still is made multiple times, the interval between iterations of the
+algorithm would increase in order to reduce calculation overhead. Also, after
+"parking" on one of the most left or most right decisions, the algorithm may
+decide to verify this decision by taking a step in the other direction. This is
+done in order to avoid getting stuck in a "deep sleep" scenario. Once a
+decision is made, an interrupt moderation configuration is selected from
+the predefined profiles.
+
+The last step is to notify the registered driver that it should apply the
+suggested configuration. This is done by scheduling a work function, defined by
+the Net DIM API and provided by the registered driver.
+
+As you can see, Net DIM itself does not actively interact with the system. It
+would have trouble making the correct decisions if the wrong data is supplied to
+it and it would be useless if the work function would not apply the suggested
+configuration. This does, however, allow the registered driver some room for
+manoeuvre as it may provide partial data or ignore the algorithm suggestion
+under some conditions.
+
+
+Registering a Network Device to DIM
+===================================
+
+Net DIM API exposes the main function net_dim().
+This function is the entry point to the Net
+DIM algorithm and has to be called every time the driver would like to check if
+it should change interrupt moderation parameters. The driver should provide two
+data structures: :c:type:`struct dim <dim>` and
+:c:type:`struct dim_sample <dim_sample>`. :c:type:`struct dim <dim>`
+describes the state of DIM for a specific object (RX queue, TX queue,
+other queues, etc.). This includes the current selected profile, previous data
+samples, the callback function provided by the driver and more.
+:c:type:`struct dim_sample <dim_sample>` describes a data sample,
+which will be compared to the data sample stored in :c:type:`struct dim <dim>`
+in order to decide on the algorithm's next
+step. The sample should include bytes, packets and interrupts, measured by
+the driver.
+
+In order to use Net DIM from a networking driver, the driver needs to call the
+main net_dim() function. The recommended method is to call net_dim() on each
+interrupt. Since Net DIM has a built-in moderation and it might decide to skip
+iterations under certain conditions, there is no need to moderate the net_dim()
+calls as well. As mentioned above, the driver needs to provide an object of type
+:c:type:`struct dim <dim>` to the net_dim() function call. It is advised for
+each entity using Net DIM to hold a :c:type:`struct dim <dim>` as part of its
+data structure and use it as the main Net DIM API object.
+The :c:type:`struct dim_sample <dim_sample>` should hold the latest
+bytes, packets and interrupts count. No need to perform any calculations, just
+include the raw data.
+
+The net_dim() call itself does not return anything. Instead Net DIM relies on
+the driver to provide a callback function, which is called when the algorithm
+decides to make a change in the interrupt moderation parameters. This callback
+will be scheduled and run in a separate thread in order not to add overhead to
+the data flow. After the work is done, Net DIM algorithm needs to be set to
+the proper state in order to move to the next iteration.
+
+
+Example
+=======
+
+The following code demonstrates how to register a driver to Net DIM. The actual
+usage is not complete but it should make the outline of the usage clear.
+
+.. code-block:: c
+
+ #include <linux/dim.h>
+
+ /* Callback for net DIM to schedule on a decision to change moderation */
+ void my_driver_do_dim_work(struct work_struct *work)
+ {
+ /* Get struct dim from struct work_struct */
+ struct dim *dim = container_of(work, struct dim,
+ work);
+ /* Do interrupt moderation related stuff */
+ ...
+
+ /* Signal net DIM work is done and it should move to next iteration */
+ dim->state = DIM_START_MEASURE;
+ }
+
+ /* My driver's interrupt handler */
+ int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
+ {
+ ...
+ /* A struct to hold current measured data */
+ struct dim_sample dim_sample;
+ ...
+ /* Initiate data sample struct with current data */
+ dim_update_sample(my_entity->events,
+ my_entity->packets,
+ my_entity->bytes,
+ &dim_sample);
+ /* Call net DIM */
+ net_dim(&my_entity->dim, dim_sample);
+ ...
+ }
+
+ /* My entity's initialization function (my_entity was already allocated) */
+ int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
+ {
+ ...
+ /* Initiate struct work_struct with my driver's callback function */
+ INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
+ ...
+ }
+
+Dynamic Interrupt Moderation (DIM) library API
+==============================================
+
+.. kernel-doc:: include/linux/dim.h
+ :internal:
diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst
new file mode 100644
index 0000000000..f4e1b4e07a
--- /dev/null
+++ b/Documentation/networking/net_failover.rst
@@ -0,0 +1,184 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+NET_FAILOVER
+============
+
+Overview
+========
+
+The net_failover driver provides an automated failover mechanism via APIs
+to create and destroy a failover master netdev and manages a primary and
+standby slave netdevs that get registered via the generic failover
+infrastructure.
+
+The failover netdev acts a master device and controls 2 slave devices. The
+original paravirtual interface is registered as 'standby' slave netdev and
+a passthru/vf device with the same MAC gets registered as 'primary' slave
+netdev. Both 'standby' and 'failover' netdevs are associated with the same
+'pci' device. The user accesses the network interface via 'failover' netdev.
+The 'failover' netdev chooses 'primary' netdev as default for transmits when
+it is available with link up and running.
+
+This can be used by paravirtual drivers to enable an alternate low latency
+datapath. It also enables hypervisor controlled live migration of a VM with
+direct attached VF by failing over to the paravirtual datapath when the VF
+is unplugged.
+
+virtio-net accelerated datapath: STANDBY mode
+=============================================
+
+net_failover enables hypervisor controlled accelerated datapath to virtio-net
+enabled VMs in a transparent manner with no/minimal guest userspace changes.
+
+To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY
+feature on the virtio-net interface and assign the same MAC address to both
+virtio-net and VF interfaces.
+
+Here is an example libvirt XML snippet that shows such configuration:
+::
+
+ <interface type='network'>
+ <mac address='52:54:00:00:12:53'/>
+ <source network='enp66s0f0_br'/>
+ <target dev='tap01'/>
+ <model type='virtio'/>
+ <driver name='vhost' queues='4'/>
+ <link state='down'/>
+ <teaming type='persistent'/>
+ <alias name='ua-backup0'/>
+ </interface>
+ <interface type='hostdev' managed='yes'>
+ <mac address='52:54:00:00:12:53'/>
+ <source>
+ <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
+ </source>
+ <teaming type='transient' persistent='ua-backup0'/>
+ </interface>
+
+In this configuration, the first device definition is for the virtio-net
+interface and this acts as the 'persistent' device indicating that this
+interface will always be plugged in. This is specified by the 'teaming' tag with
+required attribute type having value 'persistent'. The link state for the
+virtio-net device is set to 'down' to ensure that the 'failover' netdev prefers
+the VF passthrough device for normal communication. The virtio-net device will
+be brought UP during live migration to allow uninterrupted communication.
+
+The second device definition is for the VF passthrough interface. Here the
+'teaming' tag is provided with type 'transient' indicating that this device may
+periodically be unplugged. A second attribute - 'persistent' is provided and
+points to the alias name declared for the virtio-net device.
+
+Booting a VM with the above configuration will result in the following 3
+interfaces created in the VM:
+::
+
+ 4: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
+ link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
+ inet 192.168.12.53/24 brd 192.168.12.255 scope global dynamic ens10
+ valid_lft 42482sec preferred_lft 42482sec
+ inet6 fe80::97d8:db2:8c10:b6d6/64 scope link
+ valid_lft forever preferred_lft forever
+ 5: ens10nsby: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master ens10 state DOWN group default qlen 1000
+ link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
+ 7: ens11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ens10 state UP group default qlen 1000
+ link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
+
+Here, ens10 is the 'failover' master interface, ens10nsby is the slave 'standby'
+virtio-net interface, and ens11 is the slave 'primary' VF passthrough interface.
+
+One point to note here is that some user space network configuration daemons
+like systemd-networkd, ifupdown, etc, do not understand the 'net_failover'
+device; and on the first boot, the VM might end up with both 'failover' device
+and VF acquiring IP addresses (either same or different) from the DHCP server.
+This will result in lack of connectivity to the VM. So some tweaks might be
+needed to these network configuration daemons to make sure that an IP is
+received only on the 'failover' device.
+
+Below is the patch snippet used with 'cloud-ifupdown-helper' script found on
+Debian cloud images:
+
+::
+ @@ -27,6 +27,8 @@ do_setup() {
+ local working="$cfgdir/.$INTERFACE"
+ local final="$cfgdir/$INTERFACE"
+
+ + if [ -d "/sys/class/net/${INTERFACE}/master" ]; then exit 0; fi
+ +
+ if ifup --no-act "$INTERFACE" > /dev/null 2>&1; then
+ # interface is already known to ifupdown, no need to generate cfg
+ log "Skipping configuration generation for $INTERFACE"
+
+
+Live Migration of a VM with SR-IOV VF & virtio-net in STANDBY mode
+==================================================================
+
+net_failover also enables hypervisor controlled live migration to be supported
+with VMs that have direct attached SR-IOV VF devices by automatic failover to
+the paravirtual datapath when the VF is unplugged.
+
+Here is a sample script that shows the steps to initiate live migration from
+the source hypervisor. Note: It is assumed that the VM is connected to a
+software bridge 'br0' which has a single VF attached to it along with the vnet
+device to the VM. This is not the VF that was passthrough'd to the VM (seen in
+the vf.xml file).
+::
+
+ # cat vf.xml
+ <interface type='hostdev' managed='yes'>
+ <mac address='52:54:00:00:12:53'/>
+ <source>
+ <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
+ </source>
+ <teaming type='transient' persistent='ua-backup0'/>
+ </interface>
+
+ # Source Hypervisor migrate.sh
+ #!/bin/bash
+
+ DOMAIN=vm-01
+ PF=ens6np0
+ VF=ens6v1 # VF attached to the bridge.
+ VF_NUM=1
+ TAP_IF=vmtap01 # virtio-net interface in the VM.
+ VF_XML=vf.xml
+
+ MAC=52:54:00:00:12:53
+ ZERO_MAC=00:00:00:00:00:00
+
+ # Set the virtio-net interface up.
+ virsh domif-setlink $DOMAIN $TAP_IF up
+
+ # Remove the VF that was passthrough'd to the VM.
+ virsh detach-device --live --config $DOMAIN $VF_XML
+
+ ip link set $PF vf $VF_NUM mac $ZERO_MAC
+
+ # Add FDB entry for traffic to continue going to the VM via
+ # the VF -> br0 -> vnet interface path.
+ bridge fdb add $MAC dev $VF
+ bridge fdb add $MAC dev $TAP_IF master
+
+ # Migrate the VM
+ virsh migrate --live --persistent $DOMAIN qemu+ssh://$REMOTE_HOST/system
+
+ # Clean up FDB entries after migration completes.
+ bridge fdb del $MAC dev $VF
+ bridge fdb del $MAC dev $TAP_IF master
+
+On the destination hypervisor, a shared bridge 'br0' is created before migration
+starts, and a VF from the destination PF is added to the bridge. Similarly an
+appropriate FDB entry is added.
+
+The following script is executed on the destination hypervisor once migration
+completes, and it reattaches the VF to the VM and brings down the virtio-net
+interface.
+
+::
+ # reattach-vf.sh
+ #!/bin/bash
+
+ bridge fdb del 52:54:00:00:12:53 dev ens36v0
+ bridge fdb del 52:54:00:00:12:53 dev vmtap01 master
+ virsh attach-device --config --live vm01 vf.xml
+ virsh domif-setlink vm01 vmtap01 down
diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst
new file mode 100644
index 0000000000..7a9de0568e
--- /dev/null
+++ b/Documentation/networking/netconsole.rst
@@ -0,0 +1,248 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Netconsole
+==========
+
+
+started by Ingo Molnar <mingo@redhat.com>, 2001.09.17
+
+2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003
+
+IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013
+
+Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015
+
+Release prepend support by Breno Leitao <leitao@debian.org>, Jul 7 2023
+
+Please send bug reports to Matt Mackall <mpm@selenic.com>
+Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com>
+
+Introduction:
+=============
+
+This module logs kernel printk messages over UDP allowing debugging of
+problem where disk logging fails and serial consoles are impractical.
+
+It can be used either built-in or as a module. As a built-in,
+netconsole initializes immediately after NIC cards and will bring up
+the specified interface as soon as possible. While this doesn't allow
+capture of early kernel panics, it does capture most of the boot
+process.
+
+Sender and receiver configuration:
+==================================
+
+It takes a string configuration parameter "netconsole" in the
+following format::
+
+ netconsole=[+][r][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
+
+ where
+ + if present, enable extended console support
+ r if present, prepend kernel version (release) to the message
+ src-port source for UDP packets (defaults to 6665)
+ src-ip source IP to use (interface address)
+ dev network interface (eth0)
+ tgt-port port for logging agent (6666)
+ tgt-ip IP address for logging agent
+ tgt-macaddr ethernet MAC address for logging agent (broadcast)
+
+Examples::
+
+ linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
+
+or::
+
+ insmod netconsole netconsole=@/,@10.0.0.2/
+
+or using IPv6::
+
+ insmod netconsole netconsole=@/,@fd00:1:2:3::1/
+
+It also supports logging to multiple remote agents by specifying
+parameters for the multiple agents separated by semicolons and the
+complete string enclosed in "quotes", thusly::
+
+ modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/"
+
+Built-in netconsole starts immediately after the TCP stack is
+initialized and attempts to bring up the supplied dev at the supplied
+address.
+
+The remote host has several options to receive the kernel messages,
+for example:
+
+1) syslogd
+
+2) netcat
+
+ On distributions using a BSD-based netcat version (e.g. Fedora,
+ openSUSE and Ubuntu) the listening port must be specified without
+ the -p switch::
+
+ nc -u -l -p <port>' / 'nc -u -l <port>
+
+ or::
+
+ netcat -u -l -p <port>' / 'netcat -u -l <port>
+
+3) socat
+
+::
+
+ socat udp-recv:<port> -
+
+Dynamic reconfiguration:
+========================
+
+Dynamic reconfigurability is a useful addition to netconsole that enables
+remote logging targets to be dynamically added, removed, or have their
+parameters reconfigured at runtime from a configfs-based userspace interface.
+[ Note that the parameters of netconsole targets that were specified/created
+from the boot/module option are not exposed via this interface, and hence
+cannot be modified dynamically. ]
+
+To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the
+netconsole module (or kernel, if netconsole is built-in).
+
+Some examples follow (where configfs is mounted at the /sys/kernel/config
+mountpoint).
+
+To add a remote logging target (target names can be arbitrary)::
+
+ cd /sys/kernel/config/netconsole/
+ mkdir target1
+
+Note that newly created targets have default parameter values (as mentioned
+above) and are disabled by default -- they must first be enabled by writing
+"1" to the "enabled" attribute (usually after setting parameters accordingly)
+as described below.
+
+To remove a target::
+
+ rmdir /sys/kernel/config/netconsole/othertarget/
+
+The interface exposes these parameters of a netconsole target to userspace:
+
+ ============== ================================= ============
+ enabled Is this target currently enabled? (read-write)
+ extended Extended mode enabled (read-write)
+ release Prepend kernel release to message (read-write)
+ dev_name Local network interface name (read-write)
+ local_port Source UDP port to use (read-write)
+ remote_port Remote agent's UDP port (read-write)
+ local_ip Source IP address to use (read-write)
+ remote_ip Remote agent's IP address (read-write)
+ local_mac Local interface's MAC address (read-only)
+ remote_mac Remote agent's MAC address (read-write)
+ ============== ================================= ============
+
+The "enabled" attribute is also used to control whether the parameters of
+a target can be updated or not -- you can modify the parameters of only
+disabled targets (i.e. if "enabled" is 0).
+
+To update a target's parameters::
+
+ cat enabled # check if enabled is 1
+ echo 0 > enabled # disable the target (if required)
+ echo eth2 > dev_name # set local interface
+ echo 10.0.0.4 > remote_ip # update some parameter
+ echo cb:a9:87:65:43:21 > remote_mac # update more parameters
+ echo 1 > enabled # enable target again
+
+You can also update the local interface dynamically. This is especially
+useful if you want to use interfaces that have newly come up (and may not
+have existed when netconsole was loaded / initialized).
+
+Extended console:
+=================
+
+If '+' is prefixed to the configuration line or "extended" config file
+is set to 1, extended console support is enabled. An example boot
+param follows::
+
+ linux netconsole=+4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
+
+Log messages are transmitted with extended metadata header in the
+following format which is the same as /dev/kmsg::
+
+ <level>,<sequnum>,<timestamp>,<contflag>;<message text>
+
+If 'r' (release) feature is enabled, the kernel release version is
+prepended to the start of the message. Example::
+
+ 6.4.0,6,444,501151268,-;netconsole: network logging started
+
+Non printable characters in <message text> are escaped using "\xff"
+notation. If the message contains optional dictionary, verbatim
+newline is used as the delimiter.
+
+If a message doesn't fit in certain number of bytes (currently 1000),
+the message is split into multiple fragments by netconsole. These
+fragments are transmitted with "ncfrag" header field added::
+
+ ncfrag=<byte-offset>/<total-bytes>
+
+For example, assuming a lot smaller chunk size, a message "the first
+chunk, the 2nd chunk." may be split as follows::
+
+ 6,416,1758426,-,ncfrag=0/31;the first chunk,
+ 6,416,1758426,-,ncfrag=16/31; the 2nd chunk.
+
+Miscellaneous notes:
+====================
+
+.. Warning::
+
+ the default target ethernet setting uses the broadcast
+ ethernet address to send packets, which can cause increased load on
+ other systems on the same ethernet segment.
+
+.. Tip::
+
+ some LAN switches may be configured to suppress ethernet broadcasts
+ so it is advised to explicitly specify the remote agents' MAC addresses
+ from the config parameters passed to netconsole.
+
+.. Tip::
+
+ to find out the MAC address of, say, 10.0.0.2, you may try using::
+
+ ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
+
+.. Tip::
+
+ in case the remote logging agent is on a separate LAN subnet than
+ the sender, it is suggested to try specifying the MAC address of the
+ default gateway (you may use /sbin/route -n to find it out) as the
+ remote MAC address instead.
+
+.. note::
+
+ the network device (eth1 in the above case) can run any kind
+ of other network traffic, netconsole is not intrusive. Netconsole
+ might cause slight delays in other traffic if the volume of kernel
+ messages is high, but should have no other impact.
+
+.. note::
+
+ if you find that the remote logging agent is not receiving or
+ printing all messages from the sender, it is likely that you have set
+ the "console_loglevel" parameter (on the sender) to only send high
+ priority messages to the console. You can change this at runtime using::
+
+ dmesg -n 8
+
+ or by specifying "debug" on the kernel command line at boot, to send
+ all kernel messages to the console. A specific value for this parameter
+ can also be set using the "loglevel" kernel boot option. See the
+ dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst
+ for details.
+
+Netconsole was designed to be as instantaneous as possible, to
+enable the logging of even the most critical kernel bugs. It works
+from IRQ contexts as well, and does not enable interrupts while
+sending packets. Due to these unique needs, configuration cannot
+be more automatic, and some fundamental limitations will remain:
+only IP networks, UDP packets and ethernet devices are supported.
diff --git a/Documentation/networking/netdev-features.rst b/Documentation/networking/netdev-features.rst
new file mode 100644
index 0000000000..d7b15bb64d
--- /dev/null
+++ b/Documentation/networking/netdev-features.rst
@@ -0,0 +1,205 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+Netdev features mess and how to get out from it alive
+=====================================================
+
+Author:
+ Michał Mirosław <mirq-linux@rere.qmqm.pl>
+
+
+
+Part I: Feature sets
+====================
+
+Long gone are the days when a network card would just take and give packets
+verbatim. Today's devices add multiple features and bugs (read: offloads)
+that relieve an OS of various tasks like generating and checking checksums,
+splitting packets, classifying them. Those capabilities and their state
+are commonly referred to as netdev features in Linux kernel world.
+
+There are currently three sets of features relevant to the driver, and
+one used internally by network core:
+
+ 1. netdev->hw_features set contains features whose state may possibly
+ be changed (enabled or disabled) for a particular device by user's
+ request. This set should be initialized in ndo_init callback and not
+ changed later.
+
+ 2. netdev->features set contains features which are currently enabled
+ for a device. This should be changed only by network core or in
+ error paths of ndo_set_features callback.
+
+ 3. netdev->vlan_features set contains features whose state is inherited
+ by child VLAN devices (limits netdev->features set). This is currently
+ used for all VLAN devices whether tags are stripped or inserted in
+ hardware or software.
+
+ 4. netdev->wanted_features set contains feature set requested by user.
+ This set is filtered by ndo_fix_features callback whenever it or
+ some device-specific conditions change. This set is internal to
+ networking core and should not be referenced in drivers.
+
+
+
+Part II: Controlling enabled features
+=====================================
+
+When current feature set (netdev->features) is to be changed, new set
+is calculated and filtered by calling ndo_fix_features callback
+and netdev_fix_features(). If the resulting set differs from current
+set, it is passed to ndo_set_features callback and (if the callback
+returns success) replaces value stored in netdev->features.
+NETDEV_FEAT_CHANGE notification is issued after that whenever current
+set might have changed.
+
+The following events trigger recalculation:
+ 1. device's registration, after ndo_init returned success
+ 2. user requested changes in features state
+ 3. netdev_update_features() is called
+
+ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks
+are treated as always returning success.
+
+A driver that wants to trigger recalculation must do so by calling
+netdev_update_features() while holding rtnl_lock. This should not be done
+from ndo_*_features callbacks. netdev->features should not be modified by
+driver except by means of ndo_fix_features callback.
+
+
+
+Part III: Implementation hints
+==============================
+
+ * ndo_fix_features:
+
+All dependencies between features should be resolved here. The resulting
+set can be reduced further by networking core imposed limitations (as coded
+in netdev_fix_features()). For this reason it is safer to disable a feature
+when its dependencies are not met instead of forcing the dependency on.
+
+This callback should not modify hardware nor driver state (should be
+stateless). It can be called multiple times between successive
+ndo_set_features calls.
+
+Callback must not alter features contained in NETIF_F_SOFT_FEATURES or
+NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but
+care must be taken as the change won't affect already configured VLANs.
+
+ * ndo_set_features:
+
+Hardware should be reconfigured to match passed feature set. The set
+should not be altered unless some error condition happens that can't
+be reliably detected in ndo_fix_features. In this case, the callback
+should update netdev->features to match resulting hardware state.
+Errors returned are not (and cannot be) propagated anywhere except dmesg.
+(Note: successful return is zero, >0 means silent error.)
+
+
+
+Part IV: Features
+=================
+
+For current list of features, see include/linux/netdev_features.h.
+This section describes semantics of some of them.
+
+ * Transmit checksumming
+
+For complete description, see comments near the top of include/linux/skbuff.h.
+
+Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM.
+It means that device can fill TCP/UDP-like checksum anywhere in the packets
+whatever headers there might be.
+
+ * Transmit TCP segmentation offload
+
+NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit
+set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6).
+
+ * Transmit UDP segmentation offload
+
+NETIF_F_GSO_UDP_L4 accepts a single UDP header with a payload that exceeds
+gso_size. On segmentation, it segments the payload on gso_size boundaries and
+replicates the network and UDP headers (fixing up the last one if less than
+gso_size).
+
+ * Transmit DMA from high memory
+
+On platforms where this is relevant, NETIF_F_HIGHDMA signals that
+ndo_start_xmit can handle skbs with frags in high memory.
+
+ * Transmit scatter-gather
+
+Those features say that ndo_start_xmit can handle fragmented skbs:
+NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST ---
+chained skbs (skb->next/prev list).
+
+ * Software features
+
+Features contained in NETIF_F_SOFT_FEATURES are features of networking
+stack. Driver should not change behaviour based on them.
+
+ * LLTX driver (deprecated for hardware drivers)
+
+NETIF_F_LLTX is meant to be used by drivers that don't need locking at all,
+e.g. software tunnels.
+
+This is also used in a few legacy drivers that implement their
+own locking, don't use it for new (hardware) drivers.
+
+ * netns-local device
+
+NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between
+network namespaces (e.g. loopback).
+
+Don't use it in drivers.
+
+ * VLAN challenged
+
+NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN
+headers. Some drivers set this because the cards can't handle the bigger MTU.
+[FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU
+VLANs. This may be not useful, though.]
+
+* rx-fcs
+
+This requests that the NIC append the Ethernet Frame Checksum (FCS)
+to the end of the skb data. This allows sniffers and other tools to
+read the CRC recorded by the NIC on receipt of the packet.
+
+* rx-all
+
+This requests that the NIC receive all possible frames, including errored
+frames (such as bad FCS, etc). This can be helpful when sniffing a link with
+bad packets on it. Some NICs may receive more packets if also put into normal
+PROMISC mode.
+
+* rx-gro-hw
+
+This requests that the NIC enables Hardware GRO (generic receive offload).
+Hardware GRO is basically the exact reverse of TSO, and is generally
+stricter than Hardware LRO. A packet stream merged by Hardware GRO must
+be re-segmentable by GSO or TSO back to the exact original packet stream.
+Hardware GRO is dependent on RXCSUM since every packet successfully merged
+by hardware must also have the checksum verified by hardware.
+
+* hsr-tag-ins-offload
+
+This should be set for devices which insert an HSR (High-availability Seamless
+Redundancy) or PRP (Parallel Redundancy Protocol) tag automatically.
+
+* hsr-tag-rm-offload
+
+This should be set for devices which remove HSR (High-availability Seamless
+Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically.
+
+* hsr-fwd-offload
+
+This should be set for devices which forward HSR (High-availability Seamless
+Redundancy) frames from one port to another in hardware.
+
+* hsr-dup-offload
+
+This should be set for devices which duplicate outgoing HSR (High-availability
+Seamless Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically
+frames in hardware.
diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
new file mode 100644
index 0000000000..9e4cccb90b
--- /dev/null
+++ b/Documentation/networking/netdevices.rst
@@ -0,0 +1,299 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Network Devices, the Kernel, and You!
+=====================================
+
+
+Introduction
+============
+The following is a random collection of documentation regarding
+network devices.
+
+struct net_device lifetime rules
+================================
+Network device structures need to persist even after module is unloaded and
+must be allocated with alloc_netdev_mqs() and friends.
+If device has registered successfully, it will be freed on last use
+by free_netdev(). This is required to handle the pathological case cleanly
+(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
+
+alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
+private data which gets freed when the network device is freed. If
+separately allocated data is attached to the network device
+(netdev_priv()) then it is up to the module exit handler to free that.
+
+There are two groups of APIs for registering struct net_device.
+First group can be used in normal contexts where ``rtnl_lock`` is not already
+held: register_netdev(), unregister_netdev().
+Second group can be used when ``rtnl_lock`` is already held:
+register_netdevice(), unregister_netdevice(), free_netdevice().
+
+Simple drivers
+--------------
+
+Most drivers (especially device drivers) handle lifetime of struct net_device
+in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
+
+In that case the struct net_device registration is done using
+the register_netdev(), and unregister_netdev() functions:
+
+.. code-block:: c
+
+ int probe()
+ {
+ struct my_device_priv *priv;
+ int err;
+
+ dev = alloc_netdev_mqs(...);
+ if (!dev)
+ return -ENOMEM;
+ priv = netdev_priv(dev);
+
+ /* ... do all device setup before calling register_netdev() ...
+ */
+
+ err = register_netdev(dev);
+ if (err)
+ goto err_undo;
+
+ /* net_device is visible to the user! */
+
+ err_undo:
+ /* ... undo the device setup ... */
+ free_netdev(dev);
+ return err;
+ }
+
+ void remove()
+ {
+ unregister_netdev(dev);
+ free_netdev(dev);
+ }
+
+Note that after calling register_netdev() the device is visible in the system.
+Users can open it and start sending / receiving traffic immediately,
+or run any other callback, so all initialization must be done prior to
+registration.
+
+unregister_netdev() closes the device and waits for all users to be done
+with it. The memory of struct net_device itself may still be referenced
+by sysfs but all operations on that device will fail.
+
+free_netdev() can be called after unregister_netdev() returns on when
+register_netdev() failed.
+
+Device management under RTNL
+----------------------------
+
+Registering struct net_device while in context which already holds
+the ``rtnl_lock`` requires extra care. In those scenarios most drivers
+will want to make use of struct net_device's ``needs_free_netdev``
+and ``priv_destructor`` members for freeing of state.
+
+Example flow of netdev handling under ``rtnl_lock``:
+
+.. code-block:: c
+
+ static void my_setup(struct net_device *dev)
+ {
+ dev->needs_free_netdev = true;
+ }
+
+ static void my_destructor(struct net_device *dev)
+ {
+ some_obj_destroy(priv->obj);
+ some_uninit(priv);
+ }
+
+ int create_link()
+ {
+ struct my_device_priv *priv;
+ int err;
+
+ ASSERT_RTNL();
+
+ dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
+ if (!dev)
+ return -ENOMEM;
+ priv = netdev_priv(dev);
+
+ /* Implicit constructor */
+ err = some_init(priv);
+ if (err)
+ goto err_free_dev;
+
+ priv->obj = some_obj_create();
+ if (!priv->obj) {
+ err = -ENOMEM;
+ goto err_some_uninit;
+ }
+ /* End of constructor, set the destructor: */
+ dev->priv_destructor = my_destructor;
+
+ err = register_netdevice(dev);
+ if (err)
+ /* register_netdevice() calls destructor on failure */
+ goto err_free_dev;
+
+ /* If anything fails now unregister_netdevice() (or unregister_netdev())
+ * will take care of calling my_destructor and free_netdev().
+ */
+
+ return 0;
+
+ err_some_uninit:
+ some_uninit(priv);
+ err_free_dev:
+ free_netdev(dev);
+ return err;
+ }
+
+If struct net_device.priv_destructor is set it will be called by the core
+some time after unregister_netdevice(), it will also be called if
+register_netdevice() fails. The callback may be invoked with or without
+``rtnl_lock`` held.
+
+There is no explicit constructor callback, driver "constructs" the private
+netdev state after allocating it and before registration.
+
+Setting struct net_device.needs_free_netdev makes core call free_netdevice()
+automatically after unregister_netdevice() when all references to the device
+are gone. It only takes effect after a successful call to register_netdevice()
+so if register_netdevice() fails driver is responsible for calling
+free_netdev().
+
+free_netdev() is safe to call on error paths right after unregister_netdevice()
+or when register_netdevice() fails. Parts of netdev (de)registration process
+happen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
+will defer some of the processing until ``rtnl_lock`` is released.
+
+Devices spawned from struct rtnl_link_ops should never free the
+struct net_device directly.
+
+.ndo_init and .ndo_uninit
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
+registration and de-registration, under ``rtnl_lock``. Drivers can use
+those e.g. when parts of their init process need to run under ``rtnl_lock``.
+
+``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
+runs during de-registering after device is closed but other subsystems
+may still have outstanding references to the netdevice.
+
+MTU
+===
+Each network device has a Maximum Transfer Unit. The MTU does not
+include any link layer protocol overhead. Upper layer protocols must
+not pass a socket buffer (skb) to a device to transmit with more data
+than the mtu. The MTU does not include link layer header overhead, so
+for example on Ethernet if the standard MTU is 1500 bytes used, the
+actual skb will contain up to 1514 bytes because of the Ethernet
+header. Devices should allow for the 4 byte VLAN header as well.
+
+Segmentation Offload (GSO, TSO) is an exception to this rule. The
+upper layer protocol may pass a large socket buffer to the device
+transmit routine, and the device will break that up into separate
+packets based on the current MTU.
+
+MTU is symmetrical and applies both to receive and transmit. A device
+must be able to receive at least the maximum size packet allowed by
+the MTU. A network device may use the MTU as mechanism to size receive
+buffers, but the device should allow packets with VLAN header. With
+standard Ethernet mtu of 1500 bytes, the device should allow up to
+1518 byte packets (1500 + 14 header + 4 tag). The device may either:
+drop, truncate, or pass up oversize packets, but dropping oversize
+packets is preferred.
+
+
+struct net_device synchronization rules
+=======================================
+ndo_open:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ndo_stop:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+ Note: netif_running() is guaranteed false
+
+ndo_do_ioctl:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ This is only called by network subsystems internally,
+ not by user space calling ioctl as it was in before
+ linux-5.14.
+
+ndo_siocbond:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ Used by the bonding driver for the SIOCBOND family of
+ ioctl commands.
+
+ndo_siocwandev:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ Used by the drivers/net/wan framework to handle
+ the SIOCWANDEV ioctl with the if_settings structure.
+
+ndo_siocdevprivate:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ This is used to implement SIOCDEVPRIVATE ioctl helpers.
+ These should not be added to new drivers, so don't use.
+
+ndo_eth_ioctl:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ndo_get_stats:
+ Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU.
+ Context: atomic (can't sleep under rwlock or RCU)
+
+ndo_start_xmit:
+ Synchronization: __netif_tx_lock spinlock.
+
+ When the driver sets NETIF_F_LLTX in dev->features this will be
+ called without holding netif_tx_lock. In this case the driver
+ has to lock by itself when needed.
+ The locking there should also properly protect against
+ set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
+ Don't use it for new drivers.
+
+ Context: Process with BHs disabled or BH (timer),
+ will be called with interrupts disabled by netconsole.
+
+ Return codes:
+
+ * NETDEV_TX_OK everything ok.
+ * NETDEV_TX_BUSY Cannot transmit packet, try later
+ Usually a bug, means queue start/stop flow control is broken in
+ the driver. Note: the driver must NOT put the skb in its DMA ring.
+
+ndo_tx_timeout:
+ Synchronization: netif_tx_lock spinlock; all TX queues frozen.
+ Context: BHs disabled
+ Notes: netif_queue_stopped() is guaranteed true
+
+ndo_set_rx_mode:
+ Synchronization: netif_addr_lock spinlock.
+ Context: BHs disabled
+
+struct napi_struct synchronization rules
+========================================
+napi->poll:
+ Synchronization:
+ NAPI_STATE_SCHED bit in napi->state. Device
+ driver's ndo_stop method will invoke napi_disable() on
+ all NAPI instances which will do a sleeping poll on the
+ NAPI_STATE_SCHED napi->state bit, waiting for all pending
+ NAPI activity to cease.
+
+ Context:
+ softirq
+ will be called with interrupts disabled by netconsole.
diff --git a/Documentation/networking/netfilter-sysctl.rst b/Documentation/networking/netfilter-sysctl.rst
new file mode 100644
index 0000000000..beb6d7b275
--- /dev/null
+++ b/Documentation/networking/netfilter-sysctl.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+Netfilter Sysfs variables
+=========================
+
+/proc/sys/net/netfilter/* Variables:
+====================================
+
+nf_log_all_netns - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ By default, only init_net namespace can log packets into kernel log
+ with LOG target; this aims to prevent containers from flooding host
+ kernel log. If enabled, this target also works in other network
+ namespaces. This variable is only accessible from init_net.
diff --git a/Documentation/networking/netif-msg.rst b/Documentation/networking/netif-msg.rst
new file mode 100644
index 0000000000..b20d265a73
--- /dev/null
+++ b/Documentation/networking/netif-msg.rst
@@ -0,0 +1,95 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+NETIF Msg Level
+===============
+
+The design of the network interface message level setting.
+
+History
+-------
+
+ The design of the debugging message interface was guided and
+ constrained by backwards compatibility previous practice. It is useful
+ to understand the history and evolution in order to understand current
+ practice and relate it to older driver source code.
+
+ From the beginning of Linux, each network device driver has had a local
+ integer variable that controls the debug message level. The message
+ level ranged from 0 to 7, and monotonically increased in verbosity.
+
+ The message level was not precisely defined past level 3, but were
+ always implemented within +-1 of the specified level. Drivers tended
+ to shed the more verbose level messages as they matured.
+
+ - 0 Minimal messages, only essential information on fatal errors.
+ - 1 Standard messages, initialization status. No run-time messages
+ - 2 Special media selection messages, generally timer-driver.
+ - 3 Interface starts and stops, including normal status messages
+ - 4 Tx and Rx frame error messages, and abnormal driver operation
+ - 5 Tx packet queue information, interrupt events.
+ - 6 Status on each completed Tx packet and received Rx packets
+ - 7 Initial contents of Tx and Rx packets
+
+ Initially this message level variable was uniquely named in each driver
+ e.g. "lance_debug", so that a kernel symbolic debugger could locate and
+ modify the setting. When kernel modules became common, the variables
+ were consistently renamed to "debug" and allowed to be set as a module
+ parameter.
+
+ This approach worked well. However there is always a demand for
+ additional features. Over the years the following emerged as
+ reasonable and easily implemented enhancements
+
+ - Using an ioctl() call to modify the level.
+ - Per-interface rather than per-driver message level setting.
+ - More selective control over the type of messages emitted.
+
+ The netif_msg recommendation adds these features with only a minor
+ complexity and code size increase.
+
+ The recommendation is the following points
+
+ - Retaining the per-driver integer variable "debug" as a module
+ parameter with a default level of '1'.
+
+ - Adding a per-interface private variable named "msg_enable". The
+ variable is a bit map rather than a level, and is initialized as::
+
+ 1 << debug
+
+ Or more precisely::
+
+ debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug)
+
+ Messages should changes from::
+
+ if (debug > 1)
+ printk(MSG_DEBUG "%s: ...
+
+ to::
+
+ if (np->msg_enable & NETIF_MSG_LINK)
+ printk(MSG_DEBUG "%s: ...
+
+
+The set of message levels is named
+
+
+ ========= =================== ============
+ Old level Name Bit position
+ ========= =================== ============
+ 0 NETIF_MSG_DRV 0x0001
+ 1 NETIF_MSG_PROBE 0x0002
+ 2 NETIF_MSG_LINK 0x0004
+ 2 NETIF_MSG_TIMER 0x0004
+ 3 NETIF_MSG_IFDOWN 0x0008
+ 3 NETIF_MSG_IFUP 0x0008
+ 4 NETIF_MSG_RX_ERR 0x0010
+ 4 NETIF_MSG_TX_ERR 0x0010
+ 5 NETIF_MSG_TX_QUEUED 0x0020
+ 5 NETIF_MSG_INTR 0x0020
+ 6 NETIF_MSG_TX_DONE 0x0040
+ 6 NETIF_MSG_RX_STATUS 0x0040
+ 7 NETIF_MSG_PKTDATA 0x0080
+ ========= =================== ============
diff --git a/Documentation/networking/nexthop-group-resilient.rst b/Documentation/networking/nexthop-group-resilient.rst
new file mode 100644
index 0000000000..fabecee24d
--- /dev/null
+++ b/Documentation/networking/nexthop-group-resilient.rst
@@ -0,0 +1,293 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+Resilient Next-hop Groups
+=========================
+
+Resilient groups are a type of next-hop group that is aimed at minimizing
+disruption in flow routing across changes to the group composition and
+weights of constituent next hops.
+
+The idea behind resilient hashing groups is best explained in contrast to
+the legacy multipath next-hop group, which uses the hash-threshold
+algorithm, described in RFC 2992.
+
+To select a next hop, hash-threshold algorithm first assigns a range of
+hashes to each next hop in the group, and then selects the next hop by
+comparing the SKB hash with the individual ranges. When a next hop is
+removed from the group, the ranges are recomputed, which leads to
+reassignment of parts of hash space from one next hop to another. RFC 2992
+illustrates it thus::
+
+ +-------+-------+-------+-------+-------+
+ | 1 | 2 | 3 | 4 | 5 |
+ +-------+-+-----+---+---+-----+-+-------+
+ | 1 | 2 | 4 | 5 |
+ +---------+---------+---------+---------+
+
+ Before and after deletion of next hop 3
+ under the hash-threshold algorithm.
+
+Note how next hop 2 gave up part of the hash space in favor of next hop 1,
+and 4 in favor of 5. While there will usually be some overlap between the
+previous and the new distribution, some traffic flows change the next hop
+that they resolve to.
+
+If a multipath group is used for load-balancing between multiple servers,
+this hash space reassignment causes an issue that packets from a single
+flow suddenly end up arriving at a server that does not expect them. This
+can result in TCP connections being reset.
+
+If a multipath group is used for load-balancing among available paths to
+the same server, the issue is that different latencies and reordering along
+the way causes the packets to arrive in the wrong order, resulting in
+degraded application performance.
+
+To mitigate the above-mentioned flow redirection, resilient next-hop groups
+insert another layer of indirection between the hash space and its
+constituent next hops: a hash table. The selection algorithm uses SKB hash
+to choose a hash table bucket, then reads the next hop that this bucket
+contains, and forwards traffic there.
+
+This indirection brings an important feature. In the hash-threshold
+algorithm, the range of hashes associated with a next hop must be
+continuous. With a hash table, mapping between the hash table buckets and
+the individual next hops is arbitrary. Therefore when a next hop is deleted
+the buckets that held it are simply reassigned to other next hops::
+
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ v v v v
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Before and after deletion of next hop 3
+ under the resilient hashing algorithm.
+
+When weights of next hops in a group are altered, it may be possible to
+choose a subset of buckets that are currently not used for forwarding
+traffic, and use those to satisfy the new next-hop distribution demands,
+keeping the "busy" buckets intact. This way, established flows are ideally
+kept being forwarded to the same endpoints through the same paths as before
+the next-hop group change.
+
+Algorithm
+---------
+
+In a nutshell, the algorithm works as follows. Each next hop deserves a
+certain number of buckets, according to its weight and the number of
+buckets in the hash table. In accordance with the source code, we will call
+this number a "wants count" of a next hop. In case of an event that might
+cause bucket allocation change, the wants counts for individual next hops
+are updated.
+
+Next hops that have fewer buckets than their wants count, are called
+"underweight". Those that have more are "overweight". If there are no
+overweight (and therefore no underweight) next hops in the group, it is
+said to be "balanced".
+
+Each bucket maintains a last-used timer. Every time a packet is forwarded
+through a bucket, this timer is updated to current jiffies value. One
+attribute of a resilient group is then the "idle timer", which is the
+amount of time that a bucket must not be hit by traffic in order for it to
+be considered "idle". Buckets that are not idle are busy.
+
+After assigning wants counts to next hops, an "upkeep" algorithm runs. For
+buckets:
+
+1) that have no assigned next hop, or
+2) whose next hop has been removed, or
+3) that are idle and their next hop is overweight,
+
+upkeep changes the next hop that the bucket references to one of the
+underweight next hops. If, after considering all buckets in this manner,
+there are still underweight next hops, another upkeep run is scheduled to a
+future time.
+
+There may not be enough "idle" buckets to satisfy the updated wants counts
+of all next hops. Another attribute of a resilient group is the "unbalanced
+timer". This timer can be set to 0, in which case the table will stay out
+of balance until idle buckets do appear, possibly never. If set to a
+non-zero value, the value represents the period of time that the table is
+permitted to stay out of balance.
+
+With this in mind, we update the above list of conditions with one more
+item. Thus buckets:
+
+4) whose next hop is overweight, and the amount of time that the table has
+ been out of balance exceeds the unbalanced timer, if that is non-zero,
+
+\... are migrated as well.
+
+Offloading & Driver Feedback
+----------------------------
+
+When offloading resilient groups, the algorithm that distributes buckets
+among next hops is still the one in SW. Drivers are notified of updates to
+next hop groups in the following three ways:
+
+- Full group notification with the type
+ ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
+ created and buckets populated for the first time.
+
+- Single-bucket notifications of the type
+ ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
+ individual migrations within an already-established group.
+
+- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
+ is sent before the group is replaced, and is a way for the driver to veto
+ the group before committing anything to the HW.
+
+Some single-bucket notifications are forced, as indicated by the "force"
+flag in the notification. Those are used for the cases where e.g. the next
+hop associated with the bucket was removed, and the bucket really must be
+migrated.
+
+Non-forced notifications can be overridden by the driver by returning an
+error code. The use case for this is that the driver notifies the HW that a
+bucket should be migrated, but the HW discovers that the bucket has in fact
+been hit by traffic.
+
+A second way for the HW to report that a bucket is busy is through the
+``nexthop_res_grp_activity_update()`` API. The buckets identified this way
+as busy are treated as if traffic hit them.
+
+Offloaded buckets should be flagged as either "offload" or "trap". This is
+done through the ``nexthop_bucket_set_hw_flags()`` API.
+
+Netlink UAPI
+------------
+
+Resilient Group Replacement
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
+same manner as other multipath groups. The following changes apply to the
+attributes passed in the netlink message:
+
+ =================== =========================================================
+ ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
+ ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient
+ groups.
+ =================== =========================================================
+
+``NHA_RES_GROUP`` payload:
+
+ =================================== =========================================
+ ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table.
+ ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t.
+ ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t.
+ =================================== =========================================
+
+Next Hop Get
+^^^^^^^^^^^^
+
+Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
+message in exactly the same way as other next hop get requests. The
+response attributes match the replacement attributes cited above, except
+``NHA_RES_GROUP`` payload will include the following attribute:
+
+ =================================== =========================================
+ ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out
+ of balance, in units of clock_t.
+ =================================== =========================================
+
+Bucket Get
+^^^^^^^^^^
+
+The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
+used to request a single bucket. The attributes recognized at get requests
+are:
+
+ =================== =========================================================
+ ``NHA_ID`` ID of the next-hop group that the bucket belongs to.
+ ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
+ =================== =========================================================
+
+``NHA_RES_BUCKET`` payload:
+
+ ======================== ====================================================
+ ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
+ ======================== ====================================================
+
+Bucket Dumps
+^^^^^^^^^^^^
+
+The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
+to request a dump of matching buckets. The attributes recognized at dump
+requests are:
+
+ =================== =========================================================
+ ``NHA_ID`` If specified, limits the dump to just the next-hop group
+ with this ID.
+ ``NHA_OIF`` If specified, limits the dump to buckets that contain
+ next hops that use the device with this ifindex.
+ ``NHA_MASTER`` If specified, limits the dump to buckets that contain
+ next hops that use a device in the VRF with this ifindex.
+ ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
+ =================== =========================================================
+
+``NHA_RES_BUCKET`` payload:
+
+ ======================== ====================================================
+ ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
+ that contain the next hop with this ID.
+ ======================== ====================================================
+
+Usage
+-----
+
+To illustrate the usage, consider the following commands::
+
+ # ip nexthop add id 1 via 192.0.2.2 dev eth0
+ # ip nexthop add id 2 via 192.0.2.3 dev eth0
+ # ip nexthop add id 10 group 1/2 type resilient \
+ buckets 8 idle_timer 60 unbalanced_timer 300
+
+The last command creates a resilient next-hop group. It will have 8 buckets
+(which is unusually low number, and used here for demonstration purposes
+only), each bucket will be considered idle when no traffic hits it for at
+least 60 seconds, and if the table remains out of balance for 300 seconds,
+it will be forcefully brought into balance.
+
+Changing next-hop weights leads to change in bucket allocation::
+
+ # ip nexthop replace id 10 group 1,3/2 type resilient
+
+This can be confirmed by looking at individual buckets::
+
+ # ip nexthop bucket show id 10
+ id 10 index 0 idle_time 5.59 nhid 1
+ id 10 index 1 idle_time 5.59 nhid 1
+ id 10 index 2 idle_time 8.74 nhid 2
+ id 10 index 3 idle_time 8.74 nhid 2
+ id 10 index 4 idle_time 8.74 nhid 1
+ id 10 index 5 idle_time 8.74 nhid 1
+ id 10 index 6 idle_time 8.74 nhid 1
+ id 10 index 7 idle_time 8.74 nhid 1
+
+Note the two buckets that have a shorter idle time. Those are the ones that
+were migrated after the next-hop replace command to satisfy the new demand
+that next hop 1 be given 6 buckets instead of 4.
+
+Netdevsim
+---------
+
+The netdevsim driver implements a mock offload of resilient groups, and
+exposes debugfs interface that allows marking individual buckets as busy.
+For example, the following will mark bucket 23 in next-hop group 10 as
+active::
+
+ # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
+
+In addition, another debugfs interface can be used to configure that the
+next attempt to migrate a bucket should fail::
+
+ # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
+
+Besides serving as an example, the interfaces that netdevsim exposes are
+useful in automated testing, and
+``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
+them to test the algorithm.
diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst
new file mode 100644
index 0000000000..c383a394c6
--- /dev/null
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Netfilter Conntrack Sysfs variables
+===================================
+
+/proc/sys/net/netfilter/nf_conntrack_* Variables:
+=================================================
+
+nf_conntrack_acct - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ Enable connection tracking flow accounting. 64-bit byte and packet
+ counters per flow are added.
+
+nf_conntrack_buckets - INTEGER
+ Size of hash table. If not specified as parameter during module
+ loading, the default size is calculated by dividing total memory
+ by 16384 to determine the number of buckets. The hash table will
+ never have fewer than 1024 and never more than 262144 buckets.
+ This sysctl is only writeable in the initial net namespace.
+
+nf_conntrack_checksum - BOOLEAN
+ - 0 - disabled
+ - not 0 - enabled (default)
+
+ Verify checksum of incoming packets. Packets with bad checksums are
+ in INVALID state. If this is enabled, such packets will not be
+ considered for connection tracking.
+
+nf_conntrack_count - INTEGER (read-only)
+ Number of currently allocated flow entries.
+
+nf_conntrack_events - BOOLEAN
+ - 0 - disabled
+ - 1 - enabled
+ - 2 - auto (default)
+
+ If this option is enabled, the connection tracking code will
+ provide userspace with connection tracking events via ctnetlink.
+ The default allocates the extension if a userspace program is
+ listening to ctnetlink events.
+
+nf_conntrack_expect_max - INTEGER
+ Maximum size of expectation table. Default value is
+ nf_conntrack_buckets / 256. Minimum is 1.
+
+nf_conntrack_frag6_high_thresh - INTEGER
+ default 262144
+
+ Maximum memory used to reassemble IPv6 fragments. When
+ nf_conntrack_frag6_high_thresh bytes of memory is allocated for this
+ purpose, the fragment handler will toss packets until
+ nf_conntrack_frag6_low_thresh is reached.
+
+nf_conntrack_frag6_low_thresh - INTEGER
+ default 196608
+
+ See nf_conntrack_frag6_low_thresh
+
+nf_conntrack_frag6_timeout - INTEGER (seconds)
+ default 60
+
+ Time to keep an IPv6 fragment in memory.
+
+nf_conntrack_generic_timeout - INTEGER (seconds)
+ default 600
+
+ Default for generic timeout. This refers to layer 4 unknown/unsupported
+ protocols.
+
+nf_conntrack_icmp_timeout - INTEGER (seconds)
+ default 30
+
+ Default for ICMP timeout.
+
+nf_conntrack_icmpv6_timeout - INTEGER (seconds)
+ default 30
+
+ Default for ICMP6 timeout.
+
+nf_conntrack_log_invalid - INTEGER
+ - 0 - disable (default)
+ - 1 - log ICMP packets
+ - 6 - log TCP packets
+ - 17 - log UDP packets
+ - 33 - log DCCP packets
+ - 41 - log ICMPv6 packets
+ - 136 - log UDPLITE packets
+ - 255 - log packets of any protocol
+
+ Log invalid packets of a type specified by value.
+
+nf_conntrack_max - INTEGER
+ Maximum number of allowed connection tracking entries. This value is set
+ to nf_conntrack_buckets by default.
+ Note that connection tracking entries are added to the table twice -- once
+ for the original direction and once for the reply direction (i.e., with
+ the reversed address). This means that with default settings a maxed-out
+ table will have a average hash chain length of 2, not 1.
+
+nf_conntrack_tcp_be_liberal - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ Be conservative in what you do, be liberal in what you accept from others.
+ If it's non-zero, we mark only out of window RST segments as INVALID.
+
+nf_conntrack_tcp_ignore_invalid_rst - BOOLEAN
+ - 0 - disabled (default)
+ - 1 - enabled
+
+ If it's 1, we don't mark out of window RST segments as INVALID.
+
+nf_conntrack_tcp_loose - BOOLEAN
+ - 0 - disabled
+ - not 0 - enabled (default)
+
+ If it is set to zero, we disable picking up already established
+ connections.
+
+nf_conntrack_tcp_max_retrans - INTEGER
+ default 3
+
+ Maximum number of packets that can be retransmitted without
+ received an (acceptable) ACK from the destination. If this number
+ is reached, a shorter timer will be started.
+
+nf_conntrack_tcp_timeout_close - INTEGER (seconds)
+ default 10
+
+nf_conntrack_tcp_timeout_close_wait - INTEGER (seconds)
+ default 60
+
+nf_conntrack_tcp_timeout_established - INTEGER (seconds)
+ default 432000 (5 days)
+
+nf_conntrack_tcp_timeout_fin_wait - INTEGER (seconds)
+ default 120
+
+nf_conntrack_tcp_timeout_last_ack - INTEGER (seconds)
+ default 30
+
+nf_conntrack_tcp_timeout_max_retrans - INTEGER (seconds)
+ default 300
+
+nf_conntrack_tcp_timeout_syn_recv - INTEGER (seconds)
+ default 60
+
+nf_conntrack_tcp_timeout_syn_sent - INTEGER (seconds)
+ default 120
+
+nf_conntrack_tcp_timeout_time_wait - INTEGER (seconds)
+ default 120
+
+nf_conntrack_tcp_timeout_unacknowledged - INTEGER (seconds)
+ default 300
+
+nf_conntrack_timestamp - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ Enable connection tracking flow timestamping.
+
+nf_conntrack_sctp_timeout_closed - INTEGER (seconds)
+ default 10
+
+nf_conntrack_sctp_timeout_cookie_wait - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_cookie_echoed - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_established - INTEGER (seconds)
+ default 210
+
+ Default is set to (hb_interval * path_max_retrans + rto_max)
+
+nf_conntrack_sctp_timeout_shutdown_sent - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_shutdown_recd - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_shutdown_ack_sent - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_heartbeat_sent - INTEGER (seconds)
+ default 30
+
+ This timeout is used to setup conntrack entry on secondary paths.
+ Default is set to hb_interval.
+
+nf_conntrack_udp_timeout - INTEGER (seconds)
+ default 30
+
+nf_conntrack_udp_timeout_stream - INTEGER (seconds)
+ default 120
+
+ This extended timeout will be used in case there is an UDP stream
+ detected.
+
+nf_conntrack_gre_timeout - INTEGER (seconds)
+ default 30
+
+nf_conntrack_gre_timeout_stream - INTEGER (seconds)
+ default 180
+
+ This extended timeout will be used in case there is an GRE stream
+ detected.
+
+nf_hooks_lwtunnel - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ If this option is enabled, the lightweight tunnel netfilter hooks are
+ enabled. This option cannot be disabled once it is enabled.
+
+nf_flowtable_tcp_timeout - INTEGER (seconds)
+ default 30
+
+ Control offload timeout for tcp connections.
+ TCP connections may be offloaded from nf conntrack to nf flow table.
+ Once aged, the connection is returned to nf conntrack with tcp pickup timeout.
+
+nf_flowtable_udp_timeout - INTEGER (seconds)
+ default 30
+
+ Control offload timeout for udp connections.
+ UDP connections may be offloaded from nf conntrack to nf flow table.
+ Once aged, the connection is returned to nf conntrack with udp pickup timeout.
diff --git a/Documentation/networking/nf_flowtable.rst b/Documentation/networking/nf_flowtable.rst
new file mode 100644
index 0000000000..d757c21c10
--- /dev/null
+++ b/Documentation/networking/nf_flowtable.rst
@@ -0,0 +1,235 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
+Netfilter's flowtable infrastructure
+====================================
+
+This documentation describes the Netfilter flowtable infrastructure which allows
+you to define a fastpath through the flowtable datapath. This infrastructure
+also provides hardware offload support. The flowtable supports for the layer 3
+IPv4 and IPv6 and the layer 4 TCP and UDP protocols.
+
+Overview
+--------
+
+Once the first packet of the flow successfully goes through the IP forwarding
+path, from the second packet on, you might decide to offload the flow to the
+flowtable through your ruleset. The flowtable infrastructure provides a rule
+action that allows you to specify when to add a flow to the flowtable.
+
+A packet that finds a matching entry in the flowtable (ie. flowtable hit) is
+transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
+classic IP forwarding path (the visible effect is that you do not see these
+packets from any of the Netfilter hooks coming after ingress). In case that
+there is no matching entry in the flowtable (ie. flowtable miss), the packet
+follows the classic IP forwarding path.
+
+The flowtable uses a resizable hashtable. Lookups are based on the following
+n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
+source and destination, layer 4 source and destination ports and the input
+interface (useful in case there are several conntrack zones in place).
+
+The 'flow add' action allows you to populate the flowtable, the user selectively
+specifies what flows are placed into the flowtable. Hence, packets follow the
+classic IP forwarding path unless the user explicitly instruct flows to use this
+new alternative forwarding path via policy.
+
+The flowtable datapath is represented in Fig.1, which describes the classic IP
+forwarding path including the Netfilter hooks and the flowtable fastpath bypass.
+
+::
+
+ userspace process
+ ^ |
+ | |
+ _____|____ ____\/___
+ / \ / \
+ | input | | output |
+ \__________/ \_________/
+ ^ |
+ | |
+ _________ __________ --------- _____\/_____
+ / \ / \ |Routing | / \
+ --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
+ \_________/ \__________/ ---------- \____________/ ^
+ | ^ | ^ |
+ flowtable | ____\/___ | |
+ | | / \ | |
+ __\/___ | | forward |------------ |
+ |-----| | \_________/ |
+ |-----| | 'flow offload' rule |
+ |-----| | adds entry to |
+ |_____| | flowtable |
+ | | |
+ / \ | |
+ /hit\_no_| |
+ \ ? / |
+ \ / |
+ |__yes_________________fastpath bypass ____________________________|
+
+ Fig.1 Netfilter hooks and flowtable interactions
+
+The flowtable entry also stores the NAT configuration, so all packets are
+mangled according to the NAT policy that is specified from the classic IP
+forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
+traffic is passed up to follow the classic IP forwarding path given that the
+transport header is missing, in this case, flowtable lookups are not possible.
+TCP RST and FIN packets are also passed up to the classic IP forwarding path to
+release the flow gracefully. Packets that exceed the MTU are also passed up to
+the classic forwarding path to report packet-too-big ICMP errors to the sender.
+
+Example configuration
+---------------------
+
+Enabling the flowtable bypass is relatively easy, you only need to create a
+flowtable and add one rule to your forward chain::
+
+ table inet x {
+ flowtable f {
+ hook ingress priority 0; devices = { eth0, eth1 };
+ }
+ chain y {
+ type filter hook forward priority 0; policy accept;
+ ip protocol tcp flow add @f
+ counter packets 0 bytes 0
+ }
+ }
+
+This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
+netdevices. You can create as many flowtables as you want in case you need to
+perform resource partitioning. The flowtable priority defines the order in which
+hooks are run in the pipeline, this is convenient in case you already have a
+nftables ingress chain (make sure the flowtable priority is smaller than the
+nftables ingress chain hence the flowtable runs before in the pipeline).
+
+The 'flow offload' action from the forward chain 'y' adds an entry to the
+flowtable for the TCP syn-ack packet coming in the reply direction. Once the
+flow is offloaded, you will observe that the counter rule in the example above
+does not get updated for the packets that are being forwarded through the
+forwarding bypass.
+
+You can identify offloaded flows through the [OFFLOAD] tag when listing your
+connection tracking table.
+
+::
+
+ # conntrack -L
+ tcp 6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2
+
+
+Layer 2 encapsulation
+---------------------
+
+Since Linux kernel 5.13, the flowtable infrastructure discovers the real
+netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
+parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
+VLAN ID / PPPoE session ID which are used for the flowtable lookups. The
+flowtable datapath also deals with layer 2 decapsulation.
+
+You do not need to add the PPPoE and the VLAN devices to your flowtable,
+instead the real device is sufficient for the flowtable to track your flows.
+
+Bridge and IP forwarding
+------------------------
+
+Since Linux kernel 5.13, you can add bridge ports to the flowtable. The
+flowtable infrastructure discovers the topology behind the bridge device. This
+allows the flowtable to define a fastpath bypass between the bridge ports
+(represented as eth1 and eth2 in the example figure below) and the gateway
+device (represented as eth0) in your switch/router.
+
+::
+
+ fastpath bypass
+ .-------------------------.
+ / \
+ | IP forwarding |
+ | / \ \/
+ | br0 eth0 ..... eth0
+ . / \ *host B*
+ -> eth1 eth2
+ . *switch/router*
+ .
+ .
+ eth0
+ *host A*
+
+The flowtable infrastructure also supports for bridge VLAN filtering actions
+such as PVID and untagged. You can also stack a classic VLAN device on top of
+your bridge port.
+
+If you would like that your flowtable defines a fastpath between your bridge
+ports and your IP forwarding path, you have to add your bridge ports (as
+represented by the real netdevice) to your flowtable definition.
+
+Counters
+--------
+
+The flowtable can synchronize packet and byte counters with the existing
+connection tracking entry by specifying the counter statement in your flowtable
+definition, e.g.
+
+::
+
+ table inet x {
+ flowtable f {
+ hook ingress priority 0; devices = { eth0, eth1 };
+ counter
+ }
+ }
+
+Counter support is available since Linux kernel 5.7.
+
+Hardware offload
+----------------
+
+If your network device provides hardware offload support, you can turn it on by
+means of the 'offload' flag in your flowtable definition, e.g.
+
+::
+
+ table inet x {
+ flowtable f {
+ hook ingress priority 0; devices = { eth0, eth1 };
+ flags offload;
+ }
+ }
+
+There is a workqueue that adds the flows to the hardware. Note that a few
+packets might still run over the flowtable software path until the workqueue has
+a chance to offload the flow to the network device.
+
+You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
+listing your connection tracking table. Please, note that the [OFFLOAD] tag
+refers to the software offload mode, so there is a distinction between [OFFLOAD]
+which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
+to the hardware offload datapath being used by the flow.
+
+The flowtable hardware offload infrastructure also supports for the DSA
+(Distributed Switch Architecture).
+
+Limitations
+-----------
+
+The flowtable behaves like a cache. The flowtable entries might get stale if
+either the destination MAC address or the egress netdevice that is used for
+transmission changes.
+
+This might be a problem if:
+
+- You run the flowtable in software mode and you combine bridge and IP
+ forwarding in your setup.
+- Hardware offload is enabled.
+
+More reading
+------------
+
+This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
+also made a very complete and comprehensive summary called "A state of network
+acceleration" that describes how things were before this infrastructure was
+mainlined [3]_ and it also makes a rough summary of this work [4]_.
+
+.. [1] https://lwn.net/Articles/738214/
+.. [2] https://lwn.net/Articles/742164/
+.. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
+.. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html
diff --git a/Documentation/networking/nfc.rst b/Documentation/networking/nfc.rst
new file mode 100644
index 0000000000..9aab3a88c9
--- /dev/null
+++ b/Documentation/networking/nfc.rst
@@ -0,0 +1,130 @@
+===================
+Linux NFC subsystem
+===================
+
+The Near Field Communication (NFC) subsystem is required to standardize the
+NFC device drivers development and to create an unified userspace interface.
+
+This document covers the architecture overview, the device driver interface
+description and the userspace interface description.
+
+Architecture overview
+=====================
+
+The NFC subsystem is responsible for:
+ - NFC adapters management;
+ - Polling for targets;
+ - Low-level data exchange;
+
+The subsystem is divided in some parts. The 'core' is responsible for
+providing the device driver interface. On the other side, it is also
+responsible for providing an interface to control operations and low-level
+data exchange.
+
+The control operations are available to userspace via generic netlink.
+
+The low-level data exchange interface is provided by the new socket family
+PF_NFC. The NFC_SOCKPROTO_RAW performs raw communication with NFC targets.
+
+.. code-block:: none
+
+ +--------------------------------------+
+ | USER SPACE |
+ +--------------------------------------+
+ ^ ^
+ | low-level | control
+ | data exchange | operations
+ | |
+ | v
+ | +-----------+
+ | AF_NFC | netlink |
+ | socket +-----------+
+ | raw ^
+ | |
+ v v
+ +---------+ +-----------+
+ | rawsock | <--------> | core |
+ +---------+ +-----------+
+ ^
+ |
+ v
+ +-----------+
+ | driver |
+ +-----------+
+
+Device Driver Interface
+=======================
+
+When registering on the NFC subsystem, the device driver must inform the core
+of the set of supported NFC protocols and the set of ops callbacks. The ops
+callbacks that must be implemented are the following:
+
+* start_poll - setup the device to poll for targets
+* stop_poll - stop on progress polling operation
+* activate_target - select and initialize one of the targets found
+* deactivate_target - deselect and deinitialize the selected target
+* data_exchange - send data and receive the response (transceive operation)
+
+Userspace interface
+===================
+
+The userspace interface is divided in control operations and low-level data
+exchange operation.
+
+CONTROL OPERATIONS:
+
+Generic netlink is used to implement the interface to the control operations.
+The operations are composed by commands and events, all listed below:
+
+* NFC_CMD_GET_DEVICE - get specific device info or dump the device list
+* NFC_CMD_START_POLL - setup a specific device to polling for targets
+* NFC_CMD_STOP_POLL - stop the polling operation in a specific device
+* NFC_CMD_GET_TARGET - dump the list of targets found by a specific device
+
+* NFC_EVENT_DEVICE_ADDED - reports an NFC device addition
+* NFC_EVENT_DEVICE_REMOVED - reports an NFC device removal
+* NFC_EVENT_TARGETS_FOUND - reports START_POLL results when 1 or more targets
+ are found
+
+The user must call START_POLL to poll for NFC targets, passing the desired NFC
+protocols through NFC_ATTR_PROTOCOLS attribute. The device remains in polling
+state until it finds any target. However, the user can stop the polling
+operation by calling STOP_POLL command. In this case, it will be checked if
+the requester of STOP_POLL is the same of START_POLL.
+
+If the polling operation finds one or more targets, the event TARGETS_FOUND is
+sent (including the device id). The user must call GET_TARGET to get the list of
+all targets found by such device. Each reply message has target attributes with
+relevant information such as the supported NFC protocols.
+
+All polling operations requested through one netlink socket are stopped when
+it's closed.
+
+LOW-LEVEL DATA EXCHANGE:
+
+The userspace must use PF_NFC sockets to perform any data communication with
+targets. All NFC sockets use AF_NFC::
+
+ struct sockaddr_nfc {
+ sa_family_t sa_family;
+ __u32 dev_idx;
+ __u32 target_idx;
+ __u32 nfc_protocol;
+ };
+
+To establish a connection with one target, the user must create an
+NFC_SOCKPROTO_RAW socket and call the 'connect' syscall with the sockaddr_nfc
+struct correctly filled. All information comes from NFC_EVENT_TARGETS_FOUND
+netlink event. As a target can support more than one NFC protocol, the user
+must inform which protocol it wants to use.
+
+Internally, 'connect' will result in an activate_target call to the driver.
+When the socket is closed, the target is deactivated.
+
+The data format exchanged through the sockets is NFC protocol dependent. For
+instance, when communicating with MIFARE tags, the data exchanged are MIFARE
+commands and their responses.
+
+The first received package is the response to the first sent package and so
+on. In order to allow valid "empty" responses, every data received has a NULL
+header of 1 byte.
diff --git a/Documentation/networking/openvswitch.rst b/Documentation/networking/openvswitch.rst
new file mode 100644
index 0000000000..1a8353dbf1
--- /dev/null
+++ b/Documentation/networking/openvswitch.rst
@@ -0,0 +1,251 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================
+Open vSwitch datapath developer documentation
+=============================================
+
+The Open vSwitch kernel module allows flexible userspace control over
+flow-level packet processing on selected network devices. It can be
+used to implement a plain Ethernet switch, network device bonding,
+VLAN processing, network access control, flow-based network control,
+and so on.
+
+The kernel module implements multiple "datapaths" (analogous to
+bridges), each of which can have multiple "vports" (analogous to ports
+within a bridge). Each datapath also has associated with it a "flow
+table" that userspace populates with "flows" that map from keys based
+on packet headers and metadata to sets of actions. The most common
+action forwards the packet to another vport; other actions are also
+implemented.
+
+When a packet arrives on a vport, the kernel module processes it by
+extracting its flow key and looking it up in the flow table. If there
+is a matching flow, it executes the associated actions. If there is
+no match, it queues the packet to userspace for processing (as part of
+its processing, userspace will likely set up a flow to handle further
+packets of the same type entirely in-kernel).
+
+
+Flow key compatibility
+----------------------
+
+Network protocols evolve over time. New protocols become important
+and existing protocols lose their prominence. For the Open vSwitch
+kernel module to remain relevant, it must be possible for newer
+versions to parse additional protocols as part of the flow key. It
+might even be desirable, someday, to drop support for parsing
+protocols that have become obsolete. Therefore, the Netlink interface
+to Open vSwitch is designed to allow carefully written userspace
+applications to work with any version of the flow key, past or future.
+
+To support this forward and backward compatibility, whenever the
+kernel module passes a packet to userspace, it also passes along the
+flow key that it parsed from the packet. Userspace then extracts its
+own notion of a flow key from the packet and compares it against the
+kernel-provided version:
+
+ - If userspace's notion of the flow key for the packet matches the
+ kernel's, then nothing special is necessary.
+
+ - If the kernel's flow key includes more fields than the userspace
+ version of the flow key, for example if the kernel decoded IPv6
+ headers but userspace stopped at the Ethernet type (because it
+ does not understand IPv6), then again nothing special is
+ necessary. Userspace can still set up a flow in the usual way,
+ as long as it uses the kernel-provided flow key to do it.
+
+ - If the userspace flow key includes more fields than the
+ kernel's, for example if userspace decoded an IPv6 header but
+ the kernel stopped at the Ethernet type, then userspace can
+ forward the packet manually, without setting up a flow in the
+ kernel. This case is bad for performance because every packet
+ that the kernel considers part of the flow must go to userspace,
+ but the forwarding behavior is correct. (If userspace can
+ determine that the values of the extra fields would not affect
+ forwarding behavior, then it could set up a flow anyway.)
+
+How flow keys evolve over time is important to making this work, so
+the following sections go into detail.
+
+
+Flow key format
+---------------
+
+A flow key is passed over a Netlink socket as a sequence of Netlink
+attributes. Some attributes represent packet metadata, defined as any
+information about a packet that cannot be extracted from the packet
+itself, e.g. the vport on which the packet was received. Most
+attributes, however, are extracted from headers within the packet,
+e.g. source and destination addresses from Ethernet, IP, or TCP
+headers.
+
+The <linux/openvswitch.h> header file defines the exact format of the
+flow key attributes. For informal explanatory purposes here, we write
+them as comma-separated strings, with parentheses indicating arguments
+and nesting. For example, the following could represent a flow key
+corresponding to a TCP packet that arrived on vport 1::
+
+ in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
+ eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
+ frag=no), tcp(src=49163, dst=80)
+
+Often we ellipsize arguments not important to the discussion, e.g.::
+
+ in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
+
+
+Wildcarded flow key format
+--------------------------
+
+A wildcarded flow is described with two sequences of Netlink attributes
+passed over the Netlink socket. A flow key, exactly as described above, and an
+optional corresponding flow mask.
+
+A wildcarded flow can represent a group of exact match flows. Each '1' bit
+in the mask specifies a exact match with the corresponding bit in the flow key.
+A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
+of a incoming packet. Using wildcarded flow can improve the flow set up rate
+by reduce the number of new flows need to be processed by the user space program.
+
+Support for the mask Netlink attribute is optional for both the kernel and user
+space program. The kernel can ignore the mask attribute, installing an exact
+match flow, or reduce the number of don't care bits in the kernel to less than
+what was specified by the user space program. In this case, variations in bits
+that the kernel does not implement will simply result in additional flow setups.
+The kernel module will also work with user space programs that neither support
+nor supply flow mask attributes.
+
+Since the kernel may ignore or modify wildcard bits, it can be difficult for
+the userspace program to know exactly what matches are installed. There are
+two possible approaches: reactively install flows as they miss the kernel
+flow table (and therefore not attempt to determine wildcard changes at all)
+or use the kernel's response messages to determine the installed wildcards.
+
+When interacting with userspace, the kernel should maintain the match portion
+of the key exactly as originally installed. This will provides a handle to
+identify the flow for all future operations. However, when reporting the
+mask of an installed flow, the mask should include any restrictions imposed
+by the kernel.
+
+The behavior when using overlapping wildcarded flows is undefined. It is the
+responsibility of the user space program to ensure that any incoming packet
+can match at most one flow, wildcarded or not. The current implementation
+performs best-effort detection of overlapping wildcarded flows and may reject
+some but not all of them. However, this behavior may change in future versions.
+
+
+Unique flow identifiers
+-----------------------
+
+An alternative to using the original match portion of a key as the handle for
+flow identification is a unique flow identifier, or "UFID". UFIDs are optional
+for both the kernel and user space program.
+
+User space programs that support UFID are expected to provide it during flow
+setup in addition to the flow, then refer to the flow using the UFID for all
+future operations. The kernel is not required to index flows by the original
+flow key if a UFID is specified.
+
+
+Basic rule for evolving flow keys
+---------------------------------
+
+Some care is needed to really maintain forward and backward
+compatibility for applications that follow the rules listed under
+"Flow key compatibility" above.
+
+The basic rule is obvious::
+
+ ==================================================================
+ New network protocol support must only supplement existing flow
+ key attributes. It must not change the meaning of already defined
+ flow key attributes.
+ ==================================================================
+
+This rule does have less-obvious consequences so it is worth working
+through a few examples. Suppose, for example, that the kernel module
+did not already implement VLAN parsing. Instead, it just interpreted
+the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
+packet. The flow key for any packet with an 802.1Q header would look
+essentially like this, ignoring metadata::
+
+ eth(...), eth_type(0x8100)
+
+Naively, to add VLAN support, it makes sense to add a new "vlan" flow
+key attribute to contain the VLAN tag, then continue to decode the
+encapsulated headers beyond the VLAN tag using the existing field
+definitions. With this change, a TCP packet in VLAN 10 would have a
+flow key much like this::
+
+ eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
+
+But this change would negatively affect a userspace application that
+has not been updated to understand the new "vlan" flow key attribute.
+The application could, following the flow compatibility rules above,
+ignore the "vlan" attribute that it does not understand and therefore
+assume that the flow contained IP packets. This is a bad assumption
+(the flow only contains IP packets if one parses and skips over the
+802.1Q header) and it could cause the application's behavior to change
+across kernel versions even though it follows the compatibility rules.
+
+The solution is to use a set of nested attributes. This is, for
+example, why 802.1Q support uses nested attributes. A TCP packet in
+VLAN 10 is actually expressed as::
+
+ eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
+ ip(proto=6, ...), tcp(...)))
+
+Notice how the "eth_type", "ip", and "tcp" flow key attributes are
+nested inside the "encap" attribute. Thus, an application that does
+not understand the "vlan" key will not see either of those attributes
+and therefore will not misinterpret them. (Also, the outer eth_type
+is still 0x8100, not changed to 0x0800.)
+
+Handling malformed packets
+--------------------------
+
+Don't drop packets in the kernel for malformed protocol headers, bad
+checksums, etc. This would prevent userspace from implementing a
+simple Ethernet switch that forwards every packet.
+
+Instead, in such a case, include an attribute with "empty" content.
+It doesn't matter if the empty content could be valid protocol values,
+as long as those values are rarely seen in practice, because userspace
+can always forward all packets with those values to userspace and
+handle them individually.
+
+For example, consider a packet that contains an IP header that
+indicates protocol 6 for TCP, but which is truncated just after the IP
+header, so that the TCP header is missing. The flow key for this
+packet would include a tcp attribute with all-zero src and dst, like
+this::
+
+ eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
+
+As another example, consider a packet with an Ethernet type of 0x8100,
+indicating that a VLAN TCI should follow, but which is truncated just
+after the Ethernet type. The flow key for this packet would include
+an all-zero-bits vlan and an empty encap attribute, like this::
+
+ eth(...), eth_type(0x8100), vlan(0), encap()
+
+Unlike a TCP packet with source and destination ports 0, an
+all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
+VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
+attribute expressly to allow this situation to be distinguished.
+Thus, the flow key in this second example unambiguously indicates a
+missing or malformed VLAN TCI.
+
+Other rules
+-----------
+
+The other rules for flow keys are much less subtle:
+
+ - Duplicate attributes are not allowed at a given nesting level.
+
+ - Ordering of attributes is not significant.
+
+ - When the kernel sends a given flow key to userspace, it always
+ composes it the same way. This allows userspace to hash and
+ compare entire flow keys that it may not be able to fully
+ interpret.
diff --git a/Documentation/networking/operstates.rst b/Documentation/networking/operstates.rst
new file mode 100644
index 0000000000..1ee2141e8e
--- /dev/null
+++ b/Documentation/networking/operstates.rst
@@ -0,0 +1,187 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Operational States
+==================
+
+
+1. Introduction
+===============
+
+Linux distinguishes between administrative and operational state of an
+interface. Administrative state is the result of "ip link set dev
+<dev> up or down" and reflects whether the administrator wants to use
+the device for traffic.
+
+However, an interface is not usable just because the admin enabled it
+- ethernet requires to be plugged into the switch and, depending on
+a site's networking policy and configuration, an 802.1X authentication
+to be performed before user data can be transferred. Operational state
+shows the ability of an interface to transmit this user data.
+
+Thanks to 802.1X, userspace must be granted the possibility to
+influence operational state. To accommodate this, operational state is
+split into two parts: Two flags that can be set by the driver only, and
+a RFC2863 compatible state that is derived from these flags, a policy,
+and changeable from userspace under certain rules.
+
+
+2. Querying from userspace
+==========================
+
+Both admin and operational state can be queried via the netlink
+operation RTM_GETLINK. It is also possible to subscribe to RTNLGRP_LINK
+to be notified of updates while the interface is admin up. This is
+important for setting from userspace.
+
+These values contain interface state:
+
+ifinfomsg::if_flags & IFF_UP:
+ Interface is admin up
+
+ifinfomsg::if_flags & IFF_RUNNING:
+ Interface is in RFC2863 operational state UP or UNKNOWN. This is for
+ backward compatibility, routing daemons, dhcp clients can use this
+ flag to determine whether they should use the interface.
+
+ifinfomsg::if_flags & IFF_LOWER_UP:
+ Driver has signaled netif_carrier_on()
+
+ifinfomsg::if_flags & IFF_DORMANT:
+ Driver has signaled netif_dormant_on()
+
+TLV IFLA_OPERSTATE
+------------------
+
+contains RFC2863 state of the interface in numeric representation:
+
+IF_OPER_UNKNOWN (0):
+ Interface is in unknown state, neither driver nor userspace has set
+ operational state. Interface must be considered for user data as
+ setting operational state has not been implemented in every driver.
+
+IF_OPER_NOTPRESENT (1):
+ Unused in current kernel (notpresent interfaces normally disappear),
+ just a numerical placeholder.
+
+IF_OPER_DOWN (2):
+ Interface is unable to transfer data on L1, f.e. ethernet is not
+ plugged or interface is ADMIN down.
+
+IF_OPER_LOWERLAYERDOWN (3):
+ Interfaces stacked on an interface that is IF_OPER_DOWN show this
+ state (f.e. VLAN).
+
+IF_OPER_TESTING (4):
+ Interface is in testing mode, for example executing driver self-tests
+ or media (cable) test. It can't be used for normal traffic until tests
+ complete.
+
+IF_OPER_DORMANT (5):
+ Interface is L1 up, but waiting for an external event, f.e. for a
+ protocol to establish. (802.1X)
+
+IF_OPER_UP (6):
+ Interface is operational up and can be used.
+
+This TLV can also be queried via sysfs.
+
+TLV IFLA_LINKMODE
+-----------------
+
+contains link policy. This is needed for userspace interaction
+described below.
+
+This TLV can also be queried via sysfs.
+
+
+3. Kernel driver API
+====================
+
+Kernel drivers have access to two flags that map to IFF_LOWER_UP and
+IFF_DORMANT. These flags can be set from everywhere, even from
+interrupts. It is guaranteed that only the driver has write access,
+however, if different layers of the driver manipulate the same flag,
+the driver has to provide the synchronisation needed.
+
+__LINK_STATE_NOCARRIER, maps to !IFF_LOWER_UP:
+
+The driver uses netif_carrier_on() to clear and netif_carrier_off() to
+set this flag. On netif_carrier_off(), the scheduler stops sending
+packets. The name 'carrier' and the inversion are historical, think of
+it as lower layer.
+
+Note that for certain kind of soft-devices, which are not managing any
+real hardware, it is possible to set this bit from userspace. One
+should use TLV IFLA_CARRIER to do so.
+
+netif_carrier_ok() can be used to query that bit.
+
+__LINK_STATE_DORMANT, maps to IFF_DORMANT:
+
+Set by the driver to express that the device cannot yet be used
+because some driver controlled protocol establishment has to
+complete. Corresponding functions are netif_dormant_on() to set the
+flag, netif_dormant_off() to clear it and netif_dormant() to query.
+
+On device allocation, both flags __LINK_STATE_NOCARRIER and
+__LINK_STATE_DORMANT are cleared, so the effective state is equivalent
+to netif_carrier_ok() and !netif_dormant().
+
+
+Whenever the driver CHANGES one of these flags, a workqueue event is
+scheduled to translate the flag combination to IFLA_OPERSTATE as
+follows:
+
+!netif_carrier_ok():
+ IF_OPER_LOWERLAYERDOWN if the interface is stacked, IF_OPER_DOWN
+ otherwise. Kernel can recognise stacked interfaces because their
+ ifindex != iflink.
+
+netif_carrier_ok() && netif_dormant():
+ IF_OPER_DORMANT
+
+netif_carrier_ok() && !netif_dormant():
+ IF_OPER_UP if userspace interaction is disabled. Otherwise
+ IF_OPER_DORMANT with the possibility for userspace to initiate the
+ IF_OPER_UP transition afterwards.
+
+
+4. Setting from userspace
+=========================
+
+Applications have to use the netlink interface to influence the
+RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1
+via RTM_SETLINK instructs the kernel that an interface should go to
+IF_OPER_DORMANT instead of IF_OPER_UP when the combination
+netif_carrier_ok() && !netif_dormant() is set by the
+driver. Afterwards, the userspace application can set IFLA_OPERSTATE
+to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
+netif_carrier_off() or netif_dormant_on(). Changes made by userspace
+are multicasted on the netlink group RTNLGRP_LINK.
+
+So basically a 802.1X supplicant interacts with the kernel like this:
+
+- subscribe to RTNLGRP_LINK
+- set IFLA_LINKMODE to 1 via RTM_SETLINK
+- query RTM_GETLINK once to get initial state
+- if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until
+ netlink multicast signals this state
+- do 802.1X, eventually abort if flags go down again
+- send RTM_SETLINK to set operstate to IF_OPER_UP if authentication
+ succeeds, IF_OPER_DORMANT otherwise
+- see how operstate and IFF_RUNNING is echoed via netlink multicast
+- set interface back to IF_OPER_DORMANT if 802.1X reauthentication
+ fails
+- restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag
+
+if supplicant goes down, bring back IFLA_LINKMODE to 0 and
+IFLA_OPERSTATE to a sane value.
+
+A routing daemon or dhcp client just needs to care for IFF_RUNNING or
+waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before
+considering the interface / querying a DHCP address.
+
+
+For technical questions and/or comments please e-mail to Stefan Rompf
+(stefan at loplof.de).
diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst
new file mode 100644
index 0000000000..30a3be3c48
--- /dev/null
+++ b/Documentation/networking/packet_mmap.rst
@@ -0,0 +1,1083 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+Packet MMAP
+===========
+
+Abstract
+========
+
+This file documents the mmap() facility available with the PACKET
+socket interface. This type of sockets is used for
+
+i) capture network traffic with utilities like tcpdump,
+ii) transmit network traffic, or any other that needs raw
+ access to network interface.
+
+Howto can be found at:
+
+ https://sites.google.com/site/packetmmap/
+
+Please send your comments to
+ - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
+ - Johann Baudy
+
+Why use PACKET_MMAP
+===================
+
+Non PACKET_MMAP capture process (plain AF_PACKET) is very
+inefficient. It uses very limited buffers and requires one system call to
+capture each packet, it requires two if you want to get packet's timestamp
+(like libpcap always does).
+
+On the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
+configurable circular buffer mapped in user space that can be used to either
+send or receive packets. This way reading packets just needs to wait for them,
+most of the time there is no need to issue a single system call. Concerning
+transmission, multiple packets can be sent through one system call to get the
+highest bandwidth. By using a shared buffer between the kernel and the user
+also has the benefit of minimizing packet copies.
+
+It's fine to use PACKET_MMAP to improve the performance of the capture and
+transmission process, but it isn't everything. At least, if you are capturing
+at high speeds (this is relative to the cpu speed), you should check if the
+device driver of your network interface card supports some sort of interrupt
+load mitigation or (even better) if it supports NAPI, also make sure it is
+enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
+supported by devices of your network. CPU IRQ pinning of your network interface
+card can also be an advantage.
+
+How to use mmap() to improve capture process
+============================================
+
+From the user standpoint, you should use the higher level libpcap library, which
+is a de facto standard, portable across nearly all operating systems
+including Win32.
+
+Packet MMAP support was integrated into libpcap around the time of version 1.3.0;
+TPACKET_V3 support was added in version 1.5.0
+
+How to use mmap() directly to improve capture process
+=====================================================
+
+From the system calls stand point, the use of PACKET_MMAP involves
+the following process::
+
+
+ [setup] socket() -------> creation of the capture socket
+ setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_RX_RING
+ mmap() ---------> mapping of the allocated buffer to the
+ user process
+
+ [capture] poll() ---------> to wait for incoming packets
+
+ [shutdown] close() --------> destruction of the capture socket and
+ deallocation of all associated
+ resources.
+
+
+socket creation and destruction is straight forward, and is done
+the same way with or without PACKET_MMAP::
+
+ int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
+
+where mode is SOCK_RAW for the raw interface were link level
+information can be captured or SOCK_DGRAM for the cooked
+interface where link level information capture is not
+supported and a link level pseudo-header is provided
+by the kernel.
+
+The destruction of the socket and all associated resources
+is done by a simple call to close(fd).
+
+Similarly as without PACKET_MMAP, it is possible to use one socket
+for capture and transmission. This can be done by mapping the
+allocated RX and TX buffer ring with a single mmap() call.
+See "Mapping and use of the circular buffer (ring)".
+
+Next I will describe PACKET_MMAP settings and its constraints,
+also the mapping of the circular buffer in the user process and
+the use of this buffer.
+
+How to use mmap() directly to improve transmission process
+==========================================================
+Transmission process is similar to capture as shown below::
+
+ [setup] socket() -------> creation of the transmission socket
+ setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_TX_RING
+ bind() ---------> bind transmission socket with a network interface
+ mmap() ---------> mapping of the allocated buffer to the
+ user process
+
+ [transmission] poll() ---------> wait for free packets (optional)
+ send() ---------> send all packets that are set as ready in
+ the ring
+ The flag MSG_DONTWAIT can be used to return
+ before end of transfer.
+
+ [shutdown] close() --------> destruction of the transmission socket and
+ deallocation of all associated resources.
+
+Socket creation and destruction is also straight forward, and is done
+the same way as in capturing described in the previous paragraph::
+
+ int fd = socket(PF_PACKET, mode, 0);
+
+The protocol can optionally be 0 in case we only want to transmit
+via this socket, which avoids an expensive call to packet_rcv().
+In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
+set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
+
+Binding the socket to your network interface is mandatory (with zero copy) to
+know the header size of frames used in the circular buffer.
+
+As capture, each frame contains two parts::
+
+ --------------------
+ | struct tpacket_hdr | Header. It contains the status of
+ | | of this frame
+ |--------------------|
+ | data buffer |
+ . . Data that will be sent over the network interface.
+ . .
+ --------------------
+
+ bind() associates the socket to your network interface thanks to
+ sll_ifindex parameter of struct sockaddr_ll.
+
+ Initialization example::
+
+ struct sockaddr_ll my_addr;
+ struct ifreq s_ifr;
+ ...
+
+ strscpy_pad (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+
+ /* get interface index of eth0 */
+ ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
+
+ /* fill sockaddr_ll struct to prepare binding */
+ my_addr.sll_family = AF_PACKET;
+ my_addr.sll_protocol = htons(ETH_P_ALL);
+ my_addr.sll_ifindex = s_ifr.ifr_ifindex;
+
+ /* bind socket to eth0 */
+ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
+
+ A complete tutorial is available at: https://sites.google.com/site/packetmmap/
+
+By default, the user should put data at::
+
+ frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
+
+So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
+the beginning of the user data will be at::
+
+ frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+If you wish to put user data at a custom offset from the beginning of
+the frame (for payload alignment with SOCK_RAW mode for instance) you
+can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
+to make this work it must be enabled previously with setsockopt()
+and the PACKET_TX_HAS_OFF option.
+
+PACKET_MMAP settings
+====================
+
+To setup PACKET_MMAP from user level code is done with a call like
+
+ - Capture process::
+
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
+
+ - Transmission process::
+
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
+
+The most significant argument in the previous call is the req parameter,
+this parameter must to have the following structure::
+
+ struct tpacket_req
+ {
+ unsigned int tp_block_size; /* Minimal size of contiguous block */
+ unsigned int tp_block_nr; /* Number of blocks */
+ unsigned int tp_frame_size; /* Size of frame */
+ unsigned int tp_frame_nr; /* Total number of frames */
+ };
+
+This structure is defined in /usr/include/linux/if_packet.h and establishes a
+circular buffer (ring) of unswappable memory.
+Being mapped in the capture process allows reading the captured frames and
+related meta-information like timestamps without requiring a system call.
+
+Frames are grouped in blocks. Each block is a physically contiguous
+region of memory and holds tp_block_size/tp_frame_size frames. The total number
+of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because::
+
+ frames_per_block = tp_block_size/tp_frame_size
+
+indeed, packet_set_ring checks that the following condition is true::
+
+ frames_per_block * tp_block_nr == tp_frame_nr
+
+Lets see an example, with the following values::
+
+ tp_block_size= 4096
+ tp_frame_size= 2048
+ tp_block_nr = 4
+ tp_frame_nr = 8
+
+we will get the following buffer structure::
+
+ block #1 block #2
+ +---------+---------+ +---------+---------+
+ | frame 1 | frame 2 | | frame 3 | frame 4 |
+ +---------+---------+ +---------+---------+
+
+ block #3 block #4
+ +---------+---------+ +---------+---------+
+ | frame 5 | frame 6 | | frame 7 | frame 8 |
+ +---------+---------+ +---------+---------+
+
+A frame can be of any size with the only condition it can fit in a block. A block
+can only hold an integer number of frames, or in other words, a frame cannot
+be spawned across two blocks, so there are some details you have to take into
+account when choosing the frame_size. See "Mapping and use of the circular
+buffer (ring)".
+
+PACKET_MMAP setting constraints
+===============================
+
+In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
+the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
+16384 in a 64 bit architecture.
+
+Block size limit
+----------------
+
+As stated earlier, each block is a contiguous physical region of memory. These
+memory regions are allocated with calls to the __get_free_pages() function. As
+the name indicates, this function allocates pages of memory, and the second
+argument is "order" or a power of two number of pages, that is
+(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
+order=2 ==> 16384 bytes, etc. The maximum size of a
+region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
+precisely the limit can be calculated as::
+
+ PAGE_SIZE << MAX_ORDER
+
+ In a i386 architecture PAGE_SIZE is 4096 bytes
+ In a 2.4/i386 kernel MAX_ORDER is 10
+ In a 2.6/i386 kernel MAX_ORDER is 11
+
+So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
+respectively, with an i386 architecture.
+
+User space programs can include /usr/include/sys/user.h and
+/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
+
+The pagesize can also be determined dynamically with the getpagesize (2)
+system call.
+
+Block number limit
+------------------
+
+To understand the constraints of PACKET_MMAP, we have to see the structure
+used to hold the pointers to each block.
+
+Currently, this structure is a dynamically allocated vector with kmalloc
+called pg_vec, its size limits the number of blocks that can be allocated::
+
+ +---+---+---+---+
+ | x | x | x | x |
+ +---+---+---+---+
+ | | | |
+ | | | v
+ | | v block #4
+ | v block #3
+ v block #2
+ block #1
+
+kmalloc allocates any number of bytes of physically contiguous memory from
+a pool of pre-determined sizes. This pool of memory is maintained by the slab
+allocator which is at the end the responsible for doing the allocation and
+hence which imposes the maximum memory that kmalloc can allocate.
+
+In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
+predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
+entries of /proc/slabinfo
+
+In a 32 bit architecture, pointers are 4 bytes long, so the total number of
+pointers to blocks is::
+
+ 131072/4 = 32768 blocks
+
+PACKET_MMAP buffer size calculator
+==================================
+
+Definitions:
+
+============== ================================================================
+<size-max> is the maximum size of allocable with kmalloc
+ (see /proc/slabinfo)
+<pointer size> depends on the architecture -- ``sizeof(void *)``
+<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
+<max-order> is the value defined with MAX_ORDER
+<frame size> it's an upper bound of frame's capture size (more on this later)
+============== ================================================================
+
+from these definitions we will derive::
+
+ <block number> = <size-max>/<pointer size>
+ <block size> = <pagesize> << <max-order>
+
+so, the max buffer size is::
+
+ <block number> * <block size>
+
+and, the number of frames be::
+
+ <block number> * <block size> / <frame size>
+
+Suppose the following parameters, which apply for 2.6 kernel and an
+i386 architecture::
+
+ <size-max> = 131072 bytes
+ <pointer size> = 4 bytes
+ <pagesize> = 4096 bytes
+ <max-order> = 11
+
+and a value for <frame size> of 2048 bytes. These parameters will yield::
+
+ <block number> = 131072/4 = 32768 blocks
+ <block size> = 4096 << 11 = 8 MiB.
+
+and hence the buffer will have a 262144 MiB size. So it can hold
+262144 MiB / 2048 bytes = 134217728 frames
+
+Actually, this buffer size is not possible with an i386 architecture.
+Remember that the memory is allocated in kernel space, in the case of
+an i386 kernel's memory size is limited to 1GiB.
+
+All memory allocations are not freed until the socket is closed. The memory
+allocations are done with GFP_KERNEL priority, this basically means that
+the allocation can wait and swap other process' memory in order to allocate
+the necessary memory, so normally limits can be reached.
+
+Other constraints
+-----------------
+
+If you check the source code you will see that what I draw here as a frame
+is not only the link level frame. At the beginning of each frame there is a
+header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
+meta information like timestamp. So what we draw here a frame it's really
+the following (from include/linux/if_packet.h)::
+
+ /*
+ Frame structure:
+
+ - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
+ - struct tpacket_hdr
+ - pad to TPACKET_ALIGNMENT=16
+ - struct sockaddr_ll
+ - Gap, chosen so that packet data (Start+tp_net) aligns to
+ TPACKET_ALIGNMENT=16
+ - Start+tp_mac: [ Optional MAC header ]
+ - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
+ - Pad to align to TPACKET_ALIGNMENT=16
+ */
+
+The following are conditions that are checked in packet_set_ring
+
+ - tp_block_size must be a multiple of PAGE_SIZE (1)
+ - tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
+ - tp_frame_size must be a multiple of TPACKET_ALIGNMENT
+ - tp_frame_nr must be exactly frames_per_block*tp_block_nr
+
+Note that tp_block_size should be chosen to be a power of two or there will
+be a waste of memory.
+
+Mapping and use of the circular buffer (ring)
+---------------------------------------------
+
+The mapping of the buffer in the user process is done with the conventional
+mmap function. Even the circular buffer is compound of several physically
+discontiguous blocks of memory, they are contiguous to the user space, hence
+just one call to mmap is needed::
+
+ mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+
+If tp_frame_size is a divisor of tp_block_size frames will be
+contiguously spaced by tp_frame_size bytes. If not, each
+tp_block_size/tp_frame_size frames there will be a gap between
+the frames. This is because a frame cannot be spawn across two
+blocks.
+
+To use one socket for capture and transmission, the mapping of both the
+RX and TX buffer ring has to be done with one call to mmap::
+
+ ...
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
+ ...
+ rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ tx_ring = rx_ring + size;
+
+RX must be the first as the kernel maps the TX ring memory right
+after the RX one.
+
+At the beginning of each frame there is an status field (see
+struct tpacket_hdr). If this field is 0 means that the frame is ready
+to be used for the kernel, If not, there is a frame the user can read
+and the following flags apply:
+
+Capture process
+^^^^^^^^^^^^^^^
+
+From include/linux/if_packet.h::
+
+ #define TP_STATUS_COPY (1 << 1)
+ #define TP_STATUS_LOSING (1 << 2)
+ #define TP_STATUS_CSUMNOTREADY (1 << 3)
+ #define TP_STATUS_CSUM_VALID (1 << 7)
+
+====================== =======================================================
+TP_STATUS_COPY This flag indicates that the frame (and associated
+ meta information) has been truncated because it's
+ larger than tp_frame_size. This packet can be
+ read entirely with recvfrom().
+
+ In order to make this work it must to be
+ enabled previously with setsockopt() and
+ the PACKET_COPY_THRESH option.
+
+ The number of frames that can be buffered to
+ be read with recvfrom is limited like a normal socket.
+ See the SO_RCVBUF option in the socket (7) man page.
+
+TP_STATUS_LOSING indicates there were packet drops from last time
+ statistics where checked with getsockopt() and
+ the PACKET_STATISTICS option.
+
+TP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which
+ its checksum will be done in hardware. So while
+ reading the packet we should not try to check the
+ checksum.
+
+TP_STATUS_CSUM_VALID This flag indicates that at least the transport
+ header checksum of the packet has been already
+ validated on the kernel side. If the flag is not set
+ then we are free to check the checksum by ourselves
+ provided that TP_STATUS_CSUMNOTREADY is also not set.
+====================== =======================================================
+
+for convenience there are also the following defines::
+
+ #define TP_STATUS_KERNEL 0
+ #define TP_STATUS_USER 1
+
+The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
+receives a packet it puts in the buffer and updates the status with
+at least the TP_STATUS_USER flag. Then the user can read the packet,
+once the packet is read the user must zero the status field, so the kernel
+can use again that frame buffer.
+
+The user can use poll (any other variant should apply too) to check if new
+packets are in the ring::
+
+ struct pollfd pfd;
+
+ pfd.fd = fd;
+ pfd.revents = 0;
+ pfd.events = POLLIN|POLLRDNORM|POLLERR;
+
+ if (status == TP_STATUS_KERNEL)
+ retval = poll(&pfd, 1, timeout);
+
+It doesn't incur in a race condition to first check the status value and
+then poll for frames.
+
+Transmission process
+^^^^^^^^^^^^^^^^^^^^
+
+Those defines are also used for transmission::
+
+ #define TP_STATUS_AVAILABLE 0 // Frame is available
+ #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
+ #define TP_STATUS_SENDING 2 // Frame is currently in transmission
+ #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
+
+First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
+packet, the user fills a data buffer of an available frame, sets tp_len to
+current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
+This can be done on multiple frames. Once the user is ready to transmit, it
+calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
+forwarded to the network device. The kernel updates each status of sent
+frames with TP_STATUS_SENDING until the end of transfer.
+
+At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
+
+::
+
+ header->tp_len = in_i_size;
+ header->tp_status = TP_STATUS_SEND_REQUEST;
+ retval = send(this->socket, NULL, 0, 0);
+
+The user can also use poll() to check if a buffer is available:
+
+(status == TP_STATUS_SENDING)
+
+::
+
+ struct pollfd pfd;
+ pfd.fd = fd;
+ pfd.revents = 0;
+ pfd.events = POLLOUT;
+ retval = poll(&pfd, 1, timeout);
+
+What TPACKET versions are available and when to use them?
+=========================================================
+
+::
+
+ int val = tpacket_version;
+ setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+ getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+
+where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
+
+TPACKET_V1:
+ - Default if not otherwise specified by setsockopt(2)
+ - RX_RING, TX_RING available
+
+TPACKET_V1 --> TPACKET_V2:
+ - Made 64 bit clean due to unsigned long usage in TPACKET_V1
+ structures, thus this also works on 64 bit kernel with 32 bit
+ userspace and the like
+ - Timestamp resolution in nanoseconds instead of microseconds
+ - RX_RING, TX_RING available
+ - VLAN metadata information available for packets
+ (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
+ in the tpacket2_hdr structure:
+
+ - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
+ that the tp_vlan_tci field has valid VLAN TCI value
+ - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
+ indicates that the tp_vlan_tpid field has valid VLAN TPID value
+
+ - How to switch to TPACKET_V2:
+
+ 1. Replace struct tpacket_hdr by struct tpacket2_hdr
+ 2. Query header len and save
+ 3. Set protocol version to 2, set up ring as usual
+ 4. For getting the sockaddr_ll,
+ use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of
+ ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))``
+
+TPACKET_V2 --> TPACKET_V3:
+ - Flexible buffer implementation for RX_RING:
+ 1. Blocks can be configured with non-static frame-size
+ 2. Read/poll is at a block-level (as opposed to packet-level)
+ 3. Added poll timeout to avoid indefinite user-space wait
+ on idle links
+ 4. Added user-configurable knobs:
+
+ 4.1 block::timeout
+ 4.2 tpkt_hdr::sk_rxhash
+
+ - RX Hash data available in user space
+ - TX_RING semantics are conceptually similar to TPACKET_V2;
+ use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
+ instead of TPACKET2_HDRLEN. In the current implementation,
+ the tp_next_offset field in the tpacket3_hdr MUST be set to
+ zero, indicating that the ring does not hold variable sized frames.
+ Packets with non-zero values of tp_next_offset will be dropped.
+
+AF_PACKET fanout mode
+=====================
+
+In the AF_PACKET fanout mode, packet reception can be load balanced among
+processes. This also works in combination with mmap(2) on packet sockets.
+
+Currently implemented fanout policies are:
+
+ - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
+ - PACKET_FANOUT_LB: schedule to socket by round-robin
+ - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
+ - PACKET_FANOUT_RND: schedule to socket by random selection
+ - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
+ - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
+
+Minimal example code by David S. Miller (try things like "./test eth0 hash",
+"./test eth0 lb", etc.)::
+
+ #include <stddef.h>
+ #include <stdlib.h>
+ #include <stdio.h>
+ #include <string.h>
+
+ #include <sys/types.h>
+ #include <sys/wait.h>
+ #include <sys/socket.h>
+ #include <sys/ioctl.h>
+
+ #include <unistd.h>
+
+ #include <linux/if_ether.h>
+ #include <linux/if_packet.h>
+
+ #include <net/if.h>
+
+ static const char *device_name;
+ static int fanout_type;
+ static int fanout_id;
+
+ #ifndef PACKET_FANOUT
+ # define PACKET_FANOUT 18
+ # define PACKET_FANOUT_HASH 0
+ # define PACKET_FANOUT_LB 1
+ #endif
+
+ static int setup_socket(void)
+ {
+ int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
+ struct sockaddr_ll ll;
+ struct ifreq ifr;
+ int fanout_arg;
+
+ if (fd < 0) {
+ perror("socket");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ strcpy(ifr.ifr_name, device_name);
+ err = ioctl(fd, SIOCGIFINDEX, &ifr);
+ if (err < 0) {
+ perror("SIOCGIFINDEX");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = AF_PACKET;
+ ll.sll_ifindex = ifr.ifr_ifindex;
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ return EXIT_FAILURE;
+ }
+
+ fanout_arg = (fanout_id | (fanout_type << 16));
+ err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
+ &fanout_arg, sizeof(fanout_arg));
+ if (err) {
+ perror("setsockopt");
+ return EXIT_FAILURE;
+ }
+
+ return fd;
+ }
+
+ static void fanout_thread(void)
+ {
+ int fd = setup_socket();
+ int limit = 10000;
+
+ if (fd < 0)
+ exit(fd);
+
+ while (limit-- > 0) {
+ char buf[1600];
+ int err;
+
+ err = read(fd, buf, sizeof(buf));
+ if (err < 0) {
+ perror("read");
+ exit(EXIT_FAILURE);
+ }
+ if ((limit % 10) == 0)
+ fprintf(stdout, "(%d) \n", getpid());
+ }
+
+ fprintf(stdout, "%d: Received 10000 packets\n", getpid());
+
+ close(fd);
+ exit(0);
+ }
+
+ int main(int argc, char **argp)
+ {
+ int fd, err;
+ int i;
+
+ if (argc != 3) {
+ fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ if (!strcmp(argp[2], "hash"))
+ fanout_type = PACKET_FANOUT_HASH;
+ else if (!strcmp(argp[2], "lb"))
+ fanout_type = PACKET_FANOUT_LB;
+ else {
+ fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
+ exit(EXIT_FAILURE);
+ }
+
+ device_name = argp[1];
+ fanout_id = getpid() & 0xffff;
+
+ for (i = 0; i < 4; i++) {
+ pid_t pid = fork();
+
+ switch (pid) {
+ case 0:
+ fanout_thread();
+
+ case -1:
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ for (i = 0; i < 4; i++) {
+ int status;
+
+ wait(&status);
+ }
+
+ return 0;
+ }
+
+AF_PACKET TPACKET_V3 example
+============================
+
+AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
+sizes by doing its own memory management. It is based on blocks where polling
+works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
+
+It is said that TPACKET_V3 brings the following benefits:
+
+ * ~15% - 20% reduction in CPU-usage
+ * ~20% increase in packet capture rate
+ * ~2x increase in packet density
+ * Port aggregation analysis
+ * Non static frame size to capture entire packet payload
+
+So it seems to be a good candidate to be used with packet fanout.
+
+Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
+it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.)::
+
+ /* Written from scratch, but kernel-to-user space API usage
+ * dissected from lolpcap:
+ * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
+ * License: GPL, version 2.0
+ */
+
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <stdint.h>
+ #include <string.h>
+ #include <assert.h>
+ #include <net/if.h>
+ #include <arpa/inet.h>
+ #include <netdb.h>
+ #include <poll.h>
+ #include <unistd.h>
+ #include <signal.h>
+ #include <inttypes.h>
+ #include <sys/socket.h>
+ #include <sys/mman.h>
+ #include <linux/if_packet.h>
+ #include <linux/if_ether.h>
+ #include <linux/ip.h>
+
+ #ifndef likely
+ # define likely(x) __builtin_expect(!!(x), 1)
+ #endif
+ #ifndef unlikely
+ # define unlikely(x) __builtin_expect(!!(x), 0)
+ #endif
+
+ struct block_desc {
+ uint32_t version;
+ uint32_t offset_to_priv;
+ struct tpacket_hdr_v1 h1;
+ };
+
+ struct ring {
+ struct iovec *rd;
+ uint8_t *map;
+ struct tpacket_req3 req;
+ };
+
+ static unsigned long packets_total = 0, bytes_total = 0;
+ static sig_atomic_t sigint = 0;
+
+ static void sighandler(int num)
+ {
+ sigint = 1;
+ }
+
+ static int setup_socket(struct ring *ring, char *netdev)
+ {
+ int err, i, fd, v = TPACKET_V3;
+ struct sockaddr_ll ll;
+ unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
+ unsigned int blocknum = 64;
+
+ fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+ if (fd < 0) {
+ perror("socket");
+ exit(1);
+ }
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ memset(&ring->req, 0, sizeof(ring->req));
+ ring->req.tp_block_size = blocksiz;
+ ring->req.tp_frame_size = framesiz;
+ ring->req.tp_block_nr = blocknum;
+ ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
+ ring->req.tp_retire_blk_tov = 60;
+ ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
+ sizeof(ring->req));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
+ if (ring->map == MAP_FAILED) {
+ perror("mmap");
+ exit(1);
+ }
+
+ ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
+ assert(ring->rd);
+ for (i = 0; i < ring->req.tp_block_nr; ++i) {
+ ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
+ ring->rd[i].iov_len = ring->req.tp_block_size;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = PF_PACKET;
+ ll.sll_protocol = htons(ETH_P_ALL);
+ ll.sll_ifindex = if_nametoindex(netdev);
+ ll.sll_hatype = 0;
+ ll.sll_pkttype = 0;
+ ll.sll_halen = 0;
+
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ exit(1);
+ }
+
+ return fd;
+ }
+
+ static void display(struct tpacket3_hdr *ppd)
+ {
+ struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
+ struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
+
+ if (eth->h_proto == htons(ETH_P_IP)) {
+ struct sockaddr_in ss, sd;
+ char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
+
+ memset(&ss, 0, sizeof(ss));
+ ss.sin_family = PF_INET;
+ ss.sin_addr.s_addr = ip->saddr;
+ getnameinfo((struct sockaddr *) &ss, sizeof(ss),
+ sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
+
+ memset(&sd, 0, sizeof(sd));
+ sd.sin_family = PF_INET;
+ sd.sin_addr.s_addr = ip->daddr;
+ getnameinfo((struct sockaddr *) &sd, sizeof(sd),
+ dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
+
+ printf("%s -> %s, ", sbuff, dbuff);
+ }
+
+ printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
+ }
+
+ static void walk_block(struct block_desc *pbd, const int block_num)
+ {
+ int num_pkts = pbd->h1.num_pkts, i;
+ unsigned long bytes = 0;
+ struct tpacket3_hdr *ppd;
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
+ pbd->h1.offset_to_first_pkt);
+ for (i = 0; i < num_pkts; ++i) {
+ bytes += ppd->tp_snaplen;
+ display(ppd);
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
+ ppd->tp_next_offset);
+ }
+
+ packets_total += num_pkts;
+ bytes_total += bytes;
+ }
+
+ static void flush_block(struct block_desc *pbd)
+ {
+ pbd->h1.block_status = TP_STATUS_KERNEL;
+ }
+
+ static void teardown_socket(struct ring *ring, int fd)
+ {
+ munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
+ free(ring->rd);
+ close(fd);
+ }
+
+ int main(int argc, char **argp)
+ {
+ int fd, err;
+ socklen_t len;
+ struct ring ring;
+ struct pollfd pfd;
+ unsigned int block_num = 0, blocks = 64;
+ struct block_desc *pbd;
+ struct tpacket_stats_v3 stats;
+
+ if (argc != 2) {
+ fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ signal(SIGINT, sighandler);
+
+ memset(&ring, 0, sizeof(ring));
+ fd = setup_socket(&ring, argp[argc - 1]);
+ assert(fd > 0);
+
+ memset(&pfd, 0, sizeof(pfd));
+ pfd.fd = fd;
+ pfd.events = POLLIN | POLLERR;
+ pfd.revents = 0;
+
+ while (likely(!sigint)) {
+ pbd = (struct block_desc *) ring.rd[block_num].iov_base;
+
+ if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
+ poll(&pfd, 1, -1);
+ continue;
+ }
+
+ walk_block(pbd, block_num);
+ flush_block(pbd);
+ block_num = (block_num + 1) % blocks;
+ }
+
+ len = sizeof(stats);
+ err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
+ if (err < 0) {
+ perror("getsockopt");
+ exit(1);
+ }
+
+ fflush(stdout);
+ printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
+ stats.tp_packets, bytes_total, stats.tp_drops,
+ stats.tp_freeze_q_cnt);
+
+ teardown_socket(&ring, fd);
+ return 0;
+ }
+
+PACKET_QDISC_BYPASS
+===================
+
+If there is a requirement to load the network with many packets in a similar
+fashion as pktgen does, you might set the following option after socket
+creation::
+
+ int one = 1;
+ setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
+
+This has the side-effect, that packets sent through PF_PACKET will bypass the
+kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
+packet are not buffered, tc disciplines are ignored, increased loss can occur
+and such packets are also not visible to other PF_PACKET sockets anymore. So,
+you have been warned; generally, this can be useful for stress testing various
+components of a system.
+
+On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
+on PF_PACKET sockets.
+
+PACKET_TIMESTAMP
+================
+
+The PACKET_TIMESTAMP setting determines the source of the timestamp in
+the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
+NIC is capable of timestamping packets in hardware, you can request those
+hardware timestamps to be used. Note: you may need to enable the generation
+of hardware timestamps with SIOCSHWTSTAMP (see related information from
+Documentation/networking/timestamping.rst).
+
+PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING::
+
+ int req = SOF_TIMESTAMPING_RAW_HARDWARE;
+ setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
+
+For the mmap(2)ed ring buffers, such timestamps are stored in the
+``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members.
+To determine what kind of timestamp has been reported, the tp_status field
+is binary or'ed with the following possible bits ...
+
+::
+
+ TP_STATUS_TS_RAW_HARDWARE
+ TP_STATUS_TS_SOFTWARE
+
+... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the
+RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
+software fallback was invoked *within* PF_PACKET's processing code (less
+precise).
+
+Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
+ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
+frames to be updated resp. the frame handed over to the application, iv) walk
+through the frames to pick up the individual hw/sw timestamps.
+
+Only (!) if transmit timestamping is enabled, then these bits are combined
+with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
+application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
+in a first step to see if the frame belongs to the application, and then
+one can extract the type of timestamp in a second step from tp_status)!
+
+If you don't care about them, thus having it disabled, checking for
+TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
+TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
+members do not contain a valid value. For TX_RINGs, by default no timestamp
+is generated!
+
+See include/linux/net_tstamp.h and Documentation/networking/timestamping.rst
+for more information on hardware timestamps.
+
+Miscellaneous bits
+==================
+
+- Packet sockets work well together with Linux socket filters, thus you also
+ might want to have a look at Documentation/networking/filter.rst
+
+THANKS
+======
+
+ Jesse Brandeburg, for fixing my grammathical/spelling errors
diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst
new file mode 100644
index 0000000000..215ebc9275
--- /dev/null
+++ b/Documentation/networking/page_pool.rst
@@ -0,0 +1,191 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Page Pool API
+=============
+
+.. kernel-doc:: include/net/page_pool/helpers.h
+ :doc: page_pool allocator
+
+Architecture overview
+=====================
+
+.. code-block:: none
+
+ +------------------+
+ | Driver |
+ +------------------+
+ ^
+ |
+ |
+ |
+ v
+ +--------------------------------------------+
+ | request memory |
+ +--------------------------------------------+
+ ^ ^
+ | |
+ | Pool empty | Pool has entries
+ | |
+ v v
+ +-----------------------+ +------------------------+
+ | alloc (and map) pages | | get page from cache |
+ +-----------------------+ +------------------------+
+ ^ ^
+ | |
+ | cache available | No entries, refill
+ | | from ptr-ring
+ | |
+ v v
+ +-----------------+ +------------------+
+ | Fast cache | | ptr-ring cache |
+ +-----------------+ +------------------+
+
+API interface
+=============
+The number of pools created **must** match the number of hardware queues
+unless hardware restrictions make that impossible. This would otherwise beat the
+purpose of page pool, which is allocate pages fast from cache without locking.
+This lockless guarantee naturally comes from running under a NAPI softirq.
+The protection doesn't strictly have to be NAPI, any guarantee that allocating
+a page will cause no race conditions is enough.
+
+.. kernel-doc:: net/core/page_pool.c
+ :identifiers: page_pool_create
+
+.. kernel-doc:: include/net/page_pool/types.h
+ :identifiers: struct page_pool_params
+
+.. kernel-doc:: include/net/page_pool/helpers.h
+ :identifiers: page_pool_put_page page_pool_put_full_page
+ page_pool_recycle_direct page_pool_dev_alloc_pages
+ page_pool_get_dma_addr page_pool_get_dma_dir
+
+.. kernel-doc:: net/core/page_pool.c
+ :identifiers: page_pool_put_page_bulk page_pool_get_stats
+
+DMA sync
+--------
+Driver is always responsible for syncing the pages for the CPU.
+Drivers may choose to take care of syncing for the device as well
+or set the ``PP_FLAG_DMA_SYNC_DEV`` flag to request that pages
+allocated from the page pool are already synced for the device.
+
+If ``PP_FLAG_DMA_SYNC_DEV`` is set, the driver must inform the core what portion
+of the buffer has to be synced. This allows the core to avoid syncing the entire
+page when the drivers knows that the device only accessed a portion of the page.
+
+Most drivers will reserve headroom in front of the frame. This part
+of the buffer is not touched by the device, so to avoid syncing
+it drivers can set the ``offset`` field in struct page_pool_params
+appropriately.
+
+For pages recycled on the XDP xmit and skb paths the page pool will
+use the ``max_len`` member of struct page_pool_params to decide how
+much of the page needs to be synced (starting at ``offset``).
+When directly freeing pages in the driver (page_pool_put_page())
+the ``dma_sync_size`` argument specifies how much of the buffer needs
+to be synced.
+
+If in doubt set ``offset`` to 0, ``max_len`` to ``PAGE_SIZE`` and
+pass -1 as ``dma_sync_size``. That combination of arguments is always
+correct.
+
+Note that the syncing parameters are for the entire page.
+This is important to remember when using fragments (``PP_FLAG_PAGE_FRAG``),
+where allocated buffers may be smaller than a full page.
+Unless the driver author really understands page pool internals
+it's recommended to always use ``offset = 0``, ``max_len = PAGE_SIZE``
+with fragmented page pools.
+
+Stats API and structures
+------------------------
+If the kernel is configured with ``CONFIG_PAGE_POOL_STATS=y``, the API
+page_pool_get_stats() and structures described below are available.
+It takes a pointer to a ``struct page_pool`` and a pointer to a struct
+page_pool_stats allocated by the caller.
+
+The API will fill in the provided struct page_pool_stats with
+statistics about the page_pool.
+
+.. kernel-doc:: include/net/page_pool/types.h
+ :identifiers: struct page_pool_recycle_stats
+ struct page_pool_alloc_stats
+ struct page_pool_stats
+
+Coding examples
+===============
+
+Registration
+------------
+
+.. code-block:: c
+
+ /* Page pool registration */
+ struct page_pool_params pp_params = { 0 };
+ struct xdp_rxq_info xdp_rxq;
+ int err;
+
+ pp_params.order = 0;
+ /* internal DMA mapping in page_pool */
+ pp_params.flags = PP_FLAG_DMA_MAP;
+ pp_params.pool_size = DESC_NUM;
+ pp_params.nid = NUMA_NO_NODE;
+ pp_params.dev = priv->dev;
+ pp_params.napi = napi; /* only if locking is tied to NAPI */
+ pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
+ page_pool = page_pool_create(&pp_params);
+
+ err = xdp_rxq_info_reg(&xdp_rxq, ndev, 0);
+ if (err)
+ goto err_out;
+
+ err = xdp_rxq_info_reg_mem_model(&xdp_rxq, MEM_TYPE_PAGE_POOL, page_pool);
+ if (err)
+ goto err_out;
+
+NAPI poller
+-----------
+
+
+.. code-block:: c
+
+ /* NAPI Rx poller */
+ enum dma_data_direction dma_dir;
+
+ dma_dir = page_pool_get_dma_dir(dring->page_pool);
+ while (done < budget) {
+ if (some error)
+ page_pool_recycle_direct(page_pool, page);
+ if (packet_is_xdp) {
+ if XDP_DROP:
+ page_pool_recycle_direct(page_pool, page);
+ } else (packet_is_skb) {
+ skb_mark_for_recycle(skb);
+ new_page = page_pool_dev_alloc_pages(page_pool);
+ }
+ }
+
+Stats
+-----
+
+.. code-block:: c
+
+ #ifdef CONFIG_PAGE_POOL_STATS
+ /* retrieve stats */
+ struct page_pool_stats stats = { 0 };
+ if (page_pool_get_stats(page_pool, &stats)) {
+ /* perhaps the driver reports statistics with ethool */
+ ethtool_print_allocation_stats(&stats.alloc_stats);
+ ethtool_print_recycle_stats(&stats.recycle_stats);
+ }
+ #endif
+
+Driver unload
+-------------
+
+.. code-block:: c
+
+ /* Driver unload */
+ page_pool_put_full_page(page_pool, page, false);
+ xdp_rxq_info_unreg(&xdp_rxq);
diff --git a/Documentation/networking/phonet.rst b/Documentation/networking/phonet.rst
new file mode 100644
index 0000000000..d705cc5b09
--- /dev/null
+++ b/Documentation/networking/phonet.rst
@@ -0,0 +1,230 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+============================
+Linux Phonet protocol family
+============================
+
+Introduction
+------------
+
+Phonet is a packet protocol used by Nokia cellular modems for both IPC
+and RPC. With the Linux Phonet socket family, Linux host processes can
+receive and send messages from/to the modem, or any other external
+device attached to the modem. The modem takes care of routing.
+
+Phonet packets can be exchanged through various hardware connections
+depending on the device, such as:
+
+ - USB with the CDC Phonet interface,
+ - infrared,
+ - Bluetooth,
+ - an RS232 serial port (with a dedicated "FBUS" line discipline),
+ - the SSI bus with some TI OMAP processors.
+
+
+Packets format
+--------------
+
+Phonet packets have a common header as follows::
+
+ struct phonethdr {
+ uint8_t pn_media; /* Media type (link-layer identifier) */
+ uint8_t pn_rdev; /* Receiver device ID */
+ uint8_t pn_sdev; /* Sender device ID */
+ uint8_t pn_res; /* Resource ID or function */
+ uint16_t pn_length; /* Big-endian message byte length (minus 6) */
+ uint8_t pn_robj; /* Receiver object ID */
+ uint8_t pn_sobj; /* Sender object ID */
+ };
+
+On Linux, the link-layer header includes the pn_media byte (see below).
+The next 7 bytes are part of the network-layer header.
+
+The device ID is split: the 6 higher-order bits constitute the device
+address, while the 2 lower-order bits are used for multiplexing, as are
+the 8-bit object identifiers. As such, Phonet can be considered as a
+network layer with 6 bits of address space and 10 bits for transport
+protocol (much like port numbers in IP world).
+
+The modem always has address number zero. All other device have a their
+own 6-bit address.
+
+
+Link layer
+----------
+
+Phonet links are always point-to-point links. The link layer header
+consists of a single Phonet media type byte. It uniquely identifies the
+link through which the packet is transmitted, from the modem's
+perspective. Each Phonet network device shall prepend and set the media
+type byte as appropriate. For convenience, a common phonet_header_ops
+link-layer header operations structure is provided. It sets the
+media type according to the network device hardware address.
+
+Linux Phonet network interfaces support a dedicated link layer packets
+type (ETH_P_PHONET) which is out of the Ethernet type range. They can
+only send and receive Phonet packets.
+
+The virtual TUN tunnel device driver can also be used for Phonet. This
+requires IFF_TUN mode, _without_ the IFF_NO_PI flag. In this case,
+there is no link-layer header, so there is no Phonet media type byte.
+
+Note that Phonet interfaces are not allowed to re-order packets, so
+only the (default) Linux FIFO qdisc should be used with them.
+
+
+Network layer
+-------------
+
+The Phonet socket address family maps the Phonet packet header::
+
+ struct sockaddr_pn {
+ sa_family_t spn_family; /* AF_PHONET */
+ uint8_t spn_obj; /* Object ID */
+ uint8_t spn_dev; /* Device ID */
+ uint8_t spn_resource; /* Resource or function */
+ uint8_t spn_zero[...]; /* Padding */
+ };
+
+The resource field is only used when sending and receiving;
+It is ignored by bind() and getsockname().
+
+
+Low-level datagram protocol
+---------------------------
+
+Applications can send Phonet messages using the Phonet datagram socket
+protocol from the PF_PHONET family. Each socket is bound to one of the
+2^10 object IDs available, and can send and receive packets with any
+other peer.
+
+::
+
+ struct sockaddr_pn addr = { .spn_family = AF_PHONET, };
+ ssize_t len;
+ socklen_t addrlen = sizeof(addr);
+ int fd;
+
+ fd = socket(PF_PHONET, SOCK_DGRAM, 0);
+ bind(fd, (struct sockaddr *)&addr, sizeof(addr));
+ /* ... */
+
+ sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr));
+ len = recvfrom(fd, buf, sizeof(buf), 0,
+ (struct sockaddr *)&addr, &addrlen);
+
+This protocol follows the SOCK_DGRAM connection-less semantics.
+However, connect() and getpeername() are not supported, as they did
+not seem useful with Phonet usages (could be added easily).
+
+
+Resource subscription
+---------------------
+
+A Phonet datagram socket can be subscribed to any number of 8-bits
+Phonet resources, as follow::
+
+ uint32_t res = 0xXX;
+ ioctl(fd, SIOCPNADDRESOURCE, &res);
+
+Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O
+control request, or when the socket is closed.
+
+Note that no more than one socket can be subscribed to any given
+resource at a time. If not, ioctl() will return EBUSY.
+
+
+Phonet Pipe protocol
+--------------------
+
+The Phonet Pipe protocol is a simple sequenced packets protocol
+with end-to-end congestion control. It uses the passive listening
+socket paradigm. The listening socket is bound to an unique free object
+ID. Each listening socket can handle up to 255 simultaneous
+connections, one per accept()'d socket.
+
+::
+
+ int lfd, cfd;
+
+ lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE);
+ listen (lfd, INT_MAX);
+
+ /* ... */
+ cfd = accept(lfd, NULL, NULL);
+ for (;;)
+ {
+ char buf[...];
+ ssize_t len = read(cfd, buf, sizeof(buf));
+
+ /* ... */
+
+ write(cfd, msg, msglen);
+ }
+
+Connections are traditionally established between two endpoints by a
+"third party" application. This means that both endpoints are passive.
+
+
+As of Linux kernel version 2.6.39, it is also possible to connect
+two endpoints directly, using connect() on the active side. This is
+intended to support the newer Nokia Wireless Modem API, as found in
+e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform::
+
+ struct sockaddr_spn spn;
+ int fd;
+
+ fd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE);
+ memset(&spn, 0, sizeof(spn));
+ spn.spn_family = AF_PHONET;
+ spn.spn_obj = ...;
+ spn.spn_dev = ...;
+ spn.spn_resource = 0xD9;
+ connect(fd, (struct sockaddr *)&spn, sizeof(spn));
+ /* normal I/O here ... */
+ close(fd);
+
+
+.. Warning:
+
+ When polling a connected pipe socket for writability, there is an
+ intrinsic race condition whereby writability might be lost between the
+ polling and the writing system calls. In this case, the socket will
+ block until write becomes possible again, unless non-blocking mode
+ is enabled.
+
+
+The pipe protocol provides two socket options at the SOL_PNPIPE level:
+
+ PNPIPE_ENCAP accepts one integer value (int) of:
+
+ PNPIPE_ENCAP_NONE:
+ The socket operates normally (default).
+
+ PNPIPE_ENCAP_IP:
+ The socket is used as a backend for a virtual IP
+ interface. This requires CAP_NET_ADMIN capability. GPRS data
+ support on Nokia modems can use this. Note that the socket cannot
+ be reliably poll()'d or read() from while in this mode.
+
+ PNPIPE_IFINDEX
+ is a read-only integer value. It contains the
+ interface index of the network interface created by PNPIPE_ENCAP,
+ or zero if encapsulation is off.
+
+ PNPIPE_HANDLE
+ is a read-only integer value. It contains the underlying
+ identifier ("pipe handle") of the pipe. This is only defined for
+ socket descriptors that are already connected or being connected.
+
+
+Authors
+-------
+
+Linux Phonet was initially written by Sakari Ailus.
+
+Other contributors include Mikä Liljeberg, Andras Domokos,
+Carlos Chinea and Rémi Denis-Courmont.
+
+Copyright |copy| 2008 Nokia Corporation.
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
new file mode 100644
index 0000000000..1283240d76
--- /dev/null
+++ b/Documentation/networking/phy.rst
@@ -0,0 +1,551 @@
+=====================
+PHY Abstraction Layer
+=====================
+
+Purpose
+=======
+
+Most network devices consist of set of registers which provide an interface
+to a MAC layer, which communicates with the physical connection through a
+PHY. The PHY concerns itself with negotiating link parameters with the link
+partner on the other side of the network connection (typically, an ethernet
+cable), and provides a register interface to allow drivers to determine what
+settings were chosen, and to configure what settings are allowed.
+
+While these devices are distinct from the network devices, and conform to a
+standard layout for the registers, it has been common practice to integrate
+the PHY management code with the network driver. This has resulted in large
+amounts of redundant code. Also, on embedded systems with multiple (and
+sometimes quite different) ethernet controllers connected to the same
+management bus, it is difficult to ensure safe use of the bus.
+
+Since the PHYs are devices, and the management busses through which they are
+accessed are, in fact, busses, the PHY Abstraction Layer treats them as such.
+In doing so, it has these goals:
+
+#. Increase code-reuse
+#. Increase overall code-maintainability
+#. Speed development time for new network drivers, and for new systems
+
+Basically, this layer is meant to provide an interface to PHY devices which
+allows network driver writers to write as little code as possible, while
+still providing a full feature set.
+
+The MDIO bus
+============
+
+Most network devices are connected to a PHY by means of a management bus.
+Different devices use different busses (though some share common interfaces).
+In order to take advantage of the PAL, each bus interface needs to be
+registered as a distinct device.
+
+#. read and write functions must be implemented. Their prototypes are::
+
+ int write(struct mii_bus *bus, int mii_id, int regnum, u16 value);
+ int read(struct mii_bus *bus, int mii_id, int regnum);
+
+ mii_id is the address on the bus for the PHY, and regnum is the register
+ number. These functions are guaranteed not to be called from interrupt
+ time, so it is safe for them to block, waiting for an interrupt to signal
+ the operation is complete
+
+#. A reset function is optional. This is used to return the bus to an
+ initialized state.
+
+#. A probe function is needed. This function should set up anything the bus
+ driver needs, setup the mii_bus structure, and register with the PAL using
+ mdiobus_register. Similarly, there's a remove function to undo all of
+ that (use mdiobus_unregister).
+
+#. Like any driver, the device_driver structure must be configured, and init
+ exit functions are used to register the driver.
+
+#. The bus must also be declared somewhere as a device, and registered.
+
+As an example for how one driver implemented an mdio bus driver, see
+drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file
+for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/")
+
+(RG)MII/electrical interface considerations
+===========================================
+
+The Reduced Gigabit Medium Independent Interface (RGMII) is a 12-pin
+electrical signal interface using a synchronous 125Mhz clock signal and several
+data lines. Due to this design decision, a 1.5ns to 2ns delay must be added
+between the clock line (RXC or TXC) and the data lines to let the PHY (clock
+sink) have a large enough setup and hold time to sample the data lines correctly. The
+PHY library offers different types of PHY_INTERFACE_MODE_RGMII* values to let
+the PHY driver and optionally the MAC driver, implement the required delay. The
+values of phy_interface_t must be understood from the perspective of the PHY
+device itself, leading to the following:
+
+* PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any
+ internal delay by itself, it assumes that either the Ethernet MAC (if capable)
+ or the PCB traces insert the correct 1.5-2ns delay
+
+* PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay
+ for the transmit data lines (TXD[3:0]) processed by the PHY device
+
+* PHY_INTERFACE_MODE_RGMII_RXID: the PHY should insert an internal delay
+ for the receive data lines (RXD[3:0]) processed by the PHY device
+
+* PHY_INTERFACE_MODE_RGMII_ID: the PHY should insert internal delays for
+ both transmit AND receive data lines from/to the PHY device
+
+Whenever possible, use the PHY side RGMII delay for these reasons:
+
+* PHY devices may offer sub-nanosecond granularity in how they allow a
+ receiver/transmitter side delay (e.g: 0.5, 1.0, 1.5ns) to be specified. Such
+ precision may be required to account for differences in PCB trace lengths
+
+* PHY devices are typically qualified for a large range of applications
+ (industrial, medical, automotive...), and they provide a constant and
+ reliable delay across temperature/pressure/voltage ranges
+
+* PHY device drivers in PHYLIB being reusable by nature, being able to
+ configure correctly a specified delay enables more designs with similar delay
+ requirements to be operated correctly
+
+For cases where the PHY is not capable of providing this delay, but the
+Ethernet MAC driver is capable of doing so, the correct phy_interface_t value
+should be PHY_INTERFACE_MODE_RGMII, and the Ethernet MAC driver should be
+configured correctly in order to provide the required transmit and/or receive
+side delay from the perspective of the PHY device. Conversely, if the Ethernet
+MAC driver looks at the phy_interface_t value, for any other mode but
+PHY_INTERFACE_MODE_RGMII, it should make sure that the MAC-level delays are
+disabled.
+
+In case neither the Ethernet MAC, nor the PHY are capable of providing the
+required delays, as defined per the RGMII standard, several options may be
+available:
+
+* Some SoCs may offer a pin pad/mux/controller capable of configuring a given
+ set of pins' strength, delays, and voltage; and it may be a suitable
+ option to insert the expected 2ns RGMII delay.
+
+* Modifying the PCB design to include a fixed delay (e.g: using a specifically
+ designed serpentine), which may not require software configuration at all.
+
+Common problems with RGMII delay mismatch
+-----------------------------------------
+
+When there is a RGMII delay mismatch between the Ethernet MAC and the PHY, this
+will most likely result in the clock and data line signals to be unstable when
+the PHY or MAC take a snapshot of these signals to translate them into logical
+1 or 0 states and reconstruct the data being transmitted/received. Typical
+symptoms include:
+
+* Transmission/reception partially works, and there is frequent or occasional
+ packet loss observed
+
+* Ethernet MAC may report some or all packets ingressing with a FCS/CRC error,
+ or just discard them all
+
+* Switching to lower speeds such as 10/100Mbits/sec makes the problem go away
+ (since there is enough setup/hold time in that case)
+
+Connecting to a PHY
+===================
+
+Sometime during startup, the network driver needs to establish a connection
+between the PHY device, and the network device. At this time, the PHY's bus
+and drivers need to all have been loaded, so it is ready for the connection.
+At this point, there are several ways to connect to the PHY:
+
+#. The PAL handles everything, and only calls the network driver when
+ the link state changes, so it can react.
+
+#. The PAL handles everything except interrupts (usually because the
+ controller has the interrupt registers).
+
+#. The PAL handles everything, but checks in with the driver every second,
+ allowing the network driver to react first to any changes before the PAL
+ does.
+
+#. The PAL serves only as a library of functions, with the network device
+ manually calling functions to update status, and configure the PHY
+
+
+Letting the PHY Abstraction Layer do Everything
+===============================================
+
+If you choose option 1 (The hope is that every driver can, but to still be
+useful to drivers that can't), connecting to the PHY is simple:
+
+First, you need a function to react to changes in the link state. This
+function follows this protocol::
+
+ static void adjust_link(struct net_device *dev);
+
+Next, you need to know the device name of the PHY connected to this device.
+The name will look something like, "0:00", where the first number is the
+bus id, and the second is the PHY's address on that bus. Typically,
+the bus is responsible for making its ID unique.
+
+Now, to connect, just call this function::
+
+ phydev = phy_connect(dev, phy_name, &adjust_link, interface);
+
+*phydev* is a pointer to the phy_device structure which represents the PHY.
+If phy_connect is successful, it will return the pointer. dev, here, is the
+pointer to your net_device. Once done, this function will have started the
+PHY's software state machine, and registered for the PHY's interrupt, if it
+has one. The phydev structure will be populated with information about the
+current state, though the PHY will not yet be truly operational at this
+point.
+
+PHY-specific flags should be set in phydev->dev_flags prior to the call
+to phy_connect() such that the underlying PHY driver can check for flags
+and perform specific operations based on them.
+This is useful if the system has put hardware restrictions on
+the PHY/controller, of which the PHY needs to be aware.
+
+*interface* is a u32 which specifies the connection type used
+between the controller and the PHY. Examples are GMII, MII,
+RGMII, and SGMII. See "PHY interface mode" below. For a full
+list, see include/linux/phy.h
+
+Now just make sure that phydev->supported and phydev->advertising have any
+values pruned from them which don't make sense for your controller (a 10/100
+controller may be connected to a gigabit capable PHY, so you would need to
+mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions
+for these bitfields. Note that you should not SET any bits, except the
+SUPPORTED_Pause and SUPPORTED_AsymPause bits (see below), or the PHY may get
+put into an unsupported state.
+
+Lastly, once the controller is ready to handle network traffic, you call
+phy_start(phydev). This tells the PAL that you are ready, and configures the
+PHY to connect to the network. If the MAC interrupt of your network driver
+also handles PHY status changes, just set phydev->irq to PHY_MAC_INTERRUPT
+before you call phy_start and use phy_mac_interrupt() from the network
+driver. If you don't want to use interrupts, set phydev->irq to PHY_POLL.
+phy_start() enables the PHY interrupts (if applicable) and starts the
+phylib state machine.
+
+When you want to disconnect from the network (even if just briefly), you call
+phy_stop(phydev). This function also stops the phylib state machine and
+disables PHY interrupts.
+
+PHY interface modes
+===================
+
+The PHY interface mode supplied in the phy_connect() family of functions
+defines the initial operating mode of the PHY interface. This is not
+guaranteed to remain constant; there are PHYs which dynamically change
+their interface mode without software interaction depending on the
+negotiation results.
+
+Some of the interface modes are described below:
+
+``PHY_INTERFACE_MODE_SMII``
+ This is serial MII, clocked at 125MHz, supporting 100M and 10M speeds.
+ Some details can be found in
+ https://opencores.org/ocsvn/smii/smii/trunk/doc/SMII.pdf
+
+``PHY_INTERFACE_MODE_1000BASEX``
+ This defines the 1000BASE-X single-lane serdes link as defined by the
+ 802.3 standard section 36. The link operates at a fixed bit rate of
+ 1.25Gbaud using a 10B/8B encoding scheme, resulting in an underlying
+ data rate of 1Gbps. Embedded in the data stream is a 16-bit control
+ word which is used to negotiate the duplex and pause modes with the
+ remote end. This does not include "up-clocked" variants such as 2.5Gbps
+ speeds (see below.)
+
+``PHY_INTERFACE_MODE_2500BASEX``
+ This defines a variant of 1000BASE-X which is clocked 2.5 times as fast
+ as the 802.3 standard, giving a fixed bit rate of 3.125Gbaud.
+
+``PHY_INTERFACE_MODE_SGMII``
+ This is used for Cisco SGMII, which is a modification of 1000BASE-X
+ as defined by the 802.3 standard. The SGMII link consists of a single
+ serdes lane running at a fixed bit rate of 1.25Gbaud with 10B/8B
+ encoding. The underlying data rate is 1Gbps, with the slower speeds of
+ 100Mbps and 10Mbps being achieved through replication of each data symbol.
+ The 802.3 control word is re-purposed to send the negotiated speed and
+ duplex information from to the MAC, and for the MAC to acknowledge
+ receipt. This does not include "up-clocked" variants such as 2.5Gbps
+ speeds.
+
+ Note: mismatched SGMII vs 1000BASE-X configuration on a link can
+ successfully pass data in some circumstances, but the 16-bit control
+ word will not be correctly interpreted, which may cause mismatches in
+ duplex, pause or other settings. This is dependent on the MAC and/or
+ PHY behaviour.
+
+``PHY_INTERFACE_MODE_5GBASER``
+ This is the IEEE 802.3 Clause 129 defined 5GBASE-R protocol. It is
+ identical to the 10GBASE-R protocol defined in Clause 49, with the
+ exception that it operates at half the frequency. Please refer to the
+ IEEE standard for the definition.
+
+``PHY_INTERFACE_MODE_10GBASER``
+ This is the IEEE 802.3 Clause 49 defined 10GBASE-R protocol used with
+ various different mediums. Please refer to the IEEE standard for a
+ definition of this.
+
+ Note: 10GBASE-R is just one protocol that can be used with XFI and SFI.
+ XFI and SFI permit multiple protocols over a single SERDES lane, and
+ also defines the electrical characteristics of the signals with a host
+ compliance board plugged into the host XFP/SFP connector. Therefore,
+ XFI and SFI are not PHY interface types in their own right.
+
+``PHY_INTERFACE_MODE_10GKR``
+ This is the IEEE 802.3 Clause 49 defined 10GBASE-R with Clause 73
+ autonegotiation. Please refer to the IEEE standard for further
+ information.
+
+ Note: due to legacy usage, some 10GBASE-R usage incorrectly makes
+ use of this definition.
+
+``PHY_INTERFACE_MODE_25GBASER``
+ This is the IEEE 802.3 PCS Clause 107 defined 25GBASE-R protocol.
+ The PCS is identical to 10GBASE-R, i.e. 64B/66B encoded
+ running 2.5 as fast, giving a fixed bit rate of 25.78125 Gbaud.
+ Please refer to the IEEE standard for further information.
+
+``PHY_INTERFACE_MODE_100BASEX``
+ This defines IEEE 802.3 Clause 24. The link operates at a fixed data
+ rate of 125Mpbs using a 4B/5B encoding scheme, resulting in an underlying
+ data rate of 100Mpbs.
+
+``PHY_INTERFACE_MODE_QUSGMII``
+ This defines the Cisco the Quad USGMII mode, which is the Quad variant of
+ the USGMII (Universal SGMII) link. It's very similar to QSGMII, but uses
+ a Packet Control Header (PCH) instead of the 7 bytes preamble to carry not
+ only the port id, but also so-called "extensions". The only documented
+ extension so-far in the specification is the inclusion of timestamps, for
+ PTP-enabled PHYs. This mode isn't compatible with QSGMII, but offers the
+ same capabilities in terms of link speed and negotiation.
+
+``PHY_INTERFACE_MODE_1000BASEKX``
+ This is 1000BASE-X as defined by IEEE 802.3 Clause 36 with Clause 73
+ autonegotiation. Generally, it will be used with a Clause 70 PMD. To
+ contrast with the 1000BASE-X phy mode used for Clause 38 and 39 PMDs, this
+ interface mode has different autonegotiation and only supports full duplex.
+
+``PHY_INTERFACE_MODE_PSGMII``
+ This is the Penta SGMII mode, it is similar to QSGMII but it combines 5
+ SGMII lines into a single link compared to 4 on QSGMII.
+
+Pause frames / flow control
+===========================
+
+The PHY does not participate directly in flow control/pause frames except by
+making sure that the SUPPORTED_Pause and SUPPORTED_AsymPause bits are set in
+MII_ADVERTISE to indicate towards the link partner that the Ethernet MAC
+controller supports such a thing. Since flow control/pause frames generation
+involves the Ethernet MAC driver, it is recommended that this driver takes care
+of properly indicating advertisement and support for such features by setting
+the SUPPORTED_Pause and SUPPORTED_AsymPause bits accordingly. This can be done
+either before or after phy_connect() and/or as a result of implementing the
+ethtool::set_pauseparam feature.
+
+
+Keeping Close Tabs on the PAL
+=============================
+
+It is possible that the PAL's built-in state machine needs a little help to
+keep your network device and the PHY properly in sync. If so, you can
+register a helper function when connecting to the PHY, which will be called
+every second before the state machine reacts to any changes. To do this, you
+need to manually call phy_attach() and phy_prepare_link(), and then call
+phy_start_machine() with the second argument set to point to your special
+handler.
+
+Currently there are no examples of how to use this functionality, and testing
+on it has been limited because the author does not have any drivers which use
+it (they all use option 1). So Caveat Emptor.
+
+Doing it all yourself
+=====================
+
+There's a remote chance that the PAL's built-in state machine cannot track
+the complex interactions between the PHY and your network device. If this is
+so, you can simply call phy_attach(), and not call phy_start_machine or
+phy_prepare_link(). This will mean that phydev->state is entirely yours to
+handle (phy_start and phy_stop toggle between some of the states, so you
+might need to avoid them).
+
+An effort has been made to make sure that useful functionality can be
+accessed without the state-machine running, and most of these functions are
+descended from functions which did not interact with a complex state-machine.
+However, again, no effort has been made so far to test running without the
+state machine, so tryer beware.
+
+Here is a brief rundown of the functions::
+
+ int phy_read(struct phy_device *phydev, u16 regnum);
+ int phy_write(struct phy_device *phydev, u16 regnum, u16 val);
+
+Simple read/write primitives. They invoke the bus's read/write function
+pointers.
+::
+
+ void phy_print_status(struct phy_device *phydev);
+
+A convenience function to print out the PHY status neatly.
+::
+
+ void phy_request_interrupt(struct phy_device *phydev);
+
+Requests the IRQ for the PHY interrupts.
+::
+
+ struct phy_device * phy_attach(struct net_device *dev, const char *phy_id,
+ phy_interface_t interface);
+
+Attaches a network device to a particular PHY, binding the PHY to a generic
+driver if none was found during bus initialization.
+::
+
+ int phy_start_aneg(struct phy_device *phydev);
+
+Using variables inside the phydev structure, either configures advertising
+and resets autonegotiation, or disables autonegotiation, and configures
+forced settings.
+::
+
+ static inline int phy_read_status(struct phy_device *phydev);
+
+Fills the phydev structure with up-to-date information about the current
+settings in the PHY.
+::
+
+ int phy_ethtool_ksettings_set(struct phy_device *phydev,
+ const struct ethtool_link_ksettings *cmd);
+
+Ethtool convenience functions.
+::
+
+ int phy_mii_ioctl(struct phy_device *phydev,
+ struct mii_ioctl_data *mii_data, int cmd);
+
+The MII ioctl. Note that this function will completely screw up the state
+machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to
+use this only to write registers which are not standard, and don't set off
+a renegotiation.
+
+PHY Device Drivers
+==================
+
+With the PHY Abstraction Layer, adding support for new PHYs is
+quite easy. In some cases, no work is required at all! However,
+many PHYs require a little hand-holding to get up-and-running.
+
+Generic PHY driver
+------------------
+
+If the desired PHY doesn't have any errata, quirks, or special
+features you want to support, then it may be best to not add
+support, and let the PHY Abstraction Layer's Generic PHY Driver
+do all of the work.
+
+Writing a PHY driver
+--------------------
+
+If you do need to write a PHY driver, the first thing to do is
+make sure it can be matched with an appropriate PHY device.
+This is done during bus initialization by reading the device's
+UID (stored in registers 2 and 3), then comparing it to each
+driver's phy_id field by ANDing it with each driver's
+phy_id_mask field. Also, it needs a name. Here's an example::
+
+ static struct phy_driver dm9161_driver = {
+ .phy_id = 0x0181b880,
+ .name = "Davicom DM9161E",
+ .phy_id_mask = 0x0ffffff0,
+ ...
+ }
+
+Next, you need to specify what features (speed, duplex, autoneg,
+etc) your PHY device and driver support. Most PHYs support
+PHY_BASIC_FEATURES, but you can look in include/mii.h for other
+features.
+
+Each driver consists of a number of function pointers, documented
+in include/linux/phy.h under the phy_driver structure.
+
+Of these, only config_aneg and read_status are required to be
+assigned by the driver code. The rest are optional. Also, it is
+preferred to use the generic phy driver's versions of these two
+functions if at all possible: genphy_read_status and
+genphy_config_aneg. If this is not possible, it is likely that
+you only need to perform some actions before and after invoking
+these functions, and so your functions will wrap the generic
+ones.
+
+Feel free to look at the Marvell, Cicada, and Davicom drivers in
+drivers/net/phy/ for examples (the lxt and qsemi drivers have
+not been tested as of this writing).
+
+The PHY's MMD register accesses are handled by the PAL framework
+by default, but can be overridden by a specific PHY driver if
+required. This could be the case if a PHY was released for
+manufacturing before the MMD PHY register definitions were
+standardized by the IEEE. Most modern PHYs will be able to use
+the generic PAL framework for accessing the PHY's MMD registers.
+An example of such usage is for Energy Efficient Ethernet support,
+implemented in the PAL. This support uses the PAL to access MMD
+registers for EEE query and configuration if the PHY supports
+the IEEE standard access mechanisms, or can use the PHY's specific
+access interfaces if overridden by the specific PHY driver. See
+the Micrel driver in drivers/net/phy/ for an example of how this
+can be implemented.
+
+Board Fixups
+============
+
+Sometimes the specific interaction between the platform and the PHY requires
+special handling. For instance, to change where the PHY's clock input is,
+or to add a delay to account for latency issues in the data path. In order
+to support such contingencies, the PHY Layer allows platform code to register
+fixups to be run when the PHY is brought up (or subsequently reset).
+
+When the PHY Layer brings up a PHY it checks to see if there are any fixups
+registered for it, matching based on UID (contained in the PHY device's phy_id
+field) and the bus identifier (contained in phydev->dev.bus_id). Both must
+match, however two constants, PHY_ANY_ID and PHY_ANY_UID, are provided as
+wildcards for the bus ID and UID, respectively.
+
+When a match is found, the PHY layer will invoke the run function associated
+with the fixup. This function is passed a pointer to the phy_device of
+interest. It should therefore only operate on that PHY.
+
+The platform code can either register the fixup using phy_register_fixup()::
+
+ int phy_register_fixup(const char *phy_id,
+ u32 phy_uid, u32 phy_uid_mask,
+ int (*run)(struct phy_device *));
+
+Or using one of the two stubs, phy_register_fixup_for_uid() and
+phy_register_fixup_for_id()::
+
+ int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask,
+ int (*run)(struct phy_device *));
+ int phy_register_fixup_for_id(const char *phy_id,
+ int (*run)(struct phy_device *));
+
+The stubs set one of the two matching criteria, and set the other one to
+match anything.
+
+When phy_register_fixup() or \*_for_uid()/\*_for_id() is called at module load
+time, the module needs to unregister the fixup and free allocated memory when
+it's unloaded.
+
+Call one of following function before unloading module::
+
+ int phy_unregister_fixup(const char *phy_id, u32 phy_uid, u32 phy_uid_mask);
+ int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask);
+ int phy_register_fixup_for_id(const char *phy_id);
+
+Standards
+=========
+
+IEEE Standard 802.3: CSMA/CD Access Method and Physical Layer Specifications, Section Two:
+http://standards.ieee.org/getieee802/download/802.3-2008_section2.pdf
+
+RGMII v1.3:
+http://web.archive.org/web/20160303212629/http://www.hp.com/rnd/pdfs/RGMIIv1_3.pdf
+
+RGMII v2.0:
+http://web.archive.org/web/20160303171328/http://www.hp.com/rnd/pdfs/RGMIIv2_0_final_hp.pdf
diff --git a/Documentation/networking/pktgen.rst b/Documentation/networking/pktgen.rst
new file mode 100644
index 0000000000..1225f0f63f
--- /dev/null
+++ b/Documentation/networking/pktgen.rst
@@ -0,0 +1,410 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
+HOWTO for the linux packet generator
+====================================
+
+Enable CONFIG_NET_PKTGEN to compile and build pktgen either in-kernel
+or as a module. A module is preferred; modprobe pktgen if needed. Once
+running, pktgen creates a thread for each CPU with affinity to that CPU.
+Monitoring and controlling is done via /proc. It is easiest to select a
+suitable sample script and configure that.
+
+On a dual CPU::
+
+ ps aux | grep pkt
+ root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0]
+ root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1]
+
+
+For monitoring and control pktgen creates::
+
+ /proc/net/pktgen/pgctrl
+ /proc/net/pktgen/kpktgend_X
+ /proc/net/pktgen/ethX
+
+
+Tuning NIC for max performance
+==============================
+
+The default NIC settings are (likely) not tuned for pktgen's artificial
+overload type of benchmarking, as this could hurt the normal use-case.
+
+Specifically increasing the TX ring buffer in the NIC::
+
+ # ethtool -G ethX tx 1024
+
+A larger TX ring can improve pktgen's performance, while it can hurt
+in the general case, 1) because the TX ring buffer might get larger
+than the CPU's L1/L2 cache, 2) because it allows more queueing in the
+NIC HW layer (which is bad for bufferbloat).
+
+One should hesitate to conclude that packets/descriptors in the HW
+TX ring cause delay. Drivers usually delay cleaning up the
+ring-buffers for various performance reasons, and packets stalling
+the TX ring might just be waiting for cleanup.
+
+This cleanup issue is specifically the case for the driver ixgbe
+(Intel 82599 chip). This driver (ixgbe) combines TX+RX ring cleanups,
+and the cleanup interval is affected by the ethtool --coalesce setting
+of parameter "rx-usecs".
+
+For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6)::
+
+ # ethtool -C ethX rx-usecs 30
+
+
+Kernel threads
+==============
+Pktgen creates a thread for each CPU with affinity to that CPU.
+Which is controlled through procfile /proc/net/pktgen/kpktgend_X.
+
+Example: /proc/net/pktgen/kpktgend_0::
+
+ Running:
+ Stopped: eth4@0
+ Result: OK: add_device=eth4@0
+
+Most important are the devices assigned to the thread.
+
+The two basic thread commands are:
+
+ * add_device DEVICE@NAME -- adds a single device
+ * rem_device_all -- remove all associated devices
+
+When adding a device to a thread, a corresponding procfile is created
+which is used for configuring this device. Thus, device names need to
+be unique.
+
+To support adding the same device to multiple threads, which is useful
+with multi queue NICs, the device naming scheme is extended with "@":
+device@something
+
+The part after "@" can be anything, but it is custom to use the thread
+number.
+
+Viewing devices
+===============
+
+The Params section holds configured information. The Current section
+holds running statistics. The Result is printed after a run or after
+interruption. Example::
+
+ /proc/net/pktgen/eth4@0
+
+ Params: count 100000 min_pkt_size: 60 max_pkt_size: 60
+ frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0
+ flows: 0 flowlen: 0
+ queue_map_min: 0 queue_map_max: 0
+ dst_min: 192.168.81.2 dst_max:
+ src_min: src_max:
+ src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8
+ udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9
+ src_mac_count: 0 dst_mac_count: 0
+ Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU
+ Current:
+ pkts-sofar: 100000 errors: 0
+ started: 623913381008us stopped: 623913396439us idle: 25us
+ seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
+ cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2
+ cur_udp_dst: 9 cur_udp_src: 42
+ cur_queue_map: 0
+ flows: 0
+ Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags)
+ 6480562pps 3110Mb/sec (3110669760bps) errors: 0
+
+
+Configuring devices
+===================
+This is done via the /proc interface, and most easily done via pgset
+as defined in the sample scripts.
+You need to specify PGDEV environment variable to use functions from sample
+scripts, i.e.::
+
+ export PGDEV=/proc/net/pktgen/eth4@0
+ source samples/pktgen/functions.sh
+
+Examples::
+
+ pg_ctrl start starts injection.
+ pg_ctrl stop aborts injection. Also, ^C aborts generator.
+
+ pgset "clone_skb 1" sets the number of copies of the same packet
+ pgset "clone_skb 0" use single SKB for all transmits
+ pgset "burst 8" uses xmit_more API to queue 8 copies of the same
+ packet and update HW tx queue tail pointer once.
+ "burst 1" is the default
+ pgset "pkt_size 9014" sets packet size to 9014
+ pgset "frags 5" packet will consist of 5 fragments
+ pgset "count 200000" sets number of packets to send, set to zero
+ for continuous sends until explicitly stopped.
+
+ pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds
+
+ pgset "dst 10.0.0.1" sets IP destination address
+ (BEWARE! This generator is very aggressive!)
+
+ pgset "dst_min 10.0.0.1" Same as dst
+ pgset "dst_max 10.0.0.254" Set the maximum destination IP.
+ pgset "src_min 10.0.0.1" Set the minimum (or only) source IP.
+ pgset "src_max 10.0.0.254" Set the maximum source IP.
+ pgset "dst6 fec0::1" IPV6 destination address
+ pgset "src6 fec0::2" IPV6 source address
+ pgset "dstmac 00:00:00:00:00:00" sets MAC destination address
+ pgset "srcmac 00:00:00:00:00:00" sets MAC source address
+
+ pgset "queue_map_min 0" Sets the min value of tx queue interval
+ pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices
+ To select queue 1 of a given device,
+ use queue_map_min=1 and queue_map_max=1
+
+ pgset "src_mac_count 1" Sets the number of MACs we'll range through.
+ The 'minimum' MAC is what you set with srcmac.
+
+ pgset "dst_mac_count 1" Sets the number of MACs we'll range through.
+ The 'minimum' MAC is what you set with dstmac.
+
+ pgset "flag [name]" Set a flag to determine behaviour. Current flags
+ are: IPSRC_RND # IP source is random (between min/max)
+ IPDST_RND # IP destination is random
+ UDPSRC_RND, UDPDST_RND,
+ MACSRC_RND, MACDST_RND
+ TXSIZE_RND, IPV6,
+ MPLS_RND, VID_RND, SVID_RND
+ FLOW_SEQ,
+ QUEUE_MAP_RND # queue map random
+ QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
+ UDPCSUM,
+ IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
+ NODE_ALLOC # node specific memory allocation
+ NO_TIMESTAMP # disable timestamping
+ pgset 'flag ![name]' Clear a flag to determine behaviour.
+ Note that you might need to use single quote in
+ interactive mode, so that your shell wouldn't expand
+ the specified flag as a history command.
+
+ pgset "spi [SPI_VALUE]" Set specific SA used to transform packet.
+
+ pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then
+ cycle through the port range.
+
+ pgset "udp_src_max 9" set UDP source port max.
+ pgset "udp_dst_min 9" set UDP destination port min, If < udp_dst_max, then
+ cycle through the port range.
+ pgset "udp_dst_max 9" set UDP destination port max.
+
+ pgset "mpls 0001000a,0002000a,0000000a" set MPLS labels (in this example
+ outer label=16,middle label=32,
+ inner label=0 (IPv4 NULL)) Note that
+ there must be no spaces between the
+ arguments. Leading zeros are required.
+ Do not set the bottom of stack bit,
+ that's done automatically. If you do
+ set the bottom of stack bit, that
+ indicates that you want to randomly
+ generate that address and the flag
+ MPLS_RND will be turned on. You
+ can have any mix of random and fixed
+ labels in the label stack.
+
+ pgset "mpls 0" turn off mpls (or any invalid argument works too!)
+
+ pgset "vlan_id 77" set VLAN ID 0-4095
+ pgset "vlan_p 3" set priority bit 0-7 (default 0)
+ pgset "vlan_cfi 0" set canonical format identifier 0-1 (default 0)
+
+ pgset "svlan_id 22" set SVLAN ID 0-4095
+ pgset "svlan_p 3" set priority bit 0-7 (default 0)
+ pgset "svlan_cfi 0" set canonical format identifier 0-1 (default 0)
+
+ pgset "vlan_id 9999" > 4095 remove vlan and svlan tags
+ pgset "svlan 9999" > 4095 remove svlan tag
+
+
+ pgset "tos XX" set former IPv4 TOS field (e.g. "tos 28" for AF11 no ECN, default 00)
+ pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class B8" for EF no ECN, default 00)
+
+ pgset "rate 300M" set rate to 300 Mb/s
+ pgset "ratep 1000000" set rate to 1Mpps
+
+ pgset "xmit_mode netif_receive" RX inject into stack netif_receive_skb()
+ Works with "burst" but not with "clone_skb".
+ Default xmit_mode is "start_xmit".
+
+Sample scripts
+==============
+
+A collection of tutorial scripts and helpers for pktgen is in the
+samples/pktgen directory. The helper parameters.sh file support easy
+and consistent parameter parsing across the sample scripts.
+
+Usage example and help::
+
+ ./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2
+
+Usage:::
+
+ ./pktgen_sample01_simple.sh [-vx] -i ethX
+
+ -i : ($DEV) output interface/device (required)
+ -s : ($PKT_SIZE) packet size
+ -d : ($DEST_IP) destination IP. CIDR (e.g. 198.18.0.0/15) is also allowed
+ -m : ($DST_MAC) destination MAC-addr
+ -p : ($DST_PORT) destination PORT range (e.g. 433-444) is also allowed
+ -t : ($THREADS) threads to start
+ -f : ($F_THREAD) index of first thread (zero indexed CPU number)
+ -c : ($SKB_CLONE) SKB clones send before alloc new SKB
+ -n : ($COUNT) num messages to send per thread, 0 means indefinitely
+ -b : ($BURST) HW level bursting of SKBs
+ -v : ($VERBOSE) verbose
+ -x : ($DEBUG) debug
+ -6 : ($IP6) IPv6
+ -w : ($DELAY) Tx Delay value (ns)
+ -a : ($APPEND) Script will not reset generator's state, but will append its config
+
+The global variables being set are also listed. E.g. the required
+interface/device parameter "-i" sets variable $DEV. Copy the
+pktgen_sampleXX scripts and modify them to fit your own needs.
+
+
+Interrupt affinity
+===================
+Note that when adding devices to a specific CPU it is a good idea to
+also assign /proc/irq/XX/smp_affinity so that the TX interrupts are bound
+to the same CPU. This reduces cache bouncing when freeing skbs.
+
+Plus using the device flag QUEUE_MAP_CPU, which maps the SKBs TX queue
+to the running threads CPU (directly from smp_processor_id()).
+
+Enable IPsec
+============
+Default IPsec transformation with ESP encapsulation plus transport mode
+can be enabled by simply setting::
+
+ pgset "flag IPSEC"
+ pgset "flows 1"
+
+To avoid breaking existing testbed scripts for using AH type and tunnel mode,
+you can use "pgset spi SPI_VALUE" to specify which transformation mode
+to employ.
+
+
+Current commands and configuration options
+==========================================
+
+**Pgcontrol commands**::
+
+ start
+ stop
+ reset
+
+**Thread commands**::
+
+ add_device
+ rem_device_all
+
+
+**Device commands**::
+
+ count
+ clone_skb
+ burst
+ debug
+
+ frags
+ delay
+
+ src_mac_count
+ dst_mac_count
+
+ pkt_size
+ min_pkt_size
+ max_pkt_size
+
+ queue_map_min
+ queue_map_max
+ skb_priority
+
+ tos (ipv4)
+ traffic_class (ipv6)
+
+ mpls
+
+ udp_src_min
+ udp_src_max
+
+ udp_dst_min
+ udp_dst_max
+
+ node
+
+ flag
+ IPSRC_RND
+ IPDST_RND
+ UDPSRC_RND
+ UDPDST_RND
+ MACSRC_RND
+ MACDST_RND
+ TXSIZE_RND
+ IPV6
+ MPLS_RND
+ VID_RND
+ SVID_RND
+ FLOW_SEQ
+ QUEUE_MAP_RND
+ QUEUE_MAP_CPU
+ UDPCSUM
+ IPSEC
+ NODE_ALLOC
+ NO_TIMESTAMP
+
+ spi (ipsec)
+
+ dst_min
+ dst_max
+
+ src_min
+ src_max
+
+ dst_mac
+ src_mac
+
+ clear_counters
+
+ src6
+ dst6
+ dst6_max
+ dst6_min
+
+ flows
+ flowlen
+
+ rate
+ ratep
+
+ xmit_mode <start_xmit|netif_receive>
+
+ vlan_cfi
+ vlan_id
+ vlan_p
+
+ svlan_cfi
+ svlan_id
+ svlan_p
+
+
+References:
+
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
+
+Paper from Linux-Kongress in Erlangen 2004.
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
+
+Thanks to:
+
+Grant Grundler for testing on IA-64 and parisc, Harald Welte, Lennert Buytenhek
+Stephen Hemminger, Andi Kleen, Dave Miller and many others.
+
+
+Good luck with the linux net-development.
diff --git a/Documentation/networking/plip.rst b/Documentation/networking/plip.rst
new file mode 100644
index 0000000000..0eda745050
--- /dev/null
+++ b/Documentation/networking/plip.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================
+PLIP: The Parallel Line Internet Protocol Device
+================================================
+
+Donald Becker (becker@super.org)
+I.D.A. Supercomputing Research Center, Bowie MD 20715
+
+At some point T. Thorn will probably contribute text,
+Tommy Thorn (tthorn@daimi.aau.dk)
+
+PLIP Introduction
+-----------------
+
+This document describes the parallel port packet pusher for Net/LGX.
+This device interface allows a point-to-point connection between two
+parallel ports to appear as a IP network interface.
+
+What is PLIP?
+=============
+
+PLIP is Parallel Line IP, that is, the transportation of IP packages
+over a parallel port. In the case of a PC, the obvious choice is the
+printer port. PLIP is a non-standard, but [can use] uses the standard
+LapLink null-printer cable [can also work in turbo mode, with a PLIP
+cable]. [The protocol used to pack IP packages, is a simple one
+initiated by Crynwr.]
+
+Advantages of PLIP
+==================
+
+It's cheap, it's available everywhere, and it's easy.
+
+The PLIP cable is all that's needed to connect two Linux boxes, and it
+can be built for very few bucks.
+
+Connecting two Linux boxes takes only a second's decision and a few
+minutes' work, no need to search for a [supported] netcard. This might
+even be especially important in the case of notebooks, where netcards
+are not easily available.
+
+Not requiring a netcard also means that apart from connecting the
+cables, everything else is software configuration [which in principle
+could be made very easy.]
+
+Disadvantages of PLIP
+=====================
+
+Doesn't work over a modem, like SLIP and PPP. Limited range, 15 m.
+Can only be used to connect three (?) Linux boxes. Doesn't connect to
+an existing Ethernet. Isn't standard (not even de facto standard, like
+SLIP).
+
+Performance
+===========
+
+PLIP easily outperforms Ethernet cards....(ups, I was dreaming, but
+it *is* getting late. EOB)
+
+PLIP driver details
+-------------------
+
+The Linux PLIP driver is an implementation of the original Crynwr protocol,
+that uses the parallel port subsystem of the kernel in order to properly
+share parallel ports between PLIP and other services.
+
+IRQs and trigger timeouts
+=========================
+
+When a parallel port used for a PLIP driver has an IRQ configured to it, the
+PLIP driver is signaled whenever data is sent to it via the cable, such that
+when no data is available, the driver isn't being used.
+
+However, on some machines it is hard, if not impossible, to configure an IRQ
+to a certain parallel port, mainly because it is used by some other device.
+On these machines, the PLIP driver can be used in IRQ-less mode, where
+the PLIP driver would constantly poll the parallel port for data waiting,
+and if such data is available, process it. This mode is less efficient than
+the IRQ mode, because the driver has to check the parallel port many times
+per second, even when no data at all is sent. Some rough measurements
+indicate that there isn't a noticeable performance drop when using IRQ-less
+mode as compared to IRQ mode as far as the data transfer speed is involved.
+There is a performance drop on the machine hosting the driver.
+
+When the PLIP driver is used in IRQ mode, the timeout used for triggering a
+data transfer (the maximal time the PLIP driver would allow the other side
+before announcing a timeout, when trying to handshake a transfer of some
+data) is, by default, 500usec. As IRQ delivery is more or less immediate,
+this timeout is quite sufficient.
+
+When in IRQ-less mode, the PLIP driver polls the parallel port HZ times
+per second (where HZ is typically 100 on most platforms, and 1024 on an
+Alpha, as of this writing). Between two such polls, there are 10^6/HZ usecs.
+On an i386, for example, 10^6/100 = 10000usec. It is easy to see that it is
+quite possible for the trigger timeout to expire between two such polls, as
+the timeout is only 500usec long. As a result, it is required to change the
+trigger timeout on the *other* side of a PLIP connection, to about
+10^6/HZ usecs. If both sides of a PLIP connection are used in IRQ-less mode,
+this timeout is required on both sides.
+
+It appears that in practice, the trigger timeout can be shorter than in the
+above calculation. It isn't an important issue, unless the wire is faulty,
+in which case a long timeout would stall the machine when, for whatever
+reason, bits are dropped.
+
+A utility that can perform this change in Linux is plipconfig, which is part
+of the net-tools package (its location can be found in the
+Documentation/Changes file). An example command would be
+'plipconfig plipX trigger 10000', where plipX is the appropriate
+PLIP device.
+
+PLIP hardware interconnection
+-----------------------------
+
+PLIP uses several different data transfer methods. The first (and the
+only one implemented in the early version of the code) uses a standard
+printer "null" cable to transfer data four bits at a time using
+data bit outputs connected to status bit inputs.
+
+The second data transfer method relies on both machines having
+bi-directional parallel ports, rather than output-only ``printer``
+ports. This allows byte-wide transfers and avoids reconstructing
+nibbles into bytes, leading to much faster transfers.
+
+Parallel Transfer Mode 0 Cable
+==============================
+
+The cable for the first transfer mode is a standard
+printer "null" cable which transfers data four bits at a time using
+data bit outputs of the first port (machine T) connected to the
+status bit inputs of the second port (machine R). There are five
+status inputs, and they are used as four data inputs and a clock (data
+strobe) input, arranged so that the data input bits appear as contiguous
+bits with standard status register implementation.
+
+A cable that implements this protocol is available commercially as a
+"Null Printer" or "Turbo Laplink" cable. It can be constructed with
+two DB-25 male connectors symmetrically connected as follows::
+
+ STROBE output 1*
+ D0->ERROR 2 - 15 15 - 2
+ D1->SLCT 3 - 13 13 - 3
+ D2->PAPOUT 4 - 12 12 - 4
+ D3->ACK 5 - 10 10 - 5
+ D4->BUSY 6 - 11 11 - 6
+ D5,D6,D7 are 7*, 8*, 9*
+ AUTOFD output 14*
+ INIT output 16*
+ SLCTIN 17 - 17
+ extra grounds are 18*,19*,20*,21*,22*,23*,24*
+ GROUND 25 - 25
+
+ * Do not connect these pins on either end
+
+If the cable you are using has a metallic shield it should be
+connected to the metallic DB-25 shell at one end only.
+
+Parallel Transfer Mode 1
+========================
+
+The second data transfer method relies on both machines having
+bi-directional parallel ports, rather than output-only ``printer``
+ports. This allows byte-wide transfers, and avoids reconstructing
+nibbles into bytes. This cable should not be used on unidirectional
+``printer`` (as opposed to ``parallel``) ports or when the machine
+isn't configured for PLIP, as it will result in output driver
+conflicts and the (unlikely) possibility of damage.
+
+The cable for this transfer mode should be constructed as follows::
+
+ STROBE->BUSY 1 - 11
+ D0->D0 2 - 2
+ D1->D1 3 - 3
+ D2->D2 4 - 4
+ D3->D3 5 - 5
+ D4->D4 6 - 6
+ D5->D5 7 - 7
+ D6->D6 8 - 8
+ D7->D7 9 - 9
+ INIT -> ACK 16 - 10
+ AUTOFD->PAPOUT 14 - 12
+ SLCT->SLCTIN 13 - 17
+ GND->ERROR 18 - 15
+ extra grounds are 19*,20*,21*,22*,23*,24*
+ GROUND 25 - 25
+
+ * Do not connect these pins on either end
+
+Once again, if the cable you are using has a metallic shield it should
+be connected to the metallic DB-25 shell at one end only.
+
+PLIP Mode 0 transfer protocol
+=============================
+
+The PLIP driver is compatible with the "Crynwr" parallel port transfer
+standard in Mode 0. That standard specifies the following protocol::
+
+ send header nibble '0x8'
+ count-low octet
+ count-high octet
+ ... data octets
+ checksum octet
+
+Each octet is sent as::
+
+ <wait for rx. '0x1?'> <send 0x10+(octet&0x0F)>
+ <wait for rx. '0x0?'> <send 0x00+((octet>>4)&0x0F)>
+
+To start a transfer the transmitting machine outputs a nibble 0x08.
+That raises the ACK line, triggering an interrupt in the receiving
+machine. The receiving machine disables interrupts and raises its own ACK
+line.
+
+Restated::
+
+ (OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise)
+ Send_Byte:
+ OUT := low nibble, OUT.4 := 1
+ WAIT FOR IN.4 = 1
+ OUT := high nibble, OUT.4 := 0
+ WAIT FOR IN.4 = 0
diff --git a/Documentation/networking/ppp_generic.rst b/Documentation/networking/ppp_generic.rst
new file mode 100644
index 0000000000..5a10abce59
--- /dev/null
+++ b/Documentation/networking/ppp_generic.rst
@@ -0,0 +1,456 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================
+PPP Generic Driver and Channel Interface
+========================================
+
+ Paul Mackerras
+ paulus@samba.org
+
+ 7 Feb 2002
+
+The generic PPP driver in linux-2.4 provides an implementation of the
+functionality which is of use in any PPP implementation, including:
+
+* the network interface unit (ppp0 etc.)
+* the interface to the networking code
+* PPP multilink: splitting datagrams between multiple links, and
+ ordering and combining received fragments
+* the interface to pppd, via a /dev/ppp character device
+* packet compression and decompression
+* TCP/IP header compression and decompression
+* detecting network traffic for demand dialling and for idle timeouts
+* simple packet filtering
+
+For sending and receiving PPP frames, the generic PPP driver calls on
+the services of PPP ``channels``. A PPP channel encapsulates a
+mechanism for transporting PPP frames from one machine to another. A
+PPP channel implementation can be arbitrarily complex internally but
+has a very simple interface with the generic PPP code: it merely has
+to be able to send PPP frames, receive PPP frames, and optionally
+handle ioctl requests. Currently there are PPP channel
+implementations for asynchronous serial ports, synchronous serial
+ports, and for PPP over ethernet.
+
+This architecture makes it possible to implement PPP multilink in a
+natural and straightforward way, by allowing more than one channel to
+be linked to each ppp network interface unit. The generic layer is
+responsible for splitting datagrams on transmit and recombining them
+on receive.
+
+
+PPP channel API
+---------------
+
+See include/linux/ppp_channel.h for the declaration of the types and
+functions used to communicate between the generic PPP layer and PPP
+channels.
+
+Each channel has to provide two functions to the generic PPP layer,
+via the ppp_channel.ops pointer:
+
+* start_xmit() is called by the generic layer when it has a frame to
+ send. The channel has the option of rejecting the frame for
+ flow-control reasons. In this case, start_xmit() should return 0
+ and the channel should call the ppp_output_wakeup() function at a
+ later time when it can accept frames again, and the generic layer
+ will then attempt to retransmit the rejected frame(s). If the frame
+ is accepted, the start_xmit() function should return 1.
+
+* ioctl() provides an interface which can be used by a user-space
+ program to control aspects of the channel's behaviour. This
+ procedure will be called when a user-space program does an ioctl
+ system call on an instance of /dev/ppp which is bound to the
+ channel. (Usually it would only be pppd which would do this.)
+
+The generic PPP layer provides seven functions to channels:
+
+* ppp_register_channel() is called when a channel has been created, to
+ notify the PPP generic layer of its presence. For example, setting
+ a serial port to the PPPDISC line discipline causes the ppp_async
+ channel code to call this function.
+
+* ppp_unregister_channel() is called when a channel is to be
+ destroyed. For example, the ppp_async channel code calls this when
+ a hangup is detected on the serial port.
+
+* ppp_output_wakeup() is called by a channel when it has previously
+ rejected a call to its start_xmit function, and can now accept more
+ packets.
+
+* ppp_input() is called by a channel when it has received a complete
+ PPP frame.
+
+* ppp_input_error() is called by a channel when it has detected that a
+ frame has been lost or dropped (for example, because of a FCS (frame
+ check sequence) error).
+
+* ppp_channel_index() returns the channel index assigned by the PPP
+ generic layer to this channel. The channel should provide some way
+ (e.g. an ioctl) to transmit this back to user-space, as user-space
+ will need it to attach an instance of /dev/ppp to this channel.
+
+* ppp_unit_number() returns the unit number of the ppp network
+ interface to which this channel is connected, or -1 if the channel
+ is not connected.
+
+Connecting a channel to the ppp generic layer is initiated from the
+channel code, rather than from the generic layer. The channel is
+expected to have some way for a user-level process to control it
+independently of the ppp generic layer. For example, with the
+ppp_async channel, this is provided by the file descriptor to the
+serial port.
+
+Generally a user-level process will initialize the underlying
+communications medium and prepare it to do PPP. For example, with an
+async tty, this can involve setting the tty speed and modes, issuing
+modem commands, and then going through some sort of dialog with the
+remote system to invoke PPP service there. We refer to this process
+as ``discovery``. Then the user-level process tells the medium to
+become a PPP channel and register itself with the generic PPP layer.
+The channel then has to report the channel number assigned to it back
+to the user-level process. From that point, the PPP negotiation code
+in the PPP daemon (pppd) can take over and perform the PPP
+negotiation, accessing the channel through the /dev/ppp interface.
+
+At the interface to the PPP generic layer, PPP frames are stored in
+skbuff structures and start with the two-byte PPP protocol number.
+The frame does *not* include the 0xff ``address`` byte or the 0x03
+``control`` byte that are optionally used in async PPP. Nor is there
+any escaping of control characters, nor are there any FCS or framing
+characters included. That is all the responsibility of the channel
+code, if it is needed for the particular medium. That is, the skbuffs
+presented to the start_xmit() function contain only the 2-byte
+protocol number and the data, and the skbuffs presented to ppp_input()
+must be in the same format.
+
+The channel must provide an instance of a ppp_channel struct to
+represent the channel. The channel is free to use the ``private`` field
+however it wishes. The channel should initialize the ``mtu`` and
+``hdrlen`` fields before calling ppp_register_channel() and not change
+them until after ppp_unregister_channel() returns. The ``mtu`` field
+represents the maximum size of the data part of the PPP frames, that
+is, it does not include the 2-byte protocol number.
+
+If the channel needs some headroom in the skbuffs presented to it for
+transmission (i.e., some space free in the skbuff data area before the
+start of the PPP frame), it should set the ``hdrlen`` field of the
+ppp_channel struct to the amount of headroom required. The generic
+PPP layer will attempt to provide that much headroom but the channel
+should still check if there is sufficient headroom and copy the skbuff
+if there isn't.
+
+On the input side, channels should ideally provide at least 2 bytes of
+headroom in the skbuffs presented to ppp_input(). The generic PPP
+code does not require this but will be more efficient if this is done.
+
+
+Buffering and flow control
+--------------------------
+
+The generic PPP layer has been designed to minimize the amount of data
+that it buffers in the transmit direction. It maintains a queue of
+transmit packets for the PPP unit (network interface device) plus a
+queue of transmit packets for each attached channel. Normally the
+transmit queue for the unit will contain at most one packet; the
+exceptions are when pppd sends packets by writing to /dev/ppp, and
+when the core networking code calls the generic layer's start_xmit()
+function with the queue stopped, i.e. when the generic layer has
+called netif_stop_queue(), which only happens on a transmit timeout.
+The start_xmit function always accepts and queues the packet which it
+is asked to transmit.
+
+Transmit packets are dequeued from the PPP unit transmit queue and
+then subjected to TCP/IP header compression and packet compression
+(Deflate or BSD-Compress compression), as appropriate. After this
+point the packets can no longer be reordered, as the decompression
+algorithms rely on receiving compressed packets in the same order that
+they were generated.
+
+If multilink is not in use, this packet is then passed to the attached
+channel's start_xmit() function. If the channel refuses to take
+the packet, the generic layer saves it for later transmission. The
+generic layer will call the channel's start_xmit() function again
+when the channel calls ppp_output_wakeup() or when the core
+networking code calls the generic layer's start_xmit() function
+again. The generic layer contains no timeout and retransmission
+logic; it relies on the core networking code for that.
+
+If multilink is in use, the generic layer divides the packet into one
+or more fragments and puts a multilink header on each fragment. It
+decides how many fragments to use based on the length of the packet
+and the number of channels which are potentially able to accept a
+fragment at the moment. A channel is potentially able to accept a
+fragment if it doesn't have any fragments currently queued up for it
+to transmit. The channel may still refuse a fragment; in this case
+the fragment is queued up for the channel to transmit later. This
+scheme has the effect that more fragments are given to higher-
+bandwidth channels. It also means that under light load, the generic
+layer will tend to fragment large packets across all the channels,
+thus reducing latency, while under heavy load, packets will tend to be
+transmitted as single fragments, thus reducing the overhead of
+fragmentation.
+
+
+SMP safety
+----------
+
+The PPP generic layer has been designed to be SMP-safe. Locks are
+used around accesses to the internal data structures where necessary
+to ensure their integrity. As part of this, the generic layer
+requires that the channels adhere to certain requirements and in turn
+provides certain guarantees to the channels. Essentially the channels
+are required to provide the appropriate locking on the ppp_channel
+structures that form the basis of the communication between the
+channel and the generic layer. This is because the channel provides
+the storage for the ppp_channel structure, and so the channel is
+required to provide the guarantee that this storage exists and is
+valid at the appropriate times.
+
+The generic layer requires these guarantees from the channel:
+
+* The ppp_channel object must exist from the time that
+ ppp_register_channel() is called until after the call to
+ ppp_unregister_channel() returns.
+
+* No thread may be in a call to any of ppp_input(), ppp_input_error(),
+ ppp_output_wakeup(), ppp_channel_index() or ppp_unit_number() for a
+ channel at the time that ppp_unregister_channel() is called for that
+ channel.
+
+* ppp_register_channel() and ppp_unregister_channel() must be called
+ from process context, not interrupt or softirq/BH context.
+
+* The remaining generic layer functions may be called at softirq/BH
+ level but must not be called from a hardware interrupt handler.
+
+* The generic layer may call the channel start_xmit() function at
+ softirq/BH level but will not call it at interrupt level. Thus the
+ start_xmit() function may not block.
+
+* The generic layer will only call the channel ioctl() function in
+ process context.
+
+The generic layer provides these guarantees to the channels:
+
+* The generic layer will not call the start_xmit() function for a
+ channel while any thread is already executing in that function for
+ that channel.
+
+* The generic layer will not call the ioctl() function for a channel
+ while any thread is already executing in that function for that
+ channel.
+
+* By the time a call to ppp_unregister_channel() returns, no thread
+ will be executing in a call from the generic layer to that channel's
+ start_xmit() or ioctl() function, and the generic layer will not
+ call either of those functions subsequently.
+
+
+Interface to pppd
+-----------------
+
+The PPP generic layer exports a character device interface called
+/dev/ppp. This is used by pppd to control PPP interface units and
+channels. Although there is only one /dev/ppp, each open instance of
+/dev/ppp acts independently and can be attached either to a PPP unit
+or a PPP channel. This is achieved using the file->private_data field
+to point to a separate object for each open instance of /dev/ppp. In
+this way an effect similar to Solaris' clone open is obtained,
+allowing us to control an arbitrary number of PPP interfaces and
+channels without having to fill up /dev with hundreds of device names.
+
+When /dev/ppp is opened, a new instance is created which is initially
+unattached. Using an ioctl call, it can then be attached to an
+existing unit, attached to a newly-created unit, or attached to an
+existing channel. An instance attached to a unit can be used to send
+and receive PPP control frames, using the read() and write() system
+calls, along with poll() if necessary. Similarly, an instance
+attached to a channel can be used to send and receive PPP frames on
+that channel.
+
+In multilink terms, the unit represents the bundle, while the channels
+represent the individual physical links. Thus, a PPP frame sent by a
+write to the unit (i.e., to an instance of /dev/ppp attached to the
+unit) will be subject to bundle-level compression and to fragmentation
+across the individual links (if multilink is in use). In contrast, a
+PPP frame sent by a write to the channel will be sent as-is on that
+channel, without any multilink header.
+
+A channel is not initially attached to any unit. In this state it can
+be used for PPP negotiation but not for the transfer of data packets.
+It can then be connected to a PPP unit with an ioctl call, which
+makes it available to send and receive data packets for that unit.
+
+The ioctl calls which are available on an instance of /dev/ppp depend
+on whether it is unattached, attached to a PPP interface, or attached
+to a PPP channel. The ioctl calls which are available on an
+unattached instance are:
+
+* PPPIOCNEWUNIT creates a new PPP interface and makes this /dev/ppp
+ instance the "owner" of the interface. The argument should point to
+ an int which is the desired unit number if >= 0, or -1 to assign the
+ lowest unused unit number. Being the owner of the interface means
+ that the interface will be shut down if this instance of /dev/ppp is
+ closed.
+
+* PPPIOCATTACH attaches this instance to an existing PPP interface.
+ The argument should point to an int containing the unit number.
+ This does not make this instance the owner of the PPP interface.
+
+* PPPIOCATTCHAN attaches this instance to an existing PPP channel.
+ The argument should point to an int containing the channel number.
+
+The ioctl calls available on an instance of /dev/ppp attached to a
+channel are:
+
+* PPPIOCCONNECT connects this channel to a PPP interface. The
+ argument should point to an int containing the interface unit
+ number. It will return an EINVAL error if the channel is already
+ connected to an interface, or ENXIO if the requested interface does
+ not exist.
+
+* PPPIOCDISCONN disconnects this channel from the PPP interface that
+ it is connected to. It will return an EINVAL error if the channel
+ is not connected to an interface.
+
+* PPPIOCBRIDGECHAN bridges a channel with another. The argument should
+ point to an int containing the channel number of the channel to bridge
+ to. Once two channels are bridged, frames presented to one channel by
+ ppp_input() are passed to the bridge instance for onward transmission.
+ This allows frames to be switched from one channel into another: for
+ example, to pass PPPoE frames into a PPPoL2TP session. Since channel
+ bridging interrupts the normal ppp_input() path, a given channel may
+ not be part of a bridge at the same time as being part of a unit.
+ This ioctl will return an EALREADY error if the channel is already
+ part of a bridge or unit, or ENXIO if the requested channel does not
+ exist.
+
+* PPPIOCUNBRIDGECHAN performs the inverse of PPPIOCBRIDGECHAN, unbridging
+ a channel pair. This ioctl will return an EINVAL error if the channel
+ does not form part of a bridge.
+
+* All other ioctl commands are passed to the channel ioctl() function.
+
+The ioctl calls that are available on an instance that is attached to
+an interface unit are:
+
+* PPPIOCSMRU sets the MRU (maximum receive unit) for the interface.
+ The argument should point to an int containing the new MRU value.
+
+* PPPIOCSFLAGS sets flags which control the operation of the
+ interface. The argument should be a pointer to an int containing
+ the new flags value. The bits in the flags value that can be set
+ are:
+
+ ================ ========================================
+ SC_COMP_TCP enable transmit TCP header compression
+ SC_NO_TCP_CCID disable connection-id compression for
+ TCP header compression
+ SC_REJ_COMP_TCP disable receive TCP header decompression
+ SC_CCP_OPEN Compression Control Protocol (CCP) is
+ open, so inspect CCP packets
+ SC_CCP_UP CCP is up, may (de)compress packets
+ SC_LOOP_TRAFFIC send IP traffic to pppd
+ SC_MULTILINK enable PPP multilink fragmentation on
+ transmitted packets
+ SC_MP_SHORTSEQ expect short multilink sequence
+ numbers on received multilink fragments
+ SC_MP_XSHORTSEQ transmit short multilink sequence nos.
+ ================ ========================================
+
+ The values of these flags are defined in <linux/ppp-ioctl.h>. Note
+ that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and
+ SC_MP_XSHORTSEQ bits are ignored if the CONFIG_PPP_MULTILINK option
+ is not selected.
+
+* PPPIOCGFLAGS returns the value of the status/control flags for the
+ interface unit. The argument should point to an int where the ioctl
+ will store the flags value. As well as the values listed above for
+ PPPIOCSFLAGS, the following bits may be set in the returned value:
+
+ ================ =========================================
+ SC_COMP_RUN CCP compressor is running
+ SC_DECOMP_RUN CCP decompressor is running
+ SC_DC_ERROR CCP decompressor detected non-fatal error
+ SC_DC_FERROR CCP decompressor detected fatal error
+ ================ =========================================
+
+* PPPIOCSCOMPRESS sets the parameters for packet compression or
+ decompression. The argument should point to a ppp_option_data
+ structure (defined in <linux/ppp-ioctl.h>), which contains a
+ pointer/length pair which should describe a block of memory
+ containing a CCP option specifying a compression method and its
+ parameters. The ppp_option_data struct also contains a ``transmit``
+ field. If this is 0, the ioctl will affect the receive path,
+ otherwise the transmit path.
+
+* PPPIOCGUNIT returns, in the int pointed to by the argument, the unit
+ number of this interface unit.
+
+* PPPIOCSDEBUG sets the debug flags for the interface to the value in
+ the int pointed to by the argument. Only the least significant bit
+ is used; if this is 1 the generic layer will print some debug
+ messages during its operation. This is only intended for debugging
+ the generic PPP layer code; it is generally not helpful for working
+ out why a PPP connection is failing.
+
+* PPPIOCGDEBUG returns the debug flags for the interface in the int
+ pointed to by the argument.
+
+* PPPIOCGIDLE returns the time, in seconds, since the last data
+ packets were sent and received. The argument should point to a
+ ppp_idle structure (defined in <linux/ppp_defs.h>). If the
+ CONFIG_PPP_FILTER option is enabled, the set of packets which reset
+ the transmit and receive idle timers is restricted to those which
+ pass the ``active`` packet filter.
+ Two versions of this command exist, to deal with user space
+ expecting times as either 32-bit or 64-bit time_t seconds.
+
+* PPPIOCSMAXCID sets the maximum connection-ID parameter (and thus the
+ number of connection slots) for the TCP header compressor and
+ decompressor. The lower 16 bits of the int pointed to by the
+ argument specify the maximum connection-ID for the compressor. If
+ the upper 16 bits of that int are non-zero, they specify the maximum
+ connection-ID for the decompressor, otherwise the decompressor's
+ maximum connection-ID is set to 15.
+
+* PPPIOCSNPMODE sets the network-protocol mode for a given network
+ protocol. The argument should point to an npioctl struct (defined
+ in <linux/ppp-ioctl.h>). The ``protocol`` field gives the PPP protocol
+ number for the protocol to be affected, and the ``mode`` field
+ specifies what to do with packets for that protocol:
+
+ ============= ==============================================
+ NPMODE_PASS normal operation, transmit and receive packets
+ NPMODE_DROP silently drop packets for this protocol
+ NPMODE_ERROR drop packets and return an error on transmit
+ NPMODE_QUEUE queue up packets for transmit, drop received
+ packets
+ ============= ==============================================
+
+ At present NPMODE_ERROR and NPMODE_QUEUE have the same effect as
+ NPMODE_DROP.
+
+* PPPIOCGNPMODE returns the network-protocol mode for a given
+ protocol. The argument should point to an npioctl struct with the
+ ``protocol`` field set to the PPP protocol number for the protocol of
+ interest. On return the ``mode`` field will be set to the network-
+ protocol mode for that protocol.
+
+* PPPIOCSPASS and PPPIOCSACTIVE set the ``pass`` and ``active`` packet
+ filters. These ioctls are only available if the CONFIG_PPP_FILTER
+ option is selected. The argument should point to a sock_fprog
+ structure (defined in <linux/filter.h>) containing the compiled BPF
+ instructions for the filter. Packets are dropped if they fail the
+ ``pass`` filter; otherwise, if they fail the ``active`` filter they are
+ passed but they do not reset the transmit or receive idle timer.
+
+* PPPIOCSMRRU enables or disables multilink processing for received
+ packets and sets the multilink MRRU (maximum reconstructed receive
+ unit). The argument should point to an int containing the new MRRU
+ value. If the MRRU value is 0, processing of received multilink
+ fragments is disabled. This ioctl is only available if the
+ CONFIG_PPP_MULTILINK option is selected.
+
+Last modified: 7-feb-2002
diff --git a/Documentation/networking/proc_net_tcp.rst b/Documentation/networking/proc_net_tcp.rst
new file mode 100644
index 0000000000..7d9dfe36af
--- /dev/null
+++ b/Documentation/networking/proc_net_tcp.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================
+The proc/net/tcp and proc/net/tcp6 variables
+============================================
+
+This document describes the interfaces /proc/net/tcp and /proc/net/tcp6.
+Note that these interfaces are deprecated in favor of tcp_diag.
+
+These /proc interfaces provide information about currently active TCP
+connections, and are implemented by tcp4_seq_show() in net/ipv4/tcp_ipv4.c
+and tcp6_seq_show() in net/ipv6/tcp_ipv6.c, respectively.
+
+It will first list all listening TCP sockets, and next list all established
+TCP connections. A typical entry of /proc/net/tcp would look like this (split
+up into 3 parts because of the length of the line)::
+
+ 46: 010310AC:9C4C 030310AC:1770 01
+ | | | | | |--> connection state
+ | | | | |------> remote TCP port number
+ | | | |-------------> remote IPv4 address
+ | | |--------------------> local TCP port number
+ | |---------------------------> local IPv4 address
+ |----------------------------------> number of entry
+
+ 00000150:00000000 01:00000019 00000000
+ | | | | |--> number of unrecovered RTO timeouts
+ | | | |----------> number of jiffies until timer expires
+ | | |----------------> timer_active (see below)
+ | |----------------------> receive-queue
+ |-------------------------------> transmit-queue
+
+ 1000 0 54165785 4 cd1e6040 25 4 27 3 -1
+ | | | | | | | | | |--> slow start size threshold,
+ | | | | | | | | | or -1 if the threshold
+ | | | | | | | | | is >= 0xFFFF
+ | | | | | | | | |----> sending congestion window
+ | | | | | | | |-------> (ack.quick<<1)|ack.pingpong
+ | | | | | | |---------> Predicted tick of soft clock
+ | | | | | | (delayed ACK control data)
+ | | | | | |------------> retransmit timeout
+ | | | | |------------------> location of socket in memory
+ | | | |-----------------------> socket reference count
+ | | |-----------------------------> inode
+ | |----------------------------------> unanswered 0-window probes
+ |---------------------------------------------> uid
+
+timer_active:
+
+ == ================================================================
+ 0 no timer is pending
+ 1 retransmit-timer is pending
+ 2 another timer (e.g. delayed ack or keepalive) is pending
+ 3 this is a socket in TIME_WAIT state. Not all fields will contain
+ data (or even exist)
+ 4 zero window probe timer is pending
+ == ================================================================
diff --git a/Documentation/networking/radiotap-headers.rst b/Documentation/networking/radiotap-headers.rst
new file mode 100644
index 0000000000..1a1bd1ec06
--- /dev/null
+++ b/Documentation/networking/radiotap-headers.rst
@@ -0,0 +1,159 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+How to use radiotap headers
+===========================
+
+Pointer to the radiotap include file
+------------------------------------
+
+Radiotap headers are variable-length and extensible, you can get most of the
+information you need to know on them from::
+
+ ./include/net/ieee80211_radiotap.h
+
+This document gives an overview and warns on some corner cases.
+
+
+Structure of the header
+-----------------------
+
+There is a fixed portion at the start which contains a u32 bitmap that defines
+if the possible argument associated with that bit is present or not. So if b0
+of the it_present member of ieee80211_radiotap_header is set, it means that
+the header for argument index 0 (IEEE80211_RADIOTAP_TSFT) is present in the
+argument area.
+
+::
+
+ < 8-byte ieee80211_radiotap_header >
+ [ <possible argument bitmap extensions ... > ]
+ [ <argument> ... ]
+
+At the moment there are only 13 possible argument indexes defined, but in case
+we run out of space in the u32 it_present member, it is defined that b31 set
+indicates that there is another u32 bitmap following (shown as "possible
+argument bitmap extensions..." above), and the start of the arguments is moved
+forward 4 bytes each time.
+
+Note also that the it_len member __le16 is set to the total number of bytes
+covered by the ieee80211_radiotap_header and any arguments following.
+
+
+Requirements for arguments
+--------------------------
+
+After the fixed part of the header, the arguments follow for each argument
+index whose matching bit is set in the it_present member of
+ieee80211_radiotap_header.
+
+ - the arguments are all stored little-endian!
+
+ - the argument payload for a given argument index has a fixed size. So
+ IEEE80211_RADIOTAP_TSFT being present always indicates an 8-byte argument is
+ present. See the comments in ./include/net/ieee80211_radiotap.h for a nice
+ breakdown of all the argument sizes
+
+ - the arguments must be aligned to a boundary of the argument size using
+ padding. So a u16 argument must start on the next u16 boundary if it isn't
+ already on one, a u32 must start on the next u32 boundary and so on.
+
+ - "alignment" is relative to the start of the ieee80211_radiotap_header, ie,
+ the first byte of the radiotap header. The absolute alignment of that first
+ byte isn't defined. So even if the whole radiotap header is starting at, eg,
+ address 0x00000003, still the first byte of the radiotap header is treated as
+ 0 for alignment purposes.
+
+ - the above point that there may be no absolute alignment for multibyte
+ entities in the fixed radiotap header or the argument region means that you
+ have to take special evasive action when trying to access these multibyte
+ entities. Some arches like Blackfin cannot deal with an attempt to
+ dereference, eg, a u16 pointer that is pointing to an odd address. Instead
+ you have to use a kernel API get_unaligned() to dereference the pointer,
+ which will do it bytewise on the arches that require that.
+
+ - The arguments for a given argument index can be a compound of multiple types
+ together. For example IEEE80211_RADIOTAP_CHANNEL has an argument payload
+ consisting of two u16s of total length 4. When this happens, the padding
+ rule is applied dealing with a u16, NOT dealing with a 4-byte single entity.
+
+
+Example valid radiotap header
+-----------------------------
+
+::
+
+ 0x00, 0x00, // <-- radiotap version + pad byte
+ 0x0b, 0x00, // <- radiotap header length
+ 0x04, 0x0c, 0x00, 0x00, // <-- bitmap
+ 0x6c, // <-- rate (in 500kHz units)
+ 0x0c, //<-- tx power
+ 0x01 //<-- antenna
+
+
+Using the Radiotap Parser
+-------------------------
+
+If you are having to parse a radiotap struct, you can radically simplify the
+job by using the radiotap parser that lives in net/wireless/radiotap.c and has
+its prototypes available in include/net/cfg80211.h. You use it like this::
+
+ #include <net/cfg80211.h>
+
+ /* buf points to the start of the radiotap header part */
+
+ int MyFunction(u8 * buf, int buflen)
+ {
+ int pkt_rate_100kHz = 0, antenna = 0, pwr = 0;
+ struct ieee80211_radiotap_iterator iterator;
+ int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen);
+
+ while (!ret) {
+
+ ret = ieee80211_radiotap_iterator_next(&iterator);
+
+ if (ret)
+ continue;
+
+ /* see if this argument is something we can use */
+
+ switch (iterator.this_arg_index) {
+ /*
+ * You must take care when dereferencing iterator.this_arg
+ * for multibyte types... the pointer is not aligned. Use
+ * get_unaligned((type *)iterator.this_arg) to dereference
+ * iterator.this_arg for type "type" safely on all arches.
+ */
+ case IEEE80211_RADIOTAP_RATE:
+ /* radiotap "rate" u8 is in
+ * 500kbps units, eg, 0x02=1Mbps
+ */
+ pkt_rate_100kHz = (*iterator.this_arg) * 5;
+ break;
+
+ case IEEE80211_RADIOTAP_ANTENNA:
+ /* radiotap uses 0 for 1st ant */
+ antenna = *iterator.this_arg);
+ break;
+
+ case IEEE80211_RADIOTAP_DBM_TX_POWER:
+ pwr = *iterator.this_arg;
+ break;
+
+ default:
+ break;
+ }
+ } /* while more rt headers */
+
+ if (ret != -ENOENT)
+ return TXRX_DROP;
+
+ /* discard the radiotap header part */
+ buf += iterator.max_length;
+ buflen -= iterator.max_length;
+
+ ...
+
+ }
+
+Andy Green <andy@warmcat.com>
diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst
new file mode 100644
index 0000000000..498395f5fb
--- /dev/null
+++ b/Documentation/networking/rds.rst
@@ -0,0 +1,448 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===
+RDS
+===
+
+Overview
+========
+
+This readme tries to provide some background on the hows and whys of RDS,
+and will hopefully help you find your way around the code.
+
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
+
+RDS Architecture
+================
+
+RDS provides reliable, ordered datagram delivery by using a single
+reliable connection between any two nodes in the cluster. This allows
+applications to use a single socket to talk to any other process in the
+cluster - so in a cluster with N processes you need N sockets, in contrast
+to N*N if you use a connection-oriented socket transport like TCP.
+
+RDS is not Infiniband-specific; it was designed to support different
+transports. The current implementation used to support RDS over TCP as well
+as IB.
+
+The high-level semantics of RDS from the application's point of view are
+
+ * Addressing
+
+ RDS uses IPv4 addresses and 16bit port numbers to identify
+ the end point of a connection. All socket operations that involve
+ passing addresses between kernel and user space generally
+ use a struct sockaddr_in.
+
+ The fact that IPv4 addresses are used does not mean the underlying
+ transport has to be IP-based. In fact, RDS over IB uses a
+ reliable IB connection; the IP address is used exclusively to
+ locate the remote node's GID (by ARPing for the given IP).
+
+ The port space is entirely independent of UDP, TCP or any other
+ protocol.
+
+ * Socket interface
+
+ RDS sockets work *mostly* as you would expect from a BSD
+ socket. The next section will cover the details. At any rate,
+ all I/O is performed through the standard BSD socket API.
+ Some additions like zerocopy support are implemented through
+ control messages, while other extensions use the getsockopt/
+ setsockopt calls.
+
+ Sockets must be bound before you can send or receive data.
+ This is needed because binding also selects a transport and
+ attaches it to the socket. Once bound, the transport assignment
+ does not change. RDS will tolerate IPs moving around (eg in
+ a active-active HA scenario), but only as long as the address
+ doesn't move to a different transport.
+
+ * sysctls
+
+ RDS supports a number of sysctls in /proc/sys/net/rds
+
+
+Socket Interface
+================
+
+ AF_RDS, PF_RDS, SOL_RDS
+ AF_RDS and PF_RDS are the domain type to be used with socket(2)
+ to create RDS sockets. SOL_RDS is the socket-level to be used
+ with setsockopt(2) and getsockopt(2) for RDS specific socket
+ options.
+
+ fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ This creates a new, unbound RDS socket.
+
+ setsockopt(SOL_SOCKET): send and receive buffer size
+ RDS honors the send and receive buffer size socket options.
+ You are not allowed to queue more than SO_SNDSIZE bytes to
+ a socket. A message is queued when sendmsg is called, and
+ it leaves the queue when the remote system acknowledges
+ its arrival.
+
+ The SO_RCVSIZE option controls the maximum receive queue length.
+ This is a soft limit rather than a hard limit - RDS will
+ continue to accept and queue incoming messages, even if that
+ takes the queue length over the limit. However, it will also
+ mark the port as "congested" and send a congestion update to
+ the source node. The source node is supposed to throttle any
+ processes sending to this congested port.
+
+ bind(fd, &sockaddr_in, ...)
+ This binds the socket to a local IP address and port, and a
+ transport, if one has not already been selected via the
+ SO_RDS_TRANSPORT socket option
+
+ sendmsg(fd, ...)
+ Sends a message to the indicated recipient. The kernel will
+ transparently establish the underlying reliable connection
+ if it isn't up yet.
+
+ An attempt to send a message that exceeds SO_SNDSIZE will
+ return with -EMSGSIZE
+
+ An attempt to send a message that would take the total number
+ of queued bytes over the SO_SNDSIZE threshold will return
+ EAGAIN.
+
+ An attempt to send a message to a destination that is marked
+ as "congested" will return ENOBUFS.
+
+ recvmsg(fd, ...)
+ Receives a message that was queued to this socket. The sockets
+ recv queue accounting is adjusted, and if the queue length
+ drops below SO_SNDSIZE, the port is marked uncongested, and
+ a congestion update is sent to all peers.
+
+ Applications can ask the RDS kernel module to receive
+ notifications via control messages (for instance, there is a
+ notification when a congestion update arrived, or when a RDMA
+ operation completes). These notifications are received through
+ the msg.msg_control buffer of struct msghdr. The format of the
+ messages is described in manpages.
+
+ poll(fd)
+ RDS supports the poll interface to allow the application
+ to implement async I/O.
+
+ POLLIN handling is pretty straightforward. When there's an
+ incoming message queued to the socket, or a pending notification,
+ we signal POLLIN.
+
+ POLLOUT is a little harder. Since you can essentially send
+ to any destination, RDS will always signal POLLOUT as long as
+ there's room on the send queue (ie the number of bytes queued
+ is less than the sendbuf size).
+
+ However, the kernel will refuse to accept messages to
+ a destination marked congested - in this case you will loop
+ forever if you rely on poll to tell you what to do.
+ This isn't a trivial problem, but applications can deal with
+ this - by using congestion notifications, and by checking for
+ ENOBUFS errors returned by sendmsg.
+
+ setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+ This allows the application to discard all messages queued to a
+ specific destination on this particular socket.
+
+ This allows the application to cancel outstanding messages if
+ it detects a timeout. For instance, if it tried to send a message,
+ and the remote host is unreachable, RDS will keep trying forever.
+ The application may decide it's not worth it, and cancel the
+ operation. In this case, it would use RDS_CANCEL_SENT_TO to
+ nuke any pending messages.
+
+ ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
+ Set or read an integer defining the underlying
+ encapsulating transport to be used for RDS packets on the
+ socket. When setting the option, integer argument may be
+ one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
+ value, RDS_TRANS_NONE will be returned on an unbound socket.
+ This socket option may only be set exactly once on the socket,
+ prior to binding it via the bind(2) system call. Attempts to
+ set SO_RDS_TRANSPORT on a socket for which the transport has
+ been previously attached explicitly (by SO_RDS_TRANSPORT) or
+ implicitly (via bind(2)) will return an error of EOPNOTSUPP.
+ An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
+ always return EINVAL.
+
+RDMA for RDS
+============
+
+ see rds-rdma(7) manpage (available in rds-tools)
+
+
+Congestion Notifications
+========================
+
+ see rds(7) manpage
+
+
+RDS Protocol
+============
+
+ Message header
+
+ The message header is a 'struct rds_header' (see rds.h):
+
+ Fields:
+
+ h_sequence:
+ per-packet sequence number
+ h_ack:
+ piggybacked acknowledgment of last packet received
+ h_len:
+ length of data, not including header
+ h_sport:
+ source port
+ h_dport:
+ destination port
+ h_flags:
+ Can be:
+
+ ============= ==================================
+ CONG_BITMAP this is a congestion update bitmap
+ ACK_REQUIRED receiver must ack this packet
+ RETRANSMITTED packet has previously been sent
+ ============= ==================================
+
+ h_credit:
+ indicate to other end of connection that
+ it has more credits available (i.e. there is
+ more send room)
+ h_padding[4]:
+ unused, for future use
+ h_csum:
+ header checksum
+ h_exthdr:
+ optional data can be passed here. This is currently used for
+ passing RDMA-related information.
+
+ ACK and retransmit handling
+
+ One might think that with reliable IB connections you wouldn't need
+ to ack messages that have been received. The problem is that IB
+ hardware generates an ack message before it has DMAed the message
+ into memory. This creates a potential message loss if the HCA is
+ disabled for any reason between when it sends the ack and before
+ the message is DMAed and processed. This is only a potential issue
+ if another HCA is available for fail-over.
+
+ Sending an ack immediately would allow the sender to free the sent
+ message from their send queue quickly, but could cause excessive
+ traffic to be used for acks. RDS piggybacks acks on sent data
+ packets. Ack-only packets are reduced by only allowing one to be
+ in flight at a time, and by the sender only asking for acks when
+ its send buffers start to fill up. All retransmissions are also
+ acked.
+
+ Flow Control
+
+ RDS's IB transport uses a credit-based mechanism to verify that
+ there is space in the peer's receive buffers for more data. This
+ eliminates the need for hardware retries on the connection.
+
+ Congestion
+
+ Messages waiting in the receive queue on the receiving socket
+ are accounted against the sockets SO_RCVBUF option value. Only
+ the payload bytes in the message are accounted for. If the
+ number of bytes queued equals or exceeds rcvbuf then the socket
+ is congested. All sends attempted to this socket's address
+ should return block or return -EWOULDBLOCK.
+
+ Applications are expected to be reasonably tuned such that this
+ situation very rarely occurs. An application encountering this
+ "back-pressure" is considered a bug.
+
+ This is implemented by having each node maintain bitmaps which
+ indicate which ports on bound addresses are congested. As the
+ bitmap changes it is sent through all the connections which
+ terminate in the local address of the bitmap which changed.
+
+ The bitmaps are allocated as connections are brought up. This
+ avoids allocation in the interrupt handling path which queues
+ sages on sockets. The dense bitmaps let transports send the
+ entire bitmap on any bitmap change reasonably efficiently. This
+ is much easier to implement than some finer-grained
+ communication of per-port congestion. The sender does a very
+ inexpensive bit test to test if the port it's about to send to
+ is congested or not.
+
+
+RDS Transport Layer
+===================
+
+ As mentioned above, RDS is not IB-specific. Its code is divided
+ into a general RDS layer and a transport layer.
+
+ The general layer handles the socket API, congestion handling,
+ loopback, stats, usermem pinning, and the connection state machine.
+
+ The transport layer handles the details of the transport. The IB
+ transport, for example, handles all the queue pairs, work requests,
+ CM event handlers, and other Infiniband details.
+
+
+RDS Kernel Structures
+=====================
+
+ struct rds_message
+ aka possibly "rds_outgoing", the generic RDS layer copies data to
+ be sent and sets header fields as needed, based on the socket API.
+ This is then queued for the individual connection and sent by the
+ connection's transport.
+
+ struct rds_incoming
+ a generic struct referring to incoming data that can be handed from
+ the transport to the general code and queued by the general code
+ while the socket is awoken. It is then passed back to the transport
+ code to handle the actual copy-to-user.
+
+ struct rds_socket
+ per-socket information
+
+ struct rds_connection
+ per-connection information
+
+ struct rds_transport
+ pointers to transport-specific functions
+
+ struct rds_statistics
+ non-transport-specific statistics
+
+ struct rds_cong_map
+ wraps the raw congestion bitmap, contains rbnode, waitq, etc.
+
+Connection management
+=====================
+
+ Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+ ERROR states.
+
+ The first time an attempt is made by an RDS socket to send data to
+ a node, a connection is allocated and connected. That connection is
+ then maintained forever -- if there are transport errors, the
+ connection will be dropped and re-established.
+
+ Dropping a connection while packets are queued will cause queued or
+ partially-sent datagrams to be retransmitted when the connection is
+ re-established.
+
+
+The send path
+=============
+
+ rds_sendmsg()
+ - struct rds_message built from incoming data
+ - CMSGs parsed (e.g. RDMA ops)
+ - transport connection alloced and connected if not already
+ - rds_message placed on send queue
+ - send worker awoken
+
+ rds_send_worker()
+ - calls rds_send_xmit() until queue is empty
+
+ rds_send_xmit()
+ - transmits congestion map if one is pending
+ - may set ACK_REQUIRED
+ - calls transport to send either non-RDMA or RDMA message
+ (RDMA ops never retransmitted)
+
+ rds_ib_xmit()
+ - allocs work requests from send ring
+ - adds any new send credits available to peer (h_credits)
+ - maps the rds_message's sg list
+ - piggybacks ack
+ - populates work requests
+ - post send to connection's queue pair
+
+The recv path
+=============
+
+ rds_ib_recv_cq_comp_handler()
+ - looks at write completions
+ - unmaps recv buffer from device
+ - no errors, call rds_ib_process_recv()
+ - refill recv ring
+
+ rds_ib_process_recv()
+ - validate header checksum
+ - copy header to rds_ib_incoming struct if start of a new datagram
+ - add to ibinc's fraglist
+ - if competed datagram:
+ - update cong map if datagram was cong update
+ - call rds_recv_incoming() otherwise
+ - note if ack is required
+
+ rds_recv_incoming()
+ - drop duplicate packets
+ - respond to pings
+ - find the sock associated with this datagram
+ - add to sock queue
+ - wake up sock
+ - do some congestion calculations
+ rds_recvmsg
+ - copy data into user iovec
+ - handle CMSGs
+ - return to application
+
+Multipath RDS (mprds)
+=====================
+ Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
+ (though the concept can be extended to other transports). The classical
+ implementation of RDS-over-TCP is implemented by demultiplexing multiple
+ PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
+ port]) over a single TCP socket between the 2 IP addresses involved. This
+ has the limitation that it ends up funneling multiple RDS flows over a
+ single TCP flow, thus it is
+ (a) upper-bounded to the single-flow bandwidth,
+ (b) suffers from head-of-line blocking for all the RDS sockets.
+
+ Better throughput (for a fixed small packet size, MTU) can be achieved
+ by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
+ RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp
+ connection. RDS sockets will be attached to a path based on some hash
+ (e.g., of local address and RDS port number) and packets for that RDS
+ socket will be sent over the attached path using TCP to segment/reassemble
+ RDS datagrams on that path.
+
+ Multipathed RDS is implemented by splitting the struct rds_connection into
+ a common (to all paths) part, and a per-path struct rds_conn_path. All
+ I/O workqs and reconnect threads are driven from the rds_conn_path.
+ Transports such as TCP that are multipath capable may then set up a
+ TCP socket per rds_conn_path, and this is managed by the transport via
+ the transport privatee cp_transport_data pointer.
+
+ Transports announce themselves as multipath capable by setting the
+ t_mp_capable bit during registration with the rds core module. When the
+ transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
+ across multiple paths. The outgoing hash is computed based on the
+ local address and port that the PF_RDS socket is bound to.
+
+ Additionally, even if the transport is MP capable, we may be
+ peering with some node that does not support mprds, or supports
+ a different number of paths. As a result, the peering nodes need
+ to agree on the number of paths to be used for the connection.
+ This is done by sending out a control packet exchange before the
+ first data packet. The control packet exchange must have completed
+ prior to outgoing hash completion in rds_sendmsg() when the transport
+ is mutlipath capable.
+
+ The control packet is an RDS ping packet (i.e., packet to rds dest
+ port 0) with the ping packet having a rds extension header option of
+ type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
+ number of paths supported by the sender. The "probe" ping packet will
+ get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
+ The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
+ be able to compute the min(sender_paths, rcvr_paths). The pong
+ sent in response to a probe-ping should contain the rcvr's npaths
+ when the rcvr is mprds-capable.
+
+ If the rcvr is not mprds-capable, the exthdr in the ping will be
+ ignored. In this case the pong will not have any exthdrs, so the sender
+ of the probe-ping can default to single-path mprds.
+
diff --git a/Documentation/networking/regulatory.rst b/Documentation/networking/regulatory.rst
new file mode 100644
index 0000000000..5e42c8a175
--- /dev/null
+++ b/Documentation/networking/regulatory.rst
@@ -0,0 +1,209 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+Linux wireless regulatory documentation
+=======================================
+
+This document gives a brief review over how the Linux wireless
+regulatory infrastructure works.
+
+More up to date information can be obtained at the project's web page:
+
+https://wireless.wiki.kernel.org/en/developers/Regulatory
+
+Keeping regulatory domains in userspace
+---------------------------------------
+
+Due to the dynamic nature of regulatory domains we keep them
+in userspace and provide a framework for userspace to upload
+to the kernel one regulatory domain to be used as the central
+core regulatory domain all wireless devices should adhere to.
+
+How to get regulatory domains to the kernel
+-------------------------------------------
+
+When the regulatory domain is first set up, the kernel will request a
+database file (regulatory.db) containing all the regulatory rules. It
+will then use that database when it needs to look up the rules for a
+given country.
+
+How to get regulatory domains to the kernel (old CRDA solution)
+---------------------------------------------------------------
+
+Userspace gets a regulatory domain in the kernel by having
+a userspace agent build it and send it via nl80211. Only
+expected regulatory domains will be respected by the kernel.
+
+A currently available userspace agent which can accomplish this
+is CRDA - central regulatory domain agent. Its documented here:
+
+https://wireless.wiki.kernel.org/en/developers/Regulatory/CRDA
+
+Essentially the kernel will send a udev event when it knows
+it needs a new regulatory domain. A udev rule can be put in place
+to trigger crda to send the respective regulatory domain for a
+specific ISO/IEC 3166 alpha2.
+
+Below is an example udev rule which can be used:
+
+# Example file, should be put in /etc/udev/rules.d/regulatory.rules
+KERNEL=="regulatory*", ACTION=="change", SUBSYSTEM=="platform", RUN+="/sbin/crda"
+
+The alpha2 is passed as an environment variable under the variable COUNTRY.
+
+Who asks for regulatory domains?
+--------------------------------
+
+* Users
+
+Users can use iw:
+
+https://wireless.wiki.kernel.org/en/users/Documentation/iw
+
+An example::
+
+ # set regulatory domain to "Costa Rica"
+ iw reg set CR
+
+This will request the kernel to set the regulatory domain to
+the specified alpha2. The kernel in turn will then ask userspace
+to provide a regulatory domain for the alpha2 specified by the user
+by sending a uevent.
+
+* Wireless subsystems for Country Information elements
+
+The kernel will send a uevent to inform userspace a new
+regulatory domain is required. More on this to be added
+as its integration is added.
+
+* Drivers
+
+If drivers determine they need a specific regulatory domain
+set they can inform the wireless core using regulatory_hint().
+They have two options -- they either provide an alpha2 so that
+crda can provide back a regulatory domain for that country or
+they can build their own regulatory domain based on internal
+custom knowledge so the wireless core can respect it.
+
+*Most* drivers will rely on the first mechanism of providing a
+regulatory hint with an alpha2. For these drivers there is an additional
+check that can be used to ensure compliance based on custom EEPROM
+regulatory data. This additional check can be used by drivers by
+registering on its struct wiphy a reg_notifier() callback. This notifier
+is called when the core's regulatory domain has been changed. The driver
+can use this to review the changes made and also review who made them
+(driver, user, country IE) and determine what to allow based on its
+internal EEPROM data. Devices drivers wishing to be capable of world
+roaming should use this callback. More on world roaming will be
+added to this document when its support is enabled.
+
+Device drivers who provide their own built regulatory domain
+do not need a callback as the channels registered by them are
+the only ones that will be allowed and therefore *additional*
+channels cannot be enabled.
+
+Example code - drivers hinting an alpha2:
+------------------------------------------
+
+This example comes from the zd1211rw device driver. You can start
+by having a mapping of your device's EEPROM country/regulatory
+domain value to a specific alpha2 as follows::
+
+ static struct zd_reg_alpha2_map reg_alpha2_map[] = {
+ { ZD_REGDOMAIN_FCC, "US" },
+ { ZD_REGDOMAIN_IC, "CA" },
+ { ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */
+ { ZD_REGDOMAIN_JAPAN, "JP" },
+ { ZD_REGDOMAIN_JAPAN_ADD, "JP" },
+ { ZD_REGDOMAIN_SPAIN, "ES" },
+ { ZD_REGDOMAIN_FRANCE, "FR" },
+
+Then you can define a routine to map your read EEPROM value to an alpha2,
+as follows::
+
+ static int zd_reg2alpha2(u8 regdomain, char *alpha2)
+ {
+ unsigned int i;
+ struct zd_reg_alpha2_map *reg_map;
+ for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) {
+ reg_map = &reg_alpha2_map[i];
+ if (regdomain == reg_map->reg) {
+ alpha2[0] = reg_map->alpha2[0];
+ alpha2[1] = reg_map->alpha2[1];
+ return 0;
+ }
+ }
+ return 1;
+ }
+
+Lastly, you can then hint to the core of your discovered alpha2, if a match
+was found. You need to do this after you have registered your wiphy. You
+are expected to do this during initialization.
+
+::
+
+ r = zd_reg2alpha2(mac->regdomain, alpha2);
+ if (!r)
+ regulatory_hint(hw->wiphy, alpha2);
+
+Example code - drivers providing a built in regulatory domain:
+--------------------------------------------------------------
+
+[NOTE: This API is not currently available, it can be added when required]
+
+If you have regulatory information you can obtain from your
+driver and you *need* to use this we let you build a regulatory domain
+structure and pass it to the wireless core. To do this you should
+kmalloc() a structure big enough to hold your regulatory domain
+structure and you should then fill it with your data. Finally you simply
+call regulatory_hint() with the regulatory domain structure in it.
+
+Below is a simple example, with a regulatory domain cached using the stack.
+Your implementation may vary (read EEPROM cache instead, for example).
+
+Example cache of some regulatory domain::
+
+ struct ieee80211_regdomain mydriver_jp_regdom = {
+ .n_reg_rules = 3,
+ .alpha2 = "JP",
+ //.alpha2 = "99", /* If I have no alpha2 to map it to */
+ .reg_rules = {
+ /* IEEE 802.11b/g, channels 1..14 */
+ REG_RULE(2412-10, 2484+10, 40, 6, 20, 0),
+ /* IEEE 802.11a, channels 34..48 */
+ REG_RULE(5170-10, 5240+10, 40, 6, 20,
+ NL80211_RRF_NO_IR),
+ /* IEEE 802.11a, channels 52..64 */
+ REG_RULE(5260-10, 5320+10, 40, 6, 20,
+ NL80211_RRF_NO_IR|
+ NL80211_RRF_DFS),
+ }
+ };
+
+Then in some part of your code after your wiphy has been registered::
+
+ struct ieee80211_regdomain *rd;
+ int size_of_regd;
+ int num_rules = mydriver_jp_regdom.n_reg_rules;
+ unsigned int i;
+
+ size_of_regd = sizeof(struct ieee80211_regdomain) +
+ (num_rules * sizeof(struct ieee80211_reg_rule));
+
+ rd = kzalloc(size_of_regd, GFP_KERNEL);
+ if (!rd)
+ return -ENOMEM;
+
+ memcpy(rd, &mydriver_jp_regdom, sizeof(struct ieee80211_regdomain));
+
+ for (i=0; i < num_rules; i++)
+ memcpy(&rd->reg_rules[i],
+ &mydriver_jp_regdom.reg_rules[i],
+ sizeof(struct ieee80211_reg_rule));
+ regulatory_struct_hint(rd);
+
+Statically compiled regulatory database
+---------------------------------------
+
+When a database should be fixed into the kernel, it can be provided as a
+firmware file at build time that is then linked into the kernel.
diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst
new file mode 100644
index 0000000000..decb39c19b
--- /dev/null
+++ b/Documentation/networking/representors.rst
@@ -0,0 +1,261 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+Network Function Representors
+=============================
+
+This document describes the semantics and usage of representor netdevices, as
+used to control internal switching on SmartNICs. For the closely-related port
+representors on physical (multi-port) switches, see
+:ref:`Documentation/networking/switchdev.rst <switchdev>`.
+
+Motivation
+----------
+
+Since the mid-2010s, network cards have started offering more complex
+virtualisation capabilities than the legacy SR-IOV approach (with its simple
+MAC/VLAN-based switching model) can support. This led to a desire to offload
+software-defined networks (such as OpenVSwitch) to these NICs to specify the
+network connectivity of each function. The resulting designs are variously
+called SmartNICs or DPUs.
+
+Network function representors bring the standard Linux networking stack to
+virtual switches and IOV devices. Just as each physical port of a Linux-
+controlled switch has a separate netdev, so does each virtual port of a virtual
+switch.
+When the system boots, and before any offload is configured, all packets from
+the virtual functions appear in the networking stack of the PF via the
+representors. The PF can thus always communicate freely with the virtual
+functions.
+The PF can configure standard Linux forwarding between representors, the uplink
+or any other netdev (routing, bridging, TC classifiers).
+
+Thus, a representor is both a control plane object (representing the function in
+administrative commands) and a data plane object (one end of a virtual pipe).
+As a virtual link endpoint, the representor can be configured like any other
+netdevice; in some cases (e.g. link state) the representee will follow the
+representor's configuration, while in others there are separate APIs to
+configure the representee.
+
+Definitions
+-----------
+
+This document uses the term "switchdev function" to refer to the PCIe function
+which has administrative control over the virtual switch on the device.
+Typically, this will be a PF, but conceivably a NIC could be configured to grant
+these administrative privileges instead to a VF or SF (subfunction).
+Depending on NIC design, a multi-port NIC might have a single switchdev function
+for the whole device or might have a separate virtual switch, and hence
+switchdev function, for each physical network port.
+If the NIC supports nested switching, there might be separate switchdev
+functions for each nested switch, in which case each switchdev function should
+only create representors for the ports on the (sub-)switch it directly
+administers.
+
+A "representee" is the object that a representor represents. So for example in
+the case of a VF representor, the representee is the corresponding VF.
+
+What does a representor do?
+---------------------------
+
+A representor has three main roles.
+
+1. It is used to configure the network connection the representee sees, e.g.
+ link up/down, MTU, etc. For instance, bringing the representor
+ administratively UP should cause the representee to see a link up / carrier
+ on event.
+2. It provides the slow path for traffic which does not hit any offloaded
+ fast-path rules in the virtual switch. Packets transmitted on the
+ representor netdevice should be delivered to the representee; packets
+ transmitted by the representee which fail to match any switching rule should
+ be received on the representor netdevice. (That is, there is a virtual pipe
+ connecting the representor to the representee, similar in concept to a veth
+ pair.)
+ This allows software switch implementations (such as OpenVSwitch or a Linux
+ bridge) to forward packets between representees and the rest of the network.
+3. It acts as a handle by which switching rules (such as TC filters) can refer
+ to the representee, allowing these rules to be offloaded.
+
+The combination of 2) and 3) means that the behaviour (apart from performance)
+should be the same whether a TC filter is offloaded or not. E.g. a TC rule
+on a VF representor applies in software to packets received on that representor
+netdevice, while in hardware offload it would apply to packets transmitted by
+the representee VF. Conversely, a mirred egress redirect to a VF representor
+corresponds in hardware to delivery directly to the representee VF.
+
+What functions should have a representor?
+-----------------------------------------
+
+Essentially, for each virtual port on the device's internal switch, there
+should be a representor.
+Some vendors have chosen to omit representors for the uplink and the physical
+network port, which can simplify usage (the uplink netdev becomes in effect the
+physical port's representor) but does not generalise to devices with multiple
+ports or uplinks.
+
+Thus, the following should all have representors:
+
+ - VFs belonging to the switchdev function.
+ - Other PFs on the local PCIe controller, and any VFs belonging to them.
+ - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded
+ System-on-Chip within the SmartNIC).
+ - PFs and VFs with other personalities, including network block devices (such
+ as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only
+ if) their network access is implemented through a virtual switch port. [#]_
+ Note that such functions can require a representor despite the representee
+ not having a netdev.
+ - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
+ their own port on the switch (as opposed to using their parent PF's port).
+ - Any accelerators or plugins on the device whose interface to the network is
+ through a virtual switch port, even if they do not have a corresponding PCIe
+ PF or VF.
+
+This allows the entire switching behaviour of the NIC to be controlled through
+representor TC rules.
+
+It is a common misunderstanding to conflate virtual ports with PCIe virtual
+functions or their netdevs. While in simple cases there will be a 1:1
+correspondence between VF netdevices and VF representors, more advanced device
+configurations may not follow this.
+A PCIe function which does not have network access through the internal switch
+(not even indirectly through the hardware implementation of whatever services
+the function provides) should *not* have a representor (even if it has a
+netdev).
+Such a function has no switch virtual port for the representor to configure or
+to be the other end of the virtual pipe.
+The representor represents the virtual port, not the PCIe function nor the 'end
+user' netdevice.
+
+.. [#] The concept here is that a hardware IP stack in the device performs the
+ translation between block DMA requests and network packets, so that only
+ network packets pass through the virtual port onto the switch. The network
+ access that the IP stack "sees" would then be configurable through tc rules;
+ e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However,
+ any needed configuration of the block device *qua* block device, not being a
+ networking entity, would not be appropriate for the representor and would
+ thus use some other channel such as devlink.
+ Contrast this with the case of a virtio-blk implementation which forwards the
+ DMA requests unchanged to another PF whose driver then initiates and
+ terminates IP traffic in software; in that case the DMA traffic would *not*
+ run over the virtual switch and the virtio-blk PF should thus *not* have a
+ representor.
+
+How are representors created?
+-----------------------------
+
+The driver instance attached to the switchdev function should, for each virtual
+port on the switch, create a pure-software netdevice which has some form of
+in-kernel reference to the switchdev function's own netdevice or driver private
+data (``netdev_priv()``).
+This may be by enumerating ports at probe time, reacting dynamically to the
+creation and destruction of ports at run time, or a combination of the two.
+
+The operations of the representor netdevice will generally involve acting
+through the switchdev function. For example, ``ndo_start_xmit()`` might send
+the packet through a hardware TX queue attached to the switchdev function, with
+either packet metadata or queue configuration marking it for delivery to the
+representee.
+
+How are representors identified?
+--------------------------------
+
+The representor netdevice should *not* directly refer to a PCIe device (e.g.
+through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
+representee or of the switchdev function.
+Instead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to
+assign a devlink port instance to the netdevice before registering the
+netdevice; the kernel uses the devlink port to provide the ``phys_switch_id``
+and ``phys_port_name`` sysfs nodes.
+(Some legacy drivers implement ``ndo_get_port_parent_id()`` and
+``ndo_get_phys_port_name()`` directly, but this is deprecated.) See
+:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
+details of this API.
+
+It is expected that userland will use this information (e.g. through udev rules)
+to construct an appropriately informative name or alias for the netdevice. For
+instance if the switchdev function is ``eth4`` then a representor with a
+``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
+
+There are as yet no established conventions for naming representors which do not
+correspond to PCIe functions (e.g. accelerators and plugins).
+
+How do representors interact with TC rules?
+-------------------------------------------
+
+Any TC rule on a representor applies (in software TC) to packets received by
+that representor netdevice. Thus, if the delivery part of the rule corresponds
+to another port on the virtual switch, the driver may choose to offload it to
+hardware, applying it to packets transmitted by the representee.
+
+Similarly, since a TC mirred egress action targeting the representor would (in
+software) send the packet through the representor (and thus indirectly deliver
+it to the representee), hardware offload should interpret this as delivery to
+the representee.
+
+As a simple example, if ``PORT_DEV`` is the physical port representor and
+``REP_DEV`` is a VF representor, the following rules::
+
+ tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \
+ action mirred egress redirect dev $PORT_DEV
+ tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \
+ action mirred egress mirror dev $REP_DEV
+
+would mean that all IPv4 packets from the VF are sent out the physical port, and
+all IPv4 packets received on the physical port are delivered to the VF in
+addition to ``PORT_DEV``. (Note that without ``skip_sw`` on the second rule,
+the VF would get two copies, as the packet reception on ``PORT_DEV`` would
+trigger the TC rule again and mirror the packet to ``REP_DEV``.)
+
+On devices without separate port and uplink representors, ``PORT_DEV`` would
+instead be the switchdev function's own uplink netdevice.
+
+Of course the rules can (if supported by the NIC) include packet-modifying
+actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
+
+Tunnel encapsulation and decapsulation are rather more complicated, as they
+involve a third netdevice (a tunnel netdev operating in metadata mode, such as
+a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
+require an IP address to be bound to the underlay device (e.g. switchdev
+function uplink netdev or port representor). TC rules such as::
+
+ tc filter add dev $REP_DEV parent ffff: flower \
+ action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
+ dst_port 4789 \
+ action mirred egress redirect dev vxlan0
+ tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
+ enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
+ action tunnel_key unset action mirred egress redirect dev $REP_DEV
+
+where ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is
+another IP address on the same subnet, mean that packets sent by the VF should
+be VxLAN encapsulated and sent out the physical port (the driver has to deduce
+this by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also
+perform an ARP/neighbour table lookup to find the MAC addresses to use in the
+outer Ethernet frame), while UDP packets received on the physical port with UDP
+port 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``,
+decapsulated and forwarded to the VF.
+
+If this all seems complicated, just remember the 'golden rule' of TC offload:
+the hardware should ensure the same final results as if the packets were
+processed through the slow path, traversed software TC (except ignoring any
+``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or
+received through the representor netdevices.
+
+Configuring the representee's MAC
+---------------------------------
+
+The representee's link state is controlled through the representor. Setting the
+representor administratively UP or DOWN should cause carrier ON or OFF at the
+representee.
+
+Setting an MTU on the representor should cause that same MTU to be reported to
+the representee.
+(On hardware that allows configuring separate and distinct MTU and MRU values,
+the representor MTU should correspond to the representee's MRU and vice-versa.)
+
+Currently there is no way to use the representor to set the station permanent
+MAC address of the representee; other methods available to do this include:
+
+ - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
+ - devlink port function (see **devlink-port(8)** and
+ :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst
new file mode 100644
index 0000000000..e807e18ba3
--- /dev/null
+++ b/Documentation/networking/rxrpc.rst
@@ -0,0 +1,1174 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+RxRPC Network Protocol
+======================
+
+The RxRPC protocol driver provides a reliable two-phase transport on top of UDP
+that can be used to perform RxRPC remote operations. This is done over sockets
+of AF_RXRPC family, using sendmsg() and recvmsg() with control data to send and
+receive data, aborts and errors.
+
+Contents of this document:
+
+ (#) Overview.
+
+ (#) RxRPC protocol summary.
+
+ (#) AF_RXRPC driver model.
+
+ (#) Control messages.
+
+ (#) Socket options.
+
+ (#) Security.
+
+ (#) Example client usage.
+
+ (#) Example server usage.
+
+ (#) AF_RXRPC kernel interface.
+
+ (#) Configurable parameters.
+
+
+Overview
+========
+
+RxRPC is a two-layer protocol. There is a session layer which provides
+reliable virtual connections using UDP over IPv4 (or IPv6) as the transport
+layer, but implements a real network protocol; and there's the presentation
+layer which renders structured data to binary blobs and back again using XDR
+(as does SunRPC)::
+
+ +-------------+
+ | Application |
+ +-------------+
+ | XDR | Presentation
+ +-------------+
+ | RxRPC | Session
+ +-------------+
+ | UDP | Transport
+ +-------------+
+
+
+AF_RXRPC provides:
+
+ (1) Part of an RxRPC facility for both kernel and userspace applications by
+ making the session part of it a Linux network protocol (AF_RXRPC).
+
+ (2) A two-phase protocol. The client transmits a blob (the request) and then
+ receives a blob (the reply), and the server receives the request and then
+ transmits the reply.
+
+ (3) Retention of the reusable bits of the transport system set up for one call
+ to speed up subsequent calls.
+
+ (4) A secure protocol, using the Linux kernel's key retention facility to
+ manage security on the client end. The server end must of necessity be
+ more active in security negotiations.
+
+AF_RXRPC does not provide XDR marshalling/presentation facilities. That is
+left to the application. AF_RXRPC only deals in blobs. Even the operation ID
+is just the first four bytes of the request blob, and as such is beyond the
+kernel's interest.
+
+
+Sockets of AF_RXRPC family are:
+
+ (1) created as type SOCK_DGRAM;
+
+ (2) provided with a protocol of the type of underlying transport they're going
+ to use - currently only PF_INET is supported.
+
+
+The Andrew File System (AFS) is an example of an application that uses this and
+that has both kernel (filesystem) and userspace (utility) components.
+
+
+RxRPC Protocol Summary
+======================
+
+An overview of the RxRPC protocol:
+
+ (#) RxRPC sits on top of another networking protocol (UDP is the only option
+ currently), and uses this to provide network transport. UDP ports, for
+ example, provide transport endpoints.
+
+ (#) RxRPC supports multiple virtual "connections" from any given transport
+ endpoint, thus allowing the endpoints to be shared, even to the same
+ remote endpoint.
+
+ (#) Each connection goes to a particular "service". A connection may not go
+ to multiple services. A service may be considered the RxRPC equivalent of
+ a port number. AF_RXRPC permits multiple services to share an endpoint.
+
+ (#) Client-originating packets are marked, thus a transport endpoint can be
+ shared between client and server connections (connections have a
+ direction).
+
+ (#) Up to a billion connections may be supported concurrently between one
+ local transport endpoint and one service on one remote endpoint. An RxRPC
+ connection is described by seven numbers::
+
+ Local address }
+ Local port } Transport (UDP) address
+ Remote address }
+ Remote port }
+ Direction
+ Connection ID
+ Service ID
+
+ (#) Each RxRPC operation is a "call". A connection may make up to four
+ billion calls, but only up to four calls may be in progress on a
+ connection at any one time.
+
+ (#) Calls are two-phase and asymmetric: the client sends its request data,
+ which the service receives; then the service transmits the reply data
+ which the client receives.
+
+ (#) The data blobs are of indefinite size, the end of a phase is marked with a
+ flag in the packet. The number of packets of data making up one blob may
+ not exceed 4 billion, however, as this would cause the sequence number to
+ wrap.
+
+ (#) The first four bytes of the request data are the service operation ID.
+
+ (#) Security is negotiated on a per-connection basis. The connection is
+ initiated by the first data packet on it arriving. If security is
+ requested, the server then issues a "challenge" and then the client
+ replies with a "response". If the response is successful, the security is
+ set for the lifetime of that connection, and all subsequent calls made
+ upon it use that same security. In the event that the server lets a
+ connection lapse before the client, the security will be renegotiated if
+ the client uses the connection again.
+
+ (#) Calls use ACK packets to handle reliability. Data packets are also
+ explicitly sequenced per call.
+
+ (#) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs.
+ A hard-ACK indicates to the far side that all the data received to a point
+ has been received and processed; a soft-ACK indicates that the data has
+ been received but may yet be discarded and re-requested. The sender may
+ not discard any transmittable packets until they've been hard-ACK'd.
+
+ (#) Reception of a reply data packet implicitly hard-ACK's all the data
+ packets that make up the request.
+
+ (#) An call is complete when the request has been sent, the reply has been
+ received and the final hard-ACK on the last packet of the reply has
+ reached the server.
+
+ (#) An call may be aborted by either end at any time up to its completion.
+
+
+AF_RXRPC Driver Model
+=====================
+
+About the AF_RXRPC driver:
+
+ (#) The AF_RXRPC protocol transparently uses internal sockets of the transport
+ protocol to represent transport endpoints.
+
+ (#) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC
+ connections are handled transparently. One client socket may be used to
+ make multiple simultaneous calls to the same service. One server socket
+ may handle calls from many clients.
+
+ (#) Additional parallel client connections will be initiated to support extra
+ concurrent calls, up to a tunable limit.
+
+ (#) Each connection is retained for a certain amount of time [tunable] after
+ the last call currently using it has completed in case a new call is made
+ that could reuse it.
+
+ (#) Each internal UDP socket is retained [tunable] for a certain amount of
+ time [tunable] after the last connection using it discarded, in case a new
+ connection is made that could use it.
+
+ (#) A client-side connection is only shared between calls if they have
+ the same key struct describing their security (and assuming the calls
+ would otherwise share the connection). Non-secured calls would also be
+ able to share connections with each other.
+
+ (#) A server-side connection is shared if the client says it is.
+
+ (#) ACK'ing is handled by the protocol driver automatically, including ping
+ replying.
+
+ (#) SO_KEEPALIVE automatically pings the other side to keep the connection
+ alive [TODO].
+
+ (#) If an ICMP error is received, all calls affected by that error will be
+ aborted with an appropriate network error passed through recvmsg().
+
+
+Interaction with the user of the RxRPC socket:
+
+ (#) A socket is made into a server socket by binding an address with a
+ non-zero service ID.
+
+ (#) In the client, sending a request is achieved with one or more sendmsgs,
+ followed by the reply being received with one or more recvmsgs.
+
+ (#) The first sendmsg for a request to be sent from a client contains a tag to
+ be used in all other sendmsgs or recvmsgs associated with that call. The
+ tag is carried in the control data.
+
+ (#) connect() is used to supply a default destination address for a client
+ socket. This may be overridden by supplying an alternate address to the
+ first sendmsg() of a call (struct msghdr::msg_name).
+
+ (#) If connect() is called on an unbound client, a random local port will
+ bound before the operation takes place.
+
+ (#) A server socket may also be used to make client calls. To do this, the
+ first sendmsg() of the call must specify the target address. The server's
+ transport endpoint is used to send the packets.
+
+ (#) Once the application has received the last message associated with a call,
+ the tag is guaranteed not to be seen again, and so it can be used to pin
+ client resources. A new call can then be initiated with the same tag
+ without fear of interference.
+
+ (#) In the server, a request is received with one or more recvmsgs, then the
+ the reply is transmitted with one or more sendmsgs, and then the final ACK
+ is received with a last recvmsg.
+
+ (#) When sending data for a call, sendmsg is given MSG_MORE if there's more
+ data to come on that call.
+
+ (#) When receiving data for a call, recvmsg flags MSG_MORE if there's more
+ data to come for that call.
+
+ (#) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg
+ to indicate the terminal message for that call.
+
+ (#) A call may be aborted by adding an abort control message to the control
+ data. Issuing an abort terminates the kernel's use of that call's tag.
+ Any messages waiting in the receive queue for that call will be discarded.
+
+ (#) Aborts, busy notifications and challenge packets are delivered by recvmsg,
+ and control data messages will be set to indicate the context. Receiving
+ an abort or a busy message terminates the kernel's use of that call's tag.
+
+ (#) The control data part of the msghdr struct is used for a number of things:
+
+ (#) The tag of the intended or affected call.
+
+ (#) Sending or receiving errors, aborts and busy notifications.
+
+ (#) Notifications of incoming calls.
+
+ (#) Sending debug requests and receiving debug replies [TODO].
+
+ (#) When the kernel has received and set up an incoming call, it sends a
+ message to server application to let it know there's a new call awaiting
+ its acceptance [recvmsg reports a special control message]. The server
+ application then uses sendmsg to assign a tag to the new call. Once that
+ is done, the first part of the request data will be delivered by recvmsg.
+
+ (#) The server application has to provide the server socket with a keyring of
+ secret keys corresponding to the security types it permits. When a secure
+ connection is being set up, the kernel looks up the appropriate secret key
+ in the keyring and then sends a challenge packet to the client and
+ receives a response packet. The kernel then checks the authorisation of
+ the packet and either aborts the connection or sets up the security.
+
+ (#) The name of the key a client will use to secure its communications is
+ nominated by a socket option.
+
+
+Notes on sendmsg:
+
+ (#) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is
+ making progress at accepting packets within a reasonable time such that we
+ manage to queue up all the data for transmission. This requires the
+ client to accept at least one packet per 2*RTT time period.
+
+ If this isn't set, sendmsg() will return immediately, either returning
+ EINTR/ERESTARTSYS if nothing was consumed or returning the amount of data
+ consumed.
+
+
+Notes on recvmsg:
+
+ (#) If there's a sequence of data messages belonging to a particular call on
+ the receive queue, then recvmsg will keep working through them until:
+
+ (a) it meets the end of that call's received data,
+
+ (b) it meets a non-data message,
+
+ (c) it meets a message belonging to a different call, or
+
+ (d) it fills the user buffer.
+
+ If recvmsg is called in blocking mode, it will keep sleeping, awaiting the
+ reception of further data, until one of the above four conditions is met.
+
+ (2) MSG_PEEK operates similarly, but will return immediately if it has put any
+ data in the buffer rather than sleeping until it can fill the buffer.
+
+ (3) If a data message is only partially consumed in filling a user buffer,
+ then the remainder of that message will be left on the front of the queue
+ for the next taker. MSG_TRUNC will never be flagged.
+
+ (4) If there is more data to be had on a call (it hasn't copied the last byte
+ of the last data message in that phase yet), then MSG_MORE will be
+ flagged.
+
+
+Control Messages
+================
+
+AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex
+calls, to invoke certain actions and to report certain conditions. These are:
+
+ ======================= === =========== ===============================
+ MESSAGE ID SRT DATA MEANING
+ ======================= === =========== ===============================
+ RXRPC_USER_CALL_ID sr- User ID App's call specifier
+ RXRPC_ABORT srt Abort code Abort code to issue/received
+ RXRPC_ACK -rt n/a Final ACK received
+ RXRPC_NET_ERROR -rt error num Network error on call
+ RXRPC_BUSY -rt n/a Call rejected (server busy)
+ RXRPC_LOCAL_ERROR -rt error num Local error encountered
+ RXRPC_NEW_CALL -r- n/a New call received
+ RXRPC_ACCEPT s-- n/a Accept new call
+ RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call
+ RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded
+ RXRPC_TX_LENGTH s-- data len Total length of Tx data
+ ======================= === =========== ===============================
+
+ (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message)
+
+ (#) RXRPC_USER_CALL_ID
+
+ This is used to indicate the application's call ID. It's an unsigned long
+ that the app specifies in the client by attaching it to the first data
+ message or in the server by passing it in association with an RXRPC_ACCEPT
+ message. recvmsg() passes it in conjunction with all messages except
+ those of the RXRPC_NEW_CALL message.
+
+ (#) RXRPC_ABORT
+
+ This is can be used by an application to abort a call by passing it to
+ sendmsg, or it can be delivered by recvmsg to indicate a remote abort was
+ received. Either way, it must be associated with an RXRPC_USER_CALL_ID to
+ specify the call affected. If an abort is being sent, then error EBADSLT
+ will be returned if there is no call with that user ID.
+
+ (#) RXRPC_ACK
+
+ This is delivered to a server application to indicate that the final ACK
+ of a call was received from the client. It will be associated with an
+ RXRPC_USER_CALL_ID to indicate the call that's now complete.
+
+ (#) RXRPC_NET_ERROR
+
+ This is delivered to an application to indicate that an ICMP error message
+ was encountered in the process of trying to talk to the peer. An
+ errno-class integer value will be included in the control message data
+ indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
+ affected.
+
+ (#) RXRPC_BUSY
+
+ This is delivered to a client application to indicate that a call was
+ rejected by the server due to the server being busy. It will be
+ associated with an RXRPC_USER_CALL_ID to indicate the rejected call.
+
+ (#) RXRPC_LOCAL_ERROR
+
+ This is delivered to an application to indicate that a local error was
+ encountered and that a call has been aborted because of it. An
+ errno-class integer value will be included in the control message data
+ indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
+ affected.
+
+ (#) RXRPC_NEW_CALL
+
+ This is delivered to indicate to a server application that a new call has
+ arrived and is awaiting acceptance. No user ID is associated with this,
+ as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT.
+
+ (#) RXRPC_ACCEPT
+
+ This is used by a server application to attempt to accept a call and
+ assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID
+ to indicate the user ID to be assigned. If there is no call to be
+ accepted (it may have timed out, been aborted, etc.), then sendmsg will
+ return error ENODATA. If the user ID is already in use by another call,
+ then error EBADSLT will be returned.
+
+ (#) RXRPC_EXCLUSIVE_CALL
+
+ This is used to indicate that a client call should be made on a one-off
+ connection. The connection is discarded once the call has terminated.
+
+ (#) RXRPC_UPGRADE_SERVICE
+
+ This is used to make a client call to probe if the specified service ID
+ may be upgraded by the server. The caller must check msg_name returned to
+ recvmsg() for the service ID actually in use. The operation probed must
+ be one that takes the same arguments in both services.
+
+ Once this has been used to establish the upgrade capability (or lack
+ thereof) of the server, the service ID returned should be used for all
+ future communication to that server and RXRPC_UPGRADE_SERVICE should no
+ longer be set.
+
+ (#) RXRPC_TX_LENGTH
+
+ This is used to inform the kernel of the total amount of data that is
+ going to be transmitted by a call (whether in a client request or a
+ service response). If given, it allows the kernel to encrypt from the
+ userspace buffer directly to the packet buffers, rather than copying into
+ the buffer and then encrypting in place. This may only be given with the
+ first sendmsg() providing data for a call. EMSGSIZE will be generated if
+ the amount of data actually given is different.
+
+ This takes a parameter of __s64 type that indicates how much will be
+ transmitted. This may not be less than zero.
+
+The symbol RXRPC__SUPPORTED is defined as one more than the highest control
+message type supported. At run time this can be queried by means of the
+RXRPC_SUPPORTED_CMSG socket option (see below).
+
+
+==============
+SOCKET OPTIONS
+==============
+
+AF_RXRPC sockets support a few socket options at the SOL_RXRPC level:
+
+ (#) RXRPC_SECURITY_KEY
+
+ This is used to specify the description of the key to be used. The key is
+ extracted from the calling process's keyrings with request_key() and
+ should be of "rxrpc" type.
+
+ The optval pointer points to the description string, and optlen indicates
+ how long the string is, without the NUL terminator.
+
+ (#) RXRPC_SECURITY_KEYRING
+
+ Similar to above but specifies a keyring of server secret keys to use (key
+ type "keyring"). See the "Security" section.
+
+ (#) RXRPC_EXCLUSIVE_CONNECTION
+
+ This is used to request that new connections should be used for each call
+ made subsequently on this socket. optval should be NULL and optlen 0.
+
+ (#) RXRPC_MIN_SECURITY_LEVEL
+
+ This is used to specify the minimum security level required for calls on
+ this socket. optval must point to an int containing one of the following
+ values:
+
+ (a) RXRPC_SECURITY_PLAIN
+
+ Encrypted checksum only.
+
+ (b) RXRPC_SECURITY_AUTH
+
+ Encrypted checksum plus packet padded and first eight bytes of packet
+ encrypted - which includes the actual packet length.
+
+ (c) RXRPC_SECURITY_ENCRYPT
+
+ Encrypted checksum plus entire packet padded and encrypted, including
+ actual packet length.
+
+ (#) RXRPC_UPGRADEABLE_SERVICE
+
+ This is used to indicate that a service socket with two bindings may
+ upgrade one bound service to the other if requested by the client. optval
+ must point to an array of two unsigned short ints. The first is the
+ service ID to upgrade from and the second the service ID to upgrade to.
+
+ (#) RXRPC_SUPPORTED_CMSG
+
+ This is a read-only option that writes an int into the buffer indicating
+ the highest control message type supported.
+
+
+========
+SECURITY
+========
+
+Currently, only the kerberos 4 equivalent protocol has been implemented
+(security index 2 - rxkad). This requires the rxkad module to be loaded and,
+on the client, tickets of the appropriate type to be obtained from the AFS
+kaserver or the kerberos server and installed as "rxrpc" type keys. This is
+normally done using the klog program. An example simple klog program can be
+found at:
+
+ http://people.redhat.com/~dhowells/rxrpc/klog.c
+
+The payload provided to add_key() on the client should be of the following
+form::
+
+ struct rxrpc_key_sec2_v1 {
+ uint16_t security_index; /* 2 */
+ uint16_t ticket_length; /* length of ticket[] */
+ uint32_t expiry; /* time at which expires */
+ uint8_t kvno; /* key version number */
+ uint8_t __pad[3];
+ uint8_t session_key[8]; /* DES session key */
+ uint8_t ticket[0]; /* the encrypted ticket */
+ };
+
+Where the ticket blob is just appended to the above structure.
+
+
+For the server, keys of type "rxrpc_s" must be made available to the server.
+They have a description of "<serviceID>:<securityIndex>" (eg: "52:2" for an
+rxkad key for the AFS VL service). When such a key is created, it should be
+given the server's secret key as the instantiation data (see the example
+below).
+
+ add_key("rxrpc_s", "52:2", secret_key, 8, keyring);
+
+A keyring is passed to the server socket by naming it in a sockopt. The server
+socket then looks the server secret keys up in this keyring when secure
+incoming connections are made. This can be seen in an example program that can
+be found at:
+
+ http://people.redhat.com/~dhowells/rxrpc/listen.c
+
+
+====================
+EXAMPLE CLIENT USAGE
+====================
+
+A client would issue an operation by:
+
+ (1) An RxRPC socket is set up by::
+
+ client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
+
+ Where the third parameter indicates the protocol family of the transport
+ socket used - usually IPv4 but it can also be IPv6 [TODO].
+
+ (2) A local address can optionally be bound::
+
+ struct sockaddr_rxrpc srx = {
+ .srx_family = AF_RXRPC,
+ .srx_service = 0, /* we're a client */
+ .transport_type = SOCK_DGRAM, /* type of transport socket */
+ .transport.sin_family = AF_INET,
+ .transport.sin_port = htons(7000), /* AFS callback */
+ .transport.sin_address = 0, /* all local interfaces */
+ };
+ bind(client, &srx, sizeof(srx));
+
+ This specifies the local UDP port to be used. If not given, a random
+ non-privileged port will be used. A UDP port may be shared between
+ several unrelated RxRPC sockets. Security is handled on a basis of
+ per-RxRPC virtual connection.
+
+ (3) The security is set::
+
+ const char *key = "AFS:cambridge.redhat.com";
+ setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key));
+
+ This issues a request_key() to get the key representing the security
+ context. The minimum security level can be set::
+
+ unsigned int sec = RXRPC_SECURITY_ENCRYPT;
+ setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL,
+ &sec, sizeof(sec));
+
+ (4) The server to be contacted can then be specified (alternatively this can
+ be done through sendmsg)::
+
+ struct sockaddr_rxrpc srx = {
+ .srx_family = AF_RXRPC,
+ .srx_service = VL_SERVICE_ID,
+ .transport_type = SOCK_DGRAM, /* type of transport socket */
+ .transport.sin_family = AF_INET,
+ .transport.sin_port = htons(7005), /* AFS volume manager */
+ .transport.sin_address = ...,
+ };
+ connect(client, &srx, sizeof(srx));
+
+ (5) The request data should then be posted to the server socket using a series
+ of sendmsg() calls, each with the following control message attached:
+
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ ================== ===================================
+
+ MSG_MORE should be set in msghdr::msg_flags on all but the last part of
+ the request. Multiple requests may be made simultaneously.
+
+ An RXRPC_TX_LENGTH control message can also be specified on the first
+ sendmsg() call.
+
+ If a call is intended to go to a destination other than the default
+ specified through connect(), then msghdr::msg_name should be set on the
+ first request message of that call.
+
+ (6) The reply data will then be posted to the server socket for recvmsg() to
+ pick up. MSG_MORE will be flagged by recvmsg() if there's more reply data
+ for a particular call to be read. MSG_EOR will be set on the terminal
+ read for a call.
+
+ All data will be delivered with the following control message attached:
+
+ RXRPC_USER_CALL_ID - specifies the user ID for this call
+
+ If an abort or error occurred, this will be returned in the control data
+ buffer instead, and MSG_EOR will be flagged to indicate the end of that
+ call.
+
+A client may ask for a service ID it knows and ask that this be upgraded to a
+better service if one is available by supplying RXRPC_UPGRADE_SERVICE on the
+first sendmsg() of a call. The client should then check srx_service in the
+msg_name filled in by recvmsg() when collecting the result. srx_service will
+hold the same value as given to sendmsg() if the upgrade request was ignored by
+the service - otherwise it will be altered to indicate the service ID the
+server upgraded to. Note that the upgraded service ID is chosen by the server.
+The caller has to wait until it sees the service ID in the reply before sending
+any more calls (further calls to the same destination will be blocked until the
+probe is concluded).
+
+
+Example Server Usage
+====================
+
+A server would be set up to accept operations in the following manner:
+
+ (1) An RxRPC socket is created by::
+
+ server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
+
+ Where the third parameter indicates the address type of the transport
+ socket used - usually IPv4.
+
+ (2) Security is set up if desired by giving the socket a keyring with server
+ secret keys in it::
+
+ keyring = add_key("keyring", "AFSkeys", NULL, 0,
+ KEY_SPEC_PROCESS_KEYRING);
+
+ const char secret_key[8] = {
+ 0xa7, 0x83, 0x8a, 0xcb, 0xc7, 0x83, 0xec, 0x94 };
+ add_key("rxrpc_s", "52:2", secret_key, 8, keyring);
+
+ setsockopt(server, SOL_RXRPC, RXRPC_SECURITY_KEYRING, "AFSkeys", 7);
+
+ The keyring can be manipulated after it has been given to the socket. This
+ permits the server to add more keys, replace keys, etc. while it is live.
+
+ (3) A local address must then be bound::
+
+ struct sockaddr_rxrpc srx = {
+ .srx_family = AF_RXRPC,
+ .srx_service = VL_SERVICE_ID, /* RxRPC service ID */
+ .transport_type = SOCK_DGRAM, /* type of transport socket */
+ .transport.sin_family = AF_INET,
+ .transport.sin_port = htons(7000), /* AFS callback */
+ .transport.sin_address = 0, /* all local interfaces */
+ };
+ bind(server, &srx, sizeof(srx));
+
+ More than one service ID may be bound to a socket, provided the transport
+ parameters are the same. The limit is currently two. To do this, bind()
+ should be called twice.
+
+ (4) If service upgrading is required, first two service IDs must have been
+ bound and then the following option must be set::
+
+ unsigned short service_ids[2] = { from_ID, to_ID };
+ setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE,
+ service_ids, sizeof(service_ids));
+
+ This will automatically upgrade connections on service from_ID to service
+ to_ID if they request it. This will be reflected in msg_name obtained
+ through recvmsg() when the request data is delivered to userspace.
+
+ (5) The server is then set to listen out for incoming calls::
+
+ listen(server, 100);
+
+ (6) The kernel notifies the server of pending incoming connections by sending
+ it a message for each. This is received with recvmsg() on the server
+ socket. It has no data, and has a single dataless control message
+ attached::
+
+ RXRPC_NEW_CALL
+
+ The address that can be passed back by recvmsg() at this point should be
+ ignored since the call for which the message was posted may have gone by
+ the time it is accepted - in which case the first call still on the queue
+ will be accepted.
+
+ (7) The server then accepts the new call by issuing a sendmsg() with two
+ pieces of control data and no actual data:
+
+ ================== ==============================
+ RXRPC_ACCEPT indicate connection acceptance
+ RXRPC_USER_CALL_ID specify user ID for this call
+ ================== ==============================
+
+ (8) The first request data packet will then be posted to the server socket for
+ recvmsg() to pick up. At that point, the RxRPC address for the call can
+ be read from the address fields in the msghdr struct.
+
+ Subsequent request data will be posted to the server socket for recvmsg()
+ to collect as it arrives. All but the last piece of the request data will
+ be delivered with MSG_MORE flagged.
+
+ All data will be delivered with the following control message attached:
+
+
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ ================== ===================================
+
+ (9) The reply data should then be posted to the server socket using a series
+ of sendmsg() calls, each with the following control messages attached:
+
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ ================== ===================================
+
+ MSG_MORE should be set in msghdr::msg_flags on all but the last message
+ for a particular call.
+
+(10) The final ACK from the client will be posted for retrieval by recvmsg()
+ when it is received. It will take the form of a dataless message with two
+ control messages attached:
+
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ RXRPC_ACK indicates final ACK (no data)
+ ================== ===================================
+
+ MSG_EOR will be flagged to indicate that this is the final message for
+ this call.
+
+(11) Up to the point the final packet of reply data is sent, the call can be
+ aborted by calling sendmsg() with a dataless message with the following
+ control messages attached:
+
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ RXRPC_ABORT indicates abort code (4 byte data)
+ ================== ===================================
+
+ Any packets waiting in the socket's receive queue will be discarded if
+ this is issued.
+
+Note that all the communications for a particular service take place through
+the one server socket, using control messages on sendmsg() and recvmsg() to
+determine the call affected.
+
+
+AF_RXRPC Kernel Interface
+=========================
+
+The AF_RXRPC module also provides an interface for use by in-kernel utilities
+such as the AFS filesystem. This permits such a utility to:
+
+ (1) Use different keys directly on individual client calls on one socket
+ rather than having to open a whole slew of sockets, one for each key it
+ might want to use.
+
+ (2) Avoid having RxRPC call request_key() at the point of issue of a call or
+ opening of a socket. Instead the utility is responsible for requesting a
+ key at the appropriate point. AFS, for instance, would do this during VFS
+ operations such as open() or unlink(). The key is then handed through
+ when the call is initiated.
+
+ (3) Request the use of something other than GFP_KERNEL to allocate memory.
+
+ (4) Avoid the overhead of using the recvmsg() call. RxRPC messages can be
+ intercepted before they get put into the socket Rx queue and the socket
+ buffers manipulated directly.
+
+To use the RxRPC facility, a kernel utility must still open an AF_RXRPC socket,
+bind an address as appropriate and listen if it's to be a server socket, but
+then it passes this to the kernel interface functions.
+
+The kernel interface functions are as follows:
+
+ (#) Begin a new client call::
+
+ struct rxrpc_call *
+ rxrpc_kernel_begin_call(struct socket *sock,
+ struct sockaddr_rxrpc *srx,
+ struct key *key,
+ unsigned long user_call_ID,
+ s64 tx_total_len,
+ gfp_t gfp,
+ rxrpc_notify_rx_t notify_rx,
+ bool upgrade,
+ bool intr,
+ unsigned int debug_id);
+
+ This allocates the infrastructure to make a new RxRPC call and assigns
+ call and connection numbers. The call will be made on the UDP port that
+ the socket is bound to. The call will go to the destination address of a
+ connected client socket unless an alternative is supplied (srx is
+ non-NULL).
+
+ If a key is supplied then this will be used to secure the call instead of
+ the key bound to the socket with the RXRPC_SECURITY_KEY sockopt. Calls
+ secured in this way will still share connections if at all possible.
+
+ The user_call_ID is equivalent to that supplied to sendmsg() in the
+ control data buffer. It is entirely feasible to use this to point to a
+ kernel data structure.
+
+ tx_total_len is the amount of data the caller is intending to transmit
+ with this call (or -1 if unknown at this point). Setting the data size
+ allows the kernel to encrypt directly to the packet buffers, thereby
+ saving a copy. The value may not be less than -1.
+
+ notify_rx is a pointer to a function to be called when events such as
+ incoming data packets or remote aborts happen.
+
+ upgrade should be set to true if a client operation should request that
+ the server upgrade the service to a better one. The resultant service ID
+ is returned by rxrpc_kernel_recv_data().
+
+ intr should be set to true if the call should be interruptible. If this
+ is not set, this function may not return until a channel has been
+ allocated; if it is set, the function may return -ERESTARTSYS.
+
+ debug_id is the call debugging ID to be used for tracing. This can be
+ obtained by atomically incrementing rxrpc_debug_id.
+
+ If this function is successful, an opaque reference to the RxRPC call is
+ returned. The caller now holds a reference on this and it must be
+ properly ended.
+
+ (#) Shut down a client call::
+
+ void rxrpc_kernel_shutdown_call(struct socket *sock,
+ struct rxrpc_call *call);
+
+ This is used to shut down a previously begun call. The user_call_ID is
+ expunged from AF_RXRPC's knowledge and will not be seen again in
+ association with the specified call.
+
+ (#) Release the ref on a client call::
+
+ void rxrpc_kernel_put_call(struct socket *sock,
+ struct rxrpc_call *call);
+
+ This is used to release the caller's ref on an rxrpc call.
+
+ (#) Send data through a call::
+
+ typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk,
+ unsigned long user_call_ID,
+ struct sk_buff *skb);
+
+ int rxrpc_kernel_send_data(struct socket *sock,
+ struct rxrpc_call *call,
+ struct msghdr *msg,
+ size_t len,
+ rxrpc_notify_end_tx_t notify_end_rx);
+
+ This is used to supply either the request part of a client call or the
+ reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the
+ data buffers to be used. msg_iov may not be NULL and must point
+ exclusively to in-kernel virtual addresses. msg.msg_flags may be given
+ MSG_MORE if there will be subsequent data sends for this call.
+
+ The msg must not specify a destination address, control data or any flags
+ other than MSG_MORE. len is the total amount of data to transmit.
+
+ notify_end_rx can be NULL or it can be used to specify a function to be
+ called when the call changes state to end the Tx phase. This function is
+ called with a spinlock held to prevent the last DATA packet from being
+ transmitted until the function returns.
+
+ (#) Receive data from a call::
+
+ int rxrpc_kernel_recv_data(struct socket *sock,
+ struct rxrpc_call *call,
+ void *buf,
+ size_t size,
+ size_t *_offset,
+ bool want_more,
+ u32 *_abort,
+ u16 *_service)
+
+ This is used to receive data from either the reply part of a client call
+ or the request part of a service call. buf and size specify how much
+ data is desired and where to store it. *_offset is added on to buf and
+ subtracted from size internally; the amount copied into the buffer is
+ added to *_offset before returning.
+
+ want_more should be true if further data will be required after this is
+ satisfied and false if this is the last item of the receive phase.
+
+ There are three normal returns: 0 if the buffer was filled and want_more
+ was true; 1 if the buffer was filled, the last DATA packet has been
+ emptied and want_more was false; and -EAGAIN if the function needs to be
+ called again.
+
+ If the last DATA packet is processed but the buffer contains less than
+ the amount requested, EBADMSG is returned. If want_more wasn't set, but
+ more data was available, EMSGSIZE is returned.
+
+ If a remote ABORT is detected, the abort code received will be stored in
+ ``*_abort`` and ECONNABORTED will be returned.
+
+ The service ID that the call ended up with is returned into *_service.
+ This can be used to see if a call got a service upgrade.
+
+ (#) Abort a call??
+
+ ::
+
+ void rxrpc_kernel_abort_call(struct socket *sock,
+ struct rxrpc_call *call,
+ u32 abort_code);
+
+ This is used to abort a call if it's still in an abortable state. The
+ abort code specified will be placed in the ABORT message sent.
+
+ (#) Intercept received RxRPC messages::
+
+ typedef void (*rxrpc_interceptor_t)(struct sock *sk,
+ unsigned long user_call_ID,
+ struct sk_buff *skb);
+
+ void
+ rxrpc_kernel_intercept_rx_messages(struct socket *sock,
+ rxrpc_interceptor_t interceptor);
+
+ This installs an interceptor function on the specified AF_RXRPC socket.
+ All messages that would otherwise wind up in the socket's Rx queue are
+ then diverted to this function. Note that care must be taken to process
+ the messages in the right order to maintain DATA message sequentiality.
+
+ The interceptor function itself is provided with the address of the socket
+ and handling the incoming message, the ID assigned by the kernel utility
+ to the call and the socket buffer containing the message.
+
+ The skb->mark field indicates the type of message:
+
+ =============================== =======================================
+ Mark Meaning
+ =============================== =======================================
+ RXRPC_SKB_MARK_DATA Data message
+ RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call
+ RXRPC_SKB_MARK_BUSY Client call rejected as server busy
+ RXRPC_SKB_MARK_REMOTE_ABORT Call aborted by peer
+ RXRPC_SKB_MARK_NET_ERROR Network error detected
+ RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered
+ RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance
+ =============================== =======================================
+
+ The remote abort message can be probed with rxrpc_kernel_get_abort_code().
+ The two error messages can be probed with rxrpc_kernel_get_error_number().
+ A new call can be accepted with rxrpc_kernel_accept_call().
+
+ Data messages can have their contents extracted with the usual bunch of
+ socket buffer manipulation functions. A data message can be determined to
+ be the last one in a sequence with rxrpc_kernel_is_data_last(). When a
+ data message has been used up, rxrpc_kernel_data_consumed() should be
+ called on it.
+
+ Messages should be handled to rxrpc_kernel_free_skb() to dispose of. It
+ is possible to get extra refs on all types of message for later freeing,
+ but this may pin the state of a call until the message is finally freed.
+
+ (#) Accept an incoming call::
+
+ struct rxrpc_call *
+ rxrpc_kernel_accept_call(struct socket *sock,
+ unsigned long user_call_ID);
+
+ This is used to accept an incoming call and to assign it a call ID. This
+ function is similar to rxrpc_kernel_begin_call() and calls accepted must
+ be ended in the same way.
+
+ If this function is successful, an opaque reference to the RxRPC call is
+ returned. The caller now holds a reference on this and it must be
+ properly ended.
+
+ (#) Reject an incoming call::
+
+ int rxrpc_kernel_reject_call(struct socket *sock);
+
+ This is used to reject the first incoming call on the socket's queue with
+ a BUSY message. -ENODATA is returned if there were no incoming calls.
+ Other errors may be returned if the call had been aborted (-ECONNABORTED)
+ or had timed out (-ETIME).
+
+ (#) Allocate a null key for doing anonymous security::
+
+ struct key *rxrpc_get_null_key(const char *keyname);
+
+ This is used to allocate a null RxRPC key that can be used to indicate
+ anonymous security for a particular domain.
+
+ (#) Get the peer address of a call::
+
+ void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call,
+ struct sockaddr_rxrpc *_srx);
+
+ This is used to find the remote peer address of a call.
+
+ (#) Set the total transmit data size on a call::
+
+ void rxrpc_kernel_set_tx_length(struct socket *sock,
+ struct rxrpc_call *call,
+ s64 tx_total_len);
+
+ This sets the amount of data that the caller is intending to transmit on a
+ call. It's intended to be used for setting the reply size as the request
+ size should be set when the call is begun. tx_total_len may not be less
+ than zero.
+
+ (#) Get call RTT::
+
+ u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call);
+
+ Get the RTT time to the peer in use by a call. The value returned is in
+ nanoseconds.
+
+ (#) Check call still alive::
+
+ bool rxrpc_kernel_check_life(struct socket *sock,
+ struct rxrpc_call *call,
+ u32 *_life);
+ void rxrpc_kernel_probe_life(struct socket *sock,
+ struct rxrpc_call *call);
+
+ The first function passes back in ``*_life`` a number that is updated when
+ ACKs are received from the peer (notably including PING RESPONSE ACKs
+ which we can elicit by sending PING ACKs to see if the call still exists
+ on the server). The caller should compare the numbers of two calls to see
+ if the call is still alive after waiting for a suitable interval. It also
+ returns true as long as the call hasn't yet reached the completed state.
+
+ This allows the caller to work out if the server is still contactable and
+ if the call is still alive on the server while waiting for the server to
+ process a client operation.
+
+ The second function causes a ping ACK to be transmitted to try to provoke
+ the peer into responding, which would then cause the value returned by the
+ first function to change. Note that this must be called in TASK_RUNNING
+ state.
+
+ (#) Get remote client epoch::
+
+ u32 rxrpc_kernel_get_epoch(struct socket *sock,
+ struct rxrpc_call *call)
+
+ This allows the epoch that's contained in packets of an incoming client
+ call to be queried. This value is returned. The function always
+ successful if the call is still in progress. It shouldn't be called once
+ the call has expired. Note that calling this on a local client call only
+ returns the local epoch.
+
+ This value can be used to determine if the remote client has been
+ restarted as it shouldn't change otherwise.
+
+ (#) Set the maximum lifespan on a call::
+
+ void rxrpc_kernel_set_max_life(struct socket *sock,
+ struct rxrpc_call *call,
+ unsigned long hard_timeout)
+
+ This sets the maximum lifespan on a call to hard_timeout (which is in
+ jiffies). In the event of the timeout occurring, the call will be
+ aborted and -ETIME or -ETIMEDOUT will be returned.
+
+ (#) Apply the RXRPC_MIN_SECURITY_LEVEL sockopt to a socket from within in the
+ kernel::
+
+ int rxrpc_sock_set_min_security_level(struct sock *sk,
+ unsigned int val);
+
+ This specifies the minimum security level required for calls on this
+ socket.
+
+
+Configurable Parameters
+=======================
+
+The RxRPC protocol driver has a number of configurable parameters that can be
+adjusted through sysctls in /proc/net/rxrpc/:
+
+ (#) req_ack_delay
+
+ The amount of time in milliseconds after receiving a packet with the
+ request-ack flag set before we honour the flag and actually send the
+ requested ack.
+
+ Usually the other side won't stop sending packets until the advertised
+ reception window is full (to a maximum of 255 packets), so delaying the
+ ACK permits several packets to be ACK'd in one go.
+
+ (#) soft_ack_delay
+
+ The amount of time in milliseconds after receiving a new packet before we
+ generate a soft-ACK to tell the sender that it doesn't need to resend.
+
+ (#) idle_ack_delay
+
+ The amount of time in milliseconds after all the packets currently in the
+ received queue have been consumed before we generate a hard-ACK to tell
+ the sender it can free its buffers, assuming no other reason occurs that
+ we would send an ACK.
+
+ (#) resend_timeout
+
+ The amount of time in milliseconds after transmitting a packet before we
+ transmit it again, assuming no ACK is received from the receiver telling
+ us they got it.
+
+ (#) max_call_lifetime
+
+ The maximum amount of time in seconds that a call may be in progress
+ before we preemptively kill it.
+
+ (#) dead_call_expiry
+
+ The amount of time in seconds before we remove a dead call from the call
+ list. Dead calls are kept around for a little while for the purpose of
+ repeating ACK and ABORT packets.
+
+ (#) connection_expiry
+
+ The amount of time in seconds after a connection was last used before we
+ remove it from the connection list. While a connection is in existence,
+ it serves as a placeholder for negotiated security; when it is deleted,
+ the security must be renegotiated.
+
+ (#) transport_expiry
+
+ The amount of time in seconds after a transport was last used before we
+ remove it from the transport list. While a transport is in existence, it
+ serves to anchor the peer data and keeps the connection ID counter.
+
+ (#) rxrpc_rx_window_size
+
+ The size of the receive window in packets. This is the maximum number of
+ unconsumed received packets we're willing to hold in memory for any
+ particular call.
+
+ (#) rxrpc_rx_mtu
+
+ The maximum packet MTU size that we're willing to receive in bytes. This
+ indicates to the peer whether we're willing to accept jumbo packets.
+
+ (#) rxrpc_rx_jumbo_max
+
+ The maximum number of packets that we're willing to accept in a jumbo
+ packet. Non-terminal packets in a jumbo packet must contain a four byte
+ header plus exactly 1412 bytes of data. The terminal packet must contain
+ a four byte header plus any amount of data. In any event, a jumbo packet
+ may not exceed rxrpc_rx_mtu in size.
diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
new file mode 100644
index 0000000000..92c9fb46d6
--- /dev/null
+++ b/Documentation/networking/scaling.rst
@@ -0,0 +1,523 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Scaling in the Linux Networking Stack
+=====================================
+
+
+Introduction
+============
+
+This document describes a set of complementary techniques in the Linux
+networking stack to increase parallelism and improve performance for
+multi-processor systems.
+
+The following technologies are described:
+
+- RSS: Receive Side Scaling
+- RPS: Receive Packet Steering
+- RFS: Receive Flow Steering
+- Accelerated Receive Flow Steering
+- XPS: Transmit Packet Steering
+
+
+RSS: Receive Side Scaling
+=========================
+
+Contemporary NICs support multiple receive and transmit descriptor queues
+(multi-queue). On reception, a NIC can send different packets to different
+queues to distribute processing among CPUs. The NIC distributes packets by
+applying a filter to each packet that assigns it to one of a small number
+of logical flows. Packets for each flow are steered to a separate receive
+queue, which in turn can be processed by separate CPUs. This mechanism is
+generally known as “Receive-side Scaling” (RSS). The goal of RSS and
+the other scaling techniques is to increase performance uniformly.
+Multi-queue distribution can also be used for traffic prioritization, but
+that is not the focus of these techniques.
+
+The filter used in RSS is typically a hash function over the network
+and/or transport layer headers-- for example, a 4-tuple hash over
+IP addresses and TCP ports of a packet. The most common hardware
+implementation of RSS uses a 128-entry indirection table where each entry
+stores a queue number. The receive queue for a packet is determined
+by masking out the low order seven bits of the computed hash for the
+packet (usually a Toeplitz hash), taking this number as a key into the
+indirection table and reading the corresponding value.
+
+Some advanced NICs allow steering packets to queues based on
+programmable filters. For example, webserver bound TCP port 80 packets
+can be directed to their own receive queue. Such “n-tuple” filters can
+be configured from ethtool (--config-ntuple).
+
+
+RSS Configuration
+-----------------
+
+The driver for a multi-queue capable NIC typically provides a kernel
+module parameter for specifying the number of hardware queues to
+configure. In the bnx2x driver, for instance, this parameter is called
+num_queues. A typical RSS configuration would be to have one receive queue
+for each CPU if the device supports enough queues, or otherwise at least
+one for each memory domain, where a memory domain is a set of CPUs that
+share a particular memory level (L1, L2, NUMA node, etc.).
+
+The indirection table of an RSS device, which resolves a queue by masked
+hash, is usually programmed by the driver at initialization. The
+default mapping is to distribute the queues evenly in the table, but the
+indirection table can be retrieved and modified at runtime using ethtool
+commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
+indirection table could be done to give different queues different
+relative weights.
+
+
+RSS IRQ Configuration
+~~~~~~~~~~~~~~~~~~~~~
+
+Each receive queue has a separate IRQ associated with it. The NIC triggers
+this to notify a CPU when new packets arrive on the given queue. The
+signaling path for PCIe devices uses message signaled interrupts (MSI-X),
+that can route each interrupt to a particular CPU. The active mapping
+of queues to IRQs can be determined from /proc/interrupts. By default,
+an IRQ may be handled on any CPU. Because a non-negligible part of packet
+processing takes place in receive interrupt handling, it is advantageous
+to spread receive interrupts between CPUs. To manually adjust the IRQ
+affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
+will be running irqbalance, a daemon that dynamically optimizes IRQ
+assignments and as a result may override any manual settings.
+
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+RSS should be enabled when latency is a concern or whenever receive
+interrupt processing forms a bottleneck. Spreading load between CPUs
+decreases queue length. For low latency networking, the optimal setting
+is to allocate as many queues as there are CPUs in the system (or the
+NIC maximum, if lower). The most efficient high-rate configuration
+is likely the one with the smallest number of receive queues where no
+receive queue overflows due to a saturated CPU, because in default
+mode with interrupt coalescing enabled, the aggregate number of
+interrupts (and thus work) grows with each additional queue.
+
+Per-cpu load can be observed using the mpstat utility, but note that on
+processors with hyperthreading (HT), each hyperthread is represented as
+a separate CPU. For interrupt handling, HT has shown no benefit in
+initial tests, so limit the number of queues to the number of CPU cores
+in the system.
+
+
+RPS: Receive Packet Steering
+============================
+
+Receive Packet Steering (RPS) is logically a software implementation of
+RSS. Being in software, it is necessarily called later in the datapath.
+Whereas RSS selects the queue and hence CPU that will run the hardware
+interrupt handler, RPS selects the CPU to perform protocol processing
+above the interrupt handler. This is accomplished by placing the packet
+on the desired CPU’s backlog queue and waking up the CPU for processing.
+RPS has some advantages over RSS:
+
+1) it can be used with any NIC
+2) software filters can easily be added to hash over new protocols
+3) it does not increase hardware device interrupt rate (although it does
+ introduce inter-processor interrupts (IPIs))
+
+RPS is called during bottom half of the receive interrupt handler, when
+a driver sends a packet up the network stack with netif_rx() or
+netif_receive_skb(). These call the get_rps_cpu() function, which
+selects the queue that should process a packet.
+
+The first step in determining the target CPU for RPS is to calculate a
+flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
+depending on the protocol). This serves as a consistent hash of the
+associated flow of the packet. The hash is either provided by hardware
+or will be computed in the stack. Capable hardware can pass the hash in
+the receive descriptor for the packet; this would usually be the same
+hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
+skb->hash and can be used elsewhere in the stack as a hash of the
+packet’s flow.
+
+Each receive hardware queue has an associated list of CPUs to which
+RPS may enqueue packets for processing. For each received packet,
+an index into the list is computed from the flow hash modulo the size
+of the list. The indexed CPU is the target for processing the packet,
+and the packet is queued to the tail of that CPU’s backlog queue. At
+the end of the bottom half routine, IPIs are sent to any CPUs for which
+packets have been queued to their backlog queue. The IPI wakes backlog
+processing on the remote CPU, and any queued packets are then processed
+up the networking stack.
+
+
+RPS Configuration
+-----------------
+
+RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
+by default for SMP). Even when compiled in, RPS remains disabled until
+explicitly configured. The list of CPUs to which RPS may forward traffic
+can be configured for each receive queue using a sysfs file entry::
+
+ /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
+
+This file implements a bitmap of CPUs. RPS is disabled when it is zero
+(the default), in which case packets are processed on the interrupting
+CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
+the bitmap.
+
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+For a single queue device, a typical RPS configuration would be to set
+the rps_cpus to the CPUs in the same memory domain of the interrupting
+CPU. If NUMA locality is not an issue, this could also be all CPUs in
+the system. At high interrupt rate, it might be wise to exclude the
+interrupting CPU from the map since that already performs much work.
+
+For a multi-queue system, if RSS is configured so that a hardware
+receive queue is mapped to each CPU, then RPS is probably redundant
+and unnecessary. If there are fewer hardware queues than CPUs, then
+RPS might be beneficial if the rps_cpus for each queue are the ones that
+share the same memory domain as the interrupting CPU for that queue.
+
+
+RPS Flow Limit
+--------------
+
+RPS scales kernel receive processing across CPUs without introducing
+reordering. The trade-off to sending all packets from the same flow
+to the same CPU is CPU load imbalance if flows vary in packet rate.
+In the extreme case a single flow dominates traffic. Especially on
+common server workloads with many concurrent connections, such
+behavior indicates a problem such as a misconfiguration or spoofed
+source Denial of Service attack.
+
+Flow Limit is an optional RPS feature that prioritizes small flows
+during CPU contention by dropping packets from large flows slightly
+ahead of those from small flows. It is active only when an RPS or RFS
+destination CPU approaches saturation. Once a CPU's input packet
+queue exceeds half the maximum queue length (as set by sysctl
+net.core.netdev_max_backlog), the kernel starts a per-flow packet
+count over the last 256 packets. If a flow exceeds a set ratio (by
+default, half) of these packets when a new packet arrives, then the
+new packet is dropped. Packets from other flows are still only
+dropped once the input packet queue reaches netdev_max_backlog.
+No packets are dropped when the input packet queue length is below
+the threshold, so flow limit does not sever connections outright:
+even large flows maintain connectivity.
+
+
+Interface
+~~~~~~~~~
+
+Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
+turned on. It is implemented for each CPU independently (to avoid lock
+and cache contention) and toggled per CPU by setting the relevant bit
+in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
+bitmap interface as rps_cpus (see above) when called from procfs::
+
+ /proc/sys/net/core/flow_limit_cpu_bitmap
+
+Per-flow rate is calculated by hashing each packet into a hashtable
+bucket and incrementing a per-bucket counter. The hash function is
+the same that selects a CPU in RPS, but as the number of buckets can
+be much larger than the number of CPUs, flow limit has finer-grained
+identification of large flows and fewer false positives. The default
+table has 4096 buckets. This value can be modified through sysctl::
+
+ net.core.flow_limit_table_len
+
+The value is only consulted when a new table is allocated. Modifying
+it does not update active tables.
+
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Flow limit is useful on systems with many concurrent connections,
+where a single connection taking up 50% of a CPU indicates a problem.
+In such environments, enable the feature on all CPUs that handle
+network rx interrupts (as set in /proc/irq/N/smp_affinity).
+
+The feature depends on the input packet queue length to exceed
+the flow limit threshold (50%) + the flow history length (256).
+Setting net.core.netdev_max_backlog to either 1000 or 10000
+performed well in experiments.
+
+
+RFS: Receive Flow Steering
+==========================
+
+While RPS steers packets solely based on hash, and thus generally
+provides good load distribution, it does not take into account
+application locality. This is accomplished by Receive Flow Steering
+(RFS). The goal of RFS is to increase datacache hitrate by steering
+kernel processing of packets to the CPU where the application thread
+consuming the packet is running. RFS relies on the same RPS mechanisms
+to enqueue packets onto the backlog of another CPU and to wake up that
+CPU.
+
+In RFS, packets are not forwarded directly by the value of their hash,
+but the hash is used as index into a flow lookup table. This table maps
+flows to the CPUs where those flows are being processed. The flow hash
+(see RPS section above) is used to calculate the index into this table.
+The CPU recorded in each entry is the one which last processed the flow.
+If an entry does not hold a valid CPU, then packets mapped to that entry
+are steered using plain RPS. Multiple table entries may point to the
+same CPU. Indeed, with many flows and few CPUs, it is very likely that
+a single application thread handles flows with many different flow hashes.
+
+rps_sock_flow_table is a global flow table that contains the *desired* CPU
+for flows: the CPU that is currently processing the flow in userspace.
+Each table value is a CPU index that is updated during calls to recvmsg
+and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
+tcp_splice_read()).
+
+When the scheduler moves a thread to a new CPU while it has outstanding
+receive packets on the old CPU, packets may arrive out of order. To
+avoid this, RFS uses a second flow table to track outstanding packets
+for each flow: rps_dev_flow_table is a table specific to each hardware
+receive queue of each device. Each table value stores a CPU index and a
+counter. The CPU index represents the *current* CPU onto which packets
+for this flow are enqueued for further kernel processing. Ideally, kernel
+and userspace processing occur on the same CPU, and hence the CPU index
+in both tables is identical. This is likely false if the scheduler has
+recently migrated a userspace thread while the kernel still has packets
+enqueued for kernel processing on the old CPU.
+
+The counter in rps_dev_flow_table values records the length of the current
+CPU's backlog when a packet in this flow was last enqueued. Each backlog
+queue has a head counter that is incremented on dequeue. A tail counter
+is computed as head counter + queue length. In other words, the counter
+in rps_dev_flow[i] records the last element in flow i that has
+been enqueued onto the currently designated CPU for flow i (of course,
+entry i is actually selected by hash and multiple flows may hash to the
+same entry i).
+
+And now the trick for avoiding out of order packets: when selecting the
+CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
+and the rps_dev_flow table of the queue that the packet was received on
+are compared. If the desired CPU for the flow (found in the
+rps_sock_flow table) matches the current CPU (found in the rps_dev_flow
+table), the packet is enqueued onto that CPU’s backlog. If they differ,
+the current CPU is updated to match the desired CPU if one of the
+following is true:
+
+ - The current CPU's queue head counter >= the recorded tail counter
+ value in rps_dev_flow[i]
+ - The current CPU is unset (>= nr_cpu_ids)
+ - The current CPU is offline
+
+After this check, the packet is sent to the (possibly updated) current
+CPU. These rules aim to ensure that a flow only moves to a new CPU when
+there are no packets outstanding on the old CPU, as the outstanding
+packets could arrive later than those about to be processed on the new
+CPU.
+
+
+RFS Configuration
+-----------------
+
+RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
+by default for SMP). The functionality remains disabled until explicitly
+configured. The number of entries in the global flow table is set through::
+
+ /proc/sys/net/core/rps_sock_flow_entries
+
+The number of entries in the per-queue flow table are set through::
+
+ /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
+
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Both of these need to be set before RFS is enabled for a receive queue.
+Values for both are rounded up to the nearest power of two. The
+suggested flow count depends on the expected number of active connections
+at any given time, which may be significantly less than the number of open
+connections. We have found that a value of 32768 for rps_sock_flow_entries
+works fairly well on a moderately loaded server.
+
+For a single queue device, the rps_flow_cnt value for the single queue
+would normally be configured to the same value as rps_sock_flow_entries.
+For a multi-queue device, the rps_flow_cnt for each queue might be
+configured as rps_sock_flow_entries / N, where N is the number of
+queues. So for instance, if rps_sock_flow_entries is set to 32768 and there
+are 16 configured receive queues, rps_flow_cnt for each queue might be
+configured as 2048.
+
+
+Accelerated RFS
+===============
+
+Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
+balancing mechanism that uses soft state to steer flows based on where
+the application thread consuming the packets of each flow is running.
+Accelerated RFS should perform better than RFS since packets are sent
+directly to a CPU local to the thread consuming the data. The target CPU
+will either be the same CPU where the application runs, or at least a CPU
+which is local to the application thread’s CPU in the cache hierarchy.
+
+To enable accelerated RFS, the networking stack calls the
+ndo_rx_flow_steer driver function to communicate the desired hardware
+queue for packets matching a particular flow. The network stack
+automatically calls this function every time a flow entry in
+rps_dev_flow_table is updated. The driver in turn uses a device specific
+method to program the NIC to steer the packets.
+
+The hardware queue for a flow is derived from the CPU recorded in
+rps_dev_flow_table. The stack consults a CPU to hardware queue map which
+is maintained by the NIC driver. This is an auto-generated reverse map of
+the IRQ affinity table shown by /proc/interrupts. Drivers can use
+functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
+to populate the map. For each CPU, the corresponding queue in the map is
+set to be one whose processing CPU is closest in cache locality.
+
+
+Accelerated RFS Configuration
+-----------------------------
+
+Accelerated RFS is only available if the kernel is compiled with
+CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
+It also requires that ntuple filtering is enabled via ethtool. The map
+of CPU to queues is automatically deduced from the IRQ affinities
+configured for each receive queue by the driver, so no additional
+configuration should be necessary.
+
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+This technique should be enabled whenever one wants to use RFS and the
+NIC supports hardware acceleration.
+
+
+XPS: Transmit Packet Steering
+=============================
+
+Transmit Packet Steering is a mechanism for intelligently selecting
+which transmit queue to use when transmitting a packet on a multi-queue
+device. This can be accomplished by recording two kinds of maps, either
+a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
+to hardware transmit queue(s).
+
+1. XPS using CPUs map
+
+The goal of this mapping is usually to assign queues
+exclusively to a subset of CPUs, where the transmit completions for
+these queues are processed on a CPU within this set. This choice
+provides two benefits. First, contention on the device queue lock is
+significantly reduced since fewer CPUs contend for the same queue
+(contention can be eliminated completely if each CPU has its own
+transmit queue). Secondly, cache miss rate on transmit completion is
+reduced, in particular for data cache lines that hold the sk_buff
+structures.
+
+2. XPS using receive queues map
+
+This mapping is used to pick transmit queue based on the receive
+queue(s) map configuration set by the administrator. A set of receive
+queues can be mapped to a set of transmit queues (many:many), although
+the common use case is a 1:1 mapping. This will enable sending packets
+on the same queue associations for transmit and receive. This is useful for
+busy polling multi-threaded workloads where there are challenges in
+associating a given CPU to a given application thread. The application
+threads are not pinned to CPUs and each thread handles packets
+received on a single queue. The receive queue number is cached in the
+socket for the connection. In this model, sending the packets on the same
+transmit queue corresponding to the associated receive queue has benefits
+in keeping the CPU overhead low. Transmit completion work is locked into
+the same queue-association that a given application is polling on. This
+avoids the overhead of triggering an interrupt on another CPU. When the
+application cleans up the packets during the busy poll, transmit completion
+may be processed along with it in the same thread context and so result in
+reduced latency.
+
+XPS is configured per transmit queue by setting a bitmap of
+CPUs/receive-queues that may use that queue to transmit. The reverse
+mapping, from CPUs to transmit queues or from receive-queues to transmit
+queues, is computed and maintained for each network device. When
+transmitting the first packet in a flow, the function get_xps_queue() is
+called to select a queue. This function uses the ID of the receive queue
+for the socket connection for a match in the receive queue-to-transmit queue
+lookup table. Alternatively, this function can also use the ID of the
+running CPU as a key into the CPU-to-queue lookup table. If the
+ID matches a single queue, that is used for transmission. If multiple
+queues match, one is selected by using the flow hash to compute an index
+into the set. When selecting the transmit queue based on receive queue(s)
+map, the transmit device is not validated against the receive device as it
+requires expensive lookup operation in the datapath.
+
+The queue chosen for transmitting a particular flow is saved in the
+corresponding socket structure for the flow (e.g. a TCP connection).
+This transmit queue is used for subsequent packets sent on the flow to
+prevent out of order (ooo) packets. The choice also amortizes the cost
+of calling get_xps_queues() over all packets in the flow. To avoid
+ooo packets, the queue for a flow can subsequently only be changed if
+skb->ooo_okay is set for a packet in the flow. This flag indicates that
+there are no outstanding packets in the flow, so the transmit queue can
+change without the risk of generating out of order packets. The
+transport layer is responsible for setting ooo_okay appropriately. TCP,
+for instance, sets the flag when all data for a connection has been
+acknowledged.
+
+XPS Configuration
+-----------------
+
+XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
+default for SMP). If compiled in, it is driver dependent whether, and
+how, XPS is configured at device init. The mapping of CPUs/receive-queues
+to transmit queue can be inspected and configured using sysfs:
+
+For selection based on CPUs map::
+
+ /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
+
+For selection based on receive-queues map::
+
+ /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
+
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+For a network device with a single transmission queue, XPS configuration
+has no effect, since there is no choice in this case. In a multi-queue
+system, XPS is preferably configured so that each CPU maps onto one queue.
+If there are as many queues as there are CPUs in the system, then each
+queue can also map onto one CPU, resulting in exclusive pairings that
+experience no contention. If there are fewer queues than CPUs, then the
+best CPUs to share a given queue are probably those that share the cache
+with the CPU that processes transmit completions for that queue
+(transmit interrupts).
+
+For transmit queue selection based on receive queue(s), XPS has to be
+explicitly configured mapping receive-queue(s) to transmit queue(s). If the
+user configuration for receive-queue map does not apply, then the transmit
+queue is selected based on the CPUs map.
+
+
+Per TX Queue rate limitation
+============================
+
+These are rate-limitation mechanisms implemented by HW, where currently
+a max-rate attribute is supported, by setting a Mbps value to::
+
+ /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
+
+A value of zero means disabled, and this is the default.
+
+
+Further Information
+===================
+RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
+2.6.38. Original patches were submitted by Tom Herbert
+(therbert@google.com)
+
+Accelerated RFS was introduced in 2.6.35. Original patches were
+submitted by Ben Hutchings (bwh@kernel.org)
+
+Authors:
+
+- Tom Herbert (therbert@google.com)
+- Willem de Bruijn (willemb@google.com)
diff --git a/Documentation/networking/sctp.rst b/Documentation/networking/sctp.rst
new file mode 100644
index 0000000000..9f4d9c8a92
--- /dev/null
+++ b/Documentation/networking/sctp.rst
@@ -0,0 +1,42 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Linux Kernel SCTP
+=================
+
+This is the current BETA release of the Linux Kernel SCTP reference
+implementation.
+
+SCTP (Stream Control Transmission Protocol) is a IP based, message oriented,
+reliable transport protocol, with congestion control, support for
+transparent multi-homing, and multiple ordered streams of messages.
+RFC2960 defines the core protocol. The IETF SIGTRAN working group originally
+developed the SCTP protocol and later handed the protocol over to the
+Transport Area (TSVWG) working group for the continued evolvement of SCTP as a
+general purpose transport.
+
+See the IETF website (http://www.ietf.org) for further documents on SCTP.
+See http://www.ietf.org/rfc/rfc2960.txt
+
+The initial project goal is to create an Linux kernel reference implementation
+of SCTP that is RFC 2960 compliant and provides an programming interface
+referred to as the UDP-style API of the Sockets Extensions for SCTP, as
+proposed in IETF Internet-Drafts.
+
+Caveats
+=======
+
+- lksctp can be built as statically or as a module. However, be aware that
+ module removal of lksctp is not yet a safe activity.
+
+- There is tentative support for IPv6, but most work has gone towards
+ implementation and testing lksctp on IPv4.
+
+
+For more information, please visit the lksctp project website:
+
+ http://www.sf.net/projects/lksctp
+
+Or contact the lksctp developers through the mailing list:
+
+ <linux-sctp@vger.kernel.org>
diff --git a/Documentation/networking/secid.rst b/Documentation/networking/secid.rst
new file mode 100644
index 0000000000..b45141a980
--- /dev/null
+++ b/Documentation/networking/secid.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+LSM/SeLinux secid
+=================
+
+flowi structure:
+
+The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate
+the label of the flow. This label of the flow is currently used in selecting
+matching labeled xfrm(s).
+
+If this is an outbound flow, the label is derived from the socket, if any, or
+the incoming packet this flow is being generated as a response to (e.g. tcp
+resets, timewait ack, etc.). It is also conceivable that the label could be
+derived from other sources such as process context, device, etc., in special
+cases, as may be appropriate.
+
+If this is an inbound flow, the label is derived from the IPSec security
+associations, if any, used by the packet.
diff --git a/Documentation/networking/seg6-sysctl.rst b/Documentation/networking/seg6-sysctl.rst
new file mode 100644
index 0000000000..07c20e470b
--- /dev/null
+++ b/Documentation/networking/seg6-sysctl.rst
@@ -0,0 +1,39 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Seg6 Sysfs variables
+====================
+
+
+/proc/sys/net/conf/<iface>/seg6_* variables:
+============================================
+
+seg6_enabled - BOOL
+ Accept or drop SR-enabled IPv6 packets on this interface.
+
+ Relevant packets are those with SRH present and DA = local.
+
+ * 0 - disabled (default)
+ * not 0 - enabled
+
+seg6_require_hmac - INTEGER
+ Define HMAC policy for ingress SR-enabled packets on this interface.
+
+ * -1 - Ignore HMAC field
+ * 0 - Accept SR packets without HMAC, validate SR packets with HMAC
+ * 1 - Drop SR packets without HMAC, validate SR packets with HMAC
+
+ Default is 0.
+
+seg6_flowlabel - INTEGER
+ Controls the behaviour of computing the flowlabel of outer
+ IPv6 header in case of SR T.encaps
+
+ == =======================================================
+ -1 set flowlabel to zero.
+ 0 copy flowlabel from Inner packet in case of Inner IPv6
+ (Set flowlabel to 0 in case IPv4/L2)
+ 1 Compute the flowlabel using seg6_make_flowlabel()
+ == =======================================================
+
+ Default is 0.
diff --git a/Documentation/networking/segmentation-offloads.rst b/Documentation/networking/segmentation-offloads.rst
new file mode 100644
index 0000000000..085e8fab03
--- /dev/null
+++ b/Documentation/networking/segmentation-offloads.rst
@@ -0,0 +1,184 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Segmentation Offloads
+=====================
+
+
+Introduction
+============
+
+This document describes a set of techniques in the Linux networking stack
+to take advantage of segmentation offload capabilities of various NICs.
+
+The following technologies are described:
+ * TCP Segmentation Offload - TSO
+ * UDP Fragmentation Offload - UFO
+ * IPIP, SIT, GRE, and UDP Tunnel Offloads
+ * Generic Segmentation Offload - GSO
+ * Generic Receive Offload - GRO
+ * Partial Generic Segmentation Offload - GSO_PARTIAL
+ * SCTP acceleration with GSO - GSO_BY_FRAGS
+
+
+TCP Segmentation Offload
+========================
+
+TCP segmentation allows a device to segment a single frame into multiple
+frames with a data payload size specified in skb_shinfo()->gso_size.
+When TCP segmentation requested the bit for either SKB_GSO_TCPV4 or
+SKB_GSO_TCPV6 should be set in skb_shinfo()->gso_type and
+skb_shinfo()->gso_size should be set to a non-zero value.
+
+TCP segmentation is dependent on support for the use of partial checksum
+offload. For this reason TSO is normally disabled if the Tx checksum
+offload for a given device is disabled.
+
+In order to support TCP segmentation offload it is necessary to populate
+the network and transport header offsets of the skbuff so that the device
+drivers will be able determine the offsets of the IP or IPv6 header and the
+TCP header. In addition as CHECKSUM_PARTIAL is required csum_start should
+also point to the TCP header of the packet.
+
+For IPv4 segmentation we support one of two types in terms of the IP ID.
+The default behavior is to increment the IP ID with every segment. If the
+GSO type SKB_GSO_TCP_FIXEDID is specified then we will not increment the IP
+ID and all segments will use the same IP ID. If a device has
+NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when performing TSO
+and we will either increment the IP ID for all frames, or leave it at a
+static value based on driver preference.
+
+
+UDP Fragmentation Offload
+=========================
+
+UDP fragmentation offload allows a device to fragment an oversized UDP
+datagram into multiple IPv4 fragments. Many of the requirements for UDP
+fragmentation offload are the same as TSO. However the IPv4 ID for
+fragments should not increment as a single IPv4 datagram is fragmented.
+
+UFO is deprecated: modern kernels will no longer generate UFO skbs, but can
+still receive them from tuntap and similar devices. Offload of UDP-based
+tunnel protocols is still supported.
+
+
+IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads
+========================================================
+
+In addition to the offloads described above it is possible for a frame to
+contain additional headers such as an outer tunnel. In order to account
+for such instances an additional set of segmentation offload types were
+introduced including SKB_GSO_IPXIP4, SKB_GSO_IPXIP6, SKB_GSO_GRE, and
+SKB_GSO_UDP_TUNNEL. These extra segmentation types are used to identify
+cases where there are more than just 1 set of headers. For example in the
+case of IPIP and SIT we should have the network and transport headers moved
+from the standard list of headers to "inner" header offsets.
+
+Currently only two levels of headers are supported. The convention is to
+refer to the tunnel headers as the outer headers, while the encapsulated
+data is normally referred to as the inner headers. Below is the list of
+calls to access the given headers:
+
+IPIP/SIT Tunnel::
+
+ Outer Inner
+ MAC skb_mac_header
+ Network skb_network_header skb_inner_network_header
+ Transport skb_transport_header
+
+UDP/GRE Tunnel::
+
+ Outer Inner
+ MAC skb_mac_header skb_inner_mac_header
+ Network skb_network_header skb_inner_network_header
+ Transport skb_transport_header skb_inner_transport_header
+
+In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and
+SKB_GSO_UDP_TUNNEL_CSUM. These two additional tunnel types reflect the
+fact that the outer header also requests to have a non-zero checksum
+included in the outer header.
+
+Finally there is SKB_GSO_TUNNEL_REMCSUM which indicates that a given tunnel
+header has requested a remote checksum offload. In this case the inner
+headers will be left with a partial checksum and only the outer header
+checksum will be computed.
+
+
+Generic Segmentation Offload
+============================
+
+Generic segmentation offload is a pure software offload that is meant to
+deal with cases where device drivers cannot perform the offloads described
+above. What occurs in GSO is that a given skbuff will have its data broken
+out over multiple skbuffs that have been resized to match the MSS provided
+via skb_shinfo()->gso_size.
+
+Before enabling any hardware segmentation offload a corresponding software
+offload is required in GSO. Otherwise it becomes possible for a frame to
+be re-routed between devices and end up being unable to be transmitted.
+
+
+Generic Receive Offload
+=======================
+
+Generic receive offload is the complement to GSO. Ideally any frame
+assembled by GRO should be segmented to create an identical sequence of
+frames using GSO, and any sequence of frames segmented by GSO should be
+able to be reassembled back to the original by GRO. The only exception to
+this is IPv4 ID in the case that the DF bit is set for a given IP header.
+If the value of the IPv4 ID is not sequentially incrementing it will be
+altered so that it is when a frame assembled via GRO is segmented via GSO.
+
+
+Partial Generic Segmentation Offload
+====================================
+
+Partial generic segmentation offload is a hybrid between TSO and GSO. What
+it effectively does is take advantage of certain traits of TCP and tunnels
+so that instead of having to rewrite the packet headers for each segment
+only the inner-most transport header and possibly the outer-most network
+header need to be updated. This allows devices that do not support tunnel
+offloads or tunnel offloads with checksum to still make use of segmentation.
+
+With the partial offload what occurs is that all headers excluding the
+inner transport header are updated such that they will contain the correct
+values for if the header was simply duplicated. The one exception to this
+is the outer IPv4 ID field. It is up to the device drivers to guarantee
+that the IPv4 ID field is incremented in the case that a given header does
+not have the DF bit set.
+
+
+SCTP acceleration with GSO
+===========================
+
+SCTP - despite the lack of hardware support - can still take advantage of
+GSO to pass one large packet through the network stack, rather than
+multiple small packets.
+
+This requires a different approach to other offloads, as SCTP packets
+cannot be just segmented to (P)MTU. Rather, the chunks must be contained in
+IP segments, padding respected. So unlike regular GSO, SCTP can't just
+generate a big skb, set gso_size to the fragmentation point and deliver it
+to IP layer.
+
+Instead, the SCTP protocol layer builds an skb with the segments correctly
+padded and stored as chained skbs, and skb_segment() splits based on those.
+To signal this, gso_size is set to the special value GSO_BY_FRAGS.
+
+Therefore, any code in the core networking stack must be aware of the
+possibility that gso_size will be GSO_BY_FRAGS and handle that case
+appropriately.
+
+There are some helpers to make this easier:
+
+- skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if
+ an skb is an SCTP GSO skb.
+
+- For size checks, the skb_gso_validate_*_len family of helpers correctly
+ considers GSO_BY_FRAGS.
+
+- For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size
+ will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs.
+
+This also affects drivers with the NETIF_F_FRAGLIST & NETIF_F_GSO_SCTP bits
+set. Note also that NETIF_F_GSO_SCTP is included in NETIF_F_GSO_SOFTWARE.
diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst
new file mode 100644
index 0000000000..55b65f607a
--- /dev/null
+++ b/Documentation/networking/sfp-phylink.rst
@@ -0,0 +1,284 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+phylink
+=======
+
+Overview
+========
+
+phylink is a mechanism to support hot-pluggable networking modules
+directly connected to a MAC without needing to re-initialise the
+adapter on hot-plug events.
+
+phylink supports conventional phylib-based setups, fixed link setups
+and SFP (Small Formfactor Pluggable) modules at present.
+
+Modes of operation
+==================
+
+phylink has several modes of operation, which depend on the firmware
+settings.
+
+1. PHY mode
+
+ In PHY mode, we use phylib to read the current link settings from
+ the PHY, and pass them to the MAC driver. We expect the MAC driver
+ to configure exactly the modes that are specified without any
+ negotiation being enabled on the link.
+
+2. Fixed mode
+
+ Fixed mode is the same as PHY mode as far as the MAC driver is
+ concerned.
+
+3. In-band mode
+
+ In-band mode is used with 802.3z, SGMII and similar interface modes,
+ and we are expecting to use and honor the in-band negotiation or
+ control word sent across the serdes channel.
+
+By example, what this means is that:
+
+.. code-block:: none
+
+ &eth {
+ phy = <&phy>;
+ phy-mode = "sgmii";
+ };
+
+does not use in-band SGMII signalling. The PHY is expected to follow
+exactly the settings given to it in its :c:func:`mac_config` function.
+The link should be forced up or down appropriately in the
+:c:func:`mac_link_up` and :c:func:`mac_link_down` functions.
+
+.. code-block:: none
+
+ &eth {
+ managed = "in-band-status";
+ phy = <&phy>;
+ phy-mode = "sgmii";
+ };
+
+uses in-band mode, where results from the PHY's negotiation are passed
+to the MAC through the SGMII control word, and the MAC is expected to
+acknowledge the control word. The :c:func:`mac_link_up` and
+:c:func:`mac_link_down` functions must not force the MAC side link
+up and down.
+
+Rough guide to converting a network driver to sfp/phylink
+=========================================================
+
+This guide briefly describes how to convert a network driver from
+phylib to the sfp/phylink support. Please send patches to improve
+this documentation.
+
+1. Optionally split the network driver's phylib update function into
+ two parts dealing with link-down and link-up. This can be done as
+ a separate preparation commit.
+
+ An older example of this preparation can be found in git commit
+ fc548b991fb0, although this was splitting into three parts; the
+ link-up part now includes configuring the MAC for the link settings.
+ Please see :c:func:`mac_link_up` for more information on this.
+
+2. Replace::
+
+ select FIXED_PHY
+ select PHYLIB
+
+ with::
+
+ select PHYLINK
+
+ in the driver's Kconfig stanza.
+
+3. Add::
+
+ #include <linux/phylink.h>
+
+ to the driver's list of header files.
+
+4. Add::
+
+ struct phylink *phylink;
+ struct phylink_config phylink_config;
+
+ to the driver's private data structure. We shall refer to the
+ driver's private data pointer as ``priv`` below, and the driver's
+ private data structure as ``struct foo_priv``.
+
+5. Replace the following functions:
+
+ .. flat-table::
+ :header-rows: 1
+ :widths: 1 1
+ :stub-columns: 0
+
+ * - Original function
+ - Replacement function
+ * - phy_start(phydev)
+ - phylink_start(priv->phylink)
+ * - phy_stop(phydev)
+ - phylink_stop(priv->phylink)
+ * - phy_mii_ioctl(phydev, ifr, cmd)
+ - phylink_mii_ioctl(priv->phylink, ifr, cmd)
+ * - phy_ethtool_get_wol(phydev, wol)
+ - phylink_ethtool_get_wol(priv->phylink, wol)
+ * - phy_ethtool_set_wol(phydev, wol)
+ - phylink_ethtool_set_wol(priv->phylink, wol)
+ * - phy_disconnect(phydev)
+ - phylink_disconnect_phy(priv->phylink)
+
+ Please note that some of these functions must be called under the
+ rtnl lock, and will warn if not. This will normally be the case,
+ except if these are called from the driver suspend/resume paths.
+
+6. Add/replace ksettings get/set methods with:
+
+ .. code-block:: c
+
+ static int foo_ethtool_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
+ {
+ struct foo_priv *priv = netdev_priv(dev);
+
+ return phylink_ethtool_ksettings_set(priv->phylink, cmd);
+ }
+
+ static int foo_ethtool_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
+ {
+ struct foo_priv *priv = netdev_priv(dev);
+
+ return phylink_ethtool_ksettings_get(priv->phylink, cmd);
+ }
+
+7. Replace the call to::
+
+ phy_dev = of_phy_connect(dev, node, link_func, flags, phy_interface);
+
+ and associated code with a call to::
+
+ err = phylink_of_phy_connect(priv->phylink, node, flags);
+
+ For the most part, ``flags`` can be zero; these flags are passed to
+ the phy_attach_direct() inside this function call if a PHY is specified
+ in the DT node ``node``.
+
+ ``node`` should be the DT node which contains the network phy property,
+ fixed link properties, and will also contain the sfp property.
+
+ The setup of fixed links should also be removed; these are handled
+ internally by phylink.
+
+ of_phy_connect() was also passed a function pointer for link updates.
+ This function is replaced by a different form of MAC updates
+ described below in (8).
+
+ Manipulation of the PHY's supported/advertised happens within phylink
+ based on the validate callback, see below in (8).
+
+ Note that the driver no longer needs to store the ``phy_interface``,
+ and also note that ``phy_interface`` becomes a dynamic property,
+ just like the speed, duplex etc. settings.
+
+ Finally, note that the MAC driver has no direct access to the PHY
+ anymore; that is because in the phylink model, the PHY can be
+ dynamic.
+
+8. Add a :c:type:`struct phylink_mac_ops <phylink_mac_ops>` instance to
+ the driver, which is a table of function pointers, and implement
+ these functions. The old link update function for
+ :c:func:`of_phy_connect` becomes three methods: :c:func:`mac_link_up`,
+ :c:func:`mac_link_down`, and :c:func:`mac_config`. If step 1 was
+ performed, then the functionality will have been split there.
+
+ It is important that if in-band negotiation is used,
+ :c:func:`mac_link_up` and :c:func:`mac_link_down` do not prevent the
+ in-band negotiation from completing, since these functions are called
+ when the in-band link state changes - otherwise the link will never
+ come up.
+
+ The :c:func:`validate` method should mask the supplied supported mask,
+ and ``state->advertising`` with the supported ethtool link modes.
+ These are the new ethtool link modes, so bitmask operations must be
+ used. For an example, see ``drivers/net/ethernet/marvell/mvneta.c``.
+
+ The :c:func:`mac_link_state` method is used to read the link state
+ from the MAC, and report back the settings that the MAC is currently
+ using. This is particularly important for in-band negotiation
+ methods such as 1000base-X and SGMII.
+
+ The :c:func:`mac_link_up` method is used to inform the MAC that the
+ link has come up. The call includes the negotiation mode and interface
+ for reference only. The finalised link parameters are also supplied
+ (speed, duplex and flow control/pause enablement settings) which
+ should be used to configure the MAC when the MAC and PCS are not
+ tightly integrated, or when the settings are not coming from in-band
+ negotiation.
+
+ The :c:func:`mac_config` method is used to update the MAC with the
+ requested state, and must avoid unnecessarily taking the link down
+ when making changes to the MAC configuration. This means the
+ function should modify the state and only take the link down when
+ absolutely necessary to change the MAC configuration. An example
+ of how to do this can be found in :c:func:`mvneta_mac_config` in
+ ``drivers/net/ethernet/marvell/mvneta.c``.
+
+ For further information on these methods, please see the inline
+ documentation in :c:type:`struct phylink_mac_ops <phylink_mac_ops>`.
+
+9. Remove calls to of_parse_phandle() for the PHY,
+ of_phy_register_fixed_link() for fixed links etc. from the probe
+ function, and replace with:
+
+ .. code-block:: c
+
+ struct phylink *phylink;
+ priv->phylink_config.dev = &dev.dev;
+ priv->phylink_config.type = PHYLINK_NETDEV;
+
+ phylink = phylink_create(&priv->phylink_config, node, phy_mode, &phylink_ops);
+ if (IS_ERR(phylink)) {
+ err = PTR_ERR(phylink);
+ fail probe;
+ }
+
+ priv->phylink = phylink;
+
+ and arrange to destroy the phylink in the probe failure path as
+ appropriate and the removal path too by calling:
+
+ .. code-block:: c
+
+ phylink_destroy(priv->phylink);
+
+10. Arrange for MAC link state interrupts to be forwarded into
+ phylink, via:
+
+ .. code-block:: c
+
+ phylink_mac_change(priv->phylink, link_is_up);
+
+ where ``link_is_up`` is true if the link is currently up or false
+ otherwise. If a MAC is unable to provide these interrupts, then
+ it should set ``priv->phylink_config.pcs_poll = true;`` in step 9.
+
+11. Verify that the driver does not call::
+
+ netif_carrier_on()
+ netif_carrier_off()
+
+ as these will interfere with phylink's tracking of the link state,
+ and cause phylink to omit calls via the :c:func:`mac_link_up` and
+ :c:func:`mac_link_down` methods.
+
+Network drivers should call phylink_stop() and phylink_start() via their
+suspend/resume paths, which ensures that the appropriate
+:c:type:`struct phylink_mac_ops <phylink_mac_ops>` methods are called
+as necessary.
+
+For information describing the SFP cage in DT, please see the binding
+documentation in the kernel source tree
+``Documentation/devicetree/bindings/net/sff,sfp.yaml``.
diff --git a/Documentation/networking/skbuff.rst b/Documentation/networking/skbuff.rst
new file mode 100644
index 0000000000..5b74275a73
--- /dev/null
+++ b/Documentation/networking/skbuff.rst
@@ -0,0 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+struct sk_buff
+==============
+
+:c:type:`sk_buff` is the main networking structure representing
+a packet.
+
+Basic sk_buff geometry
+----------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :doc: Basic sk_buff geometry
+
+Shared skbs and skb clones
+--------------------------
+
+:c:member:`sk_buff.users` is a simple refcount allowing multiple entities
+to keep a struct sk_buff alive. skbs with a ``sk_buff.users != 1`` are referred
+to as shared skbs (see skb_shared()).
+
+skb_clone() allows for fast duplication of skbs. None of the data buffers
+get copied, but caller gets a new metadata struct (struct sk_buff).
+&skb_shared_info.refcount indicates the number of skbs pointing at the same
+packet data (i.e. clones).
+
+dataref and headerless skbs
+---------------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :doc: dataref and headerless skbs
+
+Checksum information
+--------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :doc: skb checksums
diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
new file mode 100644
index 0000000000..6d8acdbe9b
--- /dev/null
+++ b/Documentation/networking/smc-sysctl.rst
@@ -0,0 +1,61 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+SMC Sysctl
+==========
+
+/proc/sys/net/smc/* Variables
+=============================
+
+autocorking_size - INTEGER
+ Setting SMC auto corking size:
+ SMC auto corking is like TCP auto corking from the application's
+ perspective of view. When applications do consecutive small
+ write()/sendmsg() system calls, we try to coalesce these small writes
+ as much as possible, to lower total amount of CDC and RDMA Write been
+ sent.
+ autocorking_size limits the maximum corked bytes that can be sent to
+ the under device in 1 single sending. If set to 0, the SMC auto corking
+ is disabled.
+ Applications can still use TCP_CORK for optimal behavior when they
+ know how/when to uncork their sockets.
+
+ Default: 64K
+
+smcr_buf_type - INTEGER
+ Controls which type of sndbufs and RMBs to use in later newly created
+ SMC-R link group. Only for SMC-R.
+
+ Default: 0 (physically contiguous sndbufs and RMBs)
+
+ Possible values:
+
+ - 0 - Use physically contiguous buffers
+ - 1 - Use virtually contiguous buffers
+ - 2 - Mixed use of the two types. Try physically contiguous buffers first.
+ If not available, use virtually contiguous buffers then.
+
+smcr_testlink_time - INTEGER
+ How frequently SMC-R link sends out TEST_LINK LLC messages to confirm
+ viability, after the last activity of connections on it. Value 0 means
+ disabling TEST_LINK.
+
+ Default: 30 seconds.
+
+wmem - INTEGER
+ Initial size of send buffer used by SMC sockets.
+ The default value inherits from net.ipv4.tcp_wmem[1].
+
+ The minimum value is 16KiB and there is no hard limit for max value, but
+ only allowed 512KiB for SMC-R and 1MiB for SMC-D.
+
+ Default: 16K
+
+rmem - INTEGER
+ Initial size of receive buffer (RMB) used by SMC sockets.
+ The default value inherits from net.ipv4.tcp_rmem[1].
+
+ The minimum value is 16KiB and there is no hard limit for max value, but
+ only allowed 512KiB for SMC-R and 1MiB for SMC-D.
+
+ Default: 128K
diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
new file mode 100644
index 0000000000..2136374744
--- /dev/null
+++ b/Documentation/networking/snmp_counter.rst
@@ -0,0 +1,1793 @@
+============
+SNMP counter
+============
+
+This document explains the meaning of SNMP counters.
+
+General IPv4 counters
+=====================
+All layer 4 packets and ICMP packets will change these counters, but
+these counters won't be changed by layer 2 packets (such as STP) or
+ARP packets.
+
+* IpInReceives
+
+Defined in `RFC1213 ipInReceives`_
+
+.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26
+
+The number of packets received by the IP layer. It gets increasing at the
+beginning of ip_rcv function, always be updated together with
+IpExtInOctets. It will be increased even if the packet is dropped
+later (e.g. due to the IP header is invalid or the checksum is wrong
+and so on). It indicates the number of aggregated segments after
+GRO/LRO.
+
+* IpInDelivers
+
+Defined in `RFC1213 ipInDelivers`_
+
+.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28
+
+The number of packets delivers to the upper layer protocols. E.g. TCP, UDP,
+ICMP and so on. If no one listens on a raw socket, only kernel
+supported protocols will be delivered, if someone listens on the raw
+socket, all valid IP packets will be delivered.
+
+* IpOutRequests
+
+Defined in `RFC1213 ipOutRequests`_
+
+.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28
+
+The number of packets sent via IP layer, for both single cast and
+multicast packets, and would always be updated together with
+IpExtOutOctets.
+
+* IpExtInOctets and IpExtOutOctets
+
+They are Linux kernel extensions, no RFC definitions. Please note,
+RFC1213 indeed defines ifInOctets and ifOutOctets, but they
+are different things. The ifInOctets and ifOutOctets include the MAC
+layer header size but IpExtInOctets and IpExtOutOctets don't, they
+only include the IP layer header and the IP layer data.
+
+* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts
+
+They indicate the number of four kinds of ECN IP packets, please refer
+`Explicit Congestion Notification`_ for more details.
+
+.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6
+
+These 4 counters calculate how many packets received per ECN
+status. They count the real frame number regardless the LRO/GRO. So
+for the same packet, you might find that IpInReceives count 1, but
+IpExtInNoECTPkts counts 2 or more.
+
+* IpInHdrErrors
+
+Defined in `RFC1213 ipInHdrErrors`_. It indicates the packet is
+dropped due to the IP header error. It might happen in both IP input
+and IP forward paths.
+
+.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27
+
+* IpInAddrErrors
+
+Defined in `RFC1213 ipInAddrErrors`_. It will be increased in two
+scenarios: (1) The IP address is invalid. (2) The destination IP
+address is not a local address and IP forwarding is not enabled
+
+.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27
+
+* IpExtInNoRoutes
+
+This counter means the packet is dropped when the IP stack receives a
+packet and can't find a route for it from the route table. It might
+happen when IP forwarding is enabled and the destination IP address is
+not a local address and there is no route for the destination IP
+address.
+
+* IpInUnknownProtos
+
+Defined in `RFC1213 ipInUnknownProtos`_. It will be increased if the
+layer 4 protocol is unsupported by kernel. If an application is using
+raw socket, kernel will always deliver the packet to the raw socket
+and this counter won't be increased.
+
+.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27
+
+* IpExtInTruncatedPkts
+
+For IPv4 packet, it means the actual data size is smaller than the
+"Total Length" field in the IPv4 header.
+
+* IpInDiscards
+
+Defined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped
+in the IP receiving path and due to kernel internal reasons (e.g. no
+enough memory).
+
+.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28
+
+* IpOutDiscards
+
+Defined in `RFC1213 ipOutDiscards`_. It indicates the packet is
+dropped in the IP sending path and due to kernel internal reasons.
+
+.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28
+
+* IpOutNoRoutes
+
+Defined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is
+dropped in the IP sending path and no route is found for it.
+
+.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29
+
+ICMP counters
+=============
+* IcmpInMsgs and IcmpOutMsgs
+
+Defined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_
+
+.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41
+.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43
+
+As mentioned in the RFC1213, these two counters include errors, they
+would be increased even if the ICMP packet has an invalid type. The
+ICMP output path will check the header of a raw socket, so the
+IcmpOutMsgs would still be updated if the IP header is constructed by
+a userspace program.
+
+* ICMP named types
+
+| These counters include most of common ICMP types, they are:
+| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_
+| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_
+| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_
+| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_
+| IcmpInRedirects: `RFC1213 icmpInRedirects`_
+| IcmpInEchos: `RFC1213 icmpInEchos`_
+| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_
+| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_
+| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_
+| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_
+| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_
+| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_
+| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_
+| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_
+| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_
+| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_
+| IcmpOutEchos: `RFC1213 icmpOutEchos`_
+| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_
+| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_
+| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_
+| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_
+| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_
+
+.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41
+.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41
+.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42
+.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42
+.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42
+.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42
+.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42
+.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42
+.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43
+.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43
+.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43
+
+.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44
+.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44
+.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44
+.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44
+.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44
+.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45
+.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45
+.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45
+.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45
+.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45
+.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46
+
+Every ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP
+Echo packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are
+straightforward. The 'In' counter means kernel receives such a packet
+and the 'Out' counter means kernel sends such a packet.
+
+* ICMP numeric types
+
+They are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the
+ICMP type number. These counters track all kinds of ICMP packets. The
+ICMP type number definition could be found in the `ICMP parameters`_
+document.
+
+.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
+
+For example, if the Linux kernel sends an ICMP Echo packet, the
+IcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply
+packet, IcmpMsgInType0 would increase 1.
+
+* IcmpInCsumErrors
+
+This counter indicates the checksum of the ICMP packet is
+wrong. Kernel verifies the checksum after updating the IcmpInMsgs and
+before updating IcmpMsgInType[N]. If a packet has bad checksum, the
+IcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated.
+
+* IcmpInErrors and IcmpOutErrors
+
+Defined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_
+
+.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41
+.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43
+
+When an error occurs in the ICMP packet handler path, these two
+counters would be updated. The receiving packet path use IcmpInErrors
+and the sending packet path use IcmpOutErrors. When IcmpInCsumErrors
+is increased, IcmpInErrors would always be increased too.
+
+relationship of the ICMP counters
+---------------------------------
+The sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they
+are updated at the same time. The sum of IcmpMsgInType[N] plus
+IcmpInErrors should be equal or larger than IcmpInMsgs. When kernel
+receives an ICMP packet, kernel follows below logic:
+
+1. increase IcmpInMsgs
+2. if has any error, update IcmpInErrors and finish the process
+3. update IcmpMsgOutType[N]
+4. handle the packet depending on the type, if has any error, update
+ IcmpInErrors and finish the process
+
+So if all errors occur in step (2), IcmpInMsgs should be equal to the
+sum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in
+step (4), IcmpInMsgs should be equal to the sum of
+IcmpMsgOutType[N]. If the errors occur in both step (2) and step (4),
+IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
+IcmpInErrors.
+
+General TCP counters
+====================
+* TcpInSegs
+
+Defined in `RFC1213 tcpInSegs`_
+
+.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
+
+The number of packets received by the TCP layer. As mentioned in
+RFC1213, it includes the packets received in error, such as checksum
+error, invalid TCP header and so on. Only one error won't be included:
+if the layer 2 destination address is not the NIC's layer 2
+address. It might happen if the packet is a multicast or broadcast
+packet, or the NIC is in promiscuous mode. In these situations, the
+packets would be delivered to the TCP layer, but the TCP layer will discard
+these packets before increasing TcpInSegs. The TcpInSegs counter
+isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
+counter would only increase 1.
+
+* TcpOutSegs
+
+Defined in `RFC1213 tcpOutSegs`_
+
+.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
+
+The number of packets sent by the TCP layer. As mentioned in RFC1213,
+it excludes the retransmitted packets. But it includes the SYN, ACK
+and RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of
+GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
+increase 2.
+
+* TcpActiveOpens
+
+Defined in `RFC1213 tcpActiveOpens`_
+
+.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
+
+It means the TCP layer sends a SYN, and come into the SYN-SENT
+state. Every time TcpActiveOpens increases 1, TcpOutSegs should always
+increase 1.
+
+* TcpPassiveOpens
+
+Defined in `RFC1213 tcpPassiveOpens`_
+
+.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
+
+It means the TCP layer receives a SYN, replies a SYN+ACK, come into
+the SYN-RCVD state.
+
+* TcpExtTCPRcvCoalesce
+
+When packets are received by the TCP layer and are not be read by the
+application, the TCP layer will try to merge them. This counter
+indicate how many packets are merged in such situation. If GRO is
+enabled, lots of packets would be merged by GRO, these packets
+wouldn't be counted to TcpExtTCPRcvCoalesce.
+
+* TcpExtTCPAutoCorking
+
+When sending packets, the TCP layer will try to merge small packets to
+a bigger one. This counter increase 1 for every packet merged in such
+situation. Please refer to the LWN article for more details:
+https://lwn.net/Articles/576263/
+
+* TcpExtTCPOrigDataSent
+
+This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
+explanation below::
+
+ TCPOrigDataSent: number of outgoing packets with original data (excluding
+ retransmission but including data-in-SYN). This counter is different from
+ TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
+ more useful to track the TCP retransmission rate.
+
+* TCPSynRetrans
+
+This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
+explanation below::
+
+ TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
+ retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
+
+* TCPFastOpenActiveFail
+
+This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
+explanation below::
+
+ TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
+ the remote does not accept it or the attempts timed out.
+
+.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
+
+* TcpExtListenOverflows and TcpExtListenDrops
+
+When kernel receives a SYN from a client, and if the TCP accept queue
+is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows.
+At the same time kernel will also add 1 to TcpExtListenDrops. When a
+TCP socket is in LISTEN state, and kernel need to drop a packet,
+kernel would always add 1 to TcpExtListenDrops. So increase
+TcpExtListenOverflows would let TcpExtListenDrops increasing at the
+same time, but TcpExtListenDrops would also increase without
+TcpExtListenOverflows increasing, e.g. a memory allocation fail would
+also let TcpExtListenDrops increase.
+
+Note: The above explanation is based on kernel 4.10 or above version, on
+an old kernel, the TCP stack has different behavior when TCP accept
+queue is full. On the old kernel, TCP stack won't drop the SYN, it
+would complete the 3-way handshake. As the accept queue is full, TCP
+stack will keep the socket in the TCP half-open queue. As it is in the
+half open queue, TCP stack will send SYN+ACK on an exponential backoff
+timer, after client replies ACK, TCP stack checks whether the accept
+queue is still full, if it is not full, moves the socket to the accept
+queue, if it is full, keeps the socket in the half-open queue, at next
+time client replies ACK, this socket will get another chance to move
+to the accept queue.
+
+
+TCP Fast Open
+=============
+* TcpEstabResets
+
+Defined in `RFC1213 tcpEstabResets`_.
+
+.. _RFC1213 tcpEstabResets: https://tools.ietf.org/html/rfc1213#page-48
+
+* TcpAttemptFails
+
+Defined in `RFC1213 tcpAttemptFails`_.
+
+.. _RFC1213 tcpAttemptFails: https://tools.ietf.org/html/rfc1213#page-48
+
+* TcpOutRsts
+
+Defined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates
+the 'segments sent containing the RST flag', but in linux kernel, this
+counter indicates the segments kernel tried to send. The sending
+process might be failed due to some errors (e.g. memory alloc failed).
+
+.. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
+
+* TcpExtTCPSpuriousRtxHostQueues
+
+When the TCP stack wants to retransmit a packet, and finds that packet
+is not lost in the network, but the packet is not sent yet, the TCP
+stack would give up the retransmission and update this counter. It
+might happen if a packet stays too long time in a qdisc or driver
+queue.
+
+* TcpEstabResets
+
+The socket receives a RST packet in Establish or CloseWait state.
+
+* TcpExtTCPKeepAlive
+
+This counter indicates many keepalive packets were sent. The keepalive
+won't be enabled by default. A userspace program could enable it by
+setting the SO_KEEPALIVE socket option.
+
+* TcpExtTCPSpuriousRTOs
+
+The spurious retransmission timeout detected by the `F-RTO`_
+algorithm.
+
+.. _F-RTO: https://tools.ietf.org/html/rfc5682
+
+TCP Fast Path
+=============
+When kernel receives a TCP packet, it has two paths to handler the
+packet, one is fast path, another is slow path. The comment in kernel
+code provides a good explanation of them, I pasted them below::
+
+ It is split into a fast path and a slow path. The fast path is
+ disabled when:
+
+ - A zero window was announced from us
+ - zero window probing
+ is only handled properly on the slow path.
+ - Out of order segments arrived.
+ - Urgent data is expected.
+ - There is no buffer space left
+ - Unexpected TCP flags/window values/header lengths are received
+ (detected by checking the TCP header against pred_flags)
+ - Data is sent in both directions. The fast path only supports pure senders
+ or pure receivers (this means either the sequence number or the ack
+ value must stay constant)
+ - Unexpected TCP option.
+
+Kernel will try to use fast path unless any of the above conditions
+are satisfied. If the packets are out of order, kernel will handle
+them in slow path, which means the performance might be not very
+good. Kernel would also come into slow path if the "Delayed ack" is
+used, because when using "Delayed ack", the data is sent in both
+directions. When the TCP window scale option is not used, kernel will
+try to enable fast path immediately when the connection comes into the
+established state, but if the TCP window scale option is used, kernel
+will disable the fast path at first, and try to enable it after kernel
+receives packets.
+
+* TcpExtTCPPureAcks and TcpExtTCPHPAcks
+
+If a packet set ACK flag and has no data, it is a pure ACK packet, if
+kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
+if kernel handles it in the slow path, TcpExtTCPPureAcks will
+increase 1.
+
+* TcpExtTCPHPHits
+
+If a TCP packet has data (which means it is not a pure ACK packet),
+and this packet is handled in the fast path, TcpExtTCPHPHits will
+increase 1.
+
+
+TCP abort
+=========
+* TcpExtTCPAbortOnData
+
+It means TCP layer has data in flight, but need to close the
+connection. So TCP layer sends a RST to the other side, indicate the
+connection is not closed very graceful. An easy way to increase this
+counter is using the SO_LINGER option. Please refer to the SO_LINGER
+section of the `socket man page`_:
+
+.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html
+
+By default, when an application closes a connection, the close function
+will return immediately and kernel will try to send the in-flight data
+async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
+to a positive number, the close function won't return immediately, but
+wait for the in-flight data are acked by the other side, the max wait
+time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
+when the application closes a connection, kernel will send a RST
+immediately and increase the TcpExtTCPAbortOnData counter.
+
+* TcpExtTCPAbortOnClose
+
+This counter means the application has unread data in the TCP layer when
+the application wants to close the TCP connection. In such a situation,
+kernel will send a RST to the other side of the TCP connection.
+
+* TcpExtTCPAbortOnMemory
+
+When an application closes a TCP connection, kernel still need to track
+the connection, let it complete the TCP disconnect process. E.g. an
+app calls the close method of a socket, kernel sends fin to the other
+side of the connection, then the app has no relationship with the
+socket any more, but kernel need to keep the socket, this socket
+becomes an orphan socket, kernel waits for the reply of the other side,
+and would come to the TIME_WAIT state finally. When kernel has no
+enough memory to keep the orphan socket, kernel would send an RST to
+the other side, and delete the socket, in such situation, kernel will
+increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
+TcpExtTCPAbortOnMemory:
+
+1. the memory used by the TCP protocol is higher than the third value of
+the tcp_mem. Please refer the tcp_mem section in the `TCP man page`_:
+
+.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html
+
+2. the orphan socket count is higher than net.ipv4.tcp_max_orphans
+
+
+* TcpExtTCPAbortOnTimeout
+
+This counter will increase when any of the TCP timers expire. In such
+situation, kernel won't send RST, just give up the connection.
+
+* TcpExtTCPAbortOnLinger
+
+When a TCP connection comes into FIN_WAIT_2 state, instead of waiting
+for the fin packet from the other side, kernel could send a RST and
+delete the socket immediately. This is not the default behavior of
+Linux kernel TCP stack. By configuring the TCP_LINGER2 socket option,
+you could let kernel follow this behavior.
+
+* TcpExtTCPAbortFailed
+
+The kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is
+satisfied. If an internal error occurs during this process,
+TcpExtTCPAbortFailed will be increased.
+
+.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50
+
+TCP Hybrid Slow Start
+=====================
+The Hybrid Slow Start algorithm is an enhancement of the traditional
+TCP congestion window Slow Start algorithm. It uses two pieces of
+information to detect whether the max bandwidth of the TCP path is
+approached. The two pieces of information are ACK train length and
+increase in packet delay. For detail information, please refer the
+`Hybrid Slow Start paper`_. Either ACK train length or packet delay
+hits a specific threshold, the congestion control algorithm will come
+into the Congestion Avoidance state. Until v4.20, two congestion
+control algorithms are using Hybrid Slow Start, they are cubic (the
+default congestion control algorithm) and cdg. Four snmp counters
+relate with the Hybrid Slow Start algorithm.
+
+.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf
+
+* TcpExtTCPHystartTrainDetect
+
+How many times the ACK train length threshold is detected
+
+* TcpExtTCPHystartTrainCwnd
+
+The sum of CWND detected by ACK train length. Dividing this value by
+TcpExtTCPHystartTrainDetect is the average CWND which detected by the
+ACK train length.
+
+* TcpExtTCPHystartDelayDetect
+
+How many times the packet delay threshold is detected.
+
+* TcpExtTCPHystartDelayCwnd
+
+The sum of CWND detected by packet delay. Dividing this value by
+TcpExtTCPHystartDelayDetect is the average CWND which detected by the
+packet delay.
+
+TCP retransmission and congestion control
+=========================================
+The TCP protocol has two retransmission mechanisms: SACK and fast
+recovery. They are exclusive with each other. When SACK is enabled,
+the kernel TCP stack would use SACK, or kernel would use fast
+recovery. The SACK is a TCP option, which is defined in `RFC2018`_,
+the fast recovery is defined in `RFC6582`_, which is also called
+'Reno'.
+
+The TCP congestion control is a big and complex topic. To understand
+the related snmp counter, we need to know the states of the congestion
+control state machine. There are 5 states: Open, Disorder, CWR,
+Recovery and Loss. For details about these states, please refer page 5
+and page 6 of this document:
+https://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf
+
+.. _RFC2018: https://tools.ietf.org/html/rfc2018
+.. _RFC6582: https://tools.ietf.org/html/rfc6582
+
+* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery
+
+When the congestion control comes into Recovery state, if sack is
+used, TcpExtTCPSackRecovery increases 1, if sack is not used,
+TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP
+stack begins to retransmit the lost packets.
+
+* TcpExtTCPSACKReneging
+
+A packet was acknowledged by SACK, but the receiver has dropped this
+packet, so the sender needs to retransmit this packet. In this
+situation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver
+could drop a packet which has been acknowledged by SACK, although it is
+unusual, it is allowed by the TCP protocol. The sender doesn't really
+know what happened on the receiver side. The sender just waits until
+the RTO expires for this packet, then the sender assumes this packet
+has been dropped by the receiver.
+
+* TcpExtTCPRenoReorder
+
+The reorder packet is detected by fast recovery. It would only be used
+if SACK is disabled. The fast recovery algorithm detects recorder by
+the duplicate ACK number. E.g., if retransmission is triggered, and
+the original retransmitted packet is not lost, it is just out of
+order, the receiver would acknowledge multiple times, one for the
+retransmitted packet, another for the arriving of the original out of
+order packet. Thus the sender would find more ACks than its
+expectation, and the sender knows out of order occurs.
+
+* TcpExtTCPTSReorder
+
+The reorder packet is detected when a hole is filled. E.g., assume the
+sender sends packet 1,2,3,4,5, and the receiving order is
+1,2,4,5,3. When the sender receives the ACK of packet 3 (which will
+fill the hole), two conditions will let TcpExtTCPTSReorder increase
+1: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet
+3 is retransmitted but the timestamp of the packet 3's ACK is earlier
+than the retransmission timestamp.
+
+* TcpExtTCPSACKReorder
+
+The reorder packet detected by SACK. The SACK has two methods to
+detect reorder: (1) DSACK is received by the sender. It means the
+sender sends the same packet more than one times. And the only reason
+is the sender believes an out of order packet is lost so it sends the
+packet again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and
+the sender has received SACKs for packet 2 and 5, now the sender
+receives SACK for packet 4 and the sender doesn't retransmit the
+packet yet, the sender would know packet 4 is out of order. The TCP
+stack of kernel will increase TcpExtTCPSACKReorder for both of the
+above scenarios.
+
+* TcpExtTCPSlowStartRetrans
+
+The TCP stack wants to retransmit a packet and the congestion control
+state is 'Loss'.
+
+* TcpExtTCPFastRetrans
+
+The TCP stack wants to retransmit a packet and the congestion control
+state is not 'Loss'.
+
+* TcpExtTCPLostRetransmit
+
+A SACK points out that a retransmission packet is lost again.
+
+* TcpExtTCPRetransFail
+
+The TCP stack tries to deliver a retransmission packet to lower layers
+but the lower layers return an error.
+
+* TcpExtTCPSynRetrans
+
+The TCP stack retransmits a SYN packet.
+
+DSACK
+=====
+The DSACK is defined in `RFC2883`_. The receiver uses DSACK to report
+duplicate packets to the sender. There are two kinds of
+duplications: (1) a packet which has been acknowledged is
+duplicate. (2) an out of order packet is duplicate. The TCP stack
+counts these two kinds of duplications on both receiver side and
+sender side.
+
+.. _RFC2883 : https://tools.ietf.org/html/rfc2883
+
+* TcpExtTCPDSACKOldSent
+
+The TCP stack receives a duplicate packet which has been acked, so it
+sends a DSACK to the sender.
+
+* TcpExtTCPDSACKOfoSent
+
+The TCP stack receives an out of order duplicate packet, so it sends a
+DSACK to the sender.
+
+* TcpExtTCPDSACKRecv
+
+The TCP stack receives a DSACK, which indicates an acknowledged
+duplicate packet is received.
+
+* TcpExtTCPDSACKOfoRecv
+
+The TCP stack receives a DSACK, which indicate an out of order
+duplicate packet is received.
+
+invalid SACK and DSACK
+======================
+When a SACK (or DSACK) block is invalid, a corresponding counter would
+be updated. The validation method is base on the start/end sequence
+number of the SACK block. For more details, please refer the comment
+of the function tcp_is_sackblock_valid in the kernel source code. A
+SACK option could have up to 4 blocks, they are checked
+individually. E.g., if 3 blocks of a SACk is invalid, the
+corresponding counter would be updated 3 times. The comment of the
+`Add counters for discarded SACK blocks`_ patch has additional
+explanation:
+
+.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
+
+* TcpExtTCPSACKDiscard
+
+This counter indicates how many SACK blocks are invalid. If the invalid
+SACK block is caused by ACK recording, the TCP stack will only ignore
+it and won't update this counter.
+
+* TcpExtTCPDSACKIgnoredOld and TcpExtTCPDSACKIgnoredNoUndo
+
+When a DSACK block is invalid, one of these two counters would be
+updated. Which counter will be updated depends on the undo_marker flag
+of the TCP socket. If the undo_marker is not set, the TCP stack isn't
+likely to re-transmit any packets, and we still receive an invalid
+DSACK block, the reason might be that the packet is duplicated in the
+middle of the network. In such scenario, TcpExtTCPDSACKIgnoredNoUndo
+will be updated. If the undo_marker is set, TcpExtTCPDSACKIgnoredOld
+will be updated. As implied in its name, it might be an old packet.
+
+SACK shift
+==========
+The linux networking stack stores data in sk_buff struct (skb for
+short). If a SACK block acrosses multiple skb, the TCP stack will try
+to re-arrange data in these skb. E.g. if a SACK block acknowledges seq
+10 to 15, skb1 has seq 10 to 13, skb2 has seq 14 to 20. The seq 14 and
+15 in skb2 would be moved to skb1. This operation is 'shift'. If a
+SACK block acknowledges seq 10 to 20, skb1 has seq 10 to 13, skb2 has
+seq 14 to 20. All data in skb2 will be moved to skb1, and skb2 will be
+discard, this operation is 'merge'.
+
+* TcpExtTCPSackShifted
+
+A skb is shifted
+
+* TcpExtTCPSackMerged
+
+A skb is merged
+
+* TcpExtTCPSackShiftFallback
+
+A skb should be shifted or merged, but the TCP stack doesn't do it for
+some reasons.
+
+TCP out of order
+================
+* TcpExtTCPOFOQueue
+
+The TCP layer receives an out of order packet and has enough memory
+to queue it.
+
+* TcpExtTCPOFODrop
+
+The TCP layer receives an out of order packet but doesn't have enough
+memory, so drops it. Such packets won't be counted into
+TcpExtTCPOFOQueue.
+
+* TcpExtTCPOFOMerge
+
+The received out of order packet has an overlay with the previous
+packet. the overlay part will be dropped. All of TcpExtTCPOFOMerge
+packets will also be counted into TcpExtTCPOFOQueue.
+
+TCP PAWS
+========
+PAWS (Protection Against Wrapped Sequence numbers) is an algorithm
+which is used to drop old packets. It depends on the TCP
+timestamps. For detail information, please refer the `timestamp wiki`_
+and the `RFC of PAWS`_.
+
+.. _RFC of PAWS: https://tools.ietf.org/html/rfc1323#page-17
+.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps
+
+* TcpExtPAWSActive
+
+Packets are dropped by PAWS in Syn-Sent status.
+
+* TcpExtPAWSEstab
+
+Packets are dropped by PAWS in any status other than Syn-Sent.
+
+TCP ACK skip
+============
+In some scenarios, kernel would avoid sending duplicate ACKs too
+frequently. Please find more details in the tcp_invalid_ratelimit
+section of the `sysctl document`_. When kernel decides to skip an ACK
+due to tcp_invalid_ratelimit, kernel would update one of below
+counters to indicate the ACK is skipped in which scenario. The ACK
+would only be skipped if the received packet is either a SYN packet or
+it has no data.
+
+.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.rst
+
+* TcpExtTCPACKSkippedSynRecv
+
+The ACK is skipped in Syn-Recv status. The Syn-Recv status means the
+TCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is
+waiting for an ACK. Generally, the TCP stack doesn't need to send ACK
+in the Syn-Recv status. But in several scenarios, the TCP stack need
+to send an ACK. E.g., the TCP stack receives the same SYN packet
+repeately, the received packet does not pass the PAWS check, or the
+received packet sequence number is out of window. In these scenarios,
+the TCP stack needs to send ACK. If the ACk sending frequency is higher than
+tcp_invalid_ratelimit allows, the TCP stack will skip sending ACK and
+increase TcpExtTCPACKSkippedSynRecv.
+
+
+* TcpExtTCPACKSkippedPAWS
+
+The ACK is skipped due to PAWS (Protect Against Wrapped Sequence
+numbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2
+or Time-Wait statuses, the skipped ACK would be counted to
+TcpExtTCPACKSkippedSynRecv, TcpExtTCPACKSkippedFinWait2 or
+TcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK
+would be counted to TcpExtTCPACKSkippedPAWS.
+
+* TcpExtTCPACKSkippedSeq
+
+The sequence number is out of window and the timestamp passes the PAWS
+check and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait.
+
+* TcpExtTCPACKSkippedFinWait2
+
+The ACK is skipped in Fin-Wait-2 status, the reason would be either
+PAWS check fails or the received sequence number is out of window.
+
+* TcpExtTCPACKSkippedTimeWait
+
+The ACK is skipped in Time-Wait status, the reason would be either
+PAWS check failed or the received sequence number is out of window.
+
+* TcpExtTCPACKSkippedChallenge
+
+The ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines
+3 kind of challenge ACK, please refer `RFC 5961 section 3.2`_,
+`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these
+three scenarios, In some TCP status, the linux TCP stack would also
+send challenge ACKs if the ACK number is before the first
+unacknowledged number (more strict than `RFC 5961 section 5.2`_).
+
+.. _RFC 5961 section 3.2: https://tools.ietf.org/html/rfc5961#page-7
+.. _RFC 5961 section 4.2: https://tools.ietf.org/html/rfc5961#page-9
+.. _RFC 5961 section 5.2: https://tools.ietf.org/html/rfc5961#page-11
+
+TCP receive window
+==================
+* TcpExtTCPWantZeroWindowAdv
+
+Depending on current memory usage, the TCP stack tries to set receive
+window to zero. But the receive window might still be a no-zero
+value. For example, if the previous window size is 10, and the TCP
+stack receives 3 bytes, the current window size would be 7 even if the
+window size calculated by the memory usage is zero.
+
+* TcpExtTCPToZeroWindowAdv
+
+The TCP receive window is set to zero from a no-zero value.
+
+* TcpExtTCPFromZeroWindowAdv
+
+The TCP receive window is set to no-zero value from zero.
+
+
+Delayed ACK
+===========
+The TCP Delayed ACK is a technique which is used for reducing the
+packet count in the network. For more details, please refer the
+`Delayed ACK wiki`_
+
+.. _Delayed ACK wiki: https://en.wikipedia.org/wiki/TCP_delayed_acknowledgment
+
+* TcpExtDelayedACKs
+
+A delayed ACK timer expires. The TCP stack will send a pure ACK packet
+and exit the delayed ACK mode.
+
+* TcpExtDelayedACKLocked
+
+A delayed ACK timer expires, but the TCP stack can't send an ACK
+immediately due to the socket is locked by a userspace program. The
+TCP stack will send a pure ACK later (after the userspace program
+unlock the socket). When the TCP stack sends the pure ACK later, the
+TCP stack will also update TcpExtDelayedACKs and exit the delayed ACK
+mode.
+
+* TcpExtDelayedACKLost
+
+It will be updated when the TCP stack receives a packet which has been
+ACKed. A Delayed ACK loss might cause this issue, but it would also be
+triggered by other reasons, such as a packet is duplicated in the
+network.
+
+Tail Loss Probe (TLP)
+=====================
+TLP is an algorithm which is used to detect TCP packet loss. For more
+details, please refer the `TLP paper`_.
+
+.. _TLP paper: https://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
+
+* TcpExtTCPLossProbes
+
+A TLP probe packet is sent.
+
+* TcpExtTCPLossProbeRecovery
+
+A packet loss is detected and recovered by TLP.
+
+TCP Fast Open description
+=========================
+TCP Fast Open is a technology which allows data transfer before the
+3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a
+general description.
+
+.. _TCP Fast Open wiki: https://en.wikipedia.org/wiki/TCP_Fast_Open
+
+* TcpExtTCPFastOpenActive
+
+When the TCP stack receives an ACK packet in the SYN-SENT status, and
+the ACK packet acknowledges the data in the SYN packet, the TCP stack
+understand the TFO cookie is accepted by the other side, then it
+updates this counter.
+
+* TcpExtTCPFastOpenActiveFail
+
+This counter indicates that the TCP stack initiated a TCP Fast Open,
+but it failed. This counter would be updated in three scenarios: (1)
+the other side doesn't acknowledge the data in the SYN packet. (2) The
+SYN packet which has the TFO cookie is timeout at least once. (3)
+after the 3-way handshake, the retransmission timeout happens
+net.ipv4.tcp_retries1 times, because some middle-boxes may black-hole
+fast open after the handshake.
+
+* TcpExtTCPFastOpenPassive
+
+This counter indicates how many times the TCP stack accepts the fast
+open request.
+
+* TcpExtTCPFastOpenPassiveFail
+
+This counter indicates how many times the TCP stack rejects the fast
+open request. It is caused by either the TFO cookie is invalid or the
+TCP stack finds an error during the socket creating process.
+
+* TcpExtTCPFastOpenListenOverflow
+
+When the pending fast open request number is larger than
+fastopenq->max_qlen, the TCP stack will reject the fast open request
+and update this counter. When this counter is updated, the TCP stack
+won't update TcpExtTCPFastOpenPassive or
+TcpExtTCPFastOpenPassiveFail. The fastopenq->max_qlen is set by the
+TCP_FASTOPEN socket operation and it could not be larger than
+net.core.somaxconn. For example:
+
+setsockopt(sfd, SOL_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen));
+
+* TcpExtTCPFastOpenCookieReqd
+
+This counter indicates how many times a client wants to request a TFO
+cookie.
+
+SYN cookies
+===========
+SYN cookies are used to mitigate SYN flood, for details, please refer
+the `SYN cookies wiki`_.
+
+.. _SYN cookies wiki: https://en.wikipedia.org/wiki/SYN_cookies
+
+* TcpExtSyncookiesSent
+
+It indicates how many SYN cookies are sent.
+
+* TcpExtSyncookiesRecv
+
+How many reply packets of the SYN cookies the TCP stack receives.
+
+* TcpExtSyncookiesFailed
+
+The MSS decoded from the SYN cookie is invalid. When this counter is
+updated, the received packet won't be treated as a SYN cookie and the
+TcpExtSyncookiesRecv counter won't be updated.
+
+Challenge ACK
+=============
+For details of challenge ACK, please refer the explanation of
+TcpExtTCPACKSkippedChallenge.
+
+* TcpExtTCPChallengeACK
+
+The number of challenge acks sent.
+
+* TcpExtTCPSYNChallenge
+
+The number of challenge acks sent in response to SYN packets. After
+updates this counter, the TCP stack might send a challenge ACK and
+update the TcpExtTCPChallengeACK counter, or it might also skip to
+send the challenge and update the TcpExtTCPACKSkippedChallenge.
+
+prune
+=====
+When a socket is under memory pressure, the TCP stack will try to
+reclaim memory from the receiving queue and out of order queue. One of
+the reclaiming method is 'collapse', which means allocate a big skb,
+copy the contiguous skbs to the single big skb, and free these
+contiguous skbs.
+
+* TcpExtPruneCalled
+
+The TCP stack tries to reclaim memory for a socket. After updates this
+counter, the TCP stack will try to collapse the out of order queue and
+the receiving queue. If the memory is still not enough, the TCP stack
+will try to discard packets from the out of order queue (and update the
+TcpExtOfoPruned counter)
+
+* TcpExtOfoPruned
+
+The TCP stack tries to discard packet on the out of order queue.
+
+* TcpExtRcvPruned
+
+After 'collapse' and discard packets from the out of order queue, if
+the actually used memory is still larger than the max allowed memory,
+this counter will be updated. It means the 'prune' fails.
+
+* TcpExtTCPRcvCollapsed
+
+This counter indicates how many skbs are freed during 'collapse'.
+
+examples
+========
+
+ping test
+---------
+Run the ping command against the public dns server 8.8.8.8::
+
+ nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
+ PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
+ 64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
+
+ --- 8.8.8.8 ping statistics ---
+ 1 packets transmitted, 1 received, 0% packet loss, time 0ms
+ rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
+
+The nstayt result::
+
+ nstatuser@nstat-a:~$ nstat
+ #kernel
+ IpInReceives 1 0.0
+ IpInDelivers 1 0.0
+ IpOutRequests 1 0.0
+ IcmpInMsgs 1 0.0
+ IcmpInEchoReps 1 0.0
+ IcmpOutMsgs 1 0.0
+ IcmpOutEchos 1 0.0
+ IcmpMsgInType0 1 0.0
+ IcmpMsgOutType8 1 0.0
+ IpExtInOctets 84 0.0
+ IpExtOutOctets 84 0.0
+ IpExtInNoECTPkts 1 0.0
+
+The Linux server sent an ICMP Echo packet, so IpOutRequests,
+IcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The
+server got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs,
+IcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply
+was passed to the ICMP layer via IP layer, so IpInDelivers was
+increased 1. The default ping data size is 48, so an ICMP Echo packet
+and its corresponding Echo Reply packet are constructed by:
+
+* 14 bytes MAC header
+* 20 bytes IP header
+* 16 bytes ICMP header
+* 48 bytes data (default value of the ping command)
+
+So the IpExtInOctets and IpExtOutOctets are 20+16+48=84.
+
+tcp 3-way handshake
+-------------------
+On server side, we run::
+
+ nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
+ Listening on [0.0.0.0] (family 0, port 9000)
+
+On client side, we run::
+
+ nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
+ Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
+
+The server listened on tcp 9000 port, the client connected to it, they
+completed the 3-way handshake.
+
+On server side, we can find below nstat output::
+
+ nstatuser@nstat-b:~$ nstat | grep -i tcp
+ TcpPassiveOpens 1 0.0
+ TcpInSegs 2 0.0
+ TcpOutSegs 1 0.0
+ TcpExtTCPPureAcks 1 0.0
+
+On client side, we can find below nstat output::
+
+ nstatuser@nstat-a:~$ nstat | grep -i tcp
+ TcpActiveOpens 1 0.0
+ TcpInSegs 1 0.0
+ TcpOutSegs 2 0.0
+
+When the server received the first SYN, it replied a SYN+ACK, and came into
+SYN-RCVD state, so TcpPassiveOpens increased 1. The server received
+SYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2
+packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK
+of the 3-way handshake is a pure ACK without data, so
+TcpExtTCPPureAcks increased 1.
+
+When the client sent SYN, the client came into the SYN-SENT state, so
+TcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent
+ACK, so client sent 2 packets, received 1 packet, TcpInSegs increased
+1, TcpOutSegs increased 2.
+
+TCP normal traffic
+------------------
+Run nc on server::
+
+ nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
+ Listening on [0.0.0.0] (family 0, port 9000)
+
+Run nc on client::
+
+ nstatuser@nstat-a:~$ nc -v nstat-b 9000
+ Connection to nstat-b 9000 port [tcp/*] succeeded!
+
+Input a string in the nc client ('hello' in our example)::
+
+ nstatuser@nstat-a:~$ nc -v nstat-b 9000
+ Connection to nstat-b 9000 port [tcp/*] succeeded!
+ hello
+
+The client side nstat output::
+
+ nstatuser@nstat-a:~$ nstat
+ #kernel
+ IpInReceives 1 0.0
+ IpInDelivers 1 0.0
+ IpOutRequests 1 0.0
+ TcpInSegs 1 0.0
+ TcpOutSegs 1 0.0
+ TcpExtTCPPureAcks 1 0.0
+ TcpExtTCPOrigDataSent 1 0.0
+ IpExtInOctets 52 0.0
+ IpExtOutOctets 58 0.0
+ IpExtInNoECTPkts 1 0.0
+
+The server side nstat output::
+
+ nstatuser@nstat-b:~$ nstat
+ #kernel
+ IpInReceives 1 0.0
+ IpInDelivers 1 0.0
+ IpOutRequests 1 0.0
+ TcpInSegs 1 0.0
+ TcpOutSegs 1 0.0
+ IpExtInOctets 58 0.0
+ IpExtOutOctets 52 0.0
+ IpExtInNoECTPkts 1 0.0
+
+Input a string in nc client side again ('world' in our example)::
+
+ nstatuser@nstat-a:~$ nc -v nstat-b 9000
+ Connection to nstat-b 9000 port [tcp/*] succeeded!
+ hello
+ world
+
+Client side nstat output::
+
+ nstatuser@nstat-a:~$ nstat
+ #kernel
+ IpInReceives 1 0.0
+ IpInDelivers 1 0.0
+ IpOutRequests 1 0.0
+ TcpInSegs 1 0.0
+ TcpOutSegs 1 0.0
+ TcpExtTCPHPAcks 1 0.0
+ TcpExtTCPOrigDataSent 1 0.0
+ IpExtInOctets 52 0.0
+ IpExtOutOctets 58 0.0
+ IpExtInNoECTPkts 1 0.0
+
+
+Server side nstat output::
+
+ nstatuser@nstat-b:~$ nstat
+ #kernel
+ IpInReceives 1 0.0
+ IpInDelivers 1 0.0
+ IpOutRequests 1 0.0
+ TcpInSegs 1 0.0
+ TcpOutSegs 1 0.0
+ TcpExtTCPHPHits 1 0.0
+ IpExtInOctets 58 0.0
+ IpExtOutOctets 52 0.0
+ IpExtInNoECTPkts 1 0.0
+
+Compare the first client-side nstat and the second client-side nstat,
+we could find one difference: the first one had a 'TcpExtTCPPureAcks',
+but the second one had a 'TcpExtTCPHPAcks'. The first server-side
+nstat and the second server-side nstat had a difference too: the
+second server-side nstat had a TcpExtTCPHPHits, but the first
+server-side nstat didn't have it. The network traffic patterns were
+exactly the same: the client sent a packet to the server, the server
+replied an ACK. But kernel handled them in different ways. When the
+TCP window scale option is not used, kernel will try to enable fast
+path immediately when the connection comes into the established state,
+but if the TCP window scale option is used, kernel will disable the
+fast path at first, and try to enable it after kernel receives
+packets. We could use the 'ss' command to verify whether the window
+scale option is used. e.g. run below command on either server or
+client::
+
+ nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
+ Netid Recv-Q Send-Q Local Address:Port Peer Address:Port
+ tcp 0 0 192.168.122.250:40654 192.168.122.251:9000
+ ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
+
+The 'wscale:7,7' means both server and client set the window scale
+option to 7. Now we could explain the nstat output in our test:
+
+In the first nstat output of client side, the client sent a packet, server
+reply an ACK, when kernel handled this ACK, the fast path was not
+enabled, so the ACK was counted into 'TcpExtTCPPureAcks'.
+
+In the second nstat output of client side, the client sent a packet again,
+and received another ACK from the server, in this time, the fast path is
+enabled, and the ACK was qualified for fast path, so it was handled by
+the fast path, so this ACK was counted into TcpExtTCPHPAcks.
+
+In the first nstat output of server side, fast path was not enabled,
+so there was no 'TcpExtTCPHPHits'.
+
+In the second nstat output of server side, the fast path was enabled,
+and the packet received from client qualified for fast path, so it
+was counted into 'TcpExtTCPHPHits'.
+
+TcpExtTCPAbortOnClose
+---------------------
+On the server side, we run below python script::
+
+ import socket
+ import time
+
+ port = 9000
+
+ s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+ s.bind(('0.0.0.0', port))
+ s.listen(1)
+ sock, addr = s.accept()
+ while True:
+ time.sleep(9999999)
+
+This python script listen on 9000 port, but doesn't read anything from
+the connection.
+
+On the client side, we send the string "hello" by nc::
+
+ nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000
+
+Then, we come back to the server side, the server has received the "hello"
+packet, and the TCP layer has acked this packet, but the application didn't
+read it yet. We type Ctrl-C to terminate the server script. Then we
+could find TcpExtTCPAbortOnClose increased 1 on the server side::
+
+ nstatuser@nstat-b:~$ nstat | grep -i abort
+ TcpExtTCPAbortOnClose 1 0.0
+
+If we run tcpdump on the server side, we could find the server sent a
+RST after we type Ctrl-C.
+
+TcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout
+---------------------------------------------------
+Below is an example which let the orphan socket count be higher than
+net.ipv4.tcp_max_orphans.
+Change tcp_max_orphans to a smaller value on client::
+
+ sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
+
+Client code (create 64 connection to server)::
+
+ nstatuser@nstat-a:~$ cat client_orphan.py
+ import socket
+ import time
+
+ server = 'nstat-b' # server address
+ port = 9000
+
+ count = 64
+
+ connection_list = []
+
+ for i in range(64):
+ s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+ s.connect((server, port))
+ connection_list.append(s)
+ print("connection_count: %d" % len(connection_list))
+
+ while True:
+ time.sleep(99999)
+
+Server code (accept 64 connection from client)::
+
+ nstatuser@nstat-b:~$ cat server_orphan.py
+ import socket
+ import time
+
+ port = 9000
+ count = 64
+
+ s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+ s.bind(('0.0.0.0', port))
+ s.listen(count)
+ connection_list = []
+ while True:
+ sock, addr = s.accept()
+ connection_list.append((sock, addr))
+ print("connection_count: %d" % len(connection_list))
+
+Run the python scripts on server and client.
+
+On server::
+
+ python3 server_orphan.py
+
+On client::
+
+ python3 client_orphan.py
+
+Run iptables on server::
+
+ sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
+
+Type Ctrl-C on client, stop client_orphan.py.
+
+Check TcpExtTCPAbortOnMemory on client::
+
+ nstatuser@nstat-a:~$ nstat | grep -i abort
+ TcpExtTCPAbortOnMemory 54 0.0
+
+Check orphaned socket count on client::
+
+ nstatuser@nstat-a:~$ ss -s
+ Total: 131 (kernel 0)
+ TCP: 14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
+
+ Transport Total IP IPv6
+ * 0 - -
+ RAW 1 0 1
+ UDP 1 1 0
+ TCP 14 13 1
+ INET 16 14 2
+ FRAG 0 0 0
+
+The explanation of the test: after run server_orphan.py and
+client_orphan.py, we set up 64 connections between server and
+client. Run the iptables command, the server will drop all packets from
+the client, type Ctrl-C on client_orphan.py, the system of the client
+would try to close these connections, and before they are closed
+gracefully, these connections became orphan sockets. As the iptables
+of the server blocked packets from the client, the server won't receive fin
+from the client, so all connection on clients would be stuck on FIN_WAIT_1
+stage, so they will keep as orphan sockets until timeout. We have echo
+10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
+only keep 10 orphan sockets, for all other orphan sockets, the client
+system sent RST for them and delete them. We have 64 connections, so
+the 'ss -s' command shows the system has 10 orphan sockets, and the
+value of TcpExtTCPAbortOnMemory was 54.
+
+An additional explanation about orphan socket count: You could find the
+exactly orphan socket count by the 'ss -s' command, but when kernel
+decide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel
+doesn't always check the exactly orphan socket count. For increasing
+performance, kernel checks an approximate count firstly, if the
+approximate count is more than tcp_max_orphans, kernel checks the
+exact count again. So if the approximate count is less than
+tcp_max_orphans, but exactly count is more than tcp_max_orphans, you
+would find TcpExtTCPAbortOnMemory is not increased at all. If
+tcp_max_orphans is large enough, it won't occur, but if you decrease
+tcp_max_orphans to a small value like our test, you might find this
+issue. So in our test, the client set up 64 connections although the
+tcp_max_orphans is 10. If the client only set up 11 connections, we
+can't find the change of TcpExtTCPAbortOnMemory.
+
+Continue the previous test, we wait for several minutes. Because of the
+iptables on the server blocked the traffic, the server wouldn't receive
+fin, and all the client's orphan sockets would timeout on the
+FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
+10 timeout on the client::
+
+ nstatuser@nstat-a:~$ nstat | grep -i abort
+ TcpExtTCPAbortOnTimeout 10 0.0
+
+TcpExtTCPAbortOnLinger
+----------------------
+The server side code::
+
+ nstatuser@nstat-b:~$ cat server_linger.py
+ import socket
+ import time
+
+ port = 9000
+
+ s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+ s.bind(('0.0.0.0', port))
+ s.listen(1)
+ sock, addr = s.accept()
+ while True:
+ time.sleep(9999999)
+
+The client side code::
+
+ nstatuser@nstat-a:~$ cat client_linger.py
+ import socket
+ import struct
+
+ server = 'nstat-b' # server address
+ port = 9000
+
+ s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+ s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
+ s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))
+ s.connect((server, port))
+ s.close()
+
+Run server_linger.py on server::
+
+ nstatuser@nstat-b:~$ python3 server_linger.py
+
+Run client_linger.py on client::
+
+ nstatuser@nstat-a:~$ python3 client_linger.py
+
+After run client_linger.py, check the output of nstat::
+
+ nstatuser@nstat-a:~$ nstat | grep -i abort
+ TcpExtTCPAbortOnLinger 1 0.0
+
+TcpExtTCPRcvCoalesce
+--------------------
+On the server, we run a program which listen on TCP port 9000, but
+doesn't read any data::
+
+ import socket
+ import time
+ port = 9000
+ s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+ s.bind(('0.0.0.0', port))
+ s.listen(1)
+ sock, addr = s.accept()
+ while True:
+ time.sleep(9999999)
+
+Save the above code as server_coalesce.py, and run::
+
+ python3 server_coalesce.py
+
+On the client, save below code as client_coalesce.py::
+
+ import socket
+ server = 'nstat-b'
+ port = 9000
+ s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+ s.connect((server, port))
+
+Run::
+
+ nstatuser@nstat-a:~$ python3 -i client_coalesce.py
+
+We use '-i' to come into the interactive mode, then a packet::
+
+ >>> s.send(b'foo')
+ 3
+
+Send a packet again::
+
+ >>> s.send(b'bar')
+ 3
+
+On the server, run nstat::
+
+ ubuntu@nstat-b:~$ nstat
+ #kernel
+ IpInReceives 2 0.0
+ IpInDelivers 2 0.0
+ IpOutRequests 2 0.0
+ TcpInSegs 2 0.0
+ TcpOutSegs 2 0.0
+ TcpExtTCPRcvCoalesce 1 0.0
+ IpExtInOctets 110 0.0
+ IpExtOutOctets 104 0.0
+ IpExtInNoECTPkts 2 0.0
+
+The client sent two packets, server didn't read any data. When
+the second packet arrived at server, the first packet was still in
+the receiving queue. So the TCP layer merged the two packets, and we
+could find the TcpExtTCPRcvCoalesce increased 1.
+
+TcpExtListenOverflows and TcpExtListenDrops
+-------------------------------------------
+On server, run the nc command, listen on port 9000::
+
+ nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
+ Listening on [0.0.0.0] (family 0, port 9000)
+
+On client, run 3 nc commands in different terminals::
+
+ nstatuser@nstat-a:~$ nc -v nstat-b 9000
+ Connection to nstat-b 9000 port [tcp/*] succeeded!
+
+The nc command only accepts 1 connection, and the accept queue length
+is 1. On current linux implementation, set queue length to n means the
+actual queue length is n+1. Now we create 3 connections, 1 is accepted
+by nc, 2 in accepted queue, so the accept queue is full.
+
+Before running the 4th nc, we clean the nstat history on the server::
+
+ nstatuser@nstat-b:~$ nstat -n
+
+Run the 4th nc on the client::
+
+ nstatuser@nstat-a:~$ nc -v nstat-b 9000
+
+If the nc server is running on kernel 4.10 or higher version, you
+won't see the "Connection to ... succeeded!" string, because kernel
+will drop the SYN if the accept queue is full. If the nc client is running
+on an old kernel, you would see that the connection is succeeded,
+because kernel would complete the 3 way handshake and keep the socket
+on half open queue. I did the test on kernel 4.15. Below is the nstat
+on the server::
+
+ nstatuser@nstat-b:~$ nstat
+ #kernel
+ IpInReceives 4 0.0
+ IpInDelivers 4 0.0
+ TcpInSegs 4 0.0
+ TcpExtListenOverflows 4 0.0
+ TcpExtListenDrops 4 0.0
+ IpExtInOctets 240 0.0
+ IpExtInNoECTPkts 4 0.0
+
+Both TcpExtListenOverflows and TcpExtListenDrops were 4. If the time
+between the 4th nc and the nstat was longer, the value of
+TcpExtListenOverflows and TcpExtListenDrops would be larger, because
+the SYN of the 4th nc was dropped, the client was retrying.
+
+IpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes
+-------------------------------------------------
+server A IP address: 192.168.122.250
+server B IP address: 192.168.122.251
+Prepare on server A, add a route to server B::
+
+ $ sudo ip route add 8.8.8.8/32 via 192.168.122.251
+
+Prepare on server B, disable send_redirects for all interfaces::
+
+ $ sudo sysctl -w net.ipv4.conf.all.send_redirects=0
+ $ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0
+ $ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0
+ $ sudo sysctl -w net.ipv4.conf.default.send_redirects=0
+
+We want to let sever A send a packet to 8.8.8.8, and route the packet
+to server B. When server B receives such packet, it might send a ICMP
+Redirect message to server A, set send_redirects to 0 will disable
+this behavior.
+
+First, generate InAddrErrors. On server B, we disable IP forwarding::
+
+ $ sudo sysctl -w net.ipv4.conf.all.forwarding=0
+
+On server A, we send packets to 8.8.8.8::
+
+ $ nc -v 8.8.8.8 53
+
+On server B, we check the output of nstat::
+
+ $ nstat
+ #kernel
+ IpInReceives 3 0.0
+ IpInAddrErrors 3 0.0
+ IpExtInOctets 180 0.0
+ IpExtInNoECTPkts 3 0.0
+
+As we have let server A route 8.8.8.8 to server B, and we disabled IP
+forwarding on server B, Server A sent packets to server B, then server B
+dropped packets and increased IpInAddrErrors. As the nc command would
+re-send the SYN packet if it didn't receive a SYN+ACK, we could find
+multiple IpInAddrErrors.
+
+Second, generate IpExtInNoRoutes. On server B, we enable IP
+forwarding::
+
+ $ sudo sysctl -w net.ipv4.conf.all.forwarding=1
+
+Check the route table of server B and remove the default route::
+
+ $ ip route show
+ default via 192.168.122.1 dev ens3 proto static
+ 192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251
+ $ sudo ip route delete default via 192.168.122.1 dev ens3 proto static
+
+On server A, we contact 8.8.8.8 again::
+
+ $ nc -v 8.8.8.8 53
+ nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable
+
+On server B, run nstat::
+
+ $ nstat
+ #kernel
+ IpInReceives 1 0.0
+ IpOutRequests 1 0.0
+ IcmpOutMsgs 1 0.0
+ IcmpOutDestUnreachs 1 0.0
+ IcmpMsgOutType3 1 0.0
+ IpExtInNoRoutes 1 0.0
+ IpExtInOctets 60 0.0
+ IpExtOutOctets 88 0.0
+ IpExtInNoECTPkts 1 0.0
+
+We enabled IP forwarding on server B, when server B received a packet
+which destination IP address is 8.8.8.8, server B will try to forward
+this packet. We have deleted the default route, there was no route for
+8.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP
+Destination Unreachable" message to server A.
+
+Third, generate IpOutNoRoutes. Run ping command on server B::
+
+ $ ping -c 1 8.8.8.8
+ connect: Network is unreachable
+
+Run nstat on server B::
+
+ $ nstat
+ #kernel
+ IpOutNoRoutes 1 0.0
+
+We have deleted the default route on server B. Server B couldn't find
+a route for the 8.8.8.8 IP address, so server B increased
+IpOutNoRoutes.
+
+TcpExtTCPACKSkippedSynRecv
+--------------------------
+In this test, we send 3 same SYN packets from client to server. The
+first SYN will let server create a socket, set it to Syn-Recv status,
+and reply a SYN/ACK. The second SYN will let server reply the SYN/ACK
+again, and record the reply time (the duplicate ACK reply time). The
+third SYN will let server check the previous duplicate ACK reply time,
+and decide to skip the duplicate ACK, then increase the
+TcpExtTCPACKSkippedSynRecv counter.
+
+Run tcpdump to capture a SYN packet::
+
+ nstatuser@nstat-a:~$ sudo tcpdump -c 1 -w /tmp/syn.pcap port 9000
+ tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
+
+Open another terminal, run nc command::
+
+ nstatuser@nstat-a:~$ nc nstat-b 9000
+
+As the nstat-b didn't listen on port 9000, it should reply a RST, and
+the nc command exited immediately. It was enough for the tcpdump
+command to capture a SYN packet. A linux server might use hardware
+offload for the TCP checksum, so the checksum in the /tmp/syn.pcap
+might be not correct. We call tcprewrite to fix it::
+
+ nstatuser@nstat-a:~$ tcprewrite --infile=/tmp/syn.pcap --outfile=/tmp/syn_fixcsum.pcap --fixcsum
+
+On nstat-b, we run nc to listen on port 9000::
+
+ nstatuser@nstat-b:~$ nc -lkv 9000
+ Listening on [0.0.0.0] (family 0, port 9000)
+
+On nstat-a, we blocked the packet from port 9000, or nstat-a would send
+RST to nstat-b::
+
+ nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP
+
+Send 3 SYN repeatedly to nstat-b::
+
+ nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done
+
+Check snmp counter on nstat-b::
+
+ nstatuser@nstat-b:~$ nstat | grep -i skip
+ TcpExtTCPACKSkippedSynRecv 1 0.0
+
+As we expected, TcpExtTCPACKSkippedSynRecv is 1.
+
+TcpExtTCPACKSkippedPAWS
+-----------------------
+To trigger PAWS, we could send an old SYN.
+
+On nstat-b, let nc listen on port 9000::
+
+ nstatuser@nstat-b:~$ nc -lkv 9000
+ Listening on [0.0.0.0] (family 0, port 9000)
+
+On nstat-a, run tcpdump to capture a SYN::
+
+ nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/paws_pre.pcap -c 1 port 9000
+ tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
+
+On nstat-a, run nc as a client to connect nstat-b::
+
+ nstatuser@nstat-a:~$ nc -v nstat-b 9000
+ Connection to nstat-b 9000 port [tcp/*] succeeded!
+
+Now the tcpdump has captured the SYN and exit. We should fix the
+checksum::
+
+ nstatuser@nstat-a:~$ tcprewrite --infile /tmp/paws_pre.pcap --outfile /tmp/paws.pcap --fixcsum
+
+Send the SYN packet twice::
+
+ nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/paws.pcap; done
+
+On nstat-b, check the snmp counter::
+
+ nstatuser@nstat-b:~$ nstat | grep -i skip
+ TcpExtTCPACKSkippedPAWS 1 0.0
+
+We sent two SYN via tcpreplay, both of them would let PAWS check
+failed, the nstat-b replied an ACK for the first SYN, skipped the ACK
+for the second SYN, and updated TcpExtTCPACKSkippedPAWS.
+
+TcpExtTCPACKSkippedSeq
+----------------------
+To trigger TcpExtTCPACKSkippedSeq, we send packets which have valid
+timestamp (to pass PAWS check) but the sequence number is out of
+window. The linux TCP stack would avoid to skip if the packet has
+data, so we need a pure ACK packet. To generate such a packet, we
+could create two sockets: one on port 9000, another on port 9001. Then
+we capture an ACK on port 9001, change the source/destination port
+numbers to match the port 9000 socket. Then we could trigger
+TcpExtTCPACKSkippedSeq via this packet.
+
+On nstat-b, open two terminals, run two nc commands to listen on both
+port 9000 and port 9001::
+
+ nstatuser@nstat-b:~$ nc -lkv 9000
+ Listening on [0.0.0.0] (family 0, port 9000)
+
+ nstatuser@nstat-b:~$ nc -lkv 9001
+ Listening on [0.0.0.0] (family 0, port 9001)
+
+On nstat-a, run two nc clients::
+
+ nstatuser@nstat-a:~$ nc -v nstat-b 9000
+ Connection to nstat-b 9000 port [tcp/*] succeeded!
+
+ nstatuser@nstat-a:~$ nc -v nstat-b 9001
+ Connection to nstat-b 9001 port [tcp/*] succeeded!
+
+On nstat-a, run tcpdump to capture an ACK::
+
+ nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/seq_pre.pcap -c 1 dst port 9001
+ tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
+
+On nstat-b, send a packet via the port 9001 socket. E.g. we sent a
+string 'foo' in our example::
+
+ nstatuser@nstat-b:~$ nc -lkv 9001
+ Listening on [0.0.0.0] (family 0, port 9001)
+ Connection from nstat-a 42132 received!
+ foo
+
+On nstat-a, the tcpdump should have captured the ACK. We should check
+the source port numbers of the two nc clients::
+
+ nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee
+ State Recv-Q Send-Q Local Address:Port Peer Address:Port
+ ESTAB 0 0 192.168.122.250:50208 192.168.122.251:9000
+ ESTAB 0 0 192.168.122.250:42132 192.168.122.251:9001
+
+Run tcprewrite, change port 9001 to port 9000, change port 42132 to
+port 50208::
+
+ nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum
+
+Now the /tmp/seq.pcap is the packet we need. Send it to nstat-b::
+
+ nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/seq.pcap; done
+
+Check TcpExtTCPACKSkippedSeq on nstat-b::
+
+ nstatuser@nstat-b:~$ nstat | grep -i skip
+ TcpExtTCPACKSkippedSeq 1 0.0
diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst
new file mode 100644
index 0000000000..551b3cc29a
--- /dev/null
+++ b/Documentation/networking/statistics.rst
@@ -0,0 +1,221 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Interface statistics
+====================
+
+Overview
+========
+
+This document is a guide to Linux network interface statistics.
+
+There are three main sources of interface statistics in Linux:
+
+ - standard interface statistics based on
+ :c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`;
+ - protocol-specific statistics; and
+ - driver-defined statistics available via ethtool.
+
+Standard interface statistics
+-----------------------------
+
+There are multiple interfaces to reach the standard statistics.
+Most commonly used is the `ip` command from `iproute2`::
+
+ $ ip -s -s link show dev ens4u1u1
+ 6: ens4u1u1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
+ link/ether 48:2a:e3:4c:b1:d1 brd ff:ff:ff:ff:ff:ff
+ RX: bytes packets errors dropped overrun mcast
+ 74327665117 69016965 0 0 0 0
+ RX errors: length crc frame fifo missed
+ 0 0 0 0 0
+ TX: bytes packets errors dropped carrier collsns
+ 21405556176 44608960 0 0 0 0
+ TX errors: aborted fifo window heartbeat transns
+ 0 0 0 0 128
+ altname enp58s0u1u1
+
+Note that `-s` has been specified twice to see all members of
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+If `-s` is specified once the detailed errors won't be shown.
+
+`ip` supports JSON formatting via the `-j` option.
+
+Protocol-specific statistics
+----------------------------
+
+Protocol-specific statistics are exposed via relevant interfaces,
+the same interfaces as are used to configure them.
+
+ethtool
+~~~~~~~
+
+Ethtool exposes common low-level statistics.
+All the standard statistics are expected to be maintained
+by the device, not the driver (as opposed to driver-defined stats
+described in the next section which mix software and hardware stats).
+For devices which contain unmanaged
+switches (e.g. legacy SR-IOV or multi-host NICs) the events counted
+may not pertain exclusively to the packets destined to
+the local host interface. In other words the events may
+be counted at the network port (MAC/PHY blocks) without separation
+for different host side (PCIe) devices. Such ambiguity must not
+be present when internal switch is managed by Linux (so called
+switchdev mode for NICs).
+
+Standard ethtool statistics can be accessed via the interfaces used
+for configuration. For example ethtool interface used
+to configure pause frames can report corresponding hardware counters::
+
+ $ ethtool --include-statistics -a eth0
+ Pause parameters for eth0:
+ Autonegotiate: on
+ RX: on
+ TX: on
+ Statistics:
+ tx_pause_frames: 1
+ rx_pause_frames: 1
+
+General Ethernet statistics not associated with any particular
+functionality are exposed via ``ethtool -S $ifc`` by specifying
+the ``--groups`` parameter::
+
+ $ ethtool -S eth0 --groups eth-phy eth-mac eth-ctrl rmon
+ Stats for eth0:
+ eth-phy-SymbolErrorDuringCarrier: 0
+ eth-mac-FramesTransmittedOK: 1
+ eth-mac-FrameTooLongErrors: 1
+ eth-ctrl-MACControlFramesTransmitted: 1
+ eth-ctrl-MACControlFramesReceived: 0
+ eth-ctrl-UnsupportedOpcodesReceived: 1
+ rmon-etherStatsUndersizePkts: 1
+ rmon-etherStatsJabbers: 0
+ rmon-rx-etherStatsPkts64Octets: 1
+ rmon-rx-etherStatsPkts65to127Octets: 0
+ rmon-rx-etherStatsPkts128to255Octets: 0
+ rmon-tx-etherStatsPkts64Octets: 2
+ rmon-tx-etherStatsPkts65to127Octets: 3
+ rmon-tx-etherStatsPkts128to255Octets: 0
+
+Driver-defined statistics
+-------------------------
+
+Driver-defined ethtool statistics can be dumped using `ethtool -S $ifc`, e.g.::
+
+ $ ethtool -S ens4u1u1
+ NIC statistics:
+ tx_single_collisions: 0
+ tx_multi_collisions: 0
+
+uAPIs
+=====
+
+procfs
+------
+
+The historical `/proc/net/dev` text interface gives access to the list
+of interfaces as well as their statistics.
+
+Note that even though this interface is using
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`
+internally it combines some of the fields.
+
+sysfs
+-----
+
+Each device directory in sysfs contains a `statistics` directory (e.g.
+`/sys/class/net/lo/statistics/`) with files corresponding to
+members of :c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+
+This simple interface is convenient especially in constrained/embedded
+environments without access to tools. However, it's inefficient when
+reading multiple stats as it internally performs a full dump of
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`
+and reports only the stat corresponding to the accessed file.
+
+Sysfs files are documented in
+`Documentation/ABI/testing/sysfs-class-net-statistics`.
+
+
+netlink
+-------
+
+`rtnetlink` (`NETLINK_ROUTE`) is the preferred method of accessing
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>` stats.
+
+Statistics are reported both in the responses to link information
+requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`,
+when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request).
+
+ethtool
+-------
+
+Ethtool IOCTL interface allows drivers to report implementation
+specific statistics. Historically it has also been used to report
+statistics for which other APIs did not exist, like per-device-queue
+statistics, or standard-based statistics (e.g. RFC 2863).
+
+Statistics and their string identifiers are retrieved separately.
+Identifiers via `ETHTOOL_GSTRINGS` with `string_set` set to `ETH_SS_STATS`,
+and values via `ETHTOOL_GSTATS`. User space should use `ETHTOOL_GDRVINFO`
+to retrieve the number of statistics (`.n_stats`).
+
+ethtool-netlink
+---------------
+
+Ethtool netlink is a replacement for the older IOCTL interface.
+
+Protocol-related statistics can be requested in get commands by setting
+the `ETHTOOL_FLAG_STATS` flag in `ETHTOOL_A_HEADER_FLAGS`. Currently
+statistics are supported in the following commands:
+
+ - `ETHTOOL_MSG_PAUSE_GET`
+ - `ETHTOOL_MSG_FEC_GET`
+ - `ETHTOOL_MSG_MM_GET`
+
+debugfs
+-------
+
+Some drivers expose extra statistics via `debugfs`.
+
+struct rtnl_link_stats64
+========================
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+ :identifiers: rtnl_link_stats64
+
+Notes for driver authors
+========================
+
+Drivers should report all statistics which have a matching member in
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>` exclusively
+via `.ndo_get_stats64`. Reporting such standard stats via ethtool
+or debugfs will not be accepted.
+
+Drivers must ensure best possible compliance with
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+Please note for example that detailed error statistics must be
+added into the general `rx_error` / `tx_error` counters.
+
+The `.ndo_get_stats64` callback can not sleep because of accesses
+via `/proc/net/dev`. If driver may sleep when retrieving the statistics
+from the device it should do so periodically asynchronously and only return
+a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
+allows setting the frequency of refreshing statistics, if needed.
+
+Retrieving ethtool statistics is a multi-syscall process, drivers are advised
+to keep the number of statistics constant to avoid race conditions with
+user space trying to read them.
+
+Statistics must persist across routine operations like bringing the interface
+down and up.
+
+Kernel-internal data structures
+-------------------------------
+
+The following structures are internal to the kernel, their members are
+translated to netlink attributes when dumped. Drivers must not overwrite
+the statistics they don't report with 0.
+
+- ethtool_pause_stats()
+- ethtool_fec_stats()
diff --git a/Documentation/networking/strparser.rst b/Documentation/networking/strparser.rst
new file mode 100644
index 0000000000..6cab1f74ae
--- /dev/null
+++ b/Documentation/networking/strparser.rst
@@ -0,0 +1,240 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+Stream Parser (strparser)
+=========================
+
+Introduction
+============
+
+The stream parser (strparser) is a utility that parses messages of an
+application layer protocol running over a data stream. The stream
+parser works in conjunction with an upper layer in the kernel to provide
+kernel support for application layer messages. For instance, Kernel
+Connection Multiplexor (KCM) uses the Stream Parser to parse messages
+using a BPF program.
+
+The strparser works in one of two modes: receive callback or general
+mode.
+
+In receive callback mode, the strparser is called from the data_ready
+callback of a TCP socket. Messages are parsed and delivered as they are
+received on the socket.
+
+In general mode, a sequence of skbs are fed to strparser from an
+outside source. Message are parsed and delivered as the sequence is
+processed. This modes allows strparser to be applied to arbitrary
+streams of data.
+
+Interface
+=========
+
+The API includes a context structure, a set of callbacks, utility
+functions, and a data_ready function for receive callback mode. The
+callbacks include a parse_msg function that is called to perform
+parsing (e.g. BPF parsing in case of KCM), and a rcv_msg function
+that is called when a full message has been completed.
+
+Functions
+=========
+
+ ::
+
+ strp_init(struct strparser *strp, struct sock *sk,
+ const struct strp_callbacks *cb)
+
+ Called to initialize a stream parser. strp is a struct of type
+ strparser that is allocated by the upper layer. sk is the TCP
+ socket associated with the stream parser for use with receive
+ callback mode; in general mode this is set to NULL. Callbacks
+ are called by the stream parser (the callbacks are listed below).
+
+ ::
+
+ void strp_pause(struct strparser *strp)
+
+ Temporarily pause a stream parser. Message parsing is suspended
+ and no new messages are delivered to the upper layer.
+
+ ::
+
+ void strp_unpause(struct strparser *strp)
+
+ Unpause a paused stream parser.
+
+ ::
+
+ void strp_stop(struct strparser *strp);
+
+ strp_stop is called to completely stop stream parser operations.
+ This is called internally when the stream parser encounters an
+ error, and it is called from the upper layer to stop parsing
+ operations.
+
+ ::
+
+ void strp_done(struct strparser *strp);
+
+ strp_done is called to release any resources held by the stream
+ parser instance. This must be called after the stream processor
+ has been stopped.
+
+ ::
+
+ int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
+ unsigned int orig_offset, size_t orig_len,
+ size_t max_msg_size, long timeo)
+
+ strp_process is called in general mode for a stream parser to
+ parse an sk_buff. The number of bytes processed or a negative
+ error number is returned. Note that strp_process does not
+ consume the sk_buff. max_msg_size is maximum size the stream
+ parser will parse. timeo is timeout for completing a message.
+
+ ::
+
+ void strp_data_ready(struct strparser *strp);
+
+ The upper layer calls strp_tcp_data_ready when data is ready on
+ the lower socket for strparser to process. This should be called
+ from a data_ready callback that is set on the socket. Note that
+ maximum messages size is the limit of the receive socket
+ buffer and message timeout is the receive timeout for the socket.
+
+ ::
+
+ void strp_check_rcv(struct strparser *strp);
+
+ strp_check_rcv is called to check for new messages on the socket.
+ This is normally called at initialization of a stream parser
+ instance or after strp_unpause.
+
+Callbacks
+=========
+
+There are six callbacks:
+
+ ::
+
+ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
+
+ parse_msg is called to determine the length of the next message
+ in the stream. The upper layer must implement this function. It
+ should parse the sk_buff as containing the headers for the
+ next application layer message in the stream.
+
+ The skb->cb in the input skb is a struct strp_msg. Only
+ the offset field is relevant in parse_msg and gives the offset
+ where the message starts in the skb.
+
+ The return values of this function are:
+
+ ========= ===========================================================
+ >0 indicates length of successfully parsed message
+ 0 indicates more data must be received to parse the message
+ -ESTRPIPE current message should not be processed by the
+ kernel, return control of the socket to userspace which
+ can proceed to read the messages itself
+ other < 0 Error in parsing, give control back to userspace
+ assuming that synchronization is lost and the stream
+ is unrecoverable (application expected to close TCP socket)
+ ========= ===========================================================
+
+ In the case that an error is returned (return value is less than
+ zero) and the parser is in receive callback mode, then it will set
+ the error on TCP socket and wake it up. If parse_msg returned
+ -ESTRPIPE and the stream parser had previously read some bytes for
+ the current message, then the error set on the attached socket is
+ ENODATA since the stream is unrecoverable in that case.
+
+ ::
+
+ void (*lock)(struct strparser *strp)
+
+ The lock callback is called to lock the strp structure when
+ the strparser is performing an asynchronous operation (such as
+ processing a timeout). In receive callback mode the default
+ function is to lock_sock for the associated socket. In general
+ mode the callback must be set appropriately.
+
+ ::
+
+ void (*unlock)(struct strparser *strp)
+
+ The unlock callback is called to release the lock obtained
+ by the lock callback. In receive callback mode the default
+ function is release_sock for the associated socket. In general
+ mode the callback must be set appropriately.
+
+ ::
+
+ void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
+
+ rcv_msg is called when a full message has been received and
+ is queued. The callee must consume the sk_buff; it can
+ call strp_pause to prevent any further messages from being
+ received in rcv_msg (see strp_pause above). This callback
+ must be set.
+
+ The skb->cb in the input skb is a struct strp_msg. This
+ struct contains two fields: offset and full_len. Offset is
+ where the message starts in the skb, and full_len is the
+ the length of the message. skb->len - offset may be greater
+ then full_len since strparser does not trim the skb.
+
+ ::
+
+ int (*read_sock_done)(struct strparser *strp, int err);
+
+ read_sock_done is called when the stream parser is done reading
+ the TCP socket in receive callback mode. The stream parser may
+ read multiple messages in a loop and this function allows cleanup
+ to occur when exiting the loop. If the callback is not set (NULL
+ in strp_init) a default function is used.
+
+ ::
+
+ void (*abort_parser)(struct strparser *strp, int err);
+
+ This function is called when stream parser encounters an error
+ in parsing. The default function stops the stream parser and
+ sets the error in the socket if the parser is in receive callback
+ mode. The default function can be changed by setting the callback
+ to non-NULL in strp_init.
+
+Statistics
+==========
+
+Various counters are kept for each stream parser instance. These are in
+the strp_stats structure. strp_aggr_stats is a convenience structure for
+accumulating statistics for multiple stream parser instances.
+save_strp_stats and aggregate_strp_stats are helper functions to save
+and aggregate statistics.
+
+Message assembly limits
+=======================
+
+The stream parser provide mechanisms to limit the resources consumed by
+message assembly.
+
+A timer is set when assembly starts for a new message. In receive
+callback mode the message timeout is taken from rcvtime for the
+associated TCP socket. In general mode, the timeout is passed as an
+argument in strp_process. If the timer fires before assembly completes
+the stream parser is aborted and the ETIMEDOUT error is set on the TCP
+socket if in receive callback mode.
+
+In receive callback mode, message length is limited to the receive
+buffer size of the associated TCP socket. If the length returned by
+parse_msg is greater than the socket buffer size then the stream parser
+is aborted with EMSGSIZE error set on the TCP socket. Note that this
+makes the maximum size of receive skbuffs for a socket with a stream
+parser to be 2*sk_rcvbuf of the TCP socket.
+
+In general mode the message length limit is passed in as an argument
+to strp_process.
+
+Author
+======
+
+Tom Herbert (tom@quantonium.net)
diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
new file mode 100644
index 0000000000..758f1dae3f
--- /dev/null
+++ b/Documentation/networking/switchdev.rst
@@ -0,0 +1,564 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+.. _switchdev:
+
+===============================================
+Ethernet switch device driver model (switchdev)
+===============================================
+
+Copyright |copy| 2014 Jiri Pirko <jiri@resnulli.us>
+
+Copyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com>
+
+
+The Ethernet switch device driver model (switchdev) is an in-kernel driver
+model for switch devices which offload the forwarding (data) plane from the
+kernel.
+
+Figure 1 is a block diagram showing the components of the switchdev model for
+an example setup using a data-center-class switch ASIC chip. Other setups
+with SR-IOV or soft switches, such as OVS, are possible.
+
+::
+
+
+ User-space tools
+
+ user space |
+ +-------------------------------------------------------------------+
+ kernel | Netlink
+ |
+ +--------------+-------------------------------+
+ | Network stack |
+ | (Linux) |
+ | |
+ +----------------------------------------------+
+
+ sw1p2 sw1p4 sw1p6
+ sw1p1 + sw1p3 + sw1p5 + eth1
+ + | + | + | +
+ | | | | | | |
+ +--+----+----+----+----+----+---+ +-----+-----+
+ | Switch driver | | mgmt |
+ | (this document) | | driver |
+ | | | |
+ +--------------+----------------+ +-----------+
+ |
+ kernel | HW bus (eg PCI)
+ +-------------------------------------------------------------------+
+ hardware |
+ +--------------+----------------+
+ | Switch device (sw1) |
+ | +----+ +--------+
+ | | v offloaded data path | mgmt port
+ | | | |
+ +--|----|----+----+----+----+---+
+ | | | | | |
+ + + + + + +
+ p1 p2 p3 p4 p5 p6
+
+ front-panel ports
+
+
+ Fig 1.
+
+
+Include Files
+-------------
+
+::
+
+ #include <linux/netdevice.h>
+ #include <net/switchdev.h>
+
+
+Configuration
+-------------
+
+Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model
+support is built for driver.
+
+
+Switch Ports
+------------
+
+On switchdev driver initialization, the driver will allocate and register a
+struct net_device (using register_netdev()) for each enumerated physical switch
+port, called the port netdev. A port netdev is the software representation of
+the physical port and provides a conduit for control traffic to/from the
+controller (the kernel) and the network, as well as an anchor point for higher
+level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using
+standard netdev tools (iproute2, ethtool, etc), the port netdev can also
+provide to the user access to the physical properties of the switch port such
+as PHY link state and I/O statistics.
+
+There is (currently) no higher-level kernel object for the switch beyond the
+port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops.
+
+A switch management port is outside the scope of the switchdev driver model.
+Typically, the management port is not participating in offloaded data plane and
+is loaded with a different driver, such as a NIC driver, on the management port
+device.
+
+Switch ID
+^^^^^^^^^
+
+The switchdev driver must implement the net_device operation
+ndo_get_port_parent_id for each port netdev, returning the same physical ID for
+each port of a switch. The ID must be unique between switches on the same
+system. The ID does not need to be unique between switches on different
+systems.
+
+The switch ID is used to locate ports on a switch and to know if aggregated
+ports belong to the same switch.
+
+Port Netdev Naming
+^^^^^^^^^^^^^^^^^^
+
+Udev rules should be used for port netdev naming, using some unique attribute
+of the port as a key, for example the port MAC address or the port PHYS name.
+Hard-coding of kernel netdev names within the driver is discouraged; let the
+kernel pick the default netdev name, and let udev set the final name based on a
+port attribute.
+
+Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
+useful for dynamically-named ports where the device names its ports based on
+external configuration. For example, if a physical 40G port is split logically
+into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
+name for each port using port PHYS name. The udev rule would be::
+
+ SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
+ ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
+
+Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
+is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0
+would be sub-port 0 on port 1 on switch 1.
+
+Port Features
+^^^^^^^^^^^^^
+
+NETIF_F_NETNS_LOCAL
+
+If the switchdev driver (and device) only supports offloading of the default
+network namespace (netns), the driver should set this feature flag to prevent
+the port netdev from being moved out of the default netns. A netns-aware
+driver/device would not set this flag and be responsible for partitioning
+hardware to preserve netns containment. This means hardware cannot forward
+traffic from a port in one namespace to another port in another namespace.
+
+Port Topology
+^^^^^^^^^^^^^
+
+The port netdevs representing the physical switch ports can be organized into
+higher-level switching constructs. The default construct is a standalone
+router port, used to offload L3 forwarding. Two or more ports can be bonded
+together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge
+L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3
+tunnels can be built on ports. These constructs are built using standard Linux
+tools such as the bridge driver, the bonding/team drivers, and netlink-based
+tools such as iproute2.
+
+The switchdev driver can know a particular port's position in the topology by
+monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a
+bond will see its upper master change. If that bond is moved into a bridge,
+the bond's upper master will change. And so on. The driver will track such
+movements to know what position a port is in in the overall topology by
+registering for netdevice events and acting on NETDEV_CHANGEUPPER.
+
+L2 Forwarding Offload
+---------------------
+
+The idea is to offload the L2 data forwarding (switching) path from the kernel
+to the switchdev device by mirroring bridge FDB entries down to the device. An
+FDB entry is the {port, MAC, VLAN} tuple forwarding destination.
+
+To offloading L2 bridging, the switchdev driver/device should support:
+
+ - Static FDB entries installed on a bridge port
+ - Notification of learned/forgotten src mac/vlans from device
+ - STP state changes on the port
+ - VLAN flooding of multicast/broadcast and unknown unicast packets
+
+Static FDB Entries
+^^^^^^^^^^^^^^^^^^
+
+A driver which implements the ``ndo_fdb_add``, ``ndo_fdb_del`` and
+``ndo_fdb_dump`` operations is able to support the command below, which adds a
+static bridge FDB entry::
+
+ bridge fdb add dev DEV ADDRESS [vlan VID] [self] static
+
+(the "static" keyword is non-optional: if not specified, the entry defaults to
+being "local", which means that it should not be forwarded)
+
+The "self" keyword (optional because it is implicit) has the role of
+instructing the kernel to fulfill the operation through the ``ndo_fdb_add``
+implementation of the ``DEV`` device itself. If ``DEV`` is a bridge port, this
+will bypass the bridge and therefore leave the software database out of sync
+with the hardware one.
+
+To avoid this, the "master" keyword can be used::
+
+ bridge fdb add dev DEV ADDRESS [vlan VID] master static
+
+The above command instructs the kernel to search for a master interface of
+``DEV`` and fulfill the operation through the ``ndo_fdb_add`` method of that.
+This time, the bridge generates a ``SWITCHDEV_FDB_ADD_TO_DEVICE`` notification
+which the port driver can handle and use it to program its hardware table. This
+way, the software and the hardware database will both contain this static FDB
+entry.
+
+Note: for new switchdev drivers that offload the Linux bridge, implementing the
+``ndo_fdb_add`` and ``ndo_fdb_del`` bridge bypass methods is strongly
+discouraged: all static FDB entries should be added on a bridge port using the
+"master" flag. The ``ndo_fdb_dump`` is an exception and can be implemented to
+visualize the hardware tables, if the device does not have an interrupt for
+notifying the operating system of newly learned/forgotten dynamic FDB
+addresses. In that case, the hardware FDB might end up having entries that the
+software FDB does not, and implementing ``ndo_fdb_dump`` is the only way to see
+them.
+
+Note: by default, the bridge does not filter on VLAN and only bridges untagged
+traffic. To enable VLAN support, turn on VLAN filtering::
+
+ echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
+
+Notification of Learned/Forgotten Source MAC/VLANs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The switch device will learn/forget source MAC address/VLAN on ingress packets
+and notify the switch driver of the mac/vlan/port tuples. The switch driver,
+in turn, will notify the bridge driver using the switchdev notifier call::
+
+ err = call_switchdev_notifiers(val, dev, info, extack);
+
+Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
+forgetting, and info points to a struct switchdev_notifier_fdb_info. On
+SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
+bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge
+command will label these entries "offload"::
+
+ $ bridge fdb
+ 52:54:00:12:35:01 dev sw1p1 master br0 permanent
+ 00:02:00:00:02:00 dev sw1p1 master br0 offload
+ 00:02:00:00:02:00 dev sw1p1 self
+ 52:54:00:12:35:02 dev sw1p2 master br0 permanent
+ 00:02:00:00:03:00 dev sw1p2 master br0 offload
+ 00:02:00:00:03:00 dev sw1p2 self
+ 33:33:00:00:00:01 dev eth0 self permanent
+ 01:00:5e:00:00:01 dev eth0 self permanent
+ 33:33:ff:00:00:00 dev eth0 self permanent
+ 01:80:c2:00:00:0e dev eth0 self permanent
+ 33:33:00:00:00:01 dev br0 self permanent
+ 01:00:5e:00:00:01 dev br0 self permanent
+ 33:33:ff:12:35:01 dev br0 self permanent
+
+Learning on the port should be disabled on the bridge using the bridge command::
+
+ bridge link set dev DEV learning off
+
+Learning on the device port should be enabled, as well as learning_sync::
+
+ bridge link set dev DEV learning on self
+ bridge link set dev DEV learning_sync on self
+
+Learning_sync attribute enables syncing of the learned/forgotten FDB entry to
+the bridge's FDB. It's possible, but not optimal, to enable learning on the
+device port and on the bridge port, and disable learning_sync.
+
+To support learning, the driver implements switchdev op
+switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS.
+
+FDB Ageing
+^^^^^^^^^^
+
+The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is
+the responsibility of the port driver/device to age out these entries. If the
+port device supports ageing, when the FDB entry expires, it will notify the
+driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the
+device does not support ageing, the driver can simulate ageing using a
+garbage collection timer to monitor FDB entries. Expired entries will be
+notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for
+example of driver running ageing timer.
+
+To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB
+entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The
+notification will reset the FDB entry's last-used time to now. The driver
+should rate limit refresh notifications, for example, no more than once a
+second. (The last-used time is visible using the bridge -s fdb option).
+
+STP State Change on Port
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Internally or with a third-party STP protocol implementation (e.g. mstpd), the
+bridge driver maintains the STP state for ports, and will notify the switch
+driver of STP state change on a port using the switchdev op
+switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE.
+
+State is one of BR_STATE_*. The switch driver can use STP state updates to
+update ingress packet filter list for the port. For example, if port is
+DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs
+and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass.
+
+Note that STP BDPUs are untagged and STP state applies to all VLANs on the port
+so packet filters should be applied consistently across untagged and tagged
+VLANs on the port.
+
+Flooding L2 domain
+^^^^^^^^^^^^^^^^^^
+
+For a given L2 VLAN domain, the switch device should flood multicast/broadcast
+and unknown unicast packets to all ports in domain, if allowed by port's
+current STP state. The switch driver, knowing which ports are within which
+vlan L2 domain, can program the switch device for flooding. The packet may
+be sent to the port netdev for processing by the bridge driver. The
+bridge should not reflood the packet to the same ports the device flooded,
+otherwise there will be duplicate packets on the wire.
+
+To avoid duplicate packets, the switch driver should mark a packet as already
+forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark
+the skb using the ingress bridge port's mark and prevent it from being forwarded
+through any bridge port with the same mark.
+
+It is possible for the switch device to not handle flooding and push the
+packets up to the bridge driver for flooding. This is not ideal as the number
+of ports scale in the L2 domain as the device is much more efficient at
+flooding packets that software.
+
+If supported by the device, flood control can be offloaded to it, preventing
+certain netdevs from flooding unicast traffic for which there is no FDB entry.
+
+IGMP Snooping
+^^^^^^^^^^^^^
+
+In order to support IGMP snooping, the port netdevs should trap to the bridge
+driver all IGMP join and leave messages.
+The bridge multicast module will notify port netdevs on every multicast group
+changed whether it is static configured or dynamically joined/leave.
+The hardware implementation should be forwarding all registered multicast
+traffic groups only to the configured ports.
+
+L3 Routing Offload
+------------------
+
+Offloading L3 routing requires that device be programmed with FIB entries from
+the kernel, with the device doing the FIB lookup and forwarding. The device
+does a longest prefix match (LPM) on FIB entries matching route prefix and
+forwards the packet to the matching FIB entry's nexthop(s) egress ports.
+
+To program the device, the driver has to register a FIB notifier handler
+using register_fib_notifier. The following events are available:
+
+=================== ===================================================
+FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device,
+ or modifying an existing entry on the device.
+FIB_EVENT_ENTRY_DEL used for removing a FIB entry
+FIB_EVENT_RULE_ADD,
+FIB_EVENT_RULE_DEL used to propagate FIB rule changes
+=================== ===================================================
+
+FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass::
+
+ struct fib_entry_notifier_info {
+ struct fib_notifier_info info; /* must be first */
+ u32 dst;
+ int dst_len;
+ struct fib_info *fi;
+ u8 tos;
+ u8 type;
+ u32 tb_id;
+ u32 nlflags;
+ };
+
+to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi``
+structure holds details on the route and route's nexthops. ``*dev`` is one
+of the port netdevs mentioned in the route's next hop list.
+
+Routes offloaded to the device are labeled with "offload" in the ip route
+listing::
+
+ $ ip route show
+ default via 192.168.0.2 dev eth0
+ 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload
+ 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
+ 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload
+ 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
+ 12.0.0.2 proto zebra metric 30 offload
+ nexthop via 11.0.0.1 dev sw1p1 weight 1
+ nexthop via 11.0.0.9 dev sw1p2 weight 1
+ 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
+ 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
+ 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15
+
+The "offload" flag is set in case at least one device offloads the FIB entry.
+
+XXX: add/mod/del IPv6 FIB API
+
+Nexthop Resolution
+^^^^^^^^^^^^^^^^^^
+
+The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for
+the switch device to forward the packet with the correct dst mac address, the
+nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac
+address discovery comes via the ARP (or ND) process and is available via the
+arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver
+should trigger the kernel's neighbor resolution process. See the rocker
+driver's rocker_port_ipv4_resolve() for an example.
+
+The driver can monitor for updates to arp_tbl using the netevent notifier
+NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops
+for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy
+to know when arp_tbl neighbor entries are purged from the port.
+
+Device driver expected behavior
+-------------------------------
+
+Below is a set of defined behavior that switchdev enabled network devices must
+adhere to.
+
+Configuration-less state
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Upon driver bring up, the network devices must be fully operational, and the
+backing driver must configure the network device such that it is possible to
+send and receive traffic to this network device and it is properly separated
+from other network devices/ports (e.g.: as is frequent with a switch ASIC). How
+this is achieved is heavily hardware dependent, but a simple solution can be to
+use per-port VLAN identifiers unless a better mechanism is available
+(proprietary metadata for each network port for instance).
+
+The network device must be capable of running a full IP protocol stack
+including multicast, DHCP, IPv4/6, etc. If necessary, it should program the
+appropriate filters for VLAN, multicast, unicast etc. The underlying device
+driver must effectively be configured in a similar fashion to what it would do
+when IGMP snooping is enabled for IP multicast over these switchdev network
+devices and unsolicited multicast must be filtered as early as possible in
+the hardware.
+
+When configuring VLANs on top of the network device, all VLANs must be working,
+irrespective of the state of other network devices (e.g.: other ports being part
+of a VLAN-aware bridge doing ingress VID checking). See below for details.
+
+If the device implements e.g.: VLAN filtering, putting the interface in
+promiscuous mode should allow the reception of all VLAN tags (including those
+not present in the filter(s)).
+
+Bridged switch ports
+^^^^^^^^^^^^^^^^^^^^
+
+When a switchdev enabled network device is added as a bridge member, it should
+not disrupt any functionality of non-bridged network devices and they
+should continue to behave as normal network devices. Depending on the bridge
+configuration knobs below, the expected behavior is documented.
+
+Bridge VLAN filtering
+^^^^^^^^^^^^^^^^^^^^^
+
+The Linux bridge allows the configuration of a VLAN filtering mode (statically,
+at device creation time, and dynamically, during run time) which must be
+observed by the underlying switchdev network device/hardware:
+
+- with VLAN filtering turned off: the bridge is strictly VLAN unaware and its
+ data path will process all Ethernet frames as if they are VLAN-untagged.
+ The bridge VLAN database can still be modified, but the modifications should
+ have no effect while VLAN filtering is turned off. Frames ingressing the
+ device with a VID that is not programmed into the bridge/switch's VLAN table
+ must be forwarded and may be processed using a VLAN device (see below).
+
+- with VLAN filtering turned on: the bridge is VLAN-aware and frames ingressing
+ the device with a VID that is not programmed into the bridges/switch's VLAN
+ table must be dropped (strict VID checking).
+
+When there is a VLAN device (e.g: sw0p1.100) configured on top of a switchdev
+network device which is a bridge port member, the behavior of the software
+network stack must be preserved, or the configuration must be refused if that
+is not possible.
+
+- with VLAN filtering turned off, the bridge will process all ingress traffic
+ for the port, except for the traffic tagged with a VLAN ID destined for a
+ VLAN upper. The VLAN upper interface (which consumes the VLAN tag) can even
+ be added to a second bridge, which includes other switch ports or software
+ interfaces. Some approaches to ensure that the forwarding domain for traffic
+ belonging to the VLAN upper interfaces are managed properly:
+
+ * If forwarding destinations can be managed per VLAN, the hardware could be
+ configured to map all traffic, except the packets tagged with a VID
+ belonging to a VLAN upper interface, to an internal VID corresponding to
+ untagged packets. This internal VID spans all ports of the VLAN-unaware
+ bridge. The VID corresponding to the VLAN upper interface spans the
+ physical port of that VLAN interface, as well as the other ports that
+ might be bridged with it.
+ * Treat bridge ports with VLAN upper interfaces as standalone, and let
+ forwarding be handled in the software data path.
+
+- with VLAN filtering turned on, these VLAN devices can be created as long as
+ the bridge does not have an existing VLAN entry with the same VID on any
+ bridge port. These VLAN devices cannot be enslaved into the bridge since they
+ duplicate functionality/use case with the bridge's VLAN data path processing.
+
+Non-bridged network ports of the same switch fabric must not be disturbed in any
+way by the enabling of VLAN filtering on the bridge device(s). If the VLAN
+filtering setting is global to the entire chip, then the standalone ports
+should indicate to the network stack that VLAN filtering is required by setting
+'rx-vlan-filter: on [fixed]' in the ethtool features.
+
+Because VLAN filtering can be turned on/off at runtime, the switchdev driver
+must be able to reconfigure the underlying hardware on the fly to honor the
+toggling of that option and behave appropriately. If that is not possible, the
+switchdev driver can also refuse to support dynamic toggling of the VLAN
+filtering knob at runtime and require a destruction of the bridge device(s) and
+creation of new bridge device(s) with a different VLAN filtering value to
+ensure VLAN awareness is pushed down to the hardware.
+
+Even when VLAN filtering in the bridge is turned off, the underlying switch
+hardware and driver may still configure itself in a VLAN-aware mode provided
+that the behavior described above is observed.
+
+The VLAN protocol of the bridge plays a role in deciding whether a packet is
+treated as tagged or not: a bridge using the 802.1ad protocol must treat both
+VLAN-untagged packets, as well as packets tagged with 802.1Q headers, as
+untagged.
+
+The 802.1p (VID 0) tagged packets must be treated in the same way by the device
+as untagged packets, since the bridge device does not allow the manipulation of
+VID 0 in its database.
+
+When the bridge has VLAN filtering enabled and a PVID is not configured on the
+ingress port, untagged and 802.1p tagged packets must be dropped. When the bridge
+has VLAN filtering enabled and a PVID exists on the ingress port, untagged and
+priority-tagged packets must be accepted and forwarded according to the
+bridge's port membership of the PVID VLAN. When the bridge has VLAN filtering
+disabled, the presence/lack of a PVID should not influence the packet
+forwarding decision.
+
+Bridge IGMP snooping
+^^^^^^^^^^^^^^^^^^^^
+
+The Linux bridge allows the configuration of IGMP snooping (statically, at
+interface creation time, or dynamically, during runtime) which must be observed
+by the underlying switchdev network device/hardware in the following way:
+
+- when IGMP snooping is turned off, multicast traffic must be flooded to all
+ ports within the same bridge that have mcast_flood=true. The CPU/management
+ port should ideally not be flooded (unless the ingress interface has
+ IFF_ALLMULTI or IFF_PROMISC) and continue to learn multicast traffic through
+ the network stack notifications. If the hardware is not capable of doing that
+ then the CPU/management port must also be flooded and multicast filtering
+ happens in software.
+
+- when IGMP snooping is turned on, multicast traffic must selectively flow
+ to the appropriate network ports (including CPU/management port). Flooding of
+ unknown multicast should be only towards the ports connected to a multicast
+ router (the local device may also act as a multicast router).
+
+The switch must adhere to RFC 4541 and flood multicast traffic accordingly
+since that is what the Linux bridge implementation does.
+
+Because IGMP snooping can be turned on/off at runtime, the switchdev driver
+must be able to reconfigure the underlying hardware on the fly to honor the
+toggling of that option and behave appropriately.
+
+A switchdev driver can also refuse to support dynamic toggling of the multicast
+snooping knob at runtime and require the destruction of the bridge device(s)
+and creation of a new bridge device(s) with a different multicast snooping
+value.
diff --git a/Documentation/networking/sysfs-tagging.rst b/Documentation/networking/sysfs-tagging.rst
new file mode 100644
index 0000000000..65307130ab
--- /dev/null
+++ b/Documentation/networking/sysfs-tagging.rst
@@ -0,0 +1,48 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Sysfs tagging
+=============
+
+(Taken almost verbatim from Eric Biederman's netns tagging patch
+commit msg)
+
+The problem. Network devices show up in sysfs and with the network
+namespace active multiple devices with the same name can show up in
+the same directory, ouch!
+
+To avoid that problem and allow existing applications in network
+namespaces to see the same interface that is currently presented in
+sysfs, sysfs now has tagging directory support.
+
+By using the network namespace pointers as tags to separate out
+the sysfs directory entries we ensure that we don't have conflicts
+in the directories and applications only see a limited set of
+the network devices.
+
+Each sysfs directory entry may be tagged with a namespace via the
+``void *ns member`` of its ``kernfs_node``. If a directory entry is tagged,
+then ``kernfs_node->flags`` will have a flag between KOBJ_NS_TYPE_NONE
+and KOBJ_NS_TYPES, and ns will point to the namespace to which it
+belongs.
+
+Each sysfs superblock's kernfs_super_info contains an array
+``void *ns[KOBJ_NS_TYPES]``. When a task in a tagging namespace
+kobj_nstype first mounts sysfs, a new superblock is created. It
+will be differentiated from other sysfs mounts by having its
+``s_fs_info->ns[kobj_nstype]`` set to the new namespace. Note that
+through bind mounting and mounts propagation, a task can easily view
+the contents of other namespaces' sysfs mounts. Therefore, when a
+namespace exits, it will call kobj_ns_exit() to invalidate any
+kernfs_node->ns pointers pointing to it.
+
+Users of this interface:
+
+- define a type in the ``kobj_ns_type`` enumeration.
+- call kobj_ns_type_register() with its ``kobj_ns_type_operations`` which has
+
+ - current_ns() which returns current's namespace
+ - netlink_ns() which returns a socket's namespace
+ - initial_ns() which returns the initial namespace
+
+- call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/Documentation/networking/tc-actions-env-rules.rst b/Documentation/networking/tc-actions-env-rules.rst
new file mode 100644
index 0000000000..86884b8fb4
--- /dev/null
+++ b/Documentation/networking/tc-actions-env-rules.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+TC Actions - Environmental Rules
+================================
+
+
+The "environmental" rules for authors of any new tc actions are:
+
+1) If you stealeth or borroweth any packet thou shalt be branching
+ from the righteous path and thou shalt cloneth.
+
+ For example if your action queues a packet to be processed later,
+ or intentionally branches by redirecting a packet, then you need to
+ clone the packet.
+
+2) If you munge any packet thou shalt call pskb_expand_head in the case
+ someone else is referencing the skb. After that you "own" the skb.
+
+3) Dropping packets you don't own is a no-no. You simply return
+ TC_ACT_SHOT to the caller and they will drop it.
+
+The "environmental" rules for callers of actions (qdiscs etc) are:
+
+#) Thou art responsible for freeing anything returned as being
+ TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is
+ returned, then all is great and you don't need to do anything.
+
+Post on netdev if something is unclear.
diff --git a/Documentation/networking/tc-queue-filters.rst b/Documentation/networking/tc-queue-filters.rst
new file mode 100644
index 0000000000..6b41709227
--- /dev/null
+++ b/Documentation/networking/tc-queue-filters.rst
@@ -0,0 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+TC queue based filtering
+=========================
+
+TC can be used for directing traffic to either a set of queues or
+to a single queue on both the transmit and receive side.
+
+On the transmit side:
+
+1) TC filter directing traffic to a set of queues is achieved
+ using the action skbedit priority for Tx priority selection,
+ the priority maps to a traffic class (set of queues) when
+ the queue-sets are configured using mqprio.
+
+2) TC filter directs traffic to a transmit queue with the action
+ skbedit queue_mapping $tx_qid. The action skbedit queue_mapping
+ for transmit queue is executed in software only and cannot be
+ offloaded.
+
+Likewise, on the receive side, the two filters for selecting set of
+queues and/or a single queue are supported as below:
+
+1) TC flower filter directs incoming traffic to a set of queues using
+ the 'hw_tc' option.
+ hw_tc $TCID - Specify a hardware traffic class to pass matching
+ packets on to. TCID is in the range 0 through 15.
+
+2) TC filter with action skbedit queue_mapping $rx_qid selects a
+ receive queue. The action skbedit queue_mapping for receive queue
+ is supported only in hardware. Multiple filters may compete in
+ the hardware for queue selection. In such case, the hardware
+ pipeline resolves conflicts based on priority. On Intel E810
+ devices, TC filter directing traffic to a queue have higher
+ priority over flow director filter assigning a queue. The hash
+ filter has lowest priority.
diff --git a/Documentation/networking/tcp-thin.rst b/Documentation/networking/tcp-thin.rst
new file mode 100644
index 0000000000..b06765c96e
--- /dev/null
+++ b/Documentation/networking/tcp-thin.rst
@@ -0,0 +1,52 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Thin-streams and TCP
+====================
+
+A wide range of Internet-based services that use reliable transport
+protocols display what we call thin-stream properties. This means
+that the application sends data with such a low rate that the
+retransmission mechanisms of the transport protocol are not fully
+effective. In time-dependent scenarios (like online games, control
+systems, stock trading etc.) where the user experience depends
+on the data delivery latency, packet loss can be devastating for
+the service quality. Extreme latencies are caused by TCP's
+dependency on the arrival of new data from the application to trigger
+retransmissions effectively through fast retransmit instead of
+waiting for long timeouts.
+
+After analysing a large number of time-dependent interactive
+applications, we have seen that they often produce thin streams
+and also stay with this traffic pattern throughout its entire
+lifespan. The combination of time-dependency and the fact that the
+streams provoke high latencies when using TCP is unfortunate.
+
+In order to reduce application-layer latency when packets are lost,
+a set of mechanisms has been made, which address these latency issues
+for thin streams. In short, if the kernel detects a thin stream,
+the retransmission mechanisms are modified in the following manner:
+
+1) If the stream is thin, fast retransmit on the first dupACK.
+2) If the stream is thin, do not apply exponential backoff.
+
+These enhancements are applied only if the stream is detected as
+thin. This is accomplished by defining a threshold for the number
+of packets in flight. If there are less than 4 packets in flight,
+fast retransmissions can not be triggered, and the stream is prone
+to experience high retransmission latencies.
+
+Since these mechanisms are targeted at time-dependent applications,
+they must be specifically activated by the application using the
+TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK IOCTLS or the
+tcp_thin_linear_timeouts and tcp_thin_dupack sysctls. Both
+modifications are turned off by default.
+
+References
+==========
+More information on the modifications, as well as a wide range of
+experimental data can be found here:
+
+"Improving latency for interactive, thin-stream applications over
+reliable transport"
+http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file
diff --git a/Documentation/networking/team.rst b/Documentation/networking/team.rst
new file mode 100644
index 0000000000..0a7f3a0595
--- /dev/null
+++ b/Documentation/networking/team.rst
@@ -0,0 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+Team
+====
+
+Team devices are driven from userspace via libteam library which is here:
+ https://github.com/jpirko/libteam
diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
new file mode 100644
index 0000000000..f17c01834a
--- /dev/null
+++ b/Documentation/networking/timestamping.rst
@@ -0,0 +1,802 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Timestamping
+============
+
+
+1. Control Interfaces
+=====================
+
+The interfaces for receiving network packages timestamps are:
+
+SO_TIMESTAMP
+ Generates a timestamp for each incoming packet in (not necessarily
+ monotonic) system time. Reports the timestamp via recvmsg() in a
+ control message in usec resolution.
+ SO_TIMESTAMP is defined as SO_TIMESTAMP_NEW or SO_TIMESTAMP_OLD
+ based on the architecture type and time_t representation of libc.
+ Control message format is in struct __kernel_old_timeval for
+ SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for
+ SO_TIMESTAMP_NEW options respectively.
+
+SO_TIMESTAMPNS
+ Same timestamping mechanism as SO_TIMESTAMP, but reports the
+ timestamp as struct timespec in nsec resolution.
+ SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD
+ based on the architecture type and time_t representation of libc.
+ Control message format is in struct timespec for SO_TIMESTAMPNS_OLD
+ and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options
+ respectively.
+
+IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
+ Only for multicast:approximate transmit timestamp obtained by
+ reading the looped packet receive timestamp.
+
+SO_TIMESTAMPING
+ Generates timestamps on reception, transmission or both. Supports
+ multiple timestamp sources, including hardware. Supports generating
+ timestamps for stream sockets.
+
+
+1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW)
+-------------------------------------------------------------
+
+This socket option enables timestamping of datagrams on the reception
+path. Because the destination socket, if any, is not known early in
+the network stack, the feature has to be enabled for all packets. The
+same is true for all early receive timestamp options.
+
+For interface details, see `man 7 socket`.
+
+Always use SO_TIMESTAMP_NEW timestamp to always get timestamp in
+struct __kernel_sock_timeval format.
+
+SO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038
+on 32 bit machines.
+
+1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW)
+-------------------------------------------------------------------
+
+This option is identical to SO_TIMESTAMP except for the returned data type.
+Its struct timespec allows for higher resolution (ns) timestamps than the
+timeval of SO_TIMESTAMP (ms).
+
+Always use SO_TIMESTAMPNS_NEW timestamp to always get timestamp in
+struct __kernel_timespec format.
+
+SO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038
+on 32 bit machines.
+
+1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW)
+----------------------------------------------------------------------
+
+Supports multiple types of timestamp requests. As a result, this
+socket option takes a bitmap of flags, not a boolean. In::
+
+ err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
+
+val is an integer with any of the following bits set. Setting other
+bit returns EINVAL and does not change the current state.
+
+The socket option configures timestamp generation for individual
+sk_buffs (1.3.1), timestamp reporting to the socket's error
+queue (1.3.2) and options (1.3.3). Timestamp generation can also
+be enabled for individual sendmsg calls using cmsg (1.3.4).
+
+
+1.3.1 Timestamp Generation
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Some bits are requests to the stack to try to generate timestamps. Any
+combination of them is valid. Changes to these bits apply to newly
+created packets, not to packets already in the stack. As a result, it
+is possible to selectively request timestamps for a subset of packets
+(e.g., for sampling) by embedding an send() call within two setsockopt
+calls, one to enable timestamp generation and one to disable it.
+Timestamps may also be generated for reasons other than being
+requested by a particular socket, such as when receive timestamping is
+enabled system wide, as explained earlier.
+
+SOF_TIMESTAMPING_RX_HARDWARE:
+ Request rx timestamps generated by the network adapter.
+
+SOF_TIMESTAMPING_RX_SOFTWARE:
+ Request rx timestamps when data enters the kernel. These timestamps
+ are generated just after a device driver hands a packet to the
+ kernel receive stack.
+
+SOF_TIMESTAMPING_TX_HARDWARE:
+ Request tx timestamps generated by the network adapter. This flag
+ can be enabled via both socket options and control messages.
+
+SOF_TIMESTAMPING_TX_SOFTWARE:
+ Request tx timestamps when data leaves the kernel. These timestamps
+ are generated in the device driver as close as possible, but always
+ prior to, passing the packet to the network interface. Hence, they
+ require driver support and may not be available for all devices.
+ This flag can be enabled via both socket options and control messages.
+
+SOF_TIMESTAMPING_TX_SCHED:
+ Request tx timestamps prior to entering the packet scheduler. Kernel
+ transmit latency is, if long, often dominated by queuing delay. The
+ difference between this timestamp and one taken at
+ SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent
+ of protocol processing. The latency incurred in protocol
+ processing, if any, can be computed by subtracting a userspace
+ timestamp taken immediately before send() from this timestamp. On
+ machines with virtual devices where a transmitted packet travels
+ through multiple devices and, hence, multiple packet schedulers,
+ a timestamp is generated at each layer. This allows for fine
+ grained measurement of queuing delay. This flag can be enabled
+ via both socket options and control messages.
+
+SOF_TIMESTAMPING_TX_ACK:
+ Request tx timestamps when all data in the send buffer has been
+ acknowledged. This only makes sense for reliable protocols. It is
+ currently only implemented for TCP. For that protocol, it may
+ over-report measurement, because the timestamp is generated when all
+ data up to and including the buffer at send() was acknowledged: the
+ cumulative acknowledgment. The mechanism ignores SACK and FACK.
+ This flag can be enabled via both socket options and control messages.
+
+
+1.3.2 Timestamp Reporting
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The other three bits control which timestamps will be reported in a
+generated control message. Changes to the bits take immediate
+effect at the timestamp reporting locations in the stack. Timestamps
+are only reported for packets that also have the relevant timestamp
+generation request set.
+
+SOF_TIMESTAMPING_SOFTWARE:
+ Report any software timestamps when available.
+
+SOF_TIMESTAMPING_SYS_HARDWARE:
+ This option is deprecated and ignored.
+
+SOF_TIMESTAMPING_RAW_HARDWARE:
+ Report hardware timestamps as generated by
+ SOF_TIMESTAMPING_TX_HARDWARE when available.
+
+
+1.3.3 Timestamp Options
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The interface supports the options
+
+SOF_TIMESTAMPING_OPT_ID:
+ Generate a unique identifier along with each packet. A process can
+ have multiple concurrent timestamping requests outstanding. Packets
+ can be reordered in the transmit path, for instance in the packet
+ scheduler. In that case timestamps will be queued onto the error
+ queue out of order from the original send() calls. It is not always
+ possible to uniquely match timestamps to the original send() calls
+ based on timestamp order or payload inspection alone, then.
+
+ This option associates each packet at send() with a unique
+ identifier and returns that along with the timestamp. The identifier
+ is derived from a per-socket u32 counter (that wraps). For datagram
+ sockets, the counter increments with each sent packet. For stream
+ sockets, it increments with every byte. For stream sockets, also set
+ SOF_TIMESTAMPING_OPT_ID_TCP, see the section below.
+
+ The counter starts at zero. It is initialized the first time that
+ the socket option is enabled. It is reset each time the option is
+ enabled after having been disabled. Resetting the counter does not
+ change the identifiers of existing packets in the system.
+
+ This option is implemented only for transmit timestamps. There, the
+ timestamp is always looped along with a struct sock_extended_err.
+ The option modifies field ee_data to pass an id that is unique
+ among all possibly concurrently outstanding timestamp requests for
+ that socket.
+
+SOF_TIMESTAMPING_OPT_ID_TCP:
+ Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
+ timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
+ counter increments for stream sockets, but its starting point is
+ not entirely trivial. This option fixes that.
+
+ For stream sockets, if SOF_TIMESTAMPING_OPT_ID is set, this should
+ always be set too. On datagram sockets the option has no effect.
+
+ A reasonable expectation is that the counter is reset to zero with
+ the system call, so that a subsequent write() of N bytes generates
+ a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
+ implements this behavior under all conditions.
+
+ SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
+ especially when the socket option is set when no data is in
+ transmission. If data is being transmitted, it may be off by the
+ length of the output queue (SIOCOUTQ).
+
+ The difference is due to being based on snd_una versus write_seq.
+ snd_una is the offset in the stream acknowledged by the peer. This
+ depends on factors outside of process control, such as network RTT.
+ write_seq is the last byte written by the process. This offset is
+ not affected by external inputs.
+
+ The difference is subtle and unlikely to be noticed when configured
+ at initial socket creation, when no data is queued or sent. But
+ SOF_TIMESTAMPING_OPT_ID_TCP behavior is more robust regardless of
+ when the socket option is set.
+
+SOF_TIMESTAMPING_OPT_CMSG:
+ Support recv() cmsg for all timestamped packets. Control messages
+ are already supported unconditionally on all packets with receive
+ timestamps and on IPv6 packets with transmit timestamp. This option
+ extends them to IPv4 packets with transmit timestamp. One use case
+ is to correlate packets with their egress device, by enabling socket
+ option IP_PKTINFO simultaneously.
+
+
+SOF_TIMESTAMPING_OPT_TSONLY:
+ Applies to transmit timestamps only. Makes the kernel return the
+ timestamp as a cmsg alongside an empty packet, as opposed to
+ alongside the original packet. This reduces the amount of memory
+ charged to the socket's receive budget (SO_RCVBUF) and delivers
+ the timestamp even if sysctl net.core.tstamp_allow_data is 0.
+ This option disables SOF_TIMESTAMPING_OPT_CMSG.
+
+SOF_TIMESTAMPING_OPT_STATS:
+ Optional stats that are obtained along with the transmit timestamps.
+ It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the
+ transmit timestamp is available, the stats are available in a
+ separate control message of type SCM_TIMESTAMPING_OPT_STATS, as a
+ list of TLVs (struct nlattr) of types. These stats allow the
+ application to associate various transport layer stats with
+ the transmit timestamps, such as how long a certain block of
+ data was limited by peer's receiver window.
+
+SOF_TIMESTAMPING_OPT_PKTINFO:
+ Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming
+ packets with hardware timestamps. The message contains struct
+ scm_ts_pktinfo, which supplies the index of the real interface which
+ received the packet and its length at layer 2. A valid (non-zero)
+ interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is
+ enabled and the driver is using NAPI. The struct contains also two
+ other fields, but they are reserved and undefined.
+
+SOF_TIMESTAMPING_OPT_TX_SWHW:
+ Request both hardware and software timestamps for outgoing packets
+ when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE
+ are enabled at the same time. If both timestamps are generated,
+ two separate messages will be looped to the socket's error queue,
+ each containing just one timestamp.
+
+New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to
+disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate
+regardless of the setting of sysctl net.core.tstamp_allow_data.
+
+An exception is when a process needs additional cmsg data, for
+instance SOL_IP/IP_PKTINFO to detect the egress network interface.
+Then pass option SOF_TIMESTAMPING_OPT_CMSG. This option depends on
+having access to the contents of the original packet, so cannot be
+combined with SOF_TIMESTAMPING_OPT_TSONLY.
+
+
+1.3.4. Enabling timestamps via control messages
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In addition to socket options, timestamp generation can be requested
+per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1).
+Using this feature, applications can sample timestamps per sendmsg()
+without paying the overhead of enabling and disabling timestamps via
+setsockopt::
+
+ struct msghdr *msg;
+ ...
+ cmsg = CMSG_FIRSTHDR(msg);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SO_TIMESTAMPING;
+ cmsg->cmsg_len = CMSG_LEN(sizeof(__u32));
+ *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED |
+ SOF_TIMESTAMPING_TX_SOFTWARE |
+ SOF_TIMESTAMPING_TX_ACK;
+ err = sendmsg(fd, msg, 0);
+
+The SOF_TIMESTAMPING_TX_* flags set via cmsg will override
+the SOF_TIMESTAMPING_TX_* flags set via setsockopt.
+
+Moreover, applications must still enable timestamp reporting via
+setsockopt to receive timestamps::
+
+ __u32 val = SOF_TIMESTAMPING_SOFTWARE |
+ SOF_TIMESTAMPING_OPT_ID /* or any other flag */;
+ err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
+
+
+1.4 Bytestream Timestamps
+-------------------------
+
+The SO_TIMESTAMPING interface supports timestamping of bytes in a
+bytestream. Each request is interpreted as a request for when the
+entire contents of the buffer has passed a timestamping point. That
+is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record
+when all bytes have reached the device driver, regardless of how
+many packets the data has been converted into.
+
+In general, bytestreams have no natural delimiters and therefore
+correlating a timestamp with data is non-trivial. A range of bytes
+may be split across segments, any segments may be merged (possibly
+coalescing sections of previously segmented buffers associated with
+independent send() calls). Segments can be reordered and the same
+byte range can coexist in multiple segments for protocols that
+implement retransmissions.
+
+It is essential that all timestamps implement the same semantics,
+regardless of these possible transformations, as otherwise they are
+incomparable. Handling "rare" corner cases differently from the
+simple case (a 1:1 mapping from buffer to skb) is insufficient
+because performance debugging often needs to focus on such outliers.
+
+In practice, timestamps can be correlated with segments of a
+bytestream consistently, if both semantics of the timestamp and the
+timing of measurement are chosen correctly. This challenge is no
+different from deciding on a strategy for IP fragmentation. There, the
+definition is that only the first fragment is timestamped. For
+bytestreams, we chose that a timestamp is generated only when all
+bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to
+implement and reason about. An implementation that has to take into
+account SACK would be more complex due to possible transmission holes
+and out of order arrival.
+
+On the host, TCP can also break the simple 1:1 mapping from buffer to
+skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The
+implementation ensures correctness in all cases by tracking the
+individual last byte passed to send(), even if it is no longer the
+last byte after an skbuff extend or merge operation. It stores the
+relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff
+has only one such field, only one timestamp can be generated.
+
+In rare cases, a timestamp request can be missed if two requests are
+collapsed onto the same skb. A process can detect this situation by
+enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at
+send time with the value returned for each timestamp. It can prevent
+the situation by always flushing the TCP stack in between requests,
+for instance by enabling TCP_NODELAY and disabling TCP_CORK and
+autocork.
+
+These precautions ensure that the timestamp is generated only when all
+bytes have passed a timestamp point, assuming that the network stack
+itself does not reorder the segments. The stack indeed tries to avoid
+reordering. The one exception is under administrator control: it is
+possible to construct a packet scheduler configuration that delays
+segments from the same stream differently. Such a setup would be
+unusual.
+
+
+2 Data Interfaces
+==================
+
+Timestamps are read using the ancillary data feature of recvmsg().
+See `man 3 cmsg` for details of this interface. The socket manual
+page (`man 7 socket`) describes how timestamps generated with
+SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.
+
+
+2.1 SCM_TIMESTAMPING records
+----------------------------
+
+These timestamps are returned in a control message with cmsg_level
+SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
+
+For SO_TIMESTAMPING_OLD::
+
+ struct scm_timestamping {
+ struct timespec ts[3];
+ };
+
+For SO_TIMESTAMPING_NEW::
+
+ struct scm_timestamping64 {
+ struct __kernel_timespec ts[3];
+
+Always use SO_TIMESTAMPING_NEW timestamp to always get timestamp in
+struct scm_timestamping64 format.
+
+SO_TIMESTAMPING_OLD returns incorrect timestamps after the year 2038
+on 32 bit machines.
+
+The structure can return up to three timestamps. This is a legacy
+feature. At least one field is non-zero at any time. Most timestamps
+are passed in ts[0]. Hardware timestamps are passed in ts[2].
+
+ts[1] used to hold hardware timestamps converted to system time.
+Instead, expose the hardware clock device on the NIC directly as
+a HW PTP clock source, to allow time conversion in userspace and
+optionally synchronize system time with a userspace PTP stack such
+as linuxptp. For the PTP clock API, see Documentation/driver-api/ptp.rst.
+
+Note that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled
+together with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false
+software timestamp will be generated in the recvmsg() call and passed
+in ts[0] when a real software timestamp is missing. This happens also
+on hardware transmit timestamps.
+
+2.1.1 Transmit timestamps with MSG_ERRQUEUE
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For transmit timestamps the outgoing packet is looped back to the
+socket's error queue with the send timestamp(s) attached. A process
+receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE
+set and with a msg_control buffer sufficiently large to receive the
+relevant metadata structures. The recvmsg call returns the original
+outgoing data packet with two ancillary messages attached.
+
+A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR
+embeds a struct sock_extended_err. This defines the error type. For
+timestamps, the ee_errno field is ENOMSG. The other ancillary message
+will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This
+embeds the struct scm_timestamping.
+
+
+2.1.1.2 Timestamp types
+~~~~~~~~~~~~~~~~~~~~~~~
+
+The semantics of the three struct timespec are defined by field
+ee_info in the extended error structure. It contains a value of
+type SCM_TSTAMP_* to define the actual timestamp passed in
+scm_timestamping.
+
+The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_*
+control fields discussed previously, with one exception. For legacy
+reasons, SCM_TSTAMP_SND is equal to zero and can be set for both
+SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It
+is the first if ts[2] is non-zero, the second otherwise, in which
+case the timestamp is stored in ts[0].
+
+
+2.1.1.3 Fragmentation
+~~~~~~~~~~~~~~~~~~~~~
+
+Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
+explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
+then only the first fragment is timestamped and returned to the sending
+socket.
+
+
+2.1.1.4 Packet Payload
+~~~~~~~~~~~~~~~~~~~~~~
+
+The calling application is often not interested in receiving the whole
+packet payload that it passed to the stack originally: the socket
+error queue mechanism is just a method to piggyback the timestamp on.
+In this case, the application can choose to read datagrams with a
+smaller buffer, possibly even of length 0. The payload is truncated
+accordingly. Until the process calls recvmsg() on the error queue,
+however, the full packet is queued, taking up budget from SO_RCVBUF.
+
+
+2.1.1.5 Blocking Read
+~~~~~~~~~~~~~~~~~~~~~
+
+Reading from the error queue is always a non-blocking operation. To
+block waiting on a timestamp, use poll or select. poll() will return
+POLLERR in pollfd.revents if any data is ready on the error queue.
+There is no need to pass this flag in pollfd.events. This flag is
+ignored on request. See also `man 2 poll`.
+
+
+2.1.2 Receive timestamps
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+On reception, there is no reason to read from the socket error queue.
+The SCM_TIMESTAMPING ancillary data is sent along with the packet data
+on a normal recvmsg(). Since this is not a socket error, it is not
+accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case,
+the meaning of the three fields in struct scm_timestamping is
+implicitly defined. ts[0] holds a software timestamp if set, ts[1]
+is again deprecated and ts[2] holds a hardware timestamp if set.
+
+
+3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
+=======================================================================
+
+Hardware time stamping must also be initialized for each device driver
+that is expected to do hardware time stamping. The parameter is defined in
+include/uapi/linux/net_tstamp.h as::
+
+ struct hwtstamp_config {
+ int flags; /* no flags defined right now, must be zero */
+ int tx_type; /* HWTSTAMP_TX_* */
+ int rx_filter; /* HWTSTAMP_FILTER_* */
+ };
+
+Desired behavior is passed into the kernel and to a specific device by
+calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
+ifr_data points to a struct hwtstamp_config. The tx_type and
+rx_filter are hints to the driver what it is expected to do. If
+the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets.
+
+Drivers are free to use a more permissive configuration than the requested
+configuration. It is expected that drivers should only implement directly the
+most generic mode that can be supported. For example if the hardware can
+support HWTSTAMP_FILTER_PTP_V2_EVENT, then it should generally always upscale
+HWTSTAMP_FILTER_PTP_V2_L2_SYNC, and so forth, as HWTSTAMP_FILTER_PTP_V2_EVENT
+is more generic (and more useful to applications).
+
+A driver which supports hardware time stamping shall update the struct
+with the actual, possibly more permissive configuration. If the
+requested packets cannot be time stamped, then nothing should be
+changed and ERANGE shall be returned (in contrast to EINVAL, which
+indicates that SIOCSHWTSTAMP is not supported at all).
+
+Only a processes with admin rights may change the configuration. User
+space is responsible to ensure that multiple processes don't interfere
+with each other and that the settings are reset.
+
+Any process can read the actual configuration by passing this
+structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has
+not been implemented in all drivers.
+
+::
+
+ /* possible values for hwtstamp_config->tx_type */
+ enum {
+ /*
+ * no outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done
+ */
+ HWTSTAMP_TX_OFF,
+
+ /*
+ * enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet
+ */
+ HWTSTAMP_TX_ON,
+ };
+
+ /* possible values for hwtstamp_config->rx_filter */
+ enum {
+ /* time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /* time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /* return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /* PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+
+ /* for the complete list of values, please check
+ * the include file include/uapi/linux/net_tstamp.h
+ */
+ };
+
+3.1 Hardware Timestamping Implementation: Device Drivers
+--------------------------------------------------------
+
+A driver which supports hardware time stamping must support the
+SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
+the actual values as described in the section on SIOCSHWTSTAMP. It
+should also support SIOCGHWTSTAMP.
+
+Time stamps for received packets must be stored in the skb. To get a pointer
+to the shared time stamp structure of the skb call skb_hwtstamps(). Then
+set the time stamps in the structure::
+
+ struct skb_shared_hwtstamps {
+ /* hardware time stamp transformed into duration
+ * since arbitrary point in time
+ */
+ ktime_t hwtstamp;
+ };
+
+Time stamps for outgoing packets are to be generated as follows:
+
+- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)
+ is set no-zero. If yes, then the driver is expected to do hardware time
+ stamping.
+- If this is possible for the skb and requested, then declare
+ that the driver is doing the time stamping by setting the flag
+ SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with::
+
+ skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+
+ You might want to keep a pointer to the associated skb for the next step
+ and not free the skb. A driver not supporting hardware time stamping doesn't
+ do that. A driver must never touch sk_buff::tstamp! It is used to store
+ software generated time stamps by the network subsystem.
+- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware
+ as possible. skb_tx_timestamp() provides a software time stamp if requested
+ and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
+- As soon as the driver has sent the packet and/or obtained a
+ hardware time stamp for it, it passes the time stamp back by
+ calling skb_tstamp_tx() with the original skb, the raw
+ hardware time stamp. skb_tstamp_tx() clones the original skb and
+ adds the timestamps, therefore the original skb has to be freed now.
+ If obtaining the hardware time stamp somehow fails, then the driver
+ should not fall back to software time stamping. The rationale is that
+ this would occur at a later time in the processing pipeline than other
+ software time stamping and therefore could lead to unexpected deltas
+ between time stamps.
+
+3.2 Special considerations for stacked PTP Hardware Clocks
+----------------------------------------------------------
+
+There are situations when there may be more than one PHC (PTP Hardware Clock)
+in the data path of a packet. The kernel has no explicit mechanism to allow the
+user to select which PHC to use for timestamping Ethernet frames. Instead, the
+assumption is that the outermost PHC is always the most preferable, and that
+kernel drivers collaborate towards achieving that goal. Currently there are 3
+cases of stacked PHCs, detailed below:
+
+3.2.1 DSA (Distributed Switch Architecture) switches
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These are Ethernet switches which have one of their ports connected to an
+(otherwise completely unaware) host Ethernet interface, and perform the role of
+a port multiplier with optional forwarding acceleration features. Each DSA
+switch port is visible to the user as a standalone (virtual) network interface,
+and its network I/O is performed, under the hood, indirectly through the host
+interface (redirecting to the host port on TX, and intercepting frames on RX).
+
+When a DSA switch is attached to a host port, PTP synchronization has to
+suffer, since the switch's variable queuing delay introduces a path delay
+jitter between the host port and its PTP partner. For this reason, some DSA
+switches include a timestamping clock of their own, and have the ability to
+perform network timestamping on their own MAC, such that path delays only
+measure wire and PHY propagation latencies. Timestamping DSA switches are
+supported in Linux and expose the same ABI as any other network interface (save
+for the fact that the DSA interfaces are in fact virtual in terms of network
+I/O, they do have their own PHC). It is typical, but not mandatory, for all
+interfaces of a DSA switch to share the same PHC.
+
+By design, PTP timestamping with a DSA switch does not need any special
+handling in the driver for the host port it is attached to. However, when the
+host port also supports PTP timestamping, DSA will take care of intercepting
+the ``.ndo_eth_ioctl`` calls towards the host port, and block attempts to enable
+hardware timestamping on it. This is because the SO_TIMESTAMPING API does not
+allow the delivery of multiple hardware timestamps for the same packet, so
+anybody else except for the DSA switch port must be prevented from doing so.
+
+In the generic layer, DSA provides the following infrastructure for PTP
+timestamping:
+
+- ``.port_txtstamp()``: a hook called prior to the transmission of
+ packets with a hardware TX timestamping request from user space.
+ This is required for two-step timestamping, since the hardware
+ timestamp becomes available after the actual MAC transmission, so the
+ driver must be prepared to correlate the timestamp with the original
+ packet so that it can re-enqueue the packet back into the socket's
+ error queue. To save the packet for when the timestamp becomes
+ available, the driver can call ``skb_clone_sk`` , save the clone pointer
+ in skb->cb and enqueue a tx skb queue. Typically, a switch will have a
+ PTP TX timestamp register (or sometimes a FIFO) where the timestamp
+ becomes available. In case of a FIFO, the hardware might store
+ key-value pairs of PTP sequence ID/message type/domain number and the
+ actual timestamp. To perform the correlation correctly between the
+ packets in a queue waiting for timestamping and the actual timestamps,
+ drivers can use a BPF classifier (``ptp_classify_raw``) to identify
+ the PTP transport type, and ``ptp_parse_header`` to interpret the PTP
+ header fields. There may be an IRQ that is raised upon this
+ timestamp's availability, or the driver might have to poll after
+ invoking ``dev_queue_xmit()`` towards the host interface.
+ One-step TX timestamping do not require packet cloning, since there is
+ no follow-up message required by the PTP protocol (because the
+ TX timestamp is embedded into the packet by the MAC), and therefore
+ user space does not expect the packet annotated with the TX timestamp
+ to be re-enqueued into its socket's error queue.
+
+- ``.port_rxtstamp()``: On RX, the BPF classifier is run by DSA to
+ identify PTP event messages (any other packets, including PTP general
+ messages, are not timestamped). The original (and only) timestampable
+ skb is provided to the driver, for it to annotate it with a timestamp,
+ if that is immediately available, or defer to later. On reception,
+ timestamps might either be available in-band (through metadata in the
+ DSA header, or attached in other ways to the packet), or out-of-band
+ (through another RX timestamping FIFO). Deferral on RX is typically
+ necessary when retrieving the timestamp needs a sleepable context. In
+ that case, it is the responsibility of the DSA driver to call
+ ``netif_rx()`` on the freshly timestamped skb.
+
+3.2.2 Ethernet PHYs
+^^^^^^^^^^^^^^^^^^^
+
+These are devices that typically fulfill a Layer 1 role in the network stack,
+hence they do not have a representation in terms of a network interface as DSA
+switches do. However, PHYs may be able to detect and timestamp PTP packets, for
+performance reasons: timestamps taken as close as possible to the wire have the
+potential to yield a more stable and precise synchronization.
+
+A PHY driver that supports PTP timestamping must create a ``struct
+mii_timestamper`` and add a pointer to it in ``phydev->mii_ts``. The presence
+of this pointer will be checked by the networking stack.
+
+Since PHYs do not have network interface representations, the timestamping and
+ethtool ioctl operations for them need to be mediated by their respective MAC
+driver. Therefore, as opposed to DSA switches, modifications need to be done
+to each individual MAC driver for PHY timestamping support. This entails:
+
+- Checking, in ``.ndo_eth_ioctl``, whether ``phy_has_hwtstamp(netdev->phydev)``
+ is true or not. If it is, then the MAC driver should not process this request
+ but instead pass it on to the PHY using ``phy_mii_ioctl()``.
+
+- On RX, special intervention may or may not be needed, depending on the
+ function used to deliver skb's up the network stack. In the case of plain
+ ``netif_rx()`` and similar, MAC drivers must check whether
+ ``skb_defer_rx_timestamp(skb)`` is necessary or not - and if it is, don't
+ call ``netif_rx()`` at all. If ``CONFIG_NETWORK_PHY_TIMESTAMPING`` is
+ enabled, and ``skb->dev->phydev->mii_ts`` exists, its ``.rxtstamp()`` hook
+ will be called now, to determine, using logic very similar to DSA, whether
+ deferral for RX timestamping is necessary. Again like DSA, it becomes the
+ responsibility of the PHY driver to send the packet up the stack when the
+ timestamp is available.
+
+ For other skb receive functions, such as ``napi_gro_receive`` and
+ ``netif_receive_skb``, the stack automatically checks whether
+ ``skb_defer_rx_timestamp()`` is necessary, so this check is not needed inside
+ the driver.
+
+- On TX, again, special intervention might or might not be needed. The
+ function that calls the ``mii_ts->txtstamp()`` hook is named
+ ``skb_clone_tx_timestamp()``. This function can either be called directly
+ (case in which explicit MAC driver support is indeed needed), but the
+ function also piggybacks from the ``skb_tx_timestamp()`` call, which many MAC
+ drivers already perform for software timestamping purposes. Therefore, if a
+ MAC supports software timestamping, it does not need to do anything further
+ at this stage.
+
+3.2.3 MII bus snooping devices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These perform the same role as timestamping Ethernet PHYs, save for the fact
+that they are discrete devices and can therefore be used in conjunction with
+any PHY even if it doesn't support timestamping. In Linux, they are
+discoverable and attachable to a ``struct phy_device`` through Device Tree, and
+for the rest, they use the same mii_ts infrastructure as those. See
+Documentation/devicetree/bindings/ptp/timestamper.txt for more details.
+
+3.2.4 Other caveats for MAC drivers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Stacked PHCs, especially DSA (but not only) - since that doesn't require any
+modification to MAC drivers, so it is more difficult to ensure correctness of
+all possible code paths - is that they uncover bugs which were impossible to
+trigger before the existence of stacked PTP clocks. One example has to do with
+this line of code, already presented earlier::
+
+ skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+
+Any TX timestamping logic, be it a plain MAC driver, a DSA switch driver, a PHY
+driver or a MII bus snooping device driver, should set this flag.
+But a MAC driver that is unaware of PHC stacking might get tripped up by
+somebody other than itself setting this flag, and deliver a duplicate
+timestamp.
+For example, a typical driver design for TX timestamping might be to split the
+transmission part into 2 portions:
+
+1. "TX": checks whether PTP timestamping has been previously enabled through
+ the ``.ndo_eth_ioctl`` ("``priv->hwtstamp_tx_enabled == true``") and the
+ current skb requires a TX timestamp ("``skb_shinfo(skb)->tx_flags &
+ SKBTX_HW_TSTAMP``"). If this is true, it sets the
+ "``skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS``" flag. Note: as
+ described above, in the case of a stacked PHC system, this condition should
+ never trigger, as this MAC is certainly not the outermost PHC. But this is
+ not where the typical issue is. Transmission proceeds with this packet.
+
+2. "TX confirmation": Transmission has finished. The driver checks whether it
+ is necessary to collect any TX timestamp for it. Here is where the typical
+ issues are: the MAC driver takes a shortcut and only checks whether
+ "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``" was set. With a stacked
+ PHC system, this is incorrect because this MAC driver is not the only entity
+ in the TX data path who could have enabled SKBTX_IN_PROGRESS in the first
+ place.
+
+The correct solution for this problem is for MAC drivers to have a compound
+check in their "TX confirmation" portion, not only for
+"``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``", but also for
+"``priv->hwtstamp_tx_enabled == true``". Because the rest of the system ensures
+that PTP timestamping is not enabled for anything other than the outermost PHC,
+this enhanced check will avoid delivering a duplicated TX timestamp to user
+space.
diff --git a/Documentation/networking/tipc.rst b/Documentation/networking/tipc.rst
new file mode 100644
index 0000000000..ab63d298cc
--- /dev/null
+++ b/Documentation/networking/tipc.rst
@@ -0,0 +1,215 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Linux Kernel TIPC
+=================
+
+Introduction
+============
+
+TIPC (Transparent Inter Process Communication) is a protocol that is specially
+designed for intra-cluster communication. It can be configured to transmit
+messages either on UDP or directly across Ethernet. Message delivery is
+sequence guaranteed, loss free and flow controlled. Latency times are shorter
+than with any other known protocol, while maximal throughput is comparable to
+that of TCP.
+
+TIPC Features
+-------------
+
+- Cluster wide IPC service
+
+ Have you ever wished you had the convenience of Unix Domain Sockets even when
+ transmitting data between cluster nodes? Where you yourself determine the
+ addresses you want to bind to and use? Where you don't have to perform DNS
+ lookups and worry about IP addresses? Where you don't have to start timers
+ to monitor the continuous existence of peer sockets? And yet without the
+ downsides of that socket type, such as the risk of lingering inodes?
+
+ Welcome to the Transparent Inter Process Communication service, TIPC in short,
+ which gives you all of this, and a lot more.
+
+- Service Addressing
+
+ A fundamental concept in TIPC is that of Service Addressing which makes it
+ possible for a programmer to chose his own address, bind it to a server
+ socket and let client programs use only that address for sending messages.
+
+- Service Tracking
+
+ A client wanting to wait for the availability of a server, uses the Service
+ Tracking mechanism to subscribe for binding and unbinding/close events for
+ sockets with the associated service address.
+
+ The service tracking mechanism can also be used for Cluster Topology Tracking,
+ i.e., subscribing for availability/non-availability of cluster nodes.
+
+ Likewise, the service tracking mechanism can be used for Cluster Connectivity
+ Tracking, i.e., subscribing for up/down events for individual links between
+ cluster nodes.
+
+- Transmission Modes
+
+ Using a service address, a client can send datagram messages to a server socket.
+
+ Using the same address type, it can establish a connection towards an accepting
+ server socket.
+
+ It can also use a service address to create and join a Communication Group,
+ which is the TIPC manifestation of a brokerless message bus.
+
+ Multicast with very good performance and scalability is available both in
+ datagram mode and in communication group mode.
+
+- Inter Node Links
+
+ Communication between any two nodes in a cluster is maintained by one or two
+ Inter Node Links, which both guarantee data traffic integrity and monitor
+ the peer node's availability.
+
+- Cluster Scalability
+
+ By applying the Overlapping Ring Monitoring algorithm on the inter node links
+ it is possible to scale TIPC clusters up to 1000 nodes with a maintained
+ neighbor failure discovery time of 1-2 seconds. For smaller clusters this
+ time can be made much shorter.
+
+- Neighbor Discovery
+
+ Neighbor Node Discovery in the cluster is done by Ethernet broadcast or UDP
+ multicast, when any of those services are available. If not, configured peer
+ IP addresses can be used.
+
+- Configuration
+
+ When running TIPC in single node mode no configuration whatsoever is needed.
+ When running in cluster mode TIPC must as a minimum be given a node address
+ (before Linux 4.17) and told which interface to attach to. The "tipc"
+ configuration tool makes is possible to add and maintain many more
+ configuration parameters.
+
+- Performance
+
+ TIPC message transfer latency times are better than in any other known protocol.
+ Maximal byte throughput for inter-node connections is still somewhat lower than
+ for TCP, while they are superior for intra-node and inter-container throughput
+ on the same host.
+
+- Language Support
+
+ The TIPC user API has support for C, Python, Perl, Ruby, D and Go.
+
+More Information
+----------------
+
+- How to set up TIPC:
+
+ http://tipc.io/getting_started.html
+
+- How to program with TIPC:
+
+ http://tipc.io/programming.html
+
+- How to contribute to TIPC:
+
+- http://tipc.io/contacts.html
+
+- More details about TIPC specification:
+
+ http://tipc.io/protocol.html
+
+
+Implementation
+==============
+
+TIPC is implemented as a kernel module in net/tipc/ directory.
+
+TIPC Base Types
+---------------
+
+.. kernel-doc:: net/tipc/subscr.h
+ :internal:
+
+.. kernel-doc:: net/tipc/bearer.h
+ :internal:
+
+.. kernel-doc:: net/tipc/name_table.h
+ :internal:
+
+.. kernel-doc:: net/tipc/name_distr.h
+ :internal:
+
+.. kernel-doc:: net/tipc/bcast.c
+ :internal:
+
+TIPC Bearer Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/bearer.c
+ :internal:
+
+.. kernel-doc:: net/tipc/udp_media.c
+ :internal:
+
+TIPC Crypto Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/crypto.c
+ :internal:
+
+TIPC Discoverer Interfaces
+--------------------------
+
+.. kernel-doc:: net/tipc/discover.c
+ :internal:
+
+TIPC Link Interfaces
+--------------------
+
+.. kernel-doc:: net/tipc/link.c
+ :internal:
+
+TIPC msg Interfaces
+-------------------
+
+.. kernel-doc:: net/tipc/msg.c
+ :internal:
+
+TIPC Name Interfaces
+--------------------
+
+.. kernel-doc:: net/tipc/name_table.c
+ :internal:
+
+.. kernel-doc:: net/tipc/name_distr.c
+ :internal:
+
+TIPC Node Management Interfaces
+-------------------------------
+
+.. kernel-doc:: net/tipc/node.c
+ :internal:
+
+TIPC Socket Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/socket.c
+ :internal:
+
+TIPC Network Topology Interfaces
+--------------------------------
+
+.. kernel-doc:: net/tipc/subscr.c
+ :internal:
+
+TIPC Server Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/topsrv.c
+ :internal:
+
+TIPC Trace Interfaces
+---------------------
+
+.. kernel-doc:: net/tipc/trace.c
+ :internal:
diff --git a/Documentation/networking/tls-handshake.rst b/Documentation/networking/tls-handshake.rst
new file mode 100644
index 0000000000..6f5ea1646a
--- /dev/null
+++ b/Documentation/networking/tls-handshake.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+In-Kernel TLS Handshake
+=======================
+
+Overview
+========
+
+Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs
+over TCP. TLS provides end-to-end data integrity and confidentiality in
+addition to peer authentication.
+
+The kernel's kTLS implementation handles the TLS record subprotocol, but
+does not handle the TLS handshake subprotocol which is used to establish
+a TLS session. Kernel consumers can use the API described here to
+request TLS session establishment.
+
+There are several possible ways to provide a handshake service in the
+kernel. The API described here is designed to hide the details of those
+implementations so that in-kernel TLS consumers do not need to be
+aware of how the handshake gets done.
+
+
+User handshake agent
+====================
+
+As of this writing, there is no TLS handshake implementation in the
+Linux kernel. To provide a handshake service, a handshake agent
+(typically in user space) is started in each network namespace where a
+kernel consumer might require a TLS handshake. Handshake agents listen
+for events sent from the kernel that indicate a handshake request is
+waiting.
+
+An open socket is passed to a handshake agent via a netlink operation,
+which creates a socket descriptor in the agent's file descriptor table.
+If the handshake completes successfully, the handshake agent promotes
+the socket to use the TLS ULP and sets the session information using the
+SOL_TLS socket options. The handshake agent returns the socket to the
+kernel via a second netlink operation.
+
+
+Kernel Handshake API
+====================
+
+A kernel TLS consumer initiates a client-side TLS handshake on an open
+socket by invoking one of the tls_client_hello() functions. First, it
+fills in a structure that contains the parameters of the request:
+
+.. code-block:: c
+
+ struct tls_handshake_args {
+ struct socket *ta_sock;
+ tls_done_func_t ta_done;
+ void *ta_data;
+ const char *ta_peername;
+ unsigned int ta_timeout_ms;
+ key_serial_t ta_keyring;
+ key_serial_t ta_my_cert;
+ key_serial_t ta_my_privkey;
+ unsigned int ta_num_peerids;
+ key_serial_t ta_my_peerids[5];
+ };
+
+The @ta_sock field references an open and connected socket. The consumer
+must hold a reference on the socket to prevent it from being destroyed
+while the handshake is in progress. The consumer must also have
+instantiated a struct file in sock->file.
+
+
+@ta_done contains a callback function that is invoked when the handshake
+has completed. Further explanation of this function is in the "Handshake
+Completion" sesction below.
+
+The consumer can provide a NUL-terminated hostname in the @ta_peername
+field that is sent as part of ClientHello. If no peername is provided,
+the DNS hostname associated with the server's IP address is used instead.
+
+The consumer can fill in the @ta_timeout_ms field to force the servicing
+handshake agent to exit after a number of milliseconds. This enables the
+socket to be fully closed once both the kernel and the handshake agent
+have closed their endpoints.
+
+Authentication material such as x.509 certificates, private certificate
+keys, and pre-shared keys are provided to the handshake agent in keys
+that are instantiated by the consumer before making the handshake
+request. The consumer can provide a private keyring that is linked into
+the handshake agent's process keyring in the @ta_keyring field to prevent
+access of those keys by other subsystems.
+
+To request an x.509-authenticated TLS session, the consumer fills in
+the @ta_my_cert and @ta_my_privkey fields with the serial numbers of
+keys containing an x.509 certificate and the private key for that
+certificate. Then, it invokes this function:
+
+.. code-block:: c
+
+ ret = tls_client_hello_x509(args, gfp_flags);
+
+The function returns zero when the handshake request is under way. A
+zero return guarantees the callback function @ta_done will be invoked
+for this socket. The function returns a negative errno if the handshake
+could not be started. A negative errno guarantees the callback function
+@ta_done will not be invoked on this socket.
+
+
+To initiate a client-side TLS handshake with a pre-shared key, use:
+
+.. code-block:: c
+
+ ret = tls_client_hello_psk(args, gfp_flags);
+
+However, in this case, the consumer fills in the @ta_my_peerids array
+with serial numbers of keys containing the peer identities it wishes
+to offer, and the @ta_num_peerids field with the number of array
+entries it has filled in. The other fields are filled in as above.
+
+
+To initiate an anonymous client-side TLS handshake use:
+
+.. code-block:: c
+
+ ret = tls_client_hello_anon(args, gfp_flags);
+
+The handshake agent presents no peer identity information to the remote
+during this type of handshake. Only server authentication (ie the client
+verifies the server's identity) is performed during the handshake. Thus
+the established session uses encryption only.
+
+
+Consumers that are in-kernel servers use:
+
+.. code-block:: c
+
+ ret = tls_server_hello_x509(args, gfp_flags);
+
+or
+
+.. code-block:: c
+
+ ret = tls_server_hello_psk(args, gfp_flags);
+
+The argument structure is filled in as above.
+
+
+If the consumer needs to cancel the handshake request, say, due to a ^C
+or other exigent event, the consumer can invoke:
+
+.. code-block:: c
+
+ bool tls_handshake_cancel(sock);
+
+This function returns true if the handshake request associated with
+@sock has been canceled. The consumer's handshake completion callback
+will not be invoked. If this function returns false, then the consumer's
+completion callback has already been invoked.
+
+
+Handshake Completion
+====================
+
+When the handshake agent has completed processing, it notifies the
+kernel that the socket may be used by the consumer again. At this point,
+the consumer's handshake completion callback, provided in the @ta_done
+field in the tls_handshake_args structure, is invoked.
+
+The synopsis of this function is:
+
+.. code-block:: c
+
+ typedef void (*tls_done_func_t)(void *data, int status,
+ key_serial_t peerid);
+
+The consumer provides a cookie in the @ta_data field of the
+tls_handshake_args structure that is returned in the @data parameter of
+this callback. The consumer uses the cookie to match the callback to the
+thread waiting for the handshake to complete.
+
+The success status of the handshake is returned via the @status
+parameter:
+
++------------+----------------------------------------------+
+| status | meaning |
++============+==============================================+
+| 0 | TLS session established successfully |
++------------+----------------------------------------------+
+| -EACCESS | Remote peer rejected the handshake or |
+| | authentication failed |
++------------+----------------------------------------------+
+| -ENOMEM | Temporary resource allocation failure |
++------------+----------------------------------------------+
+| -EINVAL | Consumer provided an invalid argument |
++------------+----------------------------------------------+
+| -ENOKEY | Missing authentication material |
++------------+----------------------------------------------+
+| -EIO | An unexpected fault occurred |
++------------+----------------------------------------------+
+
+The @peerid parameter contains the serial number of a key containing the
+remote peer's identity or the value TLS_NO_PEERID if the session is not
+authenticated.
+
+A best practice is to close and destroy the socket immediately if the
+handshake failed.
+
+
+Other considerations
+--------------------
+
+While a handshake is under way, the kernel consumer must alter the
+socket's sk_data_ready callback function to ignore all incoming data.
+Once the handshake completion callback function has been invoked, normal
+receive operation can be resumed.
+
+Once a TLS session is established, the consumer must provide a buffer
+for and then examine the control message (CMSG) that is part of every
+subsequent sock_recvmsg(). Each control message indicates whether the
+received message data is TLS record data or session metadata.
+
+See tls.rst for details on how a kTLS consumer recognizes incoming
+(decrypted) application data, alerts, and handshake packets once the
+socket has been promoted to use the TLS ULP.
diff --git a/Documentation/networking/tls-offload-layers.svg b/Documentation/networking/tls-offload-layers.svg
new file mode 100644
index 0000000000..cf72f05dbb
--- /dev/null
+++ b/Documentation/networking/tls-offload-layers.svg
@@ -0,0 +1 @@
+<svg version="1.1" viewBox="0.0 0.0 460.0 500.0" fill="none" stroke="none" stroke-linecap="square" stroke-miterlimit="10" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg"><clipPath id="p.0"><path d="m0 0l960.0 0l0 720.0l-960.0 0l0 -720.0z" clip-rule="nonzero"/></clipPath><g clip-path="url(#p.0)"><path fill="#000000" fill-opacity="0.0" d="m0 0l960.0 0l0 720.0l-960.0 0z" fill-rule="evenodd"/><path fill="#cfe2f3" d="m117.02887 0l72.28346 0l0 40.25197l-72.28346 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m117.02887 0l72.28346 0l0 40.25197l-72.28346 0z" fill-rule="evenodd"/><path fill="#000000" d="m135.71944 27.045982l0 -9.671875l1.46875 0l0 1.46875q0.5625 -1.03125 1.03125 -1.359375q0.484375 -0.328125 1.0625 -0.328125q0.828125 0 1.6875 0.53125l-0.5625 1.515625q-0.609375 -0.359375 -1.203125 -0.359375q-0.546875 0 -0.96875 0.328125q-0.421875 0.328125 -0.609375 0.890625q-0.28125 0.875 -0.28125 1.921875l0 5.0625l-1.625 0zm12.853302 -3.109375l1.6875 0.203125q-0.40625 1.484375 -1.484375 2.3125q-1.078125 0.8125 -2.765625 0.8125q-2.125 0 -3.375 -1.296875q-1.234375 -1.3125 -1.234375 -3.671875q0 -2.453125 1.25 -3.796875q1.265625 -1.34375 3.265625 -1.34375q1.9375 0 3.15625 1.328125q1.234375 1.3125 1.234375 3.703125q0 0.15625 0 0.4375l-7.21875 0q0.09375 1.59375 0.90625 2.453125q0.8125 0.84375 2.015625 0.84375q0.90625 0 1.546875 -0.46875q0.640625 -0.484375 1.015625 -1.515625zm-5.390625 -2.65625l5.40625 0q-0.109375 -1.21875 -0.625 -1.828125q-0.78125 -0.953125 -2.03125 -0.953125q-1.125 0 -1.90625 0.765625q-0.765625 0.75 -0.84375 2.015625zm15.453842 4.578125q-0.921875 0.765625 -1.765625 1.09375q-0.828125 0.3125 -1.796875 0.3125q-1.59375 0 -2.453125 -0.78125q-0.859375 -0.78125 -0.859375 -1.984375q0 -0.71875 0.328125 -1.296875q0.328125 -0.59375 0.84375 -0.9375q0.53125 -0.359375 1.1875 -0.546875q0.46875 -0.125 1.453125 -0.25q1.984375 -0.234375 2.921875 -0.5625q0.015625 -0.34375 0.015625 -0.421875q0 -1.0 -0.46875 -1.421875q-0.625 -0.546875 -1.875 -0.546875q-1.15625 0 -1.703125 0.40625q-0.546875 0.40625 -0.8125 1.421875l-1.609375 -0.21875q0.21875 -1.015625 0.71875 -1.640625q0.5 -0.640625 1.453125 -0.984375q0.953125 -0.34375 2.1875 -0.34375q1.25 0 2.015625 0.296875q0.78125 0.28125 1.140625 0.734375q0.375 0.4375 0.515625 1.109375q0.078125 0.421875 0.078125 1.515625l0 2.1875q0 2.28125 0.109375 2.890625q0.109375 0.59375 0.40625 1.15625l-1.703125 0q-0.265625 -0.515625 -0.328125 -1.1875zm-0.140625 -3.671875q-0.890625 0.375 -2.671875 0.625q-1.015625 0.140625 -1.4375 0.328125q-0.421875 0.1875 -0.65625 0.53125q-0.21875 0.34375 -0.21875 0.78125q0 0.65625 0.5 1.09375q0.5 0.4375 1.453125 0.4375q0.9375 0 1.671875 -0.40625q0.75 -0.421875 1.09375 -1.140625q0.265625 -0.5625 0.265625 -1.640625l0 -0.609375zm10.469467 4.859375l0 -1.21875q-0.90625 1.4375 -2.703125 1.4375q-1.15625 0 -2.125 -0.640625q-0.96875 -0.640625 -1.5 -1.78125q-0.53125 -1.140625 -0.53125 -2.625q0 -1.453125 0.484375 -2.625q0.484375 -1.1875 1.4375 -1.8125q0.96875 -0.625 2.171875 -0.625q0.875 0 1.546875 0.375q0.6875 0.359375 1.109375 0.953125l0 -4.796875l1.640625 0l0 13.359375l-1.53125 0zm-5.171875 -4.828125q0 1.859375 0.78125 2.78125q0.78125 0.921875 1.84375 0.921875q1.078125 0 1.828125 -0.875q0.75 -0.890625 0.75 -2.6875q0 -1.984375 -0.765625 -2.90625q-0.765625 -0.9375 -1.890625 -0.9375q-1.078125 0 -1.8125 0.890625q-0.734375 0.890625 -0.734375 2.8125z" fill-rule="nonzero"/><path fill="#cfe2f3" d="m309.02887 0l72.28348 0l0 40.25197l-72.28348 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m309.02887 0l72.28348 0l0 40.25197l-72.28348 0z" fill-rule="evenodd"/><path fill="#000000" d="m328.4915 27.045982l-2.96875 -9.671875l1.703125 0l1.53125 5.578125l0.578125 2.078125q0.046875 -0.15625 0.5 -2.0l1.546875 -5.65625l1.6875 0l1.4375 5.609375l0.484375 1.84375l0.5625 -1.859375l1.65625 -5.59375l1.59375 0l-3.03125 9.671875l-1.703125 0l-1.53125 -5.796875l-0.375 -1.640625l-1.953125 7.4375l-1.71875 0zm11.676086 0l0 -9.671875l1.46875 0l0 1.46875q0.5625 -1.03125 1.03125 -1.359375q0.484375 -0.328125 1.0625 -0.328125q0.828125 0 1.6875 0.53125l-0.5625 1.515625q-0.609375 -0.359375 -1.203125 -0.359375q-0.546875 0 -0.96875 0.328125q-0.421875 0.328125 -0.609375 0.890625q-0.28125 0.875 -0.28125 1.921875l0 5.0625l-1.625 0zm6.228302 -11.46875l0 -1.890625l1.640625 0l0 1.890625l-1.640625 0zm0 11.46875l0 -9.671875l1.640625 0l0 9.671875l-1.640625 0zm7.722931 -1.46875l0.234375 1.453125q-0.6875 0.140625 -1.234375 0.140625q-0.890625 0 -1.390625 -0.28125q-0.484375 -0.28125 -0.6875 -0.734375q-0.203125 -0.46875 -0.203125 -1.9375l0 -5.578125l-1.203125 0l0 -1.265625l1.203125 0l0 -2.390625l1.625 -0.984375l0 3.375l1.65625 0l0 1.265625l-1.65625 0l0 5.671875q0 0.6875 0.078125 0.890625q0.09375 0.203125 0.28125 0.328125q0.203125 0.109375 0.578125 0.109375q0.265625 0 0.71875 -0.0625zm8.230194 -1.640625l1.6875 0.203125q-0.40625 1.484375 -1.484375 2.3125q-1.078125 0.8125 -2.765625 0.8125q-2.125 0 -3.375 -1.296875q-1.234375 -1.3125 -1.234375 -3.671875q0 -2.453125 1.25 -3.796875q1.265625 -1.34375 3.265625 -1.34375q1.9375 0 3.15625 1.328125q1.234375 1.3125 1.234375 3.703125q0 0.15625 0 0.4375l-7.21875 0q0.09375 1.59375 0.90625 2.453125q0.8125 0.84375 2.015625 0.84375q0.90625 0 1.546875 -0.46875q0.640625 -0.484375 1.015625 -1.515625zm-5.390625 -2.65625l5.40625 0q-0.109375 -1.21875 -0.625 -1.828125q-0.78125 -0.953125 -2.03125 -0.953125q-1.125 0 -1.90625 0.765625q-0.765625 0.75 -0.84375 2.015625z" fill-rule="nonzero"/><path fill="#cfe2f3" d="m73.64304 101.46588l351.0551 0l0 53.70079l-351.0551 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m73.64304 101.46588l351.0551 0l0 53.70079l-351.0551 0z" fill-rule="evenodd"/><path fill="#000000" d="m215.67503 135.23627l0 -13.359367l1.640625 0l0 7.6249924l3.890625 -3.9374924l2.109375 0l-3.6875 3.5937424l4.0625 6.078125l-2.015625 0l-3.203125 -4.953125l-1.15625 1.125l0 3.828125l-1.640625 0zm12.90625 -1.46875l0.234375 1.453125q-0.6875 0.140625 -1.234375 0.140625q-0.890625 0 -1.390625 -0.28125q-0.484375 -0.28125 -0.6875 -0.734375q-0.203125 -0.46875 -0.203125 -1.9375l0 -5.5781174l-1.203125 0l0 -1.265625l1.203125 0l0 -2.390625l1.625 -0.984375l0 3.375l1.65625 0l0 1.265625l-1.65625 0l0 5.6718674q0 0.6875 0.078125 0.890625q0.09375 0.203125 0.28125 0.328125q0.203125 0.109375 0.578125 0.109375q0.265625 0 0.71875 -0.0625zm1.5583038 1.46875l0 -13.359367l1.640625 0l0 13.359367l-1.640625 0zm3.5354462 -2.890625l1.625 -0.25q0.125 0.96875 0.75 1.5q0.625 0.515625 1.75 0.515625q1.125 0 1.671875 -0.453125q0.546875 -0.46875 0.546875 -1.09375q0 -0.546875 -0.484375 -0.875q-0.328125 -0.21875 -1.671875 -0.546875q-1.8125 -0.46875 -2.515625 -0.796875q-0.6875 -0.328125 -1.046875 -0.90625q-0.359375 -0.59375 -0.359375 -1.3125q0 -0.6406174 0.296875 -1.1874924q0.296875 -0.5625 0.8125 -0.921875q0.375 -0.28125 1.03125 -0.46875q0.671875 -0.203125 1.421875 -0.203125q1.140625 0 2.0 0.328125q0.859375 0.328125 1.265625 0.890625q0.421875 0.5625 0.578125 1.4999924l-1.609375 0.21875q-0.109375 -0.7499924 -0.640625 -1.1718674q-0.515625 -0.421875 -1.46875 -0.421875q-1.140625 0 -1.625 0.375q-0.46875 0.375 -0.46875 0.875q0 0.31249237 0.1875 0.5781174q0.203125 0.265625 0.640625 0.4375q0.234375 0.09375 1.4375 0.421875q1.75 0.453125 2.4375 0.75q0.6875 0.296875 1.078125 0.859375q0.390625 0.5625 0.390625 1.40625q0 0.828125 -0.484375 1.546875q-0.46875 0.71875 -1.375 1.125q-0.90625 0.390625 -2.046875 0.390625q-1.875 0 -2.875 -0.78125q-0.984375 -0.78125 -1.25 -2.328125zm24.136429 -10.468742l1.765625 0l0 7.7187424q0 2.015625 -0.453125 3.203125q-0.453125 1.1875 -1.640625 1.9375q-1.1875 0.734375 -3.125 0.734375q-1.875 0 -3.078125 -0.640625q-1.1875 -0.65625 -1.703125 -1.875q-0.5 -1.234375 -0.5 -3.359375l0 -7.7187424l1.765625 0l0 7.7187424q0 1.734375 0.3125 2.5625q0.328125 0.8125 1.109375 1.265625q0.796875 0.453125 1.9375 0.453125q1.953125 0 2.78125 -0.890625q0.828125 -0.890625 0.828125 -3.390625l0 -7.7187424zm4.629181 13.359367l0 -13.359367l1.78125 0l0 11.781242l6.5625 0l0 1.578125l-8.34375 0zm10.453857 0l0 -13.359367l5.046875 0q1.328125 0 2.03125 0.125q0.96875 0.171875 1.640625 0.640625q0.671875 0.453125 1.078125 1.28125q0.40625 0.828125 0.40625 1.828125q0 1.703125 -1.09375 2.8906174q-1.078125 1.171875 -3.921875 1.171875l-3.421875 0l0 5.421875l-1.765625 0zm1.765625 -7.0l3.453125 0q1.71875 0 2.4375 -0.6406174q0.71875 -0.640625 0.71875 -1.796875q0 -0.84375 -0.421875 -1.4375q-0.421875 -0.59375 -1.125 -0.78125q-0.4375 -0.125 -1.640625 -0.125l-3.421875 0l0 4.7812424z" fill-rule="nonzero"/><path fill="#cfe2f3" d="m73.64304 216.38058l351.0551 0l0 53.700775l-351.0551 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m73.64304 216.38058l351.0551 0l0 53.700775l-351.0551 0z" fill-rule="evenodd"/><path fill="#000000" d="m211.16338 250.15097l0 -11.78125l-4.40625 0l0 -1.578125l10.578125 0l0 1.578125l-4.40625 0l0 11.78125l-1.765625 0zm17.52098 -4.6875l1.765625 0.453125q-0.5625 2.171875 -2.0 3.328125q-1.4375 1.140625 -3.53125 1.140625q-2.15625 0 -3.515625 -0.875q-1.34375 -0.890625 -2.0625 -2.546875q-0.703125 -1.671875 -0.703125 -3.59375q0 -2.078125 0.796875 -3.625q0.796875 -1.5625 2.265625 -2.359375q1.484375 -0.8125 3.25 -0.8125q2.0 0 3.359375 1.015625q1.375 1.015625 1.90625 2.875l-1.734375 0.40625q-0.46875 -1.453125 -1.359375 -2.109375q-0.875 -0.671875 -2.203125 -0.671875q-1.546875 0 -2.578125 0.734375q-1.03125 0.734375 -1.453125 1.984375q-0.421875 1.234375 -0.421875 2.5625q0 1.703125 0.5 2.96875q0.5 1.265625 1.546875 1.90625q1.046875 0.625 2.265625 0.625q1.484375 0 2.515625 -0.859375q1.03125 -0.859375 1.390625 -2.546875zm3.9416962 4.6875l0 -13.359375l5.046875 0q1.328125 0 2.03125 0.125q0.96875 0.171875 1.640625 0.640625q0.671875 0.453125 1.078125 1.28125q0.40625 0.828125 0.40625 1.828125q0 1.703125 -1.09375 2.890625q-1.078125 1.171875 -3.921875 1.171875l-3.421875 0l0 5.421875l-1.765625 0zm1.765625 -7.0l3.453125 0q1.71875 0 2.4375 -0.640625q0.71875 -0.640625 0.71875 -1.796875q0 -0.84375 -0.421875 -1.4375q-0.421875 -0.59375 -1.125 -0.78125q-0.4375 -0.125 -1.640625 -0.125l-3.421875 0l0 4.78125zm14.664642 4.109375l1.625 -0.25q0.125 0.96875 0.75 1.5q0.625 0.515625 1.75 0.515625q1.125 0 1.671875 -0.453125q0.546875 -0.46875 0.546875 -1.09375q0 -0.546875 -0.484375 -0.875q-0.328125 -0.21875 -1.671875 -0.546875q-1.8125 -0.46875 -2.515625 -0.796875q-0.6875 -0.328125 -1.046875 -0.90625q-0.359375 -0.59375 -0.359375 -1.3125q0 -0.640625 0.296875 -1.1875q0.296875 -0.5625 0.8125 -0.921875q0.375 -0.28125 1.03125 -0.46875q0.671875 -0.203125 1.421875 -0.203125q1.140625 0 2.0 0.328125q0.859375 0.328125 1.2656097 0.890625q0.421875 0.5625 0.578125 1.5l-1.6093597 0.21875q-0.109375 -0.75 -0.640625 -1.171875q-0.515625 -0.421875 -1.46875 -0.421875q-1.140625 0 -1.625 0.375q-0.46875 0.375 -0.46875 0.875q0 0.3125 0.1875 0.578125q0.203125 0.265625 0.640625 0.4375q0.234375 0.09375 1.4375 0.421875q1.75 0.453125 2.4375 0.75q0.68748474 0.296875 1.0781097 0.859375q0.390625 0.5625 0.390625 1.40625q0 0.828125 -0.484375 1.546875q-0.46875 0.71875 -1.3749847 1.125q-0.90625 0.390625 -2.046875 0.390625q-1.875 0 -2.875 -0.78125q-0.984375 -0.78125 -1.25 -2.328125zm13.562485 1.421875l0.234375 1.453125q-0.6875 0.140625 -1.234375 0.140625q-0.890625 0 -1.390625 -0.28125q-0.484375 -0.28125 -0.6875 -0.734375q-0.203125 -0.46875 -0.203125 -1.9375l0 -5.578125l-1.203125 0l0 -1.265625l1.203125 0l0 -2.390625l1.625 -0.984375l0 3.375l1.65625 0l0 1.265625l-1.65625 0l0 5.671875q0 0.6875 0.078125 0.890625q0.09375 0.203125 0.28125 0.328125q0.203125 0.109375 0.578125 0.109375q0.265625 0 0.71875 -0.0625zm7.917694 0.28125q-0.921875 0.765625 -1.765625 1.09375q-0.828125 0.3125 -1.796875 0.3125q-1.59375 0 -2.453125 -0.78125q-0.859375 -0.78125 -0.859375 -1.984375q0 -0.71875 0.328125 -1.296875q0.328125 -0.59375 0.84375 -0.9375q0.53125 -0.359375 1.1875 -0.546875q0.46875 -0.125 1.453125 -0.25q1.984375 -0.234375 2.921875 -0.5625q0.015625 -0.34375 0.015625 -0.421875q0 -1.0 -0.46875 -1.421875q-0.625 -0.546875 -1.875 -0.546875q-1.15625 0 -1.703125 0.40625q-0.546875 0.40625 -0.8125 1.421875l-1.609375 -0.21875q0.21875 -1.015625 0.71875 -1.640625q0.5 -0.640625 1.453125 -0.984375q0.953125 -0.34375 2.1875 -0.34375q1.25 0 2.015625 0.296875q0.78125 0.28125 1.140625 0.734375q0.375 0.4375 0.515625 1.109375q0.078125 0.421875 0.078125 1.515625l0 2.1875q0 2.28125 0.109375 2.890625q0.109375 0.59375 0.40625 1.15625l-1.703125 0q-0.265625 -0.515625 -0.328125 -1.1875zm-0.140625 -3.671875q-0.890625 0.375 -2.671875 0.625q-1.015625 0.140625 -1.4375 0.328125q-0.421875 0.1875 -0.65625 0.53125q-0.21875 0.34375 -0.21875 0.78125q0 0.65625 0.5 1.09375q0.5 0.4375 1.453125 0.4375q0.9375 0 1.671875 -0.40625q0.75 -0.421875 1.09375 -1.140625q0.265625 -0.5625 0.265625 -1.640625l0 -0.609375zm10.516327 1.3125l1.609375 0.21875q-0.265625 1.65625 -1.359375 2.609375q-1.078125 0.9375 -2.671875 0.9375q-1.984375 0 -3.1875 -1.296875q-1.203125 -1.296875 -1.203125 -3.71875q0 -1.578125 0.515625 -2.75q0.515625 -1.171875 1.578125 -1.75q1.0625 -0.59375 2.3125 -0.59375q1.578125 0 2.578125 0.796875q1.0 0.796875 1.28125 2.265625l-1.59375 0.234375q-0.234375 -0.96875 -0.8125 -1.453125q-0.578125 -0.5 -1.390625 -0.5q-1.234375 0 -2.015625 0.890625q-0.78125 0.890625 -0.78125 2.8125q0 1.953125 0.75 2.84375q0.75 0.875 1.953125 0.875q0.96875 0 1.609375 -0.59375q0.65625 -0.59375 0.828125 -1.828125zm3.015625 3.546875l0 -13.359375l1.640625 0l0 7.625l3.890625 -3.9375l2.109375 0l-3.6875 3.59375l4.0625 6.078125l-2.015625 0l-3.203125 -4.953125l-1.15625 1.125l0 3.828125l-1.640625 0z" fill-rule="nonzero"/><path fill="#cfe2f3" d="m73.64304 331.2953l351.0551 0l0 53.700775l-351.0551 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m73.64304 331.2953l351.0551 0l0 53.700775l-351.0551 0z" fill-rule="evenodd"/><path fill="#000000" d="m225.73463 365.06567l0 -13.359375l4.609375 0q1.546875 0 2.375 0.203125q1.140625 0.25 1.953125 0.953125q1.0625 0.890625 1.578125 2.28125q0.53125 1.390625 0.53125 3.171875q0 1.515625 -0.359375 2.703125q-0.359375 1.171875 -0.921875 1.9375q-0.546875 0.765625 -1.203125 1.21875q-0.65625 0.4375 -1.59375 0.671875q-0.9375 0.21875 -2.140625 0.21875l-4.828125 0zm1.765625 -1.578125l2.859375 0q1.3125 0 2.0625 -0.234375q0.75 -0.25 1.203125 -0.703125q0.625 -0.625 0.96875 -1.6875q0.359375 -1.0625 0.359375 -2.578125q0 -2.09375 -0.6875 -3.21875q-0.6875 -1.125 -1.671875 -1.5q-0.703125 -0.28125 -2.28125 -0.28125l-2.8125 0l0 10.203125zm11.488571 1.578125l0 -9.671875l1.46875 0l0 1.46875q0.5625 -1.03125 1.03125 -1.359375q0.484375 -0.328125 1.0625 -0.328125q0.828125 0 1.6875 0.53125l-0.5625 1.515625q-0.609375 -0.359375 -1.203125 -0.359375q-0.546875 0 -0.96875 0.328125q-0.421875 0.328125 -0.609375 0.890625q-0.28125 0.875 -0.28125 1.921875l0 5.0625l-1.625 0zm6.228302 -11.46875l0 -1.890625l1.640625 0l0 1.890625l-1.640625 0zm0 11.46875l0 -9.671875l1.640625 0l0 9.671875l-1.640625 0zm6.832321 0l-3.6875 -9.671875l1.734375 0l2.078125 5.796875q0.328125 0.9375 0.625 1.9375q0.203125 -0.765625 0.609375 -1.828125l2.140625 -5.90625l1.6874847 0l-3.6562347 9.671875l-1.53125 0zm13.26561 -3.109375l1.6875 0.203125q-0.40625 1.484375 -1.484375 2.3125q-1.078125 0.8125 -2.765625 0.8125q-2.125 0 -3.375 -1.296875q-1.234375 -1.3125 -1.234375 -3.671875q0 -2.453125 1.25 -3.796875q1.265625 -1.34375 3.265625 -1.34375q1.9375 0 3.15625 1.328125q1.234375 1.3125 1.234375 3.703125q0 0.15625 0 0.4375l-7.21875 0q0.09375 1.59375 0.90625 2.453125q0.8125 0.84375 2.015625 0.84375q0.90625 0 1.546875 -0.46875q0.640625 -0.484375 1.015625 -1.515625zm-5.390625 -2.65625l5.40625 0q-0.109375 -1.21875 -0.625 -1.828125q-0.78125 -0.953125 -2.03125 -0.953125q-1.125 0 -1.90625 0.765625q-0.765625 0.75 -0.84375 2.015625zm9.125732 5.765625l0 -9.671875l1.46875 0l0 1.46875q0.5625 -1.03125 1.03125 -1.359375q0.484375 -0.328125 1.0625 -0.328125q0.828125 0 1.6875 0.53125l-0.5625 1.515625q-0.609375 -0.359375 -1.203125 -0.359375q-0.546875 0 -0.96875 0.328125q-0.421875 0.328125 -0.609375 0.890625q-0.28125 0.875 -0.28125 1.921875l0 5.0625l-1.625 0z" fill-rule="nonzero"/><path fill="#cfe2f3" d="m73.64304 446.20996l351.0551 0l0 53.700806l-351.0551 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m73.64304 446.20996l351.0551 0l0 53.700806l-351.0551 0z" fill-rule="evenodd"/><path fill="#000000" d="m222.09538 479.98038l0 -13.359375l4.609375 0q1.546875 0 2.375 0.203125q1.140625 0.25 1.953125 0.953125q1.0625 0.890625 1.578125 2.28125q0.53125 1.390625 0.53125 3.171875q0 1.515625 -0.359375 2.703125q-0.359375 1.171875 -0.921875 1.9375q-0.546875 0.765625 -1.203125 1.21875q-0.65625 0.4375 -1.59375 0.671875q-0.9375 0.21875 -2.140625 0.21875l-4.828125 0zm1.765625 -1.578125l2.859375 0q1.3125 0 2.0625 -0.234375q0.75 -0.25 1.203125 -0.703125q0.625 -0.625 0.96875 -1.6875q0.359375 -1.0625 0.359375 -2.578125q0 -2.09375 -0.6875 -3.21875q-0.6875 -1.125 -1.671875 -1.5q-0.703125 -0.28125 -2.28125 -0.28125l-2.8125 0l0 10.203125zm18.129196 -1.53125l1.6875 0.203125q-0.40625 1.484375 -1.484375 2.3125q-1.078125 0.8125 -2.765625 0.8125q-2.125 0 -3.375 -1.296875q-1.234375 -1.3125 -1.234375 -3.671875q0 -2.453125 1.25 -3.796875q1.265625 -1.34375 3.265625 -1.34375q1.9375 0 3.15625 1.328125q1.234375 1.3125 1.234375 3.703125q0 0.15625 0 0.4375l-7.21875 0q0.09375 1.59375 0.90625 2.453125q0.8125 0.84375 2.015625 0.84375q0.90625 0 1.546875 -0.46875q0.640625 -0.484375 1.015625 -1.515625zm-5.390625 -2.65625l5.40625 0q-0.109375 -1.21875 -0.625 -1.828125q-0.78125 -0.953125 -2.03125 -0.953125q-1.125 0 -1.90625 0.765625q-0.765625 0.75 -0.84375 2.015625zm11.828842 5.765625l-3.6875 -9.671875l1.734375 0l2.078125 5.796875q0.328125 0.9375 0.625 1.9375q0.203125 -0.765625 0.609375 -1.828125l2.140625 -5.90625l1.6875 0l-3.65625 9.671875l-1.53125 0zm6.640625 -11.46875l0 -1.890625l1.6406097 0l0 1.890625l-1.6406097 0zm0 11.46875l0 -9.671875l1.6406097 0l0 9.671875l-1.6406097 0zm10.457321 -3.546875l1.609375 0.21875q-0.265625 1.65625 -1.359375 2.609375q-1.078125 0.9375 -2.671875 0.9375q-1.984375 0 -3.1875 -1.296875q-1.203125 -1.296875 -1.203125 -3.71875q0 -1.578125 0.515625 -2.75q0.515625 -1.171875 1.578125 -1.75q1.0625 -0.59375 2.3125 -0.59375q1.578125 0 2.578125 0.796875q1.0 0.796875 1.28125 2.265625l-1.59375 0.234375q-0.234375 -0.96875 -0.8125 -1.453125q-0.578125 -0.5 -1.390625 -0.5q-1.234375 0 -2.015625 0.890625q-0.78125 0.890625 -0.78125 2.8125q0 1.953125 0.75 2.84375q0.75 0.875 1.953125 0.875q0.96875 0 1.609375 -0.59375q0.65625 -0.59375 0.828125 -1.828125zm9.640625 0.4375l1.6875 0.203125q-0.40625 1.484375 -1.484375 2.3125q-1.078125 0.8125 -2.765625 0.8125q-2.125 0 -3.375 -1.296875q-1.234375 -1.3125 -1.234375 -3.671875q0 -2.453125 1.25 -3.796875q1.265625 -1.34375 3.265625 -1.34375q1.9375 0 3.15625 1.328125q1.234375 1.3125 1.234375 3.703125q0 0.15625 0 0.4375l-7.21875 0q0.09375 1.59375 0.90625 2.453125q0.8125 0.84375 2.015625 0.84375q0.90625 0 1.546875 -0.46875q0.640625 -0.484375 1.015625 -1.515625zm-5.390625 -2.65625l5.40625 0q-0.109375 -1.21875 -0.625 -1.828125q-0.78125 -0.953125 -2.03125 -0.953125q-1.125 0 -1.90625 0.765625q-0.765625 0.75 -0.84375 2.015625z" fill-rule="nonzero"/><path fill="#000000" fill-opacity="0.0" d="m153.17061 40.25197l0 0" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m153.17061 40.25197l0 0" fill-rule="evenodd"/><path fill="#000000" fill-opacity="0.0" d="m48.435696 73.03937l402.67715 0" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" stroke-dasharray="4.0,3.0" d="m48.435696 73.03937l402.67715 0" fill-rule="evenodd"/><path fill="#cfe2f3" d="m177.95801 71.49061l-12.393707 0l0 12.897636l-24.7874 0l0 -12.897636l-12.393692 0l24.7874 -12.89764z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m177.95801 71.49061l-12.393707 0l0 12.897636l-24.7874 0l0 -12.897636l-12.393692 0l24.7874 -12.89764z" fill-rule="evenodd"/><path fill="#cfe2f3" d="m320.3832 71.49061l12.393707 0l0 -12.89764l24.787384 0l0 12.89764l12.393707 0l-24.787415 12.897636z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m320.3832 71.49061l12.393707 0l0 -12.89764l24.787384 0l0 12.89764l12.393707 0l-24.787415 12.897636z" fill-rule="evenodd"/><path fill="#cfe2f3" d="m177.95801 188.3931l-12.393707 0l0 12.897629l-24.7874 0l0 -12.897629l-12.393692 0l24.7874 -12.897644z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m177.95801 188.3931l-12.393707 0l0 12.897629l-24.7874 0l0 -12.897629l-12.393692 0l24.7874 -12.897644z" fill-rule="evenodd"/><path fill="#cfe2f3" d="m320.3832 188.3931l12.393707 0l0 -12.897644l24.787384 0l0 12.897644l12.393707 0l-24.787415 12.897629z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m320.3832 188.3931l12.393707 0l0 -12.897644l24.787384 0l0 12.897644l12.393707 0l-24.787415 12.897629z" fill-rule="evenodd"/><path fill="#cfe2f3" d="m177.95801 301.4256l-12.393707 0l0 12.897644l-24.7874 0l0 -12.897644l-12.393692 0l24.7874 -12.897644z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m177.95801 301.4256l-12.393707 0l0 12.897644l-24.7874 0l0 -12.897644l-12.393692 0l24.7874 -12.897644z" fill-rule="evenodd"/><path fill="#cfe2f3" d="m320.3832 301.4256l12.393707 0l0 -12.897644l24.787384 0l0 12.897644l12.393707 0l-24.787415 12.897644z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m320.3832 301.4256l12.393707 0l0 -12.897644l24.787384 0l0 12.897644l12.393707 0l-24.787415 12.897644z" fill-rule="evenodd"/><path fill="#cfe2f3" d="m177.95801 415.4906l-12.393707 0l0 12.897644l-24.7874 0l0 -12.897644l-12.393692 0l24.7874 -12.897644z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m177.95801 415.4906l-12.393707 0l0 12.897644l-24.7874 0l0 -12.897644l-12.393692 0l24.7874 -12.897644z" fill-rule="evenodd"/><path fill="#cfe2f3" d="m320.3832 415.4906l12.393707 0l0 -12.897644l24.787384 0l0 12.897644l12.393707 0l-24.787415 12.897644z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m320.3832 415.4906l12.393707 0l0 -12.897644l24.787384 0l0 12.897644l12.393707 0l-24.787415 12.897644z" fill-rule="evenodd"/><path fill="#000000" fill-opacity="0.0" d="m198.14961 44.009186l109.44881 0l0 53.70079l-109.44881 0z" fill-rule="evenodd"/><path fill="#000000" d="m224.98174 70.929184l2.78125 -13.359375l1.65625 0l-1.0 4.78125q0.78125 -0.71875 1.421875 -1.015625q0.640625 -0.296875 1.328125 -0.296875q1.359375 0 2.265625 1.015625q0.90625 1.0 0.90625 2.9375q0 1.28125 -0.375 2.359375q-0.359375 1.0625 -0.890625 1.78125q-0.53125 0.71875 -1.109375 1.15625q-0.578125 0.4375 -1.1875 0.640625q-0.59375 0.21875 -1.140625 0.21875q-0.96875 0 -1.703125 -0.5q-0.71875 -0.515625 -1.125 -1.546875l-0.390625 1.828125l-1.4375 0zm2.4375 -3.96875l-0.015625 0.3125q0 1.234375 0.59375 1.890625q0.59375 0.640625 1.484375 0.640625q0.859375 0 1.578125 -0.609375q0.734375 -0.609375 1.1875 -1.890625q0.46875 -1.28125 0.46875 -2.375q0 -1.21875 -0.59375 -1.890625q-0.578125 -0.671875 -1.4375 -0.671875q-0.90625 0 -1.65625 0.6875q-0.734375 0.6875 -1.234375 2.125q-0.375 1.0625 -0.375 1.78125zm14.531967 2.21875q-1.734375 1.96875 -3.5625 1.96875q-1.109375 0 -1.796875 -0.640625q-0.6875 -0.640625 -0.6875 -1.578125q0 -0.609375 0.296875 -2.09375l1.171875 -5.578125l1.65625 0l-1.296875 6.1875q-0.171875 0.765625 -0.171875 1.203125q0 0.546875 0.328125 0.859375q0.34375 0.296875 0.984375 0.296875q0.703125 0 1.359375 -0.328125q0.65625 -0.34375 1.125 -0.921875q0.484375 -0.578125 0.796875 -1.359375q0.1875 -0.5 0.453125 -1.765625l0.875 -4.171875l1.65625 0l-2.03125 9.671875l-1.515625 0l0.359375 -1.75zm4.000717 1.75l1.765625 -8.40625l-1.484375 0l0.265625 -1.265625l1.484375 0l0.28125 -1.375q0.21875 -1.03125 0.4375 -1.484375q0.234375 -0.453125 0.75 -0.75q0.53125 -0.296875 1.4375 -0.296875q0.625 0 1.828125 0.265625l-0.296875 1.4375q-0.84375 -0.21875 -1.40625 -0.21875q-0.484375 0 -0.734375 0.25q-0.25 0.234375 -0.4375 1.125l-0.21875 1.046875l1.84375 0l-0.265625 1.265625l-1.84375 0l-1.75 8.40625l-1.65625 0zm5.183304 0l1.765625 -8.40625l-1.484375 0l0.265625 -1.265625l1.484375 0l0.28125 -1.375q0.21875 -1.03125 0.4375 -1.484375q0.234375 -0.453125 0.75 -0.75q0.53125 -0.296875 1.4375 -0.296875q0.625 0 1.828125 0.265625l-0.296875 1.4375q-0.84375 -0.21875 -1.40625 -0.21875q-0.484375 0 -0.734375 0.25q-0.25 0.234375 -0.4375 1.125l-0.21875 1.046875l1.84375 0l-0.265625 1.265625l-1.84375 0l-1.75 8.40625l-1.65625 0zm12.058319 -3.28125l1.609375 0.15625q-0.34375 1.1875 -1.59375 2.265625q-1.234375 1.078125 -2.96875 1.078125q-1.0625 0 -1.96875 -0.5q-0.890625 -0.5 -1.359375 -1.4375q-0.46875 -0.953125 -0.46875 -2.15625q0 -1.59375 0.734375 -3.078125q0.734375 -1.484375 1.890625 -2.203125q1.171875 -0.734375 2.53125 -0.734375q1.734375 0 2.765625 1.078125q1.03125 1.0625 1.03125 2.921875q0 0.71875 -0.125 1.46875l-7.125 0q-0.046875 0.28125 -0.046875 0.5q0 1.359375 0.625 2.078125q0.625 0.71875 1.53125 0.71875q0.84375 0 1.65625 -0.546875q0.828125 -0.5625 1.28125 -1.609375zm-4.78125 -2.40625l5.421875 0q0.015625 -0.25 0.015625 -0.359375q0 -1.234375 -0.625 -1.890625q-0.625 -0.671875 -1.59375 -0.671875q-1.0625 0 -1.9375 0.734375q-0.859375 0.71875 -1.28125 2.1875zm8.063202 5.6875l2.015625 -9.671875l1.453125 0l-0.40625 1.96875q0.75 -1.109375 1.453125 -1.640625q0.71875 -0.546875 1.46875 -0.546875q0.5 0 1.21875 0.359375l-0.671875 1.53125q-0.4375 -0.3125 -0.9375 -0.3125q-0.875 0 -1.78125 0.96875q-0.90625 0.953125 -1.4375 3.46875l-0.8125 3.875l-1.5625 0zm6.368927 -3.3125l1.640625 -0.09375q0 0.703125 0.21875 1.21875q0.21875 0.5 0.796875 0.8125q0.59375 0.3125 1.375 0.3125q1.09375 0 1.640625 -0.4375q0.546875 -0.4375 0.546875 -1.015625q0 -0.4375 -0.328125 -0.8125q-0.328125 -0.390625 -1.640625 -0.953125q-1.3125 -0.5625 -1.671875 -0.78125q-0.609375 -0.375 -0.921875 -0.875q-0.3125 -0.515625 -0.3125 -1.171875q0 -1.140625 0.90625 -1.953125q0.921875 -0.828125 2.5625 -0.828125q1.828125 0 2.765625 0.84375q0.953125 0.84375 1.0 2.21875l-1.609375 0.109375q-0.046875 -0.875 -0.625 -1.390625q-0.578125 -0.515625 -1.65625 -0.515625q-0.84375 0 -1.328125 0.40625q-0.46875 0.390625 -0.46875 0.84375q0 0.453125 0.40625 0.796875q0.28125 0.234375 1.421875 0.734375q1.890625 0.8125 2.375 1.28125q0.78125 0.765625 0.78125 1.84375q0 0.71875 -0.4375 1.421875q-0.4375 0.6875 -1.34375 1.109375q-0.90625 0.40625 -2.140625 0.40625q-1.671875 0 -2.84375 -0.828125q-1.1875 -0.828125 -1.109375 -2.703125z" fill-rule="nonzero"/><path fill="#000000" fill-opacity="0.0" d="m198.14961 164.00919l109.44881 0l0 53.70079l-109.44881 0z" fill-rule="evenodd"/><path fill="#000000" d="m211.6696 187.61668l1.640625 -0.09375q0 0.703125 0.21875 1.21875q0.21875 0.5 0.796875 0.8125q0.59375 0.3125 1.375 0.3125q1.09375 0 1.640625 -0.4375q0.546875 -0.4375 0.546875 -1.015625q0 -0.4375 -0.328125 -0.8125q-0.328125 -0.390625 -1.640625 -0.953125q-1.3125 -0.5625 -1.671875 -0.78125q-0.609375 -0.375 -0.921875 -0.875q-0.3125 -0.515625 -0.3125 -1.171875q0 -1.140625 0.90625 -1.953125q0.921875 -0.828125 2.5625 -0.828125q1.828125 0 2.765625 0.84375q0.953125 0.84375 1.0 2.21875l-1.609375 0.109375q-0.046875 -0.875 -0.625 -1.390625q-0.578125 -0.515625 -1.65625 -0.515625q-0.84375 0 -1.328125 0.40625q-0.46875 0.390625 -0.46875 0.84375q0 0.453125 0.40625 0.796875q0.28125 0.234375 1.421875 0.734375q1.890625 0.8125 2.375 1.28125q0.78125 0.765625 0.78125 1.84375q0 0.71875 -0.4375 1.421875q-0.4375 0.6875 -1.34375 1.109375q-0.90625 0.40625 -2.140625 0.40625q-1.671875 0 -2.84375 -0.828125q-1.1875 -0.828125 -1.109375 -2.703125zm15.84375 -0.21875l1.65625 0.171875q-0.625 1.8125 -1.765625 2.703125q-1.140625 0.875 -2.609375 0.875q-1.578125 0 -2.5625 -1.015625q-0.96875 -1.03125 -0.96875 -2.859375q0 -1.578125 0.625 -3.109375q0.640625 -1.53125 1.796875 -2.328125q1.171875 -0.796875 2.6875 -0.796875q1.546875 0 2.453125 0.875q0.921875 0.875 0.921875 2.328125l-1.625 0.109375q-0.015625 -0.921875 -0.546875 -1.4375q-0.515625 -0.515625 -1.359375 -0.515625q-1.0 0 -1.734375 0.625q-0.71875 0.625 -1.140625 1.90625q-0.40625 1.28125 -0.40625 2.46875q0 1.234375 0.546875 1.859375q0.546875 0.609375 1.34375 0.609375q0.796875 0 1.53125 -0.609375q0.734375 -0.609375 1.15625 -1.859375zm9.171875 2.328125q-0.859375 0.734375 -1.65625 1.078125q-0.78125 0.34375 -1.6875 0.34375q-1.34375 0 -2.171875 -0.78125q-0.8125 -0.796875 -0.8125 -2.03125q0 -0.796875 0.375 -1.421875q0.375 -0.625 0.984375 -1.0q0.609375 -0.390625 1.484375 -0.546875q0.5625 -0.109375 2.109375 -0.171875q1.5625 -0.0625 2.234375 -0.328125q0.1875 -0.671875 0.1875 -1.125q0 -0.578125 -0.421875 -0.90625q-0.5625 -0.453125 -1.671875 -0.453125q-1.03125 0 -1.703125 0.46875q-0.65625 0.453125 -0.953125 1.296875l-1.671875 -0.140625q0.515625 -1.4375 1.609375 -2.203125q1.109375 -0.765625 2.796875 -0.765625q1.796875 0 2.84375 0.859375q0.796875 0.625 0.796875 1.65625q0 0.765625 -0.21875 1.796875l-0.53125 2.40625q-0.265625 1.140625 -0.265625 1.859375q0 0.453125 0.203125 1.3125l-1.671875 0q-0.125 -0.46875 -0.1875 -1.203125zm0.609375 -3.703125q-0.34375 0.140625 -0.75 0.21875q-0.390625 0.0625 -1.3125 0.15625q-1.4375 0.125 -2.03125 0.328125q-0.59375 0.1875 -0.90625 0.625q-0.296875 0.421875 -0.296875 0.9375q0 0.6875 0.484375 1.140625q0.484375 0.4375 1.359375 0.4375q0.828125 0 1.578125 -0.421875q0.75 -0.4375 1.1875 -1.203125q0.4375 -0.78125 0.6875 -2.21875zm7.094467 3.5625l-0.265625 1.359375q-0.59375 0.140625 -1.15625 0.140625q-0.984375 0 -1.5625 -0.46875q-0.4375 -0.375 -0.4375 -1.0q0 -0.3125 0.234375 -1.46875l1.171875 -5.625l-1.296875 0l0.265625 -1.265625l1.296875 0l0.5 -2.375l1.890625 -1.140625l-0.734375 3.515625l1.625 0l-0.28125 1.265625l-1.609375 0l-1.125 5.359375q-0.203125 1.015625 -0.203125 1.21875q0 0.28125 0.15625 0.4375q0.171875 0.15625 0.5625 0.15625q0.546875 0 0.96875 -0.109375zm5.183304 0l-0.265625 1.359375q-0.59375 0.140625 -1.15625 0.140625q-0.984375 0 -1.5625 -0.46875q-0.4375 -0.375 -0.4375 -1.0q0 -0.3125 0.234375 -1.46875l1.171875 -5.625l-1.296875 0l0.265625 -1.265625l1.296875 0l0.5 -2.375l1.890625 -1.140625l-0.734375 3.515625l1.625 0l-0.28125 1.265625l-1.609375 0l-1.125 5.359375q-0.203125 1.015625 -0.203125 1.21875q0 0.28125 0.15625 0.4375q0.171875 0.15625 0.5625 0.15625q0.546875 0 0.96875 -0.109375zm8.433304 -1.9375l1.609375 0.15625q-0.34375 1.1875 -1.59375 2.265625q-1.234375 1.078125 -2.96875 1.078125q-1.0625 0 -1.96875 -0.5q-0.890625 -0.5 -1.359375 -1.4375q-0.46875 -0.953125 -0.46875 -2.15625q0 -1.59375 0.734375 -3.078125q0.734375 -1.484375 1.890625 -2.203125q1.171875 -0.734375 2.53125 -0.734375q1.734375 0 2.765625 1.078125q1.03125 1.0625 1.03125 2.921875q0 0.71875 -0.125 1.46875l-7.125 0q-0.046875 0.28125 -0.046875 0.5q0 1.359375 0.625 2.078125q0.625 0.71875 1.53125 0.71875q0.84375 0 1.65625 -0.546875q0.828125 -0.5625 1.28125 -1.609375zm-4.78125 -2.40625l5.421875 0q0.015625 -0.25 0.015625 -0.359375q0 -1.234375 -0.625 -1.890625q-0.625 -0.671875 -1.59375 -0.671875q-1.0625 0 -1.9375 0.734375q-0.859375 0.71875 -1.28125 2.1875zm8.063202 5.6875l2.015625 -9.671875l1.453125 0l-0.40625 1.96875q0.75 -1.109375 1.453125 -1.640625q0.71875 -0.546875 1.46875 -0.546875q0.5 0 1.21875 0.359375l-0.671875 1.53125q-0.4375 -0.3125 -0.9375 -0.3125q-0.875 0 -1.78125 0.96875q-0.90625 0.953125 -1.4375 3.46875l-0.8125 3.875l-1.5625 0zm11.255371 0l2.796875 -13.359375l1.640625 0l-2.78125 13.359375l-1.65625 0zm6.613556 -11.484375l0.40625 -1.875l1.625 0l-0.390625 1.875l-1.640625 0zm-2.390625 11.484375l2.015625 -9.671875l1.65625 0l-2.03125 9.671875l-1.640625 0zm4.3635864 -3.3125l1.640625 -0.09375q0 0.703125 0.21875 1.21875q0.21875 0.5 0.796875 0.8125q0.59375 0.3125 1.375 0.3125q1.09375 0 1.640625 -0.4375q0.546875 -0.4375 0.546875 -1.015625q0 -0.4375 -0.328125 -0.8125q-0.328125 -0.390625 -1.640625 -0.953125q-1.3125 -0.5625 -1.671875 -0.78125q-0.609375 -0.375 -0.921875 -0.875q-0.3125 -0.515625 -0.3125 -1.171875q0 -1.140625 0.90625 -1.953125q0.921875 -0.828125 2.5625 -0.828125q1.828125 0 2.765625 0.84375q0.953125 0.84375 1.0 2.21875l-1.609375 0.109375q-0.046875 -0.875 -0.625 -1.390625q-0.578125 -0.515625 -1.65625 -0.515625q-0.84375 0 -1.328125 0.40625q-0.46875 0.390625 -0.46875 0.84375q0 0.453125 0.40625 0.796875q0.28125 0.234375 1.421875 0.734375q1.890625 0.8125 2.375 1.28125q0.78125 0.765625 0.78125 1.84375q0 0.71875 -0.4375 1.421875q-0.4375 0.6875 -1.34375 1.109375q-0.90625 0.40625 -2.140625 0.40625q-1.671875 0 -2.84375 -0.828125q-1.1875 -0.828125 -1.109375 -2.703125zm13.015625 1.96875l-0.265625 1.359375q-0.59375 0.140625 -1.15625 0.140625q-0.984375 0 -1.5625 -0.46875q-0.4375 -0.375 -0.4375 -1.0q0 -0.3125 0.234375 -1.46875l1.171875 -5.625l-1.296875 0l0.265625 -1.265625l1.296875 0l0.5 -2.375l1.890625 -1.140625l-0.734375 3.515625l1.625 0l-0.28125 1.265625l-1.609375 0l-1.125 5.359375q-0.203125 1.015625 -0.203125 1.21875q0 0.28125 0.15625 0.4375q0.171875 0.15625 0.5625 0.15625q0.546875 0 0.96875 -0.109375z" fill-rule="nonzero"/><path fill="#000000" fill-opacity="0.0" d="m198.14961 280.1392l109.44881 0l0 53.700806l-109.44881 0z" fill-rule="evenodd"/><path fill="#000000" d="m223.58026 303.7467l1.640625 -0.09375q0 0.703125 0.21875 1.21875q0.21875 0.5 0.796875 0.8125q0.59375 0.3125 1.375 0.3125q1.09375 0 1.640625 -0.4375q0.546875 -0.4375 0.546875 -1.015625q0 -0.4375 -0.328125 -0.8125q-0.328125 -0.390625 -1.640625 -0.953125q-1.3125 -0.5625 -1.671875 -0.78125q-0.609375 -0.375 -0.921875 -0.875q-0.3125 -0.515625 -0.3125 -1.171875q0 -1.140625 0.90625 -1.953125q0.921875 -0.828125 2.5625 -0.828125q1.828125 0 2.765625 0.84375q0.953125 0.84375 1.0 2.21875l-1.609375 0.109375q-0.046875 -0.875 -0.625 -1.390625q-0.578125 -0.515625 -1.65625 -0.515625q-0.84375 0 -1.328125 0.40625q-0.46875 0.390625 -0.46875 0.84375q0 0.453125 0.40625 0.796875q0.28125 0.234375 1.421875 0.734375q1.890625 0.8125 2.375 1.28125q0.78125 0.765625 0.78125 1.84375q0 0.71875 -0.4375 1.421875q-0.4375 0.6875 -1.34375 1.109375q-0.90625 0.40625 -2.140625 0.40625q-1.671875 0 -2.84375 -0.828125q-1.1875 -0.828125 -1.109375 -2.703125zm9.1875 3.3125l2.796875 -13.359375l1.640625 0l-1.734375 8.28125l4.8125 -4.59375l2.171875 0l-4.125 3.609375l2.5 6.0625l-1.8125 0l-1.921875 -4.96875l-2.015625 1.734375l-0.671875 3.234375l-1.640625 0zm7.5 3.703125l0 -1.1875l10.875 0l0 1.1875l-10.875 0zm12.188217 -3.703125l2.78125 -13.359375l1.6562653 0l-1.0000153 4.78125q0.78126526 -0.71875 1.4218903 -1.015625q0.640625 -0.296875 1.328125 -0.296875q1.359375 0 2.265625 1.015625q0.90625 1.0 0.90625 2.9375q0 1.28125 -0.375 2.359375q-0.359375 1.0625 -0.890625 1.78125q-0.53125 0.71875 -1.109375 1.15625q-0.578125 0.4375 -1.1875 0.640625q-0.59375 0.21875 -1.140625 0.21875q-0.96875 0 -1.7031403 -0.5q-0.71875 -0.515625 -1.125 -1.546875l-0.390625 1.828125l-1.4375 0zm2.4375 -3.96875l-0.015625 0.3125q0 1.234375 0.59375 1.890625q0.59376526 0.640625 1.4843903 0.640625q0.859375 0 1.578125 -0.609375q0.734375 -0.609375 1.1875 -1.890625q0.46875 -1.28125 0.46875 -2.375q0 -1.21875 -0.59375 -1.890625q-0.578125 -0.671875 -1.4375 -0.671875q-0.90625 0 -1.65625 0.6875q-0.73439026 0.6875 -1.2343903 2.125q-0.375 1.0625 -0.375 1.78125zm14.531967 2.21875q-1.734375 1.96875 -3.5625 1.96875q-1.109375 0 -1.796875 -0.640625q-0.6875 -0.640625 -0.6875 -1.578125q0 -0.609375 0.296875 -2.09375l1.171875 -5.578125l1.65625 0l-1.296875 6.1875q-0.171875 0.765625 -0.171875 1.203125q0 0.546875 0.328125 0.859375q0.34375 0.296875 0.984375 0.296875q0.703125 0 1.359375 -0.328125q0.65625 -0.34375 1.125 -0.921875q0.484375 -0.578125 0.796875 -1.359375q0.1875 -0.5 0.453125 -1.765625l0.875 -4.171875l1.65625 0l-2.03125 9.671875l-1.515625 0l0.359375 -1.75zm4.0007324 1.75l1.765625 -8.40625l-1.484375 0l0.265625 -1.265625l1.484375 0l0.28125 -1.375q0.21875 -1.03125 0.4375 -1.484375q0.234375 -0.453125 0.75 -0.75q0.53125 -0.296875 1.4375 -0.296875q0.625 0 1.828125 0.265625l-0.296875 1.4375q-0.84375 -0.21875 -1.40625 -0.21875q-0.484375 0 -0.734375 0.25q-0.25 0.234375 -0.4375 1.125l-0.21875 1.046875l1.84375 0l-0.265625 1.265625l-1.84375 0l-1.75 8.40625l-1.65625 0zm5.1832886 0l1.765625 -8.40625l-1.484375 0l0.265625 -1.265625l1.484375 0l0.28125 -1.375q0.21875 -1.03125 0.4375 -1.484375q0.234375 -0.453125 0.75 -0.75q0.53125 -0.296875 1.4375 -0.296875q0.625 0 1.828125 0.265625l-0.296875 1.4375q-0.84375 -0.21875 -1.40625 -0.21875q-0.484375 0 -0.734375 0.25q-0.25 0.234375 -0.4375 1.125l-0.21875 1.046875l1.84375 0l-0.265625 1.265625l-1.84375 0l-1.75 8.40625l-1.65625 0z" fill-rule="nonzero"/><path fill="#000000" fill-opacity="0.0" d="m163.81627 395.2362l178.11024 0l0 53.700806l-178.11024 0z" fill-rule="evenodd"/><path fill="#000000" d="m195.60925 418.62497l1.65625 0.171875q-0.625 1.8125 -1.765625 2.703125q-1.140625 0.875 -2.609375 0.875q-1.578125 0 -2.5625 -1.015625q-0.96875 -1.03125 -0.96875 -2.859375q0 -1.578125 0.625 -3.109375q0.640625 -1.53125 1.796875 -2.328125q1.171875 -0.796875 2.6875 -0.796875q1.546875 0 2.453125 0.875q0.921875 0.875 0.921875 2.328125l-1.625 0.109375q-0.015625 -0.921875 -0.546875 -1.4375q-0.515625 -0.515625 -1.359375 -0.515625q-1.0 0 -1.734375 0.625q-0.71875 0.625 -1.140625 1.90625q-0.40625 1.28125 -0.40625 2.46875q0 1.234375 0.546875 1.859375q0.546875 0.609375 1.34375 0.609375q0.796875 0 1.53125 -0.609375q0.734375 -0.609375 1.15625 -1.859375zm2.9375 -0.140625q0 -2.828125 1.671875 -4.6875q1.375 -1.53125 3.609375 -1.53125q1.75 0 2.8125 1.09375q1.078125 1.09375 1.078125 2.953125q0 1.65625 -0.671875 3.09375q-0.671875 1.4375 -1.921875 2.203125q-1.25 0.765625 -2.625 0.765625q-1.125 0 -2.046875 -0.484375q-0.921875 -0.484375 -1.421875 -1.359375q-0.484375 -0.890625 -0.484375 -2.046875zm1.65625 -0.15625q0 1.359375 0.65625 2.0625q0.65625 0.703125 1.65625 0.703125q0.53125 0 1.046875 -0.203125q0.53125 -0.21875 0.96875 -0.65625q0.453125 -0.4375 0.765625 -1.0q0.3125 -0.5625 0.5 -1.203125q0.28125 -0.90625 0.28125 -1.734375q0 -1.3125 -0.65625 -2.03125q-0.65625 -0.734375 -1.65625 -0.734375q-0.78125 0 -1.421875 0.375q-0.625 0.375 -1.140625 1.09375q-0.515625 0.703125 -0.765625 1.640625q-0.234375 0.9375 -0.234375 1.6875zm8.438217 3.828125l2.015625 -9.671875l1.5 0l-0.359375 1.6875q0.96875 -1.0 1.8125 -1.453125q0.859375 -0.453125 1.734375 -0.453125q1.1875 0 1.84375 0.640625q0.671875 0.625 0.671875 1.703125q0 0.53125 -0.234375 1.6875l-1.234375 5.859375l-1.640625 0l1.28125 -6.125q0.1875 -0.890625 0.1875 -1.328125q0 -0.484375 -0.328125 -0.78125q-0.328125 -0.296875 -0.953125 -0.296875q-1.28125 0 -2.265625 0.90625q-0.984375 0.90625 -1.453125 3.125l-0.9375 4.5l-1.640625 0zm14.219467 -1.34375l-0.265625 1.359375q-0.59375 0.140625 -1.15625 0.140625q-0.984375 0 -1.5625 -0.46875q-0.4375 -0.375 -0.4375 -1.0q0 -0.3125 0.234375 -1.46875l1.171875 -5.625l-1.296875 0l0.265625 -1.265625l1.296875 0l0.5 -2.375l1.890625 -1.140625l-0.734375 3.515625l1.625 0l-0.28125 1.265625l-1.609375 0l-1.125 5.359375q-0.203125 1.015625 -0.203125 1.21875q0 0.28125 0.15625 0.4375q0.171875 0.15625 0.5625 0.15625q0.546875 0 0.96875 -0.109375zm8.433304 -1.9375l1.609375 0.15625q-0.34375 1.1875 -1.59375 2.265625q-1.234375 1.078125 -2.96875 1.078125q-1.0625 0 -1.96875 -0.5q-0.890625 -0.5 -1.359375 -1.4375q-0.46875 -0.953125 -0.46875 -2.15625q0 -1.59375 0.734375 -3.078125q0.734375 -1.484375 1.890625 -2.203125q1.171875 -0.734375 2.53125 -0.734375q1.734375 0 2.765625 1.078125q1.03125 1.0625 1.03125 2.921875q0 0.71875 -0.125 1.46875l-7.125 0q-0.046875 0.28125 -0.046875 0.5q0 1.359375 0.625 2.078125q0.625 0.71875 1.53125 0.71875q0.84375 0 1.65625 -0.546875q0.828125 -0.5625 1.28125 -1.609375zm-4.78125 -2.40625l5.421875 0q0.015625 -0.25 0.015625 -0.359375q0 -1.234375 -0.625 -1.890625q-0.625 -0.671875 -1.59375 -0.671875q-1.0625 0 -1.9375 0.734375q-0.859375 0.71875 -1.28125 2.1875zm7.406967 5.6875l4.21875 -4.90625l-2.421875 -4.765625l1.828125 0l0.8125 1.71875q0.453125 0.96875 0.828125 1.84375l2.78125 -3.5625l2.015625 0l-4.0625 4.890625l2.4375 4.78125l-1.8125 0l-0.96875 -1.96875q-0.3125 -0.625 -0.703125 -1.5625l-2.875 3.53125l-2.078125 0zm13.828125 -1.34375l-0.265625 1.359375q-0.59375 0.140625 -1.15625 0.140625q-0.984375 0 -1.5625 -0.46875q-0.4375 -0.375 -0.4375 -1.0q0 -0.3125 0.234375 -1.46875l1.171875 -5.625l-1.296875 0l0.265625 -1.265625l1.296875 0l0.5 -2.375l1.890625 -1.140625l-0.734375 3.515625l1.625 0l-0.28125 1.265625l-1.609375 0l-1.125 5.359375q-0.203125 1.015625 -0.203125 1.21875q0 0.28125 0.15625 0.4375q0.171875 0.15625 0.5625 0.15625q0.546875 0 0.96875 -0.109375zm11.210358 -0.8125l0 -3.671875l-3.640625 0l0 -1.515625l3.640625 0l0 -3.640625l1.546875 0l0 3.640625l3.640625 0l0 1.515625l-3.640625 0l0 3.671875l-1.546875 0zm11.390778 2.15625l2.78125 -13.359375l1.65625 0l-1.0 4.78125q0.78125 -0.71875 1.421875 -1.015625q0.640625 -0.296875 1.328125 -0.296875q1.359375 0 2.265625 1.015625q0.90625 1.0 0.90625 2.9375q0 1.28125 -0.375 2.359375q-0.359375 1.0625 -0.890625 1.78125q-0.53125 0.71875 -1.109375 1.15625q-0.578125 0.4375 -1.1875 0.640625q-0.59375 0.21875 -1.140625 0.21875q-0.96875 0 -1.703125 -0.5q-0.71875 -0.515625 -1.125 -1.546875l-0.390625 1.828125l-1.4375 0zm2.4375 -3.96875l-0.015625 0.3125q0 1.234375 0.59375 1.890625q0.59375 0.640625 1.484375 0.640625q0.859375 0 1.578125 -0.609375q0.734375 -0.609375 1.1875 -1.890625q0.46875 -1.28125 0.46875 -2.375q0 -1.21875 -0.59375 -1.890625q-0.578125 -0.671875 -1.4375 -0.671875q-0.90625 0 -1.65625 0.6875q-0.734375 0.6875 -1.234375 2.125q-0.375 1.0625 -0.375 1.78125zm14.531952 2.21875q-1.734375 1.96875 -3.5625 1.96875q-1.109375 0 -1.796875 -0.640625q-0.6875 -0.640625 -0.6875 -1.578125q0 -0.609375 0.296875 -2.09375l1.171875 -5.578125l1.65625 0l-1.296875 6.1875q-0.171875 0.765625 -0.171875 1.203125q0 0.546875 0.328125 0.859375q0.34375 0.296875 0.984375 0.296875q0.703125 0 1.359375 -0.328125q0.65625 -0.34375 1.125 -0.921875q0.484375 -0.578125 0.796875 -1.359375q0.1875 -0.5 0.453125 -1.765625l0.875 -4.171875l1.65625 0l-2.03125 9.671875l-1.515625 0l0.359375 -1.75zm4.0007324 1.75l1.765625 -8.40625l-1.484375 0l0.265625 -1.265625l1.484375 0l0.28125 -1.375q0.21875 -1.03125 0.4375 -1.484375q0.234375 -0.453125 0.75 -0.75q0.53125 -0.296875 1.4375 -0.296875q0.625 0 1.828125 0.265625l-0.296875 1.4375q-0.84375 -0.21875 -1.40625 -0.21875q-0.484375 0 -0.734375 0.25q-0.25 0.234375 -0.4375 1.125l-0.21875 1.046875l1.84375 0l-0.265625 1.265625l-1.84375 0l-1.75 8.40625l-1.65625 0zm5.1832886 0l1.765625 -8.40625l-1.484375 0l0.265625 -1.265625l1.484375 0l0.28125 -1.375q0.21875 -1.03125 0.4375 -1.484375q0.234375 -0.453125 0.75 -0.75q0.53125 -0.296875 1.4375 -0.296875q0.625 0 1.828125 0.265625l-0.296875 1.4375q-0.84375 -0.21875 -1.40625 -0.21875q-0.484375 0 -0.734375 0.25q-0.25 0.234375 -0.4375 1.125l-0.21875 1.046875l1.84375 0l-0.265625 1.265625l-1.84375 0l-1.75 8.40625l-1.65625 0zm12.058319 -3.28125l1.609375 0.15625q-0.34375 1.1875 -1.59375 2.265625q-1.234375 1.078125 -2.96875 1.078125q-1.0625 0 -1.96875 -0.5q-0.890625 -0.5 -1.359375 -1.4375q-0.46875 -0.953125 -0.46875 -2.15625q0 -1.59375 0.734375 -3.078125q0.734375 -1.484375 1.890625 -2.203125q1.171875 -0.734375 2.53125 -0.734375q1.734375 0 2.765625 1.078125q1.03125 1.0625 1.03125 2.921875q0 0.71875 -0.125 1.46875l-7.125 0q-0.046875 0.28125 -0.046875 0.5q0 1.359375 0.625 2.078125q0.625 0.71875 1.53125 0.71875q0.84375 0 1.65625 -0.546875q0.828125 -0.5625 1.28125 -1.609375zm-4.78125 -2.40625l5.421875 0q0.015625 -0.25 0.015625 -0.359375q0 -1.234375 -0.625 -1.890625q-0.625 -0.671875 -1.59375 -0.671875q-1.0625 0 -1.9375 0.734375q-0.859375 0.71875 -1.28125 2.1875zm8.063202 5.6875l2.015625 -9.671875l1.453125 0l-0.40625 1.96875q0.75 -1.109375 1.453125 -1.640625q0.71875 -0.546875 1.46875 -0.546875q0.5 0 1.21875 0.359375l-0.671875 1.53125q-0.4375 -0.3125 -0.9375 -0.3125q-0.875 0 -1.78125 0.96875q-0.90625 0.953125 -1.4375 3.46875l-0.8125 3.875l-1.5625 0z" fill-rule="nonzero"/><path fill="#cfe2f3" d="m0 165.96588l118.74016 0l0 40.25197l-118.74016 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m0 165.96588l118.74016 0l0 40.25197l-118.74016 0z" fill-rule="evenodd"/><path fill="#000000" d="m23.145836 190.12123l1.625 -0.25q0.125 0.96875 0.75 1.5q0.625 0.515625 1.75 0.515625q1.125 0 1.671875 -0.453125q0.546875 -0.46875 0.546875 -1.09375q0 -0.546875 -0.484375 -0.875q-0.328125 -0.21875 -1.671875 -0.546875q-1.8125 -0.46875 -2.515625 -0.796875q-0.6875 -0.328125 -1.046875 -0.90625q-0.359375 -0.59375 -0.359375 -1.3125q0 -0.640625 0.296875 -1.1875q0.296875 -0.5625 0.8125 -0.921875q0.375 -0.28125 1.03125 -0.46875q0.671875 -0.203125 1.421875 -0.203125q1.140625 0 2.0 0.328125q0.859375 0.328125 1.265625 0.890625q0.421875 0.5625 0.578125 1.5l-1.609375 0.21875q-0.109375 -0.75 -0.640625 -1.171875q-0.515625 -0.421875 -1.46875 -0.421875q-1.140625 0 -1.625 0.375q-0.46875 0.375 -0.46875 0.875q0 0.3125 0.1875 0.578125q0.203125 0.265625 0.640625 0.4375q0.234375 0.09375 1.4375 0.421875q1.75 0.453125 2.4375 0.75q0.6875 0.296875 1.078125 0.859375q0.390625 0.5625 0.390625 1.40625q0 0.828125 -0.484375 1.546875q-0.46875 0.71875 -1.375 1.125q-0.90625 0.390625 -2.046875 0.390625q-1.875 0 -2.875 -0.78125q-0.984375 -0.78125 -1.25 -2.328125zm13.5625 1.421875l0.234375 1.453125q-0.6875 0.140625 -1.234375 0.140625q-0.890625 0 -1.390625 -0.28125q-0.484375 -0.28125 -0.6875 -0.734375q-0.203125 -0.46875 -0.203125 -1.9375l0 -5.578125l-1.203125 0l0 -1.265625l1.203125 0l0 -2.390625l1.625 -0.984375l0 3.375l1.65625 0l0 1.265625l-1.65625 0l0 5.671875q0 0.6875 0.078125 0.890625q0.09375 0.203125 0.28125 0.328125q0.203125 0.109375 0.578125 0.109375q0.265625 0 0.71875 -0.0625zm1.5895538 1.46875l0 -9.671875l1.46875 0l0 1.46875q0.5625 -1.03125 1.03125 -1.359375q0.484375 -0.328125 1.0625 -0.328125q0.828125 0 1.6875 0.53125l-0.5625 1.515625q-0.609375 -0.359375 -1.203125 -0.359375q-0.546875 0 -0.96875 0.328125q-0.421875 0.328125 -0.609375 0.890625q-0.28125 0.875 -0.28125 1.921875l0 5.0625l-1.625 0zm6.228302 3.703125l0 -13.375l1.484375 0l0 1.25q0.53125 -0.734375 1.1875 -1.09375q0.671875 -0.375 1.625 -0.375q1.234375 0 2.171875 0.640625q0.953125 0.625 1.4375 1.796875q0.484375 1.15625 0.484375 2.546875q0 1.484375 -0.53125 2.671875q-0.53125 1.1875 -1.546875 1.828125q-1.015625 0.625 -2.140625 0.625q-0.8125 0 -1.46875 -0.34375q-0.65625 -0.34375 -1.0625 -0.875l0 4.703125l-1.640625 0zm1.484375 -8.484375q0 1.859375 0.75 2.765625q0.765625 0.890625 1.828125 0.890625q1.09375 0 1.875 -0.921875q0.78125 -0.9375 0.78125 -2.875q0 -1.84375 -0.765625 -2.765625q-0.75 -0.921875 -1.8125 -0.921875q-1.046875 0 -1.859375 0.984375q-0.796875 0.96875 -0.796875 2.84375zm15.203842 3.59375q-0.921875 0.765625 -1.765625 1.09375q-0.828125 0.3125 -1.796875 0.3125q-1.59375 0 -2.453125 -0.78125q-0.859375 -0.78125 -0.859375 -1.984375q0 -0.71875 0.328125 -1.296875q0.328125 -0.59375 0.84375 -0.9375q0.53125 -0.359375 1.1875 -0.546875q0.46875 -0.125 1.453125 -0.25q1.984375 -0.234375 2.921875 -0.5625q0.015625 -0.34375 0.015625 -0.421875q0 -1.0 -0.46875 -1.421875q-0.625 -0.546875 -1.875 -0.546875q-1.15625 0 -1.703125 0.40625q-0.546875 0.40625 -0.8125 1.421875l-1.609375 -0.21875q0.21875 -1.015625 0.71875 -1.640625q0.5 -0.640625 1.453125 -0.984375q0.953125 -0.34375 2.1875 -0.34375q1.25 0 2.015625 0.296875q0.78125 0.28125 1.140625 0.734375q0.375 0.4375 0.515625 1.109375q0.078125 0.421875 0.078125 1.515625l0 2.1875q0 2.28125 0.109375 2.890625q0.109375 0.59375 0.40625 1.15625l-1.703125 0q-0.265625 -0.515625 -0.328125 -1.1875zm-0.140625 -3.671875q-0.890625 0.375 -2.671875 0.625q-1.015625 0.140625 -1.4375 0.328125q-0.421875 0.1875 -0.65625 0.53125q-0.21875 0.34375 -0.21875 0.78125q0 0.65625 0.5 1.09375q0.5 0.4375 1.453125 0.4375q0.9375 0 1.671875 -0.40625q0.75 -0.421875 1.09375 -1.140625q0.265625 -0.5625 0.265625 -1.640625l0 -0.609375zm4.188217 4.859375l0 -9.671875l1.46875 0l0 1.46875q0.5625 -1.03125 1.03125 -1.359375q0.484375 -0.328125 1.0625 -0.328125q0.828125 0 1.6875 0.53125l-0.5625 1.515625q-0.609375 -0.359375 -1.203125 -0.359375q-0.546875 0 -0.96875 0.328125q-0.421875 0.328125 -0.609375 0.890625q-0.28125 0.875 -0.28125 1.921875l0 5.0625l-1.625 0zm5.572052 -2.890625l1.625 -0.25q0.125 0.96875 0.75 1.5q0.625 0.515625 1.75 0.515625q1.125 0 1.671875 -0.453125q0.546875 -0.46875 0.546875 -1.09375q0 -0.546875 -0.484375 -0.875q-0.328125 -0.21875 -1.671875 -0.546875q-1.8125 -0.46875 -2.515625 -0.796875q-0.6875 -0.328125 -1.046875 -0.90625q-0.359375 -0.59375 -0.359375 -1.3125q0 -0.640625 0.296875 -1.1875q0.296875 -0.5625 0.8125 -0.921875q0.375 -0.28125 1.03125 -0.46875q0.671875 -0.203125 1.421875 -0.203125q1.140625 0 2.0 0.328125q0.859375 0.328125 1.265625 0.890625q0.421875 0.5625 0.578125 1.5l-1.609375 0.21875q-0.109375 -0.75 -0.640625 -1.171875q-0.515625 -0.421875 -1.46875 -0.421875q-1.140625 0 -1.625 0.375q-0.46875 0.375 -0.46875 0.875q0 0.3125 0.1875 0.578125q0.203125 0.265625 0.640625 0.4375q0.234375 0.09375 1.4375 0.421875q1.75 0.453125 2.4375 0.75q0.6875 0.296875 1.078125 0.859375q0.390625 0.5625 0.390625 1.40625q0 0.828125 -0.484375 1.546875q-0.46875 0.71875 -1.375 1.125q-0.90625 0.390625 -2.046875 0.390625q-1.875 0 -2.875 -0.78125q-0.984375 -0.78125 -1.25 -2.328125zm16.609375 -0.21875l1.6875 0.203125q-0.40625 1.484375 -1.484375 2.3125q-1.078125 0.8125 -2.765625 0.8125q-2.125 0 -3.375 -1.296875q-1.234375 -1.3125 -1.234375 -3.671875q0 -2.453125 1.25 -3.796875q1.265625 -1.34375 3.265625 -1.34375q1.9375 0 3.15625 1.328125q1.234375 1.3125 1.234375 3.703125q0 0.15625 0 0.4375l-7.21875 0q0.09375 1.59375 0.90625 2.453125q0.8125 0.84375 2.015625 0.84375q0.90625 0 1.546875 -0.46875q0.640625 -0.484375 1.015625 -1.515625zm-5.390625 -2.65625l5.40625 0q-0.109375 -1.21875 -0.625 -1.828125q-0.78125 -0.953125 -2.03125 -0.953125q-1.125 0 -1.90625 0.765625q-0.765625 0.75 -0.84375 2.015625zm9.125717 5.765625l0 -9.671875l1.46875 0l0 1.46875q0.5625 -1.03125 1.03125 -1.359375q0.484375 -0.328125 1.0625 -0.328125q0.828125 0 1.6875 0.53125l-0.5625 1.515625q-0.609375 -0.359375 -1.203125 -0.359375q-0.546875 0 -0.96875 0.328125q-0.421875 0.328125 -0.609375 0.890625q-0.28125 0.875 -0.28125 1.921875l0 5.0625l-1.625 0z" fill-rule="nonzero"/></g></svg>
diff --git a/Documentation/networking/tls-offload-reorder-bad.svg b/Documentation/networking/tls-offload-reorder-bad.svg
new file mode 100644
index 0000000000..d107aaf0f7
--- /dev/null
+++ b/Documentation/networking/tls-offload-reorder-bad.svg
@@ -0,0 +1 @@
+<svg version="1.1" viewBox="0.0 0.0 672.0 68.0" fill="none" stroke="none" stroke-linecap="square" stroke-miterlimit="10" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg"><clipPath id="p.0"><path d="m0 0l960.0 0l0 720.0l-960.0 0l0 -720.0z" clip-rule="nonzero"/></clipPath><g clip-path="url(#p.0)"><path fill="#000000" fill-opacity="0.0" d="m0 0l960.0 0l0 720.0l-960.0 0z" fill-rule="evenodd"/><path fill="#b6d7a8" d="m0 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m0 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" d="m15.953125 52.942722l-1.640625 0l0 -10.453125q-0.59375 0.5625 -1.5625 1.140625q-0.953125 0.5625 -1.71875 0.84375l0 -1.59375q1.375 -0.640625 2.40625 -1.5625q1.03125 -0.921875 1.453125 -1.78125l1.0625 0l0 13.40625z" fill-rule="nonzero"/><path fill="#c9daf8" d="m340.69897 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m340.69897 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" d="m355.73022 52.942722l0 -3.203125l-5.796875 0l0 -1.5l6.09375 -8.65625l1.34375 0l0 8.65625l1.796875 0l0 1.5l-1.796875 0l0 3.203125l-1.640625 0zm0 -4.703125l0 -6.015625l-4.1875 6.015625l4.1875 0z" fill-rule="nonzero"/><path fill="#cfe2f3" d="m225.37527 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m225.37527 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" d="m235.15652 49.411472l1.640625 -0.21875q0.28125 1.40625 0.953125 2.015625q0.6875 0.609375 1.65625 0.609375q1.15625 0 1.953125 -0.796875q0.796875 -0.796875 0.796875 -1.984375q0 -1.125 -0.734375 -1.859375q-0.734375 -0.734375 -1.875 -0.734375q-0.46875 0 -1.15625 0.171875l0.1875 -1.4375q0.15625 0.015625 0.265625 0.015625q1.046875 0 1.875 -0.546875q0.84375 -0.546875 0.84375 -1.671875q0 -0.90625 -0.609375 -1.5q-0.609375 -0.59375 -1.578125 -0.59375q-0.953125 0 -1.59375 0.609375q-0.640625 0.59375 -0.8125 1.796875l-1.640625 -0.296875q0.296875 -1.640625 1.359375 -2.546875q1.0625 -0.90625 2.65625 -0.90625q1.09375 0 2.0 0.46875q0.921875 0.46875 1.40625 1.28125q0.5 0.8125 0.5 1.71875q0 0.859375 -0.46875 1.578125q-0.46875 0.703125 -1.375 1.125q1.1875 0.28125 1.84375 1.140625q0.65625 0.859375 0.65625 2.15625q0 1.734375 -1.28125 2.953125q-1.265625 1.21875 -3.21875 1.21875q-1.765625 0 -2.921875 -1.046875q-1.15625 -1.046875 -1.328125 -2.71875z" fill-rule="nonzero"/><path fill="#cfe2f3" d="m572.3295 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m572.3295 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" d="m590.72015 51.364597l0 1.578125l-8.828125 0q-0.015625 -0.59375 0.1875 -1.140625q0.34375 -0.90625 1.078125 -1.78125q0.75 -0.875 2.15625 -2.015625q2.171875 -1.78125 2.9375 -2.828125q0.765625 -1.046875 0.765625 -1.96875q0 -0.984375 -0.703125 -1.640625q-0.6875 -0.671875 -1.8125 -0.671875q-1.1875 0 -1.90625 0.71875q-0.703125 0.703125 -0.703125 1.953125l-1.6875 -0.171875q0.171875 -1.890625 1.296875 -2.875q1.140625 -0.984375 3.03125 -0.984375q1.921875 0 3.046875 1.0625q1.125 1.0625 1.125 2.640625q0 0.796875 -0.328125 1.578125q-0.328125 0.78125 -1.09375 1.640625q-0.75 0.84375 -2.53125 2.34375q-1.46875 1.234375 -1.890625 1.6875q-0.421875 0.4375 -0.6875 0.875l6.546875 0z" fill-rule="nonzero"/><path fill="#e06666" d="m615.56793 24.999102l6.5512085 0l0 42.04725l-6.5512085 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m615.56793 24.999102l6.5512085 0l0 42.04725l-6.5512085 0z" fill-rule="evenodd"/><path fill="#c9daf8" d="m456.51425 24.999102l99.02365 0l0 42.04725l-99.02365 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m456.51425 24.999102l99.02365 0l0 42.04725l-99.02365 0z" fill-rule="evenodd"/><path fill="#000000" d="m466.2955 49.442722l1.71875 -0.140625q0.1875 1.25 0.875 1.890625q0.703125 0.625 1.6875 0.625q1.1875 0 2.0 -0.890625q0.828125 -0.890625 0.828125 -2.359375q0 -1.40625 -0.796875 -2.21875q-0.78125 -0.8125 -2.0625 -0.8125q-0.78125 0 -1.421875 0.359375q-0.640625 0.359375 -1.0 0.9375l-1.546875 -0.203125l1.296875 -6.859375l6.640625 0l0 1.5625l-5.328125 0l-0.71875 3.59375q1.203125 -0.84375 2.515625 -0.84375q1.75 0 2.953125 1.21875q1.203125 1.203125 1.203125 3.109375q0 1.8125 -1.046875 3.140625q-1.296875 1.625 -3.515625 1.625q-1.8125 0 -2.96875 -1.015625q-1.15625 -1.03125 -1.3125 -2.71875z" fill-rule="nonzero"/><path fill="#e06666" d="m391.1985 24.999102l6.551178 0l0 42.04725l-6.551178 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m391.1985 24.999102l6.551178 0l0 42.04725l-6.551178 0z" fill-rule="evenodd"/><path fill="#000000" fill-opacity="0.0" d="m114.43843 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" stroke-dasharray="4.0,3.0" d="m114.43843 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" fill-opacity="0.0" d="m163.95024 24.999102c0 -12.5 114.47246 -25.007874 228.9449 -25.0c114.47241 0.007874016 228.94489 12.531496 228.94489 25.062992" fill-rule="evenodd"/><path stroke="#000000" stroke-width="2.0" stroke-linejoin="round" stroke-linecap="butt" d="m163.95024 24.9991c0 -12.499998 114.47246 -25.007872 228.9449 -24.999998c57.236206 0.003937008 114.47244 3.136811 157.3996 7.835138c21.463562 2.3491635 39.349915 5.0896897 51.8703 8.026144c3.130127 0.7341137 5.9248657 1.4804726 8.356262 2.236023c0.30395508 0.09444237 0.6022339 0.1890316 0.89471436 0.28375435l0.37457275 0.12388611" fill-rule="evenodd"/><path fill="#000000" stroke="#000000" stroke-width="2.0" stroke-linecap="butt" d="m609.98517 21.270555l9.406311 2.1936665l-5.7955933 -7.7266836z" fill-rule="evenodd"/><path fill="#e06666" d="m47.56793 24.999102l6.551182 0l0 42.04725l-6.551182 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m47.56793 24.999102l6.551182 0l0 42.04725l-6.551182 0z" fill-rule="evenodd"/></g></svg>
diff --git a/Documentation/networking/tls-offload-reorder-good.svg b/Documentation/networking/tls-offload-reorder-good.svg
new file mode 100644
index 0000000000..10e17d91f7
--- /dev/null
+++ b/Documentation/networking/tls-offload-reorder-good.svg
@@ -0,0 +1 @@
+<svg version="1.1" viewBox="0.0 0.0 672.0 68.0" fill="none" stroke="none" stroke-linecap="square" stroke-miterlimit="10" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg"><clipPath id="p.0"><path d="m0 0l960.0 0l0 720.0l-960.0 0l0 -720.0z" clip-rule="nonzero"/></clipPath><g clip-path="url(#p.0)"><path fill="#000000" fill-opacity="0.0" d="m0 0l960.0 0l0 720.0l-960.0 0z" fill-rule="evenodd"/><path fill="#b6d7a8" d="m0 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m0 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" d="m15.953125 52.942722l-1.640625 0l0 -10.453125q-0.59375 0.5625 -1.5625 1.140625q-0.953125 0.5625 -1.71875 0.84375l0 -1.59375q1.375 -0.640625 2.40625 -1.5625q1.03125 -0.921875 1.453125 -1.78125l1.0625 0l0 13.40625z" fill-rule="nonzero"/><path fill="#b6d7a8" d="m340.69897 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m340.69897 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" d="m355.73022 52.942722l0 -3.203125l-5.796875 0l0 -1.5l6.09375 -8.65625l1.34375 0l0 8.65625l1.796875 0l0 1.5l-1.796875 0l0 3.203125l-1.640625 0zm0 -4.703125l0 -6.015625l-4.1875 6.015625l4.1875 0z" fill-rule="nonzero"/><path fill="#cfe2f3" d="m225.37527 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m225.37527 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" d="m235.15652 49.411472l1.640625 -0.21875q0.28125 1.40625 0.953125 2.015625q0.6875 0.609375 1.65625 0.609375q1.15625 0 1.953125 -0.796875q0.796875 -0.796875 0.796875 -1.984375q0 -1.125 -0.734375 -1.859375q-0.734375 -0.734375 -1.875 -0.734375q-0.46875 0 -1.15625 0.171875l0.1875 -1.4375q0.15625 0.015625 0.265625 0.015625q1.046875 0 1.875 -0.546875q0.84375 -0.546875 0.84375 -1.671875q0 -0.90625 -0.609375 -1.5q-0.609375 -0.59375 -1.578125 -0.59375q-0.953125 0 -1.59375 0.609375q-0.640625 0.59375 -0.8125 1.796875l-1.640625 -0.296875q0.296875 -1.640625 1.359375 -2.546875q1.0625 -0.90625 2.65625 -0.90625q1.09375 0 2.0 0.46875q0.921875 0.46875 1.40625 1.28125q0.5 0.8125 0.5 1.71875q0 0.859375 -0.46875 1.578125q-0.46875 0.703125 -1.375 1.125q1.1875 0.28125 1.84375 1.140625q0.65625 0.859375 0.65625 2.15625q0 1.734375 -1.28125 2.953125q-1.265625 1.21875 -3.21875 1.21875q-1.765625 0 -2.921875 -1.046875q-1.15625 -1.046875 -1.328125 -2.71875z" fill-rule="nonzero"/><path fill="#e06666" d="m271.56793 24.999102l6.551178 0l0 42.04725l-6.551178 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m271.56793 24.999102l6.551178 0l0 42.04725l-6.551178 0z" fill-rule="evenodd"/><path fill="#cfe2f3" d="m572.3295 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m572.3295 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" d="m590.72015 51.364597l0 1.578125l-8.828125 0q-0.015625 -0.59375 0.1875 -1.140625q0.34375 -0.90625 1.078125 -1.78125q0.75 -0.875 2.15625 -2.015625q2.171875 -1.78125 2.9375 -2.828125q0.765625 -1.046875 0.765625 -1.96875q0 -0.984375 -0.703125 -1.640625q-0.6875 -0.671875 -1.8125 -0.671875q-1.1875 0 -1.90625 0.71875q-0.703125 0.703125 -0.703125 1.953125l-1.6875 -0.171875q0.171875 -1.890625 1.296875 -2.875q1.140625 -0.984375 3.03125 -0.984375q1.921875 0 3.046875 1.0625q1.125 1.0625 1.125 2.640625q0 0.796875 -0.328125 1.578125q-0.328125 0.78125 -1.09375 1.640625q-0.75 0.84375 -2.53125 2.34375q-1.46875 1.234375 -1.890625 1.6875q-0.421875 0.4375 -0.6875 0.875l6.546875 0z" fill-rule="nonzero"/><path fill="#b6d7a8" d="m456.51425 24.999102l99.02365 0l0 42.04725l-99.02365 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m456.51425 24.999102l99.02365 0l0 42.04725l-99.02365 0z" fill-rule="evenodd"/><path fill="#000000" d="m466.2955 49.442722l1.71875 -0.140625q0.1875 1.25 0.875 1.890625q0.703125 0.625 1.6875 0.625q1.1875 0 2.0 -0.890625q0.828125 -0.890625 0.828125 -2.359375q0 -1.40625 -0.796875 -2.21875q-0.78125 -0.8125 -2.0625 -0.8125q-0.78125 0 -1.421875 0.359375q-0.640625 0.359375 -1.0 0.9375l-1.546875 -0.203125l1.296875 -6.859375l6.640625 0l0 1.5625l-5.328125 0l-0.71875 3.59375q1.203125 -0.84375 2.515625 -0.84375q1.75 0 2.953125 1.21875q1.203125 1.203125 1.203125 3.109375q0 1.8125 -1.046875 3.140625q-1.296875 1.625 -3.515625 1.625q-1.8125 0 -2.96875 -1.015625q-1.15625 -1.03125 -1.3125 -2.71875z" fill-rule="nonzero"/><path fill="#e06666" d="m503.1985 24.999102l6.551178 0l0 42.04725l-6.551178 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m503.1985 24.999102l6.551178 0l0 42.04725l-6.551178 0z" fill-rule="evenodd"/><path fill="#000000" fill-opacity="0.0" d="m114.43843 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" stroke-dasharray="4.0,3.0" d="m114.43843 24.999102l99.02362 0l0 42.04725l-99.02362 0z" fill-rule="evenodd"/><path fill="#000000" fill-opacity="0.0" d="m163.95024 24.999102c0 -12.5 114.47246 -25.007874 228.9449 -25.0c114.47241 0.007874016 228.94489 12.531496 228.94489 25.062992" fill-rule="evenodd"/><path stroke="#000000" stroke-width="2.0" stroke-linejoin="round" stroke-linecap="butt" d="m163.95024 24.9991c0 -12.499998 114.47246 -25.007872 228.9449 -24.999998c57.236206 0.003937008 114.47244 3.136811 157.3996 7.835138c21.463562 2.3491635 39.349915 5.0896897 51.8703 8.026144c3.130127 0.7341137 5.9248657 1.4804726 8.356262 2.236023c0.30395508 0.09444237 0.6022339 0.1890316 0.89471436 0.28375435l0.37457275 0.12388611" fill-rule="evenodd"/><path fill="#000000" stroke="#000000" stroke-width="2.0" stroke-linecap="butt" d="m609.98517 21.270555l9.406311 2.1936665l-5.7955933 -7.7266836z" fill-rule="evenodd"/><path fill="#e06666" d="m47.56793 24.999102l6.551182 0l0 42.04725l-6.551182 0z" fill-rule="evenodd"/><path stroke="#000000" stroke-width="1.0" stroke-linejoin="round" stroke-linecap="butt" d="m47.56793 24.999102l6.551182 0l0 42.04725l-6.551182 0z" fill-rule="evenodd"/></g></svg>
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
new file mode 100644
index 0000000000..5f0dea3d57
--- /dev/null
+++ b/Documentation/networking/tls-offload.rst
@@ -0,0 +1,539 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+==================
+Kernel TLS offload
+==================
+
+Kernel TLS operation
+====================
+
+Linux kernel provides TLS connection offload infrastructure. Once a TCP
+connection is in ``ESTABLISHED`` state user space can enable the TLS Upper
+Layer Protocol (ULP) and install the cryptographic connection state.
+For details regarding the user-facing interface refer to the TLS
+documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`.
+
+``ktls`` can operate in three modes:
+
+ * Software crypto mode (``TLS_SW``) - CPU handles the cryptography.
+ In most basic cases only crypto operations synchronous with the CPU
+ can be used, but depending on calling context CPU may utilize
+ asynchronous crypto accelerators. The use of accelerators introduces extra
+ latency on socket reads (decryption only starts when a read syscall
+ is made) and additional I/O load on the system.
+ * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto
+ on a packet by packet basis, provided the packets arrive in order.
+ This mode integrates best with the kernel stack and is described in detail
+ in the remaining part of this document
+ (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``).
+ * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where
+ NIC driver and firmware replace the kernel networking stack
+ with its own TCP handling, it is not usable in production environments
+ making use of the Linux networking stack for example any firewalling
+ abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``).
+
+The operation mode is selected automatically based on device configuration,
+offload opt-in or opt-out on per-connection basis is not currently supported.
+
+TX
+--
+
+At a high level user write requests are turned into a scatter list, the TLS ULP
+intercepts them, inserts record framing, performs encryption (in ``TLS_SW``
+mode) and then hands the modified scatter list to the TCP layer. From this
+point on the TCP stack proceeds as normal.
+
+In ``TLS_HW`` mode the encryption is not performed in the TLS ULP.
+Instead packets reach a device driver, the driver will mark the packets
+for crypto offload based on the socket the packet is attached to,
+and send them to the device for encryption and transmission.
+
+RX
+--
+
+On the receive side if the device handled decryption and authentication
+successfully, the driver will set the decrypted bit in the associated
+:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
+are handled normally. ``ktls`` is informed when data is queued to the socket
+and the ``strparser`` mechanism is used to delineate the records. Upon read
+request, records are retrieved from the socket and passed to decryption routine.
+If device decrypted all the segments of the record the decryption is skipped,
+otherwise software path handles decryption.
+
+.. kernel-figure:: tls-offload-layers.svg
+ :alt: TLS offload layers
+ :align: center
+ :figwidth: 28em
+
+ Layers of Kernel TLS stack
+
+Device configuration
+====================
+
+During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and
+``NETIF_F_HW_TLS_TX`` features and installs its
+:c:type:`struct tlsdev_ops <tlsdev_ops>`
+pointer in the :c:member:`tlsdev_ops` member of the
+:c:type:`struct net_device <net_device>`.
+
+When TLS cryptographic connection state is installed on a ``ktls`` socket
+(note that it is done twice, once for RX and once for TX direction,
+and the two are completely independent), the kernel checks if the underlying
+network device is offload-capable and attempts the offload. In case offload
+fails the connection is handled entirely in software using the same mechanism
+as if the offload was never tried.
+
+Offload request is performed via the :c:member:`tls_dev_add` callback of
+:c:type:`struct tlsdev_ops <tlsdev_ops>`:
+
+.. code-block:: c
+
+ int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
+ enum tls_offload_ctx_dir direction,
+ struct tls_crypto_info *crypto_info,
+ u32 start_offload_tcp_sn);
+
+``direction`` indicates whether the cryptographic information is for
+the received or transmitted packets. Driver uses the ``sk`` parameter
+to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6).
+Cryptographic information in ``crypto_info`` includes the key, iv, salt
+as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates
+which TCP sequence number corresponds to the beginning of the record with
+sequence number from ``crypto_info``. The driver can add its state
+at the end of kernel structures (see :c:member:`driver_state` members
+in ``include/net/tls.h``) to avoid additional allocations and pointer
+dereferences.
+
+TX
+--
+
+After TX state is installed, the stack guarantees that the first segment
+of the stream will start exactly at the ``start_offload_tcp_sn`` sequence
+number, simplifying TCP sequence number matching.
+
+TX offload being fully initialized does not imply that all segments passing
+through the driver and which belong to the offloaded socket will be after
+the expected sequence number and will have kernel record information.
+In particular, already encrypted data may have been queued to the socket
+before installing the connection state in the kernel.
+
+RX
+--
+
+In RX direction local networking stack has little control over the segmentation,
+so the initial records' TCP sequence number may be anywhere inside the segment.
+
+Normal operation
+================
+
+At the minimum the device maintains the following state for each connection, in
+each direction:
+
+ * crypto secrets (key, iv, salt)
+ * crypto processing state (partial blocks, partial authentication tag, etc.)
+ * record metadata (sequence number, processing offset and length)
+ * expected TCP sequence number
+
+There are no guarantees on record length or record segmentation. In particular
+segments may start at any point of a record and contain any number of records.
+Assuming segments are received in order, the device should be able to perform
+crypto operations and authentication regardless of segmentation. For this
+to be possible device has to keep small amount of segment-to-segment state.
+This includes at least:
+
+ * partial headers (if a segment carried only a part of the TLS header)
+ * partial data block
+ * partial authentication tag (all data had been seen but part of the
+ authentication tag has to be written or read from the subsequent segment)
+
+Record reassembly is not necessary for TLS offload. If the packets arrive
+in order the device should be able to handle them separately and make
+forward progress.
+
+TX
+--
+
+The kernel stack performs record framing reserving space for the authentication
+tag and populating all other TLS header and tailer fields.
+
+Both the device and the driver maintain expected TCP sequence numbers
+due to the possibility of retransmissions and the lack of software fallback
+once the packet reaches the device.
+For segments passed in order, the driver marks the packets with
+a connection identifier (note that a 5-tuple lookup is insufficient to identify
+packets requiring HW offload, see the :ref:`5tuple_problems` section)
+and hands them to the device. The device identifies the packet as requiring
+TLS handling and confirms the sequence number matches its expectation.
+The device performs encryption and authentication of the record data.
+It replaces the authentication tag and TCP checksum with correct values.
+
+RX
+--
+
+Before a packet is DMAed to the host (but after NIC's embedded switching
+and packet transformation functions) the device validates the Layer 4
+checksum and performs a 5-tuple lookup to find any TLS connection the packet
+may belong to (technically a 4-tuple
+lookup is sufficient - IP addresses and TCP port numbers, as the protocol
+is always TCP). If connection is matched device confirms if the TCP sequence
+number is the expected one and proceeds to TLS handling (record delineation,
+decryption, authentication for each record in the packet). The device leaves
+the record framing unmodified, the stack takes care of record decapsulation.
+Device indicates successful handling of TLS offload in the per-packet context
+(descriptor) passed to the host.
+
+Upon reception of a TLS offloaded packet, the driver sets
+the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
+corresponding to the segment. Networking stack makes sure decrypted
+and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer)
+and takes care of partial decryption.
+
+Resync handling
+===============
+
+In presence of packet drops or network packet reordering, the device may lose
+synchronization with the TLS stream, and require a resync with the kernel's
+TCP stack.
+
+Note that resync is only attempted for connections which were successfully
+added to the device table and are in TLS_HW mode. For example,
+if the table was full when cryptographic state was installed in the kernel,
+such connection will never get offloaded. Therefore the resync request
+does not carry any cryptographic connection state.
+
+TX
+--
+
+Segments transmitted from an offloaded socket can get out of sync
+in similar ways to the receive side-retransmissions - local drops
+are possible, though network reorders are not. There are currently
+two mechanisms for dealing with out of order segments.
+
+Crypto state rebuilding
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Whenever an out of order segment is transmitted the driver provides
+the device with enough information to perform cryptographic operations.
+This means most likely that the part of the record preceding the current
+segment has to be passed to the device as part of the packet context,
+together with its TCP sequence number and TLS record number. The device
+can then initialize its crypto state, process and discard the preceding
+data (to be able to insert the authentication tag) and move onto handling
+the actual packet.
+
+In this mode depending on the implementation the driver can either ask
+for a continuation with the crypto state and the new sequence number
+(next expected segment is the one after the out of order one), or continue
+with the previous stream state - assuming that the out of order segment
+was just a retransmission. The former is simpler, and does not require
+retransmission detection therefore it is the recommended method until
+such time it is proven inefficient.
+
+Next record sync
+~~~~~~~~~~~~~~~~
+
+Whenever an out of order segment is detected the driver requests
+that the ``ktls`` software fallback code encrypt it. If the segment's
+sequence number is lower than expected the driver assumes retransmission
+and doesn't change device state. If the segment is in the future, it
+may imply a local drop, the driver asks the stack to sync the device
+to the next record state and falls back to software.
+
+Resync request is indicated with:
+
+.. code-block:: c
+
+ void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq)
+
+Until resync is complete driver should not access its expected TCP
+sequence number (as it will be updated from a different context).
+Following helper should be used to test if resync is complete:
+
+.. code-block:: c
+
+ bool tls_offload_tx_resync_pending(struct sock *sk)
+
+Next time ``ktls`` pushes a record it will first send its TCP sequence number
+and TLS record number to the driver. Stack will also make sure that
+the new record will start on a segment boundary (like it does when
+the connection is initially added).
+
+RX
+--
+
+A small amount of RX reorder events may not require a full resynchronization.
+In particular the device should not lose synchronization
+when record boundary can be recovered:
+
+.. kernel-figure:: tls-offload-reorder-good.svg
+ :alt: reorder of non-header segment
+ :align: center
+
+ Reorder of non-header segment
+
+Green segments are successfully decrypted, blue ones are passed
+as received on wire, red stripes mark start of new records.
+
+In above case segment 1 is received and decrypted successfully.
+Segment 2 was dropped so 3 arrives out of order. The device knows
+the next record starts inside 3, based on record length in segment 1.
+Segment 3 is passed untouched, because due to lack of data from segment 2
+the remainder of the previous record inside segment 3 cannot be handled.
+The device can, however, collect the authentication algorithm's state
+and partial block from the new record in segment 3 and when 4 and 5
+arrive continue decryption. Finally when 2 arrives it's completely outside
+of expected window of the device so it's passed as is without special
+handling. ``ktls`` software fallback handles the decryption of record
+spanning segments 1, 2 and 3. The device did not get out of sync,
+even though two segments did not get decrypted.
+
+Kernel synchronization may be necessary if the lost segment contained
+a record header and arrived after the next record header has already passed:
+
+.. kernel-figure:: tls-offload-reorder-bad.svg
+ :alt: reorder of header segment
+ :align: center
+
+ Reorder of segment with a TLS header
+
+In this example segment 2 gets dropped, and it contains a record header.
+Device can only detect that segment 4 also contains a TLS header
+if it knows the length of the previous record from segment 2. In this case
+the device will lose synchronization with the stream.
+
+Stream scan resynchronization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When the device gets out of sync and the stream reaches TCP sequence
+numbers more than a max size record past the expected TCP sequence number,
+the device starts scanning for a known header pattern. For example
+for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur
+in the SSL/TLS version field of the header. Once pattern is matched
+the device continues attempting parsing headers at expected locations
+(based on the length fields at guessed locations).
+Whenever the expected location does not contain a valid header the scan
+is restarted.
+
+When the header is matched the device sends a confirmation request
+to the kernel, asking if the guessed location is correct (if a TLS record
+really starts there), and which record sequence number the given header had.
+The kernel confirms the guessed location was correct and tells the device
+the record sequence number. Meanwhile, the device had been parsing
+and counting all records since the just-confirmed one, it adds the number
+of records it had seen to the record number provided by the kernel.
+At this point the device is in sync and can resume decryption at next
+segment boundary.
+
+In a pathological case the device may latch onto a sequence of matching
+headers and never hear back from the kernel (there is no negative
+confirmation from the kernel). The implementation may choose to periodically
+restart scan. Given how unlikely falsely-matching stream is, however,
+periodic restart is not deemed necessary.
+
+Special care has to be taken if the confirmation request is passed
+asynchronously to the packet stream and record may get processed
+by the kernel before the confirmation request.
+
+Stack-driven resynchronization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The driver may also request the stack to perform resynchronization
+whenever it sees the records are no longer getting decrypted.
+If the connection is configured in this mode the stack automatically
+schedules resynchronization after it has received two completely encrypted
+records.
+
+The stack waits for the socket to drain and informs the device about
+the next expected record number and its TCP sequence number. If the
+records continue to be received fully encrypted stack retries the
+synchronization with an exponential back off (first after 2 encrypted
+records, then after 4 records, after 8, after 16... up until every
+128 records).
+
+Error handling
+==============
+
+TX
+--
+
+Packets may be redirected or rerouted by the stack to a different
+device than the selected TLS offload device. The stack will handle
+such condition using the :c:func:`sk_validate_xmit_skb` helper
+(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook).
+Offload maintains information about all records until the data is
+fully acknowledged, so if skbs reach the wrong device they can be handled
+by software fallback.
+
+Any device TLS offload handling error on the transmission side must result
+in the packet being dropped. For example if a packet got out of order
+due to a bug in the stack or the device, reached the device and can't
+be encrypted such packet must be dropped.
+
+RX
+--
+
+If the device encounters any problems with TLS offload on the receive
+side it should pass the packet to the host's networking stack as it was
+received on the wire.
+
+For example authentication failure for any record in the segment should
+result in passing the unmodified packet to the software fallback. This means
+packets should not be modified "in place". Splitting segments to handle partial
+decryption is not advised. In other words either all records in the packet
+had been handled successfully and authenticated or the packet has to be passed
+to the host's stack as it was on the wire (recovering original packet in the
+driver if device provides precise error is sufficient).
+
+The Linux networking stack does not provide a way of reporting per-packet
+decryption and authentication errors, packets with errors must simply not
+have the :c:member:`decrypted` mark set.
+
+A packet should also not be handled by the TLS offload if it contains
+incorrect checksums.
+
+Performance metrics
+===================
+
+TLS offload can be characterized by the following basic metrics:
+
+ * max connection count
+ * connection installation rate
+ * connection installation latency
+ * total cryptographic performance
+
+Note that each TCP connection requires a TLS session in both directions,
+the performance may be reported treating each direction separately.
+
+Max connection count
+--------------------
+
+The number of connections device can support can be exposed via
+``devlink resource`` API.
+
+Total cryptographic performance
+-------------------------------
+
+Offload performance may depend on segment and record size.
+
+Overload of the cryptographic subsystem of the device should not have
+significant performance impact on non-offloaded streams.
+
+Statistics
+==========
+
+Following minimum set of TLS-related statistics should be reported
+by the driver:
+
+ * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets
+ which were part of a TLS stream.
+ * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
+ which were successfully decrypted.
+ * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for
+ decryption.
+ * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device
+ (connection has finished).
+ * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync
+ request.
+ * ``rx_tls_resync_req_start`` - number of times the TLS async resync request
+ was started.
+ * ``rx_tls_resync_req_end`` - number of times the TLS async resync request
+ properly ended with providing the HW tracked tcp-seq.
+ * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
+ procedure was started by not properly ended.
+ * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
+ the driver was successfully handled.
+ * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
+ the driver was terminated unsuccessfully.
+ * ``rx_tls_err`` - number of RX packets which were part of a TLS stream
+ but were not decrypted due to unexpected error in the state machine.
+ * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
+ for encryption of their TLS payload.
+ * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
+ passed to the device for encryption.
+ * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for
+ encryption.
+ * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
+ but did not arrive in the expected order.
+ * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of
+ a TLS stream and arrived out-of-order, but skipped the HW offload routine
+ and went to the regular transmit flow as they were retransmissions of the
+ connection handshake.
+ * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of
+ a TLS stream dropped, because they arrived out of order and associated
+ record could not be found.
+ * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS
+ stream dropped, because they contain both data that has been encrypted by
+ software and data that expects hardware crypto offload.
+
+Notable corner cases, exceptions and additional requirements
+============================================================
+
+.. _5tuple_problems:
+
+5-tuple matching limitations
+----------------------------
+
+The device can only recognize received packets based on the 5-tuple
+of the socket. Current ``ktls`` implementation will not offload sockets
+routed through software interfaces such as those used for tunneling
+or virtual networking. However, many packet transformations performed
+by the networking stack (most notably any BPF logic) do not require
+any intermediate software device, therefore a 5-tuple match may
+consistently miss at the device level. In such cases the device
+should still be able to perform TX offload (encryption) and should
+fallback cleanly to software decryption (RX).
+
+Out of order
+------------
+
+Introducing extra processing in NICs should not cause packets to be
+transmitted or received out of order, for example pure ACK packets
+should not be reordered with respect to data segments.
+
+Ingress reorder
+---------------
+
+A device is permitted to perform packet reordering for consecutive
+TCP segments (i.e. placing packets in the correct order) but any form
+of additional buffering is disallowed.
+
+Coexistence with standard networking offload features
+-----------------------------------------------------
+
+Offloaded ``ktls`` sockets should support standard TCP stack features
+transparently. Enabling device TLS offload should not cause any difference
+in packets as seen on the wire.
+
+Transport layer transparency
+----------------------------
+
+The device should not modify any packet headers for the purpose
+of the simplifying TLS offload.
+
+The device should not depend on any packet headers beyond what is strictly
+necessary for TLS offload.
+
+Segment drops
+-------------
+
+Dropping packets is acceptable only in the event of catastrophic
+system errors and should never be used as an error handling mechanism
+in cases arising from normal operation. In other words, reliance
+on TCP retransmissions to handle corner cases is not acceptable.
+
+TLS device features
+-------------------
+
+Drivers should ignore the changes to the TLS device feature flags.
+These flags will be acted upon accordingly by the core ``ktls`` code.
+TLS device feature flags only control adding of new TLS connection
+offloads, old connections will remain active after flags are cleared.
+
+TLS encryption cannot be offloaded to devices without checksum calculation
+offload. Hence, TLS TX device feature flag requires TX csum offload being set.
+Disabling the latter implies clearing the former. Disabling TX checksum offload
+should not affect old connections, and drivers should make sure checksum
+calculation does not break for them.
+Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
+does not want to enable RX csum offload, TLS RX device feature is disabled
+as well.
diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst
new file mode 100644
index 0000000000..658ed3a71e
--- /dev/null
+++ b/Documentation/networking/tls.rst
@@ -0,0 +1,288 @@
+.. _kernel_tls:
+
+==========
+Kernel TLS
+==========
+
+Overview
+========
+
+Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs over
+TCP. TLS provides end-to-end data integrity and confidentiality.
+
+User interface
+==============
+
+Creating a TLS connection
+-------------------------
+
+First create a new TCP socket and set the TLS ULP.
+
+.. code-block:: c
+
+ sock = socket(AF_INET, SOCK_STREAM, 0);
+ setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
+
+Setting the TLS ULP allows us to set/get TLS socket options. Currently
+only the symmetric encryption is handled in the kernel. After the TLS
+handshake is complete, we have all the parameters required to move the
+data-path to the kernel. There is a separate socket option for moving
+the transmit and the receive into the kernel.
+
+.. code-block:: c
+
+ /* From linux/tls.h */
+ struct tls_crypto_info {
+ unsigned short version;
+ unsigned short cipher_type;
+ };
+
+ struct tls12_crypto_info_aes_gcm_128 {
+ struct tls_crypto_info info;
+ unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE];
+ unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE];
+ unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE];
+ unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE];
+ };
+
+
+ struct tls12_crypto_info_aes_gcm_128 crypto_info;
+
+ crypto_info.info.version = TLS_1_2_VERSION;
+ crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128;
+ memcpy(crypto_info.iv, iv_write, TLS_CIPHER_AES_GCM_128_IV_SIZE);
+ memcpy(crypto_info.rec_seq, seq_number_write,
+ TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE);
+ memcpy(crypto_info.key, cipher_key_write, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+ memcpy(crypto_info.salt, implicit_iv_write, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
+
+ setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info));
+
+Transmit and receive are set separately, but the setup is the same, using either
+TLS_TX or TLS_RX.
+
+Sending TLS application data
+----------------------------
+
+After setting the TLS_TX socket option all application data sent over this
+socket is encrypted using TLS and the parameters provided in the socket option.
+For example, we can send an encrypted hello world record as follows:
+
+.. code-block:: c
+
+ const char *msg = "hello world\n";
+ send(sock, msg, strlen(msg));
+
+send() data is directly encrypted from the userspace buffer provided
+to the encrypted kernel send buffer if possible.
+
+The sendfile system call will send the file's data over TLS records of maximum
+length (2^14).
+
+.. code-block:: c
+
+ file = open(filename, O_RDONLY);
+ fstat(file, &stat);
+ sendfile(sock, file, &offset, stat.st_size);
+
+TLS records are created and sent after each send() call, unless
+MSG_MORE is passed. MSG_MORE will delay creation of a record until
+MSG_MORE is not passed, or the maximum record size is reached.
+
+The kernel will need to allocate a buffer for the encrypted data.
+This buffer is allocated at the time send() is called, such that
+either the entire send() call will return -ENOMEM (or block waiting
+for memory), or the encryption will always succeed. If send() returns
+-ENOMEM and some data was left on the socket buffer from a previous
+call using MSG_MORE, the MSG_MORE data is left on the socket buffer.
+
+Receiving TLS application data
+------------------------------
+
+After setting the TLS_RX socket option, all recv family socket calls
+are decrypted using TLS parameters provided. A full TLS record must
+be received before decryption can happen.
+
+.. code-block:: c
+
+ char buffer[16384];
+ recv(sock, buffer, 16384);
+
+Received data is decrypted directly in to the user buffer if it is
+large enough, and no additional allocations occur. If the userspace
+buffer is too small, data is decrypted in the kernel and copied to
+userspace.
+
+``EINVAL`` is returned if the TLS version in the received message does not
+match the version passed in setsockopt.
+
+``EMSGSIZE`` is returned if the received message is too big.
+
+``EBADMSG`` is returned if decryption failed for any other reason.
+
+Send TLS control messages
+-------------------------
+
+Other than application data, TLS has control messages such as alert
+messages (record type 21) and handshake messages (record type 22), etc.
+These messages can be sent over the socket by providing the TLS record type
+via a CMSG. For example the following function sends @data of @length bytes
+using a record of type @record_type.
+
+.. code-block:: c
+
+ /* send TLS control message using record_type */
+ static int klts_send_ctrl_message(int sock, unsigned char record_type,
+ void *data, size_t length)
+ {
+ struct msghdr msg = {0};
+ int cmsg_len = sizeof(record_type);
+ struct cmsghdr *cmsg;
+ char buf[CMSG_SPACE(cmsg_len)];
+ struct iovec msg_iov; /* Vector of data to send/receive into. */
+
+ msg.msg_control = buf;
+ msg.msg_controllen = sizeof(buf);
+ cmsg = CMSG_FIRSTHDR(&msg);
+ cmsg->cmsg_level = SOL_TLS;
+ cmsg->cmsg_type = TLS_SET_RECORD_TYPE;
+ cmsg->cmsg_len = CMSG_LEN(cmsg_len);
+ *CMSG_DATA(cmsg) = record_type;
+ msg.msg_controllen = cmsg->cmsg_len;
+
+ msg_iov.iov_base = data;
+ msg_iov.iov_len = length;
+ msg.msg_iov = &msg_iov;
+ msg.msg_iovlen = 1;
+
+ return sendmsg(sock, &msg, 0);
+ }
+
+Control message data should be provided unencrypted, and will be
+encrypted by the kernel.
+
+Receiving TLS control messages
+------------------------------
+
+TLS control messages are passed in the userspace buffer, with message
+type passed via cmsg. If no cmsg buffer is provided, an error is
+returned if a control message is received. Data messages may be
+received without a cmsg buffer set.
+
+.. code-block:: c
+
+ char buffer[16384];
+ char cmsg[CMSG_SPACE(sizeof(unsigned char))];
+ struct msghdr msg = {0};
+ msg.msg_control = cmsg;
+ msg.msg_controllen = sizeof(cmsg);
+
+ struct iovec msg_iov;
+ msg_iov.iov_base = buffer;
+ msg_iov.iov_len = 16384;
+
+ msg.msg_iov = &msg_iov;
+ msg.msg_iovlen = 1;
+
+ int ret = recvmsg(sock, &msg, 0 /* flags */);
+
+ struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+ if (cmsg->cmsg_level == SOL_TLS &&
+ cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
+ int record_type = *((unsigned char *)CMSG_DATA(cmsg));
+ // Do something with record_type, and control message data in
+ // buffer.
+ //
+ // Note that record_type may be == to application data (23).
+ } else {
+ // Buffer contains application data.
+ }
+
+recv will never return data from mixed types of TLS records.
+
+Integrating in to userspace TLS library
+---------------------------------------
+
+At a high level, the kernel TLS ULP is a replacement for the record
+layer of a userspace TLS library.
+
+A patchset to OpenSSL to use ktls as the record layer is
+`here <https://github.com/Mellanox/openssl/commits/tls_rx2>`_.
+
+`An example <https://github.com/ktls/af_ktls-tool/commits/RX>`_
+of calling send directly after a handshake using gnutls.
+Since it doesn't implement a full record layer, control
+messages are not supported.
+
+Optional optimizations
+----------------------
+
+There are certain condition-specific optimizations the TLS ULP can make,
+if requested. Those optimizations are either not universally beneficial
+or may impact correctness, hence they require an opt-in.
+All options are set per-socket using setsockopt(), and their
+state can be checked using getsockopt() and via socket diag (``ss``).
+
+TLS_TX_ZEROCOPY_RO
+~~~~~~~~~~~~~~~~~~
+
+For device offload only. Allow sendfile() data to be transmitted directly
+to the NIC without making an in-kernel copy. This allows true zero-copy
+behavior when device offload is enabled.
+
+The application must make sure that the data is not modified between being
+submitted and transmission completing. In other words this is mostly
+applicable if the data sent on a socket via sendfile() is read-only.
+
+Modifying the data may result in different versions of the data being used
+for the original TCP transmission and TCP retransmissions. To the receiver
+this will look like TLS records had been tampered with and will result
+in record authentication failures.
+
+TLS_RX_EXPECT_NO_PAD
+~~~~~~~~~~~~~~~~~~~~
+
+TLS 1.3 only. Expect the sender to not pad records. This allows the data
+to be decrypted directly into user space buffers with TLS 1.3.
+
+This optimization is safe to enable only if the remote end is trusted,
+otherwise it is an attack vector to doubling the TLS processing cost.
+
+If the record decrypted turns out to had been padded or is not a data
+record it will be decrypted again into a kernel buffer without zero copy.
+Such events are counted in the ``TlsDecryptRetry`` statistic.
+
+Statistics
+==========
+
+TLS implementation exposes the following per-namespace statistics
+(``/proc/net/tls_stat``):
+
+- ``TlsCurrTxSw``, ``TlsCurrRxSw`` -
+ number of TX and RX sessions currently installed where host handles
+ cryptography
+
+- ``TlsCurrTxDevice``, ``TlsCurrRxDevice`` -
+ number of TX and RX sessions currently installed where NIC handles
+ cryptography
+
+- ``TlsTxSw``, ``TlsRxSw`` -
+ number of TX and RX sessions opened with host cryptography
+
+- ``TlsTxDevice``, ``TlsRxDevice`` -
+ number of TX and RX sessions opened with NIC cryptography
+
+- ``TlsDecryptError`` -
+ record decryption failed (e.g. due to incorrect authentication tag)
+
+- ``TlsDeviceRxResync`` -
+ number of RX resyncs sent to NICs handling cryptography
+
+- ``TlsDecryptRetry`` -
+ number of RX records which had to be re-decrypted due to
+ ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. Note that this counter will
+ also increment for non-data records.
+
+- ``TlsRxNoPadViolation`` -
+ number of data RX records which had to be re-decrypted due to
+ ``TLS_RX_EXPECT_NO_PAD`` mis-prediction.
diff --git a/Documentation/networking/tproxy.rst b/Documentation/networking/tproxy.rst
new file mode 100644
index 0000000000..00dc3a1a66
--- /dev/null
+++ b/Documentation/networking/tproxy.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+Transparent proxy support
+=========================
+
+This feature adds Linux 2.2-like transparent proxy support to current kernels.
+To use it, enable the socket match and the TPROXY target in your kernel config.
+You will need policy routing too, so be sure to enable that as well.
+
+From Linux 4.18 transparent proxy support is also available in nf_tables.
+
+1. Making non-local sockets work
+================================
+
+The idea is that you identify packets with destination address matching a local
+socket on your box, set the packet mark to a certain value::
+
+ # iptables -t mangle -N DIVERT
+ # iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
+ # iptables -t mangle -A DIVERT -j MARK --set-mark 1
+ # iptables -t mangle -A DIVERT -j ACCEPT
+
+Alternatively you can do this in nft with the following commands::
+
+ # nft add table filter
+ # nft add chain filter divert "{ type filter hook prerouting priority -150; }"
+ # nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept
+
+And then match on that value using policy routing to have those packets
+delivered locally::
+
+ # ip rule add fwmark 1 lookup 100
+ # ip route add local 0.0.0.0/0 dev lo table 100
+
+Because of certain restrictions in the IPv4 routing output code you'll have to
+modify your application to allow it to send datagrams _from_ non-local IP
+addresses. All you have to do is enable the (SOL_IP, IP_TRANSPARENT) socket
+option before calling bind::
+
+ fd = socket(AF_INET, SOCK_STREAM, 0);
+ /* - 8< -*/
+ int value = 1;
+ setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value));
+ /* - 8< -*/
+ name.sin_family = AF_INET;
+ name.sin_port = htons(0xCAFE);
+ name.sin_addr.s_addr = htonl(0xDEADBEEF);
+ bind(fd, &name, sizeof(name));
+
+A trivial patch for netcat is available here:
+http://people.netfilter.org/hidden/tproxy/netcat-ip_transparent-support.patch
+
+
+2. Redirecting traffic
+======================
+
+Transparent proxying often involves "intercepting" traffic on a router. This is
+usually done with the iptables REDIRECT target; however, there are serious
+limitations of that method. One of the major issues is that it actually
+modifies the packets to change the destination address -- which might not be
+acceptable in certain situations. (Think of proxying UDP for example: you won't
+be able to find out the original destination address. Even in case of TCP
+getting the original destination address is racy.)
+
+The 'TPROXY' target provides similar functionality without relying on NAT. Simply
+add rules like this to the iptables ruleset above::
+
+ # iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
+ --tproxy-mark 0x1/0x1 --on-port 50080
+
+Or the following rule to nft:
+
+# nft add rule filter divert tcp dport 80 tproxy to :50080 meta mark set 1 accept
+
+Note that for this to work you'll have to modify the proxy to enable (SOL_IP,
+IP_TRANSPARENT) for the listening socket.
+
+As an example implementation, tcprdr is available here:
+https://git.breakpoint.cc/cgit/fw/tcprdr.git/
+This tool is written by Florian Westphal and it was used for testing during the
+nf_tables implementation.
+
+3. Iptables and nf_tables extensions
+====================================
+
+To use tproxy you'll need to have the following modules compiled for iptables:
+
+ - NETFILTER_XT_MATCH_SOCKET
+ - NETFILTER_XT_TARGET_TPROXY
+
+Or the floowing modules for nf_tables:
+
+ - NFT_SOCKET
+ - NFT_TPROXY
+
+4. Application support
+======================
+
+4.1. Squid
+----------
+
+Squid 3.HEAD has support built-in. To use it, pass
+'--enable-linux-netfilter' to configure and set the 'tproxy' option on
+the HTTP listener you redirect traffic to with the TPROXY iptables
+target.
+
+For more information please consult the following page on the Squid
+wiki: http://wiki.squid-cache.org/Features/Tproxy4
diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst
new file mode 100644
index 0000000000..4d7087f727
--- /dev/null
+++ b/Documentation/networking/tuntap.rst
@@ -0,0 +1,259 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============================
+Universal TUN/TAP device driver
+===============================
+
+Copyright |copy| 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
+
+ Linux, Solaris drivers
+ Copyright |copy| 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
+
+ FreeBSD TAP driver
+ Copyright |copy| 1999-2000 Maksim Yevmenkin <m_evmenkin@yahoo.com>
+
+ Revision of this document 2002 by Florian Thiel <florian.thiel@gmx.net>
+
+1. Description
+==============
+
+ TUN/TAP provides packet reception and transmission for user space programs.
+ It can be seen as a simple Point-to-Point or Ethernet device, which,
+ instead of receiving packets from physical media, receives them from
+ user space program and instead of sending packets via physical media
+ writes them to the user space program.
+
+ In order to use the driver a program has to open /dev/net/tun and issue a
+ corresponding ioctl() to register a network device with the kernel. A network
+ device will appear as tunXX or tapXX, depending on the options chosen. When
+ the program closes the file descriptor, the network device and all
+ corresponding routes will disappear.
+
+ Depending on the type of device chosen the userspace program has to read/write
+ IP packets (with tun) or ethernet frames (with tap). Which one is being used
+ depends on the flags given with the ioctl().
+
+ The package from http://vtun.sourceforge.net/tun contains two simple examples
+ for how to use tun and tap devices. Both programs work like a bridge between
+ two network interfaces.
+ br_select.c - bridge based on select system call.
+ br_sigio.c - bridge based on async io and SIGIO signal.
+ However, the best example is VTun http://vtun.sourceforge.net :))
+
+2. Configuration
+================
+
+ Create device node::
+
+ mkdir /dev/net (if it doesn't exist already)
+ mknod /dev/net/tun c 10 200
+
+ Set permissions::
+
+ e.g. chmod 0666 /dev/net/tun
+
+ There's no harm in allowing the device to be accessible by non-root users,
+ since CAP_NET_ADMIN is required for creating network devices or for
+ connecting to network devices which aren't owned by the user in question.
+ If you want to create persistent devices and give ownership of them to
+ unprivileged users, then you need the /dev/net/tun device to be usable by
+ those users.
+
+ Driver module autoloading
+
+ Make sure that "Kernel module loader" - module auto-loading
+ support is enabled in your kernel. The kernel should load it on
+ first access.
+
+ Manual loading
+
+ insert the module by hand::
+
+ modprobe tun
+
+ If you do it the latter way, you have to load the module every time you
+ need it, if you do it the other way it will be automatically loaded when
+ /dev/net/tun is being opened.
+
+3. Program interface
+====================
+
+3.1 Network device allocation
+-----------------------------
+
+``char *dev`` should be the name of the device with a format string (e.g.
+"tun%d"), but (as far as I can see) this can be any valid network device name.
+Note that the character pointer becomes overwritten with the real device name
+(e.g. "tun0")::
+
+ #include <linux/if.h>
+ #include <linux/if_tun.h>
+
+ int tun_alloc(char *dev)
+ {
+ struct ifreq ifr;
+ int fd, err;
+
+ if( (fd = open("/dev/net/tun", O_RDWR)) < 0 )
+ return tun_alloc_old(dev);
+
+ memset(&ifr, 0, sizeof(ifr));
+
+ /* Flags: IFF_TUN - TUN device (no Ethernet headers)
+ * IFF_TAP - TAP device
+ *
+ * IFF_NO_PI - Do not provide packet information
+ */
+ ifr.ifr_flags = IFF_TUN;
+ if( *dev )
+ strscpy_pad(ifr.ifr_name, dev, IFNAMSIZ);
+
+ if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){
+ close(fd);
+ return err;
+ }
+ strcpy(dev, ifr.ifr_name);
+ return fd;
+ }
+
+3.2 Frame format
+----------------
+
+If flag IFF_NO_PI is not set each frame format is::
+
+ Flags [2 bytes]
+ Proto [2 bytes]
+ Raw protocol(IP, IPv6, etc) frame.
+
+3.3 Multiqueue tuntap interface
+-------------------------------
+
+From version 3.8, Linux supports multiqueue tuntap which can uses multiple
+file descriptors (queues) to parallelize packets sending or receiving. The
+device allocation is the same as before, and if user wants to create multiple
+queues, TUNSETIFF with the same device name must be called many times with
+IFF_MULTI_QUEUE flag.
+
+``char *dev`` should be the name of the device, queues is the number of queues
+to be created, fds is used to store and return the file descriptors (queues)
+created to the caller. Each file descriptor were served as the interface of a
+queue which could be accessed by userspace.
+
+::
+
+ #include <linux/if.h>
+ #include <linux/if_tun.h>
+
+ int tun_alloc_mq(char *dev, int queues, int *fds)
+ {
+ struct ifreq ifr;
+ int fd, err, i;
+
+ if (!dev)
+ return -1;
+
+ memset(&ifr, 0, sizeof(ifr));
+ /* Flags: IFF_TUN - TUN device (no Ethernet headers)
+ * IFF_TAP - TAP device
+ *
+ * IFF_NO_PI - Do not provide packet information
+ * IFF_MULTI_QUEUE - Create a queue of multiqueue device
+ */
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE;
+ strcpy(ifr.ifr_name, dev);
+
+ for (i = 0; i < queues; i++) {
+ if ((fd = open("/dev/net/tun", O_RDWR)) < 0)
+ goto err;
+ err = ioctl(fd, TUNSETIFF, (void *)&ifr);
+ if (err) {
+ close(fd);
+ goto err;
+ }
+ fds[i] = fd;
+ }
+
+ return 0;
+ err:
+ for (--i; i >= 0; i--)
+ close(fds[i]);
+ return err;
+ }
+
+A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When
+calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when
+calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were
+enabled by default after it was created through TUNSETIFF.
+
+fd is the file descriptor (queue) that we want to enable or disable, when
+enable is true we enable it, otherwise we disable it::
+
+ #include <linux/if.h>
+ #include <linux/if_tun.h>
+
+ int tun_set_queue(int fd, int enable)
+ {
+ struct ifreq ifr;
+
+ memset(&ifr, 0, sizeof(ifr));
+
+ if (enable)
+ ifr.ifr_flags = IFF_ATTACH_QUEUE;
+ else
+ ifr.ifr_flags = IFF_DETACH_QUEUE;
+
+ return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
+ }
+
+Universal TUN/TAP device driver Frequently Asked Question
+=========================================================
+
+1. What platforms are supported by TUN/TAP driver ?
+
+Currently driver has been written for 3 Unices:
+
+ - Linux kernels 2.2.x, 2.4.x
+ - FreeBSD 3.x, 4.x, 5.x
+ - Solaris 2.6, 7.0, 8.0
+
+2. What is TUN/TAP driver used for?
+
+As mentioned above, main purpose of TUN/TAP driver is tunneling.
+It is used by VTun (http://vtun.sourceforge.net).
+
+Another interesting application using TUN/TAP is pipsecd
+(http://perso.enst.fr/~beyssac/pipsec/), a userspace IPSec
+implementation that can use complete kernel routing (unlike FreeS/WAN).
+
+3. How does Virtual network device actually work ?
+
+Virtual network device can be viewed as a simple Point-to-Point or
+Ethernet device, which instead of receiving packets from a physical
+media, receives them from user space program and instead of sending
+packets via physical media sends them to the user space program.
+
+Let's say that you configured IPv6 on the tap0, then whenever
+the kernel sends an IPv6 packet to tap0, it is passed to the application
+(VTun for example). The application encrypts, compresses and sends it to
+the other side over TCP or UDP. The application on the other side decompresses
+and decrypts the data received and writes the packet to the TAP device,
+the kernel handles the packet like it came from real physical device.
+
+4. What is the difference between TUN driver and TAP driver?
+
+TUN works with IP frames. TAP works with Ethernet frames.
+
+This means that you have to read/write IP packets when you are using tun and
+ethernet frames when using tap.
+
+5. What is the difference between BPF and TUN/TAP driver?
+
+BPF is an advanced packet filter. It can be attached to existing
+network interface. It does not provide a virtual network interface.
+A TUN/TAP driver does provide a virtual network interface and it is possible
+to attach BPF to this interface.
+
+6. Does TAP driver support kernel Ethernet bridging?
+
+Yes. Linux and FreeBSD drivers support Ethernet bridging.
diff --git a/Documentation/networking/udplite.rst b/Documentation/networking/udplite.rst
new file mode 100644
index 0000000000..2c225f28b7
--- /dev/null
+++ b/Documentation/networking/udplite.rst
@@ -0,0 +1,291 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+The UDP-Lite protocol (RFC 3828)
+================================
+
+
+ UDP-Lite is a Standards-Track IETF transport protocol whose characteristic
+ is a variable-length checksum. This has advantages for transport of multimedia
+ (video, VoIP) over wireless networks, as partly damaged packets can still be
+ fed into the codec instead of being discarded due to a failed checksum test.
+
+ This file briefly describes the existing kernel support and the socket API.
+ For in-depth information, you can consult:
+
+ - The UDP-Lite Homepage:
+ http://web.archive.org/web/%2E/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/
+
+ From here you can also download some example application source code.
+
+ - The UDP-Lite HOWTO on
+ http://web.archive.org/web/%2E/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/files/UDP-Lite-HOWTO.txt
+
+ - The Wireshark UDP-Lite WiKi (with capture files):
+ https://wiki.wireshark.org/Lightweight_User_Datagram_Protocol
+
+ - The Protocol Spec, RFC 3828, http://www.ietf.org/rfc/rfc3828.txt
+
+
+1. Applications
+===============
+
+ Several applications have been ported successfully to UDP-Lite. Ethereal
+ (now called wireshark) has UDP-Litev4/v6 support by default.
+
+ Porting applications to UDP-Lite is straightforward: only socket level and
+ IPPROTO need to be changed; senders additionally set the checksum coverage
+ length (default = header length = 8). Details are in the next section.
+
+2. Programming API
+==================
+
+ UDP-Lite provides a connectionless, unreliable datagram service and hence
+ uses the same socket type as UDP. In fact, porting from UDP to UDP-Lite is
+ very easy: simply add ``IPPROTO_UDPLITE`` as the last argument of the
+ socket(2) call so that the statement looks like::
+
+ s = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDPLITE);
+
+ or, respectively,
+
+ ::
+
+ s = socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDPLITE);
+
+ With just the above change you are able to run UDP-Lite services or connect
+ to UDP-Lite servers. The kernel will assume that you are not interested in
+ using partial checksum coverage and so emulate UDP mode (full coverage).
+
+ To make use of the partial checksum coverage facilities requires setting a
+ single socket option, which takes an integer specifying the coverage length:
+
+ * Sender checksum coverage: UDPLITE_SEND_CSCOV
+
+ For example::
+
+ int val = 20;
+ setsockopt(s, SOL_UDPLITE, UDPLITE_SEND_CSCOV, &val, sizeof(int));
+
+ sets the checksum coverage length to 20 bytes (12b data + 8b header).
+ Of each packet only the first 20 bytes (plus the pseudo-header) will be
+ checksummed. This is useful for RTP applications which have a 12-byte
+ base header.
+
+
+ * Receiver checksum coverage: UDPLITE_RECV_CSCOV
+
+ This option is the receiver-side analogue. It is truly optional, i.e. not
+ required to enable traffic with partial checksum coverage. Its function is
+ that of a traffic filter: when enabled, it instructs the kernel to drop
+ all packets which have a coverage _less_ than this value. For example, if
+ RTP and UDP headers are to be protected, a receiver can enforce that only
+ packets with a minimum coverage of 20 are admitted::
+
+ int min = 20;
+ setsockopt(s, SOL_UDPLITE, UDPLITE_RECV_CSCOV, &min, sizeof(int));
+
+ The calls to getsockopt(2) are analogous. Being an extension and not a stand-
+ alone protocol, all socket options known from UDP can be used in exactly the
+ same manner as before, e.g. UDP_CORK or UDP_ENCAP.
+
+ A detailed discussion of UDP-Lite checksum coverage options is in section IV.
+
+3. Header Files
+===============
+
+ The socket API requires support through header files in /usr/include:
+
+ * /usr/include/netinet/in.h
+ to define IPPROTO_UDPLITE
+
+ * /usr/include/netinet/udplite.h
+ for UDP-Lite header fields and protocol constants
+
+ For testing purposes, the following can serve as a ``mini`` header file::
+
+ #define IPPROTO_UDPLITE 136
+ #define SOL_UDPLITE 136
+ #define UDPLITE_SEND_CSCOV 10
+ #define UDPLITE_RECV_CSCOV 11
+
+ Ready-made header files for various distros are in the UDP-Lite tarball.
+
+4. Kernel Behaviour with Regards to the Various Socket Options
+==============================================================
+
+
+ To enable debugging messages, the log level need to be set to 8, as most
+ messages use the KERN_DEBUG level (7).
+
+ 1) Sender Socket Options
+
+ If the sender specifies a value of 0 as coverage length, the module
+ assumes full coverage, transmits a packet with coverage length of 0
+ and according checksum. If the sender specifies a coverage < 8 and
+ different from 0, the kernel assumes 8 as default value. Finally,
+ if the specified coverage length exceeds the packet length, the packet
+ length is used instead as coverage length.
+
+ 2) Receiver Socket Options
+
+ The receiver specifies the minimum value of the coverage length it
+ is willing to accept. A value of 0 here indicates that the receiver
+ always wants the whole of the packet covered. In this case, all
+ partially covered packets are dropped and an error is logged.
+
+ It is not possible to specify illegal values (<0 and <8); in these
+ cases the default of 8 is assumed.
+
+ All packets arriving with a coverage value less than the specified
+ threshold are discarded, these events are also logged.
+
+ 3) Disabling the Checksum Computation
+
+ On both sender and receiver, checksumming will always be performed
+ and cannot be disabled using SO_NO_CHECK. Thus::
+
+ setsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, ... );
+
+ will always will be ignored, while the value of::
+
+ getsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, &value, ...);
+
+ is meaningless (as in TCP). Packets with a zero checksum field are
+ illegal (cf. RFC 3828, sec. 3.1) and will be silently discarded.
+
+ 4) Fragmentation
+
+ The checksum computation respects both buffersize and MTU. The size
+ of UDP-Lite packets is determined by the size of the send buffer. The
+ minimum size of the send buffer is 2048 (defined as SOCK_MIN_SNDBUF
+ in include/net/sock.h), the default value is configurable as
+ net.core.wmem_default or via setting the SO_SNDBUF socket(7)
+ option. The maximum upper bound for the send buffer is determined
+ by net.core.wmem_max.
+
+ Given a payload size larger than the send buffer size, UDP-Lite will
+ split the payload into several individual packets, filling up the
+ send buffer size in each case.
+
+ The precise value also depends on the interface MTU. The interface MTU,
+ in turn, may trigger IP fragmentation. In this case, the generated
+ UDP-Lite packet is split into several IP packets, of which only the
+ first one contains the L4 header.
+
+ The send buffer size has implications on the checksum coverage length.
+ Consider the following example::
+
+ Payload: 1536 bytes Send Buffer: 1024 bytes
+ MTU: 1500 bytes Coverage Length: 856 bytes
+
+ UDP-Lite will ship the 1536 bytes in two separate packets::
+
+ Packet 1: 1024 payload + 8 byte header + 20 byte IP header = 1052 bytes
+ Packet 2: 512 payload + 8 byte header + 20 byte IP header = 540 bytes
+
+ The coverage packet covers the UDP-Lite header and 848 bytes of the
+ payload in the first packet, the second packet is fully covered. Note
+ that for the second packet, the coverage length exceeds the packet
+ length. The kernel always re-adjusts the coverage length to the packet
+ length in such cases.
+
+ As an example of what happens when one UDP-Lite packet is split into
+ several tiny fragments, consider the following example::
+
+ Payload: 1024 bytes Send buffer size: 1024 bytes
+ MTU: 300 bytes Coverage length: 575 bytes
+
+ +-+-----------+--------------+--------------+--------------+
+ |8| 272 | 280 | 280 | 280 |
+ +-+-----------+--------------+--------------+--------------+
+ 280 560 840 1032
+ ^
+ *****checksum coverage*************
+
+ The UDP-Lite module generates one 1032 byte packet (1024 + 8 byte
+ header). According to the interface MTU, these are split into 4 IP
+ packets (280 byte IP payload + 20 byte IP header). The kernel module
+ sums the contents of the entire first two packets, plus 15 bytes of
+ the last packet before releasing the fragments to the IP module.
+
+ To see the analogous case for IPv6 fragmentation, consider a link
+ MTU of 1280 bytes and a write buffer of 3356 bytes. If the checksum
+ coverage is less than 1232 bytes (MTU minus IPv6/fragment header
+ lengths), only the first fragment needs to be considered. When using
+ larger checksum coverage lengths, each eligible fragment needs to be
+ checksummed. Suppose we have a checksum coverage of 3062. The buffer
+ of 3356 bytes will be split into the following fragments::
+
+ Fragment 1: 1280 bytes carrying 1232 bytes of UDP-Lite data
+ Fragment 2: 1280 bytes carrying 1232 bytes of UDP-Lite data
+ Fragment 3: 948 bytes carrying 900 bytes of UDP-Lite data
+
+ The first two fragments have to be checksummed in full, of the last
+ fragment only 598 (= 3062 - 2*1232) bytes are checksummed.
+
+ While it is important that such cases are dealt with correctly, they
+ are (annoyingly) rare: UDP-Lite is designed for optimising multimedia
+ performance over wireless (or generally noisy) links and thus smaller
+ coverage lengths are likely to be expected.
+
+5. UDP-Lite Runtime Statistics and their Meaning
+================================================
+
+ Exceptional and error conditions are logged to syslog at the KERN_DEBUG
+ level. Live statistics about UDP-Lite are available in /proc/net/snmp
+ and can (with newer versions of netstat) be viewed using::
+
+ netstat -svu
+
+ This displays UDP-Lite statistics variables, whose meaning is as follows.
+
+ ============ =====================================================
+ InDatagrams The total number of datagrams delivered to users.
+
+ NoPorts Number of packets received to an unknown port.
+ These cases are counted separately (not as InErrors).
+
+ InErrors Number of erroneous UDP-Lite packets. Errors include:
+
+ * internal socket queue receive errors
+ * packet too short (less than 8 bytes or stated
+ coverage length exceeds received length)
+ * xfrm4_policy_check() returned with error
+ * application has specified larger min. coverage
+ length than that of incoming packet
+ * checksum coverage violated
+ * bad checksum
+
+ OutDatagrams Total number of sent datagrams.
+ ============ =====================================================
+
+ These statistics derive from the UDP MIB (RFC 2013).
+
+6. IPtables
+===========
+
+ There is packet match support for UDP-Lite as well as support for the LOG target.
+ If you copy and paste the following line into /etc/protocols::
+
+ udplite 136 UDP-Lite # UDP-Lite [RFC 3828]
+
+ then::
+
+ iptables -A INPUT -p udplite -j LOG
+
+ will produce logging output to syslog. Dropping and rejecting packets also works.
+
+7. Maintainer Address
+=====================
+
+ The UDP-Lite patch was developed at
+
+ University of Aberdeen
+ Electronics Research Group
+ Department of Engineering
+ Fraser Noble Building
+ Aberdeen AB24 3UE; UK
+
+ The current maintainer is Gerrit Renker, <gerrit@erg.abdn.ac.uk>. Initial
+ code was developed by William Stanislaus, <william@erg.abdn.ac.uk>.
diff --git a/Documentation/networking/vrf.rst b/Documentation/networking/vrf.rst
new file mode 100644
index 0000000000..0a9a6f968c
--- /dev/null
+++ b/Documentation/networking/vrf.rst
@@ -0,0 +1,464 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
+Virtual Routing and Forwarding (VRF)
+====================================
+
+The VRF Device
+==============
+
+The VRF device combined with ip rules provides the ability to create virtual
+routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the
+Linux network stack. One use case is the multi-tenancy problem where each
+tenant has their own unique routing tables and in the very least need
+different default gateways.
+
+Processes can be "VRF aware" by binding a socket to the VRF device. Packets
+through the socket then use the routing table associated with the VRF
+device. An important feature of the VRF device implementation is that it
+impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected
+(ie., they do not need to be run in each VRF). The design also allows
+the use of higher priority ip rules (Policy Based Routing, PBR) to take
+precedence over the VRF device rules directing specific traffic as desired.
+
+In addition, VRF devices allow VRFs to be nested within namespaces. For
+example network namespaces provide separation of network interfaces at the
+device layer, VLANs on the interfaces within a namespace provide L2 separation
+and then VRF devices provide L3 separation.
+
+Design
+------
+A VRF device is created with an associated route table. Network interfaces
+are then enslaved to a VRF device::
+
+ +-----------------------------+
+ | vrf-blue | ===> route table 10
+ +-----------------------------+
+ | | |
+ +------+ +------+ +-------------+
+ | eth1 | | eth2 | ... | bond1 |
+ +------+ +------+ +-------------+
+ | |
+ +------+ +------+
+ | eth8 | | eth9 |
+ +------+ +------+
+
+Packets received on an enslaved device and are switched to the VRF device
+in the IPv4 and IPv6 processing stacks giving the impression that packets
+flow through the VRF device. Similarly on egress routing rules are used to
+send packets to the VRF device driver before getting sent out the actual
+interface. This allows tcpdump on a VRF device to capture all packets into
+and out of the VRF as a whole\ [1]_. Similarly, netfilter\ [2]_ and tc rules
+can be applied using the VRF device to specify rules that apply to the VRF
+domain as a whole.
+
+.. [1] Packets in the forwarded state do not flow through the device, so those
+ packets are not seen by tcpdump. Will revisit this limitation in a
+ future release.
+
+.. [2] Iptables on ingress supports PREROUTING with skb->dev set to the real
+ ingress device and both INPUT and PREROUTING rules with skb->dev set to
+ the VRF device. For egress POSTROUTING and OUTPUT rules can be written
+ using either the VRF device or real egress device.
+
+Setup
+-----
+1. VRF device is created with an association to a FIB table.
+ e.g,::
+
+ ip link add vrf-blue type vrf table 10
+ ip link set dev vrf-blue up
+
+2. An l3mdev FIB rule directs lookups to the table associated with the device.
+ A single l3mdev rule is sufficient for all VRFs. The VRF device adds the
+ l3mdev rule for IPv4 and IPv6 when the first device is created with a
+ default preference of 1000. Users may delete the rule if desired and add
+ with a different priority or install per-VRF rules.
+
+ Prior to the v4.8 kernel iif and oif rules are needed for each VRF device::
+
+ ip ru add oif vrf-blue table 10
+ ip ru add iif vrf-blue table 10
+
+3. Set the default route for the table (and hence default route for the VRF)::
+
+ ip route add table 10 unreachable default metric 4278198272
+
+ This high metric value ensures that the default unreachable route can
+ be overridden by a routing protocol suite. FRRouting interprets
+ kernel metrics as a combined admin distance (upper byte) and priority
+ (lower 3 bytes). Thus the above metric translates to [255/8192].
+
+4. Enslave L3 interfaces to a VRF device::
+
+ ip link set dev eth1 master vrf-blue
+
+ Local and connected routes for enslaved devices are automatically moved to
+ the table associated with VRF device. Any additional routes depending on
+ the enslaved device are dropped and will need to be reinserted to the VRF
+ FIB table following the enslavement.
+
+ The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global
+ addresses as VRF enslavement changes::
+
+ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
+
+5. Additional VRF routes are added to associated table::
+
+ ip route add table 10 ...
+
+
+Applications
+------------
+Applications that are to work within a VRF need to bind their socket to the
+VRF device::
+
+ setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
+
+or to specify the output device using cmsg and IP_PKTINFO.
+
+By default the scope of the port bindings for unbound sockets is
+limited to the default VRF. That is, it will not be matched by packets
+arriving on interfaces enslaved to an l3mdev and processes may bind to
+the same port if they bind to an l3mdev.
+
+TCP & UDP services running in the default VRF context (ie., not bound
+to any VRF device) can work across all VRF domains by enabling the
+tcp_l3mdev_accept and udp_l3mdev_accept sysctl options::
+
+ sysctl -w net.ipv4.tcp_l3mdev_accept=1
+ sysctl -w net.ipv4.udp_l3mdev_accept=1
+
+These options are disabled by default so that a socket in a VRF is only
+selected for packets in that VRF. There is a similar option for RAW
+sockets, which is enabled by default for reasons of backwards compatibility.
+This is so as to specify the output device with cmsg and IP_PKTINFO, but
+using a socket not bound to the corresponding VRF. This allows e.g. older ping
+implementations to be run with specifying the device but without executing it
+in the VRF. This option can be disabled so that packets received in a VRF
+context are only handled by a raw socket bound to the VRF, and packets in the
+default VRF are only handled by a socket not bound to any VRF::
+
+ sysctl -w net.ipv4.raw_l3mdev_accept=0
+
+netfilter rules on the VRF device can be used to limit access to services
+running in the default VRF context as well.
+
+Using VRF-aware applications (applications which simultaneously create sockets
+outside and inside VRFs) in conjunction with ``net.ipv4.tcp_l3mdev_accept=1``
+is possible but may lead to problems in some situations. With that sysctl
+value, it is unspecified which listening socket will be selected to handle
+connections for VRF traffic; ie. either a socket bound to the VRF or an unbound
+socket may be used to accept new connections from a VRF. This somewhat
+unexpected behavior can lead to problems if sockets are configured with extra
+options (ex. TCP MD5 keys) with the expectation that VRF traffic will
+exclusively be handled by sockets bound to VRFs, as would be the case with
+``net.ipv4.tcp_l3mdev_accept=0``. Finally and as a reminder, regardless of
+which listening socket is selected, established sockets will be created in the
+VRF based on the ingress interface, as documented earlier.
+
+--------------------------------------------------------------------------------
+
+Using iproute2 for VRFs
+=======================
+iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this
+section lists both commands where appropriate -- with the vrf keyword and the
+older form without it.
+
+1. Create a VRF
+
+ To instantiate a VRF device and associate it with a table::
+
+ $ ip link add dev NAME type vrf table ID
+
+ As of v4.8 the kernel supports the l3mdev FIB rule where a single rule
+ covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first
+ device create.
+
+2. List VRFs
+
+ To list VRFs that have been created::
+
+ $ ip [-d] link show type vrf
+ NOTE: The -d option is needed to show the table id
+
+ For example::
+
+ $ ip -d link show type vrf
+ 11: mgmt: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
+ link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0
+ vrf table 1 addrgenmode eui64
+ 12: red: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
+ link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0
+ vrf table 10 addrgenmode eui64
+ 13: blue: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
+ link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0
+ vrf table 66 addrgenmode eui64
+ 14: green: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
+ link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0
+ vrf table 81 addrgenmode eui64
+
+
+ Or in brief output::
+
+ $ ip -br link show type vrf
+ mgmt UP 72:b3:ba:91:e2:24 <NOARP,MASTER,UP,LOWER_UP>
+ red UP b6:6f:6e:f6:da:73 <NOARP,MASTER,UP,LOWER_UP>
+ blue UP 36:62:e8:7d:bb:8c <NOARP,MASTER,UP,LOWER_UP>
+ green UP e6:28:b8:63:70:bb <NOARP,MASTER,UP,LOWER_UP>
+
+
+3. Assign a Network Interface to a VRF
+
+ Network interfaces are assigned to a VRF by enslaving the netdevice to a
+ VRF device::
+
+ $ ip link set dev NAME master NAME
+
+ On enslavement connected and local routes are automatically moved to the
+ table associated with the VRF device.
+
+ For example::
+
+ $ ip link set dev eth0 master mgmt
+
+
+4. Show Devices Assigned to a VRF
+
+ To show devices that have been assigned to a specific VRF add the master
+ option to the ip command::
+
+ $ ip link show vrf NAME
+ $ ip link show master NAME
+
+ For example::
+
+ $ ip link show vrf red
+ 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000
+ link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff
+ 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000
+ link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff
+ 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000
+ link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff
+
+
+ Or using the brief output::
+
+ $ ip -br link show vrf red
+ eth1 UP 02:00:00:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
+ eth2 UP 02:00:00:00:02:03 <BROADCAST,MULTICAST,UP,LOWER_UP>
+ eth5 DOWN 02:00:00:00:02:06 <BROADCAST,MULTICAST>
+
+
+5. Show Neighbor Entries for a VRF
+
+ To list neighbor entries associated with devices enslaved to a VRF device
+ add the master option to the ip command::
+
+ $ ip [-6] neigh show vrf NAME
+ $ ip [-6] neigh show master NAME
+
+ For example::
+
+ $ ip neigh show vrf red
+ 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE
+ 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE
+
+ $ ip -6 neigh show vrf red
+ 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE
+
+
+6. Show Addresses for a VRF
+
+ To show addresses for interfaces associated with a VRF add the master
+ option to the ip command::
+
+ $ ip addr show vrf NAME
+ $ ip addr show master NAME
+
+ For example::
+
+ $ ip addr show vrf red
+ 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
+ link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff
+ inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1
+ valid_lft forever preferred_lft forever
+ inet6 2002:1::2/120 scope global
+ valid_lft forever preferred_lft forever
+ inet6 fe80::ff:fe00:202/64 scope link
+ valid_lft forever preferred_lft forever
+ 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
+ link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff
+ inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2
+ valid_lft forever preferred_lft forever
+ inet6 2002:2::2/120 scope global
+ valid_lft forever preferred_lft forever
+ inet6 fe80::ff:fe00:203/64 scope link
+ valid_lft forever preferred_lft forever
+ 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN group default qlen 1000
+ link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff
+
+ Or in brief format::
+
+ $ ip -br addr show vrf red
+ eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64
+ eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64
+ eth5 DOWN
+
+
+7. Show Routes for a VRF
+
+ To show routes for a VRF use the ip command to display the table associated
+ with the VRF device::
+
+ $ ip [-6] route show vrf NAME
+ $ ip [-6] route show table ID
+
+ For example::
+
+ $ ip route show vrf red
+ unreachable default metric 4278198272
+ broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2
+ 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2
+ local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2
+ broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2
+ broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2
+ 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2
+ local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2
+ broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2
+
+ $ ip -6 route show vrf red
+ local 2002:1:: dev lo proto none metric 0 pref medium
+ local 2002:1::2 dev lo proto none metric 0 pref medium
+ 2002:1::/120 dev eth1 proto kernel metric 256 pref medium
+ local 2002:2:: dev lo proto none metric 0 pref medium
+ local 2002:2::2 dev lo proto none metric 0 pref medium
+ 2002:2::/120 dev eth2 proto kernel metric 256 pref medium
+ local fe80:: dev lo proto none metric 0 pref medium
+ local fe80:: dev lo proto none metric 0 pref medium
+ local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium
+ local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium
+ fe80::/64 dev eth1 proto kernel metric 256 pref medium
+ fe80::/64 dev eth2 proto kernel metric 256 pref medium
+ ff00::/8 dev red metric 256 pref medium
+ ff00::/8 dev eth1 metric 256 pref medium
+ ff00::/8 dev eth2 metric 256 pref medium
+ unreachable default dev lo metric 4278198272 error -101 pref medium
+
+8. Route Lookup for a VRF
+
+ A test route lookup can be done for a VRF::
+
+ $ ip [-6] route get vrf NAME ADDRESS
+ $ ip [-6] route get oif NAME ADDRESS
+
+ For example::
+
+ $ ip route get 10.2.1.40 vrf red
+ 10.2.1.40 dev eth1 table red src 10.2.1.2
+ cache
+
+ $ ip -6 route get 2002:1::32 vrf red
+ 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium
+
+
+9. Removing Network Interface from a VRF
+
+ Network interfaces are removed from a VRF by breaking the enslavement to
+ the VRF device::
+
+ $ ip link set dev NAME nomaster
+
+ Connected routes are moved back to the default table and local entries are
+ moved to the local table.
+
+ For example::
+
+ $ ip link set dev eth0 nomaster
+
+--------------------------------------------------------------------------------
+
+Commands used in this example::
+
+ cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF
+ 1 mgmt
+ 10 red
+ 66 blue
+ 81 green
+ EOF
+
+ function vrf_create
+ {
+ VRF=$1
+ TBID=$2
+
+ # create VRF device
+ ip link add ${VRF} type vrf table ${TBID}
+
+ if [ "${VRF}" != "mgmt" ]; then
+ ip route add table ${TBID} unreachable default metric 4278198272
+ fi
+ ip link set dev ${VRF} up
+ }
+
+ vrf_create mgmt 1
+ ip link set dev eth0 master mgmt
+
+ vrf_create red 10
+ ip link set dev eth1 master red
+ ip link set dev eth2 master red
+ ip link set dev eth5 master red
+
+ vrf_create blue 66
+ ip link set dev eth3 master blue
+
+ vrf_create green 81
+ ip link set dev eth4 master green
+
+
+ Interface addresses from /etc/network/interfaces:
+ auto eth0
+ iface eth0 inet static
+ address 10.0.0.2
+ netmask 255.255.255.0
+ gateway 10.0.0.254
+
+ iface eth0 inet6 static
+ address 2000:1::2
+ netmask 120
+
+ auto eth1
+ iface eth1 inet static
+ address 10.2.1.2
+ netmask 255.255.255.0
+
+ iface eth1 inet6 static
+ address 2002:1::2
+ netmask 120
+
+ auto eth2
+ iface eth2 inet static
+ address 10.2.2.2
+ netmask 255.255.255.0
+
+ iface eth2 inet6 static
+ address 2002:2::2
+ netmask 120
+
+ auto eth3
+ iface eth3 inet static
+ address 10.2.3.2
+ netmask 255.255.255.0
+
+ iface eth3 inet6 static
+ address 2002:3::2
+ netmask 120
+
+ auto eth4
+ iface eth4 inet static
+ address 10.2.4.2
+ netmask 255.255.255.0
+
+ iface eth4 inet6 static
+ address 2002:4::2
+ netmask 120
diff --git a/Documentation/networking/vxlan.rst b/Documentation/networking/vxlan.rst
new file mode 100644
index 0000000000..2759dc1cc5
--- /dev/null
+++ b/Documentation/networking/vxlan.rst
@@ -0,0 +1,88 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Virtual eXtensible Local Area Networking documentation
+======================================================
+
+The VXLAN protocol is a tunnelling protocol designed to solve the
+problem of limited VLAN IDs (4096) in IEEE 802.1q. With VXLAN the
+size of the identifier is expanded to 24 bits (16777216).
+
+VXLAN is described by IETF RFC 7348, and has been implemented by a
+number of vendors. The protocol runs over UDP using a single
+destination port. This document describes the Linux kernel tunnel
+device, there is also a separate implementation of VXLAN for
+Openvswitch.
+
+Unlike most tunnels, a VXLAN is a 1 to N network, not just point to
+point. A VXLAN device can learn the IP address of the other endpoint
+either dynamically in a manner similar to a learning bridge, or make
+use of statically-configured forwarding entries.
+
+The management of vxlan is done in a manner similar to its two closest
+neighbors GRE and VLAN. Configuring VXLAN requires the version of
+iproute2 that matches the kernel release where VXLAN was first merged
+upstream.
+
+1. Create vxlan device::
+
+ # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 dstport 4789
+
+This creates a new device named vxlan0. The device uses the multicast
+group 239.1.1.1 over eth1 to handle traffic for which there is no
+entry in the forwarding table. The destination port number is set to
+the IANA-assigned value of 4789. The Linux implementation of VXLAN
+pre-dates the IANA's selection of a standard destination port number
+and uses the Linux-selected value by default to maintain backwards
+compatibility.
+
+2. Delete vxlan device::
+
+ # ip link delete vxlan0
+
+3. Show vxlan info::
+
+ # ip -d link show vxlan0
+
+It is possible to create, destroy and display the vxlan
+forwarding table using the new bridge command.
+
+1. Create forwarding table entry::
+
+ # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0
+
+2. Delete forwarding table entry::
+
+ # bridge fdb delete 00:17:42:8a:b4:05 dev vxlan0
+
+3. Show forwarding table::
+
+ # bridge fdb show dev vxlan0
+
+The following NIC features may indicate support for UDP tunnel-related
+offloads (most commonly VXLAN features, but support for a particular
+encapsulation protocol is NIC specific):
+
+ - `tx-udp_tnl-segmentation`
+ - `tx-udp_tnl-csum-segmentation`
+ ability to perform TCP segmentation offload of UDP encapsulated frames
+
+ - `rx-udp_tunnel-port-offload`
+ receive side parsing of UDP encapsulated frames which allows NICs to
+ perform protocol-aware offloads, like checksum validation offload of
+ inner frames (only needed by NICs without protocol-agnostic offloads)
+
+For devices supporting `rx-udp_tunnel-port-offload` the list of currently
+offloaded ports can be interrogated with `ethtool`::
+
+ $ ethtool --show-tunnels eth0
+ Tunnel information for eth0:
+ UDP port table 0:
+ Size: 4
+ Types: vxlan
+ No entries
+ UDP port table 1:
+ Size: 4
+ Types: geneve, vxlan-gpe
+ Entries (1):
+ port 1230, vxlan-gpe
diff --git a/Documentation/networking/x25-iface.rst b/Documentation/networking/x25-iface.rst
new file mode 100644
index 0000000000..285cefcfce
--- /dev/null
+++ b/Documentation/networking/x25-iface.rst
@@ -0,0 +1,81 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+X.25 Device Driver Interface
+============================
+
+Version 1.1
+
+ Jonathan Naylor 26.12.96
+
+This is a description of the messages to be passed between the X.25 Packet
+Layer and the X.25 device driver. They are designed to allow for the easy
+setting of the LAPB mode from within the Packet Layer.
+
+The X.25 device driver will be coded normally as per the Linux device driver
+standards. Most X.25 device drivers will be moderately similar to the
+already existing Ethernet device drivers. However unlike those drivers, the
+X.25 device driver has a state associated with it, and this information
+needs to be passed to and from the Packet Layer for proper operation.
+
+All messages are held in sk_buff's just like real data to be transmitted
+over the LAPB link. The first byte of the skbuff indicates the meaning of
+the rest of the skbuff, if any more information does exist.
+
+
+Packet Layer to Device Driver
+-----------------------------
+
+First Byte = 0x00 (X25_IFACE_DATA)
+
+This indicates that the rest of the skbuff contains data to be transmitted
+over the LAPB link. The LAPB link should already exist before any data is
+passed down.
+
+First Byte = 0x01 (X25_IFACE_CONNECT)
+
+Establish the LAPB link. If the link is already established then the connect
+confirmation message should be returned as soon as possible.
+
+First Byte = 0x02 (X25_IFACE_DISCONNECT)
+
+Terminate the LAPB link. If it is already disconnected then the disconnect
+confirmation message should be returned as soon as possible.
+
+First Byte = 0x03 (X25_IFACE_PARAMS)
+
+LAPB parameters. To be defined.
+
+
+Device Driver to Packet Layer
+-----------------------------
+
+First Byte = 0x00 (X25_IFACE_DATA)
+
+This indicates that the rest of the skbuff contains data that has been
+received over the LAPB link.
+
+First Byte = 0x01 (X25_IFACE_CONNECT)
+
+LAPB link has been established. The same message is used for both a LAPB
+link connect_confirmation and a connect_indication.
+
+First Byte = 0x02 (X25_IFACE_DISCONNECT)
+
+LAPB link has been terminated. This same message is used for both a LAPB
+link disconnect_confirmation and a disconnect_indication.
+
+First Byte = 0x03 (X25_IFACE_PARAMS)
+
+LAPB parameters. To be defined.
+
+
+Requirements for the device driver
+----------------------------------
+
+Packets should not be reordered or dropped when delivering between the
+Packet Layer and the device driver.
+
+To avoid packets from being reordered or dropped when delivering from
+the device driver to the Packet Layer, the device driver should not
+call "netif_rx" to deliver the received packets. Instead, it should
+call "netif_receive_skb_core" from softirq context to deliver them.
diff --git a/Documentation/networking/x25.rst b/Documentation/networking/x25.rst
new file mode 100644
index 0000000000..e11d9ebdf9
--- /dev/null
+++ b/Documentation/networking/x25.rst
@@ -0,0 +1,46 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Linux X.25 Project
+==================
+
+As my third year dissertation at University I have taken it upon myself to
+write an X.25 implementation for Linux. My aim is to provide a complete X.25
+Packet Layer and a LAPB module to allow for "normal" X.25 to be run using
+Linux. There are two sorts of X.25 cards available, intelligent ones that
+implement LAPB on the card itself, and unintelligent ones that simply do
+framing, bit-stuffing and checksumming. These both need to be handled by the
+system.
+
+I therefore decided to write the implementation such that as far as the
+Packet Layer is concerned, the link layer was being performed by a lower
+layer of the Linux kernel and therefore it did not concern itself with
+implementation of LAPB. Therefore the LAPB modules would be called by
+unintelligent X.25 card drivers and not by intelligent ones, this would
+provide a uniform device driver interface, and simplify configuration.
+
+To confuse matters a little, an 802.2 LLC implementation is also possible
+which could allow X.25 to be run over an Ethernet (or Token Ring) and
+conform with the JNT "Pink Book", this would have a different interface to
+the Packet Layer but there would be no confusion since the class of device
+being served by the LLC would be completely separate from LAPB.
+
+Just when you thought that it could not become more confusing, another
+option appeared, XOT. This allows X.25 Packet Layer frames to operate over
+the Internet using TCP/IP as a reliable link layer. RFC1613 specifies the
+format and behaviour of the protocol. If time permits this option will also
+be actively considered.
+
+A linux-x25 mailing list has been created at vger.kernel.org to support the
+development and use of Linux X.25. It is early days yet, but interested
+parties are welcome to subscribe to it. Just send a message to
+majordomo@vger.kernel.org with the following in the message body:
+
+subscribe linux-x25
+end
+
+The contents of the Subject line are ignored.
+
+Jonathan
+
+g4klx@g4klx.demon.co.uk
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
new file mode 100644
index 0000000000..25ce72af81
--- /dev/null
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -0,0 +1,113 @@
+===============
+XDP RX Metadata
+===============
+
+This document describes how an eXpress Data Path (XDP) program can access
+hardware metadata related to a packet using a set of helper functions,
+and how it can pass that metadata on to other consumers.
+
+General Design
+==============
+
+XDP has access to a set of kfuncs to manipulate the metadata in an XDP frame.
+Every device driver that wishes to expose additional packet metadata can
+implement these kfuncs. The set of kfuncs is declared in ``include/net/xdp.h``
+via ``XDP_METADATA_KFUNC_xxx``.
+
+Currently, the following kfuncs are supported. In the future, as more
+metadata is supported, this set will grow:
+
+.. kernel-doc:: net/core/xdp.c
+ :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
+
+An XDP program can use these kfuncs to read the metadata into stack
+variables for its own consumption. Or, to pass the metadata on to other
+consumers, an XDP program can store it into the metadata area carried
+ahead of the packet. Not all packets will necessary have the requested
+metadata available in which case the driver returns ``-ENODATA``.
+
+Not all kfuncs have to be implemented by the device driver; when not
+implemented, the default ones that return ``-EOPNOTSUPP`` will be used
+to indicate the device driver have not implemented this kfunc.
+
+
+Within an XDP frame, the metadata layout (accessed via ``xdp_buff``) is
+as follows::
+
+ +----------+-----------------+------+
+ | headroom | custom metadata | data |
+ +----------+-----------------+------+
+ ^ ^
+ | |
+ xdp_buff->data_meta xdp_buff->data
+
+An XDP program can store individual metadata items into this ``data_meta``
+area in whichever format it chooses. Later consumers of the metadata
+will have to agree on the format by some out of band contract (like for
+the AF_XDP use case, see below).
+
+AF_XDP
+======
+
+:doc:`af_xdp` use-case implies that there is a contract between the BPF
+program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and
+the final consumer. Thus the BPF program manually allocates a fixed number of
+bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
+of kfuncs to populate it. The userspace ``XSK`` consumer computes
+``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata.
+Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
+``METADATA_SIZE`` is an application-specific constant (``AF_XDP`` receive
+descriptor does _not_ explicitly carry the size of the metadata).
+
+Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
+
+ +----------+-----------------+------+
+ | headroom | custom metadata | data |
+ +----------+-----------------+------+
+ ^
+ |
+ rx_desc->address
+
+XDP_PASS
+========
+
+This is the path where the packets processed by the XDP program are passed
+into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
+contents. Currently, every driver has custom kernel code to parse
+the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
+conversion, and the XDP metadata is not used by the kernel when building
+``skbs``. However, TC-BPF programs can access the XDP metadata area using
+the ``data_meta`` pointer.
+
+In the future, we'd like to support a case where an XDP program
+can override some of the metadata used for building ``skbs``.
+
+bpf_redirect_map
+================
+
+``bpf_redirect_map`` can redirect the frame to a different device.
+Some devices (like virtual ethernet links) support running a second XDP
+program after the redirect. However, the final consumer doesn't have
+access to the original hardware descriptor and can't access any of
+the original metadata. The same applies to XDP programs installed
+into devmaps and cpumaps.
+
+This means that for redirected packets only custom metadata is
+currently supported, which has to be prepared by the initial XDP program
+before redirect. If the frame is eventually passed to the kernel, the
+``skb`` created from such a frame won't have any hardware metadata populated
+in its ``skb``. If such a packet is later redirected into an ``XSK``,
+that will also only have access to the custom metadata.
+
+bpf_tail_call
+=============
+
+Adding programs that access metadata kfuncs to the ``BPF_MAP_TYPE_PROG_ARRAY``
+is currently not supported.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/progs/xdp_metadata.c`` and
+``tools/testing/selftests/bpf/prog_tests/xdp_metadata.c`` for an example of
+BPF program that handles XDP metadata.
diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst
new file mode 100644
index 0000000000..535077cbeb
--- /dev/null
+++ b/Documentation/networking/xfrm_device.rst
@@ -0,0 +1,196 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _xfrm_device:
+
+===============================================
+XFRM device - offloading the IPsec computations
+===============================================
+
+Shannon Nelson <shannon.nelson@oracle.com>
+Leon Romanovsky <leonro@nvidia.com>
+
+
+Overview
+========
+
+IPsec is a useful feature for securing network traffic, but the
+computational cost is high: a 10Gbps link can easily be brought down
+to under 1Gbps, depending on the traffic and link configuration.
+Luckily, there are NICs that offer a hardware based IPsec offload which
+can radically increase throughput and decrease CPU utilization. The XFRM
+Device interface allows NIC drivers to offer to the stack access to the
+hardware offload.
+
+Right now, there are two types of hardware offload that kernel supports.
+ * IPsec crypto offload:
+ * NIC performs encrypt/decrypt
+ * Kernel does everything else
+ * IPsec packet offload:
+ * NIC performs encrypt/decrypt
+ * NIC does encapsulation
+ * Kernel and NIC have SA and policy in-sync
+ * NIC handles the SA and policies states
+ * The Kernel talks to the keymanager
+
+Userland access to the offload is typically through a system such as
+libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
+be handy when experimenting. An example command might look something
+like this for crypto offload:
+
+ ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+ reqid 0x07 replay-window 32 \
+ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+ offload dev eth4 dir in
+
+and for packet offload
+
+ ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+ reqid 0x07 replay-window 32 \
+ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+ offload packet dev eth4 dir in
+
+ ip x p add src 14.0.0.70 dst 14.0.0.52 offload packet dev eth4 dir in
+ tmpl src 14.0.0.70 dst 14.0.0.52 proto esp reqid 10000 mode transport
+
+Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
+
+
+
+Callbacks to implement
+======================
+
+::
+
+ /* from include/linux/netdevice.h */
+ struct xfrmdev_ops {
+ /* Crypto and Packet offload callbacks */
+ int (*xdo_dev_state_add) (struct xfrm_state *x, struct netlink_ext_ack *extack);
+ void (*xdo_dev_state_delete) (struct xfrm_state *x);
+ void (*xdo_dev_state_free) (struct xfrm_state *x);
+ bool (*xdo_dev_offload_ok) (struct sk_buff *skb,
+ struct xfrm_state *x);
+ void (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+
+ /* Solely packet offload callbacks */
+ void (*xdo_dev_state_update_curlft) (struct xfrm_state *x);
+ int (*xdo_dev_policy_add) (struct xfrm_policy *x, struct netlink_ext_ack *extack);
+ void (*xdo_dev_policy_delete) (struct xfrm_policy *x);
+ void (*xdo_dev_policy_free) (struct xfrm_policy *x);
+ };
+
+The NIC driver offering ipsec offload will need to implement callbacks
+relevant to supported offload to make the offload available to the network
+stack's XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
+NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
+
+
+
+Flow
+====
+
+At probe time and before the call to register_netdev(), the driver should
+set up local data structures and XFRM callbacks, and set the feature bits.
+The XFRM code's listener will finish the setup on NETDEV_REGISTER.
+
+::
+
+ adapter->netdev->xfrmdev_ops = &ixgbe_xfrmdev_ops;
+ adapter->netdev->features |= NETIF_F_HW_ESP;
+ adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP;
+
+When new SAs are set up with a request for "offload" feature, the
+driver's xdo_dev_state_add() will be given the new SA to be offloaded
+and an indication of whether it is for Rx or Tx. The driver should
+
+ - verify the algorithm is supported for offloads
+ - store the SA information (key, salt, target-ip, protocol, etc)
+ - enable the HW offload of the SA
+ - return status value:
+
+ =========== ===================================
+ 0 success
+ -EOPNETSUPP offload not supported, try SW IPsec,
+ not applicable for packet offload mode
+ other fail the request
+ =========== ===================================
+
+The driver can also set an offload_handle in the SA, an opaque void pointer
+that can be used to convey context into the fast-path offload requests::
+
+ xs->xso.offload_handle = context;
+
+
+When the network stack is preparing an IPsec packet for an SA that has
+been setup for offload, it first calls into xdo_dev_offload_ok() with
+the skb and the intended offload state to ask the driver if the offload
+will serviceable. This can check the packet information to be sure the
+offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and
+return true of false to signify its support.
+
+Crypto offload mode:
+When ready to send, the driver needs to inspect the Tx packet for the
+offload information, including the opaque context, and set up the packet
+send accordingly::
+
+ xs = xfrm_input_state(skb);
+ context = xs->xso.offload_handle;
+ set up HW for send
+
+The stack has already inserted the appropriate IPsec headers in the
+packet data, the offload just needs to do the encryption and fix up the
+header values.
+
+
+When a packet is received and the HW has indicated that it offloaded a
+decryption, the driver needs to add a reference to the decoded SA into
+the packet's skb. At this point the data should be decrypted but the
+IPsec headers are still in the packet data; they are removed later up
+the stack in xfrm_input().
+
+ find and hold the SA that was used to the Rx skb::
+
+ get spi, protocol, and destination IP from packet headers
+ xs = find xs from (spi, protocol, dest_IP)
+ xfrm_state_hold(xs);
+
+ store the state information into the skb::
+
+ sp = secpath_set(skb);
+ if (!sp) return;
+ sp->xvec[sp->len++] = xs;
+ sp->olen++;
+
+ indicate the success and/or error status of the offload::
+
+ xo = xfrm_offload(skb);
+ xo->flags = CRYPTO_DONE;
+ xo->status = crypto_status;
+
+ hand the packet to napi_gro_receive() as usual
+
+In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn().
+Driver will check packet seq number and update HW ESN state machine if needed.
+
+Packet offload mode:
+HW adds and deletes XFRM headers. So in RX path, XFRM stack is bypassed if HW
+reported success. In TX path, the packet lefts kernel without extra header
+and not encrypted, the HW is responsible to perform it.
+
+When the SA is removed by the user, the driver's xdo_dev_state_delete()
+and xdo_dev_policy_delete() are asked to disable the offload. Later,
+xdo_dev_state_free() and xdo_dev_policy_free() are called from a garbage
+collection routine after all reference counts to the state and policy
+have been removed and any remaining resources can be cleared for the
+offload state. How these are used by the driver will depend on specific
+hardware needs.
+
+As a netdev is set to DOWN the XFRM stack's netdev listener will call
+xdo_dev_state_delete(), xdo_dev_policy_delete(), xdo_dev_state_free() and
+xdo_dev_policy_free() on any remaining offloaded states.
+
+Outcome of HW handling packets, the XFRM core can't count hard, soft limits.
+The HW/driver are responsible to perform it and provide accurate data when
+xdo_dev_state_update_curlft() is called. In case of one of these limits
+occuried, the driver needs to call to xfrm_state_check_expire() to make sure
+that XFRM performs rekeying sequence.
diff --git a/Documentation/networking/xfrm_proc.rst b/Documentation/networking/xfrm_proc.rst
new file mode 100644
index 0000000000..0a771c5a73
--- /dev/null
+++ b/Documentation/networking/xfrm_proc.rst
@@ -0,0 +1,113 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+XFRM proc - /proc/net/xfrm_* files
+==================================
+
+Masahide NAKAMURA <nakam@linux-ipv6.org>
+
+
+Transformation Statistics
+-------------------------
+
+The xfrm_proc code is a set of statistics showing numbers of packets
+dropped by the transformation code and why. These counters are defined
+as part of the linux private MIB. These counters can be viewed in
+/proc/net/xfrm_stat.
+
+
+Inbound errors
+~~~~~~~~~~~~~~
+
+XfrmInError:
+ All errors which is not matched others
+
+XfrmInBufferError:
+ No buffer is left
+
+XfrmInHdrError:
+ Header error
+
+XfrmInNoStates:
+ No state is found
+ i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong
+
+XfrmInStateProtoError:
+ Transformation protocol specific error
+ e.g. SA key is wrong
+
+XfrmInStateModeError:
+ Transformation mode specific error
+
+XfrmInStateSeqError:
+ Sequence error
+ i.e. Sequence number is out of window
+
+XfrmInStateExpired:
+ State is expired
+
+XfrmInStateMismatch:
+ State has mismatch option
+ e.g. UDP encapsulation type is mismatch
+
+XfrmInStateInvalid:
+ State is invalid
+
+XfrmInTmplMismatch:
+ No matching template for states
+ e.g. Inbound SAs are correct but SP rule is wrong
+
+XfrmInNoPols:
+ No policy is found for states
+ e.g. Inbound SAs are correct but no SP is found
+
+XfrmInPolBlock:
+ Policy discards
+
+XfrmInPolError:
+ Policy error
+
+XfrmAcquireError:
+ State hasn't been fully acquired before use
+
+XfrmFwdHdrError:
+ Forward routing of a packet is not allowed
+
+Outbound errors
+~~~~~~~~~~~~~~~
+XfrmOutError:
+ All errors which is not matched others
+
+XfrmOutBundleGenError:
+ Bundle generation error
+
+XfrmOutBundleCheckError:
+ Bundle check error
+
+XfrmOutNoStates:
+ No state is found
+
+XfrmOutStateProtoError:
+ Transformation protocol specific error
+
+XfrmOutStateModeError:
+ Transformation mode specific error
+
+XfrmOutStateSeqError:
+ Sequence error
+ i.e. Sequence number overflow
+
+XfrmOutStateExpired:
+ State is expired
+
+XfrmOutPolBlock:
+ Policy discards
+
+XfrmOutPolDead:
+ Policy is dead
+
+XfrmOutPolError:
+ Policy error
+
+XfrmOutStateInvalid:
+ State is invalid, perhaps expired
diff --git a/Documentation/networking/xfrm_sync.rst b/Documentation/networking/xfrm_sync.rst
new file mode 100644
index 0000000000..6246503cea
--- /dev/null
+++ b/Documentation/networking/xfrm_sync.rst
@@ -0,0 +1,189 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+XFRM
+====
+
+The sync patches work is based on initial patches from
+Krisztian <hidden@balabit.hu> and others and additional patches
+from Jamal <hadi@cyberus.ca>.
+
+The end goal for syncing is to be able to insert attributes + generate
+events so that the SA can be safely moved from one machine to another
+for HA purposes.
+The idea is to synchronize the SA so that the takeover machine can do
+the processing of the SA as accurate as possible if it has access to it.
+
+We already have the ability to generate SA add/del/upd events.
+These patches add ability to sync and have accurate lifetime byte (to
+ensure proper decay of SAs) and replay counters to avoid replay attacks
+with as minimal loss at failover time.
+This way a backup stays as closely up-to-date as an active member.
+
+Because the above items change for every packet the SA receives,
+it is possible for a lot of the events to be generated.
+For this reason, we also add a nagle-like algorithm to restrict
+the events. i.e we are going to set thresholds to say "let me
+know if the replay sequence threshold is reached or 10 secs have passed"
+These thresholds are set system-wide via sysctls or can be updated
+per SA.
+
+The identified items that need to be synchronized are:
+- the lifetime byte counter
+note that: lifetime time limit is not important if you assume the failover
+machine is known ahead of time since the decay of the time countdown
+is not driven by packet arrival.
+- the replay sequence for both inbound and outbound
+
+1) Message Structure
+----------------------
+
+nlmsghdr:aevent_id:optional-TLVs.
+
+The netlink message types are:
+
+XFRM_MSG_NEWAE and XFRM_MSG_GETAE.
+
+A XFRM_MSG_GETAE does not have TLVs.
+
+A XFRM_MSG_NEWAE will have at least two TLVs (as is
+discussed further below).
+
+aevent_id structure looks like::
+
+ struct xfrm_aevent_id {
+ struct xfrm_usersa_id sa_id;
+ xfrm_address_t saddr;
+ __u32 flags;
+ __u32 reqid;
+ };
+
+The unique SA is identified by the combination of xfrm_usersa_id,
+reqid and saddr.
+
+flags are used to indicate different things. The possible
+flags are::
+
+ XFRM_AE_RTHR=1, /* replay threshold*/
+ XFRM_AE_RVAL=2, /* replay value */
+ XFRM_AE_LVAL=4, /* lifetime value */
+ XFRM_AE_ETHR=8, /* expiry timer threshold */
+ XFRM_AE_CR=16, /* Event cause is replay update */
+ XFRM_AE_CE=32, /* Event cause is timer expiry */
+ XFRM_AE_CU=64, /* Event cause is policy update */
+
+How these flags are used is dependent on the direction of the
+message (kernel<->user) as well the cause (config, query or event).
+This is described below in the different messages.
+
+The pid will be set appropriately in netlink to recognize direction
+(0 to the kernel and pid = processid that created the event
+when going from kernel to user space)
+
+A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS
+to get notified of these events.
+
+2) TLVS reflect the different parameters:
+-----------------------------------------
+
+a) byte value (XFRMA_LTIME_VAL)
+
+This TLV carries the running/current counter for byte lifetime since
+last event.
+
+b)replay value (XFRMA_REPLAY_VAL)
+
+This TLV carries the running/current counter for replay sequence since
+last event.
+
+c)replay threshold (XFRMA_REPLAY_THRESH)
+
+This TLV carries the threshold being used by the kernel to trigger events
+when the replay sequence is exceeded.
+
+d) expiry timer (XFRMA_ETIMER_THRESH)
+
+This is a timer value in milliseconds which is used as the nagle
+value to rate limit the events.
+
+3) Default configurations for the parameters:
+---------------------------------------------
+
+By default these events should be turned off unless there is
+at least one listener registered to listen to the multicast
+group XFRMNLGRP_AEVENTS.
+
+Programs installing SAs will need to specify the two thresholds, however,
+in order to not change existing applications such as racoon
+we also provide default threshold values for these different parameters
+in case they are not specified.
+
+the two sysctls/proc entries are:
+
+a) /proc/sys/net/core/sysctl_xfrm_aevent_etime
+used to provide default values for the XFRMA_ETIMER_THRESH in incremental
+units of time of 100ms. The default is 10 (1 second)
+
+b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth
+used to provide default values for XFRMA_REPLAY_THRESH parameter
+in incremental packet count. The default is two packets.
+
+4) Message types
+----------------
+
+a) XFRM_MSG_GETAE issued by user-->kernel.
+ XFRM_MSG_GETAE does not carry any TLVs.
+
+The response is a XFRM_MSG_NEWAE which is formatted based on what
+XFRM_MSG_GETAE queried for.
+
+The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+* if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved
+* if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved
+
+b) XFRM_MSG_NEWAE is issued by either user space to configure
+ or kernel to announce events or respond to a XFRM_MSG_GETAE.
+
+i) user --> kernel to configure a specific SA.
+
+any of the values or threshold parameters can be updated by passing the
+appropriate TLV.
+
+A response is issued back to the sender in user space to indicate success
+or failure.
+
+In the case of success, additionally an event with
+XFRM_MSG_NEWAE is also issued to any listeners as described in iii).
+
+ii) kernel->user direction as a response to XFRM_MSG_GETAE
+
+The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
+The threshold TLVs will be included if explicitly requested in
+the XFRM_MSG_GETAE message.
+
+iii) kernel->user to report as event if someone sets any values or
+ thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above).
+ In such a case XFRM_AE_CU flag is set to inform the user that
+ the change happened as a result of an update.
+ The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
+iv) kernel->user to report event when replay threshold or a timeout
+ is exceeded.
+
+In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout
+happened) is set to inform the user what happened.
+Note the two flags are mutually exclusive.
+The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
+Exceptions to threshold settings
+--------------------------------
+
+If you have an SA that is getting hit by traffic in bursts such that
+there is a period where the timer threshold expires with no packets
+seen, then an odd behavior is seen as follows:
+The first packet arrival after a timer expiry will trigger a timeout
+event; i.e we don't wait for a timeout period or a packet threshold
+to be reached. This is done for simplicity and efficiency reasons.
+
+-JHS
diff --git a/Documentation/networking/xfrm_sysctl.rst b/Documentation/networking/xfrm_sysctl.rst
new file mode 100644
index 0000000000..47b9bbdd01
--- /dev/null
+++ b/Documentation/networking/xfrm_sysctl.rst
@@ -0,0 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+XFRM Syscall
+============
+
+/proc/sys/net/core/xfrm_* Variables:
+====================================
+
+xfrm_acq_expires - INTEGER
+ default 30 - hard timeout in seconds for acquire requests